Call transcription and scoring that replaced a QA team for a 40-seat B2B SaaS sales org.
Two-tier speech-to-text (fine-tuned Whisper Large-v3 async + Deepgram Nova-2 realtime) plus playbook-as-code scoring. 3,000+ calls scored per month. 89% agreement with senior reviewers — enough to retire the manual QA pass and redirect the time into one-on-one coaching.
Client
B2B SaaS sales organization. 40 seats across SDR / AE / AM. Calls on Zoom, Teams, and Twilio-routed cell lines. Client operates on their own AWS account — data residency and retention are non-negotiable.
Engagement
14-week build, Terraform-deployed into the client's AWS. Ongoing retainer for playbook tuning and model upkeep.
The QA bottleneck was eating a third of the coaching budget.
A single manual QA reviewer could score ~40 calls per week against the company's 30-point rubric. The team generated 3,000+ calls per month. Coverage was under 10%, the sample was non-random (reviewers cherry-picked obvious bad calls), and coaching conversations relied on whatever the VP happened to remember from last week's ride-along.
Off-the-shelf call-intelligence tools gave them transcripts and buzzword-counts — "mentioned the competitor 3 times" — but not scoring against their playbook. The playbook had 30 criteria spread across discovery, demo, and closing stages, each with a specific linguistic shape that a keyword matcher couldn't catch.
They wanted their rubric, scored consistently, across every call, with the same rigor as their senior reviewer. That meant the LLM had to see the same evidence their reviewer would see — and had to be evaluable, not trusted on faith.
Two-tier STT. Playbook as Pydantic code. Real-time coaching prompts under 300 ms.
Three separate pipelines, one data model, one eval harness.
Two-tier speech-to-text. Deepgram Nova-2 runs realtime against the live call for sub-second caption and coaching latency. Whisper Large-v3 — fine-tuned on ~180 hours of the client's labeled calls to pick up their product names, prospect companies, and internal acronyms — runs async on the full recording and becomes the source of truth for scoring. Nova-2 for speed, fine-tuned Whisper for accuracy. 96% word-level accuracy on the client's test set, up from 87% on stock Whisper.
Playbook as Pydantic code, not prompts. The 30-criterion rubric is compiled into a Pydantic model: 14 discovery criteria, 9 demo criteria, 7 closing criteria. Each criterion has a scoring LLM prompt, an evidence extractor, a score range, and a refusal path. Changing a criterion is a PR, not a prompt-engineering session. The scoring LLM (GPT-4o) outputs a schema-validated response per criterion — evidence spans, score, rationale.
Real-time coaching. Deepgram Nova-2 streams into a LangGraph coaching agent that fires prompts to the rep through an Electron desktop app: "prospect said budget — ask about decision timeline," "you've been talking 90 seconds straight, ask a question." End-to-end latency under 300 ms so prompts land before the moment closes.
Eval harness on 500 double-scored calls. Before the system went live, we had senior reviewers score 500 calls independently. That's the ground truth. Every scoring-model change runs against that harness. We knew the system hit 89% agreement with senior reviewers before shipping — and we know the moment a change drops it below 85%.
Client AWS, Terraform-deployed. Entire stack — STT, scoring service, coaching service, Postgres, object storage — lives in the client's AWS account. Azure OpenAI endpoint with data-residency configured. We wrote Terraform; they own the infrastructure.
Architecture (three pipelines)
100% call coverage, $34 per seat per month, and coaching that's tied to evidence.
Calls scored per month
100% of sales calls across SDR / AE / AM. Previously <10% coverage, cherry-picked.
Word-level STT accuracy
On the client's test set. Up from 87% on stock Whisper — the fine-tune pulled in their product names, prospect companies, and acronyms.
Senior-reviewer agreement
Per-criterion agreement against 500 double-scored calls. Above the 85% threshold the VP set for retiring manual review.
QA time saved
Manual reviewer time redirected into one-on-one coaching sessions with reps — using the scorecards as the conversation anchor.
Per seat per month
All-in run cost: STT (both tiers) + scoring LLM + coaching LLM + infrastructure. Azure OpenAI + Deepgram + self-hosted Whisper on the client's AWS.
Coaching-prompt latency
Realtime-transcript-to-desktop-overlay. Fast enough that prompts land while the conversational moment is still open.
Pipeline velocity is up by about a third. We're not running more calls — we're running the same calls with reps who actually got coached on the last one. The scorecards gave us a shared language.
Four decisions where generic call-intelligence tools fail.
1. Fine-tune the STT, don't apologize for it. 87% accuracy on stock Whisper sounds fine until you realize every product name, every prospect company, every internal acronym is in the 13% that's wrong. Scoring calls on garbled transcripts is worse than not scoring at all. Six hours of GPU time to fine-tune paid for itself inside a week.
2. The playbook is code, not a prompt. Thirty criteria in a prompt is unmaintainable. Thirty criteria as a Pydantic model with a scoring function per criterion is version-controlled, testable, diffable. When the VP changed how "discovery depth" was scored, that was a PR with an eval run — not a panicked late-night prompt edit.
3. Coaching agent and scoring model are separate. Coaching runs on GPT-4o-mini because it has to be fast. Scoring runs on GPT-4o because it has to be accurate. Same company, different budgets, different latency envelopes — don't force one model to do both.
4. Double-scored eval set before you ship. 500 calls each scored by two senior reviewers. That's the ground truth the LLM has to reach. Without it, you have no idea if you're at 70% agreement or 90% — and neither does the VP you're selling the system to.
Fourteen weeks, Terraform-deployed into the client's AWS.
- Weeks 1–2
Playbook capture + eval set
Workshopped the 30-criterion rubric with the VP and senior reviewer. Double-scored 500 historical calls — that's the ground truth.
- Weeks 3–4
Whisper fine-tune
Labeled 180 hours of client audio. Fine-tuned Whisper Large-v3 on client vocabulary. 87% → 96% word-level on the test set.
- Weeks 5–6
Scoring pipeline + Pydantic rubric
Compiled rubric to code. Per-criterion scoring service with schema-validated output. First eval run against the 500-call harness.
- Weeks 7–8
Iterate to 89% agreement
Tuned prompts and evidence extractors criterion-by-criterion. Every change re-ran the eval. Hit 89% per-criterion agreement.
- Weeks 9–10
Realtime coaching agent
Deepgram Nova-2 stream into a LangGraph coaching agent. Electron desktop overlay. <300 ms prompt latency.
- Weeks 11–14
Terraform deploy + rollout
Entire stack Terraform'd into the client's AWS. Azure OpenAI endpoint BYOK. Rolled out to all 40 seats. Manual QA retired.
Same approach, different problem.
We can score every call for you, against your rubric, in 10–14 weeks.
Bring us your playbook, ~100 hours of labeled audio, and a senior reviewer for the eval set. We'll ship into your AWS or ours.