iGaming QA assessment — from 66% to 91% accuracy with schema-guided reasoning
A Tier-1 online casino operator was auditing 2% of its support interactions at 66% accuracy. We rebuilt the evaluation system on a three-stage schema-guided pipeline with an eval harness and a two-model ensemble on regulatory-risk criteria — and scaled coverage to 25%.
Client
Tier-1 online casino operator. 10M+ registered players, multiple European and LatAm licenses, regulated across 12 jurisdictions. NDA.
Engagement
14-week rebuild of an existing v1 direct-prompting QA system. Handoff to the client's QA Ops team with CI-gated eval harness and observability.
The v1 system landed at 66% agreement with human reviewers. Not good enough for regulatory reporting.
The operator's customer-support QA team was manually auditing ~2% of ~150K monthly agent-player interactions (chat + call transcripts). A 2% sample wasn't representative — it was biased toward easy escalations and missed the long-tail regulatory risk the license reviewer actually cared about.
The first attempt used direct LLM prompting: "read this conversation, score 1–5 on each of 8 criteria, cite evidence." Inter-rater agreement with the human QA team landed at 66%. For a system that had to produce evidence for a license auditor, that wasn't close enough to defend.
The brief was concrete: accuracy >85% against human reviewers, coverage >20% of all interactions, cost per audit under $0.05, a reviewer UI that QA managers actually used, and an audit-ready evidence packet per case.
Schema-guided reasoning. Rubric as code. Eval harness before model tuning.
The 91% number didn't come from a better prompt. It came from replacing free-form scoring with a three-stage validated pipeline.
Schema-guided reasoning instead of one-shot scoring. Direct prompting asks the model to score and explain in one pass — accuracy caps at ~65–70% because the model will invent evidence to justify the score it wants. We switched to three stages: (1) extract structured evidence spans for each criterion, (2) validate each criterion against only its evidence, (3) aggregate into a final score. Accuracy on the same held-out set jumped to 91%.
Rubric as code, not as a prompt. 8 QA criteria with regulatory sub-criteria (KYC flow, responsible gambling language, license disclosure, GDPR compliance). Each criterion is its own Pydantic schema, its own evidence extraction step, its own validator. Changes ship through GitHub PR with the eval delta attached — QA Ops can see what will change before it changes.
Two-model ensemble on high-stakes criteria. Regulatory-risk criteria (RG language, license disclosure, complaint handling) get scored by both GPT-4o and Claude Sonnet 3.5. Disagreements flag for human review. Inter-model agreement rate: 94%. Everywhere else, one model is enough — we don't pay twice when the signal isn't worth it.
Reviewer dashboard designed for QA managers, not engineers. Every score surfaces the evidence span that triggered it. Reviewers can dispute in one click. Disputes feed into the eval harness as new test cases — not into the model as training data, but into the test set, so we know when the system starts drifting.
Eval harness first, model tuning second. 1,200 gold-standard cases before any prompt work. CI-gated: no rubric or prompt change ships until the eval harness clears on both accuracy and reviewer-agreement. LangSmith integration for trace-level debug when something misfires.
Architecture (data flow)
One year in production. 12+ months without a regulatory finding.
The 91% number matters. That the license auditor asks fewer questions now matters more.
Accuracy
Up from 66% at baseline. Measured as inter-rater agreement with a blinded human QA panel on 600 held-out cases.
Coverage
Up from 2%. The QA team audits 12x more interactions with the same headcount — because the AI does detection and humans do review.
Per audit
Cost per scored interaction including ensemble runs. Well under the $0.05 target.
Inter-model agreement
GPT-4o vs. Claude Sonnet 3.5 on regulatory-risk criteria. The 6% disagreement flags for human review.
Gold-standard eval cases
Curated by the QA team, reviewed monthly. No rubric change ships without the eval delta.
Regulatory findings
In 12+ months of audits, the license auditor hasn't flagged a single missed criterion. That's the metric that matters.
We went from auditing 2% of conversations with a QA team of 8 to auditing 25% with the same team — because the AI does the detection and they do the review. The 91% matters less to me than that our license auditor asks fewer questions now.
Four decisions most LLM-QA rebuilds skip.
1. Schema-guided reasoning as the core technique, not a tweak. Most teams try "better prompts" on direct scoring and cap at 70%. The 3-stage pipeline (evidence → validate → aggregate) is where the 91% came from. Accuracy lives in architecture, not wording.
2. Rubric as code, reviewable in git. QA managers can see every rubric change on a PR. Reviewers built trust in the system because they could audit what would change before it changed anything. No hidden prompt edits.
3. Eval harness first, model tuning second. 1,200 gold cases before any prompt work. Every PR runs the eval delta before merge. Tuning without an eval harness is a slot machine.
4. Ensemble only where it pays. Two-model ensemble doubles cost. We use it on regulatory-risk criteria where disagreement is signal — not on easy criteria where one model is enough. Cost discipline matters when you're scoring 37K interactions a month.
From audit to production in 14 weeks.
- Week 1–2
v1 audit + baseline
Failure-mode analysis on the existing direct-prompting system. Baseline confirmed at 66%. Defined the gold-standard test set spec.
- Week 3–4
Rubric-as-code rebuild
Pydantic schemas per criterion. Initial 200-case gold set curated by the client QA team.
- Week 5–7
Schema-guided pipeline on LangGraph
Three-stage evidence→validate→score. Scaled gold set to 800. First 85% run on held-out eval.
- Week 8–10
Ensemble + reviewer dashboard
Two-model ensemble on regulatory criteria. Reviewer UI v1. Coverage scaled to 10% of monthly traffic.
- Week 11–14
CI eval harness + handoff
Gold set to 1,200. Coverage scaled to 25%. CI-gated deploys. Handoff to client QA Ops with runbook.
- Ongoing
Retainer
New regulatory criteria as license profiles change. Eval harness maintenance. Quarterly drift review.
Same methodology, different domains.
If you're stuck at 70% accuracy, we can get you to 90%+.
Bring us your current LLM-eval system, your test set, and your accuracy number. We'll tell you in a 20-minute call whether schema-guided reasoning closes the gap.