Industry · iGaming10M+ PlayersDeployed · Production since 2023

iGaming QA assessment — from 66% to 91% accuracy with schema-guided reasoning

A Tier-1 online casino operator was auditing 2% of its support interactions at 66% accuracy. We rebuilt the evaluation system on a three-stage schema-guided pipeline with an eval harness and a two-model ensemble on regulatory-risk criteria — and scaled coverage to 25%.

66→91%Accuracy

2→25%Coverage

10M+Players

$0.04Per audit

Client

Tier-1 online casino operator. 10M+ registered players, multiple European and LatAm licenses, regulated across 12 jurisdictions. NDA.

Engagement

14-week rebuild of an existing v1 direct-prompting QA system. Handoff to the client's QA Ops team with CI-gated eval harness and observability.

Challenge

The v1 system landed at 66% agreement with human reviewers. Not good enough for regulatory reporting.

The operator's customer-support QA team was manually auditing ~2% of ~150K monthly agent-player interactions (chat + call transcripts). A 2% sample wasn't representative — it was biased toward easy escalations and missed the long-tail regulatory risk the license reviewer actually cared about.

The first attempt used direct LLM prompting: "read this conversation, score 1–5 on each of 8 criteria, cite evidence." Inter-rater agreement with the human QA team landed at 66%. For a system that had to produce evidence for a license auditor, that wasn't close enough to defend.

The brief was concrete: accuracy >85% against human reviewers, coverage >20% of all interactions, cost per audit under $0.05, a reviewer UI that QA managers actually used, and an audit-ready evidence packet per case.

Solution

Schema-guided reasoning. Rubric as code. Eval harness before model tuning.

The 91% number didn't come from a better prompt. It came from replacing free-form scoring with a three-stage validated pipeline.

Schema-guided reasoning instead of one-shot scoring. Direct prompting asks the model to score and explain in one pass — accuracy caps at ~65–70% because the model will invent evidence to justify the score it wants. We switched to three stages: (1) extract structured evidence spans for each criterion, (2) validate each criterion against only its evidence, (3) aggregate into a final score. Accuracy on the same held-out set jumped to 91%.

Rubric as code, not as a prompt. 8 QA criteria with regulatory sub-criteria (KYC flow, responsible gambling language, license disclosure, GDPR compliance). Each criterion is its own Pydantic schema, its own evidence extraction step, its own validator. Changes ship through GitHub PR with the eval delta attached — QA Ops can see what will change before it changes.

Two-model ensemble on high-stakes criteria. Regulatory-risk criteria (RG language, license disclosure, complaint handling) get scored by both GPT-4o and Claude Sonnet 3.5. Disagreements flag for human review. Inter-model agreement rate: 94%. Everywhere else, one model is enough — we don't pay twice when the signal isn't worth it.

Reviewer dashboard designed for QA managers, not engineers. Every score surfaces the evidence span that triggered it. Reviewers can dispute in one click. Disputes feed into the eval harness as new test cases — not into the model as training data, but into the test set, so we know when the system starts drifting.

Eval harness first, model tuning second. 1,200 gold-standard cases before any prompt work. CI-gated: no rubric or prompt change ships until the eval harness clears on both accuracy and reviewer-agreement. LangSmith integration for trace-level debug when something misfires.

Architecture (data flow)

1.IngestConversation → structured parse (speaker, timestamp, channel)

2.Evidence extractionGPT-4o extracts spans relevant to each criterion

3.Schema validationEach span → criterion-specific Pydantic validator

4.ScoringPer-criterion LLM call with evidence context, schema-typed output

5.EnsembleGPT-4o + Claude Sonnet 3.5 on regulatory-risk criteria

6.AggregationWeighted rubric → overall score + evidence packet

7.Reviewer UIper-case reviewone-click disputeaudit trail

8.Eval harness1,200 gold cases, CI-gated, LangSmith traces

GPT-4o Claude Sonnet 3.5 LangGraph LangSmith Pydantic FastAPI PostgreSQL Redis Next.js Python 3.12 Docker Kubernetes Helicone pytest

Results

One year in production. 12+ months without a regulatory finding.

The 91% number matters. That the license auditor asks fewer questions now matters more.

91%

Accuracy

Up from 66% at baseline. Measured as inter-rater agreement with a blinded human QA panel on 600 held-out cases.

25%

Coverage

Up from 2%. The QA team audits 12x more interactions with the same headcount — because the AI does detection and humans do review.

$0.04

Per audit

Cost per scored interaction including ensemble runs. Well under the $0.05 target.

94%

Inter-model agreement

GPT-4o vs. Claude Sonnet 3.5 on regulatory-risk criteria. The 6% disagreement flags for human review.

1,200

Gold-standard eval cases

Curated by the QA team, reviewed monthly. No rubric change ships without the eval delta.

Regulatory findings

In 12+ months of audits, the license auditor hasn't flagged a single missed criterion. That's the metric that matters.

We went from auditing 2% of conversations with a QA team of 8 to auditing 25% with the same team — because the AI does the detection and they do the review. The 91% matters less to me than that our license auditor asks fewer questions now.

— Head of QA Operations, Tier-1 iGaming Operator

What made it work

Four decisions most LLM-QA rebuilds skip.

1. Schema-guided reasoning as the core technique, not a tweak. Most teams try "better prompts" on direct scoring and cap at 70%. The 3-stage pipeline (evidence → validate → aggregate) is where the 91% came from. Accuracy lives in architecture, not wording.

2. Rubric as code, reviewable in git. QA managers can see every rubric change on a PR. Reviewers built trust in the system because they could audit what would change before it changed anything. No hidden prompt edits.

3. Eval harness first, model tuning second. 1,200 gold cases before any prompt work. Every PR runs the eval delta before merge. Tuning without an eval harness is a slot machine.

4. Ensemble only where it pays. Two-model ensemble doubles cost. We use it on regulatory-risk criteria where disagreement is signal — not on easy criteria where one model is enough. Cost discipline matters when you're scoring 37K interactions a month.

Timeline

From audit to production in 14 weeks.

Week 1–2
v1 audit + baseline
Failure-mode analysis on the existing direct-prompting system. Baseline confirmed at 66%. Defined the gold-standard test set spec.
Week 3–4
Rubric-as-code rebuild
Pydantic schemas per criterion. Initial 200-case gold set curated by the client QA team.
Week 5–7
Schema-guided pipeline on LangGraph
Three-stage evidence→validate→score. Scaled gold set to 800. First 85% run on held-out eval.
Week 8–10
Ensemble + reviewer dashboard
Two-model ensemble on regulatory criteria. Reviewer UI v1. Coverage scaled to 10% of monthly traffic.
Week 11–14
CI eval harness + handoff
Gold set to 1,200. Coverage scaled to 25%. CI-gated deploys. Handoff to client QA Ops with runbook.
Ongoing
Retainer
New regulatory criteria as license profiles change. Eval harness maintenance. Quarterly drift review.

Related cases

Same methodology, different domains.

Want similar results?

If you're stuck at 70% accuracy, we can get you to 90%+.

Bring us your current LLM-eval system, your test set, and your accuracy number. We'll tell you in a 20-minute call whether schema-guided reasoning closes the gap.

Book a discovery call → See AI agent service

iGaming QA assessment — from 66% to 91% accuracy with schema-guided reasoning

Client

Engagement

The v1 system landed at 66% agreement with human reviewers. Not good enough for regulatory reporting.

Schema-guided reasoning. Rubric as code. Eval harness before model tuning.

Architecture (data flow)

One year in production. 12+ months without a regulatory finding.

Accuracy

Coverage

Per audit

Inter-model agreement

Gold-standard eval cases

Regulatory findings

Four decisions most LLM-QA rebuilds skip.

From audit to production in 14 weeks.

v1 audit + baseline

Rubric-as-code rebuild

Schema-guided pipeline on LangGraph

Ensemble + reviewer dashboard

CI eval harness + handoff

Retainer

Same methodology, different domains.

Milina — NYC voice agent at $0.09/call

CleverAnswerAI — HIPAA dental receptionist

ConvoTune — AI call scoring for sales

If you're stuck at 70% accuracy, we can get you to 90%+.