Enterprise RAG Challenge — a winning architecture for document-grounded answers
A short technical case study. The architecture that placed first at the Enterprise RAG Challenge is the same architecture we deploy for clients: hybrid retrieval, structured document parsing, schema-validated refusal, and query decomposition. Benchmarks that reward hallucination-free answers look a lot like production systems that have to not embarrass anyone.
Context
Open Enterprise RAG Challenge — a public benchmark on enterprise financial documents. Not a client engagement. The architecture we submitted is the architecture we deploy when a client asks for grounded Q&A over their corpus.
Engagement
4-week research sprint. Architecture + eval harness open-sourced.
Dumb-baseline RAG ships broken production systems.
The Enterprise RAG Challenge tests a system's ability to answer precise questions against a large corpus of enterprise financial documents: annual reports, 10-Ks, investor presentations. Correct answers require (1) retrieving the right page, (2) extracting the right figure, and (3) not hallucinating when the document doesn't contain the answer.
The dumb baseline — embed chunks, top-k retrieval, stuff into GPT-4o — lands around 55–60% on a representative eval set. Production RAG systems built on that architecture ship buggy outputs every week. We built the submission the same way we build production RAG for clients — because the techniques that win benchmarks are the techniques that survive deployment.
Hybrid retrieval. Structured parsing. Schema-validated refusal. Query decomposition.
Four layers, each fixing a specific failure mode in naive RAG.
Hybrid retrieval: BM25 + dense embeddings + reranker. Pure-vector retrieval misses exact-match financial terms (ticker symbols, line-item names). Pure-BM25 misses paraphrased queries. We fan out to both, then rerank with bge-reranker-large that scores each candidate against the query. Top-5 after rerank goes to the generator.
Structured document parsing, not naive chunking. Financial PDFs have tables, footnotes, page numbers, and section hierarchy. We parse with Unstructured.io for layout + a custom table extraction pass for income statements, balance sheets, and cash flows. Each chunk carries structured metadata: page, section, table name, row. That metadata survives into the generator's context.
Schema-validated generation with a refusal path. The generator (GPT-4o) outputs a Pydantic-typed response with fields: answer confidence evidence_spans refusal_reason. If the evidence doesn't support the answer, the schema forces a refusal — the model physically can't invent a number it didn't see.
Query decomposition for multi-hop questions. Questions like "what was the YoY change in segment X revenue" need two retrievals. A pre-step LLM call decomposes the question into atomic sub-queries, retrieves for each, and the generator sees all evidence before answering.
Eval harness on a held-out set. Every pipeline change ran against our own 500-question eval set (not the competition set) before submission. We knew our accuracy within 2 points before the public result landed.
Architecture (data flow)
First place. More importantly, zero hallucinations on refusal-required questions.
On the leaderboard
Enterprise RAG Challenge. Evaluated against the organizers' private question set.
Top-5 hit rate
Retrieval hit rate on held-out eval — the correct evidence span landed in the top 5 after rerank.
Hallucinations
On 18% of questions where the document didn't contain an answer, the system refused via the Pydantic schema. Zero fabricated numbers.
Questions decomposed
Multi-hop questions where the decomposer fired. These were the questions that broke naive RAG in our eval.
Avg per answer
All-in cost: decomposition + retrieval + rerank + generation. Representative of production economics.
Held-out eval questions
Our own eval set, curated from the corpus. We knew our accuracy within 2 points before submission.
The architecture that won the RAG Challenge is the same architecture we deploy for clients. Benchmarks that reward hallucination-free answers look a lot like production systems that have to not embarrass anyone.
Four decisions where naive RAG breaks in production.
1. Hybrid retrieval is not optional for enterprise documents. Financial data has too many exact-match terms for pure-vector to win on its own. BM25 catches what embeddings miss — ticker symbols, line-item names, footnote references.
2. Schema-validated refusal beats accuracy tuning. A RAG system that knows when to say "the document doesn't answer that" outperforms one that tries to be slightly more accurate on ambiguous questions. Refusal is an architecture decision, not a prompt.
3. Query decomposition before retrieval, not after. Multi-hop questions need sub-queries before retrieval. Retrieving once and hoping the LLM stitches multiple pieces together is the #1 failure mode in production RAG.
4. Eval harness on a held-out set, not on the test set. You don't optimize against the leaderboard. You build your own 500-question set, iterate there, and submit once.
Four-week research sprint.
- Week 1
Corpus analysis + baseline
Pinecone + GPT-4o baseline landed at 63% on our eval. Documented every failure mode.
- Week 2
BM25 + rerank + structured parsing
Hybrid retrieval layer → 78%. Unstructured.io + table extraction → 84%. Retrieval failure rate collapsed.
- Week 3
Query decomposition + schema-validated refusal
Multi-hop questions unlocked. Refusal path forced hallucinations to zero. >90% on held-out eval.
- Week 4
Final runs + open-source
Eval harness CI-gated. Architecture and eval-harness repo published. Submission.
Same architecture, different domains.
We can ship it on your enterprise docs in 6–10 weeks.
Bring us your corpus, your accuracy target, and your refusal budget — we'll tell you what the architecture costs to build and run.