Research · RAGPublic benchmarkArchitecture open-sourced

Enterprise RAG Challenge — a winning architecture for document-grounded answers

A short technical case study. The architecture that placed first at the Enterprise RAG Challenge is the same architecture we deploy for clients: hybrid retrieval, structured document parsing, schema-validated refusal, and query decomposition. Benchmarks that reward hallucination-free answers look a lot like production systems that have to not embarrass anyone.

1stPlace
>90%Top-5 hit rate
0Hallucinations
$0.08Per answer

Context

Open Enterprise RAG Challenge — a public benchmark on enterprise financial documents. Not a client engagement. The architecture we submitted is the architecture we deploy when a client asks for grounded Q&A over their corpus.

Engagement

4-week research sprint. Architecture + eval harness open-sourced.

Dumb-baseline RAG ships broken production systems.

The Enterprise RAG Challenge tests a system's ability to answer precise questions against a large corpus of enterprise financial documents: annual reports, 10-Ks, investor presentations. Correct answers require (1) retrieving the right page, (2) extracting the right figure, and (3) not hallucinating when the document doesn't contain the answer.

The dumb baseline — embed chunks, top-k retrieval, stuff into GPT-4o — lands around 55–60% on a representative eval set. Production RAG systems built on that architecture ship buggy outputs every week. We built the submission the same way we build production RAG for clients — because the techniques that win benchmarks are the techniques that survive deployment.

Hybrid retrieval. Structured parsing. Schema-validated refusal. Query decomposition.

Four layers, each fixing a specific failure mode in naive RAG.

Hybrid retrieval: BM25 + dense embeddings + reranker. Pure-vector retrieval misses exact-match financial terms (ticker symbols, line-item names). Pure-BM25 misses paraphrased queries. We fan out to both, then rerank with bge-reranker-large that scores each candidate against the query. Top-5 after rerank goes to the generator.

Structured document parsing, not naive chunking. Financial PDFs have tables, footnotes, page numbers, and section hierarchy. We parse with Unstructured.io for layout + a custom table extraction pass for income statements, balance sheets, and cash flows. Each chunk carries structured metadata: page, section, table name, row. That metadata survives into the generator's context.

Schema-validated generation with a refusal path. The generator (GPT-4o) outputs a Pydantic-typed response with fields: answer confidence evidence_spans refusal_reason. If the evidence doesn't support the answer, the schema forces a refusal — the model physically can't invent a number it didn't see.

Query decomposition for multi-hop questions. Questions like "what was the YoY change in segment X revenue" need two retrievals. A pre-step LLM call decomposes the question into atomic sub-queries, retrieves for each, and the generator sees all evidence before answering.

Eval harness on a held-out set. Every pipeline change ran against our own 500-question eval set (not the competition set) before submission. We knew our accuracy within 2 points before the public result landed.

Architecture (data flow)

1.ParseUnstructured.io layout + custom table extraction
2.ChunkStructured chunks with (page, section, table, row) metadata
3.IndexPinecone (dense) + OpenSearch (BM25) + metadata filters
4.DecomposeGPT-4o decomposer → atomic sub-queries
5.RetrieveBM25 + dense + metadata filter → top-20 candidates
6.Rerankbge-reranker-large → top-5
7.GenerateGPT-4o with Pydantic schema (answer + evidence + refusal)
8.ValidateSchema check + evidence span citation → final response
GPT-4o Claude Sonnet 3.5 bge-reranker-large Pinecone OpenSearch Unstructured.io Pydantic LangGraph LangSmith Python 3.12 FastAPI Docker

First place. More importantly, zero hallucinations on refusal-required questions.

1st

On the leaderboard

Enterprise RAG Challenge. Evaluated against the organizers' private question set.

>90%

Top-5 hit rate

Retrieval hit rate on held-out eval — the correct evidence span landed in the top 5 after rerank.

0

Hallucinations

On 18% of questions where the document didn't contain an answer, the system refused via the Pydantic schema. Zero fabricated numbers.

24%

Questions decomposed

Multi-hop questions where the decomposer fired. These were the questions that broke naive RAG in our eval.

$0.08

Avg per answer

All-in cost: decomposition + retrieval + rerank + generation. Representative of production economics.

500

Held-out eval questions

Our own eval set, curated from the corpus. We knew our accuracy within 2 points before submission.

The architecture that won the RAG Challenge is the same architecture we deploy for clients. Benchmarks that reward hallucination-free answers look a lot like production systems that have to not embarrass anyone.

— Viktor Andriichuk, Founder, DataFlux Software

Four decisions where naive RAG breaks in production.

1. Hybrid retrieval is not optional for enterprise documents. Financial data has too many exact-match terms for pure-vector to win on its own. BM25 catches what embeddings miss — ticker symbols, line-item names, footnote references.

2. Schema-validated refusal beats accuracy tuning. A RAG system that knows when to say "the document doesn't answer that" outperforms one that tries to be slightly more accurate on ambiguous questions. Refusal is an architecture decision, not a prompt.

3. Query decomposition before retrieval, not after. Multi-hop questions need sub-queries before retrieval. Retrieving once and hoping the LLM stitches multiple pieces together is the #1 failure mode in production RAG.

4. Eval harness on a held-out set, not on the test set. You don't optimize against the leaderboard. You build your own 500-question set, iterate there, and submit once.

Four-week research sprint.

We can ship it on your enterprise docs in 6–10 weeks.

Bring us your corpus, your accuracy target, and your refusal budget — we'll tell you what the architecture costs to build and run.