Research · RAGPublic benchmarkArchitecture open-sourced

Enterprise RAG Challenge — a winning architecture for document-grounded answers

A short technical case study. The architecture that placed first at the Enterprise RAG Challenge is the same architecture we deploy for clients: hybrid retrieval, structured document parsing, schema-validated refusal, and query decomposition. Benchmarks that reward hallucination-free answers look a lot like production systems that have to not embarrass anyone.

1stPlace

>90%Top-5 hit rate

0Hallucinations

$0.08Per answer

Context

Open Enterprise RAG Challenge — a public benchmark on enterprise financial documents. Not a client engagement. The architecture we submitted is the architecture we deploy when a client asks for grounded Q&A over their corpus.

Engagement

4-week research sprint. Architecture + eval harness open-sourced.

Challenge

Dumb-baseline RAG ships broken production systems.

The Enterprise RAG Challenge tests a system's ability to answer precise questions against a large corpus of enterprise financial documents: annual reports, 10-Ks, investor presentations. Correct answers require (1) retrieving the right page, (2) extracting the right figure, and (3) not hallucinating when the document doesn't contain the answer.

The dumb baseline — embed chunks, top-k retrieval, stuff into GPT-4o — lands around 55–60% on a representative eval set. Production RAG systems built on that architecture ship buggy outputs every week. We built the submission the same way we build production RAG for clients — because the techniques that win benchmarks are the techniques that survive deployment.

Solution

Hybrid retrieval. Structured parsing. Schema-validated refusal. Query decomposition.

Four layers, each fixing a specific failure mode in naive RAG.

Hybrid retrieval: BM25 + dense embeddings + reranker. Pure-vector retrieval misses exact-match financial terms (ticker symbols, line-item names). Pure-BM25 misses paraphrased queries. We fan out to both, then rerank with bge-reranker-large that scores each candidate against the query. Top-5 after rerank goes to the generator.

Structured document parsing, not naive chunking. Financial PDFs have tables, footnotes, page numbers, and section hierarchy. We parse with Unstructured.io for layout + a custom table extraction pass for income statements, balance sheets, and cash flows. Each chunk carries structured metadata: page, section, table name, row. That metadata survives into the generator's context.

Schema-validated generation with a refusal path. The generator (GPT-4o) outputs a Pydantic-typed response with fields: answer confidence evidence_spans refusal_reason. If the evidence doesn't support the answer, the schema forces a refusal — the model physically can't invent a number it didn't see.

Query decomposition for multi-hop questions. Questions like "what was the YoY change in segment X revenue" need two retrievals. A pre-step LLM call decomposes the question into atomic sub-queries, retrieves for each, and the generator sees all evidence before answering.

Eval harness on a held-out set. Every pipeline change ran against our own 500-question eval set (not the competition set) before submission. We knew our accuracy within 2 points before the public result landed.

Architecture (data flow)

1.ParseUnstructured.io layout + custom table extraction

2.ChunkStructured chunks with (page, section, table, row) metadata

3.IndexPinecone (dense) + OpenSearch (BM25) + metadata filters

4.DecomposeGPT-4o decomposer → atomic sub-queries

5.RetrieveBM25 + dense + metadata filter → top-20 candidates

6.Rerankbge-reranker-large → top-5

7.GenerateGPT-4o with Pydantic schema (answer + evidence + refusal)

8.ValidateSchema check + evidence span citation → final response

GPT-4o Claude Sonnet 3.5 bge-reranker-large Pinecone OpenSearch Unstructured.io Pydantic LangGraph LangSmith Python 3.12 FastAPI Docker

Results

First place. More importantly, zero hallucinations on refusal-required questions.

1st

On the leaderboard

Enterprise RAG Challenge. Evaluated against the organizers' private question set.

>90%

Top-5 hit rate

Retrieval hit rate on held-out eval — the correct evidence span landed in the top 5 after rerank.

Hallucinations

On 18% of questions where the document didn't contain an answer, the system refused via the Pydantic schema. Zero fabricated numbers.

24%

Questions decomposed

Multi-hop questions where the decomposer fired. These were the questions that broke naive RAG in our eval.

$0.08

Avg per answer

All-in cost: decomposition + retrieval + rerank + generation. Representative of production economics.

500

Held-out eval questions

Our own eval set, curated from the corpus. We knew our accuracy within 2 points before submission.

The architecture that won the RAG Challenge is the same architecture we deploy for clients. Benchmarks that reward hallucination-free answers look a lot like production systems that have to not embarrass anyone.

— Viktor Andriichuk, Founder, DataFlux Software

What made it work

Four decisions where naive RAG breaks in production.

1. Hybrid retrieval is not optional for enterprise documents. Financial data has too many exact-match terms for pure-vector to win on its own. BM25 catches what embeddings miss — ticker symbols, line-item names, footnote references.

2. Schema-validated refusal beats accuracy tuning. A RAG system that knows when to say "the document doesn't answer that" outperforms one that tries to be slightly more accurate on ambiguous questions. Refusal is an architecture decision, not a prompt.

3. Query decomposition before retrieval, not after. Multi-hop questions need sub-queries before retrieval. Retrieving once and hoping the LLM stitches multiple pieces together is the #1 failure mode in production RAG.

4. Eval harness on a held-out set, not on the test set. You don't optimize against the leaderboard. You build your own 500-question set, iterate there, and submit once.

Timeline

Four-week research sprint.

Week 1
Corpus analysis + baseline
Pinecone + GPT-4o baseline landed at 63% on our eval. Documented every failure mode.
Week 2
BM25 + rerank + structured parsing
Hybrid retrieval layer → 78%. Unstructured.io + table extraction → 84%. Retrieval failure rate collapsed.
Week 3
Query decomposition + schema-validated refusal
Multi-hop questions unlocked. Refusal path forced hallucinations to zero. >90% on held-out eval.
Week 4
Final runs + open-source
Eval harness CI-gated. Architecture and eval-harness repo published. Submission.

Related cases

Same architecture, different domains.

Want this architecture?

We can ship it on your enterprise docs in 6–10 weeks.

Bring us your corpus, your accuracy target, and your refusal budget — we'll tell you what the architecture costs to build and run.

Book a discovery call → See chatbot service