A benchmarking study demonstrating how agentic multi-step RAG outperforms single-shot retrieval on complex banking compliance queries, with measurable, evaluator-graded results.
Compliance professionals in banking and financial services routinely ask complex, multi-part questions that span multiple regulatory documents. A question like "What are the capital adequacy requirements for community banks, and how do they interact with lending limits?" requires synthesizing information from several semantically disconnected policy sections.
Traditional single-shot RAG systems fail here. They retrieve based on initial query phrasing, which rarely matches the terminology used across all relevant regulatory documents. The result is confident-sounding answers that are missing critical context, a dangerous failure mode in compliance-heavy environments.
"The most dangerous failure mode is not hallucination—it is confident incompleteness."
— From the benchmarking analysis
I designed and implemented an agentic RAG workflow using LangGraph that iteratively refines its retrieval strategy until the retrieved context demonstrably covers all dimensions of the question. The system was benchmarked against a standard single-shot RAG pipeline on the same knowledge base and evaluated by an LLM judge.
| Agent | Responsibility |
|---|---|
| Query Rewriter | Rewrites raw user query into semantically richer form using corpus-native regulatory terminology before retrieval |
| Retriever | Fetches semantically relevant documents from the NCUA compliance knowledge base |
| Relevance Grader | Quality gate: explicitly judges whether retrieved docs actually cover all dimensions of the question |
| Decision Router | Routes to Generator if relevant; routes back to Rewriter with retry budget remaining |
| Retry Counter | Enforces retry budget; triggers graceful fallback to Generator when budget exhausted |
| Generator | Synthesizes final answer from original question and best retrieved context |
Knowledge base: ~33 publicly available NCUA compliance documents, ~304K tokens, ~739 chunks. Embeddings generated with voyage-law-2, a legal and regulatory-optimized embedding model. Evaluation profile: Compliance-Grade (50% accuracy weight, 30% source coverage).
The benchmarking study evaluated both pipelines on 10 representative compliance queries. Results were scored by an LLM judge across five dimensions: accuracy, completeness, source coverage, consistency, and latency.
Benchmark score comparison (selected queries):
Read the full technical analysis on the blog: Benchmarking Basic vs Agentic RAG with LLM-as-a-Judge →
Whether it's compliance, legal, policy, or internal knowledge. I can design a RAG system that gives you complete, reliable answers, not just confident-sounding ones.