Banking Compliance Agentic RAG LLM-as-a-Judge

Compliance Intelligence Assistant

A benchmarking study demonstrating how agentic multi-step RAG outperforms single-shot retrieval on complex banking compliance queries, with measurable, evaluator-graded results.

8/10

Benchmark queries won by Agentic RAG

+12%

Improvement in composite quality score (0.831 → 0.929)

+29pt

Score improvement on hardest multi-part queries

🏦 NCUA Compliance Domain 🤖 LangGraph Multi-Agent 📊 LLM-as-a-Judge Evaluation 🏛 Anthropic Claude + voyage-law-2

The Challenge

Compliance Queries That Basic RAG Can't Answer Completely

Compliance professionals in banking and financial services routinely ask complex, multi-part questions that span multiple regulatory documents. A question like "What are the capital adequacy requirements for community banks, and how do they interact with lending limits?" requires synthesizing information from several semantically disconnected policy sections.

Traditional single-shot RAG systems fail here. They retrieve based on initial query phrasing, which rarely matches the terminology used across all relevant regulatory documents. The result is confident-sounding answers that are missing critical context, a dangerous failure mode in compliance-heavy environments.

"The most dangerous failure mode is not hallucination—it is confident incompleteness."
— From the benchmarking analysis

The Solution

Agentic Multi-Step Retrieval with Iterative Refinement

I designed and implemented an agentic RAG workflow using LangGraph that iteratively refines its retrieval strategy until the retrieved context demonstrably covers all dimensions of the question. The system was benchmarked against a standard single-shot RAG pipeline on the same knowledge base and evaluated by an LLM judge.

User Query
    ↓
Query Rewriter (corpus-native regulatory terminology)
    ↓
Retriever (ChromaDB vector search, voyage-law-2 embeddings)
    ↓
Relevance Grader (explicit quality gate: does this cover all dimensions?)
    ↓ [not relevant]            ↓ [relevant]
Decision Router            Generator → Answer
    ↑
Retry Counter (budget enforcement, graceful fallback)

Agent	Responsibility
Query Rewriter	Rewrites raw user query into semantically richer form using corpus-native regulatory terminology before retrieval
Retriever	Fetches semantically relevant documents from the NCUA compliance knowledge base
Relevance Grader	Quality gate: explicitly judges whether retrieved docs actually cover all dimensions of the question
Decision Router	Routes to Generator if relevant; routes back to Rewriter with retry budget remaining
Retry Counter	Enforces retry budget; triggers graceful fallback to Generator when budget exhausted
Generator	Synthesizes final answer from original question and best retrieved context

Knowledge base: ~33 publicly available NCUA compliance documents, ~304K tokens, ~739 chunks. Embeddings generated with voyage-law-2, a legal and regulatory-optimized embedding model. Evaluation profile: Compliance-Grade (50% accuracy weight, 30% source coverage).

LangGraph Anthropic Claude voyage-law-2 ChromaDB Python 3.12 Streamlit LLM-as-a-Judge

The Results

Agentic RAG Wins on the Queries That Matter Most

The benchmarking study evaluated both pipelines on 10 representative compliance queries. Results were scored by an LLM judge across five dimensions: accuracy, completeness, source coverage, consistency, and latency.

8/10

Benchmark queries won by Agentic RAG

0.929

Agentic composite score vs 0.831 for Basic RAG

+29pt

Improvement on asset threshold compliance query (Q4)

Benchmark score comparison (selected queries):

Basic RAG Agentic RAG

Q4: Asset Thresholds

+29pt

Q8: PCA Obligations

+24pt

Q10: CRE Limits

+17pt

Overall Average

+10pt

Agentic RAG's iterative query rewriting ensures multi-part questions are answered completely, not just partially
Relevance grading eliminates the "confident incompleteness" failure mode that plagues single-shot RAG
Plateau detection prevents runaway retries on questions the knowledge base genuinely cannot answer
Tradeoff acknowledged: Agentic RAG costs 3.4× more per query and requires 2.7× longer latency, appropriate for board-level compliance decisions, not routine FAQ lookup

Read the full technical analysis on the blog: Benchmarking Basic vs Agentic RAG with LLM-as-a-Judge →

Compliance Intelligence Assistant

Compliance Queries That Basic RAG Can't Answer Completely

Agentic Multi-Step Retrieval with Iterative Refinement

Agentic RAG Wins on the Queries That Matter Most

Need Accurate Answers From Complex Documents?