What is retrieval-augmented generation (RAG)?

RAG is an architecture that connects a large language model to an external knowledge base at query time. Instead of relying on what the model learned during training, RAG retrieves relevant documents from your data, passes them to the LLM as context, and generates answers grounded in your specific knowledge. This reduces hallucinations and keeps answers current without retraining the model.

How much does a production RAG system cost to build?

A basic RAG system (single data source, managed vector DB, standard retrieval) runs $15,000 to $30,000 to build. A production system with hybrid retrieval, reranking, evaluation pipelines, and multi-source ingestion typically costs $40,000 to $80,000. Ongoing infrastructure costs range from $300 to $2,000 per month depending on query volume and whether you self-host the vector database.

RAG vs fine-tuning: which should I choose?

Use RAG when your knowledge changes frequently, when you need source citations, or when your data is too large to fit in a context window. Use fine-tuning when your use case requires a specific response style or format, when you need high-volume inference at low cost, and when the knowledge is stable. Hybrid approaches work too: RAG for retrieval plus fine-tuning for format. They achieve 96% accuracy vs 89% for RAG alone.

What chunking strategy works best for RAG?

For most production systems, 512-token chunks with 50-100 token overlap is the baseline. It scores 69% on typical document retrieval benchmarks. Semantic chunking (splitting at natural topic boundaries rather than fixed tokens) improves recall by 8-12% for long documents. For structured data like tables and code, use specialized chunkers rather than recursive character splitting.

How do you reduce hallucinations in RAG?

Three layers: retrieval quality (better chunks and hybrid search reduce irrelevant context), reranking (cross-encoder rerankers cut hallucinations 15-20% by filtering low-confidence retrievals), and generation guardrails (instruct the LLM to cite sources and say 'I don't know' when the retrieved context doesn't support an answer). Hallucination detection at eval time using LLM-based judges catches remaining failures.

Which vector database should I use for RAG?

For most teams starting out: Qdrant or Pinecone. Qdrant is open-source, self-hostable, and handles complex metadata filtering well. Pinecone is managed and scales with zero ops overhead. If you're already on PostgreSQL, pgvector with binary quantization is now competitive for under 5 million vectors. Avoid Chroma for production. It lacks the replication and multi-tenancy features enterprise workloads require.

Build a Production RAG System in 2026

70% of enterprise GenAI deployments use RAG. The RAG market is growing at 49.1% CAGR, from USD 2.33B in 2025 to a projected USD 11B by 2030. Most deployments fail before generation even runs. The retrieval layer is where they die.

Here's the math nobody tells you upfront. A typical RAG pipeline has three core stages: retrieval, reranking, and generation. If each stage runs at 95% reliability, your end-to-end success rate is 0.95 × 0.95 × 0.95 = 0.857. That means roughly one in seven queries fails. In production at 10,000 queries a day, that's 1,400 broken answers delivered to real users.

We've built RAG systems for healthcare (My Medical Records AI), fintech (ShinyLoans), and real estate (Highline Residential). Each one taught us something different about where RAG breaks. This guide covers the full architecture, the seven failure points, and the cost model most teams ignore until the bill arrives.

What RAG Actually Is (And Why Most Explanations Get It Wrong)

RAG is foundational to modern generative AI applications. It connects an LLM to an external knowledge base at query time. Instead of asking the model to recall facts from training, you retrieve relevant documents from your own data, pass them as context, and generate answers grounded in your specific knowledge. The model becomes a reasoning engine over your data rather than a static knowledge repository.

Most explanations stop there. They skip the part that actually matters: RAG is not a single technique. It's a pipeline with multiple components, each of which can fail independently. A well-tuned embedding model with a poor chunking strategy still returns bad results. A perfect retrieval layer with a weak reranker still buries the right documents. You have to get all of it right.

The RAG Pipeline at a Glance

1. Ingest & Chunk

2. Embed

3. Store

4. Retrieve

5. Rerank & Generate

The Seven Failure Points in Production RAG

A 2024 arXiv study identified seven failure modes in RAG systems (Barnett et al., 2024). These map exactly to what we see in production. Every broken RAG system we've inherited hits at least two of them.

Missing Content

The knowledge base doesn't contain the answer. This sounds obvious but it's the most common failure. Teams embed marketing copy and FAQ pages, then wonder why the system can't answer technical support questions. Audit your corpus before you build.

Missed Top Documents

The right documents exist but don't rank in the top-k results. This is a retrieval problem. Dense vector search alone misses exact matches. Keyword queries miss semantic meaning. The fix is hybrid retrieval: BM25 plus dense vectors, combined with Reciprocal Rank Fusion.

Context Truncation

The answer is retrieved but cut off at the context window limit. Longer documents get trimmed. Fix: smaller, more focused chunks. Track average chunk length and set a max that preserves meaning.

Wrong Format Extraction

Tables, code blocks, and structured data don't survive recursive text splitting. A financial table becomes garbled tokens. Use specialized parsers for PDFs, spreadsheets, and code. LlamaParse and unstructured.io handle most cases.

Incorrect Specificity

The system retrieves something relevant but not specific enough. A question about Q3 2025 revenue gets a Q3 2024 answer because the embedding similarity is high. Add metadata filtering on date, document type, and source to enforce specificity.

Data Freshness Lag

The knowledge base goes stale. A document updated yesterday still serves last month's version. You need an incremental ingestion pipeline: detect changes, re-embed only the updated chunks, update the index without full re-ingestion. This is the piece most RAG tutorials skip entirely.

LLM Hallucination Over Retrieved Context

The model ignores the retrieved context and generates from its training data instead. This happens most often when the context is noisy or contradictory. Fix: constrain the system prompt to only answer from provided context, and add a hallucination detector at eval time.

Chunking Strategy: The Part That Determines Everything Else

Bad chunking is the #1 cause of poor retrieval. Get chunking wrong and no amount of model tuning will save you. Here are the strategies that actually work in production.

Strategy	Best For	Recall Score	Complexity
Fixed-size (512 tokens, 10% overlap)	General docs, FAQs	69%	Low
Recursive character splitting	Mixed content, articles	72%	Low
Semantic chunking	Long documents, reports	81%	Medium
Document-aware (headers, sections)	PDFs, technical docs	84%	Medium
Hierarchical (parent/child chunks)	Complex knowledge bases	88%	High

Start with fixed-size chunking at 512 tokens with 50-100 token overlap. It's the fastest baseline to validate. Once retrieval quality looks reasonable, switch to semantic or document-aware chunking. The recall jump from 69% to 81-84% is worth the added complexity for most production systems.

Retrieval: Why Dense Vectors Alone Aren't Enough

Dense vector search excels at semantic similarity. It finds documents that mean the same thing even when they use different words. But it struggles with exact matches: product codes, names, IDs, and specific technical terms. A user searching for "error code E-4021" gets back documents about general error handling instead.

Hybrid retrieval solves this. Combine dense vector search (using embeddings like OpenAI's text-embedding-3-small or open-source machine learning models like BGE-M3) with BM25 keyword search. Use Reciprocal Rank Fusion (RRF) to merge the result lists. This approach consistently outperforms either method alone, especially on technical and structured queries.

Embedding model costs in 2026: OpenAI text-embedding-3-small costs $0.02/1M tokens. text-embedding-3-large costs $0.13/1M tokens. BGE-M3 self-hosted adds ops overhead but eliminates per-token costs. For most enterprise workloads, text-embedding-3-small plus Qdrant self-hosted hits the right cost-performance balance. Use our free AI token counter to estimate costs before your first API call.

Reranking: The 15-20% Accuracy Boost Most Teams Skip

After retrieval returns the top-k documents, a reranker re-scores each one against the original query using a cross-encoder model. Cross-encoders are more accurate than bi-encoders but too slow for full-corpus search. Running them over just the top 20-50 retrieved candidates is the right trade-off.

We consistently see 15-20% accuracy improvements from adding a reranker. Cohere Rerank is the managed option. BGE-Reranker-v2 is the open-source equivalent. Latency cost is 50-150ms per query. For most conversational use cases, that's an acceptable trade for significantly better results.

Evaluation: How to Know if Your RAG System Actually Works

Most teams eyeball a few example queries and call the system good. That's not evaluation. You need a test set of 50-200 question-answer pairs generated from your actual documents. Run every retrieval change against this set and track NDCG (Normalized Discounted Cumulative Gain) or MRR (Mean Reciprocal Rank) at k=5.

For end-to-end evaluation, use RAGAs (Retrieval Augmented Generation Assessment). It measures four dimensions automatically: answer faithfulness (is the answer grounded in context?), answer relevancy, context precision, and context recall. Pair this with your data analytics stack to track query quality trends over time.

Cost Model: What Production RAG Actually Costs

Component	Managed ($/month)	Self-Hosted ($/month)
Vector DB (10M vectors)	$70-$200 (Pinecone)	$20-$50 (Qdrant on VPS)
Embeddings (1M queries)	$20 (text-embedding-3-small)	$0 + $30 GPU compute
LLM generation (1M tokens)	$15 (GPT-4o-mini)	$5-10 (Llama 3 hosted)
Reranker (1M queries)	$10 (Cohere)	$15 GPU compute
Typical total	$115-$430	$70-$105 + 8-12 eng hrs/mo

Self-hosted saves money above roughly 60-80 million queries per month. Below that, managed infrastructure is usually cheaper when you factor in engineering time. For most enterprise RAG systems in early production, start managed. Migrate to self-hosted when the savings justify the ops overhead.

Data Freshness: The Problem Nobody Talks About

A RAG system is only as good as its knowledge base. If your documents update daily but your ingestion pipeline runs weekly, users get stale answers. This is particularly painful for support systems, internal wikis, and policy documents.

Build an event-driven ingestion pipeline from day one. When a document is created, updated, or deleted, trigger a re-ingestion job that deletes the old chunks from the vector DB and inserts the new ones. Tools like n8n, Airflow, and AWS Lambda handle the orchestration. Index freshness should be a first-class SLA, not an afterthought.

Security and Compliance in Production RAG

For healthcare and fintech, RAG introduces real compliance surface. Patient records, loan documents, and financial reports can't be exposed across tenant boundaries. Multi-tenant RAG needs namespace isolation at the vector DB level (Pinecone namespaces, Qdrant payload filters), not just at the application level.

For My Medical Records AI, we built tenant-level namespaces in Qdrant with document-level ACLs enforced before retrieval. Every query is scoped to the requesting user's organization. The retrieved context never crosses organizational boundaries. HIPAA compliance isn't optional, and retrofitting it after launch costs more than building it in.

Production RAG Checklist

Before You Launch

✓Corpus audit: every answer your system needs exists in the knowledge base
✓Chunking validated: recall at k=5 above 80% on your test set
✓Hybrid retrieval: dense + BM25 combined with RRF
✓Reranker in place: cross-encoder reranking top 20 retrieved docs
✓Evaluation suite: 100+ question-answer pairs, RAGAs tracking faithfulness
✓Freshness pipeline: automated re-ingestion on document updates
✓Multi-tenant isolation: namespace or payload-level ACLs
✓Hallucination guardrails: system prompt constraints + LLM-based detection
✓Cost monitoring: alerts on per-query spend and total monthly burn
✓Latency SLA: p99 under 2 seconds for conversational interfaces

When RAG Isn't the Right Answer

RAG works best when your knowledge changes frequently, when you need citations, and when your data is too large for the context window. It's overkill for simple tasks where a well-tuned prompt and a stable system prompt would work fine.

For high-volume, low-latency use cases where the knowledge is stable, fine-tuning is often cheaper and faster. A customer support bot that handles 50,000 repetitive queries a day about a fixed product catalog should probably be fine-tuned on GPT-4o-mini rather than running RAG at scale. The retrieval overhead adds latency and cost for little benefit when the answers don't change.

We cover the full decision framework in our article on fine-tuning vs RAG vs prompt engineering. The short version: start with prompting, add RAG when the knowledge base grows, fine-tune only when both options hit a ceiling.

Building a RAG System?

We've built production RAG systems for healthcare, fintech, and real estate. We know where they fail before you find out the hard way.

Book a Free Strategy Call

See our full AI development services or generative AI capabilities.

Sources

Barnett, S. et al. (2024). Seven Failure Points When Engineering a Retrieval Augmented Generation System. arXiv.
RAGAs (2024). RAGAs: Retrieval Augmented Generation Assessment Framework. Documentation.
OpenAI (2024). Embeddings: OpenAI API Documentation.
BAAI (2024). BGE-M3: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings. Hugging Face.
Qdrant (2025). Qdrant Vector Database Documentation.

How to Build a Production RAG System in 2026: Complete Engineering Guide