RAG Architecture
What is RAG?
Retrieval-Augmented Generation (RAG) combines the power of large language models with external knowledge retrieval. Instead of relying solely on parametric knowledge learned during training, RAG systems dynamically fetch relevant information to generate more accurate, up-to-date, and verifiable responses.
Why RAG is Needed
LLM Limitations
- Knowledge cutoff date
- Can't access private data
- May hallucinate facts
- Can't cite sources
RAG Solutions
- Real-time information access
- Query private knowledge bases
- Grounded in retrieved facts
- Provides source attribution
RAG Pipeline Architecture
Core Components
Documents
Source data
Embeddings
Vector representations
Vector DB
Indexed storage
Retrieval
Find relevant chunks
Generation
LLM response
Step-by-Step Process
- 1Document Processing: Split documents into chunks, clean text, extract metadata
- 2Embedding Generation: Convert text chunks into high-dimensional vectors
- 3Indexing: Store embeddings in vector database with efficient search structures
- 4Query Processing: Embed user query and search for similar chunks
- 5Context Assembly: Combine retrieved chunks with query into prompt
- 6Response Generation: LLM generates answer using retrieved context
Architecture Patterns
Basic RAG
Simple retrieval and generation
Components
- • Embedding Model
- • Vector DB
- • LLM
- • Simple Prompt
Pros
- ✓ Easy to implement
- ✓ Low complexity
- ✓ Quick prototyping
Cons
- ✗ Limited accuracy
- ✗ No query optimization
- ✗ Basic context handling
RAG Implementation Libraries
LangChain
Python/JSKey Features:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
# Initialize vector store
vectorstore = Chroma.from_documents(
documents=docs,
embedding=OpenAIEmbeddings()
)
# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(),
return_source_documents=True
)
LlamaIndex
PythonKey Features:
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms import OpenAI
# Load and index documents
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
# Query with RAG
query_engine = index.as_query_engine(
llm=OpenAI(model="gpt-4"),
similarity_top_k=3
)
response = query_engine.query("What is RAG?")
Haystack
PythonKey Features:
from haystack import Pipeline
from haystack.nodes import EmbeddingRetriever, PromptNode
# Build RAG pipeline
pipeline = Pipeline()
pipeline.add_node(
component=retriever,
name="Retriever",
inputs=["Query"]
)
pipeline.add_node(
component=prompt_node,
name="PromptNode",
inputs=["Retriever"]
)
Production Challenges
Latency
RAG adds retrieval time to generation
Solutions:
- Cache frequent queries
- Optimize embedding dimensions
- Use faster vector indexes
- Parallel retrieval and generation
Context Window Limits
LLMs have token limits for context
Solutions:
- Implement context compression
- Use hierarchical summarization
- Smart chunk selection
- Sliding window approach
Retrieval Quality
Retrieved chunks may not be relevant
Solutions:
- Hybrid search (vector + keyword)
- Query expansion/rewriting
- Cross-encoder re-ranking
- Feedback loops for improvement
Data Freshness
Keeping vector index up-to-date
Solutions:
- Incremental indexing
- Real-time embedding pipeline
- Version control for embeddings
- Scheduled re-indexing
Advanced RAG Patterns
Multi-Query RAG
Generate multiple query variations to improve retrieval coverage
# Generate multiple queries from user input queries = [ "What is transformer architecture?", "How do transformers work in NLP?", "Explain self-attention mechanism", "Transformer model components" ] # Retrieve for each query and merge results all_docs = [] for query in queries: docs = retriever.get_relevant_documents(query) all_docs.extend(docs) # Deduplicate and rank unique_docs = deduplicate(all_docs) ranked_docs = rerank(unique_docs, original_query)
RAG with Guardrails
Add safety and validation layers to RAG pipeline
- • Input Validation: Check queries for malicious content
- • Source Verification: Ensure retrieved docs are from trusted sources
- • Output Filtering: Remove sensitive information from responses
- • Hallucination Detection: Verify claims against retrieved context
Adaptive RAG
Dynamically adjust retrieval strategy based on query type
Simple Queries
Direct retrieval → Generate
Complex Queries
Decompose → Multi-hop retrieval
Comparison Queries
Parallel retrieval → Synthesize
Evaluation Metrics
Measuring RAG Performance
Retrieval Metrics
- Precision@K: Relevant docs in top K results
- Recall@K: Coverage of all relevant docs
- MRR: Mean Reciprocal Rank of first relevant doc
- NDCG: Normalized Discounted Cumulative Gain
Generation Metrics
- Faithfulness: Answer grounded in retrieved context
- Relevance: Answer addresses the query
- Completeness: All aspects of query covered
- Coherence: Logical flow and clarity
Next Steps
Now that you understand RAG architecture and implementation, explore how modern online search tools like Perplexity combine RAG with web-scale search capabilities.
Continue to Online LLM Search