Back to Blog

RAG Systems: Best Practices and Common Pitfalls

By ML Team15 min read
RAGVector DatabasesLLMProduction

Introduction

Retrieval-Augmented Generation (RAG) has become the go-to approach for grounding LLMs with external knowledge. However, building a production-ready RAG system requires careful consideration of multiple factors. This guide covers the essential best practices and common pitfalls to avoid.

1. Chunking Strategies

The way you split your documents into chunks can significantly impact retrieval quality. Let's explore different strategies:

Interactive Chunking Strategy Comparison

Pros

  • Simple to implement
  • Fast processing

Cons

  • May break context
  • Poor semantic boundaries
Retrieval F1 Score0.65

2. Vector Database Selection

Choosing the right vector database is crucial for performance and scalability. Here's a comparison of popular options:

DatabaseBest ForProsCons
PineconeProduction SaaSFully managed, scalableVendor lock-in, cost
WeaviateHybrid searchGraphQL API, modulesComplex setup
ChromaDBDevelopmentSimple, lightweightLimited scale
QdrantHigh performanceFast, Rust-basedSmaller ecosystem

3. Embedding Model Selection

The quality of your embeddings directly impacts retrieval accuracy. Consider these factors:

  • Domain specificity: Fine-tuned models often outperform general-purpose ones
  • Dimension size: Balance between accuracy (higher dims) and speed/cost (lower dims)
  • Multilingual support: Essential for international applications

4. F1 Scores in RAG Evaluation: Beyond Binary Answers

Unlike traditional QA systems where answers are binary (correct/incorrect), RAG systems present unique evaluation challenges. The same information can be expressed in many different ways while remaining accurate. Here's how F1 scores adapt to this complexity:

Traditional vs RAG F1 Score Calculation

Traditional F1 (Binary)

Answer: "Paris"

Ground Truth: "Paris"

✓ Exact match = 1.0

F1 = 2 × (Precision × Recall) / (Precision + Recall)

RAG F1 (Token-based)

Answer: "The capital of France is Paris"

Ground Truth: "Paris is France's capital city"

✓ Semantic match ≈ 0.67

Based on token overlap & semantic similarity

How RAG F1 Scores Work

RAG evaluation typically uses token-level F1 scores rather than exact match metrics:

  1. 1

    Tokenization

    Both the generated answer and reference answers are tokenized into individual words or subwords.

  2. 2

    Token Matching

    Calculate precision (tokens in generated answer that appear in reference) and recall (tokens in reference that appear in generated answer).

  3. 3

    F1 Computation

    F1 score is the harmonic mean of precision and recall, giving a balanced measure of answer quality.

Advanced RAG Evaluation Metrics

Modern RAG systems often combine multiple metrics for comprehensive evaluation:

MetricWhat it MeasuresBest For
Token F1Word overlap between answer and referenceFactual accuracy
ROUGE-LLongest common subsequenceFluency and order preservation
BERTScoreSemantic similarity using embeddingsMeaning preservation
Answer RelevanceHow well answer addresses the questionUser satisfaction
Context RelevanceQuality of retrieved documentsRetrieval effectiveness

💡 Key Insight

Unlike binary evaluation, RAG F1 scores account for partial correctness. An answer that includes all correct information plus some extra context might score 0.8-0.9, while an answer missing key facts might score 0.3-0.5, providing nuanced feedback for system improvement.

Handling Multiple Valid Answers

RAG evaluation often uses multiple reference answers to handle variation:

# Example: Multiple valid answers for "What is the capital of France?"
reference_answers = [
    "Paris",
    "The capital of France is Paris",
    "Paris is the capital city of France",
    "France's capital is Paris, located in the north"
]

# F1 score calculated against best matching reference
generated = "Paris is France's capital"
f1_scores = [calculate_f1(generated, ref) for ref in reference_answers]
final_f1 = max(f1_scores)  # Take best match: ~0.95

5. Common Pitfalls to Avoid

❌ Pitfall: Ignoring chunk overlap

Without overlap, you may lose important context at chunk boundaries. Use 10-20% overlap for better retrieval.

❌ Pitfall: Not handling document updates

Implement versioning and update strategies to keep your vector store synchronized with source documents.

❌ Pitfall: Over-relying on similarity scores

Combine semantic search with keyword matching and metadata filtering for more robust retrieval.

6. Production Checklist

Before Going to Production

Conclusion

Building a production RAG system is an iterative process. Start with a simple implementation, measure performance, and gradually optimize based on your specific use case. Remember that the best configuration depends on your data, users, and requirements.

MachinaLearning - Machine Learning Education Platform