Challenges: RAG Systems

Hands-on challenges to master Retrieval-Augmented Generation

🚀 Challenge 1: The Chunking Optimization Game

Difficulty: ⭐⭐ Beginner-Intermediate
Time: 45-60 minutes
Concepts: Text chunking, retrieval accuracy, semantic boundaries

The Problem

Chunking is critical for RAG - bad chunks = bad retrieval = bad answers. Find the optimal chunking strategy!

Your Task

Take a long technical document (e.g., Python documentation, research paper)
Create 10 test questions that require specific passages
Try 5 different chunking strategies:
- Fixed size (256, 512, 1024 tokens)
- Sentence-based
- Paragraph-based
- Semantic (embeddings-based)
- Hierarchical (sections → paragraphs → sentences)
Measure which strategy retrieves the right passages most often

Evaluation Metrics


# For each question, check if correct passage is in top-3 results
hit_rate = correct_chunks_retrieved / total_questions
 
# Average position of correct chunk
mrr = mean([1/rank for rank in chunk_positions])

Success Criteria

Test all 5 chunking methods
Create visualization comparing methods
Identify when each method works best
Provide recommendations

💡 Hint

Different content types need different strategies:

Code documentation: Semantic chunking works well
Narrative text: Paragraph-based is often good
Q&A: Sentence-based can work

🚀 Challenge 2: Query Expansion Techniques

Difficulty: ⭐⭐⭐ Intermediate
Time: 1-2 hours
Concepts: Query understanding, multi-query retrieval, HyDE

The Problem

User queries are often vague or poorly worded. Expand them to improve retrieval!

Your Task

Implement 3 query expansion techniques:

Technique 1: Multi-Query Generation


# Original: "How to use python lists?"
# Expanded:
# - "Python list operations tutorial"
# - "Add items to Python list"
# - "List methods in Python"
# - "Python array vs list"

Technique 2: Hypothetical Document Embeddings (HyDE)


# Original query: "What causes climate change?"
# Generate hypothetical answer, then search for it:
generated_answer = llm("Write a detailed answer about climate change causes...")
search_embedding = embed(generated_answer)

Technique 3: Query Decomposition


# Complex: "Compare Python and JavaScript for web development"
# Decompose:
# - "Python for web development features"
# - "JavaScript for web development features"
# - "Python vs JavaScript comparison"

Comparison Task

Test on 20 diverse questions
Compare retrieval accuracy for each method
Analyze latency and cost tradeoffs
Identify best use cases

💡 Hint

Multi-query can be parallelized for speed. HyDE works great when you know the answer format. Query decomposition is powerful for complex questions.

🚀 Challenge 3: The Hallucination Hunter

Difficulty: ⭐⭐⭐⭐ Advanced
Time: 2-3 hours
Concepts: Faithfulness, fact verification, hallucination detection

The Problem

LLMs sometimes “hallucinate” - generate plausible-sounding but incorrect information. Catch them!

Your Task

Build a hallucination detection system:

Faithfulness Scoring
- Check if answer is supported by retrieved context
- Use entailment model or LLM-as-judge
- Score 0-1 for how well grounded the answer is
Citation Verification
- Extract claims from answer
- Verify each claim against source documents
- Flag unsupported claims
Confidence Calibration
- Estimate answer confidence
- Compare with actual correctness
- Calibrate model to be more honest

Implementation


class HallucinationDetector:
    def check_faithfulness(self, answer, context):
        """Score how well answer is supported by context."""
        # TODO: Implement
        pass
    
    def verify_citations(self, answer, sources):
        """Verify each claim in answer."""
        claims = self.extract_claims(answer)
        verified = []
        for claim in claims:
            is_supported = self.verify_claim(claim, sources)
            verified.append({
                "claim": claim,
                "supported": is_supported,
                "confidence": ...
            })
        return verified

Test Dataset

Create 30 questions with known hallucination triggers:

Questions outside knowledge base
Ambiguous questions
Questions with conflicting information
Questions requiring calculation/reasoning

💡 Hint

Use models like “microsoft/deberta-v3-large” for entailment. Compare multiple answer generations - consistent = likely correct. Prompt engineering: “Only answer if you’re certain. Otherwise say ‘I don’t know.’”

🚀 Challenge 4: Conversational RAG

Difficulty: ⭐⭐⭐⭐ Advanced
Time: 3-4 hours
Concepts: Dialogue management, context tracking, memory

The Problem

Most RAG systems handle single questions. Build one that handles multi-turn conversations!

Your Task

Handle conversation like this:


User: "What are the benefits of Python?"
Bot: "Python offers readability, extensive libraries..." [uses RAG]

User: "What about performance?"  # Implicit: Python performance
Bot: "Python is slower than compiled languages..." [understands context]

User: "Compare it to Java"  # Implicit: Python vs Java performance
Bot: "Java is generally faster because..." [maintains full context]

Requirements

Track conversation history
Rewrite queries with context (coreference resolution)
Maintain entity tracking
Handle follow-up questions
Know when to retrieve vs use previous context
Manage token budget (conversation history grows!)

Conversation Management


class ConversationalRAG:
    def __init__(self):
        self.conversation_history = []
        self.entity_tracker = {}
    
    def rewrite_query_with_context(self, current_query, history):
        """Rewrite query to be standalone using conversation context."""
        # "What about performance?" → "What about Python performance?"
        pass
    
    def should_retrieve(self, query, history):
        """Decide if we need new retrieval or can use context."""
        # Avoid unnecessary retrievals for clarification questions
        pass
    
    def chat(self, user_message):
        # Rewrite query
        # Retrieve if needed
        # Generate with conversation context
        # Update history
        pass

💡 Hint

Use LLM to rewrite queries: “Given conversation history, rewrite this query to be standalone” Keep sliding window of last N turns to manage tokens. Detect if query is clarification vs new topic.

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 4-6 hours
Concepts: Multi-modal embeddings, vision-language models, hybrid retrieval

The Problem

Real documents have images, tables, charts - not just text. Build RAG that handles it all!

Your Task

Build a system that processes:

Text: Standard RAG
Images: Visual search with CLIP
Tables: Structured data retrieval
Diagrams: Caption extraction + visual search
Code: Syntax-aware chunking

Example Use Case: Technical Documentation


User: "Show me the architecture diagram and explain the components"

System should:
1. Retrieve relevant diagram (image similarity)
2. Extract/generate diagram description
3. Retrieve text about components
4. Combine image + text in answer

💡 Hint

Start with caption generation and text retrieval before attempting end-to-end multimodal embeddings.

🚀 Challenge 6: Corrective RAG Loop

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 3-5 hours
Concepts: CRAG, retrieval grading, retry policies, abstention

The Problem

Many RAG failures are not generation failures. They are retrieval failures that should have been caught before the model answered.

Your Task

Build a corrective loop that evaluates retrieval quality before the final answer is generated.

Requirements

Grade retrieved evidence for relevance and coverage
Retry with a rewritten query if retrieval quality is weak
Compress or filter noisy chunks before generation
Abstain when no trustworthy evidence is found
Log which step fixed the failure, if any

Suggested pipeline


def corrective_rag(query):
    candidates = retrieve(query)
    grade = grade_retrieval(query, candidates)
 
    if grade &lt; 0.5:
        better_query = rewrite_query(query)
        candidates = retrieve(better_query)
        grade = grade_retrieval(better_query, candidates)
 
    if grade &lt; 0.5:
        return {"answer": "I don't have enough reliable evidence.", "status": "abstain"}
 
    context = compress_context(query, candidates)
    return generate_answer(query, context)

Success Criteria

Retrieval failures are explicitly detected
Retry logic improves at least some failed cases
Unsupported questions do not produce confident hallucinations
You can show before/after examples from a failure set

💡 Hint

Keep the grading simple first: use a small rubric for topical relevance, evidence coverage, and answerability.

🚀 Challenge 7: Hierarchical or Graph Retrieval

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 4-6 hours
Concepts: RAPTOR, parent-child retrieval, GraphRAG, multi-hop reasoning

The Problem

Flat chunk retrieval breaks down when the answer is spread across sections, entities, or long reports.

Your Task

Implement one structured retrieval approach:

Option A: Parent-Child / Hierarchical Retrieval

Retrieve fine-grained chunks
Expand to their parent section or source document
Generate the final answer using both local evidence and larger context

Option B: RAPTOR-style Summarization Tree

Create chunk summaries recursively
Retrieve from summaries first, then drill down to leaves
Compare quality and latency against flat retrieval

Option C: GraphRAG Prototype

Extract entities and relations from documents
Build a lightweight graph
Retrieve by entity neighborhood plus semantic search

Success Criteria

Show at least 10 questions that require cross-section reasoning
Compare flat retrieval vs. your structured approach
Explain where the structured approach helps and where it adds overhead
Include failure cases, not just wins

💡 Hint

If full GraphRAG is too heavy, parent-child retrieval is the best structured upgrade to implement first.

Implementation Components

Multi-Modal Embeddings:
- Text: sentence-transformers
- Images: CLIP
- Tables: Table-specific embedders
Hybrid Retrieval:
- Combine results from different modalities
- Weight by relevance and modality type
Multi-Modal Generation:
- GPT-4 Vision for image understanding
- Generate answers referencing both text and images

Success Criteria

Process PDFs with images/tables
Retrieve relevant visuals for queries
Generate answers combining modalities
Handle queries like “show me”, “diagram of”, “table showing”

💡 Hint

Use GPT-4 Vision or LLaVA for image understanding. CLIP for image-text similarity. Separate vector stores per modality, then merge results.

🏆 Meta Challenge: RAG Optimization Competition

Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 8-12 hours
Concepts: End-to-end optimization, systematic evaluation

The Ultimate Challenge

Build the best RAG system for a specific domain and prove it!

Competition Format

Choose Domain: Medical, legal, technical docs, customer support, etc.
Build System: Full RAG pipeline
Create Benchmark: 100+ test questions with ground truth
Optimize Everything:
- Chunking strategy
- Embedding model
- Retrieval method
- Re-ranking
- Generation prompts
- Cost/latency tradeoffs

Leaderboard Metrics

Accuracy: % of correct answers
Faithfulness: % of answers supported by context
Latency: Average response time
Cost: $ per 1000 queries
User Satisfaction: Human evaluation (1-5)

Deliverables

Complete RAG system (code)
Benchmark dataset (questions + answers)
Evaluation results (metrics + analysis)
Technical report (methodology + findings)
Demo (Gradio/Streamlit app)

Optional Stretch

Open-source your solution
Deploy publicly
Write blog post about optimizations
Beat baseline by >20% accuracy

📊 Challenge Progress Tracker

Challenge 1: Chunking Optimization
Challenge 2: Query Expansion
Challenge 3: Hallucination Hunter
Challenge 4: Conversational RAG
Challenge 5: Multi-Modal RAG
Meta Challenge: RAG Optimization Competition

Post your challenge solutions:

GitHub: Share your repos
Discussions: Challenges Category
Blog: Write about your learnings
Twitter: Tag #ZeroToAI #RAGChallenge

💡 Tips for Success

Start Simple: Get basic version working first
Measure Everything: Metrics guide optimization
Error Analysis: Study failures to improve
Read Papers: Many techniques have research backing
Use Tools: LangChain, LlamaIndex can speed things up
Iterate: First version won’t be perfect

📚 Helpful Resources

Happy building! 🚀

Remember: RAG is about the journey of optimization, not just the destination!

Challenges: RAG Systems

🚀 Challenge 1: The Chunking Optimization Game

The Problem

Your Task

Evaluation Metrics

Success Criteria

💡 Hint

🚀 Challenge 2: Query Expansion Techniques

The Problem

Your Task

Comparison Task

💡 Hint

🚀 Challenge 3: The Hallucination Hunter

The Problem

Your Task

Implementation

Test Dataset

💡 Hint

🚀 Challenge 4: Conversational RAG

The Problem

Your Task

Requirements

Conversation Management

💡 Hint

🚀 Challenge 5: Multi-Modal RAG

The Problem

Your Task

Example Use Case: Technical Documentation

💡 Hint

🚀 Challenge 6: Corrective RAG Loop

The Problem

Your Task

Requirements

Suggested pipeline

Success Criteria

💡 Hint

🚀 Challenge 7: Hierarchical or Graph Retrieval

The Problem

Your Task

Success Criteria

💡 Hint

Implementation Components

Success Criteria

💡 Hint

🏆 Meta Challenge: RAG Optimization Competition

The Ultimate Challenge

Competition Format

Leaderboard Metrics

Deliverables

Optional Stretch

📊 Challenge Progress Tracker

🏅 Share Your Work

💡 Tips for Success

📚 Helpful Resources