Challenges: RAG Systems
Hands-on challenges to master Retrieval-Augmented Generation
🚀 Challenge 1: The Chunking Optimization Game
Difficulty: ⭐⭐ Beginner-Intermediate
Time: 45-60 minutes
Concepts: Text chunking, retrieval accuracy, semantic boundaries
The Problem
Chunking is critical for RAG - bad chunks = bad retrieval = bad answers. Find the optimal chunking strategy!
Your Task
- Take a long technical document (e.g., Python documentation, research paper)
- Create 10 test questions that require specific passages
- Try 5 different chunking strategies:
- Fixed size (256, 512, 1024 tokens)
- Sentence-based
- Paragraph-based
- Semantic (embeddings-based)
- Hierarchical (sections → paragraphs → sentences)
- Measure which strategy retrieves the right passages most often
Evaluation Metrics
# For each question, check if correct passage is in top-3 results
hit_rate = correct_chunks_retrieved / total_questions
# Average position of correct chunk
mrr = mean([1/rank for rank in chunk_positions])Success Criteria
- Test all 5 chunking methods
- Create visualization comparing methods
- Identify when each method works best
- Provide recommendations
💡 Hint
- Code documentation: Semantic chunking works well
- Narrative text: Paragraph-based is often good
- Q&A: Sentence-based can work
🚀 Challenge 2: Query Expansion Techniques
Difficulty: ⭐⭐⭐ Intermediate
Time: 1-2 hours
Concepts: Query understanding, multi-query retrieval, HyDE
The Problem
User queries are often vague or poorly worded. Expand them to improve retrieval!
Your Task
Implement 3 query expansion techniques:
Technique 1: Multi-Query Generation
# Original: "How to use python lists?"
# Expanded:
# - "Python list operations tutorial"
# - "Add items to Python list"
# - "List methods in Python"
# - "Python array vs list"Technique 2: Hypothetical Document Embeddings (HyDE)
# Original query: "What causes climate change?"
# Generate hypothetical answer, then search for it:
generated_answer = llm("Write a detailed answer about climate change causes...")
search_embedding = embed(generated_answer)Technique 3: Query Decomposition
# Complex: "Compare Python and JavaScript for web development"
# Decompose:
# - "Python for web development features"
# - "JavaScript for web development features"
# - "Python vs JavaScript comparison"Comparison Task
- Test on 20 diverse questions
- Compare retrieval accuracy for each method
- Analyze latency and cost tradeoffs
- Identify best use cases
💡 Hint
🚀 Challenge 3: The Hallucination Hunter
Difficulty: ⭐⭐⭐⭐ Advanced
Time: 2-3 hours
Concepts: Faithfulness, fact verification, hallucination detection
The Problem
LLMs sometimes “hallucinate” - generate plausible-sounding but incorrect information. Catch them!
Your Task
Build a hallucination detection system:
-
Faithfulness Scoring
- Check if answer is supported by retrieved context
- Use entailment model or LLM-as-judge
- Score 0-1 for how well grounded the answer is
-
Citation Verification
- Extract claims from answer
- Verify each claim against source documents
- Flag unsupported claims
-
Confidence Calibration
- Estimate answer confidence
- Compare with actual correctness
- Calibrate model to be more honest
Implementation
class HallucinationDetector:
def check_faithfulness(self, answer, context):
"""Score how well answer is supported by context."""
# TODO: Implement
pass
def verify_citations(self, answer, sources):
"""Verify each claim in answer."""
claims = self.extract_claims(answer)
verified = []
for claim in claims:
is_supported = self.verify_claim(claim, sources)
verified.append({
"claim": claim,
"supported": is_supported,
"confidence": ...
})
return verifiedTest Dataset
Create 30 questions with known hallucination triggers:
- Questions outside knowledge base
- Ambiguous questions
- Questions with conflicting information
- Questions requiring calculation/reasoning
💡 Hint
🚀 Challenge 4: Conversational RAG
Difficulty: ⭐⭐⭐⭐ Advanced
Time: 3-4 hours
Concepts: Dialogue management, context tracking, memory
The Problem
Most RAG systems handle single questions. Build one that handles multi-turn conversations!
Your Task
Handle conversation like this:
User: "What are the benefits of Python?"
Bot: "Python offers readability, extensive libraries..." [uses RAG]
User: "What about performance?" # Implicit: Python performance
Bot: "Python is slower than compiled languages..." [understands context]
User: "Compare it to Java" # Implicit: Python vs Java performance
Bot: "Java is generally faster because..." [maintains full context]Requirements
- Track conversation history
- Rewrite queries with context (coreference resolution)
- Maintain entity tracking
- Handle follow-up questions
- Know when to retrieve vs use previous context
- Manage token budget (conversation history grows!)
Conversation Management
class ConversationalRAG:
def __init__(self):
self.conversation_history = []
self.entity_tracker = {}
def rewrite_query_with_context(self, current_query, history):
"""Rewrite query to be standalone using conversation context."""
# "What about performance?" → "What about Python performance?"
pass
def should_retrieve(self, query, history):
"""Decide if we need new retrieval or can use context."""
# Avoid unnecessary retrievals for clarification questions
pass
def chat(self, user_message):
# Rewrite query
# Retrieve if needed
# Generate with conversation context
# Update history
pass💡 Hint
🚀 Challenge 5: Multi-Modal RAG
Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 4-6 hours
Concepts: Multi-modal embeddings, vision-language models, hybrid retrieval
The Problem
Real documents have images, tables, charts - not just text. Build RAG that handles it all!
Your Task
Build a system that processes:
- Text: Standard RAG
- Images: Visual search with CLIP
- Tables: Structured data retrieval
- Diagrams: Caption extraction + visual search
- Code: Syntax-aware chunking
Example Use Case: Technical Documentation
User: "Show me the architecture diagram and explain the components"
System should:
1. Retrieve relevant diagram (image similarity)
2. Extract/generate diagram description
3. Retrieve text about components
4. Combine image + text in answer💡 Hint
🚀 Challenge 6: Corrective RAG Loop
Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 3-5 hours
Concepts: CRAG, retrieval grading, retry policies, abstention
The Problem
Many RAG failures are not generation failures. They are retrieval failures that should have been caught before the model answered.
Your Task
Build a corrective loop that evaluates retrieval quality before the final answer is generated.
Requirements
- Grade retrieved evidence for relevance and coverage
- Retry with a rewritten query if retrieval quality is weak
- Compress or filter noisy chunks before generation
- Abstain when no trustworthy evidence is found
- Log which step fixed the failure, if any
Suggested pipeline
def corrective_rag(query):
candidates = retrieve(query)
grade = grade_retrieval(query, candidates)
if grade < 0.5:
better_query = rewrite_query(query)
candidates = retrieve(better_query)
grade = grade_retrieval(better_query, candidates)
if grade < 0.5:
return {"answer": "I don't have enough reliable evidence.", "status": "abstain"}
context = compress_context(query, candidates)
return generate_answer(query, context)Success Criteria
- Retrieval failures are explicitly detected
- Retry logic improves at least some failed cases
- Unsupported questions do not produce confident hallucinations
- You can show before/after examples from a failure set
💡 Hint
🚀 Challenge 7: Hierarchical or Graph Retrieval
Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 4-6 hours
Concepts: RAPTOR, parent-child retrieval, GraphRAG, multi-hop reasoning
The Problem
Flat chunk retrieval breaks down when the answer is spread across sections, entities, or long reports.
Your Task
Implement one structured retrieval approach:
Option A: Parent-Child / Hierarchical Retrieval
- Retrieve fine-grained chunks
- Expand to their parent section or source document
- Generate the final answer using both local evidence and larger context
Option B: RAPTOR-style Summarization Tree
- Create chunk summaries recursively
- Retrieve from summaries first, then drill down to leaves
- Compare quality and latency against flat retrieval
Option C: GraphRAG Prototype
- Extract entities and relations from documents
- Build a lightweight graph
- Retrieve by entity neighborhood plus semantic search
Success Criteria
- Show at least 10 questions that require cross-section reasoning
- Compare flat retrieval vs. your structured approach
- Explain where the structured approach helps and where it adds overhead
- Include failure cases, not just wins
💡 Hint
Implementation Components
-
Multi-Modal Embeddings:
- Text: sentence-transformers
- Images: CLIP
- Tables: Table-specific embedders
-
Hybrid Retrieval:
- Combine results from different modalities
- Weight by relevance and modality type
-
Multi-Modal Generation:
- GPT-4 Vision for image understanding
- Generate answers referencing both text and images
Success Criteria
- Process PDFs with images/tables
- Retrieve relevant visuals for queries
- Generate answers combining modalities
- Handle queries like “show me”, “diagram of”, “table showing”
💡 Hint
🏆 Meta Challenge: RAG Optimization Competition
Difficulty: ⭐⭐⭐⭐⭐ Expert
Time: 8-12 hours
Concepts: End-to-end optimization, systematic evaluation
The Ultimate Challenge
Build the best RAG system for a specific domain and prove it!
Competition Format
- Choose Domain: Medical, legal, technical docs, customer support, etc.
- Build System: Full RAG pipeline
- Create Benchmark: 100+ test questions with ground truth
- Optimize Everything:
- Chunking strategy
- Embedding model
- Retrieval method
- Re-ranking
- Generation prompts
- Cost/latency tradeoffs
Leaderboard Metrics
- Accuracy: % of correct answers
- Faithfulness: % of answers supported by context
- Latency: Average response time
- Cost: $ per 1000 queries
- User Satisfaction: Human evaluation (1-5)
Deliverables
- Complete RAG system (code)
- Benchmark dataset (questions + answers)
- Evaluation results (metrics + analysis)
- Technical report (methodology + findings)
- Demo (Gradio/Streamlit app)
Optional Stretch
- Open-source your solution
- Deploy publicly
- Write blog post about optimizations
- Beat baseline by >20% accuracy
📊 Challenge Progress Tracker
- Challenge 1: Chunking Optimization
- Challenge 2: Query Expansion
- Challenge 3: Hallucination Hunter
- Challenge 4: Conversational RAG
- Challenge 5: Multi-Modal RAG
- Meta Challenge: RAG Optimization Competition
🏅 Share Your Work
Post your challenge solutions:
- GitHub: Share your repos
- Discussions: Challenges Category
- Blog: Write about your learnings
- Twitter: Tag
#ZeroToAI#RAGChallenge
💡 Tips for Success
- Start Simple: Get basic version working first
- Measure Everything: Metrics guide optimization
- Error Analysis: Study failures to improve
- Read Papers: Many techniques have research backing
- Use Tools: LangChain, LlamaIndex can speed things up
- Iterate: First version won’t be perfect
📚 Helpful Resources
Happy building! 🚀
Remember: RAG is about the journey of optimization, not just the destination!