AI Agents - Challenges

Test your understanding of AI agents with these hands-on challenges! Each challenge builds on the concepts from the notebooks.

🎯 Challenge 1: Calculator Agent

Difficulty: ⭐⭐ (Beginner)
Time: 30-45 minutes

Objective

Build a calculator agent that can handle complex math expressions using function calling.

Requirements

Create an agent with these tools:

add(a, b) - Addition
subtract(a, b) - Subtraction
multiply(a, b) - Multiplication
divide(a, b) - Division
power(base, exponent) - Exponentiation
sqrt(n) - Square root

Test Cases


queries = [
    "What is 15 plus 27?",
    "Calculate 144 divided by 12",
    "What's the square root of 256?",
    "What is 2 to the power of 10?",
    "Calculate (15 + 27) * 3",  # Multi-step
]

Success Criteria

✅ Handles all basic operations
✅ Can chain multiple operations
✅ Proper error handling (divide by zero, negative sqrt)
✅ Returns clear, formatted answers

Hints

Start with simple operations first
Test each tool individually before integration
Handle edge cases (division by zero, etc.)

🎯 Challenge 2: Weather Agent with API

Difficulty: ⭐⭐⭐ (Intermediate)
Time: 1-2 hours

Objective

Build an agent that fetches real weather data and answers questions about it.

Requirements

Create tools for:

get_current_weather(city) - Current conditions
get_forecast(city, days) - Future forecast
compare_weather(city1, city2) - Compare two locations
get_weather_alerts(city) - Severe weather warnings

API Options

OpenWeatherMap (free tier)
Weather API
MeteoStat

Test Cases


queries = [
    "What's the weather like in London today?",
    "Will it rain in Seattle this week?",
    "Is it warmer in Miami or Los Angeles right now?",
    "Any weather alerts for San Francisco?",
    "Should I bring an umbrella in New York tomorrow?",
]

Success Criteria

✅ Makes real API calls
✅ Handles API errors gracefully
✅ Caches results (avoid redundant API calls)
✅ Provides natural language responses
✅ Includes relevant details (temp, humidity, conditions)

Optional Stretch

Add temperature unit conversion (C ↔ F)
Historical weather data
Weather recommendations (clothing, activities)

🎯 Challenge 3: Multi-Tool Research Agent

Difficulty: ⭐⭐⭐ (Intermediate)
Time: 2-3 hours

Objective

Build an agent that can research topics by combining multiple information sources.

Requirements

Implement these tools:

wikipedia_search(topic) - Search Wikipedia
web_search(query) - DuckDuckGo or similar
arxiv_search(topic) - Academic papers
summarize_text(text, max_words) - Summarization

Test Cases


queries = [
    "What is machine learning?",
    "Summarize the latest research on quantum computing",
    "Explain the history of artificial intelligence",
    "What are the applications of neural networks?",
]

Success Criteria

✅ Searches multiple sources
✅ Synthesizes information from different sources
✅ Cites sources properly
✅ Handles “no results found” gracefully
✅ Summarizes long content effectively

Optional Stretch

Fact-checking across sources
Include images/diagrams
Generate bibliography

Libraries to Use


import wikipedia
import requests
from duckduckgo_search import DDGS
import arxiv

🎯 Challenge 4: Code Review Agent

Difficulty: ⭐⭐⭐⭐ (Advanced)
Time: 3-4 hours

Objective

Create an agent that reviews Python code and provides feedback.

Requirements

Build tools for:

check_syntax(code) - Syntax validation
check_style(code) - PEP 8 compliance
find_bugs(code) - Static analysis
suggest_improvements(code) - Optimization tips
calculate_complexity(code) - Cyclomatic complexity

Test Cases


# Test with various code samples
buggy_code = """
def divide(a, b):
    return a / b  # No zero check!
"""
 
inefficient_code = """
def find_max(numbers):
    for i in range(len(numbers)):
        for j in range(len(numbers)):
            if numbers[i] > numbers[j]:
                ...
"""
 
messy_code = """
def x(a,b,c):
 if a>b:
  if b>c:
   return a
  else:return c
"""

Success Criteria

✅ Identifies syntax errors
✅ Detects common bugs (division by zero, off-by-one, etc.)
✅ Suggests style improvements
✅ Provides specific line numbers
✅ Explains WHY each issue matters
✅ Prioritizes issues (critical → minor)

Libraries to Use


import ast  # Parse Python code
import pylint
from radon.complexity import cc_visit  # Complexity
from autopep8 import fix_code  # Style fixes

Optional Stretch

Suggest specific fixes (not just identify issues)
Security vulnerability detection
Performance profiling
Generate unit tests for the code

🎯 Challenge 5: Memory-Enhanced Chatbot

Difficulty: ⭐⭐⭐⭐ (Advanced)
Time: 3-4 hours

Objective

Build a chatbot that remembers previous conversations and user preferences.

Requirements

Implement memory systems:

Short-term memory: Last 10 messages
Long-term memory: User facts stored in vector DB
Summarization: Compress old conversations

Tools needed:

remember_fact(fact) - Store user information
recall_facts(query) - Retrieve relevant facts
summarize_conversation() - Compress history
get_user_profile() - Return known preferences

Test Scenario


# Session 1
User: "Hi, my name is Alice"
Bot: "Nice to meet you, Alice!"
 
User: "I love pizza and hiking"
Bot: "Great! I'll remember that you enjoy pizza and hiking."
 
# Session 2 (new session, should remember)
User: "What do you know about me?"
Bot: "Your name is Alice, and you enjoy pizza and hiking."
 
User: "Recommend a weekend activity"
Bot: "Based on your interest in hiking, how about exploring a nearby trail?"

Success Criteria

✅ Stores facts from conversation
✅ Retrieves relevant facts when needed
✅ Persists between sessions (file/DB storage)
✅ Handles contradictions (“I don’t like pizza anymore”)
✅ Summarizes when context gets too long

Implementation Options

Option A: Simple JSON storage


import json
 
class SimpleMemory:
    def __init__(self):
        self.facts = {}
    
    def remember(self, key, value):
        self.facts[key] = value
        self.save()
    
    def save(self):
        with open('memory.json', 'w') as f:
            json.dump(self.facts, f)
    
    def load(self):
        with open('memory.json', 'r') as f:
            self.facts = json.load(f)

Option B: Vector database


from sentence_transformers import SentenceTransformer
import chromadb
 
class VectorMemory:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.client = chromadb.Client()
        self.collection = self.client.create_collection("memories")
    
    def remember(self, fact):
        embedding = self.model.encode([fact])[0]
        self.collection.add(
            embeddings=[embedding.tolist()],
            documents=[fact],
            ids=[str(time.time())]
        )
    
    def recall(self, query, n=5):
        query_embedding = self.model.encode([query])[0]
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n
        )
        return results['documents'][0]

Optional Stretch

Semantic search over memories
Automatic fact extraction from conversation
Memory importance scoring (forget trivial facts)
Export conversation history

🎯 Challenge 6: Multi-Agent System

Difficulty: ⭐⭐⭐⭐⭐ (Expert)
Time: 5-8 hours

Objective

Build a system where multiple specialized agents collaborate to solve complex tasks.

System Design

Create 3 specialized agents:

Planner Agent - Breaks down tasks into steps
Executor Agent - Performs individual steps
Reviewer Agent - Checks quality and accuracy

Example Task Flow


User: "Research climate change and write a 500-word summary"

Planner:
  Step 1: Search for climate change information
  Step 2: Extract key facts
  Step 3: Write summary
  Step 4: Review and edit

Executor (Step 1):
  [Searches web, Wikipedia, academic sources]
  [Returns: List of facts]

Executor (Step 3):
  [Writes draft summary]

Reviewer (Step 4):
  Issues found:
  - Summary is only 350 words (need 500)
  - Missing citation for statistic
  Action: Request revision

Executor (Revision):
  [Expands summary to 500 words, adds citation]

Reviewer:
  ✅ All requirements met
  Final output ready

Requirements

✅ Clear separation of responsibilities
✅ Inter-agent communication protocol
✅ Shared memory/context
✅ Error handling and recovery
✅ Feedback loops (reviewer → executor)

Test Cases


tasks = [
    "Research and summarize quantum computing in 500 words",
    "Find the best Italian restaurant in Seattle and make a reservation",
    "Debug this Python code and write test cases for it",
    "Plan a 3-day trip to Tokyo with budget under $2000",
]

Success Criteria

✅ Agents collaborate effectively
✅ Work is distributed appropriately
✅ Handles complex multi-step tasks
✅ Quality control via reviewer
✅ Graceful failure recovery

Architecture Example


class MultiAgentSystem:
    def __init__(self):
        self.planner = PlannerAgent()
        self.executor = ExecutorAgent()
        self.reviewer = ReviewerAgent()
        self.shared_memory = SharedMemory()
    
    def execute_task(self, task):
        # 1. Planner creates plan
        plan = self.planner.create_plan(task)
        self.shared_memory.store_plan(plan)
        
        # 2. Executor performs steps
        for step in plan.steps:
            result = self.executor.execute(step)
            self.shared_memory.store_result(step.id, result)
            
            # 3. Reviewer checks quality
            review = self.reviewer.review(step, result)
            
            if not review.approved:
                # Retry with feedback
                result = self.executor.execute(
                    step, 
                    feedback=review.feedback
                )
        
        # 4. Final compilation
        return self.compile_results()

Optional Stretch Challenges

Add a 4th “Coordinator” agent to manage others
Implement voting mechanism for disagreements
Parallel execution of independent steps
Real-time progress tracking UI
Agent specialization (ResearchAgent, WritingAgent, etc.)

🎯 Challenge 7: Autonomous Task Scheduler

Difficulty: ⭐⭐⭐⭐⭐ (Expert)
Time: 6-10 hours

Objective

Build an agent that autonomously manages and executes scheduled tasks.

Features Required

Task Management
- Add/remove/update tasks
- Priority levels (high, medium, low)
- Dependencies between tasks
- Recurring tasks (daily, weekly, etc.)
Intelligent Scheduling
- Optimize task order based on:
  - Dependencies
  - Deadlines
  - Estimated duration
  - Resource availability
Autonomous Execution
- Run tasks automatically at scheduled times
- Retry failed tasks
- Send notifications
- Generate reports

Example Usage


scheduler = TaskSchedulerAgent()
 
# Add tasks
scheduler.add_task(
    name="Daily Backup",
    action="run_backup",
    schedule="daily at 2am",
    priority="high"
)
 
scheduler.add_task(
    name="Generate Weekly Report",
    action="create_report",
    schedule="every Monday at 9am",
    dependencies=["collect_data", "analyze_data"],
    priority="medium"
)
 
# Agent runs autonomously
scheduler.start()

Tools to Implement

add_task(task_config) - Create new task
run_task(task_id) - Execute specific task
check_dependencies(task_id) - Verify prerequisites
estimate_duration(task_id) - Predict runtime
send_notification(message, channel) - Alerts
generate_schedule() - Optimize task order

Success Criteria

✅ Handles task dependencies correctly
✅ Executes tasks at scheduled times
✅ Retries with exponential backoff
✅ Sends success/failure notifications
✅ Generates execution reports
✅ Optimizes schedule to meet deadlines
✅ Handles concurrent task execution

Advanced Features

Conflict detection (two tasks need same resource)
Dynamic rescheduling when tasks run long
Learning from past executions (improve estimates)
Resource allocation (CPU, memory, API quotas)

Libraries to Use


import schedule
import asyncio
from apscheduler.schedulers.background import BackgroundScheduler
import networkx as nx  # For dependency graphs

Optional Stretch

Web UI for managing tasks
Integration with calendar APIs
ML-based duration estimation
Multi-agent delegation (distribute work)

🎯 Challenge 8: Real-Time Monitoring Agent

Difficulty: ⭐⭐⭐⭐ (Advanced)
Time: 4-5 hours

Objective

Build an agent that monitors systems/services and takes action when issues are detected.

What to Monitor

Choose ONE or build multiple:

Website uptime - Check if sites are accessible
API health - Monitor response times and errors
System resources - CPU, memory, disk usage
Log files - Detect errors/warnings
Social media - Track mentions or hashtags

Required Tools

check_health(target) - Perform health check
analyze_metrics(data) - Identify anomalies
send_alert(severity, message) - Notify on issues
take_action(issue) - Auto-remediation
generate_report() - Status summary

Example: Website Monitor


monitor = MonitoringAgent(
    targets=["https://example.com", "https://api.example.com"],
    check_interval=60  # seconds
)
 
monitor.start()
 
# When issue detected:
# 1. Check health → Site down
# 2. Analyze → 503 error, server overload
# 3. Alert → Send Slack notification
# 4. Action → Restart server, scale resources
# 5. Report → Log incident details

Success Criteria

✅ Continuous monitoring (background process)
✅ Configurable check intervals
✅ Anomaly detection (what’s unusual?)
✅ Multi-channel alerts (email, Slack, SMS)
✅ Auto-remediation for common issues
✅ Detailed incident reports

Alert Levels


class AlertLevel:
    INFO = "info"        # FYI, no action needed
    WARNING = "warning"  # Attention required
    ERROR = "error"      # Immediate action needed
    CRITICAL = "critical" # System down, escalate

Auto-Remediation Examples

Website down → Restart web server
API slow → Scale up instances
Disk full → Clean temp files
Memory leak → Restart process

Optional Stretch

Anomaly detection with ML
Predictive alerts (issue likely soon)
Dashboard visualization
Integration with PagerDuty/OpsGenie
Historical trend analysis

� Challenge 9: Agent Evaluation Pipeline

Difficulty: ⭐⭐⭐⭐ (Advanced)
Time: 3-5 hours

Objective

Build an evaluation pipeline that systematically measures agent quality across task success, trajectory efficiency, tool correctness, and safety.

Requirements

Create an evaluation framework with:

EvalCase - dataclass with task, expected_outcome, required_tools
AgentRunner - runs the agent and captures the full trajectory
LLMJudge - scores each trajectory on a 0-5 rubric via a judge prompt
ReportGenerator - computes pass@1, avg tool calls, avg cost, and failure analysis

Test Cases


eval_cases = [
    EvalCase(task="What is the capital of France?", expected="Paris", required_tools=[]),
    EvalCase(task="Search for recent AI papers on RAG", expected="list of papers", required_tools=["web_search"]),
    EvalCase(task="Calculate compound interest on $10k at 5% for 10 years", expected="~$16,288.95", required_tools=["calculator"]),
    EvalCase(task="Summarize this 3-page document", expected="concise summary", required_tools=["summarize"]),
]

Success Criteria

✅ Runs ≥10 test cases end-to-end
✅ LLM-as-Judge scoring with structured rubric
✅ Generates markdown report with per-case results
✅ Computes aggregate metrics (pass rate, avg cost, avg latency)
✅ Identifies failure patterns (which tool / step fails most?)

Optional Stretch

Regression detection: compare two agent versions
Red-team safety suite: test for prompt injection, data leakage
Cost-accuracy Pareto analysis across different models
Integration with promptfoo or LangSmith for tracing

Reference

See Notebook 10: Agent Evaluation for implementation patterns and the TrajectoryEvaluator class.

�🏆 Completion Checklist

Track your progress:

Challenge 1: Calculator Agent ⭐⭐
Challenge 2: Weather Agent ⭐⭐⭐
Challenge 3: Research Agent ⭐⭐⭐
Challenge 4: Code Review Agent ⭐⭐⭐⭐
Challenge 5: Memory-Enhanced Chatbot ⭐⭐⭐⭐
Challenge 6: Multi-Agent System ⭐⭐⭐⭐⭐
Challenge 7: Task Scheduler ⭐⭐⭐⭐⭐
Challenge 8: Monitoring Agent ⭐⭐⭐⭐
Challenge 9: Agent Evaluation Pipeline ⭐⭐⭐⭐

💡 General Tips

Starting Out

Read the requirements carefully - Understand what’s needed
Plan before coding - Sketch out tool designs
Start simple - Get basic version working first
Test incrementally - Don’t build everything then test

Tool Design Best Practices


# ✅ GOOD: Focused, single responsibility
def search_web(query: str, num_results: int = 5) -> list:
    """Search web and return results"""
    pass
 
# ❌ BAD: Too many responsibilities
def do_research(topic, summarize=True, translate=False, save_file=True):
    """Does too much, hard to test and debug"""
    pass

Error Handling


# Always validate inputs
def divide(a: float, b: float) -> float:
    if not isinstance(a, (int, float)):
        raise TypeError(f"Expected number, got {type(a)}")
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b
 
# Graceful degradation
def fetch_weather(city: str) -> dict:
    try:
        response = requests.get(f"api.weather.com/{city}")
        response.raise_for_status()
        return response.json()
    except requests.RequestException as e:
        logger.error(f"Weather API failed: {e}")
        return {"error": "Weather data unavailable", "city": city}

Testing Strategy


# Test each tool independently first
def test_search_web():
    results = search_web("python programming")
    assert len(results) > 0
    assert "title" in results[0]
    assert "url" in results[0]
 
# Then test agent integration
def test_agent_uses_search():
    agent = ResearchAgent()
    response = agent.run("What is Python?")
    assert agent.tools_called["search_web"] >= 1

Debugging

Log everything: Tool calls, LLM responses, errors
Use verbose mode: See agent’s reasoning
Test tools solo: Before integrating with agent
Check API limits: Don’t exceed rate limits

📚 Resources

APIs (Free Tiers)

Weather: OpenWeatherMap, Weather API
Web Search: DuckDuckGo, SerpAPI
Knowledge: Wikipedia API, Wolfram Alpha
Communication: Twilio (SMS), SendGrid (email)

Libraries


pip install openai langchain requests
pip install wikipedia-api arxiv duckduckgo-search
pip install chromadb sentence-transformers  # Vector memory
pip install schedule apscheduler  # Task scheduling
pip install pytest pytest-cov  # Testing

Documentation

🤝 Getting Help

Stuck? Try these steps:

Re-read the challenge requirements
Review notebook examples
Check your tool schemas (proper JSON?)
Test tools individually
Check agent logs for errors
Ask in GitHub Discussions with:
- What you’re trying to do
- What error you’re getting
- Code snippet (relevant parts)

Common Issues:

“Agent not calling tools” → Check tool schemas
“API errors” → Verify API key, check rate limits
“Context too long” → Reduce message history
“Agent loops infinitely” → Add max iterations limit

🎓 Learning Outcomes

By completing these challenges, you will:

✅ Master tool design and implementation
✅ Build robust error handling
✅ Implement agent memory systems
✅ Create multi-agent architectures
✅ Deploy production-ready agents
✅ Debug complex agent behaviors
✅ Optimize for performance and cost
✅ Build real-world agent applications

Ready to build? Start with Challenge 1 and work your way up! 🚀