AI Agents - Challenges
Test your understanding of AI agents with these hands-on challenges! Each challenge builds on the concepts from the notebooks.
🎯 Challenge 1: Calculator Agent
Difficulty: ⭐⭐ (Beginner)
Time: 30-45 minutes
Objective
Build a calculator agent that can handle complex math expressions using function calling.
Requirements
Create an agent with these tools:
add(a, b)- Additionsubtract(a, b)- Subtractionmultiply(a, b)- Multiplicationdivide(a, b)- Divisionpower(base, exponent)- Exponentiationsqrt(n)- Square root
Test Cases
queries = [
"What is 15 plus 27?",
"Calculate 144 divided by 12",
"What's the square root of 256?",
"What is 2 to the power of 10?",
"Calculate (15 + 27) * 3", # Multi-step
]Success Criteria
- ✅ Handles all basic operations
- ✅ Can chain multiple operations
- ✅ Proper error handling (divide by zero, negative sqrt)
- ✅ Returns clear, formatted answers
Hints
- Start with simple operations first
- Test each tool individually before integration
- Handle edge cases (division by zero, etc.)
🎯 Challenge 2: Weather Agent with API
Difficulty: ⭐⭐⭐ (Intermediate)
Time: 1-2 hours
Objective
Build an agent that fetches real weather data and answers questions about it.
Requirements
Create tools for:
get_current_weather(city)- Current conditionsget_forecast(city, days)- Future forecastcompare_weather(city1, city2)- Compare two locationsget_weather_alerts(city)- Severe weather warnings
API Options
- OpenWeatherMap (free tier)
- Weather API
- MeteoStat
Test Cases
queries = [
"What's the weather like in London today?",
"Will it rain in Seattle this week?",
"Is it warmer in Miami or Los Angeles right now?",
"Any weather alerts for San Francisco?",
"Should I bring an umbrella in New York tomorrow?",
]Success Criteria
- ✅ Makes real API calls
- ✅ Handles API errors gracefully
- ✅ Caches results (avoid redundant API calls)
- ✅ Provides natural language responses
- ✅ Includes relevant details (temp, humidity, conditions)
Optional Stretch
- Add temperature unit conversion (C ↔ F)
- Historical weather data
- Weather recommendations (clothing, activities)
🎯 Challenge 3: Multi-Tool Research Agent
Difficulty: ⭐⭐⭐ (Intermediate)
Time: 2-3 hours
Objective
Build an agent that can research topics by combining multiple information sources.
Requirements
Implement these tools:
wikipedia_search(topic)- Search Wikipediaweb_search(query)- DuckDuckGo or similararxiv_search(topic)- Academic paperssummarize_text(text, max_words)- Summarization
Test Cases
queries = [
"What is machine learning?",
"Summarize the latest research on quantum computing",
"Explain the history of artificial intelligence",
"What are the applications of neural networks?",
]Success Criteria
- ✅ Searches multiple sources
- ✅ Synthesizes information from different sources
- ✅ Cites sources properly
- ✅ Handles “no results found” gracefully
- ✅ Summarizes long content effectively
Optional Stretch
- Fact-checking across sources
- Include images/diagrams
- Generate bibliography
Libraries to Use
import wikipedia
import requests
from duckduckgo_search import DDGS
import arxiv🎯 Challenge 4: Code Review Agent
Difficulty: ⭐⭐⭐⭐ (Advanced)
Time: 3-4 hours
Objective
Create an agent that reviews Python code and provides feedback.
Requirements
Build tools for:
check_syntax(code)- Syntax validationcheck_style(code)- PEP 8 compliancefind_bugs(code)- Static analysissuggest_improvements(code)- Optimization tipscalculate_complexity(code)- Cyclomatic complexity
Test Cases
# Test with various code samples
buggy_code = """
def divide(a, b):
return a / b # No zero check!
"""
inefficient_code = """
def find_max(numbers):
for i in range(len(numbers)):
for j in range(len(numbers)):
if numbers[i] > numbers[j]:
...
"""
messy_code = """
def x(a,b,c):
if a>b:
if b>c:
return a
else:return c
"""Success Criteria
- ✅ Identifies syntax errors
- ✅ Detects common bugs (division by zero, off-by-one, etc.)
- ✅ Suggests style improvements
- ✅ Provides specific line numbers
- ✅ Explains WHY each issue matters
- ✅ Prioritizes issues (critical → minor)
Libraries to Use
import ast # Parse Python code
import pylint
from radon.complexity import cc_visit # Complexity
from autopep8 import fix_code # Style fixesOptional Stretch
- Suggest specific fixes (not just identify issues)
- Security vulnerability detection
- Performance profiling
- Generate unit tests for the code
🎯 Challenge 5: Memory-Enhanced Chatbot
Difficulty: ⭐⭐⭐⭐ (Advanced)
Time: 3-4 hours
Objective
Build a chatbot that remembers previous conversations and user preferences.
Requirements
Implement memory systems:
- Short-term memory: Last 10 messages
- Long-term memory: User facts stored in vector DB
- Summarization: Compress old conversations
Tools needed:
remember_fact(fact)- Store user informationrecall_facts(query)- Retrieve relevant factssummarize_conversation()- Compress historyget_user_profile()- Return known preferences
Test Scenario
# Session 1
User: "Hi, my name is Alice"
Bot: "Nice to meet you, Alice!"
User: "I love pizza and hiking"
Bot: "Great! I'll remember that you enjoy pizza and hiking."
# Session 2 (new session, should remember)
User: "What do you know about me?"
Bot: "Your name is Alice, and you enjoy pizza and hiking."
User: "Recommend a weekend activity"
Bot: "Based on your interest in hiking, how about exploring a nearby trail?"Success Criteria
- ✅ Stores facts from conversation
- ✅ Retrieves relevant facts when needed
- ✅ Persists between sessions (file/DB storage)
- ✅ Handles contradictions (“I don’t like pizza anymore”)
- ✅ Summarizes when context gets too long
Implementation Options
Option A: Simple JSON storage
import json
class SimpleMemory:
def __init__(self):
self.facts = {}
def remember(self, key, value):
self.facts[key] = value
self.save()
def save(self):
with open('memory.json', 'w') as f:
json.dump(self.facts, f)
def load(self):
with open('memory.json', 'r') as f:
self.facts = json.load(f)Option B: Vector database
from sentence_transformers import SentenceTransformer
import chromadb
class VectorMemory:
def __init__(self):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.client = chromadb.Client()
self.collection = self.client.create_collection("memories")
def remember(self, fact):
embedding = self.model.encode([fact])[0]
self.collection.add(
embeddings=[embedding.tolist()],
documents=[fact],
ids=[str(time.time())]
)
def recall(self, query, n=5):
query_embedding = self.model.encode([query])[0]
results = self.collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=n
)
return results['documents'][0]Optional Stretch
- Semantic search over memories
- Automatic fact extraction from conversation
- Memory importance scoring (forget trivial facts)
- Export conversation history
🎯 Challenge 6: Multi-Agent System
Difficulty: ⭐⭐⭐⭐⭐ (Expert)
Time: 5-8 hours
Objective
Build a system where multiple specialized agents collaborate to solve complex tasks.
System Design
Create 3 specialized agents:
- Planner Agent - Breaks down tasks into steps
- Executor Agent - Performs individual steps
- Reviewer Agent - Checks quality and accuracy
Example Task Flow
User: "Research climate change and write a 500-word summary"
Planner:
Step 1: Search for climate change information
Step 2: Extract key facts
Step 3: Write summary
Step 4: Review and edit
Executor (Step 1):
[Searches web, Wikipedia, academic sources]
[Returns: List of facts]
Executor (Step 3):
[Writes draft summary]
Reviewer (Step 4):
Issues found:
- Summary is only 350 words (need 500)
- Missing citation for statistic
Action: Request revision
Executor (Revision):
[Expands summary to 500 words, adds citation]
Reviewer:
✅ All requirements met
Final output readyRequirements
- ✅ Clear separation of responsibilities
- ✅ Inter-agent communication protocol
- ✅ Shared memory/context
- ✅ Error handling and recovery
- ✅ Feedback loops (reviewer → executor)
Test Cases
tasks = [
"Research and summarize quantum computing in 500 words",
"Find the best Italian restaurant in Seattle and make a reservation",
"Debug this Python code and write test cases for it",
"Plan a 3-day trip to Tokyo with budget under $2000",
]Success Criteria
- ✅ Agents collaborate effectively
- ✅ Work is distributed appropriately
- ✅ Handles complex multi-step tasks
- ✅ Quality control via reviewer
- ✅ Graceful failure recovery
Architecture Example
class MultiAgentSystem:
def __init__(self):
self.planner = PlannerAgent()
self.executor = ExecutorAgent()
self.reviewer = ReviewerAgent()
self.shared_memory = SharedMemory()
def execute_task(self, task):
# 1. Planner creates plan
plan = self.planner.create_plan(task)
self.shared_memory.store_plan(plan)
# 2. Executor performs steps
for step in plan.steps:
result = self.executor.execute(step)
self.shared_memory.store_result(step.id, result)
# 3. Reviewer checks quality
review = self.reviewer.review(step, result)
if not review.approved:
# Retry with feedback
result = self.executor.execute(
step,
feedback=review.feedback
)
# 4. Final compilation
return self.compile_results()Optional Stretch Challenges
- Add a 4th “Coordinator” agent to manage others
- Implement voting mechanism for disagreements
- Parallel execution of independent steps
- Real-time progress tracking UI
- Agent specialization (ResearchAgent, WritingAgent, etc.)
🎯 Challenge 7: Autonomous Task Scheduler
Difficulty: ⭐⭐⭐⭐⭐ (Expert)
Time: 6-10 hours
Objective
Build an agent that autonomously manages and executes scheduled tasks.
Features Required
-
Task Management
- Add/remove/update tasks
- Priority levels (high, medium, low)
- Dependencies between tasks
- Recurring tasks (daily, weekly, etc.)
-
Intelligent Scheduling
- Optimize task order based on:
- Dependencies
- Deadlines
- Estimated duration
- Resource availability
- Optimize task order based on:
-
Autonomous Execution
- Run tasks automatically at scheduled times
- Retry failed tasks
- Send notifications
- Generate reports
Example Usage
scheduler = TaskSchedulerAgent()
# Add tasks
scheduler.add_task(
name="Daily Backup",
action="run_backup",
schedule="daily at 2am",
priority="high"
)
scheduler.add_task(
name="Generate Weekly Report",
action="create_report",
schedule="every Monday at 9am",
dependencies=["collect_data", "analyze_data"],
priority="medium"
)
# Agent runs autonomously
scheduler.start()Tools to Implement
add_task(task_config)- Create new taskrun_task(task_id)- Execute specific taskcheck_dependencies(task_id)- Verify prerequisitesestimate_duration(task_id)- Predict runtimesend_notification(message, channel)- Alertsgenerate_schedule()- Optimize task order
Success Criteria
- ✅ Handles task dependencies correctly
- ✅ Executes tasks at scheduled times
- ✅ Retries with exponential backoff
- ✅ Sends success/failure notifications
- ✅ Generates execution reports
- ✅ Optimizes schedule to meet deadlines
- ✅ Handles concurrent task execution
Advanced Features
- Conflict detection (two tasks need same resource)
- Dynamic rescheduling when tasks run long
- Learning from past executions (improve estimates)
- Resource allocation (CPU, memory, API quotas)
Libraries to Use
import schedule
import asyncio
from apscheduler.schedulers.background import BackgroundScheduler
import networkx as nx # For dependency graphsOptional Stretch
- Web UI for managing tasks
- Integration with calendar APIs
- ML-based duration estimation
- Multi-agent delegation (distribute work)
🎯 Challenge 8: Real-Time Monitoring Agent
Difficulty: ⭐⭐⭐⭐ (Advanced)
Time: 4-5 hours
Objective
Build an agent that monitors systems/services and takes action when issues are detected.
What to Monitor
Choose ONE or build multiple:
- Website uptime - Check if sites are accessible
- API health - Monitor response times and errors
- System resources - CPU, memory, disk usage
- Log files - Detect errors/warnings
- Social media - Track mentions or hashtags
Required Tools
check_health(target)- Perform health checkanalyze_metrics(data)- Identify anomaliessend_alert(severity, message)- Notify on issuestake_action(issue)- Auto-remediationgenerate_report()- Status summary
Example: Website Monitor
monitor = MonitoringAgent(
targets=["https://example.com", "https://api.example.com"],
check_interval=60 # seconds
)
monitor.start()
# When issue detected:
# 1. Check health → Site down
# 2. Analyze → 503 error, server overload
# 3. Alert → Send Slack notification
# 4. Action → Restart server, scale resources
# 5. Report → Log incident detailsSuccess Criteria
- ✅ Continuous monitoring (background process)
- ✅ Configurable check intervals
- ✅ Anomaly detection (what’s unusual?)
- ✅ Multi-channel alerts (email, Slack, SMS)
- ✅ Auto-remediation for common issues
- ✅ Detailed incident reports
Alert Levels
class AlertLevel:
INFO = "info" # FYI, no action needed
WARNING = "warning" # Attention required
ERROR = "error" # Immediate action needed
CRITICAL = "critical" # System down, escalateAuto-Remediation Examples
- Website down → Restart web server
- API slow → Scale up instances
- Disk full → Clean temp files
- Memory leak → Restart process
Optional Stretch
- Anomaly detection with ML
- Predictive alerts (issue likely soon)
- Dashboard visualization
- Integration with PagerDuty/OpsGenie
- Historical trend analysis
� Challenge 9: Agent Evaluation Pipeline
Difficulty: ⭐⭐⭐⭐ (Advanced)
Time: 3-5 hours
Objective
Build an evaluation pipeline that systematically measures agent quality across task success, trajectory efficiency, tool correctness, and safety.
Requirements
Create an evaluation framework with:
EvalCase- dataclass withtask,expected_outcome,required_toolsAgentRunner- runs the agent and captures the full trajectoryLLMJudge- scores each trajectory on a 0-5 rubric via a judge promptReportGenerator- computes pass@1, avg tool calls, avg cost, and failure analysis
Test Cases
eval_cases = [
EvalCase(task="What is the capital of France?", expected="Paris", required_tools=[]),
EvalCase(task="Search for recent AI papers on RAG", expected="list of papers", required_tools=["web_search"]),
EvalCase(task="Calculate compound interest on $10k at 5% for 10 years", expected="~$16,288.95", required_tools=["calculator"]),
EvalCase(task="Summarize this 3-page document", expected="concise summary", required_tools=["summarize"]),
]Success Criteria
- ✅ Runs ≥10 test cases end-to-end
- ✅ LLM-as-Judge scoring with structured rubric
- ✅ Generates markdown report with per-case results
- ✅ Computes aggregate metrics (pass rate, avg cost, avg latency)
- ✅ Identifies failure patterns (which tool / step fails most?)
Optional Stretch
- Regression detection: compare two agent versions
- Red-team safety suite: test for prompt injection, data leakage
- Cost-accuracy Pareto analysis across different models
- Integration with promptfoo or LangSmith for tracing
Reference
See Notebook 10: Agent Evaluation for implementation patterns and the TrajectoryEvaluator class.
�🏆 Completion Checklist
Track your progress:
- Challenge 1: Calculator Agent ⭐⭐
- Challenge 2: Weather Agent ⭐⭐⭐
- Challenge 3: Research Agent ⭐⭐⭐
- Challenge 4: Code Review Agent ⭐⭐⭐⭐
- Challenge 5: Memory-Enhanced Chatbot ⭐⭐⭐⭐
- Challenge 6: Multi-Agent System ⭐⭐⭐⭐⭐
- Challenge 7: Task Scheduler ⭐⭐⭐⭐⭐
- Challenge 8: Monitoring Agent ⭐⭐⭐⭐
- Challenge 9: Agent Evaluation Pipeline ⭐⭐⭐⭐
💡 General Tips
Starting Out
- Read the requirements carefully - Understand what’s needed
- Plan before coding - Sketch out tool designs
- Start simple - Get basic version working first
- Test incrementally - Don’t build everything then test
Tool Design Best Practices
# ✅ GOOD: Focused, single responsibility
def search_web(query: str, num_results: int = 5) -> list:
"""Search web and return results"""
pass
# ❌ BAD: Too many responsibilities
def do_research(topic, summarize=True, translate=False, save_file=True):
"""Does too much, hard to test and debug"""
passError Handling
# Always validate inputs
def divide(a: float, b: float) -> float:
if not isinstance(a, (int, float)):
raise TypeError(f"Expected number, got {type(a)}")
if b == 0:
raise ValueError("Cannot divide by zero")
return a / b
# Graceful degradation
def fetch_weather(city: str) -> dict:
try:
response = requests.get(f"api.weather.com/{city}")
response.raise_for_status()
return response.json()
except requests.RequestException as e:
logger.error(f"Weather API failed: {e}")
return {"error": "Weather data unavailable", "city": city}Testing Strategy
# Test each tool independently first
def test_search_web():
results = search_web("python programming")
assert len(results) > 0
assert "title" in results[0]
assert "url" in results[0]
# Then test agent integration
def test_agent_uses_search():
agent = ResearchAgent()
response = agent.run("What is Python?")
assert agent.tools_called["search_web"] >= 1Debugging
- Log everything: Tool calls, LLM responses, errors
- Use verbose mode: See agent’s reasoning
- Test tools solo: Before integrating with agent
- Check API limits: Don’t exceed rate limits
📚 Resources
APIs (Free Tiers)
- Weather: OpenWeatherMap, Weather API
- Web Search: DuckDuckGo, SerpAPI
- Knowledge: Wikipedia API, Wolfram Alpha
- Communication: Twilio (SMS), SendGrid (email)
Libraries
pip install openai langchain requests
pip install wikipedia-api arxiv duckduckgo-search
pip install chromadb sentence-transformers # Vector memory
pip install schedule apscheduler # Task scheduling
pip install pytest pytest-cov # TestingDocumentation
🤝 Getting Help
Stuck? Try these steps:
- Re-read the challenge requirements
- Review notebook examples
- Check your tool schemas (proper JSON?)
- Test tools individually
- Check agent logs for errors
- Ask in GitHub Discussions with:
- What you’re trying to do
- What error you’re getting
- Code snippet (relevant parts)
Common Issues:
- “Agent not calling tools” → Check tool schemas
- “API errors” → Verify API key, check rate limits
- “Context too long” → Reduce message history
- “Agent loops infinitely” → Add max iterations limit
🎓 Learning Outcomes
By completing these challenges, you will:
- ✅ Master tool design and implementation
- ✅ Build robust error handling
- ✅ Implement agent memory systems
- ✅ Create multi-agent architectures
- ✅ Deploy production-ready agents
- ✅ Debug complex agent behaviors
- ✅ Optimize for performance and cost
- ✅ Build real-world agent applications
Ready to build? Start with Challenge 1 and work your way up! 🚀