AI Agents - Assignment
Build a Production-Ready AI Agent System
📋 Assignment Overview
Objective: Build a fully functional AI agent that can autonomously accomplish complex tasks using multiple tools and reasoning.
Estimated Time: 10-15 hours
Scope: Capstone-style build with optional extensions
Suggested Pace: 1-2 weeks after completing the main agent notebooks
🎯 Learning Objectives
After completing this assignment, you will be able to:
- ✅ Design and implement tool schemas for AI agents
- ✅ Build agents that use multiple tools effectively
- ✅ Implement error handling and validation
- ✅ Add memory and state management
- ✅ Evaluate agent performance
- ✅ Deploy a production-ready agent
📦 Deliverables
- Agent Implementation (Python code)
- Tool Definitions (JSON schemas + implementations)
- Test Suite (Unit tests + integration tests)
- Documentation (README + API docs)
- Demo Video or Live Demo (3-5 minutes)
- Report (2-3 pages analyzing your agent)
🏗️ Part 1: Agent Design & Implementation
Choose ONE Agent Type:
Option A: SQL Agent (Recommended for beginners)
Purpose: Natural language → SQL queries → Execute → Present results
Required Tools:
generate_sql_query(question, schema)- Convert NL to SQLexecute_query(sql)- Run SQL on databaseexplain_results(data)- Interpret resultsvisualize_data(data, chart_type)- Create charts
Example Interaction:
User: "Show me the top 5 customers by revenue in 2024"
Agent:
1. Generates SQL: SELECT customer_id, SUM(revenue) ...
2. Executes query
3. Returns: "Here are your top 5 customers:
1. Acme Corp - $1.2M
2. TechStart - $980K
..."
4. Creates bar chart visualizationBonus: Handle follow-up questions, query optimization suggestions
Option B: Research Agent
Purpose: Topic → Search → Summarize → Compile → Report
Required Tools:
web_search(query, num_results)- Search the webscrape_webpage(url)- Extract contentsummarize_text(text, max_length)- Create summariesgenerate_report(sections)- Compile final report
Example Interaction:
User: "Research the latest developments in quantum computing"
Agent:
1. Searches for "quantum computing 2024 breakthroughs"
2. Scrapes top 5 articles
3. Summarizes each article
4. Compiles comprehensive report with citationsBonus: Fact-checking, multi-source verification, citation formatting
Option C: Code Debugging Agent
Purpose: Buggy code → Analyze → Identify issues → Fix → Test
Required Tools:
analyze_code(code, language)- Static analysisrun_tests(code, tests)- Execute test suitesuggest_fixes(errors)- Propose solutionsapply_fix(code, fix)- Implement fix
Example Interaction:
User: "Debug this Python function that's failing tests"
Agent:
1. Analyzes code structure
2. Runs test suite
3. Identifies: "Index out of bounds error on line 15"
4. Suggests fix: "Add bounds checking"
5. Applies fix
6. Re-runs tests → All pass ✅Bonus: Performance optimization, code quality improvements
Option D: Personal Assistant Agent
Purpose: Manage calendar, emails, tasks, reminders
Required Tools:
check_calendar(date_range)- View eventsschedule_meeting(title, time, attendees)- Create eventssend_email(to, subject, body)- Send emailsset_reminder(task, time)- Create remindersweb_search(query)- Research information
Example Interaction:
User: "Schedule a meeting with John next Tuesday at 2pm to discuss Q1 planning"
Agent:
1. Checks calendar for conflicts
2. Finds available slot
3. Creates meeting event
4. Sends email invitation to John
5. Sets reminder for 1 hour beforeBonus: Smart scheduling (avoid lunch hours, respect time zones), meeting prep
Option E: Agent Evaluation Pipeline (NEW - see Notebook 10)
Purpose: Build → Evaluate → Improve an agent using structured eval methods
Required Tools:
run_agent(task)- Execute agent on a test casescore_trajectory(trace)- LLM-as-Judge trajectory scoringcompare_runs(run_a, run_b)- A/B comparison of agent variantsgenerate_report(results)- Eval dashboard with pass-rates and cost
Example Interaction:
User: "Evaluate my research agent on 20 test cases"
Pipeline:
1. Loads test suite from eval_cases.json
2. Runs agent on each case, records trajectory + tool calls
3. LLM-as-Judge scores each trajectory (0-5)
4. Computes pass@1 rate, avg tool calls, avg cost
5. Generates markdown report with failure analysisBonus: Regression detection (compare to previous run), safety red-team suite
Requirements (All Options):
1. Agent Architecture
- Clean separation of concerns (agent logic, tools, utilities)
- Configurable (system prompts, tool selection, parameters)
- Logging of all agent actions
- Error recovery mechanisms
2. Tool Implementation
- At least 4 tools implemented
- Proper JSON schemas for all tools
- Input validation and error handling
- Tool execution logging
3. Agent Reasoning
- Intelligent tool selection
- Multi-step planning for complex tasks
- Ability to self-correct when errors occur
- Clear reasoning traces (what, why, how)
🧠 Part 2: Memory & State Management
Implement memory systems for your agent:
2.1 Conversation History
class ConversationMemory:
def __init__(self, max_messages=10):
self.messages = []
self.max_messages = max_messages
def add_message(self, role, content):
"""Add message to history"""
pass
def get_context(self):
"""Return recent context for LLM"""
pass
def summarize_old_messages(self):
"""Compress old messages"""
passRequirements:
- Store conversation history
- Limit context window (token management)
- Summarize old messages to save tokens
- Clear context on user request
2.2 Task Memory
class TaskMemory:
def __init__(self):
self.completed_steps = []
self.pending_steps = []
def record_step(self, step, result):
"""Record completed step"""
pass
def get_progress(self):
"""Return task progress"""
passRequirements:
- Track completed vs. pending steps
- Resume from failures
- Progress reporting
2.3 Long-Term Memory (Optional)
- Vector database for facts/knowledge
- Retrieve relevant past interactions
- Personalization based on history
🧪 Part 3: Testing & Evaluation
3.1 Unit Tests
Test each tool individually:
def test_tool_name():
"""Test tool with valid inputs"""
result = my_tool(valid_input)
assert result == expected_output
def test_tool_error_handling():
"""Test tool with invalid inputs"""
with pytest.raises(ValueError):
my_tool(invalid_input)Requirements:
- Test all tools with valid inputs
- Test error cases
- Test edge cases
- Achieve >80% code coverage
3.2 Integration Tests
Test agent end-to-end:
def test_agent_simple_query():
"""Test agent with straightforward query"""
response = agent.run("simple query")
assert "expected" in response.lower()
def test_agent_multi_step():
"""Test agent with complex multi-step task"""
response = agent.run("complex task requiring multiple tools")
assert agent.tools_used >= 2
assert response.success == TrueRequirements:
- Test simple queries
- Test multi-step tasks
- Test error recovery
- Test with real/mocked APIs
3.3 Evaluation Metrics
Measure agent performance:
metrics = {
"task_success_rate": 0.85, # % of tasks completed successfully
"avg_tool_calls": 3.2, # Average tools used per task
"avg_response_time": 5.4, # Seconds
"token_usage": 1500, # Average tokens per interaction
"error_rate": 0.05 # % of errors
}Requirements:
- Success rate on test cases
- Average response time
- Token efficiency
- Error recovery rate
📝 Part 4: Documentation & Demo
4.1 README.md
# [Your Agent Name]
## Overview
Brief description of what your agent does
## Features
- Feature 1
- Feature 2
## Installation
```bash
pip install -r requirements.txtUsage
from my_agent import Agent
agent = Agent()
result = agent.run("your query")Architecture
Diagram showing components
API Reference
Tool descriptions and parameters
Examples
5+ example interactions
### 4.2 Code Documentation
- [ ] Docstrings for all functions
- [ ] Type hints
- [ ] Inline comments for complex logic
- [ ] API reference (auto-generated)
### 4.3 Demo
**Option 1: Video Demo (3-5 minutes)**
- Show agent handling 3+ different queries
- Explain tool selection decisions
- Demonstrate error handling
**Option 2: Live Demo + Gradio UI**
- Build web interface
- Demo during presentation
- Include example queries
---
## 🎁 Optional Extensions
### Optional Extension 1: Advanced Reasoning
Implement **ReAct** (Reasoning + Acting) pattern:Thought: I need to find the revenue data Action: execute_query(“SELECT SUM(revenue) FROM sales WHERE year=2024”) Observation: Total revenue is 4.1M Thought: Growth is 26.8%, I can now respond Final Answer: Revenue grew by 26.8% from 5.2M
### Optional Extension 2: Parallel Tool Execution
- Execute multiple independent tools concurrently
- Aggregate results efficiently
- Handle parallel errors gracefully
### Optional Extension 3: Agent Optimization
- Cache frequent API calls
- Optimize token usage
- Reduce latency with streaming
- Smart tool selection (skip unnecessary tools)
### Optional Extension 4: Production Deployment
- Deploy as REST API (FastAPI/Flask)
- Add authentication
- Rate limiting
- Monitoring dashboard
- Docker containerization
### Optional Extension 5: MCP Integration
- Expose your agent's tools via MCP server
- Connect to your agent from an MCP-compatible client (Claude Desktop, Cursor)
- Demonstrate cross-runtime tool sharing
### Optional Extension 6: Agent Evaluation Pipeline
- Build an LLM-as-Judge scorer for trajectory quality
- Run eval suite of ≥10 test cases with pass/fail tracking
- Generate a markdown report with failure analysis
- See **Notebook 10: Agent Evaluation** for patterns
---
## 📊 Self-Review Guide
### Part 1: Agent Design & Implementation
| Criteria | Relative Emphasis | Description |
|----------|-------------------|-------------|
| **Architecture** | High | Clean code, separation of concerns, configurability |
| **Tools** | High | All tools work correctly, proper schemas, error handling |
| **Reasoning** | Medium | Intelligent tool selection, multi-step planning |
### Part 2: Memory & State
| Criteria | Relative Emphasis | Description |
|----------|-------------------|-------------|
| **Conversation History** | High | Properly stores and retrieves context |
| **Task Memory** | High | Tracks progress, resumes from failures |
| **Implementation** | Medium | Clean code, efficient storage |
### Part 3: Testing & Evaluation
| Criteria | Relative Emphasis | Description |
|----------|-------------------|-------------|
| **Unit Tests** | High | Comprehensive coverage, edge cases |
| **Integration Tests** | High | End-to-end scenarios, error cases |
| **Metrics** | Medium | Proper evaluation methodology |
### Part 4: Documentation & Demo
| Criteria | Relative Emphasis | Description |
|----------|-------------------|-------------|
| **README** | High | Clear, comprehensive, examples |
| **Code Docs** | Medium | Docstrings, type hints, comments |
| **Demo** | High | Shows key features, explains decisions |
### Optional Extensions
- ReAct pattern: +5
- Parallel execution: +5
- Optimization: +5
- Deployment: +5
- MCP integration: +5
- Agent evaluation: +5
---
## 💡 Hints & Tips
### Getting Started
1. **Start simple:** Build basic agent with 1-2 tools first
2. **Test early:** Write tests as you build tools
3. **Iterate:** Add features incrementally
4. **Use frameworks:** LangChain can simplify development
### Tool Design
- Keep tools focused (single responsibility)
- Validate inputs rigorously
- Return structured data (JSON)
- Include helpful error messages
### Debugging
- Log all LLM calls and tool executions
- Test tools independently before agent integration
- Use `print` statements liberally
- Check token usage to avoid context overflow
### Common Pitfalls
- ❌ Tools that do too much (break into smaller tools)
- ❌ Poor error handling (always validate inputs)
- ❌ No logging (impossible to debug)
- ❌ Ignoring context limits (manage tokens carefully)
---
## 📚 Resources
### Code Examples
- [OpenAI Function Calling Examples](https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models)
- [LangChain Agent Templates](https://python.langchain.com/docs/modules/agents/agent_types/)
- [Agent Design Patterns](https://github.com/microsoft/ai-agents-for-beginners)
### Testing
- [Pytest Documentation](https://docs.pytest.org/)
- [Unit Testing Best Practices](https://realpython.com/python-testing/)
### Deployment
- [FastAPI Tutorial](https://fastapi.tiangolo.com/tutorial/)
- [Docker for Python](https://docs.docker.com/language/python/)
---
## 🤝 Collaboration Policy
- **Recommended default:** Build this project independently
- **Getting help:** GitHub Discussions, documentation, and targeted implementation questions
- **Code sharing:** Don't share solutions, but discuss approaches
- **AI assistance:** OK to use for debugging, not for writing entire agent
---
## 📅 Project Packaging
**Suggested GitHub workflow:**
1. Create repo: `ai-agent-[your-name]`
2. Include all deliverables
3. Add comprehensive README
4. Save the repo link with your project notes or portfolio materials
---
## ❓ FAQ
**Q: Can I use LangChain or must I build from scratch?**
A: You can use frameworks, but you must understand and explain the code.
**Q: How many tools are required?**
A: Minimum 4 tools. More is better if they're all useful.
**Q: Can I use mock/fake APIs for testing?**
A: Yes for testing, but include at least one real API integration.
**Q: What if my agent makes mistakes?**
A: That's OK! Document the failure cases and explain why they occur.
**Q: Can I work in a team?**
A: No, this is individual. But you can discuss ideas with classmates.
---
**Good luck building your AI agent! 🚀🤖**