Skip to Content
15 AI Agents12 Assignment

AI Agents - Assignment

Build a Production-Ready AI Agent System


📋 Assignment Overview

Objective: Build a fully functional AI agent that can autonomously accomplish complex tasks using multiple tools and reasoning.

Estimated Time: 10-15 hours
Scope: Capstone-style build with optional extensions
Suggested Pace: 1-2 weeks after completing the main agent notebooks


🎯 Learning Objectives

After completing this assignment, you will be able to:

  • ✅ Design and implement tool schemas for AI agents
  • ✅ Build agents that use multiple tools effectively
  • ✅ Implement error handling and validation
  • ✅ Add memory and state management
  • ✅ Evaluate agent performance
  • ✅ Deploy a production-ready agent

📦 Deliverables

  1. Agent Implementation (Python code)
  2. Tool Definitions (JSON schemas + implementations)
  3. Test Suite (Unit tests + integration tests)
  4. Documentation (README + API docs)
  5. Demo Video or Live Demo (3-5 minutes)
  6. Report (2-3 pages analyzing your agent)

🏗️ Part 1: Agent Design & Implementation

Choose ONE Agent Type:

Purpose: Natural language → SQL queries → Execute → Present results

Required Tools:

  • generate_sql_query(question, schema) - Convert NL to SQL
  • execute_query(sql) - Run SQL on database
  • explain_results(data) - Interpret results
  • visualize_data(data, chart_type) - Create charts

Example Interaction:

User: "Show me the top 5 customers by revenue in 2024" Agent: 1. Generates SQL: SELECT customer_id, SUM(revenue) ... 2. Executes query 3. Returns: "Here are your top 5 customers: 1. Acme Corp - $1.2M 2. TechStart - $980K ..." 4. Creates bar chart visualization

Bonus: Handle follow-up questions, query optimization suggestions


Option B: Research Agent

Purpose: Topic → Search → Summarize → Compile → Report

Required Tools:

  • web_search(query, num_results) - Search the web
  • scrape_webpage(url) - Extract content
  • summarize_text(text, max_length) - Create summaries
  • generate_report(sections) - Compile final report

Example Interaction:

User: "Research the latest developments in quantum computing" Agent: 1. Searches for "quantum computing 2024 breakthroughs" 2. Scrapes top 5 articles 3. Summarizes each article 4. Compiles comprehensive report with citations

Bonus: Fact-checking, multi-source verification, citation formatting


Option C: Code Debugging Agent

Purpose: Buggy code → Analyze → Identify issues → Fix → Test

Required Tools:

  • analyze_code(code, language) - Static analysis
  • run_tests(code, tests) - Execute test suite
  • suggest_fixes(errors) - Propose solutions
  • apply_fix(code, fix) - Implement fix

Example Interaction:

User: "Debug this Python function that's failing tests" Agent: 1. Analyzes code structure 2. Runs test suite 3. Identifies: "Index out of bounds error on line 15" 4. Suggests fix: "Add bounds checking" 5. Applies fix 6. Re-runs tests → All pass ✅

Bonus: Performance optimization, code quality improvements


Option D: Personal Assistant Agent

Purpose: Manage calendar, emails, tasks, reminders

Required Tools:

  • check_calendar(date_range) - View events
  • schedule_meeting(title, time, attendees) - Create events
  • send_email(to, subject, body) - Send emails
  • set_reminder(task, time) - Create reminders
  • web_search(query) - Research information

Example Interaction:

User: "Schedule a meeting with John next Tuesday at 2pm to discuss Q1 planning" Agent: 1. Checks calendar for conflicts 2. Finds available slot 3. Creates meeting event 4. Sends email invitation to John 5. Sets reminder for 1 hour before

Bonus: Smart scheduling (avoid lunch hours, respect time zones), meeting prep


Option E: Agent Evaluation Pipeline (NEW - see Notebook 10)

Purpose: Build → Evaluate → Improve an agent using structured eval methods

Required Tools:

  • run_agent(task) - Execute agent on a test case
  • score_trajectory(trace) - LLM-as-Judge trajectory scoring
  • compare_runs(run_a, run_b) - A/B comparison of agent variants
  • generate_report(results) - Eval dashboard with pass-rates and cost

Example Interaction:

User: "Evaluate my research agent on 20 test cases" Pipeline: 1. Loads test suite from eval_cases.json 2. Runs agent on each case, records trajectory + tool calls 3. LLM-as-Judge scores each trajectory (0-5) 4. Computes pass@1 rate, avg tool calls, avg cost 5. Generates markdown report with failure analysis

Bonus: Regression detection (compare to previous run), safety red-team suite


Requirements (All Options):

1. Agent Architecture

  • Clean separation of concerns (agent logic, tools, utilities)
  • Configurable (system prompts, tool selection, parameters)
  • Logging of all agent actions
  • Error recovery mechanisms

2. Tool Implementation

  • At least 4 tools implemented
  • Proper JSON schemas for all tools
  • Input validation and error handling
  • Tool execution logging

3. Agent Reasoning

  • Intelligent tool selection
  • Multi-step planning for complex tasks
  • Ability to self-correct when errors occur
  • Clear reasoning traces (what, why, how)

🧠 Part 2: Memory & State Management

Implement memory systems for your agent:

2.1 Conversation History

class ConversationMemory: def __init__(self, max_messages=10): self.messages = [] self.max_messages = max_messages def add_message(self, role, content): """Add message to history""" pass def get_context(self): """Return recent context for LLM""" pass def summarize_old_messages(self): """Compress old messages""" pass

Requirements:

  • Store conversation history
  • Limit context window (token management)
  • Summarize old messages to save tokens
  • Clear context on user request

2.2 Task Memory

class TaskMemory: def __init__(self): self.completed_steps = [] self.pending_steps = [] def record_step(self, step, result): """Record completed step""" pass def get_progress(self): """Return task progress""" pass

Requirements:

  • Track completed vs. pending steps
  • Resume from failures
  • Progress reporting

2.3 Long-Term Memory (Optional)

  • Vector database for facts/knowledge
  • Retrieve relevant past interactions
  • Personalization based on history

🧪 Part 3: Testing & Evaluation

3.1 Unit Tests

Test each tool individually:

def test_tool_name(): """Test tool with valid inputs""" result = my_tool(valid_input) assert result == expected_output def test_tool_error_handling(): """Test tool with invalid inputs""" with pytest.raises(ValueError): my_tool(invalid_input)

Requirements:

  • Test all tools with valid inputs
  • Test error cases
  • Test edge cases
  • Achieve >80% code coverage

3.2 Integration Tests

Test agent end-to-end:

def test_agent_simple_query(): """Test agent with straightforward query""" response = agent.run("simple query") assert "expected" in response.lower() def test_agent_multi_step(): """Test agent with complex multi-step task""" response = agent.run("complex task requiring multiple tools") assert agent.tools_used >= 2 assert response.success == True

Requirements:

  • Test simple queries
  • Test multi-step tasks
  • Test error recovery
  • Test with real/mocked APIs

3.3 Evaluation Metrics

Measure agent performance:

metrics = { "task_success_rate": 0.85, # % of tasks completed successfully "avg_tool_calls": 3.2, # Average tools used per task "avg_response_time": 5.4, # Seconds "token_usage": 1500, # Average tokens per interaction "error_rate": 0.05 # % of errors }

Requirements:

  • Success rate on test cases
  • Average response time
  • Token efficiency
  • Error recovery rate

📝 Part 4: Documentation & Demo

4.1 README.md

# [Your Agent Name] ## Overview Brief description of what your agent does ## Features - Feature 1 - Feature 2 ## Installation ```bash pip install -r requirements.txt

Usage

from my_agent import Agent agent = Agent() result = agent.run("your query")

Architecture

Diagram showing components

API Reference

Tool descriptions and parameters

Examples

5+ example interactions

### 4.2 Code Documentation - [ ] Docstrings for all functions - [ ] Type hints - [ ] Inline comments for complex logic - [ ] API reference (auto-generated) ### 4.3 Demo **Option 1: Video Demo (3-5 minutes)** - Show agent handling 3+ different queries - Explain tool selection decisions - Demonstrate error handling **Option 2: Live Demo + Gradio UI** - Build web interface - Demo during presentation - Include example queries --- ## 🎁 Optional Extensions ### Optional Extension 1: Advanced Reasoning Implement **ReAct** (Reasoning + Acting) pattern:

Thought: I need to find the revenue data Action: execute_query(“SELECT SUM(revenue) FROM sales WHERE year=2024”) Observation: Total revenue is 5.2MThought:NowIshouldcompareto2023Action:executequery("SELECTSUM(revenue)FROMsalesWHEREyear=2023")Observation:2023revenuewas5.2M Thought: Now I should compare to 2023 Action: execute_query("SELECT SUM(revenue) FROM sales WHERE year=2023") Observation: 2023 revenue was 4.1M Thought: Growth is 26.8%, I can now respond Final Answer: Revenue grew by 26.8% from 4.1Mto4.1M to 5.2M

### Optional Extension 2: Parallel Tool Execution - Execute multiple independent tools concurrently - Aggregate results efficiently - Handle parallel errors gracefully ### Optional Extension 3: Agent Optimization - Cache frequent API calls - Optimize token usage - Reduce latency with streaming - Smart tool selection (skip unnecessary tools) ### Optional Extension 4: Production Deployment - Deploy as REST API (FastAPI/Flask) - Add authentication - Rate limiting - Monitoring dashboard - Docker containerization ### Optional Extension 5: MCP Integration - Expose your agent's tools via MCP server - Connect to your agent from an MCP-compatible client (Claude Desktop, Cursor) - Demonstrate cross-runtime tool sharing ### Optional Extension 6: Agent Evaluation Pipeline - Build an LLM-as-Judge scorer for trajectory quality - Run eval suite of ≥10 test cases with pass/fail tracking - Generate a markdown report with failure analysis - See **Notebook 10: Agent Evaluation** for patterns --- ## 📊 Self-Review Guide ### Part 1: Agent Design & Implementation | Criteria | Relative Emphasis | Description | |----------|-------------------|-------------| | **Architecture** | High | Clean code, separation of concerns, configurability | | **Tools** | High | All tools work correctly, proper schemas, error handling | | **Reasoning** | Medium | Intelligent tool selection, multi-step planning | ### Part 2: Memory & State | Criteria | Relative Emphasis | Description | |----------|-------------------|-------------| | **Conversation History** | High | Properly stores and retrieves context | | **Task Memory** | High | Tracks progress, resumes from failures | | **Implementation** | Medium | Clean code, efficient storage | ### Part 3: Testing & Evaluation | Criteria | Relative Emphasis | Description | |----------|-------------------|-------------| | **Unit Tests** | High | Comprehensive coverage, edge cases | | **Integration Tests** | High | End-to-end scenarios, error cases | | **Metrics** | Medium | Proper evaluation methodology | ### Part 4: Documentation & Demo | Criteria | Relative Emphasis | Description | |----------|-------------------|-------------| | **README** | High | Clear, comprehensive, examples | | **Code Docs** | Medium | Docstrings, type hints, comments | | **Demo** | High | Shows key features, explains decisions | ### Optional Extensions - ReAct pattern: +5 - Parallel execution: +5 - Optimization: +5 - Deployment: +5 - MCP integration: +5 - Agent evaluation: +5 --- ## 💡 Hints & Tips ### Getting Started 1. **Start simple:** Build basic agent with 1-2 tools first 2. **Test early:** Write tests as you build tools 3. **Iterate:** Add features incrementally 4. **Use frameworks:** LangChain can simplify development ### Tool Design - Keep tools focused (single responsibility) - Validate inputs rigorously - Return structured data (JSON) - Include helpful error messages ### Debugging - Log all LLM calls and tool executions - Test tools independently before agent integration - Use `print` statements liberally - Check token usage to avoid context overflow ### Common Pitfalls - ❌ Tools that do too much (break into smaller tools) - ❌ Poor error handling (always validate inputs) - ❌ No logging (impossible to debug) - ❌ Ignoring context limits (manage tokens carefully) --- ## 📚 Resources ### Code Examples - [OpenAI Function Calling Examples](https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models) - [LangChain Agent Templates](https://python.langchain.com/docs/modules/agents/agent_types/) - [Agent Design Patterns](https://github.com/microsoft/ai-agents-for-beginners) ### Testing - [Pytest Documentation](https://docs.pytest.org/) - [Unit Testing Best Practices](https://realpython.com/python-testing/) ### Deployment - [FastAPI Tutorial](https://fastapi.tiangolo.com/tutorial/) - [Docker for Python](https://docs.docker.com/language/python/) --- ## 🤝 Collaboration Policy - **Recommended default:** Build this project independently - **Getting help:** GitHub Discussions, documentation, and targeted implementation questions - **Code sharing:** Don't share solutions, but discuss approaches - **AI assistance:** OK to use for debugging, not for writing entire agent --- ## 📅 Project Packaging **Suggested GitHub workflow:** 1. Create repo: `ai-agent-[your-name]` 2. Include all deliverables 3. Add comprehensive README 4. Save the repo link with your project notes or portfolio materials --- ## ❓ FAQ **Q: Can I use LangChain or must I build from scratch?** A: You can use frameworks, but you must understand and explain the code. **Q: How many tools are required?** A: Minimum 4 tools. More is better if they're all useful. **Q: Can I use mock/fake APIs for testing?** A: Yes for testing, but include at least one real API integration. **Q: What if my agent makes mistakes?** A: That's OK! Document the failure cases and explain why they occur. **Q: Can I work in a team?** A: No, this is individual. But you can discuss ideas with classmates. --- **Good luck building your AI agent! 🚀🤖**
Last updated on