Real-Time & Streaming AI
Status: This phase is currently an introduction, not one of the repo’s deepest modules yet. Use it to learn the core patterns, then pair it with MLOps, local LLMs, or multimodal work for stronger projects.
Overview
Learn how to build real-time AI applications with streaming responses, WebSocket connections, progressive loading, and live voice or multimodal interaction patterns.
Duration: 10 hours (5 notebooks + materials)
Topics Covered:
- Streaming LLM Responses
- WebSocket Connections
- Real-Time RAG
- Production Streaming Systems
- Realtime voice and multimodal interactions
Learning Objectives
By the end of this phase, you will be able to:
- Implement Server-Sent Events (SSE) for streaming
- Build WebSocket-based real-time chat applications
- Understand when WebRTC is a better fit than SSE/WebSockets
- Handle progressive loading and chunked responses
- Create streaming RAG pipelines
- Design production streaming systems with backpressure and observability in mind
- Optimize for latency and throughput
- Design interruption-safe realtime voice loops
Prerequisites
- Strong Python programming skills
- Understanding of LLMs and API-based workflows
- Basic knowledge of async/await
- Familiarity with web technologies
- Prior exposure to prompting, embeddings, or simple app integration work is helpful
How To Use This Phase Right Now
- Learn the transport patterns first: SSE, WebSockets, and when WebRTC enters the picture.
- Build one small streaming project before optimizing everything.
- Add
06_realtime_voice_multimodal.ipynbwhen you move from text streaming to turn-taking voice or live multimodal assistants. - Treat this phase as a systems-pattern module that complements ../09-mlops/09-mlops.ipynb, ../13-multimodal/13-multimodal.ipynb, and ../14-local-llms/14-local-llms.ipynb.
Course Content
1. Streaming Responses (90 minutes)
File: 02_streaming_responses.ipynb
Topics:
- OpenAI Responses API event streaming
- Server-Sent Events (SSE) protocol
- Handling semantic stream events and text deltas
- Measuring TTFT and output rate
- Error handling in streams
- Progress indicators
Key Code:
# OpenAI Responses API streaming
for event in client.responses.create(
model="gpt-4.1",
input="Tell me a story",
stream=True
):
if event.type == "response.output_text.delta":
print(event.delta, end="")
# SSE frame format
def format_sse(delta: str) -> str:
return f"data: {delta}\\n\\n"2. WebSocket Connections (90 minutes)
File: 03_websocket_connections.ipynb
Topics:
- WebSocket protocol basics
- Bidirectional communication
- FastAPI WebSocket endpoints
- Client-side WebSocket handling
- Connection management
- Heartbeat and reconnection
Key Code:
import asyncio
async def process_message(data: str) -> str:
return f"assistant: {data}"
async def websocket_style_roundtrip(messages: list[str]) -> list[str]:
replies = []
for data in messages:
replies.append(await process_message(data))
return replies3. Real-Time RAG (90 minutes)
File: 04_real_time_rag.ipynb
Topics:
- Streaming search results
- Progressive context loading
- Incremental vector search
- Streaming summarization
- Real-time document processing
- Hybrid search streaming
Architecture:
4. Production Streaming (120 minutes)
File: 05_production_streaming.ipynb
Topics:
- Load balancing streaming connections
- Connection pooling
- Rate limiting
- Backpressure handling
- Monitoring and metrics
- Error recovery
- Scaling strategies
Production Considerations:
- Connection limits
- Timeout management
- Memory management
- Graceful degradation
- Observability
5. Realtime Voice and Multimodal Patterns (90 minutes)
File: 06_realtime_voice_multimodal.ipynb
Topics:
- Turn-taking state machines
- Interruption and cancellation semantics
- Audio frame chunking
- WebRTC vs WebSocket transport choices
- Live session state for voice and multimodal assistants
Technical Stack
Backend:
- FastAPI
- OpenAI Python SDK
- WebSockets library
- asyncio
Frontend:
- HTML/CSS/JavaScript
- EventSource API
- WebSocket API
- React (optional)
Infrastructure:
- Nginx (reverse proxy)
- Redis (connection management)
- Prometheus (monitoring)
- Docker
- WebRTC / LiveKit style realtime media transport
2026 Realtime Topics To Know
- Realtime APIs for voice and multimodal assistants
- Turn-taking, interruption, and low-latency audio streaming
- WebRTC for browser-to-browser media and live copilot experiences
- Disaggregated retrieval + generation pipelines to keep end-to-end latency low
Best Practices
Performance
- Use connection pooling
- Implement backpressure
- Buffer appropriately
- Monitor latency
Reliability
- Handle disconnections gracefully
- Implement retry logic
- Timeout management
- Circuit breakers
Security
- Rate limiting per user
- Input validation
- Authentication tokens
- CORS configuration
User Experience
- Loading indicators
- Smooth animations
- Error messages
- Offline support
Common Patterns
Pattern 1: Simple SSE Streaming
for event in client.responses.create(model="gpt-4.1-mini", input=prompt, stream=True):
if event.type == "response.output_text.delta":
yield f"data: {event.delta}\\n\\n"Pattern 2: WebSocket with Heartbeat
async def heartbeat(websocket):
while True:
await asyncio.sleep(30)
await websocket.send_json({"type": "ping"})Pattern 3: Streaming RAG
async def streaming_rag(query):
# Search
docs = await vector_search(query)
yield {"type": "sources", "data": docs}
# Generate
async for chunk in llm_generate(query, docs):
yield {"type": "text", "data": chunk}Real-World Examples
-
ChatGPT-style Interface
- Streaming responses
- Typing indicators
- Stop generation
- Copy/retry
-
Live Document Q&A
- Upload and index
- Real-time search
- Streaming answers
- Source citations
-
Multi-User Chat
- WebSocket rooms
- Broadcast messages
- User presence
- Typing indicators
Resources
Documentation
Libraries
fastapi- Modern Python web frameworkwebsockets- WebSocket client/serversse-starlette- SSE for Starlette/FastAPIhttpx- Async HTTP client
Tools
- Postman - API testing with WebSocket support
- k6 - Load testing
- WebSocket King - WebSocket client tester
Troubleshooting
Issue: Stream stops unexpectedly
Solution: Check timeout settings, implement heartbeat
Issue: High latency
Solution: Optimize chunk size, reduce buffering, check network
Issue: Connection drops
Solution: Implement reconnection logic, use exponential backoff
Issue: Memory leaks
Solution: Close connections properly, cleanup event listeners
What Comes Next
After completing this phase:
- Review Phase 19 (AI Safety) for securing streaming apps
- Explore Phase 15 (AI Agents) for multi-agent streaming
- Check Phase 18 (Low-Code) for Gradio/Streamlit streaming
- Build your own production streaming application
Time Estimates
- Total Duration: 8 hours
- Notebooks: 6-7 hours
- Assignment: 4-6 hours
- Challenges: 6-8 hours
- Total with Practice: 16-20 hours
Success Criteria
- ✅ Implement SSE and WebSocket endpoints
- ✅ Build real-time chat interface
- ✅ Create streaming RAG pipeline
- ✅ Handle 100+ concurrent connections
- ✅ Deploy production streaming app
- ✅ Monitor and optimize performance
Note: This is a foundational module for building modern AI applications. Master these concepts to create responsive, real-time user experiences.