Audio & Speech

Speech recognition with Whisper and text-to-speech generation.

Use this subtrack when you want voice interfaces, transcription pipelines, and speech-driven assistants. It is best treated as a practical systems branch rather than a deep speech-research path.

How To Use This Subtrack Well

Start with speech-to-text and text-to-speech before tackling full duplex voice systems.
Measure latency and transcription quality alongside model quality.
Pair this work with ../../20-real-time-streaming/README.md if you want conversational audio products.

What Comes Next

Continue to ../README.md for the broader multimodal roadmap.
Continue to ../../20-real-time-streaming/README.md for realtime transport and interaction patterns.
Continue to ../../15-ai-agents/README.md if you want tool-using voice agents.

Last updated on May 24, 2026

01 Start Here 01 Whisper Speech Recognition