Audio & Speech
Speech recognition with Whisper and text-to-speech generation.
Use this subtrack when you want voice interfaces, transcription pipelines, and speech-driven assistants. It is best treated as a practical systems branch rather than a deep speech-research path.
How To Use This Subtrack Well
- Start with speech-to-text and text-to-speech before tackling full duplex voice systems.
- Measure latency and transcription quality alongside model quality.
- Pair this work with ../../20-real-time-streaming/README.md if you want conversational audio products.
What Comes Next
- Continue to ../README.md for the broader multimodal roadmap.
- Continue to ../../20-real-time-streaming/README.md for realtime transport and interaction patterns.
- Continue to ../../15-ai-agents/README.md if you want tool-using voice agents.
Last updated on