Vision-Language Models

CLIP, vision-language models, and multimodal RAG pipelines.

Use this subtrack when you want image-text retrieval, image understanding, and multimodal search systems. It is the most natural follow-on if you care about visual embeddings, document understanding, or image-grounded assistants.

How To Use This Subtrack Well

Start with CLIP-style embeddings before building multimodal RAG systems.
Compare retrieval quality and grounding, not just generated answer quality.
Pair this work with ../../08-rag/README.md if you want stronger retrieval-system intuition.

What Comes Next

Continue to ../README.md for the broader multimodal roadmap.
Continue to ../../10-specializations/computer-vision/README.md if you want deeper vision-specific work.
Continue to ../../20-real-time-streaming/README.md if you want live multimodal interaction patterns.

Last updated on May 24, 2026

02 Controlnet 01 Clip Basics