Skip to Content
09 MLOps

MLOps

Goal: Learn to deploy, monitor, and maintain ML models as production systems. This is what separates a data scientist from a machine learning engineer.

This phase is one of the main bridges from learning notebooks to building production-ready systems. If you want hiring leverage, this is one of the most important folders in the repo.


Why MLOps Matters for Your Career

80% of ML projects never reach production. The ones that do succeed because of solid MLOps practices. Employers specifically look for:

  • Can you deploy a model beyond a Jupyter notebook?
  • Can you reproduce an experiment from 3 months ago?
  • Do you know how to detect when a model starts degrading?
  • Can you build a CI/CD pipeline for ML?

MLOps is consistently one of the top hiring criteria for ML Engineer roles.


Notebooks - Work in This Order

#NotebookWhat You LearnTime
101_START_HERE.ipynbMLOps overview and the full lifecycle30 min
202_experiment_tracking.ipynbMLflow: log metrics, params, artifacts60 min
303_fastapi_basics.ipynbBuild REST API endpoints for model serving60 min
404_model_deployment.ipynbPackage and deploy a model end-to-end90 min
505_docker_ml.ipynbContainerize ML models with Docker90 min
606_monitoring.ipynbDetect data drift and model degradation60 min
707_ci_cd_pipeline.ipynbGitHub Actions for automated ML testing60 min
808_cloud_deployment.ipynbDeploy to AWS/GCP/Azure90 min
909_llm_infrastructure.ipynbvLLM, TGI, and LLM serving at scale60 min

How To Use This Phase Well

  • Complete at least one full path: experiment tracking -> serving -> containerization -> monitoring.
  • Build one deployable project instead of only reading notebook examples.
  • Treat monitoring and reproducibility as part of the product, not as post-launch cleanup.
  • Pair this phase with a model-building phase you already care about so the operational work stays concrete.

Key Concepts

The ML Lifecycle (What MLOps Manages)

Experiment Tracking (MLflow)

Every training run should be tracked. Track:

  • Parameters: learning rate, batch size, model architecture choices
  • Metrics: loss, accuracy, F1, AUC - over time, not just final values
  • Artifacts: the trained model file, tokenizer, feature scaler
  • Environment: Python version, library versions (requirements.txt)

MLflow quick start:

import mlflow mlflow.start_run() mlflow.log_param("learning_rate", 0.001) mlflow.log_metric("accuracy", 0.94) mlflow.log_artifact("model.pkl") mlflow.end_run()

Model Serving Patterns

PatternToolWhen to Use
REST APIFastAPIStandard models, <100ms latency needed
Batch inferenceCelery/RayLarge datasets, overnight jobs
StreamingvLLM + SSELLM text generation
Managed foundation model APIBedrock / Vertex AI / Azure AI FoundryFastest path to production without running GPUs
GPU inference serverTriton / vLLM / TGIHigh-throughput production serving
Edge deploymentONNX RuntimeMobile/embedded devices

The MLOps Stack (What to Learn)

CategoryToolPriority
Experiment trackingMLflow or W&BMust know
Model servingFastAPIMust know
ContainerizationDockerMust know
CI/CDGitHub ActionsMust know
MonitoringPrometheus + GrafanaKnow basics
LLM servingvLLMKnow if doing LLM work
OrchestrationKubeflow / AirflowNice to have
Cloud MLSageMaker / Azure ML / Vertex AINice to have

Deployment Matrix: AWS, Azure, Google, and Open Source

Different deployment targets solve different problems. ONNX Runtime is a runtime and model format choice. Bedrock, Vertex AI, and Azure AI Foundry are managed platforms. vLLM, TGI, Triton, Ollama, and llama.cpp are open-source serving stacks.

NeedAWSAzureGoogle CloudOpen SourceBest Fit
Managed LLM APIBedrockAzure AI Foundry / Azure OpenAIVertex AI GeminiOpenAI-compatible gateway over hosted OSS is possible, but not truly managedTeams that want minimal infra
Train and deploy custom ML modelSageMakerAzure MLVertex AIFastAPI + Docker + KubernetesClassical ML and custom DL models
Self-host open-weight LLMs on GPUEKS/ECS + vLLM or TGIAKS + vLLM or TGIGKE + vLLM or TGIvLLM / TGI / Triton / SGLangHigh-volume LLM inference
Multi-model inference serverSageMaker endpoints or ECS/EKS + TritonAzure ML managed endpoints or AKS + TritonVertex endpoints or GKE + TritonTriton Inference ServerMixed PyTorch / TensorRT / ONNX workloads
Edge or mobile deploymentGreengrass + ONNX RuntimeAzure IoT Edge + ONNX RuntimeEdge TPU / Vertex Edge + ONNX RuntimeONNX Runtime / TensorFlow Lite / llama.cppLow-latency local inference
Local developer workflowBedrock local emulation is limitedAzure-hosted onlyVertex-hosted onlyOllama / llama.cpp / LM StudioFast iteration and privacy

How to Choose a Deployment Path

  1. Use ONNX Runtime when you own the model artifact and want portable, optimized inference across CPU, GPU, and edge devices.
  2. Use Bedrock / Azure AI Foundry / Vertex AI when you want managed foundation model access and do not want to run your own inference cluster.
  3. Use SageMaker / Azure ML / Vertex AI custom endpoints when you need managed training plus deployment for your own models.
  4. Use vLLM / TGI / Triton / SGLang when you want open-source control, custom batching, lower cost at scale, or open-weight LLM hosting.
  5. Use Ollama or llama.cpp for local development, offline demos, CPU-friendly inference, or privacy-sensitive prototyping.

Practical Defaults

ScenarioRecommended Path
MVP chatbot with lowest ops burdenBedrock, Azure AI Foundry, Vertex AI, or OpenAI/Anthropic API
Enterprise app with strict cloud standardMatch the platform to your cloud: Bedrock, Azure AI Foundry, or Vertex AI
Open-weight LLM in productionvLLM or TGI on Kubernetes / cloud GPU
Mixed model fleet with TensorRT/ONNX/PyTorchTriton Inference Server
Mobile / embedded / offlineONNX Runtime or TensorFlow Lite
Local-first developmentOllama or llama.cpp

Docker for ML - The Essential Pattern

FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . EXPOSE 8000 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Model Monitoring - What to Watch

  • Data drift: Input feature distributions shift from training distribution
  • Concept drift: The relationship between features and labels changes
  • Performance degradation: Accuracy/F1 drops on recent data
  • Latency: Response time increases (often due to memory pressure)
  • Error rates: HTTP 5xx errors in your API

LLM Infrastructure (09_llm_infrastructure.ipynb)

This newer notebook covers production LLM serving:

  • vLLM: PagedAttention for high-throughput LLM inference (10-30x faster than naive serving)
  • TGI (Text Generation Inference): HuggingFace’s production LLM server
  • Ollama: Easy local LLM serving with OpenAI-compatible API
  • llama.cpp: CPU inference for quantized models

When to use what:

ScenarioTool
Local developmentOllama
Production, high throughputvLLM
HuggingFace models in prodTGI
CPU-only inferencellama.cpp
Managed cloud FM accessBedrock / Azure AI Foundry / Vertex AI

Bedrock vs ONNX vs vLLM in One Sentence

  • Bedrock: managed API for foundation models.
  • ONNX Runtime: portable runtime for your own exported model.
  • vLLM: high-throughput open-source LLM server for self-hosting.

Practice Projects (Put These on GitHub)

Project 1: Model API with Full MLOps

  • Train any classifier (e.g., sentiment analysis)
  • Track experiment with MLflow
  • Serve with FastAPI
  • Containerize with Docker
  • Add GitHub Actions to run tests on every push

Project 2: LLM Serving Setup

  • Set up vLLM with a small model (Qwen2.5-1.5B)
  • Create OpenAI-compatible endpoints
  • Load test with Locust
  • Monitor with basic Prometheus metrics

Project 3: Model Monitoring Pipeline

  • Deploy a model
  • Generate artificial drift in incoming data
  • Detect and alert on drift
  • Trigger retraining pipeline

Interview Questions for MLOps

  1. How do you detect data drift? What would you do when you detect it?
  2. What’s the difference between a model registry and an artifact store?
  3. How does vLLM’s PagedAttention improve throughput?
  4. Walk me through how you’d deploy a new model version with zero downtime.
  5. What’s the difference between online and batch inference? When would you use each?

External Resources

ResourceTypeLink
Made With MLFree Coursehttps://madewithml.com 
Full Stack Deep LearningFree Coursehttps://fullstackdeeplearning.com 
MLflow DocsDocshttps://mlflow.org/docs/latest/index.html 
vLLM DocsDocshttps://docs.vllm.ai 
FastAPI DocsDocshttps://fastapi.tiangolo.com 
mlflow/mlflowGitHubhttps://github.com/mlflow/mlflow 
vllm-project/vllmGitHubhttps://github.com/vllm-project/vllm 

What to Learn Next

After MLOps, choose your specialization path:

Last updated on