Skip to Content
14 Local LLMs07 AI Toolkit VS Code

AI Toolkit for VS Code

The AI Toolkit extension (formerly Windows AI Studio) is Microsoft’s VS Code extension for browsing, downloading, running, fine-tuning, and evaluating local models - all from the editor.


1. What AI Toolkit Does

CapabilityDetails
Model catalogBrowse models from Hugging Face and Azure AI Foundry directly in the sidebar
Local playgroundChat with downloaded models in an interactive panel inside VS Code
ONNX RuntimeRun models locally via ONNX Runtime GenAI (CPU, CUDA, DirectML, Apple Silicon)
Fine-tuningQLoRA fine-tuning with a guided UI - dataset prep, hyperparameters, training
EvaluationRun promptflow-evals evaluators and view results in the extension
Model conversionConvert Hugging Face models to ONNX format for local inference
Multi-runtimeSupports ONNX, GGUF (via llama.cpp), and cloud-hosted endpoints

2. Installation

  1. Open VS Code
  2. Go to Extensions (Cmd+Shift+X)
  3. Search for “AI Toolkit” (publisher: Microsoft)
  4. Click Install

The extension adds an AI Toolkit icon to the Activity Bar (left sidebar).


3. Browsing and Downloading Models

From the Model Catalog

  1. Click the AI Toolkit icon in the sidebar
  2. Browse Popular Models or search by name
  3. Click a model card to see details: size, architecture, quantization options
  4. Click Download to pull the model locally

Where Models Are Stored

Models are downloaded to the AI Toolkit working directory:

PlatformPath
macOS/Linux~/.aitk/models/
Windows%USERPROFILE%\.aitk\models\

Directory structure follows a 4-layer convention:

~/.aitk/models/{publisher}/{model-name}/{runtime}/{display-name}

Example:

~/.aitk/models/microsoft/Phi-4-mini/cpu/phi4-mini-int4

Model Formats

FormatRuntimeWhen to use
ONNXONNX Runtime GenAIBest integration with AI Toolkit, broadest hardware support
GGUFllama.cppAlready have GGUF models from Ollama or LM Studio
CloudAPI endpointModel too large for local hardware

4. Using the Local Playground

Once a model is downloaded:

  1. Select it in the AI Toolkit sidebar
  2. Click Load in Playground
  3. Chat with the model in the interactive panel

Playground Features

  • System prompt: Set a custom system message
  • Temperature / Top-p: Adjust generation parameters
  • Token limit: Control response length
  • Multi-turn: Maintains conversation context

Playground vs. Ollama

FeatureAI Toolkit PlaygroundOllama
UIBuilt into VS CodeTerminal or separate UI
FormatONNX (primary)GGUF
RuntimeONNX Runtime GenAIllama.cpp
APINot exposed by defaultOpenAI-compatible API
Fine-tuningBuilt-in UIRequires separate workflow
Model catalogIntegrated in sidebarollama pull <model>

When to use each: Ollama if you want a local OpenAI-compatible API (see 04_llm_server_and_api.ipynb). AI Toolkit if you want a GUI-first experience with fine-tuning and evaluation built in.


5. Model Conversion to ONNX

To run a Hugging Face model in AI Toolkit, convert it to ONNX format:

Setup

conda create -n model_builder python==3.11 -y conda activate model_builder pip install onnx torch onnxruntime_genai transformers

Convert

python -m onnxruntime_genai.models.builder \ -m /path/to/hf/model \ -p int4 \ -e cpu \ -o ~/.aitk/models/publisher/model-name/cpu/display-name \ -c /tmp/conversion-cache \ --extra_options include_prompt_templates=1

Precision × Runtime Combinations

PrecisionRuntimeUse case
INT4CPULaptops, low memory
FP16CUDANVIDIA GPUs
FP16DirectMLAMD/Intel GPUs on Windows
FP32CPUMaximum accuracy, slow

6. Fine-Tuning with QLoRA

AI Toolkit provides a guided fine-tuning workflow directly in VS Code.

Steps

  1. Select a model in the sidebar → Fine-tune
  2. Prepare dataset: Upload JSONL training data
  3. Configure: Set LoRA rank, learning rate, epochs, batch size
  4. Train: Runs locally using your GPU (or CPU with longer times)
  5. Evaluate: Compare base model vs. fine-tuned model side by side

Dataset Format

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is RAG?"}, {"role": "assistant", "content": "RAG stands for Retrieval-Augmented Generation..."}]} {"messages": [{"role": "user", "content": "Explain embeddings"}, {"role": "assistant", "content": "Embeddings are dense vector representations..."}]}

When to Fine-Tune Locally

ScenarioFine-tune locally?
Small model (< 7B params)Yes
Large model (> 13B params)Only with a strong GPU (24GB+ VRAM)
Quick experiment / proof of conceptYes
Production trainingUse cloud (Azure AI Foundry)
Sensitive data that can’t leave the machineYes - main advantage

7. Evaluation

AI Toolkit integrates with promptflow-evals to evaluate model quality.

Built-in Evaluators

EvaluatorMeasures
GroundednessEvaluatorAre responses grounded in provided context?
RelevanceEvaluatorAre responses relevant to the question?
CoherenceEvaluatorIs the response logically coherent?
FluencyEvaluatorIs the language fluent and natural?
SimilarityEvaluatorHow similar is the response to ground truth?
F1ScoreEvaluatorToken-level F1 against ground truth

Running an Evaluation

  1. Prepare a JSONL dataset with questions and ground truth
  2. Select a model in AI Toolkit → Evaluate
  3. Choose evaluators and configure the judge model
  4. View results in the extension panel

Example Evaluation Script

from promptflow.evals.evaluators import SimilarityEvaluator, F1ScoreEvaluator from promptflow.evals import evaluate results = evaluate( data="test_data.jsonl", evaluators={ "similarity": SimilarityEvaluator(model_config=model_config), "f1": F1ScoreEvaluator(), }, target=my_model_function, )

8. AI Toolkit vs. Other Local LLM Tools

FeatureAI ToolkitOllamaLM StudiovLLM
UIVS Code sidebarCLIDesktop appCLI/API
Primary formatONNXGGUFGGUFHF/AWQ
Fine-tuningBuilt-in (QLoRA)NoNoNo
EvaluationBuilt-inNoNoNo
API serverNoYes (OpenAI-compatible)Yes (OpenAI-compatible)Yes (OpenAI-compatible)
Model catalogHF + Azure AIOllama libraryHFHF
Best forExperiment + fine-tuneQuick local APIGUI explorationProduction serving

9. Practical Workflow

Exploring a New Model

1. Browse catalog in AI Toolkit sidebar 2. Download a quantized ONNX variant 3. Test in playground with sample prompts 4. If quality is good → serve via Ollama or vLLM for your app 5. If quality is poor → fine-tune with QLoRA in AI Toolkit 6. Evaluate fine-tuned model against baseline 7. Export and serve the improved model

Connecting AI Toolkit to Copilot

AI Toolkit models don’t directly replace GitHub Copilot’s backend. However, you can:

  • Use AI Toolkit to explore and evaluate models before deploying them as an API
  • Serve a local model via Ollama or vLLM, then connect it as a custom endpoint
  • Use fine-tuned models for domain-specific tasks (code review, doc generation) alongside Copilot for general coding

10. Troubleshooting

ProblemFix
Model doesn’t appear in sidebarCheck ~/.aitk/models/ directory structure - must be exactly 4 layers deep
Slow inference on macOSEnsure the ONNX model targets cpu or use MLX models via Ollama instead
Out of memory during fine-tuningReduce batch size, use INT4 quantization, or switch to a smaller model
Conversion failsInstall latest onnxruntime_genai and transformers from git main
Playground shows gibberishModel may need include_prompt_templates=1 during conversion

Next Steps

  • Try 01_ollama_quickstart.ipynb for an API-first approach to local models
  • See 03_local_rag_with_ollama.ipynb for building a local RAG system
  • See 05_speculative_decoding.ipynb for inference optimization
  • For connecting local models to your coding workflow, see 31-ai-powered-dev-tools/04_mcp_deep_dive.md
Last updated on