Claude Code supports Bring Your Own Model (BYOM) — you can point it at any API that speaks the Anthropic Messages format (/v1/messages). Ollama serves this API natively, so you can deploy an open-source model on a Vast.ai GPU instance and connect Claude Code directly to it. No proxy, no API translation layer, no Anthropic account required.
This guide covers deploying two models and connecting Claude Code to them:
| Model | Parameters | VRAM Used | Best For |
|---|
| Qwen3-Coder-Next | 80B MoE (3B active) | ~57 GB | State-of-the-art coding, tool calling |
| GPT-OSS-20B | 20B (4-bit quantized) | ~14 GB | Lightweight, fast responses, fine-tuned for Claude Code |
Qwen3-Coder-Next is a Mixture of Experts model from Alibaba — 80 billion total parameters but only 3 billion active per token, giving strong coding ability at efficient inference cost. GPT-OSS-20B is fine-tuned specifically for Claude Code’s tool-calling format.
Prerequisites
Hardware Requirements
| Model | Min GPU VRAM | Recommended GPU | Disk |
|---|
| Qwen3-Coder-Next | 80 GB | A100 80GB, H100 | 200 GB |
| GPT-OSS-20B | 16 GB | RTX 3090, RTX 4090 | 100 GB |
Qwen3-Coder-Next uses ~48 GB for model weights and ~8 GB for KV cache in Ollama’s default Q4 quantization — totaling ~57 GB, which requires an 80 GB GPU like the A100 or H100. GPT-OSS-20B uses ~12 GB for weights and ~1 GB for KV cache. Disk space is needed for the Ollama image plus model downloads.
Step 1: Install the Vast.ai CLI
Install the CLI and set your API key. You can find your API key in the Vast.ai console under Account → API Key:
pip install vastai
vastai set api-key <YOUR_VAST_API_KEY>
Step 2: Choose a Model, Find a GPU, and Deploy
Qwen3-Coder-Next
GPT-OSS-20B
Search for a GPU with at least 80 GB VRAM. Look for an A100 or H100 in the results — these offer the best performance for this model:vastai search offers \
'gpu_ram>=80 num_gpus=1 reliability>0.9 disk_space>=200 inet_down>200 dph<2.0' \
-o 'dph'
Pick an offer ID from the first column, then create the instance:vastai create instance <OFFER_ID> \
--image ollama/ollama:latest \
--env "-p 11434:11434" \
--disk 200 \
--onstart-cmd "ollama serve & sleep 5 && ollama pull qwen3-coder-next"
This model fits on smaller GPUs. Search for instances with at least 16 GB VRAM:vastai search offers \
'gpu_ram>=16 num_gpus=1 reliability>0.9 disk_space>=100 inet_down>200 dph<1.0' \
-o 'dph'
Pick an offer ID from the first column, then create the instance:vastai create instance <OFFER_ID> \
--image ollama/ollama:latest \
--env "-p 11434:11434" \
--disk 100 \
--onstart-cmd "ollama serve & sleep 5 && ollama pull gpt-oss:20b"
The command starts the Ollama server, waits for it to initialize, then downloads the model weights. Save the instance ID from the output — you’ll need it in the next steps.
What the flags do
| Flag | Purpose |
|---|
--image ollama/ollama:latest | Official Ollama Docker image with GPU support |
-p 11434:11434 | Exposes Ollama’s default port to the internet |
--disk 200 | Allocates enough disk for the Docker image plus model weights |
ollama serve & | Starts the Ollama server in the background |
ollama pull <model> | Downloads the model weights (runs once on first boot) |
Step 3: Wait for the Model to Download
Monitor the instance logs to track the download progress:
vastai logs <INSTANCE_ID> --tail 20
Look for success in the output, which confirms the model finished downloading:
pulling 30e51a7cb1cf: 100% ▏████████████████████ 51 GB
verifying sha256 digest
writing manifest
success
Step 4: Get Your Endpoint
Retrieve the public IP and mapped port for your instance:
vastai show instance <INSTANCE_ID> --raw | \
jq -r '"\(.public_ipaddr):\(.ports["11434/tcp"][0].HostPort)"'
This outputs your endpoint in <IP>:<PORT> format. Save this — you’ll use it to verify and connect Claude Code.
Step 5: Verify the Endpoint
Before connecting Claude Code, confirm the model is running and responding correctly.
Check model availability
List the models loaded in Ollama:
curl -s http://<IP>:<PORT>/v1/models \
-H "x-api-key: ollama" | jq .
You should see your model listed in the response.
Test basic chat
Send a simple message using the Anthropic Messages API format:
curl -s http://<IP>:<PORT>/v1/messages \
-H "content-type: application/json" \
-H "anthropic-version: 2023-06-01" \
-H "x-api-key: ollama" \
-d '{
"model": "qwen3-coder-next",
"max_tokens": 256,
"messages": [{"role": "user", "content": "Say hello in one sentence"}]
}' | jq .
For GPT-OSS-20B, replace the model name with gpt-oss:20b.
Expected output:
{
"id": "msg_f2419f865f0ab7866135d9f2",
"type": "message",
"role": "assistant",
"model": "qwen3-coder-next",
"content": [
{
"type": "text",
"text": "Hello!"
}
],
"stop_reason": "end_turn",
"usage": {
"input_tokens": 13,
"output_tokens": 3
}
}
Claude Code relies on tool calling to edit files, run commands, and navigate your codebase. Verify the model handles tool calls correctly:
curl -s http://<IP>:<PORT>/v1/messages \
-H "content-type: application/json" \
-H "anthropic-version: 2023-06-01" \
-H "x-api-key: ollama" \
-d '{
"model": "qwen3-coder-next",
"max_tokens": 1024,
"tools": [
{
"name": "Write",
"description": "Write content to a file",
"input_schema": {
"type": "object",
"properties": {
"file_path": {"type": "string"},
"content": {"type": "string"}
},
"required": ["file_path", "content"]
}
}
],
"messages": [{"role": "user", "content": "Create hello.py that prints hello world"}]
}' | jq .
A successful response includes "stop_reason": "tool_use" and a tool_use content block with the file path and content:
{
"id": "msg_00789b0ea0df023942763847",
"type": "message",
"role": "assistant",
"model": "qwen3-coder-next",
"content": [
{
"type": "tool_use",
"id": "call_spd57315",
"name": "Write",
"input": {
"file_path": "hello.py",
"content": "print(\"hello world\")\n"
}
}
],
"stop_reason": "tool_use",
"usage": {
"input_tokens": 306,
"output_tokens": 36
}
}
Step 6: Connect Claude Code
Set the environment variables that tell Claude Code to use your self-hosted model instead of Anthropic’s API. Replace <IP>:<PORT> with the endpoint from step 4.
Qwen3-Coder-Next
GPT-OSS-20B
export ANTHROPIC_BASE_URL="http://<IP>:<PORT>"
export ANTHROPIC_API_KEY="ollama"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_MODEL="qwen3-coder-next"
claude --model qwen3-coder-next
export ANTHROPIC_BASE_URL="http://<IP>:<PORT>"
export ANTHROPIC_API_KEY="ollama"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_MODEL="gpt-oss:20b"
claude --model "gpt-oss:20b"
Claude Code launches and connects to your model. Try asking it to create a file, edit code, or run a command to confirm tool calling works end-to-end.
What the environment variables do
| Variable | Purpose |
|---|
ANTHROPIC_BASE_URL | Points Claude Code at your Ollama instance instead of api.anthropic.com |
ANTHROPIC_API_KEY | Required by Claude Code but can be any value — Ollama doesn’t enforce auth |
ANTHROPIC_AUTH_TOKEN | Same as above — set to any non-empty string |
ANTHROPIC_MODEL | The model name to request from Ollama |
Persistent Configuration (optional)
To avoid setting environment variables every time, add the configuration to ~/.claude/settings.json:
{
"env": {
"ANTHROPIC_BASE_URL": "http://<IP>:<PORT>",
"ANTHROPIC_API_KEY": "ollama",
"ANTHROPIC_AUTH_TOKEN": "ollama"
}
}
Then launch with:
claude --model qwen3-coder-next
The settings.json approach stores your endpoint persistently. If you destroy the Vast.ai instance, you’ll need to update the IP and port or remove the configuration to use Anthropic’s API again.
Cleanup
Destroy your instance when you’re done to stop billing:
vastai destroy instance <INSTANCE_ID>
Next Steps
- Try other models: Ollama supports hundreds of models. Any model with tool-calling support works with Claude Code — try
qwen3-coder (30B) for a middle ground between the two options above.
- Secure your endpoint: The default setup has no authentication. For production use, add a reverse proxy with TLS and API key validation.
- Scale up: An H100 offers faster inference than an A100 for Qwen3-Coder-Next, with more headroom for longer context windows and concurrent requests.
Resources