Skip to main content
Claude Code supports Bring Your Own Model (BYOM) — you can point it at any API that speaks the Anthropic Messages format (/v1/messages). Ollama serves this API natively, so you can deploy an open-source model on a Vast.ai GPU instance and connect Claude Code directly to it. No proxy, no API translation layer, no Anthropic account required. This guide covers deploying two models and connecting Claude Code to them:
ModelParametersVRAM UsedBest For
Qwen3-Coder-Next80B MoE (3B active)~57 GBState-of-the-art coding, tool calling
GPT-OSS-20B20B (4-bit quantized)~14 GBLightweight, fast responses, fine-tuned for Claude Code
Qwen3-Coder-Next is a Mixture of Experts model from Alibaba — 80 billion total parameters but only 3 billion active per token, giving strong coding ability at efficient inference cost. GPT-OSS-20B is fine-tuned specifically for Claude Code’s tool-calling format.

Prerequisites

Hardware Requirements

ModelMin GPU VRAMRecommended GPUDisk
Qwen3-Coder-Next80 GBA100 80GB, H100200 GB
GPT-OSS-20B16 GBRTX 3090, RTX 4090100 GB
Qwen3-Coder-Next uses ~48 GB for model weights and ~8 GB for KV cache in Ollama’s default Q4 quantization — totaling ~57 GB, which requires an 80 GB GPU like the A100 or H100. GPT-OSS-20B uses ~12 GB for weights and ~1 GB for KV cache. Disk space is needed for the Ollama image plus model downloads.

Step 1: Install the Vast.ai CLI

Install the CLI and set your API key. You can find your API key in the Vast.ai console under Account → API Key:
pip install vastai
vastai set api-key <YOUR_VAST_API_KEY>

Step 2: Choose a Model, Find a GPU, and Deploy

Search for a GPU with at least 80 GB VRAM. Look for an A100 or H100 in the results — these offer the best performance for this model:
vastai search offers \
  'gpu_ram>=80 num_gpus=1 reliability>0.9 disk_space>=200 inet_down>200 dph<2.0' \
  -o 'dph'
Pick an offer ID from the first column, then create the instance:
vastai create instance <OFFER_ID> \
  --image ollama/ollama:latest \
  --env "-p 11434:11434" \
  --disk 200 \
  --onstart-cmd "ollama serve & sleep 5 && ollama pull qwen3-coder-next"
The command starts the Ollama server, waits for it to initialize, then downloads the model weights. Save the instance ID from the output — you’ll need it in the next steps.

What the flags do

FlagPurpose
--image ollama/ollama:latestOfficial Ollama Docker image with GPU support
-p 11434:11434Exposes Ollama’s default port to the internet
--disk 200Allocates enough disk for the Docker image plus model weights
ollama serve &Starts the Ollama server in the background
ollama pull <model>Downloads the model weights (runs once on first boot)

Step 3: Wait for the Model to Download

Monitor the instance logs to track the download progress:
vastai logs <INSTANCE_ID> --tail 20
Look for success in the output, which confirms the model finished downloading:
pulling 30e51a7cb1cf: 100% ▏████████████████████  51 GB
verifying sha256 digest
writing manifest
success

Step 4: Get Your Endpoint

Retrieve the public IP and mapped port for your instance:
vastai show instance <INSTANCE_ID> --raw | \
  jq -r '"\(.public_ipaddr):\(.ports["11434/tcp"][0].HostPort)"'
This outputs your endpoint in <IP>:<PORT> format. Save this — you’ll use it to verify and connect Claude Code.

Step 5: Verify the Endpoint

Before connecting Claude Code, confirm the model is running and responding correctly.

Check model availability

List the models loaded in Ollama:
curl -s http://<IP>:<PORT>/v1/models \
  -H "x-api-key: ollama" | jq .
You should see your model listed in the response.

Test basic chat

Send a simple message using the Anthropic Messages API format:
curl -s http://<IP>:<PORT>/v1/messages \
  -H "content-type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -H "x-api-key: ollama" \
  -d '{
    "model": "qwen3-coder-next",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "Say hello in one sentence"}]
  }' | jq .
For GPT-OSS-20B, replace the model name with gpt-oss:20b. Expected output:
{
  "id": "msg_f2419f865f0ab7866135d9f2",
  "type": "message",
  "role": "assistant",
  "model": "qwen3-coder-next",
  "content": [
    {
      "type": "text",
      "text": "Hello!"
    }
  ],
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 13,
    "output_tokens": 3
  }
}

Test tool calling

Claude Code relies on tool calling to edit files, run commands, and navigate your codebase. Verify the model handles tool calls correctly:
curl -s http://<IP>:<PORT>/v1/messages \
  -H "content-type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -H "x-api-key: ollama" \
  -d '{
    "model": "qwen3-coder-next",
    "max_tokens": 1024,
    "tools": [
      {
        "name": "Write",
        "description": "Write content to a file",
        "input_schema": {
          "type": "object",
          "properties": {
            "file_path": {"type": "string"},
            "content": {"type": "string"}
          },
          "required": ["file_path", "content"]
        }
      }
    ],
    "messages": [{"role": "user", "content": "Create hello.py that prints hello world"}]
  }' | jq .
A successful response includes "stop_reason": "tool_use" and a tool_use content block with the file path and content:
{
  "id": "msg_00789b0ea0df023942763847",
  "type": "message",
  "role": "assistant",
  "model": "qwen3-coder-next",
  "content": [
    {
      "type": "tool_use",
      "id": "call_spd57315",
      "name": "Write",
      "input": {
        "file_path": "hello.py",
        "content": "print(\"hello world\")\n"
      }
    }
  ],
  "stop_reason": "tool_use",
  "usage": {
    "input_tokens": 306,
    "output_tokens": 36
  }
}

Step 6: Connect Claude Code

Set the environment variables that tell Claude Code to use your self-hosted model instead of Anthropic’s API. Replace <IP>:<PORT> with the endpoint from step 4.
export ANTHROPIC_BASE_URL="http://<IP>:<PORT>"
export ANTHROPIC_API_KEY="ollama"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_MODEL="qwen3-coder-next"
claude --model qwen3-coder-next
Claude Code launches and connects to your model. Try asking it to create a file, edit code, or run a command to confirm tool calling works end-to-end.

What the environment variables do

VariablePurpose
ANTHROPIC_BASE_URLPoints Claude Code at your Ollama instance instead of api.anthropic.com
ANTHROPIC_API_KEYRequired by Claude Code but can be any value — Ollama doesn’t enforce auth
ANTHROPIC_AUTH_TOKENSame as above — set to any non-empty string
ANTHROPIC_MODELThe model name to request from Ollama

Persistent Configuration (optional)

To avoid setting environment variables every time, add the configuration to ~/.claude/settings.json:
{
  "env": {
    "ANTHROPIC_BASE_URL": "http://<IP>:<PORT>",
    "ANTHROPIC_API_KEY": "ollama",
    "ANTHROPIC_AUTH_TOKEN": "ollama"
  }
}
Then launch with:
claude --model qwen3-coder-next
The settings.json approach stores your endpoint persistently. If you destroy the Vast.ai instance, you’ll need to update the IP and port or remove the configuration to use Anthropic’s API again.

Cleanup

Destroy your instance when you’re done to stop billing:
vastai destroy instance <INSTANCE_ID>

Next Steps

  • Try other models: Ollama supports hundreds of models. Any model with tool-calling support works with Claude Code — try qwen3-coder (30B) for a middle ground between the two options above.
  • Secure your endpoint: The default setup has no authentication. For production use, add a reverse proxy with TLS and API key validation.
  • Scale up: An H100 offers faster inference than an A100 for Qwen3-Coder-Next, with more headroom for longer context windows and concurrent requests.

Resources