Back to Help
Help · Updated 2026-05-19

Enable vision (image input) for an agent.

OpenClaw doesn't have a global "vision on/off" switch. Vision capability is decided per-model via the input array, and a sidecar fallback handles agents whose primary model is text-only. This article shows the two knobs and where to find them.

01 · Background

Why one Ollama model handles photos and another doesn't

When an inbound message has an image attachment (Telegram, WhatsApp, etc.), the OpenClaw gateway makes a decision per model:

  • If the active reply model is marked vision-capable in openclaw.json, the image is passed straight to it.
  • If the active reply model is text-only, the image is preserved as an offloaded media://inbound/* ref and OpenClaw routes the description request through agents.defaults.imageModel.primaryinstead (the "sidecar").

For local Ollama installs, "marked vision-capable" means the model row in models.providers.ollama.models[] has input: ["text", "image"]. Two agents using two different models will get different behavior on the same inbound photo.

The upstream model must actually be vision-capable

Flipping the input array on a model row tells OpenClaw the model accepts images. It does notadd image support to a model that doesn't have it. Verify with Ollama:

curl -s -X POST http://<host>:11434/api/show \
  -H 'Content-Type: application/json' \
  -d '{"name":"<model>"}' | jq '.capabilities'

Look for "vision"in the array. If it's missing, pull a vision build instead (e.g. ollama pull qwen2.5vl:7b) or use the sidecar fallback described in section 3.

02 · Per-model toggle

Mark a model vision-capable from the agent's Model tab

Open Agents → <agent> → Model tab in the Pro dashboard. The Image input (vision)section lets you flip the agent's current primary model between:

  • Auto-detect (gateway decides) — clears the inputarray. The gateway falls back to whatever Ollama's /api/show probe reports.
  • Vision enabled (text + image) — writes input: ["text", "image"].
  • Text only — writes input: ["text"]. Use this when you know the model can't actually see images and want to force routing through the sidecar.

The toggle writes to models.providers.<provider>.models[].input. Because catalog rows are shared across agents, flipping the flag for one agent's model affects every agent that pins the same model id.

What the connector writes for a brand-new model row

{
  "id": "qwen3.5:9b",
  "name": "qwen3.5:9b",
  "reasoning": false,
  "input": ["text", "image"],
  "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
  "contextWindow": 128000,
  "maxTokens": 8192
}

Tune contextWindow and maxTokens right above the vision toggle if those defaults are wrong for your model.

03 · Global fallback

Sidecar vision model for text-only agents

Some agents must run on a text-only reasoning model for cost or latency reasons but still need to read inbound photos. For those, configure a single sidecar vision model on the Agents page:

  1. Go to /koko-dashboard/pro/agents.
  2. In the Global vision fallback card, pick a vision-capable model (e.g. ollama/gemma4:31b or ollama/qwen2.5vl:7b).
  3. Click Save fallback.

The dropdown writes to agents.defaults.imageModel.primary. With it set, every agent whose primary is text-only routes inbound images through this sidecar; the agent receives a short text description of the image in its context.

04 · Operator checklist

Quick checklist when an agent refuses to see images

  1. Confirm the model is vision-capable upstream. Use the /api/showprobe in section 1. If the capability isn't there, pull a vision build — the OpenClaw flag alone won't add image support.
  2. Open the agent's Model tab. Set Image input (vision) to Vision enabled (text + image). Save.
  3. If the model truly is text-only, set a sidecar on the Agents page (Global vision fallback) so the agent still receives a description of the image.
  4. Re-test with a real inbound photo. If the agent still refuses, check /status on the gateway — the media line summarises why a capability was skipped (file too large, provider missing, etc.).