Enable vision (image input) for an agent.
OpenClaw doesn't have a global "vision on/off" switch. Vision capability is decided per-model via the input array, and a sidecar fallback handles agents whose primary model is text-only. This article shows the two knobs and where to find them.
01 · Background
Why one Ollama model handles photos and another doesn't
When an inbound message has an image attachment (Telegram, WhatsApp, etc.), the OpenClaw gateway makes a decision per model:
- If the active reply model is marked vision-capable in
openclaw.json, the image is passed straight to it. - If the active reply model is text-only, the image is preserved as an offloaded
media://inbound/*ref and OpenClaw routes the description request throughagents.defaults.imageModel.primaryinstead (the "sidecar").
For local Ollama installs, "marked vision-capable" means the model row in models.providers.ollama.models[] has input: ["text", "image"]. Two agents using two different models will get different behavior on the same inbound photo.
The upstream model must actually be vision-capable
Flipping the input array on a model row tells OpenClaw the model accepts images. It does notadd image support to a model that doesn't have it. Verify with Ollama:
curl -s -X POST http://<host>:11434/api/show \
-H 'Content-Type: application/json' \
-d '{"name":"<model>"}' | jq '.capabilities'Look for "vision"in the array. If it's missing, pull a vision build instead (e.g. ollama pull qwen2.5vl:7b) or use the sidecar fallback described in section 3.
02 · Per-model toggle
Mark a model vision-capable from the agent's Model tab
Open Agents → <agent> → Model tab in the Pro dashboard. The Image input (vision)section lets you flip the agent's current primary model between:
- Auto-detect (gateway decides) — clears the
inputarray. The gateway falls back to whatever Ollama's/api/showprobe reports. - Vision enabled (text + image) — writes
input: ["text", "image"]. - Text only — writes
input: ["text"]. Use this when you know the model can't actually see images and want to force routing through the sidecar.
The toggle writes to models.providers.<provider>.models[].input. Because catalog rows are shared across agents, flipping the flag for one agent's model affects every agent that pins the same model id.
What the connector writes for a brand-new model row
{
"id": "qwen3.5:9b",
"name": "qwen3.5:9b",
"reasoning": false,
"input": ["text", "image"],
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
"contextWindow": 128000,
"maxTokens": 8192
}Tune contextWindow and maxTokens right above the vision toggle if those defaults are wrong for your model.
03 · Global fallback
Sidecar vision model for text-only agents
Some agents must run on a text-only reasoning model for cost or latency reasons but still need to read inbound photos. For those, configure a single sidecar vision model on the Agents page:
- Go to
/koko-dashboard/pro/agents. - In the Global vision fallback card, pick a vision-capable model (e.g.
ollama/gemma4:31borollama/qwen2.5vl:7b). - Click Save fallback.
The dropdown writes to agents.defaults.imageModel.primary. With it set, every agent whose primary is text-only routes inbound images through this sidecar; the agent receives a short text description of the image in its context.
04 · Operator checklist
Quick checklist when an agent refuses to see images
- Confirm the model is vision-capable upstream. Use the
/api/showprobe in section 1. If the capability isn't there, pull a vision build — the OpenClaw flag alone won't add image support. - Open the agent's Model tab. Set Image input (vision) to Vision enabled (text + image). Save.
- If the model truly is text-only, set a sidecar on the Agents page (Global vision fallback) so the agent still receives a description of the image.
- Re-test with a real inbound photo. If the agent still refuses, check
/statuson the gateway — the media line summarises why a capability was skipped (file too large, provider missing, etc.).