Multimodal Input
Vision-capable chat models accept images inside a user message. The content becomes an array of typed parts instead of a single string. This guide covers the two ways to attach an image — by URL or as a base64 data URI — and how to mix text and images in a single turn.Two ways to attach an image
Public URL
Reference an image that the gateway can fetch over HTTPS. Best for static assets, CDN-hosted images, and pre-signed URLs.
Base64 data URI
Inline the image bytes as a base64 string. Best for ephemeral content, user uploads, and private assets.
URL example
Pass the image as a content part withtype: image_url:
Base64 example
Inline the bytes as a data URI:The
detail parameter controls how much compute the model spends on the image. low is faster and cheaper; high reads small text and dense details; auto lets the model pick.Multiple images in one turn
Add as many content parts as you need. The model reads them in order:Limits and gotchas
- Each model has a maximum number of images per request and a maximum resolution. Excess is rejected with
400 invalid_request_error. - Total request size — text plus encoded images — must stay under the per-request byte cap. For base64, encode the image first and check
len(b64) * 3 / 4against the limit. - When you mix text and images, keep the text part concise. The model reads both, but verbose prompts dilute the focus on the visual.
Streaming and tools
Vision works with streaming and with tool calls. The same content array works instream: true requests and in requests that include tools. The model emits text and tool deltas the same way it does for a text-only message, with the image already understood as part of the context.