Model Selection
No single model wins on every axis. The right choice depends on what you are building, how fast it must respond, and how much you can spend per request. This guide walks through the trade-offs and ends with a curated capability table.The three axes
Latency
Time to first token (TTFT) and time to last token. Critical for chat UIs and voice agents.
Cost
Price per million input and output tokens. High-volume pipelines are usually cost-bound. See the live pricing page for current numbers.
Capability
Reasoning quality, instruction following, vision, tool use, JSON adherence, multilingual coverage.
Decision flow
Pin the use case
Write down the user-visible behavior. “Extract structured fields from invoices” is a use case. “Use AI” is not.
Identify the constraint
Is latency, cost, or quality the binding constraint? You can usually only optimize one at a time.
Pick the smallest model that meets it
Start with the smallest model that satisfies the constraint, then escalate only if evals show quality gaps.
Run an offline eval
Build a 50-200 example eval set that mirrors production. Measure quality on every candidate model with the same prompt and the same judging criteria.
Model capability reference
Prices are not listed here because they change frequently. Always check the live pricing page for current per-token rates.
OpenAI (GPT) models
| Model | Best for | Notes |
|---|---|---|
gpt-5.4 | Default chat, vision, tool use | Flagship model; best overall quality across tasks |
gpt-5.4-mini | High-volume classification, routing, extraction | Fastest and most cost-efficient in the GPT line |
gpt-5.5 | Cutting-edge quality requirements | Latest generation; highest capability ceiling |
gpt-5.3-codex | Code generation, debugging, refactoring | Optimised for programming tasks |
Google Gemini models
| Model | Best for | Notes |
|---|---|---|
gemini-3.1-pro-preview | Long-context analysis, careful instruction following | Excellent for large document understanding |
gemini-2.5-pro | High-quality multimodal tasks | Strong vision and reasoning combination |
gemini-2.5-flash | Fast multimodal at lower cost | Good balance of speed and quality |
gemini-3-flash-preview | Lowest-latency Gemini tasks | Latest Flash generation; fast and affordable |
DeepSeek models
| Model | Best for | Notes |
|---|---|---|
deepseek-v4-pro | Multi-step reasoning, planning, hard math | Advanced reasoning; good alternative to heavy frontier models |
deepseek-v4-flash | Cost-sensitive reasoning tasks | Lighter-weight DeepSeek; faster and cheaper than Pro |
Routing strategies
A single application often needs more than one model. Common patterns:- Tiered routing. Use a small, cheap model to classify the request, then escalate to a larger model only when the small model is not confident. Cuts cost without hurting quality on the hard cases.
- Cascading fallback. Try the cheapest model first. If it returns a low confidence score, retry with the next model up. Useful for unpredictable traffic.
- Specialist per task. Different models for different capabilities:
gpt-5.4for chat,gemini-2.5-profor multimodal analysis,deepseek-v4-profor reasoning-heavy tasks. The gateway dispatches each call to the right upstream.
What to do when evals disagree
- If two models tie on quality, pick the cheaper one.
- If a model wins on a synthetic benchmark but loses on your real eval, trust your eval.
- If a vendor publishes a flashy new model, wait for a third-party confirmation before adopting it for production.
- If you must move fast, run the new model on a 5-10% traffic shadow and compare user-level outcomes before fully migrating.