Model Selection

No single model wins on every axis. The right choice depends on what you are building, how fast it must respond, and how much you can spend per request. This guide walks through the trade-offs and ends with a curated capability table.

The three axes

Latency

Time to first token (TTFT) and time to last token. Critical for chat UIs and voice agents.

Cost

Price per million input and output tokens. High-volume pipelines are usually cost-bound. See the live pricing page for current numbers.

Capability

Reasoning quality, instruction following, vision, tool use, JSON adherence, multilingual coverage.

Most production systems optimize a weighted blend of the three. A classification job that runs in batch overnight is cost-bound; a customer-facing assistant is latency-bound; an agent that plans multi-step actions is capability-bound.

Decision flow

Pin the use case

Write down the user-visible behavior. “Extract structured fields from invoices” is a use case. “Use AI” is not.

Identify the constraint

Is latency, cost, or quality the binding constraint? You can usually only optimize one at a time.

Pick the smallest model that meets it

Start with the smallest model that satisfies the constraint, then escalate only if evals show quality gaps.

Run an offline eval

Build a 50-200 example eval set that mirrors production. Measure quality on every candidate model with the same prompt and the same judging criteria.

Re-evaluate on a schedule

New model revisions land every few weeks. Re-run the eval on each release and promote when quality improves for a similar cost.

Model capability reference

Prices are not listed here because they change frequently. Always check the live pricing page for current per-token rates.

OpenAI (GPT) models

Model	Best for	Notes
`gpt-5.4`	Default chat, vision, tool use	Flagship model; best overall quality across tasks
`gpt-5.4-mini`	High-volume classification, routing, extraction	Fastest and most cost-efficient in the GPT line
`gpt-5.5`	Cutting-edge quality requirements	Latest generation; highest capability ceiling
`gpt-5.3-codex`	Code generation, debugging, refactoring	Optimised for programming tasks

Google Gemini models

Model	Best for	Notes
`gemini-3.1-pro-preview`	Long-context analysis, careful instruction following	Excellent for large document understanding
`gemini-2.5-pro`	High-quality multimodal tasks	Strong vision and reasoning combination
`gemini-2.5-flash`	Fast multimodal at lower cost	Good balance of speed and quality
`gemini-3-flash-preview`	Lowest-latency Gemini tasks	Latest Flash generation; fast and affordable

DeepSeek models

Model	Best for	Notes
`deepseek-v4-pro`	Multi-step reasoning, planning, hard math	Advanced reasoning; good alternative to heavy frontier models
`deepseek-v4-flash`	Cost-sensitive reasoning tasks	Lighter-weight DeepSeek; faster and cheaper than Pro

Routing strategies

A single application often needs more than one model. Common patterns:

Tiered routing. Use a small, cheap model to classify the request, then escalate to a larger model only when the small model is not confident. Cuts cost without hurting quality on the hard cases.
Cascading fallback. Try the cheapest model first. If it returns a low confidence score, retry with the next model up. Useful for unpredictable traffic.
Specialist per task. Different models for different capabilities: gpt-5.4 for chat, gemini-2.5-pro for multimodal analysis, deepseek-v4-pro for reasoning-heavy tasks. The gateway dispatches each call to the right upstream.

Most teams that revisit model selection quarterly find 2-3x cost savings without sacrificing quality, simply by retiring older defaults that the team never updated.

What to do when evals disagree

If two models tie on quality, pick the cheaper one.
If a model wins on a synthetic benchmark but loses on your real eval, trust your eval.
If a vendor publishes a flashy new model, wait for a third-party confirmation before adopting it for production.
If you must move fast, run the new model on a 5-10% traffic shadow and compare user-level outcomes before fully migrating.

​Model Selection

​The three axes

Latency

Cost

Capability

​Decision flow

​Model capability reference

​OpenAI (GPT) models

​Google Gemini models

​DeepSeek models

​Routing strategies

​What to do when evals disagree

Model Selection

The three axes

Decision flow

Model capability reference

OpenAI (GPT) models

Google Gemini models

DeepSeek models

Routing strategies

What to do when evals disagree