Skip to main content

Model Selection

No single model wins on every axis. The right choice depends on what you are building, how fast it must respond, and how much you can spend per request. This guide walks through the trade-offs and ends with a curated capability table.

The three axes

Latency

Time to first token (TTFT) and time to last token. Critical for chat UIs and voice agents.

Cost

Price per million input and output tokens. High-volume pipelines are usually cost-bound. See the live pricing page for current numbers.

Capability

Reasoning quality, instruction following, vision, tool use, JSON adherence, multilingual coverage.
Most production systems optimize a weighted blend of the three. A classification job that runs in batch overnight is cost-bound; a customer-facing assistant is latency-bound; an agent that plans multi-step actions is capability-bound.

Decision flow

1

Pin the use case

Write down the user-visible behavior. “Extract structured fields from invoices” is a use case. “Use AI” is not.
2

Identify the constraint

Is latency, cost, or quality the binding constraint? You can usually only optimize one at a time.
3

Pick the smallest model that meets it

Start with the smallest model that satisfies the constraint, then escalate only if evals show quality gaps.
4

Run an offline eval

Build a 50-200 example eval set that mirrors production. Measure quality on every candidate model with the same prompt and the same judging criteria.
5

Re-evaluate on a schedule

New model revisions land every few weeks. Re-run the eval on each release and promote when quality improves for a similar cost.

Model capability reference

Prices are not listed here because they change frequently. Always check the live pricing page for current per-token rates.

OpenAI (GPT) models

ModelBest forNotes
gpt-5.4Default chat, vision, tool useFlagship model; best overall quality across tasks
gpt-5.4-miniHigh-volume classification, routing, extractionFastest and most cost-efficient in the GPT line
gpt-5.5Cutting-edge quality requirementsLatest generation; highest capability ceiling
gpt-5.3-codexCode generation, debugging, refactoringOptimised for programming tasks

Google Gemini models

ModelBest forNotes
gemini-3.1-pro-previewLong-context analysis, careful instruction followingExcellent for large document understanding
gemini-2.5-proHigh-quality multimodal tasksStrong vision and reasoning combination
gemini-2.5-flashFast multimodal at lower costGood balance of speed and quality
gemini-3-flash-previewLowest-latency Gemini tasksLatest Flash generation; fast and affordable

DeepSeek models

ModelBest forNotes
deepseek-v4-proMulti-step reasoning, planning, hard mathAdvanced reasoning; good alternative to heavy frontier models
deepseek-v4-flashCost-sensitive reasoning tasksLighter-weight DeepSeek; faster and cheaper than Pro

Routing strategies

A single application often needs more than one model. Common patterns:
  • Tiered routing. Use a small, cheap model to classify the request, then escalate to a larger model only when the small model is not confident. Cuts cost without hurting quality on the hard cases.
  • Cascading fallback. Try the cheapest model first. If it returns a low confidence score, retry with the next model up. Useful for unpredictable traffic.
  • Specialist per task. Different models for different capabilities: gpt-5.4 for chat, gemini-2.5-pro for multimodal analysis, deepseek-v4-pro for reasoning-heavy tasks. The gateway dispatches each call to the right upstream.
Most teams that revisit model selection quarterly find 2-3x cost savings without sacrificing quality, simply by retiring older defaults that the team never updated.

What to do when evals disagree

  • If two models tie on quality, pick the cheaper one.
  • If a model wins on a synthetic benchmark but loses on your real eval, trust your eval.
  • If a vendor publishes a flashy new model, wait for a third-party confirmation before adopting it for production.
  • If you must move fast, run the new model on a 5-10% traffic shadow and compare user-level outcomes before fully migrating.