Skip to main content

Streaming

Streaming lets your UI render tokens as soon as the model produces them, instead of waiting for the full response. The InfinityBlue gateway streams chat completions over Server-Sent Events (SSE) using the same format as the OpenAI streaming API.

Why stream

  • Lower perceived latency. Users see the first word in ~300 ms instead of waiting 2-5 seconds.
  • Faster cancellation. Stop paying for in-flight tokens the moment the user navigates away.
  • Better for long answers. A 1,000-token response is unusable as a single chunk but feels responsive when it streams.

Enable streaming

Set "stream": true in the request body. The server returns Content-Type: text/event-stream and emits one event per token.
curl https://api.getinfinityblue.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" \
  -d '{"model":"gpt-5.4","stream":true,"messages":[{"role":"user","content":"Write a haiku about streaming."}]}'
Each event is a single line that starts with data: followed by a JSON chunk. The stream ends with a literal data: [DONE].
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Whisper"},"finish_reason":null}]}
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" of"},"finish_reason":null}]}
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]

Client examples

The OpenAI SDK handles the parsing for you. Stream chunks arrive as delta objects you append to your local state.
from openai import OpenAI
client = OpenAI(api_key="YOUR_API_KEY", base_url="https://api.getinfinityblue.com/v1")
stream = client.chat.completions.create(
    model="gpt-5.4", stream=True, messages=[{"role": "user", "content": "Write a haiku about streaming."}],
)
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content: print(delta.content, end="", flush=True)
import OpenAI from "openai";
const client = new OpenAI({ apiKey: "YOUR_API_KEY", baseURL: "https://api.getinfinityblue.com/v1" });
const stream = await client.chat.completions.create({
  model: "gpt-5.4", stream: true, messages: [{ role: "user", content: "Write a haiku about streaming." }],
});
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content ?? "";
  if (content) process.stdout.write(content);
}

Use fetch in the browser

If you cannot ship an SDK, the raw fetch API works. Parse each data: line as a separate JSON object and append the delta.
const response = await fetch("https://api.getinfinityblue.com/v1/chat/completions", {
  method: "POST",
  headers: { "Authorization": `Bearer ${apiKey}`, "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "gpt-5.4", stream: true,
    messages: [{ role: "user", content: "Write a haiku about streaming." }],
  }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split("\n");
  buffer = lines.pop() || ""; // keep the last (possibly incomplete) line
  for (const line of lines) {
    if (!line.startsWith("data: ")) continue;
    const payload = line.slice(6);
    if (payload === "[DONE]") continue;
    const chunk = JSON.parse(payload);
    const delta = chunk.choices[0]?.delta?.content ?? "";
    if (delta) document.getElementById("output").textContent += delta;
  }
}

Stream with tool calls and structured output

When the model produces a tool call, the deltas stream in pieces of the JSON. Accumulate the argument fragments into the same object the model would have returned in a non-streaming response, and only execute the tool once finish_reason is tool_calls. Structured output still streams token by token, but the final parsed object matches what response_format declared.
If you abort a stream mid-response, the model may have already consumed prompt tokens. The next billing cycle still charges for the input. Cancel early, not after a long pause.