Streaming

Streaming lets your UI render tokens as soon as the model produces them, instead of waiting for the full response. The InfinityBlue gateway streams chat completions over Server-Sent Events (SSE) using the same format as the OpenAI streaming API.

Why stream

Lower perceived latency. Users see the first word in ~300 ms instead of waiting 2-5 seconds.
Faster cancellation. Stop paying for in-flight tokens the moment the user navigates away.
Better for long answers. A 1,000-token response is unusable as a single chunk but feels responsive when it streams.

Enable streaming

Set "stream": true in the request body. The server returns Content-Type: text/event-stream and emits one event per token.

curl https://api.getinfinityblue.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" \
  -d '{"model":"gpt-5.4","stream":true,"messages":[{"role":"user","content":"Write a haiku about streaming."}]}'

Each event is a single line that starts with data: followed by a JSON chunk. The stream ends with a literal data: [DONE].

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Whisper"},"finish_reason":null}]}
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" of"},"finish_reason":null}]}
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]

Client examples

The OpenAI SDK handles the parsing for you. Stream chunks arrive as delta objects you append to your local state.

from openai import OpenAI
client = OpenAI(api_key="YOUR_API_KEY", base_url="https://api.getinfinityblue.com/v1")
stream = client.chat.completions.create(
    model="gpt-5.4", stream=True, messages=[{"role": "user", "content": "Write a haiku about streaming."}],
)
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content: print(delta.content, end="", flush=True)

import OpenAI from "openai";
const client = new OpenAI({ apiKey: "YOUR_API_KEY", baseURL: "https://api.getinfinityblue.com/v1" });
const stream = await client.chat.completions.create({
  model: "gpt-5.4", stream: true, messages: [{ role: "user", content: "Write a haiku about streaming." }],
});
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content ?? "";
  if (content) process.stdout.write(content);
}

Use fetch in the browser

If you cannot ship an SDK, the raw fetch API works. Parse each data: line as a separate JSON object and append the delta.

const response = await fetch("https://api.getinfinityblue.com/v1/chat/completions", {
  method: "POST",
  headers: { "Authorization": `Bearer ${apiKey}`, "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "gpt-5.4", stream: true,
    messages: [{ role: "user", content: "Write a haiku about streaming." }],
  }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split("\n");
  buffer = lines.pop() || ""; // keep the last (possibly incomplete) line
  for (const line of lines) {
    if (!line.startsWith("data: ")) continue;
    const payload = line.slice(6);
    if (payload === "[DONE]") continue;
    const chunk = JSON.parse(payload);
    const delta = chunk.choices[0]?.delta?.content ?? "";
    if (delta) document.getElementById("output").textContent += delta;
  }
}

Stream with tool calls and structured output

When the model produces a tool call, the deltas stream in pieces of the JSON. Accumulate the argument fragments into the same object the model would have returned in a non-streaming response, and only execute the tool once finish_reason is tool_calls. Structured output still streams token by token, but the final parsed object matches what response_format declared.

If you abort a stream mid-response, the model may have already consumed prompt tokens. The next billing cycle still charges for the input. Cancel early, not after a long pause.

​Streaming

​Why stream

​Enable streaming

​Client examples

​Use fetch in the browser

​Stream with tool calls and structured output

Streaming

Why stream

Enable streaming

Client examples

Use fetch in the browser

Stream with tool calls and structured output