Streaming
Streaming lets your UI render tokens as soon as the model produces them, instead of waiting for the full response. The InfinityBlue gateway streams chat completions over Server-Sent Events (SSE) using the same format as the OpenAI streaming API.
Why stream
- Lower perceived latency. Users see the first word in ~300 ms instead of waiting 2-5 seconds.
- Faster cancellation. Stop paying for in-flight tokens the moment the user navigates away.
- Better for long answers. A 1,000-token response is unusable as a single chunk but feels responsive when it streams.
Enable streaming
Set "stream": true in the request body. The server returns Content-Type: text/event-stream and emits one event per token.
curl https://api.getinfinityblue.com/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" \
-d '{"model":"gpt-5.4","stream":true,"messages":[{"role":"user","content":"Write a haiku about streaming."}]}'
Each event is a single line that starts with data: followed by a JSON chunk. The stream ends with a literal data: [DONE].
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Whisper"},"finish_reason":null}]}
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" of"},"finish_reason":null}]}
data: {"id":"chatcmpl-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Client examples
The OpenAI SDK handles the parsing for you. Stream chunks arrive as delta objects you append to your local state.
from openai import OpenAI
client = OpenAI(api_key="YOUR_API_KEY", base_url="https://api.getinfinityblue.com/v1")
stream = client.chat.completions.create(
model="gpt-5.4", stream=True, messages=[{"role": "user", "content": "Write a haiku about streaming."}],
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content: print(delta.content, end="", flush=True)
import OpenAI from "openai";
const client = new OpenAI({ apiKey: "YOUR_API_KEY", baseURL: "https://api.getinfinityblue.com/v1" });
const stream = await client.chat.completions.create({
model: "gpt-5.4", stream: true, messages: [{ role: "user", content: "Write a haiku about streaming." }],
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content ?? "";
if (content) process.stdout.write(content);
}
Use fetch in the browser
If you cannot ship an SDK, the raw fetch API works. Parse each data: line as a separate JSON object and append the delta.
const response = await fetch("https://api.getinfinityblue.com/v1/chat/completions", {
method: "POST",
headers: { "Authorization": `Bearer ${apiKey}`, "Content-Type": "application/json" },
body: JSON.stringify({
model: "gpt-5.4", stream: true,
messages: [{ role: "user", content: "Write a haiku about streaming." }],
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop() || ""; // keep the last (possibly incomplete) line
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const payload = line.slice(6);
if (payload === "[DONE]") continue;
const chunk = JSON.parse(payload);
const delta = chunk.choices[0]?.delta?.content ?? "";
if (delta) document.getElementById("output").textContent += delta;
}
}
When the model produces a tool call, the deltas stream in pieces of the JSON. Accumulate the argument fragments into the same object the model would have returned in a non-streaming response, and only execute the tool once finish_reason is tool_calls. Structured output still streams token by token, but the final parsed object matches what response_format declared.
If you abort a stream mid-response, the model may have already consumed prompt tokens. The next billing cycle still charges for the input. Cancel early, not after a long pause.