Designing streaming UX for AI tools

Streaming is the first thing that makes an AI product feel fast. It's also where most AI products have subtly broken UX. Here's a collection of things I've thought hard about while shipping streaming interfaces.

Partial render is a feature, not a bug

The obvious use of streaming is "show text as it arrives." But you can do more. If you're generating structured output — a JSON object, a list, a form — you can parse partial output speculatively and render meaningful UI before the stream completes.

For a list-generation use case, this means the first item can appear as a proper UI card before items 2-10 exist. The user gets immediate feedback. The experience feels faster than it actually is.

The tradeoff: partial state can confuse users if a list item appears and then mutates significantly as more tokens arrive. I've found that showing a subtle "typing" indicator on the in-progress item helps set expectations.

Cancellation is user respect

If a user stops a generation, the operation should actually stop. This sounds obvious. In practice, many implementations just hide the in-progress state on the client but let the server-side generation run to completion. The user pays for credits. The model burns compute. The UX is a lie.

Proper cancellation requires an AbortController on the fetch request, a cancellation signal on the backend, and — if you're paying per token — a rollback or partial refund on the credit side.

I use SSE (Server-Sent Events) for streaming. The connection closing naturally signals the server to stop. NestJS with a cleanup handler on the onDisconnect event works well for this.

Error handling mid-stream

What happens when a model API returns an error after 200 tokens have already streamed? Most UX I've seen either silently stops or shows a generic error that replaces the partial output. Both feel jarring.

A better pattern: keep the partial output visible, append an inline error state ("Generation stopped — the model returned an error"), and offer a retry button. The user sees what they got and can choose what to do next. Don't erase work.

The thinking state

The gap between "user submits" and "first token arrives" is where you lose people. Even if that gap is 800ms, it needs a meaningful indicator. A spinning icon doesn't communicate that a large language model is doing something non-trivial.

I've had good results with a subtle pulsing dot next to a "Generating..." label, combined with a realistic first-response time expectation ("typically 1-3 seconds"). Setting expectations is half the UX work.

Markdown rendering during streaming

Rendering markdown incrementally creates visual churn. Headers and bold text that haven't "closed" yet render as partial syntax. There are two approaches:

Buffer tokens until a meaningful unit completes (end of sentence, end of paragraph), then render.
Use a streaming-aware markdown parser that handles unclosed syntax gracefully.

I've used a buffering approach on some projects and the streaming-aware approach on others. The streaming-aware parser gives a better experience at the cost of implementation complexity.