Production Systems
12 minLesson 13 of 14
Streaming Responses
Deliver faster perceived performance with streaming in production LLM interfaces
Learning goals
- •Understand why streaming improves UX
- •Learn to implement streaming in different frameworks
- •Handle streaming edge cases
Why Stream?
- User waits for entire response
- Long responses = long wait
- No feedback during generation
- First tokens appear in ~200ms
- Response builds in real-time
- Better perceived performance
For a 500-token response, streaming can make the experience feel 10x faster.
Implementing Streaming
// Using AI SDK
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
const result = streamText({
model: openai('gpt-4'),
prompt: 'Write a short story',
});
for await (const chunk of result.textStream) {
process.stdout.write(chunk);
}Server-Sent Events (SSE) or WebSockets deliver chunks to the browser.
Streaming Considerations
Error Handling
- Connection drops
- Rate limits
- Model errors
Parsing Structured Output
If streaming JSON, wait for complete object before parsing.
UI/UX
- Show typing indicator during generation
- Handle rapid content updates efficiently
- Consider smooth scroll behavior
Common mistakes
×Not handling connection errors—streams can fail mid-response
×Parsing partial JSON—wait for complete structures
×Ignoring rate limits—streaming doesn't prevent rate limiting
×No loading states—users need feedback while waiting for first token
Key takeaways
+Streaming dramatically improves perceived performance
+First tokens appear in ~200ms regardless of total response length
+Handle errors gracefully—streams can fail mid-response
+Buffer structured output until it's complete