Advanced Techniques
15 minLesson 10 of 14
Managing the Context Window
Work effectively within token limits when designing real conversational and assistant flows
Learning goals
- •Understand context window limitations
- •Learn strategies for long conversations
- •Implement effective context management
Context Window Basics
The context window is the total amount of text (in tokens) the model can consider at once. This includes:
- System prompt
- Conversation history
- Current user message
- Model's response
- GPT-4o: 128K tokens
- GPT-4.1: up to 1M tokens
- GPT-5.4: 1.05M tokens (1M)
- Claude 3.5 Sonnet: 200K tokens
- Claude Opus/Sonnet 4.5–4.6: 200K default; up to 1M in beta (4.6)
- Llama 3: 8K official; variants (e.g. 3.1) up to 128K
- Llama 4 (Maverick/Scout): up to 1M–10M depending on variant (effective context much lower in long contexts)
The Middle Problem
Research shows models pay less attention to information in the middle of long contexts:
- Beginning: High attention (primacy effect)
- Middle: Lower attention (lost in the middle)
- End: High attention (recency effect)
For critical information, place it at the beginning or end of your context.
Context Management Strategies
Summarization
Periodically summarize older messages:
[Summary of previous conversation: User asked about X, we discussed Y, agreed on Z]Sliding Window
Keep only the N most recent messages.
Selective Inclusion
Only include messages relevant to the current query.
Hierarchical Memory
Store detailed information externally, include summaries in context.
Common mistakes
×Including all conversation history—leads to context overflow and higher costs
×Putting critical info in the middle—it may be overlooked
×No context management strategy—conversations degrade as they grow
×Ignoring context limits—truncation can cause incoherent responses
Key takeaways
+Context windows have hard limits—plan for them
+Place critical information at the beginning and end
+Use summarization and selective inclusion to manage long conversations
+Monitor token usage and implement overflow strategies