Managing the Context Window

Work effectively within token limits when designing real conversational and assistant flows

Learning goals

The context window is the total amount of text (in tokens) the model can consider at once. This includes:

GPT-4o: 128K tokens
GPT-4.1: up to 1M tokens
GPT-5.4: 1.05M tokens (1M)
Claude 3.5 Sonnet: 200K tokens
Claude Opus/Sonnet 4.5–4.6: 200K default; up to 1M in beta (4.6)
Llama 3: 8K official; variants (e.g. 3.1) up to 128K
Llama 4 (Maverick/Scout): up to 1M–10M depending on variant (effective context much lower in long contexts)

Research shows models pay less attention to information in the middle of long contexts:

For critical information, place it at the beginning or end of your context.

Periodically summarize older messages:

[Summary of previous conversation: User asked about X, we discussed Y, agreed on Z]

Keep only the N most recent messages.

Only include messages relevant to the current query.

Store detailed information externally, include summaries in context.

×Including all conversation history—leads to context overflow and higher costs

×Putting critical info in the middle—it may be overlooked

×No context management strategy—conversations degrade as they grow

×Ignoring context limits—truncation can cause incoherent responses

+Context windows have hard limits—plan for them

+Place critical information at the beginning and end

+Use summarization and selective inclusion to manage long conversations

+Monitor token usage and implement overflow strategies

Context Window Basics

The context window is the total amount of text (in tokens) the model can consider at once. This includes:

GPT-4o: 128K tokens
GPT-4.1: up to 1M tokens
GPT-5.4: 1.05M tokens (1M)
Claude 3.5 Sonnet: 200K tokens
Claude Opus/Sonnet 4.5–4.6: 200K default; up to 1M in beta (4.6)
Llama 3: 8K official; variants (e.g. 3.1) up to 128K
Llama 4 (Maverick/Scout): up to 1M–10M depending on variant (effective context much lower in long contexts)

Context Management Strategies

Periodically summarize older messages:

[Summary of previous conversation: User asked about X, we discussed Y, agreed on Z]

Keep only the N most recent messages.

Only include messages relevant to the current query.

Store detailed information externally, include summaries in context.