Context Compaction
Forge includes powerful automatic context management capabilities that optimize AI conversations while preserving important information.
What is Context Compaction?
As conversations with AI agents grow longer, they can exceed token limits and become inefficient. Context compaction automatically summarizes older parts of conversations when they reach configurable thresholds, allowing you to maintain longer, more productive interactions without hitting model context limits.
Key benefits include:
- Extended Conversations: Continue conversations beyond normal token limits
- Optimized Performance: Reduce token usage and improve response times
- Preserved Context: Keep critical information while summarizing less important details
- Cost Efficiency: Reduce token usage in API calls
- Reasoning Preservation: Maintains reasoning chains for extended thinking models
How It Works
Context compaction uses an intelligent multi-step process:
-
Trigger Detection: Compaction activates when ANY of these conditions are met:
- Token count exceeds
token_threshold - Total message count exceeds
message_threshold - User turn count exceeds
turn_threshold - Last message is from user AND
on_turn_endis enabled
- Token count exceeds
-
Message Selection: The system identifies which messages to compact:
- Preserves recent messages based on
retention_windowandeviction_window - Starts from the first assistant message in the sequence
- Maintains tool call/result pairs atomically (never splits them)
- Uses a conservative approach to prevent over-compaction
- Preserves recent messages based on
-
Summarization: Older messages are sent to the configured model for summarization:
- Detects if a plan was being executed and uses appropriate format
- Extracts file operations, action logs, and task status
- Consolidates multiple older summaries chronologically
-
Context Replacement: The summary replaces compacted messages as a user message:
- Includes any user feedback from the compacted sequence
- Preserves the natural conversation flow
-
Reasoning Preservation: For extended thinking models:
- Extracts the most recent reasoning from compacted messages
- Injects it into the first remaining assistant message
- Prevents breaking reasoning chains while avoiding accumulation
This process runs in parallel with your main request to minimize latency impact.
Configuration Options
Add the following to your forge.yaml file under an agent configuration:
agents:
- id: assistant
model: anthropic/claude-3.5-sonnet
compact:
# Triggering thresholds (ANY condition triggers compaction)
token_threshold: 80000 # Trigger when context exceeds this many tokens
message_threshold: 200 # Trigger after this many total messages
turn_threshold: 50 # Trigger after this many user turns (optional)
on_turn_end: false # Trigger on user message (default: false, use with caution)
# Compaction strategy (controls what to preserve)
retention_window: 6 # Number of recent messages to preserve unchanged
eviction_window: 0.2 # Percentage (0.0-1.0) of context to compact
# Summarization settings
max_tokens: 2000 # Maximum tokens for the generated summary
model: google/gemini-2.0-flash-001 # Model to use for summarization
prompt: "{{> forge-system-prompt-context-summarizer.hbs }}" # Optional custom prompt
Configuration Parameters
| Parameter | Required | Description |
|---|---|---|
token_threshold | No | Trigger compaction when token count exceeds this value |
message_threshold | No | Trigger compaction when total messages exceed this value |
turn_threshold | No | Trigger compaction when user turns exceed this value |
on_turn_end | No | Trigger compaction on user messages (default: false) |
retention_window | No | Number of recent messages to preserve unchanged |
eviction_window | No | Percentage (0.0-1.0) of context that can be summarized |
max_tokens | No | Maximum token count for the generated summary |
model | No | AI model to use for generating the summary |
prompt | No | Custom prompt template for summarization |
Compaction triggers when ANY condition is met. You can use one or more thresholds based on your needs. The system uses a conservative strategy - if both eviction_window and retention_window apply, it will preserve more context (whichever is more conservative).
Setting on_turn_end: true will trigger compaction after every user message, which can be very aggressive and may remove important context. Use this option carefully and only in specific scenarios where you need frequent context reduction.
Best Practices
Selecting Appropriate Thresholds
Set thresholds based on your model's context window and usage patterns:
Token-based triggering (recommended for most cases):
- For Claude 3.7 Sonnet (~200K token window): 150,000 to 180,000 tokens
- For Claude 3.5 haiku (~200K token window): 120,000 to 160,000 tokens
- Leave headroom for the model to work with full context
Message/Turn-based triggering (useful for specific scenarios):
- Use
message_thresholdfor very long conversations regardless of token count - Use
turn_thresholdwhen you want to limit based on user interactions - Combine with token thresholds for multi-condition triggering
Choosing Summarization Models
For the summarization model, balance speed and quality:
- Fast models (like Gemini Flash): Provide quicker summaries with lower cost
- Powerful models (like Claude Sonnet): Better context preservation but higher cost and latency
- Consider using a different model than your main agent for cost optimization
Retention and Eviction Windows
These settings work together to determine what gets compacted:
- Retention Window (fixed count): Preserves the last N messages unchanged
- Typical value: 6-10 messages
- Guarantees recent context is never compacted
- Eviction Window (percentage): Controls what portion can be summarized
- Range: 0.0 (nothing) to 1.0 (everything eligible)
- Typical value: 0.2 (20% of older context)
- More conservative than retention window wins
Example: With retention_window: 6 and eviction_window: 0.2, if you have 100 messages:
- Retention says: preserve last 6, can compact first 94
- Eviction says: can compact 20% of 100 = 20 messages
- Result: Compacts the first 20 messages (more conservative)
Example Use Cases
Long Debugging Sessions
When debugging complex issues, conversations can become lengthy. Context compaction allows the agent to remember key debugging steps while summarizing earlier diagnostics.
compact:
token_threshold: 100000
retention_window: 8
eviction_window: 0.3
model: google/gemini-2.0-flash-001
Multi-Stage Project Development
For projects developed over multiple sessions, context compaction enables the agent to maintain awareness of project requirements and previous decisions while focusing on current tasks.
compact:
token_threshold: 150000
message_threshold: 150
retention_window: 10
eviction_window: 0.2
model: anthropic/claude-3-haiku
Interactive Learning and Tutorials
When using Forge for learning or following tutorials, compaction helps maintain the thread of the lesson while summarizing earlier explanations.
compact:
token_threshold: 80000
retention_window: 6
eviction_window: 0.25
model: google/gemini-2.0-flash-001
Performance Considerations
Context compaction is designed to minimize performance impact:
- Parallel Execution: Compaction runs alongside your main request, not blocking responses
- Summarization Latency: More powerful models may take longer but provide better summaries
- Cost Impact: Each compaction requires an LLM call, adding to usage costs
- Token Reduction: Effective compaction can reduce overall token usage significantly
Optimization tips:
- Use faster models (like Gemini Flash) for summarization to reduce latency
- Set higher thresholds to compact less frequently
- Balance
retention_windowandeviction_windowfor your use case
Troubleshooting
Issue: Context Seems Lost After Compaction
Possible causes:
- Summary model not preserving key details
- Eviction window too aggressive
- Retention window too small
Solutions:
- Increase
max_tokensto allow for more detailed summaries - Use a more capable summarization model
- Increase
retention_windowto preserve more recent messages - Reduce
eviction_windowto compact less aggressively - Customize the summarization
promptto emphasize important details
Issue: Slow Responses After Threshold is Reached
Possible causes:
- Slow summarization model
- Very large context to summarize
Solutions:
- Choose a faster summarization model (e.g., Gemini Flash)
- Reduce
token_thresholdto trigger earlier compaction with smaller context - Increase
eviction_windowto compact more at once (fewer compactions)
Related Features
- Agent Configuration - Learn about other agent configuration options
- Operating Agents - Understand how context works in different operation modes
- Commands Reference - Use
/compactto manually trigger compaction
By effectively using context compaction, you can maintain longer, more productive AI conversations while optimizing for performance and cost efficiency. The system intelligently balances context preservation with token optimization, ensuring your agents have the information they need without exceeding limits.