Forge Code is an AI coding assistant that runs in your terminal, helping developers write better code faster with AI-powered suggestions, refactoring, and debugging assistance.

How does Forge Code work?

Forge Code integrates with your terminal and IDE, providing AI-powered code suggestions, debugging help, and refactoring assistance based on your codebase and context.

Is Forge Code free to use?

Forge Code offers a free tier with basic features, along with paid plans for advanced functionality and team collaboration.

Context Compaction

Name: Forge Code
Author: Forge Code

Forge includes powerful automatic context management capabilities that optimize AI conversations while preserving important information.

What is Context Compaction?

As conversations with AI agents grow longer, they can exceed token limits and become inefficient. Context compaction automatically summarizes older parts of conversations when they reach configurable thresholds, allowing you to maintain longer, more productive interactions without hitting model context limits.

Key benefits include:

Extended Conversations: Continue conversations beyond normal token limits
Optimized Performance: Reduce token usage and improve response times
Preserved Context: Keep critical information while summarizing less important details
Cost Efficiency: Reduce token usage in API calls
Reasoning Preservation: Maintains reasoning chains for extended thinking models

How It Works

Context compaction uses an intelligent multi-step process:

Trigger Detection: Compaction activates when ANY of these conditions are met:
- Token count exceeds token_threshold
- Total message count exceeds message_threshold
- User turn count exceeds turn_threshold
- Last message is from user AND on_turn_end is enabled
Message Selection: The system identifies which messages to compact:
- Preserves recent messages based on retention_window and eviction_window
- Starts from the first assistant message in the sequence
- Maintains tool call/result pairs atomically (never splits them)
- Uses a conservative approach to prevent over-compaction
Summarization: Older messages are sent to the configured model for summarization:
- Detects if a plan was being executed and uses appropriate format
- Extracts file operations, action logs, and task status
- Consolidates multiple older summaries chronologically
Context Replacement: The summary replaces compacted messages as a user message:
- Includes any user feedback from the compacted sequence
- Preserves the natural conversation flow
Reasoning Preservation: For extended thinking models:
- Extracts the most recent reasoning from compacted messages
- Injects it into the first remaining assistant message
- Prevents breaking reasoning chains while avoiding accumulation

This process runs in parallel with your main request to minimize latency impact.

Configuration Options

Add the following to your forge.yaml file under an agent configuration:

agents:
  - id: assistant
    model: anthropic/claude-3.5-sonnet
    compact:
      # Triggering thresholds (ANY condition triggers compaction)
      token_threshold: 80000 # Trigger when context exceeds this many tokens
      message_threshold: 200 # Trigger after this many total messages
      turn_threshold: 50 # Trigger after this many user turns (optional)
      on_turn_end: false # Trigger on user message (default: false, use with caution)

      # Compaction strategy (controls what to preserve)
      retention_window: 6 # Number of recent messages to preserve unchanged
      eviction_window: 0.2 # Percentage (0.0-1.0) of context to compact

      # Summarization settings
      max_tokens: 2000 # Maximum tokens for the generated summary
      model: google/gemini-2.0-flash-001 # Model to use for summarization
      prompt: "{{> forge-system-prompt-context-summarizer.hbs }}" # Optional custom prompt

Configuration Parameters

Parameter	Required	Description
`token_threshold`	No	Trigger compaction when token count exceeds this value
`message_threshold`	No	Trigger compaction when total messages exceed this value
`turn_threshold`	No	Trigger compaction when user turns exceed this value
`on_turn_end`	No	Trigger compaction on user messages (default: false)
`retention_window`	No	Number of recent messages to preserve unchanged
`eviction_window`	No	Percentage (0.0-1.0) of context that can be summarized
`max_tokens`	No	Maximum token count for the generated summary
`model`	No	AI model to use for generating the summary
`prompt`	No	Custom prompt template for summarization

Triggering Logic

Compaction triggers when ANY condition is met. You can use one or more thresholds based on your needs. The system uses a conservative strategy - if both eviction_window and retention_window apply, it will preserve more context (whichever is more conservative).

on_turn_end Usage

Setting on_turn_end: true will trigger compaction after every user message, which can be very aggressive and may remove important context. Use this option carefully and only in specific scenarios where you need frequent context reduction.

Best Practices

Selecting Appropriate Thresholds

Set thresholds based on your model's context window and usage patterns:

Token-based triggering (recommended for most cases):

For Claude 3.7 Sonnet (~200K token window): 150,000 to 180,000 tokens
For Claude 3.5 haiku (~200K token window): 120,000 to 160,000 tokens
Leave headroom for the model to work with full context

Message/Turn-based triggering (useful for specific scenarios):

Use message_threshold for very long conversations regardless of token count
Use turn_threshold when you want to limit based on user interactions
Combine with token thresholds for multi-condition triggering

Choosing Summarization Models

For the summarization model, balance speed and quality:

Fast models (like Gemini Flash): Provide quicker summaries with lower cost
Powerful models (like Claude Sonnet): Better context preservation but higher cost and latency
Consider using a different model than your main agent for cost optimization

Retention and Eviction Windows

These settings work together to determine what gets compacted:

Retention Window (fixed count): Preserves the last N messages unchanged
- Typical value: 6-10 messages
- Guarantees recent context is never compacted
Eviction Window (percentage): Controls what portion can be summarized
- Range: 0.0 (nothing) to 1.0 (everything eligible)
- Typical value: 0.2 (20% of older context)
- More conservative than retention window wins

Example: With retention_window: 6 and eviction_window: 0.2, if you have 100 messages:

Retention says: preserve last 6, can compact first 94
Eviction says: can compact 20% of 100 = 20 messages
Result: Compacts the first 20 messages (more conservative)

Example Use Cases

Long Debugging Sessions

When debugging complex issues, conversations can become lengthy. Context compaction allows the agent to remember key debugging steps while summarizing earlier diagnostics.

compact:
  token_threshold: 100000
  retention_window: 8
  eviction_window: 0.3
  model: google/gemini-2.0-flash-001

Multi-Stage Project Development

For projects developed over multiple sessions, context compaction enables the agent to maintain awareness of project requirements and previous decisions while focusing on current tasks.

compact:
  token_threshold: 150000
  message_threshold: 150
  retention_window: 10
  eviction_window: 0.2
  model: anthropic/claude-3-haiku

Interactive Learning and Tutorials

When using Forge for learning or following tutorials, compaction helps maintain the thread of the lesson while summarizing earlier explanations.

compact:
  token_threshold: 80000
  retention_window: 6
  eviction_window: 0.25
  model: google/gemini-2.0-flash-001

Performance Considerations

Context compaction is designed to minimize performance impact:

Parallel Execution: Compaction runs alongside your main request, not blocking responses
Summarization Latency: More powerful models may take longer but provide better summaries
Cost Impact: Each compaction requires an LLM call, adding to usage costs
Token Reduction: Effective compaction can reduce overall token usage significantly

Optimization tips:

Use faster models (like Gemini Flash) for summarization to reduce latency
Set higher thresholds to compact less frequently
Balance retention_window and eviction_window for your use case

Troubleshooting

Issue: Context Seems Lost After Compaction

Possible causes:

Summary model not preserving key details
Eviction window too aggressive
Retention window too small

Solutions:

Increase max_tokens to allow for more detailed summaries
Use a more capable summarization model
Increase retention_window to preserve more recent messages
Reduce eviction_window to compact less aggressively
Customize the summarization prompt to emphasize important details

Issue: Slow Responses After Threshold is Reached

Possible causes:

Slow summarization model
Very large context to summarize

Solutions:

Choose a faster summarization model (e.g., Gemini Flash)
Reduce token_threshold to trigger earlier compaction with smaller context
Increase eviction_window to compact more at once (fewer compactions)

Agent Configuration - Learn about other agent configuration options
Operating Agents - Understand how context works in different operation modes
Commands Reference - Use /compact to manually trigger compaction

By effectively using context compaction, you can maintain longer, more productive AI conversations while optimizing for performance and cost efficiency. The system intelligently balances context preservation with token optimization, ensuring your agents have the information they need without exceeding limits.

What is Context Compaction?​

How It Works​

Configuration Options​

Configuration Parameters​

Best Practices​

Selecting Appropriate Thresholds​

Choosing Summarization Models​

Retention and Eviction Windows​

Example Use Cases​

Long Debugging Sessions​

Multi-Stage Project Development​

Interactive Learning and Tutorials​

Performance Considerations​

Troubleshooting​

Issue: Context Seems Lost After Compaction​

Issue: Slow Responses After Threshold is Reached​

Related Features​