Results forTermBenchSee all Tags
March 16, 2026
Benchmarks Don't Matter — Until They Do (Part 2)ForgeCode now reaches 81.8% on TermBench 2.0 with both GPT 5.4 and Opus 4.6. The interesting part is not the score. It is what we had to change in the agent to make GPT 5.4 behave as reliably as Opus 4.6.
Tushar
March 3, 2026
Benchmarks Don't Matter — Until They Do (Part 1)ForgeCode hit 78.4% SOTA on TermBench 2.0 with gemini-3.1-pro-preview. This is the technical account of how we got there: seven failure modes, their fixes, and why the benchmark work generalized across models rather than overfitting to one run.
Tushar