March 3, 2026Benchmarks Don't Matter — Until They DoForgeCode hit 78.4% SOTA on TermBench 2.0 with gemini-3.1-pro-preview. This is the technical account of how we got there: seven failure modes, their fixes, and why the benchmark work generalized across models rather than overfitting to one run.