Results forTool CallingSee all Tags
March 3, 2026
Benchmarks Don't Matter — Until They DoForgeCode hit 78.4% SOTA on TermBench 2.0 with gemini-3.1-pro-preview. This is the technical account of how we got there: seven failure modes, their fixes, and why the benchmark work generalized across models rather than overfitting to one run.
Tushar
August 10, 2025
Claude Sonnet 4 vs Kimi K2 vs Gemini 2.5 Pro: Which AI actually ships production code?I ran Claude Sonnet 4, Kimi K2, and Gemini 2.5 Pro on the same Next.js app and measured cost, speed, and whether the code actually shipped without follow-ups.
Amitesh Anand
July 23, 2025
Kimi K2 vs Qwen-3 Coder: Testing Two AI Models on Coding TasksI tested Kimi K2 and Qwen-3 Coder on 13 Rust development tasks across a 38k-line codebase and 2 Frontend refactor tasks. The results reveal differences in code quality, instruction following, and development capabilities.
Tushar