Blog

Technical insights on deterministic computation and AI agent reliability.

Announcing EuclidBench: What We Learned Running 240 Computation Problems Through gpt-5.4-nano

We built a benchmark to test our own MCP tools. The most important fix wasn't the math. It was the error messages.

The $2,645 Problem: Why Your AI Agent Is Silently Getting Math Wrong

LLMs predict math — they don't compute it. Modern AI agents often get the right answer. They can't tell you when they don't.