Announcing EuclidBench: What We Learned Running 240 Computation Problems Through gpt-5.4-nano
We build deterministic computation tools for AI agents. Ten domains (calculate, convert, statistics, datetime, finance, geo, color, regex, validate, and encode) exposed via MCP so any agent can use them.
We believed these tools would make agents more accurate. But believing it and proving it are different things. So we built a benchmark.
This post announces EuclidBench, explains why we built it, and shares our early results. The failures, the fixes, and the single most important lesson we've learned so far.
Why We Built EuclidBench
The short answer: we needed to test our own tools.
Existing benchmarks like GSM8K and BIG-Bench test whether LLMs can solve math problems. They don't test whether LLMs can use tools to solve math problems. And they definitely don't test whether the tools themselves are doing their job.
We needed a benchmark that could answer three questions:
- How often does the raw model get computation wrong? (The baseline.)
- Do provider tools like code_interpreter fix it? (The status quo.)
- Does Euclid fix what's left? (Our claim.)
So we built EuclidBench: 240 problems across all 10 Euclid domains, including 80 multi-tool chain problems that require orchestrating multiple tool calls to reach an answer. Every ground truth answer is verified by running it through Euclid itself. The benchmark doesn't trust the model or a human to provide the correct answer.
Three scenarios. Same problems. Same model. The only variable is what tools are available.
The Early Results
We ran gpt-5.4-nano, a small, cheap model, across all 240 problems under each scenario.
Overall accuracy:
| Scenario | Accuracy | Errors |
|---|---|---|
| Raw LLM (no tools) | 37.9% | 149 / 240 |
| Provider tools (code_interpreter) | 56.7% | 104 / 240 |
| Euclid MCP | 74.2% | 62 / 240 |
Provider tools cut nearly a third of the errors. Euclid cut another 40% of what remained.
But the overall number hides the more interesting story. Here's the domain breakdown:
Raw LLM
37.9%
149 errors / 240 problems
Provider Tools
56.7%
104 errors / 240 problems
Euclid MCP
74.2%
62 errors / 240 problems
Accuracy by domain
Where tools matter most
Key finding
The error message is the product
The first full run scored 0%. Not because the math was wrong, but because gpt-5.4-nano sends all schema parameters to every tool call. The server rejected them, the model retried the same call 6 to 10 times, and nothing worked. The fix: apply Postel's Law. Strip irrelevant params silently. Return ignored_params and param_hint so the agent learns the correct signature. After that change: 0% to 74.2%.
Some things that stood out:
Calculate, statistics, and validate all hit 100% with Euclid. These are the domains where the tool does exactly what the model can't: deterministic arithmetic, statistical inference, and format validation. The raw model scored 89.5%, 28.6%, and 100% respectively. (Yes, nano is already perfect at validation. It knows what a valid email looks like. It just can't compute a standard deviation.)
Finance is where the gap is widest. The raw model scored 4.8% on financial calculations: compound interest, loan payments, NPV, amortisation. Provider tools brought that to 38.1%. Euclid brought it to 76.2%. Still not perfect, but a 16x improvement over the raw model.
Chain problems are the frontier. Multi-tool problems like "calculate the distance between two cities, convert to miles, then compute fuel cost at $3.45/gallon for a 28mpg car" scored 20% raw, 35% with provider tools, 56.2% with Euclid. Orchestrating multiple tool calls in sequence is where every approach struggles, and it's where we're focusing next.
Regex is an interesting outlier. The raw model scored 100% on regex problems. Euclid scored 58.3%. This tells us something important: regex is fundamentally a pattern language, and LLMs are pattern machines. For some domains, the model doesn't need a tool, and forcing tool use can actually hurt. That's a useful signal.
The 0% Run
When we first ran the full 240-problem dataset with Euclid, accuracy was 0%.
Not a single correct answer. The model wasn't even making tool calls.
This was the most important moment in building EuclidBench, and it had nothing to do with the benchmark itself.
The problem: gpt-5.4-nano sends all schema parameters to every tool call. If the encode tool has parameters for operation, input, key, algorithm, and output_encoding, the model sends all five, even when calling base64_encode, which only needs operation and input.
Our server was rejecting these calls. Invalid parameter for this operation. The model received the error, and retried with the same parameters. This looped 6 to 10 times per problem until the model gave up.
Every problem. 240 times. Zero correct.
The Fix: Postel's Law for MCP
The fix was not making the model smarter. The fix was making our tools more forgiving.
We applied Postel's Law ("be liberal in what you accept") to our MCP tool handlers. Instead of rejecting calls with irrelevant parameters, the server now:
- Silently strips parameters that don't apply to the current operation
- Returns the result alongside
ignored_params, a list of what was stripped - Includes a
param_hintexplaining what the operation actually accepts
The model gets its answer. It also gets a gentle correction: "By the way, base64_encode only accepts operation and input. The key and algorithm you sent were ignored."
After this change: 0% to 74.2%.
The math engine didn't change. The computation was always correct. What changed was how we communicated with the agent when it got the invocation wrong.
What We're Learning
Error messages are not an afterthought. For MCP tools, the error response matters more than the success response. An agent will misuse your tool. The question is whether your tool helps it recover or sends it into a retry loop.
The benchmark itself needs debugging. We fixed 34 false-failures in our scoring system: tolerance issues on geographic calculations, case sensitivity on hex colour values, coordinate format parsing. Building a benchmark that correctly evaluates tool-assisted computation is its own engineering challenge. We're being transparent about this because it matters: if the benchmark is wrong, the results are meaningless.
Not every domain needs a tool. The regex result (100% raw, 58.3% Euclid) is a useful signal. LLMs are already good at pattern-matching tasks. Routing everything through a tool can add friction without adding accuracy. EuclidBench is helping us understand where tools add value and where they don't.
Small models with the right tools can punch above their weight. gpt-5.4-nano is cheap. With Euclid, it's achieving 100% on domains where the raw model scores under 30%. We'll be testing this further: cheap model + deterministic tools vs. expensive model alone, across more models.
What's Next
This is an early result. One model, one run, a dataset we're still refining. Here's what's coming:
More models. We're running EuclidBench across Claude Sonnet, Claude Haiku, GPT-5.4-mini, Gemini Flash, and others. Each model tells a different story about where tool use helps most.
More problems. 240 is a start. We're expanding the dataset, particularly in the domains where results are most interesting: finance, datetime, and multi-tool chains.
Open dataset. We plan to publish the EuclidBench dataset openly so others can run it, extend it, and verify our results. If we're going to claim that Euclid improves computation accuracy, the evidence should be reproducible.
Better error responses. The Postel's Law fix was a breakthrough, but 74.2% isn't 100%. We're studying the remaining failures to understand what information agents need in tool responses to self-correct more effectively.
We built EuclidBench because we needed to know if our tools actually work. The answer is: they help significantly, but there's real work left to do. We'd rather share that honestly than wait for a perfect number.
Model: gpt-5.4-nano (OpenAI). Dataset: euclidbench-full, 240 problems, 10 domains. Run date: April 9, 2026. Full methodology and dataset details will be published with the open release.