← Back to blog
|5 min read

Announcing EuclidBench: What We Learned Running 240 Computation Problems Through gpt-5.4-nano

We build deterministic computation tools for AI agents. Ten domains (calculate, convert, statistics, datetime, finance, geo, color, regex, validate, and encode) exposed via MCP so any agent can use them.

We believed these tools would make agents more accurate. But believing it and proving it are different things. So we built a benchmark.

This post announces EuclidBench, explains why we built it, and shares our early results. The failures, the fixes, and the single most important lesson we've learned so far.


Why We Built EuclidBench

The short answer: we needed to test our own tools.

Existing benchmarks like GSM8K and BIG-Bench test whether LLMs can solve math problems. They don't test whether LLMs can use tools to solve math problems. And they definitely don't test whether the tools themselves are doing their job.

We needed a benchmark that could answer three questions:

  1. How often does the raw model get computation wrong? (The baseline.)
  2. Do provider tools like code_interpreter fix it? (The status quo.)
  3. Does Euclid fix what's left? (Our claim.)

So we built EuclidBench: 240 problems across all 10 Euclid domains, including 80 multi-tool chain problems that require orchestrating multiple tool calls to reach an answer. Every ground truth answer is verified by running it through Euclid itself. The benchmark doesn't trust the model or a human to provide the correct answer.

Three scenarios. Same problems. Same model. The only variable is what tools are available.


The Early Results

We ran gpt-5.4-nano, a small, cheap model, across all 240 problems under each scenario.

Overall accuracy:

ScenarioAccuracyErrors
Raw LLM (no tools)37.9%149 / 240
Provider tools (code_interpreter)56.7%104 / 240
Euclid MCP74.2%62 / 240

Provider tools cut nearly a third of the errors. Euclid cut another 40% of what remained.

But the overall number hides the more interesting story. Here's the domain breakdown:

Raw LLM

37.9%

149 errors / 240 problems

Provider Tools

56.7%

104 errors / 240 problems

Euclid MCP

74.2%

62 errors / 240 problems

Accuracy by domain

Where tools matter most

Raw LLM
Provider Tools
Euclid MCP
calculate
89.5%
84.2%
100.0%
statistics
28.6%
90.5%
100.0%
validate
100.0%
100.0%
100.0%
encode
73.3%
86.7%
93.3%
color
26.7%
40.0%
86.7%
geo
20.0%
53.3%
80.0%
finance
4.8%
38.1%
76.2%
convert
46.7%
53.3%
66.7%
datetime
13.3%
40.0%
60.0%
regex
100.0%
100.0%
58.3%
chain
20.0%
35.0%
56.2%

Key finding

The error message is the product

The first full run scored 0%. Not because the math was wrong, but because gpt-5.4-nano sends all schema parameters to every tool call. The server rejected them, the model retried the same call 6 to 10 times, and nothing worked. The fix: apply Postel's Law. Strip irrelevant params silently. Return ignored_params and param_hint so the agent learns the correct signature. After that change: 0% to 74.2%.

Some things that stood out:

Calculate, statistics, and validate all hit 100% with Euclid. These are the domains where the tool does exactly what the model can't: deterministic arithmetic, statistical inference, and format validation. The raw model scored 89.5%, 28.6%, and 100% respectively. (Yes, nano is already perfect at validation. It knows what a valid email looks like. It just can't compute a standard deviation.)

Finance is where the gap is widest. The raw model scored 4.8% on financial calculations: compound interest, loan payments, NPV, amortisation. Provider tools brought that to 38.1%. Euclid brought it to 76.2%. Still not perfect, but a 16x improvement over the raw model.

Chain problems are the frontier. Multi-tool problems like "calculate the distance between two cities, convert to miles, then compute fuel cost at $3.45/gallon for a 28mpg car" scored 20% raw, 35% with provider tools, 56.2% with Euclid. Orchestrating multiple tool calls in sequence is where every approach struggles, and it's where we're focusing next.

Regex is an interesting outlier. The raw model scored 100% on regex problems. Euclid scored 58.3%. This tells us something important: regex is fundamentally a pattern language, and LLMs are pattern machines. For some domains, the model doesn't need a tool, and forcing tool use can actually hurt. That's a useful signal.


The 0% Run

When we first ran the full 240-problem dataset with Euclid, accuracy was 0%.

Not a single correct answer. The model wasn't even making tool calls.

This was the most important moment in building EuclidBench, and it had nothing to do with the benchmark itself.

The problem: gpt-5.4-nano sends all schema parameters to every tool call. If the encode tool has parameters for operation, input, key, algorithm, and output_encoding, the model sends all five, even when calling base64_encode, which only needs operation and input.

Our server was rejecting these calls. Invalid parameter for this operation. The model received the error, and retried with the same parameters. This looped 6 to 10 times per problem until the model gave up.

Every problem. 240 times. Zero correct.


The Fix: Postel's Law for MCP

The fix was not making the model smarter. The fix was making our tools more forgiving.

We applied Postel's Law ("be liberal in what you accept") to our MCP tool handlers. Instead of rejecting calls with irrelevant parameters, the server now:

  1. Silently strips parameters that don't apply to the current operation
  2. Returns the result alongside ignored_params, a list of what was stripped
  3. Includes a param_hint explaining what the operation actually accepts

The model gets its answer. It also gets a gentle correction: "By the way, base64_encode only accepts operation and input. The key and algorithm you sent were ignored."

After this change: 0% to 74.2%.

The math engine didn't change. The computation was always correct. What changed was how we communicated with the agent when it got the invocation wrong.


What We're Learning

Error messages are not an afterthought. For MCP tools, the error response matters more than the success response. An agent will misuse your tool. The question is whether your tool helps it recover or sends it into a retry loop.

The benchmark itself needs debugging. We fixed 34 false-failures in our scoring system: tolerance issues on geographic calculations, case sensitivity on hex colour values, coordinate format parsing. Building a benchmark that correctly evaluates tool-assisted computation is its own engineering challenge. We're being transparent about this because it matters: if the benchmark is wrong, the results are meaningless.

Not every domain needs a tool. The regex result (100% raw, 58.3% Euclid) is a useful signal. LLMs are already good at pattern-matching tasks. Routing everything through a tool can add friction without adding accuracy. EuclidBench is helping us understand where tools add value and where they don't.

Small models with the right tools can punch above their weight. gpt-5.4-nano is cheap. With Euclid, it's achieving 100% on domains where the raw model scores under 30%. We'll be testing this further: cheap model + deterministic tools vs. expensive model alone, across more models.


What's Next

This is an early result. One model, one run, a dataset we're still refining. Here's what's coming:

More models. We're running EuclidBench across Claude Sonnet, Claude Haiku, GPT-5.4-mini, Gemini Flash, and others. Each model tells a different story about where tool use helps most.

More problems. 240 is a start. We're expanding the dataset, particularly in the domains where results are most interesting: finance, datetime, and multi-tool chains.

Open dataset. We plan to publish the EuclidBench dataset openly so others can run it, extend it, and verify our results. If we're going to claim that Euclid improves computation accuracy, the evidence should be reproducible.

Better error responses. The Postel's Law fix was a breakthrough, but 74.2% isn't 100%. We're studying the remaining failures to understand what information agents need in tool responses to self-correct more effectively.

We built EuclidBench because we needed to know if our tools actually work. The answer is: they help significantly, but there's real work left to do. We'd rather share that honestly than wait for a perfect number.


Model: gpt-5.4-nano (OpenAI). Dataset: euclidbench-full, 240 problems, 10 domains. Run date: April 9, 2026. Full methodology and dataset details will be published with the open release.

← Back to blog