The $2,645 Problem: Why Your AI Agent Is Silently Getting Math Wrong
Your AI agent just processed 2,645 invoices.
Each one included a standard price calculation: $2,450 x 1.08. The correct answer is $2,646.00. In many runs, your agent returns exactly that. But in some — under a different context window, a different prompt structure, a slightly different temperature setting — it returns $2,645.00.
That's $1 wrong per invoice. $2,645 in billing errors. And your agent had no idea.
This is the $2,645 Problem. Not that LLMs always get math wrong. That they can't tell you when they do.
Testing Modern LLMs on Math
Before writing this post, we ran a set of real calculations against GPT-4o — not illustrative examples, actual queries to the model.
For simple multiplication (2450 x 1.08), it returned 2646. Correct.
For a mortgage calculation — $247,500 at 6.875% interest, 30-year term — it returned $1,627.54, working through the formula step by step.
For sample standard deviation across a 10-value dataset, it returned 29.7769, with each step shown.
Three for three. In these runs, GPT-4o with chain-of-thought reasoning performed exactly as you'd hope.
Here is the problem with that result.
What the Test Doesn't Show
When GPT-4o answered correctly, it triggered an extended reasoning process — "Thought for 13s" visible in the interface. The model checked its own work, enumerated steps, and verified intermediate values.
Production AI agents rarely do this.
In a typical agentic pipeline, the model is juggling tool calls, context retrieval, response formatting, and dozens of other tokens competing for attention. Math is handled inline, as one step in a larger task. There is no "Thought for 13s." There is no step-by-step enumeration. There is a prompt, and a predicted completion, and whatever number comes out.
The same GPT-4o that got 29.7769 in a focused math session will, in a different context, produce a different answer — not because it is "worse," but because it is doing something fundamentally different from calculation. It is predicting what a correct answer should look like, and the prediction quality degrades with context complexity.
This is not a criticism of GPT-4o, or any LLM. It is a description of what they are: next-token predictors trained on text. They are extraordinarily capable at that task. Exact computation is a different task entirely.
The Non-Determinism Problem
Run the same math query through an LLM ten times. You may get ten identical answers. You may get nine correct and one slightly off. You may get two different "correct" answers depending on temperature.
There is no way to know which answer is right without checking the math externally.
This would be fine if the model flagged uncertainty. If it said "I computed 29.7769, but verify this before using it in production" — that would be a tool behaving honestly. LLMs do not do this by default. They return answers with the same surface confidence whether they computed them correctly or confabulated them.
The outputs look identical. The downstream system cannot distinguish between 29.7769 (correct) and 29.7768 (wrong). The only way to know is to compute the answer independently — which defeats the purpose of delegating the calculation to the model.
The Scale Problem
A $1 error sounds trivial. Here is what happens when it is embedded in a system:
Pricing engine generating 10,000 quotes per day: $10,000 in daily billing variance. Invisible until reconciliation. By then, months of systematic undercharging.
Financial reconciliation agent checking 50,000 transactions per month: A 0.03% error rate produces 15 incorrect entries. Auditors find them; your team explains them.
Multi-jurisdiction tax calculation across an e-commerce platform: Rounding errors applied to intermediate values in the wrong order. Compliance risk that scales with transaction volume.
Trading algorithm executing position sizing: Compounding errors in capital allocation. The math is wrong; the position is wrong; the risk profile is wrong. The portfolio tracks the error, not reality.
In none of these cases is the agent producing obviously wrong output. The numbers look plausible. They are close. They pass format validation. They fail precision validation — which you did not implement, because you trusted the model.
Why LLMs Get Math Wrong
The mechanism is worth understanding.
When you send 2450 x 1.08 to an LLM, the model does not run a multiplication operation. It has no multiplication operation. It tokenizes the expression, passes it through billions of trained parameters, and produces tokens that statistically resemble what correct math outputs look like in its training data.
For common, well-represented calculations, this works well. The training corpus includes enough correct arithmetic that the prediction is usually right.
For less common calculations — unusual decimal places, multi-step formulas, statistical functions, financial math with edge cases — the training data is thinner. The prediction confidence decreases. The error probability increases. And the model's output looks exactly the same either way.
Floating point behavior is particularly unreliable. LLMs routinely round intermediate values at the wrong stage, apply floating point arithmetic inconsistently, and produce different precision levels depending on how the question is phrased.
The Verification Gap
Here is what a reliable math result looks like:
{
"expression": "2450 * 1.08",
"result": 2646,
"verified": true,
"hash": "a3f8c2...",
"timestamp": "2026-03-28T06:55:00Z"
}
You know the expression that was evaluated. You know the result. You know it was computed, not predicted. You have a hash that proves the result was not modified in transit. You have a timestamp for audit.
Here is what an LLM math result looks like:
2646
One number. No expression log. No hash. No indication of whether this was computed or predicted. No way to distinguish a correct result from an incorrect one without recalculating externally.
For low-stakes, one-off math, this is fine. For production systems handling money, measurements, or dates at scale — this is a gap.
The Fix
The fix is not better prompting. The fix is not chain-of-thought. Both improve accuracy; neither provides determinism or verifiability.
The fix is to stop asking LLMs to do math.
LLMs are built for language: understanding intent, generating text, reasoning over context. They are extraordinarily good at those tasks. Math — arithmetic, financial calculations, statistical functions, date arithmetic — should be handled by a deterministic computation engine.
// Before: agent predicts math
const total = await llm.complete(`What is 2450 * 1.08?`);
// Returns: "2646" — probably correct, unverified
// After: agent computes math
const total = await euclid.calculate("2450 * 1.08");
// Returns: { result: 2646, verified: true, hash: "a3f...", expression: "2450 * 1.08" }
One call. Exact result. Verifiable. Auditable.
Euclid is the deterministic computation engine for AI agents. It exposes six tools via MCP — calculate, convert, statistics, datetime, encode, finance — that handle every class of numeric operation an agent is likely to need. Every result is exact. Every result is verified. No token prediction involved.
Add Euclid to Your Agent
Euclid is available as an MCP server. Add it to Claude Code, Cursor, Windsurf, or any MCP-compatible agent in one command:
claude mcp add --transport http euclid https://mcp.euclidtools.com
Free tier: 1,000 lifetime calls, no credit card required. Base rate: $1 per 10,000 calls.
The $2,645 Problem is real. It is running in production systems right now — not because every calculation is wrong, but because no one can tell which ones are.
Tested: GPT-4o (March 2026). Results vary by model, temperature, and context.