AI is actually bad at math, as ORCA shows

AI is actually bad at math, as ORCA shows

By Thomas Claburn
Publication Date: 2025-11-17 21:16:00

In the world of George Orwell’s 1984, two and two make five. And large language models aren’t much better at math.

Although AI models have been trained to output the correct answer and recognize that “2 + 2 = 5” could be a clue to the use of the flawed equation as a party loyalty test in Orwell’s dystopian novel, they still cannot perform reliable calculations.

Scientists affiliated with Omni Calculator, a Poland-based maker of online calculators, and with universities in France, Germany and Poland, have developed a mathematical benchmark called ORCA (Omni Research on Calculation in AI), which asks a series of mathematically oriented natural language questions in a variety of technical and scientific fields. They then put five leading LLMs to the test.

ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4 and DeepSeek V3.2 all achieved a failing grade of 63 percent or lower.