AI model makers love to flex their benchmarks scores. But how trustworthy are these numbers? What if the tests themselves are rigged, biased, or just plain meaningless?
OpenAI’s o3 debuted with claims that, having been trained on a publicly available ARC-AGI dataset, the LLM scored a “breakthrough 75.7 percent” on ARC-AGI’s semi-private evaluation dataset with a $10K compute limit. ARC-AGI is a set of puzzle-like inputs that AI models try to solve as a measure of intelligence.
Google’s…
Article Source
https://www.theregister.com/2025/02/15/boffins_question_ai_model_test/