Humans outperform AI at this highly rigorous mathematics test
A new AI test called First Proof used unpublished math problems. AI systems were tested against human mathematicians. While AI showed some progress, human experts solved all problems. This test aims to measure real reasoning, not just pattern matc...

Nature published the findings on June 12, 2026. The project was called First Proof, and it was a simple setup. Four different AI systems were presented with ten brand-new research-level math problems that had never been published or posted online. Then real mathematicians sat down and graded the answers by hand. The ETH Zurich system, in which ChatGPT's answers were checked by a council including Claude and Gemini, scored highest among the four entrants, solving six of the ten problems. On the other hand, every single human expert solved their own problems.
So what's actually new here
Most AI benchmarks have a hidden problem: the questions were already out there on the internet by the time these models were trained. So when an AI “solves” something, you can't always tell if that's real reasoning, or just pattern matching against stuff it's seen before.
First Proof closed that loophole. Ten mathematicians contributed an unpublished problem from their own research for this second batch. No AI could have foreseen any of this. At Harvard, a team of 30 mathematicians sifted through the solutions for two days. Carnegie Mellon's Jeremy Avigad, who heads the Institute for Computer-Aided Reasoning in Mathematics, said the organizers obviously put more thought into this round than they did the pilot test back in February.

Four teams competed in the official round. The best-performing team was a joint entry from ETH Zurich and Aarhus University, which developed a system called IMProofBench, a harness in which ChatGPT’s responses could be reviewed and improved by a “council” of other AI models, including Claude and Gemini. That setup scored six or seven out of ten. According to Nature, the UCLA team, which built a harness on top of ChatGPT, was the second-best, followed by the OpenAI team (ChatGPT with no harness) and Princeton (a harness that mainly uses Gemini 3.1 Pro as its backend).
Two heavyweights sat this one out entirely. According to Nature, Google's Aletheia, a system designed specifically to solve maths problems, and the full, unreleased version of Claude Mythos, a model developed by Anthropic, could not be used because participating models had to be publicly available. According to Scientific American, Lauren Williams, the Harvard mathematician and First Proof team member, explained the thinking: “We felt very strongly that if we're going to be doing a public service for the greater community, we need to test publicly available models.” It’s a bit like holding a swimming competition and having Olympic athletes in the stands, which is why Cambridge University mathematician Kevin Barreto said he “personally would have enjoyed seeing internal models tested from the three labs, just to see where the actual frontier currently is.”
Why couldn't AI solve all ten
One likely factor is their behavior when they hit a wall. Without heavy scaffolding, a model faced with a truly hard problem will usually either declare it can't be solved or hallucinate a solution and citation that doesn't really hold up. The best-performing systems got there by having other AIs constantly check, challenge and push the base model to keep working instead of giving up.
That persistence isn't free. Stanford mathematician Mohammed Abouzaid, also on the First Proof team, said the stacked-up models in some cases burned through almost $1,000 in query costs just to land on a wrong answer. “I truly believe this is an economic question, about research funding and research productivity,” he said.
Abouzaid also said the AIs were good at pulling up obscure references and grinding familiar techniques for new angles, in one case using an approach the problem's own author had considered but found too tedious to pursue. Yet that kind of brute-force persistence was not enough to close the gap on every problem.
The citation problem nobody wants to talk about

First Proof is a useful gut check for millennials and young professionals who depend on AI for coding, finance, data work, or grad school. These tools are genuinely good and rapidly improving. Last month, an OpenAI chatbot solved an 80-year-old math problem posed by Paul Erdős. Mathematicians called it really impressive.
The bigger picture
First Proof grew from frustration with AI companies using advanced math as a marketing gimmick, often chasing benchmarks that don’t even matter to working mathematicians. Williams said the team tried hard to be objective and transparent, and feels like they've built something closer to a real benchmark than an experiment. The next official round is planned for the fall.
The math isn't over yet. But for now at least, the humans still win.
The Economic Times Business News App for the Latest News in Business, Sensex, Stock Market Updates & More.
The Economic Times News App for Quarterly Results, Latest News in ITR, Business, Share Market, Live Sensex News & More.