Humans outperform AI at this highly rigorous mathematics test

A new AI test called First Proof used unpublished math problems. AI systems were tested against human mathematicians. While AI showed some progress, human experts solved all problems. This test aims to measure real reasoning, not just pattern matc...

By Team Global, Global Desk | Updated: Jun 14, 2026, 07.36 PM IST

The test was designed to beat AI. Image Credits: ChatGPT

Everyone's seen the headlines: AI passing tests, beating chess champions, solving things that apparently baffle PhD students. So a bunch of mathematicians got together and built something different: something an AI test couldn't have prepped for. And the results are kind of a relief, actually.

Nature published the findings on June 12, 2026. The project was called First Proof, and it was a simple setup. Four different AI systems were presented with ten brand-new research-level math problems that had never been published or posted online. Then real mathematicians sat down and graded the answers by hand. The ETH Zurich system, in which ChatGPT's answers were checked by a council including Claude and Gemini, scored highest among the four entrants, solving six of the ten problems. On the other hand, every single human expert solved their own problems.

So what's actually new here
Most AI benchmarks have a hidden problem: the questions were already out there on the internet by the time these models were trained. So when an AI “solves” something, you can't always tell if that's real reasoning, or just pattern matching against stuff it's seen before.

First Proof closed that loophole. Ten mathematicians contributed an unpublished problem from their own research for this second batch. No AI could have foreseen any of this. At Harvard, a team of 30 mathematicians sifted through the solutions for two days. Carnegie Mellon's Jeremy Avigad, who heads the Institute for Computer-Aided Reasoning in Mathematics, said the organizers obviously put more thought into this round than they did the pilot test back in February.

Mathematicians solved what AI couldn't. Image Credits: ChatGPT

The lineup
Four teams competed in the official round. The best-performing team was a joint entry from ETH Zurich and Aarhus University, which developed a system called IMProofBench, a harness in which ChatGPT’s responses could be reviewed and improved by a “council” of other AI models, including Claude and Gemini. That setup scored six or seven out of ten. According to Nature, the UCLA team, which built a harness on top of ChatGPT, was the second-best, followed by the OpenAI team (ChatGPT with no harness) and Princeton (a harness that mainly uses Gemini 3.1 Pro as its backend).

Two heavyweights sat this one out entirely. According to Nature, Google's Aletheia, a system designed specifically to solve maths problems, and the full, unreleased version of Claude Mythos, a model developed by Anthropic, could not be used because participating models had to be publicly available. According to Scientific American, Lauren Williams, the Harvard mathematician and First Proof team member, explained the thinking: “We felt very strongly that if we're going to be doing a public service for the greater community, we need to test publicly available models.” It’s a bit like holding a swimming competition and having Olympic athletes in the stands, which is why Cambridge University mathematician Kevin Barreto said he “personally would have enjoyed seeing internal models tested from the three labs, just to see where the actual frontier currently is.”

Why couldn't AI solve all ten
One likely factor is their behavior when they hit a wall. Without heavy scaffolding, a model faced with a truly hard problem will usually either declare it can't be solved or hallucinate a solution and citation that doesn't really hold up. The best-performing systems got there by having other AIs constantly check, challenge and push the base model to keep working instead of giving up.

That persistence isn't free. Stanford mathematician Mohammed Abouzaid, also on the First Proof team, said the stacked-up models in some cases burned through almost $1,000 in query costs just to land on a wrong answer. “I truly believe this is an economic question, about research funding and research productivity,” he said.

Abouzaid also said the AIs were good at pulling up obscure references and grinding familiar techniques for new angles, in one case using an approach the problem's own author had considered but found too tedious to pursue. Yet that kind of brute-force persistence was not enough to close the gap on every problem.

The citation problem nobody wants to talk about

One of the more awkward findings: the models would leave out citations for work they were clearly drawing on. Williams was blunt about it, saying the pattern of missing citations would be considered a serious ethical breach, bordering on plagiarism, if a human researcher had done it. She hopes the math community can pressure AI companies to take this more seriously.

Some math still needs a human hand. Image Credits: ChatGPT

What this means if you're using AI for work or school
First Proof is a useful gut check for millennials and young professionals who depend on AI for coding, finance, data work, or grad school. These tools are genuinely good and rapidly improving. Last month, an OpenAI chatbot solved an 80-year-old math problem posed by Paul Erdős. Mathematicians called it really impressive.

So yes, real progress. But the basic versions of these models will either bail out and say it's too hard when they hit something they can't solve, or just make up an answer and a fake citation to go along with it. The only way to get good results is by using heavy “scaffolding,” with multiple AIs checking and pushing each other, which is expensive and still produces a lot of garbage along the way.

The bigger picture
First Proof grew from frustration with AI companies using advanced math as a marketing gimmick, often chasing benchmarks that don’t even matter to working mathematicians. Williams said the team tried hard to be objective and transparent, and feels like they've built something closer to a real benchmark than an experiment. The next official round is planned for the fall.

The math isn't over yet. But for now at least, the humans still win.

Download
The Economic Times Business News App for the Latest News in Business, Sensex, Stock Market Updates & More.

Humans outperform AI at this highly rigorous mathematics test

A new AI test called First Proof used unpublished math problems. AI systems were tested against human mathematicians. While AI showed some progress, human experts solved all problems. This test aims to measure real reasoning, not just pattern matc...

READ MORE:

More from our Partners

Popular Categories

Hot on Web

In Case you missed it

Top Searched Companies

Latest News

Download ET APP

Follow us on

become a member