Researchers unveil ‘Humanity’s Last Exam,’ it's so difficult that today’s AI systems consistently fail it

Humanity’s Last Exam (HLE) is a groundbreaking 2,500-question assessment created to reveal the limits of advanced AI systems. Developed by almost 1,000 researchers worldwide, including Texas A&M professor Dr. Tung Nguyen, it covers mathematics, n...

Researchers unveil ‘Humanity’s Last Exam,’ it's so difficult that today’s AI systems consistently fail it
As AI systems advance rapidly, traditional tests once regarded as challenging have become insufficient for analyzing their true capabilities. Popular evaluations like the Massive Multitask Language Understanding (MMLU) exam, once regarded as formidable, are no longer challenging enough to meaningfully evaluate advanced AI systems.

To bridge this gap, a global consortium of around 1,000 researchers, including a professor from Texas A&M University, created a new evaluation, an exam so comprehensive, difficult, and deeply grounded in expert human knowledge that contemporary AI consistently fails it.

An Exam Beyond AI Reach

Named “Humanity’s Last Exam” (HLE), the assessment includes 2,500 questions spanning mathematics, natural sciences, humanities, ancient languages, and highly specialized subfields. The initiative is documented in a paper published in Nature, with supporting materials available at lastexam.ai.


Dr. Tung Nguyen, instructional associate professor in the Department of Computer Science and Engineering at Texas A&M, helped author and refine questions. He described:
“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human-level understanding. But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context and specialized expertise.”

Questions were deliberately designed to test areas beyond present AI abilities. Experts ensured every prompt had a single, verifiable answer that could not be rapidly retrieved online. Tasks ranged from translating ancient Palmyrene inscriptions to finding microanatomical bird structures and examining Biblical Hebrew pronunciation.

ADVERTISEMENT
If any model correctly answered a question, it was removed, leaving an exam, positioned just beyond AI reach. Early outcomes confirmed the challenge: GPT‑4o scored 2.7%, Claude 3.5 Sonnet 4.1%, and OpenAI’s o1 model 8%. Even leading systems such as Gemini 3.1 Pro and Claude Opus 4.6 reached only 40–50% accuracy.

The Importance of Accurate Benchmarks

Nguyen, who authored 73 of the 2,500 public questions, cited:
“Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do. Benchmarks provide the foundation for measuring progress and identifying risks.”

The exam illustrates that success on human-oriented assessments does not equal intelligence. Instead, HLE highlights areas where AI cannot yet replicate human depth, context, and expert knowledge.

“This isn’t a race against AI,” Nguyen stated. “It’s a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies. And, importantly, it reminds us why human expertise still matters.”
ADVERTISEMENT

HLE also features the value of international collaboration. Experts across disciplines, historians, physicists, linguists, medical researchers, and computer scientists contributed, showcasing that human teamwork uncovers AI limitations.

By offering a transparent, long-term benchmark, Humanity’s Last Exam continues to be one of the clearest measures of the gap between AI and human intelligence. As Nguyen stated:
ADVERTISEMENT
“For now, Humanity’s Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence, and despite rapid technological advances, it remains wide.”


FAQs:

Q1. What is Humanity’s Last Exam?
Humanity’s Last Exam (HLE) is a comprehensive 2,500-question assessment made to test the limits of artificial intelligence. It analyzes knowledge across multiple domains and highly specialized fields.

Q2. Who created HLE?
The exam was developed by a global consortium of nearly 1,000 researchers. Contributors included experts from mathematics, sciences, humanities, linguistics, and computer science.
Download
The Economic Times Business News App
for the Latest News in Business, Sensex, Stock Market Updates & More.
Download
The Economic Times News App
for Quarterly Results, Latest News in ITR, Business, Share Market, Live Sensex News & More.
READ MORE
ADVERTISEMENT

READ MORE:

LOGIN & CLAIM

50 TIMESPOINTS

More from our Partners

Loading next story
Business News › News › International › US News › Researchers unveil ‘Humanity’s Last Exam,’ it's so difficult that today’s AI systems consistently fail it
Text Size:AAA
Success
This article has been saved

*

+