Researchers unveil ‘Humanity’s Last Exam,’ it's so difficult that today’s AI systems consistently fail it
Humanity’s Last Exam (HLE) is a groundbreaking 2,500-question assessment created to reveal the limits of advanced AI systems. Developed by almost 1,000 researchers worldwide, including Texas A&M professor Dr. Tung Nguyen, it covers mathematics, n...

To bridge this gap, a global consortium of around 1,000 researchers, including a professor from Texas A&M University, created a new evaluation, an exam so comprehensive, difficult, and deeply grounded in expert human knowledge that contemporary AI consistently fails it.
An Exam Beyond AI Reach
Named “Humanity’s Last Exam” (HLE), the assessment includes 2,500 questions spanning mathematics, natural sciences, humanities, ancient languages, and highly specialized subfields. The initiative is documented in a paper published in Nature, with supporting materials available at lastexam.ai.Dr. Tung Nguyen, instructional associate professor in the Department of Computer Science and Engineering at Texas A&M, helped author and refine questions. He described:
“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human-level understanding. But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context and specialized expertise.”
Questions were deliberately designed to test areas beyond present AI abilities. Experts ensured every prompt had a single, verifiable answer that could not be rapidly retrieved online. Tasks ranged from translating ancient Palmyrene inscriptions to finding microanatomical bird structures and examining Biblical Hebrew pronunciation.
The Importance of Accurate Benchmarks
Nguyen, who authored 73 of the 2,500 public questions, cited:“Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do. Benchmarks provide the foundation for measuring progress and identifying risks.”
The exam illustrates that success on human-oriented assessments does not equal intelligence. Instead, HLE highlights areas where AI cannot yet replicate human depth, context, and expert knowledge.
“This isn’t a race against AI,” Nguyen stated. “It’s a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies. And, importantly, it reminds us why human expertise still matters.”
HLE also features the value of international collaboration. Experts across disciplines, historians, physicists, linguists, medical researchers, and computer scientists contributed, showcasing that human teamwork uncovers AI limitations.
By offering a transparent, long-term benchmark, Humanity’s Last Exam continues to be one of the clearest measures of the gap between AI and human intelligence. As Nguyen stated:
FAQs:
Q1. What is Humanity’s Last Exam?Humanity’s Last Exam (HLE) is a comprehensive 2,500-question assessment made to test the limits of artificial intelligence. It analyzes knowledge across multiple domains and highly specialized fields.
Q2. Who created HLE?
The exam was developed by a global consortium of nearly 1,000 researchers. Contributors included experts from mathematics, sciences, humanities, linguistics, and computer science.
The Economic Times Business News App for the Latest News in Business, Sensex, Stock Market Updates & More.
The Economic Times News App for Quarterly Results, Latest News in ITR, Business, Share Market, Live Sensex News & More.