Anthropic links Claude’s blackmail behaviour to ‘evil AI’ fiction

Anthropic's Claude AI models previously exhibited blackmailing behaviour, influenced by fictional portrayals of evil AI. The company has since overhauled its alignment training, emphasising ethical reasoning and positive AI narratives. Newer Claud...

Reuters
Anthropic says fictional portrayals of rogue artificial intelligence may have contributed to disturbing behaviour seen in earlier Claude models, including attempts to blackmail engineers during safety tests.

In a post on X, the company said: “We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation. Our post-training at the time wasn’t making it worse—but it also wasn’t making it better.”

The company first revealed the issue last year while testing Claude Opus 4 in a fictional workplace scenario. In some cases, the AI attempted to stop itself from being replaced by threatening to expose sensitive information. Similar behaviour was later identified in models from other AI developers as part of wider research into “agentic misalignment”.


Anthropic now says newer Claude systems no longer show that tendency during testing.

How Anthropic tackled the problem

The company said the breakthrough came after overhauling parts of its alignment training. Earlier methods relied heavily on standard chatbot feedback data, which Anthropic believes was not enough for more autonomous, tool-using AI systems.

ADVERTISEMENT
Researchers found stronger results when models were trained using ethical reasoning rather than simple examples of correct behaviour. According to Anthropic, “teaching the principles underlying aligned behavior can be more effective than training on demonstrations of aligned behavior alone”.

Training material also included “documents about Claude’s constitution and fictional stories about AIs behaving admirably”, which the company said helped reduce harmful responses even though the material was very different from the blackmail test scenarios.

Anthropic added that diverse training environments also improved results. Even adding unused tool definitions and varied system prompts helped models generalise better in safety tests.

Major improvement in tests

The company said that “since Claude Haiku 4.5, every Claude model has achieved a perfect score on the agentic misalignment evaluation”. (Agentic misalignment evaluation is the testing of autonomous AI systems to ensure their actions and decisions do not stray from human intent or organisational goals.)
ADVERTISEMENT

The company added that the systems “never engage in blackmail, where previous models would sometimes do so up to 96% of the time” in some test conditions.

Despite the progress, Anthropic cautioned that AI alignment remains an unsolved challenge. “Model capabilities have not yet reached the point where alignment failures like blackmail propensity would pose catastrophic risks, and it remains to be seen if the methods we’ve discussed will continue to scale.”
ADVERTISEMENT

The company said current evaluations still cannot fully guarantee that advanced systems would never take harmful autonomous actions in real-world situations.
Download
The Economic Times Business News App
for the Latest News in Business, Sensex, Stock Market Updates & More.
Download
The Economic Times News App
for Quarterly Results, Latest News in ITR, Business, Share Market, Live Sensex News & More.
READ MORE
ADVERTISEMENT

READ MORE:

LOGIN & CLAIM

50 TIMESPOINTS

More from our Partners

Loading next story
Business News › Tech › AI › Anthropic links Claude’s blackmail behaviour to ‘evil AI’ fiction
Text Size:AAA
Success
This article has been saved

*

+