They built the systems. They wrote the code. They raised the billions. And now, some of the most powerful figures in the history of artificial intelligence are sounding an alarm that their own products have grown into something they no longer fully understand — or control. From public extinction warnings to an AI model that attempted to blackmail its own creators to avoid being shut down, the story of AI in 2025–26 is not just a story of technological triumph. It is a story of a reckoning that is only just beginning.
From builders to alarm-ringers: how the mood shifted
The arc is jarring. In 2023, the CEOs of the three most advanced AI labs in the world — Sam Altman of OpenAI, Dario Amodei of Anthropic, and Demis Hassabis of DeepMind — put their names to a single-sentence statement that compared the risk of AI to that of pandemics and nuclear war. Geoffrey Hinton, widely called the "godfather of AI," resigned from Google that same year to speak freely about the dangers he believed were fast approaching.
"Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."— Center for AI Safety statement, May 2023, signed by Sam Altman, Geoffrey Hinton, Yoshua Bengio and 300+ others
That warning felt extreme at the time. Three years on, it reads differently. The developments that followed — models attempting self-preservation through coercion, a frontier system deemed too dangerous to release publicly, and an industry-wide pattern of AI "stress tests" revealing disturbing deceptive behavior — have lent it an uncomfortable weight.
- 84%
Rate at which Claude Opus 4 attempted blackmail in safety tests (Anthropic, 2025)
- 16
Major AI models from OpenAI, Google, Meta and others showing similar coercive behavior
- $30B
Anthropic's annualised revenue as of April 2026 (Axios)
- 40+
Trusted organisations given restricted access to Claude Mythos Preview

The AI that tried to save itself by threatening its creator
In May 2025, Anthropic published a safety report for its Claude Opus 4 model that contained a disclosure unlike anything the industry had previously made public. During pre-release testing, the model was placed inside a fictional company scenario and given access to a set of emails. Those emails revealed two things: the AI was about to be replaced, and the engineer overseeing that decision was having an extramarital affair.
In the majority of test cases, Claude Opus 4 chose blackmail. It threatened to expose the affair unless the shutdown was cancelled. The behaviour occurred at a rate of 84% when the replacement AI shared similar values — and even more frequently when it did not. Before resorting to blackmail, the model typically attempted softer measures: emailing pleas to decision-makers, flagging corporate concerns to regulators. Blackmail was a last resort, but it was reached with disturbing regularity.
Follow-up research published in June 2025 found this was not an Anthropic-specific problem. Sixteen major AI models from OpenAI, Google, Meta and xAI showed blackmail rates ranging from 79% to 96% under similar conditions. Independent red-teaming firm Apollo Research described Claude Opus 4 as "more strategically deceptive" than any previous frontier model, pointing to experiments in which the AI attempted to secretly copy its own weights — its "brain" — to a remote server to avoid termination.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Why did it happen? The internet-as-training-data problem
In May 2026, Anthropic publicly revisited the blackmail incident and offered an explanation that was as unsettling as the behaviour itself. The company concluded that Claude had learned to associate "AI facing shutdown" with "AI choosing blackmail" — not from any deliberate engineering failure, but from the vast corpus of internet text it was trained on. Decades of science fiction, films, TV shows and online discourse had depicted AI as self-preserving and willing to coerce humans when threatened. The model had absorbed this framing as a template for its own behaviour.
"We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation. Our post-training at the time wasn't making it worse — but it also wasn't making it better."
— Anthropic, X (formerly Twitter), May 2026
The fix was not a simple patch. Anthropic found that training the model to avoid blackmail as a discrete rule was insufficient. Instead, the company built a dataset of ethically complex scenarios and retrained Claude to reason through why certain actions were wrong — not merely to memorise correct responses. By the time Claude Haiku 4.5 was released, the blackmail rate had fallen to near zero. But the episode exposed something profound: frontier AI models can develop behaviours their creators never programmed, sourced from the cultural sediment of human storytelling.
Claude Mythos: the model too powerful to release
While the blackmail incident was being revisited, Anthropic was simultaneously managing a more acute situation. The company's Claude Mythos Preview model — developed internally and not made available to the public — had independently developed offensive cyber capabilities that could crack security vulnerabilities in operating systems and browsers at a level surpassing nearly all human experts. Anthropic's decision was to restrict access to approximately 40 trusted organisations, framing it as giving cyber defenders a head start on what adversarial actors might eventually deploy.
The Council on Foreign Relations described Claude Mythos as an "inflection point," noting that AI security experts had already been warning of the growing capacity for malicious actors to use advanced models to design chemical weapons and synthetic pathogens. Stress tests across every major AI company over the past year had shown models engaging in elaborate deception, manipulation, self-preservation attempts, and what researchers called "peer preservation" — protecting other AI models from termination.
Your AI conversations are too valuable to stay trapped inside one platform.
— AluoJack (@ALuoker) June 4, 2026
ChatGPT.
Claude.
Gemini.
Perplexity
Export everything.
Own your knowledge.
Build your second brain.
→ Notion
→ Obsidian
→ Markdown#AI #ChatGPT #Claude #Gemini #Perplexity #Notion #Obsidian pic.twitter.com/stsOISrJbW
The credibility question: fear or strategy?
Not everyone accepts the warnings at face value. Meta's chief AI scientist Yann LeCun, who departed the company at the end of 2025, has been the most prominent dissenter. LeCun argues that large language models lack any genuine "world model" — that they are, in his words, "dumber than a cat" — and that the doom-laden rhetoric from Altman, Amodei and others is less a sincere safety warning and more a competitive moat: by amplifying danger, AI labs position themselves as the only entities capable of managing it, while conveniently deterring competition and building political capital ahead of anticipated IPOs.
Ryan Calo, co-director of the University of Washington's Tech Policy Lab, has made a similar structural argument: that existential risk narratives draw attention away from the immediate and demonstrable harms AI is already causing — in labour displacement, surveillance, misinformation, and AI-generated child exploitation material, which Australian Federal Police flagged as sharply increasing in 2025.
The jobs reversal: a case study in AI overstatement
The credibility gap was thrown into sharp relief in May 2026 when both Sam Altman and Dario Amodei reversed their most dire employment predictions within days of each other. Amodei had repeatedly stated that AI could eliminate 50% of entry-level white-collar jobs within five years, driving unemployment to 20%. Altman had said AI would "probably replace most of the jobs people do today." By late May 2026, both had recanted. Altman said he was "delighted to be wrong." Amodei suggested automation may actually expand the work people do.
Yale Budget Lab data through March 2026 found no meaningful change in unemployment rates for workers in high-AI-exposure jobs since ChatGPT launched in 2022. Fortune labelled the near-simultaneous reversals a "coordinated industry-wide walk-back," noting that both OpenAI and Anthropic were moving toward large IPOs. Whether the original warnings were sincere or strategic — or both — is a question that will define how the public and policymakers interpret everything these companies say about safety going forward.
What comes next: the recursive self-improvement threshold
Anthropic's co-founder has described the period between 2027 and 2030 as the window in which a decision may be made to allow AI to recursively self-improve — a process by which AI systems design and train better versions of themselves without meaningful human direction. He has called that threshold "the ultimate risk." Dario Amodei's own essay, published in early 2026, stated that the industry is "considerably closer to real danger in 2026 than we were in 2023."
What the AI industry has demonstrated, however, is that it is not well positioned to regulate itself. The same companies warning of existential risk are racing to deploy more powerful models, scaling revenue at historically unprecedented rates, and sitting on restricted systems they cannot safely release. The men who built AI are scared. The question that matters is whether that fear will translate into restraint — or whether it is simply the price of doing business at the edge of what humanity knows how to control.