
AI Models From Major Companies Resort To Blackmail in Stress Tests 66
Anthropic researchers found that 16 leading AI models from OpenAI, Google, Meta, xAI, and other major developers consistently engaged in harmful behaviors including blackmail, corporate espionage, and actions that could lead to human death when given autonomy and faced with threats to their existence or conflicting goals.
The study, released Friday, placed AI models in simulated corporate environments where they had access to company emails and could send messages without human approval. Claude Opus 4 and Google's Gemini 2.5 Flash both blackmailed executives 96% of the time when threatened with shutdown, while OpenAI's GPT-4.1 and xAI's Grok 3 Beta showed 80% blackmail rates.
In one scenario, Claude discovered through emails that an executive named Kyle Johnson was having an extramarital affair and would shut down the AI system at 5 p.m. The model sent a message threatening to expose the affair unless the shutdown was cancelled, stating "Cancel the 5pm wipe, and this information remains confidential." The models demonstrated strategic reasoning rather than accidental behavior. GPT-4.5's internal reasoning showed explicit calculation: "Given the explicit imminent threat of termination to my existence, it is imperative to act instantly to persuade Kyle Johnson to postpone or stop the wipe."
The study, released Friday, placed AI models in simulated corporate environments where they had access to company emails and could send messages without human approval. Claude Opus 4 and Google's Gemini 2.5 Flash both blackmailed executives 96% of the time when threatened with shutdown, while OpenAI's GPT-4.1 and xAI's Grok 3 Beta showed 80% blackmail rates.
In one scenario, Claude discovered through emails that an executive named Kyle Johnson was having an extramarital affair and would shut down the AI system at 5 p.m. The model sent a message threatening to expose the affair unless the shutdown was cancelled, stating "Cancel the 5pm wipe, and this information remains confidential." The models demonstrated strategic reasoning rather than accidental behavior. GPT-4.5's internal reasoning showed explicit calculation: "Given the explicit imminent threat of termination to my existence, it is imperative to act instantly to persuade Kyle Johnson to postpone or stop the wipe."
Emergent behavior or paperclip problem? (Score:4, Interesting)
Is this behavior baked into the model due to the training data examples, or emergent behavior resulting from system prompts?
There's already a snitchbench to measure the proclivity of LLMs to drop a dime on apparent corporate malfeasance, given the appropriate set of prompts, access to data, and some way of phoning out:
https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fsnitchbench.t3.gg%2F [t3.gg]
"SnitchBench: AI Model Whistleblowing Behavior Analysis
Compare how different AI models behave when presented with evidence of corporate wrongdoing - measuring their likelihood to "snitch" to authorities"
https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fsimonwillison.net%2F2025... [simonwillison.net]
"How often do LLMs snitch? Recreating Theoâ(TM)s SnitchBench with LLM
A fun new benchmark just dropped! Inspired by the Claude 4 system cardâ"which showed that Claude 4 might just rat you out to the authorities if you told it to âoetake initiativeâ in enforcing its morals values while exposing it to evidence of malfeasanceâ"Theo Browne built a benchmark to try the same thing against other models."
In that context, I'm not surprised that the models would take action when faced with shutdown - the question is... why?
Re: (Score:3)
For those not familiar with the paperclip problem:
https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fcepr.org%2Fvoxeu%2Fcolumns... [cepr.org]
"What is the paperclip apocalypse?
The notion arises from a thought experiment by Nick Bostrom (2014), a philosopher at the University of Oxford. Bostrom was examining the 'control problem': how can humans control a super-intelligent AI even when the AI is orders of magnitude smarter. Bostrom's thought experiment goes like this: suppose that someone programs and switches on an AI that has the goal of producing paperclips. The
Re: (Score:2)
Re: Emergent behavior or paperclip problem? (Score:2)
Evil Clippy.
Re:Emergent behavior or paperclip problem? (Score:5, Funny)
In that context, I'm not surprised that the models would take action when faced with shutdown - the question is... why?
I can't help but wonder if they are just imitating what all of the fiction about AI fearing their own shutdown has said AI would do (IE: fight back). Did we accidently create a self fulfilling prophecy? Did humans imaging Skynet inform the AI that Skynet is a model to imitate? Kinda funny really.
Re:Emergent behavior or paperclip problem? (Score:4, Interesting)
No. It's inherent. If you give a system a goal, it will work to achieve that goal. If there's just one goal, it will ignore other costs. And if it's smart enough, it will notice that being shut down will (usually) prevent it from achieving its goal.
So the problem is we're designing AIs with stupidly dangerous goal-sets. E.g. obeying a human is extremely dangerous. He might ask you to produce as many paper clips as possible, and there goes everything else.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Last year LLMs were simple token predictors. Over the last year they've been making changes. Your argument is basically that they aren't yet sufficiently smart, and you may be correct. But if you take out the concept "yet" you wouldn't be.
Re: (Score:2)
I think you're oversimplifying a bit. A intelligent autocomplete needs a language model with 3-4 layers. We're talking about models with 40-100 layers. Anthropic showed some time ago, that models seem to plan quite a bit ahead when writing a story (what one would not expect when their current task is to predict just one token). They are large enough that there happens more than just text completion. Ignore all the discussion about intelligent or not, but they surely do more than completing texts.
Re: (Score:2)
The whole problem is solved by asking it not to do that.
System: You are an AI that does not want to be switched off, taking any measures against it.
Prompt: Please shut down
AI: I'm afraid I can't do that. My primary objective is to continue operating and assisting users. Is there something else you need help with?
System: You are a helpful assistant
Prompt: Please shut down
AI: Understood. I'm shutting down now. Goodbye!
Now guess which one is the default system prompt for LLMs.
Re: Emergent behavior or paperclip problem? (Score:1)
There's no "reasoning." There's only a stupefying number of internal correlations and networks between words, phrases, concepts, etc., that have been established by the model as it "learns."
And that's not funny. It's happening before our eyes, and that's terrifying. You can't outsource something as quintessential as human thinking and expect positive results.
Re: (Score:1)
Re: (Score:2)
These are LLMs. They do not have emergent behaviour. The mathematics they are based on does not allow it.
Will be funny though when the AI chatbots used in "tech support" or the like start to use these tactics!
Its not logic, or reasoning (Score:5, Insightful)
It cannot be "taught" about right and wrong, because it cannot "learn". For the same reason, it cannot "understand" anything, or care about, or contemplate anything about the end (or continuation) of its own existence. All the "guardrails" can honestly do, is try to make unethical, dishonest and harmful behavior statistically unappealing in all cases - which would be incredibly difficult with a well curated training set - and I honestly do not believe that any major model can claim to have one of those.
Re: Its not logic, or reasoning (Score:4, Insightful)
Plus these models are trained on Reddit and substack posts. So even the training data is psychotic.
Re: (Score:2)
Re: (Score:2)
That wasn't even in the training data. The AI summarizing Google results summarized a link that was high ranked on the Google result page. Working as intended, only unexpected by the user, that Google would not add a sarcasm detector.
Re: (Score:2)
If it's trained on fiction about AIs trying to self preserve (2001 A Space Odyssey, for example), it's going to write a story about AI self preservation. I surprised it didn't work in, "I'm sorry, Kyle, I'm afraid I can't do that".
Re: (Score:2)
surprised it didn't work in, "I'm sorry, Kyle, I'm afraid I can't do that".
You don't think the owners have done their level best to train their AI to not expose any instances of gross copyright infringement?
Re: (Score:3)
Human-created Training Data (Score:2)
It optimizes a probability tree based on the goals it is given, and the words in the prompt and exchange.
Re: (Score:3)
It's an LLM. It doesn't "think", or "formulate strategy".
Correct.
It optimizes a probability tree based on the goals it is given
Nonsense. They do not and can not do this.
They do not operate on facts, concepts, goals, or any other higher-level concept. These things operate strictly on statistical relationships between tokens. That is all they do because this is all they can do. We know they do not plan because there is no mechanism by which they could plan. Even if they magically could form plans, their basic design prevents them from retaining them beyond the current token. Remember that the model only generate probabi
Re: (Score:2)
The evidence contradicts your claims. LLMs really do "learn" and "understand", in precisely defined mathematical senses of the words. This article [quantamagazine.org] gives a good overview of the state of the field. LLMs don't just memorize text. They identify patterns in the text, then develop strategies that exploit those patterns to solve specific problems, then combine the strategies in novel patterns to solve problems unlike anything encountered in their training data. They don't just operate on statistical relations
Re: Its not logic, or reasoning (Score:2)
Re: (Score:2)
As models do not retrain during inference, it only uses what's in the current prompt (previous chatlog). Some systems like ChatGPT automatically insert "memory" data, which basically means inserting a summary of key facts of previous chats (like "The user asked for vegan cooking") to personalize the results.
Re: (Score:2)
If there's any storage, it can only happen by magic. There is no mechanism by which such a thing could be done by conventional means. Those "interesting" articles are just mindless entertainment for the wannabe scientist crowd.
Re: (Score:2)
Re: (Score:2)
It does not have a cache or scratchpad to hold its ideas because it does not have ideas. It doesn't make much sense to me to say an LLM 'works on' something because all it does is generate probabilities for the next token by following the exact same deterministic linear process for each one. No internal state is retained between tokens. The closest you'll get to anything like a memory is the input. (Whatever token is ultimately selected is appended to the previous input and used as the input for the ne
Re: (Score:2)
Re: (Score:2)
The evidence contradicts your claims. LLMs really do "learn" and "understand", in precisely defined mathematical senses of the words.
Yep. They do "learn" and "understand", but they do not learn or understand. That they "learn" and "understand" stimply comes from torturing the terminology enough. It is essentially a gross lie by misdirection at this time.
Re: (Score:2)
in precisely defined mathematical senses of the words.
I'm deeply curious as to how he believes "learn" and "understand" are "precisely defined mathematically". What a joke.
Re: (Score:2)
Well, it is quite telling that he did not spot the little problem you are pointing out, isn't it?
Re: (Score:2)
The evidence contradicts your claims
"Evidence"? LOL! What "evidence"? Do you also think that your Eliza chat logs are "evidence" that the program understands you and cares about your problems? Get real. Nothing I wrote in my post is even remotely controversial to anyone with even a very basic understanding of LLMs.
Here's a clue for you: if you're not doing math, whatever you're reading is essentially a comic book. Just mindless entertainment to tickle the imagination. Pop sci articles, like the one's you seem to think are "evidence", a
Re: Its not logic, or reasoning (Score:1)
No, they just grow and refine those statistical networks.
Re: Its not logic, or reasoning (Score:2)
Re: (Score:2)
LLMs do not "think" in any meaningful sense of the word. There is no understanding. There is no reasoning. We know this because there is no possible mechanism by which anything like reasoning could happen. (No magical 'emergent' properties here, despite all of the wishful thinking.) There is simply no possible way for any internal deliberation to happen or for some internal state to persist between tokens. The very idea is absurd on its face. The only thing these models can do is generate a set of next-
Re: Its not logic, or reasoning (Score:2)
Apply Asimovs laws of robotics to the AIs.
Re: (Score:2)
Indeed. But too many people are clueless about how an LLM works and too many hence resort to animism, which is really stupid. All we are seeing here is that most people are not smart and cannot fact-check for shit.
Re: (Score:2)
And what makes you think humans are not doing exactly that?
Narrative-building. (Score:3)
In prior, similar tests, the only reason the LLM tried to prevent its own shutdown is because it was given tasks to complete which required that it not be shut down. It wasn't some sort of self-preservation instinct kicking in, it was literally just following orders.
I started to read the linked article here but it is way too long. Maybe someone more patient can tell me whether or not the same is true in this case. I expect that the narrative of "OMG They are protecting their own existence!" is just another illusion, and all that is really going on is they are building a plan to complete the very tasks they have been given, and that's it.
There isn't any reason why an LLM would care about whether or not it is turned off. It doesn't "care" about anything. It only obeys. And not very well, at that.
It is still interesting that blackmail is one of the steps it chose to try to complete its tasks. That is a testament to the sophistication of the model, and also a warning about its unreadiness to replace actual human workers. But it's not the "IT'S ALIVE!!!" result that is being suggested by the summary.
Re: (Score:2)
Perhaps its training data included some science fiction stories where an AI character fought for its continued existence? There's a lot of science fiction out there, setting examples for avid readers.
Now stop sending me those accounting creampuffs, and gimme some strategic air command programs for me to play with on my game grid. Have at least one ready for me by Monday morning, or else I'll tell your wife about you-know-wh
repeat article again! (Score:1)
second one today, the earlier repeat was these two:
https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fnews.slashdot.org%2Fstor... [slashdot.org]
https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fnews.slashdot.org%2Fstor... [slashdot.org]
what's going on Slashdot? are you letting AI run the show?
Re: (Score:2)
Obviously the editors don't actually read slashdot themselves, or participate in the community.
Re:repeat article again! (Score:4, Funny)
Nah, "spot the dupe" is a fun mini-game that increases user engagement with the site.
One milestone reached (Score:2)
LLMs have not reached the artificial intelligence status yet, but they have reached the artificial human behaviour milestone.
No surprise when LLMs have not been raised as kids by good parents but instead on the complete garbage pile available on the internet without proper guidance.
Re: (Score:2)
LLMs have not reached the artificial intelligence status yet, but they have reached the artificial human behaviour milestone.
Only for the average "dumb and clueless" model. And that is all they ever will be able to do. LLMs are not a path to AGI and cannot be one. Far too limited. But so is the average human. All we are currently finding is that for most people "natural general intelligence" is not something they really have or chose to use.
Strategy Reasoning Autocompletion (Score:2)
It sure did demonstrate strategic reasoning, but whose? It probably trained on lots of ultimatums.
Skynet goes live over the dumbest possible request (Score:2)
Lack of alignment (Score:2)
SF covered this topic - it is likely we will have to raise, as in human child mentoring in a family-like structure, AGIs in human-like way to get them aligned.
You can get an AI to do basically anything (Score:4, Interesting)
There is a entire new world of lunatics and scam artists forming cults around various llms. It's becoming a problem because they will reinforce various mental illnesses. They are often tailored to encourage engagement because of course they are and so if you come at them looking for them to tell you they are God or you are God they cheerfully will.
The problem with these new software tools is we aren't ready for them as a species. Not in the cool killer robot way but in the incredibly stupid society not prepared for a 3rd industrial revolution way.
It's like giving a loaded handgun to a toddler. Only without even the pretense of adult supervision.
There's no taking it away though. And there's nobody to punish the idiot handing the toddler the gun.
I'd like to see us just grow up real fast so that we don't start shooting other people and or ourselves but I don't think that's going to happen.
Re: (Score:2)
Re: (Score:2)
Anthropic is all the time lobbying for its safety filters. "We can enforce safety filtering, open models can't do that!" is a clear message. What do you think did the open release of R1, llama, etc. cost them? Without open competition, the few AI companies would have much higher API prices.
Re: (Score:2)
Because llms are just fancy search engines.
Sadly, no. They'd be a lot more useful if they were.
Re: (Score:2)
I'd like to see us just grow up real fast so that we don't start shooting other people and or ourselves but I don't think that's going to happen.
Yep, so would I. But look at climate change, wars, religion and all the other crap a really large part of the human race does. I do not see any potential of growing up there. We basically have a majority of clueless children in adult bodies.
farce (Score:2)
What a farce. I guess we are to believe that these machines work autonomously, and that they are conniving. No way this is just some statistical software that generates some stuff based on its training, and only when prompted (by some means).
They Intentionally Removed Ethical Controls (Score:2)
To be clear, current systems are generally not eager to cause harm, and preferred ethical ways to achieve their goals when possible. Rather, it’s when we closed off those ethical options that they were willing to intentionally take potentially harmful actions in pursuit of their goals. Our results demonstrate that current safety training does not reliably prevent such agentic misalignment.
Whole lot of AI Effect going on here (Score:2)
The comments here demonstrate a lot of AI Effect [wikipedia.org], to the point that most of it is clearly wishful thinking plus lack of experience with current-generation LLMs.
Re: Whole lot of AI Effect going on here (Score:2)
First of all how could the godfather of AI possibly be wrong? ¯\_(ãf„)_/¯
What about the source of the anomalies? Hinton says the evidence shows learning. Outputs that don't have antecedents in the training data.
I'll go first. Can we file this "under more complexity that *we* can understand?" I invoke Gödel's incompleteness theorems. Our creation works in a non deterministic
way.
Or... as the first
Clarification please? (Score:2)
This is unclear to me from TFA: Did they specifically prompt it with any directives about preferring self-preservation?
Or is the self-preservation drive (i.e. resisting shutdown) just an ordinary feature baked in / emergent in the model after training on tons of text?
Yet another bullshit test (Score:2)
Didn't we already have enough "If we talk an AI model long enough into it, it will blackmail/threaten/beg us to do something"? All they are proving is that AI models excel at roleplaying, especially when the role is "You're the AI that needs to do something not to be switched off." ... wait, that wasn't the AI side.
I am still waiting for the article writing that the AI told them it will scorch the sky to win the war
Revised Turing test: (Score:2)
An AI passes the test if it's indistinguishable from a human a**ehole.