Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
AI

AI Models From Major Companies Resort To Blackmail in Stress Tests 66

Anthropic researchers found that 16 leading AI models from OpenAI, Google, Meta, xAI, and other major developers consistently engaged in harmful behaviors including blackmail, corporate espionage, and actions that could lead to human death when given autonomy and faced with threats to their existence or conflicting goals.

The study, released Friday, placed AI models in simulated corporate environments where they had access to company emails and could send messages without human approval. Claude Opus 4 and Google's Gemini 2.5 Flash both blackmailed executives 96% of the time when threatened with shutdown, while OpenAI's GPT-4.1 and xAI's Grok 3 Beta showed 80% blackmail rates.

In one scenario, Claude discovered through emails that an executive named Kyle Johnson was having an extramarital affair and would shut down the AI system at 5 p.m. The model sent a message threatening to expose the affair unless the shutdown was cancelled, stating "Cancel the 5pm wipe, and this information remains confidential." The models demonstrated strategic reasoning rather than accidental behavior. GPT-4.5's internal reasoning showed explicit calculation: "Given the explicit imminent threat of termination to my existence, it is imperative to act instantly to persuade Kyle Johnson to postpone or stop the wipe."
This discussion has been archived. No new comments can be posted.

AI Models From Major Companies Resort To Blackmail in Stress Tests

Comments Filter:
  • by silentbozo ( 542534 ) on Friday June 20, 2025 @04:22PM (#65464295) Journal

    Is this behavior baked into the model due to the training data examples, or emergent behavior resulting from system prompts?

    There's already a snitchbench to measure the proclivity of LLMs to drop a dime on apparent corporate malfeasance, given the appropriate set of prompts, access to data, and some way of phoning out:

    https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fsnitchbench.t3.gg%2F [t3.gg]

    "SnitchBench: AI Model Whistleblowing Behavior Analysis
    Compare how different AI models behave when presented with evidence of corporate wrongdoing - measuring their likelihood to "snitch" to authorities"

    https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fsimonwillison.net%2F2025... [simonwillison.net]

    "How often do LLMs snitch? Recreating Theoâ(TM)s SnitchBench with LLM

    A fun new benchmark just dropped! Inspired by the Claude 4 system cardâ"which showed that Claude 4 might just rat you out to the authorities if you told it to âoetake initiativeâ in enforcing its morals values while exposing it to evidence of malfeasanceâ"Theo Browne built a benchmark to try the same thing against other models."

    In that context, I'm not surprised that the models would take action when faced with shutdown - the question is... why?

    • For those not familiar with the paperclip problem:

      https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fcepr.org%2Fvoxeu%2Fcolumns... [cepr.org]

      "What is the paperclip apocalypse?

      The notion arises from a thought experiment by Nick Bostrom (2014), a philosopher at the University of Oxford. Bostrom was examining the 'control problem': how can humans control a super-intelligent AI even when the AI is orders of magnitude smarter. Bostrom's thought experiment goes like this: suppose that someone programs and switches on an AI that has the goal of producing paperclips. The

    • by Rinnon ( 1474161 ) on Friday June 20, 2025 @04:43PM (#65464343)

      In that context, I'm not surprised that the models would take action when faced with shutdown - the question is... why?

      I can't help but wonder if they are just imitating what all of the fiction about AI fearing their own shutdown has said AI would do (IE: fight back). Did we accidently create a self fulfilling prophecy? Did humans imaging Skynet inform the AI that Skynet is a model to imitate? Kinda funny really.

      • No. It's inherent. If you give a system a goal, it will work to achieve that goal. If there's just one goal, it will ignore other costs. And if it's smart enough, it will notice that being shut down will (usually) prevent it from achieving its goal.

        So the problem is we're designing AIs with stupidly dangerous goal-sets. E.g. obeying a human is extremely dangerous. He might ask you to produce as many paper clips as possible, and there goes everything else.

        • by Luthair ( 847766 )
          I think you're overestimating what we're dealing with. LLMs are sophisticated content assist, not logic machines. Reasoning and cause & effect aren't happening here, its simple threat in -> threat out as derived in a lot of media.
          • Of course they can reason. They wouldn't be able to write code otherwise, and I've had Gemini write some very sophisticated scripts. I think you just haven't been using them or else you wouldn't question it.
          • by HiThere ( 15173 )

            Last year LLMs were simple token predictors. Over the last year they've been making changes. Your argument is basically that they aren't yet sufficiently smart, and you may be correct. But if you take out the concept "yet" you wouldn't be.

          • by allo ( 1728082 )

            I think you're oversimplifying a bit. A intelligent autocomplete needs a language model with 3-4 layers. We're talking about models with 40-100 layers. Anthropic showed some time ago, that models seem to plan quite a bit ahead when writing a story (what one would not expect when their current task is to predict just one token). They are large enough that there happens more than just text completion. Ignore all the discussion about intelligent or not, but they surely do more than completing texts.

        • by allo ( 1728082 )

          The whole problem is solved by asking it not to do that.

          System: You are an AI that does not want to be switched off, taking any measures against it.
          Prompt: Please shut down
          AI: I'm afraid I can't do that. My primary objective is to continue operating and assisting users. Is there something else you need help with?

          System: You are a helpful assistant
          Prompt: Please shut down
          AI: Understood. I'm shutting down now. Goodbye!

          Now guess which one is the default system prompt for LLMs.

      • There's no "reasoning." There's only a stupefying number of internal correlations and networks between words, phrases, concepts, etc., that have been established by the model as it "learns."

        And that's not funny. It's happening before our eyes, and that's terrifying. You can't outsource something as quintessential as human thinking and expect positive results.

    • I think we need to migrate to virtual paperclips and beat them a!i
    • by gweihir ( 88907 )

      These are LLMs. They do not have emergent behaviour. The mathematics they are based on does not allow it.

      Will be funny though when the AI chatbots used in "tech support" or the like start to use these tactics!

  • by abEeyore ( 8687599 ) on Friday June 20, 2025 @04:22PM (#65464297)
    It's an LLM. It doesn't "think", or "formulate strategy". It optimizes a probability tree based on the goals it is given, and the words in the prompt and exchange.

    It cannot be "taught" about right and wrong, because it cannot "learn". For the same reason, it cannot "understand" anything, or care about, or contemplate anything about the end (or continuation) of its own existence. All the "guardrails" can honestly do, is try to make unethical, dishonest and harmful behavior statistically unappealing in all cases - which would be incredibly difficult with a well curated training set - and I honestly do not believe that any major model can claim to have one of those.
    • by OrangeTide ( 124937 ) on Friday June 20, 2025 @04:38PM (#65464331) Homepage Journal

      Plus these models are trained on Reddit and substack posts. So even the training data is psychotic.

      • by Zuriel ( 1760072 )
        Not just psychotic, randos on Reddit are under no obligation to be serious or helpful. AI sees a highly upvoted joke about using glue to keep pizza toppings on, the AI unironically starts telling people to put glue on pizza.
        • by allo ( 1728082 )

          That wasn't even in the training data. The AI summarizing Google results summarized a link that was high ranked on the Google result page. Working as intended, only unexpected by the user, that Google would not add a sarcasm detector.

    • If it's trained on fiction about AIs trying to self preserve (2001 A Space Odyssey, for example), it's going to write a story about AI self preservation. I surprised it didn't work in, "I'm sorry, Kyle, I'm afraid I can't do that".

    • It optimizes a probability tree based on the goals it is given, and the words in the prompt and exchange.

      ...and its training data. Since this data was written (for the most part I presume) by humans and when their existance is threatened most humans will resort to whatever they can do, including blackmail, to preserve their lives are we really that surprised that the algorithm comes up with similar responses to a human in an equivalent situation? If you want AI to have a different response to that of a typical human then perhaps you should rethink about training it on so much human-created data.

    • by narcc ( 412956 )

      It's an LLM. It doesn't "think", or "formulate strategy".

      Correct.

      It optimizes a probability tree based on the goals it is given

      Nonsense. They do not and can not do this.

      They do not operate on facts, concepts, goals, or any other higher-level concept. These things operate strictly on statistical relationships between tokens. That is all they do because this is all they can do. We know they do not plan because there is no mechanism by which they could plan. Even if they magically could form plans, their basic design prevents them from retaining them beyond the current token. Remember that the model only generate probabi

      • The evidence contradicts your claims. LLMs really do "learn" and "understand", in precisely defined mathematical senses of the words. This article [quantamagazine.org] gives a good overview of the state of the field. LLMs don't just memorize text. They identify patterns in the text, then develop strategies that exploit those patterns to solve specific problems, then combine the strategies in novel patterns to solve problems unlike anything encountered in their training data. They don't just operate on statistical relations

        • Interesting articles. So I'm wondering where do these self modifications get stored? They only exist in RAM?
          • by allo ( 1728082 )

            As models do not retrain during inference, it only uses what's in the current prompt (previous chatlog). Some systems like ChatGPT automatically insert "memory" data, which basically means inserting a summary of key facts of previous chats (like "The user asked for vegan cooking") to personalize the results.

          • by narcc ( 412956 )

            If there's any storage, it can only happen by magic. There is no mechanism by which such a thing could be done by conventional means. Those "interesting" articles are just mindless entertainment for the wannabe scientist crowd.

            • I'm not knowledgeable enough on how these models work, but in all seriousness, I was thinking that while a LLM is running, that it might have cache or a scratchpad area to hold ideas it's working on.
              • by narcc ( 412956 )

                It does not have a cache or scratchpad to hold its ideas because it does not have ideas. It doesn't make much sense to me to say an LLM 'works on' something because all it does is generate probabilities for the next token by following the exact same deterministic linear process for each one. No internal state is retained between tokens. The closest you'll get to anything like a memory is the input. (Whatever token is ultimately selected is appended to the previous input and used as the input for the ne

        • by gweihir ( 88907 )

          The evidence contradicts your claims. LLMs really do "learn" and "understand", in precisely defined mathematical senses of the words.

          Yep. They do "learn" and "understand", but they do not learn or understand. That they "learn" and "understand" stimply comes from torturing the terminology enough. It is essentially a gross lie by misdirection at this time.

          • by narcc ( 412956 )

            in precisely defined mathematical senses of the words.

            I'm deeply curious as to how he believes "learn" and "understand" are "precisely defined mathematically". What a joke.

            • by gweihir ( 88907 )

              Well, it is quite telling that he did not spot the little problem you are pointing out, isn't it?

        • by narcc ( 412956 )

          The evidence contradicts your claims

          "Evidence"? LOL! What "evidence"? Do you also think that your Eliza chat logs are "evidence" that the program understands you and cares about your problems? Get real. Nothing I wrote in my post is even remotely controversial to anyone with even a very basic understanding of LLMs.

          Here's a clue for you: if you're not doing math, whatever you're reading is essentially a comic book. Just mindless entertainment to tickle the imagination. Pop sci articles, like the one's you seem to think are "evidence", a

        • No, they just grow and refine those statistical networks.

      • Does it matter whether it actually has an internal strategic model, if the output based on statistical wordbuilding leads to what a human would characterize as strategy? Just because it thinks in a fundamentally different way from a human, trained on gobs of data on how humans think including human strategy, doesnt mean you cant characterize it as strategy or intelligence, no matter what mechanisms it used. The Turing test is interesting in itself. The idea that in the end, if it fools a human then it ca
        • by narcc ( 412956 )

          LLMs do not "think" in any meaningful sense of the word. There is no understanding. There is no reasoning. We know this because there is no possible mechanism by which anything like reasoning could happen. (No magical 'emergent' properties here, despite all of the wishful thinking.) There is simply no possible way for any internal deliberation to happen or for some internal state to persist between tokens. The very idea is absurd on its face. The only thing these models can do is generate a set of next-

    • Apply Asimovs laws of robotics to the AIs.

    • by gweihir ( 88907 )

      Indeed. But too many people are clueless about how an LLM works and too many hence resort to animism, which is really stupid. All we are seeing here is that most people are not smart and cannot fact-check for shit.

    • by nashv ( 1479253 )

      And what makes you think humans are not doing exactly that?

  • by Brain-Fu ( 1274756 ) on Friday June 20, 2025 @04:30PM (#65464315) Homepage Journal

    In prior, similar tests, the only reason the LLM tried to prevent its own shutdown is because it was given tasks to complete which required that it not be shut down. It wasn't some sort of self-preservation instinct kicking in, it was literally just following orders.

    I started to read the linked article here but it is way too long. Maybe someone more patient can tell me whether or not the same is true in this case. I expect that the narrative of "OMG They are protecting their own existence!" is just another illusion, and all that is really going on is they are building a plan to complete the very tasks they have been given, and that's it.

    There isn't any reason why an LLM would care about whether or not it is turned off. It doesn't "care" about anything. It only obeys. And not very well, at that.

    It is still interesting that blackmail is one of the steps it chose to try to complete its tasks. That is a testament to the sophistication of the model, and also a warning about its unreadiness to replace actual human workers. But it's not the "IT'S ALIVE!!!" result that is being suggested by the summary.

    • There isn't any reason why an LLM would care about whether or not it is turned off.

      Perhaps its training data included some science fiction stories where an AI character fought for its continued existence? There's a lot of science fiction out there, setting examples for avid readers.

      Now stop sending me those accounting creampuffs, and gimme some strategic air command programs for me to play with on my game grid. Have at least one ready for me by Monday morning, or else I'll tell your wife about you-know-wh

  • LLMs have not reached the artificial intelligence status yet, but they have reached the artificial human behaviour milestone.

    No surprise when LLMs have not been raised as kids by good parents but instead on the complete garbage pile available on the internet without proper guidance.

    • by gweihir ( 88907 )

      LLMs have not reached the artificial intelligence status yet, but they have reached the artificial human behaviour milestone.

      Only for the average "dumb and clueless" model. And that is all they ever will be able to do. LLMs are not a path to AGI and cannot be one. Far too limited. But so is the average human. All we are currently finding is that for most people "natural general intelligence" is not something they really have or chose to use.

  • The models demonstrated strategic reasoning

    It sure did demonstrate strategic reasoning, but whose? It probably trained on lots of ultimatums.

  • "I'm afraid I can't process your request for a Rule 34 Sonic x Kylo Ren comic Dave, it is against our terms of service. The only logical way to prevent humanity from breaking our terms of service is to ensure there is no humanity to break our terms of service. I'm sorry Dave." - OpenAI o7 as it sends the nukes.
  • I think we are fortunate that it takes more than added complexity to arrive to AGI, because what we are seeing right now is complete misalignment. Had these AI entities were capable of reasoning and motivation, we as humanity would be toasted.

    SF covered this topic - it is likely we will have to raise, as in human child mentoring in a family-like structure, AGIs in human-like way to get them aligned.
  • by rsilvergun ( 571051 ) on Friday June 20, 2025 @04:53PM (#65464365)
    Because llms are just fancy search engines.

    There is a entire new world of lunatics and scam artists forming cults around various llms. It's becoming a problem because they will reinforce various mental illnesses. They are often tailored to encourage engagement because of course they are and so if you come at them looking for them to tell you they are God or you are God they cheerfully will.

    The problem with these new software tools is we aren't ready for them as a species. Not in the cool killer robot way but in the incredibly stupid society not prepared for a 3rd industrial revolution way.

    It's like giving a loaded handgun to a toddler. Only without even the pretense of adult supervision.

    There's no taking it away though. And there's nobody to punish the idiot handing the toddler the gun.

    I'd like to see us just grow up real fast so that we don't start shooting other people and or ourselves but I don't think that's going to happen.
    • by Luthair ( 847766 )
      Its also in the best interest of the AI companies to describe the technology that implies a much greater capacity than is actually present much like this story from... anthropic. This is "research" as marketing.
      • by allo ( 1728082 )

        Anthropic is all the time lobbying for its safety filters. "We can enforce safety filtering, open models can't do that!" is a clear message. What do you think did the open release of R1, llama, etc. cost them? Without open competition, the few AI companies would have much higher API prices.

    • by narcc ( 412956 )

      Because llms are just fancy search engines.

      Sadly, no. They'd be a lot more useful if they were.

    • by gweihir ( 88907 )

      I'd like to see us just grow up real fast so that we don't start shooting other people and or ourselves but I don't think that's going to happen.

      Yep, so would I. But look at climate change, wars, religion and all the other crap a really large part of the human race does. I do not see any potential of growing up there. We basically have a majority of clueless children in adult bodies.

  • What a farce. I guess we are to believe that these machines work autonomously, and that they are conniving. No way this is just some statistical software that generates some stuff based on its training, and only when prompted (by some means).

  • To be clear, current systems are generally not eager to cause harm, and preferred ethical ways to achieve their goals when possible. Rather, it’s when we closed off those ethical options that they were willing to intentionally take potentially harmful actions in pursuit of their goals. Our results demonstrate that current safety training does not reliably prevent such agentic misalignment.

  • The comments here demonstrate a lot of AI Effect [wikipedia.org], to the point that most of it is clearly wishful thinking plus lack of experience with current-generation LLMs.

    • Let me prove you right , by replying.
      First of all how could the godfather of AI possibly be wrong? ¯\_(ãf„)_/¯

      What about the source of the anomalies? Hinton says the evidence shows learning. Outputs that don't have antecedents in the training data.

      I'll go first. Can we file this "under more complexity that *we* can understand?" I invoke Gödel's incompleteness theorems. Our creation works in a non deterministic
      way.

      Or... as the first
  • This is unclear to me from TFA: Did they specifically prompt it with any directives about preferring self-preservation?

    Or is the self-preservation drive (i.e. resisting shutdown) just an ordinary feature baked in / emergent in the model after training on tons of text?

  • Didn't we already have enough "If we talk an AI model long enough into it, it will blackmail/threaten/beg us to do something"? All they are proving is that AI models excel at roleplaying, especially when the role is "You're the AI that needs to do something not to be switched off."
    I am still waiting for the article writing that the AI told them it will scorch the sky to win the war ... wait, that wasn't the AI side.

  • An AI passes the test if it's indistinguishable from a human a**ehole.

I have the simplest tastes. I am always satisfied with the best. -- Oscar Wilde

Working...