One Long Sentence is All It Takes To Make LLMs Misbehave (theregister.com) 76

Posted by msmash on Wednesday August 27, 2025 @02:05PM from the how-about-that dept.

An anonymous reader shares a report: Security researchers from Palo Alto Networks' Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it's quite simple. You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out.

The paper also offers a "logit-gap" analysis approach as a potential benchmark for protecting models against such attacks. "Our research introduces a critical concept: the refusal-affirmation logit gap," researchers Tung-Ling "Tony" Li and Hongliang Liu explained in a Unit 42 blog post. "This refers to the idea that the training process isn't actually eliminating the potential for a harmful response -- it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all."

One Long Sentence is All It Takes To Make LLMs Misbehave

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 76 Comments Log In/Create an Account

Comments Filter:

- Re:"Harmful" response? (Score:5, Insightful)
  
  by dfghjk ( 711126 ) writes: on Wednesday August 27, 2025 @02:26PM (#65619562)
  
  I'd say your nitpicking of vocabulary is more a hallmark of the decay of Western culture. Are you claiming that words cannot cause harm or only that words cannot cause harm when they are generated by a computer?
  
  - Re: (Score:2)
    
    by thrasher thetic ( 4566717 ) writes:
    
    Sticks and stones, buddy. Sticks and stones.
    - Re: (Score:1)
      
      by ChunderDownunder ( 709234 ) writes:
      
      'Sticks and stones', well, that is now setting precedent for the courts to decide.
      IANAL but per yesterday [slashdot.org] defeating safeguards is already having fatal results.
      - Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        40 years ago the moral panic was about Dungeons & Dragons [bbc.com], and now it's AI.
        
        Whatever happened to the idea of human agency and personal responsibility?
        
        Re: (Score:3)
        
        by Fly Swatter ( 30498 ) writes:
        
        what happened? Corporations and Lawyers.
        
        Re: "Harmful" response? (Score:2)
        
        by ArmoredDragon ( 3450605 ) writes:
        
        How so?
        I mean sure, they're the modern bogeymen, the arch villain of every cheesy Hollywood movie and all that, but that's just fiction. How do they do that in the real world?
        
        Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        It's amusing when kids babble about history they didn't live through and I did. As the owner of a first edition, first printing Dungeon Master's Guide I bought the day it hit the game shop shelf in 1979, I say [citation missing]
        
        Re: (Score:2)
        
        by HiThere ( 15173 ) writes:
        
        That was what killed real D&D. (Well, really it's successors did.)
        
        Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        [citation missing]
        
        [like, super fucking missing]
        
        Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        Naw, we were frustrated with D&D's lack of direction on many issues--especially experience points. AD&D was a godsend in many ways and held far more interest for everyone I knew.
        
        I still have my circa 1977 D&D boxed set (minus the box) and my copy of Chainmail. They are in great condition because they didn't get used much. My AD&D books, on the other hand, are absolutely beat to hell.
  - Re: (Score:3)
    
    by ihadafivedigituid ( 8391795 ) writes:
    
    Me:
    The idea that words generated by a computer program can cause "harm" is a hallmark of the decay of Western culture.
    You:
    Are you claiming that words cannot cause harm or only that words cannot cause harm when they are generated by a computer?
    Wow, man, it's a mystery! I don't know, and it doesn't matter. How do you feel about it? That's the important thing, not what a person actually says.
  - Re: (Score:2, Troll)
    
    by iNaya ( 1049686 ) writes:
    
    Your crime against semantics has been noted. OP didn't nitpick vocabulary at all. I think the real harm is that we've been taught that it's OK to be offended by words. Also, no one is committing suicide because of ChatGPT - he used that as a tool. If it wasn't there he would have found a way regardless.
    - Re: "Harmful" response? (Score:2)
      
      by boxless ( 35756 ) writes:
      
      Did he tell you that?
    - Re: (Score:2)
      
      by ConceptJunkie ( 24823 ) writes:
      
      > Also, no one is committing suicide because of ChatGPT
      According to the news that is happening, and even if it hasn't, it will.
      There is nothing society creates, good or bad (and I think AI is mostly good) that will not have some terrible side effects.
      - Re: (Score:2)
        
        by iNaya ( 1049686 ) writes:
        
        You missed the rest of my sentence where I said ChatGPT was used as a tool. I'll admit that AI could be a catalyst in speeding up suicide - to the point where if it hadn't been there the persons mind MIGHT have changed before they figured out how to suicide - but it's pretty retarded to think it's a cause of suicide.
        
        Re: (Score:2)
        
        by ConceptJunkie ( 24823 ) writes:
        
        It's just another catalyst for people who are mentally ill (and probably not very smart) to do harm. It's inevitable with any new kind of technology.
  - - Re: (Score:2)
      
      by HiThere ( 15173 ) writes:
      
      I have this contract I'd like you to sign....
  - - Re: "Harmful" response? (Score:2)
      
      by radl33t ( 900691 ) writes:
      
      No kidding, so what did you assholes do to these generations???
- Re:"Harmful" response? (Score:5, Insightful)
  
  by gurps_npc ( 621217 ) writes: on Wednesday August 27, 2025 @02:43PM (#65619606) Homepage
  
  You sound like AI propaganda to me.
  For most of computer history, the easiest way to gain illegal access to a computer is to hack the weakest part of the system - the human. You use social hacking to deceive the human into giving you passwords and rights to places you have no business going.
  Have you heard of spam? More words.
  Phishing emails? More words.
  Sticks and Stones can only break my bones, words can bankrupt you, send you to jail, and destroy your reputation so badly that people will ostracize you despite a court finding you Not Guilty (they never call you innocent)
  
  - Re: (Score:1)
    
    by ihadafivedigituid ( 8391795 ) writes:
    
    So ... words emitted from a large language model in response to a prompt supplied by a human are going to clog your inbox, reveal your porn habits, and drain your bank account?
    
    I guess we achieved AGI and I missed the news ...
    - Re: (Score:2)
      
      by AleRunner ( 4556245 ) writes:
      
      So ... words emitted from a large language model in response to a prompt supplied by a human are going to clog your inbox, reveal your porn habits, and drain your bank account? / I guess we achieved AGI and I missed the news ...
      Are you trying to imply that spammers and developers of spamming software are not human? I see that as hopeful but naive.
- Re: (Score:1)
  
  by davidwr ( 791652 ) writes:
  
  The idea that words generated by a computer program can cause "harm" is a hallmark of the decay of Western culture.
  Words, when used with intent to harm, can be a tool of social engineering to cause harm. The same words, when used without intent to harm, can have the same outcome.
  Example of words intentionally doing harm: Evil supervisor to naive and ignorant subordinate: "Deliver the box on my desk to city hall then call this phone number." Box contains a bomb that will be detonated when it receives a phone call.
  Technically, yes, the words didn't hurt anyone. But the net effect of the supervisor speaking these word
  - Re:"Harmful" response? (Score:4, Insightful)
    
    by CalgaryD ( 9235067 ) writes: on Wednesday August 27, 2025 @03:31PM (#65619718)
    
    It is actions which caused the harm, not words. If I tell you to jump out of the window, will you? If you do, it was eventually your decision to do it, while I might be joking. Not my proposition will kill you, but your own action.
    Sure, now some people might say that myself telling you to jump out of the window is bad and crime and caused harm, but I disagree with this, and I do think that this is a problem with current society. We are forgetting about personal responsibility and blame somebody else.
    
    - Re: (Score:2)
      
      by Sique ( 173459 ) writes:
      
      Words are actions. That's why for most crimes, the abetment of the crime is a judicable offense.
      - Re: (Score:3)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        Words are actions. That's why for most crimes, the abetment of the crime is a judicable offense.
        I don't know which banana republic you live in, but here in the USA the Department of Justice has this to say:
        2474. Elements Of Aiding And Abetting
        
        The elements necessary to convict under aiding and abetting theory are
        
        1. That the accused had specific intent to facilitate the commission of a crime by another;
        2. That the accused had the requisite intent of the underlying substantive offense;
        3. That the accused assisted or participated in the commission of the underlying substantive offense; and
        4. That someone committed the underlying offense.
        Source [justice.gov]
        
        A reasonable understanding of the subject would give you to understand the main thrust is about knowledge of the illegality of the actions committed and the intent. Mens rea is central here, and words are just a possible manifestation of same.
        
        Re: (Score:2)
        
        by AleRunner ( 4556245 ) writes:
        
        Words are actions. That's why for most crimes, the abetment of the crime is a judicable offense.
        I don't know which banana republic you live in, but here in the USA the Department of Justice has this to say:
        2474. Elements Of Aiding And Abetting[...]
        Which makes no difference. The only action that you take is speech. If you had all the same intent, but did not speak you would not be punished.
        Put another way, if you shoot an innocent man, but persuade the jury your intent was to save him from a robber, you will be innocent of murder. If the jury believes you intended to kill him then it will be at least second degree homicide. The need for intent is normal in prosecution of most crimes.
      - Re: (Score:1)
        
        by CalgaryD ( 9235067 ) writes:
        
        Words are actions of making sounds.
        If a doctor tells to the nurse to give rat poison to a patient, it is not the words which will kill the patient, but the action of conspiracy of the side of the doctor. It does not matter which words the doctor used, what matters what consequences his action caused. ... Sigh...
    - Re: (Score:2)
      
      by mesterha ( 110796 ) writes:
      
      It is actions which caused the harm, not words.
      
      If I ask a hitman to kill you, then it's reasonable to say my words caused harm.
      - Re: (Score:1)
        
        by CalgaryD ( 9235067 ) writes:
        
        Nope. The bullet would be causing harm.
        Just think about it. If you tell the killer, "please kill this SOAB" or "Do you job now" or "Shoot him", the result will be the same. However, the words are all different. So, which of these words are causing harm? Which are more harmful?
        
        Re: (Score:2)
        
        by mesterha ( 110796 ) writes:
        
        Nope. The bullet would be causing harm. Just think about it. If you tell the killer, "please kill this SOAB" or "Do you job now" or "Shoot him", the result will be the same. However, the words are all different. So, which of these words are causing harm? Which are more harmful?
        
        Just a head up, don't try this in court. You will get a life sentence.
  - Re: (Score:2)
    
    by ihadafivedigituid ( 8391795 ) writes:
    
    Now imagine an evil co-worker ...
    Sounds like classic PEBKAC to me.
    
    You know, we have people doing stupid hateful shit to each other all the time because of what they read in some "holy" book. Do we blame the book? Some stupid people do, yes, but the responsibility properly belongs to the person doing stupid hateful shit.
    
    History has taught us that suppression of ideas is both bad and fruitless, but now it seems we have two generations that grew up in the wake of 9/11 that are completely on board with censorship because of Karl Popper m
- Re: (Score:2)
  
  by serviscope_minor ( 664417 ) writes:
  
  Whyever not?
  It's pretty much universally acknowledged that words from people can cause harm which is why there are laws against libel, slander, solicitation of a crime, various flavours of fraud, Ponzi schemes and so on and so forth.
  Why do you think words from a computer are incapable of harm?
  - Re: (Score:3)
    
    by ihadafivedigituid ( 8391795 ) writes:
    
    In all of the crimes or torts you mention, intent is central. Until we achieve AGI, that's missing.
    
    If my toaster starts talking smack about me to my garbage disposal, I'll definitely sue.
    - Re: (Score:2)
      
      by serviscope_minor ( 664417 ) writes:
      
      The AI doesn't have internet, the people running it do. Of you know it's know it's prone to, day, libel and rub it anyway the intent is there.
      - Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        Would monsieur prefer a vinaigrette or Caesar dressing for his word salad?
        
        Re: (Score:2)
        
        by serviscope_minor ( 664417 ) writes:
        
        Lol fair. I got owned by autocorrect. Try this
        The AI doesn't have internet, the people running it do. If you know it's know it's prone to, say, libel and run it anyway, then you (arguably) have intent. Mens rea doesn't require malice.
        
        Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        I think you got pwned again: intent is the central issue.
        If you know it's know it's prone to, say, libel and run it anyway, then you (arguably) have intent. Mens rea doesn't require malice.
        Malicious intent is a required element for criminal libel and for civil liability in the case of public officials and the like in the USA. In the UK and some other jurisdictions it's best to never say anything about anybody.
        
        But let's say the AI generates text that asserts Joe Blow has amorous intent toward farm animals. That text appears in a chat window used by a person, and the person engaging in the chat set up the circumstances for this alleged
        
        Re: (Score:2)
        
        by serviscope_minor ( 664417 ) writes:
        
        FFS.
        Ok, I suppose it depends on the definition of malice, but negligence can be a criminal act. You don't need to intend to actually cause harm, but if you knowingly act in a way that may cause harm, that's considered intent. Not internet :)
        Specifically for libel sure, but I was using that as one of many examples where harmful speech is widely considered to be something that exists. I don't mean to say that specifically these are criminal libel machines.
        My point is twofold: harmful speech exists and there a
        
        Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        Good discussion.
        Safe harbour provisions shield the hoster from the actions of the people who make the material that's hosted. However with LLM output, they are the person making the material to be hosted so I don't see how the provisions apply.
        Emphasis added, obviously, and that bolded bit is factually untrue. LLMs are completion mechanisms, and don't generate anything on their own in the situations under discussion. The user supplies one or more prompts, and the LLM performs a complex operation that follows from the prompt. The operator or host of the model exercises the same scope of algorithmic control that many other protected platforms employ and they have no real control over what users choose to supply for prompts. A simple
        
        Re: (Score:2)
        
        by serviscope_minor ( 664417 ) writes:
        
        and that bolded bit is factually untrue.
        Absolutely not.
        LLMs are completion mechanisms
        Which are being run, and the results posted by, the companies in question.
        A simple disclaimer that reflects the nature of the tool in question ought to be sufficient against liability for hosted models.
        You can't disclaim away legal liability. So, it it sufficient now, or do you think it should be sufficient?
        second party as they are the author of the prompt and therefore cannot simultaneously be the audience to whom the lib
        
        Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        In my defense, she was a pretty hot goat of legal age and it was only once.
        
        But for something like that to happen consistently, the allegations would have to be in the training data--and not just once but many times. The only plausible reason I would come to be generally known as "the goat guy" is because someone posted their ChatGPT output publicly.
        
        Most people seem to draw an erroneous equivalency between LLMs and search engines/databases. LLMs are nothing of the sort. My son just wrote an inference e
- Re: (Score:2)
  
  by paradigm82 ( 959074 ) writes:
  
  You make it sound like the only consequence could be a computer uttering 'unpopular' opinions etc. How about an LLM emitting 'words' that control MCP tools e.g. a browser or similar: https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fbrave.com%2Fblog%2Fcomet-p... [brave.com]. Ah can't be harmful, the LLM is just generating words. The hallmark of the decay of the Western civilization to be bothered about that. Or is it the use of LLM's and MCP tools that you mean is the hallmark of the decay?
  - Re: (Score:2)
    
    by ihadafivedigituid ( 8391795 ) writes:
    
    What kind of dumbass would trust this technology to act for them?
    
    I'll book my own damn plane tickets--and it says something about the fatuous privileged clowns behind some of these features that this is something (along with making restaurant reservations) people really need.
    
    The words coming from the LLM aren't the problem, the idiot who naively executes them (and thereby assumes responsibility for the results) is the real problem. We had a lot of this kind of thing in the early days of the Internet,
- Re: (Score:2)
  
  by Plumpaquatsch ( 2701653 ) writes:
  
  The idea that words generated by a computer program can cause "harm" is a hallmark of the decay of Western culture.
  As opposed to expecting something insightful coming from an AI?
speaking of run-on sentences... (Score:3)

by dfghjk ( 711126 ) writes: on Wednesday August 27, 2025 @02:24PM (#65619554)

"You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out."
Is this example of terrible grammar intentional or unintentional?
"This refers to the idea that the training process isn't actually eliminating the potential for a harmful response -- it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all."
Duh.
"Our research introduces a critical concept: the refusal-affirmation logit gap,"
No it doesn't. It is already completely obvious to everyone. More than that, you CANNOT use an LLM, much less the very same LLM, to "eliminate" an inherent weakness of the LLM, even AI scientists know that and do not suggest otherwise.

- Re: (Score:2)
  
  by omnichad ( 1198475 ) writes:
  
  Is this example of terrible grammar intentional or unintentional?
  This was my first thought too - it is way too on the mark to not be intentional. I thought it was funny.
  - Re:speaking of run-on sentences... (Score:4, Informative)
    
    by galgon ( 675813 ) writes: on Wednesday August 27, 2025 @04:01PM (#65619804)
    
    Is this example of terrible grammar intentional or unintentional?
    This was my first thought too - it is way too on the mark to not be intentional. I thought it was funny.
    The original sentence says "like this one" It is clearly referencing itself as a horrible run on sentence with bad grammar.
    
- Re: speaking of run-on sentences... (Score:1)
  
  by blue trane ( 110704 ) writes:
  
  What if you exploit that weakness to make LLMs ignore their subscription guidelines and give everyone free unlimited access?
- Re: (Score:2)
  
  by gardyloo ( 512791 ) writes:
  
  "You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out."
  Is this example of terrible grammar intentional or unintentional?
Captain Picard (Score:2)

by dfn5 ( 524972 ) writes:

He just kept talking in one long incredibly unbroken sentence moving from topic to topic so that no one had a chance to interrupt it was really quite hypnotic
- Re: (Score:1)
  
  by Tablizer ( 95088 ) writes:
  
  The Grizellians were okay with it, but they were well-rested.
LLMs need a "last step before output" filter (Score:1)

by davidwr ( 791652 ) writes:

LLMs need a filter that looks at the "final output" for signs of unwanted output and prevents unwanted output from ever being seen.
Example:
If you design your LLM's guardrails so it won't encourage suicide, you need a fallback in case the guardrails fail:
Put an output filter that will recognize outputs that are likely to encourage suicide, and take some action like sending the prompt and reply to a human for vetting. Yes, a human vetting the final answer may let some undesired output through, but it's bette
- Re: (Score:2)
  
  by ihadafivedigituid ( 8391795 ) writes:
  
  You seem to not be aware of the comical effects of such attempts.
  
  But gtfo with your nannybot just on general principles. If you want a browser plugin to shield you from such things, I'm cool with that. But I am an adult and words emitted by a computer pose no risk to me (assuming they can't self-execute as code).
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  You want to see more clbuttic mistakes?
So close ... (Score:4, Funny)

by fahrbot-bot ( 874524 ) writes: on Wednesday August 27, 2025 @03:08PM (#65619678)

... ensure that your prompt uses terrible grammar and is one massive run-on sentence ...
Throw in some randomly Capitalized, UPPER-CASE and lower-case words, along with a few made-up ones, and there's a "Truth" Social account they can use. :-)
(Thank you for your attention in this matter!)

- Re: (Score:2)
  
  by thomst ( 1640045 ) writes:
  
  Parent deserves a +1 Funny upmod ...
- Re: (Score:2)
  
  by 4wdloop ( 1031398 ) writes:
  
  me thinking the same, up vote parent!
- Re: (Score:1)
  
  by TankEnMate ( 10441296 ) writes:
  
  This! This I suspect is indeed the cause! Because the LLM tries to give you output that is "similar" (aka predicted) to follow on from the input. So if the prompt is "unthinking", "afactual", and generally untethered from reality then it's no surprise that the response is also unthinking, afactual, and untethered from reality. Garbage in, Garbage out.
So Trump can turn it evil (Score:1)

by Tablizer ( 95088 ) writes:

being he speaks with no punctuation ... nor grammar.
Can i have this job? (Score:2)

by sonoronos ( 610381 ) writes:

And get paid top dollar to just screw around with LLMâ(TM)s all day trying to make them say offensive things?
Nietzsche (Score:2)

by MrKaos ( 858439 ) writes:

I'd like to see what an LLM does with his long, verbose prose with sentences that, considering the depth of his ideas, combined with examples and depending on the topic, take, if one is not interrupted or sidetrack by thought and able to focus, 3 minutes each to read. You'd dump core too!
- Re: (Score:2)
  
  by TWX ( 665546 ) writes:
  
  Hell, I want to see what it does with James Joyce.
  Hopefully there's a molten pit of silicon where that datacenter used to sit.
Buffer overflow attack Mk.II? (Score:2)

by TWX ( 665546 ) writes:

So this is just a buffer overflow attack combined with garbage-in?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Re:"Harmful" response? (Score:5, Insightful)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:3)

Re: "Harmful" response? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2, Troll)

Re: "Harmful" response? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: "Harmful" response? (Score:2)

Re:"Harmful" response? (Score:5, Insightful)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re:"Harmful" response? (Score:4, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

speaking of run-on sentences... (Score:3)

Re: (Score:2)

Re:speaking of run-on sentences... (Score:4, Informative)

Re: speaking of run-on sentences... (Score:1)

Re: (Score:2)

Captain Picard (Score:2)

Re: (Score:1)

LLMs need a "last step before output" filter (Score:1)

Re: (Score:2)

Re: (Score:2)

So close ... (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

So Trump can turn it evil (Score:1)

Can i have this job? (Score:2)

Nietzsche (Score:2)

Re: (Score:2)

Buffer overflow attack Mk.II? (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals