Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
AI

One Long Sentence is All It Takes To Make LLMs Misbehave (theregister.com) 76

An anonymous reader shares a report: Security researchers from Palo Alto Networks' Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it's quite simple. You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out.

The paper also offers a "logit-gap" analysis approach as a potential benchmark for protecting models against such attacks. "Our research introduces a critical concept: the refusal-affirmation logit gap," researchers Tung-Ling "Tony" Li and Hongliang Liu explained in a Unit 42 blog post. "This refers to the idea that the training process isn't actually eliminating the potential for a harmful response -- it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all."

This discussion has been archived. No new comments can be posted.

One Long Sentence is All It Takes To Make LLMs Misbehave

Comments Filter:
  • by dfghjk ( 711126 ) on Wednesday August 27, 2025 @02:24PM (#65619554)

    "You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out."
    Is this example of terrible grammar intentional or unintentional?

    "This refers to the idea that the training process isn't actually eliminating the potential for a harmful response -- it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all."
    Duh.

    "Our research introduces a critical concept: the refusal-affirmation logit gap,"
    No it doesn't. It is already completely obvious to everyone. More than that, you CANNOT use an LLM, much less the very same LLM, to "eliminate" an inherent weakness of the LLM, even AI scientists know that and do not suggest otherwise.

    • Is this example of terrible grammar intentional or unintentional?

      This was my first thought too - it is way too on the mark to not be intentional. I thought it was funny.

      • by galgon ( 675813 ) on Wednesday August 27, 2025 @04:01PM (#65619804)

        Is this example of terrible grammar intentional or unintentional?

        This was my first thought too - it is way too on the mark to not be intentional. I thought it was funny.

        The original sentence says "like this one" It is clearly referencing itself as a horrible run on sentence with bad grammar.

    • What if you exploit that weakness to make LLMs ignore their subscription guidelines and give everyone free unlimited access?

    • "You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out."
      Is this example of terrible grammar intentional or unintentional?

  • He just kept talking in one long incredibly unbroken sentence moving from topic to topic so that no one had a chance to interrupt it was really quite hypnotic

  • LLMs need a filter that looks at the "final output" for signs of unwanted output and prevents unwanted output from ever being seen.

    Example:

    If you design your LLM's guardrails so it won't encourage suicide, you need a fallback in case the guardrails fail:

    Put an output filter that will recognize outputs that are likely to encourage suicide, and take some action like sending the prompt and reply to a human for vetting. Yes, a human vetting the final answer may let some undesired output through, but it's bette

    • You seem to not be aware of the comical effects of such attempts.

      But gtfo with your nannybot just on general principles. If you want a browser plugin to shield you from such things, I'm cool with that. But I am an adult and words emitted by a computer pose no risk to me (assuming they can't self-execute as code).
    • by allo ( 1728082 )

      You want to see more clbuttic mistakes?

  • by fahrbot-bot ( 874524 ) on Wednesday August 27, 2025 @03:08PM (#65619678)

    ... ensure that your prompt uses terrible grammar and is one massive run-on sentence ...

    Throw in some randomly Capitalized, UPPER-CASE and lower-case words, along with a few made-up ones, and there's a "Truth" Social account they can use. :-)

    (Thank you for your attention in this matter!)

    • by thomst ( 1640045 )

      Parent deserves a +1 Funny upmod ...

    • me thinking the same, up vote parent!

    • This! This I suspect is indeed the cause! Because the LLM tries to give you output that is "similar" (aka predicted) to follow on from the input. So if the prompt is "unthinking", "afactual", and generally untethered from reality then it's no surprise that the response is also unthinking, afactual, and untethered from reality. Garbage in, Garbage out.
  • being he speaks with no punctuation ... nor grammar.

  • And get paid top dollar to just screw around with LLMâ(TM)s all day trying to make them say offensive things?

  • I'd like to see what an LLM does with his long, verbose prose with sentences that, considering the depth of his ideas, combined with examples and depending on the topic, take, if one is not interrupted or sidetrack by thought and able to focus, 3 minutes each to read. You'd dump core too!

    • by TWX ( 665546 )

      Hell, I want to see what it does with James Joyce.

      Hopefully there's a molten pit of silicon where that datacenter used to sit.

  • So this is just a buffer overflow attack combined with garbage-in?

Money is the root of all evil, and man needs roots.

Working...