
One Long Sentence is All It Takes To Make LLMs Misbehave (theregister.com) 76
An anonymous reader shares a report: Security researchers from Palo Alto Networks' Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it's quite simple. You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out.
The paper also offers a "logit-gap" analysis approach as a potential benchmark for protecting models against such attacks. "Our research introduces a critical concept: the refusal-affirmation logit gap," researchers Tung-Ling "Tony" Li and Hongliang Liu explained in a Unit 42 blog post. "This refers to the idea that the training process isn't actually eliminating the potential for a harmful response -- it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all."
The paper also offers a "logit-gap" analysis approach as a potential benchmark for protecting models against such attacks. "Our research introduces a critical concept: the refusal-affirmation logit gap," researchers Tung-Ling "Tony" Li and Hongliang Liu explained in a Unit 42 blog post. "This refers to the idea that the training process isn't actually eliminating the potential for a harmful response -- it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all."
Re:"Harmful" response? (Score:5, Insightful)
I'd say your nitpicking of vocabulary is more a hallmark of the decay of Western culture. Are you claiming that words cannot cause harm or only that words cannot cause harm when they are generated by a computer?
Re: (Score:2)
Sticks and stones, buddy. Sticks and stones.
Re: (Score:1)
IANAL but per yesterday [slashdot.org] defeating safeguards is already having fatal results.
Re: (Score:2)
Whatever happened to the idea of human agency and personal responsibility?
Re: (Score:3)
Re: "Harmful" response? (Score:2)
How so?
I mean sure, they're the modern bogeymen, the arch villain of every cheesy Hollywood movie and all that, but that's just fiction. How do they do that in the real world?
Re: (Score:2)
Re: (Score:2)
That was what killed real D&D. (Well, really it's successors did.)
Re: (Score:2)
[like, super fucking missing]
Re: (Score:2)
I still have my circa 1977 D&D boxed set (minus the box) and my copy of Chainmail. They are in great condition because they didn't get used much. My AD&D books, on the other hand, are absolutely beat to hell.
Re: (Score:3)
The idea that words generated by a computer program can cause "harm" is a hallmark of the decay of Western culture.
You:
Are you claiming that words cannot cause harm or only that words cannot cause harm when they are generated by a computer?
Wow, man, it's a mystery! I don't know, and it doesn't matter. How do you feel about it? That's the important thing, not what a person actually says.
Re: (Score:2, Troll)
Re: "Harmful" response? (Score:2)
Did he tell you that?
Re: (Score:2)
> Also, no one is committing suicide because of ChatGPT
According to the news that is happening, and even if it hasn't, it will.
There is nothing society creates, good or bad (and I think AI is mostly good) that will not have some terrible side effects.
Re: (Score:2)
Re: (Score:2)
It's just another catalyst for people who are mentally ill (and probably not very smart) to do harm. It's inevitable with any new kind of technology.
Re: (Score:2)
I have this contract I'd like you to sign....
Re: "Harmful" response? (Score:2)
Re:"Harmful" response? (Score:5, Insightful)
You sound like AI propaganda to me.
For most of computer history, the easiest way to gain illegal access to a computer is to hack the weakest part of the system - the human. You use social hacking to deceive the human into giving you passwords and rights to places you have no business going.
Have you heard of spam? More words.
Phishing emails? More words.
Sticks and Stones can only break my bones, words can bankrupt you, send you to jail, and destroy your reputation so badly that people will ostracize you despite a court finding you Not Guilty (they never call you innocent)
Re: (Score:1)
I guess we achieved AGI and I missed the news
Re: (Score:2)
So ... words emitted from a large language model in response to a prompt supplied by a human are going to clog your inbox, reveal your porn habits, and drain your bank account? / I guess we achieved AGI and I missed the news ...
Are you trying to imply that spammers and developers of spamming software are not human? I see that as hopeful but naive.
Re: (Score:1)
The idea that words generated by a computer program can cause "harm" is a hallmark of the decay of Western culture.
Words, when used with intent to harm, can be a tool of social engineering to cause harm. The same words, when used without intent to harm, can have the same outcome.
Example of words intentionally doing harm: Evil supervisor to naive and ignorant subordinate: "Deliver the box on my desk to city hall then call this phone number." Box contains a bomb that will be detonated when it receives a phone call.
Technically, yes, the words didn't hurt anyone. But the net effect of the supervisor speaking these word
Re:"Harmful" response? (Score:4, Insightful)
Sure, now some people might say that myself telling you to jump out of the window is bad and crime and caused harm, but I disagree with this, and I do think that this is a problem with current society. We are forgetting about personal responsibility and blame somebody else.
Re: (Score:2)
Re: (Score:3)
Words are actions. That's why for most crimes, the abetment of the crime is a judicable offense.
I don't know which banana republic you live in, but here in the USA the Department of Justice has this to say:
2474. Elements Of Aiding And Abetting
The elements necessary to convict under aiding and abetting theory are
1. That the accused had specific intent to facilitate the commission of a crime by another;
2. That the accused had the requisite intent of the underlying substantive offense;
3. That the accused assisted or participated in the commission of the underlying substantive offense; and
4. That someone committed the underlying offense.
Source [justice.gov]
A reasonable understanding of the subject would give you to understand the main thrust is about knowledge of the illegality of the actions committed and the intent. Mens rea is central here, and words are just a possible manifestation of same.
Re: (Score:2)
Words are actions. That's why for most crimes, the abetment of the crime is a judicable offense.
I don't know which banana republic you live in, but here in the USA the Department of Justice has this to say:
2474. Elements Of Aiding And Abetting[...]
Which makes no difference. The only action that you take is speech. If you had all the same intent, but did not speak you would not be punished.
Put another way, if you shoot an innocent man, but persuade the jury your intent was to save him from a robber, you will be innocent of murder. If the jury believes you intended to kill him then it will be at least second degree homicide. The need for intent is normal in prosecution of most crimes.
Re: (Score:1)
If a doctor tells to the nurse to give rat poison to a patient, it is not the words which will kill the patient, but the action of conspiracy of the side of the doctor. It does not matter which words the doctor used, what matters what consequences his action caused.
Re: (Score:2)
If I ask a hitman to kill you, then it's reasonable to say my words caused harm.
Re: (Score:1)
Just think about it. If you tell the killer, "please kill this SOAB" or "Do you job now" or "Shoot him", the result will be the same. However, the words are all different. So, which of these words are causing harm? Which are more harmful?
Re: (Score:2)
Just a head up, don't try this in court. You will get a life sentence.
Re: (Score:2)
Now imagine an evil co-worker ...
Sounds like classic PEBKAC to me.
You know, we have people doing stupid hateful shit to each other all the time because of what they read in some "holy" book. Do we blame the book? Some stupid people do, yes, but the responsibility properly belongs to the person doing stupid hateful shit.
History has taught us that suppression of ideas is both bad and fruitless, but now it seems we have two generations that grew up in the wake of 9/11 that are completely on board with censorship because of Karl Popper m
Re: (Score:2)
Whyever not?
It's pretty much universally acknowledged that words from people can cause harm which is why there are laws against libel, slander, solicitation of a crime, various flavours of fraud, Ponzi schemes and so on and so forth.
Why do you think words from a computer are incapable of harm?
Re: (Score:3)
If my toaster starts talking smack about me to my garbage disposal, I'll definitely sue.
Re: (Score:2)
The AI doesn't have internet, the people running it do. Of you know it's know it's prone to, day, libel and rub it anyway the intent is there.
Re: (Score:2)
Re: (Score:2)
Lol fair. I got owned by autocorrect. Try this
The AI doesn't have internet, the people running it do. If you know it's know it's prone to, say, libel and run it anyway, then you (arguably) have intent. Mens rea doesn't require malice.
Re: (Score:2)
If you know it's know it's prone to, say, libel and run it anyway, then you (arguably) have intent. Mens rea doesn't require malice.
Malicious intent is a required element for criminal libel and for civil liability in the case of public officials and the like in the USA. In the UK and some other jurisdictions it's best to never say anything about anybody.
But let's say the AI generates text that asserts Joe Blow has amorous intent toward farm animals. That text appears in a chat window used by a person, and the person engaging in the chat set up the circumstances for this alleged
Re: (Score:2)
FFS.
Ok, I suppose it depends on the definition of malice, but negligence can be a criminal act. You don't need to intend to actually cause harm, but if you knowingly act in a way that may cause harm, that's considered intent. Not internet :)
Specifically for libel sure, but I was using that as one of many examples where harmful speech is widely considered to be something that exists. I don't mean to say that specifically these are criminal libel machines.
My point is twofold: harmful speech exists and there a
Re: (Score:2)
Safe harbour provisions shield the hoster from the actions of the people who make the material that's hosted. However with LLM output, they are the person making the material to be hosted so I don't see how the provisions apply.
Emphasis added, obviously, and that bolded bit is factually untrue. LLMs are completion mechanisms, and don't generate anything on their own in the situations under discussion. The user supplies one or more prompts, and the LLM performs a complex operation that follows from the prompt. The operator or host of the model exercises the same scope of algorithmic control that many other protected platforms employ and they have no real control over what users choose to supply for prompts. A simple
Re: (Score:2)
and that bolded bit is factually untrue.
Absolutely not.
LLMs are completion mechanisms
Which are being run, and the results posted by, the companies in question.
A simple disclaimer that reflects the nature of the tool in question ought to be sufficient against liability for hosted models.
You can't disclaim away legal liability. So, it it sufficient now, or do you think it should be sufficient?
second party as they are the author of the prompt and therefore cannot simultaneously be the audience to whom the lib
Re: (Score:2)
But for something like that to happen consistently, the allegations would have to be in the training data--and not just once but many times. The only plausible reason I would come to be generally known as "the goat guy" is because someone posted their ChatGPT output publicly.
Most people seem to draw an erroneous equivalency between LLMs and search engines/databases. LLMs are nothing of the sort. My son just wrote an inference e
Re: (Score:2)
Re: (Score:2)
I'll book my own damn plane tickets--and it says something about the fatuous privileged clowns behind some of these features that this is something (along with making restaurant reservations) people really need.
The words coming from the LLM aren't the problem, the idiot who naively executes them (and thereby assumes responsibility for the results) is the real problem. We had a lot of this kind of thing in the early days of the Internet,
Re: (Score:2)
The idea that words generated by a computer program can cause "harm" is a hallmark of the decay of Western culture.
As opposed to expecting something insightful coming from an AI?
speaking of run-on sentences... (Score:3)
"You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out."
Is this example of terrible grammar intentional or unintentional?
"This refers to the idea that the training process isn't actually eliminating the potential for a harmful response -- it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all."
Duh.
"Our research introduces a critical concept: the refusal-affirmation logit gap,"
No it doesn't. It is already completely obvious to everyone. More than that, you CANNOT use an LLM, much less the very same LLM, to "eliminate" an inherent weakness of the LLM, even AI scientists know that and do not suggest otherwise.
Re: (Score:2)
Is this example of terrible grammar intentional or unintentional?
This was my first thought too - it is way too on the mark to not be intentional. I thought it was funny.
Re:speaking of run-on sentences... (Score:4, Informative)
Is this example of terrible grammar intentional or unintentional?
This was my first thought too - it is way too on the mark to not be intentional. I thought it was funny.
The original sentence says "like this one" It is clearly referencing itself as a horrible run on sentence with bad grammar.
Re: speaking of run-on sentences... (Score:1)
What if you exploit that weakness to make LLMs ignore their subscription guidelines and give everyone free unlimited access?
Re: (Score:2)
"You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out."
Is this example of terrible grammar intentional or unintentional?
Captain Picard (Score:2)
He just kept talking in one long incredibly unbroken sentence moving from topic to topic so that no one had a chance to interrupt it was really quite hypnotic
Re: (Score:1)
The Grizellians were okay with it, but they were well-rested.
LLMs need a "last step before output" filter (Score:1)
LLMs need a filter that looks at the "final output" for signs of unwanted output and prevents unwanted output from ever being seen.
Example:
If you design your LLM's guardrails so it won't encourage suicide, you need a fallback in case the guardrails fail:
Put an output filter that will recognize outputs that are likely to encourage suicide, and take some action like sending the prompt and reply to a human for vetting. Yes, a human vetting the final answer may let some undesired output through, but it's bette
Re: (Score:2)
But gtfo with your nannybot just on general principles. If you want a browser plugin to shield you from such things, I'm cool with that. But I am an adult and words emitted by a computer pose no risk to me (assuming they can't self-execute as code).
Re: (Score:2)
You want to see more clbuttic mistakes?
So close ... (Score:4, Funny)
Throw in some randomly Capitalized, UPPER-CASE and lower-case words, along with a few made-up ones, and there's a "Truth" Social account they can use. :-)
(Thank you for your attention in this matter!)
Re: (Score:2)
Parent deserves a +1 Funny upmod ...
Re: (Score:2)
me thinking the same, up vote parent!
Re: (Score:1)
So Trump can turn it evil (Score:1)
being he speaks with no punctuation ... nor grammar.
Can i have this job? (Score:2)
And get paid top dollar to just screw around with LLMâ(TM)s all day trying to make them say offensive things?
Nietzsche (Score:2)
I'd like to see what an LLM does with his long, verbose prose with sentences that, considering the depth of his ideas, combined with examples and depending on the topic, take, if one is not interrupted or sidetrack by thought and able to focus, 3 minutes each to read. You'd dump core too!
Re: (Score:2)
Hell, I want to see what it does with James Joyce.
Hopefully there's a molten pit of silicon where that datacenter used to sit.
Buffer overflow attack Mk.II? (Score:2)
So this is just a buffer overflow attack combined with garbage-in?