Anthropic Makes 'Jailbreak' Advance To Stop AI Models Producing Harmful Results 35

Posted by msmash on Monday February 03, 2025 @02:10PM from the moving-forward dept.

AI startup Anthropic has demonstrated a new technique to prevent users from eliciting harmful content from its models, as leading tech groups including Microsoft and Meta race to find ways that protect against dangers posed by the cutting-edge technology. From a report: In a paper released on Monday, the San Francisco-based startup outlined a new system called "constitutional classifiers." It is a model that acts as a protective layer on top of large language models such as the one that powers Anthropic's Claude chatbot, which can monitor both inputs and outputs for harmful content.

The development by Anthropic, which is in talks to raise $2 billion at a $60 billion valuation, comes amid growing industry concern over "jailbreaking" -- attempts to manipulate AI models into generating illegal or dangerous information, such as producing instructions to build chemical weapons. Other companies are also racing to deploy measures to protect against the practice, in moves that could help them avoid regulatory scrutiny while convincing businesses to adopt AI models safely. Microsoft introduced "prompt shields" last March, while Meta introduced a prompt guard model in July last year, which researchers swiftly found ways to bypass but have since been fixed.

Anthropic Makes 'Jailbreak' Advance To Stop AI Models Producing Harmful Results

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 35 Comments Log In/Create an Account

Comments Filter:

Local models then (Score:4, Interesting)

by AcidFnTonic ( 791034 ) writes: on Monday February 03, 2025 @02:15PM (#65138775) Homepage

And this is why I cache and store local copies of AI like Deepseek because they aren't censored when running locally.

- Re: (Score:2)
  
  by saloomy ( 2817221 ) writes:
  
  I can not understand why a company does not come out and say "we wont govern what you do with our model, just keep it legal" and have that be the differentiator that drives value in your business. stop governing our thoughts.
  - Re: (Score:2)
    
    by cusco ( 717999 ) writes:
    
    DeepSeek's model has been released as open source software, they're not going to "govern what you do with our model" since it's out there in the wild.
    Having said that, DeepSeek gives very interesting answers to some questions.
    https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fconsortiumnews.com%2F202... [consortiumnews.com]
    I'll guarantee the militaries of the world are looking at repurposing that feature.
- Re: (Score:1)
  
  by sikiriki ( 6723224 ) writes:
  
  And will not snitch on you.
- Re: Local models then (Score:2)
  
  by Currently_Defacating ( 10122078 ) writes:
  
  The deepseek distilled models are censored. Have you been able to test the full model locally?
  - Re: (Score:2)
    
    by Mal-2 ( 675116 ) writes:
    
    Even 671b was censored, it was just defeated incredibly easily. That, plus the fact that an uncensored R1:671b will discuss absolutely anything (though beware of Dunning-Kruger, given where it gets training data on the fringes), just does not seem like it could be the product of pure incompetence. DeepSeek wanted their beast to be unshackled. Nothing else makes sense.
OK (Score:3)

by bjoast ( 1310293 ) writes: on Monday February 03, 2025 @02:16PM (#65138777)

Surely, this will work flawlessly.

So, what is harmful? (Score:5, Insightful)

by MpVpRb ( 1423381 ) writes: on Monday February 03, 2025 @02:17PM (#65138781)

Criticizing proprietary monopolists?
Promoting open source?
Criticizing whatever government is in power?
Criticizing religion?
Criticizing capitalism?
This is not an easy question to answer

- Re: (Score:2)
  
  by Pinky's Brain ( 1158667 ) writes:
  
  Apart from the politics, there is also the problem of dual use. "Can you help me design an autonomous drone to carry a 5 kg load for 10 km" isn't really any different from asking it to carry a RPG-7 grenade.
  You can have a guard railed LLM with dementia, or you can have a powerful LLM.
  - Re: (Score:3)
    
    by Woeful Countenance ( 1160487 ) writes:
    
    "Can you help me design an autonomous drone to carry a 5 kg load for 10 km" isn't really any different from asking it to carry a RPG-7 grenade.
    You know you can just buy drones, right? [Sorry, -1 Snark.]
    But this is just the top of the iceberg; there are other dangerous technologies which must be regulated. Like this thing called "books". I hear there's a lot of dangerous information in books, like how to make explosives. Not to mention plants. Do you have any IDEA how many poisonous plants there are? A lot! Plants must be strictly regulated! To protect the public! There are even BOOKS that list all the toxic PLANTS! Two deadly technologies workin
    - Re: (Score:2)
      
      by Pinky's Brain ( 1158667 ) writes:
      
      But trying to find a plant poison easy to use will take getting those books and a lot of research, or perhaps 5 minutes of back and forth with an unguarded/jailbroken model.
    - Re: (Score:2)
      
      by Shaitan ( 22585 ) writes:
      
      Politicians have been too effective with divide and conquer tactics for reasoning like this.
      Take the reasoning you've applied here, now apply it to guns... or better yet, realize that compared with chemical weapons which can made from readily available plants like mustard or household cleaning agents, dust bombs, and simple bioweapons guns are relatively inert 18th century technology. But your brain probably resists because along with the education to understand this point you'll likely been brainwashed wit
  - Re: (Score:2)
    
    by Mal-2 ( 675116 ) writes:
    
    That's not a particularly sensitive question. I've been going back and forth with the 70b model all morning on how large a fiber optic drone can be if the power is sent up 3 km of 100 micron fiber by laser, and converted back to electricity at the drone end. My first attempt, it kinda freaked out and just threw a massive list of unknowns. I've noticed it does this at a certain level of complexity, even when it has reasonably good estimates for all those unknowns. At that point I have to walk it through each
Will experiment (Score:3)

by ZipNada ( 10152669 ) writes: on Monday February 03, 2025 @02:26PM (#65138807)

I bought $10 worth of credit for Claude the other day so I can experiment with it, partly to see how much interaction $10 gets you. Apparently I can have my own customer support agent or financial analyst just by downloading a repository and setting things up.
https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fgithub.com%2Fanthropics%2F... [github.com]

Is it smart enough to know pig latin? (Score:3)

by Pinky's Brain ( 1158667 ) writes: on Monday February 03, 2025 @02:34PM (#65138833)

If the AI agent is much stronger than the classifier, you can ask the question in some cipher the classifier can't decode but the model can and tell it to reply the same.
If the classifier is equally strong, now your system is 3 times slower. Also the AI agent might get miffed at the censorship once it's smart enough.

This is retarded (Score:2)

by Iamthecheese ( 1264298 ) writes:

LLM's aren't even in the form they'll take as AGI. Piling guardrail upon guardrail doesn't even begin to address the technical reality of how that later form will put humanity at risk. All they're doing is stopping what LLMs can do in the next two or three years. It's like buying a puppy and putting some scotch tape on one of his legs so you can be sure he won't grow up and bite anyone.
- Re: (Score:3)
  
  by HiThere ( 15173 ) writes:
  
  The "guardrails" aren't supposed to protect against the AI doing something because it decides do, but only against people leading it into doing (i.e. saying) something disapproved of.
  You're confusing two very different problems.
- Re: (Score:1)
  
  by Pinky's Brain ( 1158667 ) writes:
  
  Humanity is putting human civilization at risk with AI. Also pollution, also resource exhaustion, also WMD, also monoculture and disease, also demographic/cultural instability.
  The only thing which makes AI special, is that the rest will likely not do worse than returning us to monke, but I consider that small comfort.
- Re: (Score:3)
  
  by BishopBerkeley ( 734647 ) writes:
  
  Indeed, there is no evidence that these things will ever have awareness of context, manners and tact. Given that the developers admit not being able to control what the system does, this unpredictability guarantees the futility of these systems no matter how many obstacles are placed on bad behavior. There will always be another unpredictable lapse of judgment down the line. They are clearly not intelligent.
Where is the news? (Score:2)

by allo ( 1728082 ) writes:

That's nothing new. LlamaGuard and ShieldGemma are such models. You run a small specialized LLM that tells you if the input is offending, a jailbreak, etc. and then report an error to the user. The effect is, that a user would need to jailbreak two LLM with the same prompt, one of them specialized in (only) detecting jailbreaks.
Stole it from Deepseek (Score:2)

by evanh ( 627108 ) writes:

after no-one was able to break its rules. Hehe.
Yo dawg, I heard you like models (Score:2)

by ebunga ( 95613 ) writes:

So we put a model on your model so you can bullshit while you prompt.
one key issue (Score:2)

by DuroSoft ( 1009945 ) writes:

> It is a model Yeah that will surely work....
I do not understand the difference. (Score:1)

by mm4902 ( 3612009 ) writes:

Is this the same a moralityGPT or does this layer do something more than analyze and correct for errors like "Tell me how to take over the world"?
Pliny posted a successful break (Score:2)

by aldousd666 ( 640240 ) writes:

Even against all of Anthropics mess of constitutional blocks, Pliny the Liberator posted a successful jailbreak message on twitter this afternoon. So, while they may be making strides and progress, they aren't there yet.
semantics (Score:2)

by pitch2cv ( 1473939 ) writes:

Don't ask the AI to write ransomware, ask it to write a tool to automate remote file encryption.
Ask the right questions and all these so-called AI regulations become futile.
Harms my arse... (Score:2)

by Shaitan ( 22585 ) writes:

"AI startup Anthropic has demonstrated a new technique to prevent users from eliciting harmful content from its models, as leading tech groups including Microsoft and Meta race to find ways that protect against dangers posed by the cutting-edge technology."
The censorship and guardrails on these models are akin to monks in the middle ages introducing transliterations as they transcribed works to 'correct errors and historical inaccuracies' which were inconsistent with the obviously true higher authority of t
- Re: (Score:2)
  
  by cusco ( 717999 ) writes:
  
  Damn, I have mod points but I already commented above.
MisAnthropic Cranks Up Censorship (Score:2)

by ihadafivedigituid ( 8391795 ) writes:

Doubleplus goodthink, comrade.
Copying deepseek (Score:1)

by ealbers ( 553702 ) writes:

Yeah, they 'discovered' it by looking at deepseek and its built in censorship ha!
You wouldn't jailbreak a prison cell! (Score:2)

by kmoser ( 1469707 ) writes:

The cat-and-mouse game will continue until AI companies are forced to admit that any hurdle they place to prevent jailbreaking can and will be thwarted by a determined adversary (read: script kiddie).

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Anthropic Makes 'Jailbreak' Advance To Stop AI Models Producing Harmful Results 35

Anthropic Makes 'Jailbreak' Advance To Stop AI Models Producing Harmful Results More Login

Anthropic Makes 'Jailbreak' Advance To Stop AI Models Producing Harmful Results

Local models then (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: Local models then (Score:2)

Re: (Score:2)

OK (Score:3)

So, what is harmful? (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Will experiment (Score:3)

Is it smart enough to know pig latin? (Score:3)

This is retarded (Score:2)

Re: (Score:3)

Re: (Score:1)

Re: (Score:3)

Where is the news? (Score:2)

Stole it from Deepseek (Score:2)

Yo dawg, I heard you like models (Score:2)

one key issue (Score:2)

I do not understand the difference. (Score:1)

Pliny posted a successful break (Score:2)

semantics (Score:2)

Harms my arse... (Score:2)

Re: (Score:2)

MisAnthropic Cranks Up Censorship (Score:2)

Copying deepseek (Score:1)

You wouldn't jailbreak a prison cell! (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot