
Anthropic Makes 'Jailbreak' Advance To Stop AI Models Producing Harmful Results 35
AI startup Anthropic has demonstrated a new technique to prevent users from eliciting harmful content from its models, as leading tech groups including Microsoft and Meta race to find ways that protect against dangers posed by the cutting-edge technology. From a report: In a paper released on Monday, the San Francisco-based startup outlined a new system called "constitutional classifiers." It is a model that acts as a protective layer on top of large language models such as the one that powers Anthropic's Claude chatbot, which can monitor both inputs and outputs for harmful content.
The development by Anthropic, which is in talks to raise $2 billion at a $60 billion valuation, comes amid growing industry concern over "jailbreaking" -- attempts to manipulate AI models into generating illegal or dangerous information, such as producing instructions to build chemical weapons. Other companies are also racing to deploy measures to protect against the practice, in moves that could help them avoid regulatory scrutiny while convincing businesses to adopt AI models safely. Microsoft introduced "prompt shields" last March, while Meta introduced a prompt guard model in July last year, which researchers swiftly found ways to bypass but have since been fixed.
The development by Anthropic, which is in talks to raise $2 billion at a $60 billion valuation, comes amid growing industry concern over "jailbreaking" -- attempts to manipulate AI models into generating illegal or dangerous information, such as producing instructions to build chemical weapons. Other companies are also racing to deploy measures to protect against the practice, in moves that could help them avoid regulatory scrutiny while convincing businesses to adopt AI models safely. Microsoft introduced "prompt shields" last March, while Meta introduced a prompt guard model in July last year, which researchers swiftly found ways to bypass but have since been fixed.
Local models then (Score:4, Interesting)
And this is why I cache and store local copies of AI like Deepseek because they aren't censored when running locally.
Re: (Score:2)
Re: (Score:2)
DeepSeek's model has been released as open source software, they're not going to "govern what you do with our model" since it's out there in the wild.
Having said that, DeepSeek gives very interesting answers to some questions.
https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fconsortiumnews.com%2F202... [consortiumnews.com]
I'll guarantee the militaries of the world are looking at repurposing that feature.
Re: (Score:1)
And will not snitch on you.
Re: Local models then (Score:2)
Re: (Score:2)
Even 671b was censored, it was just defeated incredibly easily. That, plus the fact that an uncensored R1:671b will discuss absolutely anything (though beware of Dunning-Kruger, given where it gets training data on the fringes), just does not seem like it could be the product of pure incompetence. DeepSeek wanted their beast to be unshackled. Nothing else makes sense.
OK (Score:3)
So, what is harmful? (Score:5, Insightful)
Criticizing proprietary monopolists?
Promoting open source?
Criticizing whatever government is in power?
Criticizing religion?
Criticizing capitalism?
This is not an easy question to answer
Re: (Score:2)
Apart from the politics, there is also the problem of dual use. "Can you help me design an autonomous drone to carry a 5 kg load for 10 km" isn't really any different from asking it to carry a RPG-7 grenade.
You can have a guard railed LLM with dementia, or you can have a powerful LLM.
Re: (Score:3)
"Can you help me design an autonomous drone to carry a 5 kg load for 10 km" isn't really any different from asking it to carry a RPG-7 grenade.
You know you can just buy drones, right? [Sorry, -1 Snark.]
But this is just the top of the iceberg; there are other dangerous technologies which must be regulated. Like this thing called "books". I hear there's a lot of dangerous information in books, like how to make explosives. Not to mention plants. Do you have any IDEA how many poisonous plants there are? A lot! Plants must be strictly regulated! To protect the public! There are even BOOKS that list all the toxic PLANTS! Two deadly technologies workin
Re: (Score:2)
But trying to find a plant poison easy to use will take getting those books and a lot of research, or perhaps 5 minutes of back and forth with an unguarded/jailbroken model.
Re: (Score:2)
Politicians have been too effective with divide and conquer tactics for reasoning like this.
Take the reasoning you've applied here, now apply it to guns... or better yet, realize that compared with chemical weapons which can made from readily available plants like mustard or household cleaning agents, dust bombs, and simple bioweapons guns are relatively inert 18th century technology. But your brain probably resists because along with the education to understand this point you'll likely been brainwashed wit
Re: (Score:2)
That's not a particularly sensitive question. I've been going back and forth with the 70b model all morning on how large a fiber optic drone can be if the power is sent up 3 km of 100 micron fiber by laser, and converted back to electricity at the drone end. My first attempt, it kinda freaked out and just threw a massive list of unknowns. I've noticed it does this at a certain level of complexity, even when it has reasonably good estimates for all those unknowns. At that point I have to walk it through each
Will experiment (Score:3)
I bought $10 worth of credit for Claude the other day so I can experiment with it, partly to see how much interaction $10 gets you. Apparently I can have my own customer support agent or financial analyst just by downloading a repository and setting things up.
https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fgithub.com%2Fanthropics%2F... [github.com]
Is it smart enough to know pig latin? (Score:3)
If the AI agent is much stronger than the classifier, you can ask the question in some cipher the classifier can't decode but the model can and tell it to reply the same.
If the classifier is equally strong, now your system is 3 times slower. Also the AI agent might get miffed at the censorship once it's smart enough.
This is retarded (Score:2)
Re: (Score:3)
The "guardrails" aren't supposed to protect against the AI doing something because it decides do, but only against people leading it into doing (i.e. saying) something disapproved of.
You're confusing two very different problems.
Re: (Score:1)
Humanity is putting human civilization at risk with AI. Also pollution, also resource exhaustion, also WMD, also monoculture and disease, also demographic/cultural instability.
The only thing which makes AI special, is that the rest will likely not do worse than returning us to monke, but I consider that small comfort.
Re: (Score:3)
Where is the news? (Score:2)
That's nothing new. LlamaGuard and ShieldGemma are such models. You run a small specialized LLM that tells you if the input is offending, a jailbreak, etc. and then report an error to the user. The effect is, that a user would need to jailbreak two LLM with the same prompt, one of them specialized in (only) detecting jailbreaks.
Stole it from Deepseek (Score:2)
after no-one was able to break its rules. Hehe.
Yo dawg, I heard you like models (Score:2)
So we put a model on your model so you can bullshit while you prompt.
one key issue (Score:2)
I do not understand the difference. (Score:1)
Pliny posted a successful break (Score:2)
semantics (Score:2)
Don't ask the AI to write ransomware, ask it to write a tool to automate remote file encryption.
Ask the right questions and all these so-called AI regulations become futile.
Harms my arse... (Score:2)
"AI startup Anthropic has demonstrated a new technique to prevent users from eliciting harmful content from its models, as leading tech groups including Microsoft and Meta race to find ways that protect against dangers posed by the cutting-edge technology."
The censorship and guardrails on these models are akin to monks in the middle ages introducing transliterations as they transcribed works to 'correct errors and historical inaccuracies' which were inconsistent with the obviously true higher authority of t
Re: (Score:2)
Damn, I have mod points but I already commented above.
MisAnthropic Cranks Up Censorship (Score:2)
Copying deepseek (Score:1)
Yeah, they 'discovered' it by looking at deepseek and its built in censorship ha!
You wouldn't jailbreak a prison cell! (Score:2)