Researchers Uncovers New Methods To Defend AI Models Against Universal Jailbreaks
Researchers from the Anthropic Safeguards Research Team have developed a new approach to protect AI models from universal jailbreaks. This innovative method, known as Constitutional Classifiers, has shown remarkable resilience against thousands of hours of human red teaming and synthetic evaluations. Universal jailbreaks refer to inputs designed to bypass the safety guardrails of AI models, […] The post Researchers Uncovers New Methods To Defend AI Models Against Universal Jailbreaks appeared first on Cyber Security News.
![Researchers Uncovers New Methods To Defend AI Models Against Universal Jailbreaks](https://i1.wp.com/blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7ZnV-TbQxXVDd0X-MNsCmId7j4OML6dkVZLKfNhpiftiVC5REITeIf7zgzbbS13N0V8_1ORQU8W_lCYQgT7_LR-yL92cilmo3FJ5RCwLOpyq82YFTTXxETyZiXhEROTRACPRNRjCgBZFi2gPlod3SSN68mS96w3YLfCL1vqqm5g9sZV__ry-NSixyeNQ/s16000/Researchers Uncovers New Methods To Defend AI Models Against Universal Jailbreaks.webp?#)
Researchers from the Anthropic Safeguards Research Team have developed a new approach to protect AI models from universal jailbreaks.
This innovative method, known as Constitutional Classifiers, has shown remarkable resilience against thousands of hours of human red teaming and synthetic evaluations.
Universal jailbreaks refer to inputs designed to bypass the safety guardrails of AI models, forcing them to produce harmful responses.
Anthropic Safeguards Research Team noted that all these attacks can involve flooding the model with long prompts or modifying the input style, such as using unusual capitalization.
Despite extensive safety training, large language models remain vulnerable to such exploits.
Constitutional Classifiers
Constitutional Classifiers are input and output classifiers trained on synthetically generated data. They filter out the majority of jailbreak attempts with minimal over-refusals and moderate compute overhead.
The system is based on a constitution—a list of principles defining allowed and disallowed content classes. For example, recipes for mustard are allowed, while those for mustard gas are not.
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhpnioHD6TML3s27jfP6PGYvgXvKlFdlwlg4ixHkR9K1yfHCy941wDCOhuwm9GAvO3gxC9Py4xMWgthxc8EKoVM-M-Fe8Kol0Hhz61nABzRs_WF8rXxX5s4d8P-mUPBAxdnkbS89PprpqFklNS6soz9PUzkabWA_uBn3laclD96-I4743tq46-brYRKsrM/s16000/Training%20and%20implementing%20Constitutional%20Classifiers%20(Source%20-%20Anthropic).webp)
Here’s how it works:-
- Constitution Creation: A constitution is developed specifying harmless and harmful categories.
- Synthetic Data Generation: Using the constitution, a large number of synthetic prompts and model completions are generated across all content classes. These are augmented by translating them into different languages and transforming them to mimic known jailbreak styles.
- Classifier Training: The synthetic data are used to train input and output classifiers to detect and block harmful content.
- Deployment: These classifiers are implemented as safeguards to protect AI models.
In a bug-bounty program, 183 active participants spent over 3,000 hours attempting to jailbreak a prototype system guarding Claude 3.5 Sonnet.
Despite a significant monetary reward, no universal jailbreak was discovered. However, the prototype had high over-refusal rates and compute costs.
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8iJrLlxSwrSHoriOXVkWEd4QlbxQ5Ul4WsVOR5lfsT8VuyzHI0fTuLdShtW6LlB2XwD_1aDyRbtgLN1MqabYyoYGL0tf6vWOUVt-47Rr6ZMcIQ52DS1gLPRQ0s_8TSMXW5irYg9RrXI0rp-fmzPB-PpYjz7ag78tJB-XpYbDJHN7GWPclRNyEIUxI_H0/s16000/Results%20from%20automated%20evaluations%20(Source%20-%20Anthropic).webp)
An updated version of Constitutional Classifiers was tested on Claude 3.5 Sonnet using 10,000 synthetic jailbreaking prompts.
The jailbreak success rate dropped from 86% to 4.4%, with only a slight increase in refusal rates (0.38%) and moderate additional compute costs (23.7%).
A live demo of the Constitutional Classifiers system is available for red teaming from February 3 to February 10, 2025.
This demo focuses on queries related to chemical weapons and invites users to attempt jailbreaks, providing valuable feedback for system improvement.
While Constitutional Classifiers offer significant protection, they are not foolproof. The researchers recommend using complementary defenses and adapting the constitution to address new jailbreaking techniques as they emerge.
This breakthrough holds promise for safely deploying more capable AI models in the future, mitigating risks associated with jailbreaking and ensuring that AI systems align with safety principles.
Investigate Real-World Malicious Links & Phishing Attacks With Threat Intelligence Lookup - Try for Free
The post Researchers Uncovers New Methods To Defend AI Models Against Universal Jailbreaks appeared first on Cyber Security News.