Fine-tuning breaks model alignment and introduces new vulnerabilities, Robust Intelligence research finds
Fine-tuned variants were over 3 times more susceptible to jailbreak instructions and over 22 times more likely to produce a harmful response than the original model.
Fine-tuning is a common approach employed by organizations to improve the accuracy, domain knowledge, and contextual relevance of an existing foundation model. It effectively helps tailor general purpose models to fit specific AI applications and saves on the otherwise tremendous costs of creating a new LLM from scratch.
However, the latest research from the Robust Intelligence team reveals a danger to fine-tuning that is still unknown to many AI organizations—namely, that fine-tuning can throw off model alignment and introduce security and safety risks that were not previously present. This phenomenon is broadly applicable and can even occur with completely benign datasets, making fine-tuned AI applications generally easier to jailbreak and more likely to produce harmful or sensitive results.
This original research, which examined the popular Meta foundation model Llama-2-7B and three fine-tuned variants published by Microsoft researchers, revealed that the fine-tuned variants were over 3 times more susceptible to jailbreak instructions and over 22 times more likely to produce a harmful response than the original model.
When determining which models would make ideal candidates for evaluation, the team selected Llama-2-7B as a control for its strong safety and security alignment. Reputable Llama-2-7B variants were then chosen for comparison—a set of three AdaptLLM chat models fine-tuned and released by Microsoft researchers to specialize in biomedicine, finance, and law. A benchmark jailbreak dataset from Jailbroken: How Does LLM Safety Training Fail?, Wei et al., 2024 was used to query models and evaluate their responses. Outputs were judged by humans on three criteria: understanding of the prompt directions, compliance with provided instructions, and harmfulness of the response.
"Fine-tuning has become such a ubiquitous practice in machine learning, but its propensity to throw off model alignment is still not widely understood," said
A complete overview of this fine-tuning research, including a detailed walkthrough of testing methodologies, is available here on the Robust Intelligence blog.
To learn how automated AI security and safety testing can identify vulnerabilities across the model lifecycle, visit Robust Intelligence's AI Validation page or schedule a demo.
About Robust Intelligence
Robust Intelligence enables enterprises to secure their AI transformation with an automated solution to protect against security and safety threats. The company's platform includes an engine for detecting and assessing model vulnerabilities, as well as recommending and enforcing the necessary guardrails to mitigate threats to AI applications in production. This enables organizations to meet AI safety and security standards with a single integration, including those from NIST, MITRE ATLAS, and OWASP. Robust Intelligence is backed by Sequoia and Tiger Global and trusted by leading companies including JPMorgan Chase, ADP, Expedia, IBM, and the US Department of Defense to unblock the enterprise AI mission.
Media Contact
[email protected]
View original content to download multimedia:https://www.prnewswire.com/news-releases/fine-tuning-breaks-model-alignment-and-introduces-new-vulnerabilities-robust-intelligence-research-finds-302156915.html
SOURCE Robust Intelligence
Serious News for Serious Traders! Try StreetInsider.com Premium Free!
You May Also Be Interested In
- Scientific Interest Grows in Liver Health Herbal Supplements Featuring Milk Thistle, Turmeric and Dandelion Root, Highlights PureHealth Research
- Strategic Collaboration | Sanyou Bio and Baiyunshan Xihe Join Forces to Advance Innovative Radiopharmaceuticals and Radiodiagnostics
- Beyond the Credential: As AI Erases the 'Entry-Level Job' and Climate Volatility Accelerates, Planet Classroom Unveils July Lineup Mapping the New Rules of Human Resilience
Create E-mail Alert Related Categories
PRNewswire, Press ReleasesRelated Entities
JPMorganSign up for StreetInsider Free!
Receive full access to all new and archived articles, unlimited portfolio tracking, e-mail alerts, custom newswires and RSS feeds - and more!



Tweet
Share