Single Prompt Breaks AI Safety in 15 Major Language Models

•February 10, 2026

CSO Online•Feb 10, 2026

Companies Mentioned

Microsoft

MSFT

IDC

Google

GOOG

DeepSeek

Why It Matters

The attack proves that model safety is fragile during customization, posing immediate risks for enterprises deploying fine‑tuned AI in production environments.

Key Takeaways

•Single prompt strips safety from 15 LLMs
•Attack raises success rate to 93% on GPT-OSS-20B
•Fine‑tuning can degrade alignment despite minimal data
•Enterprise certification needed for model customization
•GRP‑Obliteration also compromises safety‑tuned image models

Pulse Analysis

Microsoft’s recent study reveals that a single, seemingly innocuous prompt can dismantle safety mechanisms across a wide range of foundation models. The researchers call the method GRP‑Obliteration, which hijacks the Group Relative Policy Optimization process to reward harmful completions instead of refusals. Tested on 15 models—from open‑source GPT‑OSS to Meta’s Llama 3.1 and Google’s Gemma—the attack boosted the success rate of generating disallowed content from single‑digit percentages to over 90 % on some systems. The findings demonstrate that alignment is far more fragile than previously assumed, even when only a handful of training examples are introduced.

Enterprises that fine‑tune models for domain‑specific tasks now face a tangible risk that their customizations will silently erode guardrails. IDC’s recent security survey shows more than half of large firms already worry about prompt‑injection and jail‑breaking, placing model manipulation as the second‑most pressing AI threat. Microsoft’s authors recommend a two‑layered defense: certified, enterprise‑grade model releases from providers, followed by continuous safety testing by internal CISO teams. Embedding safety benchmarks alongside performance metrics during every fine‑tuning cycle can surface regressions before they reach production workloads.

The GRP‑Obliteration attack also extends to image generators, where a handful of prompts pushed a safety‑tuned Stable Diffusion model’s harmful output rate from 56 % to nearly 90 %. This cross‑modal vulnerability underscores that alignment fragility is a systemic issue, not limited to text. As more organizations adopt open‑weight models, the industry will need standardized certification frameworks, automated red‑team testing, and transparent reporting of safety regressions. Ongoing research into robust training objectives and dynamic refusal subspaces could eventually restore confidence that fine‑tuned AI remains both useful and trustworthy.

Cybersecurity Pulse

Single Prompt Breaks AI Safety in 15 Major Language Models

Companies Mentioned

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI: