Black Hat USA 2025 | Universal and Context-Independent Triggers for Precise Control of LLM Outputs

Black Hat
Black HatApr 4, 2026

Why It Matters

Universal adversarial triggers make sophisticated prompt‑injection attacks accessible to low‑skill actors, jeopardizing the security of AI‑driven agents and demanding immediate hardening measures.

Key Takeaways

  • Universal adversarial triggers enable model‑agnostic prompt injection attacks.
  • Triggers achieve approximately 70% success across diverse contexts and payloads.
  • Gradient‑based hot‑flip optimization discovers token sequences offline efficiently.
  • Demonstrated remote code execution on AI agents such as Open Interpreter.
  • Mitigations require sandboxing, perplexity filters, and model‑specific controls.

Summary

The Black Hat presentation introduced a novel class of prompt‑injection attacks called universal adversarial triggers, which allow attackers to hijack large language model (LLM) outputs with a single, reusable token sequence. By decoupling the malicious payload from the trigger, the method works across varied applications—from text generation to code‑to‑SQL—without needing bespoke crafting for each context.

The researchers showed that these triggers achieve roughly a 70% success rate across heterogeneous prompt settings and maintain about 60% effectiveness when transferred within the same model family. They are generated offline using gradient‑based “hot‑flip” optimization, which iteratively replaces tokens to maximize the probability of the desired payload, despite the discrete nature of token spaces.

Live demos highlighted the real‑world impact: a trigger embedded in an email caused the Open Interpreter agent to execute a reverse‑shell command, and a malicious VS Code extension leveraged the same technique to run arbitrary shell code without user confirmation. The talk also detailed the underlying token‑embedding pipeline and the mathematical loss formulation that guides trigger discovery.

The findings lower the technical barrier for prompt‑injection attacks, exposing AI agents to remote code execution and data theft. Defenders must adopt strict sandboxing, implement perplexity‑based filters, and design model‑specific safeguards to mitigate this emerging threat.

Original Description

In this talk, we will introduce a novel gradient-based prompt-injection technique that can generate universal triggers to manipulate open-source Large Language Model (LLM) outputs. While previous attacks often depend heavily on prompt context or require multiple iterations to fully control the model's behavior, our method discovers "universal and context-independent triggers" that force the LLM to produce precisely crafted, attacker-chosen text—regardless of the original prompt or task. We will outline how these triggers are discovered via discrete gradient descent on extensive and diverse instruction datasets. Our demonstrations will show how such triggers can be applied to attack open source LLM applications to achieve remote code execution.
Furthermore, we will discuss the substantial threats posed by such attacks to LLM-based applications, highlighting the potential for adversaries to take over the decisions and actions made by AI agents.
By:
Jiashuo Liang | Researcher, Tencent Xuanwu Lab
Guancheng Li | Researcher, Tencent Xuanwu Lab
Presentation Materials Available at:

Comments

Want to join the conversation?

Loading comments...