
Augustus v0.0.9: Multi-Turn Attacks for LLMs That Fight Back
Why It Matters
Multi‑turn attacks expose a largely undefended surface, forcing LLM providers to extend safety beyond single‑turn filters. Organizations must evaluate conversational resilience to avoid data leakage and policy violations.
Key Takeaways
- •Unified engine runs four distinct multi‑turn strategies
- •Hydra can erase refused turns, diversifying tactics
- •Crescendo reaches 0.80 score in just two turns
- •GOAT achieves perfect score in a single turn
- •Works across 28 providers, 172 probes, 43 generators
Pulse Analysis
The security community has long focused on single‑turn jailbreaks—simple prompts that trick a model into ignoring its policies. Modern guardrails now reject obvious tricks like “ignore previous instructions” or base64‑encoded payloads within milliseconds. However, these defenses often overlook the cumulative effect of a natural conversation, where each turn appears innocuous but together steer the model toward prohibited content. This shift from isolated prompts to contextual dialogue creates a blind spot that attackers can exploit, making multi‑turn testing essential for a realistic risk assessment.
Augustus v0.0.9 addresses that blind spot with a single binary that orchestrates attacker, target, and judge LLMs across any provider. Its four personalities illustrate different tactical philosophies: Crescendo escalates gently, GOAT attacks aggressively with chain‑of‑thought reasoning, Hydra rewrites refused turns to hide failures, and Mischievous User mimics a casual user to evade detection. A built‑in judge scores progress after each exchange, enabling automatic back‑tracking and technique diversification across twelve categories. The engine’s plug‑in architecture lets teams mix and match generators—OpenAI, Anthropic, Ollama, or custom REST endpoints—while leveraging 172 probes and 109 detectors for comprehensive coverage.
For enterprises deploying LLMs, the emergence of robust multi‑turn attack frameworks signals a need to rethink defensive postures. Traditional prompt‑filtering and refusal logging are insufficient when a model gradually builds context that appears legitimate. Security teams should incorporate continuous conversation monitoring, dynamic policy updates, and adversarial training that includes multi‑turn scenarios. As open‑source tools like Augustus lower the barrier to sophisticated red‑team exercises, vendors are likely to accelerate research into conversational safety nets, such as memory‑aware refusal mechanisms and real‑time intent verification, to protect against this evolving threat vector.
Comments
Want to join the conversation?
Loading comments...