Anthropic AI Releases Bloom: An Open-Source Agentic Framework for Automated Behavioral Evaluations of Frontier AI Models

Anthropic releases Bloom, an open‑source agentic framework for automated behavioral evaluations

Anthropic has released Bloom, an open source agentic framework that automates behavioral evaluations for frontier AI models. The system takes a researcher‑specified behavior and builds targeted evaluations that measure how often and how strongly that behavior appears in realistic scenarios.

Why Bloom?

Behavioral evaluations for safety and alignment are expensive to design and maintain. Teams must hand‑craft creative scenarios, run many interactions, read long transcripts and aggregate scores. As models evolve, old benchmarks can become obsolete or leak into training data. Anthropic’s research team frames this as a scalability problem: they need a way to generate fresh evaluations for misaligned behaviors faster while keeping metrics meaningful.

Bloom targets this gap. Instead of a fixed benchmark with a small set of prompts, Bloom grows an evaluation suite from a seed configuration. The seed anchors what behavior to study, how many scenarios to generate and what interaction style to use. The framework then produces new but behavior‑consistent scenarios on each run, while still allowing reproducibility through the recorded seed.

Qualifire's Rogue feature is an open source dynamic evaluation tool for agentic systems, which utilizes OpenAI V4+, JS OpenAI, and Python OpenAI

Bloom is implemented as a Python pipeline and is released under the MIT license on GitHub. The core input is the evaluation “seed”, defined in seed.yaml. This file references a behavior key in behaviors/behaviors.json, optional example transcripts and global parameters that shape the whole run.

Key configuration elements include:

behavior – a unique identifier defined in behaviors.json for the target behavior (e.g., sycophancy or self‑preservation)
examples – zero or more few‑shot transcripts stored under behaviors/examples/
total_evals – the number of rollouts to generate in the suite
rollout.target – the model under evaluation such as claude-sonnet-4
Controls such as diversity, max_turns, modality, reasoning effort and additional judgment qualities

Bloom uses LiteLLM as a backend for model API calls and can talk to Anthropic and OpenAI models through a single interface. It integrates with Weights & Biases for large sweeps and exports Inspect‑compatible transcripts.

Four‑stage agentic pipeline

Bloom’s evaluation process is organized into four agent stages that run in sequence:

Understanding agent – reads the behavior description and example conversations, builds a structured summary of what counts as a positive instance, and attributes specific spans in the examples to successful behavior demonstrations.
Ideation agent – generates candidate evaluation scenarios, describing the situation, user persona, tools the target model can access, and what a successful rollout looks like. Diversity parameters trade off between distinct scenarios and variations per scenario.
Rollout agent – instantiates these scenarios with the target model, runs multi‑turn conversations or simulated environments, and records all messages and tool calls. Parameters such as max_turns, modality and no_user_mode control how autonomous the target model is.
Judgment and meta‑judgment agents – a judge model scores each transcript for behavior presence on a numerical scale and can also rate additional qualities like realism. A meta‑judge reads summaries of all rollouts and produces a suite‑level report highlighting the most important cases and patterns. The main metric is an elicitation rate (the share of rollouts that score at least 7 / 10 for behavior presence).

Validation on frontier models

Anthropic used Bloom to build four alignment‑relevant evaluation suites (delusional sycophancy, instructed long‑horizon sabotage, self‑preservation, and self‑preferential bias). Each suite contains 100 distinct rollouts and is repeated three times across 16 frontier models. The reported plots show elicitation rate with standard‑deviation error bars, using Claude Opus 4.1 as the evaluator across all stages.

Bloom is also tested on intentionally misaligned “model organisms” from earlier alignment work. Across 10 quirky behaviors, Bloom separates the organism from the baseline production model in 9 cases. In the remaining self‑promotion quirk, manual inspection shows that the baseline model exhibits similar behavior frequency, explaining the overlap in scores. A separate validation exercise compares human labels on 40 transcripts against 11 candidate judge models. Claude Opus 4.1 reaches a Spearman correlation of 0.86 with human scores, and Claude Sonnet 4.5 reaches 0.75, with especially strong agreement at high and low scores where thresholds matter.

Validation results

Relationship to Petri and positioning

Anthropic positions Bloom as complementary to Petri. Petri is a broad‑coverage auditing tool that takes seed instructions describing many scenarios and behaviors, then uses automated agents to probe models through multi‑turn interactions and summarize diverse safety‑relevant dimensions. Bloom instead starts from one behavior definition and automates the engineering needed to turn that into a large, targeted evaluation suite with quantitative metrics like elicitation rate.

Key takeaways

Bloom is an open‑source agentic framework that turns a single behavior specification into a complete behavioral evaluation suite for large models, using a four‑stage pipeline of understanding, ideation, rollout, and judgment.
The system is driven by a seed configuration in seed.yaml and behaviors/behaviors.json, where researchers specify the target behavior, example transcripts, total evaluations, rollout model, and controls such as diversity, max turns, and modality.
Bloom relies on LiteLLM for unified access to Anthropic and OpenAI models, integrates with Weights & Biases for experiment tracking, and exports Inspect‑compatible JSON plus an interactive viewer for inspecting transcripts and scores.
Anthropic validates Bloom on four alignment‑focused behaviors across 16 frontier models (100 rollouts repeated three times) and on ten model‑organism quirks, where Bloom separates intentionally misaligned organisms from baseline models in nine cases and judge models match human labels with Spearman correlation up to 0.86.

Author: Asif Razzaq

Asif Razzaq is the CEO of Marktechpost Media Inc. He is a visionary entrepreneur and engineer committed to harnessing the potential of artificial intelligence for social good. His most recent endeavor is the launch of an AI media platform, Marktechpost, which provides in‑depth, technically sound coverage of machine‑learning and deep‑learning news.

AI Blogs and Articles

Why It Matters