FLUX.2 lowers the barrier to high‑quality diffusion generation, enabling enterprises to deploy state‑of‑the‑art visual AI on more modest hardware and accelerate customized content creation.
FLUX.2 represents a significant architectural shift in open‑source diffusion models. By consolidating the text‑encoding stage into a single Mistral Small 3.1 encoder, the pipeline simplifies prompt processing while extending the maximum sequence length to 512 tokens. The underlying DiT transformer now favors single‑stream blocks, reducing inter‑module communication overhead and reallocating roughly three‑quarters of parameters to these more efficient layers. This redesign not only boosts generation speed but also aligns the model with emerging hardware optimizations such as Flash Attention 3, positioning FLUX.2 as a competitive alternative to proprietary offerings.
From an operational perspective, the model’s raw footprint exceeds 80 GB of VRAM, a hurdle for most on‑premise deployments. Diffusers mitigates this through flexible strategies: CPU offloading trims GPU demand to around 62 GB, while 4‑bit quantization via bitsandbytes squeezes the requirement to 20‑24 GB, making consumer‑grade GPUs viable. A modular pipeline further separates the heavy text encoder, allowing it to run on remote inference endpoints. This hybrid approach frees local memory for the diffusion core, enabling real‑time generation on mid‑range workstations without sacrificing image fidelity.
The broader impact on the AI industry is twofold. First, the lowered hardware barrier democratizes access to high‑resolution, text‑guided image synthesis, encouraging startups and creative studios to integrate visual AI into products and workflows. Second, the inclusion of LoRA fine‑tuning guidance empowers developers to tailor FLUX.2 to niche domains—be it medical illustration or brand‑specific art—without exhaustive retraining. As open‑source diffusion models continue to mature, FLUX.2’s blend of architectural efficiency and deployment flexibility sets a new benchmark for scalable, customizable generative AI.
Published: November 25, 2025
Authors: YiYi Xu, Daniel Gu, Sayak Paul, Alvaro Somoza, Dhruv Nair, Aritra Roy Gosthipaty, Linoy Tsaban, Apolinário from multimodal AI art
FLUX.2 is the recent series of image‑generation models from Black Forest Labs, preceded by the Flux.1 series. It is an entirely new model with a new architecture and pre‑training done from scratch! In this post we discuss the key changes introduced in FLUX.2, performing inference with it under various setups, and LoRA fine‑tuning.
🚨 FLUX.2 is not meant to be a drop‑in replacement of FLUX.1, but a new generation model
FLUX.2 can be used for both image‑guided and text‑guided image generation. It can also take multiple images as reference inputs while producing the final output image. Below we briefly discuss the key changes introduced in FLUX.2.
Instead of two text encoders as in Flux.1, FLUX.2 uses a single text encoder – Mistral Small 3.1. Using a single encoder greatly simplifies the process of computing prompt embeddings. The pipeline allows for a max_sequence_length of 512.
FLUX.2 follows the same general multimodel diffusion transformer (MM‑DiT) + parallel DiT architecture as Flux.1. As a refresher, MM‑DiT blocks first process the image latents and conditioning text in separate streams, only joining the two together for the attention operation, and are thus referred to as “double‑stream” blocks. The parallel blocks then operate on the concatenated image and text streams and can be regarded as “single‑stream” blocks.
Key DiT changes from Flux.1 to FLUX.2:
Time and guidance information (in the form of AdaLayerNorm‑Zero modulation parameters) is shared across all double‑stream and single‑stream transformer blocks, respectively, rather than having individual modulation parameters for each block as in Flux.1.
None of the layers in the model use bias parameters. In particular, neither the attention nor feed‑forward (FF) sub‑blocks of either transformer block use bias parameters.
In Flux.1, the single‑stream transformer blocks fused the attention output projection with the FF output projection. FLUX.2 single‑stream blocks also fuse the attention QKV projections with the FF input projection, creating a fully parallel transformer block:
![Figure taken from the ViT‑22B paper]
Compared to the ViT‑22B block, FLUX.2 uses a SwiGLU‑style MLP activation rather than GELU (and also doesn’t use bias parameters).
A larger proportion of the transformer blocks in FLUX.2 are single‑stream blocks (8 double‑stream blocks to 48 single‑stream blocks, compared to 19/38 for Flux.1). Consequently, single‑stream blocks make up a larger share of the DiT parameters: Flux.1[dev]-12B has ~54 % of its total parameters in double‑stream blocks, whereas FLUX.2[dev]-32B has ~24 % in double‑stream blocks (and ~73 % in single‑stream blocks).
A new Autoencoder
Better way to incorporate resolution‑dependent timestep schedules
FLUX.2 uses a larger DiT and Mistral 3 Small as its text encoder. When used together without any kind of offloading, inference requires more than 80 GB VRAM. Below we show how to perform inference with FLUX.2 in more accessible ways, under various system‑level constraints.
Before running the code snippets, install the latest diffusers from the main branch and log in to the Hugging Face Hub:
pip uninstall diffusers -y && pip install git+https://github.com/huggingface/diffusers -U
hf auth login
from diffusers import Flux2Pipeline
import torch
repo_id = "black-forest-labs/FLUX.2-dev"
pipe = Flux2Pipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
image = pipe(
prompt="dog dancing near the sun",
num_inference_steps=50, # 28 is a good trade‑off
guidance_scale=4,
height=1024,
width=1024
).images[0]
The snippet was tested on an H100; without CPU offloading the model cannot run. With CPU offloading enabled, the setup uses ~62 GB VRAM.
Hopper‑series GPUs can leverage Flash Attention 3 for faster inference:
from diffusers import Flux2Pipeline
import torch
repo_id = "black-forest-labs/FLUX.2-dev"
pipe = Flux2Pipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16)
pipe.transformer.set_attention_backend("_flash_3_hub")
pipe.enable_model_cpu_offload()
image = pipe(
prompt="dog dancing near the sun",
num_inference_steps=50,
guidance_scale=2.5,
height=1024,
width=1024
).images[0]
You can check out the supported attention backends here.
Using bitsandbytes, we can load the transformer and text encoder in 4‑bit, allowing 24 GB GPUs to run the model locally. The following snippet runs on a GPU with ~20 GB free VRAM.
import torch
from transformers import Mistral3ForConditionalGeneration
from diffusers import Flux2Pipeline, Flux2Transformer2DModel
repo_id = "diffusers/FLUX.2-dev-bnb-4bit"
device = "cuda:0"
torch_dtype = torch.bfloat16
transformer = Flux2Transformer2DModel.from_pretrained(
repo_id, subfolder="transformer", torch_dtype=torch_dtype, device_map="cpu"
)
text_encoder = Mistral3ForConditionalGeneration.from_pretrained(
repo_id, subfolder="text_encoder", dtype=torch_dtype, device_map="cpu"
)
pipe = Flux2Pipeline.from_pretrained(
repo_id, transformer=transformer, text_encoder=text_encoder, torch_dtype=torch_dtype
)
pipe.enable_model_cpu_offload()
prompt = (
"Realistic macro photograph of a hermit crab using a soda can as its shell, "
"partially emerging from the can, captured with sharp detail and natural colors, "
"on a sunlit beach with soft shadows and a shallow depth of field, with blurred "
"ocean waves in the background. The can has the text `BFL Diffusers` on it and "
"it has a color gradient that start with #FF5733 at the top and transitions to "
"#33FF57 at the bottom."
)
image = pipe(
prompt=prompt,
generator=torch.Generator(device=device).manual_seed(42),
num_inference_steps=50,
guidance_scale=4,
).images[0]
image.save("flux2_t2i_nf4.png")
Because a Diffusers pipeline is modular, we can run the text encoder remotely (via an Inference Endpoint) and keep the DiT and VAE local. This frees VRAM for the heavy components.
⚠️ To use the remote text encoder you need a valid Hugging Face token. If you are already authenticated, no further action is required.
from diffusers import Flux2Pipeline, Flux2Transformer2DModel
from diffusers import BitsAndBytesConfig as DiffBitsAndBytesConfig
from huggingface_hub import get_token
import requests, torch, io
def remote_text_encoder(prompts: str | list[str]):
def _encode_single(prompt: str):
response = requests.post(
"https://remote-text-encoder-flux-2.huggingface.co/predict",
json={"prompt": prompt},
headers={
"Authorization": f"Bearer {get_token()}",
"Content-Type": "application/json"
}
)
assert response.status_code == 200, f"{response.status_code=}"
return torch.load(io.BytesIO(response.content))
if isinstance(prompts, (list, tuple)):
embeds = [_encode_single(p) for p in prompts]
return torch.cat(embeds, dim=0)
return _encode_single(prompts).to("cuda")
repo_id = "black-forest-labs/FLUX.2-dev"
quantized_dit_id = "diffusers/FLUX.2-dev-bnb-4bit"
torch_dtype = torch.bfloat16
dit = Flux2Transformer2DModel.from_pretrained(
quantized_dit_id, subfolder="transformer", torch_dtype=torch_dtype, device_map="cpu"
)
pipe = Flux2Pipeline.from_pretrained(
repo_id,
text_encoder=None, # will be provided remotely
transformer=dit,
torch_dtype=torch_dtype,
)
pipe.enable_model_cpu_offload()
print("Running remote text encoder ☁️")
prompt1 = "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture"
prompt2 = "a photo of a dense forest with rain. The word 'FLUX.2' is painted over it in big, red brush strokes with"
# Encode prompts remotely
embeds = remote_text_encoder([prompt1, prompt2])
# Generate images locally
images = pipe(
prompt_embeds=embeds,
num_inference_steps=50,
guidance_scale=4,
).images
This setup runs on a GPU with ~18 GB VRAM.
(The original article continues with a detailed guide on LoRA fine‑tuning; include the relevant sections here if needed.)
Comments
Want to join the conversation?
Loading comments...