Diffusers Welcomes FLUX-2

•November 25, 2025

Hugging Face•Nov 25, 2025

Companies Mentioned

Black Forest Labs

Mistral AI

GitHub

Why It Matters

FLUX.2 lowers the barrier to high‑quality diffusion generation, enabling enterprises to deploy state‑of‑the‑art visual AI on more modest hardware and accelerate customized content creation.

Key Takeaways

•FLUX.2 uses single Mistral Small 3.1 text encoder.
•Model requires 80 GB VRAM without offloading.
•4‑bit quantization enables 24 GB GPU inference.
•Single‑stream DiT blocks dominate parameters in FLUX.2.
•Remote text encoder offloads memory for low‑VRAM setups.

Pulse Analysis

FLUX.2 represents a significant architectural shift in open‑source diffusion models. By consolidating the text‑encoding stage into a single Mistral Small 3.1 encoder, the pipeline simplifies prompt processing while extending the maximum sequence length to 512 tokens. The underlying DiT transformer now favors single‑stream blocks, reducing inter‑module communication overhead and reallocating roughly three‑quarters of parameters to these more efficient layers. This redesign not only boosts generation speed but also aligns the model with emerging hardware optimizations such as Flash Attention 3, positioning FLUX.2 as a competitive alternative to proprietary offerings.

From an operational perspective, the model’s raw footprint exceeds 80 GB of VRAM, a hurdle for most on‑premise deployments. Diffusers mitigates this through flexible strategies: CPU offloading trims GPU demand to around 62 GB, while 4‑bit quantization via bitsandbytes squeezes the requirement to 20‑24 GB, making consumer‑grade GPUs viable. A modular pipeline further separates the heavy text encoder, allowing it to run on remote inference endpoints. This hybrid approach frees local memory for the diffusion core, enabling real‑time generation on mid‑range workstations without sacrificing image fidelity.

The broader impact on the AI industry is twofold. First, the lowered hardware barrier democratizes access to high‑resolution, text‑guided image synthesis, encouraging startups and creative studios to integrate visual AI into products and workflows. Second, the inclusion of LoRA fine‑tuning guidance empowers developers to tailor FLUX.2 to niche domains—be it medical illustration or brand‑specific art—without exhaustive retraining. As open‑source diffusion models continue to mature, FLUX.2’s blend of architectural efficiency and deployment flexibility sets a new benchmark for scalable, customizable generative AI.

Diffusers welcomes FLUX-2

Welcome FLUX.2 – BFL’s new open image generation model 🤗

Published: November 25, 2025

Authors: YiYi Xu, Daniel Gu, Sayak Paul, Alvaro Somoza, Dhruv Nair, Aritra Roy Gosthipaty, Linoy Tsaban, Apolinário from multimodal AI art

FLUX.2 is the recent series of image‑generation models from Black Forest Labs, preceded by the Flux.1 series. It is an entirely new model with a new architecture and pre‑training done from scratch! In this post we discuss the key changes introduced in FLUX.2, performing inference with it under various setups, and LoRA fine‑tuning.

🚨 FLUX.2 is not meant to be a drop‑in replacement of FLUX.1, but a new generation model

FLUX.2 introduction
Inference with Diffusers
LoRA fine‑tuning

FLUX.2: A Brief Introduction

FLUX.2 can be used for both image‑guided and text‑guided image generation. It can also take multiple images as reference inputs while producing the final output image. Below we briefly discuss the key changes introduced in FLUX.2.

Text encoder

Instead of two text encoders as in Flux.1, FLUX.2 uses a single text encoder – Mistral Small 3.1. Using a single encoder greatly simplifies the process of computing prompt embeddings. The pipeline allows for a max_sequence_length of 512.

DiT

FLUX.2 follows the same general multimodel diffusion transformer (MM‑DiT) + parallel DiT architecture as Flux.1. As a refresher, MM‑DiT blocks first process the image latents and conditioning text in separate streams, only joining the two together for the attention operation, and are thus referred to as “double‑stream” blocks. The parallel blocks then operate on the concatenated image and text streams and can be regarded as “single‑stream” blocks.

Key DiT changes from Flux.1 to FLUX.2:

Time and guidance information (in the form of AdaLayerNorm‑Zero modulation parameters) is shared across all double‑stream and single‑stream transformer blocks, respectively, rather than having individual modulation parameters for each block as in Flux.1.
None of the layers in the model use bias parameters. In particular, neither the attention nor feed‑forward (FF) sub‑blocks of either transformer block use bias parameters.
In Flux.1, the single‑stream transformer blocks fused the attention output projection with the FF output projection. FLUX.2 single‑stream blocks also fuse the attention QKV projections with the FF input projection, creating a fully parallel transformer block:

![Figure taken from the ViT‑22B paper]

Compared to the ViT‑22B block, FLUX.2 uses a SwiGLU‑style MLP activation rather than GELU (and also doesn’t use bias parameters).
A larger proportion of the transformer blocks in FLUX.2 are single‑stream blocks (8 double‑stream blocks to 48 single‑stream blocks, compared to 19/38 for Flux.1). Consequently, single‑stream blocks make up a larger share of the DiT parameters: Flux.1[dev]-12B has ~54 % of its total parameters in double‑stream blocks, whereas FLUX.2[dev]-32B has ~24 % in double‑stream blocks (and ~73 % in single‑stream blocks).

Misc

A new Autoencoder
Better way to incorporate resolution‑dependent timestep schedules

Inference With Diffusers

FLUX.2 uses a larger DiT and Mistral 3 Small as its text encoder. When used together without any kind of offloading, inference requires more than 80 GB VRAM. Below we show how to perform inference with FLUX.2 in more accessible ways, under various system‑level constraints.

Installation and Authentication

Before running the code snippets, install the latest diffusers from the main branch and log in to the Hugging Face Hub:


pip uninstall diffusers -y && pip install git+https://github.com/huggingface/diffusers -U

hf auth login

Regular Inference


from diffusers import Flux2Pipeline

import torch



repo_id = "black-forest-labs/FLUX.2-dev"

pipe = Flux2Pipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16)

pipe.enable_model_cpu_offload()



image = pipe(

    prompt="dog dancing near the sun",

    num_inference_steps=50,      # 28 is a good trade‑off

    guidance_scale=4,

    height=1024,

    width=1024

).images[0]

The snippet was tested on an H100; without CPU offloading the model cannot run. With CPU offloading enabled, the setup uses ~62 GB VRAM.

Hopper‑series GPUs can leverage Flash Attention 3 for faster inference:


from diffusers import Flux2Pipeline

import torch



repo_id = "black-forest-labs/FLUX.2-dev"

pipe = Flux2Pipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16)

pipe.transformer.set_attention_backend("_flash_3_hub")

pipe.enable_model_cpu_offload()



image = pipe(

    prompt="dog dancing near the sun",

    num_inference_steps=50,

    guidance_scale=2.5,

    height=1024,

    width=1024

).images[0]

You can check out the supported attention backends here.

Resource‑constrained Inference

4‑bit Quantization

Using bitsandbytes, we can load the transformer and text encoder in 4‑bit, allowing 24 GB GPUs to run the model locally. The following snippet runs on a GPU with ~20 GB free VRAM.


import torch

from transformers import Mistral3ForConditionalGeneration

from diffusers import Flux2Pipeline, Flux2Transformer2DModel



repo_id = "diffusers/FLUX.2-dev-bnb-4bit"

device = "cuda:0"

torch_dtype = torch.bfloat16



transformer = Flux2Transformer2DModel.from_pretrained(

    repo_id, subfolder="transformer", torch_dtype=torch_dtype, device_map="cpu"

)

text_encoder = Mistral3ForConditionalGeneration.from_pretrained(

    repo_id, subfolder="text_encoder", dtype=torch_dtype, device_map="cpu"

)



pipe = Flux2Pipeline.from_pretrained(

    repo_id, transformer=transformer, text_encoder=text_encoder, torch_dtype=torch_dtype

)

pipe.enable_model_cpu_offload()



prompt = (

    "Realistic macro photograph of a hermit crab using a soda can as its shell, "

    "partially emerging from the can, captured with sharp detail and natural colors, "

    "on a sunlit beach with soft shadows and a shallow depth of field, with blurred "

    "ocean waves in the background. The can has the text `BFL Diffusers` on it and "

    "it has a color gradient that start with #FF5733 at the top and transitions to "

    "#33FF57 at the bottom."

)



image = pipe(

    prompt=prompt,

    generator=torch.Generator(device=device).manual_seed(42),

    num_inference_steps=50,

    guidance_scale=4,

).images[0]



image.save("flux2_t2i_nf4.png")

Local + Remote (Modular Pipeline)

Because a Diffusers pipeline is modular, we can run the text encoder remotely (via an Inference Endpoint) and keep the DiT and VAE local. This frees VRAM for the heavy components.

⚠️ To use the remote text encoder you need a valid Hugging Face token. If you are already authenticated, no further action is required.


from diffusers import Flux2Pipeline, Flux2Transformer2DModel

from diffusers import BitsAndBytesConfig as DiffBitsAndBytesConfig

from huggingface_hub import get_token

import requests, torch, io



def remote_text_encoder(prompts: str | list[str]):

    def _encode_single(prompt: str):

        response = requests.post(

            "https://remote-text-encoder-flux-2.huggingface.co/predict",

            json={"prompt": prompt},

            headers={

                "Authorization": f"Bearer {get_token()}",

                "Content-Type": "application/json"

            }

        )

        assert response.status_code == 200, f"{response.status_code=}"

        return torch.load(io.BytesIO(response.content))



    if isinstance(prompts, (list, tuple)):

        embeds = [_encode_single(p) for p in prompts]

        return torch.cat(embeds, dim=0)

    return _encode_single(prompts).to("cuda")



repo_id = "black-forest-labs/FLUX.2-dev"

quantized_dit_id = "diffusers/FLUX.2-dev-bnb-4bit"

torch_dtype = torch.bfloat16



dit = Flux2Transformer2DModel.from_pretrained(

    quantized_dit_id, subfolder="transformer", torch_dtype=torch_dtype, device_map="cpu"

)



pipe = Flux2Pipeline.from_pretrained(

    repo_id,

    text_encoder=None,          # will be provided remotely

    transformer=dit,

    torch_dtype=torch_dtype,

)

pipe.enable_model_cpu_offload()



print("Running remote text encoder ☁️")

prompt1 = "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture"

prompt2 = "a photo of a dense forest with rain. The word 'FLUX.2' is painted over it in big, red brush strokes with"

# Encode prompts remotely

embeds = remote_text_encoder([prompt1, prompt2])



# Generate images locally

images = pipe(

    prompt_embeds=embeds,

    num_inference_steps=50,

    guidance_scale=4,

).images

This setup runs on a GPU with ~18 GB VRAM.

LoRA Fine‑tuning

(The original article continues with a detailed guide on LoRA fine‑tuning; include the relevant sections here if needed.)

Read Original Article

Comments

Want to join the conversation?

Loading comments...

Welcome FLUX.2 – BFL’s new open image generation model 🤗

Published: November 25, 2025

Authors: YiYi Xu, Daniel Gu, Sayak Paul, Alvaro Somoza, Dhruv Nair, Aritra Roy Gosthipaty, Linoy Tsaban, Apolinário from multimodal AI art

🚨 FLUX.2 is not meant to be a drop‑in replacement of FLUX.1, but a new generation model

FLUX.2 introduction
Inference with Diffusers
LoRA fine‑tuning

FLUX.2: A Brief Introduction

Text encoder

DiT

Key DiT changes from Flux.1 to FLUX.2:

Time and guidance information (in the form of AdaLayerNorm‑Zero modulation parameters) is shared across all double‑stream and single‑stream transformer blocks, respectively, rather than having individual modulation parameters for each block as in Flux.1.
None of the layers in the model use bias parameters. In particular, neither the attention nor feed‑forward (FF) sub‑blocks of either transformer block use bias parameters.
In Flux.1, the single‑stream transformer blocks fused the attention output projection with the FF output projection. FLUX.2 single‑stream blocks also fuse the attention QKV projections with the FF input projection, creating a fully parallel transformer block:

![Figure taken from the ViT‑22B paper]

Compared to the ViT‑22B block, FLUX.2 uses a SwiGLU‑style MLP activation rather than GELU (and also doesn’t use bias parameters).
A larger proportion of the transformer blocks in FLUX.2 are single‑stream blocks (8 double‑stream blocks to 48 single‑stream blocks, compared to 19/38 for Flux.1). Consequently, single‑stream blocks make up a larger share of the DiT parameters: Flux.1[dev]-12B has ~54 % of its total parameters in double‑stream blocks, whereas FLUX.2[dev]-32B has ~24 % in double‑stream blocks (and ~73 % in single‑stream blocks).

Misc

A new Autoencoder
Better way to incorporate resolution‑dependent timestep schedules

Inference With Diffusers

Installation and Authentication

Before running the code snippets, install the latest diffusers from the main branch and log in to the Hugging Face Hub:


pip uninstall diffusers -y && pip install git+https://github.com/huggingface/diffusers -U

hf auth login

Regular Inference


from diffusers import Flux2Pipeline

import torch



repo_id = "black-forest-labs/FLUX.2-dev"

pipe = Flux2Pipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16)

pipe.enable_model_cpu_offload()



image = pipe(

    prompt="dog dancing near the sun",

    num_inference_steps=50,      # 28 is a good trade‑off

    guidance_scale=4,

    height=1024,

    width=1024

).images[0]

The snippet was tested on an H100; without CPU offloading the model cannot run. With CPU offloading enabled, the setup uses ~62 GB VRAM.

Hopper‑series GPUs can leverage Flash Attention 3 for faster inference:


from diffusers import Flux2Pipeline

import torch



repo_id = "black-forest-labs/FLUX.2-dev"

pipe = Flux2Pipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16)

pipe.transformer.set_attention_backend("_flash_3_hub")

pipe.enable_model_cpu_offload()



image = pipe(

    prompt="dog dancing near the sun",

    num_inference_steps=50,

    guidance_scale=2.5,

    height=1024,

    width=1024

).images[0]

You can check out the supported attention backends here.

Resource‑constrained Inference

4‑bit Quantization

Using bitsandbytes, we can load the transformer and text encoder in 4‑bit, allowing 24 GB GPUs to run the model locally. The following snippet runs on a GPU with ~20 GB free VRAM.


import torch

from transformers import Mistral3ForConditionalGeneration

from diffusers import Flux2Pipeline, Flux2Transformer2DModel



repo_id = "diffusers/FLUX.2-dev-bnb-4bit"

device = "cuda:0"

torch_dtype = torch.bfloat16



transformer = Flux2Transformer2DModel.from_pretrained(

    repo_id, subfolder="transformer", torch_dtype=torch_dtype, device_map="cpu"

)

text_encoder = Mistral3ForConditionalGeneration.from_pretrained(

    repo_id, subfolder="text_encoder", dtype=torch_dtype, device_map="cpu"

)



pipe = Flux2Pipeline.from_pretrained(

    repo_id, transformer=transformer, text_encoder=text_encoder, torch_dtype=torch_dtype

)

pipe.enable_model_cpu_offload()



prompt = (

    "Realistic macro photograph of a hermit crab using a soda can as its shell, "

    "partially emerging from the can, captured with sharp detail and natural colors, "

    "on a sunlit beach with soft shadows and a shallow depth of field, with blurred "

    "ocean waves in the background. The can has the text `BFL Diffusers` on it and "

    "it has a color gradient that start with #FF5733 at the top and transitions to "

    "#33FF57 at the bottom."

)



image = pipe(

    prompt=prompt,

    generator=torch.Generator(device=device).manual_seed(42),

    num_inference_steps=50,

    guidance_scale=4,

).images[0]



image.save("flux2_t2i_nf4.png")

Local + Remote (Modular Pipeline)

Because a Diffusers pipeline is modular, we can run the text encoder remotely (via an Inference Endpoint) and keep the DiT and VAE local. This frees VRAM for the heavy components.

⚠️ To use the remote text encoder you need a valid Hugging Face token. If you are already authenticated, no further action is required.


from diffusers import Flux2Pipeline, Flux2Transformer2DModel

from diffusers import BitsAndBytesConfig as DiffBitsAndBytesConfig

from huggingface_hub import get_token

import requests, torch, io



def remote_text_encoder(prompts: str | list[str]):

    def _encode_single(prompt: str):

        response = requests.post(

            "https://remote-text-encoder-flux-2.huggingface.co/predict",

            json={"prompt": prompt},

            headers={

                "Authorization": f"Bearer {get_token()}",

                "Content-Type": "application/json"

            }

        )

        assert response.status_code == 200, f"{response.status_code=}"

        return torch.load(io.BytesIO(response.content))



    if isinstance(prompts, (list, tuple)):

        embeds = [_encode_single(p) for p in prompts]

        return torch.cat(embeds, dim=0)

    return _encode_single(prompts).to("cuda")



repo_id = "black-forest-labs/FLUX.2-dev"

quantized_dit_id = "diffusers/FLUX.2-dev-bnb-4bit"

torch_dtype = torch.bfloat16



dit = Flux2Transformer2DModel.from_pretrained(

    quantized_dit_id, subfolder="transformer", torch_dtype=torch_dtype, device_map="cpu"

)



pipe = Flux2Pipeline.from_pretrained(

    repo_id,

    text_encoder=None,          # will be provided remotely

    transformer=dit,

    torch_dtype=torch_dtype,

)

pipe.enable_model_cpu_offload()



print("Running remote text encoder ☁️")

prompt1 = "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture"

prompt2 = "a photo of a dense forest with rain. The word 'FLUX.2' is painted over it in big, red brush strokes with"

# Encode prompts remotely

embeds = remote_text_encoder([prompt1, prompt2])



# Generate images locally

images = pipe(

    prompt_embeds=embeds,

    num_inference_steps=50,

    guidance_scale=4,

).images

This setup runs on a GPU with ~18 GB VRAM.

LoRA Fine‑tuning

(The original article continues with a detailed guide on LoRA fine‑tuning; include the relevant sections here if needed.)

AI Pulse

Diffusers Welcomes FLUX-2

Companies Mentioned

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Diffusers welcomes FLUX-2

Welcome FLUX.2 – BFL’s new open image generation model 🤗

Table of contents

FLUX.2: A Brief Introduction

Text encoder

DiT

Misc

Inference With Diffusers

Installation and Authentication

Regular Inference

Resource‑constrained Inference

4‑bit Quantization

Local + Remote (Modular Pipeline)

LoRA Fine‑tuning

Comments

AI Pulse

Diffusers Welcomes FLUX-2

Companies Mentioned

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Diffusers welcomes FLUX-2

Welcome FLUX.2 – BFL’s new open image generation model 🤗

Table of contents

FLUX.2: A Brief Introduction

Text encoder

DiT

Misc

Inference With Diffusers

Installation and Authentication

Regular Inference

Resource‑constrained Inference

4‑bit Quantization

Local + Remote (Modular Pipeline)

LoRA Fine‑tuning

Comments

Local + Remote (Modular Pipeline)

Local + Remote (Modular Pipeline)