AI News and Headlines
  • All Technology
  • AI
  • Autonomy
  • B2B Growth
  • Big Data
  • BioTech
  • ClimateTech
  • Consumer Tech
  • Crypto
  • Cybersecurity
  • DevOps
  • Digital Marketing
  • Ecommerce
  • EdTech
  • Enterprise
  • FinTech
  • GovTech
  • Hardware
  • HealthTech
  • HRTech
  • LegalTech
  • Nanotech
  • PropTech
  • Quantum
  • Robotics
  • SaaS
  • SpaceTech
AllNewsDealsSocialBlogsVideosPodcastsDigests

AI Pulse

EMAIL DIGESTS

Daily

Every morning

Weekly

Sunday recap

NewsDealsSocialBlogsVideosPodcasts
AINewsTraining Design for Text-to-Image Models: Lessons From Ablations
Training Design for Text-to-Image Models: Lessons From Ablations
AI

Training Design for Text-to-Image Models: Lessons From Ablations

•February 3, 2026
0
Hugging Face
Hugging Face•Feb 3, 2026

Companies Mentioned

Pexels

Pexels

Midjourney

Midjourney

Discord

Discord

Why It Matters

These findings provide a practical recipe for reducing compute cost and accelerating convergence of large text‑to‑image models, a critical advantage for commercial AI developers seeking competitive quality without prohibitive training budgets.

Key Takeaways

  • •REPA alignment cuts FID ~3 points, modest speed loss
  • •REPA‑E and Flux2‑AE halve FID, trade throughput
  • •TREAD token routing boosts high‑res speed and quality
  • •Long captions essential; short captions double FID
  • •Muon optimizer improves metrics over AdamW

Pulse Analysis

Training large text‑to‑image diffusion models remains a resource‑intensive endeavor, yet the PRX Part 2 ablation study demonstrates that strategic loss‑function tweaks can yield outsized quality gains. By coupling flow‑matching with a frozen visual encoder (REPA), early representation learning accelerates, delivering a three‑point FID reduction while only slightly slowing batch throughput. Extending alignment to the latent space (REPA‑E) or swapping the tokenizer for Flux2‑AE pushes the FID down by roughly six points, though practitioners must balance this against a noticeable drop in samples per second. These techniques illustrate how modest algorithmic changes can replace raw compute in the race for higher fidelity.

Beyond loss functions, the study explores compute‑saving architectures such as token routing. Methods like TREAD, which bypass expensive layers for a subset of tokens, show limited benefit at low resolutions but become a game‑changer at 1024 × 1024, delivering both faster training and better image quality. Coupled with the Muon optimizer—a preconditioner‑style alternative to AdamW—training converges more quickly, while careful BF16 handling avoids precision‑related quality regressions. Together, these engineering choices form a cohesive efficiency stack that trims training time without sacrificing output fidelity.

Data curation also proves pivotal. Long, descriptive captions provide richer supervisory signals, whereas short captions dramatically inflate FID, underscoring the importance of textual richness. A two‑phase data regimen—starting with synthetic MidJourneyV6 images for rapid structural learning, then fine‑tuning on real‑world Pexels data for texture fidelity—optimizes both speed and realism. Adding a small, high‑impact fine‑tuning set (the Alchemist dataset) further polishes the model. For enterprises, these insights translate into lower cloud costs, faster time‑to‑market, and the ability to iterate on generative AI products with confidence.

Training Design for Text-to-Image Models: Lessons from Ablations

PRX Part 2: Training Efficient Text‑to‑Image Models

Welcome back! This is the second part of our series on training efficient text‑to‑image models from scratch.

In the first post of this series we introduced our goal: training a competitive text‑to‑image foundation model entirely from scratch, in the open, and at scale. We focused primarily on architectural choices and motivated the core design decisions behind our model PRX. We also released an early, small (1.2 B parameters) version of the model as a preview of what we are building (go try it if you haven’t already 😉).

In this post, we shift our focus from architecture to training. The goal is to document what actually moved the needle for us when trying to make models train faster, converge more reliably, and learn better representations. The field is moving quickly and the list of “training tricks” keeps growing, so rather than attempting an exhaustive survey, we structured this as an experimental logbook: we reproduce (or adapt) a set of recent ideas, implement them in a consistent setup, and report how they affect optimization and convergence in practice. Finally, we do not only report these techniques in isolation; we also explore which ones remain useful when combined.

In the next post, we will publish the full training recipe as code, including the experiments in this post. We will also run and report on a public “speedrun” where we put the best pieces together into a single configuration and stress‑test it end‑to‑end. This exercise will serve both as a stress test of our current training pipeline and as a concrete demonstration of how far careful training design can go under tight constraints. If you haven’t already, we invite you to join our Discord to continue the discussion. A significant part of this project has been shaped by exchanges with community members, and we place a high value on external feedback, ablations, and alternative interpretations of the results.


The Baseline

Before introducing any training‑efficiency techniques, we first establish a clean reference run. This baseline is intentionally simple. It uses standard components, avoids auxiliary objectives, and does not rely on architectural shortcuts or tricks to save compute resources. Its role is to serve as a stable point of comparison for all subsequent experiments.

Concretely, this is a pure Flow Matching (Lipman et al., 2022) training setup (as introduced in Part 1) with no extra objectives and no architectural speed hacks. We will use the small PRX‑1.2 B model we presented in the first post of this series (single‑stream architecture with global attention for the image tokens and text tokens) as our baseline and train it in Flux VAE latent space, keeping the configuration fixed across all comparisons unless stated otherwise.

Baseline training setup

| Setting | Value |

|---|---|

| Steps | 100 k |

| Dataset | Public 1 M synthetic image generated with MidJourneyV6 |

| Resolution | 256 × 256 |

| Global batch size | 256 |

| Optimizer | AdamW |

| lr | 1e‑4 |

| weight_decay | 0.0 |

| eps | 1e‑15 |

| betas | (0.9, 0.95) |

| Text encoder | GemmaT5 |

| Positional encoding | Rotary (RoPE) |

| Attention mask | Padding mask |

| EMA | Disabled |

Does this modification improve convergence or training efficiency relative to the baseline?

Examples of baseline model generations after 100 k training steps.


Benchmarking Metrics

To keep this post grounded, we rely on a small set of metrics to monitor checkpoints over time. None of them is a perfect proxy for perceived image quality, but together they provide a practical scoreboard while we iterate.

  • Fréchet Inception Distance (FID) – measures how close the distributions of generated and real images are (lower is better).

  • CLIP Maximum Mean Discrepancy (CMMD) – distance between real and generated image distributions using CLIP embeddings (lower is better).

  • DINOv2 Maximum Mean Discrepancy (DINO‑MMD) – same idea as CMMD but using DINOv2 embeddings.

  • Network throughput – average number of samples processed per second (samples / s).


Representation Alignment

Diffusion and flow models are typically trained with a single objective: predict a noise‑like target (or vector field) from a corrupted input. Early in training that one objective does two jobs at once: it must build a useful internal representation and learn to denoise on top of it. Representation alignment makes this explicit by keeping the denoising objective and adding an auxiliary loss that directly supervises intermediate features using a strong, frozen vision encoder. This tends to speed up early learning and bring the model’s features closer to those of modern self‑supervised encoders.

REPA (Yu et al., 2024)

Representation alignment with a pre‑trained visual encoder.

Loss formulation

[

\mathcal{L}{\text{REPA}}(\theta,\phi) = -\mathbb{E}{x_0,x_1,t}\Big[\frac{1}{N}\sum_{n=1}^{N} \text{sim}\big(y_{0,[n]},, h_\phi(h_{t,[n]})\big)\Big]

]

The total loss is

[

\mathcal{L} = \mathcal{L}{\text{FM}} + \lambda,\mathcal{L}{\text{REPA}}

]

What we observed

We ran REPA on top of our baseline PRX training, using two frozen teachers: DINOv2 and DINOv3. Adding alignment improves quality metrics, and the stronger teacher helps more, at the cost of a modest speed drop.

| Method | FID ↓ | CMMD ↓ | DINO‑MMD ↓ | batches/sec ↑ |

|---|---|---|---|---|

| Baseline | 18.2 | 0.41 | 0.39 | 3.95 |

| REPA‑DINOv3 | 14.64 | 0.35 | 0.30 | 3.46 |

| REPA‑DINOv2 | 16.6 | 0.39 | 0.31 | 3.66 |

Qualitatively, after ~100 k steps samples trained with alignment show cleaner global structure and more coherent layouts.

iREPA (Singh et al., 2025)

A follow‑up that focuses on preserving spatial structure rather than global semantics. It replaces the MLP projection head with a lightweight 3 × 3 convolution and applies spatial normalization to teacher patch tokens.

What we observed

Applying iREPA on top of DINOv2 gave smoother convergence and slightly better metrics, but the same tweaks degraded performance when using a DINOv3 teacher. Because of this inconsistency we will not include iREPA in our default recipe.

REPA‑E (Leng et al., 2025)

Instead of aligning only intermediate features, REPA‑E aligns the latent space itself. It stops the diffusion loss gradient from updating the VAE while still applying a REPA alignment loss to both the VAE and the diffusion model.

What we observed

Comparing three tokenizers:

| Method | FID ↓ | CMMD ↓ | DINO‑MMD ↓ | batches/sec ↑ |

|---|---|---|---|---|

| Baseline | 18.20 | 0.41 | 0.39 | 3.95 |

| Flux2‑AE | 12.07 | 0.09 | 0.08 | 1.79 |

| REPA‑E‑VAE | 12.08 | 0.26 | 0.18 | 3.39 |

Both latent‑space interventions lower FID by ~6 points. Flux2‑AE achieves the best metrics but at a large throughput penalty; REPA‑E‑VAE offers a balanced trade‑off.


Training Objectives: Beyond Vanilla Flow Matching

Small changes to the loss often have outsized effects on convergence speed, conditional fidelity, and how quickly a model “locks in” global structure.

Contrastive Flow Matching (Stoica et al., 2025)

Adds a contrastive term that pushes conditional flows away from other flows in the batch.

[

\mathcal{L}{\Delta \text{FM}}(\theta) = \mathbb{E}\Big[ |v\theta(x_t,t,y)-(\dot{\alpha}_t x+\dot{\sigma}t\varepsilon)|^2 - \lambda |v\theta(x_t,t,y)-(\dot{\alpha}_t \tilde{x}+\dot{\sigma}_t\tilde{\varepsilon})|^2 \Big]

]

What we observed

| Method | FID ↓ | CMMD ↓ | DINO‑MMD ↓ | batches/sec ↑ |

|---|---|---|---|---|

| Baseline | 18.20 | 0.41 | 0.39 | 3.95 |

| Contrastive‑FM | 20.03 | 0.40 | 0.36 | 3.75 |

The gain is modest (small improvement on representation‑driven metrics, negligible throughput cost), so we keep it as a low‑cost regularizer.

JiT (Li & He, 2025)

Predicts the clean image instead of noise/velocity, then converts the prediction back to a velocity for the flow loss. This “back‑to‑basics” formulation makes learning easier, especially at high resolution.

What we observed

In the 256 × 256 latent setting:

| Method | FID ↓ | CMMD ↓ | DINO‑MMD ↓ | batches/sec ↑ |

|---|---|---|---|---|

| Baseline | 18.20 | 0.41 | 0.39 | 3.95 |

| X‑Pred | 16.80 | 0.54 | 0.49 | 3.95 |

The benefit is unclear in latent space, but the formulation enables stable training directly on 1024 × 1024 images with 32 × 32 patches, achieving FID 17.42, DINO‑MMD 0.56, CMMD 0.71 at 1.33 batches/s.


Token Routing and Sparsification to Reduce Compute Costs

Instead of discarding tokens, routing methods let a subset of tokens bypass expensive layers, preserving information while saving compute.

TREAD (Krause et al., 2025)

Randomly selects a fraction of tokens and temporarily bypasses a contiguous chunk of layers, re‑injecting them later.

SPRINT (Park et al., 2025)

Runs dense early layers, sparsifies the middle (most expensive) layers, then fuses sparse deep features with a dense residual stream.

What we observed (256 × 256)

| Method | FID ↓ | CMMD ↓ | DINO‑MMD ↓ | batches/sec ↑ |

|---|---|---|---|---|

| Baseline | 18.20 | 0.41 | 0.39 | 3.95 |

| TREAD | 21.61 | 0.55 | 0.41 | 4.11 |

| SPRINT | 22.56 | 0.72 | 0.42 | 4.20 |

Both give modest throughput gains (≈ 7–9 %) but degrade quality at this resolution.

High‑resolution (1024 × 1024) results

| Method | FID ↓ | CMMD ↓ | DINO‑MMD ↓ | batches/sec ↑ |

|---|---|---|---|---|

| Baseline | 17.42 | 0.71 | 0.56 | 1.33 |

| TREAD | 14.10 | 0.46 | 0.37 | 1.64 |

| SPRINT | 16.90 | 0.51 | 0.41 | 1.89 |

At high resolution, routing dramatically improves both speed and quality; TREAD especially shines with a large FID drop.


Data

The choice of training data, including caption style, can influence the training trajectory as much as optimization tricks.

Long vs. Short Captions

Long, descriptive captions (our baseline) contain detailed composition, lighting, and attribute information.

Short, one‑line captions provide minimal description.

What we observed

| Method | FID ↓ | CMMD ↓ | DINO‑MMD ↓ | batches/sec ↑ |

|---|---|---|---|---|

| Baseline (long) | 18.20 | 0.41 | 0.39 | 3.95 |

| Short‑captions | 36.84 | 0.98 | 1.14 | 3.95 |

Short captions severely hurt convergence. Long captions give richer supervision, making early learning easier. A practical strategy is to fine‑tune later on a mix of long and short captions.

Bootstrapping with Synthetic Images

We compared training on synthetic images (MidJourneyV6) vs. real images (Pexels), both ≈ 1 M samples.

| Method | FID ↓ | CMMD ↓ | DINO‑MMD ↓ | batches/sec ↑ |

|---|---|---|---|---|

| Synthetic | 18.20 | 0.41 | 0.39 | 3.95 |

| Real | 16.6 | 0.50 | 0.46 | 3.95 |

Synthetic data yields better CMMD/DINO‑MMD (global structure), while real data gives lower FID (texture realism). A useful recipe: start with synthetic data for rapid structure learning, then switch to real data for fine‑grained texture quality.

SFT with Alchemist (small, high‑impact dataset)

Fine‑tuning for 20 k steps on the 3 350‑pair Alchemist dataset adds a distinct “style layer” with better composition and polish, without harming generalization.


More Useful Tips for Training

Muon Optimizer

Muon (Jordan et al., 2024) is a preconditioner‑style optimizer that can accelerate convergence.

| Method | FID ↓ | CMMD ↓ | DINO‑MMD ↓ |

|---|---|---|---|

| Baseline (AdamW) | 18.20 | 0.41 | 0.39 |

| Muon | 15.55 | 0.36 | 0.35 |

Precision Gotcha: Casting vs. Storing Weights in BF16

Storing model parameters in BF16 (instead of FP32) harms numerically sensitive operations (LayerNorm, attention softmax, RoPE, optimizer state).

| Method | FID ↓ | CMMD ↓ | DINO‑MMD ↓ |

|---|---|---|---|

| Baseline (FP32 weights) | 18.20 | 0.41 | 0.39 |

| BF16‑stored weights (bug) | 21.87 | 0.61 | 0.57 |

Rule: use BF16 autocast for compute, but keep weights (and optimizer state) in FP32.


Summary

We ran a systematic set of ablations on PRX training, comparing a range of optimization, representation, efficiency, and data choices against a clean flow‑matching baseline using both quality metrics and throughput.

  • Alignment (REPA) gives the biggest early gains; use it as a burn‑in and drop it later.

  • Latent/tokenizer improvements (REPA‑E, Flux2‑AE) provide large quality jumps with clear speed trade‑offs.

  • Objective tweaks are mixed: contrastive FM helps slightly; x‑prediction enables stable 1024² pixel training.

  • Token routing (TREAD/SPRINT) is minor at 256² but a major win at high resolution.

  • Data matters: long captions are critical; synthetic data accelerates structure learning; a tiny high‑impact SFT set adds polish.

  • Practical details such as optimizer choice (Muon) and correct precision handling (avoid BF16‑stored weights) also have noticeable impact.

That’s it for Part 2! If you want to play with an earlier public checkpoint from this series, the PRX‑1024 T2I beta is still available here.

We’re really excited about what’s next: in the coming weeks we will release the full source code of the PRX training framework, and we will do a public 24‑hour “speedrun” where we combine the best ideas from this post into a single run and see how far the full recipe can go in one day.

If you made it this far, thank you very much for your interest. We would love to have you join our Discord community where we discuss PRX progress and results, along with everything related to diffusion and text‑to‑image models.

Read Original Article
0

Comments

Want to join the conversation?

Loading comments...