Files
Ando 8ec0a5ccad feat: implement rae autoencoder. (#13046)
* feat: implement three RAE encoders(dinov2, siglip2, mae)

* feat: finish first version of autoencoder_rae

* fix formatting

* make fix-copies

* initial doc

* fix latent_mean / latent_var init types to accept config-friendly inputs

* use mean and std convention

* cleanup

* add rae to diffusers script

* use imports

* use attention

* remove unneeded class

* example traiing script

* input and ground truth sizes have to be the same

* fix argument

* move loss to training script

* cleanup

* simplify mixins

* fix training script

* fix entrypoint for instantiating the AutoencoderRAE

* added encoder_image_size config

* undo last change

* fixes from pretrained weights

* cleanups

* address reviews

* fix train script to use pretrained

* fix conversion script review

* latebt normalization buffers are now always registered with no-op defaults

* Update examples/research_projects/autoencoder_rae/README.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update src/diffusers/models/autoencoders/autoencoder_rae.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* use image url

* Encoder is frozen

* fix slow test

* remove config

* use ModelTesterMixin and AutoencoderTesterMixin

* make quality

* strip final layernorm when converting

* _strip_final_layernorm_affine for training script

* fix test

* add dispatch forward and update conversion script

* update training script

* error out as soon as possible and add comments

* Update src/diffusers/models/autoencoders/autoencoder_rae.py

Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>

* use buffer

* inline

* Update src/diffusers/models/autoencoders/autoencoder_rae.py

Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>

* remove optional

* _noising takes a generator

* Update src/diffusers/models/autoencoders/autoencoder_rae.py

Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>

* fix api

* rename

* remove unittest

* use randn_tensor

* fix device map on multigpu

* check if the key is missing in the original state dict and only then add to the allow_missing set

* remove initialize_weights

---------

Co-authored-by: wangyuqi <wangyuqi@MBP-FJDQNJTWYN-0208.local>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
2026-03-05 20:17:14 +05:30
..

Training AutoencoderRAE

This example trains the decoder of AutoencoderRAE (stage-1 style), while keeping the representation encoder frozen.

It follows the same high-level training recipe as the official RAE stage-1 setup:

  • frozen encoder
  • train decoder
  • pixel reconstruction loss
  • optional encoder feature consistency loss

Quickstart

Resume or finetune from pretrained weights

accelerate launch examples/research_projects/autoencoder_rae/train_autoencoder_rae.py \
  --pretrained_model_name_or_path nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08 \
  --train_data_dir /path/to/imagenet_like_folder \
  --output_dir /tmp/autoencoder-rae \
  --resolution 256 \
  --train_batch_size 8 \
  --learning_rate 1e-4 \
  --num_train_epochs 10 \
  --report_to wandb \
  --reconstruction_loss_type l1 \
  --use_encoder_loss \
  --encoder_loss_weight 0.1

Train from scratch with a pretrained encoder

The following command launches RAE training with "facebook/dinov2-with-registers-base" as the base.

accelerate launch examples/research_projects/autoencoder_rae/train_autoencoder_rae.py \
  --train_data_dir /path/to/imagenet_like_folder \
  --output_dir /tmp/autoencoder-rae \
  --resolution 256 \
  --encoder_type dinov2 \
  --encoder_name_or_path facebook/dinov2-with-registers-base \
  --encoder_input_size 224 \
  --patch_size 16 \
  --image_size 256 \
  --decoder_hidden_size 1152 \
  --decoder_num_hidden_layers 28 \
  --decoder_num_attention_heads 16 \
  --decoder_intermediate_size 4096 \
  --train_batch_size 8 \
  --learning_rate 1e-4 \
  --num_train_epochs 10 \
  --report_to wandb \
  --reconstruction_loss_type l1 \
  --use_encoder_loss \
  --encoder_loss_weight 0.1

Note: stage-1 reconstruction loss assumes matching target/output spatial size, so --resolution must equal --image_size.

Dataset format is expected to be ImageFolder-compatible:

train_data_dir/
  class_a/
    img_0001.jpg
  class_b/
    img_0002.jpg