mirror of https://github.com/huggingface/diffusers.git synced 2026-04-13 03:02:06 +08:00

Files

Ando 8ec0a5ccad feat: implement rae autoencoder. (#13046 )

* feat: implement three RAE encoders(dinov2, siglip2, mae)

* feat: finish first version of autoencoder_rae

* fix formatting

* make fix-copies

* initial doc

* fix latent_mean / latent_var init types to accept config-friendly inputs

* use mean and std convention

* cleanup

* add rae to diffusers script

* use imports

* use attention

* remove unneeded class

* example traiing script

* input and ground truth sizes have to be the same

* fix argument

* move loss to training script

* cleanup

* simplify mixins

* fix training script

* fix entrypoint for instantiating the AutoencoderRAE

* added encoder_image_size config

* undo last change

* fixes from pretrained weights

* cleanups

* address reviews

* fix train script to use pretrained

* fix conversion script review

* latebt normalization buffers are now always registered with no-op defaults

* Update examples/research_projects/autoencoder_rae/README.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update src/diffusers/models/autoencoders/autoencoder_rae.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* use image url

* Encoder is frozen

* fix slow test

* remove config

* use ModelTesterMixin and AutoencoderTesterMixin

* make quality

* strip final layernorm when converting

* _strip_final_layernorm_affine for training script

* fix test

* add dispatch forward and update conversion script

* update training script

* error out as soon as possible and add comments

* Update src/diffusers/models/autoencoders/autoencoder_rae.py

Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>

* use buffer

* inline

* Update src/diffusers/models/autoencoders/autoencoder_rae.py

Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>

* remove optional

* _noising takes a generator

* Update src/diffusers/models/autoencoders/autoencoder_rae.py

Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>

* fix api

* rename

* remove unittest

* use randn_tensor

* fix device map on multigpu

* check if the key is missing in the original state dict and only then add to the allow_missing set

* remove initialize_weights

---------

Co-authored-by: wangyuqi <wangyuqi@MBP-FJDQNJTWYN-0208.local>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>

2026-03-05 20:17:14 +05:30

3.9 KiB

Raw Blame History

AutoencoderRAE

The Representation Autoencoder (RAE) model introduced in Diffusion Transformers with Representation Autoencoders by Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie from NYU VISIONx.

RAE combines a frozen pretrained vision encoder (DINOv2, SigLIP2, or MAE) with a trainable ViT-MAE-style decoder. In the two-stage RAE training recipe, the autoencoder is trained in stage 1 (reconstruction), and then a diffusion model is trained on the resulting latent space in stage 2 (generation).

The following RAE models are released and supported in Diffusers:

Model	Encoder	Latent shape (224px input)
`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08`	DINOv2-base	768 x 16 x 16
`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08-i512`	DINOv2-base (512px)	768 x 32 x 32
`nyu-visionx/RAE-dinov2-wReg-small-ViTXL-n08`	DINOv2-small	384 x 16 x 16
`nyu-visionx/RAE-dinov2-wReg-large-ViTXL-n08`	DINOv2-large	1024 x 16 x 16
`nyu-visionx/RAE-siglip2-base-p16-i256-ViTXL-n08`	SigLIP2-base	768 x 16 x 16
`nyu-visionx/RAE-mae-base-p16-ViTXL-n08`	MAE-base	768 x 16 x 16

Loading a pretrained model

from diffusers import AutoencoderRAE

model = AutoencoderRAE.from_pretrained(
    "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
).to("cuda").eval()

Encoding and decoding a real image

import torch
from diffusers import AutoencoderRAE
from diffusers.utils import load_image
from torchvision.transforms.functional import to_tensor, to_pil_image

model = AutoencoderRAE.from_pretrained(
    "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
).to("cuda").eval()

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
image = image.convert("RGB").resize((224, 224))
x = to_tensor(image).unsqueeze(0).to("cuda")  # (1, 3, 224, 224), values in [0, 1]

with torch.no_grad():
    latents = model.encode(x).latent        # (1, 768, 16, 16)
    recon = model.decode(latents).sample     # (1, 3, 256, 256)

recon_image = to_pil_image(recon[0].clamp(0, 1).cpu())
recon_image.save("recon.png")

Latent normalization

Some pretrained checkpoints include per-channel latents_mean and latents_std statistics for normalizing the latent space. When present, encode and decode automatically apply the normalization and denormalization, respectively.

model = AutoencoderRAE.from_pretrained(
    "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
).to("cuda").eval()

# Latent normalization is handled automatically inside encode/decode
# when the checkpoint config includes latents_mean/latents_std.
with torch.no_grad():
    latents = model.encode(x).latent   # normalized latents
    recon = model.decode(latents).sample

AutoencoderRAE

autodoc AutoencoderRAE

encode
decode
all

DecoderOutput

autodoc models.autoencoders.vae.DecoderOutput

3.9 KiB Raw Blame History

AutoencoderRAE

Loading a pretrained model

Encoding and decoding a real image

Latent normalization

AutoencoderRAE

DecoderOutput

3.9 KiB

Raw Blame History