revert change in guider

upup
Merge branch 'main' into helios-modular
2026-03-08 09:41:50 +08:00 · 2026-03-06 18:39:37 +00:00 · 2026-03-06 09:41:08 +00:00 · 2026-03-05 22:23:27 +00:00 · 2026-03-05 22:21:53 +00:00 · 2026-03-05 20:17:14 +05:30
37 changed files with 8102 additions and 19 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -460,6 +460,8 @@
        title: AutoencoderKLQwenImage
      - local: api/models/autoencoder_kl_wan
        title: AutoencoderKLWan
+      - local: api/models/autoencoder_rae
+        title: AutoencoderRAE
      - local: api/models/consistency_decoder_vae
        title: ConsistencyDecoderVAE
      - local: api/models/autoencoder_oobleck
--- a/docs/source/en/api/models/autoencoder_rae.md
+++ b/docs/source/en/api/models/autoencoder_rae.md
@@ -0,0 +1,89 @@
+<!-- Copyright 2026 The NYU Vision-X and HuggingFace Teams. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# AutoencoderRAE
+
+The Representation Autoencoder (RAE) model introduced in [Diffusion Transformers with Representation Autoencoders](https://huggingface.co/papers/2510.11690) by Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie from NYU VISIONx.
+
+RAE combines a frozen pretrained vision encoder (DINOv2, SigLIP2, or MAE) with a trainable ViT-MAE-style decoder. In the two-stage RAE training recipe, the autoencoder is trained in stage 1 (reconstruction), and then a diffusion model is trained on the resulting latent space in stage 2 (generation).
+
+The following RAE models are released and supported in Diffusers:
+
+| Model | Encoder | Latent shape (224px input) |
+|:------|:--------|:---------------------------|
+| [`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08) | DINOv2-base | 768 x 16 x 16 |
+| [`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08-i512`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08-i512) | DINOv2-base (512px) | 768 x 32 x 32 |
+| [`nyu-visionx/RAE-dinov2-wReg-small-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-small-ViTXL-n08) | DINOv2-small | 384 x 16 x 16 |
+| [`nyu-visionx/RAE-dinov2-wReg-large-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-large-ViTXL-n08) | DINOv2-large | 1024 x 16 x 16 |
+| [`nyu-visionx/RAE-siglip2-base-p16-i256-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-siglip2-base-p16-i256-ViTXL-n08) | SigLIP2-base | 768 x 16 x 16 |
+| [`nyu-visionx/RAE-mae-base-p16-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-mae-base-p16-ViTXL-n08) | MAE-base | 768 x 16 x 16 |
+
+## Loading a pretrained model
+
+```python
+from diffusers import AutoencoderRAE
+
+model = AutoencoderRAE.from_pretrained(
+    "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
+).to("cuda").eval()
+```
+
+## Encoding and decoding a real image
+
+```python
+import torch
+from diffusers import AutoencoderRAE
+from diffusers.utils import load_image
+from torchvision.transforms.functional import to_tensor, to_pil_image
+
+model = AutoencoderRAE.from_pretrained(
+    "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
+).to("cuda").eval()
+
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
+image = image.convert("RGB").resize((224, 224))
+x = to_tensor(image).unsqueeze(0).to("cuda")  # (1, 3, 224, 224), values in [0, 1]
+
+with torch.no_grad():
+    latents = model.encode(x).latent        # (1, 768, 16, 16)
+    recon = model.decode(latents).sample     # (1, 3, 256, 256)
+
+recon_image = to_pil_image(recon[0].clamp(0, 1).cpu())
+recon_image.save("recon.png")
+```
+
+## Latent normalization
+
+Some pretrained checkpoints include per-channel `latents_mean` and `latents_std` statistics for normalizing the latent space. When present, `encode` and `decode` automatically apply the normalization and denormalization, respectively.
+
+```python
+model = AutoencoderRAE.from_pretrained(
+    "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
+).to("cuda").eval()
+
+# Latent normalization is handled automatically inside encode/decode
+# when the checkpoint config includes latents_mean/latents_std.
+with torch.no_grad():
+    latents = model.encode(x).latent   # normalized latents
+    recon = model.decode(latents).sample
+```
+
+## AutoencoderRAE
+
+[[autodoc]] AutoencoderRAE
+  - encode
+  - decode
+  - all
+
+## DecoderOutput
+
+[[autodoc]] models.autoencoders.vae.DecoderOutput
--- a/docs/source/en/api/models/helios_transformer3d.md
+++ b/docs/source/en/api/models/helios_transformer3d.md
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->

 # HeliosTransformer3DModel

-A 14B Real-Time Autogressive Diffusion Transformer model (support T2V, I2V and V2V) for 3D video-like data from [Helios](https://github.com/PKU-YuanGroup/Helios) was introduced in [Helios: Real Real-Time Long Video Generation Model](https://huggingface.co/papers/) by Peking University & ByteDance & etc.
+A 14B Real-Time Autogressive Diffusion Transformer model (support T2V, I2V and V2V) for 3D video-like data from [Helios](https://github.com/PKU-YuanGroup/Helios) was introduced in [Helios: Real Real-Time Long Video Generation Model](https://huggingface.co/papers/2603.04379) by Peking University & ByteDance & etc.

 The model can be loaded with the following code snippet.

--- a/docs/source/en/api/pipelines/helios.md
+++ b/docs/source/en/api/pipelines/helios.md
@@ -22,7 +22,7 @@

 # Helios

-[Helios: Real Real-Time Long Video Generation Model](https://huggingface.co/papers/) from Peking University & ByteDance & etc, by Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, Li Yuan.
+[Helios: Real Real-Time Long Video Generation Model](https://huggingface.co/papers/2603.04379) from Peking University & ByteDance & etc, by Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, Li Yuan.

 *  <u>We introduce Helios, the first 14B video generation model that runs at 17 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching a strong baseline in quality.</u> We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drift heuristics such as self-forcing, error banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, causal masking, or sparse attention; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize its typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to—or lower than—those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. All the code and models are available at [this https URL](https://pku-yuangroup.github.io/Helios-Page).

--- a/docs/source/en/api/pipelines/ltx2.md
+++ b/docs/source/en/api/pipelines/ltx2.md
@@ -193,6 +193,179 @@ encode_video(
 )
 ```

+## Condition Pipeline Generation
+
+You can use `LTX2ConditionPipeline` to specify image and/or video conditions at arbitrary latent indices. For example, we can specify both a first-frame and last-frame condition to perform first-last-frame-to-video (FLF2V) generation:
+
+```py
+import torch
+from diffusers import LTX2ConditionPipeline, LTX2LatentUpsamplePipeline
+from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
+from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
+from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
+from diffusers.pipelines.ltx2.export_utils import encode_video
+from diffusers.utils import load_image
+
+device = "cuda"
+width = 768
+height = 512
+random_seed = 42
+generator = torch.Generator(device).manual_seed(random_seed)
+model_path = "rootonchair/LTX-2-19b-distilled"
+
+pipe = LTX2ConditionPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
+pipe.enable_sequential_cpu_offload(device=device)
+pipe.vae.enable_tiling()
+
+prompt = (
+    "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are "
+    "delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright "
+    "sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, "
+    "low-angle perspective."
+)
+
+first_image = load_image(
+    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png",
+)
+last_image = load_image(
+    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png",
+)
+first_cond = LTX2VideoCondition(frames=first_image, index=0, strength=1.0)
+last_cond = LTX2VideoCondition(frames=last_image, index=-1, strength=1.0)
+conditions = [first_cond, last_cond]
+
+frame_rate = 24.0
+video_latent, audio_latent = pipe(
+    conditions=conditions,
+    prompt=prompt,
+    width=width,
+    height=height,
+    num_frames=121,
+    frame_rate=frame_rate,
+    num_inference_steps=8,
+    sigmas=DISTILLED_SIGMA_VALUES,
+    guidance_scale=1.0,
+    generator=generator,
+    output_type="latent",
+    return_dict=False,
+)
+
+latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
+    model_path,
+    subfolder="latent_upsampler",
+    torch_dtype=torch.bfloat16,
+)
+upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
+upsample_pipe.enable_model_cpu_offload(device=device)
+upscaled_video_latent = upsample_pipe(
+    latents=video_latent,
+    output_type="latent",
+    return_dict=False,
+)[0]
+
+video, audio = pipe(
+    latents=upscaled_video_latent,
+    audio_latents=audio_latent,
+    prompt=prompt,
+    width=width * 2,
+    height=height * 2,
+    num_inference_steps=3,
+    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
+    generator=generator,
+    guidance_scale=1.0,
+    output_type="np",
+    return_dict=False,
+)
+
+encode_video(
+    video[0],
+    fps=frame_rate,
+    audio=audio[0].float().cpu(),
+    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
+    output_path="ltx2_distilled_flf2v.mp4",
+)
+```
+
+You can use both image and video conditions:
+
+```py
+import torch
+from diffusers import LTX2ConditionPipeline
+from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
+from diffusers.pipelines.ltx2.export_utils import encode_video
+from diffusers.utils import load_image, load_video
+
+device = "cuda"
+width = 768
+height = 512
+random_seed = 42
+generator = torch.Generator(device).manual_seed(random_seed)
+model_path = "rootonchair/LTX-2-19b-distilled"
+
+pipe = LTX2ConditionPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
+pipe.enable_sequential_cpu_offload(device=device)
+pipe.vae.enable_tiling()
+
+prompt = (
+    "The video depicts a long, straight highway stretching into the distance, flanked by metal guardrails. The road is "
+    "divided into multiple lanes, with a few vehicles visible in the far distance. The surrounding landscape features "
+    "dry, grassy fields on one side and rolling hills on the other. The sky is mostly clear with a few scattered "
+    "clouds, suggesting a bright, sunny day. And then the camera switch to a winding mountain road covered in snow, "
+    "with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The "
+    "landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the "
+    "solitude and beauty of a winter drive through a mountainous region."
+)
+negative_prompt = (
+    "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
+    "grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
+    "deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
+    "wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
+    "field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
+    "lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
+    "valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
+    "mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
+    "off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
+    "pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
+    "inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
+)
+
+cond_video = load_video(
+    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
+)
+cond_image = load_image(
+    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input.jpg"
+)
+video_cond = LTX2VideoCondition(frames=cond_video, index=0, strength=1.0)
+image_cond = LTX2VideoCondition(frames=cond_image, index=8, strength=1.0)
+conditions = [video_cond, image_cond]
+
+frame_rate = 24.0
+video, audio = pipe(
+    conditions=conditions,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    width=width,
+    height=height,
+    num_frames=121,
+    frame_rate=frame_rate,
+    num_inference_steps=40,
+    guidance_scale=4.0,
+    generator=generator,
+    output_type="np",
+    return_dict=False,
+)
+
+encode_video(
+    video[0],
+    fps=frame_rate,
+    audio=audio[0].float().cpu(),
+    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
+    output_path="ltx2_cond_video.mp4",
+)
+```
+
+Because the conditioning is done via latent frames, the 8 data space frames corresponding to the specified latent frame for an image condition will tend to be static.
+
 ## LTX2Pipeline

 [[autodoc]] LTX2Pipeline
@@ -205,6 +378,12 @@ encode_video(
  - all
  - __call__

+## LTX2ConditionPipeline
+
+[[autodoc]] LTX2ConditionPipeline
+  - all
+  - __call__
+
 ## LTX2LatentUpsamplePipeline

 [[autodoc]] LTX2LatentUpsamplePipeline
--- a/docs/source/en/using-diffusers/helios.md
+++ b/docs/source/en/using-diffusers/helios.md
@@ -130,4 +130,4 @@ pipe.to("cuda")

 Learn more about Helios with the following resources.
 - Watch [video1](https://www.youtube.com/watch?v=vd_AgHtOUFQ) and [video2](https://www.youtube.com/watch?v=1GeIU2Dn7UY) for a demonstration of Helios's key features.
- The research paper, [Helios: Real Real-Time Long Video Generation Model](https://huggingface.co/papers/) for more details.
+- The research paper, [Helios: Real Real-Time Long Video Generation Model](https://huggingface.co/papers/2603.04379) for more details.
--- a/docs/source/zh/using-diffusers/helios.md
+++ b/docs/source/zh/using-diffusers/helios.md
@@ -131,4 +131,4 @@ pipe.to("cuda")
 通过以下资源了解有关 Helios 的更多信息：

 - [视频1](https://www.youtube.com/watch?v=vd_AgHtOUFQ)和[视频2](https://www.youtube.com/watch?v=1GeIU2Dn7UY)演示了 Helios 的主要功能;
- 有关更多详细信息，请参阅研究论文 [Helios: Real Real-Time Long Video Generation Model](https://huggingface.co/papers/)。
+- 有关更多详细信息，请参阅研究论文 [Helios: Real Real-Time Long Video Generation Model](https://huggingface.co/papers/2603.04379)。
--- a/examples/research_projects/autoencoder_rae/README.md
+++ b/examples/research_projects/autoencoder_rae/README.md
@@ -0,0 +1,66 @@
+# Training AutoencoderRAE
+
+This example trains the decoder of `AutoencoderRAE` (stage-1 style), while keeping the representation encoder frozen.
+
+It follows the same high-level training recipe as the official RAE stage-1 setup:
+- frozen encoder
+- train decoder
+- pixel reconstruction loss
+- optional encoder feature consistency loss
+
+## Quickstart
+
+### Resume or finetune from pretrained weights
+
+```bash
+accelerate launch examples/research_projects/autoencoder_rae/train_autoencoder_rae.py \
+  --pretrained_model_name_or_path nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08 \
+  --train_data_dir /path/to/imagenet_like_folder \
+  --output_dir /tmp/autoencoder-rae \
+  --resolution 256 \
+  --train_batch_size 8 \
+  --learning_rate 1e-4 \
+  --num_train_epochs 10 \
+  --report_to wandb \
+  --reconstruction_loss_type l1 \
+  --use_encoder_loss \
+  --encoder_loss_weight 0.1
+```
+
+### Train from scratch with a pretrained encoder
+The following command launches RAE training with "facebook/dinov2-with-registers-base" as the base.
+
+```bash
+accelerate launch examples/research_projects/autoencoder_rae/train_autoencoder_rae.py \
+  --train_data_dir /path/to/imagenet_like_folder \
+  --output_dir /tmp/autoencoder-rae \
+  --resolution 256 \
+  --encoder_type dinov2 \
+  --encoder_name_or_path facebook/dinov2-with-registers-base \
+  --encoder_input_size 224 \
+  --patch_size 16 \
+  --image_size 256 \
+  --decoder_hidden_size 1152 \
+  --decoder_num_hidden_layers 28 \
+  --decoder_num_attention_heads 16 \
+  --decoder_intermediate_size 4096 \
+  --train_batch_size 8 \
+  --learning_rate 1e-4 \
+  --num_train_epochs 10 \
+  --report_to wandb \
+  --reconstruction_loss_type l1 \
+  --use_encoder_loss \
+  --encoder_loss_weight 0.1
+```
+
+Note: stage-1 reconstruction loss assumes matching target/output spatial size, so `--resolution` must equal `--image_size`.
+
+Dataset format is expected to be `ImageFolder`-compatible:
+
+```text
+train_data_dir/
+  class_a/
+    img_0001.jpg
+  class_b/
+    img_0002.jpg
+```
--- a/examples/research_projects/autoencoder_rae/train_autoencoder_rae.py
+++ b/examples/research_projects/autoencoder_rae/train_autoencoder_rae.py
@@ -0,0 +1,405 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import math
+import os
+from pathlib import Path
+
+import torch
+import torch.nn.functional as F
+from accelerate import Accelerator
+from accelerate.logging import get_logger
+from accelerate.utils import ProjectConfiguration, set_seed
+from torch.utils.data import DataLoader
+from torchvision import transforms
+from torchvision.datasets import ImageFolder
+from tqdm.auto import tqdm
+
+from diffusers import AutoencoderRAE
+from diffusers.optimization import get_scheduler
+
+
+logger = get_logger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Train a stage-1 Representation Autoencoder (RAE) decoder.")
+    parser.add_argument(
+        "--train_data_dir",
+        type=str,
+        required=True,
+        help="Path to an ImageFolder-style dataset root.",
+    )
+    parser.add_argument(
+        "--output_dir", type=str, default="autoencoder-rae", help="Directory to save checkpoints/model."
+    )
+    parser.add_argument("--logging_dir", type=str, default="logs", help="Accelerate logging directory.")
+    parser.add_argument("--seed", type=int, default=42)
+
+    parser.add_argument("--resolution", type=int, default=256)
+    parser.add_argument("--center_crop", action="store_true")
+    parser.add_argument("--random_flip", action="store_true")
+
+    parser.add_argument("--train_batch_size", type=int, default=8)
+    parser.add_argument("--dataloader_num_workers", type=int, default=4)
+    parser.add_argument("--num_train_epochs", type=int, default=10)
+    parser.add_argument("--max_train_steps", type=int, default=None)
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
+    parser.add_argument("--max_grad_norm", type=float, default=1.0)
+
+    parser.add_argument("--learning_rate", type=float, default=1e-4)
+    parser.add_argument("--adam_beta1", type=float, default=0.9)
+    parser.add_argument("--adam_beta2", type=float, default=0.999)
+    parser.add_argument("--adam_weight_decay", type=float, default=1e-2)
+    parser.add_argument("--adam_epsilon", type=float, default=1e-8)
+    parser.add_argument("--lr_scheduler", type=str, default="cosine")
+    parser.add_argument("--lr_warmup_steps", type=int, default=500)
+
+    parser.add_argument("--checkpointing_steps", type=int, default=1000)
+    parser.add_argument("--validation_steps", type=int, default=500)
+
+    parser.add_argument(
+        "--pretrained_model_name_or_path",
+        type=str,
+        default=None,
+        help="Path to a pretrained AutoencoderRAE model (or HF Hub id) to resume training from.",
+    )
+    parser.add_argument(
+        "--encoder_name_or_path",
+        type=str,
+        default=None,
+        help=(
+            "HF Hub id or local path of the pretrained encoder (e.g. 'facebook/dinov2-with-registers-base'). "
+            "When --pretrained_model_name_or_path is not set, the encoder weights are loaded from this path "
+            "into a freshly constructed AutoencoderRAE. Ignored when --pretrained_model_name_or_path is set."
+        ),
+    )
+
+    parser.add_argument("--encoder_type", type=str, choices=["dinov2", "siglip2", "mae"], default="dinov2")
+    parser.add_argument("--encoder_hidden_size", type=int, default=768)
+    parser.add_argument("--encoder_patch_size", type=int, default=14)
+    parser.add_argument("--encoder_num_hidden_layers", type=int, default=12)
+    parser.add_argument("--encoder_input_size", type=int, default=224)
+    parser.add_argument("--patch_size", type=int, default=16)
+    parser.add_argument("--image_size", type=int, default=256)
+    parser.add_argument("--num_channels", type=int, default=3)
+
+    parser.add_argument("--decoder_hidden_size", type=int, default=1152)
+    parser.add_argument("--decoder_num_hidden_layers", type=int, default=28)
+    parser.add_argument("--decoder_num_attention_heads", type=int, default=16)
+    parser.add_argument("--decoder_intermediate_size", type=int, default=4096)
+
+    parser.add_argument("--noise_tau", type=float, default=0.0)
+    parser.add_argument("--scaling_factor", type=float, default=1.0)
+    parser.add_argument("--reshape_to_2d", action=argparse.BooleanOptionalAction, default=True)
+
+    parser.add_argument(
+        "--reconstruction_loss_type",
+        type=str,
+        choices=["l1", "mse"],
+        default="l1",
+        help="Pixel reconstruction loss.",
+    )
+    parser.add_argument(
+        "--encoder_loss_weight",
+        type=float,
+        default=0.0,
+        help="Weight for encoder feature consistency loss in the training loop.",
+    )
+    parser.add_argument(
+        "--use_encoder_loss",
+        action="store_true",
+        help="Enable encoder feature consistency loss term in the training loop.",
+    )
+    parser.add_argument("--report_to", type=str, default="tensorboard")
+
+    return parser.parse_args()
+
+
+def build_transforms(args):
+    image_transforms = [
+        transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BICUBIC),
+        transforms.CenterCrop(args.resolution) if args.center_crop else transforms.RandomCrop(args.resolution),
+    ]
+    if args.random_flip:
+        image_transforms.append(transforms.RandomHorizontalFlip())
+    image_transforms.append(transforms.ToTensor())
+    return transforms.Compose(image_transforms)
+
+
+def compute_losses(
+    model, pixel_values, reconstruction_loss_type: str, use_encoder_loss: bool, encoder_loss_weight: float
+):
+    decoded = model(pixel_values).sample
+
+    if decoded.shape[-2:] != pixel_values.shape[-2:]:
+        raise ValueError(
+            "Training requires matching reconstruction and target sizes, got "
+            f"decoded={tuple(decoded.shape[-2:])}, target={tuple(pixel_values.shape[-2:])}."
+        )
+
+    if reconstruction_loss_type == "l1":
+        reconstruction_loss = F.l1_loss(decoded.float(), pixel_values.float())
+    else:
+        reconstruction_loss = F.mse_loss(decoded.float(), pixel_values.float())
+
+    encoder_loss = torch.zeros_like(reconstruction_loss)
+    if use_encoder_loss and encoder_loss_weight > 0:
+        base_model = model.module if hasattr(model, "module") else model
+        target_encoder_input = base_model._resize_and_normalize(pixel_values)
+        reconstructed_encoder_input = base_model._resize_and_normalize(decoded)
+
+        encoder_forward_kwargs = {"model": base_model.encoder}
+        if base_model.config.encoder_type == "mae":
+            encoder_forward_kwargs["patch_size"] = base_model.config.encoder_patch_size
+        with torch.no_grad():
+            target_tokens = base_model._encoder_forward_fn(images=target_encoder_input, **encoder_forward_kwargs)
+        reconstructed_tokens = base_model._encoder_forward_fn(
+            images=reconstructed_encoder_input, **encoder_forward_kwargs
+        )
+        encoder_loss = F.mse_loss(reconstructed_tokens.float(), target_tokens.float())
+
+    loss = reconstruction_loss + float(encoder_loss_weight) * encoder_loss
+    return decoded, loss, reconstruction_loss, encoder_loss
+
+
+def _strip_final_layernorm_affine(state_dict, prefix=""):
+    """Remove final layernorm weight/bias so the model keeps its default init (identity)."""
+    keys_to_strip = {f"{prefix}weight", f"{prefix}bias"}
+    return {k: v for k, v in state_dict.items() if k not in keys_to_strip}
+
+
+def _load_pretrained_encoder_weights(model, encoder_type, encoder_name_or_path):
+    """Load pretrained HF transformers encoder weights into the model's encoder."""
+    if encoder_type == "dinov2":
+        from transformers import Dinov2WithRegistersModel
+
+        hf_encoder = Dinov2WithRegistersModel.from_pretrained(encoder_name_or_path)
+        state_dict = hf_encoder.state_dict()
+        state_dict = _strip_final_layernorm_affine(state_dict, prefix="layernorm.")
+    elif encoder_type == "siglip2":
+        from transformers import SiglipModel
+
+        hf_encoder = SiglipModel.from_pretrained(encoder_name_or_path).vision_model
+        state_dict = {f"vision_model.{k}": v for k, v in hf_encoder.state_dict().items()}
+        state_dict = _strip_final_layernorm_affine(state_dict, prefix="vision_model.post_layernorm.")
+    elif encoder_type == "mae":
+        from transformers import ViTMAEForPreTraining
+
+        hf_encoder = ViTMAEForPreTraining.from_pretrained(encoder_name_or_path).vit
+        state_dict = hf_encoder.state_dict()
+        state_dict = _strip_final_layernorm_affine(state_dict, prefix="layernorm.")
+    else:
+        raise ValueError(f"Unknown encoder_type: {encoder_type}")
+
+    model.encoder.load_state_dict(state_dict, strict=False)
+
+
+def main():
+    args = parse_args()
+    if args.resolution != args.image_size:
+        raise ValueError(
+            f"`--resolution` ({args.resolution}) must match `--image_size` ({args.image_size}) "
+            "for stage-1 reconstruction loss."
+        )
+
+    logging_dir = Path(args.output_dir, args.logging_dir)
+    accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir, logging_dir=logging_dir)
+    accelerator = Accelerator(
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        project_config=accelerator_project_config,
+        log_with=args.report_to,
+    )
+
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.info(accelerator.state, main_process_only=False)
+
+    if args.seed is not None:
+        set_seed(args.seed)
+
+    if accelerator.is_main_process:
+        os.makedirs(args.output_dir, exist_ok=True)
+    accelerator.wait_for_everyone()
+
+    dataset = ImageFolder(args.train_data_dir, transform=build_transforms(args))
+
+    def collate_fn(examples):
+        pixel_values = torch.stack([example[0] for example in examples]).float()
+        return {"pixel_values": pixel_values}
+
+    train_dataloader = DataLoader(
+        dataset,
+        shuffle=True,
+        collate_fn=collate_fn,
+        batch_size=args.train_batch_size,
+        num_workers=args.dataloader_num_workers,
+        pin_memory=True,
+        drop_last=True,
+    )
+
+    if args.pretrained_model_name_or_path is not None:
+        model = AutoencoderRAE.from_pretrained(args.pretrained_model_name_or_path)
+        logger.info(f"Loaded pretrained AutoencoderRAE from {args.pretrained_model_name_or_path}")
+    else:
+        model = AutoencoderRAE(
+            encoder_type=args.encoder_type,
+            encoder_hidden_size=args.encoder_hidden_size,
+            encoder_patch_size=args.encoder_patch_size,
+            encoder_num_hidden_layers=args.encoder_num_hidden_layers,
+            decoder_hidden_size=args.decoder_hidden_size,
+            decoder_num_hidden_layers=args.decoder_num_hidden_layers,
+            decoder_num_attention_heads=args.decoder_num_attention_heads,
+            decoder_intermediate_size=args.decoder_intermediate_size,
+            patch_size=args.patch_size,
+            encoder_input_size=args.encoder_input_size,
+            image_size=args.image_size,
+            num_channels=args.num_channels,
+            noise_tau=args.noise_tau,
+            reshape_to_2d=args.reshape_to_2d,
+            use_encoder_loss=args.use_encoder_loss,
+            scaling_factor=args.scaling_factor,
+        )
+        if args.encoder_name_or_path is not None:
+            _load_pretrained_encoder_weights(model, args.encoder_type, args.encoder_name_or_path)
+            logger.info(f"Loaded pretrained encoder weights from {args.encoder_name_or_path}")
+    model.encoder.requires_grad_(False)
+    model.decoder.requires_grad_(True)
+    model.train()
+
+    optimizer = torch.optim.AdamW(
+        (p for p in model.parameters() if p.requires_grad),
+        lr=args.learning_rate,
+        betas=(args.adam_beta1, args.adam_beta2),
+        weight_decay=args.adam_weight_decay,
+        eps=args.adam_epsilon,
+    )
+
+    overrode_max_train_steps = False
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    if args.max_train_steps is None:
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+        overrode_max_train_steps = True
+
+    lr_scheduler = get_scheduler(
+        args.lr_scheduler,
+        optimizer=optimizer,
+        num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
+        num_training_steps=args.max_train_steps * accelerator.num_processes,
+    )
+
+    model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+        model, optimizer, train_dataloader, lr_scheduler
+    )
+
+    if overrode_max_train_steps:
+        num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+    args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+
+    if accelerator.is_main_process:
+        accelerator.init_trackers("train_autoencoder_rae", config=vars(args))
+
+    total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+    logger.info("***** Running training *****")
+    logger.info(f"  Num examples = {len(dataset)}")
+    logger.info(f"  Num Epochs = {args.num_train_epochs}")
+    logger.info(f"  Instantaneous batch size per device = {args.train_batch_size}")
+    logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
+    logger.info(f"  Total optimization steps = {args.max_train_steps}")
+
+    progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)
+    progress_bar.set_description("Steps")
+    global_step = 0
+
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_dataloader):
+            with accelerator.accumulate(model):
+                pixel_values = batch["pixel_values"]
+
+                _, loss, reconstruction_loss, encoder_loss = compute_losses(
+                    model,
+                    pixel_values,
+                    reconstruction_loss_type=args.reconstruction_loss_type,
+                    use_encoder_loss=args.use_encoder_loss,
+                    encoder_loss_weight=args.encoder_loss_weight,
+                )
+
+                accelerator.backward(loss)
+                if accelerator.sync_gradients:
+                    accelerator.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad()
+
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+
+                logs = {
+                    "loss": loss.detach().item(),
+                    "reconstruction_loss": reconstruction_loss.detach().item(),
+                    "encoder_loss": encoder_loss.detach().item(),
+                    "lr": lr_scheduler.get_last_lr()[0],
+                }
+                progress_bar.set_postfix(**logs)
+                accelerator.log(logs, step=global_step)
+
+                if global_step % args.validation_steps == 0:
+                    with torch.no_grad():
+                        _, val_loss, val_reconstruction_loss, val_encoder_loss = compute_losses(
+                            model,
+                            pixel_values,
+                            reconstruction_loss_type=args.reconstruction_loss_type,
+                            use_encoder_loss=args.use_encoder_loss,
+                            encoder_loss_weight=args.encoder_loss_weight,
+                        )
+                    accelerator.log(
+                        {
+                            "val/loss": val_loss.detach().item(),
+                            "val/reconstruction_loss": val_reconstruction_loss.detach().item(),
+                            "val/encoder_loss": val_encoder_loss.detach().item(),
+                        },
+                        step=global_step,
+                    )
+
+                if global_step % args.checkpointing_steps == 0:
+                    if accelerator.is_main_process:
+                        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
+                        unwrapped_model = accelerator.unwrap_model(model)
+                        unwrapped_model.save_pretrained(save_path)
+                        logger.info(f"Saved checkpoint to {save_path}")
+
+            if global_step >= args.max_train_steps:
+                break
+
+        if global_step >= args.max_train_steps:
+            break
+
+    accelerator.wait_for_everyone()
+    if accelerator.is_main_process:
+        unwrapped_model = accelerator.unwrap_model(model)
+        unwrapped_model.save_pretrained(args.output_dir)
+    accelerator.end_training()
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/convert_rae_to_diffusers.py
+++ b/scripts/convert_rae_to_diffusers.py
@@ -0,0 +1,406 @@
+import argparse
+from pathlib import Path
+from typing import Any
+
+import torch
+from huggingface_hub import HfApi, hf_hub_download
+
+from diffusers import AutoencoderRAE
+
+
+DECODER_CONFIGS = {
+    "ViTB": {
+        "decoder_hidden_size": 768,
+        "decoder_intermediate_size": 3072,
+        "decoder_num_attention_heads": 12,
+        "decoder_num_hidden_layers": 12,
+    },
+    "ViTL": {
+        "decoder_hidden_size": 1024,
+        "decoder_intermediate_size": 4096,
+        "decoder_num_attention_heads": 16,
+        "decoder_num_hidden_layers": 24,
+    },
+    "ViTXL": {
+        "decoder_hidden_size": 1152,
+        "decoder_intermediate_size": 4096,
+        "decoder_num_attention_heads": 16,
+        "decoder_num_hidden_layers": 28,
+    },
+}
+
+ENCODER_DEFAULT_NAME_OR_PATH = {
+    "dinov2": "facebook/dinov2-with-registers-base",
+    "siglip2": "google/siglip2-base-patch16-256",
+    "mae": "facebook/vit-mae-base",
+}
+
+ENCODER_HIDDEN_SIZE = {
+    "dinov2": 768,
+    "siglip2": 768,
+    "mae": 768,
+}
+
+ENCODER_PATCH_SIZE = {
+    "dinov2": 14,
+    "siglip2": 16,
+    "mae": 16,
+}
+
+DEFAULT_DECODER_SUBDIR = {
+    "dinov2": "decoders/dinov2/wReg_base",
+    "mae": "decoders/mae/base_p16",
+    "siglip2": "decoders/siglip2/base_p16_i256",
+}
+
+DEFAULT_STATS_SUBDIR = {
+    "dinov2": "stats/dinov2/wReg_base",
+    "mae": "stats/mae/base_p16",
+    "siglip2": "stats/siglip2/base_p16_i256",
+}
+
+DECODER_FILE_CANDIDATES = ("dinov2_decoder.pt", "model.pt")
+STATS_FILE_CANDIDATES = ("stat.pt",)
+
+
+def dataset_case_candidates(name: str) -> tuple[str, ...]:
+    return (name, name.lower(), name.upper(), name.title(), "imagenet1k", "ImageNet1k")
+
+
+class RepoAccessor:
+    def __init__(self, repo_or_path: str, cache_dir: str | None = None):
+        self.repo_or_path = repo_or_path
+        self.cache_dir = cache_dir
+        self.local_root: Path | None = None
+        self.repo_id: str | None = None
+        self.repo_files: set[str] | None = None
+
+        root = Path(repo_or_path)
+        if root.exists() and root.is_dir():
+            self.local_root = root
+        else:
+            self.repo_id = repo_or_path
+            self.repo_files = set(HfApi().list_repo_files(repo_or_path))
+
+    def exists(self, relative_path: str) -> bool:
+        relative_path = relative_path.replace("\\", "/")
+        if self.local_root is not None:
+            return (self.local_root / relative_path).is_file()
+        return relative_path in self.repo_files
+
+    def fetch(self, relative_path: str) -> Path:
+        relative_path = relative_path.replace("\\", "/")
+        if self.local_root is not None:
+            return self.local_root / relative_path
+        downloaded = hf_hub_download(repo_id=self.repo_id, filename=relative_path, cache_dir=self.cache_dir)
+        return Path(downloaded)
+
+
+def unwrap_state_dict(maybe_wrapped: dict[str, Any]) -> dict[str, Any]:
+    state_dict = maybe_wrapped
+    for k in ("model", "module", "state_dict"):
+        if isinstance(state_dict, dict) and k in state_dict and isinstance(state_dict[k], dict):
+            state_dict = state_dict[k]
+
+    out = dict(state_dict)
+    if len(out) > 0 and all(key.startswith("module.") for key in out):
+        out = {key[len("module.") :]: value for key, value in out.items()}
+    if len(out) > 0 and all(key.startswith("decoder.") for key in out):
+        out = {key[len("decoder.") :]: value for key, value in out.items()}
+    return out
+
+
+def remap_decoder_attention_keys_for_diffusers(state_dict: dict[str, Any]) -> dict[str, Any]:
+    """
+    Map official RAE decoder attention key layout to diffusers Attention layout used by AutoencoderRAE decoder.
+
+    Example mappings:
+    - `...attention.attention.query.*` -> `...attention.to_q.*`
+    - `...attention.attention.key.*`   -> `...attention.to_k.*`
+    - `...attention.attention.value.*` -> `...attention.to_v.*`
+    - `...attention.output.dense.*`    -> `...attention.to_out.0.*`
+    """
+    remapped: dict[str, Any] = {}
+    for key, value in state_dict.items():
+        new_key = key
+        new_key = new_key.replace(".attention.attention.query.", ".attention.to_q.")
+        new_key = new_key.replace(".attention.attention.key.", ".attention.to_k.")
+        new_key = new_key.replace(".attention.attention.value.", ".attention.to_v.")
+        new_key = new_key.replace(".attention.output.dense.", ".attention.to_out.0.")
+        remapped[new_key] = value
+    return remapped
+
+
+def resolve_decoder_file(
+    accessor: RepoAccessor, encoder_type: str, variant: str, decoder_checkpoint: str | None
+) -> str:
+    if decoder_checkpoint is not None:
+        if accessor.exists(decoder_checkpoint):
+            return decoder_checkpoint
+        raise FileNotFoundError(f"Decoder checkpoint not found: {decoder_checkpoint}")
+
+    base = f"{DEFAULT_DECODER_SUBDIR[encoder_type]}/{variant}"
+    for name in DECODER_FILE_CANDIDATES:
+        candidate = f"{base}/{name}"
+        if accessor.exists(candidate):
+            return candidate
+
+    raise FileNotFoundError(
+        f"Could not find decoder checkpoint under `{base}`. Tried: {list(DECODER_FILE_CANDIDATES)}"
+    )
+
+
+def resolve_stats_file(
+    accessor: RepoAccessor,
+    encoder_type: str,
+    dataset_name: str,
+    stats_checkpoint: str | None,
+) -> str | None:
+    if stats_checkpoint is not None:
+        if accessor.exists(stats_checkpoint):
+            return stats_checkpoint
+        raise FileNotFoundError(f"Stats checkpoint not found: {stats_checkpoint}")
+
+    base = DEFAULT_STATS_SUBDIR[encoder_type]
+    for dataset in dataset_case_candidates(dataset_name):
+        for name in STATS_FILE_CANDIDATES:
+            candidate = f"{base}/{dataset}/{name}"
+            if accessor.exists(candidate):
+                return candidate
+
+    return None
+
+
+def extract_latent_stats(stats_obj: Any) -> tuple[Any | None, Any | None]:
+    if not isinstance(stats_obj, dict):
+        return None, None
+
+    if "latents_mean" in stats_obj or "latents_std" in stats_obj:
+        return stats_obj.get("latents_mean", None), stats_obj.get("latents_std", None)
+
+    mean = stats_obj.get("mean", None)
+    var = stats_obj.get("var", None)
+    if mean is None and var is None:
+        return None, None
+
+    latents_std = None
+    if var is not None:
+        if isinstance(var, torch.Tensor):
+            latents_std = torch.sqrt(var + 1e-5)
+        else:
+            latents_std = torch.sqrt(torch.tensor(var) + 1e-5)
+    return mean, latents_std
+
+
+def _strip_final_layernorm_affine(state_dict: dict[str, Any], prefix: str = "") -> dict[str, Any]:
+    """Remove final layernorm weight/bias from encoder state dict.
+
+    RAE uses non-affine layernorm (weight=1, bias=0 is the default identity).
+    Stripping these keys means the model keeps its default init values, which
+    is functionally equivalent to setting elementwise_affine=False.
+    """
+    keys_to_strip = {f"{prefix}weight", f"{prefix}bias"}
+    return {k: v for k, v in state_dict.items() if k not in keys_to_strip}
+
+
+def _load_hf_encoder_state_dict(encoder_type: str, encoder_name_or_path: str) -> dict[str, Any]:
+    """Download the HF encoder and extract the state dict for the inner model."""
+    if encoder_type == "dinov2":
+        from transformers import Dinov2WithRegistersModel
+
+        hf_model = Dinov2WithRegistersModel.from_pretrained(encoder_name_or_path)
+        sd = hf_model.state_dict()
+        return _strip_final_layernorm_affine(sd, prefix="layernorm.")
+    elif encoder_type == "siglip2":
+        from transformers import SiglipModel
+
+        # SiglipModel.vision_model is a SiglipVisionTransformer.
+        # Our Siglip2Encoder wraps it inside SiglipVisionModel which nests it
+        # under .vision_model, so we add the prefix to match the diffusers key layout.
+        hf_model = SiglipModel.from_pretrained(encoder_name_or_path).vision_model
+        sd = {f"vision_model.{k}": v for k, v in hf_model.state_dict().items()}
+        return _strip_final_layernorm_affine(sd, prefix="vision_model.post_layernorm.")
+    elif encoder_type == "mae":
+        from transformers import ViTMAEForPreTraining
+
+        hf_model = ViTMAEForPreTraining.from_pretrained(encoder_name_or_path).vit
+        sd = hf_model.state_dict()
+        return _strip_final_layernorm_affine(sd, prefix="layernorm.")
+    else:
+        raise ValueError(f"Unknown encoder_type: {encoder_type}")
+
+
+def convert(args: argparse.Namespace) -> None:
+    accessor = RepoAccessor(args.repo_or_path, cache_dir=args.cache_dir)
+    encoder_name_or_path = args.encoder_name_or_path or ENCODER_DEFAULT_NAME_OR_PATH[args.encoder_type]
+
+    decoder_relpath = resolve_decoder_file(accessor, args.encoder_type, args.variant, args.decoder_checkpoint)
+    stats_relpath = resolve_stats_file(accessor, args.encoder_type, args.dataset_name, args.stats_checkpoint)
+
+    print(f"Using decoder checkpoint: {decoder_relpath}")
+    if stats_relpath is not None:
+        print(f"Using stats checkpoint:   {stats_relpath}")
+    else:
+        print("No stats checkpoint found; conversion will proceed without latent stats.")
+
+    if args.dry_run:
+        return
+
+    decoder_path = accessor.fetch(decoder_relpath)
+    decoder_obj = torch.load(decoder_path, map_location="cpu")
+    decoder_state_dict = unwrap_state_dict(decoder_obj)
+    decoder_state_dict = remap_decoder_attention_keys_for_diffusers(decoder_state_dict)
+
+    latents_mean, latents_std = None, None
+    if stats_relpath is not None:
+        stats_path = accessor.fetch(stats_relpath)
+        stats_obj = torch.load(stats_path, map_location="cpu")
+        latents_mean, latents_std = extract_latent_stats(stats_obj)
+
+    decoder_cfg = DECODER_CONFIGS[args.decoder_config_name]
+
+    # Read encoder normalization stats from the HF image processor (only place that downloads encoder info)
+    from transformers import AutoConfig, AutoImageProcessor
+
+    proc = AutoImageProcessor.from_pretrained(encoder_name_or_path)
+    encoder_norm_mean = list(proc.image_mean)
+    encoder_norm_std = list(proc.image_std)
+
+    # Read encoder hidden size and patch size from HF config
+    encoder_hidden_size = ENCODER_HIDDEN_SIZE[args.encoder_type]
+    encoder_patch_size = ENCODER_PATCH_SIZE[args.encoder_type]
+    try:
+        hf_config = AutoConfig.from_pretrained(encoder_name_or_path)
+        # For models like SigLIP that nest vision config
+        if hasattr(hf_config, "vision_config"):
+            hf_config = hf_config.vision_config
+        encoder_hidden_size = hf_config.hidden_size
+        encoder_patch_size = hf_config.patch_size
+    except Exception:
+        pass
+
+    # Load the actual encoder weights from HF to include in the saved model
+    encoder_state_dict = _load_hf_encoder_state_dict(args.encoder_type, encoder_name_or_path)
+
+    # Build model on meta device to avoid double init overhead
+    with torch.device("meta"):
+        model = AutoencoderRAE(
+            encoder_type=args.encoder_type,
+            encoder_hidden_size=encoder_hidden_size,
+            encoder_patch_size=encoder_patch_size,
+            encoder_input_size=args.encoder_input_size,
+            patch_size=args.patch_size,
+            image_size=args.image_size,
+            num_channels=args.num_channels,
+            encoder_norm_mean=encoder_norm_mean,
+            encoder_norm_std=encoder_norm_std,
+            decoder_hidden_size=decoder_cfg["decoder_hidden_size"],
+            decoder_num_hidden_layers=decoder_cfg["decoder_num_hidden_layers"],
+            decoder_num_attention_heads=decoder_cfg["decoder_num_attention_heads"],
+            decoder_intermediate_size=decoder_cfg["decoder_intermediate_size"],
+            latents_mean=latents_mean,
+            latents_std=latents_std,
+            scaling_factor=args.scaling_factor,
+        )
+
+    # Assemble full state dict and load with assign=True
+    full_state_dict = {}
+
+    # Encoder weights (prefixed with "encoder.")
+    for k, v in encoder_state_dict.items():
+        full_state_dict[f"encoder.{k}"] = v
+
+    # Decoder weights (prefixed with "decoder.")
+    for k, v in decoder_state_dict.items():
+        full_state_dict[f"decoder.{k}"] = v
+
+    # Buffers from config
+    full_state_dict["encoder_mean"] = torch.tensor(encoder_norm_mean, dtype=torch.float32).view(1, 3, 1, 1)
+    full_state_dict["encoder_std"] = torch.tensor(encoder_norm_std, dtype=torch.float32).view(1, 3, 1, 1)
+    if latents_mean is not None:
+        latents_mean_t = latents_mean if isinstance(latents_mean, torch.Tensor) else torch.tensor(latents_mean)
+        full_state_dict["_latents_mean"] = latents_mean_t
+    else:
+        full_state_dict["_latents_mean"] = torch.zeros(1)
+    if latents_std is not None:
+        latents_std_t = latents_std if isinstance(latents_std, torch.Tensor) else torch.tensor(latents_std)
+        full_state_dict["_latents_std"] = latents_std_t
+    else:
+        full_state_dict["_latents_std"] = torch.ones(1)
+
+    model.load_state_dict(full_state_dict, strict=False, assign=True)
+
+    # Verify no critical keys are missing
+    model_keys = {name for name, _ in model.named_parameters()}
+    model_keys |= {name for name, _ in model.named_buffers()}
+    loaded_keys = set(full_state_dict.keys())
+    missing = model_keys - loaded_keys
+    # decoder_pos_embed is initialized in-model. trainable_cls_token is only
+    # allowed to be missing if it was absent in the source decoder checkpoint.
+    allowed_missing = {"decoder.decoder_pos_embed"}
+    if "trainable_cls_token" not in decoder_state_dict:
+        allowed_missing.add("decoder.trainable_cls_token")
+    if missing - allowed_missing:
+        print(f"Warning: missing keys after conversion: {sorted(missing - allowed_missing)}")
+
+    output_path = Path(args.output_path)
+    output_path.mkdir(parents=True, exist_ok=True)
+    model.save_pretrained(output_path)
+
+    if args.verify_load:
+        print("Verifying converted checkpoint with AutoencoderRAE.from_pretrained(low_cpu_mem_usage=False)...")
+        loaded_model = AutoencoderRAE.from_pretrained(output_path, low_cpu_mem_usage=False)
+        if not isinstance(loaded_model, AutoencoderRAE):
+            raise RuntimeError("Verification failed: loaded object is not AutoencoderRAE.")
+        print("Verification passed.")
+
+    print(f"Saved converted AutoencoderRAE to: {output_path}")
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Convert RAE decoder checkpoints to diffusers AutoencoderRAE format")
+    parser.add_argument(
+        "--repo_or_path", type=str, required=True, help="Hub repo id (e.g. nyu-visionx/RAE-collections) or local path"
+    )
+    parser.add_argument("--output_path", type=str, required=True, help="Directory to save converted model")
+
+    parser.add_argument("--encoder_type", type=str, choices=["dinov2", "mae", "siglip2"], required=True)
+    parser.add_argument(
+        "--encoder_name_or_path", type=str, default=None, help="Optional encoder HF model id or local path override"
+    )
+
+    parser.add_argument("--variant", type=str, default="ViTXL_n08", help="Decoder variant folder name")
+    parser.add_argument("--dataset_name", type=str, default="imagenet1k", help="Stats dataset folder name")
+
+    parser.add_argument(
+        "--decoder_checkpoint", type=str, default=None, help="Relative path to decoder checkpoint inside repo/path"
+    )
+    parser.add_argument(
+        "--stats_checkpoint", type=str, default=None, help="Relative path to stats checkpoint inside repo/path"
+    )
+
+    parser.add_argument("--decoder_config_name", type=str, choices=list(DECODER_CONFIGS.keys()), default="ViTXL")
+    parser.add_argument("--encoder_input_size", type=int, default=224)
+    parser.add_argument("--patch_size", type=int, default=16)
+    parser.add_argument("--image_size", type=int, default=None)
+    parser.add_argument("--num_channels", type=int, default=3)
+    parser.add_argument("--scaling_factor", type=float, default=1.0)
+
+    parser.add_argument("--cache_dir", type=str, default=None)
+    parser.add_argument("--dry_run", action="store_true", help="Only resolve and print selected files")
+    parser.add_argument(
+        "--verify_load",
+        action="store_true",
+        help="After conversion, load back with AutoencoderRAE.from_pretrained(low_cpu_mem_usage=False).",
+    )
+
+    return parser.parse_args()
+
+
+def main() -> None:
+    args = parse_args()
+    convert(args)
+
+
+if __name__ == "__main__":
+    main()
--- a/src/diffusers/init.py
+++ b/src/diffusers/init.py
@@ -202,6 +202,7 @@ else:
            "AutoencoderKLTemporalDecoder",
            "AutoencoderKLWan",
            "AutoencoderOobleck",
+            "AutoencoderRAE",
            "AutoencoderTiny",
            "AutoModel",
            "BriaFiboTransformer2DModel",
@@ -432,6 +433,12 @@ else:
            "FluxKontextAutoBlocks",
            "FluxKontextModularPipeline",
            "FluxModularPipeline",
+            "HeliosAutoBlocks",
+            "HeliosModularPipeline",
+            "HeliosPyramidAutoBlocks",
+            "HeliosPyramidDistilledAutoBlocks",
+            "HeliosPyramidDistilledModularPipeline",
+            "HeliosPyramidModularPipeline",
            "QwenImageAutoBlocks",
            "QwenImageEditAutoBlocks",
            "QwenImageEditModularPipeline",
@@ -571,6 +578,7 @@ else:
            "LEditsPPPipelineStableDiffusionXL",
            "LongCatImageEditPipeline",
            "LongCatImagePipeline",
+            "LTX2ConditionPipeline",
            "LTX2ImageToVideoPipeline",
            "LTX2LatentUpsamplePipeline",
            "LTX2Pipeline",
@@ -974,6 +982,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            AutoencoderKLTemporalDecoder,
            AutoencoderKLWan,
            AutoencoderOobleck,
+            AutoencoderRAE,
            AutoencoderTiny,
            AutoModel,
            BriaFiboTransformer2DModel,
@@ -1183,6 +1192,12 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            FluxKontextAutoBlocks,
            FluxKontextModularPipeline,
            FluxModularPipeline,
+            HeliosAutoBlocks,
+            HeliosModularPipeline,
+            HeliosPyramidAutoBlocks,
+            HeliosPyramidDistilledAutoBlocks,
+            HeliosPyramidDistilledModularPipeline,
+            HeliosPyramidModularPipeline,
            QwenImageAutoBlocks,
            QwenImageEditAutoBlocks,
            QwenImageEditModularPipeline,
@@ -1318,6 +1333,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            LEditsPPPipelineStableDiffusionXL,
            LongCatImageEditPipeline,
            LongCatImagePipeline,
+            LTX2ConditionPipeline,
            LTX2ImageToVideoPipeline,
            LTX2LatentUpsamplePipeline,
            LTX2Pipeline,
--- a/src/diffusers/models/init.py
+++ b/src/diffusers/models/init.py
@@ -49,6 +49,7 @@ if is_torch_available():
    _import_structure["autoencoders.autoencoder_kl_temporal_decoder"] = ["AutoencoderKLTemporalDecoder"]
    _import_structure["autoencoders.autoencoder_kl_wan"] = ["AutoencoderKLWan"]
    _import_structure["autoencoders.autoencoder_oobleck"] = ["AutoencoderOobleck"]
+    _import_structure["autoencoders.autoencoder_rae"] = ["AutoencoderRAE"]
    _import_structure["autoencoders.autoencoder_tiny"] = ["AutoencoderTiny"]
    _import_structure["autoencoders.consistency_decoder_vae"] = ["ConsistencyDecoderVAE"]
    _import_structure["autoencoders.vq_model"] = ["VQModel"]
@@ -168,6 +169,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            AutoencoderKLTemporalDecoder,
            AutoencoderKLWan,
            AutoencoderOobleck,
+            AutoencoderRAE,
            AutoencoderTiny,
            ConsistencyDecoderVAE,
            VQModel,
--- a/src/diffusers/models/attention_dispatch.py
+++ b/src/diffusers/models/attention_dispatch.py
@@ -38,6 +38,7 @@ from ..utils import (
    is_flash_attn_available,
    is_flash_attn_version,
    is_kernels_available,
+    is_kernels_version,
    is_sageattention_available,
    is_sageattention_version,
    is_torch_npu_available,
@@ -318,6 +319,7 @@ class _HubKernelConfig:
    repo_id: str
    function_attr: str
    revision: str | None = None
+    version: int | None = None
    kernel_fn: Callable | None = None
    wrapped_forward_attr: str | None = None
    wrapped_backward_attr: str | None = None
@@ -327,31 +329,34 @@ class _HubKernelConfig:

 # Registry for hub-based attention kernels
 _HUB_KERNELS_REGISTRY: dict["AttentionBackendName", _HubKernelConfig] = {
-    # TODO: temporary revision for now. Remove when merged upstream into `main`.
    AttentionBackendName._FLASH_3_HUB: _HubKernelConfig(
        repo_id="kernels-community/flash-attn3",
        function_attr="flash_attn_func",
-        revision="fake-ops-return-probs",
        wrapped_forward_attr="flash_attn_interface._flash_attn_forward",
        wrapped_backward_attr="flash_attn_interface._flash_attn_backward",
+        version=1,
    ),
    AttentionBackendName._FLASH_3_VARLEN_HUB: _HubKernelConfig(
        repo_id="kernels-community/flash-attn3",
        function_attr="flash_attn_varlen_func",
-        # revision="fake-ops-return-probs",
+        version=1,
    ),
    AttentionBackendName.FLASH_HUB: _HubKernelConfig(
        repo_id="kernels-community/flash-attn2",
        function_attr="flash_attn_func",
-        revision=None,
        wrapped_forward_attr="flash_attn_interface._wrapped_flash_attn_forward",
        wrapped_backward_attr="flash_attn_interface._wrapped_flash_attn_backward",
+        version=1,
    ),
    AttentionBackendName.FLASH_VARLEN_HUB: _HubKernelConfig(
-        repo_id="kernels-community/flash-attn2", function_attr="flash_attn_varlen_func", revision=None
+        repo_id="kernels-community/flash-attn2",
+        function_attr="flash_attn_varlen_func",
+        version=1,
    ),
    AttentionBackendName.SAGE_HUB: _HubKernelConfig(
-        repo_id="kernels-community/sage_attention", function_attr="sageattn", revision=None
+        repo_id="kernels-community/sage-attention",
+        function_attr="sageattn",
+        version=1,
    ),
 }

@@ -521,6 +526,10 @@ def _check_attention_backend_requirements(backend: AttentionBackendName) -> None
            raise RuntimeError(
                f"Backend '{backend.value}' is not usable because the `kernels` package isn't available. Please install it with `pip install kernels`."
            )
+        if not is_kernels_version(">=", "0.12"):
+            raise RuntimeError(
+                f"Backend '{backend.value}' needs to be used with a `kernels` version of at least 0.12. Please update with `pip install -U kernels`."
+            )

    elif backend == AttentionBackendName.AITER:
        if not _CAN_USE_AITER_ATTN:
@@ -694,7 +703,7 @@ def _maybe_download_kernel_for_backend(backend: AttentionBackendName) -> None:
    try:
        from kernels import get_kernel

-        kernel_module = get_kernel(config.repo_id, revision=config.revision)
+        kernel_module = get_kernel(config.repo_id, revision=config.revision, version=config.version)
        if needs_kernel:
            config.kernel_fn = _resolve_kernel_attr(kernel_module, config.function_attr)

--- a/src/diffusers/models/autoencoders/init.py
+++ b/src/diffusers/models/autoencoders/init.py
@@ -18,6 +18,7 @@ from .autoencoder_kl_qwenimage import AutoencoderKLQwenImage
 from .autoencoder_kl_temporal_decoder import AutoencoderKLTemporalDecoder
 from .autoencoder_kl_wan import AutoencoderKLWan
 from .autoencoder_oobleck import AutoencoderOobleck
+from .autoencoder_rae import AutoencoderRAE
 from .autoencoder_tiny import AutoencoderTiny
 from .consistency_decoder_vae import ConsistencyDecoderVAE
 from .vq_model import VQModel
--- a/src/diffusers/models/autoencoders/autoencoder_rae.py
+++ b/src/diffusers/models/autoencoders/autoencoder_rae.py
@@ -0,0 +1,689 @@
+# Copyright 2026 The NYU Vision-X and HuggingFace Teams. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from math import sqrt
+from typing import Any
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from ...configuration_utils import ConfigMixin, register_to_config
+from ...utils import BaseOutput, logging
+from ...utils.accelerate_utils import apply_forward_hook
+from ...utils.import_utils import is_transformers_available
+from ...utils.torch_utils import randn_tensor
+
+
+if is_transformers_available():
+    from transformers import (
+        Dinov2WithRegistersConfig,
+        Dinov2WithRegistersModel,
+        SiglipVisionConfig,
+        SiglipVisionModel,
+        ViTMAEConfig,
+        ViTMAEModel,
+    )
+
+from ..activations import get_activation
+from ..attention import AttentionMixin
+from ..attention_processor import Attention
+from ..embeddings import get_2d_sincos_pos_embed
+from ..modeling_utils import ModelMixin
+from .vae import AutoencoderMixin, DecoderOutput, EncoderOutput
+
+
+logger = logging.get_logger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# Per-encoder forward functions
+# ---------------------------------------------------------------------------
+# Each function takes the raw transformers model + images and returns patch
+# tokens of shape (B, N, C), stripping CLS / register tokens as needed.
+
+
+def _dinov2_encoder_forward(model: nn.Module, images: torch.Tensor) -> torch.Tensor:
+    outputs = model(images, output_hidden_states=True)
+    unused_token_num = 5  # 1 CLS + 4 register tokens
+    return outputs.last_hidden_state[:, unused_token_num:]
+
+
+def _siglip2_encoder_forward(model: nn.Module, images: torch.Tensor) -> torch.Tensor:
+    outputs = model(images, output_hidden_states=True, interpolate_pos_encoding=True)
+    return outputs.last_hidden_state
+
+
+def _mae_encoder_forward(model: nn.Module, images: torch.Tensor, patch_size: int) -> torch.Tensor:
+    h, w = images.shape[2], images.shape[3]
+    patch_num = int(h * w // patch_size**2)
+    if patch_num * patch_size**2 != h * w:
+        raise ValueError("Image size should be divisible by patch size.")
+    noise = torch.arange(patch_num).unsqueeze(0).expand(images.shape[0], -1).to(images.device).to(images.dtype)
+    outputs = model(images, noise, interpolate_pos_encoding=True)
+    return outputs.last_hidden_state[:, 1:]  # remove cls token
+
+
+# ---------------------------------------------------------------------------
+# Encoder construction helpers
+# ---------------------------------------------------------------------------
+
+
+def _build_encoder(
+    encoder_type: str, hidden_size: int, patch_size: int, num_hidden_layers: int, head_dim: int = 64
+) -> nn.Module:
+    """Build a frozen encoder from config (no pretrained download)."""
+    num_attention_heads = hidden_size // head_dim  # all supported encoders use head_dim=64
+
+    if encoder_type == "dinov2":
+        config = Dinov2WithRegistersConfig(
+            hidden_size=hidden_size,
+            patch_size=patch_size,
+            image_size=518,
+            num_attention_heads=num_attention_heads,
+            num_hidden_layers=num_hidden_layers,
+        )
+        model = Dinov2WithRegistersModel(config)
+        # RAE strips the final layernorm affine params (identity LN). Remove them from
+        # the architecture so `from_pretrained` doesn't leave them on the meta device.
+        model.layernorm.weight = None
+        model.layernorm.bias = None
+    elif encoder_type == "siglip2":
+        config = SiglipVisionConfig(
+            hidden_size=hidden_size,
+            patch_size=patch_size,
+            image_size=256,
+            num_attention_heads=num_attention_heads,
+            num_hidden_layers=num_hidden_layers,
+        )
+        model = SiglipVisionModel(config)
+        # See dinov2 comment above.
+        model.vision_model.post_layernorm.weight = None
+        model.vision_model.post_layernorm.bias = None
+    elif encoder_type == "mae":
+        config = ViTMAEConfig(
+            hidden_size=hidden_size,
+            patch_size=patch_size,
+            image_size=224,
+            num_attention_heads=num_attention_heads,
+            num_hidden_layers=num_hidden_layers,
+            mask_ratio=0.0,
+        )
+        model = ViTMAEModel(config)
+        # See dinov2 comment above.
+        model.layernorm.weight = None
+        model.layernorm.bias = None
+    else:
+        raise ValueError(f"Unknown encoder_type='{encoder_type}'. Available: dinov2, siglip2, mae")
+
+    model.requires_grad_(False)
+    return model
+
+
+_ENCODER_FORWARD_FNS = {
+    "dinov2": _dinov2_encoder_forward,
+    "siglip2": _siglip2_encoder_forward,
+    "mae": _mae_encoder_forward,
+}
+
+
+@dataclass
+class RAEDecoderOutput(BaseOutput):
+    """
+    Output of `RAEDecoder`.
+
+    Args:
+        logits (`torch.Tensor`):
+            Patch reconstruction logits of shape `(batch_size, num_patches, patch_size**2 * num_channels)`.
+    """
+
+    logits: torch.Tensor
+
+
+class ViTMAEIntermediate(nn.Module):
+    def __init__(self, hidden_size: int, intermediate_size: int, hidden_act: str = "gelu"):
+        super().__init__()
+        self.dense = nn.Linear(hidden_size, intermediate_size)
+        self.intermediate_act_fn = get_activation(hidden_act)
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class ViTMAEOutput(nn.Module):
+    def __init__(self, hidden_size: int, intermediate_size: int, hidden_dropout_prob: float = 0.0):
+        super().__init__()
+        self.dense = nn.Linear(intermediate_size, hidden_size)
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = hidden_states + input_tensor
+        return hidden_states
+
+
+class ViTMAELayer(nn.Module):
+    """
+    This matches the naming/parameter structure used in RAE-main (ViTMAE decoder block).
+    """
+
+    def __init__(
+        self,
+        *,
+        hidden_size: int,
+        num_attention_heads: int,
+        intermediate_size: int,
+        qkv_bias: bool = True,
+        layer_norm_eps: float = 1e-12,
+        hidden_dropout_prob: float = 0.0,
+        attention_probs_dropout_prob: float = 0.0,
+        hidden_act: str = "gelu",
+    ):
+        super().__init__()
+        if hidden_size % num_attention_heads != 0:
+            raise ValueError(
+                f"hidden_size={hidden_size} must be divisible by num_attention_heads={num_attention_heads}"
+            )
+        self.attention = Attention(
+            query_dim=hidden_size,
+            heads=num_attention_heads,
+            dim_head=hidden_size // num_attention_heads,
+            dropout=attention_probs_dropout_prob,
+            bias=qkv_bias,
+        )
+        self.intermediate = ViTMAEIntermediate(
+            hidden_size=hidden_size, intermediate_size=intermediate_size, hidden_act=hidden_act
+        )
+        self.output = ViTMAEOutput(
+            hidden_size=hidden_size, intermediate_size=intermediate_size, hidden_dropout_prob=hidden_dropout_prob
+        )
+        self.layernorm_before = nn.LayerNorm(hidden_size, eps=layer_norm_eps)
+        self.layernorm_after = nn.LayerNorm(hidden_size, eps=layer_norm_eps)
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        attention_output = self.attention(self.layernorm_before(hidden_states))
+        hidden_states = attention_output + hidden_states
+
+        layer_output = self.layernorm_after(hidden_states)
+        layer_output = self.intermediate(layer_output)
+        layer_output = self.output(layer_output, hidden_states)
+        return layer_output
+
+
+class RAEDecoder(nn.Module):
+    """
+    Decoder implementation ported from RAE-main to keep checkpoint compatibility.
+
+    Key attributes (must match checkpoint keys):
+    - decoder_embed
+    - decoder_pos_embed
+    - decoder_layers
+    - decoder_norm
+    - decoder_pred
+    - trainable_cls_token
+    """
+
+    def __init__(
+        self,
+        hidden_size: int = 768,
+        decoder_hidden_size: int = 512,
+        decoder_num_hidden_layers: int = 8,
+        decoder_num_attention_heads: int = 16,
+        decoder_intermediate_size: int = 2048,
+        num_patches: int = 256,
+        patch_size: int = 16,
+        num_channels: int = 3,
+        image_size: int = 256,
+        qkv_bias: bool = True,
+        layer_norm_eps: float = 1e-12,
+        hidden_dropout_prob: float = 0.0,
+        attention_probs_dropout_prob: float = 0.0,
+        hidden_act: str = "gelu",
+    ):
+        super().__init__()
+        self.decoder_hidden_size = decoder_hidden_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.image_size = image_size
+        self.num_patches = num_patches
+
+        self.decoder_embed = nn.Linear(hidden_size, decoder_hidden_size, bias=True)
+        grid_size = int(num_patches**0.5)
+        pos_embed = get_2d_sincos_pos_embed(
+            decoder_hidden_size, grid_size, cls_token=True, extra_tokens=1, output_type="pt"
+        )
+        self.register_buffer("decoder_pos_embed", pos_embed.unsqueeze(0).float(), persistent=False)
+
+        self.decoder_layers = nn.ModuleList(
+            [
+                ViTMAELayer(
+                    hidden_size=decoder_hidden_size,
+                    num_attention_heads=decoder_num_attention_heads,
+                    intermediate_size=decoder_intermediate_size,
+                    qkv_bias=qkv_bias,
+                    layer_norm_eps=layer_norm_eps,
+                    hidden_dropout_prob=hidden_dropout_prob,
+                    attention_probs_dropout_prob=attention_probs_dropout_prob,
+                    hidden_act=hidden_act,
+                )
+                for _ in range(decoder_num_hidden_layers)
+            ]
+        )
+
+        self.decoder_norm = nn.LayerNorm(decoder_hidden_size, eps=layer_norm_eps)
+        self.decoder_pred = nn.Linear(decoder_hidden_size, patch_size**2 * num_channels, bias=True)
+        self.gradient_checkpointing = False
+
+        self.trainable_cls_token = nn.Parameter(torch.zeros(1, 1, decoder_hidden_size))
+
+    def interpolate_pos_encoding(self, embeddings: torch.Tensor) -> torch.Tensor:
+        embeddings_positions = embeddings.shape[1] - 1
+        num_positions = self.decoder_pos_embed.shape[1] - 1
+
+        class_pos_embed = self.decoder_pos_embed[:, 0, :]
+        patch_pos_embed = self.decoder_pos_embed[:, 1:, :]
+        dim = self.decoder_pos_embed.shape[-1]
+
+        patch_pos_embed = patch_pos_embed.reshape(1, 1, -1, dim).permute(0, 3, 1, 2)
+        patch_pos_embed = F.interpolate(
+            patch_pos_embed,
+            scale_factor=(1, embeddings_positions / num_positions),
+            mode="bicubic",
+            align_corners=False,
+        )
+        patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
+        return torch.cat((class_pos_embed.unsqueeze(0), patch_pos_embed), dim=1)
+
+    def interpolate_latent(self, x: torch.Tensor) -> torch.Tensor:
+        b, l, c = x.shape
+        if l == self.num_patches:
+            return x
+        h = w = int(l**0.5)
+        x = x.reshape(b, h, w, c).permute(0, 3, 1, 2)
+        target_size = (int(self.num_patches**0.5), int(self.num_patches**0.5))
+        x = F.interpolate(x, size=target_size, mode="bilinear", align_corners=False)
+        x = x.permute(0, 2, 3, 1).contiguous().view(b, self.num_patches, c)
+        return x
+
+    def unpatchify(self, patchified_pixel_values: torch.Tensor, original_image_size: tuple[int, int] | None = None):
+        patch_size, num_channels = self.patch_size, self.num_channels
+        original_image_size = (
+            original_image_size if original_image_size is not None else (self.image_size, self.image_size)
+        )
+        original_height, original_width = original_image_size
+        num_patches_h = original_height // patch_size
+        num_patches_w = original_width // patch_size
+        if num_patches_h * num_patches_w != patchified_pixel_values.shape[1]:
+            raise ValueError(
+                f"The number of patches in the patchified pixel values {patchified_pixel_values.shape[1]}, does not match the number of patches on original image {num_patches_h}*{num_patches_w}"
+            )
+
+        batch_size = patchified_pixel_values.shape[0]
+        patchified_pixel_values = patchified_pixel_values.reshape(
+            batch_size,
+            num_patches_h,
+            num_patches_w,
+            patch_size,
+            patch_size,
+            num_channels,
+        )
+        patchified_pixel_values = torch.einsum("nhwpqc->nchpwq", patchified_pixel_values)
+        pixel_values = patchified_pixel_values.reshape(
+            batch_size,
+            num_channels,
+            num_patches_h * patch_size,
+            num_patches_w * patch_size,
+        )
+        return pixel_values
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        *,
+        interpolate_pos_encoding: bool = False,
+        drop_cls_token: bool = False,
+        return_dict: bool = True,
+    ) -> RAEDecoderOutput | tuple[torch.Tensor]:
+        x = self.decoder_embed(hidden_states)
+        if drop_cls_token:
+            x_ = x[:, 1:, :]
+            x_ = self.interpolate_latent(x_)
+        else:
+            x_ = self.interpolate_latent(x)
+
+        cls_token = self.trainable_cls_token.expand(x_.shape[0], -1, -1)
+        x = torch.cat([cls_token, x_], dim=1)
+
+        if interpolate_pos_encoding:
+            if not drop_cls_token:
+                raise ValueError("interpolate_pos_encoding only supports drop_cls_token=True")
+            decoder_pos_embed = self.interpolate_pos_encoding(x)
+        else:
+            decoder_pos_embed = self.decoder_pos_embed
+
+        hidden_states = x + decoder_pos_embed.to(device=x.device, dtype=x.dtype)
+
+        for layer_module in self.decoder_layers:
+            hidden_states = layer_module(hidden_states)
+
+        hidden_states = self.decoder_norm(hidden_states)
+        logits = self.decoder_pred(hidden_states)
+        logits = logits[:, 1:, :]
+
+        if not return_dict:
+            return (logits,)
+        return RAEDecoderOutput(logits=logits)
+
+
+class AutoencoderRAE(ModelMixin, AttentionMixin, AutoencoderMixin, ConfigMixin):
+    r"""
+    Representation Autoencoder (RAE) model for encoding images to latents and decoding latents to images.
+
+    This model uses a frozen pretrained encoder (DINOv2, SigLIP2, or MAE) with a trainable ViT decoder to reconstruct
+    images from learned representations.
+
+    This model inherits from [`ModelMixin`]. Check the superclass documentation for its generic methods implemented for
+    all models (such as downloading or saving).
+
+    Args:
+        encoder_type (`str`, *optional*, defaults to `"dinov2"`):
+            Type of frozen encoder to use. One of `"dinov2"`, `"siglip2"`, or `"mae"`.
+        encoder_hidden_size (`int`, *optional*, defaults to `768`):
+            Hidden size of the encoder model.
+        encoder_patch_size (`int`, *optional*, defaults to `14`):
+            Patch size of the encoder model.
+        encoder_num_hidden_layers (`int`, *optional*, defaults to `12`):
+            Number of hidden layers in the encoder model.
+        patch_size (`int`, *optional*, defaults to `16`):
+            Decoder patch size (used for unpatchify and decoder head).
+        encoder_input_size (`int`, *optional*, defaults to `224`):
+            Input size expected by the encoder.
+        image_size (`int`, *optional*):
+            Decoder output image size. If `None`, it is derived from encoder token count and `patch_size` like
+            RAE-main: `image_size = patch_size * sqrt(num_patches)`, where `num_patches = (encoder_input_size //
+            encoder_patch_size) ** 2`.
+        num_channels (`int`, *optional*, defaults to `3`):
+            Number of input/output channels.
+        encoder_norm_mean (`list`, *optional*, defaults to `[0.485, 0.456, 0.406]`):
+            Channel-wise mean for encoder input normalization (ImageNet defaults).
+        encoder_norm_std (`list`, *optional*, defaults to `[0.229, 0.224, 0.225]`):
+            Channel-wise std for encoder input normalization (ImageNet defaults).
+        latents_mean (`list` or `tuple`, *optional*):
+            Optional mean for latent normalization. Tensor inputs are accepted and converted to config-serializable
+            lists.
+        latents_std (`list` or `tuple`, *optional*):
+            Optional standard deviation for latent normalization. Tensor inputs are accepted and converted to
+            config-serializable lists.
+        noise_tau (`float`, *optional*, defaults to `0.0`):
+            Noise level for training (adds noise to latents during training).
+        reshape_to_2d (`bool`, *optional*, defaults to `True`):
+            Whether to reshape latents to 2D (B, C, H, W) format.
+        use_encoder_loss (`bool`, *optional*, defaults to `False`):
+            Whether to use encoder hidden states in the loss (for advanced training).
+    """
+
+    # NOTE: gradient checkpointing is not wired up for this model yet.
+    _supports_gradient_checkpointing = False
+    _no_split_modules = ["ViTMAELayer"]
+    _keys_to_ignore_on_load_unexpected = ["decoder.decoder_pos_embed"]
+
+    @register_to_config
+    def __init__(
+        self,
+        encoder_type: str = "dinov2",
+        encoder_hidden_size: int = 768,
+        encoder_patch_size: int = 14,
+        encoder_num_hidden_layers: int = 12,
+        decoder_hidden_size: int = 512,
+        decoder_num_hidden_layers: int = 8,
+        decoder_num_attention_heads: int = 16,
+        decoder_intermediate_size: int = 2048,
+        patch_size: int = 16,
+        encoder_input_size: int = 224,
+        image_size: int | None = None,
+        num_channels: int = 3,
+        encoder_norm_mean: list | None = None,
+        encoder_norm_std: list | None = None,
+        latents_mean: list | tuple | torch.Tensor | None = None,
+        latents_std: list | tuple | torch.Tensor | None = None,
+        noise_tau: float = 0.0,
+        reshape_to_2d: bool = True,
+        use_encoder_loss: bool = False,
+        scaling_factor: float = 1.0,
+    ):
+        super().__init__()
+
+        if encoder_type not in _ENCODER_FORWARD_FNS:
+            raise ValueError(
+                f"Unknown encoder_type='{encoder_type}'. Available: {sorted(_ENCODER_FORWARD_FNS.keys())}"
+            )
+
+        def _to_config_compatible(value: Any) -> Any:
+            if isinstance(value, torch.Tensor):
+                return value.detach().cpu().tolist()
+            if isinstance(value, tuple):
+                return [_to_config_compatible(v) for v in value]
+            if isinstance(value, list):
+                return [_to_config_compatible(v) for v in value]
+            return value
+
+        def _as_optional_tensor(value: torch.Tensor | list | tuple | None) -> torch.Tensor | None:
+            if value is None:
+                return None
+            if isinstance(value, torch.Tensor):
+                return value.detach().clone()
+            return torch.tensor(value, dtype=torch.float32)
+
+        latents_std_tensor = _as_optional_tensor(latents_std)
+
+        # Ensure config values are JSON-serializable (list/None), even if caller passes torch.Tensors.
+        self.register_to_config(
+            latents_mean=_to_config_compatible(latents_mean),
+            latents_std=_to_config_compatible(latents_std),
+        )
+
+        self.encoder_input_size = encoder_input_size
+        self.noise_tau = float(noise_tau)
+        self.reshape_to_2d = bool(reshape_to_2d)
+        self.use_encoder_loss = bool(use_encoder_loss)
+
+        # Validate early, before building the (potentially large) encoder/decoder.
+        encoder_patch_size = int(encoder_patch_size)
+        if self.encoder_input_size % encoder_patch_size != 0:
+            raise ValueError(
+                f"encoder_input_size={self.encoder_input_size} must be divisible by encoder_patch_size={encoder_patch_size}."
+            )
+        decoder_patch_size = int(patch_size)
+        if decoder_patch_size <= 0:
+            raise ValueError("patch_size must be a positive integer (this is decoder_patch_size).")
+
+        # Frozen representation encoder (built from config, no downloads)
+        self.encoder: nn.Module = _build_encoder(
+            encoder_type=encoder_type,
+            hidden_size=encoder_hidden_size,
+            patch_size=encoder_patch_size,
+            num_hidden_layers=encoder_num_hidden_layers,
+        )
+        self._encoder_forward_fn = _ENCODER_FORWARD_FNS[encoder_type]
+        num_patches = (self.encoder_input_size // encoder_patch_size) ** 2
+
+        grid = int(sqrt(num_patches))
+        if grid * grid != num_patches:
+            raise ValueError(f"Computed num_patches={num_patches} must be a perfect square.")
+
+        derived_image_size = decoder_patch_size * grid
+        if image_size is None:
+            image_size = derived_image_size
+        else:
+            image_size = int(image_size)
+            if image_size != derived_image_size:
+                raise ValueError(
+                    f"image_size={image_size} must equal decoder_patch_size*sqrt(num_patches)={derived_image_size} "
+                    f"for patch_size={decoder_patch_size} and computed num_patches={num_patches}."
+                )
+
+        # Encoder input normalization stats (ImageNet defaults)
+        if encoder_norm_mean is None:
+            encoder_norm_mean = [0.485, 0.456, 0.406]
+        if encoder_norm_std is None:
+            encoder_norm_std = [0.229, 0.224, 0.225]
+        encoder_mean_tensor = torch.tensor(encoder_norm_mean, dtype=torch.float32).view(1, 3, 1, 1)
+        encoder_std_tensor = torch.tensor(encoder_norm_std, dtype=torch.float32).view(1, 3, 1, 1)
+
+        self.register_buffer("encoder_mean", encoder_mean_tensor, persistent=True)
+        self.register_buffer("encoder_std", encoder_std_tensor, persistent=True)
+
+        # Latent normalization buffers (defaults are no-ops; actual values come from checkpoint)
+        latents_mean_tensor = _as_optional_tensor(latents_mean)
+        if latents_mean_tensor is None:
+            latents_mean_tensor = torch.zeros(1)
+        self.register_buffer("_latents_mean", latents_mean_tensor, persistent=True)
+
+        if latents_std_tensor is None:
+            latents_std_tensor = torch.ones(1)
+        self.register_buffer("_latents_std", latents_std_tensor, persistent=True)
+
+        # ViT-MAE style decoder
+        self.decoder = RAEDecoder(
+            hidden_size=int(encoder_hidden_size),
+            decoder_hidden_size=int(decoder_hidden_size),
+            decoder_num_hidden_layers=int(decoder_num_hidden_layers),
+            decoder_num_attention_heads=int(decoder_num_attention_heads),
+            decoder_intermediate_size=int(decoder_intermediate_size),
+            num_patches=int(num_patches),
+            patch_size=int(decoder_patch_size),
+            num_channels=int(num_channels),
+            image_size=int(image_size),
+        )
+        self.num_patches = int(num_patches)
+        self.decoder_patch_size = int(decoder_patch_size)
+        self.decoder_image_size = int(image_size)
+
+        # Slicing support (batch dimension) similar to other diffusers autoencoders
+        self.use_slicing = False
+
+    def _noising(self, x: torch.Tensor, generator: torch.Generator | None = None) -> torch.Tensor:
+        # Per-sample random sigma in [0, noise_tau]
+        noise_sigma = self.noise_tau * torch.rand(
+            (x.size(0),) + (1,) * (x.ndim - 1), device=x.device, dtype=x.dtype, generator=generator
+        )
+        return x + noise_sigma * randn_tensor(x.shape, generator=generator, device=x.device, dtype=x.dtype)
+
+    def _resize_and_normalize(self, x: torch.Tensor) -> torch.Tensor:
+        _, _, h, w = x.shape
+        if h != self.encoder_input_size or w != self.encoder_input_size:
+            x = F.interpolate(
+                x, size=(self.encoder_input_size, self.encoder_input_size), mode="bicubic", align_corners=False
+            )
+        mean = self.encoder_mean.to(device=x.device, dtype=x.dtype)
+        std = self.encoder_std.to(device=x.device, dtype=x.dtype)
+        return (x - mean) / std
+
+    def _denormalize_image(self, x: torch.Tensor) -> torch.Tensor:
+        mean = self.encoder_mean.to(device=x.device, dtype=x.dtype)
+        std = self.encoder_std.to(device=x.device, dtype=x.dtype)
+        return x * std + mean
+
+    def _normalize_latents(self, z: torch.Tensor) -> torch.Tensor:
+        latents_mean = self._latents_mean.to(device=z.device, dtype=z.dtype)
+        latents_std = self._latents_std.to(device=z.device, dtype=z.dtype)
+        return (z - latents_mean) / (latents_std + 1e-5)
+
+    def _denormalize_latents(self, z: torch.Tensor) -> torch.Tensor:
+        latents_mean = self._latents_mean.to(device=z.device, dtype=z.dtype)
+        latents_std = self._latents_std.to(device=z.device, dtype=z.dtype)
+        return z * (latents_std + 1e-5) + latents_mean
+
+    def _encode(self, x: torch.Tensor, generator: torch.Generator | None = None) -> torch.Tensor:
+        x = self._resize_and_normalize(x)
+
+        if self.config.encoder_type == "mae":
+            tokens = self._encoder_forward_fn(self.encoder, x, self.config.encoder_patch_size)
+        else:
+            tokens = self._encoder_forward_fn(self.encoder, x)  # (B, N, C)
+
+        if self.training and self.noise_tau > 0:
+            tokens = self._noising(tokens, generator=generator)
+
+        if self.reshape_to_2d:
+            b, n, c = tokens.shape
+            side = int(sqrt(n))
+            if side * side != n:
+                raise ValueError(f"Token length n={n} is not a perfect square; cannot reshape to 2D.")
+            z = tokens.transpose(1, 2).contiguous().view(b, c, side, side)  # (B, C, h, w)
+        else:
+            z = tokens
+
+        z = self._normalize_latents(z)
+
+        # Follow diffusers convention: optionally scale latents for diffusion
+        if self.config.scaling_factor != 1.0:
+            z = z * self.config.scaling_factor
+
+        return z
+
+    @apply_forward_hook
+    def encode(
+        self, x: torch.Tensor, return_dict: bool = True, generator: torch.Generator | None = None
+    ) -> EncoderOutput | tuple[torch.Tensor]:
+        if self.use_slicing and x.shape[0] > 1:
+            latents = torch.cat([self._encode(x_slice, generator=generator) for x_slice in x.split(1)], dim=0)
+        else:
+            latents = self._encode(x, generator=generator)
+
+        if not return_dict:
+            return (latents,)
+        return EncoderOutput(latent=latents)
+
+    def _decode(self, z: torch.Tensor) -> torch.Tensor:
+        # Undo scaling factor if applied at encode time
+        if self.config.scaling_factor != 1.0:
+            z = z / self.config.scaling_factor
+
+        z = self._denormalize_latents(z)
+
+        if self.reshape_to_2d:
+            b, c, h, w = z.shape
+            tokens = z.view(b, c, h * w).transpose(1, 2).contiguous()  # (B, N, C)
+        else:
+            tokens = z
+
+        logits = self.decoder(tokens, return_dict=True).logits
+        x_rec = self.decoder.unpatchify(logits)
+        x_rec = self._denormalize_image(x_rec)
+        return x_rec.to(device=z.device)
+
+    @apply_forward_hook
+    def decode(self, z: torch.Tensor, return_dict: bool = True) -> DecoderOutput | tuple[torch.Tensor]:
+        if self.use_slicing and z.shape[0] > 1:
+            decoded = torch.cat([self._decode(z_slice) for z_slice in z.split(1)], dim=0)
+        else:
+            decoded = self._decode(z)
+
+        if not return_dict:
+            return (decoded,)
+        return DecoderOutput(sample=decoded)
+
+    def forward(
+        self, sample: torch.Tensor, return_dict: bool = True, generator: torch.Generator | None = None
+    ) -> DecoderOutput | tuple[torch.Tensor]:
+        latents = self.encode(sample, return_dict=False, generator=generator)[0]
+        decoded = self.decode(latents, return_dict=False)[0]
+        if not return_dict:
+            return (decoded,)
+        return DecoderOutput(sample=decoded)
--- a/src/diffusers/modular_pipelines/init.py
+++ b/src/diffusers/modular_pipelines/init.py
@@ -56,6 +56,14 @@ else:
        "WanImage2VideoModularPipeline",
        "Wan22Image2VideoModularPipeline",
    ]
+    _import_structure["helios"] = [
+        "HeliosAutoBlocks",
+        "HeliosModularPipeline",
+        "HeliosPyramidAutoBlocks",
+        "HeliosPyramidDistilledAutoBlocks",
+        "HeliosPyramidDistilledModularPipeline",
+        "HeliosPyramidModularPipeline",
+    ]
    _import_structure["flux"] = [
        "FluxAutoBlocks",
        "FluxModularPipeline",
@@ -103,6 +111,14 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            Flux2KleinModularPipeline,
            Flux2ModularPipeline,
        )
+        from .helios import (
+            HeliosAutoBlocks,
+            HeliosModularPipeline,
+            HeliosPyramidAutoBlocks,
+            HeliosPyramidDistilledAutoBlocks,
+            HeliosPyramidDistilledModularPipeline,
+            HeliosPyramidModularPipeline,
+        )
        from .modular_pipeline import (
            AutoPipelineBlocks,
            BlockState,
--- a/src/diffusers/modular_pipelines/helios/init.py
+++ b/src/diffusers/modular_pipelines/helios/init.py
@@ -0,0 +1,59 @@
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    DIFFUSERS_SLOW_IMPORT,
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    get_objects_from_module,
+    is_torch_available,
+    is_transformers_available,
+)
+
+
+_dummy_objects = {}
+_import_structure = {}
+
+try:
+    if not (is_transformers_available() and is_torch_available()):
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    from ...utils import dummy_torch_and_transformers_objects  # noqa F403
+
+    _dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
+else:
+    _import_structure["modular_blocks_helios"] = ["HeliosAutoBlocks"]
+    _import_structure["modular_blocks_helios_pyramid"] = ["HeliosPyramidAutoBlocks"]
+    _import_structure["modular_blocks_helios_pyramid_distilled"] = ["HeliosPyramidDistilledAutoBlocks"]
+    _import_structure["modular_pipeline"] = [
+        "HeliosModularPipeline",
+        "HeliosPyramidDistilledModularPipeline",
+        "HeliosPyramidModularPipeline",
+    ]
+
+if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
+    try:
+        if not (is_transformers_available() and is_torch_available()):
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        from ...utils.dummy_torch_and_transformers_objects import *  # noqa F403
+    else:
+        from .modular_blocks_helios import HeliosAutoBlocks
+        from .modular_blocks_helios_pyramid import HeliosPyramidAutoBlocks
+        from .modular_blocks_helios_pyramid_distilled import HeliosPyramidDistilledAutoBlocks
+        from .modular_pipeline import (
+            HeliosModularPipeline,
+            HeliosPyramidDistilledModularPipeline,
+            HeliosPyramidModularPipeline,
+        )
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(
+        __name__,
+        globals()["__file__"],
+        _import_structure,
+        module_spec=__spec__,
+    )
+
+    for name, value in _dummy_objects.items():
+        setattr(sys.modules[__name__], name, value)
--- a/src/diffusers/modular_pipelines/helios/before_denoise.py
+++ b/src/diffusers/modular_pipelines/helios/before_denoise.py
@@ -0,0 +1,836 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import torch
+
+from ...models import HeliosTransformer3DModel
+from ...schedulers import HeliosScheduler
+from ...utils import logging
+from ...utils.torch_utils import randn_tensor
+from ..modular_pipeline import ModularPipelineBlocks, PipelineState
+from ..modular_pipeline_utils import ComponentSpec, InputParam, OutputParam
+from .modular_pipeline import HeliosModularPipeline
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+# Copied from diffusers.pipelines.flux.pipeline_flux.calculate_shift
+def calculate_shift(
+    image_seq_len,
+    base_seq_len: int = 256,
+    max_seq_len: int = 4096,
+    base_shift: float = 0.5,
+    max_shift: float = 1.15,
+):
+    m = (max_shift - base_shift) / (max_seq_len - base_seq_len)
+    b = base_shift - m * base_seq_len
+    mu = image_seq_len * m + b
+    return mu
+
+
+class HeliosTextInputStep(ModularPipelineBlocks):
+    model_name = "helios"
+
+    @property
+    def description(self) -> str:
+        return (
+            "Input processing step that:\n"
+            "  1. Determines `batch_size` and `dtype` based on `prompt_embeds`\n"
+            "  2. Adjusts input tensor shapes based on `batch_size` (number of prompts) and `num_videos_per_prompt`\n\n"
+            "All input tensors are expected to have either batch_size=1 or match the batch_size\n"
+            "of prompt_embeds. The tensors will be duplicated across the batch dimension to\n"
+            "have a final batch_size of batch_size * num_videos_per_prompt."
+        )
+
+    @property
+    def inputs(self) -> list[InputParam]:
+        return [
+            InputParam(
+                "num_videos_per_prompt",
+                default=1,
+                type_hint=int,
+                description="Number of videos to generate per prompt.",
+            ),
+            InputParam.template("prompt_embeds"),
+            InputParam.template("negative_prompt_embeds"),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> list[str]:
+        return [
+            OutputParam(
+                "batch_size",
+                type_hint=int,
+                description="Number of prompts, the final batch size of model inputs should be batch_size * num_videos_per_prompt",
+            ),
+            OutputParam(
+                "dtype",
+                type_hint=torch.dtype,
+                description="Data type of model tensor inputs (determined by `prompt_embeds.dtype`)",
+            ),
+        ]
+
+    def check_inputs(self, components, block_state):
+        if block_state.prompt_embeds is not None and block_state.negative_prompt_embeds is not None:
+            if block_state.prompt_embeds.shape != block_state.negative_prompt_embeds.shape:
+                raise ValueError(
+                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
+                    f" got: `prompt_embeds` {block_state.prompt_embeds.shape} != `negative_prompt_embeds`"
+                    f" {block_state.negative_prompt_embeds.shape}."
+                )
+
+    @torch.no_grad()
+    def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+        self.check_inputs(components, block_state)
+
+        block_state.batch_size = block_state.prompt_embeds.shape[0]
+        block_state.dtype = block_state.prompt_embeds.dtype
+
+        _, seq_len, _ = block_state.prompt_embeds.shape
+        block_state.prompt_embeds = block_state.prompt_embeds.repeat(1, block_state.num_videos_per_prompt, 1)
+        block_state.prompt_embeds = block_state.prompt_embeds.view(
+            block_state.batch_size * block_state.num_videos_per_prompt, seq_len, -1
+        )
+
+        if block_state.negative_prompt_embeds is not None:
+            _, seq_len, _ = block_state.negative_prompt_embeds.shape
+            block_state.negative_prompt_embeds = block_state.negative_prompt_embeds.repeat(
+                1, block_state.num_videos_per_prompt, 1
+            )
+            block_state.negative_prompt_embeds = block_state.negative_prompt_embeds.view(
+                block_state.batch_size * block_state.num_videos_per_prompt, seq_len, -1
+            )
+
+        self.set_block_state(state, block_state)
+
+        return components, state
+
+
+# Copied from diffusers.modular_pipelines.wan.before_denoise.repeat_tensor_to_batch_size
+def repeat_tensor_to_batch_size(
+    input_name: str,
+    input_tensor: torch.Tensor,
+    batch_size: int,
+    num_videos_per_prompt: int = 1,
+) -> torch.Tensor:
+    """Repeat tensor elements to match the final batch size.
+
+    This function expands a tensor's batch dimension to match the final batch size (batch_size * num_videos_per_prompt)
+    by repeating each element along dimension 0.
+
+    The input tensor must have batch size 1 or batch_size. The function will:
+    - If batch size is 1: repeat each element (batch_size * num_videos_per_prompt) times
+    - If batch size equals batch_size: repeat each element num_videos_per_prompt times
+
+    Args:
+        input_name (str): Name of the input tensor (used for error messages)
+        input_tensor (torch.Tensor): The tensor to repeat. Must have batch size 1 or batch_size.
+        batch_size (int): The base batch size (number of prompts)
+        num_videos_per_prompt (int, optional): Number of videos to generate per prompt. Defaults to 1.
+
+    Returns:
+        torch.Tensor: The repeated tensor with final batch size (batch_size * num_videos_per_prompt)
+
+    Raises:
+        ValueError: If input_tensor is not a torch.Tensor or has invalid batch size
+
+    Examples:
+        tensor = torch.tensor([[1, 2, 3]]) # shape: [1, 3] repeated = repeat_tensor_to_batch_size("image", tensor,
+        batch_size=2, num_videos_per_prompt=2) repeated # tensor([[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]) - shape:
+        [4, 3]
+
+        tensor = torch.tensor([[1, 2, 3], [4, 5, 6]]) # shape: [2, 3] repeated = repeat_tensor_to_batch_size("image",
+        tensor, batch_size=2, num_videos_per_prompt=2) repeated # tensor([[1, 2, 3], [1, 2, 3], [4, 5, 6], [4, 5, 6]])
+        - shape: [4, 3]
+    """
+    # make sure input is a tensor
+    if not isinstance(input_tensor, torch.Tensor):
+        raise ValueError(f"`{input_name}` must be a tensor")
+
+    # make sure input tensor e.g. image_latents has batch size 1 or batch_size same as prompts
+    if input_tensor.shape[0] == 1:
+        repeat_by = batch_size * num_videos_per_prompt
+    elif input_tensor.shape[0] == batch_size:
+        repeat_by = num_videos_per_prompt
+    else:
+        raise ValueError(
+            f"`{input_name}` must have have batch size 1 or {batch_size}, but got {input_tensor.shape[0]}"
+        )
+
+    # expand the tensor to match the batch_size * num_videos_per_prompt
+    input_tensor = input_tensor.repeat_interleave(repeat_by, dim=0)
+
+    return input_tensor
+
+
+# Copied from diffusers.modular_pipelines.wan.before_denoise.calculate_dimension_from_latents
+def calculate_dimension_from_latents(
+    latents: torch.Tensor, vae_scale_factor_temporal: int, vae_scale_factor_spatial: int
+) -> tuple[int, int]:
+    """Calculate image dimensions from latent tensor dimensions.
+
+    This function converts latent temporal and spatial dimensions to image temporal and spatial dimensions by
+    multiplying the latent num_frames/height/width by the VAE scale factor.
+
+    Args:
+        latents (torch.Tensor): The latent tensor. Must have 4 or 5 dimensions.
+            Expected shapes: [batch, channels, height, width] or [batch, channels, frames, height, width]
+        vae_scale_factor_temporal (int): The scale factor used by the VAE to compress temporal dimension.
+            Typically 4 for most VAEs (video is 4x larger than latents in temporal dimension)
+        vae_scale_factor_spatial (int): The scale factor used by the VAE to compress spatial dimension.
+            Typically 8 for most VAEs (image is 8x larger than latents in each dimension)
+
+    Returns:
+        tuple[int, int]: The calculated image dimensions as (height, width)
+
+    Raises:
+        ValueError: If latents tensor doesn't have 4 or 5 dimensions
+
+    """
+    if latents.ndim != 5:
+        raise ValueError(f"latents must have 5 dimensions, but got {latents.ndim}")
+
+    _, _, num_latent_frames, latent_height, latent_width = latents.shape
+
+    num_frames = (num_latent_frames - 1) * vae_scale_factor_temporal + 1
+    height = latent_height * vae_scale_factor_spatial
+    width = latent_width * vae_scale_factor_spatial
+
+    return num_frames, height, width
+
+
+class HeliosAdditionalInputsStep(ModularPipelineBlocks):
+    """Configurable step that standardizes inputs for the denoising step.
+
+    This step handles:
+    1. For encoded image latents: Computes height/width from latents and expands batch size
+    2. For additional_batch_inputs: Expands batch dimensions to match final batch size
+    """
+
+    model_name = "helios"
+
+    def __init__(
+        self,
+        image_latent_inputs: list[InputParam] | None = None,
+        additional_batch_inputs: list[InputParam] | None = None,
+    ):
+        if image_latent_inputs is None:
+            image_latent_inputs = [InputParam.template("image_latents")]
+        if additional_batch_inputs is None:
+            additional_batch_inputs = []
+
+        if not isinstance(image_latent_inputs, list):
+            raise ValueError(f"image_latent_inputs must be a list, but got {type(image_latent_inputs)}")
+        else:
+            for input_param in image_latent_inputs:
+                if not isinstance(input_param, InputParam):
+                    raise ValueError(f"image_latent_inputs must be a list of InputParam, but got {type(input_param)}")
+
+        if not isinstance(additional_batch_inputs, list):
+            raise ValueError(f"additional_batch_inputs must be a list, but got {type(additional_batch_inputs)}")
+        else:
+            for input_param in additional_batch_inputs:
+                if not isinstance(input_param, InputParam):
+                    raise ValueError(
+                        f"additional_batch_inputs must be a list of InputParam, but got {type(input_param)}"
+                    )
+
+        self._image_latent_inputs = image_latent_inputs
+        self._additional_batch_inputs = additional_batch_inputs
+        super().__init__()
+
+    @property
+    def description(self) -> str:
+        summary_section = (
+            "Input processing step that:\n"
+            "  1. For image latent inputs: Computes height/width from latents and expands batch size\n"
+            "  2. For additional batch inputs: Expands batch dimensions to match final batch size"
+        )
+
+        inputs_info = ""
+        if self._image_latent_inputs or self._additional_batch_inputs:
+            inputs_info = "\n\nConfigured inputs:"
+            if self._image_latent_inputs:
+                inputs_info += f"\n  - Image latent inputs: {[p.name for p in self._image_latent_inputs]}"
+            if self._additional_batch_inputs:
+                inputs_info += f"\n  - Additional batch inputs: {[p.name for p in self._additional_batch_inputs]}"
+
+        placement_section = "\n\nThis block should be placed after the encoder steps and the text input step."
+
+        return summary_section + inputs_info + placement_section
+
+    @property
+    def inputs(self) -> list[InputParam]:
+        inputs = [
+            InputParam(name="num_videos_per_prompt", default=1),
+            InputParam(name="batch_size", required=True),
+        ]
+        inputs += self._image_latent_inputs + self._additional_batch_inputs
+
+        return inputs
+
+    @property
+    def intermediate_outputs(self) -> list[OutputParam]:
+        outputs = [
+            OutputParam("height", type_hint=int),
+            OutputParam("width", type_hint=int),
+        ]
+
+        for input_param in self._image_latent_inputs:
+            outputs.append(OutputParam(input_param.name, type_hint=torch.Tensor))
+
+        for input_param in self._additional_batch_inputs:
+            outputs.append(OutputParam(input_param.name, type_hint=torch.Tensor))
+
+        return outputs
+
+    def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        for input_param in self._image_latent_inputs:
+            image_latent_tensor = getattr(block_state, input_param.name)
+            if image_latent_tensor is None:
+                continue
+
+            # Calculate height/width from latents
+            _, height, width = calculate_dimension_from_latents(
+                image_latent_tensor, components.vae_scale_factor_temporal, components.vae_scale_factor_spatial
+            )
+            block_state.height = height
+            block_state.width = width
+
+            # Expand batch size
+            image_latent_tensor = repeat_tensor_to_batch_size(
+                input_name=input_param.name,
+                input_tensor=image_latent_tensor,
+                num_videos_per_prompt=block_state.num_videos_per_prompt,
+                batch_size=block_state.batch_size,
+            )
+
+            setattr(block_state, input_param.name, image_latent_tensor)
+
+        for input_param in self._additional_batch_inputs:
+            input_tensor = getattr(block_state, input_param.name)
+            if input_tensor is None:
+                continue
+
+            input_tensor = repeat_tensor_to_batch_size(
+                input_name=input_param.name,
+                input_tensor=input_tensor,
+                num_videos_per_prompt=block_state.num_videos_per_prompt,
+                batch_size=block_state.batch_size,
+            )
+
+            setattr(block_state, input_param.name, input_tensor)
+
+        self.set_block_state(state, block_state)
+        return components, state
+
+
+class HeliosAddNoiseToImageLatentsStep(ModularPipelineBlocks):
+    """Adds noise to image_latents and fake_image_latents for I2V conditioning.
+
+    Applies single-sigma noise to image_latents (using image_noise_sigma range) and single-sigma noise to
+    fake_image_latents (using video_noise_sigma range).
+    """
+
+    model_name = "helios"
+
+    @property
+    def description(self) -> str:
+        return (
+            "Adds noise to image_latents and fake_image_latents for I2V conditioning. "
+            "Uses random sigma from configured ranges for each."
+        )
+
+    @property
+    def inputs(self) -> list[InputParam]:
+        return [
+            InputParam.template("image_latents"),
+            InputParam(
+                "fake_image_latents",
+                required=True,
+                type_hint=torch.Tensor,
+                description="Fake image latents used as history seed for I2V generation.",
+            ),
+            InputParam(
+                "image_noise_sigma_min",
+                default=0.111,
+                type_hint=float,
+                description="Minimum sigma for image latent noise.",
+            ),
+            InputParam(
+                "image_noise_sigma_max",
+                default=0.135,
+                type_hint=float,
+                description="Maximum sigma for image latent noise.",
+            ),
+            InputParam(
+                "video_noise_sigma_min",
+                default=0.111,
+                type_hint=float,
+                description="Minimum sigma for video/fake-image latent noise.",
+            ),
+            InputParam(
+                "video_noise_sigma_max",
+                default=0.135,
+                type_hint=float,
+                description="Maximum sigma for video/fake-image latent noise.",
+            ),
+            InputParam.template("generator"),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> list[OutputParam]:
+        return [
+            OutputParam.template("image_latents"),
+            OutputParam("fake_image_latents", type_hint=torch.Tensor, description="Noisy fake image latents"),
+        ]
+
+    @torch.no_grad()
+    def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        device = components._execution_device
+        image_latents = block_state.image_latents
+        fake_image_latents = block_state.fake_image_latents
+
+        # Add noise to image_latents
+        image_noise_sigma = (
+            torch.rand(1, device=device, generator=block_state.generator)
+            * (block_state.image_noise_sigma_max - block_state.image_noise_sigma_min)
+            + block_state.image_noise_sigma_min
+        )
+        image_latents = (
+            image_noise_sigma * randn_tensor(image_latents.shape, generator=block_state.generator, device=device)
+            + (1 - image_noise_sigma) * image_latents
+        )
+
+        # Add noise to fake_image_latents
+        fake_image_noise_sigma = (
+            torch.rand(1, device=device, generator=block_state.generator)
+            * (block_state.video_noise_sigma_max - block_state.video_noise_sigma_min)
+            + block_state.video_noise_sigma_min
+        )
+        fake_image_latents = (
+            fake_image_noise_sigma
+            * randn_tensor(fake_image_latents.shape, generator=block_state.generator, device=device)
+            + (1 - fake_image_noise_sigma) * fake_image_latents
+        )
+
+        block_state.image_latents = image_latents.to(device=device, dtype=torch.float32)
+        block_state.fake_image_latents = fake_image_latents.to(device=device, dtype=torch.float32)
+
+        self.set_block_state(state, block_state)
+        return components, state
+
+
+class HeliosAddNoiseToVideoLatentsStep(ModularPipelineBlocks):
+    """Adds noise to image_latents and video_latents for V2V conditioning.
+
+    Applies single-sigma noise to image_latents (using image_noise_sigma range) and per-frame noise to video_latents in
+    chunks (using video_noise_sigma range).
+    """
+
+    model_name = "helios"
+
+    @property
+    def description(self) -> str:
+        return (
+            "Adds noise to image_latents and video_latents for V2V conditioning. "
+            "Uses single-sigma noise for image_latents and per-frame noise for video chunks."
+        )
+
+    @property
+    def inputs(self) -> list[InputParam]:
+        return [
+            InputParam.template("image_latents"),
+            InputParam(
+                "video_latents",
+                required=True,
+                type_hint=torch.Tensor,
+                description="Encoded video latents for V2V generation.",
+            ),
+            InputParam(
+                "num_latent_frames_per_chunk",
+                default=9,
+                type_hint=int,
+                description="Number of latent frames per temporal chunk.",
+            ),
+            InputParam(
+                "image_noise_sigma_min",
+                default=0.111,
+                type_hint=float,
+                description="Minimum sigma for image latent noise.",
+            ),
+            InputParam(
+                "image_noise_sigma_max",
+                default=0.135,
+                type_hint=float,
+                description="Maximum sigma for image latent noise.",
+            ),
+            InputParam(
+                "video_noise_sigma_min",
+                default=0.111,
+                type_hint=float,
+                description="Minimum sigma for video latent noise.",
+            ),
+            InputParam(
+                "video_noise_sigma_max",
+                default=0.135,
+                type_hint=float,
+                description="Maximum sigma for video latent noise.",
+            ),
+            InputParam.template("generator"),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> list[OutputParam]:
+        return [
+            OutputParam.template("image_latents"),
+            OutputParam("video_latents", type_hint=torch.Tensor, description="Noisy video latents"),
+        ]
+
+    @torch.no_grad()
+    def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        device = components._execution_device
+        image_latents = block_state.image_latents
+        video_latents = block_state.video_latents
+        num_latent_frames_per_chunk = block_state.num_latent_frames_per_chunk
+
+        # Add noise to first frame (single sigma)
+        image_noise_sigma = (
+            torch.rand(1, device=device, generator=block_state.generator)
+            * (block_state.image_noise_sigma_max - block_state.image_noise_sigma_min)
+            + block_state.image_noise_sigma_min
+        )
+        image_latents = (
+            image_noise_sigma * randn_tensor(image_latents.shape, generator=block_state.generator, device=device)
+            + (1 - image_noise_sigma) * image_latents
+        )
+
+        # Add per-frame noise to video chunks
+        noisy_latents_chunks = []
+        num_latent_chunks = video_latents.shape[2] // num_latent_frames_per_chunk
+        for i in range(num_latent_chunks):
+            chunk_start = i * num_latent_frames_per_chunk
+            chunk_end = chunk_start + num_latent_frames_per_chunk
+            latent_chunk = video_latents[:, :, chunk_start:chunk_end, :, :]
+
+            chunk_frames = latent_chunk.shape[2]
+            frame_sigmas = (
+                torch.rand(chunk_frames, device=device, generator=block_state.generator)
+                * (block_state.video_noise_sigma_max - block_state.video_noise_sigma_min)
+                + block_state.video_noise_sigma_min
+            )
+            frame_sigmas = frame_sigmas.view(1, 1, chunk_frames, 1, 1)
+
+            noisy_chunk = (
+                frame_sigmas * randn_tensor(latent_chunk.shape, generator=block_state.generator, device=device)
+                + (1 - frame_sigmas) * latent_chunk
+            )
+            noisy_latents_chunks.append(noisy_chunk)
+        video_latents = torch.cat(noisy_latents_chunks, dim=2)
+
+        block_state.image_latents = image_latents.to(device=device, dtype=torch.float32)
+        block_state.video_latents = video_latents.to(device=device, dtype=torch.float32)
+
+        self.set_block_state(state, block_state)
+        return components, state
+
+
+class HeliosPrepareHistoryStep(ModularPipelineBlocks):
+    """Prepares chunk/history indices and initializes history state for the chunk loop."""
+
+    model_name = "helios"
+
+    @property
+    def description(self) -> str:
+        return (
+            "Prepares the chunk loop by computing latent dimensions, number of chunks, "
+            "history indices, and initializing history state (history_latents, image_latents, latent_chunks)."
+        )
+
+    @property
+    def expected_components(self) -> list[ComponentSpec]:
+        return [
+            ComponentSpec("transformer", HeliosTransformer3DModel),
+        ]
+
+    @property
+    def inputs(self) -> list[InputParam]:
+        return [
+            InputParam.template("height", default=384),
+            InputParam.template("width", default=640),
+            InputParam(
+                "num_frames", default=132, type_hint=int, description="Total number of video frames to generate."
+            ),
+            InputParam("batch_size", required=True, type_hint=int),
+            InputParam(
+                "num_latent_frames_per_chunk",
+                default=9,
+                type_hint=int,
+                description="Number of latent frames per temporal chunk.",
+            ),
+            InputParam(
+                "history_sizes",
+                default=[16, 2, 1],
+                type_hint=list,
+                description="Sizes of long/mid/short history buffers for temporal context.",
+            ),
+            InputParam(
+                "keep_first_frame",
+                default=True,
+                type_hint=bool,
+                description="Whether to keep the first frame as a prefix in history.",
+            ),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> list[OutputParam]:
+        return [
+            OutputParam("num_latent_chunk", type_hint=int, description="Number of temporal chunks"),
+            OutputParam("latent_shape", type_hint=tuple, description="Shape of latent tensor per chunk"),
+            OutputParam("history_sizes", type_hint=list, description="Adjusted history sizes (sorted, descending)"),
+            OutputParam("indices_hidden_states", type_hint=torch.Tensor, kwargs_type="denoiser_input_fields"),
+            OutputParam("indices_latents_history_short", type_hint=torch.Tensor, kwargs_type="denoiser_input_fields"),
+            OutputParam("indices_latents_history_mid", type_hint=torch.Tensor, kwargs_type="denoiser_input_fields"),
+            OutputParam("indices_latents_history_long", type_hint=torch.Tensor, kwargs_type="denoiser_input_fields"),
+            OutputParam("history_latents", type_hint=torch.Tensor, description="Initialized zero history latents"),
+        ]
+
+    @torch.no_grad()
+    def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        batch_size = block_state.batch_size
+        device = components._execution_device
+
+        block_state.num_frames = max(block_state.num_frames, 1)
+        history_sizes = sorted(block_state.history_sizes, reverse=True)
+
+        num_channels_latents = components.num_channels_latents
+        h_latent = block_state.height // components.vae_scale_factor_spatial
+        w_latent = block_state.width // components.vae_scale_factor_spatial
+
+        # Compute number of chunks
+        block_state.window_num_frames = (
+            block_state.num_latent_frames_per_chunk - 1
+        ) * components.vae_scale_factor_temporal + 1
+        block_state.num_latent_chunk = max(
+            1, (block_state.num_frames + block_state.window_num_frames - 1) // block_state.window_num_frames
+        )
+
+        # Modify history_sizes for non-keep_first_frame (matching pipeline behavior)
+        if not block_state.keep_first_frame:
+            history_sizes = history_sizes.copy()
+            history_sizes[-1] = history_sizes[-1] + 1
+
+        # Compute indices ONCE (same structure for all chunks)
+        if block_state.keep_first_frame:
+            indices = torch.arange(0, sum([1, *history_sizes, block_state.num_latent_frames_per_chunk]))
+            (
+                indices_prefix,
+                indices_latents_history_long,
+                indices_latents_history_mid,
+                indices_latents_history_1x,
+                indices_hidden_states,
+            ) = indices.split([1, *history_sizes, block_state.num_latent_frames_per_chunk], dim=0)
+            indices_latents_history_short = torch.cat([indices_prefix, indices_latents_history_1x], dim=0)
+        else:
+            indices = torch.arange(0, sum([*history_sizes, block_state.num_latent_frames_per_chunk]))
+            (
+                indices_latents_history_long,
+                indices_latents_history_mid,
+                indices_latents_history_short,
+                indices_hidden_states,
+            ) = indices.split([*history_sizes, block_state.num_latent_frames_per_chunk], dim=0)
+
+        # Latent shape per chunk
+        block_state.latent_shape = (
+            batch_size,
+            num_channels_latents,
+            block_state.num_latent_frames_per_chunk,
+            h_latent,
+            w_latent,
+        )
+
+        # Set outputs
+        block_state.history_sizes = history_sizes
+        block_state.indices_hidden_states = indices_hidden_states.unsqueeze(0)
+        block_state.indices_latents_history_short = indices_latents_history_short.unsqueeze(0)
+        block_state.indices_latents_history_mid = indices_latents_history_mid.unsqueeze(0)
+        block_state.indices_latents_history_long = indices_latents_history_long.unsqueeze(0)
+        block_state.history_latents = torch.zeros(
+            batch_size,
+            num_channels_latents,
+            sum(history_sizes),
+            h_latent,
+            w_latent,
+            device=device,
+            dtype=torch.float32,
+        )
+
+        self.set_block_state(state, block_state)
+
+        return components, state
+
+
+class HeliosI2VSeedHistoryStep(ModularPipelineBlocks):
+    """Seeds history_latents with fake_image_latents for I2V pipelines.
+
+    This small additive step runs after HeliosPrepareHistoryStep and appends fake_image_latents to the initialized
+    history_latents tensor.
+    """
+
+    model_name = "helios"
+
+    @property
+    def description(self) -> str:
+        return "I2V history seeding: appends fake_image_latents to history_latents."
+
+    @property
+    def inputs(self) -> list[InputParam]:
+        return [
+            InputParam("history_latents", required=True, type_hint=torch.Tensor),
+            InputParam("fake_image_latents", required=True, type_hint=torch.Tensor),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> list[OutputParam]:
+        return [
+            OutputParam(
+                "history_latents", type_hint=torch.Tensor, description="History latents seeded with fake_image_latents"
+            ),
+        ]
+
+    @torch.no_grad()
+    def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        block_state.history_latents = torch.cat([block_state.history_latents, block_state.fake_image_latents], dim=2)
+
+        self.set_block_state(state, block_state)
+        return components, state
+
+
+class HeliosV2VSeedHistoryStep(ModularPipelineBlocks):
+    """Seeds history_latents with video_latents for V2V pipelines.
+
+    This step runs after HeliosPrepareHistoryStep and replaces the tail of history_latents with video_latents. If the
+    video has fewer frames than the history, the beginning of history is preserved.
+    """
+
+    model_name = "helios"
+
+    @property
+    def description(self) -> str:
+        return "V2V history seeding: replaces the tail of history_latents with video_latents."
+
+    @property
+    def inputs(self) -> list[InputParam]:
+        return [
+            InputParam("history_latents", required=True, type_hint=torch.Tensor),
+            InputParam("video_latents", required=True, type_hint=torch.Tensor),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> list[OutputParam]:
+        return [
+            OutputParam(
+                "history_latents", type_hint=torch.Tensor, description="History latents seeded with video_latents"
+            ),
+        ]
+
+    @torch.no_grad()
+    def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        history_latents = block_state.history_latents
+        video_latents = block_state.video_latents
+
+        history_frames = history_latents.shape[2]
+        video_frames = video_latents.shape[2]
+        if video_frames < history_frames:
+            keep_frames = history_frames - video_frames
+            history_latents = torch.cat([history_latents[:, :, :keep_frames, :, :], video_latents], dim=2)
+        else:
+            history_latents = video_latents
+
+        block_state.history_latents = history_latents
+
+        self.set_block_state(state, block_state)
+        return components, state
+
+
+class HeliosSetTimestepsStep(ModularPipelineBlocks):
+    """Computes scheduler parameters (mu, sigmas) for the chunk loop."""
+
+    model_name = "helios"
+
+    @property
+    def description(self) -> str:
+        return "Computes scheduler shift parameter (mu) and default sigmas for the Helios chunk loop."
+
+    @property
+    def expected_components(self) -> list[ComponentSpec]:
+        return [
+            ComponentSpec("transformer", HeliosTransformer3DModel),
+            ComponentSpec("scheduler", HeliosScheduler),
+        ]
+
+    @property
+    def inputs(self) -> list[InputParam]:
+        return [
+            InputParam("latent_shape", required=True, type_hint=tuple),
+            InputParam.template("num_inference_steps"),
+            InputParam.template("sigmas"),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> list[OutputParam]:
+        return [
+            OutputParam("mu", type_hint=float, description="Scheduler shift parameter"),
+            OutputParam("sigmas", type_hint=list, description="Sigma schedule for diffusion"),
+        ]
+
+    @torch.no_grad()
+    def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        patch_size = components.transformer.config.patch_size
+        latent_shape = block_state.latent_shape
+        image_seq_len = (latent_shape[-1] * latent_shape[-2] * latent_shape[-3]) // (
+            patch_size[0] * patch_size[1] * patch_size[2]
+        )
+
+        if block_state.sigmas is None:
+            block_state.sigmas = np.linspace(0.999, 0.0, block_state.num_inference_steps + 1)[:-1]
+
+        block_state.mu = calculate_shift(
+            image_seq_len,
+            components.scheduler.config.get("base_image_seq_len", 256),
+            components.scheduler.config.get("max_image_seq_len", 4096),
+            components.scheduler.config.get("base_shift", 0.5),
+            components.scheduler.config.get("max_shift", 1.15),
+        )
+
+        self.set_block_state(state, block_state)
+
+        return components, state
--- a/src/diffusers/modular_pipelines/helios/decoders.py
+++ b/src/diffusers/modular_pipelines/helios/decoders.py
@@ -0,0 +1,110 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import PIL
+import torch
+
+from ...configuration_utils import FrozenDict
+from ...models import AutoencoderKLWan
+from ...utils import logging
+from ...video_processor import VideoProcessor
+from ..modular_pipeline import ModularPipelineBlocks, PipelineState
+from ..modular_pipeline_utils import ComponentSpec, InputParam, OutputParam
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+class HeliosDecodeStep(ModularPipelineBlocks):
+    """Decode all chunk latents with VAE, trim frames, and postprocess into final video output."""
+
+    model_name = "helios"
+
+    @property
+    def description(self) -> str:
+        return (
+            "Decodes all chunk latents with the VAE, concatenates them, "
+            "trims to the target frame count, and postprocesses into the final video output."
+        )
+
+    @property
+    def expected_components(self) -> list[ComponentSpec]:
+        return [
+            ComponentSpec("vae", AutoencoderKLWan),
+            ComponentSpec(
+                "video_processor",
+                VideoProcessor,
+                config=FrozenDict({"vae_scale_factor": 8}),
+                default_creation_method="from_config",
+            ),
+        ]
+
+    @property
+    def inputs(self) -> list[InputParam]:
+        return [
+            InputParam(
+                "latent_chunks", required=True, type_hint=list, description="List of per-chunk denoised latent tensors"
+            ),
+            InputParam("num_frames", required=True, type_hint=int, description="The target number of output frames"),
+            InputParam.template("output_type", default="np"),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> list[OutputParam]:
+        return [
+            OutputParam(
+                "videos",
+                type_hint=list[list[PIL.Image.Image]] | list[torch.Tensor] | list[np.ndarray],
+                description="The generated videos, can be a PIL.Image.Image, torch.Tensor or a numpy array",
+            ),
+        ]
+
+    @torch.no_grad()
+    def __call__(self, components, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        vae = components.vae
+
+        latents_mean = (
+            torch.tensor(vae.config.latents_mean).view(1, vae.config.z_dim, 1, 1, 1).to(vae.device, vae.dtype)
+        )
+        latents_std = 1.0 / torch.tensor(vae.config.latents_std).view(1, vae.config.z_dim, 1, 1, 1).to(
+            vae.device, vae.dtype
+        )
+
+        history_video = None
+        for chunk_latents in block_state.latent_chunks:
+            current_latents = chunk_latents.to(vae.dtype) / latents_std + latents_mean
+            current_video = vae.decode(current_latents, return_dict=False)[0]
+
+            if history_video is None:
+                history_video = current_video
+            else:
+                history_video = torch.cat([history_video, current_video], dim=2)
+
+        # Trim to proper frame count
+        generated_frames = history_video.size(2)
+        generated_frames = (
+            generated_frames - 1
+        ) // components.vae_scale_factor_temporal * components.vae_scale_factor_temporal + 1
+        history_video = history_video[:, :, :generated_frames]
+
+        block_state.videos = components.video_processor.postprocess_video(
+            history_video, output_type=block_state.output_type
+        )
+
+        self.set_block_state(state, block_state)
+
+        return components, state
--- a/src/diffusers/modular_pipelines/helios/denoise.py
+++ b/src/diffusers/modular_pipelines/helios/denoise.py
--- a/src/diffusers/modular_pipelines/helios/encoders.py
+++ b/src/diffusers/modular_pipelines/helios/encoders.py
@@ -0,0 +1,392 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import html
+
+import regex as re
+import torch
+from transformers import AutoTokenizer, UMT5EncoderModel
+
+from ...configuration_utils import FrozenDict
+from ...guiders import ClassifierFreeGuidance
+from ...models import AutoencoderKLWan
+from ...utils import is_ftfy_available, logging
+from ...video_processor import VideoProcessor
+from ..modular_pipeline import ModularPipelineBlocks, PipelineState
+from ..modular_pipeline_utils import ComponentSpec, InputParam, OutputParam
+from .modular_pipeline import HeliosModularPipeline
+
+
+if is_ftfy_available():
+    import ftfy
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+def basic_clean(text):
+    text = ftfy.fix_text(text)
+    text = html.unescape(html.unescape(text))
+    return text.strip()
+
+
+def whitespace_clean(text):
+    text = re.sub(r"\s+", " ", text)
+    text = text.strip()
+    return text
+
+
+def prompt_clean(text):
+    text = whitespace_clean(basic_clean(text))
+    return text
+
+
+def get_t5_prompt_embeds(
+    text_encoder: UMT5EncoderModel,
+    tokenizer: AutoTokenizer,
+    prompt: str | list[str],
+    max_sequence_length: int,
+    device: torch.device,
+    dtype: torch.dtype | None = None,
+):
+    """Encode text prompts into T5 embeddings for Helios.
+
+    Args:
+        text_encoder: The T5 text encoder model.
+        tokenizer: The tokenizer for the text encoder.
+        prompt: The prompt or prompts to encode.
+        max_sequence_length: Maximum sequence length for tokenization.
+        device: Device to place tensors on.
+        dtype: Optional dtype override. Defaults to `text_encoder.dtype`.
+
+    Returns:
+        A tuple of `(prompt_embeds, attention_mask)` where `prompt_embeds` is the encoded text embeddings and
+        `attention_mask` is a boolean mask.
+    """
+    dtype = dtype or text_encoder.dtype
+
+    prompt = [prompt] if isinstance(prompt, str) else prompt
+    prompt = [prompt_clean(u) for u in prompt]
+
+    text_inputs = tokenizer(
+        prompt,
+        padding="max_length",
+        max_length=max_sequence_length,
+        truncation=True,
+        add_special_tokens=True,
+        return_attention_mask=True,
+        return_tensors="pt",
+    )
+    text_input_ids, mask = text_inputs.input_ids, text_inputs.attention_mask
+    seq_lens = mask.gt(0).sum(dim=1).long()
+
+    prompt_embeds = text_encoder(text_input_ids.to(device), mask.to(device)).last_hidden_state
+    prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
+    prompt_embeds = [u[:v] for u, v in zip(prompt_embeds, seq_lens)]
+    prompt_embeds = torch.stack(
+        [torch.cat([u, u.new_zeros(max_sequence_length - u.size(0), u.size(1))]) for u in prompt_embeds], dim=0
+    )
+
+    return prompt_embeds, text_inputs.attention_mask.bool()
+
+
+class HeliosTextEncoderStep(ModularPipelineBlocks):
+    model_name = "helios"
+
+    @property
+    def description(self) -> str:
+        return "Text Encoder step that generates text embeddings to guide the video generation"
+
+    @property
+    def expected_components(self) -> list[ComponentSpec]:
+        return [
+            ComponentSpec("text_encoder", UMT5EncoderModel),
+            ComponentSpec("tokenizer", AutoTokenizer),
+            ComponentSpec(
+                "guider",
+                ClassifierFreeGuidance,
+                config=FrozenDict({"guidance_scale": 5.0}),
+                default_creation_method="from_config",
+            ),
+        ]
+
+    @property
+    def inputs(self) -> list[InputParam]:
+        return [
+            InputParam.template("prompt"),
+            InputParam.template("negative_prompt"),
+            InputParam.template("max_sequence_length"),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> list[OutputParam]:
+        return [
+            OutputParam.template("prompt_embeds"),
+            OutputParam.template("negative_prompt_embeds"),
+        ]
+
+    @staticmethod
+    def check_inputs(prompt, negative_prompt):
+        if prompt is not None and not isinstance(prompt, (str, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
+
+        if negative_prompt is not None and not isinstance(negative_prompt, (str, list)):
+            raise ValueError(f"`negative_prompt` has to be of type `str` or `list` but is {type(negative_prompt)}")
+
+        if prompt is not None and negative_prompt is not None:
+            prompt_list = [prompt] if isinstance(prompt, str) else prompt
+            neg_list = [negative_prompt] if isinstance(negative_prompt, str) else negative_prompt
+            if type(prompt_list) is not type(neg_list):
+                raise TypeError(
+                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
+                    f" {type(prompt)}."
+                )
+            if len(prompt_list) != len(neg_list):
+                raise ValueError(
+                    f"`negative_prompt` has batch size {len(neg_list)}, but `prompt` has batch size"
+                    f" {len(prompt_list)}. Please make sure that passed `negative_prompt` matches"
+                    " the batch size of `prompt`."
+                )
+
+    @torch.no_grad()
+    def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        prompt = block_state.prompt
+        negative_prompt = block_state.negative_prompt
+        max_sequence_length = block_state.max_sequence_length
+        device = components._execution_device
+
+        self.check_inputs(prompt, negative_prompt)
+
+        # Encode prompt
+        block_state.prompt_embeds, _ = get_t5_prompt_embeds(
+            text_encoder=components.text_encoder,
+            tokenizer=components.tokenizer,
+            prompt=prompt,
+            max_sequence_length=max_sequence_length,
+            device=device,
+        )
+
+        # Encode negative prompt
+        block_state.negative_prompt_embeds = None
+        if components.requires_unconditional_embeds:
+            negative_prompt = negative_prompt or ""
+            if isinstance(prompt, list) and isinstance(negative_prompt, str):
+                negative_prompt = len(prompt) * [negative_prompt]
+
+            block_state.negative_prompt_embeds, _ = get_t5_prompt_embeds(
+                text_encoder=components.text_encoder,
+                tokenizer=components.tokenizer,
+                prompt=negative_prompt,
+                max_sequence_length=max_sequence_length,
+                device=device,
+            )
+
+        self.set_block_state(state, block_state)
+        return components, state
+
+
+class HeliosImageVaeEncoderStep(ModularPipelineBlocks):
+    """Encodes an input image into VAE latent space for image-to-video generation."""
+
+    model_name = "helios"
+
+    @property
+    def description(self) -> str:
+        return (
+            "Image Encoder step that encodes an input image into VAE latent space, "
+            "producing image_latents (first frame prefix) and fake_image_latents (history seed) "
+            "for image-to-video generation."
+        )
+
+    @property
+    def expected_components(self) -> list[ComponentSpec]:
+        return [
+            ComponentSpec("vae", AutoencoderKLWan),
+            ComponentSpec(
+                "video_processor",
+                VideoProcessor,
+                config=FrozenDict({"vae_scale_factor": 8}),
+                default_creation_method="from_config",
+            ),
+        ]
+
+    @property
+    def inputs(self) -> list[InputParam]:
+        return [
+            InputParam.template("image"),
+            InputParam.template("height", default=384),
+            InputParam.template("width", default=640),
+            InputParam(
+                "num_latent_frames_per_chunk",
+                default=9,
+                type_hint=int,
+                description="Number of latent frames per temporal chunk.",
+            ),
+            InputParam.template("generator"),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> list[OutputParam]:
+        return [
+            OutputParam.template("image_latents"),
+            OutputParam(
+                "fake_image_latents", type_hint=torch.Tensor, description="Fake image latents for history seeding"
+            ),
+        ]
+
+    @torch.no_grad()
+    def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        vae = components.vae
+        device = components._execution_device
+
+        latents_mean = (
+            torch.tensor(vae.config.latents_mean).view(1, vae.config.z_dim, 1, 1, 1).to(vae.device, vae.dtype)
+        )
+        latents_std = 1.0 / torch.tensor(vae.config.latents_std).view(1, vae.config.z_dim, 1, 1, 1).to(
+            vae.device, vae.dtype
+        )
+
+        # Preprocess image to 4D tensor (B, C, H, W)
+        image = components.video_processor.preprocess(
+            block_state.image, height=block_state.height, width=block_state.width
+        )
+        image_5d = image.unsqueeze(2).to(device=device, dtype=vae.dtype)  # (B, C, 1, H, W)
+
+        # Encode image to get image_latents
+        image_latents = vae.encode(image_5d).latent_dist.sample(generator=block_state.generator)
+        image_latents = (image_latents - latents_mean) * latents_std
+
+        # Encode fake video to get fake_image_latents
+        min_frames = (block_state.num_latent_frames_per_chunk - 1) * components.vae_scale_factor_temporal + 1
+        fake_video = image_5d.repeat(1, 1, min_frames, 1, 1)  # (B, C, min_frames, H, W)
+        fake_latents_full = vae.encode(fake_video).latent_dist.sample(generator=block_state.generator)
+        fake_latents_full = (fake_latents_full - latents_mean) * latents_std
+        fake_image_latents = fake_latents_full[:, :, -1:, :, :]
+
+        block_state.image_latents = image_latents.to(device=device, dtype=torch.float32)
+        block_state.fake_image_latents = fake_image_latents.to(device=device, dtype=torch.float32)
+
+        self.set_block_state(state, block_state)
+        return components, state
+
+
+class HeliosVideoVaeEncoderStep(ModularPipelineBlocks):
+    """Encodes an input video into VAE latent space for video-to-video generation.
+
+    Produces `image_latents` (first frame) and `video_latents` (remaining frames encoded in chunks).
+    """
+
+    model_name = "helios"
+
+    @property
+    def description(self) -> str:
+        return (
+            "Video Encoder step that encodes an input video into VAE latent space, "
+            "producing image_latents (first frame) and video_latents (chunked video frames) "
+            "for video-to-video generation."
+        )
+
+    @property
+    def expected_components(self) -> list[ComponentSpec]:
+        return [
+            ComponentSpec("vae", AutoencoderKLWan),
+            ComponentSpec(
+                "video_processor",
+                VideoProcessor,
+                config=FrozenDict({"vae_scale_factor": 8}),
+                default_creation_method="from_config",
+            ),
+        ]
+
+    @property
+    def inputs(self) -> list[InputParam]:
+        return [
+            InputParam("video", required=True, description="Input video for video-to-video generation"),
+            InputParam.template("height", default=384),
+            InputParam.template("width", default=640),
+            InputParam(
+                "num_latent_frames_per_chunk",
+                default=9,
+                type_hint=int,
+                description="Number of latent frames per temporal chunk.",
+            ),
+            InputParam.template("generator"),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> list[OutputParam]:
+        return [
+            OutputParam.template("image_latents"),
+            OutputParam("video_latents", type_hint=torch.Tensor, description="Encoded video latents (chunked)"),
+        ]
+
+    @torch.no_grad()
+    def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        vae = components.vae
+        device = components._execution_device
+        num_latent_frames_per_chunk = block_state.num_latent_frames_per_chunk
+
+        latents_mean = (
+            torch.tensor(vae.config.latents_mean).view(1, vae.config.z_dim, 1, 1, 1).to(vae.device, vae.dtype)
+        )
+        latents_std = 1.0 / torch.tensor(vae.config.latents_std).view(1, vae.config.z_dim, 1, 1, 1).to(
+            vae.device, vae.dtype
+        )
+
+        # Preprocess video
+        video = components.video_processor.preprocess_video(
+            block_state.video, height=block_state.height, width=block_state.width
+        )
+        video = video.to(device=device, dtype=vae.dtype)
+
+        # Encode video into latents
+        num_frames = video.shape[2]
+        min_frames = (num_latent_frames_per_chunk - 1) * 4 + 1
+        num_chunks = num_frames // min_frames
+        if num_chunks == 0:
+            raise ValueError(
+                f"Video must have at least {min_frames} frames "
+                f"(got {num_frames} frames). "
+                f"Required: (num_latent_frames_per_chunk - 1) * 4 + 1 = ({num_latent_frames_per_chunk} - 1) * 4 + 1 = {min_frames}"
+            )
+        total_valid_frames = num_chunks * min_frames
+        start_frame = num_frames - total_valid_frames
+
+        # Encode first frame
+        first_frame = video[:, :, 0:1, :, :]
+        image_latents = vae.encode(first_frame).latent_dist.sample(generator=block_state.generator)
+        image_latents = (image_latents - latents_mean) * latents_std
+
+        # Encode remaining frames in chunks
+        latents_chunks = []
+        for i in range(num_chunks):
+            chunk_start = start_frame + i * min_frames
+            chunk_end = chunk_start + min_frames
+            video_chunk = video[:, :, chunk_start:chunk_end, :, :]
+            chunk_latents = vae.encode(video_chunk).latent_dist.sample(generator=block_state.generator)
+            chunk_latents = (chunk_latents - latents_mean) * latents_std
+            latents_chunks.append(chunk_latents)
+        video_latents = torch.cat(latents_chunks, dim=2)
+
+        block_state.image_latents = image_latents.to(device=device, dtype=torch.float32)
+        block_state.video_latents = video_latents.to(device=device, dtype=torch.float32)
+
+        self.set_block_state(state, block_state)
+        return components, state
--- a/src/diffusers/modular_pipelines/helios/modular_blocks_helios.py
+++ b/src/diffusers/modular_pipelines/helios/modular_blocks_helios.py
@@ -0,0 +1,542 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+
+from ...utils import logging
+from ..modular_pipeline import AutoPipelineBlocks, ConditionalPipelineBlocks, SequentialPipelineBlocks
+from ..modular_pipeline_utils import InputParam, InsertableDict, OutputParam
+from .before_denoise import (
+    HeliosAdditionalInputsStep,
+    HeliosAddNoiseToImageLatentsStep,
+    HeliosAddNoiseToVideoLatentsStep,
+    HeliosI2VSeedHistoryStep,
+    HeliosPrepareHistoryStep,
+    HeliosSetTimestepsStep,
+    HeliosTextInputStep,
+    HeliosV2VSeedHistoryStep,
+)
+from .decoders import HeliosDecodeStep
+from .denoise import HeliosChunkDenoiseStep, HeliosI2VChunkDenoiseStep
+from .encoders import HeliosImageVaeEncoderStep, HeliosTextEncoderStep, HeliosVideoVaeEncoderStep
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+# ====================
+# 1. Vae Encoder
+# ====================
+
+
+# auto_docstring
+class HeliosAutoVaeEncoderStep(AutoPipelineBlocks):
+    """
+    Encoder step that encodes video or image inputs. This is an auto pipeline block.
+       - `HeliosVideoVaeEncoderStep` (video_encoder) is used when `video` is provided.
+       - `HeliosImageVaeEncoderStep` (image_encoder) is used when `image` is provided.
+       - If neither is provided, step will be skipped.
+
+      Components:
+          vae (`AutoencoderKLWan`) video_processor (`VideoProcessor`)
+
+      Inputs:
+          video (`None`, *optional*):
+              Input video for video-to-video generation
+          height (`int`, *optional*, defaults to 384):
+              The height in pixels of the generated image.
+          width (`int`, *optional*, defaults to 640):
+              The width in pixels of the generated image.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          image (`Image | list`, *optional*):
+              Reference image(s) for denoising. Can be a single image or list of images.
+
+      Outputs:
+          image_latents (`Tensor`):
+              The latent representation of the input image.
+          video_latents (`Tensor`):
+              Encoded video latents (chunked)
+          fake_image_latents (`Tensor`):
+              Fake image latents for history seeding
+    """
+
+    block_classes = [HeliosVideoVaeEncoderStep, HeliosImageVaeEncoderStep]
+    block_names = ["video_encoder", "image_encoder"]
+    block_trigger_inputs = ["video", "image"]
+
+    @property
+    def description(self):
+        return (
+            "Encoder step that encodes video or image inputs. This is an auto pipeline block.\n"
+            " - `HeliosVideoVaeEncoderStep` (video_encoder) is used when `video` is provided.\n"
+            " - `HeliosImageVaeEncoderStep` (image_encoder) is used when `image` is provided.\n"
+            " - If neither is provided, step will be skipped."
+        )
+
+
+# ====================
+# 2. DENOISE
+# ====================
+
+
+# DENOISE (T2V)
+# auto_docstring
+class HeliosCoreDenoiseStep(SequentialPipelineBlocks):
+    """
+    Denoise block that takes encoded conditions and runs the chunk-based denoising process.
+
+      Components:
+          transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
+
+      Inputs:
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          prompt_embeds (`Tensor`):
+              text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          negative_prompt_embeds (`Tensor`, *optional*):
+              negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          height (`int`, *optional*, defaults to 384):
+              The height in pixels of the generated image.
+          width (`int`, *optional*, defaults to 640):
+              The width in pixels of the generated image.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          num_inference_steps (`int`, *optional*, defaults to 50):
+              The number of denoising steps.
+          sigmas (`list`, *optional*):
+              Custom sigmas for the denoising process.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          timesteps (`Tensor`, *optional*):
+              Timesteps for the denoising process.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+
+      Outputs:
+          latent_chunks (`list`):
+              List of per-chunk denoised latent tensors
+    """
+
+    model_name = "helios"
+    block_classes = [
+        HeliosTextInputStep,
+        HeliosPrepareHistoryStep,
+        HeliosSetTimestepsStep,
+        HeliosChunkDenoiseStep,
+    ]
+    block_names = ["input", "prepare_history", "set_timesteps", "chunk_denoise"]
+
+    @property
+    def description(self):
+        return "Denoise block that takes encoded conditions and runs the chunk-based denoising process."
+
+    @property
+    def outputs(self):
+        return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
+
+
+# DENOISE (I2V)
+# auto_docstring
+class HeliosI2VCoreDenoiseStep(SequentialPipelineBlocks):
+    """
+    I2V denoise block that seeds history with image latents and uses I2V-aware chunk preparation.
+
+      Components:
+          transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
+
+      Inputs:
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          prompt_embeds (`Tensor`):
+              text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          negative_prompt_embeds (`Tensor`, *optional*):
+              negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          image_latents (`Tensor`):
+              image latents used to guide the image generation. Can be generated from vae_encoder step.
+          fake_image_latents (`Tensor`, *optional*):
+              Fake image latents used as history seed for I2V generation.
+          image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for image latent noise.
+          image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for image latent noise.
+          video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for video/fake-image latent noise.
+          video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for video/fake-image latent noise.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          num_inference_steps (`int`, *optional*, defaults to 50):
+              The number of denoising steps.
+          sigmas (`list`, *optional*):
+              Custom sigmas for the denoising process.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          timesteps (`Tensor`, *optional*):
+              Timesteps for the denoising process.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+
+      Outputs:
+          latent_chunks (`list`):
+              List of per-chunk denoised latent tensors
+    """
+
+    model_name = "helios"
+    block_classes = [
+        HeliosTextInputStep,
+        HeliosAdditionalInputsStep(
+            image_latent_inputs=[InputParam.template("image_latents")],
+            additional_batch_inputs=[
+                InputParam(
+                    "fake_image_latents",
+                    type_hint=torch.Tensor,
+                    description="Fake image latents used as history seed for I2V generation.",
+                ),
+            ],
+        ),
+        HeliosAddNoiseToImageLatentsStep,
+        HeliosPrepareHistoryStep,
+        HeliosI2VSeedHistoryStep,
+        HeliosSetTimestepsStep,
+        HeliosI2VChunkDenoiseStep,
+    ]
+    block_names = [
+        "input",
+        "additional_inputs",
+        "add_noise_image",
+        "prepare_history",
+        "seed_history",
+        "set_timesteps",
+        "chunk_denoise",
+    ]
+
+    @property
+    def description(self):
+        return "I2V denoise block that seeds history with image latents and uses I2V-aware chunk preparation."
+
+    @property
+    def outputs(self):
+        return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
+
+
+# DENOISE (V2V)
+# auto_docstring
+class HeliosV2VCoreDenoiseStep(SequentialPipelineBlocks):
+    """
+    V2V denoise block that seeds history with video latents and uses I2V-aware chunk preparation.
+
+      Components:
+          transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
+
+      Inputs:
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          prompt_embeds (`Tensor`):
+              text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          negative_prompt_embeds (`Tensor`, *optional*):
+              negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          image_latents (`Tensor`, *optional*):
+              image latents used to guide the image generation. Can be generated from vae_encoder step.
+          video_latents (`Tensor`, *optional*):
+              Encoded video latents for V2V generation.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for image latent noise.
+          image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for image latent noise.
+          video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for video latent noise.
+          video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for video latent noise.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          num_inference_steps (`int`, *optional*, defaults to 50):
+              The number of denoising steps.
+          sigmas (`list`, *optional*):
+              Custom sigmas for the denoising process.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          timesteps (`Tensor`, *optional*):
+              Timesteps for the denoising process.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+
+      Outputs:
+          latent_chunks (`list`):
+              List of per-chunk denoised latent tensors
+    """
+
+    model_name = "helios"
+    block_classes = [
+        HeliosTextInputStep,
+        HeliosAdditionalInputsStep(
+            image_latent_inputs=[InputParam.template("image_latents")],
+            additional_batch_inputs=[
+                InputParam(
+                    "video_latents", type_hint=torch.Tensor, description="Encoded video latents for V2V generation."
+                ),
+            ],
+        ),
+        HeliosAddNoiseToVideoLatentsStep,
+        HeliosPrepareHistoryStep,
+        HeliosV2VSeedHistoryStep,
+        HeliosSetTimestepsStep,
+        HeliosI2VChunkDenoiseStep,
+    ]
+    block_names = [
+        "input",
+        "additional_inputs",
+        "add_noise_video",
+        "prepare_history",
+        "seed_history",
+        "set_timesteps",
+        "chunk_denoise",
+    ]
+
+    @property
+    def description(self):
+        return "V2V denoise block that seeds history with video latents and uses I2V-aware chunk preparation."
+
+    @property
+    def outputs(self):
+        return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
+
+
+# AUTO DENOISE
+# auto_docstring
+class HeliosAutoCoreDenoiseStep(ConditionalPipelineBlocks):
+    """
+    Core denoise step that selects the appropriate denoising block.
+       - `HeliosV2VCoreDenoiseStep` (video2video) for video-to-video tasks.
+       - `HeliosI2VCoreDenoiseStep` (image2video) for image-to-video tasks.
+       - `HeliosCoreDenoiseStep` (text2video) for text-to-video tasks.
+
+      Components:
+          transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
+
+      Inputs:
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          prompt_embeds (`Tensor`):
+              text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          negative_prompt_embeds (`Tensor`, *optional*):
+              negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          image_latents (`Tensor`, *optional*):
+              image latents used to guide the image generation. Can be generated from vae_encoder step.
+          video_latents (`Tensor`, *optional*):
+              Encoded video latents for V2V generation.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for image latent noise.
+          image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for image latent noise.
+          video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for video latent noise.
+          video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for video latent noise.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          history_sizes (`list`):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          num_inference_steps (`int`, *optional*, defaults to 50):
+              The number of denoising steps.
+          sigmas (`list`):
+              Custom sigmas for the denoising process.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          timesteps (`Tensor`, *optional*):
+              Timesteps for the denoising process.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+          fake_image_latents (`Tensor`, *optional*):
+              Fake image latents used as history seed for I2V generation.
+          height (`int`, *optional*, defaults to 384):
+              The height in pixels of the generated image.
+          width (`int`, *optional*, defaults to 640):
+              The width in pixels of the generated image.
+
+      Outputs:
+          latent_chunks (`list`):
+              List of per-chunk denoised latent tensors
+    """
+
+    block_classes = [HeliosV2VCoreDenoiseStep, HeliosI2VCoreDenoiseStep, HeliosCoreDenoiseStep]
+    block_names = ["video2video", "image2video", "text2video"]
+    block_trigger_inputs = ["video_latents", "fake_image_latents"]
+    default_block_name = "text2video"
+
+    def select_block(self, video_latents=None, fake_image_latents=None):
+        if video_latents is not None:
+            return "video2video"
+        elif fake_image_latents is not None:
+            return "image2video"
+        return None
+
+    @property
+    def description(self):
+        return (
+            "Core denoise step that selects the appropriate denoising block.\n"
+            " - `HeliosV2VCoreDenoiseStep` (video2video) for video-to-video tasks.\n"
+            " - `HeliosI2VCoreDenoiseStep` (image2video) for image-to-video tasks.\n"
+            " - `HeliosCoreDenoiseStep` (text2video) for text-to-video tasks."
+        )
+
+
+AUTO_BLOCKS = InsertableDict(
+    [
+        ("text_encoder", HeliosTextEncoderStep()),
+        ("vae_encoder", HeliosAutoVaeEncoderStep()),
+        ("denoise", HeliosAutoCoreDenoiseStep()),
+        ("decode", HeliosDecodeStep()),
+    ]
+)
+
+# ====================
+# 3. Auto Blocks
+# ====================
+
+
+# auto_docstring
+class HeliosAutoBlocks(SequentialPipelineBlocks):
+    """
+    Auto Modular pipeline for text-to-video, image-to-video, and video-to-video tasks using Helios.
+
+      Supported workflows:
+        - `text2video`: requires `prompt`
+        - `image2video`: requires `prompt`, `image`
+        - `video2video`: requires `prompt`, `video`
+
+      Components:
+          text_encoder (`UMT5EncoderModel`) tokenizer (`AutoTokenizer`) guider (`ClassifierFreeGuidance`) vae
+          (`AutoencoderKLWan`) video_processor (`VideoProcessor`) transformer (`HeliosTransformer3DModel`) scheduler
+          (`HeliosScheduler`)
+
+      Inputs:
+          prompt (`str`):
+              The prompt or prompts to guide image generation.
+          negative_prompt (`str`, *optional*):
+              The prompt or prompts not to guide the image generation.
+          max_sequence_length (`int`, *optional*, defaults to 512):
+              Maximum sequence length for prompt encoding.
+          video (`None`, *optional*):
+              Input video for video-to-video generation
+          height (`int`, *optional*, defaults to 384):
+              The height in pixels of the generated image.
+          width (`int`, *optional*, defaults to 640):
+              The width in pixels of the generated image.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          image (`Image | list`, *optional*):
+              Reference image(s) for denoising. Can be a single image or list of images.
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          image_latents (`Tensor`, *optional*):
+              image latents used to guide the image generation. Can be generated from vae_encoder step.
+          video_latents (`Tensor`, *optional*):
+              Encoded video latents for V2V generation.
+          image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for image latent noise.
+          image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for image latent noise.
+          video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for video latent noise.
+          video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for video latent noise.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          history_sizes (`list`):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          num_inference_steps (`int`, *optional*, defaults to 50):
+              The number of denoising steps.
+          sigmas (`list`):
+              Custom sigmas for the denoising process.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          timesteps (`Tensor`, *optional*):
+              Timesteps for the denoising process.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+          fake_image_latents (`Tensor`, *optional*):
+              Fake image latents used as history seed for I2V generation.
+          output_type (`str`, *optional*, defaults to np):
+              Output format: 'pil', 'np', 'pt'.
+
+      Outputs:
+          videos (`list`):
+              The generated videos.
+    """
+
+    model_name = "helios"
+
+    block_classes = AUTO_BLOCKS.values()
+    block_names = AUTO_BLOCKS.keys()
+
+    _workflow_map = {
+        "text2video": {"prompt": True},
+        "image2video": {"prompt": True, "image": True},
+        "video2video": {"prompt": True, "video": True},
+    }
+
+    @property
+    def description(self):
+        return "Auto Modular pipeline for text-to-video, image-to-video, and video-to-video tasks using Helios."
+
+    @property
+    def outputs(self):
+        return [OutputParam.template("videos")]
--- a/src/diffusers/modular_pipelines/helios/modular_blocks_helios_pyramid.py
+++ b/src/diffusers/modular_pipelines/helios/modular_blocks_helios_pyramid.py
@@ -0,0 +1,520 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+
+from ...utils import logging
+from ..modular_pipeline import AutoPipelineBlocks, ConditionalPipelineBlocks, SequentialPipelineBlocks
+from ..modular_pipeline_utils import InputParam, InsertableDict, OutputParam
+from .before_denoise import (
+    HeliosAdditionalInputsStep,
+    HeliosAddNoiseToImageLatentsStep,
+    HeliosAddNoiseToVideoLatentsStep,
+    HeliosI2VSeedHistoryStep,
+    HeliosPrepareHistoryStep,
+    HeliosTextInputStep,
+    HeliosV2VSeedHistoryStep,
+)
+from .decoders import HeliosDecodeStep
+from .denoise import HeliosPyramidChunkDenoiseStep, HeliosPyramidI2VChunkDenoiseStep
+from .encoders import HeliosImageVaeEncoderStep, HeliosTextEncoderStep, HeliosVideoVaeEncoderStep
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+# ====================
+# 1. Vae Encoder
+# ====================
+
+
+# auto_docstring
+class HeliosPyramidAutoVaeEncoderStep(AutoPipelineBlocks):
+    """
+    Encoder step that encodes video or image inputs. This is an auto pipeline block.
+       - `HeliosVideoVaeEncoderStep` (video_encoder) is used when `video` is provided.
+       - `HeliosImageVaeEncoderStep` (image_encoder) is used when `image` is provided.
+       - If neither is provided, step will be skipped.
+
+      Components:
+          vae (`AutoencoderKLWan`) video_processor (`VideoProcessor`)
+
+      Inputs:
+          video (`None`, *optional*):
+              Input video for video-to-video generation
+          height (`int`, *optional*, defaults to 384):
+              The height in pixels of the generated image.
+          width (`int`, *optional*, defaults to 640):
+              The width in pixels of the generated image.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          image (`Image | list`, *optional*):
+              Reference image(s) for denoising. Can be a single image or list of images.
+
+      Outputs:
+          image_latents (`Tensor`):
+              The latent representation of the input image.
+          video_latents (`Tensor`):
+              Encoded video latents (chunked)
+          fake_image_latents (`Tensor`):
+              Fake image latents for history seeding
+    """
+
+    block_classes = [HeliosVideoVaeEncoderStep, HeliosImageVaeEncoderStep]
+    block_names = ["video_encoder", "image_encoder"]
+    block_trigger_inputs = ["video", "image"]
+
+    @property
+    def description(self):
+        return (
+            "Encoder step that encodes video or image inputs. This is an auto pipeline block.\n"
+            " - `HeliosVideoVaeEncoderStep` (video_encoder) is used when `video` is provided.\n"
+            " - `HeliosImageVaeEncoderStep` (image_encoder) is used when `image` is provided.\n"
+            " - If neither is provided, step will be skipped."
+        )
+
+
+# ====================
+# 2. DENOISE
+# ====================
+
+
+# DENOISE (T2V)
+# auto_docstring
+class HeliosPyramidCoreDenoiseStep(SequentialPipelineBlocks):
+    """
+    T2V pyramid denoise block with progressive multi-resolution denoising.
+
+      Components:
+          transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider
+          (`ClassifierFreeZeroStarGuidance`)
+
+      Inputs:
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          prompt_embeds (`Tensor`):
+              text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          negative_prompt_embeds (`Tensor`, *optional*):
+              negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          height (`int`, *optional*, defaults to 384):
+              The height in pixels of the generated image.
+          width (`int`, *optional*, defaults to 640):
+              The width in pixels of the generated image.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
+              Number of denoising steps per pyramid stage.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+
+      Outputs:
+          latent_chunks (`list`):
+              List of per-chunk denoised latent tensors
+    """
+
+    model_name = "helios-pyramid"
+    block_classes = [
+        HeliosTextInputStep,
+        HeliosPrepareHistoryStep,
+        HeliosPyramidChunkDenoiseStep,
+    ]
+    block_names = ["input", "prepare_history", "pyramid_chunk_denoise"]
+
+    @property
+    def description(self):
+        return "T2V pyramid denoise block with progressive multi-resolution denoising."
+
+    @property
+    def outputs(self):
+        return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
+
+
+# DENOISE (I2V)
+# auto_docstring
+class HeliosPyramidI2VCoreDenoiseStep(SequentialPipelineBlocks):
+    """
+    I2V pyramid denoise block with progressive multi-resolution denoising.
+
+      Components:
+          transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider
+          (`ClassifierFreeZeroStarGuidance`)
+
+      Inputs:
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          prompt_embeds (`Tensor`):
+              text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          negative_prompt_embeds (`Tensor`, *optional*):
+              negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          image_latents (`Tensor`):
+              image latents used to guide the image generation. Can be generated from vae_encoder step.
+          fake_image_latents (`Tensor`, *optional*):
+              Fake image latents used as history seed for I2V generation.
+          image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for image latent noise.
+          image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for image latent noise.
+          video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for video/fake-image latent noise.
+          video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for video/fake-image latent noise.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
+              Number of denoising steps per pyramid stage.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+
+      Outputs:
+          latent_chunks (`list`):
+              List of per-chunk denoised latent tensors
+    """
+
+    model_name = "helios-pyramid"
+    block_classes = [
+        HeliosTextInputStep,
+        HeliosAdditionalInputsStep(
+            image_latent_inputs=[InputParam.template("image_latents")],
+            additional_batch_inputs=[
+                InputParam(
+                    "fake_image_latents",
+                    type_hint=torch.Tensor,
+                    description="Fake image latents used as history seed for I2V generation.",
+                ),
+            ],
+        ),
+        HeliosAddNoiseToImageLatentsStep,
+        HeliosPrepareHistoryStep,
+        HeliosI2VSeedHistoryStep,
+        HeliosPyramidI2VChunkDenoiseStep,
+    ]
+    block_names = [
+        "input",
+        "additional_inputs",
+        "add_noise_image",
+        "prepare_history",
+        "seed_history",
+        "pyramid_chunk_denoise",
+    ]
+
+    @property
+    def description(self):
+        return "I2V pyramid denoise block with progressive multi-resolution denoising."
+
+    @property
+    def outputs(self):
+        return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
+
+
+# DENOISE (V2V)
+# auto_docstring
+class HeliosPyramidV2VCoreDenoiseStep(SequentialPipelineBlocks):
+    """
+    V2V pyramid denoise block with progressive multi-resolution denoising.
+
+      Components:
+          transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider
+          (`ClassifierFreeZeroStarGuidance`)
+
+      Inputs:
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          prompt_embeds (`Tensor`):
+              text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          negative_prompt_embeds (`Tensor`, *optional*):
+              negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          image_latents (`Tensor`, *optional*):
+              image latents used to guide the image generation. Can be generated from vae_encoder step.
+          video_latents (`Tensor`, *optional*):
+              Encoded video latents for V2V generation.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for image latent noise.
+          image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for image latent noise.
+          video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for video latent noise.
+          video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for video latent noise.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
+              Number of denoising steps per pyramid stage.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+
+      Outputs:
+          latent_chunks (`list`):
+              List of per-chunk denoised latent tensors
+    """
+
+    model_name = "helios-pyramid"
+    block_classes = [
+        HeliosTextInputStep,
+        HeliosAdditionalInputsStep(
+            image_latent_inputs=[InputParam.template("image_latents")],
+            additional_batch_inputs=[
+                InputParam(
+                    "video_latents", type_hint=torch.Tensor, description="Encoded video latents for V2V generation."
+                ),
+            ],
+        ),
+        HeliosAddNoiseToVideoLatentsStep,
+        HeliosPrepareHistoryStep,
+        HeliosV2VSeedHistoryStep,
+        HeliosPyramidI2VChunkDenoiseStep,
+    ]
+    block_names = [
+        "input",
+        "additional_inputs",
+        "add_noise_video",
+        "prepare_history",
+        "seed_history",
+        "pyramid_chunk_denoise",
+    ]
+
+    @property
+    def description(self):
+        return "V2V pyramid denoise block with progressive multi-resolution denoising."
+
+    @property
+    def outputs(self):
+        return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
+
+
+# AUTO DENOISE
+# auto_docstring
+class HeliosPyramidAutoCoreDenoiseStep(ConditionalPipelineBlocks):
+    """
+    Pyramid core denoise step that selects the appropriate denoising block.
+       - `HeliosPyramidV2VCoreDenoiseStep` (video2video) for video-to-video tasks.
+       - `HeliosPyramidI2VCoreDenoiseStep` (image2video) for image-to-video tasks.
+       - `HeliosPyramidCoreDenoiseStep` (text2video) for text-to-video tasks.
+
+      Components:
+          transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider
+          (`ClassifierFreeZeroStarGuidance`)
+
+      Inputs:
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          prompt_embeds (`Tensor`):
+              text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          negative_prompt_embeds (`Tensor`, *optional*):
+              negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          image_latents (`Tensor`, *optional*):
+              image latents used to guide the image generation. Can be generated from vae_encoder step.
+          video_latents (`Tensor`, *optional*):
+              Encoded video latents for V2V generation.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for image latent noise.
+          image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for image latent noise.
+          video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for video latent noise.
+          video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for video latent noise.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          history_sizes (`list`):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
+              Number of denoising steps per pyramid stage.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+          fake_image_latents (`Tensor`, *optional*):
+              Fake image latents used as history seed for I2V generation.
+          height (`int`, *optional*, defaults to 384):
+              The height in pixels of the generated image.
+          width (`int`, *optional*, defaults to 640):
+              The width in pixels of the generated image.
+
+      Outputs:
+          latent_chunks (`list`):
+              List of per-chunk denoised latent tensors
+    """
+
+    block_classes = [HeliosPyramidV2VCoreDenoiseStep, HeliosPyramidI2VCoreDenoiseStep, HeliosPyramidCoreDenoiseStep]
+    block_names = ["video2video", "image2video", "text2video"]
+    block_trigger_inputs = ["video_latents", "fake_image_latents"]
+    default_block_name = "text2video"
+
+    def select_block(self, video_latents=None, fake_image_latents=None):
+        if video_latents is not None:
+            return "video2video"
+        elif fake_image_latents is not None:
+            return "image2video"
+        return None
+
+    @property
+    def description(self):
+        return (
+            "Pyramid core denoise step that selects the appropriate denoising block.\n"
+            " - `HeliosPyramidV2VCoreDenoiseStep` (video2video) for video-to-video tasks.\n"
+            " - `HeliosPyramidI2VCoreDenoiseStep` (image2video) for image-to-video tasks.\n"
+            " - `HeliosPyramidCoreDenoiseStep` (text2video) for text-to-video tasks."
+        )
+
+
+# ====================
+# 3. Auto Blocks
+# ====================
+
+PYRAMID_AUTO_BLOCKS = InsertableDict(
+    [
+        ("text_encoder", HeliosTextEncoderStep()),
+        ("vae_encoder", HeliosPyramidAutoVaeEncoderStep()),
+        ("denoise", HeliosPyramidAutoCoreDenoiseStep()),
+        ("decode", HeliosDecodeStep()),
+    ]
+)
+
+
+# auto_docstring
+class HeliosPyramidAutoBlocks(SequentialPipelineBlocks):
+    """
+    Auto Modular pipeline for pyramid progressive generation (T2V/I2V/V2V) using Helios.
+
+      Supported workflows:
+        - `text2video`: requires `prompt`
+        - `image2video`: requires `prompt`, `image`
+        - `video2video`: requires `prompt`, `video`
+
+      Components:
+          text_encoder (`UMT5EncoderModel`) tokenizer (`AutoTokenizer`) guider (`ClassifierFreeGuidance`) vae
+          (`AutoencoderKLWan`) video_processor (`VideoProcessor`) transformer (`HeliosTransformer3DModel`) scheduler
+          (`HeliosScheduler`)
+
+      Inputs:
+          prompt (`str`):
+              The prompt or prompts to guide image generation.
+          negative_prompt (`str`, *optional*):
+              The prompt or prompts not to guide the image generation.
+          max_sequence_length (`int`, *optional*, defaults to 512):
+              Maximum sequence length for prompt encoding.
+          video (`None`, *optional*):
+              Input video for video-to-video generation
+          height (`int`, *optional*, defaults to 384):
+              The height in pixels of the generated image.
+          width (`int`, *optional*, defaults to 640):
+              The width in pixels of the generated image.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          image (`Image | list`, *optional*):
+              Reference image(s) for denoising. Can be a single image or list of images.
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          image_latents (`Tensor`, *optional*):
+              image latents used to guide the image generation. Can be generated from vae_encoder step.
+          video_latents (`Tensor`, *optional*):
+              Encoded video latents for V2V generation.
+          image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for image latent noise.
+          image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for image latent noise.
+          video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for video latent noise.
+          video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for video latent noise.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          history_sizes (`list`):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
+              Number of denoising steps per pyramid stage.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+          fake_image_latents (`Tensor`, *optional*):
+              Fake image latents used as history seed for I2V generation.
+          output_type (`str`, *optional*, defaults to np):
+              Output format: 'pil', 'np', 'pt'.
+
+      Outputs:
+          videos (`list`):
+              The generated videos.
+    """
+
+    model_name = "helios-pyramid"
+
+    block_classes = PYRAMID_AUTO_BLOCKS.values()
+    block_names = PYRAMID_AUTO_BLOCKS.keys()
+
+    _workflow_map = {
+        "text2video": {"prompt": True},
+        "image2video": {"prompt": True, "image": True},
+        "video2video": {"prompt": True, "video": True},
+    }
+
+    @property
+    def description(self):
+        return "Auto Modular pipeline for pyramid progressive generation (T2V/I2V/V2V) using Helios."
+
+    @property
+    def outputs(self):
+        return [OutputParam.template("videos")]
--- a/src/diffusers/modular_pipelines/helios/modular_blocks_helios_pyramid_distilled.py
+++ b/src/diffusers/modular_pipelines/helios/modular_blocks_helios_pyramid_distilled.py
@@ -0,0 +1,530 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+
+from ...utils import logging
+from ..modular_pipeline import AutoPipelineBlocks, ConditionalPipelineBlocks, SequentialPipelineBlocks
+from ..modular_pipeline_utils import InputParam, InsertableDict, OutputParam
+from .before_denoise import (
+    HeliosAdditionalInputsStep,
+    HeliosAddNoiseToImageLatentsStep,
+    HeliosAddNoiseToVideoLatentsStep,
+    HeliosI2VSeedHistoryStep,
+    HeliosPrepareHistoryStep,
+    HeliosTextInputStep,
+    HeliosV2VSeedHistoryStep,
+)
+from .decoders import HeliosDecodeStep
+from .denoise import HeliosPyramidDistilledChunkDenoiseStep, HeliosPyramidDistilledI2VChunkDenoiseStep
+from .encoders import HeliosImageVaeEncoderStep, HeliosTextEncoderStep, HeliosVideoVaeEncoderStep
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+# ====================
+# 1. Vae Encoder
+# ====================
+
+
+# auto_docstring
+class HeliosPyramidDistilledAutoVaeEncoderStep(AutoPipelineBlocks):
+    """
+    Encoder step for distilled pyramid pipeline.
+       - `HeliosVideoVaeEncoderStep` (video_encoder) is used when `video` is provided.
+       - `HeliosImageVaeEncoderStep` (image_encoder) is used when `image` is provided.
+       - If neither is provided, step will be skipped.
+
+      Components:
+          vae (`AutoencoderKLWan`) video_processor (`VideoProcessor`)
+
+      Inputs:
+          video (`None`, *optional*):
+              Input video for video-to-video generation
+          height (`int`, *optional*, defaults to 384):
+              The height in pixels of the generated image.
+          width (`int`, *optional*, defaults to 640):
+              The width in pixels of the generated image.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          image (`Image | list`, *optional*):
+              Reference image(s) for denoising. Can be a single image or list of images.
+
+      Outputs:
+          image_latents (`Tensor`):
+              The latent representation of the input image.
+          video_latents (`Tensor`):
+              Encoded video latents (chunked)
+          fake_image_latents (`Tensor`):
+              Fake image latents for history seeding
+    """
+
+    block_classes = [HeliosVideoVaeEncoderStep, HeliosImageVaeEncoderStep]
+    block_names = ["video_encoder", "image_encoder"]
+    block_trigger_inputs = ["video", "image"]
+
+    @property
+    def description(self):
+        return (
+            "Encoder step for distilled pyramid pipeline.\n"
+            " - `HeliosVideoVaeEncoderStep` (video_encoder) is used when `video` is provided.\n"
+            " - `HeliosImageVaeEncoderStep` (image_encoder) is used when `image` is provided.\n"
+            " - If neither is provided, step will be skipped."
+        )
+
+
+# ====================
+# 2. DENOISE
+# ====================
+
+
+# DENOISE (T2V)
+# auto_docstring
+class HeliosPyramidDistilledCoreDenoiseStep(SequentialPipelineBlocks):
+    """
+    T2V distilled pyramid denoise block with DMD scheduler and no CFG.
+
+      Components:
+          transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
+
+      Inputs:
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          prompt_embeds (`Tensor`):
+              text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          negative_prompt_embeds (`Tensor`, *optional*):
+              negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          height (`int`, *optional*, defaults to 384):
+              The height in pixels of the generated image.
+          width (`int`, *optional*, defaults to 640):
+              The width in pixels of the generated image.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
+              Number of denoising steps per pyramid stage.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          is_amplify_first_chunk (`bool`, *optional*, defaults to True):
+              Whether to double the first chunk's timesteps via the scheduler for amplified generation.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+
+      Outputs:
+          latent_chunks (`list`):
+              List of per-chunk denoised latent tensors
+    """
+
+    model_name = "helios-pyramid"
+    block_classes = [
+        HeliosTextInputStep,
+        HeliosPrepareHistoryStep,
+        HeliosPyramidDistilledChunkDenoiseStep,
+    ]
+    block_names = ["input", "prepare_history", "pyramid_chunk_denoise"]
+
+    @property
+    def description(self):
+        return "T2V distilled pyramid denoise block with DMD scheduler and no CFG."
+
+    @property
+    def outputs(self):
+        return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
+
+
+# DENOISE (I2V)
+# auto_docstring
+class HeliosPyramidDistilledI2VCoreDenoiseStep(SequentialPipelineBlocks):
+    """
+    I2V distilled pyramid denoise block with DMD scheduler and no CFG.
+
+      Components:
+          transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
+
+      Inputs:
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          prompt_embeds (`Tensor`):
+              text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          negative_prompt_embeds (`Tensor`, *optional*):
+              negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          image_latents (`Tensor`):
+              image latents used to guide the image generation. Can be generated from vae_encoder step.
+          fake_image_latents (`Tensor`, *optional*):
+              Fake image latents used as history seed for I2V generation.
+          image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for image latent noise.
+          image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for image latent noise.
+          video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for video/fake-image latent noise.
+          video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for video/fake-image latent noise.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
+              Number of denoising steps per pyramid stage.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          is_amplify_first_chunk (`bool`, *optional*, defaults to True):
+              Whether to double the first chunk's timesteps via the scheduler for amplified generation.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+
+      Outputs:
+          latent_chunks (`list`):
+              List of per-chunk denoised latent tensors
+    """
+
+    model_name = "helios-pyramid"
+    block_classes = [
+        HeliosTextInputStep,
+        HeliosAdditionalInputsStep(
+            image_latent_inputs=[InputParam.template("image_latents")],
+            additional_batch_inputs=[
+                InputParam(
+                    "fake_image_latents",
+                    type_hint=torch.Tensor,
+                    description="Fake image latents used as history seed for I2V generation.",
+                ),
+            ],
+        ),
+        HeliosAddNoiseToImageLatentsStep,
+        HeliosPrepareHistoryStep,
+        HeliosI2VSeedHistoryStep,
+        HeliosPyramidDistilledI2VChunkDenoiseStep,
+    ]
+    block_names = [
+        "input",
+        "additional_inputs",
+        "add_noise_image",
+        "prepare_history",
+        "seed_history",
+        "pyramid_chunk_denoise",
+    ]
+
+    @property
+    def description(self):
+        return "I2V distilled pyramid denoise block with DMD scheduler and no CFG."
+
+    @property
+    def outputs(self):
+        return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
+
+
+# DENOISE (V2V)
+# auto_docstring
+class HeliosPyramidDistilledV2VCoreDenoiseStep(SequentialPipelineBlocks):
+    """
+    V2V distilled pyramid denoise block with DMD scheduler and no CFG.
+
+      Components:
+          transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
+
+      Inputs:
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          prompt_embeds (`Tensor`):
+              text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          negative_prompt_embeds (`Tensor`, *optional*):
+              negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          image_latents (`Tensor`, *optional*):
+              image latents used to guide the image generation. Can be generated from vae_encoder step.
+          video_latents (`Tensor`, *optional*):
+              Encoded video latents for V2V generation.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for image latent noise.
+          image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for image latent noise.
+          video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for video latent noise.
+          video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for video latent noise.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
+              Number of denoising steps per pyramid stage.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          is_amplify_first_chunk (`bool`, *optional*, defaults to True):
+              Whether to double the first chunk's timesteps via the scheduler for amplified generation.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+
+      Outputs:
+          latent_chunks (`list`):
+              List of per-chunk denoised latent tensors
+    """
+
+    model_name = "helios-pyramid"
+    block_classes = [
+        HeliosTextInputStep,
+        HeliosAdditionalInputsStep(
+            image_latent_inputs=[InputParam.template("image_latents")],
+            additional_batch_inputs=[
+                InputParam(
+                    "video_latents", type_hint=torch.Tensor, description="Encoded video latents for V2V generation."
+                ),
+            ],
+        ),
+        HeliosAddNoiseToVideoLatentsStep,
+        HeliosPrepareHistoryStep,
+        HeliosV2VSeedHistoryStep,
+        HeliosPyramidDistilledI2VChunkDenoiseStep,
+    ]
+    block_names = [
+        "input",
+        "additional_inputs",
+        "add_noise_video",
+        "prepare_history",
+        "seed_history",
+        "pyramid_chunk_denoise",
+    ]
+
+    @property
+    def description(self):
+        return "V2V distilled pyramid denoise block with DMD scheduler and no CFG."
+
+    @property
+    def outputs(self):
+        return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
+
+
+# AUTO DENOISE
+# auto_docstring
+class HeliosPyramidDistilledAutoCoreDenoiseStep(ConditionalPipelineBlocks):
+    """
+    Distilled pyramid core denoise step that selects the appropriate denoising block.
+       - `HeliosPyramidDistilledV2VCoreDenoiseStep` (video2video) for video-to-video tasks.
+       - `HeliosPyramidDistilledI2VCoreDenoiseStep` (image2video) for image-to-video tasks.
+       - `HeliosPyramidDistilledCoreDenoiseStep` (text2video) for text-to-video tasks.
+
+      Components:
+          transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
+
+      Inputs:
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          prompt_embeds (`Tensor`):
+              text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          negative_prompt_embeds (`Tensor`, *optional*):
+              negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
+          image_latents (`Tensor`, *optional*):
+              image latents used to guide the image generation. Can be generated from vae_encoder step.
+          video_latents (`Tensor`, *optional*):
+              Encoded video latents for V2V generation.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for image latent noise.
+          image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for image latent noise.
+          video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for video latent noise.
+          video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for video latent noise.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          history_sizes (`list`):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
+              Number of denoising steps per pyramid stage.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          is_amplify_first_chunk (`bool`, *optional*, defaults to True):
+              Whether to double the first chunk's timesteps via the scheduler for amplified generation.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+          fake_image_latents (`Tensor`, *optional*):
+              Fake image latents used as history seed for I2V generation.
+          height (`int`, *optional*, defaults to 384):
+              The height in pixels of the generated image.
+          width (`int`, *optional*, defaults to 640):
+              The width in pixels of the generated image.
+
+      Outputs:
+          latent_chunks (`list`):
+              List of per-chunk denoised latent tensors
+    """
+
+    block_classes = [
+        HeliosPyramidDistilledV2VCoreDenoiseStep,
+        HeliosPyramidDistilledI2VCoreDenoiseStep,
+        HeliosPyramidDistilledCoreDenoiseStep,
+    ]
+    block_names = ["video2video", "image2video", "text2video"]
+    block_trigger_inputs = ["video_latents", "fake_image_latents"]
+    default_block_name = "text2video"
+
+    def select_block(self, video_latents=None, fake_image_latents=None):
+        if video_latents is not None:
+            return "video2video"
+        elif fake_image_latents is not None:
+            return "image2video"
+        return None
+
+    @property
+    def description(self):
+        return (
+            "Distilled pyramid core denoise step that selects the appropriate denoising block.\n"
+            " - `HeliosPyramidDistilledV2VCoreDenoiseStep` (video2video) for video-to-video tasks.\n"
+            " - `HeliosPyramidDistilledI2VCoreDenoiseStep` (image2video) for image-to-video tasks.\n"
+            " - `HeliosPyramidDistilledCoreDenoiseStep` (text2video) for text-to-video tasks."
+        )
+
+
+# ====================
+# 3. Auto Blocks
+# ====================
+
+DISTILLED_PYRAMID_AUTO_BLOCKS = InsertableDict(
+    [
+        ("text_encoder", HeliosTextEncoderStep()),
+        ("vae_encoder", HeliosPyramidDistilledAutoVaeEncoderStep()),
+        ("denoise", HeliosPyramidDistilledAutoCoreDenoiseStep()),
+        ("decode", HeliosDecodeStep()),
+    ]
+)
+
+
+# auto_docstring
+class HeliosPyramidDistilledAutoBlocks(SequentialPipelineBlocks):
+    """
+    Auto Modular pipeline for distilled pyramid progressive generation (T2V/I2V/V2V) using Helios.
+
+      Supported workflows:
+        - `text2video`: requires `prompt`
+        - `image2video`: requires `prompt`, `image`
+        - `video2video`: requires `prompt`, `video`
+
+      Components:
+          text_encoder (`UMT5EncoderModel`) tokenizer (`AutoTokenizer`) guider (`ClassifierFreeGuidance`) vae
+          (`AutoencoderKLWan`) video_processor (`VideoProcessor`) transformer (`HeliosTransformer3DModel`) scheduler
+          (`HeliosScheduler`)
+
+      Inputs:
+          prompt (`str`):
+              The prompt or prompts to guide image generation.
+          negative_prompt (`str`, *optional*):
+              The prompt or prompts not to guide the image generation.
+          max_sequence_length (`int`, *optional*, defaults to 512):
+              Maximum sequence length for prompt encoding.
+          video (`None`, *optional*):
+              Input video for video-to-video generation
+          height (`int`, *optional*, defaults to 384):
+              The height in pixels of the generated image.
+          width (`int`, *optional*, defaults to 640):
+              The width in pixels of the generated image.
+          num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
+              Number of latent frames per temporal chunk.
+          generator (`Generator`, *optional*):
+              Torch generator for deterministic generation.
+          image (`Image | list`, *optional*):
+              Reference image(s) for denoising. Can be a single image or list of images.
+          num_videos_per_prompt (`int`, *optional*, defaults to 1):
+              Number of videos to generate per prompt.
+          image_latents (`Tensor`, *optional*):
+              image latents used to guide the image generation. Can be generated from vae_encoder step.
+          video_latents (`Tensor`, *optional*):
+              Encoded video latents for V2V generation.
+          image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for image latent noise.
+          image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for image latent noise.
+          video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
+              Minimum sigma for video latent noise.
+          video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
+              Maximum sigma for video latent noise.
+          num_frames (`int`, *optional*, defaults to 132):
+              Total number of video frames to generate.
+          history_sizes (`list`):
+              Sizes of long/mid/short history buffers for temporal context.
+          keep_first_frame (`bool`, *optional*, defaults to True):
+              Whether to keep the first frame as a prefix in history.
+          pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
+              Number of denoising steps per pyramid stage.
+          latents (`Tensor`, *optional*):
+              Pre-generated noisy latents for image generation.
+          **denoiser_input_fields (`None`, *optional*):
+              conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
+          is_amplify_first_chunk (`bool`, *optional*, defaults to True):
+              Whether to double the first chunk's timesteps via the scheduler for amplified generation.
+          attention_kwargs (`dict`, *optional*):
+              Additional kwargs for attention processors.
+          fake_image_latents (`Tensor`, *optional*):
+              Fake image latents used as history seed for I2V generation.
+          output_type (`str`, *optional*, defaults to np):
+              Output format: 'pil', 'np', 'pt'.
+
+      Outputs:
+          videos (`list`):
+              The generated videos.
+    """
+
+    model_name = "helios-pyramid"
+
+    block_classes = DISTILLED_PYRAMID_AUTO_BLOCKS.values()
+    block_names = DISTILLED_PYRAMID_AUTO_BLOCKS.keys()
+
+    _workflow_map = {
+        "text2video": {"prompt": True},
+        "image2video": {"prompt": True, "image": True},
+        "video2video": {"prompt": True, "video": True},
+    }
+
+    @property
+    def description(self):
+        return "Auto Modular pipeline for distilled pyramid progressive generation (T2V/I2V/V2V) using Helios."
+
+    @property
+    def outputs(self):
+        return [OutputParam.template("videos")]
--- a/src/diffusers/modular_pipelines/helios/modular_pipeline.py
+++ b/src/diffusers/modular_pipelines/helios/modular_pipeline.py
@@ -0,0 +1,87 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from ...loaders import HeliosLoraLoaderMixin
+from ...utils import logging
+from ..modular_pipeline import ModularPipeline
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+class HeliosModularPipeline(
+    ModularPipeline,
+    HeliosLoraLoaderMixin,
+):
+    """
+    A ModularPipeline for Helios text-to-video generation.
+
+    > [!WARNING] > This is an experimental feature and is likely to change in the future.
+    """
+
+    default_blocks_name = "HeliosAutoBlocks"
+
+    @property
+    def vae_scale_factor_spatial(self):
+        vae_scale_factor = 8
+        if hasattr(self, "vae") and self.vae is not None:
+            vae_scale_factor = self.vae.config.scale_factor_spatial
+        return vae_scale_factor
+
+    @property
+    def vae_scale_factor_temporal(self):
+        vae_scale_factor = 4
+        if hasattr(self, "vae") and self.vae is not None:
+            vae_scale_factor = self.vae.config.scale_factor_temporal
+        return vae_scale_factor
+
+    @property
+    def num_channels_latents(self):
+        # YiYi TODO: find out default value
+        num_channels_latents = 16
+        if hasattr(self, "transformer") and self.transformer is not None:
+            num_channels_latents = self.transformer.config.in_channels
+        return num_channels_latents
+
+    @property
+    def requires_unconditional_embeds(self):
+        requires_unconditional_embeds = False
+
+        if hasattr(self, "guider") and self.guider is not None:
+            requires_unconditional_embeds = self.guider._enabled and self.guider.num_conditions > 1
+
+        return requires_unconditional_embeds
+
+
+class HeliosPyramidModularPipeline(HeliosModularPipeline):
+    """
+    A ModularPipeline for Helios pyramid (progressive resolution) video generation.
+
+    > [!WARNING] > This is an experimental feature and is likely to change in the future.
+    """
+
+    default_blocks_name = "HeliosPyramidAutoBlocks"
+
+
+class HeliosPyramidDistilledModularPipeline(HeliosModularPipeline):
+    """
+    A ModularPipeline for Helios distilled pyramid video generation using DMD scheduler.
+
+    Uses guidance_scale=1.0 (no CFG) and supports is_amplify_first_chunk for the DMD scheduler.
+
+    > [!WARNING] > This is an experimental feature and is likely to change in the future.
+    """
+
+    default_blocks_name = "HeliosPyramidDistilledAutoBlocks"
--- a/src/diffusers/modular_pipelines/modular_pipeline.py
+++ b/src/diffusers/modular_pipelines/modular_pipeline.py
@@ -106,6 +106,16 @@ def _wan_i2v_map_fn(config_dict=None):
        return "WanImage2VideoModularPipeline"


+def _helios_pyramid_map_fn(config_dict=None):
+    if config_dict is None:
+        return "HeliosPyramidModularPipeline"
+
+    if config_dict.get("is_distilled", False):
+        return "HeliosPyramidDistilledModularPipeline"
+    else:
+        return "HeliosPyramidModularPipeline"
+
+
 MODULAR_PIPELINE_MAPPING = OrderedDict(
    [
        ("stable-diffusion-xl", _create_default_map_fn("StableDiffusionXLModularPipeline")),
@@ -120,6 +130,8 @@ MODULAR_PIPELINE_MAPPING = OrderedDict(
        ("qwenimage-edit-plus", _create_default_map_fn("QwenImageEditPlusModularPipeline")),
        ("qwenimage-layered", _create_default_map_fn("QwenImageLayeredModularPipeline")),
        ("z-image", _create_default_map_fn("ZImageModularPipeline")),
+        ("helios", _create_default_map_fn("HeliosModularPipeline")),
+        ("helios-pyramid", _helios_pyramid_map_fn),
    ]
 )

--- a/src/diffusers/pipelines/init.py
+++ b/src/diffusers/pipelines/init.py
@@ -292,7 +292,12 @@ else:
        "LTXLatentUpsamplePipeline",
        "LTXI2VLongMultiPromptPipeline",
    ]
-    _import_structure["ltx2"] = ["LTX2Pipeline", "LTX2ImageToVideoPipeline", "LTX2LatentUpsamplePipeline"]
+    _import_structure["ltx2"] = [
+        "LTX2Pipeline",
+        "LTX2ConditionPipeline",
+        "LTX2ImageToVideoPipeline",
+        "LTX2LatentUpsamplePipeline",
+    ]
    _import_structure["lumina"] = ["LuminaPipeline", "LuminaText2ImgPipeline"]
    _import_structure["lumina2"] = ["Lumina2Pipeline", "Lumina2Text2ImgPipeline"]
    _import_structure["lucy"] = ["LucyEditPipeline"]
@@ -731,7 +736,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            LTXLatentUpsamplePipeline,
            LTXPipeline,
        )
-        from .ltx2 import LTX2ImageToVideoPipeline, LTX2LatentUpsamplePipeline, LTX2Pipeline
+        from .ltx2 import LTX2ConditionPipeline, LTX2ImageToVideoPipeline, LTX2LatentUpsamplePipeline, LTX2Pipeline
        from .lucy import LucyEditPipeline
        from .lumina import LuminaPipeline, LuminaText2ImgPipeline
        from .lumina2 import Lumina2Pipeline, Lumina2Text2ImgPipeline
--- a/src/diffusers/pipelines/ltx2/init.py
+++ b/src/diffusers/pipelines/ltx2/init.py
@@ -25,6 +25,7 @@ else:
    _import_structure["connectors"] = ["LTX2TextConnectors"]
    _import_structure["latent_upsampler"] = ["LTX2LatentUpsamplerModel"]
    _import_structure["pipeline_ltx2"] = ["LTX2Pipeline"]
+    _import_structure["pipeline_ltx2_condition"] = ["LTX2ConditionPipeline"]
    _import_structure["pipeline_ltx2_image2video"] = ["LTX2ImageToVideoPipeline"]
    _import_structure["pipeline_ltx2_latent_upsample"] = ["LTX2LatentUpsamplePipeline"]
    _import_structure["vocoder"] = ["LTX2Vocoder"]
@@ -40,6 +41,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
        from .connectors import LTX2TextConnectors
        from .latent_upsampler import LTX2LatentUpsamplerModel
        from .pipeline_ltx2 import LTX2Pipeline
+        from .pipeline_ltx2_condition import LTX2ConditionPipeline
        from .pipeline_ltx2_image2video import LTX2ImageToVideoPipeline
        from .pipeline_ltx2_latent_upsample import LTX2LatentUpsamplePipeline
        from .vocoder import LTX2Vocoder
--- a/src/diffusers/pipelines/ltx2/pipeline_ltx2_condition.py
+++ b/src/diffusers/pipelines/ltx2/pipeline_ltx2_condition.py
--- a/src/diffusers/utils/init.py
+++ b/src/diffusers/utils/init.py
@@ -86,6 +86,7 @@ from .import_utils import (
    is_inflect_available,
    is_invisible_watermark_available,
    is_kernels_available,
+    is_kernels_version,
    is_kornia_available,
    is_librosa_available,
    is_matplotlib_available,
--- a/src/diffusers/utils/dummy_pt_objects.py
+++ b/src/diffusers/utils/dummy_pt_objects.py
@@ -656,6 +656,21 @@ class AutoencoderOobleck(metaclass=DummyObject):
        requires_backends(cls, ["torch"])


+class AutoencoderRAE(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+
+
 class AutoencoderTiny(metaclass=DummyObject):
    _backends = ["torch"]

--- a/src/diffusers/utils/dummy_torch_and_transformers_objects.py
+++ b/src/diffusers/utils/dummy_torch_and_transformers_objects.py
@@ -152,6 +152,96 @@ class FluxModularPipeline(metaclass=DummyObject):
        requires_backends(cls, ["torch", "transformers"])


+class HeliosAutoBlocks(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
+class HeliosModularPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
+class HeliosPyramidAutoBlocks(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
+class HeliosPyramidDistilledAutoBlocks(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
+class HeliosPyramidDistilledModularPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
+class HeliosPyramidModularPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
 class QwenImageAutoBlocks(metaclass=DummyObject):
    _backends = ["torch", "transformers"]

@@ -2147,6 +2237,21 @@ class LongCatImagePipeline(metaclass=DummyObject):
        requires_backends(cls, ["torch", "transformers"])


+class LTX2ConditionPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
 class LTX2ImageToVideoPipeline(metaclass=DummyObject):
    _backends = ["torch", "transformers"]

--- a/src/diffusers/utils/import_utils.py
+++ b/src/diffusers/utils/import_utils.py
@@ -724,6 +724,22 @@ def is_transformers_version(operation: str, version: str):
    return compare_versions(parse(_transformers_version), operation, version)


+@cache
+def is_kernels_version(operation: str, version: str):
+    """
+    Compares the current Kernels version to a given reference with an operation.
+
+    Args:
+        operation (`str`):
+            A string representation of an operator, such as `">"` or `"<="`
+        version (`str`):
+            A version string
+    """
+    if not _kernels_available:
+        return False
+    return compare_versions(parse(_kernels_version), operation, version)
+
+
@cache
 def is_hf_hub_version(operation: str, version: str):
    """
--- a/src/diffusers/video_processor.py
+++ b/src/diffusers/video_processor.py
@@ -25,9 +25,9 @@ from .image_processor import VaeImageProcessor, is_valid_image, is_valid_image_i
 class VideoProcessor(VaeImageProcessor):
    r"""Simple video processor."""

-    def preprocess_video(self, video, height: int | None = None, width: int | None = None) -> torch.Tensor:
+    def preprocess_video(self, video, height: int | None = None, width: int | None = None, **kwargs) -> torch.Tensor:
        r"""
-        Preprocesses input video(s).
+        Preprocesses input video(s). Keyword arguments will be forwarded to `VaeImageProcessor.preprocess`.

        Args:
            video (`list[PIL.Image]`, `list[list[PIL.Image]]`, `torch.Tensor`, `np.array`, `list[torch.Tensor]`, `list[np.array]`):
@@ -49,6 +49,10 @@ class VideoProcessor(VaeImageProcessor):
            width (`int`, *optional*`, defaults to `None`):
                The width in preprocessed frames of the video. If `None`, will use get_default_height_width()` to get
                the default width.
+
+        Returns:
+            `torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`:
+                A 5D tensor holding the batched channels-first video(s).
        """
        if isinstance(video, list) and isinstance(video[0], np.ndarray) and video[0].ndim == 5:
            warnings.warn(
@@ -79,7 +83,7 @@ class VideoProcessor(VaeImageProcessor):
                "Input is in incorrect format. Currently, we only support numpy.ndarray, torch.Tensor, PIL.Image.Image"
            )

-        video = torch.stack([self.preprocess(img, height=height, width=width) for img in video], dim=0)
+        video = torch.stack([self.preprocess(img, height=height, width=width, **kwargs) for img in video], dim=0)

        # move the number of channels before the number of frames.
        video = video.permute(0, 2, 1, 3, 4)
@@ -87,10 +91,11 @@ class VideoProcessor(VaeImageProcessor):
        return video

    def postprocess_video(
-        self, video: torch.Tensor, output_type: str = "np"
+        self, video: torch.Tensor, output_type: str = "np", **kwargs
    ) -> np.ndarray | torch.Tensor | list[PIL.Image.Image]:
        r"""
-        Converts a video tensor to a list of frames for export.
+        Converts a video tensor to a list of frames for export. Keyword arguments will be forwarded to
+        `VaeImageProcessor.postprocess`.

        Args:
            video (`torch.Tensor`): The video as a tensor.
@@ -100,7 +105,7 @@ class VideoProcessor(VaeImageProcessor):
        outputs = []
        for batch_idx in range(batch_size):
            batch_vid = video[batch_idx].permute(1, 0, 2, 3)
-            batch_output = self.postprocess(batch_vid, output_type)
+            batch_output = self.postprocess(batch_vid, output_type, **kwargs)
            outputs.append(batch_output)

        if output_type == "np":
--- a/tests/models/autoencoders/test_models_autoencoder_rae.py
+++ b/tests/models/autoencoders/test_models_autoencoder_rae.py
@@ -0,0 +1,300 @@
+# coding=utf-8
+# Copyright 2025 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import gc
+
+import pytest
+import torch
+import torch.nn.functional as F
+from torchvision.transforms.functional import to_tensor
+
+import diffusers.models.autoencoders.autoencoder_rae as _rae_module
+from diffusers.models.autoencoders.autoencoder_rae import (
+    _ENCODER_FORWARD_FNS,
+    AutoencoderRAE,
+    _build_encoder,
+)
+from diffusers.utils import load_image
+
+from ...testing_utils import (
+    backend_empty_cache,
+    enable_full_determinism,
+    slow,
+    torch_all_close,
+    torch_device,
+)
+from ..testing_utils import BaseModelTesterConfig, ModelTesterMixin
+from .testing_utils import AutoencoderTesterMixin
+
+
+enable_full_determinism()
+
+
+# ---------------------------------------------------------------------------
+# Tiny test encoder for fast unit tests (no transformers dependency)
+# ---------------------------------------------------------------------------
+
+
+class _TinyTestEncoderModule(torch.nn.Module):
+    """Minimal encoder that mimics the patch-token interface without any HF model."""
+
+    def __init__(self, hidden_size: int = 16, patch_size: int = 8, **kwargs):
+        super().__init__()
+        self.patch_size = patch_size
+        self.hidden_size = hidden_size
+
+    def forward(self, images: torch.Tensor) -> torch.Tensor:
+        pooled = F.avg_pool2d(images.mean(dim=1, keepdim=True), kernel_size=self.patch_size, stride=self.patch_size)
+        tokens = pooled.flatten(2).transpose(1, 2).contiguous()
+        return tokens.repeat(1, 1, self.hidden_size)
+
+
+def _tiny_test_encoder_forward(model, images):
+    return model(images)
+
+
+def _build_tiny_test_encoder(encoder_type, hidden_size, patch_size, num_hidden_layers):
+    return _TinyTestEncoderModule(hidden_size=hidden_size, patch_size=patch_size)
+
+
+# Monkey-patch the dispatch tables so "tiny_test" is recognised by AutoencoderRAE
+_ENCODER_FORWARD_FNS["tiny_test"] = _tiny_test_encoder_forward
+_original_build_encoder = _build_encoder
+
+
+def _patched_build_encoder(encoder_type, hidden_size, patch_size, num_hidden_layers):
+    if encoder_type == "tiny_test":
+        return _build_tiny_test_encoder(encoder_type, hidden_size, patch_size, num_hidden_layers)
+    return _original_build_encoder(encoder_type, hidden_size, patch_size, num_hidden_layers)
+
+
+_rae_module._build_encoder = _patched_build_encoder
+
+
+# ---------------------------------------------------------------------------
+# Test config
+# ---------------------------------------------------------------------------
+
+
+class AutoencoderRAETesterConfig(BaseModelTesterConfig):
+    @property
+    def model_class(self):
+        return AutoencoderRAE
+
+    @property
+    def output_shape(self):
+        return (3, 16, 16)
+
+    def get_init_dict(self):
+        return {
+            "encoder_type": "tiny_test",
+            "encoder_hidden_size": 16,
+            "encoder_patch_size": 8,
+            "encoder_input_size": 32,
+            "patch_size": 4,
+            "image_size": 16,
+            "decoder_hidden_size": 32,
+            "decoder_num_hidden_layers": 1,
+            "decoder_num_attention_heads": 4,
+            "decoder_intermediate_size": 64,
+            "num_channels": 3,
+            "encoder_norm_mean": [0.5, 0.5, 0.5],
+            "encoder_norm_std": [0.5, 0.5, 0.5],
+            "noise_tau": 0.0,
+            "reshape_to_2d": True,
+            "scaling_factor": 1.0,
+        }
+
+    @property
+    def generator(self):
+        return torch.Generator("cpu").manual_seed(0)
+
+    def get_dummy_inputs(self):
+        return {"sample": torch.randn(2, 3, 32, 32, generator=self.generator, device="cpu").to(torch_device)}
+
+    # Bridge for AutoencoderTesterMixin which still uses the old interface
+    def prepare_init_args_and_inputs_for_common(self):
+        return self.get_init_dict(), self.get_dummy_inputs()
+
+    def _make_model(self, **overrides) -> AutoencoderRAE:
+        config = self.get_init_dict()
+        config.update(overrides)
+        return AutoencoderRAE(**config).to(torch_device)
+
+
+class TestAutoEncoderRAE(AutoencoderRAETesterConfig, ModelTesterMixin):
+    """Core model tests for AutoencoderRAE."""
+
+    @pytest.mark.skip(reason="AutoencoderRAE does not support torch dynamo yet")
+    def test_from_save_pretrained_dynamo(self): ...
+
+    def test_fast_encode_decode_and_forward_shapes(self):
+        model = self._make_model().eval()
+        x = torch.rand(2, 3, 32, 32, device=torch_device)
+
+        with torch.no_grad():
+            z = model.encode(x).latent
+            decoded = model.decode(z).sample
+            recon = model(x).sample
+
+        assert z.shape == (2, 16, 4, 4)
+        assert decoded.shape == (2, 3, 16, 16)
+        assert recon.shape == (2, 3, 16, 16)
+        assert torch.isfinite(recon).all().item()
+
+    def test_fast_scaling_factor_encode_and_decode_consistency(self):
+        torch.manual_seed(0)
+        model_base = self._make_model(scaling_factor=1.0).eval()
+        torch.manual_seed(0)
+        model_scaled = self._make_model(scaling_factor=2.0).eval()
+
+        x = torch.rand(2, 3, 32, 32, device=torch_device)
+        with torch.no_grad():
+            z_base = model_base.encode(x).latent
+            z_scaled = model_scaled.encode(x).latent
+            recon_base = model_base.decode(z_base).sample
+            recon_scaled = model_scaled.decode(z_scaled).sample
+
+        assert torch.allclose(z_scaled, z_base * 2.0, atol=1e-5, rtol=1e-4)
+        assert torch.allclose(recon_scaled, recon_base, atol=1e-5, rtol=1e-4)
+
+    def test_fast_latents_normalization_matches_formula(self):
+        latents_mean = torch.full((1, 16, 1, 1), 0.25, dtype=torch.float32)
+        latents_std = torch.full((1, 16, 1, 1), 2.0, dtype=torch.float32)
+
+        model_raw = self._make_model().eval()
+        model_norm = self._make_model(latents_mean=latents_mean, latents_std=latents_std).eval()
+        x = torch.rand(1, 3, 32, 32, device=torch_device)
+
+        with torch.no_grad():
+            z_raw = model_raw.encode(x).latent
+            z_norm = model_norm.encode(x).latent
+
+        expected = (z_raw - latents_mean.to(z_raw.device, z_raw.dtype)) / (
+            latents_std.to(z_raw.device, z_raw.dtype) + 1e-5
+        )
+        assert torch.allclose(z_norm, expected, atol=1e-5, rtol=1e-4)
+
+    def test_fast_slicing_matches_non_slicing(self):
+        model = self._make_model().eval()
+        x = torch.rand(3, 3, 32, 32, device=torch_device)
+
+        with torch.no_grad():
+            model.use_slicing = False
+            z_no_slice = model.encode(x).latent
+            out_no_slice = model.decode(z_no_slice).sample
+
+            model.use_slicing = True
+            z_slice = model.encode(x).latent
+            out_slice = model.decode(z_slice).sample
+
+        assert torch.allclose(z_slice, z_no_slice, atol=1e-6, rtol=1e-5)
+        assert torch.allclose(out_slice, out_no_slice, atol=1e-6, rtol=1e-5)
+
+    def test_fast_noise_tau_applies_only_in_train(self):
+        model = self._make_model(noise_tau=0.5).to(torch_device)
+        x = torch.rand(2, 3, 32, 32, device=torch_device)
+
+        model.train()
+        torch.manual_seed(0)
+        z_train_1 = model.encode(x).latent
+        torch.manual_seed(1)
+        z_train_2 = model.encode(x).latent
+
+        model.eval()
+        torch.manual_seed(0)
+        z_eval_1 = model.encode(x).latent
+        torch.manual_seed(1)
+        z_eval_2 = model.encode(x).latent
+
+        assert z_train_1.shape == z_eval_1.shape
+        assert not torch.allclose(z_train_1, z_train_2)
+        assert torch.allclose(z_eval_1, z_eval_2, atol=1e-6, rtol=1e-5)
+
+
+class TestAutoEncoderRAESlicingTiling(AutoencoderRAETesterConfig, AutoencoderTesterMixin):
+    """Slicing and tiling tests for AutoencoderRAE."""
+
+
+@slow
+@pytest.mark.skip(reason="Not enough model usage to justify slow tests yet.")
+class AutoencoderRAEEncoderIntegrationTests:
+    def teardown_method(self):
+        gc.collect()
+        backend_empty_cache(torch_device)
+
+    def test_dinov2_encoder_forward_shape(self):
+        encoder = _build_encoder("dinov2", hidden_size=768, patch_size=14, num_hidden_layers=12).to(torch_device)
+        x = torch.rand(1, 3, 224, 224, device=torch_device)
+        y = _ENCODER_FORWARD_FNS["dinov2"](encoder, x)
+
+        assert y.ndim == 3
+        assert y.shape[0] == 1
+        assert y.shape[1] == 256  # (224/14)^2 - 5 (CLS + 4 register) = 251?  Actually dinov2 has 256 patches
+        assert y.shape[2] == 768
+
+    def test_siglip2_encoder_forward_shape(self):
+        encoder = _build_encoder("siglip2", hidden_size=768, patch_size=16, num_hidden_layers=12).to(torch_device)
+        x = torch.rand(1, 3, 224, 224, device=torch_device)
+        y = _ENCODER_FORWARD_FNS["siglip2"](encoder, x)
+
+        assert y.ndim == 3
+        assert y.shape[0] == 1
+        assert y.shape[1] == 196  # (224/16)^2
+        assert y.shape[2] == 768
+
+    def test_mae_encoder_forward_shape(self):
+        encoder = _build_encoder("mae", hidden_size=768, patch_size=16, num_hidden_layers=12).to(torch_device)
+        x = torch.rand(1, 3, 224, 224, device=torch_device)
+        y = _ENCODER_FORWARD_FNS["mae"](encoder, x, patch_size=16)
+
+        assert y.ndim == 3
+        assert y.shape[0] == 1
+        assert y.shape[1] == 196  # (224/16)^2
+        assert y.shape[2] == 768
+
+
+@slow
+@pytest.mark.skip(reason="Not enough model usage to justify slow tests yet.")
+class AutoencoderRAEIntegrationTests:
+    def teardown_method(self):
+        gc.collect()
+        backend_empty_cache(torch_device)
+
+    def test_autoencoder_rae_from_pretrained_dinov2(self):
+        model = AutoencoderRAE.from_pretrained("nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08").to(torch_device)
+        model.eval()
+
+        image = load_image(
+            "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
+        )
+        image = image.convert("RGB").resize((224, 224))
+        x = to_tensor(image).unsqueeze(0).to(torch_device)
+
+        with torch.no_grad():
+            latents = model.encode(x).latent
+            assert latents.shape == (1, 768, 16, 16)
+
+            recon = model.decode(latents).sample
+            assert recon.shape == (1, 3, 256, 256)
+            assert torch.isfinite(recon).all().item()
+
+            # fmt: off
+            expected_latent_slice = torch.tensor([0.7617, 0.8824, -0.4891])
+            expected_recon_slice = torch.tensor([0.1263, 0.1355, 0.1435])
+            # fmt: on
+
+            assert torch_all_close(latents[0, :3, 0, 0].float().cpu(), expected_latent_slice, atol=1e-3)
+            assert torch_all_close(recon[0, 0, 0, :3].float().cpu(), expected_recon_slice, atol=1e-3)
--- a/tests/modular_pipelines/helios/init.py
+++ b/tests/modular_pipelines/helios/init.py
--- a/tests/modular_pipelines/helios/test_modular_pipeline_helios.py
+++ b/tests/modular_pipelines/helios/test_modular_pipeline_helios.py
@@ -0,0 +1,166 @@
+# coding=utf-8
+# Copyright 2025 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pytest
+
+from diffusers.modular_pipelines import (
+    HeliosAutoBlocks,
+    HeliosModularPipeline,
+    HeliosPyramidAutoBlocks,
+    HeliosPyramidModularPipeline,
+)
+
+from ..test_modular_pipelines_common import ModularPipelineTesterMixin
+
+
+HELIOS_WORKFLOWS = {
+    "text2video": [
+        ("text_encoder", "HeliosTextEncoderStep"),
+        ("denoise.input", "HeliosTextInputStep"),
+        ("denoise.prepare_history", "HeliosPrepareHistoryStep"),
+        ("denoise.set_timesteps", "HeliosSetTimestepsStep"),
+        ("denoise.chunk_denoise", "HeliosChunkDenoiseStep"),
+        ("decode", "HeliosDecodeStep"),
+    ],
+    "image2video": [
+        ("text_encoder", "HeliosTextEncoderStep"),
+        ("vae_encoder", "HeliosImageVaeEncoderStep"),
+        ("denoise.input", "HeliosTextInputStep"),
+        ("denoise.additional_inputs", "HeliosAdditionalInputsStep"),
+        ("denoise.add_noise_image", "HeliosAddNoiseToImageLatentsStep"),
+        ("denoise.prepare_history", "HeliosPrepareHistoryStep"),
+        ("denoise.seed_history", "HeliosI2VSeedHistoryStep"),
+        ("denoise.set_timesteps", "HeliosSetTimestepsStep"),
+        ("denoise.chunk_denoise", "HeliosI2VChunkDenoiseStep"),
+        ("decode", "HeliosDecodeStep"),
+    ],
+    "video2video": [
+        ("text_encoder", "HeliosTextEncoderStep"),
+        ("vae_encoder", "HeliosVideoVaeEncoderStep"),
+        ("denoise.input", "HeliosTextInputStep"),
+        ("denoise.additional_inputs", "HeliosAdditionalInputsStep"),
+        ("denoise.add_noise_video", "HeliosAddNoiseToVideoLatentsStep"),
+        ("denoise.prepare_history", "HeliosPrepareHistoryStep"),
+        ("denoise.seed_history", "HeliosV2VSeedHistoryStep"),
+        ("denoise.set_timesteps", "HeliosSetTimestepsStep"),
+        ("denoise.chunk_denoise", "HeliosI2VChunkDenoiseStep"),
+        ("decode", "HeliosDecodeStep"),
+    ],
+}
+
+
+class TestHeliosModularPipelineFast(ModularPipelineTesterMixin):
+    pipeline_class = HeliosModularPipeline
+    pipeline_blocks_class = HeliosAutoBlocks
+    pretrained_model_name_or_path = "hf-internal-testing/tiny-helios-modular-pipe"
+
+    params = frozenset(["prompt", "height", "width", "num_frames"])
+    batch_params = frozenset(["prompt"])
+    optional_params = frozenset(["num_inference_steps", "num_videos_per_prompt", "latents"])
+    output_name = "videos"
+    expected_workflow_blocks = HELIOS_WORKFLOWS
+
+    def get_dummy_inputs(self, seed=0):
+        generator = self.get_generator(seed)
+        inputs = {
+            "prompt": "A painting of a squirrel eating a burger",
+            "generator": generator,
+            "num_inference_steps": 2,
+            "height": 16,
+            "width": 16,
+            "num_frames": 9,
+            "max_sequence_length": 16,
+            "output_type": "pt",
+        }
+        return inputs
+
+    @pytest.mark.skip(reason="num_videos_per_prompt")
+    def test_num_images_per_prompt(self):
+        pass
+
+
+HELIOS_PYRAMID_WORKFLOWS = {
+    "text2video": [
+        ("text_encoder", "HeliosTextEncoderStep"),
+        ("denoise.input", "HeliosTextInputStep"),
+        ("denoise.prepare_history", "HeliosPrepareHistoryStep"),
+        ("denoise.pyramid_chunk_denoise", "HeliosPyramidChunkDenoiseStep"),
+        ("decode", "HeliosDecodeStep"),
+    ],
+    "image2video": [
+        ("text_encoder", "HeliosTextEncoderStep"),
+        ("vae_encoder", "HeliosImageVaeEncoderStep"),
+        ("denoise.input", "HeliosTextInputStep"),
+        ("denoise.additional_inputs", "HeliosAdditionalInputsStep"),
+        ("denoise.add_noise_image", "HeliosAddNoiseToImageLatentsStep"),
+        ("denoise.prepare_history", "HeliosPrepareHistoryStep"),
+        ("denoise.seed_history", "HeliosI2VSeedHistoryStep"),
+        ("denoise.pyramid_chunk_denoise", "HeliosPyramidI2VChunkDenoiseStep"),
+        ("decode", "HeliosDecodeStep"),
+    ],
+    "video2video": [
+        ("text_encoder", "HeliosTextEncoderStep"),
+        ("vae_encoder", "HeliosVideoVaeEncoderStep"),
+        ("denoise.input", "HeliosTextInputStep"),
+        ("denoise.additional_inputs", "HeliosAdditionalInputsStep"),
+        ("denoise.add_noise_video", "HeliosAddNoiseToVideoLatentsStep"),
+        ("denoise.prepare_history", "HeliosPrepareHistoryStep"),
+        ("denoise.seed_history", "HeliosV2VSeedHistoryStep"),
+        ("denoise.pyramid_chunk_denoise", "HeliosPyramidI2VChunkDenoiseStep"),
+        ("decode", "HeliosDecodeStep"),
+    ],
+}
+
+
+class TestHeliosPyramidModularPipelineFast(ModularPipelineTesterMixin):
+    pipeline_class = HeliosPyramidModularPipeline
+    pipeline_blocks_class = HeliosPyramidAutoBlocks
+    pretrained_model_name_or_path = "hf-internal-testing/tiny-helios-pyramid-modular-pipe"
+
+    params = frozenset(["prompt", "height", "width", "num_frames"])
+    batch_params = frozenset(["prompt"])
+    optional_params = frozenset(["pyramid_num_inference_steps_list", "num_videos_per_prompt", "latents"])
+    output_name = "videos"
+    expected_workflow_blocks = HELIOS_PYRAMID_WORKFLOWS
+
+    def get_dummy_inputs(self, seed=0):
+        generator = self.get_generator(seed)
+        inputs = {
+            "prompt": "A painting of a squirrel eating a burger",
+            "generator": generator,
+            "pyramid_num_inference_steps_list": [2, 2],
+            "height": 64,
+            "width": 64,
+            "num_frames": 9,
+            "max_sequence_length": 16,
+            "output_type": "pt",
+        }
+        return inputs
+
+    def test_inference_batch_single_identical(self):
+        # Pyramid pipeline injects noise at each stage, so batch vs single can differ more
+        super().test_inference_batch_single_identical(expected_max_diff=5e-1)
+
+    @pytest.mark.skip(reason="Pyramid multi-stage noise makes offload comparison unreliable with tiny models")
+    def test_components_auto_cpu_offload_inference_consistent(self):
+        pass
+
+    @pytest.mark.skip(reason="Pyramid multi-stage noise makes save/load comparison unreliable with tiny models")
+    def test_save_from_pretrained(self):
+        pass
+
+    @pytest.mark.skip(reason="num_videos_per_prompt")
+    def test_num_images_per_prompt(self):
+        pass
Author	SHA1	Message	Date
yiyi@huggingface.co	921a7e77c8	revert change in guider	2026-03-06 18:39:37 +00:00
yiyi@huggingface.co	eb79d7b93a	upup	2026-03-06 09:41:08 +00:00
yiyi@huggingface.co	de1ae4ef08	Merge branch 'main' into helios-modular	2026-03-05 22:23:27 +00:00
yiyi@huggingface.co	40c0bd1fa0	add helios modular	2026-03-05 22:21:53 +00:00
Ando	8ec0a5ccad	feat: implement rae autoencoder. (#13046 ) * feat: implement three RAE encoders(dinov2, siglip2, mae) * feat: finish first version of autoencoder_rae * fix formatting * make fix-copies * initial doc * fix latent_mean / latent_var init types to accept config-friendly inputs * use mean and std convention * cleanup * add rae to diffusers script * use imports * use attention * remove unneeded class * example traiing script * input and ground truth sizes have to be the same * fix argument * move loss to training script * cleanup * simplify mixins * fix training script * fix entrypoint for instantiating the AutoencoderRAE * added encoder_image_size config * undo last change * fixes from pretrained weights * cleanups * address reviews * fix train script to use pretrained * fix conversion script review * latebt normalization buffers are now always registered with no-op defaults * Update examples/research_projects/autoencoder_rae/README.md Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * Update src/diffusers/models/autoencoders/autoencoder_rae.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * use image url * Encoder is frozen * fix slow test * remove config * use ModelTesterMixin and AutoencoderTesterMixin * make quality * strip final layernorm when converting * _strip_final_layernorm_affine for training script * fix test * add dispatch forward and update conversion script * update training script * error out as soon as possible and add comments * Update src/diffusers/models/autoencoders/autoencoder_rae.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * use buffer * inline * Update src/diffusers/models/autoencoders/autoencoder_rae.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * remove optional * _noising takes a generator * Update src/diffusers/models/autoencoders/autoencoder_rae.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * fix api * rename * remove unittest * use randn_tensor * fix device map on multigpu * check if the key is missing in the original state dict and only then add to the allow_missing set * remove initialize_weights --------- Co-authored-by: wangyuqi <wangyuqi@MBP-FJDQNJTWYN-0208.local> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>	2026-03-05 20:17:14 +05:30
Sayak Paul	29b91098f6	[attention backends] change to updated repo and version. (#13161 ) * change to updated repo and version. * fix version and force updated kernels. * propagate version.	2026-03-05 19:23:07 +05:30
Shenghai Yuan	ae5881ba77	Fix Helios paper link in documentation (#13213 ) * Fix Helios paper link in documentation Updated the link to the Helios paper for accuracy. * Fix reference link in HeliosTransformer3DModel documentation Updated the reference link for the Helios Transformer model paper. * Update Helios research paper link in documentation * Update Helios research paper link in documentation	2026-03-05 18:58:13 +05:30
dg845	ab6040ab2d	Add LTX2 Condition Pipeline (#13058 ) * LTX2 condition pipeline initial commit * Fix pipeline import error * Implement LTX-2-style general image conditioning * Blend denoising output and clean latents in sample space instead of velocity space * make style and make quality * make fix-copies * Rename LTX2VideoCondition image to frames * Update LTX2ConditionPipeline example * Remove support for image and video in __call__ * Put latent_idx_from_index logic inline * Improve comment on using the conditioning mask in denoising loop * Apply suggestions from code review Co-authored-by: Álvaro Somoza <asomoza@users.noreply.github.com> * make fix-copies * Migrate to Python 3.9+ style type annotations without explicit typing imports * Forward kwargs from preprocess/postprocess_video to preprocess/postprocess resp. * Center crop LTX-2 conditions following original code * Duplicate video and audio position ids if using CFG * make style and make quality * Remove unused index_type arg to preprocess_conditions * Add # Copied from for _normalize_latents * Fix _normalize_latents # Copied from statement * Add LTX-2 condition pipeline docs * Remove TODOs * Support only unpacked latents (5D for video, 4D for audio) * Remove # Copied from for prepare_audio_latents --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Álvaro Somoza <asomoza@users.noreply.github.com>	2026-03-05 00:42:55 -08:00