[SkyReelsV2] Fix ftfy import (#13113 )

fix
[Docs] Add guide for AutoModel with custom code (#13099 )
2026-02-10 21:05:28 +08:00 · 2026-02-10 12:56:13 +05:30 · 2026-02-10 12:19:44 +05:30 · 2026-02-09 08:27:59 -10:00 · 2026-02-09 15:45:59 +02:00 · 2026-02-08 19:40:34 -08:00
22 changed files with 2848 additions and 523 deletions
--- a/docs/source/en/api/pipelines/ltx2.md
+++ b/docs/source/en/api/pipelines/ltx2.md
@@ -106,8 +106,6 @@ video, audio = pipe(
    output_type="np",
    return_dict=False,
 )
-video = (video * 255).round().astype("uint8")
-video = torch.from_numpy(video)

 encode_video(
    video[0],
@@ -185,8 +183,6 @@ video, audio = pipe(
    output_type="np",
    return_dict=False,
 )
-video = (video * 255).round().astype("uint8")
-video = torch.from_numpy(video)

 encode_video(
    video[0],
--- a/docs/source/en/modular_diffusers/modular_diffusers_states.md
+++ b/docs/source/en/modular_diffusers/modular_diffusers_states.md
@@ -25,7 +25,9 @@ This guide explains how states work and how they connect blocks.

 The [`~modular_pipelines.PipelineState`] is a global state container for all blocks. It maintains the complete runtime state of the pipeline and provides a structured way for blocks to read from and write to shared data.

-[`~modular_pipelines.PipelineState`] stores all data in a `values` dict, which is a **mutable** state containing user provided input values and intermediate output values generated by blocks. If a block modifies an `input`, it will be reflected in the `values` dict after calling `set_block_state`.
+There are two dict's in [`~modular_pipelines.PipelineState`] for structuring data.
+
+- The `values` dict is a **mutable** state containing a copy of user provided input values and intermediate output values generated by blocks. If a block modifies an `input`, it will be reflected in the `values` dict after calling `set_block_state`.

 ```py
 PipelineState(
--- a/docs/source/en/modular_diffusers/modular_pipeline.md
+++ b/docs/source/en/modular_diffusers/modular_pipeline.md
@@ -12,28 +12,27 @@ specific language governing permissions and limitations under the License.

 # ModularPipeline

-[`ModularPipeline`] converts [`~modular_pipelines.ModularPipelineBlocks`] into an executable pipeline that loads models and performs the computation steps defined in the blocks. It is the main interface for running a pipeline and the API is very similar to [`DiffusionPipeline`] but with a few key differences.
+[`ModularPipeline`] converts [`~modular_pipelines.ModularPipelineBlocks`]'s into an executable pipeline that loads models and performs the computation steps defined in the block. It is the main interface for running a pipeline and it is very similar to the [`DiffusionPipeline`] API.

- **Loading is lazy.** With [`DiffusionPipeline`], [`~DiffusionPipeline.from_pretrained`] creates the pipeline and loads all models at the same time. With [`ModularPipeline`], creating and loading are two separate steps: [`~ModularPipeline.from_pretrained`] reads the configuration and knows where to load each component from, but doesn't actually load the model weights. You load the models later with [`~ModularPipeline.load_components`], which is where you pass loading arguments like `torch_dtype` and `quantization_config`.
-
- **Two ways to create a pipeline.** You can use [`~ModularPipeline.from_pretrained`] with an existing diffusers model repository — it automatically maps to the default pipeline blocks and then converts to a [`ModularPipeline`] with no extra setup. Currently supported models include SDXL, Wan, Qwen, Z-Image, Flux, and Flux2. You can also assemble your own pipeline from [`ModularPipelineBlocks`] and convert it with the [`~ModularPipelineBlocks.init_pipeline`] method (see [Creating a pipeline](#creating-a-pipeline) for more details).
-
- **Running the pipeline is the same.** Once loaded, you call the pipeline with the same arguments you're used to. A single [`ModularPipeline`] can support multiple workflows (text-to-image, image-to-image, inpainting, etc.) when the pipeline blocks use [`AutoPipelineBlocks`](./auto_pipeline_blocks) to automatically select the workflow based on your inputs.
-
-Below are complete examples for text-to-image, image-to-image, and inpainting with SDXL.
+The main difference is to include an expected `output` argument in the pipeline.

 <hfoptions id="example">
 <hfoption id="text-to-image">

 ```py
 import torch
-from diffusers import ModularPipeline
+from diffusers.modular_pipelines import SequentialPipelineBlocks
+from diffusers.modular_pipelines.stable_diffusion_xl import TEXT2IMAGE_BLOCKS
+
+blocks = SequentialPipelineBlocks.from_blocks_dict(TEXT2IMAGE_BLOCKS)
+
+modular_repo_id = "YiYiXu/modular-loader-t2i-0704"
+pipeline = blocks.init_pipeline(modular_repo_id)

-pipeline = ModularPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
 pipeline.load_components(torch_dtype=torch.float16)
 pipeline.to("cuda")

-image = pipeline(prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k").images[0]
+image = pipeline(prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", output="images")[0]
 image.save("modular_t2i_out.png")
 ```

@@ -42,17 +41,21 @@ image.save("modular_t2i_out.png")

 ```py
 import torch
-from diffusers import ModularPipeline
-from diffusers.utils import load_image
+from diffusers.modular_pipelines import SequentialPipelineBlocks
+from diffusers.modular_pipelines.stable_diffusion_xl import IMAGE2IMAGE_BLOCKS
+
+blocks = SequentialPipelineBlocks.from_blocks_dict(IMAGE2IMAGE_BLOCKS)
+
+modular_repo_id = "YiYiXu/modular-loader-t2i-0704"
+pipeline = blocks.init_pipeline(modular_repo_id)

-pipeline = ModularPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
 pipeline.load_components(torch_dtype=torch.float16)
 pipeline.to("cuda")

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
 init_image = load_image(url)
 prompt = "a dog catching a frisbee in the jungle"
-image = pipeline(prompt=prompt, image=init_image, strength=0.8).images[0]
+image = pipeline(prompt=prompt, image=init_image, strength=0.8, output="images")[0]
 image.save("modular_i2i_out.png")
 ```

@@ -61,10 +64,15 @@ image.save("modular_i2i_out.png")

 ```py
 import torch
-from diffusers import ModularPipeline
+from diffusers.modular_pipelines import SequentialPipelineBlocks
+from diffusers.modular_pipelines.stable_diffusion_xl import INPAINT_BLOCKS
 from diffusers.utils import load_image

-pipeline = ModularPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
+blocks = SequentialPipelineBlocks.from_blocks_dict(INPAINT_BLOCKS)
+
+modular_repo_id = "YiYiXu/modular-loader-t2i-0704"
+pipeline = blocks.init_pipeline(modular_repo_id)
+
 pipeline.load_components(torch_dtype=torch.float16)
 pipeline.to("cuda")

@@ -75,353 +83,276 @@ init_image = load_image(img_url)
 mask_image = load_image(mask_url)

 prompt = "A deep sea diver floating"
-image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85).images[0]
-image.save("modular_inpaint_out.png")
+image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, output="images")[0]
+image.save("moduar_inpaint_out.png")
 ```

 </hfoption>
 </hfoptions>

-This guide will show you how to create a [`ModularPipeline`], manage the components in it, and run it.
+This guide will show you how to create a [`ModularPipeline`] and manage the components in it.
+
+## Adding blocks
+
+Blocks are [`InsertableDict`] objects that can be inserted at specific positions, providing a flexible way to mix-and-match blocks.
+
+Use [`~modular_pipelines.modular_pipeline_utils.InsertableDict.insert`] on either the block class or `sub_blocks` attribute to add a block.
+
+```py
+# BLOCKS is dict of block classes, you need to add class to it
+BLOCKS.insert("block_name", BlockClass, index)
+# sub_blocks attribute contains instance, add a block instance to the  attribute
+t2i_blocks.sub_blocks.insert("block_name", block_instance, index)
+```
+
+Use [`~modular_pipelines.modular_pipeline_utils.InsertableDict.pop`] on either the block class or `sub_blocks` attribute to remove a block.
+
+```py
+# remove a block class from preset
+BLOCKS.pop("text_encoder")
+# split out a block instance on its own
+text_encoder_block = t2i_blocks.sub_blocks.pop("text_encoder")
+```
+
+Swap blocks by setting the existing block to the new block.
+
+```py
+# Replace block class in preset
+BLOCKS["prepare_latents"] = CustomPrepareLatents
+# Replace in sub_blocks attribute using an block instance
+t2i_blocks.sub_blocks["prepare_latents"] = CustomPrepareLatents()
+```

 ## Creating a pipeline

-There are two ways to create a [`ModularPipeline`]. Assemble and create a pipeline from [`ModularPipelineBlocks`] with [`~ModularPipelineBlocks.init_pipeline`], or load an existing pipeline with [`~ModularPipeline.from_pretrained`].
+There are two ways to create a [`ModularPipeline`]. Assemble and create a pipeline from [`ModularPipelineBlocks`] or load an existing pipeline with [`~ModularPipeline.from_pretrained`].

-You can also initialize a [`ComponentsManager`](./components_manager) to handle device placement and memory management. If you don't need automatic offloading, you can skip this and move the pipeline to your device manually with `pipeline.to("cuda")`.
+You should also initialize a [`ComponentsManager`] to handle device placement and memory and component management.

 > [!TIP]
 > Refer to the [ComponentsManager](./components_manager) doc for more details about how it can help manage components across different workflows.

-### init_pipeline
+<hfoptions id="create">
+<hfoption id="ModularPipelineBlocks">

-[`~ModularPipelineBlocks.init_pipeline`] converts any [`ModularPipelineBlocks`] into a [`ModularPipeline`].
-
-Let's define a minimal block to see how it works:
+Use the [`~ModularPipelineBlocks.init_pipeline`] method to create a [`ModularPipeline`] from the component and configuration specifications. This method loads the *specifications* from a `modular_model_index.json` file, but it doesn't load the *models* yet.

 ```py
-from transformers import CLIPTextModel
-from diffusers.modular_pipelines import (
-    ComponentSpec,
-    ModularPipelineBlocks,
-    PipelineState,
-)
+from diffusers import ComponentsManager
+from diffusers.modular_pipelines import SequentialPipelineBlocks
+from diffusers.modular_pipelines.stable_diffusion_xl import TEXT2IMAGE_BLOCKS

-class MyBlock(ModularPipelineBlocks):
-    @property
-    def expected_components(self):
-        return [
-            ComponentSpec(
-                name="text_encoder",
-                type_hint=CLIPTextModel,
-                pretrained_model_name_or_path="openai/clip-vit-large-patch14",
-            ),
-        ]
+t2i_blocks = SequentialPipelineBlocks.from_blocks_dict(TEXT2IMAGE_BLOCKS)

-    def __call__(self, components, state: PipelineState) -> PipelineState:
-        return components, state
+modular_repo_id = "YiYiXu/modular-loader-t2i-0704"
+components = ComponentsManager()
+t2i_pipeline = t2i_blocks.init_pipeline(modular_repo_id, components_manager=components)
 ```

-Call [`~ModularPipelineBlocks.init_pipeline`] to convert it into a pipeline. The `blocks` attribute on the pipeline is the blocks it was created from — it determines the expected inputs, outputs, and computation logic.
+</hfoption>
+<hfoption id="from_pretrained">

-```py
-block = MyBlock()
-pipe = block.init_pipeline()
-pipe.blocks
-```
-
-```
-MyBlock {
-  "_class_name": "MyBlock",
-  "_diffusers_version": "0.37.0.dev0"
-}
-```
-
-> [!WARNING]
-> Blocks are mutable — you can freely add, remove, or swap blocks before creating a pipeline. However, once a pipeline is created, modifying `pipeline.blocks` won't affect the pipeline because it returns a copy. If you want a different block structure, create a new pipeline after modifying the blocks.
-
-When you call [`~ModularPipelineBlocks.init_pipeline`] without a repository, it uses the `pretrained_model_name_or_path` defined in the block's [`ComponentSpec`] to determine where to load each component from. Printing the pipeline shows the component loading configuration.
-
-```py
-pipe
-ModularPipeline {
-  "_blocks_class_name": "MyBlock",
-  "_class_name": "ModularPipeline",
-  "_diffusers_version": "0.37.0.dev0",
-  "text_encoder": [
-    null,
-    null,
-    {
-      "pretrained_model_name_or_path": "openai/clip-vit-large-patch14",
-      "revision": null,
-      "subfolder": "",
-      "type_hint": [
-        "transformers",
-        "CLIPTextModel"
-      ],
-      "variant": null
-    }
-  ]
-}
-```
-
-If you pass a repository to [`~ModularPipelineBlocks.init_pipeline`], it overrides the loading path by matching your block's components against the pipeline config in that repository (`model_index.json` or `modular_model_index.json`).
-
-In the example below, the `pretrained_model_name_or_path` will be updated to `"stabilityai/stable-diffusion-xl-base-1.0"`.
-
-```py
-pipe = block.init_pipeline("stabilityai/stable-diffusion-xl-base-1.0")
-pipe
-ModularPipeline {
-  "_blocks_class_name": "MyBlock",
-  "_class_name": "ModularPipeline",
-  "_diffusers_version": "0.37.0.dev0",
-  "text_encoder": [
-    null,
-    null,
-    {
-      "pretrained_model_name_or_path": "stabilityai/stable-diffusion-xl-base-1.0",
-      "revision": null,
-      "subfolder": "text_encoder",
-      "type_hint": [
-        "transformers",
-        "CLIPTextModel"
-      ],
-      "variant": null
-    }
-  ]
-}
-```
-
-If a component in your block doesn't exist in the repository, it remains `null` and is skipped during [`~ModularPipeline.load_components`].
-
-### from_pretrained
-
-[`~ModularPipeline.from_pretrained`] is a convenient way to create a [`ModularPipeline`] without defining blocks yourself.
-
-It works with three types of repositories.
-
-**A regular diffusers repository.** Pass any supported model repository and it automatically maps to the default pipeline blocks. Currently supported models include SDXL, Wan, Qwen, Z-Image, Flux, and Flux2.
+The [`~ModularPipeline.from_pretrained`] method creates a [`ModularPipeline`] from a modular repository on the Hub.

 ```py
 from diffusers import ModularPipeline, ComponentsManager

 components = ComponentsManager()
-pipeline = ModularPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-1.0", components_manager=components
-)
+pipeline = ModularPipeline.from_pretrained("YiYiXu/modular-loader-t2i-0704", components_manager=components)
 ```

-**A modular repository.** These repositories contain a `modular_model_index.json` that specifies where to load each component from — the components can come from different repositories and the modular repository itself may not contain any model weights. For example, [diffusers/flux2-bnb-4bit-modular](https://huggingface.co/diffusers/flux2-bnb-4bit-modular) loads a quantized transformer from one repository and the remaining components from another. See [Modular repository](#modular-repository) for more details on the format.
+Add the `trust_remote_code` argument to load a custom [`ModularPipeline`].

 ```py
 from diffusers import ModularPipeline, ComponentsManager

 components = ComponentsManager()
-pipeline = ModularPipeline.from_pretrained(
-    "diffusers/flux2-bnb-4bit-modular", components_manager=components
-)
+modular_repo_id = "YiYiXu/modular-diffdiff-0704"
+diffdiff_pipeline = ModularPipeline.from_pretrained(modular_repo_id, trust_remote_code=True, components_manager=components)
 ```

-**A modular repository with custom code.** Some repositories include custom pipeline blocks alongside the loading configuration. Add `trust_remote_code=True` to load them. See [Custom blocks](./custom_blocks) for how to create your own.
-
-```py
-from diffusers import ModularPipeline, ComponentsManager
-
-components = ComponentsManager()
-pipeline = ModularPipeline.from_pretrained(
-    "diffusers/Florence2-image-Annotator", trust_remote_code=True, components_manager=components
-)
-```
+</hfoption>
+</hfoptions>

 ## Loading components

-A [`ModularPipeline`] doesn't automatically instantiate with components. It only loads the configuration and component specifications. You can load components with [`~ModularPipeline.load_components`].
+A [`ModularPipeline`] doesn't automatically instantiate with components. It only loads the configuration and component specifications. You can load all components with [`~ModularPipeline.load_components`] or only load specific components with [`~ModularPipeline.load_components`].

-This will load all the components that have a valid loading spec.
+<hfoptions id="load">
+<hfoption id="load_components">

 ```py
 import torch

-pipeline.load_components(torch_dtype=torch.float16)
+t2i_pipeline.load_components(torch_dtype=torch.float16)
+t2i_pipeline.to("cuda")
 ```

-You can also load specific components by name. The example below only loads the text_encoder.
+</hfoption>
+<hfoption id="load_components">
+
+The example below only loads the UNet and VAE.

 ```py
-pipeline.load_components(names=["text_encoder"], torch_dtype=torch.float16)
+import torch
+
+t2i_pipeline.load_components(names=["unet", "vae"], torch_dtype=torch.float16)
 ```

-After loading, printing the pipeline shows which components are loaded — the first two fields change from `null` to the component's library and class.
+</hfoption>
+</hfoptions>
+
+Print the pipeline to inspect the loaded pretrained components.

 ```py
-pipeline
+t2i_pipeline
 ```

-```
-# text_encoder is loaded - shows library and class
-"text_encoder": [
-  "transformers",
-  "CLIPTextModel",
-  { ... }
-]
+This should match the `modular_model_index.json` file from the modular repository a pipeline is initialized from. If a pipeline doesn't need a component, it won't be included even if it exists in the modular repository.

-# unet is not loaded yet - still null
+To modify where components are loaded from, edit the `modular_model_index.json` file in the repository and change it to your desired loading path. The example below loads a UNet from a different repository.
+
+```json
+# original
 "unet": [
-  null,
-  null,
-  { ... }
+  null, null,
+  {
+    "repo": "stabilityai/stable-diffusion-xl-base-1.0",
+    "subfolder": "unet",
+    "variant": "fp16"
+  }
+]
+
+# modified
+"unet": [
+  null, null,
+  {
+    "repo": "RunDiffusion/Juggernaut-XL-v9",
+    "subfolder": "unet",
+    "variant": "fp16"
+  }
 ]
 ```

-Loading keyword arguments like `torch_dtype`, `variant`, `revision`, and `quantization_config` are passed through to `from_pretrained()` for each component. You can pass a single value to apply to all components, or a dict to set per-component values.
+### Component loading status
+
+The pipeline properties below provide more information about which components are loaded.
+
+Use `component_names` to return all expected components.

 ```py
-# apply bfloat16 to all components
-pipeline.load_components(torch_dtype=torch.bfloat16)
-
-# different dtypes per component
-pipeline.load_components(torch_dtype={"transformer": torch.bfloat16, "default": torch.float32})
+t2i_pipeline.component_names
+['text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'guider', 'scheduler', 'unet', 'vae', 'image_processor']
 ```

-Note that [`~ModularPipeline.load_components`] only loads components that haven't been loaded yet and have a valid loading spec. This means if you've already set a component on the pipeline, calling [`~ModularPipeline.load_components`] again won't reload it.
+Use `null_component_names` to return components that aren't loaded yet. Load these components with [`~ModularPipeline.from_pretrained`].
+
+```py
+t2i_pipeline.null_component_names
+['text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'scheduler']
+```
+
+Use `pretrained_component_names` to return components that will be loaded from pretrained models.
+
+```py
+t2i_pipeline.pretrained_component_names
+['text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'scheduler', 'unet', 'vae']
+```
+
+Use `config_component_names` to return components that are created with the default config (not loaded from a modular repository). Components from a config aren't included because they are already initialized during pipeline creation. This is why they aren't listed in `null_component_names`.
+
+```py
+t2i_pipeline.config_component_names
+['guider', 'image_processor']
+```

 ## Updating components

-[`~ModularPipeline.update_components`] replaces a component on the pipeline with a new one. When a component is updated, the loading specifications are also updated in the pipeline config and [`~ModularPipeline.load_components`] will skip it on subsequent calls.
+Components may be updated depending on whether it is a *pretrained component* or a *config component*.

-### From AutoModel
+> [!WARNING]
+> A component may change from pretrained to config when updating a component. The component type is initially defined in a block's `expected_components` field.

-You can pass a model object loaded with `AutoModel.from_pretrained()`. Models loaded this way are automatically tagged with their loading information.
+A pretrained component is updated with [`ComponentSpec`] whereas a config component is updated by eihter passing the object directly or with [`ComponentSpec`].
+
+The [`ComponentSpec`] shows `default_creation_method="from_pretrained"` for a pretrained component shows `default_creation_method="from_config` for a config component.
+
+To update a pretrained component, create a [`ComponentSpec`] with the name of the component and where to load it from. Use the [`~ComponentSpec.load`] method to load the component.

 ```py
-from diffusers import AutoModel
+from diffusers import ComponentSpec, UNet2DConditionModel

-unet = AutoModel.from_pretrained(
-    "RunDiffusion/Juggernaut-XL-v9", subfolder="unet", variant="fp16", torch_dtype=torch.float16
-)
-pipeline.update_components(unet=unet)
+unet_spec = ComponentSpec(name="unet",type_hint=UNet2DConditionModel, repo="stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet", variant="fp16")
+unet = unet_spec.load(torch_dtype=torch.float16)
 ```

-### From ComponentSpec
-
-Use [`~ModularPipeline.get_component_spec`] to get a copy of the current component specification, modify it, and load a new component.
+The [`~ModularPipeline.update_components`] method replaces the component with a new one.

 ```py
-unet_spec = pipeline.get_component_spec("unet")
+t2i_pipeline.update_components(unet=unet2)
+```
+
+When a component is updated, the loading specifications are also updated in the pipeline config.
+
+### Component extraction and modification
+
+When you use [`~ComponentSpec.load`], the new component maintains its loading specifications. This makes it possible to extract the specification and recreate the component.
+
+```py
+spec = ComponentSpec.from_component("unet", unet2)
+spec
+ComponentSpec(name='unet', type_hint=<class 'diffusers.models.unets.unet_2d_condition.UNet2DConditionModel'>, description=None, config=None, repo='stabilityai/stable-diffusion-xl-base-1.0', subfolder='unet', variant='fp16', revision=None, default_creation_method='from_pretrained')
+unet2_recreated = spec.load(torch_dtype=torch.float16)
+```
+
+The [`~ModularPipeline.get_component_spec`] method gets a copy of the current component specification to modify or update.
+
+```py
+unet_spec = t2i_pipeline.get_component_spec("unet")
+unet_spec
+ComponentSpec(
+    name='unet',
+    type_hint=<class 'diffusers.models.unets.unet_2d_condition.UNet2DConditionModel'>,
+    pretrained_model_name_or_path='RunDiffusion/Juggernaut-XL-v9',
+    subfolder='unet',
+    variant='fp16',
+    default_creation_method='from_pretrained'
+)

 # modify to load from a different repository
-unet_spec.pretrained_model_name_or_path = "RunDiffusion/Juggernaut-XL-v9"
+unet_spec.pretrained_model_name_or_path = "stabilityai/stable-diffusion-xl-base-1.0"

-# load and update
+# load component with modified spec
 unet = unet_spec.load(torch_dtype=torch.float16)
-pipeline.update_components(unet=unet)
 ```

-You can also create a [`ComponentSpec`] from scratch.
-
-Not all components are loaded from pretrained weights — some are created from a config (listed under `pipeline.config_component_names`). For these, use [`~ComponentSpec.create`] instead of [`~ComponentSpec.load`].
-
-```py
-guider_spec = pipeline.get_component_spec("guider")
-guider_spec.config = {"guidance_scale": 5.0}
-guider = guider_spec.create()
-pipeline.update_components(guider=guider)
-```
-
-Or simply pass the object directly.
-
-```py
-from diffusers.guiders import ClassifierFreeGuidance
-
-guider = ClassifierFreeGuidance(guidance_scale=5.0)
-pipeline.update_components(guider=guider)
-```
-
-See the [Guiders](./guiders) guide for more details on available guiders and how to configure them.
-
-## Splitting a pipeline into stages
-
-Since blocks are composable, you can take a pipeline apart and reconstruct it into separate pipelines for each stage. The example below shows how we can separate the text encoder block from the rest of the pipeline, so you can encode the prompt independently and pass the embeddings to the main pipeline.
-
-```py
-from diffusers import ModularPipeline, ComponentsManager
-import torch
-
-device = "cuda"
-dtype = torch.bfloat16
-repo_id = "black-forest-labs/FLUX.2-klein-4B"
-
-# get the blocks and separate out the text encoder
-blocks = ModularPipeline.from_pretrained(repo_id).blocks
-text_block = blocks.sub_blocks.pop("text_encoder")
-
-# use ComponentsManager to handle offloading across multiple pipelines
-manager = ComponentsManager()
-manager.enable_auto_cpu_offload(device=device)
-
-# create separate pipelines for each stage
-text_encoder_pipeline = text_block.init_pipeline(repo_id, components_manager=manager)
-pipeline = blocks.init_pipeline(repo_id, components_manager=manager)
-
-# encode text
-text_encoder_pipeline.load_components(torch_dtype=dtype)
-text_embeddings = text_encoder_pipeline(prompt="a cat").get_by_kwargs("denoiser_input_fields")
-
-# denoise and decode
-pipeline.load_components(torch_dtype=dtype)
-output = pipeline(
-    **text_embeddings,
-    num_inference_steps=4,
-).images[0]
-```
-
-[`ComponentsManager`] handles memory across multiple pipelines. Unlike the offloading strategies in [`DiffusionPipeline`] that follow a fixed order, [`ComponentsManager`] makes offloading decisions dynamically each time a model forward pass runs, based on the current memory situation. This means it works regardless of how many pipelines you create or what order you run them in. See the [ComponentsManager](./components_manager) guide for more details.
-
-If pipeline stages share components (e.g., the same VAE used for encoding and decoding), you can use [`~ModularPipeline.update_components`] to pass an already-loaded component to another pipeline instead of loading it again.
-
 ## Modular repository

 A repository is required if the pipeline blocks use *pretrained components*. The repository supplies loading specifications and metadata.

-[`ModularPipeline`] works with regular diffusers repositories out of the box. However, you can also create a *modular repository* for more flexibility. A modular repository contains a `modular_model_index.json` file containing the following 3 elements.
+[`ModularPipeline`] specifically requires *modular repositories* (see [example repository](https://huggingface.co/YiYiXu/modular-diffdiff)) which are more flexible than a typical repository. It contains a `modular_model_index.json` file containing the following 3 elements.

- `library` and `class` shows which library the component was loaded from and its class. If `null`, the component hasn't been loaded yet.
+- `library` and `class` shows which library the component was loaded from and it's class. If `null`, the component hasn't been loaded yet.
 - `loading_specs_dict` contains the information required to load the component such as the repository and subfolder it is loaded from.

-The key advantage of a modular repository is that components can be loaded from different repositories. For example, [diffusers/flux2-bnb-4bit-modular](https://huggingface.co/diffusers/flux2-bnb-4bit-modular) loads a quantized transformer from `diffusers/FLUX.2-dev-bnb-4bit` while loading the remaining components from `black-forest-labs/FLUX.2-dev`.
+Unlike standard repositories, a modular repository can fetch components from different repositories based on the `loading_specs_dict`. Components don't need to exist in the same repository.

-To convert a regular diffusers repository into a modular one, create the pipeline using the regular repository, and then push to the Hub. The saved repository will contain a `modular_model_index.json` with all the loading specifications.
-
-```py
-from diffusers import ModularPipeline
-
-# load from a regular repo
-pipeline = ModularPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
-
-# push as a modular repository
-pipeline.save_pretrained("local/path", repo_id="my-username/sdxl-modular", push_to_hub=True)
-```
-
-A modular repository can also include custom pipeline blocks as Python code. This allows you to share specialized blocks that aren't native to Diffusers. For example, [diffusers/Florence2-image-Annotator](https://huggingface.co/diffusers/Florence2-image-Annotator) contains custom blocks alongside the loading configuration:
+A modular repository may contain custom code for loading a [`ModularPipeline`]. This allows you to use specialized blocks that aren't native to Diffusers.

 ```
-Florence2-image-Annotator/
+modular-diffdiff-0704/
 ├── block.py                    # Custom pipeline blocks implementation
 ├── config.json                 # Pipeline configuration and auto_map
-├── mellon_config.json          # UI configuration for Mellon
 └── modular_model_index.json    # Component loading specifications
 ```

-The `config.json` file contains an `auto_map` key that tells [`ModularPipeline`] where to find the custom blocks:
+The [config.json](https://huggingface.co/YiYiXu/modular-diffdiff-0704/blob/main/config.json) file contains an `auto_map` key that points to where a custom block is defined in `block.py`.

 ```json
 {
-  "_class_name": "Florence2AnnotatorBlocks",
+  "_class_name": "DiffDiffBlocks",
  "auto_map": {
-    "ModularPipelineBlocks": "block.Florence2AnnotatorBlocks"
+    "ModularPipelineBlocks": "block.DiffDiffBlocks"
  }
 }
 ```
-
-Load custom code repositories with `trust_remote_code=True` as shown in [from_pretrained](#from_pretrained). See [Custom blocks](./custom_blocks) for how to create and share your own.
--- a/docs/source/en/modular_diffusers/pipeline_block.md
+++ b/docs/source/en/modular_diffusers/pipeline_block.md
@@ -25,42 +25,56 @@ This guide will show you how to create a [`~modular_pipelines.ModularPipelineBlo

 A [`~modular_pipelines.ModularPipelineBlocks`] requires `inputs`, and `intermediate_outputs`.

- `inputs` are values a block reads from the [`~modular_pipelines.PipelineState`] to perform its computation. These can be values provided by a user (like a prompt or image) or values produced by a previous block (like encoded image_latents). 
+- `inputs` are values provided by a user and retrieved from the [`~modular_pipelines.PipelineState`]. This is useful because some workflows resize an image, but the original image is still required. The [`~modular_pipelines.PipelineState`] maintains the original image.

    Use `InputParam` to define `inputs`.

-```py
-class ImageEncodeStep(ModularPipelineBlocks):
-    ...
+    ```py
+    from diffusers.modular_pipelines import InputParam

-    @property
-    def inputs(self):
-        return [
-            InputParam(name="image", type_hint="PIL.Image", required=True, description="raw input image to process"),
-        ]
-    ...
-```
+    user_inputs = [
+        InputParam(name="image", type_hint="PIL.Image", description="raw input image to process")
+    ]
+    ```

-- `intermediate_outputs` are new values created by a block and added to the [`~modular_pipelines.PipelineState`]. The `intermediate_outputs` are available as `inputs` for subsequent blocks or available as the final output from running the pipeline.
+- `intermediate_outputs` are new values created by a block and added to the [`~modular_pipelines.PipelineState`]. The `intermediate_outputs` are available as `inputs` for subsequent blocks or available as the final output from running the pipeline.

    Use `OutputParam` to define `intermediate_outputs`.

-```py
-class ImageEncodeStep(ModularPipelineBlocks):
-    ...
+    ```py
+    from diffusers.modular_pipelines import OutputParam

-    @property
-    def intermediate_outputs(self):
-        return [
-            OutputParam(name="image_latents", description="latents representing the image"),
-        ]
-
-    ...
-```
+        user_intermediate_outputs = [
+        OutputParam(name="image_latents", description="latents representing the image")
+    ]
+    ```

 The intermediate inputs and outputs share data to connect blocks. They are accessible at any point, allowing you to track the workflow's progress.

-## Components and configs
+## Computation logic
+
+The computation a block performs is defined in the `__call__` method and it follows a specific structure.
+
+1. Retrieve the [`~modular_pipelines.BlockState`] to get a local view of the `inputs`
+2. Implement the computation logic on the `inputs`.
+3. Update [`~modular_pipelines.PipelineState`] to push changes from the local [`~modular_pipelines.BlockState`] back to the global [`~modular_pipelines.PipelineState`].
+4. Return the components and state which becomes available to the next block.
+
+```py
+def __call__(self, components, state):
+    # Get a local view of the state variables this block needs
+    block_state = self.get_block_state(state)
+
+    # Your computation logic here
+    # block_state contains all your inputs
+    # Access them like: block_state.image, block_state.processed_image
+
+    # Update the pipeline state with your updated block_states
+    self.set_block_state(state, block_state)
+    return components, state
+```
+
+### Components and configs

 The components and pipeline-level configs a block needs are specified in [`ComponentSpec`] and [`~modular_pipelines.ConfigSpec`].

@@ -68,108 +82,24 @@ The components and pipeline-level configs a block needs are specified in [`Compo
 - [`~modular_pipelines.ConfigSpec`] contains pipeline-level settings that control behavior across all blocks.

 ```py
-class ImageEncodeStep(ModularPipelineBlocks):
-    ...
+from diffusers import ComponentSpec, ConfigSpec

-    @property
-    def expected_components(self):
-        return [
-            ComponentSpec(name="vae", type_hint=AutoencoderKL),
-        ]
+expected_components = [
+    ComponentSpec(name="unet", type_hint=UNet2DConditionModel),
+    ComponentSpec(name="scheduler", type_hint=EulerDiscreteScheduler)
+]

-    @property
-    def expected_configs(self):
-        return [
-            ConfigSpec("force_zeros_for_empty_prompt", True),
-        ]
-
-    ...
+expected_config = [
+    ConfigSpec("force_zeros_for_empty_prompt", True)
+]
 ```

 When the blocks are converted into a pipeline, the components become available to the block as the first argument in `__call__`.

-## Computation logic
-
-The computation a block performs is defined in the `__call__` method and it follows a specific structure.
-
-1. Retrieve the [`~modular_pipelines.BlockState`] to get a local view of the `inputs`.
-2. Implement the computation logic on the `inputs`.
-3. Update [`~modular_pipelines.PipelineState`] to push changes from the local [`~modular_pipelines.BlockState`] back to the global [`~modular_pipelines.PipelineState`].
-4. Return the components and state which becomes available to the next block.
-
 ```py
-class ImageEncodeStep(ModularPipelineBlocks):
-
-    def __call__(self, components, state):
-        # Get a local view of the state variables this block needs
-        block_state = self.get_block_state(state)
-
-        # Your computation logic here
-        # block_state contains all your inputs
-        # Access them like: block_state.image, block_state.processed_image
-
-        # Update the pipeline state with your updated block_states
-        self.set_block_state(state, block_state)
-        return components, state
+def __call__(self, components, state):
+    # Access components using dot notation
+    unet = components.unet
+    vae = components.vae
+    scheduler = components.scheduler
 ```
-
-## Putting it all together
-
-Here is the complete block with all the pieces connected.
-
-```py
-from diffusers import ComponentSpec, AutoencoderKL
-from diffusers.modular_pipelines import InputParam, ModularPipelineBlocks, OutputParam
-
-
-class ImageEncodeStep(ModularPipelineBlocks):
-
-    @property
-    def description(self):
-        return "Encode an image into latent space."
-
-    @property
-    def expected_components(self):
-        return [
-            ComponentSpec(name="vae", type_hint=AutoencoderKL),
-        ]
-
-    @property
-    def inputs(self):
-        return [
-            InputParam(name="image", type_hint="PIL.Image", required=True, description="raw input image to process"),
-        ]
-
-    @property
-    def intermediate_outputs(self):
-        return [
-            OutputParam(name="image_latents", type_hint="torch.Tensor", description="latents representing the image"),
-        ]
-
-    def __call__(self, components, state):
-        block_state = self.get_block_state(state)
-        block_state.image_latents = components.vae.encode(block_state.image)
-        self.set_block_state(state, block_state)
-        return components, state
-```
-
-Every block has a `doc` property that is automatically generated from the properties you defined above. It provides a summary of the block's description, components, inputs, and outputs.
-
-```py
-block = ImageEncoderStep()
-print(block.doc)
-class ImageEncodeStep
-
-  Encode an image into latent space.
-
-  Components:
-      vae (`AutoencoderKL`)
-
-  Inputs:
-      image (`PIL.Image`):
-          raw input image to process
-
-  Outputs:
-      image_latents (`torch.Tensor`):
-          latents representing the image
-```
--- a/docs/source/en/modular_diffusers/quickstart.md
+++ b/docs/source/en/modular_diffusers/quickstart.md
@@ -39,44 +39,17 @@ image
 [`~ModularPipeline.from_pretrained`] uses lazy loading - it reads the configuration to learn where to load each component from, but doesn't actually load the model weights until you call [`~ModularPipeline.load_components`]. This gives you control over when and how components are loaded.

 > [!TIP]
-> `ComponentsManager` with `enable_auto_cpu_offload` automatically moves models between CPU and GPU as needed, reducing memory usage for large models like Qwen-Image. Learn more in the [ComponentsManager](./components_manager) guide.
->
-> If you don't need offloading, simply remove the `components_manager` argument and move the pipeline to your device manually with `pipe.to("cuda")`.
+> [`ComponentsManager`] with `enable_auto_cpu_offload` automatically moves models between CPU and GPU as needed, reducing memory usage for large models like Qwen-Image. Learn more in the [ComponentsManager](./components_manager) guide.

 Learn more about creating and loading pipelines in the [Creating a pipeline](https://huggingface.co/docs/diffusers/modular_diffusers/modular_pipeline#creating-a-pipeline) and [Loading components](https://huggingface.co/docs/diffusers/modular_diffusers/modular_pipeline#loading-components) guides.

 ## Understand the structure

-A [`ModularPipeline`] has two parts: a **definition** (the blocks) and a **state** (the loaded components and configs).
+A [`ModularPipeline`] has two parts:
+- **State**: the loaded components (models, schedulers, processors) and configuration
+- **Definition**: the [`ModularPipelineBlocks`] that specify inputs, outputs, expected components and computation logic

-Print the pipeline to see its state — the components and their loading status and configuration.
-```py
-print(pipe)
-```
-```
-QwenImageModularPipeline {
-  "_blocks_class_name": "QwenImageAutoBlocks",
-  "_class_name": "QwenImageModularPipeline",
-  "_diffusers_version": "0.37.0.dev0",
-  "transformer": [
-    "diffusers",
-    "QwenImageTransformer2DModel",
-    {
-      "pretrained_model_name_or_path": "Qwen/Qwen-Image",
-      "revision": null,
-      "subfolder": "transformer",
-      "type_hint": [
-        "diffusers",
-        "QwenImageTransformer2DModel"
-      ],
-      "variant": null
-    }
-  ],
-  ...
-}
-```
-
-Access the definition through `pipe.blocks` — this is the [`~modular_pipelines.ModularPipelineBlocks`] that defines the pipeline's workflows, inputs, outputs, and computation logic.
+The blocks define *what* the pipeline does. Access them through `pipe.blocks`.
 ```py
 print(pipe.blocks)
 ```
@@ -114,9 +87,7 @@ The output returns:

 ### Workflows

-This pipeline supports multiple workflows and adapts its behavior based on the inputs you provide. For example, if you pass `image` to the pipeline, it runs an image-to-image workflow instead of text-to-image. Learn more about how this works under the hood in the [AutoPipelineBlocks](https://huggingface.co/docs/diffusers/modular_diffusers/auto_pipeline_blocks) guide.
-
-Let's see this in action with an example.
+`QwenImageAutoBlocks` is a [`ConditionalPipelineBlocks`], so this pipeline supports multiple workflows and adapts its behavior based on the inputs you provide. For example, if you pass `image` to the pipeline, it runs an image-to-image workflow instead of text-to-image. Let's see this in action with an example.
 ```py
 from diffusers.utils import load_image

@@ -128,21 +99,20 @@ image = pipe(
 ).images[0]
 ```

-Use `get_workflow()` to extract the blocks for a specific workflow. Pass the workflow name (e.g., `"image2image"`, `"inpainting"`, `"controlnet_text2image"`) to get only the blocks relevant to that workflow. This is useful when you want to customize or debug a specific workflow.
+Use `get_workflow()` to extract the blocks for a specific workflow. Pass the workflow name (e.g., `"image2image"`, `"inpainting"`, `"controlnet_text2image"`) to get only the blocks relevant to that workflow.
 ```py
 img2img_blocks = pipe.blocks.get_workflow("image2image")
 ```

+Conditional blocks are convenient for users, but their conditional logic adds complexity when customizing or debugging. Extracting a workflow gives you the specific blocks relevant to your workflow, making it easier to work with. Learn more in the [AutoPipelineBlocks](https://huggingface.co/docs/diffusers/modular_diffusers/auto_pipeline_blocks) guide.

 ### Sub-blocks

 Blocks can contain other blocks. `pipe.blocks` gives you the top-level block definition (here, `QwenImageAutoBlocks`), while `sub_blocks` lets you access the smaller blocks inside it.

-`QwenImageAutoBlocks` is composed of: `text_encoder`, `vae_encoder`, `controlnet_vae_encoder`, `denoise`, and `decode`.
+`QwenImageAutoBlocks` is composed of: `text_encoder`, `vae_encoder`, `controlnet_vae_encoder`, `denoise`, and `decode`. Access them through the `sub_blocks` property.

-These sub-blocks run one after another and data flows linearly from one block to the next — each block's `intermediate_outputs` become available as `inputs` to the next block. This is how [`SequentialPipelineBlocks`](./sequential_pipeline_blocks) work.
-
-You can access them through the `sub_blocks` property. The `doc` property is useful for seeing the full documentation of any block, including its inputs, outputs, and components.
+The `doc` property is useful for seeing the full documentation of any block, including its inputs, outputs, and components.
 ```py
 vae_encoder_block = pipe.blocks.sub_blocks["vae_encoder"]
 print(vae_encoder_block.doc)
@@ -195,7 +165,7 @@ class CannyBlock
          Canny map for input image
 ```

-Use `get_workflow` to extract the ControlNet workflow from [`QwenImageAutoBlocks`].
+UUse `get_workflow` to extract the ControlNet workflow from [`QwenImageAutoBlocks`].
 ```py
 # Get the controlnet workflow that we want to work with
 blocks = pipe.blocks.get_workflow("controlnet_text2image")
@@ -212,8 +182,9 @@ class SequentialPipelineBlocks
      ...
 ```

+The extracted workflow is a [`SequentialPipelineBlocks`](./sequential_pipeline_blocks) - a multi-block type where blocks run one after another and data flows linearly from one block to the next. Each block's `intermediate_outputs` become available as `inputs` to subsequent blocks.

-The extracted workflow is a [`SequentialPipelineBlocks`](./sequential_pipeline_blocks) and it currently requires `control_image` as input. Let's insert the canny block at the beginning so the pipeline accepts a regular image instead.
+Currently this workflow requires `control_image` as input. Let's insert the canny block at the beginning so the pipeline accepts a regular image instead.
 ```py
 # Insert canny at the beginning
 blocks.sub_blocks.insert("canny", canny_block, 0)
@@ -240,7 +211,7 @@ class SequentialPipelineBlocks

 Now the pipeline takes `image` as input instead of `control_image`. Because blocks in a sequence share data automatically, the canny block's output (`control_image`) flows to the denoise block that needs it, and the canny block's input (`image`) becomes a pipeline input since no earlier block provides it.

-Create a pipeline from the modified blocks and load a ControlNet model. The ControlNet isn't part of the original model repository, so we load it separately and add it with [`~ModularPipeline.update_components`].
+Create a pipeline from the modified blocks and load a ControlNet model.
 ```py
 pipeline = blocks.init_pipeline("Qwen/Qwen-Image", components_manager=manager)

@@ -270,16 +241,6 @@ output
 ## Next steps

 <hfoptions id="next">
-<hfoption id="Learn the basics">
-
-Understand the core building blocks of Modular Diffusers:
-
- [ModularPipelineBlocks](./pipeline_block): The basic unit for defining a step in a pipeline.
- [SequentialPipelineBlocks](./sequential_pipeline_blocks): Chain blocks to run in sequence.
- [AutoPipelineBlocks](./auto_pipeline_blocks): Create pipelines that support multiple workflows.
- [States](./modular_diffusers_states): How data is shared between blocks.
-
-</hfoption>
 <hfoption id="Build custom blocks">

 Learn how to create your own blocks with custom logic in the [Building Custom Blocks](./custom_blocks) guide.
--- a/docs/source/en/modular_diffusers/sequential_pipeline_blocks.md
+++ b/docs/source/en/modular_diffusers/sequential_pipeline_blocks.md
@@ -91,42 +91,23 @@ class ImageEncoderBlock(ModularPipelineBlocks):
 </hfoption>
 </hfoptions>

-Connect the two blocks by defining a [~modular_pipelines.SequentialPipelineBlocks]. List the block instances in `block_classes` and their corresponding names in `block_names`. The blocks are executed in the order they appear in `block_classes`, and data flows from one block to the next through [~modular_pipelines.PipelineState].
+Connect the two blocks by defining an [`InsertableDict`] to map the block names to the block instances. Blocks are executed in the order they're registered in `blocks_dict`.
+
+Use [`~modular_pipelines.SequentialPipelineBlocks.from_blocks_dict`] to create a [`~modular_pipelines.SequentialPipelineBlocks`].

 ```py
-class ImageProcessingStep(SequentialPipelineBlocks):
-    """
-    # auto_docstring
-    """
-    model_name = "my_model"
-    block_classes = [InputBlock(), ImageEncoderBlock()]
-    block_names = ["input", "image_encoder"]
+from diffusers.modular_pipelines import SequentialPipelineBlocks, InsertableDict

-    @property
-    def description(self):
-        return (
-            "Process text prompts and images for the pipeline. It:\n"
-            " - Determines the batch size from the prompts.\n"
-            " - Encodes the image into latent space."
-        )
+blocks_dict = InsertableDict()
+blocks_dict["input"] = input_block
+blocks_dict["image_encoder"] = image_encoder_block
+
+blocks = SequentialPipelineBlocks.from_blocks_dict(blocks_dict)
 ```

-When you create a [~modular_pipelines.SequentialPipelineBlocks], properties like `inputs`, `intermediate_outputs`, and `expected_components` are automatically aggregated from the sub-blocks, so there is no need to define them again.
-
-There are a few properties you should set:
-
- `description`: We recommend adding a description for the assembled block to explain what the combined step does.
- `model_name`: This is automatically derived from the sub-blocks but isn't always correct, so you may need to override it.
- `outputs`: By default this is the same as `intermediate_outputs`, but you can manually set it to control which values appear in the doc. This is useful for showing only the final outputs instead of all intermediate values.
-
-These properties, together with the aggregated `inputs`, `intermediate_outputs`, and `expected_components`, are used to automatically generate the `doc` property.
-
-
-Inspect the sub-blocks through the `sub_blocks` property, and use `doc` for a full summary of the block's inputs, outputs, and components.
-
+Inspect the sub-blocks in [`~modular_pipelines.SequentialPipelineBlocks`] by calling `blocks`, and for more details about the inputs and outputs, access the `docs` attribute.

 ```py
-blocks = ImageProcessingStep()
 print(blocks)
 print(blocks.doc)
-```
+```
--- a/docs/source/en/using-diffusers/automodel.md
+++ b/docs/source/en/using-diffusers/automodel.md
@@ -29,8 +29,31 @@ text_encoder = AutoModel.from_pretrained(
 )
 ```

+## Custom models
+
 [`AutoModel`] also loads models from the [Hub](https://huggingface.co/models) that aren't included in Diffusers. Set `trust_remote_code=True` in [`AutoModel.from_pretrained`] to load custom models.

+A custom model repository needs a Python module with the model class, and a `config.json` with an `auto_map` entry that maps `"AutoModel"` to `"module_file.ClassName"`.
+
+```
+custom/custom-transformer-model/
+├── config.json
+├── my_model.py
+└── diffusion_pytorch_model.safetensors
+```
+
+The `config.json` includes the `auto_map` field pointing to the custom class.
+
+```json
+{
+  "auto_map": {
+    "AutoModel": "my_model.MyCustomModel"
+  }
+}
+```
+
+Then load it with `trust_remote_code=True`.
+
 ```py
 import torch
 from diffusers import AutoModel
@@ -40,7 +63,39 @@ transformer = AutoModel.from_pretrained(
 )
 ```

+For a real-world example, [Overworld/Waypoint-1-Small](https://huggingface.co/Overworld/Waypoint-1-Small/tree/main/transformer) hosts a custom `WorldModel` class across several modules in its `transformer` subfolder.
+
+```
+transformer/
+├── config.json          # auto_map: "model.WorldModel"
+├── model.py
+├── attn.py
+├── nn.py
+├── cache.py
+├── quantize.py
+├── __init__.py
+└── diffusion_pytorch_model.safetensors
+```
+
+```py
+import torch
+from diffusers import AutoModel
+
+transformer = AutoModel.from_pretrained(
+    "Overworld/Waypoint-1-Small", subfolder="transformer", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda"
+)
+```
+
 If the custom model inherits from the [`ModelMixin`] class, it gets access to the same features as Diffusers model classes, like [regional compilation](../optimization/fp16#regional-compilation) and [group offloading](../optimization/memory#group-offloading).

+> [!WARNING]
+> As a precaution with `trust_remote_code=True`, pass a commit hash to the `revision` argument in [`AutoModel.from_pretrained`] to make sure the code hasn't been updated with new malicious code (unless you fully trust the model owners).
+>
+> ```py
+> transformer = AutoModel.from_pretrained(
+>     "Overworld/Waypoint-1-Small", subfolder="transformer", trust_remote_code=True, revision="a3d8cb2"
+> )
+> ```
+
 > [!NOTE]
 > Learn more about implementing custom models in the [Community components](../using-diffusers/custom_pipeline_overview#community-components) guide.
--- a/examples/dreambooth/README_z_image.md
+++ b/examples/dreambooth/README_z_image.md
@@ -0,0 +1,347 @@
+# DreamBooth training example for Z-Image
+
+[DreamBooth](https://huggingface.co/papers/2208.12242) is a method to personalize image generation models given just a few (3~5) images of a subject/concept.
+[LoRA](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) is a popular parameter-efficient fine-tuning technique that allows you to achieve full-finetuning like performance but with a fraction of learnable parameters.
+
+The `train_dreambooth_lora_z_image.py` script shows how to implement the training procedure for [LoRAs](https://huggingface.co/blog/lora) and adapt it for [Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image).
+
+> [!NOTE]
+> **About Z-Image**
+>
+> Z-Image is a high-quality text-to-image generation model from Alibaba's Tongyi Lab. It uses a DiT (Diffusion Transformer) architecture with Qwen3 as the text encoder. The model excels at generating images with accurate text rendering, especially for Chinese characters.
+
+> [!NOTE]
+> **Memory consumption**
+>
+> Z-Image is relatively memory efficient compared to other large-scale diffusion models. Below we provide some tips and tricks to further reduce memory consumption during training.
+
+## Running locally with PyTorch
+
+### Installing the dependencies
+
+Before running the scripts, make sure to install the library's training dependencies:
+
+**Important**
+
+To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install -e .
+```
+
+Then cd in the `examples/dreambooth` folder and run
+```bash
+pip install -r requirements_z_image.txt
+```
+
+And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
+
+```bash
+accelerate config
+```
+
+Or for a default accelerate configuration without answering questions about your environment
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell (e.g., a notebook)
+
+```python
+from accelerate.utils import write_basic_config
+write_basic_config()
+```
+
+When running `accelerate config`, if we specify torch compile mode to True there can be dramatic speedups.
+Note also that we use PEFT library as backend for LoRA training, make sure to have `peft>=0.6.0` installed in your environment.
+
+
+### Dog toy example
+
+Now let's get our dataset. For this example we will use some dog images: https://huggingface.co/datasets/diffusers/dog-example.
+
+Let's first download it locally:
+
+```python
+from huggingface_hub import snapshot_download
+
+local_dir = "./dog"
+snapshot_download(
+    "diffusers/dog-example",
+    local_dir=local_dir, repo_type="dataset",
+    ignore_patterns=".gitattributes",
+)
+```
+
+This will also allow us to push the trained LoRA parameters to the Hugging Face Hub platform.
+
+## Memory Optimizations
+
+> [!NOTE] 
+> Many of these techniques complement each other and can be used together to further reduce memory consumption. However some techniques may be mutually exclusive so be sure to check before launching a training run.
+
+### CPU Offloading 
+To offload parts of the model to CPU memory, you can use `--offload` flag. This will offload the VAE and text encoder to CPU memory and only move them to GPU when needed.
+
+### Latent Caching 
+Pre-encode the training images with the VAE, and then delete it to free up some memory. To enable `latent_caching` simply pass `--cache_latents`.
+
+### QLoRA: Low Precision Training with Quantization
+Perform low precision training using 8-bit or 4-bit quantization to reduce memory usage. You can use the following flags:
+
+- **FP8 training** with `torchao`: 
+Enable FP8 training by passing `--do_fp8_training`. 
+> [!IMPORTANT] 
+> Since we are utilizing FP8 tensor cores we need CUDA GPUs with compute capability at least 8.9 or greater. If you're looking for memory-efficient training on relatively older cards, we encourage you to check out other trainers.
+
+- **NF4 training** with `bitsandbytes`: 
+Alternatively, you can use 8-bit or 4-bit quantization with `bitsandbytes` by passing `--bnb_quantization_config_path` to enable 4-bit NF4 quantization.
+
+### Gradient Checkpointing and Accumulation
+* `--gradient_accumulation` refers to the number of updates steps to accumulate before performing a backward/update pass. By passing a value > 1 you can reduce the amount of backward/update passes and hence also memory requirements.
+* With `--gradient_checkpointing` we can save memory by not storing all intermediate activations during the forward pass. Instead, only a subset of these activations (the checkpoints) are stored and the rest is recomputed as needed during the backward pass. Note that this comes at the expense of a slower backward pass.
+
+### 8-bit-Adam Optimizer
+When training with `AdamW` (doesn't apply to `prodigy`) you can pass `--use_8bit_adam` to reduce the memory requirements of training. Make sure to install `bitsandbytes` if you want to do so.
+
+### Image Resolution
+An easy way to mitigate some of the memory requirements is through `--resolution`. `--resolution` refers to the resolution for input images, all the images in the train/validation dataset are resized to this.
+Note that by default, images are resized to resolution of 1024, but it's good to keep in mind in case you're training on higher resolutions.
+
+### Precision of saved LoRA layers
+By default, trained transformer layers are saved in the precision dtype in which training was performed. E.g. when training in mixed precision is enabled with `--mixed_precision="bf16"`, final finetuned layers will be saved in `torch.bfloat16` as well. 
+This reduces memory requirements significantly without a significant quality loss. Note that if you do wish to save the final layers in float32 at the expense of more memory usage, you can do so by passing `--upcast_before_saving`.
+
+## Training Examples
+
+### Z-Image Training
+
+To perform DreamBooth with LoRA on Z-Image, run:
+
+```bash
+export MODEL_NAME="Tongyi-MAI/Z-Image"
+export INSTANCE_DIR="dog"
+export OUTPUT_DIR="trained-z-image-lora"
+
+accelerate launch train_dreambooth_lora_z_image.py \
+  --pretrained_model_name_or_path=$MODEL_NAME  \
+  --instance_data_dir=$INSTANCE_DIR \
+  --output_dir=$OUTPUT_DIR \
+  --mixed_precision="bf16" \
+  --gradient_checkpointing \
+  --cache_latents \
+  --instance_prompt="a photo of sks dog" \
+  --resolution=1024 \
+  --train_batch_size=1 \
+  --guidance_scale=5.0 \
+  --use_8bit_adam \
+  --gradient_accumulation_steps=4 \
+  --optimizer="adamW" \
+  --learning_rate=1e-4 \
+  --report_to="wandb" \
+  --lr_scheduler="constant" \
+  --lr_warmup_steps=100 \
+  --max_train_steps=500 \
+  --validation_prompt="A photo of sks dog in a bucket" \
+  --validation_epochs=25 \
+  --seed="0" \
+  --push_to_hub
+```
+
+To better track our training experiments, we're using the following flags in the command above:
+
+* `report_to="wandb"` will ensure the training runs are tracked on [Weights and Biases](https://wandb.ai/site). To use it, be sure to install `wandb` with `pip install wandb`. Don't forget to call `wandb login <your_api_key>` before training if you haven't done it before.
+* `validation_prompt` and `validation_epochs` to allow the script to do a few validation inference runs. This allows us to qualitatively check if the training is progressing as expected.
+
+> [!NOTE]
+> If you want to train using long prompts, you can use `--max_sequence_length` to set the token limit. The default is 512. Note that this will use more resources and may slow down the training in some cases.
+
+### Training with FP8 Quantization
+
+For reduced memory usage with FP8 training:
+
+```bash
+export MODEL_NAME="Tongyi-MAI/Z-Image"
+export INSTANCE_DIR="dog"
+export OUTPUT_DIR="trained-z-image-lora-fp8"
+
+accelerate launch train_dreambooth_lora_z_image.py \
+  --pretrained_model_name_or_path=$MODEL_NAME  \
+  --instance_data_dir=$INSTANCE_DIR \
+  --output_dir=$OUTPUT_DIR \
+  --do_fp8_training \
+  --gradient_checkpointing \
+  --cache_latents \
+  --instance_prompt="a photo of sks dog" \
+  --resolution=1024 \
+  --train_batch_size=1 \
+  --guidance_scale=5.0 \
+  --use_8bit_adam \
+  --gradient_accumulation_steps=4 \
+  --optimizer="adamW" \
+  --learning_rate=1e-4 \
+  --report_to="wandb" \
+  --lr_scheduler="constant" \
+  --lr_warmup_steps=100 \
+  --max_train_steps=500 \
+  --validation_prompt="A photo of sks dog in a bucket" \
+  --validation_epochs=25 \
+  --seed="0" \
+  --push_to_hub
+```
+
+### FSDP on the transformer
+
+By setting the accelerate configuration with FSDP, the transformer block will be wrapped automatically. E.g. set the configuration to:
+
+```yaml
+distributed_type: FSDP
+fsdp_config:
+  fsdp_version: 2
+  fsdp_offload_params: false
+  fsdp_sharding_strategy: HYBRID_SHARD
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: ZImageTransformerBlock
+  fsdp_forward_prefetch: true
+  fsdp_sync_module_states: false
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_use_orig_params: false
+  fsdp_activation_checkpointing: true
+  fsdp_reshard_after_forward: true
+  fsdp_cpu_ram_efficient_loading: false
+```
+
+### Prodigy Optimizer
+
+Prodigy is an adaptive optimizer that dynamically adjusts the learning rate learned parameters based on past gradients, allowing for more efficient convergence. 
+By using prodigy we can "eliminate" the need for manual learning rate tuning. Read more [here](https://huggingface.co/blog/sdxl_lora_advanced_script#adaptive-optimizers).
+
+To use prodigy, first make sure to install the prodigyopt library: `pip install prodigyopt`, and then specify:
+```bash
+--optimizer="prodigy"
+```
+
+> [!TIP]
+> When using prodigy it's generally good practice to set `--learning_rate=1.0`
+
+```bash
+export MODEL_NAME="Tongyi-MAI/Z-Image"
+export INSTANCE_DIR="dog"
+export OUTPUT_DIR="trained-z-image-lora-prodigy"
+
+accelerate launch train_dreambooth_lora_z_image.py \
+  --pretrained_model_name_or_path=$MODEL_NAME  \
+  --instance_data_dir=$INSTANCE_DIR \
+  --output_dir=$OUTPUT_DIR \
+  --mixed_precision="bf16" \
+  --gradient_checkpointing \
+  --cache_latents \
+  --instance_prompt="a photo of sks dog" \
+  --resolution=1024 \
+  --train_batch_size=1 \
+  --guidance_scale=5.0 \
+  --gradient_accumulation_steps=4 \
+  --optimizer="prodigy" \
+  --learning_rate=1.0 \
+  --report_to="wandb" \
+  --lr_scheduler="constant_with_warmup" \
+  --lr_warmup_steps=100 \
+  --max_train_steps=500 \
+  --validation_prompt="A photo of sks dog in a bucket" \
+  --validation_epochs=25 \
+  --seed="0" \
+  --push_to_hub
+```
+
+### LoRA Rank and Alpha
+
+Two key LoRA hyperparameters are LoRA rank and LoRA alpha:
+
+- `--rank`: Defines the dimension of the trainable LoRA matrices. A higher rank means more expressiveness and capacity to learn (and more parameters).
+- `--lora_alpha`: A scaling factor for the LoRA's output. The LoRA update is scaled by `lora_alpha / lora_rank`.
+
+**lora_alpha vs. rank:**
+
+This ratio dictates the LoRA's effective strength:
+- `lora_alpha == rank`: Scaling factor is 1. The LoRA is applied with its learned strength. (e.g., alpha=16, rank=16)
+- `lora_alpha < rank`: Scaling factor < 1. Reduces the LoRA's impact. Useful for subtle changes or to prevent overpowering the base model. (e.g., alpha=8, rank=16)
+- `lora_alpha > rank`: Scaling factor > 1. Amplifies the LoRA's impact. Allows a lower rank LoRA to have a stronger effect. (e.g., alpha=32, rank=16)
+
+> [!TIP]
+> A common starting point is to set `lora_alpha` equal to `rank`. 
+> Some also set `lora_alpha` to be twice the `rank` (e.g., lora_alpha=32 for lora_rank=16) 
+> to give the LoRA updates more influence without increasing parameter count. 
+> If you find your LoRA is "overcooking" or learning too aggressively, consider setting `lora_alpha` to half of `rank` 
+> (e.g., lora_alpha=8 for rank=16). Experimentation is often key to finding the optimal balance for your use case.
+
+### Target Modules
+
+When LoRA was first adapted from language models to diffusion models, it was applied to the cross-attention layers in the UNet that relate the image representations with the prompts that describe them. 
+More recently, SOTA text-to-image diffusion models replaced the UNet with a diffusion Transformer (DiT). With this change, we may also want to explore applying LoRA training onto different types of layers and blocks.
+
+To allow more flexibility and control over the targeted modules we added `--lora_layers`, in which you can specify in a comma separated string the exact modules for LoRA training. Here are some examples of target modules you can provide:
+
+- For attention only layers: `--lora_layers="to_k,to_q,to_v,to_out.0"`
+- For attention and feed-forward layers: `--lora_layers="to_k,to_q,to_v,to_out.0,ff.net.0.proj,ff.net.2"`
+
+> [!NOTE]
+> `--lora_layers` can also be used to specify which **blocks** to apply LoRA training to. To do so, simply add a block prefix to each layer in the comma separated string.
+
+> [!NOTE]
+> Keep in mind that while training more layers can improve quality and expressiveness, it also increases the size of the output LoRA weights.
+
+### Aspect Ratio Bucketing
+
+We've added aspect ratio bucketing support which allows training on images with different aspect ratios without cropping them to a single square resolution. This technique helps preserve the original composition of training images and can improve training efficiency.
+
+To enable aspect ratio bucketing, pass `--aspect_ratio_buckets` argument with a semicolon-separated list of height,width pairs, such as:
+
+```bash
+--aspect_ratio_buckets="672,1568;688,1504;720,1456;752,1392;800,1328;832,1248;880,1184;944,1104;1024,1024;1104,944;1184,880;1248,832;1328,800;1392,752;1456,720;1504,688;1568,672"
+```
+
+### Bilingual Prompts
+
+Z-Image has strong support for both Chinese and English prompts. When training with Chinese prompts, ensure your dataset captions are properly encoded in UTF-8:
+
+```bash
+--instance_prompt="一只sks狗的照片"
+--validation_prompt="一只sks狗在桶里的照片"
+```
+
+> [!TIP]
+> Z-Image excels at text rendering in generated images, especially for Chinese characters. If your use case involves generating images with text, consider including text-related examples in your training data.
+
+## Inference
+
+Once you have trained a LoRA, you can load it for inference:
+
+```python
+import torch
+from diffusers import ZImagePipeline
+
+pipe = ZImagePipeline.from_pretrained("Tongyi-MAI/Z-Image", torch_dtype=torch.bfloat16)
+pipe.to("cuda")
+
+# Load your trained LoRA
+pipe.load_lora_weights("path/to/your/trained-z-image-lora")
+
+# Generate an image
+image = pipe(
+    prompt="A photo of sks dog in a bucket",
+    height=1024,
+    width=1024,
+    num_inference_steps=50,
+    guidance_scale=5.0,
+    generator=torch.Generator("cuda").manual_seed(42),
+).images[0]
+
+image.save("output.png")
+```
+
+---
+
+Since Z-Image finetuning is still in an experimental phase, we encourage you to explore different settings and share your insights! 🤗
--- a/examples/dreambooth/train_dreambooth_lora_z_image.py
+++ b/examples/dreambooth/train_dreambooth_lora_z_image.py
--- a/src/diffusers/modular_pipelines/modular_pipeline.py
+++ b/src/diffusers/modular_pipelines/modular_pipeline.py
@@ -2067,29 +2067,58 @@ class ModularPipeline(ConfigMixin, PushToHubMixin):
        - the `config` dict, which will be saved as `modular_model_index.json` during `save_pretrained`

        Args:
-            **kwargs: Component objects or configuration values to update:
-                - Component objects: Models loaded with `AutoModel.from_pretrained()` or `ComponentSpec.load()`
-                are automatically tagged with loading information. ConfigMixin objects without weights (e.g.,
-                schedulers, guiders) can be passed directly.
-                - Configuration values: Simple values to update configuration settings
-                (e.g., `requires_safety_checker=False`)
+            **kwargs: Component objects, ComponentSpec objects, or configuration values to update:
+                - Component objects: Only supports components we can extract specs using
+                  `ComponentSpec.from_component()` method i.e. components created with ComponentSpec.load() or
+                  ConfigMixin subclasses that aren't nn.Modules (e.g., `unet=new_unet, text_encoder=new_encoder`)
+                - ComponentSpec objects: Only supports default_creation_method == "from_config", will call create()
+                  method to create a new component (e.g., `guider=ComponentSpec(name="guider",
+                  type_hint=ClassifierFreeGuidance, config={...}, default_creation_method="from_config")`)
+                - Configuration values: Simple values to update configuration settings (e.g.,
+                  `requires_safety_checker=False`)
+
+        Raises:
+            ValueError: If a component object is not supported in ComponentSpec.from_component() method:
+                - nn.Module components without a valid `_diffusers_load_id` attribute
+                - Non-ConfigMixin components without a valid `_diffusers_load_id` attribute

        Examples:
            ```python
-            # Update pretrrained model
+            # Update multiple components at once
            pipeline.update_components(unet=new_unet_model, text_encoder=new_text_encoder)

            # Update configuration values
            pipeline.update_components(requires_safety_checker=False)
+
+            # Update both components and configs together
+            pipeline.update_components(unet=new_unet_model, requires_safety_checker=False)
+
+            # Update with ComponentSpec objects (from_config only)
+            pipeline.update_components(
+                guider=ComponentSpec(
+                    name="guider",
+                    type_hint=ClassifierFreeGuidance,
+                    config={"guidance_scale": 5.0},
+                    default_creation_method="from_config",
+                )
+            )
            ```

        Notes:
-            - Components with trained weights should be loaded with `AutoModel.from_pretrained()` or
-            `ComponentSpec.load()` so that loading specs are preserved for serialization.
-            - ConfigMixin objects without weights (e.g., schedulers, guiders) can be passed directly.
+            - Components with trained weights must be created using ComponentSpec.load(). If the component has not been
+              shared in huggingface hub and you don't have loading specs, you can upload it using `push_to_hub()`
+            - ConfigMixin objects without weights (e.g., schedulers, guiders) can be passed directly
+            - ComponentSpec objects with default_creation_method="from_pretrained" are not supported in
+              update_components()
        """

-        passed_components = {k: kwargs.pop(k) for k in self._component_specs if k in kwargs}
+        # extract component_specs_updates & config_specs_updates from `specs`
+        passed_component_specs = {
+            k: kwargs.pop(k) for k in self._component_specs if k in kwargs and isinstance(kwargs[k], ComponentSpec)
+        }
+        passed_components = {
+            k: kwargs.pop(k) for k in self._component_specs if k in kwargs and not isinstance(kwargs[k], ComponentSpec)
+        }
        passed_config_values = {k: kwargs.pop(k) for k in self._config_specs if k in kwargs}

        for name, component in passed_components.items():
@@ -2128,14 +2157,33 @@ class ModularPipeline(ConfigMixin, PushToHubMixin):
        if len(kwargs) > 0:
            logger.warning(f"Unexpected keyword arguments, will be ignored: {kwargs.keys()}")

-        self.register_components(**passed_components)
+        created_components = {}
+        for name, component_spec in passed_component_specs.items():
+            if component_spec.default_creation_method == "from_pretrained":
+                raise ValueError(
+                    "ComponentSpec object with default_creation_method == 'from_pretrained' is not supported in update_components() method"
+                )
+            created_components[name] = component_spec.create()
+            current_component_spec = self._component_specs[name]
+            # warn if type changed
+            if current_component_spec.type_hint is not None and not isinstance(
+                created_components[name], current_component_spec.type_hint
+            ):
+                logger.info(
+                    f"ModularPipeline.update_components: adding {name} with new type: {created_components[name].__class__.__name__}, previous type: {current_component_spec.type_hint.__name__}"
+                )
+            # update _component_specs based on the user passed component_spec
+            self._component_specs[name] = component_spec
+        self.register_components(**passed_components, **created_components)

        config_to_register = {}
        for name, new_value in passed_config_values.items():
+            # e.g. requires_aesthetics_score = False
            self._config_specs[name].default = new_value
            config_to_register[name] = new_value
        self.register_to_config(**config_to_register)

+    # YiYi TODO: support map for additional from_pretrained kwargs
    def load_components(self, names: Optional[Union[List[str], str]] = None, **kwargs):
        """
        Load selected components from specs.
--- a/src/diffusers/pipelines/ltx2/export_utils.py
+++ b/src/diffusers/pipelines/ltx2/export_utils.py
@@ -13,12 +13,20 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+from collections.abc import Iterator
 from fractions import Fraction
-from typing import Optional
+from itertools import chain
+from typing import List, Optional, Union

+import numpy as np
+import PIL.Image
 import torch
+from tqdm import tqdm

-from ...utils import is_av_available
+from ...utils import get_logger, is_av_available
+
+
+logger = get_logger(__name__)  # pylint: disable=invalid-name


 _CAN_USE_AV = is_av_available()
@@ -101,11 +109,59 @@ def _write_audio(


 def encode_video(
-    video: torch.Tensor, fps: int, audio: Optional[torch.Tensor], audio_sample_rate: Optional[int], output_path: str
+    video: Union[List[PIL.Image.Image], np.ndarray, torch.Tensor, Iterator[torch.Tensor]],
+    fps: int,
+    audio: Optional[torch.Tensor],
+    audio_sample_rate: Optional[int],
+    output_path: str,
+    video_chunks_number: int = 1,
 ) -> None:
-    video_np = video.cpu().numpy()
+    """
+    Encodes a video with audio using the PyAV library. Based on code from the original LTX-2 repo:
+    https://github.com/Lightricks/LTX-2/blob/4f410820b198e05074a1e92de793e3b59e9ab5a0/packages/ltx-pipelines/src/ltx_pipelines/utils/media_io.py#L182

-    _, height, width, _ = video_np.shape
+    Args:
+        video (`List[PIL.Image.Image]` or `np.ndarray` or `torch.Tensor`):
+            A video tensor of shape [frames, height, width, channels] with integer pixel values in [0, 255]. If the
+            input is a `np.ndarray`, it is expected to be a float array with values in [0, 1] (which is what pipelines
+            usually return with `output_type="np"`).
+        fps (`int`)
+            The frames per second (FPS) of the encoded video.
+        audio (`torch.Tensor`, *optional*):
+            An audio waveform of shape [audio_channels, samples].
+        audio_sample_rate: (`int`, *optional*):
+            The sampling rate of the audio waveform. For LTX 2, this is typically 24000 (24 kHz).
+        output_path (`str`):
+            The path to save the encoded video to.
+        video_chunks_number (`int`, *optional*, defaults to `1`):
+            The number of chunks to split the video into for encoding. Each chunk will be encoded separately. The
+            number of chunks to use often depends on the tiling config for the video VAE.
+    """
+    if isinstance(video, list) and isinstance(video[0], PIL.Image.Image):
+        # Pipeline output_type="pil"; assumes each image is in "RGB" mode
+        video_frames = [np.array(frame) for frame in video]
+        video = np.stack(video_frames, axis=0)
+        video = torch.from_numpy(video)
+    elif isinstance(video, np.ndarray):
+        # Pipeline output_type="np"
+        is_denormalized = np.logical_and(np.zeros_like(video) <= video, video <= np.ones_like(video))
+        if np.all(is_denormalized):
+            video = (video * 255).round().astype("uint8")
+        else:
+            logger.warning(
+                "Supplied `numpy.ndarray` does not have values in [0, 1]. The values will be assumed to be pixel "
+                "values in [0, ..., 255] and will be used as is."
+            )
+        video = torch.from_numpy(video)
+
+    if isinstance(video, torch.Tensor):
+        # Split into video_chunks_number along the frame dimension
+        video = torch.tensor_split(video, video_chunks_number, dim=0)
+        video = iter(video)
+
+    first_chunk = next(video)
+
+    _, height, width, _ = first_chunk.shape

    container = av.open(output_path, mode="w")
    stream = container.add_stream("libx264", rate=int(fps))
@@ -119,10 +175,12 @@ def encode_video(

        audio_stream = _prepare_audio_stream(container, audio_sample_rate)

-    for frame_array in video_np:
-        frame = av.VideoFrame.from_ndarray(frame_array, format="rgb24")
-        for packet in stream.encode(frame):
-            container.mux(packet)
+    for video_chunk in tqdm(chain([first_chunk], video), total=video_chunks_number, desc="Encoding video chunks"):
+        video_chunk_cpu = video_chunk.to("cpu").numpy()
+        for frame_array in video_chunk_cpu:
+            frame = av.VideoFrame.from_ndarray(frame_array, format="rgb24")
+            for packet in stream.encode(frame):
+                container.mux(packet)

    # Flush encoder
    for packet in stream.encode():
--- a/src/diffusers/pipelines/ltx2/pipeline_ltx2.py
+++ b/src/diffusers/pipelines/ltx2/pipeline_ltx2.py
@@ -69,8 +69,6 @@ EXAMPLE_DOC_STRING = """
        ...     output_type="np",
        ...     return_dict=False,
        ... )
-        >>> video = (video * 255).round().astype("uint8")
-        >>> video = torch.from_numpy(video)

        >>> encode_video(
        ...     video[0],
--- a/src/diffusers/pipelines/ltx2/pipeline_ltx2_image2video.py
+++ b/src/diffusers/pipelines/ltx2/pipeline_ltx2_image2video.py
@@ -75,8 +75,6 @@ EXAMPLE_DOC_STRING = """
        ...     output_type="np",
        ...     return_dict=False,
        ... )
-        >>> video = (video * 255).round().astype("uint8")
-        >>> video = torch.from_numpy(video)

        >>> encode_video(
        ...     video[0],
--- a/src/diffusers/pipelines/ltx2/pipeline_ltx2_latent_upsample.py
+++ b/src/diffusers/pipelines/ltx2/pipeline_ltx2_latent_upsample.py
@@ -76,8 +76,6 @@ EXAMPLE_DOC_STRING = """
        ...     output_type="np",
        ...     return_dict=False,
        ... )[0]
-        >>> video = (video * 255).round().astype("uint8")
-        >>> video = torch.from_numpy(video)

        >>> encode_video(
        ...     video[0],
--- a/src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing.py
+++ b/src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing.py
@@ -18,7 +18,6 @@ import re
 from copy import deepcopy
 from typing import Any, Callable, Dict, List, Optional, Union

-import ftfy
 import torch
 from transformers import AutoTokenizer, UMT5EncoderModel

--- a/src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing_i2v.py
+++ b/src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing_i2v.py
@@ -18,7 +18,6 @@ import re
 from copy import deepcopy
 from typing import Any, Callable, Dict, List, Optional, Tuple, Union

-import ftfy
 import PIL
 import torch
 from transformers import AutoTokenizer, UMT5EncoderModel
--- a/src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing_v2v.py
+++ b/src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing_v2v.py
@@ -19,7 +19,6 @@ import re
 from copy import deepcopy
 from typing import Any, Callable, Dict, List, Optional, Union

-import ftfy
 import torch
 from PIL import Image
 from transformers import AutoTokenizer, UMT5EncoderModel
--- a/tests/modular_pipelines/test_modular_pipelines_common.py
+++ b/tests/modular_pipelines/test_modular_pipelines_common.py
@@ -37,6 +37,9 @@ class ModularPipelineTesterMixin:
    optional_params = frozenset(["num_inference_steps", "num_images_per_prompt", "latents", "output_type"])
    # this is modular specific: generator needs to be a intermediate input because it's mutable
    intermediate_params = frozenset(["generator"])
+    # Output type for the pipeline (e.g., "images" for image pipelines, "videos" for video pipelines)
+    # Subclasses can override this to change the expected output type
+    output_name = "images"

    def get_generator(self, seed=0):
        generator = torch.Generator("cpu").manual_seed(seed)
@@ -163,7 +166,7 @@ class ModularPipelineTesterMixin:

        logger.setLevel(level=diffusers.logging.WARNING)
        for batch_size, batched_input in zip(batch_sizes, batched_inputs):
-            output = pipe(**batched_input, output="images")
+            output = pipe(**batched_input, output=self.output_name)
            assert len(output) == batch_size, "Output is different from expected batch size"

    def test_inference_batch_single_identical(
@@ -197,12 +200,16 @@ class ModularPipelineTesterMixin:
        if "batch_size" in inputs:
            batched_inputs["batch_size"] = batch_size

-        output = pipe(**inputs, output="images")
-        output_batch = pipe(**batched_inputs, output="images")
+        output = pipe(**inputs, output=self.output_name)
+        output_batch = pipe(**batched_inputs, output=self.output_name)

        assert output_batch.shape[0] == batch_size

-        max_diff = torch.abs(output_batch[0] - output[0]).max()
+        # For batch comparison, we only need to compare the first item
+        if output_batch.shape[0] == batch_size and output.shape[0] == 1:
+            output_batch = output_batch[0:1]
+
+        max_diff = torch.abs(output_batch - output).max()
        assert max_diff < expected_max_diff, "Batch inference results different from single inference results"

    @require_accelerator
@@ -217,19 +224,32 @@ class ModularPipelineTesterMixin:
        # Reset generator in case it is used inside dummy inputs
        if "generator" in inputs:
            inputs["generator"] = self.get_generator(0)
-        output = pipe(**inputs, output="images")
+
+        output = pipe(**inputs, output=self.output_name)

        fp16_inputs = self.get_dummy_inputs()
        # Reset generator in case it is used inside dummy inputs
        if "generator" in fp16_inputs:
            fp16_inputs["generator"] = self.get_generator(0)
-        output_fp16 = pipe_fp16(**fp16_inputs, output="images")

-        output = output.cpu()
-        output_fp16 = output_fp16.cpu()
+        output_fp16 = pipe_fp16(**fp16_inputs, output=self.output_name)

-        max_diff = numpy_cosine_similarity_distance(output.flatten(), output_fp16.flatten())
-        assert max_diff < expected_max_diff, "FP16 inference is different from FP32 inference"
+        output_tensor = output.float().cpu()
+        output_fp16_tensor = output_fp16.float().cpu()
+
+        # Check for NaNs in outputs (can happen with tiny models in FP16)
+        if torch.isnan(output_tensor).any() or torch.isnan(output_fp16_tensor).any():
+            pytest.skip("FP16 inference produces NaN values - this is a known issue with tiny models")
+
+        max_diff = numpy_cosine_similarity_distance(
+            output_tensor.flatten().numpy(), output_fp16_tensor.flatten().numpy()
+        )
+
+        # Check if cosine similarity is NaN (which can happen if vectors are zero or very small)
+        if torch.isnan(torch.tensor(max_diff)):
+            pytest.skip("Cosine similarity is NaN - outputs may be too small for reliable comparison")
+
+        assert max_diff < expected_max_diff, f"FP16 inference is different from FP32 inference (max_diff: {max_diff})"

    @require_accelerator
    def test_to_device(self):
@@ -251,14 +271,16 @@ class ModularPipelineTesterMixin:
    def test_inference_is_not_nan_cpu(self):
        pipe = self.get_pipeline().to("cpu")

-        output = pipe(**self.get_dummy_inputs(), output="images")
+        inputs = self.get_dummy_inputs()
+        output = pipe(**inputs, output=self.output_name)
        assert torch.isnan(output).sum() == 0, "CPU Inference returns NaN"

    @require_accelerator
    def test_inference_is_not_nan(self):
        pipe = self.get_pipeline().to(torch_device)

-        output = pipe(**self.get_dummy_inputs(), output="images")
+        inputs = self.get_dummy_inputs()
+        output = pipe(**inputs, output=self.output_name)
        assert torch.isnan(output).sum() == 0, "Accelerator Inference returns NaN"

    def test_num_images_per_prompt(self):
@@ -278,7 +300,7 @@ class ModularPipelineTesterMixin:
                    if key in self.batch_params:
                        inputs[key] = batch_size * [inputs[key]]

-                images = pipe(**inputs, num_images_per_prompt=num_images_per_prompt, output="images")
+                images = pipe(**inputs, num_images_per_prompt=num_images_per_prompt, output=self.output_name)

                assert images.shape[0] == batch_size * num_images_per_prompt

@@ -293,8 +315,7 @@ class ModularPipelineTesterMixin:
        image_slices = []
        for pipe in [base_pipe, offload_pipe]:
            inputs = self.get_dummy_inputs()
-            image = pipe(**inputs, output="images")
-
+            image = pipe(**inputs, output=self.output_name)
            image_slices.append(image[0, -3:, -3:, -1].flatten())

        assert torch.abs(image_slices[0] - image_slices[1]).max() < 1e-3
@@ -315,8 +336,7 @@ class ModularPipelineTesterMixin:
        image_slices = []
        for pipe in pipes:
            inputs = self.get_dummy_inputs()
-            image = pipe(**inputs, output="images")
-
+            image = pipe(**inputs, output=self.output_name)
            image_slices.append(image[0, -3:, -3:, -1].flatten())

        assert torch.abs(image_slices[0] - image_slices[1]).max() < 1e-3
@@ -331,13 +351,13 @@ class ModularGuiderTesterMixin:
        pipe.update_components(guider=guider)

        inputs = self.get_dummy_inputs()
-        out_no_cfg = pipe(**inputs, output="images")
+        out_no_cfg = pipe(**inputs, output=self.output_name)

        # forward pass with CFG applied
        guider = ClassifierFreeGuidance(guidance_scale=7.5)
        pipe.update_components(guider=guider)
        inputs = self.get_dummy_inputs()
-        out_cfg = pipe(**inputs, output="images")
+        out_cfg = pipe(**inputs, output=self.output_name)

        assert out_cfg.shape == out_no_cfg.shape
        max_diff = torch.abs(out_cfg - out_no_cfg).max()
--- a/tests/modular_pipelines/wan/init.py
+++ b/tests/modular_pipelines/wan/init.py
--- a/tests/modular_pipelines/wan/test_modular_pipeline_wan.py
+++ b/tests/modular_pipelines/wan/test_modular_pipeline_wan.py
@@ -0,0 +1,49 @@
+# coding=utf-8
+# Copyright 2025 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pytest
+
+from diffusers.modular_pipelines import WanBlocks, WanModularPipeline
+
+from ..test_modular_pipelines_common import ModularPipelineTesterMixin
+
+
+class TestWanModularPipelineFast(ModularPipelineTesterMixin):
+    pipeline_class = WanModularPipeline
+    pipeline_blocks_class = WanBlocks
+    pretrained_model_name_or_path = "hf-internal-testing/tiny-wan-modular-pipe"
+
+    params = frozenset(["prompt", "height", "width", "num_frames"])
+    batch_params = frozenset(["prompt"])
+    optional_params = frozenset(["num_inference_steps", "num_videos_per_prompt", "latents"])
+    output_name = "videos"
+
+    def get_dummy_inputs(self, seed=0):
+        generator = self.get_generator(seed)
+        inputs = {
+            "prompt": "A painting of a squirrel eating a burger",
+            "generator": generator,
+            "num_inference_steps": 2,
+            "height": 16,
+            "width": 16,
+            "num_frames": 9,
+            "max_sequence_length": 16,
+            "output_type": "pt",
+        }
+        return inputs
+
+    @pytest.mark.skip(reason="num_videos_per_prompt")
+    def test_num_images_per_prompt(self):
+        pass
--- a/tests/modular_pipelines/z_image/init.py
+++ b/tests/modular_pipelines/z_image/init.py
--- a/tests/modular_pipelines/z_image/test_modular_pipeline_z_image.py
+++ b/tests/modular_pipelines/z_image/test_modular_pipeline_z_image.py
@@ -0,0 +1,44 @@
+# coding=utf-8
+# Copyright 2025 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from diffusers.modular_pipelines import ZImageAutoBlocks, ZImageModularPipeline
+
+from ..test_modular_pipelines_common import ModularPipelineTesterMixin
+
+
+class TestZImageModularPipelineFast(ModularPipelineTesterMixin):
+    pipeline_class = ZImageModularPipeline
+    pipeline_blocks_class = ZImageAutoBlocks
+    pretrained_model_name_or_path = "hf-internal-testing/tiny-zimage-modular-pipe"
+
+    params = frozenset(["prompt", "height", "width"])
+    batch_params = frozenset(["prompt"])
+
+    def get_dummy_inputs(self, seed=0):
+        generator = self.get_generator(seed)
+        inputs = {
+            "prompt": "A painting of a squirrel eating a burger",
+            "generator": generator,
+            "num_inference_steps": 2,
+            "height": 32,
+            "width": 32,
+            "max_sequence_length": 16,
+            "output_type": "pt",
+        }
+        return inputs
+
+    def test_inference_batch_single_identical(self):
+        super().test_inference_batch_single_identical(expected_max_diff=5e-3)
Author	SHA1	Message	Date
Álvaro Somoza	5bf248ddd8	[SkyReelsV2] Fix ftfy import (#13113 ) fix	2026-02-10 12:56:13 +05:30
Dhruv Nair	bedc67c75f	[Docs] Add guide for AutoModel with custom code (#13099 ) update	2026-02-10 12:19:44 +05:30
Sayak Paul	20efb79d49	[modular] add modular tests for Z-Image and Wan (#13078 ) * add wan modular tests * style. * add z-image tests and other fixes. * style. * increase tolerance for zimage * style * address reviewer feedback. * address reviewer feedback. * remove unneeded func * simplify even more.	2026-02-09 08:27:59 -10:00
Linoy Tsaban	8933686770	Z image lora training (#13056 ) * initial commit * initial commit * initial commit * initial commit * initial commit * initial commit * initial commit * fix vae * fix prompts * Apply style fixes * fix license --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2026-02-09 15:45:59 +02:00
dg845	baaa8d040b	LTX 2 Improve `encode_video` by Accepting More Input Types (#13057 ) * Support different pipeline outputs for LTX 2 encode_video * Update examples to use improved encode_video function * Fix comment * Address review comments * make style and make quality * Have non-iterator video inputs respect video_chunks_number * make style and make quality * Add warning when encode_video receives a non-denormalized np.ndarray * make style and make quality --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2026-02-08 19:40:34 -08:00