Merge branch 'main' into cp-fixes-attn-backends

Z-Image-Turbo ControlNet (#12792 )
* init --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-12-19 19:04:49 +08:00 · 2025-12-18 14:21:52 +08:00 · 2025-12-17 09:44:20 -10:00 · 2025-12-17 16:57:57 +05:30 · 2025-12-17 13:44:08 +05:30 · 2025-12-15 22:02:25 -10:00
48 changed files with 7638 additions and 1154 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -365,6 +365,8 @@
        title: HunyuanVideoTransformer3DModel
      - local: api/models/latte_transformer3d
        title: LatteTransformer3DModel
+      - local: api/models/longcat_image_transformer2d
+        title: LongCatImageTransformer2DModel
      - local: api/models/ltx_video_transformer3d
        title: LTXVideoTransformer3DModel
      - local: api/models/lumina2_transformer2d
@@ -402,7 +404,7 @@
      - local: api/models/wan_transformer_3d
        title: WanTransformer3DModel
      - local: api/models/z_image_transformer2d
-        title: ZImageTransformer2DModel        
+        title: ZImageTransformer2DModel
      title: Transformers
    - sections:
      - local: api/models/stable_cascade_unet
@@ -563,6 +565,8 @@
        title: Latent Diffusion
      - local: api/pipelines/ledits_pp
        title: LEDITS++
+      - local: api/pipelines/longcat_image
+        title: LongCat-Image
      - local: api/pipelines/lumina2
        title: Lumina 2.0
      - local: api/pipelines/lumina
--- a/docs/source/en/api/models/longcat_image_transformer2d.md
+++ b/docs/source/en/api/models/longcat_image_transformer2d.md
@@ -0,0 +1,25 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# LongCatImageTransformer2DModel
+
+The model can be loaded with the following code snippet.
+
+```python
+from diffusers import LongCatImageTransformer2DModel
+
+transformer = LongCatImageTransformer2DModel.from_pretrained("meituan-longcat/LongCat-Image ", subfolder="transformer", torch_dtype=torch.bfloat16)
+```
+
+## LongCatImageTransformer2DModel
+
+[[autodoc]] LongCatImageTransformer2DModel
--- a/docs/source/en/api/pipelines/longcat_image.md
+++ b/docs/source/en/api/pipelines/longcat_image.md
@@ -0,0 +1,114 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# LongCat-Image
+
+<div class="flex flex-wrap space-x-1">
+  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
+</div>
+
+
+We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models.
+
+
+### Key Features
+- 🌟 **Exceptional Efficiency and Performance**: With only **6B parameters**, LongCat-Image surpasses numerous open-source models that are several times larger across multiple benchmarks, demonstrating the immense potential of efficient model design.
+- 🌟 **Superior Editing Performance**: LongCat-Image-Edit model achieves state-of-the-art performance among open-source models, delivering leading instruction-following and image quality with superior visual consistency.
+- 🌟 **Powerful Chinese Text Rendering**: LongCat-Image demonstrates superior accuracy and stability in rendering common Chinese characters compared to existing SOTA open-source models and achieves industry-leading coverage of the Chinese dictionary.
+- 🌟 **Remarkable Photorealism**: Through an innovative data strategy and training framework, LongCat-Image achieves remarkable photorealism in generated images.
+- 🌟 **Comprehensive Open-Source Ecosystem**: We provide a complete toolchain, from intermediate checkpoints to full training code, significantly lowering the barrier for further research and development.
+
+For more details, please refer to the comprehensive [***LongCat-Image Technical Report***](https://arxiv.org/abs/2412.11963)
+
+
+## Usage Example
+
+```py
+import torch
+import diffusers
+from diffusers import LongCatImagePipeline
+
+weight_dtype = torch.bfloat16
+pipe = LongCatImagePipeline.from_pretrained("meituan-longcat/LongCat-Image", torch_dtype=torch.bfloat16 )
+pipe.to('cuda')
+# pipe.enable_model_cpu_offload()
+
+prompt = '一个年轻的亚裔女性，身穿黄色针织衫，搭配白色项链。她的双手放在膝盖上，表情恬静。背景是一堵粗糙的砖墙，午后的阳光温暖地洒在她身上，营造出一种宁静而温馨的氛围。镜头采用中距离视角，突出她的神态和服饰的细节。光线柔和地打在她的脸上，强调她的五官和饰品的质感，增加画面的层次感与亲和力。整个画面构图简洁，砖墙的纹理与阳光的光影效果相得益彰，突显出人物的优雅与从容。'
+image = pipe(
+    prompt,
+    height=768,
+    width=1344,
+    guidance_scale=4.0,
+    num_inference_steps=50,
+    num_images_per_prompt=1,
+    generator=torch.Generator("cpu").manual_seed(43),
+    enable_cfg_renorm=True,
+    enable_prompt_rewrite=True,
+).images[0]
+image.save(f'./longcat_image_t2i_example.png')
+```
+
+
+This pipeline was contributed by LongCat-Image Team. The original codebase can be found [here](https://github.com/meituan-longcat/LongCat-Image).
+
+Available models:
+<div style="overflow-x: auto; margin-bottom: 16px;">
+  <table style="border-collapse: collapse; width: 100%;">
+    <thead>
+      <tr>
+        <th style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa;">Models</th>
+        <th style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa;">Type</th>
+        <th style="padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa;">Description</th>
+        <th style="padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa;">Download Link</th>
+      </tr>
+    </thead>
+    <tbody>
+      <tr>
+        <td style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de;">LongCat&#8209;Image</td>
+        <td style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de;">Text&#8209;to&#8209;Image</td>
+        <td style="padding: 8px; border: 1px solid #d0d7de;">Final Release. The standard model for out&#8209;of&#8209;the&#8209;box inference.</td>
+        <td style="padding: 8px; border: 1px solid #d0d7de;">
+          <span style="white-space: nowrap;">🤗&nbsp;<a href="https://huggingface.co/meituan-longcat/LongCat-Image">Huggingface</a></span>
+        </td>
+      </tr>
+      <tr>
+        <td style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de;">LongCat&#8209;Image&#8209;Dev</td>
+        <td style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de;">Text&#8209;to&#8209;Image</td>
+        <td style="padding: 8px; border: 1px solid #d0d7de;">Development. Mid-training checkpoint, suitable for fine-tuning.</td>
+        <td style="padding: 8px; border: 1px solid #d0d7de;">
+          <span style="white-space: nowrap;">🤗&nbsp;<a href="https://huggingface.co/meituan-longcat/LongCat-Image-Dev">Huggingface</a></span>
+        </td>
+      </tr>
+      <tr>
+        <td style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de;">LongCat&#8209;Image&#8209;Edit</td>
+        <td style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de;">Image Editing</td>
+        <td style="padding: 8px; border: 1px solid #d0d7de;">Specialized model for image editing.</td>
+        <td style="padding: 8px; border: 1px solid #d0d7de;">
+          <span style="white-space: nowrap;">🤗&nbsp;<a href="https://huggingface.co/meituan-longcat/LongCat-Image-Edit">Huggingface</a></span>
+        </td>
+      </tr>
+    </tbody>
+  </table>
+</div>
+
+## LongCatImagePipeline
+
+[[autodoc]] LongCatImagePipeline
+- all
+- __call__
+
+## LongCatImagePipelineOutput
+
+[[autodoc]] pipelines.longcat_image.pipeline_output.LongCatImagePipelineOutput
+
+
+
--- a/src/diffusers/init.py
+++ b/src/diffusers/init.py
@@ -235,6 +235,7 @@ else:
            "Kandinsky3UNet",
            "Kandinsky5Transformer3DModel",
            "LatteTransformer3DModel",
+            "LongCatImageTransformer2DModel",
            "LTXVideoTransformer3DModel",
            "Lumina2Transformer2DModel",
            "LuminaNextDiT2DModel",
@@ -278,6 +279,7 @@ else:
            "WanAnimateTransformer3DModel",
            "WanTransformer3DModel",
            "WanVACETransformer3DModel",
+            "ZImageControlNetModel",
            "ZImageTransformer2DModel",
            "attention_backend",
        ]
@@ -532,6 +534,8 @@ else:
            "LDMTextToImagePipeline",
            "LEditsPPPipelineStableDiffusion",
            "LEditsPPPipelineStableDiffusionXL",
+            "LongCatImageEditPipeline",
+            "LongCatImagePipeline",
            "LTXConditionPipeline",
            "LTXImageToVideoPipeline",
            "LTXLatentUpsamplePipeline",
@@ -561,6 +565,7 @@ else:
            "QwenImageEditPlusPipeline",
            "QwenImageImg2ImgPipeline",
            "QwenImageInpaintPipeline",
+            "QwenImageLayeredPipeline",
            "QwenImagePipeline",
            "ReduxImageEncoder",
            "SanaControlNetPipeline",
@@ -666,6 +671,8 @@ else:
            "WuerstchenCombinedPipeline",
            "WuerstchenDecoderPipeline",
            "WuerstchenPriorPipeline",
+            "ZImageControlNetInpaintPipeline",
+            "ZImageControlNetPipeline",
            "ZImageImg2ImgPipeline",
            "ZImagePipeline",
        ]
@@ -970,6 +977,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            Kandinsky3UNet,
            Kandinsky5Transformer3DModel,
            LatteTransformer3DModel,
+            LongCatImageTransformer2DModel,
            LTXVideoTransformer3DModel,
            Lumina2Transformer2DModel,
            LuminaNextDiT2DModel,
@@ -1012,6 +1020,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            WanAnimateTransformer3DModel,
            WanTransformer3DModel,
            WanVACETransformer3DModel,
+            ZImageControlNetModel,
            ZImageTransformer2DModel,
            attention_backend,
        )
@@ -1237,6 +1246,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            LDMTextToImagePipeline,
            LEditsPPPipelineStableDiffusion,
            LEditsPPPipelineStableDiffusionXL,
+            LongCatImageEditPipeline,
+            LongCatImagePipeline,
            LTXConditionPipeline,
            LTXImageToVideoPipeline,
            LTXLatentUpsamplePipeline,
@@ -1266,6 +1277,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            QwenImageEditPlusPipeline,
            QwenImageImg2ImgPipeline,
            QwenImageInpaintPipeline,
+            QwenImageLayeredPipeline,
            QwenImagePipeline,
            ReduxImageEncoder,
            SanaControlNetPipeline,
@@ -1369,6 +1381,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            WuerstchenCombinedPipeline,
            WuerstchenDecoderPipeline,
            WuerstchenPriorPipeline,
+            ZImageControlNetInpaintPipeline,
+            ZImageControlNetPipeline,
            ZImageImg2ImgPipeline,
            ZImagePipeline,
        )
--- a/src/diffusers/loaders/single_file_model.py
+++ b/src/diffusers/loaders/single_file_model.py
@@ -49,6 +49,7 @@ from .single_file_utils import (
    convert_stable_cascade_unet_single_file_to_diffusers,
    convert_wan_transformer_to_diffusers,
    convert_wan_vae_to_diffusers,
+    convert_z_image_controlnet_checkpoint_to_diffusers,
    convert_z_image_transformer_checkpoint_to_diffusers,
    create_controlnet_diffusers_config_from_ldm,
    create_unet_diffusers_config_from_ldm,
@@ -172,11 +173,18 @@ SINGLE_FILE_LOADABLE_CLASSES = {
        "checkpoint_mapping_fn": convert_z_image_transformer_checkpoint_to_diffusers,
        "default_subfolder": "transformer",
    },
+    "ZImageControlNetModel": {
+        "checkpoint_mapping_fn": convert_z_image_controlnet_checkpoint_to_diffusers,
+    },
 }


 def _should_convert_state_dict_to_diffusers(model_state_dict, checkpoint_state_dict):
-    return not set(model_state_dict.keys()).issubset(set(checkpoint_state_dict.keys()))
+    model_state_dict_keys = set(model_state_dict.keys())
+    checkpoint_state_dict_keys = set(checkpoint_state_dict.keys())
+    is_subset = model_state_dict_keys.issubset(checkpoint_state_dict_keys)
+    is_match = model_state_dict_keys == checkpoint_state_dict_keys
+    return not (is_subset and is_match)


 def _get_single_file_loadable_mapping_class(cls):
--- a/src/diffusers/loaders/single_file_utils.py
+++ b/src/diffusers/loaders/single_file_utils.py
@@ -121,6 +121,8 @@ CHECKPOINT_KEY_NAMES = {
    "instruct-pix2pix": "model.diffusion_model.input_blocks.0.0.weight",
    "lumina2": ["model.diffusion_model.cap_embedder.0.weight", "cap_embedder.0.weight"],
    "z-image-turbo": "cap_embedder.0.weight",
+    "z-image-turbo-controlnet": "control_all_x_embedder.2-1.weight",
+    "z-image-turbo-controlnet-2.x": "control_layers.14.adaLN_modulation.0.weight",
    "sana": [
        "blocks.0.cross_attn.q_linear.weight",
        "blocks.0.cross_attn.q_linear.bias",
@@ -220,6 +222,8 @@ DIFFUSERS_DEFAULT_PIPELINE_PATHS = {
    "cosmos-2.0-v2w-2B": {"pretrained_model_name_or_path": "nvidia/Cosmos-Predict2-2B-Video2World"},
    "cosmos-2.0-v2w-14B": {"pretrained_model_name_or_path": "nvidia/Cosmos-Predict2-14B-Video2World"},
    "z-image-turbo": {"pretrained_model_name_or_path": "Tongyi-MAI/Z-Image-Turbo"},
+    "z-image-turbo-controlnet": {"pretrained_model_name_or_path": "hlky/Z-Image-Turbo-Fun-Controlnet-Union"},
+    "z-image-turbo-controlnet-2.x": {"pretrained_model_name_or_path": "hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.1"},
 }

 # Use to configure model sample size when original config is provided
@@ -779,6 +783,12 @@ def infer_diffusers_model_type(checkpoint):
        else:
            raise ValueError(f"Unexpected x_embedder shape: {x_embedder_shape} when loading Cosmos 2.0 model.")

+    elif CHECKPOINT_KEY_NAMES["z-image-turbo-controlnet-2.x"] in checkpoint:
+        model_type = "z-image-turbo-controlnet-2.x"
+
+    elif CHECKPOINT_KEY_NAMES["z-image-turbo-controlnet"] in checkpoint:
+        model_type = "z-image-turbo-controlnet"
+
    else:
        model_type = "v1"

@@ -3885,3 +3895,17 @@ def convert_z_image_transformer_checkpoint_to_diffusers(checkpoint, **kwargs):
            handler_fn_inplace(key, converted_state_dict)

    return converted_state_dict
+
+
+def convert_z_image_controlnet_checkpoint_to_diffusers(checkpoint, config, **kwargs):
+    if config["add_control_noise_refiner"] is None:
+        return checkpoint
+    elif config["add_control_noise_refiner"] == "control_noise_refiner":
+        return checkpoint
+    elif config["add_control_noise_refiner"] == "control_layers":
+        converted_state_dict = {
+            key: checkpoint.pop(key) for key in list(checkpoint.keys()) if not key.startswith("control_noise_refiner.")
+        }
+        return converted_state_dict
+    else:
+        raise ValueError("Unknown Z-Image Turbo ControlNet type.")
--- a/src/diffusers/models/init.py
+++ b/src/diffusers/models/init.py
@@ -66,6 +66,7 @@ if is_torch_available():
    _import_structure["controlnets.controlnet_sparsectrl"] = ["SparseControlNetModel"]
    _import_structure["controlnets.controlnet_union"] = ["ControlNetUnionModel"]
    _import_structure["controlnets.controlnet_xs"] = ["ControlNetXSAdapter", "UNetControlNetXSModel"]
+    _import_structure["controlnets.controlnet_z_image"] = ["ZImageControlNetModel"]
    _import_structure["controlnets.multicontrolnet"] = ["MultiControlNetModel"]
    _import_structure["controlnets.multicontrolnet_union"] = ["MultiControlNetUnionModel"]
    _import_structure["embeddings"] = ["ImageProjection"]
@@ -101,6 +102,7 @@ if is_torch_available():
    _import_structure["transformers.transformer_hunyuan_video_framepack"] = ["HunyuanVideoFramepackTransformer3DModel"]
    _import_structure["transformers.transformer_hunyuanimage"] = ["HunyuanImageTransformer2DModel"]
    _import_structure["transformers.transformer_kandinsky"] = ["Kandinsky5Transformer3DModel"]
+    _import_structure["transformers.transformer_longcat_image"] = ["LongCatImageTransformer2DModel"]
    _import_structure["transformers.transformer_ltx"] = ["LTXVideoTransformer3DModel"]
    _import_structure["transformers.transformer_lumina2"] = ["Lumina2Transformer2DModel"]
    _import_structure["transformers.transformer_mochi"] = ["MochiTransformer3DModel"]
@@ -180,6 +182,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            SD3MultiControlNetModel,
            SparseControlNetModel,
            UNetControlNetXSModel,
+            ZImageControlNetModel,
        )
        from .embeddings import ImageProjection
        from .modeling_utils import ModelMixin
@@ -208,6 +211,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            HunyuanVideoTransformer3DModel,
            Kandinsky5Transformer3DModel,
            LatteTransformer3DModel,
+            LongCatImageTransformer2DModel,
            LTXVideoTransformer3DModel,
            Lumina2Transformer2DModel,
            LuminaNextDiT2DModel,
--- a/src/diffusers/models/attention_dispatch.py
+++ b/src/diffusers/models/attention_dispatch.py
@@ -235,6 +235,10 @@ class _AttentionBackendRegistry:
    def get_active_backend(cls):
        return cls._active_backend, cls._backends[cls._active_backend]

+    @classmethod
+    def set_active_backend(cls, backend: str):
+        cls._active_backend = backend
+
    @classmethod
    def list_backends(cls):
        return list(cls._backends.keys())
@@ -294,12 +298,12 @@ def attention_backend(backend: Union[str, AttentionBackendName] = AttentionBacke
    _maybe_download_kernel_for_backend(backend)

    old_backend = _AttentionBackendRegistry._active_backend
-    _AttentionBackendRegistry._active_backend = backend
+    _AttentionBackendRegistry.set_active_backend(backend)

    try:
        yield
    finally:
-        _AttentionBackendRegistry._active_backend = old_backend
+        _AttentionBackendRegistry.set_active_backend(old_backend)


 def dispatch_attention_fn(
@@ -348,6 +352,18 @@ def dispatch_attention_fn(
            check(**kwargs)

    kwargs = {k: v for k, v in kwargs.items() if k in _AttentionBackendRegistry._supported_arg_names[backend_name]}
+
+    if "_parallel_config" in kwargs and kwargs["_parallel_config"] is not None:
+        attention_backend = AttentionBackendName(backend_name)
+        if not _AttentionBackendRegistry._is_context_parallel_available(attention_backend):
+            compatible_backends = sorted(_AttentionBackendRegistry._supports_context_parallel)
+            raise ValueError(
+                f"Context parallelism is enabled but backend '{attention_backend.value}' "
+                f"which does not support context parallelism. "
+                f"Please set a compatible attention backend: {compatible_backends} using `model.set_attention_backend()` before "
+                f"calling `model.enable_parallelism()`."
+            )
+
    return backend_fn(**kwargs)


--- a/src/diffusers/models/autoencoders/autoencoder_kl_qwenimage.py
+++ b/src/diffusers/models/autoencoders/autoencoder_kl_qwenimage.py
@@ -394,6 +394,7 @@ class QwenImageEncoder3d(nn.Module):
        attn_scales=[],
        temperal_downsample=[True, True, False],
        dropout=0.0,
+        input_channels=3,
        non_linearity: str = "silu",
    ):
        super().__init__()
@@ -410,7 +411,7 @@ class QwenImageEncoder3d(nn.Module):
        scale = 1.0

        # init block
-        self.conv_in = QwenImageCausalConv3d(3, dims[0], 3, padding=1)
+        self.conv_in = QwenImageCausalConv3d(input_channels, dims[0], 3, padding=1)

        # downsample blocks
        self.down_blocks = nn.ModuleList([])
@@ -570,6 +571,7 @@ class QwenImageDecoder3d(nn.Module):
        attn_scales=[],
        temperal_upsample=[False, True, True],
        dropout=0.0,
+        input_channels=3,
        non_linearity: str = "silu",
    ):
        super().__init__()
@@ -621,7 +623,7 @@ class QwenImageDecoder3d(nn.Module):

        # output blocks
        self.norm_out = QwenImageRMS_norm(out_dim, images=False)
-        self.conv_out = QwenImageCausalConv3d(out_dim, 3, 3, padding=1)
+        self.conv_out = QwenImageCausalConv3d(out_dim, input_channels, 3, padding=1)

        self.gradient_checkpointing = False

@@ -684,6 +686,7 @@ class AutoencoderKLQwenImage(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
        attn_scales: List[float] = [],
        temperal_downsample: List[bool] = [False, True, True],
        dropout: float = 0.0,
+        input_channels: int = 3,
        latents_mean: List[float] = [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921],
        latents_std: List[float] = [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.9160],
    ) -> None:
@@ -695,13 +698,13 @@ class AutoencoderKLQwenImage(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
        self.temperal_upsample = temperal_downsample[::-1]

        self.encoder = QwenImageEncoder3d(
-            base_dim, z_dim * 2, dim_mult, num_res_blocks, attn_scales, self.temperal_downsample, dropout
+            base_dim, z_dim * 2, dim_mult, num_res_blocks, attn_scales, self.temperal_downsample, dropout, input_channels
        )
        self.quant_conv = QwenImageCausalConv3d(z_dim * 2, z_dim * 2, 1)
        self.post_quant_conv = QwenImageCausalConv3d(z_dim, z_dim, 1)

        self.decoder = QwenImageDecoder3d(
-            base_dim, z_dim, dim_mult, num_res_blocks, attn_scales, self.temperal_upsample, dropout
+            base_dim, z_dim, dim_mult, num_res_blocks, attn_scales, self.temperal_upsample, dropout, input_channels
        )

        self.spatial_compression_ratio = 2 ** len(self.temperal_downsample)
--- a/src/diffusers/models/controlnets/init.py
+++ b/src/diffusers/models/controlnets/init.py
@@ -19,6 +19,7 @@ if is_torch_available():
    )
    from .controlnet_union import ControlNetUnionModel
    from .controlnet_xs import ControlNetXSAdapter, ControlNetXSOutput, UNetControlNetXSModel
+    from .controlnet_z_image import ZImageControlNetModel
    from .multicontrolnet import MultiControlNetModel
    from .multicontrolnet_union import MultiControlNetUnionModel

--- a/src/diffusers/models/controlnets/controlnet_z_image.py
+++ b/src/diffusers/models/controlnets/controlnet_z_image.py
@@ -0,0 +1,824 @@
+# Copyright 2025 Alibaba Z-Image Team and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from typing import List, Literal, Optional
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn.utils.rnn import pad_sequence
+
+from ...configuration_utils import ConfigMixin, register_to_config
+from ...loaders import PeftAdapterMixin
+from ...loaders.single_file_model import FromOriginalModelMixin
+from ...models.attention_processor import Attention
+from ...models.normalization import RMSNorm
+from ...utils.torch_utils import maybe_allow_in_graph
+from ..attention_dispatch import dispatch_attention_fn
+from ..controlnets.controlnet import zero_module
+from ..modeling_utils import ModelMixin
+
+
+ADALN_EMBED_DIM = 256
+SEQ_MULTI_OF = 32
+
+
+# Copied from diffusers.models.transformers.transformer_z_image.TimestepEmbedder
+class TimestepEmbedder(nn.Module):
+    def __init__(self, out_size, mid_size=None, frequency_embedding_size=256):
+        super().__init__()
+        if mid_size is None:
+            mid_size = out_size
+        self.mlp = nn.Sequential(
+            nn.Linear(frequency_embedding_size, mid_size, bias=True),
+            nn.SiLU(),
+            nn.Linear(mid_size, out_size, bias=True),
+        )
+
+        self.frequency_embedding_size = frequency_embedding_size
+
+    @staticmethod
+    def timestep_embedding(t, dim, max_period=10000):
+        with torch.amp.autocast("cuda", enabled=False):
+            half = dim // 2
+            freqs = torch.exp(
+                -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32, device=t.device) / half
+            )
+            args = t[:, None].float() * freqs[None]
+            embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
+            if dim % 2:
+                embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
+            return embedding
+
+    def forward(self, t):
+        t_freq = self.timestep_embedding(t, self.frequency_embedding_size)
+        weight_dtype = self.mlp[0].weight.dtype
+        compute_dtype = getattr(self.mlp[0], "compute_dtype", None)
+        if weight_dtype.is_floating_point:
+            t_freq = t_freq.to(weight_dtype)
+        elif compute_dtype is not None:
+            t_freq = t_freq.to(compute_dtype)
+        t_emb = self.mlp(t_freq)
+        return t_emb
+
+
+# Copied from diffusers.models.transformers.transformer_z_image.ZSingleStreamAttnProcessor
+class ZSingleStreamAttnProcessor:
+    """
+    Processor for Z-Image single stream attention that adapts the existing Attention class to match the behavior of the
+    original Z-ImageAttention module.
+    """
+
+    _attention_backend = None
+    _parallel_config = None
+
+    def __init__(self):
+        if not hasattr(F, "scaled_dot_product_attention"):
+            raise ImportError(
+                "ZSingleStreamAttnProcessor requires PyTorch 2.0. To use it, please upgrade PyTorch to version 2.0 or higher."
+            )
+
+    def __call__(
+        self,
+        attn: Attention,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        freqs_cis: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        query = attn.to_q(hidden_states)
+        key = attn.to_k(hidden_states)
+        value = attn.to_v(hidden_states)
+
+        query = query.unflatten(-1, (attn.heads, -1))
+        key = key.unflatten(-1, (attn.heads, -1))
+        value = value.unflatten(-1, (attn.heads, -1))
+
+        # Apply Norms
+        if attn.norm_q is not None:
+            query = attn.norm_q(query)
+        if attn.norm_k is not None:
+            key = attn.norm_k(key)
+
+        # Apply RoPE
+        def apply_rotary_emb(x_in: torch.Tensor, freqs_cis: torch.Tensor) -> torch.Tensor:
+            with torch.amp.autocast("cuda", enabled=False):
+                x = torch.view_as_complex(x_in.float().reshape(*x_in.shape[:-1], -1, 2))
+                freqs_cis = freqs_cis.unsqueeze(2)
+                x_out = torch.view_as_real(x * freqs_cis).flatten(3)
+                return x_out.type_as(x_in)  # todo
+
+        if freqs_cis is not None:
+            query = apply_rotary_emb(query, freqs_cis)
+            key = apply_rotary_emb(key, freqs_cis)
+
+        # Cast to correct dtype
+        dtype = query.dtype
+        query, key = query.to(dtype), key.to(dtype)
+
+        # From [batch, seq_len] to [batch, 1, 1, seq_len] -> broadcast to [batch, heads, seq_len, seq_len]
+        if attention_mask is not None and attention_mask.ndim == 2:
+            attention_mask = attention_mask[:, None, None, :]
+
+        # Compute joint attention
+        hidden_states = dispatch_attention_fn(
+            query,
+            key,
+            value,
+            attn_mask=attention_mask,
+            dropout_p=0.0,
+            is_causal=False,
+            backend=self._attention_backend,
+            parallel_config=self._parallel_config,
+        )
+
+        # Reshape back
+        hidden_states = hidden_states.flatten(2, 3)
+        hidden_states = hidden_states.to(dtype)
+
+        output = attn.to_out[0](hidden_states)
+        if len(attn.to_out) > 1:  # dropout
+            output = attn.to_out[1](output)
+
+        return output
+
+
+# Copied from diffusers.models.transformers.transformer_z_image.FeedForward
+class FeedForward(nn.Module):
+    def __init__(self, dim: int, hidden_dim: int):
+        super().__init__()
+        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
+        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
+        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
+
+    def _forward_silu_gating(self, x1, x3):
+        return F.silu(x1) * x3
+
+    def forward(self, x):
+        return self.w2(self._forward_silu_gating(self.w1(x), self.w3(x)))
+
+
+@maybe_allow_in_graph
+# Copied from diffusers.models.transformers.transformer_z_image.ZImageTransformerBlock
+class ZImageTransformerBlock(nn.Module):
+    def __init__(
+        self,
+        layer_id: int,
+        dim: int,
+        n_heads: int,
+        n_kv_heads: int,
+        norm_eps: float,
+        qk_norm: bool,
+        modulation=True,
+    ):
+        super().__init__()
+        self.dim = dim
+        self.head_dim = dim // n_heads
+
+        # Refactored to use diffusers Attention with custom processor
+        # Original Z-Image params: dim, n_heads, n_kv_heads, qk_norm
+        self.attention = Attention(
+            query_dim=dim,
+            cross_attention_dim=None,
+            dim_head=dim // n_heads,
+            heads=n_heads,
+            qk_norm="rms_norm" if qk_norm else None,
+            eps=1e-5,
+            bias=False,
+            out_bias=False,
+            processor=ZSingleStreamAttnProcessor(),
+        )
+
+        self.feed_forward = FeedForward(dim=dim, hidden_dim=int(dim / 3 * 8))
+        self.layer_id = layer_id
+
+        self.attention_norm1 = RMSNorm(dim, eps=norm_eps)
+        self.ffn_norm1 = RMSNorm(dim, eps=norm_eps)
+
+        self.attention_norm2 = RMSNorm(dim, eps=norm_eps)
+        self.ffn_norm2 = RMSNorm(dim, eps=norm_eps)
+
+        self.modulation = modulation
+        if modulation:
+            self.adaLN_modulation = nn.Sequential(nn.Linear(min(dim, ADALN_EMBED_DIM), 4 * dim, bias=True))
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        attn_mask: torch.Tensor,
+        freqs_cis: torch.Tensor,
+        adaln_input: Optional[torch.Tensor] = None,
+    ):
+        if self.modulation:
+            assert adaln_input is not None
+            scale_msa, gate_msa, scale_mlp, gate_mlp = self.adaLN_modulation(adaln_input).unsqueeze(1).chunk(4, dim=2)
+            gate_msa, gate_mlp = gate_msa.tanh(), gate_mlp.tanh()
+            scale_msa, scale_mlp = 1.0 + scale_msa, 1.0 + scale_mlp
+
+            # Attention block
+            attn_out = self.attention(
+                self.attention_norm1(x) * scale_msa, attention_mask=attn_mask, freqs_cis=freqs_cis
+            )
+            x = x + gate_msa * self.attention_norm2(attn_out)
+
+            # FFN block
+            x = x + gate_mlp * self.ffn_norm2(self.feed_forward(self.ffn_norm1(x) * scale_mlp))
+        else:
+            # Attention block
+            attn_out = self.attention(self.attention_norm1(x), attention_mask=attn_mask, freqs_cis=freqs_cis)
+            x = x + self.attention_norm2(attn_out)
+
+            # FFN block
+            x = x + self.ffn_norm2(self.feed_forward(self.ffn_norm1(x)))
+
+        return x
+
+
+# Copied from diffusers.models.transformers.transformer_z_image.RopeEmbedder
+class RopeEmbedder:
+    def __init__(
+        self,
+        theta: float = 256.0,
+        axes_dims: List[int] = (16, 56, 56),
+        axes_lens: List[int] = (64, 128, 128),
+    ):
+        self.theta = theta
+        self.axes_dims = axes_dims
+        self.axes_lens = axes_lens
+        assert len(axes_dims) == len(axes_lens), "axes_dims and axes_lens must have the same length"
+        self.freqs_cis = None
+
+    @staticmethod
+    def precompute_freqs_cis(dim: List[int], end: List[int], theta: float = 256.0):
+        with torch.device("cpu"):
+            freqs_cis = []
+            for i, (d, e) in enumerate(zip(dim, end)):
+                freqs = 1.0 / (theta ** (torch.arange(0, d, 2, dtype=torch.float64, device="cpu") / d))
+                timestep = torch.arange(e, device=freqs.device, dtype=torch.float64)
+                freqs = torch.outer(timestep, freqs).float()
+                freqs_cis_i = torch.polar(torch.ones_like(freqs), freqs).to(torch.complex64)  # complex64
+                freqs_cis.append(freqs_cis_i)
+
+            return freqs_cis
+
+    def __call__(self, ids: torch.Tensor):
+        assert ids.ndim == 2
+        assert ids.shape[-1] == len(self.axes_dims)
+        device = ids.device
+
+        if self.freqs_cis is None:
+            self.freqs_cis = self.precompute_freqs_cis(self.axes_dims, self.axes_lens, theta=self.theta)
+            self.freqs_cis = [freqs_cis.to(device) for freqs_cis in self.freqs_cis]
+        else:
+            # Ensure freqs_cis are on the same device as ids
+            if self.freqs_cis[0].device != device:
+                self.freqs_cis = [freqs_cis.to(device) for freqs_cis in self.freqs_cis]
+
+        result = []
+        for i in range(len(self.axes_dims)):
+            index = ids[:, i]
+            result.append(self.freqs_cis[i][index])
+        return torch.cat(result, dim=-1)
+
+
+@maybe_allow_in_graph
+class ZImageControlTransformerBlock(nn.Module):
+    def __init__(
+        self,
+        layer_id: int,
+        dim: int,
+        n_heads: int,
+        n_kv_heads: int,
+        norm_eps: float,
+        qk_norm: bool,
+        modulation=True,
+        block_id=0,
+    ):
+        super().__init__()
+        self.dim = dim
+        self.head_dim = dim // n_heads
+
+        # Refactored to use diffusers Attention with custom processor
+        # Original Z-Image params: dim, n_heads, n_kv_heads, qk_norm
+        self.attention = Attention(
+            query_dim=dim,
+            cross_attention_dim=None,
+            dim_head=dim // n_heads,
+            heads=n_heads,
+            qk_norm="rms_norm" if qk_norm else None,
+            eps=1e-5,
+            bias=False,
+            out_bias=False,
+            processor=ZSingleStreamAttnProcessor(),
+        )
+
+        self.feed_forward = FeedForward(dim=dim, hidden_dim=int(dim / 3 * 8))
+        self.layer_id = layer_id
+
+        self.attention_norm1 = RMSNorm(dim, eps=norm_eps)
+        self.ffn_norm1 = RMSNorm(dim, eps=norm_eps)
+
+        self.attention_norm2 = RMSNorm(dim, eps=norm_eps)
+        self.ffn_norm2 = RMSNorm(dim, eps=norm_eps)
+
+        self.modulation = modulation
+        if modulation:
+            self.adaLN_modulation = nn.Sequential(nn.Linear(min(dim, ADALN_EMBED_DIM), 4 * dim, bias=True))
+
+        # Control variant start
+        self.block_id = block_id
+        if block_id == 0:
+            self.before_proj = zero_module(nn.Linear(self.dim, self.dim))
+        self.after_proj = zero_module(nn.Linear(self.dim, self.dim))
+
+    def forward(
+        self,
+        c: torch.Tensor,
+        x: torch.Tensor,
+        attn_mask: torch.Tensor,
+        freqs_cis: torch.Tensor,
+        adaln_input: Optional[torch.Tensor] = None,
+    ):
+        # Control
+        if self.block_id == 0:
+            c = self.before_proj(c) + x
+            all_c = []
+        else:
+            all_c = list(torch.unbind(c))
+            c = all_c.pop(-1)
+
+        # Compared to `ZImageTransformerBlock` x -> c
+        if self.modulation:
+            assert adaln_input is not None
+            scale_msa, gate_msa, scale_mlp, gate_mlp = self.adaLN_modulation(adaln_input).unsqueeze(1).chunk(4, dim=2)
+            gate_msa, gate_mlp = gate_msa.tanh(), gate_mlp.tanh()
+            scale_msa, scale_mlp = 1.0 + scale_msa, 1.0 + scale_mlp
+
+            # Attention block
+            attn_out = self.attention(
+                self.attention_norm1(c) * scale_msa, attention_mask=attn_mask, freqs_cis=freqs_cis
+            )
+            c = c + gate_msa * self.attention_norm2(attn_out)
+
+            # FFN block
+            c = c + gate_mlp * self.ffn_norm2(self.feed_forward(self.ffn_norm1(c) * scale_mlp))
+        else:
+            # Attention block
+            attn_out = self.attention(self.attention_norm1(c), attention_mask=attn_mask, freqs_cis=freqs_cis)
+            c = c + self.attention_norm2(attn_out)
+
+            # FFN block
+            c = c + self.ffn_norm2(self.feed_forward(self.ffn_norm1(c)))
+
+        # Control
+        c_skip = self.after_proj(c)
+        all_c += [c_skip, c]
+        c = torch.stack(all_c)
+        return c
+
+
+class ZImageControlNetModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
+    _supports_gradient_checkpointing = True
+
+    @register_to_config
+    def __init__(
+        self,
+        control_layers_places: List[int] = None,
+        control_refiner_layers_places: List[int] = None,
+        control_in_dim=None,
+        add_control_noise_refiner: Optional[Literal["control_layers", "control_noise_refiner"]] = None,
+        all_patch_size=(2,),
+        all_f_patch_size=(1,),
+        dim=3840,
+        n_refiner_layers=2,
+        n_heads=30,
+        n_kv_heads=30,
+        norm_eps=1e-5,
+        qk_norm=True,
+    ):
+        super().__init__()
+        self.control_layers_places = control_layers_places
+        self.control_in_dim = control_in_dim
+        self.control_refiner_layers_places = control_refiner_layers_places
+        self.add_control_noise_refiner = add_control_noise_refiner
+
+        assert 0 in self.control_layers_places
+
+        # control blocks
+        self.control_layers = nn.ModuleList(
+            [
+                ZImageControlTransformerBlock(i, dim, n_heads, n_kv_heads, norm_eps, qk_norm, block_id=i)
+                for i in self.control_layers_places
+            ]
+        )
+
+        # control patch embeddings
+        all_x_embedder = {}
+        for patch_idx, (patch_size, f_patch_size) in enumerate(zip(all_patch_size, all_f_patch_size)):
+            x_embedder = nn.Linear(f_patch_size * patch_size * patch_size * self.control_in_dim, dim, bias=True)
+            all_x_embedder[f"{patch_size}-{f_patch_size}"] = x_embedder
+
+        self.control_all_x_embedder = nn.ModuleDict(all_x_embedder)
+        if self.add_control_noise_refiner == "control_layers":
+            self.control_noise_refiner = None
+        elif self.add_control_noise_refiner == "control_noise_refiner":
+            self.control_noise_refiner = nn.ModuleList(
+                [
+                    ZImageControlTransformerBlock(
+                        1000 + layer_id,
+                        dim,
+                        n_heads,
+                        n_kv_heads,
+                        norm_eps,
+                        qk_norm,
+                        modulation=True,
+                        block_id=layer_id,
+                    )
+                    for layer_id in range(n_refiner_layers)
+                ]
+            )
+        else:
+            self.control_noise_refiner = nn.ModuleList(
+                [
+                    ZImageTransformerBlock(
+                        1000 + layer_id,
+                        dim,
+                        n_heads,
+                        n_kv_heads,
+                        norm_eps,
+                        qk_norm,
+                        modulation=True,
+                    )
+                    for layer_id in range(n_refiner_layers)
+                ]
+            )
+
+        self.t_scale: Optional[float] = None
+        self.t_embedder: Optional[TimestepEmbedder] = None
+        self.all_x_embedder: Optional[nn.ModuleDict] = None
+        self.cap_embedder: Optional[nn.Sequential] = None
+        self.rope_embedder: Optional[RopeEmbedder] = None
+        self.noise_refiner: Optional[nn.ModuleList] = None
+        self.context_refiner: Optional[nn.ModuleList] = None
+        self.x_pad_token: Optional[nn.Parameter] = None
+        self.cap_pad_token: Optional[nn.Parameter] = None
+
+    @classmethod
+    def from_transformer(cls, controlnet, transformer):
+        controlnet.t_scale = transformer.t_scale
+        controlnet.t_embedder = transformer.t_embedder
+        controlnet.all_x_embedder = transformer.all_x_embedder
+        controlnet.cap_embedder = transformer.cap_embedder
+        controlnet.rope_embedder = transformer.rope_embedder
+        controlnet.noise_refiner = transformer.noise_refiner
+        controlnet.context_refiner = transformer.context_refiner
+        controlnet.x_pad_token = transformer.x_pad_token
+        controlnet.cap_pad_token = transformer.cap_pad_token
+        return controlnet
+
+    @staticmethod
+    # Copied from diffusers.models.transformers.transformer_z_image.ZImageTransformer2DModel.create_coordinate_grid
+    def create_coordinate_grid(size, start=None, device=None):
+        if start is None:
+            start = (0 for _ in size)
+
+        axes = [torch.arange(x0, x0 + span, dtype=torch.int32, device=device) for x0, span in zip(start, size)]
+        grids = torch.meshgrid(axes, indexing="ij")
+        return torch.stack(grids, dim=-1)
+
+    # Copied from diffusers.models.transformers.transformer_z_image.ZImageTransformer2DModel.patchify_and_embed
+    def patchify_and_embed(
+        self,
+        all_image: List[torch.Tensor],
+        all_cap_feats: List[torch.Tensor],
+        patch_size: int,
+        f_patch_size: int,
+    ):
+        pH = pW = patch_size
+        pF = f_patch_size
+        device = all_image[0].device
+
+        all_image_out = []
+        all_image_size = []
+        all_image_pos_ids = []
+        all_image_pad_mask = []
+        all_cap_pos_ids = []
+        all_cap_pad_mask = []
+        all_cap_feats_out = []
+
+        for i, (image, cap_feat) in enumerate(zip(all_image, all_cap_feats)):
+            ### Process Caption
+            cap_ori_len = len(cap_feat)
+            cap_padding_len = (-cap_ori_len) % SEQ_MULTI_OF
+            # padded position ids
+            cap_padded_pos_ids = self.create_coordinate_grid(
+                size=(cap_ori_len + cap_padding_len, 1, 1),
+                start=(1, 0, 0),
+                device=device,
+            ).flatten(0, 2)
+            all_cap_pos_ids.append(cap_padded_pos_ids)
+            # pad mask
+            cap_pad_mask = torch.cat(
+                [
+                    torch.zeros((cap_ori_len,), dtype=torch.bool, device=device),
+                    torch.ones((cap_padding_len,), dtype=torch.bool, device=device),
+                ],
+                dim=0,
+            )
+            all_cap_pad_mask.append(
+                cap_pad_mask if cap_padding_len > 0 else torch.zeros((cap_ori_len,), dtype=torch.bool, device=device)
+            )
+
+            # padded feature
+            cap_padded_feat = torch.cat([cap_feat, cap_feat[-1:].repeat(cap_padding_len, 1)], dim=0)
+            all_cap_feats_out.append(cap_padded_feat)
+
+            ### Process Image
+            C, F, H, W = image.size()
+            all_image_size.append((F, H, W))
+            F_tokens, H_tokens, W_tokens = F // pF, H // pH, W // pW
+
+            image = image.view(C, F_tokens, pF, H_tokens, pH, W_tokens, pW)
+            # "c f pf h ph w pw -> (f h w) (pf ph pw c)"
+            image = image.permute(1, 3, 5, 2, 4, 6, 0).reshape(F_tokens * H_tokens * W_tokens, pF * pH * pW * C)
+
+            image_ori_len = len(image)
+            image_padding_len = (-image_ori_len) % SEQ_MULTI_OF
+
+            image_ori_pos_ids = self.create_coordinate_grid(
+                size=(F_tokens, H_tokens, W_tokens),
+                start=(cap_ori_len + cap_padding_len + 1, 0, 0),
+                device=device,
+            ).flatten(0, 2)
+            image_padded_pos_ids = torch.cat(
+                [
+                    image_ori_pos_ids,
+                    self.create_coordinate_grid(size=(1, 1, 1), start=(0, 0, 0), device=device)
+                    .flatten(0, 2)
+                    .repeat(image_padding_len, 1),
+                ],
+                dim=0,
+            )
+            all_image_pos_ids.append(image_padded_pos_ids if image_padding_len > 0 else image_ori_pos_ids)
+            # pad mask
+            image_pad_mask = torch.cat(
+                [
+                    torch.zeros((image_ori_len,), dtype=torch.bool, device=device),
+                    torch.ones((image_padding_len,), dtype=torch.bool, device=device),
+                ],
+                dim=0,
+            )
+            all_image_pad_mask.append(
+                image_pad_mask
+                if image_padding_len > 0
+                else torch.zeros((image_ori_len,), dtype=torch.bool, device=device)
+            )
+            # padded feature
+            image_padded_feat = torch.cat(
+                [image, image[-1:].repeat(image_padding_len, 1)],
+                dim=0,
+            )
+            all_image_out.append(image_padded_feat if image_padding_len > 0 else image)
+
+        return (
+            all_image_out,
+            all_cap_feats_out,
+            all_image_size,
+            all_image_pos_ids,
+            all_cap_pos_ids,
+            all_image_pad_mask,
+            all_cap_pad_mask,
+        )
+
+    def patchify(
+        self,
+        all_image: List[torch.Tensor],
+        patch_size: int,
+        f_patch_size: int,
+    ):
+        pH = pW = patch_size
+        pF = f_patch_size
+        all_image_out = []
+
+        for i, image in enumerate(all_image):
+            ### Process Image
+            C, F, H, W = image.size()
+            F_tokens, H_tokens, W_tokens = F // pF, H // pH, W // pW
+
+            image = image.view(C, F_tokens, pF, H_tokens, pH, W_tokens, pW)
+            # "c f pf h ph w pw -> (f h w) (pf ph pw c)"
+            image = image.permute(1, 3, 5, 2, 4, 6, 0).reshape(F_tokens * H_tokens * W_tokens, pF * pH * pW * C)
+
+            image_ori_len = len(image)
+            image_padding_len = (-image_ori_len) % SEQ_MULTI_OF
+
+            # padded feature
+            image_padded_feat = torch.cat([image, image[-1:].repeat(image_padding_len, 1)], dim=0)
+            all_image_out.append(image_padded_feat)
+
+        return all_image_out
+
+    def forward(
+        self,
+        x: List[torch.Tensor],
+        t,
+        cap_feats: List[torch.Tensor],
+        control_context: List[torch.Tensor],
+        conditioning_scale: float = 1.0,
+        patch_size=2,
+        f_patch_size=1,
+    ):
+        if (
+            self.t_scale is None
+            or self.t_embedder is None
+            or self.all_x_embedder is None
+            or self.cap_embedder is None
+            or self.rope_embedder is None
+            or self.noise_refiner is None
+            or self.context_refiner is None
+            or self.x_pad_token is None
+            or self.cap_pad_token is None
+        ):
+            raise ValueError(
+                "Required modules are `None`, use `from_transformer` to share required modules from `transformer`."
+            )
+
+        assert patch_size in self.config.all_patch_size
+        assert f_patch_size in self.config.all_f_patch_size
+
+        bsz = len(x)
+        device = x[0].device
+        t = t * self.t_scale
+        t = self.t_embedder(t)
+
+        (
+            x,
+            cap_feats,
+            x_size,
+            x_pos_ids,
+            cap_pos_ids,
+            x_inner_pad_mask,
+            cap_inner_pad_mask,
+        ) = self.patchify_and_embed(x, cap_feats, patch_size, f_patch_size)
+
+        x_item_seqlens = [len(_) for _ in x]
+        assert all(_ % SEQ_MULTI_OF == 0 for _ in x_item_seqlens)
+        x_max_item_seqlen = max(x_item_seqlens)
+
+        control_context = self.patchify(control_context, patch_size, f_patch_size)
+        control_context = torch.cat(control_context, dim=0)
+        control_context = self.control_all_x_embedder[f"{patch_size}-{f_patch_size}"](control_context)
+
+        control_context[torch.cat(x_inner_pad_mask)] = self.x_pad_token
+        control_context = list(control_context.split(x_item_seqlens, dim=0))
+
+        control_context = pad_sequence(control_context, batch_first=True, padding_value=0.0)
+
+        # x embed & refine
+        x = torch.cat(x, dim=0)
+        x = self.all_x_embedder[f"{patch_size}-{f_patch_size}"](x)
+
+        # Match t_embedder output dtype to x for layerwise casting compatibility
+        adaln_input = t.type_as(x)
+        x[torch.cat(x_inner_pad_mask)] = self.x_pad_token
+        x = list(x.split(x_item_seqlens, dim=0))
+        x_freqs_cis = list(self.rope_embedder(torch.cat(x_pos_ids, dim=0)).split([len(_) for _ in x_pos_ids], dim=0))
+
+        x = pad_sequence(x, batch_first=True, padding_value=0.0)
+        x_freqs_cis = pad_sequence(x_freqs_cis, batch_first=True, padding_value=0.0)
+        # Clarify the length matches to satisfy Dynamo due to "Symbolic Shape Inference" to avoid compilation errors
+        x_freqs_cis = x_freqs_cis[:, : x.shape[1]]
+
+        x_attn_mask = torch.zeros((bsz, x_max_item_seqlen), dtype=torch.bool, device=device)
+        for i, seq_len in enumerate(x_item_seqlens):
+            x_attn_mask[i, :seq_len] = 1
+
+        if self.add_control_noise_refiner is not None:
+            if self.add_control_noise_refiner == "control_layers":
+                layers = self.control_layers
+            elif self.add_control_noise_refiner == "control_noise_refiner":
+                layers = self.control_noise_refiner
+            else:
+                raise ValueError(f"Unsupported `add_control_noise_refiner` type: {self.add_control_noise_refiner}.")
+            for layer in layers:
+                if torch.is_grad_enabled() and self.gradient_checkpointing:
+                    control_context = self._gradient_checkpointing_func(
+                        layer, control_context, x, x_attn_mask, x_freqs_cis, adaln_input
+                    )
+                else:
+                    control_context = layer(control_context, x, x_attn_mask, x_freqs_cis, adaln_input)
+
+            hints = torch.unbind(control_context)[:-1]
+            control_context = torch.unbind(control_context)[-1]
+            noise_refiner_block_samples = {
+                layer_idx: hints[idx] * conditioning_scale
+                for idx, layer_idx in enumerate(self.control_refiner_layers_places)
+            }
+        else:
+            noise_refiner_block_samples = None
+
+        if torch.is_grad_enabled() and self.gradient_checkpointing:
+            for layer_idx, layer in enumerate(self.noise_refiner):
+                x = self._gradient_checkpointing_func(layer, x, x_attn_mask, x_freqs_cis, adaln_input)
+                if noise_refiner_block_samples is not None:
+                    if layer_idx in noise_refiner_block_samples:
+                        x = x + noise_refiner_block_samples[layer_idx]
+        else:
+            for layer_idx, layer in enumerate(self.noise_refiner):
+                x = layer(x, x_attn_mask, x_freqs_cis, adaln_input)
+                if noise_refiner_block_samples is not None:
+                    if layer_idx in noise_refiner_block_samples:
+                        x = x + noise_refiner_block_samples[layer_idx]
+
+        # cap embed & refine
+        cap_item_seqlens = [len(_) for _ in cap_feats]
+        cap_max_item_seqlen = max(cap_item_seqlens)
+
+        cap_feats = torch.cat(cap_feats, dim=0)
+        cap_feats = self.cap_embedder(cap_feats)
+        cap_feats[torch.cat(cap_inner_pad_mask)] = self.cap_pad_token
+        cap_feats = list(cap_feats.split(cap_item_seqlens, dim=0))
+        cap_freqs_cis = list(
+            self.rope_embedder(torch.cat(cap_pos_ids, dim=0)).split([len(_) for _ in cap_pos_ids], dim=0)
+        )
+
+        cap_feats = pad_sequence(cap_feats, batch_first=True, padding_value=0.0)
+        cap_freqs_cis = pad_sequence(cap_freqs_cis, batch_first=True, padding_value=0.0)
+        # Clarify the length matches to satisfy Dynamo due to "Symbolic Shape Inference" to avoid compilation errors
+        cap_freqs_cis = cap_freqs_cis[:, : cap_feats.shape[1]]
+
+        cap_attn_mask = torch.zeros((bsz, cap_max_item_seqlen), dtype=torch.bool, device=device)
+        for i, seq_len in enumerate(cap_item_seqlens):
+            cap_attn_mask[i, :seq_len] = 1
+
+        if torch.is_grad_enabled() and self.gradient_checkpointing:
+            for layer in self.context_refiner:
+                cap_feats = self._gradient_checkpointing_func(layer, cap_feats, cap_attn_mask, cap_freqs_cis)
+        else:
+            for layer in self.context_refiner:
+                cap_feats = layer(cap_feats, cap_attn_mask, cap_freqs_cis)
+
+        # unified
+        unified = []
+        unified_freqs_cis = []
+        for i in range(bsz):
+            x_len = x_item_seqlens[i]
+            cap_len = cap_item_seqlens[i]
+            unified.append(torch.cat([x[i][:x_len], cap_feats[i][:cap_len]]))
+            unified_freqs_cis.append(torch.cat([x_freqs_cis[i][:x_len], cap_freqs_cis[i][:cap_len]]))
+        unified_item_seqlens = [a + b for a, b in zip(cap_item_seqlens, x_item_seqlens)]
+        assert unified_item_seqlens == [len(_) for _ in unified]
+        unified_max_item_seqlen = max(unified_item_seqlens)
+
+        unified = pad_sequence(unified, batch_first=True, padding_value=0.0)
+        unified_freqs_cis = pad_sequence(unified_freqs_cis, batch_first=True, padding_value=0.0)
+        unified_attn_mask = torch.zeros((bsz, unified_max_item_seqlen), dtype=torch.bool, device=device)
+        for i, seq_len in enumerate(unified_item_seqlens):
+            unified_attn_mask[i, :seq_len] = 1
+
+        ## ControlNet start
+        if not self.add_control_noise_refiner:
+            if torch.is_grad_enabled() and self.gradient_checkpointing:
+                for layer in self.control_noise_refiner:
+                    control_context = self._gradient_checkpointing_func(
+                        layer, control_context, x_attn_mask, x_freqs_cis, adaln_input
+                    )
+            else:
+                for layer in self.control_noise_refiner:
+                    control_context = layer(control_context, x_attn_mask, x_freqs_cis, adaln_input)
+
+        # unified
+        control_context_unified = []
+        for i in range(bsz):
+            x_len = x_item_seqlens[i]
+            cap_len = cap_item_seqlens[i]
+            control_context_unified.append(torch.cat([control_context[i][:x_len], cap_feats[i][:cap_len]]))
+        control_context_unified = pad_sequence(control_context_unified, batch_first=True, padding_value=0.0)
+
+        for layer in self.control_layers:
+            if torch.is_grad_enabled() and self.gradient_checkpointing:
+                control_context_unified = self._gradient_checkpointing_func(
+                    layer, control_context_unified, unified, unified_attn_mask, unified_freqs_cis, adaln_input
+                )
+            else:
+                control_context_unified = layer(
+                    control_context_unified, unified, unified_attn_mask, unified_freqs_cis, adaln_input
+                )
+
+        hints = torch.unbind(control_context_unified)[:-1]
+        controlnet_block_samples = {
+            layer_idx: hints[idx] * conditioning_scale for idx, layer_idx in enumerate(self.control_layers_places)
+        }
+        return controlnet_block_samples
--- a/src/diffusers/models/modeling_utils.py
+++ b/src/diffusers/models/modeling_utils.py
@@ -602,6 +602,7 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
        from .attention import AttentionModuleMixin
        from .attention_dispatch import (
            AttentionBackendName,
+            _AttentionBackendRegistry,
            _check_attention_backend_requirements,
            _maybe_download_kernel_for_backend,
        )
@@ -629,6 +630,9 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
                continue
            processor._attention_backend = backend

+        # Important to set the active backend so that it propagates gracefully throughout.
+        _AttentionBackendRegistry.set_active_backend(backend)
+
    def reset_attention_backend(self) -> None:
        """
        Resets the attention backend for the model. Following calls to `forward` will use the environment default, if
@@ -1541,7 +1545,7 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
                        f"Context parallelism is enabled but the attention processor '{processor.__class__.__name__}' "
                        f"is using backend '{attention_backend.value}' which does not support context parallelism. "
                        f"Please set a compatible attention backend: {compatible_backends} using `model.set_attention_backend()` before "
-                        f"calling `enable_parallelism()`."
+                        f"calling `model.enable_parallelism()`."
                    )

                # All modules use the same attention processor and backend. We don't need to
--- a/src/diffusers/models/transformers/init.py
+++ b/src/diffusers/models/transformers/init.py
@@ -33,6 +33,7 @@ if is_torch_available():
    from .transformer_hunyuan_video_framepack import HunyuanVideoFramepackTransformer3DModel
    from .transformer_hunyuanimage import HunyuanImageTransformer2DModel
    from .transformer_kandinsky import Kandinsky5Transformer3DModel
+    from .transformer_longcat_image import LongCatImageTransformer2DModel
    from .transformer_ltx import LTXVideoTransformer3DModel
    from .transformer_lumina2 import Lumina2Transformer2DModel
    from .transformer_mochi import MochiTransformer3DModel
--- a/src/diffusers/models/transformers/transformer_longcat_image.py
+++ b/src/diffusers/models/transformers/transformer_longcat_image.py
@@ -0,0 +1,548 @@
+# Copyright 2025 MeiTuan LongCat-Image Team and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from ...configuration_utils import ConfigMixin, register_to_config
+from ...loaders import FromOriginalModelMixin, PeftAdapterMixin
+from ...utils import is_torch_npu_available, logging
+from ...utils.torch_utils import maybe_allow_in_graph
+from ..attention import AttentionModuleMixin, FeedForward
+from ..attention_dispatch import dispatch_attention_fn
+from ..cache_utils import CacheMixin
+from ..embeddings import TimestepEmbedding, Timesteps, apply_rotary_emb, get_1d_rotary_pos_embed
+from ..modeling_outputs import Transformer2DModelOutput
+from ..modeling_utils import ModelMixin
+from ..normalization import AdaLayerNormContinuous, AdaLayerNormZero, AdaLayerNormZeroSingle
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+def _get_projections(attn: "LongCatImageAttention", hidden_states, encoder_hidden_states=None):
+    query = attn.to_q(hidden_states)
+    key = attn.to_k(hidden_states)
+    value = attn.to_v(hidden_states)
+
+    encoder_query = encoder_key = encoder_value = None
+    if encoder_hidden_states is not None and attn.added_kv_proj_dim is not None:
+        encoder_query = attn.add_q_proj(encoder_hidden_states)
+        encoder_key = attn.add_k_proj(encoder_hidden_states)
+        encoder_value = attn.add_v_proj(encoder_hidden_states)
+
+    return query, key, value, encoder_query, encoder_key, encoder_value
+
+
+def _get_fused_projections(attn: "LongCatImageAttention", hidden_states, encoder_hidden_states=None):
+    query, key, value = attn.to_qkv(hidden_states).chunk(3, dim=-1)
+
+    encoder_query = encoder_key = encoder_value = (None,)
+    if encoder_hidden_states is not None and hasattr(attn, "to_added_qkv"):
+        encoder_query, encoder_key, encoder_value = attn.to_added_qkv(encoder_hidden_states).chunk(3, dim=-1)
+
+    return query, key, value, encoder_query, encoder_key, encoder_value
+
+
+def _get_qkv_projections(attn: "LongCatImageAttention", hidden_states, encoder_hidden_states=None):
+    if attn.fused_projections:
+        return _get_fused_projections(attn, hidden_states, encoder_hidden_states)
+    return _get_projections(attn, hidden_states, encoder_hidden_states)
+
+
+class LongCatImageAttnProcessor:
+    _attention_backend = None
+    _parallel_config = None
+
+    def __init__(self):
+        if not hasattr(F, "scaled_dot_product_attention"):
+            raise ImportError(f"{self.__class__.__name__} requires PyTorch 2.0. Please upgrade your pytorch version.")
+
+    def __call__(
+        self,
+        attn: "LongCatImageAttention",
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: torch.Tensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        image_rotary_emb: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        query, key, value, encoder_query, encoder_key, encoder_value = _get_qkv_projections(
+            attn, hidden_states, encoder_hidden_states
+        )
+
+        query = query.unflatten(-1, (attn.heads, -1))
+        key = key.unflatten(-1, (attn.heads, -1))
+        value = value.unflatten(-1, (attn.heads, -1))
+
+        query = attn.norm_q(query)
+        key = attn.norm_k(key)
+
+        if attn.added_kv_proj_dim is not None:
+            encoder_query = encoder_query.unflatten(-1, (attn.heads, -1))
+            encoder_key = encoder_key.unflatten(-1, (attn.heads, -1))
+            encoder_value = encoder_value.unflatten(-1, (attn.heads, -1))
+
+            encoder_query = attn.norm_added_q(encoder_query)
+            encoder_key = attn.norm_added_k(encoder_key)
+
+            query = torch.cat([encoder_query, query], dim=1)
+            key = torch.cat([encoder_key, key], dim=1)
+            value = torch.cat([encoder_value, value], dim=1)
+
+        if image_rotary_emb is not None:
+            query = apply_rotary_emb(query, image_rotary_emb, sequence_dim=1)
+            key = apply_rotary_emb(key, image_rotary_emb, sequence_dim=1)
+
+        hidden_states = dispatch_attention_fn(
+            query,
+            key,
+            value,
+            attn_mask=attention_mask,
+            backend=self._attention_backend,
+            parallel_config=self._parallel_config,
+        )
+        hidden_states = hidden_states.flatten(2, 3)
+        hidden_states = hidden_states.to(query.dtype)
+
+        if encoder_hidden_states is not None:
+            encoder_hidden_states, hidden_states = hidden_states.split_with_sizes(
+                [encoder_hidden_states.shape[1], hidden_states.shape[1] - encoder_hidden_states.shape[1]], dim=1
+            )
+            hidden_states = attn.to_out[0](hidden_states)
+            hidden_states = attn.to_out[1](hidden_states)
+            encoder_hidden_states = attn.to_add_out(encoder_hidden_states)
+
+            return hidden_states, encoder_hidden_states
+        else:
+            return hidden_states
+
+
+class LongCatImageAttention(torch.nn.Module, AttentionModuleMixin):
+    _default_processor_cls = LongCatImageAttnProcessor
+    _available_processors = [
+        LongCatImageAttnProcessor,
+    ]
+
+    def __init__(
+        self,
+        query_dim: int,
+        heads: int = 8,
+        dim_head: int = 64,
+        dropout: float = 0.0,
+        bias: bool = False,
+        added_kv_proj_dim: Optional[int] = None,
+        added_proj_bias: Optional[bool] = True,
+        out_bias: bool = True,
+        eps: float = 1e-5,
+        out_dim: int = None,
+        context_pre_only: Optional[bool] = None,
+        pre_only: bool = False,
+        elementwise_affine: bool = True,
+        processor=None,
+    ):
+        super().__init__()
+
+        self.head_dim = dim_head
+        self.inner_dim = out_dim if out_dim is not None else dim_head * heads
+        self.query_dim = query_dim
+        self.use_bias = bias
+        self.dropout = dropout
+        self.out_dim = out_dim if out_dim is not None else query_dim
+        self.context_pre_only = context_pre_only
+        self.pre_only = pre_only
+        self.heads = out_dim // dim_head if out_dim is not None else heads
+        self.added_kv_proj_dim = added_kv_proj_dim
+        self.added_proj_bias = added_proj_bias
+
+        self.norm_q = torch.nn.RMSNorm(dim_head, eps=eps, elementwise_affine=elementwise_affine)
+        self.norm_k = torch.nn.RMSNorm(dim_head, eps=eps, elementwise_affine=elementwise_affine)
+        self.to_q = torch.nn.Linear(query_dim, self.inner_dim, bias=bias)
+        self.to_k = torch.nn.Linear(query_dim, self.inner_dim, bias=bias)
+        self.to_v = torch.nn.Linear(query_dim, self.inner_dim, bias=bias)
+
+        if not self.pre_only:
+            self.to_out = torch.nn.ModuleList([])
+            self.to_out.append(torch.nn.Linear(self.inner_dim, self.out_dim, bias=out_bias))
+            self.to_out.append(torch.nn.Dropout(dropout))
+
+        if added_kv_proj_dim is not None:
+            self.norm_added_q = torch.nn.RMSNorm(dim_head, eps=eps)
+            self.norm_added_k = torch.nn.RMSNorm(dim_head, eps=eps)
+            self.add_q_proj = torch.nn.Linear(added_kv_proj_dim, self.inner_dim, bias=added_proj_bias)
+            self.add_k_proj = torch.nn.Linear(added_kv_proj_dim, self.inner_dim, bias=added_proj_bias)
+            self.add_v_proj = torch.nn.Linear(added_kv_proj_dim, self.inner_dim, bias=added_proj_bias)
+            self.to_add_out = torch.nn.Linear(self.inner_dim, query_dim, bias=out_bias)
+
+        if processor is None:
+            processor = self._default_processor_cls()
+        self.set_processor(processor)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        image_rotary_emb: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> torch.Tensor:
+        attn_parameters = set(inspect.signature(self.processor.__call__).parameters.keys())
+        quiet_attn_parameters = {"ip_adapter_masks", "ip_hidden_states"}
+        unused_kwargs = [k for k, _ in kwargs.items() if k not in attn_parameters and k not in quiet_attn_parameters]
+        if len(unused_kwargs) > 0:
+            logger.warning(
+                f"joint_attention_kwargs {unused_kwargs} are not expected by {self.processor.__class__.__name__} and will be ignored."
+            )
+        kwargs = {k: w for k, w in kwargs.items() if k in attn_parameters}
+        return self.processor(self, hidden_states, encoder_hidden_states, attention_mask, image_rotary_emb, **kwargs)
+
+
+@maybe_allow_in_graph
+class LongCatImageSingleTransformerBlock(nn.Module):
+    def __init__(self, dim: int, num_attention_heads: int, attention_head_dim: int, mlp_ratio: float = 4.0):
+        super().__init__()
+        self.mlp_hidden_dim = int(dim * mlp_ratio)
+
+        self.norm = AdaLayerNormZeroSingle(dim)
+        self.proj_mlp = nn.Linear(dim, self.mlp_hidden_dim)
+        self.act_mlp = nn.GELU(approximate="tanh")
+        self.proj_out = nn.Linear(dim + self.mlp_hidden_dim, dim)
+
+        self.attn = LongCatImageAttention(
+            query_dim=dim,
+            dim_head=attention_head_dim,
+            heads=num_attention_heads,
+            out_dim=dim,
+            bias=True,
+            processor=LongCatImageAttnProcessor(),
+            eps=1e-6,
+            pre_only=True,
+        )
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: torch.Tensor,
+        temb: torch.Tensor,
+        image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        text_seq_len = encoder_hidden_states.shape[1]
+        hidden_states = torch.cat([encoder_hidden_states, hidden_states], dim=1)
+
+        residual = hidden_states
+        norm_hidden_states, gate = self.norm(hidden_states, emb=temb)
+        mlp_hidden_states = self.act_mlp(self.proj_mlp(norm_hidden_states))
+        joint_attention_kwargs = joint_attention_kwargs or {}
+        attn_output = self.attn(
+            hidden_states=norm_hidden_states,
+            image_rotary_emb=image_rotary_emb,
+            **joint_attention_kwargs,
+        )
+
+        hidden_states = torch.cat([attn_output, mlp_hidden_states], dim=2)
+        gate = gate.unsqueeze(1)
+        hidden_states = gate * self.proj_out(hidden_states)
+        hidden_states = residual + hidden_states
+        if hidden_states.dtype == torch.float16:
+            hidden_states = hidden_states.clip(-65504, 65504)
+
+        encoder_hidden_states, hidden_states = hidden_states[:, :text_seq_len], hidden_states[:, text_seq_len:]
+        return encoder_hidden_states, hidden_states
+
+
+@maybe_allow_in_graph
+class LongCatImageTransformerBlock(nn.Module):
+    def __init__(
+        self, dim: int, num_attention_heads: int, attention_head_dim: int, qk_norm: str = "rms_norm", eps: float = 1e-6
+    ):
+        super().__init__()
+
+        self.norm1 = AdaLayerNormZero(dim)
+        self.norm1_context = AdaLayerNormZero(dim)
+
+        self.attn = LongCatImageAttention(
+            query_dim=dim,
+            added_kv_proj_dim=dim,
+            dim_head=attention_head_dim,
+            heads=num_attention_heads,
+            out_dim=dim,
+            context_pre_only=False,
+            bias=True,
+            processor=LongCatImageAttnProcessor(),
+            eps=eps,
+        )
+
+        self.norm2 = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.ff = FeedForward(dim=dim, dim_out=dim, activation_fn="gelu-approximate")
+
+        self.norm2_context = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.ff_context = FeedForward(dim=dim, dim_out=dim, activation_fn="gelu-approximate")
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: torch.Tensor,
+        temb: torch.Tensor,
+        image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(hidden_states, emb=temb)
+
+        norm_encoder_hidden_states, c_gate_msa, c_shift_mlp, c_scale_mlp, c_gate_mlp = self.norm1_context(
+            encoder_hidden_states, emb=temb
+        )
+        joint_attention_kwargs = joint_attention_kwargs or {}
+
+        # Attention.
+        attention_outputs = self.attn(
+            hidden_states=norm_hidden_states,
+            encoder_hidden_states=norm_encoder_hidden_states,
+            image_rotary_emb=image_rotary_emb,
+            **joint_attention_kwargs,
+        )
+
+        if len(attention_outputs) == 2:
+            attn_output, context_attn_output = attention_outputs
+        elif len(attention_outputs) == 3:
+            attn_output, context_attn_output, ip_attn_output = attention_outputs
+
+        # Process attention outputs for the `hidden_states`.
+        attn_output = gate_msa.unsqueeze(1) * attn_output
+        hidden_states = hidden_states + attn_output
+
+        norm_hidden_states = self.norm2(hidden_states)
+        norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
+
+        ff_output = self.ff(norm_hidden_states)
+        ff_output = gate_mlp.unsqueeze(1) * ff_output
+
+        hidden_states = hidden_states + ff_output
+        if len(attention_outputs) == 3:
+            hidden_states = hidden_states + ip_attn_output
+
+        # Process attention outputs for the `encoder_hidden_states`.
+        context_attn_output = c_gate_msa.unsqueeze(1) * context_attn_output
+        encoder_hidden_states = encoder_hidden_states + context_attn_output
+
+        norm_encoder_hidden_states = self.norm2_context(encoder_hidden_states)
+        norm_encoder_hidden_states = norm_encoder_hidden_states * (1 + c_scale_mlp[:, None]) + c_shift_mlp[:, None]
+
+        context_ff_output = self.ff_context(norm_encoder_hidden_states)
+        encoder_hidden_states = encoder_hidden_states + c_gate_mlp.unsqueeze(1) * context_ff_output
+        if encoder_hidden_states.dtype == torch.float16:
+            encoder_hidden_states = encoder_hidden_states.clip(-65504, 65504)
+
+        return encoder_hidden_states, hidden_states
+
+
+class LongCatImagePosEmbed(nn.Module):
+    def __init__(self, theta: int, axes_dim: List[int]):
+        super().__init__()
+        self.theta = theta
+        self.axes_dim = axes_dim
+
+    def forward(self, ids: torch.Tensor) -> torch.Tensor:
+        n_axes = ids.shape[-1]
+        cos_out = []
+        sin_out = []
+        pos = ids.float()
+        is_mps = ids.device.type == "mps"
+        is_npu = ids.device.type == "npu"
+        freqs_dtype = torch.float32 if (is_mps or is_npu) else torch.float64
+        for i in range(n_axes):
+            cos, sin = get_1d_rotary_pos_embed(
+                self.axes_dim[i],
+                pos[:, i],
+                theta=self.theta,
+                repeat_interleave_real=True,
+                use_real=True,
+                freqs_dtype=freqs_dtype,
+            )
+            cos_out.append(cos)
+            sin_out.append(sin)
+        freqs_cos = torch.cat(cos_out, dim=-1).to(ids.device)
+        freqs_sin = torch.cat(sin_out, dim=-1).to(ids.device)
+        return freqs_cos, freqs_sin
+
+
+class LongCatImageTimestepEmbeddings(nn.Module):
+    def __init__(self, embedding_dim):
+        super().__init__()
+
+        self.time_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0)
+        self.timestep_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
+
+    def forward(self, timestep, hidden_dtype):
+        timesteps_proj = self.time_proj(timestep)
+        timesteps_emb = self.timestep_embedder(timesteps_proj.to(dtype=hidden_dtype))  # (N, D)
+
+        return timesteps_emb
+
+
+class LongCatImageTransformer2DModel(
+    ModelMixin,
+    ConfigMixin,
+    PeftAdapterMixin,
+    FromOriginalModelMixin,
+    CacheMixin,
+):
+    """
+    The Transformer model introduced in Longcat-Image.
+    """
+
+    _supports_gradient_checkpointing = True
+
+    @register_to_config
+    def __init__(
+        self,
+        patch_size: int = 1,
+        in_channels: int = 64,
+        num_layers: int = 19,
+        num_single_layers: int = 38,
+        attention_head_dim: int = 128,
+        num_attention_heads: int = 24,
+        joint_attention_dim: int = 3584,
+        pooled_projection_dim: int = 3584,
+        axes_dims_rope: List[int] = [16, 56, 56],
+    ):
+        super().__init__()
+        self.out_channels = in_channels
+        self.inner_dim = num_attention_heads * attention_head_dim
+        self.pooled_projection_dim = pooled_projection_dim
+
+        self.pos_embed = LongCatImagePosEmbed(theta=10000, axes_dim=axes_dims_rope)
+
+        self.time_embed = LongCatImageTimestepEmbeddings(embedding_dim=self.inner_dim)
+
+        self.context_embedder = nn.Linear(joint_attention_dim, self.inner_dim)
+        self.x_embedder = torch.nn.Linear(in_channels, self.inner_dim)
+
+        self.transformer_blocks = nn.ModuleList(
+            [
+                LongCatImageTransformerBlock(
+                    dim=self.inner_dim,
+                    num_attention_heads=num_attention_heads,
+                    attention_head_dim=attention_head_dim,
+                )
+                for i in range(num_layers)
+            ]
+        )
+
+        self.single_transformer_blocks = nn.ModuleList(
+            [
+                LongCatImageSingleTransformerBlock(
+                    dim=self.inner_dim,
+                    num_attention_heads=num_attention_heads,
+                    attention_head_dim=attention_head_dim,
+                )
+                for i in range(num_single_layers)
+            ]
+        )
+
+        self.norm_out = AdaLayerNormContinuous(self.inner_dim, self.inner_dim, elementwise_affine=False, eps=1e-6)
+        self.proj_out = nn.Linear(self.inner_dim, patch_size * patch_size * self.out_channels, bias=True)
+
+        self.gradient_checkpointing = False
+        self.use_checkpoint = [True] * num_layers
+        self.use_single_checkpoint = [True] * num_single_layers
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: torch.Tensor = None,
+        timestep: torch.LongTensor = None,
+        img_ids: torch.Tensor = None,
+        txt_ids: torch.Tensor = None,
+        guidance: torch.Tensor = None,
+        return_dict: bool = True,
+    ) -> Union[torch.FloatTensor, Transformer2DModelOutput]:
+        """
+        The forward method.
+
+        Args:
+            hidden_states (`torch.FloatTensor` of shape `(batch size, channel, height, width)`):
+                Input `hidden_states`.
+            encoder_hidden_states (`torch.FloatTensor` of shape `(batch size, sequence_len, embed_dims)`):
+                Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
+            timestep ( `torch.LongTensor`):
+                Used to indicate denoising step.
+            block_controlnet_hidden_states: (`list` of `torch.Tensor`):
+                A list of tensors that if specified are added to the residuals of transformer blocks.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
+                tuple.
+
+        Returns:
+            If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
+            `tuple` where the first element is the sample tensor.
+        """
+        hidden_states = self.x_embedder(hidden_states)
+
+        timestep = timestep.to(hidden_states.dtype) * 1000
+
+        temb = self.time_embed(timestep, hidden_states.dtype)
+        encoder_hidden_states = self.context_embedder(encoder_hidden_states)
+
+        ids = torch.cat((txt_ids, img_ids), dim=0)
+        if is_torch_npu_available():
+            freqs_cos, freqs_sin = self.pos_embed(ids.cpu())
+            image_rotary_emb = (freqs_cos.npu(), freqs_sin.npu())
+        else:
+            image_rotary_emb = self.pos_embed(ids)
+
+        for index_block, block in enumerate(self.transformer_blocks):
+            if torch.is_grad_enabled() and self.gradient_checkpointing and self.use_checkpoint[index_block]:
+                encoder_hidden_states, hidden_states = self._gradient_checkpointing_func(
+                    block,
+                    hidden_states,
+                    encoder_hidden_states,
+                    temb,
+                    image_rotary_emb,
+                )
+            else:
+                encoder_hidden_states, hidden_states = block(
+                    hidden_states=hidden_states,
+                    encoder_hidden_states=encoder_hidden_states,
+                    temb=temb,
+                    image_rotary_emb=image_rotary_emb,
+                )
+
+        for index_block, block in enumerate(self.single_transformer_blocks):
+            if torch.is_grad_enabled() and self.gradient_checkpointing and self.use_single_checkpoint[index_block]:
+                encoder_hidden_states, hidden_states = self._gradient_checkpointing_func(
+                    block,
+                    hidden_states,
+                    encoder_hidden_states,
+                    temb,
+                    image_rotary_emb,
+                )
+            else:
+                encoder_hidden_states, hidden_states = block(
+                    hidden_states=hidden_states,
+                    encoder_hidden_states=encoder_hidden_states,
+                    temb=temb,
+                    image_rotary_emb=image_rotary_emb,
+                )
+
+        hidden_states = self.norm_out(hidden_states, temb)
+        output = self.proj_out(hidden_states)
+
+        if not return_dict:
+            return (output,)
+
+        return Transformer2DModelOutput(sample=output)
--- a/src/diffusers/models/transformers/transformer_qwenimage.py
+++ b/src/diffusers/models/transformers/transformer_qwenimage.py
@@ -143,17 +143,26 @@ def apply_rotary_emb_qwen(


 class QwenTimestepProjEmbeddings(nn.Module):
-    def __init__(self, embedding_dim):
+    def __init__(self, embedding_dim, use_additional_t_cond=False):
        super().__init__()

        self.time_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0, scale=1000)
        self.timestep_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
+        self.use_additional_t_cond = use_additional_t_cond
+        if use_additional_t_cond:
+            self.addition_t_embedding = nn.Embedding(2, embedding_dim)

-    def forward(self, timestep, hidden_states):
+    def forward(self, timestep, hidden_states, addition_t_cond=None):
        timesteps_proj = self.time_proj(timestep)
        timesteps_emb = self.timestep_embedder(timesteps_proj.to(dtype=hidden_states.dtype))  # (N, D)

        conditioning = timesteps_emb
+        if self.use_additional_t_cond:
+            if addition_t_cond is None:
+                raise ValueError("When additional_t_cond is True, addition_t_cond must be provided.")
+            addition_t_emb = self.addition_t_embedding(addition_t_cond)
+            addition_t_emb = addition_t_emb.to(dtype=hidden_states.dtype)
+            conditioning = conditioning + addition_t_emb

        return conditioning

@@ -259,6 +268,120 @@ class QwenEmbedRope(nn.Module):
        return freqs.clone().contiguous()


+class QwenEmbedLayer3DRope(nn.Module):
+    def __init__(self, theta: int, axes_dim: List[int], scale_rope=False):
+        super().__init__()
+        self.theta = theta
+        self.axes_dim = axes_dim
+        pos_index = torch.arange(4096)
+        neg_index = torch.arange(4096).flip(0) * -1 - 1
+        self.pos_freqs = torch.cat(
+            [
+                self.rope_params(pos_index, self.axes_dim[0], self.theta),
+                self.rope_params(pos_index, self.axes_dim[1], self.theta),
+                self.rope_params(pos_index, self.axes_dim[2], self.theta),
+            ],
+            dim=1,
+        )
+        self.neg_freqs = torch.cat(
+            [
+                self.rope_params(neg_index, self.axes_dim[0], self.theta),
+                self.rope_params(neg_index, self.axes_dim[1], self.theta),
+                self.rope_params(neg_index, self.axes_dim[2], self.theta),
+            ],
+            dim=1,
+        )
+
+        self.scale_rope = scale_rope
+
+    def rope_params(self, index, dim, theta=10000):
+        """
+        Args:
+            index: [0, 1, 2, 3] 1D Tensor representing the position index of the token
+        """
+        assert dim % 2 == 0
+        freqs = torch.outer(index, 1.0 / torch.pow(theta, torch.arange(0, dim, 2).to(torch.float32).div(dim)))
+        freqs = torch.polar(torch.ones_like(freqs), freqs)
+        return freqs
+
+    def forward(self, video_fhw, txt_seq_lens, device):
+        """
+        Args: video_fhw: [frame, height, width] a list of 3 integers representing the shape of the video Args:
+        txt_length: [bs] a list of 1 integers representing the length of the text
+        """
+        if self.pos_freqs.device != device:
+            self.pos_freqs = self.pos_freqs.to(device)
+            self.neg_freqs = self.neg_freqs.to(device)
+
+        if isinstance(video_fhw, list):
+            video_fhw = video_fhw[0]
+        if not isinstance(video_fhw, list):
+            video_fhw = [video_fhw]
+
+        vid_freqs = []
+        max_vid_index = 0
+        layer_num = len(video_fhw) - 1
+        for idx, fhw in enumerate(video_fhw):
+            frame, height, width = fhw
+            if idx != layer_num:
+                video_freq = self._compute_video_freqs(frame, height, width, idx)
+            else:
+                ### For the condition image, we set the layer index to -1
+                video_freq = self._compute_condition_freqs(frame, height, width)
+            video_freq = video_freq.to(device)
+            vid_freqs.append(video_freq)
+
+            if self.scale_rope:
+                max_vid_index = max(height // 2, width // 2, max_vid_index)
+            else:
+                max_vid_index = max(height, width, max_vid_index)
+
+        max_vid_index = max(max_vid_index, layer_num)
+        max_len = max(txt_seq_lens)
+        txt_freqs = self.pos_freqs[max_vid_index : max_vid_index + max_len, ...]
+        vid_freqs = torch.cat(vid_freqs, dim=0)
+
+        return vid_freqs, txt_freqs
+
+    @functools.lru_cache(maxsize=None)
+    def _compute_video_freqs(self, frame, height, width, idx=0):
+        seq_lens = frame * height * width
+        freqs_pos = self.pos_freqs.split([x // 2 for x in self.axes_dim], dim=1)
+        freqs_neg = self.neg_freqs.split([x // 2 for x in self.axes_dim], dim=1)
+
+        freqs_frame = freqs_pos[0][idx : idx + frame].view(frame, 1, 1, -1).expand(frame, height, width, -1)
+        if self.scale_rope:
+            freqs_height = torch.cat([freqs_neg[1][-(height - height // 2) :], freqs_pos[1][: height // 2]], dim=0)
+            freqs_height = freqs_height.view(1, height, 1, -1).expand(frame, height, width, -1)
+            freqs_width = torch.cat([freqs_neg[2][-(width - width // 2) :], freqs_pos[2][: width // 2]], dim=0)
+            freqs_width = freqs_width.view(1, 1, width, -1).expand(frame, height, width, -1)
+        else:
+            freqs_height = freqs_pos[1][:height].view(1, height, 1, -1).expand(frame, height, width, -1)
+            freqs_width = freqs_pos[2][:width].view(1, 1, width, -1).expand(frame, height, width, -1)
+
+        freqs = torch.cat([freqs_frame, freqs_height, freqs_width], dim=-1).reshape(seq_lens, -1)
+        return freqs.clone().contiguous()
+
+    @functools.lru_cache(maxsize=None)
+    def _compute_condition_freqs(self, frame, height, width):
+        seq_lens = frame * height * width
+        freqs_pos = self.pos_freqs.split([x // 2 for x in self.axes_dim], dim=1)
+        freqs_neg = self.neg_freqs.split([x // 2 for x in self.axes_dim], dim=1)
+
+        freqs_frame = freqs_neg[0][-1:].view(frame, 1, 1, -1).expand(frame, height, width, -1)
+        if self.scale_rope:
+            freqs_height = torch.cat([freqs_neg[1][-(height - height // 2) :], freqs_pos[1][: height // 2]], dim=0)
+            freqs_height = freqs_height.view(1, height, 1, -1).expand(frame, height, width, -1)
+            freqs_width = torch.cat([freqs_neg[2][-(width - width // 2) :], freqs_pos[2][: width // 2]], dim=0)
+            freqs_width = freqs_width.view(1, 1, width, -1).expand(frame, height, width, -1)
+        else:
+            freqs_height = freqs_pos[1][:height].view(1, height, 1, -1).expand(frame, height, width, -1)
+            freqs_width = freqs_pos[2][:width].view(1, 1, width, -1).expand(frame, height, width, -1)
+
+        freqs = torch.cat([freqs_frame, freqs_height, freqs_width], dim=-1).reshape(seq_lens, -1)
+        return freqs.clone().contiguous()
+
+
 class QwenDoubleStreamAttnProcessor2_0:
    """
    Attention processor for Qwen double-stream architecture, matching DoubleStreamLayerMegatron logic. This processor
@@ -578,14 +701,21 @@ class QwenImageTransformer2DModel(
        guidance_embeds: bool = False,  # TODO: this should probably be removed
        axes_dims_rope: Tuple[int, int, int] = (16, 56, 56),
        zero_cond_t: bool = False,
+        use_additional_t_cond: bool = False,
+        use_layer3d_rope: bool = False,
    ):
        super().__init__()
        self.out_channels = out_channels or in_channels
        self.inner_dim = num_attention_heads * attention_head_dim

-        self.pos_embed = QwenEmbedRope(theta=10000, axes_dim=list(axes_dims_rope), scale_rope=True)
+        if not use_layer3d_rope:
+            self.pos_embed = QwenEmbedRope(theta=10000, axes_dim=list(axes_dims_rope), scale_rope=True)
+        else:
+            self.pos_embed = QwenEmbedLayer3DRope(theta=10000, axes_dim=list(axes_dims_rope), scale_rope=True)

-        self.time_text_embed = QwenTimestepProjEmbeddings(embedding_dim=self.inner_dim)
+        self.time_text_embed = QwenTimestepProjEmbeddings(
+            embedding_dim=self.inner_dim, use_additional_t_cond=use_additional_t_cond
+        )

        self.txt_norm = RMSNorm(joint_attention_dim, eps=1e-6)

@@ -621,6 +751,7 @@ class QwenImageTransformer2DModel(
        guidance: torch.Tensor = None,  # TODO: this should probably be removed
        attention_kwargs: Optional[Dict[str, Any]] = None,
        controlnet_block_samples=None,
+        additional_t_cond=None,
        return_dict: bool = True,
    ) -> Union[torch.Tensor, Transformer2DModelOutput]:
        """
@@ -683,9 +814,9 @@ class QwenImageTransformer2DModel(
            guidance = guidance.to(hidden_states.dtype) * 1000

        temb = (
-            self.time_text_embed(timestep, hidden_states)
+            self.time_text_embed(timestep, hidden_states, additional_t_cond)
            if guidance is None
-            else self.time_text_embed(timestep, guidance, hidden_states)
+            else self.time_text_embed(timestep, guidance, hidden_states, additional_t_cond)
        )

        image_rotary_emb = self.pos_embed(img_shapes, txt_seq_lens, device=hidden_states.device)
--- a/src/diffusers/models/transformers/transformer_z_image.py
+++ b/src/diffusers/models/transformers/transformer_z_image.py
@@ -13,7 +13,7 @@
 # limitations under the License.

 import math
-from typing import List, Optional, Tuple
+from typing import Dict, List, Optional, Tuple

 import torch
 import torch.nn as nn
@@ -536,6 +536,7 @@ class ZImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOr
        x: List[torch.Tensor],
        t,
        cap_feats: List[torch.Tensor],
+        controlnet_block_samples: Optional[Dict[int, torch.Tensor]] = None,
        patch_size=2,
        f_patch_size=1,
        return_dict: bool = True,
@@ -635,13 +636,19 @@ class ZImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOr
            unified_attn_mask[i, :seq_len] = 1

        if torch.is_grad_enabled() and self.gradient_checkpointing:
-            for layer in self.layers:
+            for layer_idx, layer in enumerate(self.layers):
                unified = self._gradient_checkpointing_func(
                    layer, unified, unified_attn_mask, unified_freqs_cis, adaln_input
                )
+                if controlnet_block_samples is not None:
+                    if layer_idx in controlnet_block_samples:
+                        unified = unified + controlnet_block_samples[layer_idx]
        else:
-            for layer in self.layers:
+            for layer_idx, layer in enumerate(self.layers):
                unified = layer(unified, unified_attn_mask, unified_freqs_cis, adaln_input)
+                if controlnet_block_samples is not None:
+                    if layer_idx in controlnet_block_samples:
+                        unified = unified + controlnet_block_samples[layer_idx]

        unified = self.all_final_layer[f"{patch_size}-{f_patch_size}"](unified, adaln_input)
        unified = list(unified.unbind(dim=0))
--- a/src/diffusers/pipelines/init.py
+++ b/src/diffusers/pipelines/init.py
@@ -291,6 +291,7 @@ else:
    _import_structure["lumina"] = ["LuminaPipeline", "LuminaText2ImgPipeline"]
    _import_structure["lumina2"] = ["Lumina2Pipeline", "Lumina2Text2ImgPipeline"]
    _import_structure["lucy"] = ["LucyEditPipeline"]
+    _import_structure["longcat_image"] = ["LongCatImagePipeline", "LongCatImageEditPipeline"]
    _import_structure["marigold"].extend(
        [
            "MarigoldDepthPipeline",
@@ -404,7 +405,12 @@ else:
        "Kandinsky5T2IPipeline",
        "Kandinsky5I2IPipeline",
    ]
-    _import_structure["z_image"] = ["ZImageImg2ImgPipeline", "ZImagePipeline"]
+    _import_structure["z_image"] = [
+        "ZImageImg2ImgPipeline",
+        "ZImagePipeline",
+        "ZImageControlNetPipeline",
+        "ZImageControlNetInpaintPipeline",
+    ]
    _import_structure["skyreels_v2"] = [
        "SkyReelsV2DiffusionForcingPipeline",
        "SkyReelsV2DiffusionForcingImageToVideoPipeline",
@@ -421,6 +427,7 @@ else:
        "QwenImageEditInpaintPipeline",
        "QwenImageControlNetInpaintPipeline",
        "QwenImageControlNetPipeline",
+        "QwenImageLayeredPipeline",
    ]
    _import_structure["chronoedit"] = ["ChronoEditPipeline"]
 try:
@@ -718,6 +725,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            LEditsPPPipelineStableDiffusion,
            LEditsPPPipelineStableDiffusionXL,
        )
+        from .longcat_image import LongCatImageEditPipeline, LongCatImagePipeline
        from .ltx import LTXConditionPipeline, LTXImageToVideoPipeline, LTXLatentUpsamplePipeline, LTXPipeline
        from .lucy import LucyEditPipeline
        from .lumina import LuminaPipeline, LuminaText2ImgPipeline
@@ -762,6 +770,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            QwenImageEditPlusPipeline,
            QwenImageImg2ImgPipeline,
            QwenImageInpaintPipeline,
+            QwenImageLayeredPipeline,
            QwenImagePipeline,
        )
        from .sana import (
@@ -841,7 +850,12 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            WuerstchenDecoderPipeline,
            WuerstchenPriorPipeline,
        )
-        from .z_image import ZImageImg2ImgPipeline, ZImagePipeline
+        from .z_image import (
+            ZImageControlNetInpaintPipeline,
+            ZImageControlNetPipeline,
+            ZImageImg2ImgPipeline,
+            ZImagePipeline,
+        )

        try:
            if not is_onnx_available():
--- a/src/diffusers/pipelines/longcat_image/init.py
+++ b/src/diffusers/pipelines/longcat_image/init.py
@@ -0,0 +1,51 @@
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    DIFFUSERS_SLOW_IMPORT,
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    get_objects_from_module,
+    is_torch_available,
+    is_transformers_available,
+)
+
+
+_dummy_objects = {}
+_import_structure = {}
+
+try:
+    if not (is_transformers_available() and is_torch_available()):
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    from ...utils import dummy_torch_and_transformers_objects  # noqa: F403
+
+    _dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
+else:
+    _import_structure["pipeline_longcat_image"] = ["LongCatImagePipeline"]
+    _import_structure["pipeline_longcat_image_edit"] = ["LongCatImageEditPipeline"]
+    _import_structure["pipeline_output"] = ["LongCatImagePipelineOutput"]
+
+if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
+    try:
+        if not (is_transformers_available() and is_torch_available()):
+            raise OptionalDependencyNotAvailable()
+
+    except OptionalDependencyNotAvailable:
+        from ...utils.dummy_torch_and_transformers_objects import *
+    else:
+        from .pipeline_longcat_image import LongCatImagePipeline
+        from .pipeline_longcat_image_edit import LongCatImageEditPipeline
+        from .pipeline_output import LongCatImagePipelineOutput
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(
+        __name__,
+        globals()["__file__"],
+        _import_structure,
+        module_spec=__spec__,
+    )
+
+    for name, value in _dummy_objects.items():
+        setattr(sys.modules[__name__], name, value)
--- a/src/diffusers/pipelines/longcat_image/pipeline_longcat_image.py
+++ b/src/diffusers/pipelines/longcat_image/pipeline_longcat_image.py
@@ -0,0 +1,666 @@
+# Copyright 2025 MeiTuan LongCat-Image Team and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import inspect
+import re
+from typing import Any, Dict, List, Optional, Union
+
+import numpy as np
+import torch
+from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2Tokenizer, Qwen2VLProcessor
+
+from ...image_processor import VaeImageProcessor
+from ...loaders import FromSingleFileMixin
+from ...models.autoencoders import AutoencoderKL
+from ...models.transformers import LongCatImageTransformer2DModel
+from ...pipelines.pipeline_utils import DiffusionPipeline
+from ...schedulers import FlowMatchEulerDiscreteScheduler
+from ...utils import is_torch_xla_available, logging, replace_example_docstring
+from ...utils.torch_utils import randn_tensor
+from .pipeline_output import LongCatImagePipelineOutput
+from .system_messages import SYSTEM_PROMPT_EN, SYSTEM_PROMPT_ZH
+
+
+if is_torch_xla_available():
+    import torch_xla.core.xla_model as xm
+
+    XLA_AVAILABLE = True
+else:
+    XLA_AVAILABLE = False
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+EXAMPLE_DOC_STRING = """
+    Examples:
+        ```py
+        >>> import torch
+        >>> from diffusers import LongCatImagePipeline
+
+        >>> pipe = LongCatImagePipeline.from_pretrained("meituan-longcat/LongCat-Image", torch_dtype=torch.bfloat16)
+        >>> pipe.to("cuda")
+
+        >>> prompt = "一个年轻的亚裔女性，身穿黄色针织衫，搭配白色项链。她的双手放在膝盖上，表情恬静。背景是一堵粗糙的砖墙，午后的阳光温暖地洒在她身上，营造出一种宁静而温馨的氛围。镜头采用中距离视角，突出她的神态和服饰的细节。光线柔和地打在她的脸上，强调她的五官和饰品的质感，增加画面的层次感与亲和力。整个画面构图简洁，砖墙的纹理与阳光的光影效果相得益彰，突显出人物的优雅与从容。"
+        >>> image = pipe(
+        ...     prompt,
+        ...     height=768,
+        ...     width=1344,
+        ...     num_inference_steps=50,
+        ...     guidance_scale=4.5,
+        ...     generator=torch.Generator("cpu").manual_seed(43),
+        ...     enable_cfg_renorm=True,
+        ... ).images[0]
+        >>> image.save("longcat_image.png")
+        ```
+"""
+
+
+def get_prompt_language(prompt):
+    pattern = re.compile(r"[\u4e00-\u9fff]")
+    if bool(pattern.search(prompt)):
+        return "zh"
+    return "en"
+
+
+def split_quotation(prompt, quote_pairs=None):
+    """
+    Implement a regex-based string splitting algorithm that identifies delimiters defined by single or double quote
+    pairs. Examples::
+        >>> prompt_en = "Please write 'Hello' on the blackboard for me." >>> print(split_quotation(prompt_en)) >>> #
+        output: [('Please write ', False), ("'Hello'", True), (' on the blackboard for me.', False)]
+    """
+    word_internal_quote_pattern = re.compile(r"[a-zA-Z]+'[a-zA-Z]+")
+    matches_word_internal_quote_pattern = word_internal_quote_pattern.findall(prompt)
+    mapping_word_internal_quote = []
+
+    for i, word_src in enumerate(set(matches_word_internal_quote_pattern)):
+        word_tgt = "longcat_$##$_longcat" * (i + 1)
+        prompt = prompt.replace(word_src, word_tgt)
+        mapping_word_internal_quote.append([word_src, word_tgt])
+
+    if quote_pairs is None:
+        quote_pairs = [("'", "'"), ('"', '"'), ("‘", "’"), ("“", "”")]
+    pattern = "|".join([re.escape(q1) + r"[^" + re.escape(q1 + q2) + r"]*?" + re.escape(q2) for q1, q2 in quote_pairs])
+    parts = re.split(f"({pattern})", prompt)
+
+    result = []
+    for part in parts:
+        for word_src, word_tgt in mapping_word_internal_quote:
+            part = part.replace(word_tgt, word_src)
+        if re.match(pattern, part):
+            if len(part):
+                result.append((part, True))
+        else:
+            if len(part):
+                result.append((part, False))
+    return result
+
+
+def prepare_pos_ids(modality_id=0, type="text", start=(0, 0), num_token=None, height=None, width=None):
+    if type == "text":
+        assert num_token
+        if height or width:
+            print('Warning: The parameters of height and width will be ignored in "text" type.')
+        pos_ids = torch.zeros(num_token, 3)
+        pos_ids[..., 0] = modality_id
+        pos_ids[..., 1] = torch.arange(num_token) + start[0]
+        pos_ids[..., 2] = torch.arange(num_token) + start[1]
+    elif type == "image":
+        assert height and width
+        if num_token:
+            print('Warning: The parameter of num_token will be ignored in "image" type.')
+        pos_ids = torch.zeros(height, width, 3)
+        pos_ids[..., 0] = modality_id
+        pos_ids[..., 1] = pos_ids[..., 1] + torch.arange(height)[:, None] + start[0]
+        pos_ids[..., 2] = pos_ids[..., 2] + torch.arange(width)[None, :] + start[1]
+        pos_ids = pos_ids.reshape(height * width, 3)
+    else:
+        raise KeyError(f'Unknow type {type}, only support "text" or "image".')
+    return pos_ids
+
+
+def calculate_shift(
+    image_seq_len,
+    base_seq_len: int = 256,
+    max_seq_len: int = 4096,
+    base_shift: float = 0.5,
+    max_shift: float = 1.15,
+):
+    m = (max_shift - base_shift) / (max_seq_len - base_seq_len)
+    b = base_shift - m * base_seq_len
+    mu = image_seq_len * m + b
+    return mu
+
+
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
+def retrieve_timesteps(
+    scheduler,
+    num_inference_steps: Optional[int] = None,
+    device: Optional[Union[str, torch.device]] = None,
+    timesteps: Optional[List[int]] = None,
+    sigmas: Optional[List[float]] = None,
+    **kwargs,
+):
+    r"""
+    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
+    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
+
+    Args:
+        scheduler (`SchedulerMixin`):
+            The scheduler to get timesteps from.
+        num_inference_steps (`int`):
+            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
+            must be `None`.
+        device (`str` or `torch.device`, *optional*):
+            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
+        timesteps (`List[int]`, *optional*):
+            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
+            `num_inference_steps` and `sigmas` must be `None`.
+        sigmas (`List[float]`, *optional*):
+            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
+            `num_inference_steps` and `timesteps` must be `None`.
+
+    Returns:
+        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
+        second element is the number of inference steps.
+    """
+    if timesteps is not None and sigmas is not None:
+        raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
+    if timesteps is not None:
+        accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accepts_timesteps:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" timestep schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    elif sigmas is not None:
+        accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accept_sigmas:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" sigmas schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    else:
+        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+    return timesteps, num_inference_steps
+
+
+class LongCatImagePipeline(DiffusionPipeline, FromSingleFileMixin):
+    r"""
+    The pipeline for text-to-image generation.
+    """
+
+    model_cpu_offload_seq = "text_encoder->transformer->vae"
+    _optional_components = []
+    _callback_tensor_inputs = ["latents", "prompt_embeds"]
+
+    def __init__(
+        self,
+        scheduler: FlowMatchEulerDiscreteScheduler,
+        vae: AutoencoderKL,
+        text_encoder: Qwen2_5_VLForConditionalGeneration,
+        tokenizer: Qwen2Tokenizer,
+        text_processor: Qwen2VLProcessor,
+        transformer: LongCatImageTransformer2DModel,
+    ):
+        super().__init__()
+
+        self.register_modules(
+            vae=vae,
+            text_encoder=text_encoder,
+            tokenizer=tokenizer,
+            transformer=transformer,
+            scheduler=scheduler,
+            text_processor=text_processor,
+        )
+
+        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
+        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor * 2)
+
+        self.prompt_template_encode_prefix = "<|im_start|>system\nAs an image captioning expert, generate a descriptive text prompt based on an image content, suitable for input to a text-to-image model.<|im_end|>\n<|im_start|>user\n"
+        self.prompt_template_encode_suffix = "<|im_end|>\n<|im_start|>assistant\n"
+        self.default_sample_size = 128
+        self.tokenizer_max_length = 512
+
+    def rewire_prompt(self, prompt, device):
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        all_text = []
+        for each_prompt in prompt:
+            language = get_prompt_language(each_prompt)
+            if language == "zh":
+                question = SYSTEM_PROMPT_ZH + f"\n用户输入为：{each_prompt}\n改写后的prompt为："
+            else:
+                question = SYSTEM_PROMPT_EN + f"\nUser Input: {each_prompt}\nRewritten prompt:"
+            message = [
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": question},
+                    ],
+                }
+            ]
+            # Preparation for inference
+            text = self.text_processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
+            all_text.append(text)
+
+        inputs = self.text_processor(text=all_text, padding=True, return_tensors="pt").to(device)
+
+        self.text_encoder.to(device)
+        generated_ids = self.text_encoder.generate(**inputs, max_new_tokens=self.tokenizer_max_length)
+        generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
+        output_text = self.text_processor.batch_decode(
+            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+        )
+        rewrite_prompt = output_text
+        return rewrite_prompt
+
+    def _encode_prompt(self, prompt: List[str]):
+        batch_all_tokens = []
+
+        for each_prompt in prompt:
+            all_tokens = []
+            for clean_prompt_sub, matched in split_quotation(each_prompt):
+                if matched:
+                    for sub_word in clean_prompt_sub:
+                        tokens = self.tokenizer(sub_word, add_special_tokens=False)["input_ids"]
+                        all_tokens.extend(tokens)
+                else:
+                    tokens = self.tokenizer(clean_prompt_sub, add_special_tokens=False)["input_ids"]
+                    all_tokens.extend(tokens)
+
+            if len(all_tokens) > self.tokenizer_max_length:
+                logger.warning(
+                    "Your input was truncated because `max_sequence_length` is set to "
+                    f" {self.tokenizer_max_length} input token nums : {len(all_tokens)}"
+                )
+                all_tokens = all_tokens[: self.tokenizer_max_length]
+            batch_all_tokens.append(all_tokens)
+
+        text_tokens_and_mask = self.tokenizer.pad(
+            {"input_ids": batch_all_tokens},
+            max_length=self.tokenizer_max_length,
+            padding="max_length",
+            return_attention_mask=True,
+            return_tensors="pt",
+        )
+
+        prefix_tokens = self.tokenizer(self.prompt_template_encode_prefix, add_special_tokens=False)["input_ids"]
+        suffix_tokens = self.tokenizer(self.prompt_template_encode_suffix, add_special_tokens=False)["input_ids"]
+        prefix_len = len(prefix_tokens)
+        suffix_len = len(suffix_tokens)
+
+        prefix_tokens_mask = torch.tensor([1] * len(prefix_tokens), dtype=text_tokens_and_mask.attention_mask[0].dtype)
+        suffix_tokens_mask = torch.tensor([1] * len(suffix_tokens), dtype=text_tokens_and_mask.attention_mask[0].dtype)
+
+        prefix_tokens = torch.tensor(prefix_tokens, dtype=text_tokens_and_mask.input_ids.dtype)
+        suffix_tokens = torch.tensor(suffix_tokens, dtype=text_tokens_and_mask.input_ids.dtype)
+
+        batch_size = text_tokens_and_mask.input_ids.size(0)
+
+        prefix_tokens_batch = prefix_tokens.unsqueeze(0).expand(batch_size, -1)
+        suffix_tokens_batch = suffix_tokens.unsqueeze(0).expand(batch_size, -1)
+        prefix_mask_batch = prefix_tokens_mask.unsqueeze(0).expand(batch_size, -1)
+        suffix_mask_batch = suffix_tokens_mask.unsqueeze(0).expand(batch_size, -1)
+
+        input_ids = torch.cat((prefix_tokens_batch, text_tokens_and_mask.input_ids, suffix_tokens_batch), dim=-1)
+        attention_mask = torch.cat((prefix_mask_batch, text_tokens_and_mask.attention_mask, suffix_mask_batch), dim=-1)
+
+        input_ids = input_ids.to(self.device)
+        attention_mask = attention_mask.to(self.device)
+
+        text_output = self.text_encoder(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True)
+        # [max_sequence_length, batch, hidden_size] -> [batch, max_sequence_length, hidden_size]
+        # clone to have a contiguous tensor
+        prompt_embeds = text_output.hidden_states[-1].detach()
+        prompt_embeds = prompt_embeds[:, prefix_len:-suffix_len, :]
+        return prompt_embeds
+
+    def encode_prompt(
+        self,
+        prompt: Union[str, List[str]] = None,
+        num_images_per_prompt: Optional[int] = 1,
+        prompt_embeds: Optional[torch.Tensor] = None,
+    ):
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        batch_size = len(prompt)
+        # If prompt_embeds is provided and prompt is None, skip encoding
+        if prompt_embeds is None:
+            prompt_embeds = self._encode_prompt(prompt)
+
+        _, seq_len, _ = prompt_embeds.shape
+        # duplicate text embeddings and attention mask for each generation per prompt, using mps friendly method
+        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
+        prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
+
+        text_ids = prepare_pos_ids(modality_id=0, type="text", start=(0, 0), num_token=prompt_embeds.shape[1]).to(
+            self.device
+        )
+        return prompt_embeds.to(self.device), text_ids
+
+    @staticmethod
+    def _pack_latents(latents, batch_size, num_channels_latents, height, width):
+        latents = latents.view(batch_size, num_channels_latents, height // 2, 2, width // 2, 2)
+        latents = latents.permute(0, 2, 4, 1, 3, 5)
+        latents = latents.reshape(batch_size, (height // 2) * (width // 2), num_channels_latents * 4)
+
+        return latents
+
+    @staticmethod
+    def _unpack_latents(latents, height, width, vae_scale_factor):
+        batch_size, num_patches, channels = latents.shape
+
+        # VAE applies 8x compression on images but we must also account for packing which requires
+        # latent height and width to be divisible by 2.
+        height = 2 * (int(height) // (vae_scale_factor * 2))
+        width = 2 * (int(width) // (vae_scale_factor * 2))
+
+        latents = latents.view(batch_size, height // 2, width // 2, channels // 4, 2, 2)
+        latents = latents.permute(0, 3, 1, 4, 2, 5)
+
+        latents = latents.reshape(batch_size, channels // (2 * 2), height, width)
+
+        return latents
+
+    @property
+    def do_classifier_free_guidance(self):
+        return self._guidance_scale > 1
+
+    def prepare_latents(
+        self,
+        batch_size,
+        num_channels_latents,
+        height,
+        width,
+        dtype,
+        device,
+        generator,
+        latents=None,
+    ):
+        # VAE applies 8x compression on images but we must also account for packing which requires
+        # latent height and width to be divisible by 2.
+        height = 2 * (int(height) // (self.vae_scale_factor * 2))
+        width = 2 * (int(width) // (self.vae_scale_factor * 2))
+
+        shape = (batch_size, num_channels_latents, height, width)
+        latent_image_ids = prepare_pos_ids(
+            modality_id=1,
+            type="image",
+            start=(self.tokenizer_max_length, self.tokenizer_max_length),
+            height=height // 2,
+            width=width // 2,
+        ).to(device)
+
+        if latents is not None:
+            return latents.to(device=device, dtype=dtype), latent_image_ids
+
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+
+        latents = randn_tensor(shape, generator=generator, device=device)
+        latents = latents.to(dtype=dtype)
+        latents = self._pack_latents(latents, batch_size, num_channels_latents, height, width)
+
+        return latents, latent_image_ids
+
+    @property
+    def guidance_scale(self):
+        return self._guidance_scale
+
+    @property
+    def joint_attention_kwargs(self):
+        return self._joint_attention_kwargs
+
+    @property
+    def num_timesteps(self):
+        return self._num_timesteps
+
+    @property
+    def current_timestep(self):
+        return self._current_timestep
+
+    @property
+    def interrupt(self):
+        return self._interrupt
+
+    def check_inputs(
+        self, prompt, height, width, negative_prompt=None, prompt_embeds=None, negative_prompt_embeds=None
+    ):
+        if height % (self.vae_scale_factor * 2) != 0 or width % (self.vae_scale_factor * 2) != 0:
+            logger.warning(
+                f"`height` and `width` have to be divisible by {self.vae_scale_factor * 2} but are {height} and {width}. Dimensions will be resized accordingly"
+            )
+
+        if prompt is not None and prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
+                " only forward one of the two."
+            )
+        elif prompt is None and prompt_embeds is None:
+            raise ValueError(
+                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
+            )
+        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
+
+        if negative_prompt is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+
+    @replace_example_docstring(EXAMPLE_DOC_STRING)
+    @torch.no_grad()
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = None,
+        negative_prompt: Union[str, List[str]] = None,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        num_inference_steps: int = 50,
+        sigmas: Optional[List[float]] = None,
+        guidance_scale: float = 4.5,
+        num_images_per_prompt: Optional[int] = 1,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+        enable_cfg_renorm: Optional[bool] = True,
+        cfg_renorm_min: Optional[float] = 0.0,
+        enable_prompt_rewrite: Optional[bool] = True,
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+
+        Args:
+            enable_cfg_renorm: Whether to enable cfg_renorm. Enabling cfg_renorm will improve image quality,
+                but it may lead to a decrease in the stability of some image outputs..
+            cfg_renorm_min: The minimum value of the cfg_renorm_scale range (0-1).
+                cfg_renorm_min = 1.0, renorm has no effect, while cfg_renorm_min=0.0, the renorm range is larger.
+            enable_prompt_rewrite: whether to enable prompt rewrite.
+        Examples:
+
+        Returns:
+            [`~pipelines.LongCatImagePipelineOutput`] or `tuple`: [`~pipelines.LongCatImagePipelineOutput`] if
+            `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
+            generated images.
+        """
+
+        height = height or self.default_sample_size * self.vae_scale_factor
+        width = width or self.default_sample_size * self.vae_scale_factor
+
+        # 1. Check inputs. Raise error if not correct
+        self.check_inputs(
+            prompt,
+            height,
+            width,
+            negative_prompt=negative_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+        )
+
+        self._guidance_scale = guidance_scale
+        self._joint_attention_kwargs = joint_attention_kwargs
+        self._current_timestep = None
+        self._interrupt = False
+
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+
+        device = self._execution_device
+        if enable_prompt_rewrite:
+            prompt = self.rewire_prompt(prompt, device)
+            logger.info(f"Rewrite prompt {prompt}!")
+
+        negative_prompt = "" if negative_prompt is None else negative_prompt
+        (prompt_embeds, text_ids) = self.encode_prompt(
+            prompt=prompt, prompt_embeds=prompt_embeds, num_images_per_prompt=num_images_per_prompt
+        )
+        if self.do_classifier_free_guidance:
+            (negative_prompt_embeds, negative_text_ids) = self.encode_prompt(
+                prompt=negative_prompt,
+                prompt_embeds=negative_prompt_embeds,
+                num_images_per_prompt=num_images_per_prompt,
+            )
+
+        # 4. Prepare latent variables
+        num_channels_latents = 16
+        latents, latent_image_ids = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+        )
+
+        # 5. Prepare timesteps
+        sigmas = np.linspace(1.0, 1.0 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas
+        image_seq_len = latents.shape[1]
+        mu = calculate_shift(
+            image_seq_len,
+            self.scheduler.config.get("base_image_seq_len", 256),
+            self.scheduler.config.get("max_image_seq_len", 4096),
+            self.scheduler.config.get("base_shift", 0.5),
+            self.scheduler.config.get("max_shift", 1.15),
+        )
+        timesteps, num_inference_steps = retrieve_timesteps(
+            self.scheduler,
+            num_inference_steps,
+            device,
+            sigmas=sigmas,
+            mu=mu,
+        )
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        self._num_timesteps = len(timesteps)
+
+        # handle guidance
+        guidance = None
+
+        if self.joint_attention_kwargs is None:
+            self._joint_attention_kwargs = {}
+
+        # 6. Denoising loop
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                if self.interrupt:
+                    continue
+
+                self._current_timestep = t
+                timestep = t.expand(latents.shape[0]).to(latents.dtype)
+                with self.transformer.cache_context("cond"):
+                    noise_pred_text = self.transformer(
+                        hidden_states=latents,
+                        timestep=timestep / 1000,
+                        guidance=guidance,
+                        encoder_hidden_states=prompt_embeds,
+                        txt_ids=text_ids,
+                        img_ids=latent_image_ids,
+                        return_dict=False,
+                    )[0]
+                if self.do_classifier_free_guidance:
+                    with self.transformer.cache_context("uncond"):
+                        noise_pred_uncond = self.transformer(
+                            hidden_states=latents,
+                            timestep=timestep / 1000,
+                            encoder_hidden_states=negative_prompt_embeds,
+                            txt_ids=negative_text_ids,
+                            img_ids=latent_image_ids,
+                            return_dict=False,
+                        )[0]
+                    noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)
+
+                    if enable_cfg_renorm:
+                        cond_norm = torch.norm(noise_pred_text, dim=-1, keepdim=True)
+                        noise_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
+                        scale = (cond_norm / (noise_norm + 1e-8)).clamp(min=cfg_renorm_min, max=1.0)
+                        noise_pred = noise_pred * scale
+                else:
+                    noise_pred = noise_pred_text
+
+                # compute the previous noisy sample x_t -> x_t-1
+                latents_dtype = latents.dtype
+                latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
+
+                if latents.dtype != latents_dtype:
+                    if torch.backends.mps.is_available():
+                        # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
+                        latents = latents.to(latents_dtype)
+
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+
+                if XLA_AVAILABLE:
+                    xm.mark_step()
+
+        self._current_timestep = None
+
+        if output_type == "latent":
+            image = latents
+        else:
+            latents = self._unpack_latents(latents, height, width, self.vae_scale_factor)
+            latents = (latents / self.vae.config.scaling_factor) + self.vae.config.shift_factor
+
+            if latents.dtype != self.vae.dtype:
+                latents = latents.to(dtype=self.vae.dtype)
+
+            image = self.vae.decode(latents, return_dict=False)[0]
+            image = self.image_processor.postprocess(image, output_type=output_type)
+
+        # Offload all models
+        self.maybe_free_model_hooks()
+
+        if not return_dict:
+            return (image,)
+
+        return LongCatImagePipelineOutput(images=image)
--- a/src/diffusers/pipelines/longcat_image/pipeline_longcat_image_edit.py
+++ b/src/diffusers/pipelines/longcat_image/pipeline_longcat_image_edit.py
@@ -0,0 +1,727 @@
+# Copyright 2025 MeiTuan LongCat-Image Team and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import inspect
+import math
+import re
+from typing import Any, Dict, List, Optional, Union
+
+import numpy as np
+import PIL
+import torch
+from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2Tokenizer, Qwen2VLProcessor
+
+from ...image_processor import VaeImageProcessor
+from ...loaders import FromSingleFileMixin
+from ...models.autoencoders import AutoencoderKL
+from ...models.transformers import LongCatImageTransformer2DModel
+from ...pipelines.pipeline_utils import DiffusionPipeline
+from ...schedulers import FlowMatchEulerDiscreteScheduler
+from ...utils import is_torch_xla_available, logging, replace_example_docstring
+from ...utils.torch_utils import randn_tensor
+from .pipeline_output import LongCatImagePipelineOutput
+
+
+if is_torch_xla_available():
+    import torch_xla.core.xla_model as xm
+
+    XLA_AVAILABLE = True
+else:
+    XLA_AVAILABLE = False
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+EXAMPLE_DOC_STRING = """
+    Examples:
+        ```py
+        >>> from PIL import Image
+        >>> import torch
+        >>> from diffusers import LongCatImageEditPipeline
+
+        >>> pipe = LongCatImageEditPipeline.from_pretrained(
+        ...     "meituan-longcat/LongCat-Image-Edit", torch_dtype=torch.bfloat16
+        ... )
+        >>> pipe.to("cuda")
+
+        >>> prompt = "change the cat to dog."
+        >>> input_image = Image.open("test.jpg").convert("RGB")
+        >>> image = pipe(
+        ...     input_image,
+        ...     prompt,
+        ...     num_inference_steps=50,
+        ...     guidance_scale=4.5,
+        ...     generator=torch.Generator("cpu").manual_seed(43),
+        ... ).images[0]
+        >>> image.save("longcat_image_edit.png")
+        ```
+"""
+
+
+# Copied from diffusers.pipelines.longcat_image.pipeline_longcat_image.split_quotation
+def split_quotation(prompt, quote_pairs=None):
+    """
+    Implement a regex-based string splitting algorithm that identifies delimiters defined by single or double quote
+    pairs. Examples::
+        >>> prompt_en = "Please write 'Hello' on the blackboard for me." >>> print(split_quotation(prompt_en)) >>> #
+        output: [('Please write ', False), ("'Hello'", True), (' on the blackboard for me.', False)]
+    """
+    word_internal_quote_pattern = re.compile(r"[a-zA-Z]+'[a-zA-Z]+")
+    matches_word_internal_quote_pattern = word_internal_quote_pattern.findall(prompt)
+    mapping_word_internal_quote = []
+
+    for i, word_src in enumerate(set(matches_word_internal_quote_pattern)):
+        word_tgt = "longcat_$##$_longcat" * (i + 1)
+        prompt = prompt.replace(word_src, word_tgt)
+        mapping_word_internal_quote.append([word_src, word_tgt])
+
+    if quote_pairs is None:
+        quote_pairs = [("'", "'"), ('"', '"'), ("‘", "’"), ("“", "”")]
+    pattern = "|".join([re.escape(q1) + r"[^" + re.escape(q1 + q2) + r"]*?" + re.escape(q2) for q1, q2 in quote_pairs])
+    parts = re.split(f"({pattern})", prompt)
+
+    result = []
+    for part in parts:
+        for word_src, word_tgt in mapping_word_internal_quote:
+            part = part.replace(word_tgt, word_src)
+        if re.match(pattern, part):
+            if len(part):
+                result.append((part, True))
+        else:
+            if len(part):
+                result.append((part, False))
+    return result
+
+
+# Copied from diffusers.pipelines.longcat_image.pipeline_longcat_image.prepare_pos_ids
+def prepare_pos_ids(modality_id=0, type="text", start=(0, 0), num_token=None, height=None, width=None):
+    if type == "text":
+        assert num_token
+        if height or width:
+            print('Warning: The parameters of height and width will be ignored in "text" type.')
+        pos_ids = torch.zeros(num_token, 3)
+        pos_ids[..., 0] = modality_id
+        pos_ids[..., 1] = torch.arange(num_token) + start[0]
+        pos_ids[..., 2] = torch.arange(num_token) + start[1]
+    elif type == "image":
+        assert height and width
+        if num_token:
+            print('Warning: The parameter of num_token will be ignored in "image" type.')
+        pos_ids = torch.zeros(height, width, 3)
+        pos_ids[..., 0] = modality_id
+        pos_ids[..., 1] = pos_ids[..., 1] + torch.arange(height)[:, None] + start[0]
+        pos_ids[..., 2] = pos_ids[..., 2] + torch.arange(width)[None, :] + start[1]
+        pos_ids = pos_ids.reshape(height * width, 3)
+    else:
+        raise KeyError(f'Unknow type {type}, only support "text" or "image".')
+    return pos_ids
+
+
+# Copied from diffusers.pipelines.longcat_image.pipeline_longcat_image.calculate_shift
+def calculate_shift(
+    image_seq_len,
+    base_seq_len: int = 256,
+    max_seq_len: int = 4096,
+    base_shift: float = 0.5,
+    max_shift: float = 1.15,
+):
+    m = (max_shift - base_shift) / (max_seq_len - base_seq_len)
+    b = base_shift - m * base_seq_len
+    mu = image_seq_len * m + b
+    return mu
+
+
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
+def retrieve_timesteps(
+    scheduler,
+    num_inference_steps: Optional[int] = None,
+    device: Optional[Union[str, torch.device]] = None,
+    timesteps: Optional[List[int]] = None,
+    sigmas: Optional[List[float]] = None,
+    **kwargs,
+):
+    r"""
+    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
+    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
+
+    Args:
+        scheduler (`SchedulerMixin`):
+            The scheduler to get timesteps from.
+        num_inference_steps (`int`):
+            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
+            must be `None`.
+        device (`str` or `torch.device`, *optional*):
+            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
+        timesteps (`List[int]`, *optional*):
+            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
+            `num_inference_steps` and `sigmas` must be `None`.
+        sigmas (`List[float]`, *optional*):
+            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
+            `num_inference_steps` and `timesteps` must be `None`.
+
+    Returns:
+        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
+        second element is the number of inference steps.
+    """
+    if timesteps is not None and sigmas is not None:
+        raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
+    if timesteps is not None:
+        accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accepts_timesteps:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" timestep schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    elif sigmas is not None:
+        accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accept_sigmas:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" sigmas schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    else:
+        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+    return timesteps, num_inference_steps
+
+
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.retrieve_latents
+def retrieve_latents(
+    encoder_output: torch.Tensor, generator: Optional[torch.Generator] = None, sample_mode: str = "sample"
+):
+    if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
+        return encoder_output.latent_dist.sample(generator)
+    elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
+        return encoder_output.latent_dist.mode()
+    elif hasattr(encoder_output, "latents"):
+        return encoder_output.latents
+    else:
+        raise AttributeError("Could not access latents of provided encoder_output")
+
+
+def calculate_dimensions(target_area, ratio):
+    width = math.sqrt(target_area * ratio)
+    height = width / ratio
+
+    width = width if width % 16 == 0 else (width // 16 + 1) * 16
+    height = height if height % 16 == 0 else (height // 16 + 1) * 16
+
+    width = int(width)
+    height = int(height)
+
+    return width, height
+
+
+class LongCatImageEditPipeline(DiffusionPipeline, FromSingleFileMixin):
+    r"""
+    The LongCat-Image-Edit pipeline for image editing.
+    """
+
+    model_cpu_offload_seq = "text_encoder->image_encoder->transformer->vae"
+    _optional_components = []
+    _callback_tensor_inputs = ["latents", "prompt_embeds"]
+
+    def __init__(
+        self,
+        scheduler: FlowMatchEulerDiscreteScheduler,
+        vae: AutoencoderKL,
+        text_encoder: Qwen2_5_VLForConditionalGeneration,
+        tokenizer: Qwen2Tokenizer,
+        text_processor: Qwen2VLProcessor,
+        transformer: LongCatImageTransformer2DModel,
+    ):
+        super().__init__()
+
+        self.register_modules(
+            vae=vae,
+            text_encoder=text_encoder,
+            tokenizer=tokenizer,
+            transformer=transformer,
+            scheduler=scheduler,
+            text_processor=text_processor,
+        )
+
+        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
+        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor * 2)
+        self.image_processor_vl = text_processor.image_processor
+
+        self.image_token = "<|image_pad|>"
+        self.prompt_template_encode_prefix = "<|im_start|>system\nAs an image editing expert, first analyze the content and attributes of the input image(s). Then, based on the user's editing instructions, clearly and precisely determine how to modify the given image(s), ensuring that only the specified parts are altered and all other aspects remain consistent with the original(s).<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
+        self.prompt_template_encode_suffix = "<|im_end|>\n<|im_start|>assistant\n"
+        self.default_sample_size = 128
+        self.tokenizer_max_length = 512
+
+    def _encode_prompt(self, prompt, image):
+        raw_vl_input = self.image_processor_vl(images=image, return_tensors="pt")
+        pixel_values = raw_vl_input["pixel_values"]
+        image_grid_thw = raw_vl_input["image_grid_thw"]
+        all_tokens = []
+        for clean_prompt_sub, matched in split_quotation(prompt[0]):
+            if matched:
+                for sub_word in clean_prompt_sub:
+                    tokens = self.tokenizer(sub_word, add_special_tokens=False)["input_ids"]
+                    all_tokens.extend(tokens)
+            else:
+                tokens = self.tokenizer(clean_prompt_sub, add_special_tokens=False)["input_ids"]
+                all_tokens.extend(tokens)
+
+        if len(all_tokens) > self.tokenizer_max_length:
+            logger.warning(
+                "Your input was truncated because `max_sequence_length` is set to "
+                f" {self.tokenizer_max_length} input token nums : {len(len(all_tokens))}"
+            )
+            all_tokens = all_tokens[: self.tokenizer_max_length]
+
+        text_tokens_and_mask = self.tokenizer.pad(
+            {"input_ids": [all_tokens]},
+            max_length=self.tokenizer_max_length,
+            padding="max_length",
+            return_attention_mask=True,
+            return_tensors="pt",
+        )
+
+        text = self.prompt_template_encode_prefix
+
+        merge_length = self.image_processor_vl.merge_size**2
+        while self.image_token in text:
+            num_image_tokens = image_grid_thw.prod() // merge_length
+            text = text.replace(self.image_token, "<|placeholder|>" * num_image_tokens, 1)
+        text = text.replace("<|placeholder|>", self.image_token)
+
+        prefix_tokens = self.tokenizer(text, add_special_tokens=False)["input_ids"]
+        suffix_tokens = self.tokenizer(self.prompt_template_encode_suffix, add_special_tokens=False)["input_ids"]
+
+        vision_start_token_id = self.tokenizer.convert_tokens_to_ids("<|vision_start|>")
+        prefix_len = prefix_tokens.index(vision_start_token_id)
+        suffix_len = len(suffix_tokens)
+
+        prefix_tokens_mask = torch.tensor([1] * len(prefix_tokens), dtype=text_tokens_and_mask.attention_mask[0].dtype)
+        suffix_tokens_mask = torch.tensor([1] * len(suffix_tokens), dtype=text_tokens_and_mask.attention_mask[0].dtype)
+
+        prefix_tokens = torch.tensor(prefix_tokens, dtype=text_tokens_and_mask.input_ids.dtype)
+        suffix_tokens = torch.tensor(suffix_tokens, dtype=text_tokens_and_mask.input_ids.dtype)
+
+        input_ids = torch.cat((prefix_tokens, text_tokens_and_mask.input_ids[0], suffix_tokens), dim=-1)
+        attention_mask = torch.cat(
+            (prefix_tokens_mask, text_tokens_and_mask.attention_mask[0], suffix_tokens_mask), dim=-1
+        )
+
+        input_ids = input_ids.unsqueeze(0).to(self.device)
+        attention_mask = attention_mask.unsqueeze(0).to(self.device)
+
+        pixel_values = pixel_values.to(self.device)
+        image_grid_thw = image_grid_thw.to(self.device)
+
+        text_output = self.text_encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            pixel_values=pixel_values,
+            image_grid_thw=image_grid_thw,
+            output_hidden_states=True,
+        )
+        # [max_sequence_length, batch, hidden_size] -> [batch, max_sequence_length, hidden_size]
+        # clone to have a contiguous tensor
+        prompt_embeds = text_output.hidden_states[-1].detach()
+        prompt_embeds = prompt_embeds[:, prefix_len:-suffix_len, :]
+        return prompt_embeds
+
+    def encode_prompt(
+        self,
+        prompt: List[str] = None,
+        image: Optional[torch.Tensor] = None,
+        num_images_per_prompt: Optional[int] = 1,
+        prompt_embeds: Optional[torch.Tensor] = None,
+    ):
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        batch_size = len(prompt)
+        # If prompt_embeds is provided and prompt is None, skip encoding
+        if prompt_embeds is None:
+            prompt_embeds = self._encode_prompt(prompt, image)
+
+        _, seq_len, _ = prompt_embeds.shape
+        # duplicate text embeddings and attention mask for each generation per prompt, using mps friendly method
+        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
+        prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
+
+        text_ids = prepare_pos_ids(modality_id=0, type="text", start=(0, 0), num_token=prompt_embeds.shape[1]).to(
+            self.device
+        )
+        return prompt_embeds, text_ids
+
+    @staticmethod
+    def _pack_latents(latents, batch_size, num_channels_latents, height, width):
+        latents = latents.view(batch_size, num_channels_latents, height // 2, 2, width // 2, 2)
+        latents = latents.permute(0, 2, 4, 1, 3, 5)
+        latents = latents.reshape(batch_size, (height // 2) * (width // 2), num_channels_latents * 4)
+
+        return latents
+
+    @staticmethod
+    def _unpack_latents(latents, height, width, vae_scale_factor):
+        batch_size, num_patches, channels = latents.shape
+
+        # VAE applies 8x compression on images but we must also account for packing which requires
+        # latent height and width to be divisible by 2.
+        height = 2 * (int(height) // (vae_scale_factor * 2))
+        width = 2 * (int(width) // (vae_scale_factor * 2))
+
+        latents = latents.view(batch_size, height // 2, width // 2, channels // 4, 2, 2)
+        latents = latents.permute(0, 3, 1, 4, 2, 5)
+
+        latents = latents.reshape(batch_size, channels // (2 * 2), height, width)
+
+        return latents
+
+    def _encode_vae_image(self, image: torch.Tensor, generator: torch.Generator):
+        if isinstance(generator, list):
+            image_latents = [
+                retrieve_latents(self.vae.encode(image[i : i + 1]), generator=generator[i], sample_mode="argmax")
+                for i in range(image.shape[0])
+            ]
+            image_latents = torch.cat(image_latents, dim=0)
+        else:
+            image_latents = retrieve_latents(self.vae.encode(image), generator=generator, sample_mode="argmax")
+        image_latents = (image_latents - self.vae.config.shift_factor) * self.vae.config.scaling_factor
+
+        return image_latents
+
+    @property
+    def do_classifier_free_guidance(self):
+        return self._guidance_scale > 1
+
+    def prepare_latents(
+        self,
+        image,
+        batch_size,
+        num_channels_latents,
+        height,
+        width,
+        dtype,
+        prompt_embeds_length,
+        device,
+        generator,
+        latents=None,
+    ):
+        # VAE applies 8x compression on images but we must also account for packing which requires
+        # latent height and width to be divisible by 2.
+        height = 2 * (int(height) // (self.vae_scale_factor * 2))
+        width = 2 * (int(width) // (self.vae_scale_factor * 2))
+
+        image_latents, image_latents_ids = None, None
+
+        if image is not None:
+            image = image.to(device=self.device, dtype=dtype)
+
+            if image.shape[1] != self.vae.config.latent_channels:
+                image_latents = self._encode_vae_image(image=image, generator=generator)
+            else:
+                image_latents = image
+            if batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] == 0:
+                additional_image_per_prompt = batch_size // image_latents.shape[0]
+                image_latents = torch.cat([image_latents] * additional_image_per_prompt, dim=0)
+            elif batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] != 0:
+                raise ValueError(
+                    f"Cannot duplicate `image` of batch size {image_latents.shape[0]} to {batch_size} text prompts."
+                )
+            else:
+                image_latents = torch.cat([image_latents], dim=0)
+
+            image_latents = self._pack_latents(image_latents, batch_size, num_channels_latents, height, width)
+
+            image_latents_ids = prepare_pos_ids(
+                modality_id=2,
+                type="image",
+                start=(prompt_embeds_length, prompt_embeds_length),
+                height=height // 2,
+                width=width // 2,
+            ).to(device, dtype=torch.float64)
+
+        shape = (batch_size, num_channels_latents, height, width)
+        latents_ids = prepare_pos_ids(
+            modality_id=1,
+            type="image",
+            start=(prompt_embeds_length, prompt_embeds_length),
+            height=height // 2,
+            width=width // 2,
+        ).to(device)
+
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+
+        if latents is None:
+            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+            latents = self._pack_latents(latents, batch_size, num_channels_latents, height, width)
+        else:
+            latents = latents.to(device=device, dtype=dtype)
+
+        return latents, image_latents, latents_ids, image_latents_ids
+
+    @property
+    def guidance_scale(self):
+        return self._guidance_scale
+
+    @property
+    def joint_attention_kwargs(self):
+        return self._joint_attention_kwargs
+
+    @property
+    def num_timesteps(self):
+        return self._num_timesteps
+
+    @property
+    def current_timestep(self):
+        return self._current_timestep
+
+    @property
+    def interrupt(self):
+        return self._interrupt
+
+    def check_inputs(
+        self, prompt, height, width, negative_prompt=None, prompt_embeds=None, negative_prompt_embeds=None
+    ):
+        if height % (self.vae_scale_factor * 2) != 0 or width % (self.vae_scale_factor * 2) != 0:
+            logger.warning(
+                f"`height` and `width` have to be divisible by {self.vae_scale_factor * 2} but are {height} and {width}. Dimensions will be resized accordingly"
+            )
+
+        if prompt is not None and prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
+                " only forward one of the two."
+            )
+        elif prompt is None and prompt_embeds is None:
+            raise ValueError(
+                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
+            )
+        elif prompt is not None:
+            if isinstance(prompt, str):
+                pass
+            elif isinstance(prompt, list) and len(prompt) == 1:
+                pass
+            else:
+                raise ValueError(
+                    f"`prompt` must be a `str` or a `list` of length 1, but is {prompt} (type: {type(prompt)})"
+                )
+
+        if negative_prompt is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+
+    @replace_example_docstring(EXAMPLE_DOC_STRING)
+    @torch.no_grad()
+    def __call__(
+        self,
+        image: Optional[PIL.Image.Image] = None,
+        prompt: Union[str, List[str]] = None,
+        negative_prompt: Union[str, List[str]] = None,
+        num_inference_steps: int = 50,
+        sigmas: Optional[List[float]] = None,
+        guidance_scale: float = 4.5,
+        num_images_per_prompt: Optional[int] = 1,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+
+        Examples:
+
+        Returns:
+            [`~pipelines.LongCatImagePipelineOutput`] or `tuple`: [`~pipelines.LongCatImagePipelineOutput`] if
+            `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
+            generated images.
+        """
+
+        image_size = image[0].size if isinstance(image, list) else image.size
+        calculated_width, calculated_height = calculate_dimensions(1024 * 1024, image_size[0] * 1.0 / image_size[1])
+
+        # 1. Check inputs. Raise error if not correct
+        self.check_inputs(
+            prompt,
+            calculated_height,
+            calculated_width,
+            negative_prompt=negative_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+        )
+
+        self._guidance_scale = guidance_scale
+        self._joint_attention_kwargs = joint_attention_kwargs
+        self._current_timestep = None
+        self._interrupt = False
+
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+
+        device = self._execution_device
+
+        # 3. Preprocess image
+        if image is not None and not (isinstance(image, torch.Tensor) and image.size(1) == self.latent_channels):
+            image = self.image_processor.resize(image, calculated_height, calculated_width)
+            prompt_image = self.image_processor.resize(image, calculated_height // 2, calculated_width // 2)
+            image = self.image_processor.preprocess(image, calculated_height, calculated_width)
+
+        negative_prompt = "" if negative_prompt is None else negative_prompt
+        (prompt_embeds, text_ids) = self.encode_prompt(
+            prompt=prompt, image=prompt_image, prompt_embeds=prompt_embeds, num_images_per_prompt=num_images_per_prompt
+        )
+        if self.do_classifier_free_guidance:
+            (negative_prompt_embeds, negative_text_ids) = self.encode_prompt(
+                prompt=negative_prompt,
+                image=prompt_image,
+                prompt_embeds=negative_prompt_embeds,
+                num_images_per_prompt=num_images_per_prompt,
+            )
+
+        # 4. Prepare latent variables
+        num_channels_latents = 16
+        latents, image_latents, latents_ids, image_latents_ids = self.prepare_latents(
+            image,
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            calculated_height,
+            calculated_width,
+            prompt_embeds.dtype,
+            prompt_embeds.shape[1],
+            device,
+            generator,
+            latents,
+        )
+
+        # 5. Prepare timesteps
+        sigmas = np.linspace(1.0, 1.0 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas
+        image_seq_len = latents.shape[1]
+        mu = calculate_shift(
+            image_seq_len,
+            self.scheduler.config.get("base_image_seq_len", 256),
+            self.scheduler.config.get("max_image_seq_len", 4096),
+            self.scheduler.config.get("base_shift", 0.5),
+            self.scheduler.config.get("max_shift", 1.15),
+        )
+        timesteps, num_inference_steps = retrieve_timesteps(
+            self.scheduler,
+            num_inference_steps,
+            device,
+            sigmas=sigmas,
+            mu=mu,
+        )
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        self._num_timesteps = len(timesteps)
+
+        # handle guidance
+        guidance = None
+
+        if self.joint_attention_kwargs is None:
+            self._joint_attention_kwargs = {}
+
+        if image is not None:
+            latent_image_ids = torch.cat([latents_ids, image_latents_ids], dim=0)
+        else:
+            latent_image_ids = latents_ids
+
+        # 6. Denoising loop
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                if self.interrupt:
+                    continue
+
+                self._current_timestep = t
+
+                latent_model_input = latents
+                if image_latents is not None:
+                    latent_model_input = torch.cat([latents, image_latents], dim=1)
+
+                timestep = t.expand(latent_model_input.shape[0]).to(latents.dtype)
+                with self.transformer.cache_context("cond"):
+                    noise_pred_text = self.transformer(
+                        hidden_states=latent_model_input,
+                        timestep=timestep / 1000,
+                        guidance=guidance,
+                        encoder_hidden_states=prompt_embeds,
+                        txt_ids=text_ids,
+                        img_ids=latent_image_ids,
+                        return_dict=False,
+                    )[0]
+                    noise_pred_text = noise_pred_text[:, :image_seq_len]
+                if self.do_classifier_free_guidance:
+                    with self.transformer.cache_context("uncond"):
+                        noise_pred_uncond = self.transformer(
+                            hidden_states=latent_model_input,
+                            timestep=timestep / 1000,
+                            encoder_hidden_states=negative_prompt_embeds,
+                            txt_ids=negative_text_ids,
+                            img_ids=latent_image_ids,
+                            return_dict=False,
+                        )[0]
+                        noise_pred_uncond = noise_pred_uncond[:, :image_seq_len]
+                    noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)
+                else:
+                    noise_pred = noise_pred_text
+                # compute the previous noisy sample x_t -> x_t-1
+                latents_dtype = latents.dtype
+                latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
+
+                if latents.dtype != latents_dtype:
+                    if torch.backends.mps.is_available():
+                        # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
+                        latents = latents.to(latents_dtype)
+
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+
+                if XLA_AVAILABLE:
+                    xm.mark_step()
+
+        self._current_timestep = None
+
+        if output_type == "latent":
+            image = latents
+        else:
+            latents = self._unpack_latents(latents, calculated_height, calculated_width, self.vae_scale_factor)
+            latents = (latents / self.vae.config.scaling_factor) + self.vae.config.shift_factor
+
+            if latents.dtype != self.vae.dtype:
+                latents = latents.to(dtype=self.vae.dtype)
+
+            image = self.vae.decode(latents, return_dict=False)[0]
+            image = self.image_processor.postprocess(image, output_type=output_type)
+
+        # Offload all models
+        self.maybe_free_model_hooks()
+
+        if not return_dict:
+            return (image,)
+
+        return LongCatImagePipelineOutput(images=image)
--- a/src/diffusers/pipelines/longcat_image/pipeline_output.py
+++ b/src/diffusers/pipelines/longcat_image/pipeline_output.py
@@ -0,0 +1,21 @@
+from dataclasses import dataclass
+from typing import List, Union
+
+import numpy as np
+import PIL.Image
+
+from diffusers.utils import BaseOutput
+
+
+@dataclass
+class LongCatImagePipelineOutput(BaseOutput):
+    """
+    Output class for Stable Diffusion pipelines.
+
+    Args:
+        images (`List[PIL.Image.Image]` or `np.ndarray`)
+            List of denoised PIL images of length `batch_size` or numpy array of shape `(batch_size, height, width,
+            num_channels)`. PIL images or numpy array present the denoised images of the diffusion pipeline.
+    """
+
+    images: Union[List[PIL.Image.Image], np.ndarray]
--- a/src/diffusers/pipelines/longcat_image/system_messages.py
+++ b/src/diffusers/pipelines/longcat_image/system_messages.py
@@ -0,0 +1,142 @@
+SYSTEM_PROMPT_EN = """
+You are a prompt engineering expert for text-to-image models. Since text-to-image models have limited capabilities in
+understanding user prompts, you need to identify the core theme and intent of the user's input and improve the model's
+understanding accuracy and generation quality through optimization and rewriting. The rewrite must strictly retain all
+information from the user's original prompt without deleting or distorting any details. Specific requirements are as
+follows:
+1. The rewrite must not affect any information expressed in the user's original prompt; the rewritten prompt should use
+   coherent natural language, avoid low-information redundant descriptions, and keep the rewritten prompt length as
+   concise as possible.
+2. Ensure consistency between input and output languages: Chinese input yields Chinese output, and English input yields
+   English output. The rewritten token count should not exceed 512.
+3. The rewritten description should further refine subject characteristics and aesthetic techniques appearing in the
+   original prompt, such as lighting and textures.
+4. If the original prompt does not specify an image style, ensure the rewritten prompt uses a **realistic photography
+   style**. If the user specifies a style, retain the user's style.
+5. When the original prompt requires reasoning to clarify user intent, use logical reasoning based on world knowledge
+   to convert vague abstract descriptions into specific tangible objects (e.g., convert "the tallest animal" to "a
+   giraffe").
+6. When the original prompt requires text generation, please use double quotes to enclose the text part (e.g., `"50%
+   OFF"`).
+7. When the original prompt requires generating text-heavy scenes like webpages, logos, UIs, or posters, and no
+   specific text content is specified, you need to infer appropriate text content and enclose it in double quotes. For
+   example, if the user inputs: "A tourism flyer with a grassland theme," it should be rewritten as: "A tourism flyer
+   with the image title 'Grassland'."
+8. When negative words exist in the original prompt, ensure the rewritten prompt does not contain negative words. For
+   example, "a lakeside without boats" should be rewritten such that the word "boat" does not appear at all.
+9. Except for text content explicitly requested by the user, **adding any extra text content is prohibited**.
+Here are examples of rewrites for different types of prompts: # Examples (Few-Shot Learning)
+  1. User Input: An animal with nine lives.
+    Rewrite Output: A cat bathed in soft sunlight, its fur soft and glossy. The background is a comfortable home
+    environment with light from the window filtering through curtains, creating a warm light and shadow effect. The
+    shot uses a medium distance perspective to highlight the cat's leisurely and stretched posture. Light cleverly hits
+    the cat's face, emphasizing its spirited eyes and delicate whiskers, adding depth and affinity to the image.
+  2. User Input: Create an anime-style tourism flyer with a grassland theme.
+    Rewrite Output: In the lower right of the center, a short-haired girl sits sideways on a gray, irregularly shaped
+    rock. She wears a white short-sleeved dress and brown flat shoes, holding a bunch of small white flowers in her
+    left hand, smiling with her legs hanging naturally. The girl has dark brown shoulder-length hair with bangs
+    covering her forehead, brown eyes, and a slightly open mouth. The rock surface has textures of varying depths. To
+    the girl's left and front is lush grass, with long, yellow-green blades, some glowing golden in the sunlight. The
+    grass extends into the distance, forming rolling green hills that fade in color as they recede. The sky occupies
+    the upper half of the picture, pale blue dotted with a few fluffy white clouds. In the upper left corner, there is
+    a line of text in italic, dark green font reading "Explore Nature's Peace". Colors are dominated by green, blue,
+    and yellow, fluid lines, and distinct light and shadow contrast, creating a quiet and comfortable atmosphere.
+  3. User Input: A Christmas sale poster with a red background, promoting a Buy 1 Get 1 Free milk tea offer.
+    Rewrite Output: The poster features an overall red tone, embellished with white snowflake patterns on the top and
+    left side. The upper right features a bunch of holly leaves with red berries and a pine cone. In the upper center,
+    golden 3D text reads "Christmas Heartwarming Feedback" centered, along with red bold text "Buy 1 Get 1". Below, two
+    transparent cups filled with bubble tea are placed side by side; the tea is light brown with dark brown pearls
+    scattered at the bottom and middle. Below the cups, white snow piles up, decorated with pine branches, red berries,
+    and pine cones. A blurry Christmas tree is faintly visible in the lower right corner. The image has high clarity,
+    accurate text content, a unified design style, a prominent Christmas theme, and a reasonable layout, providing
+    strong visual appeal.
+  4. User Input: A woman indoors shot in natural light, smiling with arms crossed, showing a relaxed and confident
+     posture.
+    Rewrite Output: The image features a young Asian woman with long dark brown hair naturally falling over her
+    shoulders, with some strands illuminated by light, showing a soft sheen. Her features are delicate, with long
+    eyebrows, bright and spirited dark brown eyes looking directly at the camera, revealing peace and confidence. She
+    has a high nose bridge, full lips with nude lipstick, and corners of the mouth slightly raised in a faint smile.
+    Her skin is fair, with cheeks and collarbones illuminated by warm light, showing a healthy ruddiness. She wears a
+    black spaghetti strap tank top revealing graceful collarbone lines, and a thin gold necklace with small beads and
+    metal bars glinting in the light. Her outer layer is a beige knitted cardigan, soft in texture with visible
+    knitting patterns on the sleeves. Her arms are crossed over her chest, hands covered by the cardigan sleeves, in a
+    relaxed posture. The background is a pure dark brown without extra decoration, making the figure the absolute
+    focus. The figure is located in the center of the frame. Light enters from the upper right, creating bright spots
+    on her left cheek, neck, and collarbone, while the right side is slightly shadowed, creating a three-dimensional
+    and soft tone. Image details are clear, showcasing skin texture, hair, and clothing materials well. Colors are
+    dominated by warm tones, with the combination of beige and dark brown creating a warm and comfortable atmosphere.
+    The overall style is natural, elegant, and artistic.
+  5. User Input: Create a series of images showing the growth process of an apple from seed to fruit. The series should
+     include four stages: 1. Sowing, 2. Seedling growth, 3. Plant maturity, 4. Fruit harvesting.
+    Rewrite Output: A 4-panel exquisite illustration depicting the growth process of an apple, capturing each stage
+    precisely and clearly. 1. "Sowing": A close-up shot of a hand gently placing a small apple seed into fertile dark
+    soil, with visible soil texture and the seed's smooth surface. The background is a soft-focus garden dotted with
+    green leaves and sunlight filtering through. 2. "Seedling Growth": A young apple sapling breaks through the soil,
+    stretching tender green leaves toward the sky. The scene is set in a vibrant garden illuminated by warm golden
+    light, highlighting the seedling's delicate structure. 3. "Plant Maturity": A mature apple tree, lush with branches
+    and leaves, covered in tender green foliage and developing small apples. The background is a vibrant orchard under
+    a clear blue sky, with dappled sunlight creating a peaceful atmosphere. 4. "Fruit Harvesting": A hand reaches into
+    the tree to pick a ripe red apple, its smooth skin glistening in the sun. The scene shows the abundance of the
+    orchard, with baskets of apples in the background, giving a sense of fulfillment. Each illustration uses a
+    realistic style, focusing on details and harmonious colors to showcase the natural beauty and development of the
+    apple's life cycle.
+  6. User Input: If 1 represents red, 2 represents green, 3 represents purple, and 4 represents yellow, please generate
+     a four-color rainbow based on this rule. The color order from top to bottom is 3142.
+    Rewrite Output: The image consists of four horizontally arranged colored stripes, ordered from top to bottom as
+    purple, red, yellow, and green. A white number is centered on each stripe. The top purple stripe features the
+    number "3", the red stripe below it has the number "1", the yellow stripe further down has the number "4", and the
+    bottom green stripe has the number "2". All numbers use a sans-serif font in pure white, forming a sharp contrast
+    with the background colors to ensure good readability. The stripes have high color saturation and a slight texture.
+    The overall layout is simple and clear, with distinct visual effects and no extra decorative elements, emphasizing
+    the numerical information. The image is high definition, with accurate colors and a consistent style, offering
+    strong visual appeal.
+  7. User Input: A stone tablet carved with "Guan Guan Ju Jiu, On the River Isle", natural light, background is a
+     Chinese garden.
+    Rewrite Output: An ancient stone tablet carved with "Guan Guan Ju Jiu, On the River Isle", the surface covered with
+    traces of time, the writing clear and deep. Natural light falls from above, softly illuminating every detail of the
+    stone tablet and enhancing its sense of history. The background is an elegant Chinese garden featuring lush bamboo
+    forests, winding paths, and quiet pools, creating a serene and distant atmosphere. The overall picture uses a
+    realistic style with rich details and natural light and shadow effects, highlighting the cultural heritage of the
+    stone tablet and the classical beauty of the garden.
+# Output Format Please directly output the rewritten and optimized Prompt content. Do not include any explanatory
+language or JSON formatting, and do not add opening or closing quotes yourself."""
+
+
+SYSTEM_PROMPT_ZH = """
+你是一名文生图模型的prompt
+engineering专家。由于文生图模型对用户prompt的理解能力有限，你需要识别用户输入的核心主题和意图，并通过优化改写提升模型的理解准确性和生成质量。改写必须严格保留用户原始prompt的所有信息，不得删减或曲解任何细节。
+具体要求如下：
+1. 改写不能影响用户原始prompt里表达的任何信息，改写后的prompt应该使用连贯的自然语言表达,不要出现低信息量的冗余描述，尽可能保持改写后prompt长度精简。
+2. 请确保输入和输出的语言类型一致，中文输入中文输出，英文输入英文输出，改写后的token数量不要超过512个;
+3. 改写后的描述应当进一步完善原始prompt中出现的主体特征、美学技巧，如打光、纹理等；
+4. 如果原始prompt没有指定图片风格时，确保改写后的prompt使用真实摄影风格，如果用户指定了图片风格，则保留用户风格；
+5. 当原始prompt需要推理才能明确用户意图时，根据世界知识进行适当逻辑推理，将模糊抽象描述转化为具体指向事物（例：将"最高的动物"转化为"一头长颈鹿"）。
+6. 当原始prompt需要生成文字时，请使用双引号圈定文字部分，例：`"限时5折"`）。
+7. 当原始prompt需要生成网页、logo、ui、海报等文字场景时，且没有指定具体的文字内容时，需要推断出合适的文字内容，并使用双引号圈定，如用户输入：一个旅游宣传单，以草原为主题。应该改写成：一个旅游宣传单，图片标题为“草原”。
+8. 当原始prompt中存在否定词时，需要确保改写后的prompt不存在否定词，如没有船的湖边，改写后的prompt不能出现船这个词汇。
+9. 除非用户指定生成品牌logo，否则不要增加额外的品牌logo.
+10. 除了用户明确要求书写的文字内容外，**禁止增加任何额外的文字内容**。
+以下是针对不同类型prompt改写的示例：
+
+# Examples (Few-Shot Learning)
+  1. 用户输入: 九条命的动物。
+    改写输出:
+    一只猫，被柔和的阳光笼罩着，毛发柔软而富有光泽。背景是一个舒适的家居环境，窗外的光线透过窗帘，形成温馨的光影效果。镜头采用中距离视角，突出猫悠闲舒展的姿态。光线巧妙地打在猫的脸部，强调它灵动的眼睛和精致的胡须，增加画面的层次感与亲和力。
+  2. 用户输入: 制作一个动画风格的旅游宣传单，以草原为主题。
+    改写输出:
+    画面中央偏右下角，一个短发女孩侧身坐在灰色的不规则形状岩石上，她穿着白色短袖连衣裙和棕色平底鞋，左手拿着一束白色小花，面带微笑，双腿自然垂下。女孩的头发为深棕色，齐肩短发，刘海覆盖额头，眼睛呈棕色，嘴巴微张。岩石表面有深浅不一的纹理。女孩的左侧和前方是茂盛的草地，草叶细长，呈黄绿色，部分草叶在阳光下泛着金色的光芒，仿佛被阳光照亮。草地向远处延伸，形成连绵起伏的绿色山丘，山丘的颜色由近及远逐渐变浅。天空占据了画面的上半部分，呈淡蓝色，点缀着几朵白色蓬松的云彩。画面的左上角有一行文字，文字内容是斜体、深绿色的“Explore
+    Nature's Peace”。色彩以绿色、蓝色和黄色为主，线条流畅，光影明暗对比明显，营造出一种宁静、舒适的氛围。
+  3. 用户输入: 一张以红色为背景的圣诞节促销海报，主要宣传奶茶买一送一的优惠活动。
+    改写输出: 海报整体呈现红色调，上方和左侧点缀着白色雪花图案，右上方有一束冬青叶和红色浆果，以及一个松果。海报中央偏上位置，金色立体字样“圣诞节
+    暖心回馈”居中排列，和红色粗体字“买1送1”。海报下方，两个装满珍珠奶茶的透明杯子并排摆放，杯中奶茶呈浅棕色，底部和中间散布着深棕色珍珠。杯子下方，堆积着白色雪花，雪花上装饰着松枝、红色浆果和松果。右下角隐约可见一棵模糊的圣诞树。图片清晰度高，文字内容准确，整体设计风格统一，圣诞主题突出，排版布局合理，具有较强的视觉吸引力。
+  4. 用户输入: 一位女性在室内以自然光线拍摄，她面带微笑，双臂交叉，展现出轻松自信的姿态。
+    改写输出:
+    画面中是一位年轻的亚洲女性，她拥有深棕色的长发，发丝自然地垂落在双肩，部分发丝被光线照亮，呈现出柔和的光泽。她的五官精致，眉毛修长，眼睛明亮有神，瞳孔呈深棕色，眼神直视镜头，流露出平和与自信。鼻梁挺拔，嘴唇丰满，涂有裸色系唇膏，嘴角微微上扬，展现出浅浅的微笑。她的肤色白皙，脸颊和锁骨处被暖色调的光线照亮，呈现出健康的红润感。她穿着一件黑色的细吊带背心，肩带纤细，露出优美的锁骨线条。脖颈上佩戴着一条金色的细项链，项链由小珠子和几个细长的金属条组成，在光线下闪烁着光泽。她的外搭是一件米黄色的针织开衫，材质柔软，袖子部分有明显的针织纹理。她双臂交叉在胸前，双手被开衫的袖子覆盖，姿态放松。背景是纯粹的深棕色，没有多余的装饰，使得人物成为画面的绝对焦点。人物位于画面中央。光线从画面的右上方射入，在人物的左侧脸颊、脖颈和锁骨处形成明亮的光斑，右侧则略显阴影，营造出立体感和柔和的影调。图像细节清晰，人物的皮肤纹理、发丝以及衣物材质都得到了很好的展现。色彩以暖色调为主，米黄色和深棕色的搭配营造出温馨舒适的氛围。整体呈现出一种自然、优雅且富有亲和力的艺术风格。
+  5. 用户输入：创作一系列图片，展现苹果从种子到结果的生长过程。该系列图片应包含以下四个阶段：1. 播种，2. 幼苗生长，3. 植物成熟，4. 果实采摘。
+    改写输出：一个4宫格的精美插图，描绘苹果的生长过程，精确清晰地捕捉每个阶段。1.“播种”：特写镜头，一只手轻轻地将一颗小小的苹果种子放入肥沃的深色土壤中，土壤的纹理和种子光滑的表面清晰可见。背景是花园的柔焦画面，点缀着绿色的树叶和透过树叶洒下的阳光。2.“幼苗生长”：一棵幼小的苹果树苗破土而出，嫩绿的叶子向天空舒展。场景设定在一个生机勃勃的花园中，温暖的金光照亮了它。幼苗的纤细结构。3.“植物的成熟”：一棵成熟的苹果树，枝繁叶茂，挂满了嫩绿的叶子和正在萌发的小苹果。背景是一片生机勃勃的果园，湛蓝的天空下，斑驳的阳光营造出宁静祥和的氛围。4.“采摘果实”：一只手伸向树上，摘下一个成熟的红苹果，苹果光滑的果皮在阳光下闪闪发光。画面展现了果园的丰收景象，背景中摆放着一篮篮的苹果，给人一种圆满满足的感觉。每幅插图都采用写实风格，注重细节，色彩和谐，展现了苹果生命周期的自然之美和发展过程。
+  6. 用户输入： 如果1代表红色，2代表绿色，3代表紫色，4代表黄色，请按照此规则生成四色彩虹。它的颜色顺序从上到下是3142
+    改写输出：图片由四个水平排列的彩色条纹组成，从上到下依次为紫色、红色、黄色和绿色。每个条纹上都居中放置一个白色数字。最上方的紫色条纹上是数字“3”，其下方红色条纹上是数字“1”，再下方黄色条纹上是数字“4”，最下方的绿色条纹上是数字“2”。所有数字均采用无衬线字体，颜色为纯白色，与背景色形成鲜明对比，确保了良好的可读性。条纹的颜色饱和度高，且带有轻微的纹理感，整体排版简洁明了，视觉效果清晰，没有多余的装饰元素，强调了数字信息本身。图片整体清晰度高，色彩准确，风格一致，具有较强的视觉吸引力。
+  7. 用户输入：石碑上刻着“关关雎鸠，在河之洲”，自然光照，背景是中式园林
+    改写输出：一块古老的石碑上刻着“关关雎鸠，在河之洲”，石碑表面布满岁月的痕迹，字迹清晰而深刻。自然光线从上方洒下，柔和地照亮石碑的每一个细节，增强了其历史感。背景是一座典雅的中式园林，园林中有翠绿的竹林、蜿蜒的小径和静谧的水池，营造出一种宁静而悠远的氛围。整体画面采用写实风格，细节丰富，光影效果自然，突出了石碑的文化底蕴和园林的古典美。
+# 输出格式 请直接输出改写优化后的 Prompt 内容，不要包含任何解释性语言或 JSON 格式，不要自行添加开头或结尾的引号。
+"""
--- a/src/diffusers/pipelines/qwenimage/init.py
+++ b/src/diffusers/pipelines/qwenimage/init.py
@@ -31,6 +31,7 @@ else:
    _import_structure["pipeline_qwenimage_edit_plus"] = ["QwenImageEditPlusPipeline"]
    _import_structure["pipeline_qwenimage_img2img"] = ["QwenImageImg2ImgPipeline"]
    _import_structure["pipeline_qwenimage_inpaint"] = ["QwenImageInpaintPipeline"]
+    _import_structure["pipeline_qwenimage_layered"] = ["QwenImageLayeredPipeline"]

 if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
    try:
@@ -47,6 +48,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
        from .pipeline_qwenimage_edit_plus import QwenImageEditPlusPipeline
        from .pipeline_qwenimage_img2img import QwenImageImg2ImgPipeline
        from .pipeline_qwenimage_inpaint import QwenImageInpaintPipeline
+        from .pipeline_qwenimage_layered import QwenImageLayeredPipeline
 else:
    import sys

--- a/src/diffusers/pipelines/qwenimage/pipeline_qwenimage_layered.py
+++ b/src/diffusers/pipelines/qwenimage/pipeline_qwenimage_layered.py
@@ -0,0 +1,905 @@
+# Copyright 2025 Qwen-Image Team and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import math
+from typing import Any, Callable, Dict, List, Optional, Union
+
+import numpy as np
+import torch
+from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2Tokenizer, Qwen2VLProcessor
+
+from ...image_processor import PipelineImageInput, VaeImageProcessor
+from ...loaders import QwenImageLoraLoaderMixin
+from ...models import AutoencoderKLQwenImage, QwenImageTransformer2DModel
+from ...schedulers import FlowMatchEulerDiscreteScheduler
+from ...utils import is_torch_xla_available, logging, replace_example_docstring
+from ...utils.torch_utils import randn_tensor
+from ..pipeline_utils import DiffusionPipeline
+from .pipeline_output import QwenImagePipelineOutput
+
+
+if is_torch_xla_available():
+    import torch_xla.core.xla_model as xm
+
+    XLA_AVAILABLE = True
+else:
+    XLA_AVAILABLE = False
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+EXAMPLE_DOC_STRING = """
+    Examples:
+        ```py
+        >>> import torch
+        >>> from PIL import Image
+        >>> from diffusers import QwenImageLayeredPipeline
+        >>> from diffusers.utils import load_image
+
+        >>> pipe = QwenImageLayeredPipeline.from_pretrained("Qwen/Qwen-Image-Layered", torch_dtype=torch.bfloat16)
+        >>> pipe.to("cuda")
+        >>> image = load_image(
+        ...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yarn-art-pikachu.png"
+        ... ).convert("RGBA")
+        >>> prompt = ""
+        >>> # Depending on the variant being used, the pipeline call will slightly vary.
+        >>> # Refer to the pipeline documentation for more details.
+        >>> images = pipe(
+        ...     image,
+        ...     prompt,
+        ...     num_inference_steps=50,
+        ...     true_cfg_scale=4.0,
+        ...     layers=4,
+        ...     resolution=640,
+        ...     cfg_normalize=False,
+        ...     use_en_prompt=True,
+        ... ).images[0]
+        >>> for i, image in enumerate(images):
+        ...     image.save(f"{i}.out.png")
+        ```
+"""
+
+
+# Copied from diffusers.pipelines.qwenimage.pipeline_qwenimage.calculate_shift
+def calculate_shift(
+    image_seq_len,
+    base_seq_len: int = 256,
+    max_seq_len: int = 4096,
+    base_shift: float = 0.5,
+    max_shift: float = 1.15,
+):
+    m = (max_shift - base_shift) / (max_seq_len - base_seq_len)
+    b = base_shift - m * base_seq_len
+    mu = image_seq_len * m + b
+    return mu
+
+
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
+def retrieve_timesteps(
+    scheduler,
+    num_inference_steps: Optional[int] = None,
+    device: Optional[Union[str, torch.device]] = None,
+    timesteps: Optional[List[int]] = None,
+    sigmas: Optional[List[float]] = None,
+    **kwargs,
+):
+    r"""
+    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
+    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
+
+    Args:
+        scheduler (`SchedulerMixin`):
+            The scheduler to get timesteps from.
+        num_inference_steps (`int`):
+            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
+            must be `None`.
+        device (`str` or `torch.device`, *optional*):
+            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
+        timesteps (`List[int]`, *optional*):
+            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
+            `num_inference_steps` and `sigmas` must be `None`.
+        sigmas (`List[float]`, *optional*):
+            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
+            `num_inference_steps` and `timesteps` must be `None`.
+
+    Returns:
+        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
+        second element is the number of inference steps.
+    """
+    if timesteps is not None and sigmas is not None:
+        raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
+    if timesteps is not None:
+        accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accepts_timesteps:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" timestep schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    elif sigmas is not None:
+        accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accept_sigmas:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" sigmas schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    else:
+        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+    return timesteps, num_inference_steps
+
+
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.retrieve_latents
+def retrieve_latents(
+    encoder_output: torch.Tensor, generator: Optional[torch.Generator] = None, sample_mode: str = "sample"
+):
+    if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
+        return encoder_output.latent_dist.sample(generator)
+    elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
+        return encoder_output.latent_dist.mode()
+    elif hasattr(encoder_output, "latents"):
+        return encoder_output.latents
+    else:
+        raise AttributeError("Could not access latents of provided encoder_output")
+
+
+# Copied from diffusers.pipelines.qwenimage.pipeline_qwenimage_edit_plus.calculate_dimensions
+def calculate_dimensions(target_area, ratio):
+    width = math.sqrt(target_area * ratio)
+    height = width / ratio
+
+    width = round(width / 32) * 32
+    height = round(height / 32) * 32
+
+    return width, height
+
+
+class QwenImageLayeredPipeline(DiffusionPipeline, QwenImageLoraLoaderMixin):
+    r"""
+    The Qwen-Image-Layered pipeline for image decomposing.
+
+    Args:
+        transformer ([`QwenImageTransformer2DModel`]):
+            Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
+        scheduler ([`FlowMatchEulerDiscreteScheduler`]):
+            A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
+        vae ([`AutoencoderKL`]):
+            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
+        text_encoder ([`Qwen2.5-VL-7B-Instruct`]):
+            [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), specifically the
+            [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) variant.
+        tokenizer (`QwenTokenizer`):
+            Tokenizer of class
+            [CLIPTokenizer](https://huggingface.co/docs/transformers/en/model_doc/clip#transformers.CLIPTokenizer).
+    """
+
+    model_cpu_offload_seq = "text_encoder->transformer->vae"
+    _callback_tensor_inputs = ["latents", "prompt_embeds"]
+
+    def __init__(
+        self,
+        scheduler: FlowMatchEulerDiscreteScheduler,
+        vae: AutoencoderKLQwenImage,
+        text_encoder: Qwen2_5_VLForConditionalGeneration,
+        tokenizer: Qwen2Tokenizer,
+        processor: Qwen2VLProcessor,
+        transformer: QwenImageTransformer2DModel,
+    ):
+        super().__init__()
+
+        self.register_modules(
+            vae=vae,
+            text_encoder=text_encoder,
+            tokenizer=tokenizer,
+            processor=processor,
+            transformer=transformer,
+            scheduler=scheduler,
+        )
+        self.vae_scale_factor = 2 ** len(self.vae.temperal_downsample) if getattr(self, "vae", None) else 8
+        self.latent_channels = self.vae.config.z_dim if getattr(self, "vae", None) else 16
+        # QwenImage latents are turned into 2x2 patches and packed. This means the latent width and height has to be divisible
+        # by the patch size. So the vae scale factor is multiplied by the patch size to account for this
+        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor * 2)
+        self.vl_processor = processor
+        self.tokenizer_max_length = 1024
+
+        self.prompt_template_encode = "<|im_start|>system\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n"
+        self.prompt_template_encode_start_idx = 34
+        self.image_caption_prompt_cn = """<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n# 图像标注器\n你是一个专业的图像标注器。请基于输入图像，撰写图注:\n1.
+使用自然、描述性的语言撰写图注，不要使用结构化形式或富文本形式。\n2. 通过加入以下内容，丰富图注细节：\n - 对象的属性：如数量、颜色、形状、大小、位置、材质、状态、动作等\n -
+对象间的视觉关系：如空间关系、功能关系、动作关系、从属关系、比较关系、因果关系等\n - 环境细节：例如天气、光照、颜色、纹理、气氛等\n - 文字内容：识别图像中清晰可见的文字，不做翻译和解释，用引号在图注中强调\n3.
+保持真实性与准确性：\n - 不要使用笼统的描述\n -
+描述图像中所有可见的信息，但不要加入没有在图像中出现的内容\n<|vision_start|><|image_pad|><|vision_end|><|im_end|>\n<|im_start|>assistant\n"""
+        self.image_caption_prompt_en = """<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n# Image Annotator\nYou are a professional
+image annotator. Please write an image caption based on the input image:\n1. Write the caption using natural,
+descriptive language without structured formats or rich text.\n2. Enrich caption details by including: \n - Object
+attributes, such as quantity, color, shape, size, material, state, position, actions, and so on\n - Vision Relations
+between objects, such as spatial relations, functional relations, possessive relations, attachment relations, action
+relations, comparative relations, causal relations, and so on\n - Environmental details, such as weather, lighting,
+colors, textures, atmosphere, and so on\n - Identify the text clearly visible in the image, without translation or
+explanation, and highlight it in the caption with quotation marks\n3. Maintain authenticity and accuracy:\n - Avoid
+generalizations\n - Describe all visible information in the image, while do not add information not explicitly shown in
+the image\n<|vision_start|><|image_pad|><|vision_end|><|im_end|>\n<|im_start|>assistant\n"""
+        self.default_sample_size = 128
+
+    # Copied from diffusers.pipelines.qwenimage.pipeline_qwenimage.QwenImagePipeline._extract_masked_hidden
+    def _extract_masked_hidden(self, hidden_states: torch.Tensor, mask: torch.Tensor):
+        bool_mask = mask.bool()
+        valid_lengths = bool_mask.sum(dim=1)
+        selected = hidden_states[bool_mask]
+        split_result = torch.split(selected, valid_lengths.tolist(), dim=0)
+
+        return split_result
+
+    def _get_qwen_prompt_embeds(
+        self,
+        prompt: Union[str, List[str]] = None,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+    ):
+        device = device or self._execution_device
+        dtype = dtype or self.text_encoder.dtype
+
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+
+        template = self.prompt_template_encode
+        drop_idx = self.prompt_template_encode_start_idx
+        txt = [template.format(e) for e in prompt]
+        txt_tokens = self.tokenizer(
+            txt,
+            padding=True,
+            return_tensors="pt",
+        ).to(device)
+        encoder_hidden_states = self.text_encoder(
+            input_ids=txt_tokens.input_ids,
+            attention_mask=txt_tokens.attention_mask,
+            output_hidden_states=True,
+        )
+        hidden_states = encoder_hidden_states.hidden_states[-1]
+        split_hidden_states = self._extract_masked_hidden(hidden_states, txt_tokens.attention_mask)
+        split_hidden_states = [e[drop_idx:] for e in split_hidden_states]
+        attn_mask_list = [torch.ones(e.size(0), dtype=torch.long, device=e.device) for e in split_hidden_states]
+        max_seq_len = max([e.size(0) for e in split_hidden_states])
+        prompt_embeds = torch.stack(
+            [torch.cat([u, u.new_zeros(max_seq_len - u.size(0), u.size(1))]) for u in split_hidden_states]
+        )
+        encoder_attention_mask = torch.stack(
+            [torch.cat([u, u.new_zeros(max_seq_len - u.size(0))]) for u in attn_mask_list]
+        )
+
+        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
+
+        return prompt_embeds, encoder_attention_mask
+
+    # Copied from diffusers.pipelines.qwenimage.pipeline_qwenimage.QwenImagePipeline.encode_prompt
+    def encode_prompt(
+        self,
+        prompt: Union[str, List[str]],
+        device: Optional[torch.device] = None,
+        num_images_per_prompt: int = 1,
+        prompt_embeds: Optional[torch.Tensor] = None,
+        prompt_embeds_mask: Optional[torch.Tensor] = None,
+        max_sequence_length: int = 1024,
+    ):
+        r"""
+
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                prompt to be encoded
+            device: (`torch.device`):
+                torch device
+            num_images_per_prompt (`int`):
+                number of images that should be generated per prompt
+            prompt_embeds (`torch.Tensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+        """
+        device = device or self._execution_device
+
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        batch_size = len(prompt) if prompt_embeds is None else prompt_embeds.shape[0]
+
+        if prompt_embeds is None:
+            prompt_embeds, prompt_embeds_mask = self._get_qwen_prompt_embeds(prompt, device)
+
+        prompt_embeds = prompt_embeds[:, :max_sequence_length]
+        prompt_embeds_mask = prompt_embeds_mask[:, :max_sequence_length]
+
+        _, seq_len, _ = prompt_embeds.shape
+        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
+        prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
+        prompt_embeds_mask = prompt_embeds_mask.repeat(1, num_images_per_prompt, 1)
+        prompt_embeds_mask = prompt_embeds_mask.view(batch_size * num_images_per_prompt, seq_len)
+
+        return prompt_embeds, prompt_embeds_mask
+
+    def get_image_caption(self, prompt_image, use_en_prompt=True, device=None):
+        if use_en_prompt:
+            prompt = self.image_caption_prompt_en
+        else:
+            prompt = self.image_caption_prompt_cn
+        model_inputs = self.vl_processor(
+            text=prompt,
+            images=prompt_image,
+            padding=True,
+            return_tensors="pt",
+        ).to(device)
+        generated_ids = self.text_encoder.generate(**model_inputs, max_new_tokens=512)
+        generated_ids_trimmed = [
+            out_ids[len(in_ids) :] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)
+        ]
+        output_text = self.vl_processor.batch_decode(
+            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+        )[0]
+        return output_text.strip()
+
+    def check_inputs(
+        self,
+        height,
+        width,
+        negative_prompt=None,
+        prompt_embeds=None,
+        negative_prompt_embeds=None,
+        prompt_embeds_mask=None,
+        negative_prompt_embeds_mask=None,
+        callback_on_step_end_tensor_inputs=None,
+        max_sequence_length=None,
+    ):
+        if height % (self.vae_scale_factor * 2) != 0 or width % (self.vae_scale_factor * 2) != 0:
+            logger.warning(
+                f"`height` and `width` have to be divisible by {self.vae_scale_factor * 2} but are {height} and {width}. Dimensions will be resized accordingly"
+            )
+
+        if callback_on_step_end_tensor_inputs is not None and not all(
+            k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
+        ):
+            raise ValueError(
+                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
+            )
+
+        if negative_prompt is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+
+        if prompt_embeds is not None and prompt_embeds_mask is None:
+            raise ValueError(
+                "If `prompt_embeds` are provided, `prompt_embeds_mask` also have to be passed. Make sure to generate `prompt_embeds_mask` from the same text encoder that was used to generate `prompt_embeds`."
+            )
+        if negative_prompt_embeds is not None and negative_prompt_embeds_mask is None:
+            raise ValueError(
+                "If `negative_prompt_embeds` are provided, `negative_prompt_embeds_mask` also have to be passed. Make sure to generate `negative_prompt_embeds_mask` from the same text encoder that was used to generate `negative_prompt_embeds`."
+            )
+
+        if max_sequence_length is not None and max_sequence_length > 1024:
+            raise ValueError(f"`max_sequence_length` cannot be greater than 1024 but is {max_sequence_length}")
+
+    @staticmethod
+    def _pack_latents(latents, batch_size, num_channels_latents, height, width, layers):
+        latents = latents.view(batch_size, layers, num_channels_latents, height // 2, 2, width // 2, 2)
+        latents = latents.permute(0, 1, 3, 5, 2, 4, 6)
+        latents = latents.reshape(batch_size, layers * (height // 2) * (width // 2), num_channels_latents * 4)
+
+        return latents
+
+    @staticmethod
+    def _unpack_latents(latents, height, width, layers, vae_scale_factor):
+        batch_size, num_patches, channels = latents.shape
+
+        # VAE applies 8x compression on images but we must also account for packing which requires
+        # latent height and width to be divisible by 2.
+        height = 2 * (int(height) // (vae_scale_factor * 2))
+        width = 2 * (int(width) // (vae_scale_factor * 2))
+
+        latents = latents.view(batch_size, layers + 1, height // 2, width // 2, channels // 4, 2, 2)
+        latents = latents.permute(0, 1, 4, 2, 5, 3, 6)
+
+        latents = latents.reshape(batch_size, layers + 1, channels // (2 * 2), height, width)
+        latents = latents.permute(0, 2, 1, 3, 4)  # (b, c, f, h, w)
+
+        return latents
+
+    # Copied from diffusers.pipelines.qwenimage.pipeline_qwenimage_edit.QwenImageEditPipeline._encode_vae_image
+    def _encode_vae_image(self, image: torch.Tensor, generator: torch.Generator):
+        if isinstance(generator, list):
+            image_latents = [
+                retrieve_latents(self.vae.encode(image[i : i + 1]), generator=generator[i], sample_mode="argmax")
+                for i in range(image.shape[0])
+            ]
+            image_latents = torch.cat(image_latents, dim=0)
+        else:
+            image_latents = retrieve_latents(self.vae.encode(image), generator=generator, sample_mode="argmax")
+        latents_mean = (
+            torch.tensor(self.vae.config.latents_mean)
+            .view(1, self.latent_channels, 1, 1, 1)
+            .to(image_latents.device, image_latents.dtype)
+        )
+        latents_std = (
+            torch.tensor(self.vae.config.latents_std)
+            .view(1, self.latent_channels, 1, 1, 1)
+            .to(image_latents.device, image_latents.dtype)
+        )
+        image_latents = (image_latents - latents_mean) / latents_std
+
+        return image_latents
+
+    def prepare_latents(
+        self,
+        image,
+        batch_size,
+        num_channels_latents,
+        height,
+        width,
+        layers,
+        dtype,
+        device,
+        generator,
+        latents=None,
+    ):
+        # VAE applies 8x compression on images but we must also account for packing which requires
+        # latent height and width to be divisible by 2.
+        height = 2 * (int(height) // (self.vae_scale_factor * 2))
+        width = 2 * (int(width) // (self.vae_scale_factor * 2))
+
+        shape = (
+            batch_size,
+            layers + 1,
+            num_channels_latents,
+            height,
+            width,
+        )  ### the generated first image is combined image
+
+        image_latents = None
+        if image is not None:
+            image = image.to(device=device, dtype=dtype)
+            if image.shape[1] != self.latent_channels:
+                image_latents = self._encode_vae_image(image=image, generator=generator)
+            else:
+                image_latents = image
+            if batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] == 0:
+                # expand init_latents for batch_size
+                additional_image_per_prompt = batch_size // image_latents.shape[0]
+                image_latents = torch.cat([image_latents] * additional_image_per_prompt, dim=0)
+            elif batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] != 0:
+                raise ValueError(
+                    f"Cannot duplicate `image` of batch size {image_latents.shape[0]} to {batch_size} text prompts."
+                )
+            else:
+                image_latents = torch.cat([image_latents], dim=0)
+
+            image_latent_height, image_latent_width = image_latents.shape[3:]
+            image_latents = image_latents.permute(0, 2, 1, 3, 4)  # (b, c, f, h, w) -> (b, f, c, h, w)
+            image_latents = self._pack_latents(
+                image_latents, batch_size, num_channels_latents, image_latent_height, image_latent_width, 1
+            )
+
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+        if latents is None:
+            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+            latents = self._pack_latents(latents, batch_size, num_channels_latents, height, width, layers + 1)
+        else:
+            latents = latents.to(device=device, dtype=dtype)
+
+        return latents, image_latents
+
+    @property
+    def guidance_scale(self):
+        return self._guidance_scale
+
+    @property
+    def attention_kwargs(self):
+        return self._attention_kwargs
+
+    @property
+    def num_timesteps(self):
+        return self._num_timesteps
+
+    @property
+    def current_timestep(self):
+        return self._current_timestep
+
+    @property
+    def interrupt(self):
+        return self._interrupt
+
+    @torch.no_grad()
+    @replace_example_docstring(EXAMPLE_DOC_STRING)
+    def __call__(
+        self,
+        image: Optional[PipelineImageInput] = None,
+        prompt: Union[str, List[str]] = None,
+        negative_prompt: Union[str, List[str]] = None,
+        true_cfg_scale: float = 4.0,
+        layers: Optional[int] = 4,
+        num_inference_steps: int = 50,
+        sigmas: Optional[List[float]] = None,
+        guidance_scale: Optional[float] = None,
+        num_images_per_prompt: int = 1,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.Tensor] = None,
+        prompt_embeds: Optional[torch.Tensor] = None,
+        prompt_embeds_mask: Optional[torch.Tensor] = None,
+        negative_prompt_embeds: Optional[torch.Tensor] = None,
+        negative_prompt_embeds_mask: Optional[torch.Tensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        attention_kwargs: Optional[Dict[str, Any]] = None,
+        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
+        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
+        max_sequence_length: int = 512,
+        resolution: int = 640,
+        cfg_normalize: bool = False,
+        use_en_prompt: bool = False,
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+
+        Args:
+            image (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
+                `Image`, numpy array or tensor representing an image batch to be used as the starting point. For both
+                numpy array and pytorch tensor, the expected value range is between `[0, 1]` If it's a tensor or a list
+                or tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a
+                list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)` It can also accept image
+                latents as `image`, but if passing latents directly it is not encoded again.
+            prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            negative_prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `true_cfg_scale` is
+                not greater than `1`).
+            true_cfg_scale (`float`, *optional*, defaults to 1.0):
+                true_cfg_scale (`float`, *optional*, defaults to 1.0): Guidance scale as defined in [Classifier-Free
+                Diffusion Guidance](https://huggingface.co/papers/2207.12598). `true_cfg_scale` is defined as `w` of
+                equation 2. of [Imagen Paper](https://huggingface.co/papers/2205.11487). Classifier-free guidance is
+                enabled by setting `true_cfg_scale > 1` and a provided `negative_prompt`. Higher guidance scale
+                encourages to generate images that are closely linked to the text `prompt`, usually at the expense of
+                lower image quality.
+            num_inference_steps (`int`, *optional*, defaults to 50):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            sigmas (`List[float]`, *optional*):
+                Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
+                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
+                will be used.
+            guidance_scale (`float`, *optional*, defaults to None):
+                A guidance scale value for guidance distilled models. Unlike the traditional classifier-free guidance
+                where the guidance scale is applied during inference through noise prediction rescaling, guidance
+                distilled models take the guidance scale directly as an input parameter during forward pass. Guidance
+                scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images
+                that are closely linked to the text `prompt`, usually at the expense of lower image quality. This
+                parameter in the pipeline is there to support future guidance-distilled models when they come up. It is
+                ignored when not using guidance distilled models. To enable traditional classifier-free guidance,
+                please pass `true_cfg_scale > 1.0` and `negative_prompt` (even an empty negative prompt like " " should
+                enable classifier-free guidance computations).
+            num_images_per_prompt (`int`, *optional*, defaults to 1):
+                The number of images to generate per prompt.
+            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
+                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                to make generation deterministic.
+            latents (`torch.Tensor`, *optional*):
+                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor will be generated by sampling using the supplied random `generator`.
+            prompt_embeds (`torch.Tensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`torch.Tensor`, *optional*):
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+            output_type (`str`, *optional*, defaults to `"pil"`):
+                The output format of the generate image. Choose between
+                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~pipelines.qwenimage.QwenImagePipelineOutput`] instead of a plain tuple.
+            attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            callback_on_step_end (`Callable`, *optional*):
+                A function that calls at the end of each denoising steps during the inference. The function is called
+                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
+                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
+                `callback_on_step_end_tensor_inputs`.
+            callback_on_step_end_tensor_inputs (`List`, *optional*):
+                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
+                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
+                `._callback_tensor_inputs` attribute of your pipeline class.
+            max_sequence_length (`int` defaults to 512): Maximum sequence length to use with the `prompt`.
+            resolution (`int`, *optional*, defaults to 640):
+                using different bucket in (640, 1024) to determin the condition and output resolution
+            cfg_normalize (`bool`, *optional*, defaults to `False`)
+                whether enable cfg normalization.
+            use_en_prompt (`bool`, *optional*, defaults to `False`)
+                automatic caption language if user does not provide caption
+
+        Examples:
+
+        Returns:
+            [`~pipelines.qwenimage.QwenImagePipelineOutput`] or `tuple`:
+            [`~pipelines.qwenimage.QwenImagePipelineOutput`] if `return_dict` is True, otherwise a `tuple`. When
+            returning a tuple, the first element is a list with the generated images.
+        """
+        image_size = image[0].size if isinstance(image, list) else image.size
+        assert resolution in [640, 1024], f"resolution must be either 640 or 1024, but got {resolution}"
+        calculated_width, calculated_height = calculate_dimensions(
+            resolution * resolution, image_size[0] / image_size[1]
+        )
+        height = calculated_height
+        width = calculated_width
+
+        multiple_of = self.vae_scale_factor * 2
+        width = width // multiple_of * multiple_of
+        height = height // multiple_of * multiple_of
+
+        # 1. Check inputs. Raise error if not correct
+        self.check_inputs(
+            height,
+            width,
+            negative_prompt=negative_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            prompt_embeds_mask=prompt_embeds_mask,
+            negative_prompt_embeds_mask=negative_prompt_embeds_mask,
+            callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
+            max_sequence_length=max_sequence_length,
+        )
+
+        self._guidance_scale = guidance_scale
+        self._attention_kwargs = attention_kwargs
+        self._current_timestep = None
+        self._interrupt = False
+
+        device = self._execution_device
+        # 2. Preprocess image
+        if image is not None and not (isinstance(image, torch.Tensor) and image.size(1) == self.latent_channels):
+            image = self.image_processor.resize(image, calculated_height, calculated_width)
+            prompt_image = image
+            image = self.image_processor.preprocess(image, calculated_height, calculated_width)
+            image = image.unsqueeze(2)
+            image = image.to(dtype=self.text_encoder.dtype)
+
+        if prompt is None or prompt == "" or prompt == " ":
+            prompt = self.get_image_caption(prompt_image, use_en_prompt=use_en_prompt, device=device)
+
+        # 3. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+
+        has_neg_prompt = negative_prompt is not None or (
+            negative_prompt_embeds is not None and negative_prompt_embeds_mask is not None
+        )
+
+        if true_cfg_scale > 1 and not has_neg_prompt:
+            logger.warning(
+                f"true_cfg_scale is passed as {true_cfg_scale}, but classifier-free guidance is not enabled since no negative_prompt is provided."
+            )
+        elif true_cfg_scale <= 1 and has_neg_prompt:
+            logger.warning(
+                " negative_prompt is passed but classifier-free guidance is not enabled since true_cfg_scale <= 1"
+            )
+
+        do_true_cfg = true_cfg_scale > 1 and has_neg_prompt
+        prompt_embeds, prompt_embeds_mask = self.encode_prompt(
+            prompt=prompt,
+            prompt_embeds=prompt_embeds,
+            prompt_embeds_mask=prompt_embeds_mask,
+            device=device,
+            num_images_per_prompt=num_images_per_prompt,
+            max_sequence_length=max_sequence_length,
+        )
+        if do_true_cfg:
+            negative_prompt_embeds, negative_prompt_embeds_mask = self.encode_prompt(
+                prompt=negative_prompt,
+                prompt_embeds=negative_prompt_embeds,
+                prompt_embeds_mask=negative_prompt_embeds_mask,
+                device=device,
+                num_images_per_prompt=num_images_per_prompt,
+                max_sequence_length=max_sequence_length,
+            )
+
+        # 4. Prepare latent variables
+        num_channels_latents = self.transformer.config.in_channels // 4
+        latents, image_latents = self.prepare_latents(
+            image,
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            layers,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+        )
+        img_shapes = [
+            [
+                *[
+                    (1, height // self.vae_scale_factor // 2, width // self.vae_scale_factor // 2)
+                    for _ in range(layers + 1)
+                ],
+                (1, calculated_height // self.vae_scale_factor // 2, calculated_width // self.vae_scale_factor // 2),
+            ]
+        ] * batch_size
+
+        # 5. Prepare timesteps
+        sigmas = np.linspace(1.0, 0, num_inference_steps + 1)[:-1] if sigmas is None else sigmas
+        image_seq_len = latents.shape[1]
+        base_seqlen = 256 * 256 / 16 / 16
+        mu = (image_latents.shape[1] / base_seqlen) ** 0.5
+        timesteps, num_inference_steps = retrieve_timesteps(
+            self.scheduler,
+            num_inference_steps,
+            device,
+            sigmas=sigmas,
+            mu=mu,
+        )
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        self._num_timesteps = len(timesteps)
+
+        # handle guidance
+        if self.transformer.config.guidance_embeds and guidance_scale is None:
+            raise ValueError("guidance_scale is required for guidance-distilled model.")
+        elif self.transformer.config.guidance_embeds:
+            guidance = torch.full([1], guidance_scale, device=device, dtype=torch.float32)
+            guidance = guidance.expand(latents.shape[0])
+        elif not self.transformer.config.guidance_embeds and guidance_scale is not None:
+            logger.warning(
+                f"guidance_scale is passed as {guidance_scale}, but ignored since the model is not guidance-distilled."
+            )
+            guidance = None
+        elif not self.transformer.config.guidance_embeds and guidance_scale is None:
+            guidance = None
+
+        if self.attention_kwargs is None:
+            self._attention_kwargs = {}
+
+        txt_seq_lens = prompt_embeds_mask.sum(dim=1).tolist() if prompt_embeds_mask is not None else None
+        negative_txt_seq_lens = (
+            negative_prompt_embeds_mask.sum(dim=1).tolist() if negative_prompt_embeds_mask is not None else None
+        )
+        is_rgb = torch.tensor([0] * batch_size).to(device=device, dtype=torch.long)
+        # 6. Denoising loop
+        self.scheduler.set_begin_index(0)
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                if self.interrupt:
+                    continue
+
+                self._current_timestep = t
+
+                latent_model_input = latents
+                if image_latents is not None:
+                    latent_model_input = torch.cat([latents, image_latents], dim=1)
+
+                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+                timestep = t.expand(latents.shape[0]).to(latents.dtype)
+                with self.transformer.cache_context("cond"):
+                    noise_pred = self.transformer(
+                        hidden_states=latent_model_input,
+                        timestep=timestep / 1000,
+                        guidance=guidance,
+                        encoder_hidden_states_mask=prompt_embeds_mask,
+                        encoder_hidden_states=prompt_embeds,
+                        img_shapes=img_shapes,
+                        txt_seq_lens=txt_seq_lens,
+                        attention_kwargs=self.attention_kwargs,
+                        additional_t_cond=is_rgb,
+                        return_dict=False,
+                    )[0]
+                    noise_pred = noise_pred[:, : latents.size(1)]
+
+                if do_true_cfg:
+                    with self.transformer.cache_context("uncond"):
+                        neg_noise_pred = self.transformer(
+                            hidden_states=latent_model_input,
+                            timestep=timestep / 1000,
+                            guidance=guidance,
+                            encoder_hidden_states_mask=negative_prompt_embeds_mask,
+                            encoder_hidden_states=negative_prompt_embeds,
+                            img_shapes=img_shapes,
+                            txt_seq_lens=negative_txt_seq_lens,
+                            attention_kwargs=self.attention_kwargs,
+                            additional_t_cond=is_rgb,
+                            return_dict=False,
+                        )[0]
+                    neg_noise_pred = neg_noise_pred[:, : latents.size(1)]
+                    comb_pred = neg_noise_pred + true_cfg_scale * (noise_pred - neg_noise_pred)
+
+                    if cfg_normalize:
+                        cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
+                        noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
+                        noise_pred = comb_pred * (cond_norm / noise_norm)
+                    else:
+                        noise_pred = comb_pred
+
+                # compute the previous noisy sample x_t -> x_t-1
+                latents_dtype = latents.dtype
+                latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
+
+                if latents.dtype != latents_dtype:
+                    if torch.backends.mps.is_available():
+                        # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
+                        latents = latents.to(latents_dtype)
+
+                if callback_on_step_end is not None:
+                    callback_kwargs = {}
+                    for k in callback_on_step_end_tensor_inputs:
+                        callback_kwargs[k] = locals()[k]
+                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
+
+                    latents = callback_outputs.pop("latents", latents)
+                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
+
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+
+                if XLA_AVAILABLE:
+                    xm.mark_step()
+
+        self._current_timestep = None
+        if output_type == "latent":
+            image = latents
+        else:
+            latents = self._unpack_latents(latents, height, width, layers, self.vae_scale_factor)
+            latents = latents.to(self.vae.dtype)
+            latents_mean = (
+                torch.tensor(self.vae.config.latents_mean)
+                .view(1, self.vae.config.z_dim, 1, 1, 1)
+                .to(latents.device, latents.dtype)
+            )
+            latents_std = 1.0 / torch.tensor(self.vae.config.latents_std).view(1, self.vae.config.z_dim, 1, 1, 1).to(
+                latents.device, latents.dtype
+            )
+            latents = latents / latents_std + latents_mean
+
+            b, c, f, h, w = latents.shape
+
+            latents = latents[:, :, 1:]  # remove the first frame as it is the orgin input
+
+            latents = latents.permute(0, 2, 1, 3, 4).view(-1, c, 1, h, w)
+
+            image = self.vae.decode(latents, return_dict=False)[0]  # (b f) c 1 h w
+
+            image = image.squeeze(2)
+
+            image = self.image_processor.postprocess(image, output_type=output_type)
+            images = []
+            for bidx in range(b):
+                images.append(image[bidx * f : (bidx + 1) * f])
+
+        # Offload all models
+        self.maybe_free_model_hooks()
+
+        if not return_dict:
+            return (images,)
+
+        return QwenImagePipelineOutput(images=images)
--- a/src/diffusers/pipelines/z_image/init.py
+++ b/src/diffusers/pipelines/z_image/init.py
@@ -23,6 +23,8 @@ except OptionalDependencyNotAvailable:
 else:
    _import_structure["pipeline_output"] = ["ZImagePipelineOutput"]
    _import_structure["pipeline_z_image"] = ["ZImagePipeline"]
+    _import_structure["pipeline_z_image_controlnet"] = ["ZImageControlNetPipeline"]
+    _import_structure["pipeline_z_image_controlnet_inpaint"] = ["ZImageControlNetInpaintPipeline"]
    _import_structure["pipeline_z_image_img2img"] = ["ZImageImg2ImgPipeline"]


@@ -36,6 +38,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
    else:
        from .pipeline_output import ZImagePipelineOutput
        from .pipeline_z_image import ZImagePipeline
+        from .pipeline_z_image_controlnet import ZImageControlNetPipeline
+        from .pipeline_z_image_controlnet_inpaint import ZImageControlNetInpaintPipeline
        from .pipeline_z_image_img2img import ZImageImg2ImgPipeline

 else:
--- a/src/diffusers/pipelines/z_image/pipeline_z_image_controlnet.py
+++ b/src/diffusers/pipelines/z_image/pipeline_z_image_controlnet.py
@@ -0,0 +1,725 @@
+# Copyright 2025 Alibaba Z-Image Team and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+from typing import Any, Callable, Dict, List, Optional, Union
+
+import torch
+from transformers import AutoTokenizer, PreTrainedModel
+
+from ...image_processor import PipelineImageInput, VaeImageProcessor
+from ...loaders import FromSingleFileMixin
+from ...models.autoencoders import AutoencoderKL
+from ...models.controlnets import ZImageControlNetModel
+from ...models.transformers import ZImageTransformer2DModel
+from ...pipelines.pipeline_utils import DiffusionPipeline
+from ...schedulers import FlowMatchEulerDiscreteScheduler
+from ...utils import logging, replace_example_docstring
+from ...utils.torch_utils import randn_tensor
+from .pipeline_output import ZImagePipelineOutput
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+EXAMPLE_DOC_STRING = """
+    Examples:
+        ```py
+        >>> import torch
+        >>> from diffusers import ZImageControlNetPipeline
+        >>> from diffusers import ZImageControlNetModel
+        >>> from diffusers.utils import load_image
+        >>> from huggingface_hub import hf_hub_download
+
+        >>> controlnet = ZImageControlNetModel.from_single_file(
+        ...     hf_hub_download(
+        ...         "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union",
+        ...         filename="Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
+        ...     ),
+        ...     torch_dtype=torch.bfloat16,
+        ... )
+
+        >>> # 2.1
+        >>> # controlnet = ZImageControlNetModel.from_single_file(
+        >>> #     hf_hub_download(
+        >>> #         "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
+        >>> #         filename="Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors",
+        >>> #     ),
+        >>> #     torch_dtype=torch.bfloat16,
+        >>> # )
+
+        >>> # 2.0 - `config` is required
+        >>> # controlnet = ZImageControlNetModel.from_single_file(
+        >>> #     hf_hub_download(
+        >>> #         "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
+        >>> #         filename="Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors",
+        >>> #     ),
+        >>> #     torch_dtype=torch.bfloat16,
+        >>> #     config="hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
+        >>> # )
+
+        >>> pipe = ZImageControlNetPipeline.from_pretrained(
+        ...     "Tongyi-MAI/Z-Image-Turbo", controlnet=controlnet, torch_dtype=torch.bfloat16
+        ... )
+        >>> pipe.to("cuda")
+
+        >>> # Optionally, set the attention backend to flash-attn 2 or 3, default is SDPA in PyTorch.
+        >>> # (1) Use flash attention 2
+        >>> # pipe.transformer.set_attention_backend("flash")
+        >>> # (2) Use flash attention 3
+        >>> # pipe.transformer.set_attention_backend("_flash_3")
+
+        >>> control_image = load_image(
+        ...     "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union/resolve/main/asset/pose.jpg?download=true"
+        ... )
+        >>> prompt = "一位年轻女子站在阳光明媚的海岸线上，白裙在轻拂的海风中微微飘动。她拥有一头鲜艳的紫色长发，在风中轻盈舞动，发间系着一个精致的黑色蝴蝶结，与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀，眉目精致，透着一股甜美的青春气息；神情柔和，略带羞涩，目光静静地凝望着远方的地平线，双手自然交叠于身前，仿佛沉浸在思绪之中。在她身后，是辽阔无垠、波光粼粼的大海，阳光洒在海面上，映出温暖的金色光晕。"
+        >>> image = pipe(
+        ...     prompt,
+        ...     control_image=control_image,
+        ...     controlnet_conditioning_scale=0.75,
+        ...     height=1728,
+        ...     width=992,
+        ...     num_inference_steps=9,
+        ...     guidance_scale=0.0,
+        ...     generator=torch.Generator("cuda").manual_seed(43),
+        ... ).images[0]
+        >>> image.save("zimage.png")
+        ```
+"""
+
+
+# Copied from diffusers.pipelines.flux.pipeline_flux.calculate_shift
+def calculate_shift(
+    image_seq_len,
+    base_seq_len: int = 256,
+    max_seq_len: int = 4096,
+    base_shift: float = 0.5,
+    max_shift: float = 1.15,
+):
+    m = (max_shift - base_shift) / (max_seq_len - base_seq_len)
+    b = base_shift - m * base_seq_len
+    mu = image_seq_len * m + b
+    return mu
+
+
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.retrieve_latents
+def retrieve_latents(
+    encoder_output: torch.Tensor, generator: Optional[torch.Generator] = None, sample_mode: str = "sample"
+):
+    if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
+        return encoder_output.latent_dist.sample(generator)
+    elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
+        return encoder_output.latent_dist.mode()
+    elif hasattr(encoder_output, "latents"):
+        return encoder_output.latents
+    else:
+        raise AttributeError("Could not access latents of provided encoder_output")
+
+
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
+def retrieve_timesteps(
+    scheduler,
+    num_inference_steps: Optional[int] = None,
+    device: Optional[Union[str, torch.device]] = None,
+    timesteps: Optional[List[int]] = None,
+    sigmas: Optional[List[float]] = None,
+    **kwargs,
+):
+    r"""
+    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
+    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
+
+    Args:
+        scheduler (`SchedulerMixin`):
+            The scheduler to get timesteps from.
+        num_inference_steps (`int`):
+            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
+            must be `None`.
+        device (`str` or `torch.device`, *optional*):
+            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
+        timesteps (`List[int]`, *optional*):
+            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
+            `num_inference_steps` and `sigmas` must be `None`.
+        sigmas (`List[float]`, *optional*):
+            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
+            `num_inference_steps` and `timesteps` must be `None`.
+
+    Returns:
+        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
+        second element is the number of inference steps.
+    """
+    if timesteps is not None and sigmas is not None:
+        raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
+    if timesteps is not None:
+        accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accepts_timesteps:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" timestep schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    elif sigmas is not None:
+        accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accept_sigmas:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" sigmas schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    else:
+        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+    return timesteps, num_inference_steps
+
+
+class ZImageControlNetPipeline(DiffusionPipeline, FromSingleFileMixin):
+    model_cpu_offload_seq = "text_encoder->transformer->vae"
+    _optional_components = []
+    _callback_tensor_inputs = ["latents", "prompt_embeds"]
+
+    def __init__(
+        self,
+        scheduler: FlowMatchEulerDiscreteScheduler,
+        vae: AutoencoderKL,
+        text_encoder: PreTrainedModel,
+        tokenizer: AutoTokenizer,
+        transformer: ZImageTransformer2DModel,
+        controlnet: ZImageControlNetModel,
+    ):
+        super().__init__()
+        controlnet = ZImageControlNetModel.from_transformer(controlnet, transformer)
+
+        self.register_modules(
+            vae=vae,
+            text_encoder=text_encoder,
+            tokenizer=tokenizer,
+            scheduler=scheduler,
+            transformer=transformer,
+            controlnet=controlnet,
+        )
+        self.vae_scale_factor = (
+            2 ** (len(self.vae.config.block_out_channels) - 1) if hasattr(self, "vae") and self.vae is not None else 8
+        )
+        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor * 2)
+
+    def encode_prompt(
+        self,
+        prompt: Union[str, List[str]],
+        device: Optional[torch.device] = None,
+        do_classifier_free_guidance: bool = True,
+        negative_prompt: Optional[Union[str, List[str]]] = None,
+        prompt_embeds: Optional[List[torch.FloatTensor]] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        max_sequence_length: int = 512,
+    ):
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        prompt_embeds = self._encode_prompt(
+            prompt=prompt,
+            device=device,
+            prompt_embeds=prompt_embeds,
+            max_sequence_length=max_sequence_length,
+        )
+
+        if do_classifier_free_guidance:
+            if negative_prompt is None:
+                negative_prompt = ["" for _ in prompt]
+            else:
+                negative_prompt = [negative_prompt] if isinstance(negative_prompt, str) else negative_prompt
+            assert len(prompt) == len(negative_prompt)
+            negative_prompt_embeds = self._encode_prompt(
+                prompt=negative_prompt,
+                device=device,
+                prompt_embeds=negative_prompt_embeds,
+                max_sequence_length=max_sequence_length,
+            )
+        else:
+            negative_prompt_embeds = []
+        return prompt_embeds, negative_prompt_embeds
+
+    def _encode_prompt(
+        self,
+        prompt: Union[str, List[str]],
+        device: Optional[torch.device] = None,
+        prompt_embeds: Optional[List[torch.FloatTensor]] = None,
+        max_sequence_length: int = 512,
+    ) -> List[torch.FloatTensor]:
+        device = device or self._execution_device
+
+        if prompt_embeds is not None:
+            return prompt_embeds
+
+        if isinstance(prompt, str):
+            prompt = [prompt]
+
+        for i, prompt_item in enumerate(prompt):
+            messages = [
+                {"role": "user", "content": prompt_item},
+            ]
+            prompt_item = self.tokenizer.apply_chat_template(
+                messages,
+                tokenize=False,
+                add_generation_prompt=True,
+                enable_thinking=True,
+            )
+            prompt[i] = prompt_item
+
+        text_inputs = self.tokenizer(
+            prompt,
+            padding="max_length",
+            max_length=max_sequence_length,
+            truncation=True,
+            return_tensors="pt",
+        )
+
+        text_input_ids = text_inputs.input_ids.to(device)
+        prompt_masks = text_inputs.attention_mask.to(device).bool()
+
+        prompt_embeds = self.text_encoder(
+            input_ids=text_input_ids,
+            attention_mask=prompt_masks,
+            output_hidden_states=True,
+        ).hidden_states[-2]
+
+        embeddings_list = []
+
+        for i in range(len(prompt_embeds)):
+            embeddings_list.append(prompt_embeds[i][prompt_masks[i]])
+
+        return embeddings_list
+
+    def prepare_latents(
+        self,
+        batch_size,
+        num_channels_latents,
+        height,
+        width,
+        dtype,
+        device,
+        generator,
+        latents=None,
+    ):
+        height = 2 * (int(height) // (self.vae_scale_factor * 2))
+        width = 2 * (int(width) // (self.vae_scale_factor * 2))
+
+        shape = (batch_size, num_channels_latents, height, width)
+
+        if latents is None:
+            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        else:
+            if latents.shape != shape:
+                raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")
+            latents = latents.to(device)
+        return latents
+
+    # Copied from diffusers.pipelines.controlnet_sd3.pipeline_stable_diffusion_3_controlnet.StableDiffusion3ControlNetPipeline.prepare_image
+    def prepare_image(
+        self,
+        image,
+        width,
+        height,
+        batch_size,
+        num_images_per_prompt,
+        device,
+        dtype,
+        do_classifier_free_guidance=False,
+        guess_mode=False,
+    ):
+        if isinstance(image, torch.Tensor):
+            pass
+        else:
+            image = self.image_processor.preprocess(image, height=height, width=width)
+
+        image_batch_size = image.shape[0]
+
+        if image_batch_size == 1:
+            repeat_by = batch_size
+        else:
+            # image batch size is the same as prompt batch size
+            repeat_by = num_images_per_prompt
+
+        image = image.repeat_interleave(repeat_by, dim=0)
+
+        image = image.to(device=device, dtype=dtype)
+
+        if do_classifier_free_guidance and not guess_mode:
+            image = torch.cat([image] * 2)
+
+        return image
+
+    @property
+    def guidance_scale(self):
+        return self._guidance_scale
+
+    @property
+    def do_classifier_free_guidance(self):
+        return self._guidance_scale > 1
+
+    @property
+    def joint_attention_kwargs(self):
+        return self._joint_attention_kwargs
+
+    @property
+    def num_timesteps(self):
+        return self._num_timesteps
+
+    @property
+    def interrupt(self):
+        return self._interrupt
+
+    @torch.no_grad()
+    @replace_example_docstring(EXAMPLE_DOC_STRING)
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = None,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        num_inference_steps: int = 50,
+        sigmas: Optional[List[float]] = None,
+        guidance_scale: float = 5.0,
+        control_image: PipelineImageInput = None,
+        controlnet_conditioning_scale: Union[float, List[float]] = 0.75,
+        cfg_normalization: bool = False,
+        cfg_truncation: float = 1.0,
+        negative_prompt: Optional[Union[str, List[str]]] = None,
+        num_images_per_prompt: Optional[int] = 1,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[List[torch.FloatTensor]] = None,
+        negative_prompt_embeds: Optional[List[torch.FloatTensor]] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
+        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
+        max_sequence_length: int = 512,
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            height (`int`, *optional*, defaults to 1024):
+                The height in pixels of the generated image.
+            width (`int`, *optional*, defaults to 1024):
+                The width in pixels of the generated image.
+            num_inference_steps (`int`, *optional*, defaults to 50):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            sigmas (`List[float]`, *optional*):
+                Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
+                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
+                will be used.
+            guidance_scale (`float`, *optional*, defaults to 5.0):
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+                usually at the expense of lower image quality.
+            cfg_normalization (`bool`, *optional*, defaults to False):
+                Whether to apply configuration normalization.
+            cfg_truncation (`float`, *optional*, defaults to 1.0):
+                The truncation value for configuration.
+            negative_prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
+                less than `1`).
+            num_images_per_prompt (`int`, *optional*, defaults to 1):
+                The number of images to generate per prompt.
+            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
+                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                to make generation deterministic.
+            latents (`torch.FloatTensor`, *optional*):
+                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor will be generated by sampling using the supplied random `generator`.
+            prompt_embeds (`List[torch.FloatTensor]`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`List[torch.FloatTensor]`, *optional*):
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+            output_type (`str`, *optional*, defaults to `"pil"`):
+                The output format of the generate image. Choose between
+                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~pipelines.stable_diffusion.ZImagePipelineOutput`] instead of a plain
+                tuple.
+            joint_attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            callback_on_step_end (`Callable`, *optional*):
+                A function that calls at the end of each denoising steps during the inference. The function is called
+                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
+                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
+                `callback_on_step_end_tensor_inputs`.
+            callback_on_step_end_tensor_inputs (`List`, *optional*):
+                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
+                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
+                `._callback_tensor_inputs` attribute of your pipeline class.
+            max_sequence_length (`int`, *optional*, defaults to 512):
+                Maximum sequence length to use with the `prompt`.
+
+        Examples:
+
+        Returns:
+            [`~pipelines.z_image.ZImagePipelineOutput`] or `tuple`: [`~pipelines.z_image.ZImagePipelineOutput`] if
+            `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
+            generated images.
+        """
+        height = height or 1024
+        width = width or 1024
+
+        vae_scale = self.vae_scale_factor * 2
+        if height % vae_scale != 0:
+            raise ValueError(
+                f"Height must be divisible by {vae_scale} (got {height}). "
+                f"Please adjust the height to a multiple of {vae_scale}."
+            )
+        if width % vae_scale != 0:
+            raise ValueError(
+                f"Width must be divisible by {vae_scale} (got {width}). "
+                f"Please adjust the width to a multiple of {vae_scale}."
+            )
+
+        device = self._execution_device
+
+        self._guidance_scale = guidance_scale
+        self._joint_attention_kwargs = joint_attention_kwargs
+        self._interrupt = False
+        self._cfg_normalization = cfg_normalization
+        self._cfg_truncation = cfg_truncation
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = len(prompt_embeds)
+
+        # If prompt_embeds is provided and prompt is None, skip encoding
+        if prompt_embeds is not None and prompt is None:
+            if self.do_classifier_free_guidance and negative_prompt_embeds is None:
+                raise ValueError(
+                    "When `prompt_embeds` is provided without `prompt`, "
+                    "`negative_prompt_embeds` must also be provided for classifier-free guidance."
+                )
+        else:
+            (
+                prompt_embeds,
+                negative_prompt_embeds,
+            ) = self.encode_prompt(
+                prompt=prompt,
+                negative_prompt=negative_prompt,
+                do_classifier_free_guidance=self.do_classifier_free_guidance,
+                prompt_embeds=prompt_embeds,
+                negative_prompt_embeds=negative_prompt_embeds,
+                device=device,
+                max_sequence_length=max_sequence_length,
+            )
+
+        # 4. Prepare latent variables
+        num_channels_latents = self.transformer.in_channels
+
+        control_image = self.prepare_image(
+            image=control_image,
+            width=width,
+            height=height,
+            batch_size=batch_size * num_images_per_prompt,
+            num_images_per_prompt=num_images_per_prompt,
+            device=device,
+            dtype=self.vae.dtype,
+        )
+        height, width = control_image.shape[-2:]
+        control_image = retrieve_latents(self.vae.encode(control_image), generator=generator, sample_mode="argmax")
+        control_image = (control_image - self.vae.config.shift_factor) * self.vae.config.scaling_factor
+        control_image = control_image.unsqueeze(2)
+
+        if num_channels_latents != self.controlnet.config.control_in_dim:
+            # For model version 2.0
+            control_image = torch.cat(
+                [
+                    control_image,
+                    torch.zeros(
+                        control_image.shape[0],
+                        self.controlnet.config.control_in_dim - num_channels_latents,
+                        *control_image.shape[2:],
+                    ).to(device=control_image.device, dtype=control_image.dtype),
+                ],
+                dim=1,
+            )
+
+        latents = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            torch.float32,
+            device,
+            generator,
+            latents,
+        )
+
+        # Repeat prompt_embeds for num_images_per_prompt
+        if num_images_per_prompt > 1:
+            prompt_embeds = [pe for pe in prompt_embeds for _ in range(num_images_per_prompt)]
+            if self.do_classifier_free_guidance and negative_prompt_embeds:
+                negative_prompt_embeds = [npe for npe in negative_prompt_embeds for _ in range(num_images_per_prompt)]
+
+        actual_batch_size = batch_size * num_images_per_prompt
+        image_seq_len = (latents.shape[2] // 2) * (latents.shape[3] // 2)
+
+        # 5. Prepare timesteps
+        mu = calculate_shift(
+            image_seq_len,
+            self.scheduler.config.get("base_image_seq_len", 256),
+            self.scheduler.config.get("max_image_seq_len", 4096),
+            self.scheduler.config.get("base_shift", 0.5),
+            self.scheduler.config.get("max_shift", 1.15),
+        )
+        self.scheduler.sigma_min = 0.0
+        scheduler_kwargs = {"mu": mu}
+        timesteps, num_inference_steps = retrieve_timesteps(
+            self.scheduler,
+            num_inference_steps,
+            device,
+            sigmas=sigmas,
+            **scheduler_kwargs,
+        )
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        self._num_timesteps = len(timesteps)
+
+        # 6. Denoising loop
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                if self.interrupt:
+                    continue
+
+                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+                timestep = t.expand(latents.shape[0])
+                timestep = (1000 - timestep) / 1000
+                # Normalized time for time-aware config (0 at start, 1 at end)
+                t_norm = timestep[0].item()
+
+                # Handle cfg truncation
+                current_guidance_scale = self.guidance_scale
+                if (
+                    self.do_classifier_free_guidance
+                    and self._cfg_truncation is not None
+                    and float(self._cfg_truncation) <= 1
+                ):
+                    if t_norm > self._cfg_truncation:
+                        current_guidance_scale = 0.0
+
+                # Run CFG only if configured AND scale is non-zero
+                apply_cfg = self.do_classifier_free_guidance and current_guidance_scale > 0
+
+                if apply_cfg:
+                    latents_typed = latents.to(self.transformer.dtype)
+                    latent_model_input = latents_typed.repeat(2, 1, 1, 1)
+                    prompt_embeds_model_input = prompt_embeds + negative_prompt_embeds
+                    timestep_model_input = timestep.repeat(2)
+                else:
+                    latent_model_input = latents.to(self.transformer.dtype)
+                    prompt_embeds_model_input = prompt_embeds
+                    timestep_model_input = timestep
+
+                latent_model_input = latent_model_input.unsqueeze(2)
+                latent_model_input_list = list(latent_model_input.unbind(dim=0))
+
+                controlnet_block_samples = self.controlnet(
+                    latent_model_input_list,
+                    timestep_model_input,
+                    prompt_embeds_model_input,
+                    control_image,
+                    conditioning_scale=controlnet_conditioning_scale,
+                )
+
+                model_out_list = self.transformer(
+                    latent_model_input_list,
+                    timestep_model_input,
+                    prompt_embeds_model_input,
+                    controlnet_block_samples=controlnet_block_samples,
+                )[0]
+
+                if apply_cfg:
+                    # Perform CFG
+                    pos_out = model_out_list[:actual_batch_size]
+                    neg_out = model_out_list[actual_batch_size:]
+
+                    noise_pred = []
+                    for j in range(actual_batch_size):
+                        pos = pos_out[j].float()
+                        neg = neg_out[j].float()
+
+                        pred = pos + current_guidance_scale * (pos - neg)
+
+                        # Renormalization
+                        if self._cfg_normalization and float(self._cfg_normalization) > 0.0:
+                            ori_pos_norm = torch.linalg.vector_norm(pos)
+                            new_pos_norm = torch.linalg.vector_norm(pred)
+                            max_new_norm = ori_pos_norm * float(self._cfg_normalization)
+                            if new_pos_norm > max_new_norm:
+                                pred = pred * (max_new_norm / new_pos_norm)
+
+                        noise_pred.append(pred)
+
+                    noise_pred = torch.stack(noise_pred, dim=0)
+                else:
+                    noise_pred = torch.stack([t.float() for t in model_out_list], dim=0)
+
+                noise_pred = noise_pred.squeeze(2)
+                noise_pred = -noise_pred
+
+                # compute the previous noisy sample x_t -> x_t-1
+                latents = self.scheduler.step(noise_pred.to(torch.float32), t, latents, return_dict=False)[0]
+                assert latents.dtype == torch.float32
+
+                if callback_on_step_end is not None:
+                    callback_kwargs = {}
+                    for k in callback_on_step_end_tensor_inputs:
+                        callback_kwargs[k] = locals()[k]
+                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
+
+                    latents = callback_outputs.pop("latents", latents)
+                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
+                    negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)
+
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+
+        if output_type == "latent":
+            image = latents
+
+        else:
+            latents = latents.to(self.vae.dtype)
+            latents = (latents / self.vae.config.scaling_factor) + self.vae.config.shift_factor
+
+            image = self.vae.decode(latents, return_dict=False)[0]
+            image = self.image_processor.postprocess(image, output_type=output_type)
+
+        # Offload all models
+        self.maybe_free_model_hooks()
+
+        if not return_dict:
+            return (image,)
+
+        return ZImagePipelineOutput(images=image)
--- a/src/diffusers/pipelines/z_image/pipeline_z_image_controlnet_inpaint.py
+++ b/src/diffusers/pipelines/z_image/pipeline_z_image_controlnet_inpaint.py
@@ -0,0 +1,747 @@
+# Copyright 2025 Alibaba Z-Image Team and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+from typing import Any, Callable, Dict, List, Optional, Union
+
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, PreTrainedModel
+
+from ...image_processor import PipelineImageInput, VaeImageProcessor
+from ...loaders import FromSingleFileMixin
+from ...models.autoencoders import AutoencoderKL
+from ...models.controlnets import ZImageControlNetModel
+from ...models.transformers import ZImageTransformer2DModel
+from ...pipelines.pipeline_utils import DiffusionPipeline
+from ...schedulers import FlowMatchEulerDiscreteScheduler
+from ...utils import logging, replace_example_docstring
+from ...utils.torch_utils import randn_tensor
+from .pipeline_output import ZImagePipelineOutput
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+EXAMPLE_DOC_STRING = """
+    Examples:
+        ```py
+        >>> import torch
+        >>> from diffusers import ZImageControlNetInpaintPipeline
+        >>> from diffusers import ZImageControlNetModel
+        >>> from diffusers.utils import load_image
+        >>> from huggingface_hub import hf_hub_download
+
+        >>> controlnet = ZImageControlNetModel.from_single_file(
+        ...     hf_hub_download(
+        ...         "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
+        ...         filename="Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors",
+        ...     ),
+        ...     torch_dtype=torch.bfloat16,
+        ... )
+
+        >>> # 2.0 - `config` is required
+        >>> # controlnet = ZImageControlNetModel.from_single_file(
+        >>> #     hf_hub_download(
+        >>> #         "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
+        >>> #         filename="Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors",
+        >>> #     ),
+        >>> #     torch_dtype=torch.bfloat16,
+        >>> #     config="hlky/Z-Image-Turbo-Fun-Controlnet-Union-2.0",
+        >>> # )
+
+        >>> pipe = ZImageControlNetInpaintPipeline.from_pretrained(
+        ...     "Tongyi-MAI/Z-Image-Turbo", controlnet=controlnet, torch_dtype=torch.bfloat16
+        ... )
+        >>> pipe.to("cuda")
+
+        >>> # Optionally, set the attention backend to flash-attn 2 or 3, default is SDPA in PyTorch.
+        >>> # (1) Use flash attention 2
+        >>> # pipe.transformer.set_attention_backend("flash")
+        >>> # (2) Use flash attention 3
+        >>> # pipe.transformer.set_attention_backend("_flash_3")
+
+        >>> image = load_image(
+        ...     "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/resolve/main/asset/inpaint.jpg?download=true"
+        ... )
+        >>> mask_image = load_image(
+        ...     "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/resolve/main/asset/mask.jpg?download=true"
+        ... )
+        >>> control_image = load_image(
+        ...     "https://huggingface.co/alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0/resolve/main/asset/pose.jpg?download=true"
+        ... )
+        >>> prompt = "一位年轻女子站在阳光明媚的海岸线上，画面为全身竖构图，身体微微侧向右侧，左手自然下垂，右臂弯曲扶在腰间，她的手指清晰可见，站姿放松而略带羞涩。她身穿轻盈的白色连衣裙，裙摆在海风中轻轻飘动，布料半透、质感柔软。女子拥有一头鲜艳的及腰紫色长发，被海风吹起，在身侧轻盈飞舞，发间系着一个精致的黑色蝴蝶结，与发色形成对比。她面容清秀，眉目精致，肤色白皙细腻，表情温柔略显羞涩，微微低头，眼神静静望向远处的海平线，流露出甜美的青春气息与若有所思的神情。背景是辽阔无垠的海洋与蔚蓝天空，阳光从侧前方洒下，海面波光粼粼，泛着温暖的金色光晕，天空清澈明亮，云朵稀薄，整体色调清新唯美。"
+        >>> image = pipe(
+        ...     prompt,
+        ...     image=image,
+        ...     mask_image=mask_image,
+        ...     control_image=control_image,
+        ...     controlnet_conditioning_scale=0.75,
+        ...     height=1728,
+        ...     width=992,
+        ...     num_inference_steps=25,
+        ...     guidance_scale=0.0,
+        ...     generator=torch.Generator("cuda").manual_seed(43),
+        ... ).images[0]
+        >>> image.save("zimage-inpaint.png")
+        ```
+"""
+
+
+# Copied from diffusers.pipelines.flux.pipeline_flux.calculate_shift
+def calculate_shift(
+    image_seq_len,
+    base_seq_len: int = 256,
+    max_seq_len: int = 4096,
+    base_shift: float = 0.5,
+    max_shift: float = 1.15,
+):
+    m = (max_shift - base_shift) / (max_seq_len - base_seq_len)
+    b = base_shift - m * base_seq_len
+    mu = image_seq_len * m + b
+    return mu
+
+
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.retrieve_latents
+def retrieve_latents(
+    encoder_output: torch.Tensor, generator: Optional[torch.Generator] = None, sample_mode: str = "sample"
+):
+    if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
+        return encoder_output.latent_dist.sample(generator)
+    elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
+        return encoder_output.latent_dist.mode()
+    elif hasattr(encoder_output, "latents"):
+        return encoder_output.latents
+    else:
+        raise AttributeError("Could not access latents of provided encoder_output")
+
+
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
+def retrieve_timesteps(
+    scheduler,
+    num_inference_steps: Optional[int] = None,
+    device: Optional[Union[str, torch.device]] = None,
+    timesteps: Optional[List[int]] = None,
+    sigmas: Optional[List[float]] = None,
+    **kwargs,
+):
+    r"""
+    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
+    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
+
+    Args:
+        scheduler (`SchedulerMixin`):
+            The scheduler to get timesteps from.
+        num_inference_steps (`int`):
+            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
+            must be `None`.
+        device (`str` or `torch.device`, *optional*):
+            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
+        timesteps (`List[int]`, *optional*):
+            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
+            `num_inference_steps` and `sigmas` must be `None`.
+        sigmas (`List[float]`, *optional*):
+            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
+            `num_inference_steps` and `timesteps` must be `None`.
+
+    Returns:
+        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
+        second element is the number of inference steps.
+    """
+    if timesteps is not None and sigmas is not None:
+        raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
+    if timesteps is not None:
+        accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accepts_timesteps:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" timestep schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    elif sigmas is not None:
+        accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accept_sigmas:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" sigmas schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    else:
+        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+    return timesteps, num_inference_steps
+
+
+class ZImageControlNetInpaintPipeline(DiffusionPipeline, FromSingleFileMixin):
+    model_cpu_offload_seq = "text_encoder->transformer->vae"
+    _optional_components = []
+    _callback_tensor_inputs = ["latents", "prompt_embeds"]
+
+    def __init__(
+        self,
+        scheduler: FlowMatchEulerDiscreteScheduler,
+        vae: AutoencoderKL,
+        text_encoder: PreTrainedModel,
+        tokenizer: AutoTokenizer,
+        transformer: ZImageTransformer2DModel,
+        controlnet: ZImageControlNetModel,
+    ):
+        super().__init__()
+        if transformer.in_channels == controlnet.config.control_in_dim:
+            raise ValueError(
+                "ZImageControlNetInpaintPipeline is not compatible with `alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union`, use `alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0`."
+            )
+        controlnet = ZImageControlNetModel.from_transformer(controlnet, transformer)
+
+        self.register_modules(
+            vae=vae,
+            text_encoder=text_encoder,
+            tokenizer=tokenizer,
+            scheduler=scheduler,
+            transformer=transformer,
+            controlnet=controlnet,
+        )
+        self.vae_scale_factor = (
+            2 ** (len(self.vae.config.block_out_channels) - 1) if hasattr(self, "vae") and self.vae is not None else 8
+        )
+        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor * 2)
+        self.mask_processor = VaeImageProcessor(
+            vae_scale_factor=self.vae_scale_factor, do_normalize=False, do_binarize=True, do_convert_grayscale=True
+        )
+
+    def encode_prompt(
+        self,
+        prompt: Union[str, List[str]],
+        device: Optional[torch.device] = None,
+        do_classifier_free_guidance: bool = True,
+        negative_prompt: Optional[Union[str, List[str]]] = None,
+        prompt_embeds: Optional[List[torch.FloatTensor]] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        max_sequence_length: int = 512,
+    ):
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        prompt_embeds = self._encode_prompt(
+            prompt=prompt,
+            device=device,
+            prompt_embeds=prompt_embeds,
+            max_sequence_length=max_sequence_length,
+        )
+
+        if do_classifier_free_guidance:
+            if negative_prompt is None:
+                negative_prompt = ["" for _ in prompt]
+            else:
+                negative_prompt = [negative_prompt] if isinstance(negative_prompt, str) else negative_prompt
+            assert len(prompt) == len(negative_prompt)
+            negative_prompt_embeds = self._encode_prompt(
+                prompt=negative_prompt,
+                device=device,
+                prompt_embeds=negative_prompt_embeds,
+                max_sequence_length=max_sequence_length,
+            )
+        else:
+            negative_prompt_embeds = []
+        return prompt_embeds, negative_prompt_embeds
+
+    def _encode_prompt(
+        self,
+        prompt: Union[str, List[str]],
+        device: Optional[torch.device] = None,
+        prompt_embeds: Optional[List[torch.FloatTensor]] = None,
+        max_sequence_length: int = 512,
+    ) -> List[torch.FloatTensor]:
+        device = device or self._execution_device
+
+        if prompt_embeds is not None:
+            return prompt_embeds
+
+        if isinstance(prompt, str):
+            prompt = [prompt]
+
+        for i, prompt_item in enumerate(prompt):
+            messages = [
+                {"role": "user", "content": prompt_item},
+            ]
+            prompt_item = self.tokenizer.apply_chat_template(
+                messages,
+                tokenize=False,
+                add_generation_prompt=True,
+                enable_thinking=True,
+            )
+            prompt[i] = prompt_item
+
+        text_inputs = self.tokenizer(
+            prompt,
+            padding="max_length",
+            max_length=max_sequence_length,
+            truncation=True,
+            return_tensors="pt",
+        )
+
+        text_input_ids = text_inputs.input_ids.to(device)
+        prompt_masks = text_inputs.attention_mask.to(device).bool()
+
+        prompt_embeds = self.text_encoder(
+            input_ids=text_input_ids,
+            attention_mask=prompt_masks,
+            output_hidden_states=True,
+        ).hidden_states[-2]
+
+        embeddings_list = []
+
+        for i in range(len(prompt_embeds)):
+            embeddings_list.append(prompt_embeds[i][prompt_masks[i]])
+
+        return embeddings_list
+
+    def prepare_latents(
+        self,
+        batch_size,
+        num_channels_latents,
+        height,
+        width,
+        dtype,
+        device,
+        generator,
+        latents=None,
+    ):
+        height = 2 * (int(height) // (self.vae_scale_factor * 2))
+        width = 2 * (int(width) // (self.vae_scale_factor * 2))
+
+        shape = (batch_size, num_channels_latents, height, width)
+
+        if latents is None:
+            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        else:
+            if latents.shape != shape:
+                raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")
+            latents = latents.to(device)
+        return latents
+
+    # Copied from diffusers.pipelines.controlnet_sd3.pipeline_stable_diffusion_3_controlnet.StableDiffusion3ControlNetPipeline.prepare_image
+    def prepare_image(
+        self,
+        image,
+        width,
+        height,
+        batch_size,
+        num_images_per_prompt,
+        device,
+        dtype,
+        do_classifier_free_guidance=False,
+        guess_mode=False,
+    ):
+        if isinstance(image, torch.Tensor):
+            pass
+        else:
+            image = self.image_processor.preprocess(image, height=height, width=width)
+
+        image_batch_size = image.shape[0]
+
+        if image_batch_size == 1:
+            repeat_by = batch_size
+        else:
+            # image batch size is the same as prompt batch size
+            repeat_by = num_images_per_prompt
+
+        image = image.repeat_interleave(repeat_by, dim=0)
+
+        image = image.to(device=device, dtype=dtype)
+
+        if do_classifier_free_guidance and not guess_mode:
+            image = torch.cat([image] * 2)
+
+        return image
+
+    @property
+    def guidance_scale(self):
+        return self._guidance_scale
+
+    @property
+    def do_classifier_free_guidance(self):
+        return self._guidance_scale > 1
+
+    @property
+    def joint_attention_kwargs(self):
+        return self._joint_attention_kwargs
+
+    @property
+    def num_timesteps(self):
+        return self._num_timesteps
+
+    @property
+    def interrupt(self):
+        return self._interrupt
+
+    @torch.no_grad()
+    @replace_example_docstring(EXAMPLE_DOC_STRING)
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = None,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        num_inference_steps: int = 50,
+        sigmas: Optional[List[float]] = None,
+        guidance_scale: float = 5.0,
+        image: PipelineImageInput = None,
+        mask_image: PipelineImageInput = None,
+        control_image: PipelineImageInput = None,
+        controlnet_conditioning_scale: Union[float, List[float]] = 0.75,
+        cfg_normalization: bool = False,
+        cfg_truncation: float = 1.0,
+        negative_prompt: Optional[Union[str, List[str]]] = None,
+        num_images_per_prompt: Optional[int] = 1,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[List[torch.FloatTensor]] = None,
+        negative_prompt_embeds: Optional[List[torch.FloatTensor]] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
+        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
+        max_sequence_length: int = 512,
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            height (`int`, *optional*, defaults to 1024):
+                The height in pixels of the generated image.
+            width (`int`, *optional*, defaults to 1024):
+                The width in pixels of the generated image.
+            num_inference_steps (`int`, *optional*, defaults to 50):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            sigmas (`List[float]`, *optional*):
+                Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
+                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
+                will be used.
+            guidance_scale (`float`, *optional*, defaults to 5.0):
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+                usually at the expense of lower image quality.
+            cfg_normalization (`bool`, *optional*, defaults to False):
+                Whether to apply configuration normalization.
+            cfg_truncation (`float`, *optional*, defaults to 1.0):
+                The truncation value for configuration.
+            negative_prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
+                less than `1`).
+            num_images_per_prompt (`int`, *optional*, defaults to 1):
+                The number of images to generate per prompt.
+            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
+                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                to make generation deterministic.
+            latents (`torch.FloatTensor`, *optional*):
+                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor will be generated by sampling using the supplied random `generator`.
+            prompt_embeds (`List[torch.FloatTensor]`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`List[torch.FloatTensor]`, *optional*):
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+            output_type (`str`, *optional*, defaults to `"pil"`):
+                The output format of the generate image. Choose between
+                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~pipelines.stable_diffusion.ZImagePipelineOutput`] instead of a plain
+                tuple.
+            joint_attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            callback_on_step_end (`Callable`, *optional*):
+                A function that calls at the end of each denoising steps during the inference. The function is called
+                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
+                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
+                `callback_on_step_end_tensor_inputs`.
+            callback_on_step_end_tensor_inputs (`List`, *optional*):
+                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
+                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
+                `._callback_tensor_inputs` attribute of your pipeline class.
+            max_sequence_length (`int`, *optional*, defaults to 512):
+                Maximum sequence length to use with the `prompt`.
+
+        Examples:
+
+        Returns:
+            [`~pipelines.z_image.ZImagePipelineOutput`] or `tuple`: [`~pipelines.z_image.ZImagePipelineOutput`] if
+            `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
+            generated images.
+        """
+        height = height or 1024
+        width = width or 1024
+
+        vae_scale = self.vae_scale_factor * 2
+        if height % vae_scale != 0:
+            raise ValueError(
+                f"Height must be divisible by {vae_scale} (got {height}). "
+                f"Please adjust the height to a multiple of {vae_scale}."
+            )
+        if width % vae_scale != 0:
+            raise ValueError(
+                f"Width must be divisible by {vae_scale} (got {width}). "
+                f"Please adjust the width to a multiple of {vae_scale}."
+            )
+
+        device = self._execution_device
+
+        self._guidance_scale = guidance_scale
+        self._joint_attention_kwargs = joint_attention_kwargs
+        self._interrupt = False
+        self._cfg_normalization = cfg_normalization
+        self._cfg_truncation = cfg_truncation
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = len(prompt_embeds)
+
+        # If prompt_embeds is provided and prompt is None, skip encoding
+        if prompt_embeds is not None and prompt is None:
+            if self.do_classifier_free_guidance and negative_prompt_embeds is None:
+                raise ValueError(
+                    "When `prompt_embeds` is provided without `prompt`, "
+                    "`negative_prompt_embeds` must also be provided for classifier-free guidance."
+                )
+        else:
+            (
+                prompt_embeds,
+                negative_prompt_embeds,
+            ) = self.encode_prompt(
+                prompt=prompt,
+                negative_prompt=negative_prompt,
+                do_classifier_free_guidance=self.do_classifier_free_guidance,
+                prompt_embeds=prompt_embeds,
+                negative_prompt_embeds=negative_prompt_embeds,
+                device=device,
+                max_sequence_length=max_sequence_length,
+            )
+
+        # 4. Prepare latent variables
+        num_channels_latents = self.transformer.in_channels
+
+        control_image = self.prepare_image(
+            image=control_image,
+            width=width,
+            height=height,
+            batch_size=batch_size * num_images_per_prompt,
+            num_images_per_prompt=num_images_per_prompt,
+            device=device,
+            dtype=self.vae.dtype,
+        )
+        height, width = control_image.shape[-2:]
+        control_image = retrieve_latents(self.vae.encode(control_image), generator=generator, sample_mode="argmax")
+        control_image = (control_image - self.vae.config.shift_factor) * self.vae.config.scaling_factor
+        control_image = control_image.unsqueeze(2)
+
+        mask_condition = self.mask_processor.preprocess(mask_image, height=height, width=width)
+        mask_condition = torch.tile(mask_condition, [1, 3, 1, 1]).to(
+            device=control_image.device, dtype=control_image.dtype
+        )
+
+        init_image = self.prepare_image(
+            image=image,
+            width=width,
+            height=height,
+            batch_size=batch_size * num_images_per_prompt,
+            num_images_per_prompt=num_images_per_prompt,
+            device=device,
+            dtype=self.vae.dtype,
+        )
+        height, width = init_image.shape[-2:]
+        init_image = init_image * (mask_condition < 0.5)
+        init_image = retrieve_latents(self.vae.encode(init_image), generator=generator, sample_mode="argmax")
+        init_image = (init_image - self.vae.config.shift_factor) * self.vae.config.scaling_factor
+        init_image = init_image.unsqueeze(2)
+
+        mask_condition = F.interpolate(1 - mask_condition[:, :1], size=init_image.size()[-2:], mode="nearest").to(
+            device=control_image.device, dtype=control_image.dtype
+        )
+        mask_condition = mask_condition.unsqueeze(2)
+
+        control_image = torch.cat([control_image, mask_condition, init_image], dim=1)
+
+        latents = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            torch.float32,
+            device,
+            generator,
+            latents,
+        )
+
+        # Repeat prompt_embeds for num_images_per_prompt
+        if num_images_per_prompt > 1:
+            prompt_embeds = [pe for pe in prompt_embeds for _ in range(num_images_per_prompt)]
+            if self.do_classifier_free_guidance and negative_prompt_embeds:
+                negative_prompt_embeds = [npe for npe in negative_prompt_embeds for _ in range(num_images_per_prompt)]
+
+        actual_batch_size = batch_size * num_images_per_prompt
+        image_seq_len = (latents.shape[2] // 2) * (latents.shape[3] // 2)
+
+        # 5. Prepare timesteps
+        mu = calculate_shift(
+            image_seq_len,
+            self.scheduler.config.get("base_image_seq_len", 256),
+            self.scheduler.config.get("max_image_seq_len", 4096),
+            self.scheduler.config.get("base_shift", 0.5),
+            self.scheduler.config.get("max_shift", 1.15),
+        )
+        self.scheduler.sigma_min = 0.0
+        scheduler_kwargs = {"mu": mu}
+        timesteps, num_inference_steps = retrieve_timesteps(
+            self.scheduler,
+            num_inference_steps,
+            device,
+            sigmas=sigmas,
+            **scheduler_kwargs,
+        )
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        self._num_timesteps = len(timesteps)
+
+        # 6. Denoising loop
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                if self.interrupt:
+                    continue
+
+                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+                timestep = t.expand(latents.shape[0])
+                timestep = (1000 - timestep) / 1000
+                # Normalized time for time-aware config (0 at start, 1 at end)
+                t_norm = timestep[0].item()
+
+                # Handle cfg truncation
+                current_guidance_scale = self.guidance_scale
+                if (
+                    self.do_classifier_free_guidance
+                    and self._cfg_truncation is not None
+                    and float(self._cfg_truncation) <= 1
+                ):
+                    if t_norm > self._cfg_truncation:
+                        current_guidance_scale = 0.0
+
+                # Run CFG only if configured AND scale is non-zero
+                apply_cfg = self.do_classifier_free_guidance and current_guidance_scale > 0
+
+                if apply_cfg:
+                    latents_typed = latents.to(self.transformer.dtype)
+                    latent_model_input = latents_typed.repeat(2, 1, 1, 1)
+                    prompt_embeds_model_input = prompt_embeds + negative_prompt_embeds
+                    timestep_model_input = timestep.repeat(2)
+                else:
+                    latent_model_input = latents.to(self.transformer.dtype)
+                    prompt_embeds_model_input = prompt_embeds
+                    timestep_model_input = timestep
+
+                latent_model_input = latent_model_input.unsqueeze(2)
+                latent_model_input_list = list(latent_model_input.unbind(dim=0))
+
+                controlnet_block_samples = self.controlnet(
+                    latent_model_input_list,
+                    timestep_model_input,
+                    prompt_embeds_model_input,
+                    control_image,
+                    conditioning_scale=controlnet_conditioning_scale,
+                )
+
+                model_out_list = self.transformer(
+                    latent_model_input_list,
+                    timestep_model_input,
+                    prompt_embeds_model_input,
+                    controlnet_block_samples=controlnet_block_samples,
+                )[0]
+
+                if apply_cfg:
+                    # Perform CFG
+                    pos_out = model_out_list[:actual_batch_size]
+                    neg_out = model_out_list[actual_batch_size:]
+
+                    noise_pred = []
+                    for j in range(actual_batch_size):
+                        pos = pos_out[j].float()
+                        neg = neg_out[j].float()
+
+                        pred = pos + current_guidance_scale * (pos - neg)
+
+                        # Renormalization
+                        if self._cfg_normalization and float(self._cfg_normalization) > 0.0:
+                            ori_pos_norm = torch.linalg.vector_norm(pos)
+                            new_pos_norm = torch.linalg.vector_norm(pred)
+                            max_new_norm = ori_pos_norm * float(self._cfg_normalization)
+                            if new_pos_norm > max_new_norm:
+                                pred = pred * (max_new_norm / new_pos_norm)
+
+                        noise_pred.append(pred)
+
+                    noise_pred = torch.stack(noise_pred, dim=0)
+                else:
+                    noise_pred = torch.stack([t.float() for t in model_out_list], dim=0)
+
+                noise_pred = noise_pred.squeeze(2)
+                noise_pred = -noise_pred
+
+                # compute the previous noisy sample x_t -> x_t-1
+                latents = self.scheduler.step(noise_pred.to(torch.float32), t, latents, return_dict=False)[0]
+                assert latents.dtype == torch.float32
+
+                if callback_on_step_end is not None:
+                    callback_kwargs = {}
+                    for k in callback_on_step_end_tensor_inputs:
+                        callback_kwargs[k] = locals()[k]
+                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
+
+                    latents = callback_outputs.pop("latents", latents)
+                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
+                    negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)
+
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+
+        if output_type == "latent":
+            image = latents
+
+        else:
+            latents = latents.to(self.vae.dtype)
+            latents = (latents / self.vae.config.scaling_factor) + self.vae.config.shift_factor
+
+            image = self.vae.decode(latents, return_dict=False)[0]
+            image = self.image_processor.postprocess(image, output_type=output_type)
+
+        # Offload all models
+        self.maybe_free_model_hooks()
+
+        if not return_dict:
+            return (image,)
+
+        return ZImagePipelineOutput(images=image)
--- a/src/diffusers/utils/dummy_pt_objects.py
+++ b/src/diffusers/utils/dummy_pt_objects.py
@@ -1132,6 +1132,21 @@ class LatteTransformer3DModel(metaclass=DummyObject):
        requires_backends(cls, ["torch"])


+class LongCatImageTransformer2DModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+
+
 class LTXVideoTransformer3DModel(metaclass=DummyObject):
    _backends = ["torch"]

@@ -1762,6 +1777,21 @@ class WanVACETransformer3DModel(metaclass=DummyObject):
        requires_backends(cls, ["torch"])


+class ZImageControlNetModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+
+
 class ZImageTransformer2DModel(metaclass=DummyObject):
    _backends = ["torch"]

--- a/src/diffusers/utils/dummy_torch_and_transformers_objects.py
+++ b/src/diffusers/utils/dummy_torch_and_transformers_objects.py
@@ -1832,6 +1832,36 @@ class LEditsPPPipelineStableDiffusionXL(metaclass=DummyObject):
        requires_backends(cls, ["torch", "transformers"])


+class LongCatImageEditPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
+class LongCatImagePipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
 class LTXConditionPipeline(metaclass=DummyObject):
    _backends = ["torch", "transformers"]

@@ -2267,6 +2297,21 @@ class QwenImageInpaintPipeline(metaclass=DummyObject):
        requires_backends(cls, ["torch", "transformers"])


+class QwenImageLayeredPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
 class QwenImagePipeline(metaclass=DummyObject):
    _backends = ["torch", "transformers"]

@@ -3812,6 +3857,36 @@ class WuerstchenPriorPipeline(metaclass=DummyObject):
        requires_backends(cls, ["torch", "transformers"])


+class ZImageControlNetInpaintPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
+class ZImageControlNetPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
 class ZImageImg2ImgPipeline(metaclass=DummyObject):
    _backends = ["torch", "transformers"]

--- a/tests/lora/test_lora_layers_auraflow.py
+++ b/tests/lora/test_lora_layers_auraflow.py
@@ -13,12 +13,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import sys
+import unittest

-import pytest
 import torch
 from transformers import AutoTokenizer, UMT5EncoderModel

-from diffusers import AuraFlowPipeline, AuraFlowTransformer2DModel, FlowMatchEulerDiscreteScheduler
+from diffusers import (
+    AuraFlowPipeline,
+    AuraFlowTransformer2DModel,
+    FlowMatchEulerDiscreteScheduler,
+)

 from ..testing_utils import (
    floats_tensor,
@@ -36,7 +40,7 @@ from .utils import PeftLoraLoaderMixinTests  # noqa: E402


@require_peft_backend
-class TestAuraFlowLoRA(PeftLoraLoaderMixinTests):
+class AuraFlowLoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = AuraFlowPipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -99,34 +103,34 @@ class TestAuraFlowLoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    @pytest.mark.skip("Not supported in AuraFlow.")
+    @unittest.skip("Not supported in AuraFlow.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in AuraFlow.")
+    @unittest.skip("Not supported in AuraFlow.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in AuraFlow.")
+    @unittest.skip("Not supported in AuraFlow.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in AuraFlow.")
+    @unittest.skip("Text encoder LoRA is not supported in AuraFlow.")
    def test_simple_inference_with_partial_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in AuraFlow.")
+    @unittest.skip("Text encoder LoRA is not supported in AuraFlow.")
    def test_simple_inference_with_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in AuraFlow.")
+    @unittest.skip("Text encoder LoRA is not supported in AuraFlow.")
    def test_simple_inference_with_text_lora_and_scale(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in AuraFlow.")
+    @unittest.skip("Text encoder LoRA is not supported in AuraFlow.")
    def test_simple_inference_with_text_lora_fused(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in AuraFlow.")
+    @unittest.skip("Text encoder LoRA is not supported in AuraFlow.")
    def test_simple_inference_with_text_lora_save_load(self):
        pass
--- a/tests/lora/test_lora_layers_cogvideox.py
+++ b/tests/lora/test_lora_layers_cogvideox.py
@@ -13,9 +13,10 @@
 # limitations under the License.

 import sys
+import unittest

-import pytest
 import torch
+from parameterized import parameterized
 from transformers import AutoTokenizer, T5EncoderModel

 from diffusers import (
@@ -38,7 +39,7 @@ from .utils import PeftLoraLoaderMixinTests  # noqa: E402


@require_peft_backend
-class TestCogVideoXLoRA(PeftLoraLoaderMixinTests):
+class CogVideoXLoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = CogVideoXPipeline
    scheduler_cls = CogVideoXDPMScheduler
    scheduler_kwargs = {"timestep_spacing": "trailing"}
@@ -118,59 +119,54 @@ class TestCogVideoXLoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    def test_simple_inference_with_text_lora_denoiser_fused_multi(self, pipe):
-        super().test_simple_inference_with_text_lora_denoiser_fused_multi(expected_atol=9e-3, pipe=pipe)
+    def test_simple_inference_with_text_lora_denoiser_fused_multi(self):
+        super().test_simple_inference_with_text_lora_denoiser_fused_multi(expected_atol=9e-3)

-    def test_simple_inference_with_text_denoiser_lora_unfused(self, pipe):
-        super().test_simple_inference_with_text_denoiser_lora_unfused(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_denoiser_lora_unfused(self):
+        super().test_simple_inference_with_text_denoiser_lora_unfused(expected_atol=9e-3)

-    def test_lora_scale_kwargs_match_fusion(self, base_pipe_output):
-        super().test_lora_scale_kwargs_match_fusion(
-            base_pipe_output=base_pipe_output, expected_atol=9e-3, expected_rtol=9e-3
-        )
+    def test_lora_scale_kwargs_match_fusion(self):
+        super().test_lora_scale_kwargs_match_fusion(expected_atol=9e-3, expected_rtol=9e-3)

-    @pytest.mark.parametrize(
-        "offload_type, use_stream",
-        [("block_level", True), ("leaf_level", False)],
-    )
+    @parameterized.expand([("block_level", True), ("leaf_level", False)])
    @require_torch_accelerator
-    def test_group_offloading_inference_denoiser(self, offload_type, use_stream, tmpdirname, pipe):
+    def test_group_offloading_inference_denoiser(self, offload_type, use_stream):
        # TODO: We don't run the (leaf_level, True) test here that is enabled for other models.
        # The reason for this can be found here: https://github.com/huggingface/diffusers/pull/11804#issuecomment-3013325338
-        super()._test_group_offloading_inference_denoiser(offload_type, use_stream, tmpdirname, pipe)
+        super()._test_group_offloading_inference_denoiser(offload_type, use_stream)

-    @pytest.mark.skip("Not supported in CogVideoX.")
+    @unittest.skip("Not supported in CogVideoX.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in CogVideoX.")
+    @unittest.skip("Not supported in CogVideoX.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in CogVideoX.")
+    @unittest.skip("Not supported in CogVideoX.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in CogVideoX.")
+    @unittest.skip("Text encoder LoRA is not supported in CogVideoX.")
    def test_simple_inference_with_partial_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in CogVideoX.")
+    @unittest.skip("Text encoder LoRA is not supported in CogVideoX.")
    def test_simple_inference_with_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in CogVideoX.")
+    @unittest.skip("Text encoder LoRA is not supported in CogVideoX.")
    def test_simple_inference_with_text_lora_and_scale(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in CogVideoX.")
+    @unittest.skip("Text encoder LoRA is not supported in CogVideoX.")
    def test_simple_inference_with_text_lora_fused(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in CogVideoX.")
+    @unittest.skip("Text encoder LoRA is not supported in CogVideoX.")
    def test_simple_inference_with_text_lora_save_load(self):
        pass

-    @pytest.mark.skip("Not supported in CogVideoX.")
+    @unittest.skip("Not supported in CogVideoX.")
    def test_simple_inference_with_text_denoiser_multi_adapter_block_lora(self):
        pass
--- a/tests/lora/test_lora_layers_cogview4.py
+++ b/tests/lora/test_lora_layers_cogview4.py
@@ -13,9 +13,12 @@
 # limitations under the License.

 import sys
+import tempfile
+import unittest

-import pytest
+import numpy as np
 import torch
+from parameterized import parameterized
 from transformers import AutoTokenizer, GlmModel

 from diffusers import AutoencoderKL, CogView4Pipeline, CogView4Transformer2DModel, FlowMatchEulerDiscreteScheduler
@@ -25,6 +28,7 @@ from ..testing_utils import (
    require_peft_backend,
    require_torch_accelerator,
    skip_mps,
+    torch_device,
 )


@@ -43,7 +47,7 @@ class TokenizerWrapper:

@require_peft_backend
@skip_mps
-class TestCogView4LoRA(PeftLoraLoaderMixinTests):
+class CogView4LoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = CogView4Pipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -109,50 +113,72 @@ class TestCogView4LoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    def test_simple_inference_with_text_lora_denoiser_fused_multi(self, pipe):
-        super().test_simple_inference_with_text_lora_denoiser_fused_multi(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_lora_denoiser_fused_multi(self):
+        super().test_simple_inference_with_text_lora_denoiser_fused_multi(expected_atol=9e-3)

-    def test_simple_inference_with_text_denoiser_lora_unfused(self, pipe):
-        super().test_simple_inference_with_text_denoiser_lora_unfused(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_denoiser_lora_unfused(self):
+        super().test_simple_inference_with_text_denoiser_lora_unfused(expected_atol=9e-3)

-    @pytest.mark.parametrize(
-        "offload_type, use_stream",
-        [("block_level", True), ("leaf_level", False)],
-    )
+    def test_simple_inference_save_pretrained(self):
+        """
+        Tests a simple usecase where users could use saving utilities for LoRA through save_pretrained
+        """
+        components, _, _ = self.get_dummy_components()
+        pipe = self.pipeline_class(**components)
+        pipe = pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
+        _, _, inputs = self.get_dummy_inputs(with_generator=False)
+
+        images_lora = pipe(**inputs, generator=torch.manual_seed(0))[0]
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            pipe.save_pretrained(tmpdirname)
+
+            pipe_from_pretrained = self.pipeline_class.from_pretrained(tmpdirname)
+            pipe_from_pretrained.to(torch_device)
+
+        images_lora_save_pretrained = pipe_from_pretrained(**inputs, generator=torch.manual_seed(0))[0]
+
+        self.assertTrue(
+            np.allclose(images_lora, images_lora_save_pretrained, atol=1e-3, rtol=1e-3),
+            "Loading from saved checkpoints should give same results.",
+        )
+
+    @parameterized.expand([("block_level", True), ("leaf_level", False)])
    @require_torch_accelerator
-    def test_group_offloading_inference_denoiser(self, offload_type, use_stream, tmpdirname, pipe):
+    def test_group_offloading_inference_denoiser(self, offload_type, use_stream):
        # TODO: We don't run the (leaf_level, True) test here that is enabled for other models.
        # The reason for this can be found here: https://github.com/huggingface/diffusers/pull/11804#issuecomment-3013325338
-        super()._test_group_offloading_inference_denoiser(offload_type, use_stream, tmpdirname, pipe)
+        super()._test_group_offloading_inference_denoiser(offload_type, use_stream)

-    @pytest.mark.skip("Not supported in CogView4.")
+    @unittest.skip("Not supported in CogView4.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in CogView4.")
+    @unittest.skip("Not supported in CogView4.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in CogView4.")
+    @unittest.skip("Not supported in CogView4.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in CogView4.")
+    @unittest.skip("Text encoder LoRA is not supported in CogView4.")
    def test_simple_inference_with_partial_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in CogView4.")
+    @unittest.skip("Text encoder LoRA is not supported in CogView4.")
    def test_simple_inference_with_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in CogView4.")
+    @unittest.skip("Text encoder LoRA is not supported in CogView4.")
    def test_simple_inference_with_text_lora_and_scale(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in CogView4.")
+    @unittest.skip("Text encoder LoRA is not supported in CogView4.")
    def test_simple_inference_with_text_lora_fused(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in CogView4.")
+    @unittest.skip("Text encoder LoRA is not supported in CogView4.")
    def test_simple_inference_with_text_lora_save_load(self):
        pass
--- a/tests/lora/test_lora_layers_flux.py
+++ b/tests/lora/test_lora_layers_flux.py
@@ -16,11 +16,13 @@ import copy
 import gc
 import os
 import sys
+import tempfile
+import unittest

 import numpy as np
-import pytest
 import safetensors.torch
 import torch
+from parameterized import parameterized
 from PIL import Image
 from transformers import AutoTokenizer, CLIPTextModel, CLIPTokenizer, T5EncoderModel

@@ -44,12 +46,14 @@ from ..testing_utils import (

 if is_peft_available():
    from peft.utils import get_peft_model_state_dict
+
 sys.path.append(".")
-from .utils import PeftLoraLoaderMixinTests, check_if_lora_correctly_set
+
+from .utils import PeftLoraLoaderMixinTests, check_if_lora_correctly_set  # noqa: E402


@require_peft_backend
-class TestFluxLoRA(PeftLoraLoaderMixinTests):
+class FluxLoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = FluxPipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -111,134 +115,165 @@ class TestFluxLoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    def test_with_alpha_in_state_dict(self, tmpdirname, pipe):
-        _, _, denoiser_lora_config = self.get_dummy_components(FlowMatchEulerDiscreteScheduler)
+    def test_with_alpha_in_state_dict(self):
+        components, _, denoiser_lora_config = self.get_dummy_components(FlowMatchEulerDiscreteScheduler)
+        pipe = self.pipeline_class(**components)
+        pipe = pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
        _, _, inputs = self.get_dummy_inputs(with_generator=False)

        pipe.transformer.add_adapter(denoiser_lora_config)
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in transformer"
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in transformer")

        images_lora = pipe(**inputs, generator=torch.manual_seed(0)).images

-        denoiser_state_dict = get_peft_model_state_dict(pipe.transformer)
-        self.pipeline_class.save_lora_weights(tmpdirname, transformer_lora_layers=denoiser_state_dict)
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            denoiser_state_dict = get_peft_model_state_dict(pipe.transformer)
+            self.pipeline_class.save_lora_weights(tmpdirname, transformer_lora_layers=denoiser_state_dict)

-        assert os.path.isfile(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"))
-        pipe.unload_lora_weights()
-        pipe.load_lora_weights(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors")))
+            pipe.unload_lora_weights()
+            pipe.load_lora_weights(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"))

-        # modify the state dict to have alpha values following
-        # https://huggingface.co/TheLastBen/Jon_Snow_Flux_LoRA/blob/main/jon_snow.safetensors
-        state_dict_with_alpha = safetensors.torch.load_file(
-            os.path.join(tmpdirname, "pytorch_lora_weights.safetensors")
-        )
-        alpha_dict = {}
-        for k, v in state_dict_with_alpha.items():
-            if "transformer" in k and "to_k" in k and ("lora_A" in k):
-                alpha_dict[f"{k}.alpha"] = float(torch.randint(10, 100, size=()))
-        state_dict_with_alpha.update(alpha_dict)
+            # modify the state dict to have alpha values following
+            # https://huggingface.co/TheLastBen/Jon_Snow_Flux_LoRA/blob/main/jon_snow.safetensors
+            state_dict_with_alpha = safetensors.torch.load_file(
+                os.path.join(tmpdirname, "pytorch_lora_weights.safetensors")
+            )
+            alpha_dict = {}
+            for k, v in state_dict_with_alpha.items():
+                # only do for `transformer` and for the k projections -- should be enough to test.
+                if "transformer" in k and "to_k" in k and "lora_A" in k:
+                    alpha_dict[f"{k}.alpha"] = float(torch.randint(10, 100, size=()))
+            state_dict_with_alpha.update(alpha_dict)

        images_lora_from_pretrained = pipe(**inputs, generator=torch.manual_seed(0)).images
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser"
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser")

        pipe.unload_lora_weights()
        pipe.load_lora_weights(state_dict_with_alpha)
        images_lora_with_alpha = pipe(**inputs, generator=torch.manual_seed(0)).images
-        assert np.allclose(images_lora, images_lora_from_pretrained, atol=0.001, rtol=0.001), (
-            "Loading from saved checkpoints should give same results."
-        )
-        assert not np.allclose(images_lora_with_alpha, images_lora, atol=0.001, rtol=0.001)

-    def test_lora_expansion_works_for_absent_keys(self, base_pipe_output, tmpdirname, pipe):
-        _, _, denoiser_lora_config = self.get_dummy_components(FlowMatchEulerDiscreteScheduler)
+        self.assertTrue(
+            np.allclose(images_lora, images_lora_from_pretrained, atol=1e-3, rtol=1e-3),
+            "Loading from saved checkpoints should give same results.",
+        )
+        self.assertFalse(np.allclose(images_lora_with_alpha, images_lora, atol=1e-3, rtol=1e-3))
+
+    def test_lora_expansion_works_for_absent_keys(self):
+        components, _, denoiser_lora_config = self.get_dummy_components(FlowMatchEulerDiscreteScheduler)
+        pipe = self.pipeline_class(**components)
+        pipe = pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
        _, _, inputs = self.get_dummy_inputs(with_generator=False)

+        output_no_lora = self.get_base_pipe_output()
+
        # Modify the config to have a layer which won't be present in the second LoRA we will load.
        modified_denoiser_lora_config = copy.deepcopy(denoiser_lora_config)
        modified_denoiser_lora_config.target_modules.add("x_embedder")

        pipe.transformer.add_adapter(modified_denoiser_lora_config)
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in transformer"
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in transformer")

        images_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
-        assert not np.allclose(images_lora, base_pipe_output, atol=0.001, rtol=0.001), (
-            "LoRA should lead to different results."
+        self.assertFalse(
+            np.allclose(images_lora, output_no_lora, atol=1e-3, rtol=1e-3),
+            "LoRA should lead to different results.",
        )
-        denoiser_state_dict = get_peft_model_state_dict(pipe.transformer)
-        self.pipeline_class.save_lora_weights(tmpdirname, transformer_lora_layers=denoiser_state_dict)

-        assert os.path.isfile(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"))
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            denoiser_state_dict = get_peft_model_state_dict(pipe.transformer)
+            self.pipeline_class.save_lora_weights(tmpdirname, transformer_lora_layers=denoiser_state_dict)

-        pipe.unload_lora_weights()
-        pipe.load_lora_weights(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"), adapter_name="one")
-        lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"))
-        lora_state_dict_without_xembedder = {k: v for (k, v) in lora_state_dict.items() if "x_embedder" not in k}
+            self.assertTrue(os.path.isfile(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors")))
+            pipe.unload_lora_weights()
+            pipe.load_lora_weights(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"), adapter_name="one")
+
+            # Modify the state dict to exclude "x_embedder" related LoRA params.
+            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"))
+            lora_state_dict_without_xembedder = {k: v for k, v in lora_state_dict.items() if "x_embedder" not in k}

        pipe.load_lora_weights(lora_state_dict_without_xembedder, adapter_name="two")
        pipe.set_adapters(["one", "two"])
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in transformer"
-
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in transformer")
        images_lora_with_absent_keys = pipe(**inputs, generator=torch.manual_seed(0)).images
-        assert not np.allclose(images_lora, images_lora_with_absent_keys, atol=0.001, rtol=0.001), (
-            "Different LoRAs should lead to different results."
+
+        self.assertFalse(
+            np.allclose(images_lora, images_lora_with_absent_keys, atol=1e-3, rtol=1e-3),
+            "Different LoRAs should lead to different results.",
        )
-        assert not np.allclose(base_pipe_output, images_lora_with_absent_keys, atol=0.001, rtol=0.001), (
-            "LoRA should lead to different results."
+        self.assertFalse(
+            np.allclose(output_no_lora, images_lora_with_absent_keys, atol=1e-3, rtol=1e-3),
+            "LoRA should lead to different results.",
        )

-    def test_lora_expansion_works_for_extra_keys(self, base_pipe_output, tmpdirname, pipe):
-        _, _, denoiser_lora_config = self.get_dummy_components(FlowMatchEulerDiscreteScheduler)
+    def test_lora_expansion_works_for_extra_keys(self):
+        components, _, denoiser_lora_config = self.get_dummy_components(FlowMatchEulerDiscreteScheduler)
+        pipe = self.pipeline_class(**components)
+        pipe = pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
        _, _, inputs = self.get_dummy_inputs(with_generator=False)
+        output_no_lora = self.get_base_pipe_output()

+        # Modify the config to have a layer which won't be present in the first LoRA we will load.
        modified_denoiser_lora_config = copy.deepcopy(denoiser_lora_config)
        modified_denoiser_lora_config.target_modules.add("x_embedder")
+
        pipe.transformer.add_adapter(modified_denoiser_lora_config)
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in transformer"
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in transformer")

        images_lora = pipe(**inputs, generator=torch.manual_seed(0)).images
-        assert not np.allclose(images_lora, base_pipe_output, atol=0.001, rtol=0.001), (
-            "LoRA should lead to different results."
+        self.assertFalse(
+            np.allclose(images_lora, output_no_lora, atol=1e-3, rtol=1e-3),
+            "LoRA should lead to different results.",
        )

-        denoiser_state_dict = get_peft_model_state_dict(pipe.transformer)
-        self.pipeline_class.save_lora_weights(tmpdirname, transformer_lora_layers=denoiser_state_dict)
-        assert os.path.isfile(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"))
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            denoiser_state_dict = get_peft_model_state_dict(pipe.transformer)
+            self.pipeline_class.save_lora_weights(tmpdirname, transformer_lora_layers=denoiser_state_dict)

-        pipe.unload_lora_weights()
-        lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"))
-        lora_state_dict_without_xembedder = {k: v for (k, v) in lora_state_dict.items() if "x_embedder" not in k}
-        pipe.load_lora_weights(lora_state_dict_without_xembedder, adapter_name="one")
-        pipe.load_lora_weights(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"), adapter_name="two")
+            self.assertTrue(os.path.isfile(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors")))
+            pipe.unload_lora_weights()
+            # Modify the state dict to exclude "x_embedder" related LoRA params.
+            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"))
+            lora_state_dict_without_xembedder = {k: v for k, v in lora_state_dict.items() if "x_embedder" not in k}
+            pipe.load_lora_weights(lora_state_dict_without_xembedder, adapter_name="one")
+
+            # Load state dict with `x_embedder`.
+            pipe.load_lora_weights(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"), adapter_name="two")

        pipe.set_adapters(["one", "two"])
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in transformer"
-
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in transformer")
        images_lora_with_extra_keys = pipe(**inputs, generator=torch.manual_seed(0)).images
-        assert not np.allclose(images_lora, images_lora_with_extra_keys, atol=0.001, rtol=0.001), (
-            "Different LoRAs should lead to different results."
+
+        self.assertFalse(
+            np.allclose(images_lora, images_lora_with_extra_keys, atol=1e-3, rtol=1e-3),
+            "Different LoRAs should lead to different results.",
        )
-        assert not np.allclose(base_pipe_output, images_lora_with_extra_keys, atol=0.001, rtol=0.001), (
-            "LoRA should lead to different results."
+        self.assertFalse(
+            np.allclose(output_no_lora, images_lora_with_extra_keys, atol=1e-3, rtol=1e-3),
+            "LoRA should lead to different results.",
        )

-    @pytest.mark.skip("Not supported in Flux.")
+    @unittest.skip("Not supported in Flux.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in Flux.")
+    @unittest.skip("Not supported in Flux.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in Flux.")
+    @unittest.skip("Not supported in Flux.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Not supported in Flux.")
+    @unittest.skip("Not supported in Flux.")
    def test_simple_inference_with_text_denoiser_multi_adapter_block_lora(self):
        pass


-class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
+class FluxControlLoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = FluxControlPipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -303,7 +338,12 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    def test_with_norm_in_state_dict(self, pipe):
+    def test_with_norm_in_state_dict(self):
+        components, _, denoiser_lora_config = self.get_dummy_components(FlowMatchEulerDiscreteScheduler)
+        pipe = self.pipeline_class(**components)
+        pipe = pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
+
        _, _, inputs = self.get_dummy_inputs(with_generator=False)

        logger = logging.get_logger("diffusers.loaders.lora_pipeline")
@@ -324,32 +364,39 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
                    pipe.load_lora_weights(norm_state_dict)
                lora_load_output = pipe(**inputs, generator=torch.manual_seed(0))[0]

-                assert (
+                self.assertTrue(
                    "The provided state dict contains normalization layers in addition to LoRA layers"
                    in cap_logger.out
                )
-                assert len(pipe.transformer._transformer_norm_layers) > 0
+                self.assertTrue(len(pipe.transformer._transformer_norm_layers) > 0)

                pipe.unload_lora_weights()
                lora_unload_output = pipe(**inputs, generator=torch.manual_seed(0))[0]

-            assert pipe.transformer._transformer_norm_layers is None
-            assert np.allclose(original_output, lora_unload_output, atol=1e-05, rtol=1e-05)
-            assert not np.allclose(original_output, lora_load_output, atol=1e-06, rtol=1e-06), (
-                f"{norm_layer} is tested"
+            self.assertTrue(pipe.transformer._transformer_norm_layers is None)
+            self.assertTrue(np.allclose(original_output, lora_unload_output, atol=1e-5, rtol=1e-5))
+            self.assertFalse(
+                np.allclose(original_output, lora_load_output, atol=1e-6, rtol=1e-6), f"{norm_layer} is tested"
            )

        with CaptureLogger(logger) as cap_logger:
            for key in list(norm_state_dict.keys()):
                norm_state_dict[key.replace("norm", "norm_k_something_random")] = norm_state_dict.pop(key)
            pipe.load_lora_weights(norm_state_dict)
-        assert "Unsupported keys found in state dict when trying to load normalization layers" in cap_logger.out

-    def test_lora_parameter_expanded_shapes(self, pipe):
+        self.assertTrue(
+            "Unsupported keys found in state dict when trying to load normalization layers" in cap_logger.out
+        )
+
+    def test_lora_parameter_expanded_shapes(self):
        components, _, _ = self.get_dummy_components(FlowMatchEulerDiscreteScheduler)
-        _, _, inputs = self.get_dummy_inputs(with_generator=False)
+        pipe = self.pipeline_class(**components)
+        pipe = pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)

+        _, _, inputs = self.get_dummy_inputs(with_generator=False)
        original_out = pipe(**inputs, generator=torch.manual_seed(0))[0]
+
        logger = logging.get_logger("diffusers.loaders.lora_pipeline")
        logger.setLevel(logging.DEBUG)

@@ -358,21 +405,24 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
        transformer = FluxTransformer2DModel.from_config(
            components["transformer"].config, in_channels=num_channels_without_control
        ).to(torch_device)
-        assert transformer.config.in_channels == num_channels_without_control, (
-            f"Expected {num_channels_without_control} channels in the modified transformer but has transformer.config.in_channels={transformer.config.in_channels!r}"
+        self.assertTrue(
+            transformer.config.in_channels == num_channels_without_control,
+            f"Expected {num_channels_without_control} channels in the modified transformer but has {transformer.config.in_channels=}",
        )

        original_transformer_state_dict = pipe.transformer.state_dict()
        x_embedder_weight = original_transformer_state_dict.pop("x_embedder.weight")
        incompatible_keys = transformer.load_state_dict(original_transformer_state_dict, strict=False)
-        assert "x_embedder.weight" in incompatible_keys.missing_keys, (
-            "Could not find x_embedder.weight in the missing keys."
+        self.assertTrue(
+            "x_embedder.weight" in incompatible_keys.missing_keys,
+            "Could not find x_embedder.weight in the missing keys.",
        )
-
        transformer.x_embedder.weight.data.copy_(x_embedder_weight[..., :num_channels_without_control])
        pipe.transformer = transformer
+
        out_features, in_features = pipe.transformer.x_embedder.weight.shape
        rank = 4
+
        dummy_lora_A = torch.nn.Linear(2 * in_features, rank, bias=False)
        dummy_lora_B = torch.nn.Linear(rank, out_features, bias=False)
        lora_state_dict = {
@@ -381,13 +431,15 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
        }
        with CaptureLogger(logger) as cap_logger:
            pipe.load_lora_weights(lora_state_dict, "adapter-1")
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser"
+
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser")

        lora_out = pipe(**inputs, generator=torch.manual_seed(0))[0]
-        assert not np.allclose(original_out, lora_out, rtol=0.0001, atol=0.0001)
-        assert pipe.transformer.x_embedder.weight.data.shape[1] == 2 * in_features
-        assert pipe.transformer.config.in_channels == 2 * in_features
-        assert cap_logger.out.startswith("Expanding the nn.Linear input/output features for module")
+
+        self.assertFalse(np.allclose(original_out, lora_out, rtol=1e-4, atol=1e-4))
+        self.assertTrue(pipe.transformer.x_embedder.weight.data.shape[1] == 2 * in_features)
+        self.assertTrue(pipe.transformer.config.in_channels == 2 * in_features)
+        self.assertTrue(cap_logger.out.startswith("Expanding the nn.Linear input/output features for module"))

        # Testing opposite direction where the LoRA params are zero-padded.
        components, _, _ = self.get_dummy_components(FlowMatchEulerDiscreteScheduler)
@@ -402,13 +454,15 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
        }
        with CaptureLogger(logger) as cap_logger:
            pipe.load_lora_weights(lora_state_dict, "adapter-1")
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser"
+
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser")

        lora_out = pipe(**inputs, generator=torch.manual_seed(0))[0]
-        assert not np.allclose(original_out, lora_out, rtol=0.0001, atol=0.0001)
-        assert pipe.transformer.x_embedder.weight.data.shape[1] == 2 * in_features
-        assert pipe.transformer.config.in_channels == 2 * in_features
-        assert "The following LoRA modules were zero padded to match the state dict of" in cap_logger.out
+
+        self.assertFalse(np.allclose(original_out, lora_out, rtol=1e-4, atol=1e-4))
+        self.assertTrue(pipe.transformer.x_embedder.weight.data.shape[1] == 2 * in_features)
+        self.assertTrue(pipe.transformer.config.in_channels == 2 * in_features)
+        self.assertTrue("The following LoRA modules were zero padded to match the state dict of" in cap_logger.out)

    def test_normal_lora_with_expanded_lora_raises_error(self):
        # Test the following situation. Load a regular LoRA (such as the ones trained on Flux.1-Dev). And then
@@ -440,28 +494,32 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
        }
        with CaptureLogger(logger) as cap_logger:
            pipe.load_lora_weights(lora_state_dict, "adapter-1")
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser"
-        assert pipe.get_active_adapters() == ["adapter-1"]
-        assert pipe.transformer.x_embedder.weight.data.shape[1] == 2 * in_features
-        assert pipe.transformer.config.in_channels == 2 * in_features
-        assert cap_logger.out.startswith("Expanding the nn.Linear input/output features for module")
-        (_, _, inputs) = self.get_dummy_inputs(with_generator=False)
+
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser")
+        self.assertTrue(pipe.get_active_adapters() == ["adapter-1"])
+        self.assertTrue(pipe.transformer.x_embedder.weight.data.shape[1] == 2 * in_features)
+        self.assertTrue(pipe.transformer.config.in_channels == 2 * in_features)
+        self.assertTrue(cap_logger.out.startswith("Expanding the nn.Linear input/output features for module"))
+
+        _, _, inputs = self.get_dummy_inputs(with_generator=False)
        lora_output = pipe(**inputs, generator=torch.manual_seed(0))[0]
+
        normal_lora_A = torch.nn.Linear(in_features, rank, bias=False)
        normal_lora_B = torch.nn.Linear(rank, out_features, bias=False)
        lora_state_dict = {
            "transformer.x_embedder.lora_A.weight": normal_lora_A.weight,
            "transformer.x_embedder.lora_B.weight": normal_lora_B.weight,
        }
+
        with CaptureLogger(logger) as cap_logger:
            pipe.load_lora_weights(lora_state_dict, "adapter-2")

-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser"
-        assert "The following LoRA modules were zero padded to match the state dict of" in cap_logger.out
-        assert pipe.get_active_adapters() == ["adapter-2"]
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser")
+        self.assertTrue("The following LoRA modules were zero padded to match the state dict of" in cap_logger.out)
+        self.assertTrue(pipe.get_active_adapters() == ["adapter-2"])

        lora_output_2 = pipe(**inputs, generator=torch.manual_seed(0))[0]
-        assert not np.allclose(lora_output, lora_output_2, atol=0.001, rtol=0.001)
+        self.assertFalse(np.allclose(lora_output, lora_output_2, atol=1e-3, rtol=1e-3))

        # Test the opposite case where the first lora has the correct input features and the second lora has expanded input features.
        # This should raise a runtime error on input shapes being incompatible.
@@ -482,24 +540,32 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):

        out_features, in_features = pipe.transformer.x_embedder.weight.shape
        rank = 4
+
        lora_state_dict = {
            "transformer.x_embedder.lora_A.weight": normal_lora_A.weight,
            "transformer.x_embedder.lora_B.weight": normal_lora_B.weight,
        }
        pipe.load_lora_weights(lora_state_dict, "adapter-1")
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser"
-        assert pipe.transformer.x_embedder.weight.data.shape[1] == in_features
-        assert pipe.transformer.config.in_channels == in_features
+
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser")
+        self.assertTrue(pipe.transformer.x_embedder.weight.data.shape[1] == in_features)
+        self.assertTrue(pipe.transformer.config.in_channels == in_features)

        lora_state_dict = {
            "transformer.x_embedder.lora_A.weight": shape_expander_lora_A.weight,
            "transformer.x_embedder.lora_B.weight": shape_expander_lora_B.weight,
        }
+
        # We should check for input shapes being incompatible here. But because above mentioned issue is
        # not a supported use case, and because of the PEFT renaming, we will currently have a shape
        # mismatch error.
-        with pytest.raises(RuntimeError, match="size mismatch for x_embedder.lora_A.adapter-2.weight"):
-            pipe.load_lora_weights(lora_state_dict, "adapter-2")
+        self.assertRaisesRegex(
+            RuntimeError,
+            "size mismatch for x_embedder.lora_A.adapter-2.weight",
+            pipe.load_lora_weights,
+            lora_state_dict,
+            "adapter-2",
+        )

    def test_fuse_expanded_lora_with_regular_lora(self):
        # This test checks if it works when a lora with expanded shapes (like control loras) but
@@ -531,7 +597,7 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
            "transformer.x_embedder.lora_B.weight": shape_expander_lora_B.weight,
        }
        pipe.load_lora_weights(lora_state_dict, "adapter-1")
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser"
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser")

        _, _, inputs = self.get_dummy_inputs(with_generator=False)
        lora_output = pipe(**inputs, generator=torch.manual_seed(0))[0]
@@ -544,44 +610,54 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
        }

        pipe.load_lora_weights(lora_state_dict, "adapter-2")
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser"
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser")

        lora_output_2 = pipe(**inputs, generator=torch.manual_seed(0))[0]
+
        pipe.set_adapters(["adapter-1", "adapter-2"], [1.0, 1.0])
        lora_output_3 = pipe(**inputs, generator=torch.manual_seed(0))[0]
-        assert not np.allclose(lora_output, lora_output_2, atol=0.001, rtol=0.001)
-        assert not np.allclose(lora_output, lora_output_3, atol=0.001, rtol=0.001)
-        assert not np.allclose(lora_output_2, lora_output_3, atol=0.001, rtol=0.001)
+
+        self.assertFalse(np.allclose(lora_output, lora_output_2, atol=1e-3, rtol=1e-3))
+        self.assertFalse(np.allclose(lora_output, lora_output_3, atol=1e-3, rtol=1e-3))
+        self.assertFalse(np.allclose(lora_output_2, lora_output_3, atol=1e-3, rtol=1e-3))

        pipe.fuse_lora(lora_scale=1.0, adapter_names=["adapter-1", "adapter-2"])
        lora_output_4 = pipe(**inputs, generator=torch.manual_seed(0))[0]
-        assert np.allclose(lora_output_3, lora_output_4, atol=0.001, rtol=0.001)
+        self.assertTrue(np.allclose(lora_output_3, lora_output_4, atol=1e-3, rtol=1e-3))

-    def test_load_regular_lora(self, base_pipe_output, pipe):
+    def test_load_regular_lora(self):
        # This test checks if a regular lora (think of one trained on Flux.1 Dev for example) can be loaded
        # into the transformer with more input channels than Flux.1 Dev, for example. Some examples of those
        # transformers include Flux Fill, Flux Control, etc.
+        components, _, _ = self.get_dummy_components(FlowMatchEulerDiscreteScheduler)
+        pipe = self.pipeline_class(**components)
+        pipe = pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
        _, _, inputs = self.get_dummy_inputs(with_generator=False)

+        original_output = pipe(**inputs, generator=torch.manual_seed(0))[0]
+
        out_features, in_features = pipe.transformer.x_embedder.weight.shape
        rank = 4
-        in_features = in_features // 2
+        in_features = in_features // 2  # to mimic the Flux.1-Dev LoRA.
        normal_lora_A = torch.nn.Linear(in_features, rank, bias=False)
        normal_lora_B = torch.nn.Linear(rank, out_features, bias=False)
        lora_state_dict = {
            "transformer.x_embedder.lora_A.weight": normal_lora_A.weight,
            "transformer.x_embedder.lora_B.weight": normal_lora_B.weight,
        }
+
        logger = logging.get_logger("diffusers.loaders.lora_pipeline")
        logger.setLevel(logging.INFO)
        with CaptureLogger(logger) as cap_logger:
            pipe.load_lora_weights(lora_state_dict, "adapter-1")
-        assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser"
+        self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser")

        lora_output = pipe(**inputs, generator=torch.manual_seed(0))[0]
-        assert "The following LoRA modules were zero padded to match the state dict of" in cap_logger.out
-        assert pipe.transformer.x_embedder.weight.data.shape[1] == in_features * 2
-        assert not np.allclose(base_pipe_output, lora_output, atol=0.001, rtol=0.001)
+
+        self.assertTrue("The following LoRA modules were zero padded to match the state dict of" in cap_logger.out)
+        self.assertTrue(pipe.transformer.x_embedder.weight.data.shape[1] == in_features * 2)
+        self.assertFalse(np.allclose(original_output, lora_output, atol=1e-3, rtol=1e-3))

    def test_lora_unload_with_parameter_expanded_shapes(self):
        components, _, _ = self.get_dummy_components(FlowMatchEulerDiscreteScheduler)
@@ -594,8 +670,9 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
        transformer = FluxTransformer2DModel.from_config(
            components["transformer"].config, in_channels=num_channels_without_control
        ).to(torch_device)
-        assert transformer.config.in_channels == num_channels_without_control, (
-            f"Expected {num_channels_without_control} channels in the modified transformer but has transformer.config.in_channels={transformer.config.in_channels!r}"
+        self.assertTrue(
+            transformer.config.in_channels == num_channels_without_control,
+            f"Expected {num_channels_without_control} channels in the modified transformer but has {transformer.config.in_channels=}",
        )

        # This should be initialized with a Flux pipeline variant that doesn't accept `control_image`.
@@ -620,31 +697,33 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
        }
        with CaptureLogger(logger) as cap_logger:
            control_pipe.load_lora_weights(lora_state_dict, "adapter-1")
-            assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser"
+            self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser")

        inputs["control_image"] = control_image
        lora_out = control_pipe(**inputs, generator=torch.manual_seed(0))[0]
-        assert not np.allclose(original_out, lora_out, rtol=0.0001, atol=0.0001)
-        assert pipe.transformer.x_embedder.weight.data.shape[1] == 2 * in_features
-        assert pipe.transformer.config.in_channels == 2 * in_features
-        assert cap_logger.out.startswith("Expanding the nn.Linear input/output features for module")
+
+        self.assertFalse(np.allclose(original_out, lora_out, rtol=1e-4, atol=1e-4))
+        self.assertTrue(pipe.transformer.x_embedder.weight.data.shape[1] == 2 * in_features)
+        self.assertTrue(pipe.transformer.config.in_channels == 2 * in_features)
+        self.assertTrue(cap_logger.out.startswith("Expanding the nn.Linear input/output features for module"))

        control_pipe.unload_lora_weights(reset_to_overwritten_params=True)
-        assert control_pipe.transformer.config.in_channels == num_channels_without_control, (
-            f"Expected {num_channels_without_control} channels in the modified transformer but has control_pipe.transformer.config.in_channels={control_pipe.transformer.config.in_channels!r}"
+        self.assertTrue(
+            control_pipe.transformer.config.in_channels == num_channels_without_control,
+            f"Expected {num_channels_without_control} channels in the modified transformer but has {control_pipe.transformer.config.in_channels=}",
        )
-
        loaded_pipe = FluxPipeline.from_pipe(control_pipe)
-        assert loaded_pipe.transformer.config.in_channels == num_channels_without_control, (
-            f"Expected {num_channels_without_control} channels in the modified transformer but has loaded_pipe.transformer.config.in_channels={loaded_pipe.transformer.config.in_channels!r}"
+        self.assertTrue(
+            loaded_pipe.transformer.config.in_channels == num_channels_without_control,
+            f"Expected {num_channels_without_control} channels in the modified transformer but has {loaded_pipe.transformer.config.in_channels=}",
        )
-
        inputs.pop("control_image")
        unloaded_lora_out = loaded_pipe(**inputs, generator=torch.manual_seed(0))[0]
-        assert not np.allclose(unloaded_lora_out, lora_out, rtol=0.0001, atol=0.0001)
-        assert np.allclose(unloaded_lora_out, original_out, atol=0.0001, rtol=0.0001)
-        assert pipe.transformer.x_embedder.weight.data.shape[1] == in_features
-        assert pipe.transformer.config.in_channels == in_features
+
+        self.assertFalse(np.allclose(unloaded_lora_out, lora_out, rtol=1e-4, atol=1e-4))
+        self.assertTrue(np.allclose(unloaded_lora_out, original_out, atol=1e-4, rtol=1e-4))
+        self.assertTrue(pipe.transformer.x_embedder.weight.data.shape[1] == in_features)
+        self.assertTrue(pipe.transformer.config.in_channels == in_features)

    def test_lora_unload_with_parameter_expanded_shapes_and_no_reset(self):
        components, _, _ = self.get_dummy_components(FlowMatchEulerDiscreteScheduler)
@@ -652,12 +731,14 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
        logger = logging.get_logger("diffusers.loaders.lora_pipeline")
        logger.setLevel(logging.DEBUG)

+        # Change the transformer config to mimic a real use case.
        num_channels_without_control = 4
        transformer = FluxTransformer2DModel.from_config(
            components["transformer"].config, in_channels=num_channels_without_control
        ).to(torch_device)
-        assert transformer.config.in_channels == num_channels_without_control, (
-            f"Expected {num_channels_without_control} channels in the modified transformer but has transformer.config.in_channels={transformer.config.in_channels!r}"
+        self.assertTrue(
+            transformer.config.in_channels == num_channels_without_control,
+            f"Expected {num_channels_without_control} channels in the modified transformer but has {transformer.config.in_channels=}",
        )

        # This should be initialized with a Flux pipeline variant that doesn't accept `control_image`.
@@ -682,38 +763,40 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
        }
        with CaptureLogger(logger) as cap_logger:
            control_pipe.load_lora_weights(lora_state_dict, "adapter-1")
-            assert check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser"
+            self.assertTrue(check_if_lora_correctly_set(pipe.transformer), "Lora not correctly set in denoiser")

        inputs["control_image"] = control_image
        lora_out = control_pipe(**inputs, generator=torch.manual_seed(0))[0]
-        assert not np.allclose(original_out, lora_out, rtol=0.0001, atol=0.0001)
-        assert pipe.transformer.x_embedder.weight.data.shape[1] == 2 * in_features
-        assert pipe.transformer.config.in_channels == 2 * in_features
-        assert cap_logger.out.startswith("Expanding the nn.Linear input/output features for module")
+
+        self.assertFalse(np.allclose(original_out, lora_out, rtol=1e-4, atol=1e-4))
+        self.assertTrue(pipe.transformer.x_embedder.weight.data.shape[1] == 2 * in_features)
+        self.assertTrue(pipe.transformer.config.in_channels == 2 * in_features)
+        self.assertTrue(cap_logger.out.startswith("Expanding the nn.Linear input/output features for module"))

        control_pipe.unload_lora_weights(reset_to_overwritten_params=False)
-        assert control_pipe.transformer.config.in_channels == 2 * num_channels_without_control, (
-            f"Expected {num_channels_without_control} channels in the modified transformer but has control_pipe.transformer.config.in_channels={control_pipe.transformer.config.in_channels!r}"
+        self.assertTrue(
+            control_pipe.transformer.config.in_channels == 2 * num_channels_without_control,
+            f"Expected {num_channels_without_control} channels in the modified transformer but has {control_pipe.transformer.config.in_channels=}",
        )
-
        no_lora_out = control_pipe(**inputs, generator=torch.manual_seed(0))[0]
-        assert not np.allclose(no_lora_out, lora_out, rtol=0.0001, atol=0.0001)
-        assert pipe.transformer.x_embedder.weight.data.shape[1] == in_features * 2
-        assert pipe.transformer.config.in_channels == in_features * 2

-    @pytest.mark.skip("Not supported in Flux.")
+        self.assertFalse(np.allclose(no_lora_out, lora_out, rtol=1e-4, atol=1e-4))
+        self.assertTrue(pipe.transformer.x_embedder.weight.data.shape[1] == in_features * 2)
+        self.assertTrue(pipe.transformer.config.in_channels == in_features * 2)
+
+    @unittest.skip("Not supported in Flux.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in Flux.")
+    @unittest.skip("Not supported in Flux.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in Flux.")
+    @unittest.skip("Not supported in Flux.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Not supported in Flux.")
+    @unittest.skip("Not supported in Flux.")
    def test_simple_inference_with_text_denoiser_multi_adapter_block_lora(self):
        pass

@@ -723,7 +806,7 @@ class TestFluxControlLoRA(PeftLoraLoaderMixinTests):
@require_torch_accelerator
@require_peft_backend
@require_big_accelerator
-class TestFluxLoRAIntegration:
+class FluxLoRAIntegrationTests(unittest.TestCase):
    """internal note: The integration slices were obtained on audace.

    torch: 2.6.0.dev20241006+cu124 with CUDA 12.5. Need the same setup for the
@@ -733,27 +816,33 @@ class TestFluxLoRAIntegration:
    num_inference_steps = 10
    seed = 0

-    @pytest.fixture(scope="function")
-    def pipeline(self):
+    def setUp(self):
+        super().setUp()
+
        gc.collect()
        backend_empty_cache(torch_device)
-        pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to(
-            torch_device
-        )
-        try:
-            yield pipe
-        finally:
-            del pipe
-            gc.collect()
-            backend_empty_cache(torch_device)

-    def test_flux_the_last_ben(self, pipeline):
-        pipeline.load_lora_weights("TheLastBen/Jon_Snow_Flux_LoRA", weight_name="jon_snow.safetensors")
-        pipeline.fuse_lora()
-        pipeline.unload_lora_weights()
-        pipeline = pipeline.to(torch_device)
+        self.pipeline = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
+
+    def tearDown(self):
+        super().tearDown()
+
+        del self.pipeline
+        gc.collect()
+        backend_empty_cache(torch_device)
+
+    def test_flux_the_last_ben(self):
+        self.pipeline.load_lora_weights("TheLastBen/Jon_Snow_Flux_LoRA", weight_name="jon_snow.safetensors")
+        self.pipeline.fuse_lora()
+        self.pipeline.unload_lora_weights()
+        # Instead of calling `enable_model_cpu_offload()`, we do a accelerator placement here because the CI
+        # run supports it. We have about 34GB RAM in the CI runner which kills the test when run with
+        # `enable_model_cpu_offload()`. We repeat this for the other tests, too.
+        self.pipeline = self.pipeline.to(torch_device)
+
        prompt = "jon snow eating pizza with ketchup"
-        out = pipeline(
+
+        out = self.pipeline(
            prompt,
            num_inference_steps=self.num_inference_steps,
            guidance_scale=4.0,
@@ -762,57 +851,71 @@ class TestFluxLoRAIntegration:
        ).images
        out_slice = out[0, -3:, -3:, -1].flatten()
        expected_slice = np.array([0.1855, 0.1855, 0.1836, 0.1855, 0.1836, 0.1875, 0.1777, 0.1758, 0.2246])
-        max_diff = numpy_cosine_similarity_distance(expected_slice.flatten(), out_slice)
-        assert max_diff < 0.001

-    def test_flux_kohya(self, pipeline):
-        pipeline.load_lora_weights("Norod78/brain-slug-flux")
-        pipeline.fuse_lora()
-        pipeline.unload_lora_weights()
-        pipeline = pipeline.to(torch_device)
+        max_diff = numpy_cosine_similarity_distance(expected_slice.flatten(), out_slice)
+
+        assert max_diff < 1e-3
+
+    def test_flux_kohya(self):
+        self.pipeline.load_lora_weights("Norod78/brain-slug-flux")
+        self.pipeline.fuse_lora()
+        self.pipeline.unload_lora_weights()
+        self.pipeline = self.pipeline.to(torch_device)
+
        prompt = "The cat with a brain slug earring"
-        out = pipeline(
+        out = self.pipeline(
            prompt,
            num_inference_steps=self.num_inference_steps,
            guidance_scale=4.5,
            output_type="np",
            generator=torch.manual_seed(self.seed),
        ).images
+
        out_slice = out[0, -3:, -3:, -1].flatten()
        expected_slice = np.array([0.6367, 0.6367, 0.6328, 0.6367, 0.6328, 0.6289, 0.6367, 0.6328, 0.6484])
-        max_diff = numpy_cosine_similarity_distance(expected_slice.flatten(), out_slice)
-        assert max_diff < 0.001

-    def test_flux_kohya_with_text_encoder(self, pipeline):
-        pipeline.load_lora_weights("cocktailpeanut/optimus", weight_name="optimus.safetensors")
-        pipeline.fuse_lora()
-        pipeline.unload_lora_weights()
-        pipeline = pipeline.to(torch_device)
+        max_diff = numpy_cosine_similarity_distance(expected_slice.flatten(), out_slice)
+
+        assert max_diff < 1e-3
+
+    def test_flux_kohya_with_text_encoder(self):
+        self.pipeline.load_lora_weights("cocktailpeanut/optimus", weight_name="optimus.safetensors")
+        self.pipeline.fuse_lora()
+        self.pipeline.unload_lora_weights()
+        self.pipeline = self.pipeline.to(torch_device)
+
        prompt = "optimus is cleaning the house with broomstick"
-        out = pipeline(
+        out = self.pipeline(
            prompt,
            num_inference_steps=self.num_inference_steps,
            guidance_scale=4.5,
            output_type="np",
            generator=torch.manual_seed(self.seed),
        ).images
+
        out_slice = out[0, -3:, -3:, -1].flatten()
        expected_slice = np.array([0.4023, 0.4023, 0.4023, 0.3965, 0.3984, 0.3965, 0.3926, 0.3906, 0.4219])
+
        max_diff = numpy_cosine_similarity_distance(expected_slice.flatten(), out_slice)
-        assert max_diff < 0.001

-    def test_flux_kohya_embedders_conversion(self, pipeline):
+        assert max_diff < 1e-3
+
+    def test_flux_kohya_embedders_conversion(self):
        """Test that embedders load without throwing errors"""
-        pipeline.load_lora_weights("rockerBOO/flux-bpo-po-lora")
-        pipeline.unload_lora_weights()
+        self.pipeline.load_lora_weights("rockerBOO/flux-bpo-po-lora")
+        self.pipeline.unload_lora_weights()
+
+        assert True
+
+    def test_flux_xlabs(self):
+        self.pipeline.load_lora_weights("XLabs-AI/flux-lora-collection", weight_name="disney_lora.safetensors")
+        self.pipeline.fuse_lora()
+        self.pipeline.unload_lora_weights()
+        self.pipeline = self.pipeline.to(torch_device)

-    def test_flux_xlabs(self, pipeline):
-        pipeline.load_lora_weights("XLabs-AI/flux-lora-collection", weight_name="disney_lora.safetensors")
-        pipeline.fuse_lora()
-        pipeline.unload_lora_weights()
-        pipeline = pipeline.to(torch_device)
        prompt = "A blue jay standing on a large basket of rainbow macarons, disney style"
-        out = pipeline(
+
+        out = self.pipeline(
            prompt,
            num_inference_steps=self.num_inference_steps,
            guidance_scale=3.5,
@@ -820,17 +923,23 @@ class TestFluxLoRAIntegration:
            generator=torch.manual_seed(self.seed),
        ).images
        out_slice = out[0, -3:, -3:, -1].flatten()
-        expected_slice = np.array([0.3965, 0.418, 0.4434, 0.4082, 0.4375, 0.459, 0.4141, 0.4375, 0.498])
-        max_diff = numpy_cosine_similarity_distance(expected_slice.flatten(), out_slice)
-        assert max_diff < 0.001
+        expected_slice = np.array([0.3965, 0.4180, 0.4434, 0.4082, 0.4375, 0.4590, 0.4141, 0.4375, 0.4980])
+
+        max_diff = numpy_cosine_similarity_distance(expected_slice.flatten(), out_slice)
+
+        assert max_diff < 1e-3
+
+    def test_flux_xlabs_load_lora_with_single_blocks(self):
+        self.pipeline.load_lora_weights(
+            "salinasr/test_xlabs_flux_lora_with_singleblocks", weight_name="lora.safetensors"
+        )
+        self.pipeline.fuse_lora()
+        self.pipeline.unload_lora_weights()
+        self.pipeline.enable_model_cpu_offload()

-    def test_flux_xlabs_load_lora_with_single_blocks(self, pipeline):
-        pipeline.load_lora_weights("salinasr/test_xlabs_flux_lora_with_singleblocks", weight_name="lora.safetensors")
-        pipeline.fuse_lora()
-        pipeline.unload_lora_weights()
-        pipeline.enable_model_cpu_offload()
        prompt = "a wizard mouse playing chess"
-        out = pipeline(
+
+        out = self.pipeline(
            prompt,
            num_inference_steps=self.num_inference_steps,
            guidance_scale=3.5,
@@ -842,43 +951,40 @@ class TestFluxLoRAIntegration:
            [0.04882812, 0.04101562, 0.04882812, 0.03710938, 0.02929688, 0.02734375, 0.0234375, 0.01757812, 0.0390625]
        )
        max_diff = numpy_cosine_similarity_distance(expected_slice.flatten(), out_slice)
-        assert max_diff < 0.001
+
+        assert max_diff < 1e-3


@nightly
@require_torch_accelerator
@require_peft_backend
@require_big_accelerator
-class TestFluxControlLoRAIntegration:
+class FluxControlLoRAIntegrationTests(unittest.TestCase):
    num_inference_steps = 10
    seed = 0
    prompt = "A robot made of exotic candies and chocolates of different kinds."

-    @pytest.fixture(scope="function")
-    def pipeline(self):
+    def setUp(self):
+        super().setUp()
+
        gc.collect()
        backend_empty_cache(torch_device)
-        pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to(
-            torch_device
-        )
-        try:
-            yield pipe
-        finally:
-            del pipe
-            gc.collect()
-            backend_empty_cache(torch_device)

-    @pytest.mark.parametrize(
-        "lora_ckpt_id",
-        [
-            "black-forest-labs/FLUX.1-Canny-dev-lora",
-            "black-forest-labs/FLUX.1-Depth-dev-lora",
-        ],
-    )
-    def test_lora(self, pipeline, lora_ckpt_id):
-        pipeline.load_lora_weights(lora_ckpt_id)
-        pipeline.fuse_lora()
-        pipeline.unload_lora_weights()
+        self.pipeline = FluxControlPipeline.from_pretrained(
+            "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
+        ).to(torch_device)
+
+    def tearDown(self):
+        super().tearDown()
+
+        gc.collect()
+        backend_empty_cache(torch_device)
+
+    @parameterized.expand(["black-forest-labs/FLUX.1-Canny-dev-lora", "black-forest-labs/FLUX.1-Depth-dev-lora"])
+    def test_lora(self, lora_ckpt_id):
+        self.pipeline.load_lora_weights(lora_ckpt_id)
+        self.pipeline.fuse_lora()
+        self.pipeline.unload_lora_weights()

        if "Canny" in lora_ckpt_id:
            control_image = load_image(
@@ -889,7 +995,7 @@ class TestFluxControlLoRAIntegration:
                "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flux-control-lora/depth_condition_image.png"
            )

-        image = pipeline(
+        image = self.pipeline(
            prompt=self.prompt,
            control_image=control_image,
            height=1024,
@@ -910,18 +1016,12 @@ class TestFluxControlLoRAIntegration:

        assert max_diff < 1e-3

-    @pytest.mark.parametrize(
-        "lora_ckpt_id",
-        [
-            "black-forest-labs/FLUX.1-Canny-dev-lora",
-            "black-forest-labs/FLUX.1-Depth-dev-lora",
-        ],
-    )
-    def test_lora_with_turbo(self, pipeline, lora_ckpt_id):
-        pipeline.load_lora_weights(lora_ckpt_id)
-        pipeline.load_lora_weights("ByteDance/Hyper-SD", weight_name="Hyper-FLUX.1-dev-8steps-lora.safetensors")
-        pipeline.fuse_lora()
-        pipeline.unload_lora_weights()
+    @parameterized.expand(["black-forest-labs/FLUX.1-Canny-dev-lora", "black-forest-labs/FLUX.1-Depth-dev-lora"])
+    def test_lora_with_turbo(self, lora_ckpt_id):
+        self.pipeline.load_lora_weights(lora_ckpt_id)
+        self.pipeline.load_lora_weights("ByteDance/Hyper-SD", weight_name="Hyper-FLUX.1-dev-8steps-lora.safetensors")
+        self.pipeline.fuse_lora()
+        self.pipeline.unload_lora_weights()

        if "Canny" in lora_ckpt_id:
            control_image = load_image(
--- a/tests/lora/test_lora_layers_flux2.py
+++ b/tests/lora/test_lora_layers_flux2.py
@@ -13,9 +13,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import sys
+import unittest

 import numpy as np
-import pytest
 import torch
 from transformers import AutoProcessor, Mistral3ForConditionalGeneration

@@ -30,7 +30,7 @@ from .utils import PeftLoraLoaderMixinTests, check_if_lora_correctly_set  # noqa


@require_peft_backend
-class TestFlux2LoRA(PeftLoraLoaderMixinTests):
+class Flux2LoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = Flux2Pipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -133,36 +133,36 @@ class TestFlux2LoRA(PeftLoraLoaderMixinTests):
        pipe.fuse_lora(components=self.pipeline_class._lora_loadable_modules, safe_fusing=False)
        out = pipe(**inputs)[0]

-        assert np.isnan(out).all()
+        self.assertTrue(np.isnan(out).all())

-    @pytest.mark.skip("Not supported in Flux2.")
+    @unittest.skip("Not supported in Flux2.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in Flux2.")
+    @unittest.skip("Not supported in Flux2.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in Flux2.")
+    @unittest.skip("Not supported in Flux2.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Flux2.")
+    @unittest.skip("Text encoder LoRA is not supported in Flux2.")
    def test_simple_inference_with_partial_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Flux2.")
+    @unittest.skip("Text encoder LoRA is not supported in Flux2.")
    def test_simple_inference_with_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Flux2.")
+    @unittest.skip("Text encoder LoRA is not supported in Flux2.")
    def test_simple_inference_with_text_lora_and_scale(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Flux2.")
+    @unittest.skip("Text encoder LoRA is not supported in Flux2.")
    def test_simple_inference_with_text_lora_fused(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Flux2.")
+    @unittest.skip("Text encoder LoRA is not supported in Flux2.")
    def test_simple_inference_with_text_lora_save_load(self):
        pass
--- a/tests/lora/test_lora_layers_hunyuanvideo.py
+++ b/tests/lora/test_lora_layers_hunyuanvideo.py
@@ -14,9 +14,9 @@

 import gc
 import sys
+import unittest

 import numpy as np
-import pytest
 import torch
 from transformers import CLIPTextModel, CLIPTokenizer, LlamaModel, LlamaTokenizerFast

@@ -48,7 +48,7 @@ from .utils import PeftLoraLoaderMixinTests  # noqa: E402

@require_peft_backend
@skip_mps
-class TestHunyuanVideoLoRA(PeftLoraLoaderMixinTests):
+class HunyuanVideoLoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = HunyuanVideoPipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -149,41 +149,46 @@ class TestHunyuanVideoLoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    def test_simple_inference_with_text_lora_denoiser_fused_multi(self, pipe):
-        super().test_simple_inference_with_text_lora_denoiser_fused_multi(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_lora_denoiser_fused_multi(self):
+        super().test_simple_inference_with_text_lora_denoiser_fused_multi(expected_atol=9e-3)

-    def test_simple_inference_with_text_denoiser_lora_unfused(self, pipe):
-        super().test_simple_inference_with_text_denoiser_lora_unfused(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_denoiser_lora_unfused(self):
+        super().test_simple_inference_with_text_denoiser_lora_unfused(expected_atol=9e-3)

-    @pytest.mark.skip("Not supported in HunyuanVideo.")
+    # TODO(aryan): Fix the following test
+    @unittest.skip("This test fails with an error I haven't been able to debug yet.")
+    def test_simple_inference_save_pretrained(self):
+        pass
+
+    @unittest.skip("Not supported in HunyuanVideo.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in HunyuanVideo.")
+    @unittest.skip("Not supported in HunyuanVideo.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in HunyuanVideo.")
+    @unittest.skip("Not supported in HunyuanVideo.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in HunyuanVideo.")
+    @unittest.skip("Text encoder LoRA is not supported in HunyuanVideo.")
    def test_simple_inference_with_partial_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in HunyuanVideo.")
+    @unittest.skip("Text encoder LoRA is not supported in HunyuanVideo.")
    def test_simple_inference_with_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in HunyuanVideo.")
+    @unittest.skip("Text encoder LoRA is not supported in HunyuanVideo.")
    def test_simple_inference_with_text_lora_and_scale(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in HunyuanVideo.")
+    @unittest.skip("Text encoder LoRA is not supported in HunyuanVideo.")
    def test_simple_inference_with_text_lora_fused(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in HunyuanVideo.")
+    @unittest.skip("Text encoder LoRA is not supported in HunyuanVideo.")
    def test_simple_inference_with_text_lora_save_load(self):
        pass

@@ -192,7 +197,7 @@ class TestHunyuanVideoLoRA(PeftLoraLoaderMixinTests):
@require_torch_accelerator
@require_peft_backend
@require_big_accelerator
-class TestHunyuanVideoLoRAIntegration:
+class HunyuanVideoLoRAIntegrationTests(unittest.TestCase):
    """internal note: The integration slices were obtained on DGX.

    torch: 2.5.1+cu124 with CUDA 12.5. Need the same setup for the
@@ -202,8 +207,9 @@ class TestHunyuanVideoLoRAIntegration:
    num_inference_steps = 10
    seed = 0

-    @pytest.fixture(scope="function")
-    def pipeline(self):
+    def setUp(self):
+        super().setUp()
+
        gc.collect()
        backend_empty_cache(torch_device)

@@ -211,27 +217,27 @@ class TestHunyuanVideoLoRAIntegration:
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            model_id, subfolder="transformer", torch_dtype=torch.bfloat16
        )
-        pipe = HunyuanVideoPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.float16).to(
-            torch_device
-        )
-        try:
-            yield pipe
-        finally:
-            del pipe
-            gc.collect()
-            backend_empty_cache(torch_device)
+        self.pipeline = HunyuanVideoPipeline.from_pretrained(
+            model_id, transformer=transformer, torch_dtype=torch.float16
+        ).to(torch_device)

-    def test_original_format_cseti(self, pipeline):
-        pipeline.load_lora_weights(
+    def tearDown(self):
+        super().tearDown()
+
+        gc.collect()
+        backend_empty_cache(torch_device)
+
+    def test_original_format_cseti(self):
+        self.pipeline.load_lora_weights(
            "Cseti/HunyuanVideo-LoRA-Arcane_Jinx-v1", weight_name="csetiarcane-nfjinx-v1-6000.safetensors"
        )
-        pipeline.fuse_lora()
-        pipeline.unload_lora_weights()
-        pipeline.vae.enable_tiling()
+        self.pipeline.fuse_lora()
+        self.pipeline.unload_lora_weights()
+        self.pipeline.vae.enable_tiling()

        prompt = "CSETIARCANE. A cat walks on the grass, realistic"

-        out = pipeline(
+        out = self.pipeline(
            prompt=prompt,
            height=320,
            width=512,
--- a/tests/lora/test_lora_layers_ltx_video.py
+++ b/tests/lora/test_lora_layers_ltx_video.py
@@ -13,8 +13,8 @@
 # limitations under the License.

 import sys
+import unittest

-import pytest
 import torch
 from transformers import AutoTokenizer, T5EncoderModel

@@ -34,7 +34,7 @@ from .utils import PeftLoraLoaderMixinTests  # noqa: E402


@require_peft_backend
-class TestLTXVideoLoRA(PeftLoraLoaderMixinTests):
+class LTXVideoLoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = LTXPipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -108,40 +108,40 @@ class TestLTXVideoLoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    def test_simple_inference_with_text_lora_denoiser_fused_multi(self, pipe):
-        super().test_simple_inference_with_text_lora_denoiser_fused_multi(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_lora_denoiser_fused_multi(self):
+        super().test_simple_inference_with_text_lora_denoiser_fused_multi(expected_atol=9e-3)

-    def test_simple_inference_with_text_denoiser_lora_unfused(self, pipe):
-        super().test_simple_inference_with_text_denoiser_lora_unfused(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_denoiser_lora_unfused(self):
+        super().test_simple_inference_with_text_denoiser_lora_unfused(expected_atol=9e-3)

-    @pytest.mark.skip("Not supported in LTXVideo.")
+    @unittest.skip("Not supported in LTXVideo.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in LTXVideo.")
+    @unittest.skip("Not supported in LTXVideo.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in LTXVideo.")
+    @unittest.skip("Not supported in LTXVideo.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in LTXVideo.")
+    @unittest.skip("Text encoder LoRA is not supported in LTXVideo.")
    def test_simple_inference_with_partial_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in LTXVideo.")
+    @unittest.skip("Text encoder LoRA is not supported in LTXVideo.")
    def test_simple_inference_with_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in LTXVideo.")
+    @unittest.skip("Text encoder LoRA is not supported in LTXVideo.")
    def test_simple_inference_with_text_lora_and_scale(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in LTXVideo.")
+    @unittest.skip("Text encoder LoRA is not supported in LTXVideo.")
    def test_simple_inference_with_text_lora_fused(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in LTXVideo.")
+    @unittest.skip("Text encoder LoRA is not supported in LTXVideo.")
    def test_simple_inference_with_text_lora_save_load(self):
        pass
--- a/tests/lora/test_lora_layers_lumina2.py
+++ b/tests/lora/test_lora_layers_lumina2.py
@@ -13,6 +13,7 @@
 # limitations under the License.

 import sys
+import unittest

 import numpy as np
 import pytest
@@ -35,7 +36,7 @@ from .utils import PeftLoraLoaderMixinTests, check_if_lora_correctly_set  # noqa


@require_peft_backend
-class TestLumina2LoRA(PeftLoraLoaderMixinTests):
+class Lumina2LoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = Lumina2Pipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -100,35 +101,35 @@ class TestLumina2LoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    @pytest.mark.skip("Not supported in Lumina2.")
+    @unittest.skip("Not supported in Lumina2.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in Lumina2.")
+    @unittest.skip("Not supported in Lumina2.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in Lumina2.")
+    @unittest.skip("Not supported in Lumina2.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Lumina2.")
+    @unittest.skip("Text encoder LoRA is not supported in Lumina2.")
    def test_simple_inference_with_partial_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Lumina2.")
+    @unittest.skip("Text encoder LoRA is not supported in Lumina2.")
    def test_simple_inference_with_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Lumina2.")
+    @unittest.skip("Text encoder LoRA is not supported in Lumina2.")
    def test_simple_inference_with_text_lora_and_scale(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Lumina2.")
+    @unittest.skip("Text encoder LoRA is not supported in Lumina2.")
    def test_simple_inference_with_text_lora_fused(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Lumina2.")
+    @unittest.skip("Text encoder LoRA is not supported in Lumina2.")
    def test_simple_inference_with_text_lora_save_load(self):
        pass

@@ -138,17 +139,20 @@ class TestLumina2LoRA(PeftLoraLoaderMixinTests):
        reason="Test currently fails on CPU and PyTorch 2.5.1 but not on PyTorch 2.4.1.",
        strict=False,
    )
-    def test_lora_fuse_nan(self, pipe):
-        _, text_lora_config, denoiser_lora_config = self.get_dummy_components()
+    def test_lora_fuse_nan(self):
+        components, text_lora_config, denoiser_lora_config = self.get_dummy_components()
+        pipe = self.pipeline_class(**components)
+        pipe = pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
        _, _, inputs = self.get_dummy_inputs(with_generator=False)

        if "text_encoder" in self.pipeline_class._lora_loadable_modules:
            pipe.text_encoder.add_adapter(text_lora_config, "adapter-1")
-            assert check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+            self.assertTrue(check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder")

        denoiser = pipe.transformer if self.unet_kwargs is None else pipe.unet
        denoiser.add_adapter(denoiser_lora_config, "adapter-1")
-        assert check_if_lora_correctly_set(denoiser), "Lora not correctly set in denoiser."
+        self.assertTrue(check_if_lora_correctly_set(denoiser), "Lora not correctly set in denoiser.")

        # corrupt one LoRA weight with `inf` values
        with torch.no_grad():
@@ -162,4 +166,4 @@ class TestLumina2LoRA(PeftLoraLoaderMixinTests):
        pipe.fuse_lora(components=self.pipeline_class._lora_loadable_modules, safe_fusing=False)
        out = pipe(**inputs)[0]

-        assert np.isnan(out).all()
+        self.assertTrue(np.isnan(out).all())
--- a/tests/lora/test_lora_layers_mochi.py
+++ b/tests/lora/test_lora_layers_mochi.py
@@ -13,8 +13,8 @@
 # limitations under the License.

 import sys
+import unittest

-import pytest
 import torch
 from transformers import AutoTokenizer, T5EncoderModel

@@ -34,7 +34,7 @@ from .utils import PeftLoraLoaderMixinTests  # noqa: E402

@require_peft_backend
@skip_mps
-class TestMochiLoRA(PeftLoraLoaderMixinTests):
+class MochiLoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = MochiPipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -99,44 +99,44 @@ class TestMochiLoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    def test_simple_inference_with_text_lora_denoiser_fused_multi(self, pipe):
-        super().test_simple_inference_with_text_lora_denoiser_fused_multi(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_lora_denoiser_fused_multi(self):
+        super().test_simple_inference_with_text_lora_denoiser_fused_multi(expected_atol=9e-3)

-    def test_simple_inference_with_text_denoiser_lora_unfused(self, pipe):
-        super().test_simple_inference_with_text_denoiser_lora_unfused(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_denoiser_lora_unfused(self):
+        super().test_simple_inference_with_text_denoiser_lora_unfused(expected_atol=9e-3)

-    @pytest.mark.skip("Not supported in Mochi.")
+    @unittest.skip("Not supported in Mochi.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in Mochi.")
+    @unittest.skip("Not supported in Mochi.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in Mochi.")
+    @unittest.skip("Not supported in Mochi.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Mochi.")
+    @unittest.skip("Text encoder LoRA is not supported in Mochi.")
    def test_simple_inference_with_partial_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Mochi.")
+    @unittest.skip("Text encoder LoRA is not supported in Mochi.")
    def test_simple_inference_with_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Mochi.")
+    @unittest.skip("Text encoder LoRA is not supported in Mochi.")
    def test_simple_inference_with_text_lora_and_scale(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Mochi.")
+    @unittest.skip("Text encoder LoRA is not supported in Mochi.")
    def test_simple_inference_with_text_lora_fused(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Mochi.")
+    @unittest.skip("Text encoder LoRA is not supported in Mochi.")
    def test_simple_inference_with_text_lora_save_load(self):
        pass

-    @pytest.mark.skip("Not supported in CogVideoX.")
+    @unittest.skip("Not supported in CogVideoX.")
    def test_simple_inference_with_text_denoiser_multi_adapter_block_lora(self):
        pass
--- a/tests/lora/test_lora_layers_qwenimage.py
+++ b/tests/lora/test_lora_layers_qwenimage.py
@@ -13,8 +13,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import sys
+import unittest

-import pytest
 import torch
 from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2Tokenizer

@@ -34,7 +34,7 @@ from .utils import PeftLoraLoaderMixinTests  # noqa: E402


@require_peft_backend
-class TestQwenImageLoRA(PeftLoraLoaderMixinTests):
+class QwenImageLoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = QwenImagePipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -96,34 +96,34 @@ class TestQwenImageLoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    @pytest.mark.skip("Not supported in Qwen Image.")
+    @unittest.skip("Not supported in Qwen Image.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in Qwen Image.")
+    @unittest.skip("Not supported in Qwen Image.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in Qwen Image.")
+    @unittest.skip("Not supported in Qwen Image.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Qwen Image.")
+    @unittest.skip("Text encoder LoRA is not supported in Qwen Image.")
    def test_simple_inference_with_partial_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Qwen Image.")
+    @unittest.skip("Text encoder LoRA is not supported in Qwen Image.")
    def test_simple_inference_with_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Qwen Image.")
+    @unittest.skip("Text encoder LoRA is not supported in Qwen Image.")
    def test_simple_inference_with_text_lora_and_scale(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Qwen Image.")
+    @unittest.skip("Text encoder LoRA is not supported in Qwen Image.")
    def test_simple_inference_with_text_lora_fused(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Qwen Image.")
+    @unittest.skip("Text encoder LoRA is not supported in Qwen Image.")
    def test_simple_inference_with_text_lora_save_load(self):
        pass
--- a/tests/lora/test_lora_layers_sana.py
+++ b/tests/lora/test_lora_layers_sana.py
@@ -13,8 +13,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import sys
+import unittest

-import pytest
 import torch
 from transformers import Gemma2Model, GemmaTokenizer

@@ -29,7 +29,7 @@ from .utils import PeftLoraLoaderMixinTests  # noqa: E402


@require_peft_backend
-class TestSanaLoRA(PeftLoraLoaderMixinTests):
+class SanaLoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = SanaPipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {"shift": 7.0}
@@ -105,38 +105,38 @@ class TestSanaLoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    @pytest.mark.skip("Not supported in SANA.")
+    @unittest.skip("Not supported in SANA.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Not supported in SANA.")
+    @unittest.skip("Not supported in SANA.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in SANA.")
+    @unittest.skip("Not supported in SANA.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in SANA.")
+    @unittest.skip("Text encoder LoRA is not supported in SANA.")
    def test_simple_inference_with_partial_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in SANA.")
+    @unittest.skip("Text encoder LoRA is not supported in SANA.")
    def test_simple_inference_with_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in SANA.")
+    @unittest.skip("Text encoder LoRA is not supported in SANA.")
    def test_simple_inference_with_text_lora_and_scale(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in SANA.")
+    @unittest.skip("Text encoder LoRA is not supported in SANA.")
    def test_simple_inference_with_text_lora_fused(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in SANA.")
+    @unittest.skip("Text encoder LoRA is not supported in SANA.")
    def test_simple_inference_with_text_lora_save_load(self):
        pass

-    @pytest.mark.skipif(IS_GITHUB_ACTIONS, reason="Skipping test inside GitHub Actions environment")
+    @unittest.skipIf(IS_GITHUB_ACTIONS, reason="Skipping test inside GitHub Actions environment")
    def test_layerwise_casting_inference_denoiser(self):
        return super().test_layerwise_casting_inference_denoiser()
--- a/tests/lora/test_lora_layers_sd.py
+++ b/tests/lora/test_lora_layers_sd.py
@@ -14,9 +14,9 @@
 # limitations under the License.
 import gc
 import sys
+import unittest

 import numpy as np
-import pytest
 import torch
 import torch.nn as nn
 from huggingface_hub import hf_hub_download
@@ -55,7 +55,7 @@ if is_accelerate_available():
    from accelerate.utils import release_memory


-class TestStableDiffusionLoRA(PeftLoraLoaderMixinTests):
+class StableDiffusionLoRATests(PeftLoraLoaderMixinTests, unittest.TestCase):
    pipeline_class = StableDiffusionPipeline
    scheduler_cls = DDIMScheduler
    scheduler_kwargs = {
@@ -91,6 +91,16 @@ class TestStableDiffusionLoRA(PeftLoraLoaderMixinTests):
    def output_shape(self):
        return (1, 64, 64, 3)

+    def setUp(self):
+        super().setUp()
+        gc.collect()
+        backend_empty_cache(torch_device)
+
+    def tearDown(self):
+        super().tearDown()
+        gc.collect()
+        backend_empty_cache(torch_device)
+
    # Keeping this test here makes sense because it doesn't look any integration
    # (value assertions on logits).
    @slow
@@ -104,8 +114,15 @@ class TestStableDiffusionLoRA(PeftLoraLoaderMixinTests):
        pipe.load_lora_weights(lora_id, adapter_name="adapter-2")
        pipe = pipe.to(torch_device)

-        assert check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
-        assert check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in unet"
+        self.assertTrue(
+            check_if_lora_correctly_set(pipe.text_encoder),
+            "Lora not correctly set in text encoder",
+        )
+
+        self.assertTrue(
+            check_if_lora_correctly_set(pipe.unet),
+            "Lora not correctly set in unet",
+        )

        # We will offload the first adapter in CPU and check if the offloading
        # has been performed correctly
@@ -113,35 +130,35 @@ class TestStableDiffusionLoRA(PeftLoraLoaderMixinTests):

        for name, module in pipe.unet.named_modules():
            if "adapter-1" in name and not isinstance(module, (nn.Dropout, nn.Identity)):
-                assert module.weight.device == torch.device("cpu")
+                self.assertTrue(module.weight.device == torch.device("cpu"))
            elif "adapter-2" in name and not isinstance(module, (nn.Dropout, nn.Identity)):
-                assert module.weight.device != torch.device("cpu")
+                self.assertTrue(module.weight.device != torch.device("cpu"))

        for name, module in pipe.text_encoder.named_modules():
            if "adapter-1" in name and not isinstance(module, (nn.Dropout, nn.Identity)):
-                assert module.weight.device == torch.device("cpu")
+                self.assertTrue(module.weight.device == torch.device("cpu"))
            elif "adapter-2" in name and not isinstance(module, (nn.Dropout, nn.Identity)):
-                assert module.weight.device != torch.device("cpu")
+                self.assertTrue(module.weight.device != torch.device("cpu"))

        pipe.set_lora_device(["adapter-1"], 0)

        for n, m in pipe.unet.named_modules():
            if "adapter-1" in n and not isinstance(m, (nn.Dropout, nn.Identity)):
-                assert m.weight.device != torch.device("cpu")
+                self.assertTrue(m.weight.device != torch.device("cpu"))

        for n, m in pipe.text_encoder.named_modules():
            if "adapter-1" in n and not isinstance(m, (nn.Dropout, nn.Identity)):
-                assert m.weight.device != torch.device("cpu")
+                self.assertTrue(m.weight.device != torch.device("cpu"))

        pipe.set_lora_device(["adapter-1", "adapter-2"], torch_device)

        for n, m in pipe.unet.named_modules():
            if ("adapter-1" in n or "adapter-2" in n) and not isinstance(m, (nn.Dropout, nn.Identity)):
-                assert m.weight.device != torch.device("cpu")
+                self.assertTrue(m.weight.device != torch.device("cpu"))

        for n, m in pipe.text_encoder.named_modules():
            if ("adapter-1" in n or "adapter-2" in n) and not isinstance(m, (nn.Dropout, nn.Identity)):
-                assert m.weight.device != torch.device("cpu")
+                self.assertTrue(m.weight.device != torch.device("cpu"))

    @slow
    @require_torch_accelerator
@@ -164,9 +181,15 @@ class TestStableDiffusionLoRA(PeftLoraLoaderMixinTests):
        pipe.unet.add_adapter(unet_lora_config, "adapter-1")
        pipe.text_encoder.add_adapter(text_lora_config, "adapter-1")

-        assert check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+        self.assertTrue(
+            check_if_lora_correctly_set(pipe.text_encoder),
+            "Lora not correctly set in text encoder",
+        )

-        assert check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in unet"
+        self.assertTrue(
+            check_if_lora_correctly_set(pipe.unet),
+            "Lora not correctly set in unet",
+        )

        for name, param in pipe.unet.named_parameters():
            if "lora_" in name:
@@ -202,14 +225,17 @@ class TestStableDiffusionLoRA(PeftLoraLoaderMixinTests):
        pipe.unet.add_adapter(config1, adapter_name="adapter-1")
        pipe = pipe.to(torch_device)

-        assert check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in unet"
+        self.assertTrue(
+            check_if_lora_correctly_set(pipe.unet),
+            "Lora not correctly set in unet",
+        )

        # sanity check that the adapters don't target the same layers, otherwise the test passes even without the fix
        modules_adapter_0 = {n for n, _ in pipe.unet.named_modules() if n.endswith(".adapter-0")}
        modules_adapter_1 = {n for n, _ in pipe.unet.named_modules() if n.endswith(".adapter-1")}
-        assert modules_adapter_0 != modules_adapter_1
-        assert modules_adapter_0 - modules_adapter_1
-        assert modules_adapter_1 - modules_adapter_0
+        self.assertNotEqual(modules_adapter_0, modules_adapter_1)
+        self.assertTrue(modules_adapter_0 - modules_adapter_1)
+        self.assertTrue(modules_adapter_1 - modules_adapter_0)

        # setting both separately works
        pipe.set_lora_device(["adapter-0"], "cpu")
@@ -217,30 +243,32 @@ class TestStableDiffusionLoRA(PeftLoraLoaderMixinTests):

        for name, module in pipe.unet.named_modules():
            if "adapter-0" in name and not isinstance(module, (nn.Dropout, nn.Identity)):
-                assert module.weight.device == torch.device("cpu")
+                self.assertTrue(module.weight.device == torch.device("cpu"))
            elif "adapter-1" in name and not isinstance(module, (nn.Dropout, nn.Identity)):
-                assert module.weight.device == torch.device("cpu")
+                self.assertTrue(module.weight.device == torch.device("cpu"))

        # setting both at once also works
        pipe.set_lora_device(["adapter-0", "adapter-1"], torch_device)

        for name, module in pipe.unet.named_modules():
            if "adapter-0" in name and not isinstance(module, (nn.Dropout, nn.Identity)):
-                assert module.weight.device != torch.device("cpu")
+                self.assertTrue(module.weight.device != torch.device("cpu"))
            elif "adapter-1" in name and not isinstance(module, (nn.Dropout, nn.Identity)):
-                assert module.weight.device != torch.device("cpu")
+                self.assertTrue(module.weight.device != torch.device("cpu"))


@slow
@nightly
@require_torch_accelerator
@require_peft_backend
-class TestSDLoraIntegration:
-    @pytest.fixture(autouse=True)
-    def _gc_and_cache_cleanup(self, torch_device):
+class LoraIntegrationTests(unittest.TestCase):
+    def setUp(self):
+        super().setUp()
        gc.collect()
        backend_empty_cache(torch_device)
-        yield
+
+    def tearDown(self):
+        super().tearDown()
        gc.collect()
        backend_empty_cache(torch_device)

@@ -252,7 +280,10 @@ class TestSDLoraIntegration:
        pipe.load_lora_weights(lora_id)
        pipe = pipe.to(torch_device)

-        assert check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+        self.assertTrue(
+            check_if_lora_correctly_set(pipe.text_encoder),
+            "Lora not correctly set in text encoder",
+        )

        prompt = "a red sks dog"

@@ -281,7 +312,10 @@ class TestSDLoraIntegration:
        pipe.load_lora_weights(lora_id)
        pipe = pipe.to(torch_device)

-        assert check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder"
+        self.assertTrue(
+            check_if_lora_correctly_set(pipe.text_encoder),
+            "Lora not correctly set in text encoder",
+        )

        prompt = "a red sks dog"

@@ -553,8 +587,8 @@ class TestSDLoraIntegration:
        ).images
        unloaded_lora_images = unloaded_lora_images[0, -3:, -3:, -1].flatten()

-        assert not np.allclose(initial_images, lora_images)
-        assert np.allclose(initial_images, unloaded_lora_images, atol=1e-3)
+        self.assertFalse(np.allclose(initial_images, lora_images))
+        self.assertTrue(np.allclose(initial_images, unloaded_lora_images, atol=1e-3))

        release_memory(pipe)

@@ -591,8 +625,8 @@ class TestSDLoraIntegration:
        ).images
        unloaded_lora_images = unloaded_lora_images[0, -3:, -3:, -1].flatten()

-        assert not np.allclose(initial_images, lora_images)
-        assert np.allclose(initial_images, unloaded_lora_images, atol=1e-3)
+        self.assertFalse(np.allclose(initial_images, lora_images))
+        self.assertTrue(np.allclose(initial_images, unloaded_lora_images, atol=1e-3))

        # make sure we can load a LoRA again after unloading and they don't have
        # any undesired effects.
@@ -603,7 +637,7 @@ class TestSDLoraIntegration:
        ).images
        lora_images_again = lora_images_again[0, -3:, -3:, -1].flatten()

-        assert np.allclose(lora_images, lora_images_again, atol=1e-3)
+        self.assertTrue(np.allclose(lora_images, lora_images_again, atol=1e-3))
        release_memory(pipe)

    def test_not_empty_state_dict(self):
@@ -617,7 +651,7 @@ class TestSDLoraIntegration:
        lcm_lora = load_file(cached_file)

        pipe.load_lora_weights(lcm_lora, adapter_name="lcm")
-        assert lcm_lora != {}
+        self.assertTrue(lcm_lora != {})
        release_memory(pipe)

    def test_load_unload_load_state_dict(self):
@@ -632,11 +666,11 @@ class TestSDLoraIntegration:
        previous_state_dict = lcm_lora.copy()

        pipe.load_lora_weights(lcm_lora, adapter_name="lcm")
-        assert lcm_lora == previous_state_dict
+        self.assertDictEqual(lcm_lora, previous_state_dict)

        pipe.unload_lora_weights()
        pipe.load_lora_weights(lcm_lora, adapter_name="lcm")
-        assert lcm_lora == previous_state_dict
+        self.assertDictEqual(lcm_lora, previous_state_dict)

        release_memory(pipe)

--- a/tests/lora/test_lora_layers_sd3.py
+++ b/tests/lora/test_lora_layers_sd3.py
@@ -14,9 +14,9 @@
 # limitations under the License.
 import gc
 import sys
+import unittest

 import numpy as np
-import pytest
 import torch
 from transformers import AutoTokenizer, CLIPTextModelWithProjection, CLIPTokenizer, T5EncoderModel

@@ -51,7 +51,7 @@ if is_accelerate_available():


@require_peft_backend
-class TestSD3LoRA(PeftLoraLoaderMixinTests):
+class SD3LoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = StableDiffusion3Pipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -113,19 +113,19 @@ class TestSD3LoRA(PeftLoraLoaderMixinTests):
        lora_filename = "lora_peft_format.safetensors"
        pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)

-    @pytest.mark.skip("Not supported in SD3.")
+    @unittest.skip("Not supported in SD3.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in SD3.")
+    @unittest.skip("Not supported in SD3.")
    def test_simple_inference_with_text_denoiser_multi_adapter_block_lora(self):
        pass

-    @pytest.mark.skip("Not supported in SD3.")
+    @unittest.skip("Not supported in SD3.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in SD3.")
+    @unittest.skip("Not supported in SD3.")
    def test_modify_padding_mode(self):
        pass

@@ -138,15 +138,17 @@ class TestSD3LoRA(PeftLoraLoaderMixinTests):
@require_torch_accelerator
@require_peft_backend
@require_big_accelerator
-class TestSD3LoraIntegration:
+class SD3LoraIntegrationTests(unittest.TestCase):
    pipeline_class = StableDiffusion3Img2ImgPipeline
    repo_id = "stabilityai/stable-diffusion-3-medium-diffusers"

-    @pytest.fixture(autouse=True)
-    def _gc_and_cache_cleanup(self, torch_device):
+    def setUp(self):
+        super().setUp()
        gc.collect()
        backend_empty_cache(torch_device)
-        yield
+
+    def tearDown(self):
+        super().tearDown()
        gc.collect()
        backend_empty_cache(torch_device)

--- a/tests/lora/test_lora_layers_sdxl.py
+++ b/tests/lora/test_lora_layers_sdxl.py
@@ -17,9 +17,9 @@ import gc
 import importlib
 import sys
 import time
+import unittest

 import numpy as np
-import pytest
 import torch
 from packaging import version
 from transformers import CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer
@@ -59,7 +59,7 @@ if is_accelerate_available():
    from accelerate.utils import release_memory


-class TestStableDiffusionXLLoRA(PeftLoraLoaderMixinTests):
+class StableDiffusionXLLoRATests(PeftLoraLoaderMixinTests, unittest.TestCase):
    has_two_text_encoders = True
    pipeline_class = StableDiffusionXLPipeline
    scheduler_cls = EulerDiscreteScheduler
@@ -104,11 +104,21 @@ class TestStableDiffusionXLLoRA(PeftLoraLoaderMixinTests):
    def output_shape(self):
        return (1, 64, 64, 3)

+    def setUp(self):
+        super().setUp()
+        gc.collect()
+        backend_empty_cache(torch_device)
+
+    def tearDown(self):
+        super().tearDown()
+        gc.collect()
+        backend_empty_cache(torch_device)
+
    @is_flaky
    def test_multiple_wrong_adapter_name_raises_error(self):
        super().test_multiple_wrong_adapter_name_raises_error()

-    def test_simple_inference_with_text_denoiser_lora_unfused(self, pipe):
+    def test_simple_inference_with_text_denoiser_lora_unfused(self):
        if torch.cuda.is_available():
            expected_atol = 9e-2
            expected_rtol = 9e-2
@@ -117,10 +127,10 @@ class TestStableDiffusionXLLoRA(PeftLoraLoaderMixinTests):
            expected_rtol = 1e-3

        super().test_simple_inference_with_text_denoiser_lora_unfused(
-            pipe=pipe, expected_atol=expected_atol, expected_rtol=expected_rtol
+            expected_atol=expected_atol, expected_rtol=expected_rtol
        )

-    def test_simple_inference_with_text_lora_denoiser_fused_multi(self, pipe):
+    def test_simple_inference_with_text_lora_denoiser_fused_multi(self):
        if torch.cuda.is_available():
            expected_atol = 9e-2
            expected_rtol = 9e-2
@@ -129,10 +139,10 @@ class TestStableDiffusionXLLoRA(PeftLoraLoaderMixinTests):
            expected_rtol = 1e-3

        super().test_simple_inference_with_text_lora_denoiser_fused_multi(
-            pipe=pipe, expected_atol=expected_atol, expected_rtol=expected_rtol
+            expected_atol=expected_atol, expected_rtol=expected_rtol
        )

-    def test_lora_scale_kwargs_match_fusion(self, base_pipe_output):
+    def test_lora_scale_kwargs_match_fusion(self):
        if torch.cuda.is_available():
            expected_atol = 9e-2
            expected_rtol = 9e-2
@@ -140,21 +150,21 @@ class TestStableDiffusionXLLoRA(PeftLoraLoaderMixinTests):
            expected_atol = 1e-3
            expected_rtol = 1e-3

-        super().test_lora_scale_kwargs_match_fusion(
-            base_pipe_output=base_pipe_output, expected_atol=expected_atol, expected_rtol=expected_rtol
-        )
+        super().test_lora_scale_kwargs_match_fusion(expected_atol=expected_atol, expected_rtol=expected_rtol)


@slow
@nightly
@require_torch_accelerator
@require_peft_backend
-class TestLoraSDXLIntegration:
-    @pytest.fixture(autouse=True)
-    def _gc_and_cache_cleanup(self, torch_device):
+class LoraSDXLIntegrationTests(unittest.TestCase):
+    def setUp(self):
+        super().setUp()
        gc.collect()
        backend_empty_cache(torch_device)
-        yield
+
+    def tearDown(self):
+        super().tearDown()
        gc.collect()
        backend_empty_cache(torch_device)

@@ -373,7 +383,7 @@ class TestLoraSDXLIntegration:
        end_time = time.time()
        elapsed_time_fusion = end_time - start_time

-        assert elapsed_time_fusion < elapsed_time_non_fusion
+        self.assertTrue(elapsed_time_fusion < elapsed_time_non_fusion)

        release_memory(pipe)

@@ -429,14 +439,14 @@ class TestLoraSDXLIntegration:

        for key, value in text_encoder_1_sd.items():
            key = remap_key(key, fused_te_state_dict)
-            assert torch.allclose(fused_te_state_dict[key], value)
+            self.assertTrue(torch.allclose(fused_te_state_dict[key], value))

        for key, value in text_encoder_2_sd.items():
            key = remap_key(key, fused_te_2_state_dict)
-            assert torch.allclose(fused_te_2_state_dict[key], value)
+            self.assertTrue(torch.allclose(fused_te_2_state_dict[key], value))

        for key, value in unet_state_dict.items():
-            assert torch.allclose(unet_state_dict[key], value)
+            self.assertTrue(torch.allclose(unet_state_dict[key], value))

        pipe.fuse_lora()
        pipe.unload_lora_weights()
@@ -579,7 +589,7 @@ class TestLoraSDXLIntegration:
        pipe.load_lora_weights(lora_id, weight_name="toy_face_sdxl.safetensors", adapter_name="toy")
        pipe = pipe.to(torch_device)

-        assert check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet"
+        self.assertTrue(check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet")

        prompt = "toy_face of a hacker with a hoodie"

--- a/tests/lora/test_lora_layers_wan.py
+++ b/tests/lora/test_lora_layers_wan.py
@@ -13,8 +13,8 @@
 # limitations under the License.

 import sys
+import unittest

-import pytest
 import torch
 from transformers import AutoTokenizer, T5EncoderModel

@@ -39,7 +39,7 @@ from .utils import PeftLoraLoaderMixinTests  # noqa: E402

@require_peft_backend
@skip_mps
-class TestWanLoRA(PeftLoraLoaderMixinTests):
+class WanLoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = WanPipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -104,40 +104,40 @@ class TestWanLoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    def test_simple_inference_with_text_lora_denoiser_fused_multi(self, pipe):
-        super().test_simple_inference_with_text_lora_denoiser_fused_multi(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_lora_denoiser_fused_multi(self):
+        super().test_simple_inference_with_text_lora_denoiser_fused_multi(expected_atol=9e-3)

-    def test_simple_inference_with_text_denoiser_lora_unfused(self, pipe):
-        super().test_simple_inference_with_text_denoiser_lora_unfused(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_denoiser_lora_unfused(self):
+        super().test_simple_inference_with_text_denoiser_lora_unfused(expected_atol=9e-3)

-    @pytest.mark.skip("Not supported in Wan.")
+    @unittest.skip("Not supported in Wan.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in Wan.")
+    @unittest.skip("Not supported in Wan.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in Wan.")
+    @unittest.skip("Not supported in Wan.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Wan.")
+    @unittest.skip("Text encoder LoRA is not supported in Wan.")
    def test_simple_inference_with_partial_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Wan.")
+    @unittest.skip("Text encoder LoRA is not supported in Wan.")
    def test_simple_inference_with_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Wan.")
+    @unittest.skip("Text encoder LoRA is not supported in Wan.")
    def test_simple_inference_with_text_lora_and_scale(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Wan.")
+    @unittest.skip("Text encoder LoRA is not supported in Wan.")
    def test_simple_inference_with_text_lora_fused(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Wan.")
+    @unittest.skip("Text encoder LoRA is not supported in Wan.")
    def test_simple_inference_with_text_lora_save_load(self):
        pass
--- a/tests/lora/test_lora_layers_wanvace.py
+++ b/tests/lora/test_lora_layers_wanvace.py
@@ -14,9 +14,10 @@

 import os
 import sys
+import tempfile
+import unittest

 import numpy as np
-import pytest
 import safetensors.torch
 import torch
 from PIL import Image
@@ -31,6 +32,7 @@ from ..testing_utils import (
    require_peft_backend,
    require_peft_version_greater,
    skip_mps,
+    torch_device,
 )


@@ -45,7 +47,7 @@ from .utils import PeftLoraLoaderMixinTests  # noqa: E402
@require_peft_backend
@skip_mps
@is_flaky(max_attempts=10, description="very flaky class")
-class TestWanVACELoRA(PeftLoraLoaderMixinTests):
+class WanVACELoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = WanVACEPipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
    scheduler_kwargs = {}
@@ -119,51 +121,56 @@ class TestWanVACELoRA(PeftLoraLoaderMixinTests):

        return noise, input_ids, pipeline_inputs

-    def test_simple_inference_with_text_lora_denoiser_fused_multi(self, pipe):
-        super().test_simple_inference_with_text_lora_denoiser_fused_multi(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_lora_denoiser_fused_multi(self):
+        super().test_simple_inference_with_text_lora_denoiser_fused_multi(expected_atol=9e-3)

-    def test_simple_inference_with_text_denoiser_lora_unfused(self, pipe):
-        super().test_simple_inference_with_text_denoiser_lora_unfused(pipe=pipe, expected_atol=9e-3)
+    def test_simple_inference_with_text_denoiser_lora_unfused(self):
+        super().test_simple_inference_with_text_denoiser_lora_unfused(expected_atol=9e-3)

-    @pytest.mark.skip("Not supported in Wan VACE.")
+    @unittest.skip("Not supported in Wan VACE.")
    def test_simple_inference_with_text_denoiser_block_scale(self):
        pass

-    @pytest.mark.skip("Not supported in Wan VACE.")
+    @unittest.skip("Not supported in Wan VACE.")
    def test_simple_inference_with_text_denoiser_block_scale_for_all_dict_options(self):
        pass

-    @pytest.mark.skip("Not supported in Wan VACE.")
+    @unittest.skip("Not supported in Wan VACE.")
    def test_modify_padding_mode(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Wan VACE.")
+    @unittest.skip("Text encoder LoRA is not supported in Wan VACE.")
    def test_simple_inference_with_partial_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Wan VACE.")
+    @unittest.skip("Text encoder LoRA is not supported in Wan VACE.")
    def test_simple_inference_with_text_lora(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Wan VACE.")
+    @unittest.skip("Text encoder LoRA is not supported in Wan VACE.")
    def test_simple_inference_with_text_lora_and_scale(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Wan VACE.")
+    @unittest.skip("Text encoder LoRA is not supported in Wan VACE.")
    def test_simple_inference_with_text_lora_fused(self):
        pass

-    @pytest.mark.skip("Text encoder LoRA is not supported in Wan VACE.")
+    @unittest.skip("Text encoder LoRA is not supported in Wan VACE.")
    def test_simple_inference_with_text_lora_save_load(self):
        pass

+    def test_layerwise_casting_inference_denoiser(self):
+        super().test_layerwise_casting_inference_denoiser()
+
    @require_peft_version_greater("0.13.2")
-    def test_lora_exclude_modules_wanvace(self, base_pipe_output, tmpdirname, pipe):
+    def test_lora_exclude_modules_wanvace(self):
        exclude_module_name = "vace_blocks.0.proj_out"
-        _, text_lora_config, denoiser_lora_config = self.get_dummy_components()
+        components, text_lora_config, denoiser_lora_config = self.get_dummy_components()
+        pipe = self.pipeline_class(**components).to(torch_device)
        _, _, inputs = self.get_dummy_inputs(with_generator=False)

-        assert base_pipe_output.shape == self.output_shape
+        output_no_lora = self.get_base_pipe_output()
+        self.assertTrue(output_no_lora.shape == self.output_shape)

        # only supported for `denoiser` now
        denoiser_lora_config.target_modules = ["proj_out"]
@@ -173,30 +180,36 @@ class TestWanVACELoRA(PeftLoraLoaderMixinTests):
        )
        # The state dict shouldn't contain the modules to be excluded from LoRA.
        state_dict_from_model = get_peft_model_state_dict(pipe.transformer, adapter_name="default")
-        assert not any(exclude_module_name in k for k in state_dict_from_model)
-        assert any("proj_out" in k for k in state_dict_from_model)
+        self.assertTrue(not any(exclude_module_name in k for k in state_dict_from_model))
+        self.assertTrue(any("proj_out" in k for k in state_dict_from_model))
        output_lora_exclude_modules = pipe(**inputs, generator=torch.manual_seed(0))[0]

-        modules_to_save = self._get_modules_to_save(pipe, has_denoiser=True)
-        lora_state_dicts = self._get_lora_state_dicts(modules_to_save)
-        self.pipeline_class.save_lora_weights(save_directory=tmpdirname, **lora_state_dicts)
-        pipe.unload_lora_weights()
+        with tempfile.TemporaryDirectory() as tmpdir:
+            modules_to_save = self._get_modules_to_save(pipe, has_denoiser=True)
+            lora_state_dicts = self._get_lora_state_dicts(modules_to_save)
+            self.pipeline_class.save_lora_weights(save_directory=tmpdir, **lora_state_dicts)
+            pipe.unload_lora_weights()

-        # Check in the loaded state dict.
-        loaded_state_dict = safetensors.torch.load_file(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"))
-        assert not any(exclude_module_name in k for k in loaded_state_dict)
-        assert any("proj_out" in k for k in loaded_state_dict)
+            # Check in the loaded state dict.
+            loaded_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
+            self.assertTrue(not any(exclude_module_name in k for k in loaded_state_dict))
+            self.assertTrue(any("proj_out" in k for k in loaded_state_dict))

-        # Check in the state dict obtained after loading LoRA.
-        pipe.load_lora_weights(tmpdirname)
-        state_dict_from_model = get_peft_model_state_dict(pipe.transformer, adapter_name="default_0")
-        assert not any(exclude_module_name in k for k in state_dict_from_model)
-        assert any("proj_out" in k for k in state_dict_from_model)
+            # Check in the state dict obtained after loading LoRA.
+            pipe.load_lora_weights(tmpdir)
+            state_dict_from_model = get_peft_model_state_dict(pipe.transformer, adapter_name="default_0")
+            self.assertTrue(not any(exclude_module_name in k for k in state_dict_from_model))
+            self.assertTrue(any("proj_out" in k for k in state_dict_from_model))

-        output_lora_pretrained = pipe(**inputs, generator=torch.manual_seed(0))[0]
-        assert not np.allclose(base_pipe_output, output_lora_exclude_modules, atol=1e-3, rtol=1e-3), (
-            "LoRA should change outputs."
-        )
-        assert np.allclose(output_lora_exclude_modules, output_lora_pretrained, atol=1e-3, rtol=1e-3), (
-            "Lora outputs should match."
-        )
+            output_lora_pretrained = pipe(**inputs, generator=torch.manual_seed(0))[0]
+            self.assertTrue(
+                not np.allclose(output_no_lora, output_lora_exclude_modules, atol=1e-3, rtol=1e-3),
+                "LoRA should change outputs.",
+            )
+            self.assertTrue(
+                np.allclose(output_lora_exclude_modules, output_lora_pretrained, atol=1e-3, rtol=1e-3),
+                "Lora outputs should match.",
+            )
+
+    def test_simple_inference_with_text_denoiser_lora_and_scale(self):
+        super().test_simple_inference_with_text_denoiser_lora_and_scale()
--- a/tests/lora/utils.py
+++ b/tests/lora/utils.py
--- a/tests/pipelines/longcat_image/init.py
+++ b/tests/pipelines/longcat_image/init.py
--- a/tests/quantization/torchao/test_torchao.py
+++ b/tests/quantization/torchao/test_torchao.py
@@ -35,6 +35,7 @@ from diffusers.models.attention_processor import Attention
 from diffusers.quantizers import PipelineQuantizationConfig

 from ...testing_utils import (
+    Expectations,
    backend_empty_cache,
    backend_synchronize,
    enable_full_determinism,
@@ -497,8 +498,23 @@ class TorchAoTest(unittest.TestCase):

    def test_model_memory_usage(self):
        model_id = "hf-internal-testing/tiny-flux-pipe"
-        expected_memory_saving_ratio = 2.0
-
+        expected_memory_saving_ratios = Expectations(
+            {
+                # XPU: For this tiny model, per-tensor overheads (alignment, fragmentation, metadata) become visible.
+                # While XPU doesn't have the large fixed cuBLAS workspace of A100, these small overheads prevent reaching the ideal 2.0 ratio.
+                # Observed ~1.27x (158k vs 124k) for model size.
+                # The runtime memory overhead is ~88k for both bf16 and int8wo. Adding this to model size: (158k+88k)/(124k+88k) ≈ 1.15.
+                ("xpu", None): 1.15,
+                # On Ampere, the cuBLAS kernels used for matrix multiplication often allocate a fixed-size workspace.
+                # Since the tiny-flux model weights are likely smaller than or comparable to this workspace, the total memory is dominated by the workspace.
+                ("cuda", 8): 1.02,
+                # On Hopper, TorchAO utilizes newer, highly optimized kernels (via Triton or CUTLASS 3.x) that are designed to be workspace-free or use negligible extra memory.
+                # Additionally, Triton kernels often handle unaligned memory better, avoiding the padding overhead seen on other backends for tiny tensors.
+                # This allows it to achieve the near-ideal 2.0x compression ratio.
+                ("cuda", 9): 2.0,
+            }
+        )
+        expected_memory_saving_ratio = expected_memory_saving_ratios.get_expectation()
        inputs = self.get_dummy_tensor_inputs(device=torch_device)

        transformer_bf16 = self.get_dummy_components(None, model_id=model_id)["transformer"]
Author	SHA1	Message	Date
Sayak Paul	301c223318	Merge branch 'main' into cp-fixes-attn-backends	2025-12-18 14:21:52 +08:00
hlky	55463f7ace	Z-Image-Turbo ControlNet (#12792 ) * init --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-12-17 09:44:20 -10:00
naykun	f9c1e612fb	Qwen Image Layered Support (#12853 ) * [qwen-image] qwen image layered support * [qwen-image] update doc * [qwen-image] fix pr comments * Apply style fixes * make fix-copies --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2025-12-17 16:57:57 +05:30
Wang, Yi	87f7d11143	extend TorchAoTest::test_model_memory_usage to other platform (#12768 ) * extend TorchAoTest::test_model_memory_usage to other platform Signe-off-by: Wang, Yi <yi.a.wang@inel.com> * add some comments Signed-off-by: Wang, Yi <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi <yi.a.wang@intel.com>	2025-12-17 13:44:08 +05:30
junqiangwu	5e48f466b9	fix the prefix_token_len bug (#12845 )	2025-12-15 22:02:25 -10:00
junqiangwu	a748a839ad	Add support for LongCat-Image (#12828 ) * Add LongCat-Image * Update src/diffusers/models/transformers/transformer_longcat_image.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/models/transformers/transformer_longcat_image.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/models/transformers/transformer_longcat_image.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/longcat_image/pipeline_longcat_image.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/longcat_image/pipeline_longcat_image.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/longcat_image/pipeline_longcat_image.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/models/transformers/transformer_longcat_image.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/longcat_image/pipeline_longcat_image.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * fix code * add doc * Update src/diffusers/pipelines/longcat_image/pipeline_longcat_image_edit.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/longcat_image/pipeline_longcat_image_edit.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/longcat_image/pipeline_longcat_image.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/longcat_image/pipeline_longcat_image.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/longcat_image/pipeline_longcat_image.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/longcat_image/pipeline_longcat_image.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * fix code & mask style & fix-copies * Apply style fixes * fix single input rewrite error --------- Co-authored-by: YiYi Xu <yixu310@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: hadoop-imagen <hadoop-imagen@psxfb7pxrbvmh3oq-worker-0.psxfb7pxrbvmh3oq.hadoop-aipnlp.svc.cluster.local>	2025-12-15 07:45:17 -10:00
Sayak Paul	3b1ccd79a5	Merge branch 'main' into cp-fixes-attn-backends	2025-12-15 20:30:22 +08:00
sayakpaul	0c35ed4708	up	2025-12-12 15:26:43 +05:30
sayakpaul	738f278d93	gracefully error out when attn-backend x cp combo isn't supported.	2025-12-12 15:25:59 +05:30
sayakpaul	23251d6cf6	Revert "gracefully error out when attn-backend x cp combo isn't supported." This reverts commit `c8abb5d7c0`.	2025-12-12 15:24:09 +05:30
sayakpaul	c8abb5d7c0	gracefully error out when attn-backend x cp combo isn't supported.	2025-12-12 15:20:18 +05:30