update style bot workflow

[Model Card] standardize advanced diffusion training sdxl lora (#7615 )
* model card gen code * push modelcard creation * remove optional from params * add import * add use_dora check * correct lora var use in tags * make style && make quality --------- Co-authored-by: Aryan <aryan@huggingface.co> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2026-02-24 11:50:35 +08:00 · 2025-04-03 10:38:14 +02:00 · 2025-04-03 07:43:01 +05:30 · 2025-04-02 11:33:19 -10:00 · 2025-04-02 10:16:31 -10:00 · 2025-04-02 20:47:10 +01:00
22 changed files with 2515 additions and 92 deletions
--- a/.github/workflows/pr_style_bot.yml
+++ b/.github/workflows/pr_style_bot.yml
@@ -13,39 +13,5 @@ jobs:
    uses: huggingface/huggingface_hub/.github/workflows/style-bot-action.yml@main
    with:
      python_quality_dependencies: "[quality]"
-      pre_commit_script_name: "Download and Compare files from the main branch"
-      pre_commit_script: |
-        echo "Downloading the files from the main branch"
-
-        curl -o main_Makefile https://raw.githubusercontent.com/huggingface/diffusers/main/Makefile
-        curl -o main_setup.py https://raw.githubusercontent.com/huggingface/diffusers/refs/heads/main/setup.py
-        curl -o main_check_doc_toc.py https://raw.githubusercontent.com/huggingface/diffusers/refs/heads/main/utils/check_doc_toc.py
-
-        echo "Compare the files and raise error if needed"
-
-        diff_failed=0
-        if ! diff -q main_Makefile Makefile; then
-          echo "Error: The Makefile has changed. Please ensure it matches the main branch."
-          diff_failed=1
-        fi
-
-        if ! diff -q main_setup.py setup.py; then
-          echo "Error: The setup.py has changed. Please ensure it matches the main branch."
-          diff_failed=1
-        fi
-
-        if ! diff -q main_check_doc_toc.py utils/check_doc_toc.py; then
-          echo "Error: The utils/check_doc_toc.py has changed. Please ensure it matches the main branch."
-          diff_failed=1
-        fi
-
-        if [ $diff_failed -eq 1 ]; then
-          echo "❌ Error happened as we detected changes in the files that should not be changed ❌"
-          exit 1
-        fi
-
-        echo "No changes in the files. Proceeding..."
-        rm -rf main_Makefile main_setup.py main_check_doc_toc.py
-      style_command: "make style && make quality"
    secrets:
      bot_token: ${{ secrets.GITHUB_TOKEN }}
--- a/docs/source/en/using-diffusers/loading.md
+++ b/docs/source/en/using-diffusers/loading.md
@@ -95,6 +95,23 @@ Use the Space below to gauge a pipeline's memory requirements before you downloa
    ></iframe>
 </div>

+### Specifying Component-Specific Data Types
+
+You can customize the data types for individual sub-models by passing a dictionary to the `torch_dtype` parameter. This allows you to load different components of a pipeline in different floating point precisions. For instance, if you want to load the transformer with `torch.bfloat16` and all other components with `torch.float16`, you can pass a dictionary mapping:
+
+```python
+from diffusers import HunyuanVideoPipeline
+import torch
+
+pipe = HunyuanVideoPipeline.from_pretrained(
+    "hunyuanvideo-community/HunyuanVideo",
+    torch_dtype={'transformer': torch.bfloat16, 'default': torch.float16},
+)
+print(pipe.transformer.dtype, pipe.vae.dtype)  # (torch.bfloat16, torch.float16)
+```
+
+If a component is not explicitly specified in the dictionary and no `default` is provided, it will be loaded with `torch.float32`.
+
 ### Local pipeline

 To load a pipeline locally, use [git-lfs](https://git-lfs.github.com/) to manually download a checkpoint to your local disk.
--- a/examples/advanced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py
+++ b/examples/advanced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py
@@ -71,6 +71,7 @@ from diffusers.utils import (
    convert_unet_state_dict_to_peft,
    is_wandb_available,
 )
+from diffusers.utils.hub_utils import load_or_create_model_card, populate_model_card
 from diffusers.utils.import_utils import is_xformers_available
 from diffusers.utils.torch_utils import is_compiled_module

@@ -101,7 +102,7 @@ def determine_scheduler_type(pretrained_model_name_or_path, revision):
 def save_model_card(
    repo_id: str,
    use_dora: bool,
-    images=None,
+    images: list = None,
    base_model: str = None,
    train_text_encoder=False,
    train_text_encoder_ti=False,
@@ -111,20 +112,17 @@ def save_model_card(
    repo_folder=None,
    vae_path=None,
 ):
-    img_str = "widget:\n"
    lora = "lora" if not use_dora else "dora"
-    for i, image in enumerate(images):
-        image.save(os.path.join(repo_folder, f"image_{i}.png"))
-        img_str += f"""
-        - text: '{validation_prompt if validation_prompt else ' ' }'
-          output:
-            url:
-                "image_{i}.png"
-        """
-    if not images:
-        img_str += f"""
-        - text: '{instance_prompt}'
-        """
+
+    widget_dict = []
+    if images is not None:
+        for i, image in enumerate(images):
+            image.save(os.path.join(repo_folder, f"image_{i}.png"))
+            widget_dict.append(
+                {"text": validation_prompt if validation_prompt else " ", "output": {"url": f"image_{i}.png"}}
+            )
+    else:
+        widget_dict.append({"text": instance_prompt})
    embeddings_filename = f"{repo_folder}_emb"
    instance_prompt_webui = re.sub(r"<s\d+>", "", re.sub(r"<s\d+>", embeddings_filename, instance_prompt, count=1))
    ti_keys = ", ".join(f'"{match}"' for match in re.findall(r"<s\d+>", instance_prompt))
@@ -169,23 +167,7 @@ pipeline.load_textual_inversion(state_dict["clip_g"], token=[{ti_keys}], text_en
 to trigger concept `{key}` → use `{tokens}` in your prompt \n
 """

-    yaml = f"""---
-tags:
- stable-diffusion-xl
- stable-diffusion-xl-diffusers
- diffusers-training
- text-to-image
- diffusers
- {lora}
- template:sd-lora
-{img_str}
-base_model: {base_model}
-instance_prompt: {instance_prompt}
-license: openrail++
---
-"""
-
-    model_card = f"""
+    model_description = f"""
 # SDXL LoRA DreamBooth - {repo_id}

 <Gallery />
@@ -234,8 +216,25 @@ Special VAE used for training: {vae_path}.

 {license}
 """
-    with open(os.path.join(repo_folder, "README.md"), "w") as f:
-        f.write(yaml + model_card)
+    model_card = load_or_create_model_card(
+        repo_id_or_path=repo_id,
+        from_training=True,
+        license="openrail++",
+        base_model=base_model,
+        prompt=instance_prompt,
+        model_description=model_description,
+        widget=widget_dict,
+    )
+    tags = [
+        "text-to-image",
+        "stable-diffusion-xl",
+        "stable-diffusion-xl-diffusers",
+        "text-to-image",
+        "diffusers",
+        lora,
+        "template:sd-lora",
+    ]
+    model_card = populate_model_card(model_card, tags=tags)


 def log_validation(
--- a/examples/community/README.md
+++ b/examples/community/README.md
@@ -85,7 +85,7 @@ PIXART-α Controlnet pipeline | Implementation of the controlnet model for pixar
 | Stable Diffusion XL Attentive Eraser Pipeline |[[AAAI2025 Oral] Attentive Eraser](https://github.com/Anonym0u3/AttentiveEraser) is a novel tuning-free method that enhances object removal capabilities in pre-trained diffusion models.|[Stable Diffusion XL Attentive Eraser Pipeline](#stable-diffusion-xl-attentive-eraser-pipeline)|-|[Wenhao Sun](https://github.com/Anonym0u3) and [Benlei Cui](https://github.com/Benny079)|
 | Perturbed-Attention Guidance |StableDiffusionPAGPipeline is a modification of StableDiffusionPipeline to support Perturbed-Attention Guidance (PAG).|[Perturbed-Attention Guidance](#perturbed-attention-guidance)|[Notebook](https://github.com/huggingface/notebooks/blob/main/diffusers/perturbed_attention_guidance.ipynb)|[Hyoungwon Cho](https://github.com/HyoungwonCho)|
 | CogVideoX DDIM Inversion Pipeline | Implementation of DDIM inversion and guided attention-based editing denoising process on CogVideoX. | [CogVideoX DDIM Inversion Pipeline](#cogvideox-ddim-inversion-pipeline) | - | [LittleNyima](https://github.com/LittleNyima) |
-
+| FaithDiff Stable Diffusion XL Pipeline | Implementation of [(CVPR 2025) FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolutionUnleashing Diffusion Priors for Faithful Image Super-resolution](https://arxiv.org/abs/2411.18824) - FaithDiff is a faithful image super-resolution method that leverages latent diffusion models by actively adapting the diffusion prior and jointly fine-tuning its components (encoder and diffusion model) with an alignment module to ensure high fidelity and structural consistency. | [FaithDiff Stable Diffusion XL Pipeline](#faithdiff-stable-diffusion-xl-pipeline) | [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/jychen9811/FaithDiff) | [Junyang Chen, Jinshan Pan, Jiangxin Dong, IMAG Lab, (Adapted by Eliseu Silva)](https://github.com/JyChen9811/FaithDiff) |
 To load a custom pipeline you just need to pass the `custom_pipeline` argument to `DiffusionPipeline`, as one of the files in `diffusers/examples/community`. Feel free to send a PR with your own pipelines, we will merge them quickly.

 ```py
@@ -5334,3 +5334,103 @@ output = pipeline_for_inversion(
 pipeline.export_latents_to_video(output.inverse_latents[-1], "path/to/inverse_video.mp4", fps=8)
 pipeline.export_latents_to_video(output.recon_latents[-1], "path/to/recon_video.mp4", fps=8)
 ```
+# FaithDiff Stable Diffusion XL Pipeline
+
+[Project](https://jychen9811.github.io/FaithDiff_page/) / [GitHub](https://github.com/JyChen9811/FaithDiff/)
+
+This the implementation of the FaithDiff pipeline for SDXL, adapted to use the HuggingFace Diffusers.
+
+For more details see the project links above.
+
+## Example Usage
+
+This example upscale and restores a low-quality image. The input image has a resolution of 512x512 and will be upscaled at a scale of 2x, to a final resolution of 1024x1024. It is possible to upscale to a larger scale, but it is recommended that the input image be at least 1024x1024 in these cases. To upscale this image by 4x, for example, it would be recommended to re-input the result into a new 2x processing, thus performing progressive scaling.
+
+````py
+import random
+import numpy as np
+import torch
+from diffusers import DiffusionPipeline, AutoencoderKL, UniPCMultistepScheduler
+from huggingface_hub import hf_hub_download
+from diffusers.utils import load_image
+from PIL import Image
+
+device = "cuda"
+dtype = torch.float16
+MAX_SEED = np.iinfo(np.int32).max
+
+# Download weights for additional unet layers
+model_file = hf_hub_download(
+    "jychen9811/FaithDiff",
+    filename="FaithDiff.bin", local_dir="./proc_data/faithdiff", local_dir_use_symlinks=False
+)
+
+# Initialize the models and pipeline
+vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=dtype)
+
+model_id = "SG161222/RealVisXL_V4.0"
+pipe = DiffusionPipeline.from_pretrained(
+    model_id,
+    torch_dtype=dtype,
+    vae=vae,
+    unet=None, #<- Do not load with original model.
+    custom_pipeline="pipeline_faithdiff_stable_diffusion_xl",    
+    use_safetensors=True,
+    variant="fp16",
+).to(device)
+
+# Here we need use pipeline internal unet model
+pipe.unet = pipe.unet_model.from_pretrained(model_id, subfolder="unet", variant="fp16", use_safetensors=True)
+
+# Load aditional layers to the model
+pipe.unet.load_additional_layers(weight_path="proc_data/faithdiff/FaithDiff.bin", dtype=dtype)
+
+# Enable vae tiling
+pipe.set_encoder_tile_settings()
+pipe.enable_vae_tiling()
+
+# Optimization
+pipe.enable_model_cpu_offload()
+
+# Set selected scheduler
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+
+#input params
+prompt = "The image features a woman in her 55s with blonde hair and a white shirt, smiling at the camera. She appears to be in a good mood and is wearing a white scarf around her neck. "
+upscale = 2 # scale here
+start_point = "lr" # or "noise"
+latent_tiled_overlap = 0.5
+latent_tiled_size = 1024
+
+# Load image
+lq_image = load_image("https://huggingface.co/datasets/DEVAIEXP/assets/resolve/main/woman.png")
+original_height = lq_image.height
+original_width = lq_image.width
+print(f"Current resolution: H:{original_height} x W:{original_width}")
+
+width = original_width * int(upscale)
+height = original_height * int(upscale)
+print(f"Final resolution: H:{height} x W:{width}")
+
+# Restoration
+image = lq_image.resize((width, height), Image.LANCZOS)
+input_image, width_init, height_init, width_now, height_now = pipe.check_image_size(image)
+
+generator = torch.Generator(device=device).manual_seed(random.randint(0, MAX_SEED))
+gen_image = pipe(lr_img=input_image, 
+                 prompt = prompt,                  
+                 num_inference_steps=20, 
+                 guidance_scale=5, 
+                 generator=generator, 
+                 start_point=start_point, 
+                 height = height_now, 
+                 width=width_now, 
+                 overlap=latent_tiled_overlap, 
+                 target_size=(latent_tiled_size, latent_tiled_size)
+                ).images[0]
+
+cropped_image = gen_image.crop((0, 0, width_init, height_init))
+cropped_image.save("data/result.png")
+````
+### Result
+[<img src="https://huggingface.co/datasets/DEVAIEXP/assets/resolve/main/faithdiff_restored.PNG" width="512px" height="512px"/>](https://imgsli.com/MzY1NzE2)
--- a/examples/community/pipeline_faithdiff_stable_diffusion_xl.py
+++ b/examples/community/pipeline_faithdiff_stable_diffusion_xl.py
--- a/src/diffusers/configuration_utils.py
+++ b/src/diffusers/configuration_utils.py
@@ -35,6 +35,7 @@ from huggingface_hub.utils import (
    validate_hf_hub_args,
 )
 from requests import HTTPError
+from typing_extensions import Self

 from . import __version__
 from .utils import (
@@ -185,7 +186,9 @@ class ConfigMixin:
            )

    @classmethod
-    def from_config(cls, config: Union[FrozenDict, Dict[str, Any]] = None, return_unused_kwargs=False, **kwargs):
+    def from_config(
+        cls, config: Union[FrozenDict, Dict[str, Any]] = None, return_unused_kwargs=False, **kwargs
+    ) -> Union[Self, Tuple[Self, Dict[str, Any]]]:
        r"""
        Instantiate a Python class from a config dictionary.

--- a/src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py
+++ b/src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py
@@ -105,6 +105,7 @@ class CogVideoXCausalConv3d(nn.Module):
        self.width_pad = width_pad
        self.time_pad = time_pad
        self.time_causal_padding = (width_pad, width_pad, height_pad, height_pad, time_pad, 0)
+        self.const_padding_conv3d = (0, self.width_pad, self.height_pad)

        self.temporal_dim = 2
        self.time_kernel_size = time_kernel_size
@@ -117,6 +118,8 @@ class CogVideoXCausalConv3d(nn.Module):
            kernel_size=kernel_size,
            stride=stride,
            dilation=dilation,
+            padding=0 if self.pad_mode == "replicate" else self.const_padding_conv3d,
+            padding_mode="zeros",
        )

    def fake_context_parallel_forward(
@@ -137,9 +140,7 @@ class CogVideoXCausalConv3d(nn.Module):
        if self.pad_mode == "replicate":
            conv_cache = None
        else:
-            padding_2d = (self.width_pad, self.width_pad, self.height_pad, self.height_pad)
            conv_cache = inputs[:, :, -self.time_kernel_size + 1 :].clone()
-            inputs = F.pad(inputs, padding_2d, mode="constant", value=0)

        output = self.conv(inputs)
        return output, conv_cache
--- a/src/diffusers/models/transformers/transformer_ltx.py
+++ b/src/diffusers/models/transformers/transformer_ltx.py
@@ -26,6 +26,7 @@ from ...utils import USE_PEFT_BACKEND, logging, scale_lora_layers, unscale_lora_
 from ...utils.torch_utils import maybe_allow_in_graph
 from ..attention import FeedForward
 from ..attention_processor import Attention
+from ..cache_utils import CacheMixin
 from ..embeddings import PixArtAlphaTextProjection
 from ..modeling_outputs import Transformer2DModelOutput
 from ..modeling_utils import ModelMixin
@@ -298,7 +299,7 @@ class LTXVideoTransformerBlock(nn.Module):


@maybe_allow_in_graph
-class LTXVideoTransformer3DModel(ModelMixin, ConfigMixin, FromOriginalModelMixin, PeftAdapterMixin):
+class LTXVideoTransformer3DModel(ModelMixin, ConfigMixin, FromOriginalModelMixin, PeftAdapterMixin, CacheMixin):
    r"""
    A Transformer model for video-like data used in [LTX](https://huggingface.co/Lightricks/LTX-Video).

--- a/src/diffusers/models/transformers/transformer_wan.py
+++ b/src/diffusers/models/transformers/transformer_wan.py
@@ -24,6 +24,7 @@ from ...loaders import FromOriginalModelMixin, PeftAdapterMixin
 from ...utils import USE_PEFT_BACKEND, logging, scale_lora_layers, unscale_lora_layers
 from ..attention import FeedForward
 from ..attention_processor import Attention
+from ..cache_utils import CacheMixin
 from ..embeddings import PixArtAlphaTextProjection, TimestepEmbedding, Timesteps, get_1d_rotary_pos_embed
 from ..modeling_outputs import Transformer2DModelOutput
 from ..modeling_utils import ModelMixin
@@ -288,7 +289,7 @@ class WanTransformerBlock(nn.Module):
        return hidden_states


-class WanTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
+class WanTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin, CacheMixin):
    r"""
    A Transformer model for video-like data used in the Wan model.

--- a/src/diffusers/pipelines/cogview4/pipeline_cogview4.py
+++ b/src/diffusers/pipelines/cogview4/pipeline_cogview4.py
@@ -213,9 +213,7 @@ class CogView4Pipeline(DiffusionPipeline, CogView4LoraLoaderMixin):
                device=text_input_ids.device,
            )
            text_input_ids = torch.cat([pad_ids, text_input_ids], dim=1)
-        prompt_embeds = self.text_encoder(
-            text_input_ids.to(self.text_encoder.device), output_hidden_states=True
-        ).hidden_states[-2]
+        prompt_embeds = self.text_encoder(text_input_ids.to(device), output_hidden_states=True).hidden_states[-2]

        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
        return prompt_embeds
--- a/src/diffusers/pipelines/cogview4/pipeline_cogview4_control.py
+++ b/src/diffusers/pipelines/cogview4/pipeline_cogview4_control.py
@@ -216,9 +216,7 @@ class CogView4ControlPipeline(DiffusionPipeline):
                device=text_input_ids.device,
            )
            text_input_ids = torch.cat([pad_ids, text_input_ids], dim=1)
-        prompt_embeds = self.text_encoder(
-            text_input_ids.to(self.text_encoder.device), output_hidden_states=True
-        ).hidden_states[-2]
+        prompt_embeds = self.text_encoder(text_input_ids.to(device), output_hidden_states=True).hidden_states[-2]

        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
        return prompt_embeds
--- a/src/diffusers/pipelines/ltx/pipeline_ltx.py
+++ b/src/diffusers/pipelines/ltx/pipeline_ltx.py
@@ -489,6 +489,10 @@ class LTXPipeline(DiffusionPipeline, FromSingleFileMixin, LTXVideoLoraLoaderMixi
    def num_timesteps(self):
        return self._num_timesteps

+    @property
+    def current_timestep(self):
+        return self._current_timestep
+
    @property
    def attention_kwargs(self):
        return self._attention_kwargs
@@ -622,6 +626,7 @@ class LTXPipeline(DiffusionPipeline, FromSingleFileMixin, LTXVideoLoraLoaderMixi
        self._guidance_scale = guidance_scale
        self._attention_kwargs = attention_kwargs
        self._interrupt = False
+        self._current_timestep = None

        # 2. Define call parameters
        if prompt is not None and isinstance(prompt, str):
@@ -706,6 +711,8 @@ class LTXPipeline(DiffusionPipeline, FromSingleFileMixin, LTXVideoLoraLoaderMixi
                if self.interrupt:
                    continue

+                self._current_timestep = t
+
                latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
                latent_model_input = latent_model_input.to(prompt_embeds.dtype)

--- a/src/diffusers/pipelines/ltx/pipeline_ltx_condition.py
+++ b/src/diffusers/pipelines/ltx/pipeline_ltx_condition.py
@@ -774,6 +774,10 @@ class LTXConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTXVideoLoraL
    def num_timesteps(self):
        return self._num_timesteps

+    @property
+    def current_timestep(self):
+        return self._current_timestep
+
    @property
    def attention_kwargs(self):
        return self._attention_kwargs
@@ -933,6 +937,7 @@ class LTXConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTXVideoLoraL
        self._guidance_scale = guidance_scale
        self._attention_kwargs = attention_kwargs
        self._interrupt = False
+        self._current_timestep = None

        # 2. Define call parameters
        if prompt is not None and isinstance(prompt, str):
@@ -1066,6 +1071,8 @@ class LTXConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTXVideoLoraL
                if self.interrupt:
                    continue

+                self._current_timestep = t
+
                if image_cond_noise_scale > 0:
                    # Add timestep-dependent noise to the hard-conditioning latents
                    # This helps with motion continuity, especially when conditioned on a single frame
--- a/src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py
+++ b/src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py
@@ -550,6 +550,10 @@ class LTXImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTXVideoLo
    def num_timesteps(self):
        return self._num_timesteps

+    @property
+    def current_timestep(self):
+        return self._current_timestep
+
    @property
    def attention_kwargs(self):
        return self._attention_kwargs
@@ -686,6 +690,7 @@ class LTXImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTXVideoLo
        self._guidance_scale = guidance_scale
        self._attention_kwargs = attention_kwargs
        self._interrupt = False
+        self._current_timestep = None

        # 2. Define call parameters
        if prompt is not None and isinstance(prompt, str):
@@ -778,6 +783,8 @@ class LTXImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTXVideoLo
                if self.interrupt:
                    continue

+                self._current_timestep = t
+
                latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
                latent_model_input = latent_model_input.to(prompt_embeds.dtype)

--- a/src/diffusers/pipelines/pipeline_loading_utils.py
+++ b/src/diffusers/pipelines/pipeline_loading_utils.py
@@ -592,6 +592,11 @@ def _get_final_device_map(device_map, pipeline_class, passed_class_obj, init_dic
                loaded_sub_model = passed_class_obj[name]

        else:
+            sub_model_dtype = (
+                torch_dtype.get(name, torch_dtype.get("default", torch.float32))
+                if isinstance(torch_dtype, dict)
+                else torch_dtype
+            )
            loaded_sub_model = _load_empty_model(
                library_name=library_name,
                class_name=class_name,
@@ -600,7 +605,7 @@ def _get_final_device_map(device_map, pipeline_class, passed_class_obj, init_dic
                is_pipeline_module=is_pipeline_module,
                pipeline_class=pipeline_class,
                name=name,
-                torch_dtype=torch_dtype,
+                torch_dtype=sub_model_dtype,
                cached_folder=kwargs.get("cached_folder", None),
                force_download=kwargs.get("force_download", None),
                proxies=kwargs.get("proxies", None),
@@ -616,7 +621,12 @@ def _get_final_device_map(device_map, pipeline_class, passed_class_obj, init_dic
    # Obtain a sorted dictionary for mapping the model-level components
    # to their sizes.
    module_sizes = {
-        module_name: compute_module_sizes(module, dtype=torch_dtype)[""]
+        module_name: compute_module_sizes(
+            module,
+            dtype=torch_dtype.get(module_name, torch_dtype.get("default", torch.float32))
+            if isinstance(torch_dtype, dict)
+            else torch_dtype,
+        )[""]
        for module_name, module in init_empty_modules.items()
        if isinstance(module, torch.nn.Module)
    }
--- a/src/diffusers/pipelines/pipeline_utils.py
+++ b/src/diffusers/pipelines/pipeline_utils.py
@@ -552,9 +552,12 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
                      saved using
                    [`~DiffusionPipeline.save_pretrained`].
                    - A path to a *directory* (for example `./my_pipeline_directory/`) containing a dduf file
-            torch_dtype (`str` or `torch.dtype`, *optional*):
+            torch_dtype (`str` or `torch.dtype` or `dict[str, Union[str, torch.dtype]]`, *optional*):
                Override the default `torch.dtype` and load the model with another dtype. If "auto" is passed, the
-                dtype is automatically derived from the model's weights.
+                dtype is automatically derived from the model's weights. To load submodels with different dtype pass a
+                `dict` (for example `{'transformer': torch.bfloat16, 'vae': torch.float16}`). Set the default dtype for
+                unspecified components with `default` (for example `{'transformer': torch.bfloat16, 'default':
+                torch.float16}`). If a component is not specified and no default is set, `torch.float32` is used.
            custom_pipeline (`str`, *optional*):

                <Tip warning={true}>
@@ -703,7 +706,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
        use_onnx = kwargs.pop("use_onnx", None)
        load_connected_pipeline = kwargs.pop("load_connected_pipeline", False)

-        if torch_dtype is not None and not isinstance(torch_dtype, torch.dtype):
+        if torch_dtype is not None and not isinstance(torch_dtype, dict) and not isinstance(torch_dtype, torch.dtype):
            torch_dtype = torch.float32
            logger.warning(
                f"Passed `torch_dtype` {torch_dtype} is not a `torch.dtype`. Defaulting to `torch.float32`."
@@ -950,6 +953,11 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
                loaded_sub_model = passed_class_obj[name]
            else:
                # load sub model
+                sub_model_dtype = (
+                    torch_dtype.get(name, torch_dtype.get("default", torch.float32))
+                    if isinstance(torch_dtype, dict)
+                    else torch_dtype
+                )
                loaded_sub_model = load_sub_model(
                    library_name=library_name,
                    class_name=class_name,
@@ -957,7 +965,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
                    pipelines=pipelines,
                    is_pipeline_module=is_pipeline_module,
                    pipeline_class=pipeline_class,
-                    torch_dtype=torch_dtype,
+                    torch_dtype=sub_model_dtype,
                    provider=provider,
                    sess_options=sess_options,
                    device_map=current_device_map,
@@ -998,7 +1006,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
            for module in missing_modules:
                init_kwargs[module] = passed_class_obj.get(module, None)
        elif len(missing_modules) > 0:
-            passed_modules = set(list(init_kwargs.keys()) + list(passed_class_obj.keys())) - optional_kwargs
+            passed_modules = set(list(init_kwargs.keys()) + list(passed_class_obj.keys())) - set(optional_kwargs)
            raise ValueError(
                f"Pipeline {pipeline_class} expected {expected_modules}, but only {passed_modules} were passed."
            )
--- a/src/diffusers/schedulers/scheduling_utils.py
+++ b/src/diffusers/schedulers/scheduling_utils.py
@@ -19,6 +19,7 @@ from typing import Optional, Union

 import torch
 from huggingface_hub.utils import validate_hf_hub_args
+from typing_extensions import Self

 from ..utils import BaseOutput, PushToHubMixin

@@ -99,7 +100,7 @@ class SchedulerMixin(PushToHubMixin):
        subfolder: Optional[str] = None,
        return_unused_kwargs=False,
        **kwargs,
-    ):
+    ) -> Self:
        r"""
        Instantiate a scheduler from a pre-defined JSON configuration file in a local directory or Hub repository.

--- a/src/diffusers/utils/import_utils.py
+++ b/src/diffusers/utils/import_utils.py
@@ -109,6 +109,7 @@ if _onnx_available:
        "onnxruntime-rocm",
        "onnxruntime-migraphx",
        "onnxruntime-training",
+        "onnxruntime-vitisai",
    )
    _onnxruntime_version = None
    # For the metadata, we have to look for both onnxruntime and onnxruntime-gpu
--- a/src/diffusers/utils/testing_utils.py
+++ b/src/diffusers/utils/testing_utils.py
@@ -1161,7 +1161,7 @@ if is_torch_available():
    }
    BACKEND_RESET_MAX_MEMORY_ALLOCATED = {
        "cuda": torch.cuda.reset_max_memory_allocated,
-        "xpu": None,
+        "xpu": getattr(torch.xpu, "reset_peak_memory_stats", None),
        "cpu": None,
        "mps": None,
        "default": None,
--- a/tests/pipelines/controlnet_hunyuandit/test_controlnet_hunyuandit.py
+++ b/tests/pipelines/controlnet_hunyuandit/test_controlnet_hunyuandit.py
@@ -153,9 +153,14 @@ class HunyuanDiTControlNetPipelineFastTests(unittest.TestCase, PipelineTesterMix
        image_slice = image[0, -3:, -3:, -1]
        assert image.shape == (1, 16, 16, 3)

-        expected_slice = np.array(
-            [0.6953125, 0.89208984, 0.59375, 0.5078125, 0.5786133, 0.6035156, 0.5839844, 0.53564453, 0.52246094]
-        )
+        if torch_device == "xpu":
+            expected_slice = np.array(
+                [0.6376953, 0.84375, 0.58691406, 0.48046875, 0.43652344, 0.5517578, 0.54248047, 0.5644531, 0.48217773]
+            )
+        else:
+            expected_slice = np.array(
+                [0.6953125, 0.89208984, 0.59375, 0.5078125, 0.5786133, 0.6035156, 0.5839844, 0.53564453, 0.52246094]
+            )

        assert (
            np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
@@ -351,6 +356,7 @@ class HunyuanDiTControlNetPipelineSlowTests(unittest.TestCase):
        assert image.shape == (1024, 1024, 3)

        original_image = image[-3:, -3:, -1].flatten()
+
        expected_image = np.array(
            [0.43652344, 0.44018555, 0.4494629, 0.44995117, 0.45654297, 0.44848633, 0.43603516, 0.4404297, 0.42626953]
        )
--- a/tests/pipelines/test_pipelines_common.py
+++ b/tests/pipelines/test_pipelines_common.py
@@ -2283,6 +2283,29 @@ class PipelineTesterMixin:
        self.assertTrue(np.allclose(output_without_group_offloading, output_with_group_offloading1, atol=1e-4))
        self.assertTrue(np.allclose(output_without_group_offloading, output_with_group_offloading2, atol=1e-4))

+    def test_torch_dtype_dict(self):
+        components = self.get_dummy_components()
+        if not components:
+            self.skipTest("No dummy components defined.")
+
+        pipe = self.pipeline_class(**components)
+
+        specified_key = next(iter(components.keys()))
+
+        with tempfile.TemporaryDirectory(ignore_cleanup_errors=True) as tmpdirname:
+            pipe.save_pretrained(tmpdirname, safe_serialization=False)
+            torch_dtype_dict = {specified_key: torch.bfloat16, "default": torch.float16}
+            loaded_pipe = self.pipeline_class.from_pretrained(tmpdirname, torch_dtype=torch_dtype_dict)
+
+        for name, component in loaded_pipe.components.items():
+            if isinstance(component, torch.nn.Module) and hasattr(component, "dtype"):
+                expected_dtype = torch_dtype_dict.get(name, torch_dtype_dict.get("default", torch.float32))
+                self.assertEqual(
+                    component.dtype,
+                    expected_dtype,
+                    f"Component '{name}' has dtype {component.dtype} but expected {expected_dtype}",
+                )
+

@is_staging_test
 class PipelinePushToHubTester(unittest.TestCase):
--- a/tests/quantization/bnb/test_mixed_int8.py
+++ b/tests/quantization/bnb/test_mixed_int8.py
@@ -221,7 +221,7 @@ class BnB8bitBasicTests(Base8bitTests):
                    self.assertTrue(module.weight.dtype == torch.int8)

        # test if inference works.
-        with torch.no_grad() and torch.amp.autocast("cuda", dtype=torch.float16):
+        with torch.no_grad() and torch.autocast(model.device.type, dtype=torch.float16):
            input_dict_for_transformer = self.get_dummy_inputs()
            model_inputs = {
                k: v.to(device=torch_device) for k, v in input_dict_for_transformer.items() if not isinstance(v, bool)
Author	SHA1	Message	Date
Celina Hanouti	88382d72f7	update style bot workflow	2025-04-03 10:38:14 +02:00
Abhipsha Das	d9023a671a	[Model Card] standardize advanced diffusion training sdxl lora (#7615 ) * model card gen code * push modelcard creation * remove optional from params * add import * add use_dora check * correct lora var use in tags * make style && make quality --------- Co-authored-by: Aryan <aryan@huggingface.co> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2025-04-03 07:43:01 +05:30
Eliseu Silva	c4646a3931	feat: [Community Pipeline] - FaithDiff Stable Diffusion XL Pipeline (#11188 ) * feat: [Community Pipeline] - FaithDiff Stable Diffusion XL Pipeline for Image SR. * added pipeline	2025-04-02 11:33:19 -10:00
Dhruv Nair	c97b709afa	Add CacheMixin to Wan and LTX Transformers (#11187 ) * update * update * update	2025-04-02 10:16:31 -10:00
lakshay sharma	b0ff822ed3	Update import_utils.py (#10329 ) added onnxruntime-vitisai for custom build onnxruntime pkg	2025-04-02 20:47:10 +01:00
hlky	78c2fdc52e	SchedulerMixin from_pretrained and ConfigMixin Self type annotation (#11192 )	2025-04-02 08:24:02 -10:00
hlky	54dac3a87c	Fix enable_sequential_cpu_offload in CogView4Pipeline (#11195 ) * Fix enable_sequential_cpu_offload in CogView4Pipeline * make fix-copies	2025-04-02 16:51:23 +01:00
hlky	e5c6027ef8	[docs] `torch_dtype` map (#11194 )	2025-04-02 12:46:28 +01:00
hlky	da857bebb6	Revert `save_model` in ModelMixin save_pretrained and use safe_serialization=False in test (#11196 )	2025-04-02 12:45:36 +01:00
Fanli Lin	52b460feb9	[tests] HunyuanDiTControlNetPipeline inference precision issue on XPU (#11197 ) * add xpu part * fix more cases * remove some cases * no canny * format fix	2025-04-02 12:45:02 +01:00
hlky	d8c617ccb0	allow models to run with a user-provided dtype map instead of a single dtype (#10301 ) * allow models to run with a user-provided dtype map instead of a single dtype * make style * Add warning, change `_` to `default` * make style * add test * handle shared tensors * remove warning --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2025-04-02 09:05:46 +01:00
Bruno Magalhaes	fe2b397426	remove unnecessary call to `F.pad` (#10620 ) * rewrite memory count without implicitly using dimensions by @ic-synth * replace F.pad by built-in padding in Conv3D * in-place sums to reduce memory allocations * fixed trailing whitespace * file reformatted * in-place sums * simpler in-place expressions * removed in-place sum, may affect backward propagation logic * removed in-place sum, may affect backward propagation logic * removed in-place sum, may affect backward propagation logic * reverted change	2025-04-02 08:19:51 +01:00
Eliseu Silva	be0b7f55cc	fix: for checking mandatory and optional pipeline components (#11189 ) fix: optional componentes verification on load	2025-04-02 08:07:24 +01:00
jiqing-feng	4d5a96e40a	fix autocast (#11190 ) Signed-off-by: jiqing-feng <jiqing.feng@intel.com>	2025-04-02 07:26:27 +01:00
Yao Matrix	a7f07c1ef5	map BACKEND_RESET_MAX_MEMORY_ALLOCATED to reset_peak_memory_stats on XPU (#11191 ) Signed-off-by: YAO Matrix <matrix.yao@intel.com>	2025-04-02 07:25:48 +01:00