Merge branch 'main' into pipeline-fetcher

update
2026-01-28 22:44:45 +08:00 · 2025-02-21 14:41:37 +05:30 · 2025-02-21 14:00:02 +05:30 · 2025-02-21 13:17:33 +05:30
74 changed files with 157 additions and 488 deletions
--- a/docs/source/en/api/pipelines/animatediff.md
+++ b/docs/source/en/api/pipelines/animatediff.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Text-to-Video Generation with AnimateDiff

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 ## Overview

 [AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning](https://arxiv.org/abs/2307.04725) by Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai.
--- a/docs/source/en/api/pipelines/cogvideox.md
+++ b/docs/source/en/api/pipelines/cogvideox.md
@@ -15,10 +15,6 @@

 # CogVideoX

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://arxiv.org/abs/2408.06072) from Tsinghua University & ZhipuAI, by Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang.

 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/consisid.md
+++ b/docs/source/en/api/pipelines/consisid.md
@@ -15,10 +15,6 @@

 # ConsisID

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440) from Peking University & University of Rochester & etc, by Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan.

 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/control_flux_inpaint.md
+++ b/docs/source/en/api/pipelines/control_flux_inpaint.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # FluxControlInpaint

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 FluxControlInpaintPipeline is an implementation of Inpainting for Flux.1 Depth/Canny models. It is a pipeline that allows you to inpaint images using the Flux.1 Depth/Canny models. The pipeline takes an image and a mask as input and returns the inpainted image.

 FLUX.1 Depth and Canny [dev] is a 12 billion parameter rectified flow transformer capable of generating an image based on a text description while following the structure of a given input image. **This is not a ControlNet model**.
--- a/docs/source/en/api/pipelines/controlnet.md
+++ b/docs/source/en/api/pipelines/controlnet.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # ControlNet

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.

 With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
--- a/docs/source/en/api/pipelines/controlnet_flux.md
+++ b/docs/source/en/api/pipelines/controlnet_flux.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # ControlNet with Flux.1

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 FluxControlNetPipeline is an implementation of ControlNet for Flux.1.

 ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
--- a/docs/source/en/api/pipelines/controlnet_sd3.md
+++ b/docs/source/en/api/pipelines/controlnet_sd3.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # ControlNet with Stable Diffusion 3

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 StableDiffusion3ControlNetPipeline is an implementation of ControlNet for Stable Diffusion 3.

 ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
--- a/docs/source/en/api/pipelines/controlnet_sdxl.md
+++ b/docs/source/en/api/pipelines/controlnet_sdxl.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # ControlNet with Stable Diffusion XL

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.

 With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
--- a/docs/source/en/api/pipelines/controlnet_union.md
+++ b/docs/source/en/api/pipelines/controlnet_union.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # ControlNetUnion

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 ControlNetUnionModel is an implementation of ControlNet for Stable Diffusion XL.

 The ControlNet model was introduced in [ControlNetPlus](https://github.com/xinsir6/ControlNetPlus) by xinsir6. It supports multiple conditioning inputs without increasing computation.
--- a/docs/source/en/api/pipelines/controlnetxs.md
+++ b/docs/source/en/api/pipelines/controlnetxs.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # ControlNet-XS

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 ControlNet-XS was introduced in [ControlNet-XS](https://vislearn.github.io/ControlNet-XS/) by Denis Zavadski and Carsten Rother. It is based on the observation that the control model in the [original ControlNet](https://huggingface.co/papers/2302.05543) can be made much smaller and still produce good results.

 Like the original ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
--- a/docs/source/en/api/pipelines/deepfloyd_if.md
+++ b/docs/source/en/api/pipelines/deepfloyd_if.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # DeepFloyd IF

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 ## Overview

 DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding.
--- a/docs/source/en/api/pipelines/flux.md
+++ b/docs/source/en/api/pipelines/flux.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Flux

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 Flux is a series of text-to-image generation models based on diffusion transformers. To know more about Flux, check out the original [blog post](https://blackforestlabs.ai/announcing-black-forest-labs/) by the creators of Flux, Black Forest Labs.

 Original model checkpoints for Flux can be found [here](https://huggingface.co/black-forest-labs). Original inference code can be found [here](https://github.com/black-forest-labs/flux).
--- a/docs/source/en/api/pipelines/hunyuan_video.md
+++ b/docs/source/en/api/pipelines/hunyuan_video.md
@@ -14,10 +14,6 @@

 # HunyuanVideo

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 [HunyuanVideo](https://www.arxiv.org/abs/2412.03603) by Tencent.

 *Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at [this https URL](https://github.com/tencent/HunyuanVideo).*
--- a/docs/source/en/api/pipelines/kandinsky3.md
+++ b/docs/source/en/api/pipelines/kandinsky3.md
@@ -9,10 +9,6 @@ specific language governing permissions and limitations under the License.

 # Kandinsky 3

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 Kandinsky 3 is created by [Vladimir Arkhipkin](https://github.com/oriBetelgeuse),[Anastasia Maltseva](https://github.com/NastyaMittseva),[Igor Pavlov](https://github.com/boomb0om),[Andrei Filatov](https://github.com/anvilarth),[Arseniy Shakhmatov](https://github.com/cene555),[Andrey Kuznetsov](https://github.com/kuznetsoffandrey),[Denis Dimitrov](https://github.com/denndimitrov), [Zein Shaheen](https://github.com/zeinsh)

 The description from it's GitHub page:
--- a/docs/source/en/api/pipelines/kolors.md
+++ b/docs/source/en/api/pipelines/kolors.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/kolors_header_collage.png)

 Kolors is a large-scale text-to-image generation model based on latent diffusion, developed by [the Kuaishou Kolors team](https://github.com/Kwai-Kolors/Kolors). Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and closed-source models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters. Furthermore, Kolors supports both Chinese and English inputs, demonstrating strong performance in understanding and generating Chinese-specific content. For more details, please refer to this [technical report](https://github.com/Kwai-Kolors/Kolors/blob/master/imgs/Kolors_paper.pdf).
--- a/docs/source/en/api/pipelines/latent_consistency_models.md
+++ b/docs/source/en/api/pipelines/latent_consistency_models.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Latent Consistency Models

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 Latent Consistency Models (LCMs) were proposed in [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://huggingface.co/papers/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao.

 The abstract of the paper is as follows:
--- a/docs/source/en/api/pipelines/ledits_pp.md
+++ b/docs/source/en/api/pipelines/ledits_pp.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # LEDITS++

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 LEDITS++ was proposed in [LEDITS++: Limitless Image Editing using Text-to-Image Models](https://huggingface.co/papers/2311.16711) by Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos.

 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/ltx_video.md
+++ b/docs/source/en/api/pipelines/ltx_video.md
@@ -14,10 +14,6 @@

 # LTX Video

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 [LTX Video](https://huggingface.co/Lightricks/LTX-Video) is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 24 FPS videos at a 768x512 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content. We provide a model for both text-to-video as well as image + text-to-video usecases.

 <Tip>
--- a/docs/source/en/api/pipelines/lumina2.md
+++ b/docs/source/en/api/pipelines/lumina2.md
@@ -14,10 +14,6 @@

 # Lumina2

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 [Lumina Image 2.0: A Unified and Efficient Image Generative Model](https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0) is a 2 billion parameter flow-based diffusion transformer capable of generating diverse images from text descriptions.

 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/mochi.md
+++ b/docs/source/en/api/pipelines/mochi.md
@@ -15,10 +15,6 @@

 # Mochi 1 Preview

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 > [!TIP]
 > Only a research preview of the model weights is available at the moment.

--- a/docs/source/en/api/pipelines/overview.md
+++ b/docs/source/en/api/pipelines/overview.md
@@ -54,7 +54,7 @@ The table below lists all the pipelines currently available in 🤗 Diffusers an
 | [DiT](dit) | text2image |
 | [Flux](flux) | text2image |
 | [Hunyuan-DiT](hunyuandit) | text2image |
-| [I2VGen-XL](i2vgenxl) | image2video |
+| [I2VGen-XL](i2vgenxl) | text2video |
 | [InstructPix2Pix](pix2pix) | image editing |
 | [Kandinsky 2.1](kandinsky) | text2image, image2image, inpainting, interpolation |
 | [Kandinsky 2.2](kandinsky_v22) | text2image, image2image, inpainting |
--- a/docs/source/en/api/pipelines/pag.md
+++ b/docs/source/en/api/pipelines/pag.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Perturbed-Attention Guidance

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 [Perturbed-Attention Guidance (PAG)](https://ku-cvlab.github.io/Perturbed-Attention-Guidance/) is a new diffusion sampling guidance that improves sample quality across both unconditional and conditional settings, achieving this without requiring further training or the integration of external modules.

 PAG was introduced in [Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance](https://huggingface.co/papers/2403.17377) by Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin and Seungryong Kim.
--- a/docs/source/en/api/pipelines/panorama.md
+++ b/docs/source/en/api/pipelines/panorama.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # MultiDiffusion

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 [MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation](https://huggingface.co/papers/2302.08113) is by Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel.

 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/pia.md
+++ b/docs/source/en/api/pipelines/pia.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Image-to-Video Generation with PIA (Personalized Image Animator)

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 ## Overview

 [PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models](https://arxiv.org/abs/2312.13964) by Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, Kai Chen
--- a/docs/source/en/api/pipelines/pix2pix.md
+++ b/docs/source/en/api/pipelines/pix2pix.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # InstructPix2Pix

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://huggingface.co/papers/2211.09800) is by Tim Brooks, Aleksander Holynski and Alexei A. Efros.

 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/sana.md
+++ b/docs/source/en/api/pipelines/sana.md
@@ -14,10 +14,6 @@

 # SanaPipeline

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 [SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.

 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/stable_diffusion/depth2img.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/depth2img.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Depth-to-image

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 The Stable Diffusion model can also infer depth based on an image using [MiDaS](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure.

 <Tip>
--- a/docs/source/en/api/pipelines/stable_diffusion/img2img.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/img2img.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Image-to-image

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 The Stable Diffusion model can also be applied to image-to-image generation by passing a text prompt and an initial image to condition the generation of new images.

 The [`StableDiffusionImg2ImgPipeline`] uses the diffusion-denoising mechanism proposed in [SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations](https://huggingface.co/papers/2108.01073) by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon.
--- a/docs/source/en/api/pipelines/stable_diffusion/inpaint.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/inpaint.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Inpainting

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 The Stable Diffusion model can also be applied to inpainting which lets you edit specific parts of an image by providing a mask and a text prompt using Stable Diffusion.

 ## Tips
--- a/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Text-to-(RGB, depth)

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://huggingface.co/papers/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. LDM3D generates an image and a depth map from a given text prompt unlike the existing text-to-image diffusion models such as [Stable Diffusion](./overview) which only generates an image. With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps.

 Two checkpoints are available for use:
--- a/docs/source/en/api/pipelines/stable_diffusion/overview.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/overview.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Stable Diffusion pipelines

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). Latent diffusion applies the diffusion process over a lower dimensional latent space to reduce memory and compute complexity. This specific type of diffusion model was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.

 Stable Diffusion is trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs.
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Stable Diffusion 3

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 Stable Diffusion 3 (SD3) was proposed in [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://arxiv.org/pdf/2403.03206.pdf) by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Muller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach.

 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Stable Diffusion XL

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.

 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/stable_diffusion/text2img.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Text-to-image

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 The Stable Diffusion model was created by researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [Runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionPipeline`] is capable of generating photorealistic images given any text input. It's trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs. Latent diffusion is the research on top of which Stable Diffusion was built. It was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.

 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/stable_diffusion/upscale.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/upscale.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Super-resolution

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 The Stable Diffusion upscaler diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), and [LAION](https://laion.ai/). It is used to enhance the resolution of input images by a factor of 4.

 <Tip>
--- a/docs/source/en/api/pipelines/stable_unclip.md
+++ b/docs/source/en/api/pipelines/stable_unclip.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Stable unCLIP

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 Stable unCLIP checkpoints are finetuned from [Stable Diffusion 2.1](./stable_diffusion/stable_diffusion_2) checkpoints to condition on CLIP image embeddings.
 Stable unCLIP still conditions on text embeddings. Given the two separate conditionings, stable unCLIP can be used
 for text guided image variation. When combined with an unCLIP prior, it can also be used for full text to image generation.
--- a/docs/source/en/api/pipelines/text_to_video.md
+++ b/docs/source/en/api/pipelines/text_to_video.md
@@ -18,10 +18,6 @@ specific language governing permissions and limitations under the License.

 # Text-to-video

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 [ModelScope Text-to-Video Technical Report](https://arxiv.org/abs/2308.06571) is by Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang.

 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/text_to_video_zero.md
+++ b/docs/source/en/api/pipelines/text_to_video_zero.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Text2Video-Zero

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 [Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, [Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com).

 Text2Video-Zero enables zero-shot video generation using either:
--- a/docs/source/en/api/pipelines/unidiffuser.md
+++ b/docs/source/en/api/pipelines/unidiffuser.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # UniDiffuser

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://huggingface.co/papers/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu.

 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/wuerstchen.md
+++ b/docs/source/en/api/pipelines/wuerstchen.md
@@ -12,10 +12,6 @@ specific language governing permissions and limitations under the License.

 # Würstchen

-<div class="flex flex-wrap space-x-1">
-  <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
-</div>
-
 <img src="https://github.com/dome272/Wuerstchen/assets/61938694/0617c863-165a-43ee-9303-2a17299a0cf9">

 [Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models](https://huggingface.co/papers/2306.00637) is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville.
--- a/src/diffusers/hooks/_cfg_parallel.py
+++ b/src/diffusers/hooks/_cfg_parallel.py
@@ -1,65 +0,0 @@
-# Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import torch
-import torch.distributed as dist
-
-from ..utils import get_logger
-from ._common import _BATCHED_INPUT_IDENTIFIERS
-from .hooks import HookRegistry, ModelHook
-
-
-logger = get_logger(__name__)  # pylint: disable=invalid-name
-
-_CFG_PARALLEL = "cfg_parallel"
-
-
-class CFGParallelHook(ModelHook):
-    def initialize_hook(self, module):
-        if not dist.is_initialized():
-            raise RuntimeError("Distributed environment not initialized.")
-        return module
-
-    def new_forward(self, module: torch.nn.Module, *args, **kwargs):
-        if len(args) > 0:
-            logger.warning(
-                "CFGParallelHook is an example hook that does not work with batched positional arguments. Please use with caution."
-            )
-
-        world_size = dist.get_world_size()
-        rank = dist.get_rank()
-
-        assert world_size == 2, "This is an example hook designed to only work with 2 processes."
-
-        for key in list(kwargs.keys()):
-            if key not in _BATCHED_INPUT_IDENTIFIERS or kwargs[key] is None:
-                continue
-            kwargs[key] = torch.chunk(kwargs[key], world_size, dim=0)[rank].contiguous()
-
-        output = self.fn_ref.original_forward(*args, **kwargs)
-        sample = output[0]
-        sample_list = [torch.empty_like(sample) for _ in range(world_size)]
-        dist.all_gather(sample_list, sample)
-        sample = torch.cat(sample_list, dim=0).contiguous()
-
-        return_dict = kwargs.get("return_dict", False)
-        if not return_dict:
-            return (sample, *output[1:])
-        return output.__class__(sample, *output[1:])
-
-
-def apply_cfg_parallel(module: torch.nn.Module) -> None:
-    registry = HookRegistry.check_if_exists_or_initialize(module)
-    hook = CFGParallelHook()
-    registry.register_hook(hook, _CFG_PARALLEL)
--- a/src/diffusers/hooks/_common.py
+++ b/src/diffusers/hooks/_common.py
@@ -1,26 +0,0 @@
-from ..models.attention_processor import Attention, MochiAttention
-
-
-_ATTENTION_CLASSES = (Attention, MochiAttention)
-
-_SPATIAL_TRANSFORMER_BLOCK_IDENTIFIERS = ("blocks", "transformer_blocks", "single_transformer_blocks", "layers")
-_TEMPORAL_TRANSFORMER_BLOCK_IDENTIFIERS = ("temporal_transformer_blocks",)
-_CROSS_TRANSFORMER_BLOCK_IDENTIFIERS = ("blocks", "transformer_blocks", "layers")
-
-_ALL_TRANSFORMER_BLOCK_IDENTIFIERS = tuple(
-    {
-        *_SPATIAL_TRANSFORMER_BLOCK_IDENTIFIERS,
-        *_TEMPORAL_TRANSFORMER_BLOCK_IDENTIFIERS,
-        *_CROSS_TRANSFORMER_BLOCK_IDENTIFIERS,
-    }
-)
-
-_BATCHED_INPUT_IDENTIFIERS = (
-    "hidden_states",
-    "encoder_hidden_states",
-    "pooled_projections",
-    "timestep",
-    "attention_mask",
-    "encoder_attention_mask",
-    "guidance",
-)
--- a/src/diffusers/hooks/pyramid_attention_broadcast.py
+++ b/src/diffusers/hooks/pyramid_attention_broadcast.py
@@ -20,18 +20,19 @@ import torch

 from ..models.attention_processor import Attention, MochiAttention
 from ..utils import logging
-from ._common import (
-    _ATTENTION_CLASSES,
-    _CROSS_TRANSFORMER_BLOCK_IDENTIFIERS,
-    _SPATIAL_TRANSFORMER_BLOCK_IDENTIFIERS,
-    _TEMPORAL_TRANSFORMER_BLOCK_IDENTIFIERS,
-)
 from .hooks import HookRegistry, ModelHook


 logger = logging.get_logger(__name__)  # pylint: disable=invalid-name


+_ATTENTION_CLASSES = (Attention, MochiAttention)
+
+_SPATIAL_ATTENTION_BLOCK_IDENTIFIERS = ("blocks", "transformer_blocks", "single_transformer_blocks")
+_TEMPORAL_ATTENTION_BLOCK_IDENTIFIERS = ("temporal_transformer_blocks",)
+_CROSS_ATTENTION_BLOCK_IDENTIFIERS = ("blocks", "transformer_blocks")
+
+
@dataclass
 class PyramidAttentionBroadcastConfig:
    r"""
@@ -75,9 +76,9 @@ class PyramidAttentionBroadcastConfig:
    temporal_attention_timestep_skip_range: Tuple[int, int] = (100, 800)
    cross_attention_timestep_skip_range: Tuple[int, int] = (100, 800)

-    spatial_attention_block_identifiers: Tuple[str, ...] = _SPATIAL_TRANSFORMER_BLOCK_IDENTIFIERS
-    temporal_attention_block_identifiers: Tuple[str, ...] = _TEMPORAL_TRANSFORMER_BLOCK_IDENTIFIERS
-    cross_attention_block_identifiers: Tuple[str, ...] = _CROSS_TRANSFORMER_BLOCK_IDENTIFIERS
+    spatial_attention_block_identifiers: Tuple[str, ...] = _SPATIAL_ATTENTION_BLOCK_IDENTIFIERS
+    temporal_attention_block_identifiers: Tuple[str, ...] = _TEMPORAL_ATTENTION_BLOCK_IDENTIFIERS
+    cross_attention_block_identifiers: Tuple[str, ...] = _CROSS_ATTENTION_BLOCK_IDENTIFIERS

    current_timestep_callback: Callable[[], int] = None

--- a/src/diffusers/loaders/transformer_flux.py
+++ b/src/diffusers/loaders/transformer_flux.py
@@ -17,7 +17,7 @@ from ..models.embeddings import (
    ImageProjection,
    MultiIPAdapterImageProjection,
 )
-from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta
+from ..models.modeling_utils import load_model_dict_into_meta
 from ..utils import (
    is_accelerate_available,
    is_torch_version,
@@ -36,7 +36,7 @@ class FluxTransformer2DLoadersMixin:
    Load layers into a [`FluxTransformer2DModel`].
    """

-    def _convert_ip_adapter_image_proj_to_diffusers(self, state_dict, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
+    def _convert_ip_adapter_image_proj_to_diffusers(self, state_dict, low_cpu_mem_usage=False):
        if low_cpu_mem_usage:
            if is_accelerate_available():
                from accelerate import init_empty_weights
@@ -82,12 +82,11 @@ class FluxTransformer2DLoadersMixin:
        if not low_cpu_mem_usage:
            image_projection.load_state_dict(updated_state_dict, strict=True)
        else:
-            device_map = {"": self.device}
-            load_model_dict_into_meta(image_projection, updated_state_dict, device_map=device_map, dtype=self.dtype)
+            load_model_dict_into_meta(image_projection, updated_state_dict, device=self.device, dtype=self.dtype)

        return image_projection

-    def _convert_ip_adapter_attn_to_diffusers(self, state_dicts, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
+    def _convert_ip_adapter_attn_to_diffusers(self, state_dicts, low_cpu_mem_usage=False):
        from ..models.attention_processor import (
            FluxIPAdapterJointAttnProcessor2_0,
        )
@@ -152,15 +151,15 @@ class FluxTransformer2DLoadersMixin:
                if not low_cpu_mem_usage:
                    attn_procs[name].load_state_dict(value_dict)
                else:
-                    device_map = {"": self.device}
+                    device = self.device
                    dtype = self.dtype
-                    load_model_dict_into_meta(attn_procs[name], value_dict, device_map=device_map, dtype=dtype)
+                    load_model_dict_into_meta(attn_procs[name], value_dict, device=device, dtype=dtype)

                key_id += 1

        return attn_procs

-    def _load_ip_adapter_weights(self, state_dicts, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
+    def _load_ip_adapter_weights(self, state_dicts, low_cpu_mem_usage=False):
        if not isinstance(state_dicts, list):
            state_dicts = [state_dicts]

--- a/src/diffusers/loaders/transformer_sd3.py
+++ b/src/diffusers/loaders/transformer_sd3.py
@@ -75,9 +75,8 @@ class SD3Transformer2DLoadersMixin:
            if not low_cpu_mem_usage:
                attn_procs[name].load_state_dict(layer_state_dict[idx], strict=True)
            else:
-                device_map = {"": self.device}
                load_model_dict_into_meta(
-                    attn_procs[name], layer_state_dict[idx], device_map=device_map, dtype=self.dtype
+                    attn_procs[name], layer_state_dict[idx], device=self.device, dtype=self.dtype
                )

        return attn_procs
@@ -145,8 +144,7 @@ class SD3Transformer2DLoadersMixin:
        if not low_cpu_mem_usage:
            image_proj.load_state_dict(updated_state_dict, strict=True)
        else:
-            device_map = {"": self.device}
-            load_model_dict_into_meta(image_proj, updated_state_dict, device_map=device_map, dtype=self.dtype)
+            load_model_dict_into_meta(image_proj, updated_state_dict, device=self.device, dtype=self.dtype)

        return image_proj

--- a/src/diffusers/loaders/unet.py
+++ b/src/diffusers/loaders/unet.py
@@ -30,7 +30,7 @@ from ..models.embeddings import (
    IPAdapterPlusImageProjection,
    MultiIPAdapterImageProjection,
 )
-from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta, load_state_dict
+from ..models.modeling_utils import load_model_dict_into_meta, load_state_dict
 from ..utils import (
    USE_PEFT_BACKEND,
    _get_model_file,
@@ -143,7 +143,7 @@ class UNet2DConditionLoadersMixin:
        adapter_name = kwargs.pop("adapter_name", None)
        _pipeline = kwargs.pop("_pipeline", None)
        network_alphas = kwargs.pop("network_alphas", None)
-        low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT)
+        low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", False)
        allow_pickle = False

        if low_cpu_mem_usage and is_peft_version("<=", "0.13.0"):
@@ -540,7 +540,7 @@ class UNet2DConditionLoadersMixin:

        return state_dict

-    def _convert_ip_adapter_image_proj_to_diffusers(self, state_dict, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
+    def _convert_ip_adapter_image_proj_to_diffusers(self, state_dict, low_cpu_mem_usage=False):
        if low_cpu_mem_usage:
            if is_accelerate_available():
                from accelerate import init_empty_weights
@@ -753,12 +753,11 @@ class UNet2DConditionLoadersMixin:
        if not low_cpu_mem_usage:
            image_projection.load_state_dict(updated_state_dict, strict=True)
        else:
-            device_map = {"": self.device}
-            load_model_dict_into_meta(image_projection, updated_state_dict, device_map=device_map, dtype=self.dtype)
+            load_model_dict_into_meta(image_projection, updated_state_dict, device=self.device, dtype=self.dtype)

        return image_projection

-    def _convert_ip_adapter_attn_to_diffusers(self, state_dicts, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
+    def _convert_ip_adapter_attn_to_diffusers(self, state_dicts, low_cpu_mem_usage=False):
        from ..models.attention_processor import (
            IPAdapterAttnProcessor,
            IPAdapterAttnProcessor2_0,
@@ -847,14 +846,13 @@ class UNet2DConditionLoadersMixin:
                else:
                    device = next(iter(value_dict.values())).device
                    dtype = next(iter(value_dict.values())).dtype
-                    device_map = {"": device}
-                    load_model_dict_into_meta(attn_procs[name], value_dict, device_map=device_map, dtype=dtype)
+                    load_model_dict_into_meta(attn_procs[name], value_dict, device=device, dtype=dtype)

                key_id += 2

        return attn_procs

-    def _load_ip_adapter_weights(self, state_dicts, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
+    def _load_ip_adapter_weights(self, state_dicts, low_cpu_mem_usage=False):
        if not isinstance(state_dicts, list):
            state_dicts = [state_dicts]

--- a/src/diffusers/models/model_loading_utils.py
+++ b/src/diffusers/models/model_loading_utils.py
@@ -134,6 +134,19 @@ def _fetch_remapped_cls_from_config(config, old_class):
        return old_class


+def _check_archive_and_maybe_raise_error(checkpoint_file, format_list):
+    """
+    Check format of the archive
+    """
+    with safetensors.safe_open(checkpoint_file, framework="pt") as f:
+        metadata = f.metadata()
+        if metadata is not None and metadata.get("format") not in format_list:
+            raise OSError(
+                f"The safetensors archive passed at {checkpoint_file} does not contain the valid metadata. Make sure "
+                "you save your model with the `save_pretrained` method."
+            )
+
+
 def _determine_param_device(param_name: str, device_map: Optional[Dict[str, Union[int, str, torch.device]]]):
    """
    Find the device of param_name from the device_map.
@@ -170,6 +183,7 @@ def load_state_dict(
                # tensors are loaded on cpu
                with dduf_entries[checkpoint_file].as_mmap() as mm:
                    return safetensors.torch.load(mm)
+            _check_archive_and_maybe_raise_error(checkpoint_file, format_list=["pt", "flax"])
            if disable_mmap:
                return safetensors.torch.load(open(checkpoint_file, "rb").read())
            else:
--- a/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py
+++ b/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py
@@ -224,7 +224,7 @@ class AnimateDiffVideoToVideoPipeline(
        vae: AutoencoderKL,
        text_encoder: CLIPTextModel,
        tokenizer: CLIPTokenizer,
-        unet: Union[UNet2DConditionModel, UNetMotionModel],
+        unet: UNet2DConditionModel,
        motion_adapter: MotionAdapter,
        scheduler: Union[
            DDIMScheduler,
--- a/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py
+++ b/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py
@@ -246,7 +246,7 @@ class AnimateDiffVideoToVideoControlNetPipeline(
        vae: AutoencoderKL,
        text_encoder: CLIPTextModel,
        tokenizer: CLIPTokenizer,
-        unet: Union[UNet2DConditionModel, UNetMotionModel],
+        unet: UNet2DConditionModel,
        motion_adapter: MotionAdapter,
        controlnet: Union[ControlNetModel, List[ControlNetModel], Tuple[ControlNetModel], MultiControlNetModel],
        scheduler: Union[
--- a/src/diffusers/pipelines/controlnet_hunyuandit/pipeline_hunyuandit_controlnet.py
+++ b/src/diffusers/pipelines/controlnet_hunyuandit/pipeline_hunyuandit_controlnet.py
@@ -232,8 +232,8 @@ class HunyuanDiTControlNetPipeline(DiffusionPipeline):
            Tuple[HunyuanDiT2DControlNetModel],
            HunyuanDiT2DMultiControlNetModel,
        ],
-        text_encoder_2: Optional[T5EncoderModel] = None,
-        tokenizer_2: Optional[MT5Tokenizer] = None,
+        text_encoder_2=T5EncoderModel,
+        tokenizer_2=MT5Tokenizer,
        requires_safety_checker: bool = True,
    ):
        super().__init__()
--- a/src/diffusers/pipelines/controlnet_sd3/pipeline_stable_diffusion_3_controlnet.py
+++ b/src/diffusers/pipelines/controlnet_sd3/pipeline_stable_diffusion_3_controlnet.py
@@ -17,10 +17,10 @@ from typing import Any, Callable, Dict, List, Optional, Tuple, Union

 import torch
 from transformers import (
+    BaseImageProcessor,
    CLIPTextModelWithProjection,
    CLIPTokenizer,
-    SiglipImageProcessor,
-    SiglipVisionModel,
+    PreTrainedModel,
    T5EncoderModel,
    T5TokenizerFast,
 )
@@ -178,9 +178,9 @@ class StableDiffusion3ControlNetPipeline(
            Provides additional conditioning to the `unet` during the denoising process. If you set multiple
            ControlNets as a list, the outputs from each ControlNet are added together to create one combined
            additional conditioning.
-        image_encoder (`SiglipVisionModel`, *optional*):
+        image_encoder (`PreTrainedModel`, *optional*):
            Pre-trained Vision Model for IP Adapter.
-        feature_extractor (`SiglipImageProcessor`, *optional*):
+        feature_extractor (`BaseImageProcessor`, *optional*):
            Image processor for IP Adapter.
    """

@@ -202,8 +202,8 @@ class StableDiffusion3ControlNetPipeline(
        controlnet: Union[
            SD3ControlNetModel, List[SD3ControlNetModel], Tuple[SD3ControlNetModel], SD3MultiControlNetModel
        ],
-        image_encoder: Optional[SiglipVisionModel] = None,
-        feature_extractor: Optional[SiglipImageProcessor] = None,
+        image_encoder: PreTrainedModel = None,
+        feature_extractor: BaseImageProcessor = None,
    ):
        super().__init__()
        if isinstance(controlnet, (list, tuple)):
--- a/src/diffusers/pipelines/controlnet_sd3/pipeline_stable_diffusion_3_controlnet_inpainting.py
+++ b/src/diffusers/pipelines/controlnet_sd3/pipeline_stable_diffusion_3_controlnet_inpainting.py
@@ -17,10 +17,10 @@ from typing import Any, Callable, Dict, List, Optional, Tuple, Union

 import torch
 from transformers import (
+    BaseImageProcessor,
    CLIPTextModelWithProjection,
    CLIPTokenizer,
-    SiglipImageProcessor,
-    SiglipModel,
+    PreTrainedModel,
    T5EncoderModel,
    T5TokenizerFast,
 )
@@ -223,8 +223,8 @@ class StableDiffusion3ControlNetInpaintingPipeline(
        controlnet: Union[
            SD3ControlNetModel, List[SD3ControlNetModel], Tuple[SD3ControlNetModel], SD3MultiControlNetModel
        ],
-        image_encoder: SiglipModel = None,
-        feature_extractor: Optional[SiglipImageProcessor] = None,
+        image_encoder: PreTrainedModel = None,
+        feature_extractor: BaseImageProcessor = None,
    ):
        super().__init__()

--- a/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py
+++ b/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py
@@ -17,8 +17,6 @@ from typing import List, Optional, Tuple, Union

 import torch

-from ...models import UNet1DModel
-from ...schedulers import SchedulerMixin
 from ...utils import is_torch_xla_available, logging
 from ...utils.torch_utils import randn_tensor
 from ..pipeline_utils import AudioPipelineOutput, DiffusionPipeline
@@ -51,7 +49,7 @@ class DanceDiffusionPipeline(DiffusionPipeline):

    model_cpu_offload_seq = "unet"

-    def __init__(self, unet: UNet1DModel, scheduler: SchedulerMixin):
+    def __init__(self, unet, scheduler):
        super().__init__()
        self.register_modules(unet=unet, scheduler=scheduler)

--- a/src/diffusers/pipelines/ddim/pipeline_ddim.py
+++ b/src/diffusers/pipelines/ddim/pipeline_ddim.py
@@ -16,7 +16,6 @@ from typing import List, Optional, Tuple, Union

 import torch

-from ...models import UNet2DModel
 from ...schedulers import DDIMScheduler
 from ...utils import is_torch_xla_available
 from ...utils.torch_utils import randn_tensor
@@ -48,7 +47,7 @@ class DDIMPipeline(DiffusionPipeline):

    model_cpu_offload_seq = "unet"

-    def __init__(self, unet: UNet2DModel, scheduler: DDIMScheduler):
+    def __init__(self, unet, scheduler):
        super().__init__()

        # make sure scheduler can always be converted to DDIM
--- a/src/diffusers/pipelines/ddpm/pipeline_ddpm.py
+++ b/src/diffusers/pipelines/ddpm/pipeline_ddpm.py
@@ -17,8 +17,6 @@ from typing import List, Optional, Tuple, Union

 import torch

-from ...models import UNet2DModel
-from ...schedulers import DDPMScheduler
 from ...utils import is_torch_xla_available
 from ...utils.torch_utils import randn_tensor
 from ..pipeline_utils import DiffusionPipeline, ImagePipelineOutput
@@ -49,7 +47,7 @@ class DDPMPipeline(DiffusionPipeline):

    model_cpu_offload_seq = "unet"

-    def __init__(self, unet: UNet2DModel, scheduler: DDPMScheduler):
+    def __init__(self, unet, scheduler):
        super().__init__()
        self.register_modules(unet=unet, scheduler=scheduler)

--- a/src/diffusers/pipelines/deprecated/repaint/pipeline_repaint.py
+++ b/src/diffusers/pipelines/deprecated/repaint/pipeline_repaint.py
@@ -91,7 +91,7 @@ class RePaintPipeline(DiffusionPipeline):
    scheduler: RePaintScheduler
    model_cpu_offload_seq = "unet"

-    def __init__(self, unet: UNet2DModel, scheduler: RePaintScheduler):
+    def __init__(self, unet, scheduler):
        super().__init__()
        self.register_modules(unet=unet, scheduler=scheduler)

--- a/src/diffusers/pipelines/hunyuandit/pipeline_hunyuandit.py
+++ b/src/diffusers/pipelines/hunyuandit/pipeline_hunyuandit.py
@@ -207,8 +207,8 @@ class HunyuanDiTPipeline(DiffusionPipeline):
        safety_checker: StableDiffusionSafetyChecker,
        feature_extractor: CLIPImageProcessor,
        requires_safety_checker: bool = True,
-        text_encoder_2: Optional[T5EncoderModel] = None,
-        tokenizer_2: Optional[MT5Tokenizer] = None,
+        text_encoder_2=T5EncoderModel,
+        tokenizer_2=MT5Tokenizer,
    ):
        super().__init__()

--- a/src/diffusers/pipelines/lumina/pipeline_lumina.py
+++ b/src/diffusers/pipelines/lumina/pipeline_lumina.py
@@ -20,7 +20,7 @@ import urllib.parse as ul
 from typing import Callable, Dict, List, Optional, Tuple, Union

 import torch
-from transformers import GemmaPreTrainedModel, GemmaTokenizer, GemmaTokenizerFast
+from transformers import AutoModel, AutoTokenizer

 from ...callbacks import MultiPipelineCallbacks, PipelineCallback
 from ...image_processor import VaeImageProcessor
@@ -144,10 +144,13 @@ class LuminaText2ImgPipeline(DiffusionPipeline):
    Args:
        vae ([`AutoencoderKL`]):
            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
-        text_encoder ([`GemmaPreTrainedModel`]):
-            Frozen Gemma text-encoder.
-        tokenizer (`GemmaTokenizer` or `GemmaTokenizerFast`):
-            Gemma tokenizer.
+        text_encoder ([`AutoModel`]):
+            Frozen text-encoder. Lumina-T2I uses
+            [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.AutoModel), specifically the
+            [t5-v1_1-xxl](https://huggingface.co/Alpha-VLLM/tree/main/t5-v1_1-xxl) variant.
+        tokenizer (`AutoModel`):
+            Tokenizer of class
+            [AutoModel](https://huggingface.co/docs/transformers/model_doc/t5#transformers.AutoModel).
        transformer ([`Transformer2DModel`]):
            A text conditioned `Transformer2DModel` to denoise the encoded image latents.
        scheduler ([`SchedulerMixin`]):
@@ -182,8 +185,8 @@ class LuminaText2ImgPipeline(DiffusionPipeline):
        transformer: LuminaNextDiT2DModel,
        scheduler: FlowMatchEulerDiscreteScheduler,
        vae: AutoencoderKL,
-        text_encoder: GemmaPreTrainedModel,
-        tokenizer: Union[GemmaTokenizer, GemmaTokenizerFast],
+        text_encoder: AutoModel,
+        tokenizer: AutoTokenizer,
    ):
        super().__init__()

--- a/src/diffusers/pipelines/lumina2/pipeline_lumina2.py
+++ b/src/diffusers/pipelines/lumina2/pipeline_lumina2.py
@@ -17,7 +17,7 @@ from typing import Any, Callable, Dict, List, Optional, Tuple, Union

 import numpy as np
 import torch
-from transformers import Gemma2PreTrainedModel, GemmaTokenizer, GemmaTokenizerFast
+from transformers import AutoModel, AutoTokenizer

 from ...image_processor import VaeImageProcessor
 from ...loaders import Lumina2LoraLoaderMixin
@@ -143,10 +143,13 @@ class Lumina2Text2ImgPipeline(DiffusionPipeline, Lumina2LoraLoaderMixin):
    Args:
        vae ([`AutoencoderKL`]):
            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
-        text_encoder ([`Gemma2PreTrainedModel`]):
-            Frozen Gemma2 text-encoder.
-        tokenizer (`GemmaTokenizer` or `GemmaTokenizerFast`):
-            Gemma tokenizer.
+        text_encoder ([`AutoModel`]):
+            Frozen text-encoder. Lumina-T2I uses
+            [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.AutoModel), specifically the
+            [t5-v1_1-xxl](https://huggingface.co/Alpha-VLLM/tree/main/t5-v1_1-xxl) variant.
+        tokenizer (`AutoModel`):
+            Tokenizer of class
+            [AutoModel](https://huggingface.co/docs/transformers/model_doc/t5#transformers.AutoModel).
        transformer ([`Transformer2DModel`]):
            A text conditioned `Transformer2DModel` to denoise the encoded image latents.
        scheduler ([`SchedulerMixin`]):
@@ -162,8 +165,8 @@ class Lumina2Text2ImgPipeline(DiffusionPipeline, Lumina2LoraLoaderMixin):
        transformer: Lumina2Transformer2DModel,
        scheduler: FlowMatchEulerDiscreteScheduler,
        vae: AutoencoderKL,
-        text_encoder: Gemma2PreTrainedModel,
-        tokenizer: Union[GemmaTokenizer, GemmaTokenizerFast],
+        text_encoder: AutoModel,
+        tokenizer: AutoTokenizer,
    ):
        super().__init__()

--- a/src/diffusers/pipelines/pag/pipeline_pag_sana.py
+++ b/src/diffusers/pipelines/pag/pipeline_pag_sana.py
@@ -20,7 +20,7 @@ import warnings
 from typing import Callable, Dict, List, Optional, Tuple, Union

 import torch
-from transformers import Gemma2PreTrainedModel, GemmaTokenizer, GemmaTokenizerFast
+from transformers import AutoModelForCausalLM, AutoTokenizer

 from ...callbacks import MultiPipelineCallbacks, PipelineCallback
 from ...image_processor import PixArtImageProcessor
@@ -160,8 +160,8 @@ class SanaPAGPipeline(DiffusionPipeline, PAGMixin):

    def __init__(
        self,
-        tokenizer: Union[GemmaTokenizer, GemmaTokenizerFast],
-        text_encoder: Gemma2PreTrainedModel,
+        tokenizer: AutoTokenizer,
+        text_encoder: AutoModelForCausalLM,
        vae: AutoencoderDC,
        transformer: SanaTransformer2DModel,
        scheduler: FlowMatchEulerDiscreteScheduler,
--- a/src/diffusers/pipelines/pipeline_loading_utils.py
+++ b/src/diffusers/pipelines/pipeline_loading_utils.py
@@ -17,7 +17,7 @@ import os
 import re
 import warnings
 from pathlib import Path
-from typing import Any, Callable, Dict, List, Optional, Set, Tuple, Type, Union, get_args, get_origin
+from typing import Any, Callable, Dict, List, Optional, Union

 import requests
 import torch
@@ -1059,76 +1059,3 @@ def _maybe_raise_error_for_incorrect_transformers(config_dict):
                break
    if has_transformers_component and not is_transformers_version(">", "4.47.1"):
        raise ValueError("Please upgrade your `transformers` installation to the latest version to use DDUF.")
-
-
-def _is_valid_type(obj: Any, class_or_tuple: Union[Type, Tuple[Type, ...]]) -> bool:
-    """
-    Checks if an object is an instance of any of the provided types. For collections, it checks if every element is of
-    the correct type as well.
-    """
-    if not isinstance(class_or_tuple, tuple):
-        class_or_tuple = (class_or_tuple,)
-
-    # Unpack unions
-    unpacked_class_or_tuple = []
-    for t in class_or_tuple:
-        if get_origin(t) is Union:
-            unpacked_class_or_tuple.extend(get_args(t))
-        else:
-            unpacked_class_or_tuple.append(t)
-    class_or_tuple = tuple(unpacked_class_or_tuple)
-
-    if Any in class_or_tuple:
-        return True
-
-    obj_type = type(obj)
-    # Classes with obj's type
-    class_or_tuple = {t for t in class_or_tuple if isinstance(obj, get_origin(t) or t)}
-
-    # Singular types (e.g. int, ControlNet, ...)
-    # Untyped collections (e.g. List, but not List[int])
-    elem_class_or_tuple = {get_args(t) for t in class_or_tuple}
-    if () in elem_class_or_tuple:
-        return True
-    # Typed lists or sets
-    elif obj_type in (list, set):
-        return any(all(_is_valid_type(x, t) for x in obj) for t in elem_class_or_tuple)
-    # Typed tuples
-    elif obj_type is tuple:
-        return any(
-            # Tuples with any length and single type (e.g. Tuple[int, ...])
-            (len(t) == 2 and t[-1] is Ellipsis and all(_is_valid_type(x, t[0]) for x in obj))
-            or
-            # Tuples with fixed length and any types (e.g. Tuple[int, str])
-            (len(obj) == len(t) and all(_is_valid_type(x, tt) for x, tt in zip(obj, t)))
-            for t in elem_class_or_tuple
-        )
-    # Typed dicts
-    elif obj_type is dict:
-        return any(
-            all(_is_valid_type(k, kt) and _is_valid_type(v, vt) for k, v in obj.items())
-            for kt, vt in elem_class_or_tuple
-        )
-
-    else:
-        return False
-
-
-def _get_detailed_type(obj: Any) -> Type:
-    """
-    Gets a detailed type for an object, including nested types for collections.
-    """
-    obj_type = type(obj)
-
-    if obj_type in (list, set):
-        obj_origin_type = List if obj_type is list else Set
-        elems_type = Union[tuple({_get_detailed_type(x) for x in obj})]
-        return obj_origin_type[elems_type]
-    elif obj_type is tuple:
-        return Tuple[tuple(_get_detailed_type(x) for x in obj)]
-    elif obj_type is dict:
-        keys_type = Union[tuple({_get_detailed_type(k) for k in obj.keys()})]
-        values_type = Union[tuple({_get_detailed_type(k) for k in obj.values()})]
-        return Dict[keys_type, values_type]
-    else:
-        return obj_type
--- a/src/diffusers/pipelines/pipeline_utils.py
+++ b/src/diffusers/pipelines/pipeline_utils.py
@@ -13,6 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import enum
 import fnmatch
 import importlib
 import inspect
@@ -78,12 +79,10 @@ from .pipeline_loading_utils import (
    _fetch_class_library_tuple,
    _get_custom_components_and_folders,
    _get_custom_pipeline_class,
-    _get_detailed_type,
    _get_final_device_map,
    _get_ignore_patterns,
    _get_pipeline_class,
    _identify_model_variants,
-    _is_valid_type,
    _maybe_raise_error_for_incorrect_transformers,
    _maybe_raise_warning_for_inpainting,
    _resolve_custom_pipeline_and_cls,
@@ -877,6 +876,26 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):

        init_dict = {k: v for k, v in init_dict.items() if load_module(k, v)}

+        for key in init_dict.keys():
+            if key not in passed_class_obj:
+                continue
+            if "scheduler" in key:
+                continue
+
+            class_obj = passed_class_obj[key]
+            _expected_class_types = []
+            for expected_type in expected_types[key]:
+                if isinstance(expected_type, enum.EnumMeta):
+                    _expected_class_types.extend(expected_type.__members__.keys())
+                else:
+                    _expected_class_types.append(expected_type.__name__)
+
+            _is_valid_type = class_obj.__class__.__name__ in _expected_class_types
+            if not _is_valid_type:
+                logger.warning(
+                    f"Expected types for {key}: {_expected_class_types}, got {class_obj.__class__.__name__}."
+                )
+
        # Special case: safety_checker must be loaded separately when using `from_flax`
        if from_flax and "safety_checker" in init_dict and "safety_checker" not in passed_class_obj:
            raise NotImplementedError(
@@ -996,26 +1015,10 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
                f"Pipeline {pipeline_class} expected {expected_modules}, but only {passed_modules} were passed."
            )

-        # 10. Type checking init arguments
-        for kw, arg in init_kwargs.items():
-            # Too complex to validate with type annotation alone
-            if "scheduler" in kw:
-                continue
-            # Many tokenizer annotations don't include its "Fast" variant, so skip this
-            # e.g T5Tokenizer but not T5TokenizerFast
-            elif "tokenizer" in kw:
-                continue
-            elif (
-                arg is not None  # Skip if None
-                and not expected_types[kw] == (inspect.Signature.empty,)  # Skip if no type annotations
-                and not _is_valid_type(arg, expected_types[kw])  # Check type
-            ):
-                logger.warning(f"Expected types for {kw}: {expected_types[kw]}, got {_get_detailed_type(arg)}.")
-
-        # 11. Instantiate the pipeline
+        # 10. Instantiate the pipeline
        model = pipeline_class(**init_kwargs)

-        # 12. Save where the model was instantiated from
+        # 11. Save where the model was instantiated from
        model.register_to_config(_name_or_path=pretrained_model_name_or_path)
        if device_map is not None:
            setattr(model, "hf_device_map", final_device_map)
--- a/src/diffusers/pipelines/sana/pipeline_sana.py
+++ b/src/diffusers/pipelines/sana/pipeline_sana.py
@@ -20,7 +20,7 @@ import warnings
 from typing import Any, Callable, Dict, List, Optional, Tuple, Union

 import torch
-from transformers import Gemma2PreTrainedModel, GemmaTokenizer, GemmaTokenizerFast
+from transformers import AutoModelForCausalLM, AutoTokenizer

 from ...callbacks import MultiPipelineCallbacks, PipelineCallback
 from ...image_processor import PixArtImageProcessor
@@ -200,8 +200,8 @@ class SanaPipeline(DiffusionPipeline, SanaLoraLoaderMixin):

    def __init__(
        self,
-        tokenizer: Union[GemmaTokenizer, GemmaTokenizerFast],
-        text_encoder: Gemma2PreTrainedModel,
+        tokenizer: AutoTokenizer,
+        text_encoder: AutoModelForCausalLM,
        vae: AutoencoderDC,
        transformer: SanaTransformer2DModel,
        scheduler: DPMSolverMultistepScheduler,
--- a/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade.py
+++ b/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade.py
@@ -15,7 +15,7 @@
 from typing import Callable, Dict, List, Optional, Union

 import torch
-from transformers import CLIPTextModelWithProjection, CLIPTokenizer
+from transformers import CLIPTextModel, CLIPTokenizer

 from ...models import StableCascadeUNet
 from ...schedulers import DDPMWuerstchenScheduler
@@ -65,7 +65,7 @@ class StableCascadeDecoderPipeline(DiffusionPipeline):
    Args:
        tokenizer (`CLIPTokenizer`):
            The CLIP tokenizer.
-        text_encoder (`CLIPTextModelWithProjection`):
+        text_encoder (`CLIPTextModel`):
            The CLIP text encoder.
        decoder ([`StableCascadeUNet`]):
            The Stable Cascade decoder unet.
@@ -93,7 +93,7 @@ class StableCascadeDecoderPipeline(DiffusionPipeline):
        self,
        decoder: StableCascadeUNet,
        tokenizer: CLIPTokenizer,
-        text_encoder: CLIPTextModelWithProjection,
+        text_encoder: CLIPTextModel,
        scheduler: DDPMWuerstchenScheduler,
        vqgan: PaellaVQModel,
        latent_dim_scale: float = 10.67,
--- a/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade_combined.py
+++ b/src/diffusers/pipelines/stable_cascade/pipeline_stable_cascade_combined.py
@@ -15,7 +15,7 @@ from typing import Callable, Dict, List, Optional, Union

 import PIL
 import torch
-from transformers import CLIPImageProcessor, CLIPTextModelWithProjection, CLIPTokenizer, CLIPVisionModelWithProjection
+from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer, CLIPVisionModelWithProjection

 from ...models import StableCascadeUNet
 from ...schedulers import DDPMWuerstchenScheduler
@@ -52,7 +52,7 @@ class StableCascadeCombinedPipeline(DiffusionPipeline):
    Args:
        tokenizer (`CLIPTokenizer`):
            The decoder tokenizer to be used for text inputs.
-        text_encoder (`CLIPTextModelWithProjection`):
+        text_encoder (`CLIPTextModel`):
            The decoder text encoder to be used for text inputs.
        decoder (`StableCascadeUNet`):
            The decoder model to be used for decoder image generation pipeline.
@@ -60,18 +60,14 @@ class StableCascadeCombinedPipeline(DiffusionPipeline):
            The scheduler to be used for decoder image generation pipeline.
        vqgan (`PaellaVQModel`):
            The VQGAN model to be used for decoder image generation pipeline.
+        feature_extractor ([`~transformers.CLIPImageProcessor`]):
+            Model that extracts features from generated images to be used as inputs for the `image_encoder`.
+        image_encoder ([`CLIPVisionModelWithProjection`]):
+            Frozen CLIP image-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
        prior_prior (`StableCascadeUNet`):
            The prior model to be used for prior pipeline.
-        prior_text_encoder (`CLIPTextModelWithProjection`):
-            The prior text encoder to be used for text inputs.
-        prior_tokenizer (`CLIPTokenizer`):
-            The prior tokenizer to be used for text inputs.
        prior_scheduler (`DDPMWuerstchenScheduler`):
            The scheduler to be used for prior pipeline.
-        prior_feature_extractor ([`~transformers.CLIPImageProcessor`]):
-            Model that extracts features from generated images to be used as inputs for the `image_encoder`.
-        prior_image_encoder ([`CLIPVisionModelWithProjection`]):
-            Frozen CLIP image-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
    """

    _load_connected_pipes = True
@@ -80,12 +76,12 @@ class StableCascadeCombinedPipeline(DiffusionPipeline):
    def __init__(
        self,
        tokenizer: CLIPTokenizer,
-        text_encoder: CLIPTextModelWithProjection,
+        text_encoder: CLIPTextModel,
        decoder: StableCascadeUNet,
        scheduler: DDPMWuerstchenScheduler,
        vqgan: PaellaVQModel,
        prior_prior: StableCascadeUNet,
-        prior_text_encoder: CLIPTextModelWithProjection,
+        prior_text_encoder: CLIPTextModel,
        prior_tokenizer: CLIPTokenizer,
        prior_scheduler: DDPMWuerstchenScheduler,
        prior_feature_extractor: Optional[CLIPImageProcessor] = None,
--- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip.py
+++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_unclip.py
@@ -141,7 +141,7 @@ class StableUnCLIPPipeline(
        image_noising_scheduler: KarrasDiffusionSchedulers,
        # regular denoising components
        tokenizer: CLIPTokenizer,
-        text_encoder: CLIPTextModel,
+        text_encoder: CLIPTextModelWithProjection,
        unet: UNet2DConditionModel,
        scheduler: KarrasDiffusionSchedulers,
        # vae
--- a/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py
+++ b/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py
@@ -17,10 +17,10 @@ from typing import Any, Callable, Dict, List, Optional, Union

 import torch
 from transformers import (
+    BaseImageProcessor,
    CLIPTextModelWithProjection,
    CLIPTokenizer,
-    SiglipImageProcessor,
-    SiglipVisionModel,
+    PreTrainedModel,
    T5EncoderModel,
    T5TokenizerFast,
 )
@@ -176,9 +176,9 @@ class StableDiffusion3Pipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingle
        tokenizer_3 (`T5TokenizerFast`):
            Tokenizer of class
            [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
-        image_encoder (`SiglipVisionModel`, *optional*):
+        image_encoder (`PreTrainedModel`, *optional*):
            Pre-trained Vision Model for IP Adapter.
-        feature_extractor (`SiglipImageProcessor`, *optional*):
+        feature_extractor (`BaseImageProcessor`, *optional*):
            Image processor for IP Adapter.
    """

@@ -197,8 +197,8 @@ class StableDiffusion3Pipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingle
        tokenizer_2: CLIPTokenizer,
        text_encoder_3: T5EncoderModel,
        tokenizer_3: T5TokenizerFast,
-        image_encoder: SiglipVisionModel = None,
-        feature_extractor: SiglipImageProcessor = None,
+        image_encoder: PreTrainedModel = None,
+        feature_extractor: BaseImageProcessor = None,
    ):
        super().__init__()

--- a/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_img2img.py
+++ b/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_img2img.py
@@ -18,10 +18,10 @@ from typing import Any, Callable, Dict, List, Optional, Union
 import PIL.Image
 import torch
 from transformers import (
+    BaseImageProcessor,
    CLIPTextModelWithProjection,
    CLIPTokenizer,
-    SiglipImageProcessor,
-    SiglipVisionModel,
+    PreTrainedModel,
    T5EncoderModel,
    T5TokenizerFast,
 )
@@ -197,10 +197,6 @@ class StableDiffusion3Img2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
        tokenizer_3 (`T5TokenizerFast`):
            Tokenizer of class
            [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
-        image_encoder (`SiglipVisionModel`, *optional*):
-            Pre-trained Vision Model for IP Adapter.
-        feature_extractor (`SiglipImageProcessor`, *optional*):
-            Image processor for IP Adapter.
    """

    model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->image_encoder->transformer->vae"
@@ -218,8 +214,8 @@ class StableDiffusion3Img2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
        tokenizer_2: CLIPTokenizer,
        text_encoder_3: T5EncoderModel,
        tokenizer_3: T5TokenizerFast,
-        image_encoder: Optional[SiglipVisionModel] = None,
-        feature_extractor: Optional[SiglipImageProcessor] = None,
+        image_encoder: PreTrainedModel = None,
+        feature_extractor: BaseImageProcessor = None,
    ):
        super().__init__()

--- a/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_inpaint.py
+++ b/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_inpaint.py
@@ -17,10 +17,10 @@ from typing import Any, Callable, Dict, List, Optional, Union

 import torch
 from transformers import (
+    BaseImageProcessor,
    CLIPTextModelWithProjection,
    CLIPTokenizer,
-    SiglipImageProcessor,
-    SiglipVisionModel,
+    PreTrainedModel,
    T5EncoderModel,
    T5TokenizerFast,
 )
@@ -196,9 +196,9 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
        tokenizer_3 (`T5TokenizerFast`):
            Tokenizer of class
            [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
-        image_encoder (`SiglipVisionModel`, *optional*):
+        image_encoder (`PreTrainedModel`, *optional*):
            Pre-trained Vision Model for IP Adapter.
-        feature_extractor (`SiglipImageProcessor`, *optional*):
+        feature_extractor (`BaseImageProcessor`, *optional*):
            Image processor for IP Adapter.
    """

@@ -217,8 +217,8 @@ class StableDiffusion3InpaintPipeline(DiffusionPipeline, SD3LoraLoaderMixin, Fro
        tokenizer_2: CLIPTokenizer,
        text_encoder_3: T5EncoderModel,
        tokenizer_3: T5TokenizerFast,
-        image_encoder: Optional[SiglipVisionModel] = None,
-        feature_extractor: Optional[SiglipImageProcessor] = None,
+        image_encoder: PreTrainedModel = None,
+        feature_extractor: BaseImageProcessor = None,
    ):
        super().__init__()

--- a/src/diffusers/pipelines/stable_diffusion_k_diffusion/pipeline_stable_diffusion_k_diffusion.py
+++ b/src/diffusers/pipelines/stable_diffusion_k_diffusion/pipeline_stable_diffusion_k_diffusion.py
@@ -19,31 +19,15 @@ from typing import Callable, List, Optional, Union
 import torch
 from k_diffusion.external import CompVisDenoiser, CompVisVDenoiser
 from k_diffusion.sampling import BrownianTreeNoiseSampler, get_sigmas_karras
-from transformers import (
-    CLIPImageProcessor,
-    CLIPTextModel,
-    CLIPTokenizer,
-    CLIPTokenizerFast,
-)

 from ...image_processor import VaeImageProcessor
-from ...loaders import (
-    StableDiffusionLoraLoaderMixin,
-    TextualInversionLoaderMixin,
-)
-from ...models import AutoencoderKL, UNet2DConditionModel
+from ...loaders import StableDiffusionLoraLoaderMixin, TextualInversionLoaderMixin
 from ...models.lora import adjust_lora_scale_text_encoder
-from ...schedulers import KarrasDiffusionSchedulers, LMSDiscreteScheduler
-from ...utils import (
-    USE_PEFT_BACKEND,
-    deprecate,
-    logging,
-    scale_lora_layers,
-    unscale_lora_layers,
-)
+from ...schedulers import LMSDiscreteScheduler
+from ...utils import USE_PEFT_BACKEND, deprecate, logging, scale_lora_layers, unscale_lora_layers
 from ...utils.torch_utils import randn_tensor
 from ..pipeline_utils import DiffusionPipeline, StableDiffusionMixin
-from ..stable_diffusion import StableDiffusionPipelineOutput, StableDiffusionSafetyChecker
+from ..stable_diffusion import StableDiffusionPipelineOutput


 logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
@@ -111,13 +95,13 @@ class StableDiffusionKDiffusionPipeline(

    def __init__(
        self,
-        vae: AutoencoderKL,
-        text_encoder: CLIPTextModel,
-        tokenizer: Union[CLIPTokenizer, CLIPTokenizerFast],
-        unet: UNet2DConditionModel,
-        scheduler: KarrasDiffusionSchedulers,
-        safety_checker: StableDiffusionSafetyChecker,
-        feature_extractor: CLIPImageProcessor,
+        vae,
+        text_encoder,
+        tokenizer,
+        unet,
+        scheduler,
+        safety_checker,
+        feature_extractor,
        requires_safety_checker: bool = True,
    ):
        super().__init__()
--- a/src/diffusers/utils/import_utils.py
+++ b/src/diffusers/utils/import_utils.py
@@ -815,7 +815,7 @@ def is_peft_version(operation: str, version: str):
        version (`str`):
            A version string
    """
-    if not _peft_available:
+    if not _peft_version:
        return False
    return compare_versions(parse(_peft_version), operation, version)

@@ -829,7 +829,7 @@ def is_bitsandbytes_version(operation: str, version: str):
        version (`str`):
            A version string
    """
-    if not _bitsandbytes_available:
+    if not _bitsandbytes_version:
        return False
    return compare_versions(parse(_bitsandbytes_version), operation, version)

--- a/tests/fixtures/custom_pipeline/pipeline.py
+++ b/tests/fixtures/custom_pipeline/pipeline.py
@@ -18,7 +18,7 @@ from typing import Optional, Tuple, Union

 import torch

-from diffusers import DiffusionPipeline, ImagePipelineOutput, SchedulerMixin, UNet2DModel
+from diffusers import DiffusionPipeline, ImagePipelineOutput


 class CustomLocalPipeline(DiffusionPipeline):
@@ -33,7 +33,7 @@ class CustomLocalPipeline(DiffusionPipeline):
            [`DDPMScheduler`], or [`DDIMScheduler`].
    """

-    def __init__(self, unet: UNet2DModel, scheduler: SchedulerMixin):
+    def __init__(self, unet, scheduler):
        super().__init__()
        self.register_modules(unet=unet, scheduler=scheduler)

--- a/tests/fixtures/custom_pipeline/what_ever.py
+++ b/tests/fixtures/custom_pipeline/what_ever.py
@@ -18,7 +18,6 @@ from typing import Optional, Tuple, Union

 import torch

-from diffusers import SchedulerMixin, UNet2DModel
 from diffusers.pipelines.pipeline_utils import DiffusionPipeline, ImagePipelineOutput


@@ -34,7 +33,7 @@ class CustomLocalPipeline(DiffusionPipeline):
            [`DDPMScheduler`], or [`DDIMScheduler`].
    """

-    def __init__(self, unet: UNet2DModel, scheduler: SchedulerMixin):
+    def __init__(self, unet, scheduler):
        super().__init__()
        self.register_modules(unet=unet, scheduler=scheduler)

--- a/tests/pipelines/lumina2/test_pipeline_lumina2.py
+++ b/tests/pipelines/lumina2/test_pipeline_lumina2.py
@@ -91,10 +91,10 @@ class Lumina2Text2ImgPipelinePipelineFastTests(unittest.TestCase, PipelineTester
        text_encoder = Gemma2Model(config)

        components = {
-            "transformer": transformer,
+            "transformer": transformer.eval(),
            "vae": vae.eval(),
            "scheduler": scheduler,
-            "text_encoder": text_encoder,
+            "text_encoder": text_encoder.eval(),
            "tokenizer": tokenizer,
        }
        return components
Author	SHA1	Message	Date
Sayak Paul	6d0d52d46c	Merge branch 'main' into pipeline-fetcher	2025-02-21 14:41:37 +05:30
DN6	9e56d656df	update	2025-02-21 14:00:02 +05:30
DN6	7d7e18b9cc	update	2025-02-21 13:17:33 +05:30