test differences in controlnet with max insteda of sum

use max to test difference
use max to test difference instead of sum
2025-12-11 06:54:32 +08:00 · 2023-08-24 09:32:21 +00:00 · 2023-08-24 09:31:00 +00:00 · 2023-08-24 09:26:49 +00:00 · 2023-08-24 09:24:57 +00:00 · 2023-08-23 06:32:18 +00:00
170 changed files with 11201 additions and 2589 deletions
--- a/.github/workflows/pr_tests.yml
+++ b/.github/workflows/pr_tests.yml
@@ -113,3 +113,60 @@ jobs:
      with:
        name: pr_${{ matrix.config.report }}_test_reports
        path: reports
+
+  run_staging_tests:
+    strategy:
+      fail-fast: false
+      matrix:
+        config:
+          - name: Hub tests for models, schedulers, and pipelines
+            framework: hub_tests_pytorch
+            runner: docker-cpu
+            image: diffusers/diffusers-pytorch-cpu
+            report: torch_hub
+
+    name: ${{ matrix.config.name }}
+
+    runs-on: ${{ matrix.config.runner }}
+
+    container:
+      image: ${{ matrix.config.image }}
+      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
+
+    defaults:
+      run:
+        shell: bash
+
+    steps:
+    - name: Checkout diffusers
+      uses: actions/checkout@v3
+      with:
+        fetch-depth: 2
+
+    - name: Install dependencies
+      run: |
+        apt-get update && apt-get install libsndfile1-dev libgl1 -y
+        python -m pip install -e .[quality,test]
+
+    - name: Environment
+      run: |
+        python utils/print_env.py
+
+    - name: Run Hub tests for models, schedulers, and pipelines on a staging env
+      if: ${{ matrix.config.framework == 'hub_tests_pytorch' }}
+      run: |
+        HUGGINGFACE_CO_STAGING=true python -m pytest \
+          -m "is_staging_test" \
+          --make-reports=tests_${{ matrix.config.report }} \
+          tests
+
+    - name: Failure short reports
+      if: ${{ failure() }}
+      run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt
+
+    - name: Test suite reports artifacts
+      if: ${{ always() }}
+      uses: actions/upload-artifact@v2
+      with:
+        name: pr_${{ matrix.config.report }}_test_reports
+        path: reports
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -32,6 +32,8 @@
      title: Load safetensors
    - local: using-diffusers/other-formats
      title: Load different Stable Diffusion formats
+    - local: using-diffusers/push_to_hub
+      title: Push files to the Hub
    title: Loading & Hub
  - sections:
    - local: using-diffusers/pipeline_overview
@@ -65,7 +67,7 @@
    - local: using-diffusers/stable_diffusion_jax_how_to
      title: Stable Diffusion in JAX/Flax
    - local: using-diffusers/weighted_prompts
-      title: Weighting Prompts
+      title: Prompt weighting
    title: Pipelines for Inference
  - sections:
    - local: training/overview
@@ -194,6 +196,8 @@
      title: Consistency Models
    - local: api/pipelines/controlnet
      title: ControlNet
+    - local: api/pipelines/controlnet_sdxl
+      title: ControlNet with Stable Diffusion XL
    - local: api/pipelines/cycle_diffusion
      title: Cycle Diffusion
    - local: api/pipelines/dance_diffusion
@@ -265,6 +269,8 @@
        title: LDM3D Text-to-(RGB, Depth)
      - local: api/pipelines/stable_diffusion/adapter
        title: Stable Diffusion T2I-adapter
+      - local: api/pipelines/stable_diffusion/gligen
+        title: GLIGEN (Grounded Language-to-Image Generation)
      title: Stable Diffusion
    - local: api/pipelines/stable_unclip
      title: Stable unCLIP
@@ -293,49 +299,49 @@
    - local: api/schedulers/overview
      title: Overview
    - local: api/schedulers/cm_stochastic_iterative
-      title: Consistency Model Multistep Scheduler
-    - local: api/schedulers/ddim
-      title: DDIM
+      title: CMStochasticIterativeScheduler
    - local: api/schedulers/ddim_inverse
-      title: DDIMInverse
+      title: DDIMInverseScheduler
+    - local: api/schedulers/ddim
+      title: DDIMScheduler
    - local: api/schedulers/ddpm
-      title: DDPM
+      title: DDPMScheduler
    - local: api/schedulers/deis
-      title: DEIS
-    - local: api/schedulers/dpm_discrete
-      title: DPM Discrete Scheduler
-    - local: api/schedulers/dpm_discrete_ancestral
-      title: DPM Discrete Scheduler with ancestral sampling
+      title: DEISMultistepScheduler
+    - local: api/schedulers/multistep_dpm_solver_inverse
+      title: DPMSolverMultistepInverse
+    - local: api/schedulers/multistep_dpm_solver
+      title: DPMSolverMultistepScheduler
    - local: api/schedulers/dpm_sde
      title: DPMSolverSDEScheduler
-    - local: api/schedulers/euler_ancestral
-      title: Euler Ancestral Scheduler
-    - local: api/schedulers/euler
-      title: Euler scheduler
-    - local: api/schedulers/heun
-      title: Heun Scheduler
-    - local: api/schedulers/multistep_dpm_solver_inverse
-      title: Inverse Multistep DPM-Solver
-    - local: api/schedulers/ipndm
-      title: IPNDM
-    - local: api/schedulers/lms_discrete
-      title: Linear Multistep
-    - local: api/schedulers/multistep_dpm_solver
-      title: Multistep DPM-Solver
-    - local: api/schedulers/pndm
-      title: PNDM
-    - local: api/schedulers/repaint
-      title: RePaint Scheduler
    - local: api/schedulers/singlestep_dpm_solver
-      title: Singlestep DPM-Solver
+      title: DPMSolverSinglestepScheduler
+    - local: api/schedulers/euler_ancestral
+      title: EulerAncestralDiscreteScheduler
+    - local: api/schedulers/euler
+      title: EulerDiscreteScheduler
+    - local: api/schedulers/heun
+      title: HeunDiscreteScheduler
+    - local: api/schedulers/ipndm
+      title: IPNDMScheduler
    - local: api/schedulers/stochastic_karras_ve
-      title: Stochastic Kerras VE
+      title: KarrasVeScheduler
+    - local: api/schedulers/dpm_discrete_ancestral
+      title: KDPM2AncestralDiscreteScheduler
+    - local: api/schedulers/dpm_discrete
+      title: KDPM2DiscreteScheduler
+    - local: api/schedulers/lms_discrete
+      title: LMSDiscreteScheduler
+    - local: api/schedulers/pndm
+      title: PNDMScheduler
+    - local: api/schedulers/repaint
+      title: RePaintScheduler
+    - local: api/schedulers/score_sde_ve
+      title: ScoreSdeVeScheduler
+    - local: api/schedulers/score_sde_vp
+      title: ScoreSdeVpScheduler
    - local: api/schedulers/unipc
      title: UniPCMultistepScheduler
-    - local: api/schedulers/score_sde_ve
-      title: VE-SDE
-    - local: api/schedulers/score_sde_vp
-      title: VP-SDE
    - local: api/schedulers/vq_diffusion
      title: VQDiffusionScheduler
    title: Schedulers
--- a/docs/source/en/api/models/overview.md
+++ b/docs/source/en/api/models/overview.md
@@ -9,4 +9,8 @@ All models are built from the base [`ModelMixin`] class which is a [`torch.nn.mo

 ## FlaxModelMixin

-[[autodoc]] FlaxModelMixin
+[[autodoc]] FlaxModelMixin
+
+## PushToHubMixin
+
+[[autodoc]] utils.PushToHubMixin
--- a/docs/source/en/api/pipelines/controlnet_sdxl.md
+++ b/docs/source/en/api/pipelines/controlnet_sdxl.md
@@ -0,0 +1,162 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# ControlNet with Stable Diffusion XL
+
+[Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
+
+Using a pretrained model, we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details.
+
+The abstract from the paper is:
+
+*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
+
+We provide support using ControlNets with [Stable Diffusion XL](./stable_diffusion/stable_diffusion_xl.md) (SDXL). 
+
+You can find numerous SDXL ControlNet checkpoints from [this link](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet). There are some smaller ControlNet checkpoints too:
+
+* [controlnet-canny-sdxl-1.0-small](https://huggingface.co/diffusers/controlnet-canny-sdxl-1.0-small)
+* [controlnet-canny-sdxl-1.0-mid](https://huggingface.co/diffusers/controlnet-canny-sdxl-1.0-mid)
+* [controlnet-depth-sdxl-1.0-small](https://huggingface.co/diffusers/controlnet-depth-sdxl-1.0-small)
+* [controlnet-depth-sdxl-1.0-mid](https://huggingface.co/diffusers/controlnet-depth-sdxl-1.0-mid)
+
+We also encourage you to train custom ControlNets; we provide a [training script](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md) for this.
+
+You can find some results below:
+
+<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/sdxl_controlnet_canny_grid.png" width=600/>
+
+🚨 At the time of this writing, many of these SDXL ControlNet checkpoints are experimental and there is a lot of room for improvement. We encourage our users to provide feedback. 🚨
+
+## MultiControlNet
+
+You can compose multiple ControlNet conditionings from different image inputs to create a *MultiControlNet*. To get better results, it is often helpful to:
+
+1. mask conditionings such that they don't overlap (for example, mask the area of a canny image where the pose conditioning is located)
+2. experiment with the [`controlnet_conditioning_scale`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet#diffusers.StableDiffusionControlNetPipeline.__call__.controlnet_conditioning_scale) parameter to determine how much weight to assign to each conditioning input
+
+In this example, you'll combine a canny image and a human pose estimation image to generate a new image.
+
+Prepare the canny image conditioning:
+
+```py
+from diffusers.utils import load_image
+from PIL import Image
+import numpy as np 
+import cv2
+
+canny_image = load_image(
+    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
+)
+canny_image = np.array(canny_image)
+
+low_threshold = 100
+high_threshold = 200
+
+canny_image = cv2.Canny(canny_image, low_threshold, high_threshold)
+
+# zero out middle columns of image where pose will be overlayed
+zero_start = canny_image.shape[1] // 4
+zero_end = zero_start + canny_image.shape[1] // 2
+canny_image[:, zero_start:zero_end] = 0
+
+canny_image = canny_image[:, :, None]
+canny_image = np.concatenate([canny_image, canny_image, canny_image], axis=2)
+canny_image = Image.fromarray(canny_image).resize((1024, 1024))
+```
+
+<div class="flex gap-4">
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
+  </div>
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/landscape_canny_masked.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">canny image</figcaption>
+  </div>
+</div>
+
+Prepare the human pose estimation conditioning:
+
+```py
+from controlnet_aux import OpenposeDetector
+from diffusers.utils import load_image
+
+openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
+
+openpose_image = load_image(
+    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
+)
+openpose_image = openpose(openpose_image).resize((1024, 1024))
+```
+
+<div class="flex gap-4">
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
+  </div>
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/person_pose.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">human pose image</figcaption>
+  </div>
+</div>
+
+Load a list of ControlNet models that correspond to each conditioning, and pass them to the [`StableDiffusionXLControlNetPipeline`]. Use the faster [`UniPCMultistepScheduler`] and nable model offloading to reduce memory usage.
+
+```py
+from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL, UniPCMultistepScheduler
+import torch
+
+controlnets = [
+    ControlNetModel.from_pretrained(
+        "thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
+    ),
+    ControlNetModel.from_pretrained("diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True),
+]
+
+vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
+pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, torch_dtype=torch.float16, use_safetensors=True
+)
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+pipe.enable_model_cpu_offload()
+```
+
+Now you can pass your prompt (an optional negative prompt if you're using one), canny image, and pose image to the pipeline:
+
+```py
+prompt = "a giant standing in a fantasy landscape, best quality"
+negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
+
+generator = torch.manual_seed(1)
+
+images = [openpose_image, canny_image]
+
+images = pipe(
+    prompt,
+    image=images,
+    num_inference_steps=25,
+    generator=generator,
+    negative_prompt=negative_prompt,
+    num_images_per_prompt=3,
+    controlnet_conditioning_scale=[1.0, 0.8],
+).images[0]
+```
+
+<div class="flex justify-center">
+	<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/multicontrolnet.png"/>
+</div>
+
+## StableDiffusionXLControlNetPipeline
+[[autodoc]] StableDiffusionXLControlNetPipeline
+	- all
+	- __call__
--- a/docs/source/en/api/pipelines/overview.md
+++ b/docs/source/en/api/pipelines/overview.md
@@ -34,3 +34,7 @@ Pipelines do not offer any training functionality. You'll notice PyTorch's autog
 ## FlaxDiffusionPipeline

 [[autodoc]] pipelines.pipeline_flax_utils.FlaxDiffusionPipeline
+
+## PushToHubMixin
+
+[[autodoc]] utils.PushToHubMixin
--- a/docs/source/en/api/pipelines/stable_diffusion/gligen.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/gligen.md
@@ -0,0 +1,46 @@
+<!--Copyright 2023 The GLIGEN Authors and The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# GLIGEN (Grounded Language-to-Image Generation)
+
+The GLIGEN model was created by researchers and engineers from [University of Wisconsin-Madison, Columbia University, and Microsoft](https://github.com/gligen/GLIGEN). The [`StableDiffusionGLIGENPipeline`] can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes, if input images are given, this pipeline can insert objects described by text at the region defined by bounding boxes. Otherwise, it'll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It's trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs.
+
+The abstract from the [paper](https://huggingface.co/papers/2301.07093) is:
+
+*Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN’s zeroshot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.*
+
+<Tip>
+
+Make sure to check out the Stable Diffusion [Tips](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality and how to reuse pipeline components efficiently!
+
+If you want to use one of the official checkpoints for a task, explore the [gligen](https://huggingface.co/gligen) Hub organizations!
+
+</Tip>
+
+This pipeline was contributed by [Nikhil Gajendrakumar](https://github.com/nikhil-masterful).
+
+## StableDiffusionGLIGENPipeline
+
+[[autodoc]] StableDiffusionGLIGENPipeline
+	- all
+	- __call__
+	- enable_vae_slicing
+	- disable_vae_slicing
+	- enable_vae_tiling
+	- disable_vae_tiling
+	- enable_model_cpu_offload
+	- prepare_latents
+	- enable_fuser
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
@@ -425,7 +425,3 @@ prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
 prompt_2 = "monet painting"
 image = pipe(prompt=prompt, prompt_2=prompt_2).images[0]
 ```
-
-## Single-file Implementation of SDXL Unet Model
-
-If you are curious about how SDXL Unet is implemented and would like to make quick modifications / experimentations, you can alternatively head to [`minSDXL`](https://github.com/cloneofsimo/minSDXL) that is very `diffusers` friendly. It is a single-file implementation of SDXL Unet model that is written in PyTorch with exact same model naming and structure as in `diffusers`.
--- a/docs/source/en/api/schedulers/cm_stochastic_iterative.md
+++ b/docs/source/en/api/schedulers/cm_stochastic_iterative.md
@@ -1,11 +1,15 @@
-# Consistency Model Multistep Scheduler
+# CMStochasticIterativeScheduler

-## Overview
+[Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever introduced a multistep and onestep scheduler (Algorithm 1) that is capable of generating good samples in one or a small number of steps.

-Multistep and onestep scheduler (Algorithm 1) introduced alongside consistency models in the paper [Consistency Models](https://arxiv.org/abs/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
-Based on the [original consistency models implementation](https://github.com/openai/consistency_models).
-Should generate good samples from [`ConsistencyModelPipeline`] in one or a small number of steps.
+The abstract from the paper is:
+
+*Diffusion models have made significant breakthroughs in image, audio, and video generation, but they depend on an iterative generation process that causes slow sampling speed and caps their potential for real-time applications. To overcome this limitation, we propose consistency models, a new family of generative models that achieve high sample quality without adversarial training. They support fast one-step generation by design, while still allowing for few-step sampling to trade compute for sample quality. They also support zero-shot data editing, like image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either as a way to distill pre-trained diffusion models, or as standalone generative models. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step generation. For example, we achieve the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained as standalone generative models, consistency models also outperform single-step, non-adversarial generative models on standard benchmarks like CIFAR-10, ImageNet 64x64 and LSUN 256x256.*
+
+The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models).

 ## CMStochasticIterativeScheduler
 [[autodoc]] CMStochasticIterativeScheduler

+## CMStochasticIterativeSchedulerOutput
+[[autodoc]] schedulers.scheduling_consistency_models.CMStochasticIterativeSchedulerOutput
--- a/docs/source/en/api/schedulers/ddim.md
+++ b/docs/source/en/api/schedulers/ddim.md
@@ -10,13 +10,11 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Denoising Diffusion Implicit Models (DDIM)
+# DDIMScheduler

-## Overview
+[Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.

-[Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
-
-The abstract of the paper is the following:
+The abstract from the paper is:

 *Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, 
 yet they require simulating a Markov chain for many steps to produce a sample. 
@@ -26,50 +24,43 @@ We construct a class of non-Markovian diffusion processes that lead to the same
 We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off 
 computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.*

-The original codebase of this paper can be found here: [ermongroup/ddim](https://github.com/ermongroup/ddim).
-For questions, feel free to contact the author on [tsong.me](https://tsong.me/).
+The original codebase of this paper can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim), and you can contact the author on [tsong.me](https://tsong.me/).

-### Experimental: "Common Diffusion Noise Schedules and Sample Steps are Flawed": 
+## Tips

-The paper **[Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/abs/2305.08891)** 
-claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion.
+The paper [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion. To fix this, the authors propose:

-The abstract reads as follows:
+<Tip warning={true}>

-*We discover that common diffusion noise schedules do not enforce the last timestep to have zero signal-to-noise ratio (SNR),
-and some implementations of diffusion samplers do not start from the last timestep.
-Such designs are flawed and do not reflect the fact that the model is given pure Gaussian noise at inference, creating a discrepancy between training and inference.
-We show that the flawed design causes real problems in existing implementations. 
-In Stable Diffusion, it severely limits the model to only generate images with medium brightness and 
-prevents it from generating very bright and dark samples. We propose a few simple fixes: 
- (1) rescale the noise schedule to enforce zero terminal SNR; 
- (2) train the model with v prediction; 
- (3) change the sampler to always start from the last timestep; 
- (4) rescale classifier-free guidance to prevent over-exposure. 
-These simple changes ensure the diffusion process is congruent between training and inference and 
-allow the model to generate samples more faithful to the original data distribution.*
+🧪 This is an experimental feature!
+
+</Tip>
+
+1. rescale the noise schedule to enforce zero terminal signal-to-noise ratio (SNR)

-You can apply all of these changes in `diffusers` when using [`DDIMScheduler`]:
- (1) rescale the noise schedule to enforce zero terminal SNR; 
 ```py
 pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True)
 ```
- (2) train the model with v prediction; 
-Continue fine-tuning a checkpoint with [`train_text_to_image.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [`train_text_to_image_lora.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)
-and `--prediction_type="v_prediction"`.
- (3) change the sampler to always start from the last timestep; 
+
+2. train a model with `v_prediction` (add the following argument to the [train_text_to_image.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [train_text_to_image_lora.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py) scripts)
+
+```bash
+--prediction_type="v_prediction"
+```
+
+3. change the sampler to always start from the last timestep
+
 ```py
 pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
 ```
- (4) rescale classifier-free guidance to prevent over-exposure. 
+
+4. rescale classifier-free guidance to prevent over-exposure
+
 ```py
-pipe(..., guidance_rescale=0.7)
+image = pipeline(prompt, guidance_rescale=0.7).images[0]
 ```

-An example is to use [this checkpoint](https://huggingface.co/ptx0/pseudo-journey-v2) 
-which has been fine-tuned using the `"v_prediction"`.
-
-The checkpoint can then be run in inference as follows:
+For example:

 ```py
 from diffusers import DiffusionPipeline, DDIMScheduler
@@ -86,3 +77,6 @@ image = pipeline(prompt, guidance_rescale=0.7).images[0]

 ## DDIMScheduler
 [[autodoc]] DDIMScheduler
+
+## DDIMSchedulerOutput
+[[autodoc]] schedulers.scheduling_ddim.DDIMSchedulerOutput
--- a/docs/source/en/api/schedulers/ddim_inverse.md
+++ b/docs/source/en/api/schedulers/ddim_inverse.md
@@ -10,12 +10,10 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Inverse Denoising Diffusion Implicit Models (DDIMInverse)
+# DDIMInverseScheduler

-## Overview
-
-This scheduler is the inverted scheduler of [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
-The implementation is mostly based on the DDIM inversion definition of [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://arxiv.org/pdf/2211.09794.pdf)
+`DDIMInverseScheduler` is the inverted scheduler from [Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
+The implementation is mostly based on the DDIM inversion definition from [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794.pdf).

 ## DDIMInverseScheduler
 [[autodoc]] DDIMInverseScheduler
--- a/docs/source/en/api/schedulers/ddpm.md
+++ b/docs/source/en/api/schedulers/ddpm.md
@@ -10,18 +10,16 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Denoising Diffusion Probabilistic Models (DDPM)
+# DDPMScheduler

-## Overview
+[Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2006.11239) (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes a diffusion based model of the same name. In the context of the 🤗 Diffusers library, DDPM refers to the discrete denoising scheduler from the paper as well as the pipeline.

-[Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) 
- (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes the diffusion based model of the same name, but in the context of the 🤗 Diffusers library, DDPM refers to the discrete denoising scheduler from the paper as well as the pipeline.
+The abstract from the paper is:

-The abstract of the paper is the following:
-
-We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.
-
-The original paper can be found [here](https://arxiv.org/abs/2010.02502).
+*We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.*

 ## DDPMScheduler
 [[autodoc]] DDPMScheduler
+
+## DDPMSchedulerOutput
+[[autodoc]] schedulers.scheduling_ddpm.DDPMSchedulerOutput
--- a/docs/source/en/api/schedulers/deis.md
+++ b/docs/source/en/api/schedulers/deis.md
@@ -10,13 +10,27 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# DEIS
+# DEISMultistepScheduler

-Fast Sampling of Diffusion Models with Exponential Integrator.
+Diffusion Exponential Integrator Sampler (DEIS) is proposed in [Fast Sampling of Diffusion Models with Exponential Integrator](https://huggingface.co/papers/2204.13902) by Qinsheng Zhang and Yongxin Chen. `DEISMultistepScheduler` is a fast high order solver for diffusion ordinary differential equations (ODEs). 

-## Overview
+This implementation modifies the polynomial fitting formula in log-rho space instead of the original linear `t` space in the DEIS paper. The modification enjoys closed-form coefficients for exponential multistep update instead of replying on the numerical solver.

-Original paper can be found [here](https://arxiv.org/abs/2204.13902). The original implementation can be found [here](https://github.com/qsh-zh/deis).
+The abstract from the paper is:
+
+*The past few years have witnessed the great success of Diffusion models~(DMs) in generating high-fidelity samples in generative modeling tasks. A major limitation of the DM is its notoriously slow sampling procedure which normally requires hundreds to thousands of time discretization steps of the learned diffusion process to reach the desired accuracy. Our goal is to develop a fast sampling method for DMs with a much less number of steps while retaining high sample quality. To this end, we systematically analyze the sampling procedure in DMs and identify key factors that affect the sample quality, among which the method of discretization is most crucial. By carefully examining the learned diffusion process, we propose Diffusion Exponential Integrator Sampler~(DEIS). It is based on the Exponential Integrator designed for discretizing ordinary differential equations (ODEs) and leverages a semilinear structure of the learned diffusion process to reduce the discretization error. The proposed method can be applied to any DMs and can generate high-fidelity samples in as few as 10 steps. In our experiments, it takes about 3 minutes on one A6000 GPU to generate 50k images from CIFAR10. Moreover, by directly using pre-trained DMs, we achieve the state-of-art sampling performance when the number of score function evaluation~(NFE) is limited, e.g., 4.17 FID with 10 NFEs, 3.37 FID, and 9.74 IS with only 15 NFEs on CIFAR10. Code is available at [this https URL](https://github.com/qsh-zh/deis).*
+
+The original codebase can be found at [qsh-zh/deis](https://github.com/qsh-zh/deis).
+
+## Tips
+
+It is recommended to set `solver_order` to 2 or 3, while `solver_order=1` is equivalent to [`DDIMScheduler`].
+
+Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space
+diffusion models, you can set `thresholding=True` to use the dynamic thresholding.

 ## DEISMultistepScheduler
 [[autodoc]] DEISMultistepScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
--- a/docs/source/en/api/schedulers/dpm_discrete.md
+++ b/docs/source/en/api/schedulers/dpm_discrete.md
@@ -10,13 +10,14 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# DPM Discrete Scheduler inspired by Karras et. al paper
+# KDPM2DiscreteScheduler

-## Overview
+The `KDPM2DiscreteScheduler` is inspired by the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/).

-Inspired by [Karras et. al](https://arxiv.org/abs/2206.00364). Scheduler ported from @crowsonkb's https://github.com/crowsonkb/k-diffusion library:
-
-All credit for making this scheduler work goes to [Katherine Crowson](https://github.com/crowsonkb/)
+The original codebase can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion).

 ## KDPM2DiscreteScheduler
-[[autodoc]] KDPM2DiscreteScheduler
+[[autodoc]] KDPM2DiscreteScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
--- a/docs/source/en/api/schedulers/dpm_discrete_ancestral.md
+++ b/docs/source/en/api/schedulers/dpm_discrete_ancestral.md
@@ -10,13 +10,14 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# DPM Discrete Scheduler with ancestral sampling inspired by Karras et. al paper
+# KDPM2AncestralDiscreteScheduler

-## Overview
+The `KDPM2DiscreteScheduler` with ancestral sampling is inspired by the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/).

-Inspired by [Karras et. al](https://arxiv.org/abs/2206.00364). Scheduler ported from @crowsonkb's https://github.com/crowsonkb/k-diffusion library:
-
-All credit for making this scheduler work goes to [Katherine Crowson](https://github.com/crowsonkb/)
+The original codebase can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion).

 ## KDPM2AncestralDiscreteScheduler
-[[autodoc]] KDPM2AncestralDiscreteScheduler
+[[autodoc]] KDPM2AncestralDiscreteScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
--- a/docs/source/en/api/schedulers/dpm_sde.md
+++ b/docs/source/en/api/schedulers/dpm_sde.md
@@ -10,14 +10,12 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# DPM Stochastic Scheduler inspired by Karras et. al paper
+# DPMSolverSDEScheduler

-## Overview
-
-Inspired by Stochastic Sampler from [Karras et. al](https://arxiv.org/abs/2206.00364).
-Scheduler ported from @crowsonkb's https://github.com/crowsonkb/k-diffusion library:
-
-All credit for making this scheduler work goes to [Katherine Crowson](https://github.com/crowsonkb/)
+The `DPMSolverSDEScheduler` is inspired by the stochastic sampler from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/).

 ## DPMSolverSDEScheduler
-[[autodoc]] DPMSolverSDEScheduler
+[[autodoc]] DPMSolverSDEScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
--- a/docs/source/en/api/schedulers/euler.md
+++ b/docs/source/en/api/schedulers/euler.md
@@ -10,12 +10,13 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Euler scheduler
+# EulerDiscreteScheduler

-## Overview
+The Euler scheduler (Algorithm 2) is from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L51) implementation by [Katherine Crowson](https://github.com/crowsonkb/).

-Euler scheduler (Algorithm 2) from the paper [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364) by Karras et al. (2022). Based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L51) implementation by Katherine Crowson.
-Fast scheduler which often times generates good outputs with 20-30 steps.

 ## EulerDiscreteScheduler
-[[autodoc]] EulerDiscreteScheduler
+[[autodoc]] EulerDiscreteScheduler
+
+## EulerDiscreteSchedulerOutput
+[[autodoc]] schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput
--- a/docs/source/en/api/schedulers/euler_ancestral.md
+++ b/docs/source/en/api/schedulers/euler_ancestral.md
@@ -10,12 +10,12 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Euler Ancestral scheduler
+# EulerAncestralDiscreteScheduler

-## Overview
-
-Ancestral sampling with Euler method steps. Based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L72) implementation by Katherine Crowson.
-Fast scheduler which often times generates good outputs with 20-30 steps.
+A scheduler that uses ancestral sampling with Euler method steps. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L72) implementation by [Katherine Crowson](https://github.com/crowsonkb/).

 ## EulerAncestralDiscreteScheduler
 [[autodoc]] EulerAncestralDiscreteScheduler
+
+## EulerAncestralDiscreteSchedulerOutput
+[[autodoc]] schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteSchedulerOutput
--- a/docs/source/en/api/schedulers/heun.md
+++ b/docs/source/en/api/schedulers/heun.md
@@ -10,14 +10,12 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Heun scheduler inspired by Karras et. al paper
+# HeunDiscreteScheduler

-## Overview
-
-Algorithm 1 of [Karras et. al](https://arxiv.org/abs/2206.00364).
-Scheduler ported from @crowsonkb's https://github.com/crowsonkb/k-diffusion library:
-
-All credit for making this scheduler work goes to [Katherine Crowson](https://github.com/crowsonkb/)
+The Heun scheduler (Algorithm 1) is from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. The scheduler is ported from the [k-diffusion](https://github.com/crowsonkb/k-diffusion) library and created by [Katherine Crowson](https://github.com/crowsonkb/).

 ## HeunDiscreteScheduler
-[[autodoc]] HeunDiscreteScheduler
+[[autodoc]] HeunDiscreteScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
--- a/docs/source/en/api/schedulers/ipndm.md
+++ b/docs/source/en/api/schedulers/ipndm.md
@@ -10,11 +10,12 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# improved pseudo numerical methods for diffusion models (iPNDM)
+# IPNDMScheduler

-## Overview
-
-Original implementation can be found [here](https://github.com/crowsonkb/v-diffusion-pytorch/blob/987f8985e38208345c1959b0ea767a625831cc9b/diffusion/sampling.py#L296).
+`IPNDMScheduler` is a fourth-order Improved Pseudo Linear Multistep scheduler. The original implementation can be found at [crowsonkb/v-diffusion-pytorch](https://github.com/crowsonkb/v-diffusion-pytorch/blob/987f8985e38208345c1959b0ea767a625831cc9b/diffusion/sampling.py#L296).

 ## IPNDMScheduler
-[[autodoc]] IPNDMScheduler
+[[autodoc]] IPNDMScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
--- a/docs/source/en/api/schedulers/lms_discrete.md
+++ b/docs/source/en/api/schedulers/lms_discrete.md
@@ -10,11 +10,12 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Linear multistep scheduler for discrete beta schedules
+# LMSDiscreteScheduler

-## Overview
-
-Original implementation can be found [here](https://arxiv.org/abs/2206.00364).
+`LMSDiscreteScheduler` is a linear multistep scheduler for discrete beta schedules. The scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/), and the original implementation can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L181).

 ## LMSDiscreteScheduler
-[[autodoc]] LMSDiscreteScheduler
+[[autodoc]] LMSDiscreteScheduler
+
+## LMSDiscreteSchedulerOutput
+[[autodoc]] schedulers.scheduling_lms_discrete.LMSDiscreteSchedulerOutput
--- a/docs/source/en/api/schedulers/multistep_dpm_solver.md
+++ b/docs/source/en/api/schedulers/multistep_dpm_solver.md
@@ -10,11 +10,26 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Multistep DPM-Solver
+# DPMSolverMultistepScheduler

-## Overview
+`DPMSolverMultistep` is a multistep scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu.

-Original paper can be found [here](https://arxiv.org/abs/2206.00927) and the [improved version](https://arxiv.org/abs/2211.01095). The original implementation can be found [here](https://github.com/LuChengTHU/dpm-solver).
+DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality
+samples, and it can generate quite good samples even in 10 steps.
+
+## Tips
+
+It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling.
+
+Dynamic thresholding from Imagen (https://huggingface.co/papers/2205.11487) is supported, and for pixel-space
+diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use the dynamic
+thresholding. This thresholding method is unsuitable for latent-space diffusion models such as
+Stable Diffusion.
+
+The SDE variant of DPMSolver and DPM-Solver++ is also supported, but only for the first and second-order solvers. This is a fast SDE solver for the reverse diffusion SDE. It is recommended to use the second-order `sde-dpmsolver++`.

 ## DPMSolverMultistepScheduler
-[[autodoc]] DPMSolverMultistepScheduler
+[[autodoc]] DPMSolverMultistepScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
--- a/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md
+++ b/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md
@@ -10,13 +10,21 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Inverse Multistep DPM-Solver (DPMSolverMultistepInverse)
+# DPMSolverMultistepInverse

-## Overview
+`DPMSolverMultistepInverse` is the inverted scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu.

-This scheduler is the inverted scheduler of [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://arxiv.org/abs/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
-](https://arxiv.org/abs/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu.
-The implementation is mostly based on the DDIM inversion definition of [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://arxiv.org/pdf/2211.09794.pdf) and the ad-hoc notebook implementation for DiffEdit latent inversion [here](https://github.com/Xiang-cd/DiffEdit-stable-diffusion/blob/main/diffedit.ipynb).
+The implementation is mostly based on the DDIM inversion definition of [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794.pdf) and notebook implementation of the [`DiffEdit`] latent inversion from [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion/blob/main/diffedit.ipynb).
+
+## Tips
+
+Dynamic thresholding from Imagen (https://huggingface.co/papers/2205.11487) is supported, and for pixel-space
+diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use the dynamic
+thresholding. This thresholding method is unsuitable for latent-space diffusion models such as
+Stable Diffusion.

 ## DPMSolverMultistepInverseScheduler
 [[autodoc]] DPMSolverMultistepInverseScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
--- a/docs/source/en/api/schedulers/overview.md
+++ b/docs/source/en/api/schedulers/overview.md
@@ -12,81 +12,53 @@ specific language governing permissions and limitations under the License.

 # Schedulers

-Diffusers contains multiple pre-built schedule functions for the diffusion process.
+🤗 Diffusers provides many scheduler functions for the diffusion process. A scheduler takes a model's output (the sample which the diffusion process is iterating on) and a timestep to return a denoised sample. The timestep is important because it dictates where in the diffusion process the step is; data is generated by iterating forward *n* timesteps and inference occurs by propagating backward through the timesteps. Based on the timestep, a scheduler may be *discrete* in which case the timestep is an `int` or *continuous* in which case the timestep is a `float`.

-## What is a scheduler?
+Depending on the context, a scheduler defines how to iteratively add noise to an image or how to update a sample based on a model's output:

-The schedule functions, denoted *Schedulers* in the library take in the output of a trained model, a sample which the diffusion process is iterating on, and a timestep to return a denoised sample. That's why schedulers may also be called *Samplers* in other diffusion models implementations.
+- during *training*, a scheduler adds noise (there are different algorithms for how to add noise) to a sample to train a diffusion model
+- during *inference*, a scheduler defines how to update a sample based on a pretrained model's output

- Schedulers define the methodology for iteratively adding noise to an image or for updating a sample based on model outputs.
-    - adding noise in different manners represent the algorithmic processes to train a diffusion model by adding noise to images.
-    - for inference, the scheduler defines how to update a sample based on an output from a pretrained model.
- Schedulers are often defined by a *noise schedule* and an *update rule* to solve the differential equation solution.
+Many schedulers are implemented from the [k-diffusion](https://github.com/crowsonkb/k-diffusion) library by [Katherine Crowson](https://github.com/crowsonkb/), and they're also widely used in A1111. To help you map the schedulers from k-diffusion and A1111 to the schedulers in 🤗 Diffusers, take a look at the table below:

-### Discrete versus continuous schedulers
+| A1111/k-diffusion    | 🤗 Diffusers                         | Usage                                                                                                         |
+|---------------------|-------------------------------------|---------------------------------------------------------------------------------------------------------------|
+| DPM++ 2M            | [`DPMSolverMultistepScheduler`]     |                                                                                                               |
+| DPM++ 2M Karras     | [`DPMSolverMultistepScheduler`]     | init with `use_karras_sigmas=True`                                                                            |
+| DPM++ 2M SDE        | [`DPMSolverMultistepScheduler`]     | init with `algorithm_type="sde-dpmsolver++"`                                                                  |
+| DPM++ 2M SDE Karras | [`DPMSolverMultistepScheduler`]     | init with `use_karras_sigmas=True` and `algorithm_type="sde-dpmsolver++"`                                     |
+| DPM++ 2S a          | N/A                                 | very similar to  `DPMSolverSinglestepScheduler`                         |
+| DPM++ 2S a Karras   | N/A                                 | very similar to  `DPMSolverSinglestepScheduler(use_karras_sigmas=True, ...)` |
+| DPM++ SDE           | [`DPMSolverSinglestepScheduler`]    |                                                                                                               |
+| DPM++ SDE Karras    | [`DPMSolverSinglestepScheduler`]    | init with `use_karras_sigmas=True`                                                                            |
+| DPM2                | [`KDPM2DiscreteScheduler`]          |                                                                                                               |
+| DPM2 Karras         | [`KDPM2DiscreteScheduler`]          | init with `use_karras_sigmas=True`                                                                            |
+| DPM2 a              | [`KDPM2AncestralDiscreteScheduler`] |                                                                                                               |
+| DPM2 a Karras       | [`KDPM2AncestralDiscreteScheduler`] | init with `use_karras_sigmas=True`                                                                            |
+| DPM adaptive        | N/A                                 |                                                                                                               |
+| DPM fast            | N/A                                 |                                                                                                               |
+| Euler               | [`EulerDiscreteScheduler`]          |                                                                                                               |
+| Euler a             | [`EulerAncestralDiscreteScheduler`] |                                                                                                               |
+| Heun                | [`HeunDiscreteScheduler`]           |                                                                                                               |
+| LMS                 | [`LMSDiscreteScheduler`]            |                                                                                                               |
+| LMS Karras          | [`LMSDiscreteScheduler`]            | init with `use_karras_sigmas=True`                                                                            |
+| N/A                 | [`DEISMultistepScheduler`]          |                                                                                                               |
+| N/A                 | [`UniPCMultistepScheduler`]         |                                                                                                               |

-All schedulers take in a timestep to predict the updated version of the sample being diffused.
-The timesteps dictate where in the diffusion process the step is, where data is generated by iterating forward in time and inference is executed by propagating backwards through timesteps.
-Different algorithms use timesteps that can be discrete (accepting `int` inputs), such as the [`DDPMScheduler`] or [`PNDMScheduler`], or continuous (accepting `float` inputs), such as the score-based schedulers [`ScoreSdeVeScheduler`] or [`ScoreSdeVpScheduler`].
+All schedulers are built from the base [`SchedulerMixin`] class which implements low level utilities shared by all schedulers.

-## Designing Re-usable schedulers
-
-The core design principle between the schedule functions is to be model, system, and framework independent.
-This allows for rapid experimentation and cleaner abstractions in the code, where the model prediction is separated from the sample update.
-To this end, the design of schedulers is such that:
-
- Schedulers can be used interchangeably between diffusion models in inference to find the preferred trade-off between speed and generation quality.
- Schedulers are currently by default in PyTorch, but are designed to be framework independent (partial Jax support currently exists).
- Many diffusion pipelines, such as [`StableDiffusionPipeline`] and [`DiTPipeline`] can use any of [`KarrasDiffusionSchedulers`]
-
-## Schedulers Summary
-
-The following table summarizes all officially supported schedulers, their corresponding paper
-
-| Scheduler | Paper |
-|---|---|
-| [ddim](./ddim) | [**Denoising Diffusion Implicit Models**](https://arxiv.org/abs/2010.02502) |
-| [ddim_inverse](./ddim_inverse) | [**Denoising Diffusion Implicit Models**](https://arxiv.org/abs/2010.02502) |
-| [ddpm](./ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) |
-| [deis](./deis) | [**DEISMultistepScheduler**](https://arxiv.org/abs/2204.13902) |
-| [singlestep_dpm_solver](./singlestep_dpm_solver) | [**Singlestep DPM-Solver**](https://arxiv.org/abs/2206.00927) |
-| [multistep_dpm_solver](./multistep_dpm_solver) | [**Multistep DPM-Solver**](https://arxiv.org/abs/2206.00927) |
-| [heun](./heun) | [**Heun scheduler inspired by Karras et. al paper**](https://arxiv.org/abs/2206.00364) |
-| [dpm_discrete](./dpm_discrete) | [**DPM Discrete Scheduler inspired by Karras et. al paper**](https://arxiv.org/abs/2206.00364) |
-| [dpm_discrete_ancestral](./dpm_discrete_ancestral) | [**DPM Discrete Scheduler with ancestral sampling inspired by Karras et. al paper**](https://arxiv.org/abs/2206.00364) |
-| [stochastic_karras_ve](./stochastic_karras_ve) | [**Variance exploding, stochastic sampling from Karras et. al**](https://arxiv.org/abs/2206.00364) |
-| [lms_discrete](./lms_discrete) | [**Linear multistep scheduler for discrete beta schedules**](https://arxiv.org/abs/2206.00364) |
-| [pndm](./pndm) | [**Pseudo numerical methods for diffusion models (PNDM)**](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L181) |
-| [score_sde_ve](./score_sde_ve) | [**variance exploding stochastic differential equation (VE-SDE) scheduler**](https://arxiv.org/abs/2011.13456) |
-| [ipndm](./ipndm) | [**improved pseudo numerical methods for diffusion models (iPNDM)**](https://github.com/crowsonkb/v-diffusion-pytorch/blob/987f8985e38208345c1959b0ea767a625831cc9b/diffusion/sampling.py#L296) |
-| [score_sde_vp](./score_sde_vp) | [**Variance preserving stochastic differential equation (VP-SDE) scheduler**](https://arxiv.org/abs/2011.13456) |
-| [euler](./euler) | [**Euler scheduler**](https://arxiv.org/abs/2206.00364) |
-| [euler_ancestral](./euler_ancestral) | [**Euler Ancestral scheduler**](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L72) |
-| [vq_diffusion](./vq_diffusion) | [**VQDiffusionScheduler**](https://arxiv.org/abs/2111.14822) |
-| [unipc](./unipc) | [**UniPCMultistepScheduler**](https://arxiv.org/abs/2302.04867) |
-| [repaint](./repaint) | [**RePaint scheduler**](https://arxiv.org/abs/2201.09865) |
-
-## API
-
-The core API for any new scheduler must follow a limited structure.
- Schedulers should provide one or more `def step(...)` functions that should be called to update the generated sample iteratively.
- Schedulers should provide a `set_timesteps(...)` method that configures the parameters of a schedule function for a specific inference task.
- Schedulers should be framework-specific.
-
-The base class [`SchedulerMixin`] implements low level utilities used by multiple schedulers.
-
-### SchedulerMixin
+## SchedulerMixin
 [[autodoc]] SchedulerMixin

-### SchedulerOutput
-The class [`SchedulerOutput`] contains the outputs from any schedulers `step(...)` call.
-
+## SchedulerOutput
 [[autodoc]] schedulers.scheduling_utils.SchedulerOutput

-### KarrasDiffusionSchedulers
+## KarrasDiffusionSchedulers

-`KarrasDiffusionSchedulers` encompasses the main generalization of schedulers in Diffusers. The schedulers in this class are distinguished, at a high level, by their noise sampling strategy; the type of network and scaling; and finally the training strategy or how the loss is weighed.
+[`KarrasDiffusionSchedulers`] are a broad generalization of schedulers in 🤗 Diffusers. The schedulers in this class are distinguished at a high level by their noise sampling strategy, the type of network and scaling, the training strategy, and how the loss is weighed.

-The different schedulers, depending on the type of ODE solver, fall into the above taxonomy and provide a good abstraction for the design of the main schedulers implemented in Diffusers. The schedulers in this class are given below:
+The different schedulers in this class, depending on the ordinary differential equations (ODE) solver type, fall into the above taxonomy and provide a good abstraction for the design of the main schedulers implemented in 🤗 Diffusers. The schedulers in this class are given [here](https://github.com/huggingface/diffusers/blob/a69754bb879ed55b9b6dc9dd0b3cf4fa4124c765/src/diffusers/schedulers/scheduling_utils.py#L32).

-[[autodoc]] schedulers.scheduling_utils.KarrasDiffusionSchedulers
+## PushToHubMixin
+
+[[autodoc]] utils.PushToHubMixin
--- a/docs/source/en/api/schedulers/pndm.md
+++ b/docs/source/en/api/schedulers/pndm.md
@@ -10,11 +10,12 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Pseudo numerical methods for diffusion models (PNDM)
+# PNDMScheduler

-## Overview
-
-Original implementation can be found [here](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L181).
+`PNDMScheduler`, or pseudo numerical methods for diffusion models, uses more advanced ODE integration techniques like the Runge-Kutta and linear multi-step method. The original implementation can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L181).

 ## PNDMScheduler
-[[autodoc]] PNDMScheduler
+[[autodoc]] PNDMScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
--- a/docs/source/en/api/schedulers/repaint.md
+++ b/docs/source/en/api/schedulers/repaint.md
@@ -10,14 +10,18 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# RePaint scheduler
+# RePaintScheduler

-## Overview
+`RePaintScheduler` is a DDPM-based inpainting scheduler for unsupervised inpainting with extreme masks. It is designed to be used with the [`RePaintPipeline`], and it is based on the paper [RePaint: Inpainting using Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2201.09865) by Andreas Lugmayr et al.

-DDPM-based inpainting scheduler for unsupervised inpainting with extreme masks. 
-Intended for use with [`RePaintPipeline`].
-Based on the paper [RePaint: Inpainting using Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2201.09865) 
-and the original implementation by Andreas Lugmayr et al.: https://github.com/andreas128/RePaint
+The abstract from the paper is:
+
+*Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks. RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions. Github Repository: git.io/RePaint*.
+
+The original implementation can be found at [andreas128/RePaint](https://github.com/andreas128/).

 ## RePaintScheduler
-[[autodoc]] RePaintScheduler
+[[autodoc]] RePaintScheduler
+
+## RePaintSchedulerOutput
+[[autodoc]] schedulers.scheduling_repaint.RePaintSchedulerOutput
--- a/docs/source/en/api/schedulers/score_sde_ve.md
+++ b/docs/source/en/api/schedulers/score_sde_ve.md
@@ -10,11 +10,16 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Variance Exploding Stochastic Differential Equation (VE-SDE) scheduler
+# ScoreSdeVeScheduler

-## Overview
+`ScoreSdeVeScheduler` is a variance exploding stochastic differential equation (SDE) scheduler. It was introduced in the [Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) paper by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole.

-Original paper can be found [here](https://arxiv.org/abs/2011.13456).
+The abstract from the paper is:
+
+*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model*.

 ## ScoreSdeVeScheduler
 [[autodoc]] ScoreSdeVeScheduler
+
+## SdeVeOutput
+[[autodoc]] schedulers.scheduling_sde_ve.SdeVeOutput
--- a/docs/source/en/api/schedulers/score_sde_vp.md
+++ b/docs/source/en/api/schedulers/score_sde_vp.md
@@ -10,15 +10,17 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Variance Preserving Stochastic Differential Equation (VP-SDE) scheduler
+# ScoreSdeVpScheduler

-## Overview
+`ScoreSdeVpScheduler` is a variance preserving stochastic differential equation (SDE) scheduler.  It was introduced in the [Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) paper by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole.

-Original paper can be found [here](https://arxiv.org/abs/2011.13456).
+The abstract from the paper is:
+
+*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model*.

 <Tip warning={true}>

-Score SDE-VP is under construction.
+🚧 This scheduler is under construction!

 </Tip>

--- a/docs/source/en/api/schedulers/singlestep_dpm_solver.md
+++ b/docs/source/en/api/schedulers/singlestep_dpm_solver.md
@@ -10,11 +10,26 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Singlestep DPM-Solver
+# DPMSolverSinglestepScheduler

-## Overview
+`DPMSolverSinglestepScheduler` is a single step scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu.

-Original paper can be found [here](https://arxiv.org/abs/2206.00927) and the [improved version](https://arxiv.org/abs/2211.01095). The original implementation can be found [here](https://github.com/LuChengTHU/dpm-solver).
+DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality
+samples, and it can generate quite good samples even in 10 steps.
+
+The original implementation can be found at [LuChengTHU/dpm-solver](https://github.com/LuChengTHU/dpm-solver).
+
+## Tips
+
+It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling.
+
+Dynamic thresholding from Imagen (https://huggingface.co/papers/2205.11487) is supported, and for pixel-space
+diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use dynamic
+thresholding. This thresholding method is unsuitable for latent-space diffusion models such as
+Stable Diffusion.

 ## DPMSolverSinglestepScheduler
-[[autodoc]] DPMSolverSinglestepScheduler
+[[autodoc]] DPMSolverSinglestepScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
--- a/docs/source/en/api/schedulers/stochastic_karras_ve.md
+++ b/docs/source/en/api/schedulers/stochastic_karras_ve.md
@@ -10,11 +10,12 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Variance exploding, stochastic sampling from Karras et. al
+# KarrasVeScheduler

-## Overview
-
-Original paper can be found [here](https://arxiv.org/abs/2206.00364).
+`KarrasVeScheduler` is a stochastic sampler tailored o variance-expanding (VE) models. It is based on the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) and [Score-based generative modeling through stochastic differential equations](https://huggingface.co/papers/2011.13456) papers.

 ## KarrasVeScheduler
-[[autodoc]] KarrasVeScheduler
+[[autodoc]] KarrasVeScheduler
+
+## KarrasVeOutput
+[[autodoc]] schedulers.scheduling_karras_ve.KarrasVeOutput
--- a/docs/source/en/api/schedulers/unipc.md
+++ b/docs/source/en/api/schedulers/unipc.md
@@ -10,15 +10,28 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# UniPC
+# UniPCMultistepScheduler

-## Overview
+`UniPCMultistepScheduler` is a training-free framework designed for fast sampling of diffusion models. It was introduced in [UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models](https://huggingface.co/papers/2302.04867) by Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, Jiwen Lu.

-UniPC is a training-free framework designed for the fast sampling of diffusion models, which consists of a corrector (UniC) and a predictor (UniP) that share a unified analytical form and support arbitrary orders.
+It consists of a corrector (UniC) and a predictor (UniP) that share a unified analytical form and support arbitrary orders.
+UniPC is by design model-agnostic, supporting pixel-space/latent-space DPMs on unconditional/conditional sampling. It can also be applied to both noise prediction and data prediction models. The corrector UniC can be also applied after any off-the-shelf solvers to increase the order of accuracy.

-For more details about the method, please refer to the [paper](https://arxiv.org/abs/2302.04867) and the [code](https://github.com/wl-zhao/UniPC).
+The abstract from the paper is:

-Fast Sampling of Diffusion Models with Exponential Integrator.
+*Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM usually requires hundreds of model evaluations, which is computationally expensive. Despite recent progress in designing high-order solvers for DPMs, there still exists room for further speedup, especially in extremely few steps (e.g., 5~10 steps). Inspired by the predictor-corrector for ODE solvers, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256times256 (conditional) with only 10 function evaluations. Code is available at https://github.com/wl-zhao/UniPC*.
+
+The original codebase can be found at [wl-zhao/UniPC](https://github.com/wl-zhao/UniPC).
+
+## Tips
+
+It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling.
+
+Dynamic thresholding from Imagen (https://huggingface.co/papers/2205.11487) is supported, and for pixel-space
+diffusion models, you can set both `predict_x0=True` and `thresholding=True` to use dynamic thresholding. This thresholding method is unsuitable for latent-space diffusion models such as Stable Diffusion.

 ## UniPCMultistepScheduler
 [[autodoc]] UniPCMultistepScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
--- a/docs/source/en/api/schedulers/vq_diffusion.md
+++ b/docs/source/en/api/schedulers/vq_diffusion.md
@@ -12,9 +12,14 @@ specific language governing permissions and limitations under the License.

 # VQDiffusionScheduler

-## Overview
+`VQDiffusionScheduler` converts the transformer model's output into a sample for the unnoised image at the previous diffusion timestep. It was introduced in [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://huggingface.co/papers/2111.14822) by Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo.

-Original paper can be found [here](https://arxiv.org/abs/2111.14822)
+The abstract from the paper is:
+
+*We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.*

 ## VQDiffusionScheduler
-[[autodoc]] VQDiffusionScheduler
+[[autodoc]] VQDiffusionScheduler
+
+## VQDiffusionSchedulerOutput
+[[autodoc]] schedulers.scheduling_vq_diffusion.VQDiffusionSchedulerOutput
--- a/docs/source/en/api/utilities.md
+++ b/docs/source/en/api/utilities.md
@@ -20,4 +20,8 @@ Utility and helper functions for working with 🤗 Diffusers.

 ## export_to_video

-[[autodoc]] utils.testing_utils.export_to_video
+[[autodoc]] utils.testing_utils.export_to_video
+
+## make_image_grid
+
+[[autodoc]] utils.pil_utils.make_image_grid
--- a/docs/source/en/optimization/fp16.md
+++ b/docs/source/en/optimization/fp16.md
@@ -51,6 +51,7 @@ from diffusers import DiffusionPipeline
 pipe = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
+    use_safetensors=True,
 )
 pipe = pipe.to("cuda")

@@ -65,42 +66,11 @@ image = pipe(prompt).images[0]
  
 </Tip>

-## Sliced attention for additional memory savings
-
-For even additional memory savings, you can use a sliced version of attention that performs the computation in steps instead of all at once.
-
-<Tip>
-  Attention slicing is useful even if a batch size of just 1 is used - as long
-  as the model uses more than one attention head. If there is more than one
-  attention head the *QK^T* attention matrix can be computed sequentially for
-  each head which can save a significant amount of memory.
-</Tip>
-
-To perform the attention computation sequentially over each head, you only need to invoke [`~DiffusionPipeline.enable_attention_slicing`] in your pipeline before inference, like here:
-
-```Python
-import torch
-from diffusers import DiffusionPipeline
-
-pipe = DiffusionPipeline.from_pretrained(
-    "runwayml/stable-diffusion-v1-5",
-    torch_dtype=torch.float16,
-)
-pipe = pipe.to("cuda")
-
-prompt = "a photo of an astronaut riding a horse on mars"
-pipe.enable_attention_slicing()
-image = pipe(prompt).images[0]
-```
-
-There's a small performance penalty of about 10% slower inference times, but this method allows you to use Stable Diffusion in as little as 3.2 GB of VRAM!
-
-
 ## Sliced VAE decode for larger batches

 To decode large batches of images with limited VRAM, or to enable batches with 32 images or more, you can use sliced VAE decode that decodes the batch latents one image at a time.

-You likely want to couple this with [`~StableDiffusionPipeline.enable_attention_slicing`] or [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`] to further minimize memory use.
+You likely want to couple this with [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`] to further minimize memory use.

 To perform the VAE decode one image at a time, invoke [`~StableDiffusionPipeline.enable_vae_slicing`] in your pipeline before inference. For example:

@@ -111,6 +81,7 @@ from diffusers import StableDiffusionPipeline
 pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
+    use_safetensors=True,
 )
 pipe = pipe.to("cuda")

@@ -126,7 +97,7 @@ You may see a small performance boost in VAE decode on multi-image batches. Ther

 Tiled VAE processing makes it possible to work with large images on limited VRAM. For example, generating 4k images in 8GB of VRAM. Tiled VAE decoder splits the image into overlapping tiles, decodes the tiles, and blends the outputs to make the final image.

-You want to couple this with [`~StableDiffusionPipeline.enable_attention_slicing`] or [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`] to further minimize memory use.
+You want to couple this with [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`] to further minimize memory use.

 To use tiled VAE processing, invoke [`~StableDiffusionPipeline.enable_vae_tiling`] in your pipeline before inference. For example:

@@ -137,6 +108,7 @@ from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler
 pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
+    use_safetensors=True,
 )
 pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
 pipe = pipe.to("cuda")
@@ -164,6 +136,7 @@ from diffusers import StableDiffusionPipeline
 pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
+    use_safetensors=True,
 )

 prompt = "a photo of an astronaut riding a horse on mars"
@@ -188,11 +161,11 @@ from diffusers import StableDiffusionPipeline
 pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
+    use_safetensors=True,
 )

 prompt = "a photo of an astronaut riding a horse on mars"
 pipe.enable_sequential_cpu_offload()
-pipe.enable_attention_slicing(1)

 image = pipe(prompt).images[0]
 ```
@@ -221,6 +194,7 @@ from diffusers import StableDiffusionPipeline
 pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",  
    torch_dtype=torch.float16,
+    use_safetensors=True,
 )

 prompt = "a photo of an astronaut riding a horse on mars"
@@ -237,11 +211,11 @@ from diffusers import StableDiffusionPipeline
 pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
+    use_safetensors=True,
 )

 prompt = "a photo of an astronaut riding a horse on mars"
 pipe.enable_model_cpu_offload()
-pipe.enable_attention_slicing(1)

 image = pipe(prompt).images[0]
 ```
@@ -300,6 +274,7 @@ def generate_inputs():
 pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
+    use_safetensors=True,
 ).to("cuda")
 unet = pipe.unet
 unet.eval()
@@ -363,6 +338,7 @@ class UNet2DConditionOutput:
 pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
+    use_safetensors=True,
 ).to("cuda")

 # use jitted unet
@@ -422,6 +398,7 @@ import torch
 pipe = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
+    use_safetensors=True,
 ).to("cuda")

 pipe.enable_xformers_memory_efficient_attention()
--- a/docs/source/en/optimization/torch2.0.md
+++ b/docs/source/en/optimization/torch2.0.md
@@ -39,7 +39,7 @@ pip install --upgrade torch diffusers
    import torch
    from diffusers import DiffusionPipeline

-    pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+    pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True)
    pipe = pipe.to("cuda")

    prompt = "a photo of an astronaut riding a horse on mars"
@@ -53,7 +53,7 @@ pip install --upgrade torch diffusers
    from diffusers import DiffusionPipeline
    + from diffusers.models.attention_processor import AttnProcessor2_0

-    pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
+    pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
    + pipe.unet.set_attn_processor(AttnProcessor2_0())

    prompt = "a photo of an astronaut riding a horse on mars"
@@ -69,7 +69,7 @@ pip install --upgrade torch diffusers
    from diffusers import DiffusionPipeline
    from diffusers.models.attention_processor import AttnProcessor

-    pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
+    pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
    pipe.unet.set_default_attn_processor()

    prompt = "a photo of an astronaut riding a horse on mars"
@@ -107,7 +107,7 @@ path = "runwayml/stable-diffusion-v1-5"

 run_compile = True  # Set True / False

-pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16)
+pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True)
 pipe = pipe.to("cuda")
 pipe.unet.to(memory_format=torch.channels_last)

@@ -140,7 +140,7 @@ path = "runwayml/stable-diffusion-v1-5"

 run_compile = True  # Set True / False

-pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16)
+pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True)
 pipe = pipe.to("cuda")
 pipe.unet.to(memory_format=torch.channels_last)

@@ -180,7 +180,7 @@ path = "runwayml/stable-diffusion-inpainting"

 run_compile = True  # Set True / False

-pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16)
+pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True)
 pipe = pipe.to("cuda")
 pipe.unet.to(memory_format=torch.channels_last)

@@ -212,9 +212,9 @@ init_image = init_image.resize((512, 512))
 path = "runwayml/stable-diffusion-v1-5"

 run_compile = True  # Set True / False
-controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
+controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True)
 pipe = StableDiffusionControlNetPipeline.from_pretrained(
-    path, controlnet=controlnet, torch_dtype=torch.float16
+    path, controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
 )

 pipe = pipe.to("cuda")
@@ -240,11 +240,11 @@ import torch

 run_compile = True  # Set True / False

-pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16)
+pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True)
 pipe.to("cuda")
-pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16)
+pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True)
 pipe_2.to("cuda")
-pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16)
+pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, use_safetensors=True)
 pipe_3.to("cuda")


--- a/docs/source/en/quicktour.md
+++ b/docs/source/en/quicktour.md
@@ -67,7 +67,7 @@ Load the model with the [`~DiffusionPipeline.from_pretrained`] method:
 ```python
 >>> from diffusers import DiffusionPipeline

->>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
 ```

 The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components. You'll see that the Stable Diffusion pipeline is composed of the [`UNet2DConditionModel`] and [`PNDMScheduler`] among other things:
@@ -130,7 +130,7 @@ You can also use the pipeline locally. The only difference is you need to downlo
 Then load the saved weights into the pipeline:

 ```python
->>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5")
+>>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True)
 ```

 Now you can run the pipeline as you would in the section above.
@@ -142,7 +142,7 @@ Different schedulers come with different denoising speeds and quality trade-offs
 ```py
 >>> from diffusers import EulerDiscreteScheduler

->>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
 >>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
 ```

@@ -160,7 +160,7 @@ Models are initiated with the [`~ModelMixin.from_pretrained`] method which also
 >>> from diffusers import UNet2DModel

 >>> repo_id = "google/ddpm-cat-256"
->>> model = UNet2DModel.from_pretrained(repo_id)
+>>> model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True)
 ```

 To access the model parameters, call `model.config`:
--- a/docs/source/en/stable_diffusion.md
+++ b/docs/source/en/stable_diffusion.md
@@ -26,7 +26,7 @@ Begin by loading the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/r
 from diffusers import DiffusionPipeline

 model_id = "runwayml/stable-diffusion-v1-5"
-pipeline = DiffusionPipeline.from_pretrained(model_id)
+pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True)
 ```

 The example prompt you'll use is a portrait of an old warrior chief, but feel free to use your own prompt:
@@ -75,7 +75,7 @@ Let's start by loading the model in `float16` and generate an image:
 ```python
 import torch

-pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
+pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True)
 pipeline = pipeline.to("cuda")
 generator = torch.Generator("cuda").manual_seed(0)
 image = pipeline(prompt, generator=generator).images[0]
@@ -152,26 +152,13 @@ def get_inputs(batch_size=1):
    return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}
 ```

-You'll also need a function that'll display each batch of images:
-
-```python
-from PIL import Image
-
-
-def image_grid(imgs, rows=2, cols=2):
-    w, h = imgs[0].size
-    grid = Image.new("RGB", size=(cols * w, rows * h))
-
-    for i, img in enumerate(imgs):
-        grid.paste(img, box=(i % cols * w, i // cols * h))
-    return grid
-```
-
 Start with `batch_size=4` and see how much memory you've consumed:

 ```python
+from diffusers.utils import make_image_grid 
+
 images = pipeline(**get_inputs(batch_size=4)).images
-image_grid(images)
+make_image_grid(images, 2, 2)
 ```

 Unless you have a GPU with more RAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is configure the pipeline to use the [`~DiffusionPipeline.enable_attention_slicing`] function:
@@ -184,7 +171,7 @@ Now try increasing the `batch_size` to 8!

 ```python
 images = pipeline(**get_inputs(batch_size=8)).images
-image_grid(images, rows=2, cols=4)
+make_image_grid(images, rows=2, cols=4)
 ```

 <div class="flex justify-center">
@@ -213,7 +200,7 @@ from diffusers import AutoencoderKL
 vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda")
 pipeline.vae = vae
 images = pipeline(**get_inputs(batch_size=8)).images
-image_grid(images, rows=2, cols=4)
+make_image_grid(images, rows=2, cols=4)
 ```

 <div class="flex justify-center">
@@ -238,7 +225,7 @@ Generate a batch of images with the new prompt:

 ```python
 images = pipeline(**get_inputs(batch_size=8)).images
-image_grid(images, rows=2, cols=4)
+make_image_grid(images, rows=2, cols=4)
 ```

 <div class="flex justify-center">
@@ -257,7 +244,7 @@ prompts = [

 generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))]
 images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images
-image_grid(images)
+make_image_grid(images, 2, 2)
 ```

 <div class="flex justify-center">
--- a/docs/source/en/training/adapt_a_model.md
+++ b/docs/source/en/training/adapt_a_model.md
@@ -11,7 +11,7 @@ A [`UNet2DConditionModel`] by default accepts 4 channels in the [input sample](h
 ```py
 from diffusers import StableDiffusionPipeline

-pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
 pipeline.unet.config["in_channels"]
 4
 ```
@@ -21,7 +21,7 @@ Inpainting requires 9 channels in the input sample. You can check this value in
 ```py
 from diffusers import StableDiffusionPipeline

-pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
+pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-inpainting", use_safetensors=True)
 pipeline.unet.config["in_channels"]
 9
 ```
@@ -35,7 +35,12 @@ from diffusers import UNet2DConditionModel

 model_id = "runwayml/stable-diffusion-v1-5"
 unet = UNet2DConditionModel.from_pretrained(
-    model_id, subfolder="unet", in_channels=9, low_cpu_mem_usage=False, ignore_mismatched_sizes=True
+    model_id,
+    subfolder="unet",
+    in_channels=9,
+    low_cpu_mem_usage=False,
+    ignore_mismatched_sizes=True,
+    use_safetensors=True,
 )
 ```

--- a/docs/source/en/training/controlnet.md
+++ b/docs/source/en/training/controlnet.md
@@ -306,9 +306,9 @@ import torch
 base_model_path = "path to model"
 controlnet_path = "path to controlnet"

-controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16)
+controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16, use_safetensors=True)
 pipe = StableDiffusionControlNetPipeline.from_pretrained(
-    base_model_path, controlnet=controlnet, torch_dtype=torch.float16
+    base_model_path, controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
 )

 # speed up diffusion process with faster scheduler and memory optimization
--- a/docs/source/en/training/custom_diffusion.md
+++ b/docs/source/en/training/custom_diffusion.md
@@ -222,7 +222,9 @@ Once you have trained a model using the above command, you can run inference usi
 import torch
 from diffusers import DiffusionPipeline

-pipe = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16).to("cuda")
+pipe = DiffusionPipeline.from_pretrained(
+    "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16, use_safetensors=True
+).to("cuda")
 pipe.unet.load_attn_procs("path-to-save-model", weight_name="pytorch_custom_diffusion_weights.bin")
 pipe.load_textual_inversion("path-to-save-model", weight_name="<new1>.bin")

@@ -246,7 +248,7 @@ model_id = "sayakpaul/custom-diffusion-cat"
 card = RepoCard.load(model_id)
 base_model_id = card.data.to_dict()["base_model"]

-pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16).to("cuda")
+pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda")
 pipe.unet.load_attn_procs(model_id, weight_name="pytorch_custom_diffusion_weights.bin")
 pipe.load_textual_inversion(model_id, weight_name="<new1>.bin")

@@ -270,7 +272,7 @@ model_id = "sayakpaul/custom-diffusion-cat-wooden-pot"
 card = RepoCard.load(model_id)
 base_model_id = card.data.to_dict()["base_model"]

-pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16).to("cuda")
+pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda")
 pipe.unet.load_attn_procs(model_id, weight_name="pytorch_custom_diffusion_weights.bin")
 pipe.load_textual_inversion(model_id, weight_name="<new1>.bin")
 pipe.load_textual_inversion(model_id, weight_name="<new2>.bin")
--- a/docs/source/en/training/distributed_inference.md
+++ b/docs/source/en/training/distributed_inference.md
@@ -16,7 +16,9 @@ Now use the [`~accelerate.PartialState.split_between_processes`] utility as a co
 from accelerate import PartialState
 from diffusers import DiffusionPipeline

-pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+pipeline = DiffusionPipeline.from_pretrained(
+    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+)
 distributed_state = PartialState()
 pipeline.to(distributed_state.device)

@@ -50,7 +52,9 @@ import torch.multiprocessing as mp

 from diffusers import DiffusionPipeline

-sd = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+sd = DiffusionPipeline.from_pretrained(
+    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+)
 ```

 You'll want to create a function to run inference; [`init_process_group`](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group) handles creating a distributed environment with the type of backend to use, the `rank` of the current process, and the `world_size` or the number of processes participating. If you're running inference in parallel over 2 GPUs, then the `world_size` is 2.
--- a/docs/source/en/training/dreambooth.md
+++ b/docs/source/en/training/dreambooth.md
@@ -303,7 +303,9 @@ unet = UNet2DConditionModel.from_pretrained("/sddata/dreambooth/daruma-v2-1/chec
 # if you have trained with `--args.train_text_encoder` make sure to also load the text encoder
 text_encoder = CLIPTextModel.from_pretrained("/sddata/dreambooth/daruma-v2-1/checkpoint-100/text_encoder")

-pipeline = DiffusionPipeline.from_pretrained(model_id, unet=unet, text_encoder=text_encoder, dtype=torch.float16)
+pipeline = DiffusionPipeline.from_pretrained(
+    model_id, unet=unet, text_encoder=text_encoder, dtype=torch.float16, use_safetensors=True
+)
 pipeline.to("cuda")

 # Perform inference, or save, or push to the hub
@@ -318,7 +320,7 @@ from diffusers import DiffusionPipeline

 # Load the pipeline with the same arguments (model, revision) that were used for training
 model_id = "CompVis/stable-diffusion-v1-4"
-pipeline = DiffusionPipeline.from_pretrained(model_id)
+pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True)

 accelerator = Accelerator()

@@ -333,6 +335,7 @@ pipeline = DiffusionPipeline.from_pretrained(
    model_id,
    unet=accelerator.unwrap_model(unet),
    text_encoder=accelerator.unwrap_model(text_encoder),
+    use_safetensors=True,
 )

 # Perform inference, or save, or push to the hub
@@ -488,7 +491,7 @@ from diffusers import DiffusionPipeline
 import torch

 model_id = "path_to_saved_model"
-pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
+pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda")

 prompt = "A photo of sks dog in a bucket"
 image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
@@ -510,7 +513,7 @@ must also update the pipeline's scheduler config.
 ```py
 from diffusers import DiffusionPipeline

-pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0")
+pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", use_safetensors=True)

 pipe.load_lora_weights("<lora weights path>")

@@ -704,4 +707,4 @@ accelerate launch train_dreambooth.py \

 ## Stable Diffusion XL

-We support fine-tuning of the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with DreamBooth and LoRA via the `train_dreambooth_lora_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md). 
+We support fine-tuning of the UNet and text encoders shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with DreamBooth and LoRA via the `train_dreambooth_lora_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md). 
--- a/docs/source/en/training/instructpix2pix.md
+++ b/docs/source/en/training/instructpix2pix.md
@@ -165,7 +165,9 @@ import torch
 from diffusers import StableDiffusionInstructPix2PixPipeline

 model_id = "your_model_id"  # <- replace this
-pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
+pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
+    model_id, torch_dtype=torch.float16, use_safetensors=True
+).to("cuda")
 generator = torch.Generator("cuda").manual_seed(0)

 url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/test_pix2pix_4.png"
--- a/docs/source/en/training/lora.md
+++ b/docs/source/en/training/lora.md
@@ -98,7 +98,7 @@ Now you can use the model for inference by loading the base model in the [`Stabl

 >>> model_base = "runwayml/stable-diffusion-v1-5"

->>> pipe = StableDiffusionPipeline.from_pretrained(model_base, torch_dtype=torch.float16)
+>>> pipe = StableDiffusionPipeline.from_pretrained(model_base, torch_dtype=torch.float16, use_safetensors=True)
 >>> pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
 ```

@@ -137,7 +137,7 @@ lora_model_id = "sayakpaul/sd-model-finetuned-lora-t4"
 card = RepoCard.load(lora_model_id)
 base_model_id = card.data.to_dict()["base_model"]

-pipe = StableDiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16)
+pipe = StableDiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, use_safetensors=True)
 ...
 ```

@@ -211,7 +211,7 @@ Now you can use the model for inference by loading the base model in the [`Stabl

 >>> model_base = "runwayml/stable-diffusion-v1-5"

->>> pipe = StableDiffusionPipeline.from_pretrained(model_base, torch_dtype=torch.float16)
+>>> pipe = StableDiffusionPipeline.from_pretrained(model_base, torch_dtype=torch.float16, use_safetensors=True)
 ```

 Load the LoRA weights from your finetuned DreamBooth model *on top of the base model weights*, and then move the pipeline to a GPU for faster inference. When you merge the LoRA weights with the frozen pretrained model weights, you can optionally adjust how much of the weights to merge with the `scale` parameter:
@@ -251,7 +251,7 @@ lora_model_id = "sayakpaul/dreambooth-text-encoder-test"
 card = RepoCard.load(lora_model_id)
 base_model_id = card.data.to_dict()["base_model"]

-pipe = StableDiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16)
+pipe = StableDiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, use_safetensors=True)
 pipe = pipe.to("cuda")
 pipe.load_lora_weights(lora_model_id)
 image = pipe("A picture of a sks dog in a bucket", num_inference_steps=25).images[0]
@@ -307,7 +307,7 @@ import torch
 from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler

 pipeline = StableDiffusionPipeline.from_pretrained(
-    "gsdf/Counterfeit-V2.5", torch_dtype=torch.float16, safety_checker=None
+    "gsdf/Counterfeit-V2.5", torch_dtype=torch.float16, safety_checker=None, use_safetensors=True
 ).to("cuda")
 pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
    pipeline.scheduler.config, use_karras_sigmas=True
--- a/docs/source/en/training/text2image.md
+++ b/docs/source/en/training/text2image.md
@@ -238,7 +238,7 @@ Now you can load the fine-tuned model for inference by passing the model path or
 from diffusers import StableDiffusionPipeline

 model_path = "path_to_saved_model"
-pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16)
+pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16, use_safetensors=True)
 pipe.to("cuda")

 image = pipe(prompt="yoda").images[0]
@@ -275,3 +275,9 @@ image.save("yoda-pokemon.png")
 ```
 </jax>
 </frameworkcontent>
+
+
+## Stable Diffusion XL
+
+* We support fine-tuning the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) via the `train_text_to_image_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md). 
+* We also support fine-tuning of the UNet and Text Encoder shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with LoRA via the `train_text_to_image_lora_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md). 
--- a/docs/source/en/training/text_inversion.md
+++ b/docs/source/en/training/text_inversion.md
@@ -204,7 +204,7 @@ from diffusers import StableDiffusionPipeline
 import torch

 model_id = "runwayml/stable-diffusion-v1-5"
-pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
+pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda")
 ```

 Next, we need to load the textual inversion embedding vector which can be done via the [`TextualInversionLoaderMixin.load_textual_inversion`]
--- a/docs/source/en/tutorials/basic_training.md
+++ b/docs/source/en/tutorials/basic_training.md
@@ -252,18 +252,11 @@ Then, you'll need a way to evaluate the model. For evaluation, you can use the [

 ```py
 >>> from diffusers import DDPMPipeline
+>>> from diffusers.utils import make_image_grid
 >>> import math
 >>> import os


->>> def make_grid(images, rows, cols):
-...     w, h = images[0].size
-...     grid = Image.new("RGB", size=(cols * w, rows * h))
-...     for i, image in enumerate(images):
-...         grid.paste(image, box=(i % cols * w, i // cols * h))
-...     return grid
-
-
 >>> def evaluate(config, epoch, pipeline):
 ...     # Sample some images from random noise (this is the backward diffusion process).
 ...     # The default pipeline output type is `List[PIL.Image]`
@@ -273,7 +266,7 @@ Then, you'll need a way to evaluate the model. For evaluation, you can use the [
 ...     ).images

 ...     # Make a grid out of the images
-...     image_grid = make_grid(images, rows=4, cols=4)
+...     image_grid = make_image_grid(images, rows=4, cols=4)

 ...     # Save the images
 ...     test_dir = os.path.join(config.output_dir, "samples")
--- a/docs/source/en/using-diffusers/conditional_image_generation.md
+++ b/docs/source/en/using-diffusers/conditional_image_generation.md
@@ -25,7 +25,7 @@ In this guide, you'll use [`DiffusionPipeline`] for text-to-image generation wit
 ```python
 >>> from diffusers import DiffusionPipeline

->>> generator = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+>>> generator = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
 ```

 The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components. 
--- a/docs/source/en/using-diffusers/contribute_pipeline.md
+++ b/docs/source/en/using-diffusers/contribute_pipeline.md
@@ -94,7 +94,7 @@ output = pipeline()
 But what's even better is you can load pre-existing weights into the pipeline if the pipeline structure is identical. For example, you can load the [`google/ddpm-cifar10-32`](https://huggingface.co/google/ddpm-cifar10-32) weights into the one-step pipeline:

 ```python
-pipeline = UnetSchedulerOneForwardPipeline.from_pretrained("google/ddpm-cifar10-32")
+pipeline = UnetSchedulerOneForwardPipeline.from_pretrained("google/ddpm-cifar10-32", use_safetensors=True)

 output = pipeline()
 ```
@@ -108,7 +108,9 @@ Once it is merged, anyone with `diffusers >= 0.4.0` installed can use this pipel
 ```python
 from diffusers import DiffusionPipeline

-pipe = DiffusionPipeline.from_pretrained("google/ddpm-cifar10-32", custom_pipeline="one_step_unet")
+pipe = DiffusionPipeline.from_pretrained(
+    "google/ddpm-cifar10-32", custom_pipeline="one_step_unet", use_safetensors=True
+)
 pipe()
 ```

@@ -117,7 +119,9 @@ Another way to share your community pipeline is to upload the `one_step_unet.py`
 ```python
 from diffusers import DiffusionPipeline

-pipeline = DiffusionPipeline.from_pretrained("google/ddpm-cifar10-32", custom_pipeline="stevhliu/one_step_unet")
+pipeline = DiffusionPipeline.from_pretrained(
+    "google/ddpm-cifar10-32", custom_pipeline="stevhliu/one_step_unet", use_safetensors=True
+)
 ```

 Take a look at the following table to compare the two sharing workflows to help you decide the best option for you:
@@ -161,6 +165,7 @@ pipeline = DiffusionPipeline.from_pretrained(
    feature_extractor=feature_extractor,
    scheduler=scheduler,
    torch_dtype=torch.float16,
+    use_safetensors=True,
 )
 ```

--- a/docs/source/en/using-diffusers/control_brightness.md
+++ b/docs/source/en/using-diffusers/control_brightness.md
@@ -24,7 +24,7 @@ Next, configure the following parameters in the [`DDIMScheduler`]:
 ```py
 >>> from diffusers import DiffusionPipeline, DDIMScheduler

->>> pipeline = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2")
+>>> pipeline = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", use_safetensors=True)
 # switch the scheduler in the pipeline to use the DDIMScheduler

 >>> pipeline.scheduler = DDIMScheduler.from_config(
--- a/docs/source/en/using-diffusers/controlling_generation.md
+++ b/docs/source/en/using-diffusers/controlling_generation.md
@@ -66,16 +66,16 @@ For convenience, we provide a table to denote which methods are inference-only a

 [Paper](https://arxiv.org/abs/2211.09800)

-[Instruct Pix2Pix](../api/pipelines/stable_diffusion/pix2pix) is fine-tuned from stable diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image.
+[Instruct Pix2Pix](../api/pipelines/pix2pix) is fine-tuned from stable diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image.
 Instruct Pix2Pix has been explicitly trained to work well with [InstructGPT](https://openai.com/blog/instruction-following/)-like prompts.

-See [here](../api/pipelines/stable_diffusion/pix2pix) for more information on how to use it.
+See [here](../api/pipelines/pix2pix) for more information on how to use it.

 ## Pix2Pix Zero

 [Paper](https://arxiv.org/abs/2302.03027)

-[Pix2Pix Zero](../api/pipelines/stable_diffusion/pix2pix_zero) allows modifying an image so that one concept or subject is translated to another one while preserving general image semantics.
+[Pix2Pix Zero](../api/pipelines/pix2pix_zero) allows modifying an image so that one concept or subject is translated to another one while preserving general image semantics.

 The denoising process is guided from one conceptual embedding towards another conceptual embedding. The intermediate latents are optimized during the denoising process to push the attention maps towards reference attention maps. The reference attention maps are from the denoising process of the input image and are used to encourage semantic preservation.

@@ -88,26 +88,26 @@ Pix2Pix Zero can be used both to edit synthetic images as well as real images.
 <Tip>

 Pix2Pix Zero is the first model that allows "zero-shot" image editing. This means that the model
-can edit an image in less than a minute on a consumer GPU as shown [here](../api/pipelines/stable_diffusion/pix2pix_zero#usage-example).
+can edit an image in less than a minute on a consumer GPU as shown [here](../api/pipelines/pix2pix_zero#usage-example).

 </Tip>

 As mentioned above, Pix2Pix Zero includes optimizing the latents (and not any of the UNet, VAE, or the text encoder) to steer the generation toward a specific concept. This means that the overall
 pipeline might require more memory than a standard [StableDiffusionPipeline](../api/pipelines/stable_diffusion/text2img).

-See [here](../api/pipelines/stable_diffusion/pix2pix_zero) for more information on how to use it.
+See [here](../api/pipelines/pix2pix_zero) for more information on how to use it.

 ## Attend and Excite

 [Paper](https://arxiv.org/abs/2301.13826)

-[Attend and Excite](../api/pipelines/stable_diffusion/attend_and_excite) allows subjects in the prompt to be faithfully represented in the final image.
+[Attend and Excite](../api/pipelines/attend_and_excite) allows subjects in the prompt to be faithfully represented in the final image.

 A set of token indices are given as input, corresponding to the subjects in the prompt that need to be present in the image. During denoising, each token index is guaranteed to have a minimum attention threshold for at least one patch of the image. The intermediate latents are iteratively optimized during the denoising process to strengthen the attention of the most neglected subject token until the attention threshold is passed for all subject tokens.

-Like Pix2Pix Zero, Attend and Excite also involves a mini optimization loop (leaving the pre-trained weights untouched) in its pipeline and can require more memory than the usual `StableDiffusionPipeline`.
+Like Pix2Pix Zero, Attend and Excite also involves a mini optimization loop (leaving the pre-trained weights untouched) in its pipeline and can require more memory than the usual [StableDiffusionPipeline](../api/pipelines/stable_diffusion/text2img).

-See [here](../api/pipelines/stable_diffusion/attend_and_excite) for more information on how to use it.
+See [here](../api/pipelines/attend_and_excite) for more information on how to use it.

 ## Semantic Guidance (SEGA)

@@ -125,11 +125,11 @@ See [here](../api/pipelines/semantic_stable_diffusion) for more information on h

 [Paper](https://arxiv.org/abs/2210.00939)

-[Self-attention Guidance](../api/pipelines/stable_diffusion/self_attention_guidance) improves the general quality of images.
+[Self-attention Guidance](../api/pipelines/self_attention_guidance) improves the general quality of images.

 SAG provides guidance from predictions not conditioned on high-frequency details to fully conditioned images. The high frequency details are extracted out of the UNet self-attention maps.

-See [here](../api/pipelines/stable_diffusion/self_attention_guidance) for more information on how to use it.
+See [here](../api/pipelines/self_attention_guidance) for more information on how to use it.

 ## Depth2Image

@@ -154,9 +154,9 @@ apply Pix2Pix Zero to any of the available Stable Diffusion models.
 [Paper](https://arxiv.org/abs/2302.08113)

 MultiDiffusion defines a new generation process over a pre-trained diffusion model. This process binds together multiple diffusion generation methods that can be readily applied to generate high quality and diverse images. Results adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.
-[MultiDiffusion Panorama](../api/pipelines/stable_diffusion/panorama) allows to generate high-quality images at arbitrary aspect ratios (e.g., panoramas).
+[MultiDiffusion Panorama](../api/pipelines/panorama) allows to generate high-quality images at arbitrary aspect ratios (e.g., panoramas).

-See [here](../api/pipelines/stable_diffusion/panorama) for more information on how to use it to generate panoramic images.
+See [here](../api/pipelines/panorama) for more information on how to use it to generate panoramic images.

 ## Fine-tuning your own models

@@ -206,20 +206,20 @@ For more details, check out our [official doc](../training/custom_diffusion).

 [Paper](https://arxiv.org/abs/2303.08084)

-The [text-to-image model editing pipeline](../api/pipelines/stable_diffusion/model_editing) helps you mitigate some of the incorrect implicit assumptions a pre-trained text-to-image
+The [text-to-image model editing pipeline](../api/pipelines/model_editing) helps you mitigate some of the incorrect implicit assumptions a pre-trained text-to-image
 diffusion model might make about the subjects present in the input prompt. For example, if you prompt Stable Diffusion to generate images for "A pack of roses", the roses in the generated images
 are more likely to be red. This pipeline helps you change that assumption.

-To know more details, check out the [official doc](../api/pipelines/stable_diffusion/model_editing).
+To know more details, check out the [official doc](../api/pipelines/model_editing).

 ## DiffEdit

 [Paper](https://arxiv.org/abs/2210.11427)

-[DiffEdit](../api/pipelines/stable_diffusion/diffedit) allows for semantic editing of input images along with
+[DiffEdit](../api/pipelines/diffedit) allows for semantic editing of input images along with
 input prompts while preserving the original input images as much as possible.

-To know more details, check out the [official doc](../api/pipelines/stable_diffusion/model_editing).
+To know more details, check out the [official doc](../api/pipelines/diffedit).

 ## T2I-Adapter

--- a/docs/source/en/using-diffusers/custom_pipeline_examples.md
+++ b/docs/source/en/using-diffusers/custom_pipeline_examples.md
@@ -32,7 +32,7 @@ If a community doesn't work as expected, please open an issue and ping the autho
 To load a custom pipeline you just need to pass the `custom_pipeline` argument to `DiffusionPipeline`, as one of the files in `diffusers/examples/community`. Feel free to send a PR with your own pipelines, we will merge them quickly.
 ```py
 pipe = DiffusionPipeline.from_pretrained(
-    "CompVis/stable-diffusion-v1-4", custom_pipeline="filename_in_the_community_folder"
+    "CompVis/stable-diffusion-v1-4", custom_pipeline="filename_in_the_community_folder", use_safetensors=True
 )
 ```

@@ -61,6 +61,7 @@ guided_pipeline = DiffusionPipeline.from_pretrained(
    clip_model=clip_model,
    feature_extractor=feature_extractor,
    torch_dtype=torch.float16,
+    use_safetensors=True,
 )
 guided_pipeline.enable_attention_slicing()
 guided_pipeline = guided_pipeline.to("cuda")
@@ -117,6 +118,7 @@ pipe = DiffusionPipeline.from_pretrained(
    torch_dtype=torch.float16,
    safety_checker=None,  # Very important for videos...lots of false positives while interpolating
    custom_pipeline="interpolate_stable_diffusion",
+    use_safetensors=True,
 ).to("cuda")
 pipe.enable_attention_slicing()

@@ -159,6 +161,7 @@ pipe = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    custom_pipeline="stable_diffusion_mega",
    torch_dtype=torch.float16,
+    use_safetensors=True,
 )
 pipe.to("cuda")
 pipe.enable_attention_slicing()
@@ -203,7 +206,7 @@ from diffusers import DiffusionPipeline
 import torch

 pipe = DiffusionPipeline.from_pretrained(
-    "hakurei/waifu-diffusion", custom_pipeline="lpw_stable_diffusion", torch_dtype=torch.float16
+    "hakurei/waifu-diffusion", custom_pipeline="lpw_stable_diffusion", torch_dtype=torch.float16, use_safetensors=True
 )
 pipe = pipe.to("cuda")

@@ -224,6 +227,7 @@ pipe = DiffusionPipeline.from_pretrained(
    custom_pipeline="lpw_stable_diffusion_onnx",
    revision="onnx",
    provider="CUDAExecutionProvider",
+    use_safetensors=True,
 )

 prompt = "a photo of an astronaut riding a horse on mars, best quality"
@@ -267,8 +271,8 @@ diffuser_pipeline = DiffusionPipeline.from_pretrained(
    custom_pipeline="speech_to_image_diffusion",
    speech_model=model,
    speech_processor=processor,
-    
    torch_dtype=torch.float16,
+    use_safetensors=True,
 )

 diffuser_pipeline.enable_attention_slicing()
--- a/docs/source/en/using-diffusers/custom_pipeline_overview.md
+++ b/docs/source/en/using-diffusers/custom_pipeline_overview.md
@@ -30,7 +30,7 @@ To load any community pipeline on the Hub, pass the repository id of the communi
 from diffusers import DiffusionPipeline

 pipeline = DiffusionPipeline.from_pretrained(
-    "google/ddpm-cifar10-32", custom_pipeline="hf-internal-testing/diffusers-dummy-pipeline"
+    "google/ddpm-cifar10-32", custom_pipeline="hf-internal-testing/diffusers-dummy-pipeline", use_safetensors=True
 )
 ```

@@ -50,6 +50,7 @@ pipeline = DiffusionPipeline.from_pretrained(
    custom_pipeline="clip_guided_stable_diffusion",
    clip_model=clip_model,
    feature_extractor=feature_extractor,
+    use_safetensors=True,
 )
 ```

--- a/docs/source/en/using-diffusers/depth2img.md
+++ b/docs/source/en/using-diffusers/depth2img.md
@@ -28,6 +28,7 @@ from diffusers import StableDiffusionDepth2ImgPipeline
 pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-depth",
    torch_dtype=torch.float16,
+    use_safetensors=True,
 ).to("cuda")
 ```

--- a/docs/source/en/using-diffusers/img2img.md
+++ b/docs/source/en/using-diffusers/img2img.md
@@ -33,9 +33,9 @@ from io import BytesIO
 from diffusers import StableDiffusionImg2ImgPipeline

 device = "cuda"
-pipe = StableDiffusionImg2ImgPipeline.from_pretrained("nitrosocke/Ghibli-Diffusion", torch_dtype=torch.float16).to(
-    device
-)
+pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
+    "nitrosocke/Ghibli-Diffusion", torch_dtype=torch.float16, use_safetensors=True
+).to(device)
 ```

 Download and preprocess an initial image so you can pass it to the pipeline:
--- a/docs/source/en/using-diffusers/inpaint.md
+++ b/docs/source/en/using-diffusers/inpaint.md
@@ -29,6 +29,7 @@ from diffusers import StableDiffusionInpaintPipeline
 pipeline = StableDiffusionInpaintPipeline.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16,
+    use_safetensors=True,
 )
 pipeline = pipeline.to("cuda")
 ```
--- a/docs/source/en/using-diffusers/loading.md
+++ b/docs/source/en/using-diffusers/loading.md
@@ -39,7 +39,7 @@ The [`DiffusionPipeline`] class is the simplest and most generic way to load any
 from diffusers import DiffusionPipeline

 repo_id = "runwayml/stable-diffusion-v1-5"
-pipe = DiffusionPipeline.from_pretrained(repo_id)
+pipe = DiffusionPipeline.from_pretrained(repo_id, use_safetensors=True)
 ```

 You can also load a checkpoint with it's specific pipeline class. The example above loaded a Stable Diffusion model; to get the same result, use the [`StableDiffusionPipeline`] class:
@@ -48,7 +48,7 @@ You can also load a checkpoint with it's specific pipeline class. The example ab
 from diffusers import StableDiffusionPipeline

 repo_id = "runwayml/stable-diffusion-v1-5"
-pipe = StableDiffusionPipeline.from_pretrained(repo_id)
+pipe = StableDiffusionPipeline.from_pretrained(repo_id, use_safetensors=True)
 ```

 A checkpoint (such as [`CompVis/stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4) or [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)) may also be used for more than one task, like text-to-image or image-to-image. To differentiate what task you want to use the checkpoint for, you have to load it directly with it's corresponding task-specific pipeline class:
@@ -65,7 +65,7 @@ pipe = StableDiffusionImg2ImgPipeline.from_pretrained(repo_id)
 To load a diffusion pipeline locally, use [`git-lfs`](https://git-lfs.github.com/) to manually download the checkpoint (in this case, [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)) to your local disk. This creates a local folder, `./stable-diffusion-v1-5`, on your disk:

 ```bash
-git lfs install
+git-lfs install
 git clone https://huggingface.co/runwayml/stable-diffusion-v1-5
 ```

@@ -75,7 +75,7 @@ Then pass the local path to [`~DiffusionPipeline.from_pretrained`]:
 from diffusers import DiffusionPipeline

 repo_id = "./stable-diffusion-v1-5"
-stable_diffusion = DiffusionPipeline.from_pretrained(repo_id)
+stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, use_safetensors=True)
 ```

 The [`~DiffusionPipeline.from_pretrained`] method won't download any files from the Hub when it detects a local path, but this also means it won't download and cache the latest changes to a checkpoint.
@@ -94,7 +94,7 @@ To find out which schedulers are compatible for customization, you can use the `
 from diffusers import DiffusionPipeline

 repo_id = "runwayml/stable-diffusion-v1-5"
-stable_diffusion = DiffusionPipeline.from_pretrained(repo_id)
+stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, use_safetensors=True)
 stable_diffusion.scheduler.compatibles
 ```

@@ -109,7 +109,7 @@ repo_id = "runwayml/stable-diffusion-v1-5"

 scheduler = EulerDiscreteScheduler.from_pretrained(repo_id, subfolder="scheduler")

-stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, scheduler=scheduler)
+stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, scheduler=scheduler, use_safetensors=True)
 ```

 ### Safety checker
@@ -120,7 +120,7 @@ Diffusion models like Stable Diffusion can generate harmful content, which is wh
 from diffusers import DiffusionPipeline

 repo_id = "runwayml/stable-diffusion-v1-5"
-stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, safety_checker=None)
+stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, safety_checker=None, use_safetensors=True)
 ```

 ### Reuse components across pipelines
@@ -131,7 +131,7 @@ You can also reuse the same components in multiple pipelines to avoid loading th
 from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline

 model_id = "runwayml/stable-diffusion-v1-5"
-stable_diffusion_txt2img = StableDiffusionPipeline.from_pretrained(model_id)
+stable_diffusion_txt2img = StableDiffusionPipeline.from_pretrained(model_id, use_safetensors=True)

 components = stable_diffusion_txt2img.components
 ```
@@ -148,7 +148,7 @@ You can also pass the components individually to the pipeline if you want more f
 from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline

 model_id = "runwayml/stable-diffusion-v1-5"
-stable_diffusion_txt2img = StableDiffusionPipeline.from_pretrained(model_id)
+stable_diffusion_txt2img = StableDiffusionPipeline.from_pretrained(model_id, use_safetensors=True)
 stable_diffusion_img2img = StableDiffusionImg2ImgPipeline(
    vae=stable_diffusion_txt2img.vae,
    text_encoder=stable_diffusion_txt2img.text_encoder,
@@ -194,10 +194,12 @@ import torch

 # load fp16 variant
 stable_diffusion = DiffusionPipeline.from_pretrained(
-    "runwayml/stable-diffusion-v1-5", variant="fp16", torch_dtype=torch.float16
+    "runwayml/stable-diffusion-v1-5", variant="fp16", torch_dtype=torch.float16, use_safetensors=True
 )
 # load non_ema variant
-stable_diffusion = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", variant="non_ema")
+stable_diffusion = DiffusionPipeline.from_pretrained(
+    "runwayml/stable-diffusion-v1-5", variant="non_ema", use_safetensors=True
+)
 ```

 To save a checkpoint stored in a different floating point type or as a non-EMA variant, use the [`DiffusionPipeline.save_pretrained`] method and specify the `variant` argument. You should try and save a variant to the same folder as the original checkpoint, so you can load both from the same folder:
@@ -215,10 +217,12 @@ If you don't save the variant to an existing folder, you must specify the `varia

 ```python
 # 👎 this won't work
-stable_diffusion = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", torch_dtype=torch.float16)
+stable_diffusion = DiffusionPipeline.from_pretrained(
+    "./stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+)
 # 👍 this works
 stable_diffusion = DiffusionPipeline.from_pretrained(
-    "./stable-diffusion-v1-5", variant="fp16", torch_dtype=torch.float16
+    "./stable-diffusion-v1-5", variant="fp16", torch_dtype=torch.float16, use_safetensors=True
 )
 ```

@@ -233,7 +237,7 @@ load model variants, e.g.:
 ```python
 from diffusers import DiffusionPipeline

-pipe = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16")
+pipe = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", use_safetensors=True)
 ```

 However, this behavior is now deprecated since the "revision" argument should (just as it's done in GitHub) better be used to load model checkpoints from a specific commit or branch in development.
@@ -259,7 +263,7 @@ Models can be loaded from a subfolder with the `subfolder` argument. For example
 from diffusers import UNet2DConditionModel

 repo_id = "runwayml/stable-diffusion-v1-5"
-model = UNet2DConditionModel.from_pretrained(repo_id, subfolder="unet")
+model = UNet2DConditionModel.from_pretrained(repo_id, subfolder="unet", use_safetensors=True)
 ```

 Or directly from a repository's [directory](https://huggingface.co/google/ddpm-cifar10-32/tree/main):
@@ -268,7 +272,7 @@ Or directly from a repository's [directory](https://huggingface.co/google/ddpm-c
 from diffusers import UNet2DModel

 repo_id = "google/ddpm-cifar10-32"
-model = UNet2DModel.from_pretrained(repo_id)
+model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True)
 ```

 You can also load and save model variants by specifying the `variant` argument in [`ModelMixin.from_pretrained`] and [`ModelMixin.save_pretrained`]:
@@ -276,7 +280,9 @@ You can also load and save model variants by specifying the `variant` argument i
 ```python
 from diffusers import UNet2DConditionModel

-model = UNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="unet", variant="non-ema")
+model = UNet2DConditionModel.from_pretrained(
+    "runwayml/stable-diffusion-v1-5", subfolder="unet", variant="non-ema", use_safetensors=True
+)
 model.save_pretrained("./local-unet", variant="non-ema")
 ```

@@ -310,7 +316,7 @@ euler = EulerDiscreteScheduler.from_pretrained(repo_id, subfolder="scheduler")
 dpm = DPMSolverMultistepScheduler.from_pretrained(repo_id, subfolder="scheduler")

 # replace `dpm` with any of `ddpm`, `ddim`, `pndm`, `lms`, `euler_anc`, `euler`
-pipeline = StableDiffusionPipeline.from_pretrained(repo_id, scheduler=dpm)
+pipeline = StableDiffusionPipeline.from_pretrained(repo_id, scheduler=dpm, use_safetensors=True)
 ```

 ## DiffusionPipeline explained
@@ -326,7 +332,7 @@ The pipelines underlying folder structure corresponds directly with their class
 from diffusers import DiffusionPipeline

 repo_id = "runwayml/stable-diffusion-v1-5"
-pipeline = DiffusionPipeline.from_pretrained(repo_id)
+pipeline = DiffusionPipeline.from_pretrained(repo_id, use_safetensors=True)
 print(pipeline)
 ```

@@ -460,4 +466,4 @@ Every pipeline expects a `model_index.json` file that tells the [`DiffusionPipel
    "AutoencoderKL"
  ]
 }
-```
+```
--- a/docs/source/en/using-diffusers/other-formats.md
+++ b/docs/source/en/using-diffusers/other-formats.md
@@ -111,7 +111,9 @@ If you prefer to run inference with code, click on the **Use in Diffusers** butt
 ```py
 from diffusers import DiffusionPipeline

-pipeline = DiffusionPipeline.from_pretrained("sayakpaul/textual-inversion-cat-kerascv_sd_diffusers_pipeline")
+pipeline = DiffusionPipeline.from_pretrained(
+    "sayakpaul/textual-inversion-cat-kerascv_sd_diffusers_pipeline", use_safetensors=True
+)
 ```

 Then you can generate an image like:
@@ -119,7 +121,9 @@ Then you can generate an image like:
 ```py
 from diffusers import DiffusionPipeline

-pipeline = DiffusionPipeline.from_pretrained("sayakpaul/textual-inversion-cat-kerascv_sd_diffusers_pipeline")
+pipeline = DiffusionPipeline.from_pretrained(
+    "sayakpaul/textual-inversion-cat-kerascv_sd_diffusers_pipeline", use_safetensors=True
+)
 pipeline.to("cuda")

 placeholder_token = "<my-funny-cat-token>"
@@ -171,22 +175,12 @@ images = pipeline(
 ).images
 ```

-Finally, create a helper function to display the images:
+Display the images:

 ```py
-from PIL import Image
+from diffusers.utils import make_image_grid

-
-def image_grid(imgs, rows=2, cols=2):
-    w, h = imgs[0].size
-    grid = Image.new("RGB", size=(cols * w, rows * h))
-
-    for i, img in enumerate(imgs):
-        grid.paste(img, box=(i % cols * w, i // cols * h))
-    return grid
-
-
-image_grid(images)
+make_image_grid(images, 2, 2)
 ```

 <div class="flex justify-center">
--- a/docs/source/en/using-diffusers/push_to_hub.md
+++ b/docs/source/en/using-diffusers/push_to_hub.md
@@ -0,0 +1,171 @@
+# Push files to the Hub
+
+[[open-in-colab]]
+
+🤗 Diffusers provides a [`~diffusers.utils.PushToHubMixin`] for uploading your model, scheduler, or pipeline to the Hub. It is an easy way to store your files on the Hub, and also allows you to share your work with others. Under the hood, the [`~diffusers.utils.PushToHubMixin`]:
+
+1. creates a repository on the Hub
+2. saves your model, scheduler, or pipeline files so they can be reloaded later
+3. uploads folder containing these files to the Hub
+
+This guide will show you how to use the [`~diffusers.utils.PushToHubMixin`] to upload your files to the Hub.
+
+You'll need to log in to your Hub account with your access [token](https://huggingface.co/settings/tokens) first:
+
+```py
+from huggingface_hub import notebook_login
+
+notebook_login()
+```
+
+## Models
+
+To push a model to the Hub, call [`~diffusers.utils.PushToHubMixin.push_to_hub`] and specfiy the repository id of the model to be stored on the Hub:
+
+```py
+from diffusers import ControlNetModel
+
+controlnet = ControlNetModel(
+    block_out_channels=(32, 64),
+    layers_per_block=2,
+    in_channels=4,
+    down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+    cross_attention_dim=32,
+    conditioning_embedding_out_channels=(16, 32),
+)
+controlnet.push_to_hub("my-controlnet-model")
+```
+
+For model's, you can also specify the [*variant*](loading#checkpoint-variants) of the weights to push to the Hub. For example, to push `fp16` weights:
+
+```py
+controlnet.push_to_hub("my-controlnet-model", variant="fp16")
+```
+
+The [`~diffusers.utils.PushToHubMixin.push_to_hub`] function saves the model's `config.json` file and the weights are automatically saved in the `safetensors` format.
+
+Now you can reload the model from your repository on the Hub:
+
+```py
+model = ControlNetModel.from_pretrained("your-namespace/my-controlnet-model")
+```
+
+## Scheduler
+
+To push a scheduler to the Hub, call [`~diffusers.utils.PushToHubMixin.push_to_hub`] and specfiy the repository id of the scheduler to be stored on the Hub:
+
+```py
+from diffusers import DDIMScheduler
+
+scheduler = DDIMScheduler(
+    beta_start=0.00085,
+    beta_end=0.012,
+    beta_schedule="scaled_linear",
+    clip_sample=False,
+    set_alpha_to_one=False,
+)
+scheduler.push_to_hub("my-controlnet-scheduler")
+```
+
+The [`~diffusers.utils.PushToHubMixin.push_to_hub`] function saves the scheduler's `scheduler_config.json` file to the specified repository.
+
+Now you can reload the scheduler from your repository on the Hub:
+
+```py
+scheduler = DDIMScheduler.from_pretrained("your-namepsace/my-controlnet-scheduler")
+```
+
+## Pipeline
+
+You can also push an entire pipeline with all it's components to the Hub. For example, initialize the components of a [`StableDiffusionPipeline`] with the parameters you want:
+
+```py
+from diffusers import (
+    UNet2DConditionModel,
+    AutoencoderKL,
+    DDIMScheduler,
+    StableDiffusionPipeline,
+)
+from transformers import CLIPTextModel, CLIPTextConfig, CLIPTokenizer
+
+unet = UNet2DConditionModel(
+    block_out_channels=(32, 64),
+    layers_per_block=2,
+    sample_size=32,
+    in_channels=4,
+    out_channels=4,
+    down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+    up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+    cross_attention_dim=32,
+)
+
+scheduler = DDIMScheduler(
+    beta_start=0.00085,
+    beta_end=0.012,
+    beta_schedule="scaled_linear",
+    clip_sample=False,
+    set_alpha_to_one=False,
+)
+
+vae = AutoencoderKL(
+    block_out_channels=[32, 64],
+    in_channels=3,
+    out_channels=3,
+    down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+    up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+    latent_channels=4,
+)
+
+text_encoder_config = CLIPTextConfig(
+    bos_token_id=0,
+    eos_token_id=2,
+    hidden_size=32,
+    intermediate_size=37,
+    layer_norm_eps=1e-05,
+    num_attention_heads=4,
+    num_hidden_layers=5,
+    pad_token_id=1,
+    vocab_size=1000,
+)
+text_encoder = CLIPTextModel(text_encoder_config)
+tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+```
+
+Pass all of the components to the [`StableDiffusionPipeline`] and call [`~diffusers.utils.PushToHubMixin.push_to_hub`] to push the pipeline to the Hub:
+
+```py
+components = {
+    "unet": unet,
+    "scheduler": scheduler,
+    "vae": vae,
+    "text_encoder": text_encoder,
+    "tokenizer": tokenizer,
+    "safety_checker": None,
+    "feature_extractor": None,
+}
+
+pipeline = StableDiffusionPipeline(**components)
+pipeline.push_to_hub("my-pipeline")
+```
+
+The [`~diffusers.utils.PushToHubMixin.push_to_hub`] function saves each component to a subfolder in the repository. Now you can reload the pipeline from your repository on the Hub:
+
+```py
+pipeline = StableDiffusionPipeline.from_pretrained("your-namespace/my-pipeline")
+```
+
+## Privacy
+
+Set `private=True` in the [`~diffusers.utils.PushToHubMixin.push_to_hub`] function to keep your model, scheduler, or pipeline files private:
+
+```py
+controlnet.push_to_hub("my-controlnet-model", private=True)
+```
+
+Private repositories are only visible to you, and other users won't be able to clone the repository and your repository won't appear in search results. Even if a user has the URL to your private repository, they'll receive a `404 - Repo not found error.`
+
+To load a model, scheduler, or pipeline from a private or gated repositories, set `use_auth_token=True`:
+
+```py
+model = ControlNet.from_pretrained("your-namespace/my-controlnet-model", use_auth_token=True)
+```
--- a/docs/source/en/using-diffusers/reproducibility.md
+++ b/docs/source/en/using-diffusers/reproducibility.md
@@ -40,7 +40,7 @@ import numpy as np
 model_id = "google/ddpm-cifar10-32"

 # load model and scheduler
-ddim = DDIMPipeline.from_pretrained(model_id)
+ddim = DDIMPipeline.from_pretrained(model_id, use_safetensors=True)

 # run pipeline for just two steps and return numpy tensor
 image = ddim(num_inference_steps=2, output_type="np").images
@@ -65,7 +65,7 @@ import numpy as np
 model_id = "google/ddpm-cifar10-32"

 # load model and scheduler
-ddim = DDIMPipeline.from_pretrained(model_id)
+ddim = DDIMPipeline.from_pretrained(model_id, use_safetensors=True)

 # create a generator for reproducibility
 generator = torch.Generator(device="cpu").manual_seed(0)
@@ -100,7 +100,7 @@ import numpy as np
 model_id = "google/ddpm-cifar10-32"

 # load model and scheduler
-ddim = DDIMPipeline.from_pretrained(model_id)
+ddim = DDIMPipeline.from_pretrained(model_id, use_safetensors=True)
 ddim.to("cuda")

 # create a generator for reproducibility
@@ -125,7 +125,7 @@ import numpy as np
 model_id = "google/ddpm-cifar10-32"

 # load model and scheduler
-ddim = DDIMPipeline.from_pretrained(model_id)
+ddim = DDIMPipeline.from_pretrained(model_id, use_safetensors=True)
 ddim.to("cuda")

 # create a generator for reproducibility; notice you don't place it on the GPU!
@@ -174,7 +174,7 @@ from diffusers import DDIMScheduler, StableDiffusionPipeline
 import numpy as np

 model_id = "runwayml/stable-diffusion-v1-5"
-pipe = StableDiffusionPipeline.from_pretrained(model_id).to("cuda")
+pipe = StableDiffusionPipeline.from_pretrained(model_id, use_safetensors=True).to("cuda")
 pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
 g = torch.Generator(device="cuda")

--- a/docs/source/en/using-diffusers/reusing_seeds.md
+++ b/docs/source/en/using-diffusers/reusing_seeds.md
@@ -27,7 +27,9 @@ Instantiate a pipeline with [`DiffusionPipeline.from_pretrained`] and place it o
 ```python
 >>> from diffusers import DiffusionPipeline

->>> pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+>>> pipe = DiffusionPipeline.from_pretrained(
+...     "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+... )
 >>> pipe = pipe.to("cuda")
 ```

--- a/docs/source/en/using-diffusers/schedulers.md
+++ b/docs/source/en/using-diffusers/schedulers.md
@@ -39,7 +39,9 @@ import torch

 login()

-pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+pipeline = DiffusionPipeline.from_pretrained(
+    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+)
 ```

 Next, we move it to GPU:
--- a/docs/source/en/using-diffusers/stable_diffusion_jax_how_to.md
+++ b/docs/source/en/using-diffusers/stable_diffusion_jax_how_to.md
@@ -153,19 +153,10 @@ images = pipeline.numpy_to_pil(images)

 ### Visualization

-Let's create a helper function to display images in a grid.
-
 ```python
-def image_grid(imgs, rows, cols):
-    w, h = imgs[0].size
-    grid = Image.new("RGB", size=(cols * w, rows * h))
-    for i, img in enumerate(imgs):
-        grid.paste(img, box=(i % cols * w, i // cols * h))
-    return grid
-```
+from diffusers import make_image_grid

-```python
-image_grid(images, 2, 4)
+make_image_grid(images, 2, 4)
 ```

 ![img](https://huggingface.co/datasets/YiYiXu/test-doc-assets/resolve/main/stable_diffusion_jax_how_to_cell_38_output_0.jpeg)
@@ -198,7 +189,7 @@ images = pipeline(prompt_ids, p_params, rng, jit=True).images
 images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:])
 images = pipeline.numpy_to_pil(images)

-image_grid(images, 2, 4)
+make_image_grid(images, 2, 4)
 ```

 ![img](https://huggingface.co/datasets/YiYiXu/test-doc-assets/resolve/main/stable_diffusion_jax_how_to_cell_43_output_0.jpeg)
--- a/docs/source/en/using-diffusers/textual_inversion_inference.md
+++ b/docs/source/en/using-diffusers/textual_inversion_inference.md
@@ -14,7 +14,7 @@ from huggingface_hub import notebook_login
 notebook_login()
 ```

-Import the necessary libraries, and create a helper function to visualize the generated images:
+Import the necessary libraries:

 ```py
 import os
@@ -24,19 +24,8 @@ import PIL
 from PIL import Image

 from diffusers import StableDiffusionPipeline
+from diffusers.utils import make_image_grid
 from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer
-
-
-def image_grid(imgs, rows, cols):
-    assert len(imgs) == rows * cols
-
-    w, h = imgs[0].size
-    grid = Image.new("RGB", size=(cols * w, rows * h))
-    grid_w, grid_h = grid.size
-
-    for i, img in enumerate(imgs):
-        grid.paste(img, box=(i % cols * w, i // cols * h))
-    return grid
 ```

 Pick a Stable Diffusion checkpoint and a pre-learned concept from the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer):
@@ -49,7 +38,9 @@ repo_id_embeds = "sd-concepts-library/cat-toy"
 Now you can load a pipeline, and pass the pre-learned concept to it:

 ```py
-pipeline = StableDiffusionPipeline.from_pretrained(pretrained_model_name_or_path, torch_dtype=torch.float16).to("cuda")
+pipeline = StableDiffusionPipeline.from_pretrained(
+    pretrained_model_name_or_path, torch_dtype=torch.float16, use_safetensors=True
+).to("cuda")

 pipeline.load_textual_inversion(repo_id_embeds)
 ```
@@ -71,7 +62,7 @@ for _ in range(num_rows):
    images = pipe(prompt, num_images_per_prompt=num_samples, num_inference_steps=50, guidance_scale=7.5).images
    all_images.extend(images)

-grid = image_grid(all_images, num_samples, num_rows)
+grid = make_image_grid(all_images, num_samples, num_rows)
 grid
 ```

--- a/docs/source/en/using-diffusers/unconditional_image_generation.md
+++ b/docs/source/en/using-diffusers/unconditional_image_generation.md
@@ -32,7 +32,7 @@ In this guide, you'll use [`DiffusionPipeline`] for unconditional image generati
 ```python
 >>> from diffusers import DiffusionPipeline

->>> generator = DiffusionPipeline.from_pretrained("anton-l/ddpm-butterflies-128")
+>>> generator = DiffusionPipeline.from_pretrained("anton-l/ddpm-butterflies-128", use_safetensors=True)
 ```

 The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components. 
--- a/docs/source/en/using-diffusers/using_safetensors.md
+++ b/docs/source/en/using-diffusers/using_safetensors.md
@@ -40,7 +40,9 @@ You can use the model with the new `.safetensors` weights by specifying the refe
 ```py
 from diffusers import DiffusionPipeline

-pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", revision="refs/pr/22")
+pipeline = DiffusionPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-2-1", revision="refs/pr/22", use_safetensors=True
+)
 ```

 ## Why use safetensors?
@@ -55,7 +57,7 @@ There are several reasons for using safetensors:
 	```py
 from diffusers import StableDiffusionPipeline

- pipeline = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")
+ pipeline = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", use_safetensors=True)
 "Loaded in safetensors 0:00:02.033658"
 "Loaded in PyTorch 0:00:02.663379"
 	```
--- a/docs/source/en/using-diffusers/weighted_prompts.md
+++ b/docs/source/en/using-diffusers/weighted_prompts.md
@@ -10,34 +10,36 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Weighting prompts
+# Prompt weighting

 [[open-in-colab]]

-Text-guided diffusion models generate images based on a given text prompt. The text prompt
-can include multiple concepts that the model should generate and it's often desirable to weight
-certain parts of the prompt more or less. 
+Prompt weighting provides a way to emphasize or de-emphasize certain parts of a prompt, allowing for more control over the generated image. A prompt can include several concepts, which gets turned into contextualized text embeddings. The embeddings are used by the model to condition its cross-attention layers to generate an image (read the Stable Diffusion [blog post](https://huggingface.co/blog/stable_diffusion) to learn more about how it works).

-Diffusion models work by conditioning the cross attention layers of the diffusion model with contextualized text embeddings (see the [Stable Diffusion Guide for more information](../stable-diffusion)).
-Thus a simple way to emphasize (or de-emphasize) certain parts of the prompt is by increasing or reducing the scale of the text embedding vector that corresponds to the relevant part of the prompt.
-This is called "prompt-weighting" and has been a highly demanded feature by the community (see issue [here](https://github.com/huggingface/diffusers/issues/2431)).
+Prompt weighting works by increasing or decreasing the scale of the text embedding vector that corresponds to its concept in the prompt because you may not necessarily want the model to focus on all concepts equally. The easiest way to prepare the prompt-weighted embeddings is to use [Compel](https://github.com/damian0815/compel), a text prompt-weighting and blending library. Once you have the prompt-weighted embeddings, you can pass them to any pipeline that has a [`prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.prompt_embeds) (and optionally [`negative_prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.negative_prompt_embeds)) parameter, such as [`StableDiffusionPipeline`], [`StableDiffusionControlNetPipeline`], and [`StableDiffusionXLPipeline`].

-## How to do prompt-weighting in Diffusers
+<Tip>

-We believe the role of `diffusers` is to be a toolbox that provides essential features that enable other projects, such as [InvokeAI](https://github.com/invoke-ai/InvokeAI) or [diffuzers](https://github.com/abhishekkrthakur/diffuzers), to build powerful UIs. In order to support arbitrary methods to manipulate prompts, `diffusers` exposes a [`prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.prompt_embeds) function argument and an optional [`negative_prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.negative_prompt_embeds) function argument to many pipelines such as [`StableDiffusionPipeline`], [`StableDiffusionControlNetPipeline`], [`StableDiffusionXLPipeline`], allowing to directly pass the "prompt-weighted"/scaled text embeddings to the pipeline.
+If your favorite pipeline doesn't have a `prompt_embeds` parameter, please open an [issue](https://github.com/huggingface/diffusers/issues/new/choose) so we can add it!

-The [compel library](https://github.com/damian0815/compel) provides an easy way to emphasize or de-emphasize portions of the prompt for you. We strongly recommend it instead of preparing the embeddings yourself.
+</Tip>

-Let's look at a simple example. Imagine you want to generate an image of `"a red cat playing with a ball"` as 
-follows:
+This guide will show you how to weight and blend your prompts with Compel in 🤗 Diffusers.

+Before you begin, make sure you have the latest version of Compel installed:

-### StableDiffusionPipeline
+```py
+# uncomment to install in Colab
+#!pip install compel --upgrade
+```
+
+For this guide, let's generate an image with the prompt `"a red cat playing with a ball"` using the [`StableDiffusionPipeline`]:

 ```py
 from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler
+import torch

-pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
+pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_safetensors=True)
 pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

 prompt = "a red cat playing with a ball"
@@ -48,19 +50,13 @@ image = pipe(prompt, generator=generator, num_inference_steps=20).images[0]
 image
 ```

-This gives you:
+<div class="flex justify-center">
+  <img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/compel/forest_0.png"/>
+</div>

-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/compel/forest_0.png)
+## Weighting

-As you can see, there is no "ball" in the image. Let's emphasize this part!
-
-For this we should install the `compel` library:
-
-```py
-pip install compel --upgrade
-```
-
-and then create a `Compel` object:
+You'll notice there is no "ball" in the image! Let's use compel to upweight the concept of "ball" in the prompt. Create a [`Compel`](https://github.com/damian0815/compel/blob/main/doc/compel.md#compel-objects) object, and pass it a tokenizer and text encoder:

 ```py
 from compel import Compel
@@ -68,40 +64,114 @@ from compel import Compel
 compel_proc = Compel(tokenizer=pipe.tokenizer, text_encoder=pipe.text_encoder)
 ```

-Now we emphasize the part "ball" with the `"++"` syntax:
+compel uses `+` or `-` to increase or decrease the weight of a word in the prompt. To increase the weight of "ball":
+
+<Tip>
+
+`+` corresponds to the value `1.1`, `++` corresponds to `1.1^2`, and so on. Similarly, `-` corresponds to `0.9` and `--` corresponds to `0.9^2`. Feel free to experiment with adding more `+` or `-` in your prompt!
+
+</Tip>

 ```py
 prompt = "a red cat playing with a ball++"
 ```

-and instead of passing this to the pipeline directly, we have to process it using `compel_proc`:
+Pass the prompt to `compel_proc` to create the new prompt embeddings which are passed to the pipeline:

 ```py
 prompt_embeds = compel_proc(prompt)
-```
+generator = torch.manual_seed(33)

-Now we can pass `prompt_embeds` directly to the pipeline:
-
-```py
-generator = torch.Generator(device="cpu").manual_seed(33)
-
-images = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
+image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
 image
 ```

-We now get the following image which has a "ball"!
+<div class="flex justify-center">
+  <img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/compel/forest_1.png"/>
+</div>

-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/compel/forest_1.png)
+To downweight parts of the prompt, use the `-` suffix:

-Similarly, we de-emphasize parts of the sentence by using the `--` suffix for words, feel free to give it 
-a try!
+```py
+prompt = "a red------- cat playing with a ball"
+prompt_embeds = compel_proc(prompt)

-If your favorite pipeline does not have a `prompt_embeds` input, please make sure to open an issue, the 
-diffusers team tries to be as responsive as possible.
-
-Compel 1.1.6 adds a utility class to simplify using textual inversions.  Instantiate a `DiffusersTextualInversionManager` and pass it to Compel init:
+generator = torch.manual_seed(33)

+image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
+image
 ```
+
+<div class="flex justify-center">
+  <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-neg.png"/>
+</div>
+
+You can even up or downweight multiple concepts in the same prompt:
+
+```py
+prompt = "a red cat++ playing with a ball----"
+prompt_embeds = compel_proc(prompt)
+
+generator = torch.manual_seed(33)
+
+image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
+image
+```
+
+<div class="flex justify-center">
+  <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-pos-neg.png"/>
+</div>
+
+## Blending
+
+You can also create a weighted *blend* of prompts by adding `.blend()` to a list of prompts and passing it some weights. Your blend may not always produce the result you expect because it breaks some assumptions about how the text encoder functions, so just have fun and experiment with it!
+
+```py
+prompt_embeds = compel_proc('("a red cat playing with a ball", "jungle").blend(0.7, 0.8)')
+generator = torch.Generator(device="cuda").manual_seed(33)
+
+image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
+image
+```
+
+<div class="flex justify-center">
+  <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-blend.png"/>
+</div>
+
+## Conjunction
+
+A conjunction diffuses each prompt independently and concatenates their results by their weighted sum. Add `.and()` to the end of a list of prompts to create a conjunction:
+  
+```py
+prompt_embeds = compel_proc('("a red cat, playing with a, ball").and()')
+generator = torch.Generator(device="cuda").manual_seed(33)
+
+image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
+image
+```
+
+<div class="flex justify-center">
+  <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-conj.png"/>
+</div>
+
+## Textual inversion
+
+[Textual inversion](../training/text_inversion) is a technique for learning a specific concept from some images which you can use to generate new images conditioned on that concept.
+
+Create a pipeline and use the [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] function to load the textual inversion embeddings (feel free to browse the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer) for 100+ trained concepts):
+
+```py
+import torch
+from diffusers import StableDiffusionPipeline
+from compel import Compel, DiffusersTextualInversionManager
+
+pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True, variant="fp16").to("cuda")
+pipe.load_textual_inversion("sd-concepts-library/midjourney-style")
+```
+
+Compel provides a `DiffusersTextualInversionManager` class to simplify prompt weighting with textual inversion. Instantiate `DiffusersTextualInversionManager` and pass it to the `Compel` class:
+
+```py
 textual_inversion_manager = DiffusersTextualInversionManager(pipe)
 compel = Compel(
    tokenizer=pipe.tokenizer,
@@ -109,15 +179,48 @@ compel = Compel(
    textual_inversion_manager=textual_inversion_manager)
 ```

-Also, please check out the documentation of the [compel](https://github.com/damian0815/compel) library for 
-more information.
+Incorporate the concept to condition a prompt with using the `<concept>` syntax:

-### StableDiffusionXLPipeline
+```py
+prompt_embeds = compel_proc('("A red cat++ playing with a ball <midjourney-style>")')

-For StableDiffusionXL we need to not only pass `prompt_embeds` (and optionally `negative_prompt_embeds`), but also [`pooled_prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLInpaintPipeline.__call__.pooled_prompt_embeds) and optionally [`negative_pooled_prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLInpaintPipeline.__call__.negative_pooled_prompt_embeds). 
-In addition, [`StableDiffusionXLPipeline`] has two tokenizers and two text encoders which both need to be used to weight the prompt.
-Luckily, [`compel`](https://github.com/damian0815/compel) takes care of SDXL's special needs - all we have to do is to pass both tokenizers and text encoders to the `Compel` class.
+image = pipe(prompt_embeds=prompt_embeds).images[0]
+image
+```

+<div class="flex justify-center">
+  <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-text-inversion.png"/>
+</div>
+
+## DreamBooth
+
+[DreamBooth](../training/dreambooth) is a technique for generating contextualized images of a subject given just a few images of the subject to train on. It is similar to textual inversion, but DreamBooth trains the full model whereas textual inversion only fine-tunes the text embeddings. This means you should use [`~DiffusionPipeline.from_pretrained`] to load the DreamBooth model (feel free to browse the [Stable Diffusion Dreambooth Concepts Library](https://huggingface.co/sd-dreambooth-library) for 100+ trained models):
+
+```py
+import torch
+from diffusers import DiffusionPipeline, UniPCMultistepScheduler
+from compel import Compel
+
+pipe = DiffusionPipeline.from_pretrained("sd-dreambooth-library/dndcoverart-v1", torch_dtype=torch.float16).to("cuda")
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+```
+
+Create a `Compel` class with a tokenizer and text encoder, and pass your prompt to it. Depending on the model you use, you'll need to incorporate the model's unique identifier into your prompt. For example, the `dndcoverart-v1` model uses the identifier `dndcoverart`:
+
+```py
+compel_proc = Compel(tokenizer=pipe.tokenizer, text_encoder=pipe.text_encoder)
+prompt_embeds = compel_proc('("magazine cover of a dndcoverart dragon, high quality, intricate details, larry elmore art style").and()')
+image = pipe(prompt_embeds=prompt_embeds).images[0]
+image
+```
+
+<div class="flex justify-center">
+  <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-dreambooth.png"/>
+</div>
+
+## Stable Diffusion XL
+
+Stable Diffusion XL (SDXL) has two tokenizers and text encoders so it's usage is a bit different. To address this, you should pass both tokenizers and encoders to the `Compel` class:

 ```py
 from compel import Compel, ReturnedEmbeddingsType
@@ -138,22 +241,18 @@ compel = Compel(
 )
 ```

-Let's try our example from above again. We use the same seed for both prompts and upweight ball by a factor of 1.5 for the first 
-prompt and downweight ball by 40% for the second prompt.
+This time, let's upweight "ball" by a factor of 1.5 for the first prompt, and downweight "ball" by 0.6 for the second prompt. The [`StableDiffusionXLPipeline`] also requires [`pooled_prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLInpaintPipeline.__call__.pooled_prompt_embeds) (and optionally [`negative_pooled_prompt_embeds`](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLInpaintPipeline.__call__.negative_pooled_prompt_embeds)) so you should pass those to the pipeline along with the conditioning tensors:

 ```py
-# upweight "ball"
+# apply weights
 prompt = ["a red cat playing with a (ball)1.5", "a red cat playing with a (ball)0.6"]
 conditioning, pooled = compel(prompt)

-
 # generate image
 generator = [torch.Generator().manual_seed(33) for _ in range(len(prompt))]
 images = pipeline(prompt_embeds=conditioning, pooled_prompt_embeds=pooled, generator=generator, num_inference_steps=30).images
 ```

-Let's have a look at the result.
-
 <div class="flex gap-4">
  <div>
    <img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/compel/sdxl_ball1.png"/>
@@ -161,9 +260,6 @@ Let's have a look at the result.
  </div>
  <div>
    <img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/compel/sdxl_ball2.png"/>
-    <figcaption class="mt-2 text-center text-sm text-gray-500">a red cat playing with a (ball)0.6</figcaption>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">"a red cat playing with a (ball)0.6"</figcaption>
  </div>
-</div>
-
-We can see that the ball is almost completely gone on the right image while it's clearly visible on the left image.
-For more information and more tricks you can use `compel` with, please have a look at the [compel docs](https://github.com/damian0815/compel/blob/main/doc/syntax.md) as well.
+</div>
--- a/docs/source/en/using-diffusers/write_own_pipeline.md
+++ b/docs/source/en/using-diffusers/write_own_pipeline.md
@@ -25,7 +25,7 @@ A pipeline is a quick and easy way to run a model for inference, requiring no mo
 ```py
 >>> from diffusers import DDPMPipeline

->>> ddpm = DDPMPipeline.from_pretrained("google/ddpm-cat-256").to("cuda")
+>>> ddpm = DDPMPipeline.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")
 >>> image = ddpm(num_inference_steps=25).images[0]
 >>> image
 ```
@@ -46,7 +46,7 @@ To recreate the pipeline with the model and scheduler separately, let's write ou
 >>> from diffusers import DDPMScheduler, UNet2DModel

 >>> scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
->>> model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda")
+>>> model = UNet2DModel.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")
 ```

 2. Set the number of timesteps to run the denoising process for:
@@ -124,10 +124,14 @@ Now that you know what you need for the Stable Diffusion pipeline, load all thes
 >>> from transformers import CLIPTextModel, CLIPTokenizer
 >>> from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler

->>> vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")
+>>> vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_safetensors=True)
 >>> tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
->>> text_encoder = CLIPTextModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="text_encoder")
->>> unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")
+>>> text_encoder = CLIPTextModel.from_pretrained(
+...     "CompVis/stable-diffusion-v1-4", subfolder="text_encoder", use_safetensors=True
+... )
+>>> unet = UNet2DConditionModel.from_pretrained(
+...     "CompVis/stable-diffusion-v1-4", subfolder="unet", use_safetensors=True
+... )
 ```

 Instead of the default [`PNDMScheduler`], exchange it for the [`UniPCMultistepScheduler`] to see how easy it is to plug a different scheduler in:
--- a/examples/community/checkpoint_merger.py
+++ b/examples/community/checkpoint_merger.py
@@ -2,14 +2,8 @@ import glob
 import os
 from typing import Dict, List, Union

+import safetensors.torch
 import torch
-
-from diffusers.utils import is_safetensors_available
-
-
-if is_safetensors_available():
-    import safetensors.torch
-
 from huggingface_hub import snapshot_download

 from diffusers import DiffusionPipeline, __version__
@@ -229,14 +223,14 @@ class CheckpointMergerPipeline(DiffusionPipeline):
                    update_theta_0 = getattr(module, "load_state_dict")
                    theta_1 = (
                        safetensors.torch.load_file(checkpoint_path_1)
-                        if (is_safetensors_available() and checkpoint_path_1.endswith(".safetensors"))
+                        if (checkpoint_path_1.endswith(".safetensors"))
                        else torch.load(checkpoint_path_1, map_location="cpu")
                    )
                    theta_2 = None
                    if checkpoint_path_2:
                        theta_2 = (
                            safetensors.torch.load_file(checkpoint_path_2)
-                            if (is_safetensors_available() and checkpoint_path_2.endswith(".safetensors"))
+                            if (checkpoint_path_2.endswith(".safetensors"))
                            else torch.load(checkpoint_path_2, map_location="cpu")
                        )

--- a/examples/community/test_onnx_controlnet.py
+++ b/examples/community/test_onnx_controlnet.py
@@ -0,0 +1,909 @@
+import argparse
+import inspect
+import os
+import time
+import warnings
+from typing import Any, Callable, Dict, List, Optional, Union
+
+import numpy as np
+import PIL.Image
+import torch
+from PIL import Image
+from transformers import CLIPTokenizer
+
+from diffusers import OnnxRuntimeModel, StableDiffusionImg2ImgPipeline, UniPCMultistepScheduler
+from diffusers.image_processor import VaeImageProcessor
+from diffusers.pipelines.pipeline_utils import DiffusionPipeline
+from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput
+from diffusers.schedulers import KarrasDiffusionSchedulers
+from diffusers.utils import (
+    deprecate,
+    logging,
+    randn_tensor,
+    replace_example_docstring,
+)
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+EXAMPLE_DOC_STRING = """
+    Examples:
+        ```py
+        >>> # !pip install opencv-python transformers accelerate
+        >>> from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
+        >>> from diffusers.utils import load_image
+        >>> import numpy as np
+        >>> import torch
+
+        >>> import cv2
+        >>> from PIL import Image
+
+        >>> # download an image
+        >>> image = load_image(
+        ...     "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
+        ... )
+        >>> np_image = np.array(image)
+
+        >>> # get canny image
+        >>> np_image = cv2.Canny(np_image, 100, 200)
+        >>> np_image = np_image[:, :, None]
+        >>> np_image = np.concatenate([np_image, np_image, np_image], axis=2)
+        >>> canny_image = Image.fromarray(np_image)
+
+        >>> # load control net and stable diffusion v1-5
+        >>> controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
+        >>> pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
+        ...     "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
+        ... )
+
+        >>> # speed up diffusion process with faster scheduler and memory optimization
+        >>> pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+        >>> pipe.enable_model_cpu_offload()
+
+        >>> # generate image
+        >>> generator = torch.manual_seed(0)
+        >>> image = pipe(
+        ...     "futuristic-looking woman",
+        ...     num_inference_steps=20,
+        ...     generator=generator,
+        ...     image=image,
+        ...     control_image=canny_image,
+        ... ).images[0]
+        ```
+"""
+
+
+def prepare_image(image):
+    if isinstance(image, torch.Tensor):
+        # Batch single image
+        if image.ndim == 3:
+            image = image.unsqueeze(0)
+
+        image = image.to(dtype=torch.float32)
+    else:
+        # preprocess image
+        if isinstance(image, (PIL.Image.Image, np.ndarray)):
+            image = [image]
+
+        if isinstance(image, list) and isinstance(image[0], PIL.Image.Image):
+            image = [np.array(i.convert("RGB"))[None, :] for i in image]
+            image = np.concatenate(image, axis=0)
+        elif isinstance(image, list) and isinstance(image[0], np.ndarray):
+            image = np.concatenate([i[None, :] for i in image], axis=0)
+
+        image = image.transpose(0, 3, 1, 2)
+        image = torch.from_numpy(image).to(dtype=torch.float32) / 127.5 - 1.0
+
+    return image
+
+
+class OnnxStableDiffusionControlNetImg2ImgPipeline(DiffusionPipeline):
+    vae_encoder: OnnxRuntimeModel
+    vae_decoder: OnnxRuntimeModel
+    text_encoder: OnnxRuntimeModel
+    tokenizer: CLIPTokenizer
+    unet: OnnxRuntimeModel
+    scheduler: KarrasDiffusionSchedulers
+
+    def __init__(
+        self,
+        vae_encoder: OnnxRuntimeModel,
+        vae_decoder: OnnxRuntimeModel,
+        text_encoder: OnnxRuntimeModel,
+        tokenizer: CLIPTokenizer,
+        unet: OnnxRuntimeModel,
+        scheduler: KarrasDiffusionSchedulers,
+    ):
+        super().__init__()
+
+        self.register_modules(
+            vae_encoder=vae_encoder,
+            vae_decoder=vae_decoder,
+            text_encoder=text_encoder,
+            tokenizer=tokenizer,
+            unet=unet,
+            scheduler=scheduler,
+        )
+        self.vae_scale_factor = 2 ** (4 - 1)
+        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor, do_convert_rgb=True)
+        self.control_image_processor = VaeImageProcessor(
+            vae_scale_factor=self.vae_scale_factor, do_convert_rgb=True, do_normalize=False
+        )
+
+    def _encode_prompt(
+        self,
+        prompt: Union[str, List[str]],
+        num_images_per_prompt: Optional[int],
+        do_classifier_free_guidance: bool,
+        negative_prompt: Optional[str],
+        prompt_embeds: Optional[np.ndarray] = None,
+        negative_prompt_embeds: Optional[np.ndarray] = None,
+    ):
+        r"""
+        Encodes the prompt into text encoder hidden states.
+
+        Args:
+            prompt (`str` or `List[str]`):
+                prompt to be encoded
+            num_images_per_prompt (`int`):
+                number of images that should be generated per prompt
+            do_classifier_free_guidance (`bool`):
+                whether to use classifier free guidance or not
+            negative_prompt (`str` or `List[str]`):
+                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
+                if `guidance_scale` is less than `1`).
+            prompt_embeds (`np.ndarray`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`np.ndarray`, *optional*):
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+        """
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+
+        if prompt_embeds is None:
+            # get prompt text embeddings
+            text_inputs = self.tokenizer(
+                prompt,
+                padding="max_length",
+                max_length=self.tokenizer.model_max_length,
+                truncation=True,
+                return_tensors="np",
+            )
+            text_input_ids = text_inputs.input_ids
+            untruncated_ids = self.tokenizer(prompt, padding="max_length", return_tensors="np").input_ids
+
+            if not np.array_equal(text_input_ids, untruncated_ids):
+                removed_text = self.tokenizer.batch_decode(
+                    untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1]
+                )
+                logger.warning(
+                    "The following part of your input was truncated because CLIP can only handle sequences up to"
+                    f" {self.tokenizer.model_max_length} tokens: {removed_text}"
+                )
+
+            prompt_embeds = self.text_encoder(input_ids=text_input_ids.astype(np.int32))[0]
+
+        prompt_embeds = np.repeat(prompt_embeds, num_images_per_prompt, axis=0)
+
+        # get unconditional embeddings for classifier free guidance
+        if do_classifier_free_guidance and negative_prompt_embeds is None:
+            uncond_tokens: List[str]
+            if negative_prompt is None:
+                uncond_tokens = [""] * batch_size
+            elif type(prompt) is not type(negative_prompt):
+                raise TypeError(
+                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
+                    f" {type(prompt)}."
+                )
+            elif isinstance(negative_prompt, str):
+                uncond_tokens = [negative_prompt] * batch_size
+            elif batch_size != len(negative_prompt):
+                raise ValueError(
+                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
+                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
+                    " the batch size of `prompt`."
+                )
+            else:
+                uncond_tokens = negative_prompt
+
+            max_length = prompt_embeds.shape[1]
+            uncond_input = self.tokenizer(
+                uncond_tokens,
+                padding="max_length",
+                max_length=max_length,
+                truncation=True,
+                return_tensors="np",
+            )
+            negative_prompt_embeds = self.text_encoder(input_ids=uncond_input.input_ids.astype(np.int32))[0]
+
+        if do_classifier_free_guidance:
+            negative_prompt_embeds = np.repeat(negative_prompt_embeds, num_images_per_prompt, axis=0)
+
+            # For classifier free guidance, we need to do two forward passes.
+            # Here we concatenate the unconditional and text embeddings into a single batch
+            # to avoid doing two forward passes
+            prompt_embeds = np.concatenate([negative_prompt_embeds, prompt_embeds])
+
+        return prompt_embeds
+
+    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.decode_latents
+    def decode_latents(self, latents):
+        warnings.warn(
+            "The decode_latents method is deprecated and will be removed in a future version. Please"
+            " use VaeImageProcessor instead",
+            FutureWarning,
+        )
+        latents = 1 / self.vae.config.scaling_factor * latents
+        image = self.vae.decode(latents, return_dict=False)[0]
+        image = (image / 2 + 0.5).clamp(0, 1)
+        # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16
+        image = image.cpu().permute(0, 2, 3, 1).float().numpy()
+        return image
+
+    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
+    def prepare_extra_step_kwargs(self, generator, eta):
+        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
+        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
+        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
+        # and should be between [0, 1]
+
+        accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        extra_step_kwargs = {}
+        if accepts_eta:
+            extra_step_kwargs["eta"] = eta
+
+        # check if the scheduler accepts generator
+        accepts_generator = "generator" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        if accepts_generator:
+            extra_step_kwargs["generator"] = generator
+        return extra_step_kwargs
+
+    def check_inputs(
+        self,
+        num_controlnet,
+        prompt,
+        image,
+        callback_steps,
+        negative_prompt=None,
+        prompt_embeds=None,
+        negative_prompt_embeds=None,
+        controlnet_conditioning_scale=1.0,
+        control_guidance_start=0.0,
+        control_guidance_end=1.0,
+    ):
+        if (callback_steps is None) or (
+            callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
+        ):
+            raise ValueError(
+                f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
+                f" {type(callback_steps)}."
+            )
+
+        if prompt is not None and prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
+                " only forward one of the two."
+            )
+        elif prompt is None and prompt_embeds is None:
+            raise ValueError(
+                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
+            )
+        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
+
+        if negative_prompt is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+
+        if prompt_embeds is not None and negative_prompt_embeds is not None:
+            if prompt_embeds.shape != negative_prompt_embeds.shape:
+                raise ValueError(
+                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
+                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
+                    f" {negative_prompt_embeds.shape}."
+                )
+
+        # Check `image`
+        if num_controlnet == 1:
+            self.check_image(image, prompt, prompt_embeds)
+        elif num_controlnet > 1:
+            if not isinstance(image, list):
+                raise TypeError("For multiple controlnets: `image` must be type `list`")
+
+            # When `image` is a nested list:
+            # (e.g. [[canny_image_1, pose_image_1], [canny_image_2, pose_image_2]])
+            elif any(isinstance(i, list) for i in image):
+                raise ValueError("A single batch of multiple conditionings are supported at the moment.")
+            elif len(image) != num_controlnet:
+                raise ValueError(
+                    f"For multiple controlnets: `image` must have the same length as the number of controlnets, but got {len(image)} images and {num_controlnet} ControlNets."
+                )
+
+            for image_ in image:
+                self.check_image(image_, prompt, prompt_embeds)
+        else:
+            assert False
+
+        # Check `controlnet_conditioning_scale`
+        if num_controlnet == 1:
+            if not isinstance(controlnet_conditioning_scale, float):
+                raise TypeError("For single controlnet: `controlnet_conditioning_scale` must be type `float`.")
+        elif num_controlnet > 1:
+            if isinstance(controlnet_conditioning_scale, list):
+                if any(isinstance(i, list) for i in controlnet_conditioning_scale):
+                    raise ValueError("A single batch of multiple conditionings are supported at the moment.")
+            elif (
+                isinstance(controlnet_conditioning_scale, list)
+                and len(controlnet_conditioning_scale) != num_controlnet
+            ):
+                raise ValueError(
+                    "For multiple controlnets: When `controlnet_conditioning_scale` is specified as `list`, it must have"
+                    " the same length as the number of controlnets"
+                )
+        else:
+            assert False
+
+        if len(control_guidance_start) != len(control_guidance_end):
+            raise ValueError(
+                f"`control_guidance_start` has {len(control_guidance_start)} elements, but `control_guidance_end` has {len(control_guidance_end)} elements. Make sure to provide the same number of elements to each list."
+            )
+
+        if num_controlnet > 1:
+            if len(control_guidance_start) != num_controlnet:
+                raise ValueError(
+                    f"`control_guidance_start`: {control_guidance_start} has {len(control_guidance_start)} elements but there are {num_controlnet} controlnets available. Make sure to provide {num_controlnet}."
+                )
+
+        for start, end in zip(control_guidance_start, control_guidance_end):
+            if start >= end:
+                raise ValueError(
+                    f"control guidance start: {start} cannot be larger or equal to control guidance end: {end}."
+                )
+            if start < 0.0:
+                raise ValueError(f"control guidance start: {start} can't be smaller than 0.")
+            if end > 1.0:
+                raise ValueError(f"control guidance end: {end} can't be larger than 1.0.")
+
+    # Copied from diffusers.pipelines.controlnet.pipeline_controlnet.StableDiffusionControlNetPipeline.check_image
+    def check_image(self, image, prompt, prompt_embeds):
+        image_is_pil = isinstance(image, PIL.Image.Image)
+        image_is_tensor = isinstance(image, torch.Tensor)
+        image_is_np = isinstance(image, np.ndarray)
+        image_is_pil_list = isinstance(image, list) and isinstance(image[0], PIL.Image.Image)
+        image_is_tensor_list = isinstance(image, list) and isinstance(image[0], torch.Tensor)
+        image_is_np_list = isinstance(image, list) and isinstance(image[0], np.ndarray)
+
+        if (
+            not image_is_pil
+            and not image_is_tensor
+            and not image_is_np
+            and not image_is_pil_list
+            and not image_is_tensor_list
+            and not image_is_np_list
+        ):
+            raise TypeError(
+                f"image must be passed and be one of PIL image, numpy array, torch tensor, list of PIL images, list of numpy arrays or list of torch tensors, but is {type(image)}"
+            )
+
+        if image_is_pil:
+            image_batch_size = 1
+        else:
+            image_batch_size = len(image)
+
+        if prompt is not None and isinstance(prompt, str):
+            prompt_batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            prompt_batch_size = len(prompt)
+        elif prompt_embeds is not None:
+            prompt_batch_size = prompt_embeds.shape[0]
+
+        if image_batch_size != 1 and image_batch_size != prompt_batch_size:
+            raise ValueError(
+                f"If image batch size is not 1, image batch size must be same as prompt batch size. image batch size: {image_batch_size}, prompt batch size: {prompt_batch_size}"
+            )
+
+    # Copied from diffusers.pipelines.controlnet.pipeline_controlnet.StableDiffusionControlNetPipeline.prepare_image
+    def prepare_control_image(
+        self,
+        image,
+        width,
+        height,
+        batch_size,
+        num_images_per_prompt,
+        device,
+        dtype,
+        do_classifier_free_guidance=False,
+        guess_mode=False,
+    ):
+        image = self.control_image_processor.preprocess(image, height=height, width=width).to(dtype=torch.float32)
+        image_batch_size = image.shape[0]
+
+        if image_batch_size == 1:
+            repeat_by = batch_size
+        else:
+            # image batch size is the same as prompt batch size
+            repeat_by = num_images_per_prompt
+
+        image = image.repeat_interleave(repeat_by, dim=0)
+
+        image = image.to(device=device, dtype=dtype)
+
+        if do_classifier_free_guidance and not guess_mode:
+            image = torch.cat([image] * 2)
+
+        return image
+
+    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.StableDiffusionImg2ImgPipeline.get_timesteps
+    def get_timesteps(self, num_inference_steps, strength, device):
+        # get the original timestep using init_timestep
+        init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
+
+        t_start = max(num_inference_steps - init_timestep, 0)
+        timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
+
+        return timesteps, num_inference_steps - t_start
+
+    def prepare_latents(self, image, timestep, batch_size, num_images_per_prompt, dtype, device, generator=None):
+        if not isinstance(image, (torch.Tensor, PIL.Image.Image, list)):
+            raise ValueError(
+                f"`image` has to be of type `torch.Tensor`, `PIL.Image.Image` or list but is {type(image)}"
+            )
+
+        image = image.to(device=device, dtype=dtype)
+
+        batch_size = batch_size * num_images_per_prompt
+
+        if image.shape[1] == 4:
+            init_latents = image
+
+        else:
+            _image = image.cpu().detach().numpy()
+            init_latents = self.vae_encoder(sample=_image)[0]
+            init_latents = torch.from_numpy(init_latents).to(device=device, dtype=dtype)
+            init_latents = 0.18215 * init_latents
+
+        if batch_size > init_latents.shape[0] and batch_size % init_latents.shape[0] == 0:
+            # expand init_latents for batch_size
+            deprecation_message = (
+                f"You have passed {batch_size} text prompts (`prompt`), but only {init_latents.shape[0]} initial"
+                " images (`image`). Initial images are now duplicating to match the number of text prompts. Note"
+                " that this behavior is deprecated and will be removed in a version 1.0.0. Please make sure to update"
+                " your script to pass as many initial images as text prompts to suppress this warning."
+            )
+            deprecate("len(prompt) != len(image)", "1.0.0", deprecation_message, standard_warn=False)
+            additional_image_per_prompt = batch_size // init_latents.shape[0]
+            init_latents = torch.cat([init_latents] * additional_image_per_prompt, dim=0)
+        elif batch_size > init_latents.shape[0] and batch_size % init_latents.shape[0] != 0:
+            raise ValueError(
+                f"Cannot duplicate `image` of batch size {init_latents.shape[0]} to {batch_size} text prompts."
+            )
+        else:
+            init_latents = torch.cat([init_latents], dim=0)
+
+        shape = init_latents.shape
+        noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+
+        # get latents
+        init_latents = self.scheduler.add_noise(init_latents, noise, timestep)
+        latents = init_latents
+
+        return latents
+
+    @torch.no_grad()
+    @replace_example_docstring(EXAMPLE_DOC_STRING)
+    def __call__(
+        self,
+        num_controlnet: int,
+        fp16: bool = True,
+        prompt: Union[str, List[str]] = None,
+        image: Union[
+            torch.FloatTensor,
+            PIL.Image.Image,
+            np.ndarray,
+            List[torch.FloatTensor],
+            List[PIL.Image.Image],
+            List[np.ndarray],
+        ] = None,
+        control_image: Union[
+            torch.FloatTensor,
+            PIL.Image.Image,
+            np.ndarray,
+            List[torch.FloatTensor],
+            List[PIL.Image.Image],
+            List[np.ndarray],
+        ] = None,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        strength: float = 0.8,
+        num_inference_steps: int = 50,
+        guidance_scale: float = 7.5,
+        negative_prompt: Optional[Union[str, List[str]]] = None,
+        num_images_per_prompt: Optional[int] = 1,
+        eta: float = 0.0,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
+        callback_steps: int = 1,
+        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
+        controlnet_conditioning_scale: Union[float, List[float]] = 0.8,
+        guess_mode: bool = False,
+        control_guidance_start: Union[float, List[float]] = 0.0,
+        control_guidance_end: Union[float, List[float]] = 1.0,
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, `List[np.ndarray]`,:
+                    `List[List[torch.FloatTensor]]`, `List[List[np.ndarray]]` or `List[List[PIL.Image.Image]]`):
+                The initial image will be used as the starting point for the image generation process. Can also accpet
+                image latents as `image`, if passing latents directly, it will not be encoded again.
+            control_image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, `List[np.ndarray]`,:
+                    `List[List[torch.FloatTensor]]`, `List[List[np.ndarray]]` or `List[List[PIL.Image.Image]]`):
+                The ControlNet input condition. ControlNet uses this input condition to generate guidance to Unet. If
+                the type is specified as `Torch.FloatTensor`, it is passed to ControlNet as is. `PIL.Image.Image` can
+                also be accepted as an image. The dimensions of the output image defaults to `image`'s dimensions. If
+                height and/or width are passed, `image` is resized according to them. If multiple ControlNets are
+                specified in init, images must be passed as a list such that each element of the list can be correctly
+                batched for input to a single controlnet.
+            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The height in pixels of the generated image.
+            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The width in pixels of the generated image.
+            num_inference_steps (`int`, *optional*, defaults to 50):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            guidance_scale (`float`, *optional*, defaults to 7.5):
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+                usually at the expense of lower image quality.
+            negative_prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
+                less than `1`).
+            num_images_per_prompt (`int`, *optional*, defaults to 1):
+                The number of images to generate per prompt.
+            eta (`float`, *optional*, defaults to 0.0):
+                Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to
+                [`schedulers.DDIMScheduler`], will be ignored for others.
+            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
+                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                to make generation deterministic.
+            latents (`torch.FloatTensor`, *optional*):
+                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor will ge generated by sampling using the supplied random `generator`.
+            prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+            output_type (`str`, *optional*, defaults to `"pil"`):
+                The output format of the generate image. Choose between
+                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
+                plain tuple.
+            callback (`Callable`, *optional*):
+                A function that will be called every `callback_steps` steps during inference. The function will be
+                called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
+            callback_steps (`int`, *optional*, defaults to 1):
+                The frequency at which the `callback` function will be called. If not specified, the callback will be
+                called at every step.
+            cross_attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            controlnet_conditioning_scale (`float` or `List[float]`, *optional*, defaults to 1.0):
+                The outputs of the controlnet are multiplied by `controlnet_conditioning_scale` before they are added
+                to the residual in the original unet. If multiple ControlNets are specified in init, you can set the
+                corresponding scale as a list. Note that by default, we use a smaller conditioning scale for inpainting
+                than for [`~StableDiffusionControlNetPipeline.__call__`].
+            guess_mode (`bool`, *optional*, defaults to `False`):
+                In this mode, the ControlNet encoder will try best to recognize the content of the input image even if
+                you remove all prompts. The `guidance_scale` between 3.0 and 5.0 is recommended.
+            control_guidance_start (`float` or `List[float]`, *optional*, defaults to 0.0):
+                The percentage of total steps at which the controlnet starts applying.
+            control_guidance_end (`float` or `List[float]`, *optional*, defaults to 1.0):
+                The percentage of total steps at which the controlnet stops applying.
+
+        Examples:
+
+        Returns:
+            [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
+            [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple.
+            When returning a tuple, the first element is a list with the generated images, and the second element is a
+            list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
+            (nsfw) content, according to the `safety_checker`.
+        """
+        if fp16:
+            torch_dtype = torch.float16
+            np_dtype = np.float16
+        else:
+            torch_dtype = torch.float32
+            np_dtype = np.float32
+
+        # align format for control guidance
+        if not isinstance(control_guidance_start, list) and isinstance(control_guidance_end, list):
+            control_guidance_start = len(control_guidance_end) * [control_guidance_start]
+        elif not isinstance(control_guidance_end, list) and isinstance(control_guidance_start, list):
+            control_guidance_end = len(control_guidance_start) * [control_guidance_end]
+        elif not isinstance(control_guidance_start, list) and not isinstance(control_guidance_end, list):
+            mult = num_controlnet
+            control_guidance_start, control_guidance_end = mult * [control_guidance_start], mult * [
+                control_guidance_end
+            ]
+
+        # 1. Check inputs. Raise error if not correct
+        self.check_inputs(
+            num_controlnet,
+            prompt,
+            control_image,
+            callback_steps,
+            negative_prompt,
+            prompt_embeds,
+            negative_prompt_embeds,
+            controlnet_conditioning_scale,
+            control_guidance_start,
+            control_guidance_end,
+        )
+
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+
+        device = self._execution_device
+        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
+        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
+        # corresponds to doing no classifier free guidance.
+        do_classifier_free_guidance = guidance_scale > 1.0
+
+        if num_controlnet > 1 and isinstance(controlnet_conditioning_scale, float):
+            controlnet_conditioning_scale = [controlnet_conditioning_scale] * num_controlnet
+
+        # 3. Encode input prompt
+        prompt_embeds = self._encode_prompt(
+            prompt,
+            num_images_per_prompt,
+            do_classifier_free_guidance,
+            negative_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+        )
+        # 4. Prepare image
+        image = self.image_processor.preprocess(image).to(dtype=torch.float32)
+
+        # 5. Prepare controlnet_conditioning_image
+        if num_controlnet == 1:
+            control_image = self.prepare_control_image(
+                image=control_image,
+                width=width,
+                height=height,
+                batch_size=batch_size * num_images_per_prompt,
+                num_images_per_prompt=num_images_per_prompt,
+                device=device,
+                dtype=torch_dtype,
+                do_classifier_free_guidance=do_classifier_free_guidance,
+                guess_mode=guess_mode,
+            )
+        elif num_controlnet > 1:
+            control_images = []
+
+            for control_image_ in control_image:
+                control_image_ = self.prepare_control_image(
+                    image=control_image_,
+                    width=width,
+                    height=height,
+                    batch_size=batch_size * num_images_per_prompt,
+                    num_images_per_prompt=num_images_per_prompt,
+                    device=device,
+                    dtype=torch_dtype,
+                    do_classifier_free_guidance=do_classifier_free_guidance,
+                    guess_mode=guess_mode,
+                )
+
+                control_images.append(control_image_)
+
+            control_image = control_images
+        else:
+            assert False
+
+        # 5. Prepare timesteps
+        self.scheduler.set_timesteps(num_inference_steps, device=device)
+        timesteps, num_inference_steps = self.get_timesteps(num_inference_steps, strength, device)
+        latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt)
+
+        # 6. Prepare latent variables
+        latents = self.prepare_latents(
+            image,
+            latent_timestep,
+            batch_size,
+            num_images_per_prompt,
+            torch_dtype,
+            device,
+            generator,
+        )
+
+        # 7. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
+        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
+
+        # 7.1 Create tensor stating which controlnets to keep
+        controlnet_keep = []
+        for i in range(len(timesteps)):
+            keeps = [
+                1.0 - float(i / len(timesteps) < s or (i + 1) / len(timesteps) > e)
+                for s, e in zip(control_guidance_start, control_guidance_end)
+            ]
+            controlnet_keep.append(keeps[0] if num_controlnet == 1 else keeps)
+
+        # 8. Denoising loop
+        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                # expand the latents if we are doing classifier free guidance
+                latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
+                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
+
+                if isinstance(controlnet_keep[i], list):
+                    cond_scale = [c * s for c, s in zip(controlnet_conditioning_scale, controlnet_keep[i])]
+                else:
+                    controlnet_cond_scale = controlnet_conditioning_scale
+                    if isinstance(controlnet_cond_scale, list):
+                        controlnet_cond_scale = controlnet_cond_scale[0]
+                    cond_scale = controlnet_cond_scale * controlnet_keep[i]
+
+                # predict the noise residual
+                _latent_model_input = latent_model_input.cpu().detach().numpy()
+                _prompt_embeds = np.array(prompt_embeds, dtype=np_dtype)
+                _t = np.array([t.cpu().detach().numpy()], dtype=np_dtype)
+
+                if num_controlnet == 1:
+                    control_images = np.array([control_image], dtype=np_dtype)
+                else:
+                    control_images = []
+                    for _control_img in control_image:
+                        _control_img = _control_img.cpu().detach().numpy()
+                        control_images.append(_control_img)
+                    control_images = np.array(control_images, dtype=np_dtype)
+
+                control_scales = np.array(cond_scale, dtype=np_dtype)
+                control_scales = np.resize(control_scales, (num_controlnet, 1))
+
+                noise_pred = self.unet(
+                    sample=_latent_model_input,
+                    timestep=_t,
+                    encoder_hidden_states=_prompt_embeds,
+                    controlnet_conds=control_images,
+                    conditioning_scales=control_scales,
+                )[0]
+                noise_pred = torch.from_numpy(noise_pred).to(device)
+
+                # perform guidance
+                if do_classifier_free_guidance:
+                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+                    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
+
+                # compute the previous noisy sample x_t -> x_t-1
+                latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
+
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+                    if callback is not None and i % callback_steps == 0:
+                        callback(i, t, latents)
+
+        if not output_type == "latent":
+            _latents = latents.cpu().detach().numpy() / 0.18215
+            _latents = np.array(_latents, dtype=np_dtype)
+            image = self.vae_decoder(latent_sample=_latents)[0]
+            image = torch.from_numpy(image).to(device, dtype=torch.float32)
+            has_nsfw_concept = None
+        else:
+            image = latents
+            has_nsfw_concept = None
+
+        if has_nsfw_concept is None:
+            do_denormalize = [True] * image.shape[0]
+        else:
+            do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]
+
+        image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
+
+        if not return_dict:
+            return (image, has_nsfw_concept)
+
+        return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--sd_model",
+        type=str,
+        required=True,
+        help="Path to the `diffusers` checkpoint to convert (either a local directory or on the Hub).",
+    )
+
+    parser.add_argument(
+        "--onnx_model_dir",
+        type=str,
+        required=True,
+        help="Path to the ONNX directory",
+    )
+
+    parser.add_argument("--qr_img_path", type=str, required=True, help="Path to the qr code image")
+
+    args = parser.parse_args()
+
+    qr_image = Image.open(args.qr_img_path)
+    qr_image = qr_image.resize((512, 512))
+
+    # init stable diffusion pipeline
+    pipeline = StableDiffusionImg2ImgPipeline.from_pretrained(args.sd_model)
+    pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config)
+
+    provider = ["CUDAExecutionProvider", "CPUExecutionProvider"]
+    onnx_pipeline = OnnxStableDiffusionControlNetImg2ImgPipeline(
+        vae_encoder=OnnxRuntimeModel.from_pretrained(
+            os.path.join(args.onnx_model_dir, "vae_encoder"), provider=provider
+        ),
+        vae_decoder=OnnxRuntimeModel.from_pretrained(
+            os.path.join(args.onnx_model_dir, "vae_decoder"), provider=provider
+        ),
+        text_encoder=OnnxRuntimeModel.from_pretrained(
+            os.path.join(args.onnx_model_dir, "text_encoder"), provider=provider
+        ),
+        tokenizer=pipeline.tokenizer,
+        unet=OnnxRuntimeModel.from_pretrained(os.path.join(args.onnx_model_dir, "unet"), provider=provider),
+        scheduler=pipeline.scheduler,
+    )
+    onnx_pipeline = onnx_pipeline.to("cuda")
+
+    prompt = "a cute cat fly to the moon"
+    negative_prompt = "paintings, sketches, worst quality, low quality, normal quality, lowres, normal quality, monochrome, grayscale, skin spots, acnes, skin blemishes, age spot, glans, nsfw, nipples, necklace, worst quality, low quality, watermark, username, signature, multiple breasts, lowres, bad anatomy, bad hands, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, bad feet, single color, ugly, duplicate, morbid, mutilated, tranny, trans, trannsexual, hermaphrodite, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, ugly, blurry, bad anatomy, bad proportions, extra limbs, disfigured, bad anatomy, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, mutated hands, fused fingers, too many fingers, long neck, bad body perspect"
+
+    for i in range(10):
+        start_time = time.time()
+        image = onnx_pipeline(
+            num_controlnet=2,
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            image=qr_image,
+            control_image=[qr_image, qr_image],
+            width=512,
+            height=512,
+            strength=0.75,
+            num_inference_steps=20,
+            num_images_per_prompt=1,
+            controlnet_conditioning_scale=[0.8, 0.8],
+            control_guidance_start=[0.3, 0.3],
+            control_guidance_end=[0.9, 0.9],
+        ).images[0]
+        print(time.time() - start_time)
+        image.save("output_qr_code.png")
--- a/examples/community/test_tensorrt_controlnet.py
+++ b/examples/community/test_tensorrt_controlnet.py
--- a/examples/controlnet/train_controlnet_flax.py
+++ b/examples/controlnet/train_controlnet_flax.py
@@ -47,7 +47,7 @@ from diffusers import (
    FlaxStableDiffusionControlNetPipeline,
    FlaxUNet2DConditionModel,
 )
-from diffusers.utils import check_min_version, is_wandb_available
+from diffusers.utils import check_min_version, is_wandb_available, make_image_grid


 # To prevent an error that occurs when there are abnormally large compressed data chunk in the png image
@@ -64,18 +64,6 @@ check_min_version("0.20.0.dev0")
 logger = logging.getLogger(__name__)


-def image_grid(imgs, rows, cols):
-    assert len(imgs) == rows * cols
-
-    w, h = imgs[0].size
-    grid = Image.new("RGB", size=(cols * w, rows * h))
-    grid_w, grid_h = grid.size
-
-    for i, img in enumerate(imgs):
-        grid.paste(img, box=(i % cols * w, i // cols * h))
-    return grid
-
-
 def log_validation(pipeline, pipeline_params, controlnet_params, tokenizer, args, rng, weight_dtype):
    logger.info("Running validation...")

@@ -154,7 +142,7 @@ def save_model_card(repo_id: str, image_logs=None, base_model=str, repo_folder=N
            validation_image.save(os.path.join(repo_folder, "image_control.png"))
            img_str += f"prompt: {validation_prompt}\n"
            images = [validation_image] + images
-            image_grid(images, 1, len(images)).save(os.path.join(repo_folder, f"images_{i}.png"))
+            make_image_grid(images, 1, len(images)).save(os.path.join(repo_folder, f"images_{i}.png"))
            img_str += f"![images_{i})](./images_{i}.png)\n"

    yaml = f"""
--- a/examples/controlnet/train_controlnet_sdxl.py
+++ b/examples/controlnet/train_controlnet_sdxl.py
@@ -50,7 +50,7 @@ from diffusers import (
    UniPCMultistepScheduler,
 )
 from diffusers.optimization import get_scheduler
-from diffusers.utils import check_min_version, is_wandb_available
+from diffusers.utils import check_min_version, is_wandb_available, make_image_grid
 from diffusers.utils.import_utils import is_xformers_available


@@ -63,17 +63,6 @@ check_min_version("0.20.0.dev0")
 logger = get_logger(__name__)


-def image_grid(imgs, rows, cols):
-    assert len(imgs) == rows * cols
-
-    w, h = imgs[0].size
-    grid = Image.new("RGB", size=(cols * w, rows * h))
-
-    for i, img in enumerate(imgs):
-        grid.paste(img, box=(i % cols * w, i // cols * h))
-    return grid
-
-
 def log_validation(vae, unet, controlnet, args, accelerator, weight_dtype, step):
    logger.info("Running validation... ")

@@ -205,7 +194,7 @@ def save_model_card(repo_id: str, image_logs=None, base_model=str, repo_folder=N
            validation_image.save(os.path.join(repo_folder, "image_control.png"))
            img_str += f"prompt: {validation_prompt}\n"
            images = [validation_image] + images
-            image_grid(images, 1, len(images)).save(os.path.join(repo_folder, f"images_{i}.png"))
+            make_image_grid(images, 1, len(images)).save(os.path.join(repo_folder, f"images_{i}.png"))
            img_str += f"![images_{i})](./images_{i}.png)\n"

    yaml = f"""
--- a/examples/custom_diffusion/train_custom_diffusion.py
+++ b/examples/custom_diffusion/train_custom_diffusion.py
@@ -26,6 +26,7 @@ import warnings
 from pathlib import Path

 import numpy as np
+import safetensors
 import torch
 import torch.nn.functional as F
 import torch.utils.checkpoint
@@ -296,14 +297,19 @@ class CustomDiffusionDataset(Dataset):
        return example


-def save_new_embed(text_encoder, modifier_token_id, accelerator, args, output_dir):
+def save_new_embed(text_encoder, modifier_token_id, accelerator, args, output_dir, safe_serialization=True):
    """Saves the new token embeddings from the text encoder."""
    logger.info("Saving embeddings")
    learned_embeds = accelerator.unwrap_model(text_encoder).get_input_embeddings().weight
    for x, y in zip(modifier_token_id, args.modifier_token):
        learned_embeds_dict = {}
        learned_embeds_dict[y] = learned_embeds[x]
-        torch.save(learned_embeds_dict, f"{output_dir}/{y}.bin")
+        filename = f"{output_dir}/{y}.bin"
+
+        if safe_serialization:
+            safetensors.torch.save_file(learned_embeds_dict, filename, metadata={"format": "pt"})
+        else:
+            torch.save(learned_embeds_dict, filename)


 def parse_args(input_args=None):
@@ -605,6 +611,11 @@ def parse_args(input_args=None):
        action="store_true",
        help="Dont apply augmentation during data augmentation when this flag is enabled.",
    )
+    parser.add_argument(
+        "--no_safe_serialization",
+        action="store_true",
+        help="If specified save the checkpoint not in `safetensors` format, but in original PyTorch format instead.",
+    )

    if input_args is not None:
        args = parser.parse_args(input_args)
@@ -1244,8 +1255,15 @@ def main(args):
    accelerator.wait_for_everyone()
    if accelerator.is_main_process:
        unet = unet.to(torch.float32)
-        unet.save_attn_procs(args.output_dir)
-        save_new_embed(text_encoder, modifier_token_id, accelerator, args, args.output_dir)
+        unet.save_attn_procs(args.output_dir, safe_serialization=not args.no_safe_serialization)
+        save_new_embed(
+            text_encoder,
+            modifier_token_id,
+            accelerator,
+            args,
+            args.output_dir,
+            safe_serialization=not args.no_safe_serialization,
+        )

        # Final inference
        # Load previous pipeline
@@ -1256,9 +1274,15 @@ def main(args):
        pipeline = pipeline.to(accelerator.device)

        # load attention processors
-        pipeline.unet.load_attn_procs(args.output_dir, weight_name="pytorch_custom_diffusion_weights.bin")
+        weight_name = (
+            "pytorch_custom_diffusion_weights.safetensors"
+            if not args.no_safe_serialization
+            else "pytorch_custom_diffusion_weights.bin"
+        )
+        pipeline.unet.load_attn_procs(args.output_dir, weight_name=weight_name)
        for token in args.modifier_token:
-            pipeline.load_textual_inversion(args.output_dir, weight_name=f"{token}.bin")
+            token_weight_name = f"{token}.safetensors" if not args.no_safe_serialization else f"{token}.bin"
+            pipeline.load_textual_inversion(args.output_dir, weight_name=token_weight_name)

        # run inference
        if args.validation_prompt and args.num_validation_images > 0:
--- a/examples/dreambooth/train_dreambooth_lora.py
+++ b/examples/dreambooth/train_dreambooth_lora.py
@@ -1374,7 +1374,7 @@ def main(args):
        pipeline = pipeline.to(accelerator.device)

        # load attention processors
-        pipeline.load_lora_weights(args.output_dir, weight_name="pytorch_lora_weights.bin")
+        pipeline.load_lora_weights(args.output_dir, weight_name="pytorch_lora_weights.safetensors")

        # run inference
        images = []
--- a/examples/instruct_pix2pix/README.md
+++ b/examples/instruct_pix2pix/README.md
@@ -83,7 +83,8 @@ accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
    --learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
    --conditioning_dropout_prob=0.05 \
    --mixed_precision=fp16 \
-    --seed=42 
+    --seed=42 \
+    --push_to_hub
 ```

 Additionally, we support performing validation inference to monitor training progress
@@ -104,7 +105,8 @@ accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
    --val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" \
    --validation_prompt="make the mountains snowy" \
    --seed=42 \
-    --report_to=wandb 
+    --report_to=wandb \
+    --push_to_hub
 ```

 We recommend this type of validation as it can be useful for model debugging. Note that you need `wandb` installed to use this. You can install `wandb` by running `pip install wandb`. 
@@ -131,7 +133,8 @@ accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix.py
 --learning_rate=5e-05 --lr_warmup_steps=0 \
 --conditioning_dropout_prob=0.05 \
 --mixed_precision=fp16 \
- --seed=42 
+ --seed=42 \
+ --push_to_hub
 ```

 ## Inference
@@ -190,4 +193,4 @@ If you're looking for some interesting ways to use the InstructPix2Pix training

 ## Stable Diffusion XL

-We support fine-tuning of the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with DreamBooth and LoRA via the `train_dreambooth_lora_sdxl.py` script. Please refer to the docs [here](./README_sdxl.md).
+There's an equivalent `train_instruct_pix2pix_sdxl.py` script for [Stable Diffusion XL](https://huggingface.co/papers/2307.01952). Please refer to the docs [here](./README_sdxl.md) to learn more.
--- a/examples/instruct_pix2pix/README_sdxl.md
+++ b/examples/instruct_pix2pix/README_sdxl.md
@@ -33,7 +33,7 @@ export DATASET_ID="fusing/instructpix2pix-1000-samples"
 Now, we can launch training:

 ```bash
-python train_instruct_pix2pix_sdxl.py \
+accelerate launch train_instruct_pix2pix_sdxl.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --dataset_name=$DATASET_ID \
    --enable_xformers_memory_efficient_attention \
@@ -43,14 +43,15 @@ python train_instruct_pix2pix_sdxl.py \
    --checkpointing_steps=5000 --checkpoints_total_limit=1 \
    --learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
    --conditioning_dropout_prob=0.05 \
-    --seed=42 
+    --seed=42 \
+    --push_to_hub
 ```

 Additionally, we support performing validation inference to monitor training progress
 with Weights and Biases. You can enable this feature with `report_to="wandb"`:

 ```bash
-python train_instruct_pix2pix_sdxl.py \
+accelerate launch train_instruct_pix2pix_sdxl.py \
    --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \
    --dataset_name=$DATASET_ID \
    --use_ema \
@@ -64,7 +65,8 @@ python train_instruct_pix2pix_sdxl.py \
    --seed=42 \
    --val_image_url_or_path="https://datasets-server.huggingface.co/assets/fusing/instructpix2pix-1000-samples/--/fusing--instructpix2pix-1000-samples/train/23/input_image/image.jpg" \
    --validation_prompt="make it in japan" \
-    --report_to=wandb
+    --report_to=wandb \
+    --push_to_hub
 ```

 We recommend this type of validation as it can be useful for model debugging. Note that you need `wandb` installed to use this. You can install `wandb` by running `pip install wandb`. 
@@ -79,7 +81,7 @@ python train_instruct_pix2pix_sdxl.py \
 for running distributed training with `accelerate`. Here is an example command:

 ```bash 
-accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix.py \
+accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix_sdxl.py \
    --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \
    --dataset_name=$DATASET_ID \
    --use_ema \
@@ -93,7 +95,8 @@ accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix.py
    --seed=42 \
    --val_image_url_or_path="https://datasets-server.huggingface.co/assets/fusing/instructpix2pix-1000-samples/--/fusing--instructpix2pix-1000-samples/train/23/input_image/image.jpg" \
    --validation_prompt="make it in japan" \
-    --report_to=wandb
+    --report_to=wandb \
+    --push_to_hub
 ```

 ## Inference
@@ -155,7 +158,7 @@ We aim to understand the differences resulting from the use of SD-1.5 and SDXL-0
 export MODEL_NAME="runwayml/stable-diffusion-v1-5" or "stabilityai/stable-diffusion-xl-base-0.9"
 export DATASET_ID="fusing/instructpix2pix-1000-samples"

-CUDA_VISIBLE_DEVICES=1 python train_instruct_pix2pix.py \
+accelerate launch train_instruct_pix2pix.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --dataset_name=$DATASET_ID \
    --use_ema \
@@ -169,7 +172,8 @@ CUDA_VISIBLE_DEVICES=1 python train_instruct_pix2pix.py \
    --seed=42 \
    --val_image_url="https://datasets-server.huggingface.co/assets/fusing/instructpix2pix-1000-samples/--/fusing--instructpix2pix-1000-samples/train/23/input_image/image.jpg" \
    --validation_prompt="make it in Japan" \
-    --report_to=wandb
+    --report_to=wandb \
+    --push_to_hub
 ```

 We discovered that compared to training with SD-1.5 as the pretrained model, SDXL-0.9 results in a lower training loss value (SD-1.5 yields 0.0599, SDXL scores 0.0254). Moreover, from a visual perspective, the results obtained using SDXL demonstrated fewer artifacts and a richer detail. Notably, SDXL starts to preserve the structure of the original image earlier on.
--- a/examples/research_projects/controlnet/train_controlnet_webdataset.py
+++ b/examples/research_projects/controlnet/train_controlnet_webdataset.py
--- a/examples/research_projects/intel_opts/textual_inversion_dfq/textual_inversion.py
+++ b/examples/research_projects/intel_opts/textual_inversion_dfq/textual_inversion.py
@@ -24,6 +24,7 @@ from transformers import CLIPTextModel, CLIPTokenizer

 from diffusers import AutoencoderKL, DDPMScheduler, StableDiffusionPipeline, UNet2DConditionModel
 from diffusers.optimization import get_scheduler
+from diffusers.utils import make_image_grid


 if version.parse(version.parse(PIL.__version__).base_version) >= version.parse("9.1.0"):
@@ -427,19 +428,6 @@ def freeze_params(params):
        param.requires_grad = False


-def image_grid(imgs, rows, cols):
-    if not len(imgs) == rows * cols:
-        raise ValueError("The specified number of rows and columns are not correct.")
-
-    w, h = imgs[0].size
-    grid = Image.new("RGB", size=(cols * w, rows * h))
-    grid_w, grid_h = grid.size
-
-    for i, img in enumerate(imgs):
-        grid.paste(img, box=(i % cols * w, i // cols * h))
-    return grid
-
-
 def generate_images(pipeline, prompt="", guidance_scale=7.5, num_inference_steps=50, num_images_per_prompt=1, seed=42):
    generator = torch.Generator(pipeline.device).manual_seed(seed)
    images = pipeline(
@@ -450,7 +438,7 @@ def generate_images(pipeline, prompt="", guidance_scale=7.5, num_inference_steps
        num_images_per_prompt=num_images_per_prompt,
    ).images
    _rows = int(math.sqrt(num_images_per_prompt))
-    grid = image_grid(images, rows=_rows, cols=num_images_per_prompt // _rows)
+    grid = make_image_grid(images, rows=_rows, cols=num_images_per_prompt // _rows)
    return grid


--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@@ -23,7 +23,7 @@ import tempfile
 import unittest
 from typing import List

-import torch
+import safetensors
 from accelerate.utils import write_basic_config

 from diffusers import DiffusionPipeline, UNet2DConditionModel
@@ -93,7 +93,7 @@ class ExamplesTestsAccelerate(unittest.TestCase):

            run_command(self._launch_args + test_args, return_stdout=True)
            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "unet", "diffusion_pytorch_model.bin")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "unet", "diffusion_pytorch_model.safetensors")))
            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "scheduler", "scheduler_config.json")))

    def test_textual_inversion(self):
@@ -144,7 +144,7 @@ class ExamplesTestsAccelerate(unittest.TestCase):

            run_command(self._launch_args + test_args)
            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "unet", "diffusion_pytorch_model.bin")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "unet", "diffusion_pytorch_model.safetensors")))
            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "scheduler", "scheduler_config.json")))

    def test_dreambooth_if(self):
@@ -170,7 +170,7 @@ class ExamplesTestsAccelerate(unittest.TestCase):

            run_command(self._launch_args + test_args)
            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "unet", "diffusion_pytorch_model.bin")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "unet", "diffusion_pytorch_model.safetensors")))
            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "scheduler", "scheduler_config.json")))

    def test_dreambooth_checkpointing(self):
@@ -272,10 +272,10 @@ class ExamplesTestsAccelerate(unittest.TestCase):

            run_command(self._launch_args + test_args)
            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.bin")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))

            # make sure the state_dict has the correct naming in the parameters.
-            lora_state_dict = torch.load(os.path.join(tmpdir, "pytorch_lora_weights.bin"))
+            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
            is_lora = all("lora" in k for k in lora_state_dict.keys())
            self.assertTrue(is_lora)

@@ -305,10 +305,10 @@ class ExamplesTestsAccelerate(unittest.TestCase):

            run_command(self._launch_args + test_args)
            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.bin")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))

            # check `text_encoder` is present at all.
-            lora_state_dict = torch.load(os.path.join(tmpdir, "pytorch_lora_weights.bin"))
+            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
            keys = lora_state_dict.keys()
            is_text_encoder_present = any(k.startswith("text_encoder") for k in keys)
            self.assertTrue(is_text_encoder_present)
@@ -341,10 +341,10 @@ class ExamplesTestsAccelerate(unittest.TestCase):

            run_command(self._launch_args + test_args)
            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.bin")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))

            # make sure the state_dict has the correct naming in the parameters.
-            lora_state_dict = torch.load(os.path.join(tmpdir, "pytorch_lora_weights.bin"))
+            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
            is_lora = all("lora" in k for k in lora_state_dict.keys())
            self.assertTrue(is_lora)

@@ -373,10 +373,10 @@ class ExamplesTestsAccelerate(unittest.TestCase):

            run_command(self._launch_args + test_args)
            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.bin")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))

            # make sure the state_dict has the correct naming in the parameters.
-            lora_state_dict = torch.load(os.path.join(tmpdir, "pytorch_lora_weights.bin"))
+            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
            is_lora = all("lora" in k for k in lora_state_dict.keys())
            self.assertTrue(is_lora)

@@ -406,10 +406,10 @@ class ExamplesTestsAccelerate(unittest.TestCase):

            run_command(self._launch_args + test_args)
            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.bin")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))

            # make sure the state_dict has the correct naming in the parameters.
-            lora_state_dict = torch.load(os.path.join(tmpdir, "pytorch_lora_weights.bin"))
+            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
            is_lora = all("lora" in k for k in lora_state_dict.keys())
            self.assertTrue(is_lora)

@@ -437,6 +437,7 @@ class ExamplesTestsAccelerate(unittest.TestCase):
                --lr_scheduler constant
                --lr_warmup_steps 0
                --modifier_token <new1>
+                --no_safe_serialization
                --output_dir {tmpdir}
                """.split()

@@ -466,7 +467,7 @@ class ExamplesTestsAccelerate(unittest.TestCase):

            run_command(self._launch_args + test_args)
            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "unet", "diffusion_pytorch_model.bin")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "unet", "diffusion_pytorch_model.safetensors")))
            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "scheduler", "scheduler_config.json")))

    def test_text_to_image_checkpointing(self):
@@ -757,6 +758,30 @@ class ExamplesTestsAccelerate(unittest.TestCase):
                {"checkpoint-6", "checkpoint-8", "checkpoint-10"},
            )

+    def test_text_to_image_sdxl(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            test_args = f"""
+                examples/text_to_image/train_text_to_image_sdxl.py
+                --pretrained_model_name_or_path hf-internal-testing/tiny-stable-diffusion-xl-pipe
+                --dataset_name hf-internal-testing/dummy_image_text_data
+                --resolution 64
+                --center_crop
+                --random_flip
+                --train_batch_size 1
+                --gradient_accumulation_steps 1
+                --max_train_steps 2
+                --learning_rate 5.0e-04
+                --scale_lr
+                --lr_scheduler constant
+                --lr_warmup_steps 0
+                --output_dir {tmpdir}
+                """.split()
+
+            run_command(self._launch_args + test_args)
+            # save_pretrained smoke test
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "unet", "diffusion_pytorch_model.safetensors")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "scheduler", "scheduler_config.json")))
+
    def test_text_to_image_lora_checkpointing_checkpoints_total_limit(self):
        pretrained_model_name_or_path = "hf-internal-testing/tiny-stable-diffusion-pipe"
        prompt = "a prompt"
@@ -1349,7 +1374,7 @@ class ExamplesTestsAccelerate(unittest.TestCase):

            run_command(self._launch_args + test_args)

-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "diffusion_pytorch_model.bin")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "diffusion_pytorch_model.safetensors")))

    def test_custom_diffusion_checkpointing_checkpoints_total_limit(self):
        with tempfile.TemporaryDirectory() as tmpdir:
@@ -1366,6 +1391,7 @@ class ExamplesTestsAccelerate(unittest.TestCase):
            --max_train_steps=6
            --checkpoints_total_limit=2
            --checkpointing_steps=2
+            --no_safe_serialization
            """.split()

            run_command(self._launch_args + test_args)
@@ -1389,6 +1415,7 @@ class ExamplesTestsAccelerate(unittest.TestCase):
            --dataloader_num_workers=0
            --max_train_steps=9
            --checkpointing_steps=2
+            --no_safe_serialization
            """.split()

            run_command(self._launch_args + test_args)
@@ -1412,6 +1439,7 @@ class ExamplesTestsAccelerate(unittest.TestCase):
            --checkpointing_steps=2
            --resume_from_checkpoint=checkpoint-8
            --checkpoints_total_limit=3
+            --no_safe_serialization
            """.split()

            run_command(self._launch_args + resume_run_args)
@@ -1440,10 +1468,10 @@ class ExamplesTestsAccelerate(unittest.TestCase):

            run_command(self._launch_args + test_args)
            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.bin")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))

            # make sure the state_dict has the correct naming in the parameters.
-            lora_state_dict = torch.load(os.path.join(tmpdir, "pytorch_lora_weights.bin"))
+            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
            is_lora = all("lora" in k for k in lora_state_dict.keys())
            self.assertTrue(is_lora)

@@ -1467,10 +1495,10 @@ class ExamplesTestsAccelerate(unittest.TestCase):

            run_command(self._launch_args + test_args)
            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.bin")))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))

            # make sure the state_dict has the correct naming in the parameters.
-            lora_state_dict = torch.load(os.path.join(tmpdir, "pytorch_lora_weights.bin"))
+            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
            is_lora = all("lora" in k for k in lora_state_dict.keys())
            self.assertTrue(is_lora)

--- a/examples/text_to_image/README.md
+++ b/examples/text_to_image/README.md
@@ -319,4 +319,5 @@ According to [this issue](https://github.com/huggingface/diffusers/issues/2234#i

 ## Stable Diffusion XL

-We support fine-tuning of the UNet and Text Encoder shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with LoRA via the `train_text_to_image_lora_xl.py` script. Please refer to the docs [here](./README_sdxl.md). 
+* We support fine-tuning the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) via the `train_text_to_image_sdxl.py` script. Please refer to the docs [here](./README_sdxl.md). 
+* We also support fine-tuning of the UNet and Text Encoder shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with LoRA via the `train_text_to_image_lora_sdxl.py` script. Please refer to the docs [here](./README_sdxl.md). 
--- a/examples/text_to_image/README_sdxl.md
+++ b/examples/text_to_image/README_sdxl.md
@@ -1,17 +1,8 @@
-# LoRA training example for Stable Diffusion XL (SDXL)
+# Stable Diffusion XL text-to-image fine-tuning

-Low-Rank Adaption of Large Language Models was first introduced by Microsoft in [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*.
+The `train_text_to_image_sdxl.py` script shows how to fine-tune Stable Diffusion XL (SDXL) on your own dataset.

-In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-decomposition matrices to existing weights and **only** training those newly added weights. This has a couple of advantages:
-
- Previous pretrained weights are kept frozen so that model is not prone to [catastrophic forgetting](https://www.pnas.org/doi/10.1073/pnas.1611835114).
- Rank-decomposition matrices have significantly fewer parameters than original model, which means that trained LoRA weights are easily portable.
- LoRA attention layers allow to control to which extent the model is adapted toward new training images via a `scale` parameter.
-
-[cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository.
-
-With LoRA, it's possible to fine-tune Stable Diffusion on a custom image-caption pair dataset
-on consumer GPUs like Tesla T4, Tesla V100.
+🚨 This script is experimental. The script fine-tunes the whole model and often times the model overfits and runs into issues like catastrophic forgetting. It's recommended to try different hyperparamters to get the best result on your dataset. 🚨

 ## Running locally with PyTorch

@@ -57,6 +48,69 @@ When running `accelerate config`, if we specify torch compile mode to True there

 ### Training

+```bash
+export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
+export VAE="madebyollin/sdxl-vae-fp16-fix"
+export DATASET_NAME="lambdalabs/pokemon-blip-captions"
+
+accelerate launch train_text_to_image_sdxl.py \
+  --pretrained_model_name_or_path=$MODEL_NAME \
+  --pretrained_vae_model_name_or_path=$VAE \
+  --dataset_name=$DATASET_NAME \
+  --enable_xformers_memory_efficient_attention \
+  --resolution=512 --center_crop --random_flip \
+  --proportion_empty_prompts=0.2 \
+  --train_batch_size=1 \
+  --gradient_accumulation_steps=4 --gradient_checkpointing \
+  --max_train_steps=10000 \
+  --use_8bit_adam \
+  --learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0 \
+  --mixed_precision="fp16" \
+  --report_to="wandb" \
+  --validation_prompt="a cute Sundar Pichai creature" --validation_epochs 5 \
+  --checkpointing_steps=5000 \
+  --output_dir="sdxl-pokemon-model" \
+  --push_to_hub
+```
+
+**Notes**: 
+
+*  The `train_text_to_image_sdxl.py` script pre-computes text embeddings and the VAE encodings and keeps them in memory. While for smaller datasets like [`lambdalabs/pokemon-blip-captions`](https://hf.co/datasets/lambdalabs/pokemon-blip-captions), it might not be a problem, it can definitely lead to memory problems when the script is used on a larger dataset. For those purposes, you would want to serialize these pre-computed representations to disk separately and load them during the fine-tuning process. Refer to [this PR](https://github.com/huggingface/diffusers/pull/4505) for a more in-depth discussion. 
+* The training script is compute-intensive and may not run on a consumer GPU like Tesla T4. 
+* The training command shown above performs intermediate quality validation in between the training epochs and logs the results to Weights and Biases. `--report_to`, `--validation_prompt`, and `--validation_epochs` are the relevant CLI arguments here.
+
+### Inference
+
+```python
+from diffusers import DiffusionPipeline
+import torch
+
+model_path = "you-model-id-goes-here" # <-- change this
+pipe = DiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16)
+pipe.to("cuda")
+
+prompt = "A pokemon with green eyes and red legs."
+image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
+image.save("pokemon.png")
+```
+
+## LoRA training example for Stable Diffusion XL (SDXL)
+
+Low-Rank Adaption of Large Language Models was first introduced by Microsoft in [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*.
+
+In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-decomposition matrices to existing weights and **only** training those newly added weights. This has a couple of advantages:
+
+- Previous pretrained weights are kept frozen so that model is not prone to [catastrophic forgetting](https://www.pnas.org/doi/10.1073/pnas.1611835114).
+- Rank-decomposition matrices have significantly fewer parameters than original model, which means that trained LoRA weights are easily portable.
+- LoRA attention layers allow to control to which extent the model is adapted toward new training images via a `scale` parameter.
+
+[cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository.
+
+With LoRA, it's possible to fine-tune Stable Diffusion on a custom image-caption pair dataset
+on consumer GPUs like Tesla T4, Tesla V100.
+
+### Training
+
 First, you need to set up your development environment as is explained in the [installation section](#installing-the-dependencies). Make sure to set the `MODEL_NAME` and `DATASET_NAME` environment variables. Here, we will use [Stable Diffusion XL 1.0-base](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and the [Pokemons dataset](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions).  

 **___Note: It is quite useful to monitor the training progress by regularly generating sample images during training. [Weights and Biases](https://docs.wandb.ai/quickstart) is a nice solution to easily see generating images during training. All you need to do is to run `pip install wandb` before training to automatically log images.___**
--- a/examples/text_to_image/train_text_to_image.py
+++ b/examples/text_to_image/train_text_to_image.py
@@ -35,7 +35,6 @@ from accelerate.utils import ProjectConfiguration, set_seed
 from datasets import load_dataset
 from huggingface_hub import create_repo, upload_folder
 from packaging import version
-from PIL import Image
 from torchvision import transforms
 from tqdm.auto import tqdm
 from transformers import CLIPTextModel, CLIPTokenizer
@@ -45,7 +44,7 @@ import diffusers
 from diffusers import AutoencoderKL, DDPMScheduler, StableDiffusionPipeline, UNet2DConditionModel
 from diffusers.optimization import get_scheduler
 from diffusers.training_utils import EMAModel
-from diffusers.utils import check_min_version, deprecate, is_wandb_available
+from diffusers.utils import check_min_version, deprecate, is_wandb_available, make_image_grid
 from diffusers.utils.import_utils import is_xformers_available


@@ -63,17 +62,6 @@ DATASET_NAME_MAPPING = {
 }


-def make_image_grid(imgs, rows, cols):
-    assert len(imgs) == rows * cols
-
-    w, h = imgs[0].size
-    grid = Image.new("RGB", size=(cols * w, rows * h))
-
-    for i, img in enumerate(imgs):
-        grid.paste(img, box=(i % cols * w, i // cols * h))
-    return grid
-
-
 def save_model_card(
    args,
    repo_id: str,
--- a/examples/text_to_image/train_text_to_image_lora_sdxl.py
+++ b/examples/text_to_image/train_text_to_image_lora_sdxl.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""Fine-tuning script for Stable Diffusion for text2image with support for LoRA."""
+"""Fine-tuning script for Stable Diffusion XL for text2image with support for LoRA."""

 import argparse
 import itertools
@@ -722,13 +722,13 @@ def main(args):
            else:
                raise ValueError(f"unexpected save model: {model.__class__}")

-        lora_state_dict, network_alpha = LoraLoaderMixin.lora_state_dict(input_dir)
-        LoraLoaderMixin.load_lora_into_unet(lora_state_dict, network_alpha=network_alpha, unet=unet_)
+        lora_state_dict, network_alphas = LoraLoaderMixin.lora_state_dict(input_dir)
+        LoraLoaderMixin.load_lora_into_unet(lora_state_dict, network_alphas=network_alphas, unet=unet_)
        LoraLoaderMixin.load_lora_into_text_encoder(
-            lora_state_dict, network_alpha=network_alpha, text_encoder=text_encoder_one_
+            lora_state_dict, network_alphas=network_alphas, text_encoder=text_encoder_one_
        )
        LoraLoaderMixin.load_lora_into_text_encoder(
-            lora_state_dict, network_alpha=network_alpha, text_encoder=text_encoder_two_
+            lora_state_dict, network_alphas=network_alphas, text_encoder=text_encoder_two_
        )

    accelerator.register_save_state_pre_hook(save_model_hook)
@@ -1051,7 +1051,6 @@ def main(args):
                    text_input_ids_list=[batch["input_ids_one"], batch["input_ids_two"]],
                )
                unet_added_conditions.update({"text_embeds": pooled_prompt_embeds})
-                prompt_embeds = prompt_embeds
                model_pred = unet(
                    noisy_model_input, timesteps, prompt_embeds, added_cond_kwargs=unet_added_conditions
                ).sample
--- a/examples/text_to_image/train_text_to_image_sdxl.py
+++ b/examples/text_to_image/train_text_to_image_sdxl.py
--- a/examples/textual_inversion/textual_inversion.py
+++ b/examples/textual_inversion/textual_inversion.py
@@ -24,6 +24,7 @@ from pathlib import Path

 import numpy as np
 import PIL
+import safetensors
 import torch
 import torch.nn.functional as F
 import torch.utils.checkpoint
@@ -157,7 +158,7 @@ def log_validation(text_encoder, tokenizer, unet, vae, args, accelerator, weight
    return images


-def save_progress(text_encoder, placeholder_token_ids, accelerator, args, save_path):
+def save_progress(text_encoder, placeholder_token_ids, accelerator, args, save_path, safe_serialization=True):
    logger.info("Saving embeddings")
    learned_embeds = (
        accelerator.unwrap_model(text_encoder)
@@ -165,7 +166,11 @@ def save_progress(text_encoder, placeholder_token_ids, accelerator, args, save_p
        .weight[min(placeholder_token_ids) : max(placeholder_token_ids) + 1]
    )
    learned_embeds_dict = {args.placeholder_token: learned_embeds.detach().cpu()}
-    torch.save(learned_embeds_dict, save_path)
+
+    if safe_serialization:
+        safetensors.torch.save_file(learned_embeds_dict, save_path, metadata={"format": "pt"})
+    else:
+        torch.save(learned_embeds_dict, save_path)


 def parse_args():
@@ -409,6 +414,11 @@ def parse_args():
    parser.add_argument(
        "--enable_xformers_memory_efficient_attention", action="store_true", help="Whether or not to use xformers."
    )
+    parser.add_argument(
+        "--no_safe_serialization",
+        action="store_true",
+        help="If specified save the checkpoint not in `safetensors` format, but in original PyTorch format instead.",
+    )

    args = parser.parse_args()
    env_local_rank = int(os.environ.get("LOCAL_RANK", -1))
@@ -878,7 +888,14 @@ def main():
                global_step += 1
                if global_step % args.save_steps == 0:
                    save_path = os.path.join(args.output_dir, f"learned_embeds-steps-{global_step}.bin")
-                    save_progress(text_encoder, placeholder_token_ids, accelerator, args, save_path)
+                    save_progress(
+                        text_encoder,
+                        placeholder_token_ids,
+                        accelerator,
+                        args,
+                        save_path,
+                        safe_serialization=not args.no_safe_serialization,
+                    )

                if accelerator.is_main_process:
                    if global_step % args.checkpointing_steps == 0:
@@ -936,7 +953,14 @@ def main():
            pipeline.save_pretrained(args.output_dir)
        # Save the newly trained embeddings
        save_path = os.path.join(args.output_dir, "learned_embeds.bin")
-        save_progress(text_encoder, placeholder_token_ids, accelerator, args, save_path)
+        save_progress(
+            text_encoder,
+            placeholder_token_ids,
+            accelerator,
+            args,
+            save_path,
+            safe_serialization=not args.no_safe_serialization,
+        )

        if args.push_to_hub:
            save_model_card(
--- a/scripts/convert_original_audioldm_to_diffusers.py
+++ b/scripts/convert_original_audioldm_to_diffusers.py
@@ -38,7 +38,7 @@ from diffusers import (
    PNDMScheduler,
    UNet2DConditionModel,
 )
-from diffusers.utils import is_omegaconf_available, is_safetensors_available
+from diffusers.utils import is_omegaconf_available
 from diffusers.utils.import_utils import BACKENDS_MAPPING


@@ -824,9 +824,6 @@ def load_pipeline_from_original_audioldm_ckpt(
    from omegaconf import OmegaConf

    if from_safetensors:
-        if not is_safetensors_available():
-            raise ValueError(BACKENDS_MAPPING["safetensors"][1])
-
        from safetensors import safe_open

        checkpoint = {}
--- a/scripts/convert_original_controlnet_to_diffusers.py
+++ b/scripts/convert_original_controlnet_to_diffusers.py
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Conversion script for stable diffusion checkpoints which _only_ contain a contrlnet. """
+""" Conversion script for stable diffusion checkpoints which _only_ contain a controlnet. """

 import argparse

--- a/scripts/convert_original_stable_diffusion_to_diffusers.py
+++ b/scripts/convert_original_stable_diffusion_to_diffusers.py
@@ -147,6 +147,7 @@ if __name__ == "__main__":
    if args.pipeline_class_name is not None:
        library = importlib.import_module("diffusers")
        class_obj = getattr(library, args.pipeline_class_name)
+        pipeline_class = class_obj
    else:
        pipeline_class = None

--- a/scripts/convert_stable_diffusion_controlnet_to_onnx.py
+++ b/scripts/convert_stable_diffusion_controlnet_to_onnx.py
@@ -0,0 +1,505 @@
+import argparse
+import os
+import shutil
+from pathlib import Path
+
+import onnx
+import onnx_graphsurgeon as gs
+import torch
+from onnx import shape_inference
+from packaging import version
+from polygraphy.backend.onnx.loader import fold_constants
+from torch.onnx import export
+
+from diffusers import (
+    ControlNetModel,
+    StableDiffusionControlNetImg2ImgPipeline,
+)
+from diffusers.models.attention_processor import AttnProcessor
+from diffusers.pipelines.controlnet.pipeline_controlnet_sd_xl import StableDiffusionXLControlNetPipeline
+
+
+is_torch_less_than_1_11 = version.parse(version.parse(torch.__version__).base_version) < version.parse("1.11")
+is_torch_2_0_1 = version.parse(version.parse(torch.__version__).base_version) == version.parse("2.0.1")
+
+
+class Optimizer:
+    def __init__(self, onnx_graph, verbose=False):
+        self.graph = gs.import_onnx(onnx_graph)
+        self.verbose = verbose
+
+    def info(self, prefix):
+        if self.verbose:
+            print(
+                f"{prefix} .. {len(self.graph.nodes)} nodes, {len(self.graph.tensors().keys())} tensors, {len(self.graph.inputs)} inputs, {len(self.graph.outputs)} outputs"
+            )
+
+    def cleanup(self, return_onnx=False):
+        self.graph.cleanup().toposort()
+        if return_onnx:
+            return gs.export_onnx(self.graph)
+
+    def select_outputs(self, keep, names=None):
+        self.graph.outputs = [self.graph.outputs[o] for o in keep]
+        if names:
+            for i, name in enumerate(names):
+                self.graph.outputs[i].name = name
+
+    def fold_constants(self, return_onnx=False):
+        onnx_graph = fold_constants(gs.export_onnx(self.graph), allow_onnxruntime_shape_inference=True)
+        self.graph = gs.import_onnx(onnx_graph)
+        if return_onnx:
+            return onnx_graph
+
+    def infer_shapes(self, return_onnx=False):
+        onnx_graph = gs.export_onnx(self.graph)
+        if onnx_graph.ByteSize() > 2147483648:
+            raise TypeError("ERROR: model size exceeds supported 2GB limit")
+        else:
+            onnx_graph = shape_inference.infer_shapes(onnx_graph)
+
+        self.graph = gs.import_onnx(onnx_graph)
+        if return_onnx:
+            return onnx_graph
+
+
+def optimize(onnx_graph, name, verbose):
+    opt = Optimizer(onnx_graph, verbose=verbose)
+    opt.info(name + ": original")
+    opt.cleanup()
+    opt.info(name + ": cleanup")
+    opt.fold_constants()
+    opt.info(name + ": fold constants")
+    # opt.infer_shapes()
+    # opt.info(name + ': shape inference')
+    onnx_opt_graph = opt.cleanup(return_onnx=True)
+    opt.info(name + ": finished")
+    return onnx_opt_graph
+
+
+class UNet2DConditionControlNetModel(torch.nn.Module):
+    def __init__(
+        self,
+        unet,
+        controlnets: ControlNetModel,
+    ):
+        super().__init__()
+        self.unet = unet
+        self.controlnets = controlnets
+
+    def forward(
+        self,
+        sample,
+        timestep,
+        encoder_hidden_states,
+        controlnet_conds,
+        controlnet_scales,
+    ):
+        for i, (controlnet_cond, conditioning_scale, controlnet) in enumerate(
+            zip(controlnet_conds, controlnet_scales, self.controlnets)
+        ):
+            down_samples, mid_sample = controlnet(
+                sample,
+                timestep,
+                encoder_hidden_states=encoder_hidden_states,
+                controlnet_cond=controlnet_cond,
+                conditioning_scale=conditioning_scale,
+                return_dict=False,
+            )
+
+            # merge samples
+            if i == 0:
+                down_block_res_samples, mid_block_res_sample = down_samples, mid_sample
+            else:
+                down_block_res_samples = [
+                    samples_prev + samples_curr
+                    for samples_prev, samples_curr in zip(down_block_res_samples, down_samples)
+                ]
+                mid_block_res_sample += mid_sample
+
+        noise_pred = self.unet(
+            sample,
+            timestep,
+            encoder_hidden_states=encoder_hidden_states,
+            down_block_additional_residuals=down_block_res_samples,
+            mid_block_additional_residual=mid_block_res_sample,
+            return_dict=False,
+        )[0]
+        return noise_pred
+
+
+class UNet2DConditionXLControlNetModel(torch.nn.Module):
+    def __init__(
+        self,
+        unet,
+        controlnets: ControlNetModel,
+    ):
+        super().__init__()
+        self.unet = unet
+        self.controlnets = controlnets
+
+    def forward(
+        self,
+        sample,
+        timestep,
+        encoder_hidden_states,
+        controlnet_conds,
+        controlnet_scales,
+        text_embeds,
+        time_ids,
+    ):
+        added_cond_kwargs = {"text_embeds": text_embeds, "time_ids": time_ids}
+        for i, (controlnet_cond, conditioning_scale, controlnet) in enumerate(
+            zip(controlnet_conds, controlnet_scales, self.controlnets)
+        ):
+            down_samples, mid_sample = controlnet(
+                sample,
+                timestep,
+                encoder_hidden_states=encoder_hidden_states,
+                controlnet_cond=controlnet_cond,
+                conditioning_scale=conditioning_scale,
+                added_cond_kwargs=added_cond_kwargs,
+                return_dict=False,
+            )
+
+            # merge samples
+            if i == 0:
+                down_block_res_samples, mid_block_res_sample = down_samples, mid_sample
+            else:
+                down_block_res_samples = [
+                    samples_prev + samples_curr
+                    for samples_prev, samples_curr in zip(down_block_res_samples, down_samples)
+                ]
+                mid_block_res_sample += mid_sample
+
+        noise_pred = self.unet(
+            sample,
+            timestep,
+            encoder_hidden_states=encoder_hidden_states,
+            down_block_additional_residuals=down_block_res_samples,
+            mid_block_additional_residual=mid_block_res_sample,
+            added_cond_kwargs=added_cond_kwargs,
+            return_dict=False,
+        )[0]
+        return noise_pred
+
+
+def onnx_export(
+    model,
+    model_args: tuple,
+    output_path: Path,
+    ordered_input_names,
+    output_names,
+    dynamic_axes,
+    opset,
+    use_external_data_format=False,
+):
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    # PyTorch deprecated the `enable_onnx_checker` and `use_external_data_format` arguments in v1.11,
+    # so we check the torch version for backwards compatibility
+    with torch.inference_mode(), torch.autocast("cuda"):
+        if is_torch_less_than_1_11:
+            export(
+                model,
+                model_args,
+                f=output_path.as_posix(),
+                input_names=ordered_input_names,
+                output_names=output_names,
+                dynamic_axes=dynamic_axes,
+                do_constant_folding=True,
+                use_external_data_format=use_external_data_format,
+                enable_onnx_checker=True,
+                opset_version=opset,
+            )
+        else:
+            export(
+                model,
+                model_args,
+                f=output_path.as_posix(),
+                input_names=ordered_input_names,
+                output_names=output_names,
+                dynamic_axes=dynamic_axes,
+                do_constant_folding=True,
+                opset_version=opset,
+            )
+
+
+@torch.no_grad()
+def convert_models(
+    model_path: str, controlnet_path: list, output_path: str, opset: int, fp16: bool = False, sd_xl: bool = False
+):
+    """
+    Function to convert models in stable diffusion controlnet pipeline into ONNX format
+
+    Example:
+    python convert_stable_diffusion_controlnet_to_onnx.py
+    --model_path danbrown/RevAnimated-v1-2-2
+    --controlnet_path lllyasviel/control_v11f1e_sd15_tile ioclab/brightness-controlnet
+    --output_path path-to-models-stable_diffusion/RevAnimated-v1-2-2
+    --fp16
+
+    Example for SD XL:
+    python convert_stable_diffusion_controlnet_to_onnx.py
+    --model_path stabilityai/stable-diffusion-xl-base-1.0
+    --controlnet_path SargeZT/sdxl-controlnet-seg
+    --output_path path-to-models-stable_diffusion/stable-diffusion-xl-base-1.0
+    --fp16
+    --sd_xl
+
+    Returns:
+        create 4 onnx models in output path
+        text_encoder/model.onnx
+        unet/model.onnx + unet/weights.pb
+        vae_encoder/model.onnx
+        vae_decoder/model.onnx
+
+        run test script in diffusers/examples/community
+        python test_onnx_controlnet.py
+        --sd_model danbrown/RevAnimated-v1-2-2
+        --onnx_model_dir path-to-models-stable_diffusion/RevAnimated-v1-2-2
+        --qr_img_path path-to-qr-code-image
+    """
+    dtype = torch.float16 if fp16 else torch.float32
+    if fp16 and torch.cuda.is_available():
+        device = "cuda"
+    elif fp16 and not torch.cuda.is_available():
+        raise ValueError("`float16` model export is only supported on GPUs with CUDA")
+    else:
+        device = "cpu"
+
+    # init controlnet
+    controlnets = []
+    for path in controlnet_path:
+        controlnet = ControlNetModel.from_pretrained(path, torch_dtype=dtype).to(device)
+        if is_torch_2_0_1:
+            controlnet.set_attn_processor(AttnProcessor())
+        controlnets.append(controlnet)
+
+    if sd_xl:
+        if len(controlnets) == 1:
+            controlnet = controlnets[0]
+        else:
+            raise ValueError("MultiControlNet is not yet supported.")
+        pipeline = StableDiffusionXLControlNetPipeline.from_pretrained(
+            model_path, controlnet=controlnet, torch_dtype=dtype, variant="fp16", use_safetensors=True
+        ).to(device)
+    else:
+        pipeline = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
+            model_path, controlnet=controlnets, torch_dtype=dtype
+        ).to(device)
+
+    output_path = Path(output_path)
+    if is_torch_2_0_1:
+        pipeline.unet.set_attn_processor(AttnProcessor())
+        pipeline.vae.set_attn_processor(AttnProcessor())
+
+    # # TEXT ENCODER
+    num_tokens = pipeline.text_encoder.config.max_position_embeddings
+    text_hidden_size = pipeline.text_encoder.config.hidden_size
+    text_input = pipeline.tokenizer(
+        "A sample prompt",
+        padding="max_length",
+        max_length=pipeline.tokenizer.model_max_length,
+        truncation=True,
+        return_tensors="pt",
+    )
+    onnx_export(
+        pipeline.text_encoder,
+        # casting to torch.int32 until the CLIP fix is released: https://github.com/huggingface/transformers/pull/18515/files
+        model_args=(text_input.input_ids.to(device=device, dtype=torch.int32)),
+        output_path=output_path / "text_encoder" / "model.onnx",
+        ordered_input_names=["input_ids"],
+        output_names=["last_hidden_state", "pooler_output"],
+        dynamic_axes={
+            "input_ids": {0: "batch", 1: "sequence"},
+        },
+        opset=opset,
+    )
+    del pipeline.text_encoder
+
+    # # UNET
+    if sd_xl:
+        controlnets = torch.nn.ModuleList(controlnets)
+        unet_controlnet = UNet2DConditionXLControlNetModel(pipeline.unet, controlnets)
+        unet_in_channels = pipeline.unet.config.in_channels
+        unet_sample_size = pipeline.unet.config.sample_size
+        text_hidden_size = 2048
+        img_size = 8 * unet_sample_size
+        unet_path = output_path / "unet" / "model.onnx"
+
+        onnx_export(
+            unet_controlnet,
+            model_args=(
+                torch.randn(2, unet_in_channels, unet_sample_size, unet_sample_size).to(device=device, dtype=dtype),
+                torch.tensor([1.0]).to(device=device, dtype=dtype),
+                torch.randn(2, num_tokens, text_hidden_size).to(device=device, dtype=dtype),
+                torch.randn(len(controlnets), 2, 3, img_size, img_size).to(device=device, dtype=dtype),
+                torch.randn(len(controlnets), 1).to(device=device, dtype=dtype),
+                torch.randn(2, 1280).to(device=device, dtype=dtype),
+                torch.rand(2, 6).to(device=device, dtype=dtype),
+            ),
+            output_path=unet_path,
+            ordered_input_names=[
+                "sample",
+                "timestep",
+                "encoder_hidden_states",
+                "controlnet_conds",
+                "conditioning_scales",
+                "text_embeds",
+                "time_ids",
+            ],
+            output_names=["noise_pred"],  # has to be different from "sample" for correct tracing
+            dynamic_axes={
+                "sample": {0: "2B", 2: "H", 3: "W"},
+                "encoder_hidden_states": {0: "2B"},
+                "controlnet_conds": {1: "2B", 3: "8H", 4: "8W"},
+                "text_embeds": {0: "2B"},
+                "time_ids": {0: "2B"},
+            },
+            opset=opset,
+            use_external_data_format=True,  # UNet is > 2GB, so the weights need to be split
+        )
+        unet_model_path = str(unet_path.absolute().as_posix())
+        unet_dir = os.path.dirname(unet_model_path)
+        # optimize onnx
+        shape_inference.infer_shapes_path(unet_model_path, unet_model_path)
+        unet_opt_graph = optimize(onnx.load(unet_model_path), name="Unet", verbose=True)
+        # clean up existing tensor files
+        shutil.rmtree(unet_dir)
+        os.mkdir(unet_dir)
+        # collate external tensor files into one
+        onnx.save_model(
+            unet_opt_graph,
+            unet_model_path,
+            save_as_external_data=True,
+            all_tensors_to_one_file=True,
+            location="weights.pb",
+            convert_attribute=False,
+        )
+        del pipeline.unet
+    else:
+        controlnets = torch.nn.ModuleList(controlnets)
+        unet_controlnet = UNet2DConditionControlNetModel(pipeline.unet, controlnets)
+        unet_in_channels = pipeline.unet.config.in_channels
+        unet_sample_size = pipeline.unet.config.sample_size
+        img_size = 8 * unet_sample_size
+        unet_path = output_path / "unet" / "model.onnx"
+
+        onnx_export(
+            unet_controlnet,
+            model_args=(
+                torch.randn(2, unet_in_channels, unet_sample_size, unet_sample_size).to(device=device, dtype=dtype),
+                torch.tensor([1.0]).to(device=device, dtype=dtype),
+                torch.randn(2, num_tokens, text_hidden_size).to(device=device, dtype=dtype),
+                torch.randn(len(controlnets), 2, 3, img_size, img_size).to(device=device, dtype=dtype),
+                torch.randn(len(controlnets), 1).to(device=device, dtype=dtype),
+            ),
+            output_path=unet_path,
+            ordered_input_names=[
+                "sample",
+                "timestep",
+                "encoder_hidden_states",
+                "controlnet_conds",
+                "conditioning_scales",
+            ],
+            output_names=["noise_pred"],  # has to be different from "sample" for correct tracing
+            dynamic_axes={
+                "sample": {0: "2B", 2: "H", 3: "W"},
+                "encoder_hidden_states": {0: "2B"},
+                "controlnet_conds": {1: "2B", 3: "8H", 4: "8W"},
+            },
+            opset=opset,
+            use_external_data_format=True,  # UNet is > 2GB, so the weights need to be split
+        )
+        unet_model_path = str(unet_path.absolute().as_posix())
+        unet_dir = os.path.dirname(unet_model_path)
+        # optimize onnx
+        shape_inference.infer_shapes_path(unet_model_path, unet_model_path)
+        unet_opt_graph = optimize(onnx.load(unet_model_path), name="Unet", verbose=True)
+        # clean up existing tensor files
+        shutil.rmtree(unet_dir)
+        os.mkdir(unet_dir)
+        # collate external tensor files into one
+        onnx.save_model(
+            unet_opt_graph,
+            unet_model_path,
+            save_as_external_data=True,
+            all_tensors_to_one_file=True,
+            location="weights.pb",
+            convert_attribute=False,
+        )
+        del pipeline.unet
+
+    # VAE ENCODER
+    vae_encoder = pipeline.vae
+    vae_in_channels = vae_encoder.config.in_channels
+    vae_sample_size = vae_encoder.config.sample_size
+    # need to get the raw tensor output (sample) from the encoder
+    vae_encoder.forward = lambda sample: vae_encoder.encode(sample).latent_dist.sample()
+    onnx_export(
+        vae_encoder,
+        model_args=(torch.randn(1, vae_in_channels, vae_sample_size, vae_sample_size).to(device=device, dtype=dtype),),
+        output_path=output_path / "vae_encoder" / "model.onnx",
+        ordered_input_names=["sample"],
+        output_names=["latent_sample"],
+        dynamic_axes={
+            "sample": {0: "batch", 1: "channels", 2: "height", 3: "width"},
+        },
+        opset=opset,
+    )
+
+    # VAE DECODER
+    vae_decoder = pipeline.vae
+    vae_latent_channels = vae_decoder.config.latent_channels
+    # forward only through the decoder part
+    vae_decoder.forward = vae_encoder.decode
+    onnx_export(
+        vae_decoder,
+        model_args=(
+            torch.randn(1, vae_latent_channels, unet_sample_size, unet_sample_size).to(device=device, dtype=dtype),
+        ),
+        output_path=output_path / "vae_decoder" / "model.onnx",
+        ordered_input_names=["latent_sample"],
+        output_names=["sample"],
+        dynamic_axes={
+            "latent_sample": {0: "batch", 1: "channels", 2: "height", 3: "width"},
+        },
+        opset=opset,
+    )
+    del pipeline.vae
+
+    del pipeline
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--sd_xl", action="store_true", default=False, help="SD XL pipeline")
+
+    parser.add_argument(
+        "--model_path",
+        type=str,
+        required=True,
+        help="Path to the `diffusers` checkpoint to convert (either a local directory or on the Hub).",
+    )
+
+    parser.add_argument(
+        "--controlnet_path",
+        nargs="+",
+        required=True,
+        help="Path to the `controlnet` checkpoint to convert (either a local directory or on the Hub).",
+    )
+
+    parser.add_argument("--output_path", type=str, required=True, help="Path to the output model.")
+
+    parser.add_argument(
+        "--opset",
+        default=14,
+        type=int,
+        help="The version of the ONNX operator set to use.",
+    )
+    parser.add_argument("--fp16", action="store_true", default=False, help="Export the models in `float16` mode")
+
+    args = parser.parse_args()
+
+    convert_models(args.model_path, args.controlnet_path, args.output_path, args.opset, args.fp16, args.sd_xl)
--- a/scripts/convert_stable_diffusion_controlnet_to_tensorrt.py
+++ b/scripts/convert_stable_diffusion_controlnet_to_tensorrt.py
@@ -0,0 +1,121 @@
+import argparse
+import sys
+
+import tensorrt as trt
+
+
+def convert_models(onnx_path: str, num_controlnet: int, output_path: str, fp16: bool = False, sd_xl: bool = False):
+    """
+    Function to convert models in stable diffusion controlnet pipeline into TensorRT format
+
+    Example:
+    python convert_stable_diffusion_controlnet_to_tensorrt.py
+    --onnx_path path-to-models-stable_diffusion/RevAnimated-v1-2-2/unet/model.onnx
+    --output_path path-to-models-stable_diffusion/RevAnimated-v1-2-2/unet/model.engine
+    --fp16
+    --num_controlnet 2
+
+    Example for SD XL:
+    python convert_stable_diffusion_controlnet_to_tensorrt.py
+    --onnx_path path-to-models-stable_diffusion/stable-diffusion-xl-base-1.0/unet/model.onnx
+    --output_path path-to-models-stable_diffusion/stable-diffusion-xl-base-1.0/unet/model.engine
+    --fp16
+    --num_controlnet 1
+    --sd_xl
+
+    Returns:
+        unet/model.engine
+
+        run test script in diffusers/examples/community
+        python test_onnx_controlnet.py
+        --sd_model danbrown/RevAnimated-v1-2-2
+        --onnx_model_dir path-to-models-stable_diffusion/RevAnimated-v1-2-2
+        --unet_engine_path path-to-models-stable_diffusion/stable-diffusion-xl-base-1.0/unet/model.engine
+        --qr_img_path path-to-qr-code-image
+    """
+    # UNET
+    if sd_xl:
+        batch_size = 1
+        unet_in_channels = 4
+        unet_sample_size = 64
+        num_tokens = 77
+        text_hidden_size = 2048
+        img_size = 512
+
+        text_embeds_shape = (2 * batch_size, 1280)
+        time_ids_shape = (2 * batch_size, 6)
+    else:
+        batch_size = 1
+        unet_in_channels = 4
+        unet_sample_size = 64
+        num_tokens = 77
+        text_hidden_size = 768
+        img_size = 512
+        batch_size = 1
+
+    latents_shape = (2 * batch_size, unet_in_channels, unet_sample_size, unet_sample_size)
+    embed_shape = (2 * batch_size, num_tokens, text_hidden_size)
+    controlnet_conds_shape = (num_controlnet, 2 * batch_size, 3, img_size, img_size)
+
+    TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
+    TRT_BUILDER = trt.Builder(TRT_LOGGER)
+    TRT_RUNTIME = trt.Runtime(TRT_LOGGER)
+
+    network = TRT_BUILDER.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+    onnx_parser = trt.OnnxParser(network, TRT_LOGGER)
+
+    parse_success = onnx_parser.parse_from_file(onnx_path)
+    for idx in range(onnx_parser.num_errors):
+        print(onnx_parser.get_error(idx))
+    if not parse_success:
+        sys.exit("ONNX model parsing failed")
+    print("Load Onnx model done")
+
+    profile = TRT_BUILDER.create_optimization_profile()
+
+    profile.set_shape("sample", latents_shape, latents_shape, latents_shape)
+    profile.set_shape("encoder_hidden_states", embed_shape, embed_shape, embed_shape)
+    profile.set_shape("controlnet_conds", controlnet_conds_shape, controlnet_conds_shape, controlnet_conds_shape)
+    if sd_xl:
+        profile.set_shape("text_embeds", text_embeds_shape, text_embeds_shape, text_embeds_shape)
+        profile.set_shape("time_ids", time_ids_shape, time_ids_shape, time_ids_shape)
+
+    config = TRT_BUILDER.create_builder_config()
+    config.add_optimization_profile(profile)
+    config.set_preview_feature(trt.PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805, True)
+    if fp16:
+        config.set_flag(trt.BuilderFlag.FP16)
+
+    plan = TRT_BUILDER.build_serialized_network(network, config)
+    if plan is None:
+        sys.exit("Failed building engine")
+    print("Succeeded building engine")
+
+    engine = TRT_RUNTIME.deserialize_cuda_engine(plan)
+
+    ## save TRT engine
+    with open(output_path, "wb") as f:
+        f.write(engine.serialize())
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--sd_xl", action="store_true", default=False, help="SD XL pipeline")
+
+    parser.add_argument(
+        "--onnx_path",
+        type=str,
+        required=True,
+        help="Path to the onnx checkpoint to convert",
+    )
+
+    parser.add_argument("--num_controlnet", type=int)
+
+    parser.add_argument("--output_path", type=str, required=True, help="Path to the output model.")
+
+    parser.add_argument("--fp16", action="store_true", default=False, help="Export the models in `float16` mode")
+
+    args = parser.parse_args()
+
+    convert_models(args.onnx_path, args.num_controlnet, args.output_path, args.fp16, args.sd_xl)
--- a/scripts/convert_tiny_autoencoder_to_diffusers.py
+++ b/scripts/convert_tiny_autoencoder_to_diffusers.py
@@ -1,12 +1,6 @@
 import argparse

-from diffusers.utils import is_safetensors_available
-
-
-if is_safetensors_available():
-    import safetensors.torch
-else:
-    raise ImportError("Please install `safetensors`.")
+import safetensors.torch

 from diffusers import AutoencoderTiny

--- a/src/diffusers/init.py
+++ b/src/diffusers/init.py
@@ -171,6 +171,7 @@ else:
        StableDiffusionControlNetPipeline,
        StableDiffusionDepth2ImgPipeline,
        StableDiffusionDiffEditPipeline,
+        StableDiffusionGLIGENPipeline,
        StableDiffusionImageVariationPipeline,
        StableDiffusionImg2ImgPipeline,
        StableDiffusionInpaintPipeline,
--- a/src/diffusers/commands/fp16_safetensors.py
+++ b/src/diffusers/commands/fp16_safetensors.py
@@ -27,7 +27,7 @@ import torch
 from huggingface_hub import hf_hub_download
 from packaging import version

-from ..utils import is_safetensors_available, logging
+from ..utils import logging
 from . import BaseDiffusersCLICommand


@@ -68,12 +68,7 @@ class FP16SafetensorsCommand(BaseDiffusersCLICommand):
        self.local_ckpt_dir = f"/tmp/{ckpt_id}"
        self.fp16 = fp16

-        if is_safetensors_available():
-            self.use_safetensors = use_safetensors
-        else:
-            raise ImportError(
-                "When `use_safetensors` is set to True, the `safetensors` library needs to be installed. Install it via `pip install safetensors`."
-            )
+        self.use_safetensors = use_safetensors

        if not self.use_safetensors and not self.fp16:
            raise NotImplementedError(
--- a/src/diffusers/configuration_utils.py
+++ b/src/diffusers/configuration_utils.py
@@ -26,7 +26,7 @@ from pathlib import PosixPath
 from typing import Any, Dict, Tuple, Union

 import numpy as np
-from huggingface_hub import hf_hub_download
+from huggingface_hub import create_repo, hf_hub_download
 from huggingface_hub.utils import EntryNotFoundError, RepositoryNotFoundError, RevisionNotFoundError
 from requests import HTTPError

@@ -144,6 +144,12 @@ class ConfigMixin:
        Args:
            save_directory (`str` or `os.PathLike`):
                Directory where the configuration JSON file is saved (will be created if it does not exist).
+            push_to_hub (`bool`, *optional*, defaults to `False`):
+                Whether or not to push your model to the Hugging Face Hub after saving it. You can specify the
+                repository you want to push to with `repo_id` (will default to the name of `save_directory` in your
+                namespace).
+            kwargs (`Dict[str, Any]`, *optional*):
+                Additional keyword arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method.
        """
        if os.path.isfile(save_directory):
            raise AssertionError(f"Provided path ({save_directory}) should be a directory, not a file")
@@ -156,6 +162,22 @@ class ConfigMixin:
        self.to_json_file(output_config_file)
        logger.info(f"Configuration saved in {output_config_file}")

+        if push_to_hub:
+            commit_message = kwargs.pop("commit_message", None)
+            private = kwargs.pop("private", False)
+            create_pr = kwargs.pop("create_pr", False)
+            token = kwargs.pop("token", None)
+            repo_id = kwargs.pop("repo_id", save_directory.split(os.path.sep)[-1])
+            repo_id = create_repo(repo_id, exist_ok=True, private=private, token=token).repo_id
+
+            self._upload_folder(
+                save_directory,
+                repo_id,
+                token=token,
+                commit_message=commit_message,
+                create_pr=create_pr,
+            )
+
    @classmethod
    def from_config(cls, config: Union[FrozenDict, Dict[str, Any]] = None, return_unused_kwargs=False, **kwargs):
        r"""
--- a/src/diffusers/loaders.py
+++ b/src/diffusers/loaders.py
@@ -22,6 +22,7 @@ from pathlib import Path
 from typing import Callable, Dict, List, Optional, Union

 import requests
+import safetensors
 import torch
 import torch.nn.functional as F
 from huggingface_hub import hf_hub_download
@@ -34,16 +35,12 @@ from .utils import (
    deprecate,
    is_accelerate_available,
    is_omegaconf_available,
-    is_safetensors_available,
    is_transformers_available,
    logging,
 )
 from .utils.import_utils import BACKENDS_MAPPING


-if is_safetensors_available():
-    import safetensors
-
 if is_transformers_available():
    from transformers import CLIPTextModel, CLIPTextModelWithProjection, PreTrainedModel, PreTrainedTokenizer

@@ -261,14 +258,10 @@ class UNet2DConditionLoadersMixin:
        network_alphas = kwargs.pop("network_alphas", None)
        is_network_alphas_none = network_alphas is None

-        if use_safetensors and not is_safetensors_available():
-            raise ValueError(
-                "`use_safetensors`=True but safetensors is not installed. Please install safetensors with `pip install safetensors"
-            )
-
        allow_pickle = False
+
        if use_safetensors is None:
-            use_safetensors = is_safetensors_available()
+            use_safetensors = True
            allow_pickle = True

        user_agent = {
@@ -504,7 +497,8 @@ class UNet2DConditionLoadersMixin:
        is_main_process: bool = True,
        weight_name: str = None,
        save_function: Callable = None,
-        safe_serialization: bool = False,
+        safe_serialization: bool = True,
+        **kwargs,
    ):
        r"""
        Save an attention processor to a directory so that it can be reloaded using the
@@ -521,7 +515,8 @@ class UNet2DConditionLoadersMixin:
                The function to use to save the state dictionary. Useful during distributed training when you need to
                replace `torch.save` with another method. Can be configured with the environment variable
                `DIFFUSERS_SAVE_MODE`.
-
+            safe_serialization (`bool`, *optional*, defaults to `True`):
+                Whether to save the model using `safetensors` or the traditional PyTorch way with `pickle`.
        """
        from .models.attention_processor import (
            CustomDiffusionAttnProcessor,
@@ -757,14 +752,9 @@ class TextualInversionLoaderMixin:
        weight_name = kwargs.pop("weight_name", None)
        use_safetensors = kwargs.pop("use_safetensors", None)

-        if use_safetensors and not is_safetensors_available():
-            raise ValueError(
-                "`use_safetensors`=True but safetensors is not installed. Please install safetensors with `pip install safetensors"
-            )
-
        allow_pickle = False
        if use_safetensors is None:
-            use_safetensors = is_safetensors_available()
+            use_safetensors = True
            allow_pickle = True

        user_agent = {
@@ -1014,14 +1004,9 @@ class LoraLoaderMixin:
        unet_config = kwargs.pop("unet_config", None)
        use_safetensors = kwargs.pop("use_safetensors", None)

-        if use_safetensors and not is_safetensors_available():
-            raise ValueError(
-                "`use_safetensors`=True but safetensors is not installed. Please install safetensors with `pip install safetensors"
-            )
-
        allow_pickle = False
        if use_safetensors is None:
-            use_safetensors = is_safetensors_available()
+            use_safetensors = True
            allow_pickle = True

        user_agent = {
@@ -1054,6 +1039,7 @@ class LoraLoaderMixin:
                    if not allow_pickle:
                        raise e
                    # try loading non-safetensors weights
+                    model_file = None
                    pass
            if model_file is None:
                model_file = _get_model_file(
@@ -1430,7 +1416,7 @@ class LoraLoaderMixin:
        is_main_process: bool = True,
        weight_name: str = None,
        save_function: Callable = None,
-        safe_serialization: bool = False,
+        safe_serialization: bool = True,
    ):
        r"""
        Save the LoRA parameters corresponding to the UNet and text encoder.
@@ -1451,6 +1437,8 @@ class LoraLoaderMixin:
                The function to use to save the state dictionary. Useful during distributed training when you need to
                replace `torch.save` with another method. Can be configured with the environment variable
                `DIFFUSERS_SAVE_MODE`.
+            safe_serialization (`bool`, *optional*, defaults to `True`):
+                Whether to save the model using `safetensors` or the traditional PyTorch way with `pickle`.
        """
        # Create a flat dictionary.
        state_dict = {}
@@ -1853,7 +1841,7 @@ class FromSingleFileMixin:

        torch_dtype = kwargs.pop("torch_dtype", None)

-        use_safetensors = kwargs.pop("use_safetensors", None if is_safetensors_available() else False)
+        use_safetensors = kwargs.pop("use_safetensors", None)

        pipeline_name = cls.__name__
        file_extension = pretrained_model_link_or_path.rsplit(".", 1)[-1]
@@ -1894,16 +1882,24 @@ class FromSingleFileMixin:
            raise ValueError(f"Unhandled pipeline class: {pipeline_name}")

        # remove huggingface url
-        for prefix in ["https://huggingface.co/", "huggingface.co/", "hf.co/", "https://hf.co/"]:
+        has_valid_url_prefix = False
+        valid_url_prefixes = ["https://huggingface.co/", "huggingface.co/", "hf.co/", "https://hf.co/"]
+        for prefix in valid_url_prefixes:
            if pretrained_model_link_or_path.startswith(prefix):
                pretrained_model_link_or_path = pretrained_model_link_or_path[len(prefix) :]
+                has_valid_url_prefix = True

        # Code based on diffusers.pipelines.pipeline_utils.DiffusionPipeline.from_pretrained
        ckpt_path = Path(pretrained_model_link_or_path)
        if not ckpt_path.is_file():
+            if not has_valid_url_prefix:
+                raise ValueError(
+                    f"The provided path is either not a file or a valid huggingface URL was not provided. Valid URLs begin with {', '.join(valid_url_prefixes)}"
+                )
+
            # get repo_id and (potentially nested) file path of ckpt in repo
-            repo_id = os.path.join(*ckpt_path.parts[:2])
-            file_path = os.path.join(*ckpt_path.parts[2:])
+            repo_id = "/".join(ckpt_path.parts[:2])
+            file_path = "/".join(ckpt_path.parts[2:])

            if file_path.startswith("blob/"):
                file_path = file_path[len("blob/") :]
@@ -2050,7 +2046,7 @@ class FromOriginalVAEMixin:

        torch_dtype = kwargs.pop("torch_dtype", None)

-        use_safetensors = kwargs.pop("use_safetensors", None if is_safetensors_available() else False)
+        use_safetensors = kwargs.pop("use_safetensors", None)

        file_extension = pretrained_model_link_or_path.rsplit(".", 1)[-1]
        from_safetensors = file_extension == "safetensors"
@@ -2223,7 +2219,7 @@ class FromOriginalControlnetMixin:

        torch_dtype = kwargs.pop("torch_dtype", None)

-        use_safetensors = kwargs.pop("use_safetensors", None if is_safetensors_available() else False)
+        use_safetensors = kwargs.pop("use_safetensors", None)

        file_extension = pretrained_model_link_or_path.rsplit(".", 1)[-1]
        from_safetensors = file_extension == "safetensors"
--- a/src/diffusers/models/attention.py
+++ b/src/diffusers/models/attention.py
@@ -24,6 +24,38 @@ from .embeddings import CombinedTimestepLabelEmbeddings
 from .lora import LoRACompatibleLinear


+@maybe_allow_in_graph
+class GatedSelfAttentionDense(nn.Module):
+    def __init__(self, query_dim, context_dim, n_heads, d_head):
+        super().__init__()
+
+        # we need a linear projection since we need cat visual feature and obj feature
+        self.linear = nn.Linear(context_dim, query_dim)
+
+        self.attn = Attention(query_dim=query_dim, heads=n_heads, dim_head=d_head)
+        self.ff = FeedForward(query_dim, activation_fn="geglu")
+
+        self.norm1 = nn.LayerNorm(query_dim)
+        self.norm2 = nn.LayerNorm(query_dim)
+
+        self.register_parameter("alpha_attn", nn.Parameter(torch.tensor(0.0)))
+        self.register_parameter("alpha_dense", nn.Parameter(torch.tensor(0.0)))
+
+        self.enabled = True
+
+    def forward(self, x, objs):
+        if not self.enabled:
+            return x
+
+        n_visual = x.shape[1]
+        objs = self.linear(objs)
+
+        x = x + self.alpha_attn.tanh() * self.attn(self.norm1(torch.cat([x, objs], dim=1)))[:, :n_visual, :]
+        x = x + self.alpha_dense.tanh() * self.ff(self.norm2(x))
+
+        return x
+
+
@maybe_allow_in_graph
 class BasicTransformerBlock(nn.Module):
    r"""
@@ -62,6 +94,7 @@ class BasicTransformerBlock(nn.Module):
        norm_elementwise_affine: bool = True,
        norm_type: str = "layer_norm",
        final_dropout: bool = False,
+        attention_type: str = "default",
    ):
        super().__init__()
        self.only_cross_attention = only_cross_attention
@@ -120,6 +153,10 @@ class BasicTransformerBlock(nn.Module):
        self.norm3 = nn.LayerNorm(dim, elementwise_affine=norm_elementwise_affine)
        self.ff = FeedForward(dim, dropout=dropout, activation_fn=activation_fn, final_dropout=final_dropout)

+        # 4. Fuser
+        if attention_type == "gated":
+            self.fuser = GatedSelfAttentionDense(dim, cross_attention_dim, num_attention_heads, attention_head_dim)
+
        # let chunk size default to None
        self._chunk_size = None
        self._chunk_dim = 0
@@ -150,7 +187,9 @@ class BasicTransformerBlock(nn.Module):
        else:
            norm_hidden_states = self.norm1(hidden_states)

-        cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {}
+        # 0. Prepare GLIGEN inputs
+        cross_attention_kwargs = cross_attention_kwargs.copy() if cross_attention_kwargs is not None else {}
+        gligen_kwargs = cross_attention_kwargs.pop("gligen", None)

        attn_output = self.attn1(
            norm_hidden_states,
@@ -162,6 +201,11 @@ class BasicTransformerBlock(nn.Module):
            attn_output = gate_msa.unsqueeze(1) * attn_output
        hidden_states = attn_output + hidden_states

+        # 1.5 GLIGEN Control
+        if gligen_kwargs is not None:
+            hidden_states = self.fuser(hidden_states, gligen_kwargs["objs"])
+        # 1.5 ends
+
        # 2. Cross-Attention
        if self.attn2 is not None:
            norm_hidden_states = (
--- a/src/diffusers/models/embeddings.py
+++ b/src/diffusers/models/embeddings.py
@@ -544,3 +544,59 @@ class AttentionPooling(nn.Module):
        a = a.reshape(bs, -1, 1).transpose(1, 2)

        return a[:, 0, :]  # cls_token
+
+
+class FourierEmbedder(nn.Module):
+    def __init__(self, num_freqs=64, temperature=100):
+        super().__init__()
+
+        self.num_freqs = num_freqs
+        self.temperature = temperature
+
+        freq_bands = temperature ** (torch.arange(num_freqs) / num_freqs)
+        freq_bands = freq_bands[None, None, None]
+        self.register_buffer("freq_bands", freq_bands, persistent=False)
+
+    def __call__(self, x):
+        x = self.freq_bands * x.unsqueeze(-1)
+        return torch.stack((x.sin(), x.cos()), dim=-1).permute(0, 1, 3, 4, 2).reshape(*x.shape[:2], -1)
+
+
+class PositionNet(nn.Module):
+    def __init__(self, positive_len, out_dim, fourier_freqs=8):
+        super().__init__()
+        self.positive_len = positive_len
+        self.out_dim = out_dim
+
+        self.fourier_embedder = FourierEmbedder(num_freqs=fourier_freqs)
+        self.position_dim = fourier_freqs * 2 * 4  # 2: sin/cos, 4: xyxy
+
+        if isinstance(out_dim, tuple):
+            out_dim = out_dim[0]
+        self.linears = nn.Sequential(
+            nn.Linear(self.positive_len + self.position_dim, 512),
+            nn.SiLU(),
+            nn.Linear(512, 512),
+            nn.SiLU(),
+            nn.Linear(512, out_dim),
+        )
+
+        self.null_positive_feature = torch.nn.Parameter(torch.zeros([self.positive_len]))
+        self.null_position_feature = torch.nn.Parameter(torch.zeros([self.position_dim]))
+
+    def forward(self, boxes, masks, positive_embeddings):
+        masks = masks.unsqueeze(-1)
+
+        # embedding position (it may includes padding as placeholder)
+        xyxy_embedding = self.fourier_embedder(boxes)  # B*N*4 -> B*N*C
+
+        # learnable null embedding
+        positive_null = self.null_positive_feature.view(1, 1, -1)
+        xyxy_null = self.null_position_feature.view(1, 1, -1)
+
+        # replace padding with learnable null embedding
+        positive_embeddings = positive_embeddings * masks + (1 - masks) * positive_null
+        xyxy_embedding = xyxy_embedding * masks + (1 - masks) * xyxy_null
+
+        objs = self.linears(torch.cat([positive_embeddings, xyxy_embedding], dim=-1))
+        return objs
--- a/src/diffusers/models/lora.py
+++ b/src/diffusers/models/lora.py
@@ -22,9 +22,6 @@ class LoRALinearLayer(nn.Module):
    def __init__(self, in_features, out_features, rank=4, network_alpha=None, device=None, dtype=None):
        super().__init__()

-        if rank > min(in_features, out_features):
-            raise ValueError(f"LoRA rank {rank} must be less or equal than {min(in_features, out_features)}")
-
        self.down = nn.Linear(in_features, rank, bias=False, device=device, dtype=dtype)
        self.up = nn.Linear(rank, out_features, bias=False, device=device, dtype=dtype)
        # This value has the same meaning as the `--network_alpha` option in the kohya-ss trainer script.
@@ -54,9 +51,6 @@ class LoRAConv2dLayer(nn.Module):
    ):
        super().__init__()

-        if rank > min(in_features, out_features):
-            raise ValueError(f"LoRA rank {rank} must be less or equal than {min(in_features, out_features)}")
-
        self.down = nn.Conv2d(in_features, rank, kernel_size=kernel_size, stride=stride, padding=padding, bias=False)
        # according to the official kohya_ss trainer kernel_size are always fixed for the up layer
        # # see: https://github.com/bmaltais/kohya_ss/blob/2accb1305979ba62f5077a23aabac23b4c37e935/networks/lora_diffusers.py#L129
--- a/src/diffusers/models/modeling_flax_utils.py
+++ b/src/diffusers/models/modeling_flax_utils.py
@@ -23,7 +23,7 @@ import msgpack.exceptions
 from flax.core.frozen_dict import FrozenDict, unfreeze
 from flax.serialization import from_bytes, to_bytes
 from flax.traverse_util import flatten_dict, unflatten_dict
-from huggingface_hub import hf_hub_download
+from huggingface_hub import create_repo, hf_hub_download
 from huggingface_hub.utils import EntryNotFoundError, RepositoryNotFoundError, RevisionNotFoundError
 from requests import HTTPError

@@ -34,6 +34,7 @@ from ..utils import (
    FLAX_WEIGHTS_NAME,
    HUGGINGFACE_CO_RESOLVE_ENDPOINT,
    WEIGHTS_NAME,
+    PushToHubMixin,
    logging,
 )
 from .modeling_flax_pytorch_utils import convert_pytorch_state_dict_to_flax
@@ -42,7 +43,7 @@ from .modeling_flax_pytorch_utils import convert_pytorch_state_dict_to_flax
 logger = logging.get_logger(__name__)


-class FlaxModelMixin:
+class FlaxModelMixin(PushToHubMixin):
    r"""
    Base class for all Flax models.

@@ -497,6 +498,8 @@ class FlaxModelMixin:
        save_directory: Union[str, os.PathLike],
        params: Union[Dict, FrozenDict],
        is_main_process: bool = True,
+        push_to_hub: bool = False,
+        **kwargs,
    ):
        """
        Save a model and its configuration file to a directory so that it can be reloaded using the
@@ -511,6 +514,12 @@ class FlaxModelMixin:
                Whether the process calling this is the main process or not. Useful during distributed training and you
                need to call this function on all processes. In this case, set `is_main_process=True` only on the main
                process to avoid race conditions.
+            push_to_hub (`bool`, *optional*, defaults to `False`):
+                Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the
+                repository you want to push to with `repo_id` (will default to the name of `save_directory` in your
+                namespace).
+            kwargs (`Dict[str, Any]`, *optional*):
+                Additional key word arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method.
        """
        if os.path.isfile(save_directory):
            logger.error(f"Provided path ({save_directory}) should be a directory, not a file")
@@ -518,6 +527,14 @@ class FlaxModelMixin:

        os.makedirs(save_directory, exist_ok=True)

+        if push_to_hub:
+            commit_message = kwargs.pop("commit_message", None)
+            private = kwargs.pop("private", False)
+            create_pr = kwargs.pop("create_pr", False)
+            token = kwargs.pop("token", None)
+            repo_id = kwargs.pop("repo_id", save_directory.split(os.path.sep)[-1])
+            repo_id = create_repo(repo_id, exist_ok=True, private=private, token=token).repo_id
+
        model_to_save = self

        # Attach architecture to the config
@@ -532,3 +549,12 @@ class FlaxModelMixin:
            f.write(model_bytes)

        logger.info(f"Model weights saved in {output_model_file}")
+
+        if push_to_hub:
+            self._upload_folder(
+                save_directory,
+                repo_id,
+                token=token,
+                commit_message=commit_message,
+                create_pr=create_pr,
+            )
--- a/src/diffusers/models/modeling_utils.py
+++ b/src/diffusers/models/modeling_utils.py
@@ -21,7 +21,9 @@ import re
 from functools import partial
 from typing import Any, Callable, List, Optional, Tuple, Union

+import safetensors
 import torch
+from huggingface_hub import create_repo
 from torch import Tensor, device, nn

 from .. import __version__
@@ -36,10 +38,10 @@ from ..utils import (
    _get_model_file,
    deprecate,
    is_accelerate_available,
-    is_safetensors_available,
    is_torch_version,
    logging,
 )
+from ..utils.hub_utils import PushToHubMixin


 logger = logging.get_logger(__name__)
@@ -56,9 +58,6 @@ if is_accelerate_available():
    from accelerate.utils import set_module_tensor_to_device
    from accelerate.utils.versions import is_torch_version

-if is_safetensors_available():
-    import safetensors
-

 def get_parameter_device(parameter: torch.nn.Module):
    try:
@@ -150,7 +149,7 @@ def _load_state_dict_into_model(model_to_load, state_dict):
    return error_msgs


-class ModelMixin(torch.nn.Module):
+class ModelMixin(torch.nn.Module, PushToHubMixin):
    r"""
    Base class for all models.

@@ -273,8 +272,10 @@ class ModelMixin(torch.nn.Module):
        save_directory: Union[str, os.PathLike],
        is_main_process: bool = True,
        save_function: Callable = None,
-        safe_serialization: bool = False,
+        safe_serialization: bool = True,
        variant: Optional[str] = None,
+        push_to_hub: bool = False,
+        **kwargs,
    ):
        """
        Save a model and its configuration file to a directory so that it can be reloaded using the
@@ -291,20 +292,32 @@ class ModelMixin(torch.nn.Module):
                The function to use to save the state dictionary. Useful during distributed training when you need to
                replace `torch.save` with another method. Can be configured with the environment variable
                `DIFFUSERS_SAVE_MODE`.
-            safe_serialization (`bool`, *optional*, defaults to `False`):
+            safe_serialization (`bool`, *optional*, defaults to `True`):
                Whether to save the model using `safetensors` or the traditional PyTorch way with `pickle`.
            variant (`str`, *optional*):
                If specified, weights are saved in the format `pytorch_model.<variant>.bin`.
+            push_to_hub (`bool`, *optional*, defaults to `False`):
+                Whether or not to push your model to the Hugging Face Hub after saving it. You can specify the
+                repository you want to push to with `repo_id` (will default to the name of `save_directory` in your
+                namespace).
+            kwargs (`Dict[str, Any]`, *optional*):
+                Additional keyword arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method.
        """
-        if safe_serialization and not is_safetensors_available():
-            raise ImportError("`safe_serialization` requires the `safetensors library: `pip install safetensors`.")
-
        if os.path.isfile(save_directory):
            logger.error(f"Provided path ({save_directory}) should be a directory, not a file")
            return

        os.makedirs(save_directory, exist_ok=True)

+        if push_to_hub:
+            commit_message = kwargs.pop("commit_message", None)
+            private = kwargs.pop("private", False)
+            create_pr = kwargs.pop("create_pr", False)
+            token = kwargs.pop("token", None)
+            repo_id = kwargs.pop("repo_id", save_directory.split(os.path.sep)[-1])
+            repo_id = create_repo(repo_id, exist_ok=True, private=private, token=token).repo_id
+
+        # Only save the model itself if we are using distributed training
        model_to_save = self

        # Attach architecture to the config
@@ -328,6 +341,15 @@ class ModelMixin(torch.nn.Module):

        logger.info(f"Model weights saved in {os.path.join(save_directory, weights_name)}")

+        if push_to_hub:
+            self._upload_folder(
+                save_directory,
+                repo_id,
+                token=token,
+                commit_message=commit_message,
+                create_pr=create_pr,
+            )
+
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.PathLike]], **kwargs):
        r"""
@@ -454,14 +476,9 @@ class ModelMixin(torch.nn.Module):
        variant = kwargs.pop("variant", None)
        use_safetensors = kwargs.pop("use_safetensors", None)

-        if use_safetensors and not is_safetensors_available():
-            raise ValueError(
-                "`use_safetensors`=True but safetensors is not installed. Please install safetensors with `pip install safetensors"
-            )
-
        allow_pickle = False
        if use_safetensors is None:
-            use_safetensors = is_safetensors_available()
+            use_safetensors = True
            allow_pickle = True

        if low_cpu_mem_usage and not is_accelerate_available():
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Dhruv Nair	05df661f57	test differences in controlnet with max insteda of sum	2023-08-24 09:32:21 +00:00
Dhruv Nair	731dd5f1d1	use max to test difference	2023-08-24 09:31:00 +00:00
Dhruv Nair	493757b52a	use max to test difference instead of sum	2023-08-24 09:26:49 +00:00
Dhruv Nair	3901942d1b	fix precision tolerance anchange max diff from sum to max	2023-08-24 09:24:57 +00:00
Dhruv Nair	b36081acf6	clean up	2023-08-23 06:32:18 +00:00
Dhruv Nair	4a6a03a98b	clean up sd onnx upscale	2023-08-22 14:21:32 +00:00
Dhruv Nair	9a393a8b10	initial commit to fix inheritance issue	2023-08-17 12:49:27 +00:00
Jacqui Wei	7c3e7fedcd	Fix `use_onnx` parameter usage in `from_pretrained` func and update `test_download_no_onnx_by_default` test (#4508 ) * add missing use_onnx in from_pretrained func * fix test_download_no_onnx_by_default test func * address comments * split test cases	2023-08-17 11:49:32 +05:30
Patrick von Platen	029fb41695	[Safetensors] Make safetensors the default way of saving weights (#4235 ) * make safetensors default * set default save method as safetensors * update tests * update to support saving safetensors * update test to account for safetensors default * update example tests to use safetensors * update example to support safetensors * update unet tests for safetensors * fix failing loader tests * fix qc issues * fix pipeline tests * fix example test --------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>	2023-08-17 10:54:28 +05:30
Batuhan Taskaya	852dc76d6d	Support higher dimension LoRAs (#4625 ) * Support higher dimension LoRAs * add: tests * fix: assertion values. --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2023-08-17 10:07:07 +05:30
Scott Lessans	064f150813	Fix `UnboundLocalError` during LoRA loading (#4523 ) * fixed * add: tests --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2023-08-17 09:33:35 +05:30
Sayak Paul	5333f4c0ec	make things clear in the controlnet sdxl doc. (#4644 )	2023-08-17 09:04:28 +05:30
Dhruv Nair	3d08d8dc4e	fix loading custom text encoder when using `from_single_file` (#4571 ) fix loading custom text encoder when using from_single_file	2023-08-17 08:41:09 +05:30
Steven Liu	bdc4c3265f	[docs] MultiControlNet (#4635 ) multicontrolnet docs	2023-08-17 08:14:20 +05:30
Steven Liu	4ff7264d9b	[docs] PushToHubMixin (#4622 ) * push to hub docs * fix typo * feedback * make style	2023-08-16 13:20:59 -06:00
Sayak Paul	5049599143	[Core] feat: MultiControlNet support for SDXL ControlNet pipeline (#4597 ) * core: add multicontrolnet support to sdxl controlnet * modify checks. * fix: original_size determination * add: tests for multi controlnet sdxl. * remove unnecessary prints.	2023-08-16 20:30:39 +05:30
Suraj Patil	7b93c2a882	[research_projects] SDXL controlnet script (#4633 ) add controlent script,	2023-08-16 18:27:08 +05:30
Dirk Morris	a7de96505b	Fix unipc use_karras_sigmas exception - fixes huggingface/diffusers#4580 (#4581 ) * Fix unipc karras sigmas exception - fixes huggingface/diffusers#4580 * Add unipc scheduler tests for karras sigmas	2023-08-16 10:01:53 +05:30
Sayak Paul	351aab60e9	Update text2image.md to fix the links (#4626 )	2023-08-16 09:53:10 +05:30
nikhil-masterful	da5ab51d54	Add GLIGEN implementation (#4441 ) * Add GLIGEN implementation * GLIGEN: Fix code quality check failures * GLIGEN: Fix Import block un-sorted or un-formatted failures * GLIGEN: Fix check_repository_consistency failures * GLIGEN: Add 'PositionNet' to versatile_diffusion/modeling_text_unet.py * GLIGEN: check_repository_consistency: fix 'copy does not match' error * GLIGEN: Fix review comments (1) * GLIGEN: Fix E721 Do not compare types, use `isinstance()` failures * GLIGEN : Ensure _encode_prompt() copy matches to StableDiffusionPipeline * GLIGEN: Fix ruff E721 failure in unidiffuser/test_unidiffuser.py * GLIGEN: doc_builder: restyle pipeline_stable_diffusion_gligen.py * GIGLEN: reset files unrelated to gligen * GLIGEN: Fix documentation comments (1) * GLIGEN: Fix review comments (2) * GLIGEN: Added FastTest * GLIGEN: Fix review comments (3)	2023-08-16 09:34:17 +05:30
Sayak Paul	5175d3d7a5	add: train to text image with sdxl script. (#4505 ) * add: train to text image with sdxl script. Co-authored-by: CaptnSeraph <s3raph1m@gmail.com> * fix: partial func. * fix: default value of output_dir. * make style * set num inference steps to 25. * remove mentions of LoRA. * up min version * add: ema cli arg * run device placement while running step. * precompute vae encodings too. * fix * debug * should work now. * debug * debug * goes alright? * style * debugging * debugging * debugging * debugging * fix * reinit scheduler if prediction_type was passed. * akways cast vae in float32 * better handling of snr. Co-authored-by: bghira <bghira@users.github.com> * the vae should be also passed * add: docs. * add: sdlx t2i tests * save the pipeline * autocast. * fix: save_model_card * fix: save_model_card. --------- Co-authored-by: CaptnSeraph <s3raph1m@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: bghira <bghira@users.github.com>	2023-08-16 09:02:49 +05:30
Sayak Paul	a7508a76f0	add: pushtohubmixin to pipelines and schedulers docs overview. (#4607 ) * add: pushtohubmixin to pipelines and schedulers docs overview. * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>	2023-08-15 22:23:17 +05:30
Sayak Paul	aaef41b5fe	[Docs] fix links in the controlling generation doc. (#4612 ) * fix links in the controlling generation doc. * more fixes.	2023-08-15 20:27:13 +05:30
Wang Qiang	078df46bc9	An invalid clerical error in sdxl finetune (#4608 )	2023-08-15 10:41:51 +05:30
Sayak Paul	15782fd506	[Pipeline utils] feat: implement push_to_hub for standalone models, schedulers as well as pipelines (#4128 ) * feat: implement push_to_hub for standalone models. * address PR feedback. * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * remove max_shard_size. * add: support for scheduler push_to_hub * enable push_to_hub support for flax schedulers. * enable push_to_hub for pipelines. * Apply suggestions from code review Co-authored-by: Lucain <lucainp@gmail.com> * reflect pr feedback. * address another round of deedback. * better handling of kwargs. * add: tests * Apply suggestions from code review Co-authored-by: Lucain <lucainp@gmail.com> * setting hub staging to False for now. * incorporate staging test as a separate job. Co-authored-by: ydshieh <2521628+ydshieh@users.noreply.github.com> * fix: tokenizer loading. * fix: json dumping. * move is_staging_test to a better location. * better treatment to tokens. * define repo_id to better handle concurrency * style * explicitly set token * Empty-Commit * move SUER, TOKEN to test * collate org_repo_id * delete repo --------- Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: ydshieh <2521628+ydshieh@users.noreply.github.com>	2023-08-15 07:39:22 +05:30
Sayak Paul	d93ca26893	[Examples] Update InstructPix2Pix README_sdxl.md to fix mentions (#4574 ) * Update README_sdxl.md to fix mentions * add --push_to_hub * add --push_to_hub * fix: mention	2023-08-14 17:48:13 +05:30
Claire Froelich	32963c24c5	Fix git-lfs command typo in docs (#4586 ) fix typo in git-lfs command added missing hyphen. "git lfs" is not a command	2023-08-14 17:21:45 +05:30
AisingioroHao	1b739e7344	Fixed invalid pipeline_class_name parameter. (#4590 ) * Fixed invalid pipeline_class_name parameter. * Fix the format	2023-08-14 17:21:17 +05:30
Sayak Paul	d67eba0f31	[Utility] adds an image grid utility (#4576 ) * add: utility for image grid. * add: return type. * change necessary places. * add to utility page.	2023-08-12 10:34:51 +05:30
Steven Liu	714bfed859	[docs] Fix ControlNet SDXL docstring (#4582 ) fix	2023-08-11 10:43:40 -07:00
Sayak Paul	d5983a6779	[Examples] fix: network_alpha -> network_alphas (#4572 ) network_alpha	2023-08-11 14:18:49 +05:30
Mystfit	796c01534d	Fixing repo_id regex validation error on windows platforms (#4358 ) * Fixing repo_id regex validation error on windows platforms * Validating correct URL with prefix is provided If we are loading a URL then we don't need to use os.path.join and array slicing to split out a repo_id and file path from an absolute filepath. Checking if the URL prefix is valid first before doing any URL splitting otherwise we raise a ValueError since neither a valid filepath or URL was provided. * Style fixes --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2023-08-11 11:11:43 +05:30
Abhipsha Das	c8d86e9f0a	Remove code snippets containing `is_safetensors_available()` (#4521 ) * [WIP] Remove code snippets containing `is_safetensors_available()` * Modifying `import_utils.py` * update pipeline tests for safetensor default * fix test related to cached requests * address import nits --------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>	2023-08-11 11:05:22 +05:30
dotieuthien	b28cd3fba0	Convert Stable Diffusion ControlNet to TensorRT (#4465 ) * convert tensorrt controlnet * Fix code quality * Fix code quality * Fix code quality * Fix code quality * Fix code quality * Fix code quality * Fix number controlnet condition * Add convert SD XL to onnx * Add convert SD XL to tensorrt * Add convert SD XL to tensorrt * Add examples in comments * Add examples in comments * Add test onnx controlnet * Add tensorrt test * Remove copied * Move file test to examples/community * Remove script * Remove script * Remove text --------- Co-authored-by: dotieuthien <thien.do@mservice.com.vn> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>	2023-08-11 08:12:26 +05:30
Steven Liu	cd7071e750	[docs] Add safetensors flag (#4245 ) * add safetensors flag * apply review	2023-08-10 12:37:23 -07:00
Steven Liu	e31f38b5d6	[docs] Remove attention slicing (#4518 ) * remove attention slicing * apply feedback	2023-08-10 11:00:03 -07:00
Steven Liu	3bd5e073cb	[docs] Expand prompt weighting (#4516 ) * add more weighting/blend/conjunction * finish blend/conjunction * add textual inversion example * add dreambooth	2023-08-10 10:56:53 -07:00
YiYi Xu	3df52ba8dc	[Doc] update sdxl-controlnet repo name (#4564 ) * rename * style --------- Co-authored-by: yiyixuxu <yixu310@gmail,com>	2023-08-10 22:02:32 +05:30
Sayak Paul	c697c5ab57	improve controlnet sdxl docs now that we have a good checkpoint. (#4556 )	2023-08-10 08:21:36 +05:30
VV-A-VV	3fd45eb10f	fix some typo error (#4546 ) * fix some typo error * Undo changes to capitalization	2023-08-10 06:49:25 +05:30
Patrick von Platen	5cbcbe3c63	Revert "introduce minimalistic reimplementation of SDXL on the SDXL doc" (#4548 ) Revert "introduce minimalistic reimplementation of SDXL on the SDXL doc (#4532)" This reverts commit `e7e3749498`.	2023-08-10 06:49:06 +05:30
Dhruv Nair	7b07f9812a	Move controlnet load local tests to nightly (#4543 ) move controlnet load local tests to nihghtly	2023-08-09 23:00:42 +05:30
Steven Liu	16ad13b61d	[docs] Clean scheduler api (#4204 ) * clean scheduler mixin * up to dpmsolvermultistep * finish cleaning * first draft * fix overview table * apply feedback * update reference code	2023-08-09 09:00:35 -07:00