skip test

merge upstream
Move to slow tests to nightly (#5093 )
2026-04-25 09:21:30 +08:00 · 2023-09-19 12:39:40 +00:00 · 2023-09-19 11:27:08 +00:00 · 2023-09-19 16:04:26 +05:30 · 2023-09-19 16:02:21 +05:30 · 2023-09-19 10:51:36 +01:00
715 changed files with 74219 additions and 13873 deletions
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -41,7 +41,7 @@ Core library:
 - Schedulers: @williamberman and @patrickvonplaten
 - Pipelines:  @patrickvonplaten and @sayakpaul
 - Training examples: @sayakpaul and @patrickvonplaten
- Docs: @stevenliu and @yiyixu
+- Docs: @stevhliu and @yiyixuxu
 - JAX and MPS: @pcuenca
 - Audio: @sanchit-gandhi
 - General functionalities: @patrickvonplaten and @sayakpaul
--- a/.github/workflows/pr_tests.yml
+++ b/.github/workflows/pr_tests.yml
@@ -34,6 +34,11 @@ jobs:
            runner: docker-cpu
            image: diffusers/diffusers-pytorch-cpu
            report: torch_cpu_models_schedulers
+          - name: LoRA
+            framework: lora
+            runner: docker-cpu
+            image: diffusers/diffusers-pytorch-cpu
+            report: torch_cpu_lora
          - name: Fast Flax CPU tests
            framework: flax
            runner: docker-cpu
@@ -67,6 +72,7 @@ jobs:
      run: |
        apt-get update && apt-get install libsndfile1-dev libgl1 -y
        python -m pip install -e .[quality,test]
+        python -m pip install git+https://github.com/huggingface/accelerate.git

    - name: Environment
      run: |
@@ -88,6 +94,14 @@ jobs:
          --make-reports=tests_${{ matrix.config.report }} \
          tests/models tests/schedulers tests/others

+    - name: Run fast PyTorch LoRA CPU tests
+      if: ${{ matrix.config.framework == 'lora' }}
+      run: |
+        python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
+          -s -v -k "not Flax and not Onnx and not Dependency" \
+          --make-reports=tests_${{ matrix.config.report }} \
+          tests/lora
+
    - name: Run fast Flax TPU tests
      if: ${{ matrix.config.framework == 'flax' }}
      run: |
@@ -113,3 +127,60 @@ jobs:
      with:
        name: pr_${{ matrix.config.report }}_test_reports
        path: reports
+
+  run_staging_tests:
+    strategy:
+      fail-fast: false
+      matrix:
+        config:
+          - name: Hub tests for models, schedulers, and pipelines
+            framework: hub_tests_pytorch
+            runner: docker-cpu
+            image: diffusers/diffusers-pytorch-cpu
+            report: torch_hub
+
+    name: ${{ matrix.config.name }}
+
+    runs-on: ${{ matrix.config.runner }}
+
+    container:
+      image: ${{ matrix.config.image }}
+      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
+
+    defaults:
+      run:
+        shell: bash
+
+    steps:
+    - name: Checkout diffusers
+      uses: actions/checkout@v3
+      with:
+        fetch-depth: 2
+
+    - name: Install dependencies
+      run: |
+        apt-get update && apt-get install libsndfile1-dev libgl1 -y
+        python -m pip install -e .[quality,test]
+
+    - name: Environment
+      run: |
+        python utils/print_env.py
+
+    - name: Run Hub tests for models, schedulers, and pipelines on a staging env
+      if: ${{ matrix.config.framework == 'hub_tests_pytorch' }}
+      run: |
+        HUGGINGFACE_CO_STAGING=true python -m pytest \
+          -m "is_staging_test" \
+          --make-reports=tests_${{ matrix.config.report }} \
+          tests
+
+    - name: Failure short reports
+      if: ${{ failure() }}
+      run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt
+
+    - name: Test suite reports artifacts
+      if: ${{ always() }}
+      uses: actions/upload-artifact@v2
+      with:
+        name: pr_${{ matrix.config.report }}_test_reports
+        path: reports
--- a/.github/workflows/push_tests.yml
+++ b/.github/workflows/push_tests.yml
@@ -63,6 +63,7 @@ jobs:
      run: |
        apt-get update && apt-get install libsndfile1-dev libgl1 -y
        python -m pip install -e .[quality,test]
+        python -m pip install git+https://github.com/huggingface/accelerate.git

    - name: Environment
      run: |
--- a/.github/workflows/push_tests_mps.yml
+++ b/.github/workflows/push_tests_mps.yml
@@ -40,7 +40,7 @@ jobs:
        ${CONDA_RUN} python -m pip install --upgrade pip
        ${CONDA_RUN} python -m pip install -e .[quality,test]
        ${CONDA_RUN} python -m pip install torch torchvision torchaudio
-        ${CONDA_RUN} python -m pip install accelerate --upgrade
+        ${CONDA_RUN} python -m pip install git+https://github.com/huggingface/accelerate.git
        ${CONDA_RUN} python -m pip install transformers --upgrade

    - name: Environment
--- a/2
+++ b/2
@@ -78,7 +78,7 @@ test:
 # Run tests for examples

 test-examples:
-	python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/
+	python -m pytest -n auto --dist=loadfile -s -v ./examples/


 # Release stuff
--- a/PHILOSOPHY.md
+++ b/PHILOSOPHY.md
@@ -90,7 +90,7 @@ The following design principles are followed:
 - To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different.
 - Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work.
 - The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and
-readable longterm, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
+readable longterm, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).

 ### Schedulers

@@ -102,7 +102,7 @@ The following design principles are followed:
 - One scheduler python file corresponds to one scheduler algorithm (as might be defined in a paper).
 - If schedulers share similar functionalities, we can make use of the `#Copied from` mechanism.
 - Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`.
- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./using-diffusers/schedulers.mdx).
+- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./using-diffusers/schedulers.md).
 - Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called.
 - Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon
 - The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1).
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 <p align="center">
    <br>
-    <img src="https://github.com/huggingface/diffusers/blob/main/docs/source/en/imgs/diffusers_library.jpg" width="400"/>
+    <img src="https://raw.githubusercontent.com/huggingface/diffusers/main/docs/source/en/imgs/diffusers_library.jpg" width="400"/>
    <br>
 <p>
 <p align="center">
@@ -10,6 +10,9 @@
    <a href="https://github.com/huggingface/diffusers/releases">
        <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/diffusers.svg">
    </a>
+    <a href="https://pepy.tech/project/diffusers">
+        <img alt="GitHub release" src="https://static.pepy.tech/badge/diffusers/month">
+    </a>
    <a href="CODE_OF_CONDUCT.md">
        <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg">
    </a>
--- a/docs/README.md
+++ b/docs/README.md
@@ -68,7 +68,7 @@ The `preview` command only works with existing doc files. When you add a complet

 ## Adding a new element to the navigation bar

-Accepted files are Markdown (.md or .mdx).
+Accepted files are Markdown (.md).

 Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting
 the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/diffusers/blob/main/docs/source/_toctree.yml) file.
@@ -96,7 +96,7 @@ Sections that were moved:

 Use the relative style to link to the new file so that the versioned docs continue to work.

-For an example of a rich moved section set please see the very end of [the transformers Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.mdx).
+For an example of a rich moved section set please see the very end of [the transformers Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md).


 ## Writing Documentation - Specification
@@ -119,8 +119,8 @@ depending on the intended targets (beginners, more advanced users, or researcher

 When adding a new pipeline:

- create a file `xxx.mdx` under `docs/source/api/pipelines` (don't hesitate to copy an existing file as template).
- Link that file in (*Diffusers Summary*) section in `docs/source/api/pipelines/overview.mdx`, along with the link to the paper, and a colab notebook (if available).
+- create a file `xxx.md` under `docs/source/api/pipelines` (don't hesitate to copy an existing file as template).
+- Link that file in (*Diffusers Summary*) section in `docs/source/api/pipelines/overview.md`, along with the link to the paper, and a colab notebook (if available).
 - Write a short overview of the diffusion model:
    - Overview with paper & authors
    - Paper abstract
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -13,6 +13,8 @@
    title: Overview
  - local: using-diffusers/write_own_pipeline
    title: Understanding models and schedulers
+  - local: tutorials/autopipeline
+    title: AutoPipeline
  - local: tutorials/basic_training
    title: Train a diffusion model
  title: Tutorials
@@ -30,20 +32,22 @@
      title: Load safetensors
    - local: using-diffusers/other-formats
      title: Load different Stable Diffusion formats
+    - local: using-diffusers/push_to_hub
+      title: Push files to the Hub
    title: Loading & Hub
  - sections:
-    - local: using-diffusers/pipeline_overview
-      title: Overview
    - local: using-diffusers/unconditional_image_generation
      title: Unconditional image generation
    - local: using-diffusers/conditional_image_generation
-      title: Text-to-image generation
+      title: Text-to-image
    - local: using-diffusers/img2img
-      title: Text-guided image-to-image
+      title: Image-to-image
    - local: using-diffusers/inpaint
-      title: Text-guided image-inpainting
+      title: Inpainting
    - local: using-diffusers/depth2img
-      title: Text-guided depth-to-image
+      title: Depth-to-image
+    title: Tasks
+  - sections:
    - local: using-diffusers/textual_inversion_inference
      title: Textual inversion
    - local: training/distributed_inference
@@ -52,16 +56,28 @@
      title: Improve image quality with deterministic generation
    - local: using-diffusers/control_brightness
      title: Control image brightness
+    - local: using-diffusers/weighted_prompts
+      title: Prompt weighting
+    title: Techniques
+  - sections:
+    - local: using-diffusers/pipeline_overview
+      title: Overview
+    - local: using-diffusers/sdxl
+      title: Stable Diffusion XL
+    - local: using-diffusers/controlnet
+      title: ControlNet
+    - local: using-diffusers/shap-e
+      title: Shap-E
+    - local: using-diffusers/diffedit
+      title: DiffEdit
+    - local: using-diffusers/distilled_sd
+      title: Distilled Stable Diffusion inference
    - local: using-diffusers/reproducibility
      title: Create reproducible pipelines
    - local: using-diffusers/custom_pipeline_examples
      title: Community pipelines
    - local: using-diffusers/contribute_pipeline
      title: How to contribute a community pipeline
-    - local: using-diffusers/stable_diffusion_jax_how_to
-      title: Stable Diffusion in JAX/Flax
-    - local: using-diffusers/weighted_prompts
-      title: Weighting Prompts
    title: Pipelines for Inference
  - sections:
    - local: training/overview
@@ -86,12 +102,10 @@
      title: InstructPix2Pix Training
    - local: training/custom_diffusion
      title: Custom Diffusion
+    - local: training/t2i_adapters
+      title: T2I-Adapters
    title: Training
  - sections:
-    - local: using-diffusers/rl
-      title: Reinforcement Learning
-    - local: using-diffusers/audio
-      title: Audio
    - local: using-diffusers/other-modalities
      title: Other Modalities
    title: Taking Diffusers Beyond Images
@@ -99,25 +113,35 @@
 - sections:
  - local: optimization/opt_overview
    title: Overview
-  - local: optimization/fp16
-    title: Memory and Speed
-  - local: optimization/torch2.0
-    title: Torch2.0 support
-  - local: optimization/xformers
-    title: xFormers
-  - local: optimization/onnx
-    title: ONNX
-  - local: optimization/open_vino
-    title: OpenVINO
-  - local: optimization/coreml
-    title: Core ML
-  - local: optimization/mps
-    title: MPS
-  - local: optimization/habana
-    title: Habana Gaudi
-  - local: optimization/tome
-    title: Token Merging
-  title: Optimization/Special Hardware
+  - sections:
+    - local: optimization/fp16
+      title: Speed up inference
+    - local: optimization/memory
+      title: Reduce memory usage
+    - local: optimization/torch2.0
+      title: Torch 2.0
+    - local: optimization/xformers
+      title: xFormers
+    - local: optimization/tome
+      title: Token merging
+    title: General optimizations
+  - sections:
+    - local: using-diffusers/stable_diffusion_jax_how_to
+      title: JAX/Flax
+    - local: optimization/onnx
+      title: ONNX
+    - local: optimization/open_vino
+      title: OpenVINO
+    - local: optimization/coreml
+      title: Core ML
+    title: Optimized model types
+  - sections:
+    - local: optimization/mps
+      title: Metal Performance Shaders (MPS)
+    - local: optimization/habana
+      title: Habana Gaudi
+    title: Optimized hardware
+  title: Optimization
 - sections:
  - local: conceptual/philosophy
    title: Philosophy
@@ -166,6 +190,8 @@
      title: AutoencoderKL
    - local: api/models/asymmetricautoencoderkl
      title: AsymmetricAutoencoderKL
+    - local: api/models/autoencoder_tiny
+      title: Tiny AutoEncoder
    - local: api/models/transformer2d
      title: Transformer2D
    - local: api/models/transformer_temporal
@@ -186,10 +212,16 @@
      title: Audio Diffusion
    - local: api/pipelines/audioldm
      title: AudioLDM
+    - local: api/pipelines/audioldm2
+      title: AudioLDM 2
+    - local: api/pipelines/auto_pipeline
+      title: AutoPipeline
    - local: api/pipelines/consistency_models
      title: Consistency Models
    - local: api/pipelines/controlnet
      title: ControlNet
+    - local: api/pipelines/controlnet_sdxl
+      title: ControlNet with Stable Diffusion XL
    - local: api/pipelines/cycle_diffusion
      title: Cycle Diffusion
    - local: api/pipelines/dance_diffusion
@@ -208,10 +240,14 @@
      title: InstructPix2Pix
    - local: api/pipelines/kandinsky
      title: Kandinsky
+    - local: api/pipelines/kandinsky_v22
+      title: Kandinsky 2.2
    - local: api/pipelines/latent_diffusion
      title: Latent Diffusion
    - local: api/pipelines/panorama
      title: MultiDiffusion
+    - local: api/pipelines/musicldm
+      title: MusicLDM
    - local: api/pipelines/paint_by_example
      title: PaintByExample
    - local: api/pipelines/paradigms
@@ -259,6 +295,8 @@
        title: LDM3D Text-to-(RGB, Depth)
      - local: api/pipelines/stable_diffusion/adapter
        title: Stable Diffusion T2I-adapter
+      - local: api/pipelines/stable_diffusion/gligen
+        title: GLIGEN (Grounded Language-to-Image Generation)
      title: Stable Diffusion
    - local: api/pipelines/stable_unclip
      title: Stable unCLIP
@@ -276,63 +314,63 @@
      title: Unconditional Latent Diffusion
    - local: api/pipelines/unidiffuser
      title: UniDiffuser
+    - local: api/pipelines/value_guided_sampling
+      title: Value-guided sampling
    - local: api/pipelines/versatile_diffusion
      title: Versatile Diffusion
    - local: api/pipelines/vq_diffusion
      title: VQ Diffusion
+    - local: api/pipelines/wuerstchen
+      title: Wuerstchen
    title: Pipelines
  - sections:
    - local: api/schedulers/overview
      title: Overview
    - local: api/schedulers/cm_stochastic_iterative
-      title: Consistency Model Multistep Scheduler
-    - local: api/schedulers/ddim
-      title: DDIM
+      title: CMStochasticIterativeScheduler
    - local: api/schedulers/ddim_inverse
-      title: DDIMInverse
+      title: DDIMInverseScheduler
+    - local: api/schedulers/ddim
+      title: DDIMScheduler
    - local: api/schedulers/ddpm
-      title: DDPM
+      title: DDPMScheduler
    - local: api/schedulers/deis
-      title: DEIS
-    - local: api/schedulers/dpm_discrete
-      title: DPM Discrete Scheduler
-    - local: api/schedulers/dpm_discrete_ancestral
-      title: DPM Discrete Scheduler with ancestral sampling
+      title: DEISMultistepScheduler
+    - local: api/schedulers/multistep_dpm_solver_inverse
+      title: DPMSolverMultistepInverse
+    - local: api/schedulers/multistep_dpm_solver
+      title: DPMSolverMultistepScheduler
    - local: api/schedulers/dpm_sde
      title: DPMSolverSDEScheduler
-    - local: api/schedulers/euler_ancestral
-      title: Euler Ancestral Scheduler
-    - local: api/schedulers/euler
-      title: Euler scheduler
-    - local: api/schedulers/heun
-      title: Heun Scheduler
-    - local: api/schedulers/multistep_dpm_solver_inverse
-      title: Inverse Multistep DPM-Solver
-    - local: api/schedulers/ipndm
-      title: IPNDM
-    - local: api/schedulers/lms_discrete
-      title: Linear Multistep
-    - local: api/schedulers/multistep_dpm_solver
-      title: Multistep DPM-Solver
-    - local: api/schedulers/pndm
-      title: PNDM
-    - local: api/schedulers/repaint
-      title: RePaint Scheduler
    - local: api/schedulers/singlestep_dpm_solver
-      title: Singlestep DPM-Solver
+      title: DPMSolverSinglestepScheduler
+    - local: api/schedulers/euler_ancestral
+      title: EulerAncestralDiscreteScheduler
+    - local: api/schedulers/euler
+      title: EulerDiscreteScheduler
+    - local: api/schedulers/heun
+      title: HeunDiscreteScheduler
+    - local: api/schedulers/ipndm
+      title: IPNDMScheduler
    - local: api/schedulers/stochastic_karras_ve
-      title: Stochastic Kerras VE
+      title: KarrasVeScheduler
+    - local: api/schedulers/dpm_discrete_ancestral
+      title: KDPM2AncestralDiscreteScheduler
+    - local: api/schedulers/dpm_discrete
+      title: KDPM2DiscreteScheduler
+    - local: api/schedulers/lms_discrete
+      title: LMSDiscreteScheduler
+    - local: api/schedulers/pndm
+      title: PNDMScheduler
+    - local: api/schedulers/repaint
+      title: RePaintScheduler
+    - local: api/schedulers/score_sde_ve
+      title: ScoreSdeVeScheduler
+    - local: api/schedulers/score_sde_vp
+      title: ScoreSdeVpScheduler
    - local: api/schedulers/unipc
      title: UniPCMultistepScheduler
-    - local: api/schedulers/score_sde_ve
-      title: VE-SDE
-    - local: api/schedulers/score_sde_vp
-      title: VP-SDE
    - local: api/schedulers/vq_diffusion
      title: VQDiffusionScheduler
    title: Schedulers
-  - sections:
-    - local: api/experimental/rl
-      title: RL Planning
-    title: Experimental Features
  title: API
--- a/docs/source/en/api/attnprocessor.mdx
+++ b/docs/source/en/api/attnprocessor.mdx
@@ -17,6 +17,9 @@ An attention processor is a class for applying different types of attention mech
 ## CustomDiffusionAttnProcessor
 [[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor

+## CustomDiffusionAttnProcessor2_0
+[[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor2_0
+
 ## AttnAddedKVProcessor
 [[autodoc]] models.attention_processor.AttnAddedKVProcessor

@@ -39,4 +42,4 @@ An attention processor is a class for applying different types of attention mech
 [[autodoc]] models.attention_processor.SlicedAttnProcessor

 ## SlicedAttnAddedKVProcessor
-[[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor
+[[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor
--- a/docs/source/en/api/configuration.mdx
+++ b/docs/source/en/api/configuration.mdx
--- a/docs/source/en/api/diffusion_pipeline.mdx
+++ b/docs/source/en/api/diffusion_pipeline.mdx
--- a/docs/source/en/api/experimental/rl.mdx
+++ b/docs/source/en/api/experimental/rl.mdx
@@ -1,15 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# TODO
-
-Coming soon!
--- a/docs/source/en/api/image_processor.mdx
+++ b/docs/source/en/api/image_processor.mdx
--- a/docs/source/en/api/loaders.mdx
+++ b/docs/source/en/api/loaders.mdx
@@ -28,6 +28,10 @@ Adapters (textual inversion, LoRA, hypernetworks) allow you to modify a diffusio

 [[autodoc]] loaders.TextualInversionLoaderMixin

+## StableDiffusionXLLoraLoaderMixin
+
+[[autodoc]] loaders.StableDiffusionXLLoraLoaderMixin
+
 ## LoraLoaderMixin

 [[autodoc]] loaders.LoraLoaderMixin
--- a/docs/source/en/api/logging.mdx
+++ b/docs/source/en/api/logging.mdx
--- a/docs/source/en/api/models/asymmetricautoencoderkl.mdx
+++ b/docs/source/en/api/models/asymmetricautoencoderkl.mdx
--- a/docs/source/en/api/models/autoencoder_tiny.md
+++ b/docs/source/en/api/models/autoencoder_tiny.md
@@ -0,0 +1,45 @@
+# Tiny AutoEncoder
+
+Tiny AutoEncoder for Stable Diffusion (TAESD) was introduced in [madebyollin/taesd](https://github.com/madebyollin/taesd) by Ollin Boer Bohan. It is a tiny distilled version of Stable Diffusion's VAE that can quickly decode the latents in a [`StableDiffusionPipeline`] or [`StableDiffusionXLPipeline`] almost instantly. 
+
+To use with Stable Diffusion v-2.1:
+
+```python
+import torch
+from diffusers import DiffusionPipeline, AutoencoderTiny
+
+pipe = DiffusionPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16
+)
+pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd", torch_dtype=torch.float16)
+pipe = pipe.to("cuda")
+
+prompt = "slice of delicious New York-style berry cheesecake"
+image = pipe(prompt, num_inference_steps=25).images[0]
+image.save("cheesecake.png")
+```
+
+To use with Stable Diffusion XL 1.0
+
+```python
+import torch
+from diffusers import DiffusionPipeline, AutoencoderTiny
+
+pipe = DiffusionPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
+)
+pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesdxl", torch_dtype=torch.float16)
+pipe = pipe.to("cuda")
+
+prompt = "slice of delicious New York-style berry cheesecake"
+image = pipe(prompt, num_inference_steps=25).images[0]
+image.save("cheesecake_sdxl.png")
+```
+
+## AutoencoderTiny
+
+[[autodoc]] AutoencoderTiny
+
+## AutoencoderTinyOutput
+
+[[autodoc]] models.autoencoder_tiny.AutoencoderTinyOutput
--- a/docs/source/en/api/models/autoencoderkl.mdx
+++ b/docs/source/en/api/models/autoencoderkl.mdx
--- a/docs/source/en/api/models/controlnet.mdx
+++ b/docs/source/en/api/models/controlnet.mdx
--- a/docs/source/en/api/models/overview.mdx
+++ b/docs/source/en/api/models/overview.mdx
@@ -9,4 +9,8 @@ All models are built from the base [`ModelMixin`] class which is a [`torch.nn.mo

 ## FlaxModelMixin

-[[autodoc]] FlaxModelMixin
+[[autodoc]] FlaxModelMixin
+
+## PushToHubMixin
+
+[[autodoc]] utils.PushToHubMixin
--- a/docs/source/en/api/models/prior_transformer.mdx
+++ b/docs/source/en/api/models/prior_transformer.mdx
--- a/docs/source/en/api/models/transformer2d.mdx
+++ b/docs/source/en/api/models/transformer2d.mdx
--- a/docs/source/en/api/models/transformer_temporal.mdx
+++ b/docs/source/en/api/models/transformer_temporal.mdx
--- a/docs/source/en/api/models/unet.mdx
+++ b/docs/source/en/api/models/unet.mdx
--- a/docs/source/en/api/models/unet2d-cond.mdx
+++ b/docs/source/en/api/models/unet2d-cond.mdx
--- a/docs/source/en/api/models/unet2d.mdx
+++ b/docs/source/en/api/models/unet2d.mdx
--- a/docs/source/en/api/models/unet3d-cond.mdx
+++ b/docs/source/en/api/models/unet3d-cond.mdx
--- a/docs/source/en/api/models/vq.mdx
+++ b/docs/source/en/api/models/vq.mdx
--- a/docs/source/en/api/outputs.mdx
+++ b/docs/source/en/api/outputs.mdx
--- a/docs/source/en/api/pipelines/alt_diffusion.mdx
+++ b/docs/source/en/api/pipelines/alt_diffusion.mdx
--- a/docs/source/en/api/pipelines/attend_and_excite.mdx
+++ b/docs/source/en/api/pipelines/attend_and_excite.mdx
--- a/docs/source/en/api/pipelines/audio_diffusion.mdx
+++ b/docs/source/en/api/pipelines/audio_diffusion.mdx
--- a/docs/source/en/api/pipelines/audioldm.mdx
+++ b/docs/source/en/api/pipelines/audioldm.mdx
@@ -46,6 +46,5 @@ Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to le
 	- all
 	- __call__

-## StableDiffusionPipelineOutput
-
-[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+## AudioPipelineOutput
+[[autodoc]] pipelines.AudioPipelineOutput
--- a/docs/source/en/api/pipelines/audioldm2.md
+++ b/docs/source/en/api/pipelines/audioldm2.md
@@ -0,0 +1,93 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# AudioLDM 2
+
+AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) 
+by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate 
+text-conditional sound effects, human speech and music.
+
+Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM 2
+is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings. Two 
+text encoder models are used to compute the text embeddings from a prompt input: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap)
+and the encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5). These text embeddings 
+are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/main/api/pipelines/audioldm2#diffusers.AudioLDM2ProjectionModel). 
+A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) _language model (LM)_ is used to auto-regressively 
+predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding 
+vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2UNet2DConditionModel) 
+of AudioLDM 2 is unique in the sense that it takes **two** cross-attention embeddings, as opposed to one cross-attention 
+conditioning, as in most other LDMs.
+
+The abstract of the paper is the following:
+
+*Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called language of audio (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches.*
+
+This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be 
+found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2). 
+
+## Tips
+
+### Choosing a checkpoint
+
+AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio 
+generation. The third checkpoint is trained exclusively on text-to-music generation.
+
+All checkpoints share the same model size for the text encoders and VAE. They differ in the size and depth of the UNet. 
+See table below for details on the three checkpoints:
+
+| Checkpoint                                                      | Task          | UNet Model Size | Total Model Size | Training Data / h |
+|-----------------------------------------------------------------|---------------|-----------------|------------------|-------------------|
+| [audioldm2](https://huggingface.co/cvssp/audioldm2)             | Text-to-audio | 350M            | 1.1B             | 1150k             |
+| [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 750M            | 1.5B             | 1150k             |
+| [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 350M            | 1.1B             | 665k              |
+
+### Constructing a prompt
+
+* Descriptive prompt inputs work best: use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g. "water stream in a forest" instead of "stream").
+* It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with.
+* Using a **negative prompt** can significantly improve the quality of the generated waveform, by guiding the generation away from terms that correspond to poor quality audio. Try using a negative prompt of "Low quality." 
+
+### Controlling inference
+
+* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
+* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
+
+### Evaluating generated waveforms:
+
+* The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation
+* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
+
+The following example demonstrates how to construct good music generation using the aforementioned tips: [example](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.example).
+
+<Tip>
+
+Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between 
+scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) 
+section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
+## AudioLDM2Pipeline
+[[autodoc]] AudioLDM2Pipeline
+	- all
+	- __call__
+
+## AudioLDM2ProjectionModel
+[[autodoc]] AudioLDM2ProjectionModel
+	- forward
+
+## AudioLDM2UNet2DConditionModel
+[[autodoc]] AudioLDM2UNet2DConditionModel
+	- forward
+
+## AudioPipelineOutput
+[[autodoc]] pipelines.AudioPipelineOutput
--- a/docs/source/en/api/pipelines/auto_pipeline.md
+++ b/docs/source/en/api/pipelines/auto_pipeline.md
@@ -0,0 +1,74 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# AutoPipeline
+
+`AutoPipeline` is designed to:
+
+1. make it easy for you to load a checkpoint for a task without knowing the specific pipeline class to use
+2. use multiple pipelines in your workflow
+
+Based on the task, the `AutoPipeline` class automatically retrieves the relevant pipeline given the name or path to the pretrained weights with the `from_pretrained()` method.
+
+To seamlessly switch between tasks with the same checkpoint without reallocating additional memory, use the `from_pipe()` method to transfer the components from the original pipeline to the new one.
+
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
+).to("cuda")
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+
+image = pipeline(prompt, num_inference_steps=25).images[0]
+```
+
+<Tip>
+
+Check out the [AutoPipeline](/tutorials/autopipeline) tutorial to learn how to use this API!
+
+</Tip>
+
+`AutoPipeline` supports text-to-image, image-to-image, and inpainting for the following diffusion models:
+
+- [Stable Diffusion](./stable_diffusion)
+- [ControlNet](./controlnet)
+- [Stable Diffusion XL (SDXL)](./stable_diffusion/stable_diffusion_xl)
+- [DeepFloyd IF](./if) 
+- [Kandinsky](./kandinsky)
+- [Kandinsky 2.2](./kandinsky#kandinsky-22)
+
+
+## AutoPipelineForText2Image
+
+[[autodoc]] AutoPipelineForText2Image
+	- all
+	- from_pretrained
+	- from_pipe
+
+
+## AutoPipelineForImage2Image
+
+[[autodoc]] AutoPipelineForImage2Image
+	- all
+	- from_pretrained
+	- from_pipe
+
+## AutoPipelineForInpainting
+
+[[autodoc]] AutoPipelineForInpainting
+	- all
+	- from_pretrained
+	- from_pipe
+
+
--- a/docs/source/en/api/pipelines/consistency_models.mdx
+++ b/docs/source/en/api/pipelines/consistency_models.mdx
--- a/docs/source/en/api/pipelines/controlnet.md
+++ b/docs/source/en/api/pipelines/controlnet.md
@@ -0,0 +1,80 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# ControlNet
+
+ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
+
+With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
+
+The abstract from the paper is:
+
+*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
+
+This model was contributed by [takuma104](https://huggingface.co/takuma104). ❤️
+
+The original codebase can be found at [lllyasviel/ControlNet](https://github.com/lllyasviel/ControlNet), and you can find official ControlNet checkpoints on [lllyasviel's](https://huggingface.co/lllyasviel) Hub profile.
+
+<Tip>
+
+Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
+## StableDiffusionControlNetPipeline
+[[autodoc]] StableDiffusionControlNetPipeline
+	- all
+	- __call__
+	- enable_attention_slicing
+	- disable_attention_slicing
+	- enable_vae_slicing
+	- disable_vae_slicing
+	- enable_xformers_memory_efficient_attention
+	- disable_xformers_memory_efficient_attention
+	- load_textual_inversion
+
+## StableDiffusionControlNetImg2ImgPipeline
+[[autodoc]] StableDiffusionControlNetImg2ImgPipeline
+	- all
+	- __call__
+	- enable_attention_slicing
+	- disable_attention_slicing
+	- enable_vae_slicing
+	- disable_vae_slicing
+	- enable_xformers_memory_efficient_attention
+	- disable_xformers_memory_efficient_attention
+	- load_textual_inversion
+
+## StableDiffusionControlNetInpaintPipeline
+[[autodoc]] StableDiffusionControlNetInpaintPipeline
+	- all
+	- __call__
+	- enable_attention_slicing
+	- disable_attention_slicing
+	- enable_vae_slicing
+	- disable_vae_slicing
+	- enable_xformers_memory_efficient_attention
+	- disable_xformers_memory_efficient_attention
+	- load_textual_inversion
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+
+## FlaxStableDiffusionControlNetPipeline
+[[autodoc]] FlaxStableDiffusionControlNetPipeline
+	- all
+	- __call__
+
+## FlaxStableDiffusionControlNetPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/controlnet.mdx
+++ b/docs/source/en/api/pipelines/controlnet.mdx
@@ -1,350 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# ControlNet
-
-[Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
-
-Using a pretrained model, we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details.
-
-The abstract from the paper is:
-
-*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
-
-This model was contributed by [takuma104](https://huggingface.co/takuma104). ❤️
-
-The original codebase can be found at [lllyasviel/ControlNet](https://github.com/lllyasviel/ControlNet).
-
-## Usage example
-
-In the following we give a simple example of how to use a *ControlNet* checkpoint with Diffusers for inference.
-The inference pipeline is the same for all pipelines:
-
-* 1. Take an image and run it through a pre-conditioning processor.
-* 2. Run the pre-processed image through the [`StableDiffusionControlNetPipeline`].
-
-Let's have a look at a simple example using the [Canny Edge ControlNet](https://huggingface.co/lllyasviel/sd-controlnet-canny).
-
-```python
-from diffusers import StableDiffusionControlNetPipeline
-from diffusers.utils import load_image
-
-# Let's load the popular vermeer image
-image = load_image(
-    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
-)
-```
-
-![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png)
-
-Next, we process the image to get the canny image. This is step *1.* - running the pre-conditioning processor. The pre-conditioning processor is different for every ControlNet. Please see the model cards of the [official checkpoints](#controlnet-with-stable-diffusion-1.5) for more information about other models.
-
-First, we need to install opencv:
-
-```
-pip install opencv-contrib-python
-```
-
-Next, let's also install all required Hugging Face libraries:
-
-```
-pip install diffusers transformers git+https://github.com/huggingface/accelerate.git
-```
-
-Then we can retrieve the canny edges of the image.
-
-```python
-import cv2
-from PIL import Image
-import numpy as np
-
-image = np.array(image)
-
-low_threshold = 100
-high_threshold = 200
-
-image = cv2.Canny(image, low_threshold, high_threshold)
-image = image[:, :, None]
-image = np.concatenate([image, image, image], axis=2)
-canny_image = Image.fromarray(image)
-```
-
-Let's take a look at the processed image.
-
-![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/vermeer_canny_edged.png)
-
-Now, we load the official [Stable Diffusion 1.5 Model](runwayml/stable-diffusion-v1-5) as well as the ControlNet for canny edges.
-
-```py
-from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
-import torch
-
-controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
-pipe = StableDiffusionControlNetPipeline.from_pretrained(
-    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
-)
-```
-
-To speed-up things and reduce memory, let's enable model offloading and use the fast [`UniPCMultistepScheduler`].
-
-```py
-from diffusers import UniPCMultistepScheduler
-
-pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
-
-# this command loads the individual model components on GPU on-demand.
-pipe.enable_model_cpu_offload()
-```
-
-Finally, we can run the pipeline:
-
-```py
-generator = torch.manual_seed(0)
-
-out_image = pipe(
-    "disco dancer with colorful lights", num_inference_steps=20, generator=generator, image=canny_image
-).images[0]
-```
-
-This should take only around 3-4 seconds on GPU (depending on hardware). The output image then looks as follows:
-
-![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/vermeer_disco_dancing.png)
-
-
-**Note**: To see how to run all other ControlNet checkpoints, please have a look at [ControlNet with Stable Diffusion 1.5](#controlnet-with-stable-diffusion-1.5).
-
-<!-- TODO: add space -->
-
-## Combining multiple conditionings
-
-Multiple ControlNet conditionings can be combined for a single image generation. Pass a list of ControlNets to the pipeline's constructor and a corresponding list of conditionings to `__call__`.
-
-When combining conditionings, it is helpful to mask conditionings such that they do not overlap. In the example, we mask the middle of the canny map where the pose conditioning is located.
-
-It can also be helpful to vary the `controlnet_conditioning_scales` to emphasize one conditioning over the other.
-
-### Canny conditioning
-
-The original image:
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"/>
-
-Prepare the conditioning:
-
-```python 
-from diffusers.utils import load_image
-from PIL import Image
-import cv2
-import numpy as np
-from diffusers.utils import load_image
-
-canny_image = load_image(
-    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
-)
-canny_image = np.array(canny_image)
-
-low_threshold = 100
-high_threshold = 200
-
-canny_image = cv2.Canny(canny_image, low_threshold, high_threshold)
-
-# zero out middle columns of image where pose will be overlayed
-zero_start = canny_image.shape[1] // 4
-zero_end = zero_start + canny_image.shape[1] // 2
-canny_image[:, zero_start:zero_end] = 0
-
-canny_image = canny_image[:, :, None]
-canny_image = np.concatenate([canny_image, canny_image, canny_image], axis=2)
-canny_image = Image.fromarray(canny_image)
-```
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/landscape_canny_masked.png"/>
-
-### Openpose conditioning
-
-The original image:
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png" width=600/>
-
-Prepare the conditioning:
-
-```python
-from controlnet_aux import OpenposeDetector
-from diffusers.utils import load_image
-
-openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
-
-openpose_image = load_image(
-    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
-)
-openpose_image = openpose(openpose_image)
-```
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/person_pose.png" width=600/>
-
-### Running ControlNet with multiple conditionings
-
-```python
-from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
-import torch
-
-controlnet = [
-    ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16),
-    ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16),
-]
-
-pipe = StableDiffusionControlNetPipeline.from_pretrained(
-    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
-)
-pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
-
-pipe.enable_xformers_memory_efficient_attention()
-pipe.enable_model_cpu_offload()
-
-prompt = "a giant standing in a fantasy landscape, best quality"
-negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
-
-generator = torch.Generator(device="cpu").manual_seed(1)
-
-images = [openpose_image, canny_image]
-
-image = pipe(
-    prompt,
-    images,
-    num_inference_steps=20,
-    generator=generator,
-    negative_prompt=negative_prompt,
-    controlnet_conditioning_scale=[1.0, 0.8],
-).images[0]
-
-image.save("./multi_controlnet_output.png")
-```
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/multi_controlnet_output.png" width=600/>
-
-### Guess Mode
-
-Guess Mode is [a ControlNet feature that was implemented](https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode) after the publication of [the paper](https://arxiv.org/abs/2302.05543). The description states:
-
->In this mode, the ControlNet encoder will try best to recognize the content of the input control map, like depth map, edge map, scribbles, etc, even if you remove all prompts.
-
-#### The core implementation:
-
-It adjusts the scale of the output residuals from ControlNet by a fixed ratio depending on the block depth. The shallowest DownBlock corresponds to `0.1`. As the blocks get deeper, the scale increases exponentially, and the scale for the output of the MidBlock becomes `1.0`. 
-
-Since the core implementation is just this, **it does not have any impact on prompt conditioning**. While it is common to use it without specifying any prompts, it is also possible to provide prompts if desired.
-
-#### Usage:
-
-Just specify `guess_mode=True` in the pipe() function. A `guidance_scale` between 3.0 and 5.0 is [recommended](https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode).
-```py
-from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
-import torch
-
-controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
-pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet).to(
-    "cuda"
-)
-image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0]
-image.save("guess_mode_generated.png")
-```
-
-#### Output image comparison:
-Canny Control Example
-
-|no guess_mode with prompt|guess_mode without prompt|
-|---|---|
-|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0.png"><img width="128" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0_gm.png"><img width="128" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0_gm.png"/></a>|
-
-
-## Available checkpoints
-
-ControlNet requires a *control image* in addition to the text-to-image *prompt*. 
-Each pretrained model is trained using a different conditioning method that requires different images for conditioning the generated outputs. For example, Canny edge conditioning requires the control image to be the output of a Canny filter, while depth conditioning requires the control image to be a depth map. See the overview and image examples below to know more.
-
-All checkpoints can be found under the authors' namespace [lllyasviel](https://huggingface.co/lllyasviel).
-
-**13.04.2024 Update**: The author has released improved controlnet checkpoints v1.1 - see [here](#controlnet-v1.1).
-
-### ControlNet v1.0
-
-| Model Name | Control Image Overview| Control Image Example | Generated Image Example |
-|---|---|---|---|
-|[lllyasviel/sd-controlnet-canny](https://huggingface.co/lllyasviel/sd-controlnet-canny)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_bird_canny.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_bird_canny.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_canny_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_canny_1.png"/></a>|
-|[lllyasviel/sd-controlnet-depth](https://huggingface.co/lllyasviel/sd-controlnet-depth)<br/> *Trained with Midas depth estimation*  |A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_depth.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_depth.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_depth_2.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_depth_2.png"/></a>|
-|[lllyasviel/sd-controlnet-hed](https://huggingface.co/lllyasviel/sd-controlnet-hed)<br/> *Trained with HED edge detection (soft edge)*  |A monochrome image with white soft edges on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_bird_hed.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_bird_hed.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_hed_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_hed_1.png"/></a> |
-|[lllyasviel/sd-controlnet-mlsd](https://huggingface.co/lllyasviel/sd-controlnet-mlsd)<br/> *Trained with M-LSD line detection*  |A monochrome image composed only of white straight lines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_mlsd.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_mlsd.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_mlsd_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_mlsd_0.png"/></a>|
-|[lllyasviel/sd-controlnet-normal](https://huggingface.co/lllyasviel/sd-controlnet-normal)<br/> *Trained with normal map*  |A [normal mapped](https://en.wikipedia.org/wiki/Normal_mapping) image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_human_normal.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_human_normal.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_normal_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_normal_1.png"/></a>|
-|[lllyasviel/sd-controlnet-openpose](https://huggingface.co/lllyasviel/sd-controlnet_openpose)<br/> *Trained with OpenPose bone image*  |A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_human_openpose.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_human_openpose.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_openpose_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_openpose_0.png"/></a>|
-|[lllyasviel/sd-controlnet-scribble](https://huggingface.co/lllyasviel/sd-controlnet_scribble)<br/> *Trained with human scribbles*  |A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_scribble.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_scribble.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"/></a> |
-|[lllyasviel/sd-controlnet-seg](https://huggingface.co/lllyasviel/sd-controlnet_seg)<br/>*Trained with semantic segmentation*  |An [ADE20K](https://groups.csail.mit.edu/vision/datasets/ADE20K/)'s segmentation protocol image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_seg.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_seg.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"/></a> |
-
-### ControlNet v1.1
-
-| Model Name | Control Image Overview| Condition Image | Control Image Example | Generated Image Example |
-|---|---|---|---|---|
-|[lllyasviel/control_v11p_sd15_canny](https://huggingface.co/lllyasviel/control_v11p_sd15_canny)<br/> | *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/image_out.png"/></a>|
-|[lllyasviel/control_v11e_sd15_ip2p](https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p)<br/> | *Trained with pixel to pixel instruction* | No condition .|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/image_out.png"/></a>|
-|[lllyasviel/control_v11p_sd15_inpaint](https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint)<br/> | Trained with image inpainting | No condition.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/output.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/output.png"/></a>|
-|[lllyasviel/control_v11p_sd15_mlsd](https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd)<br/> | Trained with multi-level line segment detection | An image with annotated line segments.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/image_out.png"/></a>|
-|[lllyasviel/control_v11f1p_sd15_depth](https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth)<br/> | Trained with depth estimation | An image with depth information, usually represented as a grayscale image.|<a href="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/image_out.png"/></a>|
-|[lllyasviel/control_v11p_sd15_normalbae](https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae)<br/> | Trained with surface normal estimation | An image with surface normal information, usually represented as a color-coded image.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/image_out.png"/></a>|
-|[lllyasviel/control_v11p_sd15_seg](https://huggingface.co/lllyasviel/control_v11p_sd15_seg)<br/> | Trained with image segmentation | An image with segmented regions, usually represented as a color-coded image.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/image_out.png"/></a>|
-|[lllyasviel/control_v11p_sd15_lineart](https://huggingface.co/lllyasviel/control_v11p_sd15_lineart)<br/> | Trained with line art generation | An image with line art, usually black lines on a white background.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/image_out.png"/></a>|
-|[lllyasviel/control_v11p_sd15s2_lineart_anime](https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime)<br/> | Trained with anime line art generation | An image with anime-style line art.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/image_out.png"/></a>|
-|[lllyasviel/control_v11p_sd15_openpose](https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime)<br/> | Trained with human pose estimation | An image with human poses, usually represented as a set of keypoints or skeletons.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/image_out.png"/></a>|
-|[lllyasviel/control_v11p_sd15_scribble](https://huggingface.co/lllyasviel/control_v11p_sd15_scribble)<br/> | Trained with scribble-based image generation | An image with scribbles, usually random or user-drawn strokes.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/image_out.png"/></a>|
-|[lllyasviel/control_v11p_sd15_softedge](https://huggingface.co/lllyasviel/control_v11p_sd15_softedge)<br/> | Trained with soft edge image generation | An image with soft edges, usually to create a more painterly or artistic effect.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/image_out.png"/></a>|
-|[lllyasviel/control_v11e_sd15_shuffle](https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle)<br/> | Trained with image shuffling | An image with shuffled patches or regions.|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/image_out.png"/></a>|
-|[lllyasviel/control_v11f1e_sd15_tile](https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile)<br/> | Trained with image tiling | A blurry image or part of an image .|<a href="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/original.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/original.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/output.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/output.png"/></a>|
-
-## StableDiffusionControlNetPipeline
-[[autodoc]] StableDiffusionControlNetPipeline
-	- all
-	- __call__
-	- enable_attention_slicing
-	- disable_attention_slicing
-	- enable_vae_slicing
-	- disable_vae_slicing
-	- enable_xformers_memory_efficient_attention
-	- disable_xformers_memory_efficient_attention
-	- load_textual_inversion
-
-## StableDiffusionControlNetImg2ImgPipeline
-[[autodoc]] StableDiffusionControlNetImg2ImgPipeline
-	- all
-	- __call__
-	- enable_attention_slicing
-	- disable_attention_slicing
-	- enable_vae_slicing
-	- disable_vae_slicing
-	- enable_xformers_memory_efficient_attention
-	- disable_xformers_memory_efficient_attention
-	- load_textual_inversion
-
-## StableDiffusionControlNetInpaintPipeline
-[[autodoc]] StableDiffusionControlNetInpaintPipeline
-	- all
-	- __call__
-	- enable_attention_slicing
-	- disable_attention_slicing
-	- enable_vae_slicing
-	- disable_vae_slicing
-	- enable_xformers_memory_efficient_attention
-	- disable_xformers_memory_efficient_attention
-	- load_textual_inversion
-
-## FlaxStableDiffusionControlNetPipeline
-[[autodoc]] FlaxStableDiffusionControlNetPipeline
-	- all
-	- __call__
-
--- a/docs/source/en/api/pipelines/controlnet_sdxl.md
+++ b/docs/source/en/api/pipelines/controlnet_sdxl.md
@@ -0,0 +1,46 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# ControlNet with Stable Diffusion XL
+
+ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
+
+With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
+
+The abstract from the paper is:
+
+*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
+
+You can find additional smaller Stable Diffusion XL (SDXL) ControlNet checkpoints from the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization, and browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) checkpoints on the Hub.
+
+<Tip warning={true}>
+
+🧪 Many of the SDXL ControlNet checkpoints are experimental, and there is a lot of room for improvement. Feel free to open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) and leave us feedback on how we can improve!
+
+</Tip>
+
+If you don't see a checkpoint you're interested in, you can train your own SDXL ControlNet with our [training script](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md).
+
+<Tip>
+
+Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
+## StableDiffusionXLControlNetPipeline
+[[autodoc]] StableDiffusionXLControlNetPipeline
+	- all
+	- __call__
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/cycle_diffusion.mdx
+++ b/docs/source/en/api/pipelines/cycle_diffusion.mdx
--- a/docs/source/en/api/pipelines/dance_diffusion.mdx
+++ b/docs/source/en/api/pipelines/dance_diffusion.mdx
--- a/docs/source/en/api/pipelines/ddim.mdx
+++ b/docs/source/en/api/pipelines/ddim.mdx
--- a/docs/source/en/api/pipelines/ddpm.mdx
+++ b/docs/source/en/api/pipelines/ddpm.mdx
--- a/docs/source/en/api/pipelines/deepfloyd_if.mdx
+++ b/docs/source/en/api/pipelines/deepfloyd_if.mdx
--- a/docs/source/en/api/pipelines/diffedit.md
+++ b/docs/source/en/api/pipelines/diffedit.md
@@ -0,0 +1,55 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# DiffEdit
+
+[DiffEdit: Diffusion-based semantic image editing with mask guidance](https://huggingface.co/papers/2210.11427) is by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord.
+
+The abstract from the paper is:
+
+*Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.*
+
+The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion), and you can try it out in this [demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html).
+
+This pipeline was contributed by [clarencechen](https://github.com/clarencechen). ❤️
+
+## Tips 
+
+* The pipeline can generate masks that can be fed into other inpainting pipelines.
+* In order to generate an image using this pipeline, both an image mask (source and target prompts can be manually specified or generated, and passed to [`~StableDiffusionDiffEditPipeline.generate_mask`])
+and a set of partially inverted latents (generated using [`~StableDiffusionDiffEditPipeline.invert`]) _must_ be provided as arguments when calling the pipeline to generate the final edited image.
+* The function [`~StableDiffusionDiffEditPipeline.generate_mask`] exposes two prompt arguments, `source_prompt` and `target_prompt`
+that let you control the locations of the semantic edits in the final image to be generated. Let's say,
+you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect
+this in the generated mask, you simply have to set the embeddings related to the phrases including "cat" to
+`source_prompt` and "dog" to `target_prompt`.
+* When generating partially inverted latents using `invert`, assign a caption or text embedding describing the
+overall image to the `prompt` argument to help guide the inverse latent sampling process. In most cases, the
+source concept is sufficently descriptive to yield good results, but feel free to explore alternatives.
+* When calling the pipeline to generate the final edited image, assign the source concept to `negative_prompt`
+and the target concept to `prompt`. Taking the above example, you simply have to set the embeddings related to
+the phrases including "cat" to `negative_prompt` and "dog" to `prompt`.
+* If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to:
+    * Swap the `source_prompt` and `target_prompt` in the arguments to `generate_mask`.
+    * Change the input prompt in [`~StableDiffusionDiffEditPipeline.invert`] to include "dog".
+    * Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image.
+* The source and target prompts, or their corresponding embeddings, can also be automatically generated. Please refer to the [DiffEdit](/using-diffusers/diffedit) guide for more details.
+
+## StableDiffusionDiffEditPipeline
+[[autodoc]] StableDiffusionDiffEditPipeline
+    - all
+    - generate_mask
+    - invert
+    - __call__
+
+## StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/diffedit.mdx
+++ b/docs/source/en/api/pipelines/diffedit.mdx
@@ -1,348 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# DiffEdit
-
-[DiffEdit: Diffusion-based semantic image editing with mask guidance](https://huggingface.co/papers/2210.11427) is by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord.
-
-The abstract from the paper is:
-
-*Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.*
-
-The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion), and you can try it out in this [demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html).
-
-This pipeline was contributed by [clarencechen](https://github.com/clarencechen). ❤️
-
-## Tips 
-
-* The pipeline can generate masks that can be fed into other inpainting pipelines. Check out the code examples below to know more.
-* In order to generate an image using this pipeline, both an image mask (manually specified or generated using `generate_mask`)
-and a set of partially inverted latents (generated using `invert`) _must_ be provided as arguments when calling the pipeline to generate the final edited image.
-Refer to the code examples below for more details.
-* The function `generate_mask` exposes two prompt arguments, `source_prompt` and `target_prompt`,
-that let you control the locations of the semantic edits in the final image to be generated. Let's say,
-you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect
-this in the generated mask, you simply have to set the embeddings related to the phrases including "cat" to
-`source_prompt_embeds` and "dog" to `target_prompt_embeds`. Refer to the code example below for more details.
-* When generating partially inverted latents using `invert`, assign a caption or text embedding describing the
-overall image to the `prompt` argument to help guide the inverse latent sampling process. In most cases, the
-source concept is sufficently descriptive to yield good results, but feel free to explore alternatives.
-Please refer to [this code example](#generating-image-captions-for-inversion) for more details.
-* When calling the pipeline to generate the final edited image, assign the source concept to `negative_prompt`
-and the target concept to `prompt`. Taking the above example, you simply have to set the embeddings related to
-the phrases including "cat" to `negative_prompt_embeds` and "dog" to `prompt_embeds`. Refer to the code example
-below for more details.
-* If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to:
-    * Swap the `source_prompt` and `target_prompt` in the arguments to `generate_mask`.
-    * Change the input prompt for `invert` to include "dog".
-    * Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image.
-* Note that the source and target prompts, or their corresponding embeddings, can also be automatically generated. Please, refer to [this discussion](#generating-source-and-target-embeddings) for more details.
-
-## Usage example
-
-### Based on an input image with a caption
-
-When the pipeline is conditioned on an input image, we first obtain partially inverted latents from the input image using a
-`DDIMInverseScheduler` with the help of a caption. Then we generate an editing mask to identify relevant regions in the image using the source and target prompts. Finally, 
-the inverted noise and generated mask is used to start the generation process. 
-
-First, let's load our pipeline: 
-
-```py
-import torch
-from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline
-
-sd_model_ckpt = "stabilityai/stable-diffusion-2-1"
-pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
-    sd_model_ckpt,
-    torch_dtype=torch.float16,
-    safety_checker=None,
-)
-pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
-pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
-pipeline.enable_model_cpu_offload()
-pipeline.enable_vae_slicing()
-generator = torch.manual_seed(0)
-```
-
-Then, we load an input image to edit using our method: 
-
-```py
-from diffusers.utils import load_image
-
-img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
-raw_image = load_image(img_url).convert("RGB").resize((768, 768))
-```
-
-Then, we employ the source and target prompts to generate the editing mask:
-
-```py
-# See the "Generating source and target embeddings" section below to
-# automate the generation of these captions with a pre-trained model like Flan-T5 as explained below.
-
-source_prompt = "a bowl of fruits"
-target_prompt = "a basket of fruits"
-mask_image = pipeline.generate_mask(
-    image=raw_image,
-    source_prompt=source_prompt,
-    target_prompt=target_prompt,
-    generator=generator,
-)
-```
-
-Then, we employ the caption and the input image to get the inverted latents: 
-
-```py 
-inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image, generator=generator).latents
-```
-
-Now, generate the image with the inverted latents and semantically generated mask: 
-
-```py
-image = pipeline(
-    prompt=target_prompt,
-    mask_image=mask_image,
-    image_latents=inv_latents,
-    generator=generator,
-    negative_prompt=source_prompt,
-).images[0]
-image.save("edited_image.png")
-```
-
-## Generating image captions for inversion
-
-The authors originally used the source concept prompt as the caption for generating the partially inverted latents. However, we can also leverage open source and public image captioning models for the same purpose.
-Below, we provide an end-to-end example with the [BLIP](https://huggingface.co/docs/transformers/model_doc/blip) model
-for generating captions.
-
-First, let's load our automatic image captioning model:
-
-```py
-import torch
-from transformers import BlipForConditionalGeneration, BlipProcessor
-
-captioner_id = "Salesforce/blip-image-captioning-base"
-processor = BlipProcessor.from_pretrained(captioner_id)
-model = BlipForConditionalGeneration.from_pretrained(captioner_id, torch_dtype=torch.float16, low_cpu_mem_usage=True)
-```
-
-Then, we define a utility to generate captions from an input image using the model:
-
-```py
-@torch.no_grad()
-def generate_caption(images, caption_generator, caption_processor):
-    text = "a photograph of"
-
-    inputs = caption_processor(images, text, return_tensors="pt").to(device="cuda", dtype=caption_generator.dtype)
-    caption_generator.to("cuda")
-    outputs = caption_generator.generate(**inputs, max_new_tokens=128)
-
-    # offload caption generator
-    caption_generator.to("cpu")
-
-    caption = caption_processor.batch_decode(outputs, skip_special_tokens=True)[0]
-    return caption
-```
-
-Then, we load an input image for conditioning and obtain a suitable caption for it: 
-
-```py
-from diffusers.utils import load_image
-
-img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
-raw_image = load_image(img_url).convert("RGB").resize((768, 768))
-caption = generate_caption(raw_image, model, processor)
-```
-
-Then, we employ the generated caption and the input image to get the inverted latents: 
-
-```py
-from diffusers import DDIMInverseScheduler, DDIMScheduler
-
-pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16
-)
-pipeline = pipeline.to("cuda")
-pipeline.enable_model_cpu_offload()
-pipeline.enable_vae_slicing()
-
-pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
-pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
-
-generator = torch.manual_seed(0)
-inv_latents = pipeline.invert(prompt=caption, image=raw_image, generator=generator).latents
-```
-
-Now, generate the image with the inverted latents and semantically generated mask from our source and target prompts: 
-
-```py
-source_prompt = "a bowl of fruits"
-target_prompt = "a basket of fruits"
-
-mask_image = pipeline.generate_mask(
-    image=raw_image,
-    source_prompt=source_prompt,
-    target_prompt=target_prompt,
-    generator=generator,
-)
-
-image = pipeline(
-    prompt=target_prompt,
-    mask_image=mask_image,
-    image_latents=inv_latents,
-    generator=generator,
-    negative_prompt=source_prompt,
-).images[0]
-image.save("edited_image.png")
-```
-
-## Generating source and target embeddings 
-
-The authors originally required the user to manually provide the source and target prompts for discovering
-edit directions. However, we can also leverage open source and public models for the same purpose.
-Below, we provide an end-to-end example with the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model
-for generating source an target embeddings.
-
-**1. Load the generation model**:
-
-```py
-import torch
-from transformers import AutoTokenizer, T5ForConditionalGeneration
-
-tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")
-model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto", torch_dtype=torch.float16)
-```
-
-**2. Construct a starting prompt**: 
-
-```py
-source_concept = "bowl"
-target_concept = "basket"
-
-source_text = f"Provide a caption for images containing a {source_concept}. "
-"The captions should be in English and should be no longer than 150 characters."
-
-target_text = f"Provide a caption for images containing a {target_concept}. "
-"The captions should be in English and should be no longer than 150 characters."
-```
-
-Here, we're interested in the "bowl -> basket" direction. 
-
-**3. Generate prompts**:
-
-We can use a utility like so for this purpose. 
-
-```py
-@torch.no_grad
-def generate_prompts(input_prompt):
-    input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda")
-
-    outputs = model.generate(
-        input_ids, temperature=0.8, num_return_sequences=16, do_sample=True, max_new_tokens=128, top_k=10
-    )
-    return tokenizer.batch_decode(outputs, skip_special_tokens=True)
-```
-
-And then we just call it to generate our prompts:
-
-```py
-source_prompts = generate_prompts(source_text)
-target_prompts = generate_prompts(target_text)
-```
-
-We encourage you to play around with the different parameters supported by the
-`generate()` method ([documentation](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.generation_tf_utils.TFGenerationMixin.generate)) for the generation quality you are looking for.
-
-**4. Load the embedding model**: 
-
-Here, we need to use the same text encoder model used by the subsequent Stable Diffusion model.
-
-```py 
-from diffusers import StableDiffusionDiffEditPipeline 
-
-pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16
-)
-pipeline = pipeline.to("cuda")
-pipeline.enable_model_cpu_offload()
-pipeline.enable_vae_slicing()
-
-generator = torch.manual_seed(0)
-```
-
-**5. Compute embeddings**:
-
-```py 
-import torch 
-
-@torch.no_grad()
-def embed_prompts(sentences, tokenizer, text_encoder, device="cuda"):
-    embeddings = []
-    for sent in sentences:
-        text_inputs = tokenizer(
-            sent,
-            padding="max_length",
-            max_length=tokenizer.model_max_length,
-            truncation=True,
-            return_tensors="pt",
-        )
-        text_input_ids = text_inputs.input_ids
-        prompt_embeds = text_encoder(text_input_ids.to(device), attention_mask=None)[0]
-        embeddings.append(prompt_embeds)
-    return torch.concatenate(embeddings, dim=0).mean(dim=0).unsqueeze(0)
-
-source_embeddings = embed_prompts(source_prompts, pipeline.tokenizer, pipeline.text_encoder)
-target_embeddings = embed_prompts(target_captions, pipeline.tokenizer, pipeline.text_encoder)
-```
-
-And you're done! Now, you can use these embeddings directly while calling the pipeline: 
-
-```py
-from diffusers import DDIMInverseScheduler, DDIMScheduler
-from diffusers.utils import load_image
-
-pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
-pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
-
-img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
-raw_image = load_image(img_url).convert("RGB").resize((768, 768))
-
-
-mask_image = pipeline.generate_mask(
-    image=raw_image,
-    source_prompt_embeds=source_embeds,
-    target_prompt_embeds=target_embeds,
-    generator=generator,
-)
-
-inv_latents = pipeline.invert(
-    prompt_embeds=source_embeds,
-    image=raw_image,
-    generator=generator,
-).latents
-
-images = pipeline(
-    mask_image=mask_image,
-    image_latents=inv_latents,
-    prompt_embeds=target_embeddings,
-    negative_prompt_embeds=source_embeddings,
-    generator=generator,
-).images
-images[0].save("edited_image.png")
-```
-
-## StableDiffusionDiffEditPipeline
-[[autodoc]] StableDiffusionDiffEditPipeline
-    - all
-    - generate_mask
-    - invert
-    - __call__
--- a/docs/source/en/api/pipelines/dit.mdx
+++ b/docs/source/en/api/pipelines/dit.mdx
--- a/docs/source/en/api/pipelines/kandinsky.mdx
+++ b/docs/source/en/api/pipelines/kandinsky.mdx
@@ -105,6 +105,30 @@ One cheeseburger monster coming up! Enjoy!

 ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/cheeseburger.png)

+<Tip>
+
+We also provide an end-to-end Kandinsky pipeline [`KandinskyCombinedPipeline`], which combines both the prior pipeline and text-to-image pipeline, and lets you perform inference in a single step. You can create the combined pipeline with the [`~AutoPipelineForText2Image.from_pretrained`] method
+
+```python
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipe = AutoPipelineForText2Image.from_pretrained(
+    "kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16
+)
+pipe.enable_model_cpu_offload()
+```
+
+Under the hood, it will automatically load both [`KandinskyPriorPipeline`] and [`KandinskyPipeline`]. To generate images, you no longer need to call both pipelines and pass the outputs from one to another. You only need to call the combined pipeline once. You can set different `guidance_scale` and `num_inference_steps` for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` arguments.
+
+```python
+prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
+negative_prompt = "low quality, bad quality"
+
+image = pipe(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale =1.0, guidance_scacle = 4.0, height=768, width=768).images[0]
+```
+</Tip>
+
 The Kandinsky model works extremely well with creative prompts. Here is some of the amazing art that can be created using the exact same process but with different prompts.

 ```python
@@ -187,6 +211,34 @@ out.images[0].save("fantasy_land.png")
 ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/img2img_fantasyland.png)


+<Tip>
+
+You can also use the [`KandinskyImg2ImgCombinedPipeline`] for end-to-end image-to-image generation with Kandinsky 2.1
+
+```python
+from diffusers import AutoPipelineForImage2Image
+import torch
+import requests
+from io import BytesIO
+from PIL import Image
+import os
+
+pipe = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+pipe.enable_model_cpu_offload()
+
+prompt = "A fantasy landscape, Cinematic lighting"
+negative_prompt = "low quality, bad quality"
+
+url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+ 
+response = requests.get(url)
+original_image = Image.open(BytesIO(response.content)).convert("RGB")
+original_image.thumbnail((768, 768))
+
+image = pipe(prompt=prompt, image=original_image, strength=0.3).images[0]
+```
+</Tip>
+
 ### Text Guided Inpainting Generation

 You can use [`KandinskyInpaintPipeline`] to edit images. In this example, we will add a hat to the portrait of a cat.
@@ -212,9 +264,9 @@ init_image = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/cat.png"
 )

-mask = np.ones((768, 768), dtype=np.float32)
+mask = np.zeros((768, 768), dtype=np.float32)
 # Let's mask out an area above the cat's head
-mask[:250, 250:-250] = 0
+mask[:250, 250:-250] = 1

 out = pipe(
    prompt,
@@ -231,6 +283,33 @@ image.save("cat_with_hat.png")
 ```
 ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/inpaint_cat_hat.png)

+<Tip>
+
+To use the [`KandinskyInpaintCombinedPipeline`] to perform end-to-end image inpainting generation, you can run below code instead
+
+```python
+from diffusers import AutoPipelineForInpainting
+
+pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16)
+pipe.enable_model_cpu_offload()
+image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
+```
+</Tip>
+
+🚨🚨🚨 __Breaking change for Kandinsky Mask Inpainting__ 🚨🚨🚨
+
+We introduced a breaking change for Kandinsky inpainting pipeline in the following pull request: https://github.com/huggingface/diffusers/pull/4207. Previously we accepted a mask format where black pixels represent the masked-out area. This is inconsistent with all other pipelines in diffusers. We have changed the mask format in Knaindsky and now using white pixels instead.
+Please upgrade your inpainting code to follow the above. If you are using Kandinsky Inpaint in production. You now need to change the mask to:
+
+```python
+# For PIL input
+import PIL.ImageOps
+mask = PIL.ImageOps.invert(mask)
+
+# For PyTorch and Numpy input
+mask = 1 - mask
+```
+
 ### Interpolate 

 The [`KandinskyPriorPipeline`] also comes with a cool utility function that will allow you to interpolate the latent space of different images and texts super easily. Here is an example of how you can create an Impressionist-style portrait for your pet based on "The Starry Night". 
@@ -276,208 +355,6 @@ image.save("starry_cat.png")
 ```
 ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/starry_cat.png)

-
-### Text-to-Image Generation with ControlNet Conditioning
-
-In the following, we give a simple example of how to use [`KandinskyV22ControlnetPipeline`] to add control to the text-to-image generation with a depth image.
-
-First, let's take an image and extract its depth map.
-
-```python
-from diffusers.utils import load_image
-
-img = load_image(
-    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
-).resize((768, 768))
-```
-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png)
-
-We can use the `depth-estimation` pipeline from transformers to process the image and retrieve its depth map.
-
-```python
-import torch
-import numpy as np
-
-from transformers import pipeline
-from diffusers.utils import load_image
-
-
-def make_hint(image, depth_estimator):
-    image = depth_estimator(image)["depth"]
-    image = np.array(image)
-    image = image[:, :, None]
-    image = np.concatenate([image, image, image], axis=2)
-    detected_map = torch.from_numpy(image).float() / 255.0
-    hint = detected_map.permute(2, 0, 1)
-    return hint
-
-
-depth_estimator = pipeline("depth-estimation")
-hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
-```
-Now, we load the prior pipeline and the text-to-image controlnet pipeline
-
-```python
-from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline
-
-pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
-    "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
-)
-pipe_prior = pipe_prior.to("cuda")
-
-pipe = KandinskyV22ControlnetPipeline.from_pretrained(
-    "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
-)
-pipe = pipe.to("cuda")
-```
-
-We pass the prompt and negative prompt through the prior to generate image embeddings
-
-```python
-prompt = "A robot, 4k photo"
-
-negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
-
-generator = torch.Generator(device="cuda").manual_seed(43)
-image_emb, zero_image_emb = pipe_prior(
-    prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
-).to_tuple()
-```
-
-Now we can pass the image embeddings and the depth image we extracted to the controlnet pipeline. With Kandinsky 2.2, only prior pipelines accept `prompt` input. You do not need to pass the prompt to the controlnet pipeline.
-
-```python
-images = pipe(
-    image_embeds=image_emb,
-    negative_image_embeds=zero_image_emb,
-    hint=hint,
-    num_inference_steps=50,
-    generator=generator,
-    height=768,
-    width=768,
-).images
-
-images[0].save("robot_cat.png")
-```
-
-The output image looks as follow:
-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/robot_cat_text2img.png)
-
-### Image-to-Image Generation with ControlNet Conditioning
-
-Kandinsky 2.2 also includes a [`KandinskyV22ControlnetImg2ImgPipeline`] that will allow you to add control to the image generation process with both the image and its depth map. This pipeline works really well with [`KandinskyV22PriorEmb2EmbPipeline`], which generates image embeddings based on both a text prompt and an image. 
-
-For our robot cat example, we will pass the prompt and cat image together to the prior pipeline to generate an image embedding. We will then use that image embedding and the depth map of the cat to further control the image generation process. 
-
-We can use the same cat image and its depth map from the last example.
-
-```python
-import torch
-import numpy as np
-
-from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline
-from diffusers.utils import load_image
-from transformers import pipeline
-
-img = load_image(
-    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinskyv22/cat.png"
-).resize((768, 768))
-
-
-def make_hint(image, depth_estimator):
-    image = depth_estimator(image)["depth"]
-    image = np.array(image)
-    image = image[:, :, None]
-    image = np.concatenate([image, image, image], axis=2)
-    detected_map = torch.from_numpy(image).float() / 255.0
-    hint = detected_map.permute(2, 0, 1)
-    return hint
-
-
-depth_estimator = pipeline("depth-estimation")
-hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
-
-pipe_prior = KandinskyV22PriorEmb2EmbPipeline.from_pretrained(
-    "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
-)
-pipe_prior = pipe_prior.to("cuda")
-
-pipe = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained(
-    "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
-)
-pipe = pipe.to("cuda")
-
-prompt = "A robot, 4k photo"
-negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
-
-generator = torch.Generator(device="cuda").manual_seed(43)
-
-# run prior pipeline
-
-img_emb = pipe_prior(prompt=prompt, image=img, strength=0.85, generator=generator)
-negative_emb = pipe_prior(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)
-
-# run controlnet img2img pipeline
-images = pipe(
-    image=img,
-    strength=0.5,
-    image_embeds=img_emb.image_embeds,
-    negative_image_embeds=negative_emb.image_embeds,
-    hint=hint,
-    num_inference_steps=50,
-    generator=generator,
-    height=768,
-    width=768,
-).images
-
-images[0].save("robot_cat.png")
-```
-
-Here is the output. Compared with the output from our text-to-image controlnet example, it kept a lot more cat facial details from the original image and worked into the robot style we asked for.
-
-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/robot_cat.png)
-
-## Kandinsky 2.2
-
-The Kandinsky 2.2 release includes robust new text-to-image models that support text-to-image generation, image-to-image generation, image interpolation, and text-guided image inpainting. The general workflow to perform these tasks using Kandinsky 2.2 is the same as in Kandinsky 2.1. First, you will need to use a prior pipeline to generate image embeddings based on your text prompt, and then use one of the image decoding pipelines to generate the output image. The only difference is that in Kandinsky 2.2, all of the decoding pipelines no longer accept the `prompt` input, and the image generation process is conditioned with only `image_embeds` and `negative_image_embeds`.
-
-Let's look at an example of how to perform text-to-image generation using Kandinsky 2.2.
-
-First, let's create the prior pipeline and text-to-image pipeline with Kandinsky 2.2 checkpoints.
-
-```python
-from diffusers import DiffusionPipeline
-import torch
-
-pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16)
-pipe_prior.to("cuda")
-
-t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
-t2i_pipe.to("cuda")
-```
-
-You can then use `pipe_prior` to generate image embeddings.
-
-```python
-prompt = "portrait of a women, blue eyes, cinematic"
-negative_prompt = "low quality, bad quality"
-
-image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
-```
-
-Now you can pass these embeddings to the text-to-image pipeline. When using Kandinsky 2.2 you don't need to pass the `prompt` (but you do with the previous version, Kandinsky 2.1).
-
-```
-image = t2i_pipe(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[
-    0
-]
-image.save("portrait.png")
-```
-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/%20blue%20eyes.png)
-
-We used the text-to-image pipeline as an example, but the same process applies to all decoding pipelines in Kandinsky 2.2. For more information, please refer to our API section for each pipeline.
-
-
 ## Optimization

 Running Kandinsky in inference requires running both a first prior pipeline: [`KandinskyPriorPipeline`]
@@ -530,64 +407,24 @@ t2i_pipe.unet = torch.compile(t2i_pipe.unet, mode="reduce-overhead", fullgraph=T
 After compilation you should see a very fast inference time. For more information,
 feel free to have a look at [Our PyTorch 2.0 benchmark](https://huggingface.co/docs/diffusers/main/en/optimization/torch2.0).

+<Tip>
+
+To generate images directly from a single pipeline, you can use [`KandinskyCombinedPipeline`], [`KandinskyImg2ImgCombinedPipeline`], [`KandinskyInpaintCombinedPipeline`].
+These combined pipelines wrap the [`KandinskyPriorPipeline`] and [`KandinskyPipeline`], [`KandinskyImg2ImgPipeline`], [`KandinskyInpaintPipeline`] respectively into a single 
+pipeline for a simpler user experience
+
+</Tip>
+
 ## Available Pipelines:

 | Pipeline | Tasks |
 |---|---|
-| [pipeline_kandinsky2_2.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2.py) | *Text-to-Image Generation* |
 | [pipeline_kandinsky.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky.py) | *Text-to-Image Generation* |
-| [pipeline_kandinsky2_2_inpaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_inpaint.py) | *Image-Guided Image Generation* |
+| [pipeline_kandinsky_combined.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky_combined.py) | *End-to-end Text-to-Image, image-to-image, Inpainting Generation* |
 | [pipeline_kandinsky_inpaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_inpaint.py) | *Image-Guided Image Generation* |
-| [pipeline_kandinsky2_2_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_img2img.py) | *Image-Guided Image Generation* |
 | [pipeline_kandinsky_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_img2img.py) | *Image-Guided Image Generation* |
-| [pipeline_kandinsky2_2_controlnet.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_controlnet.py) | *Image-Guided Image Generation* |
-| [pipeline_kandinsky2_2_controlnet_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_controlnet_img2img.py) | *Image-Guided Image Generation* |


-### KandinskyV22Pipeline
-
-[[autodoc]] KandinskyV22Pipeline
-	- all
-	- __call__
-
-### KandinskyV22ControlnetPipeline
-
-[[autodoc]] KandinskyV22ControlnetPipeline
-	- all
-	- __call__
-
-### KandinskyV22ControlnetImg2ImgPipeline
-
-[[autodoc]] KandinskyV22ControlnetImg2ImgPipeline
-	- all
-	- __call__
-
-### KandinskyV22Img2ImgPipeline
-
-[[autodoc]] KandinskyV22Img2ImgPipeline
-	- all
-	- __call__
-
-### KandinskyV22InpaintPipeline
-
-[[autodoc]] KandinskyV22InpaintPipeline
-	- all
-	- __call__
-
-### KandinskyV22PriorPipeline
-
-[[autodoc]] ## KandinskyV22PriorPipeline
-	- all
-	- __call__
-	- interpolate
-
-### KandinskyV22PriorEmb2EmbPipeline
-
-[[autodoc]] KandinskyV22PriorEmb2EmbPipeline
-	- all
-	- __call__
-	- interpolate
-
 ### KandinskyPriorPipeline

 [[autodoc]] KandinskyPriorPipeline
@@ -612,3 +449,21 @@ feel free to have a look at [Our PyTorch 2.0 benchmark](https://huggingface.co/d
 [[autodoc]] KandinskyInpaintPipeline
 	- all
 	- __call__
+
+### KandinskyCombinedPipeline
+
+[[autodoc]] KandinskyCombinedPipeline
+	- all
+	- __call__
+
+### KandinskyImg2ImgCombinedPipeline
+
+[[autodoc]] KandinskyImg2ImgCombinedPipeline
+	- all
+	- __call__
+
+### KandinskyInpaintCombinedPipeline
+
+[[autodoc]] KandinskyInpaintCombinedPipeline
+	- all
+	- __call__
--- a/docs/source/en/api/pipelines/kandinsky_v22.md
+++ b/docs/source/en/api/pipelines/kandinsky_v22.md
@@ -0,0 +1,357 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Kandinsky 2.2
+
+The Kandinsky 2.2 release includes robust new text-to-image models that support text-to-image generation, image-to-image generation, image interpolation, and text-guided image inpainting. The general workflow to perform these tasks using Kandinsky 2.2 is the same as in Kandinsky 2.1. First, you will need to use a prior pipeline to generate image embeddings based on your text prompt, and then use one of the image decoding pipelines to generate the output image. The only difference is that in Kandinsky 2.2, all of the decoding pipelines no longer accept the `prompt` input, and the image generation process is conditioned with only `image_embeds` and `negative_image_embeds`.
+
+Same as with Kandinsky 2.1, the easiest way to perform text-to-image generation is to use the combined Kandinsky pipeline. This process is exactly the same as Kandinsky 2.1. All you need to do is to replace the Kandinsky 2.1 checkpoint with 2.2.
+
+```python
+from diffusers import AutoPipelineForText2Image
+import torch
+
+pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
+pipe.enable_model_cpu_offload()
+
+prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
+negative_prompt = "low quality, bad quality"
+
+image = pipe(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale =1.0, height=768, width=768).images[0]
+```
+
+Now, let's look at an example where we take separate steps to run the prior pipeline and text-to-image pipeline. This way, we can understand what's happening under the hood and how Kandinsky 2.2 differs from Kandinsky 2.1.
+
+First, let's create the prior pipeline and text-to-image pipeline with Kandinsky 2.2 checkpoints.
+
+```python
+from diffusers import DiffusionPipeline
+import torch
+
+pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16)
+pipe_prior.to("cuda")
+
+t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
+t2i_pipe.to("cuda")
+```
+
+You can then use `pipe_prior` to generate image embeddings.
+
+```python
+prompt = "portrait of a women, blue eyes, cinematic"
+negative_prompt = "low quality, bad quality"
+
+image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
+```
+
+Now you can pass these embeddings to the text-to-image pipeline. When using Kandinsky 2.2 you don't need to pass the `prompt` (but you do with the previous version, Kandinsky 2.1).
+
+```
+image = t2i_pipe(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[
+    0
+]
+image.save("portrait.png")
+```
+![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/%20blue%20eyes.png)
+
+We used the text-to-image pipeline as an example, but the same process applies to all decoding pipelines in Kandinsky 2.2. For more information, please refer to our API section for each pipeline.
+
+### Text-to-Image Generation with ControlNet Conditioning
+
+In the following, we give a simple example of how to use [`KandinskyV22ControlnetPipeline`] to add control to the text-to-image generation with a depth image.
+
+First, let's take an image and extract its depth map.
+
+```python
+from diffusers.utils import load_image
+
+img = load_image(
+    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
+).resize((768, 768))
+```
+![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png)
+
+We can use the `depth-estimation` pipeline from transformers to process the image and retrieve its depth map.
+
+```python
+import torch
+import numpy as np
+
+from transformers import pipeline
+from diffusers.utils import load_image
+
+
+def make_hint(image, depth_estimator):
+    image = depth_estimator(image)["depth"]
+    image = np.array(image)
+    image = image[:, :, None]
+    image = np.concatenate([image, image, image], axis=2)
+    detected_map = torch.from_numpy(image).float() / 255.0
+    hint = detected_map.permute(2, 0, 1)
+    return hint
+
+
+depth_estimator = pipeline("depth-estimation")
+hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
+```
+Now, we load the prior pipeline and the text-to-image controlnet pipeline
+
+```python
+from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline
+
+pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
+    "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
+)
+pipe_prior = pipe_prior.to("cuda")
+
+pipe = KandinskyV22ControlnetPipeline.from_pretrained(
+    "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
+)
+pipe = pipe.to("cuda")
+```
+
+We pass the prompt and negative prompt through the prior to generate image embeddings
+
+```python
+prompt = "A robot, 4k photo"
+
+negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
+
+generator = torch.Generator(device="cuda").manual_seed(43)
+image_emb, zero_image_emb = pipe_prior(
+    prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
+).to_tuple()
+```
+
+Now we can pass the image embeddings and the depth image we extracted to the controlnet pipeline. With Kandinsky 2.2, only prior pipelines accept `prompt` input. You do not need to pass the prompt to the controlnet pipeline.
+
+```python
+images = pipe(
+    image_embeds=image_emb,
+    negative_image_embeds=zero_image_emb,
+    hint=hint,
+    num_inference_steps=50,
+    generator=generator,
+    height=768,
+    width=768,
+).images
+
+images[0].save("robot_cat.png")
+```
+
+The output image looks as follow:
+![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/robot_cat_text2img.png)
+
+### Image-to-Image Generation with ControlNet Conditioning
+
+Kandinsky 2.2 also includes a [`KandinskyV22ControlnetImg2ImgPipeline`] that will allow you to add control to the image generation process with both the image and its depth map. This pipeline works really well with [`KandinskyV22PriorEmb2EmbPipeline`], which generates image embeddings based on both a text prompt and an image. 
+
+For our robot cat example, we will pass the prompt and cat image together to the prior pipeline to generate an image embedding. We will then use that image embedding and the depth map of the cat to further control the image generation process. 
+
+We can use the same cat image and its depth map from the last example.
+
+```python
+import torch
+import numpy as np
+
+from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline
+from diffusers.utils import load_image
+from transformers import pipeline
+
+img = load_image(
+    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinskyv22/cat.png"
+).resize((768, 768))
+
+
+def make_hint(image, depth_estimator):
+    image = depth_estimator(image)["depth"]
+    image = np.array(image)
+    image = image[:, :, None]
+    image = np.concatenate([image, image, image], axis=2)
+    detected_map = torch.from_numpy(image).float() / 255.0
+    hint = detected_map.permute(2, 0, 1)
+    return hint
+
+
+depth_estimator = pipeline("depth-estimation")
+hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
+
+pipe_prior = KandinskyV22PriorEmb2EmbPipeline.from_pretrained(
+    "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
+)
+pipe_prior = pipe_prior.to("cuda")
+
+pipe = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained(
+    "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
+)
+pipe = pipe.to("cuda")
+
+prompt = "A robot, 4k photo"
+negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
+
+generator = torch.Generator(device="cuda").manual_seed(43)
+
+# run prior pipeline
+
+img_emb = pipe_prior(prompt=prompt, image=img, strength=0.85, generator=generator)
+negative_emb = pipe_prior(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)
+
+# run controlnet img2img pipeline
+images = pipe(
+    image=img,
+    strength=0.5,
+    image_embeds=img_emb.image_embeds,
+    negative_image_embeds=negative_emb.image_embeds,
+    hint=hint,
+    num_inference_steps=50,
+    generator=generator,
+    height=768,
+    width=768,
+).images
+
+images[0].save("robot_cat.png")
+```
+
+Here is the output. Compared with the output from our text-to-image controlnet example, it kept a lot more cat facial details from the original image and worked into the robot style we asked for.
+
+![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/robot_cat.png)
+
+## Optimization
+
+Running Kandinsky in inference requires running both a first prior pipeline: [`KandinskyPriorPipeline`]
+and a second image decoding pipeline which is one of [`KandinskyPipeline`], [`KandinskyImg2ImgPipeline`], or [`KandinskyInpaintPipeline`].
+
+The bulk of the computation time will always be the second image decoding pipeline, so when looking 
+into optimizing the model, one should look into the second image decoding pipeline.
+
+When running with PyTorch < 2.0, we strongly recommend making use of [`xformers`](https://github.com/facebookresearch/xformers)
+to speed-up the optimization. This can be done by simply running:
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+t2i_pipe.enable_xformers_memory_efficient_attention()
+```
+
+When running on PyTorch >= 2.0, PyTorch's SDPA attention will automatically be used. For more information on 
+PyTorch's SDPA, feel free to have a look at [this blog post](https://pytorch.org/blog/accelerated-diffusers-pt-20/).
+
+To have explicit control , you can also manually set the pipeline to use PyTorch's 2.0 efficient attention:
+
+```py
+from diffusers.models.attention_processor import AttnAddedKVProcessor2_0
+
+t2i_pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())
+```
+
+The slowest and most memory intense attention processor is the default `AttnAddedKVProcessor` processor.
+We do **not** recommend using it except for testing purposes or cases where very high determistic behaviour is desired. 
+You can set it with:
+
+```py
+from diffusers.models.attention_processor import AttnAddedKVProcessor
+
+t2i_pipe.unet.set_attn_processor(AttnAddedKVProcessor())
+```
+
+With PyTorch >= 2.0, you can also use Kandinsky with `torch.compile` which depending 
+on your hardware can signficantly speed-up your inference time once the model is compiled.
+To use Kandinsksy with `torch.compile`, you can do:
+
+```py
+t2i_pipe.unet.to(memory_format=torch.channels_last)
+t2i_pipe.unet = torch.compile(t2i_pipe.unet, mode="reduce-overhead", fullgraph=True)
+```
+
+After compilation you should see a very fast inference time. For more information,
+feel free to have a look at [Our PyTorch 2.0 benchmark](https://huggingface.co/docs/diffusers/main/en/optimization/torch2.0).
+
+<Tip>
+
+To generate images directly from a single pipeline, you can use [`KandinskyV22CombinedPipeline`], [`KandinskyV22Img2ImgCombinedPipeline`], [`KandinskyV22InpaintCombinedPipeline`].
+These combined pipelines wrap the [`KandinskyV22PriorPipeline`] and [`KandinskyV22Pipeline`], [`KandinskyV22Img2ImgPipeline`], [`KandinskyV22InpaintPipeline`] respectively into a single 
+pipeline for a simpler user experience
+
+</Tip>
+
+## Available Pipelines:
+
+| Pipeline | Tasks |
+|---|---|
+| [pipeline_kandinsky2_2.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2.py) | *Text-to-Image Generation* |
+| [pipeline_kandinsky2_2_combined.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_combined.py) | *End-to-end Text-to-Image, image-to-image, Inpainting Generation* |
+| [pipeline_kandinsky2_2_inpaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_inpaint.py) | *Image-Guided Image Generation* |
+| [pipeline_kandinsky2_2_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_img2img.py) | *Image-Guided Image Generation* |
+| [pipeline_kandinsky2_2_controlnet.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_controlnet.py) | *Image-Guided Image Generation* |
+| [pipeline_kandinsky2_2_controlnet_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_controlnet_img2img.py) | *Image-Guided Image Generation* |
+
+
+### KandinskyV22Pipeline
+
+[[autodoc]] KandinskyV22Pipeline
+	- all
+	- __call__
+
+### KandinskyV22ControlnetPipeline
+
+[[autodoc]] KandinskyV22ControlnetPipeline
+	- all
+	- __call__
+
+### KandinskyV22ControlnetImg2ImgPipeline
+
+[[autodoc]] KandinskyV22ControlnetImg2ImgPipeline
+	- all
+	- __call__
+
+### KandinskyV22Img2ImgPipeline
+
+[[autodoc]] KandinskyV22Img2ImgPipeline
+	- all
+	- __call__
+
+### KandinskyV22InpaintPipeline
+
+[[autodoc]] KandinskyV22InpaintPipeline
+	- all
+	- __call__
+
+### KandinskyV22PriorPipeline
+
+[[autodoc]] KandinskyV22PriorPipeline
+	- all
+	- __call__
+	- interpolate
+
+### KandinskyV22PriorEmb2EmbPipeline
+
+[[autodoc]] KandinskyV22PriorEmb2EmbPipeline
+	- all
+	- __call__
+	- interpolate
+
+### KandinskyV22CombinedPipeline
+
+[[autodoc]] KandinskyV22CombinedPipeline
+	- all
+	- __call__
+
+### KandinskyV22Img2ImgCombinedPipeline
+
+[[autodoc]] KandinskyV22Img2ImgCombinedPipeline
+	- all
+	- __call__
+
+### KandinskyV22InpaintCombinedPipeline
+
+[[autodoc]] KandinskyV22InpaintCombinedPipeline
+	- all
+	- __call__
--- a/docs/source/en/api/pipelines/latent_diffusion.mdx
+++ b/docs/source/en/api/pipelines/latent_diffusion.mdx
--- a/docs/source/en/api/pipelines/latent_diffusion_uncond.mdx
+++ b/docs/source/en/api/pipelines/latent_diffusion_uncond.mdx
--- a/docs/source/en/api/pipelines/model_editing.mdx
+++ b/docs/source/en/api/pipelines/model_editing.mdx
--- a/docs/source/en/api/pipelines/musicldm.md
+++ b/docs/source/en/api/pipelines/musicldm.md
@@ -0,0 +1,57 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# MusicLDM
+
+MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
+MusicLDM takes a text prompt as input and predicts the corresponding music sample. 
+
+Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm/overview),
+MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
+latents.
+
+MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to 
+the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies 
+encourages the model to interpolate between the training samples, but stay within the domain of the training data. The 
+result is generated music that is more diverse while staying faithful to the corresponding style.
+
+The abstract of the paper is the following:
+
+*In this paper, we present MusicLDM, a state-of-the-art text-to-music model that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, to encourage the model to generate music more diverse while still staying faithful to the corresponding style.*
+
+This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi).
+
+## Tips
+
+When constructing a prompt, keep in mind:
+
+* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
+* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".
+
+During inference:
+
+* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
+* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
+* The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument.
+
+<Tip>
+
+Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between 
+scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) 
+section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
+## MusicLDMPipeline
+[[autodoc]] MusicLDMPipeline
+	- all
+	- __call__
--- a/docs/source/en/api/pipelines/overview.mdx
+++ b/docs/source/en/api/pipelines/overview.mdx
@@ -34,3 +34,7 @@ Pipelines do not offer any training functionality. You'll notice PyTorch's autog
 ## FlaxDiffusionPipeline

 [[autodoc]] pipelines.pipeline_flax_utils.FlaxDiffusionPipeline
+
+## PushToHubMixin
+
+[[autodoc]] utils.PushToHubMixin
--- a/docs/source/en/api/pipelines/paint_by_example.mdx
+++ b/docs/source/en/api/pipelines/paint_by_example.mdx
--- a/docs/source/en/api/pipelines/panorama.mdx
+++ b/docs/source/en/api/pipelines/panorama.mdx
--- a/docs/source/en/api/pipelines/paradigms.mdx
+++ b/docs/source/en/api/pipelines/paradigms.mdx
--- a/docs/source/en/api/pipelines/pix2pix.mdx
+++ b/docs/source/en/api/pipelines/pix2pix.mdx
@@ -35,4 +35,12 @@ Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to le
 	- save_lora_weights

 ## StableDiffusionPipelineOutput
-[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+
+## StableDiffusionXLInstructPix2PixPipeline
+[[autodoc]] StableDiffusionXLInstructPix2PixPipeline
+	- __call__
+	- all
+
+## StableDiffusionXLPipelineOutput
+[[autodoc]] pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput
--- a/docs/source/en/api/pipelines/pix2pix_zero.mdx
+++ b/docs/source/en/api/pipelines/pix2pix_zero.mdx
--- a/docs/source/en/api/pipelines/pndm.mdx
+++ b/docs/source/en/api/pipelines/pndm.mdx
--- a/docs/source/en/api/pipelines/repaint.mdx
+++ b/docs/source/en/api/pipelines/repaint.mdx
--- a/docs/source/en/api/pipelines/score_sde_ve.mdx
+++ b/docs/source/en/api/pipelines/score_sde_ve.mdx
--- a/docs/source/en/api/pipelines/self_attention_guidance.mdx
+++ b/docs/source/en/api/pipelines/self_attention_guidance.mdx
--- a/docs/source/en/api/pipelines/semantic_stable_diffusion.mdx
+++ b/docs/source/en/api/pipelines/semantic_stable_diffusion.mdx
--- a/docs/source/en/api/pipelines/shap_e.md
+++ b/docs/source/en/api/pipelines/shap_e.md
@@ -0,0 +1,37 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Shap-E
+
+The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai).
+
+The abstract from the paper is:
+
+*We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.*
+
+The original codebase can be found at [openai/shap-e](https://github.com/openai/shap-e).
+
+<Tip>
+
+See the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
+## ShapEPipeline
+[[autodoc]] ShapEPipeline
+	- all
+	- __call__
+
+## ShapEImg2ImgPipeline
+[[autodoc]] ShapEImg2ImgPipeline
+	- all
+	- __call__
+
+## ShapEPipelineOutput
+[[autodoc]] pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput
--- a/docs/source/en/api/pipelines/shap_e.mdx
+++ b/docs/source/en/api/pipelines/shap_e.mdx
@@ -1,190 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Shap-E
-
-The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai). 
-
-The abstract from the paper is:
-
-*We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.*
-
-The original codebase can be found at [openai/shap-e](https://github.com/openai/shap-e).
-
-<Tip>
-
-Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
-
-</Tip>
-
-## Usage Examples
-
-In the following, we will walk you through some examples of how to use Shap-E pipelines to create 3D objects in gif format.
-
-### Text-to-3D image generation 
-
-We can use [`ShapEPipeline`] to create 3D object based on a text prompt. In this example, we will make a birthday cupcake for :firecracker: diffusers library's 1 year birthday. The workflow to use the Shap-E text-to-image pipeline is same as how you would use other text-to-image pipelines in diffusers.
-
-```python
-import torch
-
-from diffusers import DiffusionPipeline
-
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-
-repo = "openai/shap-e"
-pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
-pipe = pipe.to(device)
-
-guidance_scale = 15.0
-prompt = ["A firecracker", "A birthday cupcake"]
-
-images = pipe(
-    prompt,
-    guidance_scale=guidance_scale,
-    num_inference_steps=64,
-    frame_size=256,
-).images
-```
-
-The output of [`ShapEPipeline`] is a list of lists of images frames. Each list of frames can be used to create a 3D object. Let's use the `export_to_gif` utility function in diffusers to make a 3D cupcake!
-
-```python
-from diffusers.utils import export_to_gif
-
-export_to_gif(images[0], "firecracker_3d.gif")
-export_to_gif(images[1], "cake_3d.gif")
-```
-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/firecracker_out.gif)
-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/cake_out.gif)
-
-
-### Image-to-Image generation
-
-You can use [`ShapEImg2ImgPipeline`] along with other text-to-image pipelines in diffusers and turn your 2D generation into 3D. 
-
-In this example, We will first genrate a cheeseburger with a simple prompt "A cheeseburger, white background" 
-
-```python
-from diffusers import DiffusionPipeline
-import torch
-
-pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16)
-pipe_prior.to("cuda")
-
-t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
-t2i_pipe.to("cuda")
-
-prompt = "A cheeseburger, white background"
-
-image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
-image = t2i_pipe(
-    prompt,
-    image_embeds=image_embeds,
-    negative_image_embeds=negative_image_embeds,
-).images[0]
-
-image.save("burger.png")
-```
-
-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_in.png)
-
-we will then use the Shap-E image-to-image pipeline to turn it into a 3D cheeseburger :)
-
-```python
-from PIL import Image
-from diffusers.utils import export_to_gif
-
-repo = "openai/shap-e-img2img"
-pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
-pipe = pipe.to("cuda")
-
-guidance_scale = 3.0
-image = Image.open("burger.png").resize((256, 256))
-
-images = pipe(
-    image,
-    guidance_scale=guidance_scale,
-    num_inference_steps=64,
-    frame_size=256,
-).images
-
-gif_path = export_to_gif(images[0], "burger_3d.gif")
-```
-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_out.gif)
-
-### Generate mesh
-
-For both [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`], you can generate mesh output by passing `output_type` as `mesh` to the pipeline, and then use the [`ShapEPipeline.export_to_ply`] utility function to save the output as a `ply` file. We also provide a [`ShapEPipeline.export_to_obj`] function that you can use to save mesh outputs as `obj` files.
-
-```python
-import torch
-
-from diffusers import DiffusionPipeline
-from diffusers.utils import export_to_ply
-
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-
-repo = "openai/shap-e"
-pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16, variant="fp16")
-pipe = pipe.to(device)
-
-guidance_scale = 15.0
-prompt = "A birthday cupcake"
-
-images = pipe(prompt, guidance_scale=guidance_scale, num_inference_steps=64, frame_size=256, output_type="mesh").images
-
-ply_path = export_to_ply(images[0], "3d_cake.ply")
-print(f"saved to folder: {ply_path}")
-```
-
-Huggingface Datasets supports mesh visualization for mesh files in `glb` format. Below we will show you how to convert your mesh file into `glb` format so that you can use the Dataset viewer to render 3D objects. 
-
-We need to install `trimesh` library.
-
-```
-pip install trimesh
-```
-
-To convert the mesh file into `glb` format, 
-
-```python
-import trimesh
-
-mesh = trimesh.load("3d_cake.ply")
-mesh.export("3d_cake.glb", file_type="glb")
-```
-
-By default, the mesh output of Shap-E is from the bottom viewpoint; you can change the default viewpoint by applying a rotation transformation
-
-```python
-import trimesh
-import numpy as np
-
-mesh = trimesh.load("3d_cake.ply")
-rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0])
-mesh = mesh.apply_transform(rot)
-mesh.export("3d_cake.glb", file_type="glb")
-```
-
-Now you can upload your mesh file to your dataset and visualize it! Here is the link to the 3D cake we just generated
-https://huggingface.co/datasets/hf-internal-testing/diffusers-images/blob/main/shap_e/3d_cake.glb
-
-## ShapEPipeline
-[[autodoc]] ShapEPipeline
-	- all
-	- __call__
-
-## ShapEImg2ImgPipeline
-[[autodoc]] ShapEImg2ImgPipeline
-	- all
-	- __call__
-
-## ShapEPipelineOutput
-[[autodoc]] pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput
--- a/docs/source/en/api/pipelines/spectrogram_diffusion.mdx
+++ b/docs/source/en/api/pipelines/spectrogram_diffusion.mdx
--- a/docs/source/en/api/pipelines/stable_diffusion/adapter.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/adapter.mdx
@@ -29,10 +29,11 @@ This model was contributed by the community contributor [HimariO](https://github
 | Pipeline | Tasks | Demo
 |---|---|:---:|
 | [StableDiffusionAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning* | -
+| [StableDiffusionXLAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_xl_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning on StableDiffusion-XL* | -

-## Usage example
+## Usage example with the base model of StableDiffusion-1.4/1.5

-In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference.
+In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
 All adapters use the same pipeline.

 1. Images are first converted into the appropriate *control image* format.
@@ -69,7 +70,7 @@ Next, create the adapter pipeline
 import torch
 from diffusers import StableDiffusionAdapterPipeline, T2IAdapter

-adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_color_sd14v1")
+adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_color_sd14v1", torch_dtype=torch.float16)
 pipe = StableDiffusionAdapterPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    adapter=adapter,
@@ -93,6 +94,62 @@ out_image = pipe(

 ![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_output.png)

+## Usage example with the base model of StableDiffusion-XL
+
+In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-XL.
+All adapters use the same pipeline.
+
+ 1. Images are first downloaded into the appropriate *control image* format.
+ 2. The *control image* and *prompt* are passed to the [`StableDiffusionXLAdapterPipeline`].
+
+Let's have a look at a simple example using the [Sketch Adapter](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0).
+
+```python
+from diffusers.utils import load_image
+
+sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L")
+```
+
+![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png)
+
+Then, create the adapter pipeline
+
+```py
+import torch
+from diffusers import (
+    T2IAdapter,
+    StableDiffusionXLAdapterPipeline,
+    DDPMScheduler
+)
+from diffusers.models.unet_2d_condition import UNet2DConditionModel
+
+model_id = "stabilityai/stable-diffusion-xl-base-1.0"
+adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0",torch_dtype=torch.float16, adapter_type="full_adapter_xl")
+scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")
+
+pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
+    model_id, adapter=adapter, safety_checker=None, torch_dtype=torch.float16, variant="fp16", scheduler=scheduler
+)
+
+pipe.to("cuda")
+```
+
+Finally, pass the prompt and control image to the pipeline
+
+```py
+# fix the random seed, so you will get the same result as the example
+generator = torch.Generator().manual_seed(42)
+
+sketch_image_out = pipe(
+    prompt="a photo of a dog in real world, high quality", 
+    negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality", 
+    image=sketch_image, 
+    generator=generator, 
+    guidance_scale=7.5
+).images[0]
+```
+
+![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch_output.png)

 ## Available checkpoints

@@ -113,6 +170,9 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu
 |[TencentARC/t2iadapter_depth_sd15v2](https://huggingface.co/TencentARC/t2iadapter_depth_sd15v2)||
 |[TencentARC/t2iadapter_sketch_sd15v2](https://huggingface.co/TencentARC/t2iadapter_sketch_sd15v2)||
 |[TencentARC/t2iadapter_zoedepth_sd15v1](https://huggingface.co/TencentARC/t2iadapter_zoedepth_sd15v1)||
+|[Adapter/t2iadapter, subfolder='sketch_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0)||
+|[Adapter/t2iadapter, subfolder='canny_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/canny_sdxl_1.0)||
+|[Adapter/t2iadapter, subfolder='openpose_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/openpose_sdxl_1.0)||

 ## Combining multiple adapters

@@ -185,3 +245,14 @@ However, T2I-Adapter performs slightly worse than ControlNet.
 	- disable_vae_slicing
 	- enable_xformers_memory_efficient_attention
 	- disable_xformers_memory_efficient_attention
+
+## StableDiffusionXLAdapterPipeline
+[[autodoc]] StableDiffusionXLAdapterPipeline
+	- all
+	- __call__
+	- enable_attention_slicing
+	- disable_attention_slicing
+	- enable_vae_slicing
+	- disable_vae_slicing
+	- enable_xformers_memory_efficient_attention
+	- disable_xformers_memory_efficient_attention
--- a/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/depth2img.mdx
--- a/docs/source/en/api/pipelines/stable_diffusion/gligen.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/gligen.md
@@ -0,0 +1,59 @@
+<!--Copyright 2023 The GLIGEN Authors and The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# GLIGEN (Grounded Language-to-Image Generation)
+
+The GLIGEN model was created by researchers and engineers from [University of Wisconsin-Madison, Columbia University, and Microsoft](https://github.com/gligen/GLIGEN). The [`StableDiffusionGLIGENPipeline`] and [`StableDiffusionGLIGENTextImagePipeline`] can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes with [`StableDiffusionGLIGENPipeline`], if input images are given, [`StableDiffusionGLIGENTextImagePipeline`] can insert objects described by text at the region defined by bounding boxes. Otherwise, it'll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It's trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs.
+
+The abstract from the [paper](https://huggingface.co/papers/2301.07093) is:
+
+*Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN’s zeroshot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.*
+
+<Tip>
+
+Make sure to check out the Stable Diffusion [Tips](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality and how to reuse pipeline components efficiently!
+
+If you want to use one of the official checkpoints for a task, explore the [gligen](https://huggingface.co/gligen) Hub organizations!
+
+</Tip>
+
+[`StableDiffusionGLIGENPipeline`] was contributed by [Nikhil Gajendrakumar](https://github.com/nikhil-masterful) and [`StableDiffusionGLIGENTextImagePipeline`] was contributed by [Nguyễn Công Tú Anh](https://github.com/tuanh123789).
+
+## StableDiffusionGLIGENPipeline
+
+[[autodoc]] StableDiffusionGLIGENPipeline
+	- all
+	- __call__
+	- enable_vae_slicing
+	- disable_vae_slicing
+	- enable_vae_tiling
+	- disable_vae_tiling
+	- enable_model_cpu_offload
+	- prepare_latents
+	- enable_fuser
+
+## StableDiffusionGLIGENTextImagePipeline
+
+[[autodoc]] StableDiffusionGLIGENTextImagePipeline
+	- all
+	- __call__
+	- enable_vae_slicing
+	- disable_vae_slicing
+	- enable_vae_tiling
+	- disable_vae_tiling
+	- enable_model_cpu_offload
+	- prepare_latents
+	- enable_fuser
+
+## StableDiffusionPipelineOutput
+
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/image_variation.mdx
--- a/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/img2img.mdx
--- a/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/inpaint.mdx
--- a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.mdx
--- a/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx
@@ -30,8 +30,8 @@ Make sure to check out the Stable Diffusion [Tips](overview#tips) section to lea
 	- all
 	- __call__

-## StableDiffusionPipelineOutput
+## LDM3DPipelineOutput

-[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.pipeline_stable_diffusion_ldm3d.LDM3DPipelineOutput
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/stable_diffusion/overview.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/overview.mdx
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.mdx
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.mdx
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
@@ -0,0 +1,52 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Stable Diffusion XL
+
+Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.
+
+The abstract from the paper is:
+
+*We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.*
+
+## Tips
+
+- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't for for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).
+- SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders.
+- SDXL output images can be improved by making use of a refiner model in an image-to-image setting.
+- SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters.
+
+<Tip>
+
+To learn how to use SDXL for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](../../../using-diffusers/sdxl) guide.
+
+Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! 
+
+</Tip>
+
+## StableDiffusionXLPipeline
+
+[[autodoc]] StableDiffusionXLPipeline
+	- all
+	- __call__
+
+## StableDiffusionXLImg2ImgPipeline
+
+[[autodoc]] StableDiffusionXLImg2ImgPipeline
+	- all
+	- __call__
+
+## StableDiffusionXLInpaintPipeline
+
+[[autodoc]] StableDiffusionXLInpaintPipeline
+	- all
+	- __call__
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.mdx
@@ -1,387 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Stable diffusion XL
-
-Stable Diffusion XL was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://arxiv.org/abs/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach
-
-The abstract of the paper is the following:
-
-*We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.*
-
-## Tips
-
- Stable Diffusion XL works especially well with images between 768 and 1024.
- Stable Diffusion XL can pass a different prompt for each of the text encoders it was trained on as shown below. We can even pass different parts of the same prompt to the text encoders.
- Stable Diffusion XL output image can be improved by making use of a refiner as shown below.
-
-### Available checkpoints:
-
- *Text-to-Image (1024x1024 resolution)*: [stabilityai/stable-diffusion-xl-base-0.9](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9) with [`StableDiffusionXLPipeline`]
- *Image-to-Image / Refiner (1024x1024 resolution)*: [stabilityai/stable-diffusion-xl-refiner-0.9](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9) with [`StableDiffusionXLImg2ImgPipeline`]
-
-## Usage Example
-
-Before using SDXL make sure to have `transformers`, `accelerate`, `safetensors` and `invisible_watermark` installed. 
-You can install the libraries as follows:
-
-```
-pip install transformers
-pip install accelerate
-pip install safetensors
-pip install invisible-watermark>=0.2.0
-```
-
-### Text-to-Image
-
-You can use SDXL as follows for *text-to-image*:
-
-```py
-from diffusers import StableDiffusionXLPipeline
-import torch
-
-pipe = StableDiffusionXLPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
-image = pipe(prompt=prompt).images[0]
-```
-
-### Image-to-image 
-
-You can use SDXL as follows for *image-to-image*:
-
-```py 
-import torch
-from diffusers import StableDiffusionXLImg2ImgPipeline
-from diffusers.utils import load_image
-
-pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe = pipe.to("cuda")
-url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png"
-
-init_image = load_image(url).convert("RGB")
-prompt = "a photo of an astronaut riding a horse on mars"
-image = pipe(prompt, image=init_image).images[0]
-```
-
-### Inpainting
-
-You can use SDXL as follows for *inpainting*
-
-```py 
-import torch
-from diffusers import StableDiffusionXLInpaintPipeline
-from diffusers.utils import load_image
-
-pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
-mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
-
-init_image = load_image(img_url).convert("RGB")
-mask_image = load_image(mask_url).convert("RGB")
-
-prompt = "A majestic tiger sitting on a bench"
-image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80).images[0]
-```
-
-### Refining the image output
-
-In addition to the [base model checkpoint](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9), 
-StableDiffusion-XL also includes a [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9)
-that is specialized in denoising low-noise stage images to generate images of improved high-frequency quality.
-This refiner checkpoint can be used as a "second-step" pipeline after having run the base checkpoint to improve
-image quality.
-
-When using the refiner, one can easily 
- 1.) employ the base model and refiner as an *Ensemble of Expert Denoisers* as first proposed in [eDiff-I](https://research.nvidia.com/labs/dir/eDiff-I/) or
- 2.) simply run the refiner in [SDEdit](https://arxiv.org/abs/2108.01073) fashion after the base model.
-
-**Note**: The idea of using SD-XL base & refiner as an ensemble of experts was first brought forward by 
-a couple community contributors which also helped shape the following `diffusers` implementation, namely:
- [SytanSD](https://github.com/SytanSD)
- [bghira](https://github.com/bghira)
- [Birch-san](https://github.com/Birch-san)
-
-#### 1.) Ensemble of Expert Denoisers
-
-When using the base and refiner model as an ensemble of expert of denoisers, the base model should serve as the 
-expert for the high-noise diffusion stage and the refiner serves as the expert for the low-noise diffusion stage.
-
-The advantage of 1.) over 2.) is that it requires less overall denoising steps and therefore should be significantly
-faster. The drawback is that one cannot really inspect the output of the base model; it will still be heavily denoised.
-
-To use the base model and refiner as an ensemble of expert denoisers, make sure to define the fraction
-of timesteps which should be run through the high-noise denoising stage (*i.e.* the base model) and the low-noise
-denoising stage (*i.e.* the refiner model) respectively. This fraction should be set as the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) of the base model 
-and as the [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) of the refiner model.
-
-Let's look at an example.
-First, we import the two pipelines. Since the text encoders and variational autoencoder are the same
-you don't have to load those again for the refiner.
-
-```py
-from diffusers import DiffusionPipeline
-import torch
-
-base = DiffusionPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-refiner = DiffusionPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-refiner-0.9",
-    text_encoder_2=base.text_encoder_2,
-    vae=base.vae,
-    torch_dtype=torch.float16,
-    use_safetensors=True,
-    variant="fp16",
-)
-refiner.to("cuda")
-```
-
-Now we define the number of inference steps and the fraction at which the model shall be run through the 
-high-noise denoising stage (*i.e.* the base model).
-
-```py
-n_steps = 40
-high_noise_frac = 0.7
-```
-
-A fraction of 0.7 means that 70% of the 40 inference steps (28 steps) are run through the base model
-and the remaining 12 steps are run through the refiner. Let's run the two pipelines now.
-Make sure to set `denoising_end` and `denoising_start` to the same values and keep `num_inference_steps`
-constant. Also remember that the output of the base model should be in latent space:
-
-```py
-prompt = "A majestic lion jumping from a big stone at night"
-
-image = base(prompt=prompt, num_inference_steps=n_steps, denoising_end=high_noise_frac, output_type="latent").images
-image = refiner(prompt=prompt, num_inference_steps=n_steps, denoising_start=high_noise_frac, image=image).images[0]
-```
-
-Let's have a look at the image
-
-| Original Image | Ensemble of Denoisers Experts |
-|---|---|
-| ![lion_base](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_base.png) | ![lion_ref](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_refined.png)
-
-If we would have just run the base model on the same 40 steps, the image would have been arguably less detailed (e.g. the lion eyes and nose):
-
-<Tip>
-
-The ensemble-of-experts method works well on all available schedulers!
-
-</Tip>
-
-#### 2.) Refining the image output from fully denoised base image
-
-In standard [`StableDiffusionImg2ImgPipeline`]-fashion, the fully-denoised image generated of the base model 
-can be further improved using the [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9).
-
-For this, you simply run the refiner as a normal image-to-image pipeline after the "base" text-to-image 
-pipeline. You can leave the outputs of the base model in latent space.
-
-```py
-from diffusers import DiffusionPipeline
-import torch
-
-pipe = DiffusionPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-refiner = DiffusionPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-refiner-0.9",
-    text_encoder_2=pipe.text_encoder_2,
-    vae=pipe.vae,
-    torch_dtype=torch.float16,
-    use_safetensors=True,
-    variant="fp16",
-)
-refiner.to("cuda")
-
-prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
-
-image = pipe(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
-image = refiner(prompt=prompt, image=image[None, :]).images[0]
-```
-
-| Original Image | Refined Image |
-|---|---|
-| ![](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/init_image.png) | ![](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/refined_image.png) |
-
-<Tip>
-
-The refiner can also very well be used in an in-painting setting. To do so just make
-  sure you use the [`StableDiffusionXLInpaintPipeline`] classes as shown below
-
-</Tip>
-
-To use the refiner for inpainting in the Ensemble of Expert Denoisers setting you can do the following:
-
-```py
-from diffusers import StableDiffusionXLInpaintPipeline
-from diffusers.utils import load_image
-
-pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-refiner-0.9",
-    text_encoder_2=pipe.text_encoder_2,
-    vae=pipe.vae,
-    torch_dtype=torch.float16,
-    use_safetensors=True,
-    variant="fp16",
-)
-refiner.to("cuda")
-
-img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
-mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
-
-init_image = load_image(img_url).convert("RGB")
-mask_image = load_image(mask_url).convert("RGB")
-
-prompt = "A majestic tiger sitting on a bench"
-num_inference_steps = 75
-high_noise_frac = 0.7
-
-image = pipe(
-    prompt=prompt,
-    image=init_image,
-    mask_image=mask_image,
-    num_inference_steps=num_inference_steps,
-    strength=0.80,
-    denoising_start=high_noise_frac,
-    output_type="latent",
-).images
-image = refiner(
-    prompt=prompt,
-    image=image,
-    mask_image=mask_image,
-    num_inference_steps=num_inference_steps,
-    denoising_start=high_noise_frac,
-).images[0]
-```
-
-To use the refiner for inpainting in the standard SDE-style setting, simply remove `denoising_end` and `denoising_start` and choose a smaller
-number of inference steps for the refiner.
-
-### Loading single file checkpoints / original file format
-
-By making use of [`~diffusers.loaders.FromSingleFileMixin.from_single_file`] you can also load the 
-original file format into `diffusers`:
-
-```py
-from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
-import torch
-
-pipe = StableDiffusionXLPipeline.from_single_file(
-    "./sd_xl_base_0.9.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
-    "./sd_xl_refiner_0.9.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
-)
-refiner.to("cuda")
-```
-
-### Memory optimization via model offloading 
-
-If you are seeing out-of-memory errors, we recommend making use of [`StableDiffusionXLPipeline.enable_model_cpu_offload`].
-
-```diff
- pipe.to("cuda")
-+ pipe.enable_model_cpu_offload()
-```
-
-and 
-
-```diff
- refiner.to("cuda")
-+ refiner.enable_model_cpu_offload()
-```
-
-### Speed-up inference with `torch.compile`
-
-You can speed up inference by making use of `torch.compile`. This should give you **ca.** 20% speed-up.
-
-```diff
-+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
-+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
-```
-
-### Running with `torch < 2.0`
-
-**Note** that if you want to run Stable Diffusion XL with `torch` < 2.0, please make sure to enable xformers 
-attention:
-
-```
-pip install xformers
-```
-
-```diff
-+pipe.enable_xformers_memory_efficient_attention()
-+refiner.enable_xformers_memory_efficient_attention()
-```
-
-## StableDiffusionXLPipeline
-
-[[autodoc]] StableDiffusionXLPipeline
-	- all
-	- __call__
-
-## StableDiffusionXLImg2ImgPipeline
-
-[[autodoc]] StableDiffusionXLImg2ImgPipeline
-	- all
-	- __call__
-
-## StableDiffusionXLInpaintPipeline
-
-[[autodoc]] StableDiffusionXLInpaintPipeline
-	- all
-	- __call__
-
-### Passing different prompts to each text-encoder
-
-Stable Diffusion XL was trained on two text encoders. The default behavior is to pass the same prompt to each. But it is possible to pass a different prompt for each text-encoder, as [some users](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201) noted that it can boost quality.
-To do so, you can pass `prompt_2` and `negative_prompt_2` in addition to `prompt` and `negative_prompt`. By doing that, you will pass the original prompts and negative prompts (as in `prompt` and `negative_prompt`) to `text_encoder` (in official SDXL 0.9/1.0 that is [OpenAI CLIP-ViT/L-14](https://huggingface.co/openai/clip-vit-large-patch14)),
-and `prompt_2` and `negative_prompt_2` to `text_encoder_2` (in official SDXL 0.9/1.0 that is [OpenCLIP-ViT/bigG-14](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)).
-
-```py
-from diffusers import StableDiffusionXLPipeline
-import torch
-
-pipe = StableDiffusionXLPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
-)
-pipe.to("cuda")
-
-# prompt will be passed to OAI CLIP-ViT/L-14
-prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
-# prompt_2 will be passed to OpenCLIP-ViT/bigG-14
-prompt_2 = "monet painting"
-image = pipe(prompt=prompt, prompt_2=prompt_2).images[0]
-```
--- a/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.mdx
--- a/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/upscale.mdx
--- a/docs/source/en/api/pipelines/stable_unclip.mdx
+++ b/docs/source/en/api/pipelines/stable_unclip.mdx
--- a/docs/source/en/api/pipelines/stochastic_karras_ve.mdx
+++ b/docs/source/en/api/pipelines/stochastic_karras_ve.mdx
--- a/docs/source/en/api/pipelines/text_to_video.mdx
+++ b/docs/source/en/api/pipelines/text_to_video.mdx
--- a/docs/source/en/api/pipelines/text_to_video_zero.mdx
+++ b/docs/source/en/api/pipelines/text_to_video_zero.mdx
--- a/docs/source/en/api/pipelines/unclip.mdx
+++ b/docs/source/en/api/pipelines/unclip.mdx
--- a/docs/source/en/api/pipelines/unidiffuser.mdx
+++ b/docs/source/en/api/pipelines/unidiffuser.mdx
@@ -20,6 +20,12 @@ The abstract from the [paper](https://arxiv.org/abs/2303.06555) is:

 You can find the original codebase at [thu-ml/unidiffuser](https://github.com/thu-ml/unidiffuser) and additional checkpoints at [thu-ml](https://huggingface.co/thu-ml).

+<Tip warning={true}>
+
+There is currently an issue on PyTorch 1.X where the output images are all black or the pixel values become `NaNs`. This issue can be mitigated by switching to PyTorch 2.X.
+
+</Tip>
+
 This pipeline was contributed by [dg845](https://github.com/dg845). ❤️

 ## Usage Examples
--- a/docs/source/en/api/pipelines/value_guided_sampling.md
+++ b/docs/source/en/api/pipelines/value_guided_sampling.md
@@ -0,0 +1,32 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Value-guided planning
+
+<Tip warning={true}>
+
+🧪 This is an experimental pipeline for reinforcement learning!
+
+</Tip>
+
+This pipeline is based on the [Planning with Diffusion for Flexible Behavior Synthesis](https://huggingface.co/papers/2205.09991) paper by Michael Janner, Yilun Du, Joshua B. Tenenbaum, Sergey Levine.
+
+The abstract from the paper is:
+
+*Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility*.
+
+You can find additional information about the model on the [project page](https://diffusion-planning.github.io/), the [original codebase](https://github.com/jannerm/diffuser), or try it out in a demo [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/reinforcement_learning_with_diffusers.ipynb). 
+
+The script to run the model is available [here](https://github.com/huggingface/diffusers/tree/main/examples/reinforcement_learning).
+
+## ValueGuidedRLPipeline
+[[autodoc]] diffusers.experimental.ValueGuidedRLPipeline
--- a/docs/source/en/api/pipelines/versatile_diffusion.mdx
+++ b/docs/source/en/api/pipelines/versatile_diffusion.mdx
--- a/docs/source/en/api/pipelines/vq_diffusion.mdx
+++ b/docs/source/en/api/pipelines/vq_diffusion.mdx
--- a/docs/source/en/api/pipelines/wuerstchen.md
+++ b/docs/source/en/api/pipelines/wuerstchen.md
@@ -0,0 +1,136 @@
+# Würstchen
+
+<img src="https://github.com/dome272/Wuerstchen/assets/61938694/0617c863-165a-43ee-9303-2a17299a0cf9">
+
+[Würstchen: Efficient Pretraining of Text-to-Image Models](https://huggingface.co/papers/2306.00637) is by Pablo Pernias, Dominic Rampas, and Marc Aubreville.
+
+The abstract from the paper is:
+
+*We introduce Würstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained hardware. Building on recent advancements in machine learning, our approach, which utilizes latent diffusion strategies at strong latent image compression rates, significantly reduces the computational burden, typically associated with state-of-the-art models, while preserving, if not enhancing, the quality of generated images. Wuerstchen achieves notable speed improvements at inference time, thereby rendering real-time applications more viable. One of the key advantages of our method lies in its modest training requirements of only 9,200 GPU hours, slashing the usual costs significantly without compromising the end performance. In a comparison against the state-of-the-art, we found the approach to yield strong competitiveness. This paper opens the door to a new line of research that prioritizes both performance and computational accessibility, hence democratizing the use of sophisticated AI technologies. Through Wuerstchen, we demonstrate a compelling stride forward in the realm of text-to-image synthesis, offering an innovative path to explore in future research.*
+
+## Würstchen Overview
+Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://huggingface.co/papers/2306.00637) ). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference.
+
+## Würstchen v2 comes to Diffusers
+
+After the initial paper release, we have improved numerous things in the architecture, training and sampling, making Würstchen competitive to current state-of-the-art models in many ways. We are excited to release this new version together with Diffusers. Here is a list of the improvements.
+
+- Higher resolution (1024x1024 up to 2048x2048)
+- Faster inference
+- Multi Aspect Resolution Sampling
+- Better quality
+
+
+We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are: 
+
+- v2-base
+- v2-aesthetic
+- **(default)** v2-interpolated (50% interpolation between v2-base and v2-aesthetic)
+
+We recommend using v2-interpolated, as it has a nice touch of both photorealism and aesthetics. Use v2-base for finetunings as it does not have a style bias and use v2-aesthetic for very artistic generations.
+A comparison can be seen here:
+
+<img src="https://github.com/dome272/Wuerstchen/assets/61938694/2914830f-cbd3-461c-be64-d50734f4b49d" width=500>
+
+## Text-to-Image Generation
+
+For the sake of usability, Würstchen can be used with a single pipeline. This pipeline can be used as follows:
+
+```python
+import torch
+from diffusers import AutoPipelineForText2Image
+from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS
+
+pipe = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda")
+
+caption = "Anthropomorphic cat dressed as a fire fighter"
+images = pipe(
+    caption, 
+    width=1024,
+    height=1536,
+    prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
+    prior_guidance_scale=4.0,
+    num_images_per_prompt=2,
+).images
+```
+
+For explanation purposes, we can also initialize the two main pipelines of Würstchen individually. Würstchen consists of 3 stages: Stage C, Stage B, Stage A. They all have different jobs and work only together. When generating text-conditional images, Stage C will first generate the latents in a very compressed latent space. This is what happens in the `prior_pipeline`. Afterwards, the generated latents will be passed to Stage B, which decompresses the latents into a bigger latent space of a VQGAN. These latents can then be decoded by Stage A, which is a VQGAN, into the pixel-space. Stage B & Stage A are both encapsulated in the `decoder_pipeline`. For more details, take a look at the [paper](https://huggingface.co/papers/2306.00637).
+
+```python
+import torch
+from diffusers import WuerstchenDecoderPipeline, WuerstchenPriorPipeline
+from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS
+
+device = "cuda"
+dtype = torch.float16
+num_images_per_prompt = 2
+
+prior_pipeline = WuerstchenPriorPipeline.from_pretrained(
+    "warp-ai/wuerstchen-prior", torch_dtype=dtype
+).to(device)
+decoder_pipeline = WuerstchenDecoderPipeline.from_pretrained(
+    "warp-ai/wuerstchen", torch_dtype=dtype
+).to(device)
+
+caption = "Anthropomorphic cat dressed as a fire fighter"
+negative_prompt = ""
+
+prior_output = prior_pipeline(
+    prompt=caption,
+    height=1024,
+    width=1536,
+    timesteps=DEFAULT_STAGE_C_TIMESTEPS,
+    negative_prompt=negative_prompt,
+    guidance_scale=4.0,
+    num_images_per_prompt=num_images_per_prompt,
+)
+decoder_output = decoder_pipeline(
+    image_embeddings=prior_output.image_embeddings,
+    prompt=caption,
+    negative_prompt=negative_prompt,
+    guidance_scale=0.0,
+    output_type="pil",
+).images
+```
+
+## Speed-Up Inference
+You can make use of `torch.compile` function and gain a speed-up of about 2-3x:
+
+```python
+prior_pipeline.prior = torch.compile(prior_pipeline.prior, mode="reduce-overhead", fullgraph=True)
+decoder_pipeline.decoder = torch.compile(decoder_pipeline.decoder, mode="reduce-overhead", fullgraph=True)
+```
+
+## Limitations
+
+- Due to the high compression employed by Würstchen, generations can lack a good amount
+of detail. To our human eye, this is especially noticeable in faces, hands etc.
+- **Images can only be generated in 128-pixel steps**, e.g. the next higher resolution
+after 1024x1024 is 1152x1152
+- The model lacks the ability to render correct text in images
+- The model often does not achieve photorealism
+- Difficult compositional prompts are hard for the model
+
+The original codebase, as well as experimental ideas, can be found at [dome272/Wuerstchen](https://github.com/dome272/Wuerstchen).
+
+## WuerstchenCombinedPipeline
+
+[[autodoc]] WuerstchenCombinedPipeline
+	- all
+	- __call__
+
+## WuerstchenPriorPipeline
+
+[[autodoc]] WuerstchenPriorPipeline
+	- all
+	- __call__
+
+## WuerstchenPriorPipelineOutput
+
+[[autodoc]] pipelines.wuerstchen.pipeline_wuerstchen_prior.WuerstchenPriorPipelineOutput
+
+## WuerstchenDecoderPipeline
+
+[[autodoc]] WuerstchenDecoderPipeline
+	- all
+	- __call__
--- a/docs/source/en/api/schedulers/cm_stochastic_iterative.md
+++ b/docs/source/en/api/schedulers/cm_stochastic_iterative.md
@@ -0,0 +1,15 @@
+# CMStochasticIterativeScheduler
+
+[Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever introduced a multistep and onestep scheduler (Algorithm 1) that is capable of generating good samples in one or a small number of steps.
+
+The abstract from the paper is:
+
+*Diffusion models have made significant breakthroughs in image, audio, and video generation, but they depend on an iterative generation process that causes slow sampling speed and caps their potential for real-time applications. To overcome this limitation, we propose consistency models, a new family of generative models that achieve high sample quality without adversarial training. They support fast one-step generation by design, while still allowing for few-step sampling to trade compute for sample quality. They also support zero-shot data editing, like image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either as a way to distill pre-trained diffusion models, or as standalone generative models. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step generation. For example, we achieve the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained as standalone generative models, consistency models also outperform single-step, non-adversarial generative models on standard benchmarks like CIFAR-10, ImageNet 64x64 and LSUN 256x256.*
+
+The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models).
+
+## CMStochasticIterativeScheduler
+[[autodoc]] CMStochasticIterativeScheduler
+
+## CMStochasticIterativeSchedulerOutput
+[[autodoc]] schedulers.scheduling_consistency_models.CMStochasticIterativeSchedulerOutput
--- a/docs/source/en/api/schedulers/cm_stochastic_iterative.mdx
+++ b/docs/source/en/api/schedulers/cm_stochastic_iterative.mdx
@@ -1,11 +0,0 @@
-# Consistency Model Multistep Scheduler
-
-## Overview
-
-Multistep and onestep scheduler (Algorithm 1) introduced alongside consistency models in the paper [Consistency Models](https://arxiv.org/abs/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
-Based on the [original consistency models implementation](https://github.com/openai/consistency_models).
-Should generate good samples from [`ConsistencyModelPipeline`] in one or a small number of steps.
-
-## CMStochasticIterativeScheduler
-[[autodoc]] CMStochasticIterativeScheduler
-
--- a/docs/source/en/api/schedulers/ddim.md
+++ b/docs/source/en/api/schedulers/ddim.md
@@ -0,0 +1,82 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# DDIMScheduler
+
+[Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
+
+The abstract from the paper is:
+
+*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, 
+yet they require simulating a Markov chain for many steps to produce a sample. 
+To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models
+with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. 
+We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from.
+We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off 
+computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.*
+
+The original codebase of this paper can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim), and you can contact the author on [tsong.me](https://tsong.me/).
+
+## Tips
+
+The paper [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion. To fix this, the authors propose:
+
+<Tip warning={true}>
+
+🧪 This is an experimental feature!
+
+</Tip>
+
+1. rescale the noise schedule to enforce zero terminal signal-to-noise ratio (SNR)
+
+```py
+pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True)
+```
+
+2. train a model with `v_prediction` (add the following argument to the [train_text_to_image.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [train_text_to_image_lora.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py) scripts)
+
+```bash
+--prediction_type="v_prediction"
+```
+
+3. change the sampler to always start from the last timestep
+
+```py
+pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
+```
+
+4. rescale classifier-free guidance to prevent over-exposure
+
+```py
+image = pipeline(prompt, guidance_rescale=0.7).images[0]
+```
+
+For example:
+
+```py
+from diffusers import DiffusionPipeline, DDIMScheduler
+
+pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", torch_dtype=torch.float16)
+pipe.scheduler = DDIMScheduler.from_config(
+    pipe.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing"
+)
+pipe.to("cuda")
+
+prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
+image = pipeline(prompt, guidance_rescale=0.7).images[0]
+```
+
+## DDIMScheduler
+[[autodoc]] DDIMScheduler
+
+## DDIMSchedulerOutput
+[[autodoc]] schedulers.scheduling_ddim.DDIMSchedulerOutput
--- a/docs/source/en/api/schedulers/ddim.mdx
+++ b/docs/source/en/api/schedulers/ddim.mdx
@@ -1,88 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Denoising Diffusion Implicit Models (DDIM)
-
-## Overview
-
-[Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
-
-The abstract of the paper is the following:
-
-*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, 
-yet they require simulating a Markov chain for many steps to produce a sample. 
-To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models
-with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. 
-We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from.
-We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off 
-computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.*
-
-The original codebase of this paper can be found here: [ermongroup/ddim](https://github.com/ermongroup/ddim).
-For questions, feel free to contact the author on [tsong.me](https://tsong.me/).
-
-### Experimental: "Common Diffusion Noise Schedules and Sample Steps are Flawed": 
-
-The paper **[Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/abs/2305.08891)** 
-claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion.
-
-The abstract reads as follows:
-
-*We discover that common diffusion noise schedules do not enforce the last timestep to have zero signal-to-noise ratio (SNR),
-and some implementations of diffusion samplers do not start from the last timestep.
-Such designs are flawed and do not reflect the fact that the model is given pure Gaussian noise at inference, creating a discrepancy between training and inference.
-We show that the flawed design causes real problems in existing implementations. 
-In Stable Diffusion, it severely limits the model to only generate images with medium brightness and 
-prevents it from generating very bright and dark samples. We propose a few simple fixes: 
- (1) rescale the noise schedule to enforce zero terminal SNR; 
- (2) train the model with v prediction; 
- (3) change the sampler to always start from the last timestep; 
- (4) rescale classifier-free guidance to prevent over-exposure. 
-These simple changes ensure the diffusion process is congruent between training and inference and 
-allow the model to generate samples more faithful to the original data distribution.*
-
-You can apply all of these changes in `diffusers` when using [`DDIMScheduler`]:
- (1) rescale the noise schedule to enforce zero terminal SNR; 
-```py
-pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True)
-```
- (2) train the model with v prediction; 
-Continue fine-tuning a checkpoint with [`train_text_to_image.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [`train_text_to_image_lora.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)
-and `--prediction_type="v_prediction"`.
- (3) change the sampler to always start from the last timestep; 
-```py
-pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
-```
- (4) rescale classifier-free guidance to prevent over-exposure. 
-```py
-pipe(..., guidance_rescale=0.7)
-```
-
-An example is to use [this checkpoint](https://huggingface.co/ptx0/pseudo-journey-v2) 
-which has been fine-tuned using the `"v_prediction"`.
-
-The checkpoint can then be run in inference as follows:
-
-```py
-from diffusers import DiffusionPipeline, DDIMScheduler
-
-pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", torch_dtype=torch.float16)
-pipe.scheduler = DDIMScheduler.from_config(
-    pipe.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing"
-)
-pipe.to("cuda")
-
-prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
-image = pipeline(prompt, guidance_rescale=0.7).images[0]
-```
-
-## DDIMScheduler
-[[autodoc]] DDIMScheduler
--- a/docs/source/en/api/schedulers/ddim_inverse.mdx
+++ b/docs/source/en/api/schedulers/ddim_inverse.mdx
@@ -10,12 +10,10 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Inverse Denoising Diffusion Implicit Models (DDIMInverse)
+# DDIMInverseScheduler

-## Overview
-
-This scheduler is the inverted scheduler of [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
-The implementation is mostly based on the DDIM inversion definition of [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://arxiv.org/pdf/2211.09794.pdf)
+`DDIMInverseScheduler` is the inverted scheduler from [Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
+The implementation is mostly based on the DDIM inversion definition from [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794.pdf).

 ## DDIMInverseScheduler
 [[autodoc]] DDIMInverseScheduler
--- a/docs/source/en/api/schedulers/ddpm.md
+++ b/docs/source/en/api/schedulers/ddpm.md
@@ -0,0 +1,25 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# DDPMScheduler
+
+[Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2006.11239) (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes a diffusion based model of the same name. In the context of the 🤗 Diffusers library, DDPM refers to the discrete denoising scheduler from the paper as well as the pipeline.
+
+The abstract from the paper is:
+
+*We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.*
+
+## DDPMScheduler
+[[autodoc]] DDPMScheduler
+
+## DDPMSchedulerOutput
+[[autodoc]] schedulers.scheduling_ddpm.DDPMSchedulerOutput
--- a/docs/source/en/api/schedulers/ddpm.mdx
+++ b/docs/source/en/api/schedulers/ddpm.mdx
@@ -1,27 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Denoising Diffusion Probabilistic Models (DDPM)
-
-## Overview
-
-[Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) 
- (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes the diffusion based model of the same name, but in the context of the 🤗 Diffusers library, DDPM refers to the discrete denoising scheduler from the paper as well as the pipeline.
-
-The abstract of the paper is the following:
-
-We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.
-
-The original paper can be found [here](https://arxiv.org/abs/2010.02502).
-
-## DDPMScheduler
-[[autodoc]] DDPMScheduler
--- a/Show More
+++ b/Show More