[Don't Merge] Check

2025-12-09 05:54:24 +08:00 · 2023-04-20 15:53:26 +02:00
961 changed files with 21935 additions and 195797 deletions
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@@ -13,9 +13,8 @@ body:
             *Give your issue a fitting title. Assume that someone which very limited knowledge of diffusers can understand your issue. Add links to the source code, documentation other issues, pull requests etc...*
        - 2. If your issue is about something not working, **always** provide a reproducible code snippet. The reader should be able to reproduce your issue by **only copy-pasting your code snippet into a Python shell**.
             *The community cannot solve your issue if it cannot reproduce it. If your bug is related to training, add your training script and make everything needed to train public. Otherwise, just add a simple Python code snippet.*
-        - 3. Add the **minimum** amount of code / context that is needed to understand, reproduce your issue.
+        - 3. Add the **minimum amount of code / context that is needed to understand, reproduce your issue**.
             *Make the life of maintainers easy. `diffusers` is getting many issues every day. Make sure your issue is about one bug and one bug only. Make sure you add only the context, code needed to understand your issues - nothing more. Generally, every issue is a way of documenting this library, try to make it a good documentation entry.*
        - 4. For issues related to community pipelines (i.e., the pipelines located in the `examples/community` folder), please tag the author of the pipeline in your issue thread as those pipelines are not maintained.
  - type: markdown
    attributes:
      value: |
@@ -50,57 +49,3 @@ body:
      placeholder: diffusers version, platform, python version, ...
    validations:
      required: true
  - type: textarea
    id: who-can-help
    attributes:
      label: Who can help?
      description: |
        Your issue will be replied to more quickly if you can figure out the right person to tag with @
        If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
        All issues are read by one of the core maintainers, so if you don't know who to tag, just leave this blank and
        a core maintainer will ping the right person.
        Please tag a maximum of 2 people.
        Questions on DiffusionPipeline (Saving, Loading, From pretrained, ...):
        Questions on pipelines:
        - Stable Diffusion @yiyixuxu @DN6 @patrickvonplaten @sayakpaul @patrickvonplaten
        - Stable Diffusion XL @yiyixuxu @sayakpaul @DN6 @patrickvonplaten
        - Kandinsky @yiyixuxu @patrickvonplaten
        - ControlNet @sayakpaul @yiyixuxu @DN6 @patrickvonplaten
        - T2I Adapter @sayakpaul @yiyixuxu @DN6 @patrickvonplaten
        - IF @DN6 @patrickvonplaten
        - Text-to-Video / Video-to-Video @DN6 @sayakpaul @patrickvonplaten
        - Wuerstchen @DN6 @patrickvonplaten
        - Other: @yiyixuxu @DN6
        Questions on models:
        - UNet @DN6 @yiyixuxu @sayakpaul @patrickvonplaten
        - VAE @sayakpaul @DN6 @yiyixuxu @patrickvonplaten
        - Transformers/Attention @DN6 @yiyixuxu @sayakpaul @DN6 @patrickvonplaten
        Questions on Schedulers: @yiyixuxu @patrickvonplaten
        Questions on LoRA: @sayakpaul @patrickvonplaten
        Questions on Textual Inversion: @sayakpaul @patrickvonplaten
        Questions on Training: 
        - DreamBooth @sayakpaul @patrickvonplaten
        - Text-to-Image Fine-tuning @sayakpaul @patrickvonplaten
        - Textual Inversion @sayakpaul @patrickvonplaten
        - ControlNet @sayakpaul @patrickvonplaten
        Questions on Tests: @DN6 @sayakpaul @yiyixuxu 
        Questions on Documentation: @stevhliu
        Questions on JAX- and MPS-related things: @pcuenca
        Questions on audio pipelines: @DN6 @patrickvonplaten
      placeholder: "@Username ..."
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -1,60 +0,0 @@
 # What does this PR do?
 <!--
 Congratulations! You've made it this far! You're not quite done yet though.
 Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.
 Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.
 Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost.
 -->
 <!-- Remove if not applicable -->
 Fixes # (issue)
 ## Before submitting
 - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
 - [ ] Did you read the [contributor guideline](https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md)?
 - [ ] Did you read our [philosophy doc](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md) (important for complex PRs)?
 - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case.
 - [ ] Did you make sure to update the documentation with your changes? Here are the
      [documentation guidelines](https://github.com/huggingface/diffusers/tree/main/docs), and
      [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
 - [ ] Did you write any new necessary tests?
 ## Who can review?
 Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
 members/contributors who may be interested in your PR.
 <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @
 If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
 Please tag fewer than 3 people.
 Core library:
 - Schedulers: @williamberman and @patrickvonplaten
 - Pipelines:  @patrickvonplaten and @sayakpaul
 - Training examples: @sayakpaul and @patrickvonplaten
 - Docs: @stevhliu and @yiyixuxu
 - JAX and MPS: @pcuenca
 - Audio: @sanchit-gandhi
 - General functionalities: @patrickvonplaten and @sayakpaul
 Integrations:
 - deepspeed: HF Trainer/Accelerate: @pacman100
 HF projects:
 - accelerate: [different repo](https://github.com/huggingface/accelerate)
 - datasets: [different repo](https://github.com/huggingface/datasets)
 - transformers: [different repo](https://github.com/huggingface/transformers)
 - safetensors: [different repo](https://github.com/huggingface/safetensors)
 -->
--- a/.github/actions/setup-miniconda/action.yml
+++ b/.github/actions/setup-miniconda/action.yml
@@ -27,7 +27,7 @@ runs:
      - name: Get date
        id: get-date
        shell: bash
-        run: echo "today=$(/bin/date -u '+%Y%m%d')d" >> $GITHUB_OUTPUT
+        run: echo "::set-output name=today::$(/bin/date -u '+%Y%m%d')d"
      - name: Setup miniconda cache
        id: miniconda-cache
        uses: actions/cache@v2
@@ -143,4 +143,4 @@ runs:
                echo "There is ${AVAIL}KB free space left in $MOUNT, continue"
              fi
            fi
-          done
+          done
--- a/.github/workflows/build_docker_images.yml
+++ b/.github/workflows/build_docker_images.yml
@@ -26,8 +26,6 @@ jobs:
        image-name:
          - diffusers-pytorch-cpu
          - diffusers-pytorch-cuda
          - diffusers-pytorch-compile-cuda
          - diffusers-pytorch-xformers-cuda
          - diffusers-flax-cpu
          - diffusers-flax-tpu
          - diffusers-onnxruntime-cpu
--- a/.github/workflows/build_documentation.yml
+++ b/.github/workflows/build_documentation.yml
@@ -6,18 +6,14 @@ on:
      - main
      - doc-builder*
      - v*-release
      - v*-patch
 jobs:
-  build:
+   build:
    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
    with:
      commit_sha: ${{ github.sha }}
      install_libgl1: true
      package: diffusers
      notebook_folder: diffusers_doc
-      languages: en ko zh ja pt
+      languages: en ko
    secrets:
      token: ${{ secrets.HUGGINGFACE_PUSH }}
      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
--- a/.github/workflows/build_pr_documentation.yml
+++ b/.github/workflows/build_pr_documentation.yml
@@ -13,6 +13,5 @@ jobs:
    with:
      commit_sha: ${{ github.event.pull_request.head.sha }}
      pr_number: ${{ github.event.number }}
      install_libgl1: true
      package: diffusers
-      languages: en ko zh ja pt
+      languages: en ko
--- a/.github/workflows/delete_doc_comment.yml
+++ b/.github/workflows/delete_doc_comment.yml
@@ -1,14 +1,13 @@
-name: Delete doc comment
+name: Delete dev documentation
 on:
-  workflow_run:
+  pull_request:
-    workflows: ["Delete doc comment trigger"]
+    types: [ closed ]
    types:
      - completed
 jobs:
  delete:
    uses: huggingface/doc-builder/.github/workflows/delete_doc_comment.yml@main
-    secrets:
+    with:
-      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
+      pr_number: ${{ github.event.number }}
      package: diffusers
--- a/.github/workflows/delete_doc_comment_trigger.yml
+++ b/.github/workflows/delete_doc_comment_trigger.yml
@@ -1,12 +0,0 @@
 name: Delete doc comment trigger
 on:
  pull_request:
    types: [ closed ]
 jobs:
  delete:
    uses: huggingface/doc-builder/.github/workflows/delete_doc_comment_trigger.yml@main
    with:
      pr_number: ${{ github.event.number }}
--- a/.github/workflows/pr_dependency_test.yml
+++ b/.github/workflows/pr_dependency_test.yml
@@ -1,32 +0,0 @@
 name: Run dependency tests
 on:
  pull_request:
    branches:
      - main
  push:
    branches:
      - main
 concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true
 jobs:
  check_dependencies:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.8"
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -e .
          pip install pytest
      - name: Check for soft dependencies
        run: |
          pytest tests/others/test_dependencies.py
--- a/.github/workflows/pr_quality.yml
+++ b/.github/workflows/pr_quality.yml
@@ -20,7 +20,7 @@ jobs:
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
-          python-version: "3.8"
+          python-version: "3.7"
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
@@ -38,7 +38,7 @@ jobs:
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
-          python-version: "3.8"
+          python-version: "3.7"
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
--- a/.github/workflows/pr_test_peft_backend.yml
+++ b/.github/workflows/pr_test_peft_backend.yml
@@ -1,67 +0,0 @@
 name: Fast tests for PRs - PEFT backend
 on:
  pull_request:
    branches:
      - main
 concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true
 env:
  DIFFUSERS_IS_CI: yes
  OMP_NUM_THREADS: 4
  MKL_NUM_THREADS: 4
  PYTEST_TIMEOUT: 60
 jobs:
  run_fast_tests:
    strategy:
      fail-fast: false
      matrix:
        config:
          - name: LoRA
            framework: lora
            runner: docker-cpu
            image: diffusers/diffusers-pytorch-cpu
            report: torch_cpu_lora
    name: ${{ matrix.config.name }}
    runs-on: ${{ matrix.config.runner }}
    container:
      image: ${{ matrix.config.image }}
      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
    defaults:
      run:
        shell: bash
    steps:
    - name: Checkout diffusers
      uses: actions/checkout@v3
      with:
        fetch-depth: 2
    - name: Install dependencies
      run: |
        apt-get update && apt-get install libsndfile1-dev libgl1 -y
        python -m pip install -e .[quality,test]
        python -m pip install git+https://github.com/huggingface/accelerate.git
        python -m pip install -U git+https://github.com/huggingface/transformers.git
        python -m pip install -U git+https://github.com/huggingface/peft.git
    - name: Environment
      run: |
        python utils/print_env.py
    - name: Run fast PyTorch LoRA CPU tests with PEFT backend
      if: ${{ matrix.config.framework == 'lora' }}
      run: |
        python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
          -s -v \
          --make-reports=tests_${{ matrix.config.report }} \
          tests/lora/test_lora_layers_peft.py
--- a/.github/workflows/pr_tests.yml
+++ b/.github/workflows/pr_tests.yml
@@ -4,9 +4,6 @@ on:
  pull_request:
    branches:
      - main
  push:
    branches:
      - ci-*
 concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
@@ -34,16 +31,16 @@ jobs:
            runner: docker-cpu
            image: diffusers/diffusers-pytorch-cpu
            report: torch_cpu_models_schedulers
          - name: LoRA
            framework: lora
            runner: docker-cpu
            image: diffusers/diffusers-pytorch-cpu
            report: torch_cpu_lora
          - name: Fast Flax CPU tests
            framework: flax
            runner: docker-cpu
            image: diffusers/diffusers-flax-cpu
            report: flax_cpu
          - name: Fast ONNXRuntime CPU tests
            framework: onnxruntime
            runner: docker-cpu
            image: diffusers/diffusers-onnxruntime-cpu
            report: onnx_cpu
          - name: PyTorch Example CPU tests
            framework: pytorch_examples
            runner: docker-cpu
@@ -70,9 +67,10 @@ jobs:
    - name: Install dependencies
      run: |
-        apt-get update && apt-get install libsndfile1-dev libgl1 -y
+        apt-get update && apt-get install libsndfile1-dev -y
        python -m pip install -e .[quality,test]
-        python -m pip install git+https://github.com/huggingface/accelerate.git
+        python -m pip install -U git+https://github.com/huggingface/transformers
        python -m pip install git+https://github.com/huggingface/accelerate
    - name: Environment
      run: |
@@ -90,18 +88,10 @@ jobs:
      if: ${{ matrix.config.framework == 'pytorch_models' }}
      run: |
        python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
-          -s -v -k "not Flax and not Onnx and not Dependency" \
+          -s -v -k "not Flax and not Onnx" \
          --make-reports=tests_${{ matrix.config.report }} \
          tests/models tests/schedulers tests/others
    - name: Run fast PyTorch LoRA CPU tests
      if: ${{ matrix.config.framework == 'lora' }}
      run: |
        python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
          -s -v -k "not Flax and not Onnx and not Dependency" \
          --make-reports=tests_${{ matrix.config.report }} \
          tests/lora
    - name: Run fast Flax TPU tests
      if: ${{ matrix.config.framework == 'flax' }}
      run: |
@@ -110,6 +100,14 @@ jobs:
          --make-reports=tests_${{ matrix.config.report }} \
          tests
    - name: Run fast ONNXRuntime CPU tests
      if: ${{ matrix.config.framework == 'onnxruntime' }}
      run: |
        python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
          -s -v -k "Onnx" \
          --make-reports=tests_${{ matrix.config.report }} \
          tests/
    - name: Run example PyTorch CPU tests
      if: ${{ matrix.config.framework == 'pytorch_examples' }}
      run: |
@@ -128,28 +126,9 @@ jobs:
        name: pr_${{ matrix.config.report }}_test_reports
        path: reports
-  run_staging_tests:
+  run_fast_tests_apple_m1:
-    strategy:
+    name: Fast PyTorch MPS tests on MacOS
-      fail-fast: false
+    runs-on: [ self-hosted, apple-m1 ]
      matrix:
        config:
          - name: Hub tests for models, schedulers, and pipelines
            framework: hub_tests_pytorch
            runner: docker-cpu
            image: diffusers/diffusers-pytorch-cpu
            report: torch_hub
    name: ${{ matrix.config.name }}
    runs-on: ${{ matrix.config.runner }}
    container:
      image: ${{ matrix.config.image }}
      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
    defaults:
      run:
        shell: bash
    steps:
    - name: Checkout diffusers
@@ -157,30 +136,45 @@ jobs:
      with:
        fetch-depth: 2
-    - name: Install dependencies
+    - name: Clean checkout
      shell: arch -arch arm64 bash {0}
      run: |
-        apt-get update && apt-get install libsndfile1-dev libgl1 -y
+        git clean -fxd
-        python -m pip install -e .[quality,test]
+
    - name: Setup miniconda
      uses: ./.github/actions/setup-miniconda
      with:
        python-version: 3.9
    - name: Install dependencies
      shell: arch -arch arm64 bash {0}
      run: |
        ${CONDA_RUN} python -m pip install --upgrade pip
        ${CONDA_RUN} python -m pip install -e .[quality,test]
        ${CONDA_RUN} python -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
        ${CONDA_RUN} python -m pip install git+https://github.com/huggingface/accelerate
        ${CONDA_RUN} python -m pip install -U git+https://github.com/huggingface/transformers
    - name: Environment
      shell: arch -arch arm64 bash {0}
      run: |
-        python utils/print_env.py
+        ${CONDA_RUN} python utils/print_env.py
-    - name: Run Hub tests for models, schedulers, and pipelines on a staging env
+    - name: Run fast PyTorch tests on M1 (MPS)
-      if: ${{ matrix.config.framework == 'hub_tests_pytorch' }}
+      shell: arch -arch arm64 bash {0}
      env:
        HF_HOME: /System/Volumes/Data/mnt/cache
        HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
      run: |
-        HUGGINGFACE_CO_STAGING=true python -m pytest \
+        ${CONDA_RUN} python -m pytest -n 0 -s -v --make-reports=tests_torch_mps tests/
          -m "is_staging_test" \
          --make-reports=tests_${{ matrix.config.report }} \
          tests
    - name: Failure short reports
      if: ${{ failure() }}
-      run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt
+      run: cat reports/tests_torch_mps_failures_short.txt
    - name: Test suite reports artifacts
      if: ${{ always() }}
      uses: actions/upload-artifact@v2
      with:
-        name: pr_${{ matrix.config.report }}_test_reports
+        name: pr_torch_mps_test_reports
        path: reports
--- a/.github/workflows/push_tests.yml
+++ b/.github/workflows/push_tests.yml
@@ -1,11 +1,10 @@
-name: Slow Tests on main
+name: Slow tests on main
 on:
  push:
    branches:
      - main
 env:
  DIFFUSERS_IS_CI: yes
  HF_HOME: /mnt/cache
@@ -13,371 +12,101 @@ env:
  MKL_NUM_THREADS: 8
  PYTEST_TIMEOUT: 600
  RUN_SLOW: yes
  PIPELINE_USAGE_CUTOFF: 50000
 jobs:
-  setup_torch_cuda_pipeline_matrix:
+  run_slow_tests:
    name: Setup Torch Pipelines CUDA Slow Tests Matrix
    runs-on: docker-gpu
    container:
      image: diffusers/diffusers-pytorch-cpu # this is a CPU image, but we need it to fetch the matrix
      options: --shm-size "16gb" --ipc host
    outputs:
      pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }}
    steps:
      - name: Checkout diffusers
        uses: actions/checkout@v3
        with:
          fetch-depth: 2
      - name: Install dependencies
        run: |
          apt-get update && apt-get install libsndfile1-dev libgl1 -y
          python -m pip install -e .[quality,test]
          python -m pip install git+https://github.com/huggingface/accelerate.git
      - name: Environment
        run: |
          python utils/print_env.py
      - name: Fetch Pipeline Matrix
        id: fetch_pipeline_matrix
        run: |
          matrix=$(python utils/fetch_torch_cuda_pipeline_test_matrix.py)
          echo $matrix
          echo "pipeline_test_matrix=$matrix" >> $GITHUB_OUTPUT
      - name: Pipeline Tests Artifacts
        if: ${{ always() }}
        uses: actions/upload-artifact@v2
        with:
          name: test-pipelines.json
          path: reports
  torch_pipelines_cuda_tests:
    name: Torch Pipelines CUDA Slow Tests
    needs: setup_torch_cuda_pipeline_matrix
    strategy:
      fail-fast: false
      max-parallel: 1
      matrix:
-        module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }}
+        config:
-    runs-on: docker-gpu
+          - name: Slow PyTorch CUDA tests on Ubuntu
-    container:
+            framework: pytorch
-      image: diffusers/diffusers-pytorch-cuda
+            runner: docker-gpu
-      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0
+            image: diffusers/diffusers-pytorch-cuda
-    steps:
+            report: torch_cuda
-      - name: Checkout diffusers
+          - name: Slow Flax TPU tests on Ubuntu
-        uses: actions/checkout@v3
+            framework: flax
-        with:
+            runner: docker-tpu
-          fetch-depth: 2
+            image: diffusers/diffusers-flax-tpu
-      - name: NVIDIA-SMI
+            report: flax_tpu
-        run: |
+          - name: Slow ONNXRuntime CUDA tests on Ubuntu
-          nvidia-smi
+            framework: onnxruntime
-      - name: Install dependencies
+            runner: docker-gpu
-        run: |
+            image: diffusers/diffusers-onnxruntime-cuda
-          apt-get update && apt-get install libsndfile1-dev libgl1 -y
+            report: onnx_cuda
          python -m pip install -e .[quality,test]
          python -m pip install git+https://github.com/huggingface/accelerate.git
      - name: Environment
        run: |
          python utils/print_env.py
      - name: Slow PyTorch CUDA checkpoint tests on Ubuntu
        env:
          HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
          # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
          CUBLAS_WORKSPACE_CONFIG: :16:8
        run: |
          python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
            -s -v -k "not Flax and not Onnx" \
            --make-reports=tests_pipeline_${{ matrix.module }}_cuda \
            tests/pipelines/${{ matrix.module }}
      - name: Failure short reports
        if: ${{ failure() }}
        run: |
          cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt
          cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt
-      - name: Test suite reports artifacts
+    name: ${{ matrix.config.name }}
-        if: ${{ always() }}
+
-        uses: actions/upload-artifact@v2
+    runs-on: ${{ matrix.config.runner }}
        with:
          name: pipeline_${{ matrix.module }}_test_reports
          path: reports
  torch_cuda_tests:
    name: Torch CUDA Tests
    runs-on: docker-gpu
    container:
-      image: diffusers/diffusers-pytorch-cuda
+      image: ${{ matrix.config.image }}
-      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0
+      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ ${{ matrix.config.runner == 'docker-tpu' && '--privileged' || '--gpus 0'}}
    defaults:
      run:
        shell: bash
-    strategy:
+
      matrix:
        module: [models, schedulers, lora, others]
    steps:
    - name: Checkout diffusers
      uses: actions/checkout@v3
      with:
        fetch-depth: 2
    - name: NVIDIA-SMI
      if : ${{ matrix.config.runner == 'docker-gpu' }}
      run: |
        nvidia-smi
    - name: Install dependencies
      run: |
        apt-get update && apt-get install libsndfile1-dev libgl1 -y
        python -m pip install -e .[quality,test]
-        python -m pip install git+https://github.com/huggingface/accelerate.git
+        python -m pip install -U git+https://github.com/huggingface/transformers
        python -m pip install git+https://github.com/huggingface/accelerate
    - name: Environment
      run: |
        python utils/print_env.py
    - name: Run slow PyTorch CUDA tests
      if: ${{ matrix.config.framework == 'pytorch' }}
      env:
        HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
        # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
        CUBLAS_WORKSPACE_CONFIG: :16:8
      run: |
        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
          -s -v -k "not Flax and not Onnx" \
-          --make-reports=tests_torch_cuda \
+          --make-reports=tests_${{ matrix.config.report }} \
-          tests/${{ matrix.module }}
+          tests/
    - name: Failure short reports
      if: ${{ failure() }}
      run: |
        cat reports/tests_torch_cuda_stats.txt
        cat reports/tests_torch_cuda_failures_short.txt
    - name: Test suite reports artifacts
      if: ${{ always() }}
      uses: actions/upload-artifact@v2
      with:
        name: torch_cuda_test_reports
        path: reports
  peft_cuda_tests:
    name: PEFT CUDA Tests
    runs-on: docker-gpu
    container:
      image: diffusers/diffusers-pytorch-cuda
      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0
    defaults:
      run:
        shell: bash
    steps:
    - name: Checkout diffusers
      uses: actions/checkout@v3
      with:
        fetch-depth: 2
    - name: Install dependencies
      run: |
        apt-get update && apt-get install libsndfile1-dev libgl1 -y
        python -m pip install -e .[quality,test]
        python -m pip install git+https://github.com/huggingface/accelerate.git
        python -m pip install git+https://github.com/huggingface/peft.git
    - name: Environment
      run: |
        python utils/print_env.py
    - name: Run slow PEFT CUDA tests
      env:
        HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
        # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
        CUBLAS_WORKSPACE_CONFIG: :16:8
      run: |
        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
          -s -v -k "not Flax and not Onnx" \
          --make-reports=tests_peft_cuda \
          tests/lora/
    - name: Failure short reports
      if: ${{ failure() }}
      run: |
        cat reports/tests_peft_cuda_stats.txt
        cat reports/tests_peft_cuda_failures_short.txt
    - name: Test suite reports artifacts
      if: ${{ always() }}
      uses: actions/upload-artifact@v2
      with:
        name: torch_peft_test_reports
        path: reports
  flax_tpu_tests:
    name: Flax TPU Tests
    runs-on: docker-tpu
    container:
      image: diffusers/diffusers-flax-tpu
      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --privileged
    defaults:
      run:
        shell: bash
    steps:
    - name: Checkout diffusers
      uses: actions/checkout@v3
      with:
        fetch-depth: 2
    - name: Install dependencies
      run: |
        apt-get update && apt-get install libsndfile1-dev libgl1 -y
        python -m pip install -e .[quality,test]
        python -m pip install git+https://github.com/huggingface/accelerate.git
    - name: Environment
      run: |
        python utils/print_env.py
    - name: Run slow Flax TPU tests
      if: ${{ matrix.config.framework == 'flax' }}
      env:
        HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
      run: |
        python -m pytest -n 0 \
          -s -v -k "Flax" \
-          --make-reports=tests_flax_tpu \
+          --make-reports=tests_${{ matrix.config.report }} \
          tests/
    - name: Failure short reports
      if: ${{ failure() }}
      run: |
        cat reports/tests_flax_tpu_stats.txt
        cat reports/tests_flax_tpu_failures_short.txt
    - name: Test suite reports artifacts
      if: ${{ always() }}
      uses: actions/upload-artifact@v2
      with:
        name: flax_tpu_test_reports
        path: reports
  onnx_cuda_tests:
    name: ONNX CUDA Tests
    runs-on: docker-gpu
    container:
      image: diffusers/diffusers-onnxruntime-cuda
      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0
    defaults:
      run:
        shell: bash
    steps:
    - name: Checkout diffusers
      uses: actions/checkout@v3
      with:
        fetch-depth: 2
    - name: Install dependencies
      run: |
        apt-get update && apt-get install libsndfile1-dev libgl1 -y
        python -m pip install -e .[quality,test]
        python -m pip install git+https://github.com/huggingface/accelerate.git
    - name: Environment
      run: |
        python utils/print_env.py
    - name: Run slow ONNXRuntime CUDA tests
      if: ${{ matrix.config.framework == 'onnxruntime' }}
      env:
        HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
      run: |
        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
          -s -v -k "Onnx" \
-          --make-reports=tests_onnx_cuda \
+          --make-reports=tests_${{ matrix.config.report }} \
          tests/
    - name: Failure short reports
      if: ${{ failure() }}
-      run: |
+      run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt
        cat reports/tests_onnx_cuda_stats.txt
        cat reports/tests_onnx_cuda_failures_short.txt
    - name: Test suite reports artifacts
      if: ${{ always() }}
      uses: actions/upload-artifact@v2
      with:
-        name: onnx_cuda_test_reports
+        name: ${{ matrix.config.report }}_test_reports
        path: reports
  run_torch_compile_tests:
    name: PyTorch Compile CUDA tests
    runs-on: docker-gpu
    container:
      image: diffusers/diffusers-pytorch-compile-cuda
      options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
    steps:
    - name: Checkout diffusers
      uses: actions/checkout@v3
      with:
        fetch-depth: 2
    - name: NVIDIA-SMI
      run: |
        nvidia-smi
    - name: Install dependencies
      run: |
        python -m pip install -e .[quality,test,training]
    - name: Environment
      run: |
        python utils/print_env.py
    - name: Run example tests on GPU
      env:
        HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
      run: |
        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/
    - name: Failure short reports
      if: ${{ failure() }}
      run: cat reports/tests_torch_compile_cuda_failures_short.txt
    - name: Test suite reports artifacts
      if: ${{ always() }}
      uses: actions/upload-artifact@v2
      with:
        name: torch_compile_test_reports
        path: reports
  run_xformers_tests:
    name: PyTorch xformers CUDA tests
    runs-on: docker-gpu
    container:
      image: diffusers/diffusers-pytorch-xformers-cuda
      options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
    steps:
    - name: Checkout diffusers
      uses: actions/checkout@v3
      with:
        fetch-depth: 2
    - name: NVIDIA-SMI
      run: |
        nvidia-smi
    - name: Install dependencies
      run: |
        python -m pip install -e .[quality,test,training]
    - name: Environment
      run: |
        python utils/print_env.py
    - name: Run example tests on GPU
      env:
        HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
      run: |
        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "xformers" --make-reports=tests_torch_xformers_cuda tests/
    - name: Failure short reports
      if: ${{ failure() }}
      run: cat reports/tests_torch_xformers_cuda_failures_short.txt
    - name: Test suite reports artifacts
      if: ${{ always() }}
      uses: actions/upload-artifact@v2
      with:
        name: torch_xformers_test_reports
        path: reports
  run_examples_tests:
@@ -402,6 +131,8 @@ jobs:
    - name: Install dependencies
      run: |
        python -m pip install -e .[quality,test,training]
        python -m pip install git+https://github.com/huggingface/accelerate
        python -m pip install -U git+https://github.com/huggingface/transformers
    - name: Environment
      run: |
@@ -415,13 +146,11 @@ jobs:
    - name: Failure short reports
      if: ${{ failure() }}
-      run: |
+      run: cat reports/examples_torch_cuda_failures_short.txt
        cat reports/examples_torch_cuda_stats.txt
        cat reports/examples_torch_cuda_failures_short.txt
    - name: Test suite reports artifacts
      if: ${{ always() }}
      uses: actions/upload-artifact@v2
      with:
        name: examples_test_reports
-        path: reports
+        path: reports
--- a/.github/workflows/push_tests_fast.yml
+++ b/.github/workflows/push_tests_fast.yml
@@ -1,4 +1,4 @@
-name: Fast tests on main
+name: Slow tests on main
 on:
  push:
@@ -60,8 +60,10 @@ jobs:
    - name: Install dependencies
      run: |
-        apt-get update && apt-get install libsndfile1-dev libgl1 -y
+        apt-get update && apt-get install libsndfile1-dev -y
        python -m pip install -e .[quality,test]
        python -m pip install -U git+https://github.com/huggingface/transformers
        python -m pip install git+https://github.com/huggingface/accelerate
    - name: Environment
      run: |
@@ -108,3 +110,56 @@ jobs:
      with:
        name: pr_${{ matrix.config.report }}_test_reports
        path: reports
  run_fast_tests_apple_m1:
    name: Fast PyTorch MPS tests on MacOS
    runs-on: [ self-hosted, apple-m1 ]
    steps:
    - name: Checkout diffusers
      uses: actions/checkout@v3
      with:
        fetch-depth: 2
    - name: Clean checkout
      shell: arch -arch arm64 bash {0}
      run: |
        git clean -fxd
    - name: Setup miniconda
      uses: ./.github/actions/setup-miniconda
      with:
        python-version: 3.9
    - name: Install dependencies
      shell: arch -arch arm64 bash {0}
      run: |
        ${CONDA_RUN} python -m pip install --upgrade pip
        ${CONDA_RUN} python -m pip install -e .[quality,test]
        ${CONDA_RUN} python -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
        ${CONDA_RUN} python -m pip install git+https://github.com/huggingface/accelerate
        ${CONDA_RUN} python -m pip install -U git+https://github.com/huggingface/transformers
    - name: Environment
      shell: arch -arch arm64 bash {0}
      run: |
        ${CONDA_RUN} python utils/print_env.py
    - name: Run fast PyTorch tests on M1 (MPS)
      shell: arch -arch arm64 bash {0}
      env:
        HF_HOME: /System/Volumes/Data/mnt/cache
        HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
      run: |
        ${CONDA_RUN} python -m pytest -n 0 -s -v --make-reports=tests_torch_mps tests/
    - name: Failure short reports
      if: ${{ failure() }}
      run: cat reports/tests_torch_mps_failures_short.txt
    - name: Test suite reports artifacts
      if: ${{ always() }}
      uses: actions/upload-artifact@v2
      with:
        name: pr_torch_mps_test_reports
        path: reports
--- a/.github/workflows/push_tests_mps.yml
+++ b/.github/workflows/push_tests_mps.yml
@@ -1,68 +0,0 @@
 name: Fast mps tests on main
 on:
  push:
    branches:
      - main
 env:
  DIFFUSERS_IS_CI: yes
  HF_HOME: /mnt/cache
  OMP_NUM_THREADS: 8
  MKL_NUM_THREADS: 8
  PYTEST_TIMEOUT: 600
  RUN_SLOW: no
 jobs:
  run_fast_tests_apple_m1:
    name: Fast PyTorch MPS tests on MacOS
    runs-on: [ self-hosted, apple-m1 ]
    steps:
    - name: Checkout diffusers
      uses: actions/checkout@v3
      with:
        fetch-depth: 2
    - name: Clean checkout
      shell: arch -arch arm64 bash {0}
      run: |
        git clean -fxd
    - name: Setup miniconda
      uses: ./.github/actions/setup-miniconda
      with:
        python-version: 3.9
    - name: Install dependencies
      shell: arch -arch arm64 bash {0}
      run: |
        ${CONDA_RUN} python -m pip install --upgrade pip
        ${CONDA_RUN} python -m pip install -e .[quality,test]
        ${CONDA_RUN} python -m pip install torch torchvision torchaudio
        ${CONDA_RUN} python -m pip install git+https://github.com/huggingface/accelerate.git
        ${CONDA_RUN} python -m pip install transformers --upgrade
    - name: Environment
      shell: arch -arch arm64 bash {0}
      run: |
        ${CONDA_RUN} python utils/print_env.py
    - name: Run fast PyTorch tests on M1 (MPS)
      shell: arch -arch arm64 bash {0}
      env:
        HF_HOME: /System/Volumes/Data/mnt/cache
        HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
      run: |
        ${CONDA_RUN} python -m pytest -n 0 -s -v --make-reports=tests_torch_mps tests/
    - name: Failure short reports
      if: ${{ failure() }}
      run: cat reports/tests_torch_mps_failures_short.txt
    - name: Test suite reports artifacts
      if: ${{ always() }}
      uses: actions/upload-artifact@v2
      with:
        name: pr_torch_mps_test_reports
        path: reports
--- a/.github/workflows/stale.yml
+++ b/.github/workflows/stale.yml
@@ -17,7 +17,7 @@ jobs:
    - name: Setup Python
      uses: actions/setup-python@v1
      with:
-        python-version: 3.8
+        python-version: 3.7
    - name: Install requirements
      run: |
--- a/.github/workflows/upload_pr_documentation.yml
+++ b/.github/workflows/upload_pr_documentation.yml
@@ -1,16 +0,0 @@
 name: Upload PR Documentation
 on:
  workflow_run:
    workflows: ["Build PR Documentation"]
    types:
      - completed
 jobs:
  build:
    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
    with:
      package_name: diffusers
    secrets:
      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -40,7 +40,7 @@ In the following, we give an overview of different ways to contribute, ranked by
 As said before, **all contributions are valuable to the community**.
 In the following, we will explain each contribution a bit more in detail.
-For all contributions 4.-9. you will need to open a PR. It is explained in detail how to do so in [Opening a pull request](#how-to-open-a-pr)
+For all contributions 4.-9. you will need to open a PR. It is explained in detail how to do so in [Opening a pull requst](#how-to-open-a-pr)
 ### 1. Asking and answering questions on the Diffusers discussion forum or on the Diffusers Discord
@@ -63,7 +63,7 @@ In the same spirit, you are of immense help to the community by answering such q
 **Please** keep in mind that the more effort you put into asking or answering a question, the higher
 the quality of the publicly documented knowledge. In the same way, well-posed and well-answered questions create a high-quality knowledge database accessible to everybody, while badly posed questions or answers reduce the overall quality of the public knowledge database.
-In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formated/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.
+In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accesible*, and *well-formated/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.
 **NOTE about channels**:
 [*The forum*](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) is much better indexed by search engines, such as Google. Posts are ranked by popularity rather than chronologically. Hence, it's easier to look up questions and answers that we posted some time ago.
@@ -125,14 +125,14 @@ Awesome! Tell us what problem it solved for you.
 You can open a feature request [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feature_request.md&title=).
-#### 2.3 Feedback.
+#### 2.3 Feedback. 
 Feedback about the library design and why it is good or not good helps the core maintainers immensely to build a user-friendly library. To understand the philosophy behind the current design philosophy, please have a look [here](https://huggingface.co/docs/diffusers/conceptual/philosophy). If you feel like a certain design choice does not fit with the current design philosophy, please explain why and how it should be changed. If a certain design choice follows the design philosophy too much, hence restricting use cases, explain why and how it should be changed.
 If a certain design choice is very useful for you, please also leave a note as this is great feedback for future design decisions.
 You can open an issue about feedback [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=).
-#### 2.4 Technical questions.
+#### 2.4 Technical questions. 
 Technical questions are mainly about why certain code of the library was written in a certain way, or what a certain part of the code does. Please make sure to link to the code in question and please provide detail on
 why this part of the code is difficult to understand.
@@ -168,7 +168,7 @@ more precise, provide the link to a duplicated issue or redirect them to [the fo
 If you have verified that the issued bug report is correct and requires a correction in the source code,
 please have a look at the next sections.
-For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull request](#how-to-open-a-pr) section.
+For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull requst](#how-to-open-a-pr) section.
 ### 4. Fixing a "Good first issue"
@@ -297,7 +297,7 @@ if you don't know yet what specific component you would like to add:
 - [Model or pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22)
 - [Scheduler](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22)
-Before adding any of the three components, it is strongly recommended that you give the [Philosophy guide](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md) a read to better understand the design of any of the three components. Please be aware that
+Before adding any of the three components, it is strongly recommended that you give the [Philosophy guide](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22) a read to better understand the design of any of the three components. Please be aware that
 we cannot merge model, scheduler, or pipeline additions that strongly diverge from our design philosophy
 as it will lead to API inconsistencies. If you fundamentally disagree with a design choice, please
 open a [Feedback issue](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=) instead so that it can be discussed whether a certain design
@@ -394,8 +394,8 @@ passes. You should run the tests impacted by your changes like this:
 ```bash
 $ pytest tests/<TEST_TO_RUN>.py
 ```
-
+ 
-Before you run the tests, please make sure you install the dependencies required for testing. You can do so
+Before you run the tests, please make sure you install the dependencies required for testing. You can do so 
 with this command:
 ```bash
--- a/2
+++ b/2
@@ -78,7 +78,7 @@ test:
 # Run tests for examples
 test-examples:
-	python -m pytest -n auto --dist=loadfile -s -v ./examples/
+	python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/
 # Release stuff
--- a/PHILOSOPHY.md
+++ b/PHILOSOPHY.md
@@ -27,18 +27,18 @@ In a nutshell, Diffusers is built to be a natural extension of PyTorch. Therefor
 ## Simple over easy
-As PyTorch states, **explicit is better than implicit** and **simple is better than complex**. This design philosophy is reflected in multiple parts of the library:
+As PyTorch states, **explicit is better than implicit** and **simple is better than complex**. This design philosophy is reflected in multiple parts of the library: 
 - We follow PyTorch's API with methods like [`DiffusionPipeline.to`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.to) to let the user handle device management.
 - Raising concise error messages is preferred to silently correct erroneous input. Diffusers aims at teaching the user, rather than making the library as easy to use as possible.
 - Complex model vs. scheduler logic is exposed instead of magically handled inside. Schedulers/Samplers are separated from diffusion models with minimal dependencies on each other. This forces the user to write the unrolled denoising loop. However, the separation allows for easier debugging and gives the user more control over adapting the denoising process or switching out diffusion models or schedulers.
- Separately trained components of the diffusion pipeline, *e.g.* the text encoder, the unet, and the variational autoencoder, each have their own model class. This forces the user to handle the interaction between the different model components, and the serialization format separates the model components into different files. However, this allows for easier debugging and customization. Dreambooth or textual inversion training
+- Separately trained components of the diffusion pipeline, *e.g.* the text encoder, the unet, and the variational autoencoder, each have their own model class. This forces the user to handle the interaction between the different model components, and the serialization format separates the model components into different files. However, this allows for easier debugging and customization. Dreambooth or textual inversion training 
 is very simple thanks to diffusers' ability to separate single components of the diffusion pipeline.
 ## Tweakable, contributor-friendly over abstraction
-For large parts of the library, Diffusers adopts an important design principle of the [Transformers library](https://github.com/huggingface/transformers), which is to prefer copy-pasted code over hasty abstractions. This design principle is very opinionated and stands in stark contrast to popular design principles such as [Don't repeat yourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself).
+For large parts of the library, Diffusers adopts an important design principle of the [Transformers library](https://github.com/huggingface/transformers), which is to prefer copy-pasted code over hasty abstractions. This design principle is very opinionated and stands in stark contrast to popular design principles such as [Don't repeat yourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself). 
 In short, just like Transformers does for modeling files, diffusers prefers to keep an extremely low level of abstraction and very self-contained code for pipelines and schedulers.
-Functions, long code blocks, and even classes can be copied across multiple files which at first can look like a bad, sloppy design choice that makes the library unmaintainable.
+Functions, long code blocks, and even classes can be copied across multiple files which at first can look like a bad, sloppy design choice that makes the library unmaintainable. 
 **However**, this design has proven to be extremely successful for Transformers and makes a lot of sense for community-driven, open-source machine learning libraries because:
 - Machine Learning is an extremely fast-moving field in which paradigms, model architectures, and algorithms are changing rapidly, which therefore makes it very difficult to define long-lasting code abstractions.
 - Machine Learning practitioners like to be able to quickly tweak existing code for ideation and research and therefore prefer self-contained code over one that contains many abstractions.
@@ -47,10 +47,10 @@ Functions, long code blocks, and even classes can be copied across multiple file
 At Hugging Face, we call this design the **single-file policy** which means that almost all of the code of a certain class should be written in a single, self-contained file. To read more about the philosophy, you can have a look
 at [this blog post](https://huggingface.co/blog/transformers-design-philosophy).
-In diffusers, we follow this philosophy for both pipelines and schedulers, but only partly for diffusion models. The reason we don't follow this design fully for diffusion models is because almost all diffusion pipelines, such
+In diffusers, we follow this philosophy for both pipelines and schedulers, but only partly for diffusion models. The reason we don't follow this design fully for diffusion models is because almost all diffusion pipelines, such 
 as [DDPM](https://huggingface.co/docs/diffusers/v0.12.0/en/api/pipelines/ddpm), [Stable Diffusion](https://huggingface.co/docs/diffusers/v0.12.0/en/api/pipelines/stable_diffusion/overview#stable-diffusion-pipelines), [UnCLIP (Dalle-2)](https://huggingface.co/docs/diffusers/v0.12.0/en/api/pipelines/unclip#overview) and [Imagen](https://imagen.research.google/) all rely on the same diffusion model, the [UNet](https://huggingface.co/docs/diffusers/api/models#diffusers.UNet2DConditionModel).
-Great, now you should have generally understood why 🧨 Diffusers is designed the way it is 🤗.
+Great, now you should have generally understood why 🧨 Diffusers is designed the way it is 🤗. 
 We try to apply these design principles consistently across the library. Nevertheless, there are some minor exceptions to the philosophy or some unlucky design choices. If you have feedback regarding the design, we would ❤️  to hear it [directly on GitHub](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=).
 ## Design Philosophy in Details
@@ -70,7 +70,7 @@ The following design principles are followed:
 - Pipelines should be used **only** for inference.
 - Pipelines should be very readable, self-explanatory, and easy to tweak.
 - Pipelines should be designed to build on top of each other and be easy to integrate into higher-level APIs.
- Pipelines are **not** intended to be feature-complete user interfaces. For future complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner).
+- Pipelines are **not** intended to be feature-complete user interfaces. For future complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner)
 - Every pipeline should have one and only one way to run it via a `__call__` method. The naming of the `__call__` arguments should be shared across all pipelines.
 - Pipelines should be named after the task they are intended to solve.
 - In almost all cases, novel diffusion pipelines shall be implemented in a new pipeline folder/file.
@@ -89,22 +89,22 @@ The following design principles are followed:
 - Models should by default have the highest precision and lowest performance setting.
 - To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different.
 - Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work.
- The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and
+- The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and 
-readable longterm, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+readable longterm, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
 ### Schedulers
 Schedulers are responsible to guide the denoising process for inference as well as to define a noise schedule for training. They are designed as individual classes with loadable configuration files and strongly follow the **single-file policy**.
 The following design principles are followed:
- All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers).
+- All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers). 
- Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained.
+- Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained. 
- One scheduler python file corresponds to one scheduler algorithm (as might be defined in a paper).
+- One scheduler python file corresponds to one scheduler algorithm (as might be defined in a paper). 
 - If schedulers share similar functionalities, we can make use of the `#Copied from` mechanism.
 - Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`.
- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./using-diffusers/schedulers.md).
+- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./using-diffusers/schedulers.mdx).
 - Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called.
- Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon.
+- Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon
 - The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1).
 - Given the complexity of diffusion schedulers, the `step` function does not expose all the complexity and can be a bit of a "black box".
 - In almost all cases, novel schedulers shall be implemented in a new scheduling file.
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 <p align="center">
    <br>
-    <img src="https://raw.githubusercontent.com/huggingface/diffusers/main/docs/source/en/imgs/diffusers_library.jpg" width="400"/>
+    <img src="./docs/source/en/imgs/diffusers_library.jpg" width="400"/>
    <br>
 <p>
 <p align="center">
@@ -10,9 +10,6 @@
    <a href="https://github.com/huggingface/diffusers/releases">
        <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/diffusers.svg">
    </a>
    <a href="https://pepy.tech/project/diffusers">
        <img alt="GitHub release" src="https://static.pepy.tech/badge/diffusers/month">
    </a>
    <a href="CODE_OF_CONDUCT.md">
        <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg">
    </a>
@@ -28,12 +25,12 @@
 ## Installation
-We recommend installing 🤗 Diffusers in a virtual environment from PyPi or Conda. For more details about installing [PyTorch](https://pytorch.org/get-started/locally/) and [Flax](https://flax.readthedocs.io/en/latest/#installation), please refer to their official documentation.
+We recommend installing 🤗 Diffusers in a virtual environment from PyPi or Conda. For more details about installing [PyTorch](https://pytorch.org/get-started/locally/) and [Flax](https://flax.readthedocs.io/en/latest/installation.html), please refer to their official documentation.
 ### PyTorch
 With `pip` (official package):
-
+    
 ```bash
 pip install --upgrade diffusers[torch]
 ```
@@ -62,9 +59,8 @@ Generating outputs is super easy with 🤗 Diffusers. To generate an image from
 ```python
 from diffusers import DiffusionPipeline
 import torch
-pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
 pipeline.to("cuda")
 pipeline("An image of a squirrel in Picasso style").images[0]
 ```
@@ -103,14 +99,58 @@ Check out the [Quickstart](https://huggingface.co/docs/diffusers/quicktour) to l
 | **Documentation**                                                   | **What can I learn?**                                                                                                                                                                           |
 |---------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [Tutorial](https://huggingface.co/docs/diffusers/tutorials/tutorial_overview)                                                            | A basic crash course for learning how to use the library's most important features like using models and schedulers to build your own diffusion system, and training your own diffusion model.  |
+| Tutorial                                                            | A basic crash course for learning how to use the library's most important features like using models and schedulers to build your own diffusion system, and training your own diffusion model.  |
-| [Loading](https://huggingface.co/docs/diffusers/using-diffusers/loading_overview)                                                             | Guides for how to load and configure all the components (pipelines, models, and schedulers) of the library, as well as how to use different schedulers.                                         |
+| Loading                                                             | Guides for how to load and configure all the components (pipelines, models, and schedulers) of the library, as well as how to use different schedulers.                                         |
-| [Pipelines for inference](https://huggingface.co/docs/diffusers/using-diffusers/pipeline_overview)                                             | Guides for how to use pipelines for different inference tasks, batched generation, controlling generated outputs and randomness, and how to contribute a pipeline to the library.               |
+| Pipelines for inference                                             | Guides for how to use pipelines for different inference tasks, batched generation, controlling generated outputs and randomness, and how to contribute a pipeline to the library.               |
-| [Optimization](https://huggingface.co/docs/diffusers/optimization/opt_overview)                                                        | Guides for how to optimize your diffusion model to run faster and consume less memory.                                                                                                          |
+| Optimization                                                        | Guides for how to optimize your diffusion model to run faster and consume less memory.                                                                                                          |
 | [Training](https://huggingface.co/docs/diffusers/training/overview) | Guides for how to train a diffusion model for different tasks with different training techniques.                                                                                               |
 ## Supported pipelines
 | Pipeline | Paper | Tasks |
 |---|---|:---:|
 | [alt_diffusion](./api/pipelines/alt_diffusion) | [**AltDiffusion**](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation |
 | [audio_diffusion](./api/pipelines/audio_diffusion) | [**Audio Diffusion**](https://github.com/teticio/audio-diffusion.git) | Unconditional Audio Generation |
 | [controlnet](./api/pipelines/stable_diffusion/controlnet) | [**ControlNet with Stable Diffusion**](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation |
 | [cycle_diffusion](./api/pipelines/cycle_diffusion) | [**Cycle Diffusion**](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation |
 | [dance_diffusion](./api/pipelines/dance_diffusion) | [**Dance Diffusion**](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation |
 | [ddpm](./api/pipelines/ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation |
 | [ddim](./api/pipelines/ddim) | [**Denoising Diffusion Implicit Models**](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation |
 | [latent_diffusion](./api/pipelines/latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation |
 | [latent_diffusion](./api/pipelines/latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image |
 | [latent_diffusion_uncond](./api/pipelines/latent_diffusion_uncond) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation |
 | [paint_by_example](./api/pipelines/paint_by_example) | [**Paint by Example: Exemplar-based Image Editing with Diffusion Models**](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting |
 | [pndm](./api/pipelines/pndm) | [**Pseudo Numerical Methods for Diffusion Models on Manifolds**](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation |
 | [score_sde_ve](./api/pipelines/score_sde_ve) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
 | [score_sde_vp](./api/pipelines/score_sde_vp) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
 | [semantic_stable_diffusion](./api/pipelines/semantic_stable_diffusion) | [**Semantic Guidance**](https://arxiv.org/abs/2301.12247) | Text-Guided Generation |
 | [stable_diffusion_text2img](./api/pipelines/stable_diffusion/text2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation |
 | [stable_diffusion_img2img](./api/pipelines/stable_diffusion/img2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation |
 | [stable_diffusion_inpaint](./api/pipelines/stable_diffusion/inpaint) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting |
 | [stable_diffusion_panorama](./api/pipelines/stable_diffusion/panorama) | [**MultiDiffusion**](https://multidiffusion.github.io/) | Text-to-Panorama Generation |
 | [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [**InstructPix2Pix**](https://github.com/timothybrooks/instruct-pix2pix) | Text-Guided Image Editing|
 | [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [**Zero-shot Image-to-Image Translation**](https://pix2pixzero.github.io/) | Text-Guided Image Editing |
 | [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [**Attend and Excite for Stable Diffusion**](https://attendandexcite.github.io/Attend-and-Excite/) | Text-to-Image Generation |
 | [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [**Self-Attention Guidance**](https://ku-cvlab.github.io/Self-Attention-Guidance) | Text-to-Image Generation |
 | [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [**Stable Diffusion Image Variations**](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation |
 | [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [**Stable Diffusion Latent Upscaler**](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image |
 | [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation |
 | [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting |
 | [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [**Depth-Conditional Stable Diffusion**](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) | Depth-to-Image Generation |
 | [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image |
 | [stable_diffusion_safe](./api/pipelines/stable_diffusion_safe) | [**Safe Stable Diffusion**](https://arxiv.org/abs/2211.05105) | Text-Guided Generation |
 | [stable_unclip](./stable_unclip) | **Stable unCLIP** | Text-to-Image Generation |
 | [stable_unclip](./stable_unclip) | **Stable unCLIP** | Image-to-Image Text-Guided Generation |
 | [stochastic_karras_ve](./api/pipelines/stochastic_karras_ve) | [**Elucidating the Design Space of Diffusion-Based Generative Models**](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation |
 | [unclip](./api/pipelines/unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) | Text-to-Image Generation |
 | [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
 | [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
 | [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
 | [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
 ## Contribution
-We ❤️  contributions from the open-source community!
+We ❤️  contributions from the open-source community! 
 If you want to contribute to this library, please check out our [Contribution guide](https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md).
 You can look out for [issues](https://github.com/huggingface/diffusers/issues) you'd like to tackle to contribute to the library.
 - See [Good first issues](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) for general opportunities to contribute
@@ -120,92 +160,6 @@ You can look out for [issues](https://github.com/huggingface/diffusers/issues) y
 Also, say 👋 in our public Discord channel <a href="https://discord.gg/G7tWnz98XR"><img alt="Join us on Discord" src="https://img.shields.io/discord/823813159592001537?color=5865F2&logo=discord&logoColor=white"></a>. We discuss the hottest trends about diffusion models, help each other with contributions, personal projects or
 just hang out ☕.
 ## Popular Tasks & Pipelines
 <table>
  <tr>
    <th>Task</th>
    <th>Pipeline</th>
    <th>🤗 Hub</th>
  </tr>
  <tr style="border-top: 2px solid black">
    <td>Unconditional Image Generation</td>
    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/ddpm"> DDPM </a></td>
    <td><a href="https://huggingface.co/google/ddpm-ema-church-256"> google/ddpm-ema-church-256 </a></td>
  </tr>
  <tr style="border-top: 2px solid black">
    <td>Text-to-Image</td>
    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img">Stable Diffusion Text-to-Image</a></td>
      <td><a href="https://huggingface.co/runwayml/stable-diffusion-v1-5"> runwayml/stable-diffusion-v1-5 </a></td>
  </tr>
  <tr>
    <td>Text-to-Image</td>
    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/unclip">unclip</a></td>
      <td><a href="https://huggingface.co/kakaobrain/karlo-v1-alpha"> kakaobrain/karlo-v1-alpha </a></td>
  </tr>
  <tr>
    <td>Text-to-Image</td>
    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/if">DeepFloyd IF</a></td>
      <td><a href="https://huggingface.co/DeepFloyd/IF-I-XL-v1.0"> DeepFloyd/IF-I-XL-v1.0 </a></td>
  </tr>
  <tr>
    <td>Text-to-Image</td>
    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/kandinsky">Kandinsky</a></td>
      <td><a href="https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder"> kandinsky-community/kandinsky-2-2-decoder </a></td>
  </tr>
  <tr style="border-top: 2px solid black">
    <td>Text-guided Image-to-Image</td>
    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/controlnet">Controlnet</a></td>
      <td><a href="https://huggingface.co/lllyasviel/sd-controlnet-canny"> lllyasviel/sd-controlnet-canny </a></td>
  </tr>
  <tr>
    <td>Text-guided Image-to-Image</td>
    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/pix2pix">Instruct Pix2Pix</a></td>
      <td><a href="https://huggingface.co/timbrooks/instruct-pix2pix"> timbrooks/instruct-pix2pix </a></td>
  </tr>
  <tr>
    <td>Text-guided Image-to-Image</td>
    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/img2img">Stable Diffusion Image-to-Image</a></td>
      <td><a href="https://huggingface.co/runwayml/stable-diffusion-v1-5"> runwayml/stable-diffusion-v1-5 </a></td>
  </tr>
  <tr style="border-top: 2px solid black">
    <td>Text-guided Image Inpainting</td>
    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/inpaint">Stable Diffusion Inpaint</a></td>
      <td><a href="https://huggingface.co/runwayml/stable-diffusion-inpainting"> runwayml/stable-diffusion-inpainting </a></td>
  </tr>
  <tr style="border-top: 2px solid black">
    <td>Image Variation</td>
    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/image_variation">Stable Diffusion Image Variation</a></td>
      <td><a href="https://huggingface.co/lambdalabs/sd-image-variations-diffusers"> lambdalabs/sd-image-variations-diffusers </a></td>
  </tr>
  <tr style="border-top: 2px solid black">
    <td>Super Resolution</td>
    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/upscale">Stable Diffusion Upscale</a></td>
      <td><a href="https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler"> stabilityai/stable-diffusion-x4-upscaler </a></td>
  </tr>
  <tr>
    <td>Super Resolution</td>
    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/latent_upscale">Stable Diffusion Latent Upscale</a></td>
      <td><a href="https://huggingface.co/stabilityai/sd-x2-latent-upscaler"> stabilityai/sd-x2-latent-upscaler </a></td>
  </tr>
 </table>
 ## Popular libraries using 🧨 Diffusers
 - https://github.com/microsoft/TaskMatrix
 - https://github.com/invoke-ai/InvokeAI
 - https://github.com/apple/ml-stable-diffusion
 - https://github.com/Sanster/lama-cleaner
 - https://github.com/IDEA-Research/Grounded-Segment-Anything
 - https://github.com/ashawkey/stable-dreamfusion
 - https://github.com/deep-floyd/IF
 - https://github.com/bentoml/BentoML
 - https://github.com/bmaltais/kohya_ss
 - +3000 other amazing GitHub repositories 💪
 Thank you for using us ❤️
 ## Credits
 This library concretizes previous work by many different authors and would not have been possible without their great research and implementations. We'd like to thank, in particular, the following implementations which have helped us in our development and without which the API could not have been as polished today:
--- a/docker/diffusers-pytorch-compile-cuda/Dockerfile
+++ b/docker/diffusers-pytorch-compile-cuda/Dockerfile
@@ -1,46 +0,0 @@
 FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04
 LABEL maintainer="Hugging Face"
 LABEL repository="diffusers"
 ENV DEBIAN_FRONTEND=noninteractive
 RUN apt update && \
    apt install -y bash \
    build-essential \
    git \
    git-lfs \
    curl \
    ca-certificates \
    libsndfile1-dev \
    libgl1 \
    python3.9 \
    python3.9-dev \
    python3-pip \
    python3.9-venv && \
    rm -rf /var/lib/apt/lists
 # make sure to use venv
 RUN python3.9 -m venv /opt/venv
 ENV PATH="/opt/venv/bin:$PATH"
 # pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
 RUN python3.9 -m pip install --no-cache-dir --upgrade pip && \
    python3.9 -m pip install --no-cache-dir \
    torch \
    torchvision \
    torchaudio \
    invisible_watermark && \
    python3.9 -m pip install --no-cache-dir \
    accelerate \
    datasets \
    hf-doc-builder \
    huggingface-hub \
    Jinja2 \
    librosa \
    numpy \
    scipy \
    tensorboard \
    transformers \
    omegaconf
 CMD ["/bin/bash"]
--- a/docker/diffusers-pytorch-cpu/Dockerfile
+++ b/docker/diffusers-pytorch-cpu/Dockerfile
@@ -14,7 +14,6 @@ RUN apt update && \
                   libsndfile1-dev \
                   python3.8 \
                   python3-pip \
                   libgl1 \
                   python3.8-venv && \
    rm -rf /var/lib/apt/lists
@@ -28,7 +27,6 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
        torch \
        torchvision \
        torchaudio \
        invisible_watermark \
        --extra-index-url https://download.pytorch.org/whl/cpu && \
    python3 -m pip install --no-cache-dir \
        accelerate \
@@ -42,4 +40,4 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
        tensorboard \
        transformers
-CMD ["/bin/bash"]
+CMD ["/bin/bash"]
--- a/docker/diffusers-pytorch-cuda/Dockerfile
+++ b/docker/diffusers-pytorch-cuda/Dockerfile
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04
+FROM nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu20.04
 LABEL maintainer="Hugging Face"
 LABEL repository="diffusers"
@@ -6,16 +6,15 @@ ENV DEBIAN_FRONTEND=noninteractive
 RUN apt update && \
    apt install -y bash \
-    build-essential \
+                   build-essential \
-    git \
+                   git \
-    git-lfs \
+                   git-lfs \
-    curl \
+                   curl \
-    ca-certificates \
+                   ca-certificates \
-    libsndfile1-dev \
+                   libsndfile1-dev \
-    libgl1 \
+                   python3.8 \
-    python3.8 \
+                   python3-pip \
-    python3-pip \
+                   python3.8-venv && \
    python3.8-venv && \
    rm -rf /var/lib/apt/lists
 # make sure to use venv
@@ -25,22 +24,19 @@ ENV PATH="/opt/venv/bin:$PATH"
 # pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
 RUN python3 -m pip install --no-cache-dir --upgrade pip && \
    python3 -m pip install --no-cache-dir \
-    torch \
+        torch \
-    torchvision \
+        torchvision \
-    torchaudio \
+        torchaudio \
    invisible_watermark && \
    python3 -m pip install --no-cache-dir \
-    accelerate \
+        accelerate \
-    datasets \
+        datasets \
-    hf-doc-builder \
+        hf-doc-builder \
-    huggingface-hub \
+        huggingface-hub \
-    Jinja2 \
+        Jinja2 \
-    librosa \
+        librosa \
-    numpy \
+        numpy \
-    scipy \
+        scipy \
-    tensorboard \
+        tensorboard \
-    transformers \
+        transformers
    omegaconf \
    pytorch-lightning
 CMD ["/bin/bash"]
--- a/docker/diffusers-pytorch-xformers-cuda/Dockerfile
+++ b/docker/diffusers-pytorch-xformers-cuda/Dockerfile
@@ -1,46 +0,0 @@
 FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04
 LABEL maintainer="Hugging Face"
 LABEL repository="diffusers"
 ENV DEBIAN_FRONTEND=noninteractive
 RUN apt update && \
    apt install -y bash \
                   build-essential \
                   git \
                   git-lfs \
                   curl \
                   ca-certificates \
                   libsndfile1-dev \
                   libgl1 \
                   python3.8 \
                   python3-pip \
                   python3.8-venv && \
    rm -rf /var/lib/apt/lists
 # make sure to use venv
 RUN python3 -m venv /opt/venv
 ENV PATH="/opt/venv/bin:$PATH"
 # pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
 RUN python3 -m pip install --no-cache-dir --upgrade pip && \
    python3 -m pip install --no-cache-dir \
        torch \
        torchvision \
        torchaudio \
        invisible_watermark && \
    python3 -m pip install --no-cache-dir \
        accelerate \
        datasets \
        hf-doc-builder \
        huggingface-hub \
        Jinja2 \
        librosa \
        numpy \
        scipy \
        tensorboard \
        transformers \
        omegaconf \
        xformers
 CMD ["/bin/bash"]
--- a/docs/README.md
+++ b/docs/README.md
@@ -68,10 +68,10 @@ The `preview` command only works with existing doc files. When you add a complet
 ## Adding a new element to the navigation bar
-Accepted files are Markdown (.md).
+Accepted files are Markdown (.md or .mdx).
 Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting
-the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/diffusers/blob/main/docs/source/en/_toctree.yml) file.
+the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/diffusers/blob/main/docs/source/_toctree.yml) file.
 ## Renaming section headers and moving sections
@@ -81,14 +81,14 @@ Therefore, we simply keep a little map of moved sections at the end of the docum
 So if you renamed a section from: "Section A" to "Section B", then you can add at the end of the file:
-```md
+```
 Sections that were moved:
 [ <a href="#section-b">Section A</a><a id="section-a"></a> ]
 ```
 and of course, if you moved it to another file, then:
-```md
+```
 Sections that were moved:
 [ <a href="../new-file#section-b">Section A</a><a id="section-a"></a> ]
@@ -96,7 +96,7 @@ Sections that were moved:
 Use the relative style to link to the new file so that the versioned docs continue to work.
-For an example of a rich moved section set please see the very end of [the transformers Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md).
+For an example of a rich moved section set please see the very end of [the transformers Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.mdx).
 ## Writing Documentation - Specification
@@ -109,8 +109,8 @@ although we can write them directly in Markdown.
 Adding a new tutorial or section is done in two steps:
- Add a new Markdown (.md) file under `docs/source/<languageCode>`.
+- Add a new file under `docs/source`. This file can either be ReStructuredText (.rst) or Markdown (.md).
- Link that file in `docs/source/<languageCode>/_toctree.yml` on the correct toc-tree.
+- Link that file in `docs/source/_toctree.yml` on the correct toc-tree.
 Make sure to put your new file under the proper section. It's unlikely to go in the first section (*Get Started*), so
 depending on the intended targets (beginners, more advanced users, or researchers) it should go in sections two, three, or four.
@@ -119,8 +119,8 @@ depending on the intended targets (beginners, more advanced users, or researcher
 When adding a new pipeline:
- Create a file `xxx.md` under `docs/source/<languageCode>/api/pipelines` (don't hesitate to copy an existing file as template).
+- create a file `xxx.mdx` under `docs/source/api/pipelines` (don't hesitate to copy an existing file as template).
- Link that file in (*Diffusers Summary*) section in `docs/source/api/pipelines/overview.md`, along with the link to the paper, and a colab notebook (if available).
+- Link that file in (*Diffusers Summary*) section in `docs/source/api/pipelines/overview.mdx`, along with the link to the paper, and a colab notebook (if available).
 - Write a short overview of the diffusion model:
    - Overview with paper & authors
    - Paper abstract
@@ -129,6 +129,8 @@ When adding a new pipeline:
 - Add all the pipeline classes that should be linked in the diffusion model. These classes should be added using our Markdown syntax. By default as follows:
 ```
 ## XXXPipeline
 [[autodoc]] XXXPipeline
    - all
 	- __call__
@@ -146,7 +148,7 @@ This will include every public method of the pipeline that is documented, as wel
    - disable_xformers_memory_efficient_attention
 ```
-You can follow the same process to create a new scheduler under the `docs/source/<languageCode>/api/schedulers` folder.
+You can follow the same process to create a new scheduler under the `docs/source/api/schedulers` folder
 ### Writing source documentation
@@ -162,7 +164,7 @@ provide its path. For instance: \[\`pipelines.ImagePipelineOutput\`\]. This will
 `pipelines.ImagePipelineOutput` in the description. To get rid of the path and only keep the name of the object you are
 linking to in the description, add a ~: \[\`~pipelines.ImagePipelineOutput\`\] will generate a link with `ImagePipelineOutput` in the description.
-The same works for methods so you can either use \[\`XXXClass.method\`\] or \[\`~XXXClass.method\`\].
+The same works for methods so you can either use \[\`XXXClass.method\`\] or \[~\`XXXClass.method\`\].
 #### Defining arguments in a method
@@ -194,8 +196,8 @@ Here's an example showcasing everything so far:
 For optional arguments or arguments with defaults we follow the following syntax: imagine we have a function with the
 following signature:
-```py
+```
-def my_function(x: str=None, a: float=3.14):
+def my_function(x: str = None, a: float = 1):
 ```
 then its documentation should look like this:
@@ -204,7 +206,7 @@ then its documentation should look like this:
    Args:
        x (`str`, *optional*):
            This argument controls ...
-        a (`float`, *optional*, defaults to `3.14`):
+        a (`float`, *optional*, defaults to 1):
            This argument is used to ...
 ```
@@ -266,3 +268,4 @@ We have an automatic script running with the `make style` command that will make
 This script may have some weird failures if you made a syntax mistake or if you uncover a bug. Therefore, it's
 recommended to commit your changes before running `make style`, so you can revert the changes done by that script
 easily.
--- a/docs/source/_config.py
+++ b/docs/source/_config.py
@@ -6,4 +6,4 @@ INSTALL_CONTENT = """
 # ! pip install git+https://github.com/huggingface/diffusers.git
 """
-notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
+notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -12,13 +12,9 @@
  - local: tutorials/tutorial_overview
    title: Overview
  - local: using-diffusers/write_own_pipeline
-    title: Understanding pipelines, models and schedulers
+    title: Understanding models and schedulers
  - local: tutorials/autopipeline
    title: AutoPipeline
  - local: tutorials/basic_training
    title: Train a diffusion model
  - local: tutorials/using_peft_for_inference
    title: Inference with PEFT
  title: Tutorials
 - sections:
  - sections:
@@ -29,15 +25,9 @@
    - local: using-diffusers/schedulers
      title: Load and compare different schedulers
    - local: using-diffusers/custom_pipeline_overview
-      title: Load community pipelines and components
+      title: Load community pipelines
-    - local: using-diffusers/using_safetensors
+    - local: using-diffusers/kerascv
-      title: Load safetensors
+      title: Load KerasCV Stable Diffusion checkpoints
    - local: using-diffusers/other-formats
      title: Load different Stable Diffusion formats
    - local: using-diffusers/loading_adapters
      title: Load adapters
    - local: using-diffusers/push_to_hub
      title: Push files to the Hub
    title: Loading & Hub
  - sections:
    - local: using-diffusers/pipeline_overview
@@ -45,59 +35,31 @@
    - local: using-diffusers/unconditional_image_generation
      title: Unconditional image generation
    - local: using-diffusers/conditional_image_generation
-      title: Text-to-image
+      title: Text-to-image generation
    - local: using-diffusers/img2img
-      title: Image-to-image
+      title: Text-guided image-to-image
    - local: using-diffusers/inpaint
-      title: Inpainting
+      title: Text-guided image-inpainting
    - local: using-diffusers/depth2img
-      title: Depth-to-image
+      title: Text-guided depth-to-image
    title: Tasks
  - sections:
    - local: using-diffusers/textual_inversion_inference
      title: Textual inversion
    - local: training/distributed_inference
      title: Distributed inference with multiple GPUs
    - local: using-diffusers/reusing_seeds
      title: Improve image quality with deterministic generation
    - local: using-diffusers/control_brightness
      title: Control image brightness
    - local: using-diffusers/weighted_prompts
      title: Prompt weighting
    - local: using-diffusers/freeu
      title: Improve generation quality with FreeU
    title: Techniques
  - sections:
    - local: using-diffusers/pipeline_overview
      title: Overview
    - local: using-diffusers/sdxl
      title: Stable Diffusion XL
    - local: using-diffusers/kandinsky
      title: Kandinsky
    - local: using-diffusers/controlnet
      title: ControlNet
    - local: using-diffusers/callback
      title: Callback
    - local: using-diffusers/shap-e
      title: Shap-E
    - local: using-diffusers/diffedit
      title: DiffEdit
    - local: using-diffusers/distilled_sd
      title: Distilled Stable Diffusion inference
    - local: using-diffusers/reproducibility
      title: Create reproducible pipelines
    - local: using-diffusers/custom_pipeline_examples
      title: Community pipelines
    - local: using-diffusers/contribute_pipeline
-      title: Contribute a community pipeline
+      title: How to contribute a community pipeline
-    title: Specific pipeline examples
+    - local: using-diffusers/using_safetensors
      title: Using safetensors
    - local: using-diffusers/stable_diffusion_jax_how_to
      title: Stable Diffusion in JAX/Flax
    - local: using-diffusers/weighted_prompts
      title: Weighting Prompts
    title: Pipelines for Inference
  - sections:
    - local: training/overview
      title: Overview
    - local: training/create_dataset
      title: Create a dataset for training
    - local: training/adapt_a_model
      title: Adapt a model to a new task
    - local: training/unconditional_training
      title: Unconditional image generation
    - local: training/text_inversion
@@ -114,12 +76,12 @@
      title: InstructPix2Pix Training
    - local: training/custom_diffusion
      title: Custom Diffusion
    - local: training/t2i_adapters
      title: T2I-Adapters
    - local: training/ddpo
      title: Reinforcement learning training with DDPO
    title: Training
  - sections:
    - local: using-diffusers/rl
      title: Reinforcement Learning
    - local: using-diffusers/audio
      title: Audio
    - local: using-diffusers/other-modalities
      title: Other Modalities
    title: Taking Diffusers Beyond Images
@@ -127,35 +89,23 @@
 - sections:
  - local: optimization/opt_overview
    title: Overview
-  - sections:
+  - local: optimization/fp16
-    - local: optimization/fp16
+    title: Memory and Speed
-      title: Speed up inference
+  - local: optimization/torch2.0
-    - local: optimization/memory
+    title: Torch2.0 support
-      title: Reduce memory usage
+  - local: optimization/xformers
-    - local: optimization/torch2.0
+    title: xFormers
-      title: Torch 2.0
+  - local: optimization/onnx
-    - local: optimization/xformers
+    title: ONNX
-      title: xFormers
+  - local: optimization/open_vino
-    - local: optimization/tome
+    title: OpenVINO
-      title: Token merging
+  - local: optimization/coreml
-    title: General optimizations
+    title: Core ML
-  - sections:
+  - local: optimization/mps
-    - local: using-diffusers/stable_diffusion_jax_how_to
+    title: MPS
-      title: JAX/Flax
+  - local: optimization/habana
-    - local: optimization/onnx
+    title: Habana Gaudi
-      title: ONNX
+  title: Optimization/Special Hardware
    - local: optimization/open_vino
      title: OpenVINO
    - local: optimization/coreml
      title: Core ML
    title: Optimized model types
  - sections:
    - local: optimization/mps
      title: Metal Performance Shaders (MPS)
    - local: optimization/habana
      title: Habana Gaudi
    title: Optimized hardware
  title: Optimization
 - sections:
  - local: conceptual/philosophy
    title: Philosophy
@@ -170,70 +120,28 @@
  title: Conceptual Guides
 - sections:
  - sections:
-    - local: api/configuration
+    - local: api/models
-      title: Configuration
+      title: Models
-    - local: api/loaders
+    - local: api/diffusion_pipeline
-      title: Loaders
+      title: Diffusion Pipeline
    - local: api/logging
      title: Logging
    - local: api/configuration
      title: Configuration
    - local: api/outputs
      title: Outputs
    - local: api/loaders
      title: Loaders
    title: Main Classes
  - sections:
    - local: api/models/overview
      title: Overview
    - local: api/models/unet
      title: UNet1DModel
    - local: api/models/unet2d
      title: UNet2DModel
    - local: api/models/unet2d-cond
      title: UNet2DConditionModel
    - local: api/models/unet3d-cond
      title: UNet3DConditionModel
    - local: api/models/unet-motion
      title: UNetMotionModel
    - local: api/models/vq
      title: VQModel
    - local: api/models/autoencoderkl
      title: AutoencoderKL
    - local: api/models/asymmetricautoencoderkl
      title: AsymmetricAutoencoderKL
    - local: api/models/autoencoder_tiny
      title: Tiny AutoEncoder
    - local: api/models/transformer2d
      title: Transformer2D
    - local: api/models/transformer_temporal
      title: Transformer Temporal
    - local: api/models/prior_transformer
      title: Prior Transformer
    - local: api/models/controlnet
      title: ControlNet
    title: Models
  - sections:
    - local: api/pipelines/overview
      title: Overview
    - local: api/pipelines/alt_diffusion
      title: AltDiffusion
    - local: api/pipelines/animatediff
      title: AnimateDiff
    - local: api/pipelines/attend_and_excite
      title: Attend-and-Excite
    - local: api/pipelines/audio_diffusion
      title: Audio Diffusion
    - local: api/pipelines/audioldm
      title: AudioLDM
    - local: api/pipelines/audioldm2
      title: AudioLDM 2
    - local: api/pipelines/auto_pipeline
      title: AutoPipeline
    - local: api/pipelines/blip_diffusion
      title: BLIP Diffusion
    - local: api/pipelines/consistency_models
      title: Consistency Models
    - local: api/pipelines/controlnet
      title: ControlNet
    - local: api/pipelines/controlnet_sdxl
      title: ControlNet with Stable Diffusion XL
    - local: api/pipelines/cycle_diffusion
      title: Cycle Diffusion
    - local: api/pipelines/dance_diffusion
@@ -242,167 +150,121 @@
      title: DDIM
    - local: api/pipelines/ddpm
      title: DDPM
    - local: api/pipelines/deepfloyd_if
      title: DeepFloyd IF
    - local: api/pipelines/diffedit
      title: DiffEdit
    - local: api/pipelines/dit
      title: DiT
    - local: api/pipelines/pix2pix
      title: InstructPix2Pix
    - local: api/pipelines/kandinsky
      title: Kandinsky 2.1
    - local: api/pipelines/kandinsky_v22
      title: Kandinsky 2.2
    - local: api/pipelines/latent_consistency_models
      title: Latent Consistency Models
    - local: api/pipelines/latent_diffusion
      title: Latent Diffusion
    - local: api/pipelines/panorama
      title: MultiDiffusion
    - local: api/pipelines/musicldm
      title: MusicLDM
    - local: api/pipelines/paint_by_example
-      title: Paint By Example
+      title: PaintByExample
    - local: api/pipelines/paradigms
      title: Parallel Sampling of Diffusion Models
    - local: api/pipelines/pix2pix_zero
      title: Pix2Pix Zero
    - local: api/pipelines/pixart
      title: PixArt
    - local: api/pipelines/pndm
      title: PNDM
    - local: api/pipelines/repaint
      title: RePaint
    - local: api/pipelines/stable_diffusion_safe
      title: Safe Stable Diffusion
    - local: api/pipelines/score_sde_ve
      title: Score SDE VE
    - local: api/pipelines/self_attention_guidance
      title: Self-Attention Guidance
    - local: api/pipelines/semantic_stable_diffusion
      title: Semantic Guidance
    - local: api/pipelines/shap_e
      title: Shap-E
    - local: api/pipelines/spectrogram_diffusion
-      title: Spectrogram Diffusion
+      title: "Spectrogram Diffusion"
    - sections:
      - local: api/pipelines/stable_diffusion/overview
        title: Overview
      - local: api/pipelines/stable_diffusion/text2img
-        title: Text-to-image
+        title: Text-to-Image
      - local: api/pipelines/stable_diffusion/img2img
-        title: Image-to-image
+        title: Image-to-Image
      - local: api/pipelines/stable_diffusion/inpaint
-        title: Inpainting
+        title: Inpaint
      - local: api/pipelines/stable_diffusion/depth2img
-        title: Depth-to-image
+        title: Depth-to-Image
      - local: api/pipelines/stable_diffusion/image_variation
-        title: Image variation
+        title: Image-Variation
      - local: api/pipelines/stable_diffusion/stable_diffusion_safe
        title: Safe Stable Diffusion
      - local: api/pipelines/stable_diffusion/stable_diffusion_2
        title: Stable Diffusion 2
      - local: api/pipelines/stable_diffusion/stable_diffusion_xl
        title: Stable Diffusion XL
      - local: api/pipelines/stable_diffusion/latent_upscale
        title: Latent upscaler
      - local: api/pipelines/stable_diffusion/upscale
-        title: Super-resolution
+        title: Super-Resolution
-      - local: api/pipelines/stable_diffusion/ldm3d_diffusion
+      - local: api/pipelines/stable_diffusion/latent_upscale
-        title: LDM3D Text-to-(RGB, Depth)
+        title: Stable-Diffusion-Latent-Upscaler
-      - local: api/pipelines/stable_diffusion/adapter
+      - local: api/pipelines/stable_diffusion/pix2pix
-        title: Stable Diffusion T2I-Adapter
+        title: InstructPix2Pix
-      - local: api/pipelines/stable_diffusion/gligen
+      - local: api/pipelines/stable_diffusion/attend_and_excite
-        title: GLIGEN (Grounded Language-to-Image Generation)
+        title: Attend and Excite
      - local: api/pipelines/stable_diffusion/pix2pix_zero
        title: Pix2Pix Zero
      - local: api/pipelines/stable_diffusion/self_attention_guidance
        title: Self-Attention Guidance
      - local: api/pipelines/stable_diffusion/panorama
        title: MultiDiffusion Panorama
      - local: api/pipelines/stable_diffusion/controlnet
        title: Text-to-Image Generation with ControlNet Conditioning
      - local: api/pipelines/stable_diffusion/model_editing
        title: Text-to-Image Model Editing
      title: Stable Diffusion
    - local: api/pipelines/stable_diffusion_2
      title: Stable Diffusion 2
    - local: api/pipelines/stable_unclip
      title: Stable unCLIP
    - local: api/pipelines/stochastic_karras_ve
      title: Stochastic Karras VE
    - local: api/pipelines/model_editing
      title: Text-to-image model editing
    - local: api/pipelines/text_to_video
-      title: Text-to-video
+      title: Text-to-Video
    - local: api/pipelines/text_to_video_zero
-      title: Text2Video-Zero
+      title: Text-to-Video Zero
    - local: api/pipelines/unclip
-      title: unCLIP
+      title: UnCLIP
    - local: api/pipelines/latent_diffusion_uncond
      title: Unconditional Latent Diffusion
    - local: api/pipelines/unidiffuser
      title: UniDiffuser
    - local: api/pipelines/value_guided_sampling
      title: Value-guided sampling
    - local: api/pipelines/versatile_diffusion
      title: Versatile Diffusion
    - local: api/pipelines/vq_diffusion
      title: VQ Diffusion
    - local: api/pipelines/wuerstchen
      title: Wuerstchen
    title: Pipelines
  - sections:
    - local: api/schedulers/overview
      title: Overview
    - local: api/schedulers/cm_stochastic_iterative
      title: CMStochasticIterativeScheduler
    - local: api/schedulers/ddim_inverse
      title: DDIMInverseScheduler
    - local: api/schedulers/ddim
-      title: DDIMScheduler
+      title: DDIM
    - local: api/schedulers/ddim_inverse
      title: DDIMInverse
    - local: api/schedulers/ddpm
-      title: DDPMScheduler
+      title: DDPM
    - local: api/schedulers/deis
-      title: DEISMultistepScheduler
+      title: DEIS
    - local: api/schedulers/multistep_dpm_solver_inverse
      title: DPMSolverMultistepInverse
    - local: api/schedulers/multistep_dpm_solver
      title: DPMSolverMultistepScheduler
    - local: api/schedulers/dpm_sde
      title: DPMSolverSDEScheduler
    - local: api/schedulers/singlestep_dpm_solver
      title: DPMSolverSinglestepScheduler
    - local: api/schedulers/euler_ancestral
      title: EulerAncestralDiscreteScheduler
    - local: api/schedulers/euler
      title: EulerDiscreteScheduler
    - local: api/schedulers/heun
      title: HeunDiscreteScheduler
    - local: api/schedulers/ipndm
      title: IPNDMScheduler
    - local: api/schedulers/stochastic_karras_ve
      title: KarrasVeScheduler
    - local: api/schedulers/dpm_discrete_ancestral
      title: KDPM2AncestralDiscreteScheduler
    - local: api/schedulers/dpm_discrete
-      title: KDPM2DiscreteScheduler
+      title: DPM Discrete Scheduler
-    - local: api/schedulers/lcm
+    - local: api/schedulers/dpm_discrete_ancestral
-      title: LCMScheduler
+      title: DPM Discrete Scheduler with ancestral sampling
    - local: api/schedulers/euler_ancestral
      title: Euler Ancestral Scheduler
    - local: api/schedulers/euler
      title: Euler scheduler
    - local: api/schedulers/heun
      title: Heun Scheduler
    - local: api/schedulers/ipndm
      title: IPNDM
    - local: api/schedulers/lms_discrete
-      title: LMSDiscreteScheduler
+      title: Linear Multistep
    - local: api/schedulers/multistep_dpm_solver
      title: Multistep DPM-Solver
    - local: api/schedulers/pndm
-      title: PNDMScheduler
+      title: PNDM
    - local: api/schedulers/repaint
-      title: RePaintScheduler
+      title: RePaint Scheduler
-    - local: api/schedulers/score_sde_ve
+    - local: api/schedulers/singlestep_dpm_solver
-      title: ScoreSdeVeScheduler
+      title: Singlestep DPM-Solver
-    - local: api/schedulers/score_sde_vp
+    - local: api/schedulers/stochastic_karras_ve
-      title: ScoreSdeVpScheduler
+      title: Stochastic Kerras VE
    - local: api/schedulers/unipc
      title: UniPCMultistepScheduler
    - local: api/schedulers/score_sde_ve
      title: VE-SDE
    - local: api/schedulers/score_sde_vp
      title: VP-SDE
    - local: api/schedulers/vq_diffusion
      title: VQDiffusionScheduler
    title: Schedulers
  - sections:
-    - local: api/internal_classes_overview
+    - local: api/experimental/rl
-      title: Overview
+      title: RL Planning
-    - local: api/attnprocessor
+    title: Experimental Features
      title: Attention Processor
    - local: api/activations
      title: Custom activation functions
    - local: api/normalization
      title: Custom normalization layers
    - local: api/utilities
      title: Utilities
    - local: api/image_processor
      title: VAE Image Processor
    title: Internal classes
  title: API
--- a/docs/source/en/api/activations.md
+++ b/docs/source/en/api/activations.md
@@ -1,15 +0,0 @@
 # Activation functions
 Customized activation functions for supporting various models in 🤗 Diffusers.
 ## GELU
 [[autodoc]] models.activations.GELU
 ## GEGLU
 [[autodoc]] models.activations.GEGLU
 ## ApproximateGELU
 [[autodoc]] models.activations.ApproximateGELU
--- a/docs/source/en/api/attnprocessor.md
+++ b/docs/source/en/api/attnprocessor.md
@@ -1,45 +0,0 @@
 # Attention Processor
 An attention processor is a class for applying different types of attention mechanisms.
 ## AttnProcessor
 [[autodoc]] models.attention_processor.AttnProcessor
 ## AttnProcessor2_0
 [[autodoc]] models.attention_processor.AttnProcessor2_0
 ## LoRAAttnProcessor
 [[autodoc]] models.attention_processor.LoRAAttnProcessor
 ## LoRAAttnProcessor2_0
 [[autodoc]] models.attention_processor.LoRAAttnProcessor2_0
 ## CustomDiffusionAttnProcessor
 [[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor
 ## CustomDiffusionAttnProcessor2_0
 [[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor2_0
 ## AttnAddedKVProcessor
 [[autodoc]] models.attention_processor.AttnAddedKVProcessor
 ## AttnAddedKVProcessor2_0
 [[autodoc]] models.attention_processor.AttnAddedKVProcessor2_0
 ## LoRAAttnAddedKVProcessor
 [[autodoc]] models.attention_processor.LoRAAttnAddedKVProcessor
 ## XFormersAttnProcessor
 [[autodoc]] models.attention_processor.XFormersAttnProcessor
 ## LoRAXFormersAttnProcessor
 [[autodoc]] models.attention_processor.LoRAXFormersAttnProcessor
 ## CustomDiffusionXFormersAttnProcessor
 [[autodoc]] models.attention_processor.CustomDiffusionXFormersAttnProcessor
 ## SlicedAttnProcessor
 [[autodoc]] models.attention_processor.SlicedAttnProcessor
 ## SlicedAttnAddedKVProcessor
 [[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor
--- a/docs/source/en/api/configuration.mdx
+++ b/docs/source/en/api/configuration.mdx
@@ -12,13 +12,8 @@ specific language governing permissions and limitations under the License.
 # Configuration
-Schedulers from [`~schedulers.scheduling_utils.SchedulerMixin`] and models from [`ModelMixin`] inherit from [`ConfigMixin`] which stores all the parameters that are passed to their respective `__init__` methods in a JSON-configuration file.
+Schedulers from [`~schedulers.scheduling_utils.SchedulerMixin`] and models from [`ModelMixin`] inherit from [`ConfigMixin`] which conveniently takes care of storing all the parameters that are 
-
+passed to their respective `__init__` methods in a JSON-configuration file.
 <Tip>
 To use private or [gated](https://huggingface.co/docs/hub/models-gated#gated-models) models, log-in with `huggingface-cli login`.
 </Tip>
 ## ConfigMixin
--- a/docs/source/en/api/diffusion_pipeline.mdx
+++ b/docs/source/en/api/diffusion_pipeline.mdx
@@ -0,0 +1,47 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Pipelines
 The [`DiffusionPipeline`] is the easiest way to load any pretrained diffusion pipeline from the [Hub](https://huggingface.co/models?library=diffusers) and to use it in inference.
 <Tip>
 	One should not use the Diffusion Pipeline class for training or fine-tuning a diffusion model. Individual 
 	components of diffusion pipelines are usually trained individually, so we suggest to directly work 
 	with [`UNetModel`] and [`UNetConditionModel`].
 </Tip>
 Any diffusion pipeline that is loaded with [`~DiffusionPipeline.from_pretrained`] will automatically 
 detect the pipeline type, *e.g.* [`StableDiffusionPipeline`] and consequently load each component of the 
 pipeline and pass them into the `__init__` function of the pipeline, *e.g.* [`~StableDiffusionPipeline.__init__`].
 Any pipeline object can be saved locally with [`~DiffusionPipeline.save_pretrained`].
 ## DiffusionPipeline
 [[autodoc]] DiffusionPipeline
 	- all
 	- __call__
 	- device
 	- to
 	- components
 ## ImagePipelineOutput
 By default diffusion pipelines return an object of class
 [[autodoc]] pipelines.ImagePipelineOutput
 ## AudioPipelineOutput
 By default diffusion pipelines return an object of class
 [[autodoc]] pipelines.AudioPipelineOutput
--- a/docs/source/en/api/experimental/rl.mdx
+++ b/docs/source/en/api/experimental/rl.mdx
@@ -0,0 +1,15 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # TODO
 Coming soon!
--- a/docs/source/en/api/image_processor.md
+++ b/docs/source/en/api/image_processor.md
@@ -1,27 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # VAE Image Processor
 The [`VaeImageProcessor`] provides a unified API for [`StableDiffusionPipeline`]'s to prepare image inputs for VAE encoding and post-processing outputs once they're decoded. This includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays. 
 All pipelines with [`VaeImageProcessor`] accepts PIL Image, PyTorch tensor, or NumPy arrays as image inputs and returns outputs based on the `output_type` argument by the user. You can pass encoded image latents directly to the pipeline and return latents from the pipeline as a specific output with the `output_type` argument (for example `output_type="pt"`). This allows you to take the generated latents from one pipeline and pass it to another pipeline as input without leaving the latent space. It also makes it much easier to use multiple pipelines together by passing PyTorch tensors directly between different pipelines. 
 ## VaeImageProcessor
 [[autodoc]] image_processor.VaeImageProcessor
 ## VaeImageProcessorLDM3D
 The [`VaeImageProcessorLDM3D`] accepts RGB and depth inputs and returns RGB and depth outputs.
 [[autodoc]] image_processor.VaeImageProcessorLDM3D
--- a/docs/source/en/api/internal_classes_overview.md
+++ b/docs/source/en/api/internal_classes_overview.md
@@ -1,3 +0,0 @@
 # Overview
 The APIs in this section are more experimental and prone to breaking changes. Most of them are used internally for development, but they may also be useful to you if you're interested in building a diffusion model with some custom parts or if you're interested in some of our helper utilities for working with 🤗 Diffusers.
--- a/docs/source/en/api/loaders.md
+++ b/docs/source/en/api/loaders.md
@@ -1,49 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Loaders
 Adapters (textual inversion, LoRA, hypernetworks) allow you to modify a diffusion model to generate images in a specific style without training or finetuning the entire model. The adapter weights are typically only a tiny fraction of the pretrained model's which making them very portable. 🤗 Diffusers provides an easy-to-use `LoaderMixin` API to load adapter weights.
 <Tip warning={true}>
 🧪 The `LoaderMixins` are highly experimental and prone to future changes. To use private or [gated](https://huggingface.co/docs/hub/models-gated#gated-models) models, log-in with `huggingface-cli login`.
 </Tip>
 ## UNet2DConditionLoadersMixin
 [[autodoc]] loaders.UNet2DConditionLoadersMixin
 ## TextualInversionLoaderMixin
 [[autodoc]] loaders.TextualInversionLoaderMixin
 ## StableDiffusionXLLoraLoaderMixin
 [[autodoc]] loaders.StableDiffusionXLLoraLoaderMixin
 ## LoraLoaderMixin
 [[autodoc]] loaders.LoraLoaderMixin
 ## FromSingleFileMixin
 [[autodoc]] loaders.FromSingleFileMixin
 ## FromOriginalControlnetMixin
 [[autodoc]] loaders.FromOriginalControlnetMixin
 ## FromOriginalVAEMixin
 [[autodoc]] loaders.FromOriginalVAEMixin
--- a/docs/source/en/api/loaders.mdx
+++ b/docs/source/en/api/loaders.mdx
@@ -0,0 +1,42 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Loaders
 There are many ways to train adapter neural networks for diffusion models, such as 
 - [Textual Inversion](./training/text_inversion.mdx)
 - [LoRA](https://github.com/cloneofsimo/lora)
 - [Hypernetworks](https://arxiv.org/abs/1609.09106)
 Such adapter neural networks often only consist of a fraction of the number of weights compared 
 to the pretrained model and as such are very portable. The Diffusers library offers an easy-to-use
 API to load such adapter neural networks via the [`loaders.py` module](https://github.com/huggingface/diffusers/blob/main/src/diffusers/loaders.py). 
 **Note**: This module is still highly experimental and prone to future changes.
 ## LoaderMixins
 ### UNet2DConditionLoadersMixin
 [[autodoc]] loaders.UNet2DConditionLoadersMixin
 ### TextualInversionLoaderMixin
 [[autodoc]] loaders.TextualInversionLoaderMixin
 ### LoraLoaderMixin
 [[autodoc]] loaders.LoraLoaderMixin
 ### FromCkptMixin
 [[autodoc]] loaders.FromCkptMixin
--- a/docs/source/en/api/logging.md
+++ b/docs/source/en/api/logging.md
@@ -1,96 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Logging
 🤗 Diffusers has a centralized logging system to easily manage the verbosity of the library. The default verbosity is set to `WARNING`.
 To change the verbosity level, use one of the direct setters. For instance, to change the verbosity to the `INFO` level.
 ```python
 import diffusers
 diffusers.logging.set_verbosity_info()
 ```
 You can also use the environment variable `DIFFUSERS_VERBOSITY` to override the default verbosity. You can set it
 to one of the following: `debug`, `info`, `warning`, `error`, `critical`. For example:
 ```bash
 DIFFUSERS_VERBOSITY=error ./myprogram.py
 ```
 Additionally, some `warnings` can be disabled by setting the environment variable
 `DIFFUSERS_NO_ADVISORY_WARNINGS` to a true value, like `1`. This disables any warning logged by
 [`logger.warning_advice`]. For example:
 ```bash
 DIFFUSERS_NO_ADVISORY_WARNINGS=1 ./myprogram.py
 ```
 Here is an example of how to use the same logger as the library in your own module or script:
 ```python
 from diffusers.utils import logging
 logging.set_verbosity_info()
 logger = logging.get_logger("diffusers")
 logger.info("INFO")
 logger.warning("WARN")
 ```
 All methods of the logging module are documented below. The main methods are
 [`logging.get_verbosity`] to get the current level of verbosity in the logger and
 [`logging.set_verbosity`] to set the verbosity to the level of your choice. 
 In order from the least verbose to the most verbose:
 |                                                    Method | Integer value |                                         Description |
 |----------------------------------------------------------:|--------------:|----------------------------------------------------:|
 | `diffusers.logging.CRITICAL` or `diffusers.logging.FATAL` |            50 |                only report the most critical errors |
 |                                 `diffusers.logging.ERROR` |            40 |                                  only report errors |
 |   `diffusers.logging.WARNING` or `diffusers.logging.WARN` |            30 |           only report errors and warnings (default) |
 |                                  `diffusers.logging.INFO` |            20 | only report errors, warnings, and basic information |
 |                                 `diffusers.logging.DEBUG` |            10 |                              report all information |
 By default, `tqdm` progress bars are displayed during model download. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] are used to enable or disable this behavior.
 ## Base setters
 [[autodoc]] utils.logging.set_verbosity_error
 [[autodoc]] utils.logging.set_verbosity_warning
 [[autodoc]] utils.logging.set_verbosity_info
 [[autodoc]] utils.logging.set_verbosity_debug
 ## Other functions
 [[autodoc]] utils.logging.get_verbosity
 [[autodoc]] utils.logging.set_verbosity
 [[autodoc]] utils.logging.get_logger
 [[autodoc]] utils.logging.enable_default_handler
 [[autodoc]] utils.logging.disable_default_handler
 [[autodoc]] utils.logging.enable_explicit_format
 [[autodoc]] utils.logging.reset_format
 [[autodoc]] utils.logging.enable_progress_bar
 [[autodoc]] utils.logging.disable_progress_bar
--- a/docs/source/en/api/logging.mdx
+++ b/docs/source/en/api/logging.mdx
@@ -0,0 +1,98 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Logging
 🧨 Diffusers has a centralized logging system, so that you can setup the verbosity of the library easily.
 Currently the default verbosity of the library is `WARNING`.
 To change the level of verbosity, just use one of the direct setters. For instance, here is how to change the verbosity
 to the INFO level.
 ```python
 import diffusers
 diffusers.logging.set_verbosity_info()
 ```
 You can also use the environment variable `DIFFUSERS_VERBOSITY` to override the default verbosity. You can set it
 to one of the following: `debug`, `info`, `warning`, `error`, `critical`. For example:
 ```bash
 DIFFUSERS_VERBOSITY=error ./myprogram.py
 ```
 Additionally, some `warnings` can be disabled by setting the environment variable
 `DIFFUSERS_NO_ADVISORY_WARNINGS` to a true value, like *1*. This will disable any warning that is logged using
 [`logger.warning_advice`]. For example:
 ```bash
 DIFFUSERS_NO_ADVISORY_WARNINGS=1 ./myprogram.py
 ```
 Here is an example of how to use the same logger as the library in your own module or script:
 ```python
 from diffusers.utils import logging
 logging.set_verbosity_info()
 logger = logging.get_logger("diffusers")
 logger.info("INFO")
 logger.warning("WARN")
 ```
 All the methods of this logging module are documented below, the main ones are
 [`logging.get_verbosity`] to get the current level of verbosity in the logger and
 [`logging.set_verbosity`] to set the verbosity to the level of your choice. In order (from the least
 verbose to the most verbose), those levels (with their corresponding int values in parenthesis) are:
 - `diffusers.logging.CRITICAL` or `diffusers.logging.FATAL` (int value, 50): only report the most
  critical errors.
 - `diffusers.logging.ERROR` (int value, 40): only report errors.
 - `diffusers.logging.WARNING` or `diffusers.logging.WARN` (int value, 30): only reports error and
  warnings. This the default level used by the library.
 - `diffusers.logging.INFO` (int value, 20): reports error, warnings and basic information.
 - `diffusers.logging.DEBUG` (int value, 10): report all information.
 By default, `tqdm` progress bars will be displayed during model download. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] can be used to suppress or unsuppress this behavior.
 ## Base setters
 [[autodoc]] logging.set_verbosity_error
 [[autodoc]] logging.set_verbosity_warning
 [[autodoc]] logging.set_verbosity_info
 [[autodoc]] logging.set_verbosity_debug
 ## Other functions
 [[autodoc]] logging.get_verbosity
 [[autodoc]] logging.set_verbosity
 [[autodoc]] logging.get_logger
 [[autodoc]] logging.enable_default_handler
 [[autodoc]] logging.disable_default_handler
 [[autodoc]] logging.enable_explicit_format
 [[autodoc]] logging.reset_format
 [[autodoc]] logging.enable_progress_bar
 [[autodoc]] logging.disable_progress_bar
--- a/docs/source/en/api/models.mdx
+++ b/docs/source/en/api/models.mdx
@@ -0,0 +1,107 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Models
 Diffusers contains pretrained models for popular algorithms and modules for creating the next set of diffusion models.
 The primary function of these models is to denoise an input sample, by modeling the distribution $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$.
 The models are built on the base class ['ModelMixin'] that is a `torch.nn.module` with basic functionality for saving and loading models both locally and from the HuggingFace hub.
 ## ModelMixin
 [[autodoc]] ModelMixin
 ## UNet2DOutput
 [[autodoc]] models.unet_2d.UNet2DOutput
 ## UNet2DModel
 [[autodoc]] UNet2DModel
 ## UNet1DOutput
 [[autodoc]] models.unet_1d.UNet1DOutput
 ## UNet1DModel
 [[autodoc]] UNet1DModel
 ## UNet2DConditionOutput
 [[autodoc]] models.unet_2d_condition.UNet2DConditionOutput
 ## UNet2DConditionModel
 [[autodoc]] UNet2DConditionModel
 ## UNet3DConditionOutput
 [[autodoc]] models.unet_3d_condition.UNet3DConditionOutput
 ## UNet3DConditionModel
 [[autodoc]] UNet3DConditionModel
 ## DecoderOutput
 [[autodoc]] models.vae.DecoderOutput
 ## VQEncoderOutput
 [[autodoc]] models.vq_model.VQEncoderOutput
 ## VQModel
 [[autodoc]] VQModel
 ## AutoencoderKLOutput
 [[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
 ## AutoencoderKL
 [[autodoc]] AutoencoderKL
 ## Transformer2DModel
 [[autodoc]] Transformer2DModel
 ## Transformer2DModelOutput
 [[autodoc]] models.transformer_2d.Transformer2DModelOutput
 ## TransformerTemporalModel
 [[autodoc]] models.transformer_temporal.TransformerTemporalModel
 ## Transformer2DModelOutput
 [[autodoc]] models.transformer_temporal.TransformerTemporalModelOutput
 ## PriorTransformer
 [[autodoc]] models.prior_transformer.PriorTransformer
 ## PriorTransformerOutput
 [[autodoc]] models.prior_transformer.PriorTransformerOutput
 ## ControlNetOutput
 [[autodoc]] models.controlnet.ControlNetOutput
 ## ControlNetModel
 [[autodoc]] ControlNetModel
 ## FlaxModelMixin
 [[autodoc]] FlaxModelMixin
 ## FlaxUNet2DConditionOutput
 [[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionOutput
 ## FlaxUNet2DConditionModel
 [[autodoc]] FlaxUNet2DConditionModel
 ## FlaxDecoderOutput
 [[autodoc]] models.vae_flax.FlaxDecoderOutput
 ## FlaxAutoencoderKLOutput
 [[autodoc]] models.vae_flax.FlaxAutoencoderKLOutput
 ## FlaxAutoencoderKL
 [[autodoc]] FlaxAutoencoderKL
 ## FlaxControlNetOutput
 [[autodoc]] models.controlnet_flax.FlaxControlNetOutput
 ## FlaxControlNetModel
 [[autodoc]] FlaxControlNetModel
--- a/docs/source/en/api/models/asymmetricautoencoderkl.md
+++ b/docs/source/en/api/models/asymmetricautoencoderkl.md
@@ -1,55 +0,0 @@
 # AsymmetricAutoencoderKL
 Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: [Designing a Better Asymmetric VQGAN for StableDiffusion](https://arxiv.org/abs/2306.04632) by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua.
 The abstract from the paper is:
 *StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN*
 Evaluation results can be found in section 4.1 of the original paper. 
 ## Available checkpoints
 * [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5)
 * [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2)
 ## Example Usage
 ```python
 from io import BytesIO
 from PIL import Image
 import requests
 from diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline
 def download_image(url: str) -> Image.Image:
    response = requests.get(url)
    return Image.open(BytesIO(response.content)).convert("RGB")
 prompt = "a photo of a person"
 img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png"
 mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png"
 image = download_image(img_url).resize((256, 256))
 mask_image = download_image(mask_url).resize((256, 256))
 pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
 pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
 pipe.to("cuda")
 image = pipe(prompt=prompt, image=image, mask_image=mask_image).images[0]
 image.save("image.jpeg")
 ```
 ## AsymmetricAutoencoderKL
 [[autodoc]] models.autoencoder_asym_kl.AsymmetricAutoencoderKL
 ## AutoencoderKLOutput
 [[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
 ## DecoderOutput
 [[autodoc]] models.vae.DecoderOutput
--- a/docs/source/en/api/models/autoencoder_tiny.md
+++ b/docs/source/en/api/models/autoencoder_tiny.md
@@ -1,45 +0,0 @@
 # Tiny AutoEncoder
 Tiny AutoEncoder for Stable Diffusion (TAESD) was introduced in [madebyollin/taesd](https://github.com/madebyollin/taesd) by Ollin Boer Bohan. It is a tiny distilled version of Stable Diffusion's VAE that can quickly decode the latents in a [`StableDiffusionPipeline`] or [`StableDiffusionXLPipeline`] almost instantly. 
 To use with Stable Diffusion v-2.1:
 ```python
 import torch
 from diffusers import DiffusionPipeline, AutoencoderTiny
 pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16
 )
 pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd", torch_dtype=torch.float16)
 pipe = pipe.to("cuda")
 prompt = "slice of delicious New York-style berry cheesecake"
 image = pipe(prompt, num_inference_steps=25).images[0]
 image.save("cheesecake.png")
 ```
 To use with Stable Diffusion XL 1.0
 ```python
 import torch
 from diffusers import DiffusionPipeline, AutoencoderTiny
 pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
 )
 pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesdxl", torch_dtype=torch.float16)
 pipe = pipe.to("cuda")
 prompt = "slice of delicious New York-style berry cheesecake"
 image = pipe(prompt, num_inference_steps=25).images[0]
 image.save("cheesecake_sdxl.png")
 ```
 ## AutoencoderTiny
 [[autodoc]] AutoencoderTiny
 ## AutoencoderTinyOutput
 [[autodoc]] models.autoencoder_tiny.AutoencoderTinyOutput
--- a/docs/source/en/api/models/autoencoderkl.md
+++ b/docs/source/en/api/models/autoencoderkl.md
@@ -1,43 +0,0 @@
 # AutoencoderKL
 The variational autoencoder (VAE) model with KL loss was introduced in [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114v11) by Diederik P. Kingma and Max Welling. The model is used in 🤗 Diffusers to encode images into latents and to decode latent representations into images.
 The abstract from the paper is:
 *How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.*
 ## Loading from the original format
 By default the [`AutoencoderKL`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded
 from the original format using [`FromOriginalVAEMixin.from_single_file`] as follows:
 ```py
 from diffusers import AutoencoderKL
 url = "https://huggingface.co/stabilityai/sd-vae-ft-mse-original/blob/main/vae-ft-mse-840000-ema-pruned.safetensors"  # can also be local file
 model = AutoencoderKL.from_single_file(url)
 ```
 ## AutoencoderKL
 [[autodoc]] AutoencoderKL
 ## AutoencoderKLOutput
 [[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
 ## DecoderOutput
 [[autodoc]] models.vae.DecoderOutput
 ## FlaxAutoencoderKL
 [[autodoc]] FlaxAutoencoderKL
 ## FlaxAutoencoderKLOutput
 [[autodoc]] models.vae_flax.FlaxAutoencoderKLOutput
 ## FlaxDecoderOutput
 [[autodoc]] models.vae_flax.FlaxDecoderOutput
--- a/docs/source/en/api/models/controlnet.md
+++ b/docs/source/en/api/models/controlnet.md
@@ -1,38 +0,0 @@
 # ControlNet
 The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection.
 The abstract from the paper is:
 *We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
 ## Loading from the original format
 By default the [`ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded
 from the original format using [`FromOriginalControlnetMixin.from_single_file`] as follows:
 ```py
 from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
 url = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth"  # can also be a local path
 controlnet = ControlNetModel.from_single_file(url)
 url = "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors"  # can also be a local path
 pipe = StableDiffusionControlNetPipeline.from_single_file(url, controlnet=controlnet)
 ```
 ## ControlNetModel
 [[autodoc]] ControlNetModel
 ## ControlNetOutput
 [[autodoc]] models.controlnet.ControlNetOutput
 ## FlaxControlNetModel
 [[autodoc]] FlaxControlNetModel
 ## FlaxControlNetOutput
 [[autodoc]] models.controlnet_flax.FlaxControlNetOutput
--- a/docs/source/en/api/models/overview.md
+++ b/docs/source/en/api/models/overview.md
@@ -1,16 +0,0 @@
 # Models
 🤗 Diffusers provides pretrained models for popular algorithms and modules to create custom diffusion systems. The primary function of models is to denoise an input sample as modeled by the distribution \\(p_{\theta}(x_{t-1}|x_{t})\\).
 All models are built from the base [`ModelMixin`] class which is a [`torch.nn.module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) providing basic functionality for saving and loading models, locally and from the Hugging Face Hub.
 ## ModelMixin
 [[autodoc]] ModelMixin
 ## FlaxModelMixin
 [[autodoc]] FlaxModelMixin
 ## PushToHubMixin
 [[autodoc]] utils.PushToHubMixin
--- a/docs/source/en/api/models/prior_transformer.md
+++ b/docs/source/en/api/models/prior_transformer.md
@@ -1,16 +0,0 @@
 # Prior Transformer
 The Prior Transformer was originally introduced in [Hierarchical Text-Conditional Image Generation with CLIP Latents
 ](https://huggingface.co/papers/2204.06125) by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process.
 The abstract from the paper is:
 *Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*
 ## PriorTransformer
 [[autodoc]] PriorTransformer
 ## PriorTransformerOutput
 [[autodoc]] models.prior_transformer.PriorTransformerOutput
--- a/docs/source/en/api/models/transformer2d.md
+++ b/docs/source/en/api/models/transformer2d.md
@@ -1,29 +0,0 @@
 # Transformer2D
 A Transformer model for image-like data from [CompVis](https://huggingface.co/CompVis) that is based on the [Vision Transformer](https://huggingface.co/papers/2010.11929) introduced by Dosovitskiy et al. The [`Transformer2DModel`] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.
 When the input is **continuous**:
 1. Project the input and reshape it to `(batch_size, sequence_length, feature_dimension)`.
 2. Apply the Transformer blocks in the standard way.
 3. Reshape to image.
 When the input is **discrete**:
 <Tip>
 It is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image don't contain a prediction for the masked pixel because the unnoised image cannot be masked.
 </Tip>
 1. Convert input (classes of latent pixels) to embeddings and apply positional embeddings.
 2. Apply the Transformer blocks in the standard way.
 3. Predict classes of unnoised image.
 ## Transformer2DModel
 [[autodoc]] Transformer2DModel
 ## Transformer2DModelOutput
 [[autodoc]] models.transformer_2d.Transformer2DModelOutput
--- a/docs/source/en/api/models/transformer_temporal.md
+++ b/docs/source/en/api/models/transformer_temporal.md
@@ -1,11 +0,0 @@
 # Transformer Temporal
 A Transformer model for video-like data.
 ## TransformerTemporalModel
 [[autodoc]] models.transformer_temporal.TransformerTemporalModel
 ## TransformerTemporalModelOutput
 [[autodoc]] models.transformer_temporal.TransformerTemporalModelOutput
--- a/docs/source/en/api/models/unet-motion.md
+++ b/docs/source/en/api/models/unet-motion.md
@@ -1,13 +0,0 @@
 # UNetMotionModel
 The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet model.
 The abstract from the paper is:
 *There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
 ## UNetMotionModel
 [[autodoc]] UNetMotionModel
 ## UNet3DConditionOutput
 [[autodoc]] models.unet_3d_condition.UNet3DConditionOutput
--- a/docs/source/en/api/models/unet.md
+++ b/docs/source/en/api/models/unet.md
@@ -1,13 +0,0 @@
 # UNet1DModel
 The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 1D UNet model.
 The abstract from the paper is:
 *There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
 ## UNet1DModel
 [[autodoc]] UNet1DModel
 ## UNet1DOutput
 [[autodoc]] models.unet_1d.UNet1DOutput
--- a/docs/source/en/api/models/unet2d-cond.md
+++ b/docs/source/en/api/models/unet2d-cond.md
@@ -1,19 +0,0 @@
 # UNet2DConditionModel
 The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet conditional model.
 The abstract from the paper is:
 *There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
 ## UNet2DConditionModel
 [[autodoc]] UNet2DConditionModel
 ## UNet2DConditionOutput
 [[autodoc]] models.unet_2d_condition.UNet2DConditionOutput
 ## FlaxUNet2DConditionModel
 [[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionModel
 ## FlaxUNet2DConditionOutput
 [[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionOutput
--- a/docs/source/en/api/models/unet2d.md
+++ b/docs/source/en/api/models/unet2d.md
@@ -1,13 +0,0 @@
 # UNet2DModel
 The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet model.
 The abstract from the paper is:
 *There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
 ## UNet2DModel
 [[autodoc]] UNet2DModel
 ## UNet2DOutput
 [[autodoc]] models.unet_2d.UNet2DOutput
--- a/docs/source/en/api/models/unet3d-cond.md
+++ b/docs/source/en/api/models/unet3d-cond.md
@@ -1,13 +0,0 @@
 # UNet3DConditionModel
 The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 3D UNet conditional model.
 The abstract from the paper is:
 *There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
 ## UNet3DConditionModel
 [[autodoc]] UNet3DConditionModel
 ## UNet3DConditionOutput
 [[autodoc]] models.unet_3d_condition.UNet3DConditionOutput
--- a/docs/source/en/api/models/vq.md
+++ b/docs/source/en/api/models/vq.md
@@ -1,15 +0,0 @@
 # VQModel
 The VQ-VAE model was introduced in [Neural Discrete Representation Learning](https://huggingface.co/papers/1711.00937) by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu. The model is used in 🤗 Diffusers to decode latent representations into images. Unlike [`AutoencoderKL`], the [`VQModel`] works in a quantized latent space.
 The abstract from the paper is:
 *Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.*
 ## VQModel
 [[autodoc]] VQModel
 ## VQEncoderOutput
 [[autodoc]] models.vq_model.VQEncoderOutput
--- a/docs/source/en/api/normalization.md
+++ b/docs/source/en/api/normalization.md
@@ -1,15 +0,0 @@
 # Normalization layers
 Customized normalization layers for supporting various models in 🤗 Diffusers.
 ## AdaLayerNorm
 [[autodoc]] models.normalization.AdaLayerNorm
 ## AdaLayerNormZero
 [[autodoc]] models.normalization.AdaLayerNormZero
 ## AdaGroupNorm
 [[autodoc]] models.normalization.AdaGroupNorm
--- a/docs/source/en/api/outputs.md
+++ b/docs/source/en/api/outputs.md
@@ -1,67 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Outputs
 All models outputs are subclasses of [`~utils.BaseOutput`], data structures containing all the information returned by the model. The outputs can also be used as tuples or dictionaries.
 For example:
 ```python
 from diffusers import DDIMPipeline
 pipeline = DDIMPipeline.from_pretrained("google/ddpm-cifar10-32")
 outputs = pipeline()
 ```
 The `outputs` object is a [`~pipelines.ImagePipelineOutput`] which means it has an image attribute.
 You can access each attribute as you normally would or with a keyword lookup, and if that attribute is not returned by the model, you will get `None`:
 ```python
 outputs.images
 outputs["images"]
 ```
 When considering the `outputs` object as a tuple, it only considers the attributes that don't have `None` values.
 For instance, retrieving an image by indexing into it returns the tuple `(outputs.images)`:
 ```python
 outputs[:1]
 ```
 <Tip>
 To check a specific pipeline or model output, refer to its corresponding API documentation.
 </Tip>
 ## BaseOutput
 [[autodoc]] utils.BaseOutput
    - to_tuple
 ## ImagePipelineOutput
 [[autodoc]] pipelines.ImagePipelineOutput
 ## FlaxImagePipelineOutput
 [[autodoc]] pipelines.pipeline_flax_utils.FlaxImagePipelineOutput
 ## AudioPipelineOutput
 [[autodoc]] pipelines.AudioPipelineOutput
 ## ImageTextPipelineOutput
 [[autodoc]] ImageTextPipelineOutput
--- a/docs/source/en/api/outputs.mdx
+++ b/docs/source/en/api/outputs.mdx
@@ -0,0 +1,55 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # BaseOutputs
 All models have outputs that are instances of subclasses of [`~utils.BaseOutput`]. Those are
 data structures containing all the information returned by the model, but that can also be used as tuples or
 dictionaries.
 Let's see how this looks in an example:
 ```python
 from diffusers import DDIMPipeline
 pipeline = DDIMPipeline.from_pretrained("google/ddpm-cifar10-32")
 outputs = pipeline()
 ```
 The `outputs` object is a [`~pipelines.ImagePipelineOutput`], as we can see in the
 documentation of that class below, it means it has an image attribute.
 You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you will get `None`:
 ```python
 outputs.images
 ```
 or via keyword lookup
 ```python
 outputs["images"]
 ```
 When considering our `outputs` object as tuple, it only considers the attributes that don't have `None` values.
 Here for instance, we could retrieve images via indexing:
 ```python
 outputs[:1]
 ```
 which will return the tuple `(outputs.images)` for instance.
 ## BaseOutput
 [[autodoc]] utils.BaseOutput
    - to_tuple
--- a/docs/source/en/api/pipelines/alt_diffusion.md
+++ b/docs/source/en/api/pipelines/alt_diffusion.md
@@ -1,47 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # AltDiffusion
 AltDiffusion was proposed in [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://huggingface.co/papers/2211.06679) by Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu.
 The abstract from the paper is:
 *In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.*
 ## Tips
 `AltDiffusion` is conceptually the same as [Stable Diffusion](./stable_diffusion/overview).
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## AltDiffusionPipeline
 [[autodoc]] AltDiffusionPipeline
 	- all
 	- __call__
 ## AltDiffusionImg2ImgPipeline
 [[autodoc]] AltDiffusionImg2ImgPipeline
 	- all
 	- __call__
 ## AltDiffusionPipelineOutput
 [[autodoc]] pipelines.alt_diffusion.AltDiffusionPipelineOutput
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/alt_diffusion.mdx
+++ b/docs/source/en/api/pipelines/alt_diffusion.mdx
@@ -0,0 +1,83 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # AltDiffusion
 AltDiffusion was proposed in [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu.
 The abstract of the paper is the following:
 *In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.*
 *Overview*:
 | Pipeline | Tasks | Colab | Demo
 |---|---|:---:|:---:|
 | [pipeline_alt_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py) | *Text-to-Image Generation* | - | -
 | [pipeline_alt_diffusion_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py) | *Image-to-Image Text-Guided Generation* | - |-
 ## Tips
 - AltDiffusion is conceptually exactly the same as [Stable Diffusion](./stable_diffusion/overview).
 - *Run AltDiffusion*
 AltDiffusion can be tested very easily with the [`AltDiffusionPipeline`], [`AltDiffusionImg2ImgPipeline`] and the `"BAAI/AltDiffusion-m9"` checkpoint exactly in the same way it is shown in the [Conditional Image Generation Guide](../../using-diffusers/conditional_image_generation) and the [Image-to-Image Generation Guide](../../using-diffusers/img2img).
 - *How to load and use different schedulers.*
 The alt diffusion pipeline uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers that can be used with the alt diffusion pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], [`EulerAncestralDiscreteScheduler`] etc.
 To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`] method or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the [`EulerDiscreteScheduler`], you can do the following:
 ```python
 >>> from diffusers import AltDiffusionPipeline, EulerDiscreteScheduler
 >>> pipeline = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9")
 >>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
 >>> # or
 >>> euler_scheduler = EulerDiscreteScheduler.from_pretrained("BAAI/AltDiffusion-m9", subfolder="scheduler")
 >>> pipeline = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9", scheduler=euler_scheduler)
 ```
 - *How to convert all use cases with multiple or single pipeline*
 If you want to use all possible use cases in a single `DiffusionPipeline` we recommend using the `components` functionality to instantiate all components in the most memory-efficient way:
 ```python
 >>> from diffusers import (
 ...     AltDiffusionPipeline,
 ...     AltDiffusionImg2ImgPipeline,
 ... )
 >>> text2img = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9")
 >>> img2img = AltDiffusionImg2ImgPipeline(**text2img.components)
 >>> # now you can use text2img(...) and img2img(...) just like the call methods of each respective pipeline
 ```
 ## AltDiffusionPipelineOutput
 [[autodoc]] pipelines.alt_diffusion.AltDiffusionPipelineOutput
 	- all
 	- __call__
 ## AltDiffusionPipeline
 [[autodoc]] AltDiffusionPipeline
 	- all
 	- __call__
 ## AltDiffusionImg2ImgPipeline
 [[autodoc]] AltDiffusionImg2ImgPipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/animatediff.md
+++ b/docs/source/en/api/pipelines/animatediff.md
@@ -1,230 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Text-to-Video Generation with AnimateDiff
 ## Overview
 [AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning](https://arxiv.org/abs/2307.04725) by Yuwei Guo, Ceyuan Yang*, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai
 The abstract of the paper is the following:
 With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics. In this report, we propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. We conduct our evaluation on several public representative personalized text-to-image models across anime pictures and realistic photographs, and demonstrate that our proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs. Code and pre-trained weights will be publicly available at this https URL .
 ## Available Pipelines
 | Pipeline | Tasks | Demo
 |---|---|:---:|
 | [AnimateDiffPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff.py) | *Text-to-Video Generation with AnimateDiff* |
 ## Available checkpoints
 Motion Adapter checkpoints can be found under [guoyww](https://huggingface.co/guoyww/). These checkpoints are meant to work with any model based on Stable Diffusion 1.4/1.5
 ## Usage example
 AnimateDiff works with a MotionAdapter checkpoint and a Stable Diffusion model checkpoint. The MotionAdapter is a collection of Motion Modules that are responsible for adding coherent motion across image frames. These modules are applied after the Resnet and Attention blocks in Stable Diffusion UNet.
 The following example demonstrates how to use a *MotionAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
 ```python
 import torch
 from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
 from diffusers.utils import export_to_gif
 # Load the motion adapter
 adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
 # load SD 1.5 based finetuned model
 model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
 pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter)
 scheduler = DDIMScheduler.from_pretrained(
    model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
 )
 pipe.scheduler = scheduler
 # enable memory savings
 pipe.enable_vae_slicing()
 pipe.enable_model_cpu_offload()
 output = pipe(
    prompt=(
        "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
        "orange sky, warm lighting, fishing boats, ocean waves seagulls, "
        "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
        "golden hour, coastal landscape, seaside scenery"
    ),
    negative_prompt="bad quality, worse quality",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=25,
    generator=torch.Generator("cpu").manual_seed(42),
 )
 frames = output.frames[0]
 export_to_gif(frames, "animation.gif")
 ```
 Here are some sample outputs:
 <table>
    <tr>
        <td><center>
        masterpiece, bestquality, sunset.
        <br>
        <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-realistic-doc.gif"
            alt="masterpiece, bestquality, sunset"
            style="width: 300px;" />
        </center></td>
    </tr>
 </table>
 <Tip>
 AnimateDiff tends to work better with finetuned Stable Diffusion models. If you plan on using a scheduler that can clip samples, make sure to disable it by setting `clip_sample=False` in the scheduler as this can also have an adverse effect on generated samples.
 </Tip>
 ## Using Motion LoRAs
 Motion LoRAs are a collection of LoRAs that work with the `guoyww/animatediff-motion-adapter-v1-5-2` checkpoint. These LoRAs are responsible for adding specific types of motion to the animations.
 ```python
 import torch
 from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
 from diffusers.utils import export_to_gif
 # Load the motion adapter
 adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
 # load SD 1.5 based finetuned model
 model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
 pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter)
 pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-out", adapter_name="zoom-out")
 scheduler = DDIMScheduler.from_pretrained(
    model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
 )
 pipe.scheduler = scheduler
 # enable memory savings
 pipe.enable_vae_slicing()
 pipe.enable_model_cpu_offload()
 output = pipe(
    prompt=(
        "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
        "orange sky, warm lighting, fishing boats, ocean waves seagulls, "
        "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
        "golden hour, coastal landscape, seaside scenery"
    ),
    negative_prompt="bad quality, worse quality",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=25,
    generator=torch.Generator("cpu").manual_seed(42),
 )
 frames = output.frames[0]
 export_to_gif(frames, "animation.gif")
 ```
 <table>
    <tr>
        <td><center>
        masterpiece, bestquality, sunset.
        <br>
        <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-zoom-out-lora.gif"
            alt="masterpiece, bestquality, sunset"
            style="width: 300px;" />
        </center></td>
    </tr>
 </table>
 ## Using Motion LoRAs with PEFT
 You can also leverage the [PEFT](https://github.com/huggingface/peft) backend to combine Motion LoRA's and create more complex animations.
 First install PEFT with
 ```shell
 pip install peft
 ```
 Then you can use the following code to combine Motion LoRAs.
 ```python
 import torch
 from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
 from diffusers.utils import export_to_gif
 # Load the motion adapter
 adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
 # load SD 1.5 based finetuned model
 model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
 pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter)
 pipe.load_lora_weights("diffusers/animatediff-motion-lora-zoom-out", adapter_name="zoom-out")
 pipe.load_lora_weights("diffusers/animatediff-motion-lora-pan-left", adapter_name="pan-left")
 pipe.set_adapters(["zoom-out", "pan-left"], adapter_weights=[1.0, 1.0])
 scheduler = DDIMScheduler.from_pretrained(
    model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
 )
 pipe.scheduler = scheduler
 # enable memory savings
 pipe.enable_vae_slicing()
 pipe.enable_model_cpu_offload()
 output = pipe(
    prompt=(
        "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
        "orange sky, warm lighting, fishing boats, ocean waves seagulls, "
        "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
        "golden hour, coastal landscape, seaside scenery"
    ),
    negative_prompt="bad quality, worse quality",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=25,
    generator=torch.Generator("cpu").manual_seed(42),
 )
 frames = output.frames[0]
 export_to_gif(frames, "animation.gif")
 ```
 <table>
    <tr>
        <td><center>
        masterpiece, bestquality, sunset.
        <br>
        <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-zoom-out-pan-left-lora.gif"
            alt="masterpiece, bestquality, sunset"
            style="width: 300px;" />
        </center></td>
    </tr>
 </table>
 ## AnimateDiffPipeline
 [[autodoc]] AnimateDiffPipeline
 	- all
 	- __call__
    - enable_freeu
    - disable_freeu
    - enable_vae_slicing
    - disable_vae_slicing
    - enable_vae_tiling
    - disable_vae_tiling
 ## AnimateDiffPipelineOutput
 [[autodoc]] pipelines.animatediff.AnimateDiffPipelineOutput
--- a/docs/source/en/api/pipelines/attend_and_excite.md
+++ b/docs/source/en/api/pipelines/attend_and_excite.md
@@ -1,37 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Attend-and-Excite
 Attend-and-Excite for Stable Diffusion was proposed in [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://attendandexcite.github.io/Attend-and-Excite/) and provides textual attention control over image generation.
 The abstract from the paper is:
 *Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.*
 You can find additional information about Attend-and-Excite on the [project page](https://attendandexcite.github.io/Attend-and-Excite/), the [original codebase](https://github.com/AttendAndExcite/Attend-and-Excite), or try it out in a [demo](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite).
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## StableDiffusionAttendAndExcitePipeline
 [[autodoc]] StableDiffusionAttendAndExcitePipeline
 	- all
 	- __call__
 ## StableDiffusionPipelineOutput
 [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/audio_diffusion.md
+++ b/docs/source/en/api/pipelines/audio_diffusion.md
@@ -1,37 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Audio Diffusion
 [Audio Diffusion](https://github.com/teticio/audio-diffusion) is by Robert Dargavel Smith, and it leverages the recent advances in image generation from diffusion models by converting audio samples to and from Mel spectrogram images.
 The original codebase, training scripts and example notebooks can be found at [teticio/audio-diffusion](https://github.com/teticio/audio-diffusion).
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## AudioDiffusionPipeline
 [[autodoc]] AudioDiffusionPipeline
 	- all
 	- __call__
 ## AudioPipelineOutput
 [[autodoc]] pipelines.AudioPipelineOutput
 ## ImagePipelineOutput
 [[autodoc]] pipelines.ImagePipelineOutput
 ## Mel
 [[autodoc]] Mel
--- a/docs/source/en/api/pipelines/audio_diffusion.mdx
+++ b/docs/source/en/api/pipelines/audio_diffusion.mdx
@@ -0,0 +1,98 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Audio Diffusion
 ## Overview
 [Audio Diffusion](https://github.com/teticio/audio-diffusion) by Robert Dargavel Smith.
 Audio Diffusion leverages the recent advances in image generation using diffusion models by converting audio samples to
 and from mel spectrogram images.
 The original codebase of this implementation can be found [here](https://github.com/teticio/audio-diffusion), including
 training scripts and example notebooks.
 ## Available Pipelines:
 | Pipeline | Tasks | Colab
 |---|---|:---:|
 | [pipeline_audio_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py) | *Unconditional Audio Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/audio_diffusion_pipeline.ipynb) |
 ## Examples:
 ### Audio Diffusion
 ```python
 import torch
 from IPython.display import Audio
 from diffusers import DiffusionPipeline
 device = "cuda" if torch.cuda.is_available() else "cpu"
 pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(device)
 output = pipe()
 display(output.images[0])
 display(Audio(output.audios[0], rate=mel.get_sample_rate()))
 ```
 ### Latent Audio Diffusion
 ```python
 import torch
 from IPython.display import Audio
 from diffusers import DiffusionPipeline
 device = "cuda" if torch.cuda.is_available() else "cpu"
 pipe = DiffusionPipeline.from_pretrained("teticio/latent-audio-diffusion-256").to(device)
 output = pipe()
 display(output.images[0])
 display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
 ```
 ### Audio Diffusion with DDIM (faster)
 ```python
 import torch
 from IPython.display import Audio
 from diffusers import DiffusionPipeline
 device = "cuda" if torch.cuda.is_available() else "cpu"
 pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-ddim-256").to(device)
 output = pipe()
 display(output.images[0])
 display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
 ```
 ### Variations, in-painting, out-painting etc.
 ```python
 output = pipe(
    raw_audio=output.audios[0, 0],
    start_step=int(pipe.get_default_steps() / 2),
    mask_start_secs=1,
    mask_end_secs=1,
 )
 display(output.images[0])
 display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
 ```
 ## AudioDiffusionPipeline
 [[autodoc]] AudioDiffusionPipeline
 	- all
 	- __call__
 ## Mel
 [[autodoc]] Mel
--- a/docs/source/en/api/pipelines/audioldm.md
+++ b/docs/source/en/api/pipelines/audioldm.md
@@ -1,50 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # AudioLDM
 AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://huggingface.co/papers/2301.12503) by Haohe Liu et al. Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM
 is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
 latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
 sound effects, human speech and music.
 The abstract from the paper is:
 *Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at https://audioldm.github.io.*
 The original codebase can be found at [haoheliu/AudioLDM](https://github.com/haoheliu/AudioLDM). 
 ## Tips
 When constructing a prompt, keep in mind:
 * Descriptive prompt inputs work best; you can use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific (for example, "water stream in a forest" instead of "stream").
 * It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with.
 During inference:
 * The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
 * The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## AudioLDMPipeline
 [[autodoc]] AudioLDMPipeline
 	- all
 	- __call__
 ## AudioPipelineOutput
 [[autodoc]] pipelines.AudioPipelineOutput
--- a/docs/source/en/api/pipelines/audioldm.mdx
+++ b/docs/source/en/api/pipelines/audioldm.mdx
@@ -0,0 +1,82 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # AudioLDM
 ## Overview
 AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://arxiv.org/abs/2301.12503) by Haohe Liu et al.
 Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM
 is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
 latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
 sound effects, human speech and music.
 This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be found [here](https://github.com/haoheliu/AudioLDM).
 ## Text-to-Audio
 The [`AudioLDMPipeline`] can be used to load pre-trained weights from [cvssp/audioldm](https://huggingface.co/cvssp/audioldm) and generate text-conditional audio outputs:
 ```python
 from diffusers import AudioLDMPipeline
 import torch
 import scipy
 repo_id = "cvssp/audioldm"
 pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
 pipe = pipe.to("cuda")
 prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
 audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
 # save the audio sample as a .wav file
 scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
 ```
 ### Tips
 Prompts:
 * Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
 * It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.
 Inference:
 * The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
 * The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
 ### How to load and use different schedulers
 The AudioLDM pipeline uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers
 that can be used with the AudioLDM pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], 
 [`EulerAncestralDiscreteScheduler`] etc. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest
 scheduler there is.
 To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`]
 method, or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the
 [`DPMSolverMultistepScheduler`], you can do the following:
 ```python
 >>> from diffusers import AudioLDMPipeline, DPMSolverMultistepScheduler
 >>> import torch
 >>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", torch_dtype=torch.float16)
 >>> pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
 >>> # or
 >>> dpm_scheduler = DPMSolverMultistepScheduler.from_pretrained("cvssp/audioldm", subfolder="scheduler")
 >>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", scheduler=dpm_scheduler, torch_dtype=torch.float16)
 ```
 ## AudioLDMPipeline
 [[autodoc]] AudioLDMPipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/audioldm2.md
+++ b/docs/source/en/api/pipelines/audioldm2.md
@@ -1,91 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # AudioLDM 2
 AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) 
 by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate 
 text-conditional sound effects, human speech and music.
 Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM 2
 is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings. Two 
 text encoder models are used to compute the text embeddings from a prompt input: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap)
 and the encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5). These text embeddings 
 are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/main/api/pipelines/audioldm2#diffusers.AudioLDM2ProjectionModel). 
 A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) _language model (LM)_ is used to auto-regressively 
 predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding 
 vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2UNet2DConditionModel) 
 of AudioLDM 2 is unique in the sense that it takes **two** cross-attention embeddings, as opposed to one cross-attention 
 conditioning, as in most other LDMs.
 The abstract of the paper is the following:
 *Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called language of audio (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches.*
 This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be 
 found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2). 
 ## Tips
 ### Choosing a checkpoint
 AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio 
 generation. The third checkpoint is trained exclusively on text-to-music generation.
 All checkpoints share the same model size for the text encoders and VAE. They differ in the size and depth of the UNet. 
 See table below for details on the three checkpoints:
 | Checkpoint                                                      | Task          | UNet Model Size | Total Model Size | Training Data / h |
 |-----------------------------------------------------------------|---------------|-----------------|------------------|-------------------|
 | [audioldm2](https://huggingface.co/cvssp/audioldm2)             | Text-to-audio | 350M            | 1.1B             | 1150k             |
 | [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 750M            | 1.5B             | 1150k             |
 | [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 350M            | 1.1B             | 665k              |
 ### Constructing a prompt
 * Descriptive prompt inputs work best: use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g. "water stream in a forest" instead of "stream").
 * It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with.
 * Using a **negative prompt** can significantly improve the quality of the generated waveform, by guiding the generation away from terms that correspond to poor quality audio. Try using a negative prompt of "Low quality." 
 ### Controlling inference
 * The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
 * The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
 ### Evaluating generated waveforms:
 * The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation
 * Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
 The following example demonstrates how to construct good music generation using the aforementioned tips: [example](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.example).
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## AudioLDM2Pipeline
 [[autodoc]] AudioLDM2Pipeline
 	- all
 	- __call__
 ## AudioLDM2ProjectionModel
 [[autodoc]] AudioLDM2ProjectionModel
 	- forward
 ## AudioLDM2UNet2DConditionModel
 [[autodoc]] AudioLDM2UNet2DConditionModel
 	- forward
 ## AudioPipelineOutput
 [[autodoc]] pipelines.AudioPipelineOutput
--- a/docs/source/en/api/pipelines/auto_pipeline.md
+++ b/docs/source/en/api/pipelines/auto_pipeline.md
@@ -1,74 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # AutoPipeline
 `AutoPipeline` is designed to:
 1. make it easy for you to load a checkpoint for a task without knowing the specific pipeline class to use
 2. use multiple pipelines in your workflow
 Based on the task, the `AutoPipeline` class automatically retrieves the relevant pipeline given the name or path to the pretrained weights with the `from_pretrained()` method.
 To seamlessly switch between tasks with the same checkpoint without reallocating additional memory, use the `from_pipe()` method to transfer the components from the original pipeline to the new one.
 ```py
 from diffusers import AutoPipelineForText2Image
 import torch
 pipeline = AutoPipelineForText2Image.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
 ).to("cuda")
 prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
 image = pipeline(prompt, num_inference_steps=25).images[0]
 ```
 <Tip>
 Check out the [AutoPipeline](/tutorials/autopipeline) tutorial to learn how to use this API!
 </Tip>
 `AutoPipeline` supports text-to-image, image-to-image, and inpainting for the following diffusion models:
 - [Stable Diffusion](./stable_diffusion)
 - [ControlNet](./controlnet)
 - [Stable Diffusion XL (SDXL)](./stable_diffusion/stable_diffusion_xl)
 - [DeepFloyd IF](./if) 
 - [Kandinsky](./kandinsky)
 - [Kandinsky 2.2](./kandinsky#kandinsky-22)
 ## AutoPipelineForText2Image
 [[autodoc]] AutoPipelineForText2Image
 	- all
 	- from_pretrained
 	- from_pipe
 ## AutoPipelineForImage2Image
 [[autodoc]] AutoPipelineForImage2Image
 	- all
 	- from_pretrained
 	- from_pipe
 ## AutoPipelineForInpainting
 [[autodoc]] AutoPipelineForInpainting
 	- all
 	- from_pretrained
 	- from_pipe
--- a/docs/source/en/api/pipelines/blip_diffusion.md
+++ b/docs/source/en/api/pipelines/blip_diffusion.md
@@ -1,29 +0,0 @@
 # Blip Diffusion
 Blip Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://arxiv.org/abs/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation. 
 The abstract from the paper is:
 *Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications.*
 The original codebase can be found at [salesforce/LAVIS](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion). You can find the official BLIP Diffusion checkpoints under the [hf.co/SalesForce](https://hf.co/SalesForce) organization.
 `BlipDiffusionPipeline` and `BlipDiffusionControlNetPipeline` were contributed by [`ayushtues`](https://github.com/ayushtues/).
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## BlipDiffusionPipeline
 [[autodoc]] BlipDiffusionPipeline
    - all
    - __call__
 ## BlipDiffusionControlNetPipeline
 [[autodoc]] BlipDiffusionControlNetPipeline
    - all
    - __call__
--- a/docs/source/en/api/pipelines/consistency_models.md
+++ b/docs/source/en/api/pipelines/consistency_models.md
@@ -1,43 +0,0 @@
 # Consistency Models
 Consistency Models were proposed in [Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
 The abstract from the paper is:
 *Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256. *
 The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models), and additional checkpoints are available at [openai](https://huggingface.co/openai).
 The pipeline was contributed by [dg845](https://github.com/dg845) and [ayushtues](https://huggingface.co/ayushtues). ❤️
 ## Tips
 For an additional speed-up, use `torch.compile` to generate multiple images in <1 second:
 ```diff
  import torch
  from diffusers import ConsistencyModelPipeline
  device = "cuda"
  # Load the cd_bedroom256_lpips checkpoint.
  model_id_or_path = "openai/diffusers-cd_bedroom256_lpips"
  pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
  pipe.to(device)
 + pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
  # Multistep sampling
  # Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo:
  # https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L83
  for _ in range(10):
      image = pipe(timesteps=[17, 0]).images[0]
      image.show()
 ```
 ## ConsistencyModelPipeline
 [[autodoc]] ConsistencyModelPipeline
    - all
    - __call__
 ## ImagePipelineOutput
 [[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/controlnet.md
+++ b/docs/source/en/api/pipelines/controlnet.md
@@ -1,80 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # ControlNet
 ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
 With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
 The abstract from the paper is:
 *We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
 This model was contributed by [takuma104](https://huggingface.co/takuma104). ❤️
 The original codebase can be found at [lllyasviel/ControlNet](https://github.com/lllyasviel/ControlNet), and you can find official ControlNet checkpoints on [lllyasviel's](https://huggingface.co/lllyasviel) Hub profile.
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## StableDiffusionControlNetPipeline
 [[autodoc]] StableDiffusionControlNetPipeline
 	- all
 	- __call__
 	- enable_attention_slicing
 	- disable_attention_slicing
 	- enable_vae_slicing
 	- disable_vae_slicing
 	- enable_xformers_memory_efficient_attention
 	- disable_xformers_memory_efficient_attention
 	- load_textual_inversion
 ## StableDiffusionControlNetImg2ImgPipeline
 [[autodoc]] StableDiffusionControlNetImg2ImgPipeline
 	- all
 	- __call__
 	- enable_attention_slicing
 	- disable_attention_slicing
 	- enable_vae_slicing
 	- disable_vae_slicing
 	- enable_xformers_memory_efficient_attention
 	- disable_xformers_memory_efficient_attention
 	- load_textual_inversion
 ## StableDiffusionControlNetInpaintPipeline
 [[autodoc]] StableDiffusionControlNetInpaintPipeline
 	- all
 	- __call__
 	- enable_attention_slicing
 	- disable_attention_slicing
 	- enable_vae_slicing
 	- disable_vae_slicing
 	- enable_xformers_memory_efficient_attention
 	- disable_xformers_memory_efficient_attention
 	- load_textual_inversion
 ## StableDiffusionPipelineOutput
 [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
 ## FlaxStableDiffusionControlNetPipeline
 [[autodoc]] FlaxStableDiffusionControlNetPipeline
 	- all
 	- __call__
 ## FlaxStableDiffusionControlNetPipelineOutput
 [[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/controlnet_sdxl.md
+++ b/docs/source/en/api/pipelines/controlnet_sdxl.md
@@ -1,55 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # ControlNet with Stable Diffusion XL
 ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
 With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
 The abstract from the paper is:
 *We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
 You can find additional smaller Stable Diffusion XL (SDXL) ControlNet checkpoints from the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization, and browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) checkpoints on the Hub.
 <Tip warning={true}>
 🧪 Many of the SDXL ControlNet checkpoints are experimental, and there is a lot of room for improvement. Feel free to open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) and leave us feedback on how we can improve!
 </Tip>
 If you don't see a checkpoint you're interested in, you can train your own SDXL ControlNet with our [training script](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md).
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## StableDiffusionXLControlNetPipeline
 [[autodoc]] StableDiffusionXLControlNetPipeline
 	- all
 	- __call__
 ## StableDiffusionXLControlNetImg2ImgPipeline
 [[autodoc]] StableDiffusionXLControlNetImg2ImgPipeline
 	- all
 	- __call__
 ## StableDiffusionXLControlNetInpaintPipeline
 [[autodoc]] StableDiffusionXLControlNetInpaintPipeline
 	- all
 	- __call__
 ## StableDiffusionPipelineOutput
 [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/cycle_diffusion.md
+++ b/docs/source/en/api/pipelines/cycle_diffusion.md
@@ -1,33 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Cycle Diffusion
 Cycle Diffusion is a text guided image-to-image generation model proposed in [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://huggingface.co/papers/2210.05559) by Chen Henry Wu, Fernando De la Torre.
 The abstract from the paper is:
 *Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pre-trained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs.*
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## CycleDiffusionPipeline
 [[autodoc]] CycleDiffusionPipeline
 	- all
 	- __call__
 ## StableDiffusionPiplineOutput
 [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/cycle_diffusion.mdx
+++ b/docs/source/en/api/pipelines/cycle_diffusion.mdx
@@ -0,0 +1,100 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Cycle Diffusion
 ## Overview
 Cycle Diffusion is a Text-Guided Image-to-Image Generation model proposed in [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://arxiv.org/abs/2210.05559) by Chen Henry Wu, Fernando De la Torre.
 The abstract of the paper is the following:
 *Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pre-trained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs.*
 *Tips*:
 - The Cycle Diffusion pipeline is fully compatible with any [Stable Diffusion](./stable_diffusion) checkpoints
 - Currently Cycle Diffusion only works with the [`DDIMScheduler`].
 *Example*:
 In the following we should how to best use the [`CycleDiffusionPipeline`]
 ```python
 import requests
 import torch
 from PIL import Image
 from io import BytesIO
 from diffusers import CycleDiffusionPipeline, DDIMScheduler
 # load the pipeline
 # make sure you're logged in with `huggingface-cli login`
 model_id_or_path = "CompVis/stable-diffusion-v1-4"
 scheduler = DDIMScheduler.from_pretrained(model_id_or_path, subfolder="scheduler")
 pipe = CycleDiffusionPipeline.from_pretrained(model_id_or_path, scheduler=scheduler).to("cuda")
 # let's download an initial image
 url = "https://raw.githubusercontent.com/ChenWu98/cycle-diffusion/main/data/dalle2/An%20astronaut%20riding%20a%20horse.png"
 response = requests.get(url)
 init_image = Image.open(BytesIO(response.content)).convert("RGB")
 init_image = init_image.resize((512, 512))
 init_image.save("horse.png")
 # let's specify a prompt
 source_prompt = "An astronaut riding a horse"
 prompt = "An astronaut riding an elephant"
 # call the pipeline
 image = pipe(
    prompt=prompt,
    source_prompt=source_prompt,
    image=init_image,
    num_inference_steps=100,
    eta=0.1,
    strength=0.8,
    guidance_scale=2,
    source_guidance_scale=1,
 ).images[0]
 image.save("horse_to_elephant.png")
 # let's try another example
 # See more samples at the original repo: https://github.com/ChenWu98/cycle-diffusion
 url = "https://raw.githubusercontent.com/ChenWu98/cycle-diffusion/main/data/dalle2/A%20black%20colored%20car.png"
 response = requests.get(url)
 init_image = Image.open(BytesIO(response.content)).convert("RGB")
 init_image = init_image.resize((512, 512))
 init_image.save("black.png")
 source_prompt = "A black colored car"
 prompt = "A blue colored car"
 # call the pipeline
 torch.manual_seed(0)
 image = pipe(
    prompt=prompt,
    source_prompt=source_prompt,
    image=init_image,
    num_inference_steps=100,
    eta=0.1,
    strength=0.85,
    guidance_scale=3,
    source_guidance_scale=1,
 ).images[0]
 image.save("black_to_blue.png")
 ```
 ## CycleDiffusionPipeline
 [[autodoc]] CycleDiffusionPipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/dance_diffusion.mdx
+++ b/docs/source/en/api/pipelines/dance_diffusion.mdx
@@ -12,22 +12,23 @@ specific language governing permissions and limitations under the License.
 # Dance Diffusion
-[Dance Diffusion](https://github.com/Harmonai-org/sample-generator) is by Zach Evans.
+## Overview
-Dance Diffusion is the first in a suite of generative audio tools for producers and musicians released by [Harmonai](https://github.com/Harmonai-org).
+[Dance Diffusion](https://github.com/Harmonai-org/sample-generator) by Zach Evans.
-The original codebase of this implementation can be found at [Harmonai-org](https://github.com/Harmonai-org/sample-generator).
+Dance Diffusion is the first in a suite of generative audio tools for producers and musicians to be released by Harmonai.
 For more info or to get involved in the development of these tools, please visit https://harmonai.org and fill out the form on the front page.
-<Tip>
+The original codebase of this implementation can be found [here](https://github.com/Harmonai-org/sample-generator).
-Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+## Available Pipelines:
 | Pipeline | Tasks | Colab
 |---|---|:---:|
 | [pipeline_dance_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py) | *Unconditional Audio Generation* | - |
 </Tip>
 ## DanceDiffusionPipeline
 [[autodoc]] DanceDiffusionPipeline
 	- all
 	- __call__
 ## AudioPipelineOutput
 [[autodoc]] pipelines.AudioPipelineOutput
--- a/docs/source/en/api/pipelines/ddim.md
+++ b/docs/source/en/api/pipelines/ddim.md
@@ -1,29 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # DDIM
 [Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
 The abstract from the paper is:
 *Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.*
 The original codebase can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim).
 ## DDIMPipeline
 [[autodoc]] DDIMPipeline
 	- all
 	- __call__
 ## ImagePipelineOutput
 [[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/ddim.mdx
+++ b/docs/source/en/api/pipelines/ddim.mdx
@@ -0,0 +1,36 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # DDIM
 ## Overview
 [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
 The abstract of the paper is the following:
 Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
 The original codebase of this paper can be found here: [ermongroup/ddim](https://github.com/ermongroup/ddim).
 For questions, feel free to contact the author on [tsong.me](https://tsong.me/).
 ## Available Pipelines:
 | Pipeline | Tasks | Colab
 |---|---|:---:|
 | [pipeline_ddim.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ddim/pipeline_ddim.py) | *Unconditional Image Generation* | - |
 ## DDIMPipeline
 [[autodoc]] DDIMPipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/ddpm.md
+++ b/docs/source/en/api/pipelines/ddpm.md
@@ -1,35 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # DDPM
 [Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2006.11239) (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes a diffusion based model of the same name. In the 🤗 Diffusers library, DDPM refers to the *discrete denoising scheduler* from the paper as well as the pipeline.
 The abstract from the paper is:
 *We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.*
 The original codebase can be found at [hohonathanho/diffusion](https://github.com/hojonathanho/diffusion).
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 # DDPMPipeline
 [[autodoc]] DDPMPipeline
 	- all
 	- __call__
 ## ImagePipelineOutput
 [[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/ddpm.mdx
+++ b/docs/source/en/api/pipelines/ddpm.mdx
@@ -0,0 +1,37 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # DDPM
 ## Overview
 [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) 
 (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes the diffusion based model of the same name, but in the context of the 🤗 Diffusers library, DDPM refers to the discrete denoising scheduler from the paper as well as the pipeline.
 The abstract of the paper is the following:
 We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.
 The original codebase of this paper can be found [here](https://github.com/hojonathanho/diffusion).
 ## Available Pipelines:
 | Pipeline | Tasks | Colab
 |---|---|:---:|
 | [pipeline_ddpm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ddpm/pipeline_ddpm.py) | *Unconditional Image Generation* | - |
 # DDPMPipeline
 [[autodoc]] DDPMPipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/deepfloyd_if.md
+++ b/docs/source/en/api/pipelines/deepfloyd_if.md
@@ -1,523 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # DeepFloyd IF 
 ## Overview
 DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. 
 The model is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules: 
 - Stage 1: a base model that generates 64x64 px image based on text prompt,
 - Stage 2: a 64x64 px => 256x256 px super-resolution model, and a
 - Stage 3: a 256x256 px => 1024x1024 px super-resolution model
 Stage 1 and Stage 2 utilize a frozen text encoder based on the T5 transformer to extract text embeddings, 
 which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. 
 Stage 3 is [Stability's x4 Upscaling model](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler).
 The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. 
 Our work underscores the potential of larger UNet architectures in the first stage of cascaded diffusion models and depicts a promising future for text-to-image synthesis.
 ## Usage
 Before you can use IF, you need to accept its usage conditions. To do so:
 1. Make sure to have a [Hugging Face account](https://huggingface.co/join) and be logged in
 2. Accept the license on the model card of [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0). Accepting the license on the stage I model card will auto accept for the other IF models.
 3. Make sure to login locally. Install `huggingface_hub`
 ```sh
 pip install huggingface_hub --upgrade
 ```
 run the login function in a Python shell
 ```py
 from huggingface_hub import login
 login()
 ```
 and enter your [Hugging Face Hub access token](https://huggingface.co/docs/hub/security-tokens#what-are-user-access-tokens).
 Next we install `diffusers` and dependencies:
 ```sh
 pip install diffusers accelerate transformers safetensors
 ```
 The following sections give more in-detail examples of how to use IF. Specifically:
 - [Text-to-Image Generation](#text-to-image-generation)
 - [Image-to-Image Generation](#text-guided-image-to-image-generation)
 - [Inpainting](#text-guided-inpainting-generation)
 - [Reusing model weights](#converting-between-different-pipelines)
 - [Speed optimization](#optimizing-for-speed)
 - [Memory optimization](#optimizing-for-memory)
 **Available checkpoints**
 - *Stage-1*
  - [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0)
  - [DeepFloyd/IF-I-L-v1.0](https://huggingface.co/DeepFloyd/IF-I-L-v1.0)
  - [DeepFloyd/IF-I-M-v1.0](https://huggingface.co/DeepFloyd/IF-I-M-v1.0)
 - *Stage-2*
  - [DeepFloyd/IF-II-L-v1.0](https://huggingface.co/DeepFloyd/IF-II-L-v1.0)
  - [DeepFloyd/IF-II-M-v1.0](https://huggingface.co/DeepFloyd/IF-II-M-v1.0)
 - *Stage-3*
  - [stabilityai/stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler)
 **Demo**
 [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/DeepFloyd/IF)
 **Google Colab**
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
 ### Text-to-Image Generation
 By default diffusers makes use of [model cpu offloading](https://huggingface.co/docs/diffusers/optimization/fp16#model-offloading-for-fast-inference-and-memory-savings)
 to run the whole IF pipeline with as little as 14 GB of VRAM.
 ```python
 from diffusers import DiffusionPipeline
 from diffusers.utils import pt_to_pil
 import torch
 # stage 1
 stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
 stage_1.enable_model_cpu_offload()
 # stage 2
 stage_2 = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
 )
 stage_2.enable_model_cpu_offload()
 # stage 3
 safety_modules = {
    "feature_extractor": stage_1.feature_extractor,
    "safety_checker": stage_1.safety_checker,
    "watermarker": stage_1.watermarker,
 }
 stage_3 = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
 )
 stage_3.enable_model_cpu_offload()
 prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
 generator = torch.manual_seed(1)
 # text embeds
 prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
 # stage 1
 image = stage_1(
    prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt"
 ).images
 pt_to_pil(image)[0].save("./if_stage_I.png")
 # stage 2
 image = stage_2(
    image=image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
 ).images
 pt_to_pil(image)[0].save("./if_stage_II.png")
 # stage 3
 image = stage_3(prompt=prompt, image=image, noise_level=100, generator=generator).images
 image[0].save("./if_stage_III.png")
 ```
 ### Text Guided Image-to-Image Generation
 The same IF model weights can be used for text-guided image-to-image translation or image variation.
 In this case just make sure to load the weights using the [`IFInpaintingPipeline`] and [`IFInpaintingSuperResolutionPipeline`] pipelines.
 **Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines
 without loading them twice by making use of the [`~DiffusionPipeline.components()`] function as explained [here](#converting-between-different-pipelines).
 ```python
 from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline
 from diffusers.utils import pt_to_pil
 import torch
 from PIL import Image
 import requests
 from io import BytesIO
 # download image
 url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
 response = requests.get(url)
 original_image = Image.open(BytesIO(response.content)).convert("RGB")
 original_image = original_image.resize((768, 512))
 # stage 1
 stage_1 = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
 stage_1.enable_model_cpu_offload()
 # stage 2
 stage_2 = IFImg2ImgSuperResolutionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
 )
 stage_2.enable_model_cpu_offload()
 # stage 3
 safety_modules = {
    "feature_extractor": stage_1.feature_extractor,
    "safety_checker": stage_1.safety_checker,
    "watermarker": stage_1.watermarker,
 }
 stage_3 = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
 )
 stage_3.enable_model_cpu_offload()
 prompt = "A fantasy landscape in style minecraft"
 generator = torch.manual_seed(1)
 # text embeds
 prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
 # stage 1
 image = stage_1(
    image=original_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
 ).images
 pt_to_pil(image)[0].save("./if_stage_I.png")
 # stage 2
 image = stage_2(
    image=image,
    original_image=original_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
 ).images
 pt_to_pil(image)[0].save("./if_stage_II.png")
 # stage 3
 image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images
 image[0].save("./if_stage_III.png")
 ```
 ### Text Guided Inpainting Generation
 The same IF model weights can be used for text-guided image-to-image translation or image variation.
 In this case just make sure to load the weights using the [`IFInpaintingPipeline`] and [`IFInpaintingSuperResolutionPipeline`] pipelines.
 **Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines
 without loading them twice by making use of the [`~DiffusionPipeline.components()`] function as explained [here](#converting-between-different-pipelines).
 ```python
 from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline
 from diffusers.utils import pt_to_pil
 import torch
 from PIL import Image
 import requests
 from io import BytesIO
 # download image
 url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png"
 response = requests.get(url)
 original_image = Image.open(BytesIO(response.content)).convert("RGB")
 original_image = original_image
 # download mask
 url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png"
 response = requests.get(url)
 mask_image = Image.open(BytesIO(response.content))
 mask_image = mask_image
 # stage 1
 stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
 stage_1.enable_model_cpu_offload()
 # stage 2
 stage_2 = IFInpaintingSuperResolutionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
 )
 stage_2.enable_model_cpu_offload()
 # stage 3
 safety_modules = {
    "feature_extractor": stage_1.feature_extractor,
    "safety_checker": stage_1.safety_checker,
    "watermarker": stage_1.watermarker,
 }
 stage_3 = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
 )
 stage_3.enable_model_cpu_offload()
 prompt = "blue sunglasses"
 generator = torch.manual_seed(1)
 # text embeds
 prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
 # stage 1
 image = stage_1(
    image=original_image,
    mask_image=mask_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
 ).images
 pt_to_pil(image)[0].save("./if_stage_I.png")
 # stage 2
 image = stage_2(
    image=image,
    original_image=original_image,
    mask_image=mask_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
 ).images
 pt_to_pil(image)[0].save("./if_stage_II.png")
 # stage 3
 image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images
 image[0].save("./if_stage_III.png")
 ```
 ### Converting between different pipelines
 In addition to being loaded with `from_pretrained`, Pipelines can also be loaded directly from each other.
 ```python
 from diffusers import IFPipeline, IFSuperResolutionPipeline
 pipe_1 = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0")
 pipe_2 = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-L-v1.0")
 from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline
 pipe_1 = IFImg2ImgPipeline(**pipe_1.components)
 pipe_2 = IFImg2ImgSuperResolutionPipeline(**pipe_2.components)
 from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline
 pipe_1 = IFInpaintingPipeline(**pipe_1.components)
 pipe_2 = IFInpaintingSuperResolutionPipeline(**pipe_2.components)
 ```
 ### Optimizing for speed
 The simplest optimization to run IF faster is to move all model components to the GPU.
 ```py
 pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
 pipe.to("cuda")
 ```
 You can also run the diffusion process for a shorter number of timesteps.
 This can either be done with the `num_inference_steps` argument
 ```py
 pipe("<prompt>", num_inference_steps=30)
 ```
 Or with the `timesteps` argument
 ```py
 from diffusers.pipelines.deepfloyd_if import fast27_timesteps
 pipe("<prompt>", timesteps=fast27_timesteps)
 ```
 When doing image variation or inpainting, you can also decrease the number of timesteps
 with the strength argument. The strength argument is the amount of noise to add to 
 the input image which also determines how many steps to run in the denoising process.
 A smaller number will vary the image less but run faster.
 ```py
 pipe = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
 pipe.to("cuda")
 image = pipe(image=image, prompt="<prompt>", strength=0.3).images
 ```
 You can also use [`torch.compile`](../../optimization/torch2.0). Note that we have not exhaustively tested `torch.compile`
 with IF and it might not give expected results.
 ```py
 import torch
 pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
 pipe.to("cuda")
 pipe.text_encoder = torch.compile(pipe.text_encoder)
 pipe.unet = torch.compile(pipe.unet)
 ```
 ### Optimizing for memory
 When optimizing for GPU memory, we can use the standard diffusers cpu offloading APIs.
 Either the model based CPU offloading,
 ```py
 pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
 pipe.enable_model_cpu_offload()
 ```
 or the more aggressive layer based CPU offloading.
 ```py
 pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
 pipe.enable_sequential_cpu_offload()
 ```
 Additionally, T5 can be loaded in 8bit precision
 ```py
 from transformers import T5EncoderModel
 text_encoder = T5EncoderModel.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit"
 )
 from diffusers import DiffusionPipeline
 pipe = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0",
    text_encoder=text_encoder,  # pass the previously instantiated 8bit text encoder
    unet=None,
    device_map="auto",
 )
 prompt_embeds, negative_embeds = pipe.encode_prompt("<prompt>")
 ```
 For CPU RAM constrained machines like google colab free tier where we can't load all 
 model components to the CPU at once, we can manually only load the pipeline with
 the text encoder or unet when the respective model components are needed.
 ```py
 from diffusers import IFPipeline, IFSuperResolutionPipeline
 import torch
 import gc
 from transformers import T5EncoderModel
 from diffusers.utils import pt_to_pil
 text_encoder = T5EncoderModel.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit"
 )
 # text to image
 pipe = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0",
    text_encoder=text_encoder,  # pass the previously instantiated 8bit text encoder
    unet=None,
    device_map="auto",
 )
 prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
 prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)
 # Remove the pipeline so we can re-load the pipeline with the unet
 del text_encoder
 del pipe
 gc.collect()
 torch.cuda.empty_cache()
 pipe = IFPipeline.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto"
 )
 generator = torch.Generator().manual_seed(0)
 image = pipe(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    output_type="pt",
    generator=generator,
 ).images
 pt_to_pil(image)[0].save("./if_stage_I.png")
 # Remove the pipeline so we can load the super-resolution pipeline
 del pipe
 gc.collect()
 torch.cuda.empty_cache()
 # First super resolution
 pipe = IFSuperResolutionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto"
 )
 generator = torch.Generator().manual_seed(0)
 image = pipe(
    image=image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    output_type="pt",
    generator=generator,
 ).images
 pt_to_pil(image)[0].save("./if_stage_II.png")
 ```
 ## Available Pipelines:
 | Pipeline | Tasks | Colab
 |---|---|:---:|
 | [pipeline_if.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if.py) | *Text-to-Image Generation* | - |
 | [pipeline_if_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if.py) | *Text-to-Image Generation* | - |
 | [pipeline_if_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img.py) | *Image-to-Image Generation* | - |
 | [pipeline_if_img2img_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img_superresolution.py) | *Image-to-Image Generation* | - |
 | [pipeline_if_inpainting.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting.py) | *Image-to-Image Generation* | - |
 | [pipeline_if_inpainting_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting_superresolution.py) | *Image-to-Image Generation* | - |
 ## IFPipeline
 [[autodoc]] IFPipeline
 	- all
 	- __call__
 ## IFSuperResolutionPipeline
 [[autodoc]] IFSuperResolutionPipeline
 	- all
 	- __call__
 ## IFImg2ImgPipeline
 [[autodoc]] IFImg2ImgPipeline
 	- all
 	- __call__
 ## IFImg2ImgSuperResolutionPipeline
 [[autodoc]] IFImg2ImgSuperResolutionPipeline
 	- all
 	- __call__
 ## IFInpaintingPipeline
 [[autodoc]] IFInpaintingPipeline
 	- all
 	- __call__
 ## IFInpaintingSuperResolutionPipeline
 [[autodoc]] IFInpaintingSuperResolutionPipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/diffedit.md
+++ b/docs/source/en/api/pipelines/diffedit.md
@@ -1,55 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # DiffEdit
 [DiffEdit: Diffusion-based semantic image editing with mask guidance](https://huggingface.co/papers/2210.11427) is by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord.
 The abstract from the paper is:
 *Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.*
 The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion), and you can try it out in this [demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html).
 This pipeline was contributed by [clarencechen](https://github.com/clarencechen). ❤️
 ## Tips 
 * The pipeline can generate masks that can be fed into other inpainting pipelines.
 * In order to generate an image using this pipeline, both an image mask (source and target prompts can be manually specified or generated, and passed to [`~StableDiffusionDiffEditPipeline.generate_mask`])
 and a set of partially inverted latents (generated using [`~StableDiffusionDiffEditPipeline.invert`]) _must_ be provided as arguments when calling the pipeline to generate the final edited image.
 * The function [`~StableDiffusionDiffEditPipeline.generate_mask`] exposes two prompt arguments, `source_prompt` and `target_prompt`
 that let you control the locations of the semantic edits in the final image to be generated. Let's say,
 you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect
 this in the generated mask, you simply have to set the embeddings related to the phrases including "cat" to
 `source_prompt` and "dog" to `target_prompt`.
 * When generating partially inverted latents using `invert`, assign a caption or text embedding describing the
 overall image to the `prompt` argument to help guide the inverse latent sampling process. In most cases, the
 source concept is sufficiently descriptive to yield good results, but feel free to explore alternatives.
 * When calling the pipeline to generate the final edited image, assign the source concept to `negative_prompt`
 and the target concept to `prompt`. Taking the above example, you simply have to set the embeddings related to
 the phrases including "cat" to `negative_prompt` and "dog" to `prompt`.
 * If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to:
    * Swap the `source_prompt` and `target_prompt` in the arguments to `generate_mask`.
    * Change the input prompt in [`~StableDiffusionDiffEditPipeline.invert`] to include "dog".
    * Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image.
 * The source and target prompts, or their corresponding embeddings, can also be automatically generated. Please refer to the [DiffEdit](/using-diffusers/diffedit) guide for more details.
 ## StableDiffusionDiffEditPipeline
 [[autodoc]] StableDiffusionDiffEditPipeline
    - all
    - generate_mask
    - invert
    - __call__
 ## StableDiffusionPipelineOutput
 [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/dit.mdx
+++ b/docs/source/en/api/pipelines/dit.mdx
@@ -10,26 +10,50 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
-# DiT
+# Scalable Diffusion Models with Transformers (DiT)
-[Scalable Diffusion Models with Transformers](https://huggingface.co/papers/2212.09748) (DiT) is by William Peebles and Saining Xie.
+## Overview
-The abstract from the paper is:
+[Scalable Diffusion Models with Transformers](https://arxiv.org/abs/2212.09748) (DiT) by William Peebles and Saining Xie.
 The abstract of the paper is the following:
 *We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.*
-The original codebase can be found at [facebookresearch/dit](https://github.com/facebookresearch/dit).
+The original codebase of this paper can be found here: [facebookresearch/dit](https://github.com/facebookresearch/dit).
-<Tip>
+## Available Pipelines:
-Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+| Pipeline | Tasks | Colab
 |---|---|:---:|
 | [pipeline_dit.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/dit/pipeline_dit.py) | *Conditional Image Generation* | - |
-</Tip>
+
 ## Usage example
 ```python
 from diffusers import DiTPipeline, DPMSolverMultistepScheduler
 import torch
 pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", torch_dtype=torch.float16)
 pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
 pipe = pipe.to("cuda")
 # pick words from Imagenet class labels
 pipe.labels  # to print all available words
 # pick words that exist in ImageNet
 words = ["white shark", "umbrella"]
 class_ids = pipe.get_label_ids(words)
 generator = torch.manual_seed(33)
 output = pipe(class_labels=class_ids, num_inference_steps=25, generator=generator)
 image = output.images[0]  # label 'white shark'
 ```
 ## DiTPipeline
 [[autodoc]] DiTPipeline
 	- all
 	- __call__
 ## ImagePipelineOutput
 [[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/kandinsky.md
+++ b/docs/source/en/api/pipelines/kandinsky.md
@@ -1,67 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Kandinsky 2.1
 Kandinsky 2.1 is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey) and [Denis Dimitrov](https://github.com/denndimitrov).
 The description from it's GitHub page is:
 *Kandinsky 2.1 inherits best practicies from Dall-E 2 and Latent diffusion, while introducing some new ideas. As text and image encoder it uses CLIP model and diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation.*
 The original codebase can be found at [ai-forever/Kandinsky-2](https://github.com/ai-forever/Kandinsky-2).
 <Tip>
 Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting.
 </Tip>
 ## KandinskyPriorPipeline
 [[autodoc]] KandinskyPriorPipeline
 	- all
 	- __call__
 	- interpolate
 ## KandinskyPipeline
 [[autodoc]] KandinskyPipeline
 	- all
 	- __call__
 ## KandinskyCombinedPipeline
 [[autodoc]] KandinskyCombinedPipeline
 	- all
 	- __call__
 ## KandinskyImg2ImgPipeline
 [[autodoc]] KandinskyImg2ImgPipeline
 	- all
 	- __call__
 ## KandinskyImg2ImgCombinedPipeline
 [[autodoc]] KandinskyImg2ImgCombinedPipeline
 	- all
 	- __call__
 ## KandinskyInpaintPipeline
 [[autodoc]] KandinskyInpaintPipeline
 	- all
 	- __call__
 ## KandinskyInpaintCombinedPipeline
 [[autodoc]] KandinskyInpaintCombinedPipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/kandinsky_v22.md
+++ b/docs/source/en/api/pipelines/kandinsky_v22.md
@@ -1,86 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Kandinsky 2.2
 Kandinsky 2.1 is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey) and [Denis Dimitrov](https://github.com/denndimitrov).
 The description from it's GitHub page is:
 *Kandinsky 2.2 brings substantial improvements upon its predecessor, Kandinsky 2.1, by introducing a new, more powerful image encoder - CLIP-ViT-G and the ControlNet support. The switch to CLIP-ViT-G as the image encoder significantly increases the model's capability to generate more aesthetic pictures and better understand text, thus enhancing the model's overall performance. The addition of the ControlNet mechanism allows the model to effectively control the process of generating images. This leads to more accurate and visually appealing outputs and opens new possibilities for text-guided image manipulation.*
 The original codebase can be found at [ai-forever/Kandinsky-2](https://github.com/ai-forever/Kandinsky-2).
 <Tip>
 Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting.
 </Tip>
 ## KandinskyV22PriorPipeline
 [[autodoc]] KandinskyV22PriorPipeline
 	- all
 	- __call__
 	- interpolate
 ## KandinskyV22Pipeline
 [[autodoc]] KandinskyV22Pipeline
 	- all
 	- __call__
 ## KandinskyV22CombinedPipeline
 [[autodoc]] KandinskyV22CombinedPipeline
 	- all
 	- __call__
 ## KandinskyV22ControlnetPipeline
 [[autodoc]] KandinskyV22ControlnetPipeline
 	- all
 	- __call__
 ## KandinskyV22PriorEmb2EmbPipeline
 [[autodoc]] KandinskyV22PriorEmb2EmbPipeline
 	- all
 	- __call__
 	- interpolate
 ## KandinskyV22Img2ImgPipeline
 [[autodoc]] KandinskyV22Img2ImgPipeline
 	- all
 	- __call__
 ## KandinskyV22Img2ImgCombinedPipeline
 [[autodoc]] KandinskyV22Img2ImgCombinedPipeline
 	- all
 	- __call__
 ## KandinskyV22ControlnetImg2ImgPipeline
 [[autodoc]] KandinskyV22ControlnetImg2ImgPipeline
 	- all
 	- __call__
 ## KandinskyV22InpaintPipeline
 [[autodoc]] KandinskyV22InpaintPipeline
 	- all
 	- __call__
 ## KandinskyV22InpaintCombinedPipeline
 [[autodoc]] KandinskyV22InpaintCombinedPipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/latent_consistency_models.md
+++ b/docs/source/en/api/pipelines/latent_consistency_models.md
@@ -1,40 +0,0 @@
 # Latent Consistency Models
 Latent Consistency Models (LCMs) were proposed in [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://arxiv.org/abs/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao.
 The abstract of the [paper](https://arxiv.org/pdf/2310.04378.pdf) is as follows:
 *Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference.*
 A demo for the [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) checkpoint can be found [here](https://huggingface.co/spaces/SimianLuo/Latent_Consistency_Model).
 The pipelines were contributed by [luosiallen](https://luosiallen.github.io/), [nagolinc](https://github.com/nagolinc), and [dg845](https://github.com/dg845).
 ## LatentConsistencyModelPipeline
 [[autodoc]] LatentConsistencyModelPipeline
    - all
    - __call__
    - enable_freeu
    - disable_freeu
    - enable_vae_slicing
    - disable_vae_slicing
    - enable_vae_tiling
    - disable_vae_tiling
 ## LatentConsistencyModelImg2ImgPipeline
 [[autodoc]] LatentConsistencyModelImg2ImgPipeline
    - all
    - __call__
    - enable_freeu
    - disable_freeu
    - enable_vae_slicing
    - disable_vae_slicing
    - enable_vae_tiling
    - disable_vae_tiling
 ## StableDiffusionPipelineOutput
 [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/latent_diffusion.mdx
+++ b/docs/source/en/api/pipelines/latent_diffusion.mdx
@@ -12,19 +12,31 @@ specific language governing permissions and limitations under the License.
 # Latent Diffusion
-Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.
+## Overview
-The abstract from the paper is:
+Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.
 The abstract of the paper is the following:
 *By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.*
-The original codebase can be found at [Compvis/latent-diffusion](https://github.com/CompVis/latent-diffusion).
+The original codebase can be found [here](https://github.com/CompVis/latent-diffusion).
-<Tip>
+## Tips:
-Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+- 
 - 
 - 
 ## Available Pipelines:
 | Pipeline | Tasks | Colab
 |---|---|:---:|
 | [pipeline_latent_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py) | *Text-to-Image Generation* | - |
 | [pipeline_latent_diffusion_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py) | *Super Resolution* | - |
 ## Examples:
 </Tip>
 ## LDMTextToImagePipeline
 [[autodoc]] LDMTextToImagePipeline
@@ -35,6 +47,3 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers)
 [[autodoc]] LDMSuperResolutionPipeline
 	- all
 	- __call__
 ## ImagePipelineOutput
 [[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/latent_diffusion_uncond.mdx
+++ b/docs/source/en/api/pipelines/latent_diffusion_uncond.mdx
@@ -12,24 +12,31 @@ specific language governing permissions and limitations under the License.
 # Unconditional Latent Diffusion
-Unconditional Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.
+## Overview
-The abstract from the paper is:
+Unconditional Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.
 The abstract of the paper is the following:
 *By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.*
-The original codebase can be found at [CompVis/latent-diffusion](https://github.com/CompVis/latent-diffusion).
+The original codebase can be found [here](https://github.com/CompVis/latent-diffusion).
-<Tip>
+## Tips:
-Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+- 
 - 
 - 
-</Tip>
+## Available Pipelines:
 | Pipeline | Tasks | Colab
 |---|---|:---:|
 | [pipeline_latent_diffusion_uncond.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion_uncond/pipeline_latent_diffusion_uncond.py) | *Unconditional Image Generation* | - |
 ## Examples:
 ## LDMPipeline
 [[autodoc]] LDMPipeline
 	- all
 	- __call__
 ## ImagePipelineOutput
 [[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/musicldm.md
+++ b/docs/source/en/api/pipelines/musicldm.md
@@ -1,55 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # MusicLDM
 MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
 MusicLDM takes a text prompt as input and predicts the corresponding music sample. 
 Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm/overview),
 MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
 latents.
 MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to 
 the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies 
 encourages the model to interpolate between the training samples, but stay within the domain of the training data. The 
 result is generated music that is more diverse while staying faithful to the corresponding style.
 The abstract of the paper is the following:
 *In this paper, we present MusicLDM, a state-of-the-art text-to-music model that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, to encourage the model to generate music more diverse while still staying faithful to the corresponding style.*
 This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi).
 ## Tips
 When constructing a prompt, keep in mind:
 * Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
 * Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".
 During inference:
 * The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
 * Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
 * The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument.
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## MusicLDMPipeline
 [[autodoc]] MusicLDMPipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/overview.md
+++ b/docs/source/en/api/pipelines/overview.md
@@ -1,98 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Pipelines
 Pipelines provide a simple way to run state-of-the-art diffusion models in inference by bundling all of the necessary components (multiple independently-trained models, schedulers, and processors) into a single end-to-end class. Pipelines are flexible and they can be adapted to use different schedulers or even model components.
 All pipelines are built from the base [`DiffusionPipeline`] class which provides basic functionality for loading, downloading, and saving all the components. Specific pipeline types (for example [`StableDiffusionPipeline`]) loaded with [`~DiffusionPipeline.from_pretrained`] are automatically detected and the pipeline components are loaded and passed to the `__init__` function of the pipeline.
 <Tip warning={true}>
 You shouldn't use the [`DiffusionPipeline`] class for training. Individual components (for example, [`UNet2DModel`] and [`UNet2DConditionModel`]) of diffusion pipelines are usually trained individually, so we suggest directly working with them instead.
 <br>
 Pipelines do not offer any training functionality. You'll notice PyTorch's autograd is disabled by decorating the [`~DiffusionPipeline.__call__`] method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should not be used for training. If you're interested in training, please take a look at the [Training](../../training/overview) guides instead!
 </Tip>
 The table below lists all the pipelines currently available in 🤗 Diffusers and the tasks they support. Click on a pipeline to view its abstract and published paper.
 | Pipeline | Tasks |
 |---|---|
 | [AltDiffusion](alt_diffusion) | image2image |
 | [Attend-and-Excite](attend_and_excite) | text2image |
 | [Audio Diffusion](audio_diffusion) | image2audio |
 | [AudioLDM](audioldm) | text2audio |
 | [AudioLDM2](audioldm2) | text2audio |
 | [BLIP Diffusion](blip_diffusion) | text2image |
 | [Consistency Models](consistency_models) | unconditional image generation |
 | [ControlNet](controlnet) | text2image, image2image, inpainting |
 | [ControlNet with Stable Diffusion XL](controlnet_sdxl) | text2image |
 | [Cycle Diffusion](cycle_diffusion) | image2image |
 | [Dance Diffusion](dance_diffusion) | unconditional audio generation |
 | [DDIM](ddim) | unconditional image generation |
 | [DDPM](ddpm) | unconditional image generation |
 | [DeepFloyd IF](deepfloyd_if) | text2image, image2image, inpainting, super-resolution |
 | [DiffEdit](diffedit) | inpainting |
 | [DiT](dit) | text2image |
 | [GLIGEN](gligen) | text2image |
 | [InstructPix2Pix](pix2pix) | image editing |
 | [Kandinsky](kandinsky) | text2image, image2image, inpainting, interpolation |
 | [Kandinsky 2.2](kandinsky_v22) | text2image, image2image, inpainting |
 | [Latent Diffusion](latent_diffusion) | text2image, super-resolution |
 | [LDM3D](ldm3d_diffusion) | text2image, text-to-3D |
 | [MultiDiffusion](panorama) | text2image |
 | [MusicLDM](musicldm) | text2audio |
 | [PaintByExample](paint_by_example) | inpainting |
 | [ParaDiGMS](paradigms) | text2image |
 | [Pix2Pix Zero](pix2pix_zero) | image editing |
 | [PNDM](pndm) | unconditional image generation |
 | [RePaint](repaint) | inpainting |
 | [ScoreSdeVe](score_sde_ve) | unconditional image generation |
 | [Self-Attention Guidance](self_attention_guidance) | text2image |
 | [Semantic Guidance](semantic_stable_diffusion) | text2image |
 | [Shap-E](shap_e) | text-to-3D, image-to-3D |
 | [Spectrogram Diffusion](spectrogram_diffusion) |  |
 | [Stable Diffusion](stable_diffusion/overview) | text2image, image2image, depth2image, inpainting, image variation, latent upscaler, super-resolution |
 | [Stable Diffusion Model Editing](model_editing) | model editing |
 | [Stable Diffusion XL](stable_diffusion_xl) | text2image, image2image, inpainting |
 | [Stable unCLIP](stable_unclip) | text2image, image variation |
 | [KarrasVe](karras_ve) | unconditional image generation |
 | [T2I Adapter](adapter) | text2image |
 | [Text2Video](text_to_video) | text2video, video2video |
 | [Text2Video Zero](text_to_video_zero) | text2video |
 | [UnCLIP](unclip) | text2image, image variation |
 | [Unconditional Latent Diffusion](latent_diffusion_uncond) | unconditional image generation |
 | [UniDiffuser](unidiffuser) | text2image, image2text, image variation, text variation, unconditional image generation, unconditional audio generation |
 | [Value-guided planning](value_guided_sampling) | value guided sampling |
 | [Versatile Diffusion](versatile_diffusion) | text2image, image variation |
 | [VQ Diffusion](vq_diffusion) | text2image |
 | [Wuerstchen](wuerstchen) | text2image |
 ## DiffusionPipeline
 [[autodoc]] DiffusionPipeline
 	- all
 	- __call__
 	- device
 	- to
 	- components
 ## FlaxDiffusionPipeline
 [[autodoc]] pipelines.pipeline_flax_utils.FlaxDiffusionPipeline
 ## PushToHubMixin
 [[autodoc]] utils.PushToHubMixin
--- a/docs/source/en/api/pipelines/overview.mdx
+++ b/docs/source/en/api/pipelines/overview.mdx
@@ -0,0 +1,214 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Pipelines
 Pipelines provide a simple way to run state-of-the-art diffusion models in inference.
 Most diffusion systems consist of multiple independently-trained models and highly adaptable scheduler 
 components - all of which are needed to have a functioning end-to-end diffusion system.
 As an example, [Stable Diffusion](https://huggingface.co/blog/stable_diffusion) has three independently trained models:
 - [Autoencoder](./api/models#vae)
 - [Conditional Unet](./api/models#UNet2DConditionModel)
 - [CLIP text encoder](https://huggingface.co/docs/transformers/v4.27.1/en/model_doc/clip#transformers.CLIPTextModel)
 - a scheduler component, [scheduler](./api/scheduler#pndm), 
 - a [CLIPImageProcessor](https://huggingface.co/docs/transformers/v4.27.1/en/model_doc/clip#transformers.CLIPImageProcessor),
 - as well as a [safety checker](./stable_diffusion#safety_checker).
 All of these components are necessary to run stable diffusion in inference even though they were trained 
 or created independently from each other.
 To that end, we strive to offer all open-sourced, state-of-the-art diffusion system under a unified API. 
 More specifically, we strive to provide pipelines that
 - 1. can load the officially published weights and yield 1-to-1 the same outputs as the original implementation according to the corresponding paper (*e.g.* [LDMTextToImagePipeline](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/latent_diffusion), uses the officially released weights of [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)),
 - 2. have a simple user interface to run the model in inference (see the [Pipelines API](#pipelines-api) section), 
 - 3. are easy to understand with code that is self-explanatory and can be read along-side the official paper (see [Pipelines summary](#pipelines-summary)),
 - 4. can easily be contributed by the community (see the [Contribution](#contribution) section).
 **Note** that pipelines do not (and should not) offer any training functionality. 
 If you are looking for *official* training examples, please have a look at [examples](https://github.com/huggingface/diffusers/tree/main/examples).
 ## 🧨 Diffusers Summary
 The following table summarizes all officially supported pipelines, their corresponding paper, and if 
 available a colab notebook to directly try them out.
 | Pipeline | Paper | Tasks | Colab
 |---|---|:---:|:---:|
 | [alt_diffusion](./alt_diffusion) | [**AltDiffusion**](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation | -
 | [audio_diffusion](./audio_diffusion) | [**Audio Diffusion**](https://github.com/teticio/audio_diffusion.git) | Unconditional Audio Generation |
 | [controlnet](./api/pipelines/stable_diffusion/controlnet) | [**ControlNet with Stable Diffusion**](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/controlnet.ipynb)
 | [cycle_diffusion](./cycle_diffusion) | [**Cycle Diffusion**](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation |
 | [dance_diffusion](./dance_diffusion) | [**Dance Diffusion**](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation |
 | [ddpm](./ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation |
 | [ddim](./ddim) | [**Denoising Diffusion Implicit Models**](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation |
 | [latent_diffusion](./latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation | 
 | [latent_diffusion](./latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image | 
 | [latent_diffusion_uncond](./latent_diffusion_uncond) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation | 
 | [paint_by_example](./paint_by_example) | [**Paint by Example: Exemplar-based Image Editing with Diffusion Models**](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting | 
 | [pndm](./pndm) | [**Pseudo Numerical Methods for Diffusion Models on Manifolds**](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation | 
 | [score_sde_ve](./score_sde_ve) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | 
 | [score_sde_vp](./score_sde_vp) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | 
 | [semantic_stable_diffusion](./semantic_stable_diffusion) | [**SEGA: Instructing Diffusion using Semantic Dimensions**](https://arxiv.org/abs/2301.12247) | Text-to-Image Generation |
 | [stable_diffusion_text2img](./stable_diffusion/text2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb)
 | [stable_diffusion_img2img](./stable_diffusion/img2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb)
 | [stable_diffusion_inpaint](./stable_diffusion/inpaint) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb)
 | [stable_diffusion_panorama](./stable_diffusion/panorama) | [**MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation**](https://arxiv.org/abs/2302.08113) | Text-Guided Panorama View Generation |
 | [stable_diffusion_pix2pix](./stable_diffusion/pix2pix) | [**InstructPix2Pix: Learning to Follow Image Editing Instructions**](https://arxiv.org/abs/2211.09800) | Text-Based Image Editing |
 | [stable_diffusion_pix2pix_zero](./stable_diffusion/pix2pix_zero) | [**Zero-shot Image-to-Image Translation**](https://arxiv.org/abs/2302.03027) | Text-Based Image Editing |
 | [stable_diffusion_attend_and_excite](./stable_diffusion/attend_and_excite) | [**Attend and Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models**](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation |
 | [stable_diffusion_self_attention_guidance](./stable_diffusion/self_attention_guidance) | [**Self-Attention Guidance**](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation |
 | [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [**Stable Diffusion Image Variations**](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation |
 | [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [**Stable Diffusion Latent Upscaler**](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image |
 | [stable_diffusion_2](./stable_diffusion_2/) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation | 
 | [stable_diffusion_2](./stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting | 
 | [stable_diffusion_2](./stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Depth-to-Image Text-Guided Generation |
 | [stable_diffusion_2](./stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image |
 | [stable_diffusion_safe](./stable_diffusion_safe) | [**Safe Stable Diffusion**](https://arxiv.org/abs/2211.05105) | Text-Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-research/safe-latent-diffusion/blob/main/examples/Safe%20Latent%20Diffusion.ipynb)
 | [stable_unclip](./stable_unclip) | **Stable unCLIP** | Text-to-Image Generation |
 | [stable_unclip](./stable_unclip) | **Stable unCLIP** | Image-to-Image Text-Guided Generation |
 | [stochastic_karras_ve](./stochastic_karras_ve) | [**Elucidating the Design Space of Diffusion-Based Generative Models**](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation |
 | [text_to_video_sd](./api/pipelines/text_to_video) | [Modelscope's Text-to-video-synthesis Model in Open Domain](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation |
 | [unclip](./unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) | Text-to-Image Generation |
 | [versatile_diffusion](./versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation | 
 | [versatile_diffusion](./versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation | 
 | [versatile_diffusion](./versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation | 
 | [vq_diffusion](./vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation | 
 | [text_to_video_zero](./text_to_video_zero) | [Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://arxiv.org/abs/2303.13439) | Text-to-Video Generation |
 **Note**: Pipelines are simple examples of how to play around with the diffusion systems as described in the corresponding papers. 
 However, most of them can be adapted to use different scheduler components or even different model components. Some pipeline examples are shown in the [Examples](#examples) below.
 ## Pipelines API
 Diffusion models often consist of multiple independently-trained models or other previously existing components. 
 Each model has been trained independently on a different task and the scheduler can easily be swapped out and replaced with a different one. 
 During inference, we however want to be able to easily load all components and use them in inference - even if one component, *e.g.* CLIP's text encoder, originates from a different library, such as [Transformers](https://github.com/huggingface/transformers). To that end, all pipelines provide the following functionality:
 - [`from_pretrained` method](../diffusion_pipeline) that accepts a Hugging Face Hub repository id, *e.g.* [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) or a path to a local directory, *e.g.*
 "./stable-diffusion". To correctly retrieve which models and components should be loaded, one has to provide a `model_index.json` file, *e.g.* [runwayml/stable-diffusion-v1-5/model_index.json](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/model_index.json), which defines all components that should be
 loaded into the pipelines. More specifically, for each model/component one needs to define the format `<name>: ["<library>", "<class name>"]`. `<name>` is the attribute name given to the loaded instance of `<class name>` which can be found in the library or pipeline folder called `"<library>"`.
 - [`save_pretrained`](../diffusion_pipeline) that accepts a local path, *e.g.* `./stable-diffusion` under which all models/components of the pipeline will be saved. For each component/model a folder is created inside the local path that is named after the given attribute name, *e.g.* `./stable_diffusion/unet`. 
 In addition, a `model_index.json` file is created at the root of the local path, *e.g.* `./stable_diffusion/model_index.json` so that the complete pipeline can again be instantiated 
 from the local path.
 - [`to`](../diffusion_pipeline) which accepts a `string` or `torch.device` to move all models that are of type `torch.nn.Module` to the passed device. The behavior is fully analogous to [PyTorch's `to` method](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.to).
 - [`__call__`] method to use the pipeline in inference. `__call__` defines inference logic of the pipeline and should ideally encompass all aspects of it, from pre-processing to forwarding tensors to the different models and schedulers, as well as post-processing. The API of the `__call__` method can strongly vary from pipeline to pipeline. *E.g.* a text-to-image pipeline, such as [`StableDiffusionPipeline`](./stable_diffusion) should accept among other things the text prompt to generate the image. A pure image generation pipeline, such as [DDPMPipeline](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/ddpm) on the other hand can be run without providing any inputs. To better understand what inputs can be adapted for 
 each pipeline, one should look directly into the respective pipeline.
 **Note**: All pipelines have PyTorch's autograd disabled by decorating the `__call__` method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should
 not be used for training. If you want to store the gradients during the forward pass, we recommend writing your own pipeline, see also our [community-examples](https://github.com/huggingface/diffusers/tree/main/examples/community).
 ## Contribution
 We are more than happy about any contribution to the officially supported pipelines 🤗. We aspire
 all of our pipelines to be  **self-contained**, **easy-to-tweak**, **beginner-friendly** and for **one-purpose-only**.
 - **Self-contained**: A pipeline shall be as self-contained as possible. More specifically, this means that all functionality should be either directly defined in the pipeline file itself, should be inherited from (and only from) the [`DiffusionPipeline` class](.../diffusion_pipeline) or be directly attached to the model and scheduler components of the pipeline. 
 - **Easy-to-use**: Pipelines should be extremely easy to use - one should be able to load the pipeline and 
 use it for its designated task, *e.g.* text-to-image generation, in just a couple of lines of code. Most 
 logic including pre-processing, an unrolled diffusion loop, and post-processing should all happen inside the `__call__` method.
 - **Easy-to-tweak**: Certain pipelines will not be able to handle all use cases and tasks that you might like them to. If you want to use a certain pipeline for a specific use case that is not yet supported, you might have to copy the pipeline file and tweak the code to your needs. We try to make the pipeline code as readable as possible so that each part –from pre-processing to diffusing to post-processing– can easily be adapted. If you would like the community to benefit from your customized pipeline, we would love to see a contribution to our [community-examples](https://github.com/huggingface/diffusers/tree/main/examples/community). If you feel that an important pipeline should be part of the official pipelines but isn't, a contribution to the [official pipelines](./overview) would be even better.
 - **One-purpose-only**: Pipelines should be used for one task and one task only. Even if two tasks are very similar from a modeling point of view, *e.g.* image2image translation and in-painting, pipelines shall be used for one task only to keep them *easy-to-tweak* and *readable*.
 ## Examples
 ### Text-to-Image generation with Stable Diffusion
 ```python
 # make sure you're logged in with `huggingface-cli login`
 from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler
 pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
 pipe = pipe.to("cuda")
 prompt = "a photo of an astronaut riding a horse on mars"
 image = pipe(prompt).images[0]
 image.save("astronaut_rides_horse.png")
 ```
 ### Image-to-Image text-guided generation with Stable Diffusion
 The `StableDiffusionImg2ImgPipeline` lets you pass a text prompt and an initial image to condition the generation of new images.
 ```python
 import requests
 from PIL import Image
 from io import BytesIO
 from diffusers import StableDiffusionImg2ImgPipeline
 # load the pipeline
 device = "cuda"
 pipe = StableDiffusionImg2ImgPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to(
    device
 )
 # let's download an initial image
 url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
 response = requests.get(url)
 init_image = Image.open(BytesIO(response.content)).convert("RGB")
 init_image = init_image.resize((768, 512))
 prompt = "A fantasy landscape, trending on artstation"
 images = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images
 images[0].save("fantasy_landscape.png")
 ```
 You can also run this example on colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb)
 ### Tweak prompts reusing seeds and latents
 You can generate your own latents to reproduce results, or tweak your prompt on a specific result you liked. [This notebook](https://github.com/pcuenca/diffusers-examples/blob/main/notebooks/stable-diffusion-seeds.ipynb) shows how to do it step by step. You can also run it in Google Colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pcuenca/diffusers-examples/blob/main/notebooks/stable-diffusion-seeds.ipynb)
 ### In-painting using Stable Diffusion
 The `StableDiffusionInpaintPipeline` lets you edit specific parts of an image by providing a mask and text prompt.
 ```python
 import PIL
 import requests
 import torch
 from io import BytesIO
 from diffusers import StableDiffusionInpaintPipeline
 def download_image(url):
    response = requests.get(url)
    return PIL.Image.open(BytesIO(response.content)).convert("RGB")
 img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
 mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
 init_image = download_image(img_url).resize((512, 512))
 mask_image = download_image(mask_url).resize((512, 512))
 pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16,
 )
 pipe = pipe.to("cuda")
 prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
 image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
 ```
 You can also run this example on colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb)
--- a/docs/source/en/api/pipelines/paint_by_example.md
+++ b/docs/source/en/api/pipelines/paint_by_example.md
@@ -1,39 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Paint By Example
 [Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://huggingface.co/papers/2211.13227) is by Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen.
 The abstract from the paper is:
 *Language-guided image editing has achieved great success recently. In this paper, for the first time, we investigate exemplar-guided image editing for more precise control. We achieve this goal by leveraging self-supervised training to disentangle and re-organize the source image and the exemplar. However, the naive approach will cause obvious fusing artifacts. We carefully analyze it and propose an information bottleneck and strong augmentations to avoid the trivial solution of directly copying and pasting the exemplar image. Meanwhile, to ensure the controllability of the editing process, we design an arbitrary shape mask for the exemplar image and leverage the classifier-free guidance to increase the similarity to the exemplar image. The whole framework involves a single forward of the diffusion model without any iterative optimization. We demonstrate that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.*
 The original codebase can be found at [Fantasy-Studio/Paint-by-Example](https://github.com/Fantasy-Studio/Paint-by-Example), and you can try it out in a [demo](https://huggingface.co/spaces/Fantasy-Studio/Paint-by-Example).
 ## Tips
 PaintByExample is supported by the official [Fantasy-Studio/Paint-by-Example](https://huggingface.co/Fantasy-Studio/Paint-by-Example) checkpoint. The checkpoint is warm-started from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) to inpaint partly masked images conditioned on example and reference images.
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## PaintByExamplePipeline
 [[autodoc]] PaintByExamplePipeline
 	- all
 	- __call__
 ## StableDiffusionPipelineOutput
 [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/paint_by_example.mdx
+++ b/docs/source/en/api/pipelines/paint_by_example.mdx
@@ -0,0 +1,74 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # PaintByExample
 ## Overview
 [Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://arxiv.org/abs/2211.13227) by Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen.
 The abstract of the paper is the following:
 *Language-guided image editing has achieved great success recently. In this paper, for the first time, we investigate exemplar-guided image editing for more precise control. We achieve this goal by leveraging self-supervised training to disentangle and re-organize the source image and the exemplar. However, the naive approach will cause obvious fusing artifacts. We carefully analyze it and propose an information bottleneck and strong augmentations to avoid the trivial solution of directly copying and pasting the exemplar image. Meanwhile, to ensure the controllability of the editing process, we design an arbitrary shape mask for the exemplar image and leverage the classifier-free guidance to increase the similarity to the exemplar image. The whole framework involves a single forward of the diffusion model without any iterative optimization. We demonstrate that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.*
 The original codebase can be found [here](https://github.com/Fantasy-Studio/Paint-by-Example).
 ## Available Pipelines:
 | Pipeline | Tasks | Colab
 |---|---|:---:|
 | [pipeline_paint_by_example.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/paint_by_example/pipeline_paint_by_example.py) | *Image-Guided Image Painting* | - |
 ## Tips
 - PaintByExample is supported by the official [Fantasy-Studio/Paint-by-Example](https://huggingface.co/Fantasy-Studio/Paint-by-Example) checkpoint. The checkpoint has been warm-started from the [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) and with the objective to inpaint partly masked images conditioned on example / reference images
 - To quickly demo *PaintByExample*, please have a look at [this demo](https://huggingface.co/spaces/Fantasy-Studio/Paint-by-Example)
 - You can run the following code snippet as an example:
 ```python
 # !pip install diffusers transformers
 import PIL
 import requests
 import torch
 from io import BytesIO
 from diffusers import DiffusionPipeline
 def download_image(url):
    response = requests.get(url)
    return PIL.Image.open(BytesIO(response.content)).convert("RGB")
 img_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/image/example_1.png"
 mask_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/mask/example_1.png"
 example_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/reference/example_1.jpg"
 init_image = download_image(img_url).resize((512, 512))
 mask_image = download_image(mask_url).resize((512, 512))
 example_image = download_image(example_url).resize((512, 512))
 pipe = DiffusionPipeline.from_pretrained(
    "Fantasy-Studio/Paint-by-Example",
    torch_dtype=torch.float16,
 )
 pipe = pipe.to("cuda")
 image = pipe(image=init_image, mask_image=mask_image, example_image=example_image).images[0]
 image
 ```
 ## PaintByExamplePipeline
 [[autodoc]] PaintByExamplePipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/panorama.md
+++ b/docs/source/en/api/pipelines/panorama.md
@@ -1,57 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # MultiDiffusion
 [MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation](https://huggingface.co/papers/2302.08113) is by Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel.
 The abstract from the paper is:
 *Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.*
 You can find additional information about MultiDiffusion on the [project page](https://multidiffusion.github.io/), [original codebase](https://github.com/omerbt/MultiDiffusion), and try it out in a [demo](https://huggingface.co/spaces/weizmannscience/MultiDiffusion).
 ## Tips
 While calling [`StableDiffusionPanoramaPipeline`], it's possible to specify the `view_batch_size` parameter to be > 1. 
 For some GPUs with high performance, this can speedup the generation process and increase VRAM usage.
 To generate panorama-like images make sure you pass the width parameter accordingly. We recommend a width value of 2048 which is the default.
 Circular padding is applied to ensure there are no stitching artifacts when working with 
 panoramas to ensure a seamless transition from the rightmost part to the leftmost part. 
 By enabling circular padding (set `circular_padding=True`), the operation applies additional 
 crops after the rightmost point of the image, allowing the model to "see” the transition 
 from the rightmost part to the leftmost part. This helps maintain visual consistency in 
 a 360-degree sense and creates a proper “panorama” that can be viewed using 360-degree 
 panorama viewers. When decoding latents in Stable Diffusion, circular padding is applied 
 to ensure that the decoded latents match in the RGB space.
 For example, without circular padding, there is a stitching artifact (default):
 ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/indoor_%20no_circular_padding.png)
 But with circular padding, the right and the left parts are matching (`circular_padding=True`):
 ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/indoor_%20circular_padding.png)
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## StableDiffusionPanoramaPipeline
 [[autodoc]] StableDiffusionPanoramaPipeline
 	- __call__
 	- all
 ## StableDiffusionPipelineOutput
 [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/paradigms.md
+++ b/docs/source/en/api/pipelines/paradigms.md
@@ -1,54 +0,0 @@
 <!--Copyright 2023 ParaDiGMS authors and The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Parallel Sampling of Diffusion Models
 [Parallel Sampling of Diffusion Models](https://huggingface.co/papers/2305.16317) is by Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, Nima Anari.
 The abstract from the paper is:
 *Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.*
 The original codebase can be found at [AndyShih12/paradigms](https://github.com/AndyShih12/paradigms), and the pipeline was contributed by [AndyShih12](https://github.com/AndyShih12). ❤️
 ## Tips
 This pipeline improves sampling speed by running denoising steps in parallel, at the cost of increased total FLOPs.
 Therefore, it is better to call this pipeline when running on multiple GPUs. Otherwise, without enough GPU bandwidth
 sampling may be even slower than sequential sampling.
 The two parameters to play with are `parallel` (batch size) and `tolerance`. 
 - If it fits in memory, for a 1000-step DDPM you can aim for a batch size of around 100 
 (for example, 8 GPUs and `batch_per_device=12` to get `parallel=96`). A higher batch size
 may not fit in memory, and lower batch size gives less parallelism. 
 - For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation. 
 If there is quality degradation with the default tolerance, then use a lower tolerance like `0.001`.
 For a 1000-step DDPM on 8 A100 GPUs, you can expect around a 3x speedup from [`StableDiffusionParadigmsPipeline`] compared to the [`StableDiffusionPipeline`]
 by setting `parallel=80` and `tolerance=0.1`.
 🤗 Diffusers offers [distributed inference support](../training/distributed_inference) for generating multiple prompts
 in parallel on multiple GPUs. But [`StableDiffusionParadigmsPipeline`] is designed for speeding up sampling of a single prompt by using multiple GPUs.
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## StableDiffusionParadigmsPipeline
 [[autodoc]] StableDiffusionParadigmsPipeline
 	- __call__
 	- all
 ## StableDiffusionPipelineOutput
 [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/pixart.md
+++ b/docs/source/en/api/pipelines/pixart.md
@@ -1,36 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # PixArt
 ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/header_collage.png)
 [PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis](https://huggingface.co/papers/2310.00426) is Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li.
 The abstract from the paper is:
 *The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-α, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-α's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-α only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly $300,000 ($26,000 vs. $320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-α excels in image quality, artistry, and semantic control. We hope PIXART-α will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.*
 You can find the original codebase at [PixArt-alpha/PixArt-alpha](https://github.com/PixArt-alpha/PixArt-alpha) and all the available checkpoints at [PixArt-alpha](https://huggingface.co/PixArt-alpha).
 Some notes about this pipeline:
 * It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](./dit.md).
 * It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details. 
 * It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-alpha/blob/08fbbd281ec96866109bdd2cdb75f2f58fb17610/diffusion/data/datasets/utils.py).
 * It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient than them.
 ## PixArtAlphaPipeline
 [[autodoc]] PixArtAlphaPipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/pndm.md
+++ b/docs/source/en/api/pipelines/pndm.md
@@ -1,35 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # PNDM
 [Pseudo Numerical methods for Diffusion Models on manifolds](https://huggingface.co/papers/2202.09778) (PNDM) is by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao.
 The abstract from the paper is:
 *Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules.*
 The original codebase can be found at [luping-liu/PNDM](https://github.com/luping-liu/PNDM).
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## PNDMPipeline
 [[autodoc]] PNDMPipeline
 	- all
 	- __call__
 ## ImagePipelineOutput
 [[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/pndm.mdx
+++ b/docs/source/en/api/pipelines/pndm.mdx
@@ -0,0 +1,35 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # PNDM
 ## Overview
 [Pseudo Numerical methods for Diffusion Models on manifolds](https://arxiv.org/abs/2202.09778) (PNDM) by  Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao.
 The abstract of the paper is the following:
 Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules. 
 The original codebase can be found [here](https://github.com/luping-liu/PNDM).
 ## Available Pipelines:
 | Pipeline | Tasks | Colab
 |---|---|:---:|
 | [pipeline_pndm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pndm/pipeline_pndm.py) | *Unconditional Image Generation* | - |
 ## PNDMPipeline
 [[autodoc]] PNDMPipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/repaint.md
+++ b/docs/source/en/api/pipelines/repaint.md
@@ -1,37 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # RePaint
 [RePaint: Inpainting using Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2201.09865) is by Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, Luc Van Gool.
 The abstract from the paper is:
 *Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks.
 RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions.*
 The original codebase can be found at [andreas128/RePaint](https://github.com/andreas128/RePaint).
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## RePaintPipeline
 [[autodoc]] RePaintPipeline
 	- all
 	- __call__
 ## ImagePipelineOutput
 [[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/repaint.mdx
+++ b/docs/source/en/api/pipelines/repaint.mdx
@@ -0,0 +1,77 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # RePaint
 ## Overview
 [RePaint: Inpainting using Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2201.09865) (PNDM) by Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, Luc Van Gool.
 The abstract of the paper is the following:
 Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks.
 RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions.
 The original codebase can be found [here](https://github.com/andreas128/RePaint).
 ## Available Pipelines:
 | Pipeline                                                                                                                      | Tasks              | Colab
 |-------------------------------------------------------------------------------------------------------------------------------|--------------------|:---:|
 | [pipeline_repaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/repaint/pipeline_repaint.py) | *Image Inpainting* | - |
 ## Usage example
 ```python
 from io import BytesIO
 import torch
 import PIL
 import requests
 from diffusers import RePaintPipeline, RePaintScheduler
 def download_image(url):
    response = requests.get(url)
    return PIL.Image.open(BytesIO(response.content)).convert("RGB")
 img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png"
 mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png"
 # Load the original image and the mask as PIL images
 original_image = download_image(img_url).resize((256, 256))
 mask_image = download_image(mask_url).resize((256, 256))
 # Load the RePaint scheduler and pipeline based on a pretrained DDPM model
 scheduler = RePaintScheduler.from_pretrained("google/ddpm-ema-celebahq-256")
 pipe = RePaintPipeline.from_pretrained("google/ddpm-ema-celebahq-256", scheduler=scheduler)
 pipe = pipe.to("cuda")
 generator = torch.Generator(device="cuda").manual_seed(0)
 output = pipe(
    original_image=original_image,
    mask_image=mask_image,
    num_inference_steps=250,
    eta=0.0,
    jump_length=10,
    jump_n_sample=10,
    generator=generator,
 )
 inpainted_image = output.images[0]
 ```
 ## RePaintPipeline
 [[autodoc]] RePaintPipeline
 	- all
 	- __call__
--- a/docs/source/en/api/pipelines/score_sde_ve.md
+++ b/docs/source/en/api/pipelines/score_sde_ve.md
@@ -1,35 +0,0 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Score SDE VE
 [Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) (Score SDE) is by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon and Ben Poole. This pipeline implements the variance expanding (VE) variant of the stochastic differential equation method.
 The abstract from the paper is:
 *Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.*
 The original codebase can be found at [yang-song/score_sde_pytorch](https://github.com/yang-song/score_sde_pytorch).
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
 ## ScoreSdeVePipeline
 [[autodoc]] ScoreSdeVePipeline
 	- all
 	- __call__
 ## ImagePipelineOutput
 [[autodoc]] pipelines.ImagePipelineOutput
--- a/docs/source/en/api/pipelines/score_sde_ve.mdx
+++ b/docs/source/en/api/pipelines/score_sde_ve.mdx
@@ -0,0 +1,36 @@
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Score SDE VE
 ## Overview
 [Score-Based Generative Modeling through Stochastic Differential Equations](https://arxiv.org/abs/2011.13456) (Score SDE) by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon and Ben Poole.
 The abstract of the paper is the following:
 Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.
 The original codebase can be found [here](https://github.com/yang-song/score_sde_pytorch).
 This pipeline implements the Variance Expanding (VE) variant of the method.
 ## Available Pipelines:
 | Pipeline | Tasks | Colab
 |---|---|:---:|
 | [pipeline_score_sde_ve.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py) | *Unconditional Image Generation* | - |
 ## ScoreSdeVePipeline
 [[autodoc]] ScoreSdeVePipeline
 	- all
 	- __call__
--- a/Show More
+++ b/Show More
		`@@ -1,3 +0,0 @@`
			`# Overview`

			`The APIs in this section are more experimental and prone to breaking changes. Most of them are used internally for development, but they may also be useful to you if you're interested in building a diffusion model with some custom parts or if you're interested in some of our helper utilities for working with 🤗 Diffusers.`