Fix mirror_community_pipeline.yml name

2026-04-26 01:41:29 +08:00 · 2024-06-07 10:59:16 +02:00
631 changed files with 9094 additions and 80874 deletions
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@@ -57,54 +57,50 @@ body:
      description: |
        Your issue will be replied to more quickly if you can figure out the right person to tag with @.
        If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
-
+        
        All issues are read by one of the core maintainers, so if you don't know who to tag, just leave this blank and
        a core maintainer will ping the right person.
-
+        
        Please tag a maximum of 2 people.

-        Questions on DiffusionPipeline (Saving, Loading, From pretrained, ...): @sayakpaul @DN6
+        Questions on DiffusionPipeline (Saving, Loading, From pretrained, ...):

        Questions on pipelines:
-        - Stable Diffusion @yiyixuxu @asomoza
-        - Stable Diffusion XL @yiyixuxu @sayakpaul @DN6
-        - Stable Diffusion 3: @yiyixuxu @sayakpaul @DN6 @asomoza
-        - Kandinsky @yiyixuxu
-        - ControlNet @sayakpaul @yiyixuxu @DN6
-        - T2I Adapter @sayakpaul @yiyixuxu @DN6
-        - IF @DN6
-        - Text-to-Video / Video-to-Video @DN6 @a-r-r-o-w
-        - Wuerstchen @DN6
+        - Stable Diffusion @yiyixuxu @DN6 @sayakpaul 
+        - Stable Diffusion XL @yiyixuxu @sayakpaul @DN6 
+        - Kandinsky @yiyixuxu 
+        - ControlNet @sayakpaul @yiyixuxu @DN6 
+        - T2I Adapter @sayakpaul @yiyixuxu @DN6 
+        - IF @DN6 
+        - Text-to-Video / Video-to-Video @DN6 @sayakpaul 
+        - Wuerstchen @DN6 
        - Other: @yiyixuxu @DN6
-        - Improving generation quality: @asomoza

        Questions on models:
-        - UNet @DN6 @yiyixuxu @sayakpaul
-        - VAE @sayakpaul @DN6 @yiyixuxu
-        - Transformers/Attention @DN6 @yiyixuxu @sayakpaul
+        - UNet @DN6 @yiyixuxu @sayakpaul 
+        - VAE @sayakpaul @DN6 @yiyixuxu 
+        - Transformers/Attention @DN6 @yiyixuxu @sayakpaul @DN6 

-        Questions on single file checkpoints: @DN6
+        Questions on Schedulers: @yiyixuxu 

-        Questions on Schedulers: @yiyixuxu
+        Questions on LoRA: @sayakpaul 

-        Questions on LoRA: @sayakpaul
+        Questions on Textual Inversion: @sayakpaul 

-        Questions on Textual Inversion: @sayakpaul
+        Questions on Training: 
+        - DreamBooth @sayakpaul 
+        - Text-to-Image Fine-tuning @sayakpaul 
+        - Textual Inversion @sayakpaul 
+        - ControlNet @sayakpaul 

-        Questions on Training:
-        - DreamBooth @sayakpaul
-        - Text-to-Image Fine-tuning @sayakpaul
-        - Textual Inversion @sayakpaul
-        - ControlNet @sayakpaul
-
-        Questions on Tests: @DN6 @sayakpaul @yiyixuxu
+        Questions on Tests: @DN6 @sayakpaul @yiyixuxu 

        Questions on Documentation: @stevhliu

        Questions on JAX- and MPS-related things: @pcuenca

-        Questions on audio pipelines: @sanchit-gandhi
-
-
+        Questions on audio pipelines: @DN6 
+        

+        
      placeholder: "@Username ..."
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -38,9 +38,9 @@ members/contributors who may be interested in your PR.

 Core library:

- Schedulers: @yiyixuxu
- Pipelines and pipeline callbacks: @yiyixuxu and @asomoza
- Training examples: @sayakpaul
+- Schedulers: @yiyixuxu 
+- Pipelines:  @sayakpaul @yiyixuxu @DN6
+- Training examples: @sayakpaul 
 - Docs: @stevhliu and @sayakpaul
 - JAX and MPS: @pcuenca
 - Audio: @sanchit-gandhi
@@ -48,8 +48,7 @@ Core library:

 Integrations:

- deepspeed: HF Trainer/Accelerate: @SunMarc
- PEFT: @sayakpaul @BenjaminBossan
+- deepspeed: HF Trainer/Accelerate: @pacman100

 HF projects:

--- a/.github/workflows/benchmark.yml
+++ b/.github/workflows/benchmark.yml
@@ -13,17 +13,14 @@ env:

 jobs:
  torch_pipelines_cuda_benchmark_tests:
-    env:
-      SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL_BENCHMARK }}
    name: Torch Core Pipelines CUDA Benchmarking Tests
    strategy:
      fail-fast: false
      max-parallel: 1
-    runs-on:
-      group: aws-g6-4xlarge-plus
+    runs-on: [single-gpu, nvidia-gpu, a10, ci]
    container:
-      image: diffusers/diffusers-pytorch-compile-cuda
-      options: --shm-size "16gb" --ipc host --gpus 0
+      image: diffusers/diffusers-pytorch-cuda
+      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0
    steps:
      - name: Checkout diffusers
        uses: actions/checkout@v3
@@ -53,14 +50,4 @@ jobs:
        uses: actions/upload-artifact@v2
        with:
          name: benchmark_test_reports
-          path: benchmarks/benchmark_outputs
-
-      - name: Report success status
-        if: ${{ success() }}
-        run: |
-          pip install requests && python utils/notify_benchmarking_status.py --status=success
-
-      - name: Report failure status
-        if: ${{ failure() }}
-        run: |
-          pip install requests && python utils/notify_benchmarking_status.py --status=failure
+          path: benchmarks/benchmark_outputs
--- a/.github/workflows/build_docker_images.yml
+++ b/.github/workflows/build_docker_images.yml
@@ -20,8 +20,7 @@ env:

 jobs:
  test-build-docker-images:
-    runs-on:
-      group: aws-general-8-plus
+    runs-on: [ self-hosted, intel-cpu, 8-cpu, ci ]
    if: github.event_name == 'pull_request'
    steps:
      - name: Set up Docker Buildx
@@ -51,8 +50,7 @@ jobs:
        if: steps.file_changes.outputs.all != ''

  build-and-push-docker-images:
-    runs-on:
-      group: aws-general-8-plus
+    runs-on: [ self-hosted, intel-cpu, 8-cpu, ci ]
    if: github.event_name != 'pull_request'

    permissions:
@@ -100,4 +98,4 @@ jobs:
          slack_channel: ${{ env.CI_SLACK_CHANNEL }}
          title: "🤗 Results of the ${{ matrix.image-name }} Docker Image build"
          status: ${{ job.status }}
-          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
--- a/.github/workflows/mirror_community_pipeline.yml
+++ b/.github/workflows/mirror_community_pipeline.yml
@@ -22,9 +22,6 @@ on:

 jobs:
  mirror_community_pipeline:
-    env:
-      SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL_COMMUNITY_MIRROR }}
-
    runs-on: ubuntu-latest
    steps:
      # Checkout to correct ref
@@ -39,7 +36,7 @@ jobs:
      #     If ref is 'refs/heads/main' => set 'main'
      #     Else it must be a tag => set {tag}
      - name: Set checkout_ref and path_in_repo
-        run: |
+        run: | 
          if [ "${{ github.event_name }}" == "workflow_dispatch" ]; then
            if [ -z "${{ github.event.inputs.ref }}" ]; then
              echo "Error: Missing ref input"
@@ -57,12 +54,8 @@ jobs:
          else
            # e.g. refs/tags/v0.28.1 -> v0.28.1
            echo "CHECKOUT_REF=${{ github.ref }}" >> $GITHUB_ENV
-            echo "PATH_IN_REPO=$(echo ${{ github.ref }} | sed 's/^refs\/tags\///')" >> $GITHUB_ENV
+            echo "PATH_IN_REPO=${${{ github.ref }}#refs/tags/}" >> $GITHUB_ENV
          fi
-      - name: Print env vars
-        run: |
-          echo "CHECKOUT_REF: ${{ env.CHECKOUT_REF }}"
-          echo "PATH_IN_REPO: ${{ env.PATH_IN_REPO }}"
      - uses: actions/checkout@v3
        with:
          ref: ${{ env.CHECKOUT_REF }}
@@ -74,8 +67,8 @@ jobs:
          python-version: "3.10"
      - name: Install dependencies
        run: |
-          python -m pip install --upgrade pip
-          pip install --upgrade huggingface_hub
+          python -m pip install uv
+          uv pip install --upgrade huggingface_hub

      # Check secret is set
      - name: whoami
@@ -89,14 +82,4 @@ jobs:
        run: huggingface-cli upload diffusers/community-pipelines-mirror ./examples/community ${PATH_IN_REPO} --repo-type dataset
        env:
            PATH_IN_REPO: ${{ env.PATH_IN_REPO }}
-            HF_TOKEN: ${{ secrets.HF_TOKEN_MIRROR_COMMUNITY_PIPELINES }}
-
-      - name: Report success status
-        if: ${{ success() }}
-        run: |
-          pip install requests && python utils/notify_community_pipelines_mirror.py --status=success
-
-      - name: Report failure status
-        if: ${{ failure() }}
-        run: |
-          pip install requests && python utils/notify_community_pipelines_mirror.py --status=failure
+            HF_TOKEN: ${{ secrets.HF_TOKEN_MIRROR_COMMUNITY_PIPELINES }}
--- a/.github/workflows/nightly_tests.yml
+++ b/.github/workflows/nightly_tests.yml
@@ -7,7 +7,7 @@ on:

 env:
  DIFFUSERS_IS_CI: yes
-  HF_HUB_ENABLE_HF_TRANSFER: 1
+  HF_HOME: /mnt/cache
  OMP_NUM_THREADS: 8
  MKL_NUM_THREADS: 8
  PYTEST_TIMEOUT: 600
@@ -18,11 +18,8 @@ env:

 jobs:
  setup_torch_cuda_pipeline_matrix:
-    name: Setup Torch Pipelines CUDA Slow Tests Matrix
-    runs-on:
-      group: aws-general-8-plus
-    container:
-      image: diffusers/diffusers-pytorch-cpu
+    name: Setup Torch Pipelines Matrix
+    runs-on: diffusers/diffusers-pytorch-cpu
    outputs:
      pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }}
    steps:
@@ -30,9 +27,13 @@ jobs:
        uses: actions/checkout@v3
        with:
          fetch-depth: 2
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.8"
      - name: Install dependencies
        run: |
-          pip install -e .[test]
+          pip install -e .
          pip install huggingface_hub
      - name: Fetch Pipeline Matrix
        id: fetch_pipeline_matrix
@@ -49,18 +50,16 @@ jobs:
          path: reports

  run_nightly_tests_for_torch_pipelines:
-    name: Nightly Torch Pipelines CUDA Tests
+    name: Torch Pipelines CUDA Nightly Tests
    needs: setup_torch_cuda_pipeline_matrix
    strategy:
      fail-fast: false
-      max-parallel: 8
      matrix:
        module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }}
-    runs-on:
-      group: aws-g4dn-2xlarge
+    runs-on: [single-gpu, nvidia-gpu, t4, ci]
    container:
      image: diffusers/diffusers-pytorch-cuda
-      options: --shm-size "16gb" --ipc host --gpus 0
+      options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --gpus 0
    steps:
      - name: Checkout diffusers
        uses: actions/checkout@v3
@@ -68,16 +67,19 @@ jobs:
          fetch-depth: 2
      - name: NVIDIA-SMI
        run: nvidia-smi
+
      - name: Install dependencies
        run: |
          python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
          python -m uv pip install -e [quality,test]
          python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git
          python -m uv pip install pytest-reportlog
+
      - name: Environment
        run: |
          python utils/print_env.py
-      - name: Pipeline CUDA Test
+
+      - name: Nightly PyTorch CUDA checkpoint (pipelines) tests
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
          # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
@@ -88,37 +90,38 @@ jobs:
            --make-reports=tests_pipeline_${{ matrix.module }}_cuda \
            --report-log=tests_pipeline_${{ matrix.module }}_cuda.log \
            tests/pipelines/${{ matrix.module }}
+
      - name: Failure short reports
        if: ${{ failure() }}
        run: |
          cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt
          cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt
+
      - name: Test suite reports artifacts
        if: ${{ always() }}
        uses: actions/upload-artifact@v2
        with:
          name: pipeline_${{ matrix.module }}_test_reports
          path: reports
+
      - name: Generate Report and Notify Channel
        if: always()
        run: |
          pip install slack_sdk tabulate
-          python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
+          python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY

  run_nightly_tests_for_other_torch_modules:
-    name: Nightly Torch CUDA Tests
-    runs-on:
-      group: aws-g4dn-2xlarge
+    name: Torch Non-Pipelines CUDA Nightly Tests
+    runs-on: [single-gpu, nvidia-gpu, t4, ci]
    container:
      image: diffusers/diffusers-pytorch-cuda
-      options: --shm-size "16gb" --ipc host --gpus 0
+      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0
    defaults:
      run:
        shell: bash
    strategy:
-      max-parallel: 2
      matrix:
-        module: [models, schedulers, lora, others, single_file, examples]
+        module: [models, schedulers, others, examples]
    steps:
    - name: Checkout diffusers
      uses: actions/checkout@v3
@@ -130,8 +133,8 @@ jobs:
        python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
        python -m uv pip install -e [quality,test]
        python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git
-        python -m uv pip install peft@git+https://github.com/huggingface/peft.git
        python -m uv pip install pytest-reportlog
+
    - name: Environment
      run: python utils/print_env.py

@@ -155,6 +158,7 @@ jobs:
        # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
        CUBLAS_WORKSPACE_CONFIG: :16:8
      run: |
+        python -m uv pip install peft@git+https://github.com/huggingface/peft.git
        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
          -s -v --make-reports=examples_torch_cuda \
          --report-log=examples_torch_cuda.log \
@@ -177,7 +181,64 @@ jobs:
      if: always()
      run: |
        pip install slack_sdk tabulate
-        python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
+        python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY
+
+  run_lora_nightly_tests:
+    name: Nightly LoRA Tests with PEFT and TORCH
+    runs-on: [single-gpu, nvidia-gpu, t4, ci]
+    container:
+      image: diffusers/diffusers-pytorch-cuda
+      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0
+    defaults:
+      run:
+        shell: bash
+    steps:
+    - name: Checkout diffusers
+      uses: actions/checkout@v3
+      with:
+        fetch-depth: 2
+
+    - name: Install dependencies
+      run: |
+        python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
+        python -m uv pip install -e [quality,test]
+        python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git
+        python -m uv pip install peft@git+https://github.com/huggingface/peft.git
+        python -m uv pip install pytest-reportlog
+
+    - name: Environment
+      run: python utils/print_env.py
+
+    - name: Run nightly LoRA tests with PEFT and Torch
+      env:
+        HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
+        CUBLAS_WORKSPACE_CONFIG: :16:8
+      run: |
+        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
+          -s -v -k "not Flax and not Onnx" \
+          --make-reports=tests_torch_lora_cuda \
+          --report-log=tests_torch_lora_cuda.log \
+          tests/lora
+
+    - name: Failure short reports
+      if: ${{ failure() }}
+      run: |
+        cat reports/tests_torch_lora_cuda_stats.txt
+        cat reports/tests_torch_lora_cuda_failures_short.txt
+
+    - name: Test suite reports artifacts
+      if: ${{ always() }}
+      uses: actions/upload-artifact@v2
+      with:
+        name: torch_lora_cuda_test_reports
+        path: reports
+
+    - name: Generate Report and Notify Channel
+      if: always()
+      run: |
+        pip install slack_sdk tabulate
+        python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY

  run_flax_tpu_tests:
    name: Nightly Flax TPU Tests
@@ -233,15 +294,14 @@ jobs:
      if: always()
      run: |
        pip install slack_sdk tabulate
-        python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
+        python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY

  run_nightly_onnx_tests:
    name: Nightly ONNXRuntime CUDA tests on Ubuntu
-    runs-on:
-      group: aws-g4dn-2xlarge
+    runs-on: [single-gpu, nvidia-gpu, t4, ci]
    container:
      image: diffusers/diffusers-onnxruntime-cuda
-      options: --gpus 0 --shm-size "16gb" --ipc host
+      options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/

    steps:
    - name: Checkout diffusers
@@ -258,10 +318,11 @@ jobs:
        python -m uv pip install -e [quality,test]
        python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git
        python -m uv pip install pytest-reportlog
+
    - name: Environment
      run: python utils/print_env.py

-    - name: Run Nightly ONNXRuntime CUDA tests
+    - name: Run nightly ONNXRuntime CUDA tests
      env:
        HF_TOKEN: ${{ secrets.HF_TOKEN }}
      run: |
@@ -288,7 +349,7 @@ jobs:
      if: always()
      run: |
        pip install slack_sdk tabulate
-        python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
+        python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY

  run_nightly_tests_apple_m1:
    name: Nightly PyTorch MPS tests on MacOS
@@ -350,4 +411,4 @@ jobs:
        if: always()
        run: |
          pip install slack_sdk tabulate
-          python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
+          python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY
--- a/.github/workflows/notify_slack_about_release.yml
+++ b/.github/workflows/notify_slack_about_release.yml
@@ -11,12 +11,12 @@ jobs:

    steps:
    - uses: actions/checkout@v3
-
+    
    - name: Setup Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.8'
-
+    
    - name: Notify Slack about the release
      env:
        SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
--- a/.github/workflows/pr_dependency_test.yml
+++ b/.github/workflows/pr_dependency_test.yml
@@ -33,3 +33,4 @@ jobs:
        run: |
          python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
          pytest tests/others/test_dependencies.py
+      
--- a/.github/workflows/pr_test_fetcher.yml
+++ b/.github/workflows/pr_test_fetcher.yml
@@ -15,8 +15,7 @@ concurrency:
 jobs:
  setup_pr_tests:
    name: Setup PR Tests
-    runs-on:
-      group: aws-general-8-plus
+    runs-on: [ self-hosted, intel-cpu, 8-cpu, ci ]
    container:
      image: diffusers/diffusers-pytorch-cpu
      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
@@ -74,8 +73,7 @@ jobs:
      max-parallel: 2
      matrix:
        modules: ${{ fromJson(needs.setup_pr_tests.outputs.matrix) }}
-    runs-on:
-      group: aws-general-8-plus
+    runs-on: [ self-hosted, intel-cpu, 8-cpu, ci ]
    container:
      image: diffusers/diffusers-pytorch-cpu
      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
@@ -125,13 +123,12 @@ jobs:
        config:
          - name: Hub tests for models, schedulers, and pipelines
            framework: hub_tests_pytorch
-            runner: aws-general-8-plus
+            runner: [ self-hosted, intel-cpu, 8-cpu, ci ]
            image: diffusers/diffusers-pytorch-cpu
            report: torch_hub

    name: ${{ matrix.config.name }}
-    runs-on:
-      group: ${{ matrix.config.runner }}
+    runs-on: ${{ matrix.config.runner }}
    container:
      image: ${{ matrix.config.image }}
      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
--- a/.github/workflows/pr_test_peft_backend.yml
+++ b/.github/workflows/pr_test_peft_backend.yml
@@ -71,8 +71,7 @@ jobs:

    name: LoRA - ${{ matrix.lib-versions }}

-    runs-on:
-      group: aws-general-8-plus
+    runs-on: [ self-hosted, intel-cpu, 8-cpu, ci ]

    container:
      image: diffusers/diffusers-pytorch-cpu
@@ -116,17 +115,17 @@ jobs:
          -s -v \
          --make-reports=tests_models_lora_${{ matrix.config.report }} \
          tests/models/ -k "lora"
-
-
+        
+    
    - name: Failure short reports
      if: ${{ failure() }}
      run: |
        cat reports/tests_${{ matrix.config.report }}_failures_short.txt
        cat reports/tests_models_lora_${{ matrix.config.report }}_failures_short.txt
-
+    
    - name: Test suite reports artifacts
      if: ${{ always() }}
      uses: actions/upload-artifact@v2
      with:
        name: pr_${{ matrix.config.report }}_test_reports
-        path: reports
+        path: reports
--- a/.github/workflows/pr_tests.yml
+++ b/.github/workflows/pr_tests.yml
@@ -77,29 +77,28 @@ jobs:
        config:
          - name: Fast PyTorch Pipeline CPU tests
            framework: pytorch_pipelines
-            runner: aws-highmemory-32-plus
+            runner: [ self-hosted, intel-cpu, 32-cpu, 256-ram, ci ]
            image: diffusers/diffusers-pytorch-cpu
            report: torch_cpu_pipelines
          - name: Fast PyTorch Models & Schedulers CPU tests
            framework: pytorch_models
-            runner: aws-general-8-plus
+            runner: [ self-hosted, intel-cpu, 8-cpu, ci ]
            image: diffusers/diffusers-pytorch-cpu
            report: torch_cpu_models_schedulers
          - name: Fast Flax CPU tests
            framework: flax
-            runner: aws-general-8-plus
+            runner: [ self-hosted, intel-cpu, 8-cpu, ci ]
            image: diffusers/diffusers-flax-cpu
            report: flax_cpu
          - name: PyTorch Example CPU tests
            framework: pytorch_examples
-            runner: aws-general-8-plus
+            runner: [ self-hosted, intel-cpu, 8-cpu, ci ]
            image: diffusers/diffusers-pytorch-cpu
            report: torch_example_cpu

    name: ${{ matrix.config.name }}

-    runs-on:
-      group: ${{ matrix.config.runner }}
+    runs-on: ${{ matrix.config.runner }}

    container:
      image: ${{ matrix.config.image }}
@@ -181,8 +180,7 @@ jobs:
        config:
          - name: Hub tests for models, schedulers, and pipelines
            framework: hub_tests_pytorch
-            runner:
-              group: aws-general-8-plus
+            runner: [ self-hosted, intel-cpu, 8-cpu, ci ]
            image: diffusers/diffusers-pytorch-cpu
            report: torch_hub

--- a/.github/workflows/push_tests.yml
+++ b/.github/workflows/push_tests.yml
@@ -11,16 +11,17 @@ on:

 env:
  DIFFUSERS_IS_CI: yes
+  HF_HOME: /mnt/cache
  OMP_NUM_THREADS: 8
  MKL_NUM_THREADS: 8
  PYTEST_TIMEOUT: 600
+  RUN_SLOW: yes
  PIPELINE_USAGE_CUTOFF: 50000

 jobs:
  setup_torch_cuda_pipeline_matrix:
    name: Setup Torch Pipelines CUDA Slow Tests Matrix
-    runs-on:
-      group: aws-general-8-plus
+    runs-on: [ self-hosted, intel-cpu, 8-cpu, ci ]
    container:
      image: diffusers/diffusers-pytorch-cpu
    outputs:
@@ -51,18 +52,17 @@ jobs:
          path: reports

  torch_pipelines_cuda_tests:
-    name: Torch Pipelines CUDA Tests
+    name: Torch Pipelines CUDA Slow Tests
    needs: setup_torch_cuda_pipeline_matrix
    strategy:
      fail-fast: false
      max-parallel: 8
      matrix:
        module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }}
-    runs-on:
-      group: aws-g4dn-2xlarge
+    runs-on: [single-gpu, nvidia-gpu, t4, ci]
    container:
      image: diffusers/diffusers-pytorch-cuda
-      options: --shm-size "16gb" --ipc host --gpus 0
+      options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --gpus 0
    steps:
      - name: Checkout diffusers
        uses: actions/checkout@v3
@@ -103,11 +103,10 @@ jobs:

  torch_cuda_tests:
    name: Torch CUDA Tests
-    runs-on:
-      group: aws-g4dn-2xlarge
+    runs-on: [single-gpu, nvidia-gpu, t4, ci]
    container:
      image: diffusers/diffusers-pytorch-cuda
-      options: --shm-size "16gb" --ipc host --gpus 0
+      options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --gpus 0
    defaults:
      run:
        shell: bash
@@ -125,13 +124,12 @@ jobs:
        python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
        python -m uv pip install -e [quality,test]
        python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git
-        python -m uv pip install peft@git+https://github.com/huggingface/peft.git

    - name: Environment
      run: |
        python utils/print_env.py

-    - name: Run PyTorch CUDA tests
+    - name: Run slow PyTorch CUDA tests
      env:
        HF_TOKEN: ${{ secrets.HF_TOKEN }}
        # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
@@ -155,6 +153,61 @@ jobs:
        name: torch_cuda_test_reports
        path: reports

+  peft_cuda_tests:
+    name: PEFT CUDA Tests
+    runs-on: [single-gpu, nvidia-gpu, t4, ci]
+    container:
+      image: diffusers/diffusers-pytorch-cuda
+      options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --gpus 0
+    defaults:
+      run:
+        shell: bash
+    steps:
+    - name: Checkout diffusers
+      uses: actions/checkout@v3
+      with:
+        fetch-depth: 2
+
+    - name: Install dependencies
+      run: |
+        python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
+        python -m uv pip install -e [quality,test]
+        python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git
+        python -m pip install -U peft@git+https://github.com/huggingface/peft.git
+
+    - name: Environment
+      run: |
+        python utils/print_env.py
+
+    - name: Run slow PEFT CUDA tests
+      env:
+        HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
+        CUBLAS_WORKSPACE_CONFIG: :16:8
+      run: |
+        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
+          -s -v -k "not Flax and not Onnx and not PEFTLoRALoading" \
+          --make-reports=tests_peft_cuda \
+          tests/lora/
+        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
+          -s -v -k "lora and not Flax and not Onnx and not PEFTLoRALoading" \
+          --make-reports=tests_peft_cuda_models_lora \
+          tests/models/
+
+    - name: Failure short reports
+      if: ${{ failure() }}
+      run: |
+        cat reports/tests_peft_cuda_stats.txt
+        cat reports/tests_peft_cuda_failures_short.txt
+        cat reports/tests_peft_cuda_models_lora_failures_short.txt
+
+    - name: Test suite reports artifacts
+      if: ${{ always() }}
+      uses: actions/upload-artifact@v2
+      with:
+        name: torch_peft_test_reports
+        path: reports
+
  flax_tpu_tests:
    name: Flax TPU Tests
    runs-on: docker-tpu
@@ -204,8 +257,7 @@ jobs:

  onnx_cuda_tests:
    name: ONNX CUDA Tests
-    runs-on:
-      group: aws-g4dn-2xlarge
+    runs-on: [single-gpu, nvidia-gpu, t4, ci]
    container:
      image: diffusers/diffusers-onnxruntime-cuda
      options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ --gpus 0
@@ -253,12 +305,11 @@ jobs:
  run_torch_compile_tests:
    name: PyTorch Compile CUDA tests

-    runs-on:
-      group: aws-g4dn-2xlarge
+    runs-on: [single-gpu, nvidia-gpu, t4, ci]

    container:
      image: diffusers/diffusers-pytorch-compile-cuda
-      options: --gpus 0 --shm-size "16gb" --ipc host
+      options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/

    steps:
    - name: Checkout diffusers
@@ -279,7 +330,6 @@ jobs:
    - name: Run example tests on GPU
      env:
        HF_TOKEN: ${{ secrets.HF_TOKEN }}
-        RUN_COMPILE: yes
      run: |
        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/
    - name: Failure short reports
@@ -296,12 +346,11 @@ jobs:
  run_xformers_tests:
    name: PyTorch xformers CUDA tests

-    runs-on:
-      group: aws-g4dn-2xlarge
+    runs-on: [single-gpu, nvidia-gpu, t4, ci]

    container:
      image: diffusers/diffusers-pytorch-xformers-cuda
-      options: --gpus 0 --shm-size "16gb" --ipc host
+      options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/

    steps:
    - name: Checkout diffusers
@@ -338,12 +387,11 @@ jobs:
  run_examples_tests:
    name: Examples PyTorch CUDA tests on Ubuntu

-    runs-on:
-      group: aws-g4dn-2xlarge
+    runs-on: [single-gpu, nvidia-gpu, t4, ci]

    container:
      image: diffusers/diffusers-pytorch-cuda
-      options: --gpus 0 --shm-size "16gb" --ipc host
+      options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/

    steps:
    - name: Checkout diffusers
--- a/.github/workflows/push_tests_fast.yml
+++ b/.github/workflows/push_tests_fast.yml
@@ -29,29 +29,28 @@ jobs:
        config:
          - name: Fast PyTorch CPU tests on Ubuntu
            framework: pytorch
-            runner: aws-general-8-plus
+            runner: [ self-hosted, intel-cpu, 8-cpu, ci ]
            image: diffusers/diffusers-pytorch-cpu
            report: torch_cpu
          - name: Fast Flax CPU tests on Ubuntu
            framework: flax
-            runner: aws-general-8-plus
+            runner: [ self-hosted, intel-cpu, 8-cpu, ci ]
            image: diffusers/diffusers-flax-cpu
            report: flax_cpu
          - name: Fast ONNXRuntime CPU tests on Ubuntu
            framework: onnxruntime
-            runner: aws-general-8-plus
+            runner: [ self-hosted, intel-cpu, 8-cpu, ci ]
            image: diffusers/diffusers-onnxruntime-cpu
            report: onnx_cpu
          - name: PyTorch Example CPU tests on Ubuntu
            framework: pytorch_examples
-            runner: aws-general-8-plus
+            runner: [ self-hosted, intel-cpu, 8-cpu, ci ]
            image: diffusers/diffusers-pytorch-cpu
            report: torch_example_cpu

    name: ${{ matrix.config.name }}

-    runs-on:
-      group: ${{ matrix.config.runner }}
+    runs-on: ${{ matrix.config.runner }}

    container:
      image: ${{ matrix.config.image }}
--- a/.github/workflows/pypi_publish.yaml
+++ b/.github/workflows/pypi_publish.yaml
@@ -29,7 +29,7 @@ jobs:
          LATEST_BRANCH=$(python utils/fetch_latest_release_branch.py)
          echo "Latest branch: $LATEST_BRANCH"
          echo "latest_branch=$LATEST_BRANCH" >> $GITHUB_ENV
-
+          
      - name: Set latest branch output
        id: set_latest_branch
        run: echo "::set-output name=latest_branch::${{ env.latest_branch }}"
@@ -43,27 +43,27 @@ jobs:
        uses: actions/checkout@v3
        with:
          ref: ${{ needs.find-and-checkout-latest-branch.outputs.latest_branch }}
-
+          
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.8"
-
+      
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -U setuptools wheel twine
          pip install -U torch --index-url https://download.pytorch.org/whl/cpu
          pip install -U transformers
-
+      
      - name: Build the dist files
        run: python setup.py bdist_wheel && python setup.py sdist
-
+      
      - name: Publish to the test PyPI
        env:
          TWINE_USERNAME: ${{ secrets.TEST_PYPI_USERNAME }}
          TWINE_PASSWORD: ${{ secrets.TEST_PYPI_PASSWORD }}
-        run: twine upload dist/* -r pypitest --repository-url=https://test.pypi.org/legacy/
+        run: twine upload dist/* -r pypitest --repository-url=https://test.pypi.org/legacy/    

      - name: Test installing diffusers and importing
        run: |
--- a/.github/workflows/run_tests_from_a_pr.yml
+++ b/.github/workflows/run_tests_from_a_pr.yml
@@ -7,7 +7,7 @@ on:
        default: 'diffusers/diffusers-pytorch-cuda'
        description: 'Name of the Docker image'
        required: true
-      branch:
+      branch: 
        description: 'PR Branch to test on'
        required: true
      test:
@@ -26,8 +26,7 @@ env:
 jobs:
  run_tests:
    name: "Run a test on our runner from a PR"
-    runs-on:
-      group: aws-g4dn-2xlarge
+    runs-on: [single-gpu, nvidia-gpu, t4, ci]
    container:
      image: ${{ github.event.inputs.docker_image }}
      options: --gpus 0 --privileged --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
@@ -35,19 +34,19 @@ jobs:
    steps:
      - name: Validate test files input
        id: validate_test_files
-        env:
+        env: 
          PY_TEST: ${{ github.event.inputs.test }}
        run: |
          if [[ ! "$PY_TEST" =~ ^tests/ ]]; then
            echo "Error: The input string must start with 'tests/'."
            exit 1
          fi
-
+          
          if [[ ! "$PY_TEST" =~ ^tests/(models|pipelines) ]]; then
            echo "Error: The input string must contain either 'models' or 'pipelines' after 'tests/'."
            exit 1
          fi
-
+          
          if [[ "$PY_TEST" == *";"* ]]; then
            echo "Error: The input string must not contain ';'."
            exit 1
@@ -61,14 +60,14 @@ jobs:
          repository: ${{ github.event.pull_request.head.repo.full_name }}


-      - name: Install pytest
-        run: |
+      - name: Install pytest 
+        run: | 
          python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
          python -m uv pip install -e [quality,test]
          python -m uv pip install peft
-
+      
      - name: Run tests
-        env:
+        env: 
            PY_TEST: ${{ github.event.inputs.test }}
        run: |
-          pytest "$PY_TEST"
+          pytest "$PY_TEST"
--- a/.github/workflows/ssh-pr-runner.yml
+++ b/.github/workflows/ssh-pr-runner.yml
@@ -1,40 +0,0 @@
-name: SSH into PR runners
-
-on:
-  workflow_dispatch:
-    inputs:
-      docker_image:
-        description: 'Name of the Docker image'
-        required: true
-
-env:
-  IS_GITHUB_CI: "1"
-  HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
-  HF_HOME: /mnt/cache
-  DIFFUSERS_IS_CI: yes
-  OMP_NUM_THREADS: 8
-  MKL_NUM_THREADS: 8
-  RUN_SLOW: yes
-
-jobs:
-  ssh_runner:
-    name: "SSH"
-    runs-on:
-      group: aws-highmemory-32-plus
-    container:
-      image: ${{ github.event.inputs.docker_image }}
-      options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --privileged
-
-    steps:
-      - name: Checkout diffusers
-        uses: actions/checkout@v3
-        with:
-          fetch-depth: 2
-
-      - name: Tailscale # In order to be able to SSH when a test fails
-        uses: huggingface/tailscale-action@main
-        with:
-          authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }}
-          slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }}
-          slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
-          waitForSSH: true
--- a/.github/workflows/ssh-runner.yml
+++ b/.github/workflows/ssh-runner.yml
@@ -1,4 +1,4 @@
-name: SSH into GPU runners
+name: SSH into runners

 on:
  workflow_dispatch:
@@ -22,8 +22,7 @@ env:
 jobs:
  ssh_runner:
    name: "SSH"
-    runs-on:
-      group: "${{ github.event.inputs.runner_type }}"
+    runs-on: [single-gpu, nvidia-gpu, "${{ github.event.inputs.runner_type }}", ci]
    container:
      image: ${{ github.event.inputs.docker_image }}
      options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --gpus 0 --privileged
--- a/.github/workflows/trufflehog.yml
+++ b/.github/workflows/trufflehog.yml
@@ -1,15 +0,0 @@
-on:
-  push:
-
-name: Secret Leaks
-
-jobs:
-  trufflehog:
-    runs-on: ubuntu-latest
-    steps:
-    - name: Checkout code
-      uses: actions/checkout@v4
-      with:
-        fetch-depth: 0
-    - name: Secret Scanning
-      uses: trufflesecurity/trufflehog@main
--- a/.gitignore
+++ b/.gitignore
@@ -175,4 +175,4 @@ tags
 .ruff_cache

 # wandb
-wandb
+wandb
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -63,7 +63,7 @@ In the same spirit, you are of immense help to the community by answering such q

 **Please** keep in mind that the more effort you put into asking or answering a question, the higher
 the quality of the publicly documented knowledge. In the same way, well-posed and well-answered questions create a high-quality knowledge database accessible to everybody, while badly posed questions or answers reduce the overall quality of the public knowledge database.
-In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formatted/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.
+In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formated/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.

 **NOTE about channels**:
 [*The forum*](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) is much better indexed by search engines, such as Google. Posts are ranked by popularity rather than chronologically. Hence, it's easier to look up questions and answers that we posted some time ago.
@@ -245,7 +245,7 @@ The official training examples are maintained by the Diffusers' core maintainers
 This is because of the same reasons put forward in [6. Contribute a community pipeline](#6-contribute-a-community-pipeline) for official pipelines vs. community pipelines: It is not feasible for the core maintainers to maintain all possible training methods for diffusion models.
 If the Diffusers core maintainers and the community consider a certain training paradigm to be too experimental or not popular enough, the corresponding training code should be put in the `research_projects` folder and maintained by the author.

-Both official training and research examples consist of a directory that contains one or more training scripts, a `requirements.txt` file, and a `README.md` file. In order for the user to make use of the
+Both official training and research examples consist of a directory that contains one or more training scripts, a requirements.txt file, and a README.md file. In order for the user to make use of the
 training examples, it is required to clone the repository:

 ```bash
@@ -255,8 +255,7 @@ git clone https://github.com/huggingface/diffusers
 as well as to install all additional dependencies required for training:

 ```bash
-cd diffusers
-pip install -r examples/<your-example-folder>/requirements.txt
+pip install -r /examples/<your-example-folder>/requirements.txt
 ```

 Therefore when adding an example, the `requirements.txt` file shall define all pip dependencies required for your training example so that once all those are installed, the user can run the example's training script. See, for example, the [DreamBooth `requirements.txt` file](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/requirements.txt).
@@ -503,4 +502,4 @@ $ git push --set-upstream origin your-branch-for-syncing

 ### Style guide

-For documentation strings, 🧨 Diffusers follows the [Google style](https://google.github.io/styleguide/pyguide.html).
+For documentation strings, 🧨 Diffusers follows the [Google style](https://google.github.io/styleguide/pyguide.html).
--- a/PHILOSOPHY.md
+++ b/PHILOSOPHY.md
@@ -63,14 +63,14 @@ Let's walk through more detailed design decisions for each class.
 Pipelines are designed to be easy to use (therefore do not follow [*Simple over easy*](#simple-over-easy) 100%), are not feature complete, and should loosely be seen as examples of how to use [models](#models) and [schedulers](#schedulers) for inference.

 The following design principles are followed:
- Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for [`src/diffusers/pipelines/stable-diffusion`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [# Copied from mechanism](https://github.com/huggingface/diffusers/blob/125d783076e5bd9785beb05367a2d2566843a271/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L251).
+- Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for [`src/diffusers/pipelines/stable-diffusion`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [#Copied from mechanism](https://github.com/huggingface/diffusers/blob/125d783076e5bd9785beb05367a2d2566843a271/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L251).
 - Pipelines all inherit from [`DiffusionPipeline`].
 - Every pipeline consists of different model and scheduler components, that are documented in the [`model_index.json` file](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/model_index.json), are accessible under the same name as attributes of the pipeline and can be shared between pipelines with [`DiffusionPipeline.components`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.components) function.
 - Every pipeline should be loadable via the [`DiffusionPipeline.from_pretrained`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained) function.
 - Pipelines should be used **only** for inference.
 - Pipelines should be very readable, self-explanatory, and easy to tweak.
 - Pipelines should be designed to build on top of each other and be easy to integrate into higher-level APIs.
- Pipelines are **not** intended to be feature-complete user interfaces. For feature-complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner).
+- Pipelines are **not** intended to be feature-complete user interfaces. For future complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner).
 - Every pipeline should have one and only one way to run it via a `__call__` method. The naming of the `__call__` arguments should be shared across all pipelines.
 - Pipelines should be named after the task they are intended to solve.
 - In almost all cases, novel diffusion pipelines shall be implemented in a new pipeline folder/file.
@@ -81,7 +81,7 @@ Models are designed as configurable toolboxes that are natural extensions of [Py

 The following design principles are followed:
 - Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context.
- All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unets/unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_condition.py), [`transformers/transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_2d.py), etc...
+- All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_condition.py), [`transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformer_2d.py), etc...
 - Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy.
 - Models intend to expose complexity, just like PyTorch's `Module` class, and give clear error messages.
 - Models all inherit from `ModelMixin` and `ConfigMixin`.
@@ -90,7 +90,7 @@ The following design principles are followed:
 - To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different.
 - Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work.
 - The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and
-readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).

 ### Schedulers

@@ -100,11 +100,11 @@ The following design principles are followed:
 - All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers).
 - Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained.
 - One scheduler Python file corresponds to one scheduler algorithm (as might be defined in a paper).
- If schedulers share similar functionalities, we can make use of the `# Copied from` mechanism.
+- If schedulers share similar functionalities, we can make use of the `#Copied from` mechanism.
 - Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`.
 - Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./docs/source/en/using-diffusers/schedulers.md).
 - Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called.
 - Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon.
 - The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1).
 - Given the complexity of diffusion schedulers, the `step` function does not expose all the complexity and can be a bit of a "black box".
- In almost all cases, novel schedulers shall be implemented in a new scheduling file.
+- In almost all cases, novel schedulers shall be implemented in a new scheduling file.
--- a/README.md
+++ b/README.md
@@ -20,11 +20,21 @@ limitations under the License.
    <br>
 <p>
 <p align="center">
-    <a href="https://github.com/huggingface/diffusers/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue"></a>
-    <a href="https://github.com/huggingface/diffusers/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/diffusers.svg"></a>
-    <a href="https://pepy.tech/project/diffusers"><img alt="GitHub release" src="https://static.pepy.tech/badge/diffusers/month"></a>
-    <a href="CODE_OF_CONDUCT.md"><img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg"></a>
-    <a href="https://twitter.com/diffuserslib"><img alt="X account" src="https://img.shields.io/twitter/url/https/twitter.com/diffuserslib.svg?style=social&label=Follow%20%40diffuserslib"></a>
+    <a href="https://github.com/huggingface/diffusers/blob/main/LICENSE">
+        <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue">
+    </a>
+    <a href="https://github.com/huggingface/diffusers/releases">
+        <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/diffusers.svg">
+    </a>
+    <a href="https://pepy.tech/project/diffusers">
+        <img alt="GitHub release" src="https://static.pepy.tech/badge/diffusers/month">
+    </a>
+    <a href="CODE_OF_CONDUCT.md">
+        <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg">
+    </a>
+    <a href="https://twitter.com/diffuserslib">
+        <img alt="X account" src="https://img.shields.io/twitter/url/https/twitter.com/diffuserslib.svg?style=social&label=Follow%20%40diffuserslib">
+    </a>
 </p>

 🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or training your own diffusion models, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](https://huggingface.co/docs/diffusers/conceptual/philosophy#usability-over-performance), [simple over easy](https://huggingface.co/docs/diffusers/conceptual/philosophy#simple-over-easy), and [customizability over abstractions](https://huggingface.co/docs/diffusers/conceptual/philosophy#tweakable-contributorfriendly-over-abstraction).
@@ -67,7 +77,7 @@ Please refer to the [How to use Stable Diffusion in Apple Silicon](https://huggi

 ## Quickstart

-Generating outputs is super easy with 🤗 Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 30,000+ checkpoints):
+Generating outputs is super easy with 🤗 Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 25.000+ checkpoints):

 ```python
 from diffusers import DiffusionPipeline
@@ -209,7 +219,7 @@ Also, say 👋 in our public Discord channel <a href="https://discord.gg/G7tWnz9
 - https://github.com/deep-floyd/IF
 - https://github.com/bentoml/BentoML
 - https://github.com/bmaltais/kohya_ss
- +14,000 other amazing GitHub repositories 💪
+- +11.000 other amazing GitHub repositories 💪

 Thank you for using us ❤️.

--- a/benchmarks/run_all.py
+++ b/benchmarks/run_all.py
@@ -40,7 +40,7 @@ def main():
        print(f"****** Running file: {file} ******")

        # Run with canonical settings.
-        if file != "benchmark_text_to_image.py" and file != "benchmark_ip_adapters.py":
+        if file != "benchmark_text_to_image.py":
            command = f"python {file}"
            run_command(command.split())

@@ -49,10 +49,6 @@ def main():

    # Run variants.
    for file in python_files:
-        # See: https://github.com/pytorch/pytorch/issues/129637
-        if file == "benchmark_ip_adapters.py":
-            continue
-
        if file == "benchmark_text_to_image.py":
            for ckpt in ALL_T2I_CKPTS:
                command = f"python {file} --ckpt {ckpt}"
--- a/docker/diffusers-doc-builder/Dockerfile
+++ b/docker/diffusers-doc-builder/Dockerfile
@@ -42,7 +42,7 @@ RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
        huggingface-hub \
        Jinja2 \
        librosa \
-        numpy==1.26.4 \
+        numpy \
        scipy \
        tensorboard \
        transformers \
--- a/docker/diffusers-flax-cpu/Dockerfile
+++ b/docker/diffusers-flax-cpu/Dockerfile
@@ -40,7 +40,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
        huggingface-hub \
        Jinja2 \
        librosa \
-        numpy==1.26.4 \
+        numpy \
        scipy \
        tensorboard \
        transformers
--- a/docker/diffusers-flax-tpu/Dockerfile
+++ b/docker/diffusers-flax-tpu/Dockerfile
@@ -41,8 +41,8 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
        hf-doc-builder \
        huggingface-hub \
        Jinja2 \
-        librosa \
-        numpy==1.26.4 \
+        librosa \        
+        numpy \
        scipy \
        tensorboard \
        transformers
--- a/docker/diffusers-onnxruntime-cpu/Dockerfile
+++ b/docker/diffusers-onnxruntime-cpu/Dockerfile
@@ -40,7 +40,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
        huggingface-hub \
        Jinja2 \
        librosa \
-        numpy==1.26.4 \
+        numpy \
        scipy \
        tensorboard \
        transformers
--- a/docker/diffusers-onnxruntime-cuda/Dockerfile
+++ b/docker/diffusers-onnxruntime-cuda/Dockerfile
@@ -38,10 +38,9 @@ RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
        datasets \
        hf-doc-builder \
        huggingface-hub \
-        hf_transfer \
        Jinja2 \
        librosa \
-        numpy==1.26.4 \
+        numpy \
        scipy \
        tensorboard \
        transformers
--- a/docker/diffusers-pytorch-compile-cuda/Dockerfile
+++ b/docker/diffusers-pytorch-compile-cuda/Dockerfile
@@ -17,7 +17,6 @@ RUN apt install -y bash \
    libsndfile1-dev \
    libgl1 \
    python3.10 \
-    python3.10-dev \
    python3-pip \
    python3.10-venv && \
    rm -rf /var/lib/apt/lists
@@ -38,10 +37,9 @@ RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
    datasets \
    hf-doc-builder \
    huggingface-hub \
-    hf_transfer \
    Jinja2 \
    librosa \
-    numpy==1.26.4 \
+    numpy \
    scipy \
    tensorboard \
    transformers
--- a/docker/diffusers-pytorch-cpu/Dockerfile
+++ b/docker/diffusers-pytorch-cpu/Dockerfile
@@ -16,7 +16,6 @@ RUN apt install -y bash \
                   ca-certificates \
                   libsndfile1-dev \
                   python3.10 \
-                   python3.10-dev \
                   python3-pip \
                   libgl1 \
                   python3.10-venv && \
@@ -41,7 +40,7 @@ RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
        huggingface-hub \
        Jinja2 \
        librosa \
-        numpy==1.26.4 \
+        numpy \
        scipy \
        tensorboard \
        transformers matplotlib
--- a/docker/diffusers-pytorch-cuda/Dockerfile
+++ b/docker/diffusers-pytorch-cuda/Dockerfile
@@ -17,7 +17,6 @@ RUN apt install -y bash \
    libsndfile1-dev \
    libgl1 \
    python3.10 \
-    python3.10-dev \
    python3-pip \
    python3.10-venv && \
    rm -rf /var/lib/apt/lists
@@ -38,10 +37,9 @@ RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
    datasets \
    hf-doc-builder \
    huggingface-hub \
-    hf_transfer \
    Jinja2 \
    librosa \
-    numpy==1.26.4 \
+    numpy \
    scipy \
    tensorboard \
    transformers \
--- a/docker/diffusers-pytorch-xformers-cuda/Dockerfile
+++ b/docker/diffusers-pytorch-xformers-cuda/Dockerfile
@@ -17,7 +17,6 @@ RUN apt install -y bash \
                   libsndfile1-dev \
                   libgl1 \
                   python3.10 \
-                   python3.10-dev \
                   python3-pip \
                   python3.10-venv && \
    rm -rf /var/lib/apt/lists
@@ -38,10 +37,9 @@ RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
        datasets \
        hf-doc-builder \
        huggingface-hub \
-        hf_transfer \
        Jinja2 \
        librosa \
-        numpy==1.26.4 \
+        numpy \
        scipy \
        tensorboard \
        transformers \
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -21,8 +21,6 @@
    title: Load LoRAs for inference
  - local: tutorials/fast_diffusion
    title: Accelerate inference of text-to-image diffusion models
-  - local: tutorials/inference_with_big_models
-    title: Working with big models
  title: Tutorials
 - sections:
  - local: using-diffusers/loading
@@ -83,8 +81,6 @@
    title: Kandinsky
  - local: using-diffusers/ip_adapter
    title: IP-Adapter
-  - local: using-diffusers/pag
-    title: PAG
  - local: using-diffusers/controlnet
    title: ControlNet
  - local: using-diffusers/t2i_adapter
@@ -111,8 +107,7 @@
    title: Create a dataset for training
  - local: training/adapt_a_model
    title: Adapt a model to a new task
-  - isExpanded: false
-    sections:
+  - sections:
    - local: training/unconditional_training
      title: Unconditional image generation
    - local: training/text2image
@@ -130,8 +125,8 @@
    - local: training/instructpix2pix
      title: InstructPix2Pix
    title: Models
-  - isExpanded: false
-    sections:
+    isExpanded: false
+  - sections:
    - local: training/text_inversion
      title: Textual Inversion
    - local: training/dreambooth
@@ -145,6 +140,7 @@
    - local: training/ddpo
      title: Reinforcement learning training with DDPO
    title: Methods
+    isExpanded: false
  title: Training
 - sections:
  - local: optimization/fp16
@@ -191,12 +187,7 @@
    title: Evaluating Diffusion Models
  title: Conceptual Guides
 - sections:
-  - local: community_projects
-    title: Projects built with Diffusers
-  title: Community Projects
- sections:
-  - isExpanded: false
-    sections:
+  - sections:
    - local: api/configuration
      title: Configuration
    - local: api/logging
@@ -204,8 +195,8 @@
    - local: api/outputs
      title: Outputs
    title: Main Classes
-  - isExpanded: false
-    sections:
+    isExpanded: false
+  - sections:
    - local: api/loaders/ip_adapter
      title: IP-Adapter
    - local: api/loaders/lora
@@ -219,8 +210,8 @@
    - local: api/loaders/peft
      title: PEFT
    title: Loaders
-  - isExpanded: false
-    sections:
+    isExpanded: false
+  - sections:
    - local: api/models/overview
      title: Overview
    - local: api/models/unet
@@ -239,16 +230,10 @@
      title: VQModel
    - local: api/models/autoencoderkl
      title: AutoencoderKL
-    - local: api/models/autoencoderkl_cogvideox
-      title: AutoencoderKLCogVideoX
    - local: api/models/asymmetricautoencoderkl
      title: AsymmetricAutoencoderKL
-    - local: api/models/stable_cascade_unet
-      title: StableCascadeUNet
    - local: api/models/autoencoder_tiny
      title: Tiny AutoEncoder
-    - local: api/models/autoencoder_oobleck
-      title: Oobleck AutoEncoder
    - local: api/models/consistency_decoder_vae
      title: ConsistencyDecoderVAE
    - local: api/models/transformer2d
@@ -259,35 +244,15 @@
      title: DiTTransformer2DModel
    - local: api/models/hunyuan_transformer2d
      title: HunyuanDiT2DModel
-    - local: api/models/aura_flow_transformer2d
-      title: AuraFlowTransformer2DModel
-    - local: api/models/flux_transformer
-      title: FluxTransformer2DModel
-    - local: api/models/latte_transformer3d
-      title: LatteTransformer3DModel
-    - local: api/models/cogvideox_transformer3d
-      title: CogVideoXTransformer3DModel
-    - local: api/models/lumina_nextdit2d
-      title: LuminaNextDiT2DModel
    - local: api/models/transformer_temporal
      title: TransformerTemporalModel
-    - local: api/models/sd3_transformer2d
-      title: SD3Transformer2DModel
-    - local: api/models/stable_audio_transformer
-      title: StableAudioDiTModel
    - local: api/models/prior_transformer
      title: PriorTransformer
    - local: api/models/controlnet
      title: ControlNetModel
-    - local: api/models/controlnet_hunyuandit
-      title: HunyuanDiT2DControlNetModel
-    - local: api/models/controlnet_sd3
-      title: SD3ControlNetModel
-    - local: api/models/controlnet_sparsectrl
-      title: SparseControlNetModel
    title: Models
-  - isExpanded: false
-    sections:
+    isExpanded: false
+  - sections:
    - local: api/pipelines/overview
      title: Overview
    - local: api/pipelines/amused
@@ -300,22 +265,14 @@
      title: AudioLDM
    - local: api/pipelines/audioldm2
      title: AudioLDM 2
-    - local: api/pipelines/aura_flow
-      title: AuraFlow
    - local: api/pipelines/auto_pipeline
      title: AutoPipeline
    - local: api/pipelines/blip_diffusion
      title: BLIP-Diffusion
-    - local: api/pipelines/cogvideox
-      title: CogVideoX
    - local: api/pipelines/consistency_models
      title: Consistency Models
    - local: api/pipelines/controlnet
      title: ControlNet
-    - local: api/pipelines/controlnet_hunyuandit
-      title: ControlNet with Hunyuan-DiT
-    - local: api/pipelines/controlnet_sd3
-      title: ControlNet with Stable Diffusion 3
    - local: api/pipelines/controlnet_sdxl
      title: ControlNet with Stable Diffusion XL
    - local: api/pipelines/controlnetxs
@@ -334,8 +291,6 @@
      title: DiffEdit
    - local: api/pipelines/dit
      title: DiT
-    - local: api/pipelines/flux
-      title: Flux
    - local: api/pipelines/hunyuandit
      title: Hunyuan-DiT
    - local: api/pipelines/i2vgenxl
@@ -348,26 +303,18 @@
      title: Kandinsky 2.2
    - local: api/pipelines/kandinsky3
      title: Kandinsky 3
-    - local: api/pipelines/kolors
-      title: Kolors
    - local: api/pipelines/latent_consistency_models
      title: Latent Consistency Models
    - local: api/pipelines/latent_diffusion
      title: Latent Diffusion
-    - local: api/pipelines/latte
-      title: Latte
    - local: api/pipelines/ledits_pp
      title: LEDITS++
-    - local: api/pipelines/lumina
-      title: Lumina-T2X
    - local: api/pipelines/marigold
      title: Marigold
    - local: api/pipelines/panorama
      title: MultiDiffusion
    - local: api/pipelines/musicldm
      title: MusicLDM
-    - local: api/pipelines/pag
-      title: PAG
    - local: api/pipelines/paint_by_example
      title: Paint by Example
    - local: api/pipelines/pia
@@ -382,8 +329,6 @@
      title: Semantic Guidance
    - local: api/pipelines/shap_e
      title: Shap-E
-    - local: api/pipelines/stable_audio
-      title: Stable Audio
    - local: api/pipelines/stable_cascade
      title: Stable Cascade
    - sections:
@@ -405,8 +350,6 @@
        title: Safe Stable Diffusion
      - local: api/pipelines/stable_diffusion/stable_diffusion_2
        title: Stable Diffusion 2
-      - local: api/pipelines/stable_diffusion/stable_diffusion_3
-        title: Stable Diffusion 3
      - local: api/pipelines/stable_diffusion/stable_diffusion_xl
        title: Stable Diffusion XL
      - local: api/pipelines/stable_diffusion/sdxl_turbo
@@ -439,16 +382,14 @@
    - local: api/pipelines/wuerstchen
      title: Wuerstchen
    title: Pipelines
-  - isExpanded: false
-    sections:
+    isExpanded: false
+  - sections:
    - local: api/schedulers/overview
      title: Overview
    - local: api/schedulers/cm_stochastic_iterative
      title: CMStochasticIterativeScheduler
    - local: api/schedulers/consistency_decoder
      title: ConsistencyDecoderScheduler
-    - local: api/schedulers/cosine_dpm
-      title: CosineDPMSolverMultistepScheduler
    - local: api/schedulers/ddim_inverse
      title: DDIMInverseScheduler
    - local: api/schedulers/ddim
@@ -473,10 +414,6 @@
      title: EulerAncestralDiscreteScheduler
    - local: api/schedulers/euler
      title: EulerDiscreteScheduler
-    - local: api/schedulers/flow_match_euler_discrete
-      title: FlowMatchEulerDiscreteScheduler
-    - local: api/schedulers/flow_match_heun_discrete
-      title: FlowMatchHeunDiscreteScheduler
    - local: api/schedulers/heun
      title: HeunDiscreteScheduler
    - local: api/schedulers/ipndm
@@ -506,8 +443,8 @@
    - local: api/schedulers/vq_diffusion
      title: VQDiffusionScheduler
    title: Schedulers
-  - isExpanded: false
-    sections:
+    isExpanded: false
+  - sections:
    - local: api/internal_classes_overview
      title: Overview
    - local: api/attnprocessor
@@ -523,4 +460,5 @@
    - local: api/video_processor
      title: Video Processor
    title: Internal classes
+    isExpanded: false
  title: API
--- a/docs/source/en/api/attnprocessor.md
+++ b/docs/source/en/api/attnprocessor.md
@@ -41,6 +41,12 @@ An attention processor is a class for applying different types of attention mech
 ## FusedAttnProcessor2_0
 [[autodoc]] models.attention_processor.FusedAttnProcessor2_0

+## LoRAAttnAddedKVProcessor
+[[autodoc]] models.attention_processor.LoRAAttnAddedKVProcessor
+
+## LoRAXFormersAttnProcessor
+[[autodoc]] models.attention_processor.LoRAXFormersAttnProcessor
+
 ## SlicedAttnProcessor
 [[autodoc]] models.attention_processor.SlicedAttnProcessor

--- a/docs/source/en/api/loaders/lora.md
+++ b/docs/source/en/api/loaders/lora.md
@@ -12,13 +12,10 @@ specific language governing permissions and limitations under the License.

 # LoRA

-LoRA is a fast and lightweight training method that inserts and trains a significantly smaller number of parameters instead of all the model parameters. This produces a smaller file (~100 MBs) and makes it easier to quickly train a model to learn a new concept. LoRA weights are typically loaded into the denoiser, text encoder or both. The denoiser usually corresponds to a UNet ([`UNet2DConditionModel`], for example) or a Transformer ([`SD3Transformer2DModel`], for example). There are several classes for loading LoRA weights:
+LoRA is a fast and lightweight training method that inserts and trains a significantly smaller number of parameters instead of all the model parameters. This produces a smaller file (~100 MBs) and makes it easier to quickly train a model to learn a new concept. LoRA weights are typically loaded into the UNet, text encoder or both. There are two classes for loading LoRA weights:

- [`StableDiffusionLoraLoaderMixin`] provides functions for loading and unloading, fusing and unfusing, enabling and disabling, and more functions for managing LoRA weights. This class can be used with any model.
- [`StableDiffusionXLLoraLoaderMixin`] is a [Stable Diffusion (SDXL)](../../api/pipelines/stable_diffusion/stable_diffusion_xl) version of the [`StableDiffusionLoraLoaderMixin`] class for loading and saving LoRA weights. It can only be used with the SDXL model.
- [`SD3LoraLoaderMixin`] provides similar functions for [Stable Diffusion 3](https://huggingface.co/blog/sd3).
- [`AmusedLoraLoaderMixin`] is for the [`AmusedPipeline`].
- [`LoraBaseMixin`] provides a base class with several utility methods to fuse, unfuse, unload, LoRAs and more.
+- [`LoraLoaderMixin`] provides functions for loading and unloading, fusing and unfusing, enabling and disabling, and more functions for managing LoRA weights. This class can be used with any model.
+- [`StableDiffusionXLLoraLoaderMixin`] is a [Stable Diffusion (SDXL)](../../api/pipelines/stable_diffusion/stable_diffusion_xl) version of the [`LoraLoaderMixin`] class for loading and saving LoRA weights. It can only be used with the SDXL model.

 <Tip>

@@ -26,22 +23,10 @@ To learn more about how to load LoRA weights, see the [LoRA](../../using-diffuse

 </Tip>

-## StableDiffusionLoraLoaderMixin
+## LoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.StableDiffusionLoraLoaderMixin
+[[autodoc]] loaders.lora.LoraLoaderMixin

 ## StableDiffusionXLLoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.StableDiffusionXLLoraLoaderMixin
-
-## SD3LoraLoaderMixin
-
-[[autodoc]] loaders.lora_pipeline.SD3LoraLoaderMixin
-
-## AmusedLoraLoaderMixin
-
-[[autodoc]] loaders.lora_pipeline.AmusedLoraLoaderMixin
-
-## LoraBaseMixin
-
-[[autodoc]] loaders.lora_base.LoraBaseMixin
+[[autodoc]] loaders.lora.StableDiffusionXLLoraLoaderMixin
--- a/docs/source/en/api/loaders/peft.md
+++ b/docs/source/en/api/loaders/peft.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.

 # PEFT

-Diffusers supports loading adapters such as [LoRA](../../using-diffusers/loading_adapters) with the [PEFT](https://huggingface.co/docs/peft/index) library with the [`~loaders.peft.PeftAdapterMixin`] class. This allows modeling classes in Diffusers like [`UNet2DConditionModel`], [`SD3Transformer2DModel`] to operate with an adapter.
+Diffusers supports loading adapters such as [LoRA](../../using-diffusers/loading_adapters) with the [PEFT](https://huggingface.co/docs/peft/index) library with the [`~loaders.peft.PeftAdapterMixin`] class. This allows modeling classes in Diffusers like [`UNet2DConditionModel`] to load an adapter.

 <Tip>

--- a/docs/source/en/api/loaders/single_file.md
+++ b/docs/source/en/api/loaders/single_file.md
@@ -22,7 +22,6 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:

 ## Supported pipelines

- [`CogVideoXPipeline`]
 - [`StableDiffusionPipeline`]
 - [`StableDiffusionImg2ImgPipeline`]
 - [`StableDiffusionInpaintPipeline`]
@@ -36,7 +35,6 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
 - [`StableDiffusionXLInstructPix2PixPipeline`]
 - [`StableDiffusionXLControlNetPipeline`]
 - [`StableDiffusionXLKDiffusionPipeline`]
- [`StableDiffusion3Pipeline`]
 - [`LatentConsistencyModelPipeline`]
 - [`LatentConsistencyModelImg2ImgPipeline`]
 - [`StableDiffusionControlNetXSPipeline`]
@@ -50,10 +48,7 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
 - [`UNet2DConditionModel`]
 - [`StableCascadeUNet`]
 - [`AutoencoderKL`]
- [`AutoencoderKLCogVideoX`]
 - [`ControlNetModel`]
- [`SD3Transformer2DModel`]
- [`FluxTransformer2DModel`]

 ## FromSingleFileMixin

--- a/docs/source/en/api/loaders/unet.md
+++ b/docs/source/en/api/loaders/unet.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.

 # UNet

-Some training methods - like LoRA and Custom Diffusion - typically target the UNet's attention layers, but these training methods can also target other non-attention layers. Instead of training all of a model's parameters, only a subset of the parameters are trained, which is faster and more efficient. This class is useful if you're *only* loading weights into a UNet. If you need to load weights into the text encoder or a text encoder and UNet, try using the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] function instead.
+Some training methods - like LoRA and Custom Diffusion - typically target the UNet's attention layers, but these training methods can also target other non-attention layers. Instead of training all of a model's parameters, only a subset of the parameters are trained, which is faster and more efficient. This class is useful if you're *only* loading weights into a UNet. If you need to load weights into the text encoder or a text encoder and UNet, try using the [`~loaders.LoraLoaderMixin.load_lora_weights`] function instead.

 The [`UNet2DConditionLoadersMixin`] class provides functions for loading and saving weights, fusing and unfusing LoRAs, disabling and enabling LoRAs, and setting and deleting adapters.

--- a/docs/source/en/api/models/aura_flow_transformer2d.md
+++ b/docs/source/en/api/models/aura_flow_transformer2d.md
@@ -1,19 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# AuraFlowTransformer2DModel
-
-A Transformer model for image-like data from [AuraFlow](https://blog.fal.ai/auraflow/).
-
-## AuraFlowTransformer2DModel
-
-[[autodoc]] AuraFlowTransformer2DModel
--- a/docs/source/en/api/models/autoencoder_oobleck.md
+++ b/docs/source/en/api/models/autoencoder_oobleck.md
@@ -1,38 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# AutoencoderOobleck
-
-The Oobleck variational autoencoder (VAE) model with KL loss was introduced in [Stability-AI/stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) and [Stable Audio Open](https://huggingface.co/papers/2407.14358) by Stability AI. The model is used in 🤗 Diffusers to encode audio waveforms into latents and to decode latent representations into audio waveforms.
-
-The abstract from the paper is:
-
-*Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.*
-
-## AutoencoderOobleck
-
-[[autodoc]] AutoencoderOobleck
-    - decode
-    - encode
-    - all
-
-## OobleckDecoderOutput
-
-[[autodoc]] models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput
-
-## OobleckDecoderOutput
-
-[[autodoc]] models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput
-
-## AutoencoderOobleckOutput
-
-[[autodoc]] models.autoencoders.autoencoder_oobleck.AutoencoderOobleckOutput
--- a/docs/source/en/api/models/autoencoderkl.md
+++ b/docs/source/en/api/models/autoencoderkl.md
@@ -21,7 +21,7 @@ The abstract from the paper is:
 ## Loading from the original format

 By default the [`AutoencoderKL`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded
-from the original format using [`FromOriginalModelMixin.from_single_file`] as follows:
+from the original format using [`FromOriginalVAEMixin.from_single_file`] as follows:

 ```py
 from diffusers import AutoencoderKL
--- a/docs/source/en/api/models/autoencoderkl_cogvideox.md
+++ b/docs/source/en/api/models/autoencoderkl_cogvideox.md
@@ -1,37 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License. -->
-
-# AutoencoderKLCogVideoX
-
-The 3D variational autoencoder (VAE) model with KL loss used in [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI.
-
-The model can be loaded with the following code snippet.
-
-```python
-from diffusers import AutoencoderKLCogVideoX
-
-vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.float16).to("cuda")
-```
-
-## AutoencoderKLCogVideoX
-
-[[autodoc]] AutoencoderKLCogVideoX
-    - decode
-    - encode
-    - all
-
-## AutoencoderKLOutput
-
-[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput
-
-## DecoderOutput
-
-[[autodoc]] models.autoencoders.vae.DecoderOutput
--- a/docs/source/en/api/models/cogvideox_transformer3d.md
+++ b/docs/source/en/api/models/cogvideox_transformer3d.md
@@ -1,30 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License. -->
-
-# CogVideoXTransformer3DModel
-
-A Diffusion Transformer model for 3D data from [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI.
-
-The model can be loaded with the following code snippet.
-
-```python
-from diffusers import CogVideoXTransformer3DModel
-
-vae = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="transformer", torch_dtype=torch.float16).to("cuda")
-```
-
-## CogVideoXTransformer3DModel
-
-[[autodoc]] CogVideoXTransformer3DModel
-
-## Transformer2DModelOutput
-
-[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
--- a/docs/source/en/api/models/controlnet.md
+++ b/docs/source/en/api/models/controlnet.md
@@ -21,7 +21,7 @@ The abstract from the paper is:
 ## Loading from the original format

 By default the [`ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded
-from the original format using [`FromOriginalModelMixin.from_single_file`] as follows:
+from the original format using [`FromOriginalControlnetMixin.from_single_file`] as follows:

 ```py
 from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
--- a/docs/source/en/api/models/controlnet_hunyuandit.md
+++ b/docs/source/en/api/models/controlnet_hunyuandit.md
@@ -1,37 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team and Tencent Hunyuan Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# HunyuanDiT2DControlNetModel
-
-HunyuanDiT2DControlNetModel is an implementation of ControlNet for [Hunyuan-DiT](https://arxiv.org/abs/2405.08748).
-
-ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
-
-With a ControlNet model, you can provide an additional control image to condition and control Hunyuan-DiT generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
-
-The abstract from the paper is:
-
-*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.*
-
-This code is implemented by Tencent Hunyuan Team. You can find pre-trained checkpoints for Hunyuan-DiT ControlNets on [Tencent Hunyuan](https://huggingface.co/Tencent-Hunyuan).
-
-## Example For Loading HunyuanDiT2DControlNetModel
-
-```py
-from diffusers import HunyuanDiT2DControlNetModel
-import torch
-controlnet = HunyuanDiT2DControlNetModel.from_pretrained("Tencent-Hunyuan/HunyuanDiT-v1.1-ControlNet-Diffusers-Pose", torch_dtype=torch.float16)
-```
-
-## HunyuanDiT2DControlNetModel
-
-[[autodoc]] HunyuanDiT2DControlNetModel
--- a/docs/source/en/api/models/controlnet_sd3.md
+++ b/docs/source/en/api/models/controlnet_sd3.md
@@ -1,42 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team and The InstantX Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# SD3ControlNetModel
-
-SD3ControlNetModel is an implementation of ControlNet for Stable Diffusion 3.
-
-The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection.
-
-The abstract from the paper is:
-
-*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.*
-
-## Loading from the original format
-
-By default the [`SD3ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`].
-
-```py
-from diffusers import StableDiffusion3ControlNetPipeline
-from diffusers.models import SD3ControlNetModel, SD3MultiControlNetModel
-
-controlnet = SD3ControlNetModel.from_pretrained("InstantX/SD3-Controlnet-Canny")
-pipe = StableDiffusion3ControlNetPipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", controlnet=controlnet)
-```
-
-## SD3ControlNetModel
-
-[[autodoc]] SD3ControlNetModel
-
-## SD3ControlNetOutput
-
-[[autodoc]] models.controlnet_sd3.SD3ControlNetOutput
-
--- a/docs/source/en/api/models/controlnet_sparsectrl.md
+++ b/docs/source/en/api/models/controlnet_sparsectrl.md
@@ -1,46 +0,0 @@
-<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License. -->
-
-# SparseControlNetModel
-
-SparseControlNetModel is an implementation of ControlNet for [AnimateDiff](https://arxiv.org/abs/2307.04725).
-
-ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
-
-The SparseCtrl version of ControlNet was introduced in [SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://arxiv.org/abs/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai.
-
-The abstract from the paper is:
-
-*The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at [this https URL](https://guoyww.github.io/projects/SparseCtrl).*
-
-## Example for loading SparseControlNetModel
-
-```python
-import torch
-from diffusers import SparseControlNetModel
-
-# fp32 variant in float16
-# 1. Scribble checkpoint
-controlnet = SparseControlNetModel.from_pretrained("guoyww/animatediff-sparsectrl-scribble", torch_dtype=torch.float16)
-
-# 2. RGB checkpoint
-controlnet = SparseControlNetModel.from_pretrained("guoyww/animatediff-sparsectrl-rgb", torch_dtype=torch.float16)
-
-# For loading fp16 variant, pass `variant="fp16"` as an additional parameter
-```
-
-## SparseControlNetModel
-
-[[autodoc]] SparseControlNetModel
-
-## SparseControlNetOutput
-
-[[autodoc]] models.controlnet_sparsectrl.SparseControlNetOutput
--- a/docs/source/en/api/models/flux_transformer.md
+++ b/docs/source/en/api/models/flux_transformer.md
@@ -1,19 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# FluxTransformer2DModel
-
-A Transformer model for image-like data from [Flux](https://blackforestlabs.ai/announcing-black-forest-labs/).
-
-## FluxTransformer2DModel
-
-[[autodoc]] FluxTransformer2DModel
--- a/docs/source/en/api/models/latte_transformer3d.md
+++ b/docs/source/en/api/models/latte_transformer3d.md
@@ -1,19 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-## LatteTransformer3DModel
-
-A Diffusion Transformer model for 3D data from [Latte](https://github.com/Vchitect/Latte).
-
-## LatteTransformer3DModel
-
-[[autodoc]] LatteTransformer3DModel
--- a/docs/source/en/api/models/lumina_nextdit2d.md
+++ b/docs/source/en/api/models/lumina_nextdit2d.md
@@ -1,20 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# LuminaNextDiT2DModel
-
-A Next Version of Diffusion Transformer model for 2D data from [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X).
-
-## LuminaNextDiT2DModel
-
-[[autodoc]] LuminaNextDiT2DModel
-
--- a/docs/source/en/api/models/pixart_transformer2d.md
+++ b/docs/source/en/api/models/pixart_transformer2d.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.

 # PixArtTransformer2DModel

-A Transformer model for image-like data from [PixArt-Alpha](https://huggingface.co/papers/2310.00426) and [PixArt-Sigma](https://huggingface.co/papers/2403.04692).
+A Transformer model for image-like data from [PixArt-Alpha](https://huggingface.co/papers/2310.00426) and [PixArt-Sigma](https://huggingface.co/papers/2403.04692). 

 ## PixArtTransformer2DModel

--- a/docs/source/en/api/models/sd3_transformer2d.md
+++ b/docs/source/en/api/models/sd3_transformer2d.md
@@ -1,19 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# SD3 Transformer Model
-
-The Transformer model introduced in [Stable Diffusion 3](https://hf.co/papers/2403.03206). Its novelty lies in the MMDiT transformer block.
-
-## SD3Transformer2DModel
-
-[[autodoc]] SD3Transformer2DModel
--- a/docs/source/en/api/models/stable_audio_transformer.md
+++ b/docs/source/en/api/models/stable_audio_transformer.md
@@ -1,19 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# StableAudioDiTModel
-
-A Transformer model for audio waveforms from [Stable Audio Open](https://huggingface.co/papers/2407.14358).
-
-## StableAudioDiTModel
-
-[[autodoc]] StableAudioDiTModel
--- a/docs/source/en/api/models/stable_cascade_unet.md
+++ b/docs/source/en/api/models/stable_cascade_unet.md
@@ -1,19 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# StableCascadeUNet
-
-A UNet model from the [Stable Cascade pipeline](../pipelines/stable_cascade.md).
-
-## StableCascadeUNet
-
-[[autodoc]] models.unets.unet_stable_cascade.StableCascadeUNet
--- a/docs/source/en/api/models/transformer2d.md
+++ b/docs/source/en/api/models/transformer2d.md
@@ -38,4 +38,4 @@ It is assumed one of the input classes is the masked latent pixel. The predicted

 ## Transformer2DModelOutput

-[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
+[[autodoc]] models.transformers.transformer_2d.Transformer2DModelOutput
--- a/docs/source/en/api/pipelines/animatediff.md
+++ b/docs/source/en/api/pipelines/animatediff.md
@@ -25,9 +25,6 @@ The abstract of the paper is the following:
 | Pipeline | Tasks | Demo
 |---|---|:---:|
 | [AnimateDiffPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff.py) | *Text-to-Video Generation with AnimateDiff* |
-| [AnimateDiffControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_controlnet.py) | *Controlled Video-to-Video Generation with AnimateDiff using ControlNet* |
-| [AnimateDiffSparseControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_sparsectrl.py) | *Controlled Video-to-Video Generation with AnimateDiff using SparseCtrl* |
-| [AnimateDiffSDXLPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_sdxl.py) | *Video-to-Video Generation with AnimateDiff* |
 | [AnimateDiffVideoToVideoPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py) | *Video-to-Video Generation with AnimateDiff* |

 ## Available checkpoints
@@ -81,6 +78,7 @@ output = pipe(
 )
 frames = output.frames[0]
 export_to_gif(frames, "animation.gif")
+
 ```

 Here are some sample outputs:
@@ -103,266 +101,6 @@ AnimateDiff tends to work better with finetuned Stable Diffusion models. If you

 </Tip>

-### AnimateDiffControlNetPipeline
-
-AnimateDiff can also be used with ControlNets ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide depth maps, the ControlNet model generates a video that'll preserve the spatial information from the depth maps. It is a more flexible and accurate way to control the video generation process.
-
-```python
-import torch
-from diffusers import AnimateDiffControlNetPipeline, AutoencoderKL, ControlNetModel, MotionAdapter, LCMScheduler
-from diffusers.utils import export_to_gif, load_video
-
-# Additionally, you will need a preprocess videos before they can be used with the ControlNet
-# HF maintains just the right package for it: `pip install controlnet_aux`
-from controlnet_aux.processor import ZoeDetector
-
-# Download controlnets from https://huggingface.co/lllyasviel/ControlNet-v1-1 to use .from_single_file
-# Download Diffusers-format controlnets, such as https://huggingface.co/lllyasviel/sd-controlnet-depth, to use .from_pretrained()
-controlnet = ControlNetModel.from_single_file("control_v11f1p_sd15_depth.pth", torch_dtype=torch.float16)
-
-# We use AnimateLCM for this example but one can use the original motion adapters as well (for example, https://huggingface.co/guoyww/animatediff-motion-adapter-v1-5-3)
-motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM")
-
-vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16)
-pipe: AnimateDiffControlNetPipeline = AnimateDiffControlNetPipeline.from_pretrained(
-    "SG161222/Realistic_Vision_V5.1_noVAE",
-    motion_adapter=motion_adapter,
-    controlnet=controlnet,
-    vae=vae,
-).to(device="cuda", dtype=torch.float16)
-pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear")
-pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm-lora")
-pipe.set_adapters(["lcm-lora"], [0.8])
-
-depth_detector = ZoeDetector.from_pretrained("lllyasviel/Annotators").to("cuda")
-video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif")
-conditioning_frames = []
-
-with pipe.progress_bar(total=len(video)) as progress_bar:
-    for frame in video:
-        conditioning_frames.append(depth_detector(frame))
-        progress_bar.update()
-
-prompt = "a panda, playing a guitar, sitting in a pink boat, in the ocean, mountains in background, realistic, high quality"
-negative_prompt = "bad quality, worst quality"
-
-video = pipe(
-    prompt=prompt,
-    negative_prompt=negative_prompt,
-    num_frames=len(video),
-    num_inference_steps=10,
-    guidance_scale=2.0,
-    conditioning_frames=conditioning_frames,
-    generator=torch.Generator().manual_seed(42),
-).frames[0]
-
-export_to_gif(video, "animatediff_controlnet.gif", fps=8)
-```
-
-Here are some sample outputs:
-
-<table align="center">
-    <tr>
-      <th align="center">Source Video</th>
-      <th align="center">Output Video</th>
-    </tr>
-    <tr>
-        <td align="center">
-          raccoon playing a guitar
-          <br />
-          <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif" alt="racoon playing a guitar" />
-        </td>
-        <td align="center">
-          a panda, playing a guitar, sitting in a pink boat, in the ocean, mountains in background, realistic, high quality
-          <br/>
-          <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-controlnet-output.gif" alt="a panda, playing a guitar, sitting in a pink boat, in the ocean, mountains in background, realistic, high quality" />
-        </td>
-    </tr>
-</table>
-
-### AnimateDiffSparseControlNetPipeline
-
-[SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://arxiv.org/abs/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai.
-
-The abstract from the paper is:
-
-*The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at [this https URL](https://guoyww.github.io/projects/SparseCtrl).*
-
-SparseCtrl introduces the following checkpoints for controlled text-to-video generation:
-
- [SparseCtrl Scribble](https://huggingface.co/guoyww/animatediff-sparsectrl-scribble)
- [SparseCtrl RGB](https://huggingface.co/guoyww/animatediff-sparsectrl-rgb)
-
-#### Using SparseCtrl Scribble
-
-```python
-import torch
-
-from diffusers import AnimateDiffSparseControlNetPipeline
-from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel
-from diffusers.schedulers import DPMSolverMultistepScheduler
-from diffusers.utils import export_to_gif, load_image
-
-
-model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
-motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3"
-controlnet_id = "guoyww/animatediff-sparsectrl-scribble"
-lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3"
-vae_id = "stabilityai/sd-vae-ft-mse"
-device = "cuda"
-
-motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device)
-controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device)
-vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device)
-scheduler = DPMSolverMultistepScheduler.from_pretrained(
-    model_id,
-    subfolder="scheduler",
-    beta_schedule="linear",
-    algorithm_type="dpmsolver++",
-    use_karras_sigmas=True,
-)
-pipe = AnimateDiffSparseControlNetPipeline.from_pretrained(
-    model_id,
-    motion_adapter=motion_adapter,
-    controlnet=controlnet,
-    vae=vae,
-    scheduler=scheduler,
-    torch_dtype=torch.float16,
-).to(device)
-pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora")
-pipe.fuse_lora(lora_scale=1.0)
-
-prompt = "an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality"
-negative_prompt = "low quality, worst quality, letterboxed"
-
-image_files = [
-    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-1.png",
-    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-2.png",
-    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-3.png"
-]
-condition_frame_indices = [0, 8, 15]
-conditioning_frames = [load_image(img_file) for img_file in image_files]
-
-video = pipe(
-    prompt=prompt,
-    negative_prompt=negative_prompt,
-    num_inference_steps=25,
-    conditioning_frames=conditioning_frames,
-    controlnet_conditioning_scale=1.0,
-    controlnet_frame_indices=condition_frame_indices,
-    generator=torch.Generator().manual_seed(1337),
-).frames[0]
-export_to_gif(video, "output.gif")
-```
-
-Here are some sample outputs:
-
-<table align="center">
-    <tr>
-        <center>
-          <b>an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality</b>
-        </center>
-    </tr>
-    <tr>
-        <td>
-          <center>
-            <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-1.png" alt="scribble-1" />
-          </center>
-        </td>
-        <td>
-          <center>
-            <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-2.png" alt="scribble-2" />
-          </center>
-        </td>
-        <td>
-          <center>
-            <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-3.png" alt="scribble-3" />
-          </center>
-        </td>
-    </tr>
-    <tr>
-        <td colspan=3>
-          <center>
-            <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-sparsectrl-scribble-results.gif" alt="an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality" />
-          </center>
-        </td>
-    </tr>
-</table>
-
-#### Using SparseCtrl RGB
-
-```python
-import torch
-
-from diffusers import AnimateDiffSparseControlNetPipeline
-from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel
-from diffusers.schedulers import DPMSolverMultistepScheduler
-from diffusers.utils import export_to_gif, load_image
-
-
-model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
-motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3"
-controlnet_id = "guoyww/animatediff-sparsectrl-rgb"
-lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3"
-vae_id = "stabilityai/sd-vae-ft-mse"
-device = "cuda"
-
-motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device)
-controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device)
-vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device)
-scheduler = DPMSolverMultistepScheduler.from_pretrained(
-    model_id,
-    subfolder="scheduler",
-    beta_schedule="linear",
-    algorithm_type="dpmsolver++",
-    use_karras_sigmas=True,
-)
-pipe = AnimateDiffSparseControlNetPipeline.from_pretrained(
-    model_id,
-    motion_adapter=motion_adapter,
-    controlnet=controlnet,
-    vae=vae,
-    scheduler=scheduler,
-    torch_dtype=torch.float16,
-).to(device)
-pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora")
-
-image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-firework.png")
-
-video = pipe(
-    prompt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background",
-    negative_prompt="low quality, worst quality",
-    num_inference_steps=25,
-    conditioning_frames=image,
-    controlnet_frame_indices=[0],
-    controlnet_conditioning_scale=1.0,
-    generator=torch.Generator().manual_seed(42),
-).frames[0]
-export_to_gif(video, "output.gif")
-```
-
-Here are some sample outputs:
-
-<table align="center">
-    <tr>
-        <center>
-          <b>closeup face photo of man in black clothes, night city street, bokeh, fireworks in background</b>
-        </center>
-    </tr>
-    <tr>
-        <td>
-          <center>
-            <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-firework.png" alt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background" />
-          </center>
-        </td>
-        <td>
-          <center>
-            <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-sparsectrl-rgb-result.gif" alt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background" />
-          </center>
-        </td>
-    </tr>
-</table>
-
 ### AnimateDiffSDXLPipeline

 AnimateDiff can also be used with SDXL models. This is currently an experimental feature as only a beta release of the motion adapter checkpoint is available.
@@ -565,6 +303,7 @@ output = pipe(
 )
 frames = output.frames[0]
 export_to_gif(frames, "animation.gif")
+
 ```

 <table>
@@ -639,6 +378,7 @@ output = pipe(
 )
 frames = output.frames[0]
 export_to_gif(frames, "animation.gif")
+
 ```

 <table>
@@ -823,37 +563,12 @@ export_to_gif(frames, "animatelcm-motion-lora.gif")
 </table>


-## Using `from_single_file` with the MotionAdapter
-
-`diffusers>=0.30.0` supports loading the AnimateDiff checkpoints into the `MotionAdapter` in their original format via `from_single_file`
-
-```python
-from diffusers import MotionAdapter
-
-ckpt_path = "https://huggingface.co/Lightricks/LongAnimateDiff/blob/main/lt_long_mm_32_frames.ckpt"
-
-adapter = MotionAdapter.from_single_file(ckpt_path, torch_dtype=torch.float16)
-pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter)
-```
-
 ## AnimateDiffPipeline

 [[autodoc]] AnimateDiffPipeline
  - all
  - __call__

-## AnimateDiffControlNetPipeline
-
-[[autodoc]] AnimateDiffControlNetPipeline
-  - all
-  - __call__
-
-## AnimateDiffSparseControlNetPipeline
-
-[[autodoc]] AnimateDiffSparseControlNetPipeline
-  - all
-  - __call__
-
 ## AnimateDiffSDXLPipeline

 [[autodoc]] AnimateDiffSDXLPipeline
--- a/docs/source/en/api/pipelines/audioldm2.md
+++ b/docs/source/en/api/pipelines/audioldm2.md
@@ -20,8 +20,8 @@ The abstract of the paper is the following:

 *Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at [this https URL](https://audioldm.github.io/audioldm2).*

-This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi) and [Nguyễn Công Tú Anh](https://github.com/tuanh123789). The original codebase can be
-found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2).
+This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi) and [Nguyễn Công Tú Anh](https://github.com/tuanh123789). The original codebase can be 
+found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2). 

 ## Tips

--- a/docs/source/en/api/pipelines/aura_flow.md
+++ b/docs/source/en/api/pipelines/aura_flow.md
@@ -1,29 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# AuraFlow
-
-AuraFlow is inspired by [Stable Diffusion 3](../pipelines/stable_diffusion/stable_diffusion_3.md) and is by far the largest text-to-image generation model that comes with an Apache 2.0 license. This model achieves state-of-the-art results on the [GenEval](https://github.com/djghosh13/geneval) benchmark.
-
-It was developed by the Fal team and more details about it can be found in [this blog post](https://blog.fal.ai/auraflow/).
-
-<Tip>
-
-AuraFlow can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details.
-
-</Tip>
-
-## AuraFlowPipeline
-
-[[autodoc]] AuraFlowPipeline
-	- all
-	- __call__
--- a/docs/source/en/api/pipelines/blip_diffusion.md
+++ b/docs/source/en/api/pipelines/blip_diffusion.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.

 # BLIP-Diffusion

-BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://arxiv.org/abs/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation.
+BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://arxiv.org/abs/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation. 


 The abstract from the paper is:
--- a/docs/source/en/api/pipelines/cogvideox.md
+++ b/docs/source/en/api/pipelines/cogvideox.md
@@ -1,91 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-->
-
-# CogVideoX
-
-<!-- TODO: update paper with ArXiv link when ready. -->
-
-[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) from Tsinghua University & ZhipuAI.
-
-The abstract from the paper is:
-
-*We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motion. In addition, we develop an effectively text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of CogVideoX-2B is publicly available at https://github.com/THUDM/CogVideo.*
-
-<Tip>
-
-Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
-
-</Tip>
-
-This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
-
-## Inference
-
-Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
-
-First, load the pipeline:
-
-```python
-import torch
-from diffusers import CogVideoXPipeline
-from diffusers.utils import export_to_video
-
-pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b").to("cuda")
-prompt = (
-    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
-    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
-    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
-    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
-    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
-    "atmosphere of this unique musical performance."
-)
-video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
-export_to_video(video, "output.mp4", fps=8)
-```
-
-Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`:
-
-```python
-pipeline.transformer.to(memory_format=torch.channels_last)
-pipeline.vae.to(memory_format=torch.channels_last)
-```
-
-Finally, compile the components and run inference:
-
-```python
-pipeline.transformer = torch.compile(pipeline.transformer)
-pipeline.vae.decode = torch.compile(pipeline.vae.decode)
-
-# CogVideoX works very well with long and well-described prompts
-prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
-video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
-```
-
-The [benchmark](TODO: link) results on an 80GB A100 machine are:
-
-```
-Without torch.compile(): Average inference time: TODO seconds.
-With torch.compile(): Average inference time: TODO seconds.
-```
-
-## CogVideoXPipeline
-
-[[autodoc]] CogVideoXPipeline
-  - all
-  - __call__
-
-## CogVideoXPipelineOutput
-
-[[autodoc]] pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipelineOutput
--- a/docs/source/en/api/pipelines/controlnet_hunyuandit.md
+++ b/docs/source/en/api/pipelines/controlnet_hunyuandit.md
@@ -1,36 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team and Tencent Hunyuan Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# ControlNet with Hunyuan-DiT
-
-HunyuanDiTControlNetPipeline is an implementation of ControlNet for [Hunyuan-DiT](https://arxiv.org/abs/2405.08748).
-
-ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
-
-With a ControlNet model, you can provide an additional control image to condition and control Hunyuan-DiT generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
-
-The abstract from the paper is:
-
-*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.*
-
-This code is implemented by Tencent Hunyuan Team. You can find pre-trained checkpoints for Hunyuan-DiT ControlNets on [Tencent Hunyuan](https://huggingface.co/Tencent-Hunyuan).
-
-<Tip>
-
-Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
-
-</Tip>
-
-## HunyuanDiTControlNetPipeline
-[[autodoc]] HunyuanDiTControlNetPipeline
-	- all
-	- __call__
--- a/docs/source/en/api/pipelines/controlnet_sd3.md
+++ b/docs/source/en/api/pipelines/controlnet_sd3.md
@@ -1,39 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team and The InstantX Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# ControlNet with Stable Diffusion 3
-
-StableDiffusion3ControlNetPipeline is an implementation of ControlNet for Stable Diffusion 3.
-
-ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
-
-With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
-
-The abstract from the paper is:
-
-*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.*
-
-This code is implemented by [The InstantX Team](https://huggingface.co/InstantX). You can find pre-trained checkpoints for SD3-ControlNet on [The InstantX Team](https://huggingface.co/InstantX) Hub profile.
-
-<Tip>
-
-Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
-
-</Tip>
-
-## StableDiffusion3ControlNetPipeline
-[[autodoc]] StableDiffusion3ControlNetPipeline
-	- all
-	- __call__
-
-## StableDiffusion3PipelineOutput
-[[autodoc]] pipelines.stable_diffusion_3.pipeline_output.StableDiffusion3PipelineOutput
--- a/docs/source/en/api/pipelines/flux.md
+++ b/docs/source/en/api/pipelines/flux.md
@@ -1,165 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Flux
-
-Flux is a series of text-to-image generation models based on diffusion transformers. To know more about Flux, check out the original [blog post](https://blackforestlabs.ai/announcing-black-forest-labs/) by the creators of Flux, Black Forest Labs.
-
-Original model checkpoints for Flux can be found [here](https://huggingface.co/black-forest-labs). Original inference code can be found [here](https://github.com/black-forest-labs/flux).
-
-<Tip>
-
-Flux can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to [this blog post](https://huggingface.co/blog/quanto-diffusers) to learn more.  For an exhaustive list of resources, check out [this gist](https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c).
-
-</Tip>
-
-Flux comes in two variants:
-
-* Timestep-distilled (`black-forest-labs/FLUX.1-schnell`)
-* Guidance-distilled (`black-forest-labs/FLUX.1-dev`)
-
-Both checkpoints have slightly difference usage which we detail below.
-
-### Timestep-distilled
-
-* `max_sequence_length` cannot be more than 256.
-* `guidance_scale` needs to be 0.
-* As this is a timestep-distilled model, it benefits from fewer sampling steps.
-
-```python
-import torch
-from diffusers import FluxPipeline
-
-pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16)
-pipe.enable_model_cpu_offload()
-
-prompt = "A cat holding a sign that says hello world"
-out = pipe(
-    prompt=prompt,
-    guidance_scale=0.,
-    height=768,
-    width=1360,
-    num_inference_steps=4,
-    max_sequence_length=256,
-).images[0]
-out.save("image.png")
-```
-
-### Guidance-distilled
-
-* The guidance-distilled variant takes about 50 sampling steps for good-quality generation.
-* It doesn't have any limitations around the `max_sequence_length`.
-
-```python
-import torch
-from diffusers import FluxPipeline
-
-pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
-pipe.enable_model_cpu_offload()
-
-prompt = "a tiny astronaut hatching from an egg on the moon"
-out = pipe(
-    prompt=prompt,
-    guidance_scale=3.5,
-    height=768,
-    width=1360,
-    num_inference_steps=50,
-).images[0]
-out.save("image.png")
-```
-
-## Running FP16 inference
-Flux can generate high-quality images with FP16 (i.e. to accelerate inference on Turing/Volta GPUs) but produces different outputs compared to FP32/BF16. The issue is that some activations in the text encoders have to be clipped when running in FP16, which affects the overall image. Forcing text encoders to run with FP32 inference thus removes this output difference. See [here](https://github.com/huggingface/diffusers/pull/9097#issuecomment-2272292516) for details.
-
-FP16 inference code:
-```python
-import torch
-from diffusers import FluxPipeline
-
-pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) # can replace schnell with dev
-# to run on low vram GPUs (i.e. between 4 and 32 GB VRAM)
-pipe.enable_sequential_cpu_offload()
-pipe.vae.enable_slicing()
-pipe.vae.enable_tiling()
-
-pipe.to(torch.float16) # casting here instead of in the pipeline constructor because doing so in the constructor loads all models into CPU memory at once
-
-prompt = "A cat holding a sign that says hello world"
-out = pipe(
-    prompt=prompt,
-    guidance_scale=0.,
-    height=768,
-    width=1360,
-    num_inference_steps=4,
-    max_sequence_length=256,
-).images[0]
-out.save("image.png")
-```
-
-## Single File Loading for the `FluxTransformer2DModel`
-
-The `FluxTransformer2DModel` supports loading checkpoints in the original format shipped by Black Forest Labs. This is also useful when trying to load finetunes or quantized versions of the models that have been published by the community.
-
-<Tip>
-`FP8` inference can be brittle depending on the GPU type, CUDA version, and `torch` version that you are using. It is recommended that you use the `optimum-quanto` library in order to run FP8 inference on your machine.
-</Tip>
-
-The following example demonstrates how to run Flux with less than 16GB of VRAM.
-
-First install `optimum-quanto`
-
-```shell
-pip install optimum-quanto
-```
-
-Then run the following example
-
-```python
-import torch
-from diffusers import FluxTransformer2DModel, FluxPipeline
-from transformers import T5EncoderModel, CLIPTextModel
-from optimum.quanto import freeze, qfloat8, quantize
-
-bfl_repo = "black-forest-labs/FLUX.1-dev"
-dtype = torch.bfloat16
-
-transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors", torch_dtype=dtype)
-quantize(transformer, weights=qfloat8)
-freeze(transformer)
-
-text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
-quantize(text_encoder_2, weights=qfloat8)
-freeze(text_encoder_2)
-
-pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=None, text_encoder_2=None, torch_dtype=dtype)
-pipe.transformer = transformer
-pipe.text_encoder_2 = text_encoder_2
-
-pipe.enable_model_cpu_offload()
-
-prompt = "A cat holding a sign that says hello world"
-image = pipe(
-    prompt,
-    guidance_scale=3.5,
-    output_type="pil",
-    num_inference_steps=20,
-    generator=torch.Generator("cpu").manual_seed(0)
-).images[0]
-
-image.save("flux-fp8-dev.png")
-```
-
-## FluxPipeline
-
-[[autodoc]] FluxPipeline
-	- all
-	- __call__
--- a/docs/source/en/api/pipelines/hunyuandit.md
+++ b/docs/source/en/api/pipelines/hunyuandit.md
@@ -1,4 +1,4 @@
-<!--Copyright 2024 The HuggingFace Team and Tencent Hunyuan Team. All rights reserved.
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.

 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
@@ -34,15 +34,9 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.m

 </Tip>

-<Tip>
-
-You can further improve generation quality by passing the generated image from [`HungyuanDiTPipeline`] to the [SDXL refiner](../../using-diffusers/sdxl#base-to-refiner-model) model.
-
-</Tip>
-
 ## Optimization

-You can optimize the pipeline's runtime and memory consumption with torch.compile and feed-forward chunking. To learn about other optimization methods, check out the [Speed up inference](../../optimization/fp16) and [Reduce memory usage](../../optimization/memory) guides.
+You can optimize the pipeline's runtime and memory consumption with torch.compile and feed-forward chunking. To learn about other optimization methods, check out the [Speed up inference](../../optimization/fp16) and [Reduce memory usage](../../optimization/memory) guides. 

 ### Inference

@@ -52,7 +46,7 @@ First, load the pipeline:

 ```python
 from diffusers import HunyuanDiTPipeline
-import torch
+import torch 

 pipeline = HunyuanDiTPipeline.from_pretrained(
 	"Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16
@@ -84,7 +78,7 @@ Without torch.compile(): Average inference time: 20.570 seconds.

 ### Memory optimization

-By loading the T5 text encoder in 8 bits, you can run the pipeline in just under 6 GBs of GPU VRAM. Refer to [this script](https://gist.github.com/sayakpaul/3154605f6af05b98a41081aaba5ca43e) for details.
+By loading the T5 text encoder in 8 bits, you can run the pipeline in just under 6 GBs of GPU VRAM. Refer to [this script](https://gist.github.com/sayakpaul/3154605f6af05b98a41081aaba5ca43e) for details. 

 Furthermore, you can use the [`~HunyuanDiT2DModel.enable_forward_chunking`] method to reduce memory usage. Feed-forward chunking runs the feed-forward layers in a transformer block in a loop instead of all at once. This gives you a trade-off between memory consumption and inference runtime.

@@ -98,4 +92,4 @@ Furthermore, you can use the [`~HunyuanDiT2DModel.enable_forward_chunking`] meth
 [[autodoc]] HunyuanDiTPipeline
 	- all
 	- __call__
-
+	
--- a/docs/source/en/api/pipelines/kandinsky3.md
+++ b/docs/source/en/api/pipelines/kandinsky3.md
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License.

 Kandinsky 3 is created by [Vladimir Arkhipkin](https://github.com/oriBetelgeuse),[Anastasia Maltseva](https://github.com/NastyaMittseva),[Igor Pavlov](https://github.com/boomb0om),[Andrei Filatov](https://github.com/anvilarth),[Arseniy Shakhmatov](https://github.com/cene555),[Andrey Kuznetsov](https://github.com/kuznetsoffandrey),[Denis Dimitrov](https://github.com/denndimitrov), [Zein Shaheen](https://github.com/zeinsh)

-The description from it's GitHub page:
+The description from it's Github page:

 *Kandinsky 3.0 is an open-source text-to-image diffusion model built upon the Kandinsky2-x model family. In comparison to its predecessors, enhancements have been made to the text understanding and visual quality of the model, achieved by increasing the size of the text encoder and Diffusion U-Net models, respectively.*

--- a/docs/source/en/api/pipelines/kolors.md
+++ b/docs/source/en/api/pipelines/kolors.md
@@ -1,107 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis
-
-![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/kolors_header_collage.png)
-
-Kolors is a large-scale text-to-image generation model based on latent diffusion, developed by [the Kuaishou Kolors team](kwai-kolors@kuaishou.com). Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and closed-source models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters. Furthermore, Kolors supports both Chinese and English inputs, demonstrating strong performance in understanding and generating Chinese-specific content. For more details, please refer to this [technical report](https://github.com/Kwai-Kolors/Kolors/blob/master/imgs/Kolors_paper.pdf).
-
-The abstract from the technical report is:
-
-*We present Kolors, a latent diffusion model for text-to-image synthesis, characterized by its profound understanding of both English and Chinese, as well as an impressive degree of photorealism. There are three key insights contributing to the development of Kolors. Firstly, unlike large language model T5 used in Imagen and Stable Diffusion 3, Kolors is built upon the General Language Model (GLM), which enhances its comprehension capabilities in both English and Chinese. Moreover, we employ a multimodal large language model to recaption the extensive training dataset for fine-grained text understanding. These strategies significantly improve Kolors’ ability to comprehend intricate semantics, particularly those involving multiple entities, and enable its advanced text rendering capabilities. Secondly, we divide the training of Kolors into two phases: the concept learning phase with broad knowledge and the quality improvement phase with specifically curated high-aesthetic data. Furthermore, we investigate the critical role of the noise schedule and introduce a novel schedule to optimize high-resolution image generation. These strategies collectively enhance the visual appeal of the generated high-resolution images. Lastly, we propose a category-balanced benchmark KolorsPrompts, which serves as a guide for the training and evaluation of Kolors. Consequently, even when employing the commonly used U-Net backbone, Kolors has demonstrated remarkable performance in human evaluations, surpassing the existing open-source models and achieving Midjourney-v6 level performance, especially in terms of visual appeal. We will release the code and weights of Kolors at <https://github.com/Kwai-Kolors/Kolors>, and hope that it will benefit future research and applications in the visual generation community.*
-
-## Usage Example
-
-```python
-import torch
-
-from diffusers import DPMSolverMultistepScheduler, KolorsPipeline
-
-pipe = KolorsPipeline.from_pretrained("Kwai-Kolors/Kolors-diffusers", torch_dtype=torch.float16, variant="fp16")
-pipe.to("cuda")
-pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True)
-
-image = pipe(
-    prompt='一张瓢虫的照片，微距，变焦，高质量，电影，拿着一个牌子，写着"可图"',
-    negative_prompt="",
-    guidance_scale=6.5,
-    num_inference_steps=25,
-).images[0]
-
-image.save("kolors_sample.png")
-```
-
-### IP Adapter
-
-Kolors needs a different IP Adapter to work, and it uses [Openai-CLIP-336](https://huggingface.co/openai/clip-vit-large-patch14-336) as an image encoder.
-
-<Tip>
-
-Using an IP Adapter with Kolors requires more than 24GB of VRAM. To use it, we recommend using [`~DiffusionPipeline.enable_model_cpu_offload`] on consumer GPUs.
-
-</Tip>
-
-<Tip>
-
-While Kolors is integrated in Diffusers, you need to load the image encoder from a revision to use the safetensor files. You can still use the main branch of the original repository if you're comfortable loading pickle checkpoints.
-
-</Tip>
-
-```python
-import torch
-from transformers import CLIPVisionModelWithProjection
-
-from diffusers import DPMSolverMultistepScheduler, KolorsPipeline
-from diffusers.utils import load_image
-
-image_encoder = CLIPVisionModelWithProjection.from_pretrained(
-    "Kwai-Kolors/Kolors-IP-Adapter-Plus",
-    subfolder="image_encoder",
-    low_cpu_mem_usage=True,
-    torch_dtype=torch.float16,
-    revision="refs/pr/4",
-)
-
-pipe = KolorsPipeline.from_pretrained(
-    "Kwai-Kolors/Kolors-diffusers", image_encoder=image_encoder, torch_dtype=torch.float16, variant="fp16"
-).to("cuda")
-pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True)
-
-pipe.load_ip_adapter(
-    "Kwai-Kolors/Kolors-IP-Adapter-Plus",
-    subfolder="",
-    weight_name="ip_adapter_plus_general.safetensors",
-    revision="refs/pr/4",
-    image_encoder_folder=None,
-)
-pipe.enable_model_cpu_offload()
-
-ipa_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/cat_square.png")
-
-image = pipe(
-    prompt="best quality, high quality",
-    negative_prompt="",
-    guidance_scale=6.5,
-    num_inference_steps=25,
-    ip_adapter_image=ipa_image,
-).images[0]
-
-image.save("kolors_ipa_sample.png")
-```
-
-## KolorsPipeline
-
-[[autodoc]] KolorsPipeline
-
- all
- __call__
--- a/docs/source/en/api/pipelines/latte.md
+++ b/docs/source/en/api/pipelines/latte.md
@@ -1,77 +0,0 @@
-<!-- # Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License. -->
-
-# Latte
-
-![latte text-to-video](https://github.com/Vchitect/Latte/blob/52bc0029899babbd6e9250384c83d8ed2670ff7a/visuals/latte.gif?raw=true)
-
-[Latte: Latent Diffusion Transformer for Video Generation](https://arxiv.org/abs/2401.03048) from Monash University, Shanghai AI Lab, Nanjing University, and Nanyang Technological University.
-
-The abstract from the paper is:
-
-*We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.*
-
-**Highlights**: Latte is a latent diffusion transformer proposed as a backbone for modeling different modalities (trained for text-to-video generation here). It achieves state-of-the-art performance across four standard video benchmarks - [FaceForensics](https://arxiv.org/abs/1803.09179), [SkyTimelapse](https://arxiv.org/abs/1709.07592), [UCF101](https://arxiv.org/abs/1212.0402) and [Taichi-HD](https://arxiv.org/abs/2003.00196). To prepare and download the datasets for evaluation, please refer to [this https URL](https://github.com/Vchitect/Latte/blob/main/docs/datasets_evaluation.md).
-
-This pipeline was contributed by [maxin-cn](https://github.com/maxin-cn). The original codebase can be found [here](https://github.com/Vchitect/Latte). The original weights can be found under [hf.co/maxin-cn](https://huggingface.co/maxin-cn).
-
-<Tip>
-
-Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
-
-</Tip>
-
-### Inference
-
-Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
-
-First, load the pipeline:
-
-```python
-import torch
-from diffusers import LattePipeline
-
-pipeline = LattePipeline.from_pretrained(
-	"maxin-cn/Latte-1", torch_dtype=torch.float16
-).to("cuda")
-```
-
-Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`:
-
-```python
-pipeline.transformer.to(memory_format=torch.channels_last)
-pipeline.vae.to(memory_format=torch.channels_last)
-```
-
-Finally, compile the components and run inference:
-
-```python
-pipeline.transformer = torch.compile(pipeline.transformer)
-pipeline.vae.decode = torch.compile(pipeline.vae.decode)
-
-video = pipeline(prompt="A dog wearing sunglasses floating in space, surreal, nebulae in background").frames[0]
-```
-
-The [benchmark](https://gist.github.com/a-r-r-o-w/4e1694ca46374793c0361d740a99ff19) results on an 80GB A100 machine are:
-
-```
-Without torch.compile(): Average inference time: 16.246 seconds.
-With torch.compile(): Average inference time: 14.573 seconds.
-```
-
-## LattePipeline
-
-[[autodoc]] LattePipeline
-  - all
-  - __call__
--- a/docs/source/en/api/pipelines/lumina.md
+++ b/docs/source/en/api/pipelines/lumina.md
@@ -1,90 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Lumina-T2X
-![concepts](https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/9f52eabb-07dc-4881-8257-6d8a5f2a0a5a)
-
-[Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT](https://github.com/Alpha-VLLM/Lumina-T2X/blob/main/assets/lumina-next.pdf) from Alpha-VLLM, OpenGVLab, Shanghai AI Laboratory.
-
-The abstract from the paper is:
-
-*Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers (Flag-DiT) that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduce a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights at https://github.com/Alpha-VLLM/Lumina-T2X, we aim to advance the development of next-generation generative AI capable of universal modeling.*
-
-**Highlights**: Lumina-Next is a next-generation Diffusion Transformer that significantly enhances text-to-image generation, multilingual generation, and multitask performance by introducing the Next-DiT architecture, 3D RoPE, and frequency- and time-aware RoPE, among other improvements.
-
-Lumina-Next has the following components:
-* It improves sampling efficiency with fewer and faster Steps.
-* It uses a Next-DiT as a transformer backbone with Sandwichnorm 3D RoPE, and Grouped-Query Attention.
-* It uses a Frequency- and Time-Aware Scaled RoPE.
-
---
-
-[Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers](https://arxiv.org/abs/2405.05945) from Alpha-VLLM, OpenGVLab, Shanghai AI Laboratory.
-
-The abstract from the paper is:
-
-*Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.*
-
-
-You can find the original codebase at [Alpha-VLLM](https://github.com/Alpha-VLLM/Lumina-T2X) and all the available checkpoints at [Alpha-VLLM Lumina Family](https://huggingface.co/collections/Alpha-VLLM/lumina-family-66423205bedb81171fd0644b).
-
-**Highlights**: Lumina-T2X supports Any Modality, Resolution, and Duration.
-
-Lumina-T2X has the following components:
-* It uses a Flow-based Large Diffusion Transformer as the backbone
-* It supports different any modalities with one backbone and corresponding encoder, decoder.
-
-This pipeline was contributed by [PommesPeter](https://github.com/PommesPeter). The original codebase can be found [here](https://github.com/Alpha-VLLM/Lumina-T2X). The original weights can be found under [hf.co/Alpha-VLLM](https://huggingface.co/Alpha-VLLM).
-
-<Tip>
-
-Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
-
-</Tip>
-
-### Inference (Text-to-Image)
-
-Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
-
-First, load the pipeline:
-
-```python
-from diffusers import LuminaText2ImgPipeline
-import torch
-
-pipeline = LuminaText2ImgPipeline.from_pretrained(
-	"Alpha-VLLM/Lumina-Next-SFT-diffusers", torch_dtype=torch.bfloat16
-).to("cuda")
-```
-
-Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`:
-
-```python
-pipeline.transformer.to(memory_format=torch.channels_last)
-pipeline.vae.to(memory_format=torch.channels_last)
-```
-
-Finally, compile the components and run inference:
-
-```python
-pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True)
-pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True)
-
-image = pipeline(prompt="Upper body of a young woman in a Victorian-era outfit with brass goggles and leather straps. Background shows an industrial revolution cityscape with smoky skies and tall, metal structures").images[0]
-```
-
-## LuminaText2ImgPipeline
-
-[[autodoc]] LuminaText2ImgPipeline
-	- all
-	- __call__
-
--- a/docs/source/en/api/pipelines/overview.md
+++ b/docs/source/en/api/pipelines/overview.md
@@ -71,7 +71,6 @@ The table below lists all the pipelines currently available in 🤗 Diffusers an
 | [Semantic Guidance](semantic_stable_diffusion) | text2image |
 | [Shap-E](shap_e) | text-to-3D, image-to-3D |
 | [Spectrogram Diffusion](spectrogram_diffusion) |  |
-| [Stable Audio](stable_audio) | text2audio |
 | [Stable Diffusion](stable_diffusion/overview) | text2image, image2image, depth2image, inpainting, image variation, latent upscaler, super-resolution |
 | [Stable Diffusion Model Editing](model_editing) | model editing |
 | [Stable Diffusion XL](stable_diffusion/stable_diffusion_xl) | text2image, image2image, inpainting |
--- a/docs/source/en/api/pipelines/pag.md
+++ b/docs/source/en/api/pipelines/pag.md
@@ -1,91 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Perturbed-Attention Guidance
-
-[Perturbed-Attention Guidance (PAG)](https://ku-cvlab.github.io/Perturbed-Attention-Guidance/) is a new diffusion sampling guidance that improves sample quality across both unconditional and conditional settings, achieving this without requiring further training or the integration of external modules.
-
-PAG was introduced in [Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance](https://huggingface.co/papers/2403.17377) by Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin and Seungryong Kim.
-
-The abstract from the paper is:
-
-*Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms' ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.*
-
-PAG can be used by specifying the `pag_applied_layers` as a parameter when instantiating a PAG pipeline. It can be a single string or a list of strings. Each string can be a unique layer identifier or a regular expression to identify one or more layers. 
-
- Full identifier as a normal string: `down_blocks.2.attentions.0.transformer_blocks.0.attn1.processor`
- Full identifier as a RegEx: `down_blocks.2.(attentions|motion_modules).0.transformer_blocks.0.attn1.processor`
- Partial identifier as a RegEx: `down_blocks.2`, or `attn1`
- List of identifiers (can be combo of strings and ReGex): `["blocks.1", "blocks.(14|20)", r"down_blocks\.(2,3)"]`
-
-<Tip warning={true}>
-
-Since RegEx is supported as a way for matching layer identifiers, it is crucial to use it correctly otherwise there might be unexpected behaviour. The recommended way to use PAG is by specifying layers as `blocks.{layer_index}` and `blocks.({layer_index_1|layer_index_2|...})`. Using it in any other way, while doable, may bypass our basic validation checks and give you unexpected results.
-
-</Tip>
-
-## AnimateDiffPAGPipeline
-[[autodoc]] AnimateDiffPAGPipeline
-  - all
-  - __call__
-
-## HunyuanDiTPAGPipeline
-[[autodoc]] HunyuanDiTPAGPipeline
-  - all
-  - __call__
-
-## KolorsPAGPipeline
-[[autodoc]] KolorsPAGPipeline
-  - all
-  - __call__  
-
-## StableDiffusionPAGPipeline
-[[autodoc]] StableDiffusionPAGPipeline
-	- all
-	- __call__
-
-## StableDiffusionControlNetPAGPipeline
-[[autodoc]] StableDiffusionControlNetPAGPipeline
-	- all
-	- __call__
-
-## StableDiffusionXLPAGPipeline
-[[autodoc]] StableDiffusionXLPAGPipeline
-	- all
-	- __call__
-
-## StableDiffusionXLPAGImg2ImgPipeline
-[[autodoc]] StableDiffusionXLPAGImg2ImgPipeline
-	- all
-	- __call__
-
-## StableDiffusionXLPAGInpaintPipeline
-[[autodoc]] StableDiffusionXLPAGInpaintPipeline
-	- all
-	- __call__
-
-## StableDiffusionXLControlNetPAGPipeline
-[[autodoc]] StableDiffusionXLControlNetPAGPipeline
-	- all
-	- __call__
-
-
-## StableDiffusion3PAGPipeline
-[[autodoc]] StableDiffusion3PAGPipeline
-	- all
-	- __call__
-
-
-## PixArtSigmaPAGPipeline
-[[autodoc]] PixArtSigmaPAGPipeline
-	- all
-	- __call__
--- a/docs/source/en/api/pipelines/pixart_sigma.md
+++ b/docs/source/en/api/pipelines/pixart_sigma.md
@@ -37,12 +37,6 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers)

 </Tip>

-<Tip>
-
-You can further improve generation quality by passing the generated image from [`PixArtSigmaPipeline`] to the [SDXL refiner](../../using-diffusers/sdxl#base-to-refiner-model) model.
-
-</Tip>
-
 ## Inference with under 8GB GPU VRAM

 Run the [`PixArtSigmaPipeline`] with under 8GB GPU VRAM by loading the text encoder in 8-bit precision. Let's walk through a full-fledged example.
--- a/docs/source/en/api/pipelines/stable_audio.md
+++ b/docs/source/en/api/pipelines/stable_audio.md
@@ -1,42 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Stable Audio
-
-Stable Audio was proposed in [Stable Audio Open](https://arxiv.org/abs/2407.14358) by Zach Evans et al. . it takes a text prompt as input and predicts the corresponding sound or music sample.
-
-Stable Audio Open generates variable-length (up to 47s) stereo audio at 44.1kHz from text prompts. It comprises three components: an autoencoder that compresses waveforms into a manageable sequence length, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the latent space of the autoencoder.
-
-Stable Audio is trained on a corpus of around 48k audio recordings, where around 47k are from Freesound and the rest are from the Free Music Archive (FMA). All audio files are licensed under CC0, CC BY, or CC Sampling+. This data is used to train the autoencoder and the DiT.
-
-The abstract of the paper is the following:
-*Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.*
-
-This pipeline was contributed by [Yoach Lacombe](https://huggingface.co/ylacombe). The original codebase can be found at [Stability-AI/stable-audio-tool](https://github.com/Stability-AI/stable-audio-tool).
-
-## Tips
-
-When constructing a prompt, keep in mind:
-
-* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
-* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".
-
-During inference:
-
-* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
-* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
-
-
-## StableAudioPipeline
-[[autodoc]] StableAudioPipeline
-	- all
-	- __call__
--- a/docs/source/en/api/pipelines/stable_diffusion/k_diffusion.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/k_diffusion.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.

 # K-Diffusion

-[k-diffusion](https://github.com/crowsonkb/k-diffusion) is a popular library created by [Katherine Crowson](https://github.com/crowsonkb/). We provide `StableDiffusionKDiffusionPipeline` and `StableDiffusionXLKDiffusionPipeline` that allow you to run Stable DIffusion with samplers from k-diffusion.
+[k-diffusion](https://github.com/crowsonkb/k-diffusion) is a popular library created by [Katherine Crowson](https://github.com/crowsonkb/). We provide `StableDiffusionKDiffusionPipeline` and `StableDiffusionXLKDiffusionPipeline` that allow you to run Stable DIffusion with samplers from k-diffusion. 

 Note that most the samplers from k-diffusion are implemented in Diffusers and we recommend using existing schedulers. You can find a mapping between k-diffusion samplers and schedulers in Diffusers [here](https://huggingface.co/docs/diffusers/api/schedulers/overview)

--- a/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.md
@@ -12,11 +12,11 @@ specific language governing permissions and limitations under the License.

 # Text-to-(RGB, depth)

-LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://huggingface.co/papers/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. LDM3D generates an image and a depth map from a given text prompt unlike the existing text-to-image diffusion models such as [Stable Diffusion](./overview) which only generates an image. With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps.
+LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://huggingface.co/papers/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. LDM3D generates an image and a depth map from a given text prompt unlike the existing text-to-image diffusion models such as [Stable Diffusion](./overview) which only generates an image. With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps. 

 Two checkpoints are available for use:
 - [ldm3d-original](https://huggingface.co/Intel/ldm3d). The original checkpoint used in the [paper](https://arxiv.org/pdf/2305.10853.pdf)
- [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c). The new version of LDM3D using 4 channels inputs instead of 6-channels inputs and finetuned on higher resolution images.
+- [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c). The new version of LDM3D using 4 channels inputs instead of 6-channels inputs and finetuned on higher resolution images. 


 The abstract from the paper is:
@@ -44,7 +44,7 @@ Make sure to check out the Stable Diffusion [Tips](overview#tips) section to lea

 # Upscaler

-[LDM3D-VR](https://arxiv.org/pdf/2311.03226.pdf) is an extended version of LDM3D.
+[LDM3D-VR](https://arxiv.org/pdf/2311.03226.pdf) is an extended version of LDM3D. 

 The abstract from the paper is:
 *Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods*
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md
@@ -48,7 +48,7 @@ from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
 import torch

 repo_id = "stabilityai/stable-diffusion-2-base"
-pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, variant="fp16")
+pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")

 pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
 pipe = pipe.to("cuda")
@@ -72,7 +72,7 @@ init_image = load_image(img_url).resize((512, 512))
 mask_image = load_image(mask_url).resize((512, 512))

 repo_id = "stabilityai/stable-diffusion-2-inpainting"
-pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, variant="fp16")
+pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")

 pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
 pipe = pipe.to("cuda")
--- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md
@@ -1,315 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Stable Diffusion 3
-
-Stable Diffusion 3 (SD3) was proposed in [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://arxiv.org/pdf/2403.03206.pdf) by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Muller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach.
-
-The abstract from the paper is:
-
-*Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations.*
-
-
-## Usage Example
-
-_As the model is gated, before using it with diffusers you first need to go to the [Stable Diffusion 3 Medium Hugging Face page](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers), fill in the form and accept the gate. Once you are in, you need to login so that your system knows you’ve accepted the gate._
-
-Use the command below to log in:
-
-```bash
-huggingface-cli login
-```
-
-<Tip>
-
-The SD3 pipeline uses three text encoders to generate an image. Model offloading is necessary in order for it to run on most commodity hardware. Please use the `torch.float16` data type for additional memory savings.
-
-</Tip>
-
-```python
-import torch
-from diffusers import StableDiffusion3Pipeline
-
-pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
-pipe.to("cuda")
-
-image = pipe(
-    prompt="a photo of a cat holding a sign that says hello world",
-    negative_prompt="",
-    num_inference_steps=28,
-    height=1024,
-    width=1024,
-    guidance_scale=7.0,
-).images[0]
-
-image.save("sd3_hello_world.png")
-```
-
-## Memory Optimisations for SD3
-
-SD3 uses three text encoders, one if which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware.
-
-### Running Inference with Model Offloading
-
-The most basic memory optimization available in Diffusers allows you to offload the components of the model to CPU during inference in order to save memory, while seeing a slight increase in inference latency. Model offloading will only move a model component onto the GPU when it needs to be executed, while keeping the remaining components on the CPU.
-
-```python
-import torch
-from diffusers import StableDiffusion3Pipeline
-
-pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
-pipe.enable_model_cpu_offload()
-
-image = pipe(
-    prompt="a photo of a cat holding a sign that says hello world",
-    negative_prompt="",
-    num_inference_steps=28,
-    height=1024,
-    width=1024,
-    guidance_scale=7.0,
-).images[0]
-
-image.save("sd3_hello_world.png")
-```
-
-### Dropping the T5 Text Encoder during Inference
-
-Removing the memory-intensive 4.7B parameter T5-XXL text encoder during inference can significantly decrease the memory requirements for SD3 with only a slight loss in performance.
-
-```python
-import torch
-from diffusers import StableDiffusion3Pipeline
-
-pipe = StableDiffusion3Pipeline.from_pretrained(
-    "stabilityai/stable-diffusion-3-medium-diffusers",
-    text_encoder_3=None,
-    tokenizer_3=None,
-    torch_dtype=torch.float16
-)
-pipe.to("cuda")
-
-image = pipe(
-    prompt="a photo of a cat holding a sign that says hello world",
-    negative_prompt="",
-    num_inference_steps=28,
-    height=1024,
-    width=1024,
-    guidance_scale=7.0,
-).images[0]
-
-image.save("sd3_hello_world-no-T5.png")
-```
-
-### Using a Quantized Version of the T5 Text Encoder
-
-We can leverage the `bitsandbytes` library to load and quantize the T5-XXL text encoder to 8-bit precision. This allows you to keep using all three text encoders while only slightly impacting performance.
-
-First install the `bitsandbytes` library.
-
-```shell
-pip install bitsandbytes
-```
-
-Then load the T5-XXL model using the `BitsAndBytesConfig`.
-
-```python
-import torch
-from diffusers import StableDiffusion3Pipeline
-from transformers import T5EncoderModel, BitsAndBytesConfig
-
-quantization_config = BitsAndBytesConfig(load_in_8bit=True)
-
-model_id = "stabilityai/stable-diffusion-3-medium-diffusers"
-text_encoder = T5EncoderModel.from_pretrained(
-    model_id,
-    subfolder="text_encoder_3",
-    quantization_config=quantization_config,
-)
-pipe = StableDiffusion3Pipeline.from_pretrained(
-    model_id,
-    text_encoder_3=text_encoder,
-    device_map="balanced",
-    torch_dtype=torch.float16
-)
-
-image = pipe(
-    prompt="a photo of a cat holding a sign that says hello world",
-    negative_prompt="",
-    num_inference_steps=28,
-    height=1024,
-    width=1024,
-    guidance_scale=7.0,
-).images[0]
-
-image.save("sd3_hello_world-8bit-T5.png")
-```
-
-You can find the end-to-end script [here](https://gist.github.com/sayakpaul/82acb5976509851f2db1a83456e504f1).
-
-## Performance Optimizations for SD3
-
-### Using Torch Compile to Speed Up Inference
-
-Using compiled components in the SD3 pipeline can speed up inference by as much as 4X. The following code snippet demonstrates how to compile the Transformer and VAE components of the SD3 pipeline.
-
-```python
-import torch
-from diffusers import StableDiffusion3Pipeline
-
-torch.set_float32_matmul_precision("high")
-
-torch._inductor.config.conv_1x1_as_mm = True
-torch._inductor.config.coordinate_descent_tuning = True
-torch._inductor.config.epilogue_fusion = False
-torch._inductor.config.coordinate_descent_check_all_directions = True
-
-pipe = StableDiffusion3Pipeline.from_pretrained(
-    "stabilityai/stable-diffusion-3-medium-diffusers",
-    torch_dtype=torch.float16
-).to("cuda")
-pipe.set_progress_bar_config(disable=True)
-
-pipe.transformer.to(memory_format=torch.channels_last)
-pipe.vae.to(memory_format=torch.channels_last)
-
-pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
-pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)
-
-# Warm Up
-prompt = "a photo of a cat holding a sign that says hello world"
-for _ in range(3):
-    _ = pipe(prompt=prompt, generator=torch.manual_seed(1))
-
-# Run Inference
-image = pipe(prompt=prompt, generator=torch.manual_seed(1)).images[0]
-image.save("sd3_hello_world.png")
-```
-
-Check out the full script [here](https://gist.github.com/sayakpaul/508d89d7aad4f454900813da5d42ca97).
-
-## Using Long Prompts with the T5 Text Encoder
-
-By default, the T5 Text Encoder prompt uses a maximum sequence length of `256`. This can be adjusted by setting the `max_sequence_length` to accept fewer or more tokens. Keep in mind that longer sequences require additional resources and result in longer generation times, such as during batch inference.
-
-```python
-prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree.  As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"
-
-image = pipe(
-    prompt=prompt,
-    negative_prompt="",
-    num_inference_steps=28,
-    guidance_scale=4.5,
-    max_sequence_length=512,
-).images[0]
-```
-
-### Sending a different prompt to the T5 Text Encoder
-
-You can send a different prompt to the CLIP Text Encoders and the T5 Text Encoder to prevent the prompt from being truncated by the CLIP Text Encoders and to improve generation.
-
-<Tip>
-
-The prompt with the CLIP Text Encoders is still truncated to the 77 token limit.
-
-</Tip>
-
-```python
-prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. A river of warm, melted butter, pancake-like foliage in the background, a towering pepper mill standing in for a tree."
-
-prompt_3 = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree.  As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"
-
-image = pipe(
-    prompt=prompt,
-    prompt_3=prompt_3,
-    negative_prompt="",
-    num_inference_steps=28,
-    guidance_scale=4.5,
-    max_sequence_length=512,
-).images[0]
-```
-
-## Tiny AutoEncoder for Stable Diffusion 3
-
-Tiny AutoEncoder for Stable Diffusion (TAESD3) is a tiny distilled version of Stable Diffusion 3's VAE by [Ollin Boer Bohan](https://github.com/madebyollin/taesd) that can decode [`StableDiffusion3Pipeline`] latents almost instantly.
-
-To use with Stable Diffusion 3:
-
-```python
-import torch
-from diffusers import StableDiffusion3Pipeline, AutoencoderTiny
-
-pipe = StableDiffusion3Pipeline.from_pretrained(
-    "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
-)
-pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd3", torch_dtype=torch.float16)
-pipe = pipe.to("cuda")
-
-prompt = "slice of delicious New York-style berry cheesecake"
-image = pipe(prompt, num_inference_steps=25).images[0]
-image.save("cheesecake.png")
-```
-
-## Loading the original checkpoints via `from_single_file`
-
-The `SD3Transformer2DModel` and `StableDiffusion3Pipeline` classes support loading the original checkpoints via the `from_single_file` method. This method allows you to load the original checkpoint files that were used to train the models.
-
-## Loading the original checkpoints for the `SD3Transformer2DModel`
-
-```python
-from diffusers import SD3Transformer2DModel
-
-model = SD3Transformer2DModel.from_single_file("https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium.safetensors")
-```
-
-## Loading the single checkpoint for the `StableDiffusion3Pipeline`
-
-### Loading the single file checkpoint without T5
-
-```python
-import torch
-from diffusers import StableDiffusion3Pipeline
-
-pipe = StableDiffusion3Pipeline.from_single_file(
-    "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips.safetensors",
-    torch_dtype=torch.float16,
-    text_encoder_3=None
-)
-pipe.enable_model_cpu_offload()
-
-image = pipe("a picture of a cat holding a sign that says hello world").images[0]
-image.save('sd3-single-file.png')
-```
-
-### Loading the single file checkpoint with T5
-
-> [!TIP]
-> The following example loads a checkpoint stored in a 8-bit floating point format which requires PyTorch 2.3 or later.
-
-```python
-import torch
-from diffusers import StableDiffusion3Pipeline
-
-pipe = StableDiffusion3Pipeline.from_single_file(
-    "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors",
-    torch_dtype=torch.float16,
-)
-pipe.enable_model_cpu_offload()
-
-image = pipe("a picture of a cat holding a sign that says hello world").images[0]
-image.save('sd3-single-file-t5-fp8.png')
-```
-
-## StableDiffusion3Pipeline
-
-[[autodoc]] StableDiffusion3Pipeline
-	- all
-	- __call__
--- a/docs/source/en/api/pipelines/text_to_video_zero.md
+++ b/docs/source/en/api/pipelines/text_to_video_zero.md
@@ -155,28 +155,28 @@ To generate a video from prompt with additional pose control
    imageio.mimsave("video.mp4", result, fps=4)
    ```
 - #### SDXL Support
-
+	
 	Since our attention processor also works with SDXL, it can be utilized to generate a video from prompt using ControlNet models powered by SDXL:
 	```python
 	import torch
 	from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
 	from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
-
+	
 	controlnet_model_id = 'thibaud/controlnet-openpose-sdxl-1.0'
 	model_id = 'stabilityai/stable-diffusion-xl-base-1.0'
-
+	
 	controlnet = ControlNetModel.from_pretrained(controlnet_model_id, torch_dtype=torch.float16)
 	pipe = StableDiffusionControlNetPipeline.from_pretrained(
 		model_id, controlnet=controlnet, torch_dtype=torch.float16
 	).to('cuda')
-
+	
 	# Set the attention processor
 	pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
 	pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
-
+	
 	# fix latents for all frames
 	latents = torch.randn((1, 4, 128, 128), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1)
-
+	
 	prompt = "Darth Vader dancing in a desert"
 	result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
 	imageio.mimsave("video.mp4", result, fps=4)
--- a/docs/source/en/api/schedulers/cosine_dpm.md
+++ b/docs/source/en/api/schedulers/cosine_dpm.md
@@ -1,24 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# CosineDPMSolverMultistepScheduler
-
-The [`CosineDPMSolverMultistepScheduler`] is a variant of [`DPMSolverMultistepScheduler`] with cosine schedule, proposed by Nichol and Dhariwal (2021).
-It is being used in the [Stable Audio Open](https://arxiv.org/abs/2407.14358) paper and the [Stability-AI/stable-audio-tool](https://github.com/Stability-AI/stable-audio-tool) codebase.
-
-This scheduler was contributed by [Yoach Lacombe](https://huggingface.co/ylacombe).
-
-## CosineDPMSolverMultistepScheduler
-[[autodoc]] CosineDPMSolverMultistepScheduler
-
-## SchedulerOutput
-[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
--- a/docs/source/en/api/schedulers/flow_match_euler_discrete.md
+++ b/docs/source/en/api/schedulers/flow_match_euler_discrete.md
@@ -1,18 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# FlowMatchEulerDiscreteScheduler
-
-`FlowMatchEulerDiscreteScheduler` is based on the flow-matching sampling introduced in [Stable Diffusion 3](https://arxiv.org/abs/2403.03206).
-
-## FlowMatchEulerDiscreteScheduler
-[[autodoc]] FlowMatchEulerDiscreteScheduler
--- a/docs/source/en/api/schedulers/flow_match_heun_discrete.md
+++ b/docs/source/en/api/schedulers/flow_match_heun_discrete.md
@@ -1,18 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# FlowMatchHeunDiscreteScheduler
-
-`FlowMatchHeunDiscreteScheduler` is based on the flow-matching sampling introduced in [EDM](https://arxiv.org/abs/2403.03206).
-
-## FlowMatchHeunDiscreteScheduler
-[[autodoc]] FlowMatchHeunDiscreteScheduler
--- a/docs/source/en/api/schedulers/tcd.md
+++ b/docs/source/en/api/schedulers/tcd.md
@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# TCDScheduler
+# TCDScheduler 

 [Trajectory Consistency Distillation](https://huggingface.co/papers/2402.19159) by Jianbin Zheng, Minghui Hu, Zhongyi Fan, Chaoyue Wang, Changxing Ding, Dacheng Tao and Tat-Jen Cham introduced a Strategic Stochastic Sampling (Algorithm 4) that is capable of generating good samples in a small number of steps. Distinguishing it as an advanced iteration of the multistep scheduler (Algorithm 1) in the [Consistency Models](https://huggingface.co/papers/2303.01469), Strategic Stochastic Sampling specifically tailored for the trajectory consistency function.

--- a/docs/source/en/community_projects.md
+++ b/docs/source/en/community_projects.md
@@ -1,78 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Community Projects
-
-Welcome to Community Projects. This space is dedicated to showcasing the incredible work and innovative applications created by our vibrant community using the `diffusers` library.
-
-This section aims to:
-
- Highlight diverse and inspiring projects built with `diffusers`
- Foster knowledge sharing within our community
- Provide real-world examples of how `diffusers` can be leveraged
-
-Happy exploring, and thank you for being part of the Diffusers community!
-
-<table>
-    <tr>
-        <th>Project Name</th>
-        <th>Description</th>
-    </tr>
-  <tr style="border-top: 2px solid black">
-    <td><a href="https://github.com/carson-katri/dream-textures"> dream-textures </a></td>
-    <td>Stable Diffusion built-in to Blender</td>
-  </tr>
-  <tr style="border-top: 2px solid black">
-    <td><a href="https://github.com/megvii-research/HiDiffusion"> HiDiffusion </a></td>
-    <td>Increases the resolution and speed of your diffusion model by only adding a single line of code</td>
-  </tr>
-  <tr style="border-top: 2px solid black">
-    <td><a href="https://github.com/lllyasviel/IC-Light"> IC-Light </a></td>
-    <td>IC-Light is a project to manipulate the illumination of images</td>
-  </tr>
-  <tr style="border-top: 2px solid black">
-    <td><a href="https://github.com/InstantID/InstantID"> InstantID </a></td>
-    <td>InstantID : Zero-shot Identity-Preserving Generation in Seconds</td>
-  </tr>
-  <tr style="border-top: 2px solid black">
-    <td><a href="https://github.com/Sanster/IOPaint"> IOPaint </a></td>
-    <td>Image inpainting tool powered by SOTA AI Model. Remove any unwanted object, defect, people from your pictures or erase and replace(powered by stable diffusion) any thing on your pictures.</td>
-  </tr>
-  <tr style="border-top: 2px solid black">
-    <td><a href="https://github.com/bmaltais/kohya_ss"> Kohya </a></td>
-    <td>Gradio GUI for Kohya's Stable Diffusion trainers</td>
-  </tr>
-  <tr style="border-top: 2px solid black">
-    <td><a href="https://github.com/magic-research/magic-animate"> MagicAnimate </a></td>
-    <td>MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model</td>
-  </tr>
-  <tr style="border-top: 2px solid black">
-    <td><a href="https://github.com/levihsu/OOTDiffusion"> OOTDiffusion </a></td>
-    <td>Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on</td>
-  </tr>
-  <tr style="border-top: 2px solid black">
-    <td><a href="https://github.com/vladmandic/automatic"> SD.Next </a></td>
-    <td>SD.Next: Advanced Implementation of Stable Diffusion and other Diffusion-based generative image models</td>
-  </tr>
-  <tr style="border-top: 2px solid black">
-    <td><a href="https://github.com/ashawkey/stable-dreamfusion"> stable-dreamfusion </a></td>
-    <td>Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion</td>
-  </tr>
-  <tr style="border-top: 2px solid black">
-    <td><a href="https://github.com/HVision-NKU/StoryDiffusion"> StoryDiffusion </a></td>
-    <td>StoryDiffusion can create a magic story by generating consistent images and videos.</td>
-  </tr>
-  <tr style="border-top: 2px solid black">
-    <td><a href="https://github.com/cumulo-autumn/StreamDiffusion"> StreamDiffusion </a></td>
-    <td>A Pipeline-Level Solution for Real-Time Interactive Generation</td>
-  </tr>
-</table>
--- a/docs/source/en/conceptual/contribution.md
+++ b/docs/source/en/conceptual/contribution.md
@@ -22,13 +22,14 @@ We enormously value feedback from the community, so please do not be afraid to s

 ## Overview

-You can contribute in many ways ranging from answering questions on issues and discussions to adding new diffusion models to the core library.
+You can contribute in many ways ranging from answering questions on issues to adding new diffusion models to
+the core library.

 In the following, we give an overview of different ways to contribute, ranked by difficulty in ascending order. All of them are valuable to the community.

 * 1. Asking and answering questions on [the Diffusers discussion forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers) or on [Discord](https://discord.gg/G7tWnz98XR).
-* 2. Opening new issues on [the GitHub Issues tab](https://github.com/huggingface/diffusers/issues/new/choose) or new discussions on [the GitHub Discussions tab](https://github.com/huggingface/diffusers/discussions/new/choose).
-* 3. Answering issues on [the GitHub Issues tab](https://github.com/huggingface/diffusers/issues) or discussions on [the GitHub Discussions tab](https://github.com/huggingface/diffusers/discussions).
+* 2. Opening new issues on [the GitHub Issues tab](https://github.com/huggingface/diffusers/issues/new/choose).
+* 3. Answering issues on [the GitHub Issues tab](https://github.com/huggingface/diffusers/issues).
 * 4. Fix a simple issue, marked by the "Good first issue" label, see [here](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22).
 * 5. Contribute to the [documentation](https://github.com/huggingface/diffusers/tree/main/docs/source).
 * 6. Contribute a [Community Pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3Acommunity-examples).
@@ -62,7 +63,7 @@ In the same spirit, you are of immense help to the community by answering such q

 **Please** keep in mind that the more effort you put into asking or answering a question, the higher
 the quality of the publicly documented knowledge. In the same way, well-posed and well-answered questions create a high-quality knowledge database accessible to everybody, while badly posed questions or answers reduce the overall quality of the public knowledge database.
-In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formatted/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.
+In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formated/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.

 **NOTE about channels**:
 [*The forum*](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) is much better indexed by search engines, such as Google. Posts are ranked by popularity rather than chronologically. Hence, it's easier to look up questions and answers that we posted some time ago.
@@ -98,7 +99,7 @@ This means in more detail:
 - Format your code.
 - Do not include any external libraries except for Diffusers depending on them.
 - **Always** provide all necessary information about your environment; for this, you can run: `diffusers-cli env` in your shell and copy-paste the displayed information to the issue.
- Explain the issue. If the reader doesn't know what the issue is and why it is an issue, (s)he cannot solve it.
+- Explain the issue. If the reader doesn't know what the issue is and why it is an issue, she cannot solve it.
 - **Always** make sure the reader can reproduce your issue with as little effort as possible. If your code snippet cannot be run because of missing libraries or undefined variables, the reader cannot help you. Make sure your reproducible code snippet is as minimal as possible and can be copy-pasted into a simple Python shell.
 - If in order to reproduce your issue a model and/or dataset is required, make sure the reader has access to that model or dataset. You can always upload your model or dataset to the [Hub](https://huggingface.co) to make it easily downloadable. Try to keep your model and dataset as small as possible, to make the reproduction of your issue as effortless as possible.

@@ -287,7 +288,7 @@ The official training examples are maintained by the Diffusers' core maintainers
 This is because of the same reasons put forward in [6. Contribute a community pipeline](#6-contribute-a-community-pipeline) for official pipelines vs. community pipelines: It is not feasible for the core maintainers to maintain all possible training methods for diffusion models.
 If the Diffusers core maintainers and the community consider a certain training paradigm to be too experimental or not popular enough, the corresponding training code should be put in the `research_projects` folder and maintained by the author.

-Both official training and research examples consist of a directory that contains one or more training scripts, a `requirements.txt` file, and a `README.md` file. In order for the user to make use of the
+Both official training and research examples consist of a directory that contains one or more training scripts, a requirements.txt file, and a README.md file. In order for the user to make use of the
 training examples, it is required to clone the repository:

 ```bash
@@ -297,8 +298,7 @@ git clone https://github.com/huggingface/diffusers
 as well as to install all additional dependencies required for training:

 ```bash
-cd diffusers
-pip install -r examples/<your-example-folder>/requirements.txt
+pip install -r /examples/<your-example-folder>/requirements.txt
 ```

 Therefore when adding an example, the `requirements.txt` file shall define all pip dependencies required for your training example so that once all those are installed, the user can run the example's training script. See, for example, the [DreamBooth `requirements.txt` file](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/requirements.txt).
@@ -316,7 +316,7 @@ Once an example script works, please make sure to add a comprehensive `README.md
 - A link to some training results (logs, models, etc.) that show what the user can expect as shown [here](https://api.wandb.ai/report/patrickvonplaten/xm6cd5q5).
 - If you are adding a non-official/research training example, **please don't forget** to add a sentence that you are maintaining this training example which includes your git handle as shown [here](https://github.com/huggingface/diffusers/tree/main/examples/research_projects/intel_opts#diffusers-examples-with-intel-optimizations).

-If you are contributing to the official training examples, please also make sure to add a test to its folder such as [examples/dreambooth/test_dreambooth.py](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/test_dreambooth.py). This is not necessary for non-official training examples.
+If you are contributing to the official training examples, please also make sure to add a test to [examples/test_examples.py](https://github.com/huggingface/diffusers/blob/main/examples/test_examples.py). This is not necessary for non-official training examples.

 ### 8. Fixing a "Good second issue"

@@ -418,7 +418,7 @@ You will need basic `git` proficiency to be able to contribute to
 manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
 Git](https://git-scm.com/book/en/v2) is a very good reference.

-Follow these steps to start contributing ([supported Python versions](https://github.com/huggingface/diffusers/blob/83bc6c94eaeb6f7704a2a428931cf2d9ad973ae9/setup.py#L270)):
+Follow these steps to start contributing ([supported Python versions](https://github.com/huggingface/diffusers/blob/main/setup.py#L244)):

 1. Fork the [repository](https://github.com/huggingface/diffusers) by
 clicking on the 'Fork' button on the repository's page. This creates a copy of the code
@@ -565,4 +565,4 @@ $ git push --set-upstream origin your-branch-for-syncing

 ### Style guide

-For documentation strings, 🧨 Diffusers follows the [Google style](https://google.github.io/styleguide/pyguide.html).
+For documentation strings, 🧨 Diffusers follows the [Google style](https://google.github.io/styleguide/pyguide.html).
--- a/docs/source/en/conceptual/philosophy.md
+++ b/docs/source/en/conceptual/philosophy.md
@@ -63,7 +63,7 @@ Let's walk through more in-detail design decisions for each class.
 Pipelines are designed to be easy to use (therefore do not follow [*Simple over easy*](#simple-over-easy) 100%), are not feature complete, and should loosely be seen as examples of how to use [models](#models) and [schedulers](#schedulers) for inference.

 The following design principles are followed:
- Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for [`src/diffusers/pipelines/stable-diffusion`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [# Copied from mechanism](https://github.com/huggingface/diffusers/blob/125d783076e5bd9785beb05367a2d2566843a271/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L251).
+- Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for [`src/diffusers/pipelines/stable-diffusion`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [#Copied from mechanism](https://github.com/huggingface/diffusers/blob/125d783076e5bd9785beb05367a2d2566843a271/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L251).
 - Pipelines all inherit from [`DiffusionPipeline`].
 - Every pipeline consists of different model and scheduler components, that are documented in the [`model_index.json` file](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/model_index.json), are accessible under the same name as attributes of the pipeline and can be shared between pipelines with [`DiffusionPipeline.components`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.components) function.
 - Every pipeline should be loadable via the [`DiffusionPipeline.from_pretrained`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained) function.
@@ -81,7 +81,7 @@ Models are designed as configurable toolboxes that are natural extensions of [Py

 The following design principles are followed:
 - Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context.
- All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unets/unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_condition.py), [`transformers/transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_2d.py), etc...
+- All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_condition.py), [`transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformer_2d.py), etc...
 - Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy.
 - Models intend to expose complexity, just like PyTorch's `Module` class, and give clear error messages.
 - Models all inherit from `ModelMixin` and `ConfigMixin`.
@@ -90,7 +90,7 @@ The following design principles are followed:
 - To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different.
 - Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work.
 - The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and
-readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).

 ### Schedulers

@@ -100,11 +100,11 @@ The following design principles are followed:
 - All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers).
 - Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained.
 - One scheduler Python file corresponds to one scheduler algorithm (as might be defined in a paper).
- If schedulers share similar functionalities, we can make use of the `# Copied from` mechanism.
+- If schedulers share similar functionalities, we can make use of the `#Copied from` mechanism.
 - Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`.
- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](../using-diffusers/schedulers).
+- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](../using-diffusers/schedulers.md).
 - Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called.
 - Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon.
 - The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1).
 - Given the complexity of diffusion schedulers, the `step` function does not expose all the complexity and can be a bit of a "black box".
- In almost all cases, novel schedulers shall be implemented in a new scheduling file.
+- In almost all cases, novel schedulers shall be implemented in a new scheduling file.
--- a/docs/source/en/training/adapt_a_model.md
+++ b/docs/source/en/training/adapt_a_model.md
@@ -26,7 +26,7 @@ pipeline.unet.config["in_channels"]
 9
 ```

-To adapt your text-to-image model for inpainting, you'll need to change the number of `in_channels` from 4 to 9.
+To adapt your text-to-image model for inpainting, you'll need to change the number of `in_channels` from 4 to 9. 

 Initialize a [`UNet2DConditionModel`] with the pretrained text-to-image model weights, and change `in_channels` to 9. Changing the number of `in_channels` means you need to set `ignore_mismatched_sizes=True` and `low_cpu_mem_usage=False` to avoid a size mismatch error because the shape is different now.

--- a/docs/source/en/training/controlnet.md
+++ b/docs/source/en/training/controlnet.md
@@ -349,7 +349,7 @@ control_image = load_image("./conditioning_image_1.png")
 prompt = "pale golden rod circle with old lace background"

 generator = torch.manual_seed(0)
-image = pipeline(prompt, num_inference_steps=20, generator=generator, image=control_image).images[0]
+image = pipe(prompt, num_inference_steps=20, generator=generator, image=control_image).images[0]
 image.save("./output.png")
 ```

@@ -363,4 +363,4 @@ The SDXL training script is discussed in more detail in the [SDXL training](sdxl

 Congratulations on training your own ControlNet! To learn more about how to use your new model, the following guides may be helpful:

- Learn how to [use a ControlNet](../using-diffusers/controlnet) for inference on a variety of tasks.
+- Learn how to [use a ControlNet](../using-diffusers/controlnet) for inference on a variety of tasks.
--- a/docs/source/en/training/create_dataset.md
+++ b/docs/source/en/training/create_dataset.md
@@ -9,7 +9,7 @@ This guide will show you two ways to create a dataset to finetune on:

 <Tip>

-💡 Learn more about how to create an image dataset for training in the [Create an image dataset](https://huggingface.co/docs/datasets/image_dataset) guide.
+💡 Learn more about how to create an image dataset for training in the [Create an image dataset](https://huggingface.co/docs/datasets/image_dataset) guide. 

 </Tip>

@@ -39,7 +39,7 @@ accelerate launch train_unconditional.py \

 </Tip>

-Start by creating a dataset with the [`ImageFolder`](https://huggingface.co/docs/datasets/image_load#imagefolder) feature, which creates an `image` column containing the PIL-encoded images.
+Start by creating a dataset with the [`ImageFolder`](https://huggingface.co/docs/datasets/image_load#imagefolder) feature, which creates an `image` column containing the PIL-encoded images. 

 You can use the `data_dir` or `data_files` parameters to specify the location of the dataset. The `data_files` parameter supports mapping specific files to dataset splits like `train` or `test`:

--- a/docs/source/en/training/distributed_inference.md
+++ b/docs/source/en/training/distributed_inference.md
@@ -52,6 +52,76 @@ To learn more, take a look at the [Distributed Inference with 🤗 Accelerate](h

 </Tip>

+### Device placement
+
+> [!WARNING]
+> This feature is experimental and its APIs might change in the future. 
+
+With Accelerate, you can use the `device_map` to determine how to distribute the models of a pipeline across multiple devices. This is useful in situations where you have more than one GPU.
+
+For example, if you have two 8GB GPUs, then using [`~DiffusionPipeline.enable_model_cpu_offload`] may not work so well because:
+
+* it only works on a single GPU
+* a single model might not fit on a single GPU ([`~DiffusionPipeline.enable_sequential_cpu_offload`] might work but it will be extremely slow and it is also limited to a single GPU)
+
+To make use of both GPUs, you can use the "balanced" device placement strategy which splits the models across all available GPUs.
+
+> [!WARNING]
+> Only the "balanced" strategy is supported at the moment, and we plan to support additional mapping strategies in the future.
+
+```diff
+from diffusers import DiffusionPipeline
+import torch
+
+pipeline = DiffusionPipeline.from_pretrained(
+-    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True,
+    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True, device_map="balanced"
+)
+image = pipeline("a dog").images[0]
+image
+```
+
+You can also pass a dictionary to enforce the maximum GPU memory that can be used on each device:
+
+```diff
+from diffusers import DiffusionPipeline
+import torch
+
+max_memory = {0:"1GB", 1:"1GB"}
+pipeline = DiffusionPipeline.from_pretrained(
+    "runwayml/stable-diffusion-v1-5",
+    torch_dtype=torch.float16, 
+    use_safetensors=True, 
+    device_map="balanced",
+   max_memory=max_memory
+)
+image = pipeline("a dog").images[0]
+image
+```
+
+If a device is not present in `max_memory`, then it will be completely ignored and will not participate in the device placement. 
+
+By default, Diffusers uses the maximum memory of all devices. If the models don't fit on the GPUs, they are offloaded to the CPU. If the CPU doesn't have enough memory, then you might see an error. In that case, you could defer to using [`~DiffusionPipeline.enable_sequential_cpu_offload`] and [`~DiffusionPipeline.enable_model_cpu_offload`].
+
+Call [`~DiffusionPipeline.reset_device_map`] to reset the `device_map` of a pipeline. This is also necessary if you want to use methods like `to()`, [`~DiffusionPipeline.enable_sequential_cpu_offload`], and [`~DiffusionPipeline.enable_model_cpu_offload`] on a pipeline that was device-mapped.
+
+```py
+pipeline.reset_device_map()
+```
+
+Once a pipeline has been device-mapped, you can also access its device map via `hf_device_map`:
+
+```py
+print(pipeline.hf_device_map)
+```
+
+An example device map would look like so:
+
+
+```bash
+{'unet': 1, 'vae': 1, 'safety_checker': 0, 'text_encoder': 0}
+```
+
 ## PyTorch Distributed

 PyTorch supports [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) which enables data parallelism.
@@ -106,6 +176,3 @@ Once you've completed the inference script, use the `--nproc_per_node` argument
 ```bash
 torchrun run_distributed.py --nproc_per_node=2
 ```
-
-> [!TIP]
-> You can use `device_map` within a [`DiffusionPipeline`] to distribute its model-level components on multiple devices. Refer to the [Device placement](../tutorials/inference_with_big_models#device-placement) guide to learn more.
--- a/docs/source/en/training/dreambooth.md
+++ b/docs/source/en/training/dreambooth.md
@@ -533,7 +533,7 @@ python train_dreambooth_lora.py \
    --resolution=256 \
    --train_batch_size=4 \
    --gradient_accumulation_steps=1 \
-    --learning_rate=1e-6 \
+    --learning_rate=1e-6 \ 
    --max_train_steps=2000 \
    --validation_prompt="a sks dog" \
    --validation_epochs=100 \
--- a/docs/source/en/training/instructpix2pix.md
+++ b/docs/source/en/training/instructpix2pix.md
@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.

 [InstructPix2Pix](https://hf.co/papers/2211.09800) is a Stable Diffusion model trained to edit images from human-provided instructions. For example, your prompt can be "turn the clouds rainy" and the model will edit the input image accordingly. This model is conditioned on the text prompt (or editing instruction) and the input image.

-This guide will explore the [train_instruct_pix2pix.py](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py) training script to help you become familiar with it, and how you can adapt it for your own use case.
+This guide will explore the [train_instruct_pix2pix.py](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py) training script to help you become familiar with it, and how you can adapt it for your own use-case.

 Before running the script, make sure you install the library from source:

@@ -117,7 +117,7 @@ optimizer = optimizer_cls(
 )
 ```

-Next, the edited images and edit instructions are [preprocessed](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L624) and [tokenized](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L610C24-L610C24). It is important the same image transformations are applied to the original and edited images.
+Next, the edited images and and edit instructions are [preprocessed](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L624) and [tokenized](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L610C24-L610C24). It is important the same image transformations are applied to the original and edited images.

 ```py
 def preprocess_train(examples):
@@ -249,4 +249,4 @@ The SDXL training script is discussed in more detail in the [SDXL training](sdxl

 Congratulations on training your own InstructPix2Pix model! 🥳 To learn more about the model, it may be helpful to:

- Read the [Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd) blog post to learn more about some experiments we've done with InstructPix2Pix, dataset preparation, and results for different instructions.
+- Read the [Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd) blog post to learn more about some experiments we've done with InstructPix2Pix, dataset preparation, and results for different instructions.
--- a/docs/source/en/training/text2image.md
+++ b/docs/source/en/training/text2image.md
@@ -181,7 +181,7 @@ accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
-  --enable_xformers_memory_efficient_attention \
+  --enable_xformers_memory_efficient_attention
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir="sd-naruto-model" \
  --push_to_hub
--- a/docs/source/en/tutorials/basic_training.md
+++ b/docs/source/en/tutorials/basic_training.md
@@ -340,8 +340,7 @@ Now you can wrap all these components together in a training loop with 🤗 Acce
 ...                 loss = F.mse_loss(noise_pred, noise)
 ...                 accelerator.backward(loss)

-...                 if accelerator.sync_gradients:
-...                     accelerator.clip_grad_norm_(model.parameters(), 1.0)
+...                 accelerator.clip_grad_norm_(model.parameters(), 1.0)
 ...                 optimizer.step()
 ...                 lr_scheduler.step()
 ...                 optimizer.zero_grad()
--- a/docs/source/en/tutorials/fast_diffusion.md
+++ b/docs/source/en/tutorials/fast_diffusion.md
@@ -34,10 +34,13 @@ Install [PyTorch nightly](https://pytorch.org/) to benefit from the latest and f
 pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
 ```

-> [!TIP]
-> The results reported below are from a 80GB 400W A100 with its clock rate set to the maximum.
-> If you're interested in the full benchmarking code, take a look at [huggingface/diffusion-fast](https://github.com/huggingface/diffusion-fast).
+<Tip>

+The results reported below are from a 80GB 400W A100 with its clock rate set to the maximum. <br>
+
+If you're interested in the full benchmarking code, take a look at [huggingface/diffusion-fast](https://github.com/huggingface/diffusion-fast).
+
+</Tip>

 ## Baseline

@@ -167,9 +170,6 @@ Using SDPA attention and compiling both the UNet and VAE cuts the latency from 3
    <img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_3.png" width=500>
 </div>

-> [!TIP]
-> From PyTorch 2.3.1, you can control the caching behavior of `torch.compile()`. This is particularly beneficial for compilation modes like `"max-autotune"` which performs a grid-search over several compilation flags to find the optimal configuration. Learn more in the [Compile Time Caching in torch.compile](https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html) tutorial.
-
 ### Prevent graph breaks

 Specifying `fullgraph=True` ensures there are no graph breaks in the underlying model to take full advantage of `torch.compile` without any performance degradation. For the UNet and VAE, this means changing how you access the return variables.
@@ -222,7 +222,7 @@ First, configure all the compiler tags:

 ```python
 from diffusers import StableDiffusionXLPipeline
-import torch
+import torch 

 # Notice the two new flags at the end.
 torch._inductor.config.conv_1x1_as_mm = True
--- a/docs/source/en/tutorials/inference_with_big_models.md
+++ b/docs/source/en/tutorials/inference_with_big_models.md
@@ -1,139 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Working with big models
-
-A modern diffusion model, like [Stable Diffusion XL (SDXL)](../using-diffusers/sdxl), is not just a single model, but a collection of multiple models. SDXL has four different model-level components:
-
-* A variational autoencoder (VAE)
-* Two text encoders
-* A UNet for denoising
-
-Usually, the text encoders and the denoiser are much larger compared to the VAE.
-
-As models get bigger and better, it’s possible your model is so big that even a single copy won’t fit in memory. But that doesn’t mean it can’t be loaded. If you have more than one GPU, there is more memory available to store your model. In this case, it’s better to split your model checkpoint into several smaller *checkpoint shards*.
-
-When a text encoder checkpoint has multiple shards, like [T5-xxl for SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers/tree/main/text_encoder_3), it is automatically handled by the [Transformers](https://huggingface.co/docs/transformers/index) library as it is a required dependency of Diffusers when using the [`StableDiffusion3Pipeline`]. More specifically, Transformers will automatically handle the loading of multiple shards within the requested model class and get it ready so that inference can be performed.
-
-The denoiser checkpoint can also have multiple shards and supports inference thanks to the [Accelerate](https://huggingface.co/docs/accelerate/index) library.
-
-> [!TIP]
-> Refer to the [Handling big models for inference](https://huggingface.co/docs/accelerate/main/en/concept_guides/big_model_inference) guide for general guidance when working with big models that are hard to fit into memory.
-
-For example, let's save a sharded checkpoint for the [SDXL UNet](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main/unet):
-
-```python
-from diffusers import UNet2DConditionModel
-
-unet = UNet2DConditionModel.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet"
-)
-unet.save_pretrained("sdxl-unet-sharded", max_shard_size="5GB")
-```
-
-The size of the fp32 variant of the SDXL UNet checkpoint is ~10.4GB. Set the `max_shard_size` parameter to 5GB to create 3 shards. After saving, you can load them in [`StableDiffusionXLPipeline`]:
-
-```python
-from diffusers import UNet2DConditionModel, StableDiffusionXLPipeline
-import torch
-
-unet = UNet2DConditionModel.from_pretrained(
-    "sayakpaul/sdxl-unet-sharded", torch_dtype=torch.float16
-)
-pipeline = StableDiffusionXLPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16
-).to("cuda")
-
-image = pipeline("a cute dog running on the grass", num_inference_steps=30).images[0]
-image.save("dog.png")
-```
-
-If placing all the model-level components on the GPU at once is not feasible, use [`~DiffusionPipeline.enable_model_cpu_offload`] to help you:
-
-```diff
- pipeline.to("cuda")
-+ pipeline.enable_model_cpu_offload()
-```
-
-In general, we recommend sharding when a checkpoint is more than 5GB (in fp32).
-
-## Device placement
-
-On distributed setups, you can run inference across multiple GPUs with Accelerate.
-
-> [!WARNING]
-> This feature is experimental and its APIs might change in the future.
-
-With Accelerate, you can use the `device_map` to determine how to distribute the models of a pipeline across multiple devices. This is useful in situations where you have more than one GPU.
-
-For example, if you have two 8GB GPUs, then using [`~DiffusionPipeline.enable_model_cpu_offload`] may not work so well because:
-
-* it only works on a single GPU
-* a single model might not fit on a single GPU ([`~DiffusionPipeline.enable_sequential_cpu_offload`] might work but it will be extremely slow and it is also limited to a single GPU)
-
-To make use of both GPUs, you can use the "balanced" device placement strategy which splits the models across all available GPUs.
-
-> [!WARNING]
-> Only the "balanced" strategy is supported at the moment, and we plan to support additional mapping strategies in the future.
-
-```diff
-from diffusers import DiffusionPipeline
-import torch
-
-pipeline = DiffusionPipeline.from_pretrained(
-    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True,
-+    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True, device_map="balanced"
-)
-image = pipeline("a dog").images[0]
-image
-```
-
-You can also pass a dictionary to enforce the maximum GPU memory that can be used on each device:
-
-```diff
-from diffusers import DiffusionPipeline
-import torch
-
-max_memory = {0:"1GB", 1:"1GB"}
-pipeline = DiffusionPipeline.from_pretrained(
-    "runwayml/stable-diffusion-v1-5",
-    torch_dtype=torch.float16,
-    use_safetensors=True,
-    device_map="balanced",
-+   max_memory=max_memory
-)
-image = pipeline("a dog").images[0]
-image
-```
-
-If a device is not present in `max_memory`, then it will be completely ignored and will not participate in the device placement.
-
-By default, Diffusers uses the maximum memory of all devices. If the models don't fit on the GPUs, they are offloaded to the CPU. If the CPU doesn't have enough memory, then you might see an error. In that case, you could defer to using [`~DiffusionPipeline.enable_sequential_cpu_offload`] and [`~DiffusionPipeline.enable_model_cpu_offload`].
-
-Call [`~DiffusionPipeline.reset_device_map`] to reset the `device_map` of a pipeline. This is also necessary if you want to use methods like `to()`, [`~DiffusionPipeline.enable_sequential_cpu_offload`], and [`~DiffusionPipeline.enable_model_cpu_offload`] on a pipeline that was device-mapped.
-
-```py
-pipeline.reset_device_map()
-```
-
-Once a pipeline has been device-mapped, you can also access its device map via `hf_device_map`:
-
-```py
-print(pipeline.hf_device_map)
-```
-
-An example device map would look like so:
-
-
-```bash
-{'unet': 1, 'vae': 1, 'safety_checker': 0, 'text_encoder': 0}
-```
--- a/docs/source/en/tutorials/using_peft_for_inference.md
+++ b/docs/source/en/tutorials/using_peft_for_inference.md
@@ -34,7 +34,7 @@ pipe_id = "stabilityai/stable-diffusion-xl-base-1.0"
 pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda")
 ```

-Next, load a [CiroN2022/toy-face](https://huggingface.co/CiroN2022/toy-face) adapter with the [`~diffusers.loaders.StableDiffusionXLLoraLoaderMixin.load_lora_weights`] method. With the 🤗 PEFT integration, you can assign a specific `adapter_name` to the checkpoint, which lets you easily switch between different LoRA checkpoints. Let's call this adapter `"toy"`.
+Next, load a [CiroN2022/toy-face](https://huggingface.co/CiroN2022/toy-face) adapter with the [`~diffusers.loaders.StableDiffusionXLLoraLoaderMixin.load_lora_weights`] method. With the 🤗 PEFT integration, you can assign a specific `adapter_name` to the checkpoint, which let's you easily switch between different LoRA checkpoints. Let's call this adapter `"toy"`.

 ```python
 pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy")
@@ -191,7 +191,7 @@ image

 ## Manage active adapters

-You have attached multiple adapters in this tutorial, and if you're feeling a bit lost on what adapters have been attached to the pipeline's components, use the [`~diffusers.loaders.StableDiffusionLoraLoaderMixin.get_active_adapters`] method to check the list of active adapters:
+You have attached multiple adapters in this tutorial, and if you're feeling a bit lost on what adapters have been attached to the pipeline's components, use the [`~diffusers.loaders.LoraLoaderMixin.get_active_adapters`] method to check the list of active adapters:

 ```py
 active_adapters = pipe.get_active_adapters()
@@ -199,7 +199,7 @@ active_adapters
 ["toy", "pixel"]
 ```

-You can also get the active adapters of each pipeline component with [`~diffusers.loaders.StableDiffusionLoraLoaderMixin.get_list_adapters`]:
+You can also get the active adapters of each pipeline component with [`~diffusers.loaders.LoraLoaderMixin.get_list_adapters`]:

 ```py
 list_adapters_component_wise = pipe.get_list_adapters()
--- a/docs/source/en/using-diffusers/callback.md
+++ b/docs/source/en/using-diffusers/callback.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.

 # Pipeline callbacks

-The denoising loop of a pipeline can be modified with custom defined functions using the `callback_on_step_end` parameter. The callback function is executed at the end of each step, and modifies the pipeline attributes and variables for the next step. This is really useful for *dynamically* adjusting certain pipeline attributes or modifying tensor variables. This versatility allows for interesting use cases such as changing the prompt embeddings at each timestep, assigning different weights to the prompt embeddings, and editing the guidance scale. With callbacks, you can implement new features without modifying the underlying code!
+The denoising loop of a pipeline can be modified with custom defined functions using the `callback_on_step_end` parameter. The callback function is executed at the end of each step, and modifies the pipeline attributes and variables for the next step. This is really useful for *dynamically* adjusting certain pipeline attributes or modifying tensor variables. This versatility allows for interesting use-cases such as changing the prompt embeddings at each timestep, assigning different weights to the prompt embeddings, and editing the guidance scale. With callbacks, you can implement new features without modifying the underlying code!

 > [!TIP]
 > 🤗 Diffusers currently only supports `callback_on_step_end`, but feel free to open a [feature request](https://github.com/huggingface/diffusers/issues/new/choose) if you have a cool use-case and require a callback function with a different execution point!
@@ -75,7 +75,7 @@ out.images[0].save("official_callback.png")
    <figcaption class="mt-2 text-center text-sm text-gray-500">without SDXLCFGCutoffCallback</figcaption>
  </div>
  <div>
-    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/with_cfg_callback.png" alt="generated image of a sports car at the road with cfg callback" />
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/with_cfg_callback.png" alt="generated image of a a sports car at the road with cfg callback" />
    <figcaption class="mt-2 text-center text-sm text-gray-500">with SDXLCFGCutoffCallback</figcaption>
  </div>
 </div>
--- a/docs/source/en/using-diffusers/controlnet.md
+++ b/docs/source/en/using-diffusers/controlnet.md
@@ -256,7 +256,7 @@ make_image_grid([init_image, mask_image, output], rows=1, cols=3)

 ## Guess mode

-[Guess mode](https://github.com/lllyasviel/ControlNet/discussions/188) does not require supplying a prompt to a ControlNet at all! This forces the ControlNet encoder to do its best to "guess" the contents of the input control map (depth map, pose estimation, canny edge, etc.).
+[Guess mode](https://github.com/lllyasviel/ControlNet/discussions/188) does not require supplying a prompt to a ControlNet at all! This forces the ControlNet encoder to do it's best to "guess" the contents of the input control map (depth map, pose estimation, canny edge, etc.).

 Guess mode adjusts the scale of the output residuals from a ControlNet by a fixed ratio depending on the block depth. The shallowest `DownBlock` corresponds to 0.1, and as the blocks get deeper, the scale increases exponentially such that the scale of the `MidBlock` output becomes 1.0.

@@ -506,7 +506,7 @@ make_image_grid([original_image, canny_image], rows=1, cols=2)
 </div>

 For human pose estimation, install [controlnet_aux](https://github.com/patrickvonplaten/controlnet_aux):
-
+  
 ```py
 # uncomment to install the necessary library in Colab
 #!pip install -q controlnet-aux
--- a/docs/source/en/using-diffusers/custom_pipeline_overview.md
+++ b/docs/source/en/using-diffusers/custom_pipeline_overview.md
@@ -147,11 +147,11 @@ prompt = "cat, hiding in the leaves, ((rain)), zazie rainyday, beautiful eyes, m
 neg_prompt = "(deformed iris, deformed pupils, semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime, mutated hands and fingers:1.4), (deformed, distorted, disfigured:1.3), poorly drawn, bad anatomy, wrong anatomy, extra limb, missing limb, floating limbs, disconnected limbs, mutation, mutated, ugly, disgusting, amputation"
 generator = torch.Generator(device="cpu").manual_seed(20)
 out_lpw = pipe_lpw(
-    prompt,
-    negative_prompt=neg_prompt,
+    prompt, 
+    negative_prompt=neg_prompt, 
    width=512,
    height=512,
-    max_embeddings_multiples=3,
+    max_embeddings_multiples=3, 
    num_inference_steps=50,
    generator=generator,
    ).images[0]
@@ -289,9 +289,9 @@ scheduler = DPMSolverMultistepScheduler.from_pretrained(pipe_id, subfolder="sche
 3. Load an image processor:

 ```python
-from transformers import CLIPImageProcessor
+from transformers import CLIPFeatureExtractor

-feature_extractor = CLIPImageProcessor.from_pretrained(pipe_id, subfolder="feature_extractor")
+feature_extractor = CLIPFeatureExtractor.from_pretrained(pipe_id, subfolder="feature_extractor")
 ```

 <Tip warning={true}>
--- a/docs/source/en/using-diffusers/inference_with_lcm.md
+++ b/docs/source/en/using-diffusers/inference_with_lcm.md
@@ -64,7 +64,7 @@ image
 </hfoption>
 <hfoption id="LCM-LoRA">

-To use LCM-LoRAs, you need to replace the scheduler with the [`LCMScheduler`] and load the LCM-LoRA weights with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method. Then you can use the pipeline as usual, and pass a text prompt to generate an image in just 4 steps.
+To use LCM-LoRAs, you need to replace the scheduler with the [`LCMScheduler`] and load the LCM-LoRA weights with the [`~loaders.LoraLoaderMixin.load_lora_weights`] method. Then you can use the pipeline as usual, and pass a text prompt to generate an image in just 4 steps.

 A couple of notes to keep in mind when using LCM-LoRAs are:

@@ -156,7 +156,7 @@ image
 </hfoption>
 <hfoption id="LCM-LoRA">

-To use LCM-LoRAs for image-to-image, you need to replace the scheduler with the [`LCMScheduler`] and load the LCM-LoRA weights with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method. Then you can use the pipeline as usual, and pass a text prompt and initial image to generate an image in just 4 steps.
+To use LCM-LoRAs for image-to-image, you need to replace the scheduler with the [`LCMScheduler`] and load the LCM-LoRA weights with the [`~loaders.LoraLoaderMixin.load_lora_weights`] method. Then you can use the pipeline as usual, and pass a text prompt and initial image to generate an image in just 4 steps.

 > [!TIP]
 > Experiment with different values for `num_inference_steps`, `strength`, and `guidance_scale` to get the best results.
@@ -207,7 +207,7 @@ image

 ## Inpainting

-To use LCM-LoRAs for inpainting, you need to replace the scheduler with the [`LCMScheduler`] and load the LCM-LoRA weights with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method. Then you can use the pipeline as usual, and pass a text prompt, initial image, and mask image to generate an image in just 4 steps.
+To use LCM-LoRAs for inpainting, you need to replace the scheduler with the [`LCMScheduler`] and load the LCM-LoRA weights with the [`~loaders.LoraLoaderMixin.load_lora_weights`] method. Then you can use the pipeline as usual, and pass a text prompt, initial image, and mask image to generate an image in just 4 steps.

 ```py
 import torch
@@ -235,7 +235,7 @@ image = pipe(
    mask_image=mask_image,
    generator=generator,
    num_inference_steps=4,
-    guidance_scale=4,
+    guidance_scale=4, 
 ).images[0]
 image
 ```
@@ -262,7 +262,7 @@ LCMs are compatible with adapters like LoRA, ControlNet, T2I-Adapter, and Animat
 <hfoptions id="lcm-lora">
 <hfoption id="LCM">

-Load the LCM checkpoint for your supported model into [`UNet2DConditionModel`] and replace the scheduler with the [`LCMScheduler`]. Then you can use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the LoRA weights into the LCM and generate a styled image in a few steps.
+Load the LCM checkpoint for your supported model into [`UNet2DConditionModel`] and replace the scheduler with the [`LCMScheduler`]. Then you can use the [`~loaders.LoraLoaderMixin.load_lora_weights`] method to load the LoRA weights into the LCM and generate a styled image in a few steps.

 ```python
 from diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler
@@ -294,7 +294,7 @@ image
 </hfoption>
 <hfoption id="LCM-LoRA">

-Replace the scheduler with the [`LCMScheduler`]. Then you can use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the LCM-LoRA weights and the style LoRA you want to use. Combine both LoRA adapters with the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] method and generate a styled image in a few steps.
+Replace the scheduler with the [`LCMScheduler`]. Then you can use the [`~loaders.LoraLoaderMixin.load_lora_weights`] method to load the LCM-LoRA weights and the style LoRA you want to use. Combine both LoRA adapters with the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] method and generate a styled image in a few steps.

 ```py
 import torch
@@ -389,7 +389,7 @@ make_image_grid([canny_image, image], rows=1, cols=2)
 </hfoption>
 <hfoption id="LCM-LoRA">

-Load a ControlNet model trained on canny images and pass it to the [`ControlNetModel`]. Then you can load a Stable Diffusion v1.5 model into [`StableDiffusionControlNetPipeline`] and replace the scheduler with the [`LCMScheduler`]. Use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the LCM-LoRA weights, and pass the canny image to the pipeline and generate an image.
+Load a ControlNet model trained on canny images and pass it to the [`ControlNetModel`]. Then you can load a Stable Diffusion v1.5 model into [`StableDiffusionControlNetPipeline`] and replace the scheduler with the [`LCMScheduler`]. Use the [`~loaders.LoraLoaderMixin.load_lora_weights`] method to load the LCM-LoRA weights, and pass the canny image to the pipeline and generate an image.

 > [!TIP]
 > Experiment with different values for `num_inference_steps`, `controlnet_conditioning_scale`, `cross_attention_kwargs`, and `guidance_scale` to get the best results.
@@ -497,7 +497,7 @@ pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
    unet=unet,
    adapter=adapter,
    torch_dtype=torch.float16,
-    variant="fp16",
+    variant="fp16", 
 ).to("cuda")

 pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
@@ -512,7 +512,7 @@ image = pipe(
    image=canny_image,
    num_inference_steps=4,
    guidance_scale=5,
-    adapter_conditioning_scale=0.8,
+    adapter_conditioning_scale=0.8, 
    adapter_conditioning_factor=1,
    generator=generator,
 ).images[0]
@@ -525,7 +525,7 @@ image = pipe(
 </hfoption>
 <hfoption id="LCM-LoRA">

-Load a T2IAdapter trained on canny images and pass it to the [`StableDiffusionXLAdapterPipeline`]. Replace the scheduler with the [`LCMScheduler`], and use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the LCM-LoRA weights. Pass the canny image to the pipeline and generate an image.
+Load a T2IAdapter trained on canny images and pass it to the [`StableDiffusionXLAdapterPipeline`]. Replace the scheduler with the [`LCMScheduler`], and use the [`~loaders.LoraLoaderMixin.load_lora_weights`] method to load the LCM-LoRA weights. Pass the canny image to the pipeline and generate an image.

 ```py
 import torch
@@ -554,10 +554,10 @@ canny_image = Image.fromarray(image).resize((1024, 1024))
 adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16, varient="fp16").to("cuda")

 pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-xl-base-1.0",
+    "stabilityai/stable-diffusion-xl-base-1.0", 
    adapter=adapter,
    torch_dtype=torch.float16,
-    variant="fp16",
+    variant="fp16", 
 ).to("cuda")

 pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
@@ -573,8 +573,8 @@ image = pipe(
    negative_prompt=negative_prompt,
    image=canny_image,
    num_inference_steps=4,
-    guidance_scale=1.5,
-    adapter_conditioning_scale=0.8,
+    guidance_scale=1.5, 
+    adapter_conditioning_scale=0.8, 
    adapter_conditioning_factor=1,
    generator=generator,
 ).images[0]
--- a/docs/source/en/using-diffusers/inference_with_tcd_lora.md
+++ b/docs/source/en/using-diffusers/inference_with_tcd_lora.md
@@ -212,14 +212,14 @@ TCD-LoRA is very versatile, and it can be combined with other adapter types like
 import torch
 import numpy as np
 from PIL import Image
-from transformers import DPTImageProcessor, DPTForDepthEstimation
+from transformers import DPTFeatureExtractor, DPTForDepthEstimation
 from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline
 from diffusers.utils import load_image, make_image_grid
 from scheduling_tcd import TCDScheduler

 device = "cuda"
 depth_estimator = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas").to(device)
-feature_extractor = DPTImageProcessor.from_pretrained("Intel/dpt-hybrid-midas")
+feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-hybrid-midas")

 def get_depth_map(image):
    image = feature_extractor(images=image, return_tensors="pt").pixel_values.to(device)
--- a/Show More
+++ b/Show More