mirror of
https://github.com/huggingface/diffusers.git
synced 2025-12-09 22:14:43 +08:00
Compare commits
29 Commits
debug-add-
...
t2iadapter
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
e2e633635e | ||
|
|
616c3ff2d7 | ||
|
|
1ad175c064 | ||
|
|
33e2e96f11 | ||
|
|
732cb3e2b5 | ||
|
|
b3d6a5eda8 | ||
|
|
d5933c2603 | ||
|
|
52bc39c997 | ||
|
|
54e683f773 | ||
|
|
66955beba6 | ||
|
|
5c98d93767 | ||
|
|
99788b247f | ||
|
|
6c5d3766d7 | ||
|
|
a893a43231 | ||
|
|
d44c0b0182 | ||
|
|
5eead70b93 | ||
|
|
8ba5c7600e | ||
|
|
43e9415e80 | ||
|
|
dda2cdeaa5 | ||
|
|
22f82a07a5 | ||
|
|
a818aeb50a | ||
|
|
9da50cabb4 | ||
|
|
b10475b3fa | ||
|
|
88c842d3d5 | ||
|
|
e262231ed0 | ||
|
|
2189e8a4e3 | ||
|
|
329053892a | ||
|
|
a554f5a7ee | ||
|
|
af66e4819b |
46
.github/ISSUE_TEMPLATE/bug-report.yml
vendored
46
.github/ISSUE_TEMPLATE/bug-report.yml
vendored
@@ -13,9 +13,8 @@ body:
|
||||
*Give your issue a fitting title. Assume that someone which very limited knowledge of diffusers can understand your issue. Add links to the source code, documentation other issues, pull requests etc...*
|
||||
- 2. If your issue is about something not working, **always** provide a reproducible code snippet. The reader should be able to reproduce your issue by **only copy-pasting your code snippet into a Python shell**.
|
||||
*The community cannot solve your issue if it cannot reproduce it. If your bug is related to training, add your training script and make everything needed to train public. Otherwise, just add a simple Python code snippet.*
|
||||
- 3. Add the **minimum** amount of code / context that is needed to understand, reproduce your issue.
|
||||
- 3. Add the **minimum amount of code / context that is needed to understand, reproduce your issue**.
|
||||
*Make the life of maintainers easy. `diffusers` is getting many issues every day. Make sure your issue is about one bug and one bug only. Make sure you add only the context, code needed to understand your issues - nothing more. Generally, every issue is a way of documenting this library, try to make it a good documentation entry.*
|
||||
- 4. For issues related to community pipelines (i.e., the pipelines located in the `examples/community` folder), please tag the author of the pipeline in your issue thread as those pipelines are not maintained.
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
@@ -61,46 +60,21 @@ body:
|
||||
All issues are read by one of the core maintainers, so if you don't know who to tag, just leave this blank and
|
||||
a core maintainer will ping the right person.
|
||||
|
||||
Please tag a maximum of 2 people.
|
||||
Please tag fewer than 3 people.
|
||||
|
||||
General library related questions: @patrickvonplaten and @sayakpaul
|
||||
|
||||
Questions on DiffusionPipeline (Saving, Loading, From pretrained, ...):
|
||||
Questions on the training examples: @williamberman, @sayakpaul, @yiyixuxu
|
||||
|
||||
Questions on pipelines:
|
||||
- Stable Diffusion @yiyixuxu @DN6 @patrickvonplaten @sayakpaul @patrickvonplaten
|
||||
- Stable Diffusion XL @yiyixuxu @sayakpaul @DN6 @patrickvonplaten
|
||||
- Kandinsky @yiyixuxu @patrickvonplaten
|
||||
- ControlNet @sayakpaul @yiyixuxu @DN6 @patrickvonplaten
|
||||
- T2I Adapter @sayakpaul @yiyixuxu @DN6 @patrickvonplaten
|
||||
- IF @DN6 @patrickvonplaten
|
||||
- Text-to-Video / Video-to-Video @DN6 @sayakpaul @patrickvonplaten
|
||||
- Wuerstchen @DN6 @patrickvonplaten
|
||||
- Other: @yiyixuxu @DN6
|
||||
Questions on memory optimizations, LoRA, float16, etc.: @williamberman, @patrickvonplaten, and @sayakpaul
|
||||
|
||||
Questions on models:
|
||||
- UNet @DN6 @yiyixuxu @sayakpaul @patrickvonplaten
|
||||
- VAE @sayakpaul @DN6 @yiyixuxu @patrickvonplaten
|
||||
- Transformers/Attention @DN6 @yiyixuxu @sayakpaul @DN6 @patrickvonplaten
|
||||
Questions on schedulers: @patrickvonplaten and @williamberman
|
||||
|
||||
Questions on Schedulers: @yiyixuxu @patrickvonplaten
|
||||
|
||||
Questions on LoRA: @sayakpaul @patrickvonplaten
|
||||
|
||||
Questions on Textual Inversion: @sayakpaul @patrickvonplaten
|
||||
|
||||
Questions on Training:
|
||||
- DreamBooth @sayakpaul @patrickvonplaten
|
||||
- Text-to-Image Fine-tuning @sayakpaul @patrickvonplaten
|
||||
- Textual Inversion @sayakpaul @patrickvonplaten
|
||||
- ControlNet @sayakpaul @patrickvonplaten
|
||||
|
||||
Questions on Tests: @DN6 @sayakpaul @yiyixuxu
|
||||
|
||||
Questions on Documentation: @stevhliu
|
||||
Questions on models and pipelines: @patrickvonplaten, @sayakpaul, and @williamberman
|
||||
|
||||
Questions on JAX- and MPS-related things: @pcuenca
|
||||
|
||||
Questions on audio pipelines: @DN6 @patrickvonplaten
|
||||
|
||||
|
||||
Questions on audio pipelines: @patrickvonplaten, @kashif, and @sanchit-gandhi
|
||||
|
||||
Documentation: @stevhliu and @yiyixuxu
|
||||
placeholder: "@Username ..."
|
||||
|
||||
2
.github/PULL_REQUEST_TEMPLATE.md
vendored
2
.github/PULL_REQUEST_TEMPLATE.md
vendored
@@ -41,7 +41,7 @@ Core library:
|
||||
- Schedulers: @williamberman and @patrickvonplaten
|
||||
- Pipelines: @patrickvonplaten and @sayakpaul
|
||||
- Training examples: @sayakpaul and @patrickvonplaten
|
||||
- Docs: @stevhliu and @yiyixuxu
|
||||
- Docs: @stevenliu and @yiyixu
|
||||
- JAX and MPS: @pcuenca
|
||||
- Audio: @sanchit-gandhi
|
||||
- General functionalities: @patrickvonplaten and @sayakpaul
|
||||
|
||||
2
.github/workflows/build_docker_images.yml
vendored
2
.github/workflows/build_docker_images.yml
vendored
@@ -26,8 +26,6 @@ jobs:
|
||||
image-name:
|
||||
- diffusers-pytorch-cpu
|
||||
- diffusers-pytorch-cuda
|
||||
- diffusers-pytorch-compile-cuda
|
||||
- diffusers-pytorch-xformers-cuda
|
||||
- diffusers-flax-cpu
|
||||
- diffusers-flax-tpu
|
||||
- diffusers-onnxruntime-cpu
|
||||
|
||||
2
.github/workflows/pr_dependency_test.yml
vendored
2
.github/workflows/pr_dependency_test.yml
vendored
@@ -20,7 +20,7 @@ jobs:
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: "3.8"
|
||||
python-version: "3.7"
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
python -m pip install --upgrade pip
|
||||
|
||||
4
.github/workflows/pr_quality.yml
vendored
4
.github/workflows/pr_quality.yml
vendored
@@ -20,7 +20,7 @@ jobs:
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: "3.8"
|
||||
python-version: "3.7"
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
python -m pip install --upgrade pip
|
||||
@@ -38,7 +38,7 @@ jobs:
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: "3.8"
|
||||
python-version: "3.7"
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
python -m pip install --upgrade pip
|
||||
|
||||
67
.github/workflows/pr_test_peft_backend.yml
vendored
67
.github/workflows/pr_test_peft_backend.yml
vendored
@@ -1,67 +0,0 @@
|
||||
name: Fast tests for PRs - PEFT backend
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches:
|
||||
- main
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
|
||||
cancel-in-progress: true
|
||||
|
||||
env:
|
||||
DIFFUSERS_IS_CI: yes
|
||||
OMP_NUM_THREADS: 4
|
||||
MKL_NUM_THREADS: 4
|
||||
PYTEST_TIMEOUT: 60
|
||||
|
||||
jobs:
|
||||
run_fast_tests:
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
config:
|
||||
- name: LoRA
|
||||
framework: lora
|
||||
runner: docker-cpu
|
||||
image: diffusers/diffusers-pytorch-cpu
|
||||
report: torch_cpu_lora
|
||||
|
||||
|
||||
name: ${{ matrix.config.name }}
|
||||
|
||||
runs-on: ${{ matrix.config.runner }}
|
||||
|
||||
container:
|
||||
image: ${{ matrix.config.image }}
|
||||
options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
|
||||
|
||||
defaults:
|
||||
run:
|
||||
shell: bash
|
||||
|
||||
steps:
|
||||
- name: Checkout diffusers
|
||||
uses: actions/checkout@v3
|
||||
with:
|
||||
fetch-depth: 2
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
apt-get update && apt-get install libsndfile1-dev libgl1 -y
|
||||
python -m pip install -e .[quality,test]
|
||||
python -m pip install git+https://github.com/huggingface/accelerate.git
|
||||
python -m pip install -U git+https://github.com/huggingface/transformers.git
|
||||
python -m pip install -U git+https://github.com/huggingface/peft.git
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
|
||||
- name: Run fast PyTorch LoRA CPU tests with PEFT backend
|
||||
if: ${{ matrix.config.framework == 'lora' }}
|
||||
run: |
|
||||
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
|
||||
-s -v \
|
||||
--make-reports=tests_${{ matrix.config.report }} \
|
||||
tests/lora/test_lora_layers_peft.py
|
||||
16
.github/workflows/pr_tests.yml
vendored
16
.github/workflows/pr_tests.yml
vendored
@@ -34,11 +34,6 @@ jobs:
|
||||
runner: docker-cpu
|
||||
image: diffusers/diffusers-pytorch-cpu
|
||||
report: torch_cpu_models_schedulers
|
||||
- name: LoRA
|
||||
framework: lora
|
||||
runner: docker-cpu
|
||||
image: diffusers/diffusers-pytorch-cpu
|
||||
report: torch_cpu_lora
|
||||
- name: Fast Flax CPU tests
|
||||
framework: flax
|
||||
runner: docker-cpu
|
||||
@@ -72,7 +67,6 @@ jobs:
|
||||
run: |
|
||||
apt-get update && apt-get install libsndfile1-dev libgl1 -y
|
||||
python -m pip install -e .[quality,test]
|
||||
python -m pip install git+https://github.com/huggingface/accelerate.git
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
@@ -94,14 +88,6 @@ jobs:
|
||||
--make-reports=tests_${{ matrix.config.report }} \
|
||||
tests/models tests/schedulers tests/others
|
||||
|
||||
- name: Run fast PyTorch LoRA CPU tests
|
||||
if: ${{ matrix.config.framework == 'lora' }}
|
||||
run: |
|
||||
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
|
||||
-s -v -k "not Flax and not Onnx and not Dependency" \
|
||||
--make-reports=tests_${{ matrix.config.report }} \
|
||||
tests/lora
|
||||
|
||||
- name: Run fast Flax TPU tests
|
||||
if: ${{ matrix.config.framework == 'flax' }}
|
||||
run: |
|
||||
@@ -183,4 +169,4 @@ jobs:
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: pr_${{ matrix.config.report }}_test_reports
|
||||
path: reports
|
||||
path: reports
|
||||
306
.github/workflows/push_tests.yml
vendored
306
.github/workflows/push_tests.yml
vendored
@@ -1,11 +1,10 @@
|
||||
name: Slow Tests on main
|
||||
name: Slow tests on main
|
||||
|
||||
on:
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
|
||||
|
||||
env:
|
||||
DIFFUSERS_IS_CI: yes
|
||||
HF_HOME: /mnt/cache
|
||||
@@ -13,321 +12,104 @@ env:
|
||||
MKL_NUM_THREADS: 8
|
||||
PYTEST_TIMEOUT: 600
|
||||
RUN_SLOW: yes
|
||||
PIPELINE_USAGE_CUTOFF: 50000
|
||||
|
||||
jobs:
|
||||
setup_torch_cuda_pipeline_matrix:
|
||||
name: Setup Torch Pipelines CUDA Slow Tests Matrix
|
||||
runs-on: docker-gpu
|
||||
container:
|
||||
image: diffusers/diffusers-pytorch-cpu # this is a CPU image, but we need it to fetch the matrix
|
||||
options: --shm-size "16gb" --ipc host
|
||||
outputs:
|
||||
pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }}
|
||||
steps:
|
||||
- name: Checkout diffusers
|
||||
uses: actions/checkout@v3
|
||||
with:
|
||||
fetch-depth: 2
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
apt-get update && apt-get install libsndfile1-dev libgl1 -y
|
||||
python -m pip install -e .[quality,test]
|
||||
python -m pip install git+https://github.com/huggingface/accelerate.git
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
|
||||
- name: Fetch Pipeline Matrix
|
||||
id: fetch_pipeline_matrix
|
||||
run: |
|
||||
matrix=$(python utils/fetch_torch_cuda_pipeline_test_matrix.py)
|
||||
echo $matrix
|
||||
echo "pipeline_test_matrix=$matrix" >> $GITHUB_OUTPUT
|
||||
|
||||
- name: Pipeline Tests Artifacts
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: test-pipelines.json
|
||||
path: reports
|
||||
|
||||
torch_pipelines_cuda_tests:
|
||||
name: Torch Pipelines CUDA Slow Tests
|
||||
needs: setup_torch_cuda_pipeline_matrix
|
||||
run_slow_tests:
|
||||
strategy:
|
||||
fail-fast: false
|
||||
max-parallel: 1
|
||||
matrix:
|
||||
module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }}
|
||||
runs-on: docker-gpu
|
||||
container:
|
||||
image: diffusers/diffusers-pytorch-cuda
|
||||
options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0
|
||||
steps:
|
||||
- name: Checkout diffusers
|
||||
uses: actions/checkout@v3
|
||||
with:
|
||||
fetch-depth: 2
|
||||
- name: NVIDIA-SMI
|
||||
run: |
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
apt-get update && apt-get install libsndfile1-dev libgl1 -y
|
||||
python -m pip install -e .[quality,test]
|
||||
python -m pip install git+https://github.com/huggingface/accelerate.git
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
- name: Slow PyTorch CUDA checkpoint tests on Ubuntu
|
||||
env:
|
||||
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
|
||||
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
|
||||
CUBLAS_WORKSPACE_CONFIG: :16:8
|
||||
run: |
|
||||
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
|
||||
-s -v -k "not Flax and not Onnx" \
|
||||
--make-reports=tests_pipeline_${{ matrix.module }}_cuda \
|
||||
tests/pipelines/${{ matrix.module }}
|
||||
- name: Failure short reports
|
||||
if: ${{ failure() }}
|
||||
run: |
|
||||
cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt
|
||||
cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt
|
||||
config:
|
||||
- name: Slow PyTorch CUDA tests on Ubuntu
|
||||
framework: pytorch
|
||||
runner: docker-gpu
|
||||
image: diffusers/diffusers-pytorch-cuda
|
||||
report: torch_cuda
|
||||
- name: Slow Flax TPU tests on Ubuntu
|
||||
framework: flax
|
||||
runner: docker-tpu
|
||||
image: diffusers/diffusers-flax-tpu
|
||||
report: flax_tpu
|
||||
- name: Slow ONNXRuntime CUDA tests on Ubuntu
|
||||
framework: onnxruntime
|
||||
runner: docker-gpu
|
||||
image: diffusers/diffusers-onnxruntime-cuda
|
||||
report: onnx_cuda
|
||||
|
||||
- name: Test suite reports artifacts
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: pipeline_${{ matrix.module }}_test_reports
|
||||
path: reports
|
||||
name: ${{ matrix.config.name }}
|
||||
|
||||
runs-on: ${{ matrix.config.runner }}
|
||||
|
||||
torch_cuda_tests:
|
||||
name: Torch CUDA Tests
|
||||
runs-on: docker-gpu
|
||||
container:
|
||||
image: diffusers/diffusers-pytorch-cuda
|
||||
options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0
|
||||
image: ${{ matrix.config.image }}
|
||||
options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ ${{ matrix.config.runner == 'docker-tpu' && '--privileged' || '--gpus 0'}}
|
||||
|
||||
defaults:
|
||||
run:
|
||||
shell: bash
|
||||
strategy:
|
||||
matrix:
|
||||
module: [models, schedulers, lora, others]
|
||||
|
||||
steps:
|
||||
- name: Checkout diffusers
|
||||
uses: actions/checkout@v3
|
||||
with:
|
||||
fetch-depth: 2
|
||||
|
||||
- name: NVIDIA-SMI
|
||||
if : ${{ matrix.config.runner == 'docker-gpu' }}
|
||||
run: |
|
||||
nvidia-smi
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
apt-get update && apt-get install libsndfile1-dev libgl1 -y
|
||||
python -m pip install -e .[quality,test]
|
||||
python -m pip install git+https://github.com/huggingface/accelerate.git
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
|
||||
- name: Run slow PyTorch CUDA tests
|
||||
if: ${{ matrix.config.framework == 'pytorch' }}
|
||||
env:
|
||||
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
|
||||
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
|
||||
CUBLAS_WORKSPACE_CONFIG: :16:8
|
||||
CUBLAS_WORKSPACE_CONFIG: :16:8
|
||||
|
||||
run: |
|
||||
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
|
||||
-s -v -k "not Flax and not Onnx" \
|
||||
--make-reports=tests_torch_cuda \
|
||||
tests/${{ matrix.module }}
|
||||
|
||||
- name: Failure short reports
|
||||
if: ${{ failure() }}
|
||||
run: |
|
||||
cat reports/tests_torch_cuda_stats.txt
|
||||
cat reports/tests_torch_cuda_failures_short.txt
|
||||
|
||||
- name: Test suite reports artifacts
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: torch_cuda_test_reports
|
||||
path: reports
|
||||
|
||||
flax_tpu_tests:
|
||||
name: Flax TPU Tests
|
||||
runs-on: docker-tpu
|
||||
container:
|
||||
image: diffusers/diffusers-flax-tpu
|
||||
options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --privileged
|
||||
defaults:
|
||||
run:
|
||||
shell: bash
|
||||
steps:
|
||||
- name: Checkout diffusers
|
||||
uses: actions/checkout@v3
|
||||
with:
|
||||
fetch-depth: 2
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
apt-get update && apt-get install libsndfile1-dev libgl1 -y
|
||||
python -m pip install -e .[quality,test]
|
||||
python -m pip install git+https://github.com/huggingface/accelerate.git
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
--make-reports=tests_${{ matrix.config.report }} \
|
||||
tests/
|
||||
|
||||
- name: Run slow Flax TPU tests
|
||||
if: ${{ matrix.config.framework == 'flax' }}
|
||||
env:
|
||||
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
|
||||
run: |
|
||||
python -m pytest -n 0 \
|
||||
-s -v -k "Flax" \
|
||||
--make-reports=tests_flax_tpu \
|
||||
--make-reports=tests_${{ matrix.config.report }} \
|
||||
tests/
|
||||
|
||||
- name: Failure short reports
|
||||
if: ${{ failure() }}
|
||||
run: |
|
||||
cat reports/tests_flax_tpu_stats.txt
|
||||
cat reports/tests_flax_tpu_failures_short.txt
|
||||
|
||||
- name: Test suite reports artifacts
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: flax_tpu_test_reports
|
||||
path: reports
|
||||
|
||||
onnx_cuda_tests:
|
||||
name: ONNX CUDA Tests
|
||||
runs-on: docker-gpu
|
||||
container:
|
||||
image: diffusers/diffusers-onnxruntime-cuda
|
||||
options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --gpus 0
|
||||
defaults:
|
||||
run:
|
||||
shell: bash
|
||||
steps:
|
||||
- name: Checkout diffusers
|
||||
uses: actions/checkout@v3
|
||||
with:
|
||||
fetch-depth: 2
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
apt-get update && apt-get install libsndfile1-dev libgl1 -y
|
||||
python -m pip install -e .[quality,test]
|
||||
python -m pip install git+https://github.com/huggingface/accelerate.git
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
|
||||
- name: Run slow ONNXRuntime CUDA tests
|
||||
if: ${{ matrix.config.framework == 'onnxruntime' }}
|
||||
env:
|
||||
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
|
||||
run: |
|
||||
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
|
||||
-s -v -k "Onnx" \
|
||||
--make-reports=tests_onnx_cuda \
|
||||
--make-reports=tests_${{ matrix.config.report }} \
|
||||
tests/
|
||||
|
||||
- name: Failure short reports
|
||||
if: ${{ failure() }}
|
||||
run: |
|
||||
cat reports/tests_onnx_cuda_stats.txt
|
||||
cat reports/tests_onnx_cuda_failures_short.txt
|
||||
run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt
|
||||
|
||||
- name: Test suite reports artifacts
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: onnx_cuda_test_reports
|
||||
path: reports
|
||||
|
||||
run_torch_compile_tests:
|
||||
name: PyTorch Compile CUDA tests
|
||||
|
||||
runs-on: docker-gpu
|
||||
|
||||
container:
|
||||
image: diffusers/diffusers-pytorch-compile-cuda
|
||||
options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
|
||||
|
||||
steps:
|
||||
- name: Checkout diffusers
|
||||
uses: actions/checkout@v3
|
||||
with:
|
||||
fetch-depth: 2
|
||||
|
||||
- name: NVIDIA-SMI
|
||||
run: |
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
python -m pip install -e .[quality,test,training]
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
- name: Run example tests on GPU
|
||||
env:
|
||||
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
|
||||
run: |
|
||||
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/
|
||||
- name: Failure short reports
|
||||
if: ${{ failure() }}
|
||||
run: cat reports/tests_torch_compile_cuda_failures_short.txt
|
||||
|
||||
- name: Test suite reports artifacts
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: torch_compile_test_reports
|
||||
path: reports
|
||||
|
||||
run_xformers_tests:
|
||||
name: PyTorch xformers CUDA tests
|
||||
|
||||
runs-on: docker-gpu
|
||||
|
||||
container:
|
||||
image: diffusers/diffusers-pytorch-xformers-cuda
|
||||
options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
|
||||
|
||||
steps:
|
||||
- name: Checkout diffusers
|
||||
uses: actions/checkout@v3
|
||||
with:
|
||||
fetch-depth: 2
|
||||
|
||||
- name: NVIDIA-SMI
|
||||
run: |
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
python -m pip install -e .[quality,test,training]
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
- name: Run example tests on GPU
|
||||
env:
|
||||
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
|
||||
run: |
|
||||
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "xformers" --make-reports=tests_torch_xformers_cuda tests/
|
||||
- name: Failure short reports
|
||||
if: ${{ failure() }}
|
||||
run: cat reports/tests_torch_xformers_cuda_failures_short.txt
|
||||
|
||||
- name: Test suite reports artifacts
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: torch_xformers_test_reports
|
||||
name: ${{ matrix.config.report }}_test_reports
|
||||
path: reports
|
||||
|
||||
run_examples_tests:
|
||||
@@ -365,13 +147,11 @@ jobs:
|
||||
|
||||
- name: Failure short reports
|
||||
if: ${{ failure() }}
|
||||
run: |
|
||||
cat reports/examples_torch_cuda_stats.txt
|
||||
cat reports/examples_torch_cuda_failures_short.txt
|
||||
run: cat reports/examples_torch_cuda_failures_short.txt
|
||||
|
||||
- name: Test suite reports artifacts
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: examples_test_reports
|
||||
path: reports
|
||||
path: reports
|
||||
|
||||
2
.github/workflows/push_tests_mps.yml
vendored
2
.github/workflows/push_tests_mps.yml
vendored
@@ -40,7 +40,7 @@ jobs:
|
||||
${CONDA_RUN} python -m pip install --upgrade pip
|
||||
${CONDA_RUN} python -m pip install -e .[quality,test]
|
||||
${CONDA_RUN} python -m pip install torch torchvision torchaudio
|
||||
${CONDA_RUN} python -m pip install git+https://github.com/huggingface/accelerate.git
|
||||
${CONDA_RUN} python -m pip install accelerate --upgrade
|
||||
${CONDA_RUN} python -m pip install transformers --upgrade
|
||||
|
||||
- name: Environment
|
||||
|
||||
2
.github/workflows/stale.yml
vendored
2
.github/workflows/stale.yml
vendored
@@ -17,7 +17,7 @@ jobs:
|
||||
- name: Setup Python
|
||||
uses: actions/setup-python@v1
|
||||
with:
|
||||
python-version: 3.8
|
||||
python-version: 3.7
|
||||
|
||||
- name: Install requirements
|
||||
run: |
|
||||
|
||||
@@ -40,7 +40,7 @@ In the following, we give an overview of different ways to contribute, ranked by
|
||||
As said before, **all contributions are valuable to the community**.
|
||||
In the following, we will explain each contribution a bit more in detail.
|
||||
|
||||
For all contributions 4.-9. you will need to open a PR. It is explained in detail how to do so in [Opening a pull request](#how-to-open-a-pr)
|
||||
For all contributions 4.-9. you will need to open a PR. It is explained in detail how to do so in [Opening a pull requst](#how-to-open-a-pr)
|
||||
|
||||
### 1. Asking and answering questions on the Diffusers discussion forum or on the Diffusers Discord
|
||||
|
||||
@@ -63,7 +63,7 @@ In the same spirit, you are of immense help to the community by answering such q
|
||||
|
||||
**Please** keep in mind that the more effort you put into asking or answering a question, the higher
|
||||
the quality of the publicly documented knowledge. In the same way, well-posed and well-answered questions create a high-quality knowledge database accessible to everybody, while badly posed questions or answers reduce the overall quality of the public knowledge database.
|
||||
In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formated/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.
|
||||
In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accesible*, and *well-formated/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.
|
||||
|
||||
**NOTE about channels**:
|
||||
[*The forum*](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) is much better indexed by search engines, such as Google. Posts are ranked by popularity rather than chronologically. Hence, it's easier to look up questions and answers that we posted some time ago.
|
||||
@@ -168,7 +168,7 @@ more precise, provide the link to a duplicated issue or redirect them to [the fo
|
||||
If you have verified that the issued bug report is correct and requires a correction in the source code,
|
||||
please have a look at the next sections.
|
||||
|
||||
For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull request](#how-to-open-a-pr) section.
|
||||
For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull requst](#how-to-open-a-pr) section.
|
||||
|
||||
### 4. Fixing a "Good first issue"
|
||||
|
||||
|
||||
@@ -10,9 +10,6 @@
|
||||
<a href="https://github.com/huggingface/diffusers/releases">
|
||||
<img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/diffusers.svg">
|
||||
</a>
|
||||
<a href="https://pepy.tech/project/diffusers">
|
||||
<img alt="GitHub release" src="https://static.pepy.tech/badge/diffusers/month">
|
||||
</a>
|
||||
<a href="CODE_OF_CONDUCT.md">
|
||||
<img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg">
|
||||
</a>
|
||||
|
||||
@@ -1,46 +0,0 @@
|
||||
FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04
|
||||
LABEL maintainer="Hugging Face"
|
||||
LABEL repository="diffusers"
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
|
||||
RUN apt update && \
|
||||
apt install -y bash \
|
||||
build-essential \
|
||||
git \
|
||||
git-lfs \
|
||||
curl \
|
||||
ca-certificates \
|
||||
libsndfile1-dev \
|
||||
libgl1 \
|
||||
python3.9 \
|
||||
python3.9-dev \
|
||||
python3-pip \
|
||||
python3.9-venv && \
|
||||
rm -rf /var/lib/apt/lists
|
||||
|
||||
# make sure to use venv
|
||||
RUN python3.9 -m venv /opt/venv
|
||||
ENV PATH="/opt/venv/bin:$PATH"
|
||||
|
||||
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
|
||||
RUN python3.9 -m pip install --no-cache-dir --upgrade pip && \
|
||||
python3.9 -m pip install --no-cache-dir \
|
||||
torch \
|
||||
torchvision \
|
||||
torchaudio \
|
||||
invisible_watermark && \
|
||||
python3.9 -m pip install --no-cache-dir \
|
||||
accelerate \
|
||||
datasets \
|
||||
hf-doc-builder \
|
||||
huggingface-hub \
|
||||
Jinja2 \
|
||||
librosa \
|
||||
numpy \
|
||||
scipy \
|
||||
tensorboard \
|
||||
transformers \
|
||||
omegaconf
|
||||
|
||||
CMD ["/bin/bash"]
|
||||
@@ -1,4 +1,4 @@
|
||||
FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04
|
||||
FROM nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu20.04
|
||||
LABEL maintainer="Hugging Face"
|
||||
LABEL repository="diffusers"
|
||||
|
||||
@@ -6,16 +6,16 @@ ENV DEBIAN_FRONTEND=noninteractive
|
||||
|
||||
RUN apt update && \
|
||||
apt install -y bash \
|
||||
build-essential \
|
||||
git \
|
||||
git-lfs \
|
||||
curl \
|
||||
ca-certificates \
|
||||
libsndfile1-dev \
|
||||
libgl1 \
|
||||
python3.8 \
|
||||
python3-pip \
|
||||
python3.8-venv && \
|
||||
build-essential \
|
||||
git \
|
||||
git-lfs \
|
||||
curl \
|
||||
ca-certificates \
|
||||
libsndfile1-dev \
|
||||
libgl1 \
|
||||
python3.8 \
|
||||
python3-pip \
|
||||
python3.8-venv && \
|
||||
rm -rf /var/lib/apt/lists
|
||||
|
||||
# make sure to use venv
|
||||
@@ -25,21 +25,23 @@ ENV PATH="/opt/venv/bin:$PATH"
|
||||
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
|
||||
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
|
||||
python3 -m pip install --no-cache-dir \
|
||||
torch \
|
||||
torchvision \
|
||||
torchaudio \
|
||||
invisible_watermark && \
|
||||
torch \
|
||||
torchvision \
|
||||
torchaudio \
|
||||
invisible_watermark && \
|
||||
python3 -m pip install --no-cache-dir \
|
||||
accelerate \
|
||||
datasets \
|
||||
hf-doc-builder \
|
||||
huggingface-hub \
|
||||
Jinja2 \
|
||||
librosa \
|
||||
numpy \
|
||||
scipy \
|
||||
tensorboard \
|
||||
transformers \
|
||||
omegaconf
|
||||
accelerate \
|
||||
datasets \
|
||||
hf-doc-builder \
|
||||
huggingface-hub \
|
||||
Jinja2 \
|
||||
librosa \
|
||||
numpy \
|
||||
scipy \
|
||||
tensorboard \
|
||||
transformers \
|
||||
omegaconf \
|
||||
pytorch-lightning \
|
||||
xformers
|
||||
|
||||
CMD ["/bin/bash"]
|
||||
|
||||
@@ -1,46 +0,0 @@
|
||||
FROM nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu20.04
|
||||
LABEL maintainer="Hugging Face"
|
||||
LABEL repository="diffusers"
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
|
||||
RUN apt update && \
|
||||
apt install -y bash \
|
||||
build-essential \
|
||||
git \
|
||||
git-lfs \
|
||||
curl \
|
||||
ca-certificates \
|
||||
libsndfile1-dev \
|
||||
libgl1 \
|
||||
python3.8 \
|
||||
python3-pip \
|
||||
python3.8-venv && \
|
||||
rm -rf /var/lib/apt/lists
|
||||
|
||||
# make sure to use venv
|
||||
RUN python3 -m venv /opt/venv
|
||||
ENV PATH="/opt/venv/bin:$PATH"
|
||||
|
||||
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
|
||||
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
|
||||
python3 -m pip install --no-cache-dir \
|
||||
torch==2.0.1 \
|
||||
torchvision==0.15.2 \
|
||||
torchaudio \
|
||||
invisible_watermark && \
|
||||
python3 -m pip install --no-cache-dir \
|
||||
accelerate \
|
||||
datasets \
|
||||
hf-doc-builder \
|
||||
huggingface-hub \
|
||||
Jinja2 \
|
||||
librosa \
|
||||
numpy \
|
||||
scipy \
|
||||
tensorboard \
|
||||
transformers \
|
||||
omegaconf \
|
||||
xformers
|
||||
|
||||
CMD ["/bin/bash"]
|
||||
@@ -128,7 +128,7 @@ When adding a new pipeline:
|
||||
- Possible an end-to-end example of how to use it
|
||||
- Add all the pipeline classes that should be linked in the diffusion model. These classes should be added using our Markdown syntax. By default as follows:
|
||||
|
||||
```py
|
||||
```
|
||||
## XXXPipeline
|
||||
|
||||
[[autodoc]] XXXPipeline
|
||||
@@ -138,7 +138,7 @@ When adding a new pipeline:
|
||||
|
||||
This will include every public method of the pipeline that is documented, as well as the `__call__` method that is not documented by default. If you just want to add additional methods that are not documented, you can put the list of all methods to add in a list that contains `all`.
|
||||
|
||||
```py
|
||||
```
|
||||
[[autodoc]] XXXPipeline
|
||||
- all
|
||||
- __call__
|
||||
@@ -172,7 +172,7 @@ Arguments should be defined with the `Args:` (or `Arguments:` or `Parameters:`)
|
||||
an indentation. The argument should be followed by its type, with its shape if it is a tensor, a colon, and its
|
||||
description:
|
||||
|
||||
```py
|
||||
```
|
||||
Args:
|
||||
n_layers (`int`): The number of layers of the model.
|
||||
```
|
||||
@@ -182,7 +182,7 @@ after the argument.
|
||||
|
||||
Here's an example showcasing everything so far:
|
||||
|
||||
```py
|
||||
```
|
||||
Args:
|
||||
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
|
||||
Indices of input sequence tokens in the vocabulary.
|
||||
@@ -196,13 +196,13 @@ Here's an example showcasing everything so far:
|
||||
For optional arguments or arguments with defaults we follow the following syntax: imagine we have a function with the
|
||||
following signature:
|
||||
|
||||
```py
|
||||
```
|
||||
def my_function(x: str = None, a: float = 1):
|
||||
```
|
||||
|
||||
then its documentation should look like this:
|
||||
|
||||
```py
|
||||
```
|
||||
Args:
|
||||
x (`str`, *optional*):
|
||||
This argument controls ...
|
||||
@@ -235,14 +235,14 @@ building the return.
|
||||
|
||||
Here's an example of a single value return:
|
||||
|
||||
```py
|
||||
```
|
||||
Returns:
|
||||
`List[int]`: A list of integers in the range [0, 1] --- 1 for a special token, 0 for a sequence token.
|
||||
```
|
||||
|
||||
Here's an example of a tuple return, comprising several objects:
|
||||
|
||||
```py
|
||||
```
|
||||
Returns:
|
||||
`tuple(torch.FloatTensor)` comprising various elements depending on the configuration ([`BertConfig`]) and inputs:
|
||||
- ** loss** (*optional*, returned when `masked_lm_labels` is provided) `torch.FloatTensor` of shape `(1,)` --
|
||||
|
||||
@@ -17,8 +17,6 @@
|
||||
title: AutoPipeline
|
||||
- local: tutorials/basic_training
|
||||
title: Train a diffusion model
|
||||
- local: tutorials/using_peft_for_inference
|
||||
title: Inference with PEFT
|
||||
title: Tutorials
|
||||
- sections:
|
||||
- sections:
|
||||
@@ -38,50 +36,38 @@
|
||||
title: Push files to the Hub
|
||||
title: Loading & Hub
|
||||
- sections:
|
||||
- local: using-diffusers/pipeline_overview
|
||||
title: Overview
|
||||
- local: using-diffusers/unconditional_image_generation
|
||||
title: Unconditional image generation
|
||||
- local: using-diffusers/conditional_image_generation
|
||||
title: Text-to-image
|
||||
title: Text-to-image generation
|
||||
- local: using-diffusers/img2img
|
||||
title: Image-to-image
|
||||
title: Text-guided image-to-image
|
||||
- local: using-diffusers/inpaint
|
||||
title: Inpainting
|
||||
title: Text-guided image-inpainting
|
||||
- local: using-diffusers/depth2img
|
||||
title: Depth-to-image
|
||||
title: Tasks
|
||||
- sections:
|
||||
title: Text-guided depth-to-image
|
||||
- local: using-diffusers/textual_inversion_inference
|
||||
title: Textual inversion
|
||||
- local: training/distributed_inference
|
||||
title: Distributed inference with multiple GPUs
|
||||
- local: using-diffusers/distilled_sd
|
||||
title: Distilled Stable Diffusion inference
|
||||
- local: using-diffusers/reusing_seeds
|
||||
title: Improve image quality with deterministic generation
|
||||
- local: using-diffusers/control_brightness
|
||||
title: Control image brightness
|
||||
- local: using-diffusers/weighted_prompts
|
||||
title: Prompt weighting
|
||||
- local: using-diffusers/freeu
|
||||
title: Improve generation quality with FreeU
|
||||
title: Techniques
|
||||
- sections:
|
||||
- local: using-diffusers/pipeline_overview
|
||||
title: Overview
|
||||
- local: using-diffusers/sdxl
|
||||
title: Stable Diffusion XL
|
||||
- local: using-diffusers/controlnet
|
||||
title: ControlNet
|
||||
- local: using-diffusers/shap-e
|
||||
title: Shap-E
|
||||
- local: using-diffusers/diffedit
|
||||
title: DiffEdit
|
||||
- local: using-diffusers/distilled_sd
|
||||
title: Distilled Stable Diffusion inference
|
||||
- local: using-diffusers/reproducibility
|
||||
title: Create reproducible pipelines
|
||||
- local: using-diffusers/custom_pipeline_examples
|
||||
title: Community pipelines
|
||||
- local: using-diffusers/contribute_pipeline
|
||||
title: How to contribute a community pipeline
|
||||
- local: using-diffusers/stable_diffusion_jax_how_to
|
||||
title: Stable Diffusion in JAX/Flax
|
||||
- local: using-diffusers/weighted_prompts
|
||||
title: Prompt weighting
|
||||
title: Pipelines for Inference
|
||||
- sections:
|
||||
- local: training/overview
|
||||
@@ -106,10 +92,6 @@
|
||||
title: InstructPix2Pix Training
|
||||
- local: training/custom_diffusion
|
||||
title: Custom Diffusion
|
||||
- local: training/t2i_adapters
|
||||
title: T2I-Adapters
|
||||
- local: training/ddpo
|
||||
title: Reinforcement learning training with DDPO
|
||||
title: Training
|
||||
- sections:
|
||||
- local: using-diffusers/other-modalities
|
||||
@@ -119,35 +101,25 @@
|
||||
- sections:
|
||||
- local: optimization/opt_overview
|
||||
title: Overview
|
||||
- sections:
|
||||
- local: optimization/fp16
|
||||
title: Speed up inference
|
||||
- local: optimization/memory
|
||||
title: Reduce memory usage
|
||||
- local: optimization/torch2.0
|
||||
title: Torch 2.0
|
||||
- local: optimization/xformers
|
||||
title: xFormers
|
||||
- local: optimization/tome
|
||||
title: Token merging
|
||||
title: General optimizations
|
||||
- sections:
|
||||
- local: using-diffusers/stable_diffusion_jax_how_to
|
||||
title: JAX/Flax
|
||||
- local: optimization/onnx
|
||||
title: ONNX
|
||||
- local: optimization/open_vino
|
||||
title: OpenVINO
|
||||
- local: optimization/coreml
|
||||
title: Core ML
|
||||
title: Optimized model types
|
||||
- sections:
|
||||
- local: optimization/mps
|
||||
title: Metal Performance Shaders (MPS)
|
||||
- local: optimization/habana
|
||||
title: Habana Gaudi
|
||||
title: Optimized hardware
|
||||
title: Optimization
|
||||
- local: optimization/fp16
|
||||
title: Memory and Speed
|
||||
- local: optimization/torch2.0
|
||||
title: Torch2.0 support
|
||||
- local: optimization/xformers
|
||||
title: xFormers
|
||||
- local: optimization/onnx
|
||||
title: ONNX
|
||||
- local: optimization/open_vino
|
||||
title: OpenVINO
|
||||
- local: optimization/coreml
|
||||
title: Core ML
|
||||
- local: optimization/mps
|
||||
title: MPS
|
||||
- local: optimization/habana
|
||||
title: Habana Gaudi
|
||||
- local: optimization/tome
|
||||
title: Token Merging
|
||||
title: Optimization/Special Hardware
|
||||
- sections:
|
||||
- local: conceptual/philosophy
|
||||
title: Philosophy
|
||||
@@ -222,8 +194,6 @@
|
||||
title: AudioLDM 2
|
||||
- local: api/pipelines/auto_pipeline
|
||||
title: AutoPipeline
|
||||
- local: api/pipelines/blip_diffusion
|
||||
title: BLIP Diffusion
|
||||
- local: api/pipelines/consistency_models
|
||||
title: Consistency Models
|
||||
- local: api/pipelines/controlnet
|
||||
@@ -254,8 +224,6 @@
|
||||
title: Latent Diffusion
|
||||
- local: api/pipelines/panorama
|
||||
title: MultiDiffusion
|
||||
- local: api/pipelines/musicldm
|
||||
title: MusicLDM
|
||||
- local: api/pipelines/paint_by_example
|
||||
title: PaintByExample
|
||||
- local: api/pipelines/paradigms
|
||||
@@ -328,8 +296,6 @@
|
||||
title: Versatile Diffusion
|
||||
- local: api/pipelines/vq_diffusion
|
||||
title: VQ Diffusion
|
||||
- local: api/pipelines/wuerstchen
|
||||
title: Wuerstchen
|
||||
title: Pipelines
|
||||
- sections:
|
||||
- local: api/schedulers/overview
|
||||
|
||||
@@ -17,9 +17,6 @@ An attention processor is a class for applying different types of attention mech
|
||||
## CustomDiffusionAttnProcessor
|
||||
[[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor
|
||||
|
||||
## CustomDiffusionAttnProcessor2_0
|
||||
[[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor2_0
|
||||
|
||||
## AttnAddedKVProcessor
|
||||
[[autodoc]] models.attention_processor.AttnAddedKVProcessor
|
||||
|
||||
@@ -42,4 +39,4 @@ An attention processor is a class for applying different types of attention mech
|
||||
[[autodoc]] models.attention_processor.SlicedAttnProcessor
|
||||
|
||||
## SlicedAttnAddedKVProcessor
|
||||
[[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor
|
||||
[[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor
|
||||
@@ -28,10 +28,6 @@ Adapters (textual inversion, LoRA, hypernetworks) allow you to modify a diffusio
|
||||
|
||||
[[autodoc]] loaders.TextualInversionLoaderMixin
|
||||
|
||||
## StableDiffusionXLLoraLoaderMixin
|
||||
|
||||
[[autodoc]] loaders.StableDiffusionXLLoraLoaderMixin
|
||||
|
||||
## LoraLoaderMixin
|
||||
|
||||
[[autodoc]] loaders.LoraLoaderMixin
|
||||
|
||||
@@ -67,30 +67,30 @@ By default, `tqdm` progress bars are displayed during model download. [`logging.
|
||||
|
||||
## Base setters
|
||||
|
||||
[[autodoc]] utils.logging.set_verbosity_error
|
||||
[[autodoc]] logging.set_verbosity_error
|
||||
|
||||
[[autodoc]] utils.logging.set_verbosity_warning
|
||||
[[autodoc]] logging.set_verbosity_warning
|
||||
|
||||
[[autodoc]] utils.logging.set_verbosity_info
|
||||
[[autodoc]] logging.set_verbosity_info
|
||||
|
||||
[[autodoc]] utils.logging.set_verbosity_debug
|
||||
[[autodoc]] logging.set_verbosity_debug
|
||||
|
||||
## Other functions
|
||||
|
||||
[[autodoc]] utils.logging.get_verbosity
|
||||
[[autodoc]] logging.get_verbosity
|
||||
|
||||
[[autodoc]] utils.logging.set_verbosity
|
||||
[[autodoc]] logging.set_verbosity
|
||||
|
||||
[[autodoc]] utils.logging.get_logger
|
||||
[[autodoc]] logging.get_logger
|
||||
|
||||
[[autodoc]] utils.logging.enable_default_handler
|
||||
[[autodoc]] logging.enable_default_handler
|
||||
|
||||
[[autodoc]] utils.logging.disable_default_handler
|
||||
[[autodoc]] logging.disable_default_handler
|
||||
|
||||
[[autodoc]] utils.logging.enable_explicit_format
|
||||
[[autodoc]] logging.enable_explicit_format
|
||||
|
||||
[[autodoc]] utils.logging.reset_format
|
||||
[[autodoc]] logging.reset_format
|
||||
|
||||
[[autodoc]] utils.logging.enable_progress_bar
|
||||
[[autodoc]] logging.enable_progress_bar
|
||||
|
||||
[[autodoc]] utils.logging.disable_progress_bar
|
||||
[[autodoc]] logging.disable_progress_bar
|
||||
|
||||
@@ -46,5 +46,6 @@ Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to le
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## AudioPipelineOutput
|
||||
[[autodoc]] pipelines.AudioPipelineOutput
|
||||
## StableDiffusionPipelineOutput
|
||||
|
||||
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
|
||||
@@ -20,10 +20,10 @@ Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelin
|
||||
is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings. Two
|
||||
text encoder models are used to compute the text embeddings from a prompt input: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap)
|
||||
and the encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5). These text embeddings
|
||||
are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/main/api/pipelines/audioldm2#diffusers.AudioLDM2ProjectionModel).
|
||||
are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2ProjectionModel).
|
||||
A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) _language model (LM)_ is used to auto-regressively
|
||||
predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding
|
||||
vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2UNet2DConditionModel)
|
||||
vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2UNet2DConditionModel)
|
||||
of AudioLDM 2 is unique in the sense that it takes **two** cross-attention embeddings, as opposed to one cross-attention
|
||||
conditioning, as in most other LDMs.
|
||||
|
||||
@@ -38,17 +38,13 @@ found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2).
|
||||
|
||||
### Choosing a checkpoint
|
||||
|
||||
AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio
|
||||
generation. The third checkpoint is trained exclusively on text-to-music generation.
|
||||
AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio generation. The third checkpoint is trained exclusively on text-to-music generation. See table below for details on the three official checkpoints:
|
||||
|
||||
All checkpoints share the same model size for the text encoders and VAE. They differ in the size and depth of the UNet.
|
||||
See table below for details on the three checkpoints:
|
||||
|
||||
| Checkpoint | Task | UNet Model Size | Total Model Size | Training Data / h |
|
||||
|-----------------------------------------------------------------|---------------|-----------------|------------------|-------------------|
|
||||
| [audioldm2](https://huggingface.co/cvssp/audioldm2) | Text-to-audio | 350M | 1.1B | 1150k |
|
||||
| [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 750M | 1.5B | 1150k |
|
||||
| [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 350M | 1.1B | 665k |
|
||||
| Checkpoint | Task | Model Size | Training Data / h |
|
||||
|-----------------------------------------------------------------|---------------|------------|-------------------|
|
||||
| [audioldm2](https://huggingface.co/cvssp/audioldm2) | Text-to-audio | 1.1B | 1150k |
|
||||
| [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 1.1B | 665k |
|
||||
| [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 1.5B | 1150k |
|
||||
|
||||
### Constructing a prompt
|
||||
|
||||
@@ -66,7 +62,37 @@ See table below for details on the three checkpoints:
|
||||
* The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation
|
||||
* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
|
||||
|
||||
The following example demonstrates how to construct good music generation using the aforementioned tips: [example](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.example).
|
||||
The following example demonstrates how to construct good music generation using the aforementioned tips:
|
||||
|
||||
```python
|
||||
import scipy
|
||||
import torch
|
||||
from diffusers import AudioLDM2Pipeline
|
||||
|
||||
# load the best weights for music generation
|
||||
repo_id = "cvssp/audioldm2-music"
|
||||
pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
|
||||
pipe = pipe.to("cuda")
|
||||
|
||||
# define the prompts
|
||||
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
|
||||
negative_prompt = "Low quality."
|
||||
|
||||
# set the seed
|
||||
generator = torch.Generator("cuda").manual_seed(0)
|
||||
|
||||
# run the generation
|
||||
audio = pipe(
|
||||
prompt,
|
||||
negative_prompt=negative_prompt,
|
||||
num_inference_steps=200,
|
||||
audio_length_in_s=10.0,
|
||||
num_waveforms_per_prompt=3,
|
||||
).audios
|
||||
|
||||
# save the best audio sample (index 0) as a .wav file
|
||||
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0])
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
@@ -88,6 +114,3 @@ section to learn how to efficiently load the same components into multiple pipel
|
||||
## AudioLDM2UNet2DConditionModel
|
||||
[[autodoc]] AudioLDM2UNet2DConditionModel
|
||||
- forward
|
||||
|
||||
## AudioPipelineOutput
|
||||
[[autodoc]] pipelines.AudioPipelineOutput
|
||||
@@ -42,7 +42,7 @@ Check out the [AutoPipeline](/tutorials/autopipeline) tutorial to learn how to u
|
||||
`AutoPipeline` supports text-to-image, image-to-image, and inpainting for the following diffusion models:
|
||||
|
||||
- [Stable Diffusion](./stable_diffusion)
|
||||
- [ControlNet](./controlnet)
|
||||
- [ControlNet](./api/pipelines/controlnet)
|
||||
- [Stable Diffusion XL (SDXL)](./stable_diffusion/stable_diffusion_xl)
|
||||
- [DeepFloyd IF](./if)
|
||||
- [Kandinsky](./kandinsky)
|
||||
|
||||
@@ -1,29 +0,0 @@
|
||||
# Blip Diffusion
|
||||
|
||||
Blip Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://arxiv.org/abs/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation.
|
||||
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
*Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications.*
|
||||
|
||||
The original codebase can be found at [salesforce/LAVIS](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion). You can find the official BLIP Diffusion checkpoints under the [hf.co/SalesForce](https://hf.co/SalesForce) organization.
|
||||
|
||||
`BlipDiffusionPipeline` and `BlipDiffusionControlNetPipeline` were contributed by [`ayushtues`](https://github.com/ayushtues/).
|
||||
|
||||
<Tip>
|
||||
|
||||
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
|
||||
|
||||
</Tip>
|
||||
|
||||
|
||||
## BlipDiffusionPipeline
|
||||
[[autodoc]] BlipDiffusionPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## BlipDiffusionControlNetPipeline
|
||||
[[autodoc]] BlipDiffusionControlNetPipeline
|
||||
- all
|
||||
- __call__
|
||||
@@ -12,9 +12,9 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# ControlNet
|
||||
|
||||
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
|
||||
[Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
|
||||
|
||||
With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
|
||||
Using a pretrained model, we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details.
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
@@ -22,13 +22,290 @@ The abstract from the paper is:
|
||||
|
||||
This model was contributed by [takuma104](https://huggingface.co/takuma104). ❤️
|
||||
|
||||
The original codebase can be found at [lllyasviel/ControlNet](https://github.com/lllyasviel/ControlNet), and you can find official ControlNet checkpoints on [lllyasviel's](https://huggingface.co/lllyasviel) Hub profile.
|
||||
The original codebase can be found at [lllyasviel/ControlNet](https://github.com/lllyasviel/ControlNet).
|
||||
|
||||
<Tip>
|
||||
## Usage example
|
||||
|
||||
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
|
||||
In the following we give a simple example of how to use a *ControlNet* checkpoint with Diffusers for inference.
|
||||
The inference pipeline is the same for all pipelines:
|
||||
|
||||
</Tip>
|
||||
* 1. Take an image and run it through a pre-conditioning processor.
|
||||
* 2. Run the pre-processed image through the [`StableDiffusionControlNetPipeline`].
|
||||
|
||||
Let's have a look at a simple example using the [Canny Edge ControlNet](https://huggingface.co/lllyasviel/sd-controlnet-canny).
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionControlNetPipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
# Let's load the popular vermeer image
|
||||
image = load_image(
|
||||
"https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
|
||||
)
|
||||
```
|
||||
|
||||

|
||||
|
||||
Next, we process the image to get the canny image. This is step *1.* - running the pre-conditioning processor. The pre-conditioning processor is different for every ControlNet. Please see the model cards of the [official checkpoints](#controlnet-with-stable-diffusion-1.5) for more information about other models.
|
||||
|
||||
First, we need to install opencv:
|
||||
|
||||
```
|
||||
pip install opencv-contrib-python
|
||||
```
|
||||
|
||||
Next, let's also install all required Hugging Face libraries:
|
||||
|
||||
```
|
||||
pip install diffusers transformers git+https://github.com/huggingface/accelerate.git
|
||||
```
|
||||
|
||||
Then we can retrieve the canny edges of the image.
|
||||
|
||||
```python
|
||||
import cv2
|
||||
from PIL import Image
|
||||
import numpy as np
|
||||
|
||||
image = np.array(image)
|
||||
|
||||
low_threshold = 100
|
||||
high_threshold = 200
|
||||
|
||||
image = cv2.Canny(image, low_threshold, high_threshold)
|
||||
image = image[:, :, None]
|
||||
image = np.concatenate([image, image, image], axis=2)
|
||||
canny_image = Image.fromarray(image)
|
||||
```
|
||||
|
||||
Let's take a look at the processed image.
|
||||
|
||||

|
||||
|
||||
Now, we load the official [Stable Diffusion 1.5 Model](runwayml/stable-diffusion-v1-5) as well as the ControlNet for canny edges.
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
|
||||
import torch
|
||||
|
||||
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
|
||||
pipe = StableDiffusionControlNetPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
|
||||
)
|
||||
```
|
||||
|
||||
To speed-up things and reduce memory, let's enable model offloading and use the fast [`UniPCMultistepScheduler`].
|
||||
|
||||
```py
|
||||
from diffusers import UniPCMultistepScheduler
|
||||
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
|
||||
# this command loads the individual model components on GPU on-demand.
|
||||
pipe.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
Finally, we can run the pipeline:
|
||||
|
||||
```py
|
||||
generator = torch.manual_seed(0)
|
||||
|
||||
out_image = pipe(
|
||||
"disco dancer with colorful lights", num_inference_steps=20, generator=generator, image=canny_image
|
||||
).images[0]
|
||||
```
|
||||
|
||||
This should take only around 3-4 seconds on GPU (depending on hardware). The output image then looks as follows:
|
||||
|
||||

|
||||
|
||||
|
||||
**Note**: To see how to run all other ControlNet checkpoints, please have a look at [ControlNet with Stable Diffusion 1.5](#controlnet-with-stable-diffusion-1.5).
|
||||
|
||||
<!-- TODO: add space -->
|
||||
|
||||
## Combining multiple conditionings
|
||||
|
||||
Multiple ControlNet conditionings can be combined for a single image generation. Pass a list of ControlNets to the pipeline's constructor and a corresponding list of conditionings to `__call__`.
|
||||
|
||||
When combining conditionings, it is helpful to mask conditionings such that they do not overlap. In the example, we mask the middle of the canny map where the pose conditioning is located.
|
||||
|
||||
It can also be helpful to vary the `controlnet_conditioning_scales` to emphasize one conditioning over the other.
|
||||
|
||||
### Canny conditioning
|
||||
|
||||
The original image:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"/>
|
||||
|
||||
Prepare the conditioning:
|
||||
|
||||
```python
|
||||
from diffusers.utils import load_image
|
||||
from PIL import Image
|
||||
import cv2
|
||||
import numpy as np
|
||||
from diffusers.utils import load_image
|
||||
|
||||
canny_image = load_image(
|
||||
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
|
||||
)
|
||||
canny_image = np.array(canny_image)
|
||||
|
||||
low_threshold = 100
|
||||
high_threshold = 200
|
||||
|
||||
canny_image = cv2.Canny(canny_image, low_threshold, high_threshold)
|
||||
|
||||
# zero out middle columns of image where pose will be overlayed
|
||||
zero_start = canny_image.shape[1] // 4
|
||||
zero_end = zero_start + canny_image.shape[1] // 2
|
||||
canny_image[:, zero_start:zero_end] = 0
|
||||
|
||||
canny_image = canny_image[:, :, None]
|
||||
canny_image = np.concatenate([canny_image, canny_image, canny_image], axis=2)
|
||||
canny_image = Image.fromarray(canny_image)
|
||||
```
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/landscape_canny_masked.png"/>
|
||||
|
||||
### Openpose conditioning
|
||||
|
||||
The original image:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png" width=600/>
|
||||
|
||||
Prepare the conditioning:
|
||||
|
||||
```python
|
||||
from controlnet_aux import OpenposeDetector
|
||||
from diffusers.utils import load_image
|
||||
|
||||
openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
|
||||
|
||||
openpose_image = load_image(
|
||||
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
|
||||
)
|
||||
openpose_image = openpose(openpose_image)
|
||||
```
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/person_pose.png" width=600/>
|
||||
|
||||
### Running ControlNet with multiple conditionings
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
|
||||
import torch
|
||||
|
||||
controlnet = [
|
||||
ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16),
|
||||
ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16),
|
||||
]
|
||||
|
||||
pipe = StableDiffusionControlNetPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
|
||||
)
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
|
||||
pipe.enable_xformers_memory_efficient_attention()
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
prompt = "a giant standing in a fantasy landscape, best quality"
|
||||
negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
|
||||
|
||||
generator = torch.Generator(device="cpu").manual_seed(1)
|
||||
|
||||
images = [openpose_image, canny_image]
|
||||
|
||||
image = pipe(
|
||||
prompt,
|
||||
images,
|
||||
num_inference_steps=20,
|
||||
generator=generator,
|
||||
negative_prompt=negative_prompt,
|
||||
controlnet_conditioning_scale=[1.0, 0.8],
|
||||
).images[0]
|
||||
|
||||
image.save("./multi_controlnet_output.png")
|
||||
```
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/multi_controlnet_output.png" width=600/>
|
||||
|
||||
### Guess Mode
|
||||
|
||||
Guess Mode is [a ControlNet feature that was implemented](https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode) after the publication of [the paper](https://arxiv.org/abs/2302.05543). The description states:
|
||||
|
||||
>In this mode, the ControlNet encoder will try best to recognize the content of the input control map, like depth map, edge map, scribbles, etc, even if you remove all prompts.
|
||||
|
||||
#### The core implementation:
|
||||
|
||||
It adjusts the scale of the output residuals from ControlNet by a fixed ratio depending on the block depth. The shallowest DownBlock corresponds to `0.1`. As the blocks get deeper, the scale increases exponentially, and the scale for the output of the MidBlock becomes `1.0`.
|
||||
|
||||
Since the core implementation is just this, **it does not have any impact on prompt conditioning**. While it is common to use it without specifying any prompts, it is also possible to provide prompts if desired.
|
||||
|
||||
#### Usage:
|
||||
|
||||
Just specify `guess_mode=True` in the pipe() function. A `guidance_scale` between 3.0 and 5.0 is [recommended](https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode).
|
||||
```py
|
||||
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
|
||||
import torch
|
||||
|
||||
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
|
||||
pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet).to(
|
||||
"cuda"
|
||||
)
|
||||
image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0]
|
||||
image.save("guess_mode_generated.png")
|
||||
```
|
||||
|
||||
#### Output image comparison:
|
||||
Canny Control Example
|
||||
|
||||
|no guess_mode with prompt|guess_mode without prompt|
|
||||
|---|---|
|
||||
|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0.png"><img width="128" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0_gm.png"><img width="128" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0_gm.png"/></a>|
|
||||
|
||||
|
||||
## Available checkpoints
|
||||
|
||||
ControlNet requires a *control image* in addition to the text-to-image *prompt*.
|
||||
Each pretrained model is trained using a different conditioning method that requires different images for conditioning the generated outputs. For example, Canny edge conditioning requires the control image to be the output of a Canny filter, while depth conditioning requires the control image to be a depth map. See the overview and image examples below to know more.
|
||||
|
||||
All checkpoints can be found under the authors' namespace [lllyasviel](https://huggingface.co/lllyasviel).
|
||||
|
||||
**13.04.2024 Update**: The author has released improved controlnet checkpoints v1.1 - see [here](#controlnet-v1.1).
|
||||
|
||||
### ControlNet v1.0
|
||||
|
||||
| Model Name | Control Image Overview| Control Image Example | Generated Image Example |
|
||||
|---|---|---|---|
|
||||
|[lllyasviel/sd-controlnet-canny](https://huggingface.co/lllyasviel/sd-controlnet-canny)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_bird_canny.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_bird_canny.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_canny_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_canny_1.png"/></a>|
|
||||
|[lllyasviel/sd-controlnet-depth](https://huggingface.co/lllyasviel/sd-controlnet-depth)<br/> *Trained with Midas depth estimation* |A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_depth.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_depth.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_depth_2.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_depth_2.png"/></a>|
|
||||
|[lllyasviel/sd-controlnet-hed](https://huggingface.co/lllyasviel/sd-controlnet-hed)<br/> *Trained with HED edge detection (soft edge)* |A monochrome image with white soft edges on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_bird_hed.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_bird_hed.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_hed_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_hed_1.png"/></a> |
|
||||
|[lllyasviel/sd-controlnet-mlsd](https://huggingface.co/lllyasviel/sd-controlnet-mlsd)<br/> *Trained with M-LSD line detection* |A monochrome image composed only of white straight lines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_mlsd.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_mlsd.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_mlsd_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_mlsd_0.png"/></a>|
|
||||
|[lllyasviel/sd-controlnet-normal](https://huggingface.co/lllyasviel/sd-controlnet-normal)<br/> *Trained with normal map* |A [normal mapped](https://en.wikipedia.org/wiki/Normal_mapping) image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_human_normal.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_human_normal.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_normal_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_normal_1.png"/></a>|
|
||||
|[lllyasviel/sd-controlnet-openpose](https://huggingface.co/lllyasviel/sd-controlnet_openpose)<br/> *Trained with OpenPose bone image* |A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_human_openpose.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_human_openpose.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_openpose_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_openpose_0.png"/></a>|
|
||||
|[lllyasviel/sd-controlnet-scribble](https://huggingface.co/lllyasviel/sd-controlnet_scribble)<br/> *Trained with human scribbles* |A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_scribble.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_scribble.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"/></a> |
|
||||
|[lllyasviel/sd-controlnet-seg](https://huggingface.co/lllyasviel/sd-controlnet_seg)<br/>*Trained with semantic segmentation* |An [ADE20K](https://groups.csail.mit.edu/vision/datasets/ADE20K/)'s segmentation protocol image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_seg.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_seg.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"/></a> |
|
||||
|
||||
### ControlNet v1.1
|
||||
|
||||
| Model Name | Control Image Overview| Condition Image | Control Image Example | Generated Image Example |
|
||||
|---|---|---|---|---|
|
||||
|[lllyasviel/control_v11p_sd15_canny](https://huggingface.co/lllyasviel/control_v11p_sd15_canny)<br/> | *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/image_out.png"/></a>|
|
||||
|[lllyasviel/control_v11e_sd15_ip2p](https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p)<br/> | *Trained with pixel to pixel instruction* | No condition .|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/image_out.png"/></a>|
|
||||
|[lllyasviel/control_v11p_sd15_inpaint](https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint)<br/> | Trained with image inpainting | No condition.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/output.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/output.png"/></a>|
|
||||
|[lllyasviel/control_v11p_sd15_mlsd](https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd)<br/> | Trained with multi-level line segment detection | An image with annotated line segments.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/image_out.png"/></a>|
|
||||
|[lllyasviel/control_v11f1p_sd15_depth](https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth)<br/> | Trained with depth estimation | An image with depth information, usually represented as a grayscale image.|<a href="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/image_out.png"/></a>|
|
||||
|[lllyasviel/control_v11p_sd15_normalbae](https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae)<br/> | Trained with surface normal estimation | An image with surface normal information, usually represented as a color-coded image.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/image_out.png"/></a>|
|
||||
|[lllyasviel/control_v11p_sd15_seg](https://huggingface.co/lllyasviel/control_v11p_sd15_seg)<br/> | Trained with image segmentation | An image with segmented regions, usually represented as a color-coded image.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/image_out.png"/></a>|
|
||||
|[lllyasviel/control_v11p_sd15_lineart](https://huggingface.co/lllyasviel/control_v11p_sd15_lineart)<br/> | Trained with line art generation | An image with line art, usually black lines on a white background.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/image_out.png"/></a>|
|
||||
|[lllyasviel/control_v11p_sd15s2_lineart_anime](https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime)<br/> | Trained with anime line art generation | An image with anime-style line art.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/image_out.png"/></a>|
|
||||
|[lllyasviel/control_v11p_sd15_openpose](https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime)<br/> | Trained with human pose estimation | An image with human poses, usually represented as a set of keypoints or skeletons.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/image_out.png"/></a>|
|
||||
|[lllyasviel/control_v11p_sd15_scribble](https://huggingface.co/lllyasviel/control_v11p_sd15_scribble)<br/> | Trained with scribble-based image generation | An image with scribbles, usually random or user-drawn strokes.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/image_out.png"/></a>|
|
||||
|[lllyasviel/control_v11p_sd15_softedge](https://huggingface.co/lllyasviel/control_v11p_sd15_softedge)<br/> | Trained with soft edge image generation | An image with soft edges, usually to create a more painterly or artistic effect.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/image_out.png"/></a>|
|
||||
|[lllyasviel/control_v11e_sd15_shuffle](https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle)<br/> | Trained with image shuffling | An image with shuffled patches or regions.|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/image_out.png"/></a>|
|
||||
|[lllyasviel/control_v11f1e_sd15_tile](https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile)<br/> | Trained with image tiling | A blurry image or part of an image .|<a href="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/original.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/original.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/output.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/output.png"/></a>|
|
||||
|
||||
## StableDiffusionControlNetPipeline
|
||||
[[autodoc]] StableDiffusionControlNetPipeline
|
||||
@@ -66,15 +343,8 @@ Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to le
|
||||
- disable_xformers_memory_efficient_attention
|
||||
- load_textual_inversion
|
||||
|
||||
## StableDiffusionPipelineOutput
|
||||
|
||||
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
|
||||
|
||||
## FlaxStableDiffusionControlNetPipeline
|
||||
[[autodoc]] FlaxStableDiffusionControlNetPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## FlaxStableDiffusionControlNetPipelineOutput
|
||||
|
||||
[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput
|
||||
@@ -12,35 +12,151 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# ControlNet with Stable Diffusion XL
|
||||
|
||||
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
|
||||
[Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
|
||||
|
||||
With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
|
||||
Using a pretrained model, we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details.
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
|
||||
|
||||
You can find additional smaller Stable Diffusion XL (SDXL) ControlNet checkpoints from the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization, and browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) checkpoints on the Hub.
|
||||
We provide support using ControlNets with [Stable Diffusion XL](./stable_diffusion/stable_diffusion_xl.md) (SDXL).
|
||||
|
||||
<Tip warning={true}>
|
||||
You can find numerous SDXL ControlNet checkpoints from [this link](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet). There are some smaller ControlNet checkpoints too:
|
||||
|
||||
🧪 Many of the SDXL ControlNet checkpoints are experimental, and there is a lot of room for improvement. Feel free to open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) and leave us feedback on how we can improve!
|
||||
* [controlnet-canny-sdxl-1.0-small](https://huggingface.co/diffusers/controlnet-canny-sdxl-1.0-small)
|
||||
* [controlnet-canny-sdxl-1.0-mid](https://huggingface.co/diffusers/controlnet-canny-sdxl-1.0-mid)
|
||||
* [controlnet-depth-sdxl-1.0-small](https://huggingface.co/diffusers/controlnet-depth-sdxl-1.0-small)
|
||||
* [controlnet-depth-sdxl-1.0-mid](https://huggingface.co/diffusers/controlnet-depth-sdxl-1.0-mid)
|
||||
|
||||
</Tip>
|
||||
We also encourage you to train custom ControlNets; we provide a [training script](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md) for this.
|
||||
|
||||
If you don't see a checkpoint you're interested in, you can train your own SDXL ControlNet with our [training script](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md).
|
||||
You can find some results below:
|
||||
|
||||
<Tip>
|
||||
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/sdxl_controlnet_canny_grid.png" width=600/>
|
||||
|
||||
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
|
||||
🚨 At the time of this writing, many of these SDXL ControlNet checkpoints are experimental and there is a lot of room for improvement. We encourage our users to provide feedback. 🚨
|
||||
|
||||
</Tip>
|
||||
## MultiControlNet
|
||||
|
||||
You can compose multiple ControlNet conditionings from different image inputs to create a *MultiControlNet*. To get better results, it is often helpful to:
|
||||
|
||||
1. mask conditionings such that they don't overlap (for example, mask the area of a canny image where the pose conditioning is located)
|
||||
2. experiment with the [`controlnet_conditioning_scale`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet#diffusers.StableDiffusionControlNetPipeline.__call__.controlnet_conditioning_scale) parameter to determine how much weight to assign to each conditioning input
|
||||
|
||||
In this example, you'll combine a canny image and a human pose estimation image to generate a new image.
|
||||
|
||||
Prepare the canny image conditioning:
|
||||
|
||||
```py
|
||||
from diffusers.utils import load_image
|
||||
from PIL import Image
|
||||
import numpy as np
|
||||
import cv2
|
||||
|
||||
canny_image = load_image(
|
||||
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
|
||||
)
|
||||
canny_image = np.array(canny_image)
|
||||
|
||||
low_threshold = 100
|
||||
high_threshold = 200
|
||||
|
||||
canny_image = cv2.Canny(canny_image, low_threshold, high_threshold)
|
||||
|
||||
# zero out middle columns of image where pose will be overlayed
|
||||
zero_start = canny_image.shape[1] // 4
|
||||
zero_end = zero_start + canny_image.shape[1] // 2
|
||||
canny_image[:, zero_start:zero_end] = 0
|
||||
|
||||
canny_image = canny_image[:, :, None]
|
||||
canny_image = np.concatenate([canny_image, canny_image, canny_image], axis=2)
|
||||
canny_image = Image.fromarray(canny_image).resize((1024, 1024))
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/landscape_canny_masked.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">canny image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
Prepare the human pose estimation conditioning:
|
||||
|
||||
```py
|
||||
from controlnet_aux import OpenposeDetector
|
||||
from diffusers.utils import load_image
|
||||
|
||||
openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
|
||||
|
||||
openpose_image = load_image(
|
||||
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
|
||||
)
|
||||
openpose_image = openpose(openpose_image).resize((1024, 1024))
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/person_pose.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">human pose image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
Load a list of ControlNet models that correspond to each conditioning, and pass them to the [`StableDiffusionXLControlNetPipeline`]. Use the faster [`UniPCMultistepScheduler`] and nable model offloading to reduce memory usage.
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL, UniPCMultistepScheduler
|
||||
import torch
|
||||
|
||||
controlnets = [
|
||||
ControlNetModel.from_pretrained(
|
||||
"thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
|
||||
),
|
||||
ControlNetModel.from_pretrained("diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True),
|
||||
]
|
||||
|
||||
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
|
||||
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, torch_dtype=torch.float16, use_safetensors=True
|
||||
)
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
pipe.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
Now you can pass your prompt (an optional negative prompt if you're using one), canny image, and pose image to the pipeline:
|
||||
|
||||
```py
|
||||
prompt = "a giant standing in a fantasy landscape, best quality"
|
||||
negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
|
||||
|
||||
generator = torch.manual_seed(1)
|
||||
|
||||
images = [openpose_image, canny_image]
|
||||
|
||||
images = pipe(
|
||||
prompt,
|
||||
image=images,
|
||||
num_inference_steps=25,
|
||||
generator=generator,
|
||||
negative_prompt=negative_prompt,
|
||||
num_images_per_prompt=3,
|
||||
controlnet_conditioning_scale=[1.0, 0.8],
|
||||
).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/multicontrolnet.png"/>
|
||||
</div>
|
||||
|
||||
## StableDiffusionXLControlNetPipeline
|
||||
[[autodoc]] StableDiffusionXLControlNetPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## StableDiffusionPipelineOutput
|
||||
|
||||
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
|
||||
- __call__
|
||||
@@ -24,32 +24,325 @@ This pipeline was contributed by [clarencechen](https://github.com/clarencechen)
|
||||
|
||||
## Tips
|
||||
|
||||
* The pipeline can generate masks that can be fed into other inpainting pipelines.
|
||||
* In order to generate an image using this pipeline, both an image mask (source and target prompts can be manually specified or generated, and passed to [`~StableDiffusionDiffEditPipeline.generate_mask`])
|
||||
and a set of partially inverted latents (generated using [`~StableDiffusionDiffEditPipeline.invert`]) _must_ be provided as arguments when calling the pipeline to generate the final edited image.
|
||||
* The function [`~StableDiffusionDiffEditPipeline.generate_mask`] exposes two prompt arguments, `source_prompt` and `target_prompt`
|
||||
* The pipeline can generate masks that can be fed into other inpainting pipelines. Check out the code examples below to know more.
|
||||
* In order to generate an image using this pipeline, both an image mask (manually specified or generated using `generate_mask`)
|
||||
and a set of partially inverted latents (generated using `invert`) _must_ be provided as arguments when calling the pipeline to generate the final edited image.
|
||||
Refer to the code examples below for more details.
|
||||
* The function `generate_mask` exposes two prompt arguments, `source_prompt` and `target_prompt`,
|
||||
that let you control the locations of the semantic edits in the final image to be generated. Let's say,
|
||||
you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect
|
||||
this in the generated mask, you simply have to set the embeddings related to the phrases including "cat" to
|
||||
`source_prompt` and "dog" to `target_prompt`.
|
||||
`source_prompt_embeds` and "dog" to `target_prompt_embeds`. Refer to the code example below for more details.
|
||||
* When generating partially inverted latents using `invert`, assign a caption or text embedding describing the
|
||||
overall image to the `prompt` argument to help guide the inverse latent sampling process. In most cases, the
|
||||
source concept is sufficiently descriptive to yield good results, but feel free to explore alternatives.
|
||||
source concept is sufficently descriptive to yield good results, but feel free to explore alternatives.
|
||||
Please refer to [this code example](#generating-image-captions-for-inversion) for more details.
|
||||
* When calling the pipeline to generate the final edited image, assign the source concept to `negative_prompt`
|
||||
and the target concept to `prompt`. Taking the above example, you simply have to set the embeddings related to
|
||||
the phrases including "cat" to `negative_prompt` and "dog" to `prompt`.
|
||||
the phrases including "cat" to `negative_prompt_embeds` and "dog" to `prompt_embeds`. Refer to the code example
|
||||
below for more details.
|
||||
* If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to:
|
||||
* Swap the `source_prompt` and `target_prompt` in the arguments to `generate_mask`.
|
||||
* Change the input prompt in [`~StableDiffusionDiffEditPipeline.invert`] to include "dog".
|
||||
* Change the input prompt for `invert` to include "dog".
|
||||
* Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image.
|
||||
* The source and target prompts, or their corresponding embeddings, can also be automatically generated. Please refer to the [DiffEdit](/using-diffusers/diffedit) guide for more details.
|
||||
* Note that the source and target prompts, or their corresponding embeddings, can also be automatically generated. Please, refer to [this discussion](#generating-source-and-target-embeddings) for more details.
|
||||
|
||||
## Usage example
|
||||
|
||||
### Based on an input image with a caption
|
||||
|
||||
When the pipeline is conditioned on an input image, we first obtain partially inverted latents from the input image using a
|
||||
`DDIMInverseScheduler` with the help of a caption. Then we generate an editing mask to identify relevant regions in the image using the source and target prompts. Finally,
|
||||
the inverted noise and generated mask is used to start the generation process.
|
||||
|
||||
First, let's load our pipeline:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline
|
||||
|
||||
sd_model_ckpt = "stabilityai/stable-diffusion-2-1"
|
||||
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
|
||||
sd_model_ckpt,
|
||||
torch_dtype=torch.float16,
|
||||
safety_checker=None,
|
||||
)
|
||||
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
|
||||
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_vae_slicing()
|
||||
generator = torch.manual_seed(0)
|
||||
```
|
||||
|
||||
Then, we load an input image to edit using our method:
|
||||
|
||||
```py
|
||||
from diffusers.utils import load_image
|
||||
|
||||
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
|
||||
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
|
||||
```
|
||||
|
||||
Then, we employ the source and target prompts to generate the editing mask:
|
||||
|
||||
```py
|
||||
# See the "Generating source and target embeddings" section below to
|
||||
# automate the generation of these captions with a pre-trained model like Flan-T5 as explained below.
|
||||
|
||||
source_prompt = "a bowl of fruits"
|
||||
target_prompt = "a basket of fruits"
|
||||
mask_image = pipeline.generate_mask(
|
||||
image=raw_image,
|
||||
source_prompt=source_prompt,
|
||||
target_prompt=target_prompt,
|
||||
generator=generator,
|
||||
)
|
||||
```
|
||||
|
||||
Then, we employ the caption and the input image to get the inverted latents:
|
||||
|
||||
```py
|
||||
inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image, generator=generator).latents
|
||||
```
|
||||
|
||||
Now, generate the image with the inverted latents and semantically generated mask:
|
||||
|
||||
```py
|
||||
image = pipeline(
|
||||
prompt=target_prompt,
|
||||
mask_image=mask_image,
|
||||
image_latents=inv_latents,
|
||||
generator=generator,
|
||||
negative_prompt=source_prompt,
|
||||
).images[0]
|
||||
image.save("edited_image.png")
|
||||
```
|
||||
|
||||
## Generating image captions for inversion
|
||||
|
||||
The authors originally used the source concept prompt as the caption for generating the partially inverted latents. However, we can also leverage open source and public image captioning models for the same purpose.
|
||||
Below, we provide an end-to-end example with the [BLIP](https://huggingface.co/docs/transformers/model_doc/blip) model
|
||||
for generating captions.
|
||||
|
||||
First, let's load our automatic image captioning model:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import BlipForConditionalGeneration, BlipProcessor
|
||||
|
||||
captioner_id = "Salesforce/blip-image-captioning-base"
|
||||
processor = BlipProcessor.from_pretrained(captioner_id)
|
||||
model = BlipForConditionalGeneration.from_pretrained(captioner_id, torch_dtype=torch.float16, low_cpu_mem_usage=True)
|
||||
```
|
||||
|
||||
Then, we define a utility to generate captions from an input image using the model:
|
||||
|
||||
```py
|
||||
@torch.no_grad()
|
||||
def generate_caption(images, caption_generator, caption_processor):
|
||||
text = "a photograph of"
|
||||
|
||||
inputs = caption_processor(images, text, return_tensors="pt").to(device="cuda", dtype=caption_generator.dtype)
|
||||
caption_generator.to("cuda")
|
||||
outputs = caption_generator.generate(**inputs, max_new_tokens=128)
|
||||
|
||||
# offload caption generator
|
||||
caption_generator.to("cpu")
|
||||
|
||||
caption = caption_processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
||||
return caption
|
||||
```
|
||||
|
||||
Then, we load an input image for conditioning and obtain a suitable caption for it:
|
||||
|
||||
```py
|
||||
from diffusers.utils import load_image
|
||||
|
||||
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
|
||||
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
|
||||
caption = generate_caption(raw_image, model, processor)
|
||||
```
|
||||
|
||||
Then, we employ the generated caption and the input image to get the inverted latents:
|
||||
|
||||
```py
|
||||
from diffusers import DDIMInverseScheduler, DDIMScheduler
|
||||
|
||||
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16
|
||||
)
|
||||
pipeline = pipeline.to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_vae_slicing()
|
||||
|
||||
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
|
||||
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
|
||||
|
||||
generator = torch.manual_seed(0)
|
||||
inv_latents = pipeline.invert(prompt=caption, image=raw_image, generator=generator).latents
|
||||
```
|
||||
|
||||
Now, generate the image with the inverted latents and semantically generated mask from our source and target prompts:
|
||||
|
||||
```py
|
||||
source_prompt = "a bowl of fruits"
|
||||
target_prompt = "a basket of fruits"
|
||||
|
||||
mask_image = pipeline.generate_mask(
|
||||
image=raw_image,
|
||||
source_prompt=source_prompt,
|
||||
target_prompt=target_prompt,
|
||||
generator=generator,
|
||||
)
|
||||
|
||||
image = pipeline(
|
||||
prompt=target_prompt,
|
||||
mask_image=mask_image,
|
||||
image_latents=inv_latents,
|
||||
generator=generator,
|
||||
negative_prompt=source_prompt,
|
||||
).images[0]
|
||||
image.save("edited_image.png")
|
||||
```
|
||||
|
||||
## Generating source and target embeddings
|
||||
|
||||
The authors originally required the user to manually provide the source and target prompts for discovering
|
||||
edit directions. However, we can also leverage open source and public models for the same purpose.
|
||||
Below, we provide an end-to-end example with the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model
|
||||
for generating source an target embeddings.
|
||||
|
||||
**1. Load the generation model**:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoTokenizer, T5ForConditionalGeneration
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")
|
||||
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto", torch_dtype=torch.float16)
|
||||
```
|
||||
|
||||
**2. Construct a starting prompt**:
|
||||
|
||||
```py
|
||||
source_concept = "bowl"
|
||||
target_concept = "basket"
|
||||
|
||||
source_text = f"Provide a caption for images containing a {source_concept}. "
|
||||
"The captions should be in English and should be no longer than 150 characters."
|
||||
|
||||
target_text = f"Provide a caption for images containing a {target_concept}. "
|
||||
"The captions should be in English and should be no longer than 150 characters."
|
||||
```
|
||||
|
||||
Here, we're interested in the "bowl -> basket" direction.
|
||||
|
||||
**3. Generate prompts**:
|
||||
|
||||
We can use a utility like so for this purpose.
|
||||
|
||||
```py
|
||||
@torch.no_grad
|
||||
def generate_prompts(input_prompt):
|
||||
input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda")
|
||||
|
||||
outputs = model.generate(
|
||||
input_ids, temperature=0.8, num_return_sequences=16, do_sample=True, max_new_tokens=128, top_k=10
|
||||
)
|
||||
return tokenizer.batch_decode(outputs, skip_special_tokens=True)
|
||||
```
|
||||
|
||||
And then we just call it to generate our prompts:
|
||||
|
||||
```py
|
||||
source_prompts = generate_prompts(source_text)
|
||||
target_prompts = generate_prompts(target_text)
|
||||
```
|
||||
|
||||
We encourage you to play around with the different parameters supported by the
|
||||
`generate()` method ([documentation](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.generation_tf_utils.TFGenerationMixin.generate)) for the generation quality you are looking for.
|
||||
|
||||
**4. Load the embedding model**:
|
||||
|
||||
Here, we need to use the same text encoder model used by the subsequent Stable Diffusion model.
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionDiffEditPipeline
|
||||
|
||||
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16
|
||||
)
|
||||
pipeline = pipeline.to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_vae_slicing()
|
||||
|
||||
generator = torch.manual_seed(0)
|
||||
```
|
||||
|
||||
**5. Compute embeddings**:
|
||||
|
||||
```py
|
||||
import torch
|
||||
|
||||
@torch.no_grad()
|
||||
def embed_prompts(sentences, tokenizer, text_encoder, device="cuda"):
|
||||
embeddings = []
|
||||
for sent in sentences:
|
||||
text_inputs = tokenizer(
|
||||
sent,
|
||||
padding="max_length",
|
||||
max_length=tokenizer.model_max_length,
|
||||
truncation=True,
|
||||
return_tensors="pt",
|
||||
)
|
||||
text_input_ids = text_inputs.input_ids
|
||||
prompt_embeds = text_encoder(text_input_ids.to(device), attention_mask=None)[0]
|
||||
embeddings.append(prompt_embeds)
|
||||
return torch.concatenate(embeddings, dim=0).mean(dim=0).unsqueeze(0)
|
||||
|
||||
source_embeddings = embed_prompts(source_prompts, pipeline.tokenizer, pipeline.text_encoder)
|
||||
target_embeddings = embed_prompts(target_captions, pipeline.tokenizer, pipeline.text_encoder)
|
||||
```
|
||||
|
||||
And you're done! Now, you can use these embeddings directly while calling the pipeline:
|
||||
|
||||
```py
|
||||
from diffusers import DDIMInverseScheduler, DDIMScheduler
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
|
||||
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
|
||||
|
||||
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
|
||||
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
|
||||
|
||||
|
||||
mask_image = pipeline.generate_mask(
|
||||
image=raw_image,
|
||||
source_prompt_embeds=source_embeds,
|
||||
target_prompt_embeds=target_embeds,
|
||||
generator=generator,
|
||||
)
|
||||
|
||||
inv_latents = pipeline.invert(
|
||||
prompt_embeds=source_embeds,
|
||||
image=raw_image,
|
||||
generator=generator,
|
||||
).latents
|
||||
|
||||
images = pipeline(
|
||||
mask_image=mask_image,
|
||||
image_latents=inv_latents,
|
||||
prompt_embeds=target_embeddings,
|
||||
negative_prompt_embeds=source_embeddings,
|
||||
generator=generator,
|
||||
).images
|
||||
images[0].save("edited_image.png")
|
||||
```
|
||||
|
||||
## StableDiffusionDiffEditPipeline
|
||||
[[autodoc]] StableDiffusionDiffEditPipeline
|
||||
- all
|
||||
- generate_mask
|
||||
- invert
|
||||
- __call__
|
||||
|
||||
## StableDiffusionPipelineOutput
|
||||
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
|
||||
- __call__
|
||||
@@ -396,7 +396,7 @@ t2i_pipe.unet.set_attn_processor(AttnAddedKVProcessor())
|
||||
```
|
||||
|
||||
With PyTorch >= 2.0, you can also use Kandinsky with `torch.compile` which depending
|
||||
on your hardware can significantly speed-up your inference time once the model is compiled.
|
||||
on your hardware can signficantly speed-up your inference time once the model is compiled.
|
||||
To use Kandinsksy with `torch.compile`, you can do:
|
||||
|
||||
```py
|
||||
|
||||
@@ -263,7 +263,7 @@ t2i_pipe.unet.set_attn_processor(AttnAddedKVProcessor())
|
||||
```
|
||||
|
||||
With PyTorch >= 2.0, you can also use Kandinsky with `torch.compile` which depending
|
||||
on your hardware can significantly speed-up your inference time once the model is compiled.
|
||||
on your hardware can signficantly speed-up your inference time once the model is compiled.
|
||||
To use Kandinsksy with `torch.compile`, you can do:
|
||||
|
||||
```py
|
||||
|
||||
@@ -1,57 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# MusicLDM
|
||||
|
||||
MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
|
||||
MusicLDM takes a text prompt as input and predicts the corresponding music sample.
|
||||
|
||||
Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm/overview),
|
||||
MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
|
||||
latents.
|
||||
|
||||
MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to
|
||||
the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies
|
||||
encourages the model to interpolate between the training samples, but stay within the domain of the training data. The
|
||||
result is generated music that is more diverse while staying faithful to the corresponding style.
|
||||
|
||||
The abstract of the paper is the following:
|
||||
|
||||
*In this paper, we present MusicLDM, a state-of-the-art text-to-music model that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, to encourage the model to generate music more diverse while still staying faithful to the corresponding style.*
|
||||
|
||||
This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi).
|
||||
|
||||
## Tips
|
||||
|
||||
When constructing a prompt, keep in mind:
|
||||
|
||||
* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
|
||||
* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".
|
||||
|
||||
During inference:
|
||||
|
||||
* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
|
||||
* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
|
||||
* The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument.
|
||||
|
||||
<Tip>
|
||||
|
||||
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between
|
||||
scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines)
|
||||
section to learn how to efficiently load the same components into multiple pipelines.
|
||||
|
||||
</Tip>
|
||||
|
||||
## MusicLDMPipeline
|
||||
[[autodoc]] MusicLDMPipeline
|
||||
- all
|
||||
- __call__
|
||||
@@ -34,7 +34,5 @@ Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to le
|
||||
- load_lora_weights
|
||||
- save_lora_weights
|
||||
|
||||
## StableDiffusionXLInstructPix2PixPipeline
|
||||
[[autodoc]] StableDiffusionXLInstructPix2PixPipeline
|
||||
- __call__
|
||||
- all
|
||||
## StableDiffusionPipelineOutput
|
||||
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
|
||||
@@ -31,5 +31,5 @@ Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to le
|
||||
- __call__
|
||||
|
||||
## StableDiffusionSafePipelineOutput
|
||||
[[autodoc]] pipelines.semantic_stable_diffusion.pipeline_output.SemanticStableDiffusionPipelineOutput
|
||||
- all
|
||||
[[autodoc]] pipelines.semantic_stable_diffusion.SemanticStableDiffusionPipelineOutput
|
||||
- all
|
||||
@@ -9,7 +9,7 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Shap-E
|
||||
|
||||
The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai).
|
||||
The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai).
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
@@ -19,10 +19,163 @@ The original codebase can be found at [openai/shap-e](https://github.com/openai/
|
||||
|
||||
<Tip>
|
||||
|
||||
See the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
|
||||
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Usage Examples
|
||||
|
||||
In the following, we will walk you through some examples of how to use Shap-E pipelines to create 3D objects in gif format.
|
||||
|
||||
### Text-to-3D image generation
|
||||
|
||||
We can use [`ShapEPipeline`] to create 3D object based on a text prompt. In this example, we will make a birthday cupcake for :firecracker: diffusers library's 1 year birthday. The workflow to use the Shap-E text-to-image pipeline is same as how you would use other text-to-image pipelines in diffusers.
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||
|
||||
repo = "openai/shap-e"
|
||||
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
|
||||
pipe = pipe.to(device)
|
||||
|
||||
guidance_scale = 15.0
|
||||
prompt = ["A firecracker", "A birthday cupcake"]
|
||||
|
||||
images = pipe(
|
||||
prompt,
|
||||
guidance_scale=guidance_scale,
|
||||
num_inference_steps=64,
|
||||
frame_size=256,
|
||||
).images
|
||||
```
|
||||
|
||||
The output of [`ShapEPipeline`] is a list of lists of images frames. Each list of frames can be used to create a 3D object. Let's use the `export_to_gif` utility function in diffusers to make a 3D cupcake!
|
||||
|
||||
```python
|
||||
from diffusers.utils import export_to_gif
|
||||
|
||||
export_to_gif(images[0], "firecracker_3d.gif")
|
||||
export_to_gif(images[1], "cake_3d.gif")
|
||||
```
|
||||

|
||||

|
||||
|
||||
|
||||
### Image-to-Image generation
|
||||
|
||||
You can use [`ShapEImg2ImgPipeline`] along with other text-to-image pipelines in diffusers and turn your 2D generation into 3D.
|
||||
|
||||
In this example, We will first genrate a cheeseburger with a simple prompt "A cheeseburger, white background"
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16)
|
||||
pipe_prior.to("cuda")
|
||||
|
||||
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
|
||||
t2i_pipe.to("cuda")
|
||||
|
||||
prompt = "A cheeseburger, white background"
|
||||
|
||||
image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
|
||||
image = t2i_pipe(
|
||||
prompt,
|
||||
image_embeds=image_embeds,
|
||||
negative_image_embeds=negative_image_embeds,
|
||||
).images[0]
|
||||
|
||||
image.save("burger.png")
|
||||
```
|
||||
|
||||

|
||||
|
||||
we will then use the Shap-E image-to-image pipeline to turn it into a 3D cheeseburger :)
|
||||
|
||||
```python
|
||||
from PIL import Image
|
||||
from diffusers.utils import export_to_gif
|
||||
|
||||
repo = "openai/shap-e-img2img"
|
||||
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
|
||||
pipe = pipe.to("cuda")
|
||||
|
||||
guidance_scale = 3.0
|
||||
image = Image.open("burger.png").resize((256, 256))
|
||||
|
||||
images = pipe(
|
||||
image,
|
||||
guidance_scale=guidance_scale,
|
||||
num_inference_steps=64,
|
||||
frame_size=256,
|
||||
).images
|
||||
|
||||
gif_path = export_to_gif(images[0], "burger_3d.gif")
|
||||
```
|
||||

|
||||
|
||||
### Generate mesh
|
||||
|
||||
For both [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`], you can generate mesh output by passing `output_type` as `mesh` to the pipeline, and then use the [`ShapEPipeline.export_to_ply`] utility function to save the output as a `ply` file. We also provide a [`ShapEPipeline.export_to_obj`] function that you can use to save mesh outputs as `obj` files.
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
from diffusers import DiffusionPipeline
|
||||
from diffusers.utils import export_to_ply
|
||||
|
||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||
|
||||
repo = "openai/shap-e"
|
||||
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16, variant="fp16")
|
||||
pipe = pipe.to(device)
|
||||
|
||||
guidance_scale = 15.0
|
||||
prompt = "A birthday cupcake"
|
||||
|
||||
images = pipe(prompt, guidance_scale=guidance_scale, num_inference_steps=64, frame_size=256, output_type="mesh").images
|
||||
|
||||
ply_path = export_to_ply(images[0], "3d_cake.ply")
|
||||
print(f"saved to folder: {ply_path}")
|
||||
```
|
||||
|
||||
Huggingface Datasets supports mesh visualization for mesh files in `glb` format. Below we will show you how to convert your mesh file into `glb` format so that you can use the Dataset viewer to render 3D objects.
|
||||
|
||||
We need to install `trimesh` library.
|
||||
|
||||
```
|
||||
pip install trimesh
|
||||
```
|
||||
|
||||
To convert the mesh file into `glb` format,
|
||||
|
||||
```python
|
||||
import trimesh
|
||||
|
||||
mesh = trimesh.load("3d_cake.ply")
|
||||
mesh.export("3d_cake.glb", file_type="glb")
|
||||
```
|
||||
|
||||
By default, the mesh output of Shap-E is from the bottom viewpoint; you can change the default viewpoint by applying a rotation transformation
|
||||
|
||||
```python
|
||||
import trimesh
|
||||
import numpy as np
|
||||
|
||||
mesh = trimesh.load("3d_cake.ply")
|
||||
rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0])
|
||||
mesh = mesh.apply_transform(rot)
|
||||
mesh.export("3d_cake.glb", file_type="glb")
|
||||
```
|
||||
|
||||
Now you can upload your mesh file to your dataset and visualize it! Here is the link to the 3D cake we just generated
|
||||
https://huggingface.co/datasets/hf-internal-testing/diffusers-images/blob/main/shap_e/3d_cake.glb
|
||||
|
||||
## ShapEPipeline
|
||||
[[autodoc]] ShapEPipeline
|
||||
- all
|
||||
|
||||
@@ -28,12 +28,11 @@ This model was contributed by the community contributor [HimariO](https://github
|
||||
|
||||
| Pipeline | Tasks | Demo
|
||||
|---|---|:---:|
|
||||
| [StableDiffusionAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning* | -
|
||||
| [StableDiffusionXLAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_xl_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning on StableDiffusion-XL* | -
|
||||
| [StableDiffusionAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning* | -
|
||||
|
||||
## Usage example with the base model of StableDiffusion-1.4/1.5
|
||||
## Usage example
|
||||
|
||||
In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
|
||||
In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference.
|
||||
All adapters use the same pipeline.
|
||||
|
||||
1. Images are first converted into the appropriate *control image* format.
|
||||
@@ -94,62 +93,6 @@ out_image = pipe(
|
||||
|
||||

|
||||
|
||||
## Usage example with the base model of StableDiffusion-XL
|
||||
|
||||
In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-XL.
|
||||
All adapters use the same pipeline.
|
||||
|
||||
1. Images are first downloaded into the appropriate *control image* format.
|
||||
2. The *control image* and *prompt* are passed to the [`StableDiffusionXLAdapterPipeline`].
|
||||
|
||||
Let's have a look at a simple example using the [Sketch Adapter](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0).
|
||||
|
||||
```python
|
||||
from diffusers.utils import load_image
|
||||
|
||||
sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L")
|
||||
```
|
||||
|
||||

|
||||
|
||||
Then, create the adapter pipeline
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import (
|
||||
T2IAdapter,
|
||||
StableDiffusionXLAdapterPipeline,
|
||||
DDPMScheduler
|
||||
)
|
||||
from diffusers.models.unet_2d_condition import UNet2DConditionModel
|
||||
|
||||
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
|
||||
adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0",torch_dtype=torch.float16, adapter_type="full_adapter_xl")
|
||||
scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")
|
||||
|
||||
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
|
||||
model_id, adapter=adapter, safety_checker=None, torch_dtype=torch.float16, variant="fp16", scheduler=scheduler
|
||||
)
|
||||
|
||||
pipe.to("cuda")
|
||||
```
|
||||
|
||||
Finally, pass the prompt and control image to the pipeline
|
||||
|
||||
```py
|
||||
# fix the random seed, so you will get the same result as the example
|
||||
generator = torch.Generator().manual_seed(42)
|
||||
|
||||
sketch_image_out = pipe(
|
||||
prompt="a photo of a dog in real world, high quality",
|
||||
negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality",
|
||||
image=sketch_image,
|
||||
generator=generator,
|
||||
guidance_scale=7.5
|
||||
).images[0]
|
||||
```
|
||||
|
||||

|
||||
|
||||
## Available checkpoints
|
||||
|
||||
@@ -170,9 +113,6 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu
|
||||
|[TencentARC/t2iadapter_depth_sd15v2](https://huggingface.co/TencentARC/t2iadapter_depth_sd15v2)||
|
||||
|[TencentARC/t2iadapter_sketch_sd15v2](https://huggingface.co/TencentARC/t2iadapter_sketch_sd15v2)||
|
||||
|[TencentARC/t2iadapter_zoedepth_sd15v1](https://huggingface.co/TencentARC/t2iadapter_zoedepth_sd15v1)||
|
||||
|[Adapter/t2iadapter, subfolder='sketch_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0)||
|
||||
|[Adapter/t2iadapter, subfolder='canny_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/canny_sdxl_1.0)||
|
||||
|[Adapter/t2iadapter, subfolder='openpose_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/openpose_sdxl_1.0)||
|
||||
|
||||
## Combining multiple adapters
|
||||
|
||||
@@ -245,14 +185,3 @@ However, T2I-Adapter performs slightly worse than ControlNet.
|
||||
- disable_vae_slicing
|
||||
- enable_xformers_memory_efficient_attention
|
||||
- disable_xformers_memory_efficient_attention
|
||||
|
||||
## StableDiffusionXLAdapterPipeline
|
||||
[[autodoc]] StableDiffusionXLAdapterPipeline
|
||||
- all
|
||||
- __call__
|
||||
- enable_attention_slicing
|
||||
- disable_attention_slicing
|
||||
- enable_vae_slicing
|
||||
- disable_vae_slicing
|
||||
- enable_xformers_memory_efficient_attention
|
||||
- disable_xformers_memory_efficient_attention
|
||||
|
||||
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# GLIGEN (Grounded Language-to-Image Generation)
|
||||
|
||||
The GLIGEN model was created by researchers and engineers from [University of Wisconsin-Madison, Columbia University, and Microsoft](https://github.com/gligen/GLIGEN). The [`StableDiffusionGLIGENPipeline`] and [`StableDiffusionGLIGENTextImagePipeline`] can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes with [`StableDiffusionGLIGENPipeline`], if input images are given, [`StableDiffusionGLIGENTextImagePipeline`] can insert objects described by text at the region defined by bounding boxes. Otherwise, it'll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It's trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs.
|
||||
The GLIGEN model was created by researchers and engineers from [University of Wisconsin-Madison, Columbia University, and Microsoft](https://github.com/gligen/GLIGEN). The [`StableDiffusionGLIGENPipeline`] can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes, if input images are given, this pipeline can insert objects described by text at the region defined by bounding boxes. Otherwise, it'll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It's trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs.
|
||||
|
||||
The abstract from the [paper](https://huggingface.co/papers/2301.07093) is:
|
||||
|
||||
@@ -26,7 +26,7 @@ If you want to use one of the official checkpoints for a task, explore the [glig
|
||||
|
||||
</Tip>
|
||||
|
||||
[`StableDiffusionGLIGENPipeline`] was contributed by [Nikhil Gajendrakumar](https://github.com/nikhil-masterful) and [`StableDiffusionGLIGENTextImagePipeline`] was contributed by [Nguyễn Công Tú Anh](https://github.com/tuanh123789).
|
||||
This pipeline was contributed by [Nikhil Gajendrakumar](https://github.com/nikhil-masterful).
|
||||
|
||||
## StableDiffusionGLIGENPipeline
|
||||
|
||||
@@ -41,19 +41,6 @@ If you want to use one of the official checkpoints for a task, explore the [glig
|
||||
- prepare_latents
|
||||
- enable_fuser
|
||||
|
||||
## StableDiffusionGLIGENTextImagePipeline
|
||||
|
||||
[[autodoc]] StableDiffusionGLIGENTextImagePipeline
|
||||
- all
|
||||
- __call__
|
||||
- enable_vae_slicing
|
||||
- disable_vae_slicing
|
||||
- enable_vae_tiling
|
||||
- disable_vae_tiling
|
||||
- enable_model_cpu_offload
|
||||
- prepare_latents
|
||||
- enable_fuser
|
||||
|
||||
## StableDiffusionPipelineOutput
|
||||
|
||||
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
|
||||
|
||||
@@ -10,29 +10,382 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Stable Diffusion XL
|
||||
# Stable diffusion XL
|
||||
|
||||
Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.
|
||||
Stable Diffusion XL was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://arxiv.org/abs/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach
|
||||
|
||||
The abstract from the paper is:
|
||||
The abstract of the paper is the following:
|
||||
|
||||
*We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.*
|
||||
|
||||
## Tips
|
||||
|
||||
- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't for for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).
|
||||
- SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders.
|
||||
- SDXL output images can be improved by making use of a refiner model in an image-to-image setting.
|
||||
- SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters.
|
||||
- Stable Diffusion XL works especially well with images between 768 and 1024.
|
||||
- Stable Diffusion XL can pass a different prompt for each of the text encoders it was trained on as shown below. We can even pass different parts of the same prompt to the text encoders.
|
||||
- Stable Diffusion XL output image can be improved by making use of a refiner as shown below.
|
||||
|
||||
### Available checkpoints:
|
||||
|
||||
- *Text-to-Image (1024x1024 resolution)*: [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) with [`StableDiffusionXLPipeline`]
|
||||
- *Image-to-Image / Refiner (1024x1024 resolution)*: [stabilityai/stable-diffusion-xl-refiner-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) with [`StableDiffusionXLImg2ImgPipeline`]
|
||||
|
||||
## Usage Example
|
||||
|
||||
Before using SDXL make sure to have `transformers`, `accelerate`, `safetensors` and `invisible_watermark` installed.
|
||||
You can install the libraries as follows:
|
||||
|
||||
```
|
||||
pip install transformers
|
||||
pip install accelerate
|
||||
pip install safetensors
|
||||
```
|
||||
|
||||
### Watermarker
|
||||
|
||||
We recommend to add an invisible watermark to images generating by Stable Diffusion XL, this can help with identifying if an image is machine-synthesised for downstream applications. To do so, please install
|
||||
the [invisible-watermark library](https://pypi.org/project/invisible-watermark/) via:
|
||||
|
||||
```
|
||||
pip install invisible-watermark>=0.2.0
|
||||
```
|
||||
|
||||
If the `invisible-watermark` library is installed the watermarker will be used **by default**.
|
||||
|
||||
If you have other provisions for generating or deploying images safely, you can disable the watermarker as follows:
|
||||
|
||||
```py
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False)
|
||||
```
|
||||
|
||||
### Text-to-Image
|
||||
|
||||
You can use SDXL as follows for *text-to-image*:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
import torch
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
image = pipe(prompt=prompt).images[0]
|
||||
```
|
||||
|
||||
### Image-to-image
|
||||
|
||||
You can use SDXL as follows for *image-to-image*:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import StableDiffusionXLImg2ImgPipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe = pipe.to("cuda")
|
||||
url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png"
|
||||
|
||||
init_image = load_image(url).convert("RGB")
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
image = pipe(prompt, image=init_image).images[0]
|
||||
```
|
||||
|
||||
### Inpainting
|
||||
|
||||
You can use SDXL as follows for *inpainting*
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import StableDiffusionXLInpaintPipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
|
||||
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
|
||||
|
||||
init_image = load_image(img_url).convert("RGB")
|
||||
mask_image = load_image(mask_url).convert("RGB")
|
||||
|
||||
prompt = "A majestic tiger sitting on a bench"
|
||||
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80).images[0]
|
||||
```
|
||||
|
||||
### Refining the image output
|
||||
|
||||
In addition to the [base model checkpoint](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0),
|
||||
StableDiffusion-XL also includes a [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0)
|
||||
that is specialized in denoising low-noise stage images to generate images of improved high-frequency quality.
|
||||
This refiner checkpoint can be used as a "second-step" pipeline after having run the base checkpoint to improve
|
||||
image quality.
|
||||
|
||||
When using the refiner, one can easily
|
||||
- 1.) employ the base model and refiner as an *Ensemble of Expert Denoisers* as first proposed in [eDiff-I](https://research.nvidia.com/labs/dir/eDiff-I/) or
|
||||
- 2.) simply run the refiner in [SDEdit](https://arxiv.org/abs/2108.01073) fashion after the base model.
|
||||
|
||||
**Note**: The idea of using SD-XL base & refiner as an ensemble of experts was first brought forward by
|
||||
a couple community contributors which also helped shape the following `diffusers` implementation, namely:
|
||||
- [SytanSD](https://github.com/SytanSD)
|
||||
- [bghira](https://github.com/bghira)
|
||||
- [Birch-san](https://github.com/Birch-san)
|
||||
- [AmericanPresidentJimmyCarter](https://github.com/AmericanPresidentJimmyCarter)
|
||||
|
||||
#### 1.) Ensemble of Expert Denoisers
|
||||
|
||||
When using the base and refiner model as an ensemble of expert of denoisers, the base model should serve as the
|
||||
expert for the high-noise diffusion stage and the refiner serves as the expert for the low-noise diffusion stage.
|
||||
|
||||
The advantage of 1.) over 2.) is that it requires less overall denoising steps and therefore should be significantly
|
||||
faster. The drawback is that one cannot really inspect the output of the base model; it will still be heavily denoised.
|
||||
|
||||
To use the base model and refiner as an ensemble of expert denoisers, make sure to define the span
|
||||
of timesteps which should be run through the high-noise denoising stage (*i.e.* the base model) and the low-noise
|
||||
denoising stage (*i.e.* the refiner model) respectively. We can set the intervals using the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) of the base model
|
||||
and [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) of the refiner model.
|
||||
|
||||
For both `denoising_end` and `denoising_start` a float value between 0 and 1 should be passed.
|
||||
When passed, the end and start of denoising will be defined by proportions of discrete timesteps as
|
||||
defined by the model schedule.
|
||||
Note that this will override `strength` if it is also declared, since the number of denoising steps
|
||||
is determined by the discrete timesteps the model was trained on and the declared fractional cutoff.
|
||||
|
||||
Let's look at an example.
|
||||
First, we import the two pipelines. Since the text encoders and variational autoencoder are the same
|
||||
you don't have to load those again for the refiner.
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
base = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
base.to("cuda")
|
||||
|
||||
refiner = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0",
|
||||
text_encoder_2=base.text_encoder_2,
|
||||
vae=base.vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
variant="fp16",
|
||||
)
|
||||
refiner.to("cuda")
|
||||
```
|
||||
|
||||
Now we define the number of inference steps and the point at which the model shall be run through the
|
||||
high-noise denoising stage (*i.e.* the base model).
|
||||
|
||||
```py
|
||||
n_steps = 40
|
||||
high_noise_frac = 0.8
|
||||
```
|
||||
|
||||
Stable Diffusion XL base is trained on timesteps 0-999 and Stable Diffusion XL refiner is finetuned
|
||||
from the base model on low noise timesteps 0-199 inclusive, so we use the base model for the first
|
||||
800 timesteps (high noise) and the refiner for the last 200 timesteps (low noise). Hence, `high_noise_frac`
|
||||
is set to 0.8, so that all steps 200-999 (the first 80% of denoising timesteps) are performed by the
|
||||
base model and steps 0-199 (the last 20% of denoising timesteps) are performed by the refiner model.
|
||||
|
||||
Remember, the denoising process starts at **high value** (high noise) timesteps and ends at
|
||||
**low value** (low noise) timesteps.
|
||||
|
||||
Let's run the two pipelines now. Make sure to set `denoising_end` and
|
||||
`denoising_start` to the same values and keep `num_inference_steps` constant. Also remember that
|
||||
the output of the base model should be in latent space:
|
||||
|
||||
```py
|
||||
prompt = "A majestic lion jumping from a big stone at night"
|
||||
|
||||
image = base(
|
||||
prompt=prompt,
|
||||
num_inference_steps=n_steps,
|
||||
denoising_end=high_noise_frac,
|
||||
output_type="latent",
|
||||
).images
|
||||
image = refiner(
|
||||
prompt=prompt,
|
||||
num_inference_steps=n_steps,
|
||||
denoising_start=high_noise_frac,
|
||||
image=image,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
Let's have a look at the images
|
||||
|
||||
| Original Image | Ensemble of Denoisers Experts |
|
||||
|---|---|
|
||||
|  | 
|
||||
|
||||
If we would have just run the base model on the same 40 steps, the image would have been arguably less detailed (e.g. the lion eyes and nose):
|
||||
|
||||
<Tip>
|
||||
|
||||
To learn how to use SDXL for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](../../../using-diffusers/sdxl) guide.
|
||||
|
||||
Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints!
|
||||
The ensemble-of-experts method works well on all available schedulers!
|
||||
|
||||
</Tip>
|
||||
|
||||
#### 2.) Refining the image output from fully denoised base image
|
||||
|
||||
In standard [`StableDiffusionImg2ImgPipeline`]-fashion, the fully-denoised image generated of the base model
|
||||
can be further improved using the [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0).
|
||||
|
||||
For this, you simply run the refiner as a normal image-to-image pipeline after the "base" text-to-image
|
||||
pipeline. You can leave the outputs of the base model in latent space.
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
refiner = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0",
|
||||
text_encoder_2=pipe.text_encoder_2,
|
||||
vae=pipe.vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
variant="fp16",
|
||||
)
|
||||
refiner.to("cuda")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
|
||||
image = pipe(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
|
||||
image = refiner(prompt=prompt, image=image[None, :]).images[0]
|
||||
```
|
||||
|
||||
| Original Image | Refined Image |
|
||||
|---|---|
|
||||
|  |  |
|
||||
|
||||
<Tip>
|
||||
|
||||
The refiner can also very well be used in an in-painting setting. To do so just make
|
||||
sure you use the [`StableDiffusionXLInpaintPipeline`] classes as shown below
|
||||
|
||||
</Tip>
|
||||
|
||||
To use the refiner for inpainting in the Ensemble of Expert Denoisers setting you can do the following:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLInpaintPipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0",
|
||||
text_encoder_2=pipe.text_encoder_2,
|
||||
vae=pipe.vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
variant="fp16",
|
||||
)
|
||||
refiner.to("cuda")
|
||||
|
||||
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
|
||||
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
|
||||
|
||||
init_image = load_image(img_url).convert("RGB")
|
||||
mask_image = load_image(mask_url).convert("RGB")
|
||||
|
||||
prompt = "A majestic tiger sitting on a bench"
|
||||
num_inference_steps = 75
|
||||
high_noise_frac = 0.7
|
||||
|
||||
image = pipe(
|
||||
prompt=prompt,
|
||||
image=init_image,
|
||||
mask_image=mask_image,
|
||||
num_inference_steps=num_inference_steps,
|
||||
denoising_start=high_noise_frac,
|
||||
output_type="latent",
|
||||
).images
|
||||
image = refiner(
|
||||
prompt=prompt,
|
||||
image=image,
|
||||
mask_image=mask_image,
|
||||
num_inference_steps=num_inference_steps,
|
||||
denoising_start=high_noise_frac,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
To use the refiner for inpainting in the standard SDE-style setting, simply remove `denoising_end` and `denoising_start` and choose a smaller
|
||||
number of inference steps for the refiner.
|
||||
|
||||
### Loading single file checkpoints / original file format
|
||||
|
||||
By making use of [`~diffusers.loaders.FromSingleFileMixin.from_single_file`] you can also load the
|
||||
original file format into `diffusers`:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
|
||||
import torch
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_single_file(
|
||||
"./sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
|
||||
"./sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
|
||||
)
|
||||
refiner.to("cuda")
|
||||
```
|
||||
|
||||
### Memory optimization via model offloading
|
||||
|
||||
If you are seeing out-of-memory errors, we recommend making use of [`StableDiffusionXLPipeline.enable_model_cpu_offload`].
|
||||
|
||||
```diff
|
||||
- pipe.to("cuda")
|
||||
+ pipe.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
and
|
||||
|
||||
```diff
|
||||
- refiner.to("cuda")
|
||||
+ refiner.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
### Speed-up inference with `torch.compile`
|
||||
|
||||
You can speed up inference by making use of `torch.compile`. This should give you **ca.** 20% speed-up.
|
||||
|
||||
```diff
|
||||
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
|
||||
```
|
||||
|
||||
### Running with `torch < 2.0`
|
||||
|
||||
**Note** that if you want to run Stable Diffusion XL with `torch` < 2.0, please make sure to enable xformers
|
||||
attention:
|
||||
|
||||
```
|
||||
pip install xformers
|
||||
```
|
||||
|
||||
```diff
|
||||
+pipe.enable_xformers_memory_efficient_attention()
|
||||
+refiner.enable_xformers_memory_efficient_attention()
|
||||
```
|
||||
|
||||
## StableDiffusionXLPipeline
|
||||
|
||||
[[autodoc]] StableDiffusionXLPipeline
|
||||
@@ -50,3 +403,25 @@ Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organizatio
|
||||
[[autodoc]] StableDiffusionXLInpaintPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
### Passing different prompts to each text-encoder
|
||||
|
||||
Stable Diffusion XL was trained on two text encoders. The default behavior is to pass the same prompt to each. But it is possible to pass a different prompt for each text-encoder, as [some users](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201) noted that it can boost quality.
|
||||
To do so, you can pass `prompt_2` and `negative_prompt_2` in addition to `prompt` and `negative_prompt`. By doing that, you will pass the original prompts and negative prompts (as in `prompt` and `negative_prompt`) to `text_encoder` (in official SDXL 0.9/1.0 that is [OpenAI CLIP-ViT/L-14](https://huggingface.co/openai/clip-vit-large-patch14)),
|
||||
and `prompt_2` and `negative_prompt_2` to `text_encoder_2` (in official SDXL 0.9/1.0 that is [OpenCLIP-ViT/bigG-14](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)).
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
import torch
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
# prompt will be passed to OAI CLIP-ViT/L-14
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
# prompt_2 will be passed to OpenCLIP-ViT/bigG-14
|
||||
prompt_2 = "monet painting"
|
||||
image = pipe(prompt=prompt, prompt_2=prompt_2).images[0]
|
||||
```
|
||||
|
||||
@@ -20,12 +20,6 @@ The abstract from the [paper](https://arxiv.org/abs/2303.06555) is:
|
||||
|
||||
You can find the original codebase at [thu-ml/unidiffuser](https://github.com/thu-ml/unidiffuser) and additional checkpoints at [thu-ml](https://huggingface.co/thu-ml).
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
There is currently an issue on PyTorch 1.X where the output images are all black or the pixel values become `NaNs`. This issue can be mitigated by switching to PyTorch 2.X.
|
||||
|
||||
</Tip>
|
||||
|
||||
This pipeline was contributed by [dg845](https://github.com/dg845). ❤️
|
||||
|
||||
## Usage Examples
|
||||
|
||||
@@ -1,149 +0,0 @@
|
||||
# Würstchen
|
||||
|
||||
<img src="https://github.com/dome272/Wuerstchen/assets/61938694/0617c863-165a-43ee-9303-2a17299a0cf9">
|
||||
|
||||
[Würstchen: Efficient Pretraining of Text-to-Image Models](https://huggingface.co/papers/2306.00637) is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville.
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
*We introduce Würstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained hardware. Building on recent advancements in machine learning, our approach, which utilizes latent diffusion strategies at strong latent image compression rates, significantly reduces the computational burden, typically associated with state-of-the-art models, while preserving, if not enhancing, the quality of generated images. Wuerstchen achieves notable speed improvements at inference time, thereby rendering real-time applications more viable. One of the key advantages of our method lies in its modest training requirements of only 9,200 GPU hours, slashing the usual costs significantly without compromising the end performance. In a comparison against the state-of-the-art, we found the approach to yield strong competitiveness. This paper opens the door to a new line of research that prioritizes both performance and computational accessibility, hence democratizing the use of sophisticated AI technologies. Through Wuerstchen, we demonstrate a compelling stride forward in the realm of text-to-image synthesis, offering an innovative path to explore in future research.*
|
||||
|
||||
## Würstchen Overview
|
||||
Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://huggingface.co/papers/2306.00637) ). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference.
|
||||
|
||||
## Würstchen v2 comes to Diffusers
|
||||
|
||||
After the initial paper release, we have improved numerous things in the architecture, training and sampling, making Würstchen competitive to current state-of-the-art models in many ways. We are excited to release this new version together with Diffusers. Here is a list of the improvements.
|
||||
|
||||
- Higher resolution (1024x1024 up to 2048x2048)
|
||||
- Faster inference
|
||||
- Multi Aspect Resolution Sampling
|
||||
- Better quality
|
||||
|
||||
|
||||
We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are:
|
||||
|
||||
- v2-base
|
||||
- v2-aesthetic
|
||||
- **(default)** v2-interpolated (50% interpolation between v2-base and v2-aesthetic)
|
||||
|
||||
We recommend using v2-interpolated, as it has a nice touch of both photorealism and aesthetics. Use v2-base for finetunings as it does not have a style bias and use v2-aesthetic for very artistic generations.
|
||||
A comparison can be seen here:
|
||||
|
||||
<img src="https://github.com/dome272/Wuerstchen/assets/61938694/2914830f-cbd3-461c-be64-d50734f4b49d" width=500>
|
||||
|
||||
## Text-to-Image Generation
|
||||
|
||||
For the sake of usability, Würstchen can be used with a single pipeline. This pipeline can be used as follows:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS
|
||||
|
||||
pipe = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda")
|
||||
|
||||
caption = "Anthropomorphic cat dressed as a fire fighter"
|
||||
images = pipe(
|
||||
caption,
|
||||
width=1024,
|
||||
height=1536,
|
||||
prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
|
||||
prior_guidance_scale=4.0,
|
||||
num_images_per_prompt=2,
|
||||
).images
|
||||
```
|
||||
|
||||
For explanation purposes, we can also initialize the two main pipelines of Würstchen individually. Würstchen consists of 3 stages: Stage C, Stage B, Stage A. They all have different jobs and work only together. When generating text-conditional images, Stage C will first generate the latents in a very compressed latent space. This is what happens in the `prior_pipeline`. Afterwards, the generated latents will be passed to Stage B, which decompresses the latents into a bigger latent space of a VQGAN. These latents can then be decoded by Stage A, which is a VQGAN, into the pixel-space. Stage B & Stage A are both encapsulated in the `decoder_pipeline`. For more details, take a look at the [paper](https://huggingface.co/papers/2306.00637).
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import WuerstchenDecoderPipeline, WuerstchenPriorPipeline
|
||||
from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS
|
||||
|
||||
device = "cuda"
|
||||
dtype = torch.float16
|
||||
num_images_per_prompt = 2
|
||||
|
||||
prior_pipeline = WuerstchenPriorPipeline.from_pretrained(
|
||||
"warp-ai/wuerstchen-prior", torch_dtype=dtype
|
||||
).to(device)
|
||||
decoder_pipeline = WuerstchenDecoderPipeline.from_pretrained(
|
||||
"warp-ai/wuerstchen", torch_dtype=dtype
|
||||
).to(device)
|
||||
|
||||
caption = "Anthropomorphic cat dressed as a fire fighter"
|
||||
negative_prompt = ""
|
||||
|
||||
prior_output = prior_pipeline(
|
||||
prompt=caption,
|
||||
height=1024,
|
||||
width=1536,
|
||||
timesteps=DEFAULT_STAGE_C_TIMESTEPS,
|
||||
negative_prompt=negative_prompt,
|
||||
guidance_scale=4.0,
|
||||
num_images_per_prompt=num_images_per_prompt,
|
||||
)
|
||||
decoder_output = decoder_pipeline(
|
||||
image_embeddings=prior_output.image_embeddings,
|
||||
prompt=caption,
|
||||
negative_prompt=negative_prompt,
|
||||
guidance_scale=0.0,
|
||||
output_type="pil",
|
||||
).images
|
||||
```
|
||||
|
||||
## Speed-Up Inference
|
||||
You can make use of `torch.compile` function and gain a speed-up of about 2-3x:
|
||||
|
||||
```python
|
||||
prior_pipeline.prior = torch.compile(prior_pipeline.prior, mode="reduce-overhead", fullgraph=True)
|
||||
decoder_pipeline.decoder = torch.compile(decoder_pipeline.decoder, mode="reduce-overhead", fullgraph=True)
|
||||
```
|
||||
|
||||
## Limitations
|
||||
|
||||
- Due to the high compression employed by Würstchen, generations can lack a good amount
|
||||
of detail. To our human eye, this is especially noticeable in faces, hands etc.
|
||||
- **Images can only be generated in 128-pixel steps**, e.g. the next higher resolution
|
||||
after 1024x1024 is 1152x1152
|
||||
- The model lacks the ability to render correct text in images
|
||||
- The model often does not achieve photorealism
|
||||
- Difficult compositional prompts are hard for the model
|
||||
|
||||
The original codebase, as well as experimental ideas, can be found at [dome272/Wuerstchen](https://github.com/dome272/Wuerstchen).
|
||||
|
||||
## WuerstchenCombinedPipeline
|
||||
|
||||
[[autodoc]] WuerstchenCombinedPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## WuerstchenPriorPipeline
|
||||
|
||||
[[autodoc]] WuerstchenPriorPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## WuerstchenPriorPipelineOutput
|
||||
|
||||
[[autodoc]] pipelines.wuerstchen.pipeline_wuerstchen_prior.WuerstchenPriorPipelineOutput
|
||||
|
||||
## WuerstchenDecoderPipeline
|
||||
|
||||
[[autodoc]] WuerstchenDecoderPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
@misc{pernias2023wuerstchen,
|
||||
title={Wuerstchen: Efficient Pretraining of Text-to-Image Models},
|
||||
author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher Pal and Marc Aubreville},
|
||||
year={2023},
|
||||
eprint={2306.00637},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CV}
|
||||
}
|
||||
```
|
||||
@@ -2,26 +2,26 @@
|
||||
|
||||
Utility and helper functions for working with 🤗 Diffusers.
|
||||
|
||||
## randn_tensor
|
||||
|
||||
[[autodoc]] diffusers.utils.randn_tensor
|
||||
|
||||
## numpy_to_pil
|
||||
|
||||
[[autodoc]] utils.numpy_to_pil
|
||||
[[autodoc]] utils.pil_utils.numpy_to_pil
|
||||
|
||||
## pt_to_pil
|
||||
|
||||
[[autodoc]] utils.pt_to_pil
|
||||
[[autodoc]] utils.pil_utils.pt_to_pil
|
||||
|
||||
## load_image
|
||||
|
||||
[[autodoc]] utils.load_image
|
||||
|
||||
## export_to_gif
|
||||
|
||||
[[autodoc]] utils.export_to_gif
|
||||
[[autodoc]] utils.testing_utils.load_image
|
||||
|
||||
## export_to_video
|
||||
|
||||
[[autodoc]] utils.export_to_video
|
||||
[[autodoc]] utils.testing_utils.export_to_video
|
||||
|
||||
## make_image_grid
|
||||
|
||||
[[autodoc]] utils.pil_utils.make_image_grid
|
||||
[[autodoc]] utils.pil_utils.make_image_grid
|
||||
@@ -40,7 +40,7 @@ In the following, we give an overview of different ways to contribute, ranked by
|
||||
As said before, **all contributions are valuable to the community**.
|
||||
In the following, we will explain each contribution a bit more in detail.
|
||||
|
||||
For all contributions 4.-9. you will need to open a PR. It is explained in detail how to do so in [Opening a pull request](#how-to-open-a-pr)
|
||||
For all contributions 4.-9. you will need to open a PR. It is explained in detail how to do so in [Opening a pull requst](#how-to-open-a-pr)
|
||||
|
||||
### 1. Asking and answering questions on the Diffusers discussion forum or on the Diffusers Discord
|
||||
|
||||
@@ -63,7 +63,7 @@ In the same spirit, you are of immense help to the community by answering such q
|
||||
|
||||
**Please** keep in mind that the more effort you put into asking or answering a question, the higher
|
||||
the quality of the publicly documented knowledge. In the same way, well-posed and well-answered questions create a high-quality knowledge database accessible to everybody, while badly posed questions or answers reduce the overall quality of the public knowledge database.
|
||||
In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formated/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.
|
||||
In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accesible*, and *well-formated/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.
|
||||
|
||||
**NOTE about channels**:
|
||||
[*The forum*](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) is much better indexed by search engines, such as Google. Posts are ranked by popularity rather than chronologically. Hence, it's easier to look up questions and answers that we posted some time ago.
|
||||
@@ -168,7 +168,7 @@ more precise, provide the link to a duplicated issue or redirect them to [the fo
|
||||
If you have verified that the issued bug report is correct and requires a correction in the source code,
|
||||
please have a look at the next sections.
|
||||
|
||||
For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull request](#how-to-open-a-pr) section.
|
||||
For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull requst](#how-to-open-a-pr) section.
|
||||
|
||||
### 4. Fixing a `Good first issue`
|
||||
|
||||
|
||||
@@ -334,7 +334,7 @@ image_processor = CLIPImageProcessor.from_pretrained(clip_id)
|
||||
image_encoder = CLIPVisionModelWithProjection.from_pretrained(clip_id).to(device)
|
||||
```
|
||||
|
||||
Notice that we are using a particular CLIP checkpoint, i.e., `openai/clip-vit-large-patch14`. This is because the Stable Diffusion pre-training was performed with this CLIP variant. For more details, refer to the [documentation](https://huggingface.co/docs/transformers/model_doc/clip).
|
||||
Notice that we are using a particular CLIP checkpoint, i.e., `openai/clip-vit-large-patch14`. This is because the Stable Diffusion pre-training was performed with this CLIP variant. For more details, refer to the [documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/pix2pix#diffusers.StableDiffusionInstructPix2PixPipeline.text_encoder).
|
||||
|
||||
Next, we prepare a PyTorch `nn.Module` to compute directional similarity:
|
||||
|
||||
|
||||
@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
Install 🤗 Diffusers for whichever deep learning library you're working with.
|
||||
|
||||
🤗 Diffusers is tested on Python 3.8+, PyTorch 1.7.0+ and Flax. Follow the installation instructions below for the deep learning library you are using:
|
||||
🤗 Diffusers is tested on Python 3.7+, PyTorch 1.7.0+ and Flax. Follow the installation instructions below for the deep learning library you are using:
|
||||
|
||||
- [PyTorch](https://pytorch.org/get-started/locally/) installation instructions.
|
||||
- [Flax](https://flax.readthedocs.io/en/latest/) installation instructions.
|
||||
@@ -106,7 +106,7 @@ pip install -e ".[flax]"
|
||||
|
||||
These commands will link the folder you cloned the repository to and your Python library paths.
|
||||
Python will now look inside the folder you cloned to in addition to the normal library paths.
|
||||
For example, if your Python packages are typically installed in `~/anaconda3/envs/main/lib/python3.8/site-packages/`, Python will also search the `~/diffusers/` folder you cloned to.
|
||||
For example, if your Python packages are typically installed in `~/anaconda3/envs/main/lib/python3.7/site-packages/`, Python will also search the `~/diffusers/` folder you cloned to.
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
|
||||
@@ -10,19 +10,13 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Speed up inference
|
||||
# Memory and speed
|
||||
|
||||
There are several ways to optimize 🤗 Diffusers for inference speed. As a general rule of thumb, we recommend using either [xFormers](xformers) or `torch.nn.functional.scaled_dot_product_attention` in PyTorch 2.0 for their memory-efficient attention.
|
||||
We present some techniques and ideas to optimize 🤗 Diffusers _inference_ for memory or speed. As a general rule, we recommend the use of [xFormers](https://github.com/facebookresearch/xformers) for memory efficient attention, please see the recommended [installation instructions](xformers).
|
||||
|
||||
<Tip>
|
||||
We'll discuss how the following settings impact performance and memory.
|
||||
|
||||
In many cases, optimizing for speed or memory leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on inference speed, but you can learn more about preserving memory in the [Reduce memory usage](memory) guide.
|
||||
|
||||
</Tip>
|
||||
|
||||
The results below are obtained from generating a single 512x512 image from the prompt `a photo of an astronaut riding a horse on mars` with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect.
|
||||
|
||||
| | latency | speed-up |
|
||||
| | Latency | Speedup |
|
||||
| ---------------- | ------- | ------- |
|
||||
| original | 9.50s | x1 |
|
||||
| fp16 | 3.61s | x2.63 |
|
||||
@@ -30,9 +24,15 @@ The results below are obtained from generating a single 512x512 image from the p
|
||||
| traced UNet | 3.21s | x2.96 |
|
||||
| memory efficient attention | 2.63s | x3.61 |
|
||||
|
||||
## Use TensorFloat-32
|
||||
<em>
|
||||
obtained on NVIDIA TITAN RTX by generating a single image of size 512x512 from
|
||||
the prompt "a photo of an astronaut riding a horse on mars" with 50 DDIM
|
||||
steps.
|
||||
</em>
|
||||
|
||||
On Ampere and later CUDA devices, matrix multiplications and convolutions can use the [TensorFloat-32 (TF32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode for faster, but slightly less accurate computations. By default, PyTorch enables TF32 mode for convolutions but not matrix multiplications. Unless your network requires full float32 precision, we recommend enabling TF32 for matrix multiplications. It can significantly speeds up computations with typically negligible loss in numerical accuracy.
|
||||
### Use tf32 instead of fp32 (on Ampere and later CUDA devices)
|
||||
|
||||
On Ampere and later CUDA devices matrix multiplications and convolutions can use the TensorFloat32 (TF32) mode for faster but slightly less accurate computations. By default PyTorch enables TF32 mode for convolutions but not matrix multiplications, and unless a network requires full float32 precision we recommend enabling this setting for matrix multiplications, too. It can significantly speed up computations with typically negligible loss of numerical accuracy. You can read more about it [here](https://huggingface.co/docs/transformers/v4.18.0/en/performance#tf32). All you need to do is to add this before your inference:
|
||||
|
||||
```python
|
||||
import torch
|
||||
@@ -40,11 +40,9 @@ import torch
|
||||
torch.backends.cuda.matmul.allow_tf32 = True
|
||||
```
|
||||
|
||||
You can learn more about TF32 in the [Mixed precision training](https://huggingface.co/docs/transformers/en/perf_train_gpu_one#tf32) guide.
|
||||
## Half precision weights
|
||||
|
||||
## Half-precision weights
|
||||
|
||||
To save GPU memory and get more speed, try loading and running the model weights directly in half-precision or float16:
|
||||
To save more GPU memory and get more speed, you can load and run the model weights directly in half precision. This involves loading the float16 version of the weights, which was saved to a branch named `fp16`, and telling PyTorch to use the `float16` type when loading them:
|
||||
|
||||
```Python
|
||||
import torch
|
||||
@@ -63,6 +61,351 @@ image = pipe(prompt).images[0]
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
Don't use [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) in any of the pipelines as it can lead to black images and is always slower than pure float16 precision.
|
||||
It is strongly discouraged to make use of [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) in any of the pipelines as it can lead to black images and is always slower than using pure
|
||||
float16 precision.
|
||||
|
||||
</Tip>
|
||||
</Tip>
|
||||
|
||||
## Sliced VAE decode for larger batches
|
||||
|
||||
To decode large batches of images with limited VRAM, or to enable batches with 32 images or more, you can use sliced VAE decode that decodes the batch latents one image at a time.
|
||||
|
||||
You likely want to couple this with [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`] to further minimize memory use.
|
||||
|
||||
To perform the VAE decode one image at a time, invoke [`~StableDiffusionPipeline.enable_vae_slicing`] in your pipeline before inference. For example:
|
||||
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
pipe = pipe.to("cuda")
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
pipe.enable_vae_slicing()
|
||||
images = pipe([prompt] * 32).images
|
||||
```
|
||||
|
||||
You may see a small performance boost in VAE decode on multi-image batches. There should be no performance impact on single-image batches.
|
||||
|
||||
|
||||
## Tiled VAE decode and encode for large images
|
||||
|
||||
Tiled VAE processing makes it possible to work with large images on limited VRAM. For example, generating 4k images in 8GB of VRAM. Tiled VAE decoder splits the image into overlapping tiles, decodes the tiles, and blends the outputs to make the final image.
|
||||
|
||||
You want to couple this with [`~StableDiffusionPipeline.enable_xformers_memory_efficient_attention`] to further minimize memory use.
|
||||
|
||||
To use tiled VAE processing, invoke [`~StableDiffusionPipeline.enable_vae_tiling`] in your pipeline before inference. For example:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
pipe = pipe.to("cuda")
|
||||
prompt = "a beautiful landscape photograph"
|
||||
pipe.enable_vae_tiling()
|
||||
pipe.enable_xformers_memory_efficient_attention()
|
||||
|
||||
image = pipe([prompt], width=3840, height=2224, num_inference_steps=20).images[0]
|
||||
```
|
||||
|
||||
The output image will have some tile-to-tile tone variation from the tiles having separate decoders, but you shouldn't see sharp seams between the tiles. The tiling is turned off for images that are 512x512 or smaller.
|
||||
|
||||
|
||||
<a name="sequential_offloading"></a>
|
||||
## Offloading to CPU with accelerate for memory savings
|
||||
|
||||
For additional memory savings, you can offload the weights to CPU and only load them to GPU when performing the forward pass.
|
||||
|
||||
To perform CPU offloading, all you have to do is invoke [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]:
|
||||
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
pipe.enable_sequential_cpu_offload()
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
And you can get the memory consumption to < 3GB.
|
||||
|
||||
Note that this method works at the submodule level, not on whole models. This is the best way to minimize memory consumption, but inference is much slower due to the iterative nature of the process. The UNet component of the pipeline runs several times (as many as `num_inference_steps`); each time, the different submodules of the UNet are sequentially onloaded and then offloaded as they are needed, so the number of memory transfers is large.
|
||||
|
||||
<Tip>
|
||||
Consider using <a href="#model_offloading">model offloading</a> as another point in the optimization space: it will be much faster, but memory savings won't be as large.
|
||||
</Tip>
|
||||
|
||||
It is also possible to chain offloading with attention slicing for minimal memory consumption (< 2GB).
|
||||
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
pipe.enable_sequential_cpu_offload()
|
||||
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
**Note**: When using `enable_sequential_cpu_offload()`, it is important to **not** move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal. See [this issue](https://github.com/huggingface/diffusers/issues/1934) for more information.
|
||||
|
||||
**Note**: `enable_sequential_cpu_offload()` is a stateful operation that installs hooks on the models.
|
||||
|
||||
|
||||
<a name="model_offloading"></a>
|
||||
## Model offloading for fast inference and memory savings
|
||||
|
||||
[Sequential CPU offloading](#sequential_offloading), as discussed in the previous section, preserves a lot of memory but makes inference slower, because submodules are moved to GPU as needed, and immediately returned to CPU when a new module runs.
|
||||
|
||||
Full-model offloading is an alternative that moves whole models to the GPU, instead of handling each model's constituent _modules_. This results in a negligible impact on inference time (compared with moving the pipeline to `cuda`), while still providing some memory savings.
|
||||
|
||||
In this scenario, only one of the main components of the pipeline (typically: text encoder, unet and vae)
|
||||
will be in the GPU while the others wait in the CPU. Components like the UNet that run for multiple iterations will stay on GPU until they are no longer needed.
|
||||
|
||||
This feature can be enabled by invoking `enable_model_cpu_offload()` on the pipeline, as shown below.
|
||||
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
pipe.enable_model_cpu_offload()
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
This is also compatible with attention slicing for additional memory savings.
|
||||
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
<Tip>
|
||||
This feature requires `accelerate` version 0.17.0 or larger.
|
||||
</Tip>
|
||||
|
||||
**Note**: `enable_model_cpu_offload()` is a stateful operation that installs hooks on the models and state on the pipeline. In order to properly offload
|
||||
models after they are called, it is required that the entire pipeline is run and models are called in the order the pipeline expects them to be. Exercise caution
|
||||
if models are re-used outside the context of the pipeline after hooks have been installed. See [accelerate](https://huggingface.co/docs/accelerate/v0.18.0/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module)
|
||||
for further docs on removing hooks.
|
||||
|
||||
## Using Channels Last memory format
|
||||
|
||||
Channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel). Since not all operators currently support channels last format it may result in a worst performance, so it's better to try it and see if it works for your model.
|
||||
|
||||
For example, in order to set the UNet model in our pipeline to use channels last format, we can use the following:
|
||||
|
||||
```python
|
||||
print(pipe.unet.conv_out.state_dict()["weight"].stride()) # (2880, 9, 3, 1)
|
||||
pipe.unet.to(memory_format=torch.channels_last) # in-place operation
|
||||
print(
|
||||
pipe.unet.conv_out.state_dict()["weight"].stride()
|
||||
) # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works
|
||||
```
|
||||
|
||||
## Tracing
|
||||
|
||||
Tracing runs an example input tensor through your model, and captures the operations that are invoked as that input makes its way through the model's layers so that an executable or `ScriptFunction` is returned that will be optimized using just-in-time compilation.
|
||||
|
||||
To trace our UNet model, we can use the following:
|
||||
|
||||
```python
|
||||
import time
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
import functools
|
||||
|
||||
# torch disable grad
|
||||
torch.set_grad_enabled(False)
|
||||
|
||||
# set variables
|
||||
n_experiments = 2
|
||||
unet_runs_per_experiment = 50
|
||||
|
||||
|
||||
# load inputs
|
||||
def generate_inputs():
|
||||
sample = torch.randn(2, 4, 64, 64).half().cuda()
|
||||
timestep = torch.rand(1).half().cuda() * 999
|
||||
encoder_hidden_states = torch.randn(2, 77, 768).half().cuda()
|
||||
return sample, timestep, encoder_hidden_states
|
||||
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
).to("cuda")
|
||||
unet = pipe.unet
|
||||
unet.eval()
|
||||
unet.to(memory_format=torch.channels_last) # use channels_last memory format
|
||||
unet.forward = functools.partial(unet.forward, return_dict=False) # set return_dict=False as default
|
||||
|
||||
# warmup
|
||||
for _ in range(3):
|
||||
with torch.inference_mode():
|
||||
inputs = generate_inputs()
|
||||
orig_output = unet(*inputs)
|
||||
|
||||
# trace
|
||||
print("tracing..")
|
||||
unet_traced = torch.jit.trace(unet, inputs)
|
||||
unet_traced.eval()
|
||||
print("done tracing")
|
||||
|
||||
|
||||
# warmup and optimize graph
|
||||
for _ in range(5):
|
||||
with torch.inference_mode():
|
||||
inputs = generate_inputs()
|
||||
orig_output = unet_traced(*inputs)
|
||||
|
||||
|
||||
# benchmarking
|
||||
with torch.inference_mode():
|
||||
for _ in range(n_experiments):
|
||||
torch.cuda.synchronize()
|
||||
start_time = time.time()
|
||||
for _ in range(unet_runs_per_experiment):
|
||||
orig_output = unet_traced(*inputs)
|
||||
torch.cuda.synchronize()
|
||||
print(f"unet traced inference took {time.time() - start_time:.2f} seconds")
|
||||
for _ in range(n_experiments):
|
||||
torch.cuda.synchronize()
|
||||
start_time = time.time()
|
||||
for _ in range(unet_runs_per_experiment):
|
||||
orig_output = unet(*inputs)
|
||||
torch.cuda.synchronize()
|
||||
print(f"unet inference took {time.time() - start_time:.2f} seconds")
|
||||
|
||||
# save the model
|
||||
unet_traced.save("unet_traced.pt")
|
||||
```
|
||||
|
||||
Then we can replace the `unet` attribute of the pipeline with the traced model like the following
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionPipeline
|
||||
import torch
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass
|
||||
class UNet2DConditionOutput:
|
||||
sample: torch.FloatTensor
|
||||
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
).to("cuda")
|
||||
|
||||
# use jitted unet
|
||||
unet_traced = torch.jit.load("unet_traced.pt")
|
||||
|
||||
|
||||
# del pipe.unet
|
||||
class TracedUNet(torch.nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.in_channels = pipe.unet.in_channels
|
||||
self.device = pipe.unet.device
|
||||
|
||||
def forward(self, latent_model_input, t, encoder_hidden_states):
|
||||
sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
|
||||
return UNet2DConditionOutput(sample=sample)
|
||||
|
||||
|
||||
pipe.unet = TracedUNet()
|
||||
|
||||
with torch.inference_mode():
|
||||
image = pipe([prompt] * 1, num_inference_steps=50).images[0]
|
||||
```
|
||||
|
||||
|
||||
## Memory Efficient Attention
|
||||
|
||||
Recent work on optimizing the bandwitdh in the attention block has generated huge speed ups and gains in GPU memory usage. The most recent being Flash Attention from @tridao: [code](https://github.com/HazyResearch/flash-attention), [paper](https://arxiv.org/pdf/2205.14135.pdf).
|
||||
|
||||
Here are the speedups we obtain on a few Nvidia GPUs when running the inference at 512x512 with a batch size of 1 (one prompt):
|
||||
|
||||
| GPU | Base Attention FP16 | Memory Efficient Attention FP16 |
|
||||
|------------------ |--------------------- |--------------------------------- |
|
||||
| NVIDIA Tesla T4 | 3.5it/s | 5.5it/s |
|
||||
| NVIDIA 3060 RTX | 4.6it/s | 7.8it/s |
|
||||
| NVIDIA A10G | 8.88it/s | 15.6it/s |
|
||||
| NVIDIA RTX A6000 | 11.7it/s | 21.09it/s |
|
||||
| NVIDIA TITAN RTX | 12.51it/s | 18.22it/s |
|
||||
| A100-SXM4-40GB | 18.6it/s | 29.it/s |
|
||||
| A100-SXM-80GB | 18.7it/s | 29.5it/s |
|
||||
|
||||
To leverage it just make sure you have:
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
If you have PyTorch 2.0 installed, you shouldn't use xFormers!
|
||||
|
||||
</Tip>
|
||||
|
||||
- PyTorch > 1.12
|
||||
- Cuda available
|
||||
- [Installed the xformers library](xformers).
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
).to("cuda")
|
||||
|
||||
pipe.enable_xformers_memory_efficient_attention()
|
||||
|
||||
with torch.inference_mode():
|
||||
sample = pipe("a small cat")
|
||||
|
||||
# optional: You can disable it via
|
||||
# pipe.disable_xformers_memory_efficient_attention()
|
||||
```
|
||||
|
||||
@@ -10,22 +10,25 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Habana Gaudi
|
||||
# How to use Stable Diffusion on Habana Gaudi
|
||||
|
||||
🤗 Diffusers is compatible with Habana Gaudi through 🤗 [Optimum](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion). Follow the [installation](https://docs.habana.ai/en/latest/Installation_Guide/index.html) guide to install the SynapseAI and Gaudi drivers, and then install Optimum Habana:
|
||||
🤗 Diffusers is compatible with Habana Gaudi through 🤗 [Optimum Habana](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion).
|
||||
|
||||
```bash
|
||||
python -m pip install --upgrade-strategy eager optimum[habana]
|
||||
```
|
||||
## Requirements
|
||||
|
||||
- Optimum Habana 1.6 or later, [here](https://huggingface.co/docs/optimum/habana/installation) is how to install it.
|
||||
- SynapseAI 1.10.
|
||||
|
||||
|
||||
## Inference Pipeline
|
||||
|
||||
To generate images with Stable Diffusion 1 and 2 on Gaudi, you need to instantiate two instances:
|
||||
- A pipeline with [`GaudiStableDiffusionPipeline`](https://huggingface.co/docs/optimum/habana/package_reference/stable_diffusion_pipeline). This pipeline supports *text-to-image generation*.
|
||||
- A scheduler with [`GaudiDDIMScheduler`](https://huggingface.co/docs/optimum/habana/package_reference/stable_diffusion_pipeline#optimum.habana.diffusers.GaudiDDIMScheduler). This scheduler has been optimized for Habana Gaudi.
|
||||
|
||||
- [`~optimum.habana.diffusers.GaudiStableDiffusionPipeline`], a pipeline for text-to-image generation.
|
||||
- [`~optimum.habana.diffusers.GaudiDDIMScheduler`], a Gaudi-optimized scheduler.
|
||||
|
||||
When you initialize the pipeline, you have to specify `use_habana=True` to deploy it on HPUs and to get the fastest possible generation, you should enable **HPU graphs** with `use_hpu_graphs=True`.
|
||||
|
||||
Finally, specify a [`~optimum.habana.GaudiConfig`] which can be downloaded from the [Habana](https://huggingface.co/Habana) organization on the Hub.
|
||||
When initializing the pipeline, you have to specify `use_habana=True` to deploy it on HPUs.
|
||||
Furthermore, in order to get the fastest possible generations you should enable **HPU graphs** with `use_hpu_graphs=True`.
|
||||
Finally, you will need to specify a [Gaudi configuration](https://huggingface.co/docs/optimum/habana/package_reference/gaudi_config) which can be downloaded from the [Hugging Face Hub](https://huggingface.co/Habana).
|
||||
|
||||
```python
|
||||
from optimum.habana import GaudiConfig
|
||||
@@ -42,8 +45,7 @@ pipeline = GaudiStableDiffusionPipeline.from_pretrained(
|
||||
)
|
||||
```
|
||||
|
||||
Now you can call the pipeline to generate images by batches from one or several prompts:
|
||||
|
||||
You can then call the pipeline to generate images by batches from one or several prompts:
|
||||
```python
|
||||
outputs = pipeline(
|
||||
prompt=[
|
||||
@@ -55,21 +57,21 @@ outputs = pipeline(
|
||||
)
|
||||
```
|
||||
|
||||
For more information, check out 🤗 Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official Github repository.
|
||||
For more information, check out Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official Github repository.
|
||||
|
||||
|
||||
## Benchmark
|
||||
|
||||
We benchmarked Habana's first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32) to demonstrate their performance.
|
||||
Here are the latencies for Habana first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32):
|
||||
|
||||
For [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) on 512x512 images:
|
||||
- [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) (512x512 resolution):
|
||||
|
||||
| | Latency (batch size = 1) | Throughput |
|
||||
| | Latency (batch size = 1) | Throughput (batch size = 8) |
|
||||
| ---------------------- |:------------------------:|:---------------------------:|
|
||||
| first-generation Gaudi | 3.80s | 0.308 images/s (batch size = 8) |
|
||||
| Gaudi2 | 1.33s | 1.081 images/s (batch size = 8) |
|
||||
| first-generation Gaudi | 3.80s | 0.308 images/s |
|
||||
| Gaudi2 | 1.33s | 1.081 images/s |
|
||||
|
||||
For [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) on 768x768 images:
|
||||
- [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) (768x768 resolution):
|
||||
|
||||
| | Latency (batch size = 1) | Throughput |
|
||||
| ---------------------- |:------------------------:|:-------------------------------:|
|
||||
|
||||
@@ -1,357 +0,0 @@
|
||||
# Reduce memory usage
|
||||
|
||||
A barrier to using diffusion models is the large amount of memory required. To overcome this challenge, there are several memory-reducing techniques you can use to run even some of the largest models on free-tier or consumer GPUs. Some of these techniques can even be combined to further reduce memory usage.
|
||||
|
||||
<Tip>
|
||||
|
||||
In many cases, optimizing for memory or speed leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on minimizing memory usage, but you can also learn more about how to [Speed up inference](fp16).
|
||||
|
||||
</Tip>
|
||||
|
||||
The results below are obtained from generating a single 512x512 image from the prompt a photo of an astronaut riding a horse on mars with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect as a result of reduced memory consumption.
|
||||
|
||||
| | latency | speed-up |
|
||||
| ---------------- | ------- | ------- |
|
||||
| original | 9.50s | x1 |
|
||||
| fp16 | 3.61s | x2.63 |
|
||||
| channels last | 3.30s | x2.88 |
|
||||
| traced UNet | 3.21s | x2.96 |
|
||||
| memory-efficient attention | 2.63s | x3.61 |
|
||||
|
||||
|
||||
## Sliced VAE
|
||||
|
||||
Sliced VAE enables decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time. You'll likely want to couple this with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to further reduce memory use.
|
||||
|
||||
To use sliced VAE, call [`~StableDiffusionPipeline.enable_vae_slicing`] on your pipeline before inference:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
pipe = pipe.to("cuda")
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
pipe.enable_vae_slicing()
|
||||
images = pipe([prompt] * 32).images
|
||||
```
|
||||
|
||||
You may see a small performance boost in VAE decoding on multi-image batches, and there should be no performance impact on single-image batches.
|
||||
|
||||
## Tiled VAE
|
||||
|
||||
Tiled VAE processing also enables working with large images on limited VRAM (for example, generating 4k images on 8GB of VRAM) by splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image. You should also used tiled VAE with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to further reduce memory use.
|
||||
|
||||
To use tiled VAE processing, call [`~StableDiffusionPipeline.enable_vae_tiling`] on your pipeline before inference:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
pipe = pipe.to("cuda")
|
||||
prompt = "a beautiful landscape photograph"
|
||||
pipe.enable_vae_tiling()
|
||||
pipe.enable_xformers_memory_efficient_attention()
|
||||
|
||||
image = pipe([prompt], width=3840, height=2224, num_inference_steps=20).images[0]
|
||||
```
|
||||
|
||||
The output image has some tile-to-tile tone variation because the tiles are decoded separately, but you shouldn't see any sharp and obvious seams between the tiles. Tiling is turned off for images that are 512x512 or smaller.
|
||||
|
||||
## CPU offloading
|
||||
|
||||
Offloading the weights to the CPU and only loading them on the GPU when performing the forward pass can also save memory. Often, this technique can reduce memory consumption to less than 3GB.
|
||||
|
||||
To perform CPU offloading, call [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]:
|
||||
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
pipe.enable_sequential_cpu_offload()
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
CPU offloading works on submodules rather than whole models. This is the best way to minimize memory consumption, but inference is much slower due to the iterative nature of the diffusion process. The UNet component of the pipeline runs several times (as many as `num_inference_steps`); each time, the different UNet submodules are sequentially onloaded and offloaded as needed, resulting in a large number of memory transfers.
|
||||
|
||||
<Tip>
|
||||
|
||||
Consider using [model offloading](#model-offloading) if you want to optimize for speed because it is much faster. The tradeoff is your memory savings won't be as large.
|
||||
|
||||
</Tip>
|
||||
|
||||
CPU offloading can also be chained with attention slicing to reduce memory consumption to less than 2GB.
|
||||
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
pipe.enable_sequential_cpu_offload()
|
||||
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
When using [`~StableDiffusionPipeline.enable_sequential_cpu_offload`], don't move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal (see this [issue](https://github.com/huggingface/diffusers/issues/1934) for more information).
|
||||
|
||||
[`~StableDiffusionPipeline.enable_sequential_cpu_offload`] is a stateful operation that installs hooks on the models.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Model offloading
|
||||
|
||||
<Tip>
|
||||
|
||||
Model offloading requires 🤗 Accelerate version 0.17.0 or higher.
|
||||
|
||||
</Tip>
|
||||
|
||||
[Sequential CPU offloading](#cpu-offloading) preserves a lot of memory but it makes inference slower because submodules are moved to GPU as needed, and they're immediately returned to the CPU when a new module runs.
|
||||
|
||||
Full-model offloading is an alternative that moves whole models to the GPU, instead of handling each model's constituent *submodules*. There is a negligible impact on inference time (compared with moving the pipeline to `cuda`), and it still provides some memory savings.
|
||||
|
||||
During model offloading, only one of the main components of the pipeline (typically the text encoder, UNet and VAE)
|
||||
is placed on the GPU while the others wait on the CPU. Components like the UNet that run for multiple iterations stay on the GPU until they're no longer needed.
|
||||
|
||||
Enable model offloading by calling [`~StableDiffusionPipeline.enable_model_cpu_offload`] on the pipeline:
|
||||
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
pipe.enable_model_cpu_offload()
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
Model offloading can also be combined with attention slicing for additional memory savings.
|
||||
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
In order to properly offload models after they're called, it is required to run the entire pipeline and models are called in the pipeline's expected order. Exercise caution if models are reused outside the context of the pipeline after hooks have been installed. See [Removing Hooks](https://huggingface.co/docs/accelerate/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module)
|
||||
for more information.
|
||||
|
||||
[`~StableDiffusionPipeline.enable_model_cpu_offload`] is a stateful operation that installs hooks on the models and state on the pipeline.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Channels-last memory format
|
||||
|
||||
The channels-last memory format is an alternative way of ordering NCHW tensors in memory to preserve dimension ordering. Channels-last tensors are ordered in such a way that the channels become the densest dimension (storing images pixel-per-pixel). Since not all operators currently support the channels-last format, it may result in worst performance but you should still try and see if it works for your model.
|
||||
|
||||
For example, to set the pipeline's UNet to use the channels-last format:
|
||||
|
||||
```python
|
||||
print(pipe.unet.conv_out.state_dict()["weight"].stride()) # (2880, 9, 3, 1)
|
||||
pipe.unet.to(memory_format=torch.channels_last) # in-place operation
|
||||
print(
|
||||
pipe.unet.conv_out.state_dict()["weight"].stride()
|
||||
) # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works
|
||||
```
|
||||
|
||||
## Tracing
|
||||
|
||||
Tracing runs an example input tensor through the model and captures the operations that are performed on it as that input makes its way through the model's layers. The executable or `ScriptFunction` that is returned is optimized with just-in-time compilation.
|
||||
|
||||
To trace a UNet:
|
||||
|
||||
```python
|
||||
import time
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
import functools
|
||||
|
||||
# torch disable grad
|
||||
torch.set_grad_enabled(False)
|
||||
|
||||
# set variables
|
||||
n_experiments = 2
|
||||
unet_runs_per_experiment = 50
|
||||
|
||||
|
||||
# load inputs
|
||||
def generate_inputs():
|
||||
sample = torch.randn(2, 4, 64, 64).half().cuda()
|
||||
timestep = torch.rand(1).half().cuda() * 999
|
||||
encoder_hidden_states = torch.randn(2, 77, 768).half().cuda()
|
||||
return sample, timestep, encoder_hidden_states
|
||||
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
).to("cuda")
|
||||
unet = pipe.unet
|
||||
unet.eval()
|
||||
unet.to(memory_format=torch.channels_last) # use channels_last memory format
|
||||
unet.forward = functools.partial(unet.forward, return_dict=False) # set return_dict=False as default
|
||||
|
||||
# warmup
|
||||
for _ in range(3):
|
||||
with torch.inference_mode():
|
||||
inputs = generate_inputs()
|
||||
orig_output = unet(*inputs)
|
||||
|
||||
# trace
|
||||
print("tracing..")
|
||||
unet_traced = torch.jit.trace(unet, inputs)
|
||||
unet_traced.eval()
|
||||
print("done tracing")
|
||||
|
||||
|
||||
# warmup and optimize graph
|
||||
for _ in range(5):
|
||||
with torch.inference_mode():
|
||||
inputs = generate_inputs()
|
||||
orig_output = unet_traced(*inputs)
|
||||
|
||||
|
||||
# benchmarking
|
||||
with torch.inference_mode():
|
||||
for _ in range(n_experiments):
|
||||
torch.cuda.synchronize()
|
||||
start_time = time.time()
|
||||
for _ in range(unet_runs_per_experiment):
|
||||
orig_output = unet_traced(*inputs)
|
||||
torch.cuda.synchronize()
|
||||
print(f"unet traced inference took {time.time() - start_time:.2f} seconds")
|
||||
for _ in range(n_experiments):
|
||||
torch.cuda.synchronize()
|
||||
start_time = time.time()
|
||||
for _ in range(unet_runs_per_experiment):
|
||||
orig_output = unet(*inputs)
|
||||
torch.cuda.synchronize()
|
||||
print(f"unet inference took {time.time() - start_time:.2f} seconds")
|
||||
|
||||
# save the model
|
||||
unet_traced.save("unet_traced.pt")
|
||||
```
|
||||
|
||||
Replace the `unet` attribute of the pipeline with the traced model:
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionPipeline
|
||||
import torch
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass
|
||||
class UNet2DConditionOutput:
|
||||
sample: torch.FloatTensor
|
||||
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
).to("cuda")
|
||||
|
||||
# use jitted unet
|
||||
unet_traced = torch.jit.load("unet_traced.pt")
|
||||
|
||||
|
||||
# del pipe.unet
|
||||
class TracedUNet(torch.nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.in_channels = pipe.unet.in_channels
|
||||
self.device = pipe.unet.device
|
||||
|
||||
def forward(self, latent_model_input, t, encoder_hidden_states):
|
||||
sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
|
||||
return UNet2DConditionOutput(sample=sample)
|
||||
|
||||
|
||||
pipe.unet = TracedUNet()
|
||||
|
||||
with torch.inference_mode():
|
||||
image = pipe([prompt] * 1, num_inference_steps=50).images[0]
|
||||
```
|
||||
|
||||
## Memory-efficient attention
|
||||
|
||||
Recent work on optimizing bandwidth in the attention block has generated huge speed-ups and reductions in GPU memory usage. The most recent type of memory-efficient attention is [Flash Attention](https://arxiv.org/pdf/2205.14135.pdf) (you can check out the original code at [HazyResearch/flash-attention](https://github.com/HazyResearch/flash-attention)).
|
||||
|
||||
<Tip>
|
||||
|
||||
If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling `xformers`.
|
||||
|
||||
</Tip>
|
||||
|
||||
To use Flash Attention, install the following:
|
||||
|
||||
- PyTorch > 1.12
|
||||
- CUDA available
|
||||
- [xFormers](xformers)
|
||||
|
||||
Then call [`~ModelMixin.enable_xformers_memory_efficient_attention`] on the pipeline:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
).to("cuda")
|
||||
|
||||
pipe.enable_xformers_memory_efficient_attention()
|
||||
|
||||
with torch.inference_mode():
|
||||
sample = pipe("a small cat")
|
||||
|
||||
# optional: You can disable it via
|
||||
# pipe.disable_xformers_memory_efficient_attention()
|
||||
```
|
||||
|
||||
The iteration speed when using `xformers` should match the iteration speed of Torch 2.0 as described [here](torch2.0).
|
||||
@@ -10,16 +10,29 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Metal Performance Shaders (MPS)
|
||||
# How to use Stable Diffusion in Apple Silicon (M1/M2)
|
||||
|
||||
🤗 Diffusers is compatible with Apple silicon (M1/M2 chips) using the PyTorch [`mps`](https://pytorch.org/docs/stable/notes/mps.html) device, which uses the Metal framework to leverage the GPU on MacOS devices. You'll need to have:
|
||||
🤗 Diffusers is compatible with Apple silicon for Stable Diffusion inference, using the PyTorch `mps` device. These are the steps you need to follow to use your M1 or M2 computer with Stable Diffusion.
|
||||
|
||||
- macOS computer with Apple silicon (M1/M2) hardware
|
||||
- macOS 12.6 or later (13.0 or later recommended)
|
||||
- arm64 version of Python
|
||||
- [PyTorch 2.0](https://pytorch.org/get-started/locally/) (recommended) or 1.13 (minimum version supported for `mps`)
|
||||
## Requirements
|
||||
|
||||
The `mps` backend uses PyTorch's `.to()` interface to move the Stable Diffusion pipeline on to your M1 or M2 device:
|
||||
- Mac computer with Apple silicon (M1/M2) hardware.
|
||||
- macOS 12.6 or later (13.0 or later recommended).
|
||||
- arm64 version of Python.
|
||||
- PyTorch 2.0 (recommended) or 1.13 (minimum version supported for `mps`). You can install it with `pip` or `conda` using the instructions in https://pytorch.org/get-started/locally/.
|
||||
|
||||
|
||||
## Inference Pipeline
|
||||
|
||||
The snippet below demonstrates how to use the `mps` backend using the familiar `to()` interface to move the Stable Diffusion pipeline to your M1 or M2 device.
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
**If you are using PyTorch 1.13** you need to "prime" the pipeline using an additional one-time pass through it. This is a temporary workaround for a weird issue we detected: the first inference pass produces slightly different results than subsequent ones. You only need to do this pass once, and it's ok to use just one inference step and discard the result.
|
||||
|
||||
</Tip>
|
||||
|
||||
We strongly recommend you use PyTorch 2 or better, as it solves a number of problems like the one described in the previous tip.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
@@ -31,41 +44,24 @@ pipe = pipe.to("mps")
|
||||
pipe.enable_attention_slicing()
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
Generating multiple prompts in a batch can [crash](https://github.com/huggingface/diffusers/issues/363) or fail to work reliably. We believe this is related to the [`mps`](https://github.com/pytorch/pytorch/issues/84039) backend in PyTorch. While this is being investigated, you should iterate instead of batching.
|
||||
|
||||
</Tip>
|
||||
|
||||
If you're using **PyTorch 1.13**, you need to "prime" the pipeline with an additional one-time pass through it. This is a temporary workaround for an issue where the first inference pass produces slightly different results than subsequent ones. You only need to do this pass once, and after just one inference step you can discard the result.
|
||||
|
||||
```diff
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("mps")
|
||||
pipe.enable_attention_slicing()
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
# First-time "warmup" pass if PyTorch version is 1.13
|
||||
+ _ = pipe(prompt, num_inference_steps=1)
|
||||
# First-time "warmup" pass if PyTorch version is 1.13 (see explanation above)
|
||||
_ = pipe(prompt, num_inference_steps=1)
|
||||
|
||||
# Results match those from the CPU device after the warmup pass.
|
||||
image = pipe(prompt).images[0]
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
## Troubleshoot
|
||||
## Performance Recommendations
|
||||
|
||||
M1/M2 performance is very sensitive to memory pressure. When this occurs, the system automatically swaps if it needs to which significantly degrades performance.
|
||||
M1/M2 performance is very sensitive to memory pressure. The system will automatically swap if it needs to, but performance will degrade significantly when it does.
|
||||
|
||||
To prevent this from happening, we recommend *attention slicing* to reduce memory pressure during inference and prevent swapping. This is especially relevant if your computer has less than 64GB of system RAM, or if you generate images at non-standard resolutions larger than 512×512 pixels. Call the [`~DiffusionPipeline.enable_attention_slicing`] function on your pipeline:
|
||||
We recommend you use _attention slicing_ to reduce memory pressure during inference and prevent swapping, particularly if your computer has less than 64 GB of system RAM, or if you generate images at non-standard resolutions larger than 512 × 512 pixels. Attention slicing performs the costly attention operation in multiple steps instead of all at once. It usually has a performance impact of ~20% in computers without universal memory, but we have observed _better performance_ in most Apple Silicon computers, unless you have 64 GB or more.
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("mps")
|
||||
```python
|
||||
pipeline.enable_attention_slicing()
|
||||
```
|
||||
|
||||
Attention slicing performs the costly attention operation in multiple steps instead of all at once. It usually improves performance by ~20% in computers without universal memory, but we've observed *better performance* in most Apple silicon computers unless you have 64GB of RAM or more.
|
||||
## Known Issues
|
||||
|
||||
- Generating multiple prompts in a batch [crashes or doesn't work reliably](https://github.com/huggingface/diffusers/issues/363). We believe this is related to the [`mps` backend in PyTorch](https://github.com/pytorch/pytorch/issues/84039). This is being resolved, but for now we recommend to iterate instead of batching.
|
||||
|
||||
@@ -11,19 +11,23 @@ specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
|
||||
# ONNX Runtime
|
||||
# How to use ONNX Runtime for inference
|
||||
|
||||
🤗 [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime. You'll need to install 🤗 Optimum with the following command for ONNX Runtime support:
|
||||
🤗 [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime.
|
||||
|
||||
```bash
|
||||
## Installation
|
||||
|
||||
Install 🤗 Optimum with the following command for ONNX Runtime support:
|
||||
|
||||
```
|
||||
pip install optimum["onnxruntime"]
|
||||
```
|
||||
|
||||
This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with ONNX Runtime.
|
||||
|
||||
## Stable Diffusion
|
||||
|
||||
To load and run inference, use the [`~optimum.onnxruntime.ORTStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the ONNX format on-the-fly, set `export=True`:
|
||||
### Inference
|
||||
|
||||
To load an ONNX model and run inference with ONNX Runtime, you need to replace [`StableDiffusionPipeline`] with `ORTStableDiffusionPipeline`. In case you want to load a PyTorch model and convert it to the ONNX format on-the-fly, you can set `export=True`.
|
||||
|
||||
```python
|
||||
from optimum.onnxruntime import ORTStableDiffusionPipeline
|
||||
@@ -35,20 +39,14 @@ image = pipeline(prompt).images[0]
|
||||
pipeline.save_pretrained("./onnx-stable-diffusion-v1-5")
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
Generating multiple prompts in a batch seems to take too much memory. While we look into it, you may need to iterate instead of batching.
|
||||
|
||||
</Tip>
|
||||
|
||||
To export the pipeline in the ONNX format offline and use it later for inference,
|
||||
use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command:
|
||||
If you want to export the pipeline in the ONNX format offline and later use it for inference,
|
||||
you can use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command:
|
||||
|
||||
```bash
|
||||
optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 sd_v15_onnx/
|
||||
```
|
||||
|
||||
Then to perform inference (you don't have to specify `export=True` again):
|
||||
Then perform inference:
|
||||
|
||||
```python
|
||||
from optimum.onnxruntime import ORTStableDiffusionPipeline
|
||||
@@ -59,15 +57,36 @@ prompt = "sailing ship in storm by Leonardo da Vinci"
|
||||
image = pipeline(prompt).images[0]
|
||||
```
|
||||
|
||||
Notice that we didn't have to specify `export=True` above.
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/optimum/documentation-images/resolve/main/onnxruntime/stable_diffusion_v1_5_ort_sail_boat.png">
|
||||
</div>
|
||||
|
||||
You can find more examples in 🤗 Optimum [documentation](https://huggingface.co/docs/optimum/), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting.
|
||||
You can find more examples in [optimum documentation](https://huggingface.co/docs/optimum/).
|
||||
|
||||
|
||||
### Supported tasks
|
||||
|
||||
| Task | Loading Class |
|
||||
|--------------------------------------|--------------------------------------|
|
||||
| `text-to-image` | `ORTStableDiffusionPipeline` |
|
||||
| `image-to-image` | `ORTStableDiffusionImg2ImgPipeline` |
|
||||
| `inpaint` | `ORTStableDiffusionInpaintPipeline` |
|
||||
|
||||
## Stable Diffusion XL
|
||||
|
||||
To load and run inference with SDXL, use the [`~optimum.onnxruntime.ORTStableDiffusionXLPipeline`]:
|
||||
### Export
|
||||
|
||||
To export your model to ONNX, you can use the [Optimum CLI](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) as follows :
|
||||
|
||||
```bash
|
||||
optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl sd_xl_onnx/
|
||||
```
|
||||
|
||||
### Inference
|
||||
|
||||
Here is an example of how you can load a SDXL ONNX model from [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and run inference with ONNX Runtime :
|
||||
|
||||
```python
|
||||
from optimum.onnxruntime import ORTStableDiffusionXLPipeline
|
||||
@@ -78,10 +97,13 @@ prompt = "sailing ship in storm by Leonardo da Vinci"
|
||||
image = pipeline(prompt).images[0]
|
||||
```
|
||||
|
||||
To export the pipeline in the ONNX format and use it later for inference, use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command:
|
||||
### Supported tasks
|
||||
|
||||
```bash
|
||||
optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl sd_xl_onnx/
|
||||
```
|
||||
| Task | Loading Class |
|
||||
|--------------------------------------|--------------------------------------|
|
||||
| `text-to-image` | `ORTStableDiffusionXLPipeline` |
|
||||
| `image-to-image` | `ORTStableDiffusionXLImg2ImgPipeline`|
|
||||
|
||||
SDXL in the ONNX format is supported for text-to-image and image-to-image.
|
||||
## Known Issues
|
||||
|
||||
- Generating multiple prompts in a batch seems to take too much memory. While we look into it, you may need to iterate instead of batching.
|
||||
|
||||
@@ -11,21 +11,26 @@ specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
|
||||
# OpenVINO
|
||||
# How to use OpenVINO for inference
|
||||
|
||||
🤗 [Optimum](https://github.com/huggingface/optimum-intel) provides Stable Diffusion pipelines compatible with OpenVINO to perform inference on a variety of Intel processors (see the [full list]((https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html)) of supported devices).
|
||||
🤗 [Optimum](https://github.com/huggingface/optimum-intel) provides Stable Diffusion pipelines compatible with OpenVINO. You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices).
|
||||
|
||||
You'll need to install 🤗 Optimum Intel with the `--upgrade-strategy eager` option to ensure [`optimum-intel`](https://github.com/huggingface/optimum-intel) is using the latest version:
|
||||
## Installation
|
||||
|
||||
Install 🤗 Optimum Intel with the following command:
|
||||
|
||||
```
|
||||
pip install --upgrade-strategy eager optimum["openvino"]
|
||||
```
|
||||
|
||||
This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with OpenVINO.
|
||||
The `--upgrade-strategy eager` option is needed to ensure [`optimum-intel`](https://github.com/huggingface/optimum-intel) is upgraded to its latest version.
|
||||
|
||||
|
||||
## Stable Diffusion
|
||||
|
||||
To load and run inference, use the [`~optimum.intel.OVStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the OpenVINO format on-the-fly, set `export=True`:
|
||||
### Inference
|
||||
|
||||
To load an OpenVINO model and run inference with OpenVINO Runtime, you need to replace `StableDiffusionPipeline` with `OVStableDiffusionPipeline`. In case you want to load a PyTorch model and convert it to the OpenVINO format on-the-fly, you can set `export=True`.
|
||||
|
||||
```python
|
||||
from optimum.intel import OVStableDiffusionPipeline
|
||||
@@ -39,7 +44,7 @@ image = pipeline(prompt).images[0]
|
||||
pipeline.save_pretrained("openvino-sd-v1-5")
|
||||
```
|
||||
|
||||
To further speed-up inference, statically reshape the model. If you change any parameters such as the outputs height or width, you’ll need to statically reshape your model again.
|
||||
To further speed up inference, the model can be statically reshaped :
|
||||
|
||||
```python
|
||||
# Define the shapes related to the inputs and desired outputs
|
||||
@@ -57,15 +62,30 @@ image = pipeline(
|
||||
num_images_per_prompt=num_images,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
In case you want to change any parameters such as the outputs height or width, you’ll need to statically reshape your model once again.
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/optimum/documentation-images/resolve/main/intel/openvino/stable_diffusion_v1_5_sail_boat_rembrandt.png">
|
||||
</div>
|
||||
|
||||
You can find more examples in the 🤗 Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting.
|
||||
|
||||
### Supported tasks
|
||||
|
||||
| Task | Loading Class |
|
||||
|--------------------------------------|--------------------------------------|
|
||||
| `text-to-image` | `OVStableDiffusionPipeline` |
|
||||
| `image-to-image` | `OVStableDiffusionImg2ImgPipeline` |
|
||||
| `inpaint` | `OVStableDiffusionInpaintPipeline` |
|
||||
|
||||
You can find more examples in the optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion).
|
||||
|
||||
|
||||
## Stable Diffusion XL
|
||||
|
||||
To load and run inference with SDXL, use the [`~optimum.intel.OVStableDiffusionXLPipeline`]:
|
||||
### Inference
|
||||
|
||||
Here is an example of how you can load a SDXL OpenVINO model from [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and run inference with OpenVINO Runtime :
|
||||
|
||||
```python
|
||||
from optimum.intel import OVStableDiffusionXLPipeline
|
||||
@@ -76,6 +96,15 @@ prompt = "sailing ship in storm by Rembrandt"
|
||||
image = pipeline(prompt).images[0]
|
||||
```
|
||||
|
||||
To further speed-up inference, [statically reshape](#stable-diffusion) the model as shown in the Stable Diffusion section.
|
||||
To further speed up inference, the model can be statically reshaped as showed above.
|
||||
You can find more examples in the optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion-xl).
|
||||
|
||||
### Supported tasks
|
||||
|
||||
| Task | Loading Class |
|
||||
|--------------------------------------|--------------------------------------|
|
||||
| `text-to-image` | `OVStableDiffusionXLPipeline` |
|
||||
| `image-to-image` | `OVStableDiffusionXLImg2ImgPipeline` |
|
||||
|
||||
|
||||
|
||||
You can find more examples in the 🤗 Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion-xl), and running SDXL in OpenVINO is supported for text-to-image and image-to-image.
|
||||
|
||||
@@ -12,6 +12,6 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Overview
|
||||
|
||||
Generating high-quality outputs is computationally intensive, especially during each iterative step where you go from a noisy output to a less noisy output. One of 🤗 Diffuser's goal is to make this technology widely accessible to everyone, which includes enabling fast inference on consumer and specialized hardware.
|
||||
Generating high-quality outputs is computationally intensive, especially during each iterative step where you go from a noisy output to a less noisy output. One of 🧨 Diffuser's goal is to make this technology widely accessible to everyone, which includes enabling fast inference on consumer and specialized hardware.
|
||||
|
||||
This section will cover tips and tricks - like half-precision weights and sliced attention - for optimizing inference speed and reducing memory-consumption. You'll also learn how to speed up your PyTorch code with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) or [ONNX Runtime](https://onnxruntime.ai/docs/), and enable memory-efficient attention with [xFormers](https://facebookresearch.github.io/xformers/). There are also guides for running inference on specific hardware like Apple Silicon, and Intel or Habana processors.
|
||||
This section will cover tips and tricks - like half-precision weights and sliced attention - for optimizing inference speed and reducing memory-consumption. You can also learn how to speed up your PyTorch code with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) or [ONNX Runtime](https://onnxruntime.ai/docs/), and enable memory-efficient attention with [xFormers](https://facebookresearch.github.io/xformers/). There are also guides for running inference on specific hardware like Apple Silicon, and Intel or Habana processors.
|
||||
@@ -10,39 +10,35 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Token merging
|
||||
# Token Merging
|
||||
|
||||
[Token merging](https://huggingface.co/papers/2303.17604) (ToMe) merges redundant tokens/patches progressively in the forward pass of a Transformer-based network which can speed-up the inference latency of [`StableDiffusionPipeline`].
|
||||
Token Merging (introduced in [Token Merging: Your ViT But Faster](https://arxiv.org/abs/2210.09461)) works by merging the redundant tokens / patches progressively in the forward pass of a Transformer-based network. It can speed up the inference latency of the underlying network.
|
||||
|
||||
You can use ToMe from the [`tomesd`](https://github.com/dbolya/tomesd) library with the [`apply_patch`](https://github.com/dbolya/tomesd?tab=readme-ov-file#usage) function:
|
||||
After Token Merging (ToMe) was released, the authors released [Token Merging for Fast Stable Diffusion](https://arxiv.org/abs/2303.17604), which introduced a version of ToMe which is more compatible with Stable Diffusion. We can use ToMe to gracefully speed up the inference latency of a [`DiffusionPipeline`]. This doc discusses how to apply ToMe to the [`StableDiffusionPipeline`], the expected speedups, and the qualitative aspects of using ToMe on the [`StableDiffusionPipeline`].
|
||||
|
||||
## Using ToMe
|
||||
|
||||
The authors of ToMe released a convenient Python library called [`tomesd`](https://github.com/dbolya/tomesd) that lets us apply ToMe to a [`DiffusionPipeline`] like so:
|
||||
|
||||
```diff
|
||||
from diffusers import StableDiffusionPipeline
|
||||
import tomesd
|
||||
|
||||
pipeline = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True,
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
+ tomesd.apply_patch(pipeline, ratio=0.5)
|
||||
|
||||
image = pipeline("a photo of an astronaut riding a horse on mars").images[0]
|
||||
```
|
||||
|
||||
The `apply_patch` function exposes a number of [arguments](https://github.com/dbolya/tomesd#usage) to help strike a balance between pipeline inference speed and the quality of the generated tokens. The most important argument is `ratio` which controls the number of tokens that are merged during the forward pass.
|
||||
And that’s it!
|
||||
|
||||
As reported in the [paper](https://huggingface.co/papers/2303.17604), ToMe can greatly preserve the quality of the generated images while boosting inference speed. By increasing the `ratio`, you can speed-up inference even further, but at the cost of some degraded image quality.
|
||||
`tomesd.apply_patch()` exposes [a number of arguments](https://github.com/dbolya/tomesd#usage) to let us strike a balance between the pipeline inference speed and the quality of the generated tokens. Amongst those arguments, the most important one is `ratio`. `ratio` controls the number of tokens that will be merged during the forward pass. For more details on `tomesd`, please refer to the original repository https://github.com/dbolya/tomesd and [the paper](https://arxiv.org/abs/2303.17604).
|
||||
|
||||
To test the quality of the generated images, we sampled a few prompts from [Parti Prompts](https://parti.research.google/) and performed inference with the [`StableDiffusionPipeline`] with the following settings:
|
||||
## Benchmarking `tomesd` with `StableDiffusionPipeline`
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/tome/tome_samples.png">
|
||||
</div>
|
||||
|
||||
We didn’t notice any significant decrease in the quality of the generated samples, and you can check out the generated samples in this [WandB report](https://wandb.ai/sayakpaul/tomesd-results/runs/23j4bj3i?workspace=). If you're interested in reproducing this experiment, use this [script](https://gist.github.com/sayakpaul/8cac98d7f22399085a060992f411ecbd).
|
||||
|
||||
## Benchmarks
|
||||
|
||||
We also benchmarked the impact of `tomesd` on the [`StableDiffusionPipeline`] with [xFormers](https://huggingface.co/docs/diffusers/optimization/xformers) enabled across several image resolutions. The results are obtained from A100 and V100 GPUs in the following development environment:
|
||||
We benchmarked the impact of using `tomesd` on [`StableDiffusionPipeline`] along with [xformers](https://huggingface.co/docs/diffusers/optimization/xformers) across different image resolutions. We used A100 and V100 as our test GPU devices with the following development environment (with Python 3.8.5):
|
||||
|
||||
```bash
|
||||
- `diffusers` version: 0.15.1
|
||||
@@ -55,35 +51,66 @@ We also benchmarked the impact of `tomesd` on the [`StableDiffusionPipeline`] wi
|
||||
- tomesd version: 0.1.2
|
||||
```
|
||||
|
||||
To reproduce this benchmark, feel free to use this [script](https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335). The results are reported in seconds, and where applicable we report the speed-up percentage over the vanilla pipeline when using ToMe and ToMe + xFormers.
|
||||
We used this script for benchmarking: [https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335](https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335). Following are our findings:
|
||||
|
||||
| **GPU** | **Resolution** | **Batch size** | **Vanilla** | **ToMe** | **ToMe + xFormers** |
|
||||
|----------|----------------|----------------|-------------|----------------|---------------------|
|
||||
| **A100** | 512 | 10 | 6.88 | 5.26 (+23.55%) | 4.69 (+31.83%) |
|
||||
| | 768 | 10 | OOM | 14.71 | 11 |
|
||||
| | | 8 | OOM | 11.56 | 8.84 |
|
||||
| | | 4 | OOM | 5.98 | 4.66 |
|
||||
| | | 2 | 4.99 | 3.24 (+35.07%) | 2.1 (+37.88%) |
|
||||
| | | 1 | 3.29 | 2.24 (+31.91%) | 2.03 (+38.3%) |
|
||||
| | 1024 | 10 | OOM | OOM | OOM |
|
||||
| | | 8 | OOM | OOM | OOM |
|
||||
| | | 4 | OOM | 12.51 | 9.09 |
|
||||
| | | 2 | OOM | 6.52 | 4.96 |
|
||||
| | | 1 | 6.4 | 3.61 (+43.59%) | 2.81 (+56.09%) |
|
||||
| **V100** | 512 | 10 | OOM | 10.03 | 9.29 |
|
||||
| | | 8 | OOM | 8.05 | 7.47 |
|
||||
| | | 4 | 5.7 | 4.3 (+24.56%) | 3.98 (+30.18%) |
|
||||
| | | 2 | 3.14 | 2.43 (+22.61%) | 2.27 (+27.71%) |
|
||||
| | | 1 | 1.88 | 1.57 (+16.49%) | 1.57 (+16.49%) |
|
||||
| | 768 | 10 | OOM | OOM | 23.67 |
|
||||
| | | 8 | OOM | OOM | 18.81 |
|
||||
| | | 4 | OOM | 11.81 | 9.7 |
|
||||
| | | 2 | OOM | 6.27 | 5.2 |
|
||||
| | | 1 | 5.43 | 3.38 (+37.75%) | 2.82 (+48.07%) |
|
||||
| | 1024 | 10 | OOM | OOM | OOM |
|
||||
| | | 8 | OOM | OOM | OOM |
|
||||
| | | 4 | OOM | OOM | 19.35 |
|
||||
| | | 2 | OOM | 13 | 10.78 |
|
||||
| | | 1 | OOM | 6.66 | 5.54 |
|
||||
### A100
|
||||
|
||||
As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](torch2.0).
|
||||
| Resolution | Batch size | Vanilla | ToMe | ToMe + xFormers | ToMe speedup (%) | ToMe + xFormers speedup (%) |
|
||||
| --- | --- | --- | --- | --- | --- | --- |
|
||||
| 512 | 10 | 6.88 | 5.26 | 4.69 | 23.54651163 | 31.83139535 |
|
||||
| | | | | | | |
|
||||
| 768 | 10 | OOM | 14.71 | 11 | | |
|
||||
| | 8 | OOM | 11.56 | 8.84 | | |
|
||||
| | 4 | OOM | 5.98 | 4.66 | | |
|
||||
| | 2 | 4.99 | 3.24 | 3.1 | 35.07014028 | 37.8757515 |
|
||||
| | 1 | 3.29 | 2.24 | 2.03 | 31.91489362 | 38.29787234 |
|
||||
| | | | | | | |
|
||||
| 1024 | 10 | OOM | OOM | OOM | | |
|
||||
| | 8 | OOM | OOM | OOM | | |
|
||||
| | 4 | OOM | 12.51 | 9.09 | | |
|
||||
| | 2 | OOM | 6.52 | 4.96 | | |
|
||||
| | 1 | 6.4 | 3.61 | 2.81 | 43.59375 | 56.09375 |
|
||||
|
||||
***The timings reported here are in seconds. Speedups are calculated over the `Vanilla` timings.***
|
||||
|
||||
### V100
|
||||
|
||||
| Resolution | Batch size | Vanilla | ToMe | ToMe + xFormers | ToMe speedup (%) | ToMe + xFormers speedup (%) |
|
||||
| --- | --- | --- | --- | --- | --- | --- |
|
||||
| 512 | 10 | OOM | 10.03 | 9.29 | | |
|
||||
| | 8 | OOM | 8.05 | 7.47 | | |
|
||||
| | 4 | 5.7 | 4.3 | 3.98 | 24.56140351 | 30.1754386 |
|
||||
| | 2 | 3.14 | 2.43 | 2.27 | 22.61146497 | 27.70700637 |
|
||||
| | 1 | 1.88 | 1.57 | 1.57 | 16.4893617 | 16.4893617 |
|
||||
| | | | | | | |
|
||||
| 768 | 10 | OOM | OOM | 23.67 | | |
|
||||
| | 8 | OOM | OOM | 18.81 | | |
|
||||
| | 4 | OOM | 11.81 | 9.7 | | |
|
||||
| | 2 | OOM | 6.27 | 5.2 | | |
|
||||
| | 1 | 5.43 | 3.38 | 2.82 | 37.75322284 | 48.06629834 |
|
||||
| | | | | | | |
|
||||
| 1024 | 10 | OOM | OOM | OOM | | |
|
||||
| | 8 | OOM | OOM | OOM | | |
|
||||
| | 4 | OOM | OOM | 19.35 | | |
|
||||
| | 2 | OOM | 13 | 10.78 | | |
|
||||
| | 1 | OOM | 6.66 | 5.54 | | |
|
||||
|
||||
As seen in the tables above, the speedup with `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it becomes possible to run the pipeline on a higher resolution, like 1024x1024.
|
||||
|
||||
It might be possible to speed up inference even further with [`torch.compile()`](https://huggingface.co/docs/diffusers/optimization/torch2.0).
|
||||
|
||||
## Quality
|
||||
|
||||
As reported in [the paper](https://arxiv.org/abs/2303.17604), ToMe can preserve the quality of the generated images to a great extent while speeding up inference. By increasing the `ratio`, it is possible to further speed up inference, but that might come at the cost of a deterioration in the image quality.
|
||||
|
||||
To test the quality of the generated samples using our setup, we sampled a few prompts from the “Parti Prompts” (introduced in [Parti](https://parti.research.google/)) and performed inference with the [`StableDiffusionPipeline`] in the following settings:
|
||||
|
||||
- Vanilla [`StableDiffusionPipeline`]
|
||||
- [`StableDiffusionPipeline`] + ToMe
|
||||
- [`StableDiffusionPipeline`] + ToMe + xformers
|
||||
|
||||
We didn’t notice any significant decrease in the quality of the generated samples. Here are samples:
|
||||
|
||||

|
||||
|
||||
You can check out the generated samples [here](https://wandb.ai/sayakpaul/tomesd-results/runs/23j4bj3i?workspace=). We used [this script](https://gist.github.com/sayakpaul/8cac98d7f22399085a060992f411ecbd) for conducting this experiment.
|
||||
@@ -10,83 +10,96 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Torch 2.0
|
||||
# Accelerated PyTorch 2.0 support in Diffusers
|
||||
|
||||
🤗 Diffusers supports the latest optimizations from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) which include:
|
||||
Starting from version `0.13.0`, Diffusers supports the latest optimization from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/). These include:
|
||||
1. Support for accelerated transformers implementation with memory-efficient attention – no extra dependencies (such as `xformers`) required.
|
||||
2. [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) support for extra performance boost when individual models are compiled.
|
||||
|
||||
1. A memory-efficient attention implementation, scaled dot product attention, without requiring any extra dependencies such as xFormers.
|
||||
2. [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html), a just-in-time (JIT) compiler to provide an extra performance boost when individual models are compiled.
|
||||
|
||||
Both of these optimizations require PyTorch 2.0 or later and 🤗 Diffusers > 0.13.0.
|
||||
## Installation
|
||||
|
||||
To benefit from the accelerated attention implementation and `torch.compile()`, you just need to install the latest versions of PyTorch 2.0 from pip, and make sure you are on diffusers 0.13.0 or later. As explained below, diffusers automatically uses the optimized attention processor ([`AttnProcessor2_0`](https://github.com/huggingface/diffusers/blob/1a5797c6d4491a879ea5285c4efc377664e0332d/src/diffusers/models/attention_processor.py#L798)) (but not `torch.compile()`)
|
||||
when PyTorch 2.0 is available.
|
||||
|
||||
```bash
|
||||
pip install --upgrade torch diffusers
|
||||
```
|
||||
|
||||
## Scaled dot product attention
|
||||
## Using accelerated transformers and `torch.compile`.
|
||||
|
||||
[`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA) is an optimized and memory-efficient attention (similar to xFormers) that automatically enables several other optimizations depending on the model inputs and GPU type. SDPA is enabled by default if you're using PyTorch 2.0 and the latest version of 🤗 Diffusers, so you don't need to add anything to your code.
|
||||
|
||||
However, if you want to explicitly enable it, you can set a [`DiffusionPipeline`] to use [`~models.attention_processor.AttnProcessor2_0`]:
|
||||
1. **Accelerated Transformers implementation**
|
||||
|
||||
```diff
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
+ from diffusers.models.attention_processor import AttnProcessor2_0
|
||||
PyTorch 2.0 includes an optimized and memory-efficient attention implementation through the [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) function, which automatically enables several optimizations depending on the inputs and the GPU type. This is similar to the `memory_efficient_attention` from [xFormers](https://github.com/facebookresearch/xformers), but built natively into PyTorch.
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
|
||||
+ pipe.unet.set_attn_processor(AttnProcessor2_0())
|
||||
These optimizations will be enabled by default in Diffusers if PyTorch 2.0 is installed and if `torch.nn.functional.scaled_dot_product_attention` is available. To use it, just install `torch 2.0` as suggested above and simply use the pipeline. For example:
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
SDPA should be as fast and memory efficient as `xFormers`; check the [benchmark](#benchmark) for more details.
|
||||
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True)
|
||||
pipe = pipe.to("cuda")
|
||||
|
||||
In some cases - such as making the pipeline more deterministic or converting it to other formats - it may be helpful to use the vanilla attention processor, [`~models.attention_processor.AttnProcessor`]. To revert to [`~models.attention_processor.AttnProcessor`], call the [`~UNet2DConditionModel.set_default_attn_processor`] function on the pipeline:
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
```diff
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
from diffusers.models.attention_processor import AttnProcessor
|
||||
If you want to enable it explicitly (which is not required), you can do so as shown below.
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
|
||||
+ pipe.unet.set_default_attn_processor()
|
||||
```diff
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
+ from diffusers.models.attention_processor import AttnProcessor2_0
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
|
||||
+ pipe.unet.set_attn_processor(AttnProcessor2_0())
|
||||
|
||||
## torch.compile
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
The `torch.compile` function can often provide an additional speed-up to your PyTorch code. In 🤗 Diffusers, it is usually best to wrap the UNet with `torch.compile` because it does most of the heavy lifting in the pipeline.
|
||||
This should be as fast and memory efficient as `xFormers`. More details [in our benchmark](#benchmark).
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
It is possible to revert to the vanilla attention processor ([`AttnProcessor`](https://github.com/huggingface/diffusers/blob/1a5797c6d4491a879ea5285c4efc377664e0332d/src/diffusers/models/attention_processor.py#L402)), which can be helpful to make the pipeline more deterministic, or if you need to convert a fine-tuned model to other formats such as [Core ML](https://huggingface.co/docs/diffusers/v0.16.0/en/optimization/coreml#how-to-run-stable-diffusion-with-core-ml). To use the normal attention processor you can use the [`~diffusers.UNet2DConditionModel.set_default_attn_processor`] function:
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
|
||||
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images[0]
|
||||
```
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
from diffusers.models.attention_processor import AttnProcessor
|
||||
|
||||
Depending on GPU type, `torch.compile` can provide an *additional speed-up* of **5-300x** on top of SDPA! If you're using more recent GPU architectures such as Ampere (A100, 3090), Ada (4090), and Hopper (H100), `torch.compile` is able to squeeze even more performance out of these GPUs.
|
||||
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
|
||||
pipe.unet.set_default_attn_processor()
|
||||
|
||||
Compilation requires some time to complete, so it is best suited for situations where you prepare your pipeline once and then perform the same type of inference operations multiple times. For example, calling the compiled pipeline on a different image size triggers compilation again which can be expensive.
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
2. **torch.compile**
|
||||
|
||||
To get an additional speedup, we can use the new `torch.compile` feature. Since the UNet of the pipeline is usually the most computationally expensive, we wrap the `unet` with `torch.compile` leaving rest of the sub-models (text encoder and VAE) as is. For more information and different options, refer to the
|
||||
[torch compile docs](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).
|
||||
|
||||
```python
|
||||
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images
|
||||
```
|
||||
|
||||
Depending on the type of GPU, `compile()` can yield between **5% - 300%** of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100).
|
||||
|
||||
Compilation takes some time to complete, so it is best suited for situations where you need to prepare your pipeline once and then perform the same type of inference operations multiple times. Calling the compiled pipeline on a different image size will re-trigger compilation which can be expensive.
|
||||
|
||||
For more information and different options about `torch.compile`, refer to the [`torch_compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) tutorial.
|
||||
|
||||
## Benchmark
|
||||
|
||||
We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. The code is benchmarked on 🤗 Diffusers v0.17.0.dev0 to optimize `torch.compile` usage (see [here](https://github.com/huggingface/diffusers/pull/3313) for more details).
|
||||
We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. We used `diffusers 0.17.0.dev0`, which [makes sure `torch.compile()` is leveraged optimally](https://github.com/huggingface/diffusers/pull/3313).
|
||||
|
||||
Expand the dropdown below to find the code used to benchmark each pipeline:
|
||||
### Benchmarking code
|
||||
|
||||
<details>
|
||||
#### Stable Diffusion text-to-image
|
||||
|
||||
### Stable Diffusion text-to-image
|
||||
|
||||
```python
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
@@ -108,7 +121,7 @@ for _ in range(3):
|
||||
images = pipe(prompt=prompt).images
|
||||
```
|
||||
|
||||
### Stable Diffusion image-to-image
|
||||
#### Stable Diffusion image-to-image
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionImg2ImgPipeline
|
||||
@@ -141,7 +154,7 @@ for _ in range(3):
|
||||
image = pipe(prompt=prompt, image=init_image).images[0]
|
||||
```
|
||||
|
||||
### Stable Diffusion inpainting
|
||||
#### Stable Diffusion - inpainting
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionInpaintPipeline
|
||||
@@ -181,7 +194,7 @@ for _ in range(3):
|
||||
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
|
||||
```
|
||||
|
||||
### ControlNet
|
||||
#### ControlNet
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
|
||||
@@ -219,7 +232,7 @@ for _ in range(3):
|
||||
image = pipe(prompt=prompt, image=init_image).images[0]
|
||||
```
|
||||
|
||||
### DeepFloyd IF text-to-image + upscaling
|
||||
#### IF text-to-image + upscaling
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
@@ -254,18 +267,24 @@ for _ in range(3):
|
||||
image_2 = pipe_2(image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
|
||||
image_3 = pipe_3(prompt=prompt, image=image, noise_level=100).images
|
||||
```
|
||||
</details>
|
||||
|
||||
The graph below highlights the relative speed-ups for the [`StableDiffusionPipeline`] across five GPU families with PyTorch 2.0 and `torch.compile` enabled. The benchmarks for the following graphs are measured in *number of iterations/second*.
|
||||
To give you a pictorial overview of the possible speed-ups that can be obtained with PyTorch 2.0 and `torch.compile()`,
|
||||
here is a plot that shows relative speed-ups for the [Stable Diffusion text-to-image pipeline](StableDiffusionPipeline) across five
|
||||
different GPU families (with a batch size of 4):
|
||||
|
||||

|
||||
|
||||
To give you an even better idea of how this speed-up holds for the other pipelines, consider the following
|
||||
graph for an A100 with PyTorch 2.0 and `torch.compile`:
|
||||
To give you an even better idea of how this speed-up holds for the other pipelines presented above, consider the following
|
||||
plot that shows the benchmarking numbers from an A100 across three different batch sizes
|
||||
(with PyTorch 2.0 nightly and `torch.compile()`):
|
||||
|
||||

|
||||
|
||||
In the following tables, we report our findings in terms of the *number of iterations/second*.
|
||||
_(Our benchmarking metric for the plots above is **number of iterations/second**)_
|
||||
|
||||
But we reveal all the benchmarking numbers in the interest of transparency!
|
||||
|
||||
In the following tables, we report our findings in terms of the number of **_iterations processed per second_**.
|
||||
|
||||
### A100 (batch size: 1)
|
||||
|
||||
@@ -276,7 +295,6 @@ In the following tables, we report our findings in terms of the *number of itera
|
||||
| SD - inpaint | 22.24 | 23.23 | 43.76 | 49.25 |
|
||||
| SD - controlnet | 15.02 | 15.82 | 32.13 | 36.08 |
|
||||
| IF | 20.21 / <br>13.84 / <br>24.00 | 20.12 / <br>13.70 / <br>24.03 | ❌ | 97.34 / <br>27.23 / <br>111.66 |
|
||||
| SDXL - txt2img | 8.64 | 9.9 | - | - |
|
||||
|
||||
### A100 (batch size: 4)
|
||||
|
||||
@@ -287,7 +305,6 @@ In the following tables, we report our findings in terms of the *number of itera
|
||||
| SD - inpaint | 11.67 | 13.31 | 14.88 | 17.48 |
|
||||
| SD - controlnet | 8.28 | 9.38 | 10.51 | 12.41 |
|
||||
| IF | 25.02 | 18.04 | ❌ | 48.47 |
|
||||
| SDXL - txt2img | 2.44 | 2.74 | - | - |
|
||||
|
||||
### A100 (batch size: 16)
|
||||
|
||||
@@ -298,7 +315,6 @@ In the following tables, we report our findings in terms of the *number of itera
|
||||
| SD - inpaint | 3.04 | 3.66 | 3.9 | 4.76 |
|
||||
| SD - controlnet | 2.15 | 2.58 | 2.74 | 3.35 |
|
||||
| IF | 8.78 | 9.82 | ❌ | 16.77 |
|
||||
| SDXL - txt2img | 0.64 | 0.72 | - | - |
|
||||
|
||||
### V100 (batch size: 1)
|
||||
|
||||
@@ -339,7 +355,6 @@ In the following tables, we report our findings in terms of the *number of itera
|
||||
| SD - inpaint | 6.91 | 6.7 | 7.01 | 7.37 |
|
||||
| SD - controlnet | 4.89 | 4.86 | 5.35 | 5.48 |
|
||||
| IF | 17.42 / <br>2.47 / <br>18.52 | 16.96 / <br>2.45 / <br>18.69 | ❌ | 24.63 / <br>2.47 / <br>23.39 |
|
||||
| SDXL - txt2img | 1.15 | 1.16 | - | - |
|
||||
|
||||
### T4 (batch size: 4)
|
||||
|
||||
@@ -350,7 +365,6 @@ In the following tables, we report our findings in terms of the *number of itera
|
||||
| SD - inpaint | 1.81 | 1.82 | 2.09 | 2.09 |
|
||||
| SD - controlnet | 1.34 | 1.27 | 1.47 | 1.46 |
|
||||
| IF | 5.79 | 5.61 | ❌ | 7.39 |
|
||||
| SDXL - txt2img | 0.288 | 0.289 | - | - |
|
||||
|
||||
### T4 (batch size: 16)
|
||||
|
||||
@@ -361,7 +375,6 @@ In the following tables, we report our findings in terms of the *number of itera
|
||||
| SD - inpaint | 2.30s | 2.26s | OOM after 2nd iteration | 1.95s |
|
||||
| SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup |
|
||||
| IF * | 1.44 | 1.44 | ❌ | 1.94 |
|
||||
| SDXL - txt2img | OOM | OOM | - | - |
|
||||
|
||||
### RTX 3090 (batch size: 1)
|
||||
|
||||
@@ -402,7 +415,6 @@ In the following tables, we report our findings in terms of the *number of itera
|
||||
| SD - inpaint | 40.51 | 41.88 | 44.58 | 49.72 |
|
||||
| SD - controlnet | 29.27 | 30.29 | 32.26 | 36.03 |
|
||||
| IF | 69.71 / <br>18.78 / <br>85.49 | 69.13 / <br>18.80 / <br>85.56 | ❌ | 124.60 / <br>26.37 / <br>138.79 |
|
||||
| SDXL - txt2img | 6.8 | 8.18 | - | - |
|
||||
|
||||
### RTX 4090 (batch size: 4)
|
||||
|
||||
@@ -413,7 +425,6 @@ In the following tables, we report our findings in terms of the *number of itera
|
||||
| SD - inpaint | 12.65 | 12.81 | 15.3 | 15.58 |
|
||||
| SD - controlnet | 9.1 | 9.25 | 11.03 | 11.22 |
|
||||
| IF | 31.88 | 31.14 | ❌ | 43.92 |
|
||||
| SDXL - txt2img | 2.19 | 2.35 | - | - |
|
||||
|
||||
### RTX 4090 (batch size: 16)
|
||||
|
||||
@@ -424,11 +435,10 @@ In the following tables, we report our findings in terms of the *number of itera
|
||||
| SD - inpaint | 3.17 | 3.2 | 3.85 | 3.85 |
|
||||
| SD - controlnet | 2.23 | 2.3 | 2.7 | 2.75 |
|
||||
| IF | 9.26 | 9.2 | ❌ | 13.31 |
|
||||
| SDXL - txt2img | 0.52 | 0.53 | - | - |
|
||||
|
||||
## Notes
|
||||
|
||||
* Follow this [PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks.
|
||||
* For the DeepFloyd IF pipeline where batch sizes > 1, we only used a batch size of > 1 in the first IF pipeline for text-to-image generation and NOT for upscaling. That means the two upscaling pipelines received a batch size of 1.
|
||||
* Follow [this PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks.
|
||||
* For the IF pipeline and batch sizes > 1, we only used a batch size of >1 in the first IF pipeline for text-to-image generation and NOT for upscaling. So, that means the two upscaling pipelines received a batch size of 1.
|
||||
|
||||
*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.*
|
||||
*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.*
|
||||
@@ -10,11 +10,11 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# xFormers
|
||||
# Installing xFormers
|
||||
|
||||
We recommend [xFormers](https://github.com/facebookresearch/xformers) for both inference and training. In our tests, the optimizations performed in the attention blocks allow for both faster speed and reduced memory consumption.
|
||||
We recommend the use of [xFormers](https://github.com/facebookresearch/xformers) for both inference and training. In our tests, the optimizations performed in the attention blocks allow for both faster speed and reduced memory consumption.
|
||||
|
||||
Install xFormers from `pip`:
|
||||
Starting from version `0.0.16` of xFormers, released on January 2023, installation can be easily performed using pre-built pip wheels:
|
||||
|
||||
```bash
|
||||
pip install xformers
|
||||
@@ -22,14 +22,14 @@ pip install xformers
|
||||
|
||||
<Tip>
|
||||
|
||||
The xFormers `pip` package requires the latest version of PyTorch. If you need to use a previous version of PyTorch, then we recommend [installing xFormers from the source](https://github.com/facebookresearch/xformers#installing-xformers).
|
||||
The xFormers PIP package requires the latest version of PyTorch (1.13.1 as of xFormers 0.0.16). If you need to use a previous version of PyTorch, then we recommend you install xFormers from source using [the project instructions](https://github.com/facebookresearch/xformers#installing-xformers).
|
||||
|
||||
</Tip>
|
||||
|
||||
After xFormers is installed, you can use `enable_xformers_memory_efficient_attention()` for faster inference and reduced memory consumption as shown in this [section](memory#memory-efficient-attention).
|
||||
After xFormers is installed, you can use `enable_xformers_memory_efficient_attention()` for faster inference and reduced memory consumption, as discussed [here](fp16#memory-efficient-attention).
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
According to this [issue](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212), xFormers `v0.0.16` cannot be used for training (fine-tune or DreamBooth) in some GPUs. If you observe this problem, please install a development version as indicated in the issue comments.
|
||||
According to [this issue](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212), xFormers `v0.0.16` cannot be used for training (fine-tune or Dreambooth) in some GPUs. If you observe that problem, please install a development version as indicated in that comment.
|
||||
|
||||
</Tip>
|
||||
|
||||
@@ -265,7 +265,7 @@ distributed_type: DEEPSPEED
|
||||
|
||||
See [documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more DeepSpeed configuration options.
|
||||
|
||||
</Tip>
|
||||
<Tip>
|
||||
|
||||
Changing the default Adam optimizer to DeepSpeed's Adam
|
||||
`deepspeed.ops.adam.DeepSpeedCPUAdam` gives a substantial speedup but
|
||||
@@ -330,4 +330,4 @@ image.save("./output.png")
|
||||
|
||||
## Stable Diffusion XL
|
||||
|
||||
Training with [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) is also supported via the `train_controlnet_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md).
|
||||
Training with [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) is also supported via the `train_controlnet_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md).
|
||||
@@ -87,4 +87,4 @@ accelerate launch --mixed_precision="fp16" train_text_to_image.py \
|
||||
|
||||
Now that you've created a dataset, you can plug it into the `train_data_dir` (if your dataset is local) or `dataset_name` (if your dataset is on the Hub) arguments of a training script.
|
||||
|
||||
For your next steps, feel free to try and use your dataset to train a model for [unconditional generation](unconditional_training) or [text-to-image generation](text2image)!
|
||||
For your next steps, feel free to try and use your dataset to train a model for [unconditional generation](uncondtional_training) or [text-to-image generation](text2image)!
|
||||
@@ -69,7 +69,7 @@ write_basic_config()
|
||||
|
||||
Now let's get our dataset. Download dataset from [here](https://www.cs.cmu.edu/~custom-diffusion/assets/data.zip) and unzip it. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide.
|
||||
|
||||
We also collect 200 real images using `clip-retrieval` which are combined with the target images in the training dataset as a regularization. This prevents overfitting to the given target image. The following flags enable the regularization `with_prior_preservation`, `real_prior` with `prior_loss_weight=1.`.
|
||||
We also collect 200 real images using `clip-retrieval` which are combined with the target images in the training dataset as a regularization. This prevents overfitting to the the given target image. The following flags enable the regularization `with_prior_preservation`, `real_prior` with `prior_loss_weight=1.`.
|
||||
The `class_prompt` should be the category name same as target image. The collected real images are with text captions similar to the `class_prompt`. The retrieved image are saved in `class_data_dir`. You can disable `real_prior` to use generated images as regularization. To collect the real images use this command first before training.
|
||||
|
||||
```bash
|
||||
@@ -106,7 +106,7 @@ accelerate launch train_custom_diffusion.py \
|
||||
|
||||
**Use `--enable_xformers_memory_efficient_attention` for faster training with lower VRAM requirement (16GB per GPU). Follow [this guide](https://github.com/facebookresearch/xformers) for installation instructions.**
|
||||
|
||||
To track your experiments using Weights and Biases (`wandb`) and to save intermediate results (which we HIGHLY recommend), follow these steps:
|
||||
To track your experiments using Weights and Biases (`wandb`) and to save intermediate results (whcih we HIGHLY recommend), follow these steps:
|
||||
|
||||
* Install `wandb`: `pip install wandb`.
|
||||
* Authorize: `wandb login`.
|
||||
|
||||
@@ -1,17 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Reinforcement learning training with DDPO
|
||||
|
||||
You can fine-tune Stable Diffusion on a reward function via reinforcement learning with the 🤗 TRL library and 🤗 Diffusers. This is done with the Denoising Diffusion Policy Optimization (DDPO) algorithm introduced by Black et al. in [Training Diffusion Models with Reinforcement Learning](https://arxiv.org/abs/2305.13301), which is implemented in 🤗 TRL with the [`~trl.DDPOTrainer`].
|
||||
|
||||
For more information, check out the [`~trl.DDPOTrainer`] API reference and the [Finetune Stable Diffusion Models with DDPO via TRL](https://huggingface.co/blog/trl-ddpo) blog post.
|
||||
@@ -34,7 +34,7 @@ the attention layers of a language model is sufficient to obtain good downstream
|
||||
|
||||
[cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository. 🧨 Diffusers now supports finetuning with LoRA for [text-to-image generation](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image#training-with-lora) and [DreamBooth](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth#training-with-low-rank-adaptation-of-large-language-models-lora). This guide will show you how to do both.
|
||||
|
||||
If you'd like to store or share your model with the community, login to your Hugging Face account (create [one](https://hf.co/join) if you don't have one already):
|
||||
If you'd like to store or share your model with the community, login to your Hugging Face account (create [one](hf.co/join) if you don't have one already):
|
||||
|
||||
```bash
|
||||
huggingface-cli login
|
||||
@@ -276,167 +276,20 @@ Note that the use of [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] is
|
||||
|
||||
* LoRA parameters that have separate identifiers for the UNet and the text encoder such as: [`"sayakpaul/dreambooth"`](https://huggingface.co/sayakpaul/dreambooth).
|
||||
|
||||
<Tip>
|
||||
|
||||
You can also provide a local directory path to [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] as well as [`~diffusers.loaders.UNet2DConditionLoadersMixin.load_attn_procs`].
|
||||
|
||||
</Tip>
|
||||
|
||||
## Stable Diffusion XL
|
||||
|
||||
We support fine-tuning with [Stable Diffusion XL](https://huggingface.co/papers/2307.01952). Please refer to the following docs:
|
||||
|
||||
* [text_to_image/README_sdxl.md](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md)
|
||||
* [dreambooth/README_sdxl.md](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md)
|
||||
**Note** that it is possible to provide a local directory path to [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] as well as [`~diffusers.loaders.UNet2DConditionLoadersMixin.load_attn_procs`]. To know about the supported inputs,
|
||||
refer to the respective docstrings.
|
||||
|
||||
## Unloading LoRA parameters
|
||||
|
||||
You can call [`~diffusers.loaders.LoraLoaderMixin.unload_lora_weights`] on a pipeline to unload the LoRA parameters.
|
||||
|
||||
## Fusing LoRA parameters
|
||||
## Supporting A1111 themed LoRA checkpoints from Diffusers
|
||||
|
||||
You can call [`~diffusers.loaders.LoraLoaderMixin.fuse_lora`] on a pipeline to merge the LoRA parameters with the original parameters of the underlying model(s). This can lead to a potential speedup in the inference latency.
|
||||
This support was made possible because of our amazing contributors: [@takuma104](https://github.com/takuma104) and [@isidentical](https://github.com/isidentical).
|
||||
|
||||
## Unfusing LoRA parameters
|
||||
|
||||
To undo `fuse_lora`, call [`~diffusers.loaders.LoraLoaderMixin.unfuse_lora`] on a pipeline.
|
||||
|
||||
## Working with different LoRA scales when using LoRA fusion
|
||||
|
||||
If you need to use `scale` when working with `fuse_lora()` to control the influence of the LoRA parameters on the outputs, you should specify `lora_scale` within `fuse_lora()`. Passing the `scale` parameter to `cross_attention_kwargs` when you call the pipeline won't work.
|
||||
|
||||
To use a different `lora_scale` with `fuse_lora()`, you should first call `unfuse_lora()` on the corresponding pipeline and call `fuse_lora()` again with the expected `lora_scale`.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
|
||||
lora_model_id = "hf-internal-testing/sdxl-1.0-lora"
|
||||
lora_filename = "sd_xl_offset_example-lora_1.0.safetensors"
|
||||
pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
|
||||
|
||||
# This uses a default `lora_scale` of 1.0.
|
||||
pipe.fuse_lora()
|
||||
|
||||
generator = torch.manual_seed(0)
|
||||
images_fusion = pipe(
|
||||
"masterpiece, best quality, mountain", generator=generator, num_inference_steps=2
|
||||
).images
|
||||
|
||||
# To work with a different `lora_scale`, first reverse the effects of `fuse_lora()`.
|
||||
pipe.unfuse_lora()
|
||||
|
||||
# Then proceed as follows.
|
||||
pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
|
||||
pipe.fuse_lora(lora_scale=0.5)
|
||||
|
||||
generator = torch.manual_seed(0)
|
||||
images_fusion = pipe(
|
||||
"masterpiece, best quality, mountain", generator=generator, num_inference_steps=2
|
||||
).images
|
||||
```
|
||||
|
||||
## Serializing pipelines with fused LoRA parameters
|
||||
|
||||
Let's say you want to load the pipeline above that has its UNet fused with the LoRA parameters. You can easily do so by simply calling the `save_pretrained()` method on `pipe`.
|
||||
|
||||
After loading the LoRA parameters into a pipeline, if you want to serialize the pipeline such that the affected model components are already fused with the LoRA parameters, you should:
|
||||
|
||||
* call `fuse_lora()` on the pipeline with the desired `lora_scale`, given you've already loaded the LoRA parameters into it.
|
||||
* call `save_pretrained()` on the pipeline.
|
||||
|
||||
Here is a complete example:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
|
||||
lora_model_id = "hf-internal-testing/sdxl-1.0-lora"
|
||||
lora_filename = "sd_xl_offset_example-lora_1.0.safetensors"
|
||||
pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
|
||||
|
||||
# First, fuse the LoRA parameters.
|
||||
pipe.fuse_lora()
|
||||
|
||||
# Then save.
|
||||
pipe.save_pretrained("my-pipeline-with-fused-lora")
|
||||
```
|
||||
|
||||
Now, you can load the pipeline and directly perform inference without having to load the LoRA parameters again:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("my-pipeline-with-fused-lora", torch_dtype=torch.float16).to("cuda")
|
||||
|
||||
generator = torch.manual_seed(0)
|
||||
images_fusion = pipe(
|
||||
"masterpiece, best quality, mountain", generator=generator, num_inference_steps=2
|
||||
).images
|
||||
```
|
||||
|
||||
## Working with multiple LoRA checkpoints
|
||||
|
||||
With the `fuse_lora()` method as described above, it's possible to load multiple LoRA checkpoints. Let's work through a complete example. First we load the base pipeline:
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionXLPipeline, AutoencoderKL
|
||||
import torch
|
||||
|
||||
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
vae=vae,
|
||||
torch_dtype=torch.float16,
|
||||
)
|
||||
pipe.to("cuda")
|
||||
```
|
||||
|
||||
Then let's two LoRA checkpoints and fuse them with specific `lora_scale` values:
|
||||
|
||||
```python
|
||||
# LoRA one.
|
||||
pipe.load_lora_weights("goofyai/cyborg_style_xl")
|
||||
pipe.fuse_lora(lora_scale=0.7)
|
||||
|
||||
# LoRA two.
|
||||
pipe.load_lora_weights("TheLastBen/Pikachu_SDXL")
|
||||
pipe.fuse_lora(lora_scale=0.7)
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
Play with the `lora_scale` parameter when working with multiple LoRAs to control the amount of their influence on the final outputs.
|
||||
|
||||
</Tip>
|
||||
|
||||
Let's see them in action:
|
||||
|
||||
```python
|
||||
prompt = "cyborg style pikachu"
|
||||
image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
|
||||
```
|
||||
|
||||

|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
Currently, unfusing multiple LoRA checkpoints is not possible.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Supporting different LoRA checkpoints from Diffusers
|
||||
|
||||
🤗 Diffusers supports loading checkpoints from popular LoRA trainers such as [Kohya](https://github.com/kohya-ss/sd-scripts/) and [TheLastBen](https://github.com/TheLastBen/fast-stable-diffusion). In this section, we outline the current API's details and limitations.
|
||||
|
||||
### Kohya
|
||||
|
||||
This support was made possible because of the amazing contributors: [@takuma104](https://github.com/takuma104) and [@isidentical](https://github.com/isidentical).
|
||||
|
||||
We support loading Kohya LoRA checkpoints using [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`]. In this section, we explain how to load such a checkpoint from [CivitAI](https://civitai.com/)
|
||||
To provide seamless interoperability with A1111 to our users, we support loading A1111 formatted
|
||||
LoRA checkpoints using [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] in a limited capacity.
|
||||
In this section, we explain how to load an A1111 formatted LoRA checkpoint from [CivitAI](https://civitai.com/)
|
||||
in Diffusers and perform inference with it.
|
||||
|
||||
First, download a checkpoint. We'll use
|
||||
@@ -503,9 +356,9 @@ lora_filename = "light_and_shadow.safetensors"
|
||||
pipeline.load_lora_weights(lora_model_id, weight_name=lora_filename)
|
||||
```
|
||||
|
||||
### Kohya + Stable Diffusion XL
|
||||
### Supporting Stable Diffusion XL LoRAs trained using the Kohya-trainer
|
||||
|
||||
After the release of [Stable Diffusion XL](https://huggingface.co/papers/2307.01952), the community contributed some amazing LoRA checkpoints trained on top of it with the Kohya trainer.
|
||||
With this [PR](https://github.com/huggingface/diffusers/pull/4287), there should now be better support for loading Kohya-style LoRAs trained on Stable Diffusion XL (SDXL).
|
||||
|
||||
Here are some example checkpoints we tried out:
|
||||
|
||||
@@ -527,8 +380,8 @@ base_model_id = "stabilityai/stable-diffusion-xl-base-0.9"
|
||||
pipeline = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16).to("cuda")
|
||||
pipeline.load_lora_weights(".", weight_name="Kamepan.safetensors")
|
||||
|
||||
prompt = "anime screencap, glint, drawing, best quality, light smile, shy, a full body of a girl wearing wedding dress in the middle of the forest beneath the trees, fireflies, big eyes, 2d, cute, anime girl, waifu, cel shading, magical girl, vivid colors, (outline:1.1), manga anime artstyle, masterpiece, official wallpaper, glint <lora:kame_sdxl_v2:1>"
|
||||
negative_prompt = "(deformed, bad quality, sketch, depth of field, blurry:1.1), grainy, bad anatomy, bad perspective, old, ugly, realistic, cartoon, disney, bad proportions"
|
||||
prompt = "anime screencap, glint, drawing, best quality, light smile, shy, a full body of a girl wearing wedding dress in the middle of the forest beneath the trees, fireflies, big eyes, 2d, cute, anime girl, waifu, cel shading, magical girl, vivid colors, (outline:1.1), manga anime artstyle, masterpiece, offical wallpaper, glint <lora:kame_sdxl_v2:1>"
|
||||
negative_prompt = "(deformed, bad quality, sketch, depth of field, blurry:1.1), grainy, bad anatomy, bad perspective, old, ugly, realistic, cartoon, disney, bad propotions"
|
||||
generator = torch.manual_seed(2947883060)
|
||||
num_inference_steps = 30
|
||||
guidance_scale = 7
|
||||
@@ -546,33 +399,14 @@ If you notice carefully, the inference UX is exactly identical to what we presen
|
||||
|
||||
Thanks to [@isidentical](https://github.com/isidentical) for helping us on integrating this feature.
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
**Known limitations specific to the Kohya LoRAs**:
|
||||
### Known limitations specific to the Kohya-styled LoRAs
|
||||
|
||||
* When images don't looks similar to other UIs, such as ComfyUI, it can be because of multiple reasons, as explained [here](https://github.com/huggingface/diffusers/pull/4287/#issuecomment-1655110736).
|
||||
* We don't fully support [LyCORIS checkpoints](https://github.com/KohakuBlueleaf/LyCORIS). To the best of our knowledge, our current `load_lora_weights()` should support LyCORIS checkpoints that have LoRA and LoCon modules but not the other ones, such as Hada, LoKR, etc.
|
||||
|
||||
</Tip>
|
||||
## Stable Diffusion XL
|
||||
|
||||
### TheLastBen
|
||||
We support fine-tuning with [Stable Diffusion XL](https://huggingface.co/papers/2307.01952). Please refer to the following docs:
|
||||
|
||||
Here is an example:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipeline_id = "Lykon/dreamshaper-xl-1-0"
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16)
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
lora_model_id = "TheLastBen/Papercut_SDXL"
|
||||
lora_filename = "papercut.safetensors"
|
||||
pipe.load_lora_weights(lora_model_id, weight_name=lora_filename)
|
||||
|
||||
prompt = "papercut sonic"
|
||||
image = pipe(prompt=prompt, num_inference_steps=20, generator=torch.manual_seed(0)).images[0]
|
||||
image
|
||||
```
|
||||
* [text_to_image/README_sdxl.md](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md)
|
||||
* [dreambooth/README_sdxl.md](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md)
|
||||
|
||||
@@ -34,16 +34,13 @@ If you feel like another important example should exist, we are more than happy
|
||||
Training examples show how to pretrain or fine-tune diffusion models for a variety of tasks. Currently we support:
|
||||
|
||||
- [Unconditional Training](./unconditional_training)
|
||||
- [Text-to-Image Training](./text2image)<sup>*</sup>
|
||||
- [Text-to-Image Training](./text2image)
|
||||
- [Text Inversion](./text_inversion)
|
||||
- [Dreambooth](./dreambooth)<sup>*</sup>
|
||||
- [LoRA Support](./lora)<sup>*</sup>
|
||||
- [ControlNet](./controlnet)<sup>*</sup>
|
||||
- [InstructPix2Pix](./instructpix2pix)<sup>*</sup>
|
||||
- [Dreambooth](./dreambooth)
|
||||
- [LoRA Support](./lora)
|
||||
- [ControlNet](./controlnet)
|
||||
- [InstructPix2Pix](./instructpix2pix)
|
||||
- [Custom Diffusion](./custom_diffusion)
|
||||
- [T2I-Adapters](./t2i_adapters)<sup>*</sup>
|
||||
|
||||
<sup>*</sup>: Supports [Stable Diffusion XL](../api/pipelines/stable_diffusion/stable_diffusion_xl).
|
||||
|
||||
If possible, please [install xFormers](../optimization/xformers) for memory efficient attention. This could help make your training faster and less memory intensive.
|
||||
|
||||
@@ -57,7 +54,6 @@ If possible, please [install xFormers](../optimization/xformers) for memory effi
|
||||
| [**ControlNet**](./controlnet) | ✅ | ✅ | - |
|
||||
| [**InstructPix2Pix**](./instructpix2pix) | ✅ | ✅ | - |
|
||||
| [**Custom Diffusion**](./custom_diffusion) | ✅ | ✅ | - |
|
||||
| [**T2I Adapters**](./t2i_adapters) | ✅ | ✅ | - |
|
||||
|
||||
## Community
|
||||
|
||||
|
||||
@@ -1,143 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# T2I-Adapters for Stable Diffusion XL (SDXL)
|
||||
|
||||
The `train_t2i_adapter_sdxl.py` script (as shown below) shows how to implement the [T2I-Adapter training procedure](https://hf.co/papers/2302.08453) for [Stable Diffusion XL](https://huggingface.co/papers/2307.01952).
|
||||
|
||||
## Running locally with PyTorch
|
||||
|
||||
### Installing the dependencies
|
||||
|
||||
Before running the scripts, make sure to install the library's training dependencies:
|
||||
|
||||
**Important**
|
||||
|
||||
To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/huggingface/diffusers
|
||||
cd diffusers
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
Then cd in the `examples/t2i_adapter` folder and run
|
||||
```bash
|
||||
pip install -r requirements_sdxl.txt
|
||||
```
|
||||
|
||||
And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
|
||||
|
||||
```bash
|
||||
accelerate config
|
||||
```
|
||||
|
||||
Or for a default accelerate configuration without answering questions about your environment
|
||||
|
||||
```bash
|
||||
accelerate config default
|
||||
```
|
||||
|
||||
Or if your environment doesn't support an interactive shell (e.g., a notebook)
|
||||
|
||||
```python
|
||||
from accelerate.utils import write_basic_config
|
||||
write_basic_config()
|
||||
```
|
||||
|
||||
When running `accelerate config`, if we specify torch compile mode to True there can be dramatic speedups.
|
||||
|
||||
## Circle filling dataset
|
||||
|
||||
The original dataset is hosted in the [ControlNet repo](https://huggingface.co/lllyasviel/ControlNet/blob/main/training/fill50k.zip). We re-uploaded it to be compatible with `datasets` [here](https://huggingface.co/datasets/fusing/fill50k). Note that `datasets` handles dataloading within the training script.
|
||||
|
||||
## Training
|
||||
|
||||
Our training examples use two test conditioning images. They can be downloaded by running
|
||||
|
||||
```sh
|
||||
wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png
|
||||
|
||||
wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png
|
||||
```
|
||||
|
||||
Then run `huggingface-cli login` to log into your Hugging Face account. This is needed to be able to push the trained T2IAdapter parameters to Hugging Face Hub.
|
||||
|
||||
```bash
|
||||
export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
|
||||
export OUTPUT_DIR="path to save model"
|
||||
|
||||
accelerate launch train_t2i_adapter_sdxl.py \
|
||||
--pretrained_model_name_or_path=$MODEL_DIR \
|
||||
--output_dir=$OUTPUT_DIR \
|
||||
--dataset_name=fusing/fill50k \
|
||||
--mixed_precision="fp16" \
|
||||
--resolution=1024 \
|
||||
--learning_rate=1e-5 \
|
||||
--max_train_steps=15000 \
|
||||
--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
|
||||
--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
|
||||
--validation_steps=100 \
|
||||
--train_batch_size=1 \
|
||||
--gradient_accumulation_steps=4 \
|
||||
--report_to="wandb" \
|
||||
--seed=42 \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
To better track our training experiments, we're using the following flags in the command above:
|
||||
|
||||
* `report_to="wandb` will ensure the training runs are tracked on Weights and Biases. To use it, be sure to install `wandb` with `pip install wandb`.
|
||||
* `validation_image`, `validation_prompt`, and `validation_steps` to allow the script to do a few validation inference runs. This allows us to qualitatively check if the training is progressing as expected.
|
||||
|
||||
Our experiments were conducted on a single 40GB A100 GPU.
|
||||
|
||||
### Inference
|
||||
|
||||
Once training is done, we can perform inference like so:
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter, EulerAncestralDiscreteSchedulerTest
|
||||
from diffusers.utils import load_image
|
||||
import torch
|
||||
|
||||
base_model_path = "stabilityai/stable-diffusion-xl-base-1.0"
|
||||
adapter_path = "path to adapter"
|
||||
|
||||
adapter = T2IAdapter.from_pretrained(adapter_path, torch_dtype=torch.float16)
|
||||
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
|
||||
base_model_path, adapter=adapter, torch_dtype=torch.float16
|
||||
)
|
||||
|
||||
# speed up diffusion process with faster scheduler and memory optimization
|
||||
pipe.scheduler = EulerAncestralDiscreteSchedulerTest.from_config(pipe.scheduler.config)
|
||||
# remove following line if xformers is not installed or when using Torch 2.0.
|
||||
pipe.enable_xformers_memory_efficient_attention()
|
||||
# memory optimization.
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
control_image = load_image("./conditioning_image_1.png")
|
||||
prompt = "pale golden rod circle with old lace background"
|
||||
|
||||
# generate image
|
||||
generator = torch.manual_seed(0)
|
||||
image = pipe(
|
||||
prompt, num_inference_steps=20, generator=generator, image=control_image
|
||||
).images[0]
|
||||
image.save("./output.png")
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
### Specifying a better VAE
|
||||
|
||||
SDXL's VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely `--pretrained_vae_model_name_or_path` that lets you specify the location of a better VAE (such as [this one](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)).
|
||||
@@ -281,8 +281,3 @@ image.save("yoda-pokemon.png")
|
||||
|
||||
* We support fine-tuning the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) via the `train_text_to_image_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md).
|
||||
* We also support fine-tuning of the UNet and Text Encoder shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with LoRA via the `train_text_to_image_lora_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md).
|
||||
|
||||
|
||||
## Kandinsky 2.2
|
||||
|
||||
* We support fine-tuning both the decoder and prior in Kandinsky2.2 with the `train_text_to_image_prior.py` and `train_text_to_image_decoder.py` scripts. LoRA support is also included. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/kandinsky2_2/text_to_image/README_sdxl.md).
|
||||
@@ -192,7 +192,7 @@ been added to the text encoder embedding matrix and consequently been trained.
|
||||
<Tip>
|
||||
|
||||
💡 The community has created a large library of different textual inversion embedding vectors, called [sd-concepts-library](https://huggingface.co/sd-concepts-library).
|
||||
Instead of training textual inversion embeddings from scratch you can also see whether a fitting textual inversion embedding has already been added to the library.
|
||||
Instead of training textual inversion embeddings from scratch you can also see whether a fitting textual inversion embedding has already been added to the libary.
|
||||
|
||||
</Tip>
|
||||
|
||||
|
||||
@@ -284,11 +284,22 @@ Now you can wrap all these components together in a training loop with 🤗 Acce
|
||||
|
||||
```py
|
||||
>>> from accelerate import Accelerator
|
||||
>>> from huggingface_hub import create_repo, upload_folder
|
||||
>>> from huggingface_hub import HfFolder, Repository, whoami
|
||||
>>> from tqdm.auto import tqdm
|
||||
>>> from pathlib import Path
|
||||
>>> import os
|
||||
|
||||
|
||||
>>> def get_full_repo_name(model_id: str, organization: str = None, token: str = None):
|
||||
... if token is None:
|
||||
... token = HfFolder.get_token()
|
||||
... if organization is None:
|
||||
... username = whoami(token)["name"]
|
||||
... return f"{username}/{model_id}"
|
||||
... else:
|
||||
... return f"{organization}/{model_id}"
|
||||
|
||||
|
||||
>>> def train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler):
|
||||
... # Initialize accelerator and tensorboard logging
|
||||
... accelerator = Accelerator(
|
||||
@@ -298,12 +309,11 @@ Now you can wrap all these components together in a training loop with 🤗 Acce
|
||||
... project_dir=os.path.join(config.output_dir, "logs"),
|
||||
... )
|
||||
... if accelerator.is_main_process:
|
||||
... if config.output_dir is not None:
|
||||
... os.makedirs(config.output_dir, exist_ok=True)
|
||||
... if config.push_to_hub:
|
||||
... repo_id = create_repo(
|
||||
... repo_id=config.hub_model_id or Path(config.output_dir).name, exist_ok=True
|
||||
... ).repo_id
|
||||
... repo_name = get_full_repo_name(Path(config.output_dir).name)
|
||||
... repo = Repository(config.output_dir, clone_from=repo_name)
|
||||
... elif config.output_dir is not None:
|
||||
... os.makedirs(config.output_dir, exist_ok=True)
|
||||
... accelerator.init_trackers("train_example")
|
||||
|
||||
... # Prepare everything
|
||||
@@ -361,12 +371,7 @@ Now you can wrap all these components together in a training loop with 🤗 Acce
|
||||
|
||||
... if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
|
||||
... if config.push_to_hub:
|
||||
... upload_folder(
|
||||
... repo_id=repo_id,
|
||||
... folder_path=config.output_dir,
|
||||
... commit_message=f"Epoch {epoch}",
|
||||
... ignore_patterns=["step_*", "epoch_*"],
|
||||
... )
|
||||
... repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
|
||||
... else:
|
||||
... pipeline.save_pretrained(config.output_dir)
|
||||
```
|
||||
|
||||
@@ -1,165 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
# Inference with PEFT
|
||||
|
||||
There are many adapters trained in different styles to achieve different effects. You can even combine multiple adapters to create new and unique images. With the 🤗 [PEFT](https://huggingface.co/docs/peft/index) integration in 🤗 Diffusers, it is really easy to load and manage adapters for inference. In this guide, you'll learn how to use different adapters with [Stable Diffusion XL (SDXL)](./pipelines/stable_diffusion/stable_diffusion_xl) for inference.
|
||||
|
||||
Throughout this guide, you'll use LoRA as the main adapter technique, so we'll use the terms LoRA and adapter interchangeably. You should have some familiarity with LoRA, and if you don't, we welcome you to check out the [LoRA guide](https://huggingface.co/docs/peft/conceptual_guides/lora).
|
||||
|
||||
Let's first install all the required libraries.
|
||||
|
||||
```bash
|
||||
!pip install -q transformers accelerate
|
||||
# Will be updated once the stable releases are done.
|
||||
!pip install -q git+https://github.com/huggingface/peft.git
|
||||
!pip install -q git+https://github.com/huggingface/diffusers.git
|
||||
```
|
||||
|
||||
Now, let's load a pipeline with a SDXL checkpoint:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe_id = "stabilityai/stable-diffusion-xl-base-1.0"
|
||||
pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda")
|
||||
```
|
||||
|
||||
|
||||
Next, load a LoRA checkpoint with the [`~diffusers.loaders.StableDiffusionXLLoraLoaderMixin.load_lora_weights`] method.
|
||||
|
||||
With the 🤗 PEFT integration, you can assign a specific `adapter_name` to the checkpoint, which let's you easily switch between different LoRA checkpoints. Let's call this adapter `"toy"`.
|
||||
|
||||
```python
|
||||
pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy")
|
||||
```
|
||||
|
||||
And then perform inference:
|
||||
|
||||
```python
|
||||
prompt = "toy_face of a hacker with a hoodie"
|
||||
|
||||
lora_scale= 0.9
|
||||
image = pipe(
|
||||
prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0)
|
||||
).images[0]
|
||||
image
|
||||
```
|
||||
|
||||

|
||||
|
||||
|
||||
With the `adapter_name` parameter, it is really easy to use another adapter for inference! Load the [nerijs/pixel-art-xl](https://huggingface.co/nerijs/pixel-art-xl) adapter that has been fine-tuned to generate pixel art images, and let's call it `"pixel"`.
|
||||
|
||||
The pipeline automatically sets the first loaded adapter (`"toy"`) as the active adapter. But you can activate the `"pixel"` adapter with the [`~diffusers.loaders.set_adapters`] method as shown below:
|
||||
|
||||
```python
|
||||
pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")
|
||||
pipe.set_adapters("pixel")
|
||||
```
|
||||
|
||||
Let's now generate an image with the second adapter and check the result:
|
||||
|
||||
```python
|
||||
prompt = "a hacker with a hoodie, pixel art"
|
||||
image = pipe(
|
||||
prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0)
|
||||
).images[0]
|
||||
image
|
||||
```
|
||||
|
||||

|
||||
|
||||
## Combine multiple adapters
|
||||
|
||||
You can also perform multi-adapter inference where you combine different adapter checkpoints for inference.
|
||||
|
||||
Once again, use the [`~diffusers.loaders.set_adapters`] method to activate two LoRA checkpoints and specify the weight for how the checkpoints should be combined.
|
||||
|
||||
```python
|
||||
pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0])
|
||||
```
|
||||
|
||||
Now that we have set these two adapters, let's generate an image from the combined adapters!
|
||||
|
||||
<Tip>
|
||||
|
||||
LoRA checkpoints in the diffusion community are almost always obtained with [DreamBooth](https://huggingface.co/docs/diffusers/main/en/training/dreambooth). DreamBooth training often relies on "trigger" words in the input text prompts in order for the generation results to look as expected. When you combine multiple LoRA checkpoints, it's important to ensure the trigger words for the corresponding LoRA checkpoints are present in the input text prompts.
|
||||
|
||||
</Tip>
|
||||
|
||||
The trigger words for [CiroN2022/toy-face](https://hf.co/CiroN2022/toy-face) and [nerijs/pixel-art-xl](https://hf.co/nerijs/pixel-art-xl) are found in their repositories.
|
||||
|
||||
|
||||
```python
|
||||
# Notice how the prompt is constructed.
|
||||
prompt = "toy_face of a hacker with a hoodie, pixel art"
|
||||
image = pipe(
|
||||
prompt, num_inference_steps=30, cross_attention_kwargs={"scale": 1.0}, generator=torch.manual_seed(0)
|
||||
).images[0]
|
||||
image
|
||||
```
|
||||
|
||||

|
||||
|
||||
Impressive! As you can see, the model was able to generate an image that mixes the characteristics of both adapters.
|
||||
|
||||
If you want to go back to using only one adapter, use the [`~diffusers.loaders.set_adapters`] method to activate the `"toy"` adapter:
|
||||
|
||||
```python
|
||||
# First, set the adapter.
|
||||
pipe.set_adapters("toy")
|
||||
|
||||
# Then, run inference.
|
||||
prompt = "toy_face of a hacker with a hoodie"
|
||||
lora_scale= 0.9
|
||||
image = pipe(
|
||||
prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0)
|
||||
).images[0]
|
||||
image
|
||||
```
|
||||
|
||||

|
||||
|
||||
|
||||
If you want to switch to only the base model, disable all LoRAs with the [`~diffusers.loaders.disable_lora`] method.
|
||||
|
||||
|
||||
```python
|
||||
pipe.disable_lora()
|
||||
|
||||
prompt = "toy_face of a hacker with a hoodie"
|
||||
lora_scale= 0.9
|
||||
image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0]
|
||||
image
|
||||
```
|
||||
|
||||

|
||||
|
||||
## Monitoring active adapters
|
||||
|
||||
You have attached multiple adapters in this tutorial, and if you're feeling a bit lost on what adapters have been attached to the pipeline's components, you can easily check the list of active adapters using the [`~diffusers.loaders.get_active_adapters`] method:
|
||||
|
||||
```python
|
||||
active_adapters = pipe.get_active_adapters()
|
||||
>>> ["toy", "pixel"]
|
||||
```
|
||||
|
||||
You can also get the active adapters of each pipeline component with [`~diffusers.loaders.get_list_adapters`]:
|
||||
|
||||
```python
|
||||
list_adapters_component_wise = pipe.get_list_adapters()
|
||||
>>> {"text_encoder": ["toy", "pixel"], "unet": ["toy", "pixel"], "text_encoder_2": ["toy", "pixel"]}
|
||||
```
|
||||
@@ -10,297 +10,51 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Text-to-image
|
||||
# Conditional image generation
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
When you think of diffusion models, text-to-image is usually one of the first things that come to mind. Text-to-image generates an image from a text description (for example, "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k") which is also known as a *prompt*.
|
||||
Conditional image generation allows you to generate images from a text prompt. The text is converted into embeddings which are used to condition the model to generate an image from noise.
|
||||
|
||||
From a very high level, a diffusion model takes a prompt and some random initial noise, and iteratively removes the noise to construct an image. The *denoising* process is guided by the prompt, and once the denoising process ends after a predetermined number of time steps, the image representation is decoded into an image.
|
||||
The [`DiffusionPipeline`] is the easiest way to use a pre-trained diffusion system for inference.
|
||||
|
||||
<Tip>
|
||||
Start by creating an instance of [`DiffusionPipeline`] and specify which pipeline [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) you would like to download.
|
||||
|
||||
Read the [How does Stable Diffusion work?](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) blog post to learn more about how a latent diffusion model works.
|
||||
In this guide, you'll use [`DiffusionPipeline`] for text-to-image generation with [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5):
|
||||
|
||||
</Tip>
|
||||
```python
|
||||
>>> from diffusers import DiffusionPipeline
|
||||
|
||||
You can generate images from a prompt in 🤗 Diffusers in two steps:
|
||||
|
||||
1. Load a checkpoint into the [`AutoPipelineForText2Image`] class, which automatically detects the appropriate pipeline class to use based on the checkpoint:
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
>>> generator = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
|
||||
```
|
||||
|
||||
2. Pass a prompt to the pipeline to generate an image:
|
||||
The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components.
|
||||
Because the model consists of roughly 1.4 billion parameters, we strongly recommend running it on a GPU.
|
||||
You can move the generator object to a GPU, just like you would in PyTorch:
|
||||
|
||||
```py
|
||||
image = pipeline(
|
||||
"stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k"
|
||||
).images[0]
|
||||
```python
|
||||
>>> generator.to("cuda")
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-vader.png"/>
|
||||
</div>
|
||||
Now you can use the `generator` on your text prompt:
|
||||
|
||||
## Popular models
|
||||
|
||||
The most common text-to-image models are [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [Stable Diffusion XL (SDXL)](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), and [Kandinsky 2.2](https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder). There are also ControlNet models or adapters that can be used with text-to-image models for more direct control in generating images. The results from each model are slightly different because of their architecture and training process, but no matter which model you choose, their usage is more or less the same. Let's use the same prompt for each model and compare their results.
|
||||
|
||||
### Stable Diffusion v1.5
|
||||
|
||||
[Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) is a latent diffusion model initialized from [Stable Diffusion v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4), and finetuned for 595K steps on 512x512 images from the LAION-Aesthetics V2 dataset. You can use this model like:
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
import torch
|
||||
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
generator = torch.Generator("cuda").manual_seed(31)
|
||||
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator).images[0]
|
||||
```python
|
||||
>>> image = generator("An image of a squirrel in Picasso style").images[0]
|
||||
```
|
||||
|
||||
### Stable Diffusion XL
|
||||
The output is by default wrapped into a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object.
|
||||
|
||||
SDXL is a much larger version of the previous Stable Diffusion models, and involves a two-stage model process that adds even more details to an image. It also includes some additional *micro-conditionings* to generate high-quality images centered subjects. Take a look at the more comprehensive [SDXL](sdxl) guide to learn more about how to use it. In general, you can use SDXL like:
|
||||
You can save the image by calling:
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
import torch
|
||||
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
generator = torch.Generator("cuda").manual_seed(31)
|
||||
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator).images[0]
|
||||
```python
|
||||
>>> image.save("image_of_squirrel_painting.png")
|
||||
```
|
||||
|
||||
### Kandinsky 2.2
|
||||
Try out the Spaces below, and feel free to play around with the guidance scale parameter to see how it affects the image quality!
|
||||
|
||||
The Kandinsky model is a bit different from the Stable Diffusion models because it also uses an image prior model to create embeddings that are used to better align text and images in the diffusion model.
|
||||
|
||||
The easiest way to use Kandinsky 2.2 is:
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
import torch
|
||||
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained(
|
||||
"kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
generator = torch.Generator("cuda").manual_seed(31)
|
||||
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator).images[0]
|
||||
```
|
||||
|
||||
### ControlNet
|
||||
|
||||
ControlNet are auxiliary models or adapters that are finetuned on top of text-to-image models, such as [Stable Diffusion V1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5). Using ControlNet models in combination with text-to-image models offers diverse options for more explicit control over how to generate an image. With ControlNet's, you add an additional conditioning input image to the model. For example, if you provide an image of a human pose (usually represented as multiple keypoints that are connected into a skeleton) as a conditioning input, the model generates an image that follows the pose of the image. Check out the more in-depth [ControlNet](controlnet) guide to learn more about other conditioning inputs and how to use them.
|
||||
|
||||
In this example, let's condition the ControlNet with a human pose estimation image. Load the ControlNet model pretrained on human pose estimations:
|
||||
|
||||
```py
|
||||
from diffusers import ControlNetModel, AutoPipelineForText2Image
|
||||
from diffusers.utils import load_image
|
||||
import torch
|
||||
|
||||
controlnet = ControlNetModel.from_pretrained(
|
||||
"lllyasviel/control_v11p_sd15_openpose", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
pose_image = load_image("https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png")
|
||||
```
|
||||
|
||||
Pass the `controlnet` to the [`AutoPipelineForText2Image`], and provide the prompt and pose estimation image:
|
||||
|
||||
```py
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
generator = torch.Generator("cuda").manual_seed(31)
|
||||
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", image=pose_image, generator=generator).images[0]
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-1.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">Stable Diffusion v1.5</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">Stable Diffusion XL</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-2.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">Kandinsky 2.2</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-3.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">ControlNet (pose conditioning)</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Configure pipeline parameters
|
||||
|
||||
There are a number of parameters that can be configured in the pipeline that affect how an image is generated. You can change the image's output size, specify a negative prompt to improve image quality, and more. This section dives deeper into how to use these parameters.
|
||||
|
||||
### Height and width
|
||||
|
||||
The `height` and `width` parameters control the height and width (in pixels) of the generated image. By default, the Stable Diffusion v1.5 model outputs 512x512 images, but you can change this to any size that is a multiple of 8. For example, to create a rectangular image:
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
import torch
|
||||
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
image = pipeline(
|
||||
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", height=768, width=512
|
||||
).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-hw.png"/>
|
||||
</div>
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
Other models may have different default image sizes depending on the image size's in the training dataset. For example, SDXL's default image size is 1024x1024 and using lower `height` and `width` values may result in lower quality images. Make sure you check the model's API reference first!
|
||||
|
||||
</Tip>
|
||||
|
||||
### Guidance scale
|
||||
|
||||
The `guidance_scale` parameter affects how much the prompt influences image generation. A lower value gives the model "creativity" to generate images that are more loosely related to the prompt. Higher `guidance_scale` values push the model to follow the prompt more closely, and if this value is too high, you may observe some artifacts in the generated image.
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
import torch
|
||||
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
image = pipeline(
|
||||
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", guidance_scale=3.5
|
||||
).images[0]
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-guidance-scale-2.5.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">guidance_scale = 2.5</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-guidance-scale-7.5.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">guidance_scale = 7.5</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-guidance-scale-10.5.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">guidance_scale = 10.5</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
### Negative prompt
|
||||
|
||||
Just like how a prompt guides generation, a *negative prompt* steers the model away from things you don't want the model to generate. This is commonly used to improve overall image quality by removing poor or bad image features such as "low resolution" or "bad details". You can also use a negative prompt to remove or modify the content and style of an image.
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
import torch
|
||||
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
image = pipeline(
|
||||
prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
|
||||
negative_prompt="ugly, deformed, disfigured, poor details, bad anatomy",
|
||||
).images[0]
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-neg-prompt-1.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">negative prompt = "ugly, deformed, disfigured, poor details, bad anatomy"</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-neg-prompt-2.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">negative prompt = "astronaut"</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
### Generator
|
||||
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html#generator) object enables reproducibility in a pipeline by setting a manual seed. You can use a `Generator` to generate batches of images and iteratively improve on an image generated from a seed as detailed in the [Improve image quality with deterministic generation](reusing_seeds) guide.
|
||||
|
||||
You can set a seed and `Generator` as shown below. Creating an image with a `Generator` should return the same result each time instead of randomly generating a new image.
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
import torch
|
||||
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
generator = torch.Generator(device="cuda").manual_seed(30)
|
||||
image = pipeline(
|
||||
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
|
||||
generator=generator,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## Control image generation
|
||||
|
||||
There are several ways to exert more control over how an image is generated outside of configuring a pipeline's parameters, such as prompt weighting and ControlNet models.
|
||||
|
||||
### Prompt weighting
|
||||
|
||||
Prompt weighting is a technique for increasing or decreasing the importance of concepts in a prompt to emphasize or minimize certain features in an image. We recommend using the [Compel](https://github.com/damian0815/compel) library to help you generate the weighted prompt embeddings.
|
||||
|
||||
<Tip>
|
||||
|
||||
Learn how to create the prompt embeddings in the [Prompt weighting](weighted_prompts) guide. This example focuses on how to use the prompt embeddings in the pipeline.
|
||||
|
||||
</Tip>
|
||||
|
||||
Once you've created the embeddings, you can pass them to the `prompt_embeds` (and `negative_prompt_embeds` if you're using a negative prompt) parameter in the pipeline.
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
import torch
|
||||
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
image = pipeline(
|
||||
prompt_emebds=prompt_embeds, # generated from Compel
|
||||
negative_prompt_embeds=negative_prompt_embeds, # generated from Compel
|
||||
).images[0]
|
||||
```
|
||||
|
||||
### ControlNet
|
||||
|
||||
As you saw in the [ControlNet](#controlnet) section, these models offer a more flexible and accurate way to generate images by incorporating an additional conditioning image input. Each ControlNet model is pretrained on a particular type of conditioning image to generate new images that resemble it. For example, if you take a ControlNet pretrained on depth maps, you can give the model a depth map as a conditioning input and it'll generate an image that preserves the spatial information in it. This is quicker and easier than specifying the depth information in a prompt. You can even combine multiple conditioning inputs with a [MultiControlNet](controlnet#multicontrolnet)!
|
||||
|
||||
There are many types of conditioning inputs you can use, and 🤗 Diffusers supports ControlNet for Stable Diffusion and SDXL models. Take a look at the more comprehensive [ControlNet](controlnet) guide to learn how you can use these models.
|
||||
|
||||
## Optimize
|
||||
|
||||
Diffusion models are large, and the iterative nature of denoising an image is computationally expensive and intensive. But this doesn't mean you need access to powerful - or even many - GPUs to use them. There are many optimization techniques for running diffusion models on consumer and free-tier resources. For example, you can load model weights in half-precision to save GPU memory and increase speed or offload the entire model to the GPU to save even more memory.
|
||||
|
||||
PyTorch 2.0 also supports a more memory-efficient attention mechanism called [*scaled dot product attention*](../optimization/torch2.0#scaled-dot-product-attention) that is automatically enabled if you're using PyTorch 2.0. You can combine this with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) to speed your code up even more:
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
import torch
|
||||
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16").to("cuda")
|
||||
pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overheard", fullgraph=True)
|
||||
```
|
||||
|
||||
For more tips on how to optimize your code to save memory and speed up inference, read the [Memory and speed](../optimization/fp16) and [Torch 2.0](../optimization/torch2.0) guides.
|
||||
<iframe
|
||||
src="https://stabilityai-stable-diffusion.hf.space"
|
||||
frameborder="0"
|
||||
width="850"
|
||||
height="500"
|
||||
></iframe>
|
||||
@@ -41,7 +41,6 @@ Unless otherwise mentioned, these are techniques that work with existing models
|
||||
13. [Model Editing](#model-editing)
|
||||
14. [DiffEdit](#diffedit)
|
||||
15. [T2I-Adapter](#t2i-adapter)
|
||||
16. [FABRIC](#fabric)
|
||||
|
||||
For convenience, we provide a table to denote which methods are inference-only and which require fine-tuning/training.
|
||||
|
||||
@@ -62,7 +61,7 @@ For convenience, we provide a table to denote which methods are inference-only a
|
||||
| [Model Editing](#model-editing) | ✅ | ❌ | |
|
||||
| [DiffEdit](#diffedit) | ✅ | ❌ | |
|
||||
| [T2I-Adapter](#t2i-adapter) | ✅ | ❌ | |
|
||||
| [Fabric](#fabric) | ✅ | ❌ | |
|
||||
|
||||
## Instruct Pix2Pix
|
||||
|
||||
[Paper](https://arxiv.org/abs/2211.09800)
|
||||
@@ -231,14 +230,3 @@ There are 8 canonical pre-trained adapters trained on different conditionings su
|
||||
depth maps, and semantic segmentations.
|
||||
|
||||
See [here](../api/pipelines/stable_diffusion/adapter) for more information on how to use it.
|
||||
|
||||
## Fabric
|
||||
|
||||
[Paper](https://arxiv.org/abs/2307.10159)
|
||||
|
||||
[Fabric](../api/pipelines/fabric) is a training-free
|
||||
approach applicable to a wide range of popular diffusion models, which exploits
|
||||
the self-attention layer present in the most widely used architectures to condition
|
||||
the diffusion process on a set of feedback images.
|
||||
|
||||
To know more details, check out the [official doc](../api/pipelines/fabric).
|
||||
|
||||
@@ -1,529 +0,0 @@
|
||||
# ControlNet
|
||||
|
||||
ControlNet is a type of model for controlling image diffusion models by conditioning the model with an additional input image. There are many types of conditioning inputs (canny edge, user sketching, human pose, depth, and more) you can use to control a diffusion model. This is hugely useful because it affords you greater control over image generation, making it easier to generate specific images without experimenting with different text prompts or denoising values as much.
|
||||
|
||||
<Tip>
|
||||
|
||||
Check out Section 3.5 of the [ControlNet](https://huggingface.co/papers/2302.05543) paper for a list of ControlNet implementations on various conditioning inputs. You can find the official Stable Diffusion ControlNet conditioned models on [lllyasviel](https://huggingface.co/lllyasviel)'s Hub profile, and more [community-trained](https://huggingface.co/models?other=stable-diffusion&other=controlnet) ones on the Hub.
|
||||
|
||||
For Stable Diffusion XL (SDXL) ControlNet models, you can find them on the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization, or you can browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) ones on the Hub.
|
||||
|
||||
</Tip>
|
||||
|
||||
A ControlNet model has two sets of weights (or blocks) connected by a zero-convolution layer:
|
||||
|
||||
- a *locked copy* keeps everything a large pretrained diffusion model has learned
|
||||
- a *trainable copy* is trained on the additional conditioning input
|
||||
|
||||
Since the locked copy preserves the pretrained model, training and implementing a ControlNet on a new conditioning input is as fast as finetuning any other model because you aren't training the model from scratch.
|
||||
|
||||
This guide will show you how to use ControlNet for text-to-image, image-to-image, inpainting, and more! There are many types of ControlNet conditioning inputs to choose from, but in this guide we'll only focus on several of them. Feel free to experiment with other conditioning inputs!
|
||||
|
||||
Before you begin, make sure you have the following libraries installed:
|
||||
|
||||
```py
|
||||
# uncomment to install the necessary libraries in Colab
|
||||
#!pip install diffusers transformers accelerate safetensors opencv-python
|
||||
```
|
||||
|
||||
## Text-to-image
|
||||
|
||||
For text-to-image, you normally pass a text prompt to the model. But with ControlNet, you can specify an additional conditioning input. Let's condition the model with a canny image, a white outline of an image on a black background. This way, the ControlNet can use the canny image as a control to guide the model to generate an image with the same outline.
|
||||
|
||||
Load an image and use the [opencv-python](https://github.com/opencv/opencv-python) library to extract the canny image:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionControlNetPipeline
|
||||
from diffusers.utils import load_image
|
||||
from PIL import Image
|
||||
import cv2
|
||||
import numpy as np
|
||||
|
||||
image = load_image(
|
||||
"https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
|
||||
)
|
||||
|
||||
image = np.array(image)
|
||||
|
||||
low_threshold = 100
|
||||
high_threshold = 200
|
||||
|
||||
image = cv2.Canny(image, low_threshold, high_threshold)
|
||||
image = image[:, :, None]
|
||||
image = np.concatenate([image, image, image], axis=2)
|
||||
canny_image = Image.fromarray(image)
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/vermeer_canny_edged.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">canny image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
Next, load a ControlNet model conditioned on canny edge detection and pass it to the [`StableDiffusionControlNetPipeline`]. Use the faster [`UniPCMultistepScheduler`] and enable model offloading to speed up inference and reduce memory usage.
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
|
||||
import torch
|
||||
|
||||
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True)
|
||||
pipe = StableDiffusionControlNetPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
pipe.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
Now pass your prompt and canny image to the pipeline:
|
||||
|
||||
```py
|
||||
output = pipe(
|
||||
"the mona lisa", image=canny_image
|
||||
).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-text2img.png"/>
|
||||
</div>
|
||||
|
||||
## Image-to-image
|
||||
|
||||
For image-to-image, you'd typically pass an initial image and a prompt to the pipeline to generate a new image. With ControlNet, you can pass an additional conditioning input to guide the model. Let's condition the model with a depth map, an image which contains spatial information. This way, the ControlNet can use the depth map as a control to guide the model to generate an image that preserves spatial information.
|
||||
|
||||
You'll use the [`StableDiffusionControlNetImg2ImgPipeline`] for this task, which is different from the [`StableDiffusionControlNetPipeline`] because it allows you to pass an initial image as the starting point for the image generation process.
|
||||
|
||||
Load an image and use the `depth-estimation` [`~transformers.Pipeline`] from 🤗 Transformers to extract the depth map of an image:
|
||||
|
||||
```py
|
||||
import torch
|
||||
import numpy as np
|
||||
|
||||
from transformers import pipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
image = load_image(
|
||||
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img.jpg"
|
||||
).resize((768, 768))
|
||||
|
||||
|
||||
def get_depth_map(image, depth_estimator):
|
||||
image = depth_estimator(image)["depth"]
|
||||
image = np.array(image)
|
||||
image = image[:, :, None]
|
||||
image = np.concatenate([image, image, image], axis=2)
|
||||
detected_map = torch.from_numpy(image).float() / 255.0
|
||||
depth_map = detected_map.permute(2, 0, 1)
|
||||
return depth_map
|
||||
|
||||
depth_estimator = pipeline("depth-estimation")
|
||||
depth_map = get_depth_map(image, depth_estimator).unsqueeze(0).half().to("cuda")
|
||||
```
|
||||
|
||||
Next, load a ControlNet model conditioned on depth maps and pass it to the [`StableDiffusionControlNetImg2ImgPipeline`]. Use the faster [`UniPCMultistepScheduler`] and enable model offloading to speed up inference and reduce memory usage.
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
|
||||
import torch
|
||||
|
||||
controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", torch_dtype=torch.float16, use_safetensors=True)
|
||||
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
pipe.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
Now pass your prompt, initial image, and depth map to the pipeline:
|
||||
|
||||
```py
|
||||
output = pipe(
|
||||
"lego batman and robin", image=image, control_image=depth_map,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img.jpg"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img-2.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
## Inpainting
|
||||
|
||||
For inpainting, you need an initial image, a mask image, and a prompt describing what to replace the mask with. ControlNet models allow you to add another control image to condition a model with. Let’s condition the model with a canny image, a white outline of an image on a black background. This way, the ControlNet can use the canny image as a control to guide the model to generate an image with the same outline.
|
||||
|
||||
Load an initial image and a mask image:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler
|
||||
from diffusers.utils import load_image
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
init_image = load_image(
|
||||
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg"
|
||||
)
|
||||
init_image = init_image.resize((512, 512))
|
||||
|
||||
mask_image = load_image(
|
||||
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg"
|
||||
)
|
||||
mask_image = mask_image.resize((512, 512))
|
||||
```
|
||||
|
||||
Create a function to prepare the control image from the initial and mask images. This'll create a tensor to mark the pixels in `init_image` as masked if the corresponding pixel in `mask_image` is over a certain threshold.
|
||||
|
||||
```py
|
||||
def make_inpaint_condition(image, image_mask):
|
||||
image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
|
||||
image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0
|
||||
|
||||
assert image.shape[0:1] == image_mask.shape[0:1]
|
||||
image[image_mask > 0.5] = 1.0 # set as masked pixel
|
||||
image = np.expand_dims(image, 0).transpose(0, 3, 1, 2)
|
||||
image = torch.from_numpy(image)
|
||||
return image
|
||||
|
||||
control_image = make_inpaint_condition(init_image, mask_image)
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">mask image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
Load a ControlNet model conditioned on inpainting and pass it to the [`StableDiffusionControlNetInpaintPipeline`]. Use the faster [`UniPCMultistepScheduler`] and enable model offloading to speed up inference and reduce memory usage.
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler
|
||||
import torch
|
||||
|
||||
controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, use_safetensors=True)
|
||||
pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
pipe.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
Now pass your prompt, initial image, mask image, and control image to the pipeline:
|
||||
|
||||
```py
|
||||
output = pipe(
|
||||
"corgi face with large ears, detailed, pixar, animated, disney",
|
||||
num_inference_steps=20,
|
||||
eta=1.0,
|
||||
image=init_image,
|
||||
mask_image=mask_image,
|
||||
control_image=control_image,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-result.png"/>
|
||||
</div>
|
||||
|
||||
## Guess mode
|
||||
|
||||
[Guess mode](https://github.com/lllyasviel/ControlNet/discussions/188) does not require supplying a prompt to a ControlNet at all! This forces the ControlNet encoder to do it's best to "guess" the contents of the input control map (depth map, pose estimation, canny edge, etc.).
|
||||
|
||||
Guess mode adjusts the scale of the output residuals from a ControlNet by a fixed ratio depending on the block depth. The shallowest `DownBlock` corresponds to 0.1, and as the blocks get deeper, the scale increases exponentially such that the scale of the `MidBlock` output becomes 1.0.
|
||||
|
||||
<Tip>
|
||||
|
||||
Guess mode does not have any impact on prompt conditioning and you can still provide a prompt if you want.
|
||||
|
||||
</Tip>
|
||||
|
||||
Set `guess_mode=True` in the pipeline, and it is [recommended](https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode) to set the `guidance_scale` value between 3.0 and 5.0.
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
|
||||
import torch
|
||||
|
||||
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", use_safetensors=True)
|
||||
pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True).to(
|
||||
"cuda"
|
||||
)
|
||||
image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">regular mode with prompt</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0_gm.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">guess mode without prompt</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## ControlNet with Stable Diffusion XL
|
||||
|
||||
There aren't too many ControlNet models compatible with Stable Diffusion XL (SDXL) at the moment, but we've trained two full-sized ControlNet models for SDXL conditioned on canny edge detection and depth maps. We're also experimenting with creating smaller versions of these SDXL-compatible ControlNet models so it is easier to run on resource-constrained hardware. You can find these checkpoints on the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization!
|
||||
|
||||
Let's use a SDXL ControlNet conditioned on canny images to generate an image. Start by loading an image and prepare the canny image:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
|
||||
from diffusers.utils import load_image
|
||||
from PIL import Image
|
||||
import cv2
|
||||
import numpy as np
|
||||
|
||||
image = load_image(
|
||||
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
|
||||
)
|
||||
|
||||
image = np.array(image)
|
||||
|
||||
low_threshold = 100
|
||||
high_threshold = 200
|
||||
|
||||
image = cv2.Canny(image, low_threshold, high_threshold)
|
||||
image = image[:, :, None]
|
||||
image = np.concatenate([image, image, image], axis=2)
|
||||
canny_image = Image.fromarray(image)
|
||||
canny_image
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hf-logo-canny.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">canny image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
Load a SDXL ControlNet model conditioned on canny edge detection and pass it to the [`StableDiffusionXLControlNetPipeline`]. You can also enable model offloading to reduce memory usage.
|
||||
|
||||
```py
|
||||
controlnet = ControlNetModel.from_pretrained(
|
||||
"diffusers/controlnet-canny-sdxl-1.0",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True
|
||||
)
|
||||
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
|
||||
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
controlnet=controlnet,
|
||||
vae=vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True
|
||||
)
|
||||
pipe.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
Now pass your prompt (and optionally a negative prompt if you're using one) and canny image to the pipeline:
|
||||
|
||||
<Tip>
|
||||
|
||||
The [`controlnet_conditioning_scale`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet#diffusers.StableDiffusionControlNetPipeline.__call__.controlnet_conditioning_scale) parameter determines how much weight to assign to the conditioning inputs. A value of 0.5 is recommended for good generalization, but feel free to experiment with this number!
|
||||
|
||||
</Tip>
|
||||
|
||||
```py
|
||||
prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
|
||||
negative_prompt = 'low quality, bad quality, sketches'
|
||||
|
||||
images = pipe(
|
||||
prompt,
|
||||
negative_prompt=negative_prompt,
|
||||
image=image,
|
||||
controlnet_conditioning_scale=0.5,
|
||||
).images[0]
|
||||
images
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img class="rounded-xl" src="https://huggingface.co/diffusers/controlnet-canny-sdxl-1.0/resolve/main/out_hug_lab_7.png"/>
|
||||
</div>
|
||||
|
||||
You can use [`StableDiffusionXLControlNetPipeline`] in guess mode as well by setting the parameter to `True`:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
|
||||
from diffusers.utils import load_image
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
import cv2
|
||||
from PIL import Image
|
||||
|
||||
prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
|
||||
negative_prompt = "low quality, bad quality, sketches"
|
||||
|
||||
image = load_image(
|
||||
"https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
|
||||
)
|
||||
|
||||
controlnet = ControlNetModel.from_pretrained(
|
||||
"diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
|
||||
)
|
||||
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
|
||||
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, vae=vae, torch_dtype=torch.float16, use_safetensors=True
|
||||
)
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
image = np.array(image)
|
||||
image = cv2.Canny(image, 100, 200)
|
||||
image = image[:, :, None]
|
||||
image = np.concatenate([image, image, image], axis=2)
|
||||
canny_image = Image.fromarray(image)
|
||||
|
||||
image = pipe(
|
||||
prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
### MultiControlNet
|
||||
|
||||
<Tip>
|
||||
|
||||
Replace the SDXL model with a model like [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) to use multiple conditioning inputs with Stable Diffusion models.
|
||||
|
||||
</Tip>
|
||||
|
||||
You can compose multiple ControlNet conditionings from different image inputs to create a *MultiControlNet*. To get better results, it is often helpful to:
|
||||
|
||||
1. mask conditionings such that they don't overlap (for example, mask the area of a canny image where the pose conditioning is located)
|
||||
2. experiment with the [`controlnet_conditioning_scale`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet#diffusers.StableDiffusionControlNetPipeline.__call__.controlnet_conditioning_scale) parameter to determine how much weight to assign to each conditioning input
|
||||
|
||||
In this example, you'll combine a canny image and a human pose estimation image to generate a new image.
|
||||
|
||||
Prepare the canny image conditioning:
|
||||
|
||||
```py
|
||||
from diffusers.utils import load_image
|
||||
from PIL import Image
|
||||
import numpy as np
|
||||
import cv2
|
||||
|
||||
canny_image = load_image(
|
||||
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
|
||||
)
|
||||
canny_image = np.array(canny_image)
|
||||
|
||||
low_threshold = 100
|
||||
high_threshold = 200
|
||||
|
||||
canny_image = cv2.Canny(canny_image, low_threshold, high_threshold)
|
||||
|
||||
# zero out middle columns of image where pose will be overlaid
|
||||
zero_start = canny_image.shape[1] // 4
|
||||
zero_end = zero_start + canny_image.shape[1] // 2
|
||||
canny_image[:, zero_start:zero_end] = 0
|
||||
|
||||
canny_image = canny_image[:, :, None]
|
||||
canny_image = np.concatenate([canny_image, canny_image, canny_image], axis=2)
|
||||
canny_image = Image.fromarray(canny_image).resize((1024, 1024))
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/landscape_canny_masked.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">canny image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
Prepare the human pose estimation conditioning:
|
||||
|
||||
```py
|
||||
from controlnet_aux import OpenposeDetector
|
||||
from diffusers.utils import load_image
|
||||
|
||||
openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
|
||||
|
||||
openpose_image = load_image(
|
||||
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
|
||||
)
|
||||
openpose_image = openpose(openpose_image).resize((1024, 1024))
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/person_pose.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">human pose image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
Load a list of ControlNet models that correspond to each conditioning, and pass them to the [`StableDiffusionXLControlNetPipeline`]. Use the faster [`UniPCMultistepScheduler`] and enable model offloading to reduce memory usage.
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL, UniPCMultistepScheduler
|
||||
import torch
|
||||
|
||||
controlnets = [
|
||||
ControlNetModel.from_pretrained(
|
||||
"thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
|
||||
),
|
||||
ControlNetModel.from_pretrained(
|
||||
"diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
|
||||
),
|
||||
]
|
||||
|
||||
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
|
||||
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, torch_dtype=torch.float16, use_safetensors=True
|
||||
)
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
pipe.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
Now you can pass your prompt (an optional negative prompt if you're using one), canny image, and pose image to the pipeline:
|
||||
|
||||
```py
|
||||
prompt = "a giant standing in a fantasy landscape, best quality"
|
||||
negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
|
||||
|
||||
generator = torch.manual_seed(1)
|
||||
|
||||
images = [openpose_image, canny_image]
|
||||
|
||||
images = pipe(
|
||||
prompt,
|
||||
image=images,
|
||||
num_inference_steps=25,
|
||||
generator=generator,
|
||||
negative_prompt=negative_prompt,
|
||||
num_images_per_prompt=3,
|
||||
controlnet_conditioning_scale=[1.0, 0.8],
|
||||
).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/multicontrolnet.png"/>
|
||||
</div>
|
||||
@@ -1,262 +0,0 @@
|
||||
# DiffEdit
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Image editing typically requires providing a mask of the area to be edited. DiffEdit automatically generates the mask for you based on a text query, making it easier overall to create a mask without image editing software. The DiffEdit algorithm works in three steps:
|
||||
|
||||
1. the diffusion model denoises an image conditioned on some query text and reference text which produces different noise estimates for different areas of the image; the difference is used to infer a mask to identify which area of the image needs to be changed to match the query text
|
||||
2. the input image is encoded into latent space with DDIM
|
||||
3. the latents are decoded with the diffusion model conditioned on the text query, using the mask as a guide such that pixels outside the mask remain the same as in the input image
|
||||
|
||||
This guide will show you how to use DiffEdit to edit images without manually creating a mask.
|
||||
|
||||
Before you begin, make sure you have the following libraries installed:
|
||||
|
||||
```py
|
||||
# uncomment to install the necessary libraries in Colab
|
||||
#!pip install diffusers transformers accelerate safetensors
|
||||
```
|
||||
|
||||
The [`StableDiffusionDiffEditPipeline`] requires an image mask and a set of partially inverted latents. The image mask is generated from the [`~StableDiffusionDiffEditPipeline.generate_mask`] function, and includes two parameters, `source_prompt` and `target_prompt`. These parameters determine what to edit in the image. For example, if you want to change a bowl of *fruits* to a bowl of *pears*, then:
|
||||
|
||||
```py
|
||||
source_prompt = "a bowl of fruits"
|
||||
target_prompt = "a bowl of pears"
|
||||
```
|
||||
|
||||
The partially inverted latents are generated from the [`~StableDiffusionDiffEditPipeline.invert`] function, and it is generally a good idea to include a `prompt` or *caption* describing the image to help guide the inverse latent sampling process. The caption can often be your `source_prompt`, but feel free to experiment with other text descriptions!
|
||||
|
||||
Let's load the pipeline, scheduler, inverse scheduler, and enable some optimizations to reduce memory usage:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline
|
||||
|
||||
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-2-1",
|
||||
torch_dtype=torch.float16,
|
||||
safety_checker=None,
|
||||
use_safetensors=True,
|
||||
)
|
||||
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
|
||||
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_vae_slicing()
|
||||
```
|
||||
|
||||
Load the image to edit:
|
||||
|
||||
```py
|
||||
from diffusers.utils import load_image
|
||||
|
||||
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
|
||||
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
|
||||
```
|
||||
|
||||
Use the [`~StableDiffusionDiffEditPipeline.generate_mask`] function to generate the image mask. You'll need to pass it the `source_prompt` and `target_prompt` to specify what to edit in the image:
|
||||
|
||||
```py
|
||||
source_prompt = "a bowl of fruits"
|
||||
target_prompt = "a basket of pears"
|
||||
mask_image = pipeline.generate_mask(
|
||||
image=raw_image,
|
||||
source_prompt=source_prompt,
|
||||
target_prompt=target_prompt,
|
||||
)
|
||||
```
|
||||
|
||||
Next, create the inverted latents and pass it a caption describing the image:
|
||||
|
||||
```py
|
||||
inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image).latents
|
||||
```
|
||||
|
||||
Finally, pass the image mask and inverted latents to the pipeline. The `target_prompt` becomes the `prompt` now, and the `source_prompt` is used as the `negative_prompt`:
|
||||
|
||||
```py
|
||||
image = pipeline(
|
||||
prompt=target_prompt,
|
||||
mask_image=mask_image,
|
||||
image_latents=inv_latents,
|
||||
negative_prompt=source_prompt,
|
||||
).images[0]
|
||||
image.save("edited_image.png")
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://github.com/Xiang-cd/DiffEdit-stable-diffusion/blob/main/assets/target.png?raw=true"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">edited image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Generate source and target embeddings
|
||||
|
||||
The source and target embeddings can be automatically generated with the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model instead of creating them manually.
|
||||
|
||||
Load the Flan-T5 model and tokenizer from the 🤗 Transformers library:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoTokenizer, T5ForConditionalGeneration
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")
|
||||
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto", torch_dtype=torch.float16)
|
||||
```
|
||||
|
||||
Provide some initial text to prompt the model to generate the source and target prompts.
|
||||
|
||||
```py
|
||||
source_concept = "bowl"
|
||||
target_concept = "basket"
|
||||
|
||||
source_text = f"Provide a caption for images containing a {source_concept}. "
|
||||
"The captions should be in English and should be no longer than 150 characters."
|
||||
|
||||
target_text = f"Provide a caption for images containing a {target_concept}. "
|
||||
"The captions should be in English and should be no longer than 150 characters."
|
||||
```
|
||||
|
||||
Next, create a utility function to generate the prompts:
|
||||
|
||||
```py
|
||||
@torch.no_grad
|
||||
def generate_prompts(input_prompt):
|
||||
input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda")
|
||||
|
||||
outputs = model.generate(
|
||||
input_ids, temperature=0.8, num_return_sequences=16, do_sample=True, max_new_tokens=128, top_k=10
|
||||
)
|
||||
return tokenizer.batch_decode(outputs, skip_special_tokens=True)
|
||||
|
||||
source_prompts = generate_prompts(source_text)
|
||||
target_prompts = generate_prompts(target_text)
|
||||
print(source_prompts)
|
||||
print(target_prompts)
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
Check out the [generation strategy](https://huggingface.co/docs/transformers/main/en/generation_strategies) guide if you're interested in learning more about strategies for generating different quality text.
|
||||
|
||||
</Tip>
|
||||
|
||||
Load the text encoder model used by the [`StableDiffusionDiffEditPipeline`] to encode the text. You'll use the text encoder to compute the text embeddings:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import StableDiffusionDiffEditPipeline
|
||||
|
||||
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16, use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_vae_slicing()
|
||||
|
||||
@torch.no_grad()
|
||||
def embed_prompts(sentences, tokenizer, text_encoder, device="cuda"):
|
||||
embeddings = []
|
||||
for sent in sentences:
|
||||
text_inputs = tokenizer(
|
||||
sent,
|
||||
padding="max_length",
|
||||
max_length=tokenizer.model_max_length,
|
||||
truncation=True,
|
||||
return_tensors="pt",
|
||||
)
|
||||
text_input_ids = text_inputs.input_ids
|
||||
prompt_embeds = text_encoder(text_input_ids.to(device), attention_mask=None)[0]
|
||||
embeddings.append(prompt_embeds)
|
||||
return torch.concatenate(embeddings, dim=0).mean(dim=0).unsqueeze(0)
|
||||
|
||||
source_embeds = embed_prompts(source_prompts, pipeline.tokenizer, pipeline.text_encoder)
|
||||
target_embeds = embed_prompts(target_prompts, pipeline.tokenizer, pipeline.text_encoder)
|
||||
```
|
||||
|
||||
Finally, pass the embeddings to the [`~StableDiffusionDiffEditPipeline.generate_mask`] and [`~StableDiffusionDiffEditPipeline.invert`] functions, and pipeline to generate the image:
|
||||
|
||||
```diff
|
||||
from diffusers import DDIMInverseScheduler, DDIMScheduler
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
|
||||
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
|
||||
|
||||
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
|
||||
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
|
||||
|
||||
|
||||
mask_image = pipeline.generate_mask(
|
||||
image=raw_image,
|
||||
+ source_prompt_embeds=source_embeds,
|
||||
+ target_prompt_embeds=target_embeds,
|
||||
)
|
||||
|
||||
inv_latents = pipeline.invert(
|
||||
+ prompt_embeds=source_embeds,
|
||||
image=raw_image,
|
||||
).latents
|
||||
|
||||
images = pipeline(
|
||||
mask_image=mask_image,
|
||||
image_latents=inv_latents,
|
||||
+ prompt_embeds=target_embeds,
|
||||
+ negative_prompt_embeds=source_embeds,
|
||||
).images
|
||||
images[0].save("edited_image.png")
|
||||
```
|
||||
|
||||
## Generate a caption for inversion
|
||||
|
||||
While you can use the `source_prompt` as a caption to help generate the partially inverted latents, you can also use the [BLIP](https://huggingface.co/docs/transformers/model_doc/blip) model to automatically generate a caption.
|
||||
|
||||
Load the BLIP model and processor from the 🤗 Transformers library:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import BlipForConditionalGeneration, BlipProcessor
|
||||
|
||||
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
|
||||
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", torch_dtype=torch.float16, low_cpu_mem_usage=True)
|
||||
```
|
||||
|
||||
Create a utility function to generate a caption from the input image:
|
||||
|
||||
```py
|
||||
@torch.no_grad()
|
||||
def generate_caption(images, caption_generator, caption_processor):
|
||||
text = "a photograph of"
|
||||
|
||||
inputs = caption_processor(images, text, return_tensors="pt").to(device="cuda", dtype=caption_generator.dtype)
|
||||
caption_generator.to("cuda")
|
||||
outputs = caption_generator.generate(**inputs, max_new_tokens=128)
|
||||
|
||||
# offload caption generator
|
||||
caption_generator.to("cpu")
|
||||
|
||||
caption = caption_processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
||||
return caption
|
||||
```
|
||||
|
||||
Load an input image and generate a caption for it using the `generate_caption` function:
|
||||
|
||||
```py
|
||||
from diffusers.utils import load_image
|
||||
|
||||
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
|
||||
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
|
||||
caption = generate_caption(raw_image, model, processor)
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<figure>
|
||||
<img class="rounded-xl" src="https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"/>
|
||||
<figcaption class="text-center">generated caption: "a photograph of a bowl of fruit on a table"</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
|
||||
Now you can drop the caption into the [`~StableDiffusionDiffEditPipeline.invert`] function to generate the partially inverted latents!
|
||||
@@ -1,123 +0,0 @@
|
||||
# Improve generation quality with FreeU
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
The UNet is responsible for denoising during the reverse diffusion process, and there are two distinct features in its architecture:
|
||||
|
||||
1. Backbone features primarily contribute to the denoising process
|
||||
2. Skip features mainly introduce high-frequency features into the decoder module and can make the network overlook the semantics in the backbone features
|
||||
|
||||
However, the skip connection can sometimes introduce unnatural image details. [FreeU](https://hf.co/papers/2309.11497) is a technique for improving image quality by rebalancing the contributions from the UNet’s skip connections and backbone feature maps.
|
||||
|
||||
FreeU is applied during inference and it does not require any additional training. The technique works for different tasks such as text-to-image, image-to-image, and text-to-video.
|
||||
|
||||
In this guide, you will apply FreeU to the [`StableDiffusionPipeline`], [`StableDiffusionXLPipeline`], and [`TextToVideoSDPipeline`].
|
||||
|
||||
## StableDiffusionPipeline
|
||||
|
||||
Load the pipeline:
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, safety_checker=None
|
||||
).to("cuda")
|
||||
```
|
||||
|
||||
Then enable the FreeU mechanism with the FreeU-specific hyperparameters. These values are scaling factors for the backbone and skip features.
|
||||
|
||||
```py
|
||||
pipeline.enable_freeu(s1=0.9, s2=0.2, b1=1.2, b2=1.4)
|
||||
```
|
||||
|
||||
The values above are from the official FreeU [code repository](https://github.com/ChenyangSi/FreeU) where you can also find [reference hyperparameters](https://github.com/ChenyangSi/FreeU#range-for-more-parameters) for different models.
|
||||
|
||||
<Tip>
|
||||
|
||||
Disable the FreeU mechanism by calling `disable_freeu()` on a pipeline.
|
||||
|
||||
</Tip>
|
||||
|
||||
And then run inference:
|
||||
|
||||
```py
|
||||
prompt = "A squirrel eating a burger"
|
||||
seed = 2023
|
||||
image = pipeline(prompt, generator=torch.manual_seed(seed)).images[0]
|
||||
```
|
||||
|
||||
The figure below compares non-FreeU and FreeU results respectively for the same hyperparameters used above (`prompt` and `seed`):
|
||||
|
||||

|
||||
|
||||
|
||||
Let's see how Stable Diffusion 2 results are impacted:
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16, safety_checker=None
|
||||
).to("cuda")
|
||||
|
||||
prompt = "A squirrel eating a burger"
|
||||
seed = 2023
|
||||
|
||||
pipeline.enable_freeu(s1=0.9, s2=0.2, b1=1.1, b2=1.2)
|
||||
image = pipeline(prompt, generator=torch.manual_seed(seed)).images[0]
|
||||
```
|
||||
|
||||
|
||||

|
||||
|
||||
## Stable Diffusion XL
|
||||
|
||||
Finally, let's take a look at how FreeU affects Stable Diffusion XL results:
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16,
|
||||
).to("cuda")
|
||||
|
||||
prompt = "A squirrel eating a burger"
|
||||
seed = 2023
|
||||
|
||||
# Comes from
|
||||
# https://wandb.ai/nasirk24/UNET-FreeU-SDXL/reports/FreeU-SDXL-Optimal-Parameters--Vmlldzo1NDg4NTUw
|
||||
pipeline.enable_freeu(s1=0.6, s2=0.4, b1=1.1, b2=1.2)
|
||||
image = pipeline(prompt, generator=torch.manual_seed(seed)).images[0]
|
||||
```
|
||||
|
||||
|
||||

|
||||
|
||||
## Text-to-video generation
|
||||
|
||||
FreeU can also be used to improve video quality:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
from diffusers.utils import export_to_video
|
||||
import torch
|
||||
|
||||
model_id = "cerspense/zeroscope_v2_576w"
|
||||
pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16).to("cuda")
|
||||
pipe = pipe.to("cuda")
|
||||
|
||||
prompt = "an astronaut riding a horse on mars"
|
||||
seed = 2023
|
||||
|
||||
# The values come from
|
||||
# https://github.com/lyn-rgb/FreeU_Diffusers#video-pipelines
|
||||
pipe.enable_freeu(b1=1.2, b2=1.4, s1=0.9, s2=0.2)
|
||||
video_frames = pipe(prompt, height=320, width=576, num_frames=30, generator=torch.manual_seed(seed)).frames
|
||||
export_to_video(video_frames, "astronaut_rides_horse.mp4")
|
||||
```
|
||||
|
||||
Thanks to [kadirnar](https://github.com/kadirnar/) for helping to integrate the feature, and to [justindujardin](https://github.com/justindujardin) for the helpful discussions.
|
||||
@@ -10,597 +10,91 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Image-to-image
|
||||
# Text-guided image-to-image generation
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Image-to-image is similar to [text-to-image](conditional_image_generation), but in addition to a prompt, you can also pass an initial image as a starting point for the diffusion process. The initial image is encoded to latent space and noise is added to it. Then the latent diffusion model takes a prompt and the noisy latent image, predicts the added noise, and removes the predicted noise from the initial latent image to get the new latent image. Lastly, a decoder decodes the new latent image back into an image.
|
||||
The [`StableDiffusionImg2ImgPipeline`] lets you pass a text prompt and an initial image to condition the generation of new images.
|
||||
|
||||
With 🤗 Diffusers, this is as easy as 1-2-3:
|
||||
|
||||
1. Load a checkpoint into the [`AutoPipelineForImage2Image`] class; this pipeline automatically handles loading the correct pipeline class based on the checkpoint:
|
||||
Before you begin, make sure you have all the necessary libraries installed:
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForImage2Image
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
# uncomment to install the necessary libraries in Colab
|
||||
#!pip install diffusers transformers ftfy accelerate
|
||||
```
|
||||
|
||||
Get started by creating a [`StableDiffusionImg2ImgPipeline`] with a pretrained Stable Diffusion model like [`nitrosocke/Ghibli-Diffusion`](https://huggingface.co/nitrosocke/Ghibli-Diffusion).
|
||||
|
||||
```python
|
||||
import torch
|
||||
import requests
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
from diffusers import StableDiffusionImg2ImgPipeline
|
||||
|
||||
device = "cuda"
|
||||
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
|
||||
"nitrosocke/Ghibli-Diffusion", torch_dtype=torch.float16, use_safetensors=True
|
||||
).to(device)
|
||||
```
|
||||
|
||||
Download and preprocess an initial image so you can pass it to the pipeline:
|
||||
|
||||
```python
|
||||
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
|
||||
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
init_image.thumbnail((768, 768))
|
||||
init_image
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/YiYiXu/test-doc-assets/resolve/main/image_2_image_using_diffusers_cell_8_output_0.jpeg"/>
|
||||
</div>
|
||||
|
||||
<Tip>
|
||||
|
||||
You'll notice throughout the guide, we use [`~DiffusionPipeline.enable_model_cpu_offload`] and [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`], to save memory and increase inference speed. If you're using PyTorch 2.0, then you don't need to call [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`] on your pipeline because it'll already be using PyTorch 2.0's native [scaled-dot product attention](../optimization/torch2.0#scaled-dot-product-attention).
|
||||
💡 `strength` is a value between 0.0 and 1.0 that controls the amount of noise added to the input image. Values that approach 1.0 allow for lots of variations but will also produce images that are not semantically consistent with the input.
|
||||
|
||||
</Tip>
|
||||
|
||||
2. Load an image to pass to the pipeline:
|
||||
Define the prompt (for this checkpoint finetuned on Ghibli-style art, you need to prefix the prompt with the `ghibli style` tokens) and run the pipeline:
|
||||
|
||||
```py
|
||||
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
|
||||
```
|
||||
|
||||
3. Pass a prompt and image to the pipeline to generate an image:
|
||||
|
||||
```py
|
||||
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
|
||||
image = pipeline(prompt, image=init_image).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Popular models
|
||||
|
||||
The most popular image-to-image models are [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [Stable Diffusion XL (SDXL)](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), and [Kandinsky 2.2](https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder). The results from the Stable Diffusion and Kandinsky models vary due to their architecture differences and training process; you can generally expect SDXL to produce higher quality images than Stable Diffusion v1.5. Let's take a quick look at how to use each of these models and compare their results.
|
||||
|
||||
### Stable Diffusion v1.5
|
||||
|
||||
Stable Diffusion v1.5 is a latent diffusion model initialized from an earlier checkpoint, and further finetuned for 595K steps on 512x512 images. To use this pipeline for image-to-image, you'll need to prepare an initial image to pass to the pipeline. Then you can pass a prompt and the image to the pipeline to generate a new image:
|
||||
|
||||
```py
|
||||
import torch
|
||||
import requests
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
from diffusers import AutoPipelineForImage2Image
|
||||
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# prepare image
|
||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
|
||||
# pass prompt and image to pipeline
|
||||
image = pipeline(prompt, image=init_image).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdv1.5.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
### Stable Diffusion XL (SDXL)
|
||||
|
||||
SDXL is a more powerful version of the Stable Diffusion model. It uses a larger base model, and an additional refiner model to increase the quality of the base model's output. Read the [SDXL](sdxl) guide for a more detailed walkthrough of how to use this model, and other techniques it uses to produce high quality images.
|
||||
|
||||
```py
|
||||
import torch
|
||||
import requests
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
from diffusers import AutoPipelineForImage2Image
|
||||
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# prepare image
|
||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdxl-init.png"
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
|
||||
# pass prompt and image to pipeline
|
||||
image = pipeline(prompt, image=init_image, strength=0.5).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdxl-init.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdxl.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
### Kandinsky 2.2
|
||||
|
||||
The Kandinsky model is different from the Stable Diffusion models because it uses an image prior model to create image embeddings. The embeddings help create a better alignment between text and images, allowing the latent diffusion model to generate better images.
|
||||
|
||||
The simplest way to use Kandinsky 2.2 is:
|
||||
|
||||
```py
|
||||
import torch
|
||||
import requests
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
from diffusers import AutoPipelineForImage2Image
|
||||
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# prepare image
|
||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
|
||||
# pass prompt and image to pipeline
|
||||
image = pipeline(prompt, image=init_image).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-kandinsky.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Configure pipeline parameters
|
||||
|
||||
There are several important parameters you can configure in the pipeline that'll affect the image generation process and image quality. Let's take a closer look at what these parameters do and how changing them affects the output.
|
||||
|
||||
### Strength
|
||||
|
||||
`strength` is one of the most important parameters to consider and it'll have a huge impact on your generated image. It determines how much the generated image resembles the initial image. In other words:
|
||||
|
||||
- 📈 a higher `strength` value gives the model more "creativity" to generate an image that's different from the initial image; a `strength` value of 1.0 means the initial image is more or less ignored
|
||||
- 📉 a lower `strength` value means the generated image is more similar to the initial image
|
||||
|
||||
The `strength` and `num_inference_steps` parameter are related because `strength` determines the number of noise steps to add. For example, if the `num_inference_steps` is 50 and `strength` is 0.8, then this means adding 40 (50 * 0.8) steps of noise to the initial image and then denoising for 40 steps to get the newly generated image.
|
||||
|
||||
```py
|
||||
import torch
|
||||
import requests
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
from diffusers import AutoPipelineForImage2Image
|
||||
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# prepare image
|
||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
image = init_image
|
||||
|
||||
# pass prompt and image to pipeline
|
||||
image = pipeline(prompt, image=init_image, strength=0.8).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-strength-0.4.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">strength = 0.4</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-strength-0.6.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">strength = 0.6</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-strength-1.0.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">strength = 1.0</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
### Guidance scale
|
||||
|
||||
The `guidance_scale` parameter is used to control how closely aligned the generated image and text prompt are. A higher `guidance_scale` value means your generated image is more aligned with the prompt, while a lower `guidance_scale` value means your generated image has more space to deviate from the prompt.
|
||||
|
||||
You can combine `guidance_scale` with `strength` for even more precise control over how expressive the model is. For example, combine a high `strength + guidance_scale` for maximum creativity or use a combination of low `strength` and low `guidance_scale` to generate an image that resembles the initial image but is not as strictly bound to the prompt.
|
||||
|
||||
```py
|
||||
import torch
|
||||
import requests
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
from diffusers import AutoPipelineForImage2Image
|
||||
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# prepare image
|
||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
|
||||
# pass prompt and image to pipeline
|
||||
image = pipeline(prompt, image=init_image, guidance_scale=8.0).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-guidance-0.1.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">guidance_scale = 0.1</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-guidance-3.0.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">guidance_scale = 5.0</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-guidance-7.5.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">guidance_scale = 10.0</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
### Negative prompt
|
||||
|
||||
A negative prompt conditions the model to *not* include things in an image, and it can be used to improve image quality or modify an image. For example, you can improve image quality by including negative prompts like "poor details" or "blurry" to encourage the model to generate a higher quality image. Or you can modify an image by specifying things to exclude from an image.
|
||||
|
||||
```py
|
||||
import torch
|
||||
import requests
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
from diffusers import AutoPipelineForImage2Image
|
||||
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# prepare image
|
||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy"
|
||||
|
||||
# pass prompt and image to pipeline
|
||||
image = pipeline(prompt, negative_prompt=negative_prompt, image=init_image).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-negative-1.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">negative prompt = "ugly, deformed, disfigured, poor details, bad anatomy"</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-negative-2.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">negative prompt = "jungle"</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Chained image-to-image pipelines
|
||||
|
||||
There are some other interesting ways you can use an image-to-image pipeline aside from just generating an image (although that is pretty cool too). You can take it a step further and chain it with other pipelines.
|
||||
|
||||
### Text-to-image-to-image
|
||||
|
||||
Chaining a text-to-image and image-to-image pipeline allows you to generate an image from text and use the generated image as the initial image for the image-to-image pipeline. This is useful if you want to generate an image entirely from scratch. For example, let's chain a Stable Diffusion and a Kandinsky model.
|
||||
|
||||
Start by generating an image with the text-to-image pipeline:
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image, AutoPipelineForImage2Image
|
||||
import torch
|
||||
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k").images[0]
|
||||
```
|
||||
|
||||
Now you can pass this generated image to the image-to-image pipeline:
|
||||
|
||||
```py
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", image=image).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
### Image-to-image-to-image
|
||||
|
||||
You can also chain multiple image-to-image pipelines together to create more interesting images. This can be useful for iteratively performing style transfer on an image, generate short GIFs, restore color to an image, or restore missing areas of an image.
|
||||
|
||||
Start by generating an image:
|
||||
|
||||
```py
|
||||
import torch
|
||||
import requests
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
from diffusers import AutoPipelineForImage2Image
|
||||
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# prepare image
|
||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
|
||||
# pass prompt and image to pipeline
|
||||
image = pipeline(prompt, image=init_image, output_type="latent").images[0]
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
It is important to specify `output_type="latent"` in the pipeline to keep all the outputs in latent space to avoid an unnecessary decode-encode step. This only works if the chained pipelines are using the same VAE.
|
||||
|
||||
</Tip>
|
||||
|
||||
Pass the latent output from this pipeline to the next pipeline to generate an image in a [comic book art style](https://huggingface.co/ogkalu/Comic-Diffusion):
|
||||
|
||||
```py
|
||||
pipelne = AutoPipelineForImage2Image.from_pretrained(
|
||||
"ogkalu/Comic-Diffusion", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# need to include the token "charliebo artstyle" in the prompt to use this checkpoint
|
||||
image = pipeline("Astronaut in a jungle, charliebo artstyle", image=image, output_type="latent").images[0]
|
||||
```
|
||||
|
||||
Repeat one more time to generate the final image in a [pixel art style](https://huggingface.co/kohbanye/pixel-art-style):
|
||||
|
||||
```py
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"kohbanye/pixel-art-style", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# need to include the token "pixelartstyle" in the prompt to use this checkpoint
|
||||
image = pipeline("Astronaut in a jungle, pixelartstyle", image=image).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
### Image-to-upscaler-to-super-resolution
|
||||
|
||||
Another way you can chain your image-to-image pipeline is with an upscaler and super-resolution pipeline to really increase the level of details in an image.
|
||||
|
||||
Start with an image-to-image pipeline:
|
||||
|
||||
```py
|
||||
import torch
|
||||
import requests
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
from diffusers import AutoPipelineForImage2Image
|
||||
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# prepare image
|
||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
|
||||
# pass prompt and image to pipeline
|
||||
image_1 = pipeline(prompt, image=init_image, output_type="latent").images[0]
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
It is important to specify `output_type="latent"` in the pipeline to keep all the outputs in *latent* space to avoid an unnecessary decode-encode step. This only works if the chained pipelines are using the same VAE.
|
||||
|
||||
</Tip>
|
||||
|
||||
Chain it to an upscaler pipeline to increase the image resolution:
|
||||
|
||||
```py
|
||||
upscaler = AutoPipelineForImage2Image.from_pretrained(
|
||||
"stabilityai/sd-x2-latent-upscaler", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
upscaler.enable_model_cpu_offload()
|
||||
upscaler.enable_xformers_memory_efficient_attention()
|
||||
|
||||
image_2 = upscaler(prompt, image=image_1, output_type="latent").images[0]
|
||||
```
|
||||
|
||||
Finally, chain it to a super-resolution pipeline to further enhance the resolution:
|
||||
|
||||
```py
|
||||
super_res = AutoPipelineForImage2Image.from_pretrained(
|
||||
"stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
super_res.enable_model_cpu_offload()
|
||||
super_res.enable_xformers_memory_efficient_attention()
|
||||
|
||||
image_3 = upscaler(prompt, image=image_2).images[0]
|
||||
image_3
|
||||
```
|
||||
|
||||
## Control image generation
|
||||
|
||||
Trying to generate an image that looks exactly the way you want can be difficult, which is why controlled generation techniques and models are so useful. While you can use the `negative_prompt` to partially control image generation, there are more robust methods like prompt weighting and ControlNets.
|
||||
|
||||
### Prompt weighting
|
||||
|
||||
Prompt weighting allows you to scale the representation of each concept in a prompt. For example, in a prompt like "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", you can choose to increase or decrease the embeddings of "astronaut" and "jungle". The [Compel](https://github.com/damian0815/compel) library provides a simple syntax for adjusting prompt weights and generating the embeddings. You can learn how to create the embeddings in the [Prompt weighting](weighted_prompts) guide.
|
||||
|
||||
[`AutoPipelineForImage2Image`] has a `prompt_embeds` (and `negative_prompt_embeds` if you're using a negative prompt) parameter where you can pass the embeddings which replaces the `prompt` parameter.
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForImage2Image
|
||||
import torch
|
||||
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
image = pipeline(prompt_emebds=prompt_embeds, # generated from Compel
|
||||
negative_prompt_embeds, # generated from Compel
|
||||
image=init_image,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
### ControlNet
|
||||
|
||||
ControlNets provide a more flexible and accurate way to control image generation because you can use an additional conditioning image. The conditioning image can be a canny image, depth map, image segmentation, and even scribbles! Whatever type of conditioning image you choose, the ControlNet generates an image that preserves the information in it.
|
||||
|
||||
For example, let's condition an image with a depth map to keep the spatial information in the image.
|
||||
|
||||
```py
|
||||
# prepare image
|
||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
init_image = init_image.resize((958, 960)) # resize to depth image dimensions
|
||||
depth_image = load_image("https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png")
|
||||
```
|
||||
|
||||
Load a ControlNet model conditioned on depth maps and the [`AutoPipelineForImage2Image`]:
|
||||
|
||||
```py
|
||||
from diffusers import ControlNetModel, AutoPipelineForImage2Image
|
||||
from diffusers.utils import load_image
|
||||
import torch
|
||||
|
||||
controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
```
|
||||
|
||||
Now generate a new image conditioned on the depth map, initial image, and prompt:
|
||||
|
||||
```py
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
image = pipeline(prompt, image=init_image, control_image=depth_image).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">depth image</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-controlnet.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">ControlNet image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
Let's apply a new [style](https://huggingface.co/nitrosocke/elden-ring-diffusion) to the image generated from the ControlNet by chaining it with an image-to-image pipeline:
|
||||
|
||||
```py
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"nitrosocke/elden-ring-diffusion", torch_dtype=torch.float16,
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
prompt = "elden ring style astronaut in a jungle" # include the token "elden ring style" in the prompt
|
||||
negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy"
|
||||
|
||||
image = pipeline(prompt, negative_prompt=negative_prompt, image=init_image, strength=0.45, guidance_scale=10.5).images[0]
|
||||
```python
|
||||
prompt = "ghibli style, a fantasy landscape with castles"
|
||||
generator = torch.Generator(device=device).manual_seed(1024)
|
||||
image = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5, generator=generator).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-elden-ring.png">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ghibli-castles.png"/>
|
||||
</div>
|
||||
|
||||
## Optimize
|
||||
You can also try experimenting with a different scheduler to see how that affects the output:
|
||||
|
||||
Running diffusion models is computationally expensive and intensive, but with a few optimization tricks, it is entirely possible to run them on consumer and free-tier GPUs. For example, you can use a more memory-efficient form of attention such as PyTorch 2.0's [scaled-dot product attention](../optimization/torch2.0#scaled-dot-product-attention) or [xFormers](../optimization/xformers) (you can use one or the other, but there's no need to use both). You can also offload the model to the GPU while the other pipeline components wait on the CPU.
|
||||
```python
|
||||
from diffusers import LMSDiscreteScheduler
|
||||
|
||||
```diff
|
||||
+ pipeline.enable_model_cpu_offload()
|
||||
+ pipeline.enable_xformers_memory_efficient_attention()
|
||||
lms = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
|
||||
pipe.scheduler = lms
|
||||
generator = torch.Generator(device=device).manual_seed(1024)
|
||||
image = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5, generator=generator).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
With [`torch.compile`](../optimization/torch2.0#torch.compile), you can boost your inference speed even more by wrapping your UNet with it:
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lms-ghibli.png"/>
|
||||
</div>
|
||||
|
||||
```py
|
||||
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
```
|
||||
Check out the Spaces below, and try generating images with different values for `strength`. You'll notice that using lower values for `strength` produces images that are more similar to the original image.
|
||||
|
||||
To learn more, take a look at the [Reduce memory usage](../optimization/memory) and [Torch 2.0](../optimization/torch2.0) guides.
|
||||
Feel free to also switch the scheduler to the [`LMSDiscreteScheduler`] and see how that affects the output.
|
||||
|
||||
<iframe
|
||||
src="https://stevhliu-ghibli-img2img.hf.space"
|
||||
frameborder="0"
|
||||
width="850"
|
||||
height="500"
|
||||
></iframe>
|
||||
|
||||
@@ -10,583 +10,68 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Inpainting
|
||||
# Text-guided image-inpainting
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Inpainting replaces or edits specific areas of an image. This makes it a useful tool for image restoration like removing defects and artifacts, or even replacing an image area with something entirely new. Inpainting relies on a mask to determine which regions of an image to fill in; the area to inpaint is represented by white pixels and the area to keep is represented by black pixels. The white pixels are filled in by the prompt.
|
||||
The [`StableDiffusionInpaintPipeline`] allows you to edit specific parts of an image by providing a mask and a text prompt. It uses a version of Stable Diffusion, like [`runwayml/stable-diffusion-inpainting`](https://huggingface.co/runwayml/stable-diffusion-inpainting) specifically trained for inpainting tasks.
|
||||
|
||||
With 🤗 Diffusers, here is how you can do inpainting:
|
||||
Get started by loading an instance of the [`StableDiffusionInpaintPipeline`]:
|
||||
|
||||
1. Load an inpainting checkpoint with the [`AutoPipelineForInpainting`] class. This'll automatically detect the appropriate pipeline class to load based on the checkpoint:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AutoPipelineForInpainting
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipeline = AutoPipelineForInpainting.from_pretrained(
|
||||
"kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
You'll notice throughout the guide, we use [`~DiffusionPipeline.enable_model_cpu_offload`] and [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`], to save memory and increase inference speed. If you're using PyTorch 2.0, it's not necessary to call [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`] on your pipeline because it'll already be using PyTorch 2.0's native [scaled-dot product attention](../optimization/torch2.0#scaled-dot-product-attention).
|
||||
|
||||
</Tip>
|
||||
|
||||
2. Load the base and mask images:
|
||||
|
||||
```py
|
||||
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png").convert("RGB")
|
||||
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png").convert("RGB")
|
||||
```
|
||||
|
||||
3. Create a prompt to inpaint the image with and pass it to the pipeline with the base and mask images:
|
||||
|
||||
```py
|
||||
prompt = "a black cat with glowing eyes, cute, adorable, disney, pixar, highly detailed, 8k"
|
||||
negative_prompt = "bad anatomy, deformed, ugly, disfigured"
|
||||
image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=init_image, mask_image=mask_image).images[0]
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">base image</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-cat.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Create a mask image
|
||||
|
||||
Throughout this guide, the mask image is provided in all of the code examples for convenience. You can inpaint on your own images, but you'll need to create a mask image for it. Use the Space below to easily create a mask image.
|
||||
|
||||
Upload a base image to inpaint on and use the sketch tool to draw a mask. Once you're done, click **Run** to generate and download the mask image.
|
||||
|
||||
<iframe
|
||||
src="https://stevhliu-inpaint-mask-maker.hf.space"
|
||||
frameborder="0"
|
||||
width="850"
|
||||
height="450"
|
||||
></iframe>
|
||||
|
||||
## Popular models
|
||||
|
||||
[Stable Diffusion Inpainting](https://huggingface.co/runwayml/stable-diffusion-inpainting), [Stable Diffusion XL (SDXL) Inpainting](https://huggingface.co/diffusers/stable-diffusion-xl-1.0-inpainting-0.1), and [Kandinsky 2.2](https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder-inpaint) are among the most popular models for inpainting. SDXL typically produces higher resolution images than Stable Diffusion v1.5, and Kandinsky 2.2 is also capable of generating high-quality images.
|
||||
|
||||
### Stable Diffusion Inpainting
|
||||
|
||||
Stable Diffusion Inpainting is a latent diffusion model finetuned on 512x512 images on inpainting. It is a good starting point because it is relatively fast and generates good quality images. To use this model for inpainting, you'll need to pass a prompt, base and mask image to the pipeline:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AutoPipelineForInpainting
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipeline = AutoPipelineForInpainting.from_pretrained(
|
||||
"runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# load base and mask image
|
||||
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png").convert("RGB")
|
||||
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png").convert("RGB")
|
||||
|
||||
generator = torch.Generator("cuda").manual_seed(92)
|
||||
prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
|
||||
image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator).images[0]
|
||||
```
|
||||
|
||||
### Stable Diffusion XL (SDXL) Inpainting
|
||||
|
||||
SDXL is a larger and more powerful version of Stable Diffusion v1.5. This model can follow a two-stage model process (though each model can also be used alone); the base model generates an image, and a refiner model takes that image and further enhances its details and quality. Take a look at the [SDXL](sdxl) guide for a more comprehensive guide on how to use SDXL and configure it's parameters.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AutoPipelineForInpainting
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipeline = AutoPipelineForInpainting.from_pretrained(
|
||||
"diffusers/stable-diffusion-xl-1.0-inpainting-0.1", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# load base and mask image
|
||||
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png").convert("RGB")
|
||||
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png").convert("RGB")
|
||||
|
||||
generator = torch.Generator("cuda").manual_seed(92)
|
||||
prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
|
||||
image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator).images[0]
|
||||
```
|
||||
|
||||
### Kandinsky 2.2 Inpainting
|
||||
|
||||
The Kandinsky model family is similar to SDXL because it uses two models as well; the image prior model creates image embeddings, and the diffusion model generates images from them. You can load the image prior and diffusion model separately, but the easiest way to use Kandinsky 2.2 is to load it into the [`AutoPipelineForInpainting`] class which uses the [`KandinskyV22InpaintCombinedPipeline`] under the hood.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AutoPipelineForInpainting
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipeline = AutoPipelineForInpainting.from_pretrained(
|
||||
"kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# load base and mask image
|
||||
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png").convert("RGB")
|
||||
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png").convert("RGB")
|
||||
|
||||
generator = torch.Generator("cuda").manual_seed(92)
|
||||
prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
|
||||
image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, generator=generator).images[0]
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">base image</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-sdv1.5.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">Stable Diffusion Inpainting</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-sdxl.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">Stable Diffusion XL Inpainting</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-kandinsky.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">Kandinsky 2.2 Inpainting</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Configure pipeline parameters
|
||||
|
||||
Image features - like quality and "creativity" - are dependent on pipeline parameters. Knowing what these parameters do is important for getting the results you want. Let's take a look at the most important parameters and see how changing them affects the output.
|
||||
|
||||
### Strength
|
||||
|
||||
`strength` is a measure of how much noise is added to the base image, which influences how similar the output is to the base image.
|
||||
|
||||
* 📈 a high `strength` value means more noise is added to an image and the denoising process takes longer, but you'll get higher quality images that are more different from the base image
|
||||
* 📉 a low `strength` value means less noise is added to an image and the denoising process is faster, but the image quality may not be as great and the generated image resembles the base image more
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AutoPipelineForInpainting
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipeline = AutoPipelineForInpainting.from_pretrained(
|
||||
"runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# load base and mask image
|
||||
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png").convert("RGB")
|
||||
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png").convert("RGB")
|
||||
|
||||
prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
|
||||
image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.6).images[0]
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-strength-0.6.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">strength = 0.6</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-strength-0.8.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">strength = 0.8</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-strength-1.0.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">strength = 1.0</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
### Guidance scale
|
||||
|
||||
`guidance_scale` affects how aligned the text prompt and generated image are.
|
||||
|
||||
* 📈 a high `guidance_scale` value means the prompt and generated image are closely aligned, so the output is a stricter interpretation of the prompt
|
||||
* 📉 a low `guidance_scale` value means the prompt and generated image are more loosely aligned, so the output may be more varied from the prompt
|
||||
|
||||
You can use `strength` and `guidance_scale` together for more control over how expressive the model is. For example, a combination high `strength` and `guidance_scale` values gives the model the most creative freedom.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AutoPipelineForInpainting
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipeline = AutoPipelineForInpainting.from_pretrained(
|
||||
"runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# load base and mask image
|
||||
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png").convert("RGB")
|
||||
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png").convert("RGB")
|
||||
|
||||
prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
|
||||
image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, guidance_scale=2.5).images[0]
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-guidance-2.5.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">guidance_scale = 2.5</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-guidance-7.5.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">guidance_scale = 7.5</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-guidance-12.5.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">guidance_scale = 12.5</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
### Negative prompt
|
||||
|
||||
A negative prompt assumes the opposite role of a prompt; it guides the model away from generating certain things in an image. This is useful for quickly improving image quality and preventing the model from generating things you don't want.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AutoPipelineForInpainting
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipeline = AutoPipelineForInpainting.from_pretrained(
|
||||
"runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# load base and mask image
|
||||
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png").convert("RGB")
|
||||
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png").convert("RGB")
|
||||
|
||||
prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
|
||||
negative_prompt = "bad architecture, unstable, poor details, blurry"
|
||||
image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=init_image, mask_image=mask_image).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<figure>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-negative.png" />
|
||||
<figcaption class="text-center">negative_prompt = "bad architecture, unstable, poor details, blurry"</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
|
||||
## Preserve unmasked areas
|
||||
|
||||
The [`AutoPipelineForInpainting`] (and other inpainting pipelines) generally changes the unmasked parts of an image to create a more natural transition between the masked and unmasked region. If this behavior is undesirable, you can force the unmasked area to remain the same. However, forcing the unmasked portion of the image to remain the same may result in some unusual transitions between the unmasked and masked areas.
|
||||
|
||||
```py
|
||||
```python
|
||||
import PIL
|
||||
import numpy as np
|
||||
import requests
|
||||
import torch
|
||||
from io import BytesIO
|
||||
|
||||
from diffusers import AutoPipelineForInpainting
|
||||
from diffusers.utils import load_image
|
||||
from diffusers import StableDiffusionInpaintPipeline
|
||||
|
||||
device = "cuda"
|
||||
pipeline = AutoPipelineForInpainting.from_pretrained(
|
||||
pipeline = StableDiffusionInpaintPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-inpainting",
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
)
|
||||
pipeline = pipeline.to(device)
|
||||
pipeline = pipeline.to("cuda")
|
||||
```
|
||||
|
||||
Download an image and a mask of a dog which you'll eventually replace:
|
||||
|
||||
```python
|
||||
def download_image(url):
|
||||
response = requests.get(url)
|
||||
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
|
||||
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
|
||||
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
|
||||
|
||||
init_image = load_image(img_url).resize((512, 512))
|
||||
mask_image = load_image(mask_url).resize((512, 512))
|
||||
init_image = download_image(img_url).resize((512, 512))
|
||||
mask_image = download_image(mask_url).resize((512, 512))
|
||||
```
|
||||
|
||||
Now you can create a prompt to replace the mask with something else:
|
||||
|
||||
```python
|
||||
prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
|
||||
repainted_image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
|
||||
repainted_image.save("repainted_image.png")
|
||||
|
||||
# Convert mask to grayscale NumPy array
|
||||
mask_image_arr = np.array(mask_image.convert("L"))
|
||||
# Add a channel dimension to the end of the grayscale mask
|
||||
mask_image_arr = mask_image_arr[:, :, None]
|
||||
# Binarize the mask: 1s correspond to the pixels which are repainted
|
||||
mask_image_arr = mask_image_arr.astype(np.float32) / 255.0
|
||||
mask_image_arr[mask_image_arr < 0.5] = 0
|
||||
mask_image_arr[mask_image_arr >= 0.5] = 1
|
||||
|
||||
# Take the masked pixels from the repainted image and the unmasked pixels from the initial image
|
||||
unmasked_unchanged_image_arr = (1 - mask_image_arr) * init_image + mask_image_arr * repainted_image
|
||||
unmasked_unchanged_image = PIL.Image.fromarray(unmasked_unchanged_image_arr.round().astype("uint8"))
|
||||
unmasked_unchanged_image.save("force_unmasked_unchanged.png")
|
||||
```
|
||||
|
||||
## Chained inpainting pipelines
|
||||
|
||||
[`AutoPipelineForInpainting`] can be chained with other 🤗 Diffusers pipelines to edit their outputs. This is often useful for improving the output quality from your other diffusion pipelines, and if you're using multiple pipelines, it can be more memory-efficient to chain them together to keep the outputs in latent space and reuse the same pipeline components.
|
||||
|
||||
### Text-to-image-to-inpaint
|
||||
|
||||
Chaining a text-to-image and inpainting pipeline allows you to inpaint the generated image, and you don't have to provide a base image to begin with. This makes it convenient to edit your favorite text-to-image outputs without having to generate an entirely new image.
|
||||
|
||||
Start with the text-to-image pipeline to create a castle:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AutoPipelineForText2Image, AutoPipelineForInpainting
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipeline = AutoPipelineForText2Image.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
image = pipeline("concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k").images[0]
|
||||
```
|
||||
|
||||
Load the mask image of the output from above:
|
||||
|
||||
```py
|
||||
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_text-chain-mask.png").convert("RGB")
|
||||
```
|
||||
|
||||
And let's inpaint the masked area with a waterfall:
|
||||
|
||||
```py
|
||||
pipeline = AutoPipelineForInpainting.from_pretrained(
|
||||
"kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
prompt = "digital painting of a fantasy waterfall, cloudy"
|
||||
image = pipeline(prompt=prompt, image=image, mask_image=mask_image).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-text-chain.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">text-to-image</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-text-chain-out.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">inpaint</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
### Inpaint-to-image-to-image
|
||||
|
||||
You can also chain an inpainting pipeline before another pipeline like image-to-image or an upscaler to improve the quality.
|
||||
|
||||
Begin by inpainting an image:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AutoPipelineForInpainting, AutoPipelineForImage2Image
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipeline = AutoPipelineForInpainting.from_pretrained(
|
||||
"runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# load base and mask image
|
||||
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png").convert("RGB")
|
||||
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png").convert("RGB")
|
||||
|
||||
prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
|
||||
image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
|
||||
|
||||
# resize image to 1024x1024 for SDXL
|
||||
image = image.resize((1024, 1024))
|
||||
```
|
||||
|
||||
Now let's pass the image to another inpainting pipeline with SDXL's refiner model to enhance the image details and quality:
|
||||
`image` | `mask_image` | `prompt` | output |
|
||||
:-------------------------:|:-------------------------:|:-------------------------:|-------------------------:|
|
||||
<img src="https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" alt="drawing" width="250"/> | <img src="https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" alt="drawing" width="250"/> | ***Face of a yellow cat, high resolution, sitting on a park bench*** | <img src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/in_paint/yellow_cat_sitting_on_a_park_bench.png" alt="drawing" width="250"/> |
|
||||
|
||||
```py
|
||||
pipeline = AutoPipelineForInpainting.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
image = pipeline(prompt=prompt, image=image, mask_image=mask_image, output_type="latent").images[0]
|
||||
```
|
||||
<Tip warning={true}>
|
||||
|
||||
<Tip>
|
||||
|
||||
It is important to specify `output_type="latent"` in the pipeline to keep all the outputs in latent space to avoid an unnecessary decode-encode step. This only works if the chained pipelines are using the same VAE. For example, in the [Text-to-image-to-inpaint](#text-to-image-to-inpaint) section, Kandinsky 2.2 uses a different VAE class than the Stable Diffusion model so it won't work. But if you use Stable Diffusion v1.5 for both pipelines, then you can keep everything in latent space because they both use [`AutoencoderKL`].
|
||||
A previous experimental implementation of inpainting used a different, lower-quality process. To ensure backwards compatibility, loading a pretrained pipeline that doesn't contain the new model will still apply the old inpainting method.
|
||||
|
||||
</Tip>
|
||||
|
||||
Finally, you can pass this image to an image-to-image pipeline to put the finishing touches on it. It is more efficient to use the [`~AutoPipelineForImage2Image.from_pipe`] method to reuse the existing pipeline components, and avoid unnecessarily loading all the pipeline components into memory again.
|
||||
Check out the Spaces below to try out image inpainting yourself!
|
||||
|
||||
```py
|
||||
pipeline = AutoPipelineForImage2Image.from_pipe(pipeline)
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
image = pipeline(prompt=prompt, image=image).images[0]
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-to-image-chain.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">inpaint</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-to-image-final.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">image-to-image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
Image-to-image and inpainting are actually very similar tasks. Image-to-image generates a new image that resembles the existing provided image. Inpainting does the same thing, but it only transforms the image area defined by the mask and the rest of the image is unchanged. You can think of inpainting as a more precise tool for making specific changes and image-to-image has a broader scope for making more sweeping changes.
|
||||
|
||||
## Control image generation
|
||||
|
||||
Getting an image to look exactly the way you want is challenging because the denoising process is random. While you can control certain aspects of generation by configuring parameters like `negative_prompt`, there are better and more efficient methods for controlling image generation.
|
||||
|
||||
### Prompt weighting
|
||||
|
||||
Prompt weighting provides a quantifiable way to scale the representation of concepts in a prompt. You can use it to increase or decrease the magnitude of the text embedding vector for each concept in the prompt, which subsequently determines how much of each concept is generated. The [Compel](https://github.com/damian0815/compel) library offers an intuitive syntax for scaling the prompt weights and generating the embeddings. Learn how to create the embeddings in the [Prompt weighting](../using-diffusers/weighted_prompts) guide.
|
||||
|
||||
Once you've generated the embeddings, pass them to the `prompt_embeds` (and `negative_prompt_embeds` if you're using a negative prompt) parameter in the [`AutoPipelineForInpainting`]. The embeddings replace the `prompt` parameter:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AutoPipelineForInpainting
|
||||
|
||||
pipeline = AutoPipelineForInpainting.from_pretrained(
|
||||
"runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16,
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
image = pipeline(prompt_emebds=prompt_embeds, # generated from Compel
|
||||
negative_prompt_embeds, # generated from Compel
|
||||
image=init_image,
|
||||
mask_image=mask_image
|
||||
).images[0]
|
||||
```
|
||||
|
||||
### ControlNet
|
||||
|
||||
ControlNet models are used with other diffusion models like Stable Diffusion, and they provide an even more flexible and accurate way to control how an image is generated. A ControlNet accepts an additional conditioning image input that guides the diffusion model to preserve the features in it.
|
||||
|
||||
For example, let's condition an image with a ControlNet pretrained on inpaint images:
|
||||
|
||||
```py
|
||||
import torch
|
||||
import numpy as np
|
||||
from diffusers import ControlNetModel, StableDiffusionControlNetInpaintPipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
# load ControlNet
|
||||
controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, variant="fp16")
|
||||
|
||||
# pass ControlNet to the pipeline
|
||||
pipeline = StableDiffusionControlNetInpaintPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-inpainting", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# load base and mask image
|
||||
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png").convert("RGB")
|
||||
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png").convert("RGB")
|
||||
|
||||
# prepare control image
|
||||
def make_inpaint_condition(init_image, mask_image):
|
||||
init_image = np.array(init_image.convert("RGB")).astype(np.float32) / 255.0
|
||||
mask_image = np.array(mask_image.convert("L")).astype(np.float32) / 255.0
|
||||
|
||||
assert init_image.shape[0:1] == mask_image.shape[0:1], "image and image_mask must have the same image size"
|
||||
init_image[mask_image > 0.5] = -1.0 # set as masked pixel
|
||||
init_image = np.expand_dims(init_image, 0).transpose(0, 3, 1, 2)
|
||||
init_image = torch.from_numpy(init_image)
|
||||
return init_image
|
||||
|
||||
control_image = make_inpaint_condition(init_image, mask_image)
|
||||
```
|
||||
|
||||
Now generate an image from the base, mask and control images. You'll notice features of the base image are strongly preserved in the generated image.
|
||||
|
||||
```py
|
||||
prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
|
||||
image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, control_image=control_image).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
You can take this a step further and chain it with an image-to-image pipeline to apply a new [style](https://huggingface.co/nitrosocke/elden-ring-diffusion):
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForImage2Image
|
||||
|
||||
pipeline = AutoPipelineForImage2Image.from_pretrained(
|
||||
"nitrosocke/elden-ring-diffusion", torch_dtype=torch.float16,
|
||||
).to("cuda")
|
||||
pipeline.enable_model_cpu_offload()
|
||||
pipeline.enable_xformers_memory_efficient_attention()
|
||||
|
||||
prompt = "elden ring style castle" # include the token "elden ring style" in the prompt
|
||||
negative_prompt = "bad architecture, deformed, disfigured, poor details"
|
||||
|
||||
image = pipeline(prompt, negative_prompt=negative_prompt, image=image).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex flex-row gap-4">
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-controlnet.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">ControlNet inpaint</figcaption>
|
||||
</div>
|
||||
<div class="flex-1">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint-img2img.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">image-to-image</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Optimize
|
||||
|
||||
It can be difficult and slow to run diffusion models if you're resource constrained, but it doesn't have to be with a few optimization tricks. One of the biggest (and easiest) optimizations you can enable is switching to memory-efficient attention. If you're using PyTorch 2.0, [scaled-dot product attention](../optimization/torch2.0#scaled-dot-product-attention) is automatically enabled and you don't need to do anything else. For non-PyTorch 2.0 users, you can install and use [xFormers](../optimization/xformers)'s implementation of memory-efficient attention. Both options reduce memory usage and accelerate inference.
|
||||
|
||||
You can also offload the model to the GPU to save even more memory:
|
||||
|
||||
```diff
|
||||
+ pipeline.enable_xformers_memory_efficient_attention()
|
||||
+ pipeline.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
To speed-up your inference code even more, use [`torch_compile`](../optimization/torch2.0#torch.compile). You should wrap `torch.compile` around the most intensive component in the pipeline which is typically the UNet:
|
||||
|
||||
```py
|
||||
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
```
|
||||
|
||||
Learn more in the [Reduce memory usage](../optimization/memory) and [Torch 2.0](../optimization/torch2.0) guides.
|
||||
<iframe
|
||||
src="https://runwayml-stable-diffusion-inpainting.hf.space"
|
||||
frameborder="0"
|
||||
width="850"
|
||||
height="500"
|
||||
></iframe>
|
||||
|
||||
@@ -12,6 +12,6 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Overview
|
||||
|
||||
A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together. Certain combinations of models and schedulers define specific pipeline types, like [`StableDiffusionXLPipeline`] or [`StableDiffusionControlNetPipeline`], with specific capabilities. All pipeline types inherit from the base [`DiffusionPipeline`] class; pass it any checkpoint, and it'll automatically detect the pipeline type and load the necessary components.
|
||||
A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together. Certain combinations of models and schedulers define specific pipeline types, like [`StableDiffusionPipeline`] or [`StableDiffusionControlNetPipeline`], with specific capabilities. All pipeline types inherit from the base [`DiffusionPipeline`] class; pass it any checkpoint, and it'll automatically detect the pipeline type and load the necessary components.
|
||||
|
||||
This section introduces you to some of the more complex pipelines like Stable Diffusion XL, ControlNet, and DiffEdit, which require additional inputs. You'll also learn how to use a distilled version of the Stable Diffusion model to speed up inference, how to control randomness on your hardware when generating images, and how to create a community pipeline for a custom task like generating images from speech.
|
||||
This section introduces you to some of the tasks supported by our pipelines such as unconditional image generation and different techniques and variations of text-to-image generation. You'll also learn how to gain more control over the generation process by setting a seed for reproducibility and weighting prompts to adjust the influence certain words in the prompt has over the output. Finally, you'll see how you can create a community pipeline for a custom task like generating images from speech.
|
||||
@@ -28,7 +28,7 @@ This is why it's important to understand how to control sources of randomness in
|
||||
|
||||
## Control randomness
|
||||
|
||||
During inference, pipelines rely heavily on random sampling operations which include creating the
|
||||
During inference, pipelines rely heavily on random sampling operations which include creating the
|
||||
Gaussian noise tensors to denoise and adding noise to the scheduling step.
|
||||
|
||||
Take a look at the tensor values in the [`DDIMPipeline`] after two inference steps:
|
||||
@@ -47,7 +47,7 @@ image = ddim(num_inference_steps=2, output_type="np").images
|
||||
print(np.abs(image).sum())
|
||||
```
|
||||
|
||||
Running the code above prints one value, but if you run it again you get a different value. What is going on here?
|
||||
Running the code above prints one value, but if you run it again you get a different value. What is going on here?
|
||||
|
||||
Every time the pipeline is run, [`torch.randn`](https://pytorch.org/docs/stable/generated/torch.randn.html) uses a different random seed to create Gaussian noise which is denoised stepwise. This leads to a different result each time it is run, which is great for diffusion pipelines since it generates a different random image each time.
|
||||
|
||||
@@ -81,16 +81,16 @@ If you run this code example on your specific hardware and PyTorch version, you
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 It might be a bit unintuitive at first to pass `Generator` objects to the pipeline instead of
|
||||
just integer values representing the seed, but this is the recommended design when dealing with
|
||||
probabilistic models in PyTorch as `Generator`'s are *random states* that can be
|
||||
💡 It might be a bit unintuitive at first to pass `Generator` objects to the pipeline instead of
|
||||
just integer values representing the seed, but this is the recommended design when dealing with
|
||||
probabilistic models in PyTorch as `Generator`'s are *random states* that can be
|
||||
passed to multiple pipelines in a sequence.
|
||||
|
||||
</Tip>
|
||||
|
||||
### GPU
|
||||
|
||||
Writing a reproducible pipeline on a GPU is a bit trickier, and full reproducibility across different hardware is not guaranteed because matrix multiplication - which diffusion pipelines require a lot of - is less deterministic on a GPU than a CPU. For example, if you run the same code example above on a GPU:
|
||||
Writing a reproducible pipeline on a GPU is a bit trickier, and full reproducibility across different hardware is not guaranteed because matrix multiplication - which diffusion pipelines require a lot of - is less deterministic on a GPU than a CPU. For example, if you run the same code example above on a GPU:
|
||||
|
||||
```python
|
||||
import torch
|
||||
@@ -113,7 +113,7 @@ print(np.abs(image).sum())
|
||||
|
||||
The result is not the same even though you're using an identical seed because the GPU uses a different random number generator than the CPU.
|
||||
|
||||
To circumvent this problem, 🧨 Diffusers has a [`~diffusers.utils.torch_utils.randn_tensor`] function for creating random noise on the CPU, and then moving the tensor to a GPU if necessary. The `randn_tensor` function is used everywhere inside the pipeline, allowing the user to **always** pass a CPU `Generator` even if the pipeline is run on a GPU.
|
||||
To circumvent this problem, 🧨 Diffusers has a [`~diffusers.utils.randn_tensor`] function for creating random noise on the CPU, and then moving the tensor to a GPU if necessary. The `randn_tensor` function is used everywhere inside the pipeline, allowing the user to **always** pass a CPU `Generator` even if the pipeline is run on a GPU.
|
||||
|
||||
You'll see the results are much closer now!
|
||||
|
||||
@@ -139,21 +139,21 @@ print(np.abs(image).sum())
|
||||
<Tip>
|
||||
|
||||
💡 If reproducibility is important, we recommend always passing a CPU generator.
|
||||
The performance loss is often neglectable, and you'll generate much more similar
|
||||
The performance loss is often neglectable, and you'll generate much more similar
|
||||
values than if the pipeline had been run on a GPU.
|
||||
|
||||
</Tip>
|
||||
|
||||
Finally, for more complex pipelines such as [`UnCLIPPipeline`], these are often extremely
|
||||
susceptible to precision error propagation. Don't expect similar results across
|
||||
different GPU hardware or PyTorch versions. In this case, you'll need to run
|
||||
Finally, for more complex pipelines such as [`UnCLIPPipeline`], these are often extremely
|
||||
susceptible to precision error propagation. Don't expect similar results across
|
||||
different GPU hardware or PyTorch versions. In this case, you'll need to run
|
||||
exactly the same hardware and PyTorch version for full reproducibility.
|
||||
|
||||
## Deterministic algorithms
|
||||
|
||||
You can also configure PyTorch to use deterministic algorithms to create a reproducible pipeline. However, you should be aware that deterministic algorithms may be slower than nondeterministic ones and you may observe a decrease in performance. But if reproducibility is important to you, then this is the way to go!
|
||||
|
||||
Nondeterministic behavior occurs when operations are launched in more than one CUDA stream. To avoid this, set the environment variable [`CUBLAS_WORKSPACE_CONFIG`](https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility) to `:16:8` to only use one buffer size during runtime.
|
||||
Nondeterministic behavior occurs when operations are launched in more than one CUDA stream. To avoid this, set the environment varibale [`CUBLAS_WORKSPACE_CONFIG`](https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility) to `:16:8` to only use one buffer size during runtime.
|
||||
|
||||
PyTorch typically benchmarks multiple algorithms to select the fastest one, but if you want reproducibility, you should disable this feature because the benchmark may select different algorithms each time. Lastly, pass `True` to [`torch.use_deterministic_algorithms`](https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html) to enable deterministic algorithms.
|
||||
|
||||
|
||||
@@ -1,431 +0,0 @@
|
||||
# Stable Diffusion XL
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
[Stable Diffusion XL](https://huggingface.co/papers/2307.01952) (SDXL) is a powerful text-to-image generation model that iterates on the previous Stable Diffusion models in three key ways:
|
||||
|
||||
1. the UNet is 3x larger and SDXL combines a second text encoder (OpenCLIP ViT-bigG/14) with the original text encoder to significantly increase the number of parameters
|
||||
2. introduces size and crop-conditioning to preserve training data from being discarded and gain more control over how a generated image should be cropped
|
||||
3. introduces a two-stage model process; the *base* model (can also be run as a standalone model) generates an image as an input to the *refiner* model which adds additional high-quality details
|
||||
|
||||
This guide will show you how to use SDXL for text-to-image, image-to-image, and inpainting.
|
||||
|
||||
Before you begin, make sure you have the following libraries installed:
|
||||
|
||||
```py
|
||||
# uncomment to install the necessary libraries in Colab
|
||||
#!pip install diffusers transformers accelerate safetensors omegaconf invisible-watermark>=0.2.0
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
We recommend installing the [invisible-watermark](https://pypi.org/project/invisible-watermark/) library to help identify images that are generated. If the invisible-watermark library is installed, it is used by default. To disable the watermarker:
|
||||
|
||||
```py
|
||||
pipeline = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False)
|
||||
```
|
||||
|
||||
</Tip>
|
||||
|
||||
## Load model checkpoints
|
||||
|
||||
Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
|
||||
import torch
|
||||
|
||||
pipeline = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
|
||||
).to("cuda")
|
||||
```
|
||||
|
||||
You can also use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
|
||||
import torch
|
||||
|
||||
pipeline = StableDiffusionXLPipeline.from_single_file(
|
||||
"https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
|
||||
"https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
|
||||
).to("cuda")
|
||||
```
|
||||
|
||||
## Text-to-image
|
||||
|
||||
For text-to-image, pass a text prompt. By default, SDXL generates a 1024x1024 image for the best results. You can try setting the `height` and `width` parameters to 768x768 or 512x512, but anything below 512x512 is not likely to work.
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
import torch
|
||||
|
||||
pipeline_text2image = AutoPipelineForText2Image.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
image = pipeline(prompt=prompt).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" alt="generated image of an astronaut in a jungle"/>
|
||||
</div>
|
||||
|
||||
## Image-to-image
|
||||
|
||||
For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with:
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForImg2Img
|
||||
from diffusers.utils import load_image
|
||||
|
||||
# use from_pipe to avoid consuming additional memory when loading a checkpoint
|
||||
pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda")
|
||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-img2img.png"
|
||||
|
||||
init_image = load_image(url).convert("RGB")
|
||||
prompt = "a dog catching a frisbee in the jungle"
|
||||
image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-img2img.png" alt="generated image of a dog catching a frisbee in a jungle"/>
|
||||
</div>
|
||||
|
||||
## Inpainting
|
||||
|
||||
For inpainting, you'll need the original image and a mask of what you want to replace in the original image. Create a prompt to describe what you want to replace the masked area with.
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForInpainting
|
||||
from diffusers.utils import load_image
|
||||
|
||||
# use from_pipe to avoid consuming additional memory when loading a checkpoint
|
||||
pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda")
|
||||
|
||||
img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
|
||||
mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png"
|
||||
|
||||
init_image = load_image(img_url).convert("RGB")
|
||||
mask_image = load_image(mask_url).convert("RGB")
|
||||
|
||||
prompt = "A deep sea diver floating"
|
||||
image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint.png" alt="generated image of a deep sea diver in a jungle"/>
|
||||
</div>
|
||||
|
||||
## Refine image quality
|
||||
|
||||
SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner:
|
||||
|
||||
1. use the base and refiner model together to produce a refined image
|
||||
2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL is originally trained)
|
||||
|
||||
### Base + refiner model
|
||||
|
||||
When you use the base and refiner model together to generate an image, this is known as an ([*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/)). The ensemble of expert denoisers approach requires less overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise.
|
||||
|
||||
As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model:
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
base = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
refiner = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0",
|
||||
text_encoder_2=base.text_encoder_2,
|
||||
vae=base.vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
variant="fp16",
|
||||
).to("cuda")
|
||||
```
|
||||
|
||||
To use this approach, you need to define the number of timesteps for each model to run through their respective stages. For the base model, this is controlled by the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) parameter and for the refiner model, it is controlled by the [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) parameter.
|
||||
|
||||
<Tip>
|
||||
|
||||
The `denoising_end` and `denoising_start` parameters should be a float between 0 and 1. These parameters are represented as a proportion of discrete timesteps as defined by the scheduler. If you're also using the `strength` parameter, it'll be ignored because the number of denoising steps is determined by the discrete timesteps the model is trained on and the declared fractional cutoff.
|
||||
|
||||
</Tip>
|
||||
|
||||
Let's set `denoising_end=0.8` so the base model performs the first 80% of denoising the **high-noise** timesteps and set `denoising_start=0.8` so the refiner model performs the last 20% of denoising the **low-noise** timesteps. The base model output should be in **latent** space instead of a PIL image.
|
||||
|
||||
```py
|
||||
prompt = "A majestic lion jumping from a big stone at night"
|
||||
|
||||
image = base(
|
||||
prompt=prompt,
|
||||
num_inference_steps=40,
|
||||
denoising_end=0.8,
|
||||
output_type="latent",
|
||||
).images
|
||||
image = refiner(
|
||||
prompt=prompt,
|
||||
num_inference_steps=40,
|
||||
denoising_start=0.8,
|
||||
image=image,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_base.png" alt="generated image of a lion on a rock at night" />
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">base model</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_refined.png" alt="generated image of a lion on a rock at night in higher quality" />
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">ensemble of expert denoisers</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
The refiner model can also be used for inpainting in the [`StableDiffusionXLInpaintPipeline`]:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLInpaintPipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
base = StableDiffusionXLInpaintPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0",
|
||||
text_encoder_2=pipe.text_encoder_2,
|
||||
vae=pipe.vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
variant="fp16",
|
||||
).to("cuda")
|
||||
|
||||
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
|
||||
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
|
||||
|
||||
init_image = load_image(img_url).convert("RGB")
|
||||
mask_image = load_image(mask_url).convert("RGB")
|
||||
|
||||
prompt = "A majestic tiger sitting on a bench"
|
||||
num_inference_steps = 75
|
||||
high_noise_frac = 0.7
|
||||
|
||||
image = base(
|
||||
prompt=prompt,
|
||||
image=init_image,
|
||||
mask_image=mask_image,
|
||||
num_inference_steps=num_inference_steps,
|
||||
denoising_end=high_noise_frac,
|
||||
output_type="latent",
|
||||
).images
|
||||
image = refiner(
|
||||
prompt=prompt,
|
||||
image=image,
|
||||
mask_image=mask_image,
|
||||
num_inference_steps=num_inference_steps,
|
||||
denoising_start=high_noise_frac,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
This ensemble of expert denoisers method works well for all available schedulers!
|
||||
|
||||
### Base to refiner model
|
||||
|
||||
SDXL gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, in an image-to-image setting.
|
||||
|
||||
Load the base and refiner models:
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
base = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
refiner = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0",
|
||||
text_encoder_2=pipe.text_encoder_2,
|
||||
vae=pipe.vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
variant="fp16",
|
||||
).to("cuda")
|
||||
```
|
||||
|
||||
Generate an image from the base model, and set the model output to **latent** space:
|
||||
|
||||
```py
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
|
||||
image = base(prompt=prompt, output_type="latent").images[0]
|
||||
```
|
||||
|
||||
Pass the generated image to the refiner model:
|
||||
|
||||
```py
|
||||
image = refiner(prompt=prompt, image=image[None, :]).images[0]
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/init_image.png" alt="generated image of an astronaut riding a green horse on Mars" />
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">base model</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/refined_image.png" alt="higher quality generated image of an astronaut riding a green horse on Mars" />
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">base model + refiner model</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
For inpainting, load the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner.
|
||||
|
||||
## Micro-conditioning
|
||||
|
||||
SDXL training involves several additional conditioning techniques, which are referred to as *micro-conditioning*. These include original image size, target image size, and cropping parameters. The micro-conditionings can be used at inference time to create high-quality, centered images.
|
||||
|
||||
<Tip>
|
||||
|
||||
You can use both micro-conditioning and negative micro-conditioning parameters thanks to classifier-free guidance. They are available in the [`StableDiffusionXLPipeline`], [`StableDiffusionXLImg2ImgPipeline`], [`StableDiffusionXLInpaintPipeline`], and [`StableDiffusionXLControlNetPipeline`].
|
||||
|
||||
</Tip>
|
||||
|
||||
### Size conditioning
|
||||
|
||||
There are two types of size conditioning:
|
||||
|
||||
- [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) conditioning comes from upscaled images in the training batch (because it would be wasteful to discard the smaller images which make up almost 40% of the total training data). This way, SDXL learns that upscaling artifacts are not supposed to be present in high-resolution images. During inference, you can use `original_size` to indicate the original image resolution. Using the default value of `(1024, 1024)` produces higher-quality images that resemble the 1024x1024 images in the dataset. If you choose to use a lower resolution, such as `(256, 256)`, the model still generates 1024x1024 images, but they'll look like the low resolution images (simpler patterns, blurring) in the dataset.
|
||||
|
||||
- [`target_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.target_size) conditioning comes from finetuning SDXL to support different image aspect ratios. During inference, if you use the default value of `(1024, 1024)`, you'll get an image that resembles the composition of square images in the dataset. We recommend using the same value for `target_size` and `original_size`, but feel free to experiment with other options!
|
||||
|
||||
🤗 Diffusers also lets you specify negative conditions about an image's size to steer generation away from certain image resolutions:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
import torch
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
image = pipe(
|
||||
prompt=prompt,
|
||||
negative_original_size=(512, 512),
|
||||
negative_target_size=(1024, 1024),
|
||||
).images[0]
|
||||
```
|
||||
|
||||
<div class="flex flex-col justify-center">
|
||||
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/negative_conditions.png"/>
|
||||
<figcaption class="text-center">Images negative conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).</figcaption>
|
||||
</div>
|
||||
|
||||
### Crop conditioning
|
||||
|
||||
Images generated by previous Stable Diffusion models may sometimes appear to be cropped. This is because images are actually cropped during training so that all the images in a batch have the same size. By conditioning on crop coordinates, SDXL *learns* that no cropping - coordinates `(0, 0)` - usually correlates with centered subjects and complete faces (this is the default value in 🤗 Diffusers). You can experiment with different coordinates if you want to generate off-centered compositions!
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
import torch
|
||||
|
||||
|
||||
pipeline = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
image = pipeline(prompt=prompt, crops_coords_top_left=(256,0)).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-cropped.png" alt="generated image of an astronaut in a jungle, slightly cropped"/>
|
||||
</div>
|
||||
|
||||
You can also specify negative cropping coordinates to steer generation away from certain cropping parameters:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
import torch
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
image = pipe(
|
||||
prompt=prompt,
|
||||
negative_original_size=(512, 512),
|
||||
negative_crops_coords_top_left=(0, 0),
|
||||
negative_target_size=(1024, 1024),
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## Use a different prompt for each text-encoder
|
||||
|
||||
SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using a negative prompts):
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
import torch
|
||||
|
||||
pipeline = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
).to("cuda")
|
||||
|
||||
# prompt is passed to OAI CLIP-ViT/L-14
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
# prompt_2 is passed to OpenCLIP-ViT/bigG-14
|
||||
prompt_2 = "Van Gogh painting"
|
||||
image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-double-prompt.png" alt="generated image of an astronaut in a jungle in the style of a van gogh painting"/>
|
||||
</div>
|
||||
|
||||
The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl] section.
|
||||
|
||||
## Optimizations
|
||||
|
||||
SDXL is a large model, and you may need to optimize memory to get it to run on your hardware. Here are some tips to save memory and speed up inference.
|
||||
|
||||
1. Offload the model to the CPU with [`~StableDiffusionXLPipeline.enable_model_cpu_offload`] for out-of-memory errors:
|
||||
|
||||
```diff
|
||||
- base.to("cuda")
|
||||
- refiner.to("cuda")
|
||||
+ base.enable_model_cpu_offload
|
||||
+ refiner.enable_model_cpu_offload
|
||||
```
|
||||
|
||||
2. Use `torch.compile` for ~20% speed-up (you need `torch>2.0`):
|
||||
|
||||
```diff
|
||||
+ base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True)
|
||||
+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
|
||||
```
|
||||
|
||||
3. Enable [xFormers](/optimization/xformers) to run SDXL if `torch<2.0`:
|
||||
|
||||
```diff
|
||||
+ base.enable_xformers_memory_efficient_attention()
|
||||
+ refiner.enable_xformers_memory_efficient_attention()
|
||||
```
|
||||
|
||||
## Other resources
|
||||
|
||||
If you're interested in experimenting with a minimal version of the [`UNet2DConditionModel`] used in SDXL, take a look at the [minSDXL](https://github.com/cloneofsimo/minSDXL) implementation which is written in PyTorch and directly compatible with 🤗 Diffusers.
|
||||
@@ -1,179 +0,0 @@
|
||||
# Shap-E
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Shap-E is a conditional model for generating 3D assets which could be used for video game development, interior design, and architecture. It is trained on a large dataset of 3D assets, and post-processed to render more views of each object and produce 16K instead of 4K point clouds. The Shap-E model is trained in two steps:
|
||||
|
||||
1. a encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset
|
||||
2. a diffusion model is trained on the latents produced by the encoder to generate either neural radiance fields (NeRFs) or a textured 3D mesh, making it easier to render and use the 3D asset in downstream applications
|
||||
|
||||
This guide will show you how to use Shap-E to start generating your own 3D assets!
|
||||
|
||||
Before you begin, make sure you have the following libraries installed:
|
||||
|
||||
```py
|
||||
# uncomment to install the necessary libraries in Colab
|
||||
#!pip install diffusers transformers accelerate safetensors trimesh
|
||||
```
|
||||
|
||||
## Text-to-3D
|
||||
|
||||
To generate a gif of a 3D object, pass a text prompt to the [`ShapEPipeline`]. The pipeline generates a list of image frames which are used to create the 3D object.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import ShapEPipeline
|
||||
|
||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||
|
||||
pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
|
||||
pipe = pipe.to(device)
|
||||
|
||||
guidance_scale = 15.0
|
||||
prompt = ["A firecracker", "A birthday cupcake"]
|
||||
|
||||
images = pipe(
|
||||
prompt,
|
||||
guidance_scale=guidance_scale,
|
||||
num_inference_steps=64,
|
||||
frame_size=256,
|
||||
).images
|
||||
```
|
||||
|
||||
Now use the [`~utils.export_to_gif`] function to turn the list of image frames into a gif of the 3D object.
|
||||
|
||||
```py
|
||||
from diffusers.utils import export_to_gif
|
||||
|
||||
export_to_gif(images[0], "firecracker_3d.gif")
|
||||
export_to_gif(images[1], "cake_3d.gif")
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/firecracker_out.gif"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">firecracker</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/cake_out.gif"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">cupcake</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Image-to-3D
|
||||
|
||||
To generate a 3D object from another image, use the [`ShapEImg2ImgPipeline`]. You can use an existing image or generate an entirely new one. Let's use the [Kandinsky 2.1](../api/pipelines/kandinsky) model to generate a new image.
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
prior_pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
|
||||
pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
|
||||
|
||||
prompt = "A cheeseburger, white background"
|
||||
|
||||
image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple()
|
||||
image = pipeline(
|
||||
prompt,
|
||||
image_embeds=image_embeds,
|
||||
negative_image_embeds=negative_image_embeds,
|
||||
).images[0]
|
||||
|
||||
image.save("burger.png")
|
||||
```
|
||||
|
||||
Pass the cheeseburger to the [`ShapEImg2ImgPipeline`] to generate a 3D representation of it.
|
||||
|
||||
```py
|
||||
from PIL import Image
|
||||
from diffusers.utils import export_to_gif
|
||||
|
||||
pipe = ShapEImg2ImgPipeline.from_pretrained("openai/shap-e-img2img", torch_dtype=torch.float16, variant="fp16").to("cuda")
|
||||
|
||||
guidance_scale = 3.0
|
||||
image = Image.open("burger.png").resize((256, 256))
|
||||
|
||||
images = pipe(
|
||||
image,
|
||||
guidance_scale=guidance_scale,
|
||||
num_inference_steps=64,
|
||||
frame_size=256,
|
||||
).images
|
||||
|
||||
gif_path = export_to_gif(images[0], "burger_3d.gif")
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_in.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">cheeseburger</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_out.gif"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">3D cheeseburger</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Generate mesh
|
||||
|
||||
Shap-E is a flexible model that can also generate textured mesh outputs to be rendered for downstream applications. In this example, you'll convert the output into a `glb` file because the 🤗 Datasets library supports mesh visualization of `glb` files which can be rendered by the [Dataset viewer](https://huggingface.co/docs/hub/datasets-viewer#dataset-preview).
|
||||
|
||||
You can generate mesh outputs for both the [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`] by specifying the `output_type` parameter as `"mesh"`:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import ShapEPipeline
|
||||
|
||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||
|
||||
pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
|
||||
pipe = pipe.to(device)
|
||||
|
||||
guidance_scale = 15.0
|
||||
prompt = "A birthday cupcake"
|
||||
|
||||
images = pipe(prompt, guidance_scale=guidance_scale, num_inference_steps=64, frame_size=256, output_type="mesh").images
|
||||
```
|
||||
|
||||
Use the [`~utils.export_to_ply`] function to save the mesh output as a `ply` file:
|
||||
|
||||
<Tip>
|
||||
|
||||
You can optionally save the mesh output as an `obj` file with the [`~utils.export_to_obj`] function. The ability to save the mesh output in a variety of formats makes it more flexible for downstream usage!
|
||||
|
||||
</Tip>
|
||||
|
||||
```py
|
||||
from diffusers.utils import export_to_ply
|
||||
|
||||
ply_path = export_to_ply(images[0], "3d_cake.ply")
|
||||
print(f"saved to folder: {ply_path}")
|
||||
```
|
||||
|
||||
Then you can convert the `ply` file to a `glb` file with the trimesh library:
|
||||
|
||||
```py
|
||||
import trimesh
|
||||
|
||||
mesh = trimesh.load("3d_cake.ply")
|
||||
mesh.export("3d_cake.glb", file_type="glb")
|
||||
```
|
||||
|
||||
By default, the mesh output is focused from the bottom viewpoint but you can change the default viewpoint by applying a rotation transform:
|
||||
|
||||
```py
|
||||
import trimesh
|
||||
import numpy as np
|
||||
|
||||
mesh = trimesh.load("3d_cake.ply")
|
||||
rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0])
|
||||
mesh = mesh.apply_transform(rot)
|
||||
mesh.export("3d_cake.glb", file_type="glb")
|
||||
```
|
||||
|
||||
Upload the mesh file to your dataset repository to visualize it with the Dataset viewer!
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/3D-cake.gif"/>
|
||||
</div>
|
||||
@@ -1,41 +1,51 @@
|
||||
# JAX/Flax
|
||||
# 🧨 Stable Diffusion in JAX / Flax !
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
🤗 Diffusers supports Flax for super fast inference on Google TPUs, such as those available in Colab, Kaggle or Google Cloud Platform. This guide shows you how to run inference with Stable Diffusion using JAX/Flax.
|
||||
🤗 Hugging Face [Diffusers](https://github.com/huggingface/diffusers) supports Flax since version `0.5.1`! This allows for super fast inference on Google TPUs, such as those available in Colab, Kaggle or Google Cloud Platform.
|
||||
|
||||
Before you begin, make sure you have the necessary libraries installed:
|
||||
This notebook shows how to run inference using JAX / Flax. If you want more details about how Stable Diffusion works or want to run it in GPU, please refer to [this notebook](https://huggingface.co/docs/diffusers/stable_diffusion).
|
||||
|
||||
First, make sure you are using a TPU backend. If you are running this notebook in Colab, select `Runtime` in the menu above, then select the option "Change runtime type" and then select `TPU` under the `Hardware accelerator` setting.
|
||||
|
||||
Note that JAX is not exclusive to TPUs, but it shines on that hardware because each TPU server has 8 TPU accelerators working in parallel.
|
||||
|
||||
## Setup
|
||||
|
||||
First make sure diffusers is installed.
|
||||
|
||||
```py
|
||||
# uncomment to install the necessary libraries in Colab
|
||||
#!pip install -q jax==0.3.25 jaxlib==0.3.25 flax transformers ftfy
|
||||
#!pip install -q diffusers
|
||||
#!pip install jax==0.3.25 jaxlib==0.3.25 flax transformers ftfy
|
||||
#!pip install diffusers
|
||||
```
|
||||
|
||||
You should also make sure you're using a TPU backend. While JAX does not run exclusively on TPUs, you'll get the best performance on a TPU because each server has 8 TPU accelerators working in parallel.
|
||||
```python
|
||||
import jax.tools.colab_tpu
|
||||
|
||||
If you are running this guide in Colab, select *Runtime* in the menu above, select the option *Change runtime type*, and then select *TPU* under the *Hardware accelerator* setting. Import JAX and quickly check whether you're using a TPU:
|
||||
jax.tools.colab_tpu.setup_tpu()
|
||||
import jax
|
||||
```
|
||||
|
||||
```python
|
||||
import jax
|
||||
import jax.tools.colab_tpu
|
||||
jax.tools.colab_tpu.setup_tpu()
|
||||
|
||||
num_devices = jax.device_count()
|
||||
device_type = jax.devices()[0].device_kind
|
||||
|
||||
print(f"Found {num_devices} JAX devices of type {device_type}.")
|
||||
assert (
|
||||
"TPU" in device_type,
|
||||
"Available device is not a TPU, please select TPU from Edit > Notebook settings > Hardware accelerator"
|
||||
)
|
||||
"Found 8 JAX devices of type Cloud TPU."
|
||||
"TPU" in device_type
|
||||
), "Available device is not a TPU, please select TPU from Edit > Notebook settings > Hardware accelerator"
|
||||
```
|
||||
|
||||
Great, now you can import the rest of the dependencies you'll need:
|
||||
```python out
|
||||
Found 8 JAX devices of type Cloud TPU.
|
||||
```
|
||||
|
||||
Then we import all the dependencies.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import jax
|
||||
import jax.numpy as jnp
|
||||
|
||||
from pathlib import Path
|
||||
@@ -48,12 +58,17 @@ from huggingface_hub import notebook_login
|
||||
from diffusers import FlaxStableDiffusionPipeline
|
||||
```
|
||||
|
||||
## Load a model
|
||||
## Model Loading
|
||||
|
||||
Flax is a functional framework, so models are stateless and parameters are stored outside of them. Loading a pretrained Flax pipeline returns *both* the pipeline and the model weights (or parameters). In this guide, you'll use `bfloat16`, a more efficient half-float type that is supported by TPUs (you can also use `float32` for full precision if you want).
|
||||
TPU devices support `bfloat16`, an efficient half-float type. We'll use it for our tests, but you can also use `float32` to use full precision instead.
|
||||
|
||||
```python
|
||||
dtype = jnp.bfloat16
|
||||
```
|
||||
|
||||
Flax is a functional framework, so models are stateless and parameters are stored outside them. Loading the pre-trained Flax pipeline will return both the pipeline itself and the model weights (or parameters). We are using a `bf16` version of the weights, which leads to type warnings that you can safely ignore.
|
||||
|
||||
```python
|
||||
pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
|
||||
"CompVis/stable-diffusion-v1-4",
|
||||
revision="bf16",
|
||||
@@ -63,87 +78,95 @@ pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
|
||||
|
||||
## Inference
|
||||
|
||||
TPUs usually have 8 devices working in parallel, so let's use the same prompt for each device. This means you can perform inference on 8 devices at once, with each device generating one image. As a result, you'll get 8 images in the same amount of time it takes for one chip to generate a single image!
|
||||
Since TPUs usually have 8 devices working in parallel, we'll replicate our prompt as many times as devices we have. Then we'll perform inference on the 8 devices at once, each responsible for generating one image. Thus, we'll get 8 images in the same amount of time it takes for one chip to generate a single one.
|
||||
|
||||
<Tip>
|
||||
|
||||
Learn more details in the [How does parallelization work?](#how-does-parallelization-work) section.
|
||||
|
||||
</Tip>
|
||||
|
||||
After replicating the prompt, get the tokenized text ids by calling the `prepare_inputs` function on the pipeline. The length of the tokenized text is set to 77 tokens as required by the configuration of the underlying CLIP text model.
|
||||
After replicating the prompt, we obtain the tokenized text ids by invoking the `prepare_inputs` function of the pipeline. The length of the tokenized text is set to 77 tokens, as required by the configuration of the underlying CLIP Text model.
|
||||
|
||||
```python
|
||||
prompt = "A cinematic film still of Morgan Freeman starring as Jimi Hendrix, portrait, 40mm lens, shallow depth of field, close up, split lighting, cinematic"
|
||||
prompt = [prompt] * jax.device_count()
|
||||
prompt_ids = pipeline.prepare_inputs(prompt)
|
||||
prompt_ids.shape
|
||||
"(8, 77)"
|
||||
```
|
||||
|
||||
Model parameters and inputs have to be replicated across the 8 parallel devices. The parameters dictionary is replicated with [`flax.jax_utils.replicate`](https://flax.readthedocs.io/en/latest/api_reference/flax.jax_utils.html#flax.jax_utils.replicate) which traverses the dictionary and changes the shape of the weights so they are repeated 8 times. Arrays are replicated using `shard`.
|
||||
```python out
|
||||
(8, 77)
|
||||
```
|
||||
|
||||
### Replication and parallelization
|
||||
|
||||
Model parameters and inputs have to be replicated across the 8 parallel devices we have. The parameters dictionary is replicated using `flax.jax_utils.replicate`, which traverses the dictionary and changes the shape of the weights so they are repeated 8 times. Arrays are replicated using `shard`.
|
||||
|
||||
```python
|
||||
# parameters
|
||||
p_params = replicate(params)
|
||||
|
||||
# arrays
|
||||
prompt_ids = shard(prompt_ids)
|
||||
prompt_ids.shape
|
||||
"(8, 1, 77)"
|
||||
```
|
||||
|
||||
This shape means each one of the 8 devices receives as an input a `jnp` array with shape `(1, 77)`, where `1` is the batch size per device. On TPUs with sufficient memory, you could have a batch size larger than `1` if you want to generate multiple images (per chip) at once.
|
||||
```python
|
||||
prompt_ids = shard(prompt_ids)
|
||||
prompt_ids.shape
|
||||
```
|
||||
|
||||
Next, create a random number generator to pass to the generation function. This is standard procedure in Flax, which is very serious and opinionated about random numbers. All functions that deal with random numbers are expected to receive a generator to ensure reproducibility, even when you're training across multiple distributed devices.
|
||||
```python out
|
||||
(8, 1, 77)
|
||||
```
|
||||
|
||||
The helper function below uses a seed to initialize a random number generator. As long as you use the same seed, you'll get the exact same results. Feel free to use different seeds when exploring results later in the guide.
|
||||
That shape means that each one of the `8` devices will receive as an input a `jnp` array with shape `(1, 77)`. `1` is therefore the batch size per device. In TPUs with sufficient memory, it could be larger than `1` if we wanted to generate multiple images (per chip) at once.
|
||||
|
||||
We are almost ready to generate images! We just need to create a random number generator to pass to the generation function. This is the standard procedure in Flax, which is very serious and opinionated about random numbers – all functions that deal with random numbers are expected to receive a generator. This ensures reproducibility, even when we are training across multiple distributed devices.
|
||||
|
||||
The helper function below uses a seed to initialize a random number generator. As long as we use the same seed, we'll get the exact same results. Feel free to use different seeds when exploring results later in the notebook.
|
||||
|
||||
```python
|
||||
def create_key(seed=0):
|
||||
return jax.random.PRNGKey(seed)
|
||||
```
|
||||
|
||||
The helper function, or `rng`, is split 8 times so each device receives a different generator and generates a different image.
|
||||
We obtain a rng and then "split" it 8 times so each device receives a different generator. Therefore, each device will create a different image, and the full process is reproducible.
|
||||
|
||||
```python
|
||||
rng = create_key(0)
|
||||
rng = jax.random.split(rng, jax.device_count())
|
||||
```
|
||||
|
||||
To take advantage of JAX's optimized speed on a TPU, pass `jit=True` to the pipeline to compile the JAX code into an efficient representation and to ensure the model runs in parallel across the 8 devices.
|
||||
JAX code can be compiled to an efficient representation that runs very fast. However, we need to ensure that all inputs have the same shape in subsequent calls; otherwise, JAX will have to recompile the code, and we wouldn't be able to take advantage of the optimized speed.
|
||||
|
||||
<Tip warning={true}>
|
||||
The Flax pipeline can compile the code for us if we pass `jit = True` as an argument. It will also ensure that the model runs in parallel in the 8 available devices.
|
||||
|
||||
You need to ensure all your inputs have the same shape in subsequent calls, other JAX will need to recompile the code which is slower.
|
||||
The first time we run the following cell it will take a long time to compile, but subequent calls (even with different inputs) will be much faster. For example, it took more than a minute to compile in a TPU v2-8 when I tested, but then it takes about **`7s`** for future inference runs.
|
||||
|
||||
</Tip>
|
||||
|
||||
The first inference run takes more time because it needs to compile the code, but subsequent calls (even with different inputs) are much faster. For example, it took more than a minute to compile on a TPU v2-8, but then it takes about **7s** on a future inference run!
|
||||
|
||||
```py
|
||||
```
|
||||
%%time
|
||||
images = pipeline(prompt_ids, p_params, rng, jit=True)[0]
|
||||
|
||||
"CPU times: user 56.2 s, sys: 42.5 s, total: 1min 38s"
|
||||
"Wall time: 1min 29s"
|
||||
```
|
||||
|
||||
The returned array has shape `(8, 1, 512, 512, 3)` which should be reshaped to remove the second dimension and get 8 images of `512 × 512 × 3`. Then you can use the [`~utils.numpy_to_pil`] function to convert the arrays into images.
|
||||
```python out
|
||||
CPU times: user 56.2 s, sys: 42.5 s, total: 1min 38s
|
||||
Wall time: 1min 29s
|
||||
```
|
||||
|
||||
The returned array has shape `(8, 1, 512, 512, 3)`. We reshape it to get rid of the second dimension and obtain 8 images of `512 × 512 × 3` and then convert them to PIL.
|
||||
|
||||
```python
|
||||
images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:])
|
||||
images = pipeline.numpy_to_pil(images)
|
||||
```
|
||||
|
||||
### Visualization
|
||||
|
||||
```python
|
||||
from diffusers import make_image_grid
|
||||
|
||||
images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:])
|
||||
images = pipeline.numpy_to_pil(images)
|
||||
make_image_grid(images, 2, 4)
|
||||
```
|
||||
|
||||

|
||||
|
||||
|
||||
## Using different prompts
|
||||
|
||||
You don't necessarily have to use the same prompt on all devices. For example, to generate 8 different prompts:
|
||||
We don't have to replicate the _same_ prompt in all the devices. We can do whatever we want: generate 2 prompts 4 times each, or even generate 8 different prompts at once. Let's do that!
|
||||
|
||||
First, we'll refactor the input preparation code into a handy function:
|
||||
|
||||
```python
|
||||
prompts = [
|
||||
@@ -156,7 +179,9 @@ prompts = [
|
||||
"Armchair in the shape of an avocado",
|
||||
"Clown astronaut in space, with Earth in the background",
|
||||
]
|
||||
```
|
||||
|
||||
```python
|
||||
prompt_ids = pipeline.prepare_inputs(prompts)
|
||||
prompt_ids = shard(prompt_ids)
|
||||
|
||||
@@ -172,41 +197,46 @@ make_image_grid(images, 2, 4)
|
||||
|
||||
## How does parallelization work?
|
||||
|
||||
The Flax pipeline in 🤗 Diffusers automatically compiles the model and runs it in parallel on all available devices. Let's take a closer look at how that process works.
|
||||
We said before that the `diffusers` Flax pipeline automatically compiles the model and runs it in parallel on all available devices. We'll now briefly look inside that process to show how it works.
|
||||
|
||||
JAX parallelization can be done in multiple ways. The easiest one revolves around using the [`jax.pmap`](https://jax.readthedocs.io/en/latest/_autosummary/jax.pmap.html) function to achieve single-program multiple-data (SPMD) parallelization. It means running several copies of the same code, each on different data inputs. More sophisticated approaches are possible, and you can go over to the JAX [documentation](https://jax.readthedocs.io/en/latest/index.html) to explore this topic in more detail if you are interested!
|
||||
JAX parallelization can be done in multiple ways. The easiest one revolves around using the `jax.pmap` function to achieve single-program, multiple-data (SPMD) parallelization. It means we'll run several copies of the same code, each on different data inputs. More sophisticated approaches are possible, we invite you to go over the [JAX documentation](https://jax.readthedocs.io/en/latest/index.html) and the [`pjit` pages](https://jax.readthedocs.io/en/latest/jax-101/08-pjit.html?highlight=pjit) to explore this topic if you are interested!
|
||||
|
||||
`jax.pmap` does two things:
|
||||
`jax.pmap` does two things for us:
|
||||
- Compiles (or `jit`s) the code, as if we had invoked `jax.jit()`. This does not happen when we call `pmap`, but the first time the pmapped function is invoked.
|
||||
- Ensures the compiled code runs in parallel in all the available devices.
|
||||
|
||||
1. Compiles (or "`jit`s") the code which is similar to `jax.jit()`. This does not happen when you call `pmap`, and only the first time the `pmap`ped function is called.
|
||||
2. Ensures the compiled code runs in parallel on all available devices.
|
||||
|
||||
To demonstrate, call `pmap` on the pipeline's `_generate` method (this is a private method that generates images and may be renamed or removed in future releases of 🤗 Diffusers):
|
||||
To show how it works we `pmap` the `_generate` method of the pipeline, which is the private method that runs generates images. Please, note that this method may be renamed or removed in future releases of `diffusers`.
|
||||
|
||||
```python
|
||||
p_generate = pmap(pipeline._generate)
|
||||
```
|
||||
|
||||
After calling `pmap`, the prepared function `p_generate` will:
|
||||
After we use `pmap`, the prepared function `p_generate` will conceptually do the following:
|
||||
* Invoke a copy of the underlying function `pipeline._generate` in each device.
|
||||
* Send each device a different portion of the input arguments. That's what sharding is used for. In our case, `prompt_ids` has shape `(8, 1, 77, 768)`. This array will be split in `8` and each copy of `_generate` will receive an input with shape `(1, 77, 768)`.
|
||||
|
||||
1. Make a copy of the underlying function, `pipeline._generate`, on each device.
|
||||
2. Send each device a different portion of the input arguments (this is why its necessary to call the *shard* function). In this case, `prompt_ids` has shape `(8, 1, 77, 768)` so the array is split into 8 and each copy of `_generate` receives an input with shape `(1, 77, 768)`.
|
||||
We can code `_generate` completely ignoring the fact that it will be invoked in parallel. We just care about our batch size (`1` in this example) and the dimensions that make sense for our code, and don't have to change anything to make it work in parallel.
|
||||
|
||||
The most important thing to pay attention to here is the batch size (1 in this example), and the input dimensions that make sense for your code. You don't have to change anything else to make the code work in parallel.
|
||||
The same way as when we used the pipeline call, the first time we run the following cell it will take a while, but then it will be much faster.
|
||||
|
||||
The first time you call the pipeline takes more time, but the calls afterward are much faster. The `block_until_ready` function is used to correctly measure inference time because JAX uses asynchronous dispatch and returns control to the Python loop as soon as it can. You don't need to use that in your code; blocking occurs automatically when you want to use the result of a computation that has not yet been materialized.
|
||||
|
||||
```py
|
||||
```
|
||||
%%time
|
||||
images = p_generate(prompt_ids, p_params, rng)
|
||||
images = images.block_until_ready()
|
||||
"CPU times: user 1min 15s, sys: 18.2 s, total: 1min 34s"
|
||||
"Wall time: 1min 15s"
|
||||
images.shape
|
||||
```
|
||||
|
||||
Check your image dimensions to see if they're correct:
|
||||
```python out
|
||||
CPU times: user 1min 15s, sys: 18.2 s, total: 1min 34s
|
||||
Wall time: 1min 15s
|
||||
```
|
||||
|
||||
```python
|
||||
images.shape
|
||||
"(8, 1, 512, 512, 3)"
|
||||
```
|
||||
```
|
||||
|
||||
```python out
|
||||
(8, 1, 512, 512, 3)
|
||||
```
|
||||
|
||||
We use `block_until_ready()` to correctly measure inference time, because JAX uses asynchronous dispatch and returns control to the Python loop as soon as it can. You don't need to use that in your code; blocking will occur automatically when you want to use the result of a computation that has not yet been materialized.
|
||||
@@ -28,8 +28,6 @@ from diffusers.utils import make_image_grid
|
||||
from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer
|
||||
```
|
||||
|
||||
## Stable Diffusion 1 and 2
|
||||
|
||||
Pick a Stable Diffusion checkpoint and a pre-learned concept from the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer):
|
||||
|
||||
```py
|
||||
@@ -71,50 +69,3 @@ grid
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/textual_inversion_inference.png">
|
||||
</div>
|
||||
|
||||
|
||||
## Stable Diffusion XL
|
||||
|
||||
Stable Diffusion XL (SDXL) can also use textual inversion vectors for inference. In contrast to Stable Diffusion 1 and 2, SDXL has two text encoders so you'll need two textual inversion embeddings - one for each text encoder model.
|
||||
|
||||
Let's download the SDXL textual inversion embeddings and have a closer look at it's structure:
|
||||
|
||||
```py
|
||||
from huggingface_hub import hf_hub_download
|
||||
from safetensors.torch import load_file
|
||||
|
||||
file = hf_hub_download("dn118/unaestheticXL", filename="unaestheticXLv31.safetensors")
|
||||
state_dict = load_file(file)
|
||||
state_dict
|
||||
```
|
||||
|
||||
```
|
||||
{'clip_g': tensor([[ 0.0077, -0.0112, 0.0065, ..., 0.0195, 0.0159, 0.0275],
|
||||
...,
|
||||
[-0.0170, 0.0213, 0.0143, ..., -0.0302, -0.0240, -0.0362]],
|
||||
'clip_l': tensor([[ 0.0023, 0.0192, 0.0213, ..., -0.0385, 0.0048, -0.0011],
|
||||
...,
|
||||
[ 0.0475, -0.0508, -0.0145, ..., 0.0070, -0.0089, -0.0163]],
|
||||
```
|
||||
|
||||
There are two tensors, `"clip-g"` and `"clip-l"`.
|
||||
`"clip-g"` corresponds to the bigger text encoder in SDXL and refers to
|
||||
`pipe.text_encoder_2` and `"clip-l"` refers to `pipe.text_encoder`.
|
||||
|
||||
Now you can load each tensor separately by passing them along with the correct text encoder and tokenizer
|
||||
to [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`]:
|
||||
|
||||
```py
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
import torch
|
||||
|
||||
pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", variant="fp16", torch_dtype=torch.float16)
|
||||
pipe.to("cuda")
|
||||
|
||||
pipe.load_textual_inversion(state_dict["clip_g"], token="unaestheticXLv31", text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2)
|
||||
pipe.load_textual_inversion(state_dict["clip_l"], token="unaestheticXLv31", text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer)
|
||||
|
||||
# the embedding should be used as a negative embedding, so we pass it as a negative prompt
|
||||
generator = torch.Generator().manual_seed(33)
|
||||
image = pipe("a woman standing in front of a mountain", negative_prompt="unaestheticXLv31", generator=generator).images[0]
|
||||
```
|
||||
|
||||
19
docs/source/en/using-diffusers/using_safetensors
Normal file
19
docs/source/en/using-diffusers/using_safetensors
Normal file
@@ -0,0 +1,19 @@
|
||||
# What is safetensors ?
|
||||
|
||||
[safetensors](https://github.com/huggingface/safetensors) is a different format
|
||||
from the classic `.bin` which uses Pytorch which uses pickle.
|
||||
|
||||
Pickle is notoriously unsafe which allow any malicious file to execute arbitrary code.
|
||||
The hub itself tries to prevent issues from it, but it's not a silver bullet.
|
||||
|
||||
`safetensors` first and foremost goal is to make loading machine learning models *safe*
|
||||
in the sense that no takeover of your computer can be done.
|
||||
|
||||
# Why use safetensors ?
|
||||
|
||||
**Safety** can be one reason, if you're attempting to use a not well known model and
|
||||
you're not sure about the source of the file.
|
||||
|
||||
And a secondary reason, is **the speed of loading**. Safetensors can load models much faster
|
||||
than regular pickle files. If you spend a lot of times switching models, this can be
|
||||
a huge timesave.
|
||||
@@ -143,8 +143,8 @@ image
|
||||
A conjunction diffuses each prompt independently and concatenates their results by their weighted sum. Add `.and()` to the end of a list of prompts to create a conjunction:
|
||||
|
||||
```py
|
||||
prompt_embeds = compel_proc('["a red cat", "playing with a", "ball"].and()')
|
||||
generator = torch.Generator(device="cuda").manual_seed(55)
|
||||
prompt_embeds = compel_proc('("a red cat, playing with a, ball").and()')
|
||||
generator = torch.Generator(device="cuda").manual_seed(33)
|
||||
|
||||
image = pipe(prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20).images[0]
|
||||
image
|
||||
|
||||
@@ -112,7 +112,7 @@ As you can see, this is already more complex than the DDPM pipeline which only c
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 Read the [How does Stable Diffusion work?](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) blog for more details about how the VAE, UNet, and text encoder models work.
|
||||
💡 Read the [How does Stable Diffusion work?](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) blog for more details about how the VAE, UNet, and text encoder models.
|
||||
|
||||
</Tip>
|
||||
|
||||
@@ -169,7 +169,7 @@ Feel free to choose any prompt you like if you want to generate something else!
|
||||
>>> width = 512 # default width of Stable Diffusion
|
||||
>>> num_inference_steps = 25 # Number of denoising steps
|
||||
>>> guidance_scale = 7.5 # Scale for classifier-free guidance
|
||||
>>> generator = torch.manual_seed(0) # Seed generator to create the initial latent noise
|
||||
>>> generator = torch.manual_seed(0) # Seed generator to create the inital latent noise
|
||||
>>> batch_size = len(prompt)
|
||||
```
|
||||
|
||||
@@ -214,7 +214,7 @@ Next, generate some initial random noise as a starting point for the diffusion p
|
||||
|
||||
```py
|
||||
>>> latents = torch.randn(
|
||||
... (batch_size, unet.config.in_channels, height // 8, width // 8),
|
||||
... (batch_size, unet.in_channels, height // 8, width // 8),
|
||||
... generator=generator,
|
||||
... )
|
||||
>>> latents = latents.to(torch_device)
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
title: "🧨 Diffusers"
|
||||
- local: quicktour
|
||||
title: "훑어보기"
|
||||
- local: stable_diffusion
|
||||
- local: in_translation
|
||||
title: Stable Diffusion
|
||||
- local: installation
|
||||
title: "설치"
|
||||
@@ -13,14 +13,12 @@
|
||||
title: 개요
|
||||
- local: using-diffusers/write_own_pipeline
|
||||
title: 모델과 스케줄러 이해하기
|
||||
- local: in_translation
|
||||
title: AutoPipeline
|
||||
- local: tutorials/basic_training
|
||||
title: Diffusion 모델 학습하기
|
||||
title: Tutorials
|
||||
- sections:
|
||||
- sections:
|
||||
- local: using-diffusers/loading_overview
|
||||
- local: in_translation
|
||||
title: 개요
|
||||
- local: using-diffusers/loading
|
||||
title: 파이프라인, 모델, 스케줄러 불러오기
|
||||
@@ -32,15 +30,13 @@
|
||||
title: 세이프텐서 불러오기
|
||||
- local: using-diffusers/other-formats
|
||||
title: 다른 형식의 Stable Diffusion 불러오기
|
||||
- local: in_translation
|
||||
title: Hub에 파일 push하기
|
||||
title: 불러오기 & 허브
|
||||
- sections:
|
||||
- local: using-diffusers/pipeline_overview
|
||||
title: 개요
|
||||
- local: using-diffusers/unconditional_image_generation
|
||||
title: Unconditional 이미지 생성
|
||||
- local: using-diffusers/conditional_image_generation
|
||||
- local: in_translation
|
||||
title: Text-to-image 생성
|
||||
- local: using-diffusers/img2img
|
||||
title: Text-guided image-to-image
|
||||
@@ -48,31 +44,27 @@
|
||||
title: Text-guided 이미지 인페인팅
|
||||
- local: using-diffusers/depth2img
|
||||
title: Text-guided depth-to-image
|
||||
- local: using-diffusers/textual_inversion_inference
|
||||
title: Textual inversion
|
||||
- local: training/distributed_inference
|
||||
title: 여러 GPU를 사용한 분산 추론
|
||||
- local: in_translation
|
||||
title: Distilled Stable Diffusion 추론
|
||||
title: Textual inversion
|
||||
- local: in_translation
|
||||
title: 여러 GPU를 사용한 분산 추론
|
||||
- local: using-diffusers/reusing_seeds
|
||||
title: Deterministic 생성으로 이미지 퀄리티 높이기
|
||||
- local: using-diffusers/control_brightness
|
||||
title: 이미지 밝기 조정하기
|
||||
- local: using-diffusers/reproducibility
|
||||
- local: in_translation
|
||||
title: 재현 가능한 파이프라인 생성하기
|
||||
- local: using-diffusers/custom_pipeline_examples
|
||||
title: 커뮤니티 파이프라인들
|
||||
- local: using-diffusers/contribute_pipeline
|
||||
- local: in_translation
|
||||
title: 커뮤티니 파이프라인에 기여하는 방법
|
||||
- local: using-diffusers/stable_diffusion_jax_how_to
|
||||
- local: in_translation
|
||||
title: JAX/Flax에서의 Stable Diffusion
|
||||
- local: using-diffusers/weighted_prompts
|
||||
- local: in_translation
|
||||
title: Weighting Prompts
|
||||
title: 추론을 위한 파이프라인
|
||||
- sections:
|
||||
- local: training/overview
|
||||
title: 개요
|
||||
- local: training/create_dataset
|
||||
- local: in_translation
|
||||
title: 학습을 위한 데이터셋 생성하기
|
||||
- local: training/adapt_a_model
|
||||
title: 새로운 태스크에 모델 적용하기
|
||||
@@ -86,11 +78,11 @@
|
||||
title: Text-to-image
|
||||
- local: training/lora
|
||||
title: Low-Rank Adaptation of Large Language Models (LoRA)
|
||||
- local: training/controlnet
|
||||
- local: in_translation
|
||||
title: ControlNet
|
||||
- local: training/instructpix2pix
|
||||
- local: in_translation
|
||||
title: InstructPix2Pix 학습
|
||||
- local: training/custom_diffusion
|
||||
- local: in_translation
|
||||
title: Custom Diffusion
|
||||
title: Training
|
||||
title: Diffusers 사용하기
|
||||
@@ -107,26 +99,12 @@
|
||||
title: ONNX
|
||||
- local: optimization/open_vino
|
||||
title: OpenVINO
|
||||
- local: optimization/coreml
|
||||
- local: in_translation
|
||||
title: Core ML
|
||||
- local: optimization/mps
|
||||
title: MPS
|
||||
- local: optimization/habana
|
||||
title: Habana Gaudi
|
||||
- local: optimization/tome
|
||||
- local: in_translation
|
||||
title: Token Merging
|
||||
title: 최적화/특수 하드웨어
|
||||
- sections:
|
||||
- local: using-diffusers/controlling_generation
|
||||
title: 제어된 생성
|
||||
- local: in_translation
|
||||
title: Diffusion Models 평가하기
|
||||
title: 개념 가이드
|
||||
- sections:
|
||||
- sections:
|
||||
- sections:
|
||||
- local: api/pipelines/stable_diffusion/stable_diffusion_xl
|
||||
title: Stable Diffusion XL
|
||||
title: Stable Diffusion
|
||||
title: Pipelines
|
||||
title: API
|
||||
@@ -1,400 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Stable diffusion XL
|
||||
|
||||
Stable Diffusion XL은 Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach에 의해 [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://arxiv.org/abs/2307.01952)에서 제안되었습니다.
|
||||
|
||||
논문 초록은 다음을 따릅니다:
|
||||
|
||||
*text-to-image의 latent diffusion 모델인 SDXL을 소개합니다. 이전 버전의 Stable Diffusion과 비교하면, SDXL은 세 배 더큰 규모의 UNet 백본을 포함합니다: 모델 파라미터의 증가는 많은 attention 블럭을 사용하고 더 큰 cross-attention context를 SDXL의 두 번째 텍스트 인코더에 사용하기 때문입니다. 다중 종횡비에 다수의 새로운 conditioning 방법을 구성했습니다. 또한 후에 수정하는 image-to-image 기술을 사용함으로써 SDXL에 의해 생성된 시각적 품질을 향상하기 위해 정제된 모델을 소개합니다. SDXL은 이전 버전의 Stable Diffusion보다 성능이 향상되었고, 이러한 black-box 최신 이미지 생성자와 경쟁력있는 결과를 달성했습니다.*
|
||||
|
||||
## 팁
|
||||
|
||||
- Stable Diffusion XL은 특히 786과 1024사이의 이미지에 잘 작동합니다.
|
||||
- Stable Diffusion XL은 아래와 같이 학습된 각 텍스트 인코더에 대해 서로 다른 프롬프트를 전달할 수 있습니다. 동일한 프롬프트의 다른 부분을 텍스트 인코더에 전달할 수도 있습니다.
|
||||
- Stable Diffusion XL 결과 이미지는 아래에 보여지듯이 정제기(refiner)를 사용함으로써 향상될 수 있습니다.
|
||||
|
||||
### 이용가능한 체크포인트:
|
||||
|
||||
- *Text-to-Image (1024x1024 해상도)*: [`StableDiffusionXLPipeline`]을 사용한 [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)
|
||||
- *Image-to-Image / 정제기(refiner) (1024x1024 해상도)*: [`StableDiffusionXLImg2ImgPipeline`]를 사용한 [stabilityai/stable-diffusion-xl-refiner-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0)
|
||||
|
||||
## 사용 예시
|
||||
|
||||
SDXL을 사용하기 전에 `transformers`, `accelerate`, `safetensors` 와 `invisible_watermark`를 설치하세요.
|
||||
다음과 같이 라이브러리를 설치할 수 있습니다:
|
||||
|
||||
```
|
||||
pip install transformers
|
||||
pip install accelerate
|
||||
pip install safetensors
|
||||
pip install invisible-watermark>=0.2.0
|
||||
```
|
||||
|
||||
### 워터마커
|
||||
|
||||
Stable Diffusion XL로 이미지를 생성할 때 워터마크가 보이지 않도록 추가하는 것을 권장하는데, 이는 다운스트림(downstream) 어플리케이션에서 기계에 합성되었는지를 식별하는데 도움을 줄 수 있습니다. 그렇게 하려면 [invisible_watermark 라이브러리](https://pypi.org/project/invisible-watermark/)를 통해 설치해주세요:
|
||||
|
||||
|
||||
```
|
||||
pip install invisible-watermark>=0.2.0
|
||||
```
|
||||
|
||||
`invisible-watermark` 라이브러리가 설치되면 워터마커가 **기본적으로** 사용될 것입니다.
|
||||
|
||||
생성 또는 안전하게 이미지를 배포하기 위해 다른 규정이 있다면, 다음과 같이 워터마커를 비활성화할 수 있습니다:
|
||||
|
||||
```py
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False)
|
||||
```
|
||||
|
||||
### Text-to-Image
|
||||
|
||||
*text-to-image*를 위해 다음과 같이 SDXL을 사용할 수 있습니다:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
import torch
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
image = pipe(prompt=prompt).images[0]
|
||||
```
|
||||
|
||||
### Image-to-image
|
||||
|
||||
*image-to-image*를 위해 다음과 같이 SDXL을 사용할 수 있습니다:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import StableDiffusionXLImg2ImgPipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe = pipe.to("cuda")
|
||||
url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png"
|
||||
|
||||
init_image = load_image(url).convert("RGB")
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
image = pipe(prompt, image=init_image).images[0]
|
||||
```
|
||||
|
||||
### 인페인팅
|
||||
|
||||
*inpainting*를 위해 다음과 같이 SDXL을 사용할 수 있습니다:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import StableDiffusionXLInpaintPipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
|
||||
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
|
||||
|
||||
init_image = load_image(img_url).convert("RGB")
|
||||
mask_image = load_image(mask_url).convert("RGB")
|
||||
|
||||
prompt = "A majestic tiger sitting on a bench"
|
||||
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80).images[0]
|
||||
```
|
||||
|
||||
### 이미지 결과물을 정제하기
|
||||
|
||||
[base 모델 체크포인트](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)에서, StableDiffusion-XL 또한 고주파 품질을 향상시키는 이미지를 생성하기 위해 낮은 노이즈 단계 이미지를 제거하는데 특화된 [refiner 체크포인트](huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0)를 포함하고 있습니다. 이 refiner 체크포인트는 이미지 품질을 향상시키기 위해 base 체크포인트를 실행한 후 "두 번째 단계" 파이프라인에 사용될 수 있습니다.
|
||||
|
||||
refiner를 사용할 때, 쉽게 사용할 수 있습니다
|
||||
- 1.) base 모델과 refiner을 사용하는데, 이는 *Denoisers의 앙상블*을 위한 첫 번째 제안된 [eDiff-I](https://research.nvidia.com/labs/dir/eDiff-I/)를 사용하거나
|
||||
- 2.) base 모델을 거친 후 [SDEdit](https://arxiv.org/abs/2108.01073) 방법으로 단순하게 refiner를 실행시킬 수 있습니다.
|
||||
|
||||
**참고**: SD-XL base와 refiner를 앙상블로 사용하는 아이디어는 커뮤니티 기여자들이 처음으로 제안했으며, 이는 다음과 같은 `diffusers`를 구현하는 데도 도움을 주셨습니다.
|
||||
- [SytanSD](https://github.com/SytanSD)
|
||||
- [bghira](https://github.com/bghira)
|
||||
- [Birch-san](https://github.com/Birch-san)
|
||||
- [AmericanPresidentJimmyCarter](https://github.com/AmericanPresidentJimmyCarter)
|
||||
|
||||
#### 1.) Denoisers의 앙상블
|
||||
|
||||
base와 refiner 모델을 denoiser의 앙상블로 사용할 때, base 모델은 고주파 diffusion 단계를 위한 전문가의 역할을 해야하고, refiner는 낮은 노이즈 diffusion 단계를 위한 전문가의 역할을 해야 합니다.
|
||||
|
||||
2.)에 비해 1.)의 장점은 전체적으로 denoising 단계가 덜 필요하므로 속도가 훨씬 더 빨라집니다. 단점은 base 모델의 결과를 검사할 수 없다는 것입니다. 즉, 여전히 노이즈가 심하게 제거됩니다.
|
||||
|
||||
base 모델과 refiner를 denoiser의 앙상블로 사용하기 위해 각각 고노이즈(high-nosise) (*즉* base 모델)와 저노이즈 (*즉* refiner 모델)의 노이즈를 제거하는 단계를 거쳐야하는 타임스텝의 기간을 정의해야 합니다.
|
||||
base 모델의 [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end)와 refiner 모델의 [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start)를 사용해 간격을 정합니다.
|
||||
|
||||
`denoising_end`와 `denoising_start` 모두 0과 1사이의 실수 값으로 전달되어야 합니다.
|
||||
전달되면 노이즈 제거의 끝과 시작은 모델 스케줄에 의해 정의된 이산적(discrete) 시간 간격의 비율로 정의됩니다.
|
||||
노이즈 제거 단계의 수는 모델이 학습된 불연속적인 시간 간격과 선언된 fractional cutoff에 의해 결정되므로 '강도' 또한 선언된 경우 이 값이 '강도'를 재정의합니다.
|
||||
|
||||
예시를 들어보겠습니다.
|
||||
우선, 두 개의 파이프라인을 가져옵니다. 텍스트 인코더와 variational autoencoder는 동일하므로 refiner를 위해 다시 불러오지 않아도 됩니다.
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
base = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
refiner = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0",
|
||||
text_encoder_2=base.text_encoder_2,
|
||||
vae=base.vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
variant="fp16",
|
||||
)
|
||||
refiner.to("cuda")
|
||||
```
|
||||
|
||||
이제 추론 단계의 수와 고노이즈에서 노이즈를 제거하는 단계(*즉* base 모델)를 거쳐 실행되는 지점을 정의합니다.
|
||||
|
||||
```py
|
||||
n_steps = 40
|
||||
high_noise_frac = 0.8
|
||||
```
|
||||
|
||||
Stable Diffusion XL base 모델은 타임스텝 0-999에 학습되며 Stable Diffusion XL refiner는 포괄적인 낮은 노이즈 타임스텝인 0-199에 base 모델로 부터 파인튜닝되어, 첫 800 타임스텝 (높은 노이즈)에 base 모델을 사용하고 마지막 200 타입스텝 (낮은 노이즈)에서 refiner가 사용됩니다. 따라서, `high_noise_frac`는 0.8로 설정하고, 모든 200-999 스텝(노이즈 제거 타임스텝의 첫 80%)은 base 모델에 의해 수행되며 0-199 스텝(노이즈 제거 타임스텝의 마지막 20%)은 refiner 모델에 의해 수행됩니다.
|
||||
|
||||
기억하세요, 노이즈 제거 절차는 **높은 값**(높은 노이즈) 타임스텝에서 시작되고, **낮은 값** (낮은 노이즈) 타임스텝에서 끝납니다.
|
||||
|
||||
이제 두 파이프라인을 실행해봅시다. `denoising_end`과 `denoising_start`를 같은 값으로 설정하고 `num_inference_steps`는 상수로 유지합니다. 또한 base 모델의 출력은 잠재 공간에 있어야 한다는 점을 기억하세요:
|
||||
|
||||
```py
|
||||
prompt = "A majestic lion jumping from a big stone at night"
|
||||
|
||||
image = base(
|
||||
prompt=prompt,
|
||||
num_inference_steps=n_steps,
|
||||
denoising_end=high_noise_frac,
|
||||
output_type="latent",
|
||||
).images
|
||||
image = refiner(
|
||||
prompt=prompt,
|
||||
num_inference_steps=n_steps,
|
||||
denoising_start=high_noise_frac,
|
||||
image=image,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
이미지를 살펴보겠습니다.
|
||||
|
||||
| 원래의 이미지 | Denoiser들의 앙상블 |
|
||||
|---|---|
|
||||
|  | 
|
||||
|
||||
동일한 40 단계에서 base 모델을 실행한다면, 이미지의 디테일(예: 사자의 눈과 코)이 떨어졌을 것입니다:
|
||||
|
||||
<Tip>
|
||||
|
||||
앙상블 방식은 사용 가능한 모든 스케줄러에서 잘 작동합니다!
|
||||
|
||||
</Tip>
|
||||
|
||||
#### 2.) 노이즈가 완전히 제거된 기본 이미지에서 이미지 출력을 정제하기
|
||||
|
||||
일반적인 [`StableDiffusionImg2ImgPipeline`] 방식에서, 기본 모델에서 생성된 완전히 노이즈가 제거된 이미지는 [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0)를 사용해 더 향상시킬 수 있습니다.
|
||||
|
||||
이를 위해, 보통의 "base" text-to-image 파이프라인을 수행 후에 image-to-image 파이프라인으로써 refiner를 실행시킬 수 있습니다. base 모델의 출력을 잠재 공간에 남겨둘 수 있습니다.
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
refiner = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0",
|
||||
text_encoder_2=pipe.text_encoder_2,
|
||||
vae=pipe.vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
variant="fp16",
|
||||
)
|
||||
refiner.to("cuda")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
|
||||
image = pipe(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
|
||||
image = refiner(prompt=prompt, image=image[None, :]).images[0]
|
||||
```
|
||||
|
||||
| 원래의 이미지 | 정제된 이미지 |
|
||||
|---|---|
|
||||
|  |  |
|
||||
|
||||
<Tip>
|
||||
|
||||
refiner는 또한 인페인팅 설정에 잘 사용될 수 있습니다. 아래에 보여지듯이 [`StableDiffusionXLInpaintPipeline`] 클래스를 사용해서 만들어보세요.
|
||||
|
||||
</Tip>
|
||||
|
||||
Denoiser 앙상블 설정에서 인페인팅에 refiner를 사용하려면 다음을 수행하면 됩니다:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLInpaintPipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0",
|
||||
text_encoder_2=pipe.text_encoder_2,
|
||||
vae=pipe.vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
variant="fp16",
|
||||
)
|
||||
refiner.to("cuda")
|
||||
|
||||
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
|
||||
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
|
||||
|
||||
init_image = load_image(img_url).convert("RGB")
|
||||
mask_image = load_image(mask_url).convert("RGB")
|
||||
|
||||
prompt = "A majestic tiger sitting on a bench"
|
||||
num_inference_steps = 75
|
||||
high_noise_frac = 0.7
|
||||
|
||||
image = pipe(
|
||||
prompt=prompt,
|
||||
image=init_image,
|
||||
mask_image=mask_image,
|
||||
num_inference_steps=num_inference_steps,
|
||||
denoising_start=high_noise_frac,
|
||||
output_type="latent",
|
||||
).images
|
||||
image = refiner(
|
||||
prompt=prompt,
|
||||
image=image,
|
||||
mask_image=mask_image,
|
||||
num_inference_steps=num_inference_steps,
|
||||
denoising_start=high_noise_frac,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
일반적인 SDE 설정에서 인페인팅에 refiner를 사용하기 위해, `denoising_end`와 `denoising_start`를 제거하고 refiner의 추론 단계의 수를 적게 선택하세요.
|
||||
|
||||
### 단독 체크포인트 파일 / 원래의 파일 형식으로 불러오기
|
||||
|
||||
[`~diffusers.loaders.FromSingleFileMixin.from_single_file`]를 사용함으로써 원래의 파일 형식을 `diffusers` 형식으로 불러올 수 있습니다:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
|
||||
import torch
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_single_file(
|
||||
"./sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
|
||||
"./sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
|
||||
)
|
||||
refiner.to("cuda")
|
||||
```
|
||||
|
||||
### 모델 offloading을 통해 메모리 최적화하기
|
||||
|
||||
out-of-memory 에러가 난다면, [`StableDiffusionXLPipeline.enable_model_cpu_offload`]을 사용하는 것을 권장합니다.
|
||||
|
||||
```diff
|
||||
- pipe.to("cuda")
|
||||
+ pipe.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
그리고
|
||||
|
||||
```diff
|
||||
- refiner.to("cuda")
|
||||
+ refiner.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
### `torch.compile`로 추론 속도를 올리기
|
||||
|
||||
`torch.compile`를 사용함으로써 추론 속도를 올릴 수 있습니다. 이는 **ca.** 20% 속도 향상이 됩니다.
|
||||
|
||||
```diff
|
||||
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
|
||||
```
|
||||
|
||||
### `torch < 2.0`일 때 실행하기
|
||||
|
||||
**참고** Stable Diffusion XL을 `torch`가 2.0 버전 미만에서 실행시키고 싶을 때, xformers 어텐션을 사용해주세요:
|
||||
|
||||
```
|
||||
pip install xformers
|
||||
```
|
||||
|
||||
```diff
|
||||
+pipe.enable_xformers_memory_efficient_attention()
|
||||
+refiner.enable_xformers_memory_efficient_attention()
|
||||
```
|
||||
|
||||
## StableDiffusionXLPipeline
|
||||
|
||||
[[autodoc]] StableDiffusionXLPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## StableDiffusionXLImg2ImgPipeline
|
||||
|
||||
[[autodoc]] StableDiffusionXLImg2ImgPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## StableDiffusionXLInpaintPipeline
|
||||
|
||||
[[autodoc]] StableDiffusionXLInpaintPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
### 각 텍스트 인코더에 다른 프롬프트를 전달하기
|
||||
|
||||
Stable Diffusion XL는 두 개의 텍스트 인코더에 학습되었습니다. 기본 동작은 각 프롬프트에 동일한 프롬프트를 전달하는 것입니다. 그러나 [일부 사용자](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201)가 품질을 향상시킬 수 있다고 지적한 것처럼 텍스트 인코더마다 다른 프롬프트를 전달할 수 있습니다. 그렇게 하려면, `prompt_2`와 `negative_prompt_2`를 `prompt`와 `negative_prompt`에 전달해야 합니다. 그렇게 함으로써, 원래의 프롬프트들(`prompt`)과 부정 프롬프트들(`negative_prompt`)를 `텍스트 인코더`에 전달할 것입니다.(공식 SDXL 0.9/1.0의 [OpenAI CLIP-ViT/L-14](https://huggingface.co/openai/clip-vit-large-patch14)에서 볼 수 있습니다.) 그리고 `prompt_2`와 `negative_prompt_2`는 `text_encoder_2`에 전달됩니다.(공식 SDXL 0.9/1.0의 [OpenCLIP-ViT/bigG-14](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)에서 볼 수 있습니다.)
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
import torch
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
# OAI CLIP-ViT/L-14에 prompt가 전달됩니다
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
# OpenCLIP-ViT/bigG-14에 prompt_2가 전달됩니다
|
||||
prompt_2 = "monet painting"
|
||||
image = pipe(prompt=prompt, prompt_2=prompt_2).images[0]
|
||||
```
|
||||
@@ -16,82 +16,48 @@ specific language governing permissions and limitations under the License.
|
||||
<br>
|
||||
</p>
|
||||
|
||||
# 🧨 Diffusers
|
||||
|
||||
# Diffusers
|
||||
🤗 Diffusers는 사전학습된 비전 및 오디오 확산 모델을 제공하고, 추론 및 학습을 위한 모듈식 도구 상자 역할을 합니다.
|
||||
|
||||
🤗 Diffusers는 이미지, 오디오, 심지어 분자의 3D 구조를 생성하기 위한 최첨단 사전 훈련된 diffusion 모델을 위한 라이브러리입니다. 간단한 추론 솔루션을 찾고 있든, 자체 diffusion 모델을 훈련하고 싶든, 🤗 Diffusers는 두 가지 모두를 지원하는 모듈식 툴박스입니다. 저희 라이브러리는 [성능보다 사용성](conceptual/philosophy#usability-over-performance), [간편함보다 단순함](conceptual/philosophy#simple-over-easy), 그리고 [추상화보다 사용자 지정 가능성](conceptual/philosophy#tweakable-contributorfriendly-over-abstraction)에 중점을 두고 설계되었습니다.
|
||||
보다 정확하게, 🤗 Diffusers는 다음을 제공합니다:
|
||||
|
||||
이 라이브러리에는 세 가지 주요 구성 요소가 있습니다:
|
||||
- 단 몇 줄의 코드로 추론을 실행할 수 있는 최신 확산 파이프라인을 제공합니다. ([**Using Diffusers**](./using-diffusers/conditional_image_generation)를 살펴보세요) 지원되는 모든 파이프라인과 해당 논문에 대한 개요를 보려면 [**Pipelines**](#pipelines)을 살펴보세요.
|
||||
- 추론에서 속도 vs 품질의 절충을 위해 상호교환적으로 사용할 수 있는 다양한 노이즈 스케줄러를 제공합니다. 자세한 내용은 [**Schedulers**](./api/schedulers/overview)를 참고하세요.
|
||||
- UNet과 같은 여러 유형의 모델을 end-to-end 확산 시스템의 구성 요소로 사용할 수 있습니다. 자세한 내용은 [**Models**](./api/models)을 참고하세요.
|
||||
- 가장 인기있는 확산 모델 테스크를 학습하는 방법을 보여주는 예제들을 제공합니다. 자세한 내용은 [**Training**](./training/overview)를 참고하세요.
|
||||
|
||||
- 몇 줄의 코드만으로 추론할 수 있는 최첨단 [diffusion 파이프라인](api/pipelines/overview).
|
||||
- 생성 속도와 품질 간의 균형을 맞추기 위해 상호교환적으로 사용할 수 있는 [노이즈 스케줄러](api/schedulers/overview).
|
||||
- 빌딩 블록으로 사용할 수 있고 스케줄러와 결합하여 자체적인 end-to-end diffusion 시스템을 만들 수 있는 사전 학습된 [모델](api/models).
|
||||
## 🧨 Diffusers 파이프라인
|
||||
|
||||
<div class="mt-10">
|
||||
<div class="w-full flex flex-col space-y-4 md:space-y-0 md:grid md:grid-cols-2 md:gap-y-4 md:gap-x-5">
|
||||
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./tutorials/tutorial_overview"
|
||||
><div class="w-full text-center bg-gradient-to-br from-blue-400 to-blue-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Tutorials</div>
|
||||
<p class="text-gray-700">결과물을 생성하고, 나만의 diffusion 시스템을 구축하고, 확산 모델을 훈련하는 데 필요한 기본 기술을 배워보세요. 🤗 Diffusers를 처음 사용하는 경우 여기에서 시작하는 것이 좋습니다!</p>
|
||||
</a>
|
||||
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./using-diffusers/loading_overview"
|
||||
><div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">How-to guides</div>
|
||||
<p class="text-gray-700">파이프라인, 모델, 스케줄러를 로드하는 데 도움이 되는 실용적인 가이드입니다. 또한 특정 작업에 파이프라인을 사용하고, 출력 생성 방식을 제어하고, 추론 속도에 맞게 최적화하고, 다양한 학습 기법을 사용하는 방법도 배울 수 있습니다.</p>
|
||||
</a>
|
||||
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./conceptual/philosophy"
|
||||
><div class="w-full text-center bg-gradient-to-br from-pink-400 to-pink-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Conceptual guides</div>
|
||||
<p class="text-gray-700">라이브러리가 왜 이런 방식으로 설계되었는지 이해하고, 라이브러리 이용에 대한 윤리적 가이드라인과 안전 구현에 대해 자세히 알아보세요.</p>
|
||||
</a>
|
||||
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./api/models"
|
||||
><div class="w-full text-center bg-gradient-to-br from-purple-400 to-purple-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Reference</div>
|
||||
<p class="text-gray-700">🤗 Diffusers 클래스 및 메서드의 작동 방식에 대한 기술 설명.</p>
|
||||
</a>
|
||||
</div>
|
||||
</div>
|
||||
다음 표에는 공시적으로 지원되는 모든 파이프라인, 관련 논문, 직접 사용해 볼 수 있는 Colab 노트북(사용 가능한 경우)이 요약되어 있습니다.
|
||||
|
||||
## Supported pipelines
|
||||
| Pipeline | Paper | Tasks | Colab
|
||||
|---|---|:---:|:---:|
|
||||
| [alt_diffusion](./api/pipelines/alt_diffusion) | [**AltDiffusion**](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation |
|
||||
| [audio_diffusion](./api/pipelines/audio_diffusion) | [**Audio Diffusion**](https://github.com/teticio/audio-diffusion.git) | Unconditional Audio Generation | [](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/audio_diffusion_pipeline.ipynb)
|
||||
| [cycle_diffusion](./api/pipelines/cycle_diffusion) | [**Cycle Diffusion**](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation |
|
||||
| [dance_diffusion](./api/pipelines/dance_diffusion) | [**Dance Diffusion**](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation |
|
||||
| [ddpm](./api/pipelines/ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation |
|
||||
| [ddim](./api/pipelines/ddim) | [**Denoising Diffusion Implicit Models**](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation |
|
||||
| [latent_diffusion](./api/pipelines/latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation |
|
||||
| [latent_diffusion](./api/pipelines/latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image |
|
||||
| [latent_diffusion_uncond](./api/pipelines/latent_diffusion_uncond) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation |
|
||||
| [paint_by_example](./api/pipelines/paint_by_example) | [**Paint by Example: Exemplar-based Image Editing with Diffusion Models**](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting |
|
||||
| [pndm](./api/pipelines/pndm) | [**Pseudo Numerical Methods for Diffusion Models on Manifolds**](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation |
|
||||
| [score_sde_ve](./api/pipelines/score_sde_ve) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
|
||||
| [score_sde_vp](./api/pipelines/score_sde_vp) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
|
||||
| [stable_diffusion](./api/pipelines/stable_diffusion/text2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb)
|
||||
| [stable_diffusion](./api/pipelines/stable_diffusion/img2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb)
|
||||
| [stable_diffusion](./api/pipelines/stable_diffusion/inpaint) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb)
|
||||
| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation |
|
||||
| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting |
|
||||
| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image |
|
||||
| [stable_diffusion_safe](./api/pipelines/stable_diffusion_safe) | [**Safe Stable Diffusion**](https://arxiv.org/abs/2211.05105) | Text-Guided Generation | [](https://colab.research.google.com/github/ml-research/safe-latent-diffusion/blob/main/examples/Safe%20Latent%20Diffusion.ipynb)
|
||||
| [stochastic_karras_ve](./api/pipelines/stochastic_karras_ve) | [**Elucidating the Design Space of Diffusion-Based Generative Models**](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation |
|
||||
| [unclip](./api/pipelines/unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) | Text-to-Image Generation |
|
||||
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
|
||||
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
|
||||
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
|
||||
| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
|
||||
|
||||
| Pipeline | Paper/Repository | Tasks |
|
||||
|---|---|:---:|
|
||||
| [alt_diffusion](./api/pipelines/alt_diffusion) | [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation |
|
||||
| [audio_diffusion](./api/pipelines/audio_diffusion) | [Audio Diffusion](https://github.com/teticio/audio-diffusion.git) | Unconditional Audio Generation |
|
||||
| [controlnet](./api/pipelines/stable_diffusion/controlnet) | [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation |
|
||||
| [cycle_diffusion](./api/pipelines/cycle_diffusion) | [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation |
|
||||
| [dance_diffusion](./api/pipelines/dance_diffusion) | [Dance Diffusion](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation |
|
||||
| [ddpm](./api/pipelines/ddpm) | [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation |
|
||||
| [ddim](./api/pipelines/ddim) | [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation |
|
||||
| [if](./if) | [**IF**](./api/pipelines/if) | Image Generation |
|
||||
| [if_img2img](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation |
|
||||
| [if_inpainting](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation |
|
||||
| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation |
|
||||
| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image |
|
||||
| [latent_diffusion_uncond](./api/pipelines/latent_diffusion_uncond) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation |
|
||||
| [paint_by_example](./api/pipelines/paint_by_example) | [Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting |
|
||||
| [pndm](./api/pipelines/pndm) | [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation |
|
||||
| [score_sde_ve](./api/pipelines/score_sde_ve) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
|
||||
| [score_sde_vp](./api/pipelines/score_sde_vp) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
|
||||
| [semantic_stable_diffusion](./api/pipelines/semantic_stable_diffusion) | [Semantic Guidance](https://arxiv.org/abs/2301.12247) | Text-Guided Generation |
|
||||
| [stable_diffusion_text2img](./api/pipelines/stable_diffusion/text2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation |
|
||||
| [stable_diffusion_img2img](./api/pipelines/stable_diffusion/img2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation |
|
||||
| [stable_diffusion_inpaint](./api/pipelines/stable_diffusion/inpaint) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting |
|
||||
| [stable_diffusion_panorama](./api/pipelines/stable_diffusion/panorama) | [MultiDiffusion](https://multidiffusion.github.io/) | Text-to-Panorama Generation |
|
||||
| [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800) | Text-Guided Image Editing|
|
||||
| [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [Zero-shot Image-to-Image Translation](https://pix2pixzero.github.io/) | Text-Guided Image Editing |
|
||||
| [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation |
|
||||
| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation Unconditional Image Generation |
|
||||
| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation |
|
||||
| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [Stable Diffusion Latent Upscaler](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image |
|
||||
| [stable_diffusion_model_editing](./api/pipelines/stable_diffusion/model_editing) | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://time-diffusion.github.io/) | Text-to-Image Model Editing |
|
||||
| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation |
|
||||
| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting |
|
||||
| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Depth-Conditional Stable Diffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) | Depth-to-Image Generation |
|
||||
| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image |
|
||||
| [stable_diffusion_safe](./api/pipelines/stable_diffusion_safe) | [Safe Stable Diffusion](https://arxiv.org/abs/2211.05105) | Text-Guided Generation |
|
||||
| [stable_unclip](./stable_unclip) | Stable unCLIP | Text-to-Image Generation |
|
||||
| [stable_unclip](./stable_unclip) | Stable unCLIP | Image-to-Image Text-Guided Generation |
|
||||
| [stochastic_karras_ve](./api/pipelines/stochastic_karras_ve) | [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation |
|
||||
| [text_to_video_sd](./api/pipelines/text_to_video) | [Modelscope's Text-to-video-synthesis Model in Open Domain](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation |
|
||||
| [unclip](./api/pipelines/unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125)(implementation by [kakaobrain](https://github.com/kakaobrain/karlo)) | Text-to-Image Generation |
|
||||
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
|
||||
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
|
||||
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
|
||||
| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
|
||||
**참고**: 파이프라인은 해당 문서에 설명된 대로 확산 시스템을 사용한 방법에 대한 간단한 예입니다.
|
||||
|
||||
@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
사용하시는 라이브러리에 맞는 🤗 Diffusers를 설치하세요.
|
||||
|
||||
🤗 Diffusers는 Python 3.8+, PyTorch 1.7.0+ 및 flax에서 테스트되었습니다. 사용중인 딥러닝 라이브러리에 대한 아래의 설치 안내를 따르세요.
|
||||
🤗 Diffusers는 Python 3.7+, PyTorch 1.7.0+ 및 flax에서 테스트되었습니다. 사용중인 딥러닝 라이브러리에 대한 아래의 설치 안내를 따르세요.
|
||||
|
||||
- [PyTorch 설치 안내](https://pytorch.org/get-started/locally/)
|
||||
- [Flax 설치 안내](https://flax.readthedocs.io/en/latest/)
|
||||
@@ -105,7 +105,7 @@ pip install -e ".[flax]"
|
||||
|
||||
이러한 명령어들은 저장소를 복제한 폴더와 Python 라이브러리 경로를 연결합니다.
|
||||
Python은 이제 일반 라이브러리 경로에 더하여 복제한 폴더 내부를 살펴봅니다.
|
||||
예를들어 Python 패키지가 `~/anaconda3/envs/main/lib/python3.8/site-packages/`에 설치되어 있는 경우 Python은 복제한 폴더인 `~/diffusers/`도 검색합니다.
|
||||
예를들어 Python 패키지가 `~/anaconda3/envs/main/lib/python3.7/site-packages/`에 설치되어 있는 경우 Python은 복제한 폴더인 `~/diffusers/`도 검색합니다.
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
|
||||
@@ -1,168 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Core ML로 Stable Diffusion을 실행하는 방법
|
||||
|
||||
[Core ML](https://developer.apple.com/documentation/coreml)은 Apple 프레임워크에서 지원하는 모델 형식 및 머신 러닝 라이브러리입니다. macOS 또는 iOS/iPadOS 앱 내에서 Stable Diffusion 모델을 실행하는 데 관심이 있는 경우, 이 가이드에서는 기존 PyTorch 체크포인트를 Core ML 형식으로 변환하고 이를 Python 또는 Swift로 추론에 사용하는 방법을 설명합니다.
|
||||
|
||||
Core ML 모델은 Apple 기기에서 사용할 수 있는 모든 컴퓨팅 엔진들, 즉 CPU, GPU, Apple Neural Engine(또는 Apple Silicon Mac 및 최신 iPhone/iPad에서 사용할 수 있는 텐서 최적화 가속기인 ANE)을 활용할 수 있습니다. 모델과 실행 중인 기기에 따라 Core ML은 컴퓨팅 엔진도 혼합하여 사용할 수 있으므로, 예를 들어 모델의 일부가 CPU에서 실행되는 반면 다른 부분은 GPU에서 실행될 수 있습니다.
|
||||
|
||||
<Tip>
|
||||
|
||||
PyTorch에 내장된 `mps` 가속기를 사용하여 Apple Silicon Macs에서 `diffusers` Python 코드베이스를 실행할 수도 있습니다. 이 방법은 [mps 가이드]에 자세히 설명되어 있지만 네이티브 앱과 호환되지 않습니다.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Stable Diffusion Core ML 체크포인트
|
||||
|
||||
Stable Diffusion 가중치(또는 체크포인트)는 PyTorch 형식으로 저장되기 때문에 네이티브 앱에서 사용하기 위해서는 Core ML 형식으로 변환해야 합니다.
|
||||
|
||||
다행히도 Apple 엔지니어들이 `diffusers`를 기반으로 한 [변환 툴](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml)을 개발하여 PyTorch 체크포인트를 Core ML로 변환할 수 있습니다.
|
||||
|
||||
모델을 변환하기 전에 잠시 시간을 내어 Hugging Face Hub를 살펴보세요. 관심 있는 모델이 이미 Core ML 형식으로 제공되고 있을 가능성이 높습니다:
|
||||
|
||||
- [Apple](https://huggingface.co/apple) organization에는 Stable Diffusion 버전 1.4, 1.5, 2.0 base 및 2.1 base가 포함되어 있습니다.
|
||||
- [coreml](https://huggingface.co/coreml) organization에는 커스텀 DreamBooth가 적용되거나, 파인튜닝된 모델이 포함되어 있습니다.
|
||||
- 이 [필터](https://huggingface.co/models?pipeline_tag=text-to-image&library=coreml&p=2&sort=likes)를 사용하여 사용 가능한 모든 Core ML 체크포인트들을 반환합니다.
|
||||
|
||||
원하는 모델을 찾을 수 없는 경우 Apple의 [모델을 Core ML로 변환하기](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml) 지침을 따르는 것이 좋습니다.
|
||||
|
||||
## 사용할 Core ML 변형(Variant) 선택하기
|
||||
|
||||
Stable Diffusion 모델은 다양한 목적에 따라 다른 Core ML 변형으로 변환할 수 있습니다:
|
||||
|
||||
- 사용되는 어텐션 블록 유형. 어텐션 연산은 이미지 표현의 여러 영역 간의 관계에 '주의를 기울이고' 이미지와 텍스트 표현이 어떻게 연관되어 있는지 이해하는 데 사용됩니다. 어텐션 연산은 컴퓨팅 및 메모리 집약적이므로 다양한 장치의 하드웨어 특성을 고려한 다양한 구현이 존재합니다. Core ML Stable Diffusion 모델의 경우 두 가지 주의 변형이 있습니다:
|
||||
* `split_einsum` ([Apple에서 도입](https://machinelearning.apple.com/research/neural-engine-transformers)은 최신 iPhone, iPad 및 M 시리즈 컴퓨터에서 사용할 수 있는 ANE 장치에 최적화되어 있습니다.
|
||||
* "원본" 어텐션(`diffusers`에 사용되는 기본 구현)는 CPU/GPU와만 호환되며 ANE와는 호환되지 않습니다. "원본" 어텐션을 사용하여 CPU + GPU에서 모델을 실행하는 것이 ANE보다 *더* 빠를 수 있습니다. 자세한 내용은 [이 성능 벤치마크](https://huggingface.co/blog/fast-mac-diffusers#performance-benchmarks)와 커뮤니티에서 제공하는 일부 [추가 측정](https://github.com/huggingface/swift-coreml-diffusers/issues/31)을 참조하십시오.
|
||||
|
||||
- 지원되는 추론 프레임워크
|
||||
* `packages`는 Python 추론에 적합합니다. 네이티브 앱에 통합하기 전에 변환된 Core ML 모델을 테스트하거나, Core ML 성능을 알고 싶지만 네이티브 앱을 지원할 필요는 없는 경우에 사용할 수 있습니다. 예를 들어, 웹 UI가 있는 애플리케이션은 Python Core ML 백엔드를 완벽하게 사용할 수 있습니다.
|
||||
* Swift 코드에는 `컴파일된` 모델이 필요합니다. Hub의 `컴파일된` 모델은 iOS 및 iPadOS 기기와의 호환성을 위해 큰 UNet 모델 가중치를 여러 파일로 분할합니다. 이는 [`--chunk-unet` 변환 옵션](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml)에 해당합니다. 네이티브 앱을 지원하려면 `컴파일된` 변형을 선택해야 합니다.
|
||||
|
||||
공식 Core ML Stable Diffusion [모델](https://huggingface.co/apple/coreml-stable-diffusion-v1-4/tree/main)에는 이러한 변형이 포함되어 있지만 커뮤니티 버전은 다를 수 있습니다:
|
||||
|
||||
```
|
||||
coreml-stable-diffusion-v1-4
|
||||
├── README.md
|
||||
├── original
|
||||
│ ├── compiled
|
||||
│ └── packages
|
||||
└── split_einsum
|
||||
├── compiled
|
||||
└── packages
|
||||
```
|
||||
|
||||
아래와 같이 필요한 변형을 다운로드하여 사용할 수 있습니다.
|
||||
|
||||
## Python에서 Core ML 추론
|
||||
|
||||
Python에서 Core ML 추론을 실행하려면 다음 라이브러리를 설치하세요:
|
||||
|
||||
```bash
|
||||
pip install huggingface_hub
|
||||
pip install git+https://github.com/apple/ml-stable-diffusion
|
||||
```
|
||||
|
||||
### 모델 체크포인트 다운로드하기
|
||||
|
||||
`컴파일된` 버전은 Swift와만 호환되므로 Python에서 추론을 실행하려면 `packages` 폴더에 저장된 버전 중 하나를 사용하세요. `원본` 또는 `split_einsum` 어텐션 중 어느 것을 사용할지 선택할 수 있습니다.
|
||||
|
||||
다음은 Hub에서 'models'라는 디렉토리로 'original' 어텐션 변형을 다운로드하는 방법입니다:
|
||||
|
||||
```Python
|
||||
from huggingface_hub import snapshot_download
|
||||
from pathlib import Path
|
||||
|
||||
repo_id = "apple/coreml-stable-diffusion-v1-4"
|
||||
variant = "original/packages"
|
||||
|
||||
model_path = Path("./models") / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_"))
|
||||
snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False)
|
||||
print(f"Model downloaded at {model_path}")
|
||||
```
|
||||
|
||||
|
||||
### 추론[[python-inference]]
|
||||
|
||||
모델의 snapshot을 다운로드한 후에는 Apple의 Python 스크립트를 사용하여 테스트할 수 있습니다.
|
||||
|
||||
```shell
|
||||
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i models/coreml-stable-diffusion-v1-4_original_packages -o </path/to/output/image> --compute-unit CPU_AND_GPU --seed 93
|
||||
```
|
||||
|
||||
`<output-mlpackages-directory>`는 위 단계에서 다운로드한 체크포인트를 가리켜야 하며, `--compute-unit`은 추론을 허용할 하드웨어를 나타냅니다. 이는 다음 옵션 중 하나이어야 합니다: `ALL`, `CPU_AND_GPU`, `CPU_ONLY`, `CPU_AND_NE`. 선택적 출력 경로와 재현성을 위한 시드를 제공할 수도 있습니다.
|
||||
|
||||
추론 스크립트에서는 Stable Diffusion 모델의 원래 버전인 `CompVis/stable-diffusion-v1-4`를 사용한다고 가정합니다. 다른 모델을 사용하는 경우 추론 명령줄에서 `--model-version` 옵션을 사용하여 해당 허브 ID를 *지정*해야 합니다. 이는 이미 지원되는 모델과 사용자가 직접 학습하거나 파인튜닝한 사용자 지정 모델에 적용됩니다.
|
||||
|
||||
예를 들어, [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)를 사용하려는 경우입니다:
|
||||
|
||||
```shell
|
||||
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" --compute-unit ALL -o output --seed 93 -i models/coreml-stable-diffusion-v1-5_original_packages --model-version runwayml/stable-diffusion-v1-5
|
||||
```
|
||||
|
||||
|
||||
## Swift에서 Core ML 추론하기
|
||||
|
||||
Swift에서 추론을 실행하는 것은 모델이 이미 `mlmodelc` 형식으로 컴파일되어 있기 때문에 Python보다 약간 빠릅니다. 이는 앱이 시작될 때 모델이 불러와지는 것이 눈에 띄지만, 이후 여러 번 실행하면 눈에 띄지 않을 것입니다.
|
||||
|
||||
### 다운로드
|
||||
|
||||
Mac에서 Swift에서 추론을 실행하려면 `컴파일된` 체크포인트 버전 중 하나가 필요합니다. 이전 예제와 유사하지만 `컴파일된` 변형 중 하나를 사용하여 Python 코드를 로컬로 다운로드하는 것이 좋습니다:
|
||||
|
||||
```Python
|
||||
from huggingface_hub import snapshot_download
|
||||
from pathlib import Path
|
||||
|
||||
repo_id = "apple/coreml-stable-diffusion-v1-4"
|
||||
variant = "original/compiled"
|
||||
|
||||
model_path = Path("./models") / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_"))
|
||||
snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False)
|
||||
print(f"Model downloaded at {model_path}")
|
||||
```
|
||||
|
||||
### 추론[[swift-inference]]
|
||||
|
||||
추론을 실행하기 위해서, Apple의 리포지토리를 복제하세요:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/apple/ml-stable-diffusion
|
||||
cd ml-stable-diffusion
|
||||
```
|
||||
|
||||
그 다음 Apple의 명령어 도구인 [Swift 패키지 관리자](https://www.swift.org/package-manager/#)를 사용합니다:
|
||||
|
||||
```bash
|
||||
swift run StableDiffusionSample --resource-path models/coreml-stable-diffusion-v1-4_original_compiled --compute-units all "a photo of an astronaut riding a horse on mars"
|
||||
```
|
||||
|
||||
`--resource-path`에 이전 단계에서 다운로드한 체크포인트 중 하나를 지정해야 하므로 확장자가 `.mlmodelc`인 컴파일된 Core ML 번들이 포함되어 있는지 확인하시기 바랍니다. `--compute-units`는 다음 값 중 하나이어야 합니다: `all`, `cpuOnly`, `cpuAndGPU`, `cpuAndNeuralEngine`.
|
||||
|
||||
자세한 내용은 [Apple의 리포지토리 안의 지침](https://github.com/apple/ml-stable-diffusion)을 참고하시기 바랍니다.
|
||||
|
||||
|
||||
## 지원되는 Diffusers 기능
|
||||
|
||||
Core ML 모델과 추론 코드는 🧨 Diffusers의 많은 기능, 옵션 및 유연성을 지원하지 않습니다. 다음은 유의해야 할 몇 가지 제한 사항입니다:
|
||||
|
||||
- Core ML 모델은 추론에만 적합합니다. 학습이나 파인튜닝에는 사용할 수 없습니다.
|
||||
- Swift에 포팅된 스케줄러는 Stable Diffusion에서 사용하는 기본 스케줄러와 `diffusers` 구현에서 Swift로 포팅한 `DPMSolverMultistepScheduler` 두 개뿐입니다. 이들 중 약 절반의 스텝으로 동일한 품질을 생성하는 `DPMSolverMultistepScheduler`를 사용하는 것이 좋습니다.
|
||||
- 추론 코드에서 네거티브 프롬프트, classifier-free guidance scale 및 image-to-image 작업을 사용할 수 있습니다. depth guidance, ControlNet, latent upscalers와 같은 고급 기능은 아직 사용할 수 없습니다.
|
||||
|
||||
Apple의 [변환 및 추론 리포지토리](https://github.com/apple/ml-stable-diffusion)와 자체 [swift-coreml-diffusers](https://github.com/huggingface/swift-coreml-diffusers) 리포지토리는 다른 개발자들이 구축할 수 있는 기술적인 데모입니다.
|
||||
|
||||
누락된 기능이 있다고 생각되면 언제든지 기능을 요청하거나, 더 좋은 방법은 기여 PR을 열어주세요. :)
|
||||
|
||||
|
||||
## 네이티브 Diffusers Swift 앱
|
||||
|
||||
자체 Apple 하드웨어에서 Stable Diffusion을 실행하는 쉬운 방법 중 하나는 `diffusers`와 Apple의 변환 및 추론 리포지토리를 기반으로 하는 [자체 오픈 소스 Swift 리포지토리](https://github.com/huggingface/swift-coreml-diffusers)를 사용하는 것입니다. 코드를 공부하고 [Xcode](https://developer.apple.com/xcode/)로 컴파일하여 필요에 맞게 조정할 수 있습니다. 편의를 위해 앱스토어에 [독립형 Mac 앱](https://apps.apple.com/app/diffusers/id1666309574)도 있으므로 코드나 IDE를 다루지 않고도 사용할 수 있습니다. 개발자로서 Core ML이 Stable Diffusion 앱을 구축하는 데 가장 적합한 솔루션이라고 판단했다면, 이 가이드의 나머지 부분을 사용하여 프로젝트를 시작할 수 있습니다. 여러분이 무엇을 빌드할지 기대됩니다. :)
|
||||
@@ -1,121 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Token Merging (토큰 병합)
|
||||
|
||||
Token Merging (introduced in [Token Merging: Your ViT But Faster](https://arxiv.org/abs/2210.09461))은 트랜스포머 기반 네트워크의 forward pass에서 중복 토큰이나 패치를 점진적으로 병합하는 방식으로 작동합니다. 이를 통해 기반 네트워크의 추론 지연 시간을 단축할 수 있습니다.
|
||||
|
||||
Token Merging(ToMe)이 출시된 후, 저자들은 [Fast Stable Diffusion을 위한 토큰 병합](https://arxiv.org/abs/2303.17604)을 발표하여 Stable Diffusion과 더 잘 호환되는 ToMe 버전을 소개했습니다. ToMe를 사용하면 [`DiffusionPipeline`]의 추론 지연 시간을 부드럽게 단축할 수 있습니다. 이 문서에서는 ToMe를 [`StableDiffusionPipeline`]에 적용하는 방법, 예상되는 속도 향상, [`StableDiffusionPipeline`]에서 ToMe를 사용할 때의 질적 측면에 대해 설명합니다.
|
||||
|
||||
## ToMe 사용하기
|
||||
|
||||
ToMe의 저자들은 [`tomesd`](https://github.com/dbolya/tomesd)라는 편리한 Python 라이브러리를 공개했는데, 이 라이브러리를 이용하면 [`DiffusionPipeline`]에 ToMe를 다음과 같이 적용할 수 있습니다:
|
||||
|
||||
```diff
|
||||
from diffusers import StableDiffusionPipeline
|
||||
import tomesd
|
||||
|
||||
pipeline = StableDiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
+ tomesd.apply_patch(pipeline, ratio=0.5)
|
||||
|
||||
image = pipeline("a photo of an astronaut riding a horse on mars").images[0]
|
||||
```
|
||||
|
||||
이것이 다입니다!
|
||||
|
||||
`tomesd.apply_patch()`는 파이프라인 추론 속도와 생성된 토큰의 품질 사이의 균형을 맞출 수 있도록 [여러 개의 인자](https://github.com/dbolya/tomesd#usage)를 노출합니다. 이러한 인수 중 가장 중요한 것은 `ratio(비율)`입니다. `ratio`은 forward pass 중에 병합될 토큰의 수를 제어합니다. `tomesd`에 대한 자세한 내용은 해당 리포지토리(https://github.com/dbolya/tomesd) 및 [논문](https://arxiv.org/abs/2303.17604)을 참고하시기 바랍니다.
|
||||
|
||||
## `StableDiffusionPipeline`으로 `tomesd` 벤치마킹하기
|
||||
|
||||
We benchmarked the impact of using `tomesd` on [`StableDiffusionPipeline`] along with [xformers](https://huggingface.co/docs/diffusers/optimization/xformers) across different image resolutions. We used A100 and V100 as our test GPU devices with the following development environment (with Python 3.8.5):
|
||||
다양한 이미지 해상도에서 [xformers](https://huggingface.co/docs/diffusers/optimization/xformers)를 적용한 상태에서, [`StableDiffusionPipeline`]에 `tomesd`를 사용했을 때의 영향을 벤치마킹했습니다. 테스트 GPU 장치로 A100과 V100을 사용했으며 개발 환경은 다음과 같습니다(Python 3.8.5 사용):
|
||||
|
||||
```bash
|
||||
- `diffusers` version: 0.15.1
|
||||
- Python version: 3.8.16
|
||||
- PyTorch version (GPU?): 1.13.1+cu116 (True)
|
||||
- Huggingface_hub version: 0.13.2
|
||||
- Transformers version: 4.27.2
|
||||
- Accelerate version: 0.18.0
|
||||
- xFormers version: 0.0.16
|
||||
- tomesd version: 0.1.2
|
||||
```
|
||||
|
||||
벤치마킹에는 다음 스크립트를 사용했습니다: [https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335](https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335). 결과는 다음과 같습니다:
|
||||
|
||||
### A100
|
||||
|
||||
| 해상도 | 배치 크기 | Vanilla | ToMe | ToMe + xFormers | ToMe 속도 향상 (%) | ToMe + xFormers 속도 향상 (%) |
|
||||
| --- | --- | --- | --- | --- | --- | --- |
|
||||
| 512 | 10 | 6.88 | 5.26 | 4.69 | 23.54651163 | 31.83139535 |
|
||||
| | | | | | | |
|
||||
| 768 | 10 | OOM | 14.71 | 11 | | |
|
||||
| | 8 | OOM | 11.56 | 8.84 | | |
|
||||
| | 4 | OOM | 5.98 | 4.66 | | |
|
||||
| | 2 | 4.99 | 3.24 | 3.1 | 35.07014028 | 37.8757515 |
|
||||
| | 1 | 3.29 | 2.24 | 2.03 | 31.91489362 | 38.29787234 |
|
||||
| | | | | | | |
|
||||
| 1024 | 10 | OOM | OOM | OOM | | |
|
||||
| | 8 | OOM | OOM | OOM | | |
|
||||
| | 4 | OOM | 12.51 | 9.09 | | |
|
||||
| | 2 | OOM | 6.52 | 4.96 | | |
|
||||
| | 1 | 6.4 | 3.61 | 2.81 | 43.59375 | 56.09375 |
|
||||
|
||||
***결과는 초 단위입니다. 속도 향상은 `Vanilla`과 비교해 계산됩니다.***
|
||||
|
||||
### V100
|
||||
|
||||
| 해상도 | 배치 크기 | Vanilla | ToMe | ToMe + xFormers | ToMe 속도 향상 (%) | ToMe + xFormers 속도 향상 (%) |
|
||||
| --- | --- | --- | --- | --- | --- | --- |
|
||||
| 512 | 10 | OOM | 10.03 | 9.29 | | |
|
||||
| | 8 | OOM | 8.05 | 7.47 | | |
|
||||
| | 4 | 5.7 | 4.3 | 3.98 | 24.56140351 | 30.1754386 |
|
||||
| | 2 | 3.14 | 2.43 | 2.27 | 22.61146497 | 27.70700637 |
|
||||
| | 1 | 1.88 | 1.57 | 1.57 | 16.4893617 | 16.4893617 |
|
||||
| | | | | | | |
|
||||
| 768 | 10 | OOM | OOM | 23.67 | | |
|
||||
| | 8 | OOM | OOM | 18.81 | | |
|
||||
| | 4 | OOM | 11.81 | 9.7 | | |
|
||||
| | 2 | OOM | 6.27 | 5.2 | | |
|
||||
| | 1 | 5.43 | 3.38 | 2.82 | 37.75322284 | 48.06629834 |
|
||||
| | | | | | | |
|
||||
| 1024 | 10 | OOM | OOM | OOM | | |
|
||||
| | 8 | OOM | OOM | OOM | | |
|
||||
| | 4 | OOM | OOM | 19.35 | | |
|
||||
| | 2 | OOM | 13 | 10.78 | | |
|
||||
| | 1 | OOM | 6.66 | 5.54 | | |
|
||||
|
||||
위의 표에서 볼 수 있듯이, 이미지 해상도가 높을수록 `tomesd`를 사용한 속도 향상이 더욱 두드러집니다. 또한 `tomesd`를 사용하면 1024x1024와 같은 더 높은 해상도에서 파이프라인을 실행할 수 있다는 점도 흥미롭습니다.
|
||||
|
||||
[`torch.compile()`](https://huggingface.co/docs/diffusers/optimization/torch2.0)을 사용하면 추론 속도를 더욱 높일 수 있습니다.
|
||||
|
||||
## 품질
|
||||
|
||||
As reported in [the paper](https://arxiv.org/abs/2303.17604), ToMe can preserve the quality of the generated images to a great extent while speeding up inference. By increasing the `ratio`, it is possible to further speed up inference, but that might come at the cost of a deterioration in the image quality.
|
||||
|
||||
To test the quality of the generated samples using our setup, we sampled a few prompts from the “Parti Prompts” (introduced in [Parti](https://parti.research.google/)) and performed inference with the [`StableDiffusionPipeline`] in the following settings:
|
||||
|
||||
[논문](https://arxiv.org/abs/2303.17604)에 보고된 바와 같이, ToMe는 생성된 이미지의 품질을 상당 부분 보존하면서 추론 속도를 높일 수 있습니다. `ratio`을 높이면 추론 속도를 더 높일 수 있지만, 이미지 품질이 저하될 수 있습니다.
|
||||
|
||||
해당 설정을 사용하여 생성된 샘플의 품질을 테스트하기 위해, "Parti 프롬프트"([Parti](https://parti.research.google/)에서 소개)에서 몇 가지 프롬프트를 샘플링하고 다음 설정에서 [`StableDiffusionPipeline`]을 사용하여 추론을 수행했습니다:
|
||||
|
||||
- Vanilla [`StableDiffusionPipeline`]
|
||||
- [`StableDiffusionPipeline`] + ToMe
|
||||
- [`StableDiffusionPipeline`] + ToMe + xformers
|
||||
|
||||
생성된 샘플의 품질이 크게 저하되는 것을 발견하지 못했습니다. 다음은 샘플입니다:
|
||||
|
||||

|
||||
|
||||
생성된 샘플은 [여기](https://wandb.ai/sayakpaul/tomesd-results/runs/23j4bj3i?workspace=)에서 확인할 수 있습니다. 이 실험을 수행하기 위해 [이 스크립트](https://gist.github.com/sayakpaul/8cac98d7f22399085a060992f411ecbd)를 사용했습니다.
|
||||
@@ -9,59 +9,43 @@ Unless required by applicable law or agreed to in writing, software distributed
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
[[open-in-colab]]
|
||||
|
||||
# 훑어보기
|
||||
|
||||
Diffusion 모델은 이미지나 오디오와 같은 관심 샘플들을 생성하기 위해 랜덤 가우시안 노이즈를 단계별로 제거하도록 학습됩니다. 이로 인해 생성 AI에 대한 관심이 매우 높아졌으며, 인터넷에서 diffusion 생성 이미지의 예를 본 적이 있을 것입니다. 🧨 Diffusers는 누구나 diffusion 모델들을 널리 이용할 수 있도록 하기 위한 라이브러리입니다.
|
||||
🧨 Diffusers로 빠르게 시작하고 실행하세요!
|
||||
이 훑어보기는 여러분이 개발자, 일반사용자 상관없이 시작하는 데 도움을 주며, 추론을 위해 [`DiffusionPipeline`] 사용하는 방법을 보여줍니다.
|
||||
|
||||
개발자든 일반 사용자든 이 훑어보기를 통해 🧨 diffusers를 소개하고 빠르게 생성할 수 있도록 도와드립니다! 알아야 할 라이브러리의 주요 구성 요소는 크게 세 가지입니다:
|
||||
시작하기에 앞서서, 필요한 모든 라이브러리가 설치되어 있는지 확인하세요:
|
||||
|
||||
* [`DiffusionPipeline`]은 추론을 위해 사전 학습된 diffusion 모델에서 샘플을 빠르게 생성하도록 설계된 높은 수준의 엔드투엔드 클래스입니다.
|
||||
* Diffusion 시스템 생성을 위한 빌딩 블록으로 사용할 수 있는 널리 사용되는 사전 학습된 [model](./api/models) 아키텍처 및 모듈.
|
||||
* 다양한 [schedulers](./api/schedulers/overview) - 학습을 위해 노이즈를 추가하는 방법과 추론 중에 노이즈 제거된 이미지를 생성하는 방법을 제어하는 알고리즘입니다.
|
||||
|
||||
훑어보기에서는 추론을 위해 [`DiffusionPipeline`]을 사용하는 방법을 보여준 다음, 모델과 스케줄러를 결합하여 [`DiffusionPipeline`] 내부에서 일어나는 일을 복제하는 방법을 안내합니다.
|
||||
|
||||
<Tip>
|
||||
|
||||
훑어보기는 간결한 버전의 🧨 Diffusers 소개로서 [노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb) 빠르게 시작할 수 있도록 도와드립니다. 디퓨저의 목표, 디자인 철학, 핵심 API에 대한 추가 세부 정보를 자세히 알아보려면 노트북을 확인하세요!
|
||||
|
||||
</Tip>
|
||||
|
||||
시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요:
|
||||
|
||||
```py
|
||||
# 주석 풀어서 Colab에 필요한 라이브러리 설치하기.
|
||||
#!pip install --upgrade diffusers accelerate transformers
|
||||
```bash
|
||||
pip install --upgrade diffusers accelerate transformers
|
||||
```
|
||||
|
||||
- [🤗 Accelerate](https://huggingface.co/docs/accelerate/index)는 추론 및 학습을 위한 모델 로딩 속도를 높여줍니다.
|
||||
- [🤗 Transformers](https://huggingface.co/docs/transformers/index)는 [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview)과 같이 가장 많이 사용되는 diffusion 모델을 실행하는 데 필요합니다.
|
||||
- [`accelerate`](https://huggingface.co/docs/accelerate/index)은 추론 및 학습을 위한 모델 불러오기 속도를 높입니다.
|
||||
- [`transformers`](https://huggingface.co/docs/transformers/index)는 [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview)과 같이 가장 널리 사용되는 확산 모델을 실행하기 위해 필요합니다.
|
||||
|
||||
## DiffusionPipeline
|
||||
|
||||
[`DiffusionPipeline`] 은 추론을 위해 사전 학습된 diffusion 시스템을 사용하는 가장 쉬운 방법입니다. 모델과 스케줄러를 포함하는 엔드 투 엔드 시스템입니다. 다양한 작업에 [`DiffusionPipeline`]을 바로 사용할 수 있습니다. 아래 표에서 지원되는 몇 가지 작업을 살펴보고, 지원되는 작업의 전체 목록은 [🧨 Diffusers Summary](./api/pipelines/overview#diffusers-summary) 표에서 확인할 수 있습니다.
|
||||
[`DiffusionPipeline`]은 추론을 위해 사전학습된 확산 시스템을 사용하는 가장 쉬운 방법입니다. 다양한 양식의 많은 작업에 [`DiffusionPipeline`]을 바로 사용할 수 있습니다. 지원되는 작업은 아래의 표를 참고하세요:
|
||||
|
||||
| **Task** | **Description** | **Pipeline**
|
||||
|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|
|
||||
| Unconditional Image Generation | generate an image from Gaussian noise | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) |
|
||||
| Text-Guided Image Generation | generate an image given a text prompt | [conditional_image_generation](./using-diffusers/conditional_image_generation) |
|
||||
| Text-Guided Image-to-Image Translation | adapt an image guided by a text prompt | [img2img](./using-diffusers/img2img) |
|
||||
| Text-Guided Image-Inpainting | fill the masked part of an image given the image, the mask and a text prompt | [inpaint](./using-diffusers/inpaint) |
|
||||
| Text-Guided Depth-to-Image Translation | adapt parts of an image guided by a text prompt while preserving structure via depth estimation | [depth2img](./using-diffusers/depth2img) |
|
||||
| Unconditional Image Generation | 가우시안 노이즈에서 이미지 생성 | [unconditional_image_generation](./using-diffusers/unconditional_image_generation`) |
|
||||
| Text-Guided Image Generation | 텍스트 프롬프트로 이미지 생성 | [conditional_image_generation](./using-diffusers/conditional_image_generation) |
|
||||
| Text-Guided Image-to-Image Translation | 텍스트 프롬프트에 따라 이미지 조정 | [img2img](./using-diffusers/img2img) |
|
||||
| Text-Guided Image-Inpainting | 마스크 및 텍스트 프롬프트가 주어진 이미지의 마스킹된 부분을 채우기 | [inpaint](./using-diffusers/inpaint) |
|
||||
| Text-Guided Depth-to-Image Translation | 깊이 추정을 통해 구조를 유지하면서 텍스트 프롬프트에 따라 이미지의 일부를 조정 | [depth2image](./using-diffusers/depth2image) |
|
||||
|
||||
먼저 [`DiffusionPipeline`]의 인스턴스를 생성하고 다운로드할 파이프라인 체크포인트를 지정합니다.
|
||||
허깅페이스 허브에 저장된 모든 [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads)에 대해 [`DiffusionPipeline`]을 사용할 수 있습니다.
|
||||
이 훑어보기에서는 text-to-image 생성을 위한 [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) 체크포인트를 로드합니다.
|
||||
확산 파이프라인이 다양한 작업에 대해 어떻게 작동하는지는 [**Using Diffusers**](./using-diffusers/overview)를 참고하세요.
|
||||
|
||||
<Tip warning={true}>
|
||||
예를들어, [`DiffusionPipeline`] 인스턴스를 생성하여 시작하고, 다운로드하려는 파이프라인 체크포인트를 지정합니다.
|
||||
모든 [Diffusers' checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads)에 대해 [`DiffusionPipeline`]을 사용할 수 있습니다.
|
||||
하지만, 이 가이드에서는 [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion)을 사용하여 text-to-image를 하는데 [`DiffusionPipeline`]을 사용합니다.
|
||||
|
||||
[Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion) 모델의 경우, 모델을 실행하기 전에 [라이선스](https://huggingface.co/spaces/CompVis/stable-diffusion-license)를 먼저 주의 깊게 읽어주세요. 🧨 Diffusers는 불쾌하거나 유해한 콘텐츠를 방지하기 위해 [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py)를 구현하고 있지만, 모델의 향상된 이미지 생성 기능으로 인해 여전히 잠재적으로 유해한 콘텐츠가 생성될 수 있습니다.
|
||||
[Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion) 기반 모델을 실행하기 전에 [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license)를 주의 깊게 읽으세요.
|
||||
이는 모델의 향상된 이미지 생성 기능과 이것으로 생성될 수 있는 유해한 콘텐츠 때문입니다. 선택한 Stable Diffusion 모델(*예*: [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5))로 이동하여 라이센스를 읽으세요.
|
||||
|
||||
</Tip>
|
||||
|
||||
[`~DiffusionPipeline.from_pretrained`] 방법으로 모델 로드하기:
|
||||
다음과 같이 모델을 로드할 수 있습니다:
|
||||
|
||||
```python
|
||||
>>> from diffusers import DiffusionPipeline
|
||||
@@ -69,245 +53,71 @@ Diffusion 모델은 이미지나 오디오와 같은 관심 샘플들을 생성
|
||||
>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
|
||||
```
|
||||
|
||||
The [`DiffusionPipeline`]은 모든 모델링, 토큰화, 스케줄링 컴포넌트를 다운로드하고 캐시합니다. Stable Diffusion Pipeline은 무엇보다도 [`UNet2DConditionModel`]과 [`PNDMScheduler`]로 구성되어 있음을 알 수 있습니다:
|
||||
|
||||
```py
|
||||
>>> pipeline
|
||||
StableDiffusionPipeline {
|
||||
"_class_name": "StableDiffusionPipeline",
|
||||
"_diffusers_version": "0.13.1",
|
||||
...,
|
||||
"scheduler": [
|
||||
"diffusers",
|
||||
"PNDMScheduler"
|
||||
],
|
||||
...,
|
||||
"unet": [
|
||||
"diffusers",
|
||||
"UNet2DConditionModel"
|
||||
],
|
||||
"vae": [
|
||||
"diffusers",
|
||||
"AutoencoderKL"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
이 모델은 약 14억 개의 파라미터로 구성되어 있으므로 GPU에서 파이프라인을 실행할 것을 강력히 권장합니다.
|
||||
PyTorch에서와 마찬가지로 제너레이터 객체를 GPU로 이동할 수 있습니다:
|
||||
[`DiffusionPipeline`]은 모든 모델링, 토큰화 및 스케줄링 구성요소를 다운로드하고 캐시합니다.
|
||||
모델은 약 14억개의 매개변수로 구성되어 있으므로 GPU에서 실행하는 것이 좋습니다.
|
||||
PyTorch에서와 마찬가지로 생성기 객체를 GPU로 옮길 수 있습니다.
|
||||
|
||||
```python
|
||||
>>> pipeline.to("cuda")
|
||||
```
|
||||
|
||||
이제 `파이프라인`에 텍스트 프롬프트를 전달하여 이미지를 생성한 다음 노이즈가 제거된 이미지에 액세스할 수 있습니다. 기본적으로 이미지 출력은 [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) 객체로 감싸집니다.
|
||||
이제 `pipeline`을 사용할 수 있습니다:
|
||||
|
||||
```python
|
||||
>>> image = pipeline("An image of a squirrel in Picasso style").images[0]
|
||||
>>> image
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/image_of_squirrel_painting.png"/>
|
||||
</div>
|
||||
출력은 기본적으로 [PIL Image object](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class)로 래핑됩니다.
|
||||
|
||||
`save`를 호출하여 이미지를 저장합니다:
|
||||
다음과 같이 함수를 호출하여 이미지를 저장할 수 있습니다:
|
||||
|
||||
```python
|
||||
>>> image.save("image_of_squirrel_painting.png")
|
||||
```
|
||||
|
||||
### 로컬 파이프라인
|
||||
**참고**: 다음을 통해 가중치를 다운로드하여 로컬에서 파이프라인을 사용할 수도 있습니다:
|
||||
|
||||
파이프라인을 로컬에서 사용할 수도 있습니다. 유일한 차이점은 가중치를 먼저 다운로드해야 한다는 점입니다:
|
||||
|
||||
```bash
|
||||
!git lfs install
|
||||
!git clone https://huggingface.co/runwayml/stable-diffusion-v1-5
|
||||
```
|
||||
git lfs install
|
||||
git clone https://huggingface.co/runwayml/stable-diffusion-v1-5
|
||||
```
|
||||
|
||||
그런 다음 저장된 가중치를 파이프라인에 로드합니다:
|
||||
그리고 저장된 가중치를 파이프라인에 불러옵니다.
|
||||
|
||||
```python
|
||||
>>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5")
|
||||
```
|
||||
|
||||
이제 위 섹션에서와 같이 파이프라인을 실행할 수 있습니다.
|
||||
파이프라인 실행은 동일한 모델 아키텍처이므로 위의 코드와 동일합니다.
|
||||
|
||||
### 스케줄러 교체
|
||||
```python
|
||||
>>> generator.to("cuda")
|
||||
>>> image = generator("An image of a squirrel in Picasso style").images[0]
|
||||
>>> image.save("image_of_squirrel_painting.png")
|
||||
```
|
||||
|
||||
스케줄러마다 노이즈 제거 속도와 품질이 서로 다릅니다. 자신에게 가장 적합한 스케줄러를 찾는 가장 좋은 방법은 직접 사용해 보는 것입니다! 🧨 Diffusers의 주요 기능 중 하나는 스케줄러 간에 쉽게 전환이 가능하다는 것입니다. 예를 들어, 기본 스케줄러인 [`PNDMScheduler`]를 [`EulerDiscreteScheduler`]로 바꾸려면, [`~diffusers.ConfigMixin.from_config`] 메서드를 사용하여 로드하세요:
|
||||
확산 시스템은 각각 장점이 있는 여러 다른 [schedulers](./api/schedulers/overview)와 함께 사용할 수 있습니다. 기본적으로 Stable Diffusion은 `PNDMScheduler`로 실행되지만 다른 스케줄러를 사용하는 방법은 매우 간단합니다. *예* [`EulerDiscreteScheduler`] 스케줄러를 사용하려는 경우, 다음과 같이 사용할 수 있습니다:
|
||||
|
||||
```py
|
||||
```python
|
||||
>>> from diffusers import EulerDiscreteScheduler
|
||||
|
||||
>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
|
||||
>>> pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
|
||||
|
||||
>>> # change scheduler to Euler
|
||||
>>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
|
||||
```
|
||||
|
||||
새 스케줄러로 이미지를 생성해보고 어떤 차이가 있는지 확인해 보세요!
|
||||
스케줄러 변경 방법에 대한 자세한 내용은 [Using Schedulers](./using-diffusers/schedulers) 가이드를 참고하세요.
|
||||
|
||||
다음 섹션에서는 모델과 스케줄러라는 [`DiffusionPipeline`]을 구성하는 컴포넌트를 자세히 살펴보고 이러한 컴포넌트를 사용하여 고양이 이미지를 생성하는 방법을 배워보겠습니다.
|
||||
[Stability AI's](https://stability.ai/)의 Stable Diffusion 모델은 인상적인 이미지 생성 모델이며 텍스트에서 이미지를 생성하는 것보다 훨씬 더 많은 작업을 수행할 수 있습니다. 우리는 Stable Diffusion만을 위한 전체 문서 페이지를 제공합니다 [link](./conceptual/stable_diffusion).
|
||||
|
||||
## 모델
|
||||
만약 더 적은 메모리, 더 높은 추론 속도, Mac과 같은 특정 하드웨어 또는 ONNX 런타임에서 실행되도록 Stable Diffusion을 최적화하는 방법을 알고 싶다면 최적화 페이지를 살펴보세요:
|
||||
|
||||
대부분의 모델은 노이즈가 있는 샘플을 가져와 각 시간 간격마다 노이즈가 적은 이미지와 입력 이미지 사이의 차이인 *노이즈 잔차*(다른 모델은 이전 샘플을 직접 예측하거나 속도 또는 [`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)을 예측하는 학습을 합니다)을 예측합니다. 모델을 믹스 앤 매치하여 다른 diffusion 시스템을 만들 수 있습니다.
|
||||
- [Optimized PyTorch on GPU](./optimization/fp16)
|
||||
- [Mac OS with PyTorch](./optimization/mps)
|
||||
- [ONNX](./optimization/onnx)
|
||||
- [OpenVINO](./optimization/open_vino)
|
||||
|
||||
모델은 [`~ModelMixin.from_pretrained`] 메서드로 시작되며, 이 메서드는 모델 가중치를 로컬에 캐시하여 다음에 모델을 로드할 때 더 빠르게 로드할 수 있습니다. 훑어보기에서는 고양이 이미지에 대해 학습된 체크포인트가 있는 기본적인 unconditional 이미지 생성 모델인 [`UNet2DModel`]을 로드합니다:
|
||||
확산 모델을 미세조정하거나 학습시키려면, [**training section**](./training/overview)을 살펴보세요.
|
||||
|
||||
```py
|
||||
>>> from diffusers import UNet2DModel
|
||||
|
||||
>>> repo_id = "google/ddpm-cat-256"
|
||||
>>> model = UNet2DModel.from_pretrained(repo_id)
|
||||
```
|
||||
|
||||
모델 매개변수에 액세스하려면 `model.config`를 호출합니다:
|
||||
|
||||
```py
|
||||
>>> model.config
|
||||
```
|
||||
|
||||
모델 구성은 🧊 고정된 🧊 딕셔너리로, 모델이 생성된 후에는 해당 매개 변수들을 변경할 수 없습니다. 이는 의도적인 것으로, 처음에 모델 아키텍처를 정의하는 데 사용된 매개변수는 동일하게 유지하면서 다른 매개변수는 추론 중에 조정할 수 있도록 하기 위한 것입니다.
|
||||
|
||||
가장 중요한 매개변수들은 다음과 같습니다:
|
||||
|
||||
* `sample_size`: 입력 샘플의 높이 및 너비 치수입니다.
|
||||
* `in_channels`: 입력 샘플의 입력 채널 수입니다.
|
||||
* `down_block_types` 및 `up_block_types`: UNet 아키텍처를 생성하는 데 사용되는 다운 및 업샘플링 블록의 유형.
|
||||
* `block_out_channels`: 다운샘플링 블록의 출력 채널 수. 업샘플링 블록의 입력 채널 수에 역순으로 사용되기도 합니다.
|
||||
* `layers_per_block`: 각 UNet 블록에 존재하는 ResNet 블록의 수입니다.
|
||||
|
||||
추론에 모델을 사용하려면 랜덤 가우시안 노이즈로 이미지 모양을 만듭니다. 모델이 여러 개의 무작위 노이즈를 수신할 수 있으므로 'batch' 축, 입력 채널 수에 해당하는 'channel' 축, 이미지의 높이와 너비를 나타내는 'sample_size' 축이 있어야 합니다:
|
||||
|
||||
```py
|
||||
>>> import torch
|
||||
|
||||
>>> torch.manual_seed(0)
|
||||
|
||||
>>> noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size)
|
||||
>>> noisy_sample.shape
|
||||
torch.Size([1, 3, 256, 256])
|
||||
```
|
||||
|
||||
추론을 위해 모델에 노이즈가 있는 이미지와 `timestep`을 전달합니다. 'timestep'은 입력 이미지의 노이즈 정도를 나타내며, 시작 부분에 더 많은 노이즈가 있고 끝 부분에 더 적은 노이즈가 있습니다. 이를 통해 모델이 diffusion 과정에서 시작 또는 끝에 더 가까운 위치를 결정할 수 있습니다. `sample` 메서드를 사용하여 모델 출력을 얻습니다:
|
||||
|
||||
```py
|
||||
>>> with torch.no_grad():
|
||||
... noisy_residual = model(sample=noisy_sample, timestep=2).sample
|
||||
```
|
||||
|
||||
하지만 실제 예를 생성하려면 노이즈 제거 프로세스를 안내할 스케줄러가 필요합니다. 다음 섹션에서는 모델을 스케줄러와 결합하는 방법에 대해 알아봅니다.
|
||||
|
||||
## 스케줄러
|
||||
|
||||
스케줄러는 모델 출력이 주어졌을 때 노이즈가 많은 샘플에서 노이즈가 적은 샘플로 전환하는 것을 관리합니다 - 이 경우 'noisy_residual'.
|
||||
|
||||
<Tip>
|
||||
|
||||
🧨 Diffusers는 Diffusion 시스템을 구축하기 위한 툴박스입니다. [`DiffusionPipeline`]을 사용하면 미리 만들어진 Diffusion 시스템을 편리하게 시작할 수 있지만, 모델과 스케줄러 구성 요소를 개별적으로 선택하여 사용자 지정 Diffusion 시스템을 구축할 수도 있습니다.
|
||||
|
||||
</Tip>
|
||||
|
||||
훑어보기의 경우, [`~diffusers.ConfigMixin.from_config`] 메서드를 사용하여 [`DDPMScheduler`]를 인스턴스화합니다:
|
||||
|
||||
```py
|
||||
>>> from diffusers import DDPMScheduler
|
||||
|
||||
>>> scheduler = DDPMScheduler.from_config(repo_id)
|
||||
>>> scheduler
|
||||
DDPMScheduler {
|
||||
"_class_name": "DDPMScheduler",
|
||||
"_diffusers_version": "0.13.1",
|
||||
"beta_end": 0.02,
|
||||
"beta_schedule": "linear",
|
||||
"beta_start": 0.0001,
|
||||
"clip_sample": true,
|
||||
"clip_sample_range": 1.0,
|
||||
"num_train_timesteps": 1000,
|
||||
"prediction_type": "epsilon",
|
||||
"trained_betas": null,
|
||||
"variance_type": "fixed_small"
|
||||
}
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 스케줄러가 구성에서 어떻게 인스턴스화되는지 주목하세요. 모델과 달리 스케줄러에는 학습 가능한 가중치가 없으며 매개변수도 없습니다!
|
||||
|
||||
</Tip>
|
||||
|
||||
가장 중요한 매개변수는 다음과 같습니다:
|
||||
|
||||
* `num_train_timesteps`: 노이즈 제거 프로세스의 길이, 즉 랜덤 가우스 노이즈를 데이터 샘플로 처리하는 데 필요한 타임스텝 수입니다.
|
||||
* `beta_schedule`: 추론 및 학습에 사용할 노이즈 스케줄 유형입니다.
|
||||
* `beta_start` 및 `beta_end`: 노이즈 스케줄의 시작 및 종료 노이즈 값입니다.
|
||||
|
||||
노이즈가 약간 적은 이미지를 예측하려면 스케줄러의 [`~diffusers.DDPMScheduler.step`] 메서드에 모델 출력, `timestep`, 현재 `sample`을 전달하세요.
|
||||
|
||||
```py
|
||||
>>> less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample).prev_sample
|
||||
>>> less_noisy_sample.shape
|
||||
```
|
||||
|
||||
`less_noisy_sample`을 다음 `timestep`으로 넘기면 노이즈가 더 줄어듭니다! 이제 이 모든 것을 한데 모아 전체 노이즈 제거 과정을 시각화해 보겠습니다.
|
||||
|
||||
먼저 노이즈 제거된 이미지를 후처리하여 `PIL.Image`로 표시하는 함수를 만듭니다:
|
||||
|
||||
```py
|
||||
>>> import PIL.Image
|
||||
>>> import numpy as np
|
||||
|
||||
|
||||
>>> def display_sample(sample, i):
|
||||
... image_processed = sample.cpu().permute(0, 2, 3, 1)
|
||||
... image_processed = (image_processed + 1.0) * 127.5
|
||||
... image_processed = image_processed.numpy().astype(np.uint8)
|
||||
|
||||
... image_pil = PIL.Image.fromarray(image_processed[0])
|
||||
... display(f"Image at step {i}")
|
||||
... display(image_pil)
|
||||
```
|
||||
|
||||
노이즈 제거 프로세스의 속도를 높이려면 입력과 모델을 GPU로 옮기세요:
|
||||
|
||||
```py
|
||||
>>> model.to("cuda")
|
||||
>>> noisy_sample = noisy_sample.to("cuda")
|
||||
```
|
||||
|
||||
이제 노이즈가 적은 샘플의 잔차를 예측하고 스케줄러로 노이즈가 적은 샘플을 계산하는 노이즈 제거 루프를 생성합니다:
|
||||
|
||||
```py
|
||||
>>> import tqdm
|
||||
|
||||
>>> sample = noisy_sample
|
||||
|
||||
>>> for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)):
|
||||
... # 1. predict noise residual
|
||||
... with torch.no_grad():
|
||||
... residual = model(sample, t).sample
|
||||
|
||||
... # 2. compute less noisy image and set x_t -> x_t-1
|
||||
... sample = scheduler.step(residual, t, sample).prev_sample
|
||||
|
||||
... # 3. optionally look at image
|
||||
... if (i + 1) % 50 == 0:
|
||||
... display_sample(sample, i + 1)
|
||||
```
|
||||
|
||||
가만히 앉아서 고양이가 소음으로만 생성되는 것을 지켜보세요!😻
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/diffusion-quicktour.png"/>
|
||||
</div>
|
||||
|
||||
## 다음 단계
|
||||
|
||||
이번 훑어보기에서 🧨 Diffusers로 멋진 이미지를 만들어 보셨기를 바랍니다! 다음 단계로 넘어가세요:
|
||||
|
||||
* [training](./tutorials/basic_training) 튜토리얼에서 모델을 학습하거나 파인튜닝하여 나만의 이미지를 생성할 수 있습니다.
|
||||
* 다양한 사용 사례는 공식 및 커뮤니티 [학습 또는 파인튜닝 스크립트](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples) 예시를 참조하세요.
|
||||
* 스케줄러 로드, 액세스, 변경 및 비교에 대한 자세한 내용은 [다른 스케줄러 사용](./using-diffusers/schedulers) 가이드에서 확인하세요.
|
||||
* [Stable Diffusion](./stable_diffusion) 가이드에서 프롬프트 엔지니어링, 속도 및 메모리 최적화, 고품질 이미지 생성을 위한 팁과 요령을 살펴보세요.
|
||||
* [GPU에서 파이토치 최적화](./optimization/fp16) 가이드와 [애플 실리콘(M1/M2)에서의 Stable Diffusion](./optimization/mps) 및 [ONNX 런타임](./optimization/onnx) 실행에 대한 추론 가이드를 통해 🧨 Diffuser 속도를 높이는 방법을 더 자세히 알아보세요.
|
||||
마지막으로, 생성된 이미지를 공개적으로 배포할 때 신중을 기해 주세요 🤗.
|
||||
@@ -1,279 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# 효과적이고 효율적인 Diffusion
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
특정 스타일로 이미지를 생성하거나 원하는 내용을 포함하도록[`DiffusionPipeline`]을 설정하는 것은 까다로울 수 있습니다. 종종 만족스러운 이미지를 얻기까지 [`DiffusionPipeline`]을 여러 번 실행해야 하는 경우가 많습니다. 그러나 무에서 유를 창조하는 것은 특히 추론을 반복해서 실행하는 경우 계산 집약적인 프로세스입니다.
|
||||
|
||||
그렇기 때문에 파이프라인에서 *계산*(속도) 및 *메모리*(GPU RAM) 효율성을 극대화하여 추론 주기 사이의 시간을 단축하여 더 빠르게 반복할 수 있도록 하는 것이 중요합니다.
|
||||
|
||||
이 튜토리얼에서는 [`DiffusionPipeline`]을 사용하여 더 빠르고 효과적으로 생성하는 방법을 안내합니다.
|
||||
|
||||
[`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) 모델을 불러와서 시작합니다:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
model_id = "runwayml/stable-diffusion-v1-5"
|
||||
pipeline = DiffusionPipeline.from_pretrained(model_id)
|
||||
```
|
||||
|
||||
예제 프롬프트는 "portrait of an old warrior chief" 이지만, 자유롭게 자신만의 프롬프트를 사용해도 됩니다:
|
||||
|
||||
```python
|
||||
prompt = "portrait photo of a old warrior chief"
|
||||
```
|
||||
|
||||
## 속도
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 GPU에 액세스할 수 없는 경우 다음과 같은 GPU 제공업체에서 무료로 사용할 수 있습니다!. [Colab](https://colab.research.google.com/)
|
||||
|
||||
</Tip>
|
||||
|
||||
추론 속도를 높이는 가장 간단한 방법 중 하나는 Pytorch 모듈을 사용할 때와 같은 방식으로 GPU에 파이프라인을 배치하는 것입니다:
|
||||
|
||||
```python
|
||||
pipeline = pipeline.to("cuda")
|
||||
```
|
||||
|
||||
동일한 이미지를 사용하고 개선할 수 있는지 확인하려면 [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html)를 사용하고 [재현성](./using-diffusers/reproducibility)에 대한 시드를 설정하세요:
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
generator = torch.Generator("cuda").manual_seed(0)
|
||||
```
|
||||
|
||||
이제 이미지를 생성할 수 있습니다:
|
||||
|
||||
```python
|
||||
image = pipeline(prompt, generator=generator).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_1.png">
|
||||
</div>
|
||||
|
||||
이 프로세스는 T4 GPU에서 약 30초가 소요되었습니다(할당된 GPU가 T4보다 나은 경우 더 빠를 수 있음). 기본적으로 [`DiffusionPipeline`]은 50개의 추론 단계에 대해 전체 `float32` 정밀도로 추론을 실행합니다. `float16`과 같은 더 낮은 정밀도로 전환하거나 추론 단계를 더 적게 실행하여 속도를 높일 수 있습니다.
|
||||
|
||||
`float16`으로 모델을 로드하고 이미지를 생성해 보겠습니다:
|
||||
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
|
||||
pipeline = pipeline.to("cuda")
|
||||
generator = torch.Generator("cuda").manual_seed(0)
|
||||
image = pipeline(prompt, generator=generator).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_2.png">
|
||||
</div>
|
||||
|
||||
이번에는 이미지를 생성하는 데 약 11초밖에 걸리지 않아 이전보다 3배 가까이 빨라졌습니다!
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 파이프라인은 항상 `float16`에서 실행할 것을 강력히 권장하며, 지금까지 출력 품질이 저하되는 경우는 거의 없었습니다.
|
||||
|
||||
</Tip>
|
||||
|
||||
또 다른 옵션은 추론 단계의 수를 줄이는 것입니다. 보다 효율적인 스케줄러를 선택하면 출력 품질 저하 없이 단계 수를 줄이는 데 도움이 될 수 있습니다. 현재 모델과 호환되는 스케줄러는 `compatibles` 메서드를 호출하여 [`DiffusionPipeline`]에서 찾을 수 있습니다:
|
||||
|
||||
```python
|
||||
pipeline.scheduler.compatibles
|
||||
[
|
||||
diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler,
|
||||
diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler,
|
||||
diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler,
|
||||
diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler,
|
||||
diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler,
|
||||
diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler,
|
||||
diffusers.schedulers.scheduling_ddpm.DDPMScheduler,
|
||||
diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler,
|
||||
diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler,
|
||||
diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler,
|
||||
diffusers.schedulers.scheduling_pndm.PNDMScheduler,
|
||||
diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler,
|
||||
diffusers.schedulers.scheduling_ddim.DDIMScheduler,
|
||||
]
|
||||
```
|
||||
|
||||
Stable Diffusion 모델은 일반적으로 약 50개의 추론 단계가 필요한 [`PNDMScheduler`]를 기본으로 사용하지만, [`DPMSolverMultistepScheduler`]와 같이 성능이 더 뛰어난 스케줄러는 약 20개 또는 25개의 추론 단계만 필요로 합니다. 새 스케줄러를 로드하려면 [`ConfigMixin.from_config`] 메서드를 사용합니다:
|
||||
|
||||
```python
|
||||
from diffusers import DPMSolverMultistepScheduler
|
||||
|
||||
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
|
||||
```
|
||||
|
||||
`num_inference_steps`를 20으로 설정합니다:
|
||||
|
||||
```python
|
||||
generator = torch.Generator("cuda").manual_seed(0)
|
||||
image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_3.png">
|
||||
</div>
|
||||
|
||||
추론시간을 4초로 단축할 수 있었습니다! ⚡️
|
||||
|
||||
## 메모리
|
||||
|
||||
파이프라인 성능 향상의 또 다른 핵심은 메모리 사용량을 줄이는 것인데, 초당 생성되는 이미지 수를 최대화하려고 하는 경우가 많기 때문에 간접적으로 더 빠른 속도를 의미합니다. 한 번에 생성할 수 있는 이미지 수를 확인하는 가장 쉬운 방법은 `OutOfMemoryError`(OOM)이 발생할 때까지 다양한 배치 크기를 시도해 보는 것입니다.
|
||||
|
||||
프롬프트 목록과 `Generators`에서 이미지 배치를 생성하는 함수를 만듭니다. 좋은 결과를 생성하는 경우 재사용할 수 있도록 각 `Generator`에 시드를 할당해야 합니다.
|
||||
|
||||
```python
|
||||
def get_inputs(batch_size=1):
|
||||
generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)]
|
||||
prompts = batch_size * [prompt]
|
||||
num_inference_steps = 20
|
||||
|
||||
return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}
|
||||
```
|
||||
|
||||
또한 각 이미지 배치를 보여주는 기능이 필요합니다:
|
||||
|
||||
```python
|
||||
from PIL import Image
|
||||
|
||||
|
||||
def image_grid(imgs, rows=2, cols=2):
|
||||
w, h = imgs[0].size
|
||||
grid = Image.new("RGB", size=(cols * w, rows * h))
|
||||
|
||||
for i, img in enumerate(imgs):
|
||||
grid.paste(img, box=(i % cols * w, i // cols * h))
|
||||
return grid
|
||||
```
|
||||
|
||||
`batch_size=4`부터 시작해 얼마나 많은 메모리를 소비했는지 확인합니다:
|
||||
|
||||
```python
|
||||
images = pipeline(**get_inputs(batch_size=4)).images
|
||||
image_grid(images)
|
||||
```
|
||||
|
||||
RAM이 더 많은 GPU가 아니라면 위의 코드에서 `OOM` 오류가 반환되었을 것입니다! 대부분의 메모리는 cross-attention 레이어가 차지합니다. 이 작업을 배치로 실행하는 대신 순차적으로 실행하면 상당한 양의 메모리를 절약할 수 있습니다. 파이프라인을 구성하여 [`~DiffusionPipeline.enable_attention_slicing`] 함수를 사용하기만 하면 됩니다:
|
||||
|
||||
|
||||
```python
|
||||
pipeline.enable_attention_slicing()
|
||||
```
|
||||
|
||||
이제 `batch_size`를 8로 늘려보세요!
|
||||
|
||||
```python
|
||||
images = pipeline(**get_inputs(batch_size=8)).images
|
||||
image_grid(images, rows=2, cols=4)
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_5.png">
|
||||
</div>
|
||||
|
||||
이전에는 4개의 이미지를 배치로 생성할 수도 없었지만, 이제는 이미지당 약 3.5초 만에 8개의 이미지를 배치로 생성할 수 있습니다! 이는 아마도 품질 저하 없이 T4 GPU에서 가장 빠른 속도일 것입니다.
|
||||
|
||||
## 품질
|
||||
|
||||
지난 두 섹션에서는 `fp16`을 사용하여 파이프라인의 속도를 최적화하고, 더 성능이 좋은 스케줄러를 사용하여 추론 단계의 수를 줄이고, attention slicing을 활성화하여 메모리 소비를 줄이는 방법을 배웠습니다. 이제 생성된 이미지의 품질을 개선하는 방법에 대해 집중적으로 알아보겠습니다.
|
||||
|
||||
|
||||
### 더 나은 체크포인트
|
||||
|
||||
가장 확실한 단계는 더 나은 체크포인트를 사용하는 것입니다. Stable Diffusion 모델은 좋은 출발점이며, 공식 출시 이후 몇 가지 개선된 버전도 출시되었습니다. 하지만 최신 버전을 사용한다고 해서 자동으로 더 나은 결과를 얻을 수 있는 것은 아닙니다. 여전히 다양한 체크포인트를 직접 실험해보고, [negative prompts](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/) 사용 등 약간의 조사를 통해 최상의 결과를 얻어야 합니다.
|
||||
|
||||
이 분야가 성장함에 따라 특정 스타일을 연출할 수 있도록 세밀하게 조정된 고품질 체크포인트가 점점 더 많아지고 있습니다. [Hub](https://huggingface.co/models?library=diffusers&sort=downloads)와 [Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery)를 둘러보고 관심 있는 것을 찾아보세요!
|
||||
|
||||
|
||||
### 더 나은 파이프라인 구성 요소
|
||||
|
||||
현재 파이프라인 구성 요소를 최신 버전으로 교체해 볼 수도 있습니다. Stability AI의 최신 [autodecoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae)를 파이프라인에 로드하고 몇 가지 이미지를 생성해 보겠습니다:
|
||||
|
||||
|
||||
```python
|
||||
from diffusers import AutoencoderKL
|
||||
|
||||
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda")
|
||||
pipeline.vae = vae
|
||||
images = pipeline(**get_inputs(batch_size=8)).images
|
||||
image_grid(images, rows=2, cols=4)
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_6.png">
|
||||
</div>
|
||||
|
||||
### 더 나은 프롬프트 엔지니어링
|
||||
|
||||
이미지를 생성하는 데 사용하는 텍스트 프롬프트는 *prompt engineering*이라고 할 정도로 매우 중요합니다. 프롬프트 엔지니어링 시 고려해야 할 몇 가지 사항은 다음과 같습니다:
|
||||
|
||||
- 생성하려는 이미지 또는 유사한 이미지가 인터넷에 어떻게 저장되어 있는가?
|
||||
- 내가 원하는 스타일로 모델을 유도하기 위해 어떤 추가 세부 정보를 제공할 수 있는가?
|
||||
|
||||
이를 염두에 두고 색상과 더 높은 품질의 디테일을 포함하도록 프롬프트를 개선해 봅시다:
|
||||
|
||||
|
||||
```python
|
||||
prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes"
|
||||
prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta"
|
||||
```
|
||||
|
||||
새로운 프롬프트로 이미지 배치를 생성합니다:
|
||||
|
||||
```python
|
||||
images = pipeline(**get_inputs(batch_size=8)).images
|
||||
image_grid(images, rows=2, cols=4)
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_7.png">
|
||||
</div>
|
||||
|
||||
꽤 인상적입니다! `1`의 시드를 가진 `Generator`에 해당하는 두 번째 이미지에 피사체의 나이에 대한 텍스트를 추가하여 조금 더 조정해 보겠습니다:
|
||||
|
||||
```python
|
||||
prompts = [
|
||||
"portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
|
||||
"portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
|
||||
"portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
|
||||
"portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
|
||||
]
|
||||
|
||||
generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))]
|
||||
images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images
|
||||
image_grid(images)
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_8.png">
|
||||
</div>
|
||||
|
||||
## 다음 단계
|
||||
|
||||
이 튜토리얼에서는 계산 및 메모리 효율을 높이고 생성된 출력의 품질을 개선하기 위해 [`DiffusionPipeline`]을 최적화하는 방법을 배웠습니다. 파이프라인을 더 빠르게 만드는 데 관심이 있다면 다음 리소스를 살펴보세요:
|
||||
|
||||
- [PyTorch 2.0](./optimization/torch2.0) 및 [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html)이 어떻게 추론 속도를 5~300% 향상시킬 수 있는지 알아보세요. A100 GPU에서는 추론 속도가 최대 50%까지 빨라질 수 있습니다!
|
||||
- PyTorch 2를 사용할 수 없는 경우, [xFormers](./optimization/xformers)를 설치하는 것이 좋습니다. 메모리 효율적인 어텐션 메커니즘은 PyTorch 1.13.1과 함께 사용하면 속도가 빨라지고 메모리 소비가 줄어듭니다.
|
||||
- 모델 오프로딩과 같은 다른 최적화 기법은 [이 가이드](./optimization/fp16)에서 다루고 있습니다.
|
||||
@@ -1,331 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# ControlNet
|
||||
|
||||
[Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) (ControlNet)은 Lvmin Zhang과 Maneesh Agrawala에 의해 쓰여졌습니다.
|
||||
|
||||
이 예시는 [원본 ControlNet 리포지토리에서 예시 학습하기](https://github.com/lllyasviel/ControlNet/blob/main/docs/train.md)에 기반합니다. ControlNet은 원들을 채우기 위해 [small synthetic dataset](https://huggingface.co/datasets/fusing/fill50k)을 사용해서 학습됩니다.
|
||||
|
||||
## 의존성 설치하기
|
||||
|
||||
아래의 스크립트를 실행하기 전에, 라이브러리의 학습 의존성을 설치해야 합니다.
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
가장 최신 버전의 예시 스크립트를 성공적으로 실행하기 위해서는, 소스에서 설치하고 최신 버전의 설치를 유지하는 것을 강력하게 추천합니다. 우리는 예시 스크립트들을 자주 업데이트하고 예시에 맞춘 특정한 요구사항을 설치합니다.
|
||||
|
||||
</Tip>
|
||||
|
||||
위 사항을 만족시키기 위해서, 새로운 가상환경에서 다음 일련의 스텝을 실행하세요:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/huggingface/diffusers
|
||||
cd diffusers
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
그 다음에는 [예시 폴더](https://github.com/huggingface/diffusers/tree/main/examples/controlnet)으로 이동합니다.
|
||||
|
||||
```bash
|
||||
cd examples/controlnet
|
||||
```
|
||||
|
||||
이제 실행하세요:
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
[🤗Accelerate](https://github.com/huggingface/accelerate/) 환경을 초기화 합니다:
|
||||
|
||||
```bash
|
||||
accelerate config
|
||||
```
|
||||
|
||||
혹은 여러분의 환경이 무엇인지 몰라도 기본적인 🤗Accelerate 구성으로 초기화할 수 있습니다:
|
||||
|
||||
```bash
|
||||
accelerate config default
|
||||
```
|
||||
|
||||
혹은 당신의 환경이 노트북 같은 상호작용하는 쉘을 지원하지 않는다면, 아래의 코드로 초기화 할 수 있습니다:
|
||||
|
||||
```python
|
||||
from accelerate.utils import write_basic_config
|
||||
|
||||
write_basic_config()
|
||||
```
|
||||
|
||||
## 원을 채우는 데이터셋
|
||||
|
||||
원본 데이터셋은 ControlNet [repo](https://huggingface.co/lllyasviel/ControlNet/blob/main/training/fill50k.zip)에 올라와있지만, 우리는 [여기](https://huggingface.co/datasets/fusing/fill50k)에 새롭게 다시 올려서 🤗 Datasets 과 호환가능합니다. 그래서 학습 스크립트 상에서 데이터 불러오기를 다룰 수 있습니다.
|
||||
|
||||
우리의 학습 예시는 원래 ControlNet의 학습에 쓰였던 [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)을 사용합니다. 그렇지만 ControlNet은 대응되는 어느 Stable Diffusion 모델([`CompVis/stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4)) 혹은 [`stabilityai/stable-diffusion-2-1`](https://huggingface.co/stabilityai/stable-diffusion-2-1)의 증가를 위해 학습될 수 있습니다.
|
||||
|
||||
자체 데이터셋을 사용하기 위해서는 [학습을 위한 데이터셋 생성하기](create_dataset) 가이드를 확인하세요.
|
||||
|
||||
## 학습
|
||||
|
||||
이 학습에 사용될 다음 이미지들을 다운로드하세요:
|
||||
|
||||
```sh
|
||||
wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png
|
||||
|
||||
wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png
|
||||
```
|
||||
|
||||
`MODEL_NAME` 환경 변수 (Hub 모델 리포지토리 아이디 혹은 모델 가중치가 있는 디렉토리로 가는 주소)를 명시하고 [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) 인자로 환경변수를 보냅니다.
|
||||
|
||||
학습 스크립트는 당신의 리포지토리에 `diffusion_pytorch_model.bin` 파일을 생성하고 저장합니다.
|
||||
|
||||
```bash
|
||||
export MODEL_DIR="runwayml/stable-diffusion-v1-5"
|
||||
export OUTPUT_DIR="path to save model"
|
||||
|
||||
accelerate launch train_controlnet.py \
|
||||
--pretrained_model_name_or_path=$MODEL_DIR \
|
||||
--output_dir=$OUTPUT_DIR \
|
||||
--dataset_name=fusing/fill50k \
|
||||
--resolution=512 \
|
||||
--learning_rate=1e-5 \
|
||||
--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
|
||||
--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
|
||||
--train_batch_size=4 \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
이 기본적인 설정으로는 ~38GB VRAM이 필요합니다.
|
||||
|
||||
기본적으로 학습 스크립트는 결과를 텐서보드에 기록합니다. 가중치(weight)와 편향(bias)을 사용하기 위해 `--report_to wandb` 를 전달합니다.
|
||||
|
||||
더 작은 batch(배치) 크기로 gradient accumulation(기울기 누적)을 하면 학습 요구사항을 ~20 GB VRAM으로 줄일 수 있습니다.
|
||||
|
||||
```bash
|
||||
export MODEL_DIR="runwayml/stable-diffusion-v1-5"
|
||||
export OUTPUT_DIR="path to save model"
|
||||
|
||||
accelerate launch train_controlnet.py \
|
||||
--pretrained_model_name_or_path=$MODEL_DIR \
|
||||
--output_dir=$OUTPUT_DIR \
|
||||
--dataset_name=fusing/fill50k \
|
||||
--resolution=512 \
|
||||
--learning_rate=1e-5 \
|
||||
--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
|
||||
--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
|
||||
--train_batch_size=1 \
|
||||
--gradient_accumulation_steps=4 \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
## 여러개 GPU로 학습하기
|
||||
|
||||
`accelerate` 은 seamless multi-GPU 학습을 고려합니다. `accelerate`과 함께 분산된 학습을 실행하기 위해 [여기](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
|
||||
의 설명을 확인하세요. 아래는 예시 명령어입니다:
|
||||
|
||||
```bash
|
||||
export MODEL_DIR="runwayml/stable-diffusion-v1-5"
|
||||
export OUTPUT_DIR="path to save model"
|
||||
|
||||
accelerate launch --mixed_precision="fp16" --multi_gpu train_controlnet.py \
|
||||
--pretrained_model_name_or_path=$MODEL_DIR \
|
||||
--output_dir=$OUTPUT_DIR \
|
||||
--dataset_name=fusing/fill50k \
|
||||
--resolution=512 \
|
||||
--learning_rate=1e-5 \
|
||||
--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
|
||||
--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
|
||||
--train_batch_size=4 \
|
||||
--mixed_precision="fp16" \
|
||||
--tracker_project_name="controlnet-demo" \
|
||||
--report_to=wandb \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
## 예시 결과
|
||||
|
||||
#### 배치 사이즈 8로 300 스텝 이후:
|
||||
|
||||
| | |
|
||||
|-------------------|:-------------------------:|
|
||||
| | 푸른 배경과 빨간 원 |
|
||||
 |  |
|
||||
| | 갈색 꽃 배경과 청록색 원 |
|
||||
 |  |
|
||||
|
||||
#### 배치 사이즈 8로 6000 스텝 이후:
|
||||
|
||||
| | |
|
||||
|-------------------|:-------------------------:|
|
||||
| | 푸른 배경과 빨간 원 |
|
||||
 |  |
|
||||
| | 갈색 꽃 배경과 청록색 원 |
|
||||
 |  |
|
||||
|
||||
## 16GB GPU에서 학습하기
|
||||
|
||||
16GB GPU에서 학습하기 위해 다음의 최적화를 진행하세요:
|
||||
|
||||
- 기울기 체크포인트 저장하기
|
||||
- bitsandbyte의 [8-bit optimizer](https://github.com/TimDettmers/bitsandbytes#requirements--installation)가 설치되지 않았다면 링크에 연결된 설명서를 보세요.
|
||||
|
||||
이제 학습 스크립트를 시작할 수 있습니다:
|
||||
|
||||
```bash
|
||||
export MODEL_DIR="runwayml/stable-diffusion-v1-5"
|
||||
export OUTPUT_DIR="path to save model"
|
||||
|
||||
accelerate launch train_controlnet.py \
|
||||
--pretrained_model_name_or_path=$MODEL_DIR \
|
||||
--output_dir=$OUTPUT_DIR \
|
||||
--dataset_name=fusing/fill50k \
|
||||
--resolution=512 \
|
||||
--learning_rate=1e-5 \
|
||||
--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
|
||||
--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
|
||||
--train_batch_size=1 \
|
||||
--gradient_accumulation_steps=4 \
|
||||
--gradient_checkpointing \
|
||||
--use_8bit_adam \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
## 12GB GPU에서 학습하기
|
||||
|
||||
12GB GPU에서 실행하기 위해 다음의 최적화를 진행하세요:
|
||||
|
||||
- 기울기 체크포인트 저장하기
|
||||
- bitsandbyte의 8-bit [optimizer](https://github.com/TimDettmers/bitsandbytes#requirements--installation)(가 설치되지 않았다면 링크에 연결된 설명서를 보세요)
|
||||
- [xFormers](https://huggingface.co/docs/diffusers/training/optimization/xformers)(가 설치되지 않았다면 링크에 연결된 설명서를 보세요)
|
||||
- 기울기를 `None`으로 설정
|
||||
|
||||
```bash
|
||||
export MODEL_DIR="runwayml/stable-diffusion-v1-5"
|
||||
export OUTPUT_DIR="path to save model"
|
||||
|
||||
accelerate launch train_controlnet.py \
|
||||
--pretrained_model_name_or_path=$MODEL_DIR \
|
||||
--output_dir=$OUTPUT_DIR \
|
||||
--dataset_name=fusing/fill50k \
|
||||
--resolution=512 \
|
||||
--learning_rate=1e-5 \
|
||||
--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
|
||||
--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
|
||||
--train_batch_size=1 \
|
||||
--gradient_accumulation_steps=4 \
|
||||
--gradient_checkpointing \
|
||||
--use_8bit_adam \
|
||||
--enable_xformers_memory_efficient_attention \
|
||||
--set_grads_to_none \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
`pip install xformers`으로 `xformers`을 확실히 설치하고 `enable_xformers_memory_efficient_attention`을 사용하세요.
|
||||
|
||||
## 8GB GPU에서 학습하기
|
||||
|
||||
우리는 ControlNet을 지원하기 위한 DeepSpeed를 철저하게 테스트하지 않았습니다. 환경설정이 메모리를 저장할 때,
|
||||
그 환경이 성공적으로 학습했는지를 확정하지 않았습니다. 성공한 학습 실행을 위해 설정을 변경해야 할 가능성이 높습니다.
|
||||
|
||||
8GB GPU에서 실행하기 위해 다음의 최적화를 진행하세요:
|
||||
|
||||
- 기울기 체크포인트 저장하기
|
||||
- bitsandbyte의 8-bit [optimizer](https://github.com/TimDettmers/bitsandbytes#requirements--installation)(가 설치되지 않았다면 링크에 연결된 설명서를 보세요)
|
||||
- [xFormers](https://huggingface.co/docs/diffusers/training/optimization/xformers)(가 설치되지 않았다면 링크에 연결된 설명서를 보세요)
|
||||
- 기울기를 `None`으로 설정
|
||||
- DeepSpeed stage 2 변수와 optimizer 없에기
|
||||
- fp16 혼합 정밀도(precision)
|
||||
|
||||
[DeepSpeed](https://www.deepspeed.ai/)는 CPU 또는 NVME로 텐서를 VRAM에서 오프로드할 수 있습니다.
|
||||
이를 위해서 훨씬 더 많은 RAM(약 25 GB)가 필요합니다.
|
||||
|
||||
DeepSpeed stage 2를 활성화하기 위해서 `accelerate config`로 환경을 구성해야합니다.
|
||||
|
||||
구성(configuration) 파일은 이런 모습이어야 합니다:
|
||||
|
||||
```yaml
|
||||
compute_environment: LOCAL_MACHINE
|
||||
deepspeed_config:
|
||||
gradient_accumulation_steps: 4
|
||||
offload_optimizer_device: cpu
|
||||
offload_param_device: cpu
|
||||
zero3_init_flag: false
|
||||
zero_stage: 2
|
||||
distributed_type: DEEPSPEED
|
||||
```
|
||||
|
||||
<팁>
|
||||
|
||||
[문서](https://huggingface.co/docs/accelerate/usage_guides/deepspeed)를 더 많은 DeepSpeed 설정 옵션을 위해 보세요.
|
||||
|
||||
<팁>
|
||||
|
||||
기본 Adam optimizer를 DeepSpeed'의 Adam
|
||||
`deepspeed.ops.adam.DeepSpeedCPUAdam` 으로 바꾸면 상당한 속도 향상을 이룰수 있지만,
|
||||
Pytorch와 같은 버전의 CUDA toolchain이 필요합니다. 8-비트 optimizer는 현재 DeepSpeed와
|
||||
호환되지 않는 것 같습니다.
|
||||
|
||||
```bash
|
||||
export MODEL_DIR="runwayml/stable-diffusion-v1-5"
|
||||
export OUTPUT_DIR="path to save model"
|
||||
|
||||
accelerate launch train_controlnet.py \
|
||||
--pretrained_model_name_or_path=$MODEL_DIR \
|
||||
--output_dir=$OUTPUT_DIR \
|
||||
--dataset_name=fusing/fill50k \
|
||||
--resolution=512 \
|
||||
--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
|
||||
--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
|
||||
--train_batch_size=1 \
|
||||
--gradient_accumulation_steps=4 \
|
||||
--gradient_checkpointing \
|
||||
--enable_xformers_memory_efficient_attention \
|
||||
--set_grads_to_none \
|
||||
--mixed_precision fp16 \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
## 추론
|
||||
|
||||
학습된 모델은 [`StableDiffusionControlNetPipeline`]과 함께 실행될 수 있습니다.
|
||||
`base_model_path`와 `controlnet_path` 에 값을 지정하세요 `--pretrained_model_name_or_path` 와
|
||||
`--output_dir` 는 학습 스크립트에 개별적으로 지정됩니다.
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
|
||||
from diffusers.utils import load_image
|
||||
import torch
|
||||
|
||||
base_model_path = "path to model"
|
||||
controlnet_path = "path to controlnet"
|
||||
|
||||
controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16)
|
||||
pipe = StableDiffusionControlNetPipeline.from_pretrained(
|
||||
base_model_path, controlnet=controlnet, torch_dtype=torch.float16
|
||||
)
|
||||
|
||||
# 더 빠른 스케줄러와 메모리 최적화로 diffusion 프로세스 속도 올리기
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
# xformers가 설치되지 않으면 아래 줄을 삭제하기
|
||||
pipe.enable_xformers_memory_efficient_attention()
|
||||
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
control_image = load_image("./conditioning_image_1.png")
|
||||
prompt = "pale golden rod circle with old lace background"
|
||||
|
||||
# 이미지 생성하기
|
||||
generator = torch.manual_seed(0)
|
||||
image = pipe(prompt, num_inference_steps=20, generator=generator, image=control_image).images[0]
|
||||
|
||||
image.save("./output.png")
|
||||
```
|
||||
@@ -1,98 +0,0 @@
|
||||
# 학습을 위한 데이터셋 만들기
|
||||
|
||||
[Hub](https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=downloads) 에는 모델 교육을 위한 많은 데이터셋이 있지만,
|
||||
관심이 있거나 사용하고 싶은 데이터셋을 찾을 수 없는 경우 🤗 [Datasets](hf.co/docs/datasets) 라이브러리를 사용하여 데이터셋을 만들 수 있습니다.
|
||||
데이터셋 구조는 모델을 학습하려는 작업에 따라 달라집니다.
|
||||
가장 기본적인 데이터셋 구조는 unconditional 이미지 생성과 같은 작업을 위한 이미지 디렉토리입니다.
|
||||
또 다른 데이터셋 구조는 이미지 디렉토리와 text-to-image 생성과 같은 작업에 해당하는 텍스트 캡션이 포함된 텍스트 파일일 수 있습니다.
|
||||
|
||||
이 가이드에는 파인 튜닝할 데이터셋을 만드는 두 가지 방법을 소개합니다:
|
||||
|
||||
- 이미지 폴더를 `--train_data_dir` 인수에 제공합니다.
|
||||
- 데이터셋을 Hub에 업로드하고 데이터셋 리포지토리 id를 `--dataset_name` 인수에 전달합니다.
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 학습에 사용할 이미지 데이터셋을 만드는 방법에 대한 자세한 내용은 [이미지 데이터셋 만들기](https://huggingface.co/docs/datasets/image_dataset) 가이드를 참고하세요.
|
||||
|
||||
</Tip>
|
||||
|
||||
## 폴더 형태로 데이터셋 구축하기
|
||||
|
||||
Unconditional 생성을 위해 이미지 폴더로 자신의 데이터셋을 구축할 수 있습니다.
|
||||
학습 스크립트는 🤗 Datasets의 [ImageFolder](https://huggingface.co/docs/datasets/en/image_dataset#imagefolder) 빌더를 사용하여
|
||||
자동으로 폴더에서 데이터셋을 구축합니다. 디렉토리 구조는 다음과 같아야 합니다 :
|
||||
|
||||
```bash
|
||||
data_dir/xxx.png
|
||||
data_dir/xxy.png
|
||||
data_dir/[...]/xxz.png
|
||||
```
|
||||
|
||||
데이터셋 디렉터리의 경로를 `--train_data_dir` 인수로 전달한 다음 학습을 시작할 수 있습니다:
|
||||
|
||||
```bash
|
||||
accelerate launch train_unconditional.py \
|
||||
# argument로 폴더 지정하기 \
|
||||
--train_data_dir <path-to-train-directory> \
|
||||
<other-arguments>
|
||||
```
|
||||
|
||||
## Hub에 데이터 올리기
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 데이터셋을 만들고 Hub에 업로드하는 것에 대한 자세한 내용은 [🤗 Datasets을 사용한 이미지 검색](https://huggingface.co/blog/image-search-datasets) 게시물을 참고하세요.
|
||||
|
||||
</Tip>
|
||||
|
||||
PIL 인코딩된 이미지가 포함된 `이미지` 열을 생성하는 [이미지 폴더](https://huggingface.co/docs/datasets/image_load#imagefolder) 기능을 사용하여 데이터셋 생성을 시작합니다.
|
||||
|
||||
`data_dir` 또는 `data_files` 매개 변수를 사용하여 데이터셋의 위치를 지정할 수 있습니다.
|
||||
`data_files` 매개변수는 특정 파일을 `train` 이나 `test` 로 분리한 데이터셋에 매핑하는 것을 지원합니다:
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
|
||||
# 예시 1: 로컬 폴더
|
||||
dataset = load_dataset("imagefolder", data_dir="path_to_your_folder")
|
||||
|
||||
# 예시 2: 로컬 파일 (지원 포맷 : tar, gzip, zip, xz, rar, zstd)
|
||||
dataset = load_dataset("imagefolder", data_files="path_to_zip_file")
|
||||
|
||||
# 예시 3: 원격 파일 (지원 포맷 : tar, gzip, zip, xz, rar, zstd)
|
||||
dataset = load_dataset(
|
||||
"imagefolder",
|
||||
data_files="https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip",
|
||||
)
|
||||
|
||||
# 예시 4: 여러개로 분할
|
||||
dataset = load_dataset(
|
||||
"imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]}
|
||||
)
|
||||
```
|
||||
|
||||
[push_to_hub(https://huggingface.co/docs/datasets/v2.13.1/en/package_reference/main_classes#datasets.Dataset.push_to_hub) 을 사용해서 Hub에 데이터셋을 업로드 합니다:
|
||||
|
||||
```python
|
||||
# 터미널에서 huggingface-cli login 커맨드를 이미 실행했다고 가정합니다
|
||||
dataset.push_to_hub("name_of_your_dataset")
|
||||
|
||||
# 개인 repo로 push 하고 싶다면, `private=True` 을 추가하세요:
|
||||
dataset.push_to_hub("name_of_your_dataset", private=True)
|
||||
```
|
||||
|
||||
이제 데이터셋 이름을 `--dataset_name` 인수에 전달하여 데이터셋을 학습에 사용할 수 있습니다:
|
||||
|
||||
```bash
|
||||
accelerate launch --mixed_precision="fp16" train_text_to_image.py \
|
||||
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
|
||||
--dataset_name="name_of_your_dataset" \
|
||||
<other-arguments>
|
||||
```
|
||||
|
||||
## 다음 단계
|
||||
|
||||
데이터셋을 생성했으니 이제 학습 스크립트의 `train_data_dir` (데이터셋이 로컬이면) 혹은 `dataset_name` (Hub에 데이터셋을 올렸으면) 인수에 연결할 수 있습니다.
|
||||
|
||||
다음 단계에서는 데이터셋을 사용하여 [unconditional 생성](https://huggingface.co/docs/diffusers/v0.18.2/en/training/unconditional_training) 또는 [텍스트-이미지 생성](https://huggingface.co/docs/diffusers/training/text2image)을 위한 모델을 학습시켜보세요!
|
||||
@@ -1,300 +0,0 @@
|
||||
<!--Copyright 2023 Custom Diffusion authors The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# 커스텀 Diffusion 학습 예제
|
||||
|
||||
[커스텀 Diffusion](https://arxiv.org/abs/2212.04488)은 피사체의 이미지 몇 장(4~5장)만 주어지면 Stable Diffusion처럼 text-to-image 모델을 커스터마이징하는 방법입니다.
|
||||
'train_custom_diffusion.py' 스크립트는 학습 과정을 구현하고 이를 Stable Diffusion에 맞게 조정하는 방법을 보여줍니다.
|
||||
|
||||
이 교육 사례는 [Nupur Kumari](https://nupurkmr9.github.io/)가 제공하였습니다. (Custom Diffusion의 저자 중 한명).
|
||||
|
||||
## 로컬에서 PyTorch로 실행하기
|
||||
|
||||
### Dependencies 설치하기
|
||||
|
||||
스크립트를 실행하기 전에 라이브러리의 학습 dependencies를 설치해야 합니다:
|
||||
|
||||
**중요**
|
||||
|
||||
예제 스크립트의 최신 버전을 성공적으로 실행하려면 **소스로부터 설치**하는 것을 매우 권장하며, 예제 스크립트를 자주 업데이트하는 만큼 일부 예제별 요구 사항을 설치하고 설치를 최신 상태로 유지하는 것이 좋습니다. 이를 위해 새 가상 환경에서 다음 단계를 실행하세요:
|
||||
|
||||
|
||||
```bash
|
||||
git clone https://github.com/huggingface/diffusers
|
||||
cd diffusers
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
[example folder](https://github.com/huggingface/diffusers/tree/main/examples/custom_diffusion)로 cd하여 이동하세요.
|
||||
|
||||
```
|
||||
cd examples/custom_diffusion
|
||||
```
|
||||
|
||||
이제 실행
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
pip install clip-retrieval
|
||||
```
|
||||
|
||||
그리고 [🤗Accelerate](https://github.com/huggingface/accelerate/) 환경을 초기화:
|
||||
|
||||
```bash
|
||||
accelerate config
|
||||
```
|
||||
|
||||
또는 사용자 환경에 대한 질문에 답하지 않고 기본 가속 구성을 사용하려면 다음과 같이 하세요.
|
||||
|
||||
```bash
|
||||
accelerate config default
|
||||
```
|
||||
|
||||
또는 사용 중인 환경이 대화형 셸을 지원하지 않는 경우(예: jupyter notebook)
|
||||
|
||||
```python
|
||||
from accelerate.utils import write_basic_config
|
||||
|
||||
write_basic_config()
|
||||
```
|
||||
### 고양이 예제 😺
|
||||
|
||||
이제 데이터셋을 가져옵니다. [여기](https://www.cs.cmu.edu/~custom-diffusion/assets/data.zip)에서 데이터셋을 다운로드하고 압축을 풉니다. 직접 데이터셋을 사용하려면 [학습용 데이터셋 생성하기](create_dataset) 가이드를 참고하세요.
|
||||
|
||||
또한 'clip-retrieval'을 사용하여 200개의 실제 이미지를 수집하고, regularization으로서 이를 학습 데이터셋의 타겟 이미지와 결합합니다. 이렇게 하면 주어진 타겟 이미지에 대한 과적합을 방지할 수 있습니다. 다음 플래그를 사용하면 `prior_loss_weight=1.`로 `prior_preservation`, `real_prior` regularization을 활성화할 수 있습니다.
|
||||
클래스_프롬프트`는 대상 이미지와 동일한 카테고리 이름이어야 합니다. 수집된 실제 이미지에는 `class_prompt`와 유사한 텍스트 캡션이 있습니다. 검색된 이미지는 `class_data_dir`에 저장됩니다. 생성된 이미지를 regularization으로 사용하기 위해 `real_prior`를 비활성화할 수 있습니다. 실제 이미지를 수집하려면 훈련 전에 이 명령을 먼저 사용하십시오.
|
||||
|
||||
```bash
|
||||
pip install clip-retrieval
|
||||
python retrieve.py --class_prompt cat --class_data_dir real_reg/samples_cat --num_class_images 200
|
||||
```
|
||||
|
||||
**___참고: [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 모델을 사용하는 경우 '해상도'를 768로 변경하세요.___**
|
||||
|
||||
스크립트는 모델 체크포인트와 `pytorch_custom_diffusion_weights.bin` 파일을 생성하여 저장소에 저장합니다.
|
||||
|
||||
```bash
|
||||
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
|
||||
export OUTPUT_DIR="path-to-save-model"
|
||||
export INSTANCE_DIR="./data/cat"
|
||||
|
||||
accelerate launch train_custom_diffusion.py \
|
||||
--pretrained_model_name_or_path=$MODEL_NAME \
|
||||
--instance_data_dir=$INSTANCE_DIR \
|
||||
--output_dir=$OUTPUT_DIR \
|
||||
--class_data_dir=./real_reg/samples_cat/ \
|
||||
--with_prior_preservation --real_prior --prior_loss_weight=1.0 \
|
||||
--class_prompt="cat" --num_class_images=200 \
|
||||
--instance_prompt="photo of a <new1> cat" \
|
||||
--resolution=512 \
|
||||
--train_batch_size=2 \
|
||||
--learning_rate=1e-5 \
|
||||
--lr_warmup_steps=0 \
|
||||
--max_train_steps=250 \
|
||||
--scale_lr --hflip \
|
||||
--modifier_token "<new1>" \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
**더 낮은 VRAM 요구 사항(GPU당 16GB)으로 더 빠르게 훈련하려면 `--enable_xformers_memory_efficient_attention`을 사용하세요. 설치 방법은 [가이드](https://github.com/facebookresearch/xformers)를 따르세요.**
|
||||
|
||||
가중치 및 편향(`wandb`)을 사용하여 실험을 추적하고 중간 결과를 저장하려면(강력히 권장합니다) 다음 단계를 따르세요:
|
||||
|
||||
* `wandb` 설치: `pip install wandb`.
|
||||
* 로그인 : `wandb login`.
|
||||
* 그런 다음 트레이닝을 시작하는 동안 `validation_prompt`를 지정하고 `report_to`를 `wandb`로 설정합니다. 다음과 같은 관련 인수를 구성할 수도 있습니다:
|
||||
* `num_validation_images`
|
||||
* `validation_steps`
|
||||
|
||||
```bash
|
||||
accelerate launch train_custom_diffusion.py \
|
||||
--pretrained_model_name_or_path=$MODEL_NAME \
|
||||
--instance_data_dir=$INSTANCE_DIR \
|
||||
--output_dir=$OUTPUT_DIR \
|
||||
--class_data_dir=./real_reg/samples_cat/ \
|
||||
--with_prior_preservation --real_prior --prior_loss_weight=1.0 \
|
||||
--class_prompt="cat" --num_class_images=200 \
|
||||
--instance_prompt="photo of a <new1> cat" \
|
||||
--resolution=512 \
|
||||
--train_batch_size=2 \
|
||||
--learning_rate=1e-5 \
|
||||
--lr_warmup_steps=0 \
|
||||
--max_train_steps=250 \
|
||||
--scale_lr --hflip \
|
||||
--modifier_token "<new1>" \
|
||||
--validation_prompt="<new1> cat sitting in a bucket" \
|
||||
--report_to="wandb" \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
다음은 [Weights and Biases page](https://wandb.ai/sayakpaul/custom-diffusion/runs/26ghrcau)의 예시이며, 여러 학습 세부 정보와 함께 중간 결과들을 확인할 수 있습니다.
|
||||
|
||||
`--push_to_hub`를 지정하면 학습된 파라미터가 허깅 페이스 허브의 리포지토리에 푸시됩니다. 다음은 [예제 리포지토리](https://huggingface.co/sayakpaul/custom-diffusion-cat)입니다.
|
||||
|
||||
### 멀티 컨셉에 대한 학습 🐱🪵
|
||||
|
||||
[this](https://github.com/ShivamShrirao/diffusers/blob/main/examples/dreambooth/train_dreambooth.py)와 유사하게 각 컨셉에 대한 정보가 포함된 [json](https://github.com/adobe-research/custom-diffusion/blob/main/assets/concept_list.json) 파일을 제공합니다.
|
||||
|
||||
실제 이미지를 수집하려면 json 파일의 각 컨셉에 대해 이 명령을 실행합니다.
|
||||
|
||||
```bash
|
||||
pip install clip-retrieval
|
||||
python retrieve.py --class_prompt {} --class_data_dir {} --num_class_images 200
|
||||
```
|
||||
|
||||
그럼 우리는 학습시킬 준비가 되었습니다!
|
||||
|
||||
```bash
|
||||
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
|
||||
export OUTPUT_DIR="path-to-save-model"
|
||||
|
||||
accelerate launch train_custom_diffusion.py \
|
||||
--pretrained_model_name_or_path=$MODEL_NAME \
|
||||
--output_dir=$OUTPUT_DIR \
|
||||
--concepts_list=./concept_list.json \
|
||||
--with_prior_preservation --real_prior --prior_loss_weight=1.0 \
|
||||
--resolution=512 \
|
||||
--train_batch_size=2 \
|
||||
--learning_rate=1e-5 \
|
||||
--lr_warmup_steps=0 \
|
||||
--max_train_steps=500 \
|
||||
--num_class_images=200 \
|
||||
--scale_lr --hflip \
|
||||
--modifier_token "<new1>+<new2>" \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
다음은 [Weights and Biases page](https://wandb.ai/sayakpaul/custom-diffusion/runs/3990tzkg)의 예시이며, 다른 학습 세부 정보와 함께 중간 결과들을 확인할 수 있습니다.
|
||||
|
||||
### 사람 얼굴에 대한 학습
|
||||
|
||||
사람 얼굴에 대한 파인튜닝을 위해 다음과 같은 설정이 더 효과적이라는 것을 확인했습니다: `learning_rate=5e-6`, `max_train_steps=1000 to 2000`, `freeze_model=crossattn`을 최소 15~20개의 이미지로 설정합니다.
|
||||
|
||||
실제 이미지를 수집하려면 훈련 전에 이 명령을 먼저 사용하십시오.
|
||||
|
||||
```bash
|
||||
pip install clip-retrieval
|
||||
python retrieve.py --class_prompt person --class_data_dir real_reg/samples_person --num_class_images 200
|
||||
```
|
||||
|
||||
이제 학습을 시작하세요!
|
||||
|
||||
```bash
|
||||
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
|
||||
export OUTPUT_DIR="path-to-save-model"
|
||||
export INSTANCE_DIR="path-to-images"
|
||||
|
||||
accelerate launch train_custom_diffusion.py \
|
||||
--pretrained_model_name_or_path=$MODEL_NAME \
|
||||
--instance_data_dir=$INSTANCE_DIR \
|
||||
--output_dir=$OUTPUT_DIR \
|
||||
--class_data_dir=./real_reg/samples_person/ \
|
||||
--with_prior_preservation --real_prior --prior_loss_weight=1.0 \
|
||||
--class_prompt="person" --num_class_images=200 \
|
||||
--instance_prompt="photo of a <new1> person" \
|
||||
--resolution=512 \
|
||||
--train_batch_size=2 \
|
||||
--learning_rate=5e-6 \
|
||||
--lr_warmup_steps=0 \
|
||||
--max_train_steps=1000 \
|
||||
--scale_lr --hflip --noaug \
|
||||
--freeze_model crossattn \
|
||||
--modifier_token "<new1>" \
|
||||
--enable_xformers_memory_efficient_attention \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
## 추론
|
||||
|
||||
위 프롬프트를 사용하여 모델을 학습시킨 후에는 아래 프롬프트를 사용하여 추론을 실행할 수 있습니다. 프롬프트에 'modifier token'(예: 위 예제에서는 \<new1\>)을 반드시 포함해야 합니다.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16).to("cuda")
|
||||
pipe.unet.load_attn_procs("path-to-save-model", weight_name="pytorch_custom_diffusion_weights.bin")
|
||||
pipe.load_textual_inversion("path-to-save-model", weight_name="<new1>.bin")
|
||||
|
||||
image = pipe(
|
||||
"<new1> cat sitting in a bucket",
|
||||
num_inference_steps=100,
|
||||
guidance_scale=6.0,
|
||||
eta=1.0,
|
||||
).images[0]
|
||||
image.save("cat.png")
|
||||
```
|
||||
|
||||
허브 리포지토리에서 이러한 매개변수를 직접 로드할 수 있습니다:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from huggingface_hub.repocard import RepoCard
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
model_id = "sayakpaul/custom-diffusion-cat"
|
||||
card = RepoCard.load(model_id)
|
||||
base_model_id = card.data.to_dict()["base_model"]
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16).to("cuda")
|
||||
pipe.unet.load_attn_procs(model_id, weight_name="pytorch_custom_diffusion_weights.bin")
|
||||
pipe.load_textual_inversion(model_id, weight_name="<new1>.bin")
|
||||
|
||||
image = pipe(
|
||||
"<new1> cat sitting in a bucket",
|
||||
num_inference_steps=100,
|
||||
guidance_scale=6.0,
|
||||
eta=1.0,
|
||||
).images[0]
|
||||
image.save("cat.png")
|
||||
```
|
||||
|
||||
다음은 여러 컨셉으로 추론을 수행하는 예제입니다:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from huggingface_hub.repocard import RepoCard
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
model_id = "sayakpaul/custom-diffusion-cat-wooden-pot"
|
||||
card = RepoCard.load(model_id)
|
||||
base_model_id = card.data.to_dict()["base_model"]
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16).to("cuda")
|
||||
pipe.unet.load_attn_procs(model_id, weight_name="pytorch_custom_diffusion_weights.bin")
|
||||
pipe.load_textual_inversion(model_id, weight_name="<new1>.bin")
|
||||
pipe.load_textual_inversion(model_id, weight_name="<new2>.bin")
|
||||
|
||||
image = pipe(
|
||||
"the <new1> cat sculpture in the style of a <new2> wooden pot",
|
||||
num_inference_steps=100,
|
||||
guidance_scale=6.0,
|
||||
eta=1.0,
|
||||
).images[0]
|
||||
image.save("multi-subject.png")
|
||||
```
|
||||
|
||||
여기서 '고양이'와 '나무 냄비'는 여러 컨셉을 말합니다.
|
||||
|
||||
### 학습된 체크포인트에서 추론하기
|
||||
|
||||
`--checkpointing_steps` 인수를 사용한 경우 학습 과정에서 저장된 전체 체크포인트 중 하나에서 추론을 수행할 수도 있습니다.
|
||||
|
||||
## Grads를 None으로 설정
|
||||
|
||||
더 많은 메모리를 절약하려면 스크립트에 `--set_grads_to_none` 인수를 전달하세요. 이렇게 하면 성적이 0이 아닌 없음으로 설정됩니다. 그러나 특정 동작이 변경되므로 문제가 발생하면 이 인수를 제거하세요.
|
||||
|
||||
자세한 정보: https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html
|
||||
|
||||
## 실험 결과
|
||||
|
||||
실험에 대한 자세한 내용은 [당사 웹페이지](https://www.cs.cmu.edu/~custom-diffusion/)를 참조하세요.
|
||||
@@ -1,92 +0,0 @@
|
||||
# 여러 GPU를 사용한 분산 추론
|
||||
|
||||
분산 설정에서는 여러 개의 프롬프트를 동시에 생성할 때 유용한 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) 또는 [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html)를 사용하여 여러 GPU에서 추론을 실행할 수 있습니다.
|
||||
|
||||
이 가이드에서는 분산 추론을 위해 🤗 Accelerate와 PyTorch Distributed를 사용하는 방법을 보여드립니다.
|
||||
|
||||
## 🤗 Accelerate
|
||||
|
||||
🤗 [Accelerate](https://huggingface.co/docs/accelerate/index)는 분산 설정에서 추론을 쉽게 훈련하거나 실행할 수 있도록 설계된 라이브러리입니다. 분산 환경 설정 프로세스를 간소화하여 PyTorch 코드에 집중할 수 있도록 해줍니다.
|
||||
|
||||
시작하려면 Python 파일을 생성하고 [`accelerate.PartialState`]를 초기화하여 분산 환경을 생성하면, 설정이 자동으로 감지되므로 `rank` 또는 `world_size`를 명시적으로 정의할 필요가 없습니다. ['DiffusionPipeline`]을 `distributed_state.device`로 이동하여 각 프로세스에 GPU를 할당합니다.
|
||||
|
||||
이제 컨텍스트 관리자로 [`~accelerate.PartialState.split_between_processes`] 유틸리티를 사용하여 프로세스 수에 따라 프롬프트를 자동으로 분배합니다.
|
||||
|
||||
|
||||
```py
|
||||
from accelerate import PartialState
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
|
||||
distributed_state = PartialState()
|
||||
pipeline.to(distributed_state.device)
|
||||
|
||||
with distributed_state.split_between_processes(["a dog", "a cat"]) as prompt:
|
||||
result = pipeline(prompt).images[0]
|
||||
result.save(f"result_{distributed_state.process_index}.png")
|
||||
```
|
||||
|
||||
Use the `--num_processes` argument to specify the number of GPUs to use, and call `accelerate launch` to run the script:
|
||||
|
||||
```bash
|
||||
accelerate launch run_distributed.py --num_processes=2
|
||||
```
|
||||
|
||||
<Tip>자세한 내용은 [🤗 Accelerate를 사용한 분산 추론](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) 가이드를 참조하세요.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Pytoerch 분산
|
||||
|
||||
PyTorch는 데이터 병렬 처리를 가능하게 하는 [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)을 지원합니다.
|
||||
|
||||
시작하려면 Python 파일을 생성하고 `torch.distributed` 및 `torch.multiprocessing`을 임포트하여 분산 프로세스 그룹을 설정하고 각 GPU에서 추론용 프로세스를 생성합니다. 그리고 [`DiffusionPipeline`]도 초기화해야 합니다:
|
||||
|
||||
확산 파이프라인을 `rank`로 이동하고 `get_rank`를 사용하여 각 프로세스에 GPU를 할당하면 각 프로세스가 다른 프롬프트를 처리합니다:
|
||||
|
||||
```py
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
import torch.multiprocessing as mp
|
||||
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
sd = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
|
||||
```
|
||||
|
||||
사용할 백엔드 유형, 현재 프로세스의 `rank`, `world_size` 또는 참여하는 프로세스 수로 분산 환경 생성을 처리하는 함수[`init_process_group`]를 만들어 추론을 실행해야 합니다.
|
||||
|
||||
2개의 GPU에서 추론을 병렬로 실행하는 경우 `world_size`는 2입니다.
|
||||
|
||||
```py
|
||||
def run_inference(rank, world_size):
|
||||
dist.init_process_group("nccl", rank=rank, world_size=world_size)
|
||||
|
||||
sd.to(rank)
|
||||
|
||||
if torch.distributed.get_rank() == 0:
|
||||
prompt = "a dog"
|
||||
elif torch.distributed.get_rank() == 1:
|
||||
prompt = "a cat"
|
||||
|
||||
image = sd(prompt).images[0]
|
||||
image.save(f"./{'_'.join(prompt)}.png")
|
||||
```
|
||||
|
||||
분산 추론을 실행하려면 [`mp.spawn`](https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn)을 호출하여 `world_size`에 정의된 GPU 수에 대해 `run_inference` 함수를 실행합니다:
|
||||
|
||||
```py
|
||||
def main():
|
||||
world_size = 2
|
||||
mp.spawn(run_inference, args=(world_size,), nprocs=world_size, join=True)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
추론 스크립트를 완료했으면 `--nproc_per_node` 인수를 사용하여 사용할 GPU 수를 지정하고 `torchrun`을 호출하여 스크립트를 실행합니다:
|
||||
|
||||
```bash
|
||||
torchrun run_distributed.py --nproc_per_node=2
|
||||
```
|
||||
@@ -15,7 +15,8 @@ specific language governing permissions and limitations under the License.
|
||||
[DreamBooth](https://arxiv.org/abs/2208.12242)는 한 주제에 대한 적은 이미지(3~5개)만으로도 stable diffusion과 같이 text-to-image 모델을 개인화할 수 있는 방법입니다. 이를 통해 모델은 다양한 장면, 포즈 및 장면(뷰)에서 피사체에 대해 맥락화(contextualized)된 이미지를 생성할 수 있습니다.
|
||||
|
||||

|
||||
<small>에서의 Dreambooth 예시 <a href="https://dreambooth.github.io">project's blog.</a></small>
|
||||
<a href="https://dreambooth.github.io">project's blog.</a></small>
|
||||
<small><a href="https://dreambooth.github.io">프로젝트 블로그</a>에서의 Dreambooth 예시</small>
|
||||
|
||||
|
||||
이 가이드는 다양한 GPU, Flax 사양에 대해 [`CompVis/stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4) 모델로 DreamBooth를 파인튜닝하는 방법을 보여줍니다. 더 깊이 파고들어 작동 방식을 확인하는 데 관심이 있는 경우, 이 가이드에 사용된 DreamBooth의 모든 학습 스크립트를 [여기](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth)에서 찾을 수 있습니다.
|
||||
@@ -471,4 +472,4 @@ image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
|
||||
image.save("dog-bucket.png")
|
||||
```
|
||||
|
||||
[저장된 학습 체크포인트](#inference-from-a-saved-checkpoint)에서도 추론을 실행할 수도 있습니다.
|
||||
[저장된 학습 체크포인트](#inference-from-a-saved-checkpoint)에서도 추론을 실행할 수도 있습니다.
|
||||
@@ -1,211 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# InstructPix2Pix
|
||||
|
||||
[InstructPix2Pix](https://arxiv.org/abs/2211.09800)는 text-conditioned diffusion 모델이 한 이미지에 편집을 따를 수 있도록 파인튜닝하는 방법입니다. 이 방법을 사용하여 파인튜닝된 모델은 다음을 입력으로 사용합니다:
|
||||
|
||||
<p align="center">
|
||||
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/edit-instruction.png" alt="instructpix2pix-inputs" width=600/>
|
||||
</p>
|
||||
|
||||
출력은 입력 이미지에 편집 지시가 반영된 "수정된" 이미지입니다:
|
||||
|
||||
<p align="center">
|
||||
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/output-gs%407-igs%401-steps%4050.png" alt="instructpix2pix-output" width=600/>
|
||||
</p>
|
||||
|
||||
`train_instruct_pix2pix.py` 스크립트([여기](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py)에서 찾을 수 있습니다.)는 학습 절차를 설명하고 Stable Diffusion에 적용할 수 있는 방법을 보여줍니다.
|
||||
|
||||
|
||||
*** `train_instruct_pix2pix.py`는 [원래 구현](https://github.com/timothybrooks/instruct-pix2pix)에 충실하면서 InstructPix2Pix 학습 절차를 구현하고 있지만, [소규모 데이터셋](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples)에서만 테스트를 했습니다. 이는 최종 결과에 영향을 끼칠 수 있습니다. 더 나은 결과를 위해, 더 큰 데이터셋에서 더 길게 학습하는 것을 권장합니다. [여기](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered)에서 InstructPix2Pix 학습을 위해 큰 데이터셋을 찾을 수 있습니다.
|
||||
***
|
||||
|
||||
## PyTorch로 로컬에서 실행하기
|
||||
|
||||
### 종속성(dependencies) 설치하기
|
||||
|
||||
이 스크립트를 실행하기 전에, 라이브러리의 학습 종속성을 설치하세요:
|
||||
|
||||
**중요**
|
||||
|
||||
최신 버전의 예제 스크립트를 성공적으로 실행하기 위해, **원본으로부터 설치**하는 것과 예제 스크립트를 자주 업데이트하고 예제별 요구사항을 설치하기 때문에 최신 상태로 유지하는 것을 권장합니다. 이를 위해, 새로운 가상 환경에서 다음 스텝을 실행하세요:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/huggingface/diffusers
|
||||
cd diffusers
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
cd 명령어로 예제 폴더로 이동하세요.
|
||||
```bash
|
||||
cd examples/instruct_pix2pix
|
||||
```
|
||||
|
||||
이제 실행하세요.
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
그리고 [🤗Accelerate](https://github.com/huggingface/accelerate/) 환경에서 초기화하세요:
|
||||
|
||||
```bash
|
||||
accelerate config
|
||||
```
|
||||
|
||||
혹은 환경에 대한 질문 없이 기본적인 accelerate 구성을 사용하려면 다음을 실행하세요.
|
||||
|
||||
```bash
|
||||
accelerate config default
|
||||
```
|
||||
|
||||
혹은 사용 중인 환경이 notebook과 같은 대화형 쉘은 지원하지 않는 경우는 다음 절차를 따라주세요.
|
||||
|
||||
```python
|
||||
from accelerate.utils import write_basic_config
|
||||
|
||||
write_basic_config()
|
||||
```
|
||||
|
||||
### 예시
|
||||
|
||||
이전에 언급했듯이, 학습을 위해 [작은 데이터셋](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples)을 사용할 것입니다. 그 데이터셋은 InstructPix2Pix 논문에서 사용된 [원래의 데이터셋](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered)보다 작은 버전입니다. 자신의 데이터셋을 사용하기 위해, [학습을 위한 데이터셋 만들기](create_dataset) 가이드를 참고하세요.
|
||||
|
||||
`MODEL_NAME` 환경 변수(허브 모델 레포지토리 또는 모델 가중치가 포함된 폴더 경로)를 지정하고 [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) 인수에 전달합니다. `DATASET_ID`에 데이터셋 이름을 지정해야 합니다:
|
||||
|
||||
|
||||
```bash
|
||||
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
|
||||
export DATASET_ID="fusing/instructpix2pix-1000-samples"
|
||||
```
|
||||
|
||||
지금, 학습을 실행할 수 있습니다. 스크립트는 레포지토리의 하위 폴더의 모든 구성요소(`feature_extractor`, `scheduler`, `text_encoder`, `unet` 등)를 저장합니다.
|
||||
|
||||
```bash
|
||||
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
|
||||
--pretrained_model_name_or_path=$MODEL_NAME \
|
||||
--dataset_name=$DATASET_ID \
|
||||
--enable_xformers_memory_efficient_attention \
|
||||
--resolution=256 --random_flip \
|
||||
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
|
||||
--max_train_steps=15000 \
|
||||
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
|
||||
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
|
||||
--conditioning_dropout_prob=0.05 \
|
||||
--mixed_precision=fp16 \
|
||||
--seed=42 \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
|
||||
추가적으로, 가중치와 바이어스를 학습 과정에 모니터링하여 검증 추론을 수행하는 것을 지원합니다. `report_to="wandb"`와 이 기능을 사용할 수 있습니다:
|
||||
|
||||
```bash
|
||||
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
|
||||
--pretrained_model_name_or_path=$MODEL_NAME \
|
||||
--dataset_name=$DATASET_ID \
|
||||
--enable_xformers_memory_efficient_attention \
|
||||
--resolution=256 --random_flip \
|
||||
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
|
||||
--max_train_steps=15000 \
|
||||
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
|
||||
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
|
||||
--conditioning_dropout_prob=0.05 \
|
||||
--mixed_precision=fp16 \
|
||||
--val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" \
|
||||
--validation_prompt="make the mountains snowy" \
|
||||
--seed=42 \
|
||||
--report_to=wandb \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
모델 디버깅에 유용한 이 평가 방법 권장합니다. 이를 사용하기 위해 `wandb`를 설치하는 것을 주목해주세요. `pip install wandb`로 실행해 `wandb`를 설치할 수 있습니다.
|
||||
|
||||
[여기](https://wandb.ai/sayakpaul/instruct-pix2pix/runs/ctr3kovq), 몇 가지 평가 방법과 학습 파라미터를 포함하는 예시를 볼 수 있습니다.
|
||||
|
||||
***참고: 원본 논문에서, 저자들은 256x256 이미지 해상도로 학습한 모델로 512x512와 같은 더 큰 해상도로 잘 일반화되는 것을 볼 수 있었습니다. 이는 학습에 사용한 큰 데이터셋을 사용했기 때문입니다.***
|
||||
|
||||
## 다수의 GPU로 학습하기
|
||||
|
||||
`accelerate`는 원활한 다수의 GPU로 학습을 가능하게 합니다. `accelerate`로 분산 학습을 실행하는 [여기](https://huggingface.co/docs/accelerate/basic_tutorials/launch) 설명을 따라 해 주시기 바랍니다. 예시의 명령어 입니다:
|
||||
|
||||
|
||||
```bash
|
||||
accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix.py \
|
||||
--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 \
|
||||
--dataset_name=sayakpaul/instructpix2pix-1000-samples \
|
||||
--use_ema \
|
||||
--enable_xformers_memory_efficient_attention \
|
||||
--resolution=512 --random_flip \
|
||||
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
|
||||
--max_train_steps=15000 \
|
||||
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
|
||||
--learning_rate=5e-05 --lr_warmup_steps=0 \
|
||||
--conditioning_dropout_prob=0.05 \
|
||||
--mixed_precision=fp16 \
|
||||
--seed=42 \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
## 추론하기
|
||||
|
||||
일단 학습이 완료되면, 추론 할 수 있습니다:
|
||||
|
||||
```python
|
||||
import PIL
|
||||
import requests
|
||||
import torch
|
||||
from diffusers import StableDiffusionInstructPix2PixPipeline
|
||||
|
||||
model_id = "your_model_id" # <- 이를 수정하세요.
|
||||
pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
|
||||
generator = torch.Generator("cuda").manual_seed(0)
|
||||
|
||||
url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/test_pix2pix_4.png"
|
||||
|
||||
|
||||
def download_image(url):
|
||||
image = PIL.Image.open(requests.get(url, stream=True).raw)
|
||||
image = PIL.ImageOps.exif_transpose(image)
|
||||
image = image.convert("RGB")
|
||||
return image
|
||||
|
||||
|
||||
image = download_image(url)
|
||||
prompt = "wipe out the lake"
|
||||
num_inference_steps = 20
|
||||
image_guidance_scale = 1.5
|
||||
guidance_scale = 10
|
||||
|
||||
edited_image = pipe(
|
||||
prompt,
|
||||
image=image,
|
||||
num_inference_steps=num_inference_steps,
|
||||
image_guidance_scale=image_guidance_scale,
|
||||
guidance_scale=guidance_scale,
|
||||
generator=generator,
|
||||
).images[0]
|
||||
edited_image.save("edited_image.png")
|
||||
```
|
||||
|
||||
학습 스크립트를 사용해 얻은 예시의 모델 레포지토리는 여기 [sayakpaul/instruct-pix2pix](https://huggingface.co/sayakpaul/instruct-pix2pix)에서 확인할 수 있습니다.
|
||||
|
||||
성능을 위한 속도와 품질을 제어하기 위해 세 가지 파라미터를 사용하는 것이 좋습니다:
|
||||
|
||||
* `num_inference_steps`
|
||||
* `image_guidance_scale`
|
||||
* `guidance_scale`
|
||||
|
||||
특히, `image_guidance_scale`와 `guidance_scale`는 생성된("수정된") 이미지에서 큰 영향을 미칠 수 있습니다.([여기](https://twitter.com/RisingSayak/status/1628392199196151808?s=20)예시를 참고해주세요.)
|
||||
|
||||
|
||||
만약 InstructPix2Pix 학습 방법을 사용해 몇 가지 흥미로운 방법을 찾고 있다면, 이 블로그 게시물[Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd)을 확인해주세요.
|
||||
@@ -47,7 +47,7 @@ huggingface-cli login
|
||||
수십억 개의 파라메터들이 있는 Stable Diffusion과 같은 모델을 파인튜닝하는 것은 느리고 어려울 수 있습니다. LoRA를 사용하면 diffusion 모델을 파인튜닝하는 것이 훨씬 쉽고 빠릅니다. 8비트 옵티마이저와 같은 트릭에 의존하지 않고도 11GB의 GPU RAM으로 하드웨어에서 실행할 수 있습니다.
|
||||
|
||||
|
||||
### 학습[[dreambooth-training]]
|
||||
### 학습 [[text-to-image 학습]]
|
||||
|
||||
[Pokémon BLIP 캡션](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) 데이터셋으로 [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)를 파인튜닝해 나만의 포켓몬을 생성해 보겠습니다.
|
||||
|
||||
@@ -89,7 +89,7 @@ accelerate launch train_dreambooth_lora.py \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
### 추론[[dreambooth-inference]]
|
||||
### 추론 [[dreambooth 추론]]
|
||||
|
||||
이제 [`StableDiffusionPipeline`]에서 기본 모델을 불러와 추론을 위해 모델을 사용할 수 있습니다:
|
||||
|
||||
|
||||
@@ -96,7 +96,7 @@ huggingface-cli login
|
||||
>>> dataset = load_dataset(config.dataset_name, split="train")
|
||||
```
|
||||
|
||||
💡[HugGan Community Event](https://huggingface.co/huggan) 에서 추가의 데이터셋을 찾거나 로컬의 [`ImageFolder`](https://huggingface.co/docs/datasets/image_dataset#imagefolder)를 만듦으로써 나만의 데이터셋을 사용할 수 있습니다. HugGan Community Event 에 가져온 데이터셋의 경우 리포지토리의 id로 `config.dataset_name` 을 설정하고, 나만의 이미지를 사용하는 경우 `imagefolder` 를 설정합니다.
|
||||
💡[HugGan Community Event](https://huggingface.co/huggan) 에서 추가의 데이터셋을 찾거나 로컬의 [`ImageFolder`](https://huggingface.co/docs/datasets/image_dataset#imagefolder)를 만듦으로써 나만의 데이터셋을 사용할 수 있습니다. HugGan Community Event 에 가져온 데이터셋의 경우 레포지토리의 id로 `config.dataset_name` 을 설정하고, 나만의 이미지를 사용하는 경우 `imagefolder` 를 설정합니다.
|
||||
|
||||
🤗 Datasets은 [`~datasets.Image`] 기능을 사용해 자동으로 이미지 데이터를 디코딩하고 [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html)로 불러옵니다. 이를 시각화 해보면:
|
||||
|
||||
@@ -277,33 +277,42 @@ Output shape: torch.Size([1, 3, 128, 128])
|
||||
... image_grid.save(f"{test_dir}/{epoch:04d}.png")
|
||||
```
|
||||
|
||||
TensorBoard에 로깅, 그래디언트 누적 및 혼합 정밀도 학습을 쉽게 수행하기 위해 🤗 Accelerate를 학습 루프에 함께 앞서 말한 모든 구성 정보들을 묶어 진행할 수 있습니다. 허브에 모델을 업로드 하기 위해 리포지토리 이름 및 정보를 가져오기 위한 함수를 작성하고 허브에 업로드할 수 있습니다.
|
||||
TensorBoard에 로깅, 그래디언트 누적 및 혼합 정밀도 학습을 쉽게 수행하기 위해 🤗 Accelerate를 학습 루프에 함께 앞서 말한 모든 구성 정보들을 묶어 진행할 수 있습니다. 허브에 모델을 업로드 하기 위해 레포지토리 이름 및 정보를 가져오기 위한 함수를 작성하고 허브에 업로드할 수 있습니다.
|
||||
|
||||
💡아래의 학습 루프는 어렵고 길어 보일 수 있지만, 나중에 한 줄의 코드로 학습을 한다면 그만한 가치가 있을 것입니다! 만약 기다리지 못하고 이미지를 생성하고 싶다면, 아래 코드를 자유롭게 붙여넣고 작동시키면 됩니다. 🤗
|
||||
|
||||
```py
|
||||
>>> from accelerate import Accelerator
|
||||
>>> from huggingface_hub import create_repo, upload_folder
|
||||
>>> from huggingface_hub import HfFolder, Repository, whoami
|
||||
>>> from tqdm.auto import tqdm
|
||||
>>> from pathlib import Path
|
||||
>>> import os
|
||||
|
||||
|
||||
>>> def get_full_repo_name(model_id: str, organization: str = None, token: str = None):
|
||||
... if token is None:
|
||||
... token = HfFolder.get_token()
|
||||
... if organization is None:
|
||||
... username = whoami(token)["name"]
|
||||
... return f"{username}/{model_id}"
|
||||
... else:
|
||||
... return f"{organization}/{model_id}"
|
||||
|
||||
|
||||
>>> def train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler):
|
||||
... # Initialize accelerator and tensorboard logging
|
||||
... # accelerator와 tensorboard 로깅 초기화
|
||||
... accelerator = Accelerator(
|
||||
... mixed_precision=config.mixed_precision,
|
||||
... gradient_accumulation_steps=config.gradient_accumulation_steps,
|
||||
... log_with="tensorboard",
|
||||
... project_dir=os.path.join(config.output_dir, "logs"),
|
||||
... logging_dir=os.path.join(config.output_dir, "logs"),
|
||||
... )
|
||||
... if accelerator.is_main_process:
|
||||
... if config.output_dir is not None:
|
||||
... os.makedirs(config.output_dir, exist_ok=True)
|
||||
... if config.push_to_hub:
|
||||
... repo_id = create_repo(
|
||||
... repo_id=config.hub_model_id or Path(config.output_dir).name, exist_ok=True
|
||||
... ).repo_id
|
||||
... repo_name = get_full_repo_name(Path(config.output_dir).name)
|
||||
... repo = Repository(config.output_dir, clone_from=repo_name)
|
||||
... elif config.output_dir is not None:
|
||||
... os.makedirs(config.output_dir, exist_ok=True)
|
||||
... accelerator.init_trackers("train_example")
|
||||
|
||||
... # 모든 것이 준비되었습니다.
|
||||
@@ -360,12 +369,7 @@ TensorBoard에 로깅, 그래디언트 누적 및 혼합 정밀도 학습을 쉽
|
||||
|
||||
... if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
|
||||
... if config.push_to_hub:
|
||||
... upload_folder(
|
||||
... repo_id=repo_id,
|
||||
... folder_path=config.output_dir,
|
||||
... commit_message=f"Epoch {epoch}",
|
||||
... ignore_patterns=["step_*", "epoch_*"],
|
||||
... )
|
||||
... repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
|
||||
... else:
|
||||
... pipeline.save_pretrained(config.output_dir)
|
||||
```
|
||||
|
||||
@@ -1,60 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# 조건부 이미지 생성
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
조건부 이미지 생성을 사용하면 텍스트 프롬프트에서 이미지를 생성할 수 있습니다. 텍스트는 임베딩으로 변환되며, 임베딩은 노이즈에서 이미지를 생성하도록 모델을 조건화하는 데 사용됩니다.
|
||||
|
||||
[`DiffusionPipeline`]은 추론을 위해 사전 훈련된 diffusion 시스템을 사용하는 가장 쉬운 방법입니다.
|
||||
|
||||
먼저 [`DiffusionPipeline`]의 인스턴스를 생성하고 다운로드할 파이프라인 [체크포인트](https://huggingface.co/models?library=diffusers&sort=downloads)를 지정합니다.
|
||||
|
||||
이 가이드에서는 [잠재 Diffusion](https://huggingface.co/CompVis/ldm-text2im-large-256)과 함께 텍스트-이미지 생성에 [`DiffusionPipeline`]을 사용합니다:
|
||||
|
||||
```python
|
||||
>>> from diffusers import DiffusionPipeline
|
||||
|
||||
>>> generator = DiffusionPipeline.from_pretrained("CompVis/ldm-text2im-large-256")
|
||||
```
|
||||
|
||||
[`DiffusionPipeline`]은 모든 모델링, 토큰화, 스케줄링 구성 요소를 다운로드하고 캐시합니다.
|
||||
이 모델은 약 14억 개의 파라미터로 구성되어 있기 때문에 GPU에서 실행할 것을 강력히 권장합니다.
|
||||
PyTorch에서와 마찬가지로 생성기 객체를 GPU로 이동할 수 있습니다:
|
||||
|
||||
```python
|
||||
>>> generator.to("cuda")
|
||||
```
|
||||
|
||||
이제 텍스트 프롬프트에서 `생성기`를 사용할 수 있습니다:
|
||||
|
||||
```python
|
||||
>>> image = generator("An image of a squirrel in Picasso style").images[0]
|
||||
```
|
||||
|
||||
출력값은 기본적으로 [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) 객체로 래핑됩니다.
|
||||
|
||||
호출하여 이미지를 저장할 수 있습니다:
|
||||
|
||||
```python
|
||||
>>> image.save("image_of_squirrel_painting.png")
|
||||
```
|
||||
|
||||
아래 스페이스를 사용해보고 안내 배율 매개변수를 자유롭게 조정하여 이미지 품질에 어떤 영향을 미치는지 확인해 보세요!
|
||||
|
||||
<iframe
|
||||
src="https://stabilityai-stable-diffusion.hf.space"
|
||||
frameborder="0"
|
||||
width="850"
|
||||
height="500"
|
||||
></iframe>
|
||||
@@ -1,182 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# 커뮤니티 파이프라인에 기여하는 방법
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 모든 사람이 속도 저하 없이 쉽게 작업을 공유할 수 있도록 커뮤니티 파이프라인을 추가하는 이유에 대한 자세한 내용은 GitHub 이슈 [#841](https://github.com/huggingface/diffusers/issues/841)를 참조하세요.
|
||||
|
||||
</Tip>
|
||||
|
||||
커뮤니티 파이프라인을 사용하면 [`DiffusionPipeline`] 위에 원하는 추가 기능을 추가할 수 있습니다. `DiffusionPipeline` 위에 구축할 때의 가장 큰 장점은 누구나 인수를 하나만 추가하면 파이프라인을 로드하고 사용할 수 있어 커뮤니티가 매우 쉽게 접근할 수 있다는 것입니다.
|
||||
|
||||
이번 가이드에서는 커뮤니티 파이프라인을 생성하는 방법과 작동 원리를 설명합니다.
|
||||
간단하게 설명하기 위해 `UNet`이 단일 forward pass를 수행하고 스케줄러를 한 번 호출하는 "one-step" 파이프라인을 만들겠습니다.
|
||||
|
||||
## 파이프라인 초기화
|
||||
|
||||
커뮤니티 파이프라인을 위한 `one_step_unet.py` 파일을 생성하는 것으로 시작합니다. 이 파일에서, Hub에서 모델 가중치와 스케줄러 구성을 로드할 수 있도록 [`DiffusionPipeline`]을 상속하는 파이프라인 클래스를 생성합니다. one-step 파이프라인에는 `UNet`과 스케줄러가 필요하므로 이를 `__init__` 함수에 인수로 추가해야합니다:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
|
||||
class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
|
||||
def __init__(self, unet, scheduler):
|
||||
super().__init__()
|
||||
```
|
||||
|
||||
파이프라인과 그 구성요소(`unet` and `scheduler`)를 [`~DiffusionPipeline.save_pretrained`]으로 저장할 수 있도록 하려면 `register_modules` 함수에 추가하세요:
|
||||
|
||||
```diff
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
|
||||
def __init__(self, unet, scheduler):
|
||||
super().__init__()
|
||||
|
||||
+ self.register_modules(unet=unet, scheduler=scheduler)
|
||||
```
|
||||
|
||||
이제 '초기화' 단계가 완료되었으니 forward pass로 이동할 수 있습니다! 🔥
|
||||
|
||||
## Forward pass 정의
|
||||
|
||||
Forward pass 에서는(`__call__`로 정의하는 것이 좋습니다) 원하는 기능을 추가할 수 있는 완전한 창작 자유가 있습니다. 우리의 놀라운 one-step 파이프라인의 경우, 임의의 이미지를 생성하고 `timestep=1`을 설정하여 `unet`과 `scheduler`를 한 번만 호출합니다:
|
||||
|
||||
```diff
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
|
||||
class UnetSchedulerOneForwardPipeline(DiffusionPipeline):
|
||||
def __init__(self, unet, scheduler):
|
||||
super().__init__()
|
||||
|
||||
self.register_modules(unet=unet, scheduler=scheduler)
|
||||
|
||||
+ def __call__(self):
|
||||
+ image = torch.randn(
|
||||
+ (1, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size),
|
||||
+ )
|
||||
+ timestep = 1
|
||||
|
||||
+ model_output = self.unet(image, timestep).sample
|
||||
+ scheduler_output = self.scheduler.step(model_output, timestep, image).prev_sample
|
||||
|
||||
+ return scheduler_output
|
||||
```
|
||||
|
||||
끝났습니다! 🚀 이제 이 파이프라인에 `unet`과 `scheduler`를 전달하여 실행할 수 있습니다:
|
||||
|
||||
```python
|
||||
from diffusers import DDPMScheduler, UNet2DModel
|
||||
|
||||
scheduler = DDPMScheduler()
|
||||
unet = UNet2DModel()
|
||||
|
||||
pipeline = UnetSchedulerOneForwardPipeline(unet=unet, scheduler=scheduler)
|
||||
|
||||
output = pipeline()
|
||||
```
|
||||
|
||||
하지만 파이프라인 구조가 동일한 경우 기존 가중치를 파이프라인에 로드할 수 있다는 장점이 있습니다. 예를 들어 one-step 파이프라인에 [`google/ddpm-cifar10-32`](https://huggingface.co/google/ddpm-cifar10-32) 가중치를 로드할 수 있습니다:
|
||||
|
||||
```python
|
||||
pipeline = UnetSchedulerOneForwardPipeline.from_pretrained("google/ddpm-cifar10-32")
|
||||
|
||||
output = pipeline()
|
||||
```
|
||||
|
||||
## 파이프라인 공유
|
||||
|
||||
🧨Diffusers [리포지토리](https://github.com/huggingface/diffusers)에서 Pull Request를 열어 [examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community) 하위 폴더에 `one_step_unet.py`의 멋진 파이프라인을 추가하세요.
|
||||
|
||||
병합이 되면, `diffusers >= 0.4.0`이 설치된 사용자라면 누구나 `custom_pipeline` 인수에 지정하여 이 파이프라인을 마술처럼 🪄 사용할 수 있습니다:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("google/ddpm-cifar10-32", custom_pipeline="one_step_unet")
|
||||
pipe()
|
||||
```
|
||||
|
||||
커뮤니티 파이프라인을 공유하는 또 다른 방법은 Hub 에서 선호하는 [모델 리포지토리](https://huggingface.co/docs/hub/models-uploading)에 직접 `one_step_unet.py` 파일을 업로드하는 것입니다. `one_step_unet.py` 파일을 지정하는 대신 모델 저장소 id를 `custom_pipeline` 인수에 전달하세요:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained("google/ddpm-cifar10-32", custom_pipeline="stevhliu/one_step_unet")
|
||||
```
|
||||
|
||||
다음 표에서 두 가지 공유 워크플로우를 비교하여 자신에게 가장 적합한 옵션을 결정하는 데 도움이 되는 정보를 확인하세요:
|
||||
|
||||
| | GitHub 커뮤니티 파이프라인 | HF Hub 커뮤니티 파이프라인 |
|
||||
|----------------|------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
|
||||
| 사용법 | 동일 | 동일 |
|
||||
| 리뷰 과정 | 병합하기 전에 GitHub에서 Pull Request를 열고 Diffusers 팀의 검토 과정을 거칩니다. 속도가 느릴 수 있습니다. | 검토 없이 Hub 저장소에 바로 업로드합니다. 가장 빠른 워크플로우 입니다. |
|
||||
| 가시성 | 공식 Diffusers 저장소 및 문서에 포함되어 있습니다. | HF 허브 프로필에 포함되며 가시성을 확보하기 위해 자신의 사용량/프로모션에 의존합니다. |
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 커뮤니티 파이프라인 파일에 원하는 패키지를 사용할 수 있습니다. 사용자가 패키지를 설치하기만 하면 모든 것이 정상적으로 작동합니다. 파이프라인이 자동으로 감지되므로 `DiffusionPipeline`에서 상속하는 파이프라인 클래스가 하나만 있는지 확인하세요.
|
||||
|
||||
</Tip>
|
||||
|
||||
## 커뮤니티 파이프라인은 어떻게 작동하나요?
|
||||
|
||||
커뮤니티 파이프라인은 [`DiffusionPipeline`]을 상속하는 클래스입니다:
|
||||
|
||||
- [`custom_pipeline`] 인수로 로드할 수 있습니다.
|
||||
- 모델 가중치 및 스케줄러 구성은 [`pretrained_model_name_or_path`]에서 로드됩니다.
|
||||
- 커뮤니티 파이프라인에서 기능을 구현하는 코드는 `pipeline.py` 파일에 정의되어 있습니다.
|
||||
|
||||
공식 저장소에서 모든 파이프라인 구성 요소 가중치를 로드할 수 없는 경우가 있습니다. 이 경우 다른 구성 요소는 파이프라인에 직접 전달해야 합니다:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
from transformers import CLIPFeatureExtractor, CLIPModel
|
||||
|
||||
model_id = "CompVis/stable-diffusion-v1-4"
|
||||
clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
|
||||
|
||||
feature_extractor = CLIPFeatureExtractor.from_pretrained(clip_model_id)
|
||||
clip_model = CLIPModel.from_pretrained(clip_model_id, torch_dtype=torch.float16)
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
model_id,
|
||||
custom_pipeline="clip_guided_stable_diffusion",
|
||||
clip_model=clip_model,
|
||||
feature_extractor=feature_extractor,
|
||||
scheduler=scheduler,
|
||||
torch_dtype=torch.float16,
|
||||
)
|
||||
```
|
||||
|
||||
커뮤니티 파이프라인의 마법은 다음 코드에 담겨 있습니다. 이 코드를 통해 커뮤니티 파이프라인을 GitHub 또는 Hub에서 로드할 수 있으며, 모든 🧨 Diffusers 패키지에서 사용할 수 있습니다.
|
||||
|
||||
```python
|
||||
# 2. 파이프라인 클래스를 로드합니다. 사용자 지정 모듈을 사용하는 경우 Hub에서 로드합니다
|
||||
# 명시적 클래스에서 로드하는 경우, 이를 사용해 보겠습니다.
|
||||
if custom_pipeline is not None:
|
||||
pipeline_class = get_class_from_dynamic_module(
|
||||
custom_pipeline, module_file=CUSTOM_PIPELINE_FILE_NAME, cache_dir=custom_pipeline
|
||||
)
|
||||
elif cls != DiffusionPipeline:
|
||||
pipeline_class = cls
|
||||
else:
|
||||
diffusers_module = importlib.import_module(cls.__module__.split(".")[0])
|
||||
pipeline_class = getattr(diffusers_module, config_dict["_class_name"])
|
||||
```
|
||||
@@ -1,45 +0,0 @@
|
||||
# 이미지 밝기 조절하기
|
||||
|
||||
Stable Diffusion 파이프라인은 [일반적인 디퓨전 노이즈 스케줄과 샘플 단계에 결함이 있음](https://huggingface.co/papers/2305.08891) 논문에서 설명한 것처럼 매우 밝거나 어두운 이미지를 생성하는 데는 성능이 평범합니다. 이 논문에서 제안한 솔루션은 현재 [`DDIMScheduler`]에 구현되어 있으며 이미지의 밝기를 개선하는 데 사용할 수 있습니다.
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 제안된 솔루션에 대한 자세한 내용은 위에 링크된 논문을 참고하세요!
|
||||
|
||||
</Tip>
|
||||
|
||||
해결책 중 하나는 *v 예측값*과 *v 로스*로 모델을 훈련하는 것입니다. 다음 flag를 [`train_text_to_image.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) 또는 [`train_text_to_image_lora.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py) 스크립트에 추가하여 `v_prediction`을 활성화합니다:
|
||||
|
||||
```bash
|
||||
--prediction_type="v_prediction"
|
||||
```
|
||||
|
||||
예를 들어, `v_prediction`으로 미세 조정된 [`ptx0/pseudo-journey-v2`](https://huggingface.co/ptx0/pseudo-journey-v2) 체크포인트를 사용해 보겠습니다.
|
||||
|
||||
다음으로 [`DDIMScheduler`]에서 다음 파라미터를 설정합니다:
|
||||
|
||||
1. rescale_betas_zero_snr=True`, 노이즈 스케줄을 제로 터미널 신호 대 잡음비(SNR)로 재조정합니다.
|
||||
2. `timestep_spacing="trailing"`, 마지막 타임스텝부터 샘플링 시작
|
||||
|
||||
```py
|
||||
>>> from diffusers import DiffusionPipeline, DDIMScheduler
|
||||
|
||||
>>> pipeline = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2")
|
||||
# switch the scheduler in the pipeline to use the DDIMScheduler
|
||||
|
||||
>>> pipeline.scheduler = DDIMScheduler.from_config(
|
||||
... pipeline.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing"
|
||||
... )
|
||||
>>> pipeline.to("cuda")
|
||||
```
|
||||
|
||||
마지막으로 파이프라인에 대한 호출에서 `guidance_rescale`을 설정하여 과다 노출을 방지합니다:
|
||||
|
||||
```py
|
||||
prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
|
||||
image = pipeline(prompt, guidance_rescale=0.7).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/zero_snr.png"/>
|
||||
</div>
|
||||
@@ -1,226 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# 제어된 생성
|
||||
|
||||
Diffusion 모델에 의해 생성된 출력을 제어하는 것은 커뮤니티에서 오랫동안 추구해 왔으며 현재 활발한 연구 주제입니다. 널리 사용되는 많은 diffusion 모델에서는 이미지와 텍스트 프롬프트 등 입력의 미묘한 변화로 인해 출력이 크게 달라질 수 있습니다. 이상적인 세계에서는 의미가 유지되고 변경되는 방식을 제어할 수 있기를 원합니다.
|
||||
|
||||
의미 보존의 대부분의 예는 입력의 변화를 출력의 변화에 정확하게 매핑하는 것으로 축소됩니다. 즉, 프롬프트에서 피사체에 형용사를 추가하면 전체 이미지가 보존되고 변경된 피사체만 수정됩니다. 또는 특정 피사체의 이미지를 변형하면 피사체의 포즈가 유지됩니다.
|
||||
|
||||
추가적으로 생성된 이미지의 품질에는 의미 보존 외에도 영향을 미치고자 하는 품질이 있습니다. 즉, 일반적으로 결과물의 품질이 좋거나 특정 스타일을 고수하거나 사실적이기를 원합니다.
|
||||
|
||||
diffusion 모델 생성을 제어하기 위해 `diffusers`가 지원하는 몇 가지 기술을 문서화합니다. 많은 부분이 최첨단 연구이며 미묘한 차이가 있을 수 있습니다. 명확한 설명이 필요하거나 제안 사항이 있으면 주저하지 마시고 [포럼](https://discuss.huggingface.co/) 또는 [GitHub 이슈](https://github.com/huggingface/diffusers/issues)에서 토론을 시작하세요.
|
||||
|
||||
생성 제어 방법에 대한 개략적인 설명과 기술 개요를 제공합니다. 기술에 대한 자세한 설명은 파이프라인에서 링크된 원본 논문을 참조하는 것이 가장 좋습니다.
|
||||
|
||||
사용 사례에 따라 적절한 기술을 선택해야 합니다. 많은 경우 이러한 기법을 결합할 수 있습니다. 예를 들어, 텍스트 반전과 SEGA를 결합하여 텍스트 반전을 사용하여 생성된 출력에 더 많은 의미적 지침을 제공할 수 있습니다.
|
||||
|
||||
별도의 언급이 없는 한, 이러한 기법은 기존 모델과 함께 작동하며 자체 가중치가 필요하지 않은 기법입니다.
|
||||
|
||||
1. [Instruct Pix2Pix](#instruct-pix2pix)
|
||||
2. [Pix2Pix Zero](#pix2pixzero)
|
||||
3. [Attend and Excite](#attend-and-excite)
|
||||
4. [Semantic Guidance](#semantic-guidance)
|
||||
5. [Self-attention Guidance](#self-attention-guidance)
|
||||
6. [Depth2Image](#depth2image)
|
||||
7. [MultiDiffusion Panorama](#multidiffusion-panorama)
|
||||
8. [DreamBooth](#dreambooth)
|
||||
9. [Textual Inversion](#textual-inversion)
|
||||
10. [ControlNet](#controlnet)
|
||||
11. [Prompt Weighting](#prompt-weighting)
|
||||
12. [Custom Diffusion](#custom-diffusion)
|
||||
13. [Model Editing](#model-editing)
|
||||
14. [DiffEdit](#diffedit)
|
||||
15. [T2I-Adapter](#t2i-adapter)
|
||||
|
||||
편의를 위해, 추론만 하거나 파인튜닝/학습하는 방법에 대한 표를 제공합니다.
|
||||
|
||||
| **Method** | **Inference only** | **Requires training /<br> fine-tuning** | **Comments** |
|
||||
| :-------------------------------------------------: | :----------------: | :-------------------------------------: | :---------------------------------------------------------------------------------------------: |
|
||||
| [Instruct Pix2Pix](#instruct-pix2pix) | ✅ | ❌ | Can additionally be<br>fine-tuned for better <br>performance on specific <br>edit instructions. |
|
||||
| [Pix2Pix Zero](#pix2pixzero) | ✅ | ❌ | |
|
||||
| [Attend and Excite](#attend-and-excite) | ✅ | ❌ | |
|
||||
| [Semantic Guidance](#semantic-guidance) | ✅ | ❌ | |
|
||||
| [Self-attention Guidance](#self-attention-guidance) | ✅ | ❌ | |
|
||||
| [Depth2Image](#depth2image) | ✅ | ❌ | |
|
||||
| [MultiDiffusion Panorama](#multidiffusion-panorama) | ✅ | ❌ | |
|
||||
| [DreamBooth](#dreambooth) | ❌ | ✅ | |
|
||||
| [Textual Inversion](#textual-inversion) | ❌ | ✅ | |
|
||||
| [ControlNet](#controlnet) | ✅ | ❌ | A ControlNet can be <br>trained/fine-tuned on<br>a custom conditioning. |
|
||||
| [Prompt Weighting](#prompt-weighting) | ✅ | ❌ | |
|
||||
| [Custom Diffusion](#custom-diffusion) | ❌ | ✅ | |
|
||||
| [Model Editing](#model-editing) | ✅ | ❌ | |
|
||||
| [DiffEdit](#diffedit) | ✅ | ❌ | |
|
||||
| [T2I-Adapter](#t2i-adapter) | ✅ | ❌ | |
|
||||
|
||||
## Pix2Pix Instruct
|
||||
|
||||
[Paper](https://arxiv.org/abs/2211.09800)
|
||||
|
||||
[Instruct Pix2Pix](../api/pipelines/stable_diffusion/pix2pix) 는 입력 이미지 편집을 지원하기 위해 stable diffusion에서 미세-조정되었습니다. 이미지와 편집을 설명하는 프롬프트를 입력으로 받아 편집된 이미지를 출력합니다.
|
||||
Instruct Pix2Pix는 [InstructGPT](https://openai.com/blog/instruction-following/)와 같은 프롬프트와 잘 작동하도록 명시적으로 훈련되었습니다.
|
||||
|
||||
사용 방법에 대한 자세한 내용은 [여기](../api/pipelines/stable_diffusion/pix2pix)를 참조하세요.
|
||||
|
||||
## Pix2Pix Zero
|
||||
|
||||
[Paper](https://arxiv.org/abs/2302.03027)
|
||||
|
||||
[Pix2Pix Zero](../api/pipelines/stable_diffusion/pix2pix_zero)를 사용하면 일반적인 이미지 의미를 유지하면서 한 개념이나 피사체가 다른 개념이나 피사체로 변환되도록 이미지를 수정할 수 있습니다.
|
||||
|
||||
노이즈 제거 프로세스는 한 개념적 임베딩에서 다른 개념적 임베딩으로 안내됩니다. 중간 잠복(intermediate latents)은 디노이징(denoising?) 프로세스 중에 최적화되어 참조 주의 지도(reference attention maps)를 향해 나아갑니다. 참조 주의 지도(reference attention maps)는 입력 이미지의 노이즈 제거(?) 프로세스에서 나온 것으로 의미 보존을 장려하는 데 사용됩니다.
|
||||
|
||||
Pix2Pix Zero는 합성 이미지와 실제 이미지를 편집하는 데 모두 사용할 수 있습니다.
|
||||
|
||||
- 합성 이미지를 편집하려면 먼저 캡션이 지정된 이미지를 생성합니다.
|
||||
다음으로 편집할 컨셉과 새로운 타겟 컨셉에 대한 이미지 캡션을 생성합니다. 이를 위해 [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)와 같은 모델을 사용할 수 있습니다. 그런 다음 텍스트 인코더를 통해 소스 개념과 대상 개념 모두에 대한 "평균" 프롬프트 임베딩을 생성합니다. 마지막으로, 합성 이미지를 편집하기 위해 pix2pix-zero 알고리즘을 사용합니다.
|
||||
- 실제 이미지를 편집하려면 먼저 [BLIP](https://huggingface.co/docs/transformers/model_doc/blip)과 같은 모델을 사용하여 이미지 캡션을 생성합니다. 그런 다음 프롬프트와 이미지에 ddim 반전을 적용하여 "역(inverse)" latents을 생성합니다. 이전과 마찬가지로 소스 및 대상 개념 모두에 대한 "평균(mean)" 프롬프트 임베딩이 생성되고 마지막으로 "역(inverse)" latents와 결합된 pix2pix-zero 알고리즘이 이미지를 편집하는 데 사용됩니다.
|
||||
|
||||
<Tip>
|
||||
|
||||
Pix2Pix Zero는 '제로 샷(zero-shot)' 이미지 편집이 가능한 최초의 모델입니다.
|
||||
즉, 이 모델은 다음과 같이 일반 소비자용 GPU에서 1분 이내에 이미지를 편집할 수 있습니다(../api/pipelines/stable_diffusion/pix2pix_zero#usage-example).
|
||||
|
||||
</Tip>
|
||||
|
||||
위에서 언급했듯이 Pix2Pix Zero에는 특정 개념으로 세대를 유도하기 위해 (UNet, VAE 또는 텍스트 인코더가 아닌) latents을 최적화하는 기능이 포함되어 있습니다.즉, 전체 파이프라인에 표준 [StableDiffusionPipeline](../api/pipelines/stable_diffusion/text2img)보다 더 많은 메모리가 필요할 수 있습니다.
|
||||
|
||||
사용 방법에 대한 자세한 내용은 [여기](../api/pipelines/stable_diffusion/pix2pix_zero)를 참조하세요.
|
||||
|
||||
## Attend and Excite
|
||||
|
||||
[Paper](https://arxiv.org/abs/2301.13826)
|
||||
|
||||
[Attend and Excite](../api/pipelines/stable_diffusion/attend_and_excite)를 사용하면 프롬프트의 피사체가 최종 이미지에 충실하게 표현되도록 할 수 있습니다.
|
||||
|
||||
이미지에 존재해야 하는 프롬프트의 피사체에 해당하는 일련의 토큰 인덱스가 입력으로 제공됩니다. 노이즈 제거 중에 각 토큰 인덱스는 이미지의 최소 한 패치 이상에 대해 최소 주의 임계값을 갖도록 보장됩니다. 모든 피사체 토큰에 대해 주의 임계값이 통과될 때까지 노이즈 제거 프로세스 중에 중간 잠복기가 반복적으로 최적화되어 가장 소홀히 취급되는 피사체 토큰의 주의력을 강화합니다.
|
||||
|
||||
Pix2Pix Zero와 마찬가지로 Attend and Excite 역시 파이프라인에 미니 최적화 루프(사전 학습된 가중치를 그대로 둔 채)가 포함되며, 일반적인 'StableDiffusionPipeline'보다 더 많은 메모리가 필요할 수 있습니다.
|
||||
|
||||
사용 방법에 대한 자세한 내용은 [여기](../api/pipelines/stable_diffusion/attend_and_excite)를 참조하세요.
|
||||
|
||||
## Semantic Guidance (SEGA)
|
||||
|
||||
[Paper](https://arxiv.org/abs/2301.12247)
|
||||
|
||||
의미유도(SEGA)를 사용하면 이미지에서 하나 이상의 컨셉을 적용하거나 제거할 수 있습니다. 컨셉의 강도도 조절할 수 있습니다. 즉, 스마일 컨셉을 사용하여 인물 사진의 스마일을 점진적으로 늘리거나 줄일 수 있습니다.
|
||||
|
||||
분류기 무료 안내(classifier free guidance)가 빈 프롬프트 입력을 통해 안내를 제공하는 방식과 유사하게, SEGA는 개념 프롬프트에 대한 안내를 제공합니다. 이러한 개념 프롬프트는 여러 개를 동시에 적용할 수 있습니다. 각 개념 프롬프트는 안내가 긍정적으로 적용되는지 또는 부정적으로 적용되는지에 따라 해당 개념을 추가하거나 제거할 수 있습니다.
|
||||
|
||||
Pix2Pix Zero 또는 Attend and Excite와 달리 SEGA는 명시적인 그라데이션 기반 최적화를 수행하는 대신 확산 프로세스와 직접 상호 작용합니다.
|
||||
|
||||
사용 방법에 대한 자세한 내용은 [여기](../api/pipelines/semantic_stable_diffusion)를 참조하세요.
|
||||
|
||||
## Self-attention Guidance (SAG)
|
||||
|
||||
[Paper](https://arxiv.org/abs/2210.00939)
|
||||
|
||||
[자기 주의 안내](../api/pipelines/stable_diffusion/self_attention_guidance)는 이미지의 전반적인 품질을 개선합니다.
|
||||
|
||||
SAG는 고빈도 세부 정보를 기반으로 하지 않은 예측에서 완전히 조건화된 이미지에 이르기까지 가이드를 제공합니다. 고빈도 디테일은 UNet 자기 주의 맵에서 추출됩니다.
|
||||
|
||||
사용 방법에 대한 자세한 내용은 [여기](../api/pipelines/stable_diffusion/self_attention_guidance)를 참조하세요.
|
||||
|
||||
## Depth2Image
|
||||
|
||||
[Project](https://huggingface.co/stabilityai/stable-diffusion-2-depth)
|
||||
|
||||
[Depth2Image](../pipelines/stable_diffusion_2#depthtoimage)는 텍스트 안내 이미지 변화에 대한 시맨틱을 더 잘 보존하도록 안정적 확산에서 미세 조정되었습니다.
|
||||
|
||||
원본 이미지의 단안(monocular) 깊이 추정치를 조건으로 합니다.
|
||||
|
||||
사용 방법에 대한 자세한 내용은 [여기](../api/pipelines/stable_diffusion_2#depthtoimage)를 참조하세요.
|
||||
|
||||
<Tip>
|
||||
|
||||
InstructPix2Pix와 Pix2Pix Zero와 같은 방법의 중요한 차이점은 전자의 경우
|
||||
는 사전 학습된 가중치를 미세 조정하는 반면, 후자는 그렇지 않다는 것입니다. 즉, 다음을 수행할 수 있습니다.
|
||||
사용 가능한 모든 안정적 확산 모델에 Pix2Pix Zero를 적용할 수 있습니다.
|
||||
|
||||
</Tip>
|
||||
|
||||
## MultiDiffusion Panorama
|
||||
|
||||
[Paper](https://arxiv.org/abs/2302.08113)
|
||||
|
||||
MultiDiffusion은 사전 학습된 diffusion model을 통해 새로운 생성 프로세스를 정의합니다. 이 프로세스는 고품질의 다양한 이미지를 생성하는 데 쉽게 적용할 수 있는 여러 diffusion 생성 방법을 하나로 묶습니다. 결과는 원하는 종횡비(예: 파노라마) 및 타이트한 분할 마스크에서 바운딩 박스에 이르는 공간 안내 신호와 같은 사용자가 제공한 제어를 준수합니다.
|
||||
[MultiDiffusion 파노라마](../api/pipelines/stable_diffusion/panorama)를 사용하면 임의의 종횡비(예: 파노라마)로 고품질 이미지를 생성할 수 있습니다.
|
||||
|
||||
파노라마 이미지를 생성하는 데 사용하는 방법에 대한 자세한 내용은 [여기](../api/pipelines/stable_diffusion/panorama)를 참조하세요.
|
||||
|
||||
## 나만의 모델 파인튜닝
|
||||
|
||||
사전 학습된 모델 외에도 Diffusers는 사용자가 제공한 데이터에 대해 모델을 파인튜닝할 수 있는 학습 스크립트가 있습니다.
|
||||
|
||||
## DreamBooth
|
||||
|
||||
[DreamBooth](../training/dreambooth)는 모델을 파인튜닝하여 새로운 주제에 대해 가르칩니다. 즉, 한 사람의 사진 몇 장을 사용하여 다양한 스타일로 그 사람의 이미지를 생성할 수 있습니다.
|
||||
|
||||
사용 방법에 대한 자세한 내용은 [여기](../training/dreambooth)를 참조하세요.
|
||||
|
||||
## Textual Inversion
|
||||
|
||||
[Textual Inversion](../training/text_inversion)은 모델을 파인튜닝하여 새로운 개념에 대해 학습시킵니다. 즉, 특정 스타일의 아트웍 사진 몇 장을 사용하여 해당 스타일의 이미지를 생성할 수 있습니다.
|
||||
|
||||
사용 방법에 대한 자세한 내용은 [여기](../training/text_inversion)를 참조하세요.
|
||||
|
||||
## ControlNet
|
||||
|
||||
[Paper](https://arxiv.org/abs/2302.05543)
|
||||
|
||||
[ControlNet](../api/pipelines/stable_diffusion/controlnet)은 추가 조건을 추가하는 보조 네트워크입니다.
|
||||
가장자리 감지, 낙서, 깊이 맵, 의미적 세그먼트와 같은 다양한 조건에 대해 훈련된 8개의 표준 사전 훈련된 ControlNet이 있습니다,
|
||||
깊이 맵, 시맨틱 세그먼테이션과 같은 다양한 조건으로 훈련된 8개의 표준 제어망이 있습니다.
|
||||
|
||||
사용 방법에 대한 자세한 내용은 [여기](../api/pipelines/stable_diffusion/controlnet)를 참조하세요.
|
||||
|
||||
## Prompt Weighting
|
||||
|
||||
프롬프트 가중치는 텍스트의 특정 부분에 더 많은 관심 가중치를 부여하는 간단한 기법입니다.
|
||||
입력에 가중치를 부여하는 간단한 기법입니다.
|
||||
|
||||
자세한 설명과 예시는 [여기](../using-diffusers/weighted_prompts)를 참조하세요.
|
||||
|
||||
## Custom Diffusion
|
||||
|
||||
[Custom Diffusion](../training/custom_diffusion)은 사전 학습된 text-to-image 간 확산 모델의 교차 관심도 맵만 미세 조정합니다.
|
||||
또한 textual inversion을 추가로 수행할 수 있습니다. 설계상 다중 개념 훈련을 지원합니다.
|
||||
DreamBooth 및 Textual Inversion 마찬가지로, 사용자 지정 확산은 사전학습된 text-to-image diffusion 모델에 새로운 개념을 학습시켜 관심 있는 개념과 관련된 출력을 생성하는 데에도 사용됩니다.
|
||||
|
||||
자세한 설명은 [공식 문서](../training/custom_diffusion)를 참조하세요.
|
||||
|
||||
## Model Editing
|
||||
|
||||
[Paper](https://arxiv.org/abs/2303.08084)
|
||||
|
||||
[텍스트-이미지 모델 편집 파이프라인](../api/pipelines/model_editing)을 사용하면 사전학습된 text-to-image diffusion 모델이 입력 프롬프트에 있는 피사체에 대해 내릴 수 있는 잘못된 암시적 가정을 완화하는 데 도움이 됩니다.
|
||||
예를 들어, 안정적 확산에 "A pack of roses"에 대한 이미지를 생성하라는 메시지를 표시하면 생성된 이미지의 장미는 빨간색일 가능성이 높습니다. 이 파이프라인은 이러한 가정을 변경하는 데 도움이 됩니다.
|
||||
|
||||
자세한 설명은 [공식 문서](../api/pipelines/model_editing)를 참조하세요.
|
||||
|
||||
## DiffEdit
|
||||
|
||||
[Paper](https://arxiv.org/abs/2210.11427)
|
||||
|
||||
[DiffEdit](../api/pipelines/diffedit)를 사용하면 원본 입력 이미지를 최대한 보존하면서 입력 프롬프트와 함께 입력 이미지의 의미론적 편집이 가능합니다.
|
||||
|
||||
|
||||
자세한 설명은 [공식 문서](../api/pipelines/diffedit)를 참조하세요.
|
||||
|
||||
## T2I-Adapter
|
||||
|
||||
[Paper](https://arxiv.org/abs/2302.08453)
|
||||
|
||||
[T2I-어댑터](../api/pipelines/stable_diffusion/adapter)는 추가적인 조건을 추가하는 auxiliary 네트워크입니다.
|
||||
가장자리 감지, 스케치, depth maps, semantic segmentations와 같은 다양한 조건에 대해 훈련된 8개의 표준 사전훈련된 adapter가 있습니다,
|
||||
|
||||
[공식 문서](api/pipelines/stable_diffusion/adapter)에서 사용 방법에 대한 정보를 참조하세요.
|
||||
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# 텍스트 기반 image-to-image 생성
|
||||
|
||||
[[open-in-colab]]
|
||||
[[Colab에서 열기]]
|
||||
|
||||
[`StableDiffusionImg2ImgPipeline`]을 사용하면 텍스트 프롬프트와 시작 이미지를 전달하여 새 이미지 생성의 조건을 지정할 수 있습니다.
|
||||
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user