mirror of
https://github.com/huggingface/diffusers.git
synced 2025-12-28 23:39:39 +08:00
Compare commits
169 Commits
fix_custom
...
controlnet
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
b13dfac9dd | ||
|
|
451631be51 | ||
|
|
71d84a9ce1 | ||
|
|
cfd84dfc14 | ||
|
|
0592773d90 | ||
|
|
c1bad6e488 | ||
|
|
b62104c737 | ||
|
|
697594f635 | ||
|
|
83d0aba6c0 | ||
|
|
47b3346422 | ||
|
|
07f1fbb18e | ||
|
|
2551b73670 | ||
|
|
930c8fdcb7 | ||
|
|
6b1abba18d | ||
|
|
470f51cd26 | ||
|
|
b7e35dc782 | ||
|
|
c77ac246c1 | ||
|
|
ed2a3584ab | ||
|
|
3eb498e7b4 | ||
|
|
c6e56e92ed | ||
|
|
27062c3631 | ||
|
|
6427aa995e | ||
|
|
8b18cd8e7f | ||
|
|
a0597f33ac | ||
|
|
3929954613 | ||
|
|
f6ce323633 | ||
|
|
6b33c11c5b | ||
|
|
5729829cd8 | ||
|
|
e27500b72c | ||
|
|
fe5911bf3d | ||
|
|
b024ebb965 | ||
|
|
ad8f985e81 | ||
|
|
ee2f2775b2 | ||
|
|
692b7a907d | ||
|
|
71c918b848 | ||
|
|
83ca21f539 | ||
|
|
f3802eb805 | ||
|
|
bfe8b41315 | ||
|
|
4535088cec | ||
|
|
2eceaaef0f | ||
|
|
ece55227ff | ||
|
|
92a57a8e84 | ||
|
|
d7280b7436 | ||
|
|
e9eb0938f4 | ||
|
|
a29ea36d62 | ||
|
|
af48bf2008 | ||
|
|
4b50ecceb0 | ||
|
|
99b540b072 | ||
|
|
b9feed8795 | ||
|
|
f9cedfb75c | ||
|
|
fc7aa64ea8 | ||
|
|
d0979f5274 | ||
|
|
fcb0da7f00 | ||
|
|
8e8b046d24 | ||
|
|
5e704a2c71 | ||
|
|
8bff782354 | ||
|
|
6632823690 | ||
|
|
f74d5e1c2f | ||
|
|
3d74dc2abd | ||
|
|
dfd7eafbce | ||
|
|
8dd0ddc3c4 | ||
|
|
080ecf01b3 | ||
|
|
7a91ea6c2b | ||
|
|
e4559f48c1 | ||
|
|
d6b861401e | ||
|
|
e4f6c3799d | ||
|
|
98c9aac1d5 | ||
|
|
e3d71ad89a | ||
|
|
68f61a07d6 | ||
|
|
4a3e574807 | ||
|
|
c2a28c346c | ||
|
|
78922ed7c7 | ||
|
|
6fde5a6dd6 | ||
|
|
d1d0b8afce | ||
|
|
04ddad484e | ||
|
|
03d829d59e | ||
|
|
8d8b4311b9 | ||
|
|
1fbcc78d6e | ||
|
|
51593da25a | ||
|
|
38e563d0c7 | ||
|
|
b8f089c5a3 | ||
|
|
187ea539ae | ||
|
|
8bf80fc8d8 | ||
|
|
45f6d52b10 | ||
|
|
746215670a | ||
|
|
bc9a8cef6f | ||
|
|
b62d9a1fdc | ||
|
|
46af98267d | ||
|
|
de1426119d | ||
|
|
41ea88f38c | ||
|
|
aed7499a8d | ||
|
|
07c9a08e67 | ||
|
|
2837d49079 | ||
|
|
1997614aa9 | ||
|
|
4e898560ce | ||
|
|
332d2bbea3 | ||
|
|
b8a5dda56e | ||
|
|
572d8e2002 | ||
|
|
2e8668f0af | ||
|
|
b298484fd0 | ||
|
|
f911287cc9 | ||
|
|
62825064bf | ||
|
|
5439e917ca | ||
|
|
174dcd697f | ||
|
|
cdf2ae8a84 | ||
|
|
49949f321d | ||
|
|
c7469ebe74 | ||
|
|
150013060e | ||
|
|
219636f7e4 | ||
|
|
35bac5edec | ||
|
|
0bf6aeb885 | ||
|
|
9a45d7fb76 | ||
|
|
61916fefc4 | ||
|
|
fc6acb6b97 | ||
|
|
5e3f8fff40 | ||
|
|
5df2acf7d2 | ||
|
|
88d269461c | ||
|
|
0c6d1bc985 | ||
|
|
13e781f9a5 | ||
|
|
0bab447670 | ||
|
|
1f02087607 | ||
|
|
95ea538c79 | ||
|
|
ef3844d3a8 | ||
|
|
3ebbaf7c96 | ||
|
|
73b125df68 | ||
|
|
88eb04489d | ||
|
|
4870626728 | ||
|
|
666743302f | ||
|
|
f7cc9adc05 | ||
|
|
59aefe9ea6 | ||
|
|
3ddc2b7395 | ||
|
|
d49e2dd54c | ||
|
|
7bfd2375c7 | ||
|
|
ea8ae8c639 | ||
|
|
958d9ec723 | ||
|
|
77f9137f10 | ||
|
|
231bdf2e56 | ||
|
|
75124fc91e | ||
|
|
908e5e9cc6 | ||
|
|
2715079344 | ||
|
|
1ae15fa64c | ||
|
|
027a365a62 | ||
|
|
f96b760658 | ||
|
|
7761b89d7b | ||
|
|
ce5504934a | ||
|
|
34d14d7848 | ||
|
|
ef9590712a | ||
|
|
a812fb6f5c | ||
|
|
f46b22ba13 | ||
|
|
b2b13cd315 | ||
|
|
38adcd21bd | ||
|
|
790212f4d9 | ||
|
|
11aa105077 | ||
|
|
abbfe4b5b7 | ||
|
|
1d50f47a58 | ||
|
|
e891b00dfc | ||
|
|
27af55d1b4 | ||
|
|
05361960f2 | ||
|
|
c42f6ee43e | ||
|
|
f523b11a10 | ||
|
|
79fa94ea8b | ||
|
|
a06317abea | ||
|
|
500a3ff9ef | ||
|
|
8caa530069 | ||
|
|
cd6186907c | ||
|
|
803d653748 | ||
|
|
cd9d0913d9 | ||
|
|
fdec23188a | ||
|
|
12a232efa9 |
29
.github/ISSUE_TEMPLATE/bug-report.yml
vendored
29
.github/ISSUE_TEMPLATE/bug-report.yml
vendored
@@ -49,3 +49,32 @@ body:
|
||||
placeholder: diffusers version, platform, python version, ...
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: who-can-help
|
||||
attributes:
|
||||
label: Who can help?
|
||||
description: |
|
||||
Your issue will be replied to more quickly if you can figure out the right person to tag with @
|
||||
If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
|
||||
|
||||
All issues are read by one of the core maintainers, so if you don't know who to tag, just leave this blank and
|
||||
a core maintainer will ping the right person.
|
||||
|
||||
Please tag fewer than 3 people.
|
||||
|
||||
General library related questions: @patrickvonplaten and @sayakpaul
|
||||
|
||||
Questions on the training examples: @williamberman, @sayakpaul, @yiyixuxu
|
||||
|
||||
Questions on memory optimizations, LoRA, float16, etc.: @williamberman, @patrickvonplaten, and @sayakpaul
|
||||
|
||||
Questions on schedulers: @patrickvonplaten and @williamberman
|
||||
|
||||
Questions on models and pipelines: @patrickvonplaten, @sayakpaul, and @williamberman
|
||||
|
||||
Questions on JAX- and MPS-related things: @pcuenca
|
||||
|
||||
Questions on audio pipelines: @patrickvonplaten, @kashif, and @sanchit-gandhi
|
||||
|
||||
Documentation: @stevhliu and @yiyixuxu
|
||||
placeholder: "@Username ..."
|
||||
|
||||
60
.github/PULL_REQUEST_TEMPLATE.md
vendored
Normal file
60
.github/PULL_REQUEST_TEMPLATE.md
vendored
Normal file
@@ -0,0 +1,60 @@
|
||||
# What does this PR do?
|
||||
|
||||
<!--
|
||||
Congratulations! You've made it this far! You're not quite done yet though.
|
||||
|
||||
Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.
|
||||
|
||||
Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.
|
||||
|
||||
Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost.
|
||||
-->
|
||||
|
||||
<!-- Remove if not applicable -->
|
||||
|
||||
Fixes # (issue)
|
||||
|
||||
|
||||
## Before submitting
|
||||
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
|
||||
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md)?
|
||||
- [ ] Did you read our [philosophy doc](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md) (important for complex PRs)?
|
||||
- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case.
|
||||
- [ ] Did you make sure to update the documentation with your changes? Here are the
|
||||
[documentation guidelines](https://github.com/huggingface/diffusers/tree/main/docs), and
|
||||
[here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
|
||||
- [ ] Did you write any new necessary tests?
|
||||
|
||||
|
||||
## Who can review?
|
||||
|
||||
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
|
||||
members/contributors who may be interested in your PR.
|
||||
|
||||
<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @
|
||||
|
||||
If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
|
||||
Please tag fewer than 3 people.
|
||||
|
||||
Core library:
|
||||
|
||||
- Schedulers: @williamberman and @patrickvonplaten
|
||||
- Pipelines: @patrickvonplaten and @sayakpaul
|
||||
- Training examples: @sayakpaul and @patrickvonplaten
|
||||
- Docs: @stevenliu and @yiyixu
|
||||
- JAX and MPS: @pcuenca
|
||||
- Audio: @sanchit-gandhi
|
||||
- General functionalities: @patrickvonplaten and @sayakpaul
|
||||
|
||||
Integrations:
|
||||
|
||||
- deepspeed: HF Trainer/Accelerate: @pacman100
|
||||
|
||||
HF projects:
|
||||
|
||||
- accelerate: [different repo](https://github.com/huggingface/accelerate)
|
||||
- datasets: [different repo](https://github.com/huggingface/datasets)
|
||||
- transformers: [different repo](https://github.com/huggingface/transformers)
|
||||
- safetensors: [different repo](https://github.com/huggingface/safetensors)
|
||||
|
||||
-->
|
||||
8
.github/workflows/build_documentation.yml
vendored
8
.github/workflows/build_documentation.yml
vendored
@@ -5,15 +5,19 @@ on:
|
||||
branches:
|
||||
- main
|
||||
- doc-builder*
|
||||
- v*-release
|
||||
- v*-patch
|
||||
|
||||
jobs:
|
||||
build:
|
||||
build:
|
||||
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
|
||||
with:
|
||||
commit_sha: ${{ github.sha }}
|
||||
install_libgl1: true
|
||||
package: diffusers
|
||||
notebook_folder: diffusers_doc
|
||||
languages: en ko
|
||||
languages: en ko zh
|
||||
|
||||
secrets:
|
||||
token: ${{ secrets.HUGGINGFACE_PUSH }}
|
||||
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
|
||||
|
||||
3
.github/workflows/build_pr_documentation.yml
vendored
3
.github/workflows/build_pr_documentation.yml
vendored
@@ -13,5 +13,6 @@ jobs:
|
||||
with:
|
||||
commit_sha: ${{ github.event.pull_request.head.sha }}
|
||||
pr_number: ${{ github.event.number }}
|
||||
install_libgl1: true
|
||||
package: diffusers
|
||||
languages: en ko
|
||||
languages: en ko zh
|
||||
|
||||
13
.github/workflows/delete_doc_comment.yml
vendored
13
.github/workflows/delete_doc_comment.yml
vendored
@@ -1,13 +1,14 @@
|
||||
name: Delete dev documentation
|
||||
name: Delete doc comment
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
types: [ closed ]
|
||||
workflow_run:
|
||||
workflows: ["Delete doc comment trigger"]
|
||||
types:
|
||||
- completed
|
||||
|
||||
|
||||
jobs:
|
||||
delete:
|
||||
uses: huggingface/doc-builder/.github/workflows/delete_doc_comment.yml@main
|
||||
with:
|
||||
pr_number: ${{ github.event.number }}
|
||||
package: diffusers
|
||||
secrets:
|
||||
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
|
||||
12
.github/workflows/delete_doc_comment_trigger.yml
vendored
Normal file
12
.github/workflows/delete_doc_comment_trigger.yml
vendored
Normal file
@@ -0,0 +1,12 @@
|
||||
name: Delete doc comment trigger
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
types: [ closed ]
|
||||
|
||||
|
||||
jobs:
|
||||
delete:
|
||||
uses: huggingface/doc-builder/.github/workflows/delete_doc_comment_trigger.yml@main
|
||||
with:
|
||||
pr_number: ${{ github.event.number }}
|
||||
32
.github/workflows/pr_dependency_test.yml
vendored
Normal file
32
.github/workflows/pr_dependency_test.yml
vendored
Normal file
@@ -0,0 +1,32 @@
|
||||
name: Run dependency tests
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches:
|
||||
- main
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
check_dependencies:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: "3.7"
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
python -m pip install --upgrade pip
|
||||
pip install -e .
|
||||
pip install pytest
|
||||
- name: Check for soft dependencies
|
||||
run: |
|
||||
pytest tests/others/test_dependencies.py
|
||||
|
||||
7
.github/workflows/pr_tests.yml
vendored
7
.github/workflows/pr_tests.yml
vendored
@@ -4,6 +4,9 @@ on:
|
||||
pull_request:
|
||||
branches:
|
||||
- main
|
||||
push:
|
||||
branches:
|
||||
- ci-*
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
|
||||
@@ -62,7 +65,7 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
apt-get update && apt-get install libsndfile1-dev -y
|
||||
apt-get update && apt-get install libsndfile1-dev libgl1 -y
|
||||
python -m pip install -e .[quality,test]
|
||||
|
||||
- name: Environment
|
||||
@@ -81,7 +84,7 @@ jobs:
|
||||
if: ${{ matrix.config.framework == 'pytorch_models' }}
|
||||
run: |
|
||||
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
|
||||
-s -v -k "not Flax and not Onnx" \
|
||||
-s -v -k "not Flax and not Onnx and not Dependency" \
|
||||
--make-reports=tests_${{ matrix.config.report }} \
|
||||
tests/models tests/schedulers tests/others
|
||||
|
||||
|
||||
2
.github/workflows/push_tests.yml
vendored
2
.github/workflows/push_tests.yml
vendored
@@ -17,6 +17,7 @@ jobs:
|
||||
run_slow_tests:
|
||||
strategy:
|
||||
fail-fast: false
|
||||
max-parallel: 1
|
||||
matrix:
|
||||
config:
|
||||
- name: Slow PyTorch CUDA tests on Ubuntu
|
||||
@@ -60,6 +61,7 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
apt-get update && apt-get install libsndfile1-dev libgl1 -y
|
||||
python -m pip install -e .[quality,test]
|
||||
|
||||
- name: Environment
|
||||
|
||||
2
.github/workflows/push_tests_fast.yml
vendored
2
.github/workflows/push_tests_fast.yml
vendored
@@ -60,7 +60,7 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
apt-get update && apt-get install libsndfile1-dev -y
|
||||
apt-get update && apt-get install libsndfile1-dev libgl1 -y
|
||||
python -m pip install -e .[quality,test]
|
||||
|
||||
- name: Environment
|
||||
|
||||
16
.github/workflows/upload_pr_documentation.yml
vendored
Normal file
16
.github/workflows/upload_pr_documentation.yml
vendored
Normal file
@@ -0,0 +1,16 @@
|
||||
name: Upload PR Documentation
|
||||
|
||||
on:
|
||||
workflow_run:
|
||||
workflows: ["Build PR Documentation"]
|
||||
types:
|
||||
- completed
|
||||
|
||||
jobs:
|
||||
build:
|
||||
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
|
||||
with:
|
||||
package_name: diffusers
|
||||
secrets:
|
||||
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
|
||||
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
|
||||
@@ -297,7 +297,7 @@ if you don't know yet what specific component you would like to add:
|
||||
- [Model or pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22)
|
||||
- [Scheduler](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22)
|
||||
|
||||
Before adding any of the three components, it is strongly recommended that you give the [Philosophy guide](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22) a read to better understand the design of any of the three components. Please be aware that
|
||||
Before adding any of the three components, it is strongly recommended that you give the [Philosophy guide](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md) a read to better understand the design of any of the three components. Please be aware that
|
||||
we cannot merge model, scheduler, or pipeline additions that strongly diverge from our design philosophy
|
||||
as it will lead to API inconsistencies. If you fundamentally disagree with a design choice, please
|
||||
open a [Feedback issue](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=) instead so that it can be discussed whether a certain design
|
||||
|
||||
11
README.md
11
README.md
@@ -25,7 +25,7 @@
|
||||
|
||||
## Installation
|
||||
|
||||
We recommend installing 🤗 Diffusers in a virtual environment from PyPi or Conda. For more details about installing [PyTorch](https://pytorch.org/get-started/locally/) and [Flax](https://flax.readthedocs.io/en/latest/installation.html), please refer to their official documentation.
|
||||
We recommend installing 🤗 Diffusers in a virtual environment from PyPi or Conda. For more details about installing [PyTorch](https://pytorch.org/get-started/locally/) and [Flax](https://flax.readthedocs.io/en/latest/#installation), please refer to their official documentation.
|
||||
|
||||
### PyTorch
|
||||
|
||||
@@ -143,9 +143,14 @@ just hang out ☕.
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Text-to-Image</td>
|
||||
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/if">if</a></td>
|
||||
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/if">DeepFloyd IF</a></td>
|
||||
<td><a href="https://huggingface.co/DeepFloyd/IF-I-XL-v1.0"> DeepFloyd/IF-I-XL-v1.0 </a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Text-to-Image</td>
|
||||
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/kandinsky">Kandinsky</a></td>
|
||||
<td><a href="https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder"> kandinsky-community/kandinsky-2-2-decoder </a></td>
|
||||
</tr>
|
||||
<tr style="border-top: 2px solid black">
|
||||
<td>Text-guided Image-to-Image</td>
|
||||
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/controlnet">Controlnet</a></td>
|
||||
@@ -153,7 +158,7 @@ just hang out ☕.
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Text-guided Image-to-Image</td>
|
||||
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/pix2pix">Instruct Pix2Pix</a></td>
|
||||
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/pix2pix">Instruct Pix2Pix</a></td>
|
||||
<td><a href="https://huggingface.co/timbrooks/instruct-pix2pix"> timbrooks/instruct-pix2pix </a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
|
||||
@@ -14,6 +14,7 @@ RUN apt update && \
|
||||
libsndfile1-dev \
|
||||
python3.8 \
|
||||
python3-pip \
|
||||
libgl1 \
|
||||
python3.8-venv && \
|
||||
rm -rf /var/lib/apt/lists
|
||||
|
||||
@@ -27,6 +28,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
|
||||
torch \
|
||||
torchvision \
|
||||
torchaudio \
|
||||
invisible_watermark \
|
||||
--extra-index-url https://download.pytorch.org/whl/cpu && \
|
||||
python3 -m pip install --no-cache-dir \
|
||||
accelerate \
|
||||
@@ -40,4 +42,4 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
|
||||
tensorboard \
|
||||
transformers
|
||||
|
||||
CMD ["/bin/bash"]
|
||||
CMD ["/bin/bash"]
|
||||
|
||||
@@ -12,6 +12,7 @@ RUN apt update && \
|
||||
curl \
|
||||
ca-certificates \
|
||||
libsndfile1-dev \
|
||||
libgl1 \
|
||||
python3.8 \
|
||||
python3-pip \
|
||||
python3.8-venv && \
|
||||
@@ -26,7 +27,8 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
|
||||
python3 -m pip install --no-cache-dir \
|
||||
torch \
|
||||
torchvision \
|
||||
torchaudio && \
|
||||
torchaudio \
|
||||
invisible_watermark && \
|
||||
python3 -m pip install --no-cache-dir \
|
||||
accelerate \
|
||||
datasets \
|
||||
|
||||
@@ -50,6 +50,8 @@
|
||||
title: Distributed inference with multiple GPUs
|
||||
- local: using-diffusers/reusing_seeds
|
||||
title: Improve image quality with deterministic generation
|
||||
- local: using-diffusers/control_brightness
|
||||
title: Control image brightness
|
||||
- local: using-diffusers/reproducibility
|
||||
title: Create reproducible pipelines
|
||||
- local: using-diffusers/custom_pipeline_examples
|
||||
@@ -115,6 +117,8 @@
|
||||
title: Habana Gaudi
|
||||
- local: optimization/tome
|
||||
title: Token Merging
|
||||
- local: optimization/bentoml
|
||||
title: BentoML Integration
|
||||
title: Optimization/Special Hardware
|
||||
- sections:
|
||||
- local: conceptual/philosophy
|
||||
@@ -130,8 +134,6 @@
|
||||
title: Conceptual Guides
|
||||
- sections:
|
||||
- sections:
|
||||
- local: api/models
|
||||
title: Models
|
||||
- local: api/attnprocessor
|
||||
title: Attention Processor
|
||||
- local: api/diffusion_pipeline
|
||||
@@ -146,7 +148,35 @@
|
||||
title: Loaders
|
||||
- local: api/utilities
|
||||
title: Utilities
|
||||
- local: api/image_processor
|
||||
title: VAE Image Processor
|
||||
title: Main Classes
|
||||
- sections:
|
||||
- local: api/models/overview
|
||||
title: Overview
|
||||
- local: api/models/unet
|
||||
title: UNet1DModel
|
||||
- local: api/models/unet2d
|
||||
title: UNet2DModel
|
||||
- local: api/models/unet2d-cond
|
||||
title: UNet2DConditionModel
|
||||
- local: api/models/unet3d-cond
|
||||
title: UNet3DConditionModel
|
||||
- local: api/models/vq
|
||||
title: VQModel
|
||||
- local: api/models/autoencoderkl
|
||||
title: AutoencoderKL
|
||||
- local: api/models/asymmetricautoencoderkl
|
||||
title: AsymmetricAutoencoderKL
|
||||
- local: api/models/transformer2d
|
||||
title: Transformer2D
|
||||
- local: api/models/transformer_temporal
|
||||
title: Transformer Temporal
|
||||
- local: api/models/prior_transformer
|
||||
title: Prior Transformer
|
||||
- local: api/models/controlnet
|
||||
title: ControlNet
|
||||
title: Models
|
||||
- sections:
|
||||
- local: api/pipelines/overview
|
||||
title: Overview
|
||||
@@ -158,6 +188,8 @@
|
||||
title: Audio Diffusion
|
||||
- local: api/pipelines/audioldm
|
||||
title: AudioLDM
|
||||
- local: api/pipelines/consistency_models
|
||||
title: Consistency Models
|
||||
- local: api/pipelines/controlnet
|
||||
title: ControlNet
|
||||
- local: api/pipelines/cycle_diffusion
|
||||
@@ -168,12 +200,12 @@
|
||||
title: DDIM
|
||||
- local: api/pipelines/ddpm
|
||||
title: DDPM
|
||||
- local: api/pipelines/deepfloyd_if
|
||||
title: DeepFloyd IF
|
||||
- local: api/pipelines/diffedit
|
||||
title: DiffEdit
|
||||
- local: api/pipelines/dit
|
||||
title: DiT
|
||||
- local: api/pipelines/if
|
||||
title: IF
|
||||
- local: api/pipelines/pix2pix
|
||||
title: InstructPix2Pix
|
||||
- local: api/pipelines/kandinsky
|
||||
@@ -184,6 +216,8 @@
|
||||
title: MultiDiffusion Panorama
|
||||
- local: api/pipelines/paint_by_example
|
||||
title: PaintByExample
|
||||
- local: api/pipelines/paradigms
|
||||
title: Parallel Sampling of Diffusion Models
|
||||
- local: api/pipelines/pix2pix_zero
|
||||
title: Pix2Pix Zero
|
||||
- local: api/pipelines/pndm
|
||||
@@ -196,6 +230,8 @@
|
||||
title: Self-Attention Guidance
|
||||
- local: api/pipelines/semantic_stable_diffusion
|
||||
title: Semantic Guidance
|
||||
- local: api/pipelines/shap_e
|
||||
title: Shap-E
|
||||
- local: api/pipelines/spectrogram_diffusion
|
||||
title: Spectrogram Diffusion
|
||||
- sections:
|
||||
@@ -215,10 +251,16 @@
|
||||
title: Safe Stable Diffusion
|
||||
- local: api/pipelines/stable_diffusion/stable_diffusion_2
|
||||
title: Stable Diffusion 2
|
||||
- local: api/pipelines/stable_diffusion/stable_diffusion_xl
|
||||
title: Stable Diffusion XL
|
||||
- local: api/pipelines/stable_diffusion/latent_upscale
|
||||
title: Stable-Diffusion-Latent-Upscaler
|
||||
- local: api/pipelines/stable_diffusion/upscale
|
||||
title: Super-Resolution
|
||||
- local: api/pipelines/stable_diffusion/ldm3d_diffusion
|
||||
title: LDM3D Text-to-(RGB, Depth)
|
||||
- local: api/pipelines/stable_diffusion/adapter
|
||||
title: Stable Diffusion T2I-adapter
|
||||
title: Stable Diffusion
|
||||
- local: api/pipelines/stable_unclip
|
||||
title: Stable unCLIP
|
||||
@@ -244,6 +286,8 @@
|
||||
- sections:
|
||||
- local: api/schedulers/overview
|
||||
title: Overview
|
||||
- local: api/schedulers/cm_stochastic_iterative
|
||||
title: Consistency Model Multistep Scheduler
|
||||
- local: api/schedulers/ddim
|
||||
title: DDIM
|
||||
- local: api/schedulers/ddim_inverse
|
||||
|
||||
@@ -12,8 +12,13 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Configuration
|
||||
|
||||
Schedulers from [`~schedulers.scheduling_utils.SchedulerMixin`] and models from [`ModelMixin`] inherit from [`ConfigMixin`] which conveniently takes care of storing all the parameters that are
|
||||
passed to their respective `__init__` methods in a JSON-configuration file.
|
||||
Schedulers from [`~schedulers.scheduling_utils.SchedulerMixin`] and models from [`ModelMixin`] inherit from [`ConfigMixin`] which stores all the parameters that are passed to their respective `__init__` methods in a JSON-configuration file.
|
||||
|
||||
<Tip>
|
||||
|
||||
To use private or [gated](https://huggingface.co/docs/hub/models-gated#gated-models) models, log-in with `huggingface-cli login`.
|
||||
|
||||
</Tip>
|
||||
|
||||
## ConfigMixin
|
||||
|
||||
|
||||
@@ -12,12 +12,12 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Pipelines
|
||||
|
||||
The [`DiffusionPipeline`] is the easiest way to load any pretrained diffusion pipeline from the [Hub](https://huggingface.co/models?library=diffusers) and use it for inference.
|
||||
The [`DiffusionPipeline`] is the quickest way to load any pretrained diffusion pipeline from the [Hub](https://huggingface.co/models?library=diffusers) for inference.
|
||||
|
||||
<Tip>
|
||||
|
||||
|
||||
You shouldn't use the [`DiffusionPipeline`] class for training or finetuning a diffusion model. Individual
|
||||
components (for example, [`UNetModel`] and [`UNetConditionModel`]) of diffusion pipelines are usually trained individually, so we suggest directly working with instead.
|
||||
components (for example, [`UNet2DModel`] and [`UNet2DConditionModel`]) of diffusion pipelines are usually trained individually, so we suggest directly working with them instead.
|
||||
|
||||
</Tip>
|
||||
|
||||
|
||||
27
docs/source/en/api/image_processor.mdx
Normal file
27
docs/source/en/api/image_processor.mdx
Normal file
@@ -0,0 +1,27 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# VAE Image Processor
|
||||
|
||||
The [`VaeImageProcessor`] provides a unified API for [`StableDiffusionPipeline`]'s to prepare image inputs for VAE encoding and post-processing outputs once they're decoded. This includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays.
|
||||
|
||||
All pipelines with [`VaeImageProcessor`] accepts PIL Image, PyTorch tensor, or NumPy arrays as image inputs and returns outputs based on the `output_type` argument by the user. You can pass encoded image latents directly to the pipeline and return latents from the pipeline as a specific output with the `output_type` argument (for example `output_type="pt"`). This allows you to take the generated latents from one pipeline and pass it to another pipeline as input without leaving the latent space. It also makes it much easier to use multiple pipelines together by passing PyTorch tensors directly between different pipelines.
|
||||
|
||||
## VaeImageProcessor
|
||||
|
||||
[[autodoc]] image_processor.VaeImageProcessor
|
||||
|
||||
## VaeImageProcessorLDM3D
|
||||
|
||||
The [`VaeImageProcessorLDM3D`] accepts RGB and depth inputs and returns RGB and depth outputs.
|
||||
|
||||
[[autodoc]] image_processor.VaeImageProcessorLDM3D
|
||||
@@ -12,31 +12,34 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Loaders
|
||||
|
||||
There are many ways to train adapter neural networks for diffusion models, such as
|
||||
- [Textual Inversion](./training/text_inversion.mdx)
|
||||
- [LoRA](https://github.com/cloneofsimo/lora)
|
||||
- [Hypernetworks](https://arxiv.org/abs/1609.09106)
|
||||
Adapters (textual inversion, LoRA, hypernetworks) allow you to modify a diffusion model to generate images in a specific style without training or finetuning the entire model. The adapter weights are typically only a tiny fraction of the pretrained model's which making them very portable. 🤗 Diffusers provides an easy-to-use `LoaderMixin` API to load adapter weights.
|
||||
|
||||
Such adapter neural networks often only consist of a fraction of the number of weights compared
|
||||
to the pretrained model and as such are very portable. The Diffusers library offers an easy-to-use
|
||||
API to load such adapter neural networks via the [`loaders.py` module](https://github.com/huggingface/diffusers/blob/main/src/diffusers/loaders.py).
|
||||
<Tip warning={true}>
|
||||
|
||||
**Note**: This module is still highly experimental and prone to future changes.
|
||||
🧪 The `LoaderMixins` are highly experimental and prone to future changes. To use private or [gated](https://huggingface.co/docs/hub/models-gated#gated-models) models, log-in with `huggingface-cli login`.
|
||||
|
||||
## LoaderMixins
|
||||
</Tip>
|
||||
|
||||
### UNet2DConditionLoadersMixin
|
||||
## UNet2DConditionLoadersMixin
|
||||
|
||||
[[autodoc]] loaders.UNet2DConditionLoadersMixin
|
||||
|
||||
### TextualInversionLoaderMixin
|
||||
## TextualInversionLoaderMixin
|
||||
|
||||
[[autodoc]] loaders.TextualInversionLoaderMixin
|
||||
|
||||
### LoraLoaderMixin
|
||||
## LoraLoaderMixin
|
||||
|
||||
[[autodoc]] loaders.LoraLoaderMixin
|
||||
|
||||
### FromCkptMixin
|
||||
## FromSingleFileMixin
|
||||
|
||||
[[autodoc]] loaders.FromCkptMixin
|
||||
[[autodoc]] loaders.FromSingleFileMixin
|
||||
|
||||
## FromOriginalControlnetMixin
|
||||
|
||||
[[autodoc]] loaders.FromOriginalControlnetMixin
|
||||
|
||||
## FromOriginalVAEMixin
|
||||
|
||||
[[autodoc]] loaders.FromOriginalVAEMixin
|
||||
|
||||
@@ -12,12 +12,9 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Logging
|
||||
|
||||
🧨 Diffusers has a centralized logging system, so that you can setup the verbosity of the library easily.
|
||||
🤗 Diffusers has a centralized logging system to easily manage the verbosity of the library. The default verbosity is set to `WARNING`.
|
||||
|
||||
Currently the default verbosity of the library is `WARNING`.
|
||||
|
||||
To change the level of verbosity, just use one of the direct setters. For instance, here is how to change the verbosity
|
||||
to the INFO level.
|
||||
To change the verbosity level, use one of the direct setters. For instance, to change the verbosity to the `INFO` level.
|
||||
|
||||
```python
|
||||
import diffusers
|
||||
@@ -33,7 +30,7 @@ DIFFUSERS_VERBOSITY=error ./myprogram.py
|
||||
```
|
||||
|
||||
Additionally, some `warnings` can be disabled by setting the environment variable
|
||||
`DIFFUSERS_NO_ADVISORY_WARNINGS` to a true value, like *1*. This will disable any warning that is logged using
|
||||
`DIFFUSERS_NO_ADVISORY_WARNINGS` to a true value, like `1`. This disables any warning logged by
|
||||
[`logger.warning_advice`]. For example:
|
||||
|
||||
```bash
|
||||
@@ -52,20 +49,21 @@ logger.warning("WARN")
|
||||
```
|
||||
|
||||
|
||||
All the methods of this logging module are documented below, the main ones are
|
||||
All methods of the logging module are documented below. The main methods are
|
||||
[`logging.get_verbosity`] to get the current level of verbosity in the logger and
|
||||
[`logging.set_verbosity`] to set the verbosity to the level of your choice. In order (from the least
|
||||
verbose to the most verbose), those levels (with their corresponding int values in parenthesis) are:
|
||||
[`logging.set_verbosity`] to set the verbosity to the level of your choice.
|
||||
|
||||
- `diffusers.logging.CRITICAL` or `diffusers.logging.FATAL` (int value, 50): only report the most
|
||||
critical errors.
|
||||
- `diffusers.logging.ERROR` (int value, 40): only report errors.
|
||||
- `diffusers.logging.WARNING` or `diffusers.logging.WARN` (int value, 30): only reports error and
|
||||
warnings. This is the default level used by the library.
|
||||
- `diffusers.logging.INFO` (int value, 20): reports error, warnings and basic information.
|
||||
- `diffusers.logging.DEBUG` (int value, 10): report all information.
|
||||
In order from the least verbose to the most verbose:
|
||||
|
||||
By default, `tqdm` progress bars will be displayed during model download. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] can be used to suppress or unsuppress this behavior.
|
||||
| Method | Integer value | Description |
|
||||
|----------------------------------------------------------:|--------------:|----------------------------------------------------:|
|
||||
| `diffusers.logging.CRITICAL` or `diffusers.logging.FATAL` | 50 | only report the most critical errors |
|
||||
| `diffusers.logging.ERROR` | 40 | only report errors |
|
||||
| `diffusers.logging.WARNING` or `diffusers.logging.WARN` | 30 | only report errors and warnings (default) |
|
||||
| `diffusers.logging.INFO` | 20 | only report errors, warnings, and basic information |
|
||||
| `diffusers.logging.DEBUG` | 10 | report all information |
|
||||
|
||||
By default, `tqdm` progress bars are displayed during model download. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] are used to enable or disable this behavior.
|
||||
|
||||
## Base setters
|
||||
|
||||
|
||||
@@ -1,107 +0,0 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Models
|
||||
|
||||
Diffusers contains pretrained models for popular algorithms and modules for creating the next set of diffusion models.
|
||||
The primary function of these models is to denoise an input sample, by modeling the distribution \\(p_{\theta}(x_{t-1}|x_{t})\\).
|
||||
The models are built on the base class ['ModelMixin'] that is a `torch.nn.module` with basic functionality for saving and loading models both locally and from the HuggingFace hub.
|
||||
|
||||
## ModelMixin
|
||||
[[autodoc]] ModelMixin
|
||||
|
||||
## UNet2DOutput
|
||||
[[autodoc]] models.unet_2d.UNet2DOutput
|
||||
|
||||
## UNet2DModel
|
||||
[[autodoc]] UNet2DModel
|
||||
|
||||
## UNet1DOutput
|
||||
[[autodoc]] models.unet_1d.UNet1DOutput
|
||||
|
||||
## UNet1DModel
|
||||
[[autodoc]] UNet1DModel
|
||||
|
||||
## UNet2DConditionOutput
|
||||
[[autodoc]] models.unet_2d_condition.UNet2DConditionOutput
|
||||
|
||||
## UNet2DConditionModel
|
||||
[[autodoc]] UNet2DConditionModel
|
||||
|
||||
## UNet3DConditionOutput
|
||||
[[autodoc]] models.unet_3d_condition.UNet3DConditionOutput
|
||||
|
||||
## UNet3DConditionModel
|
||||
[[autodoc]] UNet3DConditionModel
|
||||
|
||||
## DecoderOutput
|
||||
[[autodoc]] models.vae.DecoderOutput
|
||||
|
||||
## VQEncoderOutput
|
||||
[[autodoc]] models.vq_model.VQEncoderOutput
|
||||
|
||||
## VQModel
|
||||
[[autodoc]] VQModel
|
||||
|
||||
## AutoencoderKLOutput
|
||||
[[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
|
||||
|
||||
## AutoencoderKL
|
||||
[[autodoc]] AutoencoderKL
|
||||
|
||||
## Transformer2DModel
|
||||
[[autodoc]] Transformer2DModel
|
||||
|
||||
## Transformer2DModelOutput
|
||||
[[autodoc]] models.transformer_2d.Transformer2DModelOutput
|
||||
|
||||
## TransformerTemporalModel
|
||||
[[autodoc]] models.transformer_temporal.TransformerTemporalModel
|
||||
|
||||
## Transformer2DModelOutput
|
||||
[[autodoc]] models.transformer_temporal.TransformerTemporalModelOutput
|
||||
|
||||
## PriorTransformer
|
||||
[[autodoc]] models.prior_transformer.PriorTransformer
|
||||
|
||||
## PriorTransformerOutput
|
||||
[[autodoc]] models.prior_transformer.PriorTransformerOutput
|
||||
|
||||
## ControlNetOutput
|
||||
[[autodoc]] models.controlnet.ControlNetOutput
|
||||
|
||||
## ControlNetModel
|
||||
[[autodoc]] ControlNetModel
|
||||
|
||||
## FlaxModelMixin
|
||||
[[autodoc]] FlaxModelMixin
|
||||
|
||||
## FlaxUNet2DConditionOutput
|
||||
[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionOutput
|
||||
|
||||
## FlaxUNet2DConditionModel
|
||||
[[autodoc]] FlaxUNet2DConditionModel
|
||||
|
||||
## FlaxDecoderOutput
|
||||
[[autodoc]] models.vae_flax.FlaxDecoderOutput
|
||||
|
||||
## FlaxAutoencoderKLOutput
|
||||
[[autodoc]] models.vae_flax.FlaxAutoencoderKLOutput
|
||||
|
||||
## FlaxAutoencoderKL
|
||||
[[autodoc]] FlaxAutoencoderKL
|
||||
|
||||
## FlaxControlNetOutput
|
||||
[[autodoc]] models.controlnet_flax.FlaxControlNetOutput
|
||||
|
||||
## FlaxControlNetModel
|
||||
[[autodoc]] FlaxControlNetModel
|
||||
55
docs/source/en/api/models/asymmetricautoencoderkl.mdx
Normal file
55
docs/source/en/api/models/asymmetricautoencoderkl.mdx
Normal file
@@ -0,0 +1,55 @@
|
||||
# AsymmetricAutoencoderKL
|
||||
|
||||
Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: [Designing a Better Asymmetric VQGAN for StableDiffusion](https://arxiv.org/abs/2306.04632) by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua.
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
*StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN*
|
||||
|
||||
Evaluation results can be found in section 4.1 of the original paper.
|
||||
|
||||
## Available checkpoints
|
||||
|
||||
* [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5)
|
||||
* [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2)
|
||||
|
||||
## Example Usage
|
||||
|
||||
```python
|
||||
from io import BytesIO
|
||||
from PIL import Image
|
||||
import requests
|
||||
from diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline
|
||||
|
||||
|
||||
def download_image(url: str) -> Image.Image:
|
||||
response = requests.get(url)
|
||||
return Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
|
||||
prompt = "a photo of a person"
|
||||
img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png"
|
||||
mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png"
|
||||
|
||||
image = download_image(img_url).resize((256, 256))
|
||||
mask_image = download_image(mask_url).resize((256, 256))
|
||||
|
||||
pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
|
||||
pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
|
||||
pipe.to("cuda")
|
||||
|
||||
image = pipe(prompt=prompt, image=image, mask_image=mask_image).images[0]
|
||||
image.save("image.jpeg")
|
||||
```
|
||||
|
||||
## AsymmetricAutoencoderKL
|
||||
|
||||
[[autodoc]] models.autoencoder_asym_kl.AsymmetricAutoencoderKL
|
||||
|
||||
## AutoencoderKLOutput
|
||||
|
||||
[[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
|
||||
|
||||
## DecoderOutput
|
||||
|
||||
[[autodoc]] models.vae.DecoderOutput
|
||||
43
docs/source/en/api/models/autoencoderkl.mdx
Normal file
43
docs/source/en/api/models/autoencoderkl.mdx
Normal file
@@ -0,0 +1,43 @@
|
||||
# AutoencoderKL
|
||||
|
||||
The variational autoencoder (VAE) model with KL loss was introduced in [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114v11) by Diederik P. Kingma and Max Welling. The model is used in 🤗 Diffusers to encode images into latents and to decode latent representations into images.
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
*How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.*
|
||||
|
||||
## Loading from the original format
|
||||
|
||||
By default the [`AutoencoderKL`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded
|
||||
from the original format using [`FromOriginalVAEMixin.from_single_file`] as follows:
|
||||
|
||||
```py
|
||||
from diffusers import AutoencoderKL
|
||||
|
||||
url = "https://huggingface.co/stabilityai/sd-vae-ft-mse-original/blob/main/vae-ft-mse-840000-ema-pruned.safetensors" # can also be local file
|
||||
model = AutoencoderKL.from_single_file(url)
|
||||
```
|
||||
|
||||
## AutoencoderKL
|
||||
|
||||
[[autodoc]] AutoencoderKL
|
||||
|
||||
## AutoencoderKLOutput
|
||||
|
||||
[[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
|
||||
|
||||
## DecoderOutput
|
||||
|
||||
[[autodoc]] models.vae.DecoderOutput
|
||||
|
||||
## FlaxAutoencoderKL
|
||||
|
||||
[[autodoc]] FlaxAutoencoderKL
|
||||
|
||||
## FlaxAutoencoderKLOutput
|
||||
|
||||
[[autodoc]] models.vae_flax.FlaxAutoencoderKLOutput
|
||||
|
||||
## FlaxDecoderOutput
|
||||
|
||||
[[autodoc]] models.vae_flax.FlaxDecoderOutput
|
||||
38
docs/source/en/api/models/controlnet.mdx
Normal file
38
docs/source/en/api/models/controlnet.mdx
Normal file
@@ -0,0 +1,38 @@
|
||||
# ControlNet
|
||||
|
||||
The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection.
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
|
||||
|
||||
## Loading from the original format
|
||||
|
||||
By default the [`ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded
|
||||
from the original format using [`FromOriginalControlnetMixin.from_single_file`] as follows:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionControlnetPipeline, ControlNetModel
|
||||
|
||||
url = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth" # can also be a local path
|
||||
controlnet = ControlNetModel.from_single_file(url)
|
||||
|
||||
url = "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors" # can also be a local path
|
||||
pipe = StableDiffusionControlnetPipeline.from_single_file(url, controlnet=controlnet)
|
||||
```
|
||||
|
||||
## ControlNetModel
|
||||
|
||||
[[autodoc]] ControlNetModel
|
||||
|
||||
## ControlNetOutput
|
||||
|
||||
[[autodoc]] models.controlnet.ControlNetOutput
|
||||
|
||||
## FlaxControlNetModel
|
||||
|
||||
[[autodoc]] FlaxControlNetModel
|
||||
|
||||
## FlaxControlNetOutput
|
||||
|
||||
[[autodoc]] models.controlnet_flax.FlaxControlNetOutput
|
||||
12
docs/source/en/api/models/overview.mdx
Normal file
12
docs/source/en/api/models/overview.mdx
Normal file
@@ -0,0 +1,12 @@
|
||||
# Models
|
||||
|
||||
🤗 Diffusers provides pretrained models for popular algorithms and modules to create custom diffusion systems. The primary function of models is to denoise an input sample as modeled by the distribution \\(p_{\theta}(x_{t-1}|x_{t})\\).
|
||||
|
||||
All models are built from the base [`ModelMixin`] class which is a [`torch.nn.module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) providing basic functionality for saving and loading models, locally and from the Hugging Face Hub.
|
||||
|
||||
## ModelMixin
|
||||
[[autodoc]] ModelMixin
|
||||
|
||||
## FlaxModelMixin
|
||||
|
||||
[[autodoc]] FlaxModelMixin
|
||||
16
docs/source/en/api/models/prior_transformer.mdx
Normal file
16
docs/source/en/api/models/prior_transformer.mdx
Normal file
@@ -0,0 +1,16 @@
|
||||
# Prior Transformer
|
||||
|
||||
The Prior Transformer was originally introduced in [Hierarchical Text-Conditional Image Generation with CLIP Latents
|
||||
](https://huggingface.co/papers/2204.06125) by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process.
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*
|
||||
|
||||
## PriorTransformer
|
||||
|
||||
[[autodoc]] PriorTransformer
|
||||
|
||||
## PriorTransformerOutput
|
||||
|
||||
[[autodoc]] models.prior_transformer.PriorTransformerOutput
|
||||
29
docs/source/en/api/models/transformer2d.mdx
Normal file
29
docs/source/en/api/models/transformer2d.mdx
Normal file
@@ -0,0 +1,29 @@
|
||||
# Transformer2D
|
||||
|
||||
A Transformer model for image-like data from [CompVis](https://huggingface.co/CompVis) that is based on the [Vision Transformer](https://huggingface.co/papers/2010.11929) introduced by Dosovitskiy et al. The [`Transformer2DModel`] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.
|
||||
|
||||
When the input is **continuous**:
|
||||
|
||||
1. Project the input and reshape it to `(batch_size, sequence_length, feature_dimension)`.
|
||||
2. Apply the Transformer blocks in the standard way.
|
||||
3. Reshape to image.
|
||||
|
||||
When the input is **discrete**:
|
||||
|
||||
<Tip>
|
||||
|
||||
It is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image don't contain a prediction for the masked pixel because the unnoised image cannot be masked.
|
||||
|
||||
</Tip>
|
||||
|
||||
1. Convert input (classes of latent pixels) to embeddings and apply positional embeddings.
|
||||
2. Apply the Transformer blocks in the standard way.
|
||||
3. Predict classes of unnoised image.
|
||||
|
||||
## Transformer2DModel
|
||||
|
||||
[[autodoc]] Transformer2DModel
|
||||
|
||||
## Transformer2DModelOutput
|
||||
|
||||
[[autodoc]] models.transformer_2d.Transformer2DModelOutput
|
||||
11
docs/source/en/api/models/transformer_temporal.mdx
Normal file
11
docs/source/en/api/models/transformer_temporal.mdx
Normal file
@@ -0,0 +1,11 @@
|
||||
# Transformer Temporal
|
||||
|
||||
A Transformer model for video-like data.
|
||||
|
||||
## TransformerTemporalModel
|
||||
|
||||
[[autodoc]] models.transformer_temporal.TransformerTemporalModel
|
||||
|
||||
## TransformerTemporalModelOutput
|
||||
|
||||
[[autodoc]] models.transformer_temporal.TransformerTemporalModelOutput
|
||||
13
docs/source/en/api/models/unet.mdx
Normal file
13
docs/source/en/api/models/unet.mdx
Normal file
@@ -0,0 +1,13 @@
|
||||
# UNet1DModel
|
||||
|
||||
The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 1D UNet model.
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
|
||||
|
||||
## UNet1DModel
|
||||
[[autodoc]] UNet1DModel
|
||||
|
||||
## UNet1DOutput
|
||||
[[autodoc]] models.unet_1d.UNet1DOutput
|
||||
19
docs/source/en/api/models/unet2d-cond.mdx
Normal file
19
docs/source/en/api/models/unet2d-cond.mdx
Normal file
@@ -0,0 +1,19 @@
|
||||
# UNet2DConditionModel
|
||||
|
||||
The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet conditional model.
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
|
||||
|
||||
## UNet2DConditionModel
|
||||
[[autodoc]] UNet2DConditionModel
|
||||
|
||||
## UNet2DConditionOutput
|
||||
[[autodoc]] models.unet_2d_condition.UNet2DConditionOutput
|
||||
|
||||
## FlaxUNet2DConditionModel
|
||||
[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionModel
|
||||
|
||||
## FlaxUNet2DConditionOutput
|
||||
[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionOutput
|
||||
13
docs/source/en/api/models/unet2d.mdx
Normal file
13
docs/source/en/api/models/unet2d.mdx
Normal file
@@ -0,0 +1,13 @@
|
||||
# UNet2DModel
|
||||
|
||||
The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet model.
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
|
||||
|
||||
## UNet2DModel
|
||||
[[autodoc]] UNet2DModel
|
||||
|
||||
## UNet2DOutput
|
||||
[[autodoc]] models.unet_2d.UNet2DOutput
|
||||
13
docs/source/en/api/models/unet3d-cond.mdx
Normal file
13
docs/source/en/api/models/unet3d-cond.mdx
Normal file
@@ -0,0 +1,13 @@
|
||||
# UNet3DConditionModel
|
||||
|
||||
The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 3D UNet conditional model.
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
|
||||
|
||||
## UNet3DConditionModel
|
||||
[[autodoc]] UNet3DConditionModel
|
||||
|
||||
## UNet3DConditionOutput
|
||||
[[autodoc]] models.unet_3d_condition.UNet3DConditionOutput
|
||||
15
docs/source/en/api/models/vq.mdx
Normal file
15
docs/source/en/api/models/vq.mdx
Normal file
@@ -0,0 +1,15 @@
|
||||
# VQModel
|
||||
|
||||
The VQ-VAE model was introduced in [Neural Discrete Representation Learning](https://huggingface.co/papers/1711.00937) by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu. The model is used in 🤗 Diffusers to decode latent representations into images. Unlike [`AutoencoderKL`], the [`VQModel`] works in a quantized latent space.
|
||||
|
||||
The abstract from the paper is:
|
||||
|
||||
*Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.*
|
||||
|
||||
## VQModel
|
||||
|
||||
[[autodoc]] VQModel
|
||||
|
||||
## VQEncoderOutput
|
||||
|
||||
[[autodoc]] models.vq_model.VQEncoderOutput
|
||||
@@ -10,11 +10,9 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# BaseOutputs
|
||||
# Outputs
|
||||
|
||||
All models have outputs that are subclasses of [`~utils.BaseOutput`]. Those are
|
||||
data structures containing all the information returned by the model, but they can also be used as tuples or
|
||||
dictionaries.
|
||||
All models outputs are subclasses of [`~utils.BaseOutput`], data structures containing all the information returned by the model. The outputs can also be used as tuples or dictionaries.
|
||||
|
||||
For example:
|
||||
|
||||
|
||||
@@ -43,7 +43,7 @@ pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(devic
|
||||
|
||||
output = pipe()
|
||||
display(output.images[0])
|
||||
display(Audio(output.audios[0], rate=mel.get_sample_rate()))
|
||||
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
|
||||
```
|
||||
|
||||
### Latent Audio Diffusion
|
||||
|
||||
87
docs/source/en/api/pipelines/consistency_models.mdx
Normal file
87
docs/source/en/api/pipelines/consistency_models.mdx
Normal file
@@ -0,0 +1,87 @@
|
||||
# Consistency Models
|
||||
|
||||
Consistency Models were proposed in [Consistency Models](https://arxiv.org/abs/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
|
||||
|
||||
The abstract of the [paper](https://arxiv.org/pdf/2303.01469.pdf) is as follows:
|
||||
|
||||
*Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256. *
|
||||
|
||||
Resources:
|
||||
|
||||
* [Paper](https://arxiv.org/abs/2303.01469)
|
||||
* [Original Code](https://github.com/openai/consistency_models)
|
||||
|
||||
Available Checkpoints are:
|
||||
- *cd_imagenet64_l2 (64x64 resolution)* [openai/consistency-model-pipelines](https://huggingface.co/openai/diffusers-cd_imagenet64_l2)
|
||||
- *cd_imagenet64_lpips (64x64 resolution)* [openai/diffusers-cd_imagenet64_lpips](https://huggingface.co/openai/diffusers-cd_imagenet64_lpips)
|
||||
- *ct_imagenet64 (64x64 resolution)* [openai/diffusers-ct_imagenet64](https://huggingface.co/openai/diffusers-ct_imagenet64)
|
||||
- *cd_bedroom256_l2 (256x256 resolution)* [openai/diffusers-cd_bedroom256_l2](https://huggingface.co/openai/diffusers-cd_bedroom256_l2)
|
||||
- *cd_bedroom256_lpips (256x256 resolution)* [openai/diffusers-cd_bedroom256_lpips](https://huggingface.co/openai/diffusers-cd_bedroom256_lpips)
|
||||
- *ct_bedroom256 (256x256 resolution)* [openai/diffusers-ct_bedroom256](https://huggingface.co/openai/diffusers-ct_bedroom256)
|
||||
- *cd_cat256_l2 (256x256 resolution)* [openai/diffusers-cd_cat256_l2](https://huggingface.co/openai/diffusers-cd_cat256_l2)
|
||||
- *cd_cat256_lpips (256x256 resolution)* [openai/diffusers-cd_cat256_lpips](https://huggingface.co/openai/diffusers-cd_cat256_lpips)
|
||||
- *ct_cat256 (256x256 resolution)* [openai/diffusers-ct_cat256](https://huggingface.co/openai/diffusers-ct_cat256)
|
||||
|
||||
## Available Pipelines
|
||||
|
||||
| Pipeline | Tasks | Demo | Colab |
|
||||
|:---:|:---:|:---:|:---:|
|
||||
| [ConsistencyModelPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_consistency_models.py) | *Unconditional Image Generation* | | |
|
||||
|
||||
This pipeline was contributed by our community members [dg845](https://github.com/dg845) and [ayushtues](https://huggingface.co/ayushtues) ❤️
|
||||
|
||||
## Usage Example
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
from diffusers import ConsistencyModelPipeline
|
||||
|
||||
device = "cuda"
|
||||
# Load the cd_imagenet64_l2 checkpoint.
|
||||
model_id_or_path = "openai/diffusers-cd_imagenet64_l2"
|
||||
pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
|
||||
pipe.to(device)
|
||||
|
||||
# Onestep Sampling
|
||||
image = pipe(num_inference_steps=1).images[0]
|
||||
image.save("consistency_model_onestep_sample.png")
|
||||
|
||||
# Onestep sampling, class-conditional image generation
|
||||
# ImageNet-64 class label 145 corresponds to king penguins
|
||||
image = pipe(num_inference_steps=1, class_labels=145).images[0]
|
||||
image.save("consistency_model_onestep_sample_penguin.png")
|
||||
|
||||
# Multistep sampling, class-conditional image generation
|
||||
# Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo.
|
||||
# https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L77
|
||||
image = pipe(timesteps=[22, 0], class_labels=145).images[0]
|
||||
image.save("consistency_model_multistep_sample_penguin.png")
|
||||
```
|
||||
|
||||
For an additional speed-up, one can also make use of `torch.compile`. Multiple images can be generated in <1 second as follows:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import ConsistencyModelPipeline
|
||||
|
||||
device = "cuda"
|
||||
# Load the cd_bedroom256_lpips checkpoint.
|
||||
model_id_or_path = "openai/diffusers-cd_bedroom256_lpips"
|
||||
pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
|
||||
pipe.to(device)
|
||||
|
||||
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
|
||||
# Multistep sampling
|
||||
# Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo:
|
||||
# https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L83
|
||||
for _ in range(10):
|
||||
image = pipe(timesteps=[17, 0]).images[0]
|
||||
image.show()
|
||||
```
|
||||
|
||||
## ConsistencyModelPipeline
|
||||
[[autodoc]] ConsistencyModelPipeline
|
||||
- all
|
||||
- __call__
|
||||
@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# IF
|
||||
# DeepFloyd IF
|
||||
|
||||
## Overview
|
||||
|
||||
@@ -71,7 +71,7 @@ First, let's load our pipeline:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionPix2PixZeroPipeline
|
||||
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline
|
||||
|
||||
sd_model_ckpt = "stabilityai/stable-diffusion-2-1"
|
||||
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
|
||||
@@ -357,4 +357,4 @@ images[0].save("edited_image.png")
|
||||
- all
|
||||
- generate_mask
|
||||
- invert
|
||||
- __call__
|
||||
- __call__
|
||||
|
||||
@@ -11,19 +11,12 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
## Overview
|
||||
|
||||
Kandinsky 2.1 inherits best practices from [DALL-E 2](https://arxiv.org/abs/2204.06125) and [Latent Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/latent_diffusion), while introducing some new ideas.
|
||||
Kandinsky inherits best practices from [DALL-E 2](https://huggingface.co/papers/2204.06125) and [Latent Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/latent_diffusion), while introducing some new ideas.
|
||||
|
||||
It uses [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) for encoding images and text, and a diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach enhances the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation.
|
||||
|
||||
The Kandinsky model is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey) and [Denis Dimitrov](https://github.com/denndimitrov) and the original codebase can be found [here](https://github.com/ai-forever/Kandinsky-2)
|
||||
The Kandinsky model is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey) and [Denis Dimitrov](https://github.com/denndimitrov). The original codebase can be found [here](https://github.com/ai-forever/Kandinsky-2)
|
||||
|
||||
## Available Pipelines:
|
||||
|
||||
| Pipeline | Tasks |
|
||||
|---|---|
|
||||
| [pipeline_kandinsky.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky.py) | *Text-to-Image Generation* |
|
||||
| [pipeline_kandinsky_inpaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_inpaint.py) | *Image-Guided Image Generation* |
|
||||
| [pipeline_kandinsky_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_img2img.py) | *Image-Guided Image Generation* |
|
||||
|
||||
## Usage example
|
||||
|
||||
@@ -55,13 +48,26 @@ t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1"
|
||||
t2i_pipe.to("cuda")
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
By default, the text-to-image pipeline use [`DDIMScheduler`], you can change the scheduler to [`DDPMScheduler`]
|
||||
|
||||
```py
|
||||
scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler")
|
||||
t2i_pipe = DiffusionPipeline.from_pretrained(
|
||||
"kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16
|
||||
)
|
||||
t2i_pipe.to("cuda")
|
||||
```
|
||||
|
||||
</Tip>
|
||||
|
||||
Now we pass the prompt through the prior to generate image embeddings. The prior
|
||||
returns both the image embeddings corresponding to the prompt and negative/unconditional image
|
||||
embeddings corresponding to an empty string.
|
||||
|
||||
```py
|
||||
generator = torch.Generator(device="cuda").manual_seed(12)
|
||||
image_embeds, negative_image_embeds = pipe_prior(prompt, generator=generator).to_tuple()
|
||||
image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
|
||||
```
|
||||
|
||||
<Tip warning={true}>
|
||||
@@ -78,7 +84,7 @@ of the prior by a factor of 2.
|
||||
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
|
||||
negative_prompt = "low quality, bad quality"
|
||||
|
||||
image_embeds, negative_image_embeds = pipe_prior(prompt, negative_prompt, generator=generator).to_tuple()
|
||||
image_embeds, negative_image_embeds = pipe_prior(prompt, negative_prompt, guidance_scale=1.0).to_tuple()
|
||||
```
|
||||
|
||||
</Tip>
|
||||
@@ -89,7 +95,9 @@ in case you are using a customized negative prompt, that you should pass this on
|
||||
with `negative_prompt=negative_prompt`:
|
||||
|
||||
```py
|
||||
image = t2i_pipe(prompt, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds).images[0]
|
||||
image = t2i_pipe(
|
||||
prompt, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768
|
||||
).images[0]
|
||||
image.save("cheeseburger_monster.png")
|
||||
```
|
||||
|
||||
@@ -120,6 +128,7 @@ prompt = "birds eye view of a quilted paper style alien planet landscape, vibran
|
||||

|
||||
|
||||
|
||||
|
||||
### Text Guided Image-to-Image Generation
|
||||
|
||||
The same Kandinsky model weights can be used for text-guided image-to-image translation. In this case, just make sure to load the weights using the [`KandinskyImg2ImgPipeline`] pipeline.
|
||||
@@ -160,8 +169,7 @@ pipe.to("cuda")
|
||||
prompt = "A fantasy landscape, Cinematic lighting"
|
||||
negative_prompt = "low quality, bad quality"
|
||||
|
||||
generator = torch.Generator(device="cuda").manual_seed(30)
|
||||
image_embeds, negative_image_embeds = pipe_prior(prompt, negative_prompt, generator=generator).to_tuple()
|
||||
image_embeds, negative_image_embeds = pipe_prior(prompt, negative_prompt).to_tuple()
|
||||
|
||||
out = pipe(
|
||||
prompt,
|
||||
@@ -269,6 +277,207 @@ image.save("starry_cat.png")
|
||||

|
||||
|
||||
|
||||
### Text-to-Image Generation with ControlNet Conditioning
|
||||
|
||||
In the following, we give a simple example of how to use [`KandinskyV22ControlnetPipeline`] to add control to the text-to-image generation with a depth image.
|
||||
|
||||
First, let's take an image and extract its depth map.
|
||||
|
||||
```python
|
||||
from diffusers.utils import load_image
|
||||
|
||||
img = load_image(
|
||||
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
|
||||
).resize((768, 768))
|
||||
```
|
||||

|
||||
|
||||
We can use the `depth-estimation` pipeline from transformers to process the image and retrieve its depth map.
|
||||
|
||||
```python
|
||||
import torch
|
||||
import numpy as np
|
||||
|
||||
from transformers import pipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
|
||||
def make_hint(image, depth_estimator):
|
||||
image = depth_estimator(image)["depth"]
|
||||
image = np.array(image)
|
||||
image = image[:, :, None]
|
||||
image = np.concatenate([image, image, image], axis=2)
|
||||
detected_map = torch.from_numpy(image).float() / 255.0
|
||||
hint = detected_map.permute(2, 0, 1)
|
||||
return hint
|
||||
|
||||
|
||||
depth_estimator = pipeline("depth-estimation")
|
||||
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
|
||||
```
|
||||
Now, we load the prior pipeline and the text-to-image controlnet pipeline
|
||||
|
||||
```python
|
||||
from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline
|
||||
|
||||
pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
|
||||
"kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
|
||||
)
|
||||
pipe_prior = pipe_prior.to("cuda")
|
||||
|
||||
pipe = KandinskyV22ControlnetPipeline.from_pretrained(
|
||||
"kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
|
||||
)
|
||||
pipe = pipe.to("cuda")
|
||||
```
|
||||
|
||||
We pass the prompt and negative prompt through the prior to generate image embeddings
|
||||
|
||||
```python
|
||||
prompt = "A robot, 4k photo"
|
||||
|
||||
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
|
||||
|
||||
generator = torch.Generator(device="cuda").manual_seed(43)
|
||||
image_emb, zero_image_emb = pipe_prior(
|
||||
prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
|
||||
).to_tuple()
|
||||
```
|
||||
|
||||
Now we can pass the image embeddings and the depth image we extracted to the controlnet pipeline. With Kandinsky 2.2, only prior pipelines accept `prompt` input. You do not need to pass the prompt to the controlnet pipeline.
|
||||
|
||||
```python
|
||||
images = pipe(
|
||||
image_embeds=image_emb,
|
||||
negative_image_embeds=zero_image_emb,
|
||||
hint=hint,
|
||||
num_inference_steps=50,
|
||||
generator=generator,
|
||||
height=768,
|
||||
width=768,
|
||||
).images
|
||||
|
||||
images[0].save("robot_cat.png")
|
||||
```
|
||||
|
||||
The output image looks as follow:
|
||||

|
||||
|
||||
### Image-to-Image Generation with ControlNet Conditioning
|
||||
|
||||
Kandinsky 2.2 also includes a [`KandinskyV22ControlnetImg2ImgPipeline`] that will allow you to add control to the image generation process with both the image and its depth map. This pipeline works really well with [`KandinskyV22PriorEmb2EmbPipeline`], which generates image embeddings based on both a text prompt and an image.
|
||||
|
||||
For our robot cat example, we will pass the prompt and cat image together to the prior pipeline to generate an image embedding. We will then use that image embedding and the depth map of the cat to further control the image generation process.
|
||||
|
||||
We can use the same cat image and its depth map from the last example.
|
||||
|
||||
```python
|
||||
import torch
|
||||
import numpy as np
|
||||
|
||||
from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline
|
||||
from diffusers.utils import load_image
|
||||
from transformers import pipeline
|
||||
|
||||
img = load_image(
|
||||
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinskyv22/cat.png"
|
||||
).resize((768, 768))
|
||||
|
||||
|
||||
def make_hint(image, depth_estimator):
|
||||
image = depth_estimator(image)["depth"]
|
||||
image = np.array(image)
|
||||
image = image[:, :, None]
|
||||
image = np.concatenate([image, image, image], axis=2)
|
||||
detected_map = torch.from_numpy(image).float() / 255.0
|
||||
hint = detected_map.permute(2, 0, 1)
|
||||
return hint
|
||||
|
||||
|
||||
depth_estimator = pipeline("depth-estimation")
|
||||
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
|
||||
|
||||
pipe_prior = KandinskyV22PriorEmb2EmbPipeline.from_pretrained(
|
||||
"kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
|
||||
)
|
||||
pipe_prior = pipe_prior.to("cuda")
|
||||
|
||||
pipe = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained(
|
||||
"kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
|
||||
)
|
||||
pipe = pipe.to("cuda")
|
||||
|
||||
prompt = "A robot, 4k photo"
|
||||
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
|
||||
|
||||
generator = torch.Generator(device="cuda").manual_seed(43)
|
||||
|
||||
# run prior pipeline
|
||||
|
||||
img_emb = pipe_prior(prompt=prompt, image=img, strength=0.85, generator=generator)
|
||||
negative_emb = pipe_prior(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)
|
||||
|
||||
# run controlnet img2img pipeline
|
||||
images = pipe(
|
||||
image=img,
|
||||
strength=0.5,
|
||||
image_embeds=img_emb.image_embeds,
|
||||
negative_image_embeds=negative_emb.image_embeds,
|
||||
hint=hint,
|
||||
num_inference_steps=50,
|
||||
generator=generator,
|
||||
height=768,
|
||||
width=768,
|
||||
).images
|
||||
|
||||
images[0].save("robot_cat.png")
|
||||
```
|
||||
|
||||
Here is the output. Compared with the output from our text-to-image controlnet example, it kept a lot more cat facial details from the original image and worked into the robot style we asked for.
|
||||
|
||||

|
||||
|
||||
## Kandinsky 2.2
|
||||
|
||||
The Kandinsky 2.2 release includes robust new text-to-image models that support text-to-image generation, image-to-image generation, image interpolation, and text-guided image inpainting. The general workflow to perform these tasks using Kandinsky 2.2 is the same as in Kandinsky 2.1. First, you will need to use a prior pipeline to generate image embeddings based on your text prompt, and then use one of the image decoding pipelines to generate the output image. The only difference is that in Kandinsky 2.2, all of the decoding pipelines no longer accept the `prompt` input, and the image generation process is conditioned with only `image_embeds` and `negative_image_embeds`.
|
||||
|
||||
Let's look at an example of how to perform text-to-image generation using Kandinsky 2.2.
|
||||
|
||||
First, let's create the prior pipeline and text-to-image pipeline with Kandinsky 2.2 checkpoints.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16)
|
||||
pipe_prior.to("cuda")
|
||||
|
||||
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
|
||||
t2i_pipe.to("cuda")
|
||||
```
|
||||
|
||||
You can then use `pipe_prior` to generate image embeddings.
|
||||
|
||||
```python
|
||||
prompt = "portrait of a women, blue eyes, cinematic"
|
||||
negative_prompt = "low quality, bad quality"
|
||||
|
||||
image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
|
||||
```
|
||||
|
||||
Now you can pass these embeddings to the text-to-image pipeline. When using Kandinsky 2.2 you don't need to pass the `prompt` (but you do with the previous version, Kandinsky 2.1).
|
||||
|
||||
```
|
||||
image = t2i_pipe(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[
|
||||
0
|
||||
]
|
||||
image.save("portrait.png")
|
||||
```
|
||||

|
||||
|
||||
We used the text-to-image pipeline as an example, but the same process applies to all decoding pipelines in Kandinsky 2.2. For more information, please refer to our API section for each pipeline.
|
||||
|
||||
|
||||
## Optimization
|
||||
|
||||
Running Kandinsky in inference requires running both a first prior pipeline: [`KandinskyPriorPipeline`]
|
||||
@@ -321,30 +530,84 @@ t2i_pipe.unet = torch.compile(t2i_pipe.unet, mode="reduce-overhead", fullgraph=T
|
||||
After compilation you should see a very fast inference time. For more information,
|
||||
feel free to have a look at [Our PyTorch 2.0 benchmark](https://huggingface.co/docs/diffusers/main/en/optimization/torch2.0).
|
||||
|
||||
## Available Pipelines:
|
||||
|
||||
| Pipeline | Tasks |
|
||||
|---|---|
|
||||
| [pipeline_kandinsky2_2.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2.py) | *Text-to-Image Generation* |
|
||||
| [pipeline_kandinsky.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky.py) | *Text-to-Image Generation* |
|
||||
| [pipeline_kandinsky2_2_inpaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_inpaint.py) | *Image-Guided Image Generation* |
|
||||
| [pipeline_kandinsky_inpaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_inpaint.py) | *Image-Guided Image Generation* |
|
||||
| [pipeline_kandinsky2_2_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_img2img.py) | *Image-Guided Image Generation* |
|
||||
| [pipeline_kandinsky_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_img2img.py) | *Image-Guided Image Generation* |
|
||||
| [pipeline_kandinsky2_2_controlnet.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_controlnet.py) | *Image-Guided Image Generation* |
|
||||
| [pipeline_kandinsky2_2_controlnet_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_controlnet_img2img.py) | *Image-Guided Image Generation* |
|
||||
|
||||
|
||||
### KandinskyV22Pipeline
|
||||
|
||||
[[autodoc]] KandinskyV22Pipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## KandinskyPriorPipeline
|
||||
### KandinskyV22ControlnetPipeline
|
||||
|
||||
[[autodoc]] KandinskyV22ControlnetPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
### KandinskyV22ControlnetImg2ImgPipeline
|
||||
|
||||
[[autodoc]] KandinskyV22ControlnetImg2ImgPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
### KandinskyV22Img2ImgPipeline
|
||||
|
||||
[[autodoc]] KandinskyV22Img2ImgPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
### KandinskyV22InpaintPipeline
|
||||
|
||||
[[autodoc]] KandinskyV22InpaintPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
### KandinskyV22PriorPipeline
|
||||
|
||||
[[autodoc]] ## KandinskyV22PriorPipeline
|
||||
- all
|
||||
- __call__
|
||||
- interpolate
|
||||
|
||||
### KandinskyV22PriorEmb2EmbPipeline
|
||||
|
||||
[[autodoc]] KandinskyV22PriorEmb2EmbPipeline
|
||||
- all
|
||||
- __call__
|
||||
- interpolate
|
||||
|
||||
### KandinskyPriorPipeline
|
||||
|
||||
[[autodoc]] KandinskyPriorPipeline
|
||||
- all
|
||||
- __call__
|
||||
- interpolate
|
||||
|
||||
## KandinskyPipeline
|
||||
### KandinskyPipeline
|
||||
|
||||
[[autodoc]] KandinskyPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## KandinskyImg2ImgPipeline
|
||||
### KandinskyImg2ImgPipeline
|
||||
|
||||
[[autodoc]] KandinskyImg2ImgPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## KandinskyInpaintPipeline
|
||||
### KandinskyInpaintPipeline
|
||||
|
||||
[[autodoc]] KandinskyInpaintPipeline
|
||||
- all
|
||||
|
||||
@@ -54,14 +54,19 @@ available a colab notebook to directly try them out.
|
||||
| [if](./if) | [**IF**](https://github.com/deep-floyd/IF) | Image Generation | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
|
||||
| [if_img2img](./if) | [**IF**](https://github.com/deep-floyd/IF) | Image-to-Image Generation | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
|
||||
| [if_inpainting](./if) | [**IF**](https://github.com/deep-floyd/IF) | Image-to-Image Generation | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
|
||||
| [kandinsky](./kandinsky) | **Kandinsky** | Text-to-Image Generation |
|
||||
| [kandinsky_inpaint](./kandinsky) | **Kandinsky** | Image-to-Image Generation |
|
||||
| [kandinsky_img2img](./kandinsky) | **Kandinsksy** | Image-to-Image Generation |
|
||||
| [latent_diffusion](./latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation |
|
||||
| [latent_diffusion](./latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image |
|
||||
| [latent_diffusion_uncond](./latent_diffusion_uncond) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation |
|
||||
| [paint_by_example](./paint_by_example) | [**Paint by Example: Exemplar-based Image Editing with Diffusion Models**](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting |
|
||||
| [paradigms](./paradigms) | [**Parallel Sampling of Diffusion Models**](https://arxiv.org/abs/2305.16317) | Text-to-Image Generation |
|
||||
| [pndm](./pndm) | [**Pseudo Numerical Methods for Diffusion Models on Manifolds**](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation |
|
||||
| [score_sde_ve](./score_sde_ve) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
|
||||
| [score_sde_vp](./score_sde_vp) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
|
||||
| [semantic_stable_diffusion](./semantic_stable_diffusion) | [**SEGA: Instructing Diffusion using Semantic Dimensions**](https://arxiv.org/abs/2301.12247) | Text-to-Image Generation |
|
||||
| [stable_diffusion_adapter](./stable_diffusion/adapter) | [**T2I-Adapter**](https://arxiv.org/abs/2302.08453) | Image-to-Image Text-Guided Generation with Adapters | -
|
||||
| [stable_diffusion_text2img](./stable_diffusion/text2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb)
|
||||
| [stable_diffusion_img2img](./stable_diffusion/img2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb)
|
||||
| [stable_diffusion_inpaint](./stable_diffusion/inpaint) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb)
|
||||
@@ -72,21 +77,20 @@ available a colab notebook to directly try them out.
|
||||
| [stable_diffusion_self_attention_guidance](./stable_diffusion/self_attention_guidance) | [**Self-Attention Guidance**](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation |
|
||||
| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [**Stable Diffusion Image Variations**](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation |
|
||||
| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [**Stable Diffusion Latent Upscaler**](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image |
|
||||
| [stable_diffusion_2](./stable_diffusion_2/) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation |
|
||||
| [stable_diffusion_2](./stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting |
|
||||
| [stable_diffusion_2](./stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Depth-to-Image Text-Guided Generation |
|
||||
| [stable_diffusion_2](./stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image |
|
||||
| [stable_diffusion_2](./stable_diffusion/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting |
|
||||
| [stable_diffusion_2](./stable_diffusion/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Depth-to-Image Text-Guided Generation |
|
||||
| [stable_diffusion_2](./stable_diffusion/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image |
|
||||
| [stable_diffusion_safe](./stable_diffusion_safe) | [**Safe Stable Diffusion**](https://arxiv.org/abs/2211.05105) | Text-Guided Generation | [](https://colab.research.google.com/github/ml-research/safe-latent-diffusion/blob/main/examples/Safe%20Latent%20Diffusion.ipynb)
|
||||
| [stable_unclip](./stable_unclip) | **Stable unCLIP** | Text-to-Image Generation |
|
||||
| [stable_unclip](./stable_unclip) | **Stable unCLIP** | Image-to-Image Text-Guided Generation |
|
||||
| [stochastic_karras_ve](./stochastic_karras_ve) | [**Elucidating the Design Space of Diffusion-Based Generative Models**](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation |
|
||||
| [text_to_video_sd](./api/pipelines/text_to_video) | [Modelscope's Text-to-video-synthesis Model in Open Domain](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation |
|
||||
| [unclip](./unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) | Text-to-Image Generation |
|
||||
| [versatile_diffusion](./versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
|
||||
| [versatile_diffusion](./versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
|
||||
| [versatile_diffusion](./versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
|
||||
| [vq_diffusion](./vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
|
||||
| [text_to_video_zero](./text_to_video_zero) | [Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://arxiv.org/abs/2303.13439) | Text-to-Video Generation |
|
||||
| [text_to_video_sd](./api/pipelines/text_to_video) | [**Modelscope's Text-to-video-synthesis Model in Open Domain**](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation |
|
||||
| [unclip](./unclip) | [**Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) | Text-to-Image Generation |
|
||||
| [versatile_diffusion](./versatile_diffusion) | [**Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
|
||||
| [versatile_diffusion](./versatile_diffusion) | [**Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
|
||||
| [versatile_diffusion](./versatile_diffusion) | [**Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
|
||||
| [vq_diffusion](./vq_diffusion) | [**Vector Quantized Diffusion Model for Text-to-Image Synthesis**](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
|
||||
| [text_to_video_zero](./text_to_video_zero) | [**Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators**](https://arxiv.org/abs/2303.13439) | Text-to-Video Generation |
|
||||
|
||||
|
||||
**Note**: Pipelines are simple examples of how to play around with the diffusion systems as described in the corresponding papers.
|
||||
|
||||
@@ -60,6 +60,25 @@ and increase the VRAM usage.
|
||||
|
||||
</Tip>
|
||||
|
||||
<Tip>
|
||||
|
||||
Circular padding is applied to ensure there are no stitching artifacts when working with
|
||||
panoramas that needs to seamlessly transition from the rightmost part to the leftmost part.
|
||||
By enabling circular padding (set `circular_padding=True`), the operation applies additional
|
||||
crops after the rightmost point of the image, allowing the model to "see” the transition
|
||||
from the rightmost part to the leftmost part. This helps maintain visual consistency in
|
||||
a 360-degree sense and creates a proper “panorama” that can be viewed using 360-degree
|
||||
panorama viewers. When decoding latents in StableDiffusion, circular padding is applied
|
||||
to ensure that the decoded latents match in the RGB space.
|
||||
|
||||
Without circular padding, there is a stitching artifact (default):
|
||||

|
||||
|
||||
With circular padding, the right and the left parts are matching (`circular_padding=True`):
|
||||

|
||||
|
||||
</Tip>
|
||||
|
||||
## StableDiffusionPanoramaPipeline
|
||||
[[autodoc]] StableDiffusionPanoramaPipeline
|
||||
- __call__
|
||||
|
||||
83
docs/source/en/api/pipelines/paradigms.mdx
Normal file
83
docs/source/en/api/pipelines/paradigms.mdx
Normal file
@@ -0,0 +1,83 @@
|
||||
<!--Copyright 2023 ParaDiGMS authors and The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Parallel Sampling of Diffusion Models (ParaDiGMS)
|
||||
|
||||
## Overview
|
||||
|
||||
[Parallel Sampling of Diffusion Models](https://arxiv.org/abs/2305.16317) by Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, Nima Anari.
|
||||
|
||||
The abstract of the paper is the following:
|
||||
|
||||
*Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.*
|
||||
|
||||
Resources:
|
||||
|
||||
* [Paper](https://arxiv.org/abs/2305.16317).
|
||||
* [Original Code](https://github.com/AndyShih12/paradigms).
|
||||
|
||||
## Available Pipelines:
|
||||
|
||||
| Pipeline | Tasks | Demo
|
||||
|---|---|:---:|
|
||||
| [StableDiffusionParadigmsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py) | *Faster Text-to-Image Generation* | |
|
||||
|
||||
This pipeline was contributed by [`AndyShih12`](https://github.com/AndyShih12) in this [PR](https://github.com/huggingface/diffusers/pull/3716/).
|
||||
|
||||
## Usage example
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import DDPMParallelScheduler
|
||||
from diffusers import StableDiffusionParadigmsPipeline
|
||||
|
||||
scheduler = DDPMParallelScheduler.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="scheduler")
|
||||
|
||||
pipe = StableDiffusionParadigmsPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", scheduler=scheduler, torch_dtype=torch.float16
|
||||
)
|
||||
pipe = pipe.to("cuda")
|
||||
|
||||
ngpu, batch_per_device = torch.cuda.device_count(), 5
|
||||
pipe.wrapped_unet = torch.nn.DataParallel(pipe.unet, device_ids=[d for d in range(ngpu)])
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
image = pipe(prompt, parallel=ngpu * batch_per_device, num_inference_steps=1000).images[0]
|
||||
```
|
||||
|
||||
<Tip>
|
||||
This pipeline improves sampling speed by running denoising steps in parallel, at the cost of increased total FLOPs.
|
||||
Therefore, it is better to call this pipeline when running on multiple GPUs. Otherwise, without enough GPU bandwidth
|
||||
sampling may be even slower than sequential sampling.
|
||||
|
||||
The two parameters to play with are `parallel` (batch size) and `tolerance`.
|
||||
- If it fits in memory, for 1000-step DDPM you can aim for a batch size of around 100
|
||||
(e.g. 8 GPUs and batch_per_device=12 to get parallel=96). Higher batch size
|
||||
may not fit in memory, and lower batch size gives less parallelism.
|
||||
- For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation.
|
||||
If there is quality degradation with the default tolerance, then use a lower tolerance (e.g. 0.001).
|
||||
|
||||
For 1000-step DDPM on 8 A100 GPUs, you can expect around a 3x speedup by StableDiffusionParadigmsPipeline instead of StableDiffusionPipeline
|
||||
by setting parallel=80 and tolerance=0.1.
|
||||
</Tip>
|
||||
|
||||
<Tip>
|
||||
Diffusers also offers distributed inference support for generating multiple prompts
|
||||
in parallel on multiple GPUs. Check out the docs [here](https://huggingface.co/docs/diffusers/main/en/training/distributed_inference).
|
||||
|
||||
In contrast, this pipeline is designed for speeding up sampling of a single prompt (by using multiple GPUs).
|
||||
</Tip>
|
||||
|
||||
## StableDiffusionParadigmsPipeline
|
||||
[[autodoc]] StableDiffusionParadigmsPipeline
|
||||
- __call__
|
||||
- all
|
||||
196
docs/source/en/api/pipelines/shap_e.mdx
Normal file
196
docs/source/en/api/pipelines/shap_e.mdx
Normal file
@@ -0,0 +1,196 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Shap-E
|
||||
|
||||
## Overview
|
||||
|
||||
|
||||
The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://arxiv.org/abs/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai).
|
||||
|
||||
The abstract of the paper is the following:
|
||||
|
||||
*We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.*
|
||||
|
||||
The original codebase can be found [here](https://github.com/openai/shap-e).
|
||||
|
||||
## Available Pipelines:
|
||||
|
||||
| Pipeline | Tasks |
|
||||
|---|---|
|
||||
| [pipeline_shap_e.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/shap_e/pipeline_shap_e.py) | *Text-to-Image Generation* |
|
||||
| [pipeline_shap_e_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py) | *Image-to-Image Generation* |
|
||||
|
||||
## Available checkpoints
|
||||
|
||||
* [`openai/shap-e`](https://huggingface.co/openai/shap-e)
|
||||
* [`openai/shap-e-img2img`](https://huggingface.co/openai/shap-e-img2img)
|
||||
|
||||
## Usage Examples
|
||||
|
||||
In the following, we will walk you through some examples of how to use Shap-E pipelines to create 3D objects in gif format.
|
||||
|
||||
### Text-to-3D image generation
|
||||
|
||||
We can use [`ShapEPipeline`] to create 3D object based on a text prompt. In this example, we will make a birthday cupcake for :firecracker: diffusers library's 1 year birthday. The workflow to use the Shap-E text-to-image pipeline is same as how you would use other text-to-image pipelines in diffusers.
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||
|
||||
repo = "openai/shap-e"
|
||||
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
|
||||
pipe = pipe.to(device)
|
||||
|
||||
guidance_scale = 15.0
|
||||
prompt = ["A firecracker", "A birthday cupcake"]
|
||||
|
||||
images = pipe(
|
||||
prompt,
|
||||
guidance_scale=guidance_scale,
|
||||
num_inference_steps=64,
|
||||
frame_size=256,
|
||||
).images
|
||||
```
|
||||
|
||||
The output of [`ShapEPipeline`] is a list of lists of images frames. Each list of frames can be used to create a 3D object. Let's use the `export_to_gif` utility function in diffusers to make a 3D cupcake!
|
||||
|
||||
```python
|
||||
from diffusers.utils import export_to_gif
|
||||
|
||||
export_to_gif(images[0], "firecracker_3d.gif")
|
||||
export_to_gif(images[1], "cake_3d.gif")
|
||||
```
|
||||

|
||||

|
||||
|
||||
|
||||
### Image-to-Image generation
|
||||
|
||||
You can use [`ShapEImg2ImgPipeline`] along with other text-to-image pipelines in diffusers and turn your 2D generation into 3D.
|
||||
|
||||
In this example, We will first genrate a cheeseburger with a simple prompt "A cheeseburger, white background"
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16)
|
||||
pipe_prior.to("cuda")
|
||||
|
||||
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
|
||||
t2i_pipe.to("cuda")
|
||||
|
||||
prompt = "A cheeseburger, white background"
|
||||
|
||||
image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
|
||||
image = t2i_pipe(
|
||||
prompt,
|
||||
image_embeds=image_embeds,
|
||||
negative_image_embeds=negative_image_embeds,
|
||||
).images[0]
|
||||
|
||||
image.save("burger.png")
|
||||
```
|
||||
|
||||

|
||||
|
||||
we will then use the Shap-E image-to-image pipeline to turn it into a 3D cheeseburger :)
|
||||
|
||||
```python
|
||||
from PIL import Image
|
||||
from diffusers.utils import export_to_gif
|
||||
|
||||
repo = "openai/shap-e-img2img"
|
||||
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
|
||||
pipe = pipe.to("cuda")
|
||||
|
||||
guidance_scale = 3.0
|
||||
image = Image.open("burger.png").resize((256, 256))
|
||||
|
||||
images = pipe(
|
||||
image,
|
||||
guidance_scale=guidance_scale,
|
||||
num_inference_steps=64,
|
||||
frame_size=256,
|
||||
).images
|
||||
|
||||
gif_path = export_to_gif(images[0], "burger_3d.gif")
|
||||
```
|
||||

|
||||
|
||||
### Generate mesh
|
||||
|
||||
For both [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`], you can generate mesh output by passing `output_type` as `mesh` to the pipeline, and then use the [`ShapEPipeline.export_to_ply`] utility function to save the output as a `ply` file. We also provide a [`ShapEPipeline.export_to_obj`] function that you can use to save mesh outputs as `obj` files.
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
from diffusers import DiffusionPipeline
|
||||
from diffusers.utils import export_to_ply
|
||||
|
||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||
|
||||
repo = "openai/shap-e"
|
||||
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16, variant="fp16")
|
||||
pipe = pipe.to(device)
|
||||
|
||||
guidance_scale = 15.0
|
||||
prompt = "A birthday cupcake"
|
||||
|
||||
images = pipe(prompt, guidance_scale=guidance_scale, num_inference_steps=64, frame_size=256, output_type="mesh").images
|
||||
|
||||
ply_path = export_to_ply(images[0], "3d_cake.ply")
|
||||
print(f"saved to folder: {ply_path}")
|
||||
```
|
||||
|
||||
Huggingface Datasets supports mesh visualization for mesh files in `glb` format. Below we will show you how to convert your mesh file into `glb` format so that you can use the Dataset viewer to render 3D objects.
|
||||
|
||||
We need to install `trimesh` library.
|
||||
|
||||
```
|
||||
pip install trimesh
|
||||
```
|
||||
|
||||
To convert the mesh file into `glb` format,
|
||||
|
||||
```python
|
||||
import trimesh
|
||||
|
||||
mesh = trimesh.load("3d_cake.ply")
|
||||
mesh.export("3d_cake.glb", file_type="glb")
|
||||
```
|
||||
|
||||
By default, the mesh output of Shap-E is from the bottom viewpoint; you can change the default viewpoint by applying a rotation transformation
|
||||
|
||||
```python
|
||||
import trimesh
|
||||
import numpy as np
|
||||
|
||||
mesh = trimesh.load("3d_cake.ply")
|
||||
rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0])
|
||||
mesh = mesh.apply_transform(rot)
|
||||
mesh.export("3d_cake.glb", file_type="glb")
|
||||
```
|
||||
|
||||
Now you can upload your mesh file to your dataset and visualize it! Here is the link to the 3D cake we just generated
|
||||
https://huggingface.co/datasets/hf-internal-testing/diffusers-images/blob/main/shap_e/3d_cake.glb
|
||||
|
||||
## ShapEPipeline
|
||||
[[autodoc]] ShapEPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## ShapEImg2ImgPipeline
|
||||
[[autodoc]] ShapEImg2ImgPipeline
|
||||
- all
|
||||
- __call__
|
||||
187
docs/source/en/api/pipelines/stable_diffusion/adapter.mdx
Normal file
187
docs/source/en/api/pipelines/stable_diffusion/adapter.mdx
Normal file
@@ -0,0 +1,187 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Text-to-Image Generation with Adapter Conditioning
|
||||
|
||||
## Overview
|
||||
|
||||
[T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.08453) by Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie.
|
||||
|
||||
Using the pretrained models we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details.
|
||||
|
||||
The abstract of the paper is the following:
|
||||
|
||||
*The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate structure control is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and small T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, and achieve rich control and editing effects. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.*
|
||||
|
||||
This model was contributed by the community contributor [HimariO](https://github.com/HimariO) ❤️ .
|
||||
|
||||
## Available Pipelines:
|
||||
|
||||
| Pipeline | Tasks | Demo
|
||||
|---|---|:---:|
|
||||
| [StableDiffusionAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning* | -
|
||||
|
||||
## Usage example
|
||||
|
||||
In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference.
|
||||
All adapters use the same pipeline.
|
||||
|
||||
1. Images are first converted into the appropriate *control image* format.
|
||||
2. The *control image* and *prompt* are passed to the [`StableDiffusionAdapterPipeline`].
|
||||
|
||||
Let's have a look at a simple example using the [Color Adapter](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1).
|
||||
|
||||
```python
|
||||
from diffusers.utils import load_image
|
||||
|
||||
image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png")
|
||||
```
|
||||
|
||||

|
||||
|
||||
|
||||
Then we can create our color palette by simply resizing it to 8 by 8 pixels and then scaling it back to original size.
|
||||
|
||||
```python
|
||||
from PIL import Image
|
||||
|
||||
color_palette = image.resize((8, 8))
|
||||
color_palette = color_palette.resize((512, 512), resample=Image.Resampling.NEAREST)
|
||||
```
|
||||
|
||||
Let's take a look at the processed image.
|
||||
|
||||

|
||||
|
||||
|
||||
Next, create the adapter pipeline
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import StableDiffusionAdapterPipeline, T2IAdapter
|
||||
|
||||
adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_color_sd14v1")
|
||||
pipe = StableDiffusionAdapterPipeline.from_pretrained(
|
||||
"CompVis/stable-diffusion-v1-4",
|
||||
adapter=adapter,
|
||||
torch_dtype=torch.float16,
|
||||
)
|
||||
pipe.to("cuda")
|
||||
```
|
||||
|
||||
Finally, pass the prompt and control image to the pipeline
|
||||
|
||||
```py
|
||||
# fix the random seed, so you will get the same result as the example
|
||||
generator = torch.manual_seed(7)
|
||||
|
||||
out_image = pipe(
|
||||
"At night, glowing cubes in front of the beach",
|
||||
image=color_palette,
|
||||
generator=generator,
|
||||
).images[0]
|
||||
```
|
||||
|
||||

|
||||
|
||||
|
||||
## Available checkpoints
|
||||
|
||||
Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://huggingface.co/TencentARC/T2I-Adapter/tree/main/models).
|
||||
|
||||
### T2I-Adapter with Stable Diffusion 1.4
|
||||
|
||||
| Model Name | Control Image Overview| Control Image Example | Generated Image Example |
|
||||
|---|---|---|---|
|
||||
|[TencentARC/t2iadapter_color_sd14v1](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1)<br/> *Trained with spatial color palette* | A image with 8x8 color palette.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"/></a>|
|
||||
|[TencentARC/t2iadapter_canny_sd14v1](https://huggingface.co/TencentARC/t2iadapter_canny_sd14v1)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"/></a>|
|
||||
|[TencentARC/t2iadapter_sketch_sd14v1](https://huggingface.co/TencentARC/t2iadapter_sketch_sd14v1)<br/> *Trained with [PidiNet](https://github.com/zhuoinoulu/pidinet) edge detection* | A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"/></a>|
|
||||
|[TencentARC/t2iadapter_depth_sd14v1](https://huggingface.co/TencentARC/t2iadapter_depth_sd14v1)<br/> *Trained with Midas depth estimation* | A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"/></a>|
|
||||
|[TencentARC/t2iadapter_openpose_sd14v1](https://huggingface.co/TencentARC/t2iadapter_openpose_sd14v1)<br/> *Trained with OpenPose bone image* | A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/openpose_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/openpose_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/openpose_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/openpose_sample_output.png"/></a>|
|
||||
|[TencentARC/t2iadapter_keypose_sd14v1](https://huggingface.co/TencentARC/t2iadapter_keypose_sd14v1)<br/> *Trained with mmpose skeleton image* | A [mmpose skeleton](https://github.com/open-mmlab/mmpose) image.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_output.png"/></a>|
|
||||
|[TencentARC/t2iadapter_seg_sd14v1](https://huggingface.co/TencentARC/t2iadapter_seg_sd14v1)<br/>*Trained with semantic segmentation* | An [custom](https://github.com/TencentARC/T2I-Adapter/discussions/25) segmentation protocol image.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/seg_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/seg_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/seg_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/seg_sample_output.png"/></a> |
|
||||
|[TencentARC/t2iadapter_canny_sd15v2](https://huggingface.co/TencentARC/t2iadapter_canny_sd15v2)||
|
||||
|[TencentARC/t2iadapter_depth_sd15v2](https://huggingface.co/TencentARC/t2iadapter_depth_sd15v2)||
|
||||
|[TencentARC/t2iadapter_sketch_sd15v2](https://huggingface.co/TencentARC/t2iadapter_sketch_sd15v2)||
|
||||
|[TencentARC/t2iadapter_zoedepth_sd15v1](https://huggingface.co/TencentARC/t2iadapter_zoedepth_sd15v1)||
|
||||
|
||||
## Combining multiple adapters
|
||||
|
||||
[`MultiAdapter`] can be used for applying multiple conditionings at once.
|
||||
|
||||
Here we use the keypose adapter for the character posture and the depth adapter for creating the scene.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from PIL import Image
|
||||
from diffusers.utils import load_image
|
||||
|
||||
cond_keypose = load_image(
|
||||
"https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"
|
||||
)
|
||||
cond_depth = load_image(
|
||||
"https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"
|
||||
)
|
||||
cond = [[cond_keypose, cond_depth]]
|
||||
|
||||
prompt = ["A man walking in an office room with a nice view"]
|
||||
```
|
||||
|
||||
The two control images look as such:
|
||||
|
||||

|
||||

|
||||
|
||||
|
||||
`MultiAdapter` combines keypose and depth adapters.
|
||||
|
||||
`adapter_conditioning_scale` balances the relative influence of the different adapters.
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionAdapterPipeline, MultiAdapter
|
||||
|
||||
adapters = MultiAdapter(
|
||||
[
|
||||
T2IAdapter.from_pretrained("TencentARC/t2iadapter_keypose_sd14v1"),
|
||||
T2IAdapter.from_pretrained("TencentARC/t2iadapter_depth_sd14v1"),
|
||||
]
|
||||
)
|
||||
adapters = adapters.to(torch.float16)
|
||||
|
||||
pipe = StableDiffusionAdapterPipeline.from_pretrained(
|
||||
"CompVis/stable-diffusion-v1-4",
|
||||
torch_dtype=torch.float16,
|
||||
adapter=adapters,
|
||||
)
|
||||
|
||||
images = pipe(prompt, cond, adapter_conditioning_scale=[0.8, 0.8])
|
||||
```
|
||||
|
||||

|
||||
|
||||
|
||||
## T2I Adapter vs ControlNet
|
||||
|
||||
T2I-Adapter is similar to [ControlNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet).
|
||||
T2i-Adapter uses a smaller auxiliary network which is only run once for the entire diffusion process.
|
||||
However, T2I-Adapter performs slightly worse than ControlNet.
|
||||
|
||||
## StableDiffusionAdapterPipeline
|
||||
[[autodoc]] StableDiffusionAdapterPipeline
|
||||
- all
|
||||
- __call__
|
||||
- enable_attention_slicing
|
||||
- disable_attention_slicing
|
||||
- enable_vae_slicing
|
||||
- disable_vae_slicing
|
||||
- enable_xformers_memory_efficient_attention
|
||||
- disable_xformers_memory_efficient_attention
|
||||
@@ -31,7 +31,7 @@ proposed by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan
|
||||
- enable_xformers_memory_efficient_attention
|
||||
- disable_xformers_memory_efficient_attention
|
||||
- load_textual_inversion
|
||||
- from_ckpt
|
||||
- from_single_file
|
||||
- load_lora_weights
|
||||
- save_lora_weights
|
||||
|
||||
|
||||
@@ -0,0 +1,55 @@
|
||||
<!--Copyright 2023 The Intel Labs Team Authors and HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# LDM3D
|
||||
|
||||
LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, Vasudev Lal
|
||||
The abstract of the paper is the following:
|
||||
|
||||
*This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at [this url](https://t.ly/tdi2).*
|
||||
|
||||
|
||||
*Overview*:
|
||||
|
||||
| Pipeline | Tasks | Colab | Demo
|
||||
|---|---|:---:|:---:|
|
||||
| [pipeline_stable_diffusion_ldm3d.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py) | *Text-to-Image Generation* | - | -
|
||||
|
||||
## Tips
|
||||
|
||||
- LDM3D generates both an image and a depth map from a given text prompt, compared to the existing txt-to-img diffusion models such as [Stable Diffusion](./stable_diffusion/overview) that generates only an image.
|
||||
- With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps.
|
||||
|
||||
|
||||
Running LDM3D is straighforward with the [`StableDiffusionLDM3DPipeline`]:
|
||||
|
||||
```python
|
||||
>>> from diffusers import StableDiffusionLDM3DPipeline
|
||||
|
||||
>>> pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d")
|
||||
prompt ="A picture of some lemons on a table"
|
||||
output = pipe(prompt)
|
||||
rgb_image, depth_image = output.rgb, output.depth
|
||||
rgb_image[0].save("lemons_ldm3d_rgb.jpg")
|
||||
depth_image[0].save("lemons_ldm3d_depth.png")
|
||||
```
|
||||
|
||||
|
||||
## StableDiffusionPipelineOutput
|
||||
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## StableDiffusionLDM3DPipeline
|
||||
[[autodoc]] StableDiffusionLDM3DPipeline
|
||||
- all
|
||||
- __call__
|
||||
@@ -26,19 +26,17 @@ For more details about how Stable Diffusion works and how it differs from the ba
|
||||
| Pipeline | Tasks | Colab | Demo
|
||||
|---|---|:---:|:---:|
|
||||
| [StableDiffusionPipeline](./text2img) | *Text-to-Image Generation* | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb) | [🤗 Stable Diffusion](https://huggingface.co/spaces/stabilityai/stable-diffusion)
|
||||
| [StableDiffusionPipelineSafe](./stable_diffusion_safe) | *Text-to-Image Generation* | [](https://colab.research.google.com/github/ml-research/safe-latent-diffusion/blob/main/examples/Safe%20Latent%20Diffusion.ipynb) | [](https://huggingface.co/spaces/AIML-TUDA/unsafe-vs-safe-stable-diffusion)
|
||||
| [StableDiffusionImg2ImgPipeline](./img2img) | *Image-to-Image Text-Guided Generation* | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb) | [🤗 Diffuse the Rest](https://huggingface.co/spaces/huggingface/diffuse-the-rest)
|
||||
| [StableDiffusionInpaintPipeline](./inpaint) | **Experimental** – *Text-Guided Image Inpainting* | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb) | Coming soon
|
||||
| [StableDiffusionDepth2ImgPipeline](./depth2img) | **Experimental** – *Depth-to-Image Text-Guided Generation * | | Coming soon
|
||||
| [StableDiffusionImageVariationPipeline](./image_variation) | **Experimental** – *Image Variation Generation * | | [🤗 Stable Diffusion Image Variations](https://huggingface.co/spaces/lambdalabs/stable-diffusion-image-variations)
|
||||
| [StableDiffusionUpscalePipeline](./upscale) | **Experimental** – *Text-Guided Image Super-Resolution * | | Coming soon
|
||||
| [StableDiffusionLatentUpscalePipeline](./latent_upscale) | **Experimental** – *Text-Guided Image Super-Resolution * | | Coming soon
|
||||
| [StableDiffusionInstructPix2PixPipeline](./pix2pix) | **Experimental** – *Text-Based Image Editing * | | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://huggingface.co/spaces/timbrooks/instruct-pix2pix)
|
||||
| [StableDiffusionAttendAndExcitePipeline](./attend_and_excite) | **Experimental** – *Text-to-Image Generation * | | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite)
|
||||
| [StableDiffusionPix2PixZeroPipeline](./pix2pix_zero) | **Experimental** – *Text-Based Image Editing * | | [Zero-shot Image-to-Image Translation](https://arxiv.org/abs/2302.03027)
|
||||
| [StableDiffusionModelEditingPipeline](./model_editing) | **Experimental** – *Text-to-Image Model Editing * | | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://arxiv.org/abs/2303.08084)
|
||||
| [StableDiffusionDiffEditPipeline](./diffedit) | **Experimental** – *Text-Based Image Editing * | | [DiffEdit: Diffusion-based semantic image editing with mask guidance](https://arxiv.org/abs/2210.11427)
|
||||
|
||||
|
||||
| [StableDiffusionInpaintPipeline](./inpaint) | **Experimental** – *Text-Guided Image Inpainting* | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb) |
|
||||
| [StableDiffusionDepth2ImgPipeline](./depth2img) | **Experimental** – *Depth-to-Image Text-Guided Generation* | |
|
||||
| [StableDiffusionImageVariationPipeline](./image_variation) | **Experimental** – *Image Variation Generation* | | [🤗 Stable Diffusion Image Variations](https://huggingface.co/spaces/lambdalabs/stable-diffusion-image-variations)
|
||||
| [StableDiffusionUpscalePipeline](./upscale) | **Experimental** – *Text-Guided Image Super-Resolution* | |
|
||||
| [StableDiffusionLatentUpscalePipeline](./latent_upscale) | **Experimental** – *Text-Guided Image Super-Resolution* | |
|
||||
| [Stable Diffusion 2](./stable_diffusion_2) | *Text-Guided Image Inpainting* |
|
||||
| [Stable Diffusion 2](./stable_diffusion_2) | *Depth-to-Image Text-Guided Generation* |
|
||||
| [Stable Diffusion 2](./stable_diffusion_2) | *Text-Guided Super Resolution Image-to-Image* |
|
||||
| [StableDiffusionLDM3DPipeline](./ldm3d) | *Text-to-(RGB, Depth)* |
|
||||
|
||||
## Tips
|
||||
|
||||
|
||||
@@ -71,6 +71,64 @@ image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0]
|
||||
image.save("astronaut.png")
|
||||
```
|
||||
|
||||
#### Experimental: "Common Diffusion Noise Schedules and Sample Steps are Flawed":
|
||||
|
||||
The paper **[Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/abs/2305.08891)**
|
||||
claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion.
|
||||
|
||||
The abstract reads as follows:
|
||||
|
||||
*We discover that common diffusion noise schedules do not enforce the last timestep to have zero signal-to-noise ratio (SNR),
|
||||
and some implementations of diffusion samplers do not start from the last timestep.
|
||||
Such designs are flawed and do not reflect the fact that the model is given pure Gaussian noise at inference, creating a discrepancy between training and inference.
|
||||
We show that the flawed design causes real problems in existing implementations.
|
||||
In Stable Diffusion, it severely limits the model to only generate images with medium brightness and
|
||||
prevents it from generating very bright and dark samples. We propose a few simple fixes:
|
||||
- (1) rescale the noise schedule to enforce zero terminal SNR;
|
||||
- (2) train the model with v prediction;
|
||||
- (3) change the sampler to always start from the last timestep;
|
||||
- (4) rescale classifier-free guidance to prevent over-exposure.
|
||||
These simple changes ensure the diffusion process is congruent between training and inference and
|
||||
allow the model to generate samples more faithful to the original data distribution.*
|
||||
|
||||
You can apply all of these changes in `diffusers` when using [`DDIMScheduler`]:
|
||||
- (1) rescale the noise schedule to enforce zero terminal SNR;
|
||||
```py
|
||||
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True)
|
||||
```
|
||||
- (2) train the model with v prediction;
|
||||
Continue fine-tuning a checkpoint with [`train_text_to_image.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [`train_text_to_image_lora.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)
|
||||
and `--prediction_type="v_prediction"`.
|
||||
- (3) change the sampler to always start from the last timestep;
|
||||
```py
|
||||
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
|
||||
```
|
||||
- (4) rescale classifier-free guidance to prevent over-exposure.
|
||||
```py
|
||||
pipe(..., guidance_rescale=0.7)
|
||||
```
|
||||
|
||||
An example is to use [this checkpoint](https://huggingface.co/ptx0/pseudo-journey-v2)
|
||||
which has been fine-tuned using the `"v_prediction"`.
|
||||
|
||||
The checkpoint can then be run in inference as follows:
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline, DDIMScheduler
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", torch_dtype=torch.float16)
|
||||
pipe.scheduler = DDIMScheduler.from_config(
|
||||
pipe.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing"
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
|
||||
image = pipeline(prompt, guidance_rescale=0.7).images[0]
|
||||
```
|
||||
|
||||
## DDIMScheduler
|
||||
[[autodoc]] DDIMScheduler
|
||||
|
||||
### Image Inpainting
|
||||
|
||||
- *Image Inpainting (512x512 resolution)*: [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) with [`StableDiffusionInpaintPipeline`]
|
||||
|
||||
@@ -0,0 +1,364 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Stable diffusion XL
|
||||
|
||||
Stable Diffusion XL was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://arxiv.org/abs/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach
|
||||
|
||||
The abstract of the paper is the following:
|
||||
|
||||
*We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.*
|
||||
|
||||
## Tips
|
||||
|
||||
- Stable Diffusion XL works especially well with images between 768 and 1024.
|
||||
- Stable Diffusion XL output image can be improved by making use of a refiner as shown below.
|
||||
|
||||
### Available checkpoints:
|
||||
|
||||
- *Text-to-Image (1024x1024 resolution)*: [stabilityai/stable-diffusion-xl-base-0.9](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9) with [`StableDiffusionXLPipeline`]
|
||||
- *Image-to-Image / Refiner (1024x1024 resolution)*: [stabilityai/stable-diffusion-xl-refiner-0.9](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9) with [`StableDiffusionXLImg2ImgPipeline`]
|
||||
|
||||
## Usage Example
|
||||
|
||||
Before using SDXL make sure to have `transformers`, `accelerate`, `safetensors` and `invisible_watermark` installed.
|
||||
You can install the libraries as follows:
|
||||
|
||||
```
|
||||
pip install transformers
|
||||
pip install accelerate
|
||||
pip install safetensors
|
||||
pip install invisible-watermark>=0.2.0
|
||||
```
|
||||
|
||||
### Text-to-Image
|
||||
|
||||
You can use SDXL as follows for *text-to-image*:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
import torch
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
image = pipe(prompt=prompt).images[0]
|
||||
```
|
||||
|
||||
### Image-to-image
|
||||
|
||||
You can use SDXL as follows for *image-to-image*:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import StableDiffusionXLImg2ImgPipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe = pipe.to("cuda")
|
||||
url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png"
|
||||
|
||||
init_image = load_image(url).convert("RGB")
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
image = pipe(prompt, image=init_image).images[0]
|
||||
```
|
||||
|
||||
### Inpainting
|
||||
|
||||
You can use SDXL as follows for *inpainting*
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import StableDiffusionXLInpaintPipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
|
||||
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
|
||||
|
||||
init_image = load_image(img_url).convert("RGB")
|
||||
mask_image = load_image(mask_url).convert("RGB")
|
||||
|
||||
prompt = "A majestic tiger sitting on a bench"
|
||||
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80).images[0]
|
||||
```
|
||||
|
||||
### Refining the image output
|
||||
|
||||
In addition to the [base model checkpoint](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9),
|
||||
StableDiffusion-XL also includes a [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9)
|
||||
that is specialized in denoising low-noise stage images to generate images of improved high-frequency quality.
|
||||
This refiner checkpoint can be used as a "second-step" pipeline after having run the base checkpoint to improve
|
||||
image quality.
|
||||
|
||||
When using the refiner, one can easily
|
||||
- 1.) employ the base model and refiner as an *Ensemble of Expert Denoisers* as first proposed in [eDiff-I](https://research.nvidia.com/labs/dir/eDiff-I/) or
|
||||
- 2.) simply run the refiner in [SDEdit](https://arxiv.org/abs/2108.01073) fashion after the base model.
|
||||
|
||||
**Note**: The idea of using SD-XL base & refiner as an ensemble of experts was first brought forward by
|
||||
a couple community contributors which also helped shape the following `diffusers` implementation, namely:
|
||||
- [SytanSD](https://github.com/SytanSD)
|
||||
- [bghira](https://github.com/bghira)
|
||||
- [Birch-san](https://github.com/Birch-san)
|
||||
|
||||
#### 1.) Ensemble of Expert Denoisers
|
||||
|
||||
When using the base and refiner model as an ensemble of expert of denoisers, the base model should serve as the
|
||||
expert for the high-noise diffusion stage and the refiner serves as the expert for the low-noise diffusion stage.
|
||||
|
||||
The advantage of 1.) over 2.) is that it requires less overall denoising steps and therefore should be significantly
|
||||
faster. The drawback is that one cannot really inspect the output of the base model; it will still be heavily denoised.
|
||||
|
||||
To use the base model and refiner as an ensemble of expert denoisers, make sure to define the fraction
|
||||
of timesteps which should be run through the high-noise denoising stage (*i.e.* the base model) and the low-noise
|
||||
denoising stage (*i.e.* the refiner model) respectively. This fraction should be set as the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) of the base model
|
||||
and as the [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) of the refiner model.
|
||||
|
||||
Let's look at an example.
|
||||
First, we import the two pipelines. Since the text encoders and variational autoencoder are the same
|
||||
you don't have to load those again for the refiner.
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
base = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
refiner = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-0.9",
|
||||
text_encoder_2=base.text_encoder_2,
|
||||
vae=base.vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
variant="fp16",
|
||||
)
|
||||
refiner.to("cuda")
|
||||
```
|
||||
|
||||
Now we define the number of inference steps and the fraction at which the model shall be run through the
|
||||
high-noise denoising stage (*i.e.* the base model).
|
||||
|
||||
```py
|
||||
n_steps = 40
|
||||
high_noise_frac = 0.7
|
||||
```
|
||||
|
||||
A fraction of 0.7 means that 70% of the 40 inference steps (28 steps) are run through the base model
|
||||
and the remaining 12 steps are run through the refiner. Let's run the two pipelines now.
|
||||
Make sure to set `denoising_end` and `denoising_start` to the same values and keep `num_inference_steps`
|
||||
constant. Also remember that the output of the base model should be in latent space:
|
||||
|
||||
```py
|
||||
prompt = "A majestic lion jumping from a big stone at night"
|
||||
|
||||
image = base(prompt=prompt, num_inference_steps=n_steps, denoising_end=high_noise_frac, output_type="latent").images
|
||||
image = refiner(prompt=prompt, num_inference_steps=n_steps, denoising_start=high_noise_frac, image=image).images[0]
|
||||
```
|
||||
|
||||
Let's have a look at the image
|
||||
|
||||
| Original Image | Ensemble of Denoisers Experts |
|
||||
|---|---|
|
||||
|  | 
|
||||
|
||||
If we would have just run the base model on the same 40 steps, the image would have been arguably less detailed (e.g. the lion eyes and nose):
|
||||
|
||||
<Tip>
|
||||
|
||||
The ensemble-of-experts method works well on all available schedulers!
|
||||
|
||||
</Tip>
|
||||
|
||||
#### 2.) Refining the image output from fully denoised base image
|
||||
|
||||
In standard [`StableDiffusionImg2ImgPipeline`]-fashion, the fully-denoised image generated of the base model
|
||||
can be further improved using the [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9).
|
||||
|
||||
For this, you simply run the refiner as a normal image-to-image pipeline after the "base" text-to-image
|
||||
pipeline. You can leave the outputs of the base model in latent space.
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
refiner = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-0.9",
|
||||
text_encoder_2=pipe.text_encoder_2,
|
||||
vae=pipe.vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
variant="fp16",
|
||||
)
|
||||
refiner.to("cuda")
|
||||
|
||||
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
|
||||
|
||||
image = pipe(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
|
||||
image = refiner(prompt=prompt, image=image[None, :]).images[0]
|
||||
```
|
||||
|
||||
| Original Image | Refined Image |
|
||||
|---|---|
|
||||
|  |  |
|
||||
|
||||
<Tip>
|
||||
|
||||
The refiner can also very well be used in an in-painting setting. To do so just make
|
||||
sure you use the [`StableDiffusionXLInpaintPipeline`] classes as shown below
|
||||
|
||||
</Tip>
|
||||
|
||||
To use the refiner for inpainting in the Ensemble of Expert Denoisers setting you can do the following:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLInpaintPipeline
|
||||
from diffusers.utils import load_image
|
||||
|
||||
pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-0.9",
|
||||
text_encoder_2=pipe.text_encoder_2,
|
||||
vae=pipe.vae,
|
||||
torch_dtype=torch.float16,
|
||||
use_safetensors=True,
|
||||
variant="fp16",
|
||||
)
|
||||
refiner.to("cuda")
|
||||
|
||||
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
|
||||
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
|
||||
|
||||
init_image = load_image(img_url).convert("RGB")
|
||||
mask_image = load_image(mask_url).convert("RGB")
|
||||
|
||||
prompt = "A majestic tiger sitting on a bench"
|
||||
num_inference_steps = 75
|
||||
high_noise_frac = 0.7
|
||||
|
||||
image = pipe(
|
||||
prompt=prompt,
|
||||
image=init_image,
|
||||
mask_image=mask_image,
|
||||
num_inference_steps=num_inference_steps,
|
||||
strength=0.80,
|
||||
denoising_start=high_noise_frac,
|
||||
output_type="latent",
|
||||
).images
|
||||
image = refiner(
|
||||
prompt=prompt,
|
||||
image=image,
|
||||
mask_image=mask_image,
|
||||
num_inference_steps=num_inference_steps,
|
||||
denoising_start=high_noise_frac,
|
||||
).images[0]
|
||||
```
|
||||
|
||||
To use the refiner for inpainting in the standard SDE-style setting, simply remove `denoising_end` and `denoising_start` and choose a smaller
|
||||
number of inference steps for the refiner.
|
||||
|
||||
### Loading single file checkpoints / original file format
|
||||
|
||||
By making use of [`~diffusers.loaders.FromSingleFileMixin.from_single_file`] you can also load the
|
||||
original file format into `diffusers`:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
|
||||
import torch
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_single_file(
|
||||
"./sd_xl_base_0.9.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
|
||||
"./sd_xl_refiner_0.9.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
|
||||
)
|
||||
refiner.to("cuda")
|
||||
```
|
||||
|
||||
### Memory optimization via model offloading
|
||||
|
||||
If you are seeing out-of-memory errors, we recommend making use of [`StableDiffusionXLPipeline.enable_model_cpu_offload`].
|
||||
|
||||
```diff
|
||||
- pipe.to("cuda")
|
||||
+ pipe.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
and
|
||||
|
||||
```diff
|
||||
- refiner.to("cuda")
|
||||
+ refiner.enable_model_cpu_offload()
|
||||
```
|
||||
|
||||
### Speed-up inference with `torch.compile`
|
||||
|
||||
You can speed up inference by making use of `torch.compile`. This should give you **ca.** 20% speed-up.
|
||||
|
||||
```diff
|
||||
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
|
||||
```
|
||||
|
||||
### Running with `torch < 2.0`
|
||||
|
||||
**Note** that if you want to run Stable Diffusion XL with `torch` < 2.0, please make sure to enable xformers
|
||||
attention:
|
||||
|
||||
```
|
||||
pip install xformers
|
||||
```
|
||||
|
||||
```diff
|
||||
+pipe.enable_xformers_memory_efficient_attention()
|
||||
+refiner.enable_xformers_memory_efficient_attention()
|
||||
```
|
||||
|
||||
## StableDiffusionXLPipeline
|
||||
|
||||
[[autodoc]] StableDiffusionXLPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## StableDiffusionXLImg2ImgPipeline
|
||||
|
||||
[[autodoc]] StableDiffusionXLImg2ImgPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## StableDiffusionXLInpaintPipeline
|
||||
|
||||
[[autodoc]] StableDiffusionXLInpaintPipeline
|
||||
- all
|
||||
- __call__
|
||||
@@ -40,7 +40,7 @@ Available Checkpoints are:
|
||||
- enable_vae_tiling
|
||||
- disable_vae_tiling
|
||||
- load_textual_inversion
|
||||
- from_ckpt
|
||||
- from_single_file
|
||||
- load_lora_weights
|
||||
- save_lora_weights
|
||||
|
||||
|
||||
@@ -37,9 +37,12 @@ Resources:
|
||||
| Pipeline | Tasks | Demo
|
||||
|---|---|:---:|
|
||||
| [TextToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py) | *Text-to-Video Generation* | [🤗 Spaces](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis)
|
||||
| [VideoToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py) | *Text-Guided Video-to-Video Generation* | [(TODO)🤗 Spaces]()
|
||||
|
||||
## Usage example
|
||||
|
||||
### `text-to-video-ms-1.7b`
|
||||
|
||||
Let's start by generating a short video with the default length of 16 frames (2s at 8 fps):
|
||||
|
||||
```python
|
||||
@@ -119,12 +122,98 @@ Here are some sample outputs:
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
### `cerspense/zeroscope_v2_576w` & `cerspense/zeroscope_v2_XL`
|
||||
|
||||
Zeroscope are watermark-free model and have been trained on specific sizes such as `576x320` and `1024x576`.
|
||||
One should first generate a video using the lower resolution checkpoint [`cerspense/zeroscope_v2_576w`](https://huggingface.co/cerspense/zeroscope_v2_576w) with [`TextToVideoSDPipeline`],
|
||||
which can then be upscaled using [`VideoToVideoSDPipeline`] and [`cerspense/zeroscope_v2_XL`](https://huggingface.co/cerspense/zeroscope_v2_XL).
|
||||
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
from diffusers.utils import export_to_video
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
# memory optimization
|
||||
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
|
||||
pipe.enable_vae_slicing()
|
||||
|
||||
prompt = "Darth Vader surfing a wave"
|
||||
video_frames = pipe(prompt, num_frames=24).frames
|
||||
video_path = export_to_video(video_frames)
|
||||
video_path
|
||||
```
|
||||
|
||||
Now the video can be upscaled:
|
||||
|
||||
```py
|
||||
pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_XL", torch_dtype=torch.float16)
|
||||
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
# memory optimization
|
||||
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
|
||||
pipe.enable_vae_slicing()
|
||||
|
||||
video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames]
|
||||
|
||||
video_frames = pipe(prompt, video=video, strength=0.6).frames
|
||||
video_path = export_to_video(video_frames)
|
||||
video_path
|
||||
```
|
||||
|
||||
Here are some sample outputs:
|
||||
|
||||
<table>
|
||||
<tr>
|
||||
<td ><center>
|
||||
Darth vader surfing in waves.
|
||||
<br>
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/darthvader_cerpense.gif"
|
||||
alt="Darth vader surfing in waves."
|
||||
style="width: 576px;" />
|
||||
</center></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
### Memory optimizations
|
||||
|
||||
Text-guided video generation with [`~TextToVideoSDPipeline`] and [`~VideoToVideoSDPipeline`] is very memory intensive both
|
||||
when denoising with [`~UNet3DConditionModel`] and when decoding with [`~AutoencoderKL`]. It is possible though to reduce
|
||||
memory usage at the cost of increased runtime to achieve the exact same result. To do so, it is recommended to enable
|
||||
**forward chunking** and **vae slicing**:
|
||||
|
||||
Forward chunking via [`~UNet3DConditionModel.enable_forward_chunking`]is explained in [this blog post](https://huggingface.co/blog/reformer#2-chunked-feed-forward-layers) and
|
||||
allows to significantly reduce the required memory for the unet. You can chunk the feed forward layer over the `num_frames`
|
||||
dimension by doing:
|
||||
|
||||
```py
|
||||
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
|
||||
```
|
||||
|
||||
Vae slicing via [`~TextToVideoSDPipeline.enable_vae_slicing`] and [`~VideoToVideoSDPipeline.enable_vae_slicing`] also
|
||||
gives significant memory savings since the two pipelines decode all image frames at once.
|
||||
|
||||
```py
|
||||
pipe.enable_vae_slicing()
|
||||
```
|
||||
|
||||
## Available checkpoints
|
||||
|
||||
* [damo-vilab/text-to-video-ms-1.7b](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/)
|
||||
* [damo-vilab/text-to-video-ms-1.7b-legacy](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b-legacy)
|
||||
* [cerspense/zeroscope_v2_576w](https://huggingface.co/cerspense/zeroscope_v2_576w)
|
||||
* [cerspense/zeroscope_v2_XL](https://huggingface.co/cerspense/zeroscope_v2_XL)
|
||||
|
||||
## TextToVideoSDPipeline
|
||||
[[autodoc]] TextToVideoSDPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## VideoToVideoSDPipeline
|
||||
[[autodoc]] VideoToVideoSDPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
@@ -80,6 +80,41 @@ You can change these parameters in the pipeline call:
|
||||
* Video length:
|
||||
* `video_length`, the number of frames video_length to be generated. Default: `video_length=8`
|
||||
|
||||
We an also generate longer videos by doing the processing in a chunk-by-chunk manner:
|
||||
```python
|
||||
import torch
|
||||
import imageio
|
||||
from diffusers import TextToVideoZeroPipeline
|
||||
import numpy as np
|
||||
|
||||
model_id = "runwayml/stable-diffusion-v1-5"
|
||||
pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
|
||||
seed = 0
|
||||
video_length = 8
|
||||
chunk_size = 4
|
||||
prompt = "A panda is playing guitar on times square"
|
||||
|
||||
# Generate the video chunk-by-chunk
|
||||
result = []
|
||||
chunk_ids = np.arange(0, video_length, chunk_size - 1)
|
||||
generator = torch.Generator(device="cuda")
|
||||
for i in range(len(chunk_ids)):
|
||||
print(f"Processing chunk {i + 1} / {len(chunk_ids)}")
|
||||
ch_start = chunk_ids[i]
|
||||
ch_end = video_length if i == len(chunk_ids) - 1 else chunk_ids[i + 1]
|
||||
# Attach the first frame for Cross Frame Attention
|
||||
frame_ids = [0] + list(range(ch_start, ch_end))
|
||||
# Fix the seed for the temporal consistency
|
||||
generator.manual_seed(seed)
|
||||
output = pipe(prompt=prompt, video_length=len(frame_ids), generator=generator, frame_ids=frame_ids)
|
||||
result.append(output.images[1:])
|
||||
|
||||
# Concatenate chunks and save
|
||||
result = np.concatenate(result)
|
||||
result = [(r * 255).astype("uint8") for r in result]
|
||||
imageio.mimsave("video.mp4", result, fps=4)
|
||||
```
|
||||
|
||||
|
||||
### Text-To-Video with Pose Control
|
||||
To generate a video from prompt with additional pose control
|
||||
@@ -202,7 +237,7 @@ can run with custom [DreamBooth](../training/dreambooth) models, as shown below
|
||||
|
||||
reader = imageio.get_reader(video_path, "ffmpeg")
|
||||
frame_count = 8
|
||||
video = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
|
||||
canny_edges = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
|
||||
```
|
||||
|
||||
3. Run `StableDiffusionControlNetPipeline` with custom trained DreamBooth model
|
||||
@@ -223,10 +258,10 @@ can run with custom [DreamBooth](../training/dreambooth) models, as shown below
|
||||
pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
|
||||
|
||||
# fix latents for all frames
|
||||
latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1)
|
||||
latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(canny_edges), 1, 1, 1)
|
||||
|
||||
prompt = "oil painting of a beautiful girl avatar style"
|
||||
result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
|
||||
result = pipe(prompt=[prompt] * len(canny_edges), image=canny_edges, latents=latents).images
|
||||
imageio.mimsave("video.mp4", result, fps=4)
|
||||
```
|
||||
|
||||
|
||||
11
docs/source/en/api/schedulers/cm_stochastic_iterative.mdx
Normal file
11
docs/source/en/api/schedulers/cm_stochastic_iterative.mdx
Normal file
@@ -0,0 +1,11 @@
|
||||
# Consistency Model Multistep Scheduler
|
||||
|
||||
## Overview
|
||||
|
||||
Multistep and onestep scheduler (Algorithm 1) introduced alongside consistency models in the paper [Consistency Models](https://arxiv.org/abs/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
|
||||
Based on the [original consistency models implementation](https://github.com/openai/consistency_models).
|
||||
Should generate good samples from [`ConsistencyModelPipeline`] in one or a small number of steps.
|
||||
|
||||
## CMStochasticIterativeScheduler
|
||||
[[autodoc]] CMStochasticIterativeScheduler
|
||||
|
||||
@@ -18,10 +18,71 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
The abstract of the paper is the following:
|
||||
|
||||
Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
|
||||
*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training,
|
||||
yet they require simulating a Markov chain for many steps to produce a sample.
|
||||
To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models
|
||||
with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process.
|
||||
We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from.
|
||||
We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off
|
||||
computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.*
|
||||
|
||||
The original codebase of this paper can be found here: [ermongroup/ddim](https://github.com/ermongroup/ddim).
|
||||
For questions, feel free to contact the author on [tsong.me](https://tsong.me/).
|
||||
|
||||
### Experimental: "Common Diffusion Noise Schedules and Sample Steps are Flawed":
|
||||
|
||||
The paper **[Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/abs/2305.08891)**
|
||||
claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion.
|
||||
|
||||
The abstract reads as follows:
|
||||
|
||||
*We discover that common diffusion noise schedules do not enforce the last timestep to have zero signal-to-noise ratio (SNR),
|
||||
and some implementations of diffusion samplers do not start from the last timestep.
|
||||
Such designs are flawed and do not reflect the fact that the model is given pure Gaussian noise at inference, creating a discrepancy between training and inference.
|
||||
We show that the flawed design causes real problems in existing implementations.
|
||||
In Stable Diffusion, it severely limits the model to only generate images with medium brightness and
|
||||
prevents it from generating very bright and dark samples. We propose a few simple fixes:
|
||||
- (1) rescale the noise schedule to enforce zero terminal SNR;
|
||||
- (2) train the model with v prediction;
|
||||
- (3) change the sampler to always start from the last timestep;
|
||||
- (4) rescale classifier-free guidance to prevent over-exposure.
|
||||
These simple changes ensure the diffusion process is congruent between training and inference and
|
||||
allow the model to generate samples more faithful to the original data distribution.*
|
||||
|
||||
You can apply all of these changes in `diffusers` when using [`DDIMScheduler`]:
|
||||
- (1) rescale the noise schedule to enforce zero terminal SNR;
|
||||
```py
|
||||
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True)
|
||||
```
|
||||
- (2) train the model with v prediction;
|
||||
Continue fine-tuning a checkpoint with [`train_text_to_image.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [`train_text_to_image_lora.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)
|
||||
and `--prediction_type="v_prediction"`.
|
||||
- (3) change the sampler to always start from the last timestep;
|
||||
```py
|
||||
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
|
||||
```
|
||||
- (4) rescale classifier-free guidance to prevent over-exposure.
|
||||
```py
|
||||
pipe(..., guidance_rescale=0.7)
|
||||
```
|
||||
|
||||
An example is to use [this checkpoint](https://huggingface.co/ptx0/pseudo-journey-v2)
|
||||
which has been fine-tuned using the `"v_prediction"`.
|
||||
|
||||
The checkpoint can then be run in inference as follows:
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline, DDIMScheduler
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", torch_dtype=torch.float16)
|
||||
pipe.scheduler = DDIMScheduler.from_config(
|
||||
pipe.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing"
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
|
||||
image = pipeline(prompt, guidance_rescale=0.7).images[0]
|
||||
```
|
||||
|
||||
## DDIMScheduler
|
||||
[[autodoc]] DDIMScheduler
|
||||
|
||||
@@ -40,7 +40,7 @@ The library has three main components:
|
||||
><div class="w-full text-center bg-gradient-to-br from-pink-400 to-pink-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Conceptual guides</div>
|
||||
<p class="text-gray-700">Understand why the library was designed the way it was, and learn more about the ethical guidelines and safety implementations for using the library.</p>
|
||||
</a>
|
||||
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./api/models"
|
||||
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./api/models/overview"
|
||||
><div class="w-full text-center bg-gradient-to-br from-purple-400 to-purple-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Reference</div>
|
||||
<p class="text-gray-700">Technical descriptions of how 🤗 Diffusers classes and methods work.</p>
|
||||
</a>
|
||||
@@ -69,6 +69,7 @@ The library has three main components:
|
||||
| [score_sde_ve](./api/pipelines/score_sde_ve) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
|
||||
| [score_sde_vp](./api/pipelines/score_sde_vp) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
|
||||
| [semantic_stable_diffusion](./api/pipelines/semantic_stable_diffusion) | [Semantic Guidance](https://arxiv.org/abs/2301.12247) | Text-Guided Generation |
|
||||
| [stable_diffusion_adapter](./api/pipelines/stable_diffusion/adapter) | [**T2I-Adapter**](https://arxiv.org/abs/2302.08453) | Image-to-Image Text-Guided Generation | -
|
||||
| [stable_diffusion_text2img](./api/pipelines/stable_diffusion/text2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation |
|
||||
| [stable_diffusion_img2img](./api/pipelines/stable_diffusion/img2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation |
|
||||
| [stable_diffusion_inpaint](./api/pipelines/stable_diffusion/inpaint) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting |
|
||||
@@ -94,3 +95,4 @@ The library has three main components:
|
||||
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
|
||||
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
|
||||
| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
|
||||
| [stable_diffusion_ldm3d](./api/pipelines/stable_diffusion/ldm3d_diffusion) | [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) | Text to Image and Depth Generation |
|
||||
|
||||
@@ -23,7 +23,7 @@ Install 🤗 Diffusers for whichever deep learning library you're working with.
|
||||
|
||||
You should install 🤗 Diffusers in a [virtual environment](https://docs.python.org/3/library/venv.html).
|
||||
If you're unfamiliar with Python virtual environments, take a look at this [guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
|
||||
A virtual environment makes it easier to manage different projects, and avoid compatibility issues between dependencies.
|
||||
A virtual environment makes it easier to manage different projects and avoid compatibility issues between dependencies.
|
||||
|
||||
Start by creating a virtual environment in your project directory:
|
||||
|
||||
@@ -127,7 +127,7 @@ Your Python environment will find the `main` version of 🤗 Diffusers on the ne
|
||||
|
||||
Our library gathers telemetry information during `from_pretrained()` requests.
|
||||
This data includes the version of Diffusers and PyTorch/Flax, the requested model or pipeline class,
|
||||
and the path to a pretrained checkpoint if it is hosted on the Hub.
|
||||
and the path to a pre-trained checkpoint if it is hosted on the Hub.
|
||||
This usage data helps us debug issues and prioritize new features.
|
||||
Telemetry is only sent when loading models and pipelines from the HuggingFace Hub,
|
||||
and is not collected during local usage.
|
||||
@@ -143,4 +143,4 @@ export DISABLE_TELEMETRY=YES
|
||||
On Windows:
|
||||
```bash
|
||||
set DISABLE_TELEMETRY=YES
|
||||
```
|
||||
```
|
||||
|
||||
200
docs/source/en/optimization/bentoml.mdx
Normal file
200
docs/source/en/optimization/bentoml.mdx
Normal file
@@ -0,0 +1,200 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# BentoML Integration Guide
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
[BentoML](https://github.com/bentoml/BentoML/) is an open-source framework designed for building,
|
||||
shipping, and scaling AI applications. It allows users to easily package and serve diffusion models
|
||||
for production, ensuring reliable and efficient deployments. It features out-of-the-box operational
|
||||
management tools like monitoring and tracing, and facilitates the deployment to various cloud platforms
|
||||
with ease. BentoML's distributed architecture and the separation of API server logic from
|
||||
model inference logic enable efficient scaling of deployments, even with budget constraints.
|
||||
As a result, integrating it with Diffusers provides a valuable tool for real-world deployments.
|
||||
|
||||
This tutorial demonstrates how to integrate BentoML with Diffusers.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Install [Diffusers](https://huggingface.co/docs/diffusers/installation).
|
||||
- Install BentoML by running `pip install bentoml`. For more information, see the [BentoML documentation](https://docs.bentoml.com).
|
||||
|
||||
## Import a diffusion model
|
||||
|
||||
First, you need to prepare the model. BentoML has its own [Model Store](https://docs.bentoml.com/en/latest/concepts/model.html)
|
||||
for model management. Create a `download_model.py` file as below to import a diffusion model into BentoML's Model
|
||||
Store:
|
||||
|
||||
```py
|
||||
import bentoml
|
||||
|
||||
bentoml.diffusers.import_model(
|
||||
"sd2.1", # Model tag in the BentoML Model Store
|
||||
"stabilityai/stable-diffusion-2-1", # Hugging Face model identifier
|
||||
)
|
||||
```
|
||||
|
||||
This code snippet downloads the Stable Diffusion 2.1 model (using it's repo id
|
||||
`stabilityai/stable-diffusion-2-1`) from the Hugging Face Hub (or use the cached download
|
||||
files if the model is already downloaded) and imports it into the BentoML Model
|
||||
Store with the name `sd2.1`.
|
||||
|
||||
For models already fine-tuned and stored on disk, you can provide the path instead of
|
||||
the repo id.
|
||||
|
||||
```py
|
||||
import bentoml
|
||||
|
||||
bentoml.diffusers.import_model(
|
||||
"sd2.1-local",
|
||||
"./local_stable_diffusion_2.1/",
|
||||
)
|
||||
```
|
||||
|
||||
You can view the model in the Model Store:
|
||||
|
||||
```
|
||||
bentoml models list
|
||||
|
||||
Tag Module Size Creation Time
|
||||
sd2.1:ysrlmubascajwnry bentoml.diffusers 33.85 GiB 2023-07-12 16:47:44
|
||||
```
|
||||
|
||||
## Turn a diffusion model into a RESTful service with BentoML
|
||||
|
||||
Once the diffusion model is in BentoML's Model Store, you can implement a text-to-image
|
||||
service with it. The Stable Diffusion model accepts various arguments
|
||||
in addition to the required prompt to guide the image generation process.
|
||||
To validate these input arguments, use BentoML's [pydantic](https://github.com/pydantic/pydantic) integration.
|
||||
Create a `sdargs.py` file with an example pydantic model:
|
||||
|
||||
```py
|
||||
import typing as t
|
||||
|
||||
from pydantic import BaseModel
|
||||
|
||||
|
||||
class SDArgs(BaseModel):
|
||||
prompt: str
|
||||
negative_prompt: t.Optional[str] = None
|
||||
height: t.Optional[int] = 512
|
||||
width: t.Optional[int] = 512
|
||||
|
||||
class Config:
|
||||
extra = "allow"
|
||||
```
|
||||
|
||||
This pydantic model requires a string field `prompt` and three optional fields: `height`, `width`, and `negative_prompt`,
|
||||
each with corresponding types. The `extra = "allow"` line supports adding additional fields not defined in the `SDArgs` class.
|
||||
In a real-world scenario, you may define all the desired fields and not allow extra ones.
|
||||
|
||||
Next, create a BentoML Service file that defines a Stable Diffusion service:
|
||||
|
||||
```py
|
||||
import bentoml
|
||||
from bentoml.io import Image, JSON
|
||||
|
||||
from sdargs import SDArgs
|
||||
|
||||
bento_model = bentoml.diffusers.get("sd2.1:latest")
|
||||
sd21_runner = bento_model.to_runner(name="sd21-runner")
|
||||
|
||||
svc = bentoml.Service("stable-diffusion-21", runners=[sd21_runner])
|
||||
|
||||
|
||||
@svc.api(input=JSON(pydantic_model=SDArgs), output=Image())
|
||||
async def txt2img(input_data):
|
||||
kwargs = input_data.dict()
|
||||
res = await sd21_runner.async_run(**kwargs)
|
||||
images = res[0]
|
||||
return images[0]
|
||||
```
|
||||
|
||||
Save the file as `service.py`, and spin up a BentoML Service endpoint using:
|
||||
|
||||
```
|
||||
bentoml serve service:svc
|
||||
```
|
||||
|
||||
An HTTP server with `/txt2img` endpoint that accepts a JSON dictionary should be up at
|
||||
port 3000. Go to <http://127.0.0.1:3000> in your web browser to access the Swagger UI.
|
||||
|
||||
You can also test the text-to-image generation using `curl` and write the returned image to
|
||||
`output.jpg`.
|
||||
|
||||
```
|
||||
curl -X POST http://127.0.0.1:3000/txt2img \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d "{\"prompt\":\"a black cat\", \"height\":768, \"width\":768}" \
|
||||
--output output.jpg
|
||||
```
|
||||
|
||||
## Package a BentoML Service for cloud deployment
|
||||
|
||||
To deploy a BentoML Service, you need to pack it into a BentoML
|
||||
[Bento](https://docs.bentoml.com/en/latest/concepts/bento.html), a file archive with all the source code,
|
||||
models, data files, and dependencies. This can be done by providing a `bentofile.yaml` file as follows:
|
||||
|
||||
```yaml
|
||||
service: "service.py:svc"
|
||||
include:
|
||||
- "service.py"
|
||||
python:
|
||||
packages:
|
||||
- torch
|
||||
- transformers
|
||||
- accelerate
|
||||
- diffusers
|
||||
- triton
|
||||
- xformers
|
||||
- pydantic
|
||||
docker:
|
||||
distro: debian
|
||||
cuda_version: "11.6"
|
||||
```
|
||||
|
||||
The `bentofile.yaml` file contains [Bento build
|
||||
options](https://docs.bentoml.com/en/latest/concepts/bento.html#bento-build-options),
|
||||
such as package dependencies and Docker options.
|
||||
|
||||
Then you build a Bento using:
|
||||
|
||||
```
|
||||
bentoml build
|
||||
```
|
||||
|
||||
The output looks like:
|
||||
|
||||
```
|
||||
Successfully built Bento(tag="stable-diffusion-21:crkuh7a7rw5bcasc").
|
||||
|
||||
Possible next steps:
|
||||
|
||||
* Containerize your Bento with `bentoml containerize`:
|
||||
$ bentoml containerize stable-diffusion-21:crkuh7a7rw5bcasc
|
||||
|
||||
* Push to BentoCloud with `bentoml push`:
|
||||
$ bentoml push stable-diffusion-21:crkuh7a7rw5bcasc
|
||||
```
|
||||
|
||||
You can create a Docker image based on the Bento by running the following command and deploy it to a cloud provider.
|
||||
|
||||
```
|
||||
bentoml containerize stable-diffusion-21:crkuh7a7rw5bcasc
|
||||
```
|
||||
|
||||
If you want an end-to-end solution for deploying and managing models, you can push the Bento to [Yatai](https://github.com/bentoml/Yatai) or
|
||||
[BentoCloud](https://bentoml.com/cloud) for a distributed deployment.
|
||||
|
||||
For more information about BentoML's integration with Diffusers, see the [BentoML Diffusers
|
||||
Guide](https://docs.bentoml.com/en/latest/frameworks/diffusers.html).
|
||||
@@ -16,8 +16,8 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
## Requirements
|
||||
|
||||
- Optimum Habana 1.5 or later, [here](https://huggingface.co/docs/optimum/habana/installation) is how to install it.
|
||||
- SynapseAI 1.9.
|
||||
- Optimum Habana 1.6 or later, [here](https://huggingface.co/docs/optimum/habana/installation) is how to install it.
|
||||
- SynapseAI 1.10.
|
||||
|
||||
|
||||
## Inference Pipeline
|
||||
@@ -41,7 +41,7 @@ pipeline = GaudiStableDiffusionPipeline.from_pretrained(
|
||||
scheduler=scheduler,
|
||||
use_habana=True,
|
||||
use_hpu_graphs=True,
|
||||
gaudi_config="Habana/stable-diffusion",
|
||||
gaudi_config="Habana/stable-diffusion-2",
|
||||
)
|
||||
```
|
||||
|
||||
@@ -62,18 +62,18 @@ For more information, check out Optimum Habana's [documentation](https://hugging
|
||||
|
||||
## Benchmark
|
||||
|
||||
Here are the latencies for Habana first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) Gaudi configuration (mixed precision bf16/fp32):
|
||||
Here are the latencies for Habana first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32):
|
||||
|
||||
- [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) (512x512 resolution):
|
||||
|
||||
| | Latency (batch size = 1) | Throughput (batch size = 8) |
|
||||
| ---------------------- |:------------------------:|:---------------------------:|
|
||||
| first-generation Gaudi | 4.22s | 0.29 images/s |
|
||||
| Gaudi2 | 1.70s | 0.925 images/s |
|
||||
| first-generation Gaudi | 3.80s | 0.308 images/s |
|
||||
| Gaudi2 | 1.33s | 1.081 images/s |
|
||||
|
||||
- [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) (768x768 resolution):
|
||||
|
||||
| | Latency (batch size = 1) | Throughput |
|
||||
| ---------------------- |:------------------------:|:-------------------------------:|
|
||||
| first-generation Gaudi | 23.3s | 0.045 images/s (batch size = 2) |
|
||||
| Gaudi2 | 7.75s | 0.14 images/s (batch size = 5) |
|
||||
| first-generation Gaudi | 10.2s | 0.108 images/s (batch size = 4) |
|
||||
| Gaudi2 | 3.17s | 0.379 images/s (batch size = 8) |
|
||||
|
||||
@@ -32,8 +32,9 @@ The quicktour is a simplified version of the introductory 🧨 Diffusers [notebo
|
||||
|
||||
Before you begin, make sure you have all the necessary libraries installed:
|
||||
|
||||
```bash
|
||||
!pip install --upgrade diffusers accelerate transformers
|
||||
```py
|
||||
# uncomment to install the necessary libraries in Colab
|
||||
#!pip install --upgrade diffusers accelerate transformers
|
||||
```
|
||||
|
||||
- [🤗 Accelerate](https://huggingface.co/docs/accelerate/index) speeds up model loading for inference and training.
|
||||
|
||||
@@ -52,6 +52,8 @@ pipeline = pipeline.to("cuda")
|
||||
To make sure you can use the same image and improve on it, use a [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed for [reproducibility](./using-diffusers/reproducibility):
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
generator = torch.Generator("cuda").manual_seed(0)
|
||||
```
|
||||
|
||||
|
||||
@@ -12,8 +12,6 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# DreamBooth
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
[DreamBooth](https://arxiv.org/abs/2208.12242) is a method to personalize text-to-image models like Stable Diffusion given just a few (3-5) images of a subject. It allows the model to generate contextualized images of the subject in different scenes, poses, and views.
|
||||
|
||||

|
||||
@@ -703,3 +701,7 @@ accelerate launch train_dreambooth.py \
|
||||
--class_labels_conditioning timesteps \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
## Stable Diffusion XL
|
||||
|
||||
We support fine-tuning of the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with DreamBooth and LoRA via the `train_dreambooth_lora_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md).
|
||||
@@ -12,8 +12,6 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Low-Rank Adaptation of Large Language Models (LoRA)
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
Currently, LoRA is only supported for the attention layers of the [`UNet2DConditionalModel`]. We also
|
||||
@@ -282,6 +280,10 @@ Note that the use of [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] is
|
||||
**Note** that it is possible to provide a local directory path to [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] as well as [`~diffusers.loaders.UNet2DConditionLoadersMixin.load_attn_procs`]. To know about the supported inputs,
|
||||
refer to the respective docstrings.
|
||||
|
||||
## Unloading LoRA parameters
|
||||
|
||||
You can call [`~diffusers.loaders.LoraLoaderMixin.unload_lora_weights`] on a pipeline to unload the LoRA parameters.
|
||||
|
||||
## Supporting A1111 themed LoRA checkpoints from Diffusers
|
||||
|
||||
To provide seamless interoperability with A1111 to our users, we support loading A1111 formatted
|
||||
|
||||
@@ -14,8 +14,6 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Textual Inversion
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
[Textual Inversion](https://arxiv.org/abs/2208.01618) is a technique for capturing novel concepts from a small number of example images. While the technique was originally demonstrated with a [latent diffusion model](https://github.com/CompVis/latent-diffusion), it has since been applied to other model variants like [Stable Diffusion](https://huggingface.co/docs/diffusers/main/en/conceptual/stable_diffusion). The learned concepts can be used to better control the images generated from text-to-image pipelines. It learns new "words" in the text encoder's embedding space, which are used within text prompts for personalized image generation.
|
||||
|
||||

|
||||
|
||||
@@ -26,8 +26,9 @@ This tutorial will teach you how to train a [`UNet2DModel`] from scratch on a su
|
||||
|
||||
Before you begin, make sure you have 🤗 Datasets installed to load and preprocess image datasets, and 🤗 Accelerate, to simplify training on any number of GPUs. The following command will also install [TensorBoard](https://www.tensorflow.org/tensorboard) to visualize training metrics (you can also use [Weights & Biases](https://docs.wandb.ai/) to track your training).
|
||||
|
||||
```bash
|
||||
!pip install diffusers[training]
|
||||
```py
|
||||
# uncomment to install the necessary libraries in Colab
|
||||
#!pip install diffusers[training]
|
||||
```
|
||||
|
||||
We encourage you to share your model with the community, and in order to do that, you'll need to login to your Hugging Face account (create one [here](https://hf.co/join) if you don't already have one!). You can login from a notebook and enter your token when prompted:
|
||||
@@ -312,7 +313,7 @@ Now you can wrap all these components together in a training loop with 🤗 Acce
|
||||
... mixed_precision=config.mixed_precision,
|
||||
... gradient_accumulation_steps=config.gradient_accumulation_steps,
|
||||
... log_with="tensorboard",
|
||||
... logging_dir=os.path.join(config.output_dir, "logs"),
|
||||
... project_dir=os.path.join(config.output_dir, "logs"),
|
||||
... )
|
||||
... if accelerator.is_main_process:
|
||||
... if config.push_to_hub:
|
||||
|
||||
45
docs/source/en/using-diffusers/control_brightness.mdx
Normal file
45
docs/source/en/using-diffusers/control_brightness.mdx
Normal file
@@ -0,0 +1,45 @@
|
||||
# Control image brightness
|
||||
|
||||
The Stable Diffusion pipeline is mediocre at generating images that are either very bright or dark as explained in the [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) paper. The solutions proposed in the paper are currently implemented in the [`DDIMScheduler`] which you can use to improve the lighting in your images.
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 Take a look at the paper linked above for more details about the proposed solutions!
|
||||
|
||||
</Tip>
|
||||
|
||||
One of the solutions is to train a model with *v prediction* and *v loss*. Add the following flag to the [`train_text_to_image.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [`train_text_to_image_lora.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py) scripts to enable `v_prediction`:
|
||||
|
||||
```bash
|
||||
--prediction_type="v_prediction"
|
||||
```
|
||||
|
||||
For example, let's use the [`ptx0/pseudo-journey-v2`](https://huggingface.co/ptx0/pseudo-journey-v2) checkpoint which has been finetuned with `v_prediction`.
|
||||
|
||||
Next, configure the following parameters in the [`DDIMScheduler`]:
|
||||
|
||||
1. `rescale_betas_zero_snr=True`, rescales the noise schedule to zero terminal signal-to-noise ratio (SNR)
|
||||
2. `timestep_spacing="trailing"`, starts sampling from the last timestep
|
||||
|
||||
```py
|
||||
>>> from diffusers import DiffusionPipeline, DDIMScheduler
|
||||
|
||||
>>> pipeline = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2")
|
||||
# switch the scheduler in the pipeline to use the DDIMScheduler
|
||||
|
||||
>>> pipeline.scheduler = DDIMScheduler.from_config(
|
||||
... pipeline.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing"
|
||||
... )
|
||||
>>> pipeline.to("cuda")
|
||||
```
|
||||
|
||||
Finally, in your call to the pipeline, set `guidance_rescale` to prevent overexposure:
|
||||
|
||||
```py
|
||||
prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
|
||||
image = pipeline(prompt, guidance_rescale=0.7).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/zero_snr.png"/>
|
||||
</div>
|
||||
@@ -59,6 +59,7 @@ For convenience, we provide a table to denote which methods are inference-only a
|
||||
| [Custom Diffusion](#custom-diffusion) | ❌ | ✅ | |
|
||||
| [Model Editing](#model-editing) | ✅ | ❌ | |
|
||||
| [DiffEdit](#diffedit) | ✅ | ❌ | |
|
||||
| [T2I-Adapter](#t2i-adapter) | ✅ | ❌ | |
|
||||
|
||||
## Instruct Pix2Pix
|
||||
|
||||
@@ -215,4 +216,13 @@ To know more details, check out the [official doc](../api/pipelines/stable_diffu
|
||||
[DiffEdit](../api/pipelines/stable_diffusion/diffedit) allows for semantic editing of input images along with
|
||||
input prompts while preserving the original input images as much as possible.
|
||||
|
||||
To know more details, check out the [official doc](../api/pipelines/stable_diffusion/model_editing).
|
||||
To know more details, check out the [official doc](../api/pipelines/stable_diffusion/model_editing).
|
||||
## T2I-Adapter
|
||||
|
||||
[Paper](https://arxiv.org/abs/2302.08453)
|
||||
|
||||
[T2I-Adapter](../api/pipelines/stable_diffusion/adapter) is an auxiliary network which adds an extra condition.
|
||||
There are 8 canonical pre-trained adapters trained on different conditionings such as edge detection, sketch,
|
||||
depth maps, and semantic segmentations.
|
||||
|
||||
See [here](../api/pipelines/stable_diffusion/adapter) for more information on how to use it.
|
||||
|
||||
@@ -12,6 +12,8 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Community pipelines
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
> **For more information about community pipelines, please have a look at [this issue](https://github.com/huggingface/diffusers/issues/841).**
|
||||
|
||||
**Community** examples consist of both inference and training examples that have been added by the community.
|
||||
|
||||
@@ -12,6 +12,8 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Load community pipelines
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Community pipelines are any [`DiffusionPipeline`] class that are different from the original implementation as specified in their paper (for example, the [`StableDiffusionControlNetPipeline`] corresponds to the [Text-to-Image Generation with ControlNet Conditioning](https://arxiv.org/abs/2302.05543) paper). They provide additional functionality or extend the original implementation of a pipeline.
|
||||
|
||||
There are many cool community pipelines like [Speech to Image](https://github.com/huggingface/diffusers/tree/main/examples/community#speech-to-image) or [Composable Stable Diffusion](https://github.com/huggingface/diffusers/tree/main/examples/community#composable-stable-diffusion), and you can find all the official community pipelines [here](https://github.com/huggingface/diffusers/tree/main/examples/community).
|
||||
|
||||
@@ -18,8 +18,9 @@ The [`StableDiffusionImg2ImgPipeline`] lets you pass a text prompt and an initia
|
||||
|
||||
Before you begin, make sure you have all the necessary libraries installed:
|
||||
|
||||
```bash
|
||||
!pip install diffusers transformers ftfy accelerate
|
||||
```py
|
||||
# uncomment to install the necessary libraries in Colab
|
||||
#!pip install diffusers transformers ftfy accelerate
|
||||
```
|
||||
|
||||
Get started by creating a [`StableDiffusionImg2ImgPipeline`] with a pretrained Stable Diffusion model like [`nitrosocke/Ghibli-Diffusion`](https://huggingface.co/nitrosocke/Ghibli-Diffusion).
|
||||
|
||||
@@ -12,6 +12,8 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Load pipelines, models, and schedulers
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Having an easy way to use a diffusion system for inference is essential to 🧨 Diffusers. Diffusion systems often consist of multiple components like parameterized models, tokenizers, and schedulers that interact in complex ways. That is why we designed the [`DiffusionPipeline`] to wrap the complexity of the entire diffusion system into an easy-to-use API, while remaining flexible enough to be adapted for other use cases, such as loading each component individually as building blocks to assemble your own diffusion system.
|
||||
|
||||
Everything you need for inference or training is accessible with the `from_pretrained()` method.
|
||||
@@ -172,7 +174,7 @@ A checkpoint variant is usually a checkpoint where it's weights are:
|
||||
|
||||
</Tip>
|
||||
|
||||
Otherwise, a variant is **identical** to the original checkpoint. They have exactly the same serialization format (like [Safetensors](./using-diffusers/using_safetensors)), model structure, and weights have identical tensor shapes.
|
||||
Otherwise, a variant is **identical** to the original checkpoint. They have exactly the same serialization format (like [Safetensors](./using_safetensors)), model structure, and weights have identical tensor shapes.
|
||||
|
||||
| **checkpoint type** | **weight name** | **argument for loading weights** |
|
||||
|---------------------|-------------------------------------|----------------------------------|
|
||||
@@ -188,6 +190,7 @@ There are two important arguments to know for loading variants:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
# load fp16 variant
|
||||
stable_diffusion = DiffusionPipeline.from_pretrained(
|
||||
|
||||
@@ -12,6 +12,8 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Load different Stable Diffusion formats
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Stable Diffusion models are available in different formats depending on the framework they're trained and saved with, and where you download them from. Converting these formats for use in 🤗 Diffusers allows you to use all the features supported by the library, such as [using different schedulers](schedulers) for inference, [building your custom pipeline](write_own_pipeline), and a variety of techniques and methods for [optimizing inference speed](./optimization/opt_overview).
|
||||
|
||||
<Tip>
|
||||
@@ -24,7 +26,7 @@ This guide will show you how to convert other Stable Diffusion formats to be com
|
||||
|
||||
## PyTorch .ckpt
|
||||
|
||||
The checkpoint - or `.ckpt` - format is commonly used to store and save models. The `.ckpt` file contains the entire model and is typically several GBs in size. While you can load and use a `.ckpt` file directly with the [`~StableDiffusionPipeline.from_ckpt`] method, it is generally better to convert the `.ckpt` file to 🤗 Diffusers so both formats are available.
|
||||
The checkpoint - or `.ckpt` - format is commonly used to store and save models. The `.ckpt` file contains the entire model and is typically several GBs in size. While you can load and use a `.ckpt` file directly with the [`~StableDiffusionPipeline.from_single_file`] method, it is generally better to convert the `.ckpt` file to 🤗 Diffusers so both formats are available.
|
||||
|
||||
There are two options for converting a `.ckpt` file; use a Space to convert the checkpoint or convert the `.ckpt` file with a script.
|
||||
|
||||
@@ -141,8 +143,9 @@ pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.conf
|
||||
|
||||
Download a LoRA checkpoint from Civitai; this example uses the [Howls Moving Castle,Interior/Scenery LoRA (Ghibli Stlye)](https://civitai.com/models/14605?modelVersionId=19998) checkpoint, but feel free to try out any LoRA checkpoint!
|
||||
|
||||
```bash
|
||||
!wget https://civitai.com/api/download/models/19998 -O howls_moving_castle.safetensors
|
||||
```py
|
||||
# uncomment to download the safetensor weights
|
||||
#!wget https://civitai.com/api/download/models/19998 -O howls_moving_castle.safetensors
|
||||
```
|
||||
|
||||
Load the LoRA checkpoint into the pipeline with the [`~loaders.LoraLoaderMixin.load_lora_weights`] method:
|
||||
|
||||
@@ -12,6 +12,8 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Create reproducible pipelines
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Reproducibility is important for testing, replicating results, and can even be used to [improve image quality](reusing_seeds). However, the randomness in diffusion models is a desired property because it allows the pipeline to generate different images every time it is run. While you can't expect to get the exact same results across platforms, you can expect results to be reproducible across releases and platforms within a certain tolerance range. Even then, tolerance varies depending on the diffusion pipeline and checkpoint.
|
||||
|
||||
This is why it's important to understand how to control sources of randomness in diffusion models or use deterministic algorithms.
|
||||
|
||||
@@ -12,6 +12,8 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Improve image quality with deterministic generation
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
A common way to improve the quality of generated images is with *deterministic batch generation*, generate a batch of images and select one image to improve with a more detailed prompt in a second round of inference. The key is to pass a list of [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html#generator)'s to the pipeline for batched image generation, and tie each `Generator` to a seed so you can reuse it for an image.
|
||||
|
||||
Let's use [`runwayml/stable-diffusion-v1-5`](runwayml/stable-diffusion-v1-5) for example, and generate several versions of the following prompt:
|
||||
|
||||
@@ -12,6 +12,8 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Schedulers
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Diffusion pipelines are inherently a collection of diffusion models and schedulers that are partly independent from each other. This means that one is able to switch out parts of the pipeline to better customize
|
||||
a pipeline to one's use case. The best example of this is the [Schedulers](../api/schedulers/overview.mdx).
|
||||
|
||||
|
||||
@@ -14,9 +14,10 @@ Note that JAX is not exclusive to TPUs, but it shines on that hardware because e
|
||||
|
||||
First make sure diffusers is installed.
|
||||
|
||||
```bash
|
||||
!pip install jax==0.3.25 jaxlib==0.3.25 flax transformers ftfy
|
||||
!pip install diffusers
|
||||
```py
|
||||
# uncomment to install the necessary libraries in Colab
|
||||
#!pip install jax==0.3.25 jaxlib==0.3.25 flax transformers ftfy
|
||||
#!pip install diffusers
|
||||
```
|
||||
|
||||
```python
|
||||
|
||||
@@ -1,11 +1,14 @@
|
||||
# Load safetensors
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
[safetensors](https://github.com/huggingface/safetensors) is a safe and fast file format for storing and loading tensors. Typically, PyTorch model weights are saved or *pickled* into a `.bin` file with Python's [`pickle`](https://docs.python.org/3/library/pickle.html) utility. However, `pickle` is not secure and pickled files may contain malicious code that can be executed. safetensors is a secure alternative to `pickle`, making it ideal for sharing model weights.
|
||||
|
||||
This guide will show you how you load `.safetensor` files, and how to convert Stable Diffusion model weights stored in other formats to `.safetensor`. Before you start, make sure you have safetensors installed:
|
||||
|
||||
```bash
|
||||
!pip install safetensors
|
||||
```py
|
||||
# uncomment to install the necessary libraries in Colab
|
||||
#!pip install safetensors
|
||||
```
|
||||
|
||||
If you look at the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main) repository, you'll see weights inside the `text_encoder`, `unet` and `vae` subfolders are stored in the `.safetensors` format. By default, 🤗 Diffusers automatically loads these `.safetensors` files from their subfolders if they're available in the model repository.
|
||||
@@ -18,12 +21,12 @@ from diffusers import DiffusionPipeline
|
||||
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
|
||||
```
|
||||
|
||||
However, model weights are not necessarily stored in separate subfolders like in the example above. Sometimes, all the weights are stored in a single `.safetensors` file. In this case, if the weights are Stable Diffusion weights, you can load the file directly with the [`~diffusers.loaders.FromCkptMixin.from_ckpt`] method:
|
||||
However, model weights are not necessarily stored in separate subfolders like in the example above. Sometimes, all the weights are stored in a single `.safetensors` file. In this case, if the weights are Stable Diffusion weights, you can load the file directly with the [`~diffusers.loaders.FromSingleFileMixin.from_single_file`] method:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipeline = StableDiffusionPipeline.from_ckpt(
|
||||
pipeline = StableDiffusionPipeline.from_single_file(
|
||||
"https://huggingface.co/WarriorMama777/OrangeMixs/blob/main/Models/AbyssOrangeMix/AbyssOrangeMix.safetensors"
|
||||
)
|
||||
```
|
||||
|
||||
@@ -12,6 +12,8 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Weighting prompts
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Text-guided diffusion models generate images based on a given text prompt. The text prompt
|
||||
can include multiple concepts that the model should generate and it's often desirable to weight
|
||||
certain parts of the prompt more or less.
|
||||
|
||||
@@ -42,63 +42,63 @@ To recreate the pipeline with the model and scheduler separately, let's write ou
|
||||
|
||||
1. Load the model and scheduler:
|
||||
|
||||
```py
|
||||
>>> from diffusers import DDPMScheduler, UNet2DModel
|
||||
```py
|
||||
>>> from diffusers import DDPMScheduler, UNet2DModel
|
||||
|
||||
>>> scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
|
||||
>>> model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda")
|
||||
```
|
||||
>>> scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
|
||||
>>> model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda")
|
||||
```
|
||||
|
||||
2. Set the number of timesteps to run the denoising process for:
|
||||
|
||||
```py
|
||||
>>> scheduler.set_timesteps(50)
|
||||
```
|
||||
```py
|
||||
>>> scheduler.set_timesteps(50)
|
||||
```
|
||||
|
||||
3. Setting the scheduler timesteps creates a tensor with evenly spaced elements in it, 50 in this example. Each element corresponds to a timestep at which the model denoises an image. When you create the denoising loop later, you'll iterate over this tensor to denoise an image:
|
||||
|
||||
```py
|
||||
>>> scheduler.timesteps
|
||||
tensor([980, 960, 940, 920, 900, 880, 860, 840, 820, 800, 780, 760, 740, 720,
|
||||
700, 680, 660, 640, 620, 600, 580, 560, 540, 520, 500, 480, 460, 440,
|
||||
420, 400, 380, 360, 340, 320, 300, 280, 260, 240, 220, 200, 180, 160,
|
||||
140, 120, 100, 80, 60, 40, 20, 0])
|
||||
```
|
||||
```py
|
||||
>>> scheduler.timesteps
|
||||
tensor([980, 960, 940, 920, 900, 880, 860, 840, 820, 800, 780, 760, 740, 720,
|
||||
700, 680, 660, 640, 620, 600, 580, 560, 540, 520, 500, 480, 460, 440,
|
||||
420, 400, 380, 360, 340, 320, 300, 280, 260, 240, 220, 200, 180, 160,
|
||||
140, 120, 100, 80, 60, 40, 20, 0])
|
||||
```
|
||||
|
||||
4. Create some random noise with the same shape as the desired output:
|
||||
|
||||
```py
|
||||
>>> import torch
|
||||
```py
|
||||
>>> import torch
|
||||
|
||||
>>> sample_size = model.config.sample_size
|
||||
>>> noise = torch.randn((1, 3, sample_size, sample_size)).to("cuda")
|
||||
```
|
||||
>>> sample_size = model.config.sample_size
|
||||
>>> noise = torch.randn((1, 3, sample_size, sample_size)).to("cuda")
|
||||
```
|
||||
|
||||
4. Now write a loop to iterate over the timesteps. At each timestep, the model does a [`UNet2DModel.forward`] pass and returns the noisy residual. The scheduler's [`~DDPMScheduler.step`] method takes the noisy residual, timestep, and input and it predicts the image at the previous timestep. This output becomes the next input to the model in the denoising loop, and it'll repeat until it reaches the end of the `timesteps` array.
|
||||
5. Now write a loop to iterate over the timesteps. At each timestep, the model does a [`UNet2DModel.forward`] pass and returns the noisy residual. The scheduler's [`~DDPMScheduler.step`] method takes the noisy residual, timestep, and input and it predicts the image at the previous timestep. This output becomes the next input to the model in the denoising loop, and it'll repeat until it reaches the end of the `timesteps` array.
|
||||
|
||||
```py
|
||||
>>> input = noise
|
||||
```py
|
||||
>>> input = noise
|
||||
|
||||
>>> for t in scheduler.timesteps:
|
||||
... with torch.no_grad():
|
||||
... noisy_residual = model(input, t).sample
|
||||
... previous_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
|
||||
... input = previous_noisy_sample
|
||||
```
|
||||
>>> for t in scheduler.timesteps:
|
||||
... with torch.no_grad():
|
||||
... noisy_residual = model(input, t).sample
|
||||
... previous_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
|
||||
... input = previous_noisy_sample
|
||||
```
|
||||
|
||||
This is the entire denoising process, and you can use this same pattern to write any diffusion system.
|
||||
This is the entire denoising process, and you can use this same pattern to write any diffusion system.
|
||||
|
||||
5. The last step is to convert the denoised output into an image:
|
||||
6. The last step is to convert the denoised output into an image:
|
||||
|
||||
```py
|
||||
>>> from PIL import Image
|
||||
>>> import numpy as np
|
||||
```py
|
||||
>>> from PIL import Image
|
||||
>>> import numpy as np
|
||||
|
||||
>>> image = (input / 2 + 0.5).clamp(0, 1)
|
||||
>>> image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
|
||||
>>> image = Image.fromarray((image * 255).round().astype("uint8"))
|
||||
>>> image
|
||||
```
|
||||
>>> image = (input / 2 + 0.5).clamp(0, 1)
|
||||
>>> image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
|
||||
>>> image = Image.fromarray((image * 255).round().astype("uint8"))
|
||||
>>> image
|
||||
```
|
||||
|
||||
In the next section, you'll put your skills to the test and breakdown the more complex Stable Diffusion pipeline. The steps are more or less the same. You'll initialize the necessary components, and set the number of timesteps to create a `timestep` array. The `timestep` array is used in the denoising loop, and for each element in this array, the model predicts a less noisy image. The denoising loop iterates over the `timestep`'s, and at each timestep, it outputs a noisy residual and the scheduler uses it to predict a less noisy image at the previous timestep. This process is repeated until you reach the end of the `timestep` array.
|
||||
|
||||
@@ -286,5 +286,5 @@ This is really what 🧨 Diffusers is designed for: to make it intuitive and eas
|
||||
|
||||
For your next steps, feel free to:
|
||||
|
||||
* Learn how to [build and contribute a pipeline](using-diffusers/#contribute_pipeline) to 🧨 Diffusers. We can't wait and see what you'll come up with!
|
||||
* Explore [existing pipelines](./api/pipelines/overview) in the library, and see if you can deconstruct and build a pipeline from scratch using the models and schedulers separately.
|
||||
* Learn how to [build and contribute a pipeline](contribute_pipeline) to 🧨 Diffusers. We can't wait and see what you'll come up with!
|
||||
* Explore [existing pipelines](../api/pipelines/overview) in the library, and see if you can deconstruct and build a pipeline from scratch using the models and schedulers separately.
|
||||
|
||||
@@ -8,14 +8,69 @@
|
||||
- local: installation
|
||||
title: "설치"
|
||||
title: "시작하기"
|
||||
|
||||
- sections:
|
||||
- local: tutorials/tutorial_overview
|
||||
title: 개요
|
||||
- local: using-diffusers/write_own_pipeline
|
||||
title: 모델과 스케줄러 이해하기
|
||||
- local: tutorials/basic_training
|
||||
title: Diffusion 모델 학습하기
|
||||
title: Tutorials
|
||||
- sections:
|
||||
- sections:
|
||||
- local: in_translation
|
||||
title: 개요
|
||||
- local: in_translation
|
||||
- local: using-diffusers/loading
|
||||
title: 파이프라인, 모델, 스케줄러 불러오기
|
||||
- local: using-diffusers/schedulers
|
||||
title: 다른 스케줄러들을 가져오고 비교하기
|
||||
- local: using-diffusers/custom_pipeline_overview
|
||||
title: 커뮤니티 파이프라인 불러오기
|
||||
- local: using-diffusers/using_safetensors
|
||||
title: 세이프텐서 불러오기
|
||||
- local: using-diffusers/other-formats
|
||||
title: 다른 형식의 Stable Diffusion 불러오기
|
||||
title: 불러오기 & 허브
|
||||
- sections:
|
||||
- local: using-diffusers/pipeline_overview
|
||||
title: 개요
|
||||
- local: using-diffusers/unconditional_image_generation
|
||||
title: Unconditional 이미지 생성
|
||||
- local: in_translation
|
||||
title: Text-to-image 생성
|
||||
- local: using-diffusers/img2img
|
||||
title: Text-guided image-to-image
|
||||
- local: using-diffusers/inpaint
|
||||
title: Text-guided 이미지 인페인팅
|
||||
- local: using-diffusers/depth2img
|
||||
title: Text-guided depth-to-image
|
||||
- local: in_translation
|
||||
title: Textual inversion
|
||||
- local: in_translation
|
||||
title: 여러 GPU를 사용한 분산 추론
|
||||
- local: using-diffusers/reusing_seeds
|
||||
title: Deterministic 생성으로 이미지 퀄리티 높이기
|
||||
- local: in_translation
|
||||
title: 재현 가능한 파이프라인 생성하기
|
||||
- local: using-diffusers/custom_pipeline_examples
|
||||
title: 커뮤니티 파이프라인들
|
||||
- local: in_translation
|
||||
title: 커뮤티니 파이프라인에 기여하는 방법
|
||||
- local: in_translation
|
||||
title: JAX/Flax에서의 Stable Diffusion
|
||||
- local: in_translation
|
||||
title: Weighting Prompts
|
||||
title: 추론을 위한 파이프라인
|
||||
- sections:
|
||||
- local: training/overview
|
||||
title: 개요
|
||||
- local: in_translation
|
||||
title: 학습을 위한 데이터셋 생성하기
|
||||
- local: training/adapt_a_model
|
||||
title: 새로운 태스크에 모델 적용하기
|
||||
- local: training/unconditional_training
|
||||
title: Unconditional 이미지 생성
|
||||
- local: training/text_inversion
|
||||
title: Textual Inversion
|
||||
- local: training/dreambooth
|
||||
title: DreamBooth
|
||||
@@ -27,13 +82,16 @@
|
||||
title: ControlNet
|
||||
- local: in_translation
|
||||
title: InstructPix2Pix 학습
|
||||
title: 학습
|
||||
- local: in_translation
|
||||
title: Custom Diffusion
|
||||
title: Training
|
||||
title: Diffusers 사용하기
|
||||
- sections:
|
||||
- local: in_translation
|
||||
- local: optimization/opt_overview
|
||||
title: 개요
|
||||
- local: optimization/fp16
|
||||
title: 메모리와 속도
|
||||
- local: in_translation
|
||||
- local: optimization/torch2.0
|
||||
title: Torch2.0 지원
|
||||
- local: optimization/xformers
|
||||
title: xFormers
|
||||
@@ -41,8 +99,12 @@
|
||||
title: ONNX
|
||||
- local: optimization/open_vino
|
||||
title: OpenVINO
|
||||
- local: in_translation
|
||||
title: Core ML
|
||||
- local: optimization/mps
|
||||
title: MPS
|
||||
- local: optimization/habana
|
||||
title: Habana Gaudi
|
||||
title: 최적화/특수 하드웨어
|
||||
- local: in_translation
|
||||
title: Token Merging
|
||||
title: 최적화/특수 하드웨어
|
||||
|
||||
@@ -59,7 +59,7 @@ torch.backends.cuda.matmul.allow_tf32 = True
|
||||
|
||||
## 반정밀도 가중치
|
||||
|
||||
더 많은 GPU 메모리를 절약하고 더 빠른 속도를 얻기 위해 모델 가중치를 반정밀도(half precision)로 직접 로드하고 실행할 수 있습니다.
|
||||
더 많은 GPU 메모리를 절약하고 더 빠른 속도를 얻기 위해 모델 가중치를 반정밀도(half precision)로 직접 불러오고 실행할 수 있습니다.
|
||||
여기에는 `fp16`이라는 브랜치에 저장된 float16 버전의 가중치를 불러오고, 그 때 `float16` 유형을 사용하도록 PyTorch에 지시하는 작업이 포함됩니다.
|
||||
|
||||
```Python
|
||||
|
||||
17
docs/source/ko/optimization/opt_overview.mdx
Normal file
17
docs/source/ko/optimization/opt_overview.mdx
Normal file
@@ -0,0 +1,17 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# 개요
|
||||
|
||||
노이즈가 많은 출력에서 적은 출력으로 만드는 과정으로 고품질 생성 모델의 출력을 만드는 각각의 반복되는 스텝은 많은 계산이 필요합니다. 🧨 Diffuser의 목표 중 하나는 모든 사람이 이 기술을 널리 이용할 수 있도록 하는 것이며, 여기에는 소비자 및 특수 하드웨어에서 빠른 추론을 가능하게 하는 것을 포함합니다.
|
||||
|
||||
이 섹션에서는 추론 속도를 최적화하고 메모리 소비를 줄이기 위한 반정밀(half-precision) 가중치 및 sliced attention과 같은 팁과 요령을 다룹니다. 또한 [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) 또는 [ONNX Runtime](https://onnxruntime.ai/docs/)을 사용하여 PyTorch 코드의 속도를 높이고, [xFormers](https://facebookresearch.github.io/xformers/)를 사용하여 memory-efficient attention을 활성화하는 방법을 배울 수 있습니다. Apple Silicon, Intel 또는 Habana 프로세서와 같은 특정 하드웨어에서 추론을 실행하기 위한 가이드도 있습니다.
|
||||
445
docs/source/ko/optimization/torch2.0.mdx
Normal file
445
docs/source/ko/optimization/torch2.0.mdx
Normal file
@@ -0,0 +1,445 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Diffusers에서의 PyTorch 2.0 가속화 지원
|
||||
|
||||
`0.13.0` 버전부터 Diffusers는 [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/)에서의 최신 최적화를 지원합니다. 이는 다음을 포함됩니다.
|
||||
1. momory-efficient attention을 사용한 가속화된 트랜스포머 지원 - `xformers`같은 추가적인 dependencies 필요 없음
|
||||
2. 추가 성능 향상을 위한 개별 모델에 대한 컴파일 기능 [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) 지원
|
||||
|
||||
|
||||
## 설치
|
||||
가속화된 어텐션 구현과 및 `torch.compile()`을 사용하기 위해, pip에서 최신 버전의 PyTorch 2.0을 설치되어 있고 diffusers 0.13.0. 버전 이상인지 확인하세요. 아래 설명된 바와 같이, PyTorch 2.0이 활성화되어 있을 때 diffusers는 최적화된 어텐션 프로세서([`AttnProcessor2_0`](https://github.com/huggingface/diffusers/blob/1a5797c6d4491a879ea5285c4efc377664e0332d/src/diffusers/models/attention_processor.py#L798))를 사용합니다.
|
||||
|
||||
```bash
|
||||
pip install --upgrade torch diffusers
|
||||
```
|
||||
|
||||
## 가속화된 트랜스포머와 `torch.compile` 사용하기.
|
||||
|
||||
|
||||
1. **가속화된 트랜스포머 구현**
|
||||
|
||||
PyTorch 2.0에는 [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) 함수를 통해 최적화된 memory-efficient attention의 구현이 포함되어 있습니다. 이는 입력 및 GPU 유형에 따라 여러 최적화를 자동으로 활성화합니다. 이는 [xFormers](https://github.com/facebookresearch/xformers)의 `memory_efficient_attention`과 유사하지만 기본적으로 PyTorch에 내장되어 있습니다.
|
||||
|
||||
이러한 최적화는 PyTorch 2.0이 설치되어 있고 `torch.nn.functional.scaled_dot_product_attention`을 사용할 수 있는 경우 Diffusers에서 기본적으로 활성화됩니다. 이를 사용하려면 `torch 2.0`을 설치하고 파이프라인을 사용하기만 하면 됩니다. 예를 들어:
|
||||
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
|
||||
pipe = pipe.to("cuda")
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
이를 명시적으로 활성화하려면(필수는 아님) 아래와 같이 수행할 수 있습니다.
|
||||
|
||||
```diff
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
+ from diffusers.models.attention_processor import AttnProcessor2_0
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
|
||||
+ pipe.unet.set_attn_processor(AttnProcessor2_0())
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
이 실행 과정은 `xFormers`만큼 빠르고 메모리적으로 효율적이어야 합니다. 자세한 내용은 [벤치마크](#benchmark)에서 확인하세요.
|
||||
|
||||
파이프라인을 보다 deterministic으로 만들거나 파인 튜닝된 모델을 [Core ML](https://huggingface.co/docs/diffusers/v0.16.0/en/optimization/coreml#how-to-run-stable-diffusion-with-core-ml)과 같은 다른 형식으로 변환해야 하는 경우 바닐라 어텐션 프로세서 ([`AttnProcessor`](https://github.com/huggingface/diffusers/blob/1a5797c6d4491a879ea5285c4efc377664e0332d/src/diffusers/models/attention_processor.py#L402))로 되돌릴 수 있습니다. 일반 어텐션 프로세서를 사용하려면 [`~diffusers.UNet2DConditionModel.set_default_attn_processor`] 함수를 사용할 수 있습니다:
|
||||
|
||||
```Python
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
from diffusers.models.attention_processor import AttnProcessor
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
|
||||
pipe.unet.set_default_attn_processor()
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars"
|
||||
image = pipe(prompt).images[0]
|
||||
```
|
||||
|
||||
2. **torch.compile**
|
||||
|
||||
추가적인 속도 향상을 위해 새로운 `torch.compile` 기능을 사용할 수 있습니다. 파이프라인의 UNet은 일반적으로 계산 비용이 가장 크기 때문에 나머지 하위 모델(텍스트 인코더와 VAE)은 그대로 두고 `unet`을 `torch.compile`로 래핑합니다. 자세한 내용과 다른 옵션은 [torch 컴파일 문서](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html)를 참조하세요.
|
||||
|
||||
```python
|
||||
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images
|
||||
```
|
||||
|
||||
GPU 유형에 따라 `compile()`은 가속화된 트랜스포머 최적화를 통해 **5% - 300%**의 _추가 성능 향상_을 얻을 수 있습니다. 그러나 컴파일은 Ampere(A100, 3090), Ada(4090) 및 Hopper(H100)와 같은 최신 GPU 아키텍처에서 더 많은 성능 향상을 가져올 수 있음을 참고하세요.
|
||||
|
||||
컴파일은 완료하는 데 약간의 시간이 걸리므로, 파이프라인을 한 번 준비한 다음 동일한 유형의 추론 작업을 여러 번 수행해야 하는 상황에 가장 적합합니다. 다른 이미지 크기에서 컴파일된 파이프라인을 호출하면 시간적 비용이 많이 들 수 있는 컴파일 작업이 다시 트리거됩니다.
|
||||
|
||||
|
||||
## 벤치마크
|
||||
|
||||
PyTorch 2.0의 효율적인 어텐션 구현과 `torch.compile`을 사용하여 가장 많이 사용되는 5개의 파이프라인에 대해 다양한 GPU와 배치 크기에 걸쳐 포괄적인 벤치마크를 수행했습니다. 여기서는 [`torch.compile()`이 최적으로 활용되도록 하는](https://github.com/huggingface/diffusers/pull/3313) `diffusers 0.17.0.dev0`을 사용했습니다.
|
||||
|
||||
### 벤치마킹 코드
|
||||
|
||||
#### Stable Diffusion text-to-image
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
path = "runwayml/stable-diffusion-v1-5"
|
||||
|
||||
run_compile = True # Set True / False
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16)
|
||||
pipe = pipe.to("cuda")
|
||||
pipe.unet.to(memory_format=torch.channels_last)
|
||||
|
||||
if run_compile:
|
||||
print("Run torch compile")
|
||||
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
|
||||
prompt = "ghibli style, a fantasy landscape with castles"
|
||||
|
||||
for _ in range(3):
|
||||
images = pipe(prompt=prompt).images
|
||||
```
|
||||
|
||||
#### Stable Diffusion image-to-image
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionImg2ImgPipeline
|
||||
import requests
|
||||
import torch
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
|
||||
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
|
||||
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
init_image = init_image.resize((512, 512))
|
||||
|
||||
path = "runwayml/stable-diffusion-v1-5"
|
||||
|
||||
run_compile = True # Set True / False
|
||||
|
||||
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16)
|
||||
pipe = pipe.to("cuda")
|
||||
pipe.unet.to(memory_format=torch.channels_last)
|
||||
|
||||
if run_compile:
|
||||
print("Run torch compile")
|
||||
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
|
||||
prompt = "ghibli style, a fantasy landscape with castles"
|
||||
|
||||
for _ in range(3):
|
||||
image = pipe(prompt=prompt, image=init_image).images[0]
|
||||
```
|
||||
|
||||
#### Stable Diffusion - inpainting
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionInpaintPipeline
|
||||
import requests
|
||||
import torch
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
|
||||
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
|
||||
|
||||
def download_image(url):
|
||||
response = requests.get(url)
|
||||
return Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
|
||||
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
|
||||
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
|
||||
|
||||
init_image = download_image(img_url).resize((512, 512))
|
||||
mask_image = download_image(mask_url).resize((512, 512))
|
||||
|
||||
path = "runwayml/stable-diffusion-inpainting"
|
||||
|
||||
run_compile = True # Set True / False
|
||||
|
||||
pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16)
|
||||
pipe = pipe.to("cuda")
|
||||
pipe.unet.to(memory_format=torch.channels_last)
|
||||
|
||||
if run_compile:
|
||||
print("Run torch compile")
|
||||
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
|
||||
prompt = "ghibli style, a fantasy landscape with castles"
|
||||
|
||||
for _ in range(3):
|
||||
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
|
||||
```
|
||||
|
||||
#### ControlNet
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
|
||||
import requests
|
||||
import torch
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
|
||||
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
|
||||
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
init_image = init_image.resize((512, 512))
|
||||
|
||||
path = "runwayml/stable-diffusion-v1-5"
|
||||
|
||||
run_compile = True # Set True / False
|
||||
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
|
||||
pipe = StableDiffusionControlNetPipeline.from_pretrained(
|
||||
path, controlnet=controlnet, torch_dtype=torch.float16
|
||||
)
|
||||
|
||||
pipe = pipe.to("cuda")
|
||||
pipe.unet.to(memory_format=torch.channels_last)
|
||||
pipe.controlnet.to(memory_format=torch.channels_last)
|
||||
|
||||
if run_compile:
|
||||
print("Run torch compile")
|
||||
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True)
|
||||
|
||||
prompt = "ghibli style, a fantasy landscape with castles"
|
||||
|
||||
for _ in range(3):
|
||||
image = pipe(prompt=prompt, image=init_image).images[0]
|
||||
```
|
||||
|
||||
#### IF text-to-image + upscaling
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
run_compile = True # Set True / False
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16)
|
||||
pipe.to("cuda")
|
||||
pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16)
|
||||
pipe_2.to("cuda")
|
||||
pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16)
|
||||
pipe_3.to("cuda")
|
||||
|
||||
|
||||
pipe.unet.to(memory_format=torch.channels_last)
|
||||
pipe_2.unet.to(memory_format=torch.channels_last)
|
||||
pipe_3.unet.to(memory_format=torch.channels_last)
|
||||
|
||||
if run_compile:
|
||||
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True)
|
||||
pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True)
|
||||
|
||||
prompt = "the blue hulk"
|
||||
|
||||
prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)
|
||||
neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)
|
||||
|
||||
for _ in range(3):
|
||||
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
|
||||
image_2 = pipe_2(image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
|
||||
image_3 = pipe_3(prompt=prompt, image=image, noise_level=100).images
|
||||
```
|
||||
|
||||
PyTorch 2.0 및 `torch.compile()`로 얻을 수 있는 가능한 속도 향상에 대해, [Stable Diffusion text-to-image pipeline](StableDiffusionPipeline)에 대한 상대적인 속도 향상을 보여주는 차트를 5개의 서로 다른 GPU 제품군(배치 크기 4)에 대해 나타냅니다:
|
||||
|
||||

|
||||
|
||||
To give you an even better idea of how this speed-up holds for the other pipelines presented above, consider the following
|
||||
plot that shows the benchmarking numbers from an A100 across three different batch sizes
|
||||
(with PyTorch 2.0 nightly and `torch.compile()`):
|
||||
이 속도 향상이 위에 제시된 다른 파이프라인에 대해서도 어떻게 유지되는지 더 잘 이해하기 위해, 세 가지의 다른 배치 크기에 걸쳐 A100의 벤치마킹(PyTorch 2.0 nightly 및 `torch.compile() 사용) 수치를 보여주는 차트를 보입니다:
|
||||
|
||||

|
||||
|
||||
_(위 차트의 벤치마크 메트릭은 **초당 iteration 수(iterations/second)**입니다)_
|
||||
|
||||
그러나 투명성을 위해 모든 벤치마킹 수치를 공개합니다!
|
||||
|
||||
다음 표들에서는, **_초당 처리되는 iteration_** 수 측면에서의 결과를 보여줍니다.
|
||||
|
||||
### A100 (batch size: 1)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 21.66 | 23.13 | 44.03 | 49.74 |
|
||||
| SD - img2img | 21.81 | 22.40 | 43.92 | 46.32 |
|
||||
| SD - inpaint | 22.24 | 23.23 | 43.76 | 49.25 |
|
||||
| SD - controlnet | 15.02 | 15.82 | 32.13 | 36.08 |
|
||||
| IF | 20.21 / <br>13.84 / <br>24.00 | 20.12 / <br>13.70 / <br>24.03 | ❌ | 97.34 / <br>27.23 / <br>111.66 |
|
||||
|
||||
### A100 (batch size: 4)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 11.6 | 13.12 | 14.62 | 17.27 |
|
||||
| SD - img2img | 11.47 | 13.06 | 14.66 | 17.25 |
|
||||
| SD - inpaint | 11.67 | 13.31 | 14.88 | 17.48 |
|
||||
| SD - controlnet | 8.28 | 9.38 | 10.51 | 12.41 |
|
||||
| IF | 25.02 | 18.04 | ❌ | 48.47 |
|
||||
|
||||
### A100 (batch size: 16)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 3.04 | 3.6 | 3.83 | 4.68 |
|
||||
| SD - img2img | 2.98 | 3.58 | 3.83 | 4.67 |
|
||||
| SD - inpaint | 3.04 | 3.66 | 3.9 | 4.76 |
|
||||
| SD - controlnet | 2.15 | 2.58 | 2.74 | 3.35 |
|
||||
| IF | 8.78 | 9.82 | ❌ | 16.77 |
|
||||
|
||||
### V100 (batch size: 1)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 18.99 | 19.14 | 20.95 | 22.17 |
|
||||
| SD - img2img | 18.56 | 19.18 | 20.95 | 22.11 |
|
||||
| SD - inpaint | 19.14 | 19.06 | 21.08 | 22.20 |
|
||||
| SD - controlnet | 13.48 | 13.93 | 15.18 | 15.88 |
|
||||
| IF | 20.01 / <br>9.08 / <br>23.34 | 19.79 / <br>8.98 / <br>24.10 | ❌ | 55.75 / <br>11.57 / <br>57.67 |
|
||||
|
||||
### V100 (batch size: 4)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 5.96 | 5.89 | 6.83 | 6.86 |
|
||||
| SD - img2img | 5.90 | 5.91 | 6.81 | 6.82 |
|
||||
| SD - inpaint | 5.99 | 6.03 | 6.93 | 6.95 |
|
||||
| SD - controlnet | 4.26 | 4.29 | 4.92 | 4.93 |
|
||||
| IF | 15.41 | 14.76 | ❌ | 22.95 |
|
||||
|
||||
### V100 (batch size: 16)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 1.66 | 1.66 | 1.92 | 1.90 |
|
||||
| SD - img2img | 1.65 | 1.65 | 1.91 | 1.89 |
|
||||
| SD - inpaint | 1.69 | 1.69 | 1.95 | 1.93 |
|
||||
| SD - controlnet | 1.19 | 1.19 | OOM after warmup | 1.36 |
|
||||
| IF | 5.43 | 5.29 | ❌ | 7.06 |
|
||||
|
||||
### T4 (batch size: 1)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 6.9 | 6.95 | 7.3 | 7.56 |
|
||||
| SD - img2img | 6.84 | 6.99 | 7.04 | 7.55 |
|
||||
| SD - inpaint | 6.91 | 6.7 | 7.01 | 7.37 |
|
||||
| SD - controlnet | 4.89 | 4.86 | 5.35 | 5.48 |
|
||||
| IF | 17.42 / <br>2.47 / <br>18.52 | 16.96 / <br>2.45 / <br>18.69 | ❌ | 24.63 / <br>2.47 / <br>23.39 |
|
||||
|
||||
### T4 (batch size: 4)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 1.79 | 1.79 | 2.03 | 1.99 |
|
||||
| SD - img2img | 1.77 | 1.77 | 2.05 | 2.04 |
|
||||
| SD - inpaint | 1.81 | 1.82 | 2.09 | 2.09 |
|
||||
| SD - controlnet | 1.34 | 1.27 | 1.47 | 1.46 |
|
||||
| IF | 5.79 | 5.61 | ❌ | 7.39 |
|
||||
|
||||
### T4 (batch size: 16)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 2.34s | 2.30s | OOM after 2nd iteration | 1.99s |
|
||||
| SD - img2img | 2.35s | 2.31s | OOM after warmup | 2.00s |
|
||||
| SD - inpaint | 2.30s | 2.26s | OOM after 2nd iteration | 1.95s |
|
||||
| SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup |
|
||||
| IF * | 1.44 | 1.44 | ❌ | 1.94 |
|
||||
|
||||
### RTX 3090 (batch size: 1)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 22.56 | 22.84 | 23.84 | 25.69 |
|
||||
| SD - img2img | 22.25 | 22.61 | 24.1 | 25.83 |
|
||||
| SD - inpaint | 22.22 | 22.54 | 24.26 | 26.02 |
|
||||
| SD - controlnet | 16.03 | 16.33 | 17.38 | 18.56 |
|
||||
| IF | 27.08 / <br>9.07 / <br>31.23 | 26.75 / <br>8.92 / <br>31.47 | ❌ | 68.08 / <br>11.16 / <br>65.29 |
|
||||
|
||||
### RTX 3090 (batch size: 4)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 6.46 | 6.35 | 7.29 | 7.3 |
|
||||
| SD - img2img | 6.33 | 6.27 | 7.31 | 7.26 |
|
||||
| SD - inpaint | 6.47 | 6.4 | 7.44 | 7.39 |
|
||||
| SD - controlnet | 4.59 | 4.54 | 5.27 | 5.26 |
|
||||
| IF | 16.81 | 16.62 | ❌ | 21.57 |
|
||||
|
||||
### RTX 3090 (batch size: 16)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 1.7 | 1.69 | 1.93 | 1.91 |
|
||||
| SD - img2img | 1.68 | 1.67 | 1.93 | 1.9 |
|
||||
| SD - inpaint | 1.72 | 1.71 | 1.97 | 1.94 |
|
||||
| SD - controlnet | 1.23 | 1.22 | 1.4 | 1.38 |
|
||||
| IF | 5.01 | 5.00 | ❌ | 6.33 |
|
||||
|
||||
### RTX 4090 (batch size: 1)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 40.5 | 41.89 | 44.65 | 49.81 |
|
||||
| SD - img2img | 40.39 | 41.95 | 44.46 | 49.8 |
|
||||
| SD - inpaint | 40.51 | 41.88 | 44.58 | 49.72 |
|
||||
| SD - controlnet | 29.27 | 30.29 | 32.26 | 36.03 |
|
||||
| IF | 69.71 / <br>18.78 / <br>85.49 | 69.13 / <br>18.80 / <br>85.56 | ❌ | 124.60 / <br>26.37 / <br>138.79 |
|
||||
|
||||
### RTX 4090 (batch size: 4)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 12.62 | 12.84 | 15.32 | 15.59 |
|
||||
| SD - img2img | 12.61 | 12,.79 | 15.35 | 15.66 |
|
||||
| SD - inpaint | 12.65 | 12.81 | 15.3 | 15.58 |
|
||||
| SD - controlnet | 9.1 | 9.25 | 11.03 | 11.22 |
|
||||
| IF | 31.88 | 31.14 | ❌ | 43.92 |
|
||||
|
||||
### RTX 4090 (batch size: 16)
|
||||
|
||||
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| SD - txt2img | 3.17 | 3.2 | 3.84 | 3.85 |
|
||||
| SD - img2img | 3.16 | 3.2 | 3.84 | 3.85 |
|
||||
| SD - inpaint | 3.17 | 3.2 | 3.85 | 3.85 |
|
||||
| SD - controlnet | 2.23 | 2.3 | 2.7 | 2.75 |
|
||||
| IF | 9.26 | 9.2 | ❌ | 13.31 |
|
||||
|
||||
## 참고
|
||||
|
||||
* Follow [this PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks.
|
||||
* For the IF pipeline and batch sizes > 1, we only used a batch size of >1 in the first IF pipeline for text-to-image generation and NOT for upscaling. So, that means the two upscaling pipelines received a batch size of 1.
|
||||
|
||||
*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.*
|
||||
|
||||
* 벤치마크 수행에 사용된 환경에 대한 자세한 내용은 [이 PR](https://github.com/huggingface/diffusers/pull/3313)을 참조하세요.
|
||||
* IF 파이프라인와 배치 크기 > 1의 경우 첫 번째 IF 파이프라인에서 text-to-image 생성을 위한 배치 크기 > 1만 사용했으며 업스케일링에는 사용하지 않았습니다. 즉, 두 개의 업스케일링 파이프라인이 배치 크기 1임을 의미합니다.
|
||||
|
||||
*Diffusers에서 `torch.compile()` 지원을 개선하는 데 도움을 준 PyTorch 팀의 [Horace He](https://github.com/Chillee)에게 감사드립니다.*
|
||||
54
docs/source/ko/training/adapt_a_model.mdx
Normal file
54
docs/source/ko/training/adapt_a_model.mdx
Normal file
@@ -0,0 +1,54 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# 새로운 작업에 대한 모델을 적용하기
|
||||
|
||||
많은 diffusion 시스템은 같은 구성 요소들을 공유하므로 한 작업에 대해 사전학습된 모델을 완전히 다른 작업에 적용할 수 있습니다.
|
||||
|
||||
이 인페인팅을 위한 가이드는 사전학습된 [`UNet2DConditionModel`]의 아키텍처를 초기화하고 수정하여 사전학습된 text-to-image 모델을 어떻게 인페인팅에 적용하는지를 알려줄 것입니다.
|
||||
|
||||
## UNet2DConditionModel 파라미터 구성
|
||||
|
||||
[`UNet2DConditionModel`]은 [input sample](https://huggingface.co/docs/diffusers/v0.16.0/en/api/models#diffusers.UNet2DConditionModel.in_channels)에서 4개의 채널을 기본적으로 허용합니다. 예를 들어, [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)와 같은 사전학습된 text-to-image 모델을 불러오고 `in_channels`의 수를 확인합니다:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
|
||||
pipeline.unet.config["in_channels"]
|
||||
4
|
||||
```
|
||||
|
||||
인페인팅은 입력 샘플에 9개의 채널이 필요합니다. [`runwayml/stable-diffusion-inpainting`](https://huggingface.co/runwayml/stable-diffusion-inpainting)와 같은 사전학습된 인페인팅 모델에서 이 값을 확인할 수 있습니다:
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
|
||||
pipeline.unet.config["in_channels"]
|
||||
9
|
||||
```
|
||||
|
||||
인페인팅에 대한 text-to-image 모델을 적용하기 위해, `in_channels` 수를 4에서 9로 수정해야 할 것입니다.
|
||||
|
||||
사전학습된 text-to-image 모델의 가중치와 [`UNet2DConditionModel`]을 초기화하고 `in_channels`를 9로 수정해 주세요. `in_channels`의 수를 수정하면 크기가 달라지기 때문에 크기가 안 맞는 오류를 피하기 위해 `ignore_mismatched_sizes=True` 및 `low_cpu_mem_usage=False`를 설정해야 합니다.
|
||||
|
||||
```py
|
||||
from diffusers import UNet2DConditionModel
|
||||
|
||||
model_id = "runwayml/stable-diffusion-v1-5"
|
||||
unet = UNet2DConditionModel.from_pretrained(
|
||||
model_id, subfolder="unet", in_channels=9, low_cpu_mem_usage=False, ignore_mismatched_sizes=True
|
||||
)
|
||||
```
|
||||
|
||||
Text-to-image 모델로부터 다른 구성 요소의 사전학습된 가중치는 체크포인트로부터 초기화되지만 `unet`의 입력 채널 가중치 (`conv_in.weight`)는 랜덤하게 초기화됩니다. 그렇지 않으면 모델이 노이즈를 리턴하기 때문에 인페인팅의 모델을 파인튜닝 할 때 중요합니다.
|
||||
@@ -273,7 +273,7 @@ from diffusers import DiffusionPipeline, UNet2DConditionModel
|
||||
from transformers import CLIPTextModel
|
||||
import torch
|
||||
|
||||
# 학습에 사용된 것과 동일한 인수(model, revision)로 파이프라인을 로드합니다.
|
||||
# 학습에 사용된 것과 동일한 인수(model, revision)로 파이프라인을 불러옵니다.
|
||||
model_id = "CompVis/stable-diffusion-v1-4"
|
||||
|
||||
unet = UNet2DConditionModel.from_pretrained("/sddata/dreambooth/daruma-v2-1/checkpoint-100/unet")
|
||||
@@ -294,7 +294,7 @@ If you have **`"accelerate<0.16.0"`** installed, you need to convert it to an in
|
||||
from accelerate import Accelerator
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
# 학습에 사용된 것과 동일한 인수(model, revision)로 파이프라인을 로드합니다.
|
||||
# 학습에 사용된 것과 동일한 인수(model, revision)로 파이프라인을 불러옵니다.
|
||||
model_id = "CompVis/stable-diffusion-v1-4"
|
||||
pipeline = DiffusionPipeline.from_pretrained(model_id)
|
||||
|
||||
|
||||
@@ -102,7 +102,7 @@ accelerate launch train_dreambooth_lora.py \
|
||||
>>> pipe = StableDiffusionPipeline.from_pretrained(model_base, torch_dtype=torch.float16)
|
||||
```
|
||||
|
||||
*기본 모델의 가중치 위에* 파인튜닝된 DreamBooth 모델에서 LoRA 가중치를 로드한 다음, 더 빠른 추론을 위해 파이프라인을 GPU로 이동합니다. LoRA 가중치를 프리징된 사전 훈련된 모델 가중치와 병합할 때, 선택적으로 'scale' 매개변수로 어느 정도의 가중치를 병합할 지 조절할 수 있습니다:
|
||||
*기본 모델의 가중치 위에* 파인튜닝된 DreamBooth 모델에서 LoRA 가중치를 불러온 다음, 더 빠른 추론을 위해 파이프라인을 GPU로 이동합니다. LoRA 가중치를 프리징된 사전 훈련된 모델 가중치와 병합할 때, 선택적으로 'scale' 매개변수로 어느 정도의 가중치를 병합할 지 조절할 수 있습니다:
|
||||
|
||||
<Tip>
|
||||
|
||||
|
||||
73
docs/source/ko/training/overview.mdx
Normal file
73
docs/source/ko/training/overview.mdx
Normal file
@@ -0,0 +1,73 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# 🧨 Diffusers 학습 예시
|
||||
|
||||
이번 챕터에서는 다양한 유즈케이스들에 대한 예제 코드들을 통해 어떻게하면 효과적으로 `diffusers` 라이브러리를 사용할 수 있을까에 대해 알아보도록 하겠습니다.
|
||||
|
||||
**Note**: 혹시 오피셜한 예시코드를 찾고 있다면, [여기](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines)를 참고해보세요!
|
||||
|
||||
여기서 다룰 예시들은 다음을 지향합니다.
|
||||
|
||||
- **손쉬운 디펜던시 설치** (Self-contained) : 여기서 사용될 예시 코드들의 디펜던시 패키지들은 전부 `pip install` 명령어를 통해 설치 가능한 패키지들입니다. 또한 친절하게 `requirements.txt` 파일에 해당 패키지들이 명시되어 있어, `pip install -r requirements.txt`로 간편하게 해당 디펜던시들을 설치할 수 있습니다. 예시: [train_unconditional.py](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/train_unconditional.py), [requirements.txt](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/requirements.txt)
|
||||
- **손쉬운 수정** (Easy-to-tweak) : 저희는 가능하면 많은 유즈 케이스들을 제공하고자 합니다. 하지만 예시는 결국 그저 예시라는 점들 기억해주세요. 여기서 제공되는 예시코드들을 그저 단순히 복사-붙혀넣기하는 식으로는 여러분이 마주한 문제들을 손쉽게 해결할 순 없을 것입니다. 다시 말해 어느 정도는 여러분의 상황과 니즈에 맞춰 코드를 일정 부분 고쳐나가야 할 것입니다. 따라서 대부분의 학습 예시들은 데이터의 전처리 과정과 학습 과정에 대한 코드들을 함께 제공함으로써, 사용자가 니즈에 맞게 손쉬운 수정할 수 있도록 돕고 있습니다.
|
||||
- **입문자 친화적인** (Beginner-friendly) : 이번 챕터는 diffusion 모델과 `diffusers` 라이브러리에 대한 전반적인 이해를 돕기 위해 작성되었습니다. 따라서 diffusion 모델에 대한 최신 SOTA (state-of-the-art) 방법론들 가운데서도, 입문자에게는 많이 어려울 수 있다고 판단되면, 해당 방법론들은 여기서 다루지 않으려고 합니다.
|
||||
- **하나의 태스크만 포함할 것**(One-purpose-only): 여기서 다룰 예시들은 하나의 태스크만 포함하고 있어야 합니다. 물론 이미지 초해상화(super-resolution)와 이미지 보정(modification)과 같은 유사한 모델링 프로세스를 갖는 태스크들이 존재하겠지만, 하나의 예제에 하나의 태스크만을 담는 것이 더 이해하기 용이하다고 판단했기 때문입니다.
|
||||
|
||||
|
||||
|
||||
저희는 diffusion 모델의 대표적인 태스크들을 다루는 공식 예제를 제공하고 있습니다. *공식* 예제는 현재 진행형으로 `diffusers` 관리자들(maintainers)에 의해 관리되고 있습니다. 또한 저희는 앞서 정의한 저희의 철학을 엄격하게 따르고자 노력하고 있습니다. 혹시 여러분께서 이러한 예시가 반드시 필요하다고 생각되신다면, 언제든지 [Feature Request](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feature_request.md&title=) 혹은 직접 [Pull Request](https://github.com/huggingface/diffusers/compare)를 주시기 바랍니다. 저희는 언제나 환영입니다!
|
||||
|
||||
학습 예시들은 다양한 태스크들에 대해 diffusion 모델을 사전학습(pretrain)하거나 파인튜닝(fine-tuning)하는 법을 보여줍니다. 현재 다음과 같은 예제들을 지원하고 있습니다.
|
||||
|
||||
- [Unconditional Training](./unconditional_training)
|
||||
- [Text-to-Image Training](./text2image)
|
||||
- [Text Inversion](./text_inversion)
|
||||
- [Dreambooth](./dreambooth)
|
||||
|
||||
memory-efficient attention 연산을 수행하기 위해, 가능하면 [xFormers](../optimization/xformers)를 설치해주시기 바랍니다. 이를 통해 학습 속도를 늘리고 메모리에 대한 부담을 줄일 수 있습니다.
|
||||
|
||||
| Task | 🤗 Accelerate | 🤗 Datasets | Colab
|
||||
|---|---|:---:|:---:|
|
||||
| [**Unconditional Image Generation**](./unconditional_training) | ✅ | ✅ | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb)
|
||||
| [**Text-to-Image fine-tuning**](./text2image) | ✅ | ✅ |
|
||||
| [**Textual Inversion**](./text_inversion) | ✅ | - | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb)
|
||||
| [**Dreambooth**](./dreambooth) | ✅ | - | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_dreambooth_training.ipynb)
|
||||
| [**Training with LoRA**](./lora) | ✅ | - | - |
|
||||
| [**ControlNet**](./controlnet) | ✅ | ✅ | - |
|
||||
| [**InstructPix2Pix**](./instructpix2pix) | ✅ | ✅ | - |
|
||||
| [**Custom Diffusion**](./custom_diffusion) | ✅ | ✅ | - |
|
||||
|
||||
|
||||
## 커뮤니티
|
||||
|
||||
공식 예제 외에도 **커뮤니티 예제** 역시 제공하고 있습니다. 해당 예제들은 우리의 커뮤니티에 의해 관리됩니다. 커뮤니티 예쩨는 학습 예시나 추론 파이프라인으로 구성될 수 있습니다. 이러한 커뮤니티 예시들의 경우, 앞서 정의했던 철학들을 좀 더 관대하게 적용하고 있습니다. 또한 이러한 커뮤니티 예시들의 경우, 모든 이슈들에 대한 유지보수를 보장할 수는 없습니다.
|
||||
|
||||
유용하긴 하지만, 아직은 대중적이지 못하거나 저희의 철학에 부합하지 않는 예제들은 [community examples](https://github.com/huggingface/diffusers/tree/main/examples/community) 폴더에 담기게 됩니다.
|
||||
|
||||
**Note**: 커뮤니티 예제는 `diffusers`에 기여(contribution)를 희망하는 분들에게 [아주 좋은 기여 수단](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22)이 될 수 있습니다.
|
||||
|
||||
## 주목할 사항들
|
||||
|
||||
최신 버전의 예시 코드들의 성공적인 구동을 보장하기 위해서는, 반드시 **소스코드를 통해 `diffusers`를 설치해야 하며,** 해당 예시 코드들이 요구하는 디펜던시들 역시 설치해야 합니다. 이를 위해 새로운 가상 환경을 구축하고 다음의 명령어를 실행해야 합니다.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/huggingface/diffusers
|
||||
cd diffusers
|
||||
pip install .
|
||||
```
|
||||
|
||||
그 다음 `cd` 명령어를 통해 해당 예제 디렉토리에 접근해서 다음 명령어를 실행하면 됩니다.
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
275
docs/source/ko/training/text_inversion.mdx
Normal file
275
docs/source/ko/training/text_inversion.mdx
Normal file
@@ -0,0 +1,275 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
|
||||
|
||||
# Textual-Inversion
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
[textual-inversion](https://arxiv.org/abs/2208.01618)은 소수의 예시 이미지에서 새로운 콘셉트를 포착하는 기법입니다. 이 기술은 원래 [Latent Diffusion](https://github.com/CompVis/latent-diffusion)에서 시연되었지만, 이후 [Stable Diffusion](https://huggingface.co/docs/diffusers/main/en/conceptual/stable_diffusion)과 같은 유사한 다른 모델에도 적용되었습니다. 학습된 콘셉트는 text-to-image 파이프라인에서 생성된 이미지를 더 잘 제어하는 데 사용할 수 있습니다. 이 모델은 텍스트 인코더의 임베딩 공간에서 새로운 '단어'를 학습하여 개인화된 이미지 생성을 위한 텍스트 프롬프트 내에서 사용됩니다.
|
||||
|
||||

|
||||
<small>By using just 3-5 images you can teach new concepts to a model such as Stable Diffusion for personalized image generation <a href="https://github.com/rinongal/textual_inversion">(image source)</a>.</small>
|
||||
|
||||
이 가이드에서는 textual-inversion으로 [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) 모델을 학습하는 방법을 설명합니다. 이 가이드에서 사용된 모든 textual-inversion 학습 스크립트는 [여기](https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion)에서 확인할 수 있습니다. 내부적으로 어떻게 작동하는지 자세히 살펴보고 싶으시다면 해당 링크를 참조해주시기 바랍니다.
|
||||
|
||||
<Tip>
|
||||
|
||||
[Stable Diffusion Textual Inversion Concepts Library](https://huggingface.co/sd-concepts-library)에는 커뮤니티에서 제작한 학습된 textual-inversion 모델들이 있습니다. 시간이 지남에 따라 더 많은 콘셉트들이 추가되어 유용한 리소스로 성장할 것입니다!
|
||||
|
||||
</Tip>
|
||||
|
||||
시작하기 전에 학습을 위한 의존성 라이브러리들을 설치해야 합니다:
|
||||
|
||||
```bash
|
||||
pip install diffusers accelerate transformers
|
||||
```
|
||||
|
||||
의존성 라이브러리들의 설치가 완료되면, [🤗Accelerate](https://github.com/huggingface/accelerate/) 환경을 초기화시킵니다.
|
||||
|
||||
```bash
|
||||
accelerate config
|
||||
```
|
||||
|
||||
별도의 설정없이, 기본 🤗Accelerate 환경을 설정하려면 다음과 같이 하세요:
|
||||
|
||||
```bash
|
||||
accelerate config default
|
||||
```
|
||||
|
||||
또는 사용 중인 환경이 노트북과 같은 대화형 셸을 지원하지 않는다면, 다음과 같이 사용할 수 있습니다:
|
||||
|
||||
```py
|
||||
from accelerate.utils import write_basic_config
|
||||
|
||||
write_basic_config()
|
||||
```
|
||||
|
||||
마지막으로, Memory-Efficient Attention을 통해 메모리 사용량을 줄이기 위해 [xFormers](https://huggingface.co/docs/diffusers/main/en/training/optimization/xformers)를 설치합니다. xFormers를 설치한 후, 학습 스크립트에 `--enable_xformers_memory_efficient_attention` 인자를 추가합니다. xFormers는 Flax에서 지원되지 않습니다.
|
||||
|
||||
## 허브에 모델 업로드하기
|
||||
|
||||
모델을 허브에 저장하려면, 학습 스크립트에 다음 인자를 추가해야 합니다.
|
||||
|
||||
```bash
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
## 체크포인트 저장 및 불러오기
|
||||
|
||||
학습중에 모델의 체크포인트를 정기적으로 저장하는 것이 좋습니다. 이렇게 하면 어떤 이유로든 학습이 중단된 경우 저장된 체크포인트에서 학습을 다시 시작할 수 있습니다. 학습 스크립트에 다음 인자를 전달하면 500단계마다 전체 학습 상태가 `output_dir`의 하위 폴더에 체크포인트로서 저장됩니다.
|
||||
|
||||
```bash
|
||||
--checkpointing_steps=500
|
||||
```
|
||||
|
||||
저장된 체크포인트에서 학습을 재개하려면, 학습 스크립트와 재개할 특정 체크포인트에 다음 인자를 전달하세요.
|
||||
|
||||
```bash
|
||||
--resume_from_checkpoint="checkpoint-1500"
|
||||
```
|
||||
|
||||
## 파인 튜닝
|
||||
|
||||
학습용 데이터셋으로 [고양이 장난감 데이터셋](https://huggingface.co/datasets/diffusers/cat_toy_example)을 다운로드하여 디렉토리에 저장하세요. 여러분만의 고유한 데이터셋을 사용하고자 한다면, [학습용 데이터셋 만들기](https://huggingface.co/docs/diffusers/training/create_dataset) 가이드를 살펴보시기 바랍니다.
|
||||
|
||||
```py
|
||||
from huggingface_hub import snapshot_download
|
||||
|
||||
local_dir = "./cat"
|
||||
snapshot_download(
|
||||
"diffusers/cat_toy_example", local_dir=local_dir, repo_type="dataset", ignore_patterns=".gitattributes"
|
||||
)
|
||||
```
|
||||
|
||||
모델의 리포지토리 ID(또는 모델 가중치가 포함된 디렉터리 경로)를 `MODEL_NAME` 환경 변수에 할당하고, 해당 값을 [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) 인자에 전달합니다. 그리고 이미지가 포함된 디렉터리 경로를 `DATA_DIR` 환경 변수에 할당합니다.
|
||||
|
||||
이제 [학습 스크립트](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py)를 실행할 수 있습니다. 스크립트는 다음 파일을 생성하고 리포지토리에 저장합니다.
|
||||
|
||||
- `learned_embeds.bin`
|
||||
- `token_identifier.txt`
|
||||
- `type_of_concept.txt`.
|
||||
|
||||
<Tip>
|
||||
|
||||
💡V100 GPU 1개를 기준으로 전체 학습에는 최대 1시간이 걸립니다. 학습이 완료되기를 기다리는 동안 궁금한 점이 있으면 아래 섹션에서 [textual-inversion이 어떻게 작동하는지](https://huggingface.co/docs/diffusers/training/text_inversion#how-it-works) 자유롭게 확인하세요 !
|
||||
|
||||
</Tip>
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```bash
|
||||
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
|
||||
export DATA_DIR="./cat"
|
||||
|
||||
accelerate launch textual_inversion.py \
|
||||
--pretrained_model_name_or_path=$MODEL_NAME \
|
||||
--train_data_dir=$DATA_DIR \
|
||||
--learnable_property="object" \
|
||||
--placeholder_token="<cat-toy>" --initializer_token="toy" \
|
||||
--resolution=512 \
|
||||
--train_batch_size=1 \
|
||||
--gradient_accumulation_steps=4 \
|
||||
--max_train_steps=3000 \
|
||||
--learning_rate=5.0e-04 --scale_lr \
|
||||
--lr_scheduler="constant" \
|
||||
--lr_warmup_steps=0 \
|
||||
--output_dir="textual_inversion_cat" \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
💡학습 성능을 올리기 위해, 플레이스홀더 토큰(`<cat-toy>`)을 (단일한 임베딩 벡터가 아닌) 복수의 임베딩 벡터로 표현하는 것 역시 고려할 있습니다. 이러한 트릭이 모델이 보다 복잡한 이미지의 스타일(앞서 말한 콘셉트)을 더 잘 캡처하는 데 도움이 될 수 있습니다. 복수의 임베딩 벡터 학습을 활성화하려면 다음 옵션을 전달하십시오.
|
||||
|
||||
```bash
|
||||
--num_vectors=5
|
||||
```
|
||||
|
||||
</Tip>
|
||||
</pt>
|
||||
<jax>
|
||||
|
||||
TPU에 액세스할 수 있는 경우, [Flax 학습 스크립트](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion_flax.py)를 사용하여 더 빠르게 모델을 학습시켜보세요. (물론 GPU에서도 작동합니다.) 동일한 설정에서 Flax 학습 스크립트는 PyTorch 학습 스크립트보다 최소 70% 더 빨라야 합니다! ⚡️
|
||||
|
||||
시작하기 앞서 Flax에 대한 의존성 라이브러리들을 설치해야 합니다.
|
||||
|
||||
```bash
|
||||
pip install -U -r requirements_flax.txt
|
||||
```
|
||||
|
||||
모델의 리포지토리 ID(또는 모델 가중치가 포함된 디렉터리 경로)를 `MODEL_NAME` 환경 변수에 할당하고, 해당 값을 [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) 인자에 전달합니다.
|
||||
|
||||
그런 다음 [학습 스크립트](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion_flax.py)를 시작할 수 있습니다.
|
||||
|
||||
```bash
|
||||
export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
|
||||
export DATA_DIR="./cat"
|
||||
|
||||
python textual_inversion_flax.py \
|
||||
--pretrained_model_name_or_path=$MODEL_NAME \
|
||||
--train_data_dir=$DATA_DIR \
|
||||
--learnable_property="object" \
|
||||
--placeholder_token="<cat-toy>" --initializer_token="toy" \
|
||||
--resolution=512 \
|
||||
--train_batch_size=1 \
|
||||
--max_train_steps=3000 \
|
||||
--learning_rate=5.0e-04 --scale_lr \
|
||||
--output_dir="textual_inversion_cat" \
|
||||
--push_to_hub
|
||||
```
|
||||
</jax>
|
||||
</frameworkcontent>
|
||||
|
||||
### 중간 로깅
|
||||
|
||||
모델의 학습 진행 상황을 추적하는 데 관심이 있는 경우, 학습 과정에서 생성된 이미지를 저장할 수 있습니다. 학습 스크립트에 다음 인수를 추가하여 중간 로깅을 활성화합니다.
|
||||
|
||||
- `validation_prompt` : 샘플을 생성하는 데 사용되는 프롬프트(기본값은 `None`으로 설정되며, 이 때 중간 로깅은 비활성화됨)
|
||||
- `num_validation_images` : 생성할 샘플 이미지 수
|
||||
- `validation_steps` : `validation_prompt`로부터 샘플 이미지를 생성하기 전 스텝의 수
|
||||
|
||||
```bash
|
||||
--validation_prompt="A <cat-toy> backpack"
|
||||
--num_validation_images=4
|
||||
--validation_steps=100
|
||||
```
|
||||
|
||||
## 추론
|
||||
|
||||
모델을 학습한 후에는, 해당 모델을 [`StableDiffusionPipeline`]을 사용하여 추론에 사용할 수 있습니다.
|
||||
|
||||
textual-inversion 스크립트는 기본적으로 textual-inversion을 통해 얻어진 임베딩 벡터만을 저장합니다. 해당 임베딩 벡터들은 텍스트 인코더의 임베딩 행렬에 추가되어 있습습니다.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
<Tip>
|
||||
|
||||
💡 커뮤니티는 [sd-concepts-library](https://huggingface.co/sd-concepts-library) 라는 대규모의 textual-inversion 임베딩 벡터 라이브러리를 만들었습니다. textual-inversion 임베딩을 밑바닥부터 학습하는 대신, 해당 라이브러리에 본인이 찾는 textual-inversion 임베딩이 이미 추가되어 있지 않은지를 확인하는 것도 좋은 방법이 될 것 같습니다.
|
||||
|
||||
</Tip>
|
||||
|
||||
textual-inversion 임베딩 벡터을 불러오기 위해서는, 먼저 해당 임베딩 벡터를 학습할 때 사용한 모델을 불러와야 합니다. 여기서는 [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/docs/diffusers/training/runwayml/stable-diffusion-v1-5) 모델이 사용되었다고 가정하고 불러오겠습니다.
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionPipeline
|
||||
import torch
|
||||
|
||||
model_id = "runwayml/stable-diffusion-v1-5"
|
||||
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
|
||||
```
|
||||
|
||||
다음으로 `TextualInversionLoaderMixin.load_textual_inversion` 함수를 통해, textual-inversion 임베딩 벡터를 불러와야 합니다. 여기서 우리는 이전의 `<cat-toy>` 예제의 임베딩을 불러올 것입니다.
|
||||
|
||||
```python
|
||||
pipe.load_textual_inversion("sd-concepts-library/cat-toy")
|
||||
```
|
||||
|
||||
이제 플레이스홀더 토큰(`<cat-toy>`)이 잘 동작하는지를 확인하는 파이프라인을 실행할 수 있습니다.
|
||||
|
||||
```python
|
||||
prompt = "A <cat-toy> backpack"
|
||||
|
||||
image = pipe(prompt, num_inference_steps=50).images[0]
|
||||
image.save("cat-backpack.png")
|
||||
```
|
||||
|
||||
`TextualInversionLoaderMixin.load_textual_inversion`은 Diffusers 형식으로 저장된 텍스트 임베딩 벡터를 로드할 수 있을 뿐만 아니라, [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui) 형식으로 저장된 임베딩 벡터도 로드할 수 있습니다. 이렇게 하려면, 먼저 [civitAI](https://civitai.com/models/3036?modelVersionId=8387)에서 임베딩 벡터를 다운로드한 다음 로컬에서 불러와야 합니다.
|
||||
|
||||
```python
|
||||
pipe.load_textual_inversion("./charturnerv2.pt")
|
||||
```
|
||||
</pt>
|
||||
<jax>
|
||||
|
||||
현재 Flax에 대한 `load_textual_inversion` 함수는 없습니다. 따라서 학습 후 textual-inversion 임베딩 벡터가 모델의 일부로서 저장되었는지를 확인해야 합니다. 그런 다음은 다른 Flax 모델과 마찬가지로 실행할 수 있습니다.
|
||||
|
||||
```python
|
||||
import jax
|
||||
import numpy as np
|
||||
from flax.jax_utils import replicate
|
||||
from flax.training.common_utils import shard
|
||||
from diffusers import FlaxStableDiffusionPipeline
|
||||
|
||||
model_path = "path-to-your-trained-model"
|
||||
pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(model_path, dtype=jax.numpy.bfloat16)
|
||||
|
||||
prompt = "A <cat-toy> backpack"
|
||||
prng_seed = jax.random.PRNGKey(0)
|
||||
num_inference_steps = 50
|
||||
|
||||
num_samples = jax.device_count()
|
||||
prompt = num_samples * [prompt]
|
||||
prompt_ids = pipeline.prepare_inputs(prompt)
|
||||
|
||||
# shard inputs and rng
|
||||
params = replicate(params)
|
||||
prng_seed = jax.random.split(prng_seed, jax.device_count())
|
||||
prompt_ids = shard(prompt_ids)
|
||||
|
||||
images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
|
||||
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
|
||||
image.save("cat-backpack.png")
|
||||
```
|
||||
</jax>
|
||||
</frameworkcontent>
|
||||
|
||||
## 작동 방식
|
||||
|
||||

|
||||
<small>Architecture overview from the Textual Inversion <a href="https://textual-inversion.github.io/">blog post.</a></small>
|
||||
|
||||
일반적으로 텍스트 프롬프트는 모델에 전달되기 전에 임베딩으로 토큰화됩니다. textual-inversion은 비슷한 작업을 수행하지만, 위 다이어그램의 특수 토큰 `S*`로부터 새로운 토큰 임베딩 `v*`를 학습합니다. 모델의 아웃풋은 디퓨전 모델을 조정하는 데 사용되며, 디퓨전 모델이 단 몇 개의 예제 이미지에서 신속하고 새로운 콘셉트를 이해하는 데 도움을 줍니다.
|
||||
|
||||
이를 위해 textual-inversion은 제너레이터 모델과 학습용 이미지의 노이즈 버전을 사용합니다. 제너레이터는 노이즈가 적은 버전의 이미지를 예측하려고 시도하며 토큰 임베딩 `v*`은 제너레이터의 성능에 따라 최적화됩니다. 토큰 임베딩이 새로운 콘셉트를 성공적으로 포착하면 디퓨전 모델에 더 유용한 정보를 제공하고 노이즈가 적은 더 선명한 이미지를 생성하는 데 도움이 됩니다. 이러한 최적화 프로세스는 일반적으로 다양한 프롬프트와 이미지에 수천 번에 노출됨으로써 이루어집니다.
|
||||
|
||||
144
docs/source/ko/training/unconditional_training.mdx
Normal file
144
docs/source/ko/training/unconditional_training.mdx
Normal file
@@ -0,0 +1,144 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Unconditional 이미지 생성
|
||||
|
||||
unconditional 이미지 생성은 text-to-image 또는 image-to-image 모델과 달리 텍스트나 이미지에 대한 조건이 없이 학습 데이터 분포와 유사한 이미지만을 생성합니다.
|
||||
|
||||
<iframe
|
||||
src="https://stevhliu-ddpm-butterflies-128.hf.space"
|
||||
frameborder="0"
|
||||
width="850"
|
||||
height="550"
|
||||
></iframe>
|
||||
|
||||
|
||||
이 가이드에서는 기존에 존재하던 데이터셋과 자신만의 커스텀 데이터셋에 대해 unconditional image generation 모델을 훈련하는 방법을 설명합니다. 훈련 세부 사항에 대해 더 자세히 알고 싶다면 unconditional image generation을 위한 모든 학습 스크립트를 [여기](https://github.com/huggingface/diffusers/tree/main/examples/unconditional_image_generation)에서 확인할 수 있습니다.
|
||||
|
||||
스크립트를 실행하기 전, 먼저 의존성 라이브러리들을 설치해야 합니다.
|
||||
|
||||
```bash
|
||||
pip install diffusers[training] accelerate datasets
|
||||
```
|
||||
|
||||
그 다음 🤗 [Accelerate](https://github.com/huggingface/accelerate/) 환경을 초기화합니다.
|
||||
|
||||
```bash
|
||||
accelerate config
|
||||
```
|
||||
|
||||
별도의 설정 없이 기본 설정으로 🤗 [Accelerate](https://github.com/huggingface/accelerate/) 환경을 초기화해봅시다.
|
||||
|
||||
```bash
|
||||
accelerate config default
|
||||
```
|
||||
|
||||
노트북과 같은 대화형 쉘을 지원하지 않는 환경의 경우, 다음과 같이 사용해볼 수도 있습니다.
|
||||
|
||||
```py
|
||||
from accelerate.utils import write_basic_config
|
||||
|
||||
write_basic_config()
|
||||
```
|
||||
|
||||
## 모델을 허브에 업로드하기
|
||||
|
||||
학습 스크립트에 다음 인자를 추가하여 허브에 모델을 업로드할 수 있습니다.
|
||||
|
||||
```bash
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
## 체크포인트 저장하고 불러오기
|
||||
|
||||
훈련 중 문제가 발생할 경우를 대비하여 체크포인트를 정기적으로 저장하는 것이 좋습니다. 체크포인트를 저장하려면 학습 스크립트에 다음 인자를 전달합니다:
|
||||
|
||||
```bash
|
||||
--checkpointing_steps=500
|
||||
```
|
||||
|
||||
전체 훈련 상태는 500스텝마다 `output_dir`의 하위 폴더에 저장되며, 학습 스크립트에 `--resume_from_checkpoint` 인자를 전달함으로써 체크포인트를 불러오고 훈련을 재개할 수 있습니다.
|
||||
|
||||
```bash
|
||||
--resume_from_checkpoint="checkpoint-1500"
|
||||
```
|
||||
|
||||
## 파인튜닝
|
||||
|
||||
이제 학습 스크립트를 시작할 준비가 되었습니다! `--dataset_name` 인자에 파인튜닝할 데이터셋 이름을 지정한 다음, `--output_dir` 인자에 지정된 경로로 저장합니다. 본인만의 데이터셋를 사용하려면, [학습용 데이터셋 만들기](create_dataset) 가이드를 참조하세요.
|
||||
|
||||
학습 스크립트는 `diffusion_pytorch_model.bin` 파일을 생성하고, 그것을 당신의 리포지토리에 저장합니다.
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 전체 학습은 V100 GPU 4개를 사용할 경우, 2시간이 소요됩니다.
|
||||
|
||||
</Tip>
|
||||
|
||||
예를 들어, [Oxford Flowers](https://huggingface.co/datasets/huggan/flowers-102-categories) 데이터셋을 사용해 파인튜닝할 경우:
|
||||
|
||||
```bash
|
||||
accelerate launch train_unconditional.py \
|
||||
--dataset_name="huggan/flowers-102-categories" \
|
||||
--resolution=64 \
|
||||
--output_dir="ddpm-ema-flowers-64" \
|
||||
--train_batch_size=16 \
|
||||
--num_epochs=100 \
|
||||
--gradient_accumulation_steps=1 \
|
||||
--learning_rate=1e-4 \
|
||||
--lr_warmup_steps=500 \
|
||||
--mixed_precision=no \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://user-images.githubusercontent.com/26864830/180248660-a0b143d0-b89a-42c5-8656-2ebf6ece7e52.png"/>
|
||||
</div>
|
||||
[Pokemon](https://huggingface.co/datasets/huggan/pokemon) 데이터셋을 사용할 경우:
|
||||
|
||||
```bash
|
||||
accelerate launch train_unconditional.py \
|
||||
--dataset_name="huggan/pokemon" \
|
||||
--resolution=64 \
|
||||
--output_dir="ddpm-ema-pokemon-64" \
|
||||
--train_batch_size=16 \
|
||||
--num_epochs=100 \
|
||||
--gradient_accumulation_steps=1 \
|
||||
--learning_rate=1e-4 \
|
||||
--lr_warmup_steps=500 \
|
||||
--mixed_precision=no \
|
||||
--push_to_hub
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://user-images.githubusercontent.com/26864830/180248200-928953b4-db38-48db-b0c6-8b740fe6786f.png"/>
|
||||
</div>
|
||||
|
||||
### 여러개의 GPU로 훈련하기
|
||||
|
||||
`accelerate`을 사용하면 원활한 다중 GPU 훈련이 가능합니다. `accelerate`을 사용하여 분산 훈련을 실행하려면 [여기](https://huggingface.co/docs/accelerate/basic_tutorials/launch) 지침을 따르세요. 다음은 명령어 예제입니다.
|
||||
|
||||
```bash
|
||||
accelerate launch --mixed_precision="fp16" --multi_gpu train_unconditional.py \
|
||||
--dataset_name="huggan/pokemon" \
|
||||
--resolution=64 --center_crop --random_flip \
|
||||
--output_dir="ddpm-ema-pokemon-64" \
|
||||
--train_batch_size=16 \
|
||||
--num_epochs=100 \
|
||||
--gradient_accumulation_steps=1 \
|
||||
--use_ema \
|
||||
--learning_rate=1e-4 \
|
||||
--lr_warmup_steps=500 \
|
||||
--mixed_precision="fp16" \
|
||||
--logger="wandb" \
|
||||
--push_to_hub
|
||||
```
|
||||
405
docs/source/ko/tutorials/basic_training.mdx
Normal file
405
docs/source/ko/tutorials/basic_training.mdx
Normal file
@@ -0,0 +1,405 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
|
||||
# Diffusion 모델을 학습하기
|
||||
|
||||
Unconditional 이미지 생성은 학습에 사용된 데이터셋과 유사한 이미지를 생성하는 diffusion 모델에서 인기 있는 어플리케이션입니다. 일반적으로, 가장 좋은 결과는 특정 데이터셋에 사전 훈련된 모델을 파인튜닝하는 것으로 얻을 수 있습니다. 이 [허브](https://huggingface.co/search/full-text?q=unconditional-image-generation&type=model)에서 이러한 많은 체크포인트를 찾을 수 있지만, 만약 마음에 드는 체크포인트를 찾지 못했다면, 언제든지 스스로 학습할 수 있습니다!
|
||||
|
||||
이 튜토리얼은 나만의 🦋 나비 🦋를 생성하기 위해 [Smithsonian Butterflies](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) 데이터셋의 하위 집합에서 [`UNet2DModel`] 모델을 학습하는 방법을 가르쳐줄 것입니다.
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 이 학습 튜토리얼은 [Training with 🧨 Diffusers](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) 노트북 기반으로 합니다. Diffusion 모델의 작동 방식 및 자세한 내용은 노트북을 확인하세요!
|
||||
|
||||
</Tip>
|
||||
|
||||
시작 전에, 🤗 Datasets을 불러오고 전처리하기 위해 데이터셋이 설치되어 있는지 다수 GPU에서 학습을 간소화하기 위해 🤗 Accelerate 가 설치되어 있는지 확인하세요. 그 후 학습 메트릭을 시각화하기 위해 [TensorBoard](https://www.tensorflow.org/tensorboard)를 또한 설치하세요. (또한 학습 추적을 위해 [Weights & Biases](https://docs.wandb.ai/)를 사용할 수 있습니다.)
|
||||
|
||||
```bash
|
||||
!pip install diffusers[training]
|
||||
```
|
||||
|
||||
커뮤니티에 모델을 공유할 것을 권장하며, 이를 위해서 Hugging Face 계정에 로그인을 해야 합니다. (계정이 없다면 [여기](https://hf.co/join)에서 만들 수 있습니다.) 노트북에서 로그인할 수 있으며 메시지가 표시되면 토큰을 입력할 수 있습니다.
|
||||
|
||||
```py
|
||||
>>> from huggingface_hub import notebook_login
|
||||
|
||||
>>> notebook_login()
|
||||
```
|
||||
|
||||
또는 터미널로 로그인할 수 있습니다:
|
||||
|
||||
```bash
|
||||
huggingface-cli login
|
||||
```
|
||||
|
||||
모델 체크포인트가 상당히 크기 때문에 [Git-LFS](https://git-lfs.com/)에서 대용량 파일의 버전 관리를 할 수 있습니다.
|
||||
|
||||
```bash
|
||||
!sudo apt -qq install git-lfs
|
||||
!git config --global credential.helper store
|
||||
```
|
||||
|
||||
|
||||
## 학습 구성
|
||||
|
||||
편의를 위해 학습 파라미터들을 포함한 `TrainingConfig` 클래스를 생성합니다 (자유롭게 조정 가능):
|
||||
|
||||
```py
|
||||
>>> from dataclasses import dataclass
|
||||
|
||||
|
||||
>>> @dataclass
|
||||
... class TrainingConfig:
|
||||
... image_size = 128 # 생성되는 이미지 해상도
|
||||
... train_batch_size = 16
|
||||
... eval_batch_size = 16 # 평가 동안에 샘플링할 이미지 수
|
||||
... num_epochs = 50
|
||||
... gradient_accumulation_steps = 1
|
||||
... learning_rate = 1e-4
|
||||
... lr_warmup_steps = 500
|
||||
... save_image_epochs = 10
|
||||
... save_model_epochs = 30
|
||||
... mixed_precision = "fp16" # `no`는 float32, 자동 혼합 정밀도를 위한 `fp16`
|
||||
... output_dir = "ddpm-butterflies-128" # 로컬 및 HF Hub에 저장되는 모델명
|
||||
|
||||
... push_to_hub = True # 저장된 모델을 HF Hub에 업로드할지 여부
|
||||
... hub_private_repo = False
|
||||
... overwrite_output_dir = True # 노트북을 다시 실행할 때 이전 모델에 덮어씌울지
|
||||
... seed = 0
|
||||
|
||||
|
||||
>>> config = TrainingConfig()
|
||||
```
|
||||
|
||||
|
||||
## 데이터셋 불러오기
|
||||
|
||||
🤗 Datasets 라이브러리와 [Smithsonian Butterflies](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) 데이터셋을 쉽게 불러올 수 있습니다.
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> config.dataset_name = "huggan/smithsonian_butterflies_subset"
|
||||
>>> dataset = load_dataset(config.dataset_name, split="train")
|
||||
```
|
||||
|
||||
💡[HugGan Community Event](https://huggingface.co/huggan) 에서 추가의 데이터셋을 찾거나 로컬의 [`ImageFolder`](https://huggingface.co/docs/datasets/image_dataset#imagefolder)를 만듦으로써 나만의 데이터셋을 사용할 수 있습니다. HugGan Community Event 에 가져온 데이터셋의 경우 레포지토리의 id로 `config.dataset_name` 을 설정하고, 나만의 이미지를 사용하는 경우 `imagefolder` 를 설정합니다.
|
||||
|
||||
🤗 Datasets은 [`~datasets.Image`] 기능을 사용해 자동으로 이미지 데이터를 디코딩하고 [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html)로 불러옵니다. 이를 시각화 해보면:
|
||||
|
||||
```py
|
||||
>>> import matplotlib.pyplot as plt
|
||||
|
||||
>>> fig, axs = plt.subplots(1, 4, figsize=(16, 4))
|
||||
>>> for i, image in enumerate(dataset[:4]["image"]):
|
||||
... axs[i].imshow(image)
|
||||
... axs[i].set_axis_off()
|
||||
>>> fig.show()
|
||||
```
|
||||
|
||||

|
||||
|
||||
이미지는 모두 다른 사이즈이기 때문에, 우선 전처리가 필요합니다:
|
||||
|
||||
- `Resize` 는 `config.image_size` 에 정의된 이미지 사이즈로 변경합니다.
|
||||
- `RandomHorizontalFlip` 은 랜덤적으로 이미지를 미러링하여 데이터셋을 보강합니다.
|
||||
- `Normalize` 는 모델이 예상하는 [-1, 1] 범위로 픽셀 값을 재조정 하는데 중요합니다.
|
||||
|
||||
```py
|
||||
>>> from torchvision import transforms
|
||||
|
||||
>>> preprocess = transforms.Compose(
|
||||
... [
|
||||
... transforms.Resize((config.image_size, config.image_size)),
|
||||
... transforms.RandomHorizontalFlip(),
|
||||
... transforms.ToTensor(),
|
||||
... transforms.Normalize([0.5], [0.5]),
|
||||
... ]
|
||||
... )
|
||||
```
|
||||
|
||||
학습 도중에 `preprocess` 함수를 적용하려면 🤗 Datasets의 [`~datasets.Dataset.set_transform`] 방법이 사용됩니다.
|
||||
|
||||
```py
|
||||
>>> def transform(examples):
|
||||
... images = [preprocess(image.convert("RGB")) for image in examples["image"]]
|
||||
... return {"images": images}
|
||||
|
||||
|
||||
>>> dataset.set_transform(transform)
|
||||
```
|
||||
|
||||
이미지의 크기가 조정되었는지 확인하기 위해 이미지를 다시 시각화해보세요. 이제 [DataLoader](https://pytorch.org/docs/stable/data#torch.utils.data.DataLoader)에 데이터셋을 포함해 학습할 준비가 되었습니다!
|
||||
|
||||
```py
|
||||
>>> import torch
|
||||
|
||||
>>> train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=config.train_batch_size, shuffle=True)
|
||||
```
|
||||
|
||||
|
||||
## UNet2DModel 생성하기
|
||||
|
||||
🧨 Diffusers에 사전학습된 모델들은 모델 클래스에서 원하는 파라미터로 쉽게 생성할 수 있습니다. 예를 들어, [`UNet2DModel`]를 생성하려면:
|
||||
|
||||
```py
|
||||
>>> from diffusers import UNet2DModel
|
||||
|
||||
>>> model = UNet2DModel(
|
||||
... sample_size=config.image_size, # 타겟 이미지 해상도
|
||||
... in_channels=3, # 입력 채널 수, RGB 이미지에서 3
|
||||
... out_channels=3, # 출력 채널 수
|
||||
... layers_per_block=2, # UNet 블럭당 몇 개의 ResNet 레이어가 사용되는지
|
||||
... block_out_channels=(128, 128, 256, 256, 512, 512), # 각 UNet 블럭을 위한 출력 채널 수
|
||||
... down_block_types=(
|
||||
... "DownBlock2D", # 일반적인 ResNet 다운샘플링 블럭
|
||||
... "DownBlock2D",
|
||||
... "DownBlock2D",
|
||||
... "DownBlock2D",
|
||||
... "AttnDownBlock2D", # spatial self-attention이 포함된 일반적인 ResNet 다운샘플링 블럭
|
||||
... "DownBlock2D",
|
||||
... ),
|
||||
... up_block_types=(
|
||||
... "UpBlock2D", # 일반적인 ResNet 업샘플링 블럭
|
||||
... "AttnUpBlock2D", # spatial self-attention이 포함된 일반적인 ResNet 업샘플링 블럭
|
||||
... "UpBlock2D",
|
||||
... "UpBlock2D",
|
||||
... "UpBlock2D",
|
||||
... "UpBlock2D",
|
||||
... ),
|
||||
... )
|
||||
```
|
||||
|
||||
샘플의 이미지 크기와 모델 출력 크기가 맞는지 빠르게 확인하기 위한 좋은 아이디어가 있습니다:
|
||||
|
||||
```py
|
||||
>>> sample_image = dataset[0]["images"].unsqueeze(0)
|
||||
>>> print("Input shape:", sample_image.shape)
|
||||
Input shape: torch.Size([1, 3, 128, 128])
|
||||
|
||||
>>> print("Output shape:", model(sample_image, timestep=0).sample.shape)
|
||||
Output shape: torch.Size([1, 3, 128, 128])
|
||||
```
|
||||
|
||||
훌륭해요! 다음, 이미지에 약간의 노이즈를 더하기 위해 스케줄러가 필요합니다.
|
||||
|
||||
|
||||
## 스케줄러 생성하기
|
||||
|
||||
스케줄러는 모델을 학습 또는 추론에 사용하는지에 따라 다르게 작동합니다. 추론시에, 스케줄러는 노이즈로부터 이미지를 생성합니다. 학습시 스케줄러는 diffusion 과정에서의 특정 포인트로부터 모델의 출력 또는 샘플을 가져와 *노이즈 스케줄* 과 *업데이트 규칙*에 따라 이미지에 노이즈를 적용합니다.
|
||||
|
||||
`DDPMScheduler`를 보면 이전으로부터 `sample_image`에 랜덤한 노이즈를 더하는 `add_noise` 메서드를 사용합니다:
|
||||
|
||||
```py
|
||||
>>> import torch
|
||||
>>> from PIL import Image
|
||||
>>> from diffusers import DDPMScheduler
|
||||
|
||||
>>> noise_scheduler = DDPMScheduler(num_train_timesteps=1000)
|
||||
>>> noise = torch.randn(sample_image.shape)
|
||||
>>> timesteps = torch.LongTensor([50])
|
||||
>>> noisy_image = noise_scheduler.add_noise(sample_image, noise, timesteps)
|
||||
|
||||
>>> Image.fromarray(((noisy_image.permute(0, 2, 3, 1) + 1.0) * 127.5).type(torch.uint8).numpy()[0])
|
||||
```
|
||||
|
||||

|
||||
|
||||
모델의 학습 목적은 이미지에 더해진 노이즈를 예측하는 것입니다. 이 단계에서 손실은 다음과 같이 계산될 수 있습니다:
|
||||
|
||||
```py
|
||||
>>> import torch.nn.functional as F
|
||||
|
||||
>>> noise_pred = model(noisy_image, timesteps).sample
|
||||
>>> loss = F.mse_loss(noise_pred, noise)
|
||||
```
|
||||
|
||||
## 모델 학습하기
|
||||
|
||||
지금까지, 모델 학습을 시작하기 위해 많은 부분을 갖추었으며 이제 남은 것은 모든 것을 조합하는 것입니다.
|
||||
|
||||
우선 옵티마이저(optimizer)와 학습률 스케줄러(learning rate scheduler)가 필요할 것입니다:
|
||||
|
||||
```py
|
||||
>>> from diffusers.optimization import get_cosine_schedule_with_warmup
|
||||
|
||||
>>> optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)
|
||||
>>> lr_scheduler = get_cosine_schedule_with_warmup(
|
||||
... optimizer=optimizer,
|
||||
... num_warmup_steps=config.lr_warmup_steps,
|
||||
... num_training_steps=(len(train_dataloader) * config.num_epochs),
|
||||
... )
|
||||
```
|
||||
|
||||
그 후, 모델을 평가하는 방법이 필요합니다. 평가를 위해, `DDPMPipeline`을 사용해 배치의 이미지 샘플들을 생성하고 그리드 형태로 저장할 수 있습니다:
|
||||
|
||||
```py
|
||||
>>> from diffusers import DDPMPipeline
|
||||
>>> import math
|
||||
>>> import os
|
||||
|
||||
|
||||
>>> def make_grid(images, rows, cols):
|
||||
... w, h = images[0].size
|
||||
... grid = Image.new("RGB", size=(cols * w, rows * h))
|
||||
... for i, image in enumerate(images):
|
||||
... grid.paste(image, box=(i % cols * w, i // cols * h))
|
||||
... return grid
|
||||
|
||||
|
||||
>>> def evaluate(config, epoch, pipeline):
|
||||
... # 랜덤한 노이즈로 부터 이미지를 추출합니다.(이는 역전파 diffusion 과정입니다.)
|
||||
... # 기본 파이프라인 출력 형태는 `List[PIL.Image]` 입니다.
|
||||
... images = pipeline(
|
||||
... batch_size=config.eval_batch_size,
|
||||
... generator=torch.manual_seed(config.seed),
|
||||
... ).images
|
||||
|
||||
... # 이미지들을 그리드로 만들어줍니다.
|
||||
... image_grid = make_grid(images, rows=4, cols=4)
|
||||
|
||||
... # 이미지들을 저장합니다.
|
||||
... test_dir = os.path.join(config.output_dir, "samples")
|
||||
... os.makedirs(test_dir, exist_ok=True)
|
||||
... image_grid.save(f"{test_dir}/{epoch:04d}.png")
|
||||
```
|
||||
|
||||
TensorBoard에 로깅, 그래디언트 누적 및 혼합 정밀도 학습을 쉽게 수행하기 위해 🤗 Accelerate를 학습 루프에 함께 앞서 말한 모든 구성 정보들을 묶어 진행할 수 있습니다. 허브에 모델을 업로드 하기 위해 레포지토리 이름 및 정보를 가져오기 위한 함수를 작성하고 허브에 업로드할 수 있습니다.
|
||||
|
||||
💡아래의 학습 루프는 어렵고 길어 보일 수 있지만, 나중에 한 줄의 코드로 학습을 한다면 그만한 가치가 있을 것입니다! 만약 기다리지 못하고 이미지를 생성하고 싶다면, 아래 코드를 자유롭게 붙여넣고 작동시키면 됩니다. 🤗
|
||||
|
||||
```py
|
||||
>>> from accelerate import Accelerator
|
||||
>>> from huggingface_hub import HfFolder, Repository, whoami
|
||||
>>> from tqdm.auto import tqdm
|
||||
>>> from pathlib import Path
|
||||
>>> import os
|
||||
|
||||
|
||||
>>> def get_full_repo_name(model_id: str, organization: str = None, token: str = None):
|
||||
... if token is None:
|
||||
... token = HfFolder.get_token()
|
||||
... if organization is None:
|
||||
... username = whoami(token)["name"]
|
||||
... return f"{username}/{model_id}"
|
||||
... else:
|
||||
... return f"{organization}/{model_id}"
|
||||
|
||||
|
||||
>>> def train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler):
|
||||
... # accelerator와 tensorboard 로깅 초기화
|
||||
... accelerator = Accelerator(
|
||||
... mixed_precision=config.mixed_precision,
|
||||
... gradient_accumulation_steps=config.gradient_accumulation_steps,
|
||||
... log_with="tensorboard",
|
||||
... logging_dir=os.path.join(config.output_dir, "logs"),
|
||||
... )
|
||||
... if accelerator.is_main_process:
|
||||
... if config.push_to_hub:
|
||||
... repo_name = get_full_repo_name(Path(config.output_dir).name)
|
||||
... repo = Repository(config.output_dir, clone_from=repo_name)
|
||||
... elif config.output_dir is not None:
|
||||
... os.makedirs(config.output_dir, exist_ok=True)
|
||||
... accelerator.init_trackers("train_example")
|
||||
|
||||
... # 모든 것이 준비되었습니다.
|
||||
... # 기억해야 할 특정한 순서는 없으며 준비한 방법에 제공한 것과 동일한 순서로 객체의 압축을 풀면 됩니다.
|
||||
... model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
|
||||
... model, optimizer, train_dataloader, lr_scheduler
|
||||
... )
|
||||
|
||||
... global_step = 0
|
||||
|
||||
... # 이제 모델을 학습합니다.
|
||||
... for epoch in range(config.num_epochs):
|
||||
... progress_bar = tqdm(total=len(train_dataloader), disable=not accelerator.is_local_main_process)
|
||||
... progress_bar.set_description(f"Epoch {epoch}")
|
||||
|
||||
... for step, batch in enumerate(train_dataloader):
|
||||
... clean_images = batch["images"]
|
||||
... # 이미지에 더할 노이즈를 샘플링합니다.
|
||||
... noise = torch.randn(clean_images.shape).to(clean_images.device)
|
||||
... bs = clean_images.shape[0]
|
||||
|
||||
... # 각 이미지를 위한 랜덤한 타임스텝(timestep)을 샘플링합니다.
|
||||
... timesteps = torch.randint(
|
||||
... 0, noise_scheduler.config.num_train_timesteps, (bs,), device=clean_images.device
|
||||
... ).long()
|
||||
|
||||
... # 각 타임스텝의 노이즈 크기에 따라 깨끗한 이미지에 노이즈를 추가합니다.
|
||||
... # (이는 foward diffusion 과정입니다.)
|
||||
... noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps)
|
||||
|
||||
... with accelerator.accumulate(model):
|
||||
... # 노이즈를 반복적으로 예측합니다.
|
||||
... noise_pred = model(noisy_images, timesteps, return_dict=False)[0]
|
||||
... loss = F.mse_loss(noise_pred, noise)
|
||||
... accelerator.backward(loss)
|
||||
|
||||
... accelerator.clip_grad_norm_(model.parameters(), 1.0)
|
||||
... optimizer.step()
|
||||
... lr_scheduler.step()
|
||||
... optimizer.zero_grad()
|
||||
|
||||
... progress_bar.update(1)
|
||||
... logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0], "step": global_step}
|
||||
... progress_bar.set_postfix(**logs)
|
||||
... accelerator.log(logs, step=global_step)
|
||||
... global_step += 1
|
||||
|
||||
... # 각 에포크가 끝난 후 evaluate()와 몇 가지 데모 이미지를 선택적으로 샘플링하고 모델을 저장합니다.
|
||||
... if accelerator.is_main_process:
|
||||
... pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)
|
||||
|
||||
... if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
|
||||
... evaluate(config, epoch, pipeline)
|
||||
|
||||
... if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
|
||||
... if config.push_to_hub:
|
||||
... repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
|
||||
... else:
|
||||
... pipeline.save_pretrained(config.output_dir)
|
||||
```
|
||||
|
||||
휴, 코드가 꽤 많았네요! 하지만 🤗 Accelerate의 [`~accelerate.notebook_launcher`] 함수와 학습을 시작할 준비가 되었습니다. 함수에 학습 루프, 모든 학습 인수, 학습에 사용할 프로세스 수(사용 가능한 GPU의 수를 변경할 수 있음)를 전달합니다:
|
||||
|
||||
```py
|
||||
>>> from accelerate import notebook_launcher
|
||||
|
||||
>>> args = (config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler)
|
||||
|
||||
>>> notebook_launcher(train_loop, args, num_processes=1)
|
||||
```
|
||||
|
||||
한번 학습이 완료되면, diffusion 모델로 생성된 최종 🦋이미지🦋를 확인해보길 바랍니다!
|
||||
|
||||
```py
|
||||
>>> import glob
|
||||
|
||||
>>> sample_images = sorted(glob.glob(f"{config.output_dir}/samples/*.png"))
|
||||
>>> Image.open(sample_images[-1])
|
||||
```
|
||||
|
||||

|
||||
|
||||
## 다음 단계
|
||||
|
||||
Unconditional 이미지 생성은 학습될 수 있는 작업 중 하나의 예시입니다. 다른 작업과 학습 방법은 [🧨 Diffusers 학습 예시](../training/overview) 페이지에서 확인할 수 있습니다. 다음은 학습할 수 있는 몇 가지 예시입니다:
|
||||
|
||||
- [Textual Inversion](../training/text_inversion), 특정 시각적 개념을 학습시켜 생성된 이미지에 통합시키는 알고리즘입니다.
|
||||
- [DreamBooth](../training/dreambooth), 주제에 대한 몇 가지 입력 이미지들이 주어지면 주제에 대한 개인화된 이미지를 생성하기 위한 기술입니다.
|
||||
- [Guide](../training/text2image) 데이터셋에 Stable Diffusion 모델을 파인튜닝하는 방법입니다.
|
||||
- [Guide](../training/lora) LoRA를 사용해 매우 큰 모델을 빠르게 파인튜닝하기 위한 메모리 효율적인 기술입니다.
|
||||
23
docs/source/ko/tutorials/tutorial_overview.mdx
Normal file
23
docs/source/ko/tutorials/tutorial_overview.mdx
Normal file
@@ -0,0 +1,23 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Overview
|
||||
|
||||
🧨 Diffusers에 오신 걸 환영합니다! 여러분이 diffusion 모델과 생성 AI를 처음 접하고, 더 많은 걸 배우고 싶으셨다면 제대로 찾아오셨습니다. 이 튜토리얼은 diffusion model을 여러분에게 젠틀하게 소개하고, 라이브러리의 기본 사항(핵심 구성요소와 🧨 Diffusers 사용법)을 이해하는 데 도움이 되도록 설계되었습니다.
|
||||
|
||||
여러분은 이 튜토리얼을 통해 빠르게 생성하기 위해선 추론 파이프라인을 어떻게 사용해야 하는지, 그리고 라이브러리를 modular toolbox처럼 이용해서 여러분만의 diffusion system을 구축할 수 있도록 파이프라인을 분해하는 법을 배울 수 있습니다. 다음 단원에서는 여러분이 원하는 것을 생성하기 위해 자신만의 diffusion model을 학습하는 방법을 배우게 됩니다.
|
||||
|
||||
튜토리얼을 완료한다면 여러분은 라이브러리를 직접 탐색하고, 자신의 프로젝트와 애플리케이션에 적용할 스킬들을 습득할 수 있을 겁니다.
|
||||
|
||||
[Discord](https://discord.com/invite/JfAtkvEtRb)나 [포럼](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) 커뮤니티에 자유롭게 참여해서 다른 사용자와 개발자들과 교류하고 협업해 보세요!
|
||||
|
||||
자 지금부터 diffusing을 시작해 보겠습니다! 🧨
|
||||
275
docs/source/ko/using-diffusers/custom_pipeline_examples.mdx
Normal file
275
docs/source/ko/using-diffusers/custom_pipeline_examples.mdx
Normal file
@@ -0,0 +1,275 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# 커뮤니티 파이프라인
|
||||
|
||||
> **커뮤니티 파이프라인에 대한 자세한 내용은 [이 이슈](https://github.com/huggingface/diffusers/issues/841)를 참조하세요.
|
||||
|
||||
**커뮤니티** 예제는 커뮤니티에서 추가한 추론 및 훈련 예제로 구성되어 있습니다.
|
||||
다음 표를 참조하여 모든 커뮤니티 예제에 대한 개요를 확인하시기 바랍니다. **코드 예제**를 클릭하면 복사하여 붙여넣기할 수 있는 코드 예제를 확인할 수 있습니다.
|
||||
커뮤니티가 예상대로 작동하지 않는 경우 이슈를 개설하고 작성자에게 핑을 보내주세요.
|
||||
|
||||
| 예 | 설명 | 코드 예제 | 콜랩 |저자 |
|
||||
|:---------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------:|
|
||||
| CLIP Guided Stable Diffusion | CLIP 가이드 기반의 Stable Diffusion으로 텍스트에서 이미지로 생성하기 | [CLIP Guided Stable Diffusion](#clip-guided-stable-diffusion) | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/CLIP_Guided_Stable_diffusion_with_diffusers.ipynb) | [Suraj Patil](https://github.com/patil-suraj/) |
|
||||
| One Step U-Net (Dummy) | 커뮤니티 파이프라인을 어떻게 사용해야 하는지에 대한 예시(참고 https://github.com/huggingface/diffusers/issues/841) | [One Step U-Net](#one-step-unet) | - | [Patrick von Platen](https://github.com/patrickvonplaten/) |
|
||||
| Stable Diffusion Interpolation | 서로 다른 프롬프트/시드 간 Stable Diffusion의 latent space 보간 | [Stable Diffusion Interpolation](#stable-diffusion-interpolation) | - | [Nate Raw](https://github.com/nateraw/) |
|
||||
| Stable Diffusion Mega | 모든 기능을 갖춘 **하나의** Stable Diffusion 파이프라인 [Text2Image](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py), [Image2Image](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py) and [Inpainting](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py) | [Stable Diffusion Mega](#stable-diffusion-mega) | - | [Patrick von Platen](https://github.com/patrickvonplaten/) |
|
||||
| Long Prompt Weighting Stable Diffusion | 토큰 길이 제한이 없고 프롬프트에서 파싱 가중치 지원을 하는 **하나의** Stable Diffusion 파이프라인, | [Long Prompt Weighting Stable Diffusion](#long-prompt-weighting-stable-diffusion) |- | [SkyTNT](https://github.com/SkyTNT) |
|
||||
| Speech to Image | 자동 음성 인식을 사용하여 텍스트를 작성하고 Stable Diffusion을 사용하여 이미지를 생성합니다. | [Speech to Image](#speech-to-image) | - | [Mikail Duzenli](https://github.com/MikailINTech) |
|
||||
|
||||
커스텀 파이프라인을 불러오려면 `diffusers/examples/community`에 있는 파일 중 하나로서 `custom_pipeline` 인수를 `DiffusionPipeline`에 전달하기만 하면 됩니다. 자신만의 파이프라인이 있는 PR을 보내주시면 빠르게 병합해드리겠습니다.
|
||||
```py
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"CompVis/stable-diffusion-v1-4", custom_pipeline="filename_in_the_community_folder"
|
||||
)
|
||||
```
|
||||
|
||||
## 사용 예시
|
||||
|
||||
### CLIP 가이드 기반의 Stable Diffusion
|
||||
|
||||
모든 노이즈 제거 단계에서 추가 CLIP 모델을 통해 Stable Diffusion을 가이드함으로써 CLIP 모델 기반의 Stable Diffusion은 보다 더 사실적인 이미지를 생성을 할 수 있습니다.
|
||||
|
||||
다음 코드는 약 12GB의 GPU RAM이 필요합니다.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
from transformers import CLIPImageProcessor, CLIPModel
|
||||
import torch
|
||||
|
||||
|
||||
feature_extractor = CLIPImageProcessor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")
|
||||
clip_model = CLIPModel.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K", torch_dtype=torch.float16)
|
||||
|
||||
|
||||
guided_pipeline = DiffusionPipeline.from_pretrained(
|
||||
"CompVis/stable-diffusion-v1-4",
|
||||
custom_pipeline="clip_guided_stable_diffusion",
|
||||
clip_model=clip_model,
|
||||
feature_extractor=feature_extractor,
|
||||
torch_dtype=torch.float16,
|
||||
)
|
||||
guided_pipeline.enable_attention_slicing()
|
||||
guided_pipeline = guided_pipeline.to("cuda")
|
||||
|
||||
prompt = "fantasy book cover, full moon, fantasy forest landscape, golden vector elements, fantasy magic, dark light night, intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by WLOP and Artgerm and Albert Bierstadt, masterpiece"
|
||||
|
||||
generator = torch.Generator(device="cuda").manual_seed(0)
|
||||
images = []
|
||||
for i in range(4):
|
||||
image = guided_pipeline(
|
||||
prompt,
|
||||
num_inference_steps=50,
|
||||
guidance_scale=7.5,
|
||||
clip_guidance_scale=100,
|
||||
num_cutouts=4,
|
||||
use_cutouts=False,
|
||||
generator=generator,
|
||||
).images[0]
|
||||
images.append(image)
|
||||
|
||||
# 이미지 로컬에 저장하기
|
||||
for i, img in enumerate(images):
|
||||
img.save(f"./clip_guided_sd/image_{i}.png")
|
||||
```
|
||||
|
||||
이미지` 목록에는 로컬에 저장하거나 구글 콜랩에 직접 표시할 수 있는 PIL 이미지 목록이 포함되어 있습니다. 생성된 이미지는 기본적으로 안정적인 확산을 사용하는 것보다 품질이 높은 경향이 있습니다. 예를 들어 위의 스크립트는 다음과 같은 이미지를 생성합니다:
|
||||
|
||||
.
|
||||
|
||||
### One Step Unet
|
||||
|
||||
예시 "one-step-unet"는 다음과 같이 실행할 수 있습니다.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("google/ddpm-cifar10-32", custom_pipeline="one_step_unet")
|
||||
pipe()
|
||||
```
|
||||
|
||||
**참고**: 이 커뮤니티 파이프라인은 기능으로 유용하지 않으며 커뮤니티 파이프라인을 추가할 수 있는 방법의 예시일 뿐입니다(https://github.com/huggingface/diffusers/issues/841 참조).
|
||||
|
||||
### Stable Diffusion Interpolation
|
||||
|
||||
다음 코드는 최소 8GB VRAM의 GPU에서 실행할 수 있으며 약 5분 정도 소요됩니다.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"CompVis/stable-diffusion-v1-4",
|
||||
torch_dtype=torch.float16,
|
||||
safety_checker=None, # Very important for videos...lots of false positives while interpolating
|
||||
custom_pipeline="interpolate_stable_diffusion",
|
||||
).to("cuda")
|
||||
pipe.enable_attention_slicing()
|
||||
|
||||
frame_filepaths = pipe.walk(
|
||||
prompts=["a dog", "a cat", "a horse"],
|
||||
seeds=[42, 1337, 1234],
|
||||
num_interpolation_steps=16,
|
||||
output_dir="./dreams",
|
||||
batch_size=4,
|
||||
height=512,
|
||||
width=512,
|
||||
guidance_scale=8.5,
|
||||
num_inference_steps=50,
|
||||
)
|
||||
```
|
||||
|
||||
walk(...)` 함수의 출력은 `output_dir`에 정의된 대로 폴더에 저장된 이미지 목록을 반환합니다. 이 이미지를 사용하여 안정적으로 확산되는 동영상을 만들 수 있습니다.
|
||||
|
||||
> 안정된 확산을 이용한 동영상 제작 방법과 더 많은 기능에 대한 자세한 내용은 https://github.com/nateraw/stable-diffusion-videos 에서 확인하시기 바랍니다.
|
||||
|
||||
### Stable Diffusion Mega
|
||||
|
||||
The Stable Diffusion Mega 파이프라인을 사용하면 Stable Diffusion 파이프라인의 주요 사용 사례를 단일 클래스에서 사용할 수 있습니다.
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
from diffusers import DiffusionPipeline
|
||||
import PIL
|
||||
import requests
|
||||
from io import BytesIO
|
||||
import torch
|
||||
|
||||
|
||||
def download_image(url):
|
||||
response = requests.get(url)
|
||||
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"CompVis/stable-diffusion-v1-4",
|
||||
custom_pipeline="stable_diffusion_mega",
|
||||
torch_dtype=torch.float16,
|
||||
)
|
||||
pipe.to("cuda")
|
||||
pipe.enable_attention_slicing()
|
||||
|
||||
|
||||
### Text-to-Image
|
||||
|
||||
images = pipe.text2img("An astronaut riding a horse").images
|
||||
|
||||
### Image-to-Image
|
||||
|
||||
init_image = download_image(
|
||||
"https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
|
||||
)
|
||||
|
||||
prompt = "A fantasy landscape, trending on artstation"
|
||||
|
||||
images = pipe.img2img(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images
|
||||
|
||||
### Inpainting
|
||||
|
||||
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
|
||||
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
|
||||
init_image = download_image(img_url).resize((512, 512))
|
||||
mask_image = download_image(mask_url).resize((512, 512))
|
||||
|
||||
prompt = "a cat sitting on a bench"
|
||||
images = pipe.inpaint(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.75).images
|
||||
```
|
||||
|
||||
위에 표시된 것처럼 하나의 파이프라인에서 '텍스트-이미지 변환', '이미지-이미지 변환', '인페인팅'을 모두 실행할 수 있습니다.
|
||||
|
||||
### Long Prompt Weighting Stable Diffusion
|
||||
|
||||
파이프라인을 사용하면 77개의 토큰 길이 제한 없이 프롬프트를 입력할 수 있습니다. 또한 "()"를 사용하여 단어 가중치를 높이거나 "[]"를 사용하여 단어 가중치를 낮출 수 있습니다.
|
||||
또한 파이프라인을 사용하면 단일 클래스에서 Stable Diffusion 파이프라인의 주요 사용 사례를 사용할 수 있습니다.
|
||||
|
||||
#### pytorch
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"hakurei/waifu-diffusion", custom_pipeline="lpw_stable_diffusion", torch_dtype=torch.float16
|
||||
)
|
||||
pipe = pipe.to("cuda")
|
||||
|
||||
prompt = "best_quality (1girl:1.3) bow bride brown_hair closed_mouth frilled_bow frilled_hair_tubes frills (full_body:1.3) fox_ear hair_bow hair_tubes happy hood japanese_clothes kimono long_sleeves red_bow smile solo tabi uchikake white_kimono wide_sleeves cherry_blossoms"
|
||||
neg_prompt = "lowres, bad_anatomy, error_body, error_hair, error_arm, error_hands, bad_hands, error_fingers, bad_fingers, missing_fingers, error_legs, bad_legs, multiple_legs, missing_legs, error_lighting, error_shadow, error_reflection, text, error, extra_digit, fewer_digits, cropped, worst_quality, low_quality, normal_quality, jpeg_artifacts, signature, watermark, username, blurry"
|
||||
|
||||
pipe.text2img(prompt, negative_prompt=neg_prompt, width=512, height=512, max_embeddings_multiples=3).images[0]
|
||||
```
|
||||
|
||||
#### onnxruntime
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"CompVis/stable-diffusion-v1-4",
|
||||
custom_pipeline="lpw_stable_diffusion_onnx",
|
||||
revision="onnx",
|
||||
provider="CUDAExecutionProvider",
|
||||
)
|
||||
|
||||
prompt = "a photo of an astronaut riding a horse on mars, best quality"
|
||||
neg_prompt = "lowres, bad anatomy, error body, error hair, error arm, error hands, bad hands, error fingers, bad fingers, missing fingers, error legs, bad legs, multiple legs, missing legs, error lighting, error shadow, error reflection, text, error, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry"
|
||||
|
||||
pipe.text2img(prompt, negative_prompt=neg_prompt, width=512, height=512, max_embeddings_multiples=3).images[0]
|
||||
```
|
||||
|
||||
토큰 인덱스 시퀀스 길이가 이 모델에 지정된 최대 시퀀스 길이보다 길면(*** > 77). 이 시퀀스를 모델에서 실행하면 인덱싱 오류가 발생합니다`. 정상적인 현상이니 걱정하지 마세요.
|
||||
### Speech to Image
|
||||
|
||||
다음 코드는 사전학습된 OpenAI whisper-small과 Stable Diffusion을 사용하여 오디오 샘플에서 이미지를 생성할 수 있습니다.
|
||||
```Python
|
||||
import torch
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
from datasets import load_dataset
|
||||
from diffusers import DiffusionPipeline
|
||||
from transformers import (
|
||||
WhisperForConditionalGeneration,
|
||||
WhisperProcessor,
|
||||
)
|
||||
|
||||
|
||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
|
||||
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||
|
||||
audio_sample = ds[3]
|
||||
|
||||
text = audio_sample["text"].lower()
|
||||
speech_data = audio_sample["audio"]["array"]
|
||||
|
||||
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device)
|
||||
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
|
||||
|
||||
diffuser_pipeline = DiffusionPipeline.from_pretrained(
|
||||
"CompVis/stable-diffusion-v1-4",
|
||||
custom_pipeline="speech_to_image_diffusion",
|
||||
speech_model=model,
|
||||
speech_processor=processor,
|
||||
|
||||
torch_dtype=torch.float16,
|
||||
)
|
||||
|
||||
diffuser_pipeline.enable_attention_slicing()
|
||||
diffuser_pipeline = diffuser_pipeline.to(device)
|
||||
|
||||
output = diffuser_pipeline(speech_data)
|
||||
plt.imshow(output.images[0])
|
||||
```
|
||||
위 예시는 다음의 결과 이미지를 보입니다.
|
||||
|
||||

|
||||
56
docs/source/ko/using-diffusers/custom_pipeline_overview.mdx
Normal file
56
docs/source/ko/using-diffusers/custom_pipeline_overview.mdx
Normal file
@@ -0,0 +1,56 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# 커스텀 파이프라인 불러오기
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
커뮤니티 파이프라인은 논문에 명시된 원래의 구현체와 다른 형태로 구현된 모든 [`DiffusionPipeline`] 클래스를 의미합니다. (예를 들어, [`StableDiffusionControlNetPipeline`]는 ["Text-to-Image Generation with ControlNet Conditioning"](https://arxiv.org/abs/2302.05543) 해당) 이들은 추가 기능을 제공하거나 파이프라인의 원래 구현을 확장합니다.
|
||||
|
||||
[Speech to Image](https://github.com/huggingface/diffusers/tree/main/examples/community#speech-to-image) 또는 [Composable Stable Diffusion](https://github.com/huggingface/diffusers/tree/main/examples/community#composable-stable-diffusion) 과 같은 멋진 커뮤니티 파이프라인이 많이 있으며 [여기에서](https://github.com/huggingface/diffusers/tree/main/examples/community) 모든 공식 커뮤니티 파이프라인을 찾을 수 있습니다.
|
||||
|
||||
허브에서 커뮤니티 파이프라인을 로드하려면, 커뮤니티 파이프라인의 리포지토리 ID와 (파이프라인 가중치 및 구성 요소를 로드하려는) 모델의 리포지토리 ID를 인자로 전달해야 합니다. 예를 들어, 아래 예시에서는 `hf-internal-testing/diffusers-dummy-pipeline`에서 더미 파이프라인을 불러오고, `google/ddpm-cifar10-32`에서 파이프라인의 가중치와 컴포넌트들을 로드합니다.
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
🔒 허깅 페이스 허브에서 커뮤니티 파이프라인을 불러오는 것은 곧 해당 코드가 안전하다고 신뢰하는 것입니다. 코드를 자동으로 불러오고 실행하기 앞서 반드시 온라인으로 해당 코드의 신뢰성을 검사하세요!
|
||||
|
||||
</Tip>
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
"google/ddpm-cifar10-32", custom_pipeline="hf-internal-testing/diffusers-dummy-pipeline"
|
||||
)
|
||||
```
|
||||
|
||||
공식 커뮤니티 파이프라인을 불러오는 것은 비슷하지만, 공식 리포지토리 ID에서 가중치를 불러오는 것과 더불어 해당 파이프라인 내의 컴포넌트를 직접 지정하는 것 역시 가능합니다. 아래 예제를 보면 커뮤니티 [CLIP Guided Stable Diffusion](https://github.com/huggingface/diffusers/tree/main/examples/community#clip-guided-stable-diffusion) 파이프라인을 로드할 때, 해당 파이프라인에서 사용할 `clip_model` 컴포넌트와 `feature_extractor` 컴포넌트를 직접 설정하는 것을 확인할 수 있습니다.
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
from transformers import CLIPImageProcessor, CLIPModel
|
||||
|
||||
clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
|
||||
|
||||
feature_extractor = CLIPImageProcessor.from_pretrained(clip_model_id)
|
||||
clip_model = CLIPModel.from_pretrained(clip_model_id)
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5",
|
||||
custom_pipeline="clip_guided_stable_diffusion",
|
||||
clip_model=clip_model,
|
||||
feature_extractor=feature_extractor,
|
||||
)
|
||||
```
|
||||
|
||||
커뮤니티 파이프라인에 대한 자세한 내용은 [커뮤니티 파이프라인](https://github.com/huggingface/diffusers/blob/main/docs/source/en/using-diffusers/custom_pipeline_examples) 가이드를 살펴보세요. 커뮤니티 파이프라인 등록에 관심이 있는 경우 [커뮤니티 파이프라인에 기여하는 방법](https://github.com/huggingface/diffusers/blob/main/docs/source/en/using-diffusers/contribute_pipeline)에 대한 가이드를 확인하세요 !
|
||||
57
docs/source/ko/using-diffusers/depth2img.mdx
Normal file
57
docs/source/ko/using-diffusers/depth2img.mdx
Normal file
@@ -0,0 +1,57 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Text-guided depth-to-image 생성
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
[`StableDiffusionDepth2ImgPipeline`]을 사용하면 텍스트 프롬프트와 초기 이미지를 전달하여 새 이미지의 생성을 조절할 수 있습니다. 또한 이미지 구조를 보존하기 위해 `depth_map`을 전달할 수도 있습니다. `depth_map`이 제공되지 않으면 파이프라인은 통합된 [depth-estimation model](https://github.com/isl-org/MiDaS)을 통해 자동으로 깊이를 예측합니다.
|
||||
|
||||
|
||||
먼저 [`StableDiffusionDepth2ImgPipeline`]의 인스턴스를 생성합니다:
|
||||
|
||||
```python
|
||||
import torch
|
||||
import requests
|
||||
from PIL import Image
|
||||
|
||||
from diffusers import StableDiffusionDepth2ImgPipeline
|
||||
|
||||
pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-2-depth",
|
||||
torch_dtype=torch.float16,
|
||||
).to("cuda")
|
||||
```
|
||||
|
||||
이제 프롬프트를 파이프라인에 전달합니다. 특정 단어가 이미지 생성을 가이드 하는것을 방지하기 위해 `negative_prompt`를 전달할 수도 있습니다:
|
||||
|
||||
```python
|
||||
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
init_image = Image.open(requests.get(url, stream=True).raw)
|
||||
prompt = "two tigers"
|
||||
n_prompt = "bad, deformed, ugly, bad anatomy"
|
||||
image = pipe(prompt=prompt, image=init_image, negative_prompt=n_prompt, strength=0.7).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
| Input | Output |
|
||||
|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/coco-cats.png" width="500"/> | <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/depth2img-tigers.png" width="500"/> |
|
||||
|
||||
아래의 Spaces를 가지고 놀며 depth map이 있는 이미지와 없는 이미지의 차이가 있는지 확인해 보세요!
|
||||
|
||||
<iframe
|
||||
src="https://radames-stable-diffusion-depth2img.hf.space"
|
||||
frameborder="0"
|
||||
width="850"
|
||||
height="500"
|
||||
></iframe>
|
||||
100
docs/source/ko/using-diffusers/img2img.mdx
Normal file
100
docs/source/ko/using-diffusers/img2img.mdx
Normal file
@@ -0,0 +1,100 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# 텍스트 기반 image-to-image 생성
|
||||
|
||||
[[Colab에서 열기]]
|
||||
|
||||
[`StableDiffusionImg2ImgPipeline`]을 사용하면 텍스트 프롬프트와 시작 이미지를 전달하여 새 이미지 생성의 조건을 지정할 수 있습니다.
|
||||
|
||||
시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요:
|
||||
|
||||
```bash
|
||||
!pip install diffusers transformers ftfy accelerate
|
||||
```
|
||||
|
||||
[`nitrosocke/Ghibli-Diffusion`](https://huggingface.co/nitrosocke/Ghibli-Diffusion)과 같은 사전학습된 stable diffusion 모델로 [`StableDiffusionImg2ImgPipeline`]을 생성하여 시작하세요.
|
||||
|
||||
|
||||
```python
|
||||
import torch
|
||||
import requests
|
||||
from PIL import Image
|
||||
from io import BytesIO
|
||||
from diffusers import StableDiffusionImg2ImgPipeline
|
||||
|
||||
device = "cuda"
|
||||
pipe = StableDiffusionImg2ImgPipeline.from_pretrained("nitrosocke/Ghibli-Diffusion", torch_dtype=torch.float16).to(
|
||||
device
|
||||
)
|
||||
```
|
||||
|
||||
초기 이미지를 다운로드하고 사전 처리하여 파이프라인에 전달할 수 있습니다:
|
||||
|
||||
```python
|
||||
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
|
||||
|
||||
response = requests.get(url)
|
||||
init_image = Image.open(BytesIO(response.content)).convert("RGB")
|
||||
init_image.thumbnail((768, 768))
|
||||
init_image
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/YiYiXu/test-doc-assets/resolve/main/image_2_image_using_diffusers_cell_8_output_0.jpeg"/>
|
||||
</div>
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 `strength`는 입력 이미지에 추가되는 노이즈의 양을 제어하는 0.0에서 1.0 사이의 값입니다. 1.0에 가까운 값은 다양한 변형을 허용하지만 입력 이미지와 의미적으로 일치하지 않는 이미지를 생성합니다.
|
||||
|
||||
</Tip>
|
||||
|
||||
프롬프트를 정의하고(지브리 스타일(Ghibli-style)에 맞게 조정된 이 체크포인트의 경우 프롬프트 앞에 `ghibli style` 토큰을 붙여야 합니다) 파이프라인을 실행합니다:
|
||||
|
||||
```python
|
||||
prompt = "ghibli style, a fantasy landscape with castles"
|
||||
generator = torch.Generator(device=device).manual_seed(1024)
|
||||
image = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5, generator=generator).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ghibli-castles.png"/>
|
||||
</div>
|
||||
|
||||
다른 스케줄러로 실험하여 출력에 어떤 영향을 미치는지 확인할 수도 있습니다:
|
||||
|
||||
```python
|
||||
from diffusers import LMSDiscreteScheduler
|
||||
|
||||
lms = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
|
||||
pipe.scheduler = lms
|
||||
generator = torch.Generator(device=device).manual_seed(1024)
|
||||
image = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5, generator=generator).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lms-ghibli.png"/>
|
||||
</div>
|
||||
|
||||
아래 공백을 확인하고 `strength` 값을 다르게 설정하여 이미지를 생성해 보세요. `strength`를 낮게 설정하면 원본 이미지와 더 유사한 이미지가 생성되는 것을 확인할 수 있습니다.
|
||||
|
||||
자유롭게 스케줄러를 [`LMSDiscreteScheduler`]로 전환하여 출력에 어떤 영향을 미치는지 확인해 보세요.
|
||||
|
||||
<iframe
|
||||
src="https://stevhliu-ghibli-img2img.hf.space"
|
||||
frameborder="0"
|
||||
width="850"
|
||||
height="500"
|
||||
></iframe>
|
||||
75
docs/source/ko/using-diffusers/inpaint.mdx
Normal file
75
docs/source/ko/using-diffusers/inpaint.mdx
Normal file
@@ -0,0 +1,75 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Text-guided 이미지 인페인팅(inpainting)
|
||||
|
||||
[[코랩에서 열기]]
|
||||
|
||||
[`StableDiffusionInpaintPipeline`]은 마스크와 텍스트 프롬프트를 제공하여 이미지의 특정 부분을 편집할 수 있도록 합니다. 이 기능은 인페인팅 작업을 위해 특별히 훈련된 [`runwayml/stable-diffusion-inpainting`](https://huggingface.co/runwayml/stable-diffusion-inpainting)과 같은 Stable Diffusion 버전을 사용합니다.
|
||||
|
||||
먼저 [`StableDiffusionInpaintPipeline`] 인스턴스를 불러옵니다:
|
||||
|
||||
```python
|
||||
import PIL
|
||||
import requests
|
||||
import torch
|
||||
from io import BytesIO
|
||||
|
||||
from diffusers import StableDiffusionInpaintPipeline
|
||||
|
||||
pipeline = StableDiffusionInpaintPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-inpainting",
|
||||
torch_dtype=torch.float16,
|
||||
)
|
||||
pipeline = pipeline.to("cuda")
|
||||
```
|
||||
|
||||
나중에 교체할 강아지 이미지와 마스크를 다운로드하세요:
|
||||
|
||||
```python
|
||||
def download_image(url):
|
||||
response = requests.get(url)
|
||||
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
|
||||
|
||||
|
||||
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
|
||||
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
|
||||
|
||||
init_image = download_image(img_url).resize((512, 512))
|
||||
mask_image = download_image(mask_url).resize((512, 512))
|
||||
```
|
||||
|
||||
이제 마스크를 다른 것으로 교체하라는 프롬프트를 만들 수 있습니다:
|
||||
|
||||
```python
|
||||
prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
|
||||
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
|
||||
```
|
||||
|
||||
`image` | `mask_image` | `prompt` | output |
|
||||
:-------------------------:|:-------------------------:|:-------------------------:|-------------------------:|
|
||||
<img src="https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" alt="drawing" width="250"/> | <img src="https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" alt="drawing" width="250"/> | ***Face of a yellow cat, high resolution, sitting on a park bench*** | <img src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/in_paint/yellow_cat_sitting_on_a_park_bench.png" alt="drawing" width="250"/> |
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
이전의 실험적인 인페인팅 구현에서는 품질이 낮은 다른 프로세스를 사용했습니다. 이전 버전과의 호환성을 보장하기 위해 새 모델이 포함되지 않은 사전학습된 파이프라인을 불러오면 이전 인페인팅 방법이 계속 적용됩니다.
|
||||
|
||||
</Tip>
|
||||
|
||||
아래 Space에서 이미지 인페인팅을 직접 해보세요!
|
||||
|
||||
<iframe
|
||||
src="https://runwayml-stable-diffusion-inpainting.hf.space"
|
||||
frameborder="0"
|
||||
width="850"
|
||||
height="500"
|
||||
></iframe>
|
||||
442
docs/source/ko/using-diffusers/loading.mdx
Normal file
442
docs/source/ko/using-diffusers/loading.mdx
Normal file
@@ -0,0 +1,442 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
|
||||
|
||||
# 파이프라인, 모델, 스케줄러 불러오기
|
||||
|
||||
기본적으로 diffusion 모델은 다양한 컴포넌트들(모델, 토크나이저, 스케줄러) 간의 복잡한 상호작용을 기반으로 동작합니다. 디퓨저스(Diffusers)는 이러한 diffusion 모델을 보다 쉽고 간편한 API로 제공하는 것을 목표로 설계되었습니다. [`DiffusionPipeline`]은 diffusion 모델이 갖는 복잡성을 하나의 파이프라인 API로 통합하고, 동시에 이를 구성하는 각각의 컴포넌트들을 태스크에 맞춰 유연하게 커스터마이징할 수 있도록 지원하고 있습니다.
|
||||
|
||||
diffusion 모델의 훈련과 추론에 필요한 모든 것은 [`DiffusionPipeline.from_pretrained`] 메서드를 통해 접근할 수 있습니다. (이 말의 의미는 다음 단락에서 보다 자세하게 다뤄보도록 하겠습니다.)
|
||||
|
||||
이 문서에서는 설명할 내용은 다음과 같습니다.
|
||||
|
||||
* 허브를 통해 혹은 로컬로 파이프라인을 불러오는 법
|
||||
|
||||
* 파이프라인에 다른 컴포넌트들을 적용하는 법
|
||||
* 오리지널 체크포인트가 아닌 variant를 불러오는 법 (variant란 기본으로 설정된 `fp32`가 아닌 다른 부동 소수점 타입(예: `fp16`)을 사용하거나 Non-EMA 가중치를 사용하는 체크포인트들을 의미합니다.)
|
||||
* 모델과 스케줄러를 불러오는 법
|
||||
|
||||
|
||||
|
||||
## Diffusion 파이프라인
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 [`DiffusionPipeline`] 클래스가 동작하는 방식에 보다 자세한 내용이 궁금하다면, [DiffusionPipeline explained](#diffusionpipeline에-대해-알아보기) 섹션을 확인해보세요.
|
||||
|
||||
</Tip>
|
||||
|
||||
[`DiffusionPipeline`] 클래스는 diffusion 모델을 [허브](https://huggingface.co/models?library=diffusers)로부터 불러오는 가장 심플하면서 보편적인 방식입니다. [`DiffusionPipeline.from_pretrained`] 메서드는 적합한 파이프라인 클래스를 자동으로 탐지하고, 필요한 구성요소(configuration)와 가중치(weight) 파일들을 다운로드하고 캐싱한 다음, 해당 파이프라인 인스턴스를 반환합니다.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
repo_id = "runwayml/stable-diffusion-v1-5"
|
||||
pipe = DiffusionPipeline.from_pretrained(repo_id)
|
||||
```
|
||||
|
||||
물론 [`DiffusionPipeline`] 클래스를 사용하지 않고, 명시적으로 직접 해당 파이프라인 클래스를 불러오는 것도 가능합니다. 아래 예시 코드는 위 예시와 동일한 인스턴스를 반환합니다.
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
repo_id = "runwayml/stable-diffusion-v1-5"
|
||||
pipe = StableDiffusionPipeline.from_pretrained(repo_id)
|
||||
```
|
||||
|
||||
[CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4)이나 [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) 같은 체크포인트들의 경우, 하나 이상의 다양한 태스크에 활용될 수 있습니다. (예를 들어 위의 두 체크포인트의 경우, text-to-image와 image-to-image에 모두 활용될 수 있습니다.) 만약 이러한 체크포인트들을 기본 설정 태스크가 아닌 다른 태스크에 활용하고자 한다면, 해당 태스크에 대응되는 파이프라인(task-specific pipeline)을 사용해야 합니다.
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionImg2ImgPipeline
|
||||
|
||||
repo_id = "runwayml/stable-diffusion-v1-5"
|
||||
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(repo_id)
|
||||
```
|
||||
|
||||
|
||||
|
||||
### 로컬 파이프라인
|
||||
|
||||
파이프라인을 로컬로 불러오고자 한다면, `git-lfs`를 사용하여 직접 체크포인트를 로컬 디스크에 다운로드 받아야 합니다. 아래의 명령어를 실행하면 `./stable-diffusion-v1-5`란 이름으로 폴더가 로컬디스크에 생성됩니다.
|
||||
|
||||
```bash
|
||||
git lfs install
|
||||
git clone https://huggingface.co/runwayml/stable-diffusion-v1-5
|
||||
```
|
||||
|
||||
그런 다음 해당 로컬 경로를 [`~DiffusionPipeline.from_pretrained`] 메서드에 전달합니다.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
repo_id = "./stable-diffusion-v1-5"
|
||||
stable_diffusion = DiffusionPipeline.from_pretrained(repo_id)
|
||||
```
|
||||
|
||||
위의 예시코드처럼 만약 `repo_id`가 로컬 패스(local path)라면, [`~DiffusionPipeline.from_pretrained`] 메서드는 이를 자동으로 감지하여 허브에서 파일을 다운로드하지 않습니다. 만약 로컬 디스크에 저장된 파이프라인 체크포인트가 최신 버전이 아닐 경우에도, 최신 버전을 다운로드하지 않고 기존 로컬 디스크에 저장된 체크포인트를 사용한다는 것을 의미합니다.
|
||||
|
||||
|
||||
|
||||
### 파이프라인 내부의 컴포넌트 교체하기
|
||||
|
||||
파이프라인 내부의 컴포넌트들은 호환 가능한 다른 컴포넌트로 교체될 수 있습니다. 이와 같은 컴포넌트 교체가 중요한 이유는 다음과 같습니다.
|
||||
|
||||
- 어떤 스케줄러를 사용할 것인가는 생성속도와 생성품질 간의 트레이드오프를 정의하는 중요한 요소입니다.
|
||||
- diffusion 모델 내부의 컴포넌트들은 일반적으로 각각 독립적으로 훈련되기 때문에, 더 좋은 성능을 보여주는 컴포넌트가 있다면 그걸로 교체하는 식으로 성능을 향상시킬 수 있습니다.
|
||||
- 파인 튜닝 단계에서는 일반적으로 UNet 혹은 텍스트 인코더와 같은 일부 컴포넌트들만 훈련하게 됩니다.
|
||||
|
||||
어떤 스케줄러들이 호환가능한지는 `compatibles` 속성을 통해 확인할 수 있습니다.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
repo_id = "runwayml/stable-diffusion-v1-5"
|
||||
stable_diffusion = DiffusionPipeline.from_pretrained(repo_id)
|
||||
stable_diffusion.scheduler.compatibles
|
||||
```
|
||||
|
||||
이번에는 [`SchedulerMixin.from_pretrained`] 메서드를 사용해서, 기존 기본 스케줄러였던 [`PNDMScheduler`]를 보다 우수한 성능의 [`EulerDiscreteScheduler`]로 바꿔봅시다. 스케줄러를 로드할 때는 `subfolder` 인자를 통해, 해당 파이프라인의 레포지토리에서 [스케줄러에 관한 하위폴더](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/scheduler)를 명시해주어야 합니다.
|
||||
|
||||
그 다음 새롭게 생성한 [`EulerDiscreteScheduler`] 인스턴스를 [`DiffusionPipeline`]의 `scheduler` 인자에 전달합니다.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline, EulerDiscreteScheduler, DPMSolverMultistepScheduler
|
||||
|
||||
repo_id = "runwayml/stable-diffusion-v1-5"
|
||||
|
||||
scheduler = EulerDiscreteScheduler.from_pretrained(repo_id, subfolder="scheduler")
|
||||
|
||||
stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, scheduler=scheduler)
|
||||
```
|
||||
|
||||
### 세이프티 체커
|
||||
|
||||
스테이블 diffusion과 같은 diffusion 모델들은 유해한 이미지를 생성할 수도 있습니다. 이를 예방하기 위해 디퓨저스는 생성된 이미지의 유해성을 판단하는 [세이프티 체커(safety checker)](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) 기능을 지원하고 있습니다. 만약 세이프티 체커의 사용을 원하지 않는다면, `safety_checker` 인자에 `None`을 전달해주시면 됩니다.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
repo_id = "runwayml/stable-diffusion-v1-5"
|
||||
stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, safety_checker=None)
|
||||
```
|
||||
|
||||
### 컴포넌트 재사용
|
||||
|
||||
복수의 파이프라인에 동일한 모델이 반복적으로 사용한다면, 굳이 해당 모델의 동일한 가중치를 중복으로 RAM에 불러올 필요는 없을 것입니다. [`~DiffusionPipeline.components`] 속성을 통해 파이프라인 내부의 컴포넌트들을 참조할 수 있는데, 이번 단락에서는 이를 통해 동일한 모델 가중치를 RAM에 중복으로 불러오는 것을 방지하는 법에 대해 알아보겠습니다.
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
|
||||
|
||||
model_id = "runwayml/stable-diffusion-v1-5"
|
||||
stable_diffusion_txt2img = StableDiffusionPipeline.from_pretrained(model_id)
|
||||
|
||||
components = stable_diffusion_txt2img.components
|
||||
```
|
||||
|
||||
그 다음 위 예시 코드에서 선언한 `components` 변수를 다른 파이프라인에 전달함으로써, 모델의 가중치를 중복으로 RAM에 로딩하지 않고, 동일한 컴포넌트를 재사용할 수 있습니다.
|
||||
|
||||
```python
|
||||
stable_diffusion_img2img = StableDiffusionImg2ImgPipeline(**components)
|
||||
```
|
||||
|
||||
물론 각각의 컴포넌트들을 따로 따로 파이프라인에 전달할 수도 있습니다. 예를 들어 `stable_diffusion_txt2img` 파이프라인 안의 컴포넌트들 가운데서 세이프티 체커(`safety_checker`)와 피쳐 익스트랙터(`feature_extractor`)를 제외한 컴포넌트들만 `stable_diffusion_img2img` 파이프라인에서 재사용하는 방식 역시 가능합니다.
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
|
||||
|
||||
model_id = "runwayml/stable-diffusion-v1-5"
|
||||
stable_diffusion_txt2img = StableDiffusionPipeline.from_pretrained(model_id)
|
||||
stable_diffusion_img2img = StableDiffusionImg2ImgPipeline(
|
||||
vae=stable_diffusion_txt2img.vae,
|
||||
text_encoder=stable_diffusion_txt2img.text_encoder,
|
||||
tokenizer=stable_diffusion_txt2img.tokenizer,
|
||||
unet=stable_diffusion_txt2img.unet,
|
||||
scheduler=stable_diffusion_txt2img.scheduler,
|
||||
safety_checker=None,
|
||||
feature_extractor=None,
|
||||
requires_safety_checker=False,
|
||||
)
|
||||
```
|
||||
|
||||
## Checkpoint variants
|
||||
|
||||
Variant란 일반적으로 다음과 같은 체크포인트들을 의미합니다.
|
||||
|
||||
- `torch.float16`과 같이 정밀도는 더 낮지만, 용량 역시 더 작은 부동소수점 타입의 가중치를 사용하는 체크포인트. *(다만 이와 같은 variant의 경우, 추가적인 훈련과 CPU환경에서의 구동이 불가능합니다.)*
|
||||
- Non-EMA 가중치를 사용하는 체크포인트. *(Non-EMA 가중치의 경우, 파인 튜닝 단계에서 사용하는 것이 권장되는데, 추론 단계에선 사용하지 않는 것이 권장됩니다.)*
|
||||
|
||||
<Tip>
|
||||
|
||||
💡 모델 구조는 동일하지만 서로 다른 학습 환경에서 서로 다른 데이터셋으로 학습된 체크포인트들이 있을 경우, 해당 체크포인트들은 variant 단계가 아닌 레포지토리 단계에서 분리되어 관리되어야 합니다. (즉, 해당 체크포인트들은 서로 다른 레포지토리에서 따로 관리되어야 합니다. 예시: [`stable-diffusion-v1-4`], [`stable-diffusion-v1-5`]).
|
||||
|
||||
</Tip>
|
||||
|
||||
| **checkpoint type** | **weight name** | **argument for loading weights** |
|
||||
| ------------------- | ----------------------------------- | -------------------------------- |
|
||||
| original | diffusion_pytorch_model.bin | |
|
||||
| floating point | diffusion_pytorch_model.fp16.bin | `variant`, `torch_dtype` |
|
||||
| non-EMA | diffusion_pytorch_model.non_ema.bin | `variant` |
|
||||
|
||||
variant를 로드할 때 2개의 중요한 argument가 있습니다.
|
||||
|
||||
* `torch_dtype`은 불러올 체크포인트의 부동소수점을 정의합니다. 예를 들어 `torch_dtype=torch.float16`을 명시함으로써 가중치의 부동소수점 타입을 `fl16`으로 변환할 수 있습니다. (만약 따로 설정하지 않을 경우, 기본값으로 `fp32` 타입의 가중치가 로딩됩니다.) 또한 `variant` 인자를 명시하지 않은 채로 체크포인트를 불러온 다음, 해당 체크포인트를 `torch_dtype=torch.float16` 인자를 통해 `fp16` 타입으로 변환하는 것 역시 가능합니다. 이 경우 기본으로 설정된 `fp32` 가중치가 먼저 다운로드되고, 해당 가중치들을 불러온 다음 `fp16` 타입으로 변환하게 됩니다.
|
||||
* `variant` 인자는 레포지토리에서 어떤 variant를 불러올 것인가를 정의합니다. 가령 [`diffusers/stable-diffusion-variants`](https://huggingface.co/diffusers/stable-diffusion-variants/tree/main/unet) 레포지토리로부터 `non_ema` 체크포인트를 불러오고자 한다면, `variant="non_ema"` 인자를 전달해야 합니다.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
# load fp16 variant
|
||||
stable_diffusion = DiffusionPipeline.from_pretrained(
|
||||
"runwayml/stable-diffusion-v1-5", variant="fp16", torch_dtype=torch.float16
|
||||
)
|
||||
# load non_ema variant
|
||||
stable_diffusion = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", variant="non_ema")
|
||||
```
|
||||
|
||||
다른 부동소수점 타입의 가중치 혹은 non-EMA 가중치를 사용하는 체크포인트를 저장하기 위해서는, [`DiffusionPipeline.save_pretrained`] 메서드를 사용해야 하며, 이 때 `variant` 인자를 명시해줘야 합니다. 원래의 체크포인트와 동일한 폴더에 variant를 저장해야 하며, 이렇게 하면 동일한 폴더에서 오리지널 체크포인트과 variant를 모두 불러올 수 있습니다.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
# save as fp16 variant
|
||||
stable_diffusion.save_pretrained("runwayml/stable-diffusion-v1-5", variant="fp16")
|
||||
# save as non-ema variant
|
||||
stable_diffusion.save_pretrained("runwayml/stable-diffusion-v1-5", variant="non_ema")
|
||||
```
|
||||
|
||||
만약 variant를 기존 폴더에 저장하지 않을 경우, `variant` 인자를 반드시 명시해야 합니다. 그렇게 하지 않을 경우 원래의 오리지널 체크포인트를 찾을 수 없게 되기 때문에 에러가 발생합니다.
|
||||
|
||||
```python
|
||||
# 👎 this won't work
|
||||
stable_diffusion = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", torch_dtype=torch.float16)
|
||||
# 👍 this works
|
||||
stable_diffusion = DiffusionPipeline.from_pretrained(
|
||||
"./stable-diffusion-v1-5", variant="fp16", torch_dtype=torch.float16
|
||||
)
|
||||
```
|
||||
|
||||
### 모델 불러오기
|
||||
|
||||
모델들은 [`ModelMixin.from_pretrained`] 메서드를 통해 불러올 수 있습니다. 해당 메서드는 최신 버전의 모델 가중치 파일과 설정 파일(configurations)을 다운로드하고 캐싱합니다. 만약 이러한 파일들이 최신 버전으로 로컬 캐시에 저장되어 있다면, [`ModelMixin.from_pretrained`]는 굳이 해당 파일들을 다시 다운로드하지 않으며, 그저 캐시에 있는 최신 파일들을 재사용합니다.
|
||||
|
||||
모델은 `subfolder` 인자에 명시된 하위 폴더로부터 로드됩니다. 예를 들어 `runwayml/stable-diffusion-v1-5`의 UNet 모델의 가중치는 [`unet`](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/unet) 폴더에 저장되어 있습니다.
|
||||
|
||||
```python
|
||||
from diffusers import UNet2DConditionModel
|
||||
|
||||
repo_id = "runwayml/stable-diffusion-v1-5"
|
||||
model = UNet2DConditionModel.from_pretrained(repo_id, subfolder="unet")
|
||||
```
|
||||
|
||||
혹은 [해당 모델의 레포지토리](https://huggingface.co/google/ddpm-cifar10-32/tree/main)로부터 다이렉트로 가져오는 것 역시 가능합니다.
|
||||
|
||||
```python
|
||||
from diffusers import UNet2DModel
|
||||
|
||||
repo_id = "google/ddpm-cifar10-32"
|
||||
model = UNet2DModel.from_pretrained(repo_id)
|
||||
```
|
||||
|
||||
또한 앞서 봤던 `variant` 인자를 명시함으로써, Non-EMA나 `fp16`의 가중치를 가져오는 것 역시 가능합니다.
|
||||
|
||||
```python
|
||||
from diffusers import UNet2DConditionModel
|
||||
|
||||
model = UNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="unet", variant="non-ema")
|
||||
model.save_pretrained("./local-unet", variant="non-ema")
|
||||
```
|
||||
|
||||
### 스케줄러
|
||||
|
||||
스케줄러들은 [`SchedulerMixin.from_pretrained`] 메서드를 통해 불러올 수 있습니다. 모델과 달리 스케줄러는 별도의 가중치를 갖지 않으며, 따라서 당연히 별도의 학습과정을 요구하지 않습니다. 이러한 스케줄러들은 (해당 스케줄러 하위폴더의) configration 파일을 통해 정의됩니다.
|
||||
|
||||
여러개의 스케줄러를 불러온다고 해서 많은 메모리를 소모하는 것은 아니며, 다양한 스케줄러들에 동일한 스케줄러 configration을 적용하는 것 역시 가능합니다. 다음 예시 코드에서 불러오는 스케줄러들은 모두 [`StableDiffusionPipeline`]과 호환되는데, 이는 곧 해당 스케줄러들에 동일한 스케줄러 configration 파일을 적용할 수 있음을 의미합니다.
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionPipeline
|
||||
from diffusers import (
|
||||
DDPMScheduler,
|
||||
DDIMScheduler,
|
||||
PNDMScheduler,
|
||||
LMSDiscreteScheduler,
|
||||
EulerDiscreteScheduler,
|
||||
EulerAncestralDiscreteScheduler,
|
||||
DPMSolverMultistepScheduler,
|
||||
)
|
||||
|
||||
repo_id = "runwayml/stable-diffusion-v1-5"
|
||||
|
||||
ddpm = DDPMScheduler.from_pretrained(repo_id, subfolder="scheduler")
|
||||
ddim = DDIMScheduler.from_pretrained(repo_id, subfolder="scheduler")
|
||||
pndm = PNDMScheduler.from_pretrained(repo_id, subfolder="scheduler")
|
||||
lms = LMSDiscreteScheduler.from_pretrained(repo_id, subfolder="scheduler")
|
||||
euler_anc = EulerAncestralDiscreteScheduler.from_pretrained(repo_id, subfolder="scheduler")
|
||||
euler = EulerDiscreteScheduler.from_pretrained(repo_id, subfolder="scheduler")
|
||||
dpm = DPMSolverMultistepScheduler.from_pretrained(repo_id, subfolder="scheduler")
|
||||
|
||||
# replace `dpm` with any of `ddpm`, `ddim`, `pndm`, `lms`, `euler_anc`, `euler`
|
||||
pipeline = StableDiffusionPipeline.from_pretrained(repo_id, scheduler=dpm)
|
||||
```
|
||||
|
||||
### DiffusionPipeline에 대해 알아보기
|
||||
|
||||
클래스 메서드로서 [`DiffusionPipeline.from_pretrained`]은 2가지를 담당합니다.
|
||||
|
||||
- 첫째로, `from_pretrained` 메서드는 최신 버전의 파이프라인을 다운로드하고, 캐시에 저장합니다. 이미 로컬 캐시에 최신 버전의 파이프라인이 저장되어 있다면, [`DiffusionPipeline.from_pretrained`]은 해당 파일들을 다시 다운로드하지 않고, 로컬 캐시에 저장되어 있는 파이프라인을 불러옵니다.
|
||||
- `model_index.json` 파일을 통해 체크포인트에 대응되는 적합한 파이프라인 클래스로 불러옵니다.
|
||||
|
||||
파이프라인의 폴더 구조는 해당 파이프라인 클래스의 구조와 직접적으로 일치합니다. 예를 들어 [`StableDiffusionPipeline`] 클래스는 [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) 레포지토리와 대응되는 구조를 갖습니다.
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
repo_id = "runwayml/stable-diffusion-v1-5"
|
||||
pipeline = DiffusionPipeline.from_pretrained(repo_id)
|
||||
print(pipeline)
|
||||
```
|
||||
|
||||
위의 코드 출력 결과를 확인해보면, `pipeline`은 [`StableDiffusionPipeline`]의 인스턴스이며, 다음과 같이 총 7개의 컴포넌트로 구성된다는 것을 알 수 있습니다.
|
||||
|
||||
- `"feature_extractor"`: [`~transformers.CLIPFeatureExtractor`]의 인스턴스
|
||||
- `"safety_checker"`: 유해한 컨텐츠를 스크리닝하기 위한 [컴포넌트](https://github.com/huggingface/diffusers/blob/e55687e1e15407f60f32242027b7bb8170e58266/src/diffusers/pipelines/stable_diffusion/safety_checker.py#L32)
|
||||
- `"scheduler"`: [`PNDMScheduler`]의 인스턴스
|
||||
- `"text_encoder"`: [`~transformers.CLIPTextModel`]의 인스턴스
|
||||
- `"tokenizer"`: a [`~transformers.CLIPTokenizer`]의 인스턴스
|
||||
- `"unet"`: [`UNet2DConditionModel`]의 인스턴스
|
||||
- `"vae"` [`AutoencoderKL`]의 인스턴스
|
||||
|
||||
```json
|
||||
StableDiffusionPipeline {
|
||||
"feature_extractor": [
|
||||
"transformers",
|
||||
"CLIPImageProcessor"
|
||||
],
|
||||
"safety_checker": [
|
||||
"stable_diffusion",
|
||||
"StableDiffusionSafetyChecker"
|
||||
],
|
||||
"scheduler": [
|
||||
"diffusers",
|
||||
"PNDMScheduler"
|
||||
],
|
||||
"text_encoder": [
|
||||
"transformers",
|
||||
"CLIPTextModel"
|
||||
],
|
||||
"tokenizer": [
|
||||
"transformers",
|
||||
"CLIPTokenizer"
|
||||
],
|
||||
"unet": [
|
||||
"diffusers",
|
||||
"UNet2DConditionModel"
|
||||
],
|
||||
"vae": [
|
||||
"diffusers",
|
||||
"AutoencoderKL"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
파이프라인 인스턴스의 컴포넌트들을 [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)의 폴더 구조와 비교해볼 경우, 각각의 컴포넌트마다 별도의 폴더가 있음을 확인할 수 있습니다.
|
||||
|
||||
```
|
||||
.
|
||||
├── feature_extractor
|
||||
│ └── preprocessor_config.json
|
||||
├── model_index.json
|
||||
├── safety_checker
|
||||
│ ├── config.json
|
||||
│ └── pytorch_model.bin
|
||||
├── scheduler
|
||||
│ └── scheduler_config.json
|
||||
├── text_encoder
|
||||
│ ├── config.json
|
||||
│ └── pytorch_model.bin
|
||||
├── tokenizer
|
||||
│ ├── merges.txt
|
||||
│ ├── special_tokens_map.json
|
||||
│ ├── tokenizer_config.json
|
||||
│ └── vocab.json
|
||||
├── unet
|
||||
│ ├── config.json
|
||||
│ ├── diffusion_pytorch_model.bin
|
||||
└── vae
|
||||
├── config.json
|
||||
├── diffusion_pytorch_model.bin
|
||||
```
|
||||
|
||||
또한 각각의 컴포넌트들을 파이프라인 인스턴스의 속성으로써 참조할 수 있습니다.
|
||||
|
||||
```py
|
||||
pipeline.tokenizer
|
||||
```
|
||||
|
||||
```python
|
||||
CLIPTokenizer(
|
||||
name_or_path="/root/.cache/huggingface/hub/models--runwayml--stable-diffusion-v1-5/snapshots/39593d5650112b4cc580433f6b0435385882d819/tokenizer",
|
||||
vocab_size=49408,
|
||||
model_max_length=77,
|
||||
is_fast=False,
|
||||
padding_side="right",
|
||||
truncation_side="right",
|
||||
special_tokens={
|
||||
"bos_token": AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True),
|
||||
"eos_token": AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True),
|
||||
"unk_token": AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True),
|
||||
"pad_token": "<|endoftext|>",
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
모든 파이프라인은 `model_index.json` 파일을 통해 [`DiffusionPipeline`]에 다음과 같은 정보를 전달합니다.
|
||||
|
||||
- `_class_name` 는 어떤 파이프라인 클래스를 사용해야 하는지에 대해 알려줍니다.
|
||||
- `_diffusers_version`는 어떤 버전의 디퓨저스로 파이프라인 안의 모델들이 만들어졌는지를 알려줍니다.
|
||||
- 그 다음은 각각의 컴포넌트들이 어떤 라이브러리의 어떤 클래스로 만들어졌는지에 대해 알려줍니다. (아래 예시에서 `"feature_extractor" : ["transformers", "CLIPImageProcessor"]`의 경우, `feature_extractor` 컴포넌트는 `transformers` 라이브러리의 `CLIPImageProcessor` 클래스를 통해 만들어졌다는 것을 의미합니다.)
|
||||
|
||||
```json
|
||||
{
|
||||
"_class_name": "StableDiffusionPipeline",
|
||||
"_diffusers_version": "0.6.0",
|
||||
"feature_extractor": [
|
||||
"transformers",
|
||||
"CLIPImageProcessor"
|
||||
],
|
||||
"safety_checker": [
|
||||
"stable_diffusion",
|
||||
"StableDiffusionSafetyChecker"
|
||||
],
|
||||
"scheduler": [
|
||||
"diffusers",
|
||||
"PNDMScheduler"
|
||||
],
|
||||
"text_encoder": [
|
||||
"transformers",
|
||||
"CLIPTextModel"
|
||||
],
|
||||
"tokenizer": [
|
||||
"transformers",
|
||||
"CLIPTokenizer"
|
||||
],
|
||||
"unet": [
|
||||
"diffusers",
|
||||
"UNet2DConditionModel"
|
||||
],
|
||||
"vae": [
|
||||
"diffusers",
|
||||
"AutoencoderKL"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
191
docs/source/ko/using-diffusers/other-formats.mdx
Normal file
191
docs/source/ko/using-diffusers/other-formats.mdx
Normal file
@@ -0,0 +1,191 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# 다양한 Stable Diffusion 포맷 불러오기
|
||||
|
||||
Stable Diffusion 모델들은 학습 및 저장된 프레임워크와 다운로드 위치에 따라 다양한 형식으로 제공됩니다. 이러한 형식을 🤗 Diffusers에서 사용할 수 있도록 변환하면 추론을 위한 [다양한 스케줄러 사용](schedulers), 사용자 지정 파이프라인 구축, 추론 속도 최적화를 위한 다양한 기법과 방법 등 라이브러리에서 지원하는 모든 기능을 사용할 수 있습니다.
|
||||
|
||||
<Tip>
|
||||
|
||||
우리는 `.safetensors` 형식을 추천합니다. 왜냐하면 기존의 pickled 파일은 취약하고 머신에서 코드를 실행할 때 악용될 수 있는 것에 비해 훨씬 더 안전합니다. (safetensors 불러오기 가이드에서 자세히 알아보세요.)
|
||||
|
||||
</Tip>
|
||||
|
||||
이 가이드에서는 다른 Stable Diffusion 형식을 🤗 Diffusers와 호환되도록 변환하는 방법을 설명합니다.
|
||||
|
||||
## PyTorch .ckpt
|
||||
|
||||
체크포인트 또는 `.ckpt` 형식은 일반적으로 모델을 저장하는 데 사용됩니다. `.ckpt` 파일은 전체 모델을 포함하며 일반적으로 크기가 몇 GB입니다. `.ckpt` 파일을 [~StableDiffusionPipeline.from_ckpt] 메서드를 사용하여 직접 불러와서 사용할 수도 있지만, 일반적으로 두 가지 형식을 모두 사용할 수 있도록 `.ckpt` 파일을 🤗 Diffusers로 변환하는 것이 더 좋습니다.
|
||||
|
||||
`.ckpt` 파일을 변환하는 두 가지 옵션이 있습니다. Space를 사용하여 체크포인트를 변환하거나 스크립트를 사용하여 `.ckpt` 파일을 변환합니다.
|
||||
|
||||
### Space로 변환하기
|
||||
|
||||
`.ckpt` 파일을 변환하는 가장 쉽고 편리한 방법은 SD에서 Diffusers로 스페이스를 사용하는 것입니다. Space의 지침에 따라 .ckpt 파일을 변환 할 수 있습니다.
|
||||
|
||||
이 접근 방식은 기본 모델에서는 잘 작동하지만 더 많은 사용자 정의 모델에서는 어려움을 겪을 수 있습니다. 빈 pull request나 오류를 반환하면 Space가 실패한 것입니다.
|
||||
이 경우 스크립트를 사용하여 `.ckpt` 파일을 변환해 볼 수 있습니다.
|
||||
|
||||
### 스크립트로 변환하기
|
||||
|
||||
🤗 Diffusers는 `.ckpt` 파일 변환을 위한 변환 스크립트를 제공합니다. 이 접근 방식은 위의 Space보다 더 안정적입니다.
|
||||
|
||||
시작하기 전에 스크립트를 실행할 🤗 Diffusers의 로컬 클론(clone)이 있는지 확인하고 Hugging Face 계정에 로그인하여 pull request를 열고 변환된 모델을 허브에 푸시할 수 있도록 하세요.
|
||||
|
||||
```bash
|
||||
huggingface-cli login
|
||||
```
|
||||
|
||||
스크립트를 사용하려면:
|
||||
|
||||
1. 변환하려는 `.ckpt` 파일이 포함된 리포지토리를 Git으로 클론(clone)합니다.
|
||||
|
||||
이 예제에서는 TemporalNet .ckpt 파일을 변환해 보겠습니다:
|
||||
|
||||
```bash
|
||||
git lfs install
|
||||
git clone https://huggingface.co/CiaraRowles/TemporalNet
|
||||
```
|
||||
|
||||
2. 체크포인트를 변환할 리포지토리에서 pull request를 엽니다:
|
||||
|
||||
```bash
|
||||
cd TemporalNet && git fetch origin refs/pr/13:pr/13
|
||||
git checkout pr/13
|
||||
```
|
||||
|
||||
3. 변환 스크립트에서 구성할 입력 인수는 여러 가지가 있지만 가장 중요한 인수는 다음과 같습니다:
|
||||
|
||||
- `checkpoint_path`: 변환할 `.ckpt` 파일의 경로를 입력합니다.
|
||||
- `original_config_file`: 원래 아키텍처의 구성을 정의하는 YAML 파일입니다. 이 파일을 찾을 수 없는 경우 `.ckpt` 파일을 찾은 GitHub 리포지토리에서 YAML 파일을 검색해 보세요.
|
||||
- `dump_path`: 변환된 모델의 경로
|
||||
|
||||
예를 들어, TemporalNet 모델은 Stable Diffusion v1.5 및 ControlNet 모델이기 때문에 ControlNet 리포지토리에서 cldm_v15.yaml 파일을 가져올 수 있습니다.
|
||||
|
||||
4. 이제 스크립트를 실행하여 .ckpt 파일을 변환할 수 있습니다:
|
||||
|
||||
```bash
|
||||
python ../diffusers/scripts/convert_original_stable_diffusion_to_diffusers.py --checkpoint_path temporalnetv3.ckpt --original_config_file cldm_v15.yaml --dump_path ./ --controlnet
|
||||
```
|
||||
|
||||
5. 변환이 완료되면 변환된 모델을 업로드하고 결과물을 pull request [pull request](https://huggingface.co/CiaraRowles/TemporalNet/discussions/13)를 테스트하세요!
|
||||
|
||||
```bash
|
||||
git push origin pr/13:refs/pr/13
|
||||
```
|
||||
|
||||
## **Keras .pb or .h5**
|
||||
|
||||
🧪 이 기능은 실험적인 기능입니다. 현재로서는 Stable Diffusion v1 체크포인트만 변환 KerasCV Space에서 지원됩니다.
|
||||
|
||||
[KerasCV](https://keras.io/keras_cv/)는 [Stable Diffusion](https://github.com/keras-team/keras-cv/blob/master/keras_cv/models/stable_diffusion) v1 및 v2에 대한 학습을 지원합니다. 그러나 추론 및 배포를 위한 Stable Diffusion 모델 실험을 제한적으로 지원하는 반면, 🤗 Diffusers는 다양한 [noise schedulers](https://huggingface.co/docs/diffusers/using-diffusers/schedulers), [flash attention](https://huggingface.co/docs/diffusers/optimization/xformers), and [other optimization techniques](https://huggingface.co/docs/diffusers/optimization/fp16) 등 이러한 목적을 위한 보다 완벽한 기능을 갖추고 있습니다.
|
||||
|
||||
[Convert KerasCV](https://huggingface.co/spaces/sayakpaul/convert-kerascv-sd-diffusers) Space 변환은 `.pb` 또는 `.h5`을 PyTorch로 변환한 다음, 추론할 수 있도록 [`StableDiffusionPipeline`] 으로 감싸서 준비합니다. 변환된 체크포인트는 Hugging Face Hub의 리포지토리에 저장됩니다.
|
||||
|
||||
예제로, textual-inversion으로 학습된 `[sayakpaul/textual-inversion-kerasio](https://huggingface.co/sayakpaul/textual-inversion-kerasio/tree/main)` 체크포인트를 변환해 보겠습니다. 이것은 특수 토큰 `<my-funny-cat>`을 사용하여 고양이로 이미지를 개인화합니다.
|
||||
|
||||
KerasCV Space 변환에서는 다음을 입력할 수 있습니다:
|
||||
|
||||
- Hugging Face 토큰.
|
||||
- UNet 과 텍스트 인코더(text encoder) 가중치를 다운로드하는 경로입니다. 모델을 어떻게 학습할지 방식에 따라, UNet과 텍스트 인코더의 경로를 모두 제공할 필요는 없습니다. 예를 들어, textual-inversion에는 텍스트 인코더의 임베딩만 필요하고 텍스트-이미지(text-to-image) 모델 변환에는 UNet 가중치만 필요합니다.
|
||||
- Placeholder 토큰은 textual-inversion 모델에만 적용됩니다.
|
||||
- `output_repo_prefix`는 변환된 모델이 저장되는 리포지토리의 이름입니다.
|
||||
|
||||
**Submit** (제출) 버튼을 클릭하면 KerasCV 체크포인트가 자동으로 변환됩니다! 체크포인트가 성공적으로 변환되면, 변환된 체크포인트가 포함된 새 리포지토리로 연결되는 링크가 표시됩니다. 새 리포지토리로 연결되는 링크를 따라가면 변환된 모델을 사용해 볼 수 있는 추론 위젯이 포함된 모델 카드가 생성된 KerasCV Space 변환을 확인할 수 있습니다.
|
||||
|
||||
코드를 사용하여 추론을 실행하려면 모델 카드의 오른쪽 상단 모서리에 있는 **Use in Diffusers** 버튼을 클릭하여 예시 코드를 복사하여 붙여넣습니다:
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained("sayakpaul/textual-inversion-cat-kerascv_sd_diffusers_pipeline")
|
||||
```
|
||||
|
||||
그러면 다음과 같은 이미지를 생성할 수 있습니다:
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained("sayakpaul/textual-inversion-cat-kerascv_sd_diffusers_pipeline")
|
||||
pipeline.to("cuda")
|
||||
|
||||
placeholder_token = "<my-funny-cat-token>"
|
||||
prompt = f"two {placeholder_token} getting married, photorealistic, high quality"
|
||||
image = pipeline(prompt, num_inference_steps=50).images[0]
|
||||
```
|
||||
|
||||
## **A1111 LoRA files**
|
||||
|
||||
[Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui) (A1111)은 Stable Diffusion을 위해 널리 사용되는 웹 UI로, [Civitai](https://civitai.com/) 와 같은 모델 공유 플랫폼을 지원합니다. 특히 LoRA 기법으로 학습된 모델은 학습 속도가 빠르고 완전히 파인튜닝된 모델보다 파일 크기가 훨씬 작기 때문에 인기가 높습니다.
|
||||
|
||||
🤗 Diffusers는 [`~loaders.LoraLoaderMixin.load_lora_weights`]:를 사용하여 A1111 LoRA 체크포인트 불러오기를 지원합니다:
|
||||
|
||||
```py
|
||||
from diffusers import DiffusionPipeline, UniPCMultistepScheduler
|
||||
import torch
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
"andite/anything-v4.0", torch_dtype=torch.float16, safety_checker=None
|
||||
).to("cuda")
|
||||
pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config)
|
||||
```
|
||||
|
||||
Civitai에서 LoRA 체크포인트를 다운로드하세요; 이 예제에서는 [Howls Moving Castle,Interior/Scenery LoRA (Ghibli Stlye)](https://civitai.com/models/14605?modelVersionId=19998) 체크포인트를 사용했지만, 어떤 LoRA 체크포인트든 자유롭게 사용해 보세요!
|
||||
|
||||
```bash
|
||||
!wget https://civitai.com/api/download/models/19998 -O howls_moving_castle.safetensors
|
||||
```
|
||||
|
||||
메서드를 사용하여 파이프라인에 LoRA 체크포인트를 불러옵니다:
|
||||
|
||||
```py
|
||||
pipeline.load_lora_weights(".", weight_name="howls_moving_castle.safetensors")
|
||||
```
|
||||
|
||||
이제 파이프라인을 사용하여 이미지를 생성할 수 있습니다:
|
||||
|
||||
```py
|
||||
prompt = "masterpiece, illustration, ultra-detailed, cityscape, san francisco, golden gate bridge, california, bay area, in the snow, beautiful detailed starry sky"
|
||||
negative_prompt = "lowres, cropped, worst quality, low quality, normal quality, artifacts, signature, watermark, username, blurry, more than one bridge, bad architecture"
|
||||
|
||||
images = pipeline(
|
||||
prompt=prompt,
|
||||
negative_prompt=negative_prompt,
|
||||
width=512,
|
||||
height=512,
|
||||
num_inference_steps=25,
|
||||
num_images_per_prompt=4,
|
||||
generator=torch.manual_seed(0),
|
||||
).images
|
||||
```
|
||||
|
||||
마지막으로, 디스플레이에 이미지를 표시하는 헬퍼 함수를 만듭니다:
|
||||
|
||||
```py
|
||||
from PIL import Image
|
||||
|
||||
|
||||
def image_grid(imgs, rows=2, cols=2):
|
||||
w, h = imgs[0].size
|
||||
grid = Image.new("RGB", size=(cols * w, rows * h))
|
||||
|
||||
for i, img in enumerate(imgs):
|
||||
grid.paste(img, box=(i % cols * w, i // cols * h))
|
||||
return grid
|
||||
|
||||
|
||||
image_grid(images)
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/a1111-lora-sf.png" />
|
||||
</div>
|
||||
17
docs/source/ko/using-diffusers/pipeline_overview.mdx
Normal file
17
docs/source/ko/using-diffusers/pipeline_overview.mdx
Normal file
@@ -0,0 +1,17 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Overview
|
||||
|
||||
파이프라인은 독립적으로 훈련된 모델과 스케줄러를 함께 모아서 추론을 위해 diffusion 시스템을 빠르고 쉽게 사용할 수 있는 방법을 제공하는 end-to-end 클래스입니다. 모델과 스케줄러의 특정 조합은 특수한 기능과 함께 [`StableDiffusionPipeline`] 또는 [`StableDiffusionControlNetPipeline`]과 같은 특정 파이프라인 유형을 정의합니다. 모든 파이프라인 유형은 기본 [`DiffusionPipeline`] 클래스에서 상속됩니다. 어느 체크포인트를 전달하면, 파이프라인 유형을 자동으로 감지하고 필요한 구성 요소들을 불러옵니다.
|
||||
|
||||
이 섹션에서는 unconditional 이미지 생성, text-to-image 생성의 다양한 테크닉과 변화를 파이프라인에서 지원하는 작업들을 소개합니다. 프롬프트에 있는 특정 단어가 출력에 영향을 미치는 것을 조정하기 위해 재현성을 위한 시드 설정과 프롬프트에 가중치를 부여하는 것으로 생성 프로세스를 더 잘 제어하는 방법에 대해 배울 수 있습니다. 마지막으로 음성에서부터 이미지 생성과 같은 커스텀 작업을 위한 커뮤니티 파이프라인을 만드는 방법을 알 수 있습니다.
|
||||
63
docs/source/ko/using-diffusers/reusing_seeds.mdx
Normal file
63
docs/source/ko/using-diffusers/reusing_seeds.mdx
Normal file
@@ -0,0 +1,63 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Deterministic(결정적) 생성을 통한 이미지 품질 개선
|
||||
|
||||
생성된 이미지의 품질을 개선하는 일반적인 방법은 *결정적 batch(배치) 생성*을 사용하는 것입니다. 이 방법은 이미지 batch(배치)를 생성하고 두 번째 추론 라운드에서 더 자세한 프롬프트와 함께 개선할 이미지 하나를 선택하는 것입니다. 핵심은 일괄 이미지 생성을 위해 파이프라인에 [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html#generator) 목록을 전달하고, 각 `Generator`를 시드에 연결하여 이미지에 재사용할 수 있도록 하는 것입니다.
|
||||
|
||||
예를 들어 [`runwayml/stable-diffusion-v1-5`](runwayml/stable-diffusion-v1-5)를 사용하여 다음 프롬프트의 여러 버전을 생성해 봅시다.
|
||||
|
||||
```py
|
||||
prompt = "Labrador in the style of Vermeer"
|
||||
```
|
||||
|
||||
(가능하다면) 파이프라인을 [`DiffusionPipeline.from_pretrained`]로 인스턴스화하여 GPU에 배치합니다.
|
||||
|
||||
```python
|
||||
>>> from diffusers import DiffusionPipeline
|
||||
|
||||
>>> pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
|
||||
>>> pipe = pipe.to("cuda")
|
||||
```
|
||||
|
||||
이제 네 개의 서로 다른 `Generator`를 정의하고 각 `Generator`에 시드(`0` ~ `3`)를 할당하여 나중에 특정 이미지에 대해 `Generator`를 재사용할 수 있도록 합니다.
|
||||
|
||||
```python
|
||||
>>> import torch
|
||||
|
||||
>>> generator = [torch.Generator(device="cuda").manual_seed(i) for i in range(4)]
|
||||
```
|
||||
|
||||
이미지를 생성하고 살펴봅니다.
|
||||
|
||||
```python
|
||||
>>> images = pipe(prompt, generator=generator, num_images_per_prompt=4).images
|
||||
>>> images
|
||||
```
|
||||
|
||||

|
||||
|
||||
이 예제에서는 첫 번째 이미지를 개선했지만 실제로는 원하는 모든 이미지를 사용할 수 있습니다(심지어 두 개의 눈이 있는 이미지도!). 첫 번째 이미지에서는 시드가 '0'인 '생성기'를 사용했기 때문에 두 번째 추론 라운드에서는 이 '생성기'를 재사용할 것입니다. 이미지의 품질을 개선하려면 프롬프트에 몇 가지 텍스트를 추가합니다:
|
||||
|
||||
```python
|
||||
prompt = [prompt + t for t in [", highly realistic", ", artsy", ", trending", ", colorful"]]
|
||||
generator = [torch.Generator(device="cuda").manual_seed(0) for i in range(4)]
|
||||
```
|
||||
|
||||
시드가 `0`인 제너레이터 4개를 생성하고, 이전 라운드의 첫 번째 이미지처럼 보이는 다른 이미지 batch(배치)를 생성합니다!
|
||||
|
||||
```python
|
||||
>>> images = pipe(prompt, generator=generator).images
|
||||
>>> images
|
||||
```
|
||||
|
||||

|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user