Compare commits

..

447 Commits

Author SHA1 Message Date
Pedro Cuenca
88e6b7edbd UniPC: use same default for solver_type
Fixes a bug when switching from UniPC from another scheduler (i.e.,
DEIS) that uses a different solver type. The solver is now the same as
if we had instantiated the scheduler directly.
2023-07-04 14:46:55 +02:00
Patrick von Platen
2837d49079 Fix failing np tests (#3942)
* Fix failing np tests

* Apply suggestions from code review

* Update tests/pipelines/test_pipelines_common.py
2023-07-04 14:00:43 +02:00
Prathik Rao
1997614aa9 avoid upcasting by assigning dtype to noise tensor (#3713)
* avoid upcasting by assigning dtype to noise tensor

* make style

* Update train_unconditional.py

* Update train_unconditional.py

* make style

* add unit test for pickle

* revert change

---------

Co-authored-by: root <root@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-07-04 07:19:49 +05:30
Patrick von Platen
4e898560ce revert automatic chunking (#3934)
* revert automatic chunking

* Apply suggestions from code review

* revert automatic chunking
2023-07-03 23:12:41 +02:00
Patrick von Platen
332d2bbea3 Improve memory text to video (#3930)
* Improve memory text to video

* Apply suggestions from code review

* add test

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* finish test setup

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-07-03 18:17:34 +02:00
estelleafl
b8a5dda56e [ldm3d] Update code to be functional with the new checkpoints (#3875)
* fixed typo

* updated doc to be consistent in naming

* make style/quality

* preprocessing for 4 channels and not 6

* make style

* test for 4c

* make style/quality

* fixed test on cpu

---------

Co-authored-by: Aflalo <estellea@isl-iam1.rr.intel.com>
Co-authored-by: Aflalo <estellea@isl-gpu33.rr.intel.com>
Co-authored-by: Aflalo <estellea@isl-gpu38.rr.intel.com>
2023-07-03 18:15:46 +02:00
Andrés Mauricio Repetto Ferrero
572d8e2002 Adding better way to define multiple concepts and also validation capabilities. (#3807)
* - Added validation parameters
- Changed some parameter descriptions to better explain their use.
- Fixed a few typos.
- Added concept_list parameter for better management of multiple subjects
- changed logic for image validation

* - Fixed bad logic for class data root directories

* Defaulting validation_steps to None for an easier logic

* Fixed multiple validation prompts

* Fixed bug on validation negative prompt

* Changed validation logic for tracker.

* Added uuid for validation image labeling

* Fix error when comparing validation prompts and validation negative prompts

* Improved error message when negative prompts for validation are more than the number of prompts

* - Changed image tracking number from epoch to global_step
- Added Typing for functions

* Added some validations more when using concept_list parameter and the regular ones.

* Fixed error message

* Added more validations for validation parameters

* Improved messaging for errors

* Fixed validation error for parameters with default values

* - Added train step to image name for validation
- reformatted code

* - Added train step to image's name for validation
- reformatted code

* Updated README.md file.

* reverted back original script of train_dreambooth.py

* reverted back original script of train_dreambooth.py

* left one blank line at the eof

* reverted back setup.py

* reverted back setup.py

* added same logic for when parameters for prior preservation are used without enabling the flag while using concept_list parameter.

* Ran black formatter.

* fixed a few strings

* fixed import sort with isort and removed fstrings without placeholder

* fixed import order with ruff (since with isort wasn't ok)

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-07-03 17:55:45 +02:00
Patrick von Platen
2e8668f0af Correct controlnet out of list error (#3928)
* Correct controlnet out of list error

* Apply suggestions from code review

* correct tests

* correct tests

* fix

* test all

* Apply suggestions from code review

* test all

* test all

* Apply suggestions from code review

* Apply suggestions from code review

* fix more tests

* Fix more

* Apply suggestions from code review

* finish

* Apply suggestions from code review

* Update src/diffusers/schedulers/scheduling_k_dpm_2_ancestral_discrete.py

* finish
2023-07-03 15:10:07 +02:00
Aisuko
b298484fd0 fix/doc: no import torch issue (#3923)
Ffix/doc: no import torch issue

Signed-off-by: GitHub <noreply@github.com>
2023-07-03 12:28:42 +02:00
Aisuko
f911287cc9 fix/doc-code: Updating to the latest version parameters (#3924)
fix/doc-code: update to use the new parameter

Signed-off-by: GitHub <noreply@github.com>
2023-07-03 12:28:05 +02:00
Patrick von Platen
62825064bf Add video img2img (#3900)
* Add image to image video

* Improve

* better naming

* make fix copies

* add docs

* finish tests

* trigger tests

* make style

* correct

* finish

* Fix more

* make style

* finish
2023-07-02 13:19:27 +02:00
Aisuko
5439e917ca fix/docs: Fix the broken doc links (#3897)
* fix/docs: Fix the broken doc links

Signed-off-by: GitHub <noreply@github.com>

* Update docs/source/en/using-diffusers/write_own_pipeline.mdx

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-07-01 08:07:59 +02:00
Steven Liu
174dcd697f [docs] Model API (#3562)
* add modelmixin and unets

* remove old model page

* minor fixes

* fix unet2dcondition

* add vqmodel and autoencoderkl

* add rest of models

* fix autoencoderkl path

* fix toctree

* fix toctree again

* apply feedback

* apply feedback

* fix copies

* fix controlnet copy

* fix copies
2023-06-29 17:24:39 -07:00
takuoko
cdf2ae8a84 [Enhance] Add LoRA rank args in train_text_to_image_lora (#3866)
* add rank args in lora finetune

* del network_alpha
2023-06-29 17:09:59 +05:30
Sayak Paul
49949f321d [Tests] add test for checking soft dependencies. (#3847)
* add test for checking soft dependencies.

* address patrick's comments.

* dependency tests should not run twice.

* debugging.

* up.
2023-06-28 22:05:25 +05:30
Uranus
c7469ebe74 fix sde add noise typo (#3839)
* fix sde typo

* fix code style
2023-06-28 15:44:29 +02:00
Wadim Korablin
150013060e Support for manual CLIP loading in StableDiffusionPipeline - txt2img. (#3832)
* Support for manual CLIP loading in StableDiffusionPipeline - txt2img.

* Update src/diffusers/pipelines/stable_diffusion/convert_from_ckpt.py

* Update variables & according docs to match previous style.

* Updated to match style & quality of 'diffusers'

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-06-28 15:29:48 +02:00
Patrick von Platen
219636f7e4 improve tolerance 2023-06-28 13:29:36 +00:00
Vincent Neemie
35bac5edec Fixing the global_step key not found (#3844)
* Fixing the global_step key not found

* Apply suggestions from code review

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-06-28 14:36:33 +02:00
Saurav Maheshkar
0bf6aeb885 feat: rename single-letter vars in resnet.py (#3868)
feat: rename single-letter vars
2023-06-28 13:31:32 +02:00
Joachim Blaafjell Holwech
9a45d7fb76 Add guidance start/stop (#3770)
* Add guidance start/stop

* Add guidance start/stop to inpaint class

* Black formatting

* Add support for guidance for multicontrolnet

* Add inclusive end

* Improve design

* correct imports

* Finish

* Finish all

* Correct more

* make style

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-06-27 01:04:11 +02:00
regisss
61916fefc4 Update Habana Gaudi doc (#3863)
* Update Habana Gaudi doc

* Fix typo
2023-06-24 21:17:11 +02:00
Sayak Paul
fc6acb6b97 [Docs] add: contributor note in the paradigms docs. (#3852)
add: contributor note in the paradigms docs.
2023-06-22 17:54:35 +05:30
Patrick von Platen
5e3f8fff40 Fix some audio tests (#3841)
* Fix some audio tests

* make style

* fix

* make style
2023-06-22 13:53:27 +02:00
Patrick von Platen
5df2acf7d2 [Conversion] Small fixes (#3848)
* [Conversion] Small fixes

* Update src/diffusers/pipelines/stable_diffusion/convert_from_ckpt.py
2023-06-22 13:52:59 +02:00
Patrick von Platen
88d269461c Correct bad attn naming (#3797)
* relax tolerance slightly

* correct incorrect naming

* correct namingc

* correct more

* Apply suggestions from code review

* Fix more

* Correct more

* correct incorrect naming

* Update src/diffusers/models/controlnet.py

* Correct flax

* Correct renaming

* Correct blocks

* Fix more

* Correct more

* mkae style

* mkae style

* mkae style

* mkae style

* mkae style

* Fix flax

* mkae style

* rename

* rename

* rename attn head dim to attention_head_dim

* correct flax

* make style

* improve

* Correct more

* make style

* fix more

* mkae style

* Update src/diffusers/models/controlnet_flax.py

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-06-22 13:52:48 +02:00
Robert Dargavel Smith
0c6d1bc985 fix audio_diffusion tests (#3850) 2023-06-22 12:27:39 +02:00
Sayak Paul
13e781f9a5 fix: random module seeding (#3846) 2023-06-22 12:26:55 +02:00
Will Berman
0bab447670 relax tol attention conversion test (#3842) 2023-06-21 12:35:38 -07:00
Steven Liu
1f02087607 [docs] More API stuff (#3835)
* clean up loaders

* clean up rest of main class apis

* apply feedback
2023-06-21 11:07:23 -07:00
YiYi Xu
95ea538c79 Add ddpm kandinsky (#3783)
* update doc

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-06-21 07:23:18 -10:00
Hans Brouwer
ef3844d3a8 Support ControlNet models with different number of channels in control images (#3815)
support ControlNet models with a different hint_channels value (e.g. TemporalNet2)
2023-06-21 13:11:45 +02:00
dqueue
3ebbaf7c96 Update control_brightness.mdx (#3825) 2023-06-20 14:09:51 +02:00
Andy Shih
73b125df68 [Pipeline] Add new pipeline for ParaDiGMS -- parallel sampling of diffusion models (#3716)
* add paradigms parallel sampling pipeline

* linting

* ran make fix-copies

* add paradigms parallel sampling pipeline

* linting

* ran make fix-copies

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* changes based on review

* add docs for paradigms

* update docs with paradigms abstract

* improve documentation, and add tests for ddim/ddpm batch_step_no_noise

* fix docs and run make fix-copies

* minor changes to docs.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* move parallel scheduler to new classes for DDPMParallelScheduler and DDIMParallelScheduler

* remove changes for scheduling_ddim, adjust licenses, credits, and commented code

* fix tensor type that is breaking tests

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-06-20 15:04:26 +05:30
Sayak Paul
88eb04489d [Docs] add missing pipelines from the overview pages and minor fixes (#3795)
* add entry for safe stable diffusion to the sd overview page.

* add missing pipelines o the broader overview section in the pipelines.

* address PR feedback./
2023-06-20 11:15:21 +05:30
Sayak Paul
4870626728 [Examples] Improve the model card pushed from the train_text_to_image.py script (#3810)
* refactor: readme serialized from the example when push_to_hub is True.

* fix: batch size arg.

* a bit better formatting

* minor fixes.

* add note on env.

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* condition wandb info better

* make mixed_precision assignment in cli args explicit.

* separate inference block for sample images.

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* address more comments.

* autocast mode.

* correct none image type problem.

* ifx: list assignment.

* minor fix.

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-06-20 08:59:41 +05:30
estelleafl
666743302f [ldm3d] Fixed small typo (#3820)
* fixed typo

* updated doc to be consistent in naming

* make style/quality

---------

Co-authored-by: Aflalo <estellea@isl-iam1.rr.intel.com>
2023-06-19 17:38:02 +02:00
Steven Liu
f7cc9adc05 [docs] Zero SNR (#3776)
* add zero snr doc

* fix image link

* apply feedback

* separate page
2023-06-16 13:19:37 -07:00
Will Berman
59aefe9ea6 device map legacy attention block weight conversion (#3804) 2023-06-16 10:39:20 -07:00
Will Berman
3ddc2b7395 [train text to image] add note to loading from checkpoint (#3806)
add note to loading from checkpoint
2023-06-16 11:54:49 +05:30
Will Berman
d49e2dd54c manual check for checkpoints_total_limit instead of using accelerate (#3681)
* manual check for checkpoints_total_limit instead of using accelerate

* remove controlnet_conditioning_embedding_out_channels
2023-06-15 15:38:54 -07:00
Isotr0py
7bfd2375c7 fix typo (#3800) 2023-06-15 22:00:47 +05:30
Patrick von Platen
ea8ae8c639 Complete set_attn_processor for prior and vae (#3796)
* relax tolerance slightly

* Add more tests

* upload readme

* upload readme

* Apply suggestions from code review

* Improve API Autoencoder KL

* finalize

* finalize tests

* finalize tests

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* up

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-06-15 17:42:49 +02:00
estelleafl
958d9ec723 Ldm3d first PR (#3668)
* added ldm3d pipeline and updated image processor to support depth

* added description

* added paper reference

* added docs

* fixed bug

* added test

* Update tests/pipelines/stable_diffusion/test_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update tests/pipelines/stable_diffusion/test_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* added reference in indexmdx

* reverted changes tto image processor'

* added LDM3DOutput

* Fixes with make style

* fix failing tests for make fix-copies

* aligned with our version

* Update pipeline_stable_diffusion_ldm3d.py

updated the guidance scale

* Fix for failing check_code_quality test

* Code review feedback

* Fix typo in ldm3d_diffusion.mdx

* updated the doc accordnlgy

* copyrights

* fixed test failure

* make style

* added image processor of LDM3D in the documentation:

* added ldm3d doc to toctree

* run make style && make quality

* run make fix-copies

* Update docs/source/en/api/image_processor.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* updated the safety checker to accept tuple

* make style and make quality

* Update src/diffusers/pipelines/stable_diffusion/__init__.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* LDM3D output

* up

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Aflalo <estellea@isl-gpu27.rr.intel.com>
Co-authored-by: Anahita Bhiwandiwalla <anahita.bhiwandiwalla@intel.com>
Co-authored-by: Aflalo <estellea@isl-gpu26.rr.intel.com>
Co-authored-by: Aflalo <estellea@isl-iam1.rr.intel.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Aflalo <estellea@isl-gpu42.rr.intel.com>
Co-authored-by: Aflalo <estellea@isl-gpu43.rr.intel.com>
2023-06-15 17:36:52 +02:00
Sayak Paul
77f9137f10 feat: add PR template. (#3786)
* feat: add PR template.

* address pr comments.

* Update .github/PULL_REQUEST_TEMPLATE.md

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-06-15 19:41:54 +05:30
Naga Sai Abhinay
231bdf2e56 UnCLIP Image Interpolation -> Keep same initial noise across interpolation steps (#3782)
* Maintain same decoder start noise for all interp steps

* Correct comment

* use batch_size for consistency
2023-06-15 15:15:40 +02:00
Arpan Tripathi
75124fc91e Added LoRA loading to StableDiffusionKDiffusionPipeline (#3751)
Added `LoraLoaderMixin` to `StableDiffusionKDiffusionPipeline`
2023-06-15 15:09:44 +02:00
Patrick von Platen
908e5e9cc6 Fix some bad comment in training scripts (#3798)
* relax tolerance slightly

* correct incorrect naming
2023-06-15 15:07:51 +02:00
cmdr2
2715079344 Fix broken cpu-offloading in legacy inpainting SD pipeline (#3773) 2023-06-15 14:56:40 +02:00
takuoko
1ae15fa64c [Enhance] Update reference (#3723)
* update reference pipeline

* update reference pipeline

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-06-15 14:34:12 +02:00
Sayak Paul
027a365a62 [Bug Report template] modify the issue template to include core maintainers. (#3785)
* modify the issue template to include core maintainers.

* add: entry for audio.

* Update .github/ISSUE_TEMPLATE/bug-report.yml

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-06-15 07:43:07 +05:30
Steven Liu
f96b760658 [docs] Fix Colab notebook cells (#3777)
fix colab notebook cells
2023-06-14 10:21:39 -07:00
YiYi Xu
7761b89d7b update conversion script for Kandinsky unet (#3766)
* update kandinsky conversion script

* style

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-06-14 06:57:53 -10:00
jfozard
ce5504934a Update pipeline_flax_stable_diffusion_controlnet.py (#3306)
Update pipeline_flax_controlnet.py

Change type of images array from jax.numpy.array to numpy.ndarray to permit in-place modification of the array when the safety checker detects a NSFW image.
2023-06-12 14:25:46 -10:00
Patrick von Platen
34d14d7848 [MultiControlNet] Allow save and load (#3747)
* [MultiControlNet] Allow save and load

* Correct more

* [MultiControlNet] Allow save and load

* make style

* Apply suggestions from code review
2023-06-12 18:29:58 +02:00
Patrick von Platen
ef9590712a [Tests] Relax tolerance of flaky failing test (#3755)
relax tolerance slightly
2023-06-12 18:28:30 +02:00
Andranik Movsisyan
a812fb6f5c Text2video zero refinements (#3733)
* fix docs typos. add frame_ids argument to text2video-zero pipeline call

* make style && make quality

* add support of pytorch 2.0 scaled_dot_product_attention for CrossFrameAttnProcessor

* add chunk-by-chunk processing to text2video-zero docs

* make style && make quality

* Update docs/source/en/api/pipelines/text_to_video_zero.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-06-12 18:03:18 +02:00
Liam Swayne
f46b22ba13 [documentation] grammatical fixes in installation.mdx (#3735)
Update installation.mdx
2023-06-12 17:42:01 +02:00
JeLuF
b2b13cd315 [Documentation] Replace dead link to Flax install guide (#3739)
Replace dead link to Flax documentation

Replace the dead link to the Flax installation guide by a working one: https://flax.readthedocs.io/en/latest/#installation
2023-06-12 17:40:48 +02:00
Patrick von Platen
38adcd21bd [Stable Diffusion Inpaint & ControlNet inpaint] Correct timestep inpaint (#3749)
* Correct timestep inpaint

* make style

* Fix

* Apply suggestions from code review

* make style
2023-06-12 13:59:38 +02:00
Patrick von Platen
790212f4d9 Correct another push token (#3745)
clean up more
2023-06-12 10:29:23 +02:00
Patrick von Platen
11aa105077 Correct Token to upload docs (#3744)
clean up more
2023-06-12 10:04:45 +02:00
Patrick von Platen
abbfe4b5b7 fix zh 2023-06-10 17:54:55 +02:00
Patrick von Platen
1d50f47a58 Merge branch 'main' of https://github.com/huggingface/diffusers 2023-06-10 17:04:59 +02:00
Patrick von Platen
e891b00dfc build docs 2023-06-10 16:58:59 +02:00
Patrick von Platen
27af55d1b4 build docs 2023-06-10 16:56:41 +02:00
YiYi Xu
05361960f2 remove seed (#3734)
* remove seed

* style

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-06-09 08:27:02 -10:00
Patrick von Platen
c42f6ee43e Post 0.17.0 release (#3721)
* Post release

* Post release
2023-06-08 18:08:49 +02:00
Patrick von Platen
f523b11a10 Fix loading if unexpected keys are present (#3720)
* Fix loading

* make style
2023-06-08 16:48:06 +02:00
Zachary Mueller
79fa94ea8b Apply deprecations from Accelerate (#3714)
Apply deprecations
2023-06-08 16:44:22 +02:00
Patrick von Platen
a06317abea [Actions] Fix actions (#3712) 2023-06-07 18:57:28 +01:00
YiYi Xu
500a3ff9ef [docs] add image processor documentation (#3710)
add image processor

Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>
2023-06-07 18:35:07 +01:00
Mishig
8caa530069 [doc build] Use secrets (#3707)
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-06-07 18:21:16 +01:00
Kadir Nar
cd6186907c [Community] Support StableDiffusionCanvasPipeline (#3590)
* added StableDiffusionCanvasPipeline pipeline

* Added utils codes to pipe_utils file.

* make style

* delete mixture.py and Text2ImageRegion class

* make style

* Added the codes to the readme.md file.

* Moved functions from pipeline_utils to mix_canvas
2023-06-07 17:43:33 +01:00
Patrick von Platen
803d653748 Fix custom releases (#3708)
* Fix custom releases

* make style
2023-06-07 17:33:54 +01:00
Alex McKinney
cd9d0913d9 Fixes eval generator init in train_text_to_image_lora.py (#3678) 2023-06-07 15:37:13 +05:30
Pedro Cuenca
fdec23188a [Tests] Run slow matrix sequentially (#3500)
[tests] Run slow matrix sequentially.
2023-06-07 11:01:35 +01:00
Max-We
12a232efa9 Fix schedulers zero SNR and rescale classifier free guidance (#3664)
* Implement option for rescaling betas to zero terminal SNR

* Implement rescale classifier free guidance in pipeline_stable_diffusion.py

* focus on DDIM

* make style

* make style

* make style

* make style

* Apply suggestions from Peter Lin

* Apply suggestions from Peter Lin

* make style

* Apply suggestions from code review

* Apply suggestions from code review

* make style

* make style

---------

Co-authored-by: MaxWe00 <gitlab.9v1lq@slmail.me>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-06-07 10:57:10 +01:00
Patrick von Platen
74fd735eb0 Add draft for lora text encoder scale (#3626)
* Add draft for lora text encoder scale

* Improve naming

* fix: training dreambooth lora script.

* Apply suggestions from code review

* Update examples/dreambooth/train_dreambooth_lora.py

* Apply suggestions from code review

* Apply suggestions from code review

* add lora mixin when fit

* add lora mixin when fit

* add lora mixin when fit

* fix more

* fix more

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-06-06 22:47:46 +01:00
Jason C.H
2de9e2df36 Fix from_ckpt for Stable Diffusion 2.x (#3662) 2023-06-06 22:39:11 +01:00
Isotr0py
11b3002b48 Support views batch for panorama (#3632)
* support views batch for panorama

* add entry for the new argument

* format entry for the new argument

* add view_batch_size test

* fix batch test and a boundary condition

* add more docstrings

* fix a typos

* fix typos

* add: entry to the doc about view_batch_size.

* Revert "add: entry to the doc about view_batch_size."

This reverts commit a36aeaa9ed.

* add a tip on .

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-06-07 02:50:02 +05:30
stano
10f4ecd177 Fix the Kandinsky docstring examples (#3695)
- use the correct Prior hub model id
 - use the new names in KandinskyPriorPipelineOutput
2023-06-06 22:18:14 +01:00
Sayak Paul
de16f64667 feat: when using PT 2.0 use LoRAAttnProcessor2_0 for text enc LoRA. (#3691) 2023-06-06 21:20:53 +01:00
YiYi Xu
017ee1609b refactor Image processor for x4 upscaler (#3692)
* refactor x4 upscaler

* style

* copies

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-06-06 21:08:36 +01:00
Sayak Paul
8669e8313d [LoRA] feat: add lora attention processor for pt 2.0. (#3594)
* feat: add lora attention processor for pt 2.0.

* explicit context manager for SDPA.

* switch to flash attention

* make shapes compatible to work optimally with SDPA.

* fix: circular import problem.

* explicitly specify the flash attention kernel in sdpa

* fall back to efficient attention context manager.

* remove explicit dispatch.

* fix: removed processor.

* fix: remove optional from type annotation.

* feat: make changes regarding LoRAAttnProcessor2_0.

* remove confusing warning.

* formatting.

* relax tolerance for PT 2.0

* fix: loading message.

* remove unnecessary logging.

* add: entry to the docs.

* add: network_alpha argument.

* relax tolerance.
2023-06-06 14:56:05 +05:30
Takuma Mori
b45204ea5a Add function to remove monkey-patch for text encoder LoRA (#3649)
* merge undoable-monkeypatch

* remove TEXT_ENCODER_TARGET_MODULES, refactoring

* move create_lora_weight_file
2023-06-06 14:06:13 +05:30
Steven Liu
a8b0f42c38 [docs] Fix link to loader method (#3680)
fix link to load_lora_weights
2023-06-06 13:37:47 +05:30
Will Berman
41ae670828 move activation dispatches into helper function (#3656)
* move activation dispatches into helper function

* tests
2023-06-05 12:30:48 -07:00
Will Berman
462956be7b small tweaks for parsing thibaudz controlnet checkpoints (#3657) 2023-06-05 10:24:31 -07:00
YiYi Xu
5990014700 [WIP]Vae preprocessor refactor (PR1) (#3557)
VaeImageProcessor.preprocess refactor

* refactored VaeImageProcessor 
   -  allow passing optional height and width argument to resize()
   - add convert_to_rgb
* refactored prepare_latents method for img2img pipelines so that if we pass latents directly as image input, it will not encode it again
* added a test in test_pipelines_common.py to test latents as image inputs
* refactored img2img pipelines that accept latents as image: 
   - controlnet img2img, stable diffusion img2img , instruct_pix2pix

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-06-05 07:11:00 -10:00
Steven Liu
1a6a647e06 [docs] More API fixes (#3640)
* part 2 of api fixes

* move randn_tensor

* add to toctree

* apply feedback

* more feedback
2023-06-05 09:47:26 -07:00
Sayak Paul
995bbcb9aa [UniDiffuser test] fix one test so that it runs correctly on V100 (#3675)
* fix: assertion.

* assertion fix.
2023-06-05 17:42:31 +05:30
pdoane
d0416ab090 Update Compel documentation for textual inversions (#3663)
* Update Compel documentation for textual inversions

* Fix typo
2023-06-05 16:46:27 +05:30
Vladislav Lyubimov
1994dbcb5e Fix from_ckpt not working properly on windows (#3666) 2023-06-05 11:55:37 +01:00
Patrick von Platen
262d539a8a Correct multi gpu dreambooth (#3673)
Correct multi gpu
2023-06-05 11:03:11 +01:00
Will Berman
0fc2fb71c1 dreambooth upscaling fix added latents (#3659) 2023-06-05 10:32:16 +01:00
Steven Liu
523a50a8eb [docs] Load A1111 LoRA (#3629)
* load a1111 lora

* fix

* apply feedback

* fix
2023-06-05 11:05:42 +05:30
0x1355
de45af4a46 Allow setting num_cycles for cosine_with_restarts lr scheduler (#3606)
Expose num_cycles kwarg of get_schedule() through args.lr_num_cycles.
2023-06-05 10:18:29 +05:30
0x1355
b95cbdf6fc Set step_rules correctly for piecewise_constant scheduler (#3605)
So that schedule_func() calls get_piecewise_constant_schedule() with correctly named kwarg.
2023-06-05 10:16:26 +05:30
Will Berman
7a39691362 linting fix (#3653) 2023-06-02 13:33:19 -07:00
Will Berman
5911a3aa47 dreambooth if docs - stage II, more info (#3628)
* dreambooth if docs - stage II, more info

* Update docs/source/en/training/dreambooth.mdx

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update docs/source/en/training/dreambooth.mdx

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update docs/source/en/training/dreambooth.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* download instructions for downsized images

* update source README to match docs

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-06-02 10:37:13 -07:00
Will Berman
b7af946138 set config from original module but set compiled module on class (#3650)
* set config from original module but set compiled module on class

* add test
2023-06-02 10:26:41 -07:00
asfiyab-nvidia
d3717e6368 add Stable Diffusion TensorRT Inpainting pipeline (#3642)
* add tensorrt inpaint pipeline

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* run make style

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

---------

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-06-02 18:14:31 +01:00
Kadir Nar
0dbdc0cbae [Community Doc] Updated the filename and readme file. (#3634)
* Updated the filename and readme file.

* reformatter

* reformetter
2023-06-02 17:53:09 +01:00
YiYi Xu
0e8688113a fix inpainting pipeline when providing initial latents (#3641)
* fix latents

* fix copies

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-06-02 17:03:15 +01:00
Kashif Rasul
f1d4743394 fixed typo in example train_text_to_image.py (#3608)
fixed typo
2023-06-02 20:54:54 +05:30
Lachlan Nicholson
a6c7b5b6b7 Iterate over unique tokens to avoid duplicate replacements for multivector embeddings (#3588)
* iterate over unique tokens to avoid duplicate replacements

* added test for multiple references to multi embedding

* adhere to black formatting

* reorder test post-rebase
2023-06-02 16:10:22 +01:00
Takuma Mori
8e552bb4fe Support Kohya-ss style LoRA file format (in a limited capacity) (#3437)
* add _convert_kohya_lora_to_diffusers

* make style

* add scaffold

* match result: unet attention only

* fix monkey-patch for text_encoder

* with CLIPAttention

While the terrible images are no longer produced,
the results do not match those from the hook ver.
This may be due to not setting the network_alpha value.

* add to support network_alpha

* generate diff image

* fix monkey-patch for text_encoder

* add test_text_encoder_lora_monkey_patch()

* verify that it's okay to release the attn_procs

* fix closure version

* add comment

* Revert "fix monkey-patch for text_encoder"

This reverts commit bb9c61e6fa.

* Fix to reuse utility functions

* make LoRAAttnProcessor targets to self_attn

* fix LoRAAttnProcessor target

* make style

* fix split key

* Update src/diffusers/loaders.py

* remove TEXT_ENCODER_TARGET_MODULES loop

* add print memory usage

* remove test_kohya_loras_scaffold.py

* add: doc on LoRA civitai

* remove print statement and refactor in the doc.

* fix state_dict test for kohya-ss style lora

* Apply suggestions from code review

Co-authored-by: Takuma Mori <takuma104@gmail.com>

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-06-02 17:40:24 +05:30
Patrick von Platen
32ea2142c0 [Kandinsky] Improve kandinsky API a bit (#3636)
* Improve docs

* up

* Update docs/source/en/api/pipelines/kandinsky.mdx

* up

* up

* correct more

* further improve

* Update docs/source/en/api/pipelines/kandinsky.mdx

Co-authored-by: YiYi Xu <yixu310@gmail.com>

---------

Co-authored-by: YiYi Xu <yixu310@gmail.com>
2023-06-02 08:57:20 +01:00
Sayak Paul
55dbfa0229 [Docs] include the instruction-tuning blog link in the InstructPix2Pix docs (#3644)
include the instruction-tuning blog link.
2023-06-02 08:04:35 +05:30
Will Berman
4f14b36329 Full Dreambooth IF stage II upscaling (#3561)
* update dreambooth lora to work with IF stage II

* Update dreambooth script for IF stage II upscaler
2023-05-31 09:39:31 -07:00
Will Berman
f751b8844e update dreambooth lora to work with IF stage II (#3560) 2023-05-31 09:39:03 -07:00
Prathik Rao
abb89da4de update code to reflect latest changes as of May 30th (#3616)
* update code to reflect latest changes as of May 30th

* update text to image example

* reflect changes to textual inversion

* make style

* fix typo

* Revert unnecessary readme changes

---------

Co-authored-by: root <root@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-05-31 11:29:04 +02:00
Will Berman
7d0ac4eeab goodbye frog (#3617) 2023-05-30 23:18:01 +01:00
Patrick von Platen
0cc3a7a123 Make sure we also change the config when setting encoder_hid_dim_type=="text_proj" and allow xformers (#3615)
* fix if

* make style

* make style

* add tests for xformers

* make style

* update
2023-05-30 20:47:14 +01:00
Patrick von Platen
9d3ff0794d fix tests (#3614) 2023-05-30 18:59:07 +01:00
Patrick von Platen
a359ab4e29 Update README.md 2023-05-30 18:26:32 +01:00
Patrick von Platen
160c377ddc Make style 2023-05-30 13:14:09 +01:00
Denis
bb22d546c0 [Community] CLIP Guided Images Mixing with Stable DIffusion Pipeline (#3587)
* added clip_guided_images_mixing_stable_diffusion file and readme description

* apply pre-commit

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-30 13:13:45 +01:00
Greg Hunkins
799f5b4e12 [Feat] Enable State Dict For Textual Inversion Loader (#3439)
* enable state dict for textual inversion loader

* Empty-Commit | restart CI

* Empty-Commit | restart CI

* Empty-Commit | restart CI

* Empty-Commit | restart CI

* add tests

* fix tests

* fix tests

* fix tests

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-30 13:13:34 +01:00
takuoko
07ef4855cd [Community, Enhancement] Add reference tricks in README (#3589)
add reference tricks
2023-05-30 12:38:16 +01:00
Kadir Nar
6cbddf558a [Community] Support StableDiffusionTilingPipeline (#3586)
* added mixture pipeline

* added docstring

* update docstring
2023-05-30 12:24:15 +01:00
Rupert Menneer
35a740427e #3487 Fix inpainting strength for various samplers (#3532)
* Throw error if strength adjusted num_inference_steps < 1

* Added new fast test to check ValueError raised when num_inference_steps < 1

when strength adjusts the num_inference_steps then the inpainting pipeline should fail

* fix #3487 initial latents are now only scaled by init_noise_sigma when pure noise

updated this commit w.r.t the latest merge here: https://github.com/huggingface/diffusers/pull/3533

* fix

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-30 12:17:42 +01:00
Sayak Paul
0612f48cd0 [UniDiffuser Tests] Fix some tests (#3609)
* fix: unidiffuser test failures.

* living room.
2023-05-30 12:07:18 +01:00
Kadir Nar
c059cc0992 [docs] update the broken links (#3577) 2023-05-30 11:44:53 +01:00
Patrick von Platen
c0f867afd1 Fix temb attention (#3607)
* Fix temb attention

* Apply suggestions from code review

* make style

* Add tests and fix docker

* Apply suggestions from code review
2023-05-30 11:26:23 +01:00
Sayak Paul
c6ae883751 remove print statements from attention processor. (#3592) 2023-05-29 09:20:31 +05:30
Steven Liu
5559d04237 [docs] Working with different formats (#3534)
* add ckpt

* fix format

* apply feedback

* fix

* include pb

* rename file
2023-05-26 14:37:51 -07:00
Brandon
9917c32916 [docs] update the broken links (#3568)
update the broken links

update the broken links for training folder doc
2023-05-26 12:10:32 -07:00
Steven Liu
ab986769f1 [docs] Maintenance (#3552)
* doc fixes

* fix latex

* parenthesis on inside
2023-05-26 12:04:15 -07:00
Will Berman
bdc75e753d [IF super res] correctly normalize PIL input (#3536)
* [IF super res] correctl normalize PIL input

* 175 -> 127.5
2023-05-26 10:59:44 -07:00
Leon Lin
1d1f648c6b fix dreambooth attention mask (#3541) 2023-05-26 10:58:50 -07:00
Takuma Mori
67cf0445ef Fix to apply LoRAXFormersAttnProcessor instead of LoRAAttnProcessor when xFormers is enabled (#3556)
* fix to use LoRAXFormersAttnProcessor

* add test

* using new LoraLoaderMixin.save_lora_weights

* add test_lora_save_load_with_xformers
2023-05-26 17:33:25 +05:30
dg845
352ca3198c [WIP] Add UniDiffuser model and pipeline (#2963)
* Fix a bug of pano when not doing CFG (#3030)

* Fix a bug of pano when not doing CFG

* enhance code quality

* apply formatting.

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Text2video zero refinements (#3070)

* fix progress bar issue in pipeline_text_to_video_zero.py. Copy scheduler after first backward

* fix tensor loading in test_text_to_video_zero.py

* make style && make quality

* Release: v0.15.0

* [Tests] Speed up panorama tests (#3067)

* fix: norm group test for UNet3D.

* chore: speed up the panorama tests (fast).

* set default value of _test_inference_batch_single_identical.

* fix: batch_sizes default value.

* [Post release] v0.16.0dev (#3072)

* Adds profiling flags, computes train metrics average. (#3053)

* WIP controlnet training

- bugfix --streaming
- bugfix running report_to!='wandb'
- adds memory profile before validation

* Adds final logging statement.

* Sets train epochs to 11.

Looking at a longer ~16ep run, we see only good validation images
after ~11ep:

https://wandb.ai/andsteing/controlnet_fill50k/runs/3j2hx6n8

* Removes --logging_dir (it's not used).

* Adds --profile flags.

* Updates --output_dir=runs/fill-circle-{timestamp}.

* Compute mean of `train_metrics`.

Previously `train_metrics[-1]` was logged, resulting in very bumpy train
metrics.

* Improves logging a bit.

- adds l2_grads gradient norm logging
- adds steps_per_sec
- sets walltime as x coordinate of train/step
- logs controlnet_params config

* Adds --ccache (doesn't really help though).

* minor fix in controlnet flax example (#2986)

* fix the error when push_to_hub but not log validation

* contronet_from_pt & controlnet_revision

* add intermediate checkpointing to the guide

* Bugfix --profile_steps

* Sets `RACKER_PROJECT_NAME='controlnet_fill50k'`.

* Logs fractional epoch.

* Adds relative `walltime` metric.

* Adds `StepTraceAnnotation` and uses `global_step` insetad of `step`.

* Applied `black`.

* Streamlines commands in README a bit.

* Removes `--ccache`.

This makes only a very small difference (~1 min) with this model size, so removing
the option introduced in cdb3cc.

* Re-ran `black`.

* Update examples/controlnet/README.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Converts spaces to tab.

* Removes repeated args.

* Skips first step (compilation) in profiling

* Updates README with profiling instructions.

* Unifies tabs/spaces in README.

* Re-ran style & quality.

---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* [Pipelines] Make sure that None functions are correctly not saved (#3080)

* doc string example remove from_pt (#3083)

* [Tests] parallelize (#3078)

* [Tests] parallelize

* finish folder structuring

* Parallelize tests more

* Correct saving of pipelines

* make sure logging level is correct

* try again

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Throw deprecation warning for return_cached_folder (#3092)

Throw deprecation warning

* Allow SD attend and excite pipeline to work with any size output images (#2835)

Allow stable diffusion attend and excite pipeline to work with any size output image. Re: #2476, #2603

* [docs] Update community pipeline docs (#2989)

* update community pipeline docs

* fix formatting

* explain sharing workflows

* Add to support Guess Mode for StableDiffusionControlnetPipleline (#2998)

* add guess mode (WIP)

* fix uncond/cond order

* support guidance_scale=1.0 and batch != 1

* remove magic coeff

* add docstring

* add intergration test

* add document to controlnet.mdx

* made the comments a bit more explanatory

* fix table

* fix default value for attend-and-excite (#3099)

* fix default

* remvoe one line as requested by gc team  (#3077)

remvoe one line

* ddpm custom timesteps (#3007)

add custom timesteps test

add custom timesteps descending order check

docs

timesteps -> custom_timesteps

can only pass one of num_inference_steps and timesteps

* Fix breaking change in `pipeline_stable_diffusion_controlnet.py` (#3118)

fix breaking change

* Add global pooling to controlnet (#3121)

* [Bug fix] Fix img2img processor with safety checker (#3127)

Fix img2img processor with safety checker

* [Bug fix] Make sure correct timesteps are chosen for img2img (#3128)

Make sure correct timesteps are chosen for img2img

* Improve deprecation warnings (#3131)

* Fix config deprecation (#3129)

* Better deprecation message

* Better deprecation message

* Better doc string

* Fixes

* fix more

* fix more

* Improve __getattr__

* correct more

* fix more

* fix

* Improve more

* more improvements

* fix more

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* make style

* Fix all rest & add tests & remove old deprecation fns

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* feat: verfication of multi-gpu support for select examples. (#3126)

* feat: verfication of multi-gpu support for select examples.

* add: multi-gpu training sections to the relvant doc pages.

* speed up attend-and-excite fast tests (#3079)

* Optimize log_validation in train_controlnet_flax (#3110)

extract pipeline from log_validation

* make style

* Correct textual inversion readme (#3145)

* Update README.md

* Apply suggestions from code review

* Add unet act fn to other model components (#3136)

Adding act fn config to the unet timestep class embedding and conv
activation.

The custom activation defaults to silu which is the default
activation function for both the conv act and the timestep class
embeddings so default behavior is not changed.

The only unet which use the custom activation is the stable diffusion
latent upscaler https://huggingface.co/stabilityai/sd-x2-latent-upscaler/blob/main/unet/config.json
(I ran a script against the hub to confirm).
The latent upscaler does not use the conv activation nor the timestep
class embeddings so we don't change its behavior.

* class labels timestep embeddings projection dtype cast (#3137)

This mimics the dtype cast for the standard time embeddings

* [ckpt loader] Allow loading the Inpaint and Img2Img pipelines, while loading a ckpt model (#2705)

* [ckpt loader] Allow loading the Inpaint and Img2Img pipelines, while loading a ckpt model

* Address review comment from PR

* PyLint formatting

* Some more pylint fixes, unrelated to our change

* Another pylint fix

* Styling fix

* add from_ckpt method as Mixin (#2318)

* add mixin class for pipeline from original sd ckpt

* Improve

* make style

* merge main into

* Improve more

* fix more

* up

* Apply suggestions from code review

* finish docs

* rename

* make style

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add TensorRT SD/txt2img Community Pipeline to diffusers along with TensorRT utils (#2974)

* Add SD/txt2img Community Pipeline to diffusers along with TensorRT utils

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* update installation command

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* update tensorrt installation

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* changes
1. Update setting of cache directory
2. Address comments: merge utils and pipeline code.
3. Address comments: Add section in README

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* apply make style

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

---------

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Correct `Transformer2DModel.forward` docstring (#3074)

⚙️chore(transformer_2d) update function signature for encoder_hidden_states

* Update pipeline_stable_diffusion_inpaint_legacy.py (#2903)

* Update pipeline_stable_diffusion_inpaint_legacy.py

* fix preprocessing of Pil images with adequate batch size

* revert map

* add tests

* reformat

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* next try to fix the style

* wth is this

* Update testing_utils.py

* Update testing_utils.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Modified altdiffusion pipline to support altdiffusion-m18 (#2993)

* Modified altdiffusion pipline to support altdiffusion-m18

* Modified altdiffusion pipline to support altdiffusion-m18

* Modified altdiffusion pipline to support altdiffusion-m18

* Modified altdiffusion pipline to support altdiffusion-m18

* Modified altdiffusion pipline to support altdiffusion-m18

* Modified altdiffusion pipline to support altdiffusion-m18

* Modified altdiffusion pipline to support altdiffusion-m18

---------

Co-authored-by: root <fulong_ye@163.com>

* controlnet training resize inputs to multiple of 8 (#3135)

controlnet training center crop input images to multiple of 8

The pipeline code resizes inputs to multiples of 8.
Not doing this resizing in the training script is causing
the encoded image to have different height/width dimensions
than the encoded conditioning image (which uses a separate
encoder that's part of the controlnet model).

We resize and center crop the inputs to make sure they're the
same size (as well as all other images in the batch). We also
check that the initial resolution is a multiple of 8.

* adding custom diffusion training to diffusers examples (#3031)

* diffusers==0.14.0 update

* custom diffusion update

* custom diffusion update

* custom diffusion update

* custom diffusion update

* custom diffusion update

* custom diffusion update

* custom diffusion

* custom diffusion

* custom diffusion

* custom diffusion

* custom diffusion

* apply formatting and get rid of bare except.

* refactor readme and other minor changes.

* misc refactor.

* fix: repo_id issue and loaders logging bug.

* fix: save_model_card.

* fix: save_model_card.

* fix: save_model_card.

* add: doc entry.

* refactor doc,.

* custom diffusion

* custom diffusion

* custom diffusion

* apply style.

* remove tralining whitespace.

* fix: toctree entry.

* remove unnecessary print.

* custom diffusion

* custom diffusion

* custom diffusion test

* custom diffusion xformer update

* custom diffusion xformer update

* custom diffusion xformer update

---------

Co-authored-by: Nupur Kumari <nupurkumari@Nupurs-MacBook-Pro.local>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Nupur Kumari <nupurkumari@nupurs-mbp.wifi.local.cmu.edu>

* make style

* Update custom_diffusion.mdx (#3165)

Add missing newlines for rendering the links correctly

* Added distillation for quantization example on textual inversion. (#2760)

* Added distillation for quantization example on textual inversion.

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>

* refined readme and code style.

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>

* Update text2images.py

* refined code of model load and added compatibility check.

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>

* fixed code style.

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>

* fix C403 [*] Unnecessary `list` comprehension (rewrite as a `set` comprehension)

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>

---------

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>

* Update Noise Autocorrelation Loss Function for Pix2PixZero Pipeline (#2942)

* Update Pix2PixZero Auto-correlation Loss

* Add fast inversion tests

* Clarify purpose and mark as deprecated

Fix inversion prompt broadcasting

* Register modules set to `None` in config for `test_save_load_optional_components`

* Update new tests to coordinate with #2953

* [DreamBooth] add text encoder LoRA support in the DreamBooth training script (#3130)

* add: LoRA text encoder support for DreamBooth example.

* fix initialization.

* fix: modification call.

* add: entry in the readme.

* use dog dataset from hub.

* fix: params to clip.

* add entry to the LoRA doc.

* add: tests for lora.

* remove unnecessary list comprehension./

* Update Habana Gaudi documentation (#3169)

* Update Habana Gaudi doc

* Fix tables

* Add model offload to x4 upscaler (#3187)

* Add model offload to x4 upscaler

* fix

* [docs] Deterministic algorithms (#3172)

deterministic algos

* Update custom_diffusion.mdx to credit the author (#3163)

* Update custom_diffusion.mdx

* fix: unnecessary list comprehension.

* Fix TensorRT community pipeline device set function (#3157)

pass silence_dtype_warnings as kwarg

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* make `from_flax` work for controlnet (#3161)

fix from_flax

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [docs] Clarify training args (#3146)

* clarify training arg

* apply feedback

* Multi Vector Textual Inversion (#3144)

* Multi Vector

* Improve

* fix multi token

* improve test

* make style

* Update examples/test_examples.py

* Apply suggestions from code review

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* update

* Finish

* Apply suggestions from code review

---------

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* Add `Karras sigmas` to HeunDiscreteScheduler (#3160)

* Add karras pattern to discrete heun scheduler

* Add integration test

* Fix failing CI on pytorch test on M1 (mps)

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [AudioLDM] Fix dtype of returned waveform (#3189)

* Fix bug in train_dreambooth_lora (#3183)

* Update train_dreambooth_lora.py

fix bug

* Update train_dreambooth_lora.py

* [Community Pipelines] Update lpw_stable_diffusion pipeline (#3197)

* Update lpw_stable_diffusion.py

* fix cpu offload

* Make sure VAE attention works with Torch 2_0 (#3200)

* Make sure attention works with Torch 2_0

* make style

* Fix more

* Revert "[Community Pipelines] Update lpw_stable_diffusion pipeline" (#3201)

Revert "[Community Pipelines] Update lpw_stable_diffusion pipeline (#3197)"

This reverts commit 9965cb50ea.

* [Bug fix] Fix batch size attention head size mismatch (#3214)

* fix mixed precision training on train_dreambooth_inpaint_lora (#3138)

cast to weight dtype

* adding enable_vae_tiling and disable_vae_tiling functions (#3225)

adding enable_vae_tiling and disable_val_tiling functions

* Add ControlNet v1.1 docs (#3226)

Add v1.1 docs

* Fix issue in maybe_convert_prompt (#3188)

When the token used for textual inversion does not have any special symbols (e.g. it is not surrounded by <>), the tokenizer does not properly split the replacement tokens.  Adding a space for the padding tokens fixes this.

* Sync cache version check from transformers (#3179)

sync cache version check from transformers

* Fix docs text inversion (#3166)

* Fix docs text inversion

* Apply suggestions from code review

* add model (#3230)

* add

* clean

* up

* clean up more

* fix more tests

* Improve docs further

* improve

* more fixes docs

* Improve docs more

* Update src/diffusers/models/unet_2d_condition.py

* fix

* up

* update doc links

* make fix-copies

* add safety checker and watermarker to stage 3 doc page code snippets

* speed optimizations docs

* memory optimization docs

* make style

* add watermarking snippets to doc string examples

* make style

* use pt_to_pil helper functions in doc strings

* skip mps tests

* Improve safety

* make style

* new logic

* fix

* fix bad onnx design

* make new stable diffusion upscale pipeline model arguments optional

* define has_nsfw_concept when non-pil output type

* lowercase linked to notebook name

---------

Co-authored-by: William Berman <WLBberman@gmail.com>

* Allow return pt x4 (#3236)

* Add all files

* update

* Allow fp16 attn for x4 upscaler (#3239)

* Add all files

* update

* Make sure vae is memory efficient for PT 1

* make style

* fix fast test (#3241)

* Adds a document on token merging (#3208)

* add document on token merging.

* fix headline.

* fix: headline.

* add some samples for comparison.

* [AudioLDM] Update docs to use updated ckpt (#3240)

* [AudioLDM] Update docs to use updated ckpt

* make style

* Release: v0.16.0

* Post release for 0.16.0 (#3244)

* Post release

* fix more

* [docs] only mention one stage (#3246)

* [docs] only mention one stage

* add blurb on auto accepting

---------

Co-authored-by: William Berman <WLBberman@gmail.com>

* Write model card in controlnet training script (#3229)

Write model card in controlnet training script.

* [2064]: Add stochastic sampler (sample_dpmpp_sde) (#3020)

* [2064]: Add stochastic sampler

* [2064]: Add stochastic sampler

* [2064]: Add stochastic sampler

* [2064]: Add stochastic sampler

* [2064]: Add stochastic sampler

* [2064]: Add stochastic sampler

* [2064]: Add stochastic sampler

* Review comments

* [Review comment]: Add is_torchsde_available()

* [Review comment]: Test and docs

* [Review comment]

* [Review comment]

* [Review comment]

* [Review comment]

* [Review comment]

---------

Co-authored-by: njindal <njindal@adobe.com>

* [Stochastic Sampler][Slow Test]: Cuda test fixes (#3257)

[Slow Test]: Cuda test fixes

Co-authored-by: njindal <njindal@adobe.com>

* Remove required from tracker_project_name (#3260)

Remove required from tracker_project_name.

As observed by https://github.com/off99555 in https://github.com/huggingface/diffusers/issues/2695#issuecomment-1470755050, it already has a default value.

* adding required parameters while calling the get_up_block and get_down_block  (#3210)

* removed unnecessary parameters from get_up_block and get_down_block functions

* adding resnet_skip_time_act, resnet_out_scale_factor and cross_attention_norm to get_up_block and get_down_block functions

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* [docs] Update interface in repaint.mdx (#3119)

Update repaint.mdx

accomodate to #1701

* Update IF name to XL (#3262)

Co-authored-by: multimodalart <joaopaulo.passos+multimodal@gmail.com>

* fix typo in score sde pipeline (#3132)

* Fix typo in textual inversion JAX training script (#3123)

The pipeline is built as `pipe` but then used as `pipeline`.

* AudioDiffusionPipeline - fix encode method after config changes (#3114)

* config fixes

* deprecate get_input_dims

* Revert "Revert "[Community Pipelines] Update lpw_stable_diffusion pipeline"" (#3265)

Revert "Revert "[Community Pipelines] Update lpw_stable_diffusion pipeline" (#3201)"

This reverts commit 91a2a80eb2.

* Fix community pipelines (#3266)

* update notebook (#3259)

Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>

* [docs] add notes for stateful model changes (#3252)

* [docs] add notes for stateful model changes

* Update docs/source/en/optimization/fp16.mdx

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* link to accelerate docs for discarding hooks

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* [LoRA] quality of life improvements in the loading semantics and docs (#3180)

* 👽 qol improvements for LoRA.

* better function name?

* fix: LoRA weight loading with the new format.

* address Patrick's comments.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* change wording around encouraging the use of load_lora_weights().

* fix: function name.

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [Community Pipelines] EDICT pipeline implementation (#3153)

* EDICT pipeline initial commit

- Starting point taking from https://github.com/Joqsan/edict-diffusion

* refactor __init__() method

* minor refactoring

* refactor scheduler code

- remove scheduler and move its methods to the EDICTPipeline class

* make CFG optional
- refactor encode_prompt().
- include optional generator for sampling with vae.
- minor variable renaming

* add EDICT pipeline description to README.md

* replace preprocess() with VaeImageProcessor

* run make style and make quality commands

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [Docs]zh translated docs update (#3245)

* zh translated docs update

* update _toctree

* Update logging.mdx (#2863)

Fix typos

* Add multiple conditions to StableDiffusionControlNetInpaintPipeline (#3125)

* try multi controlnet inpaint

* multi controlnet inpaint

* multi controlnet inpaint

* Let's make sure that dreambooth always uploads to the Hub (#3272)

* Update Dreambooth README

* Adapt all docs as well

* automatically write model card

* fix

* make style

* Diffedit Zero-Shot Inpainting Pipeline (#2837)

* Update Pix2PixZero Auto-correlation Loss

* Add Stable Diffusion DiffEdit pipeline

* Add draft documentation and import code

* Bugfixes and refactoring

* Add option to not decode latents in the inversion process

* Harmonize preprocessing

* Revert "Update Pix2PixZero Auto-correlation Loss"

This reverts commit b218062fed.

* Update annotations

* rename `compute_mask` to `generate_mask`

* Update documentation

* Update docs

* Update Docs

* Fix copy

* Change shape of output latents to batch first

* Update docs

* Add first draft for tests

* Bugfix and update tests

* Add `cross_attention_kwargs` support for all pipeline methods

* Fix Copies

* Add support for PIL image latents

Add support for mask broadcasting

Update docs and tests

Align `mask` argument to `mask_image`

Remove height and width arguments

* Enable MPS Tests

* Move example docstrings

* Fix test

* Fix test

* fix pipeline inheritance

* Harmonize `prepare_image_latents` with StableDiffusionPix2PixZeroPipeline

* Register modules set to `None` in config for `test_save_load_optional_components`

* Move fixed logic to specific test class

* Clean changes to other pipelines

* Update new tests to coordinate with #2953

* Update slow tests for better results

* Safety to avoid potential problems with torch.inference_mode

* Add reference in SD Pipeline Overview

* Fix tests again

* Enforce determinism in noise for generate_mask

* Fix copies

* Widen test tolerance for fp16 based on `test_stable_diffusion_upscale_pipeline_fp16`

* Add LoraLoaderMixin and update `prepare_image_latents`

* clean up repeat and reg

* bugfix

* Remove invalid args from docs

Suppress spurious warning by repeating image before latent to mask gen

* add constant learning rate with custom rule (#3133)

* add constant lr with rules

* add constant with rules in TYPE_TO_SCHEDULER_FUNCTION

* add constant lr rate with rule

* hotfix code quality

* fix doc style

* change name constant_with_rules to piecewise constant

* Allow disabling torch 2_0 attention (#3273)

* Allow disabling torch 2_0 attention

* make style

* Update src/diffusers/models/attention.py

* [doc] add link to training script (#3271)

add link to training script

Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>

* temp disable spectogram diffusion tests (#3278)

The note-seq package throws an error on import because the default installed version of Ipython
is not compatible with python 3.8 which we run in the CI.
https://github.com/huggingface/diffusers/actions/runs/4830121056/jobs/8605954838#step:7:9

* Changed sample[0] to images[0] (#3304)

A pipeline object stores the results in `images` not in `sample`.
Current code blocks don't work.

* Typo in tutorial (#3295)

* Torch compile graph fix (#3286)

* fix more

* Fix more

* fix more

* Apply suggestions from code review

* fix

* make style

* make fix-copies

* fix

* make sure torch compile

* Clean

* fix test

* Postprocessing refactor img2img (#3268)

* refactor img2img VaeImageProcessor.postprocess

* remove copy from for init, run_safety_checker, decode_latents

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

---------

Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* [Torch 2.0 compile] Fix more torch compile breaks (#3313)

* Fix more torch compile breaks

* add tests

* Fix all

* fix controlnet

* fix more

* Add Horace He as co-author.
>
>
Co-authored-by: Horace He <horacehe2007@yahoo.com>

* Add Horace He as co-author.

Co-authored-by: Horace He <horacehe2007@yahoo.com>

---------

Co-authored-by: Horace He <horacehe2007@yahoo.com>

* fix: scale_lr and sync example readme and docs. (#3299)

* fix: scale_lr and sync example readme and docs.

* fix doc link.

* Update stable_diffusion.mdx (#3310)

fixed import statement

* Fix missing variable assign in DeepFloyd-IF-II (#3315)

Fix missing variable assign

lol

* Correct doc build for patch releases (#3316)

Update build_documentation.yml

* Add Stable Diffusion RePaint to community pipelines (#3320)

* Add Stable Diffsuion RePaint to community pipelines

- Adds Stable Diffsuion RePaint to community pipelines
- Add Readme enty for pipeline

* Fix: Remove wrong import

- Remove wrong import
- Minor change in comments

* Fix: Code formatting of stable_diffusion_repaint

* Fix: ruff errors in stable_diffusion_repaint

* Fix multistep dpmsolver for cosine schedule (suitable for deepfloyd-if) (#3314)

* fix multistep dpmsolver for cosine schedule (deepfloy-if)

* fix a typo

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* update all dpmsolver (singlestep, multistep, dpm, dpm++) for cosine noise schedule

* add test, fix style

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [docs] Improve LoRA docs (#3311)

* update docs

* add to toctree

* apply feedback

* Added input pretubation (#3292)

* Added input pretubation

* Fixed spelling

* Update write_own_pipeline.mdx (#3323)

* update controlling generation doc with latest goodies. (#3321)

* [Quality] Make style (#3341)

* Fix config dpm (#3343)

* Add the SDE variant of DPM-Solver and DPM-Solver++ (#3344)

* add SDE variant of DPM-Solver and DPM-Solver++

* add test

* fix typo

* fix typo

* Add upsample_size to AttnUpBlock2D, AttnDownBlock2D (#3275)

The argument `upsample_size` needs to be added to these modules to allow compatibility with other blocks that require this argument.

* Add UniDiffuser classes to __init__ files, modify transformer block to support pre- and post-LN, add fast default tests, fix some bugs.

* Update fast tests to use test checkpoints stored on the hub and to better match the reference UniDiffuser implementation.

* Fix code with make style.

* Revert "Fix code style with make style."

This reverts commit 10a174a12c.

* Add self.image_encoder, self.text_decoder to list of models to offload to CPU in the enable_sequential_cpu_offload(...)/enable_model_cpu_offload(...) methods to make test_cpu_offload_forward_pass pass.

* Fix code quality with make style.

* Support using a data type embedding for UniDiffuser-v1.

* Add fast test for checking UniDiffuser-v1 sampling.

* Make changes so that the repository consistency tests pass.

* Add UniDiffuser dummy objects via make fix-copies.

* Fix bugs and make improvements to the UniDiffuser pipeline:
	- Improve batch size inference and fix bugs when num_images_per_prompt or num_prompts_per_image > 1
	- Add tests for num_images_per_prompt, num_prompts_per_image > 1
	- Improve check_inputs, especially regarding checking supplied latents
	- Add reset_mode method so that mode inference can be re-enabled after mode is set manually
	- Fix some warnings related to accessing class members directly instead of through their config
	- Small amount of refactoring in pipeline_unidiffuser.py

* Fix code style with make style.

* Add/edit docstrings for added classes and public pipeline methods. Also do some light refactoring.

* Add documentation for UniDiffuser and fix some typos/formatting in docstrings.

* Fix code with make style.

* Refactor and improve the UniDiffuser convert_from_ckpt.py script.

* Move the UniDiffusers convert_from_ckpy.py script to diffusers/scripts/convert_unidiffuser_to_diffusers.py

* Fix code quality via make style.

* Improve UniDiffuser slow tests.

* make style

* Fix some typos in the UniDiffuser docs.

* Remove outdated logic based on transformers version in UniDiffuser pipeline __init__.py

* Remove dependency on einops by refactoring einops operations to pure torch operations.

* make style

* Add slow test on full checkpoint for joint mode and correct expected image slices/text prefixes.

* make style

* Fix mixed precision issue by wrapping the offending code with the torch.autocast context manager.

* Revert "Fix mixed precision issue by wrapping the offending code with the torch.autocast context manager."

This reverts commit 1a58958ab4.

* Add fast test for CUDA/fp16 model behavior (currently failing).

* Fix the mixed precision issue and add additional tests of the pipeline cuda/fp16 functionality.

* make style

* Use a CLIPVisionModelWithProjection instead of CLIPVisionModel for image_encoder to better match the original UniDiffuser implementation.

* Make style and remove some testing code.

* Fix shape errors for the 'joint' and 'img2text' modes.

* Fix tests and remove some testing code.

* Add option to use fixed latents for UniDiffuserPipelineSlowTests and fix issue in modeling_text_decoder.py.

* Improve UniDiffuser docs, particularly the usage examples, and improve slow tests with new expected outputs.

* make style

* Fix examples to load model in float16.

* In image-to-text mode, sample from the autoencoder moment distribution instead of always getting its mode.

* make style

* When encoding the image using the VAE, scale the image latents by the VAE's scaling factor.

* make style

* Clean up code and make slow tests pass.

* make fix-copies

* [docs] Fix docstring (#3334)

fix docstring

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* if dreambooth lora (#3360)

* update IF stage I pipelines

add fixed variance schedulers and lora loading

* added kv lora attn processor

* allow loading into alternative lora attn processor

* make vae optional

* throw away predicted variance

* allow loading into added kv lora layer

* allow load T5

* allow pre compute text embeddings

* set new variance type in schedulers

* fix copies

* refactor all prompt embedding code

class prompts are now included in pre-encoding code
max tokenizer length is now configurable
embedding attention mask is now configurable

* fix for when variance type is not defined on scheduler

* do not pre compute validation prompt if not present

* add example test for if lora dreambooth

* add check for train text encoder and pre compute text embeddings

* Postprocessing refactor all others (#3337)

* add text2img

* fix-copies

* add

* add all other pipelines

* add

* add

* add

* add

* add

* make style

* style + fix copies

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>

* [docs] Improve safetensors docstring (#3368)

* clarify safetensor docstring

* fix typo

* apply feedback

* add: a warning message when using xformers in a PT 2.0 env. (#3365)

* add: a warning message when using xformers in a PT 2.0 env.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* StableDiffusionInpaintingPipeline - resize image w.r.t height and width (#3322)

* StableDiffusionInpaintingPipeline now resizes input images and masks w.r.t to passed input height and width. Default is already set to 512. This addresses the common tensor mismatch error. Also moved type check into relevant funciton to keep main pipeline body tidy.

* Fixed StableDiffusionInpaintingPrepareMaskAndMaskedImageTests

Due to previous commit these tests were failing as height and width need to be passed into the prepare_mask_and_masked_image function, I have updated the code and added a height/width variable per unit test as it seemed more appropriate than the current hard coded solution

* Added a resolution test to StableDiffusionInpaintPipelineSlowTests

this unit test simply gets the input and resizes it into some that would fail (e.g. would throw a tensor mismatch error/not a mult of 8). Then passes it through the pipeline and verifies it produces output with correct dims w.r.t the passed height and width

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* make style

* [docs] Adapt a model (#3326)

* first draft

* apply feedback

* conv_in.weight thrown away

* [docs] Load safetensors (#3333)

* safetensors

* apply feedback

* apply feedback

* Apply suggestions from code review

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* make style

* [Docs] Fix stable_diffusion.mdx typo (#3398)

Fix typo in last code block. Correct "prommpts" to "prompt"

* Support ControlNet v1.1 shuffle properly (#3340)

* add inferring_controlnet_cond_batch

* Revert "add inferring_controlnet_cond_batch"

This reverts commit abe8d6311d.

* set guess_mode to True
whenever global_pool_conditions is True

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* nit

* add integration test

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [Tests] better determinism (#3374)

* enable deterministic pytorch and cuda operations.

* disable manual seeding.

* make style && make quality for unet_2d tests.

* enable determinism for the unet2dconditional model.

* add CUBLAS_WORKSPACE_CONFIG for better reproducibility.

* relax tolerance (very weird issue, though).

* revert to torch manual_seed() where needed.

* relax more tolerance.

* better placement of the cuda variable and relax more tolerance.

* enable determinism for 3d condition model.

* relax tolerance.

* add: determinism to alt_diffusion.

* relax tolerance for alt diffusion.

* dance diffusion.

* dance diffusion is flaky.

* test_dict_tuple_outputs_equivalent edit.

* fix two more tests.

* fix more ddim tests.

* fix: argument.

* change to diff in place of difference.

* fix: test_save_load call.

* test_save_load_float16 call.

* fix: expected_max_diff

* fix: paint by example.

* relax tolerance.

* add determinism to 1d unet model.

* torch 2.0 regressions seem to be brutal

* determinism to vae.

* add reason to skipping.

* up tolerance.

* determinism to vq.

* determinism to cuda.

* determinism to the generic test pipeline file.

* refactor general pipelines testing a bit.

* determinism to alt diffusion i2i

* up tolerance for alt diff i2i and audio diff

* up tolerance.

* determinism to audioldm

* increase tolerance for audioldm lms.

* increase tolerance for paint by paint.

* increase tolerance for repaint.

* determinism to cycle diffusion and sd 1.

* relax tol for cycle diffusion 🚲

* relax tol for sd 1.0

* relax tol for controlnet.

* determinism to img var.

* relax tol for img variation.

* tolerance to i2i sd

* make style

* determinism to inpaint.

* relax tolerance for inpaiting.

* determinism for inpainting legacy

* relax tolerance.

* determinism to instruct pix2pix

* determinism to model editing.

* model editing tolerance.

* panorama determinism

* determinism to pix2pix zero.

* determinism to sag.

* sd 2. determinism

* sd. tolerance

* disallow tf32 matmul.

* relax tolerance is all you need.

* make style and determinism to sd 2 depth

* relax tolerance for depth.

* tolerance to diffedit.

* tolerance to sd 2 inpaint.

* up tolerance.

* determinism in upscaling.

* tolerance in upscaler.

* more tolerance relaxation.

* determinism to v pred.

* up tol for v_pred

* unclip determinism

* determinism to unclip img2img

* determinism to text to video.

* determinism to last set of tests

* up tol.

* vq cumsum doesn't have a deterministic kernel

* relax tol

* relax tol

* [docs] Add transformers to install (#3388)

add transformers to install

* [deepspeed] partial ZeRO-3 support (#3076)

* [deepspeed] partial ZeRO-3 support

* cleanup

* improve deepspeed fixes

* Improve

* make style

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add omegaconf for tests (#3400)

Add omegaconfg

* Fix various bugs with LoRA Dreambooth and Dreambooth script (#3353)

* Improve checkpointing lora

* fix more

* Improve doc string

* Update src/diffusers/loaders.py

* make stytle

* Apply suggestions from code review

* Update src/diffusers/loaders.py

* Apply suggestions from code review

* Apply suggestions from code review

* better

* Fix all

* Fix multi-GPU dreambooth

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Fix all

* make style

* make style

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Fix docker file (#3402)

* up

* up

* fix: deepseepd_plugin retrieval from accelerate state (#3410)

* [Docs] Add `sigmoid` beta_scheduler to docstrings of relevant Schedulers (#3399)

* Add `sigmoid` beta scheduler to `DDPMScheduler` docstring

* Add `sigmoid` beta scheduler to `RePaintScheduler` docstring

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Don't install accelerate and transformers from source (#3415)

* Don't install transformers and accelerate from source (#3414)

* Improve fast tests (#3416)

Update pr_tests.yml

* attention refactor: the trilogy  (#3387)

* Replace `AttentionBlock` with `Attention`

* use _from_deprecated_attn_block check re: @patrickvonplaten

* [Docs] update the PT 2.0 optimization doc with latest findings (#3370)

* add: benchmarking stats for A100 and V100.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* address patrick's comments.

* add: rtx 4090 stats

* ⚔ benchmark reports done

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* 3313 pr link.

* add: plots.

Co-authored-by: Pedro <pedro@huggingface.co>

* fix formattimg

* update number percent.

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Fix style rendering (#3433)

* Fix style rendering.

* Fix typo

* unCLIP scheduler do not use note (#3417)

* Replace deprecated command with environment file (#3409)

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* fix warning message pipeline loading (#3446)

* add stable diffusion tensorrt img2img pipeline (#3419)

* add stable diffusion tensorrt img2img pipeline

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* update docstrings

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

---------

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* Refactor controlnet and add img2img and inpaint (#3386)

* refactor controlnet and add img2img and inpaint

* First draft to get pipelines to work

* make style

* Fix more

* Fix more

* More tests

* Fix more

* Make inpainting work

* make style and more tests

* Apply suggestions from code review

* up

* make style

* Fix imports

* Fix more

* Fix more

* Improve examples

* add test

* Make sure import is correctly deprecated

* Make sure everything works in compile mode

* make sure authorship is correctly attributed

* [Scheduler] DPM-Solver (++) Inverse Scheduler (#3335)

* Add DPM-Solver Multistep Inverse Scheduler

* Add draft tests for DiffEdit

* Add inverse sde-dpmsolver steps to tune image diversity from inverted latents

* Fix tests

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [Docs] Fix incomplete docstring for resnet.py (#3438)

Fix incomplete docstrings for resnet.py

* fix tiled vae blend extent range (#3384)

fix tiled vae bleand extent range

* Small update to "Next steps" section (#3443)

Small update to "Next steps" section:

- PyTorch 2 is recommended.
- Updated improvement figures.

* Allow arbitrary aspect ratio in IFSuperResolutionPipeline (#3298)

* Update pipeline_if_superresolution.py

Allow arbitrary aspect ratio in IFSuperResolutionPipeline by using the input image shape

* IFSuperResolutionPipeline: allow the user to override the height and width through the arguments

* update IFSuperResolutionPipeline width/height doc string to match StableDiffusionInpaintPipeline conventions

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Adding 'strength' parameter to StableDiffusionInpaintingPipeline  (#3424)

* Added explanation of 'strength' parameter

* Added get_timesteps function which relies on new strength parameter

* Added `strength` parameter which defaults to 1.

* Swapped ordering so `noise_timestep` can be calculated before masking the image

this is required when you aren't applying 100% noise to the masked region, e.g. strength < 1.

* Added strength to check_inputs, throws error if out of range

* Changed `prepare_latents` to initialise latents w.r.t strength

inspired from the stable diffusion img2img pipeline, init latents are initialised by converting the init image into a VAE latent and adding noise (based upon the strength parameter passed in), e.g. random when strength = 1, or the init image at strength = 0.

* WIP: Added a unit test for the new strength parameter in the StableDiffusionInpaintingPipeline

still need to add correct regression values

* Created a is_strength_max to initialise from pure random noise

* Updated unit tests w.r.t new strength parameter + fixed new strength unit test

* renamed parameter to avoid confusion with variable of same name

* Updated regression values for new strength test - now passes

* removed 'copied from' comment as this method is now different and divergent from the cpy

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Ensure backwards compatibility for prepare_mask_and_masked_image

created a return_image boolean and initialised to false

* Ensure backwards compatibility for prepare_latents

* Fixed copy check typo

* Fixes w.r.t backward compibility changes

* make style

* keep function argument ordering same for backwards compatibility in callees with copied from statements

* make fix-copies

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: William Berman <WLBberman@gmail.com>

* [WIP] Bugfix - Pipeline.from_pretrained is broken when the pipeline is partially downloaded (#3448)

Added bugfix using f strings.

* Fix gradient checkpointing bugs in freezing part of models (requires_grad=False) (#3404)

* gradient checkpointing bug fix

* bug fix; changes for reviews

* reformat

* reformat

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Make dreambooth lora more robust to orig unet (#3462)

* Make dreambooth lora more robust to orig unet

* up

* Reduce peak VRAM by releasing large attention tensors (as soon as they're unnecessary) (#3463)

Release large tensors in attention (as soon as they're no longer required). Reduces peak VRAM by nearly 2 GB for 1024x1024 (even after slicing), and the savings scale up with image size.

* Add min snr to text2img lora training script (#3459)

add min snr to text2img lora training script

* Add inpaint lora scale support (#3460)

* add inpaint lora scale support

* add inpaint lora scale test

---------

Co-authored-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>

* [From ckpt] Fix from_ckpt (#3466)

* Correct from_ckpt

* make style

* Update full dreambooth script to work with IF (#3425)

* Add IF dreambooth docs (#3470)

* parameterize pass single args through tuple (#3477)

* attend and excite tests disable determinism on the class level (#3478)

* dreambooth docs torch.compile note (#3471)

* dreambooth docs torch.compile note

* Update examples/dreambooth/README.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update examples/dreambooth/README.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* add: if entry in the dreambooth training docs. (#3472)

* [docs] Textual inversion inference (#3473)

* add textual inversion inference to docs

* add to toctree

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* [docs] Distributed inference (#3376)

* distributed inference

* move to inference section

* apply feedback

* update with split_between_processes

* apply feedback

* [{Up,Down}sample1d] explicit view kernel size as number elements in flattened indices (#3479)

explicit view kernel size as number elements in flattened indices

* mps & onnx tests rework (#3449)

* Remove ONNX tests from PR.

They are already a part of push_tests.yml.

* Remove mps tests from PRs.

They are already performed on push.

* Fix workflow name for fast push tests.

* Extract mps tests to a workflow.

For better control/filtering.

* Remove --extra-index-url from mps tests

* Increase tolerance of mps test

This test passes in my Mac (Ventura 13.3) but fails in the CI hardware
(Ventura 13.2). I ran the local tests following the same steps that
exist in the CI workflow.

* Temporarily run mps tests on pr

So we can test.

* Revert "Temporarily run mps tests on pr"

Tests passed, go back to running on push.

* [Attention processor] Better warning message when shifting to `AttnProcessor2_0` (#3457)

* add: debugging to enabling memory efficient processing

* add: better warning message.

* [Docs] add note on local directory path. (#3397)

add note on local directory path.

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Refactor full determinism (#3485)

* up

* fix more

* Apply suggestions from code review

* fix more

* fix more

* Check it

* Remove 16:8

* fix more

* fix more

* fix more

* up

* up

* Test only stable diffusion

* Test only two files

* up

* Try out spinning up processes that can be killed

* up

* Apply suggestions from code review

* up

* up

* Fix DPM single (#3413)

* Fix DPM single

* add test

* fix one more bug

* Apply suggestions from code review

Co-authored-by: StAlKeR7779 <stalkek7779@yandex.ru>

---------

Co-authored-by: StAlKeR7779 <stalkek7779@yandex.ru>

* Add `use_Karras_sigmas` to DPMSolverSinglestepScheduler (#3476)

* add use_karras_sigmas

* add karras test

* add doc

* Adds local_files_only bool to prevent forced online connection (#3486)

* make style

* [Docs] Korean translation (optimization, training) (#3488)

* feat) optimization kr translation

* fix) typo, italic setting

* feat) dreambooth, text2image kr

* feat) lora kr

* fix) LoRA

* fix) fp16 fix

* fix) doc-builder style

* fix) fp16 일부 단어 수정

* fix) fp16 style fix

* fix) opt, training docs update

* feat) toctree update

* feat) toctree update

---------

Co-authored-by: Chanran Kim <seriousran@gmail.com>

* DataLoader respecting EXIF data in Training Images (#3465)

* DataLoader will now bake in any transforms or image manipulations contained in the EXIF

Images may have rotations stored in EXIF. Training using such images will cause those transforms to be ignored while training and thus produce unexpected results

* Fixed the Dataloading EXIF issue in main DreamBooth training as well

* Run make style (black & isort)

* make style

* feat: allow disk offload for diffuser models (#3285)

* allow disk offload for diffuser models

* sort import

* add max_memory argument

* Changed sample[0] to images[0] (#3304)

A pipeline object stores the results in `images` not in `sample`.
Current code blocks don't work.

* Typo in tutorial (#3295)

* Torch compile graph fix (#3286)

* fix more

* Fix more

* fix more

* Apply suggestions from code review

* fix

* make style

* make fix-copies

* fix

* make sure torch compile

* Clean

* fix test

* Postprocessing refactor img2img (#3268)

* refactor img2img VaeImageProcessor.postprocess

* remove copy from for init, run_safety_checker, decode_latents

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

---------

Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* [Torch 2.0 compile] Fix more torch compile breaks (#3313)

* Fix more torch compile breaks

* add tests

* Fix all

* fix controlnet

* fix more

* Add Horace He as co-author.
>
>
Co-authored-by: Horace He <horacehe2007@yahoo.com>

* Add Horace He as co-author.

Co-authored-by: Horace He <horacehe2007@yahoo.com>

---------

Co-authored-by: Horace He <horacehe2007@yahoo.com>

* fix: scale_lr and sync example readme and docs. (#3299)

* fix: scale_lr and sync example readme and docs.

* fix doc link.

* Update stable_diffusion.mdx (#3310)

fixed import statement

* Fix missing variable assign in DeepFloyd-IF-II (#3315)

Fix missing variable assign

lol

* Correct doc build for patch releases (#3316)

Update build_documentation.yml

* Add Stable Diffusion RePaint to community pipelines (#3320)

* Add Stable Diffsuion RePaint to community pipelines

- Adds Stable Diffsuion RePaint to community pipelines
- Add Readme enty for pipeline

* Fix: Remove wrong import

- Remove wrong import
- Minor change in comments

* Fix: Code formatting of stable_diffusion_repaint

* Fix: ruff errors in stable_diffusion_repaint

* Fix multistep dpmsolver for cosine schedule (suitable for deepfloyd-if) (#3314)

* fix multistep dpmsolver for cosine schedule (deepfloy-if)

* fix a typo

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* update all dpmsolver (singlestep, multistep, dpm, dpm++) for cosine noise schedule

* add test, fix style

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [docs] Improve LoRA docs (#3311)

* update docs

* add to toctree

* apply feedback

* Added input pretubation (#3292)

* Added input pretubation

* Fixed spelling

* Update write_own_pipeline.mdx (#3323)

* update controlling generation doc with latest goodies. (#3321)

* [Quality] Make style (#3341)

* Fix config dpm (#3343)

* Add the SDE variant of DPM-Solver and DPM-Solver++ (#3344)

* add SDE variant of DPM-Solver and DPM-Solver++

* add test

* fix typo

* fix typo

* Add upsample_size to AttnUpBlock2D, AttnDownBlock2D (#3275)

The argument `upsample_size` needs to be added to these modules to allow compatibility with other blocks that require this argument.

* Rename --only_save_embeds to --save_as_full_pipeline (#3206)

* Set --only_save_embeds to False by default

Due to how the option is named, it makes more sense to behave like this.

* Refactor only_save_embeds to save_as_full_pipeline

* [AudioLDM] Generalise conversion script (#3328)

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Fix TypeError when using prompt_embeds and negative_prompt (#2982)

* test: Added test case

* fix: fixed type checking issue on _encode_prompt

* fix: fixed copies consistency

* fix: one copy was not sufficient

* Fix pipeline class on README (#3345)

Update README.md

* Inpainting: typo in docs (#3331)

Typo in docs

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add `use_Karras_sigmas` to LMSDiscreteScheduler (#3351)

* add karras sigma to lms discrete scheduler

* add test for lms_scheduler karras

* reformat test lms

* Batched load of textual inversions (#3277)

* Batched load of textual inversions

- Only call resize_token_embeddings once per batch as it is the most expensive operation
- Allow pretrained_model_name_or_path and token to be an optional list
- Remove Dict from type annotation pretrained_model_name_or_path as it was not supported in this function
- Add comment that single files (e.g. .pt/.safetensors) are supported
- Add comment for token parameter
- Convert token override log message from warning to info

* Update src/diffusers/loaders.py

Check for duplicate tokens

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update condition for None tokens

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* make fix-copies

* [docs] Fix docstring (#3334)

fix docstring

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* if dreambooth lora (#3360)

* update IF stage I pipelines

add fixed variance schedulers and lora loading

* added kv lora attn processor

* allow loading into alternative lora attn processor

* make vae optional

* throw away predicted variance

* allow loading into added kv lora layer

* allow load T5

* allow pre compute text embeddings

* set new variance type in schedulers

* fix copies

* refactor all prompt embedding code

class prompts are now included in pre-encoding code
max tokenizer length is now configurable
embedding attention mask is now configurable

* fix for when variance type is not defined on scheduler

* do not pre compute validation prompt if not present

* add example test for if lora dreambooth

* add check for train text encoder and pre compute text embeddings

* Postprocessing refactor all others (#3337)

* add text2img

* fix-copies

* add

* add all other pipelines

* add

* add

* add

* add

* add

* make style

* style + fix copies

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>

* [docs] Improve safetensors docstring (#3368)

* clarify safetensor docstring

* fix typo

* apply feedback

* add: a warning message when using xformers in a PT 2.0 env. (#3365)

* add: a warning message when using xformers in a PT 2.0 env.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* StableDiffusionInpaintingPipeline - resize image w.r.t height and width (#3322)

* StableDiffusionInpaintingPipeline now resizes input images and masks w.r.t to passed input height and width. Default is already set to 512. This addresses the common tensor mismatch error. Also moved type check into relevant funciton to keep main pipeline body tidy.

* Fixed StableDiffusionInpaintingPrepareMaskAndMaskedImageTests

Due to previous commit these tests were failing as height and width need to be passed into the prepare_mask_and_masked_image function, I have updated the code and added a height/width variable per unit test as it seemed more appropriate than the current hard coded solution

* Added a resolution test to StableDiffusionInpaintPipelineSlowTests

this unit test simply gets the input and resizes it into some that would fail (e.g. would throw a tensor mismatch error/not a mult of 8). Then passes it through the pipeline and verifies it produces output with correct dims w.r.t the passed height and width

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* make style

* [docs] Adapt a model (#3326)

* first draft

* apply feedback

* conv_in.weight thrown away

* [docs] Load safetensors (#3333)

* safetensors

* apply feedback

* apply feedback

* Apply suggestions from code review

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* make style

* [Docs] Fix stable_diffusion.mdx typo (#3398)

Fix typo in last code block. Correct "prommpts" to "prompt"

* Support ControlNet v1.1 shuffle properly (#3340)

* add inferring_controlnet_cond_batch

* Revert "add inferring_controlnet_cond_batch"

This reverts commit abe8d6311d.

* set guess_mode to True
whenever global_pool_conditions is True

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* nit

* add integration test

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [Tests] better determinism (#3374)

* enable deterministic pytorch and cuda operations.

* disable manual seeding.

* make style && make quality for unet_2d tests.

* enable determinism for the unet2dconditional model.

* add CUBLAS_WORKSPACE_CONFIG for better reproducibility.

* relax tolerance (very weird issue, though).

* revert to torch manual_seed() where needed.

* relax more tolerance.

* better placement of the cuda variable and relax more tolerance.

* enable determinism for 3d condition model.

* relax tolerance.

* add: determinism to alt_diffusion.

* relax tolerance for alt diffusion.

* dance diffusion.

* dance diffusion is flaky.

* test_dict_tuple_outputs_equivalent edit.

* fix two more tests.

* fix more ddim tests.

* fix: argument.

* change to diff in place of difference.

* fix: test_save_load call.

* test_save_load_float16 call.

* fix: expected_max_diff

* fix: paint by example.

* relax tolerance.

* add determinism to 1d unet model.

* torch 2.0 regressions seem to be brutal

* determinism to vae.

* add reason to skipping.

* up tolerance.

* determinism to vq.

* determinism to cuda.

* determinism to the generic test pipeline file.

* refactor general pipelines testing a bit.

* determinism to alt diffusion i2i

* up tolerance for alt diff i2i and audio diff

* up tolerance.

* determinism to audioldm

* increase tolerance for audioldm lms.

* increase tolerance for paint by paint.

* increase tolerance for repaint.

* determinism to cycle diffusion and sd 1.

* relax tol for cycle diffusion 🚲

* relax tol for sd 1.0

* relax tol for controlnet.

* determinism to img var.

* relax tol for img variation.

* tolerance to i2i sd

* make style

* determinism to inpaint.

* relax tolerance for inpaiting.

* determinism for inpainting legacy

* relax tolerance.

* determinism to instruct pix2pix

* determinism to model editing.

* model editing tolerance.

* panorama determinism

* determinism to pix2pix zero.

* determinism to sag.

* sd 2. determinism

* sd. tolerance

* disallow tf32 matmul.

* relax tolerance is all you need.

* make style and determinism to sd 2 depth

* relax tolerance for depth.

* tolerance to diffedit.

* tolerance to sd 2 inpaint.

* up tolerance.

* determinism in upscaling.

* tolerance in upscaler.

* more tolerance relaxation.

* determinism to v pred.

* up tol for v_pred

* unclip determinism

* determinism to unclip img2img

* determinism to text to video.

* determinism to last set of tests

* up tol.

* vq cumsum doesn't have a deterministic kernel

* relax tol

* relax tol

* [docs] Add transformers to install (#3388)

add transformers to install

* [deepspeed] partial ZeRO-3 support (#3076)

* [deepspeed] partial ZeRO-3 support

* cleanup

* improve deepspeed fixes

* Improve

* make style

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add omegaconf for tests (#3400)

Add omegaconfg

* Fix various bugs with LoRA Dreambooth and Dreambooth script (#3353)

* Improve checkpointing lora

* fix more

* Improve doc string

* Update src/diffusers/loaders.py

* make stytle

* Apply suggestions from code review

* Update src/diffusers/loaders.py

* Apply suggestions from code review

* Apply suggestions from code review

* better

* Fix all

* Fix multi-GPU dreambooth

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Fix all

* make style

* make style

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Fix docker file (#3402)

* up

* up

* fix: deepseepd_plugin retrieval from accelerate state (#3410)

* [Docs] Add `sigmoid` beta_scheduler to docstrings of relevant Schedulers (#3399)

* Add `sigmoid` beta scheduler to `DDPMScheduler` docstring

* Add `sigmoid` beta scheduler to `RePaintScheduler` docstring

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Don't install accelerate and transformers from source (#3415)

* Don't install transformers and accelerate from source (#3414)

* Improve fast tests (#3416)

Update pr_tests.yml

* attention refactor: the trilogy  (#3387)

* Replace `AttentionBlock` with `Attention`

* use _from_deprecated_attn_block check re: @patrickvonplaten

* [Docs] update the PT 2.0 optimization doc with latest findings (#3370)

* add: benchmarking stats for A100 and V100.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* address patrick's comments.

* add: rtx 4090 stats

* ⚔ benchmark reports done

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* 3313 pr link.

* add: plots.

Co-authored-by: Pedro <pedro@huggingface.co>

* fix formattimg

* update number percent.

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Fix style rendering (#3433)

* Fix style rendering.

* Fix typo

* unCLIP scheduler do not use note (#3417)

* Replace deprecated command with environment file (#3409)

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* fix warning message pipeline loading (#3446)

* add stable diffusion tensorrt img2img pipeline (#3419)

* add stable diffusion tensorrt img2img pipeline

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* update docstrings

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

---------

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* Refactor controlnet and add img2img and inpaint (#3386)

* refactor controlnet and add img2img and inpaint

* First draft to get pipelines to work

* make style

* Fix more

* Fix more

* More tests

* Fix more

* Make inpainting work

* make style and more tests

* Apply suggestions from code review

* up

* make style

* Fix imports

* Fix more

* Fix more

* Improve examples

* add test

* Make sure import is correctly deprecated

* Make sure everything works in compile mode

* make sure authorship is correctly attributed

* [Scheduler] DPM-Solver (++) Inverse Scheduler (#3335)

* Add DPM-Solver Multistep Inverse Scheduler

* Add draft tests for DiffEdit

* Add inverse sde-dpmsolver steps to tune image diversity from inverted latents

* Fix tests

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [Docs] Fix incomplete docstring for resnet.py (#3438)

Fix incomplete docstrings for resnet.py

* fix tiled vae blend extent range (#3384)

fix tiled vae bleand extent range

* Small update to "Next steps" section (#3443)

Small update to "Next steps" section:

- PyTorch 2 is recommended.
- Updated improvement figures.

* Allow arbitrary aspect ratio in IFSuperResolutionPipeline (#3298)

* Update pipeline_if_superresolution.py

Allow arbitrary aspect ratio in IFSuperResolutionPipeline by using the input image shape

* IFSuperResolutionPipeline: allow the user to override the height and width through the arguments

* update IFSuperResolutionPipeline width/height doc string to match StableDiffusionInpaintPipeline conventions

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Adding 'strength' parameter to StableDiffusionInpaintingPipeline  (#3424)

* Added explanation of 'strength' parameter

* Added get_timesteps function which relies on new strength parameter

* Added `strength` parameter which defaults to 1.

* Swapped ordering so `noise_timestep` can be calculated before masking the image

this is required when you aren't applying 100% noise to the masked region, e.g. strength < 1.

* Added strength to check_inputs, throws error if out of range

* Changed `prepare_latents` to initialise latents w.r.t strength

inspired from the stable diffusion img2img pipeline, init latents are initialised by converting the init image into a VAE latent and adding noise (based upon the strength parameter passed in), e.g. random when strength = 1, or the init image at strength = 0.

* WIP: Added a unit test for the new strength parameter in the StableDiffusionInpaintingPipeline

still need to add correct regression values

* Created a is_strength_max to initialise from pure random noise

* Updated unit tests w.r.t new strength parameter + fixed new strength unit test

* renamed parameter to avoid confusion with variable of same name

* Updated regression values for new strength test - now passes

* removed 'copied from' comment as this method is now different and divergent from the cpy

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Ensure backwards compatibility for prepare_mask_and_masked_image

created a return_image boolean and initialised to false

* Ensure backwards compatibility for prepare_latents

* Fixed copy check typo

* Fixes w.r.t backward compibility changes

* make style

* keep function argument ordering same for backwards compatibility in callees with copied from statements

* make fix-copies

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: William Berman <WLBberman@gmail.com>

* [WIP] Bugfix - Pipeline.from_pretrained is broken when the pipeline is partially downloaded (#3448)

Added bugfix using f strings.

* Fix gradient checkpointing bugs in freezing part of models (requires_grad=False) (#3404)

* gradient checkpointing bug fix

* bug fix; changes for reviews

* reformat

* reformat

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Make dreambooth lora more robust to orig unet (#3462)

* Make dreambooth lora more robust to orig unet

* up

* Reduce peak VRAM by releasing large attention tensors (as soon as they're unnecessary) (#3463)

Release large tensors in attention (as soon as they're no longer required). Reduces peak VRAM by nearly 2 GB for 1024x1024 (even after slicing), and the savings scale up with image size.

* Add min snr to text2img lora training script (#3459)

add min snr to text2img lora training script

* Add inpaint lora scale support (#3460)

* add inpaint lora scale support

* add inpaint lora scale test

---------

Co-authored-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>

* [From ckpt] Fix from_ckpt (#3466)

* Correct from_ckpt

* make style

* Update full dreambooth script to work with IF (#3425)

* Add IF dreambooth docs (#3470)

* parameterize pass single args through tuple (#3477)

* attend and excite tests disable determinism on the class level (#3478)

* dreambooth docs torch.compile note (#3471)

* dreambooth docs torch.compile note

* Update examples/dreambooth/README.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update examples/dreambooth/README.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* add: if entry in the dreambooth training docs. (#3472)

* [docs] Textual inversion inference (#3473)

* add textual inversion inference to docs

* add to toctree

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* [docs] Distributed inference (#3376)

* distributed inference

* move to inference section

* apply feedback

* update with split_between_processes

* apply feedback

* [{Up,Down}sample1d] explicit view kernel size as number elements in flattened indices (#3479)

explicit view kernel size as number elements in flattened indices

* mps & onnx tests rework (#3449)

* Remove ONNX tests from PR.

They are already a part of push_tests.yml.

* Remove mps tests from PRs.

They are already performed on push.

* Fix workflow name for fast push tests.

* Extract mps tests to a workflow.

For better control/filtering.

* Remove --extra-index-url from mps tests

* Increase tolerance of mps test

This test passes in my Mac (Ventura 13.3) but fails in the CI hardware
(Ventura 13.2). I ran the local tests following the same steps that
exist in the CI workflow.

* Temporarily run mps tests on pr

So we can test.

* Revert "Temporarily run mps tests on pr"

Tests passed, go back to running on push.

---------

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Co-authored-by: Ilia Larchenko <41329713+IliaLarchenko@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Horace He <horacehe2007@yahoo.com>
Co-authored-by: Umar <55330742+mu94-csl@users.noreply.github.com>
Co-authored-by: Mylo <36931363+gitmylo@users.noreply.github.com>
Co-authored-by: Markus Pobitzer <markuspobitzer@gmail.com>
Co-authored-by: Cheng Lu <lucheng.lc15@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Isamu Isozaki <isamu.website@gmail.com>
Co-authored-by: Cesar Aybar <csaybar@gmail.com>
Co-authored-by: Will Rice <will@spokestack.io>
Co-authored-by: Adrià Arrufat <1671644+arrufat@users.noreply.github.com>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: At-sushi <dkahw210@kyoto.zaq.ne.jp>
Co-authored-by: Lucca Zenóbio <luccazen@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com>
Co-authored-by: pdoane <pdoane2@gmail.com>
Co-authored-by: Will Berman <wlbberman@gmail.com>
Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Rupert Menneer <71332436+rupertmenneer@users.noreply.github.com>
Co-authored-by: sudowind <wfpkueecs@163.com>
Co-authored-by: Takuma Mori <takuma104@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Laureηt <laurentfainsin@protonmail.com>
Co-authored-by: Jongwoo Han <jongwooo.han@gmail.com>
Co-authored-by: asfiyab-nvidia <117682710+asfiyab-nvidia@users.noreply.github.com>
Co-authored-by: clarencechen <clarencechenct@gmail.com>
Co-authored-by: Laureηt <laurent@fainsin.bzh>
Co-authored-by: superlabs-dev <133080491+superlabs-dev@users.noreply.github.com>
Co-authored-by: Dev Aggarwal <devxpy@gmail.com>
Co-authored-by: Vimarsh Chaturvedi <vimarsh.c@gmail.com>
Co-authored-by: 7eu7d7 <31194890+7eu7d7@users.noreply.github.com>
Co-authored-by: cmdr2 <shashank.shekhar.global@gmail.com>
Co-authored-by: wfng92 <43742196+wfng92@users.noreply.github.com>
Co-authored-by: Glaceon-Hyy <ffheyy0017@gmail.com>
Co-authored-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>

* [Community] reference only control (#3435)

* add reference only control

* add reference only control

* add reference only control

* fix lint

* fix lint

* reference adain

* bugfix EulerAncestralDiscreteScheduler

* fix style fidelity rule

* fix default output size

* del unused line

* fix deterministic

* Support for cross-attention bias / mask (#2634)

* Cross-attention masks

prefer qualified symbol, fix accidental Optional

prefer qualified symbol in AttentionProcessor

prefer qualified symbol in embeddings.py

qualified symbol in transformed_2d

qualify FloatTensor in unet_2d_blocks

move new transformer_2d params attention_mask, encoder_attention_mask to the end of the section which is assumed (e.g. by functions such as checkpoint()) to have a stable positional param interface. regard return_dict as a special-case which is assumed to be injected separately from positional params (e.g. by create_custom_forward()).

move new encoder_attention_mask param to end of CrossAttn block interfaces and Unet2DCondition interface, to maintain positional param interface.

regenerate modeling_text_unet.py

remove unused import

unet_2d_condition encoder_attention_mask docs

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

versatile_diffusion/modeling_text_unet.py encoder_attention_mask docs

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

transformer_2d encoder_attention_mask docs

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

unet_2d_blocks.py: add parameter name comments

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

revert description. bool-to-bias treatment happens in unet_2d_condition only.

comment parameter names

fix copies, style

* encoder_attention_mask for SimpleCrossAttnDownBlock2D, SimpleCrossAttnUpBlock2D

* encoder_attention_mask for UNetMidBlock2DSimpleCrossAttn

* support attention_mask, encoder_attention_mask in KCrossAttnDownBlock2D, KCrossAttnUpBlock2D, KAttentionBlock. fix binding of attention_mask, cross_attention_kwargs params in KCrossAttnDownBlock2D, KCrossAttnUpBlock2D checkpoint invocations.

* fix mistake made during merge conflict resolution

* regenerate versatile_diffusion

* pass time embedding into checkpointed attention invocation

* always assume encoder_attention_mask is a mask (i.e. not a bias).

* style, fix-copies

* add tests for cross-attention masks

* add test for padding of attention mask

* explain mask's query_tokens dim. fix explanation about broadcasting over channels; we actually broadcast over query tokens

* support both masks and biases in Transformer2DModel#forward. document behaviour

* fix-copies

* delete attention_mask docs on the basis I never tested self-attention masking myself. not comfortable explaining it, since I don't actually understand how a self-attn mask can work in its current form: the key length will be different in every ResBlock (we don't downsample the mask when we downsample the image).

* review feedback: the standard Unet blocks shouldn't pass temb to attn (only to resnet). remove from KCrossAttnDownBlock2D,KCrossAttnUpBlock2D#forward.

* remove encoder_attention_mask param from SimpleCrossAttn{Up,Down}Block2D,UNetMidBlock2DSimpleCrossAttn, and mask-choice in those blocks' #forward, on the basis that they only do one type of attention, so the consumer can pass whichever type of attention_mask is appropriate.

* put attention mask padding back to how it was (since the SD use-case it enabled wasn't important, and it breaks the original unclip use-case). disable the test which was added.

* fix-copies

* style

* fix-copies

* put encoder_attention_mask param back into Simple block forward interfaces, to ensure consistency of forward interface.

* restore passing of emb to KAttentionBlock#forward, on the basis that removal caused test failures. restore also the passing of emb to checkpointed calls to KAttentionBlock#forward.

* make simple unet2d blocks use encoder_attention_mask, but only when attention_mask is None. this should fix UnCLIP compatibility.

* fix copies

* do not scale the initial global step by gradient accumulation steps when loading from checkpoint (#3506)

* Remove CPU latents logic for UniDiffuserPipelineFastTests.

* make style

* Revert "Clean up code and make slow tests pass."

This reverts commit ec7fb8735b.

* Revert bad commit and clean up code.

* add: contributor note.

* Batched load of textual inversions (#3277)

* Batched load of textual inversions

- Only call resize_token_embeddings once per batch as it is the most expensive operation
- Allow pretrained_model_name_or_path and token to be an optional list
- Remove Dict from type annotation pretrained_model_name_or_path as it was not supported in this function
- Add comment that single files (e.g. .pt/.safetensors) are supported
- Add comment for token parameter
- Convert token override log message from warning to info

* Update src/diffusers/loaders.py

Check for duplicate tokens

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update condition for None tokens

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Revert "add: contributor note."

This reverts commit 302fde9409.

* Re-add contributor note and refactored fast tests fixed latents code to remove CPU specific logic.

* make style

* Refactored the code:
	- Updated the checkpoint ids to the new ids where appropriate
	- Refactored the UniDiffuserTextDecoder methods to return only tensors (and made other changes to support this)
	- Cleaned up the code following suggestions by patrickvonplaten

* make style

* Remove padding logic from UniDiffuserTextDecoder.generate_beam since the inputs are already padded to a consistent length.

* Update checkpoint id for small test v1 checkpoint to hf-internal-testing/unidiffuser-test-v1.

* make style

* Make improvements to the documentation.

* Move ImageTextPipelineOutput documentation from /api/pipelines/unidiffuser.mdx to /api/diffusion_pipeline.mdx.

* Change order of arguments for UniDiffuserTextDecoder.generate_beam.

* make style

* Update docs/source/en/api/pipelines/unidiffuser.mdx

---------

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>
Co-authored-by: Ernie Chu <51432514+ernestchu@users.noreply.github.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Andranik Movsisyan <48154088+19and99@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Andreas Steiner <andstein@google.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Joseph Coffland <github@joe.coffland.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Takuma Mori <takuma104@gmail.com>
Co-authored-by: Will Berman <wlbberman@gmail.com>
Co-authored-by: Tommaso De Rossi <beats.by.morse@gmail.com>
Co-authored-by: Cristian Garcia <cgarcia.e88@gmail.com>
Co-authored-by: cmdr2 <secondary.cmdr2@gmail.com>
Co-authored-by: 1lint <105617163+1lint@users.noreply.github.com>
Co-authored-by: asfiyab-nvidia <117682710+asfiyab-nvidia@users.noreply.github.com>
Co-authored-by: Chanchana Sornsoontorn <off9955555@gmail.com>
Co-authored-by: hwuebben <wbben123@yahoo.de>
Co-authored-by: superhero-7 <57797766+superhero-7@users.noreply.github.com>
Co-authored-by: root <fulong_ye@163.com>
Co-authored-by: nupurkmr9 <nupurkmr9@gmail.com>
Co-authored-by: Nupur Kumari <nupurkumari@Nupurs-MacBook-Pro.local>
Co-authored-by: Nupur Kumari <nupurkumari@nupurs-mbp.wifi.local.cmu.edu>
Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>
Co-authored-by: XinyuYe-Intel <xinyu.ye@intel.com>
Co-authored-by: clarencechen <clarencechenct@gmail.com>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Youssef Adarrab <104783077+youssefadr@users.noreply.github.com>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: Chengrui Wang <80876977+crywang@users.noreply.github.com>
Co-authored-by: SkyTNT <SKYTNT@outlook.com>
Co-authored-by: Lucca Zenóbio <luccazen@gmail.com>
Co-authored-by: Isaac <34376531+init-22@users.noreply.github.com>
Co-authored-by: pdoane <pdoane2@gmail.com>
Co-authored-by: Yuchen Fan <fyc0624@gmail.com>
Co-authored-by: Nipun Jindal <jindal.nipun@gmail.com>
Co-authored-by: njindal <njindal@adobe.com>
Co-authored-by: apolinário <joaopaulo.passos@gmail.com>
Co-authored-by: multimodalart <joaopaulo.passos+multimodal@gmail.com>
Co-authored-by: Xie Zejian <xiezej@gmail.com>
Co-authored-by: Jair Trejo <jairtrejo@gmail.com>
Co-authored-by: Robert Dargavel Smith <teticio@gmail.com>
Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>
Co-authored-by: Joqsan <6027118+Joqsan@users.noreply.github.com>
Co-authored-by: NimenDavid <312648004@qq.com>
Co-authored-by: M. Tolga Cangöz <46008593+standardAI@users.noreply.github.com>
Co-authored-by: timegate <timegate@kaist.ac.kr>
Co-authored-by: Jason Kuan <jason9075@users.noreply.github.com>
Co-authored-by: Ilia Larchenko <41329713+IliaLarchenko@users.noreply.github.com>
Co-authored-by: Horace He <horacehe2007@yahoo.com>
Co-authored-by: Umar <55330742+mu94-csl@users.noreply.github.com>
Co-authored-by: Mylo <36931363+gitmylo@users.noreply.github.com>
Co-authored-by: Markus Pobitzer <markuspobitzer@gmail.com>
Co-authored-by: Cheng Lu <lucheng.lc15@gmail.com>
Co-authored-by: Isamu Isozaki <isamu.website@gmail.com>
Co-authored-by: Cesar Aybar <csaybar@gmail.com>
Co-authored-by: Will Rice <will@spokestack.io>
Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Rupert Menneer <71332436+rupertmenneer@users.noreply.github.com>
Co-authored-by: sudowind <wfpkueecs@163.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Laureηt <laurentfainsin@protonmail.com>
Co-authored-by: Jongwoo Han <jongwooo.han@gmail.com>
Co-authored-by: Laureηt <laurent@fainsin.bzh>
Co-authored-by: superlabs-dev <133080491+superlabs-dev@users.noreply.github.com>
Co-authored-by: Dev Aggarwal <devxpy@gmail.com>
Co-authored-by: Vimarsh Chaturvedi <vimarsh.c@gmail.com>
Co-authored-by: 7eu7d7 <31194890+7eu7d7@users.noreply.github.com>
Co-authored-by: cmdr2 <shashank.shekhar.global@gmail.com>
Co-authored-by: wfng92 <43742196+wfng92@users.noreply.github.com>
Co-authored-by: Glaceon-Hyy <ffheyy0017@gmail.com>
Co-authored-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>
Co-authored-by: StAlKeR7779 <stalkek7779@yandex.ru>
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com>
Co-authored-by: w4ffl35 <w4ffl35@ml1.net>
Co-authored-by: Seongsu Park <tjdtnsu@gmail.com>
Co-authored-by: Chanran Kim <seriousran@gmail.com>
Co-authored-by: Ambrosiussen <paul@ambrosiussen.com>
Co-authored-by: Hari Krishna <37787894+hari10599@users.noreply.github.com>
Co-authored-by: Adrià Arrufat <1671644+arrufat@users.noreply.github.com>
Co-authored-by: At-sushi <dkahw210@kyoto.zaq.ne.jp>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: takuoko <to78314910@gmail.com>
Co-authored-by: Birch-san <Birch-san@users.noreply.github.com>
2023-05-26 17:27:30 +05:30
Steven Liu
7948db81c5 [docs] Add AttnProcessor to docs (#3474)
* add attnprocessor to docs

* fix path to class

* create separate page for attnprocessors

* fix path

* fix path for real

* fill in docstrings

* apply feedback

* apply feedback
2023-05-26 17:11:42 +05:30
Patrick von Platen
bf16a97018 Fix controlnet guess mode euler (#3571)
* Fix guess mode controlnet for euler-like schedulers

* make style

* Co-authored-by: Chanchana Sornsoontorn <off.chanchana@gmail.com>

* Add co author Co-authored-by: Chanchana Sornsoontorn <off.chanchana@gmail.com>

* 2nd try
Co-authored-by: Chanchana Sornsoontorn <off.chanchana@gmail.com>
2023-05-26 11:31:51 +01:00
Patrick von Platen
66356e7dd5 Correct inpainting controlnet docs (#3572) 2023-05-26 11:02:30 +01:00
vikasmech
ffa33d631a renamed variable to input_ and output_ (#3507)
* renamed variable to input_ and output_

* changed input
_ to intputs and output_ to outputs
2023-05-26 10:34:11 +01:00
Emin Demirci
d8ce53a8c4 Fix loaded_token reference before definition (#3523) 2023-05-26 10:31:02 +01:00
Patrick von Platen
d114d80fd2 [Stable Diffusion Inpainting] Allow standard text-to-img checkpoints to be useable for SD inpainting (#3533)
* Add default to inpaint

* Make sure controlnet also works with normal sd for inpaint

* Add tests

* improve

* Correct encode images function

* Correct inpaint controlnet

* Improve text2img inpanit

* make style

* up

* up

* up

* up

* fix more
2023-05-26 09:47:42 +01:00
YiYi Xu
e5215dee9a fix broken change for vq pipeline (#3563)
fix vq_model

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-05-25 14:55:31 -10:00
YiYi Xu
03b7a84cbe Add Kandinsky 2.1 (#3308)
add kandinsky2.1

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Ayush Mangal <43698245+ayushtues@users.noreply.github.com>
Co-authored-by: ayushmangal <ayushmangal@microsoft.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-05-25 11:28:34 -10:00
Patrick von Platen
f19f128735 Add open parti prompts to docs (#3549)
* Add open parti prompts

* More changes
2023-05-25 11:11:20 +01:00
Isotr0py
a94977b8b3 Fix panorama to support all schedulers (#3546)
* refactor blocks init

* refactor blocks loop

* remove unused function and warnings

* fix scheduler update location

* reformat code

* reformat code again

* fix PNDM test case

* reformat pndm test case
2023-05-24 17:58:08 +05:30
Sayak Paul
8e69708b0d [Examples/DreamBooth] refactor save_model_card utility in dreambooth examples (#3543)
refactor save_model_card utility in dreambooth examples.
2023-05-24 16:16:28 +05:30
Will Berman
db56f8a4f5 explicit broadcasts for assignments (#3535) 2023-05-24 11:17:41 +01:00
Will Berman
c13dbd5c3a fix attention mask pad check (#3531) 2023-05-23 13:11:53 -07:00
Pedro Cuenca
bde2cb5d9b Run torch.compile tests in separate subprocesses (#3503)
* Run ControlNet compile test in a separate subprocess

`torch.compile()` spawns several subprocesses and the GPU memory used
was not reclaimed after the test ran. This approach was taken from
`transformers`.

* Style

* Prepare a couple more compile tests to run in subprocess.

* Use require_torch_2 decorator.

* Test inpaint_compile in subprocess.

* Run img2img compile test in subprocess.

* Run stable diffusion compile test in subprocess.

* style

* Temporarily trigger on pr to test.

* Revert "Temporarily trigger on pr to test."

This reverts commit 82d76868dd.
2023-05-23 19:24:17 +02:00
Patrick von Platen
abab61d49e Update README.md 2023-05-23 17:29:18 +01:00
Patrick von Platen
b402604de4 Update README.md (#3525) 2023-05-23 17:28:39 +01:00
Patrick von Platen
84ce50f08e Improve README (#3524)
Update README.md
2023-05-23 16:53:34 +01:00
Patrick von Platen
9e2734a710 Make sure Diffusers works even if Hub is down (#3447)
* Make sure Diffusers works even if Hub is down

* Make sure hub down is well tested
2023-05-23 14:22:43 +01:00
Patrick von Platen
d4197bf4d7 Allow custom pipeline loading (#3504) 2023-05-23 13:20:55 +01:00
takuoko
b134f6a8b6 [Community] ControlNet Reference (#3508)
add controlnet reference and bugfix

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-23 13:20:34 +01:00
yingjieh
edc6505193 [Community Pipelines]Accelerate inference of stable diffusion by IPEX on CPU (#3105)
* add stable_diffusion_ipex community pipeline

* Update readme.md

* reformat

* reformat

* Update examples/community/README.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Update examples/community/README.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Update examples/community/README.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Update examples/community/README.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Update README.md

* Update README.md

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* style

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-05-23 10:55:14 +02:00
Isotr0py
2f997f30ab Fix bug in panorama pipeline when using dpmsolver scheduler (#3499)
fix panorama pipeline with dpmsolver scheduler
2023-05-23 08:55:15 +05:30
Will Berman
67cd460154 do not scale the initial global step by gradient accumulation steps when loading from checkpoint (#3506) 2023-05-22 15:19:56 -07:00
Birch-san
64bf5d33b7 Support for cross-attention bias / mask (#2634)
* Cross-attention masks

prefer qualified symbol, fix accidental Optional

prefer qualified symbol in AttentionProcessor

prefer qualified symbol in embeddings.py

qualified symbol in transformed_2d

qualify FloatTensor in unet_2d_blocks

move new transformer_2d params attention_mask, encoder_attention_mask to the end of the section which is assumed (e.g. by functions such as checkpoint()) to have a stable positional param interface. regard return_dict as a special-case which is assumed to be injected separately from positional params (e.g. by create_custom_forward()).

move new encoder_attention_mask param to end of CrossAttn block interfaces and Unet2DCondition interface, to maintain positional param interface.

regenerate modeling_text_unet.py

remove unused import

unet_2d_condition encoder_attention_mask docs

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

versatile_diffusion/modeling_text_unet.py encoder_attention_mask docs

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

transformer_2d encoder_attention_mask docs

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

unet_2d_blocks.py: add parameter name comments

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

revert description. bool-to-bias treatment happens in unet_2d_condition only.

comment parameter names

fix copies, style

* encoder_attention_mask for SimpleCrossAttnDownBlock2D, SimpleCrossAttnUpBlock2D

* encoder_attention_mask for UNetMidBlock2DSimpleCrossAttn

* support attention_mask, encoder_attention_mask in KCrossAttnDownBlock2D, KCrossAttnUpBlock2D, KAttentionBlock. fix binding of attention_mask, cross_attention_kwargs params in KCrossAttnDownBlock2D, KCrossAttnUpBlock2D checkpoint invocations.

* fix mistake made during merge conflict resolution

* regenerate versatile_diffusion

* pass time embedding into checkpointed attention invocation

* always assume encoder_attention_mask is a mask (i.e. not a bias).

* style, fix-copies

* add tests for cross-attention masks

* add test for padding of attention mask

* explain mask's query_tokens dim. fix explanation about broadcasting over channels; we actually broadcast over query tokens

* support both masks and biases in Transformer2DModel#forward. document behaviour

* fix-copies

* delete attention_mask docs on the basis I never tested self-attention masking myself. not comfortable explaining it, since I don't actually understand how a self-attn mask can work in its current form: the key length will be different in every ResBlock (we don't downsample the mask when we downsample the image).

* review feedback: the standard Unet blocks shouldn't pass temb to attn (only to resnet). remove from KCrossAttnDownBlock2D,KCrossAttnUpBlock2D#forward.

* remove encoder_attention_mask param from SimpleCrossAttn{Up,Down}Block2D,UNetMidBlock2DSimpleCrossAttn, and mask-choice in those blocks' #forward, on the basis that they only do one type of attention, so the consumer can pass whichever type of attention_mask is appropriate.

* put attention mask padding back to how it was (since the SD use-case it enabled wasn't important, and it breaks the original unclip use-case). disable the test which was added.

* fix-copies

* style

* fix-copies

* put encoder_attention_mask param back into Simple block forward interfaces, to ensure consistency of forward interface.

* restore passing of emb to KAttentionBlock#forward, on the basis that removal caused test failures. restore also the passing of emb to checkpointed calls to KAttentionBlock#forward.

* make simple unet2d blocks use encoder_attention_mask, but only when attention_mask is None. this should fix UnCLIP compatibility.

* fix copies
2023-05-22 17:27:15 +01:00
takuoko
c4359d63e3 [Community] reference only control (#3435)
* add reference only control

* add reference only control

* add reference only control

* fix lint

* fix lint

* reference adain

* bugfix EulerAncestralDiscreteScheduler

* fix style fidelity rule

* fix default output size

* del unused line

* fix deterministic
2023-05-22 16:21:54 +01:00
Hari Krishna
f3d570c273 feat: allow disk offload for diffuser models (#3285)
* allow disk offload for diffuser models

* sort import

* add max_memory argument

* Changed sample[0] to images[0] (#3304)

A pipeline object stores the results in `images` not in `sample`.
Current code blocks don't work.

* Typo in tutorial (#3295)

* Torch compile graph fix (#3286)

* fix more

* Fix more

* fix more

* Apply suggestions from code review

* fix

* make style

* make fix-copies

* fix

* make sure torch compile

* Clean

* fix test

* Postprocessing refactor img2img (#3268)

* refactor img2img VaeImageProcessor.postprocess

* remove copy from for init, run_safety_checker, decode_latents

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

---------

Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* [Torch 2.0 compile] Fix more torch compile breaks (#3313)

* Fix more torch compile breaks

* add tests

* Fix all

* fix controlnet

* fix more

* Add Horace He as co-author.
>
>
Co-authored-by: Horace He <horacehe2007@yahoo.com>

* Add Horace He as co-author.

Co-authored-by: Horace He <horacehe2007@yahoo.com>

---------

Co-authored-by: Horace He <horacehe2007@yahoo.com>

* fix: scale_lr and sync example readme and docs. (#3299)

* fix: scale_lr and sync example readme and docs.

* fix doc link.

* Update stable_diffusion.mdx (#3310)

fixed import statement

* Fix missing variable assign in DeepFloyd-IF-II (#3315)

Fix missing variable assign

lol

* Correct doc build for patch releases (#3316)

Update build_documentation.yml

* Add Stable Diffusion RePaint to community pipelines (#3320)

* Add Stable Diffsuion RePaint to community pipelines

- Adds Stable Diffsuion RePaint to community pipelines
- Add Readme enty for pipeline

* Fix: Remove wrong import

- Remove wrong import
- Minor change in comments

* Fix: Code formatting of stable_diffusion_repaint

* Fix: ruff errors in stable_diffusion_repaint

* Fix multistep dpmsolver for cosine schedule (suitable for deepfloyd-if) (#3314)

* fix multistep dpmsolver for cosine schedule (deepfloy-if)

* fix a typo

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* update all dpmsolver (singlestep, multistep, dpm, dpm++) for cosine noise schedule

* add test, fix style

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [docs] Improve LoRA docs (#3311)

* update docs

* add to toctree

* apply feedback

* Added input pretubation (#3292)

* Added input pretubation

* Fixed spelling

* Update write_own_pipeline.mdx (#3323)

* update controlling generation doc with latest goodies. (#3321)

* [Quality] Make style (#3341)

* Fix config dpm (#3343)

* Add the SDE variant of DPM-Solver and DPM-Solver++ (#3344)

* add SDE variant of DPM-Solver and DPM-Solver++

* add test

* fix typo

* fix typo

* Add upsample_size to AttnUpBlock2D, AttnDownBlock2D (#3275)

The argument `upsample_size` needs to be added to these modules to allow compatibility with other blocks that require this argument.

* Rename --only_save_embeds to --save_as_full_pipeline (#3206)

* Set --only_save_embeds to False by default

Due to how the option is named, it makes more sense to behave like this.

* Refactor only_save_embeds to save_as_full_pipeline

* [AudioLDM] Generalise conversion script (#3328)

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Fix TypeError when using prompt_embeds and negative_prompt (#2982)

* test: Added test case

* fix: fixed type checking issue on _encode_prompt

* fix: fixed copies consistency

* fix: one copy was not sufficient

* Fix pipeline class on README (#3345)

Update README.md

* Inpainting: typo in docs (#3331)

Typo in docs

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add `use_Karras_sigmas` to LMSDiscreteScheduler (#3351)

* add karras sigma to lms discrete scheduler

* add test for lms_scheduler karras

* reformat test lms

* Batched load of textual inversions (#3277)

* Batched load of textual inversions

- Only call resize_token_embeddings once per batch as it is the most expensive operation
- Allow pretrained_model_name_or_path and token to be an optional list
- Remove Dict from type annotation pretrained_model_name_or_path as it was not supported in this function
- Add comment that single files (e.g. .pt/.safetensors) are supported
- Add comment for token parameter
- Convert token override log message from warning to info

* Update src/diffusers/loaders.py

Check for duplicate tokens

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update condition for None tokens

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* make fix-copies

* [docs] Fix docstring (#3334)

fix docstring

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* if dreambooth lora (#3360)

* update IF stage I pipelines

add fixed variance schedulers and lora loading

* added kv lora attn processor

* allow loading into alternative lora attn processor

* make vae optional

* throw away predicted variance

* allow loading into added kv lora layer

* allow load T5

* allow pre compute text embeddings

* set new variance type in schedulers

* fix copies

* refactor all prompt embedding code

class prompts are now included in pre-encoding code
max tokenizer length is now configurable
embedding attention mask is now configurable

* fix for when variance type is not defined on scheduler

* do not pre compute validation prompt if not present

* add example test for if lora dreambooth

* add check for train text encoder and pre compute text embeddings

* Postprocessing refactor all others (#3337)

* add text2img

* fix-copies

* add

* add all other pipelines

* add

* add

* add

* add

* add

* make style

* style + fix copies

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>

* [docs] Improve safetensors docstring (#3368)

* clarify safetensor docstring

* fix typo

* apply feedback

* add: a warning message when using xformers in a PT 2.0 env. (#3365)

* add: a warning message when using xformers in a PT 2.0 env.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* StableDiffusionInpaintingPipeline - resize image w.r.t height and width (#3322)

* StableDiffusionInpaintingPipeline now resizes input images and masks w.r.t to passed input height and width. Default is already set to 512. This addresses the common tensor mismatch error. Also moved type check into relevant funciton to keep main pipeline body tidy.

* Fixed StableDiffusionInpaintingPrepareMaskAndMaskedImageTests

Due to previous commit these tests were failing as height and width need to be passed into the prepare_mask_and_masked_image function, I have updated the code and added a height/width variable per unit test as it seemed more appropriate than the current hard coded solution

* Added a resolution test to StableDiffusionInpaintPipelineSlowTests

this unit test simply gets the input and resizes it into some that would fail (e.g. would throw a tensor mismatch error/not a mult of 8). Then passes it through the pipeline and verifies it produces output with correct dims w.r.t the passed height and width

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* make style

* [docs] Adapt a model (#3326)

* first draft

* apply feedback

* conv_in.weight thrown away

* [docs] Load safetensors (#3333)

* safetensors

* apply feedback

* apply feedback

* Apply suggestions from code review

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* make style

* [Docs] Fix stable_diffusion.mdx typo (#3398)

Fix typo in last code block. Correct "prommpts" to "prompt"

* Support ControlNet v1.1 shuffle properly (#3340)

* add inferring_controlnet_cond_batch

* Revert "add inferring_controlnet_cond_batch"

This reverts commit abe8d6311d.

* set guess_mode to True
whenever global_pool_conditions is True

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* nit

* add integration test

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [Tests] better determinism (#3374)

* enable deterministic pytorch and cuda operations.

* disable manual seeding.

* make style && make quality for unet_2d tests.

* enable determinism for the unet2dconditional model.

* add CUBLAS_WORKSPACE_CONFIG for better reproducibility.

* relax tolerance (very weird issue, though).

* revert to torch manual_seed() where needed.

* relax more tolerance.

* better placement of the cuda variable and relax more tolerance.

* enable determinism for 3d condition model.

* relax tolerance.

* add: determinism to alt_diffusion.

* relax tolerance for alt diffusion.

* dance diffusion.

* dance diffusion is flaky.

* test_dict_tuple_outputs_equivalent edit.

* fix two more tests.

* fix more ddim tests.

* fix: argument.

* change to diff in place of difference.

* fix: test_save_load call.

* test_save_load_float16 call.

* fix: expected_max_diff

* fix: paint by example.

* relax tolerance.

* add determinism to 1d unet model.

* torch 2.0 regressions seem to be brutal

* determinism to vae.

* add reason to skipping.

* up tolerance.

* determinism to vq.

* determinism to cuda.

* determinism to the generic test pipeline file.

* refactor general pipelines testing a bit.

* determinism to alt diffusion i2i

* up tolerance for alt diff i2i and audio diff

* up tolerance.

* determinism to audioldm

* increase tolerance for audioldm lms.

* increase tolerance for paint by paint.

* increase tolerance for repaint.

* determinism to cycle diffusion and sd 1.

* relax tol for cycle diffusion 🚲

* relax tol for sd 1.0

* relax tol for controlnet.

* determinism to img var.

* relax tol for img variation.

* tolerance to i2i sd

* make style

* determinism to inpaint.

* relax tolerance for inpaiting.

* determinism for inpainting legacy

* relax tolerance.

* determinism to instruct pix2pix

* determinism to model editing.

* model editing tolerance.

* panorama determinism

* determinism to pix2pix zero.

* determinism to sag.

* sd 2. determinism

* sd. tolerance

* disallow tf32 matmul.

* relax tolerance is all you need.

* make style and determinism to sd 2 depth

* relax tolerance for depth.

* tolerance to diffedit.

* tolerance to sd 2 inpaint.

* up tolerance.

* determinism in upscaling.

* tolerance in upscaler.

* more tolerance relaxation.

* determinism to v pred.

* up tol for v_pred

* unclip determinism

* determinism to unclip img2img

* determinism to text to video.

* determinism to last set of tests

* up tol.

* vq cumsum doesn't have a deterministic kernel

* relax tol

* relax tol

* [docs] Add transformers to install (#3388)

add transformers to install

* [deepspeed] partial ZeRO-3 support (#3076)

* [deepspeed] partial ZeRO-3 support

* cleanup

* improve deepspeed fixes

* Improve

* make style

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add omegaconf for tests (#3400)

Add omegaconfg

* Fix various bugs with LoRA Dreambooth and Dreambooth script (#3353)

* Improve checkpointing lora

* fix more

* Improve doc string

* Update src/diffusers/loaders.py

* make stytle

* Apply suggestions from code review

* Update src/diffusers/loaders.py

* Apply suggestions from code review

* Apply suggestions from code review

* better

* Fix all

* Fix multi-GPU dreambooth

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Fix all

* make style

* make style

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Fix docker file (#3402)

* up

* up

* fix: deepseepd_plugin retrieval from accelerate state (#3410)

* [Docs] Add `sigmoid` beta_scheduler to docstrings of relevant Schedulers (#3399)

* Add `sigmoid` beta scheduler to `DDPMScheduler` docstring

* Add `sigmoid` beta scheduler to `RePaintScheduler` docstring

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Don't install accelerate and transformers from source (#3415)

* Don't install transformers and accelerate from source (#3414)

* Improve fast tests (#3416)

Update pr_tests.yml

* attention refactor: the trilogy  (#3387)

* Replace `AttentionBlock` with `Attention`

* use _from_deprecated_attn_block check re: @patrickvonplaten

* [Docs] update the PT 2.0 optimization doc with latest findings (#3370)

* add: benchmarking stats for A100 and V100.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* address patrick's comments.

* add: rtx 4090 stats

* ⚔ benchmark reports done

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* 3313 pr link.

* add: plots.

Co-authored-by: Pedro <pedro@huggingface.co>

* fix formattimg

* update number percent.

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Fix style rendering (#3433)

* Fix style rendering.

* Fix typo

* unCLIP scheduler do not use note (#3417)

* Replace deprecated command with environment file (#3409)

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* fix warning message pipeline loading (#3446)

* add stable diffusion tensorrt img2img pipeline (#3419)

* add stable diffusion tensorrt img2img pipeline

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* update docstrings

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

---------

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* Refactor controlnet and add img2img and inpaint (#3386)

* refactor controlnet and add img2img and inpaint

* First draft to get pipelines to work

* make style

* Fix more

* Fix more

* More tests

* Fix more

* Make inpainting work

* make style and more tests

* Apply suggestions from code review

* up

* make style

* Fix imports

* Fix more

* Fix more

* Improve examples

* add test

* Make sure import is correctly deprecated

* Make sure everything works in compile mode

* make sure authorship is correctly attributed

* [Scheduler] DPM-Solver (++) Inverse Scheduler (#3335)

* Add DPM-Solver Multistep Inverse Scheduler

* Add draft tests for DiffEdit

* Add inverse sde-dpmsolver steps to tune image diversity from inverted latents

* Fix tests

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [Docs] Fix incomplete docstring for resnet.py (#3438)

Fix incomplete docstrings for resnet.py

* fix tiled vae blend extent range (#3384)

fix tiled vae bleand extent range

* Small update to "Next steps" section (#3443)

Small update to "Next steps" section:

- PyTorch 2 is recommended.
- Updated improvement figures.

* Allow arbitrary aspect ratio in IFSuperResolutionPipeline (#3298)

* Update pipeline_if_superresolution.py

Allow arbitrary aspect ratio in IFSuperResolutionPipeline by using the input image shape

* IFSuperResolutionPipeline: allow the user to override the height and width through the arguments

* update IFSuperResolutionPipeline width/height doc string to match StableDiffusionInpaintPipeline conventions

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Adding 'strength' parameter to StableDiffusionInpaintingPipeline  (#3424)

* Added explanation of 'strength' parameter

* Added get_timesteps function which relies on new strength parameter

* Added `strength` parameter which defaults to 1.

* Swapped ordering so `noise_timestep` can be calculated before masking the image

this is required when you aren't applying 100% noise to the masked region, e.g. strength < 1.

* Added strength to check_inputs, throws error if out of range

* Changed `prepare_latents` to initialise latents w.r.t strength

inspired from the stable diffusion img2img pipeline, init latents are initialised by converting the init image into a VAE latent and adding noise (based upon the strength parameter passed in), e.g. random when strength = 1, or the init image at strength = 0.

* WIP: Added a unit test for the new strength parameter in the StableDiffusionInpaintingPipeline

still need to add correct regression values

* Created a is_strength_max to initialise from pure random noise

* Updated unit tests w.r.t new strength parameter + fixed new strength unit test

* renamed parameter to avoid confusion with variable of same name

* Updated regression values for new strength test - now passes

* removed 'copied from' comment as this method is now different and divergent from the cpy

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Ensure backwards compatibility for prepare_mask_and_masked_image

created a return_image boolean and initialised to false

* Ensure backwards compatibility for prepare_latents

* Fixed copy check typo

* Fixes w.r.t backward compibility changes

* make style

* keep function argument ordering same for backwards compatibility in callees with copied from statements

* make fix-copies

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: William Berman <WLBberman@gmail.com>

* [WIP] Bugfix - Pipeline.from_pretrained is broken when the pipeline is partially downloaded (#3448)

Added bugfix using f strings.

* Fix gradient checkpointing bugs in freezing part of models (requires_grad=False) (#3404)

* gradient checkpointing bug fix

* bug fix; changes for reviews

* reformat

* reformat

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Make dreambooth lora more robust to orig unet (#3462)

* Make dreambooth lora more robust to orig unet

* up

* Reduce peak VRAM by releasing large attention tensors (as soon as they're unnecessary) (#3463)

Release large tensors in attention (as soon as they're no longer required). Reduces peak VRAM by nearly 2 GB for 1024x1024 (even after slicing), and the savings scale up with image size.

* Add min snr to text2img lora training script (#3459)

add min snr to text2img lora training script

* Add inpaint lora scale support (#3460)

* add inpaint lora scale support

* add inpaint lora scale test

---------

Co-authored-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>

* [From ckpt] Fix from_ckpt (#3466)

* Correct from_ckpt

* make style

* Update full dreambooth script to work with IF (#3425)

* Add IF dreambooth docs (#3470)

* parameterize pass single args through tuple (#3477)

* attend and excite tests disable determinism on the class level (#3478)

* dreambooth docs torch.compile note (#3471)

* dreambooth docs torch.compile note

* Update examples/dreambooth/README.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update examples/dreambooth/README.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* add: if entry in the dreambooth training docs. (#3472)

* [docs] Textual inversion inference (#3473)

* add textual inversion inference to docs

* add to toctree

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* [docs] Distributed inference (#3376)

* distributed inference

* move to inference section

* apply feedback

* update with split_between_processes

* apply feedback

* [{Up,Down}sample1d] explicit view kernel size as number elements in flattened indices (#3479)

explicit view kernel size as number elements in flattened indices

* mps & onnx tests rework (#3449)

* Remove ONNX tests from PR.

They are already a part of push_tests.yml.

* Remove mps tests from PRs.

They are already performed on push.

* Fix workflow name for fast push tests.

* Extract mps tests to a workflow.

For better control/filtering.

* Remove --extra-index-url from mps tests

* Increase tolerance of mps test

This test passes in my Mac (Ventura 13.3) but fails in the CI hardware
(Ventura 13.2). I ran the local tests following the same steps that
exist in the CI workflow.

* Temporarily run mps tests on pr

So we can test.

* Revert "Temporarily run mps tests on pr"

Tests passed, go back to running on push.

---------

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Co-authored-by: Ilia Larchenko <41329713+IliaLarchenko@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Horace He <horacehe2007@yahoo.com>
Co-authored-by: Umar <55330742+mu94-csl@users.noreply.github.com>
Co-authored-by: Mylo <36931363+gitmylo@users.noreply.github.com>
Co-authored-by: Markus Pobitzer <markuspobitzer@gmail.com>
Co-authored-by: Cheng Lu <lucheng.lc15@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Isamu Isozaki <isamu.website@gmail.com>
Co-authored-by: Cesar Aybar <csaybar@gmail.com>
Co-authored-by: Will Rice <will@spokestack.io>
Co-authored-by: Adrià Arrufat <1671644+arrufat@users.noreply.github.com>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: At-sushi <dkahw210@kyoto.zaq.ne.jp>
Co-authored-by: Lucca Zenóbio <luccazen@gmail.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com>
Co-authored-by: pdoane <pdoane2@gmail.com>
Co-authored-by: Will Berman <wlbberman@gmail.com>
Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Rupert Menneer <71332436+rupertmenneer@users.noreply.github.com>
Co-authored-by: sudowind <wfpkueecs@163.com>
Co-authored-by: Takuma Mori <takuma104@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Laureηt <laurentfainsin@protonmail.com>
Co-authored-by: Jongwoo Han <jongwooo.han@gmail.com>
Co-authored-by: asfiyab-nvidia <117682710+asfiyab-nvidia@users.noreply.github.com>
Co-authored-by: clarencechen <clarencechenct@gmail.com>
Co-authored-by: Laureηt <laurent@fainsin.bzh>
Co-authored-by: superlabs-dev <133080491+superlabs-dev@users.noreply.github.com>
Co-authored-by: Dev Aggarwal <devxpy@gmail.com>
Co-authored-by: Vimarsh Chaturvedi <vimarsh.c@gmail.com>
Co-authored-by: 7eu7d7 <31194890+7eu7d7@users.noreply.github.com>
Co-authored-by: cmdr2 <shashank.shekhar.global@gmail.com>
Co-authored-by: wfng92 <43742196+wfng92@users.noreply.github.com>
Co-authored-by: Glaceon-Hyy <ffheyy0017@gmail.com>
Co-authored-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>
2023-05-22 16:11:08 +01:00
Patrick von Platen
2b56e8ca68 make style 2023-05-22 16:49:46 +02:00
Ambrosiussen
b8b5daaee3 DataLoader respecting EXIF data in Training Images (#3465)
* DataLoader will now bake in any transforms or image manipulations contained in the EXIF

Images may have rotations stored in EXIF. Training using such images will cause those transforms to be ignored while training and thus produce unexpected results

* Fixed the Dataloading EXIF issue in main DreamBooth training as well

* Run make style (black & isort)
2023-05-22 15:49:35 +01:00
Seongsu Park
229fd8cbca [Docs] Korean translation (optimization, training) (#3488)
* feat) optimization kr translation

* fix) typo, italic setting

* feat) dreambooth, text2image kr

* feat) lora kr

* fix) LoRA

* fix) fp16 fix

* fix) doc-builder style

* fix) fp16 일부 단어 수정

* fix) fp16 style fix

* fix) opt, training docs update

* feat) toctree update

* feat) toctree update

---------

Co-authored-by: Chanran Kim <seriousran@gmail.com>
2023-05-22 15:46:16 +01:00
Patrick von Platen
a2874af297 make style 2023-05-22 16:44:48 +02:00
w4ffl35
0160e5146f Adds local_files_only bool to prevent forced online connection (#3486) 2023-05-22 15:44:36 +01:00
Isotr0py
194b0a425d Add use_Karras_sigmas to DPMSolverSinglestepScheduler (#3476)
* add use_karras_sigmas

* add karras test

* add doc
2023-05-22 15:43:56 +01:00
Patrick von Platen
6dd3871ae0 Fix DPM single (#3413)
* Fix DPM single

* add test

* fix one more bug

* Apply suggestions from code review

Co-authored-by: StAlKeR7779 <stalkek7779@yandex.ru>

---------

Co-authored-by: StAlKeR7779 <stalkek7779@yandex.ru>
2023-05-22 14:32:39 +01:00
Patrick von Platen
51843fd7d0 Refactor full determinism (#3485)
* up

* fix more

* Apply suggestions from code review

* fix more

* fix more

* Check it

* Remove 16:8

* fix more

* fix more

* fix more

* up

* up

* Test only stable diffusion

* Test only two files

* up

* Try out spinning up processes that can be killed

* up

* Apply suggestions from code review

* up

* up
2023-05-22 11:15:11 +01:00
Sayak Paul
49ad61c204 [Docs] add note on local directory path. (#3397)
add note on local directory path.

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-21 15:26:56 +05:30
Sayak Paul
4bbc51d94d [Attention processor] Better warning message when shifting to AttnProcessor2_0 (#3457)
* add: debugging to enabling memory efficient processing

* add: better warning message.
2023-05-21 15:26:47 +05:30
Pedro Cuenca
f7b4f51cc2 mps & onnx tests rework (#3449)
* Remove ONNX tests from PR.

They are already a part of push_tests.yml.

* Remove mps tests from PRs.

They are already performed on push.

* Fix workflow name for fast push tests.

* Extract mps tests to a workflow.

For better control/filtering.

* Remove --extra-index-url from mps tests

* Increase tolerance of mps test

This test passes in my Mac (Ventura 13.3) but fails in the CI hardware
(Ventura 13.2). I ran the local tests following the same steps that
exist in the CI workflow.

* Temporarily run mps tests on pr

So we can test.

* Revert "Temporarily run mps tests on pr"

Tests passed, go back to running on push.
2023-05-20 13:43:07 +02:00
Will Berman
85eff637aa [{Up,Down}sample1d] explicit view kernel size as number elements in flattened indices (#3479)
explicit view kernel size as number elements in flattened indices
2023-05-19 10:45:56 -07:00
Steven Liu
e589bdb956 [docs] Distributed inference (#3376)
* distributed inference

* move to inference section

* apply feedback

* update with split_between_processes

* apply feedback
2023-05-19 10:07:33 -07:00
Steven Liu
00c76f6ff1 [docs] Textual inversion inference (#3473)
* add textual inversion inference to docs

* add to toctree

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-05-19 09:47:27 -07:00
Sayak Paul
e343443565 add: if entry in the dreambooth training docs. (#3472) 2023-05-19 07:47:28 +05:30
Will Berman
8d646f2294 dreambooth docs torch.compile note (#3471)
* dreambooth docs torch.compile note

* Update examples/dreambooth/README.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update examples/dreambooth/README.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-05-19 07:40:14 +05:30
Will Berman
8917769499 attend and excite tests disable determinism on the class level (#3478) 2023-05-18 10:24:49 -07:00
Will Berman
49b7ccfb96 parameterize pass single args through tuple (#3477) 2023-05-18 10:14:29 -07:00
Will Berman
7200985eab Add IF dreambooth docs (#3470) 2023-05-17 11:56:10 -07:00
Will Berman
c9f939bf98 Update full dreambooth script to work with IF (#3425) 2023-05-17 10:42:20 -07:00
Patrick von Platen
2858d7e15e [From ckpt] Fix from_ckpt (#3466)
* Correct from_ckpt

* make style
2023-05-17 13:26:53 +01:00
Glaceon-Hyy
88295f92d9 Add inpaint lora scale support (#3460)
* add inpaint lora scale support

* add inpaint lora scale test

---------

Co-authored-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com>
2023-05-17 16:58:19 +05:30
wfng92
2faf91dbde Add min snr to text2img lora training script (#3459)
add min snr to text2img lora training script
2023-05-17 16:37:45 +05:30
cmdr2
bd78f63a54 Reduce peak VRAM by releasing large attention tensors (as soon as they're unnecessary) (#3463)
Release large tensors in attention (as soon as they're no longer required). Reduces peak VRAM by nearly 2 GB for 1024x1024 (even after slicing), and the savings scale up with image size.
2023-05-17 11:24:59 +01:00
Patrick von Platen
3ebd2d1f9e Make dreambooth lora more robust to orig unet (#3462)
* Make dreambooth lora more robust to orig unet

* up
2023-05-17 11:20:13 +01:00
7eu7d7
15f1bab13b Fix gradient checkpointing bugs in freezing part of models (requires_grad=False) (#3404)
* gradient checkpointing bug fix

* bug fix; changes for reviews

* reformat

* reformat

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-17 11:06:04 +01:00
Vimarsh Chaturvedi
415c616712 [WIP] Bugfix - Pipeline.from_pretrained is broken when the pipeline is partially downloaded (#3448)
Added bugfix using f strings.
2023-05-17 11:05:33 +01:00
Rupert Menneer
c09c4f3ab7 Adding 'strength' parameter to StableDiffusionInpaintingPipeline (#3424)
* Added explanation of 'strength' parameter

* Added get_timesteps function which relies on new strength parameter

* Added `strength` parameter which defaults to 1.

* Swapped ordering so `noise_timestep` can be calculated before masking the image

this is required when you aren't applying 100% noise to the masked region, e.g. strength < 1.

* Added strength to check_inputs, throws error if out of range

* Changed `prepare_latents` to initialise latents w.r.t strength

inspired from the stable diffusion img2img pipeline, init latents are initialised by converting the init image into a VAE latent and adding noise (based upon the strength parameter passed in), e.g. random when strength = 1, or the init image at strength = 0.

* WIP: Added a unit test for the new strength parameter in the StableDiffusionInpaintingPipeline

still need to add correct regression values

* Created a is_strength_max to initialise from pure random noise

* Updated unit tests w.r.t new strength parameter + fixed new strength unit test

* renamed parameter to avoid confusion with variable of same name

* Updated regression values for new strength test - now passes

* removed 'copied from' comment as this method is now different and divergent from the cpy

* Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Ensure backwards compatibility for prepare_mask_and_masked_image

created a return_image boolean and initialised to false

* Ensure backwards compatibility for prepare_latents

* Fixed copy check typo

* Fixes w.r.t backward compibility changes

* make style

* keep function argument ordering same for backwards compatibility in callees with copied from statements

* make fix-copies

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: William Berman <WLBberman@gmail.com>
2023-05-17 11:05:16 +01:00
Dev Aggarwal
6070b32fcf Allow arbitrary aspect ratio in IFSuperResolutionPipeline (#3298)
* Update pipeline_if_superresolution.py

Allow arbitrary aspect ratio in IFSuperResolutionPipeline by using the input image shape

* IFSuperResolutionPipeline: allow the user to override the height and width through the arguments

* update IFSuperResolutionPipeline width/height doc string to match StableDiffusionInpaintPipeline conventions

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-16 19:21:07 -07:00
Pedro Cuenca
0392eceba8 Small update to "Next steps" section (#3443)
Small update to "Next steps" section:

- PyTorch 2 is recommended.
- Updated improvement figures.
2023-05-16 19:35:47 +01:00
superlabs-dev
92ea5baca2 fix tiled vae blend extent range (#3384)
fix tiled vae bleand extent range
2023-05-16 19:33:47 +01:00
Laureηt
754fac82d2 [Docs] Fix incomplete docstring for resnet.py (#3438)
Fix incomplete docstrings for resnet.py
2023-05-16 19:33:34 +01:00
clarencechen
17f9aed79c [Scheduler] DPM-Solver (++) Inverse Scheduler (#3335)
* Add DPM-Solver Multistep Inverse Scheduler

* Add draft tests for DiffEdit

* Add inverse sde-dpmsolver steps to tune image diversity from inverted latents

* Fix tests

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-16 19:26:53 +01:00
Patrick von Platen
886575ee43 Refactor controlnet and add img2img and inpaint (#3386)
* refactor controlnet and add img2img and inpaint

* First draft to get pipelines to work

* make style

* Fix more

* Fix more

* More tests

* Fix more

* Make inpainting work

* make style and more tests

* Apply suggestions from code review

* up

* make style

* Fix imports

* Fix more

* Fix more

* Improve examples

* add test

* Make sure import is correctly deprecated

* Make sure everything works in compile mode

* make sure authorship is correctly attributed
2023-05-16 19:07:21 +01:00
asfiyab-nvidia
9d44e2fb66 add stable diffusion tensorrt img2img pipeline (#3419)
* add stable diffusion tensorrt img2img pipeline

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* update docstrings

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

---------

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
2023-05-16 14:28:01 +01:00
Patrick von Platen
d2285f5158 fix warning message pipeline loading (#3446) 2023-05-16 12:58:24 +01:00
Jongwoo Han
326f326e17 Replace deprecated command with environment file (#3409)
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-16 12:51:10 +01:00
Will Berman
29b1325a5a unCLIP scheduler do not use note (#3417) 2023-05-15 09:47:14 -06:00
Pedro Cuenca
7a32b6beeb Fix style rendering (#3433)
* Fix style rendering.

* Fix typo
2023-05-15 14:32:34 +05:30
Sayak Paul
bdefabd1a8 [Docs] update the PT 2.0 optimization doc with latest findings (#3370)
* add: benchmarking stats for A100 and V100.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* address patrick's comments.

* add: rtx 4090 stats

* ⚔ benchmark reports done

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* 3313 pr link.

* add: plots.

Co-authored-by: Pedro <pedro@huggingface.co>

* fix formattimg

* update number percent.

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-05-13 15:12:01 +05:30
Will Berman
909742dbd6 attention refactor: the trilogy (#3387)
* Replace `AttentionBlock` with `Attention`

* use _from_deprecated_attn_block check re: @patrickvonplaten
2023-05-12 08:54:09 -06:00
Patrick von Platen
28f404349d Improve fast tests (#3416)
Update pr_tests.yml
2023-05-12 14:01:03 +01:00
Patrick von Platen
03e5126978 Don't install transformers and accelerate from source (#3414) 2023-05-12 13:15:23 +01:00
Patrick von Platen
b1b92f4a98 Don't install accelerate and transformers from source (#3415) 2023-05-12 13:14:04 +01:00
Laureηt
7f6373d264 [Docs] Add sigmoid beta_scheduler to docstrings of relevant Schedulers (#3399)
* Add `sigmoid` beta scheduler to `DDPMScheduler` docstring

* Add `sigmoid` beta scheduler to `RePaintScheduler` docstring

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-12 12:48:26 +01:00
Sayak Paul
3a237f4fa2 fix: deepseepd_plugin retrieval from accelerate state (#3410) 2023-05-12 10:02:22 +01:00
Patrick von Platen
1a5797c6d4 Fix docker file (#3402)
* up

* up
2023-05-11 20:28:37 +01:00
Patrick von Platen
f92253015c Fix various bugs with LoRA Dreambooth and Dreambooth script (#3353)
* Improve checkpointing lora

* fix more

* Improve doc string

* Update src/diffusers/loaders.py

* make stytle

* Apply suggestions from code review

* Update src/diffusers/loaders.py

* Apply suggestions from code review

* Apply suggestions from code review

* better

* Fix all

* Fix multi-GPU dreambooth

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Fix all

* make style

* make style

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-05-11 19:28:09 +01:00
Patrick von Platen
58c6f9cb71 Add omegaconf for tests (#3400)
Add omegaconfg
2023-05-11 18:03:27 +01:00
Stas Bekman
af2a237676 [deepspeed] partial ZeRO-3 support (#3076)
* [deepspeed] partial ZeRO-3 support

* cleanup

* improve deepspeed fixes

* Improve

* make style

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-11 16:59:20 +01:00
Steven Liu
d71db894eb [docs] Add transformers to install (#3388)
add transformers to install
2023-05-11 08:52:28 -07:00
Sayak Paul
90f5f3c4d4 [Tests] better determinism (#3374)
* enable deterministic pytorch and cuda operations.

* disable manual seeding.

* make style && make quality for unet_2d tests.

* enable determinism for the unet2dconditional model.

* add CUBLAS_WORKSPACE_CONFIG for better reproducibility.

* relax tolerance (very weird issue, though).

* revert to torch manual_seed() where needed.

* relax more tolerance.

* better placement of the cuda variable and relax more tolerance.

* enable determinism for 3d condition model.

* relax tolerance.

* add: determinism to alt_diffusion.

* relax tolerance for alt diffusion.

* dance diffusion.

* dance diffusion is flaky.

* test_dict_tuple_outputs_equivalent edit.

* fix two more tests.

* fix more ddim tests.

* fix: argument.

* change to diff in place of difference.

* fix: test_save_load call.

* test_save_load_float16 call.

* fix: expected_max_diff

* fix: paint by example.

* relax tolerance.

* add determinism to 1d unet model.

* torch 2.0 regressions seem to be brutal

* determinism to vae.

* add reason to skipping.

* up tolerance.

* determinism to vq.

* determinism to cuda.

* determinism to the generic test pipeline file.

* refactor general pipelines testing a bit.

* determinism to alt diffusion i2i

* up tolerance for alt diff i2i and audio diff

* up tolerance.

* determinism to audioldm

* increase tolerance for audioldm lms.

* increase tolerance for paint by paint.

* increase tolerance for repaint.

* determinism to cycle diffusion and sd 1.

* relax tol for cycle diffusion 🚲

* relax tol for sd 1.0

* relax tol for controlnet.

* determinism to img var.

* relax tol for img variation.

* tolerance to i2i sd

* make style

* determinism to inpaint.

* relax tolerance for inpaiting.

* determinism for inpainting legacy

* relax tolerance.

* determinism to instruct pix2pix

* determinism to model editing.

* model editing tolerance.

* panorama determinism

* determinism to pix2pix zero.

* determinism to sag.

* sd 2. determinism

* sd. tolerance

* disallow tf32 matmul.

* relax tolerance is all you need.

* make style and determinism to sd 2 depth

* relax tolerance for depth.

* tolerance to diffedit.

* tolerance to sd 2 inpaint.

* up tolerance.

* determinism in upscaling.

* tolerance in upscaler.

* more tolerance relaxation.

* determinism to v pred.

* up tol for v_pred

* unclip determinism

* determinism to unclip img2img

* determinism to text to video.

* determinism to last set of tests

* up tol.

* vq cumsum doesn't have a deterministic kernel

* relax tol

* relax tol
2023-05-11 16:38:14 +01:00
Takuma Mori
01c056f094 Support ControlNet v1.1 shuffle properly (#3340)
* add inferring_controlnet_cond_batch

* Revert "add inferring_controlnet_cond_batch"

This reverts commit abe8d6311d.

* set guess_mode to True
whenever global_pool_conditions is True

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* nit

* add integration test

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-11 14:58:07 +01:00
sudowind
e0b56d2b18 [Docs] Fix stable_diffusion.mdx typo (#3398)
Fix typo in last code block. Correct "prommpts" to "prompt"
2023-05-11 15:10:16 +02:00
Patrick von Platen
f740d357c9 make style 2023-05-11 11:31:49 +02:00
Steven Liu
5e746753d6 [docs] Load safetensors (#3333)
* safetensors

* apply feedback

* apply feedback

* Apply suggestions from code review

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-11 10:31:27 +01:00
Steven Liu
c49e9ede4d [docs] Adapt a model (#3326)
* first draft

* apply feedback

* conv_in.weight thrown away
2023-05-10 16:02:48 -07:00
Patrick von Platen
82e6fa56f0 make style 2023-05-10 20:16:18 +02:00
Rupert Menneer
edb087a217 StableDiffusionInpaintingPipeline - resize image w.r.t height and width (#3322)
* StableDiffusionInpaintingPipeline now resizes input images and masks w.r.t to passed input height and width. Default is already set to 512. This addresses the common tensor mismatch error. Also moved type check into relevant funciton to keep main pipeline body tidy.

* Fixed StableDiffusionInpaintingPrepareMaskAndMaskedImageTests

Due to previous commit these tests were failing as height and width need to be passed into the prepare_mask_and_masked_image function, I have updated the code and added a height/width variable per unit test as it seemed more appropriate than the current hard coded solution

* Added a resolution test to StableDiffusionInpaintPipelineSlowTests

this unit test simply gets the input and resizes it into some that would fail (e.g. would throw a tensor mismatch error/not a mult of 8). Then passes it through the pipeline and verifies it produces output with correct dims w.r.t the passed height and width

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-10 19:14:25 +01:00
Sayak Paul
94a0c644a8 add: a warning message when using xformers in a PT 2.0 env. (#3365)
* add: a warning message when using xformers in a PT 2.0 env.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-10 07:22:04 +05:30
Steven Liu
26832aa5ef [docs] Improve safetensors docstring (#3368)
* clarify safetensor docstring

* fix typo

* apply feedback
2023-05-09 16:15:05 -07:00
YiYi Xu
c559479592 Postprocessing refactor all others (#3337)
* add text2img

* fix-copies

* add

* add all other pipelines

* add

* add

* add

* add

* add

* make style

* style + fix copies

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-05-09 22:28:30 +01:00
Will Berman
a757b2db6e if dreambooth lora (#3360)
* update IF stage I pipelines

add fixed variance schedulers and lora loading

* added kv lora attn processor

* allow loading into alternative lora attn processor

* make vae optional

* throw away predicted variance

* allow loading into added kv lora layer

* allow load T5

* allow pre compute text embeddings

* set new variance type in schedulers

* fix copies

* refactor all prompt embedding code

class prompts are now included in pre-encoding code
max tokenizer length is now configurable
embedding attention mask is now configurable

* fix for when variance type is not defined on scheduler

* do not pre compute validation prompt if not present

* add example test for if lora dreambooth

* add check for train text encoder and pre compute text embeddings
2023-05-09 10:24:36 -07:00
Steven Liu
571bc1ea11 [docs] Fix docstring (#3334)
fix docstring

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-08 12:08:23 -07:00
Patrick von Platen
f381402ec8 make fix-copies 2023-05-08 10:55:02 +02:00
pdoane
3d8b3d7cd8 Batched load of textual inversions (#3277)
* Batched load of textual inversions

- Only call resize_token_embeddings once per batch as it is the most expensive operation
- Allow pretrained_model_name_or_path and token to be an optional list
- Remove Dict from type annotation pretrained_model_name_or_path as it was not supported in this function
- Add comment that single files (e.g. .pt/.safetensors) are supported
- Add comment for token parameter
- Convert token override log message from warning to info

* Update src/diffusers/loaders.py

Check for duplicate tokens

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update condition for None tokens

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-08 09:54:30 +01:00
Isotr0py
0ffac97933 Add use_Karras_sigmas to LMSDiscreteScheduler (#3351)
* add karras sigma to lms discrete scheduler

* add test for lms_scheduler karras

* reformat test lms
2023-05-06 12:19:27 +01:00
Lysandre Debut
b0966f5801 Inpainting: typo in docs (#3331)
Typo in docs

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-06 12:13:33 +01:00
Lucca Zenóbio
0407c3e7d0 Fix pipeline class on README (#3345)
Update README.md
2023-05-06 12:06:52 +01:00
At-sushi
7ce3fa010a Fix TypeError when using prompt_embeds and negative_prompt (#2982)
* test: Added test case

* fix: fixed type checking issue on _encode_prompt

* fix: fixed copies consistency

* fix: one copy was not sufficient
2023-05-06 12:04:07 +01:00
Sanchit Gandhi
abd86d1c17 [AudioLDM] Generalise conversion script (#3328)
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-06 12:00:42 +01:00
Adrià Arrufat
e9aa0925a8 Rename --only_save_embeds to --save_as_full_pipeline (#3206)
* Set --only_save_embeds to False by default

Due to how the option is named, it makes more sense to behave like this.

* Refactor only_save_embeds to save_as_full_pipeline
2023-05-06 12:00:30 +01:00
Will Rice
36f43ea75a Add upsample_size to AttnUpBlock2D, AttnDownBlock2D (#3275)
The argument `upsample_size` needs to be added to these modules to allow compatibility with other blocks that require this argument.
2023-05-05 19:50:41 +01:00
Cheng Lu
27522b585b Add the SDE variant of DPM-Solver and DPM-Solver++ (#3344)
* add SDE variant of DPM-Solver and DPM-Solver++

* add test

* fix typo

* fix typo
2023-05-05 16:03:47 +01:00
Patrick von Platen
8d4c7d0ea0 Fix config dpm (#3343) 2023-05-05 12:02:33 +01:00
Patrick von Platen
29ad75dc3b [Quality] Make style (#3341) 2023-05-05 10:06:09 +01:00
Sayak Paul
379197a2f0 update controlling generation doc with latest goodies. (#3321) 2023-05-05 11:22:29 +05:30
Cesar Aybar
79c0e24a14 Update write_own_pipeline.mdx (#3323) 2023-05-04 10:58:27 -07:00
Isamu Isozaki
fa9e35fca4 Added input pretubation (#3292)
* Added input pretubation

* Fixed spelling
2023-05-04 18:12:32 +05:30
Steven Liu
4bae76e453 [docs] Improve LoRA docs (#3311)
* update docs

* add to toctree

* apply feedback
2023-05-04 11:28:44 +05:30
Cheng Lu
022479416f Fix multistep dpmsolver for cosine schedule (suitable for deepfloyd-if) (#3314)
* fix multistep dpmsolver for cosine schedule (deepfloy-if)

* fix a typo

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* update all dpmsolver (singlestep, multistep, dpm, dpm++) for cosine noise schedule

* add test, fix style

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-05-03 18:00:59 +01:00
Markus Pobitzer
2dd408504a Add Stable Diffusion RePaint to community pipelines (#3320)
* Add Stable Diffsuion RePaint to community pipelines

- Adds Stable Diffsuion RePaint to community pipelines
- Add Readme enty for pipeline

* Fix: Remove wrong import

- Remove wrong import
- Minor change in comments

* Fix: Code formatting of stable_diffusion_repaint

* Fix: ruff errors in stable_diffusion_repaint
2023-05-03 17:59:49 +01:00
Patrick von Platen
79bd909dbd Correct doc build for patch releases (#3316)
Update build_documentation.yml
2023-05-03 17:33:41 +01:00
Mylo
63a8ef7b73 Fix missing variable assign in DeepFloyd-IF-II (#3315)
Fix missing variable assign

lol
2023-05-03 17:31:04 +01:00
Umar
0ccad2ad2d Update stable_diffusion.mdx (#3310)
fixed import statement
2023-05-03 15:53:14 +01:00
Sayak Paul
efc48da23b fix: scale_lr and sync example readme and docs. (#3299)
* fix: scale_lr and sync example readme and docs.

* fix doc link.
2023-05-03 10:13:05 +05:30
Patrick von Platen
5c7a35a259 [Torch 2.0 compile] Fix more torch compile breaks (#3313)
* Fix more torch compile breaks

* add tests

* Fix all

* fix controlnet

* fix more

* Add Horace He as co-author.
>
>
Co-authored-by: Horace He <horacehe2007@yahoo.com>

* Add Horace He as co-author.

Co-authored-by: Horace He <horacehe2007@yahoo.com>

---------

Co-authored-by: Horace He <horacehe2007@yahoo.com>
2023-05-02 18:51:00 +01:00
YiYi Xu
a7f25b4a88 Postprocessing refactor img2img (#3268)
* refactor img2img VaeImageProcessor.postprocess

* remove copy from for init, run_safety_checker, decode_latents

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

---------

Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-05-01 07:54:09 -10:00
Patrick von Platen
0e82fb19e1 Torch compile graph fix (#3286)
* fix more

* Fix more

* fix more

* Apply suggestions from code review

* fix

* make style

* make fix-copies

* fix

* make sure torch compile

* Clean

* fix test
2023-05-01 16:45:43 +02:00
Ilia Larchenko
709cf554f6 Typo in tutorial (#3295) 2023-05-01 15:44:30 +02:00
Ilia Larchenko
536684eb2f Changed sample[0] to images[0] (#3304)
A pipeline object stores the results in `images` not in `sample`.
Current code blocks don't work.
2023-05-01 15:33:51 +02:00
Will Berman
384c83aa9a temp disable spectogram diffusion tests (#3278)
The note-seq package throws an error on import because the default installed version of Ipython
is not compatible with python 3.8 which we run in the CI.
https://github.com/huggingface/diffusers/actions/runs/4830121056/jobs/8605954838#step:7:9
2023-04-28 12:05:53 -07:00
YiYi Xu
14b460614b [doc] add link to training script (#3271)
add link to training script

Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>
2023-04-28 07:14:30 -10:00
Patrick von Platen
4d35d7fea3 Allow disabling torch 2_0 attention (#3273)
* Allow disabling torch 2_0 attention

* make style

* Update src/diffusers/models/attention.py
2023-04-28 13:31:11 +02:00
Jason Kuan
a7b0671c07 add constant learning rate with custom rule (#3133)
* add constant lr with rules

* add constant with rules in TYPE_TO_SCHEDULER_FUNCTION

* add constant lr rate with rule

* hotfix code quality

* fix doc style

* change name constant_with_rules to piecewise constant
2023-04-28 16:29:56 +05:30
clarencechen
be0bfcec4d Diffedit Zero-Shot Inpainting Pipeline (#2837)
* Update Pix2PixZero Auto-correlation Loss

* Add Stable Diffusion DiffEdit pipeline

* Add draft documentation and import code

* Bugfixes and refactoring

* Add option to not decode latents in the inversion process

* Harmonize preprocessing

* Revert "Update Pix2PixZero Auto-correlation Loss"

This reverts commit b218062fed.

* Update annotations

* rename `compute_mask` to `generate_mask`

* Update documentation

* Update docs

* Update Docs

* Fix copy

* Change shape of output latents to batch first

* Update docs

* Add first draft for tests

* Bugfix and update tests

* Add `cross_attention_kwargs` support for all pipeline methods

* Fix Copies

* Add support for PIL image latents

Add support for mask broadcasting

Update docs and tests

Align `mask` argument to `mask_image`

Remove height and width arguments

* Enable MPS Tests

* Move example docstrings

* Fix test

* Fix test

* fix pipeline inheritance

* Harmonize `prepare_image_latents` with StableDiffusionPix2PixZeroPipeline

* Register modules set to `None` in config for `test_save_load_optional_components`

* Move fixed logic to specific test class

* Clean changes to other pipelines

* Update new tests to coordinate with #2953

* Update slow tests for better results

* Safety to avoid potential problems with torch.inference_mode

* Add reference in SD Pipeline Overview

* Fix tests again

* Enforce determinism in noise for generate_mask

* Fix copies

* Widen test tolerance for fp16 based on `test_stable_diffusion_upscale_pipeline_fp16`

* Add LoraLoaderMixin and update `prepare_image_latents`

* clean up repeat and reg

* bugfix

* Remove invalid args from docs

Suppress spurious warning by repeating image before latent to mask gen
2023-04-28 16:28:26 +05:30
Patrick von Platen
d464214464 Let's make sure that dreambooth always uploads to the Hub (#3272)
* Update Dreambooth README

* Adapt all docs as well

* automatically write model card

* fix

* make style
2023-04-28 11:39:50 +01:00
timegate
6290668254 Add multiple conditions to StableDiffusionControlNetInpaintPipeline (#3125)
* try multi controlnet inpaint

* multi controlnet inpaint

* multi controlnet inpaint
2023-04-28 10:58:10 +01:00
M. Tolga Cangöz
73cc43109b Update logging.mdx (#2863)
Fix typos
2023-04-28 10:57:27 +01:00
NimenDavid
0614fd2038 [Docs]zh translated docs update (#3245)
* zh translated docs update

* update _toctree
2023-04-28 10:23:02 +01:00
Joqsan
462b4edd31 [Community Pipelines] EDICT pipeline implementation (#3153)
* EDICT pipeline initial commit

- Starting point taking from https://github.com/Joqsan/edict-diffusion

* refactor __init__() method

* minor refactoring

* refactor scheduler code

- remove scheduler and move its methods to the EDICTPipeline class

* make CFG optional
- refactor encode_prompt().
- include optional generator for sampling with vae.
- minor variable renaming

* add EDICT pipeline description to README.md

* replace preprocess() with VaeImageProcessor

* run make style and make quality commands

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-28 10:11:29 +01:00
Sayak Paul
71de5b7051 [LoRA] quality of life improvements in the loading semantics and docs (#3180)
* 👽 qol improvements for LoRA.

* better function name?

* fix: LoRA weight loading with the new format.

* address Patrick's comments.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* change wording around encouraging the use of load_lora_weights().

* fix: function name.

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-28 11:36:49 +05:30
Will Berman
256e6960cb [docs] add notes for stateful model changes (#3252)
* [docs] add notes for stateful model changes

* Update docs/source/en/optimization/fp16.mdx

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* link to accelerate docs for discarding hooks

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-04-27 11:05:08 -07:00
YiYi Xu
329d1df8f2 update notebook (#3259)
Co-authored-by: yiyixuxu <yixu@yis-macbook-pro.lan>
2023-04-27 07:03:56 -10:00
Patrick von Platen
364d59d13b Fix community pipelines (#3266) 2023-04-27 17:12:08 +01:00
Patrick von Platen
2ced899cc7 Revert "Revert "[Community Pipelines] Update lpw_stable_diffusion pipeline"" (#3265)
Revert "Revert "[Community Pipelines] Update lpw_stable_diffusion pipeline" (#3201)"

This reverts commit 91a2a80eb2.
2023-04-27 16:45:37 +01:00
Robert Dargavel Smith
b63419a28a AudioDiffusionPipeline - fix encode method after config changes (#3114)
* config fixes

* deprecate get_input_dims
2023-04-27 16:27:41 +01:00
Jair Trejo
eb29dbad17 Fix typo in textual inversion JAX training script (#3123)
The pipeline is built as `pipe` but then used as `pipeline`.
2023-04-27 16:24:12 +01:00
Xie Zejian
d92c4d5ab7 fix typo in score sde pipeline (#3132) 2023-04-27 15:39:14 +01:00
apolinário
eade4308da Update IF name to XL (#3262)
Co-authored-by: multimodalart <joaopaulo.passos+multimodal@gmail.com>
2023-04-27 14:26:58 +01:00
Ernie Chu
fa31da29e5 [docs] Update interface in repaint.mdx (#3119)
Update repaint.mdx

accomodate to #1701
2023-04-27 13:24:51 +01:00
Isaac
77bfb56241 adding required parameters while calling the get_up_block and get_down_block (#3210)
* removed unnecessary parameters from get_up_block and get_down_block functions

* adding resnet_skip_time_act, resnet_out_scale_factor and cross_attention_norm to get_up_block and get_down_block functions

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-04-27 17:01:43 +05:30
Pedro Cuenca
70ef774fa0 Remove required from tracker_project_name (#3260)
Remove required from tracker_project_name.

As observed by https://github.com/off99555 in https://github.com/huggingface/diffusers/issues/2695#issuecomment-1470755050, it already has a default value.
2023-04-27 16:59:18 +05:30
Nipun Jindal
0b64c2c6c3 [Stochastic Sampler][Slow Test]: Cuda test fixes (#3257)
[Slow Test]: Cuda test fixes

Co-authored-by: njindal <njindal@adobe.com>
2023-04-27 14:52:38 +05:30
Nipun Jindal
fd512d7461 [2064]: Add stochastic sampler (sample_dpmpp_sde) (#3020)
* [2064]: Add stochastic sampler

* [2064]: Add stochastic sampler

* [2064]: Add stochastic sampler

* [2064]: Add stochastic sampler

* [2064]: Add stochastic sampler

* [2064]: Add stochastic sampler

* [2064]: Add stochastic sampler

* Review comments

* [Review comment]: Add is_torchsde_available()

* [Review comment]: Test and docs

* [Review comment]

* [Review comment]

* [Review comment]

* [Review comment]

* [Review comment]

---------

Co-authored-by: njindal <njindal@adobe.com>
2023-04-27 11:18:38 +05:30
Pedro Cuenca
e0a2bd15f9 Write model card in controlnet training script (#3229)
Write model card in controlnet training script.
2023-04-26 21:22:27 +02:00
Pedro Cuenca
c399de396d [docs] only mention one stage (#3246)
* [docs] only mention one stage

* add blurb on auto accepting

---------

Co-authored-by: William Berman <WLBberman@gmail.com>
2023-04-26 12:06:50 -07:00
Patrick von Platen
f842396367 Post release for 0.16.0 (#3244)
* Post release

* fix more
2023-04-26 17:43:09 +01:00
Patrick von Platen
6ba0efb9a1 Release: v0.16.0 2023-04-26 13:35:01 +02:00
Sanchit Gandhi
46ceba5b35 [AudioLDM] Update docs to use updated ckpt (#3240)
* [AudioLDM] Update docs to use updated ckpt

* make style
2023-04-26 12:33:08 +01:00
Sayak Paul
977162c02b Adds a document on token merging (#3208)
* add document on token merging.

* fix headline.

* fix: headline.

* add some samples for comparison.
2023-04-26 16:25:48 +05:30
Patrick von Platen
744663f8dc fix fast test (#3241) 2023-04-26 11:44:19 +01:00
Patrick von Platen
abbf3c1adf Allow fp16 attn for x4 upscaler (#3239)
* Add all files

* update

* Make sure vae is memory efficient for PT 1

* make style
2023-04-26 11:16:06 +01:00
Patrick von Platen
da2ce1a6b9 Allow return pt x4 (#3236)
* Add all files

* update
2023-04-26 09:34:34 +01:00
Patrick von Platen
e51f19aee8 add model (#3230)
* add

* clean

* up

* clean up more

* fix more tests

* Improve docs further

* improve

* more fixes docs

* Improve docs more

* Update src/diffusers/models/unet_2d_condition.py

* fix

* up

* update doc links

* make fix-copies

* add safety checker and watermarker to stage 3 doc page code snippets

* speed optimizations docs

* memory optimization docs

* make style

* add watermarking snippets to doc string examples

* make style

* use pt_to_pil helper functions in doc strings

* skip mps tests

* Improve safety

* make style

* new logic

* fix

* fix bad onnx design

* make new stable diffusion upscale pipeline model arguments optional

* define has_nsfw_concept when non-pil output type

* lowercase linked to notebook name

---------

Co-authored-by: William Berman <WLBberman@gmail.com>
2023-04-25 14:20:43 -07:00
Patrick von Platen
1ffcc924bc Fix docs text inversion (#3166)
* Fix docs text inversion

* Apply suggestions from code review
2023-04-25 14:18:40 +01:00
Yuchen Fan
730e01ec93 Sync cache version check from transformers (#3179)
sync cache version check from transformers
2023-04-25 14:18:25 +01:00
pdoane
0d196f9f45 Fix issue in maybe_convert_prompt (#3188)
When the token used for textual inversion does not have any special symbols (e.g. it is not surrounded by <>), the tokenizer does not properly split the replacement tokens.  Adding a space for the padding tokens fixes this.
2023-04-25 14:17:57 +01:00
Patrick von Platen
131312caba Add ControlNet v1.1 docs (#3226)
Add v1.1 docs
2023-04-25 14:12:35 +01:00
Isaac
e9edbfc251 adding enable_vae_tiling and disable_vae_tiling functions (#3225)
adding enable_vae_tiling and disable_val_tiling functions
2023-04-25 14:12:21 +01:00
Lucca Zenóbio
0ddc5bf7b9 fix mixed precision training on train_dreambooth_inpaint_lora (#3138)
cast to weight dtype
2023-04-25 15:22:57 +05:30
Patrick von Platen
c5933c9c89 [Bug fix] Fix batch size attention head size mismatch (#3214) 2023-04-25 00:44:00 +02:00
Will Berman
91a2a80eb2 Revert "[Community Pipelines] Update lpw_stable_diffusion pipeline" (#3201)
Revert "[Community Pipelines] Update lpw_stable_diffusion pipeline (#3197)"

This reverts commit 9965cb50ea.
2023-04-22 12:36:55 -07:00
Patrick von Platen
425192fe15 Make sure VAE attention works with Torch 2_0 (#3200)
* Make sure attention works with Torch 2_0

* make style

* Fix more
2023-04-22 17:29:29 +01:00
SkyTNT
9965cb50ea [Community Pipelines] Update lpw_stable_diffusion pipeline (#3197)
* Update lpw_stable_diffusion.py

* fix cpu offload
2023-04-22 15:07:45 +01:00
Chengrui Wang
20e426cb5d Fix bug in train_dreambooth_lora (#3183)
* Update train_dreambooth_lora.py

fix bug

* Update train_dreambooth_lora.py
2023-04-22 09:04:28 +05:30
Sanchit Gandhi
90eac14f72 [AudioLDM] Fix dtype of returned waveform (#3189) 2023-04-21 19:24:37 +01:00
Youssef Adarrab
11f527ac0f Add Karras sigmas to HeunDiscreteScheduler (#3160)
* Add karras pattern to discrete heun scheduler

* Add integration test

* Fix failing CI on pytorch test on M1 (mps)

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-21 19:21:04 +01:00
Patrick von Platen
2c04e5855c Multi Vector Textual Inversion (#3144)
* Multi Vector

* Improve

* fix multi token

* improve test

* make style

* Update examples/test_examples.py

* Apply suggestions from code review

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* update

* Finish

* Apply suggestions from code review

---------

Co-authored-by: Suraj Patil <surajp815@gmail.com>
2023-04-21 19:06:19 +01:00
Steven Liu
391cfcd7d7 [docs] Clarify training args (#3146)
* clarify training arg

* apply feedback
2023-04-21 11:03:44 -07:00
YiYi Xu
bc0392a0cb make from_flax work for controlnet (#3161)
fix from_flax

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-21 19:01:36 +01:00
asfiyab-nvidia
05d9baeacd Fix TensorRT community pipeline device set function (#3157)
pass silence_dtype_warnings as kwarg

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-21 18:53:10 +01:00
Sayak Paul
e573ae06e2 Update custom_diffusion.mdx to credit the author (#3163)
* Update custom_diffusion.mdx

* fix: unnecessary list comprehension.
2023-04-21 18:44:08 +01:00
Steven Liu
2f6351b001 [docs] Deterministic algorithms (#3172)
deterministic algos
2023-04-21 10:38:34 -07:00
Patrick von Platen
9c856118c7 Add model offload to x4 upscaler (#3187)
* Add model offload to x4 upscaler

* fix
2023-04-21 17:47:33 +01:00
regisss
9bce375f77 Update Habana Gaudi documentation (#3169)
* Update Habana Gaudi doc

* Fix tables
2023-04-21 17:24:43 +01:00
Sayak Paul
3045fb2763 [DreamBooth] add text encoder LoRA support in the DreamBooth training script (#3130)
* add: LoRA text encoder support for DreamBooth example.

* fix initialization.

* fix: modification call.

* add: entry in the readme.

* use dog dataset from hub.

* fix: params to clip.

* add entry to the LoRA doc.

* add: tests for lora.

* remove unnecessary list comprehension./
2023-04-20 17:25:17 +05:30
clarencechen
7b0ba4820a Update Noise Autocorrelation Loss Function for Pix2PixZero Pipeline (#2942)
* Update Pix2PixZero Auto-correlation Loss

* Add fast inversion tests

* Clarify purpose and mark as deprecated

Fix inversion prompt broadcasting

* Register modules set to `None` in config for `test_save_load_optional_components`

* Update new tests to coordinate with #2953
2023-04-20 12:13:47 +01:00
Patrick von Platen
8d5906a331 Merge branch 'main' of https://github.com/huggingface/diffusers 2023-04-20 13:09:33 +02:00
Patrick von Platen
17470057d2 make style 2023-04-20 13:09:20 +02:00
XinyuYe-Intel
a5b242d30d Added distillation for quantization example on textual inversion. (#2760)
* Added distillation for quantization example on textual inversion.

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>

* refined readme and code style.

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>

* Update text2images.py

* refined code of model load and added compatibility check.

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>

* fixed code style.

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>

* fix C403 [*] Unnecessary `list` comprehension (rewrite as a `set` comprehension)

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>

---------

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>
2023-04-20 11:55:42 +01:00
Mishig
a121e05feb Update custom_diffusion.mdx (#3165)
Add missing newlines for rendering the links correctly
2023-04-20 11:04:06 +02:00
nupurkmr9
3979aac996 adding custom diffusion training to diffusers examples (#3031)
* diffusers==0.14.0 update

* custom diffusion update

* custom diffusion update

* custom diffusion update

* custom diffusion update

* custom diffusion update

* custom diffusion update

* custom diffusion

* custom diffusion

* custom diffusion

* custom diffusion

* custom diffusion

* apply formatting and get rid of bare except.

* refactor readme and other minor changes.

* misc refactor.

* fix: repo_id issue and loaders logging bug.

* fix: save_model_card.

* fix: save_model_card.

* fix: save_model_card.

* add: doc entry.

* refactor doc,.

* custom diffusion

* custom diffusion

* custom diffusion

* apply style.

* remove tralining whitespace.

* fix: toctree entry.

* remove unnecessary print.

* custom diffusion

* custom diffusion

* custom diffusion test

* custom diffusion xformer update

* custom diffusion xformer update

* custom diffusion xformer update

---------

Co-authored-by: Nupur Kumari <nupurkumari@Nupurs-MacBook-Pro.local>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Nupur Kumari <nupurkumari@nupurs-mbp.wifi.local.cmu.edu>
2023-04-20 09:31:42 +02:00
Will Berman
7e6886f5e9 controlnet training resize inputs to multiple of 8 (#3135)
controlnet training center crop input images to multiple of 8

The pipeline code resizes inputs to multiples of 8.
Not doing this resizing in the training script is causing
the encoded image to have different height/width dimensions
than the encoded conditioning image (which uses a separate
encoder that's part of the controlnet model).

We resize and center crop the inputs to make sure they're the
same size (as well as all other images in the batch). We also
check that the initial resolution is a multiple of 8.
2023-04-19 10:46:51 -07:00
superhero-7
a4c91be73b Modified altdiffusion pipline to support altdiffusion-m18 (#2993)
* Modified altdiffusion pipline to support altdiffusion-m18

* Modified altdiffusion pipline to support altdiffusion-m18

* Modified altdiffusion pipline to support altdiffusion-m18

* Modified altdiffusion pipline to support altdiffusion-m18

* Modified altdiffusion pipline to support altdiffusion-m18

* Modified altdiffusion pipline to support altdiffusion-m18

* Modified altdiffusion pipline to support altdiffusion-m18

---------

Co-authored-by: root <fulong_ye@163.com>
2023-04-19 18:00:29 +01:00
hwuebben
3becd368b1 Update pipeline_stable_diffusion_inpaint_legacy.py (#2903)
* Update pipeline_stable_diffusion_inpaint_legacy.py

* fix preprocessing of Pil images with adequate batch size

* revert map

* add tests

* reformat

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* next try to fix the style

* wth is this

* Update testing_utils.py

* Update testing_utils.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

* Update test_stable_diffusion_inpaint_legacy.py

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-19 17:58:13 +01:00
Chanchana Sornsoontorn
c8fdfe4572 Correct Transformer2DModel.forward docstring (#3074)
⚙️chore(transformer_2d) update function signature for encoder_hidden_states
2023-04-19 17:51:58 +01:00
asfiyab-nvidia
bba1c1de15 Add TensorRT SD/txt2img Community Pipeline to diffusers along with TensorRT utils (#2974)
* Add SD/txt2img Community Pipeline to diffusers along with TensorRT utils

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* update installation command

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* update tensorrt installation

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* changes
1. Update setting of cache directory
2. Address comments: merge utils and pipeline code.
3. Address comments: Add section in README

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

* apply make style

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>

---------

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-19 17:51:03 +01:00
1lint
86ecd4b795 add from_ckpt method as Mixin (#2318)
* add mixin class for pipeline from original sd ckpt

* Improve

* make style

* merge main into

* Improve more

* fix more

* up

* Apply suggestions from code review

* finish docs

* rename

* make style

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-19 17:07:36 +01:00
cmdr2
bdeff4d64a [ckpt loader] Allow loading the Inpaint and Img2Img pipelines, while loading a ckpt model (#2705)
* [ckpt loader] Allow loading the Inpaint and Img2Img pipelines, while loading a ckpt model

* Address review comment from PR

* PyLint formatting

* Some more pylint fixes, unrelated to our change

* Another pylint fix

* Styling fix
2023-04-19 13:37:07 +01:00
Will Berman
fc1883918f class labels timestep embeddings projection dtype cast (#3137)
This mimics the dtype cast for the standard time embeddings
2023-04-18 15:05:41 -07:00
Will Berman
f0c74e9a75 Add unet act fn to other model components (#3136)
Adding act fn config to the unet timestep class embedding and conv
activation.

The custom activation defaults to silu which is the default
activation function for both the conv act and the timestep class
embeddings so default behavior is not changed.

The only unet which use the custom activation is the stable diffusion
latent upscaler https://huggingface.co/stabilityai/sd-x2-latent-upscaler/blob/main/unet/config.json
(I ran a script against the hub to confirm).
The latent upscaler does not use the conv activation nor the timestep
class embeddings so we don't change its behavior.
2023-04-18 14:13:16 -07:00
Patrick von Platen
4bc157ffa9 Correct textual inversion readme (#3145)
* Update README.md

* Apply suggestions from code review
2023-04-18 16:35:12 +01:00
Patrick von Platen
f2df39fa0e make style 2023-04-18 14:03:17 +02:00
Cristian Garcia
8ecdd3ef65 Optimize log_validation in train_controlnet_flax (#3110)
extract pipeline from log_validation
2023-04-18 13:03:00 +01:00
YiYi Xu
cd8b7507c2 speed up attend-and-excite fast tests (#3079) 2023-04-18 13:02:25 +01:00
Sayak Paul
3b641eabe9 feat: verfication of multi-gpu support for select examples. (#3126)
* feat: verfication of multi-gpu support for select examples.

* add: multi-gpu training sections to the relvant doc pages.
2023-04-18 08:36:13 +05:30
Patrick von Platen
703307efcc Fix config deprecation (#3129)
* Better deprecation message

* Better deprecation message

* Better doc string

* Fixes

* fix more

* fix more

* Improve __getattr__

* correct more

* fix more

* fix

* Improve more

* more improvements

* fix more

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* make style

* Fix all rest & add tests & remove old deprecation fns

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-04-17 17:16:28 +01:00
Patrick von Platen
ed8fd38337 Improve deprecation warnings (#3131) 2023-04-17 16:19:11 +01:00
Patrick von Platen
ca783a0f1f [Bug fix] Make sure correct timesteps are chosen for img2img (#3128)
Make sure correct timesteps are chosen for img2img
2023-04-17 11:52:40 +01:00
Patrick von Platen
beb848e2b6 [Bug fix] Fix img2img processor with safety checker (#3127)
Fix img2img processor with safety checker
2023-04-17 10:53:04 +01:00
Patrick von Platen
cfc99adf0f Add global pooling to controlnet (#3121) 2023-04-16 19:07:23 +02:00
Tommaso De Rossi
807f69b328 Fix breaking change in pipeline_stable_diffusion_controlnet.py (#3118)
fix breaking change
2023-04-16 19:04:11 +02:00
Will Berman
b811964a7b ddpm custom timesteps (#3007)
add custom timesteps test

add custom timesteps descending order check

docs

timesteps -> custom_timesteps

can only pass one of num_inference_steps and timesteps
2023-04-14 12:39:38 -07:00
YiYi Xu
1bd4c9e93d remvoe one line as requested by gc team (#3077)
remvoe one line
2023-04-14 06:39:25 -10:00
YiYi Xu
eb2ef31606 fix default value for attend-and-excite (#3099)
* fix default
2023-04-13 17:54:54 -10:00
Takuma Mori
5c9dd0af95 Add to support Guess Mode for StableDiffusionControlnetPipleline (#2998)
* add guess mode (WIP)

* fix uncond/cond order

* support guidance_scale=1.0 and batch != 1

* remove magic coeff

* add docstring

* add intergration test

* add document to controlnet.mdx

* made the comments a bit more explanatory

* fix table
2023-04-14 08:37:34 +05:30
Steven Liu
d0f258206d [docs] Update community pipeline docs (#2989)
* update community pipeline docs

* fix formatting

* explain sharing workflows
2023-04-13 13:46:28 -07:00
Joseph Coffland
3eaead0c4a Allow SD attend and excite pipeline to work with any size output images (#2835)
Allow stable diffusion attend and excite pipeline to work with any size output image. Re: #2476, #2603
2023-04-13 05:54:16 -10:00
Patrick von Platen
3bf5ce21ad Throw deprecation warning for return_cached_folder (#3092)
Throw deprecation warning
2023-04-13 13:33:11 +01:00
Patrick von Platen
3a9d7d9758 [Tests] parallelize (#3078)
* [Tests] parallelize

* finish folder structuring

* Parallelize tests more

* Correct saving of pipelines

* make sure logging level is correct

* try again

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-04-13 13:32:57 +01:00
YiYi Xu
e748b3c6e1 doc string example remove from_pt (#3083) 2023-04-13 09:45:23 +02:00
Patrick von Platen
46c52f9b96 [Pipelines] Make sure that None functions are correctly not saved (#3080) 2023-04-13 00:25:10 +02:00
Andreas Steiner
d06e06940b Adds profiling flags, computes train metrics average. (#3053)
* WIP controlnet training

- bugfix --streaming
- bugfix running report_to!='wandb'
- adds memory profile before validation

* Adds final logging statement.

* Sets train epochs to 11.

Looking at a longer ~16ep run, we see only good validation images
after ~11ep:

https://wandb.ai/andsteing/controlnet_fill50k/runs/3j2hx6n8

* Removes --logging_dir (it's not used).

* Adds --profile flags.

* Updates --output_dir=runs/fill-circle-{timestamp}.

* Compute mean of `train_metrics`.

Previously `train_metrics[-1]` was logged, resulting in very bumpy train
metrics.

* Improves logging a bit.

- adds l2_grads gradient norm logging
- adds steps_per_sec
- sets walltime as x coordinate of train/step
- logs controlnet_params config

* Adds --ccache (doesn't really help though).

* minor fix in controlnet flax example (#2986)

* fix the error when push_to_hub but not log validation

* contronet_from_pt & controlnet_revision

* add intermediate checkpointing to the guide

* Bugfix --profile_steps

* Sets `RACKER_PROJECT_NAME='controlnet_fill50k'`.

* Logs fractional epoch.

* Adds relative `walltime` metric.

* Adds `StepTraceAnnotation` and uses `global_step` insetad of `step`.

* Applied `black`.

* Streamlines commands in README a bit.

* Removes `--ccache`.

This makes only a very small difference (~1 min) with this model size, so removing
the option introduced in cdb3cc.

* Re-ran `black`.

* Update examples/controlnet/README.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Converts spaces to tab.

* Removes repeated args.

* Skips first step (compilation) in profiling

* Updates README with profiling instructions.

* Unifies tabs/spaces in README.

* Re-ran style & quality.

---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-04-12 08:29:18 -10:00
Patrick von Platen
0a73b4d3cd [Post release] v0.16.0dev (#3072) 2023-04-12 17:18:30 +01:00
Sayak Paul
e126a82cc5 [Tests] Speed up panorama tests (#3067)
* fix: norm group test for UNet3D.

* chore: speed up the panorama tests (fast).

* set default value of _test_inference_batch_single_identical.

* fix: batch_sizes default value.
2023-04-12 16:25:54 +01:00
Patrick von Platen
e7534542a2 Release: v0.15.0 2023-04-12 15:15:31 +00:00
Andranik Movsisyan
b9b891621e Text2video zero refinements (#3070)
* fix progress bar issue in pipeline_text_to_video_zero.py. Copy scheduler after first backward

* fix tensor loading in test_text_to_video_zero.py

* make style && make quality
2023-04-12 14:27:09 +01:00
Ernie Chu
a43934371a Fix a bug of pano when not doing CFG (#3030)
* Fix a bug of pano when not doing CFG

* enhance code quality

* apply formatting.

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-04-12 14:20:25 +01:00
Pedro Cuenca
caa5884e8a Update Flax TPU tests (#3069)
Update Flax TPU tests.

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-12 14:17:36 +01:00
Sayak Paul
fa736e321d [Docs] refactor text-to-video zero (#3049)
* fix: norm group test for UNet3D.

* refactor text-to-video zero docs.
2023-04-12 14:15:26 +01:00
Patrick von Platen
a4b233e5b5 Finish docs textual inversion (#3068)
* Finish docs textual inversion

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-04-12 13:35:58 +01:00
Nipun Jindal
524535b5f2 [2064]: Add Karras to DPMSolverMultistepScheduler (#3001)
* [2737]: Add Karras DPMSolverMultistepScheduler

* [2737]: Add Karras DPMSolverMultistepScheduler

* Add test

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* fix: repo consistency.

* remove Copied from statement from the set_timestep method.

* fix: test

* Empty commit.

Co-authored-by: njindal <njindal@adobe.com>

---------

Co-authored-by: njindal <njindal@adobe.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-04-12 18:04:51 +05:30
Sean Sube
7b2407f4d7 add support for pre-calculated prompt embeds to Stable Diffusion ONNX pipelines (#2597)
* add support for prompt embeds to SD ONNX pipeline

* fix up the pipeline copies

* add prompt embeds param to other ONNX pipelines

* fix up prompt embeds param for SD upscaling ONNX pipeline

* add missing type annotations to ONNX pipes
2023-04-12 12:19:56 +01:00
Will Berman
639f6455b4 fix pipeline __setattr__ value == None (#3063)
* fix pipeline __setattr__

* add test

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-12 12:11:09 +01:00
Andy
9d7c08f95e [WIP] implement rest of the test cases (LoRA tests) (#2824)
* inital commit for lora test cases

* help a bit with lora for 3d

* fixed lora tests

* replaced redundant code

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-04-12 15:32:14 +05:30
Pedro Cuenca
dc277501c7 Flax memory efficient attention (#2889)
* add use_memory_efficient params placeholder

* test

* add memory efficient attention jax

* add memory efficient attention jax

* newline

* forgot dot

* Rename use_memory_efficient

* Keep dtype last.

* Actually use key_chunk_size

* Rename symbol

* Apply style

* Rename use_memory_efficient

* Keep dtype last

* Pass `use_memory_efficient_attention` in `from_pretrained`

* Move JAX memory efficient attention to attention_flax.

* Simple test.

* style

---------

Co-authored-by: muhammad_hanif <muhammad_hanif@sofcograha.co.id>
Co-authored-by: MuhHanif <48muhhanif@gmail.com>
2023-04-12 10:17:51 +01:00
Susung Hong
0df47efee2 [Docs] update Self-Attention Guidance docs (#2952)
* Update index.mdx

* Edit docs & add HF space link

* Only change equation numbers in comments
2023-04-12 10:14:32 +01:00
Sayak Paul
5a7d35e29c Fix InstructPix2Pix training in multi-GPU mode (#2978)
* fix: norm group test for UNet3D.

* fix: unet rejig.

* fix: unwrapping when running validation inputs.

* unwrapping the unet too.

* fix: device.

* better unwrapping.

* unwrapping before ema.

* unwrapping.
2023-04-12 10:13:53 +01:00
Patrick von Platen
0c72006e3a fix slow tsets (#3066)
* fix slow tsets

* make style
2023-04-12 10:23:52 +02:00
Sayak Paul
a89a14fa7a [LoRA] Enabling limited LoRA support for text encoder (#2918)
* add: first draft for a better LoRA enabler.

* make fix-copies.

* feat: backward compatibility.

* add: entry to the docs.

* add: tests.

* fix: docs.

* fix: norm group test for UNet3D.

* feat: add support for flat dicts.

* add depcrcation message instead of warning.
2023-04-12 08:29:04 +05:30
Sayak Paul
e607a582cf [Examples] Fix type-casting issue in the ControlNet training script (#2994)
* fix: norm group test for UNet3D.

* fix: type-casting issue in controlnet training.
2023-04-12 06:35:06 +05:30
Will Berman
ea39cd7e64 Attn added kv processor torch 2.0 block (#3023)
add AttnAddedKVProcessor2_0 block
2023-04-11 16:54:22 -07:00
Will Berman
98c5e5da31 Attention processor cross attention norm group norm (#3021)
add group norm type to attention processor cross attention norm

This lets the cross attention norm use both a group norm block and a
layer norm block.

The group norm operates along the channels dimension
and requires input shape (batch size, channels, *) where as the layer norm with a single
`normalized_shape` dimension only operates over the least significant
dimension i.e. (*, channels).

The channels we want to normalize are the hidden dimension of the encoder hidden states.

By convention, the encoder hidden states are always passed as (batch size, sequence
length, hidden states).

This means the layer norm can operate on the tensor without modification, but the group
norm requires flipping the last two dimensions to operate on (batch size, hidden states, sequence length).

All existing attention processors will have the same logic and we can
consolidate it in a helper function `prepare_encoder_hidden_states`

prepare_encoder_hidden_states -> norm_encoder_hidden_states re: @patrickvonplaten

move norm_cross defined check to outside norm_encoder_hidden_states

add missing attn.norm_cross check
2023-04-11 15:51:40 -07:00
Will Berman
2d52e81cb9 unet time embedding activation function (#3048)
* unet time embedding activation function

* typo act_fn -> time_embedding_act_fn

* flatten conditional
2023-04-11 15:51:29 -07:00
Chanchana Sornsoontorn
52c4d32d41 Fix typo and format BasicTransformerBlock attributes (#2953)
* ⚙️chore(train_controlnet) fix typo in logger message

* ⚙️chore(models) refactor modules order; make them the same as calling order

When printing the BasicTransformerBlock to stdout, I think it's crucial that the attributes order are shown in proper order. And also previously the "3. Feed Forward" comment was not making sense. It should have been close to self.ff but it's instead next to self.norm3

* correct many tests

* remove bogus file

* make style

* correct more tests

* finish tests

* fix one more

* make style

* make unclip deterministic

* ⚙️chore(models/attention) reorganize comments in BasicTransformerBlock class

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-12 00:31:05 +02:00
Will Berman
c6180a311c add only cross attention to simple attention blocks (#3011)
* add only cross attention to simple attention blocks

* add test for only_cross_attention re: @patrickvonplaten

* mid_block_only_cross_attention better default

allow mid_block_only_cross_attention to default to
`only_cross_attention` when `only_cross_attention` is given
as a single boolean
2023-04-11 14:38:50 -07:00
Pedro Cuenca
e3095c5f47 Fix invocation of some slow Flax tests (#3058)
* Fix invocation of some slow tests.

We use __call__ rather than pmapping the generation function ourselves
because the number of static arguments is different now.

* style
2023-04-11 23:21:25 +02:00
Pedro Cuenca
526827c3d1 Fix scheduler type mismatch (#3041)
When doing generation manually and using guidance_scale as a static
argument.
2023-04-11 23:20:35 +02:00
George Ogden
cb63febf2e Update documentation (#2996)
* Update documentation

Based on sampling, the width and height must be powers of 2 as the samples halve in size each time

* make style
2023-04-11 19:02:13 +01:00
Will Berman
8c6b47cfde AttentionProcessor.group_norm num_channels should be query_dim (#3046)
* `AttentionProcessor.group_norm` num_channels should be `query_dim`

The group_norm on the attention processor should really norm the number
of channels in the query _not_ the inner dim. This wasn't caught before
because the group_norm is only used by the added kv attention processors
and the added kv attention processors are only used by the karlo models
which are configured such that the inner dim is the same as the query
dim.

* add_{k,v}_proj should be projecting to inner_dim
2023-04-11 10:32:55 -07:00
Will Berman
67ec9cf513 accelerate min version for ProjectConfiguration import (#3042) 2023-04-11 10:12:28 -07:00
Will Berman
80bc0c0ced config fixes (#3060) 2023-04-11 17:54:50 +01:00
Patrick von Platen
091a058236 make style 2023-04-11 15:51:21 +00:00
J N Hearns
881a6b58c3 Fix imports for composable_stable_diffusion pipeline (#3002)
* Update composable_stable_diffusion.py

Fix imports

* Formatting

* Formatting

* Formatting
2023-04-11 16:50:25 +01:00
Steven Liu
cb9d77af23 [docs] Reusing components (#3000)
* reuse-components

* format
2023-04-11 15:34:34 +01:00
Patrick von Platen
8b451eb63b Fix config prints and save, load of pipelines (#2849)
* [Config] Fix config prints and save, load

* Only use potential nn.Modules for dtype and device

* Correct vae image processor

* make sure in_channels is not accessed directly

* make sure in channels is only accessed via config

* Make sure schedulers only access config attributes

* Make sure to access config in SAG

* Fix vae processor and make style

* add tests

* uP

* make style

* Fix more naming issues

* Final fix with vae config

* change more
2023-04-11 13:35:42 +02:00
Patrick von Platen
8369196703 fix report tool (#3047) 2023-04-11 10:55:00 +02:00
Mishig
4f48476dd6 Update contribution.mdx (#3054)
* Update contribution.mdx

hotfix for doc-builder parsing quote in heading bug

* quoteation replace
2023-04-11 09:23:58 +02:00
Pedro Cuenca
fbc9a736dd mps: skip unstable test (#3037) 2023-04-11 06:36:54 +05:30
Rogério Júnior
67c3518f68 Small typo correction in comments (#3012) 2023-04-10 13:48:35 -07:00
Andranik Movsisyan
ba49272db8 [Pipeline] Add TextToVideoZeroPipeline (#2954)
* add TextToVideoZeroPipeline and CrossFrameAttnProcessor

* add docs for text-to-video zero

* add teaser image for text-to-video zero docs

* Fix review changes. Add Documentation. Add test

* clean up the codes in pipeline_text_to_video.py. Add descriptive comments and docstrings

* make style && make quality

* make fix-copies

* make requested changes to docs. use huggingface server links for resources, delete res folder

* make style && make quality && make fix-copies

* make style && make quality

* Apply suggestions from code review

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-04-10 22:09:53 +02:00
William Berman
074d281ae0 tests and additional scheduler fixes 2023-04-10 12:59:33 -07:00
William Berman
953c9d14eb [bug fix] dpm multistep solver duplicate timesteps 2023-04-10 12:59:33 -07:00
luanjintai
85f1c19282 find another one accelerate parameter error 2023-04-10 12:23:17 -07:00
luanjintai
b5d0a9131d fix wrong parameter name for accelerate 2023-04-10 12:23:17 -07:00
Pedro Cuenca
983a7fbfd8 Initial draft of Core ML docs (#2987)
* Initial draft of Core ML docs.

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Fix Core ML spelling

* Apply the rest of suggestions.

* Attempt to fix hyperlink inside Tip.

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Apply suggestions from code review

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2023-04-10 21:09:04 +02:00
William Berman
c413353e8e add encoder_hid_dim to unet
`encoder_hid_dim` provides an additional projection for the input `encoder_hidden_states` from `encoder_hidden_dim` to `cross_attention_dim`
2023-04-09 23:00:16 -07:00
William Berman
8db5e5b37d allow unet varying number of layers per block 2023-04-09 22:57:26 -07:00
William Berman
707341aebe resnet skip time activation and output scale factor 2023-04-09 22:55:33 -07:00
William Berman
26b4319ac5 do not overwrite scheduler instance variables with type casted versions 2023-04-09 22:34:29 -07:00
William Berman
18ebd57bd8 add missing AttnProcessor2_0 to AttentionProcessor union 2023-04-09 22:02:14 -07:00
William Berman
b6cc050245 fix simple attention processor encoder hidden states ordering 2023-04-09 21:57:56 -07:00
William Berman
0cbefefac3 clamp comment @sayakpaul 2023-04-09 21:54:50 -07:00
William Berman
1875c35aeb remove extra min arg @sayakpaul 2023-04-09 21:54:50 -07:00
William Berman
1dc856e508 ddpm scheduler variance fixes 2023-04-09 21:54:50 -07:00
Will Berman
2cbdc586de dynamic threshold sampling bug fixes and docs (#3003)
dynamic threshold sampling bug fix and docs
2023-04-09 21:43:40 -07:00
YiYi Xu
dcfa6e1d20 add Min-SNR loss to Controlnet flax train script (#3016)
* add wandb team and min-snr loss

* make style

* apply feedbacks
2023-04-10 07:56:54 +05:30
Patrick von Platen
1c96f82ed9 Update one_step_unet.py
Fix dummy community pipeline
2023-04-09 19:22:18 +01:00
Guspan Tanadi
ce144d6dd0 docs: Link Navigation Path API Pipelines (#2976)
* docs: link navigation Safe Stable Diffusion

Link navigation API pipelines text2img and using diffusers Conditional Image Generation.

* docs: link navigation Versatile Diffusion

Removing exceeding path Stable Diffusion Overview.

* docs: Python extension Spectrogram Diffusion

Link navigation Spectrogram Diffusion Pipeline source code

* docs: Link navigation AltDiffusion Pipelines

Stable Diffusion Overview and Using Diffusers path.
2023-04-07 14:07:42 -07:00
Pedro Cuenca
8c5c30f3b1 Explain how to install test dependencies (#2983)
As pointed out by @Birch-san: https://github.com/huggingface/diffusers/pull/2634#issuecomment-1496517210
2023-04-07 20:41:09 +02:00
YiYi Xu
2de36fae7b minor fix in controlnet flax example (#2986)
* fix the error when push_to_hub but not log validation

* contronet_from_pt & controlnet_revision

* add intermediate checkpointing to the guide
2023-04-06 10:27:41 -10:00
FurryPotato
e40526431a [scheduler] fix some scheduler dtype error (#2992)
Co-authored-by: wangguan <dizhipeng.dzp@alibaba-inc.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-06 14:55:33 +01:00
Sayak Paul
24947317a6 [Examples] Add support for Min-SNR weighting strategy for better convergence (#2899)
* improve stable unclip doc.

* feat: support for applying min-snr weighting for faster convergence.

* add: support for validation logging with wandb

* make  not a required arg.

* fix: arg name.

* fix: cli args.

* fix: tracker config.

* fix: loss calculation.

* fix: validation logging.

* fix: unwrap call.

* fix: validation logging.

* fix: internval.

* fix: checkpointing push to hub.

* fix: c8a2856c6d\#commitcomment-106913193

* fix: norm group test for UNet3D.

* address PR comments.

* remove unneeded code.

* add: entry in the readme and docs.

* Apply suggestions from code review

Co-authored-by: Suraj Patil <surajp815@gmail.com>

---------

Co-authored-by: Suraj Patil <surajp815@gmail.com>
2023-04-06 19:08:40 +05:30
cmdr2
8826bae655 Update the K-Diffusion SD pipeline, to allow calling it with only prompt_embeds (instead of always requiring a prompt) (#2962) 2023-04-06 11:59:48 +01:00
Nipun Jindal
6e8e1ed77a [2905]: Add Karras pattern to discrete euler (#2956)
* [2905]: Add Karras pattern to discrete euler

* [2905]: Add Karras pattern to discrete euler

* Review comments

* Review comments

* Review comments

* Review comments

---------

Co-authored-by: njindal <njindal@adobe.com>
2023-04-06 16:10:57 +05:30
Kadir Nar
37b359b2bd The variable name has been updated. (#2970) 2023-04-06 10:55:43 +01:00
Patrick von Platen
a9477bbdac [Pipeline download] Improve pipeline download for index and passed co… (#2980)
* [Pipeline download] Improve pipeline download for index and passed components

* correct

* add more tests

* up
2023-04-06 01:31:09 +02:00
YiYi Xu
ee20d1f8b9 update flax controlnet training script (#2951)
* load_from_disk + checkpointing_steps

* apply feedback
2023-04-04 15:49:44 -10:00
Steven Liu
0d0fa2a3e1 [docs] Simplify loading guide (#2694)
* simplify loading guide

* apply feedbacks

* clarify variants

* clarify torch_dtype and variant

* remove conceptual pipeline doc
2023-04-04 14:08:21 -07:00
YiYi Xu
1a6def3ddb fix post-processing (#2968)
Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-04-04 08:52:55 -10:00
YiYi Xu
0c63c3839a allow use custom local dataset for controlnet training scripts (#2928)
use custom local datset

Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-04 10:37:47 -07:00
Lucain
a87e88b783 Use upload_folder in training scripts (#2934)
use upload folder in training scripts

Co-authored-by: testbot <lucainp@hf.co>
2023-04-04 16:19:12 +01:00
Patrick von Platen
a0263b2e5b make style 2023-04-04 15:18:39 +02:00
Ernie Chu
62c01d267a Ensure validation image RGB not RGBA (#2945)
* ensure validation image RGB not RGBA

* ensure validation image RGB not RGBA

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-04-04 14:17:59 +01:00
Guspan Tanadi
f3e72e9e57 Removing explicit markdown extension (#2944)
Trigger from previous PR. Build the page once again.
2023-04-04 14:15:19 +01:00
M. Tolga Cangöz
4fd7e97f33 Update ddpm.mdx (#2929) 2023-04-04 14:02:30 +01:00
M. Tolga Cangöz
4a1eae07c7 Update ddim.mdx (#2926) 2023-04-04 14:01:55 +01:00
M. Tolga Cangöz
e329edff7e Update score_sde_vp.mdx (#2938) 2023-04-04 14:00:43 +01:00
M. Tolga Cangöz
3e2d1af867 Update score_sde_ve.mdx (#2937) 2023-04-04 14:00:15 +01:00
M. Tolga Cangöz
715c25d344 Update unipc.mdx (#2936) 2023-04-04 13:59:53 +01:00
M. Tolga Cangöz
4274a3a915 Update euler_ancestral.mdx (#2932) 2023-04-04 13:58:58 +01:00
Sayak Paul
7139f0e874 fix: norm group test for UNet3D. (#2959) 2023-04-04 09:01:15 +01:00
Patrick von Platen
8c530fc2f6 make style 2023-03-31 23:46:28 +02:00
Patrick von Platen
723933f5f1 add another import 2023-03-31 23:45:05 +02:00
Patrick von Platen
f23d6eb8f2 fix missing import 2023-03-31 23:37:58 +02:00
wfng92
cd634a8fbb Check for all different packages of opencv (#2901)
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-03-31 15:00:59 +01:00
Patrick von Platen
7447f75b9f Update pipeline_stable_diffusion_controlnet.py (#2917) 2023-03-31 14:59:50 +01:00
Patrick von Platen
a5bdb678c0 fix importing diffusers without transformers installed 2023-03-31 13:56:38 +00:00
M. Tolga Cangöz
c43356267b Update controlnet.mdx (#2912)
.
2023-03-31 14:32:36 +01:00
M. Tolga Cangöz
89b23d9869 Update image_variation.mdx (#2911)
.
2023-03-31 14:31:43 +01:00
Guspan Tanadi
419660c99b Have fix current pipeline link (#2910)
Also capitalization notebook provider name
2023-03-31 14:31:14 +01:00
Patrick von Platen
d36103a089 [Tests] Speed up test (#2919)
speed up test
2023-03-31 14:20:46 +01:00
Nipun Jindal
b3c437e009 [2884]: Fix cross_attention_kwargs in StableDiffusionImg2ImgPipeline (#2902)
* [2884]: Fix cross_attention_kwargs in StableDiffusionImg2ImgPipeline

* [Build Fix]

* [Build Fix]

---------

Co-authored-by: njindal <njindal@adobe.com>
2023-03-31 13:26:04 +01:00
mengfei25
7b6caca9eb Modify example with intel optimization (#2896)
* modify intel opts inference script

* modify readme

* modify doc

* fix some issues

* reformat

* reformat script

* format issue

* format issue
2023-03-31 13:07:20 +01:00
Sandeep
f3fbf9bfc0 Fix check_inputs in upscaler pipeline to allow embeds (#2892)
* Remove suggestion to use cuDNN benchmark in docs

* removing the wrong line

* add support for embeds

* fix line length
2023-03-31 12:46:20 +01:00
Patrick von Platen
e1144ac20c Fix slow tests text inv (#2915)
* fix slow tests

* uP
2023-03-31 10:03:32 +01:00
Guillermo Cique
1055175a18 Fix textual inversion loading (#2914) 2023-03-31 09:52:48 +01:00
Takuma Mori
0df4ad541f Add support Karras sigmas for StableDiffusionKDiffusionPipeline (#2874)
* add use_karras_sigmas option

thanks @Stax124

* fix sigma_min/max from scheduler.sigmas

* add docstring

* revert to use k_diffusion_model.sigma, to(device)

* add integration test

* make style
2023-03-31 09:12:11 +05:30
YiYi Xu
51d970d60d [docs] add the Stable diffusion with Jax/Flax Guide into the docs (#2487)
* add stable diffusion jax guide


---------
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-03-30 16:22:40 -10:00
Pi Esposito
a937e1b594 add load textual inversion embeddings to stable diffusion (#2009)
* add load textual inversion embeddings draft

* fix quality

* fix typo

* make fix copies

* move to textual inversion mixin

* make it accept from sd-concept library

* accept list of paths to embeddings

* fix styling of stable diffusion pipeline

* add dummy TextualInversionMixin

* add docstring to textualinversionmixin

* add load textual inversion embeddings draft

* fix quality

* fix typo

* make fix copies

* move to textual inversion mixin

* make it accept from sd-concept library

* accept list of paths to embeddings

* fix styling of stable diffusion pipeline

* add dummy TextualInversionMixin

* add docstring to textualinversionmixin

* add case for parsing embedding from auto1111 UI format

Co-authored-by: Evan Jones <evan.a.jones3@gmail.com>
Co-authored-by: Ana Tamais <aninhamoraestamais@gmail.com>

* fix style after rebase

* move textual inversion mixin to loaders

* move mixin inheritance to DiffusionPipeline from StableDiffusionPipeline)

* update dummy class name

* addressed allo comments

* fix old dangling import

* fix style

* proposal

* remove bogus

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Will Berman <wlbberman@gmail.com>

* finish

* make style

* up

* fix code quality

* fix code quality - again

* fix code quality - 3

* fix alt diffusion code quality

* fix model editing pipeline

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Finish

---------

Co-authored-by: Evan Jones <evan.a.jones3@gmail.com>
Co-authored-by: Ana Tamais <aninhamoraestamais@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Will Berman <wlbberman@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-03-30 18:08:39 +01:00
Michael Gartsbein
1d033a95f6 img2img.multiple.controlnets.pipeline (#2833)
* img2img.multiple.controlnets.pipeline

* remove comments

---------

Co-authored-by: mishka <gartsocial@gmail.com>
2023-03-30 18:00:12 +01:00
Patrick von Platen
49609768b4 make style 2023-03-30 18:26:41 +02:00
Alon Burg
9062b2847d Support fp16 in conversion from original ckpt (#2733)
add --half to convert_original_stable_diffusion_to_diffusers.py
2023-03-30 17:26:18 +01:00
YiYi Xu
b3d5cc4a36 add flax requirement (#2894)
Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-03-30 17:10:26 +01:00
Sayak Paul
b2021273eb [Docs] add an example use for StableUnCLIPPipeline in the pipeline docs (#2897)
* improve stable unclip doc.

* add: entry of StableUnCLIPPipeline to the docs

* Apply suggestions from code review

Co-authored-by: apolinario <joaopaulo.passos@gmail.com>

---------

Co-authored-by: apolinario <joaopaulo.passos@gmail.com>
2023-03-30 17:14:04 +05:30
Steven Liu
e47459c80f [docs] Performance tutorial (#2773)
* update performance tutorial

* fix divs

* oops forgot to close tag

* apply feedback

* apply feedback

* apply feedback

* align doc title
2023-03-29 12:48:14 -07:00
514 changed files with 73308 additions and 9844 deletions

View File

@@ -49,3 +49,32 @@ body:
placeholder: diffusers version, platform, python version, ...
validations:
required: true
- type: textarea
id: who-can-help
attributes:
label: Who can help?
description: |
Your issue will be replied to more quickly if you can figure out the right person to tag with @
If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
All issues are read by one of the core maintainers, so if you don't know who to tag, just leave this blank and
a core maintainer will ping the right person.
Please tag fewer than 3 people.
General library related questions: @patrickvonplaten and @sayakpaul
Questions on the training examples: @williamberman, @sayakpaul, @yiyixuxu
Questions on memory optimizations, LoRA, float16, etc.: @williamberman, @patrickvonplaten, and @sayakpaul
Questions on schedulers: @patrickvonplaten and @williamberman
Questions on models and pipelines: @patrickvonplaten, @sayakpaul, and @williamberman
Questions on JAX- and MPS-related things: @pcuenca
Questions on audio pipelines: @patrickvonplaten, @kashif, and @sanchit-gandhi
Documentation: @stevhliu and @yiyixuxu
placeholder: "@Username ..."

60
.github/PULL_REQUEST_TEMPLATE.md vendored Normal file
View File

@@ -0,0 +1,60 @@
# What does this PR do?
<!--
Congratulations! You've made it this far! You're not quite done yet though.
Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.
Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.
Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md)?
- [ ] Did you read our [philosophy doc](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md) (important for complex PRs)?
- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the
[documentation guidelines](https://github.com/huggingface/diffusers/tree/main/docs), and
[here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
## Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @
If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
Please tag fewer than 3 people.
Core library:
- Schedulers: @williamberman and @patrickvonplaten
- Pipelines: @patrickvonplaten and @sayakpaul
- Training examples: @sayakpaul and @patrickvonplaten
- Docs: @stevenliu and @yiyixu
- JAX and MPS: @pcuenca
- Audio: @sanchit-gandhi
- General functionalities: @patrickvonplaten and @sayakpaul
Integrations:
- deepspeed: HF Trainer/Accelerate: @pacman100
HF projects:
- accelerate: [different repo](https://github.com/huggingface/accelerate)
- datasets: [different repo](https://github.com/huggingface/datasets)
- transformers: [different repo](https://github.com/huggingface/transformers)
- safetensors: [different repo](https://github.com/huggingface/safetensors)
-->

View File

@@ -27,7 +27,7 @@ runs:
- name: Get date
id: get-date
shell: bash
run: echo "::set-output name=today::$(/bin/date -u '+%Y%m%d')d"
run: echo "today=$(/bin/date -u '+%Y%m%d')d" >> $GITHUB_OUTPUT
- name: Setup miniconda cache
id: miniconda-cache
uses: actions/cache@v2

View File

@@ -6,6 +6,7 @@ on:
- main
- doc-builder*
- v*-release
- v*-patch
jobs:
build:
@@ -14,6 +15,7 @@ jobs:
commit_sha: ${{ github.sha }}
package: diffusers
notebook_folder: diffusers_doc
languages: en ko
languages: en ko zh
secrets:
token: ${{ secrets.HUGGINGFACE_PUSH }}
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}

View File

@@ -1,13 +1,14 @@
name: Delete dev documentation
name: Delete doc comment
on:
pull_request:
types: [ closed ]
workflow_run:
workflows: ["Delete doc comment trigger"]
types:
- completed
jobs:
delete:
uses: huggingface/doc-builder/.github/workflows/delete_doc_comment.yml@main
with:
pr_number: ${{ github.event.number }}
package: diffusers
secrets:
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}

View File

@@ -0,0 +1,12 @@
name: Delete doc comment trigger
on:
pull_request:
types: [ closed ]
jobs:
delete:
uses: huggingface/doc-builder/.github/workflows/delete_doc_comment_trigger.yml@main
with:
pr_number: ${{ github.event.number }}

View File

@@ -0,0 +1,32 @@
name: Run dependency tests
on:
pull_request:
branches:
- main
push:
branches:
- main
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
jobs:
check_dependencies:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.7"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e .
pip install pytest
- name: Check for soft dependencies
run: |
pytest tests/others/test_dependencies.py

View File

@@ -21,26 +21,26 @@ jobs:
fail-fast: false
matrix:
config:
- name: Fast PyTorch CPU tests on Ubuntu
framework: pytorch
- name: Fast PyTorch Pipeline CPU tests
framework: pytorch_pipelines
runner: docker-cpu
image: diffusers/diffusers-pytorch-cpu
report: torch_cpu
- name: Fast Flax CPU tests on Ubuntu
report: torch_cpu_pipelines
- name: Fast PyTorch Models & Schedulers CPU tests
framework: pytorch_models
runner: docker-cpu
image: diffusers/diffusers-pytorch-cpu
report: torch_cpu_models_schedulers
- name: Fast Flax CPU tests
framework: flax
runner: docker-cpu
image: diffusers/diffusers-flax-cpu
report: flax_cpu
- name: Fast ONNXRuntime CPU tests on Ubuntu
framework: onnxruntime
runner: docker-cpu
image: diffusers/diffusers-onnxruntime-cpu
report: onnx_cpu
- name: PyTorch Example CPU tests on Ubuntu
- name: PyTorch Example CPU tests
framework: pytorch_examples
runner: docker-cpu
image: diffusers/diffusers-pytorch-cpu
report: torch_cpu
report: torch_example_cpu
name: ${{ matrix.config.name }}
@@ -64,20 +64,26 @@ jobs:
run: |
apt-get update && apt-get install libsndfile1-dev -y
python -m pip install -e .[quality,test]
python -m pip install -U git+https://github.com/huggingface/transformers
python -m pip install git+https://github.com/huggingface/accelerate
- name: Environment
run: |
python utils/print_env.py
- name: Run fast PyTorch CPU tests
if: ${{ matrix.config.framework == 'pytorch' }}
- name: Run fast PyTorch Pipeline CPU tests
if: ${{ matrix.config.framework == 'pytorch_pipelines' }}
run: |
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
-s -v -k "not Flax and not Onnx" \
--make-reports=tests_${{ matrix.config.report }} \
tests/
tests/pipelines
- name: Run fast PyTorch Model Scheduler CPU tests
if: ${{ matrix.config.framework == 'pytorch_models' }}
run: |
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
-s -v -k "not Flax and not Onnx and not Dependency" \
--make-reports=tests_${{ matrix.config.report }} \
tests/models tests/schedulers tests/others
- name: Run fast Flax TPU tests
if: ${{ matrix.config.framework == 'flax' }}
@@ -85,15 +91,7 @@ jobs:
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
-s -v -k "Flax" \
--make-reports=tests_${{ matrix.config.report }} \
tests/
- name: Run fast ONNXRuntime CPU tests
if: ${{ matrix.config.framework == 'onnxruntime' }}
run: |
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
-s -v -k "Onnx" \
--make-reports=tests_${{ matrix.config.report }} \
tests/
tests
- name: Run example PyTorch CPU tests
if: ${{ matrix.config.framework == 'pytorch_examples' }}
@@ -112,56 +110,3 @@ jobs:
with:
name: pr_${{ matrix.config.report }}_test_reports
path: reports
run_fast_tests_apple_m1:
name: Fast PyTorch MPS tests on MacOS
runs-on: [ self-hosted, apple-m1 ]
steps:
- name: Checkout diffusers
uses: actions/checkout@v3
with:
fetch-depth: 2
- name: Clean checkout
shell: arch -arch arm64 bash {0}
run: |
git clean -fxd
- name: Setup miniconda
uses: ./.github/actions/setup-miniconda
with:
python-version: 3.9
- name: Install dependencies
shell: arch -arch arm64 bash {0}
run: |
${CONDA_RUN} python -m pip install --upgrade pip
${CONDA_RUN} python -m pip install -e .[quality,test]
${CONDA_RUN} python -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
${CONDA_RUN} python -m pip install git+https://github.com/huggingface/accelerate
${CONDA_RUN} python -m pip install -U git+https://github.com/huggingface/transformers
- name: Environment
shell: arch -arch arm64 bash {0}
run: |
${CONDA_RUN} python utils/print_env.py
- name: Run fast PyTorch tests on M1 (MPS)
shell: arch -arch arm64 bash {0}
env:
HF_HOME: /System/Volumes/Data/mnt/cache
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
${CONDA_RUN} python -m pytest -n 0 -s -v --make-reports=tests_torch_mps tests/
- name: Failure short reports
if: ${{ failure() }}
run: cat reports/tests_torch_mps_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: pr_torch_mps_test_reports
path: reports

View File

@@ -17,6 +17,7 @@ jobs:
run_slow_tests:
strategy:
fail-fast: false
max-parallel: 1
matrix:
config:
- name: Slow PyTorch CUDA tests on Ubuntu
@@ -61,8 +62,6 @@ jobs:
- name: Install dependencies
run: |
python -m pip install -e .[quality,test]
python -m pip install -U git+https://github.com/huggingface/transformers
python -m pip install git+https://github.com/huggingface/accelerate
- name: Environment
run: |
@@ -72,6 +71,9 @@ jobs:
if: ${{ matrix.config.framework == 'pytorch' }}
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
-s -v -k "not Flax and not Onnx" \
@@ -131,8 +133,6 @@ jobs:
- name: Install dependencies
run: |
python -m pip install -e .[quality,test,training]
python -m pip install git+https://github.com/huggingface/accelerate
python -m pip install -U git+https://github.com/huggingface/transformers
- name: Environment
run: |

View File

@@ -1,4 +1,4 @@
name: Slow tests on main
name: Fast tests on main
on:
push:
@@ -38,7 +38,7 @@ jobs:
framework: pytorch_examples
runner: docker-cpu
image: diffusers/diffusers-pytorch-cpu
report: torch_cpu
report: torch_example_cpu
name: ${{ matrix.config.name }}
@@ -62,8 +62,6 @@ jobs:
run: |
apt-get update && apt-get install libsndfile1-dev -y
python -m pip install -e .[quality,test]
python -m pip install -U git+https://github.com/huggingface/transformers
python -m pip install git+https://github.com/huggingface/accelerate
- name: Environment
run: |
@@ -110,56 +108,3 @@ jobs:
with:
name: pr_${{ matrix.config.report }}_test_reports
path: reports
run_fast_tests_apple_m1:
name: Fast PyTorch MPS tests on MacOS
runs-on: [ self-hosted, apple-m1 ]
steps:
- name: Checkout diffusers
uses: actions/checkout@v3
with:
fetch-depth: 2
- name: Clean checkout
shell: arch -arch arm64 bash {0}
run: |
git clean -fxd
- name: Setup miniconda
uses: ./.github/actions/setup-miniconda
with:
python-version: 3.9
- name: Install dependencies
shell: arch -arch arm64 bash {0}
run: |
${CONDA_RUN} python -m pip install --upgrade pip
${CONDA_RUN} python -m pip install -e .[quality,test]
${CONDA_RUN} python -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
${CONDA_RUN} python -m pip install git+https://github.com/huggingface/accelerate
${CONDA_RUN} python -m pip install -U git+https://github.com/huggingface/transformers
- name: Environment
shell: arch -arch arm64 bash {0}
run: |
${CONDA_RUN} python utils/print_env.py
- name: Run fast PyTorch tests on M1 (MPS)
shell: arch -arch arm64 bash {0}
env:
HF_HOME: /System/Volumes/Data/mnt/cache
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
${CONDA_RUN} python -m pytest -n 0 -s -v --make-reports=tests_torch_mps tests/
- name: Failure short reports
if: ${{ failure() }}
run: cat reports/tests_torch_mps_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: pr_torch_mps_test_reports
path: reports

68
.github/workflows/push_tests_mps.yml vendored Normal file
View File

@@ -0,0 +1,68 @@
name: Fast mps tests on main
on:
push:
branches:
- main
env:
DIFFUSERS_IS_CI: yes
HF_HOME: /mnt/cache
OMP_NUM_THREADS: 8
MKL_NUM_THREADS: 8
PYTEST_TIMEOUT: 600
RUN_SLOW: no
jobs:
run_fast_tests_apple_m1:
name: Fast PyTorch MPS tests on MacOS
runs-on: [ self-hosted, apple-m1 ]
steps:
- name: Checkout diffusers
uses: actions/checkout@v3
with:
fetch-depth: 2
- name: Clean checkout
shell: arch -arch arm64 bash {0}
run: |
git clean -fxd
- name: Setup miniconda
uses: ./.github/actions/setup-miniconda
with:
python-version: 3.9
- name: Install dependencies
shell: arch -arch arm64 bash {0}
run: |
${CONDA_RUN} python -m pip install --upgrade pip
${CONDA_RUN} python -m pip install -e .[quality,test]
${CONDA_RUN} python -m pip install torch torchvision torchaudio
${CONDA_RUN} python -m pip install accelerate --upgrade
${CONDA_RUN} python -m pip install transformers --upgrade
- name: Environment
shell: arch -arch arm64 bash {0}
run: |
${CONDA_RUN} python utils/print_env.py
- name: Run fast PyTorch tests on M1 (MPS)
shell: arch -arch arm64 bash {0}
env:
HF_HOME: /System/Volumes/Data/mnt/cache
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
${CONDA_RUN} python -m pytest -n 0 -s -v --make-reports=tests_torch_mps tests/
- name: Failure short reports
if: ${{ failure() }}
run: cat reports/tests_torch_mps_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: pr_torch_mps_test_reports
path: reports

View File

@@ -0,0 +1,16 @@
name: Upload PR Documentation
on:
workflow_run:
workflows: ["Build PR Documentation"]
types:
- completed
jobs:
build:
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
with:
package_name: diffusers
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}

View File

@@ -395,7 +395,14 @@ passes. You should run the tests impacted by your changes like this:
$ pytest tests/<TEST_TO_RUN>.py
```
You can also run the full suite with the following command, but it takes
Before you run the tests, please make sure you install the dependencies required for testing. You can do so
with this command:
```bash
$ pip install -e ".[test]"
```
You can run the full test suite with the following command, but it takes
a beefy machine to produce a result in a decent amount of time now that
Diffusers has grown a lot. Here is the command for it:

140
README.md
View File

@@ -1,6 +1,6 @@
<p align="center">
<br>
<img src="./docs/source/en/imgs/diffusers_library.jpg" width="400"/>
<img src="https://github.com/huggingface/diffusers/blob/main/docs/source/en/imgs/diffusers_library.jpg" width="400"/>
<br>
<p>
<p align="center">
@@ -25,7 +25,7 @@
## Installation
We recommend installing 🤗 Diffusers in a virtual environment from PyPi or Conda. For more details about installing [PyTorch](https://pytorch.org/get-started/locally/) and [Flax](https://flax.readthedocs.io/en/latest/installation.html), please refer to their official documentation.
We recommend installing 🤗 Diffusers in a virtual environment from PyPi or Conda. For more details about installing [PyTorch](https://pytorch.org/get-started/locally/) and [Flax](https://flax.readthedocs.io/en/latest/#installation), please refer to their official documentation.
### PyTorch
@@ -59,8 +59,9 @@ Generating outputs is super easy with 🤗 Diffusers. To generate an image from
```python
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipeline.to("cuda")
pipeline("An image of a squirrel in Picasso style").images[0]
```
@@ -99,55 +100,11 @@ Check out the [Quickstart](https://huggingface.co/docs/diffusers/quicktour) to l
| **Documentation** | **What can I learn?** |
|---------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Tutorial | A basic crash course for learning how to use the library's most important features like using models and schedulers to build your own diffusion system, and training your own diffusion model. |
| Loading | Guides for how to load and configure all the components (pipelines, models, and schedulers) of the library, as well as how to use different schedulers. |
| Pipelines for inference | Guides for how to use pipelines for different inference tasks, batched generation, controlling generated outputs and randomness, and how to contribute a pipeline to the library. |
| Optimization | Guides for how to optimize your diffusion model to run faster and consume less memory. |
| [Tutorial](https://huggingface.co/docs/diffusers/tutorials/tutorial_overview) | A basic crash course for learning how to use the library's most important features like using models and schedulers to build your own diffusion system, and training your own diffusion model. |
| [Loading](https://huggingface.co/docs/diffusers/using-diffusers/loading_overview) | Guides for how to load and configure all the components (pipelines, models, and schedulers) of the library, as well as how to use different schedulers. |
| [Pipelines for inference](https://huggingface.co/docs/diffusers/using-diffusers/pipeline_overview) | Guides for how to use pipelines for different inference tasks, batched generation, controlling generated outputs and randomness, and how to contribute a pipeline to the library. |
| [Optimization](https://huggingface.co/docs/diffusers/optimization/opt_overview) | Guides for how to optimize your diffusion model to run faster and consume less memory. |
| [Training](https://huggingface.co/docs/diffusers/training/overview) | Guides for how to train a diffusion model for different tasks with different training techniques. |
## Supported pipelines
| Pipeline | Paper | Tasks |
|---|---|:---:|
| [alt_diffusion](./api/pipelines/alt_diffusion) | [**AltDiffusion**](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation |
| [audio_diffusion](./api/pipelines/audio_diffusion) | [**Audio Diffusion**](https://github.com/teticio/audio-diffusion.git) | Unconditional Audio Generation |
| [controlnet](./api/pipelines/stable_diffusion/controlnet) | [**ControlNet with Stable Diffusion**](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation |
| [cycle_diffusion](./api/pipelines/cycle_diffusion) | [**Cycle Diffusion**](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation |
| [dance_diffusion](./api/pipelines/dance_diffusion) | [**Dance Diffusion**](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation |
| [ddpm](./api/pipelines/ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation |
| [ddim](./api/pipelines/ddim) | [**Denoising Diffusion Implicit Models**](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation |
| [latent_diffusion](./api/pipelines/latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation |
| [latent_diffusion](./api/pipelines/latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image |
| [latent_diffusion_uncond](./api/pipelines/latent_diffusion_uncond) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation |
| [paint_by_example](./api/pipelines/paint_by_example) | [**Paint by Example: Exemplar-based Image Editing with Diffusion Models**](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting |
| [pndm](./api/pipelines/pndm) | [**Pseudo Numerical Methods for Diffusion Models on Manifolds**](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation |
| [score_sde_ve](./api/pipelines/score_sde_ve) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
| [score_sde_vp](./api/pipelines/score_sde_vp) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
| [semantic_stable_diffusion](./api/pipelines/semantic_stable_diffusion) | [**Semantic Guidance**](https://arxiv.org/abs/2301.12247) | Text-Guided Generation |
| [stable_diffusion_text2img](./api/pipelines/stable_diffusion/text2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation |
| [stable_diffusion_img2img](./api/pipelines/stable_diffusion/img2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation |
| [stable_diffusion_inpaint](./api/pipelines/stable_diffusion/inpaint) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting |
| [stable_diffusion_panorama](./api/pipelines/stable_diffusion/panorama) | [**MultiDiffusion**](https://multidiffusion.github.io/) | Text-to-Panorama Generation |
| [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [**InstructPix2Pix**](https://github.com/timothybrooks/instruct-pix2pix) | Text-Guided Image Editing|
| [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [**Zero-shot Image-to-Image Translation**](https://pix2pixzero.github.io/) | Text-Guided Image Editing |
| [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [**Attend and Excite for Stable Diffusion**](https://attendandexcite.github.io/Attend-and-Excite/) | Text-to-Image Generation |
| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [**Self-Attention Guidance**](https://ku-cvlab.github.io/Self-Attention-Guidance) | Text-to-Image Generation |
| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [**Stable Diffusion Image Variations**](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation |
| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [**Stable Diffusion Latent Upscaler**](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image |
| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation |
| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting |
| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [**Depth-Conditional Stable Diffusion**](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) | Depth-to-Image Generation |
| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image |
| [stable_diffusion_safe](./api/pipelines/stable_diffusion_safe) | [**Safe Stable Diffusion**](https://arxiv.org/abs/2211.05105) | Text-Guided Generation |
| [stable_unclip](./stable_unclip) | **Stable unCLIP** | Text-to-Image Generation |
| [stable_unclip](./stable_unclip) | **Stable unCLIP** | Image-to-Image Text-Guided Generation |
| [stochastic_karras_ve](./api/pipelines/stochastic_karras_ve) | [**Elucidating the Design Space of Diffusion-Based Generative Models**](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation |
| [unclip](./api/pipelines/unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) | Text-to-Image Generation |
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
## Contribution
We ❤️ contributions from the open-source community!
@@ -160,6 +117,87 @@ You can look out for [issues](https://github.com/huggingface/diffusers/issues) y
Also, say 👋 in our public Discord channel <a href="https://discord.gg/G7tWnz98XR"><img alt="Join us on Discord" src="https://img.shields.io/discord/823813159592001537?color=5865F2&logo=discord&logoColor=white"></a>. We discuss the hottest trends about diffusion models, help each other with contributions, personal projects or
just hang out ☕.
## Popular Tasks & Pipelines
<table>
<tr>
<th>Task</th>
<th>Pipeline</th>
<th>🤗 Hub</th>
</tr>
<tr style="border-top: 2px solid black">
<td>Unconditional Image Generation</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/ddpm"> DDPM </a></td>
<td><a href="https://huggingface.co/google/ddpm-ema-church-256"> google/ddpm-ema-church-256 </a></td>
</tr>
<tr style="border-top: 2px solid black">
<td>Text-to-Image</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img">Stable Diffusion Text-to-Image</a></td>
<td><a href="https://huggingface.co/runwayml/stable-diffusion-v1-5"> runwayml/stable-diffusion-v1-5 </a></td>
</tr>
<tr>
<td>Text-to-Image</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/unclip">unclip</a></td>
<td><a href="https://huggingface.co/kakaobrain/karlo-v1-alpha"> kakaobrain/karlo-v1-alpha </a></td>
</tr>
<tr>
<td>Text-to-Image</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/if">if</a></td>
<td><a href="https://huggingface.co/DeepFloyd/IF-I-XL-v1.0"> DeepFloyd/IF-I-XL-v1.0 </a></td>
</tr>
<tr style="border-top: 2px solid black">
<td>Text-guided Image-to-Image</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/controlnet">Controlnet</a></td>
<td><a href="https://huggingface.co/lllyasviel/sd-controlnet-canny"> lllyasviel/sd-controlnet-canny </a></td>
</tr>
<tr>
<td>Text-guided Image-to-Image</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/pix2pix">Instruct Pix2Pix</a></td>
<td><a href="https://huggingface.co/timbrooks/instruct-pix2pix"> timbrooks/instruct-pix2pix </a></td>
</tr>
<tr>
<td>Text-guided Image-to-Image</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/img2img">Stable Diffusion Image-to-Image</a></td>
<td><a href="https://huggingface.co/runwayml/stable-diffusion-v1-5"> runwayml/stable-diffusion-v1-5 </a></td>
</tr>
<tr style="border-top: 2px solid black">
<td>Text-guided Image Inpainting</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/inpaint">Stable Diffusion Inpaint</a></td>
<td><a href="https://huggingface.co/runwayml/stable-diffusion-inpainting"> runwayml/stable-diffusion-inpainting </a></td>
</tr>
<tr style="border-top: 2px solid black">
<td>Image Variation</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/image_variation">Stable Diffusion Image Variation</a></td>
<td><a href="https://huggingface.co/lambdalabs/sd-image-variations-diffusers"> lambdalabs/sd-image-variations-diffusers </a></td>
</tr>
<tr style="border-top: 2px solid black">
<td>Super Resolution</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/upscale">Stable Diffusion Upscale</a></td>
<td><a href="https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler"> stabilityai/stable-diffusion-x4-upscaler </a></td>
</tr>
<tr>
<td>Super Resolution</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/latent_upscale">Stable Diffusion Latent Upscale</a></td>
<td><a href="https://huggingface.co/stabilityai/sd-x2-latent-upscaler"> stabilityai/sd-x2-latent-upscaler </a></td>
</tr>
</table>
## Popular libraries using 🧨 Diffusers
- https://github.com/microsoft/TaskMatrix
- https://github.com/invoke-ai/InvokeAI
- https://github.com/apple/ml-stable-diffusion
- https://github.com/Sanster/lama-cleaner
- https://github.com/IDEA-Research/Grounded-Segment-Anything
- https://github.com/ashawkey/stable-dreamfusion
- https://github.com/deep-floyd/IF
- https://github.com/bentoml/BentoML
- https://github.com/bmaltais/kohya_ss
- +3000 other amazing GitHub repositories 💪
Thank you for using us ❤️
## Credits
This library concretizes previous work by many different authors and would not have been possible without their great research and implementations. We'd like to thank, in particular, the following implementations which have helped us in our development and without which the API could not have been as polished today:

View File

@@ -26,7 +26,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir \
torch \
torchvision \
torchaudio \
torchaudio && \
python3 -m pip install --no-cache-dir \
accelerate \
datasets \
@@ -37,6 +37,9 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
numpy \
scipy \
tensorboard \
transformers
transformers \
omegaconf \
pytorch-lightning \
xformers
CMD ["/bin/bash"]

View File

@@ -4,7 +4,7 @@
- local: quicktour
title: Quicktour
- local: stable_diffusion
title: Stable Diffusion
title: Effective and efficient diffusion
- local: installation
title: Installation
title: Get started
@@ -25,9 +25,11 @@
- local: using-diffusers/schedulers
title: Load and compare different schedulers
- local: using-diffusers/custom_pipeline_overview
title: Load and add custom pipelines
- local: using-diffusers/kerascv
title: Load KerasCV Stable Diffusion checkpoints
title: Load community pipelines
- local: using-diffusers/using_safetensors
title: Load safetensors
- local: using-diffusers/other-formats
title: Load different Stable Diffusion formats
title: Loading & Hub
- sections:
- local: using-diffusers/pipeline_overview
@@ -42,22 +44,32 @@
title: Text-guided image-inpainting
- local: using-diffusers/depth2img
title: Text-guided depth-to-image
- local: using-diffusers/textual_inversion_inference
title: Textual inversion
- local: training/distributed_inference
title: Distributed inference with multiple GPUs
- local: using-diffusers/reusing_seeds
title: Improve image quality with deterministic generation
- local: using-diffusers/control_brightness
title: Control image brightness
- local: using-diffusers/reproducibility
title: Create reproducible pipelines
- local: using-diffusers/custom_pipeline_examples
title: Community Pipelines
title: Community pipelines
- local: using-diffusers/contribute_pipeline
title: How to contribute a Pipeline
- local: using-diffusers/using_safetensors
title: Using safetensors
title: How to contribute a community pipeline
- local: using-diffusers/stable_diffusion_jax_how_to
title: Stable Diffusion in JAX/Flax
- local: using-diffusers/weighted_prompts
title: Weighting Prompts
title: Pipelines for Inference
- sections:
- local: training/overview
title: Overview
- local: training/create_dataset
title: Create a dataset for training
- local: training/adapt_a_model
title: Adapt a model to a new task
- local: training/unconditional_training
title: Unconditional image generation
- local: training/text_inversion
@@ -72,6 +84,8 @@
title: ControlNet
- local: training/instructpix2pix
title: InstructPix2Pix Training
- local: training/custom_diffusion
title: Custom Diffusion
title: Training
- sections:
- local: using-diffusers/rl
@@ -95,10 +109,14 @@
title: ONNX
- local: optimization/open_vino
title: OpenVINO
- local: optimization/coreml
title: Core ML
- local: optimization/mps
title: MPS
- local: optimization/habana
title: Habana Gaudi
- local: optimization/tome
title: Token Merging
title: Optimization/Special Hardware
- sections:
- local: conceptual/philosophy
@@ -114,8 +132,8 @@
title: Conceptual Guides
- sections:
- sections:
- local: api/models
title: Models
- local: api/attnprocessor
title: Attention Processor
- local: api/diffusion_pipeline
title: Diffusion Pipeline
- local: api/logging
@@ -126,16 +144,48 @@
title: Outputs
- local: api/loaders
title: Loaders
- local: api/utilities
title: Utilities
- local: api/image_processor
title: VAE Image Processor
title: Main Classes
- sections:
- local: api/models/overview
title: Overview
- local: api/models/unet
title: UNet1DModel
- local: api/models/unet2d
title: UNet2DModel
- local: api/models/unet2d-cond
title: UNet2DConditionModel
- local: api/models/unet3d-cond
title: UNet3DConditionModel
- local: api/models/vq
title: VQModel
- local: api/models/autoencoderkl
title: AutoencoderKL
- local: api/models/transformer2d
title: Transformer2D
- local: api/models/transformer_temporal
title: Transformer Temporal
- local: api/models/prior_transformer
title: Prior Transformer
- local: api/models/controlnet
title: ControlNet
title: Models
- sections:
- local: api/pipelines/overview
title: Overview
- local: api/pipelines/alt_diffusion
title: AltDiffusion
- local: api/pipelines/attend_and_excite
title: Attend and Excite
- local: api/pipelines/audio_diffusion
title: Audio Diffusion
- local: api/pipelines/audioldm
title: AudioLDM
- local: api/pipelines/controlnet
title: ControlNet
- local: api/pipelines/cycle_diffusion
title: Cycle Diffusion
- local: api/pipelines/dance_diffusion
@@ -144,24 +194,38 @@
title: DDIM
- local: api/pipelines/ddpm
title: DDPM
- local: api/pipelines/diffedit
title: DiffEdit
- local: api/pipelines/dit
title: DiT
- local: api/pipelines/if
title: IF
- local: api/pipelines/pix2pix
title: InstructPix2Pix
- local: api/pipelines/kandinsky
title: Kandinsky
- local: api/pipelines/latent_diffusion
title: Latent Diffusion
- local: api/pipelines/panorama
title: MultiDiffusion Panorama
- local: api/pipelines/paint_by_example
title: PaintByExample
- local: api/pipelines/paradigms
title: Parallel Sampling of Diffusion Models
- local: api/pipelines/pix2pix_zero
title: Pix2Pix Zero
- local: api/pipelines/pndm
title: PNDM
- local: api/pipelines/repaint
title: RePaint
- local: api/pipelines/stable_diffusion_safe
title: Safe Stable Diffusion
- local: api/pipelines/score_sde_ve
title: Score SDE VE
- local: api/pipelines/self_attention_guidance
title: Self-Attention Guidance
- local: api/pipelines/semantic_stable_diffusion
title: Semantic Guidance
- local: api/pipelines/spectrogram_diffusion
title: "Spectrogram Diffusion"
title: Spectrogram Diffusion
- sections:
- local: api/pipelines/stable_diffusion/overview
title: Overview
@@ -175,37 +239,33 @@
title: Depth-to-Image
- local: api/pipelines/stable_diffusion/image_variation
title: Image-Variation
- local: api/pipelines/stable_diffusion/upscale
title: Super-Resolution
- local: api/pipelines/stable_diffusion/stable_diffusion_safe
title: Safe Stable Diffusion
- local: api/pipelines/stable_diffusion/stable_diffusion_2
title: Stable Diffusion 2
- local: api/pipelines/stable_diffusion/latent_upscale
title: Stable-Diffusion-Latent-Upscaler
- local: api/pipelines/stable_diffusion/pix2pix
title: InstructPix2Pix
- local: api/pipelines/stable_diffusion/attend_and_excite
title: Attend and Excite
- local: api/pipelines/stable_diffusion/pix2pix_zero
title: Pix2Pix Zero
- local: api/pipelines/stable_diffusion/self_attention_guidance
title: Self-Attention Guidance
- local: api/pipelines/stable_diffusion/panorama
title: MultiDiffusion Panorama
- local: api/pipelines/stable_diffusion/controlnet
title: Text-to-Image Generation with ControlNet Conditioning
- local: api/pipelines/stable_diffusion/model_editing
title: Text-to-Image Model Editing
- local: api/pipelines/stable_diffusion/upscale
title: Super-Resolution
- local: api/pipelines/stable_diffusion/ldm3d_diffusion
title: LDM3D Text-to-(RGB, Depth)
title: Stable Diffusion
- local: api/pipelines/stable_diffusion_2
title: Stable Diffusion 2
- local: api/pipelines/stable_unclip
title: Stable unCLIP
- local: api/pipelines/stochastic_karras_ve
title: Stochastic Karras VE
- local: api/pipelines/model_editing
title: Text-to-Image Model Editing
- local: api/pipelines/text_to_video
title: Text-to-Video
- local: api/pipelines/text_to_video_zero
title: Text-to-Video Zero
- local: api/pipelines/unclip
title: UnCLIP
- local: api/pipelines/latent_diffusion_uncond
title: Unconditional Latent Diffusion
- local: api/pipelines/unidiffuser
title: UniDiffuser
- local: api/pipelines/versatile_diffusion
title: Versatile Diffusion
- local: api/pipelines/vq_diffusion
@@ -226,12 +286,16 @@
title: DPM Discrete Scheduler
- local: api/schedulers/dpm_discrete_ancestral
title: DPM Discrete Scheduler with ancestral sampling
- local: api/schedulers/dpm_sde
title: DPMSolverSDEScheduler
- local: api/schedulers/euler_ancestral
title: Euler Ancestral Scheduler
- local: api/schedulers/euler
title: Euler scheduler
- local: api/schedulers/heun
title: Heun Scheduler
- local: api/schedulers/multistep_dpm_solver_inverse
title: Inverse Multistep DPM-Solver
- local: api/schedulers/ipndm
title: IPNDM
- local: api/schedulers/lms_discrete

View File

@@ -0,0 +1,42 @@
# Attention Processor
An attention processor is a class for applying different types of attention mechanisms.
## AttnProcessor
[[autodoc]] models.attention_processor.AttnProcessor
## AttnProcessor2_0
[[autodoc]] models.attention_processor.AttnProcessor2_0
## LoRAAttnProcessor
[[autodoc]] models.attention_processor.LoRAAttnProcessor
## LoRAAttnProcessor2_0
[[autodoc]] models.attention_processor.LoRAAttnProcessor2_0
## CustomDiffusionAttnProcessor
[[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor
## AttnAddedKVProcessor
[[autodoc]] models.attention_processor.AttnAddedKVProcessor
## AttnAddedKVProcessor2_0
[[autodoc]] models.attention_processor.AttnAddedKVProcessor2_0
## LoRAAttnAddedKVProcessor
[[autodoc]] models.attention_processor.LoRAAttnAddedKVProcessor
## XFormersAttnProcessor
[[autodoc]] models.attention_processor.XFormersAttnProcessor
## LoRAXFormersAttnProcessor
[[autodoc]] models.attention_processor.LoRAXFormersAttnProcessor
## CustomDiffusionXFormersAttnProcessor
[[autodoc]] models.attention_processor.CustomDiffusionXFormersAttnProcessor
## SlicedAttnProcessor
[[autodoc]] models.attention_processor.SlicedAttnProcessor
## SlicedAttnAddedKVProcessor
[[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor

View File

@@ -12,8 +12,13 @@ specific language governing permissions and limitations under the License.
# Configuration
Schedulers from [`~schedulers.scheduling_utils.SchedulerMixin`] and models from [`ModelMixin`] inherit from [`ConfigMixin`] which conveniently takes care of storing all the parameters that are
passed to their respective `__init__` methods in a JSON-configuration file.
Schedulers from [`~schedulers.scheduling_utils.SchedulerMixin`] and models from [`ModelMixin`] inherit from [`ConfigMixin`] which stores all the parameters that are passed to their respective `__init__` methods in a JSON-configuration file.
<Tip>
To use private or [gated](https://huggingface.co/docs/hub/models-gated#gated-models) models, log-in with `huggingface-cli login`.
</Tip>
## ConfigMixin

View File

@@ -12,36 +12,25 @@ specific language governing permissions and limitations under the License.
# Pipelines
The [`DiffusionPipeline`] is the easiest way to load any pretrained diffusion pipeline from the [Hub](https://huggingface.co/models?library=diffusers) and to use it in inference.
The [`DiffusionPipeline`] is the quickest way to load any pretrained diffusion pipeline from the [Hub](https://huggingface.co/models?library=diffusers) for inference.
<Tip>
One should not use the Diffusion Pipeline class for training or fine-tuning a diffusion model. Individual
components of diffusion pipelines are usually trained individually, so we suggest to directly work
with [`UNetModel`] and [`UNetConditionModel`].
You shouldn't use the [`DiffusionPipeline`] class for training or finetuning a diffusion model. Individual
components (for example, [`UNet2DModel`] and [`UNet2DConditionModel`]) of diffusion pipelines are usually trained individually, so we suggest directly working with them instead.
</Tip>
Any diffusion pipeline that is loaded with [`~DiffusionPipeline.from_pretrained`] will automatically
detect the pipeline type, *e.g.* [`StableDiffusionPipeline`] and consequently load each component of the
pipeline and pass them into the `__init__` function of the pipeline, *e.g.* [`~StableDiffusionPipeline.__init__`].
The pipeline type (for example [`StableDiffusionPipeline`]) of any diffusion pipeline loaded with [`~DiffusionPipeline.from_pretrained`] is automatically
detected and pipeline components are loaded and passed to the `__init__` function of the pipeline.
Any pipeline object can be saved locally with [`~DiffusionPipeline.save_pretrained`].
## DiffusionPipeline
[[autodoc]] DiffusionPipeline
- all
- __call__
- device
- to
- components
## ImagePipelineOutput
By default diffusion pipelines return an object of class
[[autodoc]] pipelines.ImagePipelineOutput
## AudioPipelineOutput
By default diffusion pipelines return an object of class
[[autodoc]] pipelines.AudioPipelineOutput

View File

@@ -0,0 +1,27 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# VAE Image Processor
The [`VaeImageProcessor`] provides a unified API for [`StableDiffusionPipeline`]'s to prepare image inputs for VAE encoding and post-processing outputs once they're decoded. This includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays.
All pipelines with [`VaeImageProcessor`] accepts PIL Image, PyTorch tensor, or NumPy arrays as image inputs and returns outputs based on the `output_type` argument by the user. You can pass encoded image latents directly to the pipeline and return latents from the pipeline as a specific output with the `output_type` argument (for example `output_type="pt"`). This allows you to take the generated latents from one pipeline and pass it to another pipeline as input without leaving the latent space. It also makes it much easier to use multiple pipelines together by passing PyTorch tensors directly between different pipelines.
## VaeImageProcessor
[[autodoc]] image_processor.VaeImageProcessor
## VaeImageProcessorLDM3D
The [`VaeImageProcessorLDM3D`] accepts RGB and depth inputs and returns RGB and depth outputs.
[[autodoc]] image_processor.VaeImageProcessorLDM3D

View File

@@ -12,19 +12,26 @@ specific language governing permissions and limitations under the License.
# Loaders
There are many ways to train adapter neural networks for diffusion models, such as
- [Textual Inversion](./training/text_inversion.mdx)
- [LoRA](https://github.com/cloneofsimo/lora)
- [Hypernetworks](https://arxiv.org/abs/1609.09106)
Adapters (textual inversion, LoRA, hypernetworks) allow you to modify a diffusion model to generate images in a specific style without training or finetuning the entire model. The adapter weights are typically only a tiny fraction of the pretrained model's which making them very portable. 🤗 Diffusers provides an easy-to-use `LoaderMixin` API to load adapter weights.
Such adapter neural networks often only consist of a fraction of the number of weights compared
to the pretrained model and as such are very portable. The Diffusers library offers an easy-to-use
API to load such adapter neural networks via the [`loaders.py` module](https://github.com/huggingface/diffusers/blob/main/src/diffusers/loaders.py).
<Tip warning={true}>
**Note**: This module is still highly experimental and prone to future changes.
🧪 The `LoaderMixins` are highly experimental and prone to future changes. To use private or [gated](https://huggingface.co/docs/hub/models-gated#gated-models) models, log-in with `huggingface-cli login`.
## LoaderMixins
</Tip>
### UNet2DConditionLoadersMixin
## UNet2DConditionLoadersMixin
[[autodoc]] loaders.UNet2DConditionLoadersMixin
## TextualInversionLoaderMixin
[[autodoc]] loaders.TextualInversionLoaderMixin
## LoraLoaderMixin
[[autodoc]] loaders.LoraLoaderMixin
## FromCkptMixin
[[autodoc]] loaders.FromCkptMixin

View File

@@ -12,12 +12,9 @@ specific language governing permissions and limitations under the License.
# Logging
🧨 Diffusers has a centralized logging system, so that you can setup the verbosity of the library easily.
🤗 Diffusers has a centralized logging system to easily manage the verbosity of the library. The default verbosity is set to `WARNING`.
Currently the default verbosity of the library is `WARNING`.
To change the level of verbosity, just use one of the direct setters. For instance, here is how to change the verbosity
to the INFO level.
To change the verbosity level, use one of the direct setters. For instance, to change the verbosity to the `INFO` level.
```python
import diffusers
@@ -33,7 +30,7 @@ DIFFUSERS_VERBOSITY=error ./myprogram.py
```
Additionally, some `warnings` can be disabled by setting the environment variable
`DIFFUSERS_NO_ADVISORY_WARNINGS` to a true value, like *1*. This will disable any warning that is logged using
`DIFFUSERS_NO_ADVISORY_WARNINGS` to a true value, like `1`. This disables any warning logged by
[`logger.warning_advice`]. For example:
```bash
@@ -52,20 +49,21 @@ logger.warning("WARN")
```
All the methods of this logging module are documented below, the main ones are
All methods of the logging module are documented below. The main methods are
[`logging.get_verbosity`] to get the current level of verbosity in the logger and
[`logging.set_verbosity`] to set the verbosity to the level of your choice. In order (from the least
verbose to the most verbose), those levels (with their corresponding int values in parenthesis) are:
[`logging.set_verbosity`] to set the verbosity to the level of your choice.
- `diffusers.logging.CRITICAL` or `diffusers.logging.FATAL` (int value, 50): only report the most
critical errors.
- `diffusers.logging.ERROR` (int value, 40): only report errors.
- `diffusers.logging.WARNING` or `diffusers.logging.WARN` (int value, 30): only reports error and
warnings. This the default level used by the library.
- `diffusers.logging.INFO` (int value, 20): reports error, warnings and basic information.
- `diffusers.logging.DEBUG` (int value, 10): report all information.
In order from the least verbose to the most verbose:
By default, `tqdm` progress bars will be displayed during model download. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] can be used to suppress or unsuppress this behavior.
| Method | Integer value | Description |
|----------------------------------------------------------:|--------------:|----------------------------------------------------:|
| `diffusers.logging.CRITICAL` or `diffusers.logging.FATAL` | 50 | only report the most critical errors |
| `diffusers.logging.ERROR` | 40 | only report errors |
| `diffusers.logging.WARNING` or `diffusers.logging.WARN` | 30 | only report errors and warnings (default) |
| `diffusers.logging.INFO` | 20 | only report errors, warnings, and basic information |
| `diffusers.logging.DEBUG` | 10 | report all information |
By default, `tqdm` progress bars are displayed during model download. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] are used to enable or disable this behavior.
## Base setters

View File

@@ -1,107 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Models
Diffusers contains pretrained models for popular algorithms and modules for creating the next set of diffusion models.
The primary function of these models is to denoise an input sample, by modeling the distribution $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$.
The models are built on the base class ['ModelMixin'] that is a `torch.nn.module` with basic functionality for saving and loading models both locally and from the HuggingFace hub.
## ModelMixin
[[autodoc]] ModelMixin
## UNet2DOutput
[[autodoc]] models.unet_2d.UNet2DOutput
## UNet2DModel
[[autodoc]] UNet2DModel
## UNet1DOutput
[[autodoc]] models.unet_1d.UNet1DOutput
## UNet1DModel
[[autodoc]] UNet1DModel
## UNet2DConditionOutput
[[autodoc]] models.unet_2d_condition.UNet2DConditionOutput
## UNet2DConditionModel
[[autodoc]] UNet2DConditionModel
## UNet3DConditionOutput
[[autodoc]] models.unet_3d_condition.UNet3DConditionOutput
## UNet3DConditionModel
[[autodoc]] UNet3DConditionModel
## DecoderOutput
[[autodoc]] models.vae.DecoderOutput
## VQEncoderOutput
[[autodoc]] models.vq_model.VQEncoderOutput
## VQModel
[[autodoc]] VQModel
## AutoencoderKLOutput
[[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
## AutoencoderKL
[[autodoc]] AutoencoderKL
## Transformer2DModel
[[autodoc]] Transformer2DModel
## Transformer2DModelOutput
[[autodoc]] models.transformer_2d.Transformer2DModelOutput
## TransformerTemporalModel
[[autodoc]] models.transformer_temporal.TransformerTemporalModel
## Transformer2DModelOutput
[[autodoc]] models.transformer_temporal.TransformerTemporalModelOutput
## PriorTransformer
[[autodoc]] models.prior_transformer.PriorTransformer
## PriorTransformerOutput
[[autodoc]] models.prior_transformer.PriorTransformerOutput
## ControlNetOutput
[[autodoc]] models.controlnet.ControlNetOutput
## ControlNetModel
[[autodoc]] ControlNetModel
## FlaxModelMixin
[[autodoc]] FlaxModelMixin
## FlaxUNet2DConditionOutput
[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionOutput
## FlaxUNet2DConditionModel
[[autodoc]] FlaxUNet2DConditionModel
## FlaxDecoderOutput
[[autodoc]] models.vae_flax.FlaxDecoderOutput
## FlaxAutoencoderKLOutput
[[autodoc]] models.vae_flax.FlaxAutoencoderKLOutput
## FlaxAutoencoderKL
[[autodoc]] FlaxAutoencoderKL
## FlaxControlNetOutput
[[autodoc]] models.controlnet_flax.FlaxControlNetOutput
## FlaxControlNetModel
[[autodoc]] FlaxControlNetModel

View File

@@ -0,0 +1,31 @@
# AutoencoderKL
The variational autoencoder (VAE) model with KL loss was introduced in [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114v11) by Diederik P. Kingma and Max Welling. The model is used in 🤗 Diffusers to encode images into latents and to decode latent representations into images.
The abstract from the paper is:
*How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.*
## AutoencoderKL
[[autodoc]] AutoencoderKL
## AutoencoderKLOutput
[[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
## DecoderOutput
[[autodoc]] models.vae.DecoderOutput
## FlaxAutoencoderKL
[[autodoc]] FlaxAutoencoderKL
## FlaxAutoencoderKLOutput
[[autodoc]] models.vae_flax.FlaxAutoencoderKLOutput
## FlaxDecoderOutput
[[autodoc]] models.vae_flax.FlaxDecoderOutput

View File

@@ -0,0 +1,23 @@
# ControlNet
The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection.
The abstract from the paper is:
*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
## ControlNetModel
[[autodoc]] ControlNetModel
## ControlNetOutput
[[autodoc]] models.controlnet.ControlNetOutput
## FlaxControlNetModel
[[autodoc]] FlaxControlNetModel
## FlaxControlNetOutput
[[autodoc]] models.controlnet_flax.FlaxControlNetOutput

View File

@@ -0,0 +1,12 @@
# Models
🤗 Diffusers provides pretrained models for popular algorithms and modules to create custom diffusion systems. The primary function of models is to denoise an input sample as modeled by the distribution \\(p_{\theta}(x_{t-1}|x_{t})\\).
All models are built from the base [`ModelMixin`] class which is a [`torch.nn.module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) providing basic functionality for saving and loading models, locally and from the Hugging Face Hub.
## ModelMixin
[[autodoc]] ModelMixin
## FlaxModelMixin
[[autodoc]] FlaxModelMixin

View File

@@ -0,0 +1,16 @@
# Prior Transformer
The Prior Transformer was originally introduced in [Hierarchical Text-Conditional Image Generation with CLIP Latents
](https://huggingface.co/papers/2204.06125) by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process.
The abstract from the paper is:
*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*
## PriorTransformer
[[autodoc]] PriorTransformer
## PriorTransformerOutput
[[autodoc]] models.prior_transformer.PriorTransformerOutput

View File

@@ -0,0 +1,29 @@
# Transformer2D
A Transformer model for image-like data from [CompVis](https://huggingface.co/CompVis) that is based on the [Vision Transformer](https://huggingface.co/papers/2010.11929) introduced by Dosovitskiy et al. The [`Transformer2DModel`] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.
When the input is **continuous**:
1. Project the input and reshape it to `(batch_size, sequence_length, feature_dimension)`.
2. Apply the Transformer blocks in the standard way.
3. Reshape to image.
When the input is **discrete**:
<Tip>
It is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image don't contain a prediction for the masked pixel because the unnoised image cannot be masked.
</Tip>
1. Convert input (classes of latent pixels) to embeddings and apply positional embeddings.
2. Apply the Transformer blocks in the standard way.
3. Predict classes of unnoised image.
## Transformer2DModel
[[autodoc]] Transformer2DModel
## Transformer2DModelOutput
[[autodoc]] models.transformer_2d.Transformer2DModelOutput

View File

@@ -0,0 +1,11 @@
# Transformer Temporal
A Transformer model for video-like data.
## TransformerTemporalModel
[[autodoc]] models.transformer_temporal.TransformerTemporalModel
## TransformerTemporalModelOutput
[[autodoc]] models.transformer_temporal.TransformerTemporalModelOutput

View File

@@ -0,0 +1,13 @@
# UNet1DModel
The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 1D UNet model.
The abstract from the paper is:
*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
## UNet1DModel
[[autodoc]] UNet1DModel
## UNet1DOutput
[[autodoc]] models.unet_1d.UNet1DOutput

View File

@@ -0,0 +1,19 @@
# UNet2DConditionModel
The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet conditional model.
The abstract from the paper is:
*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
## UNet2DConditionModel
[[autodoc]] UNet2DConditionModel
## UNet2DConditionOutput
[[autodoc]] models.unet_2d_condition.UNet2DConditionOutput
## FlaxUNet2DConditionModel
[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionModel
## FlaxUNet2DConditionOutput
[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionOutput

View File

@@ -0,0 +1,13 @@
# UNet2DModel
The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet model.
The abstract from the paper is:
*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
## UNet2DModel
[[autodoc]] UNet2DModel
## UNet2DOutput
[[autodoc]] models.unet_2d.UNet2DOutput

View File

@@ -0,0 +1,13 @@
# UNet3DConditionModel
The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 3D UNet conditional model.
The abstract from the paper is:
*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
## UNet3DConditionModel
[[autodoc]] UNet3DConditionModel
## UNet3DConditionOutput
[[autodoc]] models.unet_3d_condition.UNet3DConditionOutput

View File

@@ -0,0 +1,15 @@
# VQModel
The VQ-VAE model was introduced in [Neural Discrete Representation Learning](https://huggingface.co/papers/1711.00937) by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu. The model is used in 🤗 Diffusers to decode latent representations into images. Unlike [`AutoencoderKL`], the [`VQModel`] works in a quantized latent space.
The abstract from the paper is:
*Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.*
## VQModel
[[autodoc]] VQModel
## VQEncoderOutput
[[autodoc]] models.vq_model.VQEncoderOutput

View File

@@ -10,13 +10,11 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# BaseOutputs
# Outputs
All models have outputs that are instances of subclasses of [`~utils.BaseOutput`]. Those are
data structures containing all the information returned by the model, but that can also be used as tuples or
dictionaries.
All models outputs are subclasses of [`~utils.BaseOutput`], data structures containing all the information returned by the model. The outputs can also be used as tuples or dictionaries.
Let's see how this looks in an example:
For example:
```python
from diffusers import DDIMPipeline
@@ -25,31 +23,45 @@ pipeline = DDIMPipeline.from_pretrained("google/ddpm-cifar10-32")
outputs = pipeline()
```
The `outputs` object is a [`~pipelines.ImagePipelineOutput`], as we can see in the
documentation of that class below, it means it has an image attribute.
The `outputs` object is a [`~pipelines.ImagePipelineOutput`] which means it has an image attribute.
You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you will get `None`:
You can access each attribute as you normally would or with a keyword lookup, and if that attribute is not returned by the model, you will get `None`:
```python
outputs.images
```
or via keyword lookup
```python
outputs["images"]
```
When considering our `outputs` object as tuple, it only considers the attributes that don't have `None` values.
Here for instance, we could retrieve images via indexing:
When considering the `outputs` object as a tuple, it only considers the attributes that don't have `None` values.
For instance, retrieving an image by indexing into it returns the tuple `(outputs.images)`:
```python
outputs[:1]
```
which will return the tuple `(outputs.images)` for instance.
<Tip>
To check a specific pipeline or model output, refer to its corresponding API documentation.
</Tip>
## BaseOutput
[[autodoc]] utils.BaseOutput
- to_tuple
## ImagePipelineOutput
[[autodoc]] pipelines.ImagePipelineOutput
## FlaxImagePipelineOutput
[[autodoc]] pipelines.pipeline_flax_utils.FlaxImagePipelineOutput
## AudioPipelineOutput
[[autodoc]] pipelines.AudioPipelineOutput
## ImageTextPipelineOutput
[[autodoc]] ImageTextPipelineOutput

View File

@@ -28,11 +28,11 @@ The abstract of the paper is the following:
## Tips
- AltDiffusion is conceptually exactly the same as [Stable Diffusion](./api/pipelines/stable_diffusion/overview).
- AltDiffusion is conceptually exactly the same as [Stable Diffusion](./stable_diffusion/overview).
- *Run AltDiffusion*
AltDiffusion can be tested very easily with the [`AltDiffusionPipeline`], [`AltDiffusionImg2ImgPipeline`] and the `"BAAI/AltDiffusion-m9"` checkpoint exactly in the same way it is shown in the [Conditional Image Generation Guide](./using-diffusers/conditional_image_generation) and the [Image-to-Image Generation Guide](./using-diffusers/img2img).
AltDiffusion can be tested very easily with the [`AltDiffusionPipeline`], [`AltDiffusionImg2ImgPipeline`] and the `"BAAI/AltDiffusion-m9"` checkpoint exactly in the same way it is shown in the [Conditional Image Generation Guide](../../using-diffusers/conditional_image_generation) and the [Image-to-Image Generation Guide](../../using-diffusers/img2img).
- *How to load and use different schedulers.*

View File

@@ -25,14 +25,14 @@ This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit
## Text-to-Audio
The [`AudioLDMPipeline`] can be used to load pre-trained weights from [cvssp/audioldm](https://huggingface.co/cvssp/audioldm) and generate text-conditional audio outputs:
The [`AudioLDMPipeline`] can be used to load pre-trained weights from [cvssp/audioldm-s-full-v2](https://huggingface.co/cvssp/audioldm-s-full-v2) and generate text-conditional audio outputs:
```python
from diffusers import AudioLDMPipeline
import torch
import scipy
repo_id = "cvssp/audioldm"
repo_id = "cvssp/audioldm-s-full-v2"
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
@@ -68,12 +68,14 @@ method, or pass the `scheduler` argument to the `from_pretrained` method of the
>>> from diffusers import AudioLDMPipeline, DPMSolverMultistepScheduler
>>> import torch
>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", torch_dtype=torch.float16)
>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm-s-full-v2", torch_dtype=torch.float16)
>>> pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
>>> # or
>>> dpm_scheduler = DPMSolverMultistepScheduler.from_pretrained("cvssp/audioldm", subfolder="scheduler")
>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", scheduler=dpm_scheduler, torch_dtype=torch.float16)
>>> dpm_scheduler = DPMSolverMultistepScheduler.from_pretrained("cvssp/audioldm-s-full-v2", subfolder="scheduler")
>>> pipeline = AudioLDMPipeline.from_pretrained(
... "cvssp/audioldm-s-full-v2", scheduler=dpm_scheduler, torch_dtype=torch.float16
... )
```
## AudioLDMPipeline

View File

@@ -22,7 +22,7 @@ The abstract of the paper is the following:
*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
This model was contributed by the amazing community contributor [takuma104](https://huggingface.co/takuma104) ❤️ .
This model was contributed by the community contributor [takuma104](https://huggingface.co/takuma104) ❤️ .
Resources:
@@ -33,7 +33,9 @@ Resources:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [StableDiffusionControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_controlnet.py) | *Text-to-Image Generation with ControlNet Conditioning* | [Colab Example](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/controlnet.ipynb)
| [StableDiffusionControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/controlnet/pipeline_controlnet.py) | *Text-to-Image Generation with ControlNet Conditioning* | [Colab Example](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/controlnet.ipynb)
| [StableDiffusionControlNetImg2ImgPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py) | *Image-to-Image Generation with ControlNet Conditioning* |
| [StableDiffusionControlNetInpaintPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_controlnet_inpaint.py) | *Inpainting Generation with ControlNet Conditioning* |
## Usage example
@@ -131,7 +133,7 @@ This should take only around 3-4 seconds on GPU (depending on hardware). The out
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/vermeer_disco_dancing.png)
**Note**: To see how to run all other ControlNet checkpoints, please have a look at [ControlNet with Stable Diffusion 1.5](#controlnet-with-stable-diffusion-1.5)
**Note**: To see how to run all other ControlNet checkpoints, please have a look at [ControlNet with Stable Diffusion 1.5](#controlnet-with-stable-diffusion-1.5).
<!-- TODO: add space -->
@@ -242,6 +244,41 @@ image.save("./multi_controlnet_output.png")
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/multi_controlnet_output.png" width=600/>
### Guess Mode
Guess Mode is [a ControlNet feature that was implemented](https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode) after the publication of [the paper](https://arxiv.org/abs/2302.05543). The description states:
>In this mode, the ControlNet encoder will try best to recognize the content of the input control map, like depth map, edge map, scribbles, etc, even if you remove all prompts.
#### The core implementation:
It adjusts the scale of the output residuals from ControlNet by a fixed ratio depending on the block depth. The shallowest DownBlock corresponds to `0.1`. As the blocks get deeper, the scale increases exponentially, and the scale for the output of the MidBlock becomes `1.0`.
Since the core implementation is just this, **it does not have any impact on prompt conditioning**. While it is common to use it without specifying any prompts, it is also possible to provide prompts if desired.
#### Usage:
Just specify `guess_mode=True` in the pipe() function. A `guidance_scale` between 3.0 and 5.0 is [recommended](https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode).
```py
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet).to(
"cuda"
)
image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0]
image.save("guess_mode_generated.png")
```
#### Output image comparison:
Canny Control Example
|no guess_mode with prompt|guess_mode without prompt|
|---|---|
|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0.png"><img width="128" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0_gm.png"><img width="128" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0_gm.png"/></a>|
## Available checkpoints
ControlNet requires a *control image* in addition to the text-to-image *prompt*.
@@ -249,7 +286,9 @@ Each pretrained model is trained using a different conditioning method that requ
All checkpoints can be found under the authors' namespace [lllyasviel](https://huggingface.co/lllyasviel).
### ControlNet with Stable Diffusion 1.5
**13.04.2024 Update**: The author has released improved controlnet checkpoints v1.1 - see [here](#controlnet-v1.1).
### ControlNet v1.0
| Model Name | Control Image Overview| Control Image Example | Generated Image Example |
|---|---|---|---|
@@ -262,6 +301,25 @@ All checkpoints can be found under the authors' namespace [lllyasviel](https://h
|[lllyasviel/sd-controlnet-scribble](https://huggingface.co/lllyasviel/sd-controlnet_scribble)<br/> *Trained with human scribbles* |A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_scribble.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_scribble.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"/></a> |
|[lllyasviel/sd-controlnet-seg](https://huggingface.co/lllyasviel/sd-controlnet_seg)<br/>*Trained with semantic segmentation* |An [ADE20K](https://groups.csail.mit.edu/vision/datasets/ADE20K/)'s segmentation protocol image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_seg.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_seg.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"/></a> |
### ControlNet v1.1
| Model Name | Control Image Overview| Condition Image | Control Image Example | Generated Image Example |
|---|---|---|---|---|
|[lllyasviel/control_v11p_sd15_canny](https://huggingface.co/lllyasviel/control_v11p_sd15_canny)<br/> | *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11e_sd15_ip2p](https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p)<br/> | *Trained with pixel to pixel instruction* | No condition .|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_inpaint](https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint)<br/> | Trained with image inpainting | No condition.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/output.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/output.png"/></a>|
|[lllyasviel/control_v11p_sd15_mlsd](https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd)<br/> | Trained with multi-level line segment detection | An image with annotated line segments.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11f1p_sd15_depth](https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth)<br/> | Trained with depth estimation | An image with depth information, usually represented as a grayscale image.|<a href="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_normalbae](https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae)<br/> | Trained with surface normal estimation | An image with surface normal information, usually represented as a color-coded image.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_seg](https://huggingface.co/lllyasviel/control_v11p_sd15_seg)<br/> | Trained with image segmentation | An image with segmented regions, usually represented as a color-coded image.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_lineart](https://huggingface.co/lllyasviel/control_v11p_sd15_lineart)<br/> | Trained with line art generation | An image with line art, usually black lines on a white background.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15s2_lineart_anime](https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime)<br/> | Trained with anime line art generation | An image with anime-style line art.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_openpose](https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime)<br/> | Trained with human pose estimation | An image with human poses, usually represented as a set of keypoints or skeletons.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_scribble](https://huggingface.co/lllyasviel/control_v11p_sd15_scribble)<br/> | Trained with scribble-based image generation | An image with scribbles, usually random or user-drawn strokes.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_softedge](https://huggingface.co/lllyasviel/control_v11p_sd15_softedge)<br/> | Trained with soft edge image generation | An image with soft edges, usually to create a more painterly or artistic effect.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11e_sd15_shuffle](https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle)<br/> | Trained with image shuffling | An image with shuffled patches or regions.|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11f1e_sd15_tile](https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile)<br/> | Trained with image tiling | A blurry image or part of an image .|<a href="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/original.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/original.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/output.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/output.png"/></a>|
## StableDiffusionControlNetPipeline
[[autodoc]] StableDiffusionControlNetPipeline
- all
@@ -272,6 +330,31 @@ All checkpoints can be found under the authors' namespace [lllyasviel](https://h
- disable_vae_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
## StableDiffusionControlNetImg2ImgPipeline
[[autodoc]] StableDiffusionControlNetImg2ImgPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_vae_slicing
- disable_vae_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
## StableDiffusionControlNetInpaintPipeline
[[autodoc]] StableDiffusionControlNetInpaintPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_vae_slicing
- disable_vae_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
## FlaxStableDiffusionControlNetPipeline
[[autodoc]] FlaxStableDiffusionControlNetPipeline

View File

@@ -0,0 +1,360 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Zero-shot Diffusion-based Semantic Image Editing with Mask Guidance
## Overview
[DiffEdit: Diffusion-based semantic image editing with mask guidance](https://arxiv.org/abs/2210.11427) by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord.
The abstract of the paper is the following:
*Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.*
Resources:
* [Paper](https://arxiv.org/abs/2210.11427).
* [Blog Post with Demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html).
* [Implementation on Github](https://github.com/Xiang-cd/DiffEdit-stable-diffusion/).
## Tips
* The pipeline can generate masks that can be fed into other inpainting pipelines. Check out the code examples below to know more.
* In order to generate an image using this pipeline, both an image mask (manually specified or generated using `generate_mask`)
and a set of partially inverted latents (generated using `invert`) _must_ be provided as arguments when calling the pipeline to generate the final edited image.
Refer to the code examples below for more details.
* The function `generate_mask` exposes two prompt arguments, `source_prompt` and `target_prompt`,
that let you control the locations of the semantic edits in the final image to be generated. Let's say,
you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect
this in the generated mask, you simply have to set the embeddings related to the phrases including "cat" to
`source_prompt_embeds` and "dog" to `target_prompt_embeds`. Refer to the code example below for more details.
* When generating partially inverted latents using `invert`, assign a caption or text embedding describing the
overall image to the `prompt` argument to help guide the inverse latent sampling process. In most cases, the
source concept is sufficently descriptive to yield good results, but feel free to explore alternatives.
Please refer to [this code example](#generating-image-captions-for-inversion) for more details.
* When calling the pipeline to generate the final edited image, assign the source concept to `negative_prompt`
and the target concept to `prompt`. Taking the above example, you simply have to set the embeddings related to
the phrases including "cat" to `negative_prompt_embeds` and "dog" to `prompt_embeds`. Refer to the code example
below for more details.
* If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to:
* Swap the `source_prompt` and `target_prompt` in the arguments to `generate_mask`.
* Change the input prompt for `invert` to include "dog".
* Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image.
* Note that the source and target prompts, or their corresponding embeddings, can also be automatically generated. Please, refer to [this discussion](#generating-source-and-target-embeddings) for more details.
## Available Pipelines:
| Pipeline | Tasks
|---|---|
| [StableDiffusionDiffEditPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_diffedit.py) | *Text-Based Image Editing*
<!-- TODO: add Colab -->
## Usage example
### Based on an input image with a caption
When the pipeline is conditioned on an input image, we first obtain partially inverted latents from the input image using a
`DDIMInverseScheduler` with the help of a caption. Then we generate an editing mask to identify relevant regions in the image using the source and target prompts. Finally,
the inverted noise and generated mask is used to start the generation process.
First, let's load our pipeline:
```py
import torch
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionPix2PixZeroPipeline
sd_model_ckpt = "stabilityai/stable-diffusion-2-1"
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
sd_model_ckpt,
torch_dtype=torch.float16,
safety_checker=None,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()
generator = torch.manual_seed(0)
```
Then, we load an input image to edit using our method:
```py
from diffusers.utils import load_image
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
```
Then, we employ the source and target prompts to generate the editing mask:
```py
# See the "Generating source and target embeddings" section below to
# automate the generation of these captions with a pre-trained model like Flan-T5 as explained below.
source_prompt = "a bowl of fruits"
target_prompt = "a basket of fruits"
mask_image = pipeline.generate_mask(
image=raw_image,
source_prompt=source_prompt,
target_prompt=target_prompt,
generator=generator,
)
```
Then, we employ the caption and the input image to get the inverted latents:
```py
inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image, generator=generator).latents
```
Now, generate the image with the inverted latents and semantically generated mask:
```py
image = pipeline(
prompt=target_prompt,
mask_image=mask_image,
image_latents=inv_latents,
generator=generator,
negative_prompt=source_prompt,
).images[0]
image.save("edited_image.png")
```
## Generating image captions for inversion
The authors originally used the source concept prompt as the caption for generating the partially inverted latents. However, we can also leverage open source and public image captioning models for the same purpose.
Below, we provide an end-to-end example with the [BLIP](https://huggingface.co/docs/transformers/model_doc/blip) model
for generating captions.
First, let's load our automatic image captioning model:
```py
import torch
from transformers import BlipForConditionalGeneration, BlipProcessor
captioner_id = "Salesforce/blip-image-captioning-base"
processor = BlipProcessor.from_pretrained(captioner_id)
model = BlipForConditionalGeneration.from_pretrained(captioner_id, torch_dtype=torch.float16, low_cpu_mem_usage=True)
```
Then, we define a utility to generate captions from an input image using the model:
```py
@torch.no_grad()
def generate_caption(images, caption_generator, caption_processor):
text = "a photograph of"
inputs = caption_processor(images, text, return_tensors="pt").to(device="cuda", dtype=caption_generator.dtype)
caption_generator.to("cuda")
outputs = caption_generator.generate(**inputs, max_new_tokens=128)
# offload caption generator
caption_generator.to("cpu")
caption = caption_processor.batch_decode(outputs, skip_special_tokens=True)[0]
return caption
```
Then, we load an input image for conditioning and obtain a suitable caption for it:
```py
from diffusers.utils import load_image
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
caption = generate_caption(raw_image, model, processor)
```
Then, we employ the generated caption and the input image to get the inverted latents:
```py
from diffusers import DDIMInverseScheduler, DDIMScheduler
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16
)
pipeline = pipeline.to("cuda")
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
generator = torch.manual_seed(0)
inv_latents = pipeline.invert(prompt=caption, image=raw_image, generator=generator).latents
```
Now, generate the image with the inverted latents and semantically generated mask from our source and target prompts:
```py
source_prompt = "a bowl of fruits"
target_prompt = "a basket of fruits"
mask_image = pipeline.generate_mask(
image=raw_image,
source_prompt=source_prompt,
target_prompt=target_prompt,
generator=generator,
)
image = pipeline(
prompt=target_prompt,
mask_image=mask_image,
image_latents=inv_latents,
generator=generator,
negative_prompt=source_prompt,
).images[0]
image.save("edited_image.png")
```
## Generating source and target embeddings
The authors originally required the user to manually provide the source and target prompts for discovering
edit directions. However, we can also leverage open source and public models for the same purpose.
Below, we provide an end-to-end example with the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model
for generating source an target embeddings.
**1. Load the generation model**:
```py
import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto", torch_dtype=torch.float16)
```
**2. Construct a starting prompt**:
```py
source_concept = "bowl"
target_concept = "basket"
source_text = f"Provide a caption for images containing a {source_concept}. "
"The captions should be in English and should be no longer than 150 characters."
target_text = f"Provide a caption for images containing a {target_concept}. "
"The captions should be in English and should be no longer than 150 characters."
```
Here, we're interested in the "bowl -> basket" direction.
**3. Generate prompts**:
We can use a utility like so for this purpose.
```py
@torch.no_grad
def generate_prompts(input_prompt):
input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(
input_ids, temperature=0.8, num_return_sequences=16, do_sample=True, max_new_tokens=128, top_k=10
)
return tokenizer.batch_decode(outputs, skip_special_tokens=True)
```
And then we just call it to generate our prompts:
```py
source_prompts = generate_prompts(source_text)
target_prompts = generate_prompts(target_text)
```
We encourage you to play around with the different parameters supported by the
`generate()` method ([documentation](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.generation_tf_utils.TFGenerationMixin.generate)) for the generation quality you are looking for.
**4. Load the embedding model**:
Here, we need to use the same text encoder model used by the subsequent Stable Diffusion model.
```py
from diffusers import StableDiffusionDiffEditPipeline
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16
)
pipeline = pipeline.to("cuda")
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()
generator = torch.manual_seed(0)
```
**5. Compute embeddings**:
```py
import torch
@torch.no_grad()
def embed_prompts(sentences, tokenizer, text_encoder, device="cuda"):
embeddings = []
for sent in sentences:
text_inputs = tokenizer(
sent,
padding="max_length",
max_length=tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
text_input_ids = text_inputs.input_ids
prompt_embeds = text_encoder(text_input_ids.to(device), attention_mask=None)[0]
embeddings.append(prompt_embeds)
return torch.concatenate(embeddings, dim=0).mean(dim=0).unsqueeze(0)
source_embeddings = embed_prompts(source_prompts, pipeline.tokenizer, pipeline.text_encoder)
target_embeddings = embed_prompts(target_captions, pipeline.tokenizer, pipeline.text_encoder)
```
And you're done! Now, you can use these embeddings directly while calling the pipeline:
```py
from diffusers import DDIMInverseScheduler, DDIMScheduler
from diffusers.utils import load_image
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
mask_image = pipeline.generate_mask(
image=raw_image,
source_prompt_embeds=source_embeds,
target_prompt_embeds=target_embeds,
generator=generator,
)
inv_latents = pipeline.invert(
prompt_embeds=source_embeds,
image=raw_image,
generator=generator,
).latents
images = pipeline(
mask_image=mask_image,
image_latents=inv_latents,
prompt_embeds=target_embeddings,
negative_prompt_embeds=source_embeddings,
generator=generator,
).images
images[0].save("edited_image.png")
```
## StableDiffusionDiffEditPipeline
[[autodoc]] StableDiffusionDiffEditPipeline
- all
- generate_mask
- invert
- __call__

View File

@@ -0,0 +1,523 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# IF
## Overview
DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding.
The model is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules:
- Stage 1: a base model that generates 64x64 px image based on text prompt,
- Stage 2: a 64x64 px => 256x256 px super-resolution model, and a
- Stage 3: a 256x256 px => 1024x1024 px super-resolution model
Stage 1 and Stage 2 utilize a frozen text encoder based on the T5 transformer to extract text embeddings,
which are then fed into a UNet architecture enhanced with cross-attention and attention pooling.
Stage 3 is [Stability's x4 Upscaling model](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler).
The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset.
Our work underscores the potential of larger UNet architectures in the first stage of cascaded diffusion models and depicts a promising future for text-to-image synthesis.
## Usage
Before you can use IF, you need to accept its usage conditions. To do so:
1. Make sure to have a [Hugging Face account](https://huggingface.co/join) and be logged in
2. Accept the license on the model card of [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0). Accepting the license on the stage I model card will auto accept for the other IF models.
3. Make sure to login locally. Install `huggingface_hub`
```sh
pip install huggingface_hub --upgrade
```
run the login function in a Python shell
```py
from huggingface_hub import login
login()
```
and enter your [Hugging Face Hub access token](https://huggingface.co/docs/hub/security-tokens#what-are-user-access-tokens).
Next we install `diffusers` and dependencies:
```sh
pip install diffusers accelerate transformers safetensors
```
The following sections give more in-detail examples of how to use IF. Specifically:
- [Text-to-Image Generation](#text-to-image-generation)
- [Image-to-Image Generation](#text-guided-image-to-image-generation)
- [Inpainting](#text-guided-inpainting-generation)
- [Reusing model weights](#converting-between-different-pipelines)
- [Speed optimization](#optimizing-for-speed)
- [Memory optimization](#optimizing-for-memory)
**Available checkpoints**
- *Stage-1*
- [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0)
- [DeepFloyd/IF-I-L-v1.0](https://huggingface.co/DeepFloyd/IF-I-L-v1.0)
- [DeepFloyd/IF-I-M-v1.0](https://huggingface.co/DeepFloyd/IF-I-M-v1.0)
- *Stage-2*
- [DeepFloyd/IF-II-L-v1.0](https://huggingface.co/DeepFloyd/IF-II-L-v1.0)
- [DeepFloyd/IF-II-M-v1.0](https://huggingface.co/DeepFloyd/IF-II-M-v1.0)
- *Stage-3*
- [stabilityai/stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler)
**Demo**
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/DeepFloyd/IF)
**Google Colab**
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
### Text-to-Image Generation
By default diffusers makes use of [model cpu offloading](https://huggingface.co/docs/diffusers/optimization/fp16#model-offloading-for-fast-inference-and-memory-savings)
to run the whole IF pipeline with as little as 14 GB of VRAM.
```python
from diffusers import DiffusionPipeline
from diffusers.utils import pt_to_pil
import torch
# stage 1
stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_model_cpu_offload()
# stage 2
stage_2 = DiffusionPipeline.from_pretrained(
"DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_model_cpu_offload()
# stage 3
safety_modules = {
"feature_extractor": stage_1.feature_extractor,
"safety_checker": stage_1.safety_checker,
"watermarker": stage_1.watermarker,
}
stage_3 = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
)
stage_3.enable_model_cpu_offload()
prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
generator = torch.manual_seed(1)
# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
# stage 1
image = stage_1(
prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt"
).images
pt_to_pil(image)[0].save("./if_stage_I.png")
# stage 2
image = stage_2(
image=image,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_embeds,
generator=generator,
output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_II.png")
# stage 3
image = stage_3(prompt=prompt, image=image, noise_level=100, generator=generator).images
image[0].save("./if_stage_III.png")
```
### Text Guided Image-to-Image Generation
The same IF model weights can be used for text-guided image-to-image translation or image variation.
In this case just make sure to load the weights using the [`IFInpaintingPipeline`] and [`IFInpaintingSuperResolutionPipeline`] pipelines.
**Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines
without loading them twice by making use of the [`~DiffusionPipeline.components()`] function as explained [here](#converting-between-different-pipelines).
```python
from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline
from diffusers.utils import pt_to_pil
import torch
from PIL import Image
import requests
from io import BytesIO
# download image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
original_image = Image.open(BytesIO(response.content)).convert("RGB")
original_image = original_image.resize((768, 512))
# stage 1
stage_1 = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_model_cpu_offload()
# stage 2
stage_2 = IFImg2ImgSuperResolutionPipeline.from_pretrained(
"DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_model_cpu_offload()
# stage 3
safety_modules = {
"feature_extractor": stage_1.feature_extractor,
"safety_checker": stage_1.safety_checker,
"watermarker": stage_1.watermarker,
}
stage_3 = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
)
stage_3.enable_model_cpu_offload()
prompt = "A fantasy landscape in style minecraft"
generator = torch.manual_seed(1)
# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
# stage 1
image = stage_1(
image=original_image,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_embeds,
generator=generator,
output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_I.png")
# stage 2
image = stage_2(
image=image,
original_image=original_image,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_embeds,
generator=generator,
output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_II.png")
# stage 3
image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images
image[0].save("./if_stage_III.png")
```
### Text Guided Inpainting Generation
The same IF model weights can be used for text-guided image-to-image translation or image variation.
In this case just make sure to load the weights using the [`IFInpaintingPipeline`] and [`IFInpaintingSuperResolutionPipeline`] pipelines.
**Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines
without loading them twice by making use of the [`~DiffusionPipeline.components()`] function as explained [here](#converting-between-different-pipelines).
```python
from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline
from diffusers.utils import pt_to_pil
import torch
from PIL import Image
import requests
from io import BytesIO
# download image
url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png"
response = requests.get(url)
original_image = Image.open(BytesIO(response.content)).convert("RGB")
original_image = original_image
# download mask
url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png"
response = requests.get(url)
mask_image = Image.open(BytesIO(response.content))
mask_image = mask_image
# stage 1
stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_model_cpu_offload()
# stage 2
stage_2 = IFInpaintingSuperResolutionPipeline.from_pretrained(
"DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_model_cpu_offload()
# stage 3
safety_modules = {
"feature_extractor": stage_1.feature_extractor,
"safety_checker": stage_1.safety_checker,
"watermarker": stage_1.watermarker,
}
stage_3 = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
)
stage_3.enable_model_cpu_offload()
prompt = "blue sunglasses"
generator = torch.manual_seed(1)
# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
# stage 1
image = stage_1(
image=original_image,
mask_image=mask_image,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_embeds,
generator=generator,
output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_I.png")
# stage 2
image = stage_2(
image=image,
original_image=original_image,
mask_image=mask_image,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_embeds,
generator=generator,
output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_II.png")
# stage 3
image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images
image[0].save("./if_stage_III.png")
```
### Converting between different pipelines
In addition to being loaded with `from_pretrained`, Pipelines can also be loaded directly from each other.
```python
from diffusers import IFPipeline, IFSuperResolutionPipeline
pipe_1 = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0")
pipe_2 = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-L-v1.0")
from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline
pipe_1 = IFImg2ImgPipeline(**pipe_1.components)
pipe_2 = IFImg2ImgSuperResolutionPipeline(**pipe_2.components)
from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline
pipe_1 = IFInpaintingPipeline(**pipe_1.components)
pipe_2 = IFInpaintingSuperResolutionPipeline(**pipe_2.components)
```
### Optimizing for speed
The simplest optimization to run IF faster is to move all model components to the GPU.
```py
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.to("cuda")
```
You can also run the diffusion process for a shorter number of timesteps.
This can either be done with the `num_inference_steps` argument
```py
pipe("<prompt>", num_inference_steps=30)
```
Or with the `timesteps` argument
```py
from diffusers.pipelines.deepfloyd_if import fast27_timesteps
pipe("<prompt>", timesteps=fast27_timesteps)
```
When doing image variation or inpainting, you can also decrease the number of timesteps
with the strength argument. The strength argument is the amount of noise to add to
the input image which also determines how many steps to run in the denoising process.
A smaller number will vary the image less but run faster.
```py
pipe = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.to("cuda")
image = pipe(image=image, prompt="<prompt>", strength=0.3).images
```
You can also use [`torch.compile`](../../optimization/torch2.0). Note that we have not exhaustively tested `torch.compile`
with IF and it might not give expected results.
```py
import torch
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.to("cuda")
pipe.text_encoder = torch.compile(pipe.text_encoder)
pipe.unet = torch.compile(pipe.unet)
```
### Optimizing for memory
When optimizing for GPU memory, we can use the standard diffusers cpu offloading APIs.
Either the model based CPU offloading,
```py
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
```
or the more aggressive layer based CPU offloading.
```py
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()
```
Additionally, T5 can be loaded in 8bit precision
```py
from transformers import T5EncoderModel
text_encoder = T5EncoderModel.from_pretrained(
"DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit"
)
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"DeepFloyd/IF-I-XL-v1.0",
text_encoder=text_encoder, # pass the previously instantiated 8bit text encoder
unet=None,
device_map="auto",
)
prompt_embeds, negative_embeds = pipe.encode_prompt("<prompt>")
```
For CPU RAM constrained machines like google colab free tier where we can't load all
model components to the CPU at once, we can manually only load the pipeline with
the text encoder or unet when the respective model components are needed.
```py
from diffusers import IFPipeline, IFSuperResolutionPipeline
import torch
import gc
from transformers import T5EncoderModel
from diffusers.utils import pt_to_pil
text_encoder = T5EncoderModel.from_pretrained(
"DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit"
)
# text to image
pipe = DiffusionPipeline.from_pretrained(
"DeepFloyd/IF-I-XL-v1.0",
text_encoder=text_encoder, # pass the previously instantiated 8bit text encoder
unet=None,
device_map="auto",
)
prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)
# Remove the pipeline so we can re-load the pipeline with the unet
del text_encoder
del pipe
gc.collect()
torch.cuda.empty_cache()
pipe = IFPipeline.from_pretrained(
"DeepFloyd/IF-I-XL-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto"
)
generator = torch.Generator().manual_seed(0)
image = pipe(
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_embeds,
output_type="pt",
generator=generator,
).images
pt_to_pil(image)[0].save("./if_stage_I.png")
# Remove the pipeline so we can load the super-resolution pipeline
del pipe
gc.collect()
torch.cuda.empty_cache()
# First super resolution
pipe = IFSuperResolutionPipeline.from_pretrained(
"DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto"
)
generator = torch.Generator().manual_seed(0)
image = pipe(
image=image,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_embeds,
output_type="pt",
generator=generator,
).images
pt_to_pil(image)[0].save("./if_stage_II.png")
```
## Available Pipelines:
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_if.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if.py) | *Text-to-Image Generation* | - |
| [pipeline_if_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if.py) | *Text-to-Image Generation* | - |
| [pipeline_if_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img.py) | *Image-to-Image Generation* | - |
| [pipeline_if_img2img_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img_superresolution.py) | *Image-to-Image Generation* | - |
| [pipeline_if_inpainting.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting.py) | *Image-to-Image Generation* | - |
| [pipeline_if_inpainting_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting_superresolution.py) | *Image-to-Image Generation* | - |
## IFPipeline
[[autodoc]] IFPipeline
- all
- __call__
## IFSuperResolutionPipeline
[[autodoc]] IFSuperResolutionPipeline
- all
- __call__
## IFImg2ImgPipeline
[[autodoc]] IFImg2ImgPipeline
- all
- __call__
## IFImg2ImgSuperResolutionPipeline
[[autodoc]] IFImg2ImgSuperResolutionPipeline
- all
- __call__
## IFInpaintingPipeline
[[autodoc]] IFInpaintingPipeline
- all
- __call__
## IFInpaintingSuperResolutionPipeline
[[autodoc]] IFInpaintingSuperResolutionPipeline
- all
- __call__

View File

@@ -0,0 +1,365 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Kandinsky
## Overview
Kandinsky 2.1 inherits best practices from [DALL-E 2](https://arxiv.org/abs/2204.06125) and [Latent Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/latent_diffusion), while introducing some new ideas.
It uses [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) for encoding images and text, and a diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach enhances the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation.
The Kandinsky model is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey) and [Denis Dimitrov](https://github.com/denndimitrov) and the original codebase can be found [here](https://github.com/ai-forever/Kandinsky-2)
## Available Pipelines:
| Pipeline | Tasks |
|---|---|
| [pipeline_kandinsky.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky.py) | *Text-to-Image Generation* |
| [pipeline_kandinsky_inpaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_inpaint.py) | *Image-Guided Image Generation* |
| [pipeline_kandinsky_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_img2img.py) | *Image-Guided Image Generation* |
## Usage example
In the following, we will walk you through some examples of how to use the Kandinsky pipelines to create some visually aesthetic artwork.
### Text-to-Image Generation
For text-to-image generation, we need to use both [`KandinskyPriorPipeline`] and [`KandinskyPipeline`].
The first step is to encode text prompts with CLIP and then diffuse the CLIP text embeddings to CLIP image embeddings,
as first proposed in [DALL-E 2](https://cdn.openai.com/papers/dall-e-2.pdf).
Let's throw a fun prompt at Kandinsky to see what it comes up with.
```py
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
```
First, let's instantiate the prior pipeline and the text-to-image pipeline. Both
pipelines are diffusion models.
```py
from diffusers import DiffusionPipeline
import torch
pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16)
pipe_prior.to("cuda")
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
t2i_pipe.to("cuda")
```
<Tip warning={true}>
By default, the text-to-image pipeline use [`DDIMScheduler`], you can change the scheduler to [`DDPMScheduler`]
```py
scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler")
t2i_pipe = DiffusionPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16
)
t2i_pipe.to("cuda")
```
</Tip>
Now we pass the prompt through the prior to generate image embeddings. The prior
returns both the image embeddings corresponding to the prompt and negative/unconditional image
embeddings corresponding to an empty string.
```py
image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
```
<Tip warning={true}>
The text-to-image pipeline expects both `image_embeds`, `negative_image_embeds` and the original
`prompt` as the text-to-image pipeline uses another text encoder to better guide the second diffusion
process of `t2i_pipe`.
By default, the prior returns unconditioned negative image embeddings corresponding to the negative prompt of `""`.
For better results, you can also pass a `negative_prompt` to the prior. This will increase the effective batch size
of the prior by a factor of 2.
```py
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"
image_embeds, negative_image_embeds = pipe_prior(prompt, negative_prompt, guidance_scale=1.0).to_tuple()
```
</Tip>
Next, we can pass the embeddings as well as the prompt to the text-to-image pipeline. Remember that
in case you are using a customized negative prompt, that you should pass this one also to the text-to-image pipelines
with `negative_prompt=negative_prompt`:
```py
image = t2i_pipe(
prompt, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768
).images[0]
image.save("cheeseburger_monster.png")
```
One cheeseburger monster coming up! Enjoy!
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/cheeseburger.png)
The Kandinsky model works extremely well with creative prompts. Here is some of the amazing art that can be created using the exact same process but with different prompts.
```python
prompt = "bird eye view shot of a full body woman with cyan light orange magenta makeup, digital art, long braided hair her face separated by makeup in the style of yin Yang surrealism, symmetrical face, real image, contrasting tone, pastel gradient background"
```
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/hair.png)
```python
prompt = "A car exploding into colorful dust"
```
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/dusts.png)
```python
prompt = "editorial photography of an organic, almost liquid smoke style armchair"
```
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/smokechair.png)
```python
prompt = "birds eye view of a quilted paper style alien planet landscape, vibrant colours, Cinematic lighting"
```
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/alienplanet.png)
### Text Guided Image-to-Image Generation
The same Kandinsky model weights can be used for text-guided image-to-image translation. In this case, just make sure to load the weights using the [`KandinskyImg2ImgPipeline`] pipeline.
**Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines
without loading them twice by making use of the [`~DiffusionPipeline.components`] function as explained [here](#converting-between-different-pipelines).
Let's download an image.
```python
from PIL import Image
import requests
from io import BytesIO
# download image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
original_image = Image.open(BytesIO(response.content)).convert("RGB")
original_image = original_image.resize((768, 512))
```
![img](https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg)
```python
import torch
from diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline
# create prior
pipe_prior = KandinskyPriorPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16
)
pipe_prior.to("cuda")
# create img2img pipeline
pipe = KandinskyImg2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
pipe.to("cuda")
prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"
image_embeds, negative_image_embeds = pipe_prior(prompt, negative_prompt).to_tuple()
out = pipe(
prompt,
image=original_image,
image_embeds=image_embeds,
negative_image_embeds=negative_image_embeds,
height=768,
width=768,
strength=0.3,
)
out.images[0].save("fantasy_land.png")
```
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/img2img_fantasyland.png)
### Text Guided Inpainting Generation
You can use [`KandinskyInpaintPipeline`] to edit images. In this example, we will add a hat to the portrait of a cat.
```py
from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline
from diffusers.utils import load_image
import torch
import numpy as np
pipe_prior = KandinskyPriorPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16
)
pipe_prior.to("cuda")
prompt = "a hat"
prior_output = pipe_prior(prompt)
pipe = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16)
pipe.to("cuda")
init_image = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/cat.png"
)
mask = np.ones((768, 768), dtype=np.float32)
# Let's mask out an area above the cat's head
mask[:250, 250:-250] = 0
out = pipe(
prompt,
image=init_image,
mask_image=mask,
**prior_output,
height=768,
width=768,
num_inference_steps=150,
)
image = out.images[0]
image.save("cat_with_hat.png")
```
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/inpaint_cat_hat.png)
### Interpolate
The [`KandinskyPriorPipeline`] also comes with a cool utility function that will allow you to interpolate the latent space of different images and texts super easily. Here is an example of how you can create an Impressionist-style portrait for your pet based on "The Starry Night".
Note that you can interpolate between texts and images - in the below example, we passed a text prompt "a cat" and two images to the `interplate` function, along with a `weights` variable containing the corresponding weights for each condition we interplate.
```python
from diffusers import KandinskyPriorPipeline, KandinskyPipeline
from diffusers.utils import load_image
import PIL
import torch
pipe_prior = KandinskyPriorPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16
)
pipe_prior.to("cuda")
img1 = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/cat.png"
)
img2 = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/starry_night.jpeg"
)
# add all the conditions we want to interpolate, can be either text or image
images_texts = ["a cat", img1, img2]
# specify the weights for each condition in images_texts
weights = [0.3, 0.3, 0.4]
# We can leave the prompt empty
prompt = ""
prior_out = pipe_prior.interpolate(images_texts, weights)
pipe = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
pipe.to("cuda")
image = pipe(prompt, **prior_out, height=768, width=768).images[0]
image.save("starry_cat.png")
```
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/starry_cat.png)
## Optimization
Running Kandinsky in inference requires running both a first prior pipeline: [`KandinskyPriorPipeline`]
and a second image decoding pipeline which is one of [`KandinskyPipeline`], [`KandinskyImg2ImgPipeline`], or [`KandinskyInpaintPipeline`].
The bulk of the computation time will always be the second image decoding pipeline, so when looking
into optimizing the model, one should look into the second image decoding pipeline.
When running with PyTorch < 2.0, we strongly recommend making use of [`xformers`](https://github.com/facebookresearch/xformers)
to speed-up the optimization. This can be done by simply running:
```py
from diffusers import DiffusionPipeline
import torch
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
t2i_pipe.enable_xformers_memory_efficient_attention()
```
When running on PyTorch >= 2.0, PyTorch's SDPA attention will automatically be used. For more information on
PyTorch's SDPA, feel free to have a look at [this blog post](https://pytorch.org/blog/accelerated-diffusers-pt-20/).
To have explicit control , you can also manually set the pipeline to use PyTorch's 2.0 efficient attention:
```py
from diffusers.models.attention_processor import AttnAddedKVProcessor2_0
t2i_pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())
```
The slowest and most memory intense attention processor is the default `AttnAddedKVProcessor` processor.
We do **not** recommend using it except for testing purposes or cases where very high determistic behaviour is desired.
You can set it with:
```py
from diffusers.models.attention_processor import AttnAddedKVProcessor
t2i_pipe.unet.set_attn_processor(AttnAddedKVProcessor())
```
With PyTorch >= 2.0, you can also use Kandinsky with `torch.compile` which depending
on your hardware can signficantly speed-up your inference time once the model is compiled.
To use Kandinsksy with `torch.compile`, you can do:
```py
t2i_pipe.unet.to(memory_format=torch.channels_last)
t2i_pipe.unet = torch.compile(t2i_pipe.unet, mode="reduce-overhead", fullgraph=True)
```
After compilation you should see a very fast inference time. For more information,
feel free to have a look at [Our PyTorch 2.0 benchmark](https://huggingface.co/docs/diffusers/main/en/optimization/torch2.0).
## KandinskyPriorPipeline
[[autodoc]] KandinskyPriorPipeline
- all
- __call__
- interpolate
## KandinskyPipeline
[[autodoc]] KandinskyPipeline
- all
- __call__
## KandinskyImg2ImgPipeline
[[autodoc]] KandinskyImg2ImgPipeline
- all
- __call__
## KandinskyInpaintPipeline
[[autodoc]] KandinskyInpaintPipeline
- all
- __call__

View File

@@ -46,15 +46,22 @@ available a colab notebook to directly try them out.
|---|---|:---:|:---:|
| [alt_diffusion](./alt_diffusion) | [**AltDiffusion**](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation | -
| [audio_diffusion](./audio_diffusion) | [**Audio Diffusion**](https://github.com/teticio/audio_diffusion.git) | Unconditional Audio Generation |
| [controlnet](./api/pipelines/stable_diffusion/controlnet) | [**ControlNet with Stable Diffusion**](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/controlnet.ipynb)
| [controlnet](./api/pipelines/controlnet) | [**ControlNet with Stable Diffusion**](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/controlnet.ipynb)
| [cycle_diffusion](./cycle_diffusion) | [**Cycle Diffusion**](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation |
| [dance_diffusion](./dance_diffusion) | [**Dance Diffusion**](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation |
| [ddpm](./ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation |
| [ddim](./ddim) | [**Denoising Diffusion Implicit Models**](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation |
| [if](./if) | [**IF**](https://github.com/deep-floyd/IF) | Image Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
| [if_img2img](./if) | [**IF**](https://github.com/deep-floyd/IF) | Image-to-Image Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
| [if_inpainting](./if) | [**IF**](https://github.com/deep-floyd/IF) | Image-to-Image Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
| [kandinsky](./kandinsky) | **Kandinsky** | Text-to-Image Generation |
| [kandinsky_inpaint](./kandinsky) | **Kandinsky** | Image-to-Image Generation |
| [kandinsky_img2img](./kandinsky) | **Kandinsksy** | Image-to-Image Generation |
| [latent_diffusion](./latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation |
| [latent_diffusion](./latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image |
| [latent_diffusion_uncond](./latent_diffusion_uncond) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation |
| [paint_by_example](./paint_by_example) | [**Paint by Example: Exemplar-based Image Editing with Diffusion Models**](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting |
| [paradigms](./paradigms) | [**Parallel Sampling of Diffusion Models**](https://arxiv.org/abs/2305.16317) | Text-to-Image Generation |
| [pndm](./pndm) | [**Pseudo Numerical Methods for Diffusion Models on Manifolds**](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation |
| [score_sde_ve](./score_sde_ve) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
| [score_sde_vp](./score_sde_vp) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
@@ -69,20 +76,20 @@ available a colab notebook to directly try them out.
| [stable_diffusion_self_attention_guidance](./stable_diffusion/self_attention_guidance) | [**Self-Attention Guidance**](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation |
| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [**Stable Diffusion Image Variations**](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation |
| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [**Stable Diffusion Latent Upscaler**](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image |
| [stable_diffusion_2](./stable_diffusion_2/) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation |
| [stable_diffusion_2](./stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting |
| [stable_diffusion_2](./stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Depth-to-Image Text-Guided Generation |
| [stable_diffusion_2](./stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image |
| [stable_diffusion_2](./stable_diffusion/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting |
| [stable_diffusion_2](./stable_diffusion/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Depth-to-Image Text-Guided Generation |
| [stable_diffusion_2](./stable_diffusion/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image |
| [stable_diffusion_safe](./stable_diffusion_safe) | [**Safe Stable Diffusion**](https://arxiv.org/abs/2211.05105) | Text-Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-research/safe-latent-diffusion/blob/main/examples/Safe%20Latent%20Diffusion.ipynb)
| [stable_unclip](./stable_unclip) | **Stable unCLIP** | Text-to-Image Generation |
| [stable_unclip](./stable_unclip) | **Stable unCLIP** | Image-to-Image Text-Guided Generation |
| [stochastic_karras_ve](./stochastic_karras_ve) | [**Elucidating the Design Space of Diffusion-Based Generative Models**](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation |
| [text_to_video_sd](./api/pipelines/text_to_video) | [Modelscope's Text-to-video-synthesis Model in Open Domain](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation |
| [unclip](./unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) | Text-to-Image Generation |
| [versatile_diffusion](./versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
| [versatile_diffusion](./versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
| [versatile_diffusion](./versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
| [vq_diffusion](./vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
| [text_to_video_sd](./api/pipelines/text_to_video) | [**Modelscope's Text-to-video-synthesis Model in Open Domain**](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation |
| [unclip](./unclip) | [**Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) | Text-to-Image Generation |
| [versatile_diffusion](./versatile_diffusion) | [**Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
| [versatile_diffusion](./versatile_diffusion) | [**Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
| [versatile_diffusion](./versatile_diffusion) | [**Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
| [vq_diffusion](./vq_diffusion) | [**Vector Quantized Diffusion Model for Text-to-Image Synthesis**](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
| [text_to_video_zero](./text_to_video_zero) | [**Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators**](https://arxiv.org/abs/2303.13439) | Text-to-Video Generation |
**Note**: Pipelines are simple examples of how to play around with the diffusion systems as described in the corresponding papers.
@@ -109,105 +116,3 @@ each pipeline, one should look directly into the respective pipeline.
**Note**: All pipelines have PyTorch's autograd disabled by decorating the `__call__` method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should
not be used for training. If you want to store the gradients during the forward pass, we recommend writing your own pipeline, see also our [community-examples](https://github.com/huggingface/diffusers/tree/main/examples/community).
## Contribution
We are more than happy about any contribution to the officially supported pipelines 🤗. We aspire
all of our pipelines to be **self-contained**, **easy-to-tweak**, **beginner-friendly** and for **one-purpose-only**.
- **Self-contained**: A pipeline shall be as self-contained as possible. More specifically, this means that all functionality should be either directly defined in the pipeline file itself, should be inherited from (and only from) the [`DiffusionPipeline` class](.../diffusion_pipeline) or be directly attached to the model and scheduler components of the pipeline.
- **Easy-to-use**: Pipelines should be extremely easy to use - one should be able to load the pipeline and
use it for its designated task, *e.g.* text-to-image generation, in just a couple of lines of code. Most
logic including pre-processing, an unrolled diffusion loop, and post-processing should all happen inside the `__call__` method.
- **Easy-to-tweak**: Certain pipelines will not be able to handle all use cases and tasks that you might like them to. If you want to use a certain pipeline for a specific use case that is not yet supported, you might have to copy the pipeline file and tweak the code to your needs. We try to make the pipeline code as readable as possible so that each part from pre-processing to diffusing to post-processing can easily be adapted. If you would like the community to benefit from your customized pipeline, we would love to see a contribution to our [community-examples](https://github.com/huggingface/diffusers/tree/main/examples/community). If you feel that an important pipeline should be part of the official pipelines but isn't, a contribution to the [official pipelines](./overview) would be even better.
- **One-purpose-only**: Pipelines should be used for one task and one task only. Even if two tasks are very similar from a modeling point of view, *e.g.* image2image translation and in-painting, pipelines shall be used for one task only to keep them *easy-to-tweak* and *readable*.
## Examples
### Text-to-Image generation with Stable Diffusion
```python
# make sure you're logged in with `huggingface-cli login`
from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image.save("astronaut_rides_horse.png")
```
### Image-to-Image text-guided generation with Stable Diffusion
The `StableDiffusionImg2ImgPipeline` lets you pass a text prompt and an initial image to condition the generation of new images.
```python
import requests
from PIL import Image
from io import BytesIO
from diffusers import StableDiffusionImg2ImgPipeline
# load the pipeline
device = "cuda"
pipe = StableDiffusionImg2ImgPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to(
device
)
# let's download an initial image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))
prompt = "A fantasy landscape, trending on artstation"
images = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images
images[0].save("fantasy_landscape.png")
```
You can also run this example on colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb)
### Tweak prompts reusing seeds and latents
You can generate your own latents to reproduce results, or tweak your prompt on a specific result you liked. [This notebook](https://github.com/pcuenca/diffusers-examples/blob/main/notebooks/stable-diffusion-seeds.ipynb) shows how to do it step by step. You can also run it in Google Colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pcuenca/diffusers-examples/blob/main/notebooks/stable-diffusion-seeds.ipynb)
### In-painting using Stable Diffusion
The `StableDiffusionInpaintPipeline` lets you edit specific parts of an image by providing a mask and text prompt.
```python
import PIL
import requests
import torch
from io import BytesIO
from diffusers import StableDiffusionInpaintPipeline
def download_image(url):
response = requests.get(url)
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))
pipe = StableDiffusionInpaintPipeline.from_pretrained(
"runwayml/stable-diffusion-inpainting",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
```
You can also run this example on colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb)

View File

@@ -52,6 +52,14 @@ image = pipe(prompt).images[0]
image.save("dolomites.png")
```
<Tip>
While calling this pipeline, it's possible to specify the `view_batch_size` to have a >1 value.
For some GPUs with high performance, higher a `view_batch_size`, can speedup the generation
and increase the VRAM usage.
</Tip>
## StableDiffusionPanoramaPipeline
[[autodoc]] StableDiffusionPanoramaPipeline
- __call__

View File

@@ -0,0 +1,83 @@
<!--Copyright 2023 ParaDiGMS authors and The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Parallel Sampling of Diffusion Models (ParaDiGMS)
## Overview
[Parallel Sampling of Diffusion Models](https://arxiv.org/abs/2305.16317) by Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, Nima Anari.
The abstract of the paper is the following:
*Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.*
Resources:
* [Paper](https://arxiv.org/abs/2305.16317).
* [Original Code](https://github.com/AndyShih12/paradigms).
## Available Pipelines:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [StableDiffusionParadigmsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py) | *Faster Text-to-Image Generation* | |
This pipeline was contributed by [`AndyShih12`](https://github.com/AndyShih12) in this [PR](https://github.com/huggingface/diffusers/pull/3716/).
## Usage example
```python
import torch
from diffusers import DDPMParallelScheduler
from diffusers import StableDiffusionParadigmsPipeline
scheduler = DDPMParallelScheduler.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="scheduler")
pipe = StableDiffusionParadigmsPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", scheduler=scheduler, torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
ngpu, batch_per_device = torch.cuda.device_count(), 5
pipe.wrapped_unet = torch.nn.DataParallel(pipe.unet, device_ids=[d for d in range(ngpu)])
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt, parallel=ngpu * batch_per_device, num_inference_steps=1000).images[0]
```
<Tip>
This pipeline improves sampling speed by running denoising steps in parallel, at the cost of increased total FLOPs.
Therefore, it is better to call this pipeline when running on multiple GPUs. Otherwise, without enough GPU bandwidth
sampling may be even slower than sequential sampling.
The two parameters to play with are `parallel` (batch size) and `tolerance`.
- If it fits in memory, for 1000-step DDPM you can aim for a batch size of around 100
(e.g. 8 GPUs and batch_per_device=12 to get parallel=96). Higher batch size
may not fit in memory, and lower batch size gives less parallelism.
- For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation.
If there is quality degradation with the default tolerance, then use a lower tolerance (e.g. 0.001).
For 1000-step DDPM on 8 A100 GPUs, you can expect around a 3x speedup by StableDiffusionParadigmsPipeline instead of StableDiffusionPipeline
by setting parallel=80 and tolerance=0.1.
</Tip>
<Tip>
Diffusers also offers distributed inference support for generating multiple prompts
in parallel on multiple GPUs. Check out the docs [here](https://huggingface.co/docs/diffusers/main/en/training/distributed_inference).
In contrast, this pipeline is designed for speeding up sampling of a single prompt (by using multiple GPUs).
</Tip>
## StableDiffusionParadigmsPipeline
[[autodoc]] StableDiffusionParadigmsPipeline
- __call__
- all

View File

@@ -68,3 +68,6 @@ images[0].save("snowy_mountains.png")
[[autodoc]] StableDiffusionInstructPix2PixPipeline
- __call__
- all
- load_textual_inversion
- load_lora_weights
- save_lora_weights

View File

@@ -60,7 +60,7 @@ pipe = pipe.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(0)
output = pipe(
original_image=original_image,
image=original_image,
mask_image=mask_image,
num_inference_steps=250,
eta=0.0,

View File

@@ -0,0 +1,65 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Self-Attention Guidance (SAG)
## Overview
[Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) by Susung Hong et al.
The abstract of the paper is the following:
*Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity. This success is largely attributed to the use of class- or text-conditional diffusion guidance methods, such as classifier and classifier-free guidance. In this paper, we present a more comprehensive perspective that goes beyond the traditional guidance methods. From this generalized perspective, we introduce novel condition- and training-free strategies to enhance the quality of generated images. As a simple solution, blur guidance improves the suitability of intermediate samples for their fine-scale information and structures, enabling diffusion models to generate higher quality samples with a moderate guidance scale. Improving upon this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps of diffusion models to enhance their stability and efficacy. Specifically, SAG adversarially blurs only the regions that diffusion models attend to at each iteration and guides them accordingly. Our experimental results show that our SAG improves the performance of various diffusion models, including ADM, IDDPM, Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance methods leads to further improvement.*
Resources:
* [Project Page](https://ku-cvlab.github.io/Self-Attention-Guidance).
* [Paper](https://arxiv.org/abs/2210.00939).
* [Original Code](https://github.com/KU-CVLAB/Self-Attention-Guidance).
* [Hugging Face Demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance).
* [Colab Demo](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb).
## Available Pipelines:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [StableDiffusionSAGPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py) | *Text-to-Image Generation* | [🤗 Space](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) |
## Usage example
```python
import torch
from diffusers import StableDiffusionSAGPipeline
from accelerate.utils import set_seed
pipe = StableDiffusionSAGPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
seed = 8978
prompt = "."
guidance_scale = 7.5
num_images_per_prompt = 1
sag_scale = 1.0
set_seed(seed)
images = pipe(
prompt, num_images_per_prompt=num_images_per_prompt, guidance_scale=guidance_scale, sag_scale=sag_scale
).images
images[0].save("example.png")
```
## StableDiffusionSAGPipeline
[[autodoc]] StableDiffusionSAGPipeline
- __call__
- all

View File

@@ -24,11 +24,11 @@ The abstract of the paper is the following:
| Pipeline | Tasks | Colab | Demo
|---|---|:---:|:---:|
| [pipeline_semantic_stable_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-research/semantic-image-editing/blob/main/examples/SemanticGuidance.ipynb) | [Coming Soon](https://huggingface.co/AIML-TUDA)
| [pipeline_semantic_stable_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion.py) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-research/semantic-image-editing/blob/main/examples/SemanticGuidance.ipynb) | [Coming Soon](https://huggingface.co/AIML-TUDA)
## Tips
- The Semantic Guidance pipeline can be used with any [Stable Diffusion](./api/pipelines/stable_diffusion/text2img) checkpoint.
- The Semantic Guidance pipeline can be used with any [Stable Diffusion](./stable_diffusion/text2img) checkpoint.
### Run Semantic Guidance
@@ -67,7 +67,7 @@ out = pipe(
)
```
For more examples check the colab notebook.
For more examples check the Colab notebook.
## StableDiffusionSafePipelineOutput
[[autodoc]] pipelines.semantic_stable_diffusion.SemanticStableDiffusionPipelineOutput

View File

@@ -30,7 +30,7 @@ As depicted above the model takes as input a MIDI file and tokenizes it into a s
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_spectrogram_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/spectrogram_diffusion/pipeline_spectrogram_diffusion) | *Unconditional Audio Generation* | - |
| [pipeline_spectrogram_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/spectrogram_diffusion/pipeline_spectrogram_diffusion.py) | *Unconditional Audio Generation* | - |
## Example usage

View File

@@ -31,3 +31,6 @@ Available Checkpoints are:
- disable_attention_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
- load_lora_weights
- save_lora_weights

View File

@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.
## StableDiffusionImageVariationPipeline
[`StableDiffusionImageVariationPipeline`] lets you generate variations from an input image using Stable Diffusion. It uses a fine-tuned version of Stable Diffusion model, trained by [Justin Pinkney](https://www.justinpinkney.com/) (@Buntworthy) at [Lambda](https://lambdalabs.com/)
[`StableDiffusionImageVariationPipeline`] lets you generate variations from an input image using Stable Diffusion. It uses a fine-tuned version of Stable Diffusion model, trained by [Justin Pinkney](https://www.justinpinkney.com/) (@Buntworthy) at [Lambda](https://lambdalabs.com/).
The original codebase can be found here:
[Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations)

View File

@@ -30,6 +30,10 @@ proposed by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan
- disable_attention_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
- from_ckpt
- load_lora_weights
- save_lora_weights
[[autodoc]] FlaxStableDiffusionImg2ImgPipeline
- all

View File

@@ -31,6 +31,9 @@ Available checkpoints are:
- disable_attention_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
- load_lora_weights
- save_lora_weights
[[autodoc]] FlaxStableDiffusionInpaintPipeline
- all

View File

@@ -0,0 +1,55 @@
<!--Copyright 2023 The Intel Labs Team Authors and HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# LDM3D
LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, Vasudev Lal
The abstract of the paper is the following:
*This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at [this url](https://t.ly/tdi2).*
*Overview*:
| Pipeline | Tasks | Colab | Demo
|---|---|:---:|:---:|
| [pipeline_stable_diffusion_ldm3d.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py) | *Text-to-Image Generation* | - | -
## Tips
- LDM3D generates both an image and a depth map from a given text prompt, compared to the existing txt-to-img diffusion models such as [Stable Diffusion](./stable_diffusion/overview) that generates only an image.
- With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps.
Running LDM3D is straighforward with the [`StableDiffusionLDM3DPipeline`]:
```python
>>> from diffusers import StableDiffusionLDM3DPipeline
>>> pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d")
prompt ="A picture of some lemons on a table"
output = pipe(prompt)
rgb_image, depth_image = output.rgb, output.depth
rgb_image[0].save("lemons_ldm3d_rgb.jpg")
depth_image[0].save("lemons_ldm3d_depth.png")
```
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
- all
- __call__
## StableDiffusionLDM3DPipeline
[[autodoc]] StableDiffusionLDM3DPipeline
- all
- __call__

View File

@@ -26,18 +26,17 @@ For more details about how Stable Diffusion works and how it differs from the ba
| Pipeline | Tasks | Colab | Demo
|---|---|:---:|:---:|
| [StableDiffusionPipeline](./text2img) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb) | [🤗 Stable Diffusion](https://huggingface.co/spaces/stabilityai/stable-diffusion)
| [StableDiffusionPipelineSafe](./stable_diffusion_safe) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-research/safe-latent-diffusion/blob/main/examples/Safe%20Latent%20Diffusion.ipynb) | [![Huggingface Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/AIML-TUDA/unsafe-vs-safe-stable-diffusion)
| [StableDiffusionImg2ImgPipeline](./img2img) | *Image-to-Image Text-Guided Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb) | [🤗 Diffuse the Rest](https://huggingface.co/spaces/huggingface/diffuse-the-rest)
| [StableDiffusionInpaintPipeline](./inpaint) | **Experimental** *Text-Guided Image Inpainting* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb) | Coming soon
| [StableDiffusionDepth2ImgPipeline](./depth2img) | **Experimental** *Depth-to-Image Text-Guided Generation * | | Coming soon
| [StableDiffusionInpaintPipeline](./inpaint) | **Experimental** *Text-Guided Image Inpainting* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb) |
| [StableDiffusionDepth2ImgPipeline](./depth2img) | **Experimental** *Depth-to-Image Text-Guided Generation* | |
| [StableDiffusionImageVariationPipeline](./image_variation) | **Experimental** *Image Variation Generation* | | [🤗 Stable Diffusion Image Variations](https://huggingface.co/spaces/lambdalabs/stable-diffusion-image-variations)
| [StableDiffusionUpscalePipeline](./upscale) | **Experimental** *Text-Guided Image Super-Resolution * | | Coming soon
| [StableDiffusionLatentUpscalePipeline](./latent_upscale) | **Experimental** *Text-Guided Image Super-Resolution * | | Coming soon
| [StableDiffusionInstructPix2PixPipeline](./pix2pix) | **Experimental** *Text-Based Image Editing * | | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://huggingface.co/spaces/timbrooks/instruct-pix2pix)
| [StableDiffusionAttendAndExcitePipeline](./attend_and_excite) | **Experimental** *Text-to-Image Generation * | | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite)
| [StableDiffusionPix2PixZeroPipeline](./pix2pix_zero) | **Experimental** *Text-Based Image Editing * | | [Zero-shot Image-to-Image Translation](https://arxiv.org/abs/2302.03027)
| [StableDiffusionModelEditingPipeline](./model_editing) | **Experimental** *Text-to-Image Model Editing * | | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://arxiv.org/abs/2303.08084)
| [StableDiffusionUpscalePipeline](./upscale) | **Experimental** *Text-Guided Image Super-Resolution* | |
| [StableDiffusionLatentUpscalePipeline](./latent_upscale) | **Experimental** *Text-Guided Image Super-Resolution* | |
| [Stable Diffusion 2](./stable_diffusion_2) | *Text-Guided Image Inpainting* |
| [Stable Diffusion 2](./stable_diffusion_2) | *Depth-to-Image Text-Guided Generation* |
| [Stable Diffusion 2](./stable_diffusion_2) | *Text-Guided Super Resolution Image-to-Image* |
| [StableDiffusionLDM3DPipeline](./ldm3d) | *Text-to-(RGB, Depth)* |
## Tips

View File

@@ -1,64 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Self-Attention Guidance (SAG)
## Overview
[Self-Attention Guidance](https://arxiv.org/abs/2210.00939) by Susung Hong et al.
The abstract of the paper is the following:
*Denoising diffusion models (DDMs) have been drawing much attention for their appreciable sample quality and diversity. Despite their remarkable performance, DDMs remain black boxes on which further study is necessary to take a profound step. Motivated by this, we delve into the design of conventional U-shaped diffusion models. More specifically, we investigate the self-attention modules within these models through carefully designed experiments and explore their characteristics. In addition, inspired by the studies that substantiate the effectiveness of the guidance schemes, we present plug-and-play diffusion guidance, namely Self-Attention Guidance (SAG), that can drastically boost the performance of existing diffusion models. Our method, SAG, extracts the intermediate attention map from a diffusion model at every iteration and selects tokens above a certain attention score for masking and blurring to obtain a partially blurred input. Subsequently, we measure the dissimilarity between the predicted noises obtained from feeding the blurred and original input to the diffusion model and leverage it as guidance. With this guidance, we observe apparent improvements in a wide range of diffusion models, e.g., ADM, IDDPM, and Stable Diffusion, and show that the results further improve by combining our method with the conventional guidance scheme. We provide extensive ablation studies to verify our choices.*
Resources:
* [Project Page](https://ku-cvlab.github.io/Self-Attention-Guidance).
* [Paper](https://arxiv.org/abs/2210.00939).
* [Original Code](https://github.com/KU-CVLAB/Self-Attention-Guidance).
* [Demo](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb).
## Available Pipelines:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [StableDiffusionSAGPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py) | *Text-to-Image Generation* | [Colab](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb) |
## Usage example
```python
import torch
from diffusers import StableDiffusionSAGPipeline
from accelerate.utils import set_seed
pipe = StableDiffusionSAGPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
seed = 8978
prompt = "."
guidance_scale = 7.5
num_images_per_prompt = 1
sag_scale = 1.0
set_seed(seed)
images = pipe(
prompt, num_images_per_prompt=num_images_per_prompt, guidance_scale=guidance_scale, sag_scale=sag_scale
).images
images[0].save("example.png")
```
## StableDiffusionSAGPipeline
[[autodoc]] StableDiffusionSAGPipeline
- __call__
- all

View File

@@ -71,6 +71,64 @@ image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0]
image.save("astronaut.png")
```
#### Experimental: "Common Diffusion Noise Schedules and Sample Steps are Flawed":
The paper **[Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/abs/2305.08891)**
claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion.
The abstract reads as follows:
*We discover that common diffusion noise schedules do not enforce the last timestep to have zero signal-to-noise ratio (SNR),
and some implementations of diffusion samplers do not start from the last timestep.
Such designs are flawed and do not reflect the fact that the model is given pure Gaussian noise at inference, creating a discrepancy between training and inference.
We show that the flawed design causes real problems in existing implementations.
In Stable Diffusion, it severely limits the model to only generate images with medium brightness and
prevents it from generating very bright and dark samples. We propose a few simple fixes:
- (1) rescale the noise schedule to enforce zero terminal SNR;
- (2) train the model with v prediction;
- (3) change the sampler to always start from the last timestep;
- (4) rescale classifier-free guidance to prevent over-exposure.
These simple changes ensure the diffusion process is congruent between training and inference and
allow the model to generate samples more faithful to the original data distribution.*
You can apply all of these changes in `diffusers` when using [`DDIMScheduler`]:
- (1) rescale the noise schedule to enforce zero terminal SNR;
```py
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True)
```
- (2) train the model with v prediction;
Continue fine-tuning a checkpoint with [`train_text_to_image.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [`train_text_to_image_lora.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)
and `--prediction_type="v_prediction"`.
- (3) change the sampler to always start from the last timestep;
```py
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
```
- (4) rescale classifier-free guidance to prevent over-exposure.
```py
pipe(..., guidance_rescale=0.7)
```
An example is to use [this checkpoint](https://huggingface.co/ptx0/pseudo-journey-v2)
which has been fine-tuned using the `"v_prediction"`.
The checkpoint can then be run in inference as follows:
```py
from diffusers import DiffusionPipeline, DDIMScheduler
pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", torch_dtype=torch.float16)
pipe.scheduler = DDIMScheduler.from_config(
pipe.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing"
)
pipe.to("cuda")
prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
image = pipeline(prompt, guidance_rescale=0.7).images[0]
```
## DDIMScheduler
[[autodoc]] DDIMScheduler
### Image Inpainting
- *Image Inpainting (512x512 resolution)*: [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) with [`StableDiffusionInpaintPipeline`]

View File

@@ -28,11 +28,11 @@ The abstract of the paper is the following:
## Tips
- Safe Stable Diffusion may also be used with weights of [Stable Diffusion](./api/pipelines/stable_diffusion/text2img).
- Safe Stable Diffusion may also be used with weights of [Stable Diffusion](./stable_diffusion/text2img).
### Run Safe Stable Diffusion
Safe Stable Diffusion can be tested very easily with the [`StableDiffusionPipelineSafe`], and the `"AIML-TUDA/stable-diffusion-safe"` checkpoint exactly in the same way it is shown in the [Conditional Image Generation Guide](./using-diffusers/conditional_image_generation).
Safe Stable Diffusion can be tested very easily with the [`StableDiffusionPipelineSafe`], and the `"AIML-TUDA/stable-diffusion-safe"` checkpoint exactly in the same way it is shown in the [Conditional Image Generation Guide](../../using-diffusers/conditional_image_generation).
### Interacting with the Safety Concept

View File

@@ -39,6 +39,10 @@ Available Checkpoints are:
- disable_xformers_memory_efficient_attention
- enable_vae_tiling
- disable_vae_tiling
- load_textual_inversion
- from_ckpt
- load_lora_weights
- save_lora_weights
[[autodoc]] FlaxStableDiffusionPipeline
- all

View File

@@ -32,12 +32,50 @@ we do not add any additional noise to the image embeddings i.e. `noise_level = 0
* [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip)
* [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small)
* Text-to-image
* Coming soon!
* [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small)
### Text-to-Image Generation
Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha)
Coming soon!
```python
import torch
from diffusers import UnCLIPScheduler, DDPMScheduler, StableUnCLIPPipeline
from diffusers.models import PriorTransformer
from transformers import CLIPTokenizer, CLIPTextModelWithProjection
prior_model_id = "kakaobrain/karlo-v1-alpha"
data_type = torch.float16
prior = PriorTransformer.from_pretrained(prior_model_id, subfolder="prior", torch_dtype=data_type)
prior_text_model_id = "openai/clip-vit-large-patch14"
prior_tokenizer = CLIPTokenizer.from_pretrained(prior_text_model_id)
prior_text_model = CLIPTextModelWithProjection.from_pretrained(prior_text_model_id, torch_dtype=data_type)
prior_scheduler = UnCLIPScheduler.from_pretrained(prior_model_id, subfolder="prior_scheduler")
prior_scheduler = DDPMScheduler.from_config(prior_scheduler.config)
stable_unclip_model_id = "stabilityai/stable-diffusion-2-1-unclip-small"
pipe = StableUnCLIPPipeline.from_pretrained(
stable_unclip_model_id,
torch_dtype=data_type,
variant="fp16",
prior_tokenizer=prior_tokenizer,
prior_text_encoder=prior_text_model,
prior=prior,
prior_scheduler=prior_scheduler,
)
pipe = pipe.to("cuda")
wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular"
images = pipe(prompt=wave_prompt).images
images[0].save("waves.png")
```
<Tip warning={true}>
For text-to-image we use `stabilityai/stable-diffusion-2-1-unclip-small` as it was trained on CLIP ViT-L/14 embedding, the same as the Karlo model prior. [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) was trained on OpenCLIP ViT-H, so we don't recommend its use.
</Tip>
### Text guided Image-to-Image Variation

View File

@@ -37,9 +37,12 @@ Resources:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [TextToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py) | *Text-to-Video Generation* | [🤗 Spaces](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis)
| [VideoToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py) | *Text-Guided Video-to-Video Generation* | [(TODO)🤗 Spaces]()
## Usage example
### `text-to-video-ms-1.7b`
Let's start by generating a short video with the default length of 16 frames (2s at 8 fps):
```python
@@ -119,12 +122,98 @@ Here are some sample outputs:
</tr>
</table>
### `cerspense/zeroscope_v2_576w` & `cerspense/zeroscope_v2_XL`
Zeroscope are watermark-free model and have been trained on specific sizes such as `576x320` and `1024x576`.
One should first generate a video using the lower resolution checkpoint [`cerspense/zeroscope_v2_576w`](https://huggingface.co/cerspense/zeroscope_v2_576w) with [`TextToVideoSDPipeline`],
which can then be upscaled using [`VideoToVideoSDPipeline`] and [`cerspense/zeroscope_v2_XL`](https://huggingface.co/cerspense/zeroscope_v2_XL).
```py
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video
pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
# memory optimization
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
pipe.enable_vae_slicing()
prompt = "Darth Vader surfing a wave"
video_frames = pipe(prompt, num_frames=24).frames
video_path = export_to_video(video_frames)
video_path
```
Now the video can be upscaled:
```py
pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_XL", torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
# memory optimization
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
pipe.enable_vae_slicing()
video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames]
video_frames = pipe(prompt, video=video, strength=0.6).frames
video_path = export_to_video(video_frames)
video_path
```
Here are some sample outputs:
<table>
<tr>
<td ><center>
Darth vader surfing in waves.
<br>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/darthvader_cerpense.gif"
alt="Darth vader surfing in waves."
style="width: 576px;" />
</center></td>
</tr>
</table>
### Memory optimizations
Text-guided video generation with [`~TextToVideoSDPipeline`] and [`~VideoToVideoSDPipeline`] is very memory intensive both
when denoising with [`~UNet3DConditionModel`] and when decoding with [`~AutoencoderKL`]. It is possible though to reduce
memory usage at the cost of increased runtime to achieve the exact same result. To do so, it is recommended to enable
**forward chunking** and **vae slicing**:
Forward chunking via [`~UNet3DConditionModel.enable_forward_chunking`]is explained in [this blog post](https://huggingface.co/blog/reformer#2-chunked-feed-forward-layers) and
allows to significantly reduce the required memory for the unet. You can chunk the feed forward layer over the `num_frames`
dimension by doing:
```py
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
```
Vae slicing via [`~TextToVideoSDPipeline.enable_vae_slicing`] and [`~VideoToVideoSDPipeline.enable_vae_slicing`] also
gives significant memory savings since the two pipelines decode all image frames at once.
```py
pipe.enable_vae_slicing()
```
## Available checkpoints
* [damo-vilab/text-to-video-ms-1.7b](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/)
* [damo-vilab/text-to-video-ms-1.7b-legacy](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b-legacy)
* [cerspense/zeroscope_v2_576w](https://huggingface.co/cerspense/zeroscope_v2_576w)
* [cerspense/zeroscope_v2_XL](https://huggingface.co/cerspense/zeroscope_v2_XL)
## TextToVideoSDPipeline
[[autodoc]] TextToVideoSDPipeline
- all
- __call__
## VideoToVideoSDPipeline
[[autodoc]] VideoToVideoSDPipeline
- all
- __call__

View File

@@ -0,0 +1,275 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Zero-Shot Text-to-Video Generation
## Overview
[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://arxiv.org/abs/2303.13439) by
Levon Khachatryan,
Andranik Movsisyan,
Vahram Tadevosyan,
Roberto Henschel,
[Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com).
Our method Text2Video-Zero enables zero-shot video generation using either
1. A textual prompt, or
2. A prompt combined with guidance from poses or edges, or
3. Video Instruct-Pix2Pix, i.e., instruction-guided video editing.
Results are temporally consistent and follow closely the guidance and textual prompts.
![teaser-img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/t2v_zero_teaser.png)
The abstract of the paper is the following:
*Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain.
Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object.
Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing.
As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.*
Resources:
* [Project Page](https://text2video-zero.github.io/)
* [Paper](https://arxiv.org/abs/2303.13439)
* [Original Code](https://github.com/Picsart-AI-Research/Text2Video-Zero)
## Available Pipelines:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [TextToVideoZeroPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py) | *Zero-shot Text-to-Video Generation* | [🤗 Space](https://huggingface.co/spaces/PAIR/Text2Video-Zero)
## Usage example
### Text-To-Video
To generate a video from prompt, run the following python command
```python
import torch
import imageio
from diffusers import TextToVideoZeroPipeline
model_id = "runwayml/stable-diffusion-v1-5"
pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
prompt = "A panda is playing guitar on times square"
result = pipe(prompt=prompt).images
result = [(r * 255).astype("uint8") for r in result]
imageio.mimsave("video.mp4", result, fps=4)
```
You can change these parameters in the pipeline call:
* Motion field strength (see the [paper](https://arxiv.org/abs/2303.13439), Sect. 3.3.1):
* `motion_field_strength_x` and `motion_field_strength_y`. Default: `motion_field_strength_x=12`, `motion_field_strength_y=12`
* `T` and `T'` (see the [paper](https://arxiv.org/abs/2303.13439), Sect. 3.3.1)
* `t0` and `t1` in the range `{0, ..., num_inference_steps}`. Default: `t0=45`, `t1=48`
* Video length:
* `video_length`, the number of frames video_length to be generated. Default: `video_length=8`
We an also generate longer videos by doing the processing in a chunk-by-chunk manner:
```python
import torch
import imageio
from diffusers import TextToVideoZeroPipeline
import numpy as np
model_id = "runwayml/stable-diffusion-v1-5"
pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
seed = 0
video_length = 8
chunk_size = 4
prompt = "A panda is playing guitar on times square"
# Generate the video chunk-by-chunk
result = []
chunk_ids = np.arange(0, video_length, chunk_size - 1)
generator = torch.Generator(device="cuda")
for i in range(len(chunk_ids)):
print(f"Processing chunk {i + 1} / {len(chunk_ids)}")
ch_start = chunk_ids[i]
ch_end = video_length if i == len(chunk_ids) - 1 else chunk_ids[i + 1]
# Attach the first frame for Cross Frame Attention
frame_ids = [0] + list(range(ch_start, ch_end))
# Fix the seed for the temporal consistency
generator.manual_seed(seed)
output = pipe(prompt=prompt, video_length=len(frame_ids), generator=generator, frame_ids=frame_ids)
result.append(output.images[1:])
# Concatenate chunks and save
result = np.concatenate(result)
result = [(r * 255).astype("uint8") for r in result]
imageio.mimsave("video.mp4", result, fps=4)
```
### Text-To-Video with Pose Control
To generate a video from prompt with additional pose control
1. Download a demo video
```python
from huggingface_hub import hf_hub_download
filename = "__assets__/poses_skeleton_gifs/dance1_corr.mp4"
repo_id = "PAIR/Text2Video-Zero"
video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
```
2. Read video containing extracted pose images
```python
from PIL import Image
import imageio
reader = imageio.get_reader(video_path, "ffmpeg")
frame_count = 8
pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
```
To extract pose from actual video, read [ControlNet documentation](./stable_diffusion/controlnet).
3. Run `StableDiffusionControlNetPipeline` with our custom attention processor
```python
import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
model_id = "runwayml/stable-diffusion-v1-5"
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
model_id, controlnet=controlnet, torch_dtype=torch.float16
).to("cuda")
# Set the attention processor
pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
# fix latents for all frames
latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1)
prompt = "Darth Vader dancing in a desert"
result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
imageio.mimsave("video.mp4", result, fps=4)
```
### Text-To-Video with Edge Control
To generate a video from prompt with additional pose control,
follow the steps described above for pose-guided generation using [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny).
### Video Instruct-Pix2Pix
To perform text-guided video editing (with [InstructPix2Pix](./stable_diffusion/pix2pix)):
1. Download a demo video
```python
from huggingface_hub import hf_hub_download
filename = "__assets__/pix2pix video/camel.mp4"
repo_id = "PAIR/Text2Video-Zero"
video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
```
2. Read video from path
```python
from PIL import Image
import imageio
reader = imageio.get_reader(video_path, "ffmpeg")
frame_count = 8
video = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
```
3. Run `StableDiffusionInstructPix2PixPipeline` with our custom attention processor
```python
import torch
from diffusers import StableDiffusionInstructPix2PixPipeline
from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
model_id = "timbrooks/instruct-pix2pix"
pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=3))
prompt = "make it Van Gogh Starry Night style"
result = pipe(prompt=[prompt] * len(video), image=video).images
imageio.mimsave("edited_video.mp4", result, fps=4)
```
### DreamBooth specialization
Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control**
can run with custom [DreamBooth](../training/dreambooth) models, as shown below for
[Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny) and
[Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model
1. Download a demo video
```python
from huggingface_hub import hf_hub_download
filename = "__assets__/canny_videos_mp4/girl_turning.mp4"
repo_id = "PAIR/Text2Video-Zero"
video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
```
2. Read video from path
```python
from PIL import Image
import imageio
reader = imageio.get_reader(video_path, "ffmpeg")
frame_count = 8
canny_edges = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
```
3. Run `StableDiffusionControlNetPipeline` with custom trained DreamBooth model
```python
import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
# set model id to custom model
model_id = "PAIR/text2video-zero-controlnet-canny-avatar"
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
model_id, controlnet=controlnet, torch_dtype=torch.float16
).to("cuda")
# Set the attention processor
pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
# fix latents for all frames
latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(canny_edges), 1, 1, 1)
prompt = "oil painting of a beautiful girl avatar style"
result = pipe(prompt=[prompt] * len(canny_edges), image=canny_edges, latents=latents).images
imageio.mimsave("video.mp4", result, fps=4)
```
You can filter out some available DreamBooth-trained models with [this link](https://huggingface.co/models?search=dreambooth).
## TextToVideoZeroPipeline
[[autodoc]] TextToVideoZeroPipeline
- all
- __call__

View File

@@ -0,0 +1,204 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# UniDiffuser
The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://arxiv.org/abs/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu.
The abstract of the [paper](https://arxiv.org/abs/2303.06555) is the following:
*This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).*
Resources:
* [Paper](https://arxiv.org/abs/2303.06555).
* [Original Code](https://github.com/thu-ml/unidiffuser).
Available Checkpoints are:
- *UniDiffuser-v0 (512x512 resolution)* [thu-ml/unidiffuser-v0](https://huggingface.co/thu-ml/unidiffuser-v0)
- *UniDiffuser-v1 (512x512 resolution)* [thu-ml/unidiffuser-v1](https://huggingface.co/thu-ml/unidiffuser-v1)
This pipeline was contributed by our community member [dg845](https://github.com/dg845).
## Available Pipelines:
| Pipeline | Tasks | Demo | Colab |
|:---:|:---:|:---:|:---:|
| [UniDiffuserPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_unidiffuser.py) | *Joint Image-Text Gen*, *Text-to-Image*, *Image-to-Text*,<br> *Image Gen*, *Text Gen*, *Image Variation*, *Text Variation* | [🤗 Spaces](https://huggingface.co/spaces/thu-ml/unidiffuser) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/unidiffuser.ipynb) |
## Usage Examples
Because the UniDiffuser model is trained to model the joint distribution of (image, text) pairs, it is capable of performing a diverse range of generation tasks.
### Unconditional Image and Text Generation
Unconditional generation (where we start from only latents sampled from a standard Gaussian prior) from a [`UniDiffuserPipeline`] will produce a (image, text) pair:
```python
import torch
from diffusers import UniDiffuserPipeline
device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)
# Unconditional image and text generation. The generation task is automatically inferred.
sample = pipe(num_inference_steps=20, guidance_scale=8.0)
image = sample.images[0]
text = sample.text[0]
image.save("unidiffuser_joint_sample_image.png")
print(text)
```
This is also called "joint" generation in the UniDiffusers paper, since we are sampling from the joint image-text distribution.
Note that the generation task is inferred from the inputs used when calling the pipeline.
It is also possible to manually specify the unconditional generation task ("mode") manually with [`UniDiffuserPipeline.set_joint_mode`]:
```python
# Equivalent to the above.
pipe.set_joint_mode()
sample = pipe(num_inference_steps=20, guidance_scale=8.0)
```
When the mode is set manually, subsequent calls to the pipeline will use the set mode without attempting the infer the mode.
You can reset the mode with [`UniDiffuserPipeline.reset_mode`], after which the pipeline will once again infer the mode.
You can also generate only an image or only text (which the UniDiffuser paper calls "marginal" generation since we sample from the marginal distribution of images and text, respectively):
```python
# Unlike other generation tasks, image-only and text-only generation don't use classifier-free guidance
# Image-only generation
pipe.set_image_mode()
sample_image = pipe(num_inference_steps=20).images[0]
# Text-only generation
pipe.set_text_mode()
sample_text = pipe(num_inference_steps=20).text[0]
```
### Text-to-Image Generation
UniDiffuser is also capable of sampling from conditional distributions; that is, the distribution of images conditioned on a text prompt or the distribution of texts conditioned on an image.
Here is an example of sampling from the conditional image distribution (text-to-image generation or text-conditioned image generation):
```python
import torch
from diffusers import UniDiffuserPipeline
device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)
# Text-to-image generation
prompt = "an elephant under the sea"
sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_text2img_sample_image.png")
```
The `text2img` mode requires that either an input `prompt` or `prompt_embeds` be supplied. You can set the `text2img` mode manually with [`UniDiffuserPipeline.set_text_to_image_mode`].
### Image-to-Text Generation
Similarly, UniDiffuser can also produce text samples given an image (image-to-text or image-conditioned text generation):
```python
import torch
from diffusers import UniDiffuserPipeline
from diffusers.utils import load_image
device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)
# Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))
sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)
```
The `img2text` mode requires that an input `image` be supplied. You can set the `img2text` mode manually with [`UniDiffuserPipeline.set_image_to_text_mode`].
### Image Variation
The UniDiffuser authors suggest performing image variation through a "round-trip" generation method, where given an input image, we first perform an image-to-text generation, and the perform a text-to-image generation on the outputs of the first generation.
This produces a new image which is semantically similar to the input image:
```python
import torch
from diffusers import UniDiffuserPipeline
from diffusers.utils import load_image
device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)
# Image variation can be performed with a image-to-text generation followed by a text-to-image generation:
# 1. Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))
sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)
# 2. Text-to-image generation
sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0)
final_image = sample.images[0]
final_image.save("unidiffuser_image_variation_sample.png")
```
### Text Variation
Similarly, text variation can be performed on an input prompt with a text-to-image generation followed by a image-to-text generation:
```python
import torch
from diffusers import UniDiffuserPipeline
device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)
# Text variation can be performed with a text-to-image generation followed by a image-to-text generation:
# 1. Text-to-image generation
prompt = "an elephant under the sea"
sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_text2img_sample_image.png")
# 2. Image-to-text generation
sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0)
final_prompt = sample.text[0]
print(final_prompt)
```
## UniDiffuserPipeline
[[autodoc]] UniDiffuserPipeline
- all
- __call__

View File

@@ -20,7 +20,7 @@ The abstract of the paper is the following:
## Tips
- VersatileDiffusion is conceptually very similar as [Stable Diffusion](./api/pipelines/stable_diffusion/overview), but instead of providing just a image data stream conditioned on text, VersatileDiffusion provides both a image and text data stream and can be conditioned on both text and image.
- VersatileDiffusion is conceptually very similar as [Stable Diffusion](./stable_diffusion/overview), but instead of providing just a image data stream conditioned on text, VersatileDiffusion provides both a image and text data stream and can be conditioned on both text and image.
### *Run VersatileDiffusion*

View File

@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Denoising diffusion implicit models (DDIM)
# Denoising Diffusion Implicit Models (DDIM)
## Overview
@@ -18,10 +18,71 @@ specific language governing permissions and limitations under the License.
The abstract of the paper is the following:
Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training,
yet they require simulating a Markov chain for many steps to produce a sample.
To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models
with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process.
We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from.
We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off
computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.*
The original codebase of this paper can be found here: [ermongroup/ddim](https://github.com/ermongroup/ddim).
For questions, feel free to contact the author on [tsong.me](https://tsong.me/).
### Experimental: "Common Diffusion Noise Schedules and Sample Steps are Flawed":
The paper **[Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/abs/2305.08891)**
claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion.
The abstract reads as follows:
*We discover that common diffusion noise schedules do not enforce the last timestep to have zero signal-to-noise ratio (SNR),
and some implementations of diffusion samplers do not start from the last timestep.
Such designs are flawed and do not reflect the fact that the model is given pure Gaussian noise at inference, creating a discrepancy between training and inference.
We show that the flawed design causes real problems in existing implementations.
In Stable Diffusion, it severely limits the model to only generate images with medium brightness and
prevents it from generating very bright and dark samples. We propose a few simple fixes:
- (1) rescale the noise schedule to enforce zero terminal SNR;
- (2) train the model with v prediction;
- (3) change the sampler to always start from the last timestep;
- (4) rescale classifier-free guidance to prevent over-exposure.
These simple changes ensure the diffusion process is congruent between training and inference and
allow the model to generate samples more faithful to the original data distribution.*
You can apply all of these changes in `diffusers` when using [`DDIMScheduler`]:
- (1) rescale the noise schedule to enforce zero terminal SNR;
```py
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True)
```
- (2) train the model with v prediction;
Continue fine-tuning a checkpoint with [`train_text_to_image.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [`train_text_to_image_lora.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)
and `--prediction_type="v_prediction"`.
- (3) change the sampler to always start from the last timestep;
```py
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
```
- (4) rescale classifier-free guidance to prevent over-exposure.
```py
pipe(..., guidance_rescale=0.7)
```
An example is to use [this checkpoint](https://huggingface.co/ptx0/pseudo-journey-v2)
which has been fine-tuned using the `"v_prediction"`.
The checkpoint can then be run in inference as follows:
```py
from diffusers import DiffusionPipeline, DDIMScheduler
pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", torch_dtype=torch.float16)
pipe.scheduler = DDIMScheduler.from_config(
pipe.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing"
)
pipe.to("cuda")
prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
image = pipeline(prompt, guidance_rescale=0.7).images[0]
```
## DDIMScheduler
[[autodoc]] DDIMScheduler

View File

@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Denoising diffusion probabilistic models (DDPM)
# Denoising Diffusion Probabilistic Models (DDPM)
## Overview

View File

@@ -0,0 +1,23 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# DPM Stochastic Scheduler inspired by Karras et. al paper
## Overview
Inspired by Stochastic Sampler from [Karras et. al](https://arxiv.org/abs/2206.00364).
Scheduler ported from @crowsonkb's https://github.com/crowsonkb/k-diffusion library:
All credit for making this scheduler work goes to [Katherine Crowson](https://github.com/crowsonkb/)
## DPMSolverSDEScheduler
[[autodoc]] DPMSolverSDEScheduler

View File

@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.
## Overview
Ancestral sampling with Euler method steps. Based on the original (k-diffusion)[https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L72] implementation by Katherine Crowson.
Ancestral sampling with Euler method steps. Based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L72) implementation by Katherine Crowson.
Fast scheduler which often times generates good outputs with 20-30 steps.
## EulerAncestralDiscreteScheduler

View File

@@ -0,0 +1,22 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Inverse Multistep DPM-Solver (DPMSolverMultistepInverse)
## Overview
This scheduler is the inverted scheduler of [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://arxiv.org/abs/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
](https://arxiv.org/abs/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu.
The implementation is mostly based on the DDIM inversion definition of [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://arxiv.org/pdf/2211.09794.pdf) and the ad-hoc notebook implementation for DiffEdit latent inversion [here](https://github.com/Xiang-cd/DiffEdit-stable-diffusion/blob/main/diffedit.ipynb).
## DPMSolverMultistepInverseScheduler
[[autodoc]] DPMSolverMultistepInverseScheduler

View File

@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# variance exploding stochastic differential equation (VE-SDE) scheduler
# Variance Exploding Stochastic Differential Equation (VE-SDE) scheduler
## Overview

View File

@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Variance preserving stochastic differential equation (VP-SDE) scheduler
# Variance Preserving Stochastic Differential Equation (VP-SDE) scheduler
## Overview

View File

@@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License.
UniPC is a training-free framework designed for the fast sampling of diffusion models, which consists of a corrector (UniC) and a predictor (UniP) that share a unified analytical form and support arbitrary orders.
For more details about the method, please refer to the [[paper]](https://arxiv.org/abs/2302.04867) and the [[code]](https://github.com/wl-zhao/UniPC).
For more details about the method, please refer to the [paper](https://arxiv.org/abs/2302.04867) and the [code](https://github.com/wl-zhao/UniPC).
Fast Sampling of Diffusion Models with Exponential Integrator.

View File

@@ -0,0 +1,23 @@
# Utilities
Utility and helper functions for working with 🤗 Diffusers.
## randn_tensor
[[autodoc]] diffusers.utils.randn_tensor
## numpy_to_pil
[[autodoc]] utils.pil_utils.numpy_to_pil
## pt_to_pil
[[autodoc]] utils.pil_utils.pt_to_pil
## load_image
[[autodoc]] utils.testing_utils.load_image
## export_to_video
[[autodoc]] utils.testing_utils.export_to_video

View File

@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.
We ❤️ contributions from the open-source community! Everyone is welcome, and all types of participation not just code are valued and appreciated. Answering questions, helping others, reaching out, and improving the documentation are all immensely valuable to the community, so don't be afraid and get involved if you're up for it!
Everyone is encouraged to start by saying 👋 in our public Discord channel. We discuss the latest trends in diffusion models, ask questions, show off personal projects, help each other with contributions, or just hang out ☕. <a href="https://Discord.gg/G7tWnz98XR"><img alt="Join us on Discord" src="https://img.shields.io/Discord/823813159592001537?color=5865F2&logo=Discord&logoColor=white"></a>
Everyone is encouraged to start by saying 👋 in our public Discord channel. We discuss the latest trends in diffusion models, ask questions, show off personal projects, help each other with contributions, or just hang out ☕. <a href="https://Discord.gg/G7tWnz98XR"><img alt="Join us on Discord" src="https://img.shields.io/discord/823813159592001537?color=5865F2&logo=discord&logoColor=white"></a>
Whichever way you choose to contribute, we strive to be part of an open, welcoming, and kind community. Please, read our [code of conduct](https://github.com/huggingface/diffusers/blob/main/CODE_OF_CONDUCT.md) and be mindful to respect it during your interactions. We also recommend you become familiar with the [ethical guidelines](https://huggingface.co/docs/diffusers/conceptual/ethical_guidelines) that guide our project and ask you to adhere to the same principles of transparency and responsibility.
@@ -170,7 +170,7 @@ please have a look at the next sections.
For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull requst](#how-to-open-a-pr) section.
### 4. Fixing a "Good first issue"
### 4. Fixing a `Good first issue`
*Good first issues* are marked by the [Good first issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) label. Usually, the issue already
explains how a potential solution should look so that it is easier to fix.
@@ -275,7 +275,7 @@ Once an example script works, please make sure to add a comprehensive `README.md
If you are contributing to the official training examples, please also make sure to add a test to [examples/test_examples.py](https://github.com/huggingface/diffusers/blob/main/examples/test_examples.py). This is not necessary for non-official training examples.
### 8. Fixing a "Good second issue"
### 8. Fixing a `Good second issue`
*Good second issues* are marked by the [Good second issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22) label. Good second issues are
usually more complicated to solve than [Good first issues](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22).

View File

@@ -37,7 +37,8 @@ We cover Diffusion models with the following pipelines:
## Qualitative Evaluation
Qualitative evaluation typically involves human assessment of generated images. Quality is measured across aspects such as compositionality, image-text alignment, and spatial relations. Common prompts provide a degree of uniformity for subjective metrics. DrawBench and PartiPrompts are prompt datasets used for qualitative benchmarking. DrawBench and PartiPrompts were introduced by [Imagen](https://imagen.research.google/) and [Parti](https://parti.research.google/) respectively.
Qualitative evaluation typically involves human assessment of generated images. Quality is measured across aspects such as compositionality, image-text alignment, and spatial relations. Common prompts provide a degree of uniformity for subjective metrics.
DrawBench and PartiPrompts are prompt datasets used for qualitative benchmarking. DrawBench and PartiPrompts were introduced by [Imagen](https://imagen.research.google/) and [Parti](https://parti.research.google/) respectively.
From the [official Parti website](https://parti.research.google/):
@@ -51,7 +52,13 @@ PartiPrompts has the following columns:
- Category of the prompt (such as “Abstract”, “World Knowledge”, etc.)
- Challenge reflecting the difficulty (such as “Basic”, “Complex”, “Writing & Symbols”, etc.)
These benchmarks allow for side-by-side human evaluation of different image generation models. Lets see how we can use `diffusers` on a couple of PartiPrompts.
These benchmarks allow for side-by-side human evaluation of different image generation models.
For this, the 🧨 Diffusers team has built **Open Parti Prompts**, which is a community-driven qualitative benchmark based on Parti Prompts to compare state-of-the-art open-source diffusion models:
- [Open Parti Prompts Game](https://huggingface.co/spaces/OpenGenAI/open-parti-prompts): For 10 parti prompts, 4 generated images are shown and the user selects the image that suits the prompt best.
- [Open Parti Prompts Leaderboard](https://huggingface.co/spaces/OpenGenAI/parti-prompts-leaderboard): The leaderboard comparing the currently best open-sourced diffusion models to each other.
To manually compare images, lets see how we can use `diffusers` on a couple of PartiPrompts.
Below we show some prompts sampled across different challenges: Basic, Complex, Linguistic Structures, Imagination, and Writing & Symbols. Here we are using PartiPrompts as a [dataset](https://huggingface.co/datasets/nateraw/parti-prompts).

View File

@@ -53,11 +53,14 @@ The library has three main components:
|---|---|:---:|
| [alt_diffusion](./api/pipelines/alt_diffusion) | [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation |
| [audio_diffusion](./api/pipelines/audio_diffusion) | [Audio Diffusion](https://github.com/teticio/audio-diffusion.git) | Unconditional Audio Generation |
| [controlnet](./api/pipelines/stable_diffusion/controlnet) | [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation |
| [controlnet](./api/pipelines/controlnet) | [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation |
| [cycle_diffusion](./api/pipelines/cycle_diffusion) | [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation |
| [dance_diffusion](./api/pipelines/dance_diffusion) | [Dance Diffusion](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation |
| [ddpm](./api/pipelines/ddpm) | [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation |
| [ddim](./api/pipelines/ddim) | [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation |
| [if](./if) | [**IF**](./api/pipelines/if) | Image Generation |
| [if_img2img](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation |
| [if_inpainting](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation |
| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation |
| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image |
| [latent_diffusion_uncond](./api/pipelines/latent_diffusion_uncond) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation |
@@ -73,7 +76,7 @@ The library has three main components:
| [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800) | Text-Guided Image Editing|
| [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [Zero-shot Image-to-Image Translation](https://pix2pixzero.github.io/) | Text-Guided Image Editing |
| [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation |
| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation |
| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation Unconditional Image Generation |
| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation |
| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [Stable Diffusion Latent Upscaler](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image |
| [stable_diffusion_model_editing](./api/pipelines/stable_diffusion/model_editing) | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://time-diffusion.github.io/) | Text-to-Image Model Editing |
@@ -91,3 +94,4 @@ The library has three main components:
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
| [stable_diffusion_ldm3d](./api/pipelines/stable_diffusion/ldm3d_diffusion) | [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) | Text to Image and Depth Generation |

View File

@@ -12,9 +12,9 @@ specific language governing permissions and limitations under the License.
# Installation
Install 🤗 Diffusers for whichever deep learning library youre working with.
Install 🤗 Diffusers for whichever deep learning library you're working with.
🤗 Diffusers is tested on Python 3.7+, PyTorch 1.7.0+ and flax. Follow the installation instructions below for the deep learning library you are using:
🤗 Diffusers is tested on Python 3.7+, PyTorch 1.7.0+ and Flax. Follow the installation instructions below for the deep learning library you are using:
- [PyTorch](https://pytorch.org/get-started/locally/) installation instructions.
- [Flax](https://flax.readthedocs.io/en/latest/) installation instructions.
@@ -23,7 +23,7 @@ Install 🤗 Diffusers for whichever deep learning library youre working with
You should install 🤗 Diffusers in a [virtual environment](https://docs.python.org/3/library/venv.html).
If you're unfamiliar with Python virtual environments, take a look at this [guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
A virtual environment makes it easier to manage different projects, and avoid compatibility issues between dependencies.
A virtual environment makes it easier to manage different projects and avoid compatibility issues between dependencies.
Start by creating a virtual environment in your project directory:
@@ -37,27 +37,28 @@ Activate the virtual environment:
source .env/bin/activate
```
Now you're ready to install 🤗 Diffusers with the following command:
**For PyTorch**
🤗 Diffusers also relies on the 🤗 Transformers library, and you can install both with the following command:
<frameworkcontent>
<pt>
```bash
pip install diffusers["torch"]
pip install diffusers["torch"] transformers
```
**For Flax**
</pt>
<jax>
```bash
pip install diffusers["flax"]
pip install diffusers["flax"] transformers
```
</jax>
</frameworkcontent>
## Install from source
Before intsalling `diffusers` from source, make sure you have `torch` and `accelerate` installed.
Before installing 🤗 Diffusers from source, make sure you have `torch` and 🤗 Accelerate installed.
For `torch` installation refer to the `torch` [docs](https://pytorch.org/get-started/locally/#start-locally).
For `torch` installation, refer to the `torch` [installation](https://pytorch.org/get-started/locally/#start-locally) guide.
To install `accelerate`
To install 🤗 Accelerate:
```bash
pip install accelerate
@@ -74,7 +75,7 @@ The `main` version is useful for staying up-to-date with the latest developments
For instance, if a bug has been fixed since the last official release but a new release hasn't been rolled out yet.
However, this means the `main` version may not always be stable.
We strive to keep the `main` version operational, and most issues are usually resolved within a few hours or a day.
If you run into a problem, please open an [Issue](https://github.com/huggingface/transformers/issues), so we can fix it even sooner!
If you run into a problem, please open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose), so we can fix it even sooner!
## Editable install
@@ -90,21 +91,22 @@ git clone https://github.com/huggingface/diffusers.git
cd diffusers
```
**For PyTorch**
```
<frameworkcontent>
<pt>
```bash
pip install -e ".[torch]"
```
**For Flax**
```
</pt>
<jax>
```bash
pip install -e ".[flax]"
```
</jax>
</frameworkcontent>
These commands will link the folder you cloned the repository to and your Python library paths.
Python will now look inside the folder you cloned to in addition to the normal library paths.
For example, if your Python packages are typically installed in `~/anaconda3/envs/main/lib/python3.7/site-packages/`, Python will also search the folder you cloned to: `~/diffusers/`.
For example, if your Python packages are typically installed in `~/anaconda3/envs/main/lib/python3.7/site-packages/`, Python will also search the `~/diffusers/` folder you cloned to.
<Tip warning={true}>
@@ -125,7 +127,7 @@ Your Python environment will find the `main` version of 🤗 Diffusers on the ne
Our library gathers telemetry information during `from_pretrained()` requests.
This data includes the version of Diffusers and PyTorch/Flax, the requested model or pipeline class,
and the path to a pretrained checkpoint if it is hosted on the Hub.
and the path to a pre-trained checkpoint if it is hosted on the Hub.
This usage data helps us debug issues and prioritize new features.
Telemetry is only sent when loading models and pipelines from the HuggingFace Hub,
and is not collected during local usage.

View File

@@ -0,0 +1,167 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# How to run Stable Diffusion with Core ML
[Core ML](https://developer.apple.com/documentation/coreml) is the model format and machine learning library supported by Apple frameworks. If you are interested in running Stable Diffusion models inside your macOS or iOS/iPadOS apps, this guide will show you how to convert existing PyTorch checkpoints into the Core ML format and use them for inference with Python or Swift.
Core ML models can leverage all the compute engines available in Apple devices: the CPU, the GPU, and the Apple Neural Engine (or ANE, a tensor-optimized accelerator available in Apple Silicon Macs and modern iPhones/iPads). Depending on the model and the device it's running on, Core ML can mix and match compute engines too, so some portions of the model may run on the CPU while others run on GPU, for example.
<Tip>
You can also run the `diffusers` Python codebase on Apple Silicon Macs using the `mps` accelerator built into PyTorch. This approach is explained in depth in [the mps guide](mps), but it is not compatible with native apps.
</Tip>
## Stable Diffusion Core ML Checkpoints
Stable Diffusion weights (or checkpoints) are stored in the PyTorch format, so you need to convert them to the Core ML format before we can use them inside native apps.
Thankfully, Apple engineers developed [a conversion tool](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml) based on `diffusers` to convert the PyTorch checkpoints to Core ML.
Before you convert a model, though, take a moment to explore the Hugging Face Hub chances are the model you're interested in is already available in Core ML format:
- the [Apple](https://huggingface.co/apple) organization includes Stable Diffusion versions 1.4, 1.5, 2.0 base, and 2.1 base
- [coreml](https://huggingface.co/coreml) organization includes custom DreamBoothed and finetuned models
- use this [filter](https://huggingface.co/models?pipeline_tag=text-to-image&library=coreml&p=2&sort=likes) to return all available Core ML checkpoints
If you can't find the model you're interested in, we recommend you follow the instructions for [Converting Models to Core ML](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml) by Apple.
## Selecting the Core ML Variant to Use
Stable Diffusion models can be converted to different Core ML variants intended for different purposes:
- The type of attention blocks used. The attention operation is used to "pay attention" to the relationship between different areas in the image representations and to understand how the image and text representations are related. Attention is compute- and memory-intensive, so different implementations exist that consider the hardware characteristics of different devices. For Core ML Stable Diffusion models, there are two attention variants:
* `split_einsum` ([introduced by Apple](https://machinelearning.apple.com/research/neural-engine-transformers)) is optimized for ANE devices, which is available in modern iPhones, iPads and M-series computers.
* The "original" attention (the base implementation used in `diffusers`) is only compatible with CPU/GPU and not ANE. It can be *faster* to run your model on CPU + GPU using `original` attention than ANE. See [this performance benchmark](https://huggingface.co/blog/fast-mac-diffusers#performance-benchmarks) as well as some [additional measures provided by the community](https://github.com/huggingface/swift-coreml-diffusers/issues/31) for additional details.
- The supported inference framework.
* `packages` are suitable for Python inference. This can be used to test converted Core ML models before attempting to integrate them inside native apps, or if you want to explore Core ML performance but don't need to support native apps. For example, an application with a web UI could perfectly use a Python Core ML backend.
* `compiled` models are required for Swift code. The `compiled` models in the Hub split the large UNet model weights into several files for compatibility with iOS and iPadOS devices. This corresponds to the [`--chunk-unet` conversion option](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml). If you want to support native apps, then you need to select the `compiled` variant.
The official Core ML Stable Diffusion [models](https://huggingface.co/apple/coreml-stable-diffusion-v1-4/tree/main) include these variants, but the community ones may vary:
```
coreml-stable-diffusion-v1-4
├── README.md
├── original
│ ├── compiled
│ └── packages
└── split_einsum
├── compiled
└── packages
```
You can download and use the variant you need as shown below.
## Core ML Inference in Python
Install the following libraries to run Core ML inference in Python:
```bash
pip install huggingface_hub
pip install git+https://github.com/apple/ml-stable-diffusion
```
### Download the Model Checkpoints
To run inference in Python, use one of the versions stored in the `packages` folders because the `compiled` ones are only compatible with Swift. You may choose whether you want to use `original` or `split_einsum` attention.
This is how you'd download the `original` attention variant from the Hub to a directory called `models`:
```Python
from huggingface_hub import snapshot_download
from pathlib import Path
repo_id = "apple/coreml-stable-diffusion-v1-4"
variant = "original/packages"
model_path = Path("./models") / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_"))
snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False)
print(f"Model downloaded at {model_path}")
```
### Inference[[python-inference]]
Once you have downloaded a snapshot of the model, you can test it using Apple's Python script.
```shell
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i models/coreml-stable-diffusion-v1-4_original_packages -o </path/to/output/image> --compute-unit CPU_AND_GPU --seed 93
```
`<output-mlpackages-directory>` should point to the checkpoint you downloaded in the step above, and `--compute-unit` indicates the hardware you want to allow for inference. It must be one of the following options: `ALL`, `CPU_AND_GPU`, `CPU_ONLY`, `CPU_AND_NE`. You may also provide an optional output path, and a seed for reproducibility.
The inference script assumes you're using the original version of the Stable Diffusion model, `CompVis/stable-diffusion-v1-4`. If you use another model, you *have* to specify its Hub id in the inference command line, using the `--model-version` option. This works for models already supported and custom models you trained or fine-tuned yourself.
For example, if you want to use [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5):
```shell
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" --compute-unit ALL -o output --seed 93 -i models/coreml-stable-diffusion-v1-5_original_packages --model-version runwayml/stable-diffusion-v1-5
```
## Core ML inference in Swift
Running inference in Swift is slightly faster than in Python because the models are already compiled in the `mlmodelc` format. This is noticeable on app startup when the model is loaded but shouldnt be noticeable if you run several generations afterward.
### Download
To run inference in Swift on your Mac, you need one of the `compiled` checkpoint versions. We recommend you download them locally using Python code similar to the previous example, but with one of the `compiled` variants:
```Python
from huggingface_hub import snapshot_download
from pathlib import Path
repo_id = "apple/coreml-stable-diffusion-v1-4"
variant = "original/compiled"
model_path = Path("./models") / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_"))
snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False)
print(f"Model downloaded at {model_path}")
```
### Inference[[swift-inference]]
To run inference, please clone Apple's repo:
```bash
git clone https://github.com/apple/ml-stable-diffusion
cd ml-stable-diffusion
```
And then use Apple's command line tool, [Swift Package Manager](https://www.swift.org/package-manager/#):
```bash
swift run StableDiffusionSample --resource-path models/coreml-stable-diffusion-v1-4_original_compiled --compute-units all "a photo of an astronaut riding a horse on mars"
```
You have to specify in `--resource-path` one of the checkpoints downloaded in the previous step, so please make sure it contains compiled Core ML bundles with the extension `.mlmodelc`. The `--compute-units` has to be one of these values: `all`, `cpuOnly`, `cpuAndGPU`, `cpuAndNeuralEngine`.
For more details, please refer to the [instructions in Apple's repo](https://github.com/apple/ml-stable-diffusion).
## Supported Diffusers Features
The Core ML models and inference code don't support many of the features, options, and flexibility of 🧨 Diffusers. These are some of the limitations to keep in mind:
- Core ML models are only suitable for inference. They can't be used for training or fine-tuning.
- Only two schedulers have been ported to Swift, the default one used by Stable Diffusion and `DPMSolverMultistepScheduler`, which we ported to Swift from our `diffusers` implementation. We recommend you use `DPMSolverMultistepScheduler`, since it produces the same quality in about half the steps.
- Negative prompts, classifier-free guidance scale, and image-to-image tasks are available in the inference code. Advanced features such as depth guidance, ControlNet, and latent upscalers are not available yet.
Apple's [conversion and inference repo](https://github.com/apple/ml-stable-diffusion) and our own [swift-coreml-diffusers](https://github.com/huggingface/swift-coreml-diffusers) repos are intended as technology demonstrators to enable other developers to build upon.
If you feel strongly about any missing features, please feel free to open a feature request or, better yet, a contribution PR :)
## Native Diffusers Swift app
One easy way to run Stable Diffusion on your own Apple hardware is to use [our open-source Swift repo](https://github.com/huggingface/swift-coreml-diffusers), based on `diffusers` and Apple's conversion and inference repo. You can study the code, compile it with [Xcode](https://developer.apple.com/xcode/) and adapt it for your own needs. For your convenience, there's also a [standalone Mac app in the App Store](https://apps.apple.com/app/diffusers/id1666309574), so you can play with it without having to deal with the code or IDE. If you are a developer and have determined that Core ML is the best solution to build your Stable Diffusion app, then you can use the rest of this guide to get started with your project. We can't wait to see what you'll build :)

View File

@@ -50,7 +50,6 @@ from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
@@ -60,8 +59,10 @@ image = pipe(prompt).images[0]
```
<Tip warning={true}>
It is strongly discouraged to make use of [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) in any of the pipelines as it can lead to black images and is always slower than using pure
float16 precision.
</Tip>
## Sliced attention for additional memory savings
@@ -83,7 +84,6 @@ from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
@@ -110,7 +110,6 @@ from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
@@ -164,7 +163,6 @@ from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
@@ -189,7 +187,6 @@ from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
@@ -202,6 +199,8 @@ image = pipe(prompt).images[0]
**Note**: When using `enable_sequential_cpu_offload()`, it is important to **not** move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal. See [this issue](https://github.com/huggingface/diffusers/issues/1934) for more information.
**Note**: `enable_sequential_cpu_offload()` is a stateful operation that installs hooks on the models.
<a name="model_offloading"></a>
## Model offloading for fast inference and memory savings
@@ -251,6 +250,11 @@ image = pipe(prompt).images[0]
This feature requires `accelerate` version 0.17.0 or larger.
</Tip>
**Note**: `enable_model_cpu_offload()` is a stateful operation that installs hooks on the models and state on the pipeline. In order to properly offload
models after they are called, it is required that the entire pipeline is run and models are called in the order the pipeline expects them to be. Exercise caution
if models are re-used outside the context of the pipeline after hooks have been installed. See [accelerate](https://huggingface.co/docs/accelerate/v0.18.0/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module)
for further docs on removing hooks.
## Using Channels Last memory format
Channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel). Since not all operators currently support channels last format it may result in a worst performance, so it's better to try it and see if it works for your model.
@@ -401,6 +405,13 @@ Here are the speedups we obtain on a few Nvidia GPUs when running the inference
| A100-SXM-80GB | 18.7it/s | 29.5it/s |
To leverage it just make sure you have:
<Tip warning={true}>
If you have PyTorch 2.0 installed, you shouldn't use xFormers!
</Tip>
- PyTorch > 1.12
- Cuda available
- [Installed the xformers library](xformers).

View File

@@ -16,8 +16,8 @@ specific language governing permissions and limitations under the License.
## Requirements
- Optimum Habana 1.4 or later, [here](https://huggingface.co/docs/optimum/habana/installation) is how to install it.
- SynapseAI 1.8.
- Optimum Habana 1.6 or later, [here](https://huggingface.co/docs/optimum/habana/installation) is how to install it.
- SynapseAI 1.10.
## Inference Pipeline
@@ -41,7 +41,7 @@ pipeline = GaudiStableDiffusionPipeline.from_pretrained(
scheduler=scheduler,
use_habana=True,
use_hpu_graphs=True,
gaudi_config="Habana/stable-diffusion",
gaudi_config="Habana/stable-diffusion-2",
)
```
@@ -62,9 +62,18 @@ For more information, check out Optimum Habana's [documentation](https://hugging
## Benchmark
Here are the latencies for Habana first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) Gaudi configuration (mixed precision bf16/fp32):
Here are the latencies for Habana first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32):
- [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) (512x512 resolution):
| | Latency (batch size = 1) | Throughput (batch size = 8) |
| ---------------------- |:------------------------:|:---------------------------:|
| first-generation Gaudi | 4.29s | 0.283 images/s |
| Gaudi2 | 1.54s | 0.904 images/s |
| first-generation Gaudi | 3.80s | 0.308 images/s |
| Gaudi2 | 1.33s | 1.081 images/s |
- [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) (768x768 resolution):
| | Latency (batch size = 1) | Throughput |
| ---------------------- |:------------------------:|:-------------------------------:|
| first-generation Gaudi | 10.2s | 0.108 images/s (batch size = 4) |
| Gaudi2 | 3.17s | 0.379 images/s (batch size = 8) |

View File

@@ -0,0 +1,116 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Token Merging
Token Merging (introduced in [Token Merging: Your ViT But Faster](https://arxiv.org/abs/2210.09461)) works by merging the redundant tokens / patches progressively in the forward pass of a Transformer-based network. It can speed up the inference latency of the underlying network.
After Token Merging (ToMe) was released, the authors released [Token Merging for Fast Stable Diffusion](https://arxiv.org/abs/2303.17604), which introduced a version of ToMe which is more compatible with Stable Diffusion. We can use ToMe to gracefully speed up the inference latency of a [`DiffusionPipeline`]. This doc discusses how to apply ToMe to the [`StableDiffusionPipeline`], the expected speedups, and the qualitative aspects of using ToMe on the [`StableDiffusionPipeline`].
## Using ToMe
The authors of ToMe released a convenient Python library called [`tomesd`](https://github.com/dbolya/tomesd) that lets us apply ToMe to a [`DiffusionPipeline`] like so:
```diff
from diffusers import StableDiffusionPipeline
import tomesd
pipeline = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")
+ tomesd.apply_patch(pipeline, ratio=0.5)
image = pipeline("a photo of an astronaut riding a horse on mars").images[0]
```
And thats it!
`tomesd.apply_patch()` exposes [a number of arguments](https://github.com/dbolya/tomesd#usage) to let us strike a balance between the pipeline inference speed and the quality of the generated tokens. Amongst those arguments, the most important one is `ratio`. `ratio` controls the number of tokens that will be merged during the forward pass. For more details on `tomesd`, please refer to the original repository https://github.com/dbolya/tomesd and [the paper](https://arxiv.org/abs/2303.17604).
## Benchmarking `tomesd` with `StableDiffusionPipeline`
We benchmarked the impact of using `tomesd` on [`StableDiffusionPipeline`] along with [xformers](https://huggingface.co/docs/diffusers/optimization/xformers) across different image resolutions. We used A100 and V100 as our test GPU devices with the following development environment (with Python 3.8.5):
```bash
- `diffusers` version: 0.15.1
- Python version: 3.8.16
- PyTorch version (GPU?): 1.13.1+cu116 (True)
- Huggingface_hub version: 0.13.2
- Transformers version: 4.27.2
- Accelerate version: 0.18.0
- xFormers version: 0.0.16
- tomesd version: 0.1.2
```
We used this script for benchmarking: [https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335](https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335). Following are our findings:
### A100
| Resolution | Batch size | Vanilla | ToMe | ToMe + xFormers | ToMe speedup (%) | ToMe + xFormers speedup (%) |
| --- | --- | --- | --- | --- | --- | --- |
| 512 | 10 | 6.88 | 5.26 | 4.69 | 23.54651163 | 31.83139535 |
| | | | | | | |
| 768 | 10 | OOM | 14.71 | 11 | | |
| | 8 | OOM | 11.56 | 8.84 | | |
| | 4 | OOM | 5.98 | 4.66 | | |
| | 2 | 4.99 | 3.24 | 3.1 | 35.07014028 | 37.8757515 |
| | 1 | 3.29 | 2.24 | 2.03 | 31.91489362 | 38.29787234 |
| | | | | | | |
| 1024 | 10 | OOM | OOM | OOM | | |
| | 8 | OOM | OOM | OOM | | |
| | 4 | OOM | 12.51 | 9.09 | | |
| | 2 | OOM | 6.52 | 4.96 | | |
| | 1 | 6.4 | 3.61 | 2.81 | 43.59375 | 56.09375 |
***The timings reported here are in seconds. Speedups are calculated over the `Vanilla` timings.***
### V100
| Resolution | Batch size | Vanilla | ToMe | ToMe + xFormers | ToMe speedup (%) | ToMe + xFormers speedup (%) |
| --- | --- | --- | --- | --- | --- | --- |
| 512 | 10 | OOM | 10.03 | 9.29 | | |
| | 8 | OOM | 8.05 | 7.47 | | |
| | 4 | 5.7 | 4.3 | 3.98 | 24.56140351 | 30.1754386 |
| | 2 | 3.14 | 2.43 | 2.27 | 22.61146497 | 27.70700637 |
| | 1 | 1.88 | 1.57 | 1.57 | 16.4893617 | 16.4893617 |
| | | | | | | |
| 768 | 10 | OOM | OOM | 23.67 | | |
| | 8 | OOM | OOM | 18.81 | | |
| | 4 | OOM | 11.81 | 9.7 | | |
| | 2 | OOM | 6.27 | 5.2 | | |
| | 1 | 5.43 | 3.38 | 2.82 | 37.75322284 | 48.06629834 |
| | | | | | | |
| 1024 | 10 | OOM | OOM | OOM | | |
| | 8 | OOM | OOM | OOM | | |
| | 4 | OOM | OOM | 19.35 | | |
| | 2 | OOM | 13 | 10.78 | | |
| | 1 | OOM | 6.66 | 5.54 | | |
As seen in the tables above, the speedup with `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it becomes possible to run the pipeline on a higher resolution, like 1024x1024.
It might be possible to speed up inference even further with [`torch.compile()`](https://huggingface.co/docs/diffusers/optimization/torch2.0).
## Quality
As reported in [the paper](https://arxiv.org/abs/2303.17604), ToMe can preserve the quality of the generated images to a great extent while speeding up inference. By increasing the `ratio`, it is possible to further speed up inference, but that might come at the cost of a deterioration in the image quality.
To test the quality of the generated samples using our setup, we sampled a few prompts from the “Parti Prompts” (introduced in [Parti](https://parti.research.google/)) and performed inference with the [`StableDiffusionPipeline`] in the following settings:
- Vanilla [`StableDiffusionPipeline`]
- [`StableDiffusionPipeline`] + ToMe
- [`StableDiffusionPipeline`] + ToMe + xformers
We didnt notice any significant decrease in the quality of the generated samples. Here are samples:
![tome-samples](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/tome/tome_samples.png)
You can check out the generated samples [here](https://wandb.ai/sayakpaul/tomesd-results/runs/23j4bj3i?workspace=). We used [this script](https://gist.github.com/sayakpaul/8cac98d7f22399085a060992f411ecbd) for conducting this experiment.

View File

@@ -12,19 +12,21 @@ specific language governing permissions and limitations under the License.
# Accelerated PyTorch 2.0 support in Diffusers
Starting from version `0.13.0`, Diffusers supports the latest optimization from the upcoming [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) release. These include:
1. Support for accelerated transformers implementation with memory-efficient attention no extra dependencies required.
Starting from version `0.13.0`, Diffusers supports the latest optimization from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/). These include:
1. Support for accelerated transformers implementation with memory-efficient attention no extra dependencies (such as `xformers`) required.
2. [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) support for extra performance boost when individual models are compiled.
## Installation
To benefit from the accelerated attention implementation and `torch.compile`, you just need to install the latest versions of PyTorch 2.0 from `pip`, and make sure you are on diffusers 0.13.0 or later. As explained below, `diffusers` automatically uses the attention optimizations (but not `torch.compile`) when available.
To benefit from the accelerated attention implementation and `torch.compile()`, you just need to install the latest versions of PyTorch 2.0 from pip, and make sure you are on diffusers 0.13.0 or later. As explained below, diffusers automatically uses the optimized attention processor ([`AttnProcessor2_0`](https://github.com/huggingface/diffusers/blob/1a5797c6d4491a879ea5285c4efc377664e0332d/src/diffusers/models/attention_processor.py#L798)) (but not `torch.compile()`)
when PyTorch 2.0 is available.
```bash
pip install --upgrade torch torchvision diffusers
pip install --upgrade torch diffusers
```
## Using accelerated transformers and torch.compile.
## Using accelerated transformers and `torch.compile`.
1. **Accelerated Transformers implementation**
@@ -46,13 +48,13 @@ pip install --upgrade torch torchvision diffusers
If you want to enable it explicitly (which is not required), you can do so as shown below.
```Python
```diff
import torch
from diffusers import DiffusionPipeline
from diffusers.models.attention_processor import AttnProcessor2_0
+ from diffusers.models.attention_processor import AttnProcessor2_0
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
pipe.unet.set_attn_processor(AttnProcessor2_0())
+ pipe.unet.set_attn_processor(AttnProcessor2_0())
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
@@ -60,151 +62,383 @@ pip install --upgrade torch torchvision diffusers
This should be as fast and memory efficient as `xFormers`. More details [in our benchmark](#benchmark).
It is possible to revert to the vanilla attention processor ([`AttnProcessor`](https://github.com/huggingface/diffusers/blob/1a5797c6d4491a879ea5285c4efc377664e0332d/src/diffusers/models/attention_processor.py#L402)), which can be helpful to make the pipeline more deterministic, or if you need to convert a fine-tuned model to other formats such as [Core ML](https://huggingface.co/docs/diffusers/v0.16.0/en/optimization/coreml#how-to-run-stable-diffusion-with-core-ml). To use the normal attention processor you can use the [`~diffusers.UNet2DConditionModel.set_default_attn_processor`] function:
```Python
import torch
from diffusers import DiffusionPipeline
from diffusers.models.attention_processor import AttnProcessor
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
pipe.unet.set_default_attn_processor()
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
```
2. **torch.compile**
To get an additional speedup, we can use the new `torch.compile` feature. To do so, we simply wrap our `unet` with `torch.compile`. For more information and different options, refer to the
To get an additional speedup, we can use the new `torch.compile` feature. Since the UNet of the pipeline is usually the most computationally expensive, we wrap the `unet` with `torch.compile` leaving rest of the sub-models (text encoder and VAE) as is. For more information and different options, refer to the
[torch compile docs](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).
```python
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
pipe.unet = torch.compile(pipe.unet)
batch_size = 10
prompt = "A photo of an astronaut riding a horse on marse."
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images
```
Depending on the type of GPU, `compile()` can yield between 2-9% of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100).
Depending on the type of GPU, `compile()` can yield between **5% - 300%** of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100).
Compilation takes some time to complete, so it is best suited for situations where you need to prepare your pipeline once and then perform the same type of inference operations multiple times.
Compilation takes some time to complete, so it is best suited for situations where you need to prepare your pipeline once and then perform the same type of inference operations multiple times. Calling the compiled pipeline on a different image size will re-trigger compilation which can be expensive.
## Benchmark
We conducted a simple benchmark on different GPUs to compare vanilla attention, xFormers, `torch.nn.functional.scaled_dot_product_attention` and `torch.compile+torch.nn.functional.scaled_dot_product_attention`.
For the benchmark we used the [stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) model with 50 steps. The `xFormers` benchmark is done using the `torch==1.13.1` version, while the accelerated transformers optimizations are tested using nightly versions of PyTorch 2.0. The tables below summarize the results we got.
We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. We used `diffusers 0.17.0.dev0`, which [makes sure `torch.compile()` is leveraged optimally](https://github.com/huggingface/diffusers/pull/3313).
Please refer to [our featured blog post in the PyTorch site](https://pytorch.org/blog/accelerated-diffusers-pt-20/) for more details.
### Benchmarking code
### FP16 benchmark
#### Stable Diffusion text-to-image
The table below shows the benchmark results for inference using `fp16`. As we can see, `torch.nn.functional.scaled_dot_product_attention` is as fast as `xFormers` (sometimes slightly faster/slower) on all the GPUs we tested.
And using `torch.compile` gives further speed-up of up of 10% over `xFormers`, but it's mostly noticeable on the A100 GPU.
```python
from diffusers import DiffusionPipeline
import torch
___The time reported is in seconds.___
path = "runwayml/stable-diffusion-v1-5"
| GPU | Batch Size | Vanilla Attention | xFormers | PyTorch2.0 SDPA | SDPA + torch.compile | Speed over xformers (%) |
| --- | --- | --- | --- | --- | --- | --- |
| A100 | 1 | 2.69 | 2.7 | 1.98 | 2.47 | 8.52 |
| A100 | 2 | 3.21 | 3.04 | 2.38 | 2.78 | 8.55 |
| A100 | 4 | 5.27 | 3.91 | 3.89 | 3.53 | 9.72 |
| A100 | 8 | 9.74 | 7.03 | 7.04 | 6.62 | 5.83 |
| A100 | 10 | 12.02 | 8.7 | 8.67 | 8.45 | 2.87 |
| A100 | 16 | 18.95 | 13.57 | 13.55 | 13.20 | 2.73 |
| A100 | 32 (1) | OOM | 26.56 | 26.68 | 25.85 | 2.67 |
| A100 | 64 | | 52.51 | 53.03 | 50.93 | 3.01 |
| | | | | | | |
| A10 | 4 | 13.94 | 9.81 | 10.01 | 9.35 | 4.69 |
| A10 | 8 | 27.09 | 19 | 19.53 | 18.33 | 3.53 |
| A10 | 10 | 33.69 | 23.53 | 24.19 | 22.52 | 4.29 |
| A10 | 16 | OOM | 37.55 | 38.31 | 36.81 | 1.97 |
| A10 | 32 (1) | | 77.19 | 78.43 | 76.64 | 0.71 |
| A10 | 64 (1) | | 173.59 | 158.99 | 155.14 | 10.63 |
| | | | | | | |
| T4 | 4 | 38.81 | 30.09 | 29.74 | 27.55 | 8.44 |
| T4 | 8 | OOM | 55.71 | 55.99 | 53.85 | 3.34 |
| T4 | 10 | OOM | 68.96 | 69.86 | 65.35 | 5.23 |
| T4 | 16 | OOM | 111.47 | 113.26 | 106.93 | 4.07 |
| | | | | | | |
| V100 | 4 | 9.84 | 8.16 | 8.09 | 7.65 | 6.25 |
| V100 | 8 | OOM | 15.62 | 15.44 | 14.59 | 6.59 |
| V100 | 10 | OOM | 19.52 | 19.28 | 18.18 | 6.86 |
| V100 | 16 | OOM | 30.29 | 29.84 | 28.22 | 6.83 |
| | | | | | | |
| 3090 | 1 | 2.94 | 2.5 | 2.42 | 2.33 | 6.80 |
| 3090 | 4 | 10.04 | 7.82 | 7.72 | 7.38 | 5.63 |
| 3090 | 8 | 19.27 | 14.97 | 14.88 | 14.15 | 5.48 |
| 3090 | 10| 24.08 | 18.7 | 18.62 | 18.12 | 3.10 |
| 3090 | 16 | OOM | 29.06 | 28.88 | 28.2 | 2.96 |
| 3090 | 32 (1) | | 58.05 | 57.42 | 56.28 | 3.05 |
| 3090 | 64 (1) | | 126.54 | 114.27 | 112.21 | 11.32 |
| | | | | | | |
| 3090 Ti | 1 | 2.7 | 2.26 | 2.19 | 2.12 | 6.19 |
| 3090 Ti | 4 | 9.07 | 7.14 | 7.00 | 6.71 | 6.02 |
| 3090 Ti | 8 | 17.51 | 13.65 | 13.53 | 12.94 | 5.20 |
| 3090 Ti | 10 (2) | 21.79 | 16.85 | 16.77 | 16.44 | 2.43 |
| 3090 Ti | 16 | OOM | 26.1 | 26.04 | 25.53 | 2.18 |
| 3090 Ti | 32 (1) | | 51.78 | 51.71 | 50.91 | 1.68 |
| 3090 Ti | 64 (1) | | 112.02 | 102.78 | 100.89 | 9.94 |
| | | | | | | |
| 4090 | 1 | 4.47 | 3.98 | 1.28 | 1.21 | 69.60 |
| 4090 | 4 | 10.48 | 8.37 | 3.76 | 3.56 | 57.47 |
| 4090 | 8 | 14.33 | 10.22 | 7.43 | 6.99 | 31.60 |
| 4090 | 16 | | 17.07 | 14.98 | 14.58 | 14.59 |
| 4090 | 32 (1) | | 39.03 | 30.18 | 29.49 | 24.44 |
| 4090 | 64 (1) | | 77.29 | 61.34 | 59.96 | 22.42 |
run_compile = True # Set True / False
pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.unet.to(memory_format=torch.channels_last)
if run_compile:
print("Run torch compile")
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
prompt = "ghibli style, a fantasy landscape with castles"
for _ in range(3):
images = pipe(prompt=prompt).images
```
#### Stable Diffusion image-to-image
```python
from diffusers import StableDiffusionImg2ImgPipeline
import requests
import torch
from PIL import Image
from io import BytesIO
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((512, 512))
path = "runwayml/stable-diffusion-v1-5"
run_compile = True # Set True / False
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.unet.to(memory_format=torch.channels_last)
if run_compile:
print("Run torch compile")
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
prompt = "ghibli style, a fantasy landscape with castles"
for _ in range(3):
image = pipe(prompt=prompt, image=init_image).images[0]
```
#### Stable Diffusion - inpainting
```python
from diffusers import StableDiffusionInpaintPipeline
import requests
import torch
from PIL import Image
from io import BytesIO
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
def download_image(url):
response = requests.get(url)
return Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
### FP32 benchmark
init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))
The table below shows the benchmark results for inference using `fp32`. In this case, `torch.nn.functional.scaled_dot_product_attention` is faster than `xFormers` on all the GPUs we tested.
path = "runwayml/stable-diffusion-inpainting"
Using `torch.compile` in addition to the accelerated transformers implementation can yield up to 19% performance improvement over `xFormers` in Ampere and Ada cards, and up to 20% (Ampere) or 28% (Ada) over vanilla attention.
run_compile = True # Set True / False
| GPU | Batch Size | Vanilla Attention | xFormers | PyTorch2.0 SDPA | SDPA + torch.compile | Speed over xformers (%) | Speed over vanilla (%) |
| --- | --- | --- | --- | --- | --- | --- | --- |
| A100 | 1 | 4.97 | 3.86 | 2.6 | 2.86 | 25.91 | 42.45 |
| A100 | 2 | 9.03 | 6.76 | 4.41 | 4.21 | 37.72 | 53.38 |
| A100 | 4 | 16.70 | 12.42 | 7.94 | 7.54 | 39.29 | 54.85 |
| A100 | 10 | OOM | 29.93 | 18.70 | 18.46 | 38.32 | |
| A100 | 16 | | 47.08 | 29.41 | 29.04 | 38.32 | |
| A100 | 32 | | 92.89 | 57.55 | 56.67 | 38.99 | |
| A100 | 64 | | 185.3 | 114.8 | 112.98 | 39.03 | |
| | | | | | | |
| A10 | 1 | 10.59 | 8.81 | 7.51 | 7.35 | 16.57 | 30.59 |
| A10 | 4 | 34.77 | 27.63 | 22.77 | 22.07 | 20.12 | 36.53 |
| A10 | 8 | | 56.19 | 43.53 | 43.86 | 21.94 | |
| A10 | 16 | | 116.49 | 88.56 | 86.64 | 25.62 | |
| A10 | 32 | | 221.95 | 175.74 | 168.18 | 24.23 | |
| A10 | 48 | | 333.23 | 264.84 | | 20.52 | |
| | | | | | | |
| T4 | 1 | 28.2 | 24.49 | 23.93 | 23.56 | 3.80 | 16.45 |
| T4 | 2 | 52.77 | 45.7 | 45.88 | 45.06 | 1.40 | 14.61 |
| T4 | 4 | OOM | 85.72 | 85.78 | 84.48 | 1.45 | |
| T4 | 8 | | 149.64 | 150.75 | 148.4 | 0.83 | |
| | | | | | | |
| V100 | 1 | 7.4 | 6.84 | 6.8 | 6.66 | 2.63 | 10.00 |
| V100 | 2 | 13.85 | 12.81 | 12.66 | 12.35 | 3.59 | 10.83 |
| V100 | 4 | OOM | 25.73 | 25.31 | 24.78 | 3.69 | |
| V100 | 8 | | 43.95 | 43.37 | 42.25 | 3.87 | |
| V100 | 16 | | 84.99 | 84.73 | 82.55 | 2.87 | |
| | | | | | | |
| 3090 | 1 | 7.09 | 6.78 | 5.34 | 5.35 | 21.09 | 24.54 |
| 3090 | 4 | 22.69 | 21.45 | 18.56 | 18.18 | 15.24 | 19.88 |
| 3090 | 8 | | 42.59 | 36.68 | 35.61 | 16.39 | |
| 3090 | 16 | | 85.35 | 72.93 | 70.18 | 17.77 | |
| 3090 | 32 (1) | | 162.05 | 143.46 | 138.67 | 14.43 | |
| | | | | | | |
| 3090 Ti | 1 | 6.45 | 6.19 | 4.99 | 4.89 | 21.00 | 24.19 |
| 3090 Ti | 4 | 20.32 | 19.31 | 17.02 | 16.48 | 14.66 | 18.90 |
| 3090 Ti | 8 | | 37.93 | 33.21 | 32.24 | 15.00 | |
| 3090 Ti | 16 | | 75.37 | 66.63 | 64.5 | 14.42 | |
| 3090 Ti | 32 (1) | | 142.55 | 128.89 | 124.92 | 12.37 | |
| | | | | | | |
| 4090 | 1 | 5.54 | 4.99 | 2.66 | 2.58 | 48.30 | 53.43 |
| 4090 | 4 | 13.67 | 11.4 | 8.81 | 8.46 | 25.79 | 38.11 |
| 4090 | 8 | | 19.79 | 17.55 | 16.62 | 16.02 | |
| 4090 | 16 | | 38.62 | 35.65 | 34.07 | 11.78 | |
| 4090 | 32 (1) | | 76.57 | 69.48 | 65.35 | 14.65 | |
| 4090 | 48 | | 114.44 | 106.3 | | 7.11 | |
pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.unet.to(memory_format=torch.channels_last)
if run_compile:
print("Run torch compile")
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
prompt = "ghibli style, a fantasy landscape with castles"
for _ in range(3):
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
```
#### ControlNet
```python
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import requests
import torch
from PIL import Image
from io import BytesIO
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((512, 512))
path = "runwayml/stable-diffusion-v1-5"
run_compile = True # Set True / False
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
path, controlnet=controlnet, torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
pipe.unet.to(memory_format=torch.channels_last)
pipe.controlnet.to(memory_format=torch.channels_last)
if run_compile:
print("Run torch compile")
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True)
prompt = "ghibli style, a fantasy landscape with castles"
for _ in range(3):
image = pipe(prompt=prompt, image=init_image).images[0]
```
#### IF text-to-image + upscaling
```python
from diffusers import DiffusionPipeline
import torch
run_compile = True # Set True / False
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16)
pipe.to("cuda")
pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16)
pipe_2.to("cuda")
pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16)
pipe_3.to("cuda")
(1) Batch Size >= 32 requires enable_vae_slicing() because of https://github.com/pytorch/pytorch/issues/81665.
This is required for PyTorch 1.13.1, and also for PyTorch 2.0 and large batch sizes.
pipe.unet.to(memory_format=torch.channels_last)
pipe_2.unet.to(memory_format=torch.channels_last)
pipe_3.unet.to(memory_format=torch.channels_last)
For more details about how this benchmark was run, please refer to [this PR](https://github.com/huggingface/diffusers/pull/2303) and to [the blog post](https://pytorch.org/blog/accelerated-diffusers-pt-20/).
if run_compile:
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True)
pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True)
prompt = "the blue hulk"
prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)
neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)
for _ in range(3):
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
image_2 = pipe_2(image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
image_3 = pipe_3(prompt=prompt, image=image, noise_level=100).images
```
To give you a pictorial overview of the possible speed-ups that can be obtained with PyTorch 2.0 and `torch.compile()`,
here is a plot that shows relative speed-ups for the [Stable Diffusion text-to-image pipeline](StableDiffusionPipeline) across five
different GPU families (with a batch size of 4):
![t2i_speedup](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/t2i_speedup.png)
To give you an even better idea of how this speed-up holds for the other pipelines presented above, consider the following
plot that shows the benchmarking numbers from an A100 across three different batch sizes
(with PyTorch 2.0 nightly and `torch.compile()`):
![a100_numbers](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/a100_numbers.png)
_(Our benchmarking metric for the plots above is **number of iterations/second**)_
But we reveal all the benchmarking numbers in the interest of transparency!
In the following tables, we report our findings in terms of the number of **_iterations processed per second_**.
### A100 (batch size: 1)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 21.66 | 23.13 | 44.03 | 49.74 |
| SD - img2img | 21.81 | 22.40 | 43.92 | 46.32 |
| SD - inpaint | 22.24 | 23.23 | 43.76 | 49.25 |
| SD - controlnet | 15.02 | 15.82 | 32.13 | 36.08 |
| IF | 20.21 / <br>13.84 / <br>24.00 | 20.12 / <br>13.70 / <br>24.03 | ❌ | 97.34 / <br>27.23 / <br>111.66 |
### A100 (batch size: 4)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 11.6 | 13.12 | 14.62 | 17.27 |
| SD - img2img | 11.47 | 13.06 | 14.66 | 17.25 |
| SD - inpaint | 11.67 | 13.31 | 14.88 | 17.48 |
| SD - controlnet | 8.28 | 9.38 | 10.51 | 12.41 |
| IF | 25.02 | 18.04 | ❌ | 48.47 |
### A100 (batch size: 16)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 3.04 | 3.6 | 3.83 | 4.68 |
| SD - img2img | 2.98 | 3.58 | 3.83 | 4.67 |
| SD - inpaint | 3.04 | 3.66 | 3.9 | 4.76 |
| SD - controlnet | 2.15 | 2.58 | 2.74 | 3.35 |
| IF | 8.78 | 9.82 | ❌ | 16.77 |
### V100 (batch size: 1)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 18.99 | 19.14 | 20.95 | 22.17 |
| SD - img2img | 18.56 | 19.18 | 20.95 | 22.11 |
| SD - inpaint | 19.14 | 19.06 | 21.08 | 22.20 |
| SD - controlnet | 13.48 | 13.93 | 15.18 | 15.88 |
| IF | 20.01 / <br>9.08 / <br>23.34 | 19.79 / <br>8.98 / <br>24.10 | ❌ | 55.75 / <br>11.57 / <br>57.67 |
### V100 (batch size: 4)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 5.96 | 5.89 | 6.83 | 6.86 |
| SD - img2img | 5.90 | 5.91 | 6.81 | 6.82 |
| SD - inpaint | 5.99 | 6.03 | 6.93 | 6.95 |
| SD - controlnet | 4.26 | 4.29 | 4.92 | 4.93 |
| IF | 15.41 | 14.76 | ❌ | 22.95 |
### V100 (batch size: 16)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 1.66 | 1.66 | 1.92 | 1.90 |
| SD - img2img | 1.65 | 1.65 | 1.91 | 1.89 |
| SD - inpaint | 1.69 | 1.69 | 1.95 | 1.93 |
| SD - controlnet | 1.19 | 1.19 | OOM after warmup | 1.36 |
| IF | 5.43 | 5.29 | ❌ | 7.06 |
### T4 (batch size: 1)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 6.9 | 6.95 | 7.3 | 7.56 |
| SD - img2img | 6.84 | 6.99 | 7.04 | 7.55 |
| SD - inpaint | 6.91 | 6.7 | 7.01 | 7.37 |
| SD - controlnet | 4.89 | 4.86 | 5.35 | 5.48 |
| IF | 17.42 / <br>2.47 / <br>18.52 | 16.96 / <br>2.45 / <br>18.69 | ❌ | 24.63 / <br>2.47 / <br>23.39 |
### T4 (batch size: 4)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 1.79 | 1.79 | 2.03 | 1.99 |
| SD - img2img | 1.77 | 1.77 | 2.05 | 2.04 |
| SD - inpaint | 1.81 | 1.82 | 2.09 | 2.09 |
| SD - controlnet | 1.34 | 1.27 | 1.47 | 1.46 |
| IF | 5.79 | 5.61 | ❌ | 7.39 |
### T4 (batch size: 16)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 2.34s | 2.30s | OOM after 2nd iteration | 1.99s |
| SD - img2img | 2.35s | 2.31s | OOM after warmup | 2.00s |
| SD - inpaint | 2.30s | 2.26s | OOM after 2nd iteration | 1.95s |
| SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup |
| IF * | 1.44 | 1.44 | ❌ | 1.94 |
### RTX 3090 (batch size: 1)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 22.56 | 22.84 | 23.84 | 25.69 |
| SD - img2img | 22.25 | 22.61 | 24.1 | 25.83 |
| SD - inpaint | 22.22 | 22.54 | 24.26 | 26.02 |
| SD - controlnet | 16.03 | 16.33 | 17.38 | 18.56 |
| IF | 27.08 / <br>9.07 / <br>31.23 | 26.75 / <br>8.92 / <br>31.47 | ❌ | 68.08 / <br>11.16 / <br>65.29 |
### RTX 3090 (batch size: 4)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 6.46 | 6.35 | 7.29 | 7.3 |
| SD - img2img | 6.33 | 6.27 | 7.31 | 7.26 |
| SD - inpaint | 6.47 | 6.4 | 7.44 | 7.39 |
| SD - controlnet | 4.59 | 4.54 | 5.27 | 5.26 |
| IF | 16.81 | 16.62 | ❌ | 21.57 |
### RTX 3090 (batch size: 16)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 1.7 | 1.69 | 1.93 | 1.91 |
| SD - img2img | 1.68 | 1.67 | 1.93 | 1.9 |
| SD - inpaint | 1.72 | 1.71 | 1.97 | 1.94 |
| SD - controlnet | 1.23 | 1.22 | 1.4 | 1.38 |
| IF | 5.01 | 5.00 | ❌ | 6.33 |
### RTX 4090 (batch size: 1)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 40.5 | 41.89 | 44.65 | 49.81 |
| SD - img2img | 40.39 | 41.95 | 44.46 | 49.8 |
| SD - inpaint | 40.51 | 41.88 | 44.58 | 49.72 |
| SD - controlnet | 29.27 | 30.29 | 32.26 | 36.03 |
| IF | 69.71 / <br>18.78 / <br>85.49 | 69.13 / <br>18.80 / <br>85.56 | ❌ | 124.60 / <br>26.37 / <br>138.79 |
### RTX 4090 (batch size: 4)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 12.62 | 12.84 | 15.32 | 15.59 |
| SD - img2img | 12.61 | 12,.79 | 15.35 | 15.66 |
| SD - inpaint | 12.65 | 12.81 | 15.3 | 15.58 |
| SD - controlnet | 9.1 | 9.25 | 11.03 | 11.22 |
| IF | 31.88 | 31.14 | ❌ | 43.92 |
### RTX 4090 (batch size: 16)
| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
|:---:|:---:|:---:|:---:|:---:|
| SD - txt2img | 3.17 | 3.2 | 3.84 | 3.85 |
| SD - img2img | 3.16 | 3.2 | 3.84 | 3.85 |
| SD - inpaint | 3.17 | 3.2 | 3.85 | 3.85 |
| SD - controlnet | 2.23 | 2.3 | 2.7 | 2.75 |
| IF | 9.26 | 9.2 | ❌ | 13.31 |
## Notes
* Follow [this PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks.
* For the IF pipeline and batch sizes > 1, we only used a batch size of >1 in the first IF pipeline for text-to-image generation and NOT for upscaling. So, that means the two upscaling pipelines received a batch size of 1.
*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.*

View File

@@ -32,8 +32,9 @@ The quicktour is a simplified version of the introductory 🧨 Diffusers [notebo
Before you begin, make sure you have all the necessary libraries installed:
```bash
pip install --upgrade diffusers accelerate transformers
```py
# uncomment to install the necessary libraries in Colab
#!pip install --upgrade diffusers accelerate transformers
```
- [🤗 Accelerate](https://huggingface.co/docs/accelerate/index) speeds up model loading for inference and training.
@@ -121,9 +122,9 @@ Save the image by calling `save`:
You can also use the pipeline locally. The only difference is you need to download the weights first:
```
git lfs install
git clone https://huggingface.co/runwayml/stable-diffusion-v1-5
```bash
!git lfs install
!git clone https://huggingface.co/runwayml/stable-diffusion-v1-5
```
Then load the saved weights into the pipeline:

View File

@@ -10,156 +10,138 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# The Stable Diffusion Guide 🎨
<a target="_blank" href="https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_101_guide.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
# Effective and efficient diffusion
## Intro
[[open-in-colab]]
Stable Diffusion is a [Latent Diffusion model](https://github.com/CompVis/latent-diffusion) developed by researchers from the Machine Vision and Learning group at LMU Munich, *a.k.a* CompVis.
Model checkpoints were publicly released at the end of August 2022 by a collaboration of Stability AI, CompVis, and Runway with support from EleutherAI and LAION. For more information, you can check out [the official blog post](https://stability.ai/blog/stable-diffusion-public-release).
Getting the [`DiffusionPipeline`] to generate images in a certain style or include what you want can be tricky. Often times, you have to run the [`DiffusionPipeline`] several times before you end up with an image you're happy with. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again.
Since its public release the community has done an incredible job at working together to make the stable diffusion checkpoints **faster**, **more memory efficient**, and **more performant**.
This is why it's important to get the most *computational* (speed) and *memory* (GPU RAM) efficiency from the pipeline to reduce the time between inference cycles so you can iterate faster.
🧨 Diffusers offers a simple API to run stable diffusion with all memory, computing, and quality improvements.
This tutorial walks you through how to generate faster and better with the [`DiffusionPipeline`].
This notebook walks you through the improvements one-by-one so you can best leverage [`StableDiffusionPipeline`] for **inference**.
## Prompt Engineering 🎨
When running *Stable Diffusion* in inference, we usually want to generate a certain type, or style of image and then improve upon it. Improving upon a previously generated image means running inference over and over again with a different prompt and potentially a different seed until we are happy with our generation.
So to begin with, it is most important to speed up stable diffusion as much as possible to generate as many pictures as possible in a given amount of time.
This can be done by both improving the **computational efficiency** (speed) and the **memory efficiency** (GPU RAM).
Let's start by looking into computational efficiency first.
Throughout the notebook, we will focus on [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5):
``` python
model_id = "runwayml/stable-diffusion-v1-5"
```
Let's load the pipeline.
## Speed Optimization
Begin by loading the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model:
```python
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(model_id)
model_id = "runwayml/stable-diffusion-v1-5"
pipeline = DiffusionPipeline.from_pretrained(model_id)
```
We aim at generating a beautiful photograph of an *old warrior chief* and will later try to find the best prompt to generate such a photograph. For now, let's keep the prompt simple:
The example prompt you'll use is a portrait of an old warrior chief, but feel free to use your own prompt:
```python
prompt = "portrait photo of a old warrior chief"
```
To begin with, we should make sure we run inference on GPU, so let's move the pipeline to GPU, just like you would with any PyTorch module.
## Speed
<Tip>
💡 If you don't have access to a GPU, you can use one for free from a GPU provider like [Colab](https://colab.research.google.com/)!
</Tip>
One of the simplest ways to speed up inference is to place the pipeline on a GPU the same way you would with any PyTorch module:
```python
pipe = pipe.to("cuda")
pipeline = pipeline.to("cuda")
```
To generate an image, you should use the [~`StableDiffusionPipeline.__call__`] method.
To make sure we can reproduce more or less the same image in every call, let's make use of the generator. See the documentation on reproducibility [here](./conceptual/reproducibility) for more information.
``` python
generator = torch.Generator("cuda").manual_seed(0)
```
Now, let's take a spin on it.
``` python
image = pipe(prompt, generator=generator).images[0]
image
```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_1.png)
Cool, this now took roughly 30 seconds on a T4 GPU (you might see faster inference if your allocated GPU is better than a T4).
The default run we did above used full float32 precision and ran the default number of inference steps (50). The easiest speed-ups come from switching to float16 (or half) precision and simply running fewer inference steps. Let's load the model now in float16 instead.
To make sure you can use the same image and improve on it, use a [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed for [reproducibility](./using-diffusers/reproducibility):
```python
import torch
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
generator = torch.Generator("cuda").manual_seed(0)
```
And we can again call the pipeline to generate an image.
Now you can generate an image:
```python
generator = torch.Generator("cuda").manual_seed(0)
image = pipe(prompt, generator=generator).images[0]
image = pipeline(prompt, generator=generator).images[0]
image
```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_2.png)
Cool, this is almost three times as fast for arguably the same image quality.
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_1.png">
</div>
We strongly suggest always running your pipelines in float16 as so far we have very rarely seen degradations in quality because of it.
This process took ~30 seconds on a T4 GPU (it might be faster if your allocated GPU is better than a T4). By default, the [`DiffusionPipeline`] runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps.
Next, let's see if we need to use 50 inference steps or whether we could use significantly fewer. The number of inference steps is associated with the denoising scheduler we use. Choosing a more efficient scheduler could help us decrease the number of steps.
Let's have a look at all the schedulers the stable diffusion pipeline is compatible with.
Let's start by loading the model in `float16` and generate an image:
```python
pipe.scheduler.compatibles
import torch
pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipeline = pipeline.to("cuda")
generator = torch.Generator("cuda").manual_seed(0)
image = pipeline(prompt, generator=generator).images[0]
image
```
```
[diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler,
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_2.png">
</div>
This time, it only took ~11 seconds to generate the image, which is almost 3x faster than before!
<Tip>
💡 We strongly suggest always running your pipelines in `float16`, and so far, we've rarely seen any degradation in output quality.
</Tip>
Another option is to reduce the number of inference steps. Choosing a more efficient scheduler could help decrease the number of steps without sacrificing output quality. You can find which schedulers are compatible with the current model in the [`DiffusionPipeline`] by calling the `compatibles` method:
```python
pipeline.scheduler.compatibles
[
diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler,
diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler,
diffusers.schedulers.scheduling_pndm.PNDMScheduler,
diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler,
diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler,
diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler,
diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler,
diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler,
diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler,
diffusers.schedulers.scheduling_ddpm.DDPMScheduler,
diffusers.schedulers.scheduling_ddim.DDIMScheduler]
diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler,
diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler,
diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler,
diffusers.schedulers.scheduling_pndm.PNDMScheduler,
diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler,
diffusers.schedulers.scheduling_ddim.DDIMScheduler,
]
```
Cool, that's a lot of schedulers.
🧨 Diffusers is constantly adding a bunch of novel schedulers/samplers that can be used with Stable Diffusion. For more information, we recommend taking a look at the official documentation [here](https://huggingface.co/docs/diffusers/main/en/api/schedulers/overview).
Alright, right now Stable Diffusion is using the `PNDMScheduler` which usually requires around 50 inference steps. However, other schedulers such as `DPMSolverMultistepScheduler` or `DPMSolverSinglestepScheduler` seem to get away with just 20 to 25 inference steps. Let's try them out.
You can set a new scheduler by making use of the [from_config](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) function.
The Stable Diffusion model uses the [`PNDMScheduler`] by default which usually requires ~50 inference steps, but more performant schedulers like [`DPMSolverMultistepScheduler`], require only ~20 or 25 inference steps. Use the [`ConfigMixin.from_config`] method to load a new scheduler:
```python
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
```
Now, let's try to reduce the number of inference steps to just 20.
Now set the `num_inference_steps` to 20:
```python
generator = torch.Generator("cuda").manual_seed(0)
image = pipe(prompt, generator=generator, num_inference_steps=20).images[0]
image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0]
image
```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_3.png)
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_3.png">
</div>
The image now does look a little different, but it's arguably still of equally high quality. We now cut inference time to just 4 seconds though 😍.
Great, you've managed to cut the inference time to just 4 seconds! ⚡️
## Memory Optimization
## Memory
Less memory used in generation indirectly implies more speed, since we're often trying to maximize how many images we can generate per second. Usually, the more images per inference run, the more images per second too.
The other key to improving pipeline performance is consuming less memory, which indirectly implies more speed, since you're often trying to maximize the number of images generated per second. The easiest way to see how many images you can generate at once is to try out different batch sizes until you get an `OutOfMemoryError` (OOM).
The easiest way to see how many images we can generate at once is to simply try it out, and see when we get a *"Out-of-memory (OOM)"* error.
We can run batched inference by simply passing a list of prompts and generators. Let's define a quick function that generates a batch for us.
Create a function that'll generate a batch of images from a list of prompts and `Generators`. Make sure to assign each `Generator` a seed so you can reuse it if it produces a good result.
```python
def get_inputs(batch_size=1):
@@ -169,143 +151,101 @@ def get_inputs(batch_size=1):
return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}
```
This function returns a list of prompts and a list of generators, so we can reuse the generator that produced a result we like.
We also need a method that allows us to easily display a batch of images.
You'll also need a function that'll display each batch of images:
```python
from PIL import Image
def image_grid(imgs, rows=2, cols=2):
w, h = imgs[0].size
grid = Image.new('RGB', size=(cols*w, rows*h))
grid = Image.new("RGB", size=(cols * w, rows * h))
for i, img in enumerate(imgs):
grid.paste(img, box=(i % cols * w, i // cols * h))
return grid
```
Cool, let's see how much memory we can use starting with `batch_size=4`.
Start with `batch_size=4` and see how much memory you've consumed:
```python
images = pipe(**get_inputs(batch_size=4)).images
images = pipeline(**get_inputs(batch_size=4)).images
image_grid(images)
```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_4.png)
Going over a batch_size of 4 will error out in this notebook (assuming we are running it on a T4 GPU). Also, we can see we only generate slightly more images per second (3.75s/image) compared to 4s/image previously.
However, the community has found some nice tricks to improve the memory constraints further. After stable diffusion was released, the community found improvements within days and shared them freely over GitHub - open-source at its finest! I believe the original idea came from [this](https://github.com/basujindal/stable-diffusion/pull/117) GitHub thread.
By far most of the memory is taken up by the cross-attention layers. Instead of running this operation in batch, one can run it sequentially to save a significant amount of memory.
It can easily be enabled by calling `enable_attention_slicing` as is documented [here](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.enable_attention_slicing).
Unless you have a GPU with more RAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is configure the pipeline to use the [`~DiffusionPipeline.enable_attention_slicing`] function:
```python
pipe.enable_attention_slicing()
pipeline.enable_attention_slicing()
```
Great, now that attention slicing is enabled, let's try to double the batch size again, going for `batch_size=8`.
Now try increasing the `batch_size` to 8!
```python
images = pipe(**get_inputs(batch_size=8)).images
images = pipeline(**get_inputs(batch_size=8)).images
image_grid(images, rows=2, cols=4)
```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_5.png)
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_5.png">
</div>
Nice, it works. However, the speed gain is again not very big (it might however be much more significant on other GPUs).
Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~3.5 seconds per image! This is probably the fastest you can go on a T4 GPU without sacrificing quality.
We're at roughly 3.5 seconds per image 🔥 which is probably the fastest we can be with a simple T4 without sacrificing quality.
## Quality
Next, let's look into how to improve the quality!
In the last two sections, you learned how to optimize the speed of your pipeline by using `fp16`, reducing the number of inference steps by using a more performant scheduler, and enabling attention slicing to reduce memory consumption. Now you're going to focus on how to improve the quality of generated images.
## Quality Improvements
### Better checkpoints
Now that our image generation pipeline is blazing fast, let's try to get maximum image quality.
The most obvious step is to use better checkpoints. The Stable Diffusion model is a good starting point, and since its official launch, several improved versions have also been released. However, using a newer version doesn't automatically mean you'll get better results. You'll still have to experiment with different checkpoints yourself, and do a little research (such as using [negative prompts](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)) to get the best results.
First of all, image quality is extremely subjective, so it's difficult to make general claims here.
As the field grows, there are more and more high-quality checkpoints finetuned to produce certain styles. Try exploring the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) and [Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) to find one you're interested in!
The most obvious step to take to improve quality is to use *better checkpoints*. Since the release of Stable Diffusion, many improved versions have been released, which are summarized here:
### Better pipeline components
- *Official Release - 22 Aug 2022*: [Stable-Diffusion 1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4)
- *20 October 2022*: [Stable-Diffusion 1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5)
- *24 Nov 2022*: [Stable-Diffusion 2.0](https://huggingface.co/stabilityai/stable-diffusion-2-0)
- *7 Dec 2022*: [Stable-Diffusion 2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1)
Newer versions don't necessarily mean better image quality with the same parameters. People mentioned that *2.0* is slightly worse than *1.5* for certain prompts, but given the right prompt engineering *2.0* and *2.1* seem to be better.
Overall, we strongly recommend just trying the models out and reading up on advice online (e.g. it has been shown that using negative prompts is very important for 2.0 and 2.1 to get the highest possible quality. See for example [this nice blog post](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/).
Additionally, the community has started fine-tuning many of the above versions on certain styles with some of them having an extremely high quality and gaining a lot of traction.
We recommend having a look at all [diffusers checkpoints sorted by downloads and trying out the different checkpoints](https://huggingface.co/models?library=diffusers).
For the following, we will stick to v1.5 for simplicity.
Next, we can also try to optimize single components of the pipeline, e.g. switching out the latent decoder. For more details on how the whole Stable Diffusion pipeline works, please have a look at [this blog post](https://huggingface.co/blog/stable_diffusion).
Let's load [stabilityai's newest auto-decoder](https://huggingface.co/stabilityai/stable-diffusion-2-1).
You can also try replacing the current pipeline components with a newer version. Let's try loading the latest [autodecoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae) from Stability AI into the pipeline, and generate some images:
```python
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda")
```
Now we can set it to the vae of the pipeline to use it.
``` python
pipe.vae = vae
```
Let's run the same prompt as before to compare quality.
``` python
images = pipe(**get_inputs(batch_size=8)).images
pipeline.vae = vae
images = pipeline(**get_inputs(batch_size=8)).images
image_grid(images, rows=2, cols=4)
```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_6.png)
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_6.png">
</div>
Seems like the difference is only very minor, but the new generations are arguably a bit *sharper*.
### Better prompt engineering
Cool, finally, let's look a bit into prompt engineering.
The text prompt you use to generate an image is super important, so much so that it is called *prompt engineering*. Some considerations to keep during prompt engineering are:
Our goal was to generate a photo of an old warrior chief. Let's now try to bring a bit more color into the photos and make the look more impressive.
- How is the image or similar images of the one I want to generate stored on the internet?
- What additional detail can I give that steers the model towards the style I want?
Originally our prompt was "*portrait photo of an old warrior chief*".
To improve the prompt, it often helps to add cues that could have been used online to save high-quality photos, as well as add more details.
Essentially, when doing prompt engineering, one has to think:
- How was the photo or similar photos of the one I want probably stored on the internet?
- What additional detail can I give that steers the models into the style that I want?
Cool, let's add more details.
With this in mind, let's improve the prompt to include color and higher quality details:
```python
prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes"
```
and let's also add some cues that usually help to generate higher quality images.
``` python
prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta"
prompt
```
Cool, let's now try this prompt.
Generate a batch of images with the new prompt:
```python
images = pipe(**get_inputs(batch_size=8)).images
images = pipeline(**get_inputs(batch_size=8)).images
image_grid(images, rows=2, cols=4)
```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_7.png)
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_7.png">
</div>
Pretty impressive! We got some very high-quality image generations there. The 2nd image is my personal favorite, so I'll re-use this seed and see whether I can tweak the prompts slightly by using "oldest warrior", "old", "", and "young" instead of "old".
Pretty impressive! Let's tweak the second image - corresponding to the `Generator` with a seed of `1` - a bit more by adding some text about the age of the subject:
```python
prompts = [
@@ -315,19 +255,19 @@ prompts = [
"portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta",
]
generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))] # 1 because we want the 2nd image
images = pipe(prompt=prompts, generator=generator, num_inference_steps=25).images
generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))]
images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images
image_grid(images)
```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_8.png)
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_8.png">
</div>
The first picture looks nice! The eye movement slightly changed and looks nice. This finished up our 101-guide on how to use Stable Diffusion 🤗.
## Next steps
For more information on optimization or other guides, I recommend taking a look at the following:
In this tutorial, you learned how to optimize a [`DiffusionPipeline`] for computational and memory efficiency as well as improving the quality of generated outputs. If you're interested in making your pipeline even faster, take a look at the following resources:
- [Blog post about Stable Diffusion](https://huggingface.co/blog/stable_diffusion): In-detail blog post explaining Stable Diffusion.
- [FlashAttention](https://huggingface.co/docs/diffusers/optimization/xformers): XFormers flash attention can optimize your model even further with more speed and memory improvements.
- [Dreambooth](https://huggingface.co/docs/diffusers/training/dreambooth) - Quickly customize the model by fine-tuning it.
- [General info on Stable Diffusion](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/overview) - Info on other tasks that are powered by Stable Diffusion.
- Learn how [PyTorch 2.0](./optimization/torch2.0) and [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) can yield 5 - 300% faster inference speed. On an A100 GPU, inference can be up to 50% faster!
- If you can't use PyTorch 2, we recommend you install [xFormers](./optimization/xformers). Its memory-efficient attention mechanism works great with PyTorch 1.13.1 for faster speed and reduced memory consumption.
- Other optimization techniques, such as model offloading, are covered in [this guide](./optimization/fp16).

View File

@@ -0,0 +1,42 @@
# Adapt a model to a new task
Many diffusion systems share the same components, allowing you to adapt a pretrained model for one task to an entirely different task.
This guide will show you how to adapt a pretrained text-to-image model for inpainting by initializing and modifying the architecture of a pretrained [`UNet2DConditionModel`].
## Configure UNet2DConditionModel parameters
A [`UNet2DConditionModel`] by default accepts 4 channels in the [input sample](https://huggingface.co/docs/diffusers/v0.16.0/en/api/models#diffusers.UNet2DConditionModel.in_channels). For example, load a pretrained text-to-image model like [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) and take a look at the number of `in_channels`:
```py
from diffusers import StableDiffusionPipeline
pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipeline.unet.config["in_channels"]
4
```
Inpainting requires 9 channels in the input sample. You can check this value in a pretrained inpainting model like [`runwayml/stable-diffusion-inpainting`](https://huggingface.co/runwayml/stable-diffusion-inpainting):
```py
from diffusers import StableDiffusionPipeline
pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
pipeline.unet.config["in_channels"]
9
```
To adapt your text-to-image model for inpainting, you'll need to change the number of `in_channels` from 4 to 9.
Initialize a [`UNet2DConditionModel`] with the pretrained text-to-image model weights, and change `in_channels` to 9. Changing the number of `in_channels` means you need to set `ignore_mismatched_sizes=True` and `low_cpu_mem_usage=False` to avoid a size mismatch error because the shape is different now.
```py
from diffusers import UNet2DConditionModel
model_id = "runwayml/stable-diffusion-v1-5"
unet = UNet2DConditionModel.from_pretrained(
model_id, subfolder="unet", in_channels=9, low_cpu_mem_usage=False, ignore_mismatched_sizes=True
)
```
The pretrained weights of the other components from the text-to-image model are initialized from their checkpoints, but the input channel weights (`conv_in.weight`) of the `unet` are randomly initialized. It is important to finetune the model for inpainting because otherwise the model returns noise.

View File

@@ -33,7 +33,12 @@ cd diffusers
pip install -e .
```
Then navigate into the example folder and run:
Then navigate into the [example folder](https://github.com/huggingface/diffusers/tree/main/examples/controlnet)
```bash
cd examples/controlnet
```
Now run:
```bash
pip install -r requirements.txt
```
@@ -64,6 +69,8 @@ The original dataset is hosted in the ControlNet [repo](https://huggingface.co/l
Our training examples use [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) because that is what the original set of ControlNet models was trained on. However, ControlNet can be trained to augment any compatible Stable Diffusion model (such as [`CompVis/stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4)) or [`stabilityai/stable-diffusion-2-1`](https://huggingface.co/stabilityai/stable-diffusion-2-1).
To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide.
## Training
Download the following images to condition our training with:
@@ -74,6 +81,9 @@ wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/ma
wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png
```
Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument.
The training script creates and saves a `diffusion_pytorch_model.bin` file in your repository.
```bash
export MODEL_DIR="runwayml/stable-diffusion-v1-5"
@@ -87,7 +97,8 @@ accelerate launch train_controlnet.py \
--learning_rate=1e-5 \
--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
--train_batch_size=4
--train_batch_size=4 \
--push_to_hub
```
This default configuration requires ~38GB VRAM.
@@ -110,7 +121,32 @@ accelerate launch train_controlnet.py \
--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
--train_batch_size=1 \
--gradient_accumulation_steps=4
--gradient_accumulation_steps=4 \
--push_to_hub
```
## Training with multiple GPUs
`accelerate` allows for seamless multi-GPU training. Follow the instructions [here](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
for running distributed training with `accelerate`. Here is an example command:
```bash
export MODEL_DIR="runwayml/stable-diffusion-v1-5"
export OUTPUT_DIR="path to save model"
accelerate launch --mixed_precision="fp16" --multi_gpu train_controlnet.py \
--pretrained_model_name_or_path=$MODEL_DIR \
--output_dir=$OUTPUT_DIR \
--dataset_name=fusing/fill50k \
--resolution=512 \
--learning_rate=1e-5 \
--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
--train_batch_size=4 \
--mixed_precision="fp16" \
--tracker_project_name="controlnet-demo" \
--report_to=wandb \
--push_to_hub
```
## Example results
@@ -158,7 +194,8 @@ accelerate launch train_controlnet.py \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--use_8bit_adam
--use_8bit_adam \
--push_to_hub
```
## Training on a 12 GB GPU
@@ -186,7 +223,8 @@ accelerate launch train_controlnet.py \
--gradient_checkpointing \
--use_8bit_adam \
--enable_xformers_memory_efficient_attention \
--set_grads_to_none
--set_grads_to_none \
--push_to_hub
```
When using `enable_xformers_memory_efficient_attention`, please make sure to install `xformers` by `pip install xformers`.
@@ -250,7 +288,8 @@ accelerate launch train_controlnet.py \
--gradient_checkpointing \
--enable_xformers_memory_efficient_attention \
--set_grads_to_none \
--mixed_precision fp16
--mixed_precision fp16 \
--push_to_hub
```
## Inference

View File

@@ -0,0 +1,90 @@
# Create a dataset for training
There are many datasets on the [Hub](https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=downloads) to train a model on, but if you can't find one you're interested in or want to use your own, you can create a dataset with the 🤗 [Datasets](hf.co/docs/datasets) library. The dataset structure depends on the task you want to train your model on. The most basic dataset structure is a directory of images for tasks like unconditional image generation. Another dataset structure may be a directory of images and a text file containing their corresponding text captions for tasks like text-to-image generation.
This guide will show you two ways to create a dataset to finetune on:
- provide a folder of images to the `--train_data_dir` argument
- upload a dataset to the Hub and pass the dataset repository id to the `--dataset_name` argument
<Tip>
💡 Learn more about how to create an image dataset for training in the [Create an image dataset](https://huggingface.co/docs/datasets/image_dataset) guide.
</Tip>
## Provide a dataset as a folder
For unconditional generation, you can provide your own dataset as a folder of images. The training script uses the [`ImageFolder`](https://huggingface.co/docs/datasets/en/image_dataset#imagefolder) builder from 🤗 Datasets to automatically build a dataset from the folder. Your directory structure should look like:
```bash
data_dir/xxx.png
data_dir/xxy.png
data_dir/[...]/xxz.png
```
Pass the path to the dataset directory to the `--train_data_dir` argument, and then you can start training:
```bash
accelerate launch train_unconditional.py \
--train_data_dir <path-to-train-directory> \
<other-arguments>
```
## Upload your data to the Hub
<Tip>
💡 For more details and context about creating and uploading a dataset to the Hub, take a look at the [Image search with 🤗 Datasets](https://huggingface.co/blog/image-search-datasets) post.
</Tip>
Start by creating a dataset with the [`ImageFolder`](https://huggingface.co/docs/datasets/image_load#imagefolder) feature, which creates an `image` column containing the PIL-encoded images.
You can use the `data_dir` or `data_files` parameters to specify the location of the dataset. The `data_files` parameter supports mapping specific files to dataset splits like `train` or `test`:
```python
from datasets import load_dataset
# example 1: local folder
dataset = load_dataset("imagefolder", data_dir="path_to_your_folder")
# example 2: local files (supported formats are tar, gzip, zip, xz, rar, zstd)
dataset = load_dataset("imagefolder", data_files="path_to_zip_file")
# example 3: remote files (supported formats are tar, gzip, zip, xz, rar, zstd)
dataset = load_dataset(
"imagefolder",
data_files="https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip",
)
# example 4: providing several splits
dataset = load_dataset(
"imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]}
)
```
Then use the [`~datasets.Dataset.push_to_hub`] method to upload the dataset to the Hub:
```python
# assuming you have ran the huggingface-cli login command in a terminal
dataset.push_to_hub("name_of_your_dataset")
# if you want to push to a private repo, simply pass private=True:
dataset.push_to_hub("name_of_your_dataset", private=True)
```
Now the dataset is available for training by passing the dataset name to the `--dataset_name` argument:
```bash
accelerate launch --mixed_precision="fp16" train_text_to_image.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--dataset_name="name_of_your_dataset" \
<other-arguments>
```
## Next steps
Now that you've created a dataset, you can plug it into the `train_data_dir` (if your dataset is local) or `dataset_name` (if your dataset is on the Hub) arguments of a training script.
For your next steps, feel free to try and use your dataset to train a model for [unconditional generation](uncondtional_training) or [text-to-image generation](text2image)!

View File

@@ -0,0 +1,303 @@
<!--Copyright 2023 Custom Diffusion authors The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Custom Diffusion training example
[Custom Diffusion](https://arxiv.org/abs/2212.04488) is a method to customize text-to-image models like Stable Diffusion given just a few (4~5) images of a subject.
The `train_custom_diffusion.py` script shows how to implement the training procedure and adapt it for stable diffusion.
This training example was contributed by [Nupur Kumari](https://nupurkmr9.github.io/) (one of the authors of Custom Diffusion).
## Running locally with PyTorch
### Installing the dependencies
Before running the scripts, make sure to install the library's training dependencies:
**Important**
To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install -e .
```
Then cd into the [example folder](https://github.com/huggingface/diffusers/tree/main/examples/custom_diffusion)
```
cd examples/custom_diffusion
```
Now run
```bash
pip install -r requirements.txt
pip install clip-retrieval
```
And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
```bash
accelerate config
```
Or for a default accelerate configuration without answering questions about your environment
```bash
accelerate config default
```
Or if your environment doesn't support an interactive shell e.g. a notebook
```python
from accelerate.utils import write_basic_config
write_basic_config()
```
### Cat example 😺
Now let's get our dataset. Download dataset from [here](https://www.cs.cmu.edu/~custom-diffusion/assets/data.zip) and unzip it. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide.
We also collect 200 real images using `clip-retrieval` which are combined with the target images in the training dataset as a regularization. This prevents overfitting to the the given target image. The following flags enable the regularization `with_prior_preservation`, `real_prior` with `prior_loss_weight=1.`.
The `class_prompt` should be the category name same as target image. The collected real images are with text captions similar to the `class_prompt`. The retrieved image are saved in `class_data_dir`. You can disable `real_prior` to use generated images as regularization. To collect the real images use this command first before training.
```bash
pip install clip-retrieval
python retrieve.py --class_prompt cat --class_data_dir real_reg/samples_cat --num_class_images 200
```
**___Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model.___**
The script creates and saves model checkpoints and a `pytorch_custom_diffusion_weights.bin` file in your repository.
```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export OUTPUT_DIR="path-to-save-model"
export INSTANCE_DIR="./data/cat"
accelerate launch train_custom_diffusion.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--class_data_dir=./real_reg/samples_cat/ \
--with_prior_preservation --real_prior --prior_loss_weight=1.0 \
--class_prompt="cat" --num_class_images=200 \
--instance_prompt="photo of a <new1> cat" \
--resolution=512 \
--train_batch_size=2 \
--learning_rate=1e-5 \
--lr_warmup_steps=0 \
--max_train_steps=250 \
--scale_lr --hflip \
--modifier_token "<new1>" \
--push_to_hub
```
**Use `--enable_xformers_memory_efficient_attention` for faster training with lower VRAM requirement (16GB per GPU). Follow [this guide](https://github.com/facebookresearch/xformers) for installation instructions.**
To track your experiments using Weights and Biases (`wandb`) and to save intermediate results (whcih we HIGHLY recommend), follow these steps:
* Install `wandb`: `pip install wandb`.
* Authorize: `wandb login`.
* Then specify a `validation_prompt` and set `report_to` to `wandb` while launching training. You can also configure the following related arguments:
* `num_validation_images`
* `validation_steps`
Here is an example command:
```bash
accelerate launch train_custom_diffusion.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--class_data_dir=./real_reg/samples_cat/ \
--with_prior_preservation --real_prior --prior_loss_weight=1.0 \
--class_prompt="cat" --num_class_images=200 \
--instance_prompt="photo of a <new1> cat" \
--resolution=512 \
--train_batch_size=2 \
--learning_rate=1e-5 \
--lr_warmup_steps=0 \
--max_train_steps=250 \
--scale_lr --hflip \
--modifier_token "<new1>" \
--validation_prompt="<new1> cat sitting in a bucket" \
--report_to="wandb" \
--push_to_hub
```
Here is an example [Weights and Biases page](https://wandb.ai/sayakpaul/custom-diffusion/runs/26ghrcau) where you can check out the intermediate results along with other training details.
If you specify `--push_to_hub`, the learned parameters will be pushed to a repository on the Hugging Face Hub. Here is an [example repository](https://huggingface.co/sayakpaul/custom-diffusion-cat).
### Training on multiple concepts 🐱🪵
Provide a [json](https://github.com/adobe-research/custom-diffusion/blob/main/assets/concept_list.json) file with the info about each concept, similar to [this](https://github.com/ShivamShrirao/diffusers/blob/main/examples/dreambooth/train_dreambooth.py).
To collect the real images run this command for each concept in the json file.
```bash
pip install clip-retrieval
python retrieve.py --class_prompt {} --class_data_dir {} --num_class_images 200
```
And then we're ready to start training!
```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export OUTPUT_DIR="path-to-save-model"
accelerate launch train_custom_diffusion.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--output_dir=$OUTPUT_DIR \
--concepts_list=./concept_list.json \
--with_prior_preservation --real_prior --prior_loss_weight=1.0 \
--resolution=512 \
--train_batch_size=2 \
--learning_rate=1e-5 \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--num_class_images=200 \
--scale_lr --hflip \
--modifier_token "<new1>+<new2>" \
--push_to_hub
```
Here is an example [Weights and Biases page](https://wandb.ai/sayakpaul/custom-diffusion/runs/3990tzkg) where you can check out the intermediate results along with other training details.
### Training on human faces
For fine-tuning on human faces we found the following configuration to work better: `learning_rate=5e-6`, `max_train_steps=1000 to 2000`, and `freeze_model=crossattn` with at least 15-20 images.
To collect the real images use this command first before training.
```bash
pip install clip-retrieval
python retrieve.py --class_prompt person --class_data_dir real_reg/samples_person --num_class_images 200
```
Then start training!
```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export OUTPUT_DIR="path-to-save-model"
export INSTANCE_DIR="path-to-images"
accelerate launch train_custom_diffusion.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--class_data_dir=./real_reg/samples_person/ \
--with_prior_preservation --real_prior --prior_loss_weight=1.0 \
--class_prompt="person" --num_class_images=200 \
--instance_prompt="photo of a <new1> person" \
--resolution=512 \
--train_batch_size=2 \
--learning_rate=5e-6 \
--lr_warmup_steps=0 \
--max_train_steps=1000 \
--scale_lr --hflip --noaug \
--freeze_model crossattn \
--modifier_token "<new1>" \
--enable_xformers_memory_efficient_attention \
--push_to_hub
```
## Inference
Once you have trained a model using the above command, you can run inference using the below command. Make sure to include the `modifier token` (e.g. \<new1\> in above example) in your prompt.
```python
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16).to("cuda")
pipe.unet.load_attn_procs("path-to-save-model", weight_name="pytorch_custom_diffusion_weights.bin")
pipe.load_textual_inversion("path-to-save-model", weight_name="<new1>.bin")
image = pipe(
"<new1> cat sitting in a bucket",
num_inference_steps=100,
guidance_scale=6.0,
eta=1.0,
).images[0]
image.save("cat.png")
```
It's possible to directly load these parameters from a Hub repository:
```python
import torch
from huggingface_hub.repocard import RepoCard
from diffusers import DiffusionPipeline
model_id = "sayakpaul/custom-diffusion-cat"
card = RepoCard.load(model_id)
base_model_id = card.data.to_dict()["base_model"]
pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16).to("cuda")
pipe.unet.load_attn_procs(model_id, weight_name="pytorch_custom_diffusion_weights.bin")
pipe.load_textual_inversion(model_id, weight_name="<new1>.bin")
image = pipe(
"<new1> cat sitting in a bucket",
num_inference_steps=100,
guidance_scale=6.0,
eta=1.0,
).images[0]
image.save("cat.png")
```
Here is an example of performing inference with multiple concepts:
```python
import torch
from huggingface_hub.repocard import RepoCard
from diffusers import DiffusionPipeline
model_id = "sayakpaul/custom-diffusion-cat-wooden-pot"
card = RepoCard.load(model_id)
base_model_id = card.data.to_dict()["base_model"]
pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16).to("cuda")
pipe.unet.load_attn_procs(model_id, weight_name="pytorch_custom_diffusion_weights.bin")
pipe.load_textual_inversion(model_id, weight_name="<new1>.bin")
pipe.load_textual_inversion(model_id, weight_name="<new2>.bin")
image = pipe(
"the <new1> cat sculpture in the style of a <new2> wooden pot",
num_inference_steps=100,
guidance_scale=6.0,
eta=1.0,
).images[0]
image.save("multi-subject.png")
```
Here, `cat` and `wooden pot` refer to the multiple concepts.
### Inference from a training checkpoint
You can also perform inference from one of the complete checkpoint saved during the training process, if you used the `--checkpointing_steps` argument.
TODO.
## Set grads to none
To save even more memory, pass the `--set_grads_to_none` argument to the script. This will set grads to None instead of zero. However, be aware that it changes certain behaviors, so if you start experiencing any problems, remove this argument.
More info: https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html
## Experimental results
You can refer to [our webpage](https://www.cs.cmu.edu/~custom-diffusion/) that discusses our experiments in detail.

View File

@@ -0,0 +1,91 @@
# Distributed inference with multiple GPUs
On distributed setups, you can run inference across multiple GPUs with 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) or [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html), which is useful for generating with multiple prompts in parallel.
This guide will show you how to use 🤗 Accelerate and PyTorch Distributed for distributed inference.
## 🤗 Accelerate
🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) is a library designed to make it easy to train or run inference across distributed setups. It simplifies the process of setting up the distributed environment, allowing you to focus on your PyTorch code.
To begin, create a Python file and initialize an [`accelerate.PartialState`] to create a distributed environment; your setup is automatically detected so you don't need to explicitly define the `rank` or `world_size`. Move the [`DiffusionPipeline`] to `distributed_state.device` to assign a GPU to each process.
Now use the [`~accelerate.PartialState.split_between_processes`] utility as a context manager to automatically distribute the prompts between the number of processes.
```py
from accelerate import PartialState
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
distributed_state = PartialState()
pipeline.to(distributed_state.device)
with distributed_state.split_between_processes(["a dog", "a cat"]) as prompt:
result = pipeline(prompt).images[0]
result.save(f"result_{distributed_state.process_index}.png")
```
Use the `--num_processes` argument to specify the number of GPUs to use, and call `accelerate launch` to run the script:
```bash
accelerate launch run_distributed.py --num_processes=2
```
<Tip>
To learn more, take a look at the [Distributed Inference with 🤗 Accelerate](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) guide.
</Tip>
## PyTorch Distributed
PyTorch supports [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) which enables data parallelism.
To start, create a Python file and import `torch.distributed` and `torch.multiprocessing` to set up the distributed process group and to spawn the processes for inference on each GPU. You should also initialize a [`DiffusionPipeline`]:
```py
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from diffusers import DiffusionPipeline
sd = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
```
You'll want to create a function to run inference; [`init_process_group`](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group) handles creating a distributed environment with the type of backend to use, the `rank` of the current process, and the `world_size` or the number of processes participating. If you're running inference in parallel over 2 GPUs, then the `world_size` is 2.
Move the [`DiffusionPipeline`] to `rank` and use `get_rank` to assign a GPU to each process, where each process handles a different prompt:
```py
def run_inference(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
sd.to(rank)
if torch.distributed.get_rank() == 0:
prompt = "a dog"
elif torch.distributed.get_rank() == 1:
prompt = "a cat"
image = sd(prompt).images[0]
image.save(f"./{'_'.join(prompt)}.png")
```
To run the distributed inference, call [`mp.spawn`](https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn) to run the `run_inference` function on the number of GPUs defined in `world_size`:
```py
def main():
world_size = 2
mp.spawn(run_inference, args=(world_size,), nprocs=world_size, join=True)
if __name__ == "__main__":
main()
```
Once you've completed the inference script, use the `--nproc_per_node` argument to specify the number of GPUs to use and call `torchrun` to run the script:
```bash
torchrun run_distributed.py --nproc_per_node=2
```

View File

@@ -12,8 +12,6 @@ specific language governing permissions and limitations under the License.
# DreamBooth
[[open-in-colab]]
[DreamBooth](https://arxiv.org/abs/2208.12242) is a method to personalize text-to-image models like Stable Diffusion given just a few (3-5) images of a subject. It allows the model to generate contextualized images of the subject in different scenes, poses, and views.
![Dreambooth examples from the project's blog](https://dreambooth.github.io/DreamBooth_files/teaser_static.jpg)
@@ -50,6 +48,22 @@ from accelerate.utils import write_basic_config
write_basic_config()
```
Finally, download a [few images of a dog](https://huggingface.co/datasets/diffusers/dog-example) to DreamBooth with:
```py
from huggingface_hub import snapshot_download
local_dir = "./dog"
snapshot_download(
"diffusers/dog-example",
local_dir=local_dir,
repo_type="dataset",
ignore_patterns=".gitattributes",
)
```
To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide.
## Finetuning
<Tip warning={true}>
@@ -60,11 +74,13 @@ DreamBooth finetuning is very sensitive to hyperparameters and easy to overfit.
<frameworkcontent>
<pt>
Let's try DreamBooth with a [few images of a dog](https://drive.google.com/drive/folders/1BO_dyz-p65qhBRRMRA4TbZ8qW4rB99JZ); download and save them to a directory and then set the `INSTANCE_DIR` environment variable to that path:
Set the `INSTANCE_DIR` environment variable to the path of the directory containing the dog images.
Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`] argument. The `instance_prompt` argument is a text prompt that contains a unique identifier, such as `sks`, and the class the image belongs to, which in this example is `a photo of a sks dog`.
```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="path_to_training_images"
export INSTANCE_DIR="./dog"
export OUTPUT_DIR="path_to_saved_model"
```
@@ -82,7 +98,8 @@ accelerate launch train_dreambooth.py \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=400
--max_train_steps=400 \
--push_to_hub
```
</pt>
<jax>
@@ -94,11 +111,13 @@ Before running the script, make sure you have the requirements installed:
pip install -U -r requirements.txt
```
Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`] argument. The `instance_prompt` argument is a text prompt that contains a unique identifier, such as `sks`, and the class the image belongs to, which in this example is `a photo of a sks dog`.
Now you can launch the training script with the following command:
```bash
export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
export INSTANCE_DIR="path-to-instance-images"
export INSTANCE_DIR="./dog"
export OUTPUT_DIR="path-to-save-model"
python train_dreambooth_flax.py \
@@ -109,7 +128,8 @@ python train_dreambooth_flax.py \
--resolution=512 \
--train_batch_size=1 \
--learning_rate=5e-6 \
--max_train_steps=400
--max_train_steps=400 \
--push_to_hub
```
</jax>
</frameworkcontent>
@@ -124,7 +144,7 @@ The authors recommend generating `num_epochs * num_samples` images for prior pre
<pt>
```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="path_to_training_images"
export INSTANCE_DIR="./dog"
export CLASS_DIR="path_to_class_images"
export OUTPUT_DIR="path_to_saved_model"
@@ -143,13 +163,14 @@ accelerate launch train_dreambooth.py \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800
--max_train_steps=800 \
--push_to_hub
```
</pt>
<jax>
```bash
export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
export INSTANCE_DIR="path-to-instance-images"
export INSTANCE_DIR="./dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="path-to-save-model"
@@ -165,7 +186,8 @@ python train_dreambooth_flax.py \
--train_batch_size=1 \
--learning_rate=5e-6 \
--num_class_images=200 \
--max_train_steps=800
--max_train_steps=800 \
--push_to_hub
```
</jax>
</frameworkcontent>
@@ -186,7 +208,7 @@ Pass the `--train_text_encoder` argument to the training script to enable finetu
<pt>
```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="path_to_training_images"
export INSTANCE_DIR="./dog"
export CLASS_DIR="path_to_class_images"
export OUTPUT_DIR="path_to_saved_model"
@@ -201,19 +223,20 @@ accelerate launch train_dreambooth.py \
--class_prompt="a photo of dog" \
--resolution=512 \
--train_batch_size=1 \
--use_8bit_adam
--use_8bit_adam \
--gradient_checkpointing \
--learning_rate=2e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800
--max_train_steps=800 \
--push_to_hub
```
</pt>
<jax>
```bash
export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
export INSTANCE_DIR="path-to-instance-images"
export INSTANCE_DIR="./dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="path-to-save-model"
@@ -230,7 +253,8 @@ python train_dreambooth_flax.py \
--train_batch_size=1 \
--learning_rate=2e-6 \
--num_class_images=200 \
--max_train_steps=800
--max_train_steps=800 \
--push_to_hub
```
</jax>
</frameworkcontent>
@@ -349,7 +373,7 @@ Then pass the `--use_8bit_adam` option to the training script:
```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="path_to_training_images"
export INSTANCE_DIR="./dog"
export CLASS_DIR="path_to_class_images"
export OUTPUT_DIR="path_to_saved_model"
@@ -369,7 +393,8 @@ accelerate launch train_dreambooth.py \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800
--max_train_steps=800 \
--push_to_hub
```
### 12GB GPU
@@ -378,7 +403,7 @@ To run DreamBooth on a 12GB GPU, you'll need to enable gradient checkpointing, t
```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="path-to-instance-images"
export INSTANCE_DIR="./dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="path-to-save-model"
@@ -400,7 +425,8 @@ accelerate launch train_dreambooth.py \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800
--max_train_steps=800 \
--push_to_hub
```
### 8 GB GPU
@@ -425,7 +451,7 @@ Launch training with the following command:
```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="path_to_training_images"
export INSTANCE_DIR="./dog"
export CLASS_DIR="path_to_class_images"
export OUTPUT_DIR="path_to_saved_model"
@@ -446,7 +472,8 @@ accelerate launch train_dreambooth.py \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800 \
--mixed_precision=fp16
--mixed_precision=fp16 \
--push_to_hub
```
## Inference
@@ -470,3 +497,207 @@ image.save("dog-bucket.png")
```
You may also run inference from any of the [saved training checkpoints](#inference-from-a-saved-checkpoint).
## IF
You can use the lora and full dreambooth scripts to train the text to image [IF model](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0) and the stage II upscaler
[IF model](https://huggingface.co/DeepFloyd/IF-II-L-v1.0).
Note that IF has a predicted variance, and our finetuning scripts only train the models predicted error, so for finetuned IF models we switch to a fixed
variance schedule. The full finetuning scripts will update the scheduler config for the full saved model. However, when loading saved LoRA weights, you
must also update the pipeline's scheduler config.
```py
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0")
pipe.load_lora_weights("<lora weights path>")
# Update scheduler config to fixed variance schedule
pipe.scheduler = pipe.scheduler.__class__.from_config(pipe.scheduler.config, variance_type="fixed_small")
```
Additionally, a few alternative cli flags are needed for IF.
`--resolution=64`: IF is a pixel space diffusion model. In order to operate on un-compressed pixels, the input images are of a much smaller resolution.
`--pre_compute_text_embeddings`: IF uses [T5](https://huggingface.co/docs/transformers/model_doc/t5) for its text encoder. In order to save GPU memory, we pre compute all text embeddings and then de-allocate
T5.
`--tokenizer_max_length=77`: T5 has a longer default text length, but the default IF encoding procedure uses a smaller number.
`--text_encoder_use_attention_mask`: T5 passes the attention mask to the text encoder.
### Tips and Tricks
We find LoRA to be sufficient for finetuning the stage I model as the low resolution of the model makes representing finegrained detail hard regardless.
For common and/or not-visually complex object concepts, you can get away with not-finetuning the upscaler. Just be sure to adjust the prompt passed to the
upscaler to remove the new token from the instance prompt. I.e. if your stage I prompt is "a sks dog", use "a dog" for your stage II prompt.
For finegrained detail like faces that aren't present in the original training set, we find that full finetuning of the stage II upscaler is better than
LoRA finetuning stage II.
For finegrained detail like faces, we find that lower learning rates along with larger batch sizes work best.
For stage II, we find that lower learning rates are also needed.
We found experimentally that the DDPM scheduler with the default larger number of denoising steps to sometimes work better than the DPM Solver scheduler
used in the training scripts.
### Stage II additional validation images
The stage II validation requires images to upscale, we can download a downsized version of the training set:
```py
from huggingface_hub import snapshot_download
local_dir = "./dog_downsized"
snapshot_download(
"diffusers/dog-example-downsized",
local_dir=local_dir,
repo_type="dataset",
ignore_patterns=".gitattributes",
)
```
### IF stage I LoRA Dreambooth
This training configuration requires ~28 GB VRAM.
```sh
export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_dog_lora"
accelerate launch train_dreambooth_lora.py \
--report_to wandb \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a sks dog" \
--resolution=64 \
--train_batch_size=4 \
--gradient_accumulation_steps=1 \
--learning_rate=5e-6 \
--scale_lr \
--max_train_steps=1200 \
--validation_prompt="a sks dog" \
--validation_epochs=25 \
--checkpointing_steps=100 \
--pre_compute_text_embeddings \
--tokenizer_max_length=77 \
--text_encoder_use_attention_mask
```
### IF stage II LoRA Dreambooth
`--validation_images`: These images are upscaled during validation steps.
`--class_labels_conditioning=timesteps`: Pass additional conditioning to the UNet needed for stage II.
`--learning_rate=1e-6`: Lower learning rate than stage I.
`--resolution=256`: The upscaler expects higher resolution inputs
```sh
export MODEL_NAME="DeepFloyd/IF-II-L-v1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_dog_upscale"
export VALIDATION_IMAGES="dog_downsized/image_1.png dog_downsized/image_2.png dog_downsized/image_3.png dog_downsized/image_4.png"
python train_dreambooth_lora.py \
--report_to wandb \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a sks dog" \
--resolution=256 \
--train_batch_size=4 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-6 \
--max_train_steps=2000 \
--validation_prompt="a sks dog" \
--validation_epochs=100 \
--checkpointing_steps=500 \
--pre_compute_text_embeddings \
--tokenizer_max_length=77 \
--text_encoder_use_attention_mask \
--validation_images $VALIDATION_IMAGES \
--class_labels_conditioning=timesteps
```
### IF Stage I Full Dreambooth
`--skip_save_text_encoder`: When training the full model, this will skip saving the entire T5 with the finetuned model. You can still load the pipeline
with a T5 loaded from the original model.
`use_8bit_adam`: Due to the size of the optimizer states, we recommend training the full XL IF model with 8bit adam.
`--learning_rate=1e-7`: For full dreambooth, IF requires very low learning rates. With higher learning rates model quality will degrade. Note that it is
likely the learning rate can be increased with larger batch sizes.
Using 8bit adam and a batch size of 4, the model can be trained in ~48 GB VRAM.
```sh
export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_if"
accelerate launch train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=64 \
--train_batch_size=4 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-7 \
--max_train_steps=150 \
--validation_prompt "a photo of sks dog" \
--validation_steps 25 \
--text_encoder_use_attention_mask \
--tokenizer_max_length 77 \
--pre_compute_text_embeddings \
--use_8bit_adam \
--set_grads_to_none \
--skip_save_text_encoder \
--push_to_hub
```
### IF Stage II Full Dreambooth
`--learning_rate=5e-6`: With a smaller effective batch size of 4, we found that we required learning rates as low as
1e-8.
`--resolution=256`: The upscaler expects higher resolution inputs
`--train_batch_size=2` and `--gradient_accumulation_steps=6`: We found that full training of stage II particularly with
faces required large effective batch sizes.
```sh
export MODEL_NAME="DeepFloyd/IF-II-L-v1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_dog_upscale"
export VALIDATION_IMAGES="dog_downsized/image_1.png dog_downsized/image_2.png dog_downsized/image_3.png dog_downsized/image_4.png"
accelerate launch train_dreambooth.py \
--report_to wandb \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a sks dog" \
--resolution=256 \
--train_batch_size=2 \
--gradient_accumulation_steps=6 \
--learning_rate=5e-6 \
--max_train_steps=2000 \
--validation_prompt="a sks dog" \
--validation_steps=150 \
--checkpointing_steps=500 \
--pre_compute_text_embeddings \
--tokenizer_max_length=77 \
--text_encoder_use_attention_mask \
--validation_images $VALIDATION_IMAGES \
--class_labels_conditioning timesteps \
--push_to_hub
```

View File

@@ -24,7 +24,7 @@ The output is an "edited" image that reflects the edit instruction applied on th
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/output-gs%407-igs%401-steps%4050.png" alt="instructpix2pix-output" width=600/>
</p>
The `train_instruct_pix2pix.py` script shows how to implement the training procedure and adapt it for Stable Diffusion.
The `train_instruct_pix2pix.py` script (you can find the it [here](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py)) shows how to implement the training procedure and adapt it for Stable Diffusion.
***Disclaimer: Even though `train_instruct_pix2pix.py` implements the InstructPix2Pix
training procedure while being faithful to the [original implementation](https://github.com/timothybrooks/instruct-pix2pix) we have only tested it on a [small-scale dataset](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples). This can impact the end results. For better results, we recommend longer training runs with a larger dataset. [Here](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered) you can find a large dataset for InstructPix2Pix training.***
@@ -44,7 +44,12 @@ cd diffusers
pip install -e .
```
Then cd in the example folder and run
Then cd in the example folder
```bash
cd examples/instruct_pix2pix
```
Now run
```bash
pip install -r requirements.txt
```
@@ -72,17 +77,16 @@ write_basic_config()
### Toy example
As mentioned before, we'll use a [small toy dataset](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples) for training. The dataset
is a smaller version of the [original dataset](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered) used in the InstructPix2Pix paper.
is a smaller version of the [original dataset](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered) used in the InstructPix2Pix paper. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide.
Configure environment variables such as the dataset identifier and the Stable Diffusion
checkpoint:
Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument. You'll also need to specify the dataset name in `DATASET_ID`:
```bash
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export DATASET_ID="fusing/instructpix2pix-1000-samples"
```
Now, we can launch training:
Now, we can launch training. The script saves all the components (`feature_extractor`, `scheduler`, `text_encoder`, `unet`, etc) in a subfolder in your repository.
```bash
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
@@ -96,7 +100,8 @@ accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
--conditioning_dropout_prob=0.05 \
--mixed_precision=fp16 \
--seed=42
--seed=42 \
--push_to_hub
```
Additionally, we support performing validation inference to monitor training progress
@@ -117,7 +122,8 @@ accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
--val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" \
--validation_prompt="make the mountains snowy" \
--seed=42 \
--report_to=wandb
--report_to=wandb \
--push_to_hub
```
We recommend this type of validation as it can be useful for model debugging. Note that you need `wandb` installed to use this. You can install `wandb` by running `pip install wandb`.
@@ -126,6 +132,28 @@ accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
***Note: In the original paper, the authors observed that even when the model is trained with an image resolution of 256x256, it generalizes well to bigger resolutions such as 512x512. This is likely because of the larger dataset they used during training.***
## Training with multiple GPUs
`accelerate` allows for seamless multi-GPU training. Follow the instructions [here](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
for running distributed training with `accelerate`. Here is an example command:
```bash
accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix.py \
--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 \
--dataset_name=sayakpaul/instructpix2pix-1000-samples \
--use_ema \
--enable_xformers_memory_efficient_attention \
--resolution=512 --random_flip \
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
--max_train_steps=15000 \
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
--learning_rate=5e-05 --lr_warmup_steps=0 \
--conditioning_dropout_prob=0.05 \
--mixed_precision=fp16 \
--seed=42 \
--push_to_hub
```
## Inference
Once training is complete, we can perform inference:
@@ -179,3 +207,5 @@ speed and quality during performance:
Particularly, `image_guidance_scale` and `guidance_scale` can have a profound impact
on the generated ("edited") image (see [here](https://twitter.com/RisingSayak/status/1628392199196151808?s=20) for an example).
If you're looking for some interesting ways to use the InstructPix2Pix training methodology, we welcome you to check out this blog post: [Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd).

View File

@@ -12,11 +12,10 @@ specific language governing permissions and limitations under the License.
# Low-Rank Adaptation of Large Language Models (LoRA)
[[open-in-colab]]
<Tip warning={true}>
Currently, LoRA is only supported for the attention layers of the [`UNet2DConditionalModel`].
Currently, LoRA is only supported for the attention layers of the [`UNet2DConditionalModel`]. We also
support fine-tuning the text encoder for DreamBooth with LoRA in a limited capacity. Fine-tuning the text encoder for DreamBooth generally yields better results, but it can increase compute usage.
</Tip>
@@ -50,7 +49,9 @@ Finetuning a model like Stable Diffusion, which has billions of parameters, can
Let's finetune [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) on the [Pokémon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset to generate your own Pokémon.
To start, make sure you have the `MODEL_NAME` and `DATASET_NAME` environment variables set. The `OUTPUT_DIR` and `HUB_MODEL_ID` variables are optional and specify where to save the model to on the Hub:
Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument. You'll also need to set the `DATASET_NAME` environment variable to the name of the dataset you want to train on. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide.
The `OUTPUT_DIR` and `HUB_MODEL_ID` variables are optional and specify where to save the model to on the Hub:
```bash
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
@@ -65,7 +66,7 @@ There are some flags to be aware of before you start training:
* `--report_to=wandb` reports and logs the training results to your Weights & Biases dashboard (as an example, take a look at this [report](https://wandb.ai/pcuenq/text2image-fine-tune/runs/b4k1w0tn?workspace=user-pcuenq)).
* `--learning_rate=1e-04`, you can afford to use a higher learning rate than you normally would with LoRA.
Now you're ready to launch the training (you can find the full training script [here](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)):
Now you're ready to launch the training (you can find the full training script [here](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)). Training takes about 5 hours on a 2080 Ti GPU with 11GB of RAM, and it'll create and save model checkpoints and the `pytorch_lora_weights` in your repository.
```bash
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
@@ -111,7 +112,7 @@ Load the LoRA weights from your finetuned model *on top of the base model weight
</Tip>
```py
>>> pipe.unet.load_attn_procs(model_path)
>>> pipe.unet.load_attn_procs(lora_model_path)
>>> pipe.to("cuda")
# use half the weights from the LoRA finetuned model and half the weights from the base model
@@ -124,6 +125,26 @@ Load the LoRA weights from your finetuned model *on top of the base model weight
>>> image.save("blue_pokemon.png")
```
<Tip>
If you are loading the LoRA parameters from the Hub and if the Hub repository has
a `base_model` tag (such as [this](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/README.md?code=true#L4)), then
you can do:
```py
from huggingface_hub.repocard import RepoCard
lora_model_id = "sayakpaul/sd-model-finetuned-lora-t4"
card = RepoCard.load(lora_model_id)
base_model_id = card.data.to_dict()["base_model"]
pipe = StableDiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16)
...
```
</Tip>
## DreamBooth
[DreamBooth](https://arxiv.org/abs/2208.12242) is a finetuning technique for personalizing a text-to-image model like Stable Diffusion to generate photorealistic images of a subject in different contexts, given a few images of the subject. However, DreamBooth is very sensitive to hyperparameters and it is easy to overfit. Some important hyperparameters to consider include those that affect the training time (learning rate, number of training steps), and inference time (number of steps, scheduler type).
@@ -136,9 +157,11 @@ Load the LoRA weights from your finetuned model *on top of the base model weight
### Training[[dreambooth-training]]
Let's finetune [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) with DreamBooth and LoRA with some 🐶 [dog images](https://drive.google.com/drive/folders/1BO_dyz-p65qhBRRMRA4TbZ8qW4rB99JZ). Download and save these images to a directory.
Let's finetune [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) with DreamBooth and LoRA with some 🐶 [dog images](https://drive.google.com/drive/folders/1BO_dyz-p65qhBRRMRA4TbZ8qW4rB99JZ). Download and save these images to a directory. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide.
To start, make sure you have the `MODEL_NAME` and `INSTANCE_DIR` (path to directory containing images) environment variables set. The `OUTPUT_DIR` variables is optional and specifies where to save the model to on the Hub:
To start, specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument. You'll also need to set `INSTANCE_DIR` to the path of the directory containing the images.
The `OUTPUT_DIR` variables is optional and specifies where to save the model to on the Hub:
```bash
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
@@ -152,7 +175,11 @@ There are some flags to be aware of before you start training:
* `--report_to=wandb` reports and logs the training results to your Weights & Biases dashboard (as an example, take a look at this [report](https://wandb.ai/pcuenq/text2image-fine-tune/runs/b4k1w0tn?workspace=user-pcuenq)).
* `--learning_rate=1e-04`, you can afford to use a higher learning rate than you normally would with LoRA.
Now you're ready to launch the training (you can find the full training script [here](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora.py)):
Now you're ready to launch the training (you can find the full training script [here](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora.py)). The script creates and saves model checkpoints and the `pytorch_lora_weights.bin` file in your repository.
It's also possible to additionally fine-tune the text encoder with LoRA. This, in most cases, leads
to better results with a slight increase in the compute. To allow fine-tuning the text encoder with LoRA,
specify the `--train_text_encoder` while launching the `train_dreambooth_lora.py` script.
```bash
accelerate launch train_dreambooth_lora.py \
@@ -197,7 +224,7 @@ Load the LoRA weights from your finetuned DreamBooth model *on top of the base m
</Tip>
```py
>>> pipe.unet.load_attn_procs(model_path)
>>> pipe.unet.load_attn_procs(lora_model_path)
>>> pipe.to("cuda")
# use half the weights from the LoRA finetuned model and half the weights from the base model
@@ -212,3 +239,114 @@ Load the LoRA weights from your finetuned DreamBooth model *on top of the base m
>>> image = pipe("A picture of a sks dog in a bucket.", num_inference_steps=25, guidance_scale=7.5).images[0]
>>> image.save("bucket-dog.png")
```
If you used `--train_text_encoder` during training, then use `pipe.load_lora_weights()` to load the LoRA
weights. For example:
```python
from huggingface_hub.repocard import RepoCard
from diffusers import StableDiffusionPipeline
import torch
lora_model_id = "sayakpaul/dreambooth-text-encoder-test"
card = RepoCard.load(lora_model_id)
base_model_id = card.data.to_dict()["base_model"]
pipe = StableDiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.load_lora_weights(lora_model_id)
image = pipe("A picture of a sks dog in a bucket", num_inference_steps=25).images[0]
```
<Tip>
If your LoRA parameters involve the UNet as well as the Text Encoder, then passing
`cross_attention_kwargs={"scale": 0.5}` will apply the `scale` value to both the UNet
and the Text Encoder.
</Tip>
Note that the use of [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] is preferred to [`~diffusers.loaders.UNet2DConditionLoadersMixin.load_attn_procs`] for loading LoRA parameters. This is because
[`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] can handle the following situations:
* LoRA parameters that don't have separate identifiers for the UNet and the text encoder (such as [`"patrickvonplaten/lora_dreambooth_dog_example"`](https://huggingface.co/patrickvonplaten/lora_dreambooth_dog_example)). So, you can just do:
```py
pipe.load_lora_weights(lora_model_path)
```
* LoRA parameters that have separate identifiers for the UNet and the text encoder such as: [`"sayakpaul/dreambooth"`](https://huggingface.co/sayakpaul/dreambooth).
**Note** that it is possible to provide a local directory path to [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] as well as [`~diffusers.loaders.UNet2DConditionLoadersMixin.load_attn_procs`]. To know about the supported inputs,
refer to the respective docstrings.
## Supporting A1111 themed LoRA checkpoints from Diffusers
To provide seamless interoperability with A1111 to our users, we support loading A1111 formatted
LoRA checkpoints using [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] in a limited capacity.
In this section, we explain how to load an A1111 formatted LoRA checkpoint from [CivitAI](https://civitai.com/)
in Diffusers and perform inference with it.
First, download a checkpoint. We'll use
[this one](https://civitai.com/models/13239/light-and-shadow) for demonstration purposes.
```bash
wget https://civitai.com/api/download/models/15603 -O light_and_shadow.safetensors
```
Next, we initialize a [`~DiffusionPipeline`]:
```python
import torch
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
pipeline = StableDiffusionPipeline.from_pretrained(
"gsdf/Counterfeit-V2.5", torch_dtype=torch.float16, safety_checker=None
).to("cuda")
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
pipeline.scheduler.config, use_karras_sigmas=True
)
```
We then load the checkpoint downloaded from CivitAI:
```python
pipeline.load_lora_weights(".", weight_name="light_and_shadow.safetensors")
```
<Tip warning={true}>
If you're loading a checkpoint in the `safetensors` format, please ensure you have `safetensors` installed.
</Tip>
And then it's time for running inference:
```python
prompt = "masterpiece, best quality, 1girl, at dusk"
negative_prompt = ("(low quality, worst quality:1.4), (bad anatomy), (inaccurate limb:1.2), "
"bad composition, inaccurate eyes, extra digit, fewer digits, (extra arms:1.2), large breasts")
images = pipeline(prompt=prompt,
negative_prompt=negative_prompt,
width=512,
height=768,
num_inference_steps=15,
num_images_per_prompt=4,
generator=torch.manual_seed(0)
).images
```
Below is a comparison between the LoRA and the non-LoRA results:
![lora_non_lora](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lora_non_lora_comparison.png)
You have a similar checkpoint stored on the Hugging Face Hub, you can load it
directly with [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] like so:
```python
lora_model_id = "sayakpaul/civitai-light-shadow-lora"
lora_filename = "light_and_shadow.safetensors"
pipeline.load_lora_weights(lora_model_id, weight_name=lora_filename)
```

View File

@@ -39,6 +39,8 @@ Training examples show how to pretrain or fine-tune diffusion models for a varie
- [Dreambooth](./dreambooth)
- [LoRA Support](./lora)
- [ControlNet](./controlnet)
- [InstructPix2Pix](./instructpix2pix)
- [Custom Diffusion](./custom_diffusion)
If possible, please [install xFormers](../optimization/xformers) for memory efficient attention. This could help make your training faster and less memory intensive.
@@ -50,6 +52,8 @@ If possible, please [install xFormers](../optimization/xformers) for memory effi
| [**Dreambooth**](./dreambooth) | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_dreambooth_training.ipynb)
| [**Training with LoRA**](./lora) | ✅ | - | - |
| [**ControlNet**](./controlnet) | ✅ | ✅ | - |
| [**InstructPix2Pix**](./instructpix2pix) | ✅ | ✅ | - |
| [**Custom Diffusion**](./custom_diffusion) | ✅ | ✅ | - |
## Community

View File

@@ -72,15 +72,29 @@ To load a checkpoint to resume training, pass the argument `--resume_from_checkp
<frameworkcontent>
<pt>
Launch the [PyTorch training script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) for a fine-tuning run on the [Pokémon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset like this:
Launch the [PyTorch training script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) for a fine-tuning run on the [Pokémon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset like this.
<literalinclude>
{"path": "../../../../examples/text_to_image/README.md",
"language": "bash",
"start-after": "accelerate_snippet_start",
"end-before": "accelerate_snippet_end",
"dedent": 0}
</literalinclude>
Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument.
```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export dataset_name="lambdalabs/pokemon-blip-captions"
accelerate launch --mixed_precision="fp16" train_text_to_image.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$dataset_name \
--use_ema \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--output_dir="sd-pokemon-model" \
--push_to_hub
```
To finetune on your own dataset, prepare the dataset according to the format required by 🤗 [Datasets](https://huggingface.co/docs/datasets/index). You can [upload your dataset to the Hub](https://huggingface.co/docs/datasets/image_dataset#upload-dataset-to-the-hub), or you can [prepare a local folder with your files](https://huggingface.co/docs/datasets/image_dataset#imagefolder).
@@ -103,9 +117,38 @@ accelerate launch train_text_to_image.py \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--output_dir=${OUTPUT_DIR}
--lr_scheduler="constant"
--lr_warmup_steps=0 \
--output_dir=${OUTPUT_DIR} \
--push_to_hub
```
#### Training with multiple GPUs
`accelerate` allows for seamless multi-GPU training. Follow the instructions [here](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
for running distributed training with `accelerate`. Here is an example command:
```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export dataset_name="lambdalabs/pokemon-blip-captions"
accelerate launch --mixed_precision="fp16" --multi_gpu train_text_to_image.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$dataset_name \
--use_ema \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--output_dir="sd-pokemon-model" \
--push_to_hub
```
</pt>
<jax>
With Flax, it's possible to train a Stable Diffusion model faster on TPUs and GPUs thanks to [@duongna211](https://github.com/duongna21). This is very efficient on TPU hardware but works great on GPUs too. The Flax training script doesn't support features like gradient checkpointing or gradient accumulation yet, so you'll need a GPU with at least 30GB of memory or a TPU v3.
@@ -116,6 +159,8 @@ Before running the script, make sure you have the requirements installed:
pip install -U -r requirements_flax.txt
```
Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument.
Now you can launch the [Flax training script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_flax.py) like this:
```bash
@@ -130,7 +175,8 @@ python train_text_to_image_flax.py \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--output_dir="sd-pokemon-model"
--output_dir="sd-pokemon-model" \
--push_to_hub
```
To finetune on your own dataset, prepare the dataset according to the format required by 🤗 [Datasets](https://huggingface.co/docs/datasets/index). You can [upload your dataset to the Hub](https://huggingface.co/docs/datasets/image_dataset#upload-dataset-to-the-hub), or you can [prepare a local folder with your files](https://huggingface.co/docs/datasets/image_dataset#imagefolder).
@@ -150,11 +196,34 @@ python train_text_to_image_flax.py \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--output_dir="sd-pokemon-model"
--output_dir="sd-pokemon-model" \
--push_to_hub
```
</jax>
</frameworkcontent>
## Training with Min-SNR weighting
We support training with the Min-SNR weighting strategy proposed in [Efficient Diffusion Training via Min-SNR Weighting Strategy](https://arxiv.org/abs/2303.09556) which helps to achieve faster convergence
by rebalancing the loss. In order to use it, one needs to set the `--snr_gamma` argument. The recommended
value when using it is 5.0.
You can find [this project on Weights and Biases](https://wandb.ai/sayakpaul/text2image-finetune-minsnr) that compares the loss surfaces of the following setups:
* Training without the Min-SNR weighting strategy
* Training with the Min-SNR weighting strategy (`snr_gamma` set to 5.0)
* Training with the Min-SNR weighting strategy (`snr_gamma` set to 1.0)
For our small Pokemons dataset, the effects of Min-SNR weighting strategy might not appear to be pronounced, but for larger datasets, we believe the effects will be more pronounced.
Also, note that in this example, we either predict `epsilon` (i.e., the noise) or the `v_prediction`. For both of these cases, the formulation of the Min-SNR weighting strategy that we have used holds.
<Tip warning={true}>
Training with Min-SNR weighting strategy is only supported in PyTorch.
</Tip>
## LoRA
You can also use Low-Rank Adaptation of Large Language Models (LoRA), a fine-tuning technique for accelerating training large models, for fine-tuning text-to-image models. For more details, take a look at the [LoRA training](lora#text-to-image) guide.

View File

@@ -14,8 +14,6 @@ specific language governing permissions and limitations under the License.
# Textual Inversion
[[open-in-colab]]
[Textual Inversion](https://arxiv.org/abs/2208.01618) is a technique for capturing novel concepts from a small number of example images. While the technique was originally demonstrated with a [latent diffusion model](https://github.com/CompVis/latent-diffusion), it has since been applied to other model variants like [Stable Diffusion](https://huggingface.co/docs/diffusers/main/en/conceptual/stable_diffusion). The learned concepts can be used to better control the images generated from text-to-image pipelines. It learns new "words" in the text encoder's embedding space, which are used within text prompts for personalized image generation.
![Textual Inversion example](https://textual-inversion.github.io/static/images/editing/colorful_teapot.JPG)
@@ -81,9 +79,20 @@ To resume training from a saved checkpoint, pass the following argument to the t
## Finetuning
For your training dataset, download these [images of a cat statue](https://drive.google.com/drive/folders/1fmJMs25nxS_rSNqS5hTcRdLem_YQXbq5) and store them in a directory.
For your training dataset, download these [images of a cat toy](https://huggingface.co/datasets/diffusers/cat_toy_example) and store them in a directory. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide.
Set the `MODEL_NAME` environment variable to the model repository id, and the `DATA_DIR` environment variable to the path of the directory containing the images. Now you can launch the [training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py):
```py
from huggingface_hub import snapshot_download
local_dir = "./cat"
snapshot_download(
"diffusers/cat_toy_example", local_dir=local_dir, repo_type="dataset", ignore_patterns=".gitattributes"
)
```
Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument, and the `DATA_DIR` environment variable to the path of the directory containing the images.
Now you can launch the [training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py). The script creates and saves the following files to your repository: `learned_embeds.bin`, `token_identifier.txt`, and `type_of_concept.txt`.
<Tip>
@@ -95,7 +104,7 @@ Set the `MODEL_NAME` environment variable to the model repository id, and the `D
<pt>
```bash
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export DATA_DIR="path-to-dir-containing-images"
export DATA_DIR="./cat"
accelerate launch textual_inversion.py \
--pretrained_model_name_or_path=$MODEL_NAME \
@@ -109,8 +118,21 @@ accelerate launch textual_inversion.py \
--learning_rate=5.0e-04 --scale_lr \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--output_dir="textual_inversion_cat"
--output_dir="textual_inversion_cat" \
--push_to_hub
```
<Tip>
💡 If you want to increase the trainable capacity, you can associate your placeholder token, *e.g.* `<cat-toy>` to
multiple embedding vectors. This can help the model to better capture the style of more (complex) images.
To enable training multiple embedding vectors, simply pass:
```bash
--num_vectors=5
```
</Tip>
</pt>
<jax>
If you have access to TPUs, try out the [Flax training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion_flax.py) to train even faster (this'll also work for GPUs). With the same configuration settings, the Flax training script should be at least 70% faster than the PyTorch training script! ⚡️
@@ -121,11 +143,13 @@ Before you begin, make sure you install the Flax specific dependencies:
pip install -U -r requirements_flax.txt
```
Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument.
Then you can launch the [training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion_flax.py):
```bash
export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
export DATA_DIR="path-to-dir-containing-images"
export DATA_DIR="./cat"
python textual_inversion_flax.py \
--pretrained_model_name_or_path=$MODEL_NAME \
@@ -136,7 +160,8 @@ python textual_inversion_flax.py \
--train_batch_size=1 \
--max_train_steps=3000 \
--learning_rate=5.0e-04 --scale_lr \
--output_dir="textual_inversion_cat"
--output_dir="textual_inversion_cat" \
--push_to_hub
```
</jax>
</frameworkcontent>
@@ -157,24 +182,61 @@ If you're interested in following along with your model training progress, you c
## Inference
Once you have trained a model, you can use it for inference with the [`StableDiffusionPipeline`]. Make sure you include the `placeholder_token` in your prompt, in this case, it is `<cat-toy>`.
Once you have trained a model, you can use it for inference with the [`StableDiffusionPipeline`].
The textual inversion script will by default only save the textual inversion embedding vector(s) that have
been added to the text encoder embedding matrix and consequently been trained.
<frameworkcontent>
<pt>
<Tip>
💡 The community has created a large library of different textual inversion embedding vectors, called [sd-concepts-library](https://huggingface.co/sd-concepts-library).
Instead of training textual inversion embeddings from scratch you can also see whether a fitting textual inversion embedding has already been added to the libary.
</Tip>
To load the textual inversion embeddings you first need to load the base model that was used when training
your textual inversion embedding vectors. Here we assume that [`runwayml/stable-diffusion-v1-5`](runwayml/stable-diffusion-v1-5)
was used as a base model so we load it first:
```python
from diffusers import StableDiffusionPipeline
import torch
model_id = "path-to-your-trained-model"
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
```
Next, we need to load the textual inversion embedding vector which can be done via the [`TextualInversionLoaderMixin.load_textual_inversion`]
function. Here we'll load the embeddings of the "<cat-toy>" example from before.
```python
pipe.load_textual_inversion("sd-concepts-library/cat-toy")
```
Now we can run the pipeline making sure that the placeholder token `<cat-toy>` is used in our prompt.
```python
prompt = "A <cat-toy> backpack"
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
image = pipe(prompt, num_inference_steps=50).images[0]
image.save("cat-backpack.png")
```
The function [`TextualInversionLoaderMixin.load_textual_inversion`] can not only
load textual embedding vectors saved in Diffusers' format, but also embedding vectors
saved in [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui) format.
To do so, you can first download an embedding vector from [civitAI](https://civitai.com/models/3036?modelVersionId=8387)
and then load it locally:
```python
pipe.load_textual_inversion("./charturnerv2.pt")
```
</pt>
<jax>
Currently there is no `load_textual_inversion` function for Flax so one has to make sure the textual inversion
embedding vector is saved as part of the model after training.
The model can then be run just like any other Flax model:
```python
import jax
import numpy as np
@@ -183,7 +245,7 @@ from flax.training.common_utils import shard
from diffusers import FlaxStableDiffusionPipeline
model_path = "path-to-your-trained-model"
pipe, params = FlaxStableDiffusionPipeline.from_pretrained(model_path, dtype=jax.numpy.bfloat16)
pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(model_path, dtype=jax.numpy.bfloat16)
prompt = "A <cat-toy> backpack"
prng_seed = jax.random.PRNGKey(0)

View File

@@ -74,7 +74,9 @@ The full training state is saved in a subfolder in the `output_dir` every 500 st
## Finetuning
You're ready to launch the [training script](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/train_unconditional.py) now! Specify the dataset name to finetune on with the `--dataset_name` argument and then save it to the path in `--output_dir`.
You're ready to launch the [training script](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/train_unconditional.py) now! Specify the dataset name to finetune on with the `--dataset_name` argument and then save it to the path in `--output_dir`. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide.
The training script creates and saves a `diffusion_pytorch_model.bin` file in your repository.
<Tip>
@@ -122,80 +124,23 @@ accelerate launch train_unconditional.py \
<img src="https://user-images.githubusercontent.com/26864830/180248200-928953b4-db38-48db-b0c6-8b740fe6786f.png"/>
</div>
## Finetuning with your own data
### Training with multiple GPUs
There are two ways to finetune a model on your own dataset:
- provide your own folder of images to the `--train_data_dir` argument
- upload your dataset to the Hub and pass the dataset repository id to the `--dataset_name` argument.
<Tip>
💡 Learn more about how to create an image dataset for training in the [Create an image dataset](https://huggingface.co/docs/datasets/image_dataset) guide.
</Tip>
Below, we explain both in more detail.
### Provide the dataset as a folder
If you provide your own dataset as a folder, the script expects the following directory structure:
`accelerate` allows for seamless multi-GPU training. Follow the instructions [here](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
for running distributed training with `accelerate`. Here is an example command:
```bash
data_dir/xxx.png
data_dir/xxy.png
data_dir/[...]/xxz.png
accelerate launch --mixed_precision="fp16" --multi_gpu train_unconditional.py \
--dataset_name="huggan/pokemon" \
--resolution=64 --center_crop --random_flip \
--output_dir="ddpm-ema-pokemon-64" \
--train_batch_size=16 \
--num_epochs=100 \
--gradient_accumulation_steps=1 \
--use_ema \
--learning_rate=1e-4 \
--lr_warmup_steps=500 \
--mixed_precision="fp16" \
--logger="wandb" \
--push_to_hub
```
Pass the path to the folder containing the images to the `--train_data_dir` argument and launch the training:
```bash
accelerate launch train_unconditional.py \
--train_data_dir <path-to-train-directory> \
<other-arguments>
```
Internally, the script uses the [`ImageFolder`](https://huggingface.co/docs/datasets/image_load#imagefolder) to automatically build a dataset from the folder.
### Upload your data to the Hub
<Tip>
💡 For more details and context about creating and uploading a dataset to the Hub, take a look at the [Image search with 🤗 Datasets](https://huggingface.co/blog/image-search-datasets) post.
</Tip>
To upload your dataset to the Hub, you can start by creating one with the [`ImageFolder`](https://huggingface.co/docs/datasets/image_load#imagefolder) feature, which creates an `image` column containing the PIL-encoded images, from 🤗 Datasets:
```python
from datasets import load_dataset
# example 1: local folder
dataset = load_dataset("imagefolder", data_dir="path_to_your_folder")
# example 2: local files (supported formats are tar, gzip, zip, xz, rar, zstd)
dataset = load_dataset("imagefolder", data_files="path_to_zip_file")
# example 3: remote files (supported formats are tar, gzip, zip, xz, rar, zstd)
dataset = load_dataset(
"imagefolder",
data_files="https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip",
)
# example 4: providing several splits
dataset = load_dataset(
"imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]}
)
```
Then you can use the [`~datasets.Dataset.push_to_hub`] method to upload it to the Hub:
```python
# assuming you have ran the huggingface-cli login command in a terminal
dataset.push_to_hub("name_of_your_dataset")
# if you want to push to a private repo, simply pass private=True:
dataset.push_to_hub("name_of_your_dataset", private=True)
```
Now train your model by simply setting the `--dataset_name` argument to the name of your dataset on the Hub.

View File

@@ -26,8 +26,9 @@ This tutorial will teach you how to train a [`UNet2DModel`] from scratch on a su
Before you begin, make sure you have 🤗 Datasets installed to load and preprocess image datasets, and 🤗 Accelerate, to simplify training on any number of GPUs. The following command will also install [TensorBoard](https://www.tensorflow.org/tensorboard) to visualize training metrics (you can also use [Weights & Biases](https://docs.wandb.ai/) to track your training).
```bash
!pip install diffusers[training]
```py
# uncomment to install the necessary libraries in Colab
#!pip install diffusers[training]
```
We encourage you to share your model with the community, and in order to do that, you'll need to login to your Hugging Face account (create one [here](https://hf.co/join) if you don't already have one!). You can login from a notebook and enter your token when prompted:
@@ -312,7 +313,7 @@ Now you can wrap all these components together in a training loop with 🤗 Acce
... mixed_precision=config.mixed_precision,
... gradient_accumulation_steps=config.gradient_accumulation_steps,
... log_with="tensorboard",
... logging_dir=os.path.join(config.output_dir, "logs"),
... project_dir=os.path.join(config.output_dir, "logs"),
... )
... if accelerator.is_main_process:
... if config.push_to_hub:
@@ -344,7 +345,7 @@ Now you can wrap all these components together in a training loop with 🤗 Acce
... # Sample a random timestep for each image
... timesteps = torch.randint(
... 0, noise_scheduler.num_train_timesteps, (bs,), device=clean_images.device
... 0, noise_scheduler.config.num_train_timesteps, (bs,), device=clean_images.device
... ).long()
... # Add noise to the clean images according to the noise magnitude at each timestep
@@ -407,9 +408,9 @@ Once training is complete, take a look at the final 🦋 images 🦋 generated b
## Next steps
Unconditional image generation is one example of a task that can be trained. You can explore other tasks and training techniques by visiting the [🧨 Diffusers Training Examples](./training/overview) page. Here are some examples of what you can learn:
Unconditional image generation is one example of a task that can be trained. You can explore other tasks and training techniques by visiting the [🧨 Diffusers Training Examples](../training/overview) page. Here are some examples of what you can learn:
* [Textual Inversion](./training/text_inversion), an algorithm that teaches a model a specific visual concept and integrates it into the generated image.
* [DreamBooth](./training/dreambooth), a technique for generating personalized images of a subject given several input images of the subject.
* [Guide](./training/text2image) to finetuning a Stable Diffusion model on your own dataset.
* [Guide](./training/lora) to using LoRA, a memory-efficient technique for finetuning really large models faster.
* [Textual Inversion](../training/text_inversion), an algorithm that teaches a model a specific visual concept and integrates it into the generated image.
* [DreamBooth](../training/dreambooth), a technique for generating personalized images of a subject given several input images of the subject.
* [Guide](../training/text2image) to finetuning a Stable Diffusion model on your own dataset.
* [Guide](../training/lora) to using LoRA, a memory-efficient technique for finetuning really large models faster.

Some files were not shown because too many files have changed in this diff Show More