Compare commits

...

378 Commits

Author SHA1 Message Date
Sayak Paul
9858053bfe Release: v0.21.3 2023-09-27 22:10:50 +05:30
Patrick von Platen
6a3301fe34 resolve conflicts. 2023-09-27 22:09:04 +05:30
Patrick von Platen
813a1b2ee0 Fix one more 2023-09-19 00:09:30 +02:00
Patrick von Platen
a43b8574a9 Patch release: v0.21.2 2023-09-19 00:06:16 +02:00
Sayak Paul
a2f0db52e3 [LoRA] don't break offloading for incompatible lora ckpts. (#5085)
* don't break offloading for incompatible lora ckpts.

* debugging

* better condition.

* fix

* fix

* fix

* fix

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-09-19 00:05:34 +02:00
Will Berman
92f6693b37 remove unused adapter weights in constructor (#5088)
remove adapter weights in MultiAdapter constructor
2023-09-19 00:05:28 +02:00
Will Berman
932897afa8 t2i Adapter community member fix (#5090)
* convert tensorrt controlnet

* Fix code quality

* Fix code quality

* Fix code quality

* Fix code quality

* Fix code quality

* Fix code quality

* Fix number controlnet condition

* Add convert SD XL to onnx

* Add convert SD XL to tensorrt

* Add convert SD XL to tensorrt

* Add examples in comments

* Add examples in comments

* Add test onnx controlnet

* Add tensorrt test

* Remove copied

* Move file test to examples/community

* Remove script

* Remove script

* Remove text

* Fix import

* Fix T2I MultiAdapter

* fix tests

---------

Co-authored-by: dotieuthien <thien.do@mservice.com.vn>
Co-authored-by: dotieuthien <dotieuthien9997@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: dotieuthien <hades@cinnamon.is>
2023-09-19 00:05:20 +02:00
Patrick von Platen
c2940434d0 [Textual inversion] Refactor textual inversion to make it cleaner (#5076)
* [Textual inversion] Clean loading

* [Textual inversion] Clean loading

* [Textual inversion] Clean up

* [Textual inversion] Clean up

* [Textual inversion] Clean up

* [Textual inversion] Clean up
2023-09-19 00:04:53 +02:00
Patrick von Platen
60ab8fad16 Patch release: v0.21.1 2023-09-14 13:06:57 +02:00
Patrick von Platen
d17240457f [Import] Add missing settings / Correct some dummy imports (#5036)
* [Import] Add missing settings

* up

* up

* up
2023-09-14 12:47:55 +02:00
Vladimir Mandic
7512fc4df5 allow loading of sd models from safetensors without online lookups using local config files (#5019)
finish config_files implementation
2023-09-14 12:47:41 +02:00
Patrick von Platen
0c2f1ccc97 [Import] Don't force transformers to be installed (#5035)
* [Import] Don't force transformers to be installed

* make style
2023-09-14 12:47:34 +02:00
Dhruv Nair
47f2d2c7be Fix model offload bug when key isn't present (#5030)
* fix model offload bug when key isn't present

* make style
2023-09-14 12:47:25 +02:00
Patrick von Platen
af85591593 Patch release: v0.21.1 2023-09-14 12:46:39 +02:00
Patrick von Platen
29f15673ed Release: v0.21.0 2023-09-13 15:58:24 +02:00
Patrick von Platen
1037287e2b examples fix t2i training (#5001)
* examples fix t2i training

* make style
2023-09-12 23:52:41 +02:00
Steven Liu
6ea95b7a90 Fix PR template (#4984)
fix template
2023-09-12 19:36:38 +02:00
Patrick von Platen
0e0db625d0 Fix safety checker seq offload (#4998)
* fix safety checker

* fix safety checker

* fix safety checker
2023-09-12 18:56:35 +02:00
dg845
1f948109b8 [docs] Fix DiffusionPipeline.enable_sequential_cpu_offload docstring (#4952)
* Fix an unmatched backtick and make description more general for DiffusionPipeline.enable_sequential_cpu_offload.

* make style

* _exclude_from_cpu_offload -> self._exclude_from_cpu_offload

* make style

* apply suggestions from review

* make style
2023-09-12 08:58:47 -07:00
Patrick von Platen
37cb819df5 [Lora] Speed up lora loading (#4994)
* speed up lora loading

* Apply suggestions from code review

* up

* up

* Fix more

* Correct more

* Apply suggestions from code review

* up

* Fix more

* Fix more -

* up

* up
2023-09-12 17:51:15 +02:00
Dhruv Nair
f64d52dbca fix custom diffusion tests (#4996) 2023-09-12 17:50:47 +02:00
Dhruv Nair
4d897aaff5 fix image variation slow test (#4995)
fix image variation tests

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-09-12 17:45:47 +02:00
Patrick von Platen
b1105269b7 make style 2023-09-12 14:55:27 +00:00
Kashif Rasul
5d28d2217f [Wuerstchen] fix combined pipeline's num_images_per_prompt (#4989)
* fix encode_prompt

* added prompt_embeds and negative_prompt_embeds

* prompt_embeds for the prior only
2023-09-12 16:55:13 +02:00
Kashif Rasul
73bf620dec fix E721 Do not compare types, use isinstance() (#4992) 2023-09-12 16:52:25 +02:00
Dhruv Nair
c806f2fad6 remove extra gligen in import (#4987) 2023-09-12 18:35:29 +05:30
Patrick von Platen
18b7264bd0 [Utils] Correct custom init sort (#4967)
* [Utils] Correct custom init sort

* [Utils] Correct custom init sort

* [Utils] Correct custom init sort

* add type checking

* fix custom init sort

* fix test

* fix tests

---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2023-09-12 11:05:53 +02:00
zhiqiang
d82157b3ce [Bug Fix] Should pass the dtype instead of torch_dtype (#4917)
.
2023-09-11 19:45:58 +02:00
Patrick von Platen
93579650f8 Refactor model offload (#4514)
* [Draft] Refactor model offload

* [Draft] Refactor model offload

* Apply suggestions from code review

* cpu offlaod updates

* remove model cpu offload from individual pipelines

* add hook to offload models to cpu

* clean up

* model offload

* add model cpu offload string

* make style

* clean up

* fixes for offload issues

* fix tests issues

* resolve merge conflicts

* update src/diffusers/pipelines/pipeline_utils.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* make style

* Update src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py

---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2023-09-11 19:39:26 +02:00
Kashif Rasul
16a056a7b5 Wuerstchen fixes (#4942)
* fix arguments and make example code work

* change arguments in combined test

* Add default timesteps

* style

* fixed test

* fix broken test

* formatting

* fix docstrings

* fix  num_images_per_prompt

* fix doc styles

* please dont change this

* fix tests

* rename to DEFAULT_STAGE_C_TIMESTEPS

---------

Co-authored-by: Dominic Rampas <d6582533@gmail.com>
2023-09-11 15:47:53 +02:00
Patrick von Platen
6c6a246461 Update README.md (#4973)
Add monthly pip installs
2023-09-11 15:45:19 +02:00
Patrick von Platen
6bbee1048b Make sure Flax pipelines can be loaded into PyTorch (#4971)
* Make sure Flax pipelines can be loaded into PyTorch

* add test

* Update src/diffusers/pipelines/pipeline_utils.py
2023-09-11 12:03:49 +02:00
Patrick von Platen
2c60f7d14e [Core] Remove TF import checks (#4968)
[TF] Remove tf
2023-09-11 11:22:40 +02:00
Dhruv Nair
b6e0b016ce Lazy Import for Diffusers (#4829)
* initial commit

* move modules to import struct

* add dummy objects and _LazyModule

* add lazy import to schedulers

* clean up unused imports

* lazy import on models module

* lazy import for schedulers module

* add lazy import to pipelines module

* lazy import altdiffusion

* lazy import audio diffusion

* lazy import audioldm

* lazy import consistency model

* lazy import controlnet

* lazy import dance diffusion ddim ddpm

* lazy import deepfloyd

* lazy import kandinksy

* lazy imports

* lazy import semantic diffusion

* lazy imports

* lazy import stable diffusion

* move sd output to its own module

* clean up

* lazy import t2iadapter

* lazy import unclip

* lazy import versatile and vq diffsuion

* lazy import vq diffusion

* helper to fetch objects from modules

* lazy import sdxl

* lazy import txt2vid

* lazy import stochastic karras

* fix model imports

* fix bug

* lazy import

* clean up

* clean up

* fixes for tests

* fixes for tests

* clean up

* remove import of torch_utils from utils module

* clean up

* clean up

* fix mistake import statement

* dedicated modules for exporting and loading

* remove testing utils from utils module

* fixes from  merge conflicts

* Update src/diffusers/pipelines/kandinsky2_2/__init__.py

* fix docs

* fix alt diffusion copied from

* fix check dummies

* fix more docs

* remove accelerate import from utils module

* add type checking

* make style

* fix check dummies

* remove torch import from xformers check

* clean up error message

* fixes after upstream merges

* dummy objects fix

* fix tests

* remove unused module import

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-09-11 09:56:22 +02:00
Sayak Paul
88735249da [Docs] fix: minor formatting in the Würstchen docs (#4965)
fix: minor formatting in the docs
2023-09-11 09:12:53 +02:00
Will Berman
4191ddee11 Revert revert and install accelerate main (#4963)
* Revert "Temp Revert "[Core] better support offloading when side loading is enabled… (#4927)"

This reverts commit 2ab170499e.

* tests: install accelerate from main
2023-09-11 08:49:46 +02:00
Will Berman
2ab170499e Temp Revert "[Core] better support offloading when side loading is enabled… (#4927)
Revert "[Core] better support offloading when side loading is enabled. (#4855)"

This reverts commit e4b8e7928b.
2023-09-08 19:54:59 -07:00
Sayak Paul
914c513ee0 [Docs] add t2i adapter entry to overview of training scripts. (#4946)
add t2i adapter entry to overview of training scripts.
2023-09-09 06:52:11 +05:30
Will Berman
d73e6ad050 guard save model hooks to only execute on main process (#4929) 2023-09-08 10:30:06 -07:00
Sayak Paul
d0cf681a1f [Tests] add: tests for t2i adapter training. (#4947)
add: tests for t2i adapter training.
2023-09-08 19:45:39 +05:30
Suraj Patil
dfec61f4b3 [examples] T2IAdapter training script (#4934)
* add t2i_example script

* remove in channels logic

* remove comments

* remove use_euler arg

* add requirements

* only use canny example

* use datasets

* comments

* make log_validation consistent with other scripts

* add readme

* fix title in readme

* update check_min_version

* change a few minor things.

* add doc entry

* add: test for t2i adapter training

* remove use_auth_token

* fix: logged info.

* remove tests for now.

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-09-08 10:03:02 +05:30
Suraj Patil
0ec7a02b6a [StableDiffusionXLAdapterPipeline] allow negative micro conds (#4941)
* allow negative micro conds in t2i pipeline

* Empty-Commit

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-09-08 07:59:42 +05:30
Suraj Patil
626284f8d1 [StableDiffusionXLAdapterPipeline] add adapter_conditioning_factor (#4937)
add adapter_conditioning_factor
2023-09-07 19:05:28 +02:00
Sayak Paul
9800cc5ece [InstructPix2Pix] Fix pipeline implementation and add docs (#4844)
* initial evident fixes.

* instructpix2pix fixes.

* add: entry to doc.

* address PR feedback.

* make fix-copies
2023-09-07 15:34:19 +05:30
Kashif Rasul
541bb6ee63 Würstchen model (#3849)
* initial

* initial

* added initial convert script for paella vqmodel

* initial wuerstchen pipeline

* add LayerNorm2d

* added modules

* fix typo

* use model_v2

* embed clip caption amd negative_caption

* fixed name of var

* initial modules in one place

* WuerstchenPriorPipeline

* inital shape

* initial denoising prior loop

* fix output

* add WuerstchenPriorPipeline to __init__.py

* use the noise ratio in the Prior

* try to save pipeline

* save_pretrained working

* Few additions

* add _execution_device

* shape is int

* fix batch size

* fix shape of ratio

* fix shape of ratio

* fix output dataclass

* tests folder

* fix formatting

* fix float16 + started with generator

* Update pipeline_wuerstchen.py

* removed vqgan code

* add WuerstchenGeneratorPipeline

* fix WuerstchenGeneratorPipeline

* fix docstrings

* fix imports

* convert generator pipeline

* fix convert

* Work on Generator Pipeline. WIP

* Pipeline works with our diffuzz code

* apply scale factor

* removed vqgan.py

* use cosine schedule

* redo the denoising loop

* Update src/diffusers/models/resnet.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* use torch.lerp

* use warp-diffusion org

* clip_sample=False,

* some refactoring

* use model_v3_stage_c

* c_cond size

* use clip-bigG

* allow stage b clip to be None

* add dummy

* würstchen scheduler

* minor changes

* set clip=None in the pipeline

* fix attention mask

* add attention_masks to text_encoder

* make fix-copies

* add back clip

* add text_encoder

* gen_text_encoder and tokenizer

* fix import

* updated pipeline test

* undo changes to pipeline test

* nip

* fix typo

* fix output name

* set guidance_scale=0 and remove diffuze

* fix doc strings

* make style

* nip

* removed unused

* initial docs

* rename

* toc

* cleanup

* remvoe test script

* fix-copies

* fix multi images

* remove dup

* remove unused modules

* undo changes for debugging

* no  new line

* remove dup conversion script

* fix doc string

* cleanup

* pass default args

* dup permute

* fix some tests

* fix prepare_latents

* move Prior class to modules

* offload only the text encoder and vqgan

* fix resolution calculation for prior

* nip

* removed testing script

* fix shape

* fix argument to set_timesteps

* do not change .gitignore

* fix resolution calculations + readme

* resolution calculation fix + readme

* small fixes

* Add combined pipeline

* rename generator -> decoder

* Update .gitignore

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* removed efficient_net

* create combined WuerstchenPipeline

* make arguments consistent with VQ model

* fix var names

* no need to return text_encoder_hidden_states

* add latent_dim_scale to config

* split model into its own file

* add WuerschenPipeline to docs

* remove unused latent_size

* register latent_dim_scale

* update script

* update docstring

* use Attention preprocessor

* concat with normed input

* fix-copies

* add docs

* fix test

* fix style

* add to cpu_offloaded_model

* updated type

* remove 1-line func

* updated type

* initial decoder test

* formatting

* formatting

* fix autodoc link

* num_inference_steps is int

* remove comments

* fix example in docs

* Update src/diffusers/pipelines/wuerstchen/diffnext.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* rename layernorm to WuerstchenLayerNorm

* rename DiffNext to WuerstchenDiffNeXt

* added comment about MixingResidualBlock

* move paella vq-vae to pipelines' folder

* initial decoder test

* increased test_float16_inference expected diff

* self_attn is always true

* more passing decoder tests

* batch image_embeds

* fix failing tests

* set the correct dtype

* relax inference test

* update prior

* added combined pipeline test

* faster test

* faster test

* Update src/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_combined.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* fix issues from review

* update wuerstchen.md + change generator name

* resolve issues

* fix copied from usage and add back batch_size

* fix API

* fix arguments

* fix combined test

* Added timesteps argument + fixes

* Update tests/pipelines/test_pipelines_common.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update tests/pipelines/wuerstchen/test_wuerstchen_prior.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_combined.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_combined.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_combined.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/wuerstchen/pipeline_wuerstchen_combined.py

* up

* Fix more

* failing tests

* up

* up

* correct naming

* correct docs

* correct docs

* fix test params

* correct docs

* fix classifier free guidance

* fix classifier free guidance

* fix more

* fix all

* make tests faster

---------

Co-authored-by: Dominic Rampas <d6582533@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Dominic Rampas <61938694+dome272@users.noreply.github.com>
2023-09-06 16:15:51 +02:00
dg845
b76274cb53 [docs] Fix typo in Inpainting force unmasked area unchanged example (#4910)
Fix typo by replacing init_image_arr and repainted_image_arr with init_image and repainted_image, respectively.
2023-09-06 10:49:01 +02:00
Patrick von Platen
dc3e0ca59b [Textual inversion] Relax loading textual inversion (#4903)
* [Textual inversion] Relax loading textual inversion

* up
2023-09-06 10:39:44 +02:00
Sayak Paul
6c314ad0ce [Docs] add doc entry to explain lora fusion and use of different scales. (#4893)
* add doc entry to explain lora fusion and use of different scales.

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2023-09-06 07:38:13 +05:30
Steven Liu
946bb53c56 [docs] Add stronger warning for SDXL height/width (#4867)
* add size warning

* feedback
2023-09-05 10:50:42 -07:00
YiYi Xu
ea311e6989 remove latent input for kandinsky prior_emb2emb pipeline (#4887)
* remove latent input

* fix test

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-09-04 22:19:49 -10:00
YiYi Xu
4c5718a09c fix a bug in StableDiffusionUpscalePipeline.run_safety_checker (#4886)
fix

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-09-04 22:18:59 -10:00
Patrick von Platen
2340ed629e [Test] Reduce CPU memory (#4897)
* [Test] Reduce CPU memory

* [Test] Reduce CPU memory
2023-09-05 13:18:35 +05:30
Bagheera
cfdfcf2018 Add --vae_precision option to the SDXL pix2pix script so that we have… (#4881)
* Add --vae_precision option to the SDXL pix2pix script so that we have the option of avoiding float32 overhead

* style

---------

Co-authored-by: bghira <bghira@users.github.com>
2023-09-05 09:04:06 +02:00
Sayak Paul
e4b8e7928b [Core] better support offloading when side loading is enabled. (#4855)
* better support offloading when side loading is enabled.

* load_textual_inversion

* better messaging for textual inversion.

* fixes

* address PR feedback.

* sdxl support.

* improve messaging

* recursive removal when cpu sequential offloading is enabled.

* add: lora tests

* recruse.

* add: offload tests for textual inversion.
2023-09-05 06:55:13 +05:30
dg845
55e17907f9 Add dropout parameter to UNet2DModel/UNet2DConditionModel (#4882)
* Add dropout param to get_down_block/get_up_block and UNet2DModel/UNet2DConditionModel.

* Add dropout param to Versatile Diffusion modeling, which has a copy of UNet2DConditionModel and its own get_down_block/get_up_block functions.
2023-09-05 00:02:21 +02:00
Sayak Paul
c81a88b239 [Core] LoRA improvements pt. 3 (#4842)
* throw warning when more than one lora is attempted to be fused.

* introduce support of lora scale during fusion.

* change test name

* changes

* change to _lora_scale

* lora_scale to call whenever applicable.

* debugging

* lora_scale additional.

* cross_attention_kwargs

* lora_scale -> scale.

* lora_scale fix

* lora_scale in patched projection.

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* styling.

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* remove unneeded prints.

* remove unneeded prints.

* assign cross_attention_kwargs.

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* clean up.

* refactor scale retrieval logic a bit.

* fix nonetypw

* fix: tests

* add more tests

* more fixes.

* figure out a way to pass lora_scale.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* unify the retrieval logic of lora_scale.

* move adjust_lora_scale_text_encoder to lora.py.

* introduce dynamic adjustment lora scale support to sd

* fix up copies

* Empty-Commit

* add: test to check fusion equivalence on different scales.

* handle lora fusion warning.

* make lora smaller

* make lora smaller

* make lora smaller

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-09-04 23:52:31 +02:00
YiYi Xu
2c1677eefe allow passing components to connected pipelines when use the combined pipeline (#4883)
* fix

* add test

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-09-04 06:21:36 -10:00
dg845
c73e609aae Fix get_dummy_inputs for Stable Diffusion Inpaint Tests (#4845)
* Change StableDiffusionInpaintPipelineFastTests.get_dummy_inputs to produce a random image and a white mask_image.

* Add dummy expected slices for the test_stable_diffusion_inpaint tests.

* Remove print statement

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-09-04 12:04:59 +02:00
Erwann Millon
2fa4b3ffb0 check for unet_lora_layers in sdxl pipeline's save_lora_weights method (#4821)
run make fix-copies and make style
2023-09-04 09:59:59 +02:00
Isamu Isozaki
3201903d94 Retrieval Augmented Diffusion Models (#3297)
* Resetting rdm pr

* Fixed styles

* Fixed style

* Moved to rdm folder+fixed slight errors

* Removed config diff

* Started adding tests

* Adding retrieved images

* Fixed faiss import

* Fixed import errors

* Fixing tests

* Added require_faiss

* Updated dependency table

* Attempt solving consistency test

* Fixed truncation and vocab size issue

* Passed common tests

* Finished up cpu testing on pipeline

* Passed all tests locally

* Removed some slow tests

* Removed diffs from test_pipeline_common

* Remove logs

* Removed diffs from test_pipelines_common

* Fixed style

* Fully fixed styles on diffs

* Fixed name

* Proper rename

* Fixed dummies

* Fixed issue with dummyonnx

* Fixed black style

* Fixed dummies

* Changed ordering

* Fixed logging

* Fixing

* Fixing

* quality

* Debugging regex

* Fix dummies with guess

* Fixed typo

* Attempt fix dummies

* black

* ruff

* fixed ordering

* Logging

* Attempt fix

* Attempt fix dummy

* Attempt fixing styles

* Fixed faiss dependency

* Removed unnecessary deprecations

* Finished up main changes

* Added doc

* Passed tests

* Fixed tests

* Remove invisible watermark

* Fixed ruff errors

* Added prompt embed to tests

* Added tests and made retriever an optional component

* Fixed styles

* Made faiss a dependency of pipeline

* Logging

* Fixed dummies

* Make pipeline test work

* Fixed style

* Moved to research projects

* Remove diff

* Fixed style error

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-09-04 09:42:04 +02:00
Patrick von Platen
705c592ea9 [Tests] Add combined pipeline tests (#4869)
* [Tests] Add combined pipeline tests

* Update tests/pipelines/kandinsky_v22/test_kandinsky.py
2023-09-02 21:36:20 +02:00
Harutatsu Akiyama
c52acaaf17 [ControlNet SDXL Inpainting] Support inpainting of ControlNet SDXL (#4694)
* [ControlNet SDXL Inpainting] Support inpainting of ControlNet SDXL

Co-authored-by: Jiabin Bai 1355864570@qq.com


---------

Co-authored-by: Harutatsu Akiyama <kf.zy.qin@gmail.com>
2023-09-02 08:04:22 -10:00
Steven Liu
2c45a53aef [docs] Shap-E guide (#4700)
* first draft

* fixes

* more fixes

* fix toctree
2023-09-01 19:52:41 -07:00
Steven Liu
22ea35cf23 [docs] DiffEdit guide (#4722)
* first draft

* minor edits
2023-09-01 14:18:41 -07:00
YiYi Xu
5c404f20f4 [WIP] masked_latent_inputs for inpainting pipeline (#4819)
* add

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-09-01 06:55:31 -10:00
YiYi Xu
d8b6f5d09e support AutoPipeline.from_pipe between a pipeline and its ControlNet pipeline counterpart (#4861)
add
2023-09-01 06:53:03 -10:00
YiYi Xu
30a5acc39f fix a bug in sdxl-controlnet-img2img when using MultiControlNetModel (#4862)
fix

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-09-01 06:51:59 -10:00
Seongsu Park
0c775544dd [Docs] Korean translation update (#4684)
* Docs kr update 3

controlnet, reproducibility 업로드

generator 그대로 사용
seamless multi-GPU 그대로 사용

create_dataset 번역 1차

stable_diffusion_jax

new translation

Add coreml, tome

kr docs minor fix

translate training/instructpix2pix

fix training/instructpix2pix.mdx

using-diffusers/weighting_prompts 번역 1차

add SDXL docs

Translate using-diffuers/loading_overview.md

translate using-diffusers/textual_inversion_inference.md

Conditional image generation (#37)

* stable_diffusion_jax

* index_update

* index_update

* condition_image_generation

---------

Co-authored-by: Seongsu Park <tjdtnsu@gmail.com>

jihwan/stable_diffusion.mdx

custom_diffusion 작업 완료

quicktour 작업 완료

distributed inference & control brightness (#40)

* distributed_inference.mdx

* control_brightness

---------

Co-authored-by: idra79haza <idra79haza@github.com>
Co-authored-by: Seongsu Park <tjdtnsu@gmail.com>

using_safetensors (#41)

* distributed_inference.mdx

* control_brightness

* using_safetensors.mdx

---------

Co-authored-by: idra79haza <idra79haza@github.com>
Co-authored-by: Seongsu Park <tjdtnsu@gmail.com>

delete safetensor short

* Repace mdx to md

* toctree update

* Add controlling_generation

* toctree fix

* colab link, minor fix

* docs name typo fix

* frontmatter fix

* translation fix
2023-09-01 09:23:45 -07:00
Pedro Cuenca
60d259add1 Fix link from API to using-diffusers (#4856)
* Fix link from API to using-diffusers

* Fix link
2023-09-01 15:05:01 +02:00
Dhruv Nair
189e9f01b3 Test Cleanup Precision issues (#4812)
* proposal for flaky tests

* more precision fixes

* move more tests to use cosine distance

* more test fixes

* clean up

* use default attn

* clean up

* update expected value

* make style

* make style

* Apply suggestions from code review

* Update src/diffusers/pipelines/stable_diffusion/pipeline_onnx_stable_diffusion_img2img.py

* make style

* fix failing tests

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-09-01 17:58:37 +05:30
Nguyễn Công Tú Anh
38466c369f Add GLIGEN Text Image implementation (#4777)
* Add GLIGEN Text Image implementation

* add style transfer from image

* fix check_repository_consistency

* add convert script GLIGEN model to Diffusers

* rename attention type

* fix style code

* remove PositionNetTextImage

* Revert "fix check_repository_consistency"

This reverts commit 15f098c96e.

* change attention type name

* update docs for GLIGEN

* change examples with hf-document-image

* fix style

* add CLIPImageProjection for GLIGEN

* Add new encode_prompt, load project matrix in pipe init

* move CLIPImageProjection to stable_diffusion

* add comment
2023-09-01 15:48:01 +05:30
dg845
5f740d0f55 [docs] Add inpainting example for forcing the unmasked area to remain unchanged to the docs (#4536)
* Initial code to add force_unmasked_unchanged argument to StableDiffusionInpaintPipeline.__call__.

* Try to improve StableDiffusionInpaintPipelineFastTests.get_dummy_inputs.

* Use original mask to preserve unmasked pixels in pixel space rather than latent space.

* make style

* start working on note in docs to force unmasked area to be unchanged

* Add example of forcing the unmasked area to remain unchanged.

* Revert "make style"

This reverts commit fa7759293a.

* Revert "Use original mask to preserve unmasked pixels in pixel space rather than latent space."

This reverts commit 092bd0e9e9.

* Revert "Try to improve StableDiffusionInpaintPipelineFastTests.get_dummy_inputs."

This reverts commit ff41cf43c5.

* Revert "Initial code to add force_unmasked_unchanged argument to StableDiffusionInpaintPipeline.__call__."

This reverts commit 989979752a.

---------

Co-authored-by: Will Berman <wlbberman@gmail.com>
2023-08-31 21:29:16 -07:00
YiYi Xu
75f81c25d1 fix sdxl-inpaint fast test (#4859)
fix inpaint test

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-08-31 15:42:58 -10:00
Patrick von Platen
bbf733ab70 [SDXL Inpaint] Correct strength default (#4858) 2023-08-31 20:34:33 +02:00
Steven Liu
aedd78767c [docs] ControlNet guide (#4640)
* first draft

* finish first draft

* feedback and remove sections from API pages

* clean docstrings

* add full code example
2023-08-31 10:02:02 -04:00
Patrick von Platen
7caa3682e4 Remove warn with deprecate (#4850)
* Remove warn with deprecate

* Fix typo with 1.0,0
2023-08-31 15:08:41 +02:00
Ella Charlaix
0edb4cac78 Fix image processor inputs width (#4853)
fix width for np array inputs
2023-08-31 14:50:55 +02:00
Yukun Huang
85b3f08c26 Fix potential type mismatch errors in SDXL pipelines (#4796)
* Fix potential type conversion errors in SDXL pipelines

* make sure vae stays in fp16

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-31 09:22:18 +02:00
Sayak Paul
19f3161d94 [Docs] improve the LoRA doc. (#4838)
* improve the LoRA doc.

* include fuse_lora and unfuse_lora

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2023-08-31 00:13:15 +05:30
Steven Liu
a1fdfca36f [docs] SDXL (#4428)
* first draft

* reorg toctree

* note about minsdxl

* feedback

* fix

* micro-conditionings

* add tip

* fix section levels

* d'oh fix pipeline names

* feedback

* remove old section
2023-08-30 11:34:55 -04:00
Patrick von Platen
d1e20be664 make style 2023-08-30 14:13:14 +02:00
Anatoly Belikov
af3854d6ad sketch inpaint from a1111 for non-inpaint models (#4824)
* Create masked_stable_diffusion_img2img.py

* add MaskedIm2ImPipeline to readme

* Update README.md
2023-08-30 09:51:28 +02:00
Patrick von Platen
9f1936d2fc Fix Unfuse Lora (#4833)
* Fix Unfuse Lora

* add tests

* Fix more

* Fix more

* Fix all

* make style

* make style
2023-08-30 09:32:25 +05:30
Eugene Antropov
fbca2e0a7a Add loading ckpt from file for SDXL controlNet (#4683)
* Add load ckpt from file for ControlNet SDXL

* Reformat code

* Resort imports

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-30 09:00:53 +05:30
Sayak Paul
3768d4d77c [Core] refactor encode_prompt (#4617)
* refactoring of encode_prompt()

* better handling of device.

* fix: device determination

* fix: device determination 2

* handle num_images_per_prompt

* revert changes in loaders.py and give birth to encode_prompt().

* minor refactoring for encode_prompt()/

* make backward compatible.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* fix: concatenation of the neg and pos embeddings.

* incorporate encode_prompt() in test_stable_diffusion.py

* turn it into big PR.

* make it bigger

* gligen fixes.

* more fixes to fligen

* _encode_prompt -> encode_prompt in tests

* first batch

* second batch

* fix blasphemous mistake

* fix

* fix: hopefully for the final time.

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-30 08:57:26 +05:30
Nikhil Gajendrakumar
8ccb619416 VaeImageProcessor: Allow image resizing also for torch and numpy inputs (#4832)
Co-authored-by: Nikhil Gajendrakumar <nikhilkatte@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-29 22:45:05 +02:00
zideliu
0699ac62f0 fix typo (#4822) 2023-08-29 20:54:36 +02:00
Patrick von Platen
a76f2ad538 make style 2023-08-29 09:25:09 +02:00
VitjanZ
7200daa412 Support saving multiple t2i adapter models under one checkpoint (#4798)
* adding save and load for MultiAdapter, adding test

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Adding changes from review test_stable_diffusion_adapter

* import sorting fix

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-29 09:24:40 +02:00
Alexsey Shestacov
3eeaf4e041 Fix convert_original_stable_diffusion_to_diffusers script (#4817)
Fix stable diffusion conversion script
2023-08-29 09:14:45 +02:00
Patrick von Platen
c583f3b452 Fuse loras (#4473)
* Fuse loras

* initial implementation.

* add slow test one.

* styling

* add: test for checking efficiency

* print

* position

* place model offload correctly

* style

* style.

* unfuse test.

* final checks

* remove warning test

* remove warnings altogether

* debugging

* tighten up tests.

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* denugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debuging

* debugging

* debugging

* debugging

* suit up the generator initialization a bit.

* remove print

* update assertion.

* debugging

* remove print.

* fix: assertions.

* style

* can generator be a problem?

* generator

* correct tests.

* support text encoder lora fusion.

* tighten up tests.

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-08-29 09:14:24 +02:00
Chong Mou
12358b986f add models for T2I-Adapter-XL (#4696)
* T2I-Adapter-XL

* update

* update

* add pipeline

* modify pipeline

* modify pipeline

* modify pipeline

* modify pipeline

* modify pipeline

* modify modeling_text_unet

* fix styling.

* fix: copies.

* adapter settings

* new test case

* new test case

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* revert prints.

* new test case

* remove print

* org test case

* add test_pipeline

* styling.

* fix copies.

* modify test parameter

* style.

* add adapter-xl doc

* double quotes in docs

* Fix potential type mismatch

* style.

---------

Co-authored-by: sayakpaul <spsayakpaul@gmail.com>
2023-08-29 10:34:07 +05:30
YiYi Xu
5eeedd9e33 add StableDiffusionXLControlNetImg2ImgPipeline (#4592)
---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-28 08:16:27 -10:00
YiYi Xu
a971c598b5 fix auto_pipeline: pass kwargs to load_config (#4793)
* fix

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-28 07:42:16 -10:00
YiYi Xu
934d439a42 fix bug in StableDiffusionXLControlNetPipeline when use guess_mode (#4799)
* fix



---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-28 06:51:17 -10:00
Dhruv Nair
e3f3672f46 Fix Disentangle ONNX and non-ONNX pipeline (#4656)
* initial commit to fix inheritance issue

* clean up sd onnx upscale

* clean up
2023-08-28 21:14:49 +05:30
Mario Namtao Shianti Larcher
87ae330056 [Examples] Save SDXL LoRA weights with chosen precision (#4791)
* Increase min accelerate ver to avoid OOM when mixed precision

* Rm re-instantiation of VAE

* Rm casting to float32

* Del unused models and free GPU

* Fix style
2023-08-28 13:57:40 +05:30
Patrick von Platen
1b46c66132 make style 2023-08-28 07:17:21 +00:00
Yead
031358988b Fix save_path bug in textual inversion training script (#4710)
* Update textual_inversion.py

fixed safe_path bug in textual inversion training

* Update test_examples.py

update test_textual_inversion for updating saved file's name

* Update textual_inversion.py

fixed some formatting issues

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-28 09:17:08 +02:00
Shauray Singh
fd35689f25 [WIP] Add Fabric (#4201)
* empty PR

* init

* changes

* starting with the pipeline

* stable diff

* prev

* more things, getting started

* more functions

* makeing it more readable

* almost done testing

* var changes

* testing

* device

* device support

* maybe

* device malfunctions

* new new

* register

* testing

* exec does not work

* float

* change info

* change of architecture

* might work

* testing with colab

* more attn atuff

* stupid additions

* documenting and testing

* writing tests

* more docs

* tests and docs

* remove test

* empty PR

* init

* changes

* starting with the pipeline

* stable diff

* prev

* more things, getting started

* more functions

* makeing it more readable

* almost done testing

* var changes

* testing

* device

* device support

* maybe

* device malfunctions

* new new

* register

* testing

* exec does not work

* float

* change info

* change of architecture

* might work

* testing with colab

* more attn atuff

* stupid additions

* documenting and testing

* writing tests

* more docs

* tests and docs

* remove test

* change cross attention

* revert back

* tests

* reverting back to orig

* changes

* test passing

* pipeline changes

* before quality

* quality checks pass

* remove print statements

* doc fixes

* __init__ error something

* update docs, working on dim

* working on encoding

* doc fix

* more fixes

* no more dependent on 512*512

* update docs

* fixes

* test passing

* remove comment

* fixes and migration

* simpler tests

* doc changes

* green CI

* changes

* more docs

* changes

* new images

* to community examples

* selete

* more fixes

* changes

* fix

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-08-28 09:10:55 +02:00
chillpixel
e8c9069d6f Update loaders.py (#4805)
* Update loaders.py

Solves an error sometimes thrown while iterating over state_dict.keys() caused by using the .pop() method within the loop.

* Update loaders.py
2023-08-28 11:23:25 +05:30
Patrick von Platen
766aa50f70 [LoRA Attn Processors] Refactor LoRA Attn Processors (#4765)
* [LoRA Attn] Refactor LoRA attn

* correct for network alphas

* fix more

* fix more tests

* fix more tests

* Move below

* Finish

* better version

* correct serialization format

* fix

* fix more

* fix more

* fix more

* Apply suggestions from code review

* Update src/diffusers/pipelines/stable_diffusion/pipeline_onnx_stable_diffusion_img2img.py

* deprecation

* relax atol for slow test slighly

* Finish tests

* make style

* make style
2023-08-28 10:38:09 +05:30
Patrick von Platen
c4d2823601 [SDXL Lora] Fix last ben sdxl lora (#4797)
* Fix last ben sdxl lora

* Correct typo

* make style
2023-08-26 23:31:56 +02:00
Patrick von Platen
4f8853e481 [Torch compile] Fix torch compile for controlnet (#4795)
Fix torch compile for controlnete
2023-08-26 22:30:02 +02:00
Steven Liu
fed88195e3 [docs] Fix syntax for compel (#4794)
* fix syntax

* update image
2023-08-26 11:33:10 -07:00
Sayak Paul
0de35e4a52 [Tests] Tighten up LoRA loading relaxation (#4787)
* debugging

* better logic for filtering.

* Update src/diffusers/loaders.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-26 15:01:16 +05:30
Canberk Kandemir
0d81e543a2 Unet fix (#4769)
* Optional images variable train_custom_diffusion.py

* Fixed train_custom_diffusion.py

* Revert accidental changes to unet_2d_condition.py

* "Format code with black"
2023-08-26 11:01:24 +02:00
Sayak Paul
3be0ff9056 [Core] Support negative conditions in SDXL (#4774)
* add: support negative conditions.

* fix: key

* add: tests

* address PR feedback.

* add documentation

* add img2img support.

* add inpainting support.

* ad controlnet support

* Apply suggestions from code review

* modify wording in the doc.
2023-08-26 09:13:44 +05:30
Patrick von Platen
2764db3194 [SDXL] Add docs about forcing passed embeddings to be 0 (#4783)
* make style

* make style

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* make style

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2023-08-25 20:52:45 +02:00
Patrick von Platen
048d901993 make style 2023-08-25 18:51:03 +00:00
cmdr2
cb432c4ebc Allow passing a checkpoint state_dict to convert_from_ckpt (instead of just a string path) (#4653)
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-25 20:50:39 +02:00
YiYi Xu
b7b1a30bc4 refactor prepare_mask_and_masked_image with VaeImageProcessor (#4444)
* refactor image processor for mask
---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-08-25 08:18:48 -10:00
Will Berman
7e5587a5ac instance_prompt->class_prompt (#4784) 2023-08-25 20:06:55 +02:00
Mayank Khanduja
dc8da1d449 Fixed broken link of CLIP doc in evaluation doc (#4760) 2023-08-25 20:04:50 +02:00
Zijian He
3dd540171d fix bug of progress bar in clip guided images mixing (#4729) 2023-08-25 18:54:03 +02:00
YiYi Xu
b3b2d30cd8 fix a bug in from_pretrained when load optional components (#4745)
* fix
---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-25 06:25:48 -10:00
Dhruv Nair
3bba44d74e [WIP ] Proposal to address precision issues in CI (#4775)
* proposal for flaky tests

* clean up
2023-08-25 19:12:09 +05:30
Sanchit Gandhi
b1290d3fb8 Convert MusicLDM (#4579)
* from audioldm

* fix vae

* move to new pipeline

* copied from audioldm

* remove redundant control flow

* iterate

* fix docstring

* finish pipeline

* tests: from audioldm2

* iterate

* finish fast tests

* finish slow integration tests

* add docs

* remove dtype test

* update toctree

* "copied from" in conversion (where possible)

* Update docs/source/en/api/pipelines/musicldm.md

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* fix docstring

* make nightly

* style

* fix dtype test

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-25 13:31:00 +01:00
Sanchit Gandhi
29a11c2a94 [AudioLDM 2] Pipeline fixes (#4738)
* fix docs

* fix unet docs

* use image output for latents

* fix hub checkpoints

* fix pipeline example

* update example

* return_dict = False

* revert image pipeline output

* revert doc changes

* remove dtype test

* make style

* remove docstring updates

* remove unet docstring update

* Empty commit to re-trigger CI

* fix cpu offload

* fix dtype test

* add offload test
2023-08-25 11:38:10 +01:00
Patrick von Platen
cdacd8f1dd Torch device (#4755) 2023-08-25 11:13:32 +02:00
Sayak Paul
470d51c8ed improve setup.py (#4748) 2023-08-25 13:44:20 +05:30
Andrew Zhu
d6141205cd fix sdxl_lwp empty neg_prompt error issue (#4743)
* fix sdxl_lwp empty neg_prompt error issue

* fix sdxl_lwp empty neg_prompt error issue, update code format

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-08-25 09:55:56 +05:30
Sayak Paul
4447547eda [Examples] fix sdxl dreambooth lora checkpointing. (#4749)
* fix sdxl dreambooth lora checkpointing.

* style
2023-08-25 09:50:02 +05:30
Sayak Paul
5222294748 [LoRA] relax lora loading logic (#4610)
* relax lora loading logic.

* cater to the other cases too.

* fix: variable name

* bring the chaos down.

* check

* deal with checkpointed files.

* Apply suggestions from code review

Co-authored-by: apolinário <joaopaulo.passos@gmail.com>

* style

---------

Co-authored-by: apolinário <joaopaulo.passos@gmail.com>
2023-08-25 09:35:51 +05:30
Mario Namtao Shianti Larcher
c25c46137d [Examples] Add madebyollin VAE to SDXL LoRA example, along with an explanation (#4762)
Add madebyollin VAE to LoRA example, along with an explenation
2023-08-25 09:34:32 +05:30
Will Berman
3105c710ba [fix] multi t2i adapter set total_downscale_factor (#4621)
* [fix] multi t2i adapter set total_downscale_factor

* move image checks into check inputs

* remove copied from
2023-08-24 12:01:23 -07:00
Patrick von Platen
58f5f748f4 [Tests] Fix paint by example (#4761)
* [Tests] Fix paint by example

* Update src/diffusers/pipelines/paint_by_example/image_encoder.py
2023-08-24 16:03:10 +02:00
Dhruv Nair
4f05058bb7 Clean up flaky behaviour on Slow CUDA Pytorch Push Tests (#4759)
use max diff to compare model outputs
2023-08-24 18:58:02 +05:30
Patrick von Platen
5d4413001b make style 2023-08-24 10:19:47 +00:00
Symbiomatrix
863e741614 Bugfix for SDXL model loading in low ram system. (#4628)
Update convert_from_ckpt.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-24 12:19:16 +02:00
Sanchit Gandhi
24c5e7708b [AudioLDM2] Doc fixes (#4739)
* [AudioLDM2] Doc fixes

* update docstrings

* fix unet docstring

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-24 07:20:27 +05:30
YiYi Xu
cd21b965d1 add a step_index counter (#4347)
add self.step_index

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-23 10:49:54 -10:00
Yinzhen Wang
d185b5ed5f change validation scheduler for train_dreambooth.py when training IF (#4333)
* dreambooth training

* train_dreambooth validation scheduler

* set a particular scheduler via a string

* modify readme after setting a particular scheduler via a string

* modify readme after setting a particular scheduler

* use importlib to set a particular scheduler

* import with correct sort
2023-08-23 22:18:17 +02:00
Suraj Patil
709a642827 fix dummy import for AudioLDM2 (#4741)
* fix import

* style
2023-08-23 22:07:47 +02:00
Sanchit Gandhi
0a0fe69aa6 [AudioLDM Docs] Update docstring (#4744) 2023-08-23 11:04:54 -07:00
realliujiaxu
124e76ddc6 [docs] add variant="fp16" flag (#4678) 2023-08-23 10:00:34 -07:00
Sanchit Gandhi
05b0ec63bc [AudioLDM Docs] Fix docs for output (#4737) 2023-08-23 18:02:11 +02:00
Sayak Paul
4909b1e3ac [Examples] fix checkpointing and casting bugs in train_text_to_image_lora_sdxl.py (#4632)
* fix: casting issues.

* fix checkpointing.

* tests

* fix: bugs
2023-08-23 10:58:54 +05:30
Ollin Boer Bohan
052bf3280b Fix AutoencoderTiny encoder scaling convention (#4682)
* Fix AutoencoderTiny encoder scaling convention

  * Add [-1, 1] -> [0, 1] rescaling to EncoderTiny

  * Move [0, 1] -> [-1, 1] rescaling from AutoencoderTiny.decode to DecoderTiny
    (i.e. immediately after the final conv, as early as possible)

  * Fix missing [0, 255] -> [0, 1] rescaling in AutoencoderTiny.forward

  * Update AutoencoderTinyIntegrationTests to protect against scaling issues.
    The new test constructs a simple image, round-trips it through AutoencoderTiny,
    and confirms the decoded result is approximately equal to the source image.
    This test checks behavior with and without tiling enabled.
    This test will fail if new AutoencoderTiny scaling issues are introduced.

  * Context: Raw TAESD weights expect images in [0, 1], but diffusers'
    convention represents images with zero-centered values in [-1, 1],
    so AutoencoderTiny needs to scale / unscale images at the start of
    encoding and at the end of decoding in order to work with diffusers.

* Re-add existing AutoencoderTiny test, update golden values

* Add comments to AutoencoderTiny.forward
2023-08-23 08:38:37 +05:30
Patrick von Platen
80871ac597 fix bad error message when transformers is missing (#4714) 2023-08-22 21:25:01 +02:00
Patrick von Platen
6abc66ef28 Fix all docs (#4721)
* [Docs] Fix all

* fix
2023-08-22 21:00:21 +02:00
Patrick von Platen
38efac9f61 Revert "Move controlnet load local tests to nightly (#4543)" (#4713)
This reverts commit 7b07f9812a.
2023-08-22 19:55:15 +02:00
Patrick von Platen
4f6399bedd rename test file to run, so that examples tests do not fail (#4715)
* rename test file to run, so that examples tests do not fail

* [Tests] Rename community tests
2023-08-22 19:54:46 +02:00
Patrick von Platen
6e1af3a777 [Docs] Fix docs controlnet missing /Tip (#4717) 2023-08-22 18:40:26 +02:00
zideliu
f22aad6e3a Add reference_attn & reference_adain support for sdxl (#4502)
* ADD SDXL reference & reference adain

* Update README.md

* Update README.md

* format stable_diffusion_xl_reference.py

* format file

* Format file

* format file

* fix format

* fix format with ruff

* fix format

* Update examples/community/README.md

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>

* Update examples/community/README.md

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>

* Update README.md

* Update README.md & fix typo

* Update README.md

* fix format

---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2023-08-22 20:22:01 +05:30
realliujiaxu
ecded50ad5 add convert diffuser pipeline of XL to original stable diffusion (#4596)
convert diffuser pipeline of XL to original stable diffusion

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2023-08-22 19:11:06 +05:30
Alex McKinney
e34d9aa681 Replaces DIFFUSERS_TEST_DEVICE backend list with trying device (#4673)
This is a better method than comparing against a list of supported backends as it allows for supporting any number of backends provided they are installed on the user's system.
This should have no effect on the behaviour of tests in Huggingface's CI workers.
See transformers#25506 where this approach has already been added.
2023-08-22 11:48:12 +05:30
Sayak Paul
8d30d25794 [LoRA] default to None when fc alphas are not available. (#4706)
default to None when fc alphas are not available.
2023-08-22 08:47:08 +05:30
Sayak Paul
1e0395e791 [LoRA] ensure different LoRA ranks for text encoders can be properly handled (#4669)
* debugging starts

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging ends, but does it?

* more robustness.
2023-08-22 08:21:13 +05:30
Sayak Paul
9141c1f9d5 [Core] enable lora for sdxl controlnets too and add slow tests. (#4666)
* enable lora for sdxl controlnets too.

* add: tests

* fix: assertion values.
2023-08-22 07:13:23 +05:30
dg845
f75b8aa9dd [docs] Add note in UniDiffusers Doc about PyTorch 1.X numerical stability issue (#4703)
* Add note regarding UniDiffuser pipeline numerical stability issues on PyTorch 1.X

* Use the doc-builder warning tag.
2023-08-22 07:12:06 +05:30
Sanchit Gandhi
7a24977ce3 Add AudioLDM 2 (#4549)
* from audioldm

* unet down + mid

* vae, clap, flan-t5

* start sequence audio mae

* iterate on audioldm encoder

* finish encoder

* finish weight conversion

* text pre-processing

* gpt2 pre-processing

* fix projection model

* working

* unet equivalence

* finish in base

* add unet cond

* finish unet

* finish custom unet

* start clean-up

* revert base unet changes

* refactor pre-processing

* tests: from audioldm

* fix some tests

* more fixes

* iterate on tests

* make fix copies

* harden fast tests

* slow integration tests

* finish tests

* update checkpoint

* update copyright

* docs

* remove outdated method

* add docstring

* make style

* remove decode latents

* enable cpu offload

* (text_encoder_1, tokenizer_1) -> (text_encoder, tokenizer)

* more clean up

* more refactor

* build pr docs

* Update docs/source/en/api/pipelines/audioldm2.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* small clean

* tidy conversion

* update for large checkpoint

* generate -> generate_language_model

* full clap model

* shrink clap-audio in tests

* fix large integration test

* fix fast tests

* use generation config

* make style

* update docs

* finish docs

* finish doc

* update tests

* fix last test

* syntax

* finalise tests

* refactor projection model in prep for TTS

* fix fast tests

* style

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-08-21 12:34:21 +01:00
zuojianghua
74d902eb59 add config_file to from_single_file (#4614)
* Update loaders.py

add config_file to from_single_file, 
when the download_from_original_stable_diffusion_ckpt use

* Update loaders.py

add config_file to from_single_file,
when the download_from_original_stable_diffusion_ckpt use

* change config_file to original_config_file

* make style && make quality

---------

Co-authored-by: jianghua.zuo <jianghua.zuo@weimob.com>
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2023-08-18 19:33:12 +05:30
Andrew Zhu
d7c4ae619d Add SDXL long weighted prompt pipeline (replace pr:4629) (#4661)
* Add SDXL long weighted prompt pipeline

* Add SDXL long weighted prompt pipeline usage sample in the readme document

* Add SDXL long weighted prompt pipeline usage sample in the readme document, add result image
2023-08-18 11:30:10 +05:30
Isotr0py
67ea2b7afa Support tiled encode/decode for AutoencoderTiny (#4627)
* Impl tae slicing and tiling

* add tae tiling test

* add parameterized test

* formatted code

* fix failed test

* style docs
2023-08-18 09:12:55 +05:30
Sayak Paul
a10107f92b fix: lora sdxl tests (#4652) 2023-08-17 15:59:50 +05:30
Sayak Paul
d0c30cfd37 make post-release (#4650) 2023-08-17 14:16:25 +05:30
Jacqui Wei
7c3e7fedcd Fix use_onnx parameter usage in from_pretrained func and update test_download_no_onnx_by_default test (#4508)
* add missing use_onnx in from_pretrained func

* fix test_download_no_onnx_by_default test func

* address comments

* split test cases
2023-08-17 11:49:32 +05:30
Patrick von Platen
029fb41695 [Safetensors] Make safetensors the default way of saving weights (#4235)
* make safetensors default

* set default save method as safetensors

* update tests

* update to support saving safetensors

* update test to account for safetensors default

* update example tests to use safetensors

* update example to support safetensors

* update unet tests for safetensors

* fix failing loader tests

* fix qc issues

* fix pipeline tests

* fix example test

---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2023-08-17 10:54:28 +05:30
Batuhan Taskaya
852dc76d6d Support higher dimension LoRAs (#4625)
* Support higher dimension LoRAs

* add: tests

* fix: assertion values.

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-08-17 10:07:07 +05:30
Scott Lessans
064f150813 Fix UnboundLocalError during LoRA loading (#4523)
* fixed

* add: tests

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-08-17 09:33:35 +05:30
Sayak Paul
5333f4c0ec make things clear in the controlnet sdxl doc. (#4644) 2023-08-17 09:04:28 +05:30
Dhruv Nair
3d08d8dc4e fix loading custom text encoder when using from_single_file (#4571)
fix loading custom text encoder when using from_single_file
2023-08-17 08:41:09 +05:30
Steven Liu
bdc4c3265f [docs] MultiControlNet (#4635)
multicontrolnet docs
2023-08-17 08:14:20 +05:30
Steven Liu
4ff7264d9b [docs] PushToHubMixin (#4622)
* push to hub docs

* fix typo

* feedback

* make style
2023-08-16 13:20:59 -06:00
Sayak Paul
5049599143 [Core] feat: MultiControlNet support for SDXL ControlNet pipeline (#4597)
* core: add multicontrolnet support to sdxl controlnet

* modify checks.

* fix: original_size determination

* add: tests for multi controlnet sdxl.

* remove unnecessary prints.
2023-08-16 20:30:39 +05:30
Suraj Patil
7b93c2a882 [research_projects] SDXL controlnet script (#4633)
add controlent script,
2023-08-16 18:27:08 +05:30
Dirk Morris
a7de96505b Fix unipc use_karras_sigmas exception - fixes huggingface/diffusers#4580 (#4581)
* Fix unipc karras sigmas exception - fixes huggingface/diffusers#4580

* Add unipc scheduler tests for karras sigmas
2023-08-16 10:01:53 +05:30
Sayak Paul
351aab60e9 Update text2image.md to fix the links (#4626) 2023-08-16 09:53:10 +05:30
nikhil-masterful
da5ab51d54 Add GLIGEN implementation (#4441)
* Add GLIGEN implementation

* GLIGEN: Fix code quality check failures

* GLIGEN: Fix Import block un-sorted or un-formatted failures

* GLIGEN: Fix check_repository_consistency failures

* GLIGEN: Add 'PositionNet' to versatile_diffusion/modeling_text_unet.py

* GLIGEN: check_repository_consistency: fix 'copy does not match' error

* GLIGEN: Fix review comments (1)

* GLIGEN: Fix E721 Do not compare types, use `isinstance()` failures

* GLIGEN : Ensure _encode_prompt() copy matches to StableDiffusionPipeline

* GLIGEN: Fix ruff E721 failure in unidiffuser/test_unidiffuser.py

* GLIGEN: doc_builder: restyle pipeline_stable_diffusion_gligen.py

* GIGLEN: reset files unrelated to gligen

* GLIGEN: Fix documentation comments (1)

* GLIGEN: Fix review comments (2)

* GLIGEN: Added FastTest

* GLIGEN: Fix review comments (3)
2023-08-16 09:34:17 +05:30
Sayak Paul
5175d3d7a5 add: train to text image with sdxl script. (#4505)
* add: train to text image with sdxl script.

Co-authored-by: CaptnSeraph <s3raph1m@gmail.com>

* fix: partial func.

* fix: default value of output_dir.

* make style

* set num inference steps to 25.

* remove mentions of LoRA.

* up min version

* add: ema cli arg

* run device placement while running step.

* precompute vae encodings too.

* fix

* debug

* should work now.

* debug

* debug

* goes alright?

* style

* debugging

* debugging

* debugging

* debugging

* fix

* reinit scheduler if prediction_type was passed.

* akways cast vae in float32

* better handling of snr.

Co-authored-by: bghira <bghira@users.github.com>

* the vae should be also passed

* add: docs.

* add: sdlx t2i tests

* save the pipeline

* autocast.

* fix: save_model_card

* fix: save_model_card.

---------

Co-authored-by: CaptnSeraph <s3raph1m@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: bghira <bghira@users.github.com>
2023-08-16 09:02:49 +05:30
Sayak Paul
a7508a76f0 add: pushtohubmixin to pipelines and schedulers docs overview. (#4607)
* add: pushtohubmixin to pipelines and schedulers docs overview.

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2023-08-15 22:23:17 +05:30
Sayak Paul
aaef41b5fe [Docs] fix links in the controlling generation doc. (#4612)
* fix links in the controlling generation doc.

* more fixes.
2023-08-15 20:27:13 +05:30
Wang Qiang
078df46bc9 An invalid clerical error in sdxl finetune (#4608) 2023-08-15 10:41:51 +05:30
Sayak Paul
15782fd506 [Pipeline utils] feat: implement push_to_hub for standalone models, schedulers as well as pipelines (#4128)
* feat: implement push_to_hub for standalone models.

* address PR feedback.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* remove max_shard_size.

* add: support for scheduler push_to_hub

* enable push_to_hub support for flax schedulers.

* enable push_to_hub for pipelines.

* Apply suggestions from code review

Co-authored-by: Lucain <lucainp@gmail.com>

* reflect pr feedback.

* address another round of deedback.

* better handling of kwargs.

* add: tests

* Apply suggestions from code review

Co-authored-by: Lucain <lucainp@gmail.com>

* setting hub staging to False for now.

* incorporate staging test as a separate job.

Co-authored-by: ydshieh <2521628+ydshieh@users.noreply.github.com>

* fix: tokenizer loading.

* fix: json dumping.

* move is_staging_test to a better location.

* better treatment to tokens.

* define repo_id to better handle concurrency

* style

* explicitly set token

* Empty-Commit

* move SUER, TOKEN to test

* collate org_repo_id

* delete repo

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: ydshieh <2521628+ydshieh@users.noreply.github.com>
2023-08-15 07:39:22 +05:30
Sayak Paul
d93ca26893 [Examples] Update InstructPix2Pix README_sdxl.md to fix mentions (#4574)
* Update README_sdxl.md to fix mentions

* add --push_to_hub

* add --push_to_hub

* fix: mention
2023-08-14 17:48:13 +05:30
Claire Froelich
32963c24c5 Fix git-lfs command typo in docs (#4586)
fix typo in git-lfs command

added missing hyphen. "git lfs" is not a command
2023-08-14 17:21:45 +05:30
AisingioroHao
1b739e7344 Fixed invalid pipeline_class_name parameter. (#4590)
* Fixed invalid pipeline_class_name parameter.

* Fix the format
2023-08-14 17:21:17 +05:30
Sayak Paul
d67eba0f31 [Utility] adds an image grid utility (#4576)
* add: utility for image grid.

* add: return type.

* change necessary places.

* add to utility page.
2023-08-12 10:34:51 +05:30
Steven Liu
714bfed859 [docs] Fix ControlNet SDXL docstring (#4582)
fix
2023-08-11 10:43:40 -07:00
Sayak Paul
d5983a6779 [Examples] fix: network_alpha -> network_alphas (#4572)
network_alpha
2023-08-11 14:18:49 +05:30
Mystfit
796c01534d Fixing repo_id regex validation error on windows platforms (#4358)
* Fixing repo_id regex validation error on windows platforms

* Validating correct URL with prefix is provided

If we are loading a URL then we don't need to use os.path.join and array slicing to split out a repo_id and file path from an absolute filepath. 

Checking if the URL prefix is valid first before doing any URL splitting otherwise we raise a ValueError since neither a valid filepath or URL was provided.

* Style fixes

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-08-11 11:11:43 +05:30
Abhipsha Das
c8d86e9f0a Remove code snippets containing is_safetensors_available() (#4521)
* [WIP] Remove code snippets containing `is_safetensors_available()`

* Modifying `import_utils.py`

* update pipeline tests for safetensor default

* fix test related to cached requests

* address import nits

---------

Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
2023-08-11 11:05:22 +05:30
dotieuthien
b28cd3fba0 Convert Stable Diffusion ControlNet to TensorRT (#4465)
* convert tensorrt controlnet

* Fix code quality

* Fix code quality

* Fix code quality

* Fix code quality

* Fix code quality

* Fix code quality

* Fix number controlnet condition

* Add convert SD XL to onnx

* Add convert SD XL to tensorrt

* Add convert SD XL to tensorrt

* Add examples in comments

* Add examples in comments

* Add test onnx controlnet

* Add tensorrt test

* Remove copied

* Move file test to examples/community

* Remove script

* Remove script

* Remove text

---------

Co-authored-by: dotieuthien <thien.do@mservice.com.vn>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-11 08:12:26 +05:30
Steven Liu
cd7071e750 [docs] Add safetensors flag (#4245)
* add safetensors flag

* apply review
2023-08-10 12:37:23 -07:00
Steven Liu
e31f38b5d6 [docs] Remove attention slicing (#4518)
* remove attention slicing

* apply feedback
2023-08-10 11:00:03 -07:00
Steven Liu
3bd5e073cb [docs] Expand prompt weighting (#4516)
* add more weighting/blend/conjunction

* finish blend/conjunction

* add textual inversion example

* add dreambooth
2023-08-10 10:56:53 -07:00
YiYi Xu
3df52ba8dc [Doc] update sdxl-controlnet repo name (#4564)
* rename

* style

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-08-10 22:02:32 +05:30
Sayak Paul
c697c5ab57 improve controlnet sdxl docs now that we have a good checkpoint. (#4556) 2023-08-10 08:21:36 +05:30
VV-A-VV
3fd45eb10f fix some typo error (#4546)
* fix some typo error

* Undo changes to capitalization
2023-08-10 06:49:25 +05:30
Patrick von Platen
5cbcbe3c63 Revert "introduce minimalistic reimplementation of SDXL on the SDXL doc" (#4548)
Revert "introduce minimalistic reimplementation of SDXL on the SDXL doc (#4532)"

This reverts commit e7e3749498.
2023-08-10 06:49:06 +05:30
Dhruv Nair
7b07f9812a Move controlnet load local tests to nightly (#4543)
move controlnet load local tests to nihghtly
2023-08-09 23:00:42 +05:30
Steven Liu
16ad13b61d [docs] Clean scheduler api (#4204)
* clean scheduler mixin

* up to dpmsolvermultistep

* finish cleaning

* first draft

* fix overview table

* apply feedback

* update reference code
2023-08-09 09:00:35 -07:00
Dhruv Nair
da0e2fce38 pin ruff version for quality checks (#4539)
* pin ruff version for quality checks

* update dependency versions table

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-08-09 16:46:45 +05:30
Dhruv Nair
a67ff32301 Move slow tests to nightly (#4526)
* move slow pix2pixzero tests to nightly

* move slow panorama tests to nightly

* move txt2video full test to nightly

* clean up

* remove nightly test from text to video pipeline
2023-08-09 12:38:15 +02:00
jere357
3c1b4933bd Changed code that converts tensors to PIL images in the write_your_own_pipeline notebook (#4489)
changed code that converts tensors to PIL images
2023-08-09 15:00:51 +05:30
Sayak Paul
e731ae0ec8 Update README_sdxl.md to include the free-tier Colab Notebook (#4540)
Update README_sdxl.md
2023-08-09 14:32:14 +05:30
Rastislav Švarba
6c5b5b260e Fix push_to_hub in train_text_to_image_lora_sdxl.py example (#4535)
fix: push_to_hub in train text2image lora sdxl
2023-08-09 11:48:24 +05:30
Simo Ryu
e7e3749498 introduce minimalistic reimplementation of SDXL on the SDXL doc (#4532)
minsdxl
2023-08-09 07:33:07 +05:30
Wooyeol Baek
c7c0b57541 Copy lora functions to XLPipelines (#4512)
* add load_lora_weights and save_lora_weights to StableDiffusionXLImg2ImgPipeline

* add load_lora_weights and save_lora_weights to StableDiffusionXLInpaintPipeline

* apply black format

* apply black format

* add copy statement

* fix statements

* fix statements

* fix statements

* run `make fix-copies`
2023-08-08 18:53:22 +05:30
Dhruv Nair
c91272d631 fix indexing issue in sd reference pipeline (#4531) 2023-08-08 15:14:19 +02:00
George He
f0725c5845 Fix misc typos (#4479)
Fix typos
2023-08-07 17:21:19 -07:00
YiYi Xu
aef11cbf66 add pipeline_class_name argument to Stable Diffusion conversion script (#4461)
* add pipeline class

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* style

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-07 06:44:31 -10:00
Dhruv Nair
71c8224159 Moving certain pipelines slow tests to nightly (#4469)
* move audioldm tests to nightly

* move kandinsky im2img ddpm test to nightly

* move flax dpm test to nightly

* move diffedit dpm test to nightly

* move fp16 slow tests to nightly
2023-08-07 17:28:56 +02:00
Patrick von Platen
4367b8a300 move pipeline only when running validation (#4515) 2023-08-07 17:20:18 +02:00
ethansmith2000
f4f854138d grad checkpointing (#4474)
* grad checkpointing

* fix make fix-copies

* fix

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-07 15:26:54 +02:00
Patrick von Platen
e1b5b8ba13 Make sure fp16-fix is used as default (#4510)
* Make sue fp16-fix is used as default

* fix vae

* finish

* fix
2023-08-07 15:16:37 +02:00
Patrick von Platen
dff5ff35a9 [SDXL LoRA] fix batch size lora (#4509)
fix batch size lora
2023-08-07 13:27:13 +02:00
Sayak Paul
b2456717e6 Update lora.md to clarify SDXL support (#4503)
* Update lora.md

* Update lora.md
2023-08-07 11:06:30 +05:30
Vladislav Artemyev
2e69cf16fe Log global_step instead of epoch to tensorboard (#4493)
Co-authored-by: mrlzla <noname@noname.com>
2023-08-07 07:49:39 +05:30
takuoko
9c29bc2df8 [Examples] Support train_text_to_image_lora_sdxl.py (#4365)
* add train_text_to_image_lora_sdxl.py

* add train_text_to_image_lora_sdxl.py

* add test and minor fix

* Update examples/text_to_image/README_sdxl.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* fix unwrap_model rule

* add invisible-watermark in requirements

* del invisible-watermark

* Update examples/text_to_image/README_sdxl.md

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update examples/text_to_image/README_sdxl.md

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update examples/text_to_image/train_text_to_image_lora_sdxl.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* del comment & update readme

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-06 13:47:20 +05:30
AisingioroHao
70d098540d Add a data_dir parameter to the load_dataset method. (#4482)
Co-authored-by: AisingioroHao0 <1286098622@qq.com>
2023-08-06 08:45:48 +05:30
Patrick von Platen
ea1fcc28a4 [SDXL] Allow SDXL LoRA to be run with less than 16GB of VRAM (#4470)
* correct

* correct blocks

* finish

* finish

* finish

* Apply suggestions from code review

* fix

* up

* up

* up

* Update examples/dreambooth/README_sdxl.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Apply suggestions from code review

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-08-04 20:06:38 +02:00
Patrick von Platen
66de221409 Update README_sdxl.md (#4472) 2023-08-04 16:23:35 +02:00
Sayak Paul
06f73bd6d1 [Tests] Adds integration tests for SDXL LoRAs (#4462)
* add: integration tests for SDXL LoRAs.

* change pipeline class.

* fix assertion values.

* print values again.

* let's see.

* let's see.

* let's see.

* finish
2023-08-04 16:25:53 +05:30
asfiyab-nvidia
c14c141b86 TensorRT Inpaint pipeline: minor fixes (#4457)
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
2023-08-04 12:28:28 +02:00
manosplitsis
79ef9e528c Fixed multi-token textual inversion training (#4452)
* added placeholder token concatenation during training

* Update examples/textual_inversion/textual_inversion.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-04 12:21:31 +02:00
Dhruv Nair
801a5e2199 Cleanup Pass on flaky slow tests for Stable Diffusion (#4455)
* lower num inference steps and precision checkk

* fix flaky inpaint tests

* remove unsued imports

* set unet default attn processor
2023-08-04 10:24:56 +02:00
YiYi Xu
1edd0debaa fix-format (#4458)
make style

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-08-03 20:34:37 -07:00
YiYi Xu
29ece0db79 a few fix for kandinsky combined pipeline (#4352)
* add xformer

* enable_sequential_cpu_offload

* style

* Update src/diffusers/pipelines/kandinsky/pipeline_kandinsky_combined.py

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2023-08-03 15:10:41 -10:00
Patrick von Platen
1a8843f93e add sdxl to prompt weighting (#4439)
* add sdxl to prompt weighting

* Update docs/source/en/using-diffusers/weighted_prompts.md

* Update docs/source/en/using-diffusers/weighted_prompts.md

* add sdxl to prompt weighting

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Apply suggestions from code review

* Update docs/source/en/using-diffusers/weighted_prompts.md

* Apply suggestions from code review

* correct

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2023-08-03 21:41:48 +02:00
JinK
e391b789ac Support different strength for Stable Diffusion TensorRT Inpainting pipeline (#4216)
* Support different strength

* run make style
2023-08-03 21:32:44 +02:00
VV-A-VV
d0b8de1262 Delete the duplicate code for the contolnet img 2 img (#4411)
delete the duplicated code from the controlnet-img-2-img pipelines
2023-08-03 20:32:03 +02:00
Yuyang Zhao
b9058754c5 Fix bug caused by typo (#4357)
Fix the typo that causes `omegaconf.errors.ConfigAttributeError: Missing key parms`
2023-08-03 20:27:42 +02:00
Alan Ji
777becda6b fix typo to ensure make test-examples work correctly (#4329)
fix typo to ensure `make test-examples` work correctly
2023-08-03 20:17:48 +02:00
cmdr2
380bfd82c1 Allow controlnets to be loaded (from ckpt) in a parallel thread with a SD model (ckpt), and speed it up slightly (#4298)
Faster controlnet model instantiation, and allow controlnets to be loaded (from ckpt) in a parallel thread with a SD model (ckpt) without  tensor errors (race condition)
2023-08-03 20:11:13 +02:00
Steven Liu
5989a85edb [docs] Distilled SD (#4442)
* first draft

* add blog link
2023-08-03 11:03:42 -07:00
Levi McCallum
4188f3063a Add rank argument to train_dreambooth_lora_sdxl.py (#4343)
* Add rank argument to train_dreambooth_lora_sdxl.py

* Update train_dreambooth_lora_sdxl.py
2023-08-03 23:27:30 +05:30
George He
0b4430e840 Fix typerror in pipeline handling for MultiControlNets which only contain a single ControlNet (#4454)
* Handle single controlnet in multicontrolnet

* Fix formatting
2023-08-03 17:37:07 +02:00
Patrick von Platen
a74c995e7d make style 2023-08-03 14:45:06 +00:00
Neil Wang
85aa673bec auto type conversion (#4270)
* type conversion

Default value of `control_guidance_start` and `control_guidance_end` in `StableDiffusionControlNetPipeline.check_inputs` causes `TypeError: object of type 'float' has no len()`

Proposed fix: 
Convert `control_guidance_start` and `control_guidance_end` to list if float

* Update src/diffusers/pipelines/controlnet/pipeline_controlnet.py

* Update src/diffusers/pipelines/controlnet/pipeline_controlnet.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/controlnet/pipeline_controlnet.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-03 16:44:48 +02:00
Dhruv Nair
1d2587bb34 move tests to nightly (#4451)
* move tests to nightly

* clean up code quality issues

* more clean up
2023-08-03 15:25:28 +02:00
Patrick von Platen
372b58108e fix make style 2023-08-03 10:17:00 +00:00
w4ffl35
45171174b8 Prevent online access when desired when using download_from_original_stable_diffusion_ckpt (#4271)
Prevent online access when desired

- Bypass requests with config files option added to download_from_original_stable_diffusion_ckpt
- Adds local_files_only flags to all from_pretrained requests
2023-08-03 12:16:41 +02:00
cmdr2
4c4fe042a7 Accept pooled_prompt_embeds in the SDXL Controlnet pipeline. Fixes an error if prompt_embeds are passed. (#4309)
* Accept pooled_prompt_embeds in the SDXL Controlnet pipeline. Fixes an error if prompt_embeds are passed.

* Add a test for pooled prompt embeds
2023-08-03 13:05:19 +05:30
Will Berman
47bf8e566c can call encode_prompt with out setting a text encoder instance variable (#4396)
* can call encode_prompt with out setting a text encoder instance variable

* fix
2023-08-02 21:25:30 -07:00
Sayak Paul
18fc40c169 [Feat] add tiny Autoencoder for (almost) instant decoding (#4384)
* add: model implementation of tiny autoencoder.

* add: inits.

* push the latest devs.

* add: conversion script and finish.

* add: scaling factor args.

* debugging

* fix denormalization.

* fix: positional argument.

* handle use_torch_2_0_or_xformers.

* handle post_quant_conv

* handle dtype

* fix: sdxl image processor for tiny ae.

* fix: sdxl image processor for tiny ae.

* unify upcasting logic.

* copied from madness.

* remove trailing whitespace.

* set is_tiny_vae = False

* address PR comments.

* change to AutoencoderTiny

* make act_fn an str throughout

* fix: apply_forward_hook decorator call

* get rid of the special is_tiny_vae flag.

* directly scale the output.

* fix dummies?

* fix: act_fn.

* get rid of the Clamp() layer.

* bring back copied from.

* movement of the blocks to appropriate modules.

* add: docstrings to AutoencoderTiny

* add: documentation.

* changes to the conversion script.

* add doc entry.

* settle tests.

* style

* add one slow test.

* fix

* fix 2

* fix 2

* fix: 4

* fix: 5

* finish integration tests

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* style

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2023-08-02 23:58:05 +05:30
Xin Kong
615c04db15 [Pipelines] Add community pipeline for Zero123 (#4295)
* add zero123 pipeline to community

* add community doc

* reformat

* update zero123 pipeline, including cc_projection within diffusers; add convert ckpt scripts; support diffusers weights
2023-08-02 19:36:49 +02:00
Steven Liu
ae82a3eb34 [docs] AutoPipeline tutorial (#4273)
* first draft

* tidy api

* apply feedback

* mdx to md

* apply feedback

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-02 10:32:02 -07:00
Sayak Paul
816ca0048f [LoRA] Fix SDXL text encoder LoRAs (#4371)
* temporarily disable text encoder loras.

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debbuging.

* modify doc.

* rename tests.

* print slices.

* fix: assertions

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-08-02 17:00:56 +05:30
Sayak Paul
fef8d2f726 remove mentions of textual inversion from sdxl. (#4404) 2023-08-02 15:29:46 +05:30
Ella Charlaix
579b4b2020 Update documentation (#4422)
* update documentation

* minor
2023-08-02 11:49:22 +02:00
Steven Liu
6c5bd2a38d [docs] Fix SDXL docstring (#4397)
fix guidance scale value
2023-08-01 16:40:15 -07:00
Will Berman
160474ac61 train dreambooth fix pre encode class prompt (#4395) 2023-08-01 12:00:05 -07:00
YiYi Xu
c10861ee1b fix test_float16_inference (#4412)
fix

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-08-01 07:49:50 -10:00
YiYi Xu
94b332c476 support from_single_file for SDXL inpainting (#4408)
fix

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-08-01 07:47:22 -10:00
Dhruv Nair
6f4355f89f Cleanup pass for flaky Slow Tests for Stable diffusion (#4415)
* update expected slice so img2img compile tests pass

* use default attn processor

* use default attn processor and update expected slice value to pass test

* use default attn processor

* set default attn processor and update expected slice

* set default attn processor and change precision for check

* set unet to use default attn processor
2023-08-01 18:21:14 +02:00
estelleafl
05a1cb902c [ldm3d] documentation fixing typos (#4284)
* fixed typo

* updated doc to be consistent in naming

* make style/quality

* preprocessing for 4 channels and not 6

* make style

* test for 4c

* make style/quality

* fixed test on cpu

* fixed doc typo

* changed default ckpt to 4c

* Update pipeline_stable_diffusion_ldm3d.py

---------

Co-authored-by: Aflalo <estellea@isl-iam1.rr.intel.com>
Co-authored-by: Aflalo <estellea@isl-gpu33.rr.intel.com>
Co-authored-by: Aflalo <estellea@isl-gpu38.rr.intel.com>
2023-08-01 09:03:29 -07:00
Patrick von Platen
c69526a3d5 [AutoPipeline] Correct naming (#4420) 2023-08-01 14:56:27 +02:00
Nishant Rajadhyaksha
6c49d542a3 Update docs of unet_1d.py (#4394)
Update unet_1d.py

highlighting the way the modules are actually fed in the main code as the order matters because no skip block attaches time embeds whilst others do not
2023-07-31 11:04:47 -07:00
Sayak Paul
ba43ce3476 minor doc fixes. (#4380) 2023-07-31 12:15:56 +05:30
Andrey Voroshilov
ea5b0575f8 Clean up duplicate lines in encode_prompt (#4369)
* Clean up duplicate line

* Clean up duplicate lines

* Clean up duplicate line
2023-07-30 15:49:46 +05:30
Patrick von Platen
4f986fb28a [SDXL] Fix dummy imports incorrect naming (#4370)
[SDXL] Fix dummy imports
2023-07-30 12:17:38 +02:00
Harutatsu Akiyama
aae27262f4 [SDXL-IP2P] Add gif for demonstrating training processes (#4342)
* [SDXL-IP2P] Add gif for demonstrating training processes

* [SDXL-IP2P] Add gif for demonstrating training processes

* [SDXL-IP2P] Change gif to URLs

* [SDXL-IP2P] Add URLs in case gif now show

---------

Co-authored-by: Harutatsu Akiyama <kf.zy.qin@gmail.com>
2023-07-30 10:07:10 +05:30
Sayak Paul
34b5b63bb8 Update README.md to have PyPI-friendly path (#4351) 2023-07-29 08:59:18 +05:30
Will Berman
2b1786735e fix fp type in t2i adapter docs (#4350) 2023-07-28 13:01:52 -07:00
Sayak Paul
4a4cdd6b07 [Feat] Support SDXL Kohya-style LoRA (#4287)
* sdxl lora changes.

* better name replacement.

* better replacement.

* debugging

* debugging

* debugging

* debugging

* debugging

* remove print.

* print state dict keys.

* print

* distingisuih better

* debuggable.

* fxi: tyests

* fix: arg from training script.

* access from class.

* run style

* debug

* save intermediate

* some simplifications for SDXL LoRA

* styling

* unet config is not needed in diffusers format.

* fix: dynamic SGM block mapping for SDXL kohya loras (#4322)

* Use lora compatible layers for linear proj_in/proj_out (#4323)

* improve condition for using the sgm_diffusers mapping

* informative comment.

* load compatible keys and embedding layer maaping.

* Get SDXL 1.0 example lora to load

* simplify

* specif ranks and hidden sizes.

* better handling of k rank and hidden

* debug

* debug

* debug

* debug

* debug

* fix: alpha keys

* add check for handling LoRAAttnAddedKVProcessor

* sanity comment

* modifications for text encoder SDXL

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* denugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* up

* up

* up

* up

* up

* up

* unneeded comments.

* unneeded comments.

* kwargs for the other attention processors.

* kwargs for the other attention processors.

* debugging

* debugging

* debugging

* debugging

* improve

* debugging

* debugging

* more print

* Fix alphas

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* clean up

* clean up.

* debugging

* fix: text

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Batuhan Taskaya <batuhan@python.org>
2023-07-28 19:49:49 +02:00
Patrick von Platen
b7b6d6138d [SDXL] Make watermarker optional under certain circumstances to improve usability of SDXL 1.0 (#4346)
* improve sdxl

* more fixes

* improve sdxl

* improve sdxl

* improve sdxl

* finish
2023-07-28 19:29:22 +02:00
kathath
faa6cbc959 Fix repeat of negative prompt (#4335)
fix repeat of negative prompt
2023-07-28 18:14:22 +02:00
Patrick von Platen
306a7bd047 [ONNX] Don't download ONNX model by default (#4338)
* [Download] Don't download ONNX weights by default

* [Download] Don't download ONNX weights by default

* [Download] Don't download ONNX weights by default

* fix more

* finish

* finish

* finish
2023-07-28 14:02:48 +02:00
Tanupriya Singh
c7250f2b8a correct doc string for default value of guidance_scale (#4339) 2023-07-28 13:54:28 +02:00
Patrick von Platen
18b018c864 [SDXL Refiner] Fix refiner forward pass for batched input (#4327)
* fix_batch_xl

* Fix other pipelines as well

* up

* up

* Update tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_inpaint.py

* sort

* up

* Finish it all up Co-authored-by: Bagheera <bghira@users.github.com>

* Co-authored-by: Bagheera bghira@users.github.com

* Co-authored-by: Bagheera <bghira@users.github.com>

* Finish it all up Co-authored-by: Bagheera <bghira@users.github.com>
2023-07-28 12:34:18 +02:00
Sayak Paul
54fab2cd5f Update README_sdxl.md to correct the header (#4330)
Update README_sdxl.md
2023-07-28 09:22:14 +05:30
Sayak Paul
961173064d Honor the SDXL 1.0 licensing from the training scripts. (#4319)
* honor the original license.

* train_instruct_pix2pix_xl -> train_instruct_pix2pix_sdxl
2023-07-28 01:28:36 +05:30
Sayak Paul
7d0d073261 [Tests] add test for pipeline import. (#4276)
* add test for pipeline import.

* Update tests/others/test_dependencies.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* address suggestions

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-07-28 00:08:15 +05:30
Xinyang Li
01b6ec21fa fix validation option for dreambooth training example (#4317) 2023-07-27 09:58:52 -07:00
Ella Charlaix
92e5ddd295 Fix typo documentation (#4320)
fix typo documentation
2023-07-27 21:31:58 +05:30
Patrick von Platen
1926331eaf [Local loading] Correct bug with local files only (#4318)
* [Local loading] Correct bug with local files only

* file not found error

* fix

* finish
2023-07-27 16:16:46 +02:00
YiYi Xu
5fd3dca5f3 fix a bug in StableDiffusionUpscalePipeline when prompt is None (#4278)
* fix batch_size

* add test

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-07-27 15:07:50 +02:00
Duong A. Nguyen
a2091b7071 Fix SDXL conversion from original to diffusers (#4280)
* fix sdxl conversion

* convention
2023-07-27 15:05:43 +02:00
Patrick von Platen
d8bc1a4e51 [Torch.compile] Fixes torch compile graph break (#4315)
* fix torch compile

* Fix all

* make style
2023-07-27 13:53:36 +02:00
YiYi Xu
80c10d8245 update Kandinsky doc (#4301)
* update doc

* fix an error in autopipe doc

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-07-27 13:10:41 +02:00
Patrick von Platen
20e92586c1 0.20.0dev0 (#4299)
* 0.20.0dev0

* make style
2023-07-26 23:06:18 +02:00
Patrick von Platen
5623ea065a quick fix 2023-07-26 21:01:17 +02:00
Patrick von Platen
16049caf79 quick fix 2023-07-26 18:47:21 +00:00
Patrick von Platen
6a6dfe1cbd Rename (#4294)
* up

* Apply suggestions from code review

* Apply suggestions from code review

* up
2023-07-26 20:41:21 +02:00
Ella Charlaix
b83bdce42a add openvino and onnx runtime SD XL documentation (#4285)
* add openvino SD XL documentation

* add onnx SD XL integration

* rephrase

* update doc

* add images

* update model
2023-07-26 20:25:07 +02:00
camenduru
c6ae9b7df6 Where did this 'x' come from, Elon? (#4277)
* why mdx?

* why mdx?

* why mdx?

* no x for kandinksy either

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-07-26 18:18:14 +02:00
Patrick von Platen
b3e5cd6b4d [Kandinsky] Add combined pipelines / Fix cpu model offload / Fix inpainting (#4207)
* Add combined pipeline

* Download readme

* Upload

* up

* up

* fix final

* Add enable model cpu offload kandinsky

* finish

* finish

* Fix

* fix more

* make style

* fix kandinsky mask

* fix inpainting test

* add callbacks

* add tests

* fix tests

* Apply suggestions from code review

Co-authored-by: YiYi Xu <yixu310@gmail.com>

* docs

* docs

* correct docs

* fix tests

* add warning

* correct docs

---------

Co-authored-by: YiYi Xu <yixu310@gmail.com>
2023-07-26 17:13:55 +02:00
Patrick von Platen
b37dc3b3cd Fix all missing optional import statements from pipeline folders (#4272)
* fix circular import

* fix imports when watermark not specified

* fix all pipelines
2023-07-26 01:46:05 +02:00
Batuhan Taskaya
ff8f58086b Load Kohya-ss style LoRAs with auxilary states (#4147)
* Support to load Kohya-ss style LoRA file format (without restrictions)

Co-Authored-By: Takuma Mori <takuma104@gmail.com>
Co-Authored-By: Sayak Paul <spsayakpaul@gmail.com>

* tmp: add sdxl to mlp_modules

---------

Co-authored-by: Takuma Mori <takuma104@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-07-26 00:24:19 +02:00
Sayak Paul
161449d51a [SDXL DreamBooth LoRA] multiple fixes (#4262)
* add automatic licensing.

* debugging

* debugging

* more debugging

* more debugging.

* run make fix-copies.

* change to default tracker.
2023-07-25 21:10:01 +02:00
Steven Liu
34abee0907 [docs] Fix image in SDXL docs (#4267)
fix image link
2023-07-25 09:41:11 -07:00
Harutatsu Akiyama
428dbfecd9 [SDXL and IP2P]: instruction pix2pix XL training and pipeline (#4079)
* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* [Community] Implementation of the IADB community pipeline (#3996)

* community pipeline: implementation of iadb

* iadb.py: reformat using black

* iadb.py: linting update

* add kandinsky to readme table (#4081)

Co-authored-by: yiyixuxu <yixu310@gmail,com>

* [From Single File] Force accelerate to be installed (#4078)

force accelerate to be installed

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Support instruction pix2pix sdxl

* Clean up IP2P SDXL code

* Clean up IP2P SDXL code

* [IP2P and SDXL] clean up code

* [IP2P and SDXL] clean up code

* [IP2P and SDXL] clean up code

* [IP2P SDXL] Address code reviews

* [IP2P SDXL] Address code reviews, add docs, tests

* [IP2P SDXL] Address code reviews, add docs, tests

* [IP2P SDXL] Address code reviews, add docs, tests

* [IP2P SDXL] Address code reviews, add docs, tests

* [IP2P SDXL] Address code reviews, add docs, tests

* [IP2P SDXL] Address code reviews, add docs, tests

* [IP2P SDXL] Address code reviews, add docs, tests

* [IP2P SDXL] Address code reviews, add docs, tests

* [IP2P SDXL] Address code reviews, add docs, tests

* [IP2P SDXL] Address code reviews, add docs, tests

* [IP2P SDXL] Address code reviews, add docs, tests

* [IP2P SDXL] Address code reviews, add docs, tests

* [IP2P SDXL] Address code reviews

* [IP2P SDXL] Address code reviews

* [IP2P SDXL] Add README_SDXL

* [IP2P SDXL] Address code reviews

* [IP2P SDXL] Address code reviews

* [IP2P SDXL] Fix the copy problems

* [IP2P SDXL] Add license

* [IP2P SDXL] Add license

* [IP2P SDXL] Add license

* [IP2P SDXL] Address code reivew for selecting VAE andd others

* [IP2P SDXL] Update README_sdxl

* [IP2P SDXL] Update __init__

* [IP2P SDXL] Update dummy_torch_and_transformers_and_invisible_watermark_objects

* address patrick's comments and some additions to readmes.

---------

Co-authored-by: Harutatsu Akiyama <kf.zy.qin@gmail.com>
Co-authored-by: Thomas Chambon <36728882+tchambon@users.noreply.github.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-07-25 18:19:35 +05:30
Ragnar Rova
4e2a021829 Model path for sdxl wrong in dreambooth README (#4261) 2023-07-25 18:06:50 +05:30
Patrick von Platen
ebfe343149 [from_single_file] Fix circular import (#4259)
* up

* finish

* fix final
2023-07-25 14:30:39 +02:00
Sayak Paul
5ef6b8fa53 Update README_sdxl.md to change the note on default hyperparameters (#4258) 2023-07-25 16:57:48 +05:30
YiYi Xu
c11d11d63d [draft v2] AutoPipeline (#4138)
* initial

* style

* from ...pipelines -> from ..pipeline_util

* make style

* fix-copies

* fix value_guided_sampling oops

* style

* add test

* Show failing test

* update from_pipe

* fix

* add controlnet, additional test and register unused original config

* update for controlnet

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* store unused config as private attribute and pass if can

* add doc

* kandinsky inpaint pipeline does not work with decoder checkpoint

* update doc

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* style

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* fix

* Apply suggestions from code review

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-07-25 13:20:35 +02:00
Patrick von Platen
d74561da2c [SDXL] Improve docs (#4196)
* Improve docs

* Correct docs

* Add better example inpaint

* make style

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* fix

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2023-07-25 12:48:25 +02:00
Patrick von Platen
a0422ed0c9 [From Single File] Allow vae to be loaded (#4242)
* Allow vae to be loaded

* up
2023-07-25 12:16:43 +02:00
Will Berman
3dd339379d do not pass list to accelerator.init_trackers (#4248) 2023-07-24 21:10:37 -07:00
nupurkmr9
5652c43f83 Resolve bf16 error as mentioned in this [issue](https://github.com/huggingface/diffusers/issues/4139#issuecomment-1639977304) (#4214)
* resolve bf16 error

* resolve bf16 error

* resolve bf16 error

* resolve bf16 error

* resolve bf16 error

* resolve bf16 error

* resolve bf16 error
2023-07-25 05:41:19 +05:30
Sayak Paul
365e8461ac [SDXL DreamBooth LoRA] add support for text encoder fine-tuning (#4097)
* Allow low precision sd xl

* finish

* finish

* feat: initial draft for supporting text encoder lora finetuning for SDXL DreamBooth

* fix: variable assignments.

* add: autocast block.

* add debugging

* vae dtype hell

* fix: vae dtype hell.

* fix: vae dtype hell 3.

* clean up

* lora text encoder loader.

* fix: unwrapping models.

* add: tests.

* docs.

* handle unexpected keys.

* fix vae dtype in the final inference.

* fix scope problem.

* fix: save_model_card args.

* initialize: prefix to None.

* fix: dtype issues.

* apply gixes.

* debgging.

* debugging

* debugging

* debugging

* debugging

* debugging

* add: fast tests.

* pre-tokenize.

* address: will's comments.

* fix: loader and tests.

* fix: dataloader.

* simplify dataloader.

* length.

* simplification.

* make style && make quality

* simplify state_dict munging

* fix: tests.

* fix: state_dict packing.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-07-25 05:35:48 +05:30
Sayak Paul
fed12376c5 [ControlNet SDXL training] fixes in the training script (#4223)
* fix: #4206

* add: sdxl controlnet training smoketest.

* remove unnecessary token inits.

* add: licensing to model card.

* include SDXL licensing in the model card and make public visibility default

* debugging

* debugging

* disable local file download.

* fix: training test.

* fix: ckpt prefix.
2023-07-25 05:31:48 +05:30
Patrick von Platen
95b7de88fd [Docs] Fix from pretrained docs (#4240)
* [Docs] Fix from pretrained docs

* [Docs] Fix from pretrained docs
2023-07-24 20:24:29 +02:00
Apoorva Kulkarni
cbb1ead60b docs: Add missing import statement in textual_inversion inference example (#4227)
docs: Add missing import statement in textual_inversion inference instructions
2023-07-24 11:07:53 -07:00
Steven Liu
5470a4fce3 [docs] Other modalities (#4205)
remove coming soon, rl pipeline
2023-07-24 10:51:24 -07:00
39th president of the United States, probably
e98fabc550 Allow specifying denoising_start and denoising_end as integers representing the discrete timesteps, fixing the XL ensemble not working for many schedulers (#4115)
* Fix the XL ensemble not working for any kerras scheduler sigmas and having an off by one bug

* Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

* make sytle

---------

Co-authored-by: Jimmy <39@🇺🇸.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-07-24 19:44:35 +02:00
Cris
fa356bd4da [docs] Changed path for ControlNet in docs (#4215)
docs: changed path for control net
2023-07-24 10:13:10 -07:00
Patrick von Platen
3ba36f97b8 [SD-XL] Fix sdxl controlnet inference (#4238)
* Fix controlnet xl inference

* correct some sd xl control inference
2023-07-24 18:43:35 +02:00
Patrick von Platen
b288684d25 [SDXL] Fix sd xl encode prompt (#4237)
* [SDXL] Fix sd xl encode prompt

* add tests
2023-07-24 18:37:07 +02:00
Lucain
06eda5b232 Raise initial HTTPError if pipeline is not cached locally (#4230)
* Raise initial HTTPError if pipeline is not cached locally

* make style
2023-07-24 15:35:16 +02:00
Hu Ye
8e5921cac1 fix a bug of prompt embeds in sdxl (#4099)
* fix bug in sdxl

* Update pipeline_stable_diffusion_xl_img2img.py

* Update pipeline_stable_diffusion_xl.py

* Update pipeline_stable_diffusion_xl_img2img.py

* Update pipeline_stable_diffusion_xl_inpaint.py

* Update pipeline_stable_diffusion_xl.py

* Update pipeline_stable_diffusion_xl_img2img.py

* Update pipeline_stable_diffusion_xl_inpaint.py

* Update pipeline_stable_diffusion_xl_img2img.py

* Update pipeline_controlnet_sd_xl.py

* Update pipeline_controlnet_sd_xl.py

* Update pipeline_stable_diffusion_xl.py

* Update pipeline_stable_diffusion_xl_img2img.py

* Update pipeline_stable_diffusion_xl_inpaint.py

* Update test_stable_diffusion_xl.py

* Update test_stable_diffusion_xl.py

* Update test_stable_diffusion_xl.py

add test on prompt_embeds

* add test on prompt_embeds

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-07-24 10:14:20 +02:00
YiYi Xu
8e8954bd15 fix no CFG for kandinsky pipelines (#4193)
* fix bug when no cfg

* style

* fix no cfg for shap-e and cycle

* style

* fix no cfg for sdxl

* fix copies

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-07-24 10:10:14 +02:00
Jackmin801
09fab5610e [fix] network_alpha when loading unet lora from old format (#4221)
fix: missed network_alpha when loading lora from old format
2023-07-24 06:13:32 +05:30
Apoorva Kulkarni
2e53936c97 docs: Typo in dreambooth example README.md (#4203)
fix: Typo in dreambooth example README.md
2023-07-21 15:16:38 -07:00
Steven Liu
a69754bb87 [docs] Clean up pipeline apis (#3905)
* start with stable diffusion

* fix

* finish stable diffusion pipelines

* fix path to pipeline output

* fix flax paths

* fix copies

* add up to score sde ve

* finish first pass of pipelines

* fix copies

* second review

* align doc titles

* more review fixes

* final review
2023-07-21 11:01:34 -07:00
Kadir Nar
bcc570b910 📄 Renamed File for Better Understanding (#4056)
* 📄 Renamed File for Better Understanding

Renamed the 'rl' file to 'run_locomotion'. This change was made to improve the clarity and readability of the codebase. The 'rl' name was ambiguous, and 'run_locomotion' provides a more clear description of the file's purpose.

Thanks 🙌

* 📁 [Docs] Renamed Directory for Better Clarity

Renamed the 'rl' directory to 'reinforcement_learning'. This change provides a clearer understanding of the directory's purpose and its contents.

* Update examples/reinforcement_learning/README.md

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* 📝 Update README

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-07-21 09:08:27 -07:00
Sayak Paul
4dcab9227a [SDXL ControlNet Training] Follow-up fixes (#4188)
* hash computation. thanks to @lhoestq

* disable dtype casting.

* remove comments.
2023-07-21 20:55:33 +05:30
apolinário
aed30dff6b Allow passing different prompts to each text_encoder on stable_diffusion_xl pipelines (#4156)
* sdxl prompt2

* Improve checks

* doc linting

* whoops

* remove cat

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add other pipelines and tests

* Add multi-prompting to docs

* doc and copies check

* Fix copied froms

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Bring back the original code for unrelated files

* Fix tests

* Fix img2img

* Fix all

* fix

---------

Co-authored-by: multimodalart <joaopaulo.passos+multimodal@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-07-21 14:50:22 +02:00
statelesshz
e2bbaa4f54 make enable_sequential_cpu_offload more generic for third-party devices (#4191)
* make enable_sequential_cpu_offload more generic for third-party devices

* make style
2023-07-21 17:15:09 +05:30
Patrick von Platen
1e853e240e [Safetensors] make safetensors a required dep (#4177) 2023-07-21 12:57:30 +02:00
Batuhan Taskaya
ad787082e2 Fix unloading of LoRAs when xformers attention procs are in use (#4179) 2023-07-21 14:29:20 +05:30
Will Berman
7a47df22a5 remove bentoml doc in favor of blogpost (#4182) 2023-07-21 08:23:36 +05:30
Patrick von Platen
d620070bb3 [ControlNet Training] Remove safety from controlnet (#4180)
Remove safety from controlnet
2023-07-21 08:03:59 +05:30
Patrick von Platen
b7a6e34cc6 [From single file] Make sure that controlnet stays False for from_single_file (#4181)
* fix from signle file

* Make sure converison always works with safetensors
2023-07-21 02:09:11 +02:00
YiYi Xu
47b3346422 Shap-E: add support for mesh output (#4062)
* add output_type=mesh

* update img2img

* make style

* add doc

* make style

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* add docstring for output_type

* add a section in doc about hub mesh visualization/ rotation

* update conversion script so default background is white

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* renderer -> shap_e_renderer

* img2img renderer -> shap_e_renderer

* fix tests

---------

Co-authored-by: yiyixuxu <yixu310@gmail,com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-07-20 18:05:13 +02:00
Ruslan Vorovchenko
07f1fbb18e Asymmetric vqgan (#3956)
* added AsymmetricAutoencoderKL

* fixed copies+dummy

* added script to convert original asymmetric vqgan

* added docs

* updated docs

* fixed style

* fixes, added tests

* update doc

* fixed doc

* fixed tests

* naming

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* naming

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* udpated code example

* updated doc

* comments fixes

* added docstring

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* comments fixes

* added inpaint pipeline tests

* comment suggestion: delete method

* yet another fixes

---------

Co-authored-by: Ruslan Vorovchenko <r.vorovchenko@prequelapp.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-07-20 17:51:06 +02:00
Lim Swee Kiat
2551b73670 Fix bug in ControlNetPipelines with MultiControlNetModel of length 1 (#4032)
* Fix bug in ControlNetPipelines with MultiControlNetModel of length 1

* Add tests for varying number of ControlNet models

* Fix missing indexing for control_guidance_start and control_guidance_end

* Fix code quality

* Separate test for MultiControlNet with one model

* Revert formatting of earlier test
2023-07-20 17:45:08 +02:00
Allen Zhang
930c8fdcb7 fix incorrect attention head dimension in AttnProcessor2_0 (#4154)
fix inner_dim
2023-07-20 17:40:50 +02:00
Patrick von Platen
6b1abba18d Add controlnet and vae from single file (#4084)
* Add controlnet from single file

* Updates

* make style

* finish

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-07-19 14:50:27 +02:00
Saurav Maheshkar
470f51cd26 feat: add act_fn param to OutValueFunctionBlock (#3994)
* feat: add act_fn param to OutValueFunctionBlock

* feat: update unet1d tests to not use mish

* feat: add `mish` as the default activation function

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* feat: drop mish tests from unet1d

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-07-19 14:44:44 +02:00
Patrick von Platen
b7e35dc782 make style 2023-07-19 14:28:35 +02:00
Byron Mallett
c77ac246c1 Fixed SDXL single file loading to use the correct requested pipeline class (#4142)
Using pipeline_class argument to instantiate the correct pipeline when loading SDXL models from single files
2023-07-19 14:40:13 +02:00
Zhao Shenyang
ed2a3584ab Docs/bentoml integration (#4090)
* docs: first draft of BentoML integration

* Update the diffusers doc

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* add BentoML integration guide under Optimization section

* restyle codes

---------

Co-authored-by: Sherlock113 <sherlockxu07@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2023-07-18 11:56:13 -07:00
Sayak Paul
3eb498e7b4 [Core] add: controlnet support for SDXL (#4038)
* add: controlnet sdxl.

* modifications to controlnet.

* run styling.

* add: __init__.pys

* incorporate https://github.com/huggingface/diffusers/pull/4019 changes.

* run make fix-copies.

* resize the conditioning images.

* remove autocast.

* run styling.

* disable autocast.

* debugging

* device placement.

* back to autocast.

* remove comment.

* save some memory by reusing the vae and unet in the pipeline.

* apply styling.

* Allow low precision sd xl

* finish

* finish

* changes to accommodate the improved VAE.

* modifications to how we handle vae encoding in the training.

* make style

* make existing controlnet fast tests pass.

* change vae checkpoint cli arg.

* fix: vae pretrained paths.

* fix: steps in get_scheduler().

* debugging.

* debugging./

* fix: weight conversion.

* add: docs.

* add: limited tests./

* add: datasets to the requirements.

* update docstrings and incorporate the usage of watermarking.

* incorporate fix from #4083

* fix watermarking dependency handling.

* run make-fix-copies.

* Empty-Commit

* Update requirements_sdxl.txt

* remove vae upcasting part.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* run make style

* run make fix-copies.

* disable suppot for multicontrolnet.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* run make fix-copies.

* dtyle/.

* fix-copies.

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-07-18 18:25:34 +05:30
clarencechen
c6e56e92ed Add Recent Timestep Scheduling Improvements to DDIM Inverse Scheduler (#3865)
* Add Recent Timestep Scheduling Improvements to DDIM Inverse Scheduler

Roll timesteps by one to reflect origin-destination semantic discrepancy

Restore `set_alpha_to_one` option to handle negative initial timesteps

Remove `set_alpha_to_zero` option not used due to previous truncation

* Bugfix

* Remove unnecessary calls to `detach()`

Use `self.image_processor.preprocess` in DiffEdit pipeline functions

* Preprocess list input for inverted image latents in diffedit pipeline

* Add `timestep_spacing` and `steps_offset` to `DPMSolverMultistepInverseScheduler`

* Update expected test results to account for inverting last forward diffusion step

* Fix inversion progress bar bug

* Add first draft for proper fast tests for DDIMInverseScheduler

* Add deprecated DDIMInverseScheduler kwarg to ConfigMixer registry

* Fix test failure in DPMMultistepInverseScheduler

Invert step specification leads to negative noise variance in SDE-based algs

Add first draft for proper fast tests for DPMMultistepInverseScheduler

* Update expected test results to account for inverting last forward diffusion step

Clean up diffedit fast test
2023-07-18 11:35:16 +02:00
Patrick von Platen
27062c3631 Refactor execution device & cpu offload (#4114)
* create general cpu offload & execution device

* Remove boiler plate

* finish

* kp

* Correct offload more pipelines

* up

* Update src/diffusers/pipelines/pipeline_utils.py

* make style

* up
2023-07-18 11:04:40 +02:00
takuoko
6427aa995e [Enhance] Add rank in dreambooth (#4112)
add rank in dreambooth
2023-07-18 11:30:06 +05:30
Seongsu Park
8b18cd8e7f [Docs] Korean translation update (#4022)
* feat) optimization kr translation

* fix) typo, italic setting

* feat) dreambooth, text2image kr

* feat) lora kr

* fix) LoRA

* fix) fp16 fix

* fix) doc-builder style

* fix) fp16 일부 단어 수정

* fix) fp16 style fix

* fix) opt, training docs update

* merge conflict

* Fix community pipelines (#3266)

* Allow disabling torch 2_0 attention (#3273)

* Allow disabling torch 2_0 attention

* make style

* Update src/diffusers/models/attention.py

* Release: v0.16.1

* feat) toctree update

* feat) toctree update

* Fix custom releases (#3708)

* Fix custom releases

* make style

* Fix loading if unexpected keys are present (#3720)

* Fix loading

* make style

* Release: v0.17.0

* opt_overview

* commit

* Create pipeline_overview.mdx

* unconditional_image_generatoin_1stDraft

*  Add translation for write_own_pipeline.mdx

* conditional-직역, 언컨디셔널

* unconditional_image_generation first draft

* reviese

* Update pipeline_overview.mdx

* revise-2

* ♻️ translation fixed for write_own_pipeline.mdx

* complete translate basic_training.mdx

* other-formats.mdx 번역 완료

* fix tutorials/basic_training.mdx

* other-formats 수정

* inpaint 한국어 번역

* depth2img translation

* translate training/adapt-a-model.mdx

* revised_all

* feedback taken

* using_safetensors.mdx_first_draft

* custom_pipeline_examples.mdx_first_draft

* img2img 한글번역 완료

* tutorial_overview edit

* reusing_seeds

* torch2.0

* translate complete

* fix) 용어 통일 규약 반영

* [fix] 피드백을 반영해서 번역 보정

* 오탈자 정정 + 컨벤션 위배된 부분 정정

* typo, style fix

* toctree update

* copyright fix

* toctree fix

* Update _toctree.yml

---------

Co-authored-by: Chanran Kim <seriousran@gmail.com>
Co-authored-by: apolinário <joaopaulo.passos@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Lee, Hongkyu <75282888+howsmyanimeprofilepicture@users.noreply.github.com>
Co-authored-by: hyeminan <adios9709@gmail.com>
Co-authored-by: movie5 <oyh5800@naver.com>
Co-authored-by: idra79haza <idra79haza@github.com>
Co-authored-by: Jihwan Kim <cuchoco@naver.com>
Co-authored-by: jungwoo <boonkoonheart@gmail.com>
Co-authored-by: jjuun0 <jh061993@gmail.com>
Co-authored-by: szjung-test <93111772+szjung-test@users.noreply.github.com>
Co-authored-by: idra79haza <37795618+idra79haza@users.noreply.github.com>
Co-authored-by: howsmyanimeprofilepicture <howsmyanimeprofilepicture@gmail.com>
Co-authored-by: hoswmyanimeprofilepicture <hoswmyanimeprofilepicture@gmail.com>
2023-07-17 18:28:08 -07:00
Will Berman
a0597f33ac t2i pipeline (#3932)
* Quick implementation of t2i-adapter

Load adapter module with from_pretrained

Prototyping generalized adapter framework

Writeup doc string for sideload framework(WIP) + some minor update on implementation

Update adapter models

Remove old adapter optional args in UNet

Add StableDiffusionAdapterPipeline unit test

Handle cpu offload in StableDiffusionAdapterPipeline

Auto correct coding style

Update model repo name to "RzZ/sd-v1-4-adapter-pipeline"

Refactor MultiAdapter to better compatible with config system

Export MultiAdapter

Create pipeline document template from controlnet

Create dummy objects

Supproting new AdapterLight model

Fix StableDiffusionAdapterPipeline common pipeline test

[WIP] Update adapter pipeline document

Handle num_inference_steps in StableDiffusionAdapterPipeline

Update definition of Adapter "channels_in"

Update documents

Apply code style

Fix doc typo and merge error

Update doc string and example

Quality of life improvement

Remove redundant code and file from prototyping

Remove unused pageage

Remove comments

Fix title

Fix typo

Add conditioning scale arg

Bring back old implmentation

Offload sideload

Add supply info on document

Update src/diffusers/models/adapter.py

Co-authored-by: Will Berman <wlbberman@gmail.com>

Update MultiAdapter constructor

Swap out custom checkpoint and update pipeline constructor

Update docment

Apply suggestions from code review

Co-authored-by: Will Berman <wlbberman@gmail.com>

Correcting style

Following single-file policy

Update auto size in image preprocess func

Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py

Co-authored-by: Will Berman <wlbberman@gmail.com>

fix copies

Update adapter pipeline behavior

Add adapter_conditioning_scale doc string

Add the missing doc string

Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Fix few bugs from suggestion

Handle L-mode PIL image as control image

Rename to differentiate adapter resblock

Update src/diffusers/models/adapter.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Fix typo

Update adapter parameter name

Update test case and code style

Fix copies

Fix typo

Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py

Co-authored-by: Will Berman <wlbberman@gmail.com>

Update Adapter class name

Add checkpoint converting script

Fix style

Fix-copies

Remove dev script

Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Updates for parameter rename

Fix convert_adapter

remove main

fix diff

more

refactoring

more

more

small fixes

refactor

tests

more slow tests

more tests

Update docs/source/en/api/pipelines/overview.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

add community contributor to docs

Update docs/source/en/api/pipelines/stable_diffusion/adapter.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Update docs/source/en/api/pipelines/stable_diffusion/adapter.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Update docs/source/en/api/pipelines/stable_diffusion/adapter.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Update docs/source/en/api/pipelines/stable_diffusion/adapter.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Update docs/source/en/api/pipelines/stable_diffusion/adapter.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

fix

remove from_adapters

license

paper link

docs

more url fixes

more docs

fix

fixes

fix

fix

* fix sample inplace add

* additional_kwargs -> additional_residuals

* move t2i adapter pipeline to own module

* preprocess -> _preprocess_adapter_image

* add TencentArc to license

* fix example code links

* add image converter and fix example doc string

* fix links

* clearer additional residual application

---------

Co-authored-by: HimariO <dsfhe49854@gmail.com>
2023-07-17 12:55:44 -07:00
Kadir Nar
3929954613 📝 Update doc with more descriptive title and filename for "IF" section (#4049)
* 📝 Update doc with more descriptive title and filename for "IF" section

Updated the documentation to provide a more descriptive title and filename for the "IF" section. Previously, having only "IF" as the title was not conveying a clear meaning. By renaming the section to "DeepFloyd IF," we provide users with a more informative and context-specific heading.

Thanks! 🙌

* 📝 Update name for "IF" section in 📝 Update name for "IF" section in README

Updated the link and name for the "IF" section in the README file to reflect the new heading "DeepFloyd IF."

* 📝 Fix broken link for "Instruct Pix2Pix" section in README

Fixed the broken link for the "Instruct Pix2Pix" section in the README file. Previously, the link was pointing to an incorrect location due to the presence of "stable_diffusion" in the URL. By removing "stable_diffusion" from the URL, I have corrected the error and ensured that users are directed to the correct section.

* 🔧💼 Updated parameters in _toctree.yml file

- ✏️ Updated 'local' parameter to 'api/pipelines/deepfloyd_if'.
- ✏️ Updated 'title' parameter to 'DeepFloyd IF'.

🎯 These changes aim to improve visibility and accessibility in the documentation of the DeepFloyd IF pipeline. 🚀📚
2023-07-17 09:25:37 -07:00
Apoorva Pandey
f6ce323633 Make setup.py compatible with pipenv (#4121) 2023-07-17 17:31:04 +02:00
edward zhu
6b33c11c5b add noise_sampler_seed to StableDiffusionKDiffusionPipeline.__call__ (#3911)
* add noise_sampler to StableDiffusionKDiffusionPipeline

* fix/docs: Fix the broken doc links (#3897)

* fix/docs: Fix the broken doc links

Signed-off-by: GitHub <noreply@github.com>

* Update docs/source/en/using-diffusers/write_own_pipeline.mdx

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* Add video img2img (#3900)

* Add image to image video

* Improve

* better naming

* make fix copies

* add docs

* finish tests

* trigger tests

* make style

* correct

* finish

* Fix more

* make style

* finish

* fix/doc-code: Updating to the latest version parameters (#3924)

fix/doc-code: update to use the new parameter

Signed-off-by: GitHub <noreply@github.com>

* fix/doc: no import torch issue (#3923)

Ffix/doc: no import torch issue

Signed-off-by: GitHub <noreply@github.com>

* Correct controlnet out of list error (#3928)

* Correct controlnet out of list error

* Apply suggestions from code review

* correct tests

* correct tests

* fix

* test all

* Apply suggestions from code review

* test all

* test all

* Apply suggestions from code review

* Apply suggestions from code review

* fix more tests

* Fix more

* Apply suggestions from code review

* finish

* Apply suggestions from code review

* Update src/diffusers/schedulers/scheduling_k_dpm_2_ancestral_discrete.py

* finish

* Adding better way to define multiple concepts and also validation capabilities. (#3807)

* - Added validation parameters
- Changed some parameter descriptions to better explain their use.
- Fixed a few typos.
- Added concept_list parameter for better management of multiple subjects
- changed logic for image validation

* - Fixed bad logic for class data root directories

* Defaulting validation_steps to None for an easier logic

* Fixed multiple validation prompts

* Fixed bug on validation negative prompt

* Changed validation logic for tracker.

* Added uuid for validation image labeling

* Fix error when comparing validation prompts and validation negative prompts

* Improved error message when negative prompts for validation are more than the number of prompts

* - Changed image tracking number from epoch to global_step
- Added Typing for functions

* Added some validations more when using concept_list parameter and the regular ones.

* Fixed error message

* Added more validations for validation parameters

* Improved messaging for errors

* Fixed validation error for parameters with default values

* - Added train step to image name for validation
- reformatted code

* - Added train step to image's name for validation
- reformatted code

* Updated README.md file.

* reverted back original script of train_dreambooth.py

* reverted back original script of train_dreambooth.py

* left one blank line at the eof

* reverted back setup.py

* reverted back setup.py

* added same logic for when parameters for prior preservation are used without enabling the flag while using concept_list parameter.

* Ran black formatter.

* fixed a few strings

* fixed import sort with isort and removed fstrings without placeholder

* fixed import order with ruff (since with isort wasn't ok)

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* [ldm3d] Update code to be functional with the new checkpoints (#3875)

* fixed typo

* updated doc to be consistent in naming

* make style/quality

* preprocessing for 4 channels and not 6

* make style

* test for 4c

* make style/quality

* fixed test on cpu

---------

Co-authored-by: Aflalo <estellea@isl-iam1.rr.intel.com>
Co-authored-by: Aflalo <estellea@isl-gpu33.rr.intel.com>
Co-authored-by: Aflalo <estellea@isl-gpu38.rr.intel.com>

* Improve memory text to video (#3930)

* Improve memory text to video

* Apply suggestions from code review

* add test

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* finish test setup

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* revert automatic chunking (#3934)

* revert automatic chunking

* Apply suggestions from code review

* revert automatic chunking

* avoid upcasting by assigning dtype to noise tensor (#3713)

* avoid upcasting by assigning dtype to noise tensor

* make style

* Update train_unconditional.py

* Update train_unconditional.py

* make style

* add unit test for pickle

* revert change

---------

Co-authored-by: root <root@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>

* Fix failing np tests (#3942)

* Fix failing np tests

* Apply suggestions from code review

* Update tests/pipelines/test_pipelines_common.py

* Add `timestep_spacing` and `steps_offset` to schedulers (#3947)

* Add timestep_spacing to DDPM, LMSDiscrete, PNDM.

* Remove spurious line.

* More easy schedulers.

* Add `linspace` to DDIM

* Noise sigma for `trailing`.

* Add timestep_spacing to DEISMultistepScheduler.

Not sure the range is the way it was intended.

* Fix: remove line used to debug.

* Support timestep_spacing in DPMSolverMultistep, DPMSolverSDE, UniPC

* Fix: convert to numpy.

* Use sched. defaults when instantiating from_config

For params not present in the original configuration.

This makes it possible to switch pipeline schedulers even if they use
different timestep_spacing (or any other param).

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Missing args in DPMSolverMultistep

* Test: default args not in config

* Style

* Fix scheduler name in test

* Remove duplicated entries

* Add test for solver_type

This test currently fails in main. When switching from DEIS to UniPC,
solver_type is "logrho" (the default value from DEIS), which gets
translated to "bh1" by UniPC. This is different to the default value for
UniPC: "bh2". This is where the translation happens: 36d22d0709/src/diffusers/schedulers/scheduling_unipc_multistep.py (L171)

* UniPC: use same default for solver_type

Fixes a bug when switching from UniPC from another scheduler (i.e.,
DEIS) that uses a different solver type. The solver is now the same as
if we had instantiated the scheduler directly.

* do not save use default values

* fix more

* fix all

* fix schedulers

* fix more

* finish for real

* finish for real

* flaky tests

* Update tests/pipelines/stable_diffusion/test_stable_diffusion_pix2pix_zero.py

* Default steps_offset to 0.

* Add missing docstrings

* Apply suggestions from code review

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add Consistency Models Pipeline (#3492)

* initial commit

* Improve consistency models sampling implementation.

* Add CMStochasticIterativeScheduler, which implements the multi-step sampler (stochastic_iterative_sampler) in the original code, and make further improvements to sampling.

* Add Unet blocks for consistency models

* Add conversion script for Unet

* Fix bug in new unet blocks

* Fix attention weight loading

* Make design improvements to ConsistencyModelPipeline and CMStochasticIterativeScheduler and add initial version of tests.

* make style

* Make small random test UNet class conditional and set resnet_time_scale_shift to 'scale_shift' to better match consistency model checkpoints.

* Add support for converting a test UNet and non-class-conditional UNets to the consistency models conversion script.

* make style

* Change num_class_embeds to 1000 to better match the original consistency models implementation.

* Add support for distillation in pipeline_consistency_models.py.

* Improve consistency model tests:
	- Get small testing checkpoints from hub
	- Modify tests to take into account "distillation" parameter of ConsistencyModelPipeline
	- Add onestep, multistep tests for distillation and distillation + class conditional
	- Add expected image slices for onestep tests

* make style

* Improve ConsistencyModelPipeline:
	- Add initial support for class-conditional generation
	- Fix initial sigma for onestep generation
	- Fix some sigma shape issues

* make style

* Improve ConsistencyModelPipeline:
	- add latents __call__ argument and prepare_latents method
	- add check_inputs method
	- add initial docstrings for ConsistencyModelPipeline.__call__

* make style

* Fix bug when randomly generating class labels for class-conditional generation.

* Switch CMStochasticIterativeScheduler to configuring a sigma schedule and make related changes to the pipeline and tests.

* Remove some unused code and make style.

* Fix small bug in CMStochasticIterativeScheduler.

* Add expected slices for multistep sampling tests and make them pass.

* Work on consistency model fast tests:
	- in pipeline, call self.scheduler.scale_model_input before denoising
	- get expected slices for Euler and Heun scheduler tests
	- make Euler test pass
	- mark Heun test as expected fail because it doesn't support prediction_type "sample" yet
	- remove DPM and Euler Ancestral tests because they don't support use_karras_sigmas

* make style

* Refactor conversion script to make it easier to add more model architectures to convert in the future.

* Work on ConsistencyModelPipeline tests:
	- Fix device bug when handling class labels in ConsistencyModelPipeline.__call__
	- Add slow tests for onestep and multistep sampling and make them pass
	- Refactor fast tests
	- Refactor ConsistencyModelPipeline.__init__

* make style

* Remove the add_noise and add_noise_to_input methods from CMStochasticIterativeScheduler for now.

* Run python utils/check_copies.py --fix_and_overwrite
python utils/check_dummies.py --fix_and_overwrite to make dummy objects for new pipeline and scheduler.

* Make fast tests from PipelineTesterMixin pass.

* make style

* Refactor consistency models pipeline and scheduler:
	- Remove support for Karras schedulers (only support CMStochasticIterativeScheduler)
	- Move sigma manipulation, input scaling, denoising from pipeline to scheduler
	- Make corresponding changes to tests and ensure they pass

* make style

* Add docstrings and further refactor pipeline and scheduler.

* make style

* Add initial version of the consistency models documentation.

* Refactor custom timesteps logic following DDPMScheduler/IFPipeline and temporarily add torch 2.0 SDPA kernel selection logic for debugging.

* make style

* Convert current slow tests to use fp16 and flash attention.

* make style

* Add slow tests for normal attention on cuda device.

* make style

* Fix attention weights loading

* Update consistency model fast tests for new test checkpoints with attention fix.

* make style

* apply suggestions

* Add add_noise method to CMStochasticIterativeScheduler (copied from EulerDiscreteScheduler).

* Conversion script now outputs pipeline instead of UNet and add support for LSUN-256 models and different schedulers.

* When both timesteps and num_inference_steps are supplied, raise warning instead of error (timesteps take precedence).

* make style

* Add remaining diffusers model checkpoints for models in the original consistency model release and update usage example.

* apply suggestions from review

* make style

* fix attention naming

* Add tests for CMStochasticIterativeScheduler.

* make style

* Make CMStochasticIterativeScheduler tests pass.

* make style

* Override test_step_shape in CMStochasticIterativeSchedulerTest instead of modifying it in SchedulerCommonTest.

* make style

* rename some models

* Improve API

* rename some models

* Remove duplicated block

* Add docstring and make torch compile work

* More fixes

* Fixes

* Apply suggestions from code review

* Apply suggestions from code review

* add more docstring

* update consistency conversion script

---------

Co-authored-by: ayushmangal <ayushmangal@microsoft.com>
Co-authored-by: Ayush Mangal <43698245+ayushtues@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* add test case for StableDiffusionKDiffusionPipeline noise_sampler

---------

Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Aisuko <urakiny@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Andrés Mauricio Repetto Ferrero <amd.repetto@gmail.com>
Co-authored-by: estelleafl <estelle.aflalo@intel.com>
Co-authored-by: Aflalo <estellea@isl-iam1.rr.intel.com>
Co-authored-by: Aflalo <estellea@isl-gpu33.rr.intel.com>
Co-authored-by: Aflalo <estellea@isl-gpu38.rr.intel.com>
Co-authored-by: Prathik Rao <prathikr@usc.edu>
Co-authored-by: root <root@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: ayushmangal <ayushmangal@microsoft.com>
Co-authored-by: Ayush Mangal <43698245+ayushtues@users.noreply.github.com>
2023-07-17 17:10:17 +02:00
Patrick von Platen
5729829cd8 [From single file] Make accelerate optional (#4132)
* Make accelerate optional

* make accelerate optional
2023-07-17 15:56:21 +02:00
Patrick von Platen
e27500b72c [From ckpt] replace with os path join (#3746)
replace with os path join
2023-07-17 11:39:06 +02:00
Patrick von Platen
fe5911bf3d [Stable Diffusion Inpaint ]Fix dtype inpaint (#4113)
Fix dtype inpaint
2023-07-15 16:10:40 +02:00
Patrick von Platen
b024ebb965 [SD-XL] Add inpainting (#4098)
* Add more

* more

* up

* Get ensemble of expert denoisers working

* Fix code

* add tests

* up
2023-07-14 17:05:44 +02:00
Patrick von Platen
ad8f985e81 Allow low precision vae sd xl (#4083)
* Allow low precision sd xl

* finish

* finish

* make style
2023-07-14 14:51:16 +02:00
Patrick von Platen
ee2f2775b2 [tests] use parent class for monkey patching to not break other tests (#4088)
* [tests] use parent class for monkey patching to not break other tests

* fix
2023-07-14 13:44:44 +02:00
Sayak Paul
692b7a907d [Feat] add: utility for unloading lora. (#4034)
* add: test for testing unloading lora.

* add :reason to skipif.

* initial implementation of lora unload().

* apply styling.

* add: doc.

* change checkpoints.

* reinit generator

* finalize slow test.

* add fast test for unloading lora.
2023-07-14 16:30:18 +05:30
Patrick von Platen
71c918b848 [Invisible watermark] Correct version (#4087) 2023-07-14 09:30:43 +05:30
Sayak Paul
83ca21f539 fix: minor things in the SDXL docs. (#4070) 2023-07-14 09:01:20 +05:30
Gabriel Birnbaum
f3802eb805 fix requirement in SDXL (#4082) 2023-07-14 02:58:20 +02:00
Patrick von Platen
bfe8b41315 [From Single File] Force accelerate to be installed (#4078)
force accelerate to be installed
2023-07-13 23:16:43 +02:00
YiYi Xu
4535088cec add kandinsky to readme table (#4081)
Co-authored-by: yiyixuxu <yixu310@gmail,com>
2023-07-13 22:26:51 +02:00
Thomas Chambon
2eceaaef0f [Community] Implementation of the IADB community pipeline (#3996)
* community pipeline: implementation of iadb

* iadb.py: reformat using black

* iadb.py: linting update
2023-07-13 16:49:41 +02:00
Ruoxi
ece55227ff Multiply lr scheduler steps by num_processes. (#3983)
* Multiply lr scheduler steps by `num_processes`.

* Stop multiplying steps by gradient accumulation.
2023-07-13 17:50:25 +05:30
Patrick von Platen
92a57a8e84 Fix kandinsky remove safety (#4065)
* Finish

* make style
2023-07-12 21:42:29 +02:00
Lucain
d7280b7436 Trigger CI on ci-* branches (#3635) 2023-07-12 19:31:36 +02:00
Patrick von Platen
e9eb0938f4 make style 2023-07-12 19:24:47 +02:00
junming huang
a29ea36d62 Update train_unconditional.py (#3899)
increase the time of timeout when using big dataset or high resolution
2023-07-12 19:24:28 +02:00
Evgenii Kashin
af48bf2008 Add circular padding for artifact-free StableDiffusionPanoramaPipeline (#4025)
* Add circular padding option

* Fix style with black

* Fix corner case with small image size

* Add circular padding test cases

* Fix docstring

* Improve docstring for circular padding, remove slow test case

* Update docs for circular padding argument

* Add images comparison for circular padding
2023-07-12 20:49:46 +05:30
Patrick von Platen
4b50ecceb0 Correct sdxl docs (#4058) 2023-07-12 16:02:31 +02:00
Bagheera
99b540b072 [SDXL] Partial diffusion support for Text2Img and Img2Img Pipelines (#4015)
* diffusers#4003 - initial implementation of max_inference_steps

* diffusers#4003 - initial implementation of max_inference_steps and first_inference_step for img2img

* diffusers#4003 - use first_inference_step as an input arg for get_timestamps in img2img

* diffusers#4003 Do not add noise during img2img when we have a defined first timestep

* diffusers#4003 Mild updates after revert

* diffusers#4003 Missing change

* Show implementation with denoising_start and end

* Apply suggestions from code review

* Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* move to 0.19.0dev

* Apply suggestions from code review

* add exhaustive tests

* add docs

* finish

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* make style

---------

Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2023-07-12 13:54:27 +02:00
Patrick von Platen
b9feed8795 move to 0.19.0dev (#4048) 2023-07-11 22:49:12 +02:00
Kadir Nar
f9cedfb75c 📝 Fix broken link to models documentation (#4026)
* 📝 Fix broken link to models documentation

Corrected the link to the models documentation in the README. Previously, the link was pointing to an incorrect URL. Now, the link directs users to the correct documentation page for more details on the models.

Thanks! 🙌

* Update src/diffusers/models/README.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2023-07-11 12:10:37 -07:00
Michael Monashev
fc7aa64ea8 Fixed pipeline parameter descriptions. (#3955)
Fixed parameter descriptions
2023-07-11 09:52:17 -07:00
Patrick von Platen
d0979f5274 quick fix to make sure sdxl conversion works 2023-07-11 16:51:56 +00:00
Vines
fcb0da7f00 Fix diffedit doc typo (#3977)
fix diffedit code mistake
2023-07-11 09:47:36 -07:00
manosplitsis
8e8b046d24 Typo fix in example docstring in audioldm pipeline (#4044)
typo fix in example docstring in audioldm pipeline
2023-07-11 09:35:11 -07:00
oOraph
5e704a2c71 keep _use_default_values as a list type (#4040)
Signed-off-by: Raphael <oOraph@users.noreply.github.com>
Co-authored-by: Raphael <oOraph@users.noreply.github.com>
2023-07-11 18:21:08 +02:00
Patrick von Platen
8bff782354 Improve single loading file (#4041)
* start improving single file load

* Fix more

* start improving single file load

* Fix sd 2.1

* further improve from_single_file
2023-07-11 17:01:25 +02:00
Lucain
6632823690 FIX force_download in download utility (#4036)
FIX force_download in download utils
2023-07-11 12:20:00 +02:00
Sayak Paul
f74d5e1c2f [diffusers-cli] Add a CLI command to run FP16 and safetensors conversions and opening PRs (#3982)
* feat: add a cli util for fp16 and safetensors format.

* fix: commit_description.

* add: usage example.
2023-07-11 08:30:09 +05:30
Sayak Paul
3d74dc2abd [Examples] Add a training script for SDXL DreamBooth LoRA (#4016)
* add dreambooth lora script for SDXL incorporating latest changes.

* remove use_auth_token=True.

* add: documentation

* remove unneeded cli.

* increase the number of training steps in the readme.

* add LoraLoaderMixin to the subclassing mix.

* add sdxl lora dreambooth test.

* add: inference code sample.

* add: refiner output.

* add LoraLoaderMixin to the mix of classes of StableDiffusionXLImg2ImgPipeline.

* change default resolution of DreamBoothDataset.

* better sdxl report path.

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-07-11 07:38:41 +05:30
Susnato Dhar
dfd7eafbce Fixed wrong link in CONTRIBUTING.md (#4028)
Update CONTRIBUTING.md
2023-07-10 21:00:55 +02:00
Steven Liu
8dd0ddc3c4 [docs] Fix index page (#3997)
* fix

* correct link

* Fix typo

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-07-10 10:24:37 -07:00
Patrick von Platen
080ecf01b3 Improve loading pipe (#4009)
* improve loading subcomponents

* Add test for logging

* improve loading subcomponents

* make style

* make style

* fix

* finish
2023-07-10 18:05:40 +02:00
Pedro Cuenca
7a91ea6c2b Remove remaining not in upscale pipeline (#4020)
Remove remaining `not` in upscale pipeline.
2023-07-10 12:37:50 +02:00
Sayak Paul
e4559f48c1 minor improvements to the SDXL doc. (#3985)
* minor improvements to the SDXL doc.

* use_refiner variable.

* fix: typo.
2023-07-10 16:04:23 +05:30
Pedro Cuenca
d6b861401e Correctly keep vae in float16 when using PyTorch 2 or xFormers (#4019)
* Update pipeline_stable_diffusion_xl.py

fix a bug

* Update pipeline_stable_diffusion_xl_img2img.py

* Update pipeline_stable_diffusion_xl_img2img.py

* Update pipeline_stable_diffusion_upscale.py

* style

---------

Co-authored-by: Hu Ye <xiaohuzc@gmail.com>
2023-07-10 12:19:56 +02:00
Patrick von Platen
e4f6c3799d [DiffusionPipeline] Deprecate not throwing error when loading non-existant variant (#4011)
* Deprecate variant nicely

* make style

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2023-07-10 12:12:29 +02:00
Patrick von Platen
98c9aac1d5 [SDXL] Fix all sequential offload (#4010)
* Fix all sequential offload

* make style

* make style
2023-07-10 10:27:23 +02:00
Omar Sanseviero
e3d71ad89a Minor nits to Dance DIffusion (#4012)
Update pipeline_dance_diffusion.py
2023-07-10 02:53:06 +05:30
Patrick von Platen
68f61a07d6 Make sure torch compile doesn't access unet config (#4008) 2023-07-10 00:23:55 +05:30
Patrick von Platen
4a3e574807 make style 2023-07-09 16:02:59 +00:00
Will Berman
c2a28c346c Refactor LoRA (#3778)
* refactor to support patching LoRA into T5

instantiate the lora linear layer on the same device as the regular linear layer

get lora rank from state dict

tests

fmt

can create lora layer in float32 even when rest of model is float16

fix loading model hook

remove load_lora_weights_ and T5 dispatching

remove Unet#attn_processors_state_dict

docstrings

* text encoder monkeypatch class method

* fix test

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-07-09 18:02:46 +02:00
756 changed files with 86468 additions and 18749 deletions

View File

@@ -41,7 +41,7 @@ Core library:
- Schedulers: @williamberman and @patrickvonplaten
- Pipelines: @patrickvonplaten and @sayakpaul
- Training examples: @sayakpaul and @patrickvonplaten
- Docs: @stevenliu and @yiyixu
- Docs: @stevhliu and @yiyixuxu
- JAX and MPS: @pcuenca
- Audio: @sanchit-gandhi
- General functionalities: @patrickvonplaten and @sayakpaul

View File

@@ -4,6 +4,9 @@ on:
pull_request:
branches:
- main
push:
branches:
- ci-*
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
@@ -64,6 +67,7 @@ jobs:
run: |
apt-get update && apt-get install libsndfile1-dev libgl1 -y
python -m pip install -e .[quality,test]
python -m pip install git+https://github.com/huggingface/accelerate.git
- name: Environment
run: |
@@ -110,3 +114,60 @@ jobs:
with:
name: pr_${{ matrix.config.report }}_test_reports
path: reports
run_staging_tests:
strategy:
fail-fast: false
matrix:
config:
- name: Hub tests for models, schedulers, and pipelines
framework: hub_tests_pytorch
runner: docker-cpu
image: diffusers/diffusers-pytorch-cpu
report: torch_hub
name: ${{ matrix.config.name }}
runs-on: ${{ matrix.config.runner }}
container:
image: ${{ matrix.config.image }}
options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
defaults:
run:
shell: bash
steps:
- name: Checkout diffusers
uses: actions/checkout@v3
with:
fetch-depth: 2
- name: Install dependencies
run: |
apt-get update && apt-get install libsndfile1-dev libgl1 -y
python -m pip install -e .[quality,test]
- name: Environment
run: |
python utils/print_env.py
- name: Run Hub tests for models, schedulers, and pipelines on a staging env
if: ${{ matrix.config.framework == 'hub_tests_pytorch' }}
run: |
HUGGINGFACE_CO_STAGING=true python -m pytest \
-m "is_staging_test" \
--make-reports=tests_${{ matrix.config.report }} \
tests
- name: Failure short reports
if: ${{ failure() }}
run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: pr_${{ matrix.config.report }}_test_reports
path: reports

View File

@@ -63,6 +63,7 @@ jobs:
run: |
apt-get update && apt-get install libsndfile1-dev libgl1 -y
python -m pip install -e .[quality,test]
python -m pip install git+https://github.com/huggingface/accelerate.git
- name: Environment
run: |

View File

@@ -40,7 +40,7 @@ jobs:
${CONDA_RUN} python -m pip install --upgrade pip
${CONDA_RUN} python -m pip install -e .[quality,test]
${CONDA_RUN} python -m pip install torch torchvision torchaudio
${CONDA_RUN} python -m pip install accelerate --upgrade
${CONDA_RUN} python -m pip install git+https://github.com/huggingface/accelerate.git
${CONDA_RUN} python -m pip install transformers --upgrade
- name: Environment

View File

@@ -297,7 +297,7 @@ if you don't know yet what specific component you would like to add:
- [Model or pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22)
- [Scheduler](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22)
Before adding any of the three components, it is strongly recommended that you give the [Philosophy guide](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22) a read to better understand the design of any of the three components. Please be aware that
Before adding any of the three components, it is strongly recommended that you give the [Philosophy guide](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md) a read to better understand the design of any of the three components. Please be aware that
we cannot merge model, scheduler, or pipeline additions that strongly diverge from our design philosophy
as it will lead to API inconsistencies. If you fundamentally disagree with a design choice, please
open a [Feedback issue](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=) instead so that it can be discussed whether a certain design

View File

@@ -78,7 +78,7 @@ test:
# Run tests for examples
test-examples:
python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/
python -m pytest -n auto --dist=loadfile -s -v ./examples/
# Release stuff

View File

@@ -90,7 +90,7 @@ The following design principles are followed:
- To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different.
- Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work.
- The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and
readable longterm, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
readable longterm, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
### Schedulers
@@ -102,7 +102,7 @@ The following design principles are followed:
- One scheduler python file corresponds to one scheduler algorithm (as might be defined in a paper).
- If schedulers share similar functionalities, we can make use of the `#Copied from` mechanism.
- Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`.
- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./using-diffusers/schedulers.mdx).
- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./using-diffusers/schedulers.md).
- Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called.
- Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon
- The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1).

View File

@@ -1,6 +1,6 @@
<p align="center">
<br>
<img src="https://github.com/huggingface/diffusers/blob/main/docs/source/en/imgs/diffusers_library.jpg" width="400"/>
<img src="https://raw.githubusercontent.com/huggingface/diffusers/main/docs/source/en/imgs/diffusers_library.jpg" width="400"/>
<br>
<p>
<p align="center">
@@ -10,6 +10,9 @@
<a href="https://github.com/huggingface/diffusers/releases">
<img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/diffusers.svg">
</a>
<a href="https://pepy.tech/project/diffusers">
<img alt="GitHub release" src="https://static.pepy.tech/badge/diffusers/month">
</a>
<a href="CODE_OF_CONDUCT.md">
<img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg">
</a>
@@ -143,9 +146,14 @@ just hang out ☕.
</tr>
<tr>
<td>Text-to-Image</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/if">if</a></td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/if">DeepFloyd IF</a></td>
<td><a href="https://huggingface.co/DeepFloyd/IF-I-XL-v1.0"> DeepFloyd/IF-I-XL-v1.0 </a></td>
</tr>
<tr>
<td>Text-to-Image</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/kandinsky">Kandinsky</a></td>
<td><a href="https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder"> kandinsky-community/kandinsky-2-2-decoder </a></td>
</tr>
<tr style="border-top: 2px solid black">
<td>Text-guided Image-to-Image</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/controlnet">Controlnet</a></td>
@@ -153,7 +161,7 @@ just hang out ☕.
</tr>
<tr>
<td>Text-guided Image-to-Image</td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/pix2pix">Instruct Pix2Pix</a></td>
<td><a href="https://huggingface.co/docs/diffusers/api/pipelines/pix2pix">Instruct Pix2Pix</a></td>
<td><a href="https://huggingface.co/timbrooks/instruct-pix2pix"> timbrooks/instruct-pix2pix </a></td>
</tr>
<tr>

View File

@@ -68,7 +68,7 @@ The `preview` command only works with existing doc files. When you add a complet
## Adding a new element to the navigation bar
Accepted files are Markdown (.md or .mdx).
Accepted files are Markdown (.md).
Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting
the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/diffusers/blob/main/docs/source/_toctree.yml) file.
@@ -96,7 +96,7 @@ Sections that were moved:
Use the relative style to link to the new file so that the versioned docs continue to work.
For an example of a rich moved section set please see the very end of [the transformers Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.mdx).
For an example of a rich moved section set please see the very end of [the transformers Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md).
## Writing Documentation - Specification
@@ -119,8 +119,8 @@ depending on the intended targets (beginners, more advanced users, or researcher
When adding a new pipeline:
- create a file `xxx.mdx` under `docs/source/api/pipelines` (don't hesitate to copy an existing file as template).
- Link that file in (*Diffusers Summary*) section in `docs/source/api/pipelines/overview.mdx`, along with the link to the paper, and a colab notebook (if available).
- create a file `xxx.md` under `docs/source/api/pipelines` (don't hesitate to copy an existing file as template).
- Link that file in (*Diffusers Summary*) section in `docs/source/api/pipelines/overview.md`, along with the link to the paper, and a colab notebook (if available).
- Write a short overview of the diffusion model:
- Overview with paper & authors
- Paper abstract

View File

@@ -13,6 +13,8 @@
title: Overview
- local: using-diffusers/write_own_pipeline
title: Understanding models and schedulers
- local: tutorials/autopipeline
title: AutoPipeline
- local: tutorials/basic_training
title: Train a diffusion model
title: Tutorials
@@ -30,20 +32,22 @@
title: Load safetensors
- local: using-diffusers/other-formats
title: Load different Stable Diffusion formats
- local: using-diffusers/push_to_hub
title: Push files to the Hub
title: Loading & Hub
- sections:
- local: using-diffusers/pipeline_overview
title: Overview
- local: using-diffusers/unconditional_image_generation
title: Unconditional image generation
- local: using-diffusers/conditional_image_generation
title: Text-to-image generation
title: Text-to-image
- local: using-diffusers/img2img
title: Text-guided image-to-image
title: Image-to-image
- local: using-diffusers/inpaint
title: Text-guided image-inpainting
title: Inpainting
- local: using-diffusers/depth2img
title: Text-guided depth-to-image
title: Depth-to-image
title: Tasks
- sections:
- local: using-diffusers/textual_inversion_inference
title: Textual inversion
- local: training/distributed_inference
@@ -52,16 +56,28 @@
title: Improve image quality with deterministic generation
- local: using-diffusers/control_brightness
title: Control image brightness
- local: using-diffusers/weighted_prompts
title: Prompt weighting
title: Techniques
- sections:
- local: using-diffusers/pipeline_overview
title: Overview
- local: using-diffusers/sdxl
title: Stable Diffusion XL
- local: using-diffusers/controlnet
title: ControlNet
- local: using-diffusers/shap-e
title: Shap-E
- local: using-diffusers/diffedit
title: DiffEdit
- local: using-diffusers/distilled_sd
title: Distilled Stable Diffusion inference
- local: using-diffusers/reproducibility
title: Create reproducible pipelines
- local: using-diffusers/custom_pipeline_examples
title: Community pipelines
- local: using-diffusers/contribute_pipeline
title: How to contribute a community pipeline
- local: using-diffusers/stable_diffusion_jax_how_to
title: Stable Diffusion in JAX/Flax
- local: using-diffusers/weighted_prompts
title: Weighting Prompts
title: Pipelines for Inference
- sections:
- local: training/overview
@@ -86,12 +102,10 @@
title: InstructPix2Pix Training
- local: training/custom_diffusion
title: Custom Diffusion
- local: training/t2i_adapters
title: T2I-Adapters
title: Training
- sections:
- local: using-diffusers/rl
title: Reinforcement Learning
- local: using-diffusers/audio
title: Audio
- local: using-diffusers/other-modalities
title: Other Modalities
title: Taking Diffusers Beyond Images
@@ -103,6 +117,8 @@
title: Memory and Speed
- local: optimization/torch2.0
title: Torch2.0 support
- local: using-diffusers/stable_diffusion_jax_how_to
title: Stable Diffusion in JAX/Flax
- local: optimization/xformers
title: xFormers
- local: optimization/onnx
@@ -164,6 +180,10 @@
title: VQModel
- local: api/models/autoencoderkl
title: AutoencoderKL
- local: api/models/asymmetricautoencoderkl
title: AsymmetricAutoencoderKL
- local: api/models/autoencoder_tiny
title: Tiny AutoEncoder
- local: api/models/transformer2d
title: Transformer2D
- local: api/models/transformer_temporal
@@ -179,15 +199,21 @@
- local: api/pipelines/alt_diffusion
title: AltDiffusion
- local: api/pipelines/attend_and_excite
title: Attend and Excite
title: Attend-and-Excite
- local: api/pipelines/audio_diffusion
title: Audio Diffusion
- local: api/pipelines/audioldm
title: AudioLDM
- local: api/pipelines/audioldm2
title: AudioLDM 2
- local: api/pipelines/auto_pipeline
title: AutoPipeline
- local: api/pipelines/consistency_models
title: Consistency Models
- local: api/pipelines/controlnet
title: ControlNet
- local: api/pipelines/controlnet_sdxl
title: ControlNet with Stable Diffusion XL
- local: api/pipelines/cycle_diffusion
title: Cycle Diffusion
- local: api/pipelines/dance_diffusion
@@ -196,20 +222,24 @@
title: DDIM
- local: api/pipelines/ddpm
title: DDPM
- local: api/pipelines/deepfloyd_if
title: DeepFloyd IF
- local: api/pipelines/diffedit
title: DiffEdit
- local: api/pipelines/dit
title: DiT
- local: api/pipelines/if
title: IF
- local: api/pipelines/pix2pix
title: InstructPix2Pix
- local: api/pipelines/kandinsky
title: Kandinsky
- local: api/pipelines/kandinsky_v22
title: Kandinsky 2.2
- local: api/pipelines/latent_diffusion
title: Latent Diffusion
- local: api/pipelines/panorama
title: MultiDiffusion Panorama
title: MultiDiffusion
- local: api/pipelines/musicldm
title: MusicLDM
- local: api/pipelines/paint_by_example
title: PaintByExample
- local: api/pipelines/paradigms
@@ -234,15 +264,15 @@
- local: api/pipelines/stable_diffusion/overview
title: Overview
- local: api/pipelines/stable_diffusion/text2img
title: Text-to-Image
title: Text-to-image
- local: api/pipelines/stable_diffusion/img2img
title: Image-to-Image
title: Image-to-image
- local: api/pipelines/stable_diffusion/inpaint
title: Inpaint
title: Inpainting
- local: api/pipelines/stable_diffusion/depth2img
title: Depth-to-Image
title: Depth-to-image
- local: api/pipelines/stable_diffusion/image_variation
title: Image-Variation
title: Image variation
- local: api/pipelines/stable_diffusion/stable_diffusion_safe
title: Safe Stable Diffusion
- local: api/pipelines/stable_diffusion/stable_diffusion_2
@@ -250,85 +280,89 @@
- local: api/pipelines/stable_diffusion/stable_diffusion_xl
title: Stable Diffusion XL
- local: api/pipelines/stable_diffusion/latent_upscale
title: Stable-Diffusion-Latent-Upscaler
title: Latent upscaler
- local: api/pipelines/stable_diffusion/upscale
title: Super-Resolution
title: Super-resolution
- local: api/pipelines/stable_diffusion/ldm3d_diffusion
title: LDM3D Text-to-(RGB, Depth)
- local: api/pipelines/stable_diffusion/adapter
title: Stable Diffusion T2I-adapter
- local: api/pipelines/stable_diffusion/gligen
title: GLIGEN (Grounded Language-to-Image Generation)
title: Stable Diffusion
- local: api/pipelines/stable_unclip
title: Stable unCLIP
- local: api/pipelines/stochastic_karras_ve
title: Stochastic Karras VE
- local: api/pipelines/model_editing
title: Text-to-Image Model Editing
title: Text-to-image model editing
- local: api/pipelines/text_to_video
title: Text-to-Video
title: Text-to-video
- local: api/pipelines/text_to_video_zero
title: Text-to-Video Zero
title: Text2Video-Zero
- local: api/pipelines/unclip
title: UnCLIP
- local: api/pipelines/latent_diffusion_uncond
title: Unconditional Latent Diffusion
- local: api/pipelines/unidiffuser
title: UniDiffuser
- local: api/pipelines/value_guided_sampling
title: Value-guided sampling
- local: api/pipelines/versatile_diffusion
title: Versatile Diffusion
- local: api/pipelines/vq_diffusion
title: VQ Diffusion
- local: api/pipelines/wuerstchen
title: Wuerstchen
title: Pipelines
- sections:
- local: api/schedulers/overview
title: Overview
- local: api/schedulers/cm_stochastic_iterative
title: Consistency Model Multistep Scheduler
- local: api/schedulers/ddim
title: DDIM
title: CMStochasticIterativeScheduler
- local: api/schedulers/ddim_inverse
title: DDIMInverse
title: DDIMInverseScheduler
- local: api/schedulers/ddim
title: DDIMScheduler
- local: api/schedulers/ddpm
title: DDPM
title: DDPMScheduler
- local: api/schedulers/deis
title: DEIS
- local: api/schedulers/dpm_discrete
title: DPM Discrete Scheduler
- local: api/schedulers/dpm_discrete_ancestral
title: DPM Discrete Scheduler with ancestral sampling
title: DEISMultistepScheduler
- local: api/schedulers/multistep_dpm_solver_inverse
title: DPMSolverMultistepInverse
- local: api/schedulers/multistep_dpm_solver
title: DPMSolverMultistepScheduler
- local: api/schedulers/dpm_sde
title: DPMSolverSDEScheduler
- local: api/schedulers/euler_ancestral
title: Euler Ancestral Scheduler
- local: api/schedulers/euler
title: Euler scheduler
- local: api/schedulers/heun
title: Heun Scheduler
- local: api/schedulers/multistep_dpm_solver_inverse
title: Inverse Multistep DPM-Solver
- local: api/schedulers/ipndm
title: IPNDM
- local: api/schedulers/lms_discrete
title: Linear Multistep
- local: api/schedulers/multistep_dpm_solver
title: Multistep DPM-Solver
- local: api/schedulers/pndm
title: PNDM
- local: api/schedulers/repaint
title: RePaint Scheduler
- local: api/schedulers/singlestep_dpm_solver
title: Singlestep DPM-Solver
title: DPMSolverSinglestepScheduler
- local: api/schedulers/euler_ancestral
title: EulerAncestralDiscreteScheduler
- local: api/schedulers/euler
title: EulerDiscreteScheduler
- local: api/schedulers/heun
title: HeunDiscreteScheduler
- local: api/schedulers/ipndm
title: IPNDMScheduler
- local: api/schedulers/stochastic_karras_ve
title: Stochastic Kerras VE
title: KarrasVeScheduler
- local: api/schedulers/dpm_discrete_ancestral
title: KDPM2AncestralDiscreteScheduler
- local: api/schedulers/dpm_discrete
title: KDPM2DiscreteScheduler
- local: api/schedulers/lms_discrete
title: LMSDiscreteScheduler
- local: api/schedulers/pndm
title: PNDMScheduler
- local: api/schedulers/repaint
title: RePaintScheduler
- local: api/schedulers/score_sde_ve
title: ScoreSdeVeScheduler
- local: api/schedulers/score_sde_vp
title: ScoreSdeVpScheduler
- local: api/schedulers/unipc
title: UniPCMultistepScheduler
- local: api/schedulers/score_sde_ve
title: VE-SDE
- local: api/schedulers/score_sde_vp
title: VP-SDE
- local: api/schedulers/vq_diffusion
title: VQDiffusionScheduler
title: Schedulers
- sections:
- local: api/experimental/rl
title: RL Planning
title: Experimental Features
title: API

View File

@@ -1,15 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# TODO
Coming soon!

View File

@@ -35,3 +35,11 @@ Adapters (textual inversion, LoRA, hypernetworks) allow you to modify a diffusio
## FromSingleFileMixin
[[autodoc]] loaders.FromSingleFileMixin
## FromOriginalControlnetMixin
[[autodoc]] loaders.FromOriginalControlnetMixin
## FromOriginalVAEMixin
[[autodoc]] loaders.FromOriginalVAEMixin

View File

@@ -0,0 +1,55 @@
# AsymmetricAutoencoderKL
Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: [Designing a Better Asymmetric VQGAN for StableDiffusion](https://arxiv.org/abs/2306.04632) by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua.
The abstract from the paper is:
*StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN*
Evaluation results can be found in section 4.1 of the original paper.
## Available checkpoints
* [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5)
* [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2)
## Example Usage
```python
from io import BytesIO
from PIL import Image
import requests
from diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline
def download_image(url: str) -> Image.Image:
response = requests.get(url)
return Image.open(BytesIO(response.content)).convert("RGB")
prompt = "a photo of a person"
img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png"
mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png"
image = download_image(img_url).resize((256, 256))
mask_image = download_image(mask_url).resize((256, 256))
pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
pipe.to("cuda")
image = pipe(prompt=prompt, image=image, mask_image=mask_image).images[0]
image.save("image.jpeg")
```
## AsymmetricAutoencoderKL
[[autodoc]] models.autoencoder_asym_kl.AsymmetricAutoencoderKL
## AutoencoderKLOutput
[[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
## DecoderOutput
[[autodoc]] models.vae.DecoderOutput

View File

@@ -0,0 +1,45 @@
# Tiny AutoEncoder
Tiny AutoEncoder for Stable Diffusion (TAESD) was introduced in [madebyollin/taesd](https://github.com/madebyollin/taesd) by Ollin Boer Bohan. It is a tiny distilled version of Stable Diffusion's VAE that can quickly decode the latents in a [`StableDiffusionPipeline`] or [`StableDiffusionXLPipeline`] almost instantly.
To use with Stable Diffusion v-2.1:
```python
import torch
from diffusers import DiffusionPipeline, AutoencoderTiny
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16
)
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("cheesecake.png")
```
To use with Stable Diffusion XL 1.0
```python
import torch
from diffusers import DiffusionPipeline, AutoencoderTiny
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
)
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesdxl", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("cheesecake_sdxl.png")
```
## AutoencoderTiny
[[autodoc]] AutoencoderTiny
## AutoencoderTinyOutput
[[autodoc]] models.autoencoder_tiny.AutoencoderTinyOutput

View File

@@ -6,6 +6,18 @@ The abstract from the paper is:
*How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.*
## Loading from the original format
By default the [`AutoencoderKL`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded
from the original format using [`FromOriginalVAEMixin.from_single_file`] as follows:
```py
from diffusers import AutoencoderKL
url = "https://huggingface.co/stabilityai/sd-vae-ft-mse-original/blob/main/vae-ft-mse-840000-ema-pruned.safetensors" # can also be local file
model = AutoencoderKL.from_single_file(url)
```
## AutoencoderKL
[[autodoc]] AutoencoderKL
@@ -28,4 +40,4 @@ The abstract from the paper is:
## FlaxDecoderOutput
[[autodoc]] models.vae_flax.FlaxDecoderOutput
[[autodoc]] models.vae_flax.FlaxDecoderOutput

View File

@@ -6,6 +6,21 @@ The abstract from the paper is:
*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
## Loading from the original format
By default the [`ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded
from the original format using [`FromOriginalControlnetMixin.from_single_file`] as follows:
```py
from diffusers import StableDiffusionControlnetPipeline, ControlNetModel
url = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth" # can also be a local path
controlnet = ControlNetModel.from_single_file(url)
url = "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors" # can also be a local path
pipe = StableDiffusionControlnetPipeline.from_single_file(url, controlnet=controlnet)
```
## ControlNetModel
[[autodoc]] ControlNetModel
@@ -20,4 +35,4 @@ The abstract from the paper is:
## FlaxControlNetOutput
[[autodoc]] models.controlnet_flax.FlaxControlNetOutput
[[autodoc]] models.controlnet_flax.FlaxControlNetOutput

View File

@@ -9,4 +9,8 @@ All models are built from the base [`ModelMixin`] class which is a [`torch.nn.mo
## FlaxModelMixin
[[autodoc]] FlaxModelMixin
[[autodoc]] FlaxModelMixin
## PushToHubMixin
[[autodoc]] utils.PushToHubMixin

View File

@@ -0,0 +1,47 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# AltDiffusion
AltDiffusion was proposed in [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://huggingface.co/papers/2211.06679) by Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu.
The abstract from the paper is:
*In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.*
## Tips
`AltDiffusion` is conceptually the same as [Stable Diffusion](./stable_diffusion/overview).
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## AltDiffusionPipeline
[[autodoc]] AltDiffusionPipeline
- all
- __call__
## AltDiffusionImg2ImgPipeline
[[autodoc]] AltDiffusionImg2ImgPipeline
- all
- __call__
## AltDiffusionPipelineOutput
[[autodoc]] pipelines.alt_diffusion.AltDiffusionPipelineOutput
- all
- __call__

View File

@@ -1,83 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# AltDiffusion
AltDiffusion was proposed in [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu.
The abstract of the paper is the following:
*In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.*
*Overview*:
| Pipeline | Tasks | Colab | Demo
|---|---|:---:|:---:|
| [pipeline_alt_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py) | *Text-to-Image Generation* | - | -
| [pipeline_alt_diffusion_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py) | *Image-to-Image Text-Guided Generation* | - |-
## Tips
- AltDiffusion is conceptually exactly the same as [Stable Diffusion](./stable_diffusion/overview).
- *Run AltDiffusion*
AltDiffusion can be tested very easily with the [`AltDiffusionPipeline`], [`AltDiffusionImg2ImgPipeline`] and the `"BAAI/AltDiffusion-m9"` checkpoint exactly in the same way it is shown in the [Conditional Image Generation Guide](../../using-diffusers/conditional_image_generation) and the [Image-to-Image Generation Guide](../../using-diffusers/img2img).
- *How to load and use different schedulers.*
The alt diffusion pipeline uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers that can be used with the alt diffusion pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], [`EulerAncestralDiscreteScheduler`] etc.
To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`] method or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the [`EulerDiscreteScheduler`], you can do the following:
```python
>>> from diffusers import AltDiffusionPipeline, EulerDiscreteScheduler
>>> pipeline = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9")
>>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
>>> # or
>>> euler_scheduler = EulerDiscreteScheduler.from_pretrained("BAAI/AltDiffusion-m9", subfolder="scheduler")
>>> pipeline = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9", scheduler=euler_scheduler)
```
- *How to convert all use cases with multiple or single pipeline*
If you want to use all possible use cases in a single `DiffusionPipeline` we recommend using the `components` functionality to instantiate all components in the most memory-efficient way:
```python
>>> from diffusers import (
... AltDiffusionPipeline,
... AltDiffusionImg2ImgPipeline,
... )
>>> text2img = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9")
>>> img2img = AltDiffusionImg2ImgPipeline(**text2img.components)
>>> # now you can use text2img(...) and img2img(...) just like the call methods of each respective pipeline
```
## AltDiffusionPipelineOutput
[[autodoc]] pipelines.alt_diffusion.AltDiffusionPipelineOutput
- all
- __call__
## AltDiffusionPipeline
[[autodoc]] AltDiffusionPipeline
- all
- __call__
## AltDiffusionImg2ImgPipeline
[[autodoc]] AltDiffusionImg2ImgPipeline
- all
- __call__

View File

@@ -0,0 +1,37 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Attend-and-Excite
Attend-and-Excite for Stable Diffusion was proposed in [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://attendandexcite.github.io/Attend-and-Excite/) and provides textual attention control over image generation.
The abstract from the paper is:
*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.*
You can find additional information about Attend-and-Excite on the [project page](https://attendandexcite.github.io/Attend-and-Excite/), the [original codebase](https://github.com/AttendAndExcite/Attend-and-Excite), or try it out in a [demo](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite).
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## StableDiffusionAttendAndExcitePipeline
[[autodoc]] StableDiffusionAttendAndExcitePipeline
- all
- __call__
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -1,75 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Attend and Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models
## Overview
Attend and Excite for Stable Diffusion was proposed in [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://attendandexcite.github.io/Attend-and-Excite/) and provides textual attention control over the image generation.
The abstract of the paper is the following:
*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.*
Resources
* [Project Page](https://attendandexcite.github.io/Attend-and-Excite/)
* [Paper](https://arxiv.org/abs/2301.13826)
* [Original Code](https://github.com/AttendAndExcite/Attend-and-Excite)
* [Demo](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite)
## Available Pipelines:
| Pipeline | Tasks | Colab | Demo
|---|---|:---:|:---:|
| [pipeline_semantic_stable_diffusion_attend_and_excite.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_semantic_stable_diffusion_attend_and_excite) | *Text-to-Image Generation* | - | https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite
### Usage example
```python
import torch
from diffusers import StableDiffusionAttendAndExcitePipeline
model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionAttendAndExcitePipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
pipe = pipe.to("cuda")
prompt = "a cat and a frog"
# use get_indices function to find out indices of the tokens you want to alter
pipe.get_indices(prompt)
token_indices = [2, 5]
seed = 6141
generator = torch.Generator("cuda").manual_seed(seed)
images = pipe(
prompt=prompt,
token_indices=token_indices,
guidance_scale=7.5,
generator=generator,
num_inference_steps=50,
max_iter_to_alter=25,
).images
image = images[0]
image.save(f"../images/{prompt}_{seed}.png")
```
## StableDiffusionAttendAndExcitePipeline
[[autodoc]] StableDiffusionAttendAndExcitePipeline
- all
- __call__

View File

@@ -0,0 +1,37 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Audio Diffusion
[Audio Diffusion](https://github.com/teticio/audio-diffusion) is by Robert Dargavel Smith, and it leverages the recent advances in image generation from diffusion models by converting audio samples to and from Mel spectrogram images.
The original codebase, training scripts and example notebooks can be found at [teticio/audio-diffusion](https://github.com/teticio/audio-diffusion).
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## AudioDiffusionPipeline
[[autodoc]] AudioDiffusionPipeline
- all
- __call__
## AudioPipelineOutput
[[autodoc]] pipelines.AudioPipelineOutput
## ImagePipelineOutput
[[autodoc]] pipelines.ImagePipelineOutput
## Mel
[[autodoc]] Mel

View File

@@ -1,98 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Audio Diffusion
## Overview
[Audio Diffusion](https://github.com/teticio/audio-diffusion) by Robert Dargavel Smith.
Audio Diffusion leverages the recent advances in image generation using diffusion models by converting audio samples to
and from mel spectrogram images.
The original codebase of this implementation can be found [here](https://github.com/teticio/audio-diffusion), including
training scripts and example notebooks.
## Available Pipelines:
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_audio_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py) | *Unconditional Audio Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/audio_diffusion_pipeline.ipynb) |
## Examples:
### Audio Diffusion
```python
import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(device)
output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
```
### Latent Audio Diffusion
```python
import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/latent-audio-diffusion-256").to(device)
output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
```
### Audio Diffusion with DDIM (faster)
```python
import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-ddim-256").to(device)
output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
```
### Variations, in-painting, out-painting etc.
```python
output = pipe(
raw_audio=output.audios[0, 0],
start_step=int(pipe.get_default_steps() / 2),
mask_start_secs=1,
mask_end_secs=1,
)
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
```
## AudioDiffusionPipeline
[[autodoc]] AudioDiffusionPipeline
- all
- __call__
## Mel
[[autodoc]] Mel

View File

@@ -0,0 +1,50 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# AudioLDM
AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://huggingface.co/papers/2301.12503) by Haohe Liu et al. Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM
is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
sound effects, human speech and music.
The abstract from the paper is:
*Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at https://audioldm.github.io.*
The original codebase can be found at [haoheliu/AudioLDM](https://github.com/haoheliu/AudioLDM).
## Tips
When constructing a prompt, keep in mind:
* Descriptive prompt inputs work best; you can use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific (for example, "water stream in a forest" instead of "stream").
* It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with.
During inference:
* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## AudioLDMPipeline
[[autodoc]] AudioLDMPipeline
- all
- __call__
## AudioPipelineOutput
[[autodoc]] pipelines.AudioPipelineOutput

View File

@@ -1,84 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# AudioLDM
## Overview
AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://arxiv.org/abs/2301.12503) by Haohe Liu et al.
Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM
is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
sound effects, human speech and music.
This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be found [here](https://github.com/haoheliu/AudioLDM).
## Text-to-Audio
The [`AudioLDMPipeline`] can be used to load pre-trained weights from [cvssp/audioldm-s-full-v2](https://huggingface.co/cvssp/audioldm-s-full-v2) and generate text-conditional audio outputs:
```python
from diffusers import AudioLDMPipeline
import torch
import scipy
repo_id = "cvssp/audioldm-s-full-v2"
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
# save the audio sample as a .wav file
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
```
### Tips
Prompts:
* Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
* It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.
Inference:
* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
### How to load and use different schedulers
The AudioLDM pipeline uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers
that can be used with the AudioLDM pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`],
[`EulerAncestralDiscreteScheduler`] etc. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest
scheduler there is.
To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`]
method, or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the
[`DPMSolverMultistepScheduler`], you can do the following:
```python
>>> from diffusers import AudioLDMPipeline, DPMSolverMultistepScheduler
>>> import torch
>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm-s-full-v2", torch_dtype=torch.float16)
>>> pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
>>> # or
>>> dpm_scheduler = DPMSolverMultistepScheduler.from_pretrained("cvssp/audioldm-s-full-v2", subfolder="scheduler")
>>> pipeline = AudioLDMPipeline.from_pretrained(
... "cvssp/audioldm-s-full-v2", scheduler=dpm_scheduler, torch_dtype=torch.float16
... )
```
## AudioLDMPipeline
[[autodoc]] AudioLDMPipeline
- all
- __call__

View File

@@ -0,0 +1,93 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# AudioLDM 2
AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734)
by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate
text-conditional sound effects, human speech and music.
Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM 2
is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings. Two
text encoder models are used to compute the text embeddings from a prompt input: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap)
and the encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5). These text embeddings
are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/main/api/pipelines/audioldm2#diffusers.AudioLDM2ProjectionModel).
A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) _language model (LM)_ is used to auto-regressively
predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding
vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2UNet2DConditionModel)
of AudioLDM 2 is unique in the sense that it takes **two** cross-attention embeddings, as opposed to one cross-attention
conditioning, as in most other LDMs.
The abstract of the paper is the following:
*Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called language of audio (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches.*
This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be
found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2).
## Tips
### Choosing a checkpoint
AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio
generation. The third checkpoint is trained exclusively on text-to-music generation.
All checkpoints share the same model size for the text encoders and VAE. They differ in the size and depth of the UNet.
See table below for details on the three checkpoints:
| Checkpoint | Task | UNet Model Size | Total Model Size | Training Data / h |
|-----------------------------------------------------------------|---------------|-----------------|------------------|-------------------|
| [audioldm2](https://huggingface.co/cvssp/audioldm2) | Text-to-audio | 350M | 1.1B | 1150k |
| [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 750M | 1.5B | 1150k |
| [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 350M | 1.1B | 665k |
### Constructing a prompt
* Descriptive prompt inputs work best: use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g. "water stream in a forest" instead of "stream").
* It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with.
* Using a **negative prompt** can significantly improve the quality of the generated waveform, by guiding the generation away from terms that correspond to poor quality audio. Try using a negative prompt of "Low quality."
### Controlling inference
* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
### Evaluating generated waveforms:
* The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation
* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
The following example demonstrates how to construct good music generation using the aforementioned tips: [example](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.example).
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between
scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines)
section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## AudioLDM2Pipeline
[[autodoc]] AudioLDM2Pipeline
- all
- __call__
## AudioLDM2ProjectionModel
[[autodoc]] AudioLDM2ProjectionModel
- forward
## AudioLDM2UNet2DConditionModel
[[autodoc]] AudioLDM2UNet2DConditionModel
- forward
## AudioPipelineOutput
[[autodoc]] pipelines.AudioPipelineOutput

View File

@@ -0,0 +1,74 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# AutoPipeline
`AutoPipeline` is designed to:
1. make it easy for you to load a checkpoint for a task without knowing the specific pipeline class to use
2. use multiple pipelines in your workflow
Based on the task, the `AutoPipeline` class automatically retrieves the relevant pipeline given the name or path to the pretrained weights with the `from_pretrained()` method.
To seamlessly switch between tasks with the same checkpoint without reallocating additional memory, use the `from_pipe()` method to transfer the components from the original pipeline to the new one.
```py
from diffusers import AutoPipelineForText2Image
import torch
pipeline = AutoPipelineForText2Image.from_pretrained(
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipeline(prompt, num_inference_steps=25).images[0]
```
<Tip>
Check out the [AutoPipeline](/tutorials/autopipeline) tutorial to learn how to use this API!
</Tip>
`AutoPipeline` supports text-to-image, image-to-image, and inpainting for the following diffusion models:
- [Stable Diffusion](./stable_diffusion)
- [ControlNet](./api/pipelines/controlnet)
- [Stable Diffusion XL (SDXL)](./stable_diffusion/stable_diffusion_xl)
- [DeepFloyd IF](./if)
- [Kandinsky](./kandinsky)
- [Kandinsky 2.2](./kandinsky#kandinsky-22)
## AutoPipelineForText2Image
[[autodoc]] AutoPipelineForText2Image
- all
- from_pretrained
- from_pipe
## AutoPipelineForImage2Image
[[autodoc]] AutoPipelineForImage2Image
- all
- from_pretrained
- from_pipe
## AutoPipelineForInpainting
[[autodoc]] AutoPipelineForInpainting
- all
- from_pretrained
- from_pipe

View File

@@ -0,0 +1,43 @@
# Consistency Models
Consistency Models were proposed in [Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
The abstract from the paper is:
*Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256. *
The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models), and additional checkpoints are available at [openai](https://huggingface.co/openai).
The pipeline was contributed by [dg845](https://github.com/dg845) and [ayushtues](https://huggingface.co/ayushtues). ❤️
## Tips
For an additional speed-up, use `torch.compile` to generate multiple images in <1 second:
```diff
import torch
from diffusers import ConsistencyModelPipeline
device = "cuda"
# Load the cd_bedroom256_lpips checkpoint.
model_id_or_path = "openai/diffusers-cd_bedroom256_lpips"
pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
# Multistep sampling
# Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo:
# https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L83
for _ in range(10):
image = pipe(timesteps=[17, 0]).images[0]
image.show()
```
## ConsistencyModelPipeline
[[autodoc]] ConsistencyModelPipeline
- all
- __call__
## ImagePipelineOutput
[[autodoc]] pipelines.ImagePipelineOutput

View File

@@ -1,87 +0,0 @@
# Consistency Models
Consistency Models were proposed in [Consistency Models](https://arxiv.org/abs/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
The abstract of the [paper](https://arxiv.org/pdf/2303.01469.pdf) is as follows:
*Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256. *
Resources:
* [Paper](https://arxiv.org/abs/2303.01469)
* [Original Code](https://github.com/openai/consistency_models)
Available Checkpoints are:
- *cd_imagenet64_l2 (64x64 resolution)* [openai/consistency-model-pipelines](https://huggingface.co/openai/diffusers-cd_imagenet64_l2)
- *cd_imagenet64_lpips (64x64 resolution)* [openai/diffusers-cd_imagenet64_lpips](https://huggingface.co/openai/diffusers-cd_imagenet64_lpips)
- *ct_imagenet64 (64x64 resolution)* [openai/diffusers-ct_imagenet64](https://huggingface.co/openai/diffusers-ct_imagenet64)
- *cd_bedroom256_l2 (256x256 resolution)* [openai/diffusers-cd_bedroom256_l2](https://huggingface.co/openai/diffusers-cd_bedroom256_l2)
- *cd_bedroom256_lpips (256x256 resolution)* [openai/diffusers-cd_bedroom256_lpips](https://huggingface.co/openai/diffusers-cd_bedroom256_lpips)
- *ct_bedroom256 (256x256 resolution)* [openai/diffusers-ct_bedroom256](https://huggingface.co/openai/diffusers-ct_bedroom256)
- *cd_cat256_l2 (256x256 resolution)* [openai/diffusers-cd_cat256_l2](https://huggingface.co/openai/diffusers-cd_cat256_l2)
- *cd_cat256_lpips (256x256 resolution)* [openai/diffusers-cd_cat256_lpips](https://huggingface.co/openai/diffusers-cd_cat256_lpips)
- *ct_cat256 (256x256 resolution)* [openai/diffusers-ct_cat256](https://huggingface.co/openai/diffusers-ct_cat256)
## Available Pipelines
| Pipeline | Tasks | Demo | Colab |
|:---:|:---:|:---:|:---:|
| [ConsistencyModelPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_consistency_models.py) | *Unconditional Image Generation* | | |
This pipeline was contributed by our community members [dg845](https://github.com/dg845) and [ayushtues](https://huggingface.co/ayushtues) ❤️
## Usage Example
```python
import torch
from diffusers import ConsistencyModelPipeline
device = "cuda"
# Load the cd_imagenet64_l2 checkpoint.
model_id_or_path = "openai/diffusers-cd_imagenet64_l2"
pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)
# Onestep Sampling
image = pipe(num_inference_steps=1).images[0]
image.save("consistency_model_onestep_sample.png")
# Onestep sampling, class-conditional image generation
# ImageNet-64 class label 145 corresponds to king penguins
image = pipe(num_inference_steps=1, class_labels=145).images[0]
image.save("consistency_model_onestep_sample_penguin.png")
# Multistep sampling, class-conditional image generation
# Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo.
# https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L77
image = pipe(timesteps=[22, 0], class_labels=145).images[0]
image.save("consistency_model_multistep_sample_penguin.png")
```
For an additional speed-up, one can also make use of `torch.compile`. Multiple images can be generated in <1 second as follows:
```py
import torch
from diffusers import ConsistencyModelPipeline
device = "cuda"
# Load the cd_bedroom256_lpips checkpoint.
model_id_or_path = "openai/diffusers-cd_bedroom256_lpips"
pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
# Multistep sampling
# Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo:
# https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L83
for _ in range(10):
image = pipe(timesteps=[17, 0]).images[0]
image.show()
```
## ConsistencyModelPipeline
[[autodoc]] ConsistencyModelPipeline
- all
- __call__

View File

@@ -0,0 +1,80 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# ControlNet
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
The abstract from the paper is:
*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
This model was contributed by [takuma104](https://huggingface.co/takuma104). ❤️
The original codebase can be found at [lllyasviel/ControlNet](https://github.com/lllyasviel/ControlNet), and you can find official ControlNet checkpoints on [lllyasviel's](https://huggingface.co/lllyasviel) Hub profile.
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## StableDiffusionControlNetPipeline
[[autodoc]] StableDiffusionControlNetPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_vae_slicing
- disable_vae_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
## StableDiffusionControlNetImg2ImgPipeline
[[autodoc]] StableDiffusionControlNetImg2ImgPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_vae_slicing
- disable_vae_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
## StableDiffusionControlNetInpaintPipeline
[[autodoc]] StableDiffusionControlNetInpaintPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_vae_slicing
- disable_vae_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
## FlaxStableDiffusionControlNetPipeline
[[autodoc]] FlaxStableDiffusionControlNetPipeline
- all
- __call__
## FlaxStableDiffusionControlNetPipelineOutput
[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput

View File

@@ -1,363 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Text-to-Image Generation with ControlNet Conditioning
## Overview
[Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
Using the pretrained models we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details.
The abstract of the paper is the following:
*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
This model was contributed by the community contributor [takuma104](https://huggingface.co/takuma104) ❤️ .
Resources:
* [Paper](https://arxiv.org/abs/2302.05543)
* [Original Code](https://github.com/lllyasviel/ControlNet)
## Available Pipelines:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [StableDiffusionControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/controlnet/pipeline_controlnet.py) | *Text-to-Image Generation with ControlNet Conditioning* | [Colab Example](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/controlnet.ipynb)
| [StableDiffusionControlNetImg2ImgPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py) | *Image-to-Image Generation with ControlNet Conditioning* |
| [StableDiffusionControlNetInpaintPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_controlnet_inpaint.py) | *Inpainting Generation with ControlNet Conditioning* |
## Usage example
In the following we give a simple example of how to use a *ControlNet* checkpoint with Diffusers for inference.
The inference pipeline is the same for all pipelines:
* 1. Take an image and run it through a pre-conditioning processor.
* 2. Run the pre-processed image through the [`StableDiffusionControlNetPipeline`].
Let's have a look at a simple example using the [Canny Edge ControlNet](https://huggingface.co/lllyasviel/sd-controlnet-canny).
```python
from diffusers import StableDiffusionControlNetPipeline
from diffusers.utils import load_image
# Let's load the popular vermeer image
image = load_image(
"https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)
```
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png)
Next, we process the image to get the canny image. This is step *1.* - running the pre-conditioning processor. The pre-conditioning processor is different for every ControlNet. Please see the model cards of the [official checkpoints](#controlnet-with-stable-diffusion-1.5) for more information about other models.
First, we need to install opencv:
```
pip install opencv-contrib-python
```
Next, let's also install all required Hugging Face libraries:
```
pip install diffusers transformers git+https://github.com/huggingface/accelerate.git
```
Then we can retrieve the canny edges of the image.
```python
import cv2
from PIL import Image
import numpy as np
image = np.array(image)
low_threshold = 100
high_threshold = 200
image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
```
Let's take a look at the processed image.
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/vermeer_canny_edged.png)
Now, we load the official [Stable Diffusion 1.5 Model](runwayml/stable-diffusion-v1-5) as well as the ControlNet for canny edges.
```py
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)
```
To speed-up things and reduce memory, let's enable model offloading and use the fast [`UniPCMultistepScheduler`].
```py
from diffusers import UniPCMultistepScheduler
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
# this command loads the individual model components on GPU on-demand.
pipe.enable_model_cpu_offload()
```
Finally, we can run the pipeline:
```py
generator = torch.manual_seed(0)
out_image = pipe(
"disco dancer with colorful lights", num_inference_steps=20, generator=generator, image=canny_image
).images[0]
```
This should take only around 3-4 seconds on GPU (depending on hardware). The output image then looks as follows:
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/vermeer_disco_dancing.png)
**Note**: To see how to run all other ControlNet checkpoints, please have a look at [ControlNet with Stable Diffusion 1.5](#controlnet-with-stable-diffusion-1.5).
<!-- TODO: add space -->
## Combining multiple conditionings
Multiple ControlNet conditionings can be combined for a single image generation. Pass a list of ControlNets to the pipeline's constructor and a corresponding list of conditionings to `__call__`.
When combining conditionings, it is helpful to mask conditionings such that they do not overlap. In the example, we mask the middle of the canny map where the pose conditioning is located.
It can also be helpful to vary the `controlnet_conditioning_scales` to emphasize one conditioning over the other.
### Canny conditioning
The original image:
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"/>
Prepare the conditioning:
```python
from diffusers.utils import load_image
from PIL import Image
import cv2
import numpy as np
from diffusers.utils import load_image
canny_image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
)
canny_image = np.array(canny_image)
low_threshold = 100
high_threshold = 200
canny_image = cv2.Canny(canny_image, low_threshold, high_threshold)
# zero out middle columns of image where pose will be overlayed
zero_start = canny_image.shape[1] // 4
zero_end = zero_start + canny_image.shape[1] // 2
canny_image[:, zero_start:zero_end] = 0
canny_image = canny_image[:, :, None]
canny_image = np.concatenate([canny_image, canny_image, canny_image], axis=2)
canny_image = Image.fromarray(canny_image)
```
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/landscape_canny_masked.png"/>
### Openpose conditioning
The original image:
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png" width=600/>
Prepare the conditioning:
```python
from controlnet_aux import OpenposeDetector
from diffusers.utils import load_image
openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
openpose_image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
)
openpose_image = openpose(openpose_image)
```
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/person_pose.png" width=600/>
### Running ControlNet with multiple conditionings
```python
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
import torch
controlnet = [
ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16),
ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16),
]
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_model_cpu_offload()
prompt = "a giant standing in a fantasy landscape, best quality"
negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
generator = torch.Generator(device="cpu").manual_seed(1)
images = [openpose_image, canny_image]
image = pipe(
prompt,
images,
num_inference_steps=20,
generator=generator,
negative_prompt=negative_prompt,
controlnet_conditioning_scale=[1.0, 0.8],
).images[0]
image.save("./multi_controlnet_output.png")
```
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/controlnet/multi_controlnet_output.png" width=600/>
### Guess Mode
Guess Mode is [a ControlNet feature that was implemented](https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode) after the publication of [the paper](https://arxiv.org/abs/2302.05543). The description states:
>In this mode, the ControlNet encoder will try best to recognize the content of the input control map, like depth map, edge map, scribbles, etc, even if you remove all prompts.
#### The core implementation:
It adjusts the scale of the output residuals from ControlNet by a fixed ratio depending on the block depth. The shallowest DownBlock corresponds to `0.1`. As the blocks get deeper, the scale increases exponentially, and the scale for the output of the MidBlock becomes `1.0`.
Since the core implementation is just this, **it does not have any impact on prompt conditioning**. While it is common to use it without specifying any prompts, it is also possible to provide prompts if desired.
#### Usage:
Just specify `guess_mode=True` in the pipe() function. A `guidance_scale` between 3.0 and 5.0 is [recommended](https://github.com/lllyasviel/ControlNet#guess-mode--non-prompt-mode).
```py
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet).to(
"cuda"
)
image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0]
image.save("guess_mode_generated.png")
```
#### Output image comparison:
Canny Control Example
|no guess_mode with prompt|guess_mode without prompt|
|---|---|
|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0.png"><img width="128" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0_gm.png"><img width="128" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare_guess_mode/output_images/diffusers/output_bird_canny_0_gm.png"/></a>|
## Available checkpoints
ControlNet requires a *control image* in addition to the text-to-image *prompt*.
Each pretrained model is trained using a different conditioning method that requires different images for conditioning the generated outputs. For example, Canny edge conditioning requires the control image to be the output of a Canny filter, while depth conditioning requires the control image to be a depth map. See the overview and image examples below to know more.
All checkpoints can be found under the authors' namespace [lllyasviel](https://huggingface.co/lllyasviel).
**13.04.2024 Update**: The author has released improved controlnet checkpoints v1.1 - see [here](#controlnet-v1.1).
### ControlNet v1.0
| Model Name | Control Image Overview| Control Image Example | Generated Image Example |
|---|---|---|---|
|[lllyasviel/sd-controlnet-canny](https://huggingface.co/lllyasviel/sd-controlnet-canny)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_bird_canny.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_bird_canny.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_canny_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_canny_1.png"/></a>|
|[lllyasviel/sd-controlnet-depth](https://huggingface.co/lllyasviel/sd-controlnet-depth)<br/> *Trained with Midas depth estimation* |A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_depth.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_depth.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_depth_2.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_depth_2.png"/></a>|
|[lllyasviel/sd-controlnet-hed](https://huggingface.co/lllyasviel/sd-controlnet-hed)<br/> *Trained with HED edge detection (soft edge)* |A monochrome image with white soft edges on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_bird_hed.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_bird_hed.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_hed_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_hed_1.png"/></a> |
|[lllyasviel/sd-controlnet-mlsd](https://huggingface.co/lllyasviel/sd-controlnet-mlsd)<br/> *Trained with M-LSD line detection* |A monochrome image composed only of white straight lines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_mlsd.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_mlsd.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_mlsd_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_mlsd_0.png"/></a>|
|[lllyasviel/sd-controlnet-normal](https://huggingface.co/lllyasviel/sd-controlnet-normal)<br/> *Trained with normal map* |A [normal mapped](https://en.wikipedia.org/wiki/Normal_mapping) image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_human_normal.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_human_normal.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_normal_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_normal_1.png"/></a>|
|[lllyasviel/sd-controlnet-openpose](https://huggingface.co/lllyasviel/sd-controlnet_openpose)<br/> *Trained with OpenPose bone image* |A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_human_openpose.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_human_openpose.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_openpose_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_openpose_0.png"/></a>|
|[lllyasviel/sd-controlnet-scribble](https://huggingface.co/lllyasviel/sd-controlnet_scribble)<br/> *Trained with human scribbles* |A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_scribble.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_scribble.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"/></a> |
|[lllyasviel/sd-controlnet-seg](https://huggingface.co/lllyasviel/sd-controlnet_seg)<br/>*Trained with semantic segmentation* |An [ADE20K](https://groups.csail.mit.edu/vision/datasets/ADE20K/)'s segmentation protocol image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_seg.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_seg.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"/></a> |
### ControlNet v1.1
| Model Name | Control Image Overview| Condition Image | Control Image Example | Generated Image Example |
|---|---|---|---|---|
|[lllyasviel/control_v11p_sd15_canny](https://huggingface.co/lllyasviel/control_v11p_sd15_canny)<br/> | *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11e_sd15_ip2p](https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p)<br/> | *Trained with pixel to pixel instruction* | No condition .|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_inpaint](https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint)<br/> | Trained with image inpainting | No condition.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/output.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/output.png"/></a>|
|[lllyasviel/control_v11p_sd15_mlsd](https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd)<br/> | Trained with multi-level line segment detection | An image with annotated line segments.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11f1p_sd15_depth](https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth)<br/> | Trained with depth estimation | An image with depth information, usually represented as a grayscale image.|<a href="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_normalbae](https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae)<br/> | Trained with surface normal estimation | An image with surface normal information, usually represented as a color-coded image.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_seg](https://huggingface.co/lllyasviel/control_v11p_sd15_seg)<br/> | Trained with image segmentation | An image with segmented regions, usually represented as a color-coded image.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_lineart](https://huggingface.co/lllyasviel/control_v11p_sd15_lineart)<br/> | Trained with line art generation | An image with line art, usually black lines on a white background.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15s2_lineart_anime](https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime)<br/> | Trained with anime line art generation | An image with anime-style line art.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_openpose](https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime)<br/> | Trained with human pose estimation | An image with human poses, usually represented as a set of keypoints or skeletons.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_scribble](https://huggingface.co/lllyasviel/control_v11p_sd15_scribble)<br/> | Trained with scribble-based image generation | An image with scribbles, usually random or user-drawn strokes.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11p_sd15_softedge](https://huggingface.co/lllyasviel/control_v11p_sd15_softedge)<br/> | Trained with soft edge image generation | An image with soft edges, usually to create a more painterly or artistic effect.|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11e_sd15_shuffle](https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle)<br/> | Trained with image shuffling | An image with shuffled patches or regions.|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/image_out.png"/></a>|
|[lllyasviel/control_v11f1e_sd15_tile](https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile)<br/> | Trained with image tiling | A blurry image or part of an image .|<a href="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/original.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/original.png"/></a>|<a href="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/output.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile/resolve/main/images/output.png"/></a>|
## StableDiffusionControlNetPipeline
[[autodoc]] StableDiffusionControlNetPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_vae_slicing
- disable_vae_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
## StableDiffusionControlNetImg2ImgPipeline
[[autodoc]] StableDiffusionControlNetImg2ImgPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_vae_slicing
- disable_vae_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
## StableDiffusionControlNetInpaintPipeline
[[autodoc]] StableDiffusionControlNetInpaintPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_vae_slicing
- disable_vae_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
## FlaxStableDiffusionControlNetPipeline
[[autodoc]] FlaxStableDiffusionControlNetPipeline
- all
- __call__

View File

@@ -0,0 +1,46 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# ControlNet with Stable Diffusion XL
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
The abstract from the paper is:
*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
You can find additional smaller Stable Diffusion XL (SDXL) ControlNet checkpoints from the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization, and browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) checkpoints on the Hub.
<Tip warning={true}>
🧪 Many of the SDXL ControlNet checkpoints are experimental, and there is a lot of room for improvement. Feel free to open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) and leave us feedback on how we can improve!
</Tip>
If you don't see a checkpoint you're interested in, you can train your own SDXL ControlNet with our [training script](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md).
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## StableDiffusionXLControlNetPipeline
[[autodoc]] StableDiffusionXLControlNetPipeline
- all
- __call__
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -0,0 +1,33 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Cycle Diffusion
Cycle Diffusion is a text guided image-to-image generation model proposed in [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://huggingface.co/papers/2210.05559) by Chen Henry Wu, Fernando De la Torre.
The abstract from the paper is:
*Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pre-trained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs.*
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## CycleDiffusionPipeline
[[autodoc]] CycleDiffusionPipeline
- all
- __call__
## StableDiffusionPiplineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -1,100 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Cycle Diffusion
## Overview
Cycle Diffusion is a Text-Guided Image-to-Image Generation model proposed in [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://arxiv.org/abs/2210.05559) by Chen Henry Wu, Fernando De la Torre.
The abstract of the paper is the following:
*Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pre-trained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs.*
*Tips*:
- The Cycle Diffusion pipeline is fully compatible with any [Stable Diffusion](./stable_diffusion) checkpoints
- Currently Cycle Diffusion only works with the [`DDIMScheduler`].
*Example*:
In the following we should how to best use the [`CycleDiffusionPipeline`]
```python
import requests
import torch
from PIL import Image
from io import BytesIO
from diffusers import CycleDiffusionPipeline, DDIMScheduler
# load the pipeline
# make sure you're logged in with `huggingface-cli login`
model_id_or_path = "CompVis/stable-diffusion-v1-4"
scheduler = DDIMScheduler.from_pretrained(model_id_or_path, subfolder="scheduler")
pipe = CycleDiffusionPipeline.from_pretrained(model_id_or_path, scheduler=scheduler).to("cuda")
# let's download an initial image
url = "https://raw.githubusercontent.com/ChenWu98/cycle-diffusion/main/data/dalle2/An%20astronaut%20riding%20a%20horse.png"
response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((512, 512))
init_image.save("horse.png")
# let's specify a prompt
source_prompt = "An astronaut riding a horse"
prompt = "An astronaut riding an elephant"
# call the pipeline
image = pipe(
prompt=prompt,
source_prompt=source_prompt,
image=init_image,
num_inference_steps=100,
eta=0.1,
strength=0.8,
guidance_scale=2,
source_guidance_scale=1,
).images[0]
image.save("horse_to_elephant.png")
# let's try another example
# See more samples at the original repo: https://github.com/ChenWu98/cycle-diffusion
url = "https://raw.githubusercontent.com/ChenWu98/cycle-diffusion/main/data/dalle2/A%20black%20colored%20car.png"
response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((512, 512))
init_image.save("black.png")
source_prompt = "A black colored car"
prompt = "A blue colored car"
# call the pipeline
torch.manual_seed(0)
image = pipe(
prompt=prompt,
source_prompt=source_prompt,
image=init_image,
num_inference_steps=100,
eta=0.1,
strength=0.85,
guidance_scale=3,
source_guidance_scale=1,
).images[0]
image.save("black_to_blue.png")
```
## CycleDiffusionPipeline
[[autodoc]] CycleDiffusionPipeline
- all
- __call__

View File

@@ -12,23 +12,22 @@ specific language governing permissions and limitations under the License.
# Dance Diffusion
## Overview
[Dance Diffusion](https://github.com/Harmonai-org/sample-generator) is by Zach Evans.
[Dance Diffusion](https://github.com/Harmonai-org/sample-generator) by Zach Evans.
Dance Diffusion is the first in a suite of generative audio tools for producers and musicians released by [Harmonai](https://github.com/Harmonai-org).
Dance Diffusion is the first in a suite of generative audio tools for producers and musicians to be released by Harmonai.
For more info or to get involved in the development of these tools, please visit https://harmonai.org and fill out the form on the front page.
The original codebase of this implementation can be found at [Harmonai-org](https://github.com/Harmonai-org/sample-generator).
The original codebase of this implementation can be found [here](https://github.com/Harmonai-org/sample-generator).
<Tip>
## Available Pipelines:
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_dance_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py) | *Unconditional Audio Generation* | - |
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## DanceDiffusionPipeline
[[autodoc]] DanceDiffusionPipeline
- all
- __call__
## AudioPipelineOutput
[[autodoc]] pipelines.AudioPipelineOutput

View File

@@ -0,0 +1,29 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# DDIM
[Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
The abstract from the paper is:
*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.*
The original codebase can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim).
## DDIMPipeline
[[autodoc]] DDIMPipeline
- all
- __call__
## ImagePipelineOutput
[[autodoc]] pipelines.ImagePipelineOutput

View File

@@ -1,36 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# DDIM
## Overview
[Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon.
The abstract of the paper is the following:
Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
The original codebase of this paper can be found here: [ermongroup/ddim](https://github.com/ermongroup/ddim).
For questions, feel free to contact the author on [tsong.me](https://tsong.me/).
## Available Pipelines:
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_ddim.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ddim/pipeline_ddim.py) | *Unconditional Image Generation* | - |
## DDIMPipeline
[[autodoc]] DDIMPipeline
- all
- __call__

View File

@@ -0,0 +1,35 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# DDPM
[Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2006.11239) (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes a diffusion based model of the same name. In the 🤗 Diffusers library, DDPM refers to the *discrete denoising scheduler* from the paper as well as the pipeline.
The abstract from the paper is:
*We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.*
The original codebase can be found at [hohonathanho/diffusion](https://github.com/hojonathanho/diffusion).
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
# DDPMPipeline
[[autodoc]] DDPMPipeline
- all
- __call__
## ImagePipelineOutput
[[autodoc]] pipelines.ImagePipelineOutput

View File

@@ -1,37 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# DDPM
## Overview
[Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)
(DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes the diffusion based model of the same name, but in the context of the 🤗 Diffusers library, DDPM refers to the discrete denoising scheduler from the paper as well as the pipeline.
The abstract of the paper is the following:
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.
The original codebase of this paper can be found [here](https://github.com/hojonathanho/diffusion).
## Available Pipelines:
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_ddpm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ddpm/pipeline_ddpm.py) | *Unconditional Image Generation* | - |
# DDPMPipeline
[[autodoc]] DDPMPipeline
- all
- __call__

View File

@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# IF
# DeepFloyd IF
## Overview

View File

@@ -0,0 +1,55 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# DiffEdit
[DiffEdit: Diffusion-based semantic image editing with mask guidance](https://huggingface.co/papers/2210.11427) is by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord.
The abstract from the paper is:
*Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.*
The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion), and you can try it out in this [demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html).
This pipeline was contributed by [clarencechen](https://github.com/clarencechen). ❤️
## Tips
* The pipeline can generate masks that can be fed into other inpainting pipelines.
* In order to generate an image using this pipeline, both an image mask (source and target prompts can be manually specified or generated, and passed to [`~StableDiffusionDiffEditPipeline.generate_mask`])
and a set of partially inverted latents (generated using [`~StableDiffusionDiffEditPipeline.invert`]) _must_ be provided as arguments when calling the pipeline to generate the final edited image.
* The function [`~StableDiffusionDiffEditPipeline.generate_mask`] exposes two prompt arguments, `source_prompt` and `target_prompt`
that let you control the locations of the semantic edits in the final image to be generated. Let's say,
you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect
this in the generated mask, you simply have to set the embeddings related to the phrases including "cat" to
`source_prompt` and "dog" to `target_prompt`.
* When generating partially inverted latents using `invert`, assign a caption or text embedding describing the
overall image to the `prompt` argument to help guide the inverse latent sampling process. In most cases, the
source concept is sufficently descriptive to yield good results, but feel free to explore alternatives.
* When calling the pipeline to generate the final edited image, assign the source concept to `negative_prompt`
and the target concept to `prompt`. Taking the above example, you simply have to set the embeddings related to
the phrases including "cat" to `negative_prompt` and "dog" to `prompt`.
* If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to:
* Swap the `source_prompt` and `target_prompt` in the arguments to `generate_mask`.
* Change the input prompt in [`~StableDiffusionDiffEditPipeline.invert`] to include "dog".
* Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image.
* The source and target prompts, or their corresponding embeddings, can also be automatically generated. Please refer to the [DiffEdit](/using-diffusers/diffedit) guide for more details.
## StableDiffusionDiffEditPipeline
[[autodoc]] StableDiffusionDiffEditPipeline
- all
- generate_mask
- invert
- __call__
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -1,360 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Zero-shot Diffusion-based Semantic Image Editing with Mask Guidance
## Overview
[DiffEdit: Diffusion-based semantic image editing with mask guidance](https://arxiv.org/abs/2210.11427) by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord.
The abstract of the paper is the following:
*Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.*
Resources:
* [Paper](https://arxiv.org/abs/2210.11427).
* [Blog Post with Demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html).
* [Implementation on Github](https://github.com/Xiang-cd/DiffEdit-stable-diffusion/).
## Tips
* The pipeline can generate masks that can be fed into other inpainting pipelines. Check out the code examples below to know more.
* In order to generate an image using this pipeline, both an image mask (manually specified or generated using `generate_mask`)
and a set of partially inverted latents (generated using `invert`) _must_ be provided as arguments when calling the pipeline to generate the final edited image.
Refer to the code examples below for more details.
* The function `generate_mask` exposes two prompt arguments, `source_prompt` and `target_prompt`,
that let you control the locations of the semantic edits in the final image to be generated. Let's say,
you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect
this in the generated mask, you simply have to set the embeddings related to the phrases including "cat" to
`source_prompt_embeds` and "dog" to `target_prompt_embeds`. Refer to the code example below for more details.
* When generating partially inverted latents using `invert`, assign a caption or text embedding describing the
overall image to the `prompt` argument to help guide the inverse latent sampling process. In most cases, the
source concept is sufficently descriptive to yield good results, but feel free to explore alternatives.
Please refer to [this code example](#generating-image-captions-for-inversion) for more details.
* When calling the pipeline to generate the final edited image, assign the source concept to `negative_prompt`
and the target concept to `prompt`. Taking the above example, you simply have to set the embeddings related to
the phrases including "cat" to `negative_prompt_embeds` and "dog" to `prompt_embeds`. Refer to the code example
below for more details.
* If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to:
* Swap the `source_prompt` and `target_prompt` in the arguments to `generate_mask`.
* Change the input prompt for `invert` to include "dog".
* Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image.
* Note that the source and target prompts, or their corresponding embeddings, can also be automatically generated. Please, refer to [this discussion](#generating-source-and-target-embeddings) for more details.
## Available Pipelines:
| Pipeline | Tasks
|---|---|
| [StableDiffusionDiffEditPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_diffedit.py) | *Text-Based Image Editing*
<!-- TODO: add Colab -->
## Usage example
### Based on an input image with a caption
When the pipeline is conditioned on an input image, we first obtain partially inverted latents from the input image using a
`DDIMInverseScheduler` with the help of a caption. Then we generate an editing mask to identify relevant regions in the image using the source and target prompts. Finally,
the inverted noise and generated mask is used to start the generation process.
First, let's load our pipeline:
```py
import torch
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionPix2PixZeroPipeline
sd_model_ckpt = "stabilityai/stable-diffusion-2-1"
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
sd_model_ckpt,
torch_dtype=torch.float16,
safety_checker=None,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()
generator = torch.manual_seed(0)
```
Then, we load an input image to edit using our method:
```py
from diffusers.utils import load_image
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
```
Then, we employ the source and target prompts to generate the editing mask:
```py
# See the "Generating source and target embeddings" section below to
# automate the generation of these captions with a pre-trained model like Flan-T5 as explained below.
source_prompt = "a bowl of fruits"
target_prompt = "a basket of fruits"
mask_image = pipeline.generate_mask(
image=raw_image,
source_prompt=source_prompt,
target_prompt=target_prompt,
generator=generator,
)
```
Then, we employ the caption and the input image to get the inverted latents:
```py
inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image, generator=generator).latents
```
Now, generate the image with the inverted latents and semantically generated mask:
```py
image = pipeline(
prompt=target_prompt,
mask_image=mask_image,
image_latents=inv_latents,
generator=generator,
negative_prompt=source_prompt,
).images[0]
image.save("edited_image.png")
```
## Generating image captions for inversion
The authors originally used the source concept prompt as the caption for generating the partially inverted latents. However, we can also leverage open source and public image captioning models for the same purpose.
Below, we provide an end-to-end example with the [BLIP](https://huggingface.co/docs/transformers/model_doc/blip) model
for generating captions.
First, let's load our automatic image captioning model:
```py
import torch
from transformers import BlipForConditionalGeneration, BlipProcessor
captioner_id = "Salesforce/blip-image-captioning-base"
processor = BlipProcessor.from_pretrained(captioner_id)
model = BlipForConditionalGeneration.from_pretrained(captioner_id, torch_dtype=torch.float16, low_cpu_mem_usage=True)
```
Then, we define a utility to generate captions from an input image using the model:
```py
@torch.no_grad()
def generate_caption(images, caption_generator, caption_processor):
text = "a photograph of"
inputs = caption_processor(images, text, return_tensors="pt").to(device="cuda", dtype=caption_generator.dtype)
caption_generator.to("cuda")
outputs = caption_generator.generate(**inputs, max_new_tokens=128)
# offload caption generator
caption_generator.to("cpu")
caption = caption_processor.batch_decode(outputs, skip_special_tokens=True)[0]
return caption
```
Then, we load an input image for conditioning and obtain a suitable caption for it:
```py
from diffusers.utils import load_image
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
caption = generate_caption(raw_image, model, processor)
```
Then, we employ the generated caption and the input image to get the inverted latents:
```py
from diffusers import DDIMInverseScheduler, DDIMScheduler
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16
)
pipeline = pipeline.to("cuda")
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
generator = torch.manual_seed(0)
inv_latents = pipeline.invert(prompt=caption, image=raw_image, generator=generator).latents
```
Now, generate the image with the inverted latents and semantically generated mask from our source and target prompts:
```py
source_prompt = "a bowl of fruits"
target_prompt = "a basket of fruits"
mask_image = pipeline.generate_mask(
image=raw_image,
source_prompt=source_prompt,
target_prompt=target_prompt,
generator=generator,
)
image = pipeline(
prompt=target_prompt,
mask_image=mask_image,
image_latents=inv_latents,
generator=generator,
negative_prompt=source_prompt,
).images[0]
image.save("edited_image.png")
```
## Generating source and target embeddings
The authors originally required the user to manually provide the source and target prompts for discovering
edit directions. However, we can also leverage open source and public models for the same purpose.
Below, we provide an end-to-end example with the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model
for generating source an target embeddings.
**1. Load the generation model**:
```py
import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto", torch_dtype=torch.float16)
```
**2. Construct a starting prompt**:
```py
source_concept = "bowl"
target_concept = "basket"
source_text = f"Provide a caption for images containing a {source_concept}. "
"The captions should be in English and should be no longer than 150 characters."
target_text = f"Provide a caption for images containing a {target_concept}. "
"The captions should be in English and should be no longer than 150 characters."
```
Here, we're interested in the "bowl -> basket" direction.
**3. Generate prompts**:
We can use a utility like so for this purpose.
```py
@torch.no_grad
def generate_prompts(input_prompt):
input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(
input_ids, temperature=0.8, num_return_sequences=16, do_sample=True, max_new_tokens=128, top_k=10
)
return tokenizer.batch_decode(outputs, skip_special_tokens=True)
```
And then we just call it to generate our prompts:
```py
source_prompts = generate_prompts(source_text)
target_prompts = generate_prompts(target_text)
```
We encourage you to play around with the different parameters supported by the
`generate()` method ([documentation](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.generation_tf_utils.TFGenerationMixin.generate)) for the generation quality you are looking for.
**4. Load the embedding model**:
Here, we need to use the same text encoder model used by the subsequent Stable Diffusion model.
```py
from diffusers import StableDiffusionDiffEditPipeline
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16
)
pipeline = pipeline.to("cuda")
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()
generator = torch.manual_seed(0)
```
**5. Compute embeddings**:
```py
import torch
@torch.no_grad()
def embed_prompts(sentences, tokenizer, text_encoder, device="cuda"):
embeddings = []
for sent in sentences:
text_inputs = tokenizer(
sent,
padding="max_length",
max_length=tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
text_input_ids = text_inputs.input_ids
prompt_embeds = text_encoder(text_input_ids.to(device), attention_mask=None)[0]
embeddings.append(prompt_embeds)
return torch.concatenate(embeddings, dim=0).mean(dim=0).unsqueeze(0)
source_embeddings = embed_prompts(source_prompts, pipeline.tokenizer, pipeline.text_encoder)
target_embeddings = embed_prompts(target_captions, pipeline.tokenizer, pipeline.text_encoder)
```
And you're done! Now, you can use these embeddings directly while calling the pipeline:
```py
from diffusers import DDIMInverseScheduler, DDIMScheduler
from diffusers.utils import load_image
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
mask_image = pipeline.generate_mask(
image=raw_image,
source_prompt_embeds=source_embeds,
target_prompt_embeds=target_embeds,
generator=generator,
)
inv_latents = pipeline.invert(
prompt_embeds=source_embeds,
image=raw_image,
generator=generator,
).latents
images = pipeline(
mask_image=mask_image,
image_latents=inv_latents,
prompt_embeds=target_embeddings,
negative_prompt_embeds=source_embeddings,
generator=generator,
).images
images[0].save("edited_image.png")
```
## StableDiffusionDiffEditPipeline
[[autodoc]] StableDiffusionDiffEditPipeline
- all
- generate_mask
- invert
- __call__

View File

@@ -10,50 +10,26 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Scalable Diffusion Models with Transformers (DiT)
# DiT
## Overview
[Scalable Diffusion Models with Transformers](https://huggingface.co/papers/2212.09748) (DiT) is by William Peebles and Saining Xie.
[Scalable Diffusion Models with Transformers](https://arxiv.org/abs/2212.09748) (DiT) by William Peebles and Saining Xie.
The abstract of the paper is the following:
The abstract from the paper is:
*We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.*
The original codebase of this paper can be found here: [facebookresearch/dit](https://github.com/facebookresearch/dit).
The original codebase can be found at [facebookresearch/dit](https://github.com/facebookresearch/dit).
## Available Pipelines:
<Tip>
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_dit.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/dit/pipeline_dit.py) | *Conditional Image Generation* | - |
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
## Usage example
```python
from diffusers import DiTPipeline, DPMSolverMultistepScheduler
import torch
pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
# pick words from Imagenet class labels
pipe.labels # to print all available words
# pick words that exist in ImageNet
words = ["white shark", "umbrella"]
class_ids = pipe.get_label_ids(words)
generator = torch.manual_seed(33)
output = pipe(class_labels=class_ids, num_inference_steps=25, generator=generator)
image = output.images[0] # label 'white shark'
```
</Tip>
## DiTPipeline
[[autodoc]] DiTPipeline
- all
- __call__
## ImagePipelineOutput
[[autodoc]] pipelines.ImagePipelineOutput

View File

@@ -105,6 +105,30 @@ One cheeseburger monster coming up! Enjoy!
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/cheeseburger.png)
<Tip>
We also provide an end-to-end Kandinsky pipeline [`KandinskyCombinedPipeline`], which combines both the prior pipeline and text-to-image pipeline, and lets you perform inference in a single step. You can create the combined pipeline with the [`~AutoPipelineForText2Image.from_pretrained`] method
```python
from diffusers import AutoPipelineForText2Image
import torch
pipe = AutoPipelineForText2Image.from_pretrained(
"kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()
```
Under the hood, it will automatically load both [`KandinskyPriorPipeline`] and [`KandinskyPipeline`]. To generate images, you no longer need to call both pipelines and pass the outputs from one to another. You only need to call the combined pipeline once. You can set different `guidance_scale` and `num_inference_steps` for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` arguments.
```python
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"
image = pipe(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale =1.0, guidance_scacle = 4.0, height=768, width=768).images[0]
```
</Tip>
The Kandinsky model works extremely well with creative prompts. Here is some of the amazing art that can be created using the exact same process but with different prompts.
```python
@@ -187,6 +211,34 @@ out.images[0].save("fantasy_land.png")
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/img2img_fantasyland.png)
<Tip>
You can also use the [`KandinskyImg2ImgCombinedPipeline`] for end-to-end image-to-image generation with Kandinsky 2.1
```python
from diffusers import AutoPipelineForImage2Image
import torch
import requests
from io import BytesIO
from PIL import Image
import os
pipe = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
original_image = Image.open(BytesIO(response.content)).convert("RGB")
original_image.thumbnail((768, 768))
image = pipe(prompt=prompt, image=original_image, strength=0.3).images[0]
```
</Tip>
### Text Guided Inpainting Generation
You can use [`KandinskyInpaintPipeline`] to edit images. In this example, we will add a hat to the portrait of a cat.
@@ -212,9 +264,9 @@ init_image = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/cat.png"
)
mask = np.ones((768, 768), dtype=np.float32)
mask = np.zeros((768, 768), dtype=np.float32)
# Let's mask out an area above the cat's head
mask[:250, 250:-250] = 0
mask[:250, 250:-250] = 1
out = pipe(
prompt,
@@ -231,6 +283,33 @@ image.save("cat_with_hat.png")
```
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/inpaint_cat_hat.png)
<Tip>
To use the [`KandinskyInpaintCombinedPipeline`] to perform end-to-end image inpainting generation, you can run below code instead
```python
from diffusers import AutoPipelineForInpainting
pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
```
</Tip>
🚨🚨🚨 __Breaking change for Kandinsky Mask Inpainting__ 🚨🚨🚨
We introduced a breaking change for Kandinsky inpainting pipeline in the following pull request: https://github.com/huggingface/diffusers/pull/4207. Previously we accepted a mask format where black pixels represent the masked-out area. This is inconsistent with all other pipelines in diffusers. We have changed the mask format in Knaindsky and now using white pixels instead.
Please upgrade your inpainting code to follow the above. If you are using Kandinsky Inpaint in production. You now need to change the mask to:
```python
# For PIL input
import PIL.ImageOps
mask = PIL.ImageOps.invert(mask)
# For PyTorch and Numpy input
mask = 1 - mask
```
### Interpolate
The [`KandinskyPriorPipeline`] also comes with a cool utility function that will allow you to interpolate the latent space of different images and texts super easily. Here is an example of how you can create an Impressionist-style portrait for your pet based on "The Starry Night".
@@ -276,208 +355,6 @@ image.save("starry_cat.png")
```
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/starry_cat.png)
### Text-to-Image Generation with ControlNet Conditioning
In the following, we give a simple example of how to use [`KandinskyV22ControlnetPipeline`] to add control to the text-to-image generation with a depth image.
First, let's take an image and extract its depth map.
```python
from diffusers.utils import load_image
img = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
).resize((768, 768))
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png)
We can use the `depth-estimation` pipeline from transformers to process the image and retrieve its depth map.
```python
import torch
import numpy as np
from transformers import pipeline
from diffusers.utils import load_image
def make_hint(image, depth_estimator):
image = depth_estimator(image)["depth"]
image = np.array(image)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
detected_map = torch.from_numpy(image).float() / 255.0
hint = detected_map.permute(2, 0, 1)
return hint
depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
```
Now, we load the prior pipeline and the text-to-image controlnet pipeline
```python
from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline
pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
)
pipe_prior = pipe_prior.to("cuda")
pipe = KandinskyV22ControlnetPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
```
We pass the prompt and negative prompt through the prior to generate image embeddings
```python
prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
generator = torch.Generator(device="cuda").manual_seed(43)
image_emb, zero_image_emb = pipe_prior(
prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
).to_tuple()
```
Now we can pass the image embeddings and the depth image we extracted to the controlnet pipeline. With Kandinsky 2.2, only prior pipelines accept `prompt` input. You do not need to pass the prompt to the controlnet pipeline.
```python
images = pipe(
image_embeds=image_emb,
negative_image_embeds=zero_image_emb,
hint=hint,
num_inference_steps=50,
generator=generator,
height=768,
width=768,
).images
images[0].save("robot_cat.png")
```
The output image looks as follow:
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/robot_cat_text2img.png)
### Image-to-Image Generation with ControlNet Conditioning
Kandinsky 2.2 also includes a [`KandinskyV22ControlnetImg2ImgPipeline`] that will allow you to add control to the image generation process with both the image and its depth map. This pipeline works really well with [`KandinskyV22PriorEmb2EmbPipeline`], which generates image embeddings based on both a text prompt and an image.
For our robot cat example, we will pass the prompt and cat image together to the prior pipeline to generate an image embedding. We will then use that image embedding and the depth map of the cat to further control the image generation process.
We can use the same cat image and its depth map from the last example.
```python
import torch
import numpy as np
from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline
from diffusers.utils import load_image
from transformers import pipeline
img = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinskyv22/cat.png"
).resize((768, 768))
def make_hint(image, depth_estimator):
image = depth_estimator(image)["depth"]
image = np.array(image)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
detected_map = torch.from_numpy(image).float() / 255.0
hint = detected_map.permute(2, 0, 1)
return hint
depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
pipe_prior = KandinskyV22PriorEmb2EmbPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
)
pipe_prior = pipe_prior.to("cuda")
pipe = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
generator = torch.Generator(device="cuda").manual_seed(43)
# run prior pipeline
img_emb = pipe_prior(prompt=prompt, image=img, strength=0.85, generator=generator)
negative_emb = pipe_prior(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)
# run controlnet img2img pipeline
images = pipe(
image=img,
strength=0.5,
image_embeds=img_emb.image_embeds,
negative_image_embeds=negative_emb.image_embeds,
hint=hint,
num_inference_steps=50,
generator=generator,
height=768,
width=768,
).images
images[0].save("robot_cat.png")
```
Here is the output. Compared with the output from our text-to-image controlnet example, it kept a lot more cat facial details from the original image and worked into the robot style we asked for.
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/robot_cat.png)
## Kandinsky 2.2
The Kandinsky 2.2 release includes robust new text-to-image models that support text-to-image generation, image-to-image generation, image interpolation, and text-guided image inpainting. The general workflow to perform these tasks using Kandinsky 2.2 is the same as in Kandinsky 2.1. First, you will need to use a prior pipeline to generate image embeddings based on your text prompt, and then use one of the image decoding pipelines to generate the output image. The only difference is that in Kandinsky 2.2, all of the decoding pipelines no longer accept the `prompt` input, and the image generation process is conditioned with only `image_embeds` and `negative_image_embeds`.
Let's look at an example of how to perform text-to-image generation using Kandinsky 2.2.
First, let's create the prior pipeline and text-to-image pipeline with Kandinsky 2.2 checkpoints.
```python
from diffusers import DiffusionPipeline
import torch
pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16)
pipe_prior.to("cuda")
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
t2i_pipe.to("cuda")
```
You can then use `pipe_prior` to generate image embeddings.
```python
prompt = "portrait of a women, blue eyes, cinematic"
negative_prompt = "low quality, bad quality"
image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
```
Now you can pass these embeddings to the text-to-image pipeline. When using Kandinsky 2.2 you don't need to pass the `prompt` (but you do with the previous version, Kandinsky 2.1).
```
image = t2i_pipe(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[
0
]
image.save("portrait.png")
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/%20blue%20eyes.png)
We used the text-to-image pipeline as an example, but the same process applies to all decoding pipelines in Kandinsky 2.2. For more information, please refer to our API section for each pipeline.
## Optimization
Running Kandinsky in inference requires running both a first prior pipeline: [`KandinskyPriorPipeline`]
@@ -530,64 +407,24 @@ t2i_pipe.unet = torch.compile(t2i_pipe.unet, mode="reduce-overhead", fullgraph=T
After compilation you should see a very fast inference time. For more information,
feel free to have a look at [Our PyTorch 2.0 benchmark](https://huggingface.co/docs/diffusers/main/en/optimization/torch2.0).
<Tip>
To generate images directly from a single pipeline, you can use [`KandinskyCombinedPipeline`], [`KandinskyImg2ImgCombinedPipeline`], [`KandinskyInpaintCombinedPipeline`].
These combined pipelines wrap the [`KandinskyPriorPipeline`] and [`KandinskyPipeline`], [`KandinskyImg2ImgPipeline`], [`KandinskyInpaintPipeline`] respectively into a single
pipeline for a simpler user experience
</Tip>
## Available Pipelines:
| Pipeline | Tasks |
|---|---|
| [pipeline_kandinsky2_2.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2.py) | *Text-to-Image Generation* |
| [pipeline_kandinsky.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky.py) | *Text-to-Image Generation* |
| [pipeline_kandinsky2_2_inpaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_inpaint.py) | *Image-Guided Image Generation* |
| [pipeline_kandinsky_combined.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky_combined.py) | *End-to-end Text-to-Image, image-to-image, Inpainting Generation* |
| [pipeline_kandinsky_inpaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_inpaint.py) | *Image-Guided Image Generation* |
| [pipeline_kandinsky2_2_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_img2img.py) | *Image-Guided Image Generation* |
| [pipeline_kandinsky_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_img2img.py) | *Image-Guided Image Generation* |
| [pipeline_kandinsky2_2_controlnet.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_controlnet.py) | *Image-Guided Image Generation* |
| [pipeline_kandinsky2_2_controlnet_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_controlnet_img2img.py) | *Image-Guided Image Generation* |
### KandinskyV22Pipeline
[[autodoc]] KandinskyV22Pipeline
- all
- __call__
### KandinskyV22ControlnetPipeline
[[autodoc]] KandinskyV22ControlnetPipeline
- all
- __call__
### KandinskyV22ControlnetImg2ImgPipeline
[[autodoc]] KandinskyV22ControlnetImg2ImgPipeline
- all
- __call__
### KandinskyV22Img2ImgPipeline
[[autodoc]] KandinskyV22Img2ImgPipeline
- all
- __call__
### KandinskyV22InpaintPipeline
[[autodoc]] KandinskyV22InpaintPipeline
- all
- __call__
### KandinskyV22PriorPipeline
[[autodoc]] ## KandinskyV22PriorPipeline
- all
- __call__
- interpolate
### KandinskyV22PriorEmb2EmbPipeline
[[autodoc]] KandinskyV22PriorEmb2EmbPipeline
- all
- __call__
- interpolate
### KandinskyPriorPipeline
[[autodoc]] KandinskyPriorPipeline
@@ -612,3 +449,21 @@ feel free to have a look at [Our PyTorch 2.0 benchmark](https://huggingface.co/d
[[autodoc]] KandinskyInpaintPipeline
- all
- __call__
### KandinskyCombinedPipeline
[[autodoc]] KandinskyCombinedPipeline
- all
- __call__
### KandinskyImg2ImgCombinedPipeline
[[autodoc]] KandinskyImg2ImgCombinedPipeline
- all
- __call__
### KandinskyInpaintCombinedPipeline
[[autodoc]] KandinskyInpaintCombinedPipeline
- all
- __call__

View File

@@ -0,0 +1,357 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Kandinsky 2.2
The Kandinsky 2.2 release includes robust new text-to-image models that support text-to-image generation, image-to-image generation, image interpolation, and text-guided image inpainting. The general workflow to perform these tasks using Kandinsky 2.2 is the same as in Kandinsky 2.1. First, you will need to use a prior pipeline to generate image embeddings based on your text prompt, and then use one of the image decoding pipelines to generate the output image. The only difference is that in Kandinsky 2.2, all of the decoding pipelines no longer accept the `prompt` input, and the image generation process is conditioned with only `image_embeds` and `negative_image_embeds`.
Same as with Kandinsky 2.1, the easiest way to perform text-to-image generation is to use the combined Kandinsky pipeline. This process is exactly the same as Kandinsky 2.1. All you need to do is to replace the Kandinsky 2.1 checkpoint with 2.2.
```python
from diffusers import AutoPipelineForText2Image
import torch
pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"
image = pipe(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale =1.0, height=768, width=768).images[0]
```
Now, let's look at an example where we take separate steps to run the prior pipeline and text-to-image pipeline. This way, we can understand what's happening under the hood and how Kandinsky 2.2 differs from Kandinsky 2.1.
First, let's create the prior pipeline and text-to-image pipeline with Kandinsky 2.2 checkpoints.
```python
from diffusers import DiffusionPipeline
import torch
pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16)
pipe_prior.to("cuda")
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
t2i_pipe.to("cuda")
```
You can then use `pipe_prior` to generate image embeddings.
```python
prompt = "portrait of a women, blue eyes, cinematic"
negative_prompt = "low quality, bad quality"
image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
```
Now you can pass these embeddings to the text-to-image pipeline. When using Kandinsky 2.2 you don't need to pass the `prompt` (but you do with the previous version, Kandinsky 2.1).
```
image = t2i_pipe(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[
0
]
image.save("portrait.png")
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/%20blue%20eyes.png)
We used the text-to-image pipeline as an example, but the same process applies to all decoding pipelines in Kandinsky 2.2. For more information, please refer to our API section for each pipeline.
### Text-to-Image Generation with ControlNet Conditioning
In the following, we give a simple example of how to use [`KandinskyV22ControlnetPipeline`] to add control to the text-to-image generation with a depth image.
First, let's take an image and extract its depth map.
```python
from diffusers.utils import load_image
img = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
).resize((768, 768))
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png)
We can use the `depth-estimation` pipeline from transformers to process the image and retrieve its depth map.
```python
import torch
import numpy as np
from transformers import pipeline
from diffusers.utils import load_image
def make_hint(image, depth_estimator):
image = depth_estimator(image)["depth"]
image = np.array(image)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
detected_map = torch.from_numpy(image).float() / 255.0
hint = detected_map.permute(2, 0, 1)
return hint
depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
```
Now, we load the prior pipeline and the text-to-image controlnet pipeline
```python
from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline
pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
)
pipe_prior = pipe_prior.to("cuda")
pipe = KandinskyV22ControlnetPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
```
We pass the prompt and negative prompt through the prior to generate image embeddings
```python
prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
generator = torch.Generator(device="cuda").manual_seed(43)
image_emb, zero_image_emb = pipe_prior(
prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
).to_tuple()
```
Now we can pass the image embeddings and the depth image we extracted to the controlnet pipeline. With Kandinsky 2.2, only prior pipelines accept `prompt` input. You do not need to pass the prompt to the controlnet pipeline.
```python
images = pipe(
image_embeds=image_emb,
negative_image_embeds=zero_image_emb,
hint=hint,
num_inference_steps=50,
generator=generator,
height=768,
width=768,
).images
images[0].save("robot_cat.png")
```
The output image looks as follow:
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/robot_cat_text2img.png)
### Image-to-Image Generation with ControlNet Conditioning
Kandinsky 2.2 also includes a [`KandinskyV22ControlnetImg2ImgPipeline`] that will allow you to add control to the image generation process with both the image and its depth map. This pipeline works really well with [`KandinskyV22PriorEmb2EmbPipeline`], which generates image embeddings based on both a text prompt and an image.
For our robot cat example, we will pass the prompt and cat image together to the prior pipeline to generate an image embedding. We will then use that image embedding and the depth map of the cat to further control the image generation process.
We can use the same cat image and its depth map from the last example.
```python
import torch
import numpy as np
from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline
from diffusers.utils import load_image
from transformers import pipeline
img = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinskyv22/cat.png"
).resize((768, 768))
def make_hint(image, depth_estimator):
image = depth_estimator(image)["depth"]
image = np.array(image)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
detected_map = torch.from_numpy(image).float() / 255.0
hint = detected_map.permute(2, 0, 1)
return hint
depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
pipe_prior = KandinskyV22PriorEmb2EmbPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
)
pipe_prior = pipe_prior.to("cuda")
pipe = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
generator = torch.Generator(device="cuda").manual_seed(43)
# run prior pipeline
img_emb = pipe_prior(prompt=prompt, image=img, strength=0.85, generator=generator)
negative_emb = pipe_prior(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)
# run controlnet img2img pipeline
images = pipe(
image=img,
strength=0.5,
image_embeds=img_emb.image_embeds,
negative_image_embeds=negative_emb.image_embeds,
hint=hint,
num_inference_steps=50,
generator=generator,
height=768,
width=768,
).images
images[0].save("robot_cat.png")
```
Here is the output. Compared with the output from our text-to-image controlnet example, it kept a lot more cat facial details from the original image and worked into the robot style we asked for.
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/robot_cat.png)
## Optimization
Running Kandinsky in inference requires running both a first prior pipeline: [`KandinskyPriorPipeline`]
and a second image decoding pipeline which is one of [`KandinskyPipeline`], [`KandinskyImg2ImgPipeline`], or [`KandinskyInpaintPipeline`].
The bulk of the computation time will always be the second image decoding pipeline, so when looking
into optimizing the model, one should look into the second image decoding pipeline.
When running with PyTorch < 2.0, we strongly recommend making use of [`xformers`](https://github.com/facebookresearch/xformers)
to speed-up the optimization. This can be done by simply running:
```py
from diffusers import DiffusionPipeline
import torch
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
t2i_pipe.enable_xformers_memory_efficient_attention()
```
When running on PyTorch >= 2.0, PyTorch's SDPA attention will automatically be used. For more information on
PyTorch's SDPA, feel free to have a look at [this blog post](https://pytorch.org/blog/accelerated-diffusers-pt-20/).
To have explicit control , you can also manually set the pipeline to use PyTorch's 2.0 efficient attention:
```py
from diffusers.models.attention_processor import AttnAddedKVProcessor2_0
t2i_pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())
```
The slowest and most memory intense attention processor is the default `AttnAddedKVProcessor` processor.
We do **not** recommend using it except for testing purposes or cases where very high determistic behaviour is desired.
You can set it with:
```py
from diffusers.models.attention_processor import AttnAddedKVProcessor
t2i_pipe.unet.set_attn_processor(AttnAddedKVProcessor())
```
With PyTorch >= 2.0, you can also use Kandinsky with `torch.compile` which depending
on your hardware can signficantly speed-up your inference time once the model is compiled.
To use Kandinsksy with `torch.compile`, you can do:
```py
t2i_pipe.unet.to(memory_format=torch.channels_last)
t2i_pipe.unet = torch.compile(t2i_pipe.unet, mode="reduce-overhead", fullgraph=True)
```
After compilation you should see a very fast inference time. For more information,
feel free to have a look at [Our PyTorch 2.0 benchmark](https://huggingface.co/docs/diffusers/main/en/optimization/torch2.0).
<Tip>
To generate images directly from a single pipeline, you can use [`KandinskyV22CombinedPipeline`], [`KandinskyV22Img2ImgCombinedPipeline`], [`KandinskyV22InpaintCombinedPipeline`].
These combined pipelines wrap the [`KandinskyV22PriorPipeline`] and [`KandinskyV22Pipeline`], [`KandinskyV22Img2ImgPipeline`], [`KandinskyV22InpaintPipeline`] respectively into a single
pipeline for a simpler user experience
</Tip>
## Available Pipelines:
| Pipeline | Tasks |
|---|---|
| [pipeline_kandinsky2_2.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2.py) | *Text-to-Image Generation* |
| [pipeline_kandinsky2_2_combined.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_combined.py) | *End-to-end Text-to-Image, image-to-image, Inpainting Generation* |
| [pipeline_kandinsky2_2_inpaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_inpaint.py) | *Image-Guided Image Generation* |
| [pipeline_kandinsky2_2_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_img2img.py) | *Image-Guided Image Generation* |
| [pipeline_kandinsky2_2_controlnet.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_controlnet.py) | *Image-Guided Image Generation* |
| [pipeline_kandinsky2_2_controlnet_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky2_2/pipeline_kandinsky2_2_controlnet_img2img.py) | *Image-Guided Image Generation* |
### KandinskyV22Pipeline
[[autodoc]] KandinskyV22Pipeline
- all
- __call__
### KandinskyV22ControlnetPipeline
[[autodoc]] KandinskyV22ControlnetPipeline
- all
- __call__
### KandinskyV22ControlnetImg2ImgPipeline
[[autodoc]] KandinskyV22ControlnetImg2ImgPipeline
- all
- __call__
### KandinskyV22Img2ImgPipeline
[[autodoc]] KandinskyV22Img2ImgPipeline
- all
- __call__
### KandinskyV22InpaintPipeline
[[autodoc]] KandinskyV22InpaintPipeline
- all
- __call__
### KandinskyV22PriorPipeline
[[autodoc]] KandinskyV22PriorPipeline
- all
- __call__
- interpolate
### KandinskyV22PriorEmb2EmbPipeline
[[autodoc]] KandinskyV22PriorEmb2EmbPipeline
- all
- __call__
- interpolate
### KandinskyV22CombinedPipeline
[[autodoc]] KandinskyV22CombinedPipeline
- all
- __call__
### KandinskyV22Img2ImgCombinedPipeline
[[autodoc]] KandinskyV22Img2ImgCombinedPipeline
- all
- __call__
### KandinskyV22InpaintCombinedPipeline
[[autodoc]] KandinskyV22InpaintCombinedPipeline
- all
- __call__

View File

@@ -12,31 +12,19 @@ specific language governing permissions and limitations under the License.
# Latent Diffusion
## Overview
Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.
Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.
The abstract of the paper is the following:
The abstract from the paper is:
*By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.*
The original codebase can be found [here](https://github.com/CompVis/latent-diffusion).
The original codebase can be found at [Compvis/latent-diffusion](https://github.com/CompVis/latent-diffusion).
## Tips:
<Tip>
-
-
-
## Available Pipelines:
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_latent_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py) | *Text-to-Image Generation* | - |
| [pipeline_latent_diffusion_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion_superresolution.py) | *Super Resolution* | - |
## Examples:
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## LDMTextToImagePipeline
[[autodoc]] LDMTextToImagePipeline
@@ -47,3 +35,6 @@ The original codebase can be found [here](https://github.com/CompVis/latent-diff
[[autodoc]] LDMSuperResolutionPipeline
- all
- __call__
## ImagePipelineOutput
[[autodoc]] pipelines.ImagePipelineOutput

View File

@@ -12,31 +12,24 @@ specific language governing permissions and limitations under the License.
# Unconditional Latent Diffusion
## Overview
Unconditional Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.
Unconditional Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.
The abstract of the paper is the following:
The abstract from the paper is:
*By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.*
The original codebase can be found [here](https://github.com/CompVis/latent-diffusion).
The original codebase can be found at [CompVis/latent-diffusion](https://github.com/CompVis/latent-diffusion).
## Tips:
<Tip>
-
-
-
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
## Available Pipelines:
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_latent_diffusion_uncond.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion_uncond/pipeline_latent_diffusion_uncond.py) | *Unconditional Image Generation* | - |
## Examples:
</Tip>
## LDMPipeline
[[autodoc]] LDMPipeline
- all
- __call__
## ImagePipelineOutput
[[autodoc]] pipelines.ImagePipelineOutput

View File

@@ -10,52 +10,26 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Editing Implicit Assumptions in Text-to-Image Diffusion Models
# Text-to-image model editing
## Overview
[Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://huggingface.co/papers/2303.08084) is by Hadas Orgad, Bahjat Kawar, and Yonatan Belinkov. This pipeline enables editing diffusion model weights, such that its assumptions of a given concept are changed. The resulting change is expected to take effect in all prompt generations related to the edited concept.
[Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://arxiv.org/abs/2303.08084) by Hadas Orgad, Bahjat Kawar, and Yonatan Belinkov.
The abstract of the paper is the following:
The abstract from the paper is:
*Text-to-image diffusion models often make implicit assumptions about the world when generating images. While some assumptions are useful (e.g., the sky is blue), they can also be outdated, incorrect, or reflective of social biases present in the training data. Thus, there is a need to control these assumptions without requiring explicit user input or costly re-training. In this work, we aim to edit a given implicit assumption in a pre-trained diffusion model. Our Text-to-Image Model Editing method, TIME for short, receives a pair of inputs: a "source" under-specified prompt for which the model makes an implicit assumption (e.g., "a pack of roses"), and a "destination" prompt that describes the same setting, but with a specified desired attribute (e.g., "a pack of blue roses"). TIME then updates the model's cross-attention layers, as these layers assign visual meaning to textual tokens. We edit the projection matrices in these layers such that the source prompt is projected close to the destination prompt. Our method is highly efficient, as it modifies a mere 2.2% of the model's parameters in under one second. To evaluate model editing approaches, we introduce TIMED (TIME Dataset), containing 147 source and destination prompt pairs from various domains. Our experiments (using Stable Diffusion) show that TIME is successful in model editing, generalizes well for related prompts unseen during editing, and imposes minimal effect on unrelated generations.*
Resources:
You can find additional information about model editing on the [project page](https://time-diffusion.github.io/), [original codebase](https://github.com/bahjat-kawar/time-diffusion), and try it out in a [demo](https://huggingface.co/spaces/bahjat-kawar/time-diffusion).
* [Project Page](https://time-diffusion.github.io/).
* [Paper](https://arxiv.org/abs/2303.08084).
* [Original Code](https://github.com/bahjat-kawar/time-diffusion).
* [Demo](https://huggingface.co/spaces/bahjat-kawar/time-diffusion).
<Tip>
## Available Pipelines:
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
| Pipeline | Tasks | Demo
|---|---|:---:|
| [StableDiffusionModelEditingPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_model_editing.py) | *Text-to-Image Model Editing* | [🤗 Space](https://huggingface.co/spaces/bahjat-kawar/time-diffusion)) |
This pipeline enables editing the diffusion model weights, such that its assumptions on a given concept are changed. The resulting change is expected to take effect in all prompt generations pertaining to the edited concept.
## Usage example
```python
import torch
from diffusers import StableDiffusionModelEditingPipeline
model_ckpt = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionModelEditingPipeline.from_pretrained(model_ckpt)
pipe = pipe.to("cuda")
source_prompt = "A pack of roses"
destination_prompt = "A pack of blue roses"
pipe.edit_model(source_prompt, destination_prompt)
prompt = "A field of roses"
image = pipe(prompt).images[0]
image.save("field_of_roses.png")
```
</Tip>
## StableDiffusionModelEditingPipeline
[[autodoc]] StableDiffusionModelEditingPipeline
- __call__
- all
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -0,0 +1,57 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# MusicLDM
MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
MusicLDM takes a text prompt as input and predicts the corresponding music sample.
Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm/overview),
MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
latents.
MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to
the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies
encourages the model to interpolate between the training samples, but stay within the domain of the training data. The
result is generated music that is more diverse while staying faithful to the corresponding style.
The abstract of the paper is the following:
*In this paper, we present MusicLDM, a state-of-the-art text-to-music model that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, to encourage the model to generate music more diverse while still staying faithful to the corresponding style.*
This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi).
## Tips
When constructing a prompt, keep in mind:
* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".
During inference:
* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
* The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument.
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between
scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines)
section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## MusicLDMPipeline
[[autodoc]] MusicLDMPipeline
- all
- __call__

View File

@@ -0,0 +1,40 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Pipelines
Pipelines provide a simple way to run state-of-the-art diffusion models in inference by bundling all of the necessary components (multiple independently-trained models, schedulers, and processors) into a single end-to-end class. Pipelines are flexible and they can be adapted to use different scheduler or even model components.
All pipelines are built from the base [`DiffusionPipeline`] class which provides basic functionality for loading, downloading, and saving all the components.
<Tip warning={true}>
Pipelines do not offer any training functionality. You'll notice PyTorch's autograd is disabled by decorating the [`~DiffusionPipeline.__call__`] method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should not be used for training. If you're interested in training, please take a look at the [Training](../traininig/overview) guides instead!
</Tip>
## DiffusionPipeline
[[autodoc]] DiffusionPipeline
- all
- __call__
- device
- to
- components
## FlaxDiffusionPipeline
[[autodoc]] pipelines.pipeline_flax_utils.FlaxDiffusionPipeline
## PushToHubMixin
[[autodoc]] utils.PushToHubMixin

View File

@@ -1,118 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Pipelines
Pipelines provide a simple way to run state-of-the-art diffusion models in inference.
Most diffusion systems consist of multiple independently-trained models and highly adaptable scheduler
components - all of which are needed to have a functioning end-to-end diffusion system.
As an example, [Stable Diffusion](https://huggingface.co/blog/stable_diffusion) has three independently trained models:
- [Autoencoder](./api/models#vae)
- [Conditional Unet](./api/models#UNet2DConditionModel)
- [CLIP text encoder](https://huggingface.co/docs/transformers/v4.27.1/en/model_doc/clip#transformers.CLIPTextModel)
- a scheduler component, [scheduler](./api/scheduler#pndm),
- a [CLIPImageProcessor](https://huggingface.co/docs/transformers/v4.27.1/en/model_doc/clip#transformers.CLIPImageProcessor),
- as well as a [safety checker](./stable_diffusion#safety_checker).
All of these components are necessary to run stable diffusion in inference even though they were trained
or created independently from each other.
To that end, we strive to offer all open-sourced, state-of-the-art diffusion system under a unified API.
More specifically, we strive to provide pipelines that
- 1. can load the officially published weights and yield 1-to-1 the same outputs as the original implementation according to the corresponding paper (*e.g.* [LDMTextToImagePipeline](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/latent_diffusion), uses the officially released weights of [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)),
- 2. have a simple user interface to run the model in inference (see the [Pipelines API](#pipelines-api) section),
- 3. are easy to understand with code that is self-explanatory and can be read along-side the official paper (see [Pipelines summary](#pipelines-summary)),
- 4. can easily be contributed by the community (see the [Contribution](#contribution) section).
**Note** that pipelines do not (and should not) offer any training functionality.
If you are looking for *official* training examples, please have a look at [examples](https://github.com/huggingface/diffusers/tree/main/examples).
## 🧨 Diffusers Summary
The following table summarizes all officially supported pipelines, their corresponding paper, and if
available a colab notebook to directly try them out.
| Pipeline | Paper | Tasks | Colab
|---|---|:---:|:---:|
| [alt_diffusion](./alt_diffusion) | [**AltDiffusion**](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation | -
| [audio_diffusion](./audio_diffusion) | [**Audio Diffusion**](https://github.com/teticio/audio_diffusion.git) | Unconditional Audio Generation |
| [controlnet](./api/pipelines/controlnet) | [**ControlNet with Stable Diffusion**](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/controlnet.ipynb)
| [cycle_diffusion](./cycle_diffusion) | [**Cycle Diffusion**](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation |
| [dance_diffusion](./dance_diffusion) | [**Dance Diffusion**](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation |
| [ddpm](./ddpm) | [**Denoising Diffusion Probabilistic Models**](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation |
| [ddim](./ddim) | [**Denoising Diffusion Implicit Models**](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation |
| [if](./if) | [**IF**](https://github.com/deep-floyd/IF) | Image Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
| [if_img2img](./if) | [**IF**](https://github.com/deep-floyd/IF) | Image-to-Image Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
| [if_inpainting](./if) | [**IF**](https://github.com/deep-floyd/IF) | Image-to-Image Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb)
| [kandinsky](./kandinsky) | **Kandinsky** | Text-to-Image Generation |
| [kandinsky_inpaint](./kandinsky) | **Kandinsky** | Image-to-Image Generation |
| [kandinsky_img2img](./kandinsky) | **Kandinsksy** | Image-to-Image Generation |
| [latent_diffusion](./latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation |
| [latent_diffusion](./latent_diffusion) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image |
| [latent_diffusion_uncond](./latent_diffusion_uncond) | [**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation |
| [paint_by_example](./paint_by_example) | [**Paint by Example: Exemplar-based Image Editing with Diffusion Models**](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting |
| [paradigms](./paradigms) | [**Parallel Sampling of Diffusion Models**](https://arxiv.org/abs/2305.16317) | Text-to-Image Generation |
| [pndm](./pndm) | [**Pseudo Numerical Methods for Diffusion Models on Manifolds**](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation |
| [score_sde_ve](./score_sde_ve) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
| [score_sde_vp](./score_sde_vp) | [**Score-Based Generative Modeling through Stochastic Differential Equations**](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
| [semantic_stable_diffusion](./semantic_stable_diffusion) | [**SEGA: Instructing Diffusion using Semantic Dimensions**](https://arxiv.org/abs/2301.12247) | Text-to-Image Generation |
| [stable_diffusion_text2img](./stable_diffusion/text2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb)
| [stable_diffusion_img2img](./stable_diffusion/img2img) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb)
| [stable_diffusion_inpaint](./stable_diffusion/inpaint) | [**Stable Diffusion**](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb)
| [stable_diffusion_panorama](./stable_diffusion/panorama) | [**MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation**](https://arxiv.org/abs/2302.08113) | Text-Guided Panorama View Generation |
| [stable_diffusion_pix2pix](./stable_diffusion/pix2pix) | [**InstructPix2Pix: Learning to Follow Image Editing Instructions**](https://arxiv.org/abs/2211.09800) | Text-Based Image Editing |
| [stable_diffusion_pix2pix_zero](./stable_diffusion/pix2pix_zero) | [**Zero-shot Image-to-Image Translation**](https://arxiv.org/abs/2302.03027) | Text-Based Image Editing |
| [stable_diffusion_attend_and_excite](./stable_diffusion/attend_and_excite) | [**Attend and Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models**](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation |
| [stable_diffusion_self_attention_guidance](./stable_diffusion/self_attention_guidance) | [**Self-Attention Guidance**](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation |
| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [**Stable Diffusion Image Variations**](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation |
| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [**Stable Diffusion Latent Upscaler**](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image |
| [stable_diffusion_2](./stable_diffusion/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting |
| [stable_diffusion_2](./stable_diffusion/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Depth-to-Image Text-Guided Generation |
| [stable_diffusion_2](./stable_diffusion/stable_diffusion_2) | [**Stable Diffusion 2**](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image |
| [stable_diffusion_safe](./stable_diffusion_safe) | [**Safe Stable Diffusion**](https://arxiv.org/abs/2211.05105) | Text-Guided Generation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-research/safe-latent-diffusion/blob/main/examples/Safe%20Latent%20Diffusion.ipynb)
| [stable_unclip](./stable_unclip) | **Stable unCLIP** | Text-to-Image Generation |
| [stable_unclip](./stable_unclip) | **Stable unCLIP** | Image-to-Image Text-Guided Generation |
| [stochastic_karras_ve](./stochastic_karras_ve) | [**Elucidating the Design Space of Diffusion-Based Generative Models**](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation |
| [text_to_video_sd](./api/pipelines/text_to_video) | [**Modelscope's Text-to-video-synthesis Model in Open Domain**](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation |
| [unclip](./unclip) | [**Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) | Text-to-Image Generation |
| [versatile_diffusion](./versatile_diffusion) | [**Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
| [versatile_diffusion](./versatile_diffusion) | [**Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
| [versatile_diffusion](./versatile_diffusion) | [**Versatile Diffusion: Text, Images and Variations All in One Diffusion Model**](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
| [vq_diffusion](./vq_diffusion) | [**Vector Quantized Diffusion Model for Text-to-Image Synthesis**](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
| [text_to_video_zero](./text_to_video_zero) | [**Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators**](https://arxiv.org/abs/2303.13439) | Text-to-Video Generation |
**Note**: Pipelines are simple examples of how to play around with the diffusion systems as described in the corresponding papers.
However, most of them can be adapted to use different scheduler components or even different model components. Some pipeline examples are shown in the [Examples](#examples) below.
## Pipelines API
Diffusion models often consist of multiple independently-trained models or other previously existing components.
Each model has been trained independently on a different task and the scheduler can easily be swapped out and replaced with a different one.
During inference, we however want to be able to easily load all components and use them in inference - even if one component, *e.g.* CLIP's text encoder, originates from a different library, such as [Transformers](https://github.com/huggingface/transformers). To that end, all pipelines provide the following functionality:
- [`from_pretrained` method](../diffusion_pipeline) that accepts a Hugging Face Hub repository id, *e.g.* [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) or a path to a local directory, *e.g.*
"./stable-diffusion". To correctly retrieve which models and components should be loaded, one has to provide a `model_index.json` file, *e.g.* [runwayml/stable-diffusion-v1-5/model_index.json](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/model_index.json), which defines all components that should be
loaded into the pipelines. More specifically, for each model/component one needs to define the format `<name>: ["<library>", "<class name>"]`. `<name>` is the attribute name given to the loaded instance of `<class name>` which can be found in the library or pipeline folder called `"<library>"`.
- [`save_pretrained`](../diffusion_pipeline) that accepts a local path, *e.g.* `./stable-diffusion` under which all models/components of the pipeline will be saved. For each component/model a folder is created inside the local path that is named after the given attribute name, *e.g.* `./stable_diffusion/unet`.
In addition, a `model_index.json` file is created at the root of the local path, *e.g.* `./stable_diffusion/model_index.json` so that the complete pipeline can again be instantiated
from the local path.
- [`to`](../diffusion_pipeline) which accepts a `string` or `torch.device` to move all models that are of type `torch.nn.Module` to the passed device. The behavior is fully analogous to [PyTorch's `to` method](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.to).
- [`__call__`] method to use the pipeline in inference. `__call__` defines inference logic of the pipeline and should ideally encompass all aspects of it, from pre-processing to forwarding tensors to the different models and schedulers, as well as post-processing. The API of the `__call__` method can strongly vary from pipeline to pipeline. *E.g.* a text-to-image pipeline, such as [`StableDiffusionPipeline`](./stable_diffusion) should accept among other things the text prompt to generate the image. A pure image generation pipeline, such as [DDPMPipeline](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/ddpm) on the other hand can be run without providing any inputs. To better understand what inputs can be adapted for
each pipeline, one should look directly into the respective pipeline.
**Note**: All pipelines have PyTorch's autograd disabled by decorating the `__call__` method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should
not be used for training. If you want to store the gradients during the forward pass, we recommend writing your own pipeline, see also our [community-examples](https://github.com/huggingface/diffusers/tree/main/examples/community).

View File

@@ -0,0 +1,39 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# PaintByExample
[Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://huggingface.co/papers/2211.13227) is by Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen.
The abstract from the paper is:
*Language-guided image editing has achieved great success recently. In this paper, for the first time, we investigate exemplar-guided image editing for more precise control. We achieve this goal by leveraging self-supervised training to disentangle and re-organize the source image and the exemplar. However, the naive approach will cause obvious fusing artifacts. We carefully analyze it and propose an information bottleneck and strong augmentations to avoid the trivial solution of directly copying and pasting the exemplar image. Meanwhile, to ensure the controllability of the editing process, we design an arbitrary shape mask for the exemplar image and leverage the classifier-free guidance to increase the similarity to the exemplar image. The whole framework involves a single forward of the diffusion model without any iterative optimization. We demonstrate that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.*
The original codebase can be found at [Fantasy-Studio/Paint-by-Example](https://github.com/Fantasy-Studio/Paint-by-Example), and you can try it out in a [demo](https://huggingface.co/spaces/Fantasy-Studio/Paint-by-Example).
## Tips
PaintByExample is supported by the official [Fantasy-Studio/Paint-by-Example](https://huggingface.co/Fantasy-Studio/Paint-by-Example) checkpoint. The checkpoint is warm-started from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) to inpaint partly masked images conditioned on example and reference images.
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## PaintByExamplePipeline
[[autodoc]] PaintByExamplePipeline
- all
- __call__
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -1,74 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# PaintByExample
## Overview
[Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://arxiv.org/abs/2211.13227) by Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen.
The abstract of the paper is the following:
*Language-guided image editing has achieved great success recently. In this paper, for the first time, we investigate exemplar-guided image editing for more precise control. We achieve this goal by leveraging self-supervised training to disentangle and re-organize the source image and the exemplar. However, the naive approach will cause obvious fusing artifacts. We carefully analyze it and propose an information bottleneck and strong augmentations to avoid the trivial solution of directly copying and pasting the exemplar image. Meanwhile, to ensure the controllability of the editing process, we design an arbitrary shape mask for the exemplar image and leverage the classifier-free guidance to increase the similarity to the exemplar image. The whole framework involves a single forward of the diffusion model without any iterative optimization. We demonstrate that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.*
The original codebase can be found [here](https://github.com/Fantasy-Studio/Paint-by-Example).
## Available Pipelines:
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_paint_by_example.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/paint_by_example/pipeline_paint_by_example.py) | *Image-Guided Image Painting* | - |
## Tips
- PaintByExample is supported by the official [Fantasy-Studio/Paint-by-Example](https://huggingface.co/Fantasy-Studio/Paint-by-Example) checkpoint. The checkpoint has been warm-started from the [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) and with the objective to inpaint partly masked images conditioned on example / reference images
- To quickly demo *PaintByExample*, please have a look at [this demo](https://huggingface.co/spaces/Fantasy-Studio/Paint-by-Example)
- You can run the following code snippet as an example:
```python
# !pip install diffusers transformers
import PIL
import requests
import torch
from io import BytesIO
from diffusers import DiffusionPipeline
def download_image(url):
response = requests.get(url)
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/image/example_1.png"
mask_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/mask/example_1.png"
example_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/reference/example_1.jpg"
init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))
example_image = download_image(example_url).resize((512, 512))
pipe = DiffusionPipeline.from_pretrained(
"Fantasy-Studio/Paint-by-Example",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
image = pipe(image=init_image, mask_image=mask_image, example_image=example_image).images[0]
image
```
## PaintByExamplePipeline
[[autodoc]] PaintByExamplePipeline
- all
- __call__

View File

@@ -0,0 +1,57 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# MultiDiffusion
[MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation](https://huggingface.co/papers/2302.08113) is by Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel.
The abstract from the paper is:
*Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.*
You can find additional information about MultiDiffusion on the [project page](https://multidiffusion.github.io/), [original codebase](https://github.com/omerbt/MultiDiffusion), and try it out in a [demo](https://huggingface.co/spaces/weizmannscience/MultiDiffusion).
## Tips
While calling [`StableDiffusionPanoramaPipeline`], it's possible to specify the `view_batch_size` parameter to be > 1.
For some GPUs with high performance, this can speedup the generation process and increase VRAM usage.
To generate panorama-like images make sure you pass the width parameter accordingly. We recommend a width value of 2048 which is the default.
Circular padding is applied to ensure there are no stitching artifacts when working with
panoramas to ensure a seamless transition from the rightmost part to the leftmost part.
By enabling circular padding (set `circular_padding=True`), the operation applies additional
crops after the rightmost point of the image, allowing the model to "see” the transition
from the rightmost part to the leftmost part. This helps maintain visual consistency in
a 360-degree sense and creates a proper “panorama” that can be viewed using 360-degree
panorama viewers. When decoding latents in Stable Diffusion, circular padding is applied
to ensure that the decoded latents match in the RGB space.
For example, without circular padding, there is a stitching artifact (default):
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/indoor_%20no_circular_padding.png)
But with circular padding, the right and the left parts are matching (`circular_padding=True`):
![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/indoor_%20circular_padding.png)
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## StableDiffusionPanoramaPipeline
[[autodoc]] StableDiffusionPanoramaPipeline
- __call__
- all
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -1,66 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation
## Overview
[MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation](https://arxiv.org/abs/2302.08113) by Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel.
The abstract of the paper is the following:
*Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.
Resources:
* [Project Page](https://multidiffusion.github.io/).
* [Paper](https://arxiv.org/abs/2302.08113).
* [Original Code](https://github.com/omerbt/MultiDiffusion).
* [Demo](https://huggingface.co/spaces/weizmannscience/MultiDiffusion).
## Available Pipelines:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [StableDiffusionPanoramaPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py) | *Text-Guided Panorama View Generation* | [🤗 Space](https://huggingface.co/spaces/weizmannscience/MultiDiffusion)) |
<!-- TODO: add Colab -->
## Usage example
```python
import torch
from diffusers import StableDiffusionPanoramaPipeline, DDIMScheduler
model_ckpt = "stabilityai/stable-diffusion-2-base"
scheduler = DDIMScheduler.from_pretrained(model_ckpt, subfolder="scheduler")
pipe = StableDiffusionPanoramaPipeline.from_pretrained(model_ckpt, scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "a photo of the dolomites"
image = pipe(prompt).images[0]
image.save("dolomites.png")
```
<Tip>
While calling this pipeline, it's possible to specify the `view_batch_size` to have a >1 value.
For some GPUs with high performance, higher a `view_batch_size`, can speedup the generation
and increase the VRAM usage.
</Tip>
## StableDiffusionPanoramaPipeline
[[autodoc]] StableDiffusionPanoramaPipeline
- __call__
- all

View File

@@ -10,74 +10,45 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Parallel Sampling of Diffusion Models (ParaDiGMS)
# Parallel Sampling of Diffusion Models
## Overview
[Parallel Sampling of Diffusion Models](https://huggingface.co/papers/2305.16317) is by Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, Nima Anari.
[Parallel Sampling of Diffusion Models](https://arxiv.org/abs/2305.16317) by Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, Nima Anari.
The abstract of the paper is the following:
The abstract from the paper is:
*Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.*
Resources:
The original codebase can be found at [AndyShih12/paradigms](https://github.com/AndyShih12/paradigms), and the pipeline was contributed by [AndyShih12](https://github.com/AndyShih12). ❤️
* [Paper](https://arxiv.org/abs/2305.16317).
* [Original Code](https://github.com/AndyShih12/paradigms).
## Tips
## Available Pipelines:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [StableDiffusionParadigmsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py) | *Faster Text-to-Image Generation* | |
This pipeline was contributed by [`AndyShih12`](https://github.com/AndyShih12) in this [PR](https://github.com/huggingface/diffusers/pull/3716/).
## Usage example
```python
import torch
from diffusers import DDPMParallelScheduler
from diffusers import StableDiffusionParadigmsPipeline
scheduler = DDPMParallelScheduler.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="scheduler")
pipe = StableDiffusionParadigmsPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", scheduler=scheduler, torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
ngpu, batch_per_device = torch.cuda.device_count(), 5
pipe.wrapped_unet = torch.nn.DataParallel(pipe.unet, device_ids=[d for d in range(ngpu)])
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt, parallel=ngpu * batch_per_device, num_inference_steps=1000).images[0]
```
<Tip>
This pipeline improves sampling speed by running denoising steps in parallel, at the cost of increased total FLOPs.
Therefore, it is better to call this pipeline when running on multiple GPUs. Otherwise, without enough GPU bandwidth
sampling may be even slower than sequential sampling.
The two parameters to play with are `parallel` (batch size) and `tolerance`.
- If it fits in memory, for 1000-step DDPM you can aim for a batch size of around 100
(e.g. 8 GPUs and batch_per_device=12 to get parallel=96). Higher batch size
- If it fits in memory, for a 1000-step DDPM you can aim for a batch size of around 100
(for example, 8 GPUs and `batch_per_device=12` to get `parallel=96`). A higher batch size
may not fit in memory, and lower batch size gives less parallelism.
- For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation.
If there is quality degradation with the default tolerance, then use a lower tolerance (e.g. 0.001).
If there is quality degradation with the default tolerance, then use a lower tolerance like `0.001`.
For 1000-step DDPM on 8 A100 GPUs, you can expect around a 3x speedup by StableDiffusionParadigmsPipeline instead of StableDiffusionPipeline
by setting parallel=80 and tolerance=0.1.
</Tip>
For a 1000-step DDPM on 8 A100 GPUs, you can expect around a 3x speedup from [`StableDiffusionParadigmsPipeline`] compared to the [`StableDiffusionPipeline`]
by setting `parallel=80` and `tolerance=0.1`.
🤗 Diffusers offers [distributed inference support](../training/distributed_inference) for generating multiple prompts
in parallel on multiple GPUs. But [`StableDiffusionParadigmsPipeline`] is designed for speeding up sampling of a single prompt by using multiple GPUs.
<Tip>
Diffusers also offers distributed inference support for generating multiple prompts
in parallel on multiple GPUs. Check out the docs [here](https://huggingface.co/docs/diffusers/main/en/training/distributed_inference).
In contrast, this pipeline is designed for speeding up sampling of a single prompt (by using multiple GPUs).
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## StableDiffusionParadigmsPipeline
[[autodoc]] StableDiffusionParadigmsPipeline
- __call__
- all
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -10,59 +10,21 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# InstructPix2Pix: Learning to Follow Image Editing Instructions
# InstructPix2Pix
## Overview
[InstructPix2Pix: Learning to Follow Image Editing Instructions](https://huggingface.co/papers/2211.09800) is by Tim Brooks, Aleksander Holynski and Alexei A. Efros.
[InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800) by Tim Brooks, Aleksander Holynski and Alexei A. Efros.
The abstract of the paper is the following:
The abstract from the paper is:
*We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models -- a language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.*
Resources:
You can find additional information about InstructPix2Pix on the [project page](https://www.timothybrooks.com/instruct-pix2pix), [original codebase](https://github.com/timothybrooks/instruct-pix2pix), and try it out in a [demo](https://huggingface.co/spaces/timbrooks/instruct-pix2pix).
* [Project Page](https://www.timothybrooks.com/instruct-pix2pix).
* [Paper](https://arxiv.org/abs/2211.09800).
* [Original Code](https://github.com/timothybrooks/instruct-pix2pix).
* [Demo](https://huggingface.co/spaces/timbrooks/instruct-pix2pix).
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
## Available Pipelines:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [StableDiffusionInstructPix2PixPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py) | *Text-Based Image Editing* | [🤗 Space](https://huggingface.co/spaces/timbrooks/instruct-pix2pix) |
<!-- TODO: add Colab -->
## Usage example
```python
import PIL
import requests
import torch
from diffusers import StableDiffusionInstructPix2PixPipeline
model_id = "timbrooks/instruct-pix2pix"
pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
url = "https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png"
def download_image(url):
image = PIL.Image.open(requests.get(url, stream=True).raw)
image = PIL.ImageOps.exif_transpose(image)
image = image.convert("RGB")
return image
image = download_image(url)
prompt = "make the mountains snowy"
images = pipe(prompt, image=image, num_inference_steps=20, image_guidance_scale=1.5, guidance_scale=7).images
images[0].save("snowy_mountains.png")
```
</Tip>
## StableDiffusionInstructPix2PixPipeline
[[autodoc]] StableDiffusionInstructPix2PixPipeline
@@ -71,3 +33,14 @@ images[0].save("snowy_mountains.png")
- load_textual_inversion
- load_lora_weights
- save_lora_weights
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
## StableDiffusionXLInstructPix2PixPipeline
[[autodoc]] StableDiffusionXLInstructPix2PixPipeline
- __call__
- all
## StableDiffusionXLPipelineOutput
[[autodoc]] pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput

View File

@@ -10,22 +10,15 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Zero-shot Image-to-Image Translation
# Pix2Pix Zero
## Overview
[Zero-shot Image-to-Image Translation](https://huggingface.co/papers/2302.03027) is by Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu.
[Zero-shot Image-to-Image Translation](https://arxiv.org/abs/2302.03027).
The abstract of the paper is the following:
The abstract from the paper is:
*Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. In this work, we propose pix2pix-zero, an image-to-image translation method that can preserve the content of the original image without manual prompting. We first automatically discover editing directions that reflect desired edits in the text embedding space. To preserve the general content structure after editing, we further propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. In addition, our method does not need additional training for these edits and can directly use the existing pre-trained text-to-image diffusion model. We conduct extensive experiments and show that our method outperforms existing and concurrent works for both real and synthetic image editing.*
Resources:
* [Project Page](https://pix2pixzero.github.io/).
* [Paper](https://arxiv.org/abs/2302.03027).
* [Original Code](https://github.com/pix2pixzero/pix2pix-zero).
* [Demo](https://huggingface.co/spaces/pix2pix-zero-library/pix2pix-zero-demo).
You can find additional information about Pix2Pix Zero on the [project page](https://pix2pixzero.github.io/), [original codebase](https://github.com/pix2pixzero/pix2pix-zero), and try it out in a [demo](https://huggingface.co/spaces/pix2pix-zero-library/pix2pix-zero-demo).
## Tips

View File

@@ -0,0 +1,35 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# PNDM
[Pseudo Numerical methods for Diffusion Models on manifolds](https://huggingface.co/papers/2202.09778) (PNDM) is by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao.
The abstract from the paper is:
*Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules.*
The original codebase can be found at [luping-liu/PNDM](https://github.com/luping-liu/PNDM).
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## PNDMPipeline
[[autodoc]] PNDMPipeline
- all
- __call__
## ImagePipelineOutput
[[autodoc]] pipelines.ImagePipelineOutput

View File

@@ -1,35 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# PNDM
## Overview
[Pseudo Numerical methods for Diffusion Models on manifolds](https://arxiv.org/abs/2202.09778) (PNDM) by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao.
The abstract of the paper is the following:
Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules.
The original codebase can be found [here](https://github.com/luping-liu/PNDM).
## Available Pipelines:
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_pndm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pndm/pipeline_pndm.py) | *Unconditional Image Generation* | - |
## PNDMPipeline
[[autodoc]] PNDMPipeline
- all
- __call__

View File

@@ -0,0 +1,37 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# RePaint
[RePaint: Inpainting using Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2201.09865) is by Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, Luc Van Gool.
The abstract from the paper is:
*Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks.
RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions.*
The original codebase can be found at [andreas128/RePaint](https://github.com/andreas128/RePaint).
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## RePaintPipeline
[[autodoc]] RePaintPipeline
- all
- __call__
## ImagePipelineOutput
[[autodoc]] pipelines.ImagePipelineOutput

View File

@@ -1,77 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# RePaint
## Overview
[RePaint: Inpainting using Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2201.09865) (PNDM) by Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, Luc Van Gool.
The abstract of the paper is the following:
Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks.
RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions.
The original codebase can be found [here](https://github.com/andreas128/RePaint).
## Available Pipelines:
| Pipeline | Tasks | Colab
|-------------------------------------------------------------------------------------------------------------------------------|--------------------|:---:|
| [pipeline_repaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/repaint/pipeline_repaint.py) | *Image Inpainting* | - |
## Usage example
```python
from io import BytesIO
import torch
import PIL
import requests
from diffusers import RePaintPipeline, RePaintScheduler
def download_image(url):
response = requests.get(url)
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png"
mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png"
# Load the original image and the mask as PIL images
original_image = download_image(img_url).resize((256, 256))
mask_image = download_image(mask_url).resize((256, 256))
# Load the RePaint scheduler and pipeline based on a pretrained DDPM model
scheduler = RePaintScheduler.from_pretrained("google/ddpm-ema-celebahq-256")
pipe = RePaintPipeline.from_pretrained("google/ddpm-ema-celebahq-256", scheduler=scheduler)
pipe = pipe.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(0)
output = pipe(
image=original_image,
mask_image=mask_image,
num_inference_steps=250,
eta=0.0,
jump_length=10,
jump_n_sample=10,
generator=generator,
)
inpainted_image = output.images[0]
```
## RePaintPipeline
[[autodoc]] RePaintPipeline
- all
- __call__

View File

@@ -0,0 +1,35 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Score SDE VE
[Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) (Score SDE) is by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon and Ben Poole. This pipeline implements the variance expanding (VE) variant of the stochastic differential equation method.
The abstract from the paper is:
*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.*
The original codebase can be found at [yang-song/score_sde_pytorch](https://github.com/yang-song/score_sde_pytorch).
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## ScoreSdeVePipeline
[[autodoc]] ScoreSdeVePipeline
- all
- __call__
## ImagePipelineOutput
[[autodoc]] pipelines.ImagePipelineOutput

View File

@@ -1,36 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Score SDE VE
## Overview
[Score-Based Generative Modeling through Stochastic Differential Equations](https://arxiv.org/abs/2011.13456) (Score SDE) by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon and Ben Poole.
The abstract of the paper is the following:
Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.
The original codebase can be found [here](https://github.com/yang-song/score_sde_pytorch).
This pipeline implements the Variance Expanding (VE) variant of the method.
## Available Pipelines:
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_score_sde_ve.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/score_sde_ve/pipeline_score_sde_ve.py) | *Unconditional Image Generation* | - |
## ScoreSdeVePipeline
[[autodoc]] ScoreSdeVePipeline
- all
- __call__

View File

@@ -10,56 +10,26 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Self-Attention Guidance (SAG)
# Self-Attention Guidance
## Overview
[Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://huggingface.co/papers/2210.00939) is by Susung Hong et al.
[Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) by Susung Hong et al.
The abstract of the paper is the following:
The abstract from the paper is:
*Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity. This success is largely attributed to the use of class- or text-conditional diffusion guidance methods, such as classifier and classifier-free guidance. In this paper, we present a more comprehensive perspective that goes beyond the traditional guidance methods. From this generalized perspective, we introduce novel condition- and training-free strategies to enhance the quality of generated images. As a simple solution, blur guidance improves the suitability of intermediate samples for their fine-scale information and structures, enabling diffusion models to generate higher quality samples with a moderate guidance scale. Improving upon this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps of diffusion models to enhance their stability and efficacy. Specifically, SAG adversarially blurs only the regions that diffusion models attend to at each iteration and guides them accordingly. Our experimental results show that our SAG improves the performance of various diffusion models, including ADM, IDDPM, Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance methods leads to further improvement.*
Resources:
You can find additional information about Self-Attention Guidance on the [project page](https://ku-cvlab.github.io/Self-Attention-Guidance), [original codebase](https://github.com/KU-CVLAB/Self-Attention-Guidance), and try it out in a [demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) or [notebook](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb).
* [Project Page](https://ku-cvlab.github.io/Self-Attention-Guidance).
* [Paper](https://arxiv.org/abs/2210.00939).
* [Original Code](https://github.com/KU-CVLAB/Self-Attention-Guidance).
* [Hugging Face Demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance).
* [Colab Demo](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb).
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
## Available Pipelines:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [StableDiffusionSAGPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py) | *Text-to-Image Generation* | [🤗 Space](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) |
## Usage example
```python
import torch
from diffusers import StableDiffusionSAGPipeline
from accelerate.utils import set_seed
pipe = StableDiffusionSAGPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
seed = 8978
prompt = "."
guidance_scale = 7.5
num_images_per_prompt = 1
sag_scale = 1.0
set_seed(seed)
images = pipe(
prompt, num_images_per_prompt=num_images_per_prompt, guidance_scale=guidance_scale, sag_scale=sag_scale
).images
images[0].save("example.png")
```
</Tip>
## StableDiffusionSAGPipeline
[[autodoc]] StableDiffusionSAGPipeline
- __call__
- all
## StableDiffusionOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -0,0 +1,35 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Semantic Guidance
Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Diffusion using Semantic Dimensions](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation.
Small changes to the text prompt usually result in entirely different output images. However, with SEGA a variety of changes to the image are enabled that can be controlled easily and intuitively, while staying true to the original image composition.
The abstract from the paper is:
*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.*
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## SemanticStableDiffusionPipeline
[[autodoc]] SemanticStableDiffusionPipeline
- all
- __call__
## StableDiffusionSafePipelineOutput
[[autodoc]] pipelines.semantic_stable_diffusion.SemanticStableDiffusionPipelineOutput
- all

View File

@@ -1,79 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Semantic Guidance
Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Diffusion using Semantic Dimensions](https://arxiv.org/abs/2301.12247) and provides strong semantic control over the image generation.
Small changes to the text prompt usually result in entirely different output images. However, with SEGA a variety of changes to the image are enabled that can be controlled easily and intuitively, and stay true to the original image composition.
The abstract of the paper is the following:
*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.*
*Overview*:
| Pipeline | Tasks | Colab | Demo
|---|---|:---:|:---:|
| [pipeline_semantic_stable_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion.py) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-research/semantic-image-editing/blob/main/examples/SemanticGuidance.ipynb) | [Coming Soon](https://huggingface.co/AIML-TUDA)
## Tips
- The Semantic Guidance pipeline can be used with any [Stable Diffusion](./stable_diffusion/text2img) checkpoint.
### Run Semantic Guidance
The interface of [`SemanticStableDiffusionPipeline`] provides several additional parameters to influence the image generation.
Exemplary usage may look like this:
```python
import torch
from diffusers import SemanticStableDiffusionPipeline
pipe = SemanticStableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
out = pipe(
prompt="a photo of the face of a woman",
num_images_per_prompt=1,
guidance_scale=7,
editing_prompt=[
"smiling, smile", # Concepts to apply
"glasses, wearing glasses",
"curls, wavy hair, curly hair",
"beard, full beard, mustache",
],
reverse_editing_direction=[False, False, False, False], # Direction of guidance i.e. increase all concepts
edit_warmup_steps=[10, 10, 10, 10], # Warmup period for each concept
edit_guidance_scale=[4, 5, 5, 5.4], # Guidance scale for each concept
edit_threshold=[
0.99,
0.975,
0.925,
0.96,
], # Threshold for each concept. Threshold equals the percentile of the latent space that will be discarded. I.e. threshold=0.99 uses 1% of the latent dimensions
edit_momentum_scale=0.3, # Momentum scale that will be added to the latent guidance
edit_mom_beta=0.6, # Momentum beta
edit_weights=[1, 1, 1, 1, 1], # Weights of the individual concepts against each other
)
```
For more examples check the Colab notebook.
## StableDiffusionSafePipelineOutput
[[autodoc]] pipelines.semantic_stable_diffusion.SemanticStableDiffusionPipelineOutput
- all
## SemanticStableDiffusionPipeline
[[autodoc]] SemanticStableDiffusionPipeline
- all
- __call__

View File

@@ -0,0 +1,37 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Shap-E
The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai).
The abstract from the paper is:
*We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.*
The original codebase can be found at [openai/shap-e](https://github.com/openai/shap-e).
<Tip>
See the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## ShapEPipeline
[[autodoc]] ShapEPipeline
- all
- __call__
## ShapEImg2ImgPipeline
[[autodoc]] ShapEImg2ImgPipeline
- all
- __call__
## ShapEPipelineOutput
[[autodoc]] pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput

View File

@@ -1,139 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Shap-E
## Overview
The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://arxiv.org/abs/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai).
The abstract of the paper is the following:
*We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.*
The original codebase can be found [here](https://github.com/openai/shap-e).
## Available Pipelines:
| Pipeline | Tasks |
|---|---|
| [pipeline_shap_e.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/shap_e/pipeline_shap_e.py) | *Text-to-Image Generation* |
| [pipeline_shap_e_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py) | *Image-to-Image Generation* |
## Available checkpoints
* [`openai/shap-e`](https://huggingface.co/openai/shap-e)
* [`openai/shap-e-img2img`](https://huggingface.co/openai/shap-e-img2img)
## Usage Examples
In the following, we will walk you through some examples of how to use Shap-E pipelines to create 3D objects in gif format.
### Text-to-3D image generation
We can use [`ShapEPipeline`] to create 3D object based on a text prompt. In this example, we will make a birthday cupcake for :firecracker: diffusers library's 1 year birthday. The workflow to use the Shap-E text-to-image pipeline is same as how you would use other text-to-image pipelines in diffusers.
```python
import torch
from diffusers import DiffusionPipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
repo = "openai/shap-e"
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
pipe = pipe.to(device)
guidance_scale = 15.0
prompt = ["A firecracker", "A birthday cupcake"]
images = pipe(
prompt,
guidance_scale=guidance_scale,
num_inference_steps=64,
frame_size=256,
).images
```
The output of [`ShapEPipeline`] is a list of lists of images frames. Each list of frames can be used to create a 3D object. Let's use the `export_to_gif` utility function in diffusers to make a 3D cupcake!
```python
from diffusers.utils import export_to_gif
export_to_gif(images[0], "firecracker_3d.gif")
export_to_gif(images[1], "cake_3d.gif")
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/firecracker_out.gif)
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/cake_out.gif)
### Image-to-Image generation
You can use [`ShapEImg2ImgPipeline`] along with other text-to-image pipelines in diffusers and turn your 2D generation into 3D.
In this example, We will first genrate a cheeseburger with a simple prompt "A cheeseburger, white background"
```python
from diffusers import DiffusionPipeline
import torch
pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16)
pipe_prior.to("cuda")
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
t2i_pipe.to("cuda")
prompt = "A cheeseburger, white background"
image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
image = t2i_pipe(
prompt,
image_embeds=image_embeds,
negative_image_embeds=negative_image_embeds,
).images[0]
image.save("burger.png")
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_in.png)
we will then use the Shap-E image-to-image pipeline to turn it into a 3D cheeseburger :)
```python
from PIL import Image
from diffusers.utils import export_to_gif
repo = "openai/shap-e-img2img"
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
guidance_scale = 3.0
image = Image.open("burger.png").resize((256, 256))
images = pipe(
image,
guidance_scale=guidance_scale,
num_inference_steps=64,
frame_size=256,
).images
gif_path = export_to_gif(images[0], "burger_3d.gif")
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_out.gif)
## ShapEPipeline
[[autodoc]] ShapEPipeline
- all
- __call__
## ShapEImg2ImgPipeline
[[autodoc]] ShapEImg2ImgPipeline
- all
- __call__

View File

@@ -0,0 +1,37 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Spectrogram Diffusion
[Spectrogram Diffusion](https://huggingface.co/papers/2206.05408) is by Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, and Jesse Engel.
*An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generation. In this work, we focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. This enables training on a wide range of transcription datasets with a single model, which in turn offers note-level control of composition and instrumentation across a wide range of instruments. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We compare training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and find that the DDPM approach is superior both qualitatively and as measured by audio reconstruction and Fréchet distance metrics. Given the interactivity and generality of this approach, we find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.*
The original codebase can be found at [magenta/music-spectrogram-diffusion](https://github.com/magenta/music-spectrogram-diffusion).
![img](https://storage.googleapis.com/music-synthesis-with-spectrogram-diffusion/architecture.png)
As depicted above the model takes as input a MIDI file and tokenizes it into a sequence of 5 second intervals. Each tokenized interval then together with positional encodings is passed through the Note Encoder and its representation is concatenated with the previous window's generated spectrogram representation obtained via the Context Encoder. For the initial 5 second window this is set to zero. The resulting context is then used as conditioning to sample the denoised Spectrogram from the MIDI window and we concatenate this spectrogram to the final output as well as use it for the context of the next MIDI window. The process repeats till we have gone over all the MIDI inputs. Finally a MelGAN decoder converts the potentially long spectrogram to audio which is the final result of this pipeline.
<Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## SpectrogramDiffusionPipeline
[[autodoc]] SpectrogramDiffusionPipeline
- all
- __call__
## AudioPipelineOutput
[[autodoc]] pipelines.AudioPipelineOutput

View File

@@ -1,54 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Multi-instrument Music Synthesis with Spectrogram Diffusion
## Overview
[Spectrogram Diffusion](https://arxiv.org/abs/2206.05408) by Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, and Jesse Engel.
An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generation. In this work, we focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. This enables training on a wide range of transcription datasets with a single model, which in turn offers note-level control of composition and instrumentation across a wide range of instruments. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We compare training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and find that the DDPM approach is superior both qualitatively and as measured by audio reconstruction and Fréchet distance metrics. Given the interactivity and generality of this approach, we find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.
The original codebase of this implementation can be found at [magenta/music-spectrogram-diffusion](https://github.com/magenta/music-spectrogram-diffusion).
## Model
![img](https://storage.googleapis.com/music-synthesis-with-spectrogram-diffusion/architecture.png)
As depicted above the model takes as input a MIDI file and tokenizes it into a sequence of 5 second intervals. Each tokenized interval then together with positional encodings is passed through the Note Encoder and its representation is concatenated with the previous window's generated spectrogram representation obtained via the Context Encoder. For the initial 5 second window this is set to zero. The resulting context is then used as conditioning to sample the denoised Spectrogram from the MIDI window and we concatenate this spectrogram to the final output as well as use it for the context of the next MIDI window. The process repeats till we have gone over all the MIDI inputs. Finally a MelGAN decoder converts the potentially long spectrogram to audio which is the final result of this pipeline.
## Available Pipelines:
| Pipeline | Tasks | Colab
|---|---|:---:|
| [pipeline_spectrogram_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/spectrogram_diffusion/pipeline_spectrogram_diffusion.py) | *Unconditional Audio Generation* | - |
## Example usage
```python
from diffusers import SpectrogramDiffusionPipeline, MidiProcessor
pipe = SpectrogramDiffusionPipeline.from_pretrained("google/music-spectrogram-diffusion")
pipe = pipe.to("cuda")
processor = MidiProcessor()
# Download MIDI from: wget http://www.piano-midi.de/midis/beethoven/beethoven_hammerklavier_2.mid
output = pipe(processor("beethoven_hammerklavier_2.mid"))
audio = output.audios[0]
```
## SpectrogramDiffusionPipeline
[[autodoc]] SpectrogramDiffusionPipeline
- all
- __call__

View File

@@ -0,0 +1,258 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Text-to-Image Generation with Adapter Conditioning
## Overview
[T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.08453) by Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie.
Using the pretrained models we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details.
The abstract of the paper is the following:
*The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate structure control is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and small T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, and achieve rich control and editing effects. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.*
This model was contributed by the community contributor [HimariO](https://github.com/HimariO) ❤️ .
## Available Pipelines:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [StableDiffusionAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning* | -
| [StableDiffusionXLAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_xl_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning on StableDiffusion-XL* | -
## Usage example with the base model of StableDiffusion-1.4/1.5
In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
All adapters use the same pipeline.
1. Images are first converted into the appropriate *control image* format.
2. The *control image* and *prompt* are passed to the [`StableDiffusionAdapterPipeline`].
Let's have a look at a simple example using the [Color Adapter](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1).
```python
from diffusers.utils import load_image
image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png")
```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png)
Then we can create our color palette by simply resizing it to 8 by 8 pixels and then scaling it back to original size.
```python
from PIL import Image
color_palette = image.resize((8, 8))
color_palette = color_palette.resize((512, 512), resample=Image.Resampling.NEAREST)
```
Let's take a look at the processed image.
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_palette.png)
Next, create the adapter pipeline
```py
import torch
from diffusers import StableDiffusionAdapterPipeline, T2IAdapter
adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_color_sd14v1", torch_dtype=torch.float16)
pipe = StableDiffusionAdapterPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
adapter=adapter,
torch_dtype=torch.float16,
)
pipe.to("cuda")
```
Finally, pass the prompt and control image to the pipeline
```py
# fix the random seed, so you will get the same result as the example
generator = torch.manual_seed(7)
out_image = pipe(
"At night, glowing cubes in front of the beach",
image=color_palette,
generator=generator,
).images[0]
```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_output.png)
## Usage example with the base model of StableDiffusion-XL
In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-XL.
All adapters use the same pipeline.
1. Images are first downloaded into the appropriate *control image* format.
2. The *control image* and *prompt* are passed to the [`StableDiffusionXLAdapterPipeline`].
Let's have a look at a simple example using the [Sketch Adapter](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0).
```python
from diffusers.utils import load_image
sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L")
```
![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png)
Then, create the adapter pipeline
```py
import torch
from diffusers import (
T2IAdapter,
StableDiffusionXLAdapterPipeline,
DDPMScheduler
)
from diffusers.models.unet_2d_condition import UNet2DConditionModel
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0",torch_dtype=torch.float16, adapter_type="full_adapter_xl")
scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
model_id, adapter=adapter, safety_checker=None, torch_dtype=torch.float16, variant="fp16", scheduler=scheduler
)
pipe.to("cuda")
```
Finally, pass the prompt and control image to the pipeline
```py
# fix the random seed, so you will get the same result as the example
generator = torch.Generator().manual_seed(42)
sketch_image_out = pipe(
prompt="a photo of a dog in real world, high quality",
negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality",
image=sketch_image,
generator=generator,
guidance_scale=7.5
).images[0]
```
![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch_output.png)
## Available checkpoints
Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://huggingface.co/TencentARC/T2I-Adapter/tree/main/models).
### T2I-Adapter with Stable Diffusion 1.4
| Model Name | Control Image Overview| Control Image Example | Generated Image Example |
|---|---|---|---|
|[TencentARC/t2iadapter_color_sd14v1](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1)<br/> *Trained with spatial color palette* | A image with 8x8 color palette.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"/></a>|
|[TencentARC/t2iadapter_canny_sd14v1](https://huggingface.co/TencentARC/t2iadapter_canny_sd14v1)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"/></a>|
|[TencentARC/t2iadapter_sketch_sd14v1](https://huggingface.co/TencentARC/t2iadapter_sketch_sd14v1)<br/> *Trained with [PidiNet](https://github.com/zhuoinoulu/pidinet) edge detection* | A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"/></a>|
|[TencentARC/t2iadapter_depth_sd14v1](https://huggingface.co/TencentARC/t2iadapter_depth_sd14v1)<br/> *Trained with Midas depth estimation* | A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"/></a>|
|[TencentARC/t2iadapter_openpose_sd14v1](https://huggingface.co/TencentARC/t2iadapter_openpose_sd14v1)<br/> *Trained with OpenPose bone image* | A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/openpose_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/openpose_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/openpose_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/openpose_sample_output.png"/></a>|
|[TencentARC/t2iadapter_keypose_sd14v1](https://huggingface.co/TencentARC/t2iadapter_keypose_sd14v1)<br/> *Trained with mmpose skeleton image* | A [mmpose skeleton](https://github.com/open-mmlab/mmpose) image.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_output.png"/></a>|
|[TencentARC/t2iadapter_seg_sd14v1](https://huggingface.co/TencentARC/t2iadapter_seg_sd14v1)<br/>*Trained with semantic segmentation* | An [custom](https://github.com/TencentARC/T2I-Adapter/discussions/25) segmentation protocol image.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/seg_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/seg_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/seg_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/seg_sample_output.png"/></a> |
|[TencentARC/t2iadapter_canny_sd15v2](https://huggingface.co/TencentARC/t2iadapter_canny_sd15v2)||
|[TencentARC/t2iadapter_depth_sd15v2](https://huggingface.co/TencentARC/t2iadapter_depth_sd15v2)||
|[TencentARC/t2iadapter_sketch_sd15v2](https://huggingface.co/TencentARC/t2iadapter_sketch_sd15v2)||
|[TencentARC/t2iadapter_zoedepth_sd15v1](https://huggingface.co/TencentARC/t2iadapter_zoedepth_sd15v1)||
|[Adapter/t2iadapter, subfolder='sketch_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0)||
|[Adapter/t2iadapter, subfolder='canny_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/canny_sdxl_1.0)||
|[Adapter/t2iadapter, subfolder='openpose_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/openpose_sdxl_1.0)||
## Combining multiple adapters
[`MultiAdapter`] can be used for applying multiple conditionings at once.
Here we use the keypose adapter for the character posture and the depth adapter for creating the scene.
```py
import torch
from PIL import Image
from diffusers.utils import load_image
cond_keypose = load_image(
"https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"
)
cond_depth = load_image(
"https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"
)
cond = [[cond_keypose, cond_depth]]
prompt = ["A man walking in an office room with a nice view"]
```
The two control images look as such:
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png)
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png)
`MultiAdapter` combines keypose and depth adapters.
`adapter_conditioning_scale` balances the relative influence of the different adapters.
```py
from diffusers import StableDiffusionAdapterPipeline, MultiAdapter
adapters = MultiAdapter(
[
T2IAdapter.from_pretrained("TencentARC/t2iadapter_keypose_sd14v1"),
T2IAdapter.from_pretrained("TencentARC/t2iadapter_depth_sd14v1"),
]
)
adapters = adapters.to(torch.float16)
pipe = StableDiffusionAdapterPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
torch_dtype=torch.float16,
adapter=adapters,
)
images = pipe(prompt, cond, adapter_conditioning_scale=[0.8, 0.8])
```
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_depth_sample_output.png)
## T2I Adapter vs ControlNet
T2I-Adapter is similar to [ControlNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet).
T2i-Adapter uses a smaller auxiliary network which is only run once for the entire diffusion process.
However, T2I-Adapter performs slightly worse than ControlNet.
## StableDiffusionAdapterPipeline
[[autodoc]] StableDiffusionAdapterPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_vae_slicing
- disable_vae_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
## StableDiffusionXLAdapterPipeline
[[autodoc]] StableDiffusionXLAdapterPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_vae_slicing
- disable_vae_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention

View File

@@ -10,20 +10,20 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Depth-to-Image Generation
# Depth-to-image
The Stable Diffusion model can also infer depth based on an image using [MiDas](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure.
<Tip>
Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!
</Tip>
## StableDiffusionDepth2ImgPipeline
The depth-guided stable diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), and [LAION](https://laion.ai/), as part of Stable Diffusion 2.0. It uses [MiDas](https://github.com/isl-org/MiDaS) to infer depth based on an image.
[`StableDiffusionDepth2ImgPipeline`] lets you pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the images structure.
The original codebase can be found here:
- *Stable Diffusion v2*: [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion)
Available Checkpoints are:
- *stable-diffusion-2-depth*: [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth)
[[autodoc]] StableDiffusionDepth2ImgPipeline
- all
- __call__
@@ -34,3 +34,7 @@ Available Checkpoints are:
- load_textual_inversion
- load_lora_weights
- save_lora_weights
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -0,0 +1,59 @@
<!--Copyright 2023 The GLIGEN Authors and The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# GLIGEN (Grounded Language-to-Image Generation)
The GLIGEN model was created by researchers and engineers from [University of Wisconsin-Madison, Columbia University, and Microsoft](https://github.com/gligen/GLIGEN). The [`StableDiffusionGLIGENPipeline`] and [`StableDiffusionGLIGENTextImagePipeline`] can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes with [`StableDiffusionGLIGENPipeline`], if input images are given, [`StableDiffusionGLIGENTextImagePipeline`] can insert objects described by text at the region defined by bounding boxes. Otherwise, it'll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It's trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs.
The abstract from the [paper](https://huggingface.co/papers/2301.07093) is:
*Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGENs zeroshot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.*
<Tip>
Make sure to check out the Stable Diffusion [Tips](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality and how to reuse pipeline components efficiently!
If you want to use one of the official checkpoints for a task, explore the [gligen](https://huggingface.co/gligen) Hub organizations!
</Tip>
[`StableDiffusionGLIGENPipeline`] was contributed by [Nikhil Gajendrakumar](https://github.com/nikhil-masterful) and [`StableDiffusionGLIGENTextImagePipeline`] was contributed by [Nguyễn Công Tú Anh](https://github.com/tuanh123789).
## StableDiffusionGLIGENPipeline
[[autodoc]] StableDiffusionGLIGENPipeline
- all
- __call__
- enable_vae_slicing
- disable_vae_slicing
- enable_vae_tiling
- disable_vae_tiling
- enable_model_cpu_offload
- prepare_latents
- enable_fuser
## StableDiffusionGLIGENTextImagePipeline
[[autodoc]] StableDiffusionGLIGENTextImagePipeline
- all
- __call__
- enable_vae_slicing
- disable_vae_slicing
- enable_vae_tiling
- disable_vae_tiling
- enable_model_cpu_offload
- prepare_latents
- enable_fuser
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -0,0 +1,37 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Image variation
The Stable Diffusion model can also generate variations from an input image. It uses a fine-tuned version of a Stable Diffusion model by [Justin Pinkney](https://www.justinpinkney.com/) from [Lambda](https://lambdalabs.com/).
The original codebase can be found at [LambdaLabsML/lambda-diffusers](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) and additional official checkpoints for image variation can be found at [lambdalabs/sd-image-variations-diffusers](https://huggingface.co/lambdalabs/sd-image-variations-diffusers).
<Tip>
Make sure to check out the Stable Diffusion [Tips](./overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
</Tip>
## StableDiffusionImageVariationPipeline
[[autodoc]] StableDiffusionImageVariationPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -1,31 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Image Variation
## StableDiffusionImageVariationPipeline
[`StableDiffusionImageVariationPipeline`] lets you generate variations from an input image using Stable Diffusion. It uses a fine-tuned version of Stable Diffusion model, trained by [Justin Pinkney](https://www.justinpinkney.com/) (@Buntworthy) at [Lambda](https://lambdalabs.com/).
The original codebase can be found here:
[Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations)
Available Checkpoints are:
- *sd-image-variations-diffusers*: [lambdalabs/sd-image-variations-diffusers](https://huggingface.co/lambdalabs/sd-image-variations-diffusers)
[[autodoc]] StableDiffusionImageVariationPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention

View File

@@ -0,0 +1,55 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Image-to-image
The Stable Diffusion model can also be applied to image-to-image generation by passing a text prompt and an initial image to condition the generation of new images.
The [`StableDiffusionImg2ImgPipeline`] uses the diffusion-denoising mechanism proposed in [SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations](https://huggingface.co/papers/2108.01073) by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon.
The abstract from the paper is:
*Guided image synthesis enables everyday users to create and edit photo-realistic images with minimum effort. The key challenge is balancing faithfulness to the user input (e.g., hand-drawn colored strokes) and realism of the synthesized image. Existing GAN-based methods attempt to achieve such balance using either conditional GANs or GAN inversions, which are challenging and often require additional training data or loss functions for individual applications. To address these issues, we introduce a new image synthesis and editing method, Stochastic Differential Editing (SDEdit), based on a diffusion model generative prior, which synthesizes realistic images by iteratively denoising through a stochastic differential equation (SDE). Given an input image with user guide of any type, SDEdit first adds noise to the input, then subsequently denoises the resulting image through the SDE prior to increase its realism. SDEdit does not require task-specific training or inversions and can naturally achieve the balance between realism and faithfulness. SDEdit significantly outperforms state-of-the-art GAN-based methods by up to 98.09% on realism and 91.72% on overall satisfaction scores, according to a human perception study, on multiple tasks, including stroke-based image synthesis and editing as well as image compositing.*
<Tip>
Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
</Tip>
## StableDiffusionImg2ImgPipeline
[[autodoc]] StableDiffusionImg2ImgPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
- from_single_file
- load_lora_weights
- save_lora_weights
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
## FlaxStableDiffusionImg2ImgPipeline
[[autodoc]] FlaxStableDiffusionImg2ImgPipeline
- all
- __call__
## FlaxStableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput

View File

@@ -1,40 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Image-to-Image Generation
## StableDiffusionImg2ImgPipeline
The Stable Diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionImg2ImgPipeline`] lets you pass a text prompt and an initial image to condition the generation of new images using Stable Diffusion.
The original codebase can be found here: [CampVis/stable-diffusion](https://github.com/CompVis/stable-diffusion/blob/main/scripts/img2img.py)
[`StableDiffusionImg2ImgPipeline`] is compatible with all Stable Diffusion checkpoints for [Text-to-Image](./text2img)
The pipeline uses the diffusion-denoising mechanism proposed by SDEdit ([SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations](https://arxiv.org/abs/2108.01073)
proposed by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon).
[[autodoc]] StableDiffusionImg2ImgPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
- from_single_file
- load_lora_weights
- save_lora_weights
[[autodoc]] FlaxStableDiffusionImg2ImgPipeline
- all
- __call__

View File

@@ -0,0 +1,57 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Inpainting
The Stable Diffusion model can also be applied to inpainting which lets you edit specific parts of an image by providing a mask and a text prompt using Stable Diffusion.
## Tips
It is recommended to use this pipeline with checkpoints that have been specifically fine-tuned for inpainting, such
as [runwayml/stable-diffusion-inpainting](https://huggingface.co/runwayml/stable-diffusion-inpainting). Default
text-to-image Stable Diffusion checkpoints, such as
[runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) are also compatible but they might be less performant.
<Tip>
Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!
</Tip>
## StableDiffusionInpaintPipeline
[[autodoc]] StableDiffusionInpaintPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
- load_lora_weights
- save_lora_weights
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
## FlaxStableDiffusionInpaintPipeline
[[autodoc]] FlaxStableDiffusionInpaintPipeline
- all
- __call__
## FlaxStableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput

View File

@@ -1,40 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Text-Guided Image Inpainting
## StableDiffusionInpaintPipeline
The Stable Diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionInpaintPipeline`] lets you edit specific parts of an image by providing a mask and a text prompt using Stable Diffusion.
The original codebase can be found here:
- *Stable Diffusion V1*: [CampVis/stable-diffusion](https://github.com/runwayml/stable-diffusion#inpainting-with-stable-diffusion)
- *Stable Diffusion V2*: [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion#image-inpainting-with-stable-diffusion)
Available checkpoints are:
- *stable-diffusion-inpainting (512x512 resolution)*: [runwayml/stable-diffusion-inpainting](https://huggingface.co/runwayml/stable-diffusion-inpainting)
- *stable-diffusion-2-inpainting (512x512 resolution)*: [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting)
[[autodoc]] StableDiffusionInpaintPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
- load_textual_inversion
- load_lora_weights
- save_lora_weights
[[autodoc]] FlaxStableDiffusionInpaintPipeline
- all
- __call__

View File

@@ -0,0 +1,38 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Latent upscaler
The Stable Diffusion latent upscaler model was created by [Katherine Crowson](https://github.com/crowsonkb/k-diffusion) in collaboration with [Stability AI](https://stability.ai/). It is used to enhance the output image resolution by a factor of 2 (see this demo [notebook](https://colab.research.google.com/drive/1o1qYJcFeywzCIdkfKJy7cTpgZTCM2EI4) for a demonstration of the original implementation).
<Tip>
Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!
</Tip>
## StableDiffusionLatentUpscalePipeline
[[autodoc]] StableDiffusionLatentUpscalePipeline
- all
- __call__
- enable_sequential_cpu_offload
- enable_attention_slicing
- disable_attention_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -1,33 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Stable Diffusion Latent Upscaler
## StableDiffusionLatentUpscalePipeline
The Stable Diffusion Latent Upscaler model was created by [Katherine Crowson](https://github.com/crowsonkb/k-diffusion) in collaboration with [Stability AI](https://stability.ai/). It can be used on top of any [`StableDiffusionUpscalePipeline`] checkpoint to enhance its output image resolution by a factor of 2.
A notebook that demonstrates the original implementation can be found here:
- [Stable Diffusion Upscaler Demo](https://colab.research.google.com/drive/1o1qYJcFeywzCIdkfKJy7cTpgZTCM2EI4)
Available Checkpoints are:
- *stabilityai/latent-upscaler*: [stabilityai/sd-x2-latent-upscaler](https://huggingface.co/stabilityai/sd-x2-latent-upscaler)
[[autodoc]] StableDiffusionLatentUpscalePipeline
- all
- __call__
- enable_sequential_cpu_offload
- enable_attention_slicing
- disable_attention_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention

View File

@@ -10,46 +10,28 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# LDM3D
# Text-to-(RGB, depth)
LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, Vasudev Lal
The abstract of the paper is the following:
LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://huggingface.co/papers/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. LDM3D generates an image and a depth map from a given text prompt unlike the existing text-to-image diffusion models such as [Stable Diffusion](./stable_diffusion/overview) which only generates an image. With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps.
The abstract from the paper is:
*This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at [this url](https://t.ly/tdi2).*
<Tip>
*Overview*:
Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
| Pipeline | Tasks | Colab | Demo
|---|---|:---:|:---:|
| [pipeline_stable_diffusion_ldm3d.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py) | *Text-to-Image Generation* | - | -
## Tips
- LDM3D generates both an image and a depth map from a given text prompt, compared to the existing txt-to-img diffusion models such as [Stable Diffusion](./stable_diffusion/overview) that generates only an image.
- With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps.
Running LDM3D is straighforward with the [`StableDiffusionLDM3DPipeline`]:
```python
>>> from diffusers import StableDiffusionLDM3DPipeline
>>> pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d")
prompt ="A picture of some lemons on a table"
output = pipe(prompt)
rgb_image, depth_image = output.rgb, output.depth
rgb_image[0].save("lemons_ldm3d_rgb.jpg")
depth_image[0].save("lemons_ldm3d_depth.png")
```
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
- all
- __call__
</Tip>
## StableDiffusionLDM3DPipeline
[[autodoc]] StableDiffusionLDM3DPipeline
- all
- __call__
## LDM3DPipelineOutput
[[autodoc]] pipelines.stable_diffusion.pipeline_stable_diffusion_ldm3d.LDM3DPipelineOutput
- all
- __call__

View File

@@ -0,0 +1,180 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Stable Diffusion pipelines
Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). Latent diffusion applies the diffusion process over a lower dimensional latent space to reduce memory and compute complexity. This specific type of diffusion model was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.
Stable Diffusion is trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs.
For more details about how Stable Diffusion works and how it differs from the base latent diffusion model, take a look at the Stability AI [announcement](https://stability.ai/blog/stable-diffusion-announcement) and our own [blog post](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) for more technical details.
You can find the original codebase for Stable Diffusion v1.0 at [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion) and Stable Diffusion v2.0 at [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) as well as their original scripts for various tasks. Additional official checkpoints for the different Stable Diffusion versions and tasks can be found on the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations. Explore these organizations to find the best checkpoint for your use-case!
The table below summarizes the available Stable Diffusion pipelines, their supported tasks, and an interactive demo:
<div class="flex justify-center">
<div class="rounded-xl border border-gray-200">
<table class="min-w-full divide-y-2 divide-gray-200 bg-white text-sm">
<thead>
<tr>
<th class="px-4 py-2 font-medium text-gray-900 text-left">
Pipeline
</th>
<th class="px-4 py-2 font-medium text-gray-900 text-left">
Supported tasks
</th>
<th class="px-4 py-2 font-medium text-gray-900 text-left">
Space
</th>
</tr>
</thead>
<tbody class="divide-y divide-gray-200">
<tr>
<td class="px-4 py-2 text-gray-700">
<a href="./text2img">StableDiffusion</a>
</td>
<td class="px-4 py-2 text-gray-700">text-to-image</td>
<td class="px-4 py-2"><a href="https://huggingface.co/spaces/stabilityai/stable-diffusion"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"/></a>
</td>
</tr>
<tr>
<td class="px-4 py-2 text-gray-700">
<a href="./img2img">StableDiffusionImg2Img</a>
</td>
<td class="px-4 py-2 text-gray-700">image-to-image</td>
<td class="px-4 py-2"><a href="https://huggingface.co/spaces/huggingface/diffuse-the-rest"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"/></a>
</td>
</tr>
<tr>
<td class="px-4 py-2 text-gray-700">
<a href="./inpaint">StableDiffusionInpaint</a>
</td>
<td class="px-4 py-2 text-gray-700">inpainting</td>
<td class="px-4 py-2"><a href="https://huggingface.co/spaces/runwayml/stable-diffusion-inpainting"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"/></a>
</td>
</tr>
<tr>
<td class="px-4 py-2 text-gray-700">
<a href="./depth2img">StableDiffusionDepth2Img</a>
</td>
<td class="px-4 py-2 text-gray-700">depth-to-image</td>
<td class="px-4 py-2"><a href="https://huggingface.co/spaces/radames/stable-diffusion-depth2img"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"/></a>
</td>
</tr>
<tr>
<td class="px-4 py-2 text-gray-700">
<a href="./image_variation">StableDiffusionImageVariation</a>
</td>
<td class="px-4 py-2 text-gray-700">image variation</td>
<td class="px-4 py-2"><a href="https://huggingface.co/spaces/lambdalabs/stable-diffusion-image-variations"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"/></a>
</td>
</tr>
<tr>
<td class="px-4 py-2 text-gray-700">
<a href="./stable_diffusion_safe">StableDiffusionPipelineSafe</a>
</td>
<td class="px-4 py-2 text-gray-700">filtered text-to-image</td>
<td class="px-4 py-2"><a href="https://huggingface.co/spaces/AIML-TUDA/unsafe-vs-safe-stable-diffusion"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"/></a>
</td>
</tr>
<tr>
<td class="px-4 py-2 text-gray-700">
<a href="./stable_diffusion_2">StableDiffusion2</a>
</td>
<td class="px-4 py-2 text-gray-700">text-to-image, inpainting, depth-to-image, super-resolution</td>
<td class="px-4 py-2"><a href="https://huggingface.co/spaces/stabilityai/stable-diffusion"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"/></a>
</td>
</tr>
<tr>
<td class="px-4 py-2 text-gray-700">
<a href="./stable_diffusion_xl">StableDiffusionXL</a>
</td>
<td class="px-4 py-2 text-gray-700">text-to-image, image-to-image</td>
<td class="px-4 py-2"><a href="https://huggingface.co/spaces/RamAnanth1/stable-diffusion-xl"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"/></a>
</td>
</tr>
<tr>
<td class="px-4 py-2 text-gray-700">
<a href="./latent_upscale">StableDiffusionLatentUpscale</a>
</td>
<td class="px-4 py-2 text-gray-700">super-resolution</td>
<td class="px-4 py-2"><a href="https://huggingface.co/spaces/huggingface-projects/stable-diffusion-latent-upscaler"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"/></a>
</td>
</tr>
<tr>
<td class="px-4 py-2 text-gray-700">
<a href="./upscale">StableDiffusionUpscale</a>
</td>
<td class="px-4 py-2 text-gray-700">super-resolution</td>
</tr>
<tr>
<td class="px-4 py-2 text-gray-700">
<a href="./ldm3d_diffusion">StableDiffusionLDM3D</a>
</td>
<td class="px-4 py-2 text-gray-700">text-to-rgb, text-to-depth</td>
<td class="px-4 py-2"><a href="https://huggingface.co/spaces/r23/ldm3d-space"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"/></a>
</td>
</tr>
</tbody>
</table>
</div>
</div>
## Tips
To help you get the most out of the Stable Diffusion pipelines, here are a few tips for improving performance and usability. These tips are applicable to all Stable Diffusion pipelines.
### Explore tradeoff between speed and quality
[`StableDiffusionPipeline`] uses the [`PNDMScheduler`] by default, but 🤗 Diffusers provides many other schedulers (some of which are faster or output better quality) that are compatible. For example, if you want to use the [`EulerDiscreteScheduler`] instead of the default:
```py
from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
# or
euler_scheduler = EulerDiscreteScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")
pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=euler_scheduler)
```
### Reuse pipeline components to save memory
To save memory and use the same components across multiple pipelines, use the `.components` method to avoid loading weights into RAM more than once.
```py
from diffusers import (
StableDiffusionPipeline,
StableDiffusionImg2ImgPipeline,
StableDiffusionInpaintPipeline,
)
text2img = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
img2img = StableDiffusionImg2ImgPipeline(**text2img.components)
inpaint = StableDiffusionInpaintPipeline(**text2img.components)
# now you can use text2img(...), img2img(...), inpaint(...) just like the call methods of each respective pipeline
```

View File

@@ -1,81 +0,0 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Stable diffusion pipelines
Stable Diffusion is a text-to-image _latent diffusion_ model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). It's trained on 512x512 images from a subset of the [LAION-5B](https://laion.ai/blog/laion-5b/) dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs.
Latent diffusion is the research on top of which Stable Diffusion was built. It was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. You can learn more details about it in the [specific pipeline for latent diffusion](pipelines/latent_diffusion) that is part of 🤗 Diffusers.
For more details about how Stable Diffusion works and how it differs from the base latent diffusion model, please refer to the official [launch announcement post](https://stability.ai/blog/stable-diffusion-announcement) and [this section of our own blog post](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work).
*Tips*:
- To tweak your prompts on a specific result you liked, you can generate your own latents, as demonstrated in the following notebook: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pcuenca/diffusers-examples/blob/main/notebooks/stable-diffusion-seeds.ipynb)
*Overview*:
| Pipeline | Tasks | Colab | Demo
|---|---|:---:|:---:|
| [StableDiffusionPipeline](./text2img) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb) | [🤗 Stable Diffusion](https://huggingface.co/spaces/stabilityai/stable-diffusion)
| [StableDiffusionPipelineSafe](./stable_diffusion_safe) | *Text-to-Image Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ml-research/safe-latent-diffusion/blob/main/examples/Safe%20Latent%20Diffusion.ipynb) | [![Huggingface Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/AIML-TUDA/unsafe-vs-safe-stable-diffusion)
| [StableDiffusionImg2ImgPipeline](./img2img) | *Image-to-Image Text-Guided Generation* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/image_2_image_using_diffusers.ipynb) | [🤗 Diffuse the Rest](https://huggingface.co/spaces/huggingface/diffuse-the-rest)
| [StableDiffusionInpaintPipeline](./inpaint) | **Experimental** *Text-Guided Image Inpainting* | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb) |
| [StableDiffusionDepth2ImgPipeline](./depth2img) | **Experimental** *Depth-to-Image Text-Guided Generation* | |
| [StableDiffusionImageVariationPipeline](./image_variation) | **Experimental** *Image Variation Generation* | | [🤗 Stable Diffusion Image Variations](https://huggingface.co/spaces/lambdalabs/stable-diffusion-image-variations)
| [StableDiffusionUpscalePipeline](./upscale) | **Experimental** *Text-Guided Image Super-Resolution* | |
| [StableDiffusionLatentUpscalePipeline](./latent_upscale) | **Experimental** *Text-Guided Image Super-Resolution* | |
| [Stable Diffusion 2](./stable_diffusion_2) | *Text-Guided Image Inpainting* |
| [Stable Diffusion 2](./stable_diffusion_2) | *Depth-to-Image Text-Guided Generation* |
| [Stable Diffusion 2](./stable_diffusion_2) | *Text-Guided Super Resolution Image-to-Image* |
| [StableDiffusionLDM3DPipeline](./ldm3d) | *Text-to-(RGB, Depth)* |
## Tips
### How to load and use different schedulers.
The stable diffusion pipeline uses [`PNDMScheduler`] scheduler by default. But `diffusers` provides many other schedulers that can be used with the stable diffusion pipeline such as [`DDIMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], [`EulerAncestralDiscreteScheduler`] etc.
To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`] method or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the [`EulerDiscreteScheduler`], you can do the following:
```python
>>> from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
>>> pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
>>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
>>> # or
>>> euler_scheduler = EulerDiscreteScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")
>>> pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=euler_scheduler)
```
### How to convert all use cases with multiple or single pipeline
If you want to use all possible use cases in a single `DiffusionPipeline` you can either:
- Make use of the [Stable Diffusion Mega Pipeline](https://github.com/huggingface/diffusers/tree/main/examples/community#stable-diffusion-mega) or
- Make use of the `components` functionality to instantiate all components in the most memory-efficient way:
```python
>>> from diffusers import (
... StableDiffusionPipeline,
... StableDiffusionImg2ImgPipeline,
... StableDiffusionInpaintPipeline,
... )
>>> text2img = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
>>> img2img = StableDiffusionImg2ImgPipeline(**text2img.components)
>>> inpaint = StableDiffusionInpaintPipeline(**text2img.components)
>>> # now you can use text2img(...), img2img(...), inpaint(...) just like the call methods of each respective pipeline
```
## StableDiffusionPipelineOutput
[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput

View File

@@ -0,0 +1,139 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Stable Diffusion 2
Stable Diffusion 2 is a text-to-image _latent diffusion_ model built upon the work of the original [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release), and it was led by Robin Rombach and Katherine Crowson from [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/).
*The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels.
These models are trained on an aesthetic subset of the [LAION-5B dataset](https://laion.ai/blog/laion-5b/) created by the DeepFloyd team at Stability AI, which is then further filtered to remove adult content using [LAIONs NSFW filter](https://openreview.net/forum?id=M3Y74vmsMcY).*
For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [announcement post](https://stability.ai/blog/stable-diffusion-v2-release).
The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest scheduler.
Stable Diffusion 2 is available for tasks like text-to-image, inpainting, super-resolution, and depth-to-image:
| Task | Repository |
|-------------------------|---------------------------------------------------------------------------------------------------------------|
| text-to-image (512x512) | [stabilityai/stable-diffusion-2-base](https://huggingface.co/stabilityai/stable-diffusion-2-base) |
| text-to-image (768x768) | [stabilityai/stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) |
| inpainting | [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) |
| super-resolution | [stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) |
| depth-to-image | [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth) |
Here are some examples for how to use Stable Diffusion 2 for each task:
<Tip>
Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently!
If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations!
</Tip>
## Text-to-image
```py
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch
repo_id = "stabilityai/stable-diffusion-2-base"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("astronaut.png")
```
## Inpainting
```py
import PIL
import requests
import torch
from io import BytesIO
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
def download_image(url):
response = requests.get(url)
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))
repo_id = "stabilityai/stable-diffusion-2-inpainting"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0]
image.save("yellow_cat.png")
```
## Super-resolution
```py
import requests
from PIL import Image
from io import BytesIO
from diffusers import StableDiffusionUpscalePipeline
import torch
# load model and scheduler
model_id = "stabilityai/stable-diffusion-x4-upscaler"
pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipeline = pipeline.to("cuda")
# let's download an image
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
response = requests.get(url)
low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
low_res_img = low_res_img.resize((128, 128))
prompt = "a white cat"
upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
upscaled_image.save("upsampled_cat.png")
```
## Depth-to-image
```py
import torch
import requests
from PIL import Image
from diffusers import StableDiffusionDepth2ImgPipeline
pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-depth",
torch_dtype=torch.float16,
).to("cuda")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
init_image = Image.open(requests.get(url, stream=True).raw)
prompt = "two tigers"
n_propmt = "bad, deformed, ugly, bad anotomy"
image = pipe(prompt=prompt, image=init_image, negative_prompt=n_propmt, strength=0.7).images[0]
```

Some files were not shown because too many files have changed in this diff Show More