* Update pipeline_stable_diffusion_instruct_pix2pix.py
Add `cross_attention_kwargs` to `__call__` method of `StableDiffusionInstructPix2PixPipeline`, which are passed to UNet.
* Update documentation for pipeline_stable_diffusion_instruct_pix2pix.py
* Update docstring
* Update docstring
* Fix typing import
* make _callback_tensor_inputs consistent between sdxl pipelines
* forgot this one
* fix failing test
* fix test_components_function
* fix controlnet inpaint tests
* Merged isinstance calls to make the code simpler.
* Corrected formatting errors using ruff.
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Fix `added_cond_kwargs` when using IP-Adapter
Fix error when using IP-Adapter in pipeline and passing `ip_adapter_image_embeds` instead of `ip_adapter_image`
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Expand `diffusers-cli env`
* SafeTensors -> Safetensors
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Move `safetensors_version = "not installed"` to `else`
* Update `safetensors_version` checking
* Add GPU detection for Linux, Mac OS, and Windows
* Add accelerator detection to environment command
* Add is_peft_version to import_utils
* Update env.py
* Add `huggingface_hub` reference
* Add `transformers` reference
* Add reference for `huggingface_hub`
* Fix print statement in env.py for unusual OS
* Up
* Fix platform information in env.py
* up
* Fix import order in env.py
* ruff
* make style
* Fix platform system check in env.py
* Fix run method return type in env.py
* 🤗
* No need f-string
* Remove location info
* Remove accelerate config
* Refactor env.py to remove accelerate config
* feat: Add support for `bitsandbytes` library in environment command
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
* fixed vae loading issue #7619
* rerun make style && make quality
* bring back model_has_vae and add change \ to / in config_file_name on windows os to make match work
* add missing import platform
* bring back import model_info
* make config_file_name OS independent
* switch to using Path.as_posix() to resolve OS dependence
* improve style
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: bssrdf <bssrdf@gmail.com>
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
Update requirements.txt
If the datasets library is old, it will not read the metadata.jsonl and the label will default to an integer of type int.
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Fixed a wrong link to python versions in contributing.md file.
* Updated the link to a permalink, so that it will permanently point to the specific line.
* find & replace all FloatTensors to Tensor
* apply formatting
* Update torch.FloatTensor to torch.Tensor in the remaining files
* formatting
* Fix the rest of the places where FloatTensor is used as well as in documentation
* formatting
* Update new file from FloatTensor to Tensor
* Remove dead code
* PylancereportGeneralTypeIssues: Strings nested within an f-string cannot use the same quote character as the f-string prior to Python 3.12.
* Remove dead code
SDXL LoRA weights for text encoders should be decoupled on save
The method checks if at least one of unet, text_encoder and
text_encoder_2 lora weights are passed, which was not reflected in the
implentation.
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
`model_output.shape` may only have rank 1.
There are warnings related to use of random keys.
```
tests/schedulers/test_scheduler_flax.py: 13 warnings
/Users/phillypham/diffusers/src/diffusers/schedulers/scheduling_ddpm_flax.py:268: FutureWarning: normal accepts a single key, but was given a key array of shape (1, 2) != (). Use jax.vmap for batching. In a future JAX version, this will be an error.
noise = jax.random.normal(split_key, shape=model_output.shape, dtype=self.dtype)
tests/schedulers/test_scheduler_flax.py::FlaxDDPMSchedulerTest::test_betas
/Users/phillypham/virtualenv/diffusers/lib/python3.9/site-packages/jax/_src/random.py:731: FutureWarning: uniform accepts a single key, but was given a key array of shape (1,) != (). Use jax.vmap for batching. In a future JAX version, this will be an error.
u = uniform(key, shape, dtype, lo, hi) # type: ignore[arg-type]
```
* 7879 - adjust documentation to use naruto dataset, since pokemon is now gated
* replace references to pokemon in docs
* more references to pokemon replaced
* Japanese translation update
---------
Co-authored-by: bghira <bghira@users.github.com>
* Add Ascend NPU support for SDXL fine-tuning and fix the model saving bug when using DeepSpeed.
* fix check code quality
* Decouple the NPU flash attention and make it an independent module.
* add doc and unit tests for npu flash attention.
---------
Co-authored-by: mhh001 <mahonghao1@huawei.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* chore: reducing model sizes
* chore: shrinks further
* chore: shrinks further
* chore: shrinking model for img2img pipeline
* chore: reducing size of model for inpaint pipeline
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
FlaxStableDiffusionSafetyChecker sets main_input_name to "clip_input".
This makes StableDiffusionSafetyChecker consistent.
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Added get_velocity function to EulerDiscreteScheduler.
* Fix white space on blank lines
* Added copied from statement
* back to the original.
---------
Co-authored-by: Ruining Li <ruining@robots.ox.ac.uk>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
swap the order for do_classifier_free_guidance concat with repeat
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
* Check for latents, before calling prepare_latents - sdxlImg2Img
* Added latents check for all the img2img pipeline
* Fixed silly mistake while checking latents as None
A new function compute_dream_and_update_latents has been added to the
training utilities that allows you to do DREAM rectified training in line
with the paper https://arxiv.org/abs/2312.00210. The method can be used
with an extra argument in the train_text_to_image.py script.
Co-authored-by: Jimmy <39@🇺🇸.com>
* Convert channel order to BGR for the watermark encoder. Convert the watermarked BGR images back to RGB. Fixes#6292
* Revert channel order before stacking images to overcome limitations that negative strides are currently not supported
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Fixed wrong decorator by modifying it to @classmethod.
* Updated the method and it's argument.
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* add scheduled pseudo-huber loss training scripts
See #7488
* add reduction modes to huber loss
* [DB Lora] *2 multiplier to huber loss cause of 1/2 a^2 conv.
pairing of c6495def1f
* [DB Lora] add option for smooth l1 (huber / delta)
Pairing of dd22958caa
* [DB Lora] unify huber scheduling
Pairing of 19a834c3ab
* [DB Lora] add snr huber scheduler
Pairing of 47fb1a6854
* fixup examples link
* use snr schedule by default in DB
* update all huber scripts with snr
* code quality
* huber: make style && make quality
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Initialize target_unet from unet rather than teacher_unet so that we correctly add time_embedding.cond_proj if necessary.
* Use UNet2DConditionModel.from_config to initialize target_unet from unet's config.
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* give it a shot.
* print.
* correct assertion.
* gather results from the rest of the tests.
* change the assertion values where needed.
* remove print statements.
* get device <-> component mapping when using multiple gpus.
* condition the device_map bits.
* relax condition
* device_map progress.
* device_map enhancement
* some cleaning up and debugging
* Apply suggestions from code review
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* incorporate suggestions from PR.
* remove multi-gpu condition for now.
* guard check the component -> device mapping
* fix: device_memory variable
* dispatching transformers model to have force_hooks=True
* better guarding for transformers device_map
* introduce support balanced_low_memory and balanced_ultra_low_memory.
* remove device_map patch.
* fix: intermediate variable scoping.
* fix: condition in cpu offload.
* fix: flax class restrictions.
* remove modifications from cpu_offload and model_offload
* incorporate changes.
* add a simple forward pass test
* add: torch_device in get_inputs()
* add: tests
* remove print
* safe-guard to(), model offloading and cpu offloading when balanced is used as a device_map.
* style
* remove .
* safeguard device_map with more checks and remove invalid device_mapping strategues.
* make a class attribute and adjust tests accordingly.
* fix device_map check
* fix test
* adjust comment
* fix: device_map attribute
* fix: dispatching.
* max_memory test for pipeline
* version guard the tests
* fix guard.
* address review feedback.
* reset_device_map method.
* add: test for reset_hf_device_map
* fix a couple things.
* add reset_device_map() in the error message.
* add tests for checking reset_device_map doesn't have unintended consequences.
* fix reset_device_map and offloading tests.
* create _get_final_device_map utility.
* hf_device_map -> _hf_device_map
* add documentation
* add notes suggested by Marc.
* styling.
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* move updates within gpu condition.
* other docs related things
* note on ignore a device not specified in .
* provide a suggestion if device mapping errors out.
* fix: typo.
* _hf_device_map -> hf_device_map
* Empty-Commit
* add: example hf_device_map.
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* remove libsndfile1-dev and libgl1 from workflows and ensure that re present in the respective dockerfiles.
* change to self-hosted runner; let's see 🤞
* add libsndfile1-dev libgl1 for now
* use self-hosted runners for building and push too.
* Restore unet params back to normal from EMA when validation call is finished
* empty commit
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Allow safety and feature extractor arguments to be passed to convert_from_ckpt
Allows management of safety checker and feature extractor
from outside of the convert ckpt class.
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* reduce block sizes for unet1d.
* reduce blocks for unet_2d.
* reduce block size for unet_motion
* increase channels.
* correctly increase channels.
* reduce number of layers in unet2dconditionmodel tests.
* reduce block sizes for unet2dconditionmodel tests
* reduce block sizes for unet3dconditionmodel.
* fix: test_feed_forward_chunking
* fix: test_forward_with_norm_groups
* skip spatiotemporal tests on MPS.
* reduce block size in AutoencoderKL.
* reduce block sizes for vqmodel.
* further reduce block size.
* make style.
* Empty-Commit
* reduce sizes for ConsistencyDecoderVAETests
* further reduction.
* further block reductions in AutoencoderKL and AssymetricAutoencoderKL.
* massively reduce the block size in unet2dcontionmodel.
* reduce sizes for unet3d
* fix tests in unet3d.
* reduce blocks further in motion unet.
* fix: output shape
* add attention_head_dim to the test configuration.
* remove unexpected keyword arg
* up a bit.
* groups.
* up again
* fix
* Skip `test_freeu_enabled ` on MPS
* Small fixes
- import skip_mps correctly
- disable all instances of test_freeu_enabled
* Empty commit to trigger tests
* Empty commit to trigger CI
* increase number of workers for the tests.
* move to beefier runner.
* improve the fast push tests too.
* use a beefy machine for pytorch pipeline tests
* up the number of workers further.
* UniPC Multistep add `rescale_betas_zero_snr`
Same patch as DPM and Euler with the patched final alpha cumprod
BF16 doesn't seem to break down, I think cause UniPC upcasts during some
phases already? We could still force an upcast since it only
loses ≈ 0.005 it/s for me but the difference in output is very small. A
better endeavor might upcasting in step() and removing all the other
upcasts elsewhere?
* UniPC ZSNR UT
* Re-add `rescale_betas_zsnr` doc oops
* UniPC UTs iterate solvers on FP16
It wasn't catching errs on order==3. Might be excessive?
* UniPC Multistep fix tensor dtype/device on order=3
* UniPC UTs Add v_pred to fp16 test iter
For completions sake. Probably overkill?
* 7529 do not disable autocast for cuda devices
* Remove typecasting error check for non-mps platforms, as a correct autocast implementation makes it a non-issue
* add autocast fix to other training examples
* disable native_amp for dreambooth (sdxl)
* disable native_amp for pix2pix (sdxl)
* remove tests from remaining files
* disable native_amp on huggingface accelerator for every training example that uses it
* convert more usages of autocast to nullcontext, make style fixes
* make style fixes
* style.
* Empty-Commit
---------
Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* start printing the tensors.
* print full throttle
* set static slices for 7 tests.
* remove printing.
* flatten
* disable test for controlnet
* what happens when things are seeded properly?
* set the right value
* style./
* make pia test fail to check things
* print.
* fix pia.
* checking for animatediff.
* fix: animatediff.
* video synthesis
* final piece.
* style.
* print guess.
* fix: assertion for control guess.
---------
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
* Add `final_sigma_zero` to UniPCMultistep
Effectively the same trick as DDIM's `set_alpha_to_one` and
DPM's `final_sigma_type='zero'`.
Currently False by default but maybe this should be True?
* `final_sigma_zero: bool` -> `final_sigmas_type: str`
Should 1:1 match DPM Multistep now.
* Set `final_sigmas_type='sigma_min'` in UniPC UTs
* Initial commit
* Implemented block lora
- implemented block lora
- updated docs
- added tests
* Finishing up
* Reverted unrelated changes made by make style
* Fixed typo
* Fixed bug + Made text_encoder_2 scalable
* Integrated some review feedback
* Incorporated review feedback
* Fix tests
* Made every module configurable
* Adapter to new lora test structure
* Final cleanup
* Some more final fixes
- Included examples in `using_peft_for_inference.md`
- Added hint that only attns are scaled
- Removed NoneTypes
- Added test to check mismatching lens of adapter names / weights raise error
* Update using_peft_for_inference.md
* Update using_peft_for_inference.md
* Make style, quality, fix-copies
* Updated tutorial;Warning if scale/adapter mismatch
* floats are forwarded as-is; changed tutorial scale
* make style, quality, fix-copies
* Fixed typo in tutorial
* Moved some warnings into `lora_loader_utils.py`
* Moved scale/lora mismatch warnings back
* Integrated final review suggestions
* Empty commit to trigger CI
* Reverted emoty commit to trigger CI
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* speed up test_vae_slicing in animatediff
* speed up test_karras_schedulers_shape for attend and excite.
* style.
* get the static slices out.
* specify torch print options.
* modify
* test run with controlnet
* specify kwarg
* fix: things
* not None
* flatten
* controlnet img2img
* complete controlet sd
* finish more
* finish more
* finish more
* finish more
* finish the final batch
* add cpu check for expected_pipe_slice.
* finish the rest
* remove print
* style
* fix ssd1b controlnet test
* checking ssd1b
* disable the test.
* make the test_ip_adapter_single controlnet test more robust
* fix: simple inpaint
* multi
* disable panorama
* enable again
* panorama is shaky so leave it for now
* remove print
* raise tolerance.
* Bug fix for controlnetpipeline check_image
Bug fix for controlnetpipeline check_image when using multicontrolnet and prompt list
* Update test_inference_multiple_prompt_input function
* Update test_controlnet.py
add test for multiple prompts and multiple image conditioning
* Update test_controlnet.py
Fix format error
---------
Co-authored-by: Lvkesheng Shen <45848260+Fantast416@users.noreply.github.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* add remove_all_hooks
* a few more fix and tests
* up
* Update src/diffusers/pipelines/pipeline_utils.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* split tests
* add
---------
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* apple mps: training support for SDXL LoRA
* sdxl: support training lora, dreambooth, t2i, pix2pix, and controlnet on apple mps
---------
Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* mps: fix XL pipeline inference at training time due to upstream pytorch bug
* Update src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* apply the safe-guarding logic elsewhere.
---------
Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
you cannot specify `type="bool"` and `action="store_true"` at the same time.
remove excessive and buggy `type=bool`.
Co-authored-by: Linoy Tsaban <57615435+linoytsaban@users.noreply.github.com>
* feat: support dora loras from community
* safe-guard dora operations under peft version.
* pop use_dora when False
* make dora lora from kohya work.
* fix: kohya conversion utils.
* add a fast test for DoRA compatibility..
* add a nightly test.
* fixed typo
* updated doc to be consistent in naming
* make style/quality
* preprocessing for 4 channels and not 6
* make style
* test for 4c
* make style/quality
* fixed test on cpu
* fixed doc typo
* changed default ckpt to 4c
* Update pipeline_stable_diffusion_ldm3d.py
* fix bug
---------
Co-authored-by: Aflalo <estellea@isl-iam1.rr.intel.com>
Co-authored-by: Aflalo <estellea@isl-gpu33.rr.intel.com>
Co-authored-by: Aflalo <estellea@isl-gpu38.rr.intel.com>
* Add properties and `IPAdapterTesterMixin` tests for `StableDiffusionPanoramaPipeline`
* Update torch manual seed to use `torch.Generator(device=device)`
* Refactor 📞🔙 to support `callback_on_step_end`
* make fix-copies
* fix freeinit impl
* fix progress bar
* fix progress bar and remove old code
* fix num_inference_steps==1 case for freeinit by atleast running 1 step when fast sampling enabled
* checking to improve pipelines.
* more fixes.
* add: tip to encourage the usage of revision
* Apply suggestions from code review
* retrigger ci
---------
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Fix ControlNetModel.from_unet do not load add_embedding
* delete white space in blank line
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* debugging
* let's see the numbers
* let's see the numbers
* let's see the numbers
* restrict tolerance.
* increase inference steps.
* shallow copy of cross_attentionkwargs
* remove print
* pop scale from the top-level unet instead of getting it.
* improve readability.
* Apply suggestions from code review
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* fix a little bit.
---------
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Add properties and `IPAdapterTesterMixin` tests for `StableDiffusionPanoramaPipeline`
* Fix variable name typo and update comments
* Update deprecated `output_type="numpy"` to "np" in test files
* Discard changes to src/diffusers/pipelines/stable_diffusion_panorama/pipeline_stable_diffusion_panorama.py
* Update test_stable_diffusion_panorama.py
* Update numbers in README.md
* Update get_guidance_scale_embedding method to use timesteps instead of w
* Update number of checkpoints in README.md
* Add type hints and fix var name
* Fix PyTorch's convention for inplace functions
* Fix a typo
* Revert "Fix PyTorch's convention for inplace functions"
This reverts commit 74350cf65b.
* Fix typos
* Indent
* Refactor get_guidance_scale_embedding method in LEditsPPPipelineStableDiffusionXL class
* log loss per image
* add commandline param for per image loss logging
* style
* debug-loss -> debug_loss
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Change step_offset scheduler docstrings
* Mention it may be needed by some models
* More docstrings
These ones failed literal S&R because I performed it case-sensitive
which is fun.
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* add: support for notifying maintainers about the nightly test status
* add: a tempoerary workflow for validation.
* cancel in progress.
* runs-on
* clean up
* add: peft dep
* change device.
* multiple edits.
* remove temp workflow.
* add: a workflow to check if docker containers can be built if the files are modified.
* type
* unify docker image build test and push
* make it run on prs too.
* check
* check
* check
* check again.
* remove docker test build file.
* remove extra dependencies./
* check
* Initial commit
* Removed copy hints, as in original SDXLControlNetPipeline
Removed copy hints, as in original SDXLControlNetPipeline, as the `make fix-copies` seems to have issues with the @property decorator.
* Reverted changes to ControlNetXS
* Addendum to: Removed changes to ControlNetXS
* Added test+docs for mixture of denoiser
* Update docs/source/en/using-diffusers/controlnet.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/using-diffusers/controlnet.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* I added a new doc string to the class. This is more flexible to understanding other developers what are doing and where it's using.
* Update src/diffusers/models/unet_2d_blocks.py
This changes suggest by maintener.
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Update src/diffusers/models/unet_2d_blocks.py
Add suggested text
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Update unet_2d_blocks.py
I changed the Parameter to Args text.
* Update unet_2d_blocks.py
proper indentation set in this file.
* Update unet_2d_blocks.py
a little bit of change in the act_fun argument line.
* I run the black command to reformat style in the code
* Update unet_2d_blocks.py
similar doc-string add to have in the original diffusion repository.
* Fix bug for mention in this issue section #6901
* Update src/diffusers/schedulers/scheduling_ddim_flax.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Fix linter
* Restore empty line
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* copied from for t2i pipelines without ip adapter support.
* two more pipelines with proper copied from comments.
* revert to the original implementation
* throw error when patch inputs and layernorm are provided for transformers2d.
* add comment on supported norm_types in transformers2d
* more check
* fix: norm _type handling
* [bug] Fix float/int guidance scale not working in `StableVideoDiffusionPipeline`
* Add test to disable CFG on SVD
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* support and example launch for sdxl turbo
* White space fixes
* Trailing whitespace character
* ruff format
* fix guidance_scale and steps for turbo mode
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Radames Ajna <radamajna@gmail.com>
* update svd docs
* fix example doc string
* update return type hints/docs
* update type hints
* Fix typos in pipeline_stable_video_diffusion.py
* make style && make fix-copies
* Update src/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update src/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* update based on suggestion
---------
Co-authored-by: M. Tolga Cangöz <mtcangoz@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Enable FakeTensorMode for EulerDiscreteScheduler scheduler
PyTorch's FakeTensorMode does not support `.numpy()` or `numpy.array()`
calls.
This PR replaces `sigmas` numpy tensor by a PyTorch tensor equivalent
Repro
```python
with torch._subclasses.FakeTensorMode() as fake_mode, ONNXTorchPatcher():
fake_model = DiffusionPipeline.from_pretrained(model_name, low_cpu_mem_usage=False)
```
that otherwise would fail with
`RuntimeError: .numpy() is not supported for tensor subclasses.`
* Address comments
* add tags for diffusers training
* add tags for diffusers training
* add tags for diffusers training
* add tags for diffusers training
* add tags for diffusers training
* add tags for diffusers training
* add dora tags for drambooth lora scripts
* style
* add is_dora arg
* style
* add dora training feature to sd 1.5 script
* added notes about DoRA training
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* initial
* check_inputs fix to the rest of pipelines
* add fix for no cfg too
* use of variable
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Add copyright notice to relevant files and fix typos
* Set `timestep_spacing` parameter of `StableDiffusionXLPipeline`'s scheduler to `'trailing'`.
* Update `StableDiffusionXLPipeline.from_single_file` by including EulerAncestralDiscreteScheduler with `timestep_spacing="trailing"` param.
* Update model loading method in SDXL Turbo documentation
* move model helper function in pipeline to EfficiencyMixin
---------
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* DPMMultistep rescale_betas_zero_snr
* DPM upcast samples in step()
* DPM rescale_betas_zero_snr UT
* DPMSolverMulti move sample upcast after model convert
Avoids having to re-use the dtype.
* Add a newline for Ruff
* log_validation unification for controlnet.
* additional fixes.
* remove print.
* better reuse and loading
* make final inference run conditional.
* Update examples/controlnet/README_sdxl.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* resize the control image in the snippet.
---------
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Make LoRACompatibleConv padding_mode work.
* Format code style.
* add fast test
* Update src/diffusers/models/lora.py
Simplify the code by patrickvonplaten.
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* code refactor
* apply patrickvonplaten suggestion to simplify the code.
* rm test_lora_layers_old_backend.py and add test case in test_lora_layers_peft.py
* update test case.
---------
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* modulize log validation
* run make style and refactor wanddb support
* remove redundant initialization
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* make checkpoint_merger pipeline pass the "variant" argument to from_pretrained()
* make style
---------
Co-authored-by: Lincoln Stein <lstein@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* add stable_diffusion_xl_ipex community pipeline
* make style for code quality check
* update docs as suggested
---------
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* standardize model card
* fix tags
* correct import styling and update tags
* run make style and make quality
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* feat: allow low_cpu_mem_usage in ip adapter loading
* reduce the number of device placements.
* documentation.
* throw low_cpu_mem_usage warning only once from the main entry point.
* use load_model_into_meta in single file utils
* propagate to autoencoder and controlnet.
* correct class name access behaviour.
* remove torch_dtype from load_model_into_meta; seems unncessary
* remove incorrect kwarg
* style to avoid extra unnecessary line breaks
* fix: bias loading bug
* fixes for SDXL
* apply changes to the conversion script to match single_file_utils.py
* do transpose to match the single file loading logic.
Remove <cat-toy> validation prompt from textual_inversion_sdxl.py
The `<cat-toy>` validation prompt is a default choice for the example task in the README. But no other part of `textual_inversion_sdxl.py` references the cat toy and `textual_inversion.py` has a default validation prompt of `None` as well.
So bring `textual_inversion_sdxl.py` in line with `textual_inversion.py` and change default validation prompt to `None`
* attention_head_dim
* debug
* print more info
* correct num_attention_heads behaviour
* down_block_num_attention_heads -> num_attention_heads.
* correct the image link in doc.
* add: deprecation for num_attention_head
* fix: test argument to use attention_head_dim
* more fixes.
* quality
* address comments.
* remove depcrecation.
* add: support for passing ip adapter image embeddings
* debugging
* make feature_extractor unloading conditioned on safety_checker
* better condition
* type annotation
* index to look into value slices
* more debugging
* debugging
* serialize embeddings dict
* better conditioning
* remove unnecessary prints.
* Update src/diffusers/loaders/ip_adapter.py
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* make fix-copies and styling.
* styling and further copy fixing.
* fix: check_inputs call in controlnet sdxl img2img pipeline
---------
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* feat: standarize model card creation for dreambooth training.
* correct 'inference
* remove comments.
* take component out of kwargs
* style
* add: card template to have a leaner description.
* widget support.
* propagate changes to train_dreambooth_lora
* propagate changes to custom diffusion
* make widget properly type-annotated
* fix: callback function name is incorrect
On this tutorial there is a function defined and then used inside `callback_on_step_end` argument, but the name was not correct (mismatch)
* fix: typo in num_timestep (correct is num_timesteps)
fixed property name
* remove _to_tensor
* remove _to_tensor definition
* remove _collapse_frames_into_batch
* remove lora for not bloating the code.
* remove sample_size.
* simplify code a bit more
* ensure timesteps are always in tensor.
* Fix `AutoencoderTiny` with `use_slicing`
When using slicing with AutoencoderTiny, the encoder mistakenly encodes the entire batch for every image in the batch.
* Fixed formatting issue
* add noise_offset param
* micro conditioning - wip
* image processing adjusted and moved to support micro conditioning
* change time ids to be computed inside train loop
* change time ids to be computed inside train loop
* change time ids to be computed inside train loop
* time ids shape fix
* move token replacement of validation prompt to the same section of instance prompt and class prompt
* add offset noise to sd15 advanced script
* fix token loading during validation
* fix token loading during validation in sdxl script
* a little clean
* style
* a little clean
* style
* sdxl script - a little clean + minor path fix
sd 1.5 script - change default resolution value
* ad 1.5 script - minor path fix
* fix missing comma in code example in model card
* clean up commented lines
* style
* remove time ids computed outside training loop - no longer used now that we utilize micro-conditioning, as all time ids are now computed inside the training loop
* style
* [WIP] - added draft readme, building off of examples/dreambooth/README.md
* readme
* readme
* readme
* readme
* readme
* readme
* readme
* readme
* removed --crops_coords_top_left from CLI args
* style
* fix missing shape bug due to missing RGB if statement
* add blog mention at the start of the reamde as well
* Update examples/advanced_diffusion_training/README.md
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* change note to render nicely as well
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Fix bug in ResnetBlock2D.forward when not USE_PEFT_BACKEND and using scale_shift for time emb where the lora scale gets overwritten.
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* fix minsnr implementation for v-prediction case
* format code
* always compute snr when snr_gamma is specified
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* feat: explicitly tag to diffusers when using push_to_hub
* remove tags.
* reset repo.
* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* fix: tests
* fix: push_to_hub behaviour for tagging from save_pretrained
* Apply suggestions from code review
Co-authored-by: Lucain <lucainp@gmail.com>
* Apply suggestions from code review
Co-authored-by: Lucain <lucainp@gmail.com>
* import fixes.
* add library name to existing model card.
* add: standalone test for generate_model_card
* fix tests for standalone method
* moved library_name to a better place.
* merge create_model_card and generate_model_card.
* fix test
* address lucain's comments
* fix return identation
* Apply suggestions from code review
Co-authored-by: Lucain <lucainp@gmail.com>
* address further comments.
* Update src/diffusers/pipelines/pipeline_utils.py
Co-authored-by: Lucain <lucainp@gmail.com>
---------
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
* initial commit for unconditional/class-conditional consistency training script
* make style
* Add entry for consistency training script in community README.
* Move consistency training script from community to research_projects/consistency_training
* Add requirements.txt and README to research_projects/consistency_training directory.
* Manually revert community README changes for consistency training.
* Fix path to script after moving script to research projects.
* Add option to load U-Net weights from pretrained model.
---------
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* begin animatediff img2video and video2video
* revert animatediff to original implementation
* add img2video as pipeline
* update
* add vid2vid pipeline
* update imports
* update
* remove copied from line for check_inputs
* update
* update examples
* add multi-batch support
* fix __init__.py files
* move img2vid to community
* update community readme and examples
* fix
* make fix-copies
* add vid2vid batch params
* apply suggestions from review
Co-Authored-By: Dhruv Nair <dhruv.nair@gmail.com>
* add test for animatediff vid2vid
* torch.stack -> torch.cat
Co-Authored-By: Dhruv Nair <dhruv.nair@gmail.com>
* make style
* docs for vid2vid
* update
* fix prepare_latents
* fix docs
* remove img2vid
* update README to :main
* remove slow test
* refactor pipeline output
* update docs
* update docs
* merge community readme from :main
* final fix i promise
* add support for url in animatediff example
* update example
* update callbacks to latest implementation
* Update src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Update src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* fix merge
* Apply suggestions from code review
* remove callback and callback_steps as suggested in review
* Update tests/pipelines/animatediff/test_animatediff_video2video.py
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* fix import error caused due to unet refactor in #6630
* fix numpy import error after tensor2vid refactor in #6626
* make fix-copies
* fix numpy error
* fix progress bar test
---------
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* sd1.5 support in separate script
A quick adaptation to support people interested in using this method on 1.5 models.
* sd15 prompt text encoding and unet conversions
as per @linoytsaban 's recommendations. Testing would be appreciated,
* Readability and quality improvements
Removed some mentions of SDXL, and some arguments that don't apply to sd 1.5, and cleaned up some comments.
* make style/quality commands
* tracker rename and run-it doc
* Update examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py
* Update examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py
---------
Co-authored-by: Linoy Tsaban <57615435+linoytsaban@users.noreply.github.com>
* move unets to module 🦋
* parameterize unet-level import.
* fix flax unet2dcondition model import
* models __init__
* mildly depcrecating models.unet_2d_blocks in favor of models.unets.unet_2d_blocks.
* noqa
* correct depcrecation behaviour
* inherit from the actual classes.
* Empty-Commit
* backwards compatibility for unet_2d.py
* backward compatibility for unet_2d_condition
* bc for unet_1d
* bc for unet_1d_blocks
* Fixed the bug related to saving DeepSpeed models.
* Add information about training SD models using DeepSpeed to the README.
* Apply suggestions from code review
---------
Co-authored-by: mhh001 <mahonghao1@huawei.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* - extract function for stage in UNet2DConditionModel init & forward
- Add new function get_mid_block() to unet_2d_blocks.py
* add type hint to get_mid_block aligned with get_up_block and get_down_block; rename _set_xxx function
* add type hint and use keyword arguments
* remove `copy from` in versatile diffusion
* add animatediff img2vid
* fix
* Update examples/community/README.md
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* fix code snippet between ip adapter face id and animatediff img2vid
---------
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* [Fix] Multiple image conditionings in a single batch for `StableDiffusionControlNetPipeline`.
* Refactor `check_inputs` in `StableDiffusionControlNetPipeline` to avoid redundant codes.
* Make the behavior of MultiControlNetModel to be the same to the original ControlNetModel
* Keep the code change minimum for nested list support
* Add fast test `test_inference_nested_image_input`
* Remove redundant check for nested image condition in `check_inputs`
Remove `len(image) == len(prompt)` check out of `check_image()`
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Better `ValueError` message for incompatible nested image list size
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Fix syntax error in `check_inputs`
* Remove warning message for multi-ControlNets with multiple prompts
* Fix a typo in test_controlnet.py
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Add test case for multiple prompts, single image conditioning in `StableDiffusionMultiControlNetPipelineFastTests`
* Improved `ValueError` message for nested `controlnet_conditioning_scale`
* Documenting the behavior of image list as `StableDiffusionControlNetPipeline` input
---------
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Fixes#6418 Advanced Dreambooth LoRa Training
* change order of import to fix nit
* fix nit, use cast_training_params
* remove torch.compile fix, will move to a new PR
* remove unnecessary import
* Enable image resizing to adjust its height and width in StableDiffusionXLInstructPix2PixPipeline
* Ensure that validation is performed at every 'validation_step', not at every step
* fix: training resume from fp16.
* add: comment
* remove residue from another branch.
* remove more residues.
* thanks to Younes; no hacks.
* style.
* clean things a bit and modularize _set_state_dict_into_text_encoder
* add comment about the fix detailed.
* support compile
* make style
* move unwrap_model inside function
* change unwrap call
* run make style
* Update examples/dreambooth/train_dreambooth.py
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Revert "Update examples/dreambooth/train_dreambooth.py"
This reverts commit 70ab09732e.
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Remove conversion to RGB
* Add a Conversion Function
* Add type hint for convert_method
* Update src/diffusers/utils/loading_utils.py
Update docstring
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Update docstring
* Optimize imports
* Optimize imports (2)
* Reformat code
---------
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* base template file - train_instruct_pix2pix.py
* additional import and parser argument requried for lora
* finetune only instructpix2pix model -- no need to include these layers
* inject lora layers
* freeze unet model -- only lora layers are trained
* training modifications to train only lora parameters
* store only lora parameters
* move train script to research project
* run quality and style code checks
* move train script to a new folder
* add README
* update README
* update references in README
---------
Co-authored-by: Rahul Raman <rahulraman@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* enable stable-xl textual inversion
* check if optimizer_2 exists
* check text_encoder_2 before using
* add textual inversion for sdxl in a single file
* fix style
* fix example style
* reset for error changes
* add readme for sdxl
* fix style
* disable autocast as it will cause cast error when weight_dtype=bf16
* fix spelling error
* fix style and readme and 8bit optimizer
* add README_sdxl.md link
* add tracker key on log_validation
* run style
* rm the second center crop
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* add tutorials to toctree.yml
* fix title
* fix words
* add overview ja
* fix diffusion to 拡散
* fix line 21
* add space
* delete supported pipline
* fix tutorial_overview.md
* fix space
* fix typo
* Delete docs/source/ja/tutorials/using_peft_for_inference.md
this file is not translated
* Delete docs/source/ja/tutorials/basic_training.md
this file is not translated
* Delete docs/source/ja/tutorials/autopipeline.md
this file is not translated
* fix toctree
* add: experimental script for diffusion dpo training.
* random_crop cli.
* fix: caption tokenization.
* fix: pixel_values index.
* fix: grad?
* debug
* fix: reduction.
* fixes in the loss calculation.
* style
* fix: unwrap call.
* fix: validation inference.
* add: initial sdxl script
* debug
* make sure images in the tuple are of same res
* fix model_max_length
* report print
* boom
* fix: numerical issues.
* fix: resolution
* comment about resize.
* change the order of the training transformation.
* save call.
* debug
* remove print
* manually detaching necessary?
* use the same vae for validation.
* add: readme.
* unwrap text encoder when saving hook only for full text encoder tuning
* unwrap text encoder when saving hook only for full text encoder tuning
* save embeddings in each checkpoint as well
* save embeddings in each checkpoint as well
* save embeddings in each checkpoint as well
* Update examples/advanced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* add documentation for DeepCache
* fix typo
* add wandb url for DeepCache
* fix some typos
* add item in _toctree.yml
* update formats for arguments
* Update deepcache.md
* Update docs/source/en/optimization/deepcache.md
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* add StableDiffusionXLPipeline in doc
* Separate SDPipeline and SDXLPipeline
* Add the paper link of ablation experiments for hyper-parameters
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Make WDS pipeline interpolation type configurable.
* Make the VAE encoding batch size configurable.
* Make lora_alpha and lora_dropout configurable for LCM LoRA scripts.
* Generalize scalings_for_boundary_conditions function and make the timestep scaling configurable.
* Make LoRA target modules configurable for LCM-LoRA scripts.
* Move resolve_interpolation_mode to src/diffusers/training_utils.py and make interpolation type configurable in non-WDS script.
* apply suggestions from review
* debug
* debug test_with_different_scales_fusion_equivalence
* use the right method.
* place it right.
* let's see.
* let's see again
* alright then.
* add a comment.
* I added a new doc string to the class. This is more flexible to understanding other developers what are doing and where it's using.
* Update src/diffusers/models/unet_2d_blocks.py
This changes suggest by maintener.
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Update src/diffusers/models/unet_2d_blocks.py
Add suggested text
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Update unet_2d_blocks.py
I changed the Parameter to Args text.
* Update unet_2d_blocks.py
proper indentation set in this file.
* Update unet_2d_blocks.py
a little bit of change in the act_fun argument line.
* I run the black command to reformat style in the code
* Update unet_2d_blocks.py
similar doc-string add to have in the original diffusion repository.
* Batter way to write binarize function
* Solve check_code_quality error
* My mistake to run pull request but not reformated file
* Update image_processor.py
* remove extra variable and space
* Update image_processor.py
* Run ruff libarary to reformat my file
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* add: test to check if peft loras are loadable in non-peft envs.
* add torch_device approrpiately.
* fix: get_dummy_inputs().
* test logits.
* rename
* debug
* debug
* fix: generator
* new assertion values after fixing the seed.
* shape
* remove print statements and settle this.
* to update values.
* change values when lora config is initialized under a fixed seed.
* update colab link
* update notebook link
* sanity restored by getting the exact same values without peft.
* change timesteps used to calculate snr when --with_prior_preservation is enabled
* change timesteps used to calculate snr when --with_prior_preservation is enabled (canonical script)
* style
* revert canonical script to before snr gamma change
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Add unload_ip_adapter method
* Update attn_processors with original layers
* Add test
* Use set_default_attn_processor
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Fix gradient-checkpointing option is ignored in SDXL+LoRA training. (#6388)
* Fix gradient-checkpointing option is ignored in SD+LoRA training.
* Fix gradient checkpoint is not applied to text encoders. (SDXL+LoRA)
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* add doc for diffusion fast
* add entry to _toctree
* Apply suggestions from code review
* fix titlew
* fix: title entry
* add note about fuse_qkv_projections
* add adapter_name in fuse
* add tesrt
* up
* fix CI
* adapt from suggestion
* Update src/diffusers/utils/testing_utils.py
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
* change to `require_peft_version_greater`
* change variable names in test
* Update src/diffusers/loaders/lora.py
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
* break into 2 lines
* final comments
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
* [Peft] fix saving / loading when unet is not "unet"
* Update src/diffusers/loaders/lora.py
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* undo stablediffusion-xl changes
* use unet_name to get unet for lora helpers
* use unet_name
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* remove validation args from textual onverson tests
* reduce number of train steps in textual inversion tests
* fix: directories.
* debig
* fix: directories.
* remove validation tests from textual onversion
* try reducing the time of test_text_to_image_checkpointing_use_ema
* fix: directories
* speed up test_text_to_image_checkpointing
* speed up test_text_to_image_checkpointing_checkpoints_total_limit_removes_multiple_checkpoints
* fix
* speed up test_instruct_pix2pix_checkpointing_checkpoints_total_limit_removes_multiple_checkpoints
* set checkpoints_total_limit to 2.
* test_text_to_image_lora_checkpointing_checkpoints_total_limit_removes_multiple_checkpoints speed up
* speed up test_unconditional_checkpointing_checkpoints_total_limit_removes_multiple_checkpoints
* debug
* fix: directories.
* speed up test_instruct_pix2pix_checkpointing_checkpoints_total_limit
* speed up: test_controlnet_checkpointing_checkpoints_total_limit_removes_multiple_checkpoints
* speed up test_controlnet_sdxl
* speed up dreambooth tests
* speed up test_dreambooth_lora_checkpointing_checkpoints_total_limit_removes_multiple_checkpoints
* speed up test_custom_diffusion_checkpointing_checkpoints_total_limit_removes_multiple_checkpoints
* speed up test_text_to_image_lora_sdxl_text_encoder_checkpointing_checkpoints_total_limit
* speed up # checkpoint-2 should have been deleted
* speed up examples/text_to_image/test_text_to_image.py::TextToImage::test_text_to_image_checkpointing_checkpoints_total_limit
* additional speed ups
* style
* fix RuntimeError: Input type (float) and bias type (c10::Half) should be the same
* format source code
* format code
* remove the autocast blocks within the pipeline
* add autocast blocks to pipeline caller in train_text_to_image_lora.py
* [Community Pipeline] Add Marigold Monocular Depth Estimation
- add single-file pipeline
- update README
* fix format - add one blank line
* format script with ruff
* use direct image link in example code
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
echo "Quality check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make style && make quality'" >> $GITHUB_STEP_SUMMARY
check_repository_consistency:
needs:check_code_quality
runs-on:ubuntu-latest
steps:
- uses:actions/checkout@v3
- name:Set up Python
uses:actions/setup-python@v4
with:
python-version:"3.8"
- name:Install dependencies
run:|
python -m pip install --upgrade pip
pip install .[quality]
- name:Check repo consistency
run:|
python utils/check_copies.py
python utils/check_dummies.py
make deps_table_check_updated
- name:Check if failure
if:${{ failure() }}
run:|
echo "Repo consistency check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make fix-copies'" >> $GITHUB_STEP_SUMMARY
@@ -57,13 +57,13 @@ Any question or comment related to the Diffusers library can be asked on the [di
- ...
Every question that is asked on the forum or on Discord actively encourages the community to publicly
share knowledge and might very well help a beginner in the future who has the same question you're
share knowledge and might very well help a beginner in the future that has the same question you're
having. Please do pose any questions you might have.
In the same spirit, you are of immense help to the community by answering such questions because this way you are publicly documenting knowledge for everybody to learn from.
**Please** keep in mind that the more effort you put into asking or answering a question, the higher
the quality of the publicly documented knowledge. In the same way, well-posed and well-answered questions create a high-quality knowledge database accessible to everybody, while badly posed questions or answers reduce the overall quality of the public knowledge database.
In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formatted/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.
In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formated/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section.
**NOTE about channels**:
[*The forum*](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) is much better indexed by search engines, such as Google. Posts are ranked by popularity rather than chronologically. Hence, it's easier to look up questions and answers that we posted some time ago.
@@ -245,7 +245,7 @@ The official training examples are maintained by the Diffusers' core maintainers
This is because of the same reasons put forward in [6. Contribute a community pipeline](#6-contribute-a-community-pipeline) for official pipelines vs. community pipelines: It is not feasible for the core maintainers to maintain all possible training methods for diffusion models.
If the Diffusers core maintainers and the community consider a certain training paradigm to be too experimental or not popular enough, the corresponding training code should be put in the `research_projects` folder and maintained by the author.
Both official training and research examples consist of a directory that contains one or more training scripts, a `requirements.txt` file, and a `README.md` file. In order for the user to make use of the
Both official training and research examples consist of a directory that contains one or more training scripts, a requirements.txt file, and a README.md file. In order for the user to make use of the
training examples, it is required to clone the repository:
Therefore when adding an example, the `requirements.txt` file shall define all pip dependencies required for your training example so that once all those are installed, the user can run the example's training script. See, for example, the [DreamBooth `requirements.txt` file](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/requirements.txt).
@@ -15,7 +15,7 @@ specific language governing permissions and limitations under the License.
🧨 Diffusers provides **state-of-the-art** pretrained diffusion models across multiple modalities.
Its purpose is to serve as a **modular toolbox** for both inference and training.
We aim to build a library that stands the test of time and therefore take API design very seriously.
We aim at building a library that stands the test of time and therefore take API design very seriously.
In a nutshell, Diffusers is built to be a natural extension of PyTorch. Therefore, most of our design choices are based on [PyTorch's Design Principles](https://pytorch.org/docs/stable/community/design.html#pytorch-design-philosophy). Let's go over the most important ones:
@@ -63,14 +63,14 @@ Let's walk through more detailed design decisions for each class.
Pipelines are designed to be easy to use (therefore do not follow [*Simple over easy*](#simple-over-easy) 100%), are not feature complete, and should loosely be seen as examples of how to use [models](#models) and [schedulers](#schedulers) for inference.
The following design principles are followed:
- Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for [`src/diffusers/pipelines/stable-diffusion`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [#Copied from mechanism](https://github.com/huggingface/diffusers/blob/125d783076e5bd9785beb05367a2d2566843a271/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L251).
- Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for [`src/diffusers/pipelines/stable-diffusion`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [#Copied from mechanism](https://github.com/huggingface/diffusers/blob/125d783076e5bd9785beb05367a2d2566843a271/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L251).
- Pipelines all inherit from [`DiffusionPipeline`].
- Every pipeline consists of different model and scheduler components, that are documented in the [`model_index.json` file](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/model_index.json), are accessible under the same name as attributes of the pipeline and can be shared between pipelines with [`DiffusionPipeline.components`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.components) function.
- Every pipeline consists of different model and scheduler components, that are documented in the [`model_index.json` file](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/model_index.json), are accessible under the same name as attributes of the pipeline and can be shared between pipelines with [`DiffusionPipeline.components`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.components) function.
- Every pipeline should be loadable via the [`DiffusionPipeline.from_pretrained`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained) function.
- Pipelines should be used **only** for inference.
- Pipelines should be very readable, self-explanatory, and easy to tweak.
- Pipelines should be designed to build on top of each other and be easy to integrate into higher-level APIs.
- Pipelines are **not** intended to be feature-complete user interfaces. For feature-complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner).
- Pipelines are **not** intended to be feature-complete user interfaces. For futurecomplete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner).
- Every pipeline should have one and only one way to run it via a `__call__` method. The naming of the `__call__` arguments should be shared across all pipelines.
- Pipelines should be named after the task they are intended to solve.
- In almost all cases, novel diffusion pipelines shall be implemented in a new pipeline folder/file.
@@ -81,7 +81,7 @@ Models are designed as configurable toolboxes that are natural extensions of [Py
The following design principles are followed:
- Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context.
- All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unets/unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_condition.py), [`transformers/transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_2d.py), etc...
- All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_condition.py), [`transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformer_2d.py), etc...
- Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy.
- Models intend to expose complexity, just like PyTorch's `Module` class, and give clear error messages.
- Models all inherit from `ModelMixin` and `ConfigMixin`.
@@ -90,7 +90,7 @@ The following design principles are followed:
- To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different.
- Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work.
- The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and
readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
### Schedulers
@@ -100,7 +100,7 @@ The following design principles are followed:
- All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers).
- Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained.
- One scheduler Python file corresponds to one scheduler algorithm (as might be defined in a paper).
- If schedulers share similar functionalities, we can make use of the `#Copied from` mechanism.
- If schedulers share similar functionalities, we can make use of the `#Copied from` mechanism.
- Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`.
- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./docs/source/en/using-diffusers/schedulers.md).
- Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called.
🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or training your own diffusion models, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](https://huggingface.co/docs/diffusers/conceptual/philosophy#usability-over-performance), [simple over easy](https://huggingface.co/docs/diffusers/conceptual/philosophy#simple-over-easy), and [customizability over abstractions](https://huggingface.co/docs/diffusers/conceptual/philosophy#tweakable-contributorfriendly-over-abstraction).
@@ -67,13 +77,13 @@ Please refer to the [How to use Stable Diffusion in Apple Silicon](https://huggi
## Quickstart
Generating outputs is super easy with 🤗 Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 30,000+ checkpoints):
Generating outputs is super easy with 🤗 Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 22000+ checkpoints):
| [Tutorial](https://huggingface.co/docs/diffusers/tutorials/tutorial_overview) | A basic crash course for learning how to use the library's most important features like using models and schedulers to build your own diffusion system, and training your own diffusion model. |
| [Loading](https://huggingface.co/docs/diffusers/using-diffusers/loading) | Guides for how to load and configure all the components (pipelines, models, and schedulers) of the library, as well as how to use different schedulers. |
| [Pipelines for inference](https://huggingface.co/docs/diffusers/using-diffusers/overview_techniques) | Guides for how to use pipelines for different inference tasks, batched generation, controlling generated outputs and randomness, and how to contribute a pipeline to the library. |
| [Optimization](https://huggingface.co/docs/diffusers/optimization/fp16) | Guides for how to optimize your diffusion model to run faster and consume less memory. |
| [Loading](https://huggingface.co/docs/diffusers/using-diffusers/loading_overview) | Guides for how to load and configure all the components (pipelines, models, and schedulers) of the library, as well as how to use different schedulers. |
| [Pipelines for inference](https://huggingface.co/docs/diffusers/using-diffusers/pipeline_overview) | Guides for how to use pipelines for different inference tasks, batched generation, controlling generated outputs and randomness, and how to contribute a pipeline to the library. |
| [Optimization](https://huggingface.co/docs/diffusers/optimization/opt_overview) | Guides for how to optimize your diffusion model to run faster and consume less memory. |
| [Training](https://huggingface.co/docs/diffusers/training/overview) | Guides for how to train a diffusion model for different tasks with different training techniques. |
## Contribution
@@ -144,7 +154,7 @@ Also, say 👋 in our public Discord channel <a href="https://discord.gg/G7tWnz9
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Outpainting
Outpainting extends an image beyond its original boundaries, allowing you to add, replace, or modify visual elements in an image while preserving the original image. Like [inpainting](../using-diffusers/inpaint), you want to fill the white area (in this case, the area outside of the original image) with new visual elements while keeping the original image (represented by a mask of black pixels). There are a couple of ways to outpaint, such as with a [ControlNet](https://hf.co/blog/OzzyGT/outpainting-controlnet) or with [Differential Diffusion](https://hf.co/blog/OzzyGT/outpainting-differential-diffusion).
This guide will show you how to outpaint with an inpainting model, ControlNet, and a ZoeDepth estimator.
Before you begin, make sure you have the [controlnet_aux](https://github.com/huggingface/controlnet_aux) library installed so you can use the ZoeDepth estimator.
```py
!pipinstall-qcontrolnet_aux
```
## Image preparation
Start by picking an image to outpaint with and remove the background with a Space like [BRIA-RMBG-1.4](https://hf.co/spaces/briaai/BRIA-RMBG-1.4).
<iframe
src="https://briaai-bria-rmbg-1-4.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>
For example, remove the background from this image of a pair of shoes.
[Stable Diffusion XL (SDXL)](../using-diffusers/sdxl) models work best with 1024x1024 images, but you can resize the image to any size as long as your hardware has enough memory to support it. The transparent background in the image should also be replaced with a white background. Create a function (like the one below) that scales and pastes the image onto a white background.
To avoid adding unwanted extra details, use the ZoeDepth estimator to provide additional guidance during generation and to ensure the shoes remain consistent with the original image.
Once your image is ready, you can generate content in the white area around the shoes with [controlnet-inpaint-dreamer-sdxl](https://hf.co/destitech/controlnet-inpaint-dreamer-sdxl), a SDXL ControlNet trained for inpainting.
Load the inpainting ControlNet, ZoeDepth model, VAE and pass them to the [`StableDiffusionXLControlNetPipeline`]. Then you can create an optional `generate_image` function (for convenience) to outpaint an initial image.
> Now is a good time to free up some memory if you're running low!
>
> ```py
> pipeline=None
> torch.cuda.empty_cache()
> ```
Now that you have an initial outpainted image, load the [`StableDiffusionXLInpaintPipeline`] with the [RealVisXL](https://hf.co/SG161222/RealVisXL_V4.0) model to generate the final outpainted image with better quality.
Prepare a mask for the final outpainted image. To create a more natural transition between the original image and the outpainted background, blur the mask to help it blend better.
Create a better prompt and pass it to the `generate_outpaint` function to generate the final outpainted image. Again, paste the original image over the final outpainted background.
@@ -12,16 +12,10 @@ specific language governing permissions and limitations under the License.
# LoRA
LoRA is a fast and lightweight training method that inserts and trains a significantly smaller number of parameters instead of all the model parameters. This produces a smaller file (~100 MBs) and makes it easier to quickly train a model to learn a new concept. LoRA weights are typically loaded into the denoiser, text encoder or both. The denoiser usually corresponds to a UNet ([`UNet2DConditionModel`], for example) or a Transformer ([`SD3Transformer2DModel`], for example). There are several classes for loading LoRA weights:
LoRA is a fast and lightweight training method that inserts and trains a significantly smaller number of parameters instead of all the model parameters. This produces a smaller file (~100 MBs) and makes it easier to quickly train a model to learn a new concept. LoRA weights are typically loaded into the UNet, text encoder or both. There are two classes for loading LoRA weights:
- [`StableDiffusionLoraLoaderMixin`] provides functions for loading and unloading, fusing and unfusing, enabling and disabling, and more functions for managing LoRA weights. This class can be used with any model.
- [`StableDiffusionXLLoraLoaderMixin`] is a [Stable Diffusion (SDXL)](../../api/pipelines/stable_diffusion/stable_diffusion_xl) version of the [`StableDiffusionLoraLoaderMixin`] class for loading and saving LoRA weights. It can only be used with the SDXL model.
- [`SD3LoraLoaderMixin`] provides similar functions for [Stable Diffusion 3](https://huggingface.co/blog/sd3).
- [`FluxLoraLoaderMixin`] provides similar functions for [Flux](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux).
- [`CogVideoXLoraLoaderMixin`] provides similar functions for [CogVideoX](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox).
- [`Mochi1LoraLoaderMixin`] provides similar functions for [Mochi](https://huggingface.co/docs/diffusers/main/en/api/pipelines/mochi).
- [`AmusedLoraLoaderMixin`] is for the [`AmusedPipeline`].
- [`LoraBaseMixin`] provides a base class with several utility methods to fuse, unfuse, unload, LoRAs and more.
- [`LoraLoaderMixin`] provides functions for loading and unloading, fusing and unfusing, enabling and disabling, and more functions for managing LoRA weights. This class can be used with any model.
- [`StableDiffusionXLLoraLoaderMixin`] is a [Stable Diffusion (SDXL)](../../api/pipelines/stable_diffusion/stable_diffusion_xl) version of the [`LoraLoaderMixin`] class for loading and saving LoRA weights. It can only be used with the SDXL model.
<Tip>
@@ -29,34 +23,10 @@ To learn more about how to load LoRA weights, see the [LoRA](../../using-diffuse
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# PEFT
Diffusers supports loading adapters such as [LoRA](../../using-diffusers/loading_adapters) with the [PEFT](https://huggingface.co/docs/peft/index) library with the [`~loaders.peft.PeftAdapterMixin`] class. This allows modeling classes in Diffusers like [`UNet2DConditionModel`], [`SD3Transformer2DModel`] to operate with an adapter.
Diffusers supports loading adapters such as [LoRA](../../using-diffusers/loading_adapters) with the [PEFT](https://huggingface.co/docs/peft/index) library with the [`~loaders.peft.PeftAdapterMixin`] class. This allows modeling classes in Diffusers like [`UNet2DConditionModel`] to load an adapter.
@@ -10,17 +10,13 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Single files
# Loading Pipelines and Models via `from_single_file`
The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
The `from_single_file` method allows you to load supported pipelines using a single checkpoint file as opposed to the folder format used by Diffusers. This is useful if you are working with many of the Stable Diffusion Web UI's (such as A1111) that extensively rely on a singlefile to distribute all the components of a diffusion model.
* a model stored in a singlefile, which is useful if you're working with models from the diffusion ecosystem, like Automatic1111, and commonly rely on a single-file layout to store and share models
* a model stored in their originally distributed layout, which is useful if you're working with models finetuned with other services, and want to load it directly into Diffusers model objects and pipelines
The `from_single_file` method also supports loading models in their originally distributed format. This means that supported models that have been finetuned with other services can be loaded directly into supported Diffusers model objects and pipelines.
> [!TIP]
> Read the [Model files and layouts](../../using-diffusers/other-formats) guide to learn more about the Diffusers-multifolder layout versus the single-file layout, and how to load models stored in these different layouts.
## Supported pipelines
## Pipelines that currently support `from_single_file` loading
- [`StableDiffusionPipeline`]
- [`StableDiffusionImg2ImgPipeline`]
@@ -35,7 +31,6 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
- [`StableDiffusionXLInstructPix2PixPipeline`]
- [`StableDiffusionXLControlNetPipeline`]
- [`StableDiffusionXLKDiffusionPipeline`]
- [`StableDiffusion3Pipeline`]
- [`LatentConsistencyModelPipeline`]
- [`LatentConsistencyModelImg2ImgPipeline`]
- [`StableDiffusionControlNetXSPipeline`]
@@ -44,14 +39,205 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
- [`LEditsPPPipelineStableDiffusionXL`]
- [`PIAPipeline`]
## Supported models
## Models that currently support `from_single_file` loading
## Setting components in a Pipeline using `from_single_file`
Swap components of the pipeline by passing them directly to the `from_single_file` method. e.g If you would like use a different scheduler than the pipeline default.
## Using a Diffusers model repository to configure single file loading
Under the hood, `from_single_file` will try to determine a model repository to use to configure the components of the pipeline. You can also pass in a repository id to the `config` argument of the `from_single_file` method to explicitly set the repository to use.
## Override configuration options when using single file loading
Override the default model or pipeline configuration options when using `from_single_file` by passing in the relevant arguments directly to the `from_single_file` method. Any argument that is supported by the model or pipeline class can be configured in this way:
In the example above, since we explicitly passed `repo_id="segmind/SSD-1B"`, it will use this [configuration file](https://huggingface.co/segmind/SSD-1B/blob/main/unet/config.json) from the "unet" subfolder in `"segmind/SSD-1B"` to configure the unet component included in the checkpoint; Similarly, it will use the `config.json` file from `"vae"` subfolder to configure the vae model, `config.json` file from text_encoder folder to configure text_encoder and so on.
Note that most of the time you do not need to explicitly a `config` argument, `from_single_file` will automatically map the checkpoint to a repo id (we will discuss this in more details in next section). However, this can be useful in cases where model components might have been changed from what was originally distributed or in cases where a checkpoint file might not have the necessary metadata to correctly determine the configuration to use for the pipeline.
<Tip>
To learn more about how to load single file weights, see the [Load different Stable Diffusion formats](../../using-diffusers/other-formats) loading guide.
</Tip>
## Working with local files
As of `diffusers>=0.28.0` the `from_single_file` method will attempt to configure a pipeline or model by first inferring the model type from the checkpoint file and then using the model type to determine the appropriate model repo configuration to use from the Hugging Face Hub. For example, any single file checkpoint based on the Stable Diffusion XL base model will use the [`stabilityai/stable-diffusion-xl-base-1.0`](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model repo to configure the pipeline.
If you are working in an environment with restricted internet access, it is recommended to download the config files and checkpoints for the model to your preferred directory and pass the local paths to the `pretrained_model_link_or_path` and `config` arguments of the `from_single_file` method.
By default this will download the checkpoints and config files to the [Hugging Face Hub cache directory](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache). You can also specify a local directory to download the files to by passing the `local_dir` argument to the `hf_hub_download` and `snapshot_download` functions.
## Working with local files on file systems that do not support symlinking
By default the `from_single_file` method relies on the `huggingface_hub` caching mechanism to fetch and store checkpoints and config files for models and pipelines. If you are working with a file system that does not support symlinking, it is recommended that you first download the checkpoint file to a local directory and disable symlinking by passing the `local_dir_use_symlink=False` argument to the `hf_hub_download` and `snapshot_download` functions.
Disabling symlinking means that the `huggingface_hub` caching mechanism has no way to determine whether a file has already been downloaded to the local directory. This means that the `hf_hub_download` and `snapshot_download` functions will download files to the local directory each time they are executed. If you are disabling symlinking, it is recommended that you separate the model download and loading steps to avoid downloading the same file multiple times.
</Tip>
## Using the original configuration file of a model
If you would like to configure the parameters of the model components in the pipeline using the orignal YAML configuration file, you can pass a local path or url to the original configuration file to the `original_config` argument of the `from_single_file` method.
In the example above, the `original_config` file is only used to configure the parameters of the individual model components of the pipeline. For example it will be used to configure parameters such as the `in_channels` of the `vae` model and `unet` model. It is not used to determine the type of component objects in the pipeline.
<Tip>
When using `original_config` with local_files_only=True`, Diffusers will attempt to infer the components based on the type signatures of pipeline class, rather than attempting to fetch the pipeline config from the Hugging Face Hub. This is to prevent backwards breaking changes in existing code that might not be able to connect to the internet to fetch the necessary pipeline config files.
This is not as reliable as providing a path to a local config repo and might lead to errors when configuring the pipeline. To avoid this, please run the pipeline with `local_files_only=False` once to download the appropriate pipeline config files to the local cache.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# SD3Transformer2D
This class is useful when *only* loading weights into a [`SD3Transformer2DModel`]. If you need to load weights into the text encoder or a text encoder and SD3Transformer2DModel, check [`SD3LoraLoaderMixin`](lora#diffusers.loaders.SD3LoraLoaderMixin) class instead.
The [`SD3Transformer2DLoadersMixin`] class currently only loads IP-Adapter weights, but will be used in the future to save weights and load LoRAs.
<Tip>
To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide.
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# UNet
Some training methods - like LoRA and Custom Diffusion - typically target the UNet's attention layers, but these training methods can also target other non-attention layers. Instead of training all of a model's parameters, only a subset of the parameters are trained, which is faster and more efficient. This class is useful if you're *only* loading weights into a UNet. If you need to load weights into the text encoder or a text encoder and UNet, try using the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] function instead.
Some training methods - like LoRA and Custom Diffusion - typically target the UNet's attention layers, but these training methods can also target other non-attention layers. Instead of training all of a model's parameters, only a subset of the parameters are trained, which is faster and more efficient. This class is useful if you're *only* loading weights into a UNet. If you need to load weights into the text encoder or a text encoder and UNet, try using the [`~loaders.LoraLoaderMixin.load_lora_weights`] function instead.
The [`UNet2DConditionLoadersMixin`] class provides functions for loading and saving weights, fusing and unfusing LoRAs, disabling and enabling LoRAs, and setting and deleting adapters.
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# AllegroTransformer3DModel
A Diffusion Transformer model for 3D data from [Allegro](https://github.com/rhymes-ai/Allegro) was introduced in [Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) by RhymesAI.
The model can be loaded with the following code snippet.
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# AutoencoderDC
The 2D Autoencoder model used in [SANA](https://huggingface.co/papers/2410.10629) and introduced in [DCAE](https://huggingface.co/papers/2410.10733) by authors Junyu Chen\*, Han Cai\*, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han from MIT HAN Lab.
The abstract from the paper is:
*We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at [this https URL](https://github.com/mit-han-lab/efficientvit).*
The following DCAE models are released and supported in Diffusers.
The `AutoencoderDC` model has `in` and `mix` single file checkpoint variants that have matching checkpoint keys, but use different scaling factors. It is not possible for Diffusers to automatically infer the correct config file to use with the model based on just the checkpoint and will default to configuring the model using the `mix` variant config file. To override the automatically determined config, please use the `config` argument when using single file loading with `in` variant checkpoints.
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# AutoencoderKLHunyuanVideo
The 3D variational autoencoder (VAE) model with KL loss used in [HunyuanVideo](https://github.com/Tencent/HunyuanVideo/), which was introduced in [HunyuanVideo: A Systematic Framework For Large Video Generative Models](https://huggingface.co/papers/2412.03603) by Tencent.
The model can be loaded with the following code snippet.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# AutoencoderOobleck
The Oobleck variational autoencoder (VAE) model with KL loss was introduced in [Stability-AI/stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) and [Stable Audio Open](https://huggingface.co/papers/2407.14358) by Stability AI. The model is used in 🤗 Diffusers to encode audio waveforms into latents and to decode latent representations into audio waveforms.
The abstract from the paper is:
*Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.*
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# AutoencoderKLAllegro
The 3D variational autoencoder (VAE) model with KL loss used in [Allegro](https://github.com/rhymes-ai/Allegro) was introduced in [Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) by RhymesAI.
The model can be loaded with the following code snippet.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# AutoencoderKLCogVideoX
The 3D variational autoencoder (VAE) model with KL loss used in [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI.
The model can be loaded with the following code snippet.
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# AutoencoderKLMochi
The 3D variational autoencoder (VAE) model with KL loss used in [Mochi](https://github.com/genmoai/models) was introduced in [Mochi 1 Preview](https://huggingface.co/genmo/mochi-1-preview) by Tsinghua University & ZhipuAI.
The model can be loaded with the following code snippet.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# CogVideoXTransformer3DModel
A Diffusion Transformer model for 3D data from [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI.
The model can be loaded with the following code snippet.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# CogView3PlusTransformer2DModel
A Diffusion Transformer model for 2D data from [CogView3Plus](https://github.com/THUDM/CogView3) was introduced in [CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://huggingface.co/papers/2403.05121) by Tsinghua University & ZhipuAI.
The model can be loaded with the following code snippet.
@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# ControlNetModel
# ControlNet
The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection.
@@ -21,7 +21,7 @@ The abstract from the paper is:
## Loading from the original format
By default the [`ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded
from the original format using [`FromOriginalModelMixin.from_single_file`] as follows:
from the original format using [`FromOriginalControlnetMixin.from_single_file`] as follows:
<!--Copyright 2024 The HuggingFace Team and The InstantX Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# FluxControlNetModel
FluxControlNetModel is an implementation of ControlNet for Flux.1.
The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection.
The abstract from the paper is:
*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.*
## Loading from the original format
By default the [`FluxControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`].
<!--Copyright 2024 The HuggingFace Team and Tencent Hunyuan Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# HunyuanDiT2DControlNetModel
HunyuanDiT2DControlNetModel is an implementation of ControlNet for [Hunyuan-DiT](https://arxiv.org/abs/2405.08748).
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
With a ControlNet model, you can provide an additional control image to condition and control Hunyuan-DiT generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
The abstract from the paper is:
*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.*
This code is implemented by Tencent Hunyuan Team. You can find pre-trained checkpoints for Hunyuan-DiT ControlNets on [Tencent Hunyuan](https://huggingface.co/Tencent-Hunyuan).
## Example For Loading HunyuanDiT2DControlNetModel
<!--Copyright 2024 The HuggingFace Team and The InstantX Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# SD3ControlNetModel
SD3ControlNetModel is an implementation of ControlNet for Stable Diffusion 3.
The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection.
The abstract from the paper is:
*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.*
## Loading from the original format
By default the [`SD3ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`].
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# SparseControlNetModel
SparseControlNetModel is an implementation of ControlNet for [AnimateDiff](https://arxiv.org/abs/2307.04725).
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
The SparseCtrl version of ControlNet was introduced in [SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://arxiv.org/abs/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai.
The abstract from the paper is:
*The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at [this https URL](https://guoyww.github.io/projects/SparseCtrl).*
<!--Copyright 2024 The HuggingFace Team and The InstantX Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# ControlNetUnionModel
ControlNetUnionModel is an implementation of ControlNet for Stable Diffusion XL.
The ControlNet model was introduced in [ControlNetPlus](https://github.com/xinsir6/ControlNetPlus) by xinsir6. It supports multiple conditioning inputs without increasing computation.
*We design a new architecture that can support 10+ control types in condition text-to-image generation and can generate high resolution images visually comparable with midjourney. The network is based on the original ControlNet architecture, we propose two new modules to: 1 Extend the original ControlNet to support different image conditions using the same network parameter. 2 Support multiple conditions input without increasing computation offload, which is especially important for designers who want to edit image in detail, different conditions use the same condition encoder, without adding extra computations or parameters.*
## Loading
By default the [`ControlNetUnionModel`] should be loaded with [`~ModelMixin.from_pretrained`].
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# HunyuanVideoTransformer3DModel
A Diffusion Transformer model for 3D video-like data was introduced in [HunyuanVideo: A Systematic Framework For Large Video Generative Models](https://huggingface.co/papers/2412.03603) by Tencent.
The model can be loaded with the following code snippet.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# PixArtTransformer2DModel
A Transformer model for image-like data from [PixArt-Alpha](https://huggingface.co/papers/2310.00426) and [PixArt-Sigma](https://huggingface.co/papers/2403.04692).
@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# PriorTransformer
# PriorTransformer
The Prior Transformer was originally introduced in [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process.
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# SanaTransformer2DModel
A Diffusion Transformer model for 2D data from [SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) was introduced from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.
The abstract from the paper is:
*We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.*
The model can be loaded with the following code snippet.
@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Transformer2DModel
# Transformer2D
A Transformer model for image-like data from [CompVis](https://huggingface.co/CompVis) that is based on the [Vision Transformer](https://huggingface.co/papers/2010.11929) introduced by Dosovitskiy et al. The [`Transformer2DModel`] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.
@@ -38,4 +38,4 @@ It is assumed one of the input classes is the masked latent pixel. The predicted
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# Allegro
[Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) from RhymesAI, by Yuan Zhou, Qiuyue Wang, Yuxuan Cai, Huan Yang.
The abstract from the paper is:
*Significant advancements have been made in the field of video generation, with the open-source community contributing a wealth of research papers and tools for training high-quality models. However, despite these efforts, the available information and resources remain insufficient for achieving commercial-level performance. In this report, we open the black box and introduce Allegro, an advanced video generation model that excels in both quality and temporal consistency. We also highlight the current limitations in the field and present a comprehensive methodology for training high-performance, commercial-level video generation models, addressing key aspects such as data, model architecture, training pipeline, and evaluation. Our user study shows that Allegro surpasses existing open-source models and most commercial models, ranking just behind Hailuo and Kling. Code: https://github.com/rhymes-ai/Allegro , Model: https://huggingface.co/rhymes-ai/Allegro , Gallery: https://rhymes.ai/allegro_gallery .*
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
@@ -16,7 +16,7 @@ aMUSEd was introduced in [aMUSEd: An Open MUSE Reproduction](https://huggingface
Amused is a lightweight text to image model based off of the [MUSE](https://arxiv.org/abs/2301.00704) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder CLIP-L/14 instead of t5-xxl. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes.
Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder CLIP-L/14 instead of t5-xxl. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes.
@@ -25,11 +25,7 @@ The abstract of the paper is the following:
| Pipeline | Tasks | Demo
|---|---|:---:|
| [AnimateDiffPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff.py) | *Text-to-Video Generation with AnimateDiff* |
| [AnimateDiffControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_controlnet.py) | *Controlled Video-to-Video Generation with AnimateDiff using ControlNet* |
| [AnimateDiffSparseControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_sparsectrl.py) | *Controlled Video-to-Video Generation with AnimateDiff using SparseCtrl* |
| [AnimateDiffSDXLPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_sdxl.py) | *Video-to-Video Generation with AnimateDiff* |
| [AnimateDiffVideoToVideoPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py) | *Video-to-Video Generation with AnimateDiff* |
| [AnimateDiffVideoToVideoControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py) | *Video-to-Video Generation with AnimateDiff using ControlNet* |
## Available checkpoints
@@ -82,6 +78,7 @@ output = pipe(
)
frames=output.frames[0]
export_to_gif(frames,"animation.gif")
```
Here are some sample outputs:
@@ -104,266 +101,6 @@ AnimateDiff tends to work better with finetuned Stable Diffusion models. If you
</Tip>
### AnimateDiffControlNetPipeline
AnimateDiff can also be used with ControlNets ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide depth maps, the ControlNet model generates a video that'll preserve the spatial information from the depth maps. It is a more flexible and accurate way to control the video generation process.
# We use AnimateLCM for this example but one can use the original motion adapters as well (for example, https://huggingface.co/guoyww/animatediff-motion-adapter-v1-5-3)
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif" alt="racoon playing a guitar" />
</td>
<td align="center">
a panda, playing a guitar, sitting in a pink boat, in the ocean, mountains in background, realistic, high quality
<br/>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-controlnet-output.gif" alt="a panda, playing a guitar, sitting in a pink boat, in the ocean, mountains in background, realistic, high quality" />
</td>
</tr>
</table>
### AnimateDiffSparseControlNetPipeline
[SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://arxiv.org/abs/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai.
The abstract from the paper is:
*The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at [this https URL](https://guoyww.github.io/projects/SparseCtrl).*
SparseCtrl introduces the following checkpoints for controlled text-to-video generation:
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-sparsectrl-scribble-results.gif" alt="an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality" />
prompt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background",
negative_prompt="low quality, worst quality",
num_inference_steps=25,
conditioning_frames=image,
controlnet_frame_indices=[0],
controlnet_conditioning_scale=1.0,
generator=torch.Generator().manual_seed(42),
).frames[0]
export_to_gif(video,"output.gif")
```
Here are some sample outputs:
<table align="center">
<tr>
<center>
<b>closeup face photo of man in black clothes, night city street, bokeh, fireworks in background</b>
</center>
</tr>
<tr>
<td>
<center>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-firework.png" alt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background" />
</center>
</td>
<td>
<center>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-sparsectrl-rgb-result.gif" alt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background" />
</center>
</td>
</tr>
</table>
### AnimateDiffSDXLPipeline
AnimateDiff can also be used with SDXL models. This is currently an experimental feature as only a beta release of the motion adapter checkpoint is available.
@@ -519,97 +256,6 @@ Here are some sample outputs:
</tr>
</table>
### AnimateDiffVideoToVideoControlNetPipeline
AnimateDiff can be used together with ControlNets to enhance video-to-video generation by allowing for precise control over the output. ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala, and allows you to condition Stable Diffusion with an additional control image to ensure that the spatial information is preserved throughout the video.
This pipeline allows you to condition your generation both on the original video and on a sequence of control images.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff_vid2vid_controlnet.gif" alt="astronaut in space, dancing" />
</td>
</tr>
</table>
**The lights and composition were transferred from the Source Video.**
## Using Motion LoRAs
Motion LoRAs are a collection of LoRAs that work with the `guoyww/animatediff-motion-adapter-v1-5-2` checkpoint. These LoRAs are responsible for adding specific types of motion to the animations.
[FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling](https://arxiv.org/abs/2310.15169) by Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu.
FreeNoise is a sampling mechanism that can generate longer videos with short-video generation models by employing noise-rescheduling, temporal attention over sliding windows, and weighted averaging of latent frames. It also can be used with multiple prompts to allow for interpolated video generations. More details are available in the paper.
The currently supported AnimateDiff pipelines that can be used with FreeNoise are:
- [`AnimateDiffPipeline`]
- [`AnimateDiffControlNetPipeline`]
- [`AnimateDiffVideoToVideoPipeline`]
- [`AnimateDiffVideoToVideoControlNetPipeline`]
In order to use FreeNoise, a single line needs to be added to the inference code after loading your pipelines.
```diff
+ pipe.enable_free_noise()
```
After this, either a single prompt could be used, or multiple prompts can be passed as a dictionary of integer-string pairs. The integer keys of the dictionary correspond to the frame index at which the influence of that prompt would be maximum. Each frame index should map to a single string prompt. The prompts for intermediate frame indices, that are not passed in the dictionary, are created by interpolating between the frame prompts that are passed. By default, simple linear interpolation is used. However, you can customize this behaviour with a callback to the `prompt_interpolation_callback` parameter when enabling FreeNoise.
Since FreeNoise processes multiple frames together, there are parts in the modeling where the memory required exceeds that available on normal consumer GPUs. The main memory bottlenecks that we identified are spatial and temporal attention blocks, upsampling and downsampling blocks, resnet blocks and feed-forward layers. Since most of these blocks operate effectively only on the channel/embedding dimension, one can perform chunked inference across the batch dimensions. The batch dimension in AnimateDiff are either spatial (`[B x F, H x W, C]`) or temporal (`B x H x W, F, C`) in nature (note that it may seem counter-intuitive, but the batch dimension here are correct, because spatial blocks process across the `B x F` dimension while the temporal blocks process across the `B x H x W` dimension). We introduce a `SplitInferenceModule` that makes it easier to chunk across any dimension and perform inference. This saves a lot of memory but comes at the cost of requiring more time for inference.
```diff
# Load pipeline and adapters
# ...
+ pipe.enable_free_noise_split_inference()
+ pipe.unet.enable_forward_chunking(16)
```
The call to `pipe.enable_free_noise_split_inference` method accepts two parameters: `spatial_split_size` (defaults to `256`) and `temporal_split_size` (defaults to `16`). These can be configured based on how much VRAM you have available. A lower split size results in lower memory usage but slower inference, whereas a larger split size results in faster inference at the cost of more memory.
## Using `from_single_file` with the MotionAdapter
`diffusers>=0.30.0` supports loading the AnimateDiff checkpoints into the `MotionAdapter` in their original format via `from_single_file`
@@ -20,8 +20,8 @@ The abstract of the paper is the following:
*Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at [this https URL](https://audioldm.github.io/audioldm2).*
This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi) and [Nguyễn Công Tú Anh](https://github.com/tuanh123789). The original codebase can be
found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2).
This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi) and [Nguyễn Công Tú Anh](https://github.com/tuanh123789). The original codebase can be
found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2).
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# AuraFlow
AuraFlow is inspired by [Stable Diffusion 3](../pipelines/stable_diffusion/stable_diffusion_3.md) and is by far the largest text-to-image generation model that comes with an Apache 2.0 license. This model achieves state-of-the-art results on the [GenEval](https://github.com/djghosh13/geneval) benchmark.
It was developed by the Fal team and more details about it can be found in [this blog post](https://blog.fal.ai/auraflow/).
<Tip>
AuraFlow can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details.
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# BLIP-Diffusion
BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://arxiv.org/abs/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation.
BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://arxiv.org/abs/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-->
# CogVideoX
[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://arxiv.org/abs/2408.06072) from Tsinghua University & ZhipuAI, by Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang.
The abstract from the paper is:
*We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motion. In addition, we develop an effectively text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of CogVideoX-2B is publicly available at https://github.com/THUDM/CogVideo.*
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
There are three official CogVideoX checkpoints for text-to-video and video-to-video.
- Text-to-video (T2V) works best at a resolution of 1360x768 because it was trained with that specific resolution.
- Image-to-video (I2V) works for multiple resolutions. The width can vary from 768 to 1360, but the height must be 768. The height/width must be divisible by 16.
- Both T2V and I2V models support generation with 81 and 161 frames and work best at this value. Exporting videos at 16 FPS is recommended.
There are two official CogVideoX checkpoints that support pose controllable generation (by the [Alibaba-PAI](https://huggingface.co/alibaba-pai) team).
# CogVideoX works well with long and well-described prompts
prompt="A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
The [T2V benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are:
```
Without torch.compile(): Average inference time: 96.89 seconds.
With torch.compile(): Average inference time: 76.27 seconds.
```
### Memory optimization
CogVideoX-2b requires about 19 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/a-r-r-o-w/3959a03f15be5c9bd1fe545b09dfcc93) script.
-`pipe.enable_model_cpu_offload()`:
- Without enabling cpu offloading, memory usage is `33 GB`
- With enabling cpu offloading, memory usage is `19 GB`
-`pipe.enable_sequential_cpu_offload()`:
- Similar to `enable_model_cpu_offload` but can significantly reduce memory usage at the cost of slow inference
- When enabled, memory usage is under `4 GB`
-`pipe.vae.enable_tiling()`:
- With enabling cpu offloading and tiling, memory usage is `11 GB`
-`pipe.vae.enable_slicing()`
### Quantized inference
[torchao](https://github.com/pytorch/ao) and [optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the text encoder, transformer and VAE modules to lower the memory requirements. This makes it possible to run the model on a free-tier T4 Colab or lower VRAM GPUs!
It is also worth noting that torchao quantization is fully compatible with [torch.compile](/optimization/torch2.0#torchcompile), which allows for much faster inference speed. Additionally, models can be serialized and stored in a quantized datatype to save disk space with torchao. Find examples and benchmarks in the gists below.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-->
# CogView3Plus
[CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://huggingface.co/papers/2403.05121) from Tsinghua University & ZhipuAI, by Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang.
The abstract from the paper is:
*Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.*
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
<!--Copyright 2024 The HuggingFace Team, The Black Forest Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# FluxControlInpaint
FluxControlInpaintPipeline is an implementation of Inpainting for Flux.1 Depth/Canny models. It is a pipeline that allows you to inpaint images using the Flux.1 Depth/Canny models. The pipeline takes an image and a mask as input and returns the inpainted image.
FLUX.1 Depth and Canny [dev] is a 12 billion parameter rectified flow transformer capable of generating an image based on a text description while following the structure of a given input image. **This is not a ControlNet model**.
Flux can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to [this blog post](https://huggingface.co/blog/quanto-diffusers) to learn more. For an exhaustive list of resources, check out [this gist](https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c).
<!--Copyright 2024 The HuggingFace Team, The InstantX Team, and the XLabs Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# ControlNet with Flux.1
FluxControlNetPipeline is an implementation of ControlNet for Flux.1.
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.
The abstract from the paper is:
*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.*
This controlnet code is implemented by [The InstantX Team](https://huggingface.co/InstantX). You can find pre-trained checkpoints for Flux-ControlNet in the table below:
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.