* ⚡️ Speed up method `AutoencoderKLWan.clear_cache` by 886%
**Key optimizations:**
- Compute the number of `WanCausalConv3d` modules in each model (`encoder`/`decoder`) **only once during initialization**, store in `self._cached_conv_counts`. This removes unnecessary repeated tree traversals at every `clear_cache` call, which was the main bottleneck (from profiling).
- The internal helper `_count_conv3d_fast` is optimized via a generator expression with `sum` for efficiency.
All comments from the original code are preserved, except for updated or removed local docstrings/comments relevant to changed lines.
**Function signatures and outputs remain unchanged.**
* Apply style fixes
* Apply suggestions from code review
Co-authored-by: Aryan <contact.aryanvs@gmail.com>
* Apply style fixes
---------
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aryan <aryan@huggingface.co>
Co-authored-by: Aryan <contact.aryanvs@gmail.com>
Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
* Add Pruna optimization framework documentation
- Introduced a new section for Pruna in the table of contents.
- Added comprehensive documentation for Pruna, detailing its optimization techniques, installation instructions, and examples for optimizing and evaluating models
* Enhance Pruna documentation with image alt text and code block formatting
- Added alt text to images for better accessibility and context.
- Changed code block syntax from diff to python for improved clarity.
* Add installation section to Pruna documentation
- Introduced a new installation section in the Pruna documentation to guide users on how to install the framework.
- Enhanced the overall clarity and usability of the documentation for new users.
* Update pruna.md
* Update pruna.md
* Update Pruna documentation for model optimization and evaluation
- Changed section titles for consistency and clarity, from "Optimizing models" to "Optimize models" and "Evaluating and benchmarking optimized models" to "Evaluate and benchmark models".
- Enhanced descriptions to clarify the use of `diffusers` models and the evaluation process.
- Added a new example for evaluating standalone `diffusers` models.
- Updated references and links for better navigation within the documentation.
* Refactor Pruna documentation for clarity and consistency
- Removed outdated references to FLUX-juiced and streamlined the explanation of benchmarking.
- Enhanced the description of evaluating standalone `diffusers` models.
- Cleaned up code examples by removing unnecessary imports and comments for better readability.
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Enhance Pruna documentation with new examples and clarifications
- Added an image to illustrate the optimization process.
- Updated the explanation for sharing and loading optimized models on the Hugging Face Hub.
- Clarified the evaluation process for optimized models using the EvaluationAgent.
- Improved descriptions for defining metrics and evaluating standalone diffusers models.
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* support text-to-image
* update example
* make fix-copies
* support use_flow_sigmas in EDM scheduler instead of maintain cosmos-specific scheduler
* support video-to-world
* update
* rename text2image pipeline
* make fix-copies
* add t2i test
* add test for v2w pipeline
* support edm dpmsolver multistep
* update
* update
* update
* update tests
* fix tests
* safety checker
* make conversion script work without guardrail
* add clarity in documentation for device_map
* docs
* fix how compiler tester mixins are used.
* propagate
* more
* typo.
* fix tests
* fix order of decroators.
* clarify more.
* more test cases.
* fix doc
* fix device_map docstring in pipeline_utils.
* more examples
* more
* update
* remove code for stuff that is already supported.
* fix stuff.
* allow loading from repo with dot in name
* put new arg at the end to avoid breaking compatibility
* add test for loading repo with dot in name
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Update pipeline_flux_inpaint.py to fix padding_mask_crop returning only the inpainted area and not the entire image.
* Apply style fixes
* Update src/diffusers/pipelines/flux/pipeline_flux_inpaint.py
* Add community class StableDiffusionXL_T5Pipeline
Will be used with base model opendiffusionai/stablediffusionxl_t5
* Changed pooled_embeds to use projection instead of slice
* "make style" tweaks
* Added comments to top of code
* Apply style fixes
[examples] flux-control: use num_training_steps_for_scheduler in get_scheduler instead of args.max_train_steps * accelerator.num_processes
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* add guidance rescale
* update docs
* support adaptive instance norm filter
* fix custom timesteps support
* add custom timestep example to docs
* add a note about best generation settings being available only in the original repository
* use original org hub ids instead of personal
* make fix-copies
---------
Co-authored-by: Linoy Tsaban <57615435+linoytsaban@users.noreply.github.com>
* [gguf] Refactor __torch_function__ to avoid unnecessary computation
This helps with torch.compile compilation lantency. Avoiding unnecessary
computation should also lead to a slightly improved eager latency.
* Apply style fixes
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* feat: pipeline-level quant config.
Co-authored-by: SunMarc <marc.sun@hotmail.fr>
condition better.
support mapping.
improvements.
[Quantization] Add Quanto backend (#10756)
* update
* updaet
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* Update docs/source/en/quantization/quanto.md
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* Update src/diffusers/quantizers/quanto/utils.py
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* update
* update
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
[Single File] Add single file loading for SANA Transformer (#10947)
* added support for from_single_file
* added diffusers mapping script
* added testcase
* bug fix
* updated tests
* corrected code quality
* corrected code quality
---------
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
[LoRA] Improve warning messages when LoRA loading becomes a no-op (#10187)
* updates
* updates
* updates
* updates
* notebooks revert
* fix-copies.
* seeing
* fix
* revert
* fixes
* fixes
* fixes
* remove print
* fix
* conflicts ii.
* updates
* fixes
* better filtering of prefix.
---------
Co-authored-by: hlky <hlky@hlky.ac>
[LoRA] CogView4 (#10981)
* update
* make fix-copies
* update
[Tests] improve quantization tests by additionally measuring the inference memory savings (#11021)
* memory usage tests
* fixes
* gguf
[`Research Project`] Add AnyText: Multilingual Visual Text Generation And Editing (#8998)
* Add initial template
* Second template
* feat: Add TextEmbeddingModule to AnyTextPipeline
* feat: Add AuxiliaryLatentModule template to AnyTextPipeline
* Add bert tokenizer from the anytext repo for now
* feat: Update AnyTextPipeline's modify_prompt method
This commit adds improvements to the modify_prompt method in the AnyTextPipeline class. The method now handles special characters and replaces selected string prompts with a placeholder. Additionally, it includes a check for Chinese text and translation using the trans_pipe.
* Fill in the `forward` pass of `AuxiliaryLatentModule`
* `make style && make quality`
* `chore: Update bert_tokenizer.py with a TODO comment suggesting the use of the transformers library`
* Update error handling to raise and logging
* Add `create_glyph_lines` function into `TextEmbeddingModule`
* make style
* Up
* Up
* Up
* Up
* Remove several comments
* refactor: Remove ControlNetConditioningEmbedding and update code accordingly
* Up
* Up
* up
* refactor: Update AnyTextPipeline to include new optional parameters
* up
* feat: Add OCR model and its components
* chore: Update `TextEmbeddingModule` to include OCR model components and dependencies
* chore: Update `AuxiliaryLatentModule` to include VAE model and its dependencies for masked image in the editing task
* `make style`
* refactor: Update `AnyTextPipeline`'s docstring
* Update `AuxiliaryLatentModule` to include info dictionary so that text processing is done once
* simplify
* `make style`
* Converting `TextEmbeddingModule` to ordinary `encode_prompt()` function
* Simplify for now
* `make style`
* Up
* feat: Add scripts to convert AnyText controlnet to diffusers
* `make style`
* Fix: Move glyph rendering to `TextEmbeddingModule` from `AuxiliaryLatentModule`
* make style
* Up
* Simplify
* Up
* feat: Add safetensors module for loading model file
* Fix device issues
* Up
* Up
* refactor: Simplify
* refactor: Simplify code for loading models and handling data types
* `make style`
* refactor: Update to() method in FrozenCLIPEmbedderT3 and TextEmbeddingModule
* refactor: Update dtype in embedding_manager.py to match proj.weight
* Up
* Add attribution and adaptation information to pipeline_anytext.py
* Update usage example
* Will refactor `controlnet_cond_embedding` initialization
* Add `AnyTextControlNetConditioningEmbedding` template
* Refactor organization
* style
* style
* Move custom blocks from `AuxiliaryLatentModule` to `AnyTextControlNetConditioningEmbedding`
* Follow one-file policy
* style
* [Docs] Update README and pipeline_anytext.py to use AnyTextControlNetModel
* [Docs] Update import statement for AnyTextControlNetModel in pipeline_anytext.py
* [Fix] Update import path for ControlNetModel, ControlNetOutput in anytext_controlnet.py
* Refactor AnyTextControlNet to use configurable conditioning embedding channels
* Complete control net conditioning embedding in AnyTextControlNetModel
* up
* [FIX] Ensure embeddings use correct device in AnyTextControlNetModel
* up
* up
* style
* [UPDATE] Revise README and example code for AnyTextPipeline integration with DiffusionPipeline
* [UPDATE] Update example code in anytext.py to use correct font file and improve clarity
* down
* [UPDATE] Refactor BasicTokenizer usage to a new Checker class for text processing
* update pillow
* [UPDATE] Remove commented-out code and unnecessary docstring in anytext.py and anytext_controlnet.py for improved clarity
* [REMOVE] Delete frozen_clip_embedder_t3.py as it is in the anytext.py file
* [UPDATE] Replace edict with dict for configuration in anytext.py and RecModel.py for consistency
* 🆙
* style
* [UPDATE] Revise README.md for clarity, remove unused imports in anytext.py, and add author credits in anytext_controlnet.py
* style
* Update examples/research_projects/anytext/README.md
Co-authored-by: Aryan <contact.aryanvs@gmail.com>
* Remove commented-out image preparation code in AnyTextPipeline
* Remove unnecessary blank line in README.md
[Quantization] Allow loading TorchAO serialized Tensor objects with torch>=2.6 (#11018)
* update
* update
* update
* update
* update
* update
* update
* update
* update
fix: mixture tiling sdxl pipeline - adjust gerating time_ids & embeddings (#11012)
small fix on generating time_ids & embeddings
[LoRA] support wan i2v loras from the world. (#11025)
* support wan i2v loras from the world.
* remove copied from.
* upates
* add lora.
Fix SD3 IPAdapter feature extractor (#11027)
chore: fix help messages in advanced diffusion examples (#10923)
Fix missing **kwargs in lora_pipeline.py (#11011)
* Update lora_pipeline.py
* Apply style fixes
* fix-copies
---------
Co-authored-by: hlky <hlky@hlky.ac>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Fix for multi-GPU WAN inference (#10997)
Ensure that hidden_state and shift/scale are on the same device when running with multiple GPUs
Co-authored-by: Jimmy <39@🇺🇸.com>
[Refactor] Clean up import utils boilerplate (#11026)
* update
* update
* update
Use `output_size` in `repeat_interleave` (#11030)
[hybrid inference 🍯🐝] Add VAE encode (#11017)
* [hybrid inference 🍯🐝] Add VAE encode
* _toctree: add vae encode
* Add endpoints, tests
* vae_encode docs
* vae encode benchmarks
* api reference
* changelog
* Update docs/source/en/hybrid_inference/overview.md
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* update
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Wan Pipeline scaling fix, type hint warning, multi generator fix (#11007)
* Wan Pipeline scaling fix, type hint warning, multi generator fix
* Apply suggestions from code review
[LoRA] change to warning from info when notifying the users about a LoRA no-op (#11044)
* move to warning.
* test related changes.
Rename Lumina(2)Text2ImgPipeline -> Lumina(2)Pipeline (#10827)
* Rename Lumina(2)Text2ImgPipeline -> Lumina(2)Pipeline
---------
Co-authored-by: YiYi Xu <yixu310@gmail.com>
making ```formatted_images``` initialization compact (#10801)
compact writing
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Fix aclnnRepeatInterleaveIntWithDim error on NPU for get_1d_rotary_pos_embed (#10820)
* get_1d_rotary_pos_embed support npu
* Update src/diffusers/models/embeddings.py
---------
Co-authored-by: Kai zheng <kaizheng@KaideMacBook-Pro.local>
Co-authored-by: hlky <hlky@hlky.ac>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
[Tests] restrict memory tests for quanto for certain schemes. (#11052)
* restrict memory tests for quanto for certain schemes.
* Apply suggestions from code review
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
* fixes
* style
---------
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
[LoRA] feat: support non-diffusers wan t2v loras. (#11059)
feat: support non-diffusers wan t2v loras.
[examples/controlnet/train_controlnet_sd3.py] Fixes#11050 - Cast prompt_embeds and pooled_prompt_embeds to weight_dtype to prevent dtype mismatch (#11051)
Fix: dtype mismatch of prompt embeddings in sd3 controlnet training
Co-authored-by: Andreas Jörg <andreasjoerg@MacBook-Pro-von-Andreas-2.fritz.box>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
reverts accidental change that removes attn_mask in attn. Improves fl… (#11065)
reverts accidental change that removes attn_mask in attn. Improves flux ptxla by using flash block sizes. Moves encoding outside the for loop.
Co-authored-by: Juan Acevedo <jfacevedo@google.com>
Fix deterministic issue when getting pipeline dtype and device (#10696)
Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>
[Tests] add requires peft decorator. (#11037)
* add requires peft decorator.
* install peft conditionally.
* conditional deps.
Co-authored-by: DN6 <dhruv.nair@gmail.com>
---------
Co-authored-by: DN6 <dhruv.nair@gmail.com>
CogView4 Control Block (#10809)
* cogview4 control training
---------
Co-authored-by: OleehyO <leehy0357@gmail.com>
Co-authored-by: yiyixuxu <yixu310@gmail.com>
[CI] pin transformers version for benchmarking. (#11067)
pin transformers version for benchmarking.
updates
Fix Wan I2V Quality (#11087)
* fix_wan_i2v_quality
* Update src/diffusers/pipelines/wan/pipeline_wan_i2v.py
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Update src/diffusers/pipelines/wan/pipeline_wan_i2v.py
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Update src/diffusers/pipelines/wan/pipeline_wan_i2v.py
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Update pipeline_wan_i2v.py
---------
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Co-authored-by: hlky <hlky@hlky.ac>
LTX 0.9.5 (#10968)
* update
---------
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Co-authored-by: hlky <hlky@hlky.ac>
make PR GPU tests conditioned on styling. (#11099)
Group offloading improvements (#11094)
update
Fix pipeline_flux_controlnet.py (#11095)
* Fix pipeline_flux_controlnet.py
* Fix style
update readme instructions. (#11096)
Co-authored-by: Juan Acevedo <jfacevedo@google.com>
Resolve stride mismatch in UNet's ResNet to support Torch DDP (#11098)
Modify UNet's ResNet implementation to resolve stride mismatch in Torch's DDP
Fix Group offloading behaviour when using streams (#11097)
* update
* update
Quality options in `export_to_video` (#11090)
* Quality options in `export_to_video`
* make style
improve more.
add placeholders for docstrings.
formatting.
smol fix.
solidify validation and annotation
* Revert "feat: pipeline-level quant config."
This reverts commit 316ff46b76.
* feat: implement pipeline-level quantization config
Co-authored-by: SunMarc <marc@huggingface.co>
* update
* fixes
* fix validation.
* add tests and other improvements.
* add tests
* import quality
* remove prints.
* add docs.
* fixes to docs.
* doc fixes.
* doc fixes.
* add validation to the input quantization_config.
* clarify recommendations.
* docs
* add to ci.
* todo.
---------
Co-authored-by: SunMarc <marc@huggingface.co>
* test permission
* Add cross attention type for Sana-Sprint.
* Add Sana-Sprint training script in diffusers.
* make style && make quality;
* modify the attention processor with `set_attn_processor` and change `SanaAttnProcessor3_0` to `SanaVanillaAttnProcessor`
* Add import for SanaVanillaAttnProcessor
* Add README file.
* Apply suggestions from code review
* style
* Update examples/research_projects/sana/README.md
---------
Co-authored-by: lawrence-cj <cjs1020440147@icloud.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* begin transformer conversion
* refactor
* refactor
* refactor
* refactor
* refactor
* refactor
* update
* add conversion script
* add pipeline
* make fix-copies
* remove einops
* update docs
* gradient checkpointing
* add transformer test
* update
* debug
* remove prints
* match sigmas
* add vae pt. 1
* finish CV* vae
* update
* update
* update
* update
* update
* update
* make fix-copies
* update
* make fix-copies
* fix
* update
* update
* make fix-copies
* update
* update tests
* handle device and dtype for safety checker; required in latest diffusers
* remove enable_gqa and use repeat_interleave instead
* enforce safety checker; use dummy checker in fast tests
* add review suggestion for ONNX export
Co-Authored-By: Asfiya Baig <asfiyab@nvidia.com>
* fix safety_checker issues when not passed explicitly
We could either do what's done in this commit, or update the Cosmos examples to explicitly pass the safety checker
* use cosmos guardrail package
* auto format docs
* update conversion script to support 14B models
* update name CosmosPipeline -> CosmosTextToWorldPipeline
* update docs
* fix docs
* fix group offload test failing for vae
---------
Co-authored-by: Asfiya Baig <asfiyab@nvidia.com>
* [train_controlnet_sdxl] Add LANCZOS as the default interpolation mode for image resizing
* [train_dreambooth_lora_flux_advanced] Add LANCZOS as the default interpolation mode for image resizing
* 1. add pre-computation of prompt embeddings when custom prompts are used as well
2. save model card even if model is not pushed to hub
3. remove scheduler initialization from code example - not necessary anymore (it's now if the base model's config)
4. add skip_final_inference - to allow to run with validation, but skip the final loading of the pipeline with the lora weights to reduce memory reqs
* pre encode validation prompt as well
* Update examples/dreambooth/train_dreambooth_lora_hidream.py
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Update examples/dreambooth/train_dreambooth_lora_hidream.py
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* Update examples/dreambooth/train_dreambooth_lora_hidream.py
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* pre encode validation prompt as well
* Apply style fixes
* empty commit
* change default trained modules
* empty commit
* address comments + change encoding of validation prompt (before it was only pre-encoded if custom prompts are provided, but should be pre-encoded either way)
* Apply style fixes
* empty commit
* fix validation_embeddings definition
* fix final inference condition
* fix pipeline deletion in last inference
* Apply style fixes
* empty commit
* layers
* remove readme remarks on only pre-computing when instance prompt is provided and change example to 3d icons
* smol fix
* empty commit
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* fix issue that training flux controlnet was unstable and validation results were unstable
* del unused code pieces, fix grammar
---------
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Fix: Inherit `StableDiffusionXLLoraLoaderMixin`
`StableDiffusionXLControlNetAdapterInpaintPipeline`
used to incorrectly inherit
`StableDiffusionLoraLoaderMixin`
instead of `StableDiffusionXLLoraLoaderMixin`
* Update pe_selection_index_based_on_dim
* Make pe_selection_index_based_on_dim work with torh.compile
* Fix AuraFlowTransformer2DModel's dpcstring default values
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
loose expected_max_diff from 5e-1 to 8e-1 to make
KandinskyV22PipelineInpaintCombinedFastTests::test_float16_inference
pass on XPU
Signed-off-by: Matrix Yao <matrix.yao@intel.com>
Before this if txt_ids was 3d tensor, line with txt_ids[:1] concat txt_ids by batch dim. Now we first check that txt_ids is 2d tensor (or take first batch element) and then concat by token dim
* loose test_float16_inference's tolerance from 5e-2 to 6e-2, so XPU can
pass UT
Signed-off-by: Matrix Yao <matrix.yao@intel.com>
* fix test_pipeline_flux_redux fail on XPU
Signed-off-by: Matrix Yao <matrix.yao@intel.com>
---------
Signed-off-by: Matrix Yao <matrix.yao@intel.com>
* [WIP][LoRA] Implement hot-swapping of LoRA
This PR adds the possibility to hot-swap LoRA adapters. It is WIP.
Description
As of now, users can already load multiple LoRA adapters. They can
offload existing adapters or they can unload them (i.e. delete them).
However, they cannot "hotswap" adapters yet, i.e. substitute the weights
from one LoRA adapter with the weights of another, without the need to
create a separate LoRA adapter.
Generally, hot-swapping may not appear not super useful but when the
model is compiled, it is necessary to prevent recompilation. See #9279
for more context.
Caveats
To hot-swap a LoRA adapter for another, these two adapters should target
exactly the same layers and the "hyper-parameters" of the two adapters
should be identical. For instance, the LoRA alpha has to be the same:
Given that we keep the alpha from the first adapter, the LoRA scaling
would be incorrect for the second adapter otherwise.
Theoretically, we could override the scaling dict with the alpha values
derived from the second adapter's config, but changing the dict will
trigger a guard for recompilation, defeating the main purpose of the
feature.
I also found that compilation flags can have an impact on whether this
works or not. E.g. when passing "reduce-overhead", there will be errors
of the type:
> input name: arg861_1. data pointer changed from 139647332027392 to
139647331054592
I don't know enough about compilation to determine whether this is
problematic or not.
Current state
This is obviously WIP right now to collect feedback and discuss which
direction to take this. If this PR turns out to be useful, the
hot-swapping functions will be added to PEFT itself and can be imported
here (or there is a separate copy in diffusers to avoid the need for a
min PEFT version to use this feature).
Moreover, more tests need to be added to better cover this feature,
although we don't necessarily need tests for the hot-swapping
functionality itself, since those tests will be added to PEFT.
Furthermore, as of now, this is only implemented for the unet. Other
pipeline components have yet to implement this feature.
Finally, it should be properly documented.
I would like to collect feedback on the current state of the PR before
putting more time into finalizing it.
* Reviewer feedback
* Reviewer feedback, adjust test
* Fix, doc
* Make fix
* Fix for possible g++ error
* Add test for recompilation w/o hotswapping
* Make hotswap work
Requires https://github.com/huggingface/peft/pull/2366
More changes to make hotswapping work. Together with the mentioned PEFT
PR, the tests pass for me locally.
List of changes:
- docstring for hotswap
- remove code copied from PEFT, import from PEFT now
- adjustments to PeftAdapterMixin.load_lora_adapter (unfortunately, some
state dict renaming was necessary, LMK if there is a better solution)
- adjustments to UNet2DConditionLoadersMixin._process_lora: LMK if this
is even necessary or not, I'm unsure what the overall relationship is
between this and PeftAdapterMixin.load_lora_adapter
- also in UNet2DConditionLoadersMixin._process_lora, I saw that there is
no LoRA unloading when loading the adapter fails, so I added it
there (in line with what happens in PeftAdapterMixin.load_lora_adapter)
- rewritten tests to avoid shelling out, make the test more precise by
making sure that the outputs align, parametrize it
- also checked the pipeline code mentioned in this comment:
https://github.com/huggingface/diffusers/pull/9453#issuecomment-2418508871;
when running this inside the with
torch._dynamo.config.patch(error_on_recompile=True) context, there is
no error, so I think hotswapping is now working with pipelines.
* Address reviewer feedback:
- Revert deprecated method
- Fix PEFT doc link to main
- Don't use private function
- Clarify magic numbers
- Add pipeline test
Moreover:
- Extend docstrings
- Extend existing test for outputs != 0
- Extend existing test for wrong adapter name
* Change order of test decorators
parameterized.expand seems to ignore skip decorators if added in last
place (i.e. innermost decorator).
* Split model and pipeline tests
Also increase test coverage by also targeting conv2d layers (support of
which was added recently on the PEFT PR).
* Reviewer feedback: Move decorator to test classes
... instead of having them on each test method.
* Apply suggestions from code review
Co-authored-by: hlky <hlky@hlky.ac>
* Reviewer feedback: version check, TODO comment
* Add enable_lora_hotswap method
* Reviewer feedback: check _lora_loadable_modules
* Revert changes in unet.py
* Add possibility to ignore enabled at wrong time
* Fix docstrings
* Log possible PEFT error, test
* Raise helpful error if hotswap not supported
I.e. for the text encoder
* Formatting
* More linter
* More ruff
* Doc-builder complaint
* Update docstring:
- mention no text encoder support yet
- make it clear that LoRA is meant
- mention that same adapter name should be passed
* Fix error in docstring
* Update more methods with hotswap argument
- SDXL
- SD3
- Flux
No changes were made to load_lora_into_transformer.
* Add hotswap argument to load_lora_into_transformer
For SD3 and Flux. Use shorter docstring for brevity.
* Extend docstrings
* Add version guards to tests
* Formatting
* Fix LoRA loading call to add prefix=None
See:
https://github.com/huggingface/diffusers/pull/10187#issuecomment-2717571064
* Run make fix-copies
* Add hot swap documentation to the docs
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: hlky <hlky@hlky.ac>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Refactor `LTXConditionPipeline` to add text-only conditioning
* style
* up
* Refactor `LTXConditionPipeline` to streamline condition handling and improve clarity
* Improve condition checks
* Simplify latents handling based on conditioning type
* Refactor rope_interpolation_scale preparation for clarity and efficiency
* Update LTXConditionPipeline docstring to clarify supported input types
* Add LTX Video 0.9.5 model to documentation
* Clarify documentation to indicate support for text-only conditioning without passing `conditions`
* refactor: comment out unused parameters in LTXConditionPipeline
* fix: restore previously commented parameters in LTXConditionPipeline
* fix: remove unused parameters from LTXConditionPipeline
* refactor: remove unnecessary lines in LTXConditionPipeline
* model card gen code
* push modelcard creation
* remove optional from params
* add import
* add use_dora check
* correct lora var use in tags
* make style && make quality
---------
Co-authored-by: Aryan <aryan@huggingface.co>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* allow models to run with a user-provided dtype map instead of a single dtype
* make style
* Add warning, change `_` to `default`
* make style
* add test
* handle shared tensors
* remove warning
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
set self._hf_peft_config_loaded to True on successful lora load
Sets the `_hf_peft_config_loaded` flag if a LoRA is successfully loaded in `load_lora_adapter`. Fixes bug huggingface/diffusers/issues/11148
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* [Documentation] Update README and example code with additional usage instructions for AnyText
* [Documentation] Update README for AnyTextPipeline and improve logging in code
* Remove wget command for font file from example docstring in anytext.py
* Don't use `torch_dtype` when `quantization_config` is set
* up
* djkajka
* Apply suggestions from code review
---------
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
* fix bug when pixart-dmd inference with `num_inference_steps=1`
* use return_dict=False and return [1] element for 1-step pixart model, which works for both lcm and dmd
PIPELINE_USAGE_CUTOFF:1000000000# set high cutoff so that only always-test pipelines run
jobs:
check_code_quality:
runs-on:ubuntu-22.04
steps:
- uses:actions/checkout@v3
- name:Set up Python
uses:actions/setup-python@v4
with:
python-version:"3.8"
- name:Install dependencies
run:|
python -m pip install --upgrade pip
pip install .[quality]
- name:Check quality
run:make quality
- name:Check if failure
if:${{ failure() }}
run:|
echo "Quality check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make style && make quality'" >> $GITHUB_STEP_SUMMARY
check_repository_consistency:
needs:check_code_quality
runs-on:ubuntu-22.04
steps:
- uses:actions/checkout@v3
- name:Set up Python
uses:actions/setup-python@v4
with:
python-version:"3.8"
- name:Install dependencies
run:|
python -m pip install --upgrade pip
pip install .[quality]
- name:Check repo consistency
run:|
python utils/check_copies.py
python utils/check_dummies.py
python utils/check_support_list.py
make deps_table_check_updated
- name:Check if failure
if:${{ failure() }}
run:|
echo "Repo consistency check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make fix-copies'" >> $GITHUB_STEP_SUMMARY
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
@@ -11,39 +11,20 @@ specific language governing permissions and limitations under the License. -->
# Caching methods
## Pyramid Attention Broadcast
Cache methods speedup diffusion transformers by storing and reusing intermediate outputs of specific layers, such as attention and feedforward layers, instead of recalculating them at each inference step.
[Pyramid Attention Broadcast](https://huggingface.co/papers/2408.12588) from Xuanlei Zhao, Xiaolong Jin, Kai Wang, Yang You.
Pyramid Attention Broadcast (PAB) is a method that speeds up inference in diffusion models by systematically skipping attention computations between successive inference steps and reusing cached attention states. The attention states are not very different between successive inference steps. The most prominent difference is in the spatial attention blocks, not as much in the temporal attention blocks, and finally the least in the cross attention blocks. Therefore, many cross attention computation blocks can be skipped, followed by the temporal and spatial attention blocks. By combining other techniques like sequence parallelism and classifier-free guidance parallelism, PAB achieves near real-time video generation.
Enable PAB with [`~PyramidAttentionBroadcastConfig`] on any pipeline. For some benchmarks, refer to [this](https://github.com/huggingface/diffusers/pull/9562) pull request.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
@@ -20,11 +20,15 @@ LoRA is a fast and lightweight training method that inserts and trains a signifi
- [`FluxLoraLoaderMixin`] provides similar functions for [Flux](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux).
- [`CogVideoXLoraLoaderMixin`] provides similar functions for [CogVideoX](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox).
- [`Mochi1LoraLoaderMixin`] provides similar functions for [Mochi](https://huggingface.co/docs/diffusers/main/en/api/pipelines/mochi).
- [`AuraFlowLoraLoaderMixin`] provides similar functions for [AuraFlow](https://huggingface.co/fal/AuraFlow).
- [`LTXVideoLoraLoaderMixin`] provides similar functions for [LTX-Video](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video).
- [`SanaLoraLoaderMixin`] provides similar functions for [Sana](https://huggingface.co/docs/diffusers/main/en/api/pipelines/sana).
- [`HunyuanVideoLoraLoaderMixin`] provides similar functions for [HunyuanVideo](https://huggingface.co/docs/diffusers/main/en/api/pipelines/hunyuan_video).
- [`Lumina2LoraLoaderMixin`] provides similar functions for [Lumina2](https://huggingface.co/docs/diffusers/main/en/api/pipelines/lumina2).
- [`WanLoraLoaderMixin`] provides similar functions for [Wan](https://huggingface.co/docs/diffusers/main/en/api/pipelines/wan).
- [`CogView4LoraLoaderMixin`] provides similar functions for [CogView4](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogview4).
- [`AmusedLoraLoaderMixin`] is for the [`AmusedPipeline`].
- [`HiDreamImageLoraLoaderMixin`] provides similar functions for [HiDream Image](https://huggingface.co/docs/diffusers/main/en/api/pipelines/hidream)
- [`LoraBaseMixin`] provides a base class with several utility methods to fuse, unfuse, unload, LoRAs and more.
<Tip>
@@ -56,6 +60,9 @@ To learn more about how to load LoRA weights, see the [LoRA](../../using-diffuse
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# AsymmetricAutoencoderKL
Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: [Designing a Better Asymmetric VQGAN for StableDiffusion](https://arxiv.org/abs/2306.04632) by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua.
Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: [Designing a Better Asymmetric VQGAN for StableDiffusion](https://huggingface.co/papers/2306.04632) by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# AutoModel
The `AutoModel` is designed to make it easy to load a checkpoint without needing to know the specific model class. `AutoModel` automatically retrieves the correct model class from the checkpoint `config.json` file.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# AutoencoderKL
The variational autoencoder (VAE) model with KL loss was introduced in [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114v11) by Diederik P. Kingma and Max Welling. The model is used in 🤗 Diffusers to encode images into latents and to decode latent representations into images.
The variational autoencoder (VAE) model with KL loss was introduced in [Auto-Encoding Variational Bayes](https://huggingface.co/papers/1312.6114v11) by Diederik P. Kingma and Max Welling. The model is used in 🤗 Diffusers to encode images into latents and to decode latent representations into images.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->
# ConsisIDTransformer3DModel
A Diffusion Transformer model for 3D data from [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) was introduced in [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/pdf/2411.17440) by Peking University & University of Rochester & etc.
A Diffusion Transformer model for 3D data from [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) was introduced in [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://huggingface.co/papers/2411.17440) by Peking University & University of Rochester & etc.
The model can be loaded with the following code snippet.
<!--Copyright 2024 The HuggingFace Team and Tencent Hunyuan Team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team and Tencent Hunyuan Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# HunyuanDiT2DControlNetModel
HunyuanDiT2DControlNetModel is an implementation of ControlNet for [Hunyuan-DiT](https://arxiv.org/abs/2405.08748).
HunyuanDiT2DControlNetModel is an implementation of ControlNet for [Hunyuan-DiT](https://huggingface.co/papers/2405.08748).
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# SanaControlNetModel
The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection.
The abstract from the paper is:
*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.*
This model was contributed by [ishan24](https://huggingface.co/ishan24). ❤️
The original codebase can be found at [NVlabs/Sana](https://github.com/NVlabs/Sana), and you can find official ControlNet checkpoints on [Efficient-Large-Model's](https://huggingface.co/Efficient-Large-Model) Hub profile.
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
@@ -11,11 +11,11 @@ specific language governing permissions and limitations under the License. -->
# SparseControlNetModel
SparseControlNetModel is an implementation of ControlNet for [AnimateDiff](https://arxiv.org/abs/2307.04725).
SparseControlNetModel is an implementation of ControlNet for [AnimateDiff](https://huggingface.co/papers/2307.04725).
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
The SparseCtrl version of ControlNet was introduced in [SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://arxiv.org/abs/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai.
The SparseCtrl version of ControlNet was introduced in [SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://huggingface.co/papers/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai.
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# CosmosTransformer3DModel
A Diffusion Transformer model for 3D video-like data was introduced in [Cosmos World Foundation Model Platform for Physical AI](https://huggingface.co/papers/2501.03575) by NVIDIA.
The model can be loaded with the following code snippet.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.
aMUSEd was introduced in [aMUSEd: An Open MUSE Reproduction](https://huggingface.co/papers/2401.01808) by Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen.
Amused is a lightweight text to image model based off of the [MUSE](https://arxiv.org/abs/2301.00704) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
Amused is a lightweight text to image model based off of the [MUSE](https://huggingface.co/papers/2301.00704) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once.
Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder CLIP-L/14 instead of t5-xxl. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
@@ -18,7 +18,7 @@ specific language governing permissions and limitations under the License.
## Overview
[AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning](https://arxiv.org/abs/2307.04725) by Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai.
[AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning](https://huggingface.co/papers/2307.04725) by Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai.
The abstract of the paper is the following:
@@ -187,7 +187,7 @@ Here are some sample outputs:
### AnimateDiffSparseControlNetPipeline
[SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://arxiv.org/abs/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai.
[SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://huggingface.co/papers/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai.
[FreeInit: Bridging Initialization Gap in Video Diffusion Models](https://arxiv.org/abs/2312.07537) by Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu.
[FreeInit: Bridging Initialization Gap in Video Diffusion Models](https://huggingface.co/papers/2312.07537) by Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu.
FreeInit is an effective method that improves temporal consistency and overall quality of videos generated using video-diffusion-models without any addition training. It can be applied to AnimateDiff, ModelScope, VideoCrafter and various other video generation models seamlessly at inference time, and works by iteratively refining the latent-initialization noise. More details can be found it the paper.
[FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling](https://arxiv.org/abs/2310.15169) by Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu.
[FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling](https://huggingface.co/papers/2310.15169) by Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu.
FreeNoise is a sampling mechanism that can generate longer videos with short-video generation models by employing noise-rescheduling, temporal attention over sliding windows, and weighted averaging of latent frames. It also can be used with multiple prompts to allow for interpolated video generations. More details are available in the paper.
@@ -966,7 +966,7 @@ pipe.to("cuda")
prompt={
0:"A caterpillar on a leaf, high quality, photorealistic",
40:"A caterpillar transforming into a cocoon, on a leaf, near flowers, photorealistic",
80:"A cocoon on a leaf, flowers in the backgrond, photorealistic",
80:"A cocoon on a leaf, flowers in the background, photorealistic",
120:"A cocoon maturing and a butterfly being born, flowers and leaves visible in the background, photorealistic",
160:"A beautiful butterfly, vibrant colors, sitting on a leaf, flowers in the background, photorealistic",
200:"A beautiful butterfly, flying away in a forest, photorealistic",
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# AudioLDM 2
AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.
AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://huggingface.co/papers/2308.05734) by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.
Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM 2 is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings. Two text encoder models are used to compute the text embeddings from a prompt input: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap) and the encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5). These text embeddings are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/main/api/pipelines/audioldm2#diffusers.AudioLDM2ProjectionModel). A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) _language model (LM)_ is used to auto-regressively predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2UNet2DConditionModel) of AudioLDM 2 is unique in the sense that it takes **two** cross-attention embeddings, as opposed to one cross-attention conditioning, as in most other LDMs.
AuraFlow can be compiled with `torch.compile()` to speed up inference latency even for different resolutions. First, install PyTorch nightly following the instructions from [here](https://pytorch.org/). The snippet below shows the changes needed to enable this:
Specifying `use_duck_shape` to be `False` instructs the compiler if it should use the same symbolic variable to represent input sizes that are the same. For more details, check out [this comment](https://github.com/huggingface/diffusers/pull/11327#discussion_r2047659790).
This enables from 100% (on low resolutions) to a 30% (on 1536x1536 resolution) speed improvements.
Thanks to [AstraliteHeart](https://github.com/huggingface/diffusers/pull/11297/) who helped us rewrite the [`AuraFlowTransformer2DModel`] class so that the above works for different resolutions ([PR](https://github.com/huggingface/diffusers/pull/11297/)).
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
# BLIP-Diffusion
BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://arxiv.org/abs/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation.
BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://huggingface.co/papers/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation.
Chroma is a text to image generation model based on Flux.
Original model checkpoints for Chroma can be found [here](https://huggingface.co/lodestones/Chroma).
<Tip>
Chroma can use all the same optimizations as Flux.
</Tip>
## Inference (Single File)
The `ChromaTransformer2DModel` supports loading checkpoints in the original format. This is also useful when trying to load finetunes or quantized versions of the models that have been published by the community.
The following example demonstrates how to run Chroma from a single file.
[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://arxiv.org/abs/2408.06072) from Tsinghua University & ZhipuAI, by Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang.
# CogVideoX
The abstract from the paper is:
[CogVideoX](https://huggingface.co/papers/2408.06072) is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An "expert" transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos.
*We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motion. In addition, we develop an effectively text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of CogVideoX-2B is publicly available at https://github.com/THUDM/CogVideo.*
You can find all the original CogVideoX checkpoints under the [CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) collection.
<Tip>
> [!TIP]
> Click on the CogVideoX models in the right sidebar for more examples of other video generation tasks.
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
The example below demonstrates how to generate a video optimized for memory or inference speed.
</Tip>
<hfoptions id="usage">
<hfoption id="memory">
This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques.
There are three official CogVideoX checkpoints for text-to-video and video-to-video.
- Text-to-video (T2V) works best at a resolution of 1360x768 because it was trained with that specific resolution.
- Image-to-video (I2V) works for multiple resolutions. The width can vary from 768 to 1360, but the height must be 768. The height/width must be divisible by 16.
- Both T2V and I2V models support generation with 81 and 161 frames and work best at this value. Exporting videos at 16 FPS is recommended.
There are two official CogVideoX checkpoints that support pose controllable generation (by the [Alibaba-PAI](https://huggingface.co/alibaba-pai) team).
# CogVideoX works well with long and well-described prompts
prompt="A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
The [T2V benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are:
```
Without torch.compile(): Average inference time: 96.89 seconds.
With torch.compile(): Average inference time: 76.27 seconds.
```
### Memory optimization
CogVideoX-2b requires about 19 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/a-r-r-o-w/3959a03f15be5c9bd1fe545b09dfcc93) script.
-`pipe.enable_model_cpu_offload()`:
- Without enabling cpu offloading, memory usage is `33 GB`
- With enabling cpu offloading, memory usage is `19 GB`
-`pipe.enable_sequential_cpu_offload()`:
- Similar to `enable_model_cpu_offload` but can significantly reduce memory usage at the cost of slow inference
- When enabled, memory usage is under `4 GB`
-`pipe.vae.enable_tiling()`:
- With enabling cpu offloading and tiling, memory usage is `11 GB`
-`pipe.vae.enable_slicing()`
## Quantization
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`CogVideoXPipeline`] for inference with bitsandbytes.
The quantized CogVideoX 5B model below requires ~16GB of VRAM.
prompt="A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea.
The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse.
Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood,
with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.
"""
video=pipeline(
prompt=prompt,
guidance_scale=6,
num_inference_steps=50
).frames[0]
export_to_video(video,"output.mp4",fps=8)
```
</hfoption>
<hfoption id="inference speed">
[Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster.
The average inference time with torch.compile on a 80GB A100 is 76.27 seconds compared to 96.89 seconds for an uncompiled model.
A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea.
The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse.
Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood,
with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.
"""
video=pipeline(
prompt=prompt,
guidance_scale=6,
num_inference_steps=50
).frames[0]
export_to_video(video,"output.mp4",fps=8)
```
</hfoption>
</hfoptions>
## Notes
- CogVideoX supports LoRAs with [`~loaders.CogVideoXLoraLoaderMixin.load_lora_weights`].
<details>
<summary>Show example code</summary>
```py
import torch
from diffusers import CogVideoXPipeline
from diffusers.hooks import apply_group_offloading
- The text-to-video (T2V) checkpoints work best with a resolution of 1360x768 because that was the resolution it was pretrained on.
- The image-to-video (I2V) checkpoints work with multiple resolutions. The width can vary from 768 to 1360, but the height must be 758. Both height and width must be divisible by 16.
- Both T2V and I2V checkpoints work best with 81 and 161 frames. It is recommended to export the generated video at 16fps.
- Refer to the table below to view memory usage when various memory-saving techniques are enabled.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.