Compare commits

...

35 Commits

Author SHA1 Message Date
Sayak Paul
28052b0691 Merge branch 'main' into tests-load-components 2026-03-20 16:07:07 +05:30
Sayak Paul
522b523e40 [ci] hoping to fix is_flaky with wanvace. (#13294)
* hoping to fix is_flaky with wanvace.

* revert changes in src/diffusers/utils/testing_utils.py and propagate them to tests/testing_utils.py.

* up
2026-03-20 16:02:16 +05:30
Sayak Paul
443d73584b Merge branch 'main' into tests-load-components 2026-03-20 12:32:42 +05:30
Dhruv Nair
e9b9f25f67 [CI] Update transformer version in release tests (#13296)
update
2026-03-20 11:40:06 +05:30
Dhruv Nair
32b4cfc81c [Modular] Test for catching dtype and device issues with AutoModel type hints (#13287)
* update

* update

* update
2026-03-20 10:36:03 +05:30
YiYi Xu
a13e5cf9fc [agents]support skills (#13269)
* support skills

* update

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* update baSeed on new best practice

* Update .ai/skills/parity-testing/pitfalls.md

Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>

* update

---------

Co-authored-by: yiyi@huggingface.co <yiyi@ip-26-0-161-123.ec2.internal>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: yiyi@huggingface.co <yiyi@ip-26-0-160-103.ec2.internal>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
2026-03-19 18:07:41 -10:00
dg845
072d15ee42 Add Support for LTX-2.3 Models (#13217)
* Initial implementation of perturbed attn processor for LTX 2.3

* Update DiT block for LTX 2.3 + add self_attention_mask

* Add flag to control using perturbed attn processor for now

* Add support for new video upsampling blocks used by LTX-2.3

* Support LTX-2.3 Big-VGAN V2-style vocoder

* Initial implementation of LTX-2.3 vocoder with bandwidth extender

* Initial support for LTX-2.3 per-modality feature extractor

* Refactor so that text connectors own all text encoder hidden_states normalization logic

* Fix some bugs for inference

* Fix LTX-2.X DiT block forward pass

* Support prompt timestep embeds and prompt cross attn modulation

* Add LTX-2.3 configs to conversion script

* Support converting LTX-2.3 DiT checkpoints

* Support converting LTX-2.3 Video VAE checkpoints

* Support converting LTX-2.3 Vocoder with bandwidth extender

* Support converting LTX-2.3 text connectors

* Don't convert any upsamplers for now

* Support self attention mask for LTX2Pipeline

* Fix some inference bugs

* Support self attn mask and sigmas for LTX-2.3 I2V, Cond pipelines

* Support STG and modality isolation guidance for LTX-2.3

* make style and make quality

* Make audio guidance values default to video values by default

* Update to LTX-2.3 style guidance rescaling

* Support cross timesteps for LTX-2.3 cross attention modulation

* Fix RMS norm bug for LTX-2.3 text connectors

* Perform guidance rescale in sample (x0) space following original code

* Support LTX-2.3 Latent Spatial Upsampler model

* Support LTX-2.3 distilled LoRA

* Support LTX-2.3 Distilled checkpoint

* Support LTX-2.3 prompt enhancement

* Make LTX-2.X processor non-required so that tests pass

* Fix test_components_function tests for LTX2 T2V and I2V

* Fix LTX-2.3 Video VAE configuration bug causing pixel jitter

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Refactor LTX-2.X Video VAE upsampler block init logic

* Refactor LTX-2.X guidance rescaling to use rescale_noise_cfg

* Use generator initial seed to control prompt enhancement if available

* Remove self attention mask logic as it is not used in any current pipelines

* Commit fixes suggested by claude code (guidance in sample (x0) space, denormalize after timestep conditioning)

* Use constant shift following original code

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2026-03-19 14:58:29 -07:00
kaixuanliu
67613369bb fix: 'PaintByExampleImageEncoder' object has no attribute 'all_tied_w… (#13252)
* fix: 'PaintByExampleImageEncoder' object has no attribute 'all_tied_weights_keys'

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* also fix LDMBertModel

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
2026-03-18 17:55:08 -10:00
Shenghai Yuan
0c01a4b5e2 [Helios] Remove lru_cache for better AoTI compatibility and cleaner code (#13282)
fix: drop lru_cache for better AoTI compatibility
2026-03-18 23:41:58 +05:30
kaixuanliu
8e4b5607ed skip invalid test case for helios pipeline (#13218)
* skip invalid test case for helio pipeline

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

* update skip reason

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2026-03-17 20:58:35 -10:00
Junsong Chen
c6f72ad2f6 add ltx2 vae in sana-video; (#13229)
* add ltx2 vae in sana-video;

* add ltx vae in conversion script;

* Update src/diffusers/pipelines/sana_video/pipeline_sana_video.py

Co-authored-by: YiYi Xu <yixu310@gmail.com>

* Update src/diffusers/pipelines/sana_video/pipeline_sana_video.py

Co-authored-by: YiYi Xu <yixu310@gmail.com>

* condition `vae_scale_factor_xxx` related settings on VAE types;

* make the mean/std depends on vae class;

---------

Co-authored-by: YiYi Xu <yixu310@gmail.com>
2026-03-17 18:09:52 -10:00
Dhruv Nair
11a3284cee [CI] Qwen Image Model Test Refactor (#13069)
* update

* update

* update

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2026-03-17 16:44:04 +05:30
Sayak Paul
16e7067647 [tests] fix llava kwargs in the hunyuan tests (#13275)
fix llava kwargs in the hunyuan tests
2026-03-17 10:11:47 +05:30
Dhruv Nair
d1b3555c29 [Modular] Fix dtype assignment when type hint is AutoModel (#13271)
* update

* update
2026-03-17 09:47:53 +05:30
Wang, Yi
9677859ebf fix parallelism case failure in xpu (#13270)
* fix parallelism case failure in xpu

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

* updated

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2026-03-17 08:52:15 +05:30
Steven Liu
ed31974c3e [docs] updates (#13248)
* fixes

* few more links

* update zh

* fix
2026-03-16 13:24:57 -07:00
YiYi Xu
e5aa719241 Add AGENTS.md (#13259)
* add a draft

* add

* up

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2026-03-14 08:35:12 -10:00
teith
4bc1c59a67 fix: correct invalid type annotation for image in Flux2Pipeline.__call__ (#13205)
fix: correct invalid type annotation for image in Flux2Pipeline.__call__
2026-03-13 15:56:38 -03:00
Sayak Paul
764f7ede33 [core] Flux2 klein kv followups (#13264)
* implement Flux2Transformer2DModelOutput.

* add output class to docs.

* add Flux2KleinKV to docs.

* add pipeline tests for klein kv.
2026-03-13 10:05:11 +05:30
Sayak Paul
8d0f3e1ba8 [lora] fix z-image non-diffusers lora loading. (#13255)
fix z-image non-diffusers lora loading.
2026-03-13 06:58:53 +05:30
huemin
094caf398f klein 9b kv (#13262)
* klein 9b kv

* Apply style fixes

* fix typo inline modulation split

* make fix-copies

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-03-12 06:53:56 -10:00
Sayak Paul
faca9b90a7 Merge branch 'main' into tests-load-components 2026-03-12 20:57:38 +05:30
Alvaro Bartolome
81c354d879 Add PRXPipeline in AUTO_TEXT2IMAGE_PIPELINES_MAPPING (#13257) 2026-03-11 14:39:24 -03:00
Miguel Martin
0a2c26d0a4 Update Documentation for NVIDIA Cosmos (#13251)
* fix docs

* update main example
2026-03-11 09:14:56 -07:00
Dhruv Nair
07c5ba8eee [Context Parallel] Add support for custom device mesh (#13064)
* add custom mesh support

* update

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2026-03-11 16:42:11 +05:30
Dhruv Nair
897aed72fa [Quantization] Deprecate Quanto (#13180)
* update

* update
2026-03-11 09:26:46 +05:30
sayakpaul
a1f63a398c up 2026-03-10 17:55:08 +05:30
sayakpaul
bf846f722c u[ 2026-03-10 17:49:51 +05:30
sayakpaul
78a86e85cf fix 2026-03-10 17:46:55 +05:30
sayakpaul
7673ab1757 fix 2026-03-10 16:50:27 +05:30
sayakpaul
b7648557d4 test load_components. 2026-03-10 16:09:02 +05:30
Dhruv Nair
07a63e197e [CI] Potential fix for code scanning alert no. 2150: Workflow does not contain permissions (#13230)
Potential fix for code scanning alert no. 2150: Workflow does not contain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2026-03-10 09:07:16 +05:30
YiYi Xu
068c6ef6c1 [modular] helios (#13216)
* add helios modular

* upup

* revert change in guider

* up

* fix for real

* fix batch test

* Apply suggestion from @yiyixuxu

---------

Co-authored-by: yiyi@huggingface.co <yiyi@ip-26-0-163-127.ec2.internal>
2026-03-09 10:37:56 -10:00
DefTruth
94bcb8941e fix: allow pass cpu generator for helios (#13228)
* allow pass cpu generator for helios

* allow pass cpu generator for helios

* allow pass cpu generator for helios

* patch
2026-03-09 10:30:56 -10:00
Dhruv Nair
8ea908f323 [CI] Add Workflow permissions to PR tests (#13233)
[CI] Add workflow permissions to PR tests

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2026-03-09 17:51:42 +05:30
81 changed files with 10004 additions and 969 deletions

33
.ai/AGENTS.md Normal file
View File

@@ -0,0 +1,33 @@
# Diffusers — Agent Guide
## Coding style
Strive to write code as simple and explicit as possible.
- Minimize small helper/utility functions — inline the logic instead. A reader should be able to follow the full flow without jumping between functions.
- No defensive code or unused code paths — do not add fallback paths, safety checks, or configuration options "just in case". When porting from a research repo, delete training-time code paths, experimental flags, and ablation branches entirely — only keep the inference path you are actually integrating.
- Do not guess user intent and silently correct behavior. Make the expected inputs clear in the docstring, and raise a concise error for unsupported cases rather than adding complex fallback logic.
---
### Dependencies
- No new mandatory dependency without discussion (e.g. `einops`)
- Optional deps guarded with `is_X_available()` and a dummy in `utils/dummy_*.py`
## Code formatting
- `make style` and `make fix-copies` should be run as the final step before opening a PR
### Copied Code
- Many classes are kept in sync with a source via a `# Copied from ...` header comment
- Do not edit a `# Copied from` block directly — run `make fix-copies` to propagate changes from the source
- Remove the header to intentionally break the link
### Models
- All layer calls should be visible directly in `forward` — avoid helper functions that hide `nn.Module` calls.
- Avoid graph breaks for `torch.compile` compatibility — do not insert NumPy operations in forward implementations and any other patterns that can break `torch.compile` compatibility with `fullgraph=True`.
- See the **model-integration** skill for the attention pattern, pipeline rules, test setup instructions, and other important details.
## Skills
Task-specific guides live in `.ai/skills/` and are loaded on demand by AI agents.
Available skills: **model-integration** (adding/converting pipelines), **parity-testing** (debugging numerical parity).

View File

@@ -0,0 +1,167 @@
---
name: integrating-models
description: >
Use when adding a new model or pipeline to diffusers, setting up file
structure for a new model, converting a pipeline to modular format, or
converting weights for a new version of an already-supported model.
---
## Goal
Integrate a new model into diffusers end-to-end. The overall flow:
1. **Gather info** — ask the user for the reference repo, setup guide, a runnable inference script, and other objectives such as standard vs modular.
2. **Confirm the plan** — once you have everything, tell the user exactly what you'll do: e.g. "I'll integrate model X with pipeline Y into diffusers based on your script. I'll run parity tests (model-level and pipeline-level) using the `parity-testing` skill to verify numerical correctness against the reference."
3. **Implement** — write the diffusers code (model, pipeline, scheduler if needed), convert weights, register in `__init__.py`.
4. **Parity test** — use the `parity-testing` skill to verify component and e2e parity against the reference implementation.
5. **Deliver a unit test** — provide a self-contained test script that runs the diffusers implementation, checks numerical output (np allclose), and saves an image/video for visual verification. This is what the user runs to confirm everything works.
Work one workflow at a time — get it to full parity before moving on.
## Setup — gather before starting
Before writing any code, gather info in this order:
1. **Reference repo** — ask for the github link. If they've already set it up locally, ask for the path. Otherwise, ask what setup steps are needed (install deps, download checkpoints, set env vars, etc.) and run through them before proceeding.
2. **Inference script** — ask for a runnable end-to-end script for a basic workflow first (e.g. T2V). Then ask what other workflows they want to support (I2V, V2V, etc.) and agree on the full implementation order together.
3. **Standard vs modular** — standard pipelines, modular, or both?
Use `AskUserQuestion` with structured choices for step 3 when the options are known.
## Standard Pipeline Integration
### File structure for a new model
```
src/diffusers/
models/transformers/transformer_<model>.py # The core model
schedulers/scheduling_<model>.py # If model needs a custom scheduler
pipelines/<model>/
__init__.py
pipeline_<model>.py # Main pipeline
pipeline_<model>_<variant>.py # Variant pipelines (e.g. pyramid, distilled)
pipeline_output.py # Output dataclass
loaders/lora_pipeline.py # LoRA mixin (add to existing file)
tests/
models/transformers/test_models_transformer_<model>.py
pipelines/<model>/test_<model>.py
lora/test_lora_layers_<model>.py
docs/source/en/api/
pipelines/<model>.md
models/<model>_transformer3d.md # or appropriate name
```
### Integration checklist
- [ ] Implement transformer model with `from_pretrained` support
- [ ] Implement or reuse scheduler
- [ ] Implement pipeline(s) with `__call__` method
- [ ] Add LoRA support if applicable
- [ ] Register all classes in `__init__.py` files (lazy imports)
- [ ] Write unit tests (model, pipeline, LoRA)
- [ ] Write docs
- [ ] Run `make style` and `make quality`
- [ ] Test parity with reference implementation (see `parity-testing` skill)
### Attention pattern
Attention must follow the diffusers pattern: both the `Attention` class and its processor are defined in the model file. The processor's `__call__` handles the actual compute and must use `dispatch_attention_fn` rather than calling `F.scaled_dot_product_attention` directly. The attention class inherits `AttentionModuleMixin` and declares `_default_processor_cls` and `_available_processors`.
```python
# transformer_mymodel.py
class MyModelAttnProcessor:
_attention_backend = None
_parallel_config = None
def __call__(self, attn, hidden_states, attention_mask=None, ...):
query = attn.to_q(hidden_states)
key = attn.to_k(hidden_states)
value = attn.to_v(hidden_states)
# reshape, apply rope, etc.
hidden_states = dispatch_attention_fn(
query, key, value,
attn_mask=attention_mask,
backend=self._attention_backend,
parallel_config=self._parallel_config,
)
hidden_states = hidden_states.flatten(2, 3)
return attn.to_out[0](hidden_states)
class MyModelAttention(nn.Module, AttentionModuleMixin):
_default_processor_cls = MyModelAttnProcessor
_available_processors = [MyModelAttnProcessor]
def __init__(self, query_dim, heads=8, dim_head=64, ...):
super().__init__()
self.to_q = nn.Linear(query_dim, heads * dim_head, bias=False)
self.to_k = nn.Linear(query_dim, heads * dim_head, bias=False)
self.to_v = nn.Linear(query_dim, heads * dim_head, bias=False)
self.to_out = nn.ModuleList([nn.Linear(heads * dim_head, query_dim), nn.Dropout(0.0)])
self.set_processor(MyModelAttnProcessor())
def forward(self, hidden_states, attention_mask=None, **kwargs):
return self.processor(self, hidden_states, attention_mask, **kwargs)
```
Consult the implementations in `src/diffusers/models/transformers/` if you need further references.
### Implementation rules
1. **Don't combine structural changes with behavioral changes.** Restructuring code to fit diffusers APIs (ModelMixin, ConfigMixin, etc.) is unavoidable. But don't also "improve" the algorithm, refactor computation order, or rename internal variables for aesthetics. Keep numerical logic as close to the reference as possible, even if it looks unclean. For standard → modular, this is stricter: copy loop logic verbatim and only restructure into blocks. Clean up in a separate commit after parity is confirmed.
2. **Pipelines must inherit from `DiffusionPipeline`.** Consult implementations in `src/diffusers/pipelines` in case you need references.
3. **Don't subclass an existing pipeline for a variant.** DO NOT use an existing pipeline class (e.g., `FluxPipeline`) to override another pipeline (e.g., `FluxImg2ImgPipeline`) which will be a part of the core codebase (`src`).
### Test setup
- Slow tests gated with `@slow` and `RUN_SLOW=1`
- All model-level tests must use the `BaseModelTesterConfig`, `ModelTesterMixin`, `MemoryTesterMixin`, `AttentionTesterMixin`, `LoraTesterMixin`, and `TrainingTesterMixin` classes initially to write the tests. Any additional tests should be added after discussions with the maintainers. Use `tests/models/transformers/test_models_transformer_flux.py` as a reference.
### Common diffusers conventions
- Pipelines inherit from `DiffusionPipeline`
- Models use `ModelMixin` with `register_to_config` for config serialization
- Schedulers use `SchedulerMixin` with `ConfigMixin`
- Use `@torch.no_grad()` on pipeline `__call__`
- Support `output_type="latent"` for skipping VAE decode
- Support `generator` parameter for reproducibility
- Use `self.progress_bar(timesteps)` for progress tracking
## Gotchas
1. **Forgetting `__init__.py` lazy imports.** Every new class must be registered in the appropriate `__init__.py` with lazy imports. Missing this causes `ImportError` that only shows up when users try `from diffusers import YourNewClass`.
2. **Using `einops` or other non-PyTorch deps.** Reference implementations often use `einops.rearrange`. Always rewrite with native PyTorch (`reshape`, `permute`, `unflatten`). Don't add the dependency. If a dependency is truly unavoidable, guard its import: `if is_my_dependency_available(): import my_dependency`.
3. **Missing `make fix-copies` after `# Copied from`.** If you add `# Copied from` annotations, you must run `make fix-copies` to propagate them. CI will fail otherwise.
4. **Wrong `_supports_cache_class` / `_no_split_modules`.** These class attributes control KV cache and device placement. Copy from a similar model and verify -- wrong values cause silent correctness bugs or OOM errors.
5. **Missing `@torch.no_grad()` on pipeline `__call__`.** Forgetting this causes GPU OOM from gradient accumulation during inference.
6. **Config serialization gaps.** Every `__init__` parameter in a `ModelMixin` subclass must be captured by `register_to_config`. If you add a new param but forget to register it, `from_pretrained` will silently use the default instead of the saved value.
7. **Forgetting to update `_import_structure` and `_lazy_modules`.** The top-level `src/diffusers/__init__.py` has both -- missing either one causes partial import failures.
8. **Hardcoded dtype in model forward.** Don't hardcode `torch.float32` or `torch.bfloat16` in the model's forward pass. Use the dtype of the input tensors or `self.dtype` so the model works with any precision.
---
## Modular Pipeline Conversion
See [modular-conversion.md](modular-conversion.md) for the full guide on converting standard pipelines to modular format, including block types, build order, guider abstraction, and conversion checklist.
---
## Weight Conversion Tips
<!-- TODO: Add concrete examples as we encounter them. Common patterns to watch for:
- Fused QKV weights that need splitting into separate Q, K, V
- Scale/shift ordering differences (reference stores [shift, scale], diffusers expects [scale, shift])
- Weight transpositions (linear stored as transposed conv, or vice versa)
- Interleaved head dimensions that need reshaping
- Bias terms absorbed into different layers
Add each with a before/after code snippet showing the conversion. -->

View File

@@ -0,0 +1,152 @@
# Modular Pipeline Conversion Reference
## When to use
Modular pipelines break a monolithic `__call__` into composable blocks. Convert when:
- The model supports multiple workflows (T2V, I2V, V2V, etc.)
- Users need to swap guidance strategies (CFG, CFG-Zero*, PAG)
- You want to share blocks across pipeline variants
## File structure
```
src/diffusers/modular_pipelines/<model>/
__init__.py # Lazy imports
modular_pipeline.py # Pipeline class (tiny, mostly config)
encoders.py # Text encoder + image/video VAE encoder blocks
before_denoise.py # Pre-denoise setup blocks
denoise.py # The denoising loop blocks
decoders.py # VAE decode block
modular_blocks_<model>.py # Block assembly (AutoBlocks)
```
## Block types decision tree
```
Is this a single operation?
YES -> ModularPipelineBlocks (leaf block)
Does it run multiple blocks in sequence?
YES -> SequentialPipelineBlocks
Does it iterate (e.g. chunk loop)?
YES -> LoopSequentialPipelineBlocks
Does it choose ONE block based on which input is present?
Is the selection 1:1 with trigger inputs?
YES -> AutoPipelineBlocks (simple trigger mapping)
NO -> ConditionalPipelineBlocks (custom select_block method)
```
## Build order (easiest first)
1. `decoders.py` -- Takes latents, runs VAE decode, returns images/videos
2. `encoders.py` -- Takes prompt, returns prompt_embeds. Add image/video VAE encoder if needed
3. `before_denoise.py` -- Timesteps, latent prep, noise setup. Each logical operation = one block
4. `denoise.py` -- The hardest. Convert guidance to guider abstraction
## Key pattern: Guider abstraction
Original pipeline has guidance baked in:
```python
for i, t in enumerate(timesteps):
noise_pred = self.transformer(latents, prompt_embeds, ...)
if self.do_classifier_free_guidance:
noise_uncond = self.transformer(latents, negative_prompt_embeds, ...)
noise_pred = noise_uncond + scale * (noise_pred - noise_uncond)
latents = self.scheduler.step(noise_pred, t, latents).prev_sample
```
Modular pipeline separates concerns:
```python
guider_inputs = {
"encoder_hidden_states": (prompt_embeds, negative_prompt_embeds),
}
for i, t in enumerate(timesteps):
components.guider.set_state(step=i, num_inference_steps=num_steps, timestep=t)
guider_state = components.guider.prepare_inputs(guider_inputs)
for batch in guider_state:
components.guider.prepare_models(components.transformer)
cond_kwargs = {k: getattr(batch, k) for k in guider_inputs}
context_name = getattr(batch, components.guider._identifier_key)
with components.transformer.cache_context(context_name):
batch.noise_pred = components.transformer(
hidden_states=latents, timestep=timestep,
return_dict=False, **cond_kwargs, **shared_kwargs,
)[0]
components.guider.cleanup_models(components.transformer)
noise_pred = components.guider(guider_state)[0]
latents = components.scheduler.step(noise_pred, t, latents, generator=generator)[0]
```
## Key pattern: Chunk loops for video models
Use `LoopSequentialPipelineBlocks` for outer loop:
```python
class ChunkDenoiseStep(LoopSequentialPipelineBlocks):
block_classes = [PrepareChunkStep, NoiseGenStep, DenoiseInnerStep, UpdateStep]
```
Note: blocks inside `LoopSequentialPipelineBlocks` receive `(components, block_state, k)` where `k` is the loop iteration index.
## Key pattern: Workflow selection
```python
class AutoDenoise(ConditionalPipelineBlocks):
block_classes = [V2VDenoiseStep, I2VDenoiseStep, T2VDenoiseStep]
block_trigger_inputs = ["video_latents", "image_latents"]
default_block_name = "text2video"
```
## Standard InputParam/OutputParam templates
```python
# Inputs
InputParam.template("prompt") # str, required
InputParam.template("negative_prompt") # str, optional
InputParam.template("image") # PIL.Image, optional
InputParam.template("generator") # torch.Generator, optional
InputParam.template("num_inference_steps") # int, default=50
InputParam.template("latents") # torch.Tensor, optional
# Outputs
OutputParam.template("prompt_embeds")
OutputParam.template("negative_prompt_embeds")
OutputParam.template("image_latents")
OutputParam.template("latents")
OutputParam.template("videos")
OutputParam.template("images")
```
## ComponentSpec patterns
```python
# Heavy models - loaded from pretrained
ComponentSpec("transformer", YourTransformerModel)
ComponentSpec("vae", AutoencoderKL)
# Lightweight objects - created inline from config
ComponentSpec(
"guider",
ClassifierFreeGuidance,
config=FrozenDict({"guidance_scale": 7.5}),
default_creation_method="from_config"
)
```
## Conversion checklist
- [ ] Read original pipeline's `__call__` end-to-end, map stages
- [ ] Write test scripts (reference + target) with identical seeds
- [ ] Create file structure under `modular_pipelines/<model>/`
- [ ] Write decoder block (simplest)
- [ ] Write encoder blocks (text, image, video)
- [ ] Write before_denoise blocks (timesteps, latent prep, noise)
- [ ] Write denoise block with guider abstraction (hardest)
- [ ] Create pipeline class with `default_blocks_name`
- [ ] Assemble blocks in `modular_blocks_<model>.py`
- [ ] Wire up `__init__.py` with lazy imports
- [ ] Run `make style` and `make quality`
- [ ] Test all workflows for parity with reference

View File

@@ -0,0 +1,170 @@
---
name: testing-parity
description: >
Use when debugging or verifying numerical parity between pipeline
implementations (e.g., research repo vs diffusers, standard vs modular).
Also relevant when outputs look wrong — washed out, pixelated, or have
visual artifacts — as these are usually parity bugs.
---
## Setup — gather before starting
Before writing any test code, gather:
1. **Which two implementations** are being compared (e.g. research repo → diffusers, standard → modular, or research → modular). Use `AskUserQuestion` with structured choices if not already clear.
2. **Two equivalent runnable scripts** — one for each implementation, both expected to produce identical output given the same inputs. These scripts define what "parity" means concretely.
When invoked from the `model-integration` skill, you already have context: the reference script comes from step 2 of setup, and the diffusers script is the one you just wrote. You just need to make sure both scripts are runnable and use the same inputs/seed/params.
## Test strategy
**Component parity (CPU/float32) -- always run, as you build.**
Test each component before assembling the pipeline. This is the foundation -- if individual pieces are wrong, the pipeline can't be right. Each component in isolation, strict max_diff < 1e-3.
Test freshly converted checkpoints and saved checkpoints.
- **Fresh**: convert from checkpoint weights, compare against reference (catches conversion bugs)
- **Saved**: load from saved model on disk, compare against reference (catches stale saves)
Keep component test scripts around -- you will need to re-run them during pipeline debugging with different inputs or config values.
Template -- one self-contained script per component, reference and diffusers side-by-side:
```python
@torch.inference_mode()
def test_my_component(mode="fresh", model_path=None):
# 1. Deterministic input
gen = torch.Generator().manual_seed(42)
x = torch.randn(1, 3, 64, 64, generator=gen, dtype=torch.float32)
# 2. Reference: load from checkpoint, run, free
ref_model = ReferenceModel.from_config(config)
ref_model.load_state_dict(load_weights("prefix"), strict=True)
ref_model = ref_model.float().eval()
ref_out = ref_model(x).clone()
del ref_model
# 3. Diffusers: fresh (convert weights) or saved (from_pretrained)
if mode == "fresh":
diff_model = convert_my_component(load_weights("prefix"))
else:
diff_model = DiffusersModel.from_pretrained(model_path, torch_dtype=torch.float32)
diff_model = diff_model.float().eval()
diff_out = diff_model(x)
del diff_model
# 4. Compare in same script -- no saving to disk
max_diff = (ref_out - diff_out).abs().max().item()
assert max_diff < 1e-3, f"FAIL: max_diff={max_diff:.2e}"
```
Key points: (a) both reference and diffusers component in one script -- never split into separate scripts that save/load intermediates, (b) deterministic input via seeded generator, (c) load one model at a time to fit in CPU RAM, (d) `.clone()` the reference output before deleting the model.
**E2E visual (GPU/bfloat16) -- once the pipeline is assembled.**
Both pipelines generate independently with identical seeds/params. Save outputs and compare visually. If outputs look identical, you're done -- no need for deeper testing.
**Pipeline stage tests -- only if E2E fails and you need to isolate the bug.**
If the user already suspects where divergence is, start there. Otherwise, work through stages in order.
First, **match noise generation**: the way initial noise/latents are constructed (seed handling, generator, randn call order) often differs between the two scripts. If the noise doesn't match, nothing downstream will match. Check how noise is initialized in the diffusers script — if it doesn't match the reference, temporarily change it to match. Note what you changed so it can be reverted after parity is confirmed.
For small models, run on CPU/float32 for strict comparison. For large models (e.g. 22B params), CPU/float32 is impractical -- use GPU/bfloat16 with `enable_model_cpu_offload()` and relax tolerances (max_diff < 1e-1 for bfloat16 is typical for passing tests; cosine similarity > 0.9999 is a good secondary check).
Test encode and decode stages first -- they're simpler and bugs there are easier to fix. Only debug the denoising loop if encode and decode both pass.
The challenge: pipelines are monolithic `__call__` methods -- you can't just call "the encode part". See [checkpoint-mechanism.md](checkpoint-mechanism.md) for the checkpoint class that lets you stop, save, or inject tensors at named locations inside the pipeline.
**Stage test order — encode, decode, then denoise:**
- **`encode`** (test first): Stop both pipelines at `"preloop"`. Compare **every single variable** that will be consumed by the denoising loop -- not just latents and sigmas, but also prompt embeddings, attention masks, positional coordinates, connector outputs, and any conditioning inputs.
- **`decode`** (test second, before denoise): Run the reference pipeline fully -- checkpoint the post-loop latents AND let it finish to get the decoded output. Then feed those same post-loop latents through the diffusers pipeline's decode path. Compare both numerically AND visually.
- **`denoise`** (test last): Run both pipelines with realistic `num_steps` (e.g. 30) so the scheduler computes correct sigmas/timesteps, but stop after 2 loop iterations using `after_step_1`. Don't set `num_steps=2` -- that produces unrealistic sigma schedules.
```python
# Encode stage -- stop before the loop, compare ALL inputs:
ref_ckpts = {"preloop": Checkpoint(save=True, stop=True)}
run_reference_pipeline(ref_ckpts)
ref_data = ref_ckpts["preloop"].data
diff_ckpts = {"preloop": Checkpoint(save=True, stop=True)}
run_diffusers_pipeline(diff_ckpts)
diff_data = diff_ckpts["preloop"].data
# Compare EVERY variable consumed by the denoise loop:
compare_tensors("latents", ref_data["latents"], diff_data["latents"])
compare_tensors("sigmas", ref_data["sigmas"], diff_data["sigmas"])
compare_tensors("prompt_embeds", ref_data["prompt_embeds"], diff_data["prompt_embeds"])
# ... every single tensor the transformer forward() will receive
```
**E2E-injected visual test**: Once you've identified a suspected root cause using stage tests, confirm it with an e2e-injected run -- inject the known-good tensor from reference and generate a full video. If the output looks identical to reference, you've confirmed the root cause.
## Debugging technique: Injection for root-cause isolation
When stage tests show divergence, **inject a known-good tensor from one pipeline into the other** to test whether the remaining code is correct.
The principle: if you suspect input X is the root cause of divergence in stage S:
1. Run the reference pipeline and capture X
2. Run the diffusers pipeline but **replace** its X with the reference's X (via checkpoint load)
3. Compare outputs of stage S
If outputs now match: X was the root cause. If they still diverge: the bug is in the stage logic itself, not in X.
| What you're testing | What you inject | Where you inject |
|---|---|---|
| Is the decode stage correct? | Post-loop latents from reference | Before decode |
| Is the denoise loop correct? | Pre-loop latents from reference | Before the loop |
| Is step N correct? | Post-step-(N-1) latents from reference | Before step N |
**Per-step accumulation tracing**: When injection confirms the loop is correct but you want to understand *how* a small initial difference compounds, capture `after_step_{i}` for every step and plot the max_diff curve. A healthy curve stays bounded; an exponential blowup in later steps points to an amplification mechanism (see Pitfall #13 in [pitfalls.md](pitfalls.md)).
## Debugging technique: Visual comparison via frame extraction
For video pipelines, numerical metrics alone can be misleading. Extract and view individual frames:
```python
import numpy as np
from PIL import Image
def extract_frames(video_np, frame_indices):
"""video_np: (frames, H, W, 3) float array in [0, 1]"""
for idx in frame_indices:
frame = (video_np[idx] * 255).clip(0, 255).astype(np.uint8)
img = Image.fromarray(frame)
img.save(f"frame_{idx}.png")
# Compare specific frames from both pipelines
extract_frames(ref_video, [0, 60, 120])
extract_frames(diff_video, [0, 60, 120])
```
## Testing rules
1. **Never use reference code in the diffusers test path.** Each side must use only its own code.
2. **Never monkey-patch model internals in tests.** Do not replace `model.forward` or patch internal methods.
3. **Debugging instrumentation must be non-destructive.** Checkpoint captures for debugging are fine, but must not alter control flow or outputs.
4. **Prefer CPU/float32 for numerical comparison when practical.** Float32 avoids bfloat16 precision noise that obscures real bugs. But for large models (22B+), GPU/bfloat16 with `enable_model_cpu_offload()` is necessary -- use relaxed tolerances and cosine similarity as a secondary metric.
5. **Test both fresh conversion AND saved model.** Fresh catches conversion logic bugs; saved catches stale/corrupted weights from previous runs.
6. **Diff configs before debugging.** Before investigating any divergence, dump and compare all config values. A 30-second config diff prevents hours of debugging based on wrong assumptions.
7. **Never modify cached/downloaded model configs directly.** Don't edit files in `~/.cache/huggingface/`. Instead, save to a local directory or open a PR on the upstream repo.
8. **Compare ALL loop inputs in the encode test.** The preloop checkpoint must capture every single tensor the transformer forward() will receive.
## Comparison utilities
```python
def compare_tensors(name: str, a: torch.Tensor, b: torch.Tensor, tol: float = 1e-3) -> bool:
if a.shape != b.shape:
print(f" FAIL {name}: shape mismatch {a.shape} vs {b.shape}")
return False
diff = (a.float() - b.float()).abs()
max_diff = diff.max().item()
mean_diff = diff.mean().item()
cos = torch.nn.functional.cosine_similarity(
a.float().flatten().unsqueeze(0), b.float().flatten().unsqueeze(0)
).item()
passed = max_diff < tol
print(f" {'PASS' if passed else 'FAIL'} {name}: max={max_diff:.2e}, mean={mean_diff:.2e}, cos={cos:.5f}")
return passed
```
Cosine similarity is especially useful for GPU/bfloat16 tests where max_diff can be noisy -- `cos > 0.9999` is a strong signal even when max_diff exceeds tolerance.
## Gotchas
See [pitfalls.md](pitfalls.md) for the full list of gotchas to watch for during parity testing.

View File

@@ -0,0 +1,103 @@
# Checkpoint Mechanism for Stage Testing
## Overview
Pipelines are monolithic `__call__` methods -- you can't just call "the encode part". The checkpoint mechanism lets you stop, save, or inject tensors at named locations inside the pipeline.
## The Checkpoint class
Add a `_checkpoints` argument to both the diffusers pipeline and the reference implementation.
```python
@dataclass
class Checkpoint:
save: bool = False # capture variables into ckpt.data
stop: bool = False # halt pipeline after this point
load: bool = False # inject ckpt.data into local variables
data: dict = field(default_factory=dict)
```
## Pipeline instrumentation
The pipeline accepts an optional `dict[str, Checkpoint]`. Place checkpoint calls at boundaries between pipeline stages -- after each encoder, before the denoising loop (capture all loop inputs), after each loop iteration, after the loop (capture final latents before decode).
```python
def __call__(self, prompt, ..., _checkpoints=None):
# --- text encoding ---
prompt_embeds = self.text_encoder(prompt)
_maybe_checkpoint(_checkpoints, "text_encoding", {
"prompt_embeds": prompt_embeds,
})
# --- prepare latents, sigmas, positions ---
latents = self.prepare_latents(...)
sigmas = self.scheduler.sigmas
# ...
_maybe_checkpoint(_checkpoints, "preloop", {
"latents": latents,
"sigmas": sigmas,
"prompt_embeds": prompt_embeds,
"prompt_attention_mask": prompt_attention_mask,
"video_coords": video_coords,
# capture EVERYTHING the loop needs -- every tensor the transformer
# forward() receives. Missing even one variable here means you can't
# tell if it's the source of divergence during denoise debugging.
})
# --- denoising loop ---
for i, t in enumerate(timesteps):
noise_pred = self.transformer(latents, t, prompt_embeds, ...)
latents = self.scheduler.step(noise_pred, t, latents)[0]
_maybe_checkpoint(_checkpoints, f"after_step_{i}", {
"latents": latents,
})
_maybe_checkpoint(_checkpoints, "post_loop", {
"latents": latents,
})
# --- decode ---
video = self.vae.decode(latents)
return video
```
## The helper function
Each `_maybe_checkpoint` call does three things based on the Checkpoint's flags: `save` captures the local variables into `ckpt.data`, `load` injects pre-populated `ckpt.data` back into local variables, `stop` halts execution (raises an exception caught at the top level).
```python
def _maybe_checkpoint(checkpoints, name, data):
if not checkpoints:
return
ckpt = checkpoints.get(name)
if ckpt is None:
return
if ckpt.save:
ckpt.data.update(data)
if ckpt.stop:
raise PipelineStop # caught at __call__ level, returns None
```
## Injection support
Add `load` support at each checkpoint where you might want to inject:
```python
_maybe_checkpoint(_checkpoints, "preloop", {"latents": latents, ...})
# Load support: replace local variables with injected data
if _checkpoints:
ckpt = _checkpoints.get("preloop")
if ckpt is not None and ckpt.load:
latents = ckpt.data["latents"].to(device=device, dtype=latents.dtype)
```
## Key insight
The checkpoint dict is passed into the pipeline and mutated in-place. After the pipeline returns (or stops early), you read back `ckpt.data` to get the captured tensors. Both pipelines save under their own key names, so the test maps between them (e.g. reference `"video_state.latent"` -> diffusers `"latents"`).
## Memory management for large models
For large models, free the source pipeline's GPU memory before loading the target pipeline. Clone injected tensors to CPU, delete everything else, then run the target with `enable_model_cpu_offload()`.

View File

@@ -0,0 +1,116 @@
# Complete Pitfalls Reference
## 1. Global CPU RNG
`MultivariateNormal.sample()` uses the global CPU RNG, not `torch.Generator`. Must call `torch.manual_seed(seed)` before each pipeline run. A `generator=` kwarg won't help.
## 2. Timestep dtype
Many transformers expect `int64` timesteps. `get_timestep_embedding` casts to float, so `745.3` and `745` produce different embeddings. Match the reference's casting.
## 3. Guidance parameter mapping
Parameter names may differ: reference `zero_steps=1` (meaning `i <= 1`, 2 steps) vs target `zero_init_steps=2` (meaning `step < 2`, same thing). Check exact semantics.
## 4. `patch_size` in noise generation
If noise generation depends on `patch_size` (e.g. `sample_block_noise`), it must be passed through. Missing it changes noise spatial structure.
## 5. Variable shadowing in nested loops
Nested loops (stages -> chunks -> timesteps) can shadow variable names. If outer loop uses `latents` and inner loop also assigns to `latents`, scoping must match the reference.
## 6. Float precision differences -- don't dismiss them
Target may compute in float32 where reference used bfloat16. Small per-element diffs (1e-3 to 1e-2) *look* harmless but can compound catastrophically over iterative processes like denoising loops (see Pitfalls #11 and #13). Before dismissing a precision difference: (a) check whether it feeds into an iterative process, (b) if so, trace the accumulation curve over all iterations to see if it stays bounded or grows exponentially. Only truly non-iterative precision diffs (e.g. in a single-pass encoder) are safe to accept.
## 7. Scheduler state reset between stages
Some schedulers accumulate state (e.g. `model_outputs` in UniPC) that must be cleared between stages.
## 8. Component access
Standard: `self.transformer`. Modular: `components.transformer`. Missing this causes AttributeError.
## 9. Guider state across stages
In multi-stage denoising, the guider's internal state (e.g. `zero_init_steps`) may need save/restore between stages.
## 10. Model storage location
NEVER store converted models in `/tmp/` -- temporary directories get wiped on restart. Always save converted checkpoints under a persistent path in the project repo (e.g. `models/ltx23-diffusers/`).
## 11. Noise dtype mismatch (causes washed-out output)
Reference code often generates noise in float32 then casts to model dtype (bfloat16) before storing:
```python
noise = torch.randn(..., dtype=torch.float32, generator=gen)
noise = noise.to(dtype=model_dtype) # bfloat16 -- values get quantized
```
Diffusers pipelines may keep latents in float32 throughout the loop. The per-element difference is only ~1.5e-02, but this compounds over 30 denoising steps via 1/sigma amplification (Pitfall #13) and produces completely washed-out output.
**Fix**: Match the reference -- generate noise in the model's working dtype:
```python
latent_dtype = self.transformer.dtype # e.g. bfloat16
latents = self.prepare_latents(..., dtype=latent_dtype, ...)
```
**Detection**: Encode stage test shows initial latent max_diff of exactly ~1.5e-02. This specific magnitude is the signature of float32->bfloat16 quantization error.
## 12. RoPE position dtype
RoPE cosine/sine values are sensitive to position coordinate dtype. If reference uses bfloat16 positions but diffusers uses float32, the RoPE output diverges significantly (max_diff up to 2.0). Different modalities may use different position dtypes (e.g. video bfloat16, audio float32) -- check the reference carefully.
## 13. 1/sigma error amplification in Euler denoising
In Euler/flow-matching, the velocity formula divides by sigma: `v = (latents - pred_x0) / sigma`. As sigma shrinks from ~1.0 (step 0) to ~0.001 (step 29), errors are amplified up to 1000x. A 1.5e-02 init difference grows linearly through mid-steps, then exponentially in final steps, reaching max_diff ~6.0. This is why dtype mismatches (Pitfalls #11, #12) that seem tiny at init produce visually broken output. Use per-step accumulation tracing to diagnose.
## 14. Config value assumptions -- always diff, never assume
When debugging parity, don't assume config values match code defaults. The published model checkpoint may override defaults with different values. A wrong assumption about a single config field can send you down hours of debugging in the wrong direction.
**The pattern that goes wrong:**
1. You see `param_x` has default `1` in the code
2. The reference code also uses `param_x` with a default of `1`
3. You assume both sides use `1` and apply a "fix" based on that
4. But the actual checkpoint config has `param_x: 1000`, and so does the published diffusers config
5. Your "fix" now *creates* divergence instead of fixing it
**Prevention -- config diff first:**
```python
# Reference: read from checkpoint metadata (no model loading needed)
from safetensors import safe_open
import json
ref_config = json.loads(safe_open(checkpoint_path, framework="pt").metadata()["config"])
# Diffusers: read from model config
from diffusers import MyModel
diff_model = MyModel.from_pretrained(model_path, subfolder="transformer")
diff_config = dict(diff_model.config)
# Compare all values
for key in sorted(set(list(ref_config.get("transformer", {}).keys()) + list(diff_config.keys()))):
ref_val = ref_config.get("transformer", {}).get(key, "MISSING")
diff_val = diff_config.get(key, "MISSING")
if ref_val != diff_val:
print(f" DIFF {key}: ref={ref_val}, diff={diff_val}")
```
Run this **before** writing any hooks, analysis code, or fixes. It takes 30 seconds and catches wrong assumptions immediately.
**When debugging divergence -- trace values, don't reason about them:**
If two implementations diverge, hook the actual intermediate values at the point of divergence rather than reading code to figure out what the values "should" be. Code analysis builds on assumptions; value tracing reveals facts.
## 15. Decoder config mismatch (causes pixelated artifacts)
The upstream model config may have wrong values for decoder-specific parameters (e.g. `upsample_residual`, `upsample_type`). These control whether the decoder uses skip connections in upsampling -- getting them wrong produces severe pixelation or blocky artifacts.
**Detection**: Feed identical post-loop latents through both decoders. If max pixel diff is large (PSNR < 40 dB) on CPU/float32, it's a real bug, not precision noise. Trace through decoder blocks (conv_in -> mid_block -> up_blocks) to find where divergence starts.
**Fix**: Correct the config value. Don't edit cached files in `~/.cache/huggingface/` -- either save to a local model directory or open a PR on the upstream repo (see Testing Rule #7).
## 16. Incomplete injection tests -- inject ALL variables or the test is invalid
When doing injection tests (feeding reference tensors into the diffusers pipeline), you must inject **every** divergent input, including sigmas/timesteps. A common mistake: the preloop checkpoint saves sigmas but the injection code only loads latents and embeddings. The test then runs with different sigma schedules, making it impossible to isolate the real cause.
**Prevention**: After writing injection code, verify by listing every variable the injected stage consumes and checking each one is either (a) injected from reference, or (b) confirmed identical between pipelines.
## 17. bf16 connector/encoder divergence -- don't chase it
When running on GPU/bfloat16, multi-layer encoders (e.g. 8-layer connector transformers) accumulate bf16 rounding noise that looks alarming (max_diff 0.3-2.7). Before investigating, re-run the component test on CPU/float32. If it passes (max_diff < 1e-4), the divergence is pure precision noise, not a code bug. Don't spend hours tracing through layers -- confirm on CPU/float32 and move on.
## 18. Stale test fixtures
When using saved tensors for cross-pipeline comparison, always ensure both sets of tensors were captured from the same run configuration (same seed, same config, same code version). Mixing fixtures from different runs (e.g. reference tensors from yesterday, diffusers tensors from today after a code change) creates phantom divergence that wastes debugging time. Regenerate both sides in a single test script execution.

View File

@@ -16,6 +16,9 @@ on:
branches:
- ci-*
permissions:
contents: read
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

View File

@@ -1,5 +1,8 @@
name: Fast GPU Tests on PR
permissions:
contents: read
on:
pull_request:
branches: main

View File

@@ -4,6 +4,7 @@
name: (Release) Fast GPU Tests on main
on:
workflow_dispatch:
push:
branches:
- "v*.*.*-release"
@@ -33,6 +34,7 @@ jobs:
- name: Install dependencies
run: |
uv pip install -e ".[quality]"
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
- name: Environment
run: |
python utils/print_env.py
@@ -74,6 +76,7 @@ jobs:
run: |
uv pip install -e ".[quality]"
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
- name: Environment
run: |
python utils/print_env.py
@@ -125,6 +128,7 @@ jobs:
uv pip install -e ".[quality]"
uv pip install peft@git+https://github.com/huggingface/peft.git
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
- name: Environment
run: |
@@ -175,6 +179,7 @@ jobs:
uv pip install -e ".[quality]"
uv pip install peft@git+https://github.com/huggingface/peft.git
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
- name: Environment
run: |
@@ -232,6 +237,7 @@ jobs:
- name: Install dependencies
run: |
uv pip install -e ".[quality,training]"
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
- name: Environment
run: |
python utils/print_env.py
@@ -274,6 +280,7 @@ jobs:
- name: Install dependencies
run: |
uv pip install -e ".[quality,training]"
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
- name: Environment
run: |
python utils/print_env.py
@@ -316,6 +323,7 @@ jobs:
- name: Install dependencies
run: |
uv pip install -e ".[quality,training]"
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
- name: Environment
run: |

8
.gitignore vendored
View File

@@ -178,4 +178,10 @@ tags
.ruff_cache
# wandb
wandb
wandb
# AI agent generated symlinks
/AGENTS.md
/CLAUDE.md
/.agents/skills
/.claude/skills

View File

@@ -1,4 +1,4 @@
.PHONY: deps_table_update modified_only_fixup extra_style_checks quality style fixup fix-copies test test-examples
.PHONY: deps_table_update modified_only_fixup extra_style_checks quality style fixup fix-copies test test-examples codex claude clean-ai
# make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes!)
export PYTHONPATH = src
@@ -98,3 +98,21 @@ post-release:
post-patch:
python utils/release.py --post_release --patch
# AI agent symlinks
codex:
ln -snf .ai/AGENTS.md AGENTS.md
mkdir -p .agents
rm -rf .agents/skills
ln -snf ../.ai/skills .agents/skills
claude:
ln -snf .ai/AGENTS.md CLAUDE.md
mkdir -p .claude
rm -rf .claude/skills
ln -snf ../.ai/skills .claude/skills
clean-ai:
rm -f AGENTS.md CLAUDE.md
rm -rf .agents/skills .claude/skills

View File

@@ -22,6 +22,8 @@
title: Reproducibility
- local: using-diffusers/schedulers
title: Schedulers
- local: using-diffusers/guiders
title: Guiders
- local: using-diffusers/automodel
title: AutoModel
- local: using-diffusers/other-formats
@@ -110,8 +112,6 @@
title: ModularPipeline
- local: modular_diffusers/components_manager
title: ComponentsManager
- local: modular_diffusers/guiders
title: Guiders
- local: modular_diffusers/custom_blocks
title: Building Custom Blocks
- local: modular_diffusers/mellon
@@ -532,8 +532,6 @@
title: ControlNet-XS with Stable Diffusion XL
- local: api/pipelines/controlnet_union
title: ControlNetUnion
- local: api/pipelines/cosmos
title: Cosmos
- local: api/pipelines/ddim
title: DDIM
- local: api/pipelines/ddpm
@@ -677,6 +675,8 @@
title: CogVideoX
- local: api/pipelines/consisid
title: ConsisID
- local: api/pipelines/cosmos
title: Cosmos
- local: api/pipelines/framepack
title: Framepack
- local: api/pipelines/helios

View File

@@ -17,3 +17,7 @@ A Transformer model for image-like data from [Flux2](https://hf.co/black-forest-
## Flux2Transformer2DModel
[[autodoc]] Flux2Transformer2DModel
## Flux2Transformer2DModelOutput
[[autodoc]] models.transformers.transformer_flux2.Flux2Transformer2DModelOutput

View File

@@ -21,29 +21,31 @@
> [!TIP]
> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
## Loading original format checkpoints
Original format checkpoints that have not been converted to diffusers-expected format can be loaded using the `from_single_file` method.
## Basic usage
```python
import torch
from diffusers import Cosmos2TextToImagePipeline, CosmosTransformer3DModel
from diffusers import Cosmos2_5_PredictBasePipeline
from diffusers.utils import export_to_video
model_id = "nvidia/Cosmos-Predict2-2B-Text2Image"
transformer = CosmosTransformer3DModel.from_single_file(
"https://huggingface.co/nvidia/Cosmos-Predict2-2B-Text2Image/blob/main/model.pt",
torch_dtype=torch.bfloat16,
).to("cuda")
pipe = Cosmos2TextToImagePipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
model_id = "nvidia/Cosmos-Predict2.5-2B"
pipe = Cosmos2_5_PredictBasePipeline.from_pretrained(
model_id, revision="diffusers/base/post-trained", torch_dtype=torch.bfloat16
)
pipe.to("cuda")
prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess."
prompt = "As the red light shifts to green, the red bus at the intersection begins to move forward, its headlights cutting through the falling snow. The snowy tire tracks deepen as the vehicle inches ahead, casting fresh lines onto the slushy road. Around it, streetlights glow warmer, illuminating the drifting flakes and wet reflections on the asphalt. Other cars behind start to edge forward, their beams joining the scene. The stillness of the urban street transitions into motion as the quiet snowfall is punctuated by the slow advance of traffic through the frosty city corridor."
negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."
output = pipe(
prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1)
).images[0]
output.save("output.png")
image=None,
video=None,
prompt=prompt,
negative_prompt=negative_prompt,
num_frames=93,
generator=torch.Generator().manual_seed(1),
).frames[0]
export_to_video(output, "text2world.mp4", fps=16)
```
## Cosmos2_5_TransferPipeline

View File

@@ -41,5 +41,11 @@ The [official implementation](https://github.com/black-forest-labs/flux2/blob/5a
## Flux2KleinPipeline
[[autodoc]] Flux2KleinPipeline
- all
- __call__
## Flux2KleinKVPipeline
[[autodoc]] Flux2KleinKVPipeline
- all
- __call__

View File

@@ -99,7 +99,7 @@ To update guider configuration, you can run `pipe.guider = pipe.guider.new(...)`
pipe.guider = pipe.guider.new(guidance_scale=5.0)
```
Read more on Guider [here](../../modular_diffusers/guiders).
Read more on Guider [here](../../using-diffusers/guiders).

View File

@@ -30,7 +30,7 @@ HunyuanImage-2.1 comes in the following variants:
## HunyuanImage-2.1
HunyuanImage-2.1 applies [Adaptive Projected Guidance (APG)](https://huggingface.co/papers/2410.02416) combined with Classifier-Free Guidance (CFG) in the denoising loop. `HunyuanImagePipeline` has a `guider` component (read more about [Guider](../modular_diffusers/guiders.md)) and does not take a `guidance_scale` parameter at runtime. To change guider-related parameters, e.g., `guidance_scale`, you can update the `guider` configuration instead.
HunyuanImage-2.1 applies [Adaptive Projected Guidance (APG)](https://huggingface.co/papers/2410.02416) combined with Classifier-Free Guidance (CFG) in the denoising loop. `HunyuanImagePipeline` has a `guider` component (read more about [Guider](../../using-diffusers/guiders)) and does not take a `guidance_scale` parameter at runtime. To change guider-related parameters, e.g., `guidance_scale`, you can update the `guider` configuration instead.
```python
import torch

View File

@@ -44,6 +44,7 @@ The table below lists all the pipelines currently available in 🤗 Diffusers an
| [ControlNet with Stable Diffusion XL](controlnet_sdxl) | text2image |
| [ControlNet-XS](controlnetxs) | text2image |
| [ControlNet-XS with Stable Diffusion XL](controlnetxs_sdxl) | text2image |
| [Cosmos](cosmos) | text2video, video2video |
| [Dance Diffusion](dance_diffusion) | unconditional audio generation |
| [DDIM](ddim) | unconditional image generation |
| [DDPM](ddpm) | unconditional image generation |

View File

@@ -565,4 +565,16 @@ $ git push --set-upstream origin your-branch-for-syncing
### Style guide
For documentation strings, 🧨 Diffusers follows the [Google style](https://google.github.io/styleguide/pyguide.html).
For documentation strings, 🧨 Diffusers follows the [Google style](https://google.github.io/styleguide/pyguide.html).
## Coding with AI agents
The repository keeps AI-agent configuration in `.ai/` and exposes local agent files via symlinks.
- **Source of truth** — edit files under `.ai/` (`AGENTS.md` for coding guidelines, `skills/` for on-demand task knowledge)
- **Don't edit** generated root-level `AGENTS.md`, `CLAUDE.md`, or `.agents/skills`/`.claude/skills` — they are symlinks
- Setup commands:
- `make codex` — symlink guidelines + skills for OpenAI Codex
- `make claude` — symlink guidelines + skills for Claude Code
- `make clean-ai` — remove all generated symlinks

View File

@@ -338,7 +338,7 @@ guider = ClassifierFreeGuidance(guidance_scale=5.0)
pipeline.update_components(guider=guider)
```
See the [Guiders](./guiders) guide for more details on available guiders and how to configure them.
See the [Guiders](../using-diffusers/guiders) guide for more details on available guiders and how to configure them.
## Splitting a pipeline into stages

View File

@@ -39,7 +39,7 @@ The Modular Diffusers docs are organized as shown below.
- [ModularPipeline](./modular_pipeline) shows you how to create and convert pipeline blocks into an executable [`ModularPipeline`].
- [ComponentsManager](./components_manager) shows you how to manage and reuse components across multiple pipelines.
- [Guiders](./guiders) shows you how to use different guidance methods in the pipeline.
- [Guiders](../using-diffusers/guiders) shows you how to use different guidance methods in the pipeline.
## Mellon Integration

View File

@@ -482,144 +482,6 @@ print(
) # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works
```
## torch.jit.trace
[torch.jit.trace](https://pytorch.org/docs/stable/generated/torch.jit.trace.html) records the operations a model performs on a sample input and creates a new, optimized representation of the model based on the recorded execution path. During tracing, the model is optimized to reduce overhead from Python and dynamic control flows and operations are fused together for more efficiency. The returned executable or [ScriptFunction](https://pytorch.org/docs/stable/generated/torch.jit.ScriptFunction.html) can be compiled.
```py
import time
import torch
from diffusers import StableDiffusionPipeline
import functools
# torch disable grad
torch.set_grad_enabled(False)
# set variables
n_experiments = 2
unet_runs_per_experiment = 50
# load sample inputs
def generate_inputs():
sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16)
timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999
encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16)
return sample, timestep, encoder_hidden_states
pipeline = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16,
use_safetensors=True,
).to("cuda")
unet = pipeline.unet
unet.eval()
unet.to(memory_format=torch.channels_last) # use channels_last memory format
unet.forward = functools.partial(unet.forward, return_dict=False) # set return_dict=False as default
# warmup
for _ in range(3):
with torch.inference_mode():
inputs = generate_inputs()
orig_output = unet(*inputs)
# trace
print("tracing..")
unet_traced = torch.jit.trace(unet, inputs)
unet_traced.eval()
print("done tracing")
# warmup and optimize graph
for _ in range(5):
with torch.inference_mode():
inputs = generate_inputs()
orig_output = unet_traced(*inputs)
# benchmarking
with torch.inference_mode():
for _ in range(n_experiments):
torch.cuda.synchronize()
start_time = time.time()
for _ in range(unet_runs_per_experiment):
orig_output = unet_traced(*inputs)
torch.cuda.synchronize()
print(f"unet traced inference took {time.time() - start_time:.2f} seconds")
for _ in range(n_experiments):
torch.cuda.synchronize()
start_time = time.time()
for _ in range(unet_runs_per_experiment):
orig_output = unet(*inputs)
torch.cuda.synchronize()
print(f"unet inference took {time.time() - start_time:.2f} seconds")
# save the model
unet_traced.save("unet_traced.pt")
```
Replace the pipeline's UNet with the traced version.
```py
import torch
from diffusers import StableDiffusionPipeline
from dataclasses import dataclass
@dataclass
class UNet2DConditionOutput:
sample: torch.Tensor
pipeline = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16,
use_safetensors=True,
).to("cuda")
# use jitted unet
unet_traced = torch.jit.load("unet_traced.pt")
# del pipeline.unet
class TracedUNet(torch.nn.Module):
def __init__(self):
super().__init__()
self.in_channels = pipe.unet.config.in_channels
self.device = pipe.unet.device
def forward(self, latent_model_input, t, encoder_hidden_states):
sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
return UNet2DConditionOutput(sample=sample)
pipeline.unet = TracedUNet()
with torch.inference_mode():
image = pipe([prompt] * 1, num_inference_steps=50).images[0]
```
## Memory-efficient attention
> [!TIP]
> Memory-efficient attention optimizes for memory usage *and* [inference speed](./fp16#scaled-dot-product-attention)!
The Transformers attention mechanism is memory-intensive, especially for long sequences, so you can try using different and more memory-efficient attention types.
By default, if PyTorch >= 2.0 is installed, [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) is used. You don't need to make any additional changes to your code.
SDPA supports [FlashAttention](https://github.com/Dao-AILab/flash-attention) and [xFormers](https://github.com/facebookresearch/xformers) as well as a native C++ PyTorch implementation. It automatically selects the most optimal implementation based on your input.
You can explicitly use xFormers with the [`~ModelMixin.enable_xformers_memory_efficient_attention`] method.
```py
# pip install xformers
import torch
from diffusers import StableDiffusionXLPipeline
pipeline = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
).to("cuda")
pipeline.enable_xformers_memory_efficient_attention()
```
Call [`~ModelMixin.disable_xformers_memory_efficient_attention`] to disable it.
```py
pipeline.disable_xformers_memory_efficient_attention()
```
Diffusers supports multiple memory-efficient attention backends (FlashAttention, xFormers, SageAttention, and more) through [`~ModelMixin.set_attention_backend`]. Refer to the [Attention backends](./attention_backends) guide to learn how to switch between them.

View File

@@ -23,7 +23,7 @@ pip install xformers
> [!TIP]
> The xFormers `pip` package requires the latest version of PyTorch. If you need to use a previous version of PyTorch, then we recommend [installing xFormers from the source](https://github.com/facebookresearch/xformers#installing-xformers).
After xFormers is installed, you can use `enable_xformers_memory_efficient_attention()` for faster inference and reduced memory consumption as shown in this [section](memory#memory-efficient-attention).
After xFormers is installed, you can use it with [`~ModelMixin.set_attention_backend`] as shown in the [Attention backends](./attention_backends) guide.
> [!WARNING]
> According to this [issue](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212), xFormers `v0.0.16` cannot be used for training (fine-tune or DreamBooth) in some GPUs. If you observe this problem, please install a development version as indicated in the issue comments.

View File

@@ -14,6 +14,8 @@
sections:
- local: using-diffusers/schedulers
title: Load schedulers and models
- local: using-diffusers/guiders
title: Guiders
- title: Inference
isExpanded: false
@@ -80,8 +82,6 @@
title: ModularPipeline
- local: modular_diffusers/components_manager
title: ComponentsManager
- local: modular_diffusers/guiders
title: Guiders
- title: Training
isExpanded: false

View File

@@ -7,7 +7,7 @@ import safetensors.torch
import torch
from accelerate import init_empty_weights
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer, Gemma3ForConditionalGeneration
from transformers import AutoTokenizer, Gemma3ForConditionalGeneration, Gemma3Processor
from diffusers import (
AutoencoderKLLTX2Audio,
@@ -17,7 +17,7 @@ from diffusers import (
LTX2Pipeline,
LTX2VideoTransformer3DModel,
)
from diffusers.pipelines.ltx2 import LTX2LatentUpsamplerModel, LTX2TextConnectors, LTX2Vocoder
from diffusers.pipelines.ltx2 import LTX2LatentUpsamplerModel, LTX2TextConnectors, LTX2Vocoder, LTX2VocoderWithBWE
from diffusers.utils.import_utils import is_accelerate_available
@@ -44,6 +44,12 @@ LTX_2_0_TRANSFORMER_KEYS_RENAME_DICT = {
"k_norm": "norm_k",
}
LTX_2_3_TRANSFORMER_KEYS_RENAME_DICT = {
**LTX_2_0_TRANSFORMER_KEYS_RENAME_DICT,
"audio_prompt_adaln_single": "audio_prompt_adaln",
"prompt_adaln_single": "prompt_adaln",
}
LTX_2_0_VIDEO_VAE_RENAME_DICT = {
# Encoder
"down_blocks.0": "down_blocks.0",
@@ -72,6 +78,13 @@ LTX_2_0_VIDEO_VAE_RENAME_DICT = {
"per_channel_statistics.std-of-means": "latents_std",
}
LTX_2_3_VIDEO_VAE_RENAME_DICT = {
**LTX_2_0_VIDEO_VAE_RENAME_DICT,
# Decoder extra blocks
"up_blocks.7": "up_blocks.3.upsamplers.0",
"up_blocks.8": "up_blocks.3",
}
LTX_2_0_AUDIO_VAE_RENAME_DICT = {
"per_channel_statistics.mean-of-means": "latents_mean",
"per_channel_statistics.std-of-means": "latents_std",
@@ -84,10 +97,34 @@ LTX_2_0_VOCODER_RENAME_DICT = {
"conv_post": "conv_out",
}
LTX_2_0_TEXT_ENCODER_RENAME_DICT = {
LTX_2_3_VOCODER_RENAME_DICT = {
# Handle upsamplers ("ups" --> "upsamplers") due to name clash
"resblocks": "resnets",
"conv_pre": "conv_in",
"conv_post": "conv_out",
"act_post": "act_out",
"downsample.lowpass": "downsample",
}
LTX_2_0_CONNECTORS_KEYS_RENAME_DICT = {
"connectors.": "",
"video_embeddings_connector": "video_connector",
"audio_embeddings_connector": "audio_connector",
"transformer_1d_blocks": "transformer_blocks",
"text_embedding_projection.aggregate_embed": "text_proj_in",
# Attention QK Norms
"q_norm": "norm_q",
"k_norm": "norm_k",
}
LTX_2_3_CONNECTORS_KEYS_RENAME_DICT = {
"connectors.": "",
"video_embeddings_connector": "video_connector",
"audio_embeddings_connector": "audio_connector",
"transformer_1d_blocks": "transformer_blocks",
# LTX-2.3 uses per-modality embedding projections
"text_embedding_projection.audio_aggregate_embed": "audio_text_proj_in",
"text_embedding_projection.video_aggregate_embed": "video_text_proj_in",
# Attention QK Norms
"q_norm": "norm_q",
"k_norm": "norm_k",
@@ -129,23 +166,24 @@ def convert_ltx2_audio_vae_per_channel_statistics(key: str, state_dict: dict[str
return
def convert_ltx2_3_vocoder_upsamplers(key: str, state_dict: dict[str, Any]) -> None:
# Skip if not a weight, bias
if ".weight" not in key and ".bias" not in key:
return
if ".ups." in key:
new_key = key.replace(".ups.", ".upsamplers.")
param = state_dict.pop(key)
state_dict[new_key] = param
return
LTX_2_0_TRANSFORMER_SPECIAL_KEYS_REMAP = {
"video_embeddings_connector": remove_keys_inplace,
"audio_embeddings_connector": remove_keys_inplace,
"adaln_single": convert_ltx2_transformer_adaln_single,
}
LTX_2_0_CONNECTORS_KEYS_RENAME_DICT = {
"connectors.": "",
"video_embeddings_connector": "video_connector",
"audio_embeddings_connector": "audio_connector",
"transformer_1d_blocks": "transformer_blocks",
"text_embedding_projection.aggregate_embed": "text_proj_in",
# Attention QK Norms
"q_norm": "norm_q",
"k_norm": "norm_k",
}
LTX_2_0_VAE_SPECIAL_KEYS_REMAP = {
"per_channel_statistics.channel": remove_keys_inplace,
"per_channel_statistics.mean-of-stds": remove_keys_inplace,
@@ -155,13 +193,19 @@ LTX_2_0_AUDIO_VAE_SPECIAL_KEYS_REMAP = {}
LTX_2_0_VOCODER_SPECIAL_KEYS_REMAP = {}
LTX_2_3_VOCODER_SPECIAL_KEYS_REMAP = {
".ups.": convert_ltx2_3_vocoder_upsamplers,
}
LTX_2_0_CONNECTORS_SPECIAL_KEYS_REMAP = {}
def split_transformer_and_connector_state_dict(state_dict: dict[str, Any]) -> tuple[dict[str, Any], dict[str, Any]]:
connector_prefixes = (
"video_embeddings_connector",
"audio_embeddings_connector",
"transformer_1d_blocks",
"text_embedding_projection.aggregate_embed",
"text_embedding_projection",
"connectors.",
"video_connector",
"audio_connector",
@@ -225,7 +269,7 @@ def get_ltx2_transformer_config(version: str) -> tuple[dict[str, Any], dict[str,
special_keys_remap = LTX_2_0_TRANSFORMER_SPECIAL_KEYS_REMAP
elif version == "2.0":
config = {
"model_id": "diffusers-internal-dev/new-ltx-model",
"model_id": "Lightricks/LTX-2",
"diffusers_config": {
"in_channels": 128,
"out_channels": 128,
@@ -238,6 +282,8 @@ def get_ltx2_transformer_config(version: str) -> tuple[dict[str, Any], dict[str,
"pos_embed_max_pos": 20,
"base_height": 2048,
"base_width": 2048,
"gated_attn": False,
"cross_attn_mod": False,
"audio_in_channels": 128,
"audio_out_channels": 128,
"audio_patch_size": 1,
@@ -249,6 +295,8 @@ def get_ltx2_transformer_config(version: str) -> tuple[dict[str, Any], dict[str,
"audio_pos_embed_max_pos": 20,
"audio_sampling_rate": 16000,
"audio_hop_length": 160,
"audio_gated_attn": False,
"audio_cross_attn_mod": False,
"num_layers": 48,
"activation_fn": "gelu-approximate",
"qk_norm": "rms_norm_across_heads",
@@ -263,10 +311,62 @@ def get_ltx2_transformer_config(version: str) -> tuple[dict[str, Any], dict[str,
"timestep_scale_multiplier": 1000,
"cross_attn_timestep_scale_multiplier": 1000,
"rope_type": "split",
"use_prompt_embeddings": True,
"perturbed_attn": False,
},
}
rename_dict = LTX_2_0_TRANSFORMER_KEYS_RENAME_DICT
special_keys_remap = LTX_2_0_TRANSFORMER_SPECIAL_KEYS_REMAP
elif version == "2.3":
config = {
"model_id": "Lightricks/LTX-2.3",
"diffusers_config": {
"in_channels": 128,
"out_channels": 128,
"patch_size": 1,
"patch_size_t": 1,
"num_attention_heads": 32,
"attention_head_dim": 128,
"cross_attention_dim": 4096,
"vae_scale_factors": (8, 32, 32),
"pos_embed_max_pos": 20,
"base_height": 2048,
"base_width": 2048,
"gated_attn": True,
"cross_attn_mod": True,
"audio_in_channels": 128,
"audio_out_channels": 128,
"audio_patch_size": 1,
"audio_patch_size_t": 1,
"audio_num_attention_heads": 32,
"audio_attention_head_dim": 64,
"audio_cross_attention_dim": 2048,
"audio_scale_factor": 4,
"audio_pos_embed_max_pos": 20,
"audio_sampling_rate": 16000,
"audio_hop_length": 160,
"audio_gated_attn": True,
"audio_cross_attn_mod": True,
"num_layers": 48,
"activation_fn": "gelu-approximate",
"qk_norm": "rms_norm_across_heads",
"norm_elementwise_affine": False,
"norm_eps": 1e-6,
"caption_channels": 3840,
"attention_bias": True,
"attention_out_bias": True,
"rope_theta": 10000.0,
"rope_double_precision": True,
"causal_offset": 1,
"timestep_scale_multiplier": 1000,
"cross_attn_timestep_scale_multiplier": 1000,
"rope_type": "split",
"use_prompt_embeddings": False,
"perturbed_attn": True,
},
}
rename_dict = LTX_2_3_TRANSFORMER_KEYS_RENAME_DICT
special_keys_remap = LTX_2_0_TRANSFORMER_SPECIAL_KEYS_REMAP
return config, rename_dict, special_keys_remap
@@ -293,7 +393,7 @@ def get_ltx2_connectors_config(version: str) -> tuple[dict[str, Any], dict[str,
}
elif version == "2.0":
config = {
"model_id": "diffusers-internal-dev/new-ltx-model",
"model_id": "Lightricks/LTX-2",
"diffusers_config": {
"caption_channels": 3840,
"text_proj_in_factor": 49,
@@ -301,20 +401,52 @@ def get_ltx2_connectors_config(version: str) -> tuple[dict[str, Any], dict[str,
"video_connector_attention_head_dim": 128,
"video_connector_num_layers": 2,
"video_connector_num_learnable_registers": 128,
"video_gated_attn": False,
"audio_connector_num_attention_heads": 30,
"audio_connector_attention_head_dim": 128,
"audio_connector_num_layers": 2,
"audio_connector_num_learnable_registers": 128,
"audio_gated_attn": False,
"connector_rope_base_seq_len": 4096,
"rope_theta": 10000.0,
"rope_double_precision": True,
"causal_temporal_positioning": False,
"rope_type": "split",
"per_modality_projections": False,
"proj_bias": False,
},
}
rename_dict = LTX_2_0_CONNECTORS_KEYS_RENAME_DICT
special_keys_remap = {}
rename_dict = LTX_2_0_CONNECTORS_KEYS_RENAME_DICT
special_keys_remap = LTX_2_0_CONNECTORS_SPECIAL_KEYS_REMAP
elif version == "2.3":
config = {
"model_id": "Lightricks/LTX-2.3",
"diffusers_config": {
"caption_channels": 3840,
"text_proj_in_factor": 49,
"video_connector_num_attention_heads": 32,
"video_connector_attention_head_dim": 128,
"video_connector_num_layers": 8,
"video_connector_num_learnable_registers": 128,
"video_gated_attn": True,
"audio_connector_num_attention_heads": 32,
"audio_connector_attention_head_dim": 64,
"audio_connector_num_layers": 8,
"audio_connector_num_learnable_registers": 128,
"audio_gated_attn": True,
"connector_rope_base_seq_len": 4096,
"rope_theta": 10000.0,
"rope_double_precision": True,
"causal_temporal_positioning": False,
"rope_type": "split",
"per_modality_projections": True,
"video_hidden_dim": 4096,
"audio_hidden_dim": 2048,
"proj_bias": True,
},
}
rename_dict = LTX_2_3_CONNECTORS_KEYS_RENAME_DICT
special_keys_remap = LTX_2_0_CONNECTORS_SPECIAL_KEYS_REMAP
return config, rename_dict, special_keys_remap
@@ -416,7 +548,7 @@ def get_ltx2_video_vae_config(
special_keys_remap = LTX_2_0_VAE_SPECIAL_KEYS_REMAP
elif version == "2.0":
config = {
"model_id": "diffusers-internal-dev/dummy-ltx2",
"model_id": "Lightricks/LTX-2",
"diffusers_config": {
"in_channels": 3,
"out_channels": 3,
@@ -435,6 +567,7 @@ def get_ltx2_video_vae_config(
"decoder_spatio_temporal_scaling": (True, True, True),
"decoder_inject_noise": (False, False, False, False),
"downsample_type": ("spatial", "temporal", "spatiotemporal", "spatiotemporal"),
"upsample_type": ("spatiotemporal", "spatiotemporal", "spatiotemporal"),
"upsample_residual": (True, True, True),
"upsample_factor": (2, 2, 2),
"timestep_conditioning": timestep_conditioning,
@@ -451,6 +584,44 @@ def get_ltx2_video_vae_config(
}
rename_dict = LTX_2_0_VIDEO_VAE_RENAME_DICT
special_keys_remap = LTX_2_0_VAE_SPECIAL_KEYS_REMAP
elif version == "2.3":
config = {
"model_id": "Lightricks/LTX-2.3",
"diffusers_config": {
"in_channels": 3,
"out_channels": 3,
"latent_channels": 128,
"block_out_channels": (256, 512, 1024, 1024),
"down_block_types": (
"LTX2VideoDownBlock3D",
"LTX2VideoDownBlock3D",
"LTX2VideoDownBlock3D",
"LTX2VideoDownBlock3D",
),
"decoder_block_out_channels": (256, 512, 512, 1024),
"layers_per_block": (4, 6, 4, 2, 2),
"decoder_layers_per_block": (4, 6, 4, 2, 2),
"spatio_temporal_scaling": (True, True, True, True),
"decoder_spatio_temporal_scaling": (True, True, True, True),
"decoder_inject_noise": (False, False, False, False, False),
"downsample_type": ("spatial", "temporal", "spatiotemporal", "spatiotemporal"),
"upsample_type": ("spatiotemporal", "spatiotemporal", "temporal", "spatial"),
"upsample_residual": (False, False, False, False),
"upsample_factor": (2, 2, 1, 2),
"timestep_conditioning": timestep_conditioning,
"patch_size": 4,
"patch_size_t": 1,
"resnet_norm_eps": 1e-6,
"encoder_causal": True,
"decoder_causal": False,
"encoder_spatial_padding_mode": "zeros",
"decoder_spatial_padding_mode": "zeros",
"spatial_compression_ratio": 32,
"temporal_compression_ratio": 8,
},
}
rename_dict = LTX_2_3_VIDEO_VAE_RENAME_DICT
special_keys_remap = LTX_2_0_VAE_SPECIAL_KEYS_REMAP
return config, rename_dict, special_keys_remap
@@ -485,7 +656,7 @@ def convert_ltx2_video_vae(
def get_ltx2_audio_vae_config(version: str) -> tuple[dict[str, Any], dict[str, Any], dict[str, Any]]:
if version == "2.0":
config = {
"model_id": "diffusers-internal-dev/new-ltx-model",
"model_id": "Lightricks/LTX-2",
"diffusers_config": {
"base_channels": 128,
"output_channels": 2,
@@ -508,6 +679,31 @@ def get_ltx2_audio_vae_config(version: str) -> tuple[dict[str, Any], dict[str, A
}
rename_dict = LTX_2_0_AUDIO_VAE_RENAME_DICT
special_keys_remap = LTX_2_0_AUDIO_VAE_SPECIAL_KEYS_REMAP
elif version == "2.3":
config = {
"model_id": "Lightricks/LTX-2.3",
"diffusers_config": {
"base_channels": 128,
"output_channels": 2,
"ch_mult": (1, 2, 4),
"num_res_blocks": 2,
"attn_resolutions": None,
"in_channels": 2,
"resolution": 256,
"latent_channels": 8,
"norm_type": "pixel",
"causality_axis": "height",
"dropout": 0.0,
"mid_block_add_attention": False,
"sample_rate": 16000,
"mel_hop_length": 160,
"is_causal": True,
"mel_bins": 64,
"double_z": True,
}, # Same config as LTX-2.0
}
rename_dict = LTX_2_0_AUDIO_VAE_RENAME_DICT
special_keys_remap = LTX_2_0_AUDIO_VAE_SPECIAL_KEYS_REMAP
return config, rename_dict, special_keys_remap
@@ -540,7 +736,7 @@ def convert_ltx2_audio_vae(original_state_dict: dict[str, Any], version: str) ->
def get_ltx2_vocoder_config(version: str) -> tuple[dict[str, Any], dict[str, Any], dict[str, Any]]:
if version == "2.0":
config = {
"model_id": "diffusers-internal-dev/new-ltx-model",
"model_id": "Lightricks/LTX-2",
"diffusers_config": {
"in_channels": 128,
"hidden_channels": 1024,
@@ -549,21 +745,71 @@ def get_ltx2_vocoder_config(version: str) -> tuple[dict[str, Any], dict[str, Any
"upsample_factors": [6, 5, 2, 2, 2],
"resnet_kernel_sizes": [3, 7, 11],
"resnet_dilations": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
"act_fn": "leaky_relu",
"leaky_relu_negative_slope": 0.1,
"antialias": False,
"final_act_fn": "tanh",
"final_bias": True,
"output_sampling_rate": 24000,
},
}
rename_dict = LTX_2_0_VOCODER_RENAME_DICT
special_keys_remap = LTX_2_0_VOCODER_SPECIAL_KEYS_REMAP
elif version == "2.3":
config = {
"model_id": "Lightricks/LTX-2.3",
"diffusers_config": {
"in_channels": 128,
"hidden_channels": 1536,
"out_channels": 2,
"upsample_kernel_sizes": [11, 4, 4, 4, 4, 4],
"upsample_factors": [5, 2, 2, 2, 2, 2],
"resnet_kernel_sizes": [3, 7, 11],
"resnet_dilations": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
"act_fn": "snakebeta",
"leaky_relu_negative_slope": 0.1,
"antialias": True,
"antialias_ratio": 2,
"antialias_kernel_size": 12,
"final_act_fn": None,
"final_bias": False,
"bwe_in_channels": 128,
"bwe_hidden_channels": 512,
"bwe_out_channels": 2,
"bwe_upsample_kernel_sizes": [12, 11, 4, 4, 4],
"bwe_upsample_factors": [6, 5, 2, 2, 2],
"bwe_resnet_kernel_sizes": [3, 7, 11],
"bwe_resnet_dilations": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
"bwe_act_fn": "snakebeta",
"bwe_leaky_relu_negative_slope": 0.1,
"bwe_antialias": True,
"bwe_antialias_ratio": 2,
"bwe_antialias_kernel_size": 12,
"bwe_final_act_fn": None,
"bwe_final_bias": False,
"filter_length": 512,
"hop_length": 80,
"window_length": 512,
"num_mel_channels": 64,
"input_sampling_rate": 16000,
"output_sampling_rate": 48000,
},
}
rename_dict = LTX_2_3_VOCODER_RENAME_DICT
special_keys_remap = LTX_2_3_VOCODER_SPECIAL_KEYS_REMAP
return config, rename_dict, special_keys_remap
def convert_ltx2_vocoder(original_state_dict: dict[str, Any], version: str) -> dict[str, Any]:
config, rename_dict, special_keys_remap = get_ltx2_vocoder_config(version)
diffusers_config = config["diffusers_config"]
if version == "2.3":
vocoder_cls = LTX2VocoderWithBWE
else:
vocoder_cls = LTX2Vocoder
with init_empty_weights():
vocoder = LTX2Vocoder.from_config(diffusers_config)
vocoder = vocoder_cls.from_config(diffusers_config)
# Handle official code --> diffusers key remapping via the remap dict
for key in list(original_state_dict.keys()):
@@ -594,6 +840,18 @@ def get_ltx2_spatial_latent_upsampler_config(version: str):
"spatial_upsample": True,
"temporal_upsample": False,
"rational_spatial_scale": 2.0,
"use_rational_resampler": True,
}
elif version == "2.3":
config = {
"in_channels": 128,
"mid_channels": 1024,
"num_blocks_per_stage": 4,
"dims": 3,
"spatial_upsample": True,
"temporal_upsample": False,
"rational_spatial_scale": 2.0,
"use_rational_resampler": False,
}
else:
raise ValueError(f"Unsupported version: {version}")
@@ -651,13 +909,17 @@ def get_model_state_dict_from_combined_ckpt(combined_ckpt: dict[str, Any], prefi
model_state_dict = {}
for param_name, param in combined_ckpt.items():
if param_name.startswith(prefix):
model_state_dict[param_name.replace(prefix, "")] = param
model_state_dict[param_name.removeprefix(prefix)] = param
if prefix == "model.diffusion_model.":
# Some checkpoints store the text connector projection outside the diffusion model prefix.
connector_key = "text_embedding_projection.aggregate_embed.weight"
if connector_key in combined_ckpt and connector_key not in model_state_dict:
model_state_dict[connector_key] = combined_ckpt[connector_key]
connector_prefixes = ["text_embedding_projection"]
for param_name, param in combined_ckpt.items():
for prefix in connector_prefixes:
if param_name.startswith(prefix):
# Check to make sure we're not overwriting an existing key
if param_name not in model_state_dict:
model_state_dict[param_name] = combined_ckpt[param_name]
return model_state_dict
@@ -686,7 +948,7 @@ def get_args():
"--version",
type=str,
default="2.0",
choices=["test", "2.0"],
choices=["test", "2.0", "2.3"],
help="Version of the LTX 2.0 model",
)
@@ -748,6 +1010,11 @@ def get_args():
action="store_true",
help="Whether to save a latent upsampling pipeline",
)
parser.add_argument(
"--add_processor",
action="store_true",
help="Whether to add a Gemma3Processor to the pipeline for prompt enhancement.",
)
parser.add_argument("--vae_dtype", type=str, default="bf16", choices=["fp32", "fp16", "bf16"])
parser.add_argument("--audio_vae_dtype", type=str, default="bf16", choices=["fp32", "fp16", "bf16"])
@@ -756,6 +1023,12 @@ def get_args():
parser.add_argument("--text_encoder_dtype", type=str, default="bf16", choices=["fp32", "fp16", "bf16"])
parser.add_argument("--output_path", type=str, required=True, help="Path where converted model should be saved")
parser.add_argument(
"--upsample_output_path",
type=str,
default=None,
help="Path where converted upsampling pipeline should be saved",
)
return parser.parse_args()
@@ -787,7 +1060,7 @@ def main(args):
args.audio_vae,
args.dit,
args.vocoder,
args.text_encoder,
args.connectors,
args.full_pipeline,
args.upsample_pipeline,
]
@@ -852,7 +1125,12 @@ def main(args):
if not args.full_pipeline:
tokenizer.save_pretrained(os.path.join(args.output_path, "tokenizer"))
if args.latent_upsampler or args.full_pipeline or args.upsample_pipeline:
if args.add_processor:
processor = Gemma3Processor.from_pretrained(args.text_encoder_model_id)
if not args.full_pipeline:
processor.save_pretrained(os.path.join(args.output_path, "processor"))
if args.latent_upsampler or args.upsample_pipeline:
original_latent_upsampler_ckpt = load_hub_or_local_checkpoint(
repo_id=args.original_state_dict_repo_id, filename=args.latent_upsampler_filename
)
@@ -866,14 +1144,26 @@ def main(args):
latent_upsampler.save_pretrained(os.path.join(args.output_path, "latent_upsampler"))
if args.full_pipeline:
scheduler = FlowMatchEulerDiscreteScheduler(
use_dynamic_shifting=True,
base_shift=0.95,
max_shift=2.05,
base_image_seq_len=1024,
max_image_seq_len=4096,
shift_terminal=0.1,
)
is_distilled_ckpt = "distilled" in args.combined_filename
if is_distilled_ckpt:
# Disable dynamic shifting and terminal shift so that distilled sigmas are used as-is
scheduler = FlowMatchEulerDiscreteScheduler(
use_dynamic_shifting=False,
base_shift=0.95,
max_shift=2.05,
base_image_seq_len=1024,
max_image_seq_len=4096,
shift_terminal=None,
)
else:
scheduler = FlowMatchEulerDiscreteScheduler(
use_dynamic_shifting=True,
base_shift=0.95,
max_shift=2.05,
base_image_seq_len=1024,
max_image_seq_len=4096,
shift_terminal=0.1,
)
pipe = LTX2Pipeline(
scheduler=scheduler,
@@ -891,10 +1181,12 @@ def main(args):
if args.upsample_pipeline:
pipe = LTX2LatentUpsamplePipeline(vae=vae, latent_upsampler=latent_upsampler)
# Put latent upsampling pipeline in its own subdirectory so it doesn't mess with the full pipeline
pipe.save_pretrained(
os.path.join(args.output_path, "upsample_pipeline"), safe_serialization=True, max_shard_size="5GB"
)
# As two diffusers pipelines cannot be in the same directory, save the upsampling pipeline to its own directory
if args.upsample_output_path:
upsample_output_path = args.upsample_output_path
else:
upsample_output_path = args.output_path
pipe.save_pretrained(upsample_output_path, safe_serialization=True, max_shard_size="5GB")
if __name__ == "__main__":

View File

@@ -12,6 +12,7 @@ from termcolor import colored
from transformers import AutoModelForCausalLM, AutoTokenizer
from diffusers import (
AutoencoderKLLTX2Video,
AutoencoderKLWan,
DPMSolverMultistepScheduler,
FlowMatchEulerDiscreteScheduler,
@@ -24,7 +25,10 @@ from diffusers.utils.import_utils import is_accelerate_available
CTX = init_empty_weights if is_accelerate_available else nullcontext
ckpt_ids = ["Efficient-Large-Model/SANA-Video_2B_480p/checkpoints/SANA_Video_2B_480p.pth"]
ckpt_ids = [
"Efficient-Large-Model/SANA-Video_2B_480p/checkpoints/SANA_Video_2B_480p.pth",
"Efficient-Large-Model/SANA-Video_2B_720p/checkpoints/SANA_Video_2B_720p_LTXVAE.pth",
]
# https://github.com/NVlabs/Sana/blob/main/inference_video_scripts/inference_sana_video.py
@@ -92,12 +96,22 @@ def main(args):
if args.video_size == 480:
sample_size = 30 # Wan-VAE: 8xp2 downsample factor
patch_size = (1, 2, 2)
in_channels = 16
out_channels = 16
elif args.video_size == 720:
sample_size = 22 # Wan-VAE: 32xp1 downsample factor
sample_size = 22 # DC-AE-V: 32xp1 downsample factor
patch_size = (1, 1, 1)
in_channels = 32
out_channels = 32
else:
raise ValueError(f"Video size {args.video_size} is not supported.")
if args.vae_type == "ltx2":
sample_size = 22
patch_size = (1, 1, 1)
in_channels = 128
out_channels = 128
for depth in range(layer_num):
# Transformer blocks.
converted_state_dict[f"transformer_blocks.{depth}.scale_shift_table"] = state_dict.pop(
@@ -182,8 +196,8 @@ def main(args):
# Transformer
with CTX():
transformer_kwargs = {
"in_channels": 16,
"out_channels": 16,
"in_channels": in_channels,
"out_channels": out_channels,
"num_attention_heads": 20,
"attention_head_dim": 112,
"num_layers": 20,
@@ -235,9 +249,12 @@ def main(args):
else:
print(colored(f"Saving the whole Pipeline containing {args.model_type}", "green", attrs=["bold"]))
# VAE
vae = AutoencoderKLWan.from_pretrained(
"Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="vae", torch_dtype=torch.float32
)
if args.vae_type == "ltx2":
vae_path = args.vae_path or "Lightricks/LTX-2"
vae = AutoencoderKLLTX2Video.from_pretrained(vae_path, subfolder="vae", torch_dtype=torch.float32)
else:
vae_path = args.vae_path or "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(vae_path, subfolder="vae", torch_dtype=torch.float32)
# Text Encoder
text_encoder_model_path = "Efficient-Large-Model/gemma-2-2b-it"
@@ -314,7 +331,23 @@ if __name__ == "__main__":
choices=["flow-dpm_solver", "flow-euler", "uni-pc"],
help="Scheduler type to use.",
)
parser.add_argument("--task", default="t2v", type=str, required=True, help="Task to convert, t2v or i2v.")
parser.add_argument(
"--vae_type",
default="wan",
type=str,
choices=["wan", "ltx2"],
help="VAE type to use for saving full pipeline (ltx2 uses patchify 1x1x1).",
)
parser.add_argument(
"--vae_path",
default=None,
type=str,
required=False,
help="Optional VAE path or repo id. If not set, a default is used per VAE type.",
)
parser.add_argument(
"--task", default="t2v", type=str, required=True, choices=["t2v", "i2v"], help="Task to convert, t2v or i2v."
)
parser.add_argument("--dump_path", default=None, type=str, required=True, help="Path to the output pipeline.")
parser.add_argument("--save_full_pipeline", action="store_true", help="save all the pipeline elements in one.")
parser.add_argument("--dtype", default="fp32", type=str, choices=["fp32", "fp16", "bf16"], help="Weight dtype.")

View File

@@ -434,6 +434,12 @@ else:
"FluxKontextAutoBlocks",
"FluxKontextModularPipeline",
"FluxModularPipeline",
"HeliosAutoBlocks",
"HeliosModularPipeline",
"HeliosPyramidAutoBlocks",
"HeliosPyramidDistilledAutoBlocks",
"HeliosPyramidDistilledModularPipeline",
"HeliosPyramidModularPipeline",
"QwenImageAutoBlocks",
"QwenImageEditAutoBlocks",
"QwenImageEditModularPipeline",
@@ -504,6 +510,7 @@ else:
"EasyAnimateControlPipeline",
"EasyAnimateInpaintPipeline",
"EasyAnimatePipeline",
"Flux2KleinKVPipeline",
"Flux2KleinPipeline",
"Flux2Pipeline",
"FluxControlImg2ImgPipeline",
@@ -1188,6 +1195,12 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
FluxKontextAutoBlocks,
FluxKontextModularPipeline,
FluxModularPipeline,
HeliosAutoBlocks,
HeliosModularPipeline,
HeliosPyramidAutoBlocks,
HeliosPyramidDistilledAutoBlocks,
HeliosPyramidDistilledModularPipeline,
HeliosPyramidModularPipeline,
QwenImageAutoBlocks,
QwenImageEditAutoBlocks,
QwenImageEditModularPipeline,
@@ -1254,6 +1267,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
EasyAnimateControlPipeline,
EasyAnimateInpaintPipeline,
EasyAnimatePipeline,
Flux2KleinKVPipeline,
Flux2KleinPipeline,
Flux2Pipeline,
FluxControlImg2ImgPipeline,

View File

@@ -2156,6 +2156,9 @@ def _convert_non_diffusers_ltx2_lora_to_diffusers(state_dict, non_diffusers_pref
"scale_shift_table_a2v_ca_audio": "audio_a2v_cross_attn_scale_shift_table",
"q_norm": "norm_q",
"k_norm": "norm_k",
# LTX-2.3
"audio_prompt_adaln_single": "audio_prompt_adaln",
"prompt_adaln_single": "prompt_adaln",
}
else:
rename_dict = {"aggregate_embed": "text_proj_in"}
@@ -2538,8 +2541,12 @@ def _convert_non_diffusers_z_image_lora_to_diffusers(state_dict):
def get_alpha_scales(down_weight, alpha_key):
rank = down_weight.shape[0]
alpha = state_dict.pop(alpha_key).item()
scale = alpha / rank # LoRA is scaled by 'alpha / rank' in forward pass, so we need to scale it back here
alpha_tensor = state_dict.pop(alpha_key, None)
if alpha_tensor is None:
return 1.0, 1.0
scale = (
alpha_tensor.item() / rank
) # LoRA is scaled by 'alpha / rank' in forward pass, so we need to scale it back here
scale_down = scale
scale_up = 1.0
while scale_down * 2 < scale_up:

View File

@@ -60,6 +60,16 @@ class ContextParallelConfig:
rotate_method (`str`, *optional*, defaults to `"allgather"`):
Method to use for rotating key/value states across devices in ring attention. Currently, only `"allgather"`
is supported.
ulysses_anything (`bool`, *optional*, defaults to `False`):
Whether to enable "Ulysses Anything" mode, which supports arbitrary sequence lengths and head counts that
are not evenly divisible by `ulysses_degree`. When enabled, `ulysses_degree` must be greater than 1 and
`ring_degree` must be 1.
mesh (`torch.distributed.device_mesh.DeviceMesh`, *optional*):
A custom device mesh to use for context parallelism. If provided, this mesh will be used instead of
creating a new one. This is useful when combining context parallelism with other parallelism strategies
(e.g., FSDP, tensor parallelism) that share the same device mesh. The mesh must have both "ring" and
"ulysses" dimensions. Use size 1 for dimensions not being used (e.g., `mesh_shape=(2, 1, 4)` with
`mesh_dim_names=("ring", "ulysses", "fsdp")` for ring attention only with FSDP).
"""
@@ -68,6 +78,7 @@ class ContextParallelConfig:
convert_to_fp32: bool = True
# TODO: support alltoall
rotate_method: Literal["allgather", "alltoall"] = "allgather"
mesh: torch.distributed.device_mesh.DeviceMesh | None = None
# Whether to enable ulysses anything attention to support
# any sequence lengths and any head numbers.
ulysses_anything: bool = False
@@ -124,7 +135,7 @@ class ContextParallelConfig:
f"The product of `ring_degree` ({self.ring_degree}) and `ulysses_degree` ({self.ulysses_degree}) must not exceed the world size ({world_size})."
)
self._flattened_mesh = self._mesh._flatten()
self._flattened_mesh = self._mesh["ring", "ulysses"]._flatten()
self._ring_mesh = self._mesh["ring"]
self._ulysses_mesh = self._mesh["ulysses"]
self._ring_local_rank = self._ring_mesh.get_local_rank()

View File

@@ -237,7 +237,7 @@ class LTX2VideoResnetBlock3d(nn.Module):
# Like LTX 1.0 LTXVideoDownsampler3d, but uses new causal Conv3d
class LTXVideoDownsampler3d(nn.Module):
class LTX2VideoDownsampler3d(nn.Module):
def __init__(
self,
in_channels: int,
@@ -285,10 +285,11 @@ class LTXVideoDownsampler3d(nn.Module):
# Like LTX 1.0 LTXVideoUpsampler3d, but uses new causal Conv3d
class LTXVideoUpsampler3d(nn.Module):
class LTX2VideoUpsampler3d(nn.Module):
def __init__(
self,
in_channels: int,
out_channels: int | None = None,
stride: int | tuple[int, int, int] = 1,
residual: bool = False,
upscale_factor: int = 1,
@@ -300,7 +301,8 @@ class LTXVideoUpsampler3d(nn.Module):
self.residual = residual
self.upscale_factor = upscale_factor
out_channels = (in_channels * stride[0] * stride[1] * stride[2]) // upscale_factor
out_channels = out_channels or in_channels
out_channels = (out_channels * stride[0] * stride[1] * stride[2]) // upscale_factor
self.conv = LTX2VideoCausalConv3d(
in_channels=in_channels,
@@ -408,7 +410,7 @@ class LTX2VideoDownBlock3D(nn.Module):
)
elif downsample_type == "spatial":
self.downsamplers.append(
LTXVideoDownsampler3d(
LTX2VideoDownsampler3d(
in_channels=in_channels,
out_channels=out_channels,
stride=(1, 2, 2),
@@ -417,7 +419,7 @@ class LTX2VideoDownBlock3D(nn.Module):
)
elif downsample_type == "temporal":
self.downsamplers.append(
LTXVideoDownsampler3d(
LTX2VideoDownsampler3d(
in_channels=in_channels,
out_channels=out_channels,
stride=(2, 1, 1),
@@ -426,7 +428,7 @@ class LTX2VideoDownBlock3D(nn.Module):
)
elif downsample_type == "spatiotemporal":
self.downsamplers.append(
LTXVideoDownsampler3d(
LTX2VideoDownsampler3d(
in_channels=in_channels,
out_channels=out_channels,
stride=(2, 2, 2),
@@ -580,6 +582,7 @@ class LTX2VideoUpBlock3d(nn.Module):
resnet_eps: float = 1e-6,
resnet_act_fn: str = "swish",
spatio_temporal_scale: bool = True,
upsample_type: str = "spatiotemporal",
inject_noise: bool = False,
timestep_conditioning: bool = False,
upsample_residual: bool = False,
@@ -609,16 +612,23 @@ class LTX2VideoUpBlock3d(nn.Module):
self.upsamplers = None
if spatio_temporal_scale:
self.upsamplers = nn.ModuleList(
[
LTXVideoUpsampler3d(
out_channels * upscale_factor,
stride=(2, 2, 2),
residual=upsample_residual,
upscale_factor=upscale_factor,
spatial_padding_mode=spatial_padding_mode,
)
]
self.upsamplers = nn.ModuleList()
if upsample_type == "spatial":
upsample_stride = (1, 2, 2)
elif upsample_type == "temporal":
upsample_stride = (2, 1, 1)
elif upsample_type == "spatiotemporal":
upsample_stride = (2, 2, 2)
self.upsamplers.append(
LTX2VideoUpsampler3d(
in_channels=out_channels * upscale_factor,
stride=upsample_stride,
residual=upsample_residual,
upscale_factor=upscale_factor,
spatial_padding_mode=spatial_padding_mode,
)
)
resnets = []
@@ -716,7 +726,7 @@ class LTX2VideoEncoder3d(nn.Module):
"LTX2VideoDownBlock3D",
"LTX2VideoDownBlock3D",
),
spatio_temporal_scaling: tuple[bool, ...] = (True, True, True, True),
spatio_temporal_scaling: bool | tuple[bool, ...] = (True, True, True, True),
layers_per_block: tuple[int, ...] = (4, 6, 6, 2, 2),
downsample_type: tuple[str, ...] = ("spatial", "temporal", "spatiotemporal", "spatiotemporal"),
patch_size: int = 4,
@@ -726,6 +736,9 @@ class LTX2VideoEncoder3d(nn.Module):
spatial_padding_mode: str = "zeros",
):
super().__init__()
num_encoder_blocks = len(layers_per_block)
if isinstance(spatio_temporal_scaling, bool):
spatio_temporal_scaling = (spatio_temporal_scaling,) * (num_encoder_blocks - 1)
self.patch_size = patch_size
self.patch_size_t = patch_size_t
@@ -860,19 +873,27 @@ class LTX2VideoDecoder3d(nn.Module):
in_channels: int = 128,
out_channels: int = 3,
block_out_channels: tuple[int, ...] = (256, 512, 1024),
spatio_temporal_scaling: tuple[bool, ...] = (True, True, True),
spatio_temporal_scaling: bool | tuple[bool, ...] = (True, True, True),
layers_per_block: tuple[int, ...] = (5, 5, 5, 5),
upsample_type: tuple[str, ...] = ("spatiotemporal", "spatiotemporal", "spatiotemporal"),
patch_size: int = 4,
patch_size_t: int = 1,
resnet_norm_eps: float = 1e-6,
is_causal: bool = False,
inject_noise: tuple[bool, ...] = (False, False, False),
inject_noise: bool | tuple[bool, ...] = (False, False, False),
timestep_conditioning: bool = False,
upsample_residual: tuple[bool, ...] = (True, True, True),
upsample_residual: bool | tuple[bool, ...] = (True, True, True),
upsample_factor: tuple[bool, ...] = (2, 2, 2),
spatial_padding_mode: str = "reflect",
) -> None:
super().__init__()
num_decoder_blocks = len(layers_per_block)
if isinstance(spatio_temporal_scaling, bool):
spatio_temporal_scaling = (spatio_temporal_scaling,) * (num_decoder_blocks - 1)
if isinstance(inject_noise, bool):
inject_noise = (inject_noise,) * num_decoder_blocks
if isinstance(upsample_residual, bool):
upsample_residual = (upsample_residual,) * (num_decoder_blocks - 1)
self.patch_size = patch_size
self.patch_size_t = patch_size_t
@@ -917,6 +938,7 @@ class LTX2VideoDecoder3d(nn.Module):
num_layers=layers_per_block[i + 1],
resnet_eps=resnet_norm_eps,
spatio_temporal_scale=spatio_temporal_scaling[i],
upsample_type=upsample_type[i],
inject_noise=inject_noise[i + 1],
timestep_conditioning=timestep_conditioning,
upsample_residual=upsample_residual[i],
@@ -1058,11 +1080,12 @@ class AutoencoderKLLTX2Video(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
decoder_block_out_channels: tuple[int, ...] = (256, 512, 1024),
layers_per_block: tuple[int, ...] = (4, 6, 6, 2, 2),
decoder_layers_per_block: tuple[int, ...] = (5, 5, 5, 5),
spatio_temporal_scaling: tuple[bool, ...] = (True, True, True, True),
decoder_spatio_temporal_scaling: tuple[bool, ...] = (True, True, True),
decoder_inject_noise: tuple[bool, ...] = (False, False, False, False),
spatio_temporal_scaling: bool | tuple[bool, ...] = (True, True, True, True),
decoder_spatio_temporal_scaling: bool | tuple[bool, ...] = (True, True, True),
decoder_inject_noise: bool | tuple[bool, ...] = (False, False, False, False),
downsample_type: tuple[str, ...] = ("spatial", "temporal", "spatiotemporal", "spatiotemporal"),
upsample_residual: tuple[bool, ...] = (True, True, True),
upsample_type: tuple[str, ...] = ("spatiotemporal", "spatiotemporal", "spatiotemporal"),
upsample_residual: bool | tuple[bool, ...] = (True, True, True),
upsample_factor: tuple[int, ...] = (2, 2, 2),
timestep_conditioning: bool = False,
patch_size: int = 4,
@@ -1077,6 +1100,16 @@ class AutoencoderKLLTX2Video(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
temporal_compression_ratio: int = None,
) -> None:
super().__init__()
num_encoder_blocks = len(layers_per_block)
num_decoder_blocks = len(decoder_layers_per_block)
if isinstance(spatio_temporal_scaling, bool):
spatio_temporal_scaling = (spatio_temporal_scaling,) * (num_encoder_blocks - 1)
if isinstance(decoder_spatio_temporal_scaling, bool):
decoder_spatio_temporal_scaling = (decoder_spatio_temporal_scaling,) * (num_decoder_blocks - 1)
if isinstance(decoder_inject_noise, bool):
decoder_inject_noise = (decoder_inject_noise,) * num_decoder_blocks
if isinstance(upsample_residual, bool):
upsample_residual = (upsample_residual,) * (num_decoder_blocks - 1)
self.encoder = LTX2VideoEncoder3d(
in_channels=in_channels,
@@ -1098,6 +1131,7 @@ class AutoencoderKLLTX2Video(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
block_out_channels=decoder_block_out_channels,
spatio_temporal_scaling=decoder_spatio_temporal_scaling,
layers_per_block=decoder_layers_per_block,
upsample_type=upsample_type,
patch_size=patch_size,
patch_size_t=patch_size_t,
resnet_norm_eps=resnet_norm_eps,

View File

@@ -1567,7 +1567,7 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
mesh = None
if config.context_parallel_config is not None:
cp_config = config.context_parallel_config
mesh = torch.distributed.device_mesh.init_device_mesh(
mesh = cp_config.mesh or torch.distributed.device_mesh.init_device_mesh(
device_type=device_type,
mesh_shape=cp_config.mesh_shape,
mesh_dim_names=cp_config.mesh_dim_names,

View File

@@ -13,6 +13,7 @@
# limitations under the License.
import inspect
from dataclasses import dataclass
from typing import Any
import torch
@@ -21,7 +22,7 @@ import torch.nn.functional as F
from ...configuration_utils import ConfigMixin, register_to_config
from ...loaders import FluxTransformer2DLoadersMixin, FromOriginalModelMixin, PeftAdapterMixin
from ...utils import apply_lora_scale, logging
from ...utils import BaseOutput, apply_lora_scale, logging
from .._modeling_parallel import ContextParallelInput, ContextParallelOutput
from ..attention import AttentionMixin, AttentionModuleMixin
from ..attention_dispatch import dispatch_attention_fn
@@ -32,7 +33,6 @@ from ..embeddings import (
apply_rotary_emb,
get_1d_rotary_pos_embed,
)
from ..modeling_outputs import Transformer2DModelOutput
from ..modeling_utils import ModelMixin
from ..normalization import AdaLayerNormContinuous
@@ -40,6 +40,216 @@ from ..normalization import AdaLayerNormContinuous
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
@dataclass
class Flux2Transformer2DModelOutput(BaseOutput):
"""
The output of [`Flux2Transformer2DModel`].
Args:
sample (`torch.Tensor` of shape `(batch_size, num_channels, height, width)`):
The hidden states output conditioned on the `encoder_hidden_states` input.
kv_cache (`Flux2KVCache`, *optional*):
The populated KV cache for reference image tokens. Only returned when `kv_cache_mode="extract"`.
"""
sample: "torch.Tensor" # noqa: F821
kv_cache: "Flux2KVCache | None" = None
class Flux2KVLayerCache:
"""Per-layer KV cache for reference image tokens in the Flux2 Klein KV model.
Stores the K and V projections (post-RoPE) for reference tokens extracted during the first denoising step. Tensor
format: (batch_size, num_ref_tokens, num_heads, head_dim).
"""
def __init__(self):
self.k_ref: torch.Tensor | None = None
self.v_ref: torch.Tensor | None = None
def store(self, k_ref: torch.Tensor, v_ref: torch.Tensor):
"""Store reference token K/V."""
self.k_ref = k_ref
self.v_ref = v_ref
def get(self) -> tuple[torch.Tensor, torch.Tensor]:
"""Retrieve cached reference token K/V."""
if self.k_ref is None:
raise RuntimeError("KV cache has not been populated yet.")
return self.k_ref, self.v_ref
def clear(self):
self.k_ref = None
self.v_ref = None
class Flux2KVCache:
"""Container for all layers' reference-token KV caches.
Holds separate cache lists for double-stream and single-stream transformer blocks.
"""
def __init__(self, num_double_layers: int, num_single_layers: int):
self.double_block_caches = [Flux2KVLayerCache() for _ in range(num_double_layers)]
self.single_block_caches = [Flux2KVLayerCache() for _ in range(num_single_layers)]
self.num_ref_tokens: int = 0
def get_double(self, layer_idx: int) -> Flux2KVLayerCache:
return self.double_block_caches[layer_idx]
def get_single(self, layer_idx: int) -> Flux2KVLayerCache:
return self.single_block_caches[layer_idx]
def clear(self):
for cache in self.double_block_caches:
cache.clear()
for cache in self.single_block_caches:
cache.clear()
self.num_ref_tokens = 0
def _flux2_kv_causal_attention(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
num_txt_tokens: int,
num_ref_tokens: int,
kv_cache: Flux2KVLayerCache | None = None,
backend=None,
) -> torch.Tensor:
"""Causal attention for KV caching where reference tokens only self-attend.
All tensors use the diffusers convention: (batch_size, seq_len, num_heads, head_dim).
Without cache (extract mode): sequence layout is [txt, ref, img]. txt+img tokens attend to all tokens, ref tokens
only attend to themselves. With cache (cached mode): sequence layout is [txt, img]. Cached ref K/V are injected
between txt and img.
"""
# No ref tokens and no cache — standard full attention
if num_ref_tokens == 0 and kv_cache is None:
return dispatch_attention_fn(query, key, value, backend=backend)
if kv_cache is not None:
# Cached mode: inject ref K/V between txt and img
k_ref, v_ref = kv_cache.get()
k_all = torch.cat([key[:, :num_txt_tokens], k_ref, key[:, num_txt_tokens:]], dim=1)
v_all = torch.cat([value[:, :num_txt_tokens], v_ref, value[:, num_txt_tokens:]], dim=1)
return dispatch_attention_fn(query, k_all, v_all, backend=backend)
# Extract mode: ref tokens self-attend, txt+img attend to all
ref_start = num_txt_tokens
ref_end = num_txt_tokens + num_ref_tokens
q_txt = query[:, :ref_start]
q_ref = query[:, ref_start:ref_end]
q_img = query[:, ref_end:]
k_txt = key[:, :ref_start]
k_ref = key[:, ref_start:ref_end]
k_img = key[:, ref_end:]
v_txt = value[:, :ref_start]
v_ref = value[:, ref_start:ref_end]
v_img = value[:, ref_end:]
# txt+img attend to all tokens
q_txt_img = torch.cat([q_txt, q_img], dim=1)
k_all = torch.cat([k_txt, k_ref, k_img], dim=1)
v_all = torch.cat([v_txt, v_ref, v_img], dim=1)
attn_txt_img = dispatch_attention_fn(q_txt_img, k_all, v_all, backend=backend)
attn_txt = attn_txt_img[:, :ref_start]
attn_img = attn_txt_img[:, ref_start:]
# ref tokens self-attend only
attn_ref = dispatch_attention_fn(q_ref, k_ref, v_ref, backend=backend)
return torch.cat([attn_txt, attn_ref, attn_img], dim=1)
def _blend_mod_params(
img_params: tuple[torch.Tensor, ...],
ref_params: tuple[torch.Tensor, ...],
num_ref: int,
seq_len: int,
) -> tuple[torch.Tensor, ...]:
"""Blend modulation parameters so that the first `num_ref` positions use `ref_params`."""
blended = []
for im, rm in zip(img_params, ref_params):
if im.ndim == 2:
im = im.unsqueeze(1)
rm = rm.unsqueeze(1)
B = im.shape[0]
blended.append(
torch.cat(
[rm.expand(B, num_ref, -1), im.expand(B, seq_len, -1)[:, num_ref:, :]],
dim=1,
)
)
return tuple(blended)
def _blend_double_block_mods(
img_mod: torch.Tensor,
ref_mod: torch.Tensor,
num_ref: int,
seq_len: int,
) -> torch.Tensor:
"""Blend double-block image-stream modulations for a [ref, img] sequence layout.
Takes raw modulation tensors (before `Flux2Modulation.split`) and returns a blended raw tensor that is compatible
with `Flux2Modulation.split(mod, 2)`.
"""
if img_mod.ndim == 2:
img_mod = img_mod.unsqueeze(1)
ref_mod = ref_mod.unsqueeze(1)
img_chunks = torch.chunk(img_mod, 6, dim=-1)
ref_chunks = torch.chunk(ref_mod, 6, dim=-1)
img_mods = (img_chunks[0:3], img_chunks[3:6])
ref_mods = (ref_chunks[0:3], ref_chunks[3:6])
all_params = []
for img_set, ref_set in zip(img_mods, ref_mods):
blended = _blend_mod_params(img_set, ref_set, num_ref, seq_len)
all_params.extend(blended)
return torch.cat(all_params, dim=-1)
def _blend_single_block_mods(
single_mod: torch.Tensor,
ref_mod: torch.Tensor,
num_txt: int,
num_ref: int,
seq_len: int,
) -> torch.Tensor:
"""Blend single-block modulations for a [txt, ref, img] sequence layout.
Takes raw modulation tensors and returns a blended raw tensor compatible with `Flux2Modulation.split(mod, 1)`.
"""
if single_mod.ndim == 2:
single_mod = single_mod.unsqueeze(1)
ref_mod = ref_mod.unsqueeze(1)
img_params = torch.chunk(single_mod, 3, dim=-1)
ref_params = torch.chunk(ref_mod, 3, dim=-1)
blended = []
for im, rm in zip(img_params, ref_params):
if im.ndim == 2:
im = im.unsqueeze(1)
rm = rm.unsqueeze(1)
B = im.shape[0]
im_expanded = im.expand(B, seq_len, -1)
rm_expanded = rm.expand(B, num_ref, -1)
blended.append(
torch.cat(
[im_expanded[:, :num_txt, :], rm_expanded, im_expanded[:, num_txt + num_ref :, :]],
dim=1,
)
)
return torch.cat(blended, dim=-1)
def _get_projections(attn: "Flux2Attention", hidden_states, encoder_hidden_states=None):
query = attn.to_q(hidden_states)
key = attn.to_k(hidden_states)
@@ -181,9 +391,108 @@ class Flux2AttnProcessor:
return hidden_states
class Flux2KVAttnProcessor:
"""
Attention processor for Flux2 double-stream blocks with KV caching support for reference image tokens.
When `kv_cache_mode` is "extract", reference token K/V are stored in the cache after RoPE and causal attention is
used (ref tokens self-attend only, txt+img attend to all). When `kv_cache_mode` is "cached", cached ref K/V are
injected during attention. When no KV args are provided, behaves identically to `Flux2AttnProcessor`.
"""
_attention_backend = None
_parallel_config = None
def __init__(self):
if not hasattr(F, "scaled_dot_product_attention"):
raise ImportError(f"{self.__class__.__name__} requires PyTorch 2.0. Please upgrade your pytorch version.")
def __call__(
self,
attn: "Flux2Attention",
hidden_states: torch.Tensor,
encoder_hidden_states: torch.Tensor = None,
attention_mask: torch.Tensor | None = None,
image_rotary_emb: torch.Tensor | None = None,
kv_cache: Flux2KVLayerCache | None = None,
kv_cache_mode: str | None = None,
num_ref_tokens: int = 0,
) -> torch.Tensor:
query, key, value, encoder_query, encoder_key, encoder_value = _get_qkv_projections(
attn, hidden_states, encoder_hidden_states
)
query = query.unflatten(-1, (attn.heads, -1))
key = key.unflatten(-1, (attn.heads, -1))
value = value.unflatten(-1, (attn.heads, -1))
query = attn.norm_q(query)
key = attn.norm_k(key)
if attn.added_kv_proj_dim is not None:
encoder_query = encoder_query.unflatten(-1, (attn.heads, -1))
encoder_key = encoder_key.unflatten(-1, (attn.heads, -1))
encoder_value = encoder_value.unflatten(-1, (attn.heads, -1))
encoder_query = attn.norm_added_q(encoder_query)
encoder_key = attn.norm_added_k(encoder_key)
query = torch.cat([encoder_query, query], dim=1)
key = torch.cat([encoder_key, key], dim=1)
value = torch.cat([encoder_value, value], dim=1)
if image_rotary_emb is not None:
query = apply_rotary_emb(query, image_rotary_emb, sequence_dim=1)
key = apply_rotary_emb(key, image_rotary_emb, sequence_dim=1)
num_txt_tokens = encoder_hidden_states.shape[1] if encoder_hidden_states is not None else 0
# Extract ref K/V from the combined sequence
if kv_cache_mode == "extract" and kv_cache is not None and num_ref_tokens > 0:
ref_start = num_txt_tokens
ref_end = num_txt_tokens + num_ref_tokens
kv_cache.store(key[:, ref_start:ref_end].clone(), value[:, ref_start:ref_end].clone())
# Dispatch attention
if kv_cache_mode == "extract" and num_ref_tokens > 0:
hidden_states = _flux2_kv_causal_attention(
query, key, value, num_txt_tokens, num_ref_tokens, backend=self._attention_backend
)
elif kv_cache_mode == "cached" and kv_cache is not None:
hidden_states = _flux2_kv_causal_attention(
query, key, value, num_txt_tokens, 0, kv_cache=kv_cache, backend=self._attention_backend
)
else:
hidden_states = dispatch_attention_fn(
query,
key,
value,
attn_mask=attention_mask,
backend=self._attention_backend,
parallel_config=self._parallel_config,
)
hidden_states = hidden_states.flatten(2, 3)
hidden_states = hidden_states.to(query.dtype)
if encoder_hidden_states is not None:
encoder_hidden_states, hidden_states = hidden_states.split_with_sizes(
[encoder_hidden_states.shape[1], hidden_states.shape[1] - encoder_hidden_states.shape[1]], dim=1
)
encoder_hidden_states = attn.to_add_out(encoder_hidden_states)
hidden_states = attn.to_out[0](hidden_states)
hidden_states = attn.to_out[1](hidden_states)
if encoder_hidden_states is not None:
return hidden_states, encoder_hidden_states
else:
return hidden_states
class Flux2Attention(torch.nn.Module, AttentionModuleMixin):
_default_processor_cls = Flux2AttnProcessor
_available_processors = [Flux2AttnProcessor]
_available_processors = [Flux2AttnProcessor, Flux2KVAttnProcessor]
def __init__(
self,
@@ -312,6 +621,90 @@ class Flux2ParallelSelfAttnProcessor:
return hidden_states
class Flux2KVParallelSelfAttnProcessor:
"""
Attention processor for Flux2 single-stream blocks with KV caching support for reference image tokens.
When `kv_cache_mode` is "extract", reference token K/V are stored and causal attention is used. When
`kv_cache_mode` is "cached", cached ref K/V are injected during attention. When no KV args are provided, behaves
identically to `Flux2ParallelSelfAttnProcessor`.
"""
_attention_backend = None
_parallel_config = None
def __init__(self):
if not hasattr(F, "scaled_dot_product_attention"):
raise ImportError(f"{self.__class__.__name__} requires PyTorch 2.0. Please upgrade your pytorch version.")
def __call__(
self,
attn: "Flux2ParallelSelfAttention",
hidden_states: torch.Tensor,
attention_mask: torch.Tensor | None = None,
image_rotary_emb: torch.Tensor | None = None,
kv_cache: Flux2KVLayerCache | None = None,
kv_cache_mode: str | None = None,
num_txt_tokens: int = 0,
num_ref_tokens: int = 0,
) -> torch.Tensor:
# Parallel in (QKV + MLP in) projection
hidden_states_proj = attn.to_qkv_mlp_proj(hidden_states)
qkv, mlp_hidden_states = torch.split(
hidden_states_proj, [3 * attn.inner_dim, attn.mlp_hidden_dim * attn.mlp_mult_factor], dim=-1
)
query, key, value = qkv.chunk(3, dim=-1)
query = query.unflatten(-1, (attn.heads, -1))
key = key.unflatten(-1, (attn.heads, -1))
value = value.unflatten(-1, (attn.heads, -1))
query = attn.norm_q(query)
key = attn.norm_k(key)
if image_rotary_emb is not None:
query = apply_rotary_emb(query, image_rotary_emb, sequence_dim=1)
key = apply_rotary_emb(key, image_rotary_emb, sequence_dim=1)
# Extract ref K/V from the combined sequence
if kv_cache_mode == "extract" and kv_cache is not None and num_ref_tokens > 0:
ref_start = num_txt_tokens
ref_end = num_txt_tokens + num_ref_tokens
kv_cache.store(key[:, ref_start:ref_end].clone(), value[:, ref_start:ref_end].clone())
# Dispatch attention
if kv_cache_mode == "extract" and num_ref_tokens > 0:
attn_output = _flux2_kv_causal_attention(
query, key, value, num_txt_tokens, num_ref_tokens, backend=self._attention_backend
)
elif kv_cache_mode == "cached" and kv_cache is not None:
attn_output = _flux2_kv_causal_attention(
query, key, value, num_txt_tokens, 0, kv_cache=kv_cache, backend=self._attention_backend
)
else:
attn_output = dispatch_attention_fn(
query,
key,
value,
attn_mask=attention_mask,
backend=self._attention_backend,
parallel_config=self._parallel_config,
)
attn_output = attn_output.flatten(2, 3)
attn_output = attn_output.to(query.dtype)
# Handle the feedforward (FF) logic
mlp_hidden_states = attn.mlp_act_fn(mlp_hidden_states)
# Concatenate and parallel output projection
hidden_states = torch.cat([attn_output, mlp_hidden_states], dim=-1)
hidden_states = attn.to_out(hidden_states)
return hidden_states
class Flux2ParallelSelfAttention(torch.nn.Module, AttentionModuleMixin):
"""
Flux 2 parallel self-attention for the Flux 2 single-stream transformer blocks.
@@ -322,7 +715,7 @@ class Flux2ParallelSelfAttention(torch.nn.Module, AttentionModuleMixin):
"""
_default_processor_cls = Flux2ParallelSelfAttnProcessor
_available_processors = [Flux2ParallelSelfAttnProcessor]
_available_processors = [Flux2ParallelSelfAttnProcessor, Flux2KVParallelSelfAttnProcessor]
# Does not support QKV fusion as the QKV projections are always fused
_supports_qkv_fusion = False
@@ -780,6 +1173,8 @@ class Flux2Transformer2DModel(
self.gradient_checkpointing = False
_skip_keys = ["kv_cache"]
@apply_lora_scale("joint_attention_kwargs")
def forward(
self,
@@ -791,19 +1186,21 @@ class Flux2Transformer2DModel(
guidance: torch.Tensor = None,
joint_attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> torch.Tensor | Transformer2DModelOutput:
kv_cache: "Flux2KVCache | None" = None,
kv_cache_mode: str | None = None,
num_ref_tokens: int = 0,
ref_fixed_timestep: float = 0.0,
) -> torch.Tensor | Flux2Transformer2DModelOutput:
"""
The [`FluxTransformer2DModel`] forward method.
The [`Flux2Transformer2DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, image_sequence_length, in_channels)`):
Input `hidden_states`.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, text_sequence_length, joint_attention_dim)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep ( `torch.LongTensor`):
timestep (`torch.LongTensor`):
Used to indicate denoising step.
block_controlnet_hidden_states: (`list` of `torch.Tensor`):
A list of tensors that if specified are added to the residuals of transformer blocks.
joint_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
@@ -811,13 +1208,23 @@ class Flux2Transformer2DModel(
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
kv_cache (`Flux2KVCache`, *optional*):
KV cache for reference image tokens. When `kv_cache_mode` is "extract", a new cache is created and
returned. When "cached", the provided cache is used to inject ref K/V during attention.
kv_cache_mode (`str`, *optional*):
One of "extract" (first step with ref tokens) or "cached" (subsequent steps using cached ref K/V). When
`None`, standard forward pass without KV caching.
num_ref_tokens (`int`, defaults to `0`):
Number of reference image tokens prepended to `hidden_states` (only used when
`kv_cache_mode="extract"`).
ref_fixed_timestep (`float`, defaults to `0.0`):
Fixed timestep for reference token modulation (only used when `kv_cache_mode="extract"`).
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
`tuple` where the first element is the sample tensor. When `kv_cache_mode="extract"`, also returns the
populated `Flux2KVCache`.
"""
# 0. Handle input arguments
num_txt_tokens = encoder_hidden_states.shape[1]
# 1. Calculate timestep embedding and modulation parameters
@@ -832,13 +1239,33 @@ class Flux2Transformer2DModel(
double_stream_mod_txt = self.double_stream_modulation_txt(temb)
single_stream_mod = self.single_stream_modulation(temb)
# KV extract mode: create cache and blend modulations for ref tokens
if kv_cache_mode == "extract" and num_ref_tokens > 0:
num_img_tokens = hidden_states.shape[1] # includes ref tokens
kv_cache = Flux2KVCache(
num_double_layers=len(self.transformer_blocks),
num_single_layers=len(self.single_transformer_blocks),
)
kv_cache.num_ref_tokens = num_ref_tokens
# Ref tokens use a fixed timestep for modulation
ref_timestep = torch.full_like(timestep, ref_fixed_timestep * 1000)
ref_temb = self.time_guidance_embed(ref_timestep, guidance)
ref_double_mod_img = self.double_stream_modulation_img(ref_temb)
ref_single_mod = self.single_stream_modulation(ref_temb)
# Blend double block img modulation: [ref_mod, img_mod]
double_stream_mod_img = _blend_double_block_mods(
double_stream_mod_img, ref_double_mod_img, num_ref_tokens, num_img_tokens
)
# 2. Input projection for image (hidden_states) and conditioning text (encoder_hidden_states)
hidden_states = self.x_embedder(hidden_states)
encoder_hidden_states = self.context_embedder(encoder_hidden_states)
# 3. Calculate RoPE embeddings from image and text tokens
# NOTE: the below logic means that we can't support batched inference with images of different resolutions or
# text prompts of differents lengths. Is this a use case we want to support?
if img_ids.ndim == 3:
img_ids = img_ids[0]
if txt_ids.ndim == 3:
@@ -851,8 +1278,29 @@ class Flux2Transformer2DModel(
torch.cat([text_rotary_emb[1], image_rotary_emb[1]], dim=0),
)
# 4. Double Stream Transformer Blocks
# 4. Build joint_attention_kwargs with KV cache info
if kv_cache_mode == "extract":
kv_attn_kwargs = {
**(joint_attention_kwargs or {}),
"kv_cache": None,
"kv_cache_mode": "extract",
"num_ref_tokens": num_ref_tokens,
}
elif kv_cache_mode == "cached" and kv_cache is not None:
kv_attn_kwargs = {
**(joint_attention_kwargs or {}),
"kv_cache": None,
"kv_cache_mode": "cached",
"num_ref_tokens": kv_cache.num_ref_tokens,
}
else:
kv_attn_kwargs = joint_attention_kwargs
# 5. Double Stream Transformer Blocks
for index_block, block in enumerate(self.transformer_blocks):
if kv_cache_mode is not None and kv_cache is not None:
kv_attn_kwargs["kv_cache"] = kv_cache.get_double(index_block)
if torch.is_grad_enabled() and self.gradient_checkpointing:
encoder_hidden_states, hidden_states = self._gradient_checkpointing_func(
block,
@@ -861,7 +1309,7 @@ class Flux2Transformer2DModel(
double_stream_mod_img,
double_stream_mod_txt,
concat_rotary_emb,
joint_attention_kwargs,
kv_attn_kwargs,
)
else:
encoder_hidden_states, hidden_states = block(
@@ -870,13 +1318,30 @@ class Flux2Transformer2DModel(
temb_mod_img=double_stream_mod_img,
temb_mod_txt=double_stream_mod_txt,
image_rotary_emb=concat_rotary_emb,
joint_attention_kwargs=joint_attention_kwargs,
joint_attention_kwargs=kv_attn_kwargs,
)
# Concatenate text and image streams for single-block inference
hidden_states = torch.cat([encoder_hidden_states, hidden_states], dim=1)
# 5. Single Stream Transformer Blocks
# Blend single block modulation for extract mode: [txt_mod, ref_mod, img_mod]
if kv_cache_mode == "extract" and num_ref_tokens > 0:
total_single_len = hidden_states.shape[1]
single_stream_mod = _blend_single_block_mods(
single_stream_mod, ref_single_mod, num_txt_tokens, num_ref_tokens, total_single_len
)
# Build single-block KV kwargs (single blocks need num_txt_tokens)
if kv_cache_mode is not None:
kv_attn_kwargs_single = {**kv_attn_kwargs, "num_txt_tokens": num_txt_tokens}
else:
kv_attn_kwargs_single = kv_attn_kwargs
# 6. Single Stream Transformer Blocks
for index_block, block in enumerate(self.single_transformer_blocks):
if kv_cache_mode is not None and kv_cache is not None:
kv_attn_kwargs_single["kv_cache"] = kv_cache.get_single(index_block)
if torch.is_grad_enabled() and self.gradient_checkpointing:
hidden_states = self._gradient_checkpointing_func(
block,
@@ -884,7 +1349,7 @@ class Flux2Transformer2DModel(
None,
single_stream_mod,
concat_rotary_emb,
joint_attention_kwargs,
kv_attn_kwargs_single,
)
else:
hidden_states = block(
@@ -892,16 +1357,25 @@ class Flux2Transformer2DModel(
encoder_hidden_states=None,
temb_mod=single_stream_mod,
image_rotary_emb=concat_rotary_emb,
joint_attention_kwargs=joint_attention_kwargs,
joint_attention_kwargs=kv_attn_kwargs_single,
)
# Remove text tokens from concatenated stream
hidden_states = hidden_states[:, num_txt_tokens:, ...]
# 6. Output layers
# Remove text tokens (and ref tokens in extract mode) from concatenated stream
if kv_cache_mode == "extract" and num_ref_tokens > 0:
hidden_states = hidden_states[:, num_txt_tokens + num_ref_tokens :, ...]
else:
hidden_states = hidden_states[:, num_txt_tokens:, ...]
# 7. Output layers
hidden_states = self.norm_out(hidden_states, temb)
output = self.proj_out(hidden_states)
if kv_cache_mode == "extract":
if not return_dict:
return (output, kv_cache)
return Flux2Transformer2DModelOutput(sample=output, kv_cache=kv_cache)
if not return_dict:
return (output,)
return Transformer2DModelOutput(sample=output)
return Flux2Transformer2DModelOutput(sample=output)

View File

@@ -13,7 +13,6 @@
# limitations under the License.
import math
from functools import lru_cache
from typing import Any
import torch
@@ -343,7 +342,6 @@ class HeliosRotaryPosEmbed(nn.Module):
return freqs.cos(), freqs.sin()
@torch.no_grad()
@lru_cache(maxsize=32)
def _get_spatial_meshgrid(self, height, width, device_str):
device = torch.device(device_str)
grid_y_coords = torch.arange(height, device=device, dtype=torch.float32)

View File

@@ -178,6 +178,10 @@ class LTX2AudioVideoAttnProcessor:
if encoder_hidden_states is None:
encoder_hidden_states = hidden_states
if attn.to_gate_logits is not None:
# Calculate gate logits on original hidden_states
gate_logits = attn.to_gate_logits(hidden_states)
query = attn.to_q(hidden_states)
key = attn.to_k(encoder_hidden_states)
value = attn.to_v(encoder_hidden_states)
@@ -212,6 +216,112 @@ class LTX2AudioVideoAttnProcessor:
hidden_states = hidden_states.flatten(2, 3)
hidden_states = hidden_states.to(query.dtype)
if attn.to_gate_logits is not None:
hidden_states = hidden_states.unflatten(2, (attn.heads, -1)) # [B, T, H, D]
# The factor of 2.0 is so that if the gates logits are zero-initialized the initial gates are all 1
gates = 2.0 * torch.sigmoid(gate_logits) # [B, T, H]
hidden_states = hidden_states * gates.unsqueeze(-1)
hidden_states = hidden_states.flatten(2, 3)
hidden_states = attn.to_out[0](hidden_states)
hidden_states = attn.to_out[1](hidden_states)
return hidden_states
class LTX2PerturbedAttnProcessor:
r"""
Processor which implements attention with perturbation masking and per-head gating for LTX-2.X models.
"""
_attention_backend = None
_parallel_config = None
def __init__(self):
if is_torch_version("<", "2.0"):
raise ValueError(
"LTX attention processors require a minimum PyTorch version of 2.0. Please upgrade your PyTorch installation."
)
def __call__(
self,
attn: "LTX2Attention",
hidden_states: torch.Tensor,
encoder_hidden_states: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
query_rotary_emb: tuple[torch.Tensor, torch.Tensor] | None = None,
key_rotary_emb: tuple[torch.Tensor, torch.Tensor] | None = None,
perturbation_mask: torch.Tensor | None = None,
all_perturbed: bool | None = None,
) -> torch.Tensor:
batch_size, sequence_length, _ = (
hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
)
if attention_mask is not None:
attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size)
attention_mask = attention_mask.view(batch_size, attn.heads, -1, attention_mask.shape[-1])
if encoder_hidden_states is None:
encoder_hidden_states = hidden_states
if attn.to_gate_logits is not None:
# Calculate gate logits on original hidden_states
gate_logits = attn.to_gate_logits(hidden_states)
value = attn.to_v(encoder_hidden_states)
if all_perturbed is None:
all_perturbed = torch.all(perturbation_mask == 0) if perturbation_mask is not None else False
if all_perturbed:
# Skip attention, use the value projection value
hidden_states = value
else:
query = attn.to_q(hidden_states)
key = attn.to_k(encoder_hidden_states)
query = attn.norm_q(query)
key = attn.norm_k(key)
if query_rotary_emb is not None:
if attn.rope_type == "interleaved":
query = apply_interleaved_rotary_emb(query, query_rotary_emb)
key = apply_interleaved_rotary_emb(
key, key_rotary_emb if key_rotary_emb is not None else query_rotary_emb
)
elif attn.rope_type == "split":
query = apply_split_rotary_emb(query, query_rotary_emb)
key = apply_split_rotary_emb(
key, key_rotary_emb if key_rotary_emb is not None else query_rotary_emb
)
query = query.unflatten(2, (attn.heads, -1))
key = key.unflatten(2, (attn.heads, -1))
value = value.unflatten(2, (attn.heads, -1))
hidden_states = dispatch_attention_fn(
query,
key,
value,
attn_mask=attention_mask,
dropout_p=0.0,
is_causal=False,
backend=self._attention_backend,
parallel_config=self._parallel_config,
)
hidden_states = hidden_states.flatten(2, 3)
hidden_states = hidden_states.to(query.dtype)
if perturbation_mask is not None:
value = value.flatten(2, 3)
hidden_states = torch.lerp(value, hidden_states, perturbation_mask)
if attn.to_gate_logits is not None:
hidden_states = hidden_states.unflatten(2, (attn.heads, -1)) # [B, T, H, D]
# The factor of 2.0 is so that if the gates logits are zero-initialized the initial gates are all 1
gates = 2.0 * torch.sigmoid(gate_logits) # [B, T, H]
hidden_states = hidden_states * gates.unsqueeze(-1)
hidden_states = hidden_states.flatten(2, 3)
hidden_states = attn.to_out[0](hidden_states)
hidden_states = attn.to_out[1](hidden_states)
return hidden_states
@@ -224,7 +334,7 @@ class LTX2Attention(torch.nn.Module, AttentionModuleMixin):
"""
_default_processor_cls = LTX2AudioVideoAttnProcessor
_available_processors = [LTX2AudioVideoAttnProcessor]
_available_processors = [LTX2AudioVideoAttnProcessor, LTX2PerturbedAttnProcessor]
def __init__(
self,
@@ -240,6 +350,7 @@ class LTX2Attention(torch.nn.Module, AttentionModuleMixin):
norm_eps: float = 1e-6,
norm_elementwise_affine: bool = True,
rope_type: str = "interleaved",
apply_gated_attention: bool = False,
processor=None,
):
super().__init__()
@@ -266,6 +377,12 @@ class LTX2Attention(torch.nn.Module, AttentionModuleMixin):
self.to_out.append(torch.nn.Linear(self.inner_dim, self.out_dim, bias=out_bias))
self.to_out.append(torch.nn.Dropout(dropout))
if apply_gated_attention:
# Per head gate values
self.to_gate_logits = torch.nn.Linear(query_dim, heads, bias=True)
else:
self.to_gate_logits = None
if processor is None:
processor = self._default_processor_cls()
self.set_processor(processor)
@@ -321,6 +438,10 @@ class LTX2VideoTransformerBlock(nn.Module):
audio_num_attention_heads: int,
audio_attention_head_dim,
audio_cross_attention_dim: int,
video_gated_attn: bool = False,
video_cross_attn_adaln: bool = False,
audio_gated_attn: bool = False,
audio_cross_attn_adaln: bool = False,
qk_norm: str = "rms_norm_across_heads",
activation_fn: str = "gelu-approximate",
attention_bias: bool = True,
@@ -328,9 +449,16 @@ class LTX2VideoTransformerBlock(nn.Module):
eps: float = 1e-6,
elementwise_affine: bool = False,
rope_type: str = "interleaved",
perturbed_attn: bool = False,
):
super().__init__()
self.perturbed_attn = perturbed_attn
if perturbed_attn:
attn_processor_cls = LTX2PerturbedAttnProcessor
else:
attn_processor_cls = LTX2AudioVideoAttnProcessor
# 1. Self-Attention (video and audio)
self.norm1 = RMSNorm(dim, eps=eps, elementwise_affine=elementwise_affine)
self.attn1 = LTX2Attention(
@@ -343,6 +471,8 @@ class LTX2VideoTransformerBlock(nn.Module):
out_bias=attention_out_bias,
qk_norm=qk_norm,
rope_type=rope_type,
apply_gated_attention=video_gated_attn,
processor=attn_processor_cls(),
)
self.audio_norm1 = RMSNorm(audio_dim, eps=eps, elementwise_affine=elementwise_affine)
@@ -356,6 +486,8 @@ class LTX2VideoTransformerBlock(nn.Module):
out_bias=attention_out_bias,
qk_norm=qk_norm,
rope_type=rope_type,
apply_gated_attention=audio_gated_attn,
processor=attn_processor_cls(),
)
# 2. Prompt Cross-Attention
@@ -370,6 +502,8 @@ class LTX2VideoTransformerBlock(nn.Module):
out_bias=attention_out_bias,
qk_norm=qk_norm,
rope_type=rope_type,
apply_gated_attention=video_gated_attn,
processor=attn_processor_cls(),
)
self.audio_norm2 = RMSNorm(audio_dim, eps=eps, elementwise_affine=elementwise_affine)
@@ -383,6 +517,8 @@ class LTX2VideoTransformerBlock(nn.Module):
out_bias=attention_out_bias,
qk_norm=qk_norm,
rope_type=rope_type,
apply_gated_attention=audio_gated_attn,
processor=attn_processor_cls(),
)
# 3. Audio-to-Video (a2v) and Video-to-Audio (v2a) Cross-Attention
@@ -398,6 +534,8 @@ class LTX2VideoTransformerBlock(nn.Module):
out_bias=attention_out_bias,
qk_norm=qk_norm,
rope_type=rope_type,
apply_gated_attention=video_gated_attn,
processor=attn_processor_cls(),
)
# Video-to-Audio (v2a) Attention --> Q: Audio; K,V: Video
@@ -412,6 +550,8 @@ class LTX2VideoTransformerBlock(nn.Module):
out_bias=attention_out_bias,
qk_norm=qk_norm,
rope_type=rope_type,
apply_gated_attention=audio_gated_attn,
processor=attn_processor_cls(),
)
# 4. Feedforward layers
@@ -422,14 +562,36 @@ class LTX2VideoTransformerBlock(nn.Module):
self.audio_ff = FeedForward(audio_dim, activation_fn=activation_fn)
# 5. Per-Layer Modulation Parameters
# Self-Attention / Feedforward AdaLayerNorm-Zero mod params
self.scale_shift_table = nn.Parameter(torch.randn(6, dim) / dim**0.5)
self.audio_scale_shift_table = nn.Parameter(torch.randn(6, audio_dim) / audio_dim**0.5)
# Self-Attention (attn1) / Feedforward AdaLayerNorm-Zero mod params
# 6 base mod params for text cross-attn K,V; if cross_attn_adaln, also has mod params for Q
self.video_cross_attn_adaln = video_cross_attn_adaln
self.audio_cross_attn_adaln = audio_cross_attn_adaln
video_mod_param_num = 9 if self.video_cross_attn_adaln else 6
audio_mod_param_num = 9 if self.audio_cross_attn_adaln else 6
self.scale_shift_table = nn.Parameter(torch.randn(video_mod_param_num, dim) / dim**0.5)
self.audio_scale_shift_table = nn.Parameter(torch.randn(audio_mod_param_num, audio_dim) / audio_dim**0.5)
# Prompt cross-attn (attn2) additional modulation params
self.cross_attn_adaln = video_cross_attn_adaln or audio_cross_attn_adaln
if self.cross_attn_adaln:
self.prompt_scale_shift_table = nn.Parameter(torch.randn(2, dim))
self.audio_prompt_scale_shift_table = nn.Parameter(torch.randn(2, audio_dim))
# Per-layer a2v, v2a Cross-Attention mod params
self.video_a2v_cross_attn_scale_shift_table = nn.Parameter(torch.randn(5, dim))
self.audio_a2v_cross_attn_scale_shift_table = nn.Parameter(torch.randn(5, audio_dim))
@staticmethod
def get_mod_params(
scale_shift_table: torch.Tensor, temb: torch.Tensor, batch_size: int
) -> tuple[torch.Tensor, ...]:
num_ada_params = scale_shift_table.shape[0]
ada_values = scale_shift_table[None, None].to(temb.device) + temb.reshape(
batch_size, temb.shape[1], num_ada_params, -1
)
ada_params = ada_values.unbind(dim=2)
return ada_params
def forward(
self,
hidden_states: torch.Tensor,
@@ -442,143 +604,181 @@ class LTX2VideoTransformerBlock(nn.Module):
temb_ca_audio_scale_shift: torch.Tensor,
temb_ca_gate: torch.Tensor,
temb_ca_audio_gate: torch.Tensor,
temb_prompt: torch.Tensor | None = None,
temb_prompt_audio: torch.Tensor | None = None,
video_rotary_emb: tuple[torch.Tensor, torch.Tensor] | None = None,
audio_rotary_emb: tuple[torch.Tensor, torch.Tensor] | None = None,
ca_video_rotary_emb: tuple[torch.Tensor, torch.Tensor] | None = None,
ca_audio_rotary_emb: tuple[torch.Tensor, torch.Tensor] | None = None,
encoder_attention_mask: torch.Tensor | None = None,
audio_encoder_attention_mask: torch.Tensor | None = None,
self_attention_mask: torch.Tensor | None = None,
audio_self_attention_mask: torch.Tensor | None = None,
a2v_cross_attention_mask: torch.Tensor | None = None,
v2a_cross_attention_mask: torch.Tensor | None = None,
use_a2v_cross_attention: bool = True,
use_v2a_cross_attention: bool = True,
perturbation_mask: torch.Tensor | None = None,
all_perturbed: bool | None = None,
) -> torch.Tensor:
batch_size = hidden_states.size(0)
# 1. Video and Audio Self-Attention
norm_hidden_states = self.norm1(hidden_states)
# 1.1. Video Self-Attention
video_ada_params = self.get_mod_params(self.scale_shift_table, temb, batch_size)
shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = video_ada_params[:6]
if self.video_cross_attn_adaln:
shift_text_q, scale_text_q, gate_text_q = video_ada_params[6:9]
num_ada_params = self.scale_shift_table.shape[0]
ada_values = self.scale_shift_table[None, None].to(temb.device) + temb.reshape(
batch_size, temb.size(1), num_ada_params, -1
)
shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = ada_values.unbind(dim=2)
norm_hidden_states = self.norm1(hidden_states)
norm_hidden_states = norm_hidden_states * (1 + scale_msa) + shift_msa
attn_hidden_states = self.attn1(
hidden_states=norm_hidden_states,
encoder_hidden_states=None,
query_rotary_emb=video_rotary_emb,
)
video_self_attn_args = {
"hidden_states": norm_hidden_states,
"encoder_hidden_states": None,
"query_rotary_emb": video_rotary_emb,
"attention_mask": self_attention_mask,
}
if self.perturbed_attn:
video_self_attn_args["perturbation_mask"] = perturbation_mask
video_self_attn_args["all_perturbed"] = all_perturbed
attn_hidden_states = self.attn1(**video_self_attn_args)
hidden_states = hidden_states + attn_hidden_states * gate_msa
norm_audio_hidden_states = self.audio_norm1(audio_hidden_states)
num_audio_ada_params = self.audio_scale_shift_table.shape[0]
audio_ada_values = self.audio_scale_shift_table[None, None].to(temb_audio.device) + temb_audio.reshape(
batch_size, temb_audio.size(1), num_audio_ada_params, -1
)
# 1.2. Audio Self-Attention
audio_ada_params = self.get_mod_params(self.audio_scale_shift_table, temb_audio, batch_size)
audio_shift_msa, audio_scale_msa, audio_gate_msa, audio_shift_mlp, audio_scale_mlp, audio_gate_mlp = (
audio_ada_values.unbind(dim=2)
audio_ada_params[:6]
)
if self.audio_cross_attn_adaln:
audio_shift_text_q, audio_scale_text_q, audio_gate_text_q = audio_ada_params[6:9]
norm_audio_hidden_states = self.audio_norm1(audio_hidden_states)
norm_audio_hidden_states = norm_audio_hidden_states * (1 + audio_scale_msa) + audio_shift_msa
attn_audio_hidden_states = self.audio_attn1(
hidden_states=norm_audio_hidden_states,
encoder_hidden_states=None,
query_rotary_emb=audio_rotary_emb,
)
audio_self_attn_args = {
"hidden_states": norm_audio_hidden_states,
"encoder_hidden_states": None,
"query_rotary_emb": audio_rotary_emb,
"attention_mask": audio_self_attention_mask,
}
if self.perturbed_attn:
audio_self_attn_args["perturbation_mask"] = perturbation_mask
audio_self_attn_args["all_perturbed"] = all_perturbed
attn_audio_hidden_states = self.audio_attn1(**audio_self_attn_args)
audio_hidden_states = audio_hidden_states + attn_audio_hidden_states * audio_gate_msa
# 2. Video and Audio Cross-Attention with the text embeddings
# 2. Video and Audio Cross-Attention with the text embeddings (Q: Video or Audio; K,V: Text)
if self.cross_attn_adaln:
video_prompt_ada_params = self.get_mod_params(self.prompt_scale_shift_table, temb_prompt, batch_size)
shift_text_kv, scale_text_kv = video_prompt_ada_params
audio_prompt_ada_params = self.get_mod_params(
self.audio_prompt_scale_shift_table, temb_prompt_audio, batch_size
)
audio_shift_text_kv, audio_scale_text_kv = audio_prompt_ada_params
# 2.1. Video-Text Cross-Attention (Q: Video; K,V: Text)
norm_hidden_states = self.norm2(hidden_states)
if self.video_cross_attn_adaln:
norm_hidden_states = norm_hidden_states * (1 + scale_text_q) + shift_text_q
if self.cross_attn_adaln:
encoder_hidden_states = encoder_hidden_states * (1 + scale_text_kv) + shift_text_kv
attn_hidden_states = self.attn2(
norm_hidden_states,
encoder_hidden_states=encoder_hidden_states,
query_rotary_emb=None,
attention_mask=encoder_attention_mask,
)
if self.video_cross_attn_adaln:
attn_hidden_states = attn_hidden_states * gate_text_q
hidden_states = hidden_states + attn_hidden_states
# 2.2. Audio-Text Cross-Attention
norm_audio_hidden_states = self.audio_norm2(audio_hidden_states)
if self.audio_cross_attn_adaln:
norm_audio_hidden_states = norm_audio_hidden_states * (1 + audio_scale_text_q) + audio_shift_text_q
if self.cross_attn_adaln:
audio_encoder_hidden_states = audio_encoder_hidden_states * (1 + audio_scale_text_kv) + audio_shift_text_kv
attn_audio_hidden_states = self.audio_attn2(
norm_audio_hidden_states,
encoder_hidden_states=audio_encoder_hidden_states,
query_rotary_emb=None,
attention_mask=audio_encoder_attention_mask,
)
if self.audio_cross_attn_adaln:
attn_audio_hidden_states = attn_audio_hidden_states * audio_gate_text_q
audio_hidden_states = audio_hidden_states + attn_audio_hidden_states
# 3. Audio-to-Video (a2v) and Video-to-Audio (v2a) Cross-Attention
norm_hidden_states = self.audio_to_video_norm(hidden_states)
norm_audio_hidden_states = self.video_to_audio_norm(audio_hidden_states)
if use_a2v_cross_attention or use_v2a_cross_attention:
norm_hidden_states = self.audio_to_video_norm(hidden_states)
norm_audio_hidden_states = self.video_to_audio_norm(audio_hidden_states)
# Combine global and per-layer cross attention modulation parameters
# Video
video_per_layer_ca_scale_shift = self.video_a2v_cross_attn_scale_shift_table[:4, :]
video_per_layer_ca_gate = self.video_a2v_cross_attn_scale_shift_table[4:, :]
# 3.1. Combine global and per-layer cross attention modulation parameters
# Video
video_per_layer_ca_scale_shift = self.video_a2v_cross_attn_scale_shift_table[:4, :]
video_per_layer_ca_gate = self.video_a2v_cross_attn_scale_shift_table[4:, :]
video_ca_scale_shift_table = (
video_per_layer_ca_scale_shift[:, :, ...].to(temb_ca_scale_shift.dtype)
+ temb_ca_scale_shift.reshape(batch_size, temb_ca_scale_shift.shape[1], 4, -1)
).unbind(dim=2)
video_ca_gate = (
video_per_layer_ca_gate[:, :, ...].to(temb_ca_gate.dtype)
+ temb_ca_gate.reshape(batch_size, temb_ca_gate.shape[1], 1, -1)
).unbind(dim=2)
video_ca_ada_params = self.get_mod_params(video_per_layer_ca_scale_shift, temb_ca_scale_shift, batch_size)
video_ca_gate_param = self.get_mod_params(video_per_layer_ca_gate, temb_ca_gate, batch_size)
video_a2v_ca_scale, video_a2v_ca_shift, video_v2a_ca_scale, video_v2a_ca_shift = video_ca_scale_shift_table
a2v_gate = video_ca_gate[0].squeeze(2)
video_a2v_ca_scale, video_a2v_ca_shift, video_v2a_ca_scale, video_v2a_ca_shift = video_ca_ada_params
a2v_gate = video_ca_gate_param[0].squeeze(2)
# Audio
audio_per_layer_ca_scale_shift = self.audio_a2v_cross_attn_scale_shift_table[:4, :]
audio_per_layer_ca_gate = self.audio_a2v_cross_attn_scale_shift_table[4:, :]
# Audio
audio_per_layer_ca_scale_shift = self.audio_a2v_cross_attn_scale_shift_table[:4, :]
audio_per_layer_ca_gate = self.audio_a2v_cross_attn_scale_shift_table[4:, :]
audio_ca_scale_shift_table = (
audio_per_layer_ca_scale_shift[:, :, ...].to(temb_ca_audio_scale_shift.dtype)
+ temb_ca_audio_scale_shift.reshape(batch_size, temb_ca_audio_scale_shift.shape[1], 4, -1)
).unbind(dim=2)
audio_ca_gate = (
audio_per_layer_ca_gate[:, :, ...].to(temb_ca_audio_gate.dtype)
+ temb_ca_audio_gate.reshape(batch_size, temb_ca_audio_gate.shape[1], 1, -1)
).unbind(dim=2)
audio_ca_ada_params = self.get_mod_params(
audio_per_layer_ca_scale_shift, temb_ca_audio_scale_shift, batch_size
)
audio_ca_gate_param = self.get_mod_params(audio_per_layer_ca_gate, temb_ca_audio_gate, batch_size)
audio_a2v_ca_scale, audio_a2v_ca_shift, audio_v2a_ca_scale, audio_v2a_ca_shift = audio_ca_scale_shift_table
v2a_gate = audio_ca_gate[0].squeeze(2)
audio_a2v_ca_scale, audio_a2v_ca_shift, audio_v2a_ca_scale, audio_v2a_ca_shift = audio_ca_ada_params
v2a_gate = audio_ca_gate_param[0].squeeze(2)
# Audio-to-Video Cross Attention: Q: Video; K,V: Audio
mod_norm_hidden_states = norm_hidden_states * (1 + video_a2v_ca_scale.squeeze(2)) + video_a2v_ca_shift.squeeze(
2
)
mod_norm_audio_hidden_states = norm_audio_hidden_states * (
1 + audio_a2v_ca_scale.squeeze(2)
) + audio_a2v_ca_shift.squeeze(2)
# 3.2. Audio-to-Video Cross Attention: Q: Video; K,V: Audio
if use_a2v_cross_attention:
mod_norm_hidden_states = norm_hidden_states * (
1 + video_a2v_ca_scale.squeeze(2)
) + video_a2v_ca_shift.squeeze(2)
mod_norm_audio_hidden_states = norm_audio_hidden_states * (
1 + audio_a2v_ca_scale.squeeze(2)
) + audio_a2v_ca_shift.squeeze(2)
a2v_attn_hidden_states = self.audio_to_video_attn(
mod_norm_hidden_states,
encoder_hidden_states=mod_norm_audio_hidden_states,
query_rotary_emb=ca_video_rotary_emb,
key_rotary_emb=ca_audio_rotary_emb,
attention_mask=a2v_cross_attention_mask,
)
a2v_attn_hidden_states = self.audio_to_video_attn(
mod_norm_hidden_states,
encoder_hidden_states=mod_norm_audio_hidden_states,
query_rotary_emb=ca_video_rotary_emb,
key_rotary_emb=ca_audio_rotary_emb,
attention_mask=a2v_cross_attention_mask,
)
hidden_states = hidden_states + a2v_gate * a2v_attn_hidden_states
hidden_states = hidden_states + a2v_gate * a2v_attn_hidden_states
# Video-to-Audio Cross Attention: Q: Audio; K,V: Video
mod_norm_hidden_states = norm_hidden_states * (1 + video_v2a_ca_scale.squeeze(2)) + video_v2a_ca_shift.squeeze(
2
)
mod_norm_audio_hidden_states = norm_audio_hidden_states * (
1 + audio_v2a_ca_scale.squeeze(2)
) + audio_v2a_ca_shift.squeeze(2)
# 3.3. Video-to-Audio Cross Attention: Q: Audio; K,V: Video
if use_v2a_cross_attention:
mod_norm_hidden_states = norm_hidden_states * (
1 + video_v2a_ca_scale.squeeze(2)
) + video_v2a_ca_shift.squeeze(2)
mod_norm_audio_hidden_states = norm_audio_hidden_states * (
1 + audio_v2a_ca_scale.squeeze(2)
) + audio_v2a_ca_shift.squeeze(2)
v2a_attn_hidden_states = self.video_to_audio_attn(
mod_norm_audio_hidden_states,
encoder_hidden_states=mod_norm_hidden_states,
query_rotary_emb=ca_audio_rotary_emb,
key_rotary_emb=ca_video_rotary_emb,
attention_mask=v2a_cross_attention_mask,
)
v2a_attn_hidden_states = self.video_to_audio_attn(
mod_norm_audio_hidden_states,
encoder_hidden_states=mod_norm_hidden_states,
query_rotary_emb=ca_audio_rotary_emb,
key_rotary_emb=ca_video_rotary_emb,
attention_mask=v2a_cross_attention_mask,
)
audio_hidden_states = audio_hidden_states + v2a_gate * v2a_attn_hidden_states
audio_hidden_states = audio_hidden_states + v2a_gate * v2a_attn_hidden_states
# 4. Feedforward
norm_hidden_states = self.norm3(hidden_states) * (1 + scale_mlp) + shift_mlp
@@ -918,6 +1118,8 @@ class LTX2VideoTransformer3DModel(
pos_embed_max_pos: int = 20,
base_height: int = 2048,
base_width: int = 2048,
gated_attn: bool = False,
cross_attn_mod: bool = False,
audio_in_channels: int = 128, # Audio Arguments
audio_out_channels: int | None = 128,
audio_patch_size: int = 1,
@@ -929,6 +1131,8 @@ class LTX2VideoTransformer3DModel(
audio_pos_embed_max_pos: int = 20,
audio_sampling_rate: int = 16000,
audio_hop_length: int = 160,
audio_gated_attn: bool = False,
audio_cross_attn_mod: bool = False,
num_layers: int = 48, # Shared arguments
activation_fn: str = "gelu-approximate",
qk_norm: str = "rms_norm_across_heads",
@@ -943,6 +1147,8 @@ class LTX2VideoTransformer3DModel(
timestep_scale_multiplier: int = 1000,
cross_attn_timestep_scale_multiplier: int = 1000,
rope_type: str = "interleaved",
use_prompt_embeddings=True,
perturbed_attn: bool = False,
) -> None:
super().__init__()
@@ -956,17 +1162,25 @@ class LTX2VideoTransformer3DModel(
self.audio_proj_in = nn.Linear(audio_in_channels, audio_inner_dim)
# 2. Prompt embeddings
self.caption_projection = PixArtAlphaTextProjection(in_features=caption_channels, hidden_size=inner_dim)
self.audio_caption_projection = PixArtAlphaTextProjection(
in_features=caption_channels, hidden_size=audio_inner_dim
)
if use_prompt_embeddings:
# LTX-2.0; LTX-2.3 uses per-modality feature projections in the connector instead
self.caption_projection = PixArtAlphaTextProjection(in_features=caption_channels, hidden_size=inner_dim)
self.audio_caption_projection = PixArtAlphaTextProjection(
in_features=caption_channels, hidden_size=audio_inner_dim
)
# 3. Timestep Modulation Params and Embedding
self.prompt_modulation = cross_attn_mod or audio_cross_attn_mod # used by LTX-2.3
# 3.1. Global Timestep Modulation Parameters (except for cross-attention) and timestep + size embedding
# time_embed and audio_time_embed calculate both the timestep embedding and (global) modulation parameters
self.time_embed = LTX2AdaLayerNormSingle(inner_dim, num_mod_params=6, use_additional_conditions=False)
video_time_emb_mod_params = 9 if cross_attn_mod else 6
audio_time_emb_mod_params = 9 if audio_cross_attn_mod else 6
self.time_embed = LTX2AdaLayerNormSingle(
inner_dim, num_mod_params=video_time_emb_mod_params, use_additional_conditions=False
)
self.audio_time_embed = LTX2AdaLayerNormSingle(
audio_inner_dim, num_mod_params=6, use_additional_conditions=False
audio_inner_dim, num_mod_params=audio_time_emb_mod_params, use_additional_conditions=False
)
# 3.2. Global Cross Attention Modulation Parameters
@@ -995,6 +1209,13 @@ class LTX2VideoTransformer3DModel(
self.scale_shift_table = nn.Parameter(torch.randn(2, inner_dim) / inner_dim**0.5)
self.audio_scale_shift_table = nn.Parameter(torch.randn(2, audio_inner_dim) / audio_inner_dim**0.5)
# 3.4. Prompt Scale/Shift Modulation parameters (LTX-2.3)
if self.prompt_modulation:
self.prompt_adaln = LTX2AdaLayerNormSingle(inner_dim, num_mod_params=2, use_additional_conditions=False)
self.audio_prompt_adaln = LTX2AdaLayerNormSingle(
audio_inner_dim, num_mod_params=2, use_additional_conditions=False
)
# 4. Rotary Positional Embeddings (RoPE)
# Self-Attention
self.rope = LTX2AudioVideoRotaryPosEmbed(
@@ -1071,6 +1292,10 @@ class LTX2VideoTransformer3DModel(
audio_num_attention_heads=audio_num_attention_heads,
audio_attention_head_dim=audio_attention_head_dim,
audio_cross_attention_dim=audio_cross_attention_dim,
video_gated_attn=gated_attn,
video_cross_attn_adaln=cross_attn_mod,
audio_gated_attn=audio_gated_attn,
audio_cross_attn_adaln=audio_cross_attn_mod,
qk_norm=qk_norm,
activation_fn=activation_fn,
attention_bias=attention_bias,
@@ -1078,6 +1303,7 @@ class LTX2VideoTransformer3DModel(
eps=norm_eps,
elementwise_affine=norm_elementwise_affine,
rope_type=rope_type,
perturbed_attn=perturbed_attn,
)
for _ in range(num_layers)
]
@@ -1101,6 +1327,8 @@ class LTX2VideoTransformer3DModel(
audio_encoder_hidden_states: torch.Tensor,
timestep: torch.LongTensor,
audio_timestep: torch.LongTensor | None = None,
sigma: torch.Tensor | None = None,
audio_sigma: torch.Tensor | None = None,
encoder_attention_mask: torch.Tensor | None = None,
audio_encoder_attention_mask: torch.Tensor | None = None,
num_frames: int | None = None,
@@ -1110,6 +1338,10 @@ class LTX2VideoTransformer3DModel(
audio_num_frames: int | None = None,
video_coords: torch.Tensor | None = None,
audio_coords: torch.Tensor | None = None,
isolate_modalities: bool = False,
spatio_temporal_guidance_blocks: list[int] | None = None,
perturbation_mask: torch.Tensor | None = None,
use_cross_timestep: bool = False,
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> torch.Tensor:
@@ -1131,6 +1363,13 @@ class LTX2VideoTransformer3DModel(
audio_timestep (`torch.Tensor`, *optional*):
Input timestep of shape `(batch_size,)` or `(batch_size, num_audio_tokens)` for audio modulation
params. This is only used by certain pipelines such as the I2V pipeline.
sigma (`torch.Tensor`, *optional*):
Input scaled timestep of shape (batch_size,). Used for video prompt cross attention modulation in
models such as LTX-2.3.
audio_sigma (`torch.Tensor`, *optional*):
Input scaled timestep of shape (batch_size,). Used for audio prompt cross attention modulation in
models such as LTX-2.3. If `sigma` is supplied but `audio_sigma` is not, `audio_sigma` will be set to
the provided `sigma` value.
encoder_attention_mask (`torch.Tensor`, *optional*):
Optional multiplicative text attention mask of shape `(batch_size, text_seq_len)`.
audio_encoder_attention_mask (`torch.Tensor`, *optional*):
@@ -1152,6 +1391,21 @@ class LTX2VideoTransformer3DModel(
audio_coords (`torch.Tensor`, *optional*):
The audio coordinates to be used when calculating the rotary positional embeddings (RoPE) of shape
`(batch_size, 1, num_audio_tokens, 2)`. If not supplied, this will be calculated inside `forward`.
isolate_modalities (`bool`, *optional*, defaults to `False`):
Whether to isolate each modality by turning off cross-modality (audio-to-video and video-to-audio)
cross attention (for all blocks). Use for modality guidance in LTX-2.3.
spatio_temporal_guidance_blocks (`list[int]`, *optional*, defaults to `None`):
The transformer block indices at which to apply spatio-temporal guidance (STG), which shortcuts the
self-attention operations by simply using the values rather than the full scaled dot-product attention
(SDPA) operation. If `None` or empty, STG will not be applied to any block.
perturbation_mask (`torch.Tensor`, *optional*):
Perturbation mask for STG of shape `(batch_size,)` or `(batch_size, 1, 1)`. Should be 0 at batch
elements where STG should be applied and 1 elsewhere. If STG is being used but `peturbation_mask` is
not supplied, will default to applying STG (perturbing) all batch elements.
use_cross_timestep (`bool` *optional*, defaults to `False`):
Whether to use the cross modality (audio is the cross modality of video, and vice versa) sigma when
calculating the cross attention modulation parameters. `True` is the newer (e.g. LTX-2.3) behavior;
`False` is the legacy LTX-2.0 behavior.
attention_kwargs (`dict[str, Any]`, *optional*):
Optional dict of keyword args to be passed to the attention processor.
return_dict (`bool`, *optional*, defaults to `True`):
@@ -1165,6 +1419,7 @@ class LTX2VideoTransformer3DModel(
"""
# Determine timestep for audio.
audio_timestep = audio_timestep if audio_timestep is not None else timestep
audio_sigma = audio_sigma if audio_sigma is not None else sigma
# convert encoder_attention_mask to a bias the same way we do for attention_mask
if encoder_attention_mask is not None and encoder_attention_mask.ndim == 2:
@@ -1223,14 +1478,28 @@ class LTX2VideoTransformer3DModel(
temb_audio = temb_audio.view(batch_size, -1, temb_audio.size(-1))
audio_embedded_timestep = audio_embedded_timestep.view(batch_size, -1, audio_embedded_timestep.size(-1))
if self.prompt_modulation:
# LTX-2.3
temb_prompt, _ = self.prompt_adaln(
sigma.flatten(), batch_size=batch_size, hidden_dtype=hidden_states.dtype
)
temb_prompt_audio, _ = self.audio_prompt_adaln(
audio_sigma.flatten(), batch_size=batch_size, hidden_dtype=audio_hidden_states.dtype
)
temb_prompt = temb_prompt.view(batch_size, -1, temb_prompt.size(-1))
temb_prompt_audio = temb_prompt_audio.view(batch_size, -1, temb_prompt_audio.size(-1))
else:
temb_prompt = temb_prompt_audio = None
# 3.2. Prepare global modality cross attention modulation parameters
video_ca_timestep = audio_sigma.flatten() if use_cross_timestep else timestep.flatten()
video_cross_attn_scale_shift, _ = self.av_cross_attn_video_scale_shift(
timestep.flatten(),
video_ca_timestep,
batch_size=batch_size,
hidden_dtype=hidden_states.dtype,
)
video_cross_attn_a2v_gate, _ = self.av_cross_attn_video_a2v_gate(
timestep.flatten() * timestep_cross_attn_gate_scale_factor,
video_ca_timestep * timestep_cross_attn_gate_scale_factor,
batch_size=batch_size,
hidden_dtype=hidden_states.dtype,
)
@@ -1239,13 +1508,14 @@ class LTX2VideoTransformer3DModel(
)
video_cross_attn_a2v_gate = video_cross_attn_a2v_gate.view(batch_size, -1, video_cross_attn_a2v_gate.shape[-1])
audio_ca_timestep = sigma.flatten() if use_cross_timestep else audio_timestep.flatten()
audio_cross_attn_scale_shift, _ = self.av_cross_attn_audio_scale_shift(
audio_timestep.flatten(),
audio_ca_timestep,
batch_size=batch_size,
hidden_dtype=audio_hidden_states.dtype,
)
audio_cross_attn_v2a_gate, _ = self.av_cross_attn_audio_v2a_gate(
audio_timestep.flatten() * timestep_cross_attn_gate_scale_factor,
audio_ca_timestep * timestep_cross_attn_gate_scale_factor,
batch_size=batch_size,
hidden_dtype=audio_hidden_states.dtype,
)
@@ -1254,15 +1524,30 @@ class LTX2VideoTransformer3DModel(
)
audio_cross_attn_v2a_gate = audio_cross_attn_v2a_gate.view(batch_size, -1, audio_cross_attn_v2a_gate.shape[-1])
# 4. Prepare prompt embeddings
encoder_hidden_states = self.caption_projection(encoder_hidden_states)
encoder_hidden_states = encoder_hidden_states.view(batch_size, -1, hidden_states.size(-1))
# 4. Prepare prompt embeddings (LTX-2.0)
if self.config.use_prompt_embeddings:
encoder_hidden_states = self.caption_projection(encoder_hidden_states)
encoder_hidden_states = encoder_hidden_states.view(batch_size, -1, hidden_states.size(-1))
audio_encoder_hidden_states = self.audio_caption_projection(audio_encoder_hidden_states)
audio_encoder_hidden_states = audio_encoder_hidden_states.view(batch_size, -1, audio_hidden_states.size(-1))
audio_encoder_hidden_states = self.audio_caption_projection(audio_encoder_hidden_states)
audio_encoder_hidden_states = audio_encoder_hidden_states.view(
batch_size, -1, audio_hidden_states.size(-1)
)
# 5. Run transformer blocks
for block in self.transformer_blocks:
spatio_temporal_guidance_blocks = spatio_temporal_guidance_blocks or []
if len(spatio_temporal_guidance_blocks) > 0 and perturbation_mask is None:
# If STG is being used and perturbation_mask is not supplied, default to perturbing all batch elements.
perturbation_mask = torch.zeros((batch_size,))
if perturbation_mask is not None and perturbation_mask.ndim == 1:
perturbation_mask = perturbation_mask[:, None, None] # unsqueeze to 3D to broadcast with hidden_states
all_perturbed = torch.all(perturbation_mask == 0) if perturbation_mask is not None else False
stg_blocks = set(spatio_temporal_guidance_blocks)
for block_idx, block in enumerate(self.transformer_blocks):
block_perturbation_mask = perturbation_mask if block_idx in stg_blocks else None
block_all_perturbed = all_perturbed if block_idx in stg_blocks else False
if torch.is_grad_enabled() and self.gradient_checkpointing:
hidden_states, audio_hidden_states = self._gradient_checkpointing_func(
block,
@@ -1276,12 +1561,22 @@ class LTX2VideoTransformer3DModel(
audio_cross_attn_scale_shift,
video_cross_attn_a2v_gate,
audio_cross_attn_v2a_gate,
temb_prompt,
temb_prompt_audio,
video_rotary_emb,
audio_rotary_emb,
video_cross_attn_rotary_emb,
audio_cross_attn_rotary_emb,
encoder_attention_mask,
audio_encoder_attention_mask,
None, # self_attention_mask
None, # audio_self_attention_mask
None, # a2v_cross_attention_mask
None, # v2a_cross_attention_mask
not isolate_modalities, # use_a2v_cross_attention
not isolate_modalities, # use_v2a_cross_attention
block_perturbation_mask,
block_all_perturbed,
)
else:
hidden_states, audio_hidden_states = block(
@@ -1295,12 +1590,22 @@ class LTX2VideoTransformer3DModel(
temb_ca_audio_scale_shift=audio_cross_attn_scale_shift,
temb_ca_gate=video_cross_attn_a2v_gate,
temb_ca_audio_gate=audio_cross_attn_v2a_gate,
temb_prompt=temb_prompt,
temb_prompt_audio=temb_prompt_audio,
video_rotary_emb=video_rotary_emb,
audio_rotary_emb=audio_rotary_emb,
ca_video_rotary_emb=video_cross_attn_rotary_emb,
ca_audio_rotary_emb=audio_cross_attn_rotary_emb,
encoder_attention_mask=encoder_attention_mask,
audio_encoder_attention_mask=audio_encoder_attention_mask,
self_attention_mask=None,
audio_self_attention_mask=None,
a2v_cross_attention_mask=None,
v2a_cross_attention_mask=None,
use_a2v_cross_attention=not isolate_modalities,
use_v2a_cross_attention=not isolate_modalities,
perturbation_mask=block_perturbation_mask,
all_perturbed=block_all_perturbed,
)
# 6. Output layers (including unpatchification)

View File

@@ -56,6 +56,14 @@ else:
"WanImage2VideoModularPipeline",
"Wan22Image2VideoModularPipeline",
]
_import_structure["helios"] = [
"HeliosAutoBlocks",
"HeliosModularPipeline",
"HeliosPyramidAutoBlocks",
"HeliosPyramidDistilledAutoBlocks",
"HeliosPyramidDistilledModularPipeline",
"HeliosPyramidModularPipeline",
]
_import_structure["flux"] = [
"FluxAutoBlocks",
"FluxModularPipeline",
@@ -103,6 +111,14 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
Flux2KleinModularPipeline,
Flux2ModularPipeline,
)
from .helios import (
HeliosAutoBlocks,
HeliosModularPipeline,
HeliosPyramidAutoBlocks,
HeliosPyramidDistilledAutoBlocks,
HeliosPyramidDistilledModularPipeline,
HeliosPyramidModularPipeline,
)
from .modular_pipeline import (
AutoPipelineBlocks,
BlockState,

View File

@@ -0,0 +1,59 @@
from typing import TYPE_CHECKING
from ...utils import (
DIFFUSERS_SLOW_IMPORT,
OptionalDependencyNotAvailable,
_LazyModule,
get_objects_from_module,
is_torch_available,
is_transformers_available,
)
_dummy_objects = {}
_import_structure = {}
try:
if not (is_transformers_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from ...utils import dummy_torch_and_transformers_objects # noqa F403
_dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
else:
_import_structure["modular_blocks_helios"] = ["HeliosAutoBlocks"]
_import_structure["modular_blocks_helios_pyramid"] = ["HeliosPyramidAutoBlocks"]
_import_structure["modular_blocks_helios_pyramid_distilled"] = ["HeliosPyramidDistilledAutoBlocks"]
_import_structure["modular_pipeline"] = [
"HeliosModularPipeline",
"HeliosPyramidDistilledModularPipeline",
"HeliosPyramidModularPipeline",
]
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
try:
if not (is_transformers_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from ...utils.dummy_torch_and_transformers_objects import * # noqa F403
else:
from .modular_blocks_helios import HeliosAutoBlocks
from .modular_blocks_helios_pyramid import HeliosPyramidAutoBlocks
from .modular_blocks_helios_pyramid_distilled import HeliosPyramidDistilledAutoBlocks
from .modular_pipeline import (
HeliosModularPipeline,
HeliosPyramidDistilledModularPipeline,
HeliosPyramidModularPipeline,
)
else:
import sys
sys.modules[__name__] = _LazyModule(
__name__,
globals()["__file__"],
_import_structure,
module_spec=__spec__,
)
for name, value in _dummy_objects.items():
setattr(sys.modules[__name__], name, value)

View File

@@ -0,0 +1,836 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import torch
from ...models import HeliosTransformer3DModel
from ...schedulers import HeliosScheduler
from ...utils import logging
from ...utils.torch_utils import randn_tensor
from ..modular_pipeline import ModularPipelineBlocks, PipelineState
from ..modular_pipeline_utils import ComponentSpec, InputParam, OutputParam
from .modular_pipeline import HeliosModularPipeline
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
# Copied from diffusers.pipelines.flux.pipeline_flux.calculate_shift
def calculate_shift(
image_seq_len,
base_seq_len: int = 256,
max_seq_len: int = 4096,
base_shift: float = 0.5,
max_shift: float = 1.15,
):
m = (max_shift - base_shift) / (max_seq_len - base_seq_len)
b = base_shift - m * base_seq_len
mu = image_seq_len * m + b
return mu
class HeliosTextInputStep(ModularPipelineBlocks):
model_name = "helios"
@property
def description(self) -> str:
return (
"Input processing step that:\n"
" 1. Determines `batch_size` and `dtype` based on `prompt_embeds`\n"
" 2. Adjusts input tensor shapes based on `batch_size` (number of prompts) and `num_videos_per_prompt`\n\n"
"All input tensors are expected to have either batch_size=1 or match the batch_size\n"
"of prompt_embeds. The tensors will be duplicated across the batch dimension to\n"
"have a final batch_size of batch_size * num_videos_per_prompt."
)
@property
def inputs(self) -> list[InputParam]:
return [
InputParam(
"num_videos_per_prompt",
default=1,
type_hint=int,
description="Number of videos to generate per prompt.",
),
InputParam.template("prompt_embeds"),
InputParam.template("negative_prompt_embeds"),
]
@property
def intermediate_outputs(self) -> list[str]:
return [
OutputParam(
"batch_size",
type_hint=int,
description="Number of prompts, the final batch size of model inputs should be batch_size * num_videos_per_prompt",
),
OutputParam(
"dtype",
type_hint=torch.dtype,
description="Data type of model tensor inputs (determined by `prompt_embeds.dtype`)",
),
]
def check_inputs(self, components, block_state):
if block_state.prompt_embeds is not None and block_state.negative_prompt_embeds is not None:
if block_state.prompt_embeds.shape != block_state.negative_prompt_embeds.shape:
raise ValueError(
"`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
f" got: `prompt_embeds` {block_state.prompt_embeds.shape} != `negative_prompt_embeds`"
f" {block_state.negative_prompt_embeds.shape}."
)
@torch.no_grad()
def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
self.check_inputs(components, block_state)
block_state.batch_size = block_state.prompt_embeds.shape[0]
block_state.dtype = block_state.prompt_embeds.dtype
_, seq_len, _ = block_state.prompt_embeds.shape
block_state.prompt_embeds = block_state.prompt_embeds.repeat(1, block_state.num_videos_per_prompt, 1)
block_state.prompt_embeds = block_state.prompt_embeds.view(
block_state.batch_size * block_state.num_videos_per_prompt, seq_len, -1
)
if block_state.negative_prompt_embeds is not None:
_, seq_len, _ = block_state.negative_prompt_embeds.shape
block_state.negative_prompt_embeds = block_state.negative_prompt_embeds.repeat(
1, block_state.num_videos_per_prompt, 1
)
block_state.negative_prompt_embeds = block_state.negative_prompt_embeds.view(
block_state.batch_size * block_state.num_videos_per_prompt, seq_len, -1
)
self.set_block_state(state, block_state)
return components, state
# Copied from diffusers.modular_pipelines.wan.before_denoise.repeat_tensor_to_batch_size
def repeat_tensor_to_batch_size(
input_name: str,
input_tensor: torch.Tensor,
batch_size: int,
num_videos_per_prompt: int = 1,
) -> torch.Tensor:
"""Repeat tensor elements to match the final batch size.
This function expands a tensor's batch dimension to match the final batch size (batch_size * num_videos_per_prompt)
by repeating each element along dimension 0.
The input tensor must have batch size 1 or batch_size. The function will:
- If batch size is 1: repeat each element (batch_size * num_videos_per_prompt) times
- If batch size equals batch_size: repeat each element num_videos_per_prompt times
Args:
input_name (str): Name of the input tensor (used for error messages)
input_tensor (torch.Tensor): The tensor to repeat. Must have batch size 1 or batch_size.
batch_size (int): The base batch size (number of prompts)
num_videos_per_prompt (int, optional): Number of videos to generate per prompt. Defaults to 1.
Returns:
torch.Tensor: The repeated tensor with final batch size (batch_size * num_videos_per_prompt)
Raises:
ValueError: If input_tensor is not a torch.Tensor or has invalid batch size
Examples:
tensor = torch.tensor([[1, 2, 3]]) # shape: [1, 3] repeated = repeat_tensor_to_batch_size("image", tensor,
batch_size=2, num_videos_per_prompt=2) repeated # tensor([[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]) - shape:
[4, 3]
tensor = torch.tensor([[1, 2, 3], [4, 5, 6]]) # shape: [2, 3] repeated = repeat_tensor_to_batch_size("image",
tensor, batch_size=2, num_videos_per_prompt=2) repeated # tensor([[1, 2, 3], [1, 2, 3], [4, 5, 6], [4, 5, 6]])
- shape: [4, 3]
"""
# make sure input is a tensor
if not isinstance(input_tensor, torch.Tensor):
raise ValueError(f"`{input_name}` must be a tensor")
# make sure input tensor e.g. image_latents has batch size 1 or batch_size same as prompts
if input_tensor.shape[0] == 1:
repeat_by = batch_size * num_videos_per_prompt
elif input_tensor.shape[0] == batch_size:
repeat_by = num_videos_per_prompt
else:
raise ValueError(
f"`{input_name}` must have have batch size 1 or {batch_size}, but got {input_tensor.shape[0]}"
)
# expand the tensor to match the batch_size * num_videos_per_prompt
input_tensor = input_tensor.repeat_interleave(repeat_by, dim=0)
return input_tensor
# Copied from diffusers.modular_pipelines.wan.before_denoise.calculate_dimension_from_latents
def calculate_dimension_from_latents(
latents: torch.Tensor, vae_scale_factor_temporal: int, vae_scale_factor_spatial: int
) -> tuple[int, int]:
"""Calculate image dimensions from latent tensor dimensions.
This function converts latent temporal and spatial dimensions to image temporal and spatial dimensions by
multiplying the latent num_frames/height/width by the VAE scale factor.
Args:
latents (torch.Tensor): The latent tensor. Must have 4 or 5 dimensions.
Expected shapes: [batch, channels, height, width] or [batch, channels, frames, height, width]
vae_scale_factor_temporal (int): The scale factor used by the VAE to compress temporal dimension.
Typically 4 for most VAEs (video is 4x larger than latents in temporal dimension)
vae_scale_factor_spatial (int): The scale factor used by the VAE to compress spatial dimension.
Typically 8 for most VAEs (image is 8x larger than latents in each dimension)
Returns:
tuple[int, int]: The calculated image dimensions as (height, width)
Raises:
ValueError: If latents tensor doesn't have 4 or 5 dimensions
"""
if latents.ndim != 5:
raise ValueError(f"latents must have 5 dimensions, but got {latents.ndim}")
_, _, num_latent_frames, latent_height, latent_width = latents.shape
num_frames = (num_latent_frames - 1) * vae_scale_factor_temporal + 1
height = latent_height * vae_scale_factor_spatial
width = latent_width * vae_scale_factor_spatial
return num_frames, height, width
class HeliosAdditionalInputsStep(ModularPipelineBlocks):
"""Configurable step that standardizes inputs for the denoising step.
This step handles:
1. For encoded image latents: Computes height/width from latents and expands batch size
2. For additional_batch_inputs: Expands batch dimensions to match final batch size
"""
model_name = "helios"
def __init__(
self,
image_latent_inputs: list[InputParam] | None = None,
additional_batch_inputs: list[InputParam] | None = None,
):
if image_latent_inputs is None:
image_latent_inputs = [InputParam.template("image_latents")]
if additional_batch_inputs is None:
additional_batch_inputs = []
if not isinstance(image_latent_inputs, list):
raise ValueError(f"image_latent_inputs must be a list, but got {type(image_latent_inputs)}")
else:
for input_param in image_latent_inputs:
if not isinstance(input_param, InputParam):
raise ValueError(f"image_latent_inputs must be a list of InputParam, but got {type(input_param)}")
if not isinstance(additional_batch_inputs, list):
raise ValueError(f"additional_batch_inputs must be a list, but got {type(additional_batch_inputs)}")
else:
for input_param in additional_batch_inputs:
if not isinstance(input_param, InputParam):
raise ValueError(
f"additional_batch_inputs must be a list of InputParam, but got {type(input_param)}"
)
self._image_latent_inputs = image_latent_inputs
self._additional_batch_inputs = additional_batch_inputs
super().__init__()
@property
def description(self) -> str:
summary_section = (
"Input processing step that:\n"
" 1. For image latent inputs: Computes height/width from latents and expands batch size\n"
" 2. For additional batch inputs: Expands batch dimensions to match final batch size"
)
inputs_info = ""
if self._image_latent_inputs or self._additional_batch_inputs:
inputs_info = "\n\nConfigured inputs:"
if self._image_latent_inputs:
inputs_info += f"\n - Image latent inputs: {[p.name for p in self._image_latent_inputs]}"
if self._additional_batch_inputs:
inputs_info += f"\n - Additional batch inputs: {[p.name for p in self._additional_batch_inputs]}"
placement_section = "\n\nThis block should be placed after the encoder steps and the text input step."
return summary_section + inputs_info + placement_section
@property
def inputs(self) -> list[InputParam]:
inputs = [
InputParam(name="num_videos_per_prompt", default=1),
InputParam(name="batch_size", required=True),
]
inputs += self._image_latent_inputs + self._additional_batch_inputs
return inputs
@property
def intermediate_outputs(self) -> list[OutputParam]:
outputs = [
OutputParam("height", type_hint=int),
OutputParam("width", type_hint=int),
]
for input_param in self._image_latent_inputs:
outputs.append(OutputParam(input_param.name, type_hint=torch.Tensor))
for input_param in self._additional_batch_inputs:
outputs.append(OutputParam(input_param.name, type_hint=torch.Tensor))
return outputs
def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
for input_param in self._image_latent_inputs:
image_latent_tensor = getattr(block_state, input_param.name)
if image_latent_tensor is None:
continue
# Calculate height/width from latents
_, height, width = calculate_dimension_from_latents(
image_latent_tensor, components.vae_scale_factor_temporal, components.vae_scale_factor_spatial
)
block_state.height = height
block_state.width = width
# Expand batch size
image_latent_tensor = repeat_tensor_to_batch_size(
input_name=input_param.name,
input_tensor=image_latent_tensor,
num_videos_per_prompt=block_state.num_videos_per_prompt,
batch_size=block_state.batch_size,
)
setattr(block_state, input_param.name, image_latent_tensor)
for input_param in self._additional_batch_inputs:
input_tensor = getattr(block_state, input_param.name)
if input_tensor is None:
continue
input_tensor = repeat_tensor_to_batch_size(
input_name=input_param.name,
input_tensor=input_tensor,
num_videos_per_prompt=block_state.num_videos_per_prompt,
batch_size=block_state.batch_size,
)
setattr(block_state, input_param.name, input_tensor)
self.set_block_state(state, block_state)
return components, state
class HeliosAddNoiseToImageLatentsStep(ModularPipelineBlocks):
"""Adds noise to image_latents and fake_image_latents for I2V conditioning.
Applies single-sigma noise to image_latents (using image_noise_sigma range) and single-sigma noise to
fake_image_latents (using video_noise_sigma range).
"""
model_name = "helios"
@property
def description(self) -> str:
return (
"Adds noise to image_latents and fake_image_latents for I2V conditioning. "
"Uses random sigma from configured ranges for each."
)
@property
def inputs(self) -> list[InputParam]:
return [
InputParam.template("image_latents"),
InputParam(
"fake_image_latents",
required=True,
type_hint=torch.Tensor,
description="Fake image latents used as history seed for I2V generation.",
),
InputParam(
"image_noise_sigma_min",
default=0.111,
type_hint=float,
description="Minimum sigma for image latent noise.",
),
InputParam(
"image_noise_sigma_max",
default=0.135,
type_hint=float,
description="Maximum sigma for image latent noise.",
),
InputParam(
"video_noise_sigma_min",
default=0.111,
type_hint=float,
description="Minimum sigma for video/fake-image latent noise.",
),
InputParam(
"video_noise_sigma_max",
default=0.135,
type_hint=float,
description="Maximum sigma for video/fake-image latent noise.",
),
InputParam.template("generator"),
]
@property
def intermediate_outputs(self) -> list[OutputParam]:
return [
OutputParam.template("image_latents"),
OutputParam("fake_image_latents", type_hint=torch.Tensor, description="Noisy fake image latents"),
]
@torch.no_grad()
def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
device = components._execution_device
image_latents = block_state.image_latents
fake_image_latents = block_state.fake_image_latents
# Add noise to image_latents
image_noise_sigma = (
torch.rand(1, device=device, generator=block_state.generator)
* (block_state.image_noise_sigma_max - block_state.image_noise_sigma_min)
+ block_state.image_noise_sigma_min
)
image_latents = (
image_noise_sigma * randn_tensor(image_latents.shape, generator=block_state.generator, device=device)
+ (1 - image_noise_sigma) * image_latents
)
# Add noise to fake_image_latents
fake_image_noise_sigma = (
torch.rand(1, device=device, generator=block_state.generator)
* (block_state.video_noise_sigma_max - block_state.video_noise_sigma_min)
+ block_state.video_noise_sigma_min
)
fake_image_latents = (
fake_image_noise_sigma
* randn_tensor(fake_image_latents.shape, generator=block_state.generator, device=device)
+ (1 - fake_image_noise_sigma) * fake_image_latents
)
block_state.image_latents = image_latents.to(device=device, dtype=torch.float32)
block_state.fake_image_latents = fake_image_latents.to(device=device, dtype=torch.float32)
self.set_block_state(state, block_state)
return components, state
class HeliosAddNoiseToVideoLatentsStep(ModularPipelineBlocks):
"""Adds noise to image_latents and video_latents for V2V conditioning.
Applies single-sigma noise to image_latents (using image_noise_sigma range) and per-frame noise to video_latents in
chunks (using video_noise_sigma range).
"""
model_name = "helios"
@property
def description(self) -> str:
return (
"Adds noise to image_latents and video_latents for V2V conditioning. "
"Uses single-sigma noise for image_latents and per-frame noise for video chunks."
)
@property
def inputs(self) -> list[InputParam]:
return [
InputParam.template("image_latents"),
InputParam(
"video_latents",
required=True,
type_hint=torch.Tensor,
description="Encoded video latents for V2V generation.",
),
InputParam(
"num_latent_frames_per_chunk",
default=9,
type_hint=int,
description="Number of latent frames per temporal chunk.",
),
InputParam(
"image_noise_sigma_min",
default=0.111,
type_hint=float,
description="Minimum sigma for image latent noise.",
),
InputParam(
"image_noise_sigma_max",
default=0.135,
type_hint=float,
description="Maximum sigma for image latent noise.",
),
InputParam(
"video_noise_sigma_min",
default=0.111,
type_hint=float,
description="Minimum sigma for video latent noise.",
),
InputParam(
"video_noise_sigma_max",
default=0.135,
type_hint=float,
description="Maximum sigma for video latent noise.",
),
InputParam.template("generator"),
]
@property
def intermediate_outputs(self) -> list[OutputParam]:
return [
OutputParam.template("image_latents"),
OutputParam("video_latents", type_hint=torch.Tensor, description="Noisy video latents"),
]
@torch.no_grad()
def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
device = components._execution_device
image_latents = block_state.image_latents
video_latents = block_state.video_latents
num_latent_frames_per_chunk = block_state.num_latent_frames_per_chunk
# Add noise to first frame (single sigma)
image_noise_sigma = (
torch.rand(1, device=device, generator=block_state.generator)
* (block_state.image_noise_sigma_max - block_state.image_noise_sigma_min)
+ block_state.image_noise_sigma_min
)
image_latents = (
image_noise_sigma * randn_tensor(image_latents.shape, generator=block_state.generator, device=device)
+ (1 - image_noise_sigma) * image_latents
)
# Add per-frame noise to video chunks
noisy_latents_chunks = []
num_latent_chunks = video_latents.shape[2] // num_latent_frames_per_chunk
for i in range(num_latent_chunks):
chunk_start = i * num_latent_frames_per_chunk
chunk_end = chunk_start + num_latent_frames_per_chunk
latent_chunk = video_latents[:, :, chunk_start:chunk_end, :, :]
chunk_frames = latent_chunk.shape[2]
frame_sigmas = (
torch.rand(chunk_frames, device=device, generator=block_state.generator)
* (block_state.video_noise_sigma_max - block_state.video_noise_sigma_min)
+ block_state.video_noise_sigma_min
)
frame_sigmas = frame_sigmas.view(1, 1, chunk_frames, 1, 1)
noisy_chunk = (
frame_sigmas * randn_tensor(latent_chunk.shape, generator=block_state.generator, device=device)
+ (1 - frame_sigmas) * latent_chunk
)
noisy_latents_chunks.append(noisy_chunk)
video_latents = torch.cat(noisy_latents_chunks, dim=2)
block_state.image_latents = image_latents.to(device=device, dtype=torch.float32)
block_state.video_latents = video_latents.to(device=device, dtype=torch.float32)
self.set_block_state(state, block_state)
return components, state
class HeliosPrepareHistoryStep(ModularPipelineBlocks):
"""Prepares chunk/history indices and initializes history state for the chunk loop."""
model_name = "helios"
@property
def description(self) -> str:
return (
"Prepares the chunk loop by computing latent dimensions, number of chunks, "
"history indices, and initializing history state (history_latents, image_latents, latent_chunks)."
)
@property
def expected_components(self) -> list[ComponentSpec]:
return [
ComponentSpec("transformer", HeliosTransformer3DModel),
]
@property
def inputs(self) -> list[InputParam]:
return [
InputParam.template("height", default=384),
InputParam.template("width", default=640),
InputParam(
"num_frames", default=132, type_hint=int, description="Total number of video frames to generate."
),
InputParam("batch_size", required=True, type_hint=int),
InputParam(
"num_latent_frames_per_chunk",
default=9,
type_hint=int,
description="Number of latent frames per temporal chunk.",
),
InputParam(
"history_sizes",
default=[16, 2, 1],
type_hint=list,
description="Sizes of long/mid/short history buffers for temporal context.",
),
InputParam(
"keep_first_frame",
default=True,
type_hint=bool,
description="Whether to keep the first frame as a prefix in history.",
),
]
@property
def intermediate_outputs(self) -> list[OutputParam]:
return [
OutputParam("num_latent_chunk", type_hint=int, description="Number of temporal chunks"),
OutputParam("latent_shape", type_hint=tuple, description="Shape of latent tensor per chunk"),
OutputParam("history_sizes", type_hint=list, description="Adjusted history sizes (sorted, descending)"),
OutputParam("indices_hidden_states", type_hint=torch.Tensor, kwargs_type="denoiser_input_fields"),
OutputParam("indices_latents_history_short", type_hint=torch.Tensor, kwargs_type="denoiser_input_fields"),
OutputParam("indices_latents_history_mid", type_hint=torch.Tensor, kwargs_type="denoiser_input_fields"),
OutputParam("indices_latents_history_long", type_hint=torch.Tensor, kwargs_type="denoiser_input_fields"),
OutputParam("history_latents", type_hint=torch.Tensor, description="Initialized zero history latents"),
]
@torch.no_grad()
def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
batch_size = block_state.batch_size
device = components._execution_device
block_state.num_frames = max(block_state.num_frames, 1)
history_sizes = sorted(block_state.history_sizes, reverse=True)
num_channels_latents = components.num_channels_latents
h_latent = block_state.height // components.vae_scale_factor_spatial
w_latent = block_state.width // components.vae_scale_factor_spatial
# Compute number of chunks
block_state.window_num_frames = (
block_state.num_latent_frames_per_chunk - 1
) * components.vae_scale_factor_temporal + 1
block_state.num_latent_chunk = max(
1, (block_state.num_frames + block_state.window_num_frames - 1) // block_state.window_num_frames
)
# Modify history_sizes for non-keep_first_frame (matching pipeline behavior)
if not block_state.keep_first_frame:
history_sizes = history_sizes.copy()
history_sizes[-1] = history_sizes[-1] + 1
# Compute indices ONCE (same structure for all chunks)
if block_state.keep_first_frame:
indices = torch.arange(0, sum([1, *history_sizes, block_state.num_latent_frames_per_chunk]))
(
indices_prefix,
indices_latents_history_long,
indices_latents_history_mid,
indices_latents_history_1x,
indices_hidden_states,
) = indices.split([1, *history_sizes, block_state.num_latent_frames_per_chunk], dim=0)
indices_latents_history_short = torch.cat([indices_prefix, indices_latents_history_1x], dim=0)
else:
indices = torch.arange(0, sum([*history_sizes, block_state.num_latent_frames_per_chunk]))
(
indices_latents_history_long,
indices_latents_history_mid,
indices_latents_history_short,
indices_hidden_states,
) = indices.split([*history_sizes, block_state.num_latent_frames_per_chunk], dim=0)
# Latent shape per chunk
block_state.latent_shape = (
batch_size,
num_channels_latents,
block_state.num_latent_frames_per_chunk,
h_latent,
w_latent,
)
# Set outputs
block_state.history_sizes = history_sizes
block_state.indices_hidden_states = indices_hidden_states.unsqueeze(0)
block_state.indices_latents_history_short = indices_latents_history_short.unsqueeze(0)
block_state.indices_latents_history_mid = indices_latents_history_mid.unsqueeze(0)
block_state.indices_latents_history_long = indices_latents_history_long.unsqueeze(0)
block_state.history_latents = torch.zeros(
batch_size,
num_channels_latents,
sum(history_sizes),
h_latent,
w_latent,
device=device,
dtype=torch.float32,
)
self.set_block_state(state, block_state)
return components, state
class HeliosI2VSeedHistoryStep(ModularPipelineBlocks):
"""Seeds history_latents with fake_image_latents for I2V pipelines.
This small additive step runs after HeliosPrepareHistoryStep and appends fake_image_latents to the initialized
history_latents tensor.
"""
model_name = "helios"
@property
def description(self) -> str:
return "I2V history seeding: appends fake_image_latents to history_latents."
@property
def inputs(self) -> list[InputParam]:
return [
InputParam("history_latents", required=True, type_hint=torch.Tensor),
InputParam("fake_image_latents", required=True, type_hint=torch.Tensor),
]
@property
def intermediate_outputs(self) -> list[OutputParam]:
return [
OutputParam(
"history_latents", type_hint=torch.Tensor, description="History latents seeded with fake_image_latents"
),
]
@torch.no_grad()
def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
block_state.history_latents = torch.cat([block_state.history_latents, block_state.fake_image_latents], dim=2)
self.set_block_state(state, block_state)
return components, state
class HeliosV2VSeedHistoryStep(ModularPipelineBlocks):
"""Seeds history_latents with video_latents for V2V pipelines.
This step runs after HeliosPrepareHistoryStep and replaces the tail of history_latents with video_latents. If the
video has fewer frames than the history, the beginning of history is preserved.
"""
model_name = "helios"
@property
def description(self) -> str:
return "V2V history seeding: replaces the tail of history_latents with video_latents."
@property
def inputs(self) -> list[InputParam]:
return [
InputParam("history_latents", required=True, type_hint=torch.Tensor),
InputParam("video_latents", required=True, type_hint=torch.Tensor),
]
@property
def intermediate_outputs(self) -> list[OutputParam]:
return [
OutputParam(
"history_latents", type_hint=torch.Tensor, description="History latents seeded with video_latents"
),
]
@torch.no_grad()
def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
history_latents = block_state.history_latents
video_latents = block_state.video_latents
history_frames = history_latents.shape[2]
video_frames = video_latents.shape[2]
if video_frames < history_frames:
keep_frames = history_frames - video_frames
history_latents = torch.cat([history_latents[:, :, :keep_frames, :, :], video_latents], dim=2)
else:
history_latents = video_latents
block_state.history_latents = history_latents
self.set_block_state(state, block_state)
return components, state
class HeliosSetTimestepsStep(ModularPipelineBlocks):
"""Computes scheduler parameters (mu, sigmas) for the chunk loop."""
model_name = "helios"
@property
def description(self) -> str:
return "Computes scheduler shift parameter (mu) and default sigmas for the Helios chunk loop."
@property
def expected_components(self) -> list[ComponentSpec]:
return [
ComponentSpec("transformer", HeliosTransformer3DModel),
ComponentSpec("scheduler", HeliosScheduler),
]
@property
def inputs(self) -> list[InputParam]:
return [
InputParam("latent_shape", required=True, type_hint=tuple),
InputParam.template("num_inference_steps"),
InputParam.template("sigmas"),
]
@property
def intermediate_outputs(self) -> list[OutputParam]:
return [
OutputParam("mu", type_hint=float, description="Scheduler shift parameter"),
OutputParam("sigmas", type_hint=list, description="Sigma schedule for diffusion"),
]
@torch.no_grad()
def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
patch_size = components.transformer.config.patch_size
latent_shape = block_state.latent_shape
image_seq_len = (latent_shape[-1] * latent_shape[-2] * latent_shape[-3]) // (
patch_size[0] * patch_size[1] * patch_size[2]
)
if block_state.sigmas is None:
block_state.sigmas = np.linspace(0.999, 0.0, block_state.num_inference_steps + 1)[:-1]
block_state.mu = calculate_shift(
image_seq_len,
components.scheduler.config.get("base_image_seq_len", 256),
components.scheduler.config.get("max_image_seq_len", 4096),
components.scheduler.config.get("base_shift", 0.5),
components.scheduler.config.get("max_shift", 1.15),
)
self.set_block_state(state, block_state)
return components, state

View File

@@ -0,0 +1,110 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import PIL
import torch
from ...configuration_utils import FrozenDict
from ...models import AutoencoderKLWan
from ...utils import logging
from ...video_processor import VideoProcessor
from ..modular_pipeline import ModularPipelineBlocks, PipelineState
from ..modular_pipeline_utils import ComponentSpec, InputParam, OutputParam
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
class HeliosDecodeStep(ModularPipelineBlocks):
"""Decode all chunk latents with VAE, trim frames, and postprocess into final video output."""
model_name = "helios"
@property
def description(self) -> str:
return (
"Decodes all chunk latents with the VAE, concatenates them, "
"trims to the target frame count, and postprocesses into the final video output."
)
@property
def expected_components(self) -> list[ComponentSpec]:
return [
ComponentSpec("vae", AutoencoderKLWan),
ComponentSpec(
"video_processor",
VideoProcessor,
config=FrozenDict({"vae_scale_factor": 8}),
default_creation_method="from_config",
),
]
@property
def inputs(self) -> list[InputParam]:
return [
InputParam(
"latent_chunks", required=True, type_hint=list, description="List of per-chunk denoised latent tensors"
),
InputParam("num_frames", required=True, type_hint=int, description="The target number of output frames"),
InputParam.template("output_type", default="np"),
]
@property
def intermediate_outputs(self) -> list[OutputParam]:
return [
OutputParam(
"videos",
type_hint=list[list[PIL.Image.Image]] | list[torch.Tensor] | list[np.ndarray],
description="The generated videos, can be a PIL.Image.Image, torch.Tensor or a numpy array",
),
]
@torch.no_grad()
def __call__(self, components, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
vae = components.vae
latents_mean = (
torch.tensor(vae.config.latents_mean).view(1, vae.config.z_dim, 1, 1, 1).to(vae.device, vae.dtype)
)
latents_std = 1.0 / torch.tensor(vae.config.latents_std).view(1, vae.config.z_dim, 1, 1, 1).to(
vae.device, vae.dtype
)
history_video = None
for chunk_latents in block_state.latent_chunks:
current_latents = chunk_latents.to(vae.dtype) / latents_std + latents_mean
current_video = vae.decode(current_latents, return_dict=False)[0]
if history_video is None:
history_video = current_video
else:
history_video = torch.cat([history_video, current_video], dim=2)
# Trim to proper frame count
generated_frames = history_video.size(2)
generated_frames = (
generated_frames - 1
) // components.vae_scale_factor_temporal * components.vae_scale_factor_temporal + 1
history_video = history_video[:, :, :generated_frames]
block_state.videos = components.video_processor.postprocess_video(
history_video, output_type=block_state.output_type
)
self.set_block_state(state, block_state)
return components, state

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,392 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import html
import regex as re
import torch
from transformers import AutoTokenizer, UMT5EncoderModel
from ...configuration_utils import FrozenDict
from ...guiders import ClassifierFreeGuidance
from ...models import AutoencoderKLWan
from ...utils import is_ftfy_available, logging
from ...video_processor import VideoProcessor
from ..modular_pipeline import ModularPipelineBlocks, PipelineState
from ..modular_pipeline_utils import ComponentSpec, InputParam, OutputParam
from .modular_pipeline import HeliosModularPipeline
if is_ftfy_available():
import ftfy
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
def basic_clean(text):
text = ftfy.fix_text(text)
text = html.unescape(html.unescape(text))
return text.strip()
def whitespace_clean(text):
text = re.sub(r"\s+", " ", text)
text = text.strip()
return text
def prompt_clean(text):
text = whitespace_clean(basic_clean(text))
return text
def get_t5_prompt_embeds(
text_encoder: UMT5EncoderModel,
tokenizer: AutoTokenizer,
prompt: str | list[str],
max_sequence_length: int,
device: torch.device,
dtype: torch.dtype | None = None,
):
"""Encode text prompts into T5 embeddings for Helios.
Args:
text_encoder: The T5 text encoder model.
tokenizer: The tokenizer for the text encoder.
prompt: The prompt or prompts to encode.
max_sequence_length: Maximum sequence length for tokenization.
device: Device to place tensors on.
dtype: Optional dtype override. Defaults to `text_encoder.dtype`.
Returns:
A tuple of `(prompt_embeds, attention_mask)` where `prompt_embeds` is the encoded text embeddings and
`attention_mask` is a boolean mask.
"""
dtype = dtype or text_encoder.dtype
prompt = [prompt] if isinstance(prompt, str) else prompt
prompt = [prompt_clean(u) for u in prompt]
text_inputs = tokenizer(
prompt,
padding="max_length",
max_length=max_sequence_length,
truncation=True,
add_special_tokens=True,
return_attention_mask=True,
return_tensors="pt",
)
text_input_ids, mask = text_inputs.input_ids, text_inputs.attention_mask
seq_lens = mask.gt(0).sum(dim=1).long()
prompt_embeds = text_encoder(text_input_ids.to(device), mask.to(device)).last_hidden_state
prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
prompt_embeds = [u[:v] for u, v in zip(prompt_embeds, seq_lens)]
prompt_embeds = torch.stack(
[torch.cat([u, u.new_zeros(max_sequence_length - u.size(0), u.size(1))]) for u in prompt_embeds], dim=0
)
return prompt_embeds, text_inputs.attention_mask.bool()
class HeliosTextEncoderStep(ModularPipelineBlocks):
model_name = "helios"
@property
def description(self) -> str:
return "Text Encoder step that generates text embeddings to guide the video generation"
@property
def expected_components(self) -> list[ComponentSpec]:
return [
ComponentSpec("text_encoder", UMT5EncoderModel),
ComponentSpec("tokenizer", AutoTokenizer),
ComponentSpec(
"guider",
ClassifierFreeGuidance,
config=FrozenDict({"guidance_scale": 5.0}),
default_creation_method="from_config",
),
]
@property
def inputs(self) -> list[InputParam]:
return [
InputParam.template("prompt"),
InputParam.template("negative_prompt"),
InputParam.template("max_sequence_length"),
]
@property
def intermediate_outputs(self) -> list[OutputParam]:
return [
OutputParam.template("prompt_embeds"),
OutputParam.template("negative_prompt_embeds"),
]
@staticmethod
def check_inputs(prompt, negative_prompt):
if prompt is not None and not isinstance(prompt, (str, list)):
raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
if negative_prompt is not None and not isinstance(negative_prompt, (str, list)):
raise ValueError(f"`negative_prompt` has to be of type `str` or `list` but is {type(negative_prompt)}")
if prompt is not None and negative_prompt is not None:
prompt_list = [prompt] if isinstance(prompt, str) else prompt
neg_list = [negative_prompt] if isinstance(negative_prompt, str) else negative_prompt
if type(prompt_list) is not type(neg_list):
raise TypeError(
f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
f" {type(prompt)}."
)
if len(prompt_list) != len(neg_list):
raise ValueError(
f"`negative_prompt` has batch size {len(neg_list)}, but `prompt` has batch size"
f" {len(prompt_list)}. Please make sure that passed `negative_prompt` matches"
" the batch size of `prompt`."
)
@torch.no_grad()
def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
prompt = block_state.prompt
negative_prompt = block_state.negative_prompt
max_sequence_length = block_state.max_sequence_length
device = components._execution_device
self.check_inputs(prompt, negative_prompt)
# Encode prompt
block_state.prompt_embeds, _ = get_t5_prompt_embeds(
text_encoder=components.text_encoder,
tokenizer=components.tokenizer,
prompt=prompt,
max_sequence_length=max_sequence_length,
device=device,
)
# Encode negative prompt
block_state.negative_prompt_embeds = None
if components.requires_unconditional_embeds:
negative_prompt = negative_prompt or ""
if isinstance(prompt, list) and isinstance(negative_prompt, str):
negative_prompt = len(prompt) * [negative_prompt]
block_state.negative_prompt_embeds, _ = get_t5_prompt_embeds(
text_encoder=components.text_encoder,
tokenizer=components.tokenizer,
prompt=negative_prompt,
max_sequence_length=max_sequence_length,
device=device,
)
self.set_block_state(state, block_state)
return components, state
class HeliosImageVaeEncoderStep(ModularPipelineBlocks):
"""Encodes an input image into VAE latent space for image-to-video generation."""
model_name = "helios"
@property
def description(self) -> str:
return (
"Image Encoder step that encodes an input image into VAE latent space, "
"producing image_latents (first frame prefix) and fake_image_latents (history seed) "
"for image-to-video generation."
)
@property
def expected_components(self) -> list[ComponentSpec]:
return [
ComponentSpec("vae", AutoencoderKLWan),
ComponentSpec(
"video_processor",
VideoProcessor,
config=FrozenDict({"vae_scale_factor": 8}),
default_creation_method="from_config",
),
]
@property
def inputs(self) -> list[InputParam]:
return [
InputParam.template("image"),
InputParam.template("height", default=384),
InputParam.template("width", default=640),
InputParam(
"num_latent_frames_per_chunk",
default=9,
type_hint=int,
description="Number of latent frames per temporal chunk.",
),
InputParam.template("generator"),
]
@property
def intermediate_outputs(self) -> list[OutputParam]:
return [
OutputParam.template("image_latents"),
OutputParam(
"fake_image_latents", type_hint=torch.Tensor, description="Fake image latents for history seeding"
),
]
@torch.no_grad()
def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
vae = components.vae
device = components._execution_device
latents_mean = (
torch.tensor(vae.config.latents_mean).view(1, vae.config.z_dim, 1, 1, 1).to(vae.device, vae.dtype)
)
latents_std = 1.0 / torch.tensor(vae.config.latents_std).view(1, vae.config.z_dim, 1, 1, 1).to(
vae.device, vae.dtype
)
# Preprocess image to 4D tensor (B, C, H, W)
image = components.video_processor.preprocess(
block_state.image, height=block_state.height, width=block_state.width
)
image_5d = image.unsqueeze(2).to(device=device, dtype=vae.dtype) # (B, C, 1, H, W)
# Encode image to get image_latents
image_latents = vae.encode(image_5d).latent_dist.sample(generator=block_state.generator)
image_latents = (image_latents - latents_mean) * latents_std
# Encode fake video to get fake_image_latents
min_frames = (block_state.num_latent_frames_per_chunk - 1) * components.vae_scale_factor_temporal + 1
fake_video = image_5d.repeat(1, 1, min_frames, 1, 1) # (B, C, min_frames, H, W)
fake_latents_full = vae.encode(fake_video).latent_dist.sample(generator=block_state.generator)
fake_latents_full = (fake_latents_full - latents_mean) * latents_std
fake_image_latents = fake_latents_full[:, :, -1:, :, :]
block_state.image_latents = image_latents.to(device=device, dtype=torch.float32)
block_state.fake_image_latents = fake_image_latents.to(device=device, dtype=torch.float32)
self.set_block_state(state, block_state)
return components, state
class HeliosVideoVaeEncoderStep(ModularPipelineBlocks):
"""Encodes an input video into VAE latent space for video-to-video generation.
Produces `image_latents` (first frame) and `video_latents` (remaining frames encoded in chunks).
"""
model_name = "helios"
@property
def description(self) -> str:
return (
"Video Encoder step that encodes an input video into VAE latent space, "
"producing image_latents (first frame) and video_latents (chunked video frames) "
"for video-to-video generation."
)
@property
def expected_components(self) -> list[ComponentSpec]:
return [
ComponentSpec("vae", AutoencoderKLWan),
ComponentSpec(
"video_processor",
VideoProcessor,
config=FrozenDict({"vae_scale_factor": 8}),
default_creation_method="from_config",
),
]
@property
def inputs(self) -> list[InputParam]:
return [
InputParam("video", required=True, description="Input video for video-to-video generation"),
InputParam.template("height", default=384),
InputParam.template("width", default=640),
InputParam(
"num_latent_frames_per_chunk",
default=9,
type_hint=int,
description="Number of latent frames per temporal chunk.",
),
InputParam.template("generator"),
]
@property
def intermediate_outputs(self) -> list[OutputParam]:
return [
OutputParam.template("image_latents"),
OutputParam("video_latents", type_hint=torch.Tensor, description="Encoded video latents (chunked)"),
]
@torch.no_grad()
def __call__(self, components: HeliosModularPipeline, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
vae = components.vae
device = components._execution_device
num_latent_frames_per_chunk = block_state.num_latent_frames_per_chunk
latents_mean = (
torch.tensor(vae.config.latents_mean).view(1, vae.config.z_dim, 1, 1, 1).to(vae.device, vae.dtype)
)
latents_std = 1.0 / torch.tensor(vae.config.latents_std).view(1, vae.config.z_dim, 1, 1, 1).to(
vae.device, vae.dtype
)
# Preprocess video
video = components.video_processor.preprocess_video(
block_state.video, height=block_state.height, width=block_state.width
)
video = video.to(device=device, dtype=vae.dtype)
# Encode video into latents
num_frames = video.shape[2]
min_frames = (num_latent_frames_per_chunk - 1) * 4 + 1
num_chunks = num_frames // min_frames
if num_chunks == 0:
raise ValueError(
f"Video must have at least {min_frames} frames "
f"(got {num_frames} frames). "
f"Required: (num_latent_frames_per_chunk - 1) * 4 + 1 = ({num_latent_frames_per_chunk} - 1) * 4 + 1 = {min_frames}"
)
total_valid_frames = num_chunks * min_frames
start_frame = num_frames - total_valid_frames
# Encode first frame
first_frame = video[:, :, 0:1, :, :]
image_latents = vae.encode(first_frame).latent_dist.sample(generator=block_state.generator)
image_latents = (image_latents - latents_mean) * latents_std
# Encode remaining frames in chunks
latents_chunks = []
for i in range(num_chunks):
chunk_start = start_frame + i * min_frames
chunk_end = chunk_start + min_frames
video_chunk = video[:, :, chunk_start:chunk_end, :, :]
chunk_latents = vae.encode(video_chunk).latent_dist.sample(generator=block_state.generator)
chunk_latents = (chunk_latents - latents_mean) * latents_std
latents_chunks.append(chunk_latents)
video_latents = torch.cat(latents_chunks, dim=2)
block_state.image_latents = image_latents.to(device=device, dtype=torch.float32)
block_state.video_latents = video_latents.to(device=device, dtype=torch.float32)
self.set_block_state(state, block_state)
return components, state

View File

@@ -0,0 +1,542 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
from ...utils import logging
from ..modular_pipeline import AutoPipelineBlocks, ConditionalPipelineBlocks, SequentialPipelineBlocks
from ..modular_pipeline_utils import InputParam, InsertableDict, OutputParam
from .before_denoise import (
HeliosAdditionalInputsStep,
HeliosAddNoiseToImageLatentsStep,
HeliosAddNoiseToVideoLatentsStep,
HeliosI2VSeedHistoryStep,
HeliosPrepareHistoryStep,
HeliosSetTimestepsStep,
HeliosTextInputStep,
HeliosV2VSeedHistoryStep,
)
from .decoders import HeliosDecodeStep
from .denoise import HeliosChunkDenoiseStep, HeliosI2VChunkDenoiseStep
from .encoders import HeliosImageVaeEncoderStep, HeliosTextEncoderStep, HeliosVideoVaeEncoderStep
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
# ====================
# 1. Vae Encoder
# ====================
# auto_docstring
class HeliosAutoVaeEncoderStep(AutoPipelineBlocks):
"""
Encoder step that encodes video or image inputs. This is an auto pipeline block.
- `HeliosVideoVaeEncoderStep` (video_encoder) is used when `video` is provided.
- `HeliosImageVaeEncoderStep` (image_encoder) is used when `image` is provided.
- If neither is provided, step will be skipped.
Components:
vae (`AutoencoderKLWan`) video_processor (`VideoProcessor`)
Inputs:
video (`None`, *optional*):
Input video for video-to-video generation
height (`int`, *optional*, defaults to 384):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to 640):
The width in pixels of the generated image.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
image (`Image | list`, *optional*):
Reference image(s) for denoising. Can be a single image or list of images.
Outputs:
image_latents (`Tensor`):
The latent representation of the input image.
video_latents (`Tensor`):
Encoded video latents (chunked)
fake_image_latents (`Tensor`):
Fake image latents for history seeding
"""
block_classes = [HeliosVideoVaeEncoderStep, HeliosImageVaeEncoderStep]
block_names = ["video_encoder", "image_encoder"]
block_trigger_inputs = ["video", "image"]
@property
def description(self):
return (
"Encoder step that encodes video or image inputs. This is an auto pipeline block.\n"
" - `HeliosVideoVaeEncoderStep` (video_encoder) is used when `video` is provided.\n"
" - `HeliosImageVaeEncoderStep` (image_encoder) is used when `image` is provided.\n"
" - If neither is provided, step will be skipped."
)
# ====================
# 2. DENOISE
# ====================
# DENOISE (T2V)
# auto_docstring
class HeliosCoreDenoiseStep(SequentialPipelineBlocks):
"""
Denoise block that takes encoded conditions and runs the chunk-based denoising process.
Components:
transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
Inputs:
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
prompt_embeds (`Tensor`):
text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds (`Tensor`, *optional*):
negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
height (`int`, *optional*, defaults to 384):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to 640):
The width in pixels of the generated image.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps.
sigmas (`list`, *optional*):
Custom sigmas for the denoising process.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
timesteps (`Tensor`, *optional*):
Timesteps for the denoising process.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
Outputs:
latent_chunks (`list`):
List of per-chunk denoised latent tensors
"""
model_name = "helios"
block_classes = [
HeliosTextInputStep,
HeliosPrepareHistoryStep,
HeliosSetTimestepsStep,
HeliosChunkDenoiseStep,
]
block_names = ["input", "prepare_history", "set_timesteps", "chunk_denoise"]
@property
def description(self):
return "Denoise block that takes encoded conditions and runs the chunk-based denoising process."
@property
def outputs(self):
return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
# DENOISE (I2V)
# auto_docstring
class HeliosI2VCoreDenoiseStep(SequentialPipelineBlocks):
"""
I2V denoise block that seeds history with image latents and uses I2V-aware chunk preparation.
Components:
transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
Inputs:
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
prompt_embeds (`Tensor`):
text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds (`Tensor`, *optional*):
negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
image_latents (`Tensor`):
image latents used to guide the image generation. Can be generated from vae_encoder step.
fake_image_latents (`Tensor`, *optional*):
Fake image latents used as history seed for I2V generation.
image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for image latent noise.
image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for image latent noise.
video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for video/fake-image latent noise.
video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for video/fake-image latent noise.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps.
sigmas (`list`, *optional*):
Custom sigmas for the denoising process.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
timesteps (`Tensor`, *optional*):
Timesteps for the denoising process.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
Outputs:
latent_chunks (`list`):
List of per-chunk denoised latent tensors
"""
model_name = "helios"
block_classes = [
HeliosTextInputStep,
HeliosAdditionalInputsStep(
image_latent_inputs=[InputParam.template("image_latents")],
additional_batch_inputs=[
InputParam(
"fake_image_latents",
type_hint=torch.Tensor,
description="Fake image latents used as history seed for I2V generation.",
),
],
),
HeliosAddNoiseToImageLatentsStep,
HeliosPrepareHistoryStep,
HeliosI2VSeedHistoryStep,
HeliosSetTimestepsStep,
HeliosI2VChunkDenoiseStep,
]
block_names = [
"input",
"additional_inputs",
"add_noise_image",
"prepare_history",
"seed_history",
"set_timesteps",
"chunk_denoise",
]
@property
def description(self):
return "I2V denoise block that seeds history with image latents and uses I2V-aware chunk preparation."
@property
def outputs(self):
return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
# DENOISE (V2V)
# auto_docstring
class HeliosV2VCoreDenoiseStep(SequentialPipelineBlocks):
"""
V2V denoise block that seeds history with video latents and uses I2V-aware chunk preparation.
Components:
transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
Inputs:
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
prompt_embeds (`Tensor`):
text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds (`Tensor`, *optional*):
negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
image_latents (`Tensor`, *optional*):
image latents used to guide the image generation. Can be generated from vae_encoder step.
video_latents (`Tensor`, *optional*):
Encoded video latents for V2V generation.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for image latent noise.
image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for image latent noise.
video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for video latent noise.
video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for video latent noise.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps.
sigmas (`list`, *optional*):
Custom sigmas for the denoising process.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
timesteps (`Tensor`, *optional*):
Timesteps for the denoising process.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
Outputs:
latent_chunks (`list`):
List of per-chunk denoised latent tensors
"""
model_name = "helios"
block_classes = [
HeliosTextInputStep,
HeliosAdditionalInputsStep(
image_latent_inputs=[InputParam.template("image_latents")],
additional_batch_inputs=[
InputParam(
"video_latents", type_hint=torch.Tensor, description="Encoded video latents for V2V generation."
),
],
),
HeliosAddNoiseToVideoLatentsStep,
HeliosPrepareHistoryStep,
HeliosV2VSeedHistoryStep,
HeliosSetTimestepsStep,
HeliosI2VChunkDenoiseStep,
]
block_names = [
"input",
"additional_inputs",
"add_noise_video",
"prepare_history",
"seed_history",
"set_timesteps",
"chunk_denoise",
]
@property
def description(self):
return "V2V denoise block that seeds history with video latents and uses I2V-aware chunk preparation."
@property
def outputs(self):
return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
# AUTO DENOISE
# auto_docstring
class HeliosAutoCoreDenoiseStep(ConditionalPipelineBlocks):
"""
Core denoise step that selects the appropriate denoising block.
- `HeliosV2VCoreDenoiseStep` (video2video) for video-to-video tasks.
- `HeliosI2VCoreDenoiseStep` (image2video) for image-to-video tasks.
- `HeliosCoreDenoiseStep` (text2video) for text-to-video tasks.
Components:
transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
Inputs:
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
prompt_embeds (`Tensor`):
text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds (`Tensor`, *optional*):
negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
image_latents (`Tensor`, *optional*):
image latents used to guide the image generation. Can be generated from vae_encoder step.
video_latents (`Tensor`, *optional*):
Encoded video latents for V2V generation.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for image latent noise.
image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for image latent noise.
video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for video latent noise.
video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for video latent noise.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
history_sizes (`list`):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps.
sigmas (`list`):
Custom sigmas for the denoising process.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
timesteps (`Tensor`, *optional*):
Timesteps for the denoising process.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
fake_image_latents (`Tensor`, *optional*):
Fake image latents used as history seed for I2V generation.
height (`int`, *optional*, defaults to 384):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to 640):
The width in pixels of the generated image.
Outputs:
latent_chunks (`list`):
List of per-chunk denoised latent tensors
"""
block_classes = [HeliosV2VCoreDenoiseStep, HeliosI2VCoreDenoiseStep, HeliosCoreDenoiseStep]
block_names = ["video2video", "image2video", "text2video"]
block_trigger_inputs = ["video_latents", "fake_image_latents"]
default_block_name = "text2video"
def select_block(self, video_latents=None, fake_image_latents=None):
if video_latents is not None:
return "video2video"
elif fake_image_latents is not None:
return "image2video"
return None
@property
def description(self):
return (
"Core denoise step that selects the appropriate denoising block.\n"
" - `HeliosV2VCoreDenoiseStep` (video2video) for video-to-video tasks.\n"
" - `HeliosI2VCoreDenoiseStep` (image2video) for image-to-video tasks.\n"
" - `HeliosCoreDenoiseStep` (text2video) for text-to-video tasks."
)
AUTO_BLOCKS = InsertableDict(
[
("text_encoder", HeliosTextEncoderStep()),
("vae_encoder", HeliosAutoVaeEncoderStep()),
("denoise", HeliosAutoCoreDenoiseStep()),
("decode", HeliosDecodeStep()),
]
)
# ====================
# 3. Auto Blocks
# ====================
# auto_docstring
class HeliosAutoBlocks(SequentialPipelineBlocks):
"""
Auto Modular pipeline for text-to-video, image-to-video, and video-to-video tasks using Helios.
Supported workflows:
- `text2video`: requires `prompt`
- `image2video`: requires `prompt`, `image`
- `video2video`: requires `prompt`, `video`
Components:
text_encoder (`UMT5EncoderModel`) tokenizer (`AutoTokenizer`) guider (`ClassifierFreeGuidance`) vae
(`AutoencoderKLWan`) video_processor (`VideoProcessor`) transformer (`HeliosTransformer3DModel`) scheduler
(`HeliosScheduler`)
Inputs:
prompt (`str`):
The prompt or prompts to guide image generation.
negative_prompt (`str`, *optional*):
The prompt or prompts not to guide the image generation.
max_sequence_length (`int`, *optional*, defaults to 512):
Maximum sequence length for prompt encoding.
video (`None`, *optional*):
Input video for video-to-video generation
height (`int`, *optional*, defaults to 384):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to 640):
The width in pixels of the generated image.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
image (`Image | list`, *optional*):
Reference image(s) for denoising. Can be a single image or list of images.
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
image_latents (`Tensor`, *optional*):
image latents used to guide the image generation. Can be generated from vae_encoder step.
video_latents (`Tensor`, *optional*):
Encoded video latents for V2V generation.
image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for image latent noise.
image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for image latent noise.
video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for video latent noise.
video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for video latent noise.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
history_sizes (`list`):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps.
sigmas (`list`):
Custom sigmas for the denoising process.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
timesteps (`Tensor`, *optional*):
Timesteps for the denoising process.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
fake_image_latents (`Tensor`, *optional*):
Fake image latents used as history seed for I2V generation.
output_type (`str`, *optional*, defaults to np):
Output format: 'pil', 'np', 'pt'.
Outputs:
videos (`list`):
The generated videos.
"""
model_name = "helios"
block_classes = AUTO_BLOCKS.values()
block_names = AUTO_BLOCKS.keys()
_workflow_map = {
"text2video": {"prompt": True},
"image2video": {"prompt": True, "image": True},
"video2video": {"prompt": True, "video": True},
}
@property
def description(self):
return "Auto Modular pipeline for text-to-video, image-to-video, and video-to-video tasks using Helios."
@property
def outputs(self):
return [OutputParam.template("videos")]

View File

@@ -0,0 +1,520 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
from ...utils import logging
from ..modular_pipeline import AutoPipelineBlocks, ConditionalPipelineBlocks, SequentialPipelineBlocks
from ..modular_pipeline_utils import InputParam, InsertableDict, OutputParam
from .before_denoise import (
HeliosAdditionalInputsStep,
HeliosAddNoiseToImageLatentsStep,
HeliosAddNoiseToVideoLatentsStep,
HeliosI2VSeedHistoryStep,
HeliosPrepareHistoryStep,
HeliosTextInputStep,
HeliosV2VSeedHistoryStep,
)
from .decoders import HeliosDecodeStep
from .denoise import HeliosPyramidChunkDenoiseStep, HeliosPyramidI2VChunkDenoiseStep
from .encoders import HeliosImageVaeEncoderStep, HeliosTextEncoderStep, HeliosVideoVaeEncoderStep
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
# ====================
# 1. Vae Encoder
# ====================
# auto_docstring
class HeliosPyramidAutoVaeEncoderStep(AutoPipelineBlocks):
"""
Encoder step that encodes video or image inputs. This is an auto pipeline block.
- `HeliosVideoVaeEncoderStep` (video_encoder) is used when `video` is provided.
- `HeliosImageVaeEncoderStep` (image_encoder) is used when `image` is provided.
- If neither is provided, step will be skipped.
Components:
vae (`AutoencoderKLWan`) video_processor (`VideoProcessor`)
Inputs:
video (`None`, *optional*):
Input video for video-to-video generation
height (`int`, *optional*, defaults to 384):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to 640):
The width in pixels of the generated image.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
image (`Image | list`, *optional*):
Reference image(s) for denoising. Can be a single image or list of images.
Outputs:
image_latents (`Tensor`):
The latent representation of the input image.
video_latents (`Tensor`):
Encoded video latents (chunked)
fake_image_latents (`Tensor`):
Fake image latents for history seeding
"""
block_classes = [HeliosVideoVaeEncoderStep, HeliosImageVaeEncoderStep]
block_names = ["video_encoder", "image_encoder"]
block_trigger_inputs = ["video", "image"]
@property
def description(self):
return (
"Encoder step that encodes video or image inputs. This is an auto pipeline block.\n"
" - `HeliosVideoVaeEncoderStep` (video_encoder) is used when `video` is provided.\n"
" - `HeliosImageVaeEncoderStep` (image_encoder) is used when `image` is provided.\n"
" - If neither is provided, step will be skipped."
)
# ====================
# 2. DENOISE
# ====================
# DENOISE (T2V)
# auto_docstring
class HeliosPyramidCoreDenoiseStep(SequentialPipelineBlocks):
"""
T2V pyramid denoise block with progressive multi-resolution denoising.
Components:
transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider
(`ClassifierFreeZeroStarGuidance`)
Inputs:
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
prompt_embeds (`Tensor`):
text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds (`Tensor`, *optional*):
negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
height (`int`, *optional*, defaults to 384):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to 640):
The width in pixels of the generated image.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
Number of denoising steps per pyramid stage.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
Outputs:
latent_chunks (`list`):
List of per-chunk denoised latent tensors
"""
model_name = "helios-pyramid"
block_classes = [
HeliosTextInputStep,
HeliosPrepareHistoryStep,
HeliosPyramidChunkDenoiseStep,
]
block_names = ["input", "prepare_history", "pyramid_chunk_denoise"]
@property
def description(self):
return "T2V pyramid denoise block with progressive multi-resolution denoising."
@property
def outputs(self):
return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
# DENOISE (I2V)
# auto_docstring
class HeliosPyramidI2VCoreDenoiseStep(SequentialPipelineBlocks):
"""
I2V pyramid denoise block with progressive multi-resolution denoising.
Components:
transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider
(`ClassifierFreeZeroStarGuidance`)
Inputs:
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
prompt_embeds (`Tensor`):
text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds (`Tensor`, *optional*):
negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
image_latents (`Tensor`):
image latents used to guide the image generation. Can be generated from vae_encoder step.
fake_image_latents (`Tensor`, *optional*):
Fake image latents used as history seed for I2V generation.
image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for image latent noise.
image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for image latent noise.
video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for video/fake-image latent noise.
video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for video/fake-image latent noise.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
Number of denoising steps per pyramid stage.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
Outputs:
latent_chunks (`list`):
List of per-chunk denoised latent tensors
"""
model_name = "helios-pyramid"
block_classes = [
HeliosTextInputStep,
HeliosAdditionalInputsStep(
image_latent_inputs=[InputParam.template("image_latents")],
additional_batch_inputs=[
InputParam(
"fake_image_latents",
type_hint=torch.Tensor,
description="Fake image latents used as history seed for I2V generation.",
),
],
),
HeliosAddNoiseToImageLatentsStep,
HeliosPrepareHistoryStep,
HeliosI2VSeedHistoryStep,
HeliosPyramidI2VChunkDenoiseStep,
]
block_names = [
"input",
"additional_inputs",
"add_noise_image",
"prepare_history",
"seed_history",
"pyramid_chunk_denoise",
]
@property
def description(self):
return "I2V pyramid denoise block with progressive multi-resolution denoising."
@property
def outputs(self):
return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
# DENOISE (V2V)
# auto_docstring
class HeliosPyramidV2VCoreDenoiseStep(SequentialPipelineBlocks):
"""
V2V pyramid denoise block with progressive multi-resolution denoising.
Components:
transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider
(`ClassifierFreeZeroStarGuidance`)
Inputs:
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
prompt_embeds (`Tensor`):
text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds (`Tensor`, *optional*):
negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
image_latents (`Tensor`, *optional*):
image latents used to guide the image generation. Can be generated from vae_encoder step.
video_latents (`Tensor`, *optional*):
Encoded video latents for V2V generation.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for image latent noise.
image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for image latent noise.
video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for video latent noise.
video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for video latent noise.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
Number of denoising steps per pyramid stage.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
Outputs:
latent_chunks (`list`):
List of per-chunk denoised latent tensors
"""
model_name = "helios-pyramid"
block_classes = [
HeliosTextInputStep,
HeliosAdditionalInputsStep(
image_latent_inputs=[InputParam.template("image_latents")],
additional_batch_inputs=[
InputParam(
"video_latents", type_hint=torch.Tensor, description="Encoded video latents for V2V generation."
),
],
),
HeliosAddNoiseToVideoLatentsStep,
HeliosPrepareHistoryStep,
HeliosV2VSeedHistoryStep,
HeliosPyramidI2VChunkDenoiseStep,
]
block_names = [
"input",
"additional_inputs",
"add_noise_video",
"prepare_history",
"seed_history",
"pyramid_chunk_denoise",
]
@property
def description(self):
return "V2V pyramid denoise block with progressive multi-resolution denoising."
@property
def outputs(self):
return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
# AUTO DENOISE
# auto_docstring
class HeliosPyramidAutoCoreDenoiseStep(ConditionalPipelineBlocks):
"""
Pyramid core denoise step that selects the appropriate denoising block.
- `HeliosPyramidV2VCoreDenoiseStep` (video2video) for video-to-video tasks.
- `HeliosPyramidI2VCoreDenoiseStep` (image2video) for image-to-video tasks.
- `HeliosPyramidCoreDenoiseStep` (text2video) for text-to-video tasks.
Components:
transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider
(`ClassifierFreeZeroStarGuidance`)
Inputs:
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
prompt_embeds (`Tensor`):
text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds (`Tensor`, *optional*):
negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
image_latents (`Tensor`, *optional*):
image latents used to guide the image generation. Can be generated from vae_encoder step.
video_latents (`Tensor`, *optional*):
Encoded video latents for V2V generation.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for image latent noise.
image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for image latent noise.
video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for video latent noise.
video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for video latent noise.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
history_sizes (`list`):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
Number of denoising steps per pyramid stage.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
fake_image_latents (`Tensor`, *optional*):
Fake image latents used as history seed for I2V generation.
height (`int`, *optional*, defaults to 384):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to 640):
The width in pixels of the generated image.
Outputs:
latent_chunks (`list`):
List of per-chunk denoised latent tensors
"""
block_classes = [HeliosPyramidV2VCoreDenoiseStep, HeliosPyramidI2VCoreDenoiseStep, HeliosPyramidCoreDenoiseStep]
block_names = ["video2video", "image2video", "text2video"]
block_trigger_inputs = ["video_latents", "fake_image_latents"]
default_block_name = "text2video"
def select_block(self, video_latents=None, fake_image_latents=None):
if video_latents is not None:
return "video2video"
elif fake_image_latents is not None:
return "image2video"
return None
@property
def description(self):
return (
"Pyramid core denoise step that selects the appropriate denoising block.\n"
" - `HeliosPyramidV2VCoreDenoiseStep` (video2video) for video-to-video tasks.\n"
" - `HeliosPyramidI2VCoreDenoiseStep` (image2video) for image-to-video tasks.\n"
" - `HeliosPyramidCoreDenoiseStep` (text2video) for text-to-video tasks."
)
# ====================
# 3. Auto Blocks
# ====================
PYRAMID_AUTO_BLOCKS = InsertableDict(
[
("text_encoder", HeliosTextEncoderStep()),
("vae_encoder", HeliosPyramidAutoVaeEncoderStep()),
("denoise", HeliosPyramidAutoCoreDenoiseStep()),
("decode", HeliosDecodeStep()),
]
)
# auto_docstring
class HeliosPyramidAutoBlocks(SequentialPipelineBlocks):
"""
Auto Modular pipeline for pyramid progressive generation (T2V/I2V/V2V) using Helios.
Supported workflows:
- `text2video`: requires `prompt`
- `image2video`: requires `prompt`, `image`
- `video2video`: requires `prompt`, `video`
Components:
text_encoder (`UMT5EncoderModel`) tokenizer (`AutoTokenizer`) guider (`ClassifierFreeGuidance`) vae
(`AutoencoderKLWan`) video_processor (`VideoProcessor`) transformer (`HeliosTransformer3DModel`) scheduler
(`HeliosScheduler`)
Inputs:
prompt (`str`):
The prompt or prompts to guide image generation.
negative_prompt (`str`, *optional*):
The prompt or prompts not to guide the image generation.
max_sequence_length (`int`, *optional*, defaults to 512):
Maximum sequence length for prompt encoding.
video (`None`, *optional*):
Input video for video-to-video generation
height (`int`, *optional*, defaults to 384):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to 640):
The width in pixels of the generated image.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
image (`Image | list`, *optional*):
Reference image(s) for denoising. Can be a single image or list of images.
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
image_latents (`Tensor`, *optional*):
image latents used to guide the image generation. Can be generated from vae_encoder step.
video_latents (`Tensor`, *optional*):
Encoded video latents for V2V generation.
image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for image latent noise.
image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for image latent noise.
video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for video latent noise.
video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for video latent noise.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
history_sizes (`list`):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
Number of denoising steps per pyramid stage.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
fake_image_latents (`Tensor`, *optional*):
Fake image latents used as history seed for I2V generation.
output_type (`str`, *optional*, defaults to np):
Output format: 'pil', 'np', 'pt'.
Outputs:
videos (`list`):
The generated videos.
"""
model_name = "helios-pyramid"
block_classes = PYRAMID_AUTO_BLOCKS.values()
block_names = PYRAMID_AUTO_BLOCKS.keys()
_workflow_map = {
"text2video": {"prompt": True},
"image2video": {"prompt": True, "image": True},
"video2video": {"prompt": True, "video": True},
}
@property
def description(self):
return "Auto Modular pipeline for pyramid progressive generation (T2V/I2V/V2V) using Helios."
@property
def outputs(self):
return [OutputParam.template("videos")]

View File

@@ -0,0 +1,530 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
from ...utils import logging
from ..modular_pipeline import AutoPipelineBlocks, ConditionalPipelineBlocks, SequentialPipelineBlocks
from ..modular_pipeline_utils import InputParam, InsertableDict, OutputParam
from .before_denoise import (
HeliosAdditionalInputsStep,
HeliosAddNoiseToImageLatentsStep,
HeliosAddNoiseToVideoLatentsStep,
HeliosI2VSeedHistoryStep,
HeliosPrepareHistoryStep,
HeliosTextInputStep,
HeliosV2VSeedHistoryStep,
)
from .decoders import HeliosDecodeStep
from .denoise import HeliosPyramidDistilledChunkDenoiseStep, HeliosPyramidDistilledI2VChunkDenoiseStep
from .encoders import HeliosImageVaeEncoderStep, HeliosTextEncoderStep, HeliosVideoVaeEncoderStep
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
# ====================
# 1. Vae Encoder
# ====================
# auto_docstring
class HeliosPyramidDistilledAutoVaeEncoderStep(AutoPipelineBlocks):
"""
Encoder step for distilled pyramid pipeline.
- `HeliosVideoVaeEncoderStep` (video_encoder) is used when `video` is provided.
- `HeliosImageVaeEncoderStep` (image_encoder) is used when `image` is provided.
- If neither is provided, step will be skipped.
Components:
vae (`AutoencoderKLWan`) video_processor (`VideoProcessor`)
Inputs:
video (`None`, *optional*):
Input video for video-to-video generation
height (`int`, *optional*, defaults to 384):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to 640):
The width in pixels of the generated image.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
image (`Image | list`, *optional*):
Reference image(s) for denoising. Can be a single image or list of images.
Outputs:
image_latents (`Tensor`):
The latent representation of the input image.
video_latents (`Tensor`):
Encoded video latents (chunked)
fake_image_latents (`Tensor`):
Fake image latents for history seeding
"""
block_classes = [HeliosVideoVaeEncoderStep, HeliosImageVaeEncoderStep]
block_names = ["video_encoder", "image_encoder"]
block_trigger_inputs = ["video", "image"]
@property
def description(self):
return (
"Encoder step for distilled pyramid pipeline.\n"
" - `HeliosVideoVaeEncoderStep` (video_encoder) is used when `video` is provided.\n"
" - `HeliosImageVaeEncoderStep` (image_encoder) is used when `image` is provided.\n"
" - If neither is provided, step will be skipped."
)
# ====================
# 2. DENOISE
# ====================
# DENOISE (T2V)
# auto_docstring
class HeliosPyramidDistilledCoreDenoiseStep(SequentialPipelineBlocks):
"""
T2V distilled pyramid denoise block with DMD scheduler and no CFG.
Components:
transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
Inputs:
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
prompt_embeds (`Tensor`):
text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds (`Tensor`, *optional*):
negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
height (`int`, *optional*, defaults to 384):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to 640):
The width in pixels of the generated image.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
Number of denoising steps per pyramid stage.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
is_amplify_first_chunk (`bool`, *optional*, defaults to True):
Whether to double the first chunk's timesteps via the scheduler for amplified generation.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
Outputs:
latent_chunks (`list`):
List of per-chunk denoised latent tensors
"""
model_name = "helios-pyramid"
block_classes = [
HeliosTextInputStep,
HeliosPrepareHistoryStep,
HeliosPyramidDistilledChunkDenoiseStep,
]
block_names = ["input", "prepare_history", "pyramid_chunk_denoise"]
@property
def description(self):
return "T2V distilled pyramid denoise block with DMD scheduler and no CFG."
@property
def outputs(self):
return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
# DENOISE (I2V)
# auto_docstring
class HeliosPyramidDistilledI2VCoreDenoiseStep(SequentialPipelineBlocks):
"""
I2V distilled pyramid denoise block with DMD scheduler and no CFG.
Components:
transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
Inputs:
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
prompt_embeds (`Tensor`):
text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds (`Tensor`, *optional*):
negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
image_latents (`Tensor`):
image latents used to guide the image generation. Can be generated from vae_encoder step.
fake_image_latents (`Tensor`, *optional*):
Fake image latents used as history seed for I2V generation.
image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for image latent noise.
image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for image latent noise.
video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for video/fake-image latent noise.
video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for video/fake-image latent noise.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
Number of denoising steps per pyramid stage.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
is_amplify_first_chunk (`bool`, *optional*, defaults to True):
Whether to double the first chunk's timesteps via the scheduler for amplified generation.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
Outputs:
latent_chunks (`list`):
List of per-chunk denoised latent tensors
"""
model_name = "helios-pyramid"
block_classes = [
HeliosTextInputStep,
HeliosAdditionalInputsStep(
image_latent_inputs=[InputParam.template("image_latents")],
additional_batch_inputs=[
InputParam(
"fake_image_latents",
type_hint=torch.Tensor,
description="Fake image latents used as history seed for I2V generation.",
),
],
),
HeliosAddNoiseToImageLatentsStep,
HeliosPrepareHistoryStep,
HeliosI2VSeedHistoryStep,
HeliosPyramidDistilledI2VChunkDenoiseStep,
]
block_names = [
"input",
"additional_inputs",
"add_noise_image",
"prepare_history",
"seed_history",
"pyramid_chunk_denoise",
]
@property
def description(self):
return "I2V distilled pyramid denoise block with DMD scheduler and no CFG."
@property
def outputs(self):
return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
# DENOISE (V2V)
# auto_docstring
class HeliosPyramidDistilledV2VCoreDenoiseStep(SequentialPipelineBlocks):
"""
V2V distilled pyramid denoise block with DMD scheduler and no CFG.
Components:
transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
Inputs:
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
prompt_embeds (`Tensor`):
text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds (`Tensor`, *optional*):
negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
image_latents (`Tensor`, *optional*):
image latents used to guide the image generation. Can be generated from vae_encoder step.
video_latents (`Tensor`, *optional*):
Encoded video latents for V2V generation.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for image latent noise.
image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for image latent noise.
video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for video latent noise.
video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for video latent noise.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
history_sizes (`list`, *optional*, defaults to [16, 2, 1]):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
Number of denoising steps per pyramid stage.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
is_amplify_first_chunk (`bool`, *optional*, defaults to True):
Whether to double the first chunk's timesteps via the scheduler for amplified generation.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
Outputs:
latent_chunks (`list`):
List of per-chunk denoised latent tensors
"""
model_name = "helios-pyramid"
block_classes = [
HeliosTextInputStep,
HeliosAdditionalInputsStep(
image_latent_inputs=[InputParam.template("image_latents")],
additional_batch_inputs=[
InputParam(
"video_latents", type_hint=torch.Tensor, description="Encoded video latents for V2V generation."
),
],
),
HeliosAddNoiseToVideoLatentsStep,
HeliosPrepareHistoryStep,
HeliosV2VSeedHistoryStep,
HeliosPyramidDistilledI2VChunkDenoiseStep,
]
block_names = [
"input",
"additional_inputs",
"add_noise_video",
"prepare_history",
"seed_history",
"pyramid_chunk_denoise",
]
@property
def description(self):
return "V2V distilled pyramid denoise block with DMD scheduler and no CFG."
@property
def outputs(self):
return [OutputParam("latent_chunks", type_hint=list, description="List of per-chunk denoised latent tensors")]
# AUTO DENOISE
# auto_docstring
class HeliosPyramidDistilledAutoCoreDenoiseStep(ConditionalPipelineBlocks):
"""
Distilled pyramid core denoise step that selects the appropriate denoising block.
- `HeliosPyramidDistilledV2VCoreDenoiseStep` (video2video) for video-to-video tasks.
- `HeliosPyramidDistilledI2VCoreDenoiseStep` (image2video) for image-to-video tasks.
- `HeliosPyramidDistilledCoreDenoiseStep` (text2video) for text-to-video tasks.
Components:
transformer (`HeliosTransformer3DModel`) scheduler (`HeliosScheduler`) guider (`ClassifierFreeGuidance`)
Inputs:
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
prompt_embeds (`Tensor`):
text embeddings used to guide the image generation. Can be generated from text_encoder step.
negative_prompt_embeds (`Tensor`, *optional*):
negative text embeddings used to guide the image generation. Can be generated from text_encoder step.
image_latents (`Tensor`, *optional*):
image latents used to guide the image generation. Can be generated from vae_encoder step.
video_latents (`Tensor`, *optional*):
Encoded video latents for V2V generation.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for image latent noise.
image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for image latent noise.
video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for video latent noise.
video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for video latent noise.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
history_sizes (`list`):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
Number of denoising steps per pyramid stage.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
is_amplify_first_chunk (`bool`, *optional*, defaults to True):
Whether to double the first chunk's timesteps via the scheduler for amplified generation.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
fake_image_latents (`Tensor`, *optional*):
Fake image latents used as history seed for I2V generation.
height (`int`, *optional*, defaults to 384):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to 640):
The width in pixels of the generated image.
Outputs:
latent_chunks (`list`):
List of per-chunk denoised latent tensors
"""
block_classes = [
HeliosPyramidDistilledV2VCoreDenoiseStep,
HeliosPyramidDistilledI2VCoreDenoiseStep,
HeliosPyramidDistilledCoreDenoiseStep,
]
block_names = ["video2video", "image2video", "text2video"]
block_trigger_inputs = ["video_latents", "fake_image_latents"]
default_block_name = "text2video"
def select_block(self, video_latents=None, fake_image_latents=None):
if video_latents is not None:
return "video2video"
elif fake_image_latents is not None:
return "image2video"
return None
@property
def description(self):
return (
"Distilled pyramid core denoise step that selects the appropriate denoising block.\n"
" - `HeliosPyramidDistilledV2VCoreDenoiseStep` (video2video) for video-to-video tasks.\n"
" - `HeliosPyramidDistilledI2VCoreDenoiseStep` (image2video) for image-to-video tasks.\n"
" - `HeliosPyramidDistilledCoreDenoiseStep` (text2video) for text-to-video tasks."
)
# ====================
# 3. Auto Blocks
# ====================
DISTILLED_PYRAMID_AUTO_BLOCKS = InsertableDict(
[
("text_encoder", HeliosTextEncoderStep()),
("vae_encoder", HeliosPyramidDistilledAutoVaeEncoderStep()),
("denoise", HeliosPyramidDistilledAutoCoreDenoiseStep()),
("decode", HeliosDecodeStep()),
]
)
# auto_docstring
class HeliosPyramidDistilledAutoBlocks(SequentialPipelineBlocks):
"""
Auto Modular pipeline for distilled pyramid progressive generation (T2V/I2V/V2V) using Helios.
Supported workflows:
- `text2video`: requires `prompt`
- `image2video`: requires `prompt`, `image`
- `video2video`: requires `prompt`, `video`
Components:
text_encoder (`UMT5EncoderModel`) tokenizer (`AutoTokenizer`) guider (`ClassifierFreeGuidance`) vae
(`AutoencoderKLWan`) video_processor (`VideoProcessor`) transformer (`HeliosTransformer3DModel`) scheduler
(`HeliosScheduler`)
Inputs:
prompt (`str`):
The prompt or prompts to guide image generation.
negative_prompt (`str`, *optional*):
The prompt or prompts not to guide the image generation.
max_sequence_length (`int`, *optional*, defaults to 512):
Maximum sequence length for prompt encoding.
video (`None`, *optional*):
Input video for video-to-video generation
height (`int`, *optional*, defaults to 384):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to 640):
The width in pixels of the generated image.
num_latent_frames_per_chunk (`int`, *optional*, defaults to 9):
Number of latent frames per temporal chunk.
generator (`Generator`, *optional*):
Torch generator for deterministic generation.
image (`Image | list`, *optional*):
Reference image(s) for denoising. Can be a single image or list of images.
num_videos_per_prompt (`int`, *optional*, defaults to 1):
Number of videos to generate per prompt.
image_latents (`Tensor`, *optional*):
image latents used to guide the image generation. Can be generated from vae_encoder step.
video_latents (`Tensor`, *optional*):
Encoded video latents for V2V generation.
image_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for image latent noise.
image_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for image latent noise.
video_noise_sigma_min (`float`, *optional*, defaults to 0.111):
Minimum sigma for video latent noise.
video_noise_sigma_max (`float`, *optional*, defaults to 0.135):
Maximum sigma for video latent noise.
num_frames (`int`, *optional*, defaults to 132):
Total number of video frames to generate.
history_sizes (`list`):
Sizes of long/mid/short history buffers for temporal context.
keep_first_frame (`bool`, *optional*, defaults to True):
Whether to keep the first frame as a prefix in history.
pyramid_num_inference_steps_list (`list`, *optional*, defaults to [10, 10, 10]):
Number of denoising steps per pyramid stage.
latents (`Tensor`, *optional*):
Pre-generated noisy latents for image generation.
**denoiser_input_fields (`None`, *optional*):
conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
is_amplify_first_chunk (`bool`, *optional*, defaults to True):
Whether to double the first chunk's timesteps via the scheduler for amplified generation.
attention_kwargs (`dict`, *optional*):
Additional kwargs for attention processors.
fake_image_latents (`Tensor`, *optional*):
Fake image latents used as history seed for I2V generation.
output_type (`str`, *optional*, defaults to np):
Output format: 'pil', 'np', 'pt'.
Outputs:
videos (`list`):
The generated videos.
"""
model_name = "helios-pyramid"
block_classes = DISTILLED_PYRAMID_AUTO_BLOCKS.values()
block_names = DISTILLED_PYRAMID_AUTO_BLOCKS.keys()
_workflow_map = {
"text2video": {"prompt": True},
"image2video": {"prompt": True, "image": True},
"video2video": {"prompt": True, "video": True},
}
@property
def description(self):
return "Auto Modular pipeline for distilled pyramid progressive generation (T2V/I2V/V2V) using Helios."
@property
def outputs(self):
return [OutputParam.template("videos")]

View File

@@ -0,0 +1,87 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from ...loaders import HeliosLoraLoaderMixin
from ...utils import logging
from ..modular_pipeline import ModularPipeline
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
class HeliosModularPipeline(
ModularPipeline,
HeliosLoraLoaderMixin,
):
"""
A ModularPipeline for Helios text-to-video generation.
> [!WARNING] > This is an experimental feature and is likely to change in the future.
"""
default_blocks_name = "HeliosAutoBlocks"
@property
def vae_scale_factor_spatial(self):
vae_scale_factor = 8
if hasattr(self, "vae") and self.vae is not None:
vae_scale_factor = self.vae.config.scale_factor_spatial
return vae_scale_factor
@property
def vae_scale_factor_temporal(self):
vae_scale_factor = 4
if hasattr(self, "vae") and self.vae is not None:
vae_scale_factor = self.vae.config.scale_factor_temporal
return vae_scale_factor
@property
def num_channels_latents(self):
# YiYi TODO: find out default value
num_channels_latents = 16
if hasattr(self, "transformer") and self.transformer is not None:
num_channels_latents = self.transformer.config.in_channels
return num_channels_latents
@property
def requires_unconditional_embeds(self):
requires_unconditional_embeds = False
if hasattr(self, "guider") and self.guider is not None:
requires_unconditional_embeds = self.guider._enabled and self.guider.num_conditions > 1
return requires_unconditional_embeds
class HeliosPyramidModularPipeline(HeliosModularPipeline):
"""
A ModularPipeline for Helios pyramid (progressive resolution) video generation.
> [!WARNING] > This is an experimental feature and is likely to change in the future.
"""
default_blocks_name = "HeliosPyramidAutoBlocks"
class HeliosPyramidDistilledModularPipeline(HeliosModularPipeline):
"""
A ModularPipeline for Helios distilled pyramid video generation using DMD scheduler.
Uses guidance_scale=1.0 (no CFG) and supports is_amplify_first_chunk for the DMD scheduler.
> [!WARNING] > This is an experimental feature and is likely to change in the future.
"""
default_blocks_name = "HeliosPyramidDistilledAutoBlocks"

View File

@@ -106,6 +106,16 @@ def _wan_i2v_map_fn(config_dict=None):
return "WanImage2VideoModularPipeline"
def _helios_pyramid_map_fn(config_dict=None):
if config_dict is None:
return "HeliosPyramidModularPipeline"
if config_dict.get("is_distilled", False):
return "HeliosPyramidDistilledModularPipeline"
else:
return "HeliosPyramidModularPipeline"
MODULAR_PIPELINE_MAPPING = OrderedDict(
[
("stable-diffusion-xl", _create_default_map_fn("StableDiffusionXLModularPipeline")),
@@ -120,6 +130,8 @@ MODULAR_PIPELINE_MAPPING = OrderedDict(
("qwenimage-edit-plus", _create_default_map_fn("QwenImageEditPlusModularPipeline")),
("qwenimage-layered", _create_default_map_fn("QwenImageLayeredModularPipeline")),
("z-image", _create_default_map_fn("ZImageModularPipeline")),
("helios", _create_default_map_fn("HeliosModularPipeline")),
("helios-pyramid", _helios_pyramid_map_fn),
]
)

View File

@@ -309,16 +309,16 @@ class ComponentSpec:
f"`type_hint` is required when loading a single file model but is missing for component: {self.name}"
)
from diffusers import AutoModel
# `torch_dtype` is not an accepted parameter for tokenizers and processors.
# As a result, it gets stored in `init_kwargs`, which are written to the config
# during save. This causes JSON serialization to fail when saving the component.
if self.type_hint is not None and not issubclass(self.type_hint, torch.nn.Module):
if self.type_hint is not None and not issubclass(self.type_hint, (torch.nn.Module, AutoModel)):
kwargs.pop("torch_dtype", None)
if self.type_hint is None:
try:
from diffusers import AutoModel
component = AutoModel.from_pretrained(pretrained_model_name_or_path, **load_kwargs, **kwargs)
except Exception as e:
raise ValueError(f"Unable to load {self.name} without `type_hint`: {e}")
@@ -332,12 +332,6 @@ class ComponentSpec:
else getattr(self.type_hint, "from_pretrained")
)
# `torch_dtype` is not an accepted parameter for tokenizers and processors.
# As a result, it gets stored in `init_kwargs`, which are written to the config
# during save. This causes JSON serialization to fail when saving the component.
if not issubclass(self.type_hint, torch.nn.Module):
kwargs.pop("torch_dtype", None)
try:
component = load_method(pretrained_model_name_or_path, **load_kwargs, **kwargs)
except Exception as e:

View File

@@ -129,7 +129,7 @@ else:
]
_import_structure["bria"] = ["BriaPipeline"]
_import_structure["bria_fibo"] = ["BriaFiboPipeline", "BriaFiboEditPipeline"]
_import_structure["flux2"] = ["Flux2Pipeline", "Flux2KleinPipeline"]
_import_structure["flux2"] = ["Flux2Pipeline", "Flux2KleinPipeline", "Flux2KleinKVPipeline"]
_import_structure["flux"] = [
"FluxControlPipeline",
"FluxControlInpaintPipeline",
@@ -671,7 +671,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
FluxPriorReduxPipeline,
ReduxImageEncoder,
)
from .flux2 import Flux2KleinPipeline, Flux2Pipeline
from .flux2 import Flux2KleinKVPipeline, Flux2KleinPipeline, Flux2Pipeline
from .glm_image import GlmImagePipeline
from .helios import HeliosPipeline, HeliosPyramidPipeline
from .hidream_image import HiDreamImagePipeline

View File

@@ -95,6 +95,7 @@ from .pag import (
StableDiffusionXLPAGPipeline,
)
from .pixart_alpha import PixArtAlphaPipeline, PixArtSigmaPipeline
from .prx import PRXPipeline
from .qwenimage import (
QwenImageControlNetPipeline,
QwenImageEditInpaintPipeline,
@@ -185,6 +186,7 @@ AUTO_TEXT2IMAGE_PIPELINES_MAPPING = OrderedDict(
("z-image-controlnet-inpaint", ZImageControlNetInpaintPipeline),
("z-image-omni", ZImageOmniPipeline),
("ovis", OvisImagePipeline),
("prx", PRXPipeline),
]
)

View File

@@ -82,13 +82,16 @@ EXAMPLE_DOC_STRING = """
```python
>>> import cv2
>>> import numpy as np
>>> from PIL import Image
>>> import torch
>>> from diffusers import Cosmos2_5_TransferPipeline, AutoModel
>>> from diffusers.utils import export_to_video, load_video
>>> model_id = "nvidia/Cosmos-Transfer2.5-2B"
>>> # Load a Transfer2.5 controlnet variant (edge, depth, seg, or blur)
>>> controlnet = AutoModel.from_pretrained(model_id, revision="diffusers/controlnet/general/edge")
>>> controlnet = AutoModel.from_pretrained(
... model_id, revision="diffusers/controlnet/general/edge", torch_dtype=torch.bfloat16
... )
>>> pipe = Cosmos2_5_TransferPipeline.from_pretrained(
... model_id, controlnet=controlnet, revision="diffusers/general", torch_dtype=torch.bfloat16
... )

View File

@@ -24,6 +24,7 @@ except OptionalDependencyNotAvailable:
else:
_import_structure["pipeline_flux2"] = ["Flux2Pipeline"]
_import_structure["pipeline_flux2_klein"] = ["Flux2KleinPipeline"]
_import_structure["pipeline_flux2_klein_kv"] = ["Flux2KleinKVPipeline"]
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
try:
if not (is_transformers_available() and is_torch_available()):
@@ -33,6 +34,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
else:
from .pipeline_flux2 import Flux2Pipeline
from .pipeline_flux2_klein import Flux2KleinPipeline
from .pipeline_flux2_klein_kv import Flux2KleinKVPipeline
else:
import sys

View File

@@ -744,7 +744,7 @@ class Flux2Pipeline(DiffusionPipeline, Flux2LoraLoaderMixin):
@replace_example_docstring(EXAMPLE_DOC_STRING)
def __call__(
self,
image: list[PIL.Image.Image, PIL.Image.Image] | None = None,
image: PIL.Image.Image | list[PIL.Image.Image] | None = None,
prompt: str | list[str] = None,
height: int | None = None,
width: int | None = None,

View File

@@ -0,0 +1,886 @@
# Copyright 2025 Black Forest Labs and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import inspect
from typing import Any, Callable
import numpy as np
import PIL
import torch
from transformers import Qwen2TokenizerFast, Qwen3ForCausalLM
from ...loaders import Flux2LoraLoaderMixin
from ...models import AutoencoderKLFlux2, Flux2Transformer2DModel
from ...models.transformers.transformer_flux2 import Flux2KVAttnProcessor, Flux2KVParallelSelfAttnProcessor
from ...schedulers import FlowMatchEulerDiscreteScheduler
from ...utils import is_torch_xla_available, logging, replace_example_docstring
from ...utils.torch_utils import randn_tensor
from ..pipeline_utils import DiffusionPipeline
from .image_processor import Flux2ImageProcessor
from .pipeline_output import Flux2PipelineOutput
if is_torch_xla_available():
import torch_xla.core.xla_model as xm
XLA_AVAILABLE = True
else:
XLA_AVAILABLE = False
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
EXAMPLE_DOC_STRING = """
Examples:
```py
>>> import torch
>>> from PIL import Image
>>> from diffusers import Flux2KleinKVPipeline
>>> pipe = Flux2KleinKVPipeline.from_pretrained(
... "black-forest-labs/FLUX.2-klein-9b-kv", torch_dtype=torch.bfloat16
... )
>>> pipe.to("cuda")
>>> ref_image = Image.open("reference.png")
>>> image = pipe("A cat dressed like a wizard", image=ref_image, num_inference_steps=4).images[0]
>>> image.save("flux2_kv_output.png")
```
"""
# Copied from diffusers.pipelines.flux2.pipeline_flux2.compute_empirical_mu
def compute_empirical_mu(image_seq_len: int, num_steps: int) -> float:
a1, b1 = 8.73809524e-05, 1.89833333
a2, b2 = 0.00016927, 0.45666666
if image_seq_len > 4300:
mu = a2 * image_seq_len + b2
return float(mu)
m_200 = a2 * image_seq_len + b2
m_10 = a1 * image_seq_len + b1
a = (m_200 - m_10) / 190.0
b = m_200 - 200.0 * a
mu = a * num_steps + b
return float(mu)
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
def retrieve_timesteps(
scheduler,
num_inference_steps: int | None = None,
device: str | torch.device | None = None,
timesteps: list[int] | None = None,
sigmas: list[float] | None = None,
**kwargs,
):
r"""
Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
Args:
scheduler (`SchedulerMixin`):
The scheduler to get timesteps from.
num_inference_steps (`int`):
The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
must be `None`.
device (`str` or `torch.device`, *optional*):
The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
timesteps (`list[int]`, *optional*):
Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
`num_inference_steps` and `sigmas` must be `None`.
sigmas (`list[float]`, *optional*):
Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
`num_inference_steps` and `timesteps` must be `None`.
Returns:
`tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
second element is the number of inference steps.
"""
if timesteps is not None and sigmas is not None:
raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
if timesteps is not None:
accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
if not accepts_timesteps:
raise ValueError(
f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
f" timestep schedules. Please check whether you are using the correct scheduler."
)
scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
timesteps = scheduler.timesteps
num_inference_steps = len(timesteps)
elif sigmas is not None:
accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
if not accept_sigmas:
raise ValueError(
f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
f" sigmas schedules. Please check whether you are using the correct scheduler."
)
scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
timesteps = scheduler.timesteps
num_inference_steps = len(timesteps)
else:
scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
timesteps = scheduler.timesteps
return timesteps, num_inference_steps
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.retrieve_latents
def retrieve_latents(
encoder_output: torch.Tensor, generator: torch.Generator | None = None, sample_mode: str = "sample"
):
if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
return encoder_output.latent_dist.sample(generator)
elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
return encoder_output.latent_dist.mode()
elif hasattr(encoder_output, "latents"):
return encoder_output.latents
else:
raise AttributeError("Could not access latents of provided encoder_output")
class Flux2KleinKVPipeline(DiffusionPipeline, Flux2LoraLoaderMixin):
r"""
The Flux2 Klein KV pipeline for text-to-image generation with KV-cached reference image conditioning.
On the first denoising step, reference image tokens are included in the forward pass and their attention K/V
projections are cached. On subsequent steps, the cached K/V are reused without recomputing, providing faster
inference when using reference images.
Reference:
[https://bfl.ai/blog/flux2-klein-towards-interactive-visual-intelligence](https://bfl.ai/blog/flux2-klein-towards-interactive-visual-intelligence)
Args:
transformer ([`Flux2Transformer2DModel`]):
Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
scheduler ([`FlowMatchEulerDiscreteScheduler`]):
A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
vae ([`AutoencoderKLFlux2`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder ([`Qwen3ForCausalLM`]):
[Qwen3ForCausalLM](https://huggingface.co/docs/transformers/en/model_doc/qwen3#transformers.Qwen3ForCausalLM)
tokenizer (`Qwen2TokenizerFast`):
Tokenizer of class
[Qwen2TokenizerFast](https://huggingface.co/docs/transformers/en/model_doc/qwen2#transformers.Qwen2TokenizerFast).
"""
model_cpu_offload_seq = "text_encoder->transformer->vae"
_callback_tensor_inputs = ["latents", "prompt_embeds"]
def __init__(
self,
scheduler: FlowMatchEulerDiscreteScheduler,
vae: AutoencoderKLFlux2,
text_encoder: Qwen3ForCausalLM,
tokenizer: Qwen2TokenizerFast,
transformer: Flux2Transformer2DModel,
is_distilled: bool = True,
):
super().__init__()
self.register_modules(
vae=vae,
text_encoder=text_encoder,
tokenizer=tokenizer,
scheduler=scheduler,
transformer=transformer,
)
self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
# Flux latents are turned into 2x2 patches and packed. This means the latent width and height has to be divisible
# by the patch size. So the vae scale factor is multiplied by the patch size to account for this
self.image_processor = Flux2ImageProcessor(vae_scale_factor=self.vae_scale_factor * 2)
self.tokenizer_max_length = 512
self.default_sample_size = 128
# Set KV-cache-aware attention processors
self._set_kv_attn_processors()
@staticmethod
def _get_qwen3_prompt_embeds(
text_encoder: Qwen3ForCausalLM,
tokenizer: Qwen2TokenizerFast,
prompt: str | list[str],
dtype: torch.dtype | None = None,
device: torch.device | None = None,
max_sequence_length: int = 512,
hidden_states_layers: list[int] = (9, 18, 27),
):
dtype = text_encoder.dtype if dtype is None else dtype
device = text_encoder.device if device is None else device
prompt = [prompt] if isinstance(prompt, str) else prompt
all_input_ids = []
all_attention_masks = []
for single_prompt in prompt:
messages = [{"role": "user", "content": single_prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer(
text,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=max_sequence_length,
)
all_input_ids.append(inputs["input_ids"])
all_attention_masks.append(inputs["attention_mask"])
input_ids = torch.cat(all_input_ids, dim=0).to(device)
attention_mask = torch.cat(all_attention_masks, dim=0).to(device)
# Forward pass through the model
output = text_encoder(
input_ids=input_ids,
attention_mask=attention_mask,
output_hidden_states=True,
use_cache=False,
)
# Only use outputs from intermediate layers and stack them
out = torch.stack([output.hidden_states[k] for k in hidden_states_layers], dim=1)
out = out.to(dtype=dtype, device=device)
batch_size, num_channels, seq_len, hidden_dim = out.shape
prompt_embeds = out.permute(0, 2, 1, 3).reshape(batch_size, seq_len, num_channels * hidden_dim)
return prompt_embeds
@staticmethod
# Copied from diffusers.pipelines.flux2.pipeline_flux2.Flux2Pipeline._prepare_text_ids
def _prepare_text_ids(
x: torch.Tensor, # (B, L, D) or (L, D)
t_coord: torch.Tensor | None = None,
):
B, L, _ = x.shape
out_ids = []
for i in range(B):
t = torch.arange(1) if t_coord is None else t_coord[i]
h = torch.arange(1)
w = torch.arange(1)
l = torch.arange(L)
coords = torch.cartesian_prod(t, h, w, l)
out_ids.append(coords)
return torch.stack(out_ids)
@staticmethod
# Copied from diffusers.pipelines.flux2.pipeline_flux2.Flux2Pipeline._prepare_latent_ids
def _prepare_latent_ids(
latents: torch.Tensor, # (B, C, H, W)
):
r"""
Generates 4D position coordinates (T, H, W, L) for latent tensors.
Args:
latents (torch.Tensor):
Latent tensor of shape (B, C, H, W)
Returns:
torch.Tensor:
Position IDs tensor of shape (B, H*W, 4) All batches share the same coordinate structure: T=0,
H=[0..H-1], W=[0..W-1], L=0
"""
batch_size, _, height, width = latents.shape
t = torch.arange(1) # [0] - time dimension
h = torch.arange(height)
w = torch.arange(width)
l = torch.arange(1) # [0] - layer dimension
# Create position IDs: (H*W, 4)
latent_ids = torch.cartesian_prod(t, h, w, l)
# Expand to batch: (B, H*W, 4)
latent_ids = latent_ids.unsqueeze(0).expand(batch_size, -1, -1)
return latent_ids
@staticmethod
# Copied from diffusers.pipelines.flux2.pipeline_flux2.Flux2Pipeline._prepare_image_ids
def _prepare_image_ids(
image_latents: list[torch.Tensor], # [(1, C, H, W), (1, C, H, W), ...]
scale: int = 10,
):
r"""
Generates 4D time-space coordinates (T, H, W, L) for a sequence of image latents.
This function creates a unique coordinate for every pixel/patch across all input latent with different
dimensions.
Args:
image_latents (list[torch.Tensor]):
A list of image latent feature tensors, typically of shape (C, H, W).
scale (int, optional):
A factor used to define the time separation (T-coordinate) between latents. T-coordinate for the i-th
latent is: 'scale + scale * i'. Defaults to 10.
Returns:
torch.Tensor:
The combined coordinate tensor. Shape: (1, N_total, 4) Where N_total is the sum of (H * W) for all
input latents.
Coordinate Components (Dimension 4):
- T (Time): The unique index indicating which latent image the coordinate belongs to.
- H (Height): The row index within that latent image.
- W (Width): The column index within that latent image.
- L (Seq. Length): A sequence length dimension, which is always fixed at 0 (size 1)
"""
if not isinstance(image_latents, list):
raise ValueError(f"Expected `image_latents` to be a list, got {type(image_latents)}.")
# create time offset for each reference image
t_coords = [scale + scale * t for t in torch.arange(0, len(image_latents))]
t_coords = [t.view(-1) for t in t_coords]
image_latent_ids = []
for x, t in zip(image_latents, t_coords):
x = x.squeeze(0)
_, height, width = x.shape
x_ids = torch.cartesian_prod(t, torch.arange(height), torch.arange(width), torch.arange(1))
image_latent_ids.append(x_ids)
image_latent_ids = torch.cat(image_latent_ids, dim=0)
image_latent_ids = image_latent_ids.unsqueeze(0)
return image_latent_ids
@staticmethod
# Copied from diffusers.pipelines.flux2.pipeline_flux2.Flux2Pipeline._patchify_latents
def _patchify_latents(latents):
batch_size, num_channels_latents, height, width = latents.shape
latents = latents.view(batch_size, num_channels_latents, height // 2, 2, width // 2, 2)
latents = latents.permute(0, 1, 3, 5, 2, 4)
latents = latents.reshape(batch_size, num_channels_latents * 4, height // 2, width // 2)
return latents
@staticmethod
# Copied from diffusers.pipelines.flux2.pipeline_flux2.Flux2Pipeline._unpatchify_latents
def _unpatchify_latents(latents):
batch_size, num_channels_latents, height, width = latents.shape
latents = latents.reshape(batch_size, num_channels_latents // (2 * 2), 2, 2, height, width)
latents = latents.permute(0, 1, 4, 2, 5, 3)
latents = latents.reshape(batch_size, num_channels_latents // (2 * 2), height * 2, width * 2)
return latents
@staticmethod
# Copied from diffusers.pipelines.flux2.pipeline_flux2.Flux2Pipeline._pack_latents
def _pack_latents(latents):
"""
pack latents: (batch_size, num_channels, height, width) -> (batch_size, height * width, num_channels)
"""
batch_size, num_channels, height, width = latents.shape
latents = latents.reshape(batch_size, num_channels, height * width).permute(0, 2, 1)
return latents
@staticmethod
# Copied from diffusers.pipelines.flux2.pipeline_flux2.Flux2Pipeline._unpack_latents_with_ids
def _unpack_latents_with_ids(x: torch.Tensor, x_ids: torch.Tensor) -> list[torch.Tensor]:
"""
using position ids to scatter tokens into place
"""
x_list = []
for data, pos in zip(x, x_ids):
_, ch = data.shape # noqa: F841
h_ids = pos[:, 1].to(torch.int64)
w_ids = pos[:, 2].to(torch.int64)
h = torch.max(h_ids) + 1
w = torch.max(w_ids) + 1
flat_ids = h_ids * w + w_ids
out = torch.zeros((h * w, ch), device=data.device, dtype=data.dtype)
out.scatter_(0, flat_ids.unsqueeze(1).expand(-1, ch), data)
# reshape from (H * W, C) to (H, W, C) and permute to (C, H, W)
out = out.view(h, w, ch).permute(2, 0, 1)
x_list.append(out)
return torch.stack(x_list, dim=0)
def _set_kv_attn_processors(self):
"""Replace default attention processors with KV-cache-aware variants."""
for block in self.transformer.transformer_blocks:
block.attn.set_processor(Flux2KVAttnProcessor())
for block in self.transformer.single_transformer_blocks:
block.attn.set_processor(Flux2KVParallelSelfAttnProcessor())
def encode_prompt(
self,
prompt: str | list[str],
device: torch.device | None = None,
num_images_per_prompt: int = 1,
prompt_embeds: torch.Tensor | None = None,
max_sequence_length: int = 512,
text_encoder_out_layers: tuple[int] = (9, 18, 27),
):
device = device or self._execution_device
if prompt is None:
prompt = ""
prompt = [prompt] if isinstance(prompt, str) else prompt
if prompt_embeds is None:
prompt_embeds = self._get_qwen3_prompt_embeds(
text_encoder=self.text_encoder,
tokenizer=self.tokenizer,
prompt=prompt,
device=device,
max_sequence_length=max_sequence_length,
hidden_states_layers=text_encoder_out_layers,
)
batch_size, seq_len, _ = prompt_embeds.shape
prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
text_ids = self._prepare_text_ids(prompt_embeds)
text_ids = text_ids.to(device)
return prompt_embeds, text_ids
# Copied from diffusers.pipelines.flux2.pipeline_flux2.Flux2Pipeline._encode_vae_image
def _encode_vae_image(self, image: torch.Tensor, generator: torch.Generator):
if image.ndim != 4:
raise ValueError(f"Expected image dims 4, got {image.ndim}.")
image_latents = retrieve_latents(self.vae.encode(image), generator=generator, sample_mode="argmax")
image_latents = self._patchify_latents(image_latents)
latents_bn_mean = self.vae.bn.running_mean.view(1, -1, 1, 1).to(image_latents.device, image_latents.dtype)
latents_bn_std = torch.sqrt(self.vae.bn.running_var.view(1, -1, 1, 1) + self.vae.config.batch_norm_eps)
image_latents = (image_latents - latents_bn_mean) / latents_bn_std
return image_latents
# Copied from diffusers.pipelines.flux2.pipeline_flux2.Flux2Pipeline.prepare_latents
def prepare_latents(
self,
batch_size,
num_latents_channels,
height,
width,
dtype,
device,
generator: torch.Generator,
latents: torch.Tensor | None = None,
):
# VAE applies 8x compression on images but we must also account for packing which requires
# latent height and width to be divisible by 2.
height = 2 * (int(height) // (self.vae_scale_factor * 2))
width = 2 * (int(width) // (self.vae_scale_factor * 2))
shape = (batch_size, num_latents_channels * 4, height // 2, width // 2)
if isinstance(generator, list) and len(generator) != batch_size:
raise ValueError(
f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
f" size of {batch_size}. Make sure the batch size matches the length of the generators."
)
if latents is None:
latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
else:
latents = latents.to(device=device, dtype=dtype)
latent_ids = self._prepare_latent_ids(latents)
latent_ids = latent_ids.to(device)
latents = self._pack_latents(latents) # [B, C, H, W] -> [B, H*W, C]
return latents, latent_ids
# Copied from diffusers.pipelines.flux2.pipeline_flux2.Flux2Pipeline.prepare_image_latents
def prepare_image_latents(
self,
images: list[torch.Tensor],
batch_size,
generator: torch.Generator,
device,
dtype,
):
image_latents = []
for image in images:
image = image.to(device=device, dtype=dtype)
imagge_latent = self._encode_vae_image(image=image, generator=generator)
image_latents.append(imagge_latent) # (1, 128, 32, 32)
image_latent_ids = self._prepare_image_ids(image_latents)
# Pack each latent and concatenate
packed_latents = []
for latent in image_latents:
# latent: (1, 128, 32, 32)
packed = self._pack_latents(latent) # (1, 1024, 128)
packed = packed.squeeze(0) # (1024, 128) - remove batch dim
packed_latents.append(packed)
# Concatenate all reference tokens along sequence dimension
image_latents = torch.cat(packed_latents, dim=0) # (N*1024, 128)
image_latents = image_latents.unsqueeze(0) # (1, N*1024, 128)
image_latents = image_latents.repeat(batch_size, 1, 1)
image_latent_ids = image_latent_ids.repeat(batch_size, 1, 1)
image_latent_ids = image_latent_ids.to(device)
return image_latents, image_latent_ids
def check_inputs(
self,
prompt,
height,
width,
prompt_embeds=None,
callback_on_step_end_tensor_inputs=None,
):
if (
height is not None
and height % (self.vae_scale_factor * 2) != 0
or width is not None
and width % (self.vae_scale_factor * 2) != 0
):
logger.warning(
f"`height` and `width` have to be divisible by {self.vae_scale_factor * 2} but are {height} and {width}. Dimensions will be resized accordingly"
)
if callback_on_step_end_tensor_inputs is not None and not all(
k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
):
raise ValueError(
f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
)
if prompt is not None and prompt_embeds is not None:
raise ValueError(
f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
" only forward one of the two."
)
elif prompt is None and prompt_embeds is None:
raise ValueError(
"Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
)
elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
@property
def attention_kwargs(self):
return self._attention_kwargs
@property
def num_timesteps(self):
return self._num_timesteps
@property
def current_timestep(self):
return self._current_timestep
@property
def interrupt(self):
return self._interrupt
@torch.no_grad()
@replace_example_docstring(EXAMPLE_DOC_STRING)
def __call__(
self,
image: list[PIL.Image.Image] | PIL.Image.Image | None = None,
prompt: str | list[str] = None,
height: int | None = None,
width: int | None = None,
num_inference_steps: int = 4,
sigmas: list[float] | None = None,
num_images_per_prompt: int = 1,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,
prompt_embeds: torch.Tensor | None = None,
output_type: str = "pil",
return_dict: bool = True,
attention_kwargs: dict[str, Any] | None = None,
callback_on_step_end: Callable[[int, int, dict], None] | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
max_sequence_length: int = 512,
text_encoder_out_layers: tuple[int] = (9, 18, 27),
):
r"""
Function invoked when calling the pipeline for generation.
Args:
image (`PIL.Image.Image` or `List[PIL.Image.Image]`, *optional*):
Reference image(s) for conditioning. On the first denoising step, reference tokens are included in the
forward pass and their attention K/V are cached. On subsequent steps, the cached K/V are reused without
recomputing.
prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation.
height (`int`, *optional*):
The height in pixels of the generated image.
width (`int`, *optional*):
The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 4):
The number of denoising steps.
sigmas (`List[float]`, *optional*):
Custom sigmas for the denoising schedule.
num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
Generator(s) for deterministic generation.
latents (`torch.Tensor`, *optional*):
Pre-generated noisy latents.
prompt_embeds (`torch.Tensor`, *optional*):
Pre-generated text embeddings.
output_type (`str`, *optional*, defaults to `"pil"`):
Output format: `"pil"` or `"np"`.
return_dict (`bool`, *optional*, defaults to `True`):
Whether to return a `Flux2PipelineOutput` or a plain tuple.
attention_kwargs (`dict`, *optional*):
Extra kwargs passed to attention processors.
callback_on_step_end (`Callable`, *optional*):
Callback function called at the end of each denoising step.
callback_on_step_end_tensor_inputs (`List`, *optional*):
Tensor inputs for the callback function.
max_sequence_length (`int`, defaults to 512):
Maximum sequence length for the prompt.
text_encoder_out_layers (`tuple[int]`):
Layer indices for text encoder hidden state extraction.
Examples:
Returns:
[`~pipelines.flux2.Flux2PipelineOutput`] or `tuple`.
"""
# 1. Check inputs
self.check_inputs(
prompt=prompt,
height=height,
width=width,
prompt_embeds=prompt_embeds,
callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
)
self._attention_kwargs = attention_kwargs
self._current_timestep = None
self._interrupt = False
# 2. Define call parameters
if prompt is not None and isinstance(prompt, str):
batch_size = 1
elif prompt is not None and isinstance(prompt, list):
batch_size = len(prompt)
else:
batch_size = prompt_embeds.shape[0]
device = self._execution_device
# 3. prepare text embeddings
prompt_embeds, text_ids = self.encode_prompt(
prompt=prompt,
prompt_embeds=prompt_embeds,
device=device,
num_images_per_prompt=num_images_per_prompt,
max_sequence_length=max_sequence_length,
text_encoder_out_layers=text_encoder_out_layers,
)
# 4. process images
if image is not None and not isinstance(image, list):
image = [image]
condition_images = None
if image is not None:
for img in image:
self.image_processor.check_image_input(img)
condition_images = []
for img in image:
image_width, image_height = img.size
if image_width * image_height > 1024 * 1024:
img = self.image_processor._resize_to_target_area(img, 1024 * 1024)
image_width, image_height = img.size
multiple_of = self.vae_scale_factor * 2
image_width = (image_width // multiple_of) * multiple_of
image_height = (image_height // multiple_of) * multiple_of
img = self.image_processor.preprocess(img, height=image_height, width=image_width, resize_mode="crop")
condition_images.append(img)
height = height or image_height
width = width or image_width
height = height or self.default_sample_size * self.vae_scale_factor
width = width or self.default_sample_size * self.vae_scale_factor
# 5. prepare latent variables
num_channels_latents = self.transformer.config.in_channels // 4
latents, latent_ids = self.prepare_latents(
batch_size=batch_size * num_images_per_prompt,
num_latents_channels=num_channels_latents,
height=height,
width=width,
dtype=prompt_embeds.dtype,
device=device,
generator=generator,
latents=latents,
)
image_latents = None
image_latent_ids = None
if condition_images is not None:
image_latents, image_latent_ids = self.prepare_image_latents(
images=condition_images,
batch_size=batch_size * num_images_per_prompt,
generator=generator,
device=device,
dtype=self.vae.dtype,
)
# 6. Prepare timesteps
sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas
if hasattr(self.scheduler.config, "use_flow_sigmas") and self.scheduler.config.use_flow_sigmas:
sigmas = None
image_seq_len = latents.shape[1]
mu = compute_empirical_mu(image_seq_len=image_seq_len, num_steps=num_inference_steps)
timesteps, num_inference_steps = retrieve_timesteps(
self.scheduler,
num_inference_steps,
device,
sigmas=sigmas,
mu=mu,
)
num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
self._num_timesteps = len(timesteps)
# 7. Denoising loop with KV caching
# Step 0 with ref images: forward_kv_extract (full pass, cache ref K/V)
# Steps 1+: forward_kv_cached (reuse cached ref K/V)
# No ref images: standard forward
self.scheduler.set_begin_index(0)
kv_cache = None
with self.progress_bar(total=num_inference_steps) as progress_bar:
for i, t in enumerate(timesteps):
if self.interrupt:
continue
self._current_timestep = t
timestep = t.expand(latents.shape[0]).to(latents.dtype)
if i == 0 and image_latents is not None:
# Step 0: include ref tokens, extract KV cache
latent_model_input = torch.cat([image_latents, latents], dim=1).to(self.transformer.dtype)
latent_image_ids = torch.cat([image_latent_ids, latent_ids], dim=1)
noise_pred, kv_cache = self.transformer(
hidden_states=latent_model_input,
timestep=timestep / 1000,
guidance=None,
encoder_hidden_states=prompt_embeds,
txt_ids=text_ids,
img_ids=latent_image_ids,
joint_attention_kwargs=self.attention_kwargs,
return_dict=False,
kv_cache_mode="extract",
num_ref_tokens=image_latents.shape[1],
)
elif kv_cache is not None:
# Steps 1+: use cached ref KV, no ref tokens in input
noise_pred = self.transformer(
hidden_states=latents.to(self.transformer.dtype),
timestep=timestep / 1000,
guidance=None,
encoder_hidden_states=prompt_embeds,
txt_ids=text_ids,
img_ids=latent_ids,
joint_attention_kwargs=self.attention_kwargs,
return_dict=False,
kv_cache=kv_cache,
kv_cache_mode="cached",
)[0]
else:
# No reference images: standard forward
noise_pred = self.transformer(
hidden_states=latents.to(self.transformer.dtype),
timestep=timestep / 1000,
guidance=None,
encoder_hidden_states=prompt_embeds,
txt_ids=text_ids,
img_ids=latent_ids,
joint_attention_kwargs=self.attention_kwargs,
return_dict=False,
)[0]
# compute the previous noisy sample x_t -> x_t-1
latents_dtype = latents.dtype
latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
if latents.dtype != latents_dtype:
if torch.backends.mps.is_available():
latents = latents.to(latents_dtype)
if callback_on_step_end is not None:
callback_kwargs = {}
for k in callback_on_step_end_tensor_inputs:
callback_kwargs[k] = locals()[k]
callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
latents = callback_outputs.pop("latents", latents)
prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
progress_bar.update()
if XLA_AVAILABLE:
xm.mark_step()
# Clean up KV cache
if kv_cache is not None:
kv_cache.clear()
self._current_timestep = None
latents = self._unpack_latents_with_ids(latents, latent_ids)
latents_bn_mean = self.vae.bn.running_mean.view(1, -1, 1, 1).to(latents.device, latents.dtype)
latents_bn_std = torch.sqrt(self.vae.bn.running_var.view(1, -1, 1, 1) + self.vae.config.batch_norm_eps).to(
latents.device, latents.dtype
)
latents = latents * latents_bn_std + latents_bn_mean
latents = self._unpatchify_latents(latents)
if output_type == "latent":
image = latents
else:
image = self.vae.decode(latents, return_dict=False)[0]
image = self.image_processor.postprocess(image, output_type=output_type)
# Offload all models
self.maybe_free_model_hooks()
if not return_dict:
return (image,)
return Flux2PipelineOutput(images=image)

View File

@@ -456,6 +456,8 @@ class HeliosPyramidPipeline(DiffusionPipeline, HeliosLoraLoaderMixin):
# the output will be non-deterministic and may produce incorrect results in CP context.
if generator is None:
generator = torch.Generator(device=device)
elif isinstance(generator, list):
generator = generator[0]
gamma = self.scheduler.config.gamma
_, ph, pw = patch_size
@@ -470,7 +472,8 @@ class HeliosPyramidPipeline(DiffusionPipeline, HeliosLoraLoaderMixin):
L = torch.linalg.cholesky(cov)
block_number = batch_size * channel * num_frames * (height // ph) * (width // pw)
z = torch.randn(block_number, block_size, device=device, generator=generator)
z = torch.randn(block_number, block_size, generator=generator, device=generator.device)
z = z.to(device=device)
noise = z @ L.T
noise = noise.view(batch_size, channel, num_frames, height // ph, width // pw, ph, pw)

View File

@@ -720,6 +720,7 @@ class LDMBertModel(LDMBertPreTrainedModel):
super().__init__(config)
self.model = LDMBertEncoder(config)
self.to_logits = nn.Linear(config.hidden_size, config.vocab_size)
self.post_init()
def forward(
self,

View File

@@ -28,7 +28,7 @@ else:
_import_structure["pipeline_ltx2_condition"] = ["LTX2ConditionPipeline"]
_import_structure["pipeline_ltx2_image2video"] = ["LTX2ImageToVideoPipeline"]
_import_structure["pipeline_ltx2_latent_upsample"] = ["LTX2LatentUpsamplePipeline"]
_import_structure["vocoder"] = ["LTX2Vocoder"]
_import_structure["vocoder"] = ["LTX2Vocoder", "LTX2VocoderWithBWE"]
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
try:
@@ -44,7 +44,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
from .pipeline_ltx2_condition import LTX2ConditionPipeline
from .pipeline_ltx2_image2video import LTX2ImageToVideoPipeline
from .pipeline_ltx2_latent_upsample import LTX2LatentUpsamplePipeline
from .vocoder import LTX2Vocoder
from .vocoder import LTX2Vocoder, LTX2VocoderWithBWE
else:
import sys

View File

@@ -1,3 +1,5 @@
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
@@ -9,6 +11,79 @@ from ...models.modeling_utils import ModelMixin
from ...models.transformers.transformer_ltx2 import LTX2Attention, LTX2AudioVideoAttnProcessor
def per_layer_masked_mean_norm(
text_hidden_states: torch.Tensor,
sequence_lengths: torch.Tensor,
device: str | torch.device,
padding_side: str = "left",
scale_factor: int = 8,
eps: float = 1e-6,
):
"""
Performs per-batch per-layer normalization using a masked mean and range on per-layer text encoder hidden_states.
Respects the padding of the hidden states.
Args:
text_hidden_states (`torch.Tensor` of shape `(batch_size, seq_len, hidden_dim, num_layers)`):
Per-layer hidden_states from a text encoder (e.g. `Gemma3ForConditionalGeneration`).
sequence_lengths (`torch.Tensor of shape `(batch_size,)`):
The number of valid (non-padded) tokens for each batch instance.
device: (`str` or `torch.device`, *optional*):
torch device to place the resulting embeddings on
padding_side: (`str`, *optional*, defaults to `"left"`):
Whether the text tokenizer performs padding on the `"left"` or `"right"`.
scale_factor (`int`, *optional*, defaults to `8`):
Scaling factor to multiply the normalized hidden states by.
eps (`float`, *optional*, defaults to `1e-6`):
A small positive value for numerical stability when performing normalization.
Returns:
`torch.Tensor` of shape `(batch_size, seq_len, hidden_dim * num_layers)`:
Normed and flattened text encoder hidden states.
"""
batch_size, seq_len, hidden_dim, num_layers = text_hidden_states.shape
original_dtype = text_hidden_states.dtype
# Create padding mask
token_indices = torch.arange(seq_len, device=device).unsqueeze(0)
if padding_side == "right":
# For right padding, valid tokens are from 0 to sequence_length-1
mask = token_indices < sequence_lengths[:, None] # [batch_size, seq_len]
elif padding_side == "left":
# For left padding, valid tokens are from (T - sequence_length) to T-1
start_indices = seq_len - sequence_lengths[:, None] # [batch_size, 1]
mask = token_indices >= start_indices # [B, T]
else:
raise ValueError(f"padding_side must be 'left' or 'right', got {padding_side}")
mask = mask[:, :, None, None] # [batch_size, seq_len] --> [batch_size, seq_len, 1, 1]
# Compute masked mean over non-padding positions of shape (batch_size, 1, 1, seq_len)
masked_text_hidden_states = text_hidden_states.masked_fill(~mask, 0.0)
num_valid_positions = (sequence_lengths * hidden_dim).view(batch_size, 1, 1, 1)
masked_mean = masked_text_hidden_states.sum(dim=(1, 2), keepdim=True) / (num_valid_positions + eps)
# Compute min/max over non-padding positions of shape (batch_size, 1, 1 seq_len)
x_min = text_hidden_states.masked_fill(~mask, float("inf")).amin(dim=(1, 2), keepdim=True)
x_max = text_hidden_states.masked_fill(~mask, float("-inf")).amax(dim=(1, 2), keepdim=True)
# Normalization
normalized_hidden_states = (text_hidden_states - masked_mean) / (x_max - x_min + eps)
normalized_hidden_states = normalized_hidden_states * scale_factor
# Pack the hidden states to a 3D tensor (batch_size, seq_len, hidden_dim * num_layers)
normalized_hidden_states = normalized_hidden_states.flatten(2)
mask_flat = mask.squeeze(-1).expand(-1, -1, hidden_dim * num_layers)
normalized_hidden_states = normalized_hidden_states.masked_fill(~mask_flat, 0.0)
normalized_hidden_states = normalized_hidden_states.to(dtype=original_dtype)
return normalized_hidden_states
def per_token_rms_norm(text_encoder_hidden_states: torch.Tensor, eps: float = 1e-6) -> torch.Tensor:
variance = torch.mean(text_encoder_hidden_states**2, dim=2, keepdim=True)
norm_text_encoder_hidden_states = text_encoder_hidden_states * torch.rsqrt(variance + eps)
return norm_text_encoder_hidden_states
class LTX2RotaryPosEmbed1d(nn.Module):
"""
1D rotary positional embeddings (RoPE) for the LTX 2.0 text encoder connectors.
@@ -106,6 +181,7 @@ class LTX2TransformerBlock1d(nn.Module):
activation_fn: str = "gelu-approximate",
eps: float = 1e-6,
rope_type: str = "interleaved",
apply_gated_attention: bool = False,
):
super().__init__()
@@ -115,8 +191,9 @@ class LTX2TransformerBlock1d(nn.Module):
heads=num_attention_heads,
kv_heads=num_attention_heads,
dim_head=attention_head_dim,
processor=LTX2AudioVideoAttnProcessor(),
rope_type=rope_type,
apply_gated_attention=apply_gated_attention,
processor=LTX2AudioVideoAttnProcessor(),
)
self.norm2 = torch.nn.RMSNorm(dim, eps=eps, elementwise_affine=False)
@@ -160,6 +237,7 @@ class LTX2ConnectorTransformer1d(nn.Module):
eps: float = 1e-6,
causal_temporal_positioning: bool = False,
rope_type: str = "interleaved",
gated_attention: bool = False,
):
super().__init__()
self.num_attention_heads = num_attention_heads
@@ -188,6 +266,7 @@ class LTX2ConnectorTransformer1d(nn.Module):
num_attention_heads=num_attention_heads,
attention_head_dim=attention_head_dim,
rope_type=rope_type,
apply_gated_attention=gated_attention,
)
for _ in range(num_layers)
]
@@ -260,24 +339,36 @@ class LTX2TextConnectors(ModelMixin, PeftAdapterMixin, ConfigMixin):
@register_to_config
def __init__(
self,
caption_channels: int,
text_proj_in_factor: int,
video_connector_num_attention_heads: int,
video_connector_attention_head_dim: int,
video_connector_num_layers: int,
video_connector_num_learnable_registers: int | None,
audio_connector_num_attention_heads: int,
audio_connector_attention_head_dim: int,
audio_connector_num_layers: int,
audio_connector_num_learnable_registers: int | None,
connector_rope_base_seq_len: int,
rope_theta: float,
rope_double_precision: bool,
causal_temporal_positioning: bool,
caption_channels: int = 3840, # default Gemma-3-12B text encoder hidden_size
text_proj_in_factor: int = 49, # num_layers + 1 for embedding layer = 48 + 1 for Gemma-3-12B
video_connector_num_attention_heads: int = 30,
video_connector_attention_head_dim: int = 128,
video_connector_num_layers: int = 2,
video_connector_num_learnable_registers: int | None = 128,
video_gated_attn: bool = False,
audio_connector_num_attention_heads: int = 30,
audio_connector_attention_head_dim: int = 128,
audio_connector_num_layers: int = 2,
audio_connector_num_learnable_registers: int | None = 128,
audio_gated_attn: bool = False,
connector_rope_base_seq_len: int = 4096,
rope_theta: float = 10000.0,
rope_double_precision: bool = True,
causal_temporal_positioning: bool = False,
rope_type: str = "interleaved",
per_modality_projections: bool = False,
video_hidden_dim: int = 4096,
audio_hidden_dim: int = 2048,
proj_bias: bool = False,
):
super().__init__()
self.text_proj_in = nn.Linear(caption_channels * text_proj_in_factor, caption_channels, bias=False)
text_encoder_dim = caption_channels * text_proj_in_factor
if per_modality_projections:
self.video_text_proj_in = nn.Linear(text_encoder_dim, video_hidden_dim, bias=proj_bias)
self.audio_text_proj_in = nn.Linear(text_encoder_dim, audio_hidden_dim, bias=proj_bias)
else:
self.text_proj_in = nn.Linear(text_encoder_dim, caption_channels, bias=proj_bias)
self.video_connector = LTX2ConnectorTransformer1d(
num_attention_heads=video_connector_num_attention_heads,
attention_head_dim=video_connector_attention_head_dim,
@@ -288,6 +379,7 @@ class LTX2TextConnectors(ModelMixin, PeftAdapterMixin, ConfigMixin):
rope_double_precision=rope_double_precision,
causal_temporal_positioning=causal_temporal_positioning,
rope_type=rope_type,
gated_attention=video_gated_attn,
)
self.audio_connector = LTX2ConnectorTransformer1d(
num_attention_heads=audio_connector_num_attention_heads,
@@ -299,26 +391,86 @@ class LTX2TextConnectors(ModelMixin, PeftAdapterMixin, ConfigMixin):
rope_double_precision=rope_double_precision,
causal_temporal_positioning=causal_temporal_positioning,
rope_type=rope_type,
gated_attention=audio_gated_attn,
)
def forward(
self, text_encoder_hidden_states: torch.Tensor, attention_mask: torch.Tensor, additive_mask: bool = False
):
# Convert to additive attention mask, if necessary
if not additive_mask:
text_dtype = text_encoder_hidden_states.dtype
attention_mask = (attention_mask - 1).reshape(attention_mask.shape[0], 1, -1, attention_mask.shape[-1])
attention_mask = attention_mask.to(text_dtype) * torch.finfo(text_dtype).max
self,
text_encoder_hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
padding_side: str = "left",
scale_factor: int = 8,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""
Given per-layer text encoder hidden_states, extracts features and runs per-modality connectors to get text
embeddings for the LTX-2.X DiT models.
text_encoder_hidden_states = self.text_proj_in(text_encoder_hidden_states)
Args:
text_encoder_hidden_states (`torch.Tensor`)):
Per-layer text encoder hidden_states. Can either be 4D with shape `(batch_size, seq_len,
caption_channels, text_proj_in_factor) or 3D with the last two dimensions flattened.
attention_mask (`torch.Tensor` of shape `(batch_size, seq_len)`):
Multiplicative binary attention mask where 1s indicate unmasked positions and 0s indicate masked
positions.
padding_side (`str`, *optional*, defaults to `"left"`):
The padding side used by the text encoder's text encoder (either `"left"` or `"right"`). Defaults to
`"left"` as this is what the default Gemma3-12B text encoder uses. Only used if
`per_modality_projections` is `False` (LTX-2.0 models).
scale_factor (`int`, *optional*, defaults to `8`):
Scale factor for masked mean/range normalization. Only used if `per_modality_projections` is `False`
(LTX-2.0 models).
"""
if text_encoder_hidden_states.ndim == 3:
# Ensure shape is [batch_size, seq_len, caption_channels, text_proj_in_factor]
text_encoder_hidden_states = text_encoder_hidden_states.unflatten(2, (self.config.caption_channels, -1))
video_text_embedding, new_attn_mask = self.video_connector(text_encoder_hidden_states, attention_mask)
if self.config.per_modality_projections:
# LTX-2.3
norm_text_encoder_hidden_states = per_token_rms_norm(text_encoder_hidden_states)
attn_mask = (new_attn_mask < 1e-6).to(torch.int64)
attn_mask = attn_mask.reshape(video_text_embedding.shape[0], video_text_embedding.shape[1], 1)
video_text_embedding = video_text_embedding * attn_mask
new_attn_mask = attn_mask.squeeze(-1)
norm_text_encoder_hidden_states = norm_text_encoder_hidden_states.flatten(2, 3)
bool_mask = attention_mask.bool().unsqueeze(-1)
norm_text_encoder_hidden_states = torch.where(
bool_mask, norm_text_encoder_hidden_states, torch.zeros_like(norm_text_encoder_hidden_states)
)
audio_text_embedding, _ = self.audio_connector(text_encoder_hidden_states, attention_mask)
# Rescale norms with respect to video and audio dims for feature extractors
video_scale_factor = math.sqrt(self.config.video_hidden_dim / self.config.caption_channels)
video_norm_text_emb = norm_text_encoder_hidden_states * video_scale_factor
audio_scale_factor = math.sqrt(self.config.audio_hidden_dim / self.config.caption_channels)
audio_norm_text_emb = norm_text_encoder_hidden_states * audio_scale_factor
return video_text_embedding, audio_text_embedding, new_attn_mask
# Per-Modality Feature extractors
video_text_emb_proj = self.video_text_proj_in(video_norm_text_emb)
audio_text_emb_proj = self.audio_text_proj_in(audio_norm_text_emb)
else:
# LTX-2.0
sequence_lengths = attention_mask.sum(dim=-1)
norm_text_encoder_hidden_states = per_layer_masked_mean_norm(
text_hidden_states=text_encoder_hidden_states,
sequence_lengths=sequence_lengths,
device=text_encoder_hidden_states.device,
padding_side=padding_side,
scale_factor=scale_factor,
)
text_emb_proj = self.text_proj_in(norm_text_encoder_hidden_states)
video_text_emb_proj = text_emb_proj
audio_text_emb_proj = text_emb_proj
# Convert to additive attention mask for connectors
text_dtype = video_text_emb_proj.dtype
attention_mask = (attention_mask.to(torch.int64) - 1).to(text_dtype)
attention_mask = attention_mask.reshape(attention_mask.shape[0], 1, -1, attention_mask.shape[-1])
add_attn_mask = attention_mask * torch.finfo(text_dtype).max
video_text_embedding, video_attn_mask = self.video_connector(video_text_emb_proj, add_attn_mask)
# Convert video attn mask to binary (multiplicative) mask and mask video text embedding
binary_attn_mask = (video_attn_mask < 1e-6).to(torch.int64)
binary_attn_mask = binary_attn_mask.reshape(video_text_embedding.shape[0], video_text_embedding.shape[1], 1)
video_text_embedding = video_text_embedding * binary_attn_mask
audio_text_embedding, _ = self.audio_connector(audio_text_emb_proj, add_attn_mask)
return video_text_embedding, audio_text_embedding, binary_attn_mask.squeeze(-1)

View File

@@ -195,7 +195,8 @@ class LTX2LatentUpsamplerModel(ModelMixin, ConfigMixin):
dims: int = 3,
spatial_upsample: bool = True,
temporal_upsample: bool = False,
rational_spatial_scale: float | None = 2.0,
rational_spatial_scale: float = 2.0,
use_rational_resampler: bool = True,
):
super().__init__()
@@ -220,7 +221,7 @@ class LTX2LatentUpsamplerModel(ModelMixin, ConfigMixin):
PixelShuffleND(3),
)
elif spatial_upsample:
if rational_spatial_scale is not None:
if use_rational_resampler:
self.upsampler = SpatialRationalResampler(mid_channels=mid_channels, scale=rational_spatial_scale)
else:
self.upsampler = torch.nn.Sequential(

View File

@@ -18,7 +18,7 @@ from typing import Any, Callable
import numpy as np
import torch
from transformers import Gemma3ForConditionalGeneration, GemmaTokenizer, GemmaTokenizerFast
from transformers import Gemma3ForConditionalGeneration, Gemma3Processor, GemmaTokenizer, GemmaTokenizerFast
from ...callbacks import MultiPipelineCallbacks, PipelineCallback
from ...loaders import FromSingleFileMixin, LTX2LoraLoaderMixin
@@ -31,7 +31,7 @@ from ...video_processor import VideoProcessor
from ..pipeline_utils import DiffusionPipeline
from .connectors import LTX2TextConnectors
from .pipeline_output import LTX2PipelineOutput
from .vocoder import LTX2Vocoder
from .vocoder import LTX2Vocoder, LTX2VocoderWithBWE
if is_torch_xla_available():
@@ -209,7 +209,7 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
"""
model_cpu_offload_seq = "text_encoder->connectors->transformer->vae->audio_vae->vocoder"
_optional_components = []
_optional_components = ["processor"]
_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds"]
def __init__(
@@ -221,7 +221,8 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
tokenizer: GemmaTokenizer | GemmaTokenizerFast,
connectors: LTX2TextConnectors,
transformer: LTX2VideoTransformer3DModel,
vocoder: LTX2Vocoder,
vocoder: LTX2Vocoder | LTX2VocoderWithBWE,
processor: Gemma3Processor | None = None,
):
super().__init__()
@@ -234,6 +235,7 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
transformer=transformer,
vocoder=vocoder,
scheduler=scheduler,
processor=processor,
)
self.vae_spatial_compression_ratio = (
@@ -268,73 +270,6 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
self.tokenizer.model_max_length if getattr(self, "tokenizer", None) is not None else 1024
)
@staticmethod
def _pack_text_embeds(
text_hidden_states: torch.Tensor,
sequence_lengths: torch.Tensor,
device: str | torch.device,
padding_side: str = "left",
scale_factor: int = 8,
eps: float = 1e-6,
) -> torch.Tensor:
"""
Packs and normalizes text encoder hidden states, respecting padding. Normalization is performed per-batch and
per-layer in a masked fashion (only over non-padded positions).
Args:
text_hidden_states (`torch.Tensor` of shape `(batch_size, seq_len, hidden_dim, num_layers)`):
Per-layer hidden_states from a text encoder (e.g. `Gemma3ForConditionalGeneration`).
sequence_lengths (`torch.Tensor of shape `(batch_size,)`):
The number of valid (non-padded) tokens for each batch instance.
device: (`str` or `torch.device`, *optional*):
torch device to place the resulting embeddings on
padding_side: (`str`, *optional*, defaults to `"left"`):
Whether the text tokenizer performs padding on the `"left"` or `"right"`.
scale_factor (`int`, *optional*, defaults to `8`):
Scaling factor to multiply the normalized hidden states by.
eps (`float`, *optional*, defaults to `1e-6`):
A small positive value for numerical stability when performing normalization.
Returns:
`torch.Tensor` of shape `(batch_size, seq_len, hidden_dim * num_layers)`:
Normed and flattened text encoder hidden states.
"""
batch_size, seq_len, hidden_dim, num_layers = text_hidden_states.shape
original_dtype = text_hidden_states.dtype
# Create padding mask
token_indices = torch.arange(seq_len, device=device).unsqueeze(0)
if padding_side == "right":
# For right padding, valid tokens are from 0 to sequence_length-1
mask = token_indices < sequence_lengths[:, None] # [batch_size, seq_len]
elif padding_side == "left":
# For left padding, valid tokens are from (T - sequence_length) to T-1
start_indices = seq_len - sequence_lengths[:, None] # [batch_size, 1]
mask = token_indices >= start_indices # [B, T]
else:
raise ValueError(f"padding_side must be 'left' or 'right', got {padding_side}")
mask = mask[:, :, None, None] # [batch_size, seq_len] --> [batch_size, seq_len, 1, 1]
# Compute masked mean over non-padding positions of shape (batch_size, 1, 1, seq_len)
masked_text_hidden_states = text_hidden_states.masked_fill(~mask, 0.0)
num_valid_positions = (sequence_lengths * hidden_dim).view(batch_size, 1, 1, 1)
masked_mean = masked_text_hidden_states.sum(dim=(1, 2), keepdim=True) / (num_valid_positions + eps)
# Compute min/max over non-padding positions of shape (batch_size, 1, 1 seq_len)
x_min = text_hidden_states.masked_fill(~mask, float("inf")).amin(dim=(1, 2), keepdim=True)
x_max = text_hidden_states.masked_fill(~mask, float("-inf")).amax(dim=(1, 2), keepdim=True)
# Normalization
normalized_hidden_states = (text_hidden_states - masked_mean) / (x_max - x_min + eps)
normalized_hidden_states = normalized_hidden_states * scale_factor
# Pack the hidden states to a 3D tensor (batch_size, seq_len, hidden_dim * num_layers)
normalized_hidden_states = normalized_hidden_states.flatten(2)
mask_flat = mask.squeeze(-1).expand(-1, -1, hidden_dim * num_layers)
normalized_hidden_states = normalized_hidden_states.masked_fill(~mask_flat, 0.0)
normalized_hidden_states = normalized_hidden_states.to(dtype=original_dtype)
return normalized_hidden_states
def _get_gemma_prompt_embeds(
self,
prompt: str | list[str],
@@ -387,16 +322,7 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
)
text_encoder_hidden_states = text_encoder_outputs.hidden_states
text_encoder_hidden_states = torch.stack(text_encoder_hidden_states, dim=-1)
sequence_lengths = prompt_attention_mask.sum(dim=-1)
prompt_embeds = self._pack_text_embeds(
text_encoder_hidden_states,
sequence_lengths,
device=device,
padding_side=self.tokenizer.padding_side,
scale_factor=scale_factor,
)
prompt_embeds = prompt_embeds.to(dtype=dtype)
prompt_embeds = text_encoder_hidden_states.flatten(2, 3).to(dtype=dtype) # Pack to 3D
# duplicate text embeddings for each generation per prompt, using mps friendly method
_, seq_len, _ = prompt_embeds.shape
@@ -494,6 +420,50 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
return prompt_embeds, prompt_attention_mask, negative_prompt_embeds, negative_prompt_attention_mask
@torch.no_grad()
def enhance_prompt(
self,
prompt: str,
system_prompt: str,
max_new_tokens: int = 512,
seed: int = 10,
generator: torch.Generator | None = None,
generation_kwargs: dict[str, Any] | None = None,
device: str | torch.device | None = None,
):
"""
Enhances the supplied `prompt` by generating a new prompt using the current text encoder (default is a
`transformers.Gemma3ForConditionalGeneration` model) from it and a system prompt.
"""
device = device or self._execution_device
if generation_kwargs is None:
# Set to default generation kwargs
generation_kwargs = {"do_sample": True, "temperature": 0.7}
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"user prompt: {prompt}"},
]
template = self.processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = self.processor(text=template, images=None, return_tensors="pt").to(device)
self.text_encoder.to(device)
# `transformers.GenerationMixin.generate` does not support using a `torch.Generator` to control randomness,
# so manually apply a seed for reproducible generation.
if generator is not None:
# Overwrite seed to generator's initial seed
seed = generator.initial_seed()
torch.manual_seed(seed)
generated_sequences = self.text_encoder.generate(
**model_inputs,
max_new_tokens=max_new_tokens,
**generation_kwargs,
) # tensor of shape [batch_size, seq_len]
generated_ids = [seq[len(model_inputs.input_ids[i]) :] for i, seq in enumerate(generated_sequences)]
enhanced_prompt = self.processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
return enhanced_prompt
def check_inputs(
self,
prompt,
@@ -504,6 +474,9 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
negative_prompt_embeds=None,
prompt_attention_mask=None,
negative_prompt_attention_mask=None,
spatio_temporal_guidance_blocks=None,
stg_scale=None,
audio_stg_scale=None,
):
if height % 32 != 0 or width % 32 != 0:
raise ValueError(f"`height` and `width` have to be divisible by 32 but are {height} and {width}.")
@@ -547,6 +520,12 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
f" {negative_prompt_attention_mask.shape}."
)
if ((stg_scale > 0.0) or (audio_stg_scale > 0.0)) and not spatio_temporal_guidance_blocks:
raise ValueError(
"Spatio-Temporal Guidance (STG) is specified but no STG blocks are supplied. Please supply a list of"
"block indices at which to apply STG in `spatio_temporal_guidance_blocks`"
)
@staticmethod
def _pack_latents(latents: torch.Tensor, patch_size: int = 1, patch_size_t: int = 1) -> torch.Tensor:
# Unpacked latents of shape are [B, C, F, H, W] are patched into tokens of shape [B, C, F // p_t, p_t, H // p, p, W // p, p].
@@ -734,7 +713,6 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
latents = self._create_noised_state(latents, noise_scale, generator)
return latents.to(device=device, dtype=dtype)
# TODO: confirm whether this logic is correct
latent_mel_bins = num_mel_bins // self.audio_vae_mel_compression_ratio
shape = (batch_size, num_channels_latents, audio_latent_length, latent_mel_bins)
@@ -749,6 +727,24 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
latents = self._pack_audio_latents(latents)
return latents
def convert_velocity_to_x0(
self, sample: torch.Tensor, denoised_output: torch.Tensor, step_idx: int, scheduler: Any | None = None
) -> torch.Tensor:
if scheduler is None:
scheduler = self.scheduler
sample_x0 = sample - denoised_output * scheduler.sigmas[step_idx]
return sample_x0
def convert_x0_to_velocity(
self, sample: torch.Tensor, denoised_output: torch.Tensor, step_idx: int, scheduler: Any | None = None
) -> torch.Tensor:
if scheduler is None:
scheduler = self.scheduler
sample_v = (sample - denoised_output) / scheduler.sigmas[step_idx]
return sample_v
@property
def guidance_scale(self):
return self._guidance_scale
@@ -757,9 +753,41 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
def guidance_rescale(self):
return self._guidance_rescale
@property
def stg_scale(self):
return self._stg_scale
@property
def modality_scale(self):
return self._modality_scale
@property
def audio_guidance_scale(self):
return self._audio_guidance_scale
@property
def audio_guidance_rescale(self):
return self._audio_guidance_rescale
@property
def audio_stg_scale(self):
return self._audio_stg_scale
@property
def audio_modality_scale(self):
return self._audio_modality_scale
@property
def do_classifier_free_guidance(self):
return self._guidance_scale > 1.0
return (self._guidance_scale > 1.0) or (self._audio_guidance_scale > 1.0)
@property
def do_spatio_temporal_guidance(self):
return (self._stg_scale > 0.0) or (self._audio_stg_scale > 0.0)
@property
def do_modality_isolation_guidance(self):
return (self._modality_scale > 1.0) or (self._audio_modality_scale > 1.0)
@property
def num_timesteps(self):
@@ -791,7 +819,14 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
sigmas: list[float] | None = None,
timesteps: list[int] = None,
guidance_scale: float = 4.0,
stg_scale: float = 0.0,
modality_scale: float = 1.0,
guidance_rescale: float = 0.0,
audio_guidance_scale: float | None = None,
audio_stg_scale: float | None = None,
audio_modality_scale: float | None = None,
audio_guidance_rescale: float | None = None,
spatio_temporal_guidance_blocks: list[int] | None = None,
noise_scale: float = 0.0,
num_videos_per_prompt: int = 1,
generator: torch.Generator | list[torch.Generator] | None = None,
@@ -803,6 +838,11 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
negative_prompt_attention_mask: torch.Tensor | None = None,
decode_timestep: float | list[float] = 0.0,
decode_noise_scale: float | list[float] | None = None,
use_cross_timestep: bool = False,
system_prompt: str | None = None,
prompt_max_new_tokens: int = 512,
prompt_enhancement_kwargs: dict[str, Any] | None = None,
prompt_enhancement_seed: int = 10,
output_type: str = "pil",
return_dict: bool = True,
attention_kwargs: dict[str, Any] | None = None,
@@ -841,13 +881,47 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
`guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
the text `prompt`, usually at the expense of lower image quality.
the text `prompt`, usually at the expense of lower image quality. Used for the video modality (there is
a separate value `audio_guidance_scale` for the audio modality).
stg_scale (`float`, *optional*, defaults to `0.0`):
Video guidance scale for Spatio-Temporal Guidance (STG), proposed in [Spatiotemporal Skip Guidance for
Enhanced Video Diffusion Sampling](https://arxiv.org/abs/2411.18664). STG uses a CFG-like estimate
where we move the sample away from a weak sample from a perturbed version of the denoising model.
Enabling STG will result in an additional denoising model forward pass; the default value of `0.0`
means that STG is disabled.
modality_scale (`float`, *optional*, defaults to `1.0`):
Video guidance scale for LTX-2.X modality isolation guidance, where we move the sample away from a
weaker sample generated by the denoising model withy cross-modality (audio-to-video and video-to-audio)
cross attention disabled using a CFG-like estimate. Enabling modality guidance will result in an
additional denoising model forward pass; the default value of `1.0` means that modality guidance is
disabled.
guidance_rescale (`float`, *optional*, defaults to 0.0):
Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are
Flawed](https://huggingface.co/papers/2305.08891) `guidance_scale` is defined as `φ` in equation 16. of
[Common Diffusion Noise Schedules and Sample Steps are
Flawed](https://huggingface.co/papers/2305.08891). Guidance rescale factor should fix overexposure when
using zero terminal SNR.
using zero terminal SNR. Used for the video modality.
audio_guidance_scale (`float`, *optional* defaults to `None`):
Audio guidance scale for CFG with respect to the negative prompt. The CFG update rule is the same for
video and audio, but they can use different values for the guidance scale. The LTX-2.X authors suggest
that the `audio_guidance_scale` should be higher relative to the video `guidance_scale` (e.g. for
LTX-2.3 they suggest 3.0 for video and 7.0 for audio). If `None`, defaults to the video value
`guidance_scale`.
audio_stg_scale (`float`, *optional*, defaults to `None`):
Audio guidance scale for STG. As with CFG, the STG update rule is otherwise the same for video and
audio. For LTX-2.3, a value of 1.0 is suggested for both video and audio. If `None`, defaults to the
video value `stg_scale`.
audio_modality_scale (`float`, *optional*, defaults to `None`):
Audio guidance scale for LTX-2.X modality isolation guidance. As with CFG, the modality guidance rule
is otherwise the same for video and audio. For LTX-2.3, a value of 3.0 is suggested for both video and
audio. If `None`, defaults to the video value `modality_scale`.
audio_guidance_rescale (`float`, *optional*, defaults to `None`):
A separate guidance rescale factor for the audio modality. If `None`, defaults to the video value
`guidance_rescale`.
spatio_temporal_guidance_blocks (`list[int]`, *optional*, defaults to `None`):
The zero-indexed transformer block indices at which to apply STG. Must be supplied if STG is used
(`stg_scale` or `audio_stg_scale` is greater than `0`). A value of `[29]` is recommended for LTX-2.0
and `[28]` is recommended for LTX-2.3.
noise_scale (`float`, *optional*, defaults to `0.0`):
The interpolation factor between random noise and denoised latents at each timestep. Applying noise to
the `latents` and `audio_latents` before continue denoising.
@@ -878,6 +952,24 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
The timestep at which generated video is decoded.
decode_noise_scale (`float`, defaults to `None`):
The interpolation factor between random noise and denoised latents at the decode timestep.
use_cross_timestep (`bool` *optional*, defaults to `False`):
Whether to use the cross modality (audio is the cross modality of video, and vice versa) sigma when
calculating the cross attention modulation parameters. `True` is the newer (e.g. LTX-2.3) behavior;
`False` is the legacy LTX-2.0 behavior.
system_prompt (`str`, *optional*, defaults to `None`):
Optional system prompt to use for prompt enhancement. The system prompt will be used by the current
text encoder (by default, a `Gemma3ForConditionalGeneration` model) to generate an enhanced prompt from
the original `prompt` to condition generation. If not supplied, prompt enhancement will not be
performed.
prompt_max_new_tokens (`int`, *optional*, defaults to `512`):
The maximum number of new tokens to generate when performing prompt enhancement.
prompt_enhancement_kwargs (`dict[str, Any]`, *optional*, defaults to `None`):
Keyword arguments for `self.text_encoder.generate`. If not supplied, default arguments of
`do_sample=True` and `temperature=0.7` will be used. See
https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate
for more details.
prompt_enhancement_seed (`int`, *optional*, default to `10`):
Random seed for any random operations during prompt enhancement.
output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
@@ -910,6 +1002,11 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
audio_guidance_scale = audio_guidance_scale or guidance_scale
audio_stg_scale = audio_stg_scale or stg_scale
audio_modality_scale = audio_modality_scale or modality_scale
audio_guidance_rescale = audio_guidance_rescale or guidance_rescale
# 1. Check inputs. Raise error if not correct
self.check_inputs(
prompt=prompt,
@@ -920,10 +1017,21 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
negative_prompt_embeds=negative_prompt_embeds,
prompt_attention_mask=prompt_attention_mask,
negative_prompt_attention_mask=negative_prompt_attention_mask,
spatio_temporal_guidance_blocks=spatio_temporal_guidance_blocks,
stg_scale=stg_scale,
audio_stg_scale=audio_stg_scale,
)
# Per-modality guidance scales (video, audio)
self._guidance_scale = guidance_scale
self._stg_scale = stg_scale
self._modality_scale = modality_scale
self._guidance_rescale = guidance_rescale
self._audio_guidance_scale = audio_guidance_scale
self._audio_stg_scale = audio_stg_scale
self._audio_modality_scale = audio_modality_scale
self._audio_guidance_rescale = audio_guidance_rescale
self._attention_kwargs = attention_kwargs
self._interrupt = False
self._current_timestep = None
@@ -939,6 +1047,17 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
device = self._execution_device
# 3. Prepare text embeddings
if system_prompt is not None and prompt is not None:
prompt = self.enhance_prompt(
prompt=prompt,
system_prompt=system_prompt,
max_new_tokens=prompt_max_new_tokens,
seed=prompt_enhancement_seed,
generator=generator,
generation_kwargs=prompt_enhancement_kwargs,
device=device,
)
(
prompt_embeds,
prompt_attention_mask,
@@ -960,9 +1079,11 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
prompt_attention_mask = torch.cat([negative_prompt_attention_mask, prompt_attention_mask], dim=0)
additive_attention_mask = (1 - prompt_attention_mask.to(prompt_embeds.dtype)) * -1000000.0
tokenizer_padding_side = "left" # Padding side for default Gemma3-12B text encoder
if getattr(self, "tokenizer", None) is not None:
tokenizer_padding_side = getattr(self.tokenizer, "padding_side", "left")
connector_prompt_embeds, connector_audio_prompt_embeds, connector_attention_mask = self.connectors(
prompt_embeds, additive_attention_mask, additive_mask=True
prompt_embeds, prompt_attention_mask, padding_side=tokenizer_padding_side
)
# 4. Prepare latent variables
@@ -984,7 +1105,7 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
raise ValueError(
f"Provided `latents` tensor has shape {latents.shape}, but the expected shape is either [batch_size, seq_len, num_features] or [batch_size, latent_dim, latent_frames, latent_height, latent_width]."
)
video_sequence_length = latent_num_frames * latent_height * latent_width
# video_sequence_length = latent_num_frames * latent_height * latent_width
num_channels_latents = self.transformer.config.in_channels
latents = self.prepare_latents(
@@ -1041,7 +1162,7 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
# 5. Prepare timesteps
sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas
mu = calculate_shift(
video_sequence_length,
self.scheduler.config.get("max_image_seq_len", 4096),
self.scheduler.config.get("base_image_seq_len", 1024),
self.scheduler.config.get("max_image_seq_len", 4096),
self.scheduler.config.get("base_shift", 0.95),
@@ -1069,11 +1190,6 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
self._num_timesteps = len(timesteps)
# 6. Prepare micro-conditions
rope_interpolation_scale = (
self.vae_temporal_compression_ratio / frame_rate,
self.vae_spatial_compression_ratio,
self.vae_spatial_compression_ratio,
)
# Pre-compute video and audio positional ids as they will be the same at each step of the denoising loop
video_coords = self.transformer.rope.prepare_video_coords(
latents.shape[0], latent_num_frames, latent_height, latent_width, latents.device, fps=frame_rate
@@ -1111,6 +1227,7 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
encoder_hidden_states=connector_prompt_embeds,
audio_encoder_hidden_states=connector_audio_prompt_embeds,
timestep=timestep,
sigma=timestep, # Used by LTX-2.3
encoder_attention_mask=connector_attention_mask,
audio_encoder_attention_mask=connector_attention_mask,
num_frames=latent_num_frames,
@@ -1120,7 +1237,10 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
audio_num_frames=audio_num_frames,
video_coords=video_coords,
audio_coords=audio_coords,
# rope_interpolation_scale=rope_interpolation_scale,
isolate_modalities=False,
spatio_temporal_guidance_blocks=None,
perturbation_mask=None,
use_cross_timestep=use_cross_timestep,
attention_kwargs=attention_kwargs,
return_dict=False,
)
@@ -1128,24 +1248,155 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
noise_pred_audio = noise_pred_audio.float()
if self.do_classifier_free_guidance:
noise_pred_video_uncond, noise_pred_video_text = noise_pred_video.chunk(2)
noise_pred_video = noise_pred_video_uncond + self.guidance_scale * (
noise_pred_video_text - noise_pred_video_uncond
noise_pred_video_uncond_text, noise_pred_video = noise_pred_video.chunk(2)
noise_pred_video = self.convert_velocity_to_x0(latents, noise_pred_video, i, self.scheduler)
noise_pred_video_uncond_text = self.convert_velocity_to_x0(
latents, noise_pred_video_uncond_text, i, self.scheduler
)
# Use delta formulation as it works more nicely with multiple guidance terms
video_cfg_delta = (self.guidance_scale - 1) * (noise_pred_video - noise_pred_video_uncond_text)
noise_pred_audio_uncond_text, noise_pred_audio = noise_pred_audio.chunk(2)
noise_pred_audio = self.convert_velocity_to_x0(audio_latents, noise_pred_audio, i, audio_scheduler)
noise_pred_audio_uncond_text = self.convert_velocity_to_x0(
audio_latents, noise_pred_audio_uncond_text, i, audio_scheduler
)
audio_cfg_delta = (self.audio_guidance_scale - 1) * (
noise_pred_audio - noise_pred_audio_uncond_text
)
noise_pred_audio_uncond, noise_pred_audio_text = noise_pred_audio.chunk(2)
noise_pred_audio = noise_pred_audio_uncond + self.guidance_scale * (
noise_pred_audio_text - noise_pred_audio_uncond
# Get positive values from merged CFG inputs in case we need to do other DiT forward passes
if self.do_spatio_temporal_guidance or self.do_modality_isolation_guidance:
if i == 0:
# Only split values that remain constant throughout the loop once
video_prompt_embeds = connector_prompt_embeds.chunk(2, dim=0)[1]
audio_prompt_embeds = connector_audio_prompt_embeds.chunk(2, dim=0)[1]
prompt_attn_mask = connector_attention_mask.chunk(2, dim=0)[1]
video_pos_ids = video_coords.chunk(2, dim=0)[0]
audio_pos_ids = audio_coords.chunk(2, dim=0)[0]
# Split values that vary each denoising loop iteration
timestep = timestep.chunk(2, dim=0)[0]
else:
video_cfg_delta = audio_cfg_delta = 0
video_prompt_embeds = connector_prompt_embeds
audio_prompt_embeds = connector_audio_prompt_embeds
prompt_attn_mask = connector_attention_mask
video_pos_ids = video_coords
audio_pos_ids = audio_coords
noise_pred_video = self.convert_velocity_to_x0(latents, noise_pred_video, i, self.scheduler)
noise_pred_audio = self.convert_velocity_to_x0(audio_latents, noise_pred_audio, i, audio_scheduler)
if self.do_spatio_temporal_guidance:
with self.transformer.cache_context("uncond_stg"):
noise_pred_video_uncond_stg, noise_pred_audio_uncond_stg = self.transformer(
hidden_states=latents.to(dtype=prompt_embeds.dtype),
audio_hidden_states=audio_latents.to(dtype=prompt_embeds.dtype),
encoder_hidden_states=video_prompt_embeds,
audio_encoder_hidden_states=audio_prompt_embeds,
timestep=timestep,
sigma=timestep, # Used by LTX-2.3
encoder_attention_mask=prompt_attn_mask,
audio_encoder_attention_mask=prompt_attn_mask,
num_frames=latent_num_frames,
height=latent_height,
width=latent_width,
fps=frame_rate,
audio_num_frames=audio_num_frames,
video_coords=video_pos_ids,
audio_coords=audio_pos_ids,
isolate_modalities=False,
# Use STG at given blocks to perturb model
spatio_temporal_guidance_blocks=spatio_temporal_guidance_blocks,
perturbation_mask=None,
use_cross_timestep=use_cross_timestep,
attention_kwargs=attention_kwargs,
return_dict=False,
)
noise_pred_video_uncond_stg = noise_pred_video_uncond_stg.float()
noise_pred_audio_uncond_stg = noise_pred_audio_uncond_stg.float()
noise_pred_video_uncond_stg = self.convert_velocity_to_x0(
latents, noise_pred_video_uncond_stg, i, self.scheduler
)
noise_pred_audio_uncond_stg = self.convert_velocity_to_x0(
audio_latents, noise_pred_audio_uncond_stg, i, audio_scheduler
)
if self.guidance_rescale > 0:
# Based on 3.4. in https://huggingface.co/papers/2305.08891
noise_pred_video = rescale_noise_cfg(
noise_pred_video, noise_pred_video_text, guidance_rescale=self.guidance_rescale
)
noise_pred_audio = rescale_noise_cfg(
noise_pred_audio, noise_pred_audio_text, guidance_rescale=self.guidance_rescale
video_stg_delta = self.stg_scale * (noise_pred_video - noise_pred_video_uncond_stg)
audio_stg_delta = self.audio_stg_scale * (noise_pred_audio - noise_pred_audio_uncond_stg)
else:
video_stg_delta = audio_stg_delta = 0
if self.do_modality_isolation_guidance:
with self.transformer.cache_context("uncond_modality"):
noise_pred_video_uncond_modality, noise_pred_audio_uncond_modality = self.transformer(
hidden_states=latents.to(dtype=prompt_embeds.dtype),
audio_hidden_states=audio_latents.to(dtype=prompt_embeds.dtype),
encoder_hidden_states=video_prompt_embeds,
audio_encoder_hidden_states=audio_prompt_embeds,
timestep=timestep,
sigma=timestep, # Used by LTX-2.3
encoder_attention_mask=prompt_attn_mask,
audio_encoder_attention_mask=prompt_attn_mask,
num_frames=latent_num_frames,
height=latent_height,
width=latent_width,
fps=frame_rate,
audio_num_frames=audio_num_frames,
video_coords=video_pos_ids,
audio_coords=audio_pos_ids,
# Turn off A2V and V2A cross attn to isolate video and audio modalities
isolate_modalities=True,
spatio_temporal_guidance_blocks=None,
perturbation_mask=None,
use_cross_timestep=use_cross_timestep,
attention_kwargs=attention_kwargs,
return_dict=False,
)
noise_pred_video_uncond_modality = noise_pred_video_uncond_modality.float()
noise_pred_audio_uncond_modality = noise_pred_audio_uncond_modality.float()
noise_pred_video_uncond_modality = self.convert_velocity_to_x0(
latents, noise_pred_video_uncond_modality, i, self.scheduler
)
noise_pred_audio_uncond_modality = self.convert_velocity_to_x0(
audio_latents, noise_pred_audio_uncond_modality, i, audio_scheduler
)
video_modality_delta = (self.modality_scale - 1) * (
noise_pred_video - noise_pred_video_uncond_modality
)
audio_modality_delta = (self.audio_modality_scale - 1) * (
noise_pred_audio - noise_pred_audio_uncond_modality
)
else:
video_modality_delta = audio_modality_delta = 0
# Now apply all guidance terms
noise_pred_video_g = noise_pred_video + video_cfg_delta + video_stg_delta + video_modality_delta
noise_pred_audio_g = noise_pred_audio + audio_cfg_delta + audio_stg_delta + audio_modality_delta
# Apply LTX-2.X guidance rescaling
if self.guidance_rescale > 0:
noise_pred_video = rescale_noise_cfg(
noise_pred_video_g, noise_pred_video, guidance_rescale=self.guidance_rescale
)
else:
noise_pred_video = noise_pred_video_g
if self.audio_guidance_rescale > 0:
noise_pred_audio = rescale_noise_cfg(
noise_pred_audio_g, noise_pred_audio, guidance_rescale=self.audio_guidance_rescale
)
else:
noise_pred_audio = noise_pred_audio_g
# Convert back to velocity for scheduler
noise_pred_video = self.convert_x0_to_velocity(latents, noise_pred_video, i, self.scheduler)
noise_pred_audio = self.convert_x0_to_velocity(audio_latents, noise_pred_audio, i, audio_scheduler)
# compute the previous noisy sample x_t -> x_t-1
latents = self.scheduler.step(noise_pred_video, t, latents, return_dict=False)[0]
@@ -1177,9 +1428,6 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
self.transformer_spatial_patch_size,
self.transformer_temporal_patch_size,
)
latents = self._denormalize_latents(
latents, self.vae.latents_mean, self.vae.latents_std, self.vae.config.scaling_factor
)
audio_latents = self._denormalize_audio_latents(
audio_latents, self.audio_vae.latents_mean, self.audio_vae.latents_std
@@ -1187,6 +1435,9 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
audio_latents = self._unpack_audio_latents(audio_latents, audio_num_frames, num_mel_bins=latent_mel_bins)
if output_type == "latent":
latents = self._denormalize_latents(
latents, self.vae.latents_mean, self.vae.latents_std, self.vae.config.scaling_factor
)
video = latents
audio = audio_latents
else:
@@ -1209,6 +1460,10 @@ class LTX2Pipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):
]
latents = (1 - decode_noise_scale) * latents + decode_noise_scale * noise
latents = self._denormalize_latents(
latents, self.vae.latents_mean, self.vae.latents_std, self.vae.config.scaling_factor
)
latents = latents.to(self.vae.dtype)
video = self.vae.decode(latents, timestep, return_dict=False)[0]
video = self.video_processor.postprocess_video(video, output_type=output_type)

View File

@@ -33,7 +33,7 @@ from ...video_processor import VideoProcessor
from ..pipeline_utils import DiffusionPipeline
from .connectors import LTX2TextConnectors
from .pipeline_output import LTX2PipelineOutput
from .vocoder import LTX2Vocoder
from .vocoder import LTX2Vocoder, LTX2VocoderWithBWE
if is_torch_xla_available():
@@ -254,7 +254,7 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
tokenizer: GemmaTokenizer | GemmaTokenizerFast,
connectors: LTX2TextConnectors,
transformer: LTX2VideoTransformer3DModel,
vocoder: LTX2Vocoder,
vocoder: LTX2Vocoder | LTX2VocoderWithBWE,
):
super().__init__()
@@ -300,74 +300,6 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
self.tokenizer.model_max_length if getattr(self, "tokenizer", None) is not None else 1024
)
@staticmethod
# Copied from diffusers.pipelines.ltx2.pipeline_ltx2.LTX2Pipeline._pack_text_embeds
def _pack_text_embeds(
text_hidden_states: torch.Tensor,
sequence_lengths: torch.Tensor,
device: str | torch.device,
padding_side: str = "left",
scale_factor: int = 8,
eps: float = 1e-6,
) -> torch.Tensor:
"""
Packs and normalizes text encoder hidden states, respecting padding. Normalization is performed per-batch and
per-layer in a masked fashion (only over non-padded positions).
Args:
text_hidden_states (`torch.Tensor` of shape `(batch_size, seq_len, hidden_dim, num_layers)`):
Per-layer hidden_states from a text encoder (e.g. `Gemma3ForConditionalGeneration`).
sequence_lengths (`torch.Tensor of shape `(batch_size,)`):
The number of valid (non-padded) tokens for each batch instance.
device: (`str` or `torch.device`, *optional*):
torch device to place the resulting embeddings on
padding_side: (`str`, *optional*, defaults to `"left"`):
Whether the text tokenizer performs padding on the `"left"` or `"right"`.
scale_factor (`int`, *optional*, defaults to `8`):
Scaling factor to multiply the normalized hidden states by.
eps (`float`, *optional*, defaults to `1e-6`):
A small positive value for numerical stability when performing normalization.
Returns:
`torch.Tensor` of shape `(batch_size, seq_len, hidden_dim * num_layers)`:
Normed and flattened text encoder hidden states.
"""
batch_size, seq_len, hidden_dim, num_layers = text_hidden_states.shape
original_dtype = text_hidden_states.dtype
# Create padding mask
token_indices = torch.arange(seq_len, device=device).unsqueeze(0)
if padding_side == "right":
# For right padding, valid tokens are from 0 to sequence_length-1
mask = token_indices < sequence_lengths[:, None] # [batch_size, seq_len]
elif padding_side == "left":
# For left padding, valid tokens are from (T - sequence_length) to T-1
start_indices = seq_len - sequence_lengths[:, None] # [batch_size, 1]
mask = token_indices >= start_indices # [B, T]
else:
raise ValueError(f"padding_side must be 'left' or 'right', got {padding_side}")
mask = mask[:, :, None, None] # [batch_size, seq_len] --> [batch_size, seq_len, 1, 1]
# Compute masked mean over non-padding positions of shape (batch_size, 1, 1, seq_len)
masked_text_hidden_states = text_hidden_states.masked_fill(~mask, 0.0)
num_valid_positions = (sequence_lengths * hidden_dim).view(batch_size, 1, 1, 1)
masked_mean = masked_text_hidden_states.sum(dim=(1, 2), keepdim=True) / (num_valid_positions + eps)
# Compute min/max over non-padding positions of shape (batch_size, 1, 1 seq_len)
x_min = text_hidden_states.masked_fill(~mask, float("inf")).amin(dim=(1, 2), keepdim=True)
x_max = text_hidden_states.masked_fill(~mask, float("-inf")).amax(dim=(1, 2), keepdim=True)
# Normalization
normalized_hidden_states = (text_hidden_states - masked_mean) / (x_max - x_min + eps)
normalized_hidden_states = normalized_hidden_states * scale_factor
# Pack the hidden states to a 3D tensor (batch_size, seq_len, hidden_dim * num_layers)
normalized_hidden_states = normalized_hidden_states.flatten(2)
mask_flat = mask.squeeze(-1).expand(-1, -1, hidden_dim * num_layers)
normalized_hidden_states = normalized_hidden_states.masked_fill(~mask_flat, 0.0)
normalized_hidden_states = normalized_hidden_states.to(dtype=original_dtype)
return normalized_hidden_states
# Copied from diffusers.pipelines.ltx2.pipeline_ltx2.LTX2Pipeline._get_gemma_prompt_embeds
def _get_gemma_prompt_embeds(
self,
@@ -421,16 +353,7 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
)
text_encoder_hidden_states = text_encoder_outputs.hidden_states
text_encoder_hidden_states = torch.stack(text_encoder_hidden_states, dim=-1)
sequence_lengths = prompt_attention_mask.sum(dim=-1)
prompt_embeds = self._pack_text_embeds(
text_encoder_hidden_states,
sequence_lengths,
device=device,
padding_side=self.tokenizer.padding_side,
scale_factor=scale_factor,
)
prompt_embeds = prompt_embeds.to(dtype=dtype)
prompt_embeds = text_encoder_hidden_states.flatten(2, 3).to(dtype=dtype) # Pack to 3D
# duplicate text embeddings for each generation per prompt, using mps friendly method
_, seq_len, _ = prompt_embeds.shape
@@ -541,6 +464,9 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
negative_prompt_attention_mask=None,
latents=None,
audio_latents=None,
spatio_temporal_guidance_blocks=None,
stg_scale=None,
audio_stg_scale=None,
):
if height % 32 != 0 or width % 32 != 0:
raise ValueError(f"`height` and `width` have to be divisible by 32 but are {height} and {width}.")
@@ -597,6 +523,12 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
f" using the `_unpack_audio_latents` method)."
)
if ((stg_scale > 0.0) or (audio_stg_scale > 0.0)) and not spatio_temporal_guidance_blocks:
raise ValueError(
"Spatio-Temporal Guidance (STG) is specified but no STG blocks are supplied. Please supply a list of"
"block indices at which to apply STG in `spatio_temporal_guidance_blocks`"
)
@staticmethod
# Copied from diffusers.pipelines.ltx2.pipeline_ltx2.LTX2Pipeline._pack_latents
def _pack_latents(latents: torch.Tensor, patch_size: int = 1, patch_size_t: int = 1) -> torch.Tensor:
@@ -984,6 +916,24 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
latents = self._pack_audio_latents(latents)
return latents
def convert_velocity_to_x0(
self, sample: torch.Tensor, denoised_output: torch.Tensor, step_idx: int, scheduler: Any | None = None
) -> torch.Tensor:
if scheduler is None:
scheduler = self.scheduler
sample_x0 = sample - denoised_output * scheduler.sigmas[step_idx]
return sample_x0
def convert_x0_to_velocity(
self, sample: torch.Tensor, denoised_output: torch.Tensor, step_idx: int, scheduler: Any | None = None
) -> torch.Tensor:
if scheduler is None:
scheduler = self.scheduler
sample_v = (sample - denoised_output) / scheduler.sigmas[step_idx]
return sample_v
@property
def guidance_scale(self):
return self._guidance_scale
@@ -992,9 +942,41 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
def guidance_rescale(self):
return self._guidance_rescale
@property
def stg_scale(self):
return self._stg_scale
@property
def modality_scale(self):
return self._modality_scale
@property
def audio_guidance_scale(self):
return self._audio_guidance_scale
@property
def audio_guidance_rescale(self):
return self._audio_guidance_rescale
@property
def audio_stg_scale(self):
return self._audio_stg_scale
@property
def audio_modality_scale(self):
return self._audio_modality_scale
@property
def do_classifier_free_guidance(self):
return self._guidance_scale > 1.0
return (self._guidance_scale > 1.0) or (self._audio_guidance_scale > 1.0)
@property
def do_spatio_temporal_guidance(self):
return (self._stg_scale > 0.0) or (self._audio_stg_scale > 0.0)
@property
def do_modality_isolation_guidance(self):
return (self._modality_scale > 1.0) or (self._audio_modality_scale > 1.0)
@property
def num_timesteps(self):
@@ -1027,7 +1009,14 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
sigmas: list[float] | None = None,
timesteps: list[float] | None = None,
guidance_scale: float = 4.0,
stg_scale: float = 0.0,
modality_scale: float = 1.0,
guidance_rescale: float = 0.0,
audio_guidance_scale: float | None = None,
audio_stg_scale: float | None = None,
audio_modality_scale: float | None = None,
audio_guidance_rescale: float | None = None,
spatio_temporal_guidance_blocks: list[int] | None = None,
noise_scale: float | None = None,
num_videos_per_prompt: int | None = 1,
generator: torch.Generator | list[torch.Generator] | None = None,
@@ -1039,6 +1028,7 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
negative_prompt_attention_mask: torch.Tensor | None = None,
decode_timestep: float | list[float] = 0.0,
decode_noise_scale: float | list[float] | None = None,
use_cross_timestep: bool = False,
output_type: str = "pil",
return_dict: bool = True,
attention_kwargs: dict[str, Any] | None = None,
@@ -1079,13 +1069,47 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
`guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
the text `prompt`, usually at the expense of lower image quality.
the text `prompt`, usually at the expense of lower image quality. Used for the video modality (there is
a separate value `audio_guidance_scale` for the audio modality).
stg_scale (`float`, *optional*, defaults to `0.0`):
Video guidance scale for Spatio-Temporal Guidance (STG), proposed in [Spatiotemporal Skip Guidance for
Enhanced Video Diffusion Sampling](https://arxiv.org/abs/2411.18664). STG uses a CFG-like estimate
where we move the sample away from a weak sample from a perturbed version of the denoising model.
Enabling STG will result in an additional denoising model forward pass; the default value of `0.0`
means that STG is disabled.
modality_scale (`float`, *optional*, defaults to `1.0`):
Video guidance scale for LTX-2.X modality isolation guidance, where we move the sample away from a
weaker sample generated by the denoising model withy cross-modality (audio-to-video and video-to-audio)
cross attention disabled using a CFG-like estimate. Enabling modality guidance will result in an
additional denoising model forward pass; the default value of `1.0` means that modality guidance is
disabled.
guidance_rescale (`float`, *optional*, defaults to 0.0):
Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are
Flawed](https://huggingface.co/papers/2305.08891) `guidance_scale` is defined as `φ` in equation 16. of
[Common Diffusion Noise Schedules and Sample Steps are
Flawed](https://huggingface.co/papers/2305.08891). Guidance rescale factor should fix overexposure when
using zero terminal SNR.
using zero terminal SNR. Used for the video modality.
audio_guidance_scale (`float`, *optional* defaults to `None`):
Audio guidance scale for CFG with respect to the negative prompt. The CFG update rule is the same for
video and audio, but they can use different values for the guidance scale. The LTX-2.X authors suggest
that the `audio_guidance_scale` should be higher relative to the video `guidance_scale` (e.g. for
LTX-2.3 they suggest 3.0 for video and 7.0 for audio). If `None`, defaults to the video value
`guidance_scale`.
audio_stg_scale (`float`, *optional*, defaults to `None`):
Audio guidance scale for STG. As with CFG, the STG update rule is otherwise the same for video and
audio. For LTX-2.3, a value of 1.0 is suggested for both video and audio. If `None`, defaults to the
video value `stg_scale`.
audio_modality_scale (`float`, *optional*, defaults to `None`):
Audio guidance scale for LTX-2.X modality isolation guidance. As with CFG, the modality guidance rule
is otherwise the same for video and audio. For LTX-2.3, a value of 3.0 is suggested for both video and
audio. If `None`, defaults to the video value `modality_scale`.
audio_guidance_rescale (`float`, *optional*, defaults to `None`):
A separate guidance rescale factor for the audio modality. If `None`, defaults to the video value
`guidance_rescale`.
spatio_temporal_guidance_blocks (`list[int]`, *optional*, defaults to `None`):
The zero-indexed transformer block indices at which to apply STG. Must be supplied if STG is used
(`stg_scale` or `audio_stg_scale` is greater than `0`). A value of `[29]` is recommended for LTX-2.0
and `[28]` is recommended for LTX-2.3.
noise_scale (`float`, *optional*, defaults to `None`):
The interpolation factor between random noise and denoised latents at each timestep. Applying noise to
the `latents` and `audio_latents` before continue denoising. If not set, will be inferred from the
@@ -1117,6 +1141,10 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
The timestep at which generated video is decoded.
decode_noise_scale (`float`, defaults to `None`):
The interpolation factor between random noise and denoised latents at the decode timestep.
use_cross_timestep (`bool` *optional*, defaults to `False`):
Whether to use the cross modality (audio is the cross modality of video, and vice versa) sigma when
calculating the cross attention modulation parameters. `True` is the newer (e.g. LTX-2.3) behavior;
`False` is the legacy LTX-2.0 behavior.
output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
@@ -1149,6 +1177,11 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
audio_guidance_scale = audio_guidance_scale or guidance_scale
audio_stg_scale = audio_stg_scale or stg_scale
audio_modality_scale = audio_modality_scale or modality_scale
audio_guidance_rescale = audio_guidance_rescale or guidance_rescale
# 1. Check inputs. Raise error if not correct
self.check_inputs(
prompt=prompt,
@@ -1161,10 +1194,21 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
negative_prompt_attention_mask=negative_prompt_attention_mask,
latents=latents,
audio_latents=audio_latents,
spatio_temporal_guidance_blocks=spatio_temporal_guidance_blocks,
stg_scale=stg_scale,
audio_stg_scale=audio_stg_scale,
)
# Per-modality guidance scales (video, audio)
self._guidance_scale = guidance_scale
self._stg_scale = stg_scale
self._modality_scale = modality_scale
self._guidance_rescale = guidance_rescale
self._audio_guidance_scale = audio_guidance_scale
self._audio_stg_scale = audio_stg_scale
self._audio_modality_scale = audio_modality_scale
self._audio_guidance_rescale = audio_guidance_rescale
self._attention_kwargs = attention_kwargs
self._interrupt = False
self._current_timestep = None
@@ -1208,9 +1252,11 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
prompt_attention_mask = torch.cat([negative_prompt_attention_mask, prompt_attention_mask], dim=0)
additive_attention_mask = (1 - prompt_attention_mask.to(prompt_embeds.dtype)) * -1000000.0
tokenizer_padding_side = "left" # Padding side for default Gemma3-12B text encoder
if getattr(self, "tokenizer", None) is not None:
tokenizer_padding_side = getattr(self.tokenizer, "padding_side", "left")
connector_prompt_embeds, connector_audio_prompt_embeds, connector_attention_mask = self.connectors(
prompt_embeds, additive_attention_mask, additive_mask=True
prompt_embeds, prompt_attention_mask, padding_side=tokenizer_padding_side
)
# 4. Prepare latent variables
@@ -1222,7 +1268,7 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
"Got latents of shape [batch_size, latent_dim, latent_frames, latent_height, latent_width], `latent_num_frames`, `latent_height`, `latent_width` will be inferred."
)
_, _, latent_num_frames, latent_height, latent_width = latents.shape # [B, C, F, H, W]
video_sequence_length = latent_num_frames * latent_height * latent_width
# video_sequence_length = latent_num_frames * latent_height * latent_width
num_channels_latents = self.transformer.config.in_channels
latents, conditioning_mask, clean_latents = self.prepare_latents(
@@ -1272,7 +1318,7 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
# 5. Prepare timesteps
sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas
mu = calculate_shift(
video_sequence_length,
self.scheduler.config.get("max_image_seq_len", 4096),
self.scheduler.config.get("base_image_seq_len", 1024),
self.scheduler.config.get("max_image_seq_len", 4096),
self.scheduler.config.get("base_shift", 0.95),
@@ -1301,11 +1347,6 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
self._num_timesteps = len(timesteps)
# 6. Prepare micro-conditions
rope_interpolation_scale = (
self.vae_temporal_compression_ratio / frame_rate,
self.vae_spatial_compression_ratio,
self.vae_spatial_compression_ratio,
)
# Pre-compute video and audio positional ids as they will be the same at each step of the denoising loop
video_coords = self.transformer.rope.prepare_video_coords(
latents.shape[0], latent_num_frames, latent_height, latent_width, latents.device, fps=frame_rate
@@ -1344,6 +1385,7 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
audio_encoder_hidden_states=connector_audio_prompt_embeds,
timestep=video_timestep,
audio_timestep=timestep,
sigma=timestep, # Used by LTX-2.3
encoder_attention_mask=connector_attention_mask,
audio_encoder_attention_mask=connector_attention_mask,
num_frames=latent_num_frames,
@@ -1353,7 +1395,10 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
audio_num_frames=audio_num_frames,
video_coords=video_coords,
audio_coords=audio_coords,
# rope_interpolation_scale=rope_interpolation_scale,
isolate_modalities=False,
spatio_temporal_guidance_blocks=None,
perturbation_mask=None,
use_cross_timestep=use_cross_timestep,
attention_kwargs=attention_kwargs,
return_dict=False,
)
@@ -1361,41 +1406,172 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
noise_pred_audio = noise_pred_audio.float()
if self.do_classifier_free_guidance:
noise_pred_video_uncond, noise_pred_video_text = noise_pred_video.chunk(2)
noise_pred_video = noise_pred_video_uncond + self.guidance_scale * (
noise_pred_video_text - noise_pred_video_uncond
noise_pred_video_uncond_text, noise_pred_video = noise_pred_video.chunk(2)
noise_pred_video = self.convert_velocity_to_x0(latents, noise_pred_video, i, self.scheduler)
noise_pred_video_uncond_text = self.convert_velocity_to_x0(
latents, noise_pred_video_uncond_text, i, self.scheduler
)
# Use delta formulation as it works more nicely with multiple guidance terms
video_cfg_delta = (self.guidance_scale - 1) * (noise_pred_video - noise_pred_video_uncond_text)
noise_pred_audio_uncond_text, noise_pred_audio = noise_pred_audio.chunk(2)
noise_pred_audio = self.convert_velocity_to_x0(audio_latents, noise_pred_audio, i, audio_scheduler)
noise_pred_audio_uncond_text = self.convert_velocity_to_x0(
audio_latents, noise_pred_audio_uncond_text, i, audio_scheduler
)
audio_cfg_delta = (self.audio_guidance_scale - 1) * (
noise_pred_audio - noise_pred_audio_uncond_text
)
noise_pred_audio_uncond, noise_pred_audio_text = noise_pred_audio.chunk(2)
noise_pred_audio = noise_pred_audio_uncond + self.guidance_scale * (
noise_pred_audio_text - noise_pred_audio_uncond
# Get positive values from merged CFG inputs in case we need to do other DiT forward passes
if self.do_spatio_temporal_guidance or self.do_modality_isolation_guidance:
if i == 0:
# Only split values that remain constant throughout the loop once
video_prompt_embeds = connector_prompt_embeds.chunk(2, dim=0)[1]
audio_prompt_embeds = connector_audio_prompt_embeds.chunk(2, dim=0)[1]
prompt_attn_mask = connector_attention_mask.chunk(2, dim=0)[1]
video_pos_ids = video_coords.chunk(2, dim=0)[0]
audio_pos_ids = audio_coords.chunk(2, dim=0)[0]
# Split values that vary each denoising loop iteration
timestep = timestep.chunk(2, dim=0)[0]
video_timestep = video_timestep.chunk(2, dim=0)[0]
else:
video_cfg_delta = audio_cfg_delta = 0
video_prompt_embeds = connector_prompt_embeds
audio_prompt_embeds = connector_audio_prompt_embeds
prompt_attn_mask = connector_attention_mask
video_pos_ids = video_coords
audio_pos_ids = audio_coords
noise_pred_video = self.convert_velocity_to_x0(latents, noise_pred_video, i, self.scheduler)
noise_pred_audio = self.convert_velocity_to_x0(audio_latents, noise_pred_audio, i, audio_scheduler)
if self.do_spatio_temporal_guidance:
with self.transformer.cache_context("uncond_stg"):
noise_pred_video_uncond_stg, noise_pred_audio_uncond_stg = self.transformer(
hidden_states=latents.to(dtype=prompt_embeds.dtype),
audio_hidden_states=audio_latents.to(dtype=prompt_embeds.dtype),
encoder_hidden_states=video_prompt_embeds,
audio_encoder_hidden_states=audio_prompt_embeds,
timestep=video_timestep,
audio_timestep=timestep,
sigma=timestep, # Used by LTX-2.3
encoder_attention_mask=prompt_attn_mask,
audio_encoder_attention_mask=prompt_attn_mask,
num_frames=latent_num_frames,
height=latent_height,
width=latent_width,
fps=frame_rate,
audio_num_frames=audio_num_frames,
video_coords=video_pos_ids,
audio_coords=audio_pos_ids,
isolate_modalities=False,
# Use STG at given blocks to perturb model
spatio_temporal_guidance_blocks=spatio_temporal_guidance_blocks,
perturbation_mask=None,
use_cross_timestep=use_cross_timestep,
attention_kwargs=attention_kwargs,
return_dict=False,
)
noise_pred_video_uncond_stg = noise_pred_video_uncond_stg.float()
noise_pred_audio_uncond_stg = noise_pred_audio_uncond_stg.float()
noise_pred_video_uncond_stg = self.convert_velocity_to_x0(
latents, noise_pred_video_uncond_stg, i, self.scheduler
)
noise_pred_audio_uncond_stg = self.convert_velocity_to_x0(
audio_latents, noise_pred_audio_uncond_stg, i, audio_scheduler
)
if self.guidance_rescale > 0:
# Based on 3.4. in https://huggingface.co/papers/2305.08891
noise_pred_video = rescale_noise_cfg(
noise_pred_video, noise_pred_video_text, guidance_rescale=self.guidance_rescale
)
noise_pred_audio = rescale_noise_cfg(
noise_pred_audio, noise_pred_audio_text, guidance_rescale=self.guidance_rescale
video_stg_delta = self.stg_scale * (noise_pred_video - noise_pred_video_uncond_stg)
audio_stg_delta = self.audio_stg_scale * (noise_pred_audio - noise_pred_audio_uncond_stg)
else:
video_stg_delta = audio_stg_delta = 0
if self.do_modality_isolation_guidance:
with self.transformer.cache_context("uncond_modality"):
noise_pred_video_uncond_modality, noise_pred_audio_uncond_modality = self.transformer(
hidden_states=latents.to(dtype=prompt_embeds.dtype),
audio_hidden_states=audio_latents.to(dtype=prompt_embeds.dtype),
encoder_hidden_states=video_prompt_embeds,
audio_encoder_hidden_states=audio_prompt_embeds,
timestep=video_timestep,
audio_timestep=timestep,
sigma=timestep, # Used by LTX-2.3
encoder_attention_mask=prompt_attn_mask,
audio_encoder_attention_mask=prompt_attn_mask,
num_frames=latent_num_frames,
height=latent_height,
width=latent_width,
fps=frame_rate,
audio_num_frames=audio_num_frames,
video_coords=video_pos_ids,
audio_coords=audio_pos_ids,
# Turn off A2V and V2A cross attn to isolate video and audio modalities
isolate_modalities=True,
spatio_temporal_guidance_blocks=None,
perturbation_mask=None,
use_cross_timestep=use_cross_timestep,
attention_kwargs=attention_kwargs,
return_dict=False,
)
noise_pred_video_uncond_modality = noise_pred_video_uncond_modality.float()
noise_pred_audio_uncond_modality = noise_pred_audio_uncond_modality.float()
noise_pred_video_uncond_modality = self.convert_velocity_to_x0(
latents, noise_pred_video_uncond_modality, i, self.scheduler
)
noise_pred_audio_uncond_modality = self.convert_velocity_to_x0(
audio_latents, noise_pred_audio_uncond_modality, i, audio_scheduler
)
video_modality_delta = (self.modality_scale - 1) * (
noise_pred_video - noise_pred_video_uncond_modality
)
audio_modality_delta = (self.audio_modality_scale - 1) * (
noise_pred_audio - noise_pred_audio_uncond_modality
)
else:
video_modality_delta = audio_modality_delta = 0
# Now apply all guidance terms
noise_pred_video_g = noise_pred_video + video_cfg_delta + video_stg_delta + video_modality_delta
noise_pred_audio_g = noise_pred_audio + audio_cfg_delta + audio_stg_delta + audio_modality_delta
# Apply LTX-2.X guidance rescaling
if self.guidance_rescale > 0:
noise_pred_video = rescale_noise_cfg(
noise_pred_video_g, noise_pred_video, guidance_rescale=self.guidance_rescale
)
else:
noise_pred_video = noise_pred_video_g
if self.audio_guidance_rescale > 0:
noise_pred_audio = rescale_noise_cfg(
noise_pred_audio_g, noise_pred_audio, guidance_rescale=self.audio_guidance_rescale
)
else:
noise_pred_audio = noise_pred_audio_g
# NOTE: use only the first chunk of conditioning mask in case it is duplicated for CFG
bsz = noise_pred_video.size(0)
sigma = self.scheduler.sigmas[i]
# Convert the noise_pred_video velocity model prediction into a sample (x0) prediction
denoised_sample = latents - noise_pred_video * sigma
# Apply the (packed) conditioning mask to the denoised (x0) sample and clean conditioning. The
# conditioning mask contains conditioning strengths from 0 (always use denoised sample) to 1 (always
# use conditions), with intermediate values specifying how strongly to follow the conditions.
# NOTE: this operation should be applied in sample (x0) space and not velocity space (which is the
# space the denoising model outputs are in)
denoised_sample_cond = (
denoised_sample * (1 - conditioning_mask[:bsz]) + clean_latents.float() * conditioning_mask[:bsz]
noise_pred_video * (1 - conditioning_mask[:bsz]) + clean_latents.float() * conditioning_mask[:bsz]
).to(noise_pred_video.dtype)
# Convert the denoised (x0) sample back to a velocity for the scheduler
denoised_latents_cond = ((latents - denoised_sample_cond) / sigma).to(noise_pred_video.dtype)
noise_pred_video = self.convert_x0_to_velocity(latents, denoised_sample_cond, i, self.scheduler)
noise_pred_audio = self.convert_x0_to_velocity(audio_latents, noise_pred_audio, i, audio_scheduler)
# Compute the previous noisy sample x_t -> x_t-1
latents = self.scheduler.step(denoised_latents_cond, t, latents, return_dict=False)[0]
latents = self.scheduler.step(noise_pred_video, t, latents, return_dict=False)[0]
# NOTE: for now duplicate scheduler for audio latents in case self.scheduler sets internal state in
# the step method (such as _step_index)
@@ -1425,9 +1601,6 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
self.transformer_spatial_patch_size,
self.transformer_temporal_patch_size,
)
latents = self._denormalize_latents(
latents, self.vae.latents_mean, self.vae.latents_std, self.vae.config.scaling_factor
)
audio_latents = self._denormalize_audio_latents(
audio_latents, self.audio_vae.latents_mean, self.audio_vae.latents_std
@@ -1435,6 +1608,9 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
audio_latents = self._unpack_audio_latents(audio_latents, audio_num_frames, num_mel_bins=latent_mel_bins)
if output_type == "latent":
latents = self._denormalize_latents(
latents, self.vae.latents_mean, self.vae.latents_std, self.vae.config.scaling_factor
)
video = latents
audio = audio_latents
else:
@@ -1457,6 +1633,10 @@ class LTX2ConditionPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraLoad
]
latents = (1 - decode_noise_scale) * latents + decode_noise_scale * noise
latents = self._denormalize_latents(
latents, self.vae.latents_mean, self.vae.latents_std, self.vae.config.scaling_factor
)
latents = latents.to(self.vae.dtype)
video = self.vae.decode(latents, timestep, return_dict=False)[0]
video = self.video_processor.postprocess_video(video, output_type=output_type)

View File

@@ -18,7 +18,7 @@ from typing import Any, Callable
import numpy as np
import torch
from transformers import Gemma3ForConditionalGeneration, GemmaTokenizer, GemmaTokenizerFast
from transformers import Gemma3ForConditionalGeneration, Gemma3Processor, GemmaTokenizer, GemmaTokenizerFast
from ...callbacks import MultiPipelineCallbacks, PipelineCallback
from ...image_processor import PipelineImageInput
@@ -32,7 +32,7 @@ from ...video_processor import VideoProcessor
from ..pipeline_utils import DiffusionPipeline
from .connectors import LTX2TextConnectors
from .pipeline_output import LTX2PipelineOutput
from .vocoder import LTX2Vocoder
from .vocoder import LTX2Vocoder, LTX2VocoderWithBWE
if is_torch_xla_available():
@@ -212,7 +212,7 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
"""
model_cpu_offload_seq = "text_encoder->connectors->transformer->vae->audio_vae->vocoder"
_optional_components = []
_optional_components = ["processor"]
_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds"]
def __init__(
@@ -224,7 +224,8 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
tokenizer: GemmaTokenizer | GemmaTokenizerFast,
connectors: LTX2TextConnectors,
transformer: LTX2VideoTransformer3DModel,
vocoder: LTX2Vocoder,
vocoder: LTX2Vocoder | LTX2VocoderWithBWE,
processor: Gemma3Processor | None = None,
):
super().__init__()
@@ -237,6 +238,7 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
transformer=transformer,
vocoder=vocoder,
scheduler=scheduler,
processor=processor,
)
self.vae_spatial_compression_ratio = (
@@ -271,74 +273,6 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
self.tokenizer.model_max_length if getattr(self, "tokenizer", None) is not None else 1024
)
@staticmethod
# Copied from diffusers.pipelines.ltx2.pipeline_ltx2.LTX2Pipeline._pack_text_embeds
def _pack_text_embeds(
text_hidden_states: torch.Tensor,
sequence_lengths: torch.Tensor,
device: str | torch.device,
padding_side: str = "left",
scale_factor: int = 8,
eps: float = 1e-6,
) -> torch.Tensor:
"""
Packs and normalizes text encoder hidden states, respecting padding. Normalization is performed per-batch and
per-layer in a masked fashion (only over non-padded positions).
Args:
text_hidden_states (`torch.Tensor` of shape `(batch_size, seq_len, hidden_dim, num_layers)`):
Per-layer hidden_states from a text encoder (e.g. `Gemma3ForConditionalGeneration`).
sequence_lengths (`torch.Tensor of shape `(batch_size,)`):
The number of valid (non-padded) tokens for each batch instance.
device: (`str` or `torch.device`, *optional*):
torch device to place the resulting embeddings on
padding_side: (`str`, *optional*, defaults to `"left"`):
Whether the text tokenizer performs padding on the `"left"` or `"right"`.
scale_factor (`int`, *optional*, defaults to `8`):
Scaling factor to multiply the normalized hidden states by.
eps (`float`, *optional*, defaults to `1e-6`):
A small positive value for numerical stability when performing normalization.
Returns:
`torch.Tensor` of shape `(batch_size, seq_len, hidden_dim * num_layers)`:
Normed and flattened text encoder hidden states.
"""
batch_size, seq_len, hidden_dim, num_layers = text_hidden_states.shape
original_dtype = text_hidden_states.dtype
# Create padding mask
token_indices = torch.arange(seq_len, device=device).unsqueeze(0)
if padding_side == "right":
# For right padding, valid tokens are from 0 to sequence_length-1
mask = token_indices < sequence_lengths[:, None] # [batch_size, seq_len]
elif padding_side == "left":
# For left padding, valid tokens are from (T - sequence_length) to T-1
start_indices = seq_len - sequence_lengths[:, None] # [batch_size, 1]
mask = token_indices >= start_indices # [B, T]
else:
raise ValueError(f"padding_side must be 'left' or 'right', got {padding_side}")
mask = mask[:, :, None, None] # [batch_size, seq_len] --> [batch_size, seq_len, 1, 1]
# Compute masked mean over non-padding positions of shape (batch_size, 1, 1, seq_len)
masked_text_hidden_states = text_hidden_states.masked_fill(~mask, 0.0)
num_valid_positions = (sequence_lengths * hidden_dim).view(batch_size, 1, 1, 1)
masked_mean = masked_text_hidden_states.sum(dim=(1, 2), keepdim=True) / (num_valid_positions + eps)
# Compute min/max over non-padding positions of shape (batch_size, 1, 1 seq_len)
x_min = text_hidden_states.masked_fill(~mask, float("inf")).amin(dim=(1, 2), keepdim=True)
x_max = text_hidden_states.masked_fill(~mask, float("-inf")).amax(dim=(1, 2), keepdim=True)
# Normalization
normalized_hidden_states = (text_hidden_states - masked_mean) / (x_max - x_min + eps)
normalized_hidden_states = normalized_hidden_states * scale_factor
# Pack the hidden states to a 3D tensor (batch_size, seq_len, hidden_dim * num_layers)
normalized_hidden_states = normalized_hidden_states.flatten(2)
mask_flat = mask.squeeze(-1).expand(-1, -1, hidden_dim * num_layers)
normalized_hidden_states = normalized_hidden_states.masked_fill(~mask_flat, 0.0)
normalized_hidden_states = normalized_hidden_states.to(dtype=original_dtype)
return normalized_hidden_states
# Copied from diffusers.pipelines.ltx2.pipeline_ltx2.LTX2Pipeline._get_gemma_prompt_embeds
def _get_gemma_prompt_embeds(
self,
@@ -392,16 +326,7 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
)
text_encoder_hidden_states = text_encoder_outputs.hidden_states
text_encoder_hidden_states = torch.stack(text_encoder_hidden_states, dim=-1)
sequence_lengths = prompt_attention_mask.sum(dim=-1)
prompt_embeds = self._pack_text_embeds(
text_encoder_hidden_states,
sequence_lengths,
device=device,
padding_side=self.tokenizer.padding_side,
scale_factor=scale_factor,
)
prompt_embeds = prompt_embeds.to(dtype=dtype)
prompt_embeds = text_encoder_hidden_states.flatten(2, 3).to(dtype=dtype) # Pack to 3D
# duplicate text embeddings for each generation per prompt, using mps friendly method
_, seq_len, _ = prompt_embeds.shape
@@ -500,6 +425,57 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
return prompt_embeds, prompt_attention_mask, negative_prompt_embeds, negative_prompt_attention_mask
@torch.no_grad()
def enhance_prompt(
self,
image: PipelineImageInput,
prompt: str,
system_prompt: str,
max_new_tokens: int = 512,
seed: int = 10,
generator: torch.Generator | None = None,
generation_kwargs: dict[str, Any] | None = None,
device: str | torch.device | None = None,
):
"""
Enhances the supplied `prompt` by generating a new prompt using the current text encoder (default is a
`transformers.Gemma3ForConditionalGeneration` model) from it and a system prompt.
"""
device = device or self._execution_device
if generation_kwargs is None:
# Set to default generation kwargs
generation_kwargs = {"do_sample": True, "temperature": 0.7}
messages = [
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": f"User Raw Input Prompt: {prompt}."},
],
},
]
template = self.processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = self.processor(text=template, images=image, return_tensors="pt").to(device)
self.text_encoder.to(device)
# `transformers.GenerationMixin.generate` does not support using a `torch.Generator` to control randomness,
# so manually apply a seed for reproducible generation.
if generator is not None:
# Overwrite seed to generator's initial seed
seed = generator.initial_seed()
torch.manual_seed(seed)
generated_sequences = self.text_encoder.generate(
**model_inputs,
max_new_tokens=max_new_tokens,
**generation_kwargs,
) # tensor of shape [batch_size, seq_len]
generated_ids = [seq[len(model_inputs.input_ids[i]) :] for i, seq in enumerate(generated_sequences)]
enhanced_prompt = self.processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
return enhanced_prompt
# Copied from diffusers.pipelines.ltx2.pipeline_ltx2.LTX2Pipeline.check_inputs
def check_inputs(
self,
@@ -511,6 +487,9 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
negative_prompt_embeds=None,
prompt_attention_mask=None,
negative_prompt_attention_mask=None,
spatio_temporal_guidance_blocks=None,
stg_scale=None,
audio_stg_scale=None,
):
if height % 32 != 0 or width % 32 != 0:
raise ValueError(f"`height` and `width` have to be divisible by 32 but are {height} and {width}.")
@@ -554,6 +533,12 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
f" {negative_prompt_attention_mask.shape}."
)
if ((stg_scale > 0.0) or (audio_stg_scale > 0.0)) and not spatio_temporal_guidance_blocks:
raise ValueError(
"Spatio-Temporal Guidance (STG) is specified but no STG blocks are supplied. Please supply a list of"
"block indices at which to apply STG in `spatio_temporal_guidance_blocks`"
)
@staticmethod
# Copied from diffusers.pipelines.ltx2.pipeline_ltx2.LTX2Pipeline._pack_latents
def _pack_latents(latents: torch.Tensor, patch_size: int = 1, patch_size_t: int = 1) -> torch.Tensor:
@@ -788,7 +773,6 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
latents = self._create_noised_state(latents, noise_scale, generator)
return latents.to(device=device, dtype=dtype)
# TODO: confirm whether this logic is correct
latent_mel_bins = num_mel_bins // self.audio_vae_mel_compression_ratio
shape = (batch_size, num_channels_latents, audio_latent_length, latent_mel_bins)
@@ -803,6 +787,24 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
latents = self._pack_audio_latents(latents)
return latents
def convert_velocity_to_x0(
self, sample: torch.Tensor, denoised_output: torch.Tensor, step_idx: int, scheduler: Any | None = None
) -> torch.Tensor:
if scheduler is None:
scheduler = self.scheduler
sample_x0 = sample - denoised_output * scheduler.sigmas[step_idx]
return sample_x0
def convert_x0_to_velocity(
self, sample: torch.Tensor, denoised_output: torch.Tensor, step_idx: int, scheduler: Any | None = None
) -> torch.Tensor:
if scheduler is None:
scheduler = self.scheduler
sample_v = (sample - denoised_output) / scheduler.sigmas[step_idx]
return sample_v
@property
def guidance_scale(self):
return self._guidance_scale
@@ -811,9 +813,41 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
def guidance_rescale(self):
return self._guidance_rescale
@property
def stg_scale(self):
return self._stg_scale
@property
def modality_scale(self):
return self._modality_scale
@property
def audio_guidance_scale(self):
return self._audio_guidance_scale
@property
def audio_guidance_rescale(self):
return self._audio_guidance_rescale
@property
def audio_stg_scale(self):
return self._audio_stg_scale
@property
def audio_modality_scale(self):
return self._audio_modality_scale
@property
def do_classifier_free_guidance(self):
return self._guidance_scale > 1.0
return (self._guidance_scale > 1.0) or (self._audio_guidance_scale > 1.0)
@property
def do_spatio_temporal_guidance(self):
return (self._stg_scale > 0.0) or (self._audio_stg_scale > 0.0)
@property
def do_modality_isolation_guidance(self):
return (self._modality_scale > 1.0) or (self._audio_modality_scale > 1.0)
@property
def num_timesteps(self):
@@ -846,7 +880,14 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
sigmas: list[float] | None = None,
timesteps: list[int] | None = None,
guidance_scale: float = 4.0,
stg_scale: float = 0.0,
modality_scale: float = 1.0,
guidance_rescale: float = 0.0,
audio_guidance_scale: float | None = None,
audio_stg_scale: float | None = None,
audio_modality_scale: float | None = None,
audio_guidance_rescale: float | None = None,
spatio_temporal_guidance_blocks: list[int] | None = None,
noise_scale: float = 0.0,
num_videos_per_prompt: int = 1,
generator: torch.Generator | list[torch.Generator] | None = None,
@@ -858,6 +899,11 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
negative_prompt_attention_mask: torch.Tensor | None = None,
decode_timestep: float | list[float] = 0.0,
decode_noise_scale: float | list[float] | None = None,
use_cross_timestep: bool = False,
system_prompt: str | None = None,
prompt_max_new_tokens: int = 512,
prompt_enhancement_kwargs: dict[str, Any] | None = None,
prompt_enhancement_seed: int = 10,
output_type: str = "pil",
return_dict: bool = True,
attention_kwargs: dict[str, Any] | None = None,
@@ -898,13 +944,47 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
`guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
the text `prompt`, usually at the expense of lower image quality.
the text `prompt`, usually at the expense of lower image quality. Used for the video modality (there is
a separate value `audio_guidance_scale` for the audio modality).
stg_scale (`float`, *optional*, defaults to `0.0`):
Video guidance scale for Spatio-Temporal Guidance (STG), proposed in [Spatiotemporal Skip Guidance for
Enhanced Video Diffusion Sampling](https://arxiv.org/abs/2411.18664). STG uses a CFG-like estimate
where we move the sample away from a weak sample from a perturbed version of the denoising model.
Enabling STG will result in an additional denoising model forward pass; the default value of `0.0`
means that STG is disabled.
modality_scale (`float`, *optional*, defaults to `1.0`):
Video guidance scale for LTX-2.X modality isolation guidance, where we move the sample away from a
weaker sample generated by the denoising model withy cross-modality (audio-to-video and video-to-audio)
cross attention disabled using a CFG-like estimate. Enabling modality guidance will result in an
additional denoising model forward pass; the default value of `1.0` means that modality guidance is
disabled.
guidance_rescale (`float`, *optional*, defaults to 0.0):
Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are
Flawed](https://huggingface.co/papers/2305.08891) `guidance_scale` is defined as `φ` in equation 16. of
[Common Diffusion Noise Schedules and Sample Steps are
Flawed](https://huggingface.co/papers/2305.08891). Guidance rescale factor should fix overexposure when
using zero terminal SNR.
using zero terminal SNR. Used for the video modality.
audio_guidance_scale (`float`, *optional* defaults to `None`):
Audio guidance scale for CFG with respect to the negative prompt. The CFG update rule is the same for
video and audio, but they can use different values for the guidance scale. The LTX-2.X authors suggest
that the `audio_guidance_scale` should be higher relative to the video `guidance_scale` (e.g. for
LTX-2.3 they suggest 3.0 for video and 7.0 for audio). If `None`, defaults to the video value
`guidance_scale`.
audio_stg_scale (`float`, *optional*, defaults to `None`):
Audio guidance scale for STG. As with CFG, the STG update rule is otherwise the same for video and
audio. For LTX-2.3, a value of 1.0 is suggested for both video and audio. If `None`, defaults to the
video value `stg_scale`.
audio_modality_scale (`float`, *optional*, defaults to `None`):
Audio guidance scale for LTX-2.X modality isolation guidance. As with CFG, the modality guidance rule
is otherwise the same for video and audio. For LTX-2.3, a value of 3.0 is suggested for both video and
audio. If `None`, defaults to the video value `modality_scale`.
audio_guidance_rescale (`float`, *optional*, defaults to `None`):
A separate guidance rescale factor for the audio modality. If `None`, defaults to the video value
`guidance_rescale`.
spatio_temporal_guidance_blocks (`list[int]`, *optional*, defaults to `None`):
The zero-indexed transformer block indices at which to apply STG. Must be supplied if STG is used
(`stg_scale` or `audio_stg_scale` is greater than `0`). A value of `[29]` is recommended for LTX-2.0
and `[28]` is recommended for LTX-2.3.
noise_scale (`float`, *optional*, defaults to `0.0`):
The interpolation factor between random noise and denoised latents at each timestep. Applying noise to
the `latents` and `audio_latents` before continue denoising.
@@ -935,6 +1015,24 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
The timestep at which generated video is decoded.
decode_noise_scale (`float`, defaults to `None`):
The interpolation factor between random noise and denoised latents at the decode timestep.
use_cross_timestep (`bool` *optional*, defaults to `False`):
Whether to use the cross modality (audio is the cross modality of video, and vice versa) sigma when
calculating the cross attention modulation parameters. `True` is the newer (e.g. LTX-2.3) behavior;
`False` is the legacy LTX-2.0 behavior.
system_prompt (`str`, *optional*, defaults to `None`):
Optional system prompt to use for prompt enhancement. The system prompt will be used by the current
text encoder (by default, a `Gemma3ForConditionalGeneration` model) to generate an enhanced prompt from
the original `prompt` to condition generation. If not supplied, prompt enhancement will not be
performed.
prompt_max_new_tokens (`int`, *optional*, defaults to `512`):
The maximum number of new tokens to generate when performing prompt enhancement.
prompt_enhancement_kwargs (`dict[str, Any]`, *optional*, defaults to `None`):
Keyword arguments for `self.text_encoder.generate`. If not supplied, default arguments of
`do_sample=True` and `temperature=0.7` will be used. See
https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate
for more details.
prompt_enhancement_seed (`int`, *optional*, default to `10`):
Random seed for any random operations during prompt enhancement.
output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
@@ -967,6 +1065,11 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
audio_guidance_scale = audio_guidance_scale or guidance_scale
audio_stg_scale = audio_stg_scale or stg_scale
audio_modality_scale = audio_modality_scale or modality_scale
audio_guidance_rescale = audio_guidance_rescale or guidance_rescale
# 1. Check inputs. Raise error if not correct
self.check_inputs(
prompt=prompt,
@@ -977,10 +1080,21 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
negative_prompt_embeds=negative_prompt_embeds,
prompt_attention_mask=prompt_attention_mask,
negative_prompt_attention_mask=negative_prompt_attention_mask,
spatio_temporal_guidance_blocks=spatio_temporal_guidance_blocks,
stg_scale=stg_scale,
audio_stg_scale=audio_stg_scale,
)
# Per-modality guidance scales (video, audio)
self._guidance_scale = guidance_scale
self._stg_scale = stg_scale
self._modality_scale = modality_scale
self._guidance_rescale = guidance_rescale
self._audio_guidance_scale = audio_guidance_scale
self._audio_stg_scale = audio_stg_scale
self._audio_modality_scale = audio_modality_scale
self._audio_guidance_rescale = audio_guidance_rescale
self._attention_kwargs = attention_kwargs
self._interrupt = False
self._current_timestep = None
@@ -996,6 +1110,18 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
device = self._execution_device
# 3. Prepare text embeddings
if system_prompt is not None and prompt is not None:
prompt = self.enhance_prompt(
image=image,
prompt=prompt,
system_prompt=system_prompt,
max_new_tokens=prompt_max_new_tokens,
seed=prompt_enhancement_seed,
generator=generator,
generation_kwargs=prompt_enhancement_kwargs,
device=device,
)
(
prompt_embeds,
prompt_attention_mask,
@@ -1017,9 +1143,11 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
prompt_attention_mask = torch.cat([negative_prompt_attention_mask, prompt_attention_mask], dim=0)
additive_attention_mask = (1 - prompt_attention_mask.to(prompt_embeds.dtype)) * -1000000.0
tokenizer_padding_side = "left" # Padding side for default Gemma3-12B text encoder
if getattr(self, "tokenizer", None) is not None:
tokenizer_padding_side = getattr(self.tokenizer, "padding_side", "left")
connector_prompt_embeds, connector_audio_prompt_embeds, connector_attention_mask = self.connectors(
prompt_embeds, additive_attention_mask, additive_mask=True
prompt_embeds, prompt_attention_mask, padding_side=tokenizer_padding_side
)
# 4. Prepare latent variables
@@ -1041,7 +1169,7 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
raise ValueError(
f"Provided `latents` tensor has shape {latents.shape}, but the expected shape is either [batch_size, seq_len, num_features] or [batch_size, latent_dim, latent_frames, latent_height, latent_width]."
)
video_sequence_length = latent_num_frames * latent_height * latent_width
# video_sequence_length = latent_num_frames * latent_height * latent_width
if latents is None:
image = self.video_processor.preprocess(image, height=height, width=width)
@@ -1105,7 +1233,7 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
# 5. Prepare timesteps
sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas
mu = calculate_shift(
video_sequence_length,
self.scheduler.config.get("max_image_seq_len", 4096),
self.scheduler.config.get("base_image_seq_len", 1024),
self.scheduler.config.get("max_image_seq_len", 4096),
self.scheduler.config.get("base_shift", 0.95),
@@ -1134,11 +1262,6 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
self._num_timesteps = len(timesteps)
# 6. Prepare micro-conditions
rope_interpolation_scale = (
self.vae_temporal_compression_ratio / frame_rate,
self.vae_spatial_compression_ratio,
self.vae_spatial_compression_ratio,
)
# Pre-compute video and audio positional ids as they will be the same at each step of the denoising loop
video_coords = self.transformer.rope.prepare_video_coords(
latents.shape[0], latent_num_frames, latent_height, latent_width, latents.device, fps=frame_rate
@@ -1177,6 +1300,7 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
audio_encoder_hidden_states=connector_audio_prompt_embeds,
timestep=video_timestep,
audio_timestep=timestep,
sigma=timestep, # Used by LTX-2.3
encoder_attention_mask=connector_attention_mask,
audio_encoder_attention_mask=connector_attention_mask,
num_frames=latent_num_frames,
@@ -1186,7 +1310,10 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
audio_num_frames=audio_num_frames,
video_coords=video_coords,
audio_coords=audio_coords,
# rope_interpolation_scale=rope_interpolation_scale,
isolate_modalities=False,
spatio_temporal_guidance_blocks=None,
perturbation_mask=None,
use_cross_timestep=use_cross_timestep,
attention_kwargs=attention_kwargs,
return_dict=False,
)
@@ -1194,24 +1321,154 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
noise_pred_audio = noise_pred_audio.float()
if self.do_classifier_free_guidance:
noise_pred_video_uncond, noise_pred_video_text = noise_pred_video.chunk(2)
noise_pred_video = noise_pred_video_uncond + self.guidance_scale * (
noise_pred_video_text - noise_pred_video_uncond
noise_pred_video_uncond_text, noise_pred_video = noise_pred_video.chunk(2)
noise_pred_video = self.convert_velocity_to_x0(latents, noise_pred_video, i, self.scheduler)
noise_pred_video_uncond_text = self.convert_velocity_to_x0(
latents, noise_pred_video_uncond_text, i, self.scheduler
)
# Use delta formulation as it works more nicely with multiple guidance terms
video_cfg_delta = (self.guidance_scale - 1) * (noise_pred_video - noise_pred_video_uncond_text)
noise_pred_audio_uncond_text, noise_pred_audio = noise_pred_audio.chunk(2)
noise_pred_audio = self.convert_velocity_to_x0(audio_latents, noise_pred_audio, i, audio_scheduler)
noise_pred_audio_uncond_text = self.convert_velocity_to_x0(
audio_latents, noise_pred_audio_uncond_text, i, audio_scheduler
)
audio_cfg_delta = (self.audio_guidance_scale - 1) * (
noise_pred_audio - noise_pred_audio_uncond_text
)
noise_pred_audio_uncond, noise_pred_audio_text = noise_pred_audio.chunk(2)
noise_pred_audio = noise_pred_audio_uncond + self.guidance_scale * (
noise_pred_audio_text - noise_pred_audio_uncond
# Get positive values from merged CFG inputs in case we need to do other DiT forward passes
if self.do_spatio_temporal_guidance or self.do_modality_isolation_guidance:
if i == 0:
# Only split values that remain constant throughout the loop once
video_prompt_embeds = connector_prompt_embeds.chunk(2, dim=0)[1]
audio_prompt_embeds = connector_audio_prompt_embeds.chunk(2, dim=0)[1]
prompt_attn_mask = connector_attention_mask.chunk(2, dim=0)[1]
video_pos_ids = video_coords.chunk(2, dim=0)[0]
audio_pos_ids = audio_coords.chunk(2, dim=0)[0]
# Split values that vary each denoising loop iteration
timestep = timestep.chunk(2, dim=0)[0]
video_timestep = video_timestep.chunk(2, dim=0)[0]
else:
video_cfg_delta = audio_cfg_delta = 0
video_prompt_embeds = connector_prompt_embeds
audio_prompt_embeds = connector_audio_prompt_embeds
prompt_attn_mask = connector_attention_mask
video_pos_ids = video_coords
audio_pos_ids = audio_coords
noise_pred_video = self.convert_velocity_to_x0(latents, noise_pred_video, i, self.scheduler)
noise_pred_audio = self.convert_velocity_to_x0(audio_latents, noise_pred_audio, i, audio_scheduler)
if self.do_spatio_temporal_guidance:
with self.transformer.cache_context("uncond_stg"):
noise_pred_video_uncond_stg, noise_pred_audio_uncond_stg = self.transformer(
hidden_states=latents.to(dtype=prompt_embeds.dtype),
audio_hidden_states=audio_latents.to(dtype=prompt_embeds.dtype),
encoder_hidden_states=video_prompt_embeds,
audio_encoder_hidden_states=audio_prompt_embeds,
timestep=video_timestep,
audio_timestep=timestep,
sigma=timestep, # Used by LTX-2.3
encoder_attention_mask=prompt_attn_mask,
audio_encoder_attention_mask=prompt_attn_mask,
num_frames=latent_num_frames,
height=latent_height,
width=latent_width,
fps=frame_rate,
audio_num_frames=audio_num_frames,
video_coords=video_pos_ids,
audio_coords=audio_pos_ids,
isolate_modalities=False,
# Use STG at given blocks to perturb model
spatio_temporal_guidance_blocks=spatio_temporal_guidance_blocks,
perturbation_mask=None,
use_cross_timestep=use_cross_timestep,
attention_kwargs=attention_kwargs,
return_dict=False,
)
noise_pred_video_uncond_stg = noise_pred_video_uncond_stg.float()
noise_pred_audio_uncond_stg = noise_pred_audio_uncond_stg.float()
noise_pred_video_uncond_stg = self.convert_velocity_to_x0(
latents, noise_pred_video_uncond_stg, i, self.scheduler
)
noise_pred_audio_uncond_stg = self.convert_velocity_to_x0(
audio_latents, noise_pred_audio_uncond_stg, i, audio_scheduler
)
if self.guidance_rescale > 0:
# Based on 3.4. in https://huggingface.co/papers/2305.08891
noise_pred_video = rescale_noise_cfg(
noise_pred_video, noise_pred_video_text, guidance_rescale=self.guidance_rescale
)
noise_pred_audio = rescale_noise_cfg(
noise_pred_audio, noise_pred_audio_text, guidance_rescale=self.guidance_rescale
video_stg_delta = self.stg_scale * (noise_pred_video - noise_pred_video_uncond_stg)
audio_stg_delta = self.audio_stg_scale * (noise_pred_audio - noise_pred_audio_uncond_stg)
else:
video_stg_delta = audio_stg_delta = 0
if self.do_modality_isolation_guidance:
with self.transformer.cache_context("uncond_modality"):
noise_pred_video_uncond_modality, noise_pred_audio_uncond_modality = self.transformer(
hidden_states=latents.to(dtype=prompt_embeds.dtype),
audio_hidden_states=audio_latents.to(dtype=prompt_embeds.dtype),
encoder_hidden_states=video_prompt_embeds,
audio_encoder_hidden_states=audio_prompt_embeds,
timestep=video_timestep,
audio_timestep=timestep,
sigma=timestep, # Used by LTX-2.3
encoder_attention_mask=prompt_attn_mask,
audio_encoder_attention_mask=prompt_attn_mask,
num_frames=latent_num_frames,
height=latent_height,
width=latent_width,
fps=frame_rate,
audio_num_frames=audio_num_frames,
video_coords=video_pos_ids,
audio_coords=audio_pos_ids,
# Turn off A2V and V2A cross attn to isolate video and audio modalities
isolate_modalities=True,
spatio_temporal_guidance_blocks=None,
perturbation_mask=None,
use_cross_timestep=use_cross_timestep,
attention_kwargs=attention_kwargs,
return_dict=False,
)
noise_pred_video_uncond_modality = noise_pred_video_uncond_modality.float()
noise_pred_audio_uncond_modality = noise_pred_audio_uncond_modality.float()
noise_pred_video_uncond_modality = self.convert_velocity_to_x0(
latents, noise_pred_video_uncond_modality, i, self.scheduler
)
noise_pred_audio_uncond_modality = self.convert_velocity_to_x0(
audio_latents, noise_pred_audio_uncond_modality, i, audio_scheduler
)
video_modality_delta = (self.modality_scale - 1) * (
noise_pred_video - noise_pred_video_uncond_modality
)
audio_modality_delta = (self.audio_modality_scale - 1) * (
noise_pred_audio - noise_pred_audio_uncond_modality
)
else:
video_modality_delta = audio_modality_delta = 0
# Now apply all guidance terms
noise_pred_video_g = noise_pred_video + video_cfg_delta + video_stg_delta + video_modality_delta
noise_pred_audio_g = noise_pred_audio + audio_cfg_delta + audio_stg_delta + audio_modality_delta
# Apply LTX-2.X guidance rescaling
if self.guidance_rescale > 0:
noise_pred_video = rescale_noise_cfg(
noise_pred_video_g, noise_pred_video, guidance_rescale=self.guidance_rescale
)
else:
noise_pred_video = noise_pred_video_g
if self.audio_guidance_rescale > 0:
noise_pred_audio = rescale_noise_cfg(
noise_pred_audio_g, noise_pred_audio, guidance_rescale=self.audio_guidance_rescale
)
else:
noise_pred_audio = noise_pred_audio_g
# compute the previous noisy sample x_t -> x_t-1
noise_pred_video = self._unpack_latents(
@@ -1231,6 +1488,10 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
self.transformer_temporal_patch_size,
)
# Convert back to velocity for scheduler
noise_pred_video = self.convert_x0_to_velocity(latents, noise_pred_video, i, self.scheduler)
noise_pred_audio = self.convert_x0_to_velocity(audio_latents, noise_pred_audio, i, audio_scheduler)
noise_pred_video = noise_pred_video[:, :, 1:]
noise_latents = latents[:, :, 1:]
pred_latents = self.scheduler.step(noise_pred_video, t, noise_latents, return_dict=False)[0]
@@ -1268,9 +1529,6 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
self.transformer_spatial_patch_size,
self.transformer_temporal_patch_size,
)
latents = self._denormalize_latents(
latents, self.vae.latents_mean, self.vae.latents_std, self.vae.config.scaling_factor
)
audio_latents = self._denormalize_audio_latents(
audio_latents, self.audio_vae.latents_mean, self.audio_vae.latents_std
@@ -1278,6 +1536,9 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
audio_latents = self._unpack_audio_latents(audio_latents, audio_num_frames, num_mel_bins=latent_mel_bins)
if output_type == "latent":
latents = self._denormalize_latents(
latents, self.vae.latents_mean, self.vae.latents_std, self.vae.config.scaling_factor
)
video = latents
audio = audio_latents
else:
@@ -1300,6 +1561,10 @@ class LTX2ImageToVideoPipeline(DiffusionPipeline, FromSingleFileMixin, LTX2LoraL
]
latents = (1 - decode_noise_scale) * latents + decode_noise_scale * noise
latents = self._denormalize_latents(
latents, self.vae.latents_mean, self.vae.latents_std, self.vae.config.scaling_factor
)
latents = latents.to(self.vae.dtype)
video = self.vae.decode(latents, timestep, return_dict=False)[0]
video = self.video_processor.postprocess_video(video, output_type=output_type)

View File

@@ -8,6 +8,209 @@ from ...configuration_utils import ConfigMixin, register_to_config
from ...models.modeling_utils import ModelMixin
def kaiser_sinc_filter1d(cutoff: float, half_width: float, kernel_size: int) -> torch.Tensor:
"""
Creates a Kaiser sinc kernel for low-pass filtering.
Args:
cutoff (`float`):
Normalized frequency cutoff (relative to the sampling rate). Must be between 0 and 0.5 (the Nyquist
frequency).
half_width (`float`):
Used to determine the Kaiser window's beta parameter.
kernel_size:
Size of the Kaiser window (and ultimately the Kaiser sinc kernel).
Returns:
`torch.Tensor` of shape `(kernel_size,)`:
The Kaiser sinc kernel.
"""
delta_f = 4 * half_width
half_size = kernel_size // 2
amplitude = 2.285 * (half_size - 1) * math.pi * delta_f + 7.95
if amplitude > 50.0:
beta = 0.1102 * (amplitude - 8.7)
elif amplitude >= 21.0:
beta = 0.5842 * (amplitude - 21) ** 0.4 + 0.07886 * (amplitude - 21.0)
else:
beta = 0.0
window = torch.kaiser_window(kernel_size, beta=beta, periodic=False)
even = kernel_size % 2 == 0
time = torch.arange(-half_size, half_size) + 0.5 if even else torch.arange(kernel_size) - half_size
if cutoff == 0.0:
filter = torch.zeros_like(time)
else:
time = 2 * cutoff * time
sinc = torch.where(
time == 0,
torch.ones_like(time),
torch.sin(math.pi * time) / math.pi / time,
)
filter = 2 * cutoff * window * sinc
filter = filter / filter.sum()
return filter
class DownSample1d(nn.Module):
"""1D low-pass filter for antialias downsampling."""
def __init__(
self,
ratio: int = 2,
kernel_size: int | None = None,
use_padding: bool = True,
padding_mode: str = "replicate",
persistent: bool = True,
):
super().__init__()
self.ratio = ratio
self.kernel_size = kernel_size or int(6 * ratio // 2) * 2
self.pad_left = self.kernel_size // 2 + (self.kernel_size % 2) - 1
self.pad_right = self.kernel_size // 2
self.use_padding = use_padding
self.padding_mode = padding_mode
cutoff = 0.5 / ratio
half_width = 0.6 / ratio
low_pass_filter = kaiser_sinc_filter1d(cutoff, half_width, self.kernel_size)
self.register_buffer("filter", low_pass_filter.view(1, 1, self.kernel_size), persistent=persistent)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x expected shape: [batch_size, num_channels, hidden_dim]
num_channels = x.shape[1]
if self.use_padding:
x = F.pad(x, (self.pad_left, self.pad_right), mode=self.padding_mode)
x_filtered = F.conv1d(x, self.filter.expand(num_channels, -1, -1), stride=self.ratio, groups=num_channels)
return x_filtered
class UpSample1d(nn.Module):
def __init__(
self,
ratio: int = 2,
kernel_size: int | None = None,
window_type: str = "kaiser",
padding_mode: str = "replicate",
persistent: bool = True,
):
super().__init__()
self.ratio = ratio
self.padding_mode = padding_mode
if window_type == "hann":
rolloff = 0.99
lowpass_filter_width = 6
width = math.ceil(lowpass_filter_width / rolloff)
self.kernel_size = 2 * width * ratio + 1
self.pad = width
self.pad_left = 2 * width * ratio
self.pad_right = self.kernel_size - ratio
time_axis = (torch.arange(self.kernel_size) / ratio - width) * rolloff
time_clamped = time_axis.clamp(-lowpass_filter_width, lowpass_filter_width)
window = torch.cos(time_clamped * math.pi / lowpass_filter_width / 2) ** 2
sinc_filter = (torch.sinc(time_axis) * window * rolloff / ratio).view(1, 1, -1)
else:
# Kaiser sinc filter is BigVGAN default
self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
self.pad = self.kernel_size // ratio - 1
self.pad_left = self.pad * self.ratio + (self.kernel_size - self.ratio) // 2
self.pad_right = self.pad * self.ratio + (self.kernel_size - self.ratio + 1) // 2
sinc_filter = kaiser_sinc_filter1d(
cutoff=0.5 / ratio,
half_width=0.6 / ratio,
kernel_size=self.kernel_size,
)
self.register_buffer("filter", sinc_filter.view(1, 1, self.kernel_size), persistent=persistent)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x expected shape: [batch_size, num_channels, hidden_dim]
num_channels = x.shape[1]
x = F.pad(x, (self.pad, self.pad), mode=self.padding_mode)
low_pass_filter = self.filter.to(dtype=x.dtype, device=x.device).expand(num_channels, -1, -1)
x = self.ratio * F.conv_transpose1d(x, low_pass_filter, stride=self.ratio, groups=num_channels)
return x[..., self.pad_left : -self.pad_right]
class AntiAliasAct1d(nn.Module):
"""
Antialiasing activation for a 1D signal: upsamples, applies an activation (usually snakebeta), and then downsamples
to avoid aliasing.
"""
def __init__(
self,
act_fn: str | nn.Module,
ratio: int = 2,
kernel_size: int = 12,
**kwargs,
):
super().__init__()
self.upsample = UpSample1d(ratio=ratio, kernel_size=kernel_size)
if isinstance(act_fn, str):
if act_fn == "snakebeta":
act_fn = SnakeBeta(**kwargs)
elif act_fn == "snake":
act_fn = SnakeBeta(**kwargs)
else:
act_fn = nn.LeakyReLU(**kwargs)
self.act = act_fn
self.downsample = DownSample1d(ratio=ratio, kernel_size=kernel_size)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.upsample(x)
x = self.act(x)
x = self.downsample(x)
return x
class SnakeBeta(nn.Module):
"""
Implements the Snake and SnakeBeta activations, which help with learning periodic patterns.
"""
def __init__(
self,
channels: int,
alpha: float = 1.0,
eps: float = 1e-9,
trainable_params: bool = True,
logscale: bool = True,
use_beta: bool = True,
):
super().__init__()
self.eps = eps
self.logscale = logscale
self.use_beta = use_beta
self.alpha = nn.Parameter(torch.zeros(channels) if self.logscale else torch.ones(channels) * alpha)
self.alpha.requires_grad = trainable_params
if use_beta:
self.beta = nn.Parameter(torch.zeros(channels) if self.logscale else torch.ones(channels) * alpha)
self.beta.requires_grad = trainable_params
def forward(self, hidden_states: torch.Tensor, channel_dim: int = 1) -> torch.Tensor:
broadcast_shape = [1] * hidden_states.ndim
broadcast_shape[channel_dim] = -1
alpha = self.alpha.view(broadcast_shape)
if self.use_beta:
beta = self.beta.view(broadcast_shape)
if self.logscale:
alpha = torch.exp(alpha)
if self.use_beta:
beta = torch.exp(beta)
amplitude = beta if self.use_beta else alpha
hidden_states = hidden_states + (1.0 / (amplitude + self.eps)) * torch.sin(hidden_states * alpha).pow(2)
return hidden_states
class ResBlock(nn.Module):
def __init__(
self,
@@ -15,12 +218,15 @@ class ResBlock(nn.Module):
kernel_size: int = 3,
stride: int = 1,
dilations: tuple[int, ...] = (1, 3, 5),
act_fn: str = "leaky_relu",
leaky_relu_negative_slope: float = 0.1,
antialias: bool = False,
antialias_ratio: int = 2,
antialias_kernel_size: int = 12,
padding_mode: str = "same",
):
super().__init__()
self.dilations = dilations
self.negative_slope = leaky_relu_negative_slope
self.convs1 = nn.ModuleList(
[
@@ -28,6 +234,18 @@ class ResBlock(nn.Module):
for dilation in dilations
]
)
self.acts1 = nn.ModuleList()
for _ in range(len(self.convs1)):
if act_fn == "snakebeta":
act = SnakeBeta(channels, use_beta=True)
elif act_fn == "snake":
act = SnakeBeta(channels, use_beta=False)
else:
act = nn.LeakyReLU(negative_slope=leaky_relu_negative_slope)
if antialias:
act = AntiAliasAct1d(act, ratio=antialias_ratio, kernel_size=antialias_kernel_size)
self.acts1.append(act)
self.convs2 = nn.ModuleList(
[
@@ -35,12 +253,24 @@ class ResBlock(nn.Module):
for _ in range(len(dilations))
]
)
self.acts2 = nn.ModuleList()
for _ in range(len(self.convs2)):
if act_fn == "snakebeta":
act = SnakeBeta(channels, use_beta=True)
elif act_fn == "snake":
act = SnakeBeta(channels, use_beta=False)
else:
act_fn = nn.LeakyReLU(negative_slope=leaky_relu_negative_slope)
if antialias:
act = AntiAliasAct1d(act, ratio=antialias_ratio, kernel_size=antialias_kernel_size)
self.acts2.append(act)
def forward(self, x: torch.Tensor) -> torch.Tensor:
for conv1, conv2 in zip(self.convs1, self.convs2):
xt = F.leaky_relu(x, negative_slope=self.negative_slope)
for act1, conv1, act2, conv2 in zip(self.acts1, self.convs1, self.acts2, self.convs2):
xt = act1(x)
xt = conv1(xt)
xt = F.leaky_relu(xt, negative_slope=self.negative_slope)
xt = act2(xt)
xt = conv2(xt)
x = x + xt
return x
@@ -61,7 +291,13 @@ class LTX2Vocoder(ModelMixin, ConfigMixin):
upsample_factors: list[int] = [6, 5, 2, 2, 2],
resnet_kernel_sizes: list[int] = [3, 7, 11],
resnet_dilations: list[list[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
act_fn: str = "leaky_relu",
leaky_relu_negative_slope: float = 0.1,
antialias: bool = False,
antialias_ratio: int = 2,
antialias_kernel_size: int = 12,
final_act_fn: str | None = "tanh", # tanh, clamp, None
final_bias: bool = True,
output_sampling_rate: int = 24000,
):
super().__init__()
@@ -69,7 +305,9 @@ class LTX2Vocoder(ModelMixin, ConfigMixin):
self.resnets_per_upsample = len(resnet_kernel_sizes)
self.out_channels = out_channels
self.total_upsample_factor = math.prod(upsample_factors)
self.act_fn = act_fn
self.negative_slope = leaky_relu_negative_slope
self.final_act_fn = final_act_fn
if self.num_upsample_layers != len(upsample_factors):
raise ValueError(
@@ -83,6 +321,13 @@ class LTX2Vocoder(ModelMixin, ConfigMixin):
f" {len(self.resnets_per_upsample)} and {len(resnet_dilations)}, respectively."
)
supported_act_fns = ["snakebeta", "snake", "leaky_relu"]
if self.act_fn not in supported_act_fns:
raise ValueError(
f"Unsupported activation function: {self.act_fn}. Currently supported values of `act_fn` are "
f"{supported_act_fns}."
)
self.conv_in = nn.Conv1d(in_channels, hidden_channels, kernel_size=7, stride=1, padding=3)
self.upsamplers = nn.ModuleList()
@@ -103,15 +348,27 @@ class LTX2Vocoder(ModelMixin, ConfigMixin):
for kernel_size, dilations in zip(resnet_kernel_sizes, resnet_dilations):
self.resnets.append(
ResBlock(
output_channels,
kernel_size,
channels=output_channels,
kernel_size=kernel_size,
dilations=dilations,
act_fn=act_fn,
leaky_relu_negative_slope=leaky_relu_negative_slope,
antialias=antialias,
antialias_ratio=antialias_ratio,
antialias_kernel_size=antialias_kernel_size,
)
)
input_channels = output_channels
self.conv_out = nn.Conv1d(output_channels, out_channels, 7, stride=1, padding=3)
if act_fn == "snakebeta" or act_fn == "snake":
# Always use antialiasing
act_out = SnakeBeta(channels=output_channels, use_beta=True)
self.act_out = AntiAliasAct1d(act_out, ratio=antialias_ratio, kernel_size=antialias_kernel_size)
elif act_fn == "leaky_relu":
# NOTE: does NOT use self.negative_slope, following the original code
self.act_out = nn.LeakyReLU()
self.conv_out = nn.Conv1d(output_channels, out_channels, 7, stride=1, padding=3, bias=final_bias)
def forward(self, hidden_states: torch.Tensor, time_last: bool = False) -> torch.Tensor:
r"""
@@ -139,7 +396,9 @@ class LTX2Vocoder(ModelMixin, ConfigMixin):
hidden_states = self.conv_in(hidden_states)
for i in range(self.num_upsample_layers):
hidden_states = F.leaky_relu(hidden_states, negative_slope=self.negative_slope)
if self.act_fn == "leaky_relu":
# Other activations are inside each upsampling block
hidden_states = F.leaky_relu(hidden_states, negative_slope=self.negative_slope)
hidden_states = self.upsamplers[i](hidden_states)
# Run all resnets in parallel on hidden_states
@@ -149,10 +408,190 @@ class LTX2Vocoder(ModelMixin, ConfigMixin):
hidden_states = torch.mean(resnet_outputs, dim=0)
# NOTE: unlike the first leaky ReLU, this leaky ReLU is set to use the default F.leaky_relu negative slope of
# 0.01 (whereas the others usually use a slope of 0.1). Not sure if this is intended
hidden_states = F.leaky_relu(hidden_states, negative_slope=0.01)
hidden_states = self.act_out(hidden_states)
hidden_states = self.conv_out(hidden_states)
hidden_states = torch.tanh(hidden_states)
if self.final_act_fn == "tanh":
hidden_states = torch.tanh(hidden_states)
elif self.final_act_fn == "clamp":
hidden_states = torch.clamp(hidden_states, -1, 1)
return hidden_states
class CausalSTFT(nn.Module):
"""
Performs a causal short-time Fourier transform (STFT) using causal Hann windows on a waveform. The DFT bases
multiplied by the Hann windows are pre-calculated and stored as buffers. For exact parity with training, the exact
buffers should be loaded from the checkpoint in bfloat16.
"""
def __init__(self, filter_length: int = 512, hop_length: int = 80, window_length: int = 512):
super().__init__()
self.hop_length = hop_length
self.window_length = window_length
n_freqs = filter_length // 2 + 1
self.register_buffer("forward_basis", torch.zeros(n_freqs * 2, 1, filter_length), persistent=True)
self.register_buffer("inverse_basis", torch.zeros(n_freqs * 2, 1, filter_length), persistent=True)
def forward(self, waveform: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
if waveform.ndim == 2:
waveform = waveform.unsqueeze(1) # [B, num_channels, num_samples]
left_pad = max(0, self.window_length - self.hop_length) # causal: left-only
waveform = F.pad(waveform, (left_pad, 0))
spec = F.conv1d(waveform, self.forward_basis, stride=self.hop_length, padding=0)
n_freqs = spec.shape[1] // 2
real, imag = spec[:, :n_freqs], spec[:, n_freqs:]
magnitude = torch.sqrt(real**2 + imag**2)
phase = torch.atan2(imag.float(), real.float()).to(dtype=real.dtype)
return magnitude, phase
class MelSTFT(nn.Module):
"""
Calculates a causal log-mel spectrogram from a waveform. Uses a pre-calculated mel filterbank, which should be
loaded from the checkpoint in bfloat16.
"""
def __init__(
self,
filter_length: int = 512,
hop_length: int = 80,
window_length: int = 512,
num_mel_channels: int = 64,
):
super().__init__()
self.stft_fn = CausalSTFT(filter_length, hop_length, window_length)
num_freqs = filter_length // 2 + 1
self.register_buffer("mel_basis", torch.zeros(num_mel_channels, num_freqs), persistent=True)
def forward(self, waveform: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
magnitude, phase = self.stft_fn(waveform)
energy = torch.norm(magnitude, dim=1)
mel = torch.matmul(self.mel_basis.to(magnitude.dtype), magnitude)
log_mel = torch.log(torch.clamp(mel, min=1e-5))
return log_mel, magnitude, phase, energy
class LTX2VocoderWithBWE(ModelMixin, ConfigMixin):
"""
LTX-2.X vocoder with bandwidth extension (BWE) upsampling. The vocoder and the BWE module run in sequence, with the
BWE module upsampling the vocoder output waveform to a higher sampling rate. The BWE module itself has the same
architecture as the original vocoder.
"""
@register_to_config
def __init__(
self,
in_channels: int = 128,
hidden_channels: int = 1536,
out_channels: int = 2,
upsample_kernel_sizes: list[int] = [11, 4, 4, 4, 4, 4],
upsample_factors: list[int] = [5, 2, 2, 2, 2, 2],
resnet_kernel_sizes: list[int] = [3, 7, 11],
resnet_dilations: list[list[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
act_fn: str = "snakebeta",
leaky_relu_negative_slope: float = 0.1,
antialias: bool = True,
antialias_ratio: int = 2,
antialias_kernel_size: int = 12,
final_act_fn: str | None = None,
final_bias: bool = False,
bwe_in_channels: int = 128,
bwe_hidden_channels: int = 512,
bwe_out_channels: int = 2,
bwe_upsample_kernel_sizes: list[int] = [12, 11, 4, 4, 4],
bwe_upsample_factors: list[int] = [6, 5, 2, 2, 2],
bwe_resnet_kernel_sizes: list[int] = [3, 7, 11],
bwe_resnet_dilations: list[list[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
bwe_act_fn: str = "snakebeta",
bwe_leaky_relu_negative_slope: float = 0.1,
bwe_antialias: bool = True,
bwe_antialias_ratio: int = 2,
bwe_antialias_kernel_size: int = 12,
bwe_final_act_fn: str | None = None,
bwe_final_bias: bool = False,
filter_length: int = 512,
hop_length: int = 80,
window_length: int = 512,
num_mel_channels: int = 64,
input_sampling_rate: int = 16000,
output_sampling_rate: int = 48000,
):
super().__init__()
self.vocoder = LTX2Vocoder(
in_channels=in_channels,
hidden_channels=hidden_channels,
out_channels=out_channels,
upsample_kernel_sizes=upsample_kernel_sizes,
upsample_factors=upsample_factors,
resnet_kernel_sizes=resnet_kernel_sizes,
resnet_dilations=resnet_dilations,
act_fn=act_fn,
leaky_relu_negative_slope=leaky_relu_negative_slope,
antialias=antialias,
antialias_ratio=antialias_ratio,
antialias_kernel_size=antialias_kernel_size,
final_act_fn=final_act_fn,
final_bias=final_bias,
output_sampling_rate=input_sampling_rate,
)
self.bwe_generator = LTX2Vocoder(
in_channels=bwe_in_channels,
hidden_channels=bwe_hidden_channels,
out_channels=bwe_out_channels,
upsample_kernel_sizes=bwe_upsample_kernel_sizes,
upsample_factors=bwe_upsample_factors,
resnet_kernel_sizes=bwe_resnet_kernel_sizes,
resnet_dilations=bwe_resnet_dilations,
act_fn=bwe_act_fn,
leaky_relu_negative_slope=bwe_leaky_relu_negative_slope,
antialias=bwe_antialias,
antialias_ratio=bwe_antialias_ratio,
antialias_kernel_size=bwe_antialias_kernel_size,
final_act_fn=bwe_final_act_fn,
final_bias=bwe_final_bias,
output_sampling_rate=output_sampling_rate,
)
self.mel_stft = MelSTFT(
filter_length=filter_length,
hop_length=hop_length,
window_length=window_length,
num_mel_channels=num_mel_channels,
)
self.resampler = UpSample1d(
ratio=output_sampling_rate // input_sampling_rate,
window_type="hann",
persistent=False,
)
def forward(self, mel_spec: torch.Tensor) -> torch.Tensor:
# 1. Run stage 1 vocoder to get low sampling rate waveform
x = self.vocoder(mel_spec)
batch_size, num_channels, num_samples = x.shape
# Pad to exact multiple of hop_length for exact mel frame count
remainder = num_samples % self.config.hop_length
if remainder != 0:
x = F.pad(x, (0, self.hop_length - remainder))
# 2. Compute mel spectrogram on vocoder output
mel, _, _, _ = self.mel_stft(x.flatten(0, 1))
mel = mel.unflatten(0, (-1, num_channels))
# 3. Run bandwidth extender (BWE) on new mel spectrogram
mel_for_bwe = mel.transpose(2, 3) # [B, C, num_mel_bins, num_frames] --> [B, C, num_frames, num_mel_bins]
residual = self.bwe_generator(mel_for_bwe)
# 4. Residual connection with resampler
skip = self.resampler(x)
waveform = torch.clamp(residual + skip, -1, 1)
output_samples = num_samples * self.config.output_sampling_rate // self.config.input_sampling_rate
waveform = waveform[..., :output_samples]
return waveform

View File

@@ -35,6 +35,8 @@ class PaintByExampleImageEncoder(CLIPPreTrainedModel):
# uncondition for scaling
self.uncond_vector = nn.Parameter(torch.randn((1, 1, self.proj_size)))
self.post_init()
def forward(self, pixel_values, return_uncond_vector=False):
clip_output = self.model(pixel_values=pixel_values)
latent_states = clip_output.pooler_output

View File

@@ -24,7 +24,7 @@ from transformers import Gemma2PreTrainedModel, GemmaTokenizer, GemmaTokenizerFa
from ...callbacks import MultiPipelineCallbacks, PipelineCallback
from ...loaders import SanaLoraLoaderMixin
from ...models import AutoencoderDC, AutoencoderKLWan, SanaVideoTransformer3DModel
from ...models import AutoencoderDC, AutoencoderKLLTX2Video, AutoencoderKLWan, SanaVideoTransformer3DModel
from ...schedulers import DPMSolverMultistepScheduler
from ...utils import (
BACKENDS_MAPPING,
@@ -194,7 +194,7 @@ class SanaVideoPipeline(DiffusionPipeline, SanaLoraLoaderMixin):
The tokenizer used to tokenize the prompt.
text_encoder ([`Gemma2PreTrainedModel`]):
Text encoder model to encode the input prompts.
vae ([`AutoencoderKLWan` or `AutoencoderDCAEV`]):
vae ([`AutoencoderKLWan`, `AutoencoderDC`, or `AutoencoderKLLTX2Video`]):
Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
transformer ([`SanaVideoTransformer3DModel`]):
Conditional Transformer to denoise the input latents.
@@ -213,7 +213,7 @@ class SanaVideoPipeline(DiffusionPipeline, SanaLoraLoaderMixin):
self,
tokenizer: GemmaTokenizer | GemmaTokenizerFast,
text_encoder: Gemma2PreTrainedModel,
vae: AutoencoderDC | AutoencoderKLWan,
vae: AutoencoderDC | AutoencoderKLLTX2Video | AutoencoderKLWan,
transformer: SanaVideoTransformer3DModel,
scheduler: DPMSolverMultistepScheduler,
):
@@ -223,8 +223,19 @@ class SanaVideoPipeline(DiffusionPipeline, SanaLoraLoaderMixin):
tokenizer=tokenizer, text_encoder=text_encoder, vae=vae, transformer=transformer, scheduler=scheduler
)
self.vae_scale_factor_temporal = self.vae.config.scale_factor_temporal if getattr(self, "vae", None) else 4
self.vae_scale_factor_spatial = self.vae.config.scale_factor_spatial if getattr(self, "vae", None) else 8
if getattr(self, "vae", None):
if isinstance(self.vae, AutoencoderKLLTX2Video):
self.vae_scale_factor_temporal = self.vae.config.temporal_compression_ratio
self.vae_scale_factor_spatial = self.vae.config.spatial_compression_ratio
elif isinstance(self.vae, (AutoencoderDC, AutoencoderKLWan)):
self.vae_scale_factor_temporal = self.vae.config.scale_factor_temporal
self.vae_scale_factor_spatial = self.vae.config.scale_factor_spatial
else:
self.vae_scale_factor_temporal = 4
self.vae_scale_factor_spatial = 8
else:
self.vae_scale_factor_temporal = 4
self.vae_scale_factor_spatial = 8
self.vae_scale_factor = self.vae_scale_factor_spatial
@@ -985,14 +996,21 @@ class SanaVideoPipeline(DiffusionPipeline, SanaLoraLoaderMixin):
if is_torch_version(">=", "2.5.0")
else torch_accelerator_module.OutOfMemoryError
)
latents_mean = (
torch.tensor(self.vae.config.latents_mean)
.view(1, self.vae.config.z_dim, 1, 1, 1)
.to(latents.device, latents.dtype)
)
latents_std = 1.0 / torch.tensor(self.vae.config.latents_std).view(1, self.vae.config.z_dim, 1, 1, 1).to(
latents.device, latents.dtype
)
if isinstance(self.vae, AutoencoderKLLTX2Video):
latents_mean = self.vae.latents_mean
latents_std = self.vae.latents_std
z_dim = self.vae.config.latent_channels
elif isinstance(self.vae, AutoencoderKLWan):
latents_mean = torch.tensor(self.vae.config.latents_mean)
latents_std = torch.tensor(self.vae.config.latents_std)
z_dim = self.vae.config.z_dim
else:
latents_mean = torch.zeros(latents.shape[1], device=latents.device, dtype=latents.dtype)
latents_std = torch.ones(latents.shape[1], device=latents.device, dtype=latents.dtype)
z_dim = latents.shape[1]
latents_mean = latents_mean.view(1, z_dim, 1, 1, 1).to(latents.device, latents.dtype)
latents_std = 1.0 / latents_std.view(1, z_dim, 1, 1, 1).to(latents.device, latents.dtype)
latents = latents / latents_std + latents_mean
try:
video = self.vae.decode(latents, return_dict=False)[0]

View File

@@ -26,7 +26,7 @@ from transformers import Gemma2PreTrainedModel, GemmaTokenizer, GemmaTokenizerFa
from ...callbacks import MultiPipelineCallbacks, PipelineCallback
from ...image_processor import PipelineImageInput
from ...loaders import SanaLoraLoaderMixin
from ...models import AutoencoderDC, AutoencoderKLWan, SanaVideoTransformer3DModel
from ...models import AutoencoderDC, AutoencoderKLLTX2Video, AutoencoderKLWan, SanaVideoTransformer3DModel
from ...schedulers import FlowMatchEulerDiscreteScheduler
from ...utils import (
BACKENDS_MAPPING,
@@ -184,7 +184,7 @@ class SanaImageToVideoPipeline(DiffusionPipeline, SanaLoraLoaderMixin):
The tokenizer used to tokenize the prompt.
text_encoder ([`Gemma2PreTrainedModel`]):
Text encoder model to encode the input prompts.
vae ([`AutoencoderKLWan` or `AutoencoderDCAEV`]):
vae ([`AutoencoderKLWan`, `AutoencoderDC`, or `AutoencoderKLLTX2Video`]):
Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
transformer ([`SanaVideoTransformer3DModel`]):
Conditional Transformer to denoise the input latents.
@@ -203,7 +203,7 @@ class SanaImageToVideoPipeline(DiffusionPipeline, SanaLoraLoaderMixin):
self,
tokenizer: GemmaTokenizer | GemmaTokenizerFast,
text_encoder: Gemma2PreTrainedModel,
vae: AutoencoderDC | AutoencoderKLWan,
vae: AutoencoderDC | AutoencoderKLLTX2Video | AutoencoderKLWan,
transformer: SanaVideoTransformer3DModel,
scheduler: FlowMatchEulerDiscreteScheduler,
):
@@ -213,8 +213,19 @@ class SanaImageToVideoPipeline(DiffusionPipeline, SanaLoraLoaderMixin):
tokenizer=tokenizer, text_encoder=text_encoder, vae=vae, transformer=transformer, scheduler=scheduler
)
self.vae_scale_factor_temporal = self.vae.config.scale_factor_temporal if getattr(self, "vae", None) else 4
self.vae_scale_factor_spatial = self.vae.config.scale_factor_spatial if getattr(self, "vae", None) else 8
if getattr(self, "vae", None):
if isinstance(self.vae, AutoencoderKLLTX2Video):
self.vae_scale_factor_temporal = self.vae.config.temporal_compression_ratio
self.vae_scale_factor_spatial = self.vae.config.spatial_compression_ratio
elif isinstance(self.vae, (AutoencoderDC, AutoencoderKLWan)):
self.vae_scale_factor_temporal = self.vae.config.scale_factor_temporal
self.vae_scale_factor_spatial = self.vae.config.scale_factor_spatial
else:
self.vae_scale_factor_temporal = 4
self.vae_scale_factor_spatial = 8
else:
self.vae_scale_factor_temporal = 4
self.vae_scale_factor_spatial = 8
self.vae_scale_factor = self.vae_scale_factor_spatial
@@ -687,14 +698,18 @@ class SanaImageToVideoPipeline(DiffusionPipeline, SanaLoraLoaderMixin):
image_latents = retrieve_latents(self.vae.encode(image), sample_mode="argmax")
image_latents = image_latents.repeat(batch_size, 1, 1, 1, 1)
latents_mean = (
torch.tensor(self.vae.config.latents_mean)
.view(1, -1, 1, 1, 1)
.to(image_latents.device, image_latents.dtype)
)
latents_std = 1.0 / torch.tensor(self.vae.config.latents_std).view(1, -1, 1, 1, 1).to(
image_latents.device, image_latents.dtype
)
if isinstance(self.vae, AutoencoderKLLTX2Video):
_latents_mean = self.vae.latents_mean
_latents_std = self.vae.latents_std
elif isinstance(self.vae, AutoencoderKLWan):
_latents_mean = torch.tensor(self.vae.config.latents_mean)
_latents_std = torch.tensor(self.vae.config.latents_std)
else:
_latents_mean = torch.zeros(image_latents.shape[1], device=image_latents.device, dtype=image_latents.dtype)
_latents_std = torch.ones(image_latents.shape[1], device=image_latents.device, dtype=image_latents.dtype)
latents_mean = _latents_mean.view(1, -1, 1, 1, 1).to(image_latents.device, image_latents.dtype)
latents_std = 1.0 / _latents_std.view(1, -1, 1, 1, 1).to(image_latents.device, image_latents.dtype)
image_latents = (image_latents - latents_mean) * latents_std
latents[:, :, 0:1] = image_latents.to(dtype)
@@ -1034,14 +1049,21 @@ class SanaImageToVideoPipeline(DiffusionPipeline, SanaLoraLoaderMixin):
if is_torch_version(">=", "2.5.0")
else torch_accelerator_module.OutOfMemoryError
)
latents_mean = (
torch.tensor(self.vae.config.latents_mean)
.view(1, self.vae.config.z_dim, 1, 1, 1)
.to(latents.device, latents.dtype)
)
latents_std = 1.0 / torch.tensor(self.vae.config.latents_std).view(1, self.vae.config.z_dim, 1, 1, 1).to(
latents.device, latents.dtype
)
if isinstance(self.vae, AutoencoderKLLTX2Video):
latents_mean = self.vae.latents_mean
latents_std = self.vae.latents_std
z_dim = self.vae.config.latent_channels
elif isinstance(self.vae, AutoencoderKLWan):
latents_mean = torch.tensor(self.vae.config.latents_mean)
latents_std = torch.tensor(self.vae.config.latents_std)
z_dim = self.vae.config.z_dim
else:
latents_mean = torch.zeros(latents.shape[1], device=latents.device, dtype=latents.dtype)
latents_std = torch.ones(latents.shape[1], device=latents.device, dtype=latents.dtype)
z_dim = latents.shape[1]
latents_mean = latents_mean.view(1, z_dim, 1, 1, 1).to(latents.device, latents.dtype)
latents_std = 1.0 / latents_std.view(1, z_dim, 1, 1, 1).to(latents.device, latents.dtype)
latents = latents / latents_std + latents_mean
try:
video = self.vae.decode(latents, return_dict=False)[0]

View File

@@ -36,7 +36,7 @@ from typing import Any, Callable
from packaging import version
from ..utils import is_torch_available, is_torchao_available, is_torchao_version, logging
from ..utils import deprecate, is_torch_available, is_torchao_available, is_torchao_version, logging
if is_torch_available():
@@ -844,6 +844,8 @@ class QuantoConfig(QuantizationConfigMixin):
modules_to_not_convert: list[str] | None = None,
**kwargs,
):
deprecation_message = "`QuantoConfig` is deprecated and will be removed in version 1.0.0."
deprecate("QuantoConfig", "1.0.0", deprecation_message)
self.quant_method = QuantizationMethod.QUANTO
self.weights_dtype = weights_dtype
self.modules_to_not_convert = modules_to_not_convert

View File

@@ -3,6 +3,7 @@ from typing import TYPE_CHECKING, Any
from diffusers.utils.import_utils import is_optimum_quanto_version
from ...utils import (
deprecate,
get_module_from_name,
is_accelerate_available,
is_accelerate_version,
@@ -42,6 +43,9 @@ class QuantoQuantizer(DiffusersQuantizer):
super().__init__(quantization_config, **kwargs)
def validate_environment(self, *args, **kwargs):
deprecation_message = "The Quanto quantizer is deprecated and will be removed in version 1.0.0."
deprecate("QuantoQuantizer", "1.0.0", deprecation_message)
if not is_optimum_quanto_available():
raise ImportError(
"Loading an optimum-quanto quantized model requires optimum-quanto library (`pip install optimum-quanto`)"

View File

@@ -152,6 +152,96 @@ class FluxModularPipeline(metaclass=DummyObject):
requires_backends(cls, ["torch", "transformers"])
class HeliosAutoBlocks(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class HeliosModularPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class HeliosPyramidAutoBlocks(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class HeliosPyramidDistilledAutoBlocks(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class HeliosPyramidDistilledModularPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class HeliosPyramidModularPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class QwenImageAutoBlocks(metaclass=DummyObject):
_backends = ["torch", "transformers"]
@@ -1112,6 +1202,21 @@ class EasyAnimatePipeline(metaclass=DummyObject):
requires_backends(cls, ["torch", "transformers"])
class Flux2KleinKVPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class Flux2KleinPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]

View File

@@ -28,7 +28,6 @@ from diffusers.utils.import_utils import is_peft_available
from ..testing_utils import (
floats_tensor,
is_flaky,
require_peft_backend,
require_peft_version_greater,
skip_mps,
@@ -46,7 +45,6 @@ from .utils import PeftLoraLoaderMixinTests # noqa: E402
@require_peft_backend
@skip_mps
@is_flaky(max_attempts=10, description="very flaky class")
class WanVACELoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
pipeline_class = WanVACEPipeline
scheduler_cls = FlowMatchEulerDiscreteScheduler
@@ -73,8 +71,8 @@ class WanVACELoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
"base_dim": 3,
"z_dim": 4,
"dim_mult": [1, 1, 1, 1],
"latents_mean": torch.randn(4).numpy().tolist(),
"latents_std": torch.randn(4).numpy().tolist(),
"latents_mean": [-0.7571, -0.7089, -0.9113, -0.7245],
"latents_std": [2.8184, 1.4541, 2.3275, 2.6558],
"num_res_blocks": 1,
"temperal_downsample": [False, True, True],
}

View File

@@ -26,9 +26,17 @@ from diffusers.models._modeling_parallel import ContextParallelConfig
from ...testing_utils import (
is_context_parallel,
require_torch_multi_accelerator,
torch_device,
)
# Device configuration mapping
DEVICE_CONFIG = {
"cuda": {"backend": "nccl", "module": torch.cuda},
"xpu": {"backend": "xccl", "module": torch.xpu},
}
def _find_free_port():
"""Find a free port on localhost."""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
@@ -47,12 +55,17 @@ def _context_parallel_worker(rank, world_size, master_port, model_class, init_di
os.environ["RANK"] = str(rank)
os.environ["WORLD_SIZE"] = str(world_size)
# Get device configuration
device_config = DEVICE_CONFIG.get(torch_device, DEVICE_CONFIG["cuda"])
backend = device_config["backend"]
device_module = device_config["module"]
# Initialize process group
dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
dist.init_process_group(backend=backend, rank=rank, world_size=world_size)
# Set device for this process
torch.cuda.set_device(rank)
device = torch.device(f"cuda:{rank}")
device_module.set_device(rank)
device = torch.device(f"{torch_device}:{rank}")
# Create model
model = model_class(**init_dict)
@@ -60,12 +73,7 @@ def _context_parallel_worker(rank, world_size, master_port, model_class, init_di
model.eval()
# Move inputs to device
inputs_on_device = {}
for key, value in inputs_dict.items():
if isinstance(value, torch.Tensor):
inputs_on_device[key] = value.to(device)
else:
inputs_on_device[key] = value
inputs_on_device = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs_dict.items()}
# Enable context parallelism
cp_config = ContextParallelConfig(**cp_dict)
@@ -89,6 +97,65 @@ def _context_parallel_worker(rank, world_size, master_port, model_class, init_di
dist.destroy_process_group()
def _custom_mesh_worker(
rank,
world_size,
master_port,
model_class,
init_dict,
cp_dict,
mesh_shape,
mesh_dim_names,
inputs_dict,
return_dict,
):
"""Worker function for context parallel testing with a user-provided custom DeviceMesh."""
try:
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = str(master_port)
os.environ["RANK"] = str(rank)
os.environ["WORLD_SIZE"] = str(world_size)
# Get device configuration
device_config = DEVICE_CONFIG.get(torch_device, DEVICE_CONFIG["cuda"])
backend = device_config["backend"]
device_module = device_config["module"]
dist.init_process_group(backend=backend, rank=rank, world_size=world_size)
# Set device for this process
device_module.set_device(rank)
device = torch.device(f"{torch_device}:{rank}")
model = model_class(**init_dict)
model.to(device)
model.eval()
inputs_on_device = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs_dict.items()}
# DeviceMesh must be created after init_process_group, inside each worker process.
mesh = torch.distributed.device_mesh.init_device_mesh(
torch_device, mesh_shape=mesh_shape, mesh_dim_names=mesh_dim_names
)
cp_config = ContextParallelConfig(**cp_dict, mesh=mesh)
model.enable_parallelism(config=cp_config)
with torch.no_grad():
output = model(**inputs_on_device, return_dict=False)[0]
if rank == 0:
return_dict["status"] = "success"
return_dict["output_shape"] = list(output.shape)
except Exception as e:
if rank == 0:
return_dict["status"] = "error"
return_dict["error"] = str(e)
finally:
if dist.is_initialized():
dist.destroy_process_group()
@is_context_parallel
@require_torch_multi_accelerator
class ContextParallelTesterMixin:
@@ -126,3 +193,48 @@ class ContextParallelTesterMixin:
assert return_dict.get("status") == "success", (
f"Context parallel inference failed: {return_dict.get('error', 'Unknown error')}"
)
@pytest.mark.parametrize(
"cp_type,mesh_shape,mesh_dim_names",
[
("ring_degree", (2, 1, 1), ("ring", "ulysses", "fsdp")),
("ulysses_degree", (1, 2, 1), ("ring", "ulysses", "fsdp")),
],
ids=["ring-3d-fsdp", "ulysses-3d-fsdp"],
)
def test_context_parallel_custom_mesh(self, cp_type, mesh_shape, mesh_dim_names):
if not torch.distributed.is_available():
pytest.skip("torch.distributed is not available.")
if not hasattr(self.model_class, "_cp_plan") or self.model_class._cp_plan is None:
pytest.skip("Model does not have a _cp_plan defined for context parallel inference.")
world_size = 2
init_dict = self.get_init_dict()
inputs_dict = {k: v.cpu() if isinstance(v, torch.Tensor) else v for k, v in self.get_dummy_inputs().items()}
cp_dict = {cp_type: world_size}
master_port = _find_free_port()
manager = mp.Manager()
return_dict = manager.dict()
mp.spawn(
_custom_mesh_worker,
args=(
world_size,
master_port,
self.model_class,
init_dict,
cp_dict,
mesh_shape,
mesh_dim_names,
inputs_dict,
return_dict,
),
nprocs=world_size,
join=True,
)
assert return_dict.get("status") == "success", (
f"Custom mesh context parallel inference failed: {return_dict.get('error', 'Unknown error')}"
)

View File

@@ -1,4 +1,3 @@
# coding=utf-8
# Copyright 2025 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -13,49 +12,84 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import warnings
import torch
from diffusers import QwenImageTransformer2DModel
from diffusers.models.transformers.transformer_qwenimage import compute_text_seq_len_from_mask
from diffusers.utils.torch_utils import randn_tensor
from ...testing_utils import enable_full_determinism, torch_device
from ..test_modeling_common import ModelTesterMixin, TorchCompileTesterMixin
from ..testing_utils import (
AttentionTesterMixin,
BaseModelTesterConfig,
BitsAndBytesTesterMixin,
ContextParallelTesterMixin,
LoraHotSwappingForModelTesterMixin,
LoraTesterMixin,
MemoryTesterMixin,
ModelTesterMixin,
TorchAoTesterMixin,
TorchCompileTesterMixin,
TrainingTesterMixin,
)
enable_full_determinism()
class QwenImageTransformerTests(ModelTesterMixin, unittest.TestCase):
model_class = QwenImageTransformer2DModel
main_input_name = "hidden_states"
# We override the items here because the transformer under consideration is small.
model_split_percents = [0.7, 0.6, 0.6]
# Skip setting testing with default: AttnProcessor
uses_custom_attn_processor = True
class QwenImageTransformerTesterConfig(BaseModelTesterConfig):
@property
def model_class(self):
return QwenImageTransformer2DModel
@property
def dummy_input(self):
return self.prepare_dummy_input()
@property
def input_shape(self):
def output_shape(self) -> tuple[int, int]:
return (16, 16)
@property
def output_shape(self):
def input_shape(self) -> tuple[int, int]:
return (16, 16)
def prepare_dummy_input(self, height=4, width=4):
@property
def model_split_percents(self) -> list:
return [0.7, 0.6, 0.6]
@property
def main_input_name(self) -> str:
return "hidden_states"
@property
def generator(self):
return torch.Generator("cpu").manual_seed(0)
def get_init_dict(self) -> dict[str, int | list[int]]:
return {
"patch_size": 2,
"in_channels": 16,
"out_channels": 4,
"num_layers": 2,
"attention_head_dim": 16,
"num_attention_heads": 4,
"joint_attention_dim": 16,
"guidance_embeds": False,
"axes_dims_rope": (8, 4, 4),
}
def get_dummy_inputs(self) -> dict[str, torch.Tensor]:
batch_size = 1
num_latent_channels = embedding_dim = 16
sequence_length = 7
height = width = 4
sequence_length = 8
vae_scale_factor = 4
hidden_states = torch.randn((batch_size, height * width, num_latent_channels)).to(torch_device)
encoder_hidden_states = torch.randn((batch_size, sequence_length, embedding_dim)).to(torch_device)
hidden_states = randn_tensor(
(batch_size, height * width, num_latent_channels), generator=self.generator, device=torch_device
)
encoder_hidden_states = randn_tensor(
(batch_size, sequence_length, embedding_dim), generator=self.generator, device=torch_device
)
encoder_hidden_states_mask = torch.ones((batch_size, sequence_length)).to(torch_device, torch.long)
timestep = torch.tensor([1.0]).to(torch_device).expand(batch_size)
orig_height = height * 2 * vae_scale_factor
@@ -70,89 +104,57 @@ class QwenImageTransformerTests(ModelTesterMixin, unittest.TestCase):
"img_shapes": img_shapes,
}
def prepare_init_args_and_inputs_for_common(self):
init_dict = {
"patch_size": 2,
"in_channels": 16,
"out_channels": 4,
"num_layers": 2,
"attention_head_dim": 16,
"num_attention_heads": 3,
"joint_attention_dim": 16,
"guidance_embeds": False,
"axes_dims_rope": (8, 4, 4),
}
inputs_dict = self.dummy_input
return init_dict, inputs_dict
def test_gradient_checkpointing_is_applied(self):
expected_set = {"QwenImageTransformer2DModel"}
super().test_gradient_checkpointing_is_applied(expected_set=expected_set)
class TestQwenImageTransformer(QwenImageTransformerTesterConfig, ModelTesterMixin):
def test_infers_text_seq_len_from_mask(self):
"""Test that compute_text_seq_len_from_mask correctly infers sequence lengths and returns tensors."""
init_dict, inputs = self.prepare_init_args_and_inputs_for_common()
init_dict = self.get_init_dict()
inputs = self.get_dummy_inputs()
model = self.model_class(**init_dict).to(torch_device)
# Test 1: Contiguous mask with padding at the end (only first 2 tokens valid)
encoder_hidden_states_mask = inputs["encoder_hidden_states_mask"].clone()
encoder_hidden_states_mask[:, 2:] = 0 # Only first 2 tokens are valid
encoder_hidden_states_mask[:, 2:] = 0
rope_text_seq_len, per_sample_len, normalized_mask = compute_text_seq_len_from_mask(
inputs["encoder_hidden_states"], encoder_hidden_states_mask
)
# Verify rope_text_seq_len is returned as an int (for torch.compile compatibility)
self.assertIsInstance(rope_text_seq_len, int)
assert isinstance(rope_text_seq_len, int)
assert isinstance(per_sample_len, torch.Tensor)
assert int(per_sample_len.max().item()) == 2
assert normalized_mask.dtype == torch.bool
assert normalized_mask.sum().item() == 2
assert rope_text_seq_len >= inputs["encoder_hidden_states"].shape[1]
# Verify per_sample_len is computed correctly (max valid position + 1 = 2)
self.assertIsInstance(per_sample_len, torch.Tensor)
self.assertEqual(int(per_sample_len.max().item()), 2)
# Verify mask is normalized to bool dtype
self.assertTrue(normalized_mask.dtype == torch.bool)
self.assertEqual(normalized_mask.sum().item(), 2) # Only 2 True values
# Verify rope_text_seq_len is at least the sequence length
self.assertGreaterEqual(rope_text_seq_len, inputs["encoder_hidden_states"].shape[1])
# Test 2: Verify model runs successfully with inferred values
inputs["encoder_hidden_states_mask"] = normalized_mask
with torch.no_grad():
output = model(**inputs)
self.assertEqual(output.sample.shape[1], inputs["hidden_states"].shape[1])
assert output.sample.shape[1] == inputs["hidden_states"].shape[1]
# Test 3: Different mask pattern (padding at beginning)
encoder_hidden_states_mask2 = inputs["encoder_hidden_states_mask"].clone()
encoder_hidden_states_mask2[:, :3] = 0 # First 3 tokens are padding
encoder_hidden_states_mask2[:, 3:] = 1 # Last 4 tokens are valid
encoder_hidden_states_mask2[:, :3] = 0
encoder_hidden_states_mask2[:, 3:] = 1
rope_text_seq_len2, per_sample_len2, normalized_mask2 = compute_text_seq_len_from_mask(
inputs["encoder_hidden_states"], encoder_hidden_states_mask2
)
# Max valid position is 6 (last token), so per_sample_len should be 7
self.assertEqual(int(per_sample_len2.max().item()), 7)
self.assertEqual(normalized_mask2.sum().item(), 4) # 4 True values
assert int(per_sample_len2.max().item()) == 8
assert normalized_mask2.sum().item() == 5
# Test 4: No mask provided (None case)
rope_text_seq_len_none, per_sample_len_none, normalized_mask_none = compute_text_seq_len_from_mask(
inputs["encoder_hidden_states"], None
)
self.assertEqual(rope_text_seq_len_none, inputs["encoder_hidden_states"].shape[1])
self.assertIsInstance(rope_text_seq_len_none, int)
self.assertIsNone(per_sample_len_none)
self.assertIsNone(normalized_mask_none)
assert rope_text_seq_len_none == inputs["encoder_hidden_states"].shape[1]
assert isinstance(rope_text_seq_len_none, int)
assert per_sample_len_none is None
assert normalized_mask_none is None
def test_non_contiguous_attention_mask(self):
"""Test that non-contiguous masks work correctly (e.g., [1, 0, 1, 0, 1, 0, 0])"""
init_dict, inputs = self.prepare_init_args_and_inputs_for_common()
init_dict = self.get_init_dict()
inputs = self.get_dummy_inputs()
model = self.model_class(**init_dict).to(torch_device)
# Create a non-contiguous mask pattern: valid, padding, valid, padding, etc.
encoder_hidden_states_mask = inputs["encoder_hidden_states_mask"].clone()
# Pattern: [True, False, True, False, True, False, False]
encoder_hidden_states_mask[:, 1] = 0
encoder_hidden_states_mask[:, 3] = 0
encoder_hidden_states_mask[:, 5:] = 0
@@ -160,95 +162,85 @@ class QwenImageTransformerTests(ModelTesterMixin, unittest.TestCase):
inferred_rope_len, per_sample_len, normalized_mask = compute_text_seq_len_from_mask(
inputs["encoder_hidden_states"], encoder_hidden_states_mask
)
self.assertEqual(int(per_sample_len.max().item()), 5)
self.assertEqual(inferred_rope_len, inputs["encoder_hidden_states"].shape[1])
self.assertIsInstance(inferred_rope_len, int)
self.assertTrue(normalized_mask.dtype == torch.bool)
assert int(per_sample_len.max().item()) == 5
assert inferred_rope_len == inputs["encoder_hidden_states"].shape[1]
assert isinstance(inferred_rope_len, int)
assert normalized_mask.dtype == torch.bool
inputs["encoder_hidden_states_mask"] = normalized_mask
with torch.no_grad():
output = model(**inputs)
self.assertEqual(output.sample.shape[1], inputs["hidden_states"].shape[1])
assert output.sample.shape[1] == inputs["hidden_states"].shape[1]
def test_txt_seq_lens_deprecation(self):
"""Test that passing txt_seq_lens raises a deprecation warning."""
init_dict, inputs = self.prepare_init_args_and_inputs_for_common()
init_dict = self.get_init_dict()
inputs = self.get_dummy_inputs()
model = self.model_class(**init_dict).to(torch_device)
# Prepare inputs with txt_seq_lens (deprecated parameter)
txt_seq_lens = [inputs["encoder_hidden_states"].shape[1]]
# Remove encoder_hidden_states_mask to use the deprecated path
inputs_with_deprecated = inputs.copy()
inputs_with_deprecated.pop("encoder_hidden_states_mask")
inputs_with_deprecated["txt_seq_lens"] = txt_seq_lens
# Test that deprecation warning is raised
with self.assertWarns(FutureWarning) as warning_context:
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
with torch.no_grad():
output = model(**inputs_with_deprecated)
# Verify the warning message mentions the deprecation
warning_message = str(warning_context.warning)
self.assertIn("txt_seq_lens", warning_message)
self.assertIn("deprecated", warning_message)
self.assertIn("encoder_hidden_states_mask", warning_message)
future_warnings = [x for x in w if issubclass(x.category, FutureWarning)]
assert len(future_warnings) > 0, "Expected FutureWarning to be raised"
# Verify the model still works correctly despite the deprecation
self.assertEqual(output.sample.shape[1], inputs["hidden_states"].shape[1])
warning_message = str(future_warnings[0].message)
assert "txt_seq_lens" in warning_message
assert "deprecated" in warning_message
assert output.sample.shape[1] == inputs["hidden_states"].shape[1]
def test_layered_model_with_mask(self):
"""Test QwenImageTransformer2DModel with use_layer3d_rope=True (layered model)."""
# Create layered model config
init_dict = {
"patch_size": 2,
"in_channels": 16,
"out_channels": 4,
"num_layers": 2,
"attention_head_dim": 16,
"num_attention_heads": 3,
"num_attention_heads": 4,
"joint_attention_dim": 16,
"axes_dims_rope": (8, 4, 4), # Must match attention_head_dim (8+4+4=16)
"use_layer3d_rope": True, # Enable layered RoPE
"use_additional_t_cond": True, # Enable additional time conditioning
"axes_dims_rope": (8, 4, 4),
"use_layer3d_rope": True,
"use_additional_t_cond": True,
}
model = self.model_class(**init_dict).to(torch_device)
# Verify the model uses QwenEmbedLayer3DRope
from diffusers.models.transformers.transformer_qwenimage import QwenEmbedLayer3DRope
self.assertIsInstance(model.pos_embed, QwenEmbedLayer3DRope)
assert isinstance(model.pos_embed, QwenEmbedLayer3DRope)
# Test single generation with layered structure
batch_size = 1
text_seq_len = 7
text_seq_len = 8
img_h, img_w = 4, 4
layers = 4
# For layered model: (layers + 1) because we have N layers + 1 combined image
hidden_states = torch.randn(batch_size, (layers + 1) * img_h * img_w, 16).to(torch_device)
encoder_hidden_states = torch.randn(batch_size, text_seq_len, 16).to(torch_device)
# Create mask with some padding
encoder_hidden_states_mask = torch.ones(batch_size, text_seq_len).to(torch_device)
encoder_hidden_states_mask[0, 5:] = 0 # Only 5 valid tokens
encoder_hidden_states_mask[0, 5:] = 0
timestep = torch.tensor([1.0]).to(torch_device)
# additional_t_cond for use_additional_t_cond=True (0 or 1 index for embedding)
addition_t_cond = torch.tensor([0], dtype=torch.long).to(torch_device)
# Layer structure: 4 layers + 1 condition image
img_shapes = [
[
(1, img_h, img_w), # layer 0
(1, img_h, img_w), # layer 1
(1, img_h, img_w), # layer 2
(1, img_h, img_w), # layer 3
(1, img_h, img_w), # condition image (last one gets special treatment)
(1, img_h, img_w),
(1, img_h, img_w),
(1, img_h, img_w),
(1, img_h, img_w),
(1, img_h, img_w),
]
]
@@ -262,37 +254,113 @@ class QwenImageTransformerTests(ModelTesterMixin, unittest.TestCase):
additional_t_cond=addition_t_cond,
)
self.assertEqual(output.sample.shape[1], hidden_states.shape[1])
assert output.sample.shape[1] == hidden_states.shape[1]
class QwenImageTransformerCompileTests(TorchCompileTesterMixin, unittest.TestCase):
model_class = QwenImageTransformer2DModel
class TestQwenImageTransformerMemory(QwenImageTransformerTesterConfig, MemoryTesterMixin):
"""Memory optimization tests for QwenImage Transformer."""
def prepare_init_args_and_inputs_for_common(self):
return QwenImageTransformerTests().prepare_init_args_and_inputs_for_common()
def prepare_dummy_input(self, height, width):
return QwenImageTransformerTests().prepare_dummy_input(height=height, width=width)
class TestQwenImageTransformerTraining(QwenImageTransformerTesterConfig, TrainingTesterMixin):
"""Training tests for QwenImage Transformer."""
def test_torch_compile_recompilation_and_graph_break(self):
super().test_torch_compile_recompilation_and_graph_break()
def test_gradient_checkpointing_is_applied(self):
expected_set = {"QwenImageTransformer2DModel"}
super().test_gradient_checkpointing_is_applied(expected_set=expected_set)
class TestQwenImageTransformerAttention(QwenImageTransformerTesterConfig, AttentionTesterMixin):
"""Attention processor tests for QwenImage Transformer."""
class TestQwenImageTransformerContextParallel(QwenImageTransformerTesterConfig, ContextParallelTesterMixin):
"""Context Parallel inference tests for QwenImage Transformer."""
class TestQwenImageTransformerLoRA(QwenImageTransformerTesterConfig, LoraTesterMixin):
"""LoRA adapter tests for QwenImage Transformer."""
class TestQwenImageTransformerLoRAHotSwap(QwenImageTransformerTesterConfig, LoraHotSwappingForModelTesterMixin):
"""LoRA hot-swapping tests for QwenImage Transformer."""
@property
def different_shapes_for_compilation(self):
return [(4, 4), (4, 8), (8, 8)]
def get_dummy_inputs(self, height: int = 4, width: int = 4) -> dict[str, torch.Tensor]:
batch_size = 1
num_latent_channels = embedding_dim = 16
sequence_length = 8
vae_scale_factor = 4
hidden_states = randn_tensor(
(batch_size, height * width, num_latent_channels), generator=self.generator, device=torch_device
)
encoder_hidden_states = randn_tensor(
(batch_size, sequence_length, embedding_dim), generator=self.generator, device=torch_device
)
encoder_hidden_states_mask = torch.ones((batch_size, sequence_length)).to(torch_device, torch.long)
timestep = torch.tensor([1.0]).to(torch_device).expand(batch_size)
orig_height = height * 2 * vae_scale_factor
orig_width = width * 2 * vae_scale_factor
img_shapes = [(1, orig_height // vae_scale_factor // 2, orig_width // vae_scale_factor // 2)] * batch_size
return {
"hidden_states": hidden_states,
"encoder_hidden_states": encoder_hidden_states,
"encoder_hidden_states_mask": encoder_hidden_states_mask,
"timestep": timestep,
"img_shapes": img_shapes,
}
class TestQwenImageTransformerCompile(QwenImageTransformerTesterConfig, TorchCompileTesterMixin):
"""Torch compile tests for QwenImage Transformer."""
@property
def different_shapes_for_compilation(self):
return [(4, 4), (4, 8), (8, 8)]
def get_dummy_inputs(self, height: int = 4, width: int = 4) -> dict[str, torch.Tensor]:
batch_size = 1
num_latent_channels = embedding_dim = 16
sequence_length = 8
vae_scale_factor = 4
hidden_states = randn_tensor(
(batch_size, height * width, num_latent_channels), generator=self.generator, device=torch_device
)
encoder_hidden_states = randn_tensor(
(batch_size, sequence_length, embedding_dim), generator=self.generator, device=torch_device
)
encoder_hidden_states_mask = torch.ones((batch_size, sequence_length)).to(torch_device, torch.long)
timestep = torch.tensor([1.0]).to(torch_device).expand(batch_size)
orig_height = height * 2 * vae_scale_factor
orig_width = width * 2 * vae_scale_factor
img_shapes = [(1, orig_height // vae_scale_factor // 2, orig_width // vae_scale_factor // 2)] * batch_size
return {
"hidden_states": hidden_states,
"encoder_hidden_states": encoder_hidden_states,
"encoder_hidden_states_mask": encoder_hidden_states_mask,
"timestep": timestep,
"img_shapes": img_shapes,
}
def test_torch_compile_with_and_without_mask(self):
"""Test that torch.compile works with both None mask and padding mask."""
init_dict, inputs = self.prepare_init_args_and_inputs_for_common()
init_dict = self.get_init_dict()
inputs = self.get_dummy_inputs()
model = self.model_class(**init_dict).to(torch_device)
model.eval()
model.compile(mode="default", fullgraph=True)
# Test 1: Run with None mask (no padding, all tokens are valid)
inputs_no_mask = inputs.copy()
inputs_no_mask["encoder_hidden_states_mask"] = None
# First run to allow compilation
with torch.no_grad():
output_no_mask = model(**inputs_no_mask)
# Second run to verify no recompilation
with (
torch._inductor.utils.fresh_inductor_cache(),
torch._dynamo.config.patch(error_on_recompile=True),
@@ -300,19 +368,15 @@ class QwenImageTransformerCompileTests(TorchCompileTesterMixin, unittest.TestCas
):
output_no_mask_2 = model(**inputs_no_mask)
self.assertEqual(output_no_mask.sample.shape[1], inputs["hidden_states"].shape[1])
self.assertEqual(output_no_mask_2.sample.shape[1], inputs["hidden_states"].shape[1])
assert output_no_mask.sample.shape[1] == inputs["hidden_states"].shape[1]
assert output_no_mask_2.sample.shape[1] == inputs["hidden_states"].shape[1]
# Test 2: Run with all-ones mask (should behave like None)
inputs_all_ones = inputs.copy()
# Keep the all-ones mask
self.assertTrue(inputs_all_ones["encoder_hidden_states_mask"].all().item())
assert inputs_all_ones["encoder_hidden_states_mask"].all().item()
# First run to allow compilation
with torch.no_grad():
output_all_ones = model(**inputs_all_ones)
# Second run to verify no recompilation
with (
torch._inductor.utils.fresh_inductor_cache(),
torch._dynamo.config.patch(error_on_recompile=True),
@@ -320,21 +384,18 @@ class QwenImageTransformerCompileTests(TorchCompileTesterMixin, unittest.TestCas
):
output_all_ones_2 = model(**inputs_all_ones)
self.assertEqual(output_all_ones.sample.shape[1], inputs["hidden_states"].shape[1])
self.assertEqual(output_all_ones_2.sample.shape[1], inputs["hidden_states"].shape[1])
assert output_all_ones.sample.shape[1] == inputs["hidden_states"].shape[1]
assert output_all_ones_2.sample.shape[1] == inputs["hidden_states"].shape[1]
# Test 3: Run with actual padding mask (has zeros)
inputs_with_padding = inputs.copy()
mask_with_padding = inputs["encoder_hidden_states_mask"].clone()
mask_with_padding[:, 4:] = 0 # Last 3 tokens are padding
mask_with_padding[:, 4:] = 0
inputs_with_padding["encoder_hidden_states_mask"] = mask_with_padding
# First run to allow compilation
with torch.no_grad():
output_with_padding = model(**inputs_with_padding)
# Second run to verify no recompilation
with (
torch._inductor.utils.fresh_inductor_cache(),
torch._dynamo.config.patch(error_on_recompile=True),
@@ -342,8 +403,15 @@ class QwenImageTransformerCompileTests(TorchCompileTesterMixin, unittest.TestCas
):
output_with_padding_2 = model(**inputs_with_padding)
self.assertEqual(output_with_padding.sample.shape[1], inputs["hidden_states"].shape[1])
self.assertEqual(output_with_padding_2.sample.shape[1], inputs["hidden_states"].shape[1])
assert output_with_padding.sample.shape[1] == inputs["hidden_states"].shape[1]
assert output_with_padding_2.sample.shape[1] == inputs["hidden_states"].shape[1]
# Verify that outputs are different (mask should affect results)
self.assertFalse(torch.allclose(output_no_mask.sample, output_with_padding.sample, atol=1e-3))
assert not torch.allclose(output_no_mask.sample, output_with_padding.sample, atol=1e-3)
class TestQwenImageTransformerBitsAndBytes(QwenImageTransformerTesterConfig, BitsAndBytesTesterMixin):
"""BitsAndBytes quantization tests for QwenImage Transformer."""
class TestQwenImageTransformerTorchAo(QwenImageTransformerTesterConfig, TorchAoTesterMixin):
"""TorchAO quantization tests for QwenImage Transformer."""

View File

@@ -0,0 +1,166 @@
# coding=utf-8
# Copyright 2025 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pytest
from diffusers.modular_pipelines import (
HeliosAutoBlocks,
HeliosModularPipeline,
HeliosPyramidAutoBlocks,
HeliosPyramidModularPipeline,
)
from ..test_modular_pipelines_common import ModularPipelineTesterMixin
HELIOS_WORKFLOWS = {
"text2video": [
("text_encoder", "HeliosTextEncoderStep"),
("denoise.input", "HeliosTextInputStep"),
("denoise.prepare_history", "HeliosPrepareHistoryStep"),
("denoise.set_timesteps", "HeliosSetTimestepsStep"),
("denoise.chunk_denoise", "HeliosChunkDenoiseStep"),
("decode", "HeliosDecodeStep"),
],
"image2video": [
("text_encoder", "HeliosTextEncoderStep"),
("vae_encoder", "HeliosImageVaeEncoderStep"),
("denoise.input", "HeliosTextInputStep"),
("denoise.additional_inputs", "HeliosAdditionalInputsStep"),
("denoise.add_noise_image", "HeliosAddNoiseToImageLatentsStep"),
("denoise.prepare_history", "HeliosPrepareHistoryStep"),
("denoise.seed_history", "HeliosI2VSeedHistoryStep"),
("denoise.set_timesteps", "HeliosSetTimestepsStep"),
("denoise.chunk_denoise", "HeliosI2VChunkDenoiseStep"),
("decode", "HeliosDecodeStep"),
],
"video2video": [
("text_encoder", "HeliosTextEncoderStep"),
("vae_encoder", "HeliosVideoVaeEncoderStep"),
("denoise.input", "HeliosTextInputStep"),
("denoise.additional_inputs", "HeliosAdditionalInputsStep"),
("denoise.add_noise_video", "HeliosAddNoiseToVideoLatentsStep"),
("denoise.prepare_history", "HeliosPrepareHistoryStep"),
("denoise.seed_history", "HeliosV2VSeedHistoryStep"),
("denoise.set_timesteps", "HeliosSetTimestepsStep"),
("denoise.chunk_denoise", "HeliosI2VChunkDenoiseStep"),
("decode", "HeliosDecodeStep"),
],
}
class TestHeliosModularPipelineFast(ModularPipelineTesterMixin):
pipeline_class = HeliosModularPipeline
pipeline_blocks_class = HeliosAutoBlocks
pretrained_model_name_or_path = "hf-internal-testing/tiny-helios-modular-pipe"
params = frozenset(["prompt", "height", "width", "num_frames"])
batch_params = frozenset(["prompt"])
optional_params = frozenset(["num_inference_steps", "num_videos_per_prompt", "latents"])
output_name = "videos"
expected_workflow_blocks = HELIOS_WORKFLOWS
def get_dummy_inputs(self, seed=0):
generator = self.get_generator(seed)
inputs = {
"prompt": "A painting of a squirrel eating a burger",
"generator": generator,
"num_inference_steps": 2,
"height": 16,
"width": 16,
"num_frames": 9,
"max_sequence_length": 16,
"output_type": "pt",
}
return inputs
@pytest.mark.skip(reason="num_videos_per_prompt")
def test_num_images_per_prompt(self):
pass
HELIOS_PYRAMID_WORKFLOWS = {
"text2video": [
("text_encoder", "HeliosTextEncoderStep"),
("denoise.input", "HeliosTextInputStep"),
("denoise.prepare_history", "HeliosPrepareHistoryStep"),
("denoise.pyramid_chunk_denoise", "HeliosPyramidChunkDenoiseStep"),
("decode", "HeliosDecodeStep"),
],
"image2video": [
("text_encoder", "HeliosTextEncoderStep"),
("vae_encoder", "HeliosImageVaeEncoderStep"),
("denoise.input", "HeliosTextInputStep"),
("denoise.additional_inputs", "HeliosAdditionalInputsStep"),
("denoise.add_noise_image", "HeliosAddNoiseToImageLatentsStep"),
("denoise.prepare_history", "HeliosPrepareHistoryStep"),
("denoise.seed_history", "HeliosI2VSeedHistoryStep"),
("denoise.pyramid_chunk_denoise", "HeliosPyramidI2VChunkDenoiseStep"),
("decode", "HeliosDecodeStep"),
],
"video2video": [
("text_encoder", "HeliosTextEncoderStep"),
("vae_encoder", "HeliosVideoVaeEncoderStep"),
("denoise.input", "HeliosTextInputStep"),
("denoise.additional_inputs", "HeliosAdditionalInputsStep"),
("denoise.add_noise_video", "HeliosAddNoiseToVideoLatentsStep"),
("denoise.prepare_history", "HeliosPrepareHistoryStep"),
("denoise.seed_history", "HeliosV2VSeedHistoryStep"),
("denoise.pyramid_chunk_denoise", "HeliosPyramidI2VChunkDenoiseStep"),
("decode", "HeliosDecodeStep"),
],
}
class TestHeliosPyramidModularPipelineFast(ModularPipelineTesterMixin):
pipeline_class = HeliosPyramidModularPipeline
pipeline_blocks_class = HeliosPyramidAutoBlocks
pretrained_model_name_or_path = "hf-internal-testing/tiny-helios-pyramid-modular-pipe"
params = frozenset(["prompt", "height", "width", "num_frames"])
batch_params = frozenset(["prompt"])
optional_params = frozenset(["pyramid_num_inference_steps_list", "num_videos_per_prompt", "latents"])
output_name = "videos"
expected_workflow_blocks = HELIOS_PYRAMID_WORKFLOWS
def get_dummy_inputs(self, seed=0):
generator = self.get_generator(seed)
inputs = {
"prompt": "A painting of a squirrel eating a burger",
"generator": generator,
"pyramid_num_inference_steps_list": [2, 2],
"height": 64,
"width": 64,
"num_frames": 9,
"max_sequence_length": 16,
"output_type": "pt",
}
return inputs
def test_inference_batch_single_identical(self):
# Pyramid pipeline injects noise at each stage, so batch vs single can differ more
super().test_inference_batch_single_identical(expected_max_diff=5e-1)
@pytest.mark.skip(reason="Pyramid multi-stage noise makes offload comparison unreliable with tiny models")
def test_components_auto_cpu_offload_inference_consistent(self):
pass
@pytest.mark.skip(reason="Pyramid multi-stage noise makes save/load comparison unreliable with tiny models")
def test_save_from_pretrained(self):
pass
@pytest.mark.skip(reason="num_videos_per_prompt")
def test_num_images_per_prompt(self):
pass

View File

@@ -5,6 +5,7 @@ from typing import Callable
import pytest
import torch
from huggingface_hub import hf_hub_download
import diffusers
from diffusers import AutoModel, ComponentsManager, ModularPipeline, ModularPipelineBlocks
@@ -32,6 +33,33 @@ from ..testing_utils import (
)
def _get_specified_components(path_or_repo_id, cache_dir=None):
if os.path.isdir(path_or_repo_id):
config_path = os.path.join(path_or_repo_id, "modular_model_index.json")
else:
try:
config_path = hf_hub_download(
repo_id=path_or_repo_id,
filename="modular_model_index.json",
local_dir=cache_dir,
)
except Exception:
return None
with open(config_path) as f:
config = json.load(f)
components = set()
for k, v in config.items():
if isinstance(v, (str, int, float, bool)):
continue
for entry in v:
if isinstance(entry, dict) and (entry.get("repo") or entry.get("pretrained_model_name_or_path")):
components.add(k)
break
return components
class ModularPipelineTesterMixin:
"""
It provides a set of common tests for each modular pipeline,
@@ -360,6 +388,39 @@ class ModularPipelineTesterMixin:
assert torch.abs(image_slices[0] - image_slices[1]).max() < 1e-3
def test_load_expected_components_from_pretrained(self, tmp_path):
pipe = self.get_pipeline()
expected = _get_specified_components(self.pretrained_model_name_or_path, cache_dir=tmp_path)
if not expected:
pytest.skip("Skipping test as we couldn't fetch the expected components.")
actual = {
name
for name in pipe.components
if getattr(pipe, name, None) is not None
and getattr(getattr(pipe, name), "_diffusers_load_id", None) not in (None, "null")
}
assert expected == actual, f"Component mismatch: missing={expected - actual}, unexpected={actual - expected}"
def test_load_expected_components_from_save_pretrained(self, tmp_path):
pipe = self.get_pipeline()
save_dir = str(tmp_path / "saved-pipeline")
pipe.save_pretrained(save_dir)
expected = _get_specified_components(save_dir)
loaded_pipe = ModularPipeline.from_pretrained(save_dir)
loaded_pipe.load_components(torch_dtype=torch.float32)
actual = {
name
for name in loaded_pipe.components
if getattr(loaded_pipe, name, None) is not None
and getattr(getattr(loaded_pipe, name), "_diffusers_load_id", None) not in (None, "null")
}
assert expected == actual, (
f"Component mismatch after save/load: missing={expected - actual}, unexpected={actual - expected}"
)
def test_modular_index_consistency(self, tmp_path):
pipe = self.get_pipeline()
components_spec = pipe._component_specs

View File

@@ -31,7 +31,41 @@ from diffusers.modular_pipelines import (
WanModularPipeline,
)
from ..testing_utils import nightly, require_torch, slow
from ..testing_utils import nightly, require_torch, require_torch_accelerator, slow, torch_device
def _create_tiny_model_dir(model_dir):
TINY_MODEL_CODE = (
"import torch\n"
"from diffusers import ModelMixin, ConfigMixin\n"
"from diffusers.configuration_utils import register_to_config\n"
"\n"
"class TinyModel(ModelMixin, ConfigMixin):\n"
" @register_to_config\n"
" def __init__(self, hidden_size=4):\n"
" super().__init__()\n"
" self.linear = torch.nn.Linear(hidden_size, hidden_size)\n"
"\n"
" def forward(self, x):\n"
" return self.linear(x)\n"
)
with open(os.path.join(model_dir, "modeling.py"), "w") as f:
f.write(TINY_MODEL_CODE)
config = {
"_class_name": "TinyModel",
"_diffusers_version": "0.0.0",
"auto_map": {"AutoModel": "modeling.TinyModel"},
"hidden_size": 4,
}
with open(os.path.join(model_dir, "config.json"), "w") as f:
json.dump(config, f)
torch.save(
{"linear.weight": torch.randn(4, 4), "linear.bias": torch.randn(4)},
os.path.join(model_dir, "diffusion_pytorch_model.bin"),
)
class DummyCustomBlockSimple(ModularPipelineBlocks):
@@ -341,6 +375,81 @@ class TestModularCustomBlocks:
loaded_pipe.update_components(custom_model=custom_model)
assert getattr(loaded_pipe, "custom_model", None) is not None
def test_automodel_type_hint_preserves_torch_dtype(self, tmp_path):
"""Regression test for #13271: torch_dtype was incorrectly removed when type_hint is AutoModel."""
from diffusers import AutoModel
model_dir = str(tmp_path / "model")
os.makedirs(model_dir)
_create_tiny_model_dir(model_dir)
class DtypeTestBlock(ModularPipelineBlocks):
@property
def expected_components(self):
return [ComponentSpec("model", AutoModel, pretrained_model_name_or_path=model_dir)]
@property
def inputs(self) -> List[InputParam]:
return [InputParam("prompt", type_hint=str, required=True)]
@property
def intermediate_inputs(self) -> List[InputParam]:
return []
@property
def intermediate_outputs(self) -> List[OutputParam]:
return [OutputParam("output", type_hint=str)]
def __call__(self, components, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
block_state.output = "test"
self.set_block_state(state, block_state)
return components, state
block = DtypeTestBlock()
pipe = block.init_pipeline()
pipe.load_components(torch_dtype=torch.float16, trust_remote_code=True)
assert pipe.model.dtype == torch.float16
@require_torch_accelerator
def test_automodel_type_hint_preserves_device(self, tmp_path):
"""Test that ComponentSpec with AutoModel type_hint correctly passes device_map."""
from diffusers import AutoModel
model_dir = str(tmp_path / "model")
os.makedirs(model_dir)
_create_tiny_model_dir(model_dir)
class DeviceTestBlock(ModularPipelineBlocks):
@property
def expected_components(self):
return [ComponentSpec("model", AutoModel, pretrained_model_name_or_path=model_dir)]
@property
def inputs(self) -> List[InputParam]:
return [InputParam("prompt", type_hint=str, required=True)]
@property
def intermediate_inputs(self) -> List[InputParam]:
return []
@property
def intermediate_outputs(self) -> List[OutputParam]:
return [OutputParam("output", type_hint=str)]
def __call__(self, components, state: PipelineState) -> PipelineState:
block_state = self.get_block_state(state)
block_state.output = "test"
self.set_block_state(state, block_state)
return components, state
block = DeviceTestBlock()
pipe = block.init_pipeline()
pipe.load_components(device_map=torch_device, trust_remote_code=True)
assert pipe.model.device.type == torch_device
def test_custom_block_loads_from_hub(self):
repo_id = "hf-internal-testing/tiny-modular-diffusers-block"
block = ModularPipelineBlocks.from_pretrained(repo_id, trust_remote_code=True)

View File

@@ -0,0 +1,174 @@
import unittest
import numpy as np
import torch
from PIL import Image
from transformers import Qwen2TokenizerFast, Qwen3Config, Qwen3ForCausalLM
from diffusers import (
AutoencoderKLFlux2,
FlowMatchEulerDiscreteScheduler,
Flux2KleinKVPipeline,
Flux2Transformer2DModel,
)
from ...testing_utils import torch_device
from ..test_pipelines_common import PipelineTesterMixin, check_qkv_fused_layers_exist
class Flux2KleinKVPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
pipeline_class = Flux2KleinKVPipeline
params = frozenset(["prompt", "height", "width", "prompt_embeds", "image"])
batch_params = frozenset(["prompt"])
test_xformers_attention = False
test_layerwise_casting = True
test_group_offloading = True
supports_dduf = False
def get_dummy_components(self, num_layers: int = 1, num_single_layers: int = 1):
torch.manual_seed(0)
transformer = Flux2Transformer2DModel(
patch_size=1,
in_channels=4,
num_layers=num_layers,
num_single_layers=num_single_layers,
attention_head_dim=16,
num_attention_heads=2,
joint_attention_dim=16,
timestep_guidance_channels=256,
axes_dims_rope=[4, 4, 4, 4],
guidance_embeds=False,
)
# Create minimal Qwen3 config
config = Qwen3Config(
intermediate_size=16,
hidden_size=16,
num_hidden_layers=2,
num_attention_heads=2,
num_key_value_heads=2,
vocab_size=151936,
max_position_embeddings=512,
)
torch.manual_seed(0)
text_encoder = Qwen3ForCausalLM(config)
# Use a simple tokenizer for testing
tokenizer = Qwen2TokenizerFast.from_pretrained(
"hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration"
)
torch.manual_seed(0)
vae = AutoencoderKLFlux2(
sample_size=32,
in_channels=3,
out_channels=3,
down_block_types=("DownEncoderBlock2D",),
up_block_types=("UpDecoderBlock2D",),
block_out_channels=(4,),
layers_per_block=1,
latent_channels=1,
norm_num_groups=1,
use_quant_conv=False,
use_post_quant_conv=False,
)
scheduler = FlowMatchEulerDiscreteScheduler()
return {
"scheduler": scheduler,
"text_encoder": text_encoder,
"tokenizer": tokenizer,
"transformer": transformer,
"vae": vae,
}
def get_dummy_inputs(self, device, seed=0):
if str(device).startswith("mps"):
generator = torch.manual_seed(seed)
else:
generator = torch.Generator(device="cpu").manual_seed(seed)
inputs = {
"prompt": "a dog is dancing",
"image": Image.new("RGB", (64, 64)),
"generator": generator,
"num_inference_steps": 2,
"height": 8,
"width": 8,
"max_sequence_length": 64,
"output_type": "np",
"text_encoder_out_layers": (1,),
}
return inputs
def test_fused_qkv_projections(self):
device = "cpu" # ensure determinism for the device-dependent torch.Generator
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe = pipe.to(device)
pipe.set_progress_bar_config(disable=None)
inputs = self.get_dummy_inputs(device)
image = pipe(**inputs).images
original_image_slice = image[0, -3:, -3:, -1]
pipe.transformer.fuse_qkv_projections()
self.assertTrue(
check_qkv_fused_layers_exist(pipe.transformer, ["to_qkv"]),
("Something wrong with the fused attention layers. Expected all the attention projections to be fused."),
)
inputs = self.get_dummy_inputs(device)
image = pipe(**inputs).images
image_slice_fused = image[0, -3:, -3:, -1]
pipe.transformer.unfuse_qkv_projections()
inputs = self.get_dummy_inputs(device)
image = pipe(**inputs).images
image_slice_disabled = image[0, -3:, -3:, -1]
self.assertTrue(
np.allclose(original_image_slice, image_slice_fused, atol=1e-3, rtol=1e-3),
("Fusion of QKV projections shouldn't affect the outputs."),
)
self.assertTrue(
np.allclose(image_slice_fused, image_slice_disabled, atol=1e-3, rtol=1e-3),
("Outputs, with QKV projection fusion enabled, shouldn't change when fused QKV projections are disabled."),
)
self.assertTrue(
np.allclose(original_image_slice, image_slice_disabled, atol=1e-2, rtol=1e-2),
("Original outputs should match when fused QKV projections are disabled."),
)
def test_image_output_shape(self):
pipe = self.pipeline_class(**self.get_dummy_components()).to(torch_device)
inputs = self.get_dummy_inputs(torch_device)
height_width_pairs = [(32, 32), (72, 57)]
for height, width in height_width_pairs:
expected_height = height - height % (pipe.vae_scale_factor * 2)
expected_width = width - width % (pipe.vae_scale_factor * 2)
inputs.update({"height": height, "width": width})
image = pipe(**inputs).images[0]
output_height, output_width, _ = image.shape
self.assertEqual(
(output_height, output_width),
(expected_height, expected_width),
f"Output shape {image.shape} does not match expected shape {(expected_height, expected_width)}",
)
def test_without_image(self):
device = "cpu"
pipe = self.pipeline_class(**self.get_dummy_components()).to(device)
inputs = self.get_dummy_inputs(device)
del inputs["image"]
image = pipe(**inputs).images
self.assertEqual(image.shape, (1, 8, 8, 3))
@unittest.skip("Needs to be revisited")
def test_encode_prompt_works_in_isolation(self):
pass

View File

@@ -139,9 +139,9 @@ class HeliosPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
generated_slice = torch.cat([generated_slice[:8], generated_slice[-8:]])
self.assertTrue(torch.allclose(generated_slice, expected_slice, atol=1e-3))
# Override to set a more lenient max diff threshold.
@unittest.skip("Helios uses a lot of mixed precision internally, which is not suitable for this test case")
def test_save_load_float16(self):
super().test_save_load_float16(expected_max_diff=0.03)
pass
@unittest.skip("Test not supported")
def test_attention_slicing_forward_pass(self):

View File

@@ -139,7 +139,9 @@ class HunyuanVideoImageToVideoPipelineFastTests(
num_hidden_layers=2,
image_size=224,
)
llava_text_encoder_config = LlavaConfig(vision_config, text_config, pad_token_id=100, image_token_index=101)
llava_text_encoder_config = LlavaConfig(
vision_config=vision_config, text_config=text_config, pad_token_id=100, image_token_index=101
)
clip_text_encoder_config = CLIPTextConfig(
bos_token_id=0,

View File

@@ -171,6 +171,7 @@ class LTX2PipelineFastTests(PipelineTesterMixin, unittest.TestCase):
"tokenizer": tokenizer,
"connectors": connectors,
"vocoder": vocoder,
"processor": None,
}
return components

View File

@@ -171,6 +171,7 @@ class LTX2ImageToVideoPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
"tokenizer": tokenizer,
"connectors": connectors,
"vocoder": vocoder,
"processor": None,
}
return components