Compare commits

...

15 Commits

Author SHA1 Message Date
DN6
1d75ab7e35 update 2026-03-17 23:23:01 +05:30
DN6
b086b6da0a update 2026-03-17 15:40:27 +05:30
DN6
28c7516229 update 2026-03-17 15:38:03 +05:30
DN6
53b9b56059 update 2026-03-17 15:33:17 +05:30
DN6
65579667e9 update 2026-03-17 15:31:30 +05:30
DN6
cda1c36eeb update 2026-03-17 13:41:49 +05:30
DN6
f634485333 Merge branch 'main' into make-tiny-model 2026-03-17 13:37:02 +05:30
DN6
7820980959 update 2026-03-17 13:36:49 +05:30
Sayak Paul
16e7067647 [tests] fix llava kwargs in the hunyuan tests (#13275)
fix llava kwargs in the hunyuan tests
2026-03-17 10:11:47 +05:30
Dhruv Nair
d1b3555c29 [Modular] Fix dtype assignment when type hint is AutoModel (#13271)
* update

* update
2026-03-17 09:47:53 +05:30
Wang, Yi
9677859ebf fix parallelism case failure in xpu (#13270)
* fix parallelism case failure in xpu

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

* updated

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
2026-03-17 08:52:15 +05:30
Steven Liu
ed31974c3e [docs] updates (#13248)
* fixes

* few more links

* update zh

* fix
2026-03-16 13:24:57 -07:00
YiYi Xu
e5aa719241 Add AGENTS.md (#13259)
* add a draft

* add

* up

* Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2026-03-14 08:35:12 -10:00
teith
4bc1c59a67 fix: correct invalid type annotation for image in Flux2Pipeline.__call__ (#13205)
fix: correct invalid type annotation for image in Flux2Pipeline.__call__
2026-03-13 15:56:38 -03:00
Sayak Paul
764f7ede33 [core] Flux2 klein kv followups (#13264)
* implement Flux2Transformer2DModelOutput.

* add output class to docs.

* add Flux2KleinKV to docs.

* add pipeline tests for klein kv.
2026-03-13 10:05:11 +05:30
24 changed files with 567 additions and 177 deletions

77
.ai/AGENTS.md Normal file
View File

@@ -0,0 +1,77 @@
# Diffusers — Agent Guide
## Coding style
Strive to write code as simple and explicit as possible.
- Minimize small helper/utility functions — inline the logic instead. A reader should be able to follow the full flow without jumping between functions.
- No defensive code or unused code paths — do not add fallback paths, safety checks, or configuration options "just in case". When porting from a research repo, delete training-time code paths, experimental flags, and ablation branches entirely — only keep the inference path you are actually integrating.
- Do not guess user intent and silently correct behavior. Make the expected inputs clear in the docstring, and raise a concise error for unsupported cases rather than adding complex fallback logic.
---
### Dependencies
- No new mandatory dependency without discussion (e.g. `einops`)
- Optional deps guarded with `is_X_available()` and a dummy in `utils/dummy_*.py`
## Code formatting
- `make style` and `make fix-copies` should be run as the final step before opening a PR
### Copied Code
- Many classes are kept in sync with a source via a `# Copied from ...` header comment
- Do not edit a `# Copied from` block directly — run `make fix-copies` to propagate changes from the source
- Remove the header to intentionally break the link
### Models
- All layer calls should be visible directly in `forward` — avoid helper functions that hide `nn.Module` calls.
- Try to not introduce graph breaks as much as possible for better compatibility with `torch.compile`. For example, DO NOT arbitrarily insert operations from NumPy in the forward implementations.
- Attention must follow the diffusers pattern: both the `Attention` class and its processor are defined in the model file. The processor's `__call__` handles the actual compute and must use `dispatch_attention_fn` rather than calling `F.scaled_dot_product_attention` directly. The attention class inherits `AttentionModuleMixin` and declares `_default_processor_cls` and `_available_processors`.
```python
# transformer_mymodel.py
class MyModelAttnProcessor:
_attention_backend = None
_parallel_config = None
def __call__(self, attn, hidden_states, attention_mask=None, ...):
query = attn.to_q(hidden_states)
key = attn.to_k(hidden_states)
value = attn.to_v(hidden_states)
# reshape, apply rope, etc.
hidden_states = dispatch_attention_fn(
query, key, value,
attn_mask=attention_mask,
backend=self._attention_backend,
parallel_config=self._parallel_config,
)
hidden_states = hidden_states.flatten(2, 3)
return attn.to_out[0](hidden_states)
class MyModelAttention(nn.Module, AttentionModuleMixin):
_default_processor_cls = MyModelAttnProcessor
_available_processors = [MyModelAttnProcessor]
def __init__(self, query_dim, heads=8, dim_head=64, ...):
super().__init__()
self.to_q = nn.Linear(query_dim, heads * dim_head, bias=False)
self.to_k = nn.Linear(query_dim, heads * dim_head, bias=False)
self.to_v = nn.Linear(query_dim, heads * dim_head, bias=False)
self.to_out = nn.ModuleList([nn.Linear(heads * dim_head, query_dim), nn.Dropout(0.0)])
self.set_processor(MyModelAttnProcessor())
def forward(self, hidden_states, attention_mask=None, **kwargs):
return self.processor(self, hidden_states, attention_mask, **kwargs)
```
Consult the implementations in `src/diffusers/models/transformers/` if you need further references.
### Pipeline
- All pipelines must inherit from `DiffusionPipeline`. Consult implementations in `src/diffusers/pipelines` in case you need references.
- DO NOT use an existing pipeline class (e.g., `FluxPipeline`) to override another pipeline (e.g., `FluxImg2ImgPipeline` which will be a part of the core codebase (`src`).
### Tests
- Slow tests gated with `@slow` and `RUN_SLOW=1`
- All model-level tests must use the `BaseModelTesterConfig`, `ModelTesterMixin`, `MemoryTesterMixin`, `AttentionTesterMixin`, `LoraTesterMixin`, and `TrainingTesterMixin` classes initially to write the tests. Any additional tests should be added after discussions with the maintainers. Use `tests/models/transformers/test_models_transformer_flux.py` as a reference.

6
.gitignore vendored
View File

@@ -178,4 +178,8 @@ tags
.ruff_cache
# wandb
wandb
wandb
# AI agent generated symlinks
/AGENTS.md
/CLAUDE.md

View File

@@ -1,4 +1,4 @@
.PHONY: deps_table_update modified_only_fixup extra_style_checks quality style fixup fix-copies test test-examples
.PHONY: deps_table_update modified_only_fixup extra_style_checks quality style fixup fix-copies test test-examples codex claude clean-ai
# make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes!)
export PYTHONPATH = src
@@ -98,3 +98,14 @@ post-release:
post-patch:
python utils/release.py --post_release --patch
# AI agent symlinks
codex:
ln -snf .ai/AGENTS.md AGENTS.md
claude:
ln -snf .ai/AGENTS.md CLAUDE.md
clean-ai:
rm -f AGENTS.md CLAUDE.md

View File

@@ -22,6 +22,8 @@
title: Reproducibility
- local: using-diffusers/schedulers
title: Schedulers
- local: using-diffusers/guiders
title: Guiders
- local: using-diffusers/automodel
title: AutoModel
- local: using-diffusers/other-formats
@@ -110,8 +112,6 @@
title: ModularPipeline
- local: modular_diffusers/components_manager
title: ComponentsManager
- local: modular_diffusers/guiders
title: Guiders
- local: modular_diffusers/custom_blocks
title: Building Custom Blocks
- local: modular_diffusers/mellon

View File

@@ -17,3 +17,7 @@ A Transformer model for image-like data from [Flux2](https://hf.co/black-forest-
## Flux2Transformer2DModel
[[autodoc]] Flux2Transformer2DModel
## Flux2Transformer2DModelOutput
[[autodoc]] models.transformers.transformer_flux2.Flux2Transformer2DModelOutput

View File

@@ -41,5 +41,11 @@ The [official implementation](https://github.com/black-forest-labs/flux2/blob/5a
## Flux2KleinPipeline
[[autodoc]] Flux2KleinPipeline
- all
- __call__
## Flux2KleinKVPipeline
[[autodoc]] Flux2KleinKVPipeline
- all
- __call__

View File

@@ -99,7 +99,7 @@ To update guider configuration, you can run `pipe.guider = pipe.guider.new(...)`
pipe.guider = pipe.guider.new(guidance_scale=5.0)
```
Read more on Guider [here](../../modular_diffusers/guiders).
Read more on Guider [here](../../using-diffusers/guiders).

View File

@@ -30,7 +30,7 @@ HunyuanImage-2.1 comes in the following variants:
## HunyuanImage-2.1
HunyuanImage-2.1 applies [Adaptive Projected Guidance (APG)](https://huggingface.co/papers/2410.02416) combined with Classifier-Free Guidance (CFG) in the denoising loop. `HunyuanImagePipeline` has a `guider` component (read more about [Guider](../modular_diffusers/guiders.md)) and does not take a `guidance_scale` parameter at runtime. To change guider-related parameters, e.g., `guidance_scale`, you can update the `guider` configuration instead.
HunyuanImage-2.1 applies [Adaptive Projected Guidance (APG)](https://huggingface.co/papers/2410.02416) combined with Classifier-Free Guidance (CFG) in the denoising loop. `HunyuanImagePipeline` has a `guider` component (read more about [Guider](../../using-diffusers/guiders)) and does not take a `guidance_scale` parameter at runtime. To change guider-related parameters, e.g., `guidance_scale`, you can update the `guider` configuration instead.
```python
import torch

View File

@@ -565,4 +565,16 @@ $ git push --set-upstream origin your-branch-for-syncing
### Style guide
For documentation strings, 🧨 Diffusers follows the [Google style](https://google.github.io/styleguide/pyguide.html).
For documentation strings, 🧨 Diffusers follows the [Google style](https://google.github.io/styleguide/pyguide.html).
## Coding with AI agents
The repository keeps AI-agent configuration in `.ai/` and exposes local agent files via symlinks.
- **Source of truth** — edit `.ai/AGENTS.md` (and any future `.ai/skills/`)
- **Don't edit** generated root-level `AGENTS.md` or `CLAUDE.md` — they are symlinks
- Setup commands:
- `make codex` — symlink for OpenAI Codex
- `make claude` — symlink for Claude Code
- `make clean-ai` — remove generated symlinks

View File

@@ -338,7 +338,7 @@ guider = ClassifierFreeGuidance(guidance_scale=5.0)
pipeline.update_components(guider=guider)
```
See the [Guiders](./guiders) guide for more details on available guiders and how to configure them.
See the [Guiders](../using-diffusers/guiders) guide for more details on available guiders and how to configure them.
## Splitting a pipeline into stages

View File

@@ -39,7 +39,7 @@ The Modular Diffusers docs are organized as shown below.
- [ModularPipeline](./modular_pipeline) shows you how to create and convert pipeline blocks into an executable [`ModularPipeline`].
- [ComponentsManager](./components_manager) shows you how to manage and reuse components across multiple pipelines.
- [Guiders](./guiders) shows you how to use different guidance methods in the pipeline.
- [Guiders](../using-diffusers/guiders) shows you how to use different guidance methods in the pipeline.
## Mellon Integration

View File

@@ -482,144 +482,6 @@ print(
) # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works
```
## torch.jit.trace
[torch.jit.trace](https://pytorch.org/docs/stable/generated/torch.jit.trace.html) records the operations a model performs on a sample input and creates a new, optimized representation of the model based on the recorded execution path. During tracing, the model is optimized to reduce overhead from Python and dynamic control flows and operations are fused together for more efficiency. The returned executable or [ScriptFunction](https://pytorch.org/docs/stable/generated/torch.jit.ScriptFunction.html) can be compiled.
```py
import time
import torch
from diffusers import StableDiffusionPipeline
import functools
# torch disable grad
torch.set_grad_enabled(False)
# set variables
n_experiments = 2
unet_runs_per_experiment = 50
# load sample inputs
def generate_inputs():
sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16)
timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999
encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16)
return sample, timestep, encoder_hidden_states
pipeline = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16,
use_safetensors=True,
).to("cuda")
unet = pipeline.unet
unet.eval()
unet.to(memory_format=torch.channels_last) # use channels_last memory format
unet.forward = functools.partial(unet.forward, return_dict=False) # set return_dict=False as default
# warmup
for _ in range(3):
with torch.inference_mode():
inputs = generate_inputs()
orig_output = unet(*inputs)
# trace
print("tracing..")
unet_traced = torch.jit.trace(unet, inputs)
unet_traced.eval()
print("done tracing")
# warmup and optimize graph
for _ in range(5):
with torch.inference_mode():
inputs = generate_inputs()
orig_output = unet_traced(*inputs)
# benchmarking
with torch.inference_mode():
for _ in range(n_experiments):
torch.cuda.synchronize()
start_time = time.time()
for _ in range(unet_runs_per_experiment):
orig_output = unet_traced(*inputs)
torch.cuda.synchronize()
print(f"unet traced inference took {time.time() - start_time:.2f} seconds")
for _ in range(n_experiments):
torch.cuda.synchronize()
start_time = time.time()
for _ in range(unet_runs_per_experiment):
orig_output = unet(*inputs)
torch.cuda.synchronize()
print(f"unet inference took {time.time() - start_time:.2f} seconds")
# save the model
unet_traced.save("unet_traced.pt")
```
Replace the pipeline's UNet with the traced version.
```py
import torch
from diffusers import StableDiffusionPipeline
from dataclasses import dataclass
@dataclass
class UNet2DConditionOutput:
sample: torch.Tensor
pipeline = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16,
use_safetensors=True,
).to("cuda")
# use jitted unet
unet_traced = torch.jit.load("unet_traced.pt")
# del pipeline.unet
class TracedUNet(torch.nn.Module):
def __init__(self):
super().__init__()
self.in_channels = pipe.unet.config.in_channels
self.device = pipe.unet.device
def forward(self, latent_model_input, t, encoder_hidden_states):
sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
return UNet2DConditionOutput(sample=sample)
pipeline.unet = TracedUNet()
with torch.inference_mode():
image = pipe([prompt] * 1, num_inference_steps=50).images[0]
```
## Memory-efficient attention
> [!TIP]
> Memory-efficient attention optimizes for memory usage *and* [inference speed](./fp16#scaled-dot-product-attention)!
The Transformers attention mechanism is memory-intensive, especially for long sequences, so you can try using different and more memory-efficient attention types.
By default, if PyTorch >= 2.0 is installed, [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) is used. You don't need to make any additional changes to your code.
SDPA supports [FlashAttention](https://github.com/Dao-AILab/flash-attention) and [xFormers](https://github.com/facebookresearch/xformers) as well as a native C++ PyTorch implementation. It automatically selects the most optimal implementation based on your input.
You can explicitly use xFormers with the [`~ModelMixin.enable_xformers_memory_efficient_attention`] method.
```py
# pip install xformers
import torch
from diffusers import StableDiffusionXLPipeline
pipeline = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
).to("cuda")
pipeline.enable_xformers_memory_efficient_attention()
```
Call [`~ModelMixin.disable_xformers_memory_efficient_attention`] to disable it.
```py
pipeline.disable_xformers_memory_efficient_attention()
```
Diffusers supports multiple memory-efficient attention backends (FlashAttention, xFormers, SageAttention, and more) through [`~ModelMixin.set_attention_backend`]. Refer to the [Attention backends](./attention_backends) guide to learn how to switch between them.

View File

@@ -23,7 +23,7 @@ pip install xformers
> [!TIP]
> The xFormers `pip` package requires the latest version of PyTorch. If you need to use a previous version of PyTorch, then we recommend [installing xFormers from the source](https://github.com/facebookresearch/xformers#installing-xformers).
After xFormers is installed, you can use `enable_xformers_memory_efficient_attention()` for faster inference and reduced memory consumption as shown in this [section](memory#memory-efficient-attention).
After xFormers is installed, you can use it with [`~ModelMixin.set_attention_backend`] as shown in the [Attention backends](./attention_backends) guide.
> [!WARNING]
> According to this [issue](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212), xFormers `v0.0.16` cannot be used for training (fine-tune or DreamBooth) in some GPUs. If you observe this problem, please install a development version as indicated in the issue comments.

View File

@@ -14,6 +14,8 @@
sections:
- local: using-diffusers/schedulers
title: Load schedulers and models
- local: using-diffusers/guiders
title: Guiders
- title: Inference
isExpanded: false
@@ -80,8 +82,6 @@
title: ModularPipeline
- local: modular_diffusers/components_manager
title: ComponentsManager
- local: modular_diffusers/guiders
title: Guiders
- title: Training
isExpanded: false

View File

@@ -13,6 +13,7 @@
# limitations under the License.
import inspect
from dataclasses import dataclass
from typing import Any
import torch
@@ -21,7 +22,7 @@ import torch.nn.functional as F
from ...configuration_utils import ConfigMixin, register_to_config
from ...loaders import FluxTransformer2DLoadersMixin, FromOriginalModelMixin, PeftAdapterMixin
from ...utils import apply_lora_scale, logging
from ...utils import BaseOutput, apply_lora_scale, logging
from .._modeling_parallel import ContextParallelInput, ContextParallelOutput
from ..attention import AttentionMixin, AttentionModuleMixin
from ..attention_dispatch import dispatch_attention_fn
@@ -32,7 +33,6 @@ from ..embeddings import (
apply_rotary_emb,
get_1d_rotary_pos_embed,
)
from ..modeling_outputs import Transformer2DModelOutput
from ..modeling_utils import ModelMixin
from ..normalization import AdaLayerNormContinuous
@@ -40,6 +40,22 @@ from ..normalization import AdaLayerNormContinuous
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
@dataclass
class Flux2Transformer2DModelOutput(BaseOutput):
"""
The output of [`Flux2Transformer2DModel`].
Args:
sample (`torch.Tensor` of shape `(batch_size, num_channels, height, width)`):
The hidden states output conditioned on the `encoder_hidden_states` input.
kv_cache (`Flux2KVCache`, *optional*):
The populated KV cache for reference image tokens. Only returned when `kv_cache_mode="extract"`.
"""
sample: "torch.Tensor" # noqa: F821
kv_cache: "Flux2KVCache | None" = None
class Flux2KVLayerCache:
"""Per-layer KV cache for reference image tokens in the Flux2 Klein KV model.
@@ -1174,7 +1190,7 @@ class Flux2Transformer2DModel(
kv_cache_mode: str | None = None,
num_ref_tokens: int = 0,
ref_fixed_timestep: float = 0.0,
) -> torch.Tensor | Transformer2DModelOutput:
) -> torch.Tensor | Flux2Transformer2DModelOutput:
"""
The [`Flux2Transformer2DModel`] forward method.
@@ -1356,10 +1372,10 @@ class Flux2Transformer2DModel(
if kv_cache_mode == "extract":
if not return_dict:
return (output,), kv_cache
return Transformer2DModelOutput(sample=output), kv_cache
return (output, kv_cache)
return Flux2Transformer2DModelOutput(sample=output, kv_cache=kv_cache)
if not return_dict:
return (output,)
return Transformer2DModelOutput(sample=output)
return Flux2Transformer2DModelOutput(sample=output)

View File

@@ -309,16 +309,16 @@ class ComponentSpec:
f"`type_hint` is required when loading a single file model but is missing for component: {self.name}"
)
from diffusers import AutoModel
# `torch_dtype` is not an accepted parameter for tokenizers and processors.
# As a result, it gets stored in `init_kwargs`, which are written to the config
# during save. This causes JSON serialization to fail when saving the component.
if self.type_hint is not None and not issubclass(self.type_hint, torch.nn.Module):
if self.type_hint is not None and not issubclass(self.type_hint, (torch.nn.Module, AutoModel)):
kwargs.pop("torch_dtype", None)
if self.type_hint is None:
try:
from diffusers import AutoModel
component = AutoModel.from_pretrained(pretrained_model_name_or_path, **load_kwargs, **kwargs)
except Exception as e:
raise ValueError(f"Unable to load {self.name} without `type_hint`: {e}")
@@ -332,12 +332,6 @@ class ComponentSpec:
else getattr(self.type_hint, "from_pretrained")
)
# `torch_dtype` is not an accepted parameter for tokenizers and processors.
# As a result, it gets stored in `init_kwargs`, which are written to the config
# during save. This causes JSON serialization to fail when saving the component.
if not issubclass(self.type_hint, torch.nn.Module):
kwargs.pop("torch_dtype", None)
try:
component = load_method(pretrained_model_name_or_path, **load_kwargs, **kwargs)
except Exception as e:

View File

@@ -744,7 +744,7 @@ class Flux2Pipeline(DiffusionPipeline, Flux2LoraLoaderMixin):
@replace_example_docstring(EXAMPLE_DOC_STRING)
def __call__(
self,
image: list[PIL.Image.Image, PIL.Image.Image] | None = None,
image: PIL.Image.Image | list[PIL.Image.Image] | None = None,
prompt: str | list[str] = None,
height: int | None = None,
width: int | None = None,

View File

@@ -793,7 +793,7 @@ class Flux2KleinKVPipeline(DiffusionPipeline, Flux2LoraLoaderMixin):
latent_model_input = torch.cat([image_latents, latents], dim=1).to(self.transformer.dtype)
latent_image_ids = torch.cat([image_latent_ids, latent_ids], dim=1)
output, kv_cache = self.transformer(
noise_pred, kv_cache = self.transformer(
hidden_states=latent_model_input,
timestep=timestep / 1000,
guidance=None,
@@ -805,7 +805,6 @@ class Flux2KleinKVPipeline(DiffusionPipeline, Flux2LoraLoaderMixin):
kv_cache_mode="extract",
num_ref_tokens=image_latents.shape[1],
)
noise_pred = output[0]
elif kv_cache is not None:
# Steps 1+: use cached ref KV, no ref tokens in input

View File

@@ -26,9 +26,17 @@ from diffusers.models._modeling_parallel import ContextParallelConfig
from ...testing_utils import (
is_context_parallel,
require_torch_multi_accelerator,
torch_device,
)
# Device configuration mapping
DEVICE_CONFIG = {
"cuda": {"backend": "nccl", "module": torch.cuda},
"xpu": {"backend": "xccl", "module": torch.xpu},
}
def _find_free_port():
"""Find a free port on localhost."""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
@@ -47,12 +55,17 @@ def _context_parallel_worker(rank, world_size, master_port, model_class, init_di
os.environ["RANK"] = str(rank)
os.environ["WORLD_SIZE"] = str(world_size)
# Get device configuration
device_config = DEVICE_CONFIG.get(torch_device, DEVICE_CONFIG["cuda"])
backend = device_config["backend"]
device_module = device_config["module"]
# Initialize process group
dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
dist.init_process_group(backend=backend, rank=rank, world_size=world_size)
# Set device for this process
torch.cuda.set_device(rank)
device = torch.device(f"cuda:{rank}")
device_module.set_device(rank)
device = torch.device(f"{torch_device}:{rank}")
# Create model
model = model_class(**init_dict)
@@ -103,10 +116,16 @@ def _custom_mesh_worker(
os.environ["RANK"] = str(rank)
os.environ["WORLD_SIZE"] = str(world_size)
dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
# Get device configuration
device_config = DEVICE_CONFIG.get(torch_device, DEVICE_CONFIG["cuda"])
backend = device_config["backend"]
device_module = device_config["module"]
torch.cuda.set_device(rank)
device = torch.device(f"cuda:{rank}")
dist.init_process_group(backend=backend, rank=rank, world_size=world_size)
# Set device for this process
device_module.set_device(rank)
device = torch.device(f"{torch_device}:{rank}")
model = model_class(**init_dict)
model.to(device)
@@ -116,7 +135,7 @@ def _custom_mesh_worker(
# DeviceMesh must be created after init_process_group, inside each worker process.
mesh = torch.distributed.device_mesh.init_device_mesh(
"cuda", mesh_shape=mesh_shape, mesh_dim_names=mesh_dim_names
torch_device, mesh_shape=mesh_shape, mesh_dim_names=mesh_dim_names
)
cp_config = ContextParallelConfig(**cp_dict, mesh=mesh)
model.enable_parallelism(config=cp_config)

View File

@@ -0,0 +1,174 @@
import unittest
import numpy as np
import torch
from PIL import Image
from transformers import Qwen2TokenizerFast, Qwen3Config, Qwen3ForCausalLM
from diffusers import (
AutoencoderKLFlux2,
FlowMatchEulerDiscreteScheduler,
Flux2KleinKVPipeline,
Flux2Transformer2DModel,
)
from ...testing_utils import torch_device
from ..test_pipelines_common import PipelineTesterMixin, check_qkv_fused_layers_exist
class Flux2KleinKVPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
pipeline_class = Flux2KleinKVPipeline
params = frozenset(["prompt", "height", "width", "prompt_embeds", "image"])
batch_params = frozenset(["prompt"])
test_xformers_attention = False
test_layerwise_casting = True
test_group_offloading = True
supports_dduf = False
def get_dummy_components(self, num_layers: int = 1, num_single_layers: int = 1):
torch.manual_seed(0)
transformer = Flux2Transformer2DModel(
patch_size=1,
in_channels=4,
num_layers=num_layers,
num_single_layers=num_single_layers,
attention_head_dim=16,
num_attention_heads=2,
joint_attention_dim=16,
timestep_guidance_channels=256,
axes_dims_rope=[4, 4, 4, 4],
guidance_embeds=False,
)
# Create minimal Qwen3 config
config = Qwen3Config(
intermediate_size=16,
hidden_size=16,
num_hidden_layers=2,
num_attention_heads=2,
num_key_value_heads=2,
vocab_size=151936,
max_position_embeddings=512,
)
torch.manual_seed(0)
text_encoder = Qwen3ForCausalLM(config)
# Use a simple tokenizer for testing
tokenizer = Qwen2TokenizerFast.from_pretrained(
"hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration"
)
torch.manual_seed(0)
vae = AutoencoderKLFlux2(
sample_size=32,
in_channels=3,
out_channels=3,
down_block_types=("DownEncoderBlock2D",),
up_block_types=("UpDecoderBlock2D",),
block_out_channels=(4,),
layers_per_block=1,
latent_channels=1,
norm_num_groups=1,
use_quant_conv=False,
use_post_quant_conv=False,
)
scheduler = FlowMatchEulerDiscreteScheduler()
return {
"scheduler": scheduler,
"text_encoder": text_encoder,
"tokenizer": tokenizer,
"transformer": transformer,
"vae": vae,
}
def get_dummy_inputs(self, device, seed=0):
if str(device).startswith("mps"):
generator = torch.manual_seed(seed)
else:
generator = torch.Generator(device="cpu").manual_seed(seed)
inputs = {
"prompt": "a dog is dancing",
"image": Image.new("RGB", (64, 64)),
"generator": generator,
"num_inference_steps": 2,
"height": 8,
"width": 8,
"max_sequence_length": 64,
"output_type": "np",
"text_encoder_out_layers": (1,),
}
return inputs
def test_fused_qkv_projections(self):
device = "cpu" # ensure determinism for the device-dependent torch.Generator
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe = pipe.to(device)
pipe.set_progress_bar_config(disable=None)
inputs = self.get_dummy_inputs(device)
image = pipe(**inputs).images
original_image_slice = image[0, -3:, -3:, -1]
pipe.transformer.fuse_qkv_projections()
self.assertTrue(
check_qkv_fused_layers_exist(pipe.transformer, ["to_qkv"]),
("Something wrong with the fused attention layers. Expected all the attention projections to be fused."),
)
inputs = self.get_dummy_inputs(device)
image = pipe(**inputs).images
image_slice_fused = image[0, -3:, -3:, -1]
pipe.transformer.unfuse_qkv_projections()
inputs = self.get_dummy_inputs(device)
image = pipe(**inputs).images
image_slice_disabled = image[0, -3:, -3:, -1]
self.assertTrue(
np.allclose(original_image_slice, image_slice_fused, atol=1e-3, rtol=1e-3),
("Fusion of QKV projections shouldn't affect the outputs."),
)
self.assertTrue(
np.allclose(image_slice_fused, image_slice_disabled, atol=1e-3, rtol=1e-3),
("Outputs, with QKV projection fusion enabled, shouldn't change when fused QKV projections are disabled."),
)
self.assertTrue(
np.allclose(original_image_slice, image_slice_disabled, atol=1e-2, rtol=1e-2),
("Original outputs should match when fused QKV projections are disabled."),
)
def test_image_output_shape(self):
pipe = self.pipeline_class(**self.get_dummy_components()).to(torch_device)
inputs = self.get_dummy_inputs(torch_device)
height_width_pairs = [(32, 32), (72, 57)]
for height, width in height_width_pairs:
expected_height = height - height % (pipe.vae_scale_factor * 2)
expected_width = width - width % (pipe.vae_scale_factor * 2)
inputs.update({"height": height, "width": width})
image = pipe(**inputs).images[0]
output_height, output_width, _ = image.shape
self.assertEqual(
(output_height, output_width),
(expected_height, expected_width),
f"Output shape {image.shape} does not match expected shape {(expected_height, expected_width)}",
)
def test_without_image(self):
device = "cpu"
pipe = self.pipeline_class(**self.get_dummy_components()).to(device)
inputs = self.get_dummy_inputs(device)
del inputs["image"]
image = pipe(**inputs).images
self.assertEqual(image.shape, (1, 8, 8, 3))
@unittest.skip("Needs to be revisited")
def test_encode_prompt_works_in_isolation(self):
pass

View File

@@ -139,7 +139,9 @@ class HunyuanVideoImageToVideoPipelineFastTests(
num_hidden_layers=2,
image_size=224,
)
llava_text_encoder_config = LlavaConfig(vision_config, text_config, pad_token_id=100, image_token_index=101)
llava_text_encoder_config = LlavaConfig(
vision_config=vision_config, text_config=text_config, pad_token_id=100, image_token_index=101
)
clip_text_encoder_config = CLIPTextConfig(
bos_token_id=0,

210
utils/make_tiny_model.py Normal file
View File

@@ -0,0 +1,210 @@
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "diffusers",
# "torch",
# "huggingface_hub",
# "accelerate",
# "transformers",
# "sentencepiece",
# "protobuf",
# ]
# ///
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Utility script to create tiny versions of diffusers models by reducing layer counts.
Can be run locally or submitted as an HF Job via `--launch`.
Usage:
# Run locally
python make_tiny_model.py --model_repo_id <model_repo_id> --output_repo_id <output_repo_id> [--subfolder transformer] [--num_layers 2]
# Push to Hub
python make_tiny_model.py --model_repo_id <model_repo_id> --output_repo_id <output_repo_id> --push_to_hub --token $HF_TOKEN
# Submit as an HF Job
python make_tiny_model.py --model_repo_id <model_repo_id> --output_repo_id <output_repo_id> --launch [--flavor cpu-basic]
"""
import argparse
import os
import re
LAYER_PARAM_PATTERN = re.compile(r"^(num_.*layers?|n_layers|n_refiner_layers)$")
DIM_PARAM_PATTERNS = {
re.compile(r"^num_attention_heads$"): 2,
re.compile(r"^num_.*attention_heads$"): 2,
re.compile(r"^num_key_value_heads$"): 2,
re.compile(r"^num_kv_heads$"): 1,
re.compile(r"^n_heads$"): 2,
re.compile(r"^n_kv_heads$"): 2,
re.compile(r"^attention_head_dim$"): 8,
re.compile(r"^.*attention_head_dim$"): 4,
re.compile(r"^cross_attention_dim.*$"): 8,
re.compile(r"^joint_attention_dim$"): 32,
re.compile(r"^pooled_projection_dim$"): 32,
re.compile(r"^caption_projection_dim$"): 32,
re.compile(r"^caption_channels$"): 8,
re.compile(r"^cap_feat_dim$"): 16,
re.compile(r"^hidden_size$"): 16,
re.compile(r"^dim$"): 16,
re.compile(r"^.*embed_dim$"): 16,
re.compile(r"^.*embed_.*dim$"): 16,
re.compile(r"^text_dim$"): 16,
re.compile(r"^time_embed_dim$"): 4,
re.compile(r"^ffn_dim$"): 32,
re.compile(r"^intermediate_size$"): 32,
re.compile(r"^sample_size$"): 32,
}
def parse_args():
parser = argparse.ArgumentParser(description="Create a tiny version of a diffusers model.")
parser.add_argument("--model_repo_id", type=str, required=True, help="HuggingFace repo ID of the source model.")
parser.add_argument(
"--output_repo_id",
type=str,
required=True,
help="HuggingFace repo ID or local path to save the tiny model to.",
)
parser.add_argument("--subfolder", type=str, default=None, help="Subfolder within the model repo.")
parser.add_argument("--num_layers", type=int, default=2, help="Number of layers to use for the tiny model.")
parser.add_argument(
"--shrink_dims",
action="store_true",
help="Also reduce dimension parameters (attention heads, hidden size, embedding dims, etc.).",
)
parser.add_argument("--push_to_hub", action="store_true", help="Push the tiny model to the HuggingFace Hub.")
parser.add_argument(
"--token", type=str, default=None, help="HuggingFace token. Defaults to $HF_TOKEN env var if not provided."
)
launch_group = parser.add_argument_group("HF Jobs launch options")
launch_group.add_argument("--launch", action="store_true", help="Submit as an HF Job instead of running locally.")
launch_group.add_argument("--flavor", type=str, default="cpu-basic", help="HF Jobs hardware flavor.")
launch_group.add_argument("--timeout", type=str, default="30m", help="HF Jobs timeout.")
args = parser.parse_args()
if args.token is None:
args.token = os.environ.get("HF_TOKEN")
return args
def launch_job(args):
from huggingface_hub import run_uv_job
script_args = [
"--model_repo_id",
args.model_repo_id,
"--output_repo_id",
args.output_repo_id,
"--num_layers",
str(args.num_layers),
]
if args.subfolder:
script_args.extend(["--subfolder", args.subfolder])
if args.shrink_dims:
script_args.append("--shrink_dims")
if args.push_to_hub:
script_args.append("--push_to_hub")
job = run_uv_job(
__file__,
script_args=script_args,
flavor=args.flavor,
timeout=args.timeout,
secrets={"HF_TOKEN": args.token} if args.token else {},
)
print(f"Job submitted: {job.url}")
print(f"Job ID: {job.id}")
return job
def make_tiny_model(
model_repo_id, output_repo_id, subfolder=None, num_layers=2, shrink_dims=False, push_to_hub=False, token=None
):
from diffusers import AutoModel
config_kwargs = {}
if token:
config_kwargs["token"] = token
config = AutoModel.load_config(model_repo_id, subfolder=subfolder, **config_kwargs)
modified_keys = {}
for key, value in config.items():
if LAYER_PARAM_PATTERN.match(key) and isinstance(value, int) and value > num_layers:
modified_keys[key] = (value, num_layers)
config[key] = num_layers
if shrink_dims:
for key, value in config.items():
if not isinstance(value, int) or key.startswith("_"):
continue
for pattern, tiny_value in DIM_PARAM_PATTERNS.items():
if pattern.match(key) and value > tiny_value:
modified_keys[key] = (value, tiny_value)
config[key] = tiny_value
break
if not modified_keys:
print("WARNING: No config parameters were modified.")
print(f"Config keys: {[k for k in config if not k.startswith('_')]}")
return
print("Modified config parameters:")
for key, (old, new) in modified_keys.items():
print(f" {key}: {old} -> {new}")
model = AutoModel.from_config(config)
total_params = sum(p.numel() for p in model.parameters())
print(f"Tiny model created with {total_params:,} parameters.")
save_kwargs = {}
if token:
save_kwargs["token"] = token
if push_to_hub:
save_kwargs["repo_id"] = output_repo_id
model.save_pretrained(output_repo_id, push_to_hub=push_to_hub, **save_kwargs)
if push_to_hub:
print(f"Model pushed to https://huggingface.co/{output_repo_id}")
else:
print(f"Model saved to {output_repo_id}")
def main():
args = parse_args()
if args.launch:
launch_job(args)
else:
make_tiny_model(
model_repo_id=args.model_repo_id,
output_repo_id=args.output_repo_id,
subfolder=args.subfolder,
num_layers=args.num_layers,
shrink_dims=args.shrink_dims,
push_to_hub=args.push_to_hub,
token=args.token,
)
if __name__ == "__main__":
main()