Compare commits

..

1 Commits

Author SHA1 Message Date
sayakpaul
2532668363 up 2025-12-05 21:47:31 +07:00
26 changed files with 56 additions and 1747 deletions

View File

@@ -20,7 +20,7 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
python-version: '3.8'
- name: Fetch latest branch
id: fetch_latest_branch
@@ -54,6 +54,7 @@ jobs:
python -m pip install --upgrade pip
pip install -U setuptools wheel twine
pip install -U torch --index-url https://download.pytorch.org/whl/cpu
pip install -U transformers
- name: Build the dist files
run: python setup.py bdist_wheel && python setup.py sdist
@@ -68,8 +69,6 @@ jobs:
run: |
pip install diffusers && pip uninstall diffusers -y
pip install -i https://test.pypi.org/simple/ diffusers
pip install -U transformers
python utils/print_env.py
python -c "from diffusers import __version__; print(__version__)"
python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('fusing/unet-ldm-dummy-update'); pipe()"
python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('hf-internal-testing/tiny-stable-diffusion-pipe', safety_checker=None); pipe('ah suh du')"

View File

@@ -34,9 +34,3 @@ Cache methods speedup diffusion transformers by storing and reusing intermediate
[[autodoc]] FirstBlockCacheConfig
[[autodoc]] apply_first_block_cache
### TaylorSeerCacheConfig
[[autodoc]] TaylorSeerCacheConfig
[[autodoc]] apply_taylorseer_cache

View File

@@ -26,41 +26,8 @@ specific language governing permissions and limitations under the License.
Z-Image-Turbo is a distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers sub-second inference latency on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.
## Image-to-image
Use [`ZImageImg2ImgPipeline`] to transform an existing image based on a text prompt.
```python
import torch
from diffusers import ZImageImg2ImgPipeline
from diffusers.utils import load_image
pipe = ZImageImg2ImgPipeline.from_pretrained("Tongyi-MAI/Z-Image-Turbo", torch_dtype=torch.bfloat16)
pipe.to("cuda")
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
init_image = load_image(url).resize((1024, 1024))
prompt = "A fantasy landscape with mountains and a river, detailed, vibrant colors"
image = pipe(
prompt,
image=init_image,
strength=0.6,
num_inference_steps=9,
guidance_scale=0.0,
generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("zimage_img2img.png")
```
## ZImagePipeline
[[autodoc]] ZImagePipeline
- all
- __call__
## ZImageImg2ImgPipeline
[[autodoc]] ZImageImg2ImgPipeline
- all
- __call__

View File

@@ -67,34 +67,3 @@ config = FasterCacheConfig(
)
pipeline.transformer.enable_cache(config)
```
## TaylorSeer Cache
[TaylorSeer Cache](https://huggingface.co/papers/2403.06923) accelerates diffusion inference by using Taylor series expansions to approximate and cache intermediate activations across denoising steps. The method predicts future outputs based on past computations, reusing them at specified intervals to reduce redundant calculations.
This caching mechanism delivers strong results with minimal additional memory overhead. For detailed performance analysis, see [our findings here](https://github.com/huggingface/diffusers/pull/12648#issuecomment-3610615080).
To enable TaylorSeer Cache, create a [`TaylorSeerCacheConfig`] and pass it to your pipeline's transformer:
- `cache_interval`: Number of steps to reuse cached outputs before performing a full forward pass
- `disable_cache_before_step`: Initial steps that use full computations to gather data for approximations
- `max_order`: Approximation accuracy (in theory, higher values improve quality but increase memory usage but we recommend it should be set to `1`)
```python
import torch
from diffusers import FluxPipeline, TaylorSeerCacheConfig
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
config = TaylorSeerCacheConfig(
cache_interval=5,
max_order=1,
disable_cache_before_step=10,
taylor_factors_dtype=torch.bfloat16,
)
pipe.transformer.enable_cache(config)
```

View File

@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->
# NVIDIA ModelOpt
[NVIDIA-ModelOpt](https://github.com/NVIDIA/Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
[NVIDIA-ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
Before you begin, make sure you have nvidia_modelopt installed.
@@ -57,7 +57,7 @@ image.save("output.png")
>
> The quantization methods in NVIDIA-ModelOpt are designed to reduce the memory footprint of model weights using various QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) techniques while maintaining model performance. However, the actual performance gain during inference depends on the deployment framework (e.g., TRT-LLM, TensorRT) and the specific hardware configuration.
>
> More details can be found [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples).
> More details can be found [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples).
## NVIDIAModelOptConfig
@@ -86,7 +86,7 @@ The quantization methods supported are as follows:
| **NVFP4** | `nvfp4 weight only`, `nvfp4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`|
Refer to the [official modelopt documentation](https://nvidia.github.io/Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
Refer to the [official modelopt documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
## Serializing and Deserializing quantized models

View File

@@ -69,11 +69,6 @@ TRANSFORMER_CONFIGS = {
"target_size": 960,
"task_type": "i2v",
},
"480p_i2v_step_distilled": {
"target_size": 640,
"task_type": "i2v",
"use_meanflow": True,
},
}
SCHEDULER_CONFIGS = {
@@ -98,9 +93,6 @@ SCHEDULER_CONFIGS = {
"720p_i2v_distilled": {
"shift": 7.0,
},
"480p_i2v_step_distilled": {
"shift": 7.0,
},
}
GUIDANCE_CONFIGS = {
@@ -125,9 +117,6 @@ GUIDANCE_CONFIGS = {
"720p_i2v_distilled": {
"guidance_scale": 1.0,
},
"480p_i2v_step_distilled": {
"guidance_scale": 1.0,
},
}
@@ -137,7 +126,7 @@ def swap_scale_shift(weight):
return new_weight
def convert_hyvideo15_transformer_to_diffusers(original_state_dict, config=None):
def convert_hyvideo15_transformer_to_diffusers(original_state_dict):
"""
Convert HunyuanVideo 1.5 original checkpoint to Diffusers format.
"""
@@ -153,20 +142,6 @@ def convert_hyvideo15_transformer_to_diffusers(original_state_dict, config=None)
)
converted_state_dict["time_embed.timestep_embedder.linear_2.bias"] = original_state_dict.pop("time_in.mlp.2.bias")
if config.use_meanflow:
converted_state_dict["time_embed.timestep_embedder_r.linear_1.weight"] = original_state_dict.pop(
"time_r_in.mlp.0.weight"
)
converted_state_dict["time_embed.timestep_embedder_r.linear_1.bias"] = original_state_dict.pop(
"time_r_in.mlp.0.bias"
)
converted_state_dict["time_embed.timestep_embedder_r.linear_2.weight"] = original_state_dict.pop(
"time_r_in.mlp.2.weight"
)
converted_state_dict["time_embed.timestep_embedder_r.linear_2.bias"] = original_state_dict.pop(
"time_r_in.mlp.2.bias"
)
# 2. context_embedder.time_text_embed.timestep_embedder <- txt_in.t_embedder
converted_state_dict["context_embedder.time_text_embed.timestep_embedder.linear_1.weight"] = (
original_state_dict.pop("txt_in.t_embedder.mlp.0.weight")
@@ -652,7 +627,7 @@ def convert_transformer(args):
config = TRANSFORMER_CONFIGS[args.transformer_type]
with init_empty_weights():
transformer = HunyuanVideo15Transformer3DModel(**config)
state_dict = convert_hyvideo15_transformer_to_diffusers(original_state_dict, config=transformer.config)
state_dict = convert_hyvideo15_transformer_to_diffusers(original_state_dict)
transformer.load_state_dict(state_dict, strict=True, assign=True)
return transformer

View File

@@ -169,12 +169,10 @@ else:
"LayerSkipConfig",
"PyramidAttentionBroadcastConfig",
"SmoothedEnergyGuidanceConfig",
"TaylorSeerCacheConfig",
"apply_faster_cache",
"apply_first_block_cache",
"apply_layer_skip",
"apply_pyramid_attention_broadcast",
"apply_taylorseer_cache",
]
)
_import_structure["models"].extend(
@@ -662,7 +660,6 @@ else:
"WuerstchenCombinedPipeline",
"WuerstchenDecoderPipeline",
"WuerstchenPriorPipeline",
"ZImageImg2ImgPipeline",
"ZImagePipeline",
]
)
@@ -902,12 +899,10 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
LayerSkipConfig,
PyramidAttentionBroadcastConfig,
SmoothedEnergyGuidanceConfig,
TaylorSeerCacheConfig,
apply_faster_cache,
apply_first_block_cache,
apply_layer_skip,
apply_pyramid_attention_broadcast,
apply_taylorseer_cache,
)
from .models import (
AllegroTransformer3DModel,
@@ -1361,7 +1356,6 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
WuerstchenCombinedPipeline,
WuerstchenDecoderPipeline,
WuerstchenPriorPipeline,
ZImageImg2ImgPipeline,
ZImagePipeline,
)

View File

@@ -25,4 +25,3 @@ if is_torch_available():
from .layerwise_casting import apply_layerwise_casting, apply_layerwise_casting_hook
from .pyramid_attention_broadcast import PyramidAttentionBroadcastConfig, apply_pyramid_attention_broadcast
from .smoothed_energy_guidance_utils import SmoothedEnergyGuidanceConfig
from .taylorseer_cache import TaylorSeerCacheConfig, apply_taylorseer_cache

View File

@@ -1,346 +0,0 @@
import math
import re
from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple
import torch
import torch.nn as nn
from ..utils import logging
from .hooks import HookRegistry, ModelHook, StateManager
logger = logging.get_logger(__name__)
_TAYLORSEER_CACHE_HOOK = "taylorseer_cache"
_SPATIAL_ATTENTION_BLOCK_IDENTIFIERS = (
"^blocks.*attn",
"^transformer_blocks.*attn",
"^single_transformer_blocks.*attn",
)
_TEMPORAL_ATTENTION_BLOCK_IDENTIFIERS = ("^temporal_transformer_blocks.*attn",)
_TRANSFORMER_BLOCK_IDENTIFIERS = _SPATIAL_ATTENTION_BLOCK_IDENTIFIERS + _TEMPORAL_ATTENTION_BLOCK_IDENTIFIERS
_BLOCK_IDENTIFIERS = ("^[^.]*block[^.]*\\.[^.]+$",)
_PROJ_OUT_IDENTIFIERS = ("^proj_out$",)
@dataclass
class TaylorSeerCacheConfig:
"""
Configuration for TaylorSeer cache. See: https://huggingface.co/papers/2503.06923
Attributes:
cache_interval (`int`, defaults to `5`):
The interval between full computation steps. After a full computation, the cached (predicted) outputs are
reused for this many subsequent denoising steps before refreshing with a new full forward pass.
disable_cache_before_step (`int`, defaults to `3`):
The denoising step index before which caching is disabled, meaning full computation is performed for the
initial steps (0 to disable_cache_before_step - 1) to gather data for Taylor series approximations. During
these steps, Taylor factors are updated, but caching/predictions are not applied. Caching begins at this
step.
disable_cache_after_step (`int`, *optional*, defaults to `None`):
The denoising step index after which caching is disabled. If set, for steps >= this value, all modules run
full computations without predictions or state updates, ensuring accuracy in later stages if needed.
max_order (`int`, defaults to `1`):
The highest order in the Taylor series expansion for approximating module outputs. Higher orders provide
better approximations but increase computation and memory usage.
taylor_factors_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
Data type used for storing and computing Taylor series factors. Lower precision reduces memory but may
affect stability; higher precision improves accuracy at the cost of more memory.
skip_predict_identifiers (`List[str]`, *optional*, defaults to `None`):
Regex patterns (using `re.fullmatch`) for module names to place as "skip" in "cache" mode. In this mode,
the module computes fully during initial or refresh steps but returns a zero tensor (matching recorded
shape) during prediction steps to skip computation cheaply.
cache_identifiers (`List[str]`, *optional*, defaults to `None`):
Regex patterns (using `re.fullmatch`) for module names to place in Taylor-series caching mode, where
outputs are approximated and cached for reuse.
use_lite_mode (`bool`, *optional*, defaults to `False`):
Enables a lightweight TaylorSeer variant that minimizes memory usage by applying predefined patterns for
skipping and caching (e.g., skipping blocks and caching projections). This overrides any custom
`inactive_identifiers` or `active_identifiers`.
Notes:
- Patterns are matched using `re.fullmatch` on the module name.
- If `skip_predict_identifiers` or `cache_identifiers` are provided, only matching modules are hooked.
- If neither is provided, all attention-like modules are hooked by default.
Example of inactive and active usage:
```py
def forward(x):
x = self.module1(x) # inactive module: returns zeros tensor based on shape recorded during full compute
x = self.module2(x) # active module: caches output here, avoiding recomputation of prior steps
return x
```
"""
cache_interval: int = 5
disable_cache_before_step: int = 3
disable_cache_after_step: Optional[int] = None
max_order: int = 1
taylor_factors_dtype: Optional[torch.dtype] = torch.bfloat16
skip_predict_identifiers: Optional[List[str]] = None
cache_identifiers: Optional[List[str]] = None
use_lite_mode: bool = False
def __repr__(self) -> str:
return (
"TaylorSeerCacheConfig("
f"cache_interval={self.cache_interval}, "
f"disable_cache_before_step={self.disable_cache_before_step}, "
f"disable_cache_after_step={self.disable_cache_after_step}, "
f"max_order={self.max_order}, "
f"taylor_factors_dtype={self.taylor_factors_dtype}, "
f"skip_predict_identifiers={self.skip_predict_identifiers}, "
f"cache_identifiers={self.cache_identifiers}, "
f"use_lite_mode={self.use_lite_mode})"
)
class TaylorSeerState:
def __init__(
self,
taylor_factors_dtype: Optional[torch.dtype] = torch.bfloat16,
max_order: int = 1,
is_inactive: bool = False,
):
self.taylor_factors_dtype = taylor_factors_dtype
self.max_order = max_order
self.is_inactive = is_inactive
self.module_dtypes: Tuple[torch.dtype, ...] = ()
self.last_update_step: Optional[int] = None
self.taylor_factors: Dict[int, Dict[int, torch.Tensor]] = {}
self.inactive_shapes: Optional[Tuple[Tuple[int, ...], ...]] = None
self.device: Optional[torch.device] = None
self.current_step: int = -1
def reset(self) -> None:
self.current_step = -1
self.last_update_step = None
self.taylor_factors = {}
self.inactive_shapes = None
self.device = None
def update(
self,
outputs: Tuple[torch.Tensor, ...],
) -> None:
self.module_dtypes = tuple(output.dtype for output in outputs)
self.device = outputs[0].device
if self.is_inactive:
self.inactive_shapes = tuple(output.shape for output in outputs)
else:
for i, features in enumerate(outputs):
new_factors: Dict[int, torch.Tensor] = {0: features}
is_first_update = self.last_update_step is None
if not is_first_update:
delta_step = self.current_step - self.last_update_step
if delta_step == 0:
raise ValueError("Delta step cannot be zero for TaylorSeer update.")
# Recursive divided differences up to max_order
prev_factors = self.taylor_factors.get(i, {})
for j in range(self.max_order):
prev = prev_factors.get(j)
if prev is None:
break
new_factors[j + 1] = (new_factors[j] - prev.to(features.dtype)) / delta_step
self.taylor_factors[i] = {
order: factor.to(self.taylor_factors_dtype) for order, factor in new_factors.items()
}
self.last_update_step = self.current_step
@torch.compiler.disable
def predict(self) -> List[torch.Tensor]:
if self.last_update_step is None:
raise ValueError("Cannot predict without prior initialization/update.")
step_offset = self.current_step - self.last_update_step
outputs = []
if self.is_inactive:
if self.inactive_shapes is None:
raise ValueError("Inactive shapes not set during prediction.")
for i in range(len(self.module_dtypes)):
outputs.append(
torch.zeros(
self.inactive_shapes[i],
dtype=self.module_dtypes[i],
device=self.device,
)
)
else:
if not self.taylor_factors:
raise ValueError("Taylor factors empty during prediction.")
num_outputs = len(self.taylor_factors)
num_orders = len(self.taylor_factors[0])
for i in range(num_outputs):
output_dtype = self.module_dtypes[i]
taylor_factors = self.taylor_factors[i]
output = torch.zeros_like(taylor_factors[0], dtype=output_dtype)
for order in range(num_orders):
coeff = (step_offset**order) / math.factorial(order)
factor = taylor_factors[order]
output = output + factor.to(output_dtype) * coeff
outputs.append(output)
return outputs
class TaylorSeerCacheHook(ModelHook):
_is_stateful = True
def __init__(
self,
cache_interval: int,
disable_cache_before_step: int,
taylor_factors_dtype: torch.dtype,
state_manager: StateManager,
disable_cache_after_step: Optional[int] = None,
):
super().__init__()
self.cache_interval = cache_interval
self.disable_cache_before_step = disable_cache_before_step
self.disable_cache_after_step = disable_cache_after_step
self.taylor_factors_dtype = taylor_factors_dtype
self.state_manager = state_manager
def initialize_hook(self, module: torch.nn.Module):
return module
def reset_state(self, module: torch.nn.Module) -> None:
"""
Reset state between sampling runs.
"""
self.state_manager.reset()
@torch.compiler.disable
def _measure_should_compute(self) -> bool:
state: TaylorSeerState = self.state_manager.get_state()
state.current_step += 1
current_step = state.current_step
is_warmup_phase = current_step < self.disable_cache_before_step
is_compute_interval = (current_step - self.disable_cache_before_step - 1) % self.cache_interval == 0
is_cooldown_phase = self.disable_cache_after_step is not None and current_step >= self.disable_cache_after_step
should_compute = is_warmup_phase or is_compute_interval or is_cooldown_phase
return should_compute, state
def new_forward(self, module: torch.nn.Module, *args, **kwargs):
should_compute, state = self._measure_should_compute()
if should_compute:
outputs = self.fn_ref.original_forward(*args, **kwargs)
wrapped_outputs = (outputs,) if isinstance(outputs, torch.Tensor) else outputs
state.update(wrapped_outputs)
return outputs
outputs_list = state.predict()
return outputs_list[0] if len(outputs_list) == 1 else tuple(outputs_list)
def _resolve_patterns(config: TaylorSeerCacheConfig) -> Tuple[List[str], List[str]]:
"""
Resolve effective inactive and active pattern lists from config + templates.
"""
inactive_patterns = config.skip_predict_identifiers if config.skip_predict_identifiers is not None else None
active_patterns = config.cache_identifiers if config.cache_identifiers is not None else None
return inactive_patterns or [], active_patterns or []
def apply_taylorseer_cache(module: torch.nn.Module, config: TaylorSeerCacheConfig):
"""
Applies the TaylorSeer cache to a given pipeline (typically the transformer / UNet).
This function hooks selected modules in the model to enable caching or skipping based on the provided
configuration, reducing redundant computations in diffusion denoising loops.
Args:
module (torch.nn.Module): The model subtree to apply the hooks to.
config (TaylorSeerCacheConfig): Configuration for the cache.
Example:
```python
>>> import torch
>>> from diffusers import FluxPipeline, TaylorSeerCacheConfig
>>> pipe = FluxPipeline.from_pretrained(
... "black-forest-labs/FLUX.1-dev",
... torch_dtype=torch.bfloat16,
... )
>>> pipe.to("cuda")
>>> config = TaylorSeerCacheConfig(
... cache_interval=5,
... max_order=1,
... disable_cache_before_step=3,
... taylor_factors_dtype=torch.float32,
... )
>>> pipe.transformer.enable_cache(config)
```
"""
inactive_patterns, active_patterns = _resolve_patterns(config)
active_patterns = active_patterns or _TRANSFORMER_BLOCK_IDENTIFIERS
if config.use_lite_mode:
logger.info("Using TaylorSeer Lite variant for cache.")
active_patterns = _PROJ_OUT_IDENTIFIERS
inactive_patterns = _BLOCK_IDENTIFIERS
if config.skip_predict_identifiers or config.cache_identifiers:
logger.warning("Lite mode overrides user patterns.")
for name, submodule in module.named_modules():
matches_inactive = any(re.fullmatch(pattern, name) for pattern in inactive_patterns)
matches_active = any(re.fullmatch(pattern, name) for pattern in active_patterns)
if not (matches_inactive or matches_active):
continue
_apply_taylorseer_cache_hook(
module=submodule,
config=config,
is_inactive=matches_inactive,
)
def _apply_taylorseer_cache_hook(
module: nn.Module,
config: TaylorSeerCacheConfig,
is_inactive: bool,
):
"""
Registers the TaylorSeer hook on the specified nn.Module.
Args:
name: Name of the module.
module: The nn.Module to be hooked.
config: Cache configuration.
is_inactive: Whether this module should operate in "inactive" mode.
"""
state_manager = StateManager(
TaylorSeerState,
init_kwargs={
"taylor_factors_dtype": config.taylor_factors_dtype,
"max_order": config.max_order,
"is_inactive": is_inactive,
},
)
registry = HookRegistry.check_if_exists_or_initialize(module)
hook = TaylorSeerCacheHook(
cache_interval=config.cache_interval,
disable_cache_before_step=config.disable_cache_before_step,
taylor_factors_dtype=config.taylor_factors_dtype,
disable_cache_after_step=config.disable_cache_after_step,
state_manager=state_manager,
)
registry.register_hook(hook, _TAYLORSEER_CACHE_HOOK)

View File

@@ -67,11 +67,9 @@ class CacheMixin:
FasterCacheConfig,
FirstBlockCacheConfig,
PyramidAttentionBroadcastConfig,
TaylorSeerCacheConfig,
apply_faster_cache,
apply_first_block_cache,
apply_pyramid_attention_broadcast,
apply_taylorseer_cache,
)
if self.is_cache_enabled:
@@ -85,25 +83,16 @@ class CacheMixin:
apply_first_block_cache(self, config)
elif isinstance(config, PyramidAttentionBroadcastConfig):
apply_pyramid_attention_broadcast(self, config)
elif isinstance(config, TaylorSeerCacheConfig):
apply_taylorseer_cache(self, config)
else:
raise ValueError(f"Cache config {type(config)} is not supported.")
self._cache_config = config
def disable_cache(self) -> None:
from ..hooks import (
FasterCacheConfig,
FirstBlockCacheConfig,
HookRegistry,
PyramidAttentionBroadcastConfig,
TaylorSeerCacheConfig,
)
from ..hooks import FasterCacheConfig, FirstBlockCacheConfig, HookRegistry, PyramidAttentionBroadcastConfig
from ..hooks.faster_cache import _FASTER_CACHE_BLOCK_HOOK, _FASTER_CACHE_DENOISER_HOOK
from ..hooks.first_block_cache import _FBC_BLOCK_HOOK, _FBC_LEADER_BLOCK_HOOK
from ..hooks.pyramid_attention_broadcast import _PYRAMID_ATTENTION_BROADCAST_HOOK
from ..hooks.taylorseer_cache import _TAYLORSEER_CACHE_HOOK
if self._cache_config is None:
logger.warning("Caching techniques have not been enabled, so there's nothing to disable.")
@@ -118,8 +107,6 @@ class CacheMixin:
registry.remove_hook(_FBC_BLOCK_HOOK, recurse=True)
elif isinstance(self._cache_config, PyramidAttentionBroadcastConfig):
registry.remove_hook(_PYRAMID_ATTENTION_BROADCAST_HOOK, recurse=True)
elif isinstance(self._cache_config, TaylorSeerCacheConfig):
registry.remove_hook(_TAYLORSEER_CACHE_HOOK, recurse=True)
else:
raise ValueError(f"Cache config {type(self._cache_config)} is not supported.")

View File

@@ -184,32 +184,19 @@ class HunyuanVideo15TimeEmbedding(nn.Module):
The dimension of the output embedding.
"""
def __init__(self, embedding_dim: int, use_meanflow: bool = False):
def __init__(self, embedding_dim: int):
super().__init__()
self.time_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0)
self.timestep_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
self.use_meanflow = use_meanflow
self.time_proj_r = None
self.timestep_embedder_r = None
if use_meanflow:
self.time_proj_r = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0)
self.timestep_embedder_r = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
def forward(
self,
timestep: torch.Tensor,
timestep_r: Optional[torch.Tensor] = None,
) -> torch.Tensor:
timesteps_proj = self.time_proj(timestep)
timesteps_emb = self.timestep_embedder(timesteps_proj.to(dtype=timestep.dtype))
if timestep_r is not None:
timesteps_proj_r = self.time_proj_r(timestep_r)
timesteps_emb_r = self.timestep_embedder_r(timesteps_proj_r.to(dtype=timestep.dtype))
timesteps_emb = timesteps_emb + timesteps_emb_r
return timesteps_emb
@@ -580,7 +567,6 @@ class HunyuanVideo15Transformer3DModel(
# YiYi Notes: config based on target_size_config https://github.com/yiyixuxu/hy15/blob/main/hyvideo/pipelines/hunyuan_video_pipeline.py#L205
target_size: int = 640, # did not name sample_size since it is in pixel spaces
task_type: str = "i2v",
use_meanflow: bool = False,
) -> None:
super().__init__()
@@ -596,7 +582,7 @@ class HunyuanVideo15Transformer3DModel(
)
self.context_embedder_2 = HunyuanVideo15ByT5TextProjection(text_embed_2_dim, 2048, inner_dim)
self.time_embed = HunyuanVideo15TimeEmbedding(inner_dim, use_meanflow=use_meanflow)
self.time_embed = HunyuanVideo15TimeEmbedding(inner_dim)
self.cond_type_embed = nn.Embedding(3, inner_dim)
@@ -626,7 +612,6 @@ class HunyuanVideo15Transformer3DModel(
timestep: torch.LongTensor,
encoder_hidden_states: torch.Tensor,
encoder_attention_mask: torch.Tensor,
timestep_r: Optional[torch.LongTensor] = None,
encoder_hidden_states_2: Optional[torch.Tensor] = None,
encoder_attention_mask_2: Optional[torch.Tensor] = None,
image_embeds: Optional[torch.Tensor] = None,
@@ -658,7 +643,7 @@ class HunyuanVideo15Transformer3DModel(
image_rotary_emb = self.rope(hidden_states)
# 2. Conditional embeddings
temb = self.time_embed(timestep, timestep_r=timestep_r)
temb = self.time_embed(timestep)
hidden_states = self.x_embedder(hidden_states)

View File

@@ -404,7 +404,7 @@ else:
"Kandinsky5T2IPipeline",
"Kandinsky5I2IPipeline",
]
_import_structure["z_image"] = ["ZImageImg2ImgPipeline", "ZImagePipeline"]
_import_structure["z_image"] = ["ZImagePipeline"]
_import_structure["skyreels_v2"] = [
"SkyReelsV2DiffusionForcingPipeline",
"SkyReelsV2DiffusionForcingImageToVideoPipeline",
@@ -841,7 +841,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
WuerstchenDecoderPipeline,
WuerstchenPriorPipeline,
)
from .z_image import ZImageImg2ImgPipeline, ZImagePipeline
from .z_image import ZImagePipeline
try:
if not is_onnx_available():

View File

@@ -119,7 +119,6 @@ from .stable_diffusion_xl import (
)
from .wan import WanImageToVideoPipeline, WanPipeline, WanVideoToVideoPipeline
from .wuerstchen import WuerstchenCombinedPipeline, WuerstchenDecoderPipeline
from .z_image import ZImageImg2ImgPipeline, ZImagePipeline
AUTO_TEXT2IMAGE_PIPELINES_MAPPING = OrderedDict(
@@ -163,7 +162,6 @@ AUTO_TEXT2IMAGE_PIPELINES_MAPPING = OrderedDict(
("cogview4-control", CogView4ControlPipeline),
("qwenimage", QwenImagePipeline),
("qwenimage-controlnet", QwenImageControlNetPipeline),
("z-image", ZImagePipeline),
]
)
@@ -191,7 +189,6 @@ AUTO_IMAGE2IMAGE_PIPELINES_MAPPING = OrderedDict(
("qwenimage", QwenImageImg2ImgPipeline),
("qwenimage-edit", QwenImageEditPipeline),
("qwenimage-edit-plus", QwenImageEditPlusPipeline),
("z-image", ZImageImg2ImgPipeline),
]
)

View File

@@ -852,15 +852,6 @@ class HunyuanVideo15ImageToVideoPipeline(DiffusionPipeline):
# broadcast to batch dimension in a way that's compatible with ONNX/Core ML
timestep = t.expand(latent_model_input.shape[0]).to(latent_model_input.dtype)
if self.transformer.config.use_meanflow:
if i == len(timesteps) - 1:
timestep_r = torch.tensor([0.0], device=device)
else:
timestep_r = timesteps[i + 1]
timestep_r = timestep_r.expand(latents.shape[0]).to(latents.dtype)
else:
timestep_r = None
# Step 1: Collect model inputs needed for the guidance method
# conditional inputs should always be first element in the tuple
guider_inputs = {
@@ -902,7 +893,6 @@ class HunyuanVideo15ImageToVideoPipeline(DiffusionPipeline):
hidden_states=latent_model_input,
image_embeds=image_embeds,
timestep=timestep,
timestep_r=timestep_r,
attention_kwargs=self.attention_kwargs,
return_dict=False,
**cond_kwargs,

View File

@@ -17,7 +17,7 @@ import torch
import torch.nn as nn
from transformers import CLIPConfig, CLIPVisionModel, PreTrainedModel
from ...utils import is_transformers_version, logging
from ...utils import logging
logger = logging.get_logger(__name__)
@@ -46,9 +46,6 @@ class StableDiffusionSafetyChecker(PreTrainedModel):
self.concept_embeds_weights = nn.Parameter(torch.ones(17), requires_grad=False)
self.special_care_embeds_weights = nn.Parameter(torch.ones(3), requires_grad=False)
# Model requires post_init after transformers v4.57.3
if is_transformers_version(">", "4.57.3"):
self.post_init()
@torch.no_grad()
def forward(self, clip_input, images):

View File

@@ -23,7 +23,6 @@ except OptionalDependencyNotAvailable:
else:
_import_structure["pipeline_output"] = ["ZImagePipelineOutput"]
_import_structure["pipeline_z_image"] = ["ZImagePipeline"]
_import_structure["pipeline_z_image_img2img"] = ["ZImageImg2ImgPipeline"]
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
@@ -36,7 +35,6 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
else:
from .pipeline_output import ZImagePipelineOutput
from .pipeline_z_image import ZImagePipeline
from .pipeline_z_image_img2img import ZImageImg2ImgPipeline
else:
import sys

View File

@@ -1,709 +0,0 @@
# Copyright 2025 Alibaba Z-Image Team and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import inspect
from typing import Any, Callable, Dict, List, Optional, Union
import torch
from transformers import AutoTokenizer, PreTrainedModel
from ...image_processor import PipelineImageInput, VaeImageProcessor
from ...loaders import FromSingleFileMixin, ZImageLoraLoaderMixin
from ...models.autoencoders import AutoencoderKL
from ...models.transformers import ZImageTransformer2DModel
from ...pipelines.pipeline_utils import DiffusionPipeline
from ...schedulers import FlowMatchEulerDiscreteScheduler
from ...utils import logging, replace_example_docstring
from ...utils.torch_utils import randn_tensor
from .pipeline_output import ZImagePipelineOutput
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
EXAMPLE_DOC_STRING = """
Examples:
```py
>>> import torch
>>> from diffusers import ZImageImg2ImgPipeline
>>> from diffusers.utils import load_image
>>> pipe = ZImageImg2ImgPipeline.from_pretrained("Z-a-o/Z-Image-Turbo", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
>>> init_image = load_image(url).resize((1024, 1024))
>>> prompt = "A fantasy landscape with mountains and a river, detailed, vibrant colors"
>>> image = pipe(
... prompt,
... image=init_image,
... strength=0.6,
... num_inference_steps=9,
... guidance_scale=0.0,
... generator=torch.Generator("cuda").manual_seed(42),
... ).images[0]
>>> image.save("zimage_img2img.png")
```
"""
# Copied from diffusers.pipelines.flux.pipeline_flux.calculate_shift
def calculate_shift(
image_seq_len,
base_seq_len: int = 256,
max_seq_len: int = 4096,
base_shift: float = 0.5,
max_shift: float = 1.15,
):
m = (max_shift - base_shift) / (max_seq_len - base_seq_len)
b = base_shift - m * base_seq_len
mu = image_seq_len * m + b
return mu
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.retrieve_latents
def retrieve_latents(
encoder_output: torch.Tensor, generator: Optional[torch.Generator] = None, sample_mode: str = "sample"
):
if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
return encoder_output.latent_dist.sample(generator)
elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
return encoder_output.latent_dist.mode()
elif hasattr(encoder_output, "latents"):
return encoder_output.latents
else:
raise AttributeError("Could not access latents of provided encoder_output")
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
def retrieve_timesteps(
scheduler,
num_inference_steps: Optional[int] = None,
device: Optional[Union[str, torch.device]] = None,
timesteps: Optional[List[int]] = None,
sigmas: Optional[List[float]] = None,
**kwargs,
):
r"""
Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
Args:
scheduler (`SchedulerMixin`):
The scheduler to get timesteps from.
num_inference_steps (`int`):
The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
must be `None`.
device (`str` or `torch.device`, *optional*):
The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
timesteps (`List[int]`, *optional*):
Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
`num_inference_steps` and `sigmas` must be `None`.
sigmas (`List[float]`, *optional*):
Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
`num_inference_steps` and `timesteps` must be `None`.
Returns:
`Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
second element is the number of inference steps.
"""
if timesteps is not None and sigmas is not None:
raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
if timesteps is not None:
accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
if not accepts_timesteps:
raise ValueError(
f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
f" timestep schedules. Please check whether you are using the correct scheduler."
)
scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
timesteps = scheduler.timesteps
num_inference_steps = len(timesteps)
elif sigmas is not None:
accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
if not accept_sigmas:
raise ValueError(
f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
f" sigmas schedules. Please check whether you are using the correct scheduler."
)
scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
timesteps = scheduler.timesteps
num_inference_steps = len(timesteps)
else:
scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
timesteps = scheduler.timesteps
return timesteps, num_inference_steps
class ZImageImg2ImgPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, FromSingleFileMixin):
r"""
The ZImage pipeline for image-to-image generation.
Args:
scheduler ([`FlowMatchEulerDiscreteScheduler`]):
A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder ([`PreTrainedModel`]):
A text encoder model to encode text prompts.
tokenizer ([`AutoTokenizer`]):
A tokenizer to tokenize text prompts.
transformer ([`ZImageTransformer2DModel`]):
A ZImage transformer model to denoise the encoded image latents.
"""
model_cpu_offload_seq = "text_encoder->transformer->vae"
_optional_components = []
_callback_tensor_inputs = ["latents", "prompt_embeds"]
def __init__(
self,
scheduler: FlowMatchEulerDiscreteScheduler,
vae: AutoencoderKL,
text_encoder: PreTrainedModel,
tokenizer: AutoTokenizer,
transformer: ZImageTransformer2DModel,
):
super().__init__()
self.register_modules(
vae=vae,
text_encoder=text_encoder,
tokenizer=tokenizer,
scheduler=scheduler,
transformer=transformer,
)
self.vae_scale_factor = (
2 ** (len(self.vae.config.block_out_channels) - 1) if hasattr(self, "vae") and self.vae is not None else 8
)
self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor * 2)
# Copied from diffusers.pipelines.z_image.pipeline_z_image.ZImagePipeline.encode_prompt
def encode_prompt(
self,
prompt: Union[str, List[str]],
device: Optional[torch.device] = None,
do_classifier_free_guidance: bool = True,
negative_prompt: Optional[Union[str, List[str]]] = None,
prompt_embeds: Optional[List[torch.FloatTensor]] = None,
negative_prompt_embeds: Optional[torch.FloatTensor] = None,
max_sequence_length: int = 512,
):
prompt = [prompt] if isinstance(prompt, str) else prompt
prompt_embeds = self._encode_prompt(
prompt=prompt,
device=device,
prompt_embeds=prompt_embeds,
max_sequence_length=max_sequence_length,
)
if do_classifier_free_guidance:
if negative_prompt is None:
negative_prompt = ["" for _ in prompt]
else:
negative_prompt = [negative_prompt] if isinstance(negative_prompt, str) else negative_prompt
assert len(prompt) == len(negative_prompt)
negative_prompt_embeds = self._encode_prompt(
prompt=negative_prompt,
device=device,
prompt_embeds=negative_prompt_embeds,
max_sequence_length=max_sequence_length,
)
else:
negative_prompt_embeds = []
return prompt_embeds, negative_prompt_embeds
# Copied from diffusers.pipelines.z_image.pipeline_z_image.ZImagePipeline._encode_prompt
def _encode_prompt(
self,
prompt: Union[str, List[str]],
device: Optional[torch.device] = None,
prompt_embeds: Optional[List[torch.FloatTensor]] = None,
max_sequence_length: int = 512,
) -> List[torch.FloatTensor]:
device = device or self._execution_device
if prompt_embeds is not None:
return prompt_embeds
if isinstance(prompt, str):
prompt = [prompt]
for i, prompt_item in enumerate(prompt):
messages = [
{"role": "user", "content": prompt_item},
]
prompt_item = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True,
)
prompt[i] = prompt_item
text_inputs = self.tokenizer(
prompt,
padding="max_length",
max_length=max_sequence_length,
truncation=True,
return_tensors="pt",
)
text_input_ids = text_inputs.input_ids.to(device)
prompt_masks = text_inputs.attention_mask.to(device).bool()
prompt_embeds = self.text_encoder(
input_ids=text_input_ids,
attention_mask=prompt_masks,
output_hidden_states=True,
).hidden_states[-2]
embeddings_list = []
for i in range(len(prompt_embeds)):
embeddings_list.append(prompt_embeds[i][prompt_masks[i]])
return embeddings_list
# Copied from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3_img2img.StableDiffusion3Img2ImgPipeline.get_timesteps
def get_timesteps(self, num_inference_steps, strength, device):
# get the original timestep using init_timestep
init_timestep = min(num_inference_steps * strength, num_inference_steps)
t_start = int(max(num_inference_steps - init_timestep, 0))
timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
if hasattr(self.scheduler, "set_begin_index"):
self.scheduler.set_begin_index(t_start * self.scheduler.order)
return timesteps, num_inference_steps - t_start
def prepare_latents(
self,
image,
timestep,
batch_size,
num_channels_latents,
height,
width,
dtype,
device,
generator,
latents=None,
):
height = 2 * (int(height) // (self.vae_scale_factor * 2))
width = 2 * (int(width) // (self.vae_scale_factor * 2))
shape = (batch_size, num_channels_latents, height, width)
if latents is not None:
return latents.to(device=device, dtype=dtype)
# Encode the input image
image = image.to(device=device, dtype=dtype)
if image.shape[1] != num_channels_latents:
if isinstance(generator, list):
image_latents = [
retrieve_latents(self.vae.encode(image[i : i + 1]), generator=generator[i])
for i in range(image.shape[0])
]
image_latents = torch.cat(image_latents, dim=0)
else:
image_latents = retrieve_latents(self.vae.encode(image), generator=generator)
# Apply scaling (inverse of decoding: decode does latents/scaling_factor + shift_factor)
image_latents = (image_latents - self.vae.config.shift_factor) * self.vae.config.scaling_factor
else:
image_latents = image
# Handle batch size expansion
if batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] == 0:
additional_image_per_prompt = batch_size // image_latents.shape[0]
image_latents = torch.cat([image_latents] * additional_image_per_prompt, dim=0)
elif batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] != 0:
raise ValueError(
f"Cannot duplicate `image` of batch size {image_latents.shape[0]} to {batch_size} text prompts."
)
# Add noise using flow matching scale_noise
noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
latents = self.scheduler.scale_noise(image_latents, timestep, noise)
return latents
@property
def guidance_scale(self):
return self._guidance_scale
@property
def do_classifier_free_guidance(self):
return self._guidance_scale > 1
@property
def joint_attention_kwargs(self):
return self._joint_attention_kwargs
@property
def num_timesteps(self):
return self._num_timesteps
@property
def interrupt(self):
return self._interrupt
@torch.no_grad()
@replace_example_docstring(EXAMPLE_DOC_STRING)
def __call__(
self,
prompt: Union[str, List[str]] = None,
image: PipelineImageInput = None,
strength: float = 0.6,
height: Optional[int] = None,
width: Optional[int] = None,
num_inference_steps: int = 50,
sigmas: Optional[List[float]] = None,
guidance_scale: float = 5.0,
cfg_normalization: bool = False,
cfg_truncation: float = 1.0,
negative_prompt: Optional[Union[str, List[str]]] = None,
num_images_per_prompt: Optional[int] = 1,
generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
latents: Optional[torch.FloatTensor] = None,
prompt_embeds: Optional[List[torch.FloatTensor]] = None,
negative_prompt_embeds: Optional[List[torch.FloatTensor]] = None,
output_type: Optional[str] = "pil",
return_dict: bool = True,
joint_attention_kwargs: Optional[Dict[str, Any]] = None,
callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
callback_on_step_end_tensor_inputs: List[str] = ["latents"],
max_sequence_length: int = 512,
):
r"""
Function invoked when calling the pipeline for image-to-image generation.
Args:
prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
instead.
image (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
`Image`, numpy array or tensor representing an image batch to be used as the starting point. For both
numpy array and pytorch tensor, the expected value range is between `[0, 1]`. If it's a tensor or a
list of tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or
a list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)`.
strength (`float`, *optional*, defaults to 0.6):
Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a
starting point and more noise is added the higher the `strength`. The number of denoising steps depends
on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising
process runs for the full number of iterations specified in `num_inference_steps`. A value of 1
essentially ignores `image`.
height (`int`, *optional*, defaults to 1024):
The height in pixels of the generated image. If not provided, uses the input image height.
width (`int`, *optional*, defaults to 1024):
The width in pixels of the generated image. If not provided, uses the input image width.
num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference.
sigmas (`List[float]`, *optional*):
Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
will be used.
guidance_scale (`float`, *optional*, defaults to 5.0):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
`guidance_scale` is defined as `w` of equation 2. of [Imagen
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
cfg_normalization (`bool`, *optional*, defaults to False):
Whether to apply configuration normalization.
cfg_truncation (`float`, *optional*, defaults to 1.0):
The truncation value for configuration.
negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
to make generation deterministic.
latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will be generated by sampling using the supplied random `generator`.
prompt_embeds (`List[torch.FloatTensor]`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
provided, text embeddings will be generated from `prompt` input argument.
negative_prompt_embeds (`List[torch.FloatTensor]`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
argument.
output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.ZImagePipelineOutput`] instead of a plain
tuple.
joint_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
callback_on_step_end (`Callable`, *optional*):
A function that calls at the end of each denoising steps during the inference. The function is called
with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
`callback_on_step_end_tensor_inputs`.
callback_on_step_end_tensor_inputs (`List`, *optional*):
The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
`._callback_tensor_inputs` attribute of your pipeline class.
max_sequence_length (`int`, *optional*, defaults to 512):
Maximum sequence length to use with the `prompt`.
Examples:
Returns:
[`~pipelines.z_image.ZImagePipelineOutput`] or `tuple`: [`~pipelines.z_image.ZImagePipelineOutput`] if
`return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the
generated images.
"""
# 1. Check inputs and validate strength
if strength < 0 or strength > 1:
raise ValueError(f"The value of strength should be in [0.0, 1.0] but is {strength}")
# 2. Preprocess image
init_image = self.image_processor.preprocess(image)
init_image = init_image.to(dtype=torch.float32)
# Get dimensions from the preprocessed image if not specified
if height is None:
height = init_image.shape[-2]
if width is None:
width = init_image.shape[-1]
vae_scale = self.vae_scale_factor * 2
if height % vae_scale != 0:
raise ValueError(
f"Height must be divisible by {vae_scale} (got {height}). "
f"Please adjust the height to a multiple of {vae_scale}."
)
if width % vae_scale != 0:
raise ValueError(
f"Width must be divisible by {vae_scale} (got {width}). "
f"Please adjust the width to a multiple of {vae_scale}."
)
device = self._execution_device
self._guidance_scale = guidance_scale
self._joint_attention_kwargs = joint_attention_kwargs
self._interrupt = False
self._cfg_normalization = cfg_normalization
self._cfg_truncation = cfg_truncation
# 3. Define call parameters
if prompt is not None and isinstance(prompt, str):
batch_size = 1
elif prompt is not None and isinstance(prompt, list):
batch_size = len(prompt)
else:
batch_size = len(prompt_embeds)
# If prompt_embeds is provided and prompt is None, skip encoding
if prompt_embeds is not None and prompt is None:
if self.do_classifier_free_guidance and negative_prompt_embeds is None:
raise ValueError(
"When `prompt_embeds` is provided without `prompt`, "
"`negative_prompt_embeds` must also be provided for classifier-free guidance."
)
else:
(
prompt_embeds,
negative_prompt_embeds,
) = self.encode_prompt(
prompt=prompt,
negative_prompt=negative_prompt,
do_classifier_free_guidance=self.do_classifier_free_guidance,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_prompt_embeds,
device=device,
max_sequence_length=max_sequence_length,
)
# 4. Prepare latent variables
num_channels_latents = self.transformer.in_channels
# Repeat prompt_embeds for num_images_per_prompt
if num_images_per_prompt > 1:
prompt_embeds = [pe for pe in prompt_embeds for _ in range(num_images_per_prompt)]
if self.do_classifier_free_guidance and negative_prompt_embeds:
negative_prompt_embeds = [npe for npe in negative_prompt_embeds for _ in range(num_images_per_prompt)]
actual_batch_size = batch_size * num_images_per_prompt
# Calculate latent dimensions for image_seq_len
latent_height = 2 * (int(height) // (self.vae_scale_factor * 2))
latent_width = 2 * (int(width) // (self.vae_scale_factor * 2))
image_seq_len = (latent_height // 2) * (latent_width // 2)
# 5. Prepare timesteps
mu = calculate_shift(
image_seq_len,
self.scheduler.config.get("base_image_seq_len", 256),
self.scheduler.config.get("max_image_seq_len", 4096),
self.scheduler.config.get("base_shift", 0.5),
self.scheduler.config.get("max_shift", 1.15),
)
self.scheduler.sigma_min = 0.0
scheduler_kwargs = {"mu": mu}
timesteps, num_inference_steps = retrieve_timesteps(
self.scheduler,
num_inference_steps,
device,
sigmas=sigmas,
**scheduler_kwargs,
)
# 6. Adjust timesteps based on strength
timesteps, num_inference_steps = self.get_timesteps(num_inference_steps, strength, device)
if num_inference_steps < 1:
raise ValueError(
f"After adjusting the num_inference_steps by strength parameter: {strength}, the number of pipeline "
f"steps is {num_inference_steps} which is < 1 and not appropriate for this pipeline."
)
latent_timestep = timesteps[:1].repeat(actual_batch_size)
# 7. Prepare latents from image
latents = self.prepare_latents(
init_image,
latent_timestep,
actual_batch_size,
num_channels_latents,
height,
width,
prompt_embeds[0].dtype,
device,
generator,
latents,
)
num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
self._num_timesteps = len(timesteps)
# 8. Denoising loop
with self.progress_bar(total=num_inference_steps) as progress_bar:
for i, t in enumerate(timesteps):
if self.interrupt:
continue
# broadcast to batch dimension in a way that's compatible with ONNX/Core ML
timestep = t.expand(latents.shape[0])
timestep = (1000 - timestep) / 1000
# Normalized time for time-aware config (0 at start, 1 at end)
t_norm = timestep[0].item()
# Handle cfg truncation
current_guidance_scale = self.guidance_scale
if (
self.do_classifier_free_guidance
and self._cfg_truncation is not None
and float(self._cfg_truncation) <= 1
):
if t_norm > self._cfg_truncation:
current_guidance_scale = 0.0
# Run CFG only if configured AND scale is non-zero
apply_cfg = self.do_classifier_free_guidance and current_guidance_scale > 0
if apply_cfg:
latents_typed = latents.to(self.transformer.dtype)
latent_model_input = latents_typed.repeat(2, 1, 1, 1)
prompt_embeds_model_input = prompt_embeds + negative_prompt_embeds
timestep_model_input = timestep.repeat(2)
else:
latent_model_input = latents.to(self.transformer.dtype)
prompt_embeds_model_input = prompt_embeds
timestep_model_input = timestep
latent_model_input = latent_model_input.unsqueeze(2)
latent_model_input_list = list(latent_model_input.unbind(dim=0))
model_out_list = self.transformer(
latent_model_input_list,
timestep_model_input,
prompt_embeds_model_input,
)[0]
if apply_cfg:
# Perform CFG
pos_out = model_out_list[:actual_batch_size]
neg_out = model_out_list[actual_batch_size:]
noise_pred = []
for j in range(actual_batch_size):
pos = pos_out[j].float()
neg = neg_out[j].float()
pred = pos + current_guidance_scale * (pos - neg)
# Renormalization
if self._cfg_normalization and float(self._cfg_normalization) > 0.0:
ori_pos_norm = torch.linalg.vector_norm(pos)
new_pos_norm = torch.linalg.vector_norm(pred)
max_new_norm = ori_pos_norm * float(self._cfg_normalization)
if new_pos_norm > max_new_norm:
pred = pred * (max_new_norm / new_pos_norm)
noise_pred.append(pred)
noise_pred = torch.stack(noise_pred, dim=0)
else:
noise_pred = torch.stack([t.float() for t in model_out_list], dim=0)
noise_pred = noise_pred.squeeze(2)
noise_pred = -noise_pred
# compute the previous noisy sample x_t -> x_t-1
latents = self.scheduler.step(noise_pred.to(torch.float32), t, latents, return_dict=False)[0]
assert latents.dtype == torch.float32
if callback_on_step_end is not None:
callback_kwargs = {}
for k in callback_on_step_end_tensor_inputs:
callback_kwargs[k] = locals()[k]
callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
latents = callback_outputs.pop("latents", latents)
prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)
# call the callback, if provided
if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
progress_bar.update()
if output_type == "latent":
image = latents
else:
latents = latents.to(self.vae.dtype)
latents = (latents / self.vae.config.scaling_factor) + self.vae.config.shift_factor
image = self.vae.decode(latents, return_dict=False)[0]
image = self.image_processor.postprocess(image, output_type=output_type)
# Offload all models
self.maybe_free_model_hooks()
if not return_dict:
return (image,)
return ZImagePipelineOutput(images=image)

View File

@@ -27,7 +27,7 @@ logger = logging.get_logger(__name__)
class NVIDIAModelOptQuantizer(DiffusersQuantizer):
r"""
Diffusers Quantizer for Nvidia-Model Optimizer
Diffusers Quantizer for TensorRT Model Optimizer
"""
use_keep_in_fp32_modules = True

View File

@@ -84,35 +84,33 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
methods the library implements for all schedulers such as loading and saving.
Args:
num_train_timesteps (`int`, defaults to `1000`):
num_train_timesteps (`int`, defaults to 1000):
The number of diffusion steps to train the model.
beta_start (`float`, defaults to `0.0001`):
beta_start (`float`, defaults to 0.0001):
The starting `beta` value of inference.
beta_end (`float`, defaults to `0.02`):
beta_end (`float`, defaults to 0.02):
The final `beta` value.
beta_schedule (`"linear"`, `"scaled_linear"`, or `"squaredcos_cap_v2"`, defaults to `"linear"`):
beta_schedule (`str`, defaults to `"linear"`):
The beta schedule, a mapping from a beta range to a sequence of betas for stepping the model. Choose from
`linear`, `scaled_linear`, or `squaredcos_cap_v2`.
trained_betas (`np.ndarray` or `List[float]`, *optional*):
trained_betas (`np.ndarray`, *optional*):
Pass an array of betas directly to the constructor to bypass `beta_start` and `beta_end`.
solver_order (`int`, defaults to `2`):
solver_order (`int`, defaults to 2):
The DEIS order which can be `1` or `2` or `3`. It is recommended to use `solver_order=2` for guided
sampling, and `solver_order=3` for unconditional sampling.
prediction_type (`"epsilon"`, `"sample"`, `"v_prediction"`, or `"flow_prediction"`, defaults to `"epsilon"`):
prediction_type (`str`, defaults to `epsilon`):
Prediction type of the scheduler function; can be `epsilon` (predicts the noise of the diffusion process),
`sample` (directly predicts the noisy sample`), `v_prediction` (see section 2.4 of [Imagen
Video](https://huggingface.co/papers/2210.02303) paper), or `flow_prediction`.
`sample` (directly predicts the noisy sample`) or `v_prediction` (see section 2.4 of [Imagen
Video](https://huggingface.co/papers/2210.02303) paper).
thresholding (`bool`, defaults to `False`):
Whether to use the "dynamic thresholding" method. This is unsuitable for latent-space diffusion models such
as Stable Diffusion.
dynamic_thresholding_ratio (`float`, defaults to `0.995`):
dynamic_thresholding_ratio (`float`, defaults to 0.995):
The ratio for the dynamic thresholding method. Valid only when `thresholding=True`.
sample_max_value (`float`, defaults to `1.0`):
sample_max_value (`float`, defaults to 1.0):
The threshold value for dynamic thresholding. Valid only when `thresholding=True`.
algorithm_type (`"deis"`, defaults to `"deis"`):
algorithm_type (`str`, defaults to `deis`):
The algorithm type for the solver.
solver_type (`"logrho"`, defaults to `"logrho"`):
Solver type for DEIS.
lower_order_final (`bool`, defaults to `True`):
Whether to use lower-order solvers in the final steps. Only valid for < 15 inference steps.
use_karras_sigmas (`bool`, *optional*, defaults to `False`):
@@ -123,19 +121,11 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
use_beta_sigmas (`bool`, *optional*, defaults to `False`):
Whether to use beta sigmas for step sizes in the noise schedule during the sampling process. Refer to [Beta
Sampling is All You Need](https://huggingface.co/papers/2407.12173) for more information.
use_flow_sigmas (`bool`, *optional*, defaults to `False`):
Whether to use flow sigmas for step sizes in the noise schedule during the sampling process.
flow_shift (`float`, *optional*, defaults to `1.0`):
The flow shift parameter for flow-based models.
timestep_spacing (`"linspace"`, `"leading"`, or `"trailing"`, defaults to `"linspace"`):
timestep_spacing (`str`, defaults to `"linspace"`):
The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and
Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information.
steps_offset (`int`, defaults to `0`):
steps_offset (`int`, defaults to 0):
An offset added to the inference steps, as required by some model families.
use_dynamic_shifting (`bool`, defaults to `False`):
Whether to use dynamic shifting for the noise schedule.
time_shift_type (`"exponential"`, defaults to `"exponential"`):
The type of time shifting to apply.
"""
_compatibles = [e.name for e in KarrasDiffusionSchedulers]
@@ -147,38 +137,29 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
num_train_timesteps: int = 1000,
beta_start: float = 0.0001,
beta_end: float = 0.02,
beta_schedule: Literal["linear", "scaled_linear", "squaredcos_cap_v2"] = "linear",
trained_betas: Optional[Union[np.ndarray, List[float]]] = None,
beta_schedule: str = "linear",
trained_betas: Optional[np.ndarray] = None,
solver_order: int = 2,
prediction_type: Literal["epsilon", "sample", "v_prediction", "flow_prediction"] = "epsilon",
prediction_type: str = "epsilon",
thresholding: bool = False,
dynamic_thresholding_ratio: float = 0.995,
sample_max_value: float = 1.0,
algorithm_type: Literal["deis"] = "deis",
solver_type: Literal["logrho"] = "logrho",
algorithm_type: str = "deis",
solver_type: str = "logrho",
lower_order_final: bool = True,
use_karras_sigmas: Optional[bool] = False,
use_exponential_sigmas: Optional[bool] = False,
use_beta_sigmas: Optional[bool] = False,
use_flow_sigmas: Optional[bool] = False,
flow_shift: Optional[float] = 1.0,
timestep_spacing: Literal["linspace", "leading", "trailing"] = "linspace",
timestep_spacing: str = "linspace",
steps_offset: int = 0,
use_dynamic_shifting: bool = False,
time_shift_type: Literal["exponential"] = "exponential",
) -> None:
time_shift_type: str = "exponential",
):
if self.config.use_beta_sigmas and not is_scipy_available():
raise ImportError("Make sure to install scipy if you want to use beta sigmas.")
if (
sum(
[
self.config.use_beta_sigmas,
self.config.use_exponential_sigmas,
self.config.use_karras_sigmas,
]
)
> 1
):
if sum([self.config.use_beta_sigmas, self.config.use_exponential_sigmas, self.config.use_karras_sigmas]) > 1:
raise ValueError(
"Only one of `config.use_beta_sigmas`, `config.use_exponential_sigmas`, `config.use_karras_sigmas` can be used."
)
@@ -188,15 +169,7 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
self.betas = torch.linspace(beta_start, beta_end, num_train_timesteps, dtype=torch.float32)
elif beta_schedule == "scaled_linear":
# this schedule is very specific to the latent diffusion model.
self.betas = (
torch.linspace(
beta_start**0.5,
beta_end**0.5,
num_train_timesteps,
dtype=torch.float32,
)
** 2
)
self.betas = torch.linspace(beta_start**0.5, beta_end**0.5, num_train_timesteps, dtype=torch.float32) ** 2
elif beta_schedule == "squaredcos_cap_v2":
# Glide cosine schedule
self.betas = betas_for_alpha_bar(num_train_timesteps)
@@ -238,21 +211,21 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
self.sigmas = self.sigmas.to("cpu") # to avoid too much CPU/GPU communication
@property
def step_index(self) -> Optional[int]:
def step_index(self):
"""
The index counter for current timestep. It will increase 1 after each scheduler step.
"""
return self._step_index
@property
def begin_index(self) -> Optional[int]:
def begin_index(self):
"""
The index for the first timestep. It should be set from pipeline with `set_begin_index` method.
"""
return self._begin_index
# Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler.set_begin_index
def set_begin_index(self, begin_index: int = 0) -> None:
def set_begin_index(self, begin_index: int = 0):
"""
Sets the begin index for the scheduler. This function should be run from pipeline before the inference.
@@ -263,11 +236,8 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
self._begin_index = begin_index
def set_timesteps(
self,
num_inference_steps: int,
device: Union[str, torch.device] = None,
mu: Optional[float] = None,
) -> None:
self, num_inference_steps: int, device: Union[str, torch.device] = None, mu: Optional[float] = None
):
"""
Sets the discrete timesteps used for the diffusion chain (to be run before inference).
@@ -276,9 +246,6 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
The number of diffusion steps used when generating samples with a pre-trained model.
device (`str` or `torch.device`, *optional*):
The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
mu (`float`, *optional*):
The mu parameter for dynamic shifting. Only used when `use_dynamic_shifting=True` and
`time_shift_type="exponential"`.
"""
if mu is not None:
assert self.config.use_dynamic_shifting and self.config.time_shift_type == "exponential"
@@ -396,7 +363,7 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
return sample
# Copied from diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler._sigma_to_t
def _sigma_to_t(self, sigma: np.ndarray, log_sigmas: np.ndarray) -> np.ndarray:
def _sigma_to_t(self, sigma, log_sigmas):
"""
Convert sigma values to corresponding timestep values through interpolation.
@@ -433,7 +400,7 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
return t
# Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler._sigma_to_alpha_sigma_t
def _sigma_to_alpha_sigma_t(self, sigma: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
def _sigma_to_alpha_sigma_t(self, sigma):
"""
Convert sigma values to alpha_t and sigma_t values.
@@ -455,7 +422,7 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
return alpha_t, sigma_t
# Copied from diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler._convert_to_karras
def _convert_to_karras(self, in_sigmas: torch.Tensor, num_inference_steps: int) -> torch.Tensor:
def _convert_to_karras(self, in_sigmas: torch.Tensor, num_inference_steps) -> torch.Tensor:
"""
Construct the noise schedule as proposed in [Elucidating the Design Space of Diffusion-Based Generative
Models](https://huggingface.co/papers/2206.00364).
@@ -681,10 +648,7 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
"Passing `prev_timestep` is deprecated and has no effect as model output conversion is now handled via an internal counter `self.step_index`",
)
sigma_t, sigma_s = (
self.sigmas[self.step_index + 1],
self.sigmas[self.step_index],
)
sigma_t, sigma_s = self.sigmas[self.step_index + 1], self.sigmas[self.step_index]
alpha_t, sigma_t = self._sigma_to_alpha_sigma_t(sigma_t)
alpha_s, sigma_s = self._sigma_to_alpha_sigma_t(sigma_s)
lambda_t = torch.log(alpha_t) - torch.log(sigma_t)
@@ -750,11 +714,7 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
m0, m1 = model_output_list[-1], model_output_list[-2]
rho_t, rho_s0, rho_s1 = (
sigma_t / alpha_t,
sigma_s0 / alpha_s0,
sigma_s1 / alpha_s1,
)
rho_t, rho_s0, rho_s1 = sigma_t / alpha_t, sigma_s0 / alpha_s0, sigma_s1 / alpha_s1
if self.config.algorithm_type == "deis":
@@ -894,7 +854,7 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
return step_index
# Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler._init_step_index
def _init_step_index(self, timestep: Union[int, torch.Tensor]) -> None:
def _init_step_index(self, timestep):
"""
Initialize the step_index counter for the scheduler.
@@ -924,17 +884,18 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
Args:
model_output (`torch.Tensor`):
The direct output from learned diffusion model.
timestep (`int` or `torch.Tensor`):
timestep (`int`):
The current discrete timestep in the diffusion chain.
sample (`torch.Tensor`):
A current instance of a sample created by the diffusion process.
return_dict (`bool`, defaults to `True`):
return_dict (`bool`):
Whether or not to return a [`~schedulers.scheduling_utils.SchedulerOutput`] or `tuple`.
Returns:
[`~schedulers.scheduling_utils.SchedulerOutput`] or `tuple`:
If return_dict is `True`, [`~schedulers.scheduling_utils.SchedulerOutput`] is returned, otherwise a
tuple is returned where the first element is the sample tensor.
"""
if self.num_inference_steps is None:
raise ValueError(
@@ -1039,5 +1000,5 @@ class DEISMultistepScheduler(SchedulerMixin, ConfigMixin):
noisy_samples = alpha_t * original_samples + sigma_t * noise
return noisy_samples
def __len__(self) -> int:
def __len__(self):
return self.config.num_train_timesteps

View File

@@ -257,21 +257,6 @@ class SmoothedEnergyGuidanceConfig(metaclass=DummyObject):
requires_backends(cls, ["torch"])
class TaylorSeerCacheConfig(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
def apply_faster_cache(*args, **kwargs):
requires_backends(apply_faster_cache, ["torch"])
@@ -288,10 +273,6 @@ def apply_pyramid_attention_broadcast(*args, **kwargs):
requires_backends(apply_pyramid_attention_broadcast, ["torch"])
def apply_taylorseer_cache(*args, **kwargs):
requires_backends(apply_taylorseer_cache, ["torch"])
class AllegroTransformer3DModel(metaclass=DummyObject):
_backends = ["torch"]

View File

@@ -3752,21 +3752,6 @@ class WuerstchenPriorPipeline(metaclass=DummyObject):
requires_backends(cls, ["torch", "transformers"])
class ZImageImg2ImgPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class ZImagePipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]

View File

@@ -29,7 +29,6 @@ from ..test_pipelines_common import (
FluxIPAdapterTesterMixin,
PipelineTesterMixin,
PyramidAttentionBroadcastTesterMixin,
TaylorSeerCacheTesterMixin,
check_qkv_fused_layers_exist,
)
@@ -40,7 +39,6 @@ class FluxPipelineFastTests(
PyramidAttentionBroadcastTesterMixin,
FasterCacheTesterMixin,
FirstBlockCacheTesterMixin,
TaylorSeerCacheTesterMixin,
unittest.TestCase,
):
pipeline_class = FluxPipeline

View File

@@ -33,7 +33,6 @@ from ..test_pipelines_common import (
FirstBlockCacheTesterMixin,
PipelineTesterMixin,
PyramidAttentionBroadcastTesterMixin,
TaylorSeerCacheTesterMixin,
to_np,
)
@@ -46,7 +45,6 @@ class HunyuanVideoPipelineFastTests(
PyramidAttentionBroadcastTesterMixin,
FasterCacheTesterMixin,
FirstBlockCacheTesterMixin,
TaylorSeerCacheTesterMixin,
unittest.TestCase,
):
pipeline_class = HunyuanVideoPipeline

View File

@@ -36,7 +36,6 @@ from diffusers.hooks import apply_group_offloading
from diffusers.hooks.faster_cache import FasterCacheBlockHook, FasterCacheDenoiserHook
from diffusers.hooks.first_block_cache import FirstBlockCacheConfig
from diffusers.hooks.pyramid_attention_broadcast import PyramidAttentionBroadcastHook
from diffusers.hooks.taylorseer_cache import TaylorSeerCacheConfig
from diffusers.image_processor import VaeImageProcessor
from diffusers.loaders import FluxIPAdapterMixin, IPAdapterMixin
from diffusers.models.attention import AttentionModuleMixin
@@ -2925,57 +2924,6 @@ class FirstBlockCacheTesterMixin:
)
class TaylorSeerCacheTesterMixin:
taylorseer_cache_config = TaylorSeerCacheConfig(
cache_interval=5,
disable_cache_before_step=10,
max_order=1,
taylor_factors_dtype=torch.bfloat16,
use_lite_mode=True,
)
def test_taylorseer_cache_inference(self, expected_atol: float = 0.1):
device = "cpu" # ensure determinism for the device-dependent torch.Generator
def create_pipe():
torch.manual_seed(0)
num_layers = 2
components = self.get_dummy_components(num_layers=num_layers)
pipe = self.pipeline_class(**components)
pipe = pipe.to(device)
pipe.set_progress_bar_config(disable=None)
return pipe
def run_forward(pipe):
torch.manual_seed(0)
inputs = self.get_dummy_inputs(device)
inputs["num_inference_steps"] = 50
return pipe(**inputs)[0]
# Run inference without TaylorSeerCache
pipe = create_pipe()
output = run_forward(pipe).flatten()
original_image_slice = np.concatenate((output[:8], output[-8:]))
# Run inference with TaylorSeerCache enabled
pipe = create_pipe()
pipe.transformer.enable_cache(self.taylorseer_cache_config)
output = run_forward(pipe).flatten()
image_slice_fbc_enabled = np.concatenate((output[:8], output[-8:]))
# Run inference with TaylorSeerCache disabled
pipe.transformer.disable_cache()
output = run_forward(pipe).flatten()
image_slice_fbc_disabled = np.concatenate((output[:8], output[-8:]))
assert np.allclose(original_image_slice, image_slice_fbc_enabled, atol=expected_atol), (
"TaylorSeerCache outputs should not differ much."
)
assert np.allclose(original_image_slice, image_slice_fbc_disabled, atol=1e-4), (
"Outputs from normal inference and after disabling cache should not differ."
)
# Some models (e.g. unCLIP) are extremely likely to significantly deviate depending on which hardware is used.
# This helper function is used to check that the image doesn't deviate on average more than 10 pixels from a
# reference image.

View File

@@ -1,358 +0,0 @@
# Copyright 2025 Alibaba Z-Image Team and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import gc
import os
import unittest
import numpy as np
import torch
from transformers import Qwen2Tokenizer, Qwen3Config, Qwen3Model
from diffusers import (
AutoencoderKL,
FlowMatchEulerDiscreteScheduler,
ZImageImg2ImgPipeline,
ZImageTransformer2DModel,
)
from diffusers.utils.testing_utils import floats_tensor
from ...testing_utils import torch_device
from ..pipeline_params import (
IMAGE_TO_IMAGE_IMAGE_PARAMS,
TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS,
TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
)
from ..test_pipelines_common import PipelineTesterMixin, to_np
# Z-Image requires torch.use_deterministic_algorithms(False) due to complex64 RoPE operations
# Cannot use enable_full_determinism() which sets it to True
# Note: Z-Image does not support FP16 inference due to complex64 RoPE embeddings
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"
torch.use_deterministic_algorithms(False)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
if hasattr(torch.backends, "cuda"):
torch.backends.cuda.matmul.allow_tf32 = False
class ZImageImg2ImgPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
pipeline_class = ZImageImg2ImgPipeline
params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"cross_attention_kwargs"}
batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
image_latents_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
required_optional_params = frozenset(
[
"num_inference_steps",
"strength",
"generator",
"latents",
"return_dict",
"callback_on_step_end",
"callback_on_step_end_tensor_inputs",
]
)
supports_dduf = False
test_xformers_attention = False
test_layerwise_casting = True
test_group_offloading = True
def setUp(self):
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
torch.manual_seed(0)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(0)
def tearDown(self):
super().tearDown()
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
torch.manual_seed(0)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(0)
def get_dummy_components(self):
torch.manual_seed(0)
transformer = ZImageTransformer2DModel(
all_patch_size=(2,),
all_f_patch_size=(1,),
in_channels=16,
dim=32,
n_layers=2,
n_refiner_layers=1,
n_heads=2,
n_kv_heads=2,
norm_eps=1e-5,
qk_norm=True,
cap_feat_dim=16,
rope_theta=256.0,
t_scale=1000.0,
axes_dims=[8, 4, 4],
axes_lens=[256, 32, 32],
)
torch.manual_seed(0)
vae = AutoencoderKL(
in_channels=3,
out_channels=3,
down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
block_out_channels=[32, 64],
layers_per_block=1,
latent_channels=16,
norm_num_groups=32,
sample_size=32,
scaling_factor=0.3611,
shift_factor=0.1159,
)
torch.manual_seed(0)
scheduler = FlowMatchEulerDiscreteScheduler()
torch.manual_seed(0)
config = Qwen3Config(
hidden_size=16,
intermediate_size=16,
num_hidden_layers=2,
num_attention_heads=2,
num_key_value_heads=2,
vocab_size=151936,
max_position_embeddings=512,
)
text_encoder = Qwen3Model(config)
tokenizer = Qwen2Tokenizer.from_pretrained("hf-internal-testing/tiny-random-Qwen2VLForConditionalGeneration")
components = {
"transformer": transformer,
"vae": vae,
"scheduler": scheduler,
"text_encoder": text_encoder,
"tokenizer": tokenizer,
}
return components
def get_dummy_inputs(self, device, seed=0):
import random
if str(device).startswith("mps"):
generator = torch.manual_seed(seed)
else:
generator = torch.Generator(device=device).manual_seed(seed)
image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
inputs = {
"prompt": "dance monkey",
"negative_prompt": "bad quality",
"image": image,
"strength": 0.6,
"generator": generator,
"num_inference_steps": 2,
"guidance_scale": 3.0,
"cfg_normalization": False,
"cfg_truncation": 1.0,
"height": 32,
"width": 32,
"max_sequence_length": 16,
"output_type": "np",
}
return inputs
def test_inference(self):
device = "cpu"
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe.to(device)
pipe.set_progress_bar_config(disable=None)
inputs = self.get_dummy_inputs(device)
image = pipe(**inputs).images
generated_image = image[0]
self.assertEqual(generated_image.shape, (32, 32, 3))
def test_inference_batch_single_identical(self):
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
torch.manual_seed(0)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(0)
self._test_inference_batch_single_identical(batch_size=3, expected_max_diff=1e-1)
def test_num_images_per_prompt(self):
import inspect
sig = inspect.signature(self.pipeline_class.__call__)
if "num_images_per_prompt" not in sig.parameters:
return
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe = pipe.to(torch_device)
pipe.set_progress_bar_config(disable=None)
batch_sizes = [1, 2]
num_images_per_prompts = [1, 2]
for batch_size in batch_sizes:
for num_images_per_prompt in num_images_per_prompts:
inputs = self.get_dummy_inputs(torch_device)
for key in inputs.keys():
if key in self.batch_params:
inputs[key] = batch_size * [inputs[key]]
images = pipe(**inputs, num_images_per_prompt=num_images_per_prompt)[0]
assert images.shape[0] == batch_size * num_images_per_prompt
del pipe
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
def test_attention_slicing_forward_pass(
self, test_max_difference=True, test_mean_pixel_difference=True, expected_max_diff=1e-3
):
if not self.test_attention_slicing:
return
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
for component in pipe.components.values():
if hasattr(component, "set_default_attn_processor"):
component.set_default_attn_processor()
pipe.to(torch_device)
pipe.set_progress_bar_config(disable=None)
generator_device = "cpu"
inputs = self.get_dummy_inputs(generator_device)
output_without_slicing = pipe(**inputs)[0]
pipe.enable_attention_slicing(slice_size=1)
inputs = self.get_dummy_inputs(generator_device)
output_with_slicing1 = pipe(**inputs)[0]
pipe.enable_attention_slicing(slice_size=2)
inputs = self.get_dummy_inputs(generator_device)
output_with_slicing2 = pipe(**inputs)[0]
if test_max_difference:
max_diff1 = np.abs(to_np(output_with_slicing1) - to_np(output_without_slicing)).max()
max_diff2 = np.abs(to_np(output_with_slicing2) - to_np(output_without_slicing)).max()
self.assertLess(
max(max_diff1, max_diff2),
expected_max_diff,
"Attention slicing should not affect the inference results",
)
def test_vae_tiling(self, expected_diff_max: float = 0.3):
import random
generator_device = "cpu"
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe.to("cpu")
pipe.set_progress_bar_config(disable=None)
# Without tiling
inputs = self.get_dummy_inputs(generator_device)
inputs["height"] = inputs["width"] = 128
# Generate a larger image for the input
inputs["image"] = floats_tensor((1, 3, 128, 128), rng=random.Random(0)).to("cpu")
output_without_tiling = pipe(**inputs)[0]
# With tiling (standard AutoencoderKL doesn't accept parameters)
pipe.vae.enable_tiling()
inputs = self.get_dummy_inputs(generator_device)
inputs["height"] = inputs["width"] = 128
inputs["image"] = floats_tensor((1, 3, 128, 128), rng=random.Random(0)).to("cpu")
output_with_tiling = pipe(**inputs)[0]
self.assertLess(
(to_np(output_without_tiling) - to_np(output_with_tiling)).max(),
expected_diff_max,
"VAE tiling should not affect the inference results",
)
def test_pipeline_with_accelerator_device_map(self, expected_max_difference=5e-4):
# Z-Image RoPE embeddings (complex64) have slightly higher numerical tolerance
super().test_pipeline_with_accelerator_device_map(expected_max_difference=expected_max_difference)
def test_group_offloading_inference(self):
# Block-level offloading conflicts with RoPE cache. Pipeline-level offloading (tested separately) works fine.
self.skipTest("Using test_pipeline_level_group_offloading_inference instead")
def test_save_load_float16(self, expected_max_diff=1e-2):
# Z-Image does not support FP16 due to complex64 RoPE embeddings
self.skipTest("Z-Image does not support FP16 inference")
def test_float16_inference(self, expected_max_diff=5e-2):
# Z-Image does not support FP16 due to complex64 RoPE embeddings
self.skipTest("Z-Image does not support FP16 inference")
def test_strength_parameter(self):
"""Test that strength parameter affects the output correctly."""
device = "cpu"
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe.to(device)
pipe.set_progress_bar_config(disable=None)
# Test with different strength values
inputs_low_strength = self.get_dummy_inputs(device)
inputs_low_strength["strength"] = 0.2
inputs_high_strength = self.get_dummy_inputs(device)
inputs_high_strength["strength"] = 0.8
# Both should complete without errors
output_low = pipe(**inputs_low_strength).images[0]
output_high = pipe(**inputs_high_strength).images[0]
# Outputs should be different (different amount of transformation)
self.assertFalse(np.allclose(output_low, output_high, atol=1e-3))
def test_invalid_strength(self):
"""Test that invalid strength values raise appropriate errors."""
device = "cpu"
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe.to(device)
inputs = self.get_dummy_inputs(device)
# Test strength < 0
inputs["strength"] = -0.1
with self.assertRaises(ValueError):
pipe(**inputs)
# Test strength > 1
inputs["strength"] = 1.5
with self.assertRaises(ValueError):
pipe(**inputs)