remove existing tests/others/test_attention_backends.py file

add attention backend tests.
Support Flux Klein peft (fal) lora format (#13169 )
2026-02-23 19:30:38 +08:00 · 2026-02-23 11:31:37 +05:30 · 2026-02-23 11:30:28 +05:30 · 2026-02-21 10:31:18 +05:30 · 2026-02-20 09:31:52 +05:30 · 2026-02-20 09:01:20 +05:30
29 changed files with 950 additions and 319 deletions
--- a/.github/workflows/pr_tests_gpu.yml
+++ b/.github/workflows/pr_tests_gpu.yml
@@ -199,11 +199,6 @@ jobs:

    - name: Install dependencies
      run: |
-        # Install pkgs which depend on setuptools<81 for pkg_resources first with no build isolation
-        uv pip install pip==25.2 setuptools==80.10.2
-        uv pip install --no-build-isolation k-diffusion==0.0.12
-        uv pip install --upgrade pip setuptools
-        # Install the rest as normal
        uv pip install -e ".[quality]"
        uv pip install peft@git+https://github.com/huggingface/peft.git
        uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
--- a/.github/workflows/push_tests.yml
+++ b/.github/workflows/push_tests.yml
@@ -126,11 +126,6 @@ jobs:

    - name: Install dependencies
      run: |
-        # Install pkgs which depend on setuptools<81 for pkg_resources first with no build isolation
-        uv pip install pip==25.2 setuptools==80.10.2
-        uv pip install --no-build-isolation k-diffusion==0.0.12
-        uv pip install --upgrade pip setuptools
-        # Install the rest as normal
        uv pip install -e ".[quality]"
        uv pip install peft@git+https://github.com/huggingface/peft.git
        uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
--- a/.github/workflows/push_tests_mps.yml
+++ b/.github/workflows/push_tests_mps.yml
@@ -41,7 +41,7 @@ jobs:
      shell: arch -arch arm64 bash {0}
      run: |
        ${CONDA_RUN} python -m pip install --upgrade pip uv
-        ${CONDA_RUN} python -m uv pip install -e ".[quality,test]"
+        ${CONDA_RUN} python -m uv pip install -e ".[quality]"
        ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio
        ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git
        ${CONDA_RUN} python -m uv pip install transformers --upgrade
--- a/docs/source/en/api/pipelines/qwenimage.md
+++ b/docs/source/en/api/pipelines/qwenimage.md
@@ -29,7 +29,7 @@ Qwen-Image comes in the following variants:
 | Qwen-Image-Edit Plus | [Qwen/Qwen-Image-Edit-2509](https://huggingface.co/Qwen/Qwen-Image-Edit-2509) |

 > [!TIP]
-> [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs.
+> See the [Caching](../../optimization/cache) guide to speed up inference by storing and reusing intermediate outputs.

 ## LoRA for faster inference

@@ -190,6 +190,12 @@ For detailed benchmark scripts and results, see [this gist](https://gist.github.
  - all
  - __call__

+## QwenImageLayeredPipeline
+
+[[autodoc]] QwenImageLayeredPipeline
+  - all
+  - __call__
+
 ## QwenImagePipelineOutput

 [[autodoc]] pipelines.qwenimage.pipeline_output.QwenImagePipelineOutput
--- a/setup.py
+++ b/setup.py
@@ -101,6 +101,7 @@ _deps = [
    "datasets",
    "filelock",
    "flax>=0.4.1",
+    "ftfy",
    "hf-doc-builder>=0.3.0",
    "httpx<1.0.0",
    "huggingface-hub>=0.34.0,<2.0",
@@ -221,12 +222,14 @@ extras["docs"] = deps_list("hf-doc-builder")
 extras["training"] = deps_list("accelerate", "datasets", "protobuf", "tensorboard", "Jinja2", "peft", "timm")
 extras["test"] = deps_list(
    "compel",
+    "ftfy",
    "GitPython",
    "datasets",
    "Jinja2",
    "invisible-watermark",
    "librosa",
    "parameterized",
+    "protobuf",
    "pytest",
    "pytest-timeout",
    "pytest-xdist",
@@ -235,6 +238,7 @@ extras["test"] = deps_list(
    "sentencepiece",
    "scipy",
    "tiktoken",
+    "torchsde",
    "torchvision",
    "transformers",
    "phonemizer",
--- a/src/diffusers/dependency_versions_table.py
+++ b/src/diffusers/dependency_versions_table.py
@@ -8,6 +8,7 @@ deps = {
    "datasets": "datasets",
    "filelock": "filelock",
    "flax": "flax>=0.4.1",
+    "ftfy": "ftfy",
    "hf-doc-builder": "hf-doc-builder>=0.3.0",
    "httpx": "httpx<1.0.0",
    "huggingface-hub": "huggingface-hub>=0.34.0,<2.0",
--- a/src/diffusers/loaders/lora_pipeline.py
+++ b/src/diffusers/loaders/lora_pipeline.py
@@ -5472,6 +5472,10 @@ class Flux2LoraLoaderMixin(LoraBaseMixin):
            logger.warning(warn_msg)
            state_dict = {k: v for k, v in state_dict.items() if "dora_scale" not in k}

+        is_peft_format = any(k.startswith("base_model.model.") for k in state_dict)
+        if is_peft_format:
+            state_dict = {k.replace("base_model.model.", "diffusion_model."): v for k, v in state_dict.items()}
+
        is_ai_toolkit = any(k.startswith("diffusion_model.") for k in state_dict)
        if is_ai_toolkit:
            state_dict = _convert_non_diffusers_flux2_lora_to_diffusers(state_dict)
--- a/src/diffusers/models/attention_dispatch.py
+++ b/src/diffusers/models/attention_dispatch.py
@@ -266,6 +266,10 @@ class _HubKernelConfig:
    function_attr: str
    revision: str | None = None
    kernel_fn: Callable | None = None
+    wrapped_forward_attr: str | None = None
+    wrapped_backward_attr: str | None = None
+    wrapped_forward_fn: Callable | None = None
+    wrapped_backward_fn: Callable | None = None


 # Registry for hub-based attention kernels
@@ -280,7 +284,11 @@ _HUB_KERNELS_REGISTRY: dict["AttentionBackendName", _HubKernelConfig] = {
        # revision="fake-ops-return-probs",
    ),
    AttentionBackendName.FLASH_HUB: _HubKernelConfig(
-        repo_id="kernels-community/flash-attn2", function_attr="flash_attn_func", revision=None
+        repo_id="kernels-community/flash-attn2",
+        function_attr="flash_attn_func",
+        revision=None,
+        wrapped_forward_attr="flash_attn_interface._wrapped_flash_attn_forward",
+        wrapped_backward_attr="flash_attn_interface._wrapped_flash_attn_backward",
    ),
    AttentionBackendName.FLASH_VARLEN_HUB: _HubKernelConfig(
        repo_id="kernels-community/flash-attn2", function_attr="flash_attn_varlen_func", revision=None
@@ -605,22 +613,39 @@ def _flex_attention_causal_mask_mod(batch_idx, head_idx, q_idx, kv_idx):


 # ===== Helpers for downloading kernels =====
+def _resolve_kernel_attr(module, attr_path: str):
+    target = module
+    for attr in attr_path.split("."):
+        if not hasattr(target, attr):
+            raise AttributeError(f"Kernel module '{module.__name__}' does not define attribute path '{attr_path}'.")
+        target = getattr(target, attr)
+    return target
+
+
 def _maybe_download_kernel_for_backend(backend: AttentionBackendName) -> None:
    if backend not in _HUB_KERNELS_REGISTRY:
        return
    config = _HUB_KERNELS_REGISTRY[backend]

-    if config.kernel_fn is not None:
+    needs_kernel = config.kernel_fn is None
+    needs_wrapped_forward = config.wrapped_forward_attr is not None and config.wrapped_forward_fn is None
+    needs_wrapped_backward = config.wrapped_backward_attr is not None and config.wrapped_backward_fn is None
+
+    if not (needs_kernel or needs_wrapped_forward or needs_wrapped_backward):
        return

    try:
        from kernels import get_kernel

        kernel_module = get_kernel(config.repo_id, revision=config.revision)
-        kernel_func = getattr(kernel_module, config.function_attr)
+        if needs_kernel:
+            config.kernel_fn = _resolve_kernel_attr(kernel_module, config.function_attr)

-        # Cache the downloaded kernel function in the config object
-        config.kernel_fn = kernel_func
+        if needs_wrapped_forward:
+            config.wrapped_forward_fn = _resolve_kernel_attr(kernel_module, config.wrapped_forward_attr)
+
+        if needs_wrapped_backward:
+            config.wrapped_backward_fn = _resolve_kernel_attr(kernel_module, config.wrapped_backward_attr)

    except Exception as e:
        logger.error(f"An error occurred while fetching kernel '{config.repo_id}' from the Hub: {e}")
@@ -1071,6 +1096,237 @@ def _flash_attention_backward_op(
    return grad_query, grad_key, grad_value


+def _flash_attention_hub_forward_op(
+    ctx: torch.autograd.function.FunctionCtx,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attn_mask: torch.Tensor | None = None,
+    dropout_p: float = 0.0,
+    is_causal: bool = False,
+    scale: float | None = None,
+    enable_gqa: bool = False,
+    return_lse: bool = False,
+    _save_ctx: bool = True,
+    _parallel_config: "ParallelConfig" | None = None,
+):
+    if attn_mask is not None:
+        raise ValueError("`attn_mask` is not yet supported for flash-attn hub kernels.")
+    if enable_gqa:
+        raise ValueError("`enable_gqa` is not yet supported for flash-attn hub kernels.")
+
+    config = _HUB_KERNELS_REGISTRY[AttentionBackendName.FLASH_HUB]
+    wrapped_forward_fn = config.wrapped_forward_fn
+    wrapped_backward_fn = config.wrapped_backward_fn
+    if wrapped_forward_fn is None or wrapped_backward_fn is None:
+        raise RuntimeError(
+            "Flash attention hub kernels must expose `_wrapped_flash_attn_forward` and `_wrapped_flash_attn_backward` "
+            "for context parallel execution."
+        )
+
+    if scale is None:
+        scale = query.shape[-1] ** (-0.5)
+
+    window_size = (-1, -1)
+    softcap = 0.0
+    alibi_slopes = None
+    deterministic = False
+    grad_enabled = any(x.requires_grad for x in (query, key, value))
+
+    if grad_enabled or (_parallel_config is not None and _parallel_config.context_parallel_config._world_size > 1):
+        dropout_p = dropout_p if dropout_p > 0 else 1e-30
+
+    with torch.set_grad_enabled(grad_enabled):
+        out, lse, S_dmask, rng_state = wrapped_forward_fn(
+            query,
+            key,
+            value,
+            dropout_p,
+            scale,
+            is_causal,
+            window_size[0],
+            window_size[1],
+            softcap,
+            alibi_slopes,
+            return_lse,
+        )
+        lse = lse.permute(0, 2, 1).contiguous()
+
+    if _save_ctx:
+        ctx.save_for_backward(query, key, value, out, lse, rng_state)
+        ctx.dropout_p = dropout_p
+        ctx.scale = scale
+        ctx.is_causal = is_causal
+        ctx.window_size = window_size
+        ctx.softcap = softcap
+        ctx.alibi_slopes = alibi_slopes
+        ctx.deterministic = deterministic
+
+    return (out, lse) if return_lse else out
+
+
+def _flash_attention_hub_backward_op(
+    ctx: torch.autograd.function.FunctionCtx,
+    grad_out: torch.Tensor,
+    *args,
+    **kwargs,
+):
+    config = _HUB_KERNELS_REGISTRY[AttentionBackendName.FLASH_HUB]
+    wrapped_backward_fn = config.wrapped_backward_fn
+    if wrapped_backward_fn is None:
+        raise RuntimeError(
+            "Flash attention hub kernels must expose `_wrapped_flash_attn_backward` for context parallel execution."
+        )
+
+    query, key, value, out, lse, rng_state = ctx.saved_tensors
+    grad_query, grad_key, grad_value = torch.empty_like(query), torch.empty_like(key), torch.empty_like(value)
+
+    _ = wrapped_backward_fn(
+        grad_out,
+        query,
+        key,
+        value,
+        out,
+        lse,
+        grad_query,
+        grad_key,
+        grad_value,
+        ctx.dropout_p,
+        ctx.scale,
+        ctx.is_causal,
+        ctx.window_size[0],
+        ctx.window_size[1],
+        ctx.softcap,
+        ctx.alibi_slopes,
+        ctx.deterministic,
+        rng_state,
+    )
+
+    grad_query = grad_query[..., : grad_out.shape[-1]]
+    grad_key = grad_key[..., : grad_out.shape[-1]]
+    grad_value = grad_value[..., : grad_out.shape[-1]]
+
+    return grad_query, grad_key, grad_value
+
+
+def _flash_attention_3_hub_forward_op(
+    ctx: torch.autograd.function.FunctionCtx,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attn_mask: torch.Tensor | None = None,
+    dropout_p: float = 0.0,
+    is_causal: bool = False,
+    scale: float | None = None,
+    enable_gqa: bool = False,
+    return_lse: bool = False,
+    _save_ctx: bool = True,
+    _parallel_config: "ParallelConfig" | None = None,
+    *,
+    window_size: tuple[int, int] = (-1, -1),
+    softcap: float = 0.0,
+    num_splits: int = 1,
+    pack_gqa: bool | None = None,
+    deterministic: bool = False,
+    sm_margin: int = 0,
+):
+    if attn_mask is not None:
+        raise ValueError("`attn_mask` is not yet supported for flash-attn 3 hub kernels.")
+    if dropout_p != 0.0:
+        raise ValueError("`dropout_p` is not yet supported for flash-attn 3 hub kernels.")
+    if enable_gqa:
+        raise ValueError("`enable_gqa` is not yet supported for flash-attn 3 hub kernels.")
+
+    func = _HUB_KERNELS_REGISTRY[AttentionBackendName._FLASH_3_HUB].kernel_fn
+    out = func(
+        q=query,
+        k=key,
+        v=value,
+        softmax_scale=scale,
+        causal=is_causal,
+        qv=None,
+        q_descale=None,
+        k_descale=None,
+        v_descale=None,
+        window_size=window_size,
+        softcap=softcap,
+        num_splits=num_splits,
+        pack_gqa=pack_gqa,
+        deterministic=deterministic,
+        sm_margin=sm_margin,
+        return_attn_probs=return_lse,
+    )
+
+    lse = None
+    if return_lse:
+        out, lse = out
+        lse = lse.permute(0, 2, 1).contiguous()
+
+    if _save_ctx:
+        ctx.save_for_backward(query, key, value)
+        ctx.scale = scale
+        ctx.is_causal = is_causal
+        ctx._hub_kernel = func
+
+    return (out, lse) if return_lse else out
+
+
+def _flash_attention_3_hub_backward_op(
+    ctx: torch.autograd.function.FunctionCtx,
+    grad_out: torch.Tensor,
+    *args,
+    window_size: tuple[int, int] = (-1, -1),
+    softcap: float = 0.0,
+    num_splits: int = 1,
+    pack_gqa: bool | None = None,
+    deterministic: bool = False,
+    sm_margin: int = 0,
+):
+    query, key, value = ctx.saved_tensors
+    kernel_fn = ctx._hub_kernel
+    # NOTE: Unlike the FA2 hub kernel, the FA3 hub kernel does not expose separate wrapped forward/backward
+    # primitives (no `wrapped_forward_attr`/`wrapped_backward_attr` in its `_HubKernelConfig`). We
+    # therefore rerun the forward pass under `torch.enable_grad()` and differentiate through it with
+    # `torch.autograd.grad()`. This is a second forward pass during backward; it can be avoided once
+    # the FA3 hub exposes a dedicated fused backward kernel (analogous to `_wrapped_flash_attn_backward`
+    # in the FA2 hub), at which point this can be refactored to match `_flash_attention_hub_backward_op`.
+    with torch.enable_grad():
+        query_r = query.detach().requires_grad_(True)
+        key_r = key.detach().requires_grad_(True)
+        value_r = value.detach().requires_grad_(True)
+
+        out = kernel_fn(
+            q=query_r,
+            k=key_r,
+            v=value_r,
+            softmax_scale=ctx.scale,
+            causal=ctx.is_causal,
+            qv=None,
+            q_descale=None,
+            k_descale=None,
+            v_descale=None,
+            window_size=window_size,
+            softcap=softcap,
+            num_splits=num_splits,
+            pack_gqa=pack_gqa,
+            deterministic=deterministic,
+            sm_margin=sm_margin,
+            return_attn_probs=False,
+        )
+        if isinstance(out, tuple):
+            out = out[0]
+
+        grad_query, grad_key, grad_value = torch.autograd.grad(
+            out,
+            (query_r, key_r, value_r),
+            grad_out,
+            retain_graph=False,
+            allow_unused=False,
+        )
+
+    return grad_query, grad_key, grad_value
+
+
 def _sage_attention_forward_op(
    ctx: torch.autograd.function.FunctionCtx,
    query: torch.Tensor,
@@ -1109,6 +1365,46 @@ def _sage_attention_forward_op(
    return (out, lse) if return_lse else out


+def _sage_attention_hub_forward_op(
+    ctx: torch.autograd.function.FunctionCtx,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attn_mask: torch.Tensor | None = None,
+    dropout_p: float = 0.0,
+    is_causal: bool = False,
+    scale: float | None = None,
+    enable_gqa: bool = False,
+    return_lse: bool = False,
+    _save_ctx: bool = True,
+    _parallel_config: "ParallelConfig" | None = None,
+):
+    if attn_mask is not None:
+        raise ValueError("`attn_mask` is not yet supported for Sage attention.")
+    if dropout_p > 0.0:
+        raise ValueError("`dropout_p` is not yet supported for Sage attention.")
+    if enable_gqa:
+        raise ValueError("`enable_gqa` is not yet supported for Sage attention.")
+
+    func = _HUB_KERNELS_REGISTRY[AttentionBackendName.SAGE_HUB].kernel_fn
+    out = func(
+        q=query,
+        k=key,
+        v=value,
+        tensor_layout="NHD",
+        is_causal=is_causal,
+        sm_scale=scale,
+        return_lse=return_lse,
+    )
+
+    lse = None
+    if return_lse:
+        out, lse, *_ = out
+        lse = lse.permute(0, 2, 1).contiguous()
+
+    return (out, lse) if return_lse else out
+
+
 def _sage_attention_backward_op(
    ctx: torch.autograd.function.FunctionCtx,
    grad_out: torch.Tensor,
@@ -1117,6 +1413,26 @@ def _sage_attention_backward_op(
    raise NotImplementedError("Backward pass is not implemented for Sage attention.")


+def _maybe_modify_attn_mask_npu(query: torch.Tensor, key: torch.Tensor, attn_mask: torch.Tensor | None = None):
+    # Skip Attention Mask if all values are 1, `None` mask can speedup the computation
+    if attn_mask is not None and torch.all(attn_mask != 0):
+        attn_mask = None
+
+    # Reshape Attention Mask: [batch_size, seq_len_k] -> [batch_size, 1, sqe_len_q, seq_len_k]
+    # https://www.hiascend.com/document/detail/zh/Pytorch/730/apiref/torchnpuCustomsapi/docs/context/torch_npu-npu_fusion_attention.md
+    if (
+        attn_mask is not None
+        and attn_mask.ndim == 2
+        and attn_mask.shape[0] == query.shape[0]
+        and attn_mask.shape[1] == key.shape[1]
+    ):
+        B, Sq, Skv = attn_mask.shape[0], query.shape[1], key.shape[1]
+        attn_mask = ~attn_mask.to(torch.bool)
+        attn_mask = attn_mask.unsqueeze(1).expand(B, Sq, Skv).unsqueeze(1).contiguous()
+
+    return attn_mask
+
+
 def _npu_attention_forward_op(
    ctx: torch.autograd.function.FunctionCtx,
    query: torch.Tensor,
@@ -1134,11 +1450,14 @@ def _npu_attention_forward_op(
    if return_lse:
        raise ValueError("NPU attention backend does not support setting `return_lse=True`.")

+    attn_mask = _maybe_modify_attn_mask_npu(query, key, attn_mask)
+
    out = npu_fusion_attention(
        query,
        key,
        value,
        query.size(2),  # num_heads
+        atten_mask=attn_mask,
        input_layout="BSND",
        pse=None,
        scale=1.0 / math.sqrt(query.shape[-1]) if scale is None else scale,
@@ -1942,7 +2261,7 @@ def _flash_attention(
@_AttentionBackendRegistry.register(
    AttentionBackendName.FLASH_HUB,
    constraints=[_check_device, _check_qkv_dtype_bf16_or_fp16, _check_shape],
-    supports_context_parallel=False,
+    supports_context_parallel=True,
 )
 def _flash_attention_hub(
    query: torch.Tensor,
@@ -1960,17 +2279,35 @@ def _flash_attention_hub(
        raise ValueError("`attn_mask` is not supported for flash-attn 2.")

    func = _HUB_KERNELS_REGISTRY[AttentionBackendName.FLASH_HUB].kernel_fn
-    out = func(
-        q=query,
-        k=key,
-        v=value,
-        dropout_p=dropout_p,
-        softmax_scale=scale,
-        causal=is_causal,
-        return_attn_probs=return_lse,
-    )
-    if return_lse:
-        out, lse, *_ = out
+    if _parallel_config is None:
+        out = func(
+            q=query,
+            k=key,
+            v=value,
+            dropout_p=dropout_p,
+            softmax_scale=scale,
+            causal=is_causal,
+            return_attn_probs=return_lse,
+        )
+        if return_lse:
+            out, lse, *_ = out
+    else:
+        out = _templated_context_parallel_attention(
+            query,
+            key,
+            value,
+            None,
+            dropout_p,
+            is_causal,
+            scale,
+            False,
+            return_lse,
+            forward_op=_flash_attention_hub_forward_op,
+            backward_op=_flash_attention_hub_backward_op,
+            _parallel_config=_parallel_config,
+        )
+        if return_lse:
+            out, lse = out

    return (out, lse) if return_lse else out

@@ -2117,7 +2454,7 @@ def _flash_attention_3(
@_AttentionBackendRegistry.register(
    AttentionBackendName._FLASH_3_HUB,
    constraints=[_check_device, _check_qkv_dtype_bf16_or_fp16, _check_shape],
-    supports_context_parallel=False,
+    supports_context_parallel=True,
 )
 def _flash_attention_3_hub(
    query: torch.Tensor,
@@ -2132,33 +2469,68 @@ def _flash_attention_3_hub(
    return_attn_probs: bool = False,
    _parallel_config: "ParallelConfig" | None = None,
 ) -> torch.Tensor:
-    if _parallel_config:
-        raise NotImplementedError(f"{AttentionBackendName._FLASH_3_HUB.value} is not implemented for parallelism yet.")
    if attn_mask is not None:
        raise ValueError("`attn_mask` is not supported for flash-attn 3.")

    func = _HUB_KERNELS_REGISTRY[AttentionBackendName._FLASH_3_HUB].kernel_fn
-    out = func(
-        q=query,
-        k=key,
-        v=value,
-        softmax_scale=scale,
-        causal=is_causal,
-        qv=None,
-        q_descale=None,
-        k_descale=None,
-        v_descale=None,
+    if _parallel_config is None:
+        out = func(
+            q=query,
+            k=key,
+            v=value,
+            softmax_scale=scale,
+            causal=is_causal,
+            qv=None,
+            q_descale=None,
+            k_descale=None,
+            v_descale=None,
+            window_size=window_size,
+            softcap=softcap,
+            num_splits=1,
+            pack_gqa=None,
+            deterministic=deterministic,
+            sm_margin=0,
+            return_attn_probs=return_attn_probs,
+        )
+        return (out[0], out[1]) if return_attn_probs else out
+
+    forward_op = functools.partial(
+        _flash_attention_3_hub_forward_op,
        window_size=window_size,
        softcap=softcap,
        num_splits=1,
        pack_gqa=None,
        deterministic=deterministic,
        sm_margin=0,
-        return_attn_probs=return_attn_probs,
    )
-    # When `return_attn_probs` is True, the above returns a tuple of
-    # actual outputs and lse.
-    return (out[0], out[1]) if return_attn_probs else out
+    backward_op = functools.partial(
+        _flash_attention_3_hub_backward_op,
+        window_size=window_size,
+        softcap=softcap,
+        num_splits=1,
+        pack_gqa=None,
+        deterministic=deterministic,
+        sm_margin=0,
+    )
+    out = _templated_context_parallel_attention(
+        query,
+        key,
+        value,
+        None,
+        0.0,
+        is_causal,
+        scale,
+        False,
+        return_attn_probs,
+        forward_op=forward_op,
+        backward_op=backward_op,
+        _parallel_config=_parallel_config,
+    )
+    if return_attn_probs:
+        out, lse = out
+        return out, lse
+
+    return out


@_AttentionBackendRegistry.register(
@@ -2668,16 +3040,17 @@ def _native_npu_attention(
    return_lse: bool = False,
    _parallel_config: "ParallelConfig" | None = None,
 ) -> torch.Tensor:
-    if attn_mask is not None:
-        raise ValueError("`attn_mask` is not supported for NPU attention")
    if return_lse:
        raise ValueError("NPU attention backend does not support setting `return_lse=True`.")
    if _parallel_config is None:
+        attn_mask = _maybe_modify_attn_mask_npu(query, key, attn_mask)
+
        out = npu_fusion_attention(
            query,
            key,
            value,
            query.size(2),  # num_heads
+            atten_mask=attn_mask,
            input_layout="BSND",
            pse=None,
            scale=1.0 / math.sqrt(query.shape[-1]) if scale is None else scale,
@@ -2692,7 +3065,7 @@ def _native_npu_attention(
            query,
            key,
            value,
-            None,
+            attn_mask,
            dropout_p,
            None,
            scale,
@@ -2789,7 +3162,7 @@ def _sage_attention(
@_AttentionBackendRegistry.register(
    AttentionBackendName.SAGE_HUB,
    constraints=[_check_device_cuda, _check_qkv_dtype_bf16_or_fp16, _check_shape],
-    supports_context_parallel=False,
+    supports_context_parallel=True,
 )
 def _sage_attention_hub(
    query: torch.Tensor,
@@ -2817,6 +3190,23 @@ def _sage_attention_hub(
        )
        if return_lse:
            out, lse, *_ = out
+    else:
+        out = _templated_context_parallel_attention(
+            query,
+            key,
+            value,
+            None,
+            0.0,
+            is_causal,
+            scale,
+            False,
+            return_lse,
+            forward_op=_sage_attention_hub_forward_op,
+            backward_op=_sage_attention_backward_op,
+            _parallel_config=_parallel_config,
+        )
+        if return_lse:
+            out, lse = out

    return (out, lse) if return_lse else out

--- a/src/diffusers/models/transformers/transformer_flux2.py
+++ b/src/diffusers/models/transformers/transformer_flux2.py
@@ -424,7 +424,7 @@ class Flux2SingleTransformerBlock(nn.Module):
        self,
        hidden_states: torch.Tensor,
        encoder_hidden_states: torch.Tensor | None,
-        temb_mod_params: tuple[torch.Tensor, torch.Tensor, torch.Tensor],
+        temb_mod: torch.Tensor,
        image_rotary_emb: tuple[torch.Tensor, torch.Tensor] | None = None,
        joint_attention_kwargs: dict[str, Any] | None = None,
        split_hidden_states: bool = False,
@@ -436,7 +436,7 @@ class Flux2SingleTransformerBlock(nn.Module):
            text_seq_len = encoder_hidden_states.shape[1]
            hidden_states = torch.cat([encoder_hidden_states, hidden_states], dim=1)

-        mod_shift, mod_scale, mod_gate = temb_mod_params
+        mod_shift, mod_scale, mod_gate = Flux2Modulation.split(temb_mod, 1)[0]

        norm_hidden_states = self.norm(hidden_states)
        norm_hidden_states = (1 + mod_scale) * norm_hidden_states + mod_shift
@@ -498,16 +498,18 @@ class Flux2TransformerBlock(nn.Module):
        self,
        hidden_states: torch.Tensor,
        encoder_hidden_states: torch.Tensor,
-        temb_mod_params_img: tuple[tuple[torch.Tensor, torch.Tensor, torch.Tensor], ...],
-        temb_mod_params_txt: tuple[tuple[torch.Tensor, torch.Tensor, torch.Tensor], ...],
+        temb_mod_img: torch.Tensor,
+        temb_mod_txt: torch.Tensor,
        image_rotary_emb: tuple[torch.Tensor, torch.Tensor] | None = None,
        joint_attention_kwargs: dict[str, Any] | None = None,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        joint_attention_kwargs = joint_attention_kwargs or {}

        # Modulation parameters shape: [1, 1, self.dim]
-        (shift_msa, scale_msa, gate_msa), (shift_mlp, scale_mlp, gate_mlp) = temb_mod_params_img
-        (c_shift_msa, c_scale_msa, c_gate_msa), (c_shift_mlp, c_scale_mlp, c_gate_mlp) = temb_mod_params_txt
+        (shift_msa, scale_msa, gate_msa), (shift_mlp, scale_mlp, gate_mlp) = Flux2Modulation.split(temb_mod_img, 2)
+        (c_shift_msa, c_scale_msa, c_gate_msa), (c_shift_mlp, c_scale_mlp, c_gate_mlp) = Flux2Modulation.split(
+            temb_mod_txt, 2
+        )

        # Img stream
        norm_hidden_states = self.norm1(hidden_states)
@@ -627,15 +629,19 @@ class Flux2Modulation(nn.Module):
        self.linear = nn.Linear(dim, dim * 3 * self.mod_param_sets, bias=bias)
        self.act_fn = nn.SiLU()

-    def forward(self, temb: torch.Tensor) -> tuple[tuple[torch.Tensor, torch.Tensor, torch.Tensor], ...]:
+    def forward(self, temb: torch.Tensor) -> torch.Tensor:
        mod = self.act_fn(temb)
        mod = self.linear(mod)
+        return mod

+    @staticmethod
+    # split inside the transformer blocks, to avoid passing tuples into checkpoints https://github.com/huggingface/diffusers/issues/12776
+    def split(mod: torch.Tensor, mod_param_sets: int) -> tuple[tuple[torch.Tensor, torch.Tensor, torch.Tensor], ...]:
        if mod.ndim == 2:
            mod = mod.unsqueeze(1)
-        mod_params = torch.chunk(mod, 3 * self.mod_param_sets, dim=-1)
+        mod_params = torch.chunk(mod, 3 * mod_param_sets, dim=-1)
        # Return tuple of 3-tuples of modulation params shift/scale/gate
-        return tuple(mod_params[3 * i : 3 * (i + 1)] for i in range(self.mod_param_sets))
+        return tuple(mod_params[3 * i : 3 * (i + 1)] for i in range(mod_param_sets))


 class Flux2Transformer2DModel(
@@ -824,7 +830,7 @@ class Flux2Transformer2DModel(

        double_stream_mod_img = self.double_stream_modulation_img(temb)
        double_stream_mod_txt = self.double_stream_modulation_txt(temb)
-        single_stream_mod = self.single_stream_modulation(temb)[0]
+        single_stream_mod = self.single_stream_modulation(temb)

        # 2. Input projection for image (hidden_states) and conditioning text (encoder_hidden_states)
        hidden_states = self.x_embedder(hidden_states)
@@ -861,8 +867,8 @@ class Flux2Transformer2DModel(
                encoder_hidden_states, hidden_states = block(
                    hidden_states=hidden_states,
                    encoder_hidden_states=encoder_hidden_states,
-                    temb_mod_params_img=double_stream_mod_img,
-                    temb_mod_params_txt=double_stream_mod_txt,
+                    temb_mod_img=double_stream_mod_img,
+                    temb_mod_txt=double_stream_mod_txt,
                    image_rotary_emb=concat_rotary_emb,
                    joint_attention_kwargs=joint_attention_kwargs,
                )
@@ -884,7 +890,7 @@ class Flux2Transformer2DModel(
                hidden_states = block(
                    hidden_states=hidden_states,
                    encoder_hidden_states=None,
-                    temb_mod_params=single_stream_mod,
+                    temb_mod=single_stream_mod,
                    image_rotary_emb=concat_rotary_emb,
                    joint_attention_kwargs=joint_attention_kwargs,
                )
--- a/src/diffusers/models/transformers/transformer_qwenimage.py
+++ b/src/diffusers/models/transformers/transformer_qwenimage.py
@@ -164,7 +164,11 @@ def compute_text_seq_len_from_mask(
    position_ids = torch.arange(text_seq_len, device=encoder_hidden_states.device, dtype=torch.long)
    active_positions = torch.where(encoder_hidden_states_mask, position_ids, position_ids.new_zeros(()))
    has_active = encoder_hidden_states_mask.any(dim=1)
-    per_sample_len = torch.where(has_active, active_positions.max(dim=1).values + 1, torch.as_tensor(text_seq_len))
+    per_sample_len = torch.where(
+        has_active,
+        active_positions.max(dim=1).values + 1,
+        torch.as_tensor(text_seq_len, device=encoder_hidden_states.device),
+    )
    return text_seq_len, per_sample_len, encoder_hidden_states_mask


--- a/src/diffusers/pipelines/pipeline_utils.py
+++ b/src/diffusers/pipelines/pipeline_utils.py
@@ -112,7 +112,7 @@ LIBRARIES = []
 for library in LOADABLE_CLASSES:
    LIBRARIES.append(library)

-SUPPORTED_DEVICE_MAP = ["balanced"] + [get_device()]
+SUPPORTED_DEVICE_MAP = ["balanced"] + [get_device(), "cpu"]

 logger = logging.get_logger(__name__)

@@ -468,8 +468,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
        pipeline_is_sequentially_offloaded = any(
            module_is_sequentially_offloaded(module) for _, module in self.components.items()
        )
-
-        is_pipeline_device_mapped = self.hf_device_map is not None and len(self.hf_device_map) > 1
+        is_pipeline_device_mapped = self._is_pipeline_device_mapped()
        if is_pipeline_device_mapped:
            raise ValueError(
                "It seems like you have activated a device mapping strategy on the pipeline which doesn't allow explicit device placement using `to()`. You can call `reset_device_map()` to remove the existing device map from the pipeline."
@@ -1188,7 +1187,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
        """
        self._maybe_raise_error_if_group_offload_active(raise_error=True)

-        is_pipeline_device_mapped = self.hf_device_map is not None and len(self.hf_device_map) > 1
+        is_pipeline_device_mapped = self._is_pipeline_device_mapped()
        if is_pipeline_device_mapped:
            raise ValueError(
                "It seems like you have activated a device mapping strategy on the pipeline so calling `enable_model_cpu_offload() isn't allowed. You can call `reset_device_map()` first and then call `enable_model_cpu_offload()`."
@@ -1312,7 +1311,7 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
            raise ImportError("`enable_sequential_cpu_offload` requires `accelerate v0.14.0` or higher")
        self.remove_all_hooks()

-        is_pipeline_device_mapped = self.hf_device_map is not None and len(self.hf_device_map) > 1
+        is_pipeline_device_mapped = self._is_pipeline_device_mapped()
        if is_pipeline_device_mapped:
            raise ValueError(
                "It seems like you have activated a device mapping strategy on the pipeline so calling `enable_sequential_cpu_offload() isn't allowed. You can call `reset_device_map()` first and then call `enable_sequential_cpu_offload()`."
@@ -2228,6 +2227,21 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
                return True
        return False

+    def _is_pipeline_device_mapped(self):
+        # We support passing `device_map="cuda"`, for example. This is helpful, in case
+        # users want to pass `device_map="cpu"` when initializing a pipeline. This explicit declaration is desirable
+        # in limited VRAM environments because quantized models often initialize directly on the accelerator.
+        device_map = self.hf_device_map
+        is_device_type_map = False
+        if isinstance(device_map, str):
+            try:
+                torch.device(device_map)
+                is_device_type_map = True
+            except RuntimeError:
+                pass
+
+        return not is_device_type_map and isinstance(device_map, dict) and len(device_map) > 1
+

 class StableDiffusionMixin:
    r"""
--- a/src/diffusers/pipelines/prx/pipeline_prx.py
+++ b/src/diffusers/pipelines/prx/pipeline_prx.py
@@ -18,7 +18,6 @@ import re
 import urllib.parse as ul
 from typing import Callable

-import ftfy
 import torch
 from transformers import (
    AutoTokenizer,
@@ -34,13 +33,13 @@ from diffusers.models.transformers.transformer_prx import PRXTransformer2DModel
 from diffusers.pipelines.pipeline_utils import DiffusionPipeline
 from diffusers.pipelines.prx.pipeline_output import PRXPipelineOutput
 from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
-from diffusers.utils import (
-    logging,
-    replace_example_docstring,
-)
+from diffusers.utils import is_ftfy_available, logging, replace_example_docstring
 from diffusers.utils.torch_utils import randn_tensor


+if is_ftfy_available():
+    import ftfy
+
 DEFAULT_RESOLUTION = 512

 ASPECT_RATIO_256_BIN = {
--- a/src/diffusers/quantizers/gguf/utils.py
+++ b/src/diffusers/quantizers/gguf/utils.py
@@ -516,6 +516,9 @@ def dequantize_gguf_tensor(tensor):

    block_size, type_size = GGML_QUANT_SIZES[quant_type]

+    # Conver to plain tensor to avoid unnecessary __torch_function__ overhead.
+    tensor = tensor.as_tensor()
+
    tensor = tensor.view(torch.uint8)
    shape = _quant_shape_from_byte_shape(tensor.shape, type_size, block_size)

@@ -525,7 +528,7 @@ def dequantize_gguf_tensor(tensor):
    dequant = dequant_fn(blocks, block_size, type_size)
    dequant = dequant.reshape(shape)

-    return dequant.as_tensor()
+    return dequant


 class GGUFParameter(torch.nn.Parameter):
--- a/src/diffusers/schedulers/scheduling_flow_match_lcm.py
+++ b/src/diffusers/schedulers/scheduling_flow_match_lcm.py
@@ -14,6 +14,7 @@

 import math
 from dataclasses import dataclass
+from typing import Literal

 import numpy as np
 import torch
@@ -41,7 +42,7 @@ class FlowMatchLCMSchedulerOutput(BaseOutput):
            denoising loop.
    """

-    prev_sample: torch.FloatTensor
+    prev_sample: torch.Tensor


 class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):
@@ -79,11 +80,11 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):
        use_beta_sigmas (`bool`, defaults to False):
            Whether to use beta sigmas for step sizes in the noise schedule during sampling.
        time_shift_type (`str`, defaults to "exponential"):
-            The type of dynamic resolution-dependent timestep shifting to apply. Either "exponential" or "linear".
-        scale_factors ('list', defaults to None)
+            The type of dynamic resolution-dependent timestep shifting to apply.
+        scale_factors (`list[float]`, *optional*, defaults to `None`):
            It defines how to scale the latents at which predictions are made.
-        upscale_mode ('str', defaults to 'bicubic')
-            Upscaling method, applied if scale-wise generation is considered
+        upscale_mode (`str`, *optional*, defaults to "bicubic"):
+            Upscaling method, applied if scale-wise generation is considered.
    """

    _compatibles = []
@@ -101,16 +102,33 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):
        max_image_seq_len: int = 4096,
        invert_sigmas: bool = False,
        shift_terminal: float | None = None,
-        use_karras_sigmas: bool = False,
-        use_exponential_sigmas: bool = False,
-        use_beta_sigmas: bool = False,
-        time_shift_type: str = "exponential",
+        use_karras_sigmas: bool | None = False,
+        use_exponential_sigmas: bool | None = False,
+        use_beta_sigmas: bool | None = False,
+        time_shift_type: Literal["exponential", "linear"] = "exponential",
        scale_factors: list[float] | None = None,
-        upscale_mode: str = "bicubic",
+        upscale_mode: Literal[
+            "nearest",
+            "linear",
+            "bilinear",
+            "bicubic",
+            "trilinear",
+            "area",
+            "nearest-exact",
+        ] = "bicubic",
    ):
        if self.config.use_beta_sigmas and not is_scipy_available():
            raise ImportError("Make sure to install scipy if you want to use beta sigmas.")
-        if sum([self.config.use_beta_sigmas, self.config.use_exponential_sigmas, self.config.use_karras_sigmas]) > 1:
+        if (
+            sum(
+                [
+                    self.config.use_beta_sigmas,
+                    self.config.use_exponential_sigmas,
+                    self.config.use_karras_sigmas,
+                ]
+            )
+            > 1
+        ):
            raise ValueError(
                "Only one of `config.use_beta_sigmas`, `config.use_exponential_sigmas`, `config.use_karras_sigmas` can be used."
            )
@@ -162,7 +180,7 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):
        return self._begin_index

    # Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler.set_begin_index
-    def set_begin_index(self, begin_index: int = 0):
+    def set_begin_index(self, begin_index: int = 0) -> None:
        """
        Sets the begin index for the scheduler. This function should be run from pipeline before the inference.

@@ -172,18 +190,18 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):
        """
        self._begin_index = begin_index

-    def set_shift(self, shift: float):
+    def set_shift(self, shift: float) -> None:
        self._shift = shift

-    def set_scale_factors(self, scale_factors: list, upscale_mode):
+    def set_scale_factors(self, scale_factors: list[float], upscale_mode: str) -> None:
        """
        Sets scale factors for a scale-wise generation regime.

        Args:
-            scale_factors (`list`):
-                The scale factors for each step
+            scale_factors (`list[float]`):
+                The scale factors for each step.
            upscale_mode (`str`):
-                Upscaling method
+                Upscaling method.
        """
        self._scale_factors = scale_factors
        self._upscale_mode = upscale_mode
@@ -238,16 +256,18 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):

        return sample

-    def _sigma_to_t(self, sigma):
+    def _sigma_to_t(self, sigma: float | torch.FloatTensor) -> float | torch.FloatTensor:
        return sigma * self.config.num_train_timesteps

-    def time_shift(self, mu: float, sigma: float, t: torch.Tensor):
+    def time_shift(
+        self, mu: float, sigma: float, t: float | np.ndarray | torch.Tensor
+    ) -> float | np.ndarray | torch.Tensor:
        if self.config.time_shift_type == "exponential":
            return self._time_shift_exponential(mu, sigma, t)
        elif self.config.time_shift_type == "linear":
            return self._time_shift_linear(mu, sigma, t)

-    def stretch_shift_to_terminal(self, t: torch.Tensor) -> torch.Tensor:
+    def stretch_shift_to_terminal(self, t: np.ndarray | torch.Tensor) -> np.ndarray | torch.Tensor:
        r"""
        Stretches and shifts the timestep schedule to ensure it terminates at the configured `shift_terminal` config
        value.
@@ -256,12 +276,13 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):
        https://github.com/Lightricks/LTX-Video/blob/a01a171f8fe3d99dce2728d60a73fecf4d4238ae/ltx_video/schedulers/rf.py#L51

        Args:
-            t (`torch.Tensor`):
-                A tensor of timesteps to be stretched and shifted.
+            t (`torch.Tensor` or `np.ndarray`):
+                A tensor or numpy array of timesteps to be stretched and shifted.

        Returns:
-            `torch.Tensor`:
-                A tensor of adjusted timesteps such that the final value equals `self.config.shift_terminal`.
+            `torch.Tensor` or `np.ndarray`:
+                A tensor or numpy array of adjusted timesteps such that the final value equals
+                `self.config.shift_terminal`.
        """
        one_minus_z = 1 - t
        scale_factor = one_minus_z[-1] / (1 - self.config.shift_terminal)
@@ -270,12 +291,12 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):

    def set_timesteps(
        self,
-        num_inference_steps: int = None,
-        device: str | torch.device = None,
+        num_inference_steps: int | None = None,
+        device: str | torch.device | None = None,
        sigmas: list[float] | None = None,
-        mu: float = None,
+        mu: float | None = None,
        timesteps: list[float] | None = None,
-    ):
+    ) -> None:
        """
        Sets the discrete timesteps used for the diffusion chain (to be run before inference).

@@ -317,43 +338,45 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):
        is_timesteps_provided = timesteps is not None

        if is_timesteps_provided:
-            timesteps = np.array(timesteps).astype(np.float32)
+            timesteps = np.array(timesteps).astype(np.float32)  # type: ignore

        if sigmas is None:
            if timesteps is None:
-                timesteps = np.linspace(
-                    self._sigma_to_t(self.sigma_max), self._sigma_to_t(self.sigma_min), num_inference_steps
+                timesteps = np.linspace(  # type: ignore
+                    self._sigma_to_t(self.sigma_max),
+                    self._sigma_to_t(self.sigma_min),
+                    num_inference_steps,
                )
-            sigmas = timesteps / self.config.num_train_timesteps
+            sigmas = timesteps / self.config.num_train_timesteps  # type: ignore
        else:
-            sigmas = np.array(sigmas).astype(np.float32)
+            sigmas = np.array(sigmas).astype(np.float32)  # type: ignore
            num_inference_steps = len(sigmas)

        # 2. Perform timestep shifting. Either no shifting is applied, or resolution-dependent shifting of
        #    "exponential" or "linear" type is applied
        if self.config.use_dynamic_shifting:
-            sigmas = self.time_shift(mu, 1.0, sigmas)
+            sigmas = self.time_shift(mu, 1.0, sigmas)  # type: ignore
        else:
-            sigmas = self.shift * sigmas / (1 + (self.shift - 1) * sigmas)
+            sigmas = self.shift * sigmas / (1 + (self.shift - 1) * sigmas)  # type: ignore

        # 3. If required, stretch the sigmas schedule to terminate at the configured `shift_terminal` value
        if self.config.shift_terminal:
-            sigmas = self.stretch_shift_to_terminal(sigmas)
+            sigmas = self.stretch_shift_to_terminal(sigmas)  # type: ignore

        # 4. If required, convert sigmas to one of karras, exponential, or beta sigma schedules
        if self.config.use_karras_sigmas:
-            sigmas = self._convert_to_karras(in_sigmas=sigmas, num_inference_steps=num_inference_steps)
+            sigmas = self._convert_to_karras(in_sigmas=sigmas, num_inference_steps=num_inference_steps)  # type: ignore
        elif self.config.use_exponential_sigmas:
-            sigmas = self._convert_to_exponential(in_sigmas=sigmas, num_inference_steps=num_inference_steps)
+            sigmas = self._convert_to_exponential(in_sigmas=sigmas, num_inference_steps=num_inference_steps)  # type: ignore
        elif self.config.use_beta_sigmas:
-            sigmas = self._convert_to_beta(in_sigmas=sigmas, num_inference_steps=num_inference_steps)
+            sigmas = self._convert_to_beta(in_sigmas=sigmas, num_inference_steps=num_inference_steps)  # type: ignore

        # 5. Convert sigmas and timesteps to tensors and move to specified device
-        sigmas = torch.from_numpy(sigmas).to(dtype=torch.float32, device=device)
+        sigmas = torch.from_numpy(sigmas).to(dtype=torch.float32, device=device)  # type: ignore
        if not is_timesteps_provided:
-            timesteps = sigmas * self.config.num_train_timesteps
+            timesteps = sigmas * self.config.num_train_timesteps  # type: ignore
        else:
-            timesteps = torch.from_numpy(timesteps).to(dtype=torch.float32, device=device)
+            timesteps = torch.from_numpy(timesteps).to(dtype=torch.float32, device=device)  # type: ignore

        # 6. Append the terminal sigma value.
        #    If a model requires inverted sigma schedule for denoising but timesteps without inversion, the
@@ -370,7 +393,11 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):
        self._step_index = None
        self._begin_index = None

-    def index_for_timestep(self, timestep, schedule_timesteps=None):
+    def index_for_timestep(
+        self,
+        timestep: float | torch.Tensor,
+        schedule_timesteps: torch.Tensor | None = None,
+    ) -> int:
        if schedule_timesteps is None:
            schedule_timesteps = self.timesteps

@@ -382,9 +409,9 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):
        # case we start in the middle of the denoising schedule (e.g. for image-to-image)
        pos = 1 if len(indices) > 1 else 0

-        return indices[pos].item()
+        return int(indices[pos].item())

-    def _init_step_index(self, timestep):
+    def _init_step_index(self, timestep: float | torch.Tensor) -> None:
        if self.begin_index is None:
            if isinstance(timestep, torch.Tensor):
                timestep = timestep.to(self.timesteps.device)
@@ -459,7 +486,12 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):
                size = [round(self._scale_factors[self._step_index] * size) for size in self._init_size]
                x0_pred = torch.nn.functional.interpolate(x0_pred, size=size, mode=self._upscale_mode)

-        noise = randn_tensor(x0_pred.shape, generator=generator, device=x0_pred.device, dtype=x0_pred.dtype)
+        noise = randn_tensor(
+            x0_pred.shape,
+            generator=generator,
+            device=x0_pred.device,
+            dtype=x0_pred.dtype,
+        )
        prev_sample = (1 - sigma_next) * x0_pred + sigma_next * noise

        # upon completion increase step index by one
@@ -473,7 +505,7 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):
        return FlowMatchLCMSchedulerOutput(prev_sample=prev_sample)

    # Copied from diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler._convert_to_karras
-    def _convert_to_karras(self, in_sigmas: torch.Tensor, num_inference_steps) -> torch.Tensor:
+    def _convert_to_karras(self, in_sigmas: torch.Tensor, num_inference_steps: int) -> torch.Tensor:
        """
        Construct the noise schedule as proposed in [Elucidating the Design Space of Diffusion-Based Generative
        Models](https://huggingface.co/papers/2206.00364).
@@ -594,11 +626,15 @@ class FlowMatchLCMScheduler(SchedulerMixin, ConfigMixin):
        )
        return sigmas

-    def _time_shift_exponential(self, mu, sigma, t):
+    def _time_shift_exponential(
+        self, mu: float, sigma: float, t: float | np.ndarray | torch.Tensor
+    ) -> float | np.ndarray | torch.Tensor:
        return math.exp(mu) / (math.exp(mu) + (1 / t - 1) ** sigma)

-    def _time_shift_linear(self, mu, sigma, t):
+    def _time_shift_linear(
+        self, mu: float, sigma: float, t: float | np.ndarray | torch.Tensor
+    ) -> float | np.ndarray | torch.Tensor:
        return mu / (mu + (1 / t - 1) ** sigma)

-    def __len__(self):
+    def __len__(self) -> int:
        return self.config.num_train_timesteps
--- a/tests/models/testing_utils/init.py
+++ b/tests/models/testing_utils/init.py
@@ -1,4 +1,4 @@
-from .attention import AttentionTesterMixin
+from .attention import AttentionBackendTesterMixin, AttentionTesterMixin
 from .cache import (
    CacheTesterMixin,
    FasterCacheConfigMixin,
@@ -38,6 +38,7 @@ from .training import TrainingTesterMixin


 __all__ = [
+    "AttentionBackendTesterMixin",
    "AttentionTesterMixin",
    "BaseModelTesterConfig",
    "BitsAndBytesCompileTesterMixin",
--- a/tests/models/testing_utils/attention.py
+++ b/tests/models/testing_utils/attention.py
@@ -14,22 +14,105 @@
 # limitations under the License.

 import gc
+import logging

 import pytest
 import torch

 from diffusers.models.attention import AttentionModuleMixin
-from diffusers.models.attention_processor import (
-    AttnProcessor,
+from diffusers.models.attention_dispatch import AttentionBackendName, _AttentionBackendRegistry, attention_backend
+from diffusers.models.attention_processor import AttnProcessor
+from diffusers.utils import is_kernels_available, is_torch_version
+
+from ...testing_utils import assert_tensors_close, backend_empty_cache, is_attention, torch_device
+
+
+logger = logging.getLogger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# Module-level backend parameter sets for AttentionBackendTesterMixin
+# ---------------------------------------------------------------------------
+
+_CUDA_AVAILABLE = torch.cuda.is_available()
+_KERNELS_AVAILABLE = is_kernels_available()
+
+_PARAM_NATIVE = pytest.param(AttentionBackendName.NATIVE, id="native")
+
+_PARAM_NATIVE_CUDNN = pytest.param(
+    AttentionBackendName._NATIVE_CUDNN,
+    id="native_cudnn",
+    marks=pytest.mark.skipif(
+        not _CUDA_AVAILABLE,
+        reason="CUDA is required for _native_cudnn backend.",
+    ),
 )

-from ...testing_utils import (
-    assert_tensors_close,
-    backend_empty_cache,
-    is_attention,
-    torch_device,
+_PARAM_FLASH_HUB = pytest.param(
+    AttentionBackendName.FLASH_HUB,
+    id="flash_hub",
+    marks=[
+        pytest.mark.skipif(not _CUDA_AVAILABLE, reason="CUDA is required for flash_hub backend."),
+        pytest.mark.skipif(
+            not _KERNELS_AVAILABLE,
+            reason="`kernels` package is required for flash_hub backend. Install with `pip install kernels`.",
+        ),
+    ],
 )

+_PARAM_FLASH_3_HUB = pytest.param(
+    AttentionBackendName._FLASH_3_HUB,
+    id="flash_3_hub",
+    marks=[
+        pytest.mark.skipif(not _CUDA_AVAILABLE, reason="CUDA is required for _flash_3_hub backend."),
+        pytest.mark.skipif(
+            not _KERNELS_AVAILABLE,
+            reason="`kernels` package is required for _flash_3_hub backend. Install with `pip install kernels`.",
+        ),
+    ],
+)
+
+# All backends under test.
+_ALL_BACKEND_PARAMS = [_PARAM_NATIVE, _PARAM_NATIVE_CUDNN, _PARAM_FLASH_HUB, _PARAM_FLASH_3_HUB]
+
+# Backends that only accept bf16/fp16 inputs; models and inputs must be cast before running them.
+_BF16_REQUIRED_BACKENDS = {
+    AttentionBackendName._NATIVE_CUDNN,
+    AttentionBackendName.FLASH_HUB,
+    AttentionBackendName._FLASH_3_HUB,
+}
+
+# Backends that perform non-deterministic operations and therefore cannot run when
+# torch.use_deterministic_algorithms(True) is active (e.g. after enable_full_determinism()).
+_NON_DETERMINISTIC_BACKENDS = {AttentionBackendName._NATIVE_CUDNN}
+
+
+def _maybe_cast_to_bf16(backend, model, inputs_dict):
+    """Cast model and floating-point inputs to bfloat16 when the backend requires it."""
+    if backend not in _BF16_REQUIRED_BACKENDS:
+        return model, inputs_dict
+    model = model.to(dtype=torch.bfloat16)
+    inputs_dict = {
+        k: v.to(dtype=torch.bfloat16) if isinstance(v, torch.Tensor) and v.is_floating_point() else v
+        for k, v in inputs_dict.items()
+    }
+    return model, inputs_dict
+
+
+def _skip_if_backend_requires_nondeterminism(backend):
+    """Skip at runtime when torch.use_deterministic_algorithms(True) blocks the backend.
+
+    This check is intentionally deferred to test execution time because
+    enable_full_determinism() is typically called at module level in test files *after*
+    the module-level pytest.param() objects in this file have already been evaluated,
+    making it impossible to catch via a collection-time skipif condition.
+    """
+    if backend in _NON_DETERMINISTIC_BACKENDS and torch.are_deterministic_algorithms_enabled():
+        pytest.skip(
+            f"Backend '{backend.value}' performs non-deterministic operations and cannot run "
+            f"while `torch.use_deterministic_algorithms(True)` is active."
+        )
+

@is_attention
 class AttentionTesterMixin:
@@ -39,7 +122,6 @@ class AttentionTesterMixin:
    Tests functionality from AttentionModuleMixin including:
        - Attention processor management (set/get)
        - QKV projection fusion/unfusion
-        - Attention backends (XFormers, NPU, etc.)

    Expected from config mixin:
        - model_class: The model class to test
@@ -179,3 +261,208 @@ class AttentionTesterMixin:
            model.set_attn_processor(wrong_processors)

        assert "number of processors" in str(exc_info.value).lower(), "Error should mention processor count mismatch"
+
+
+@is_attention
+class AttentionBackendTesterMixin:
+    """
+    Mixin class for testing attention backends on models. Following things are tested:
+
+    1. Backends can be set with the `attention_backend` context manager and with
+    `set_attention_backend()` method.
+    2. SDPA outputs don't deviate too much from backend outputs.
+    3. Backend works with (regional) compilation.
+    4. Backends can be restored.
+
+    Tests the backends using the model provided by the host test class. The backends to test
+    are defined in `_ALL_BACKEND_PARAMS`.
+
+    Expected from the host test class:
+        - model_class: The model class to instantiate.
+
+    Expected methods from the host test class:
+        - get_init_dict(): Returns dict of kwargs to construct the model.
+        - get_dummy_inputs(): Returns dict of inputs for the model's forward pass.
+
+    Pytest mark: attention
+        Use `pytest -m "not attention"` to skip these tests.
+    """
+
+    # -----------------------------------------------------------------------
+    # Tolerance attributes — override in host class to loosen/tighten checks.
+    # -----------------------------------------------------------------------
+
+    # test_output_close_to_native: alternate backends (flash, cuDNN) may
+    # accumulate small numerical errors vs the reference PyTorch SDPA kernel.
+    backend_vs_native_atol: float = 1e-2
+    backend_vs_native_rtol: float = 1e-2
+
+    # test_compile: regional compilation introduces the same kind of numerical
+    # error as the non-compiled backend path, so the same loose tolerance applies.
+    compile_vs_native_atol: float = 1e-2
+    compile_vs_native_rtol: float = 1e-2
+
+    def setup_method(self):
+        gc.collect()
+        backend_empty_cache(torch_device)
+
+    def teardown_method(self):
+        gc.collect()
+        backend_empty_cache(torch_device)
+
+    @torch.no_grad()
+    @pytest.mark.parametrize("backend", _ALL_BACKEND_PARAMS)
+    def test_set_attention_backend_matches_context_manager(self, backend):
+        """set_attention_backend() and the attention_backend() context manager must yield identical outputs."""
+        _skip_if_backend_requires_nondeterminism(backend)
+
+        init_dict = self.get_init_dict()
+        inputs_dict = self.get_dummy_inputs()
+        model = self.model_class(**init_dict)
+        model.to(torch_device)
+        model.eval()
+
+        model, inputs_dict = _maybe_cast_to_bf16(backend, model, inputs_dict)
+
+        with attention_backend(backend):
+            ctx_output = model(**inputs_dict, return_dict=False)[0]
+
+        initial_registry_backend, _ = _AttentionBackendRegistry.get_active_backend()
+
+        try:
+            model.set_attention_backend(backend.value)
+        except Exception as e:
+            logger.warning("Skipping test for backend '%s': %s", backend.value, e)
+            pytest.skip(str(e))
+
+        try:
+            set_output = model(**inputs_dict, return_dict=False)[0]
+        finally:
+            model.reset_attention_backend()
+            _AttentionBackendRegistry.set_active_backend(initial_registry_backend)
+
+        assert_tensors_close(
+            set_output,
+            ctx_output,
+            atol=0,
+            rtol=0,
+            msg=(
+                f"Output from model.set_attention_backend('{backend.value}') should be identical "
+                f"to the output from `with attention_backend('{backend.value}'):`."
+            ),
+        )
+
+    @torch.no_grad()
+    @pytest.mark.parametrize("backend", _ALL_BACKEND_PARAMS)
+    def test_output_close_to_native(self, backend):
+        """All backends should produce model output numerically close to the native SDPA reference."""
+        _skip_if_backend_requires_nondeterminism(backend)
+
+        init_dict = self.get_init_dict()
+        inputs_dict = self.get_dummy_inputs()
+        model = self.model_class(**init_dict)
+        model.to(torch_device)
+        model.eval()
+
+        model, inputs_dict = _maybe_cast_to_bf16(backend, model, inputs_dict)
+
+        with attention_backend(AttentionBackendName.NATIVE):
+            native_output = model(**inputs_dict, return_dict=False)[0]
+
+        initial_registry_backend, _ = _AttentionBackendRegistry.get_active_backend()
+
+        try:
+            model.set_attention_backend(backend.value)
+        except Exception as e:
+            logger.warning("Skipping test for backend '%s': %s", backend.value, e)
+            pytest.skip(str(e))
+
+        try:
+            backend_output = model(**inputs_dict, return_dict=False)[0]
+        finally:
+            model.reset_attention_backend()
+            _AttentionBackendRegistry.set_active_backend(initial_registry_backend)
+
+        assert_tensors_close(
+            backend_output,
+            native_output,
+            atol=self.backend_vs_native_atol,
+            rtol=self.backend_vs_native_rtol,
+            msg=f"Output from {backend} should be numerically close to native SDPA.",
+        )
+
+    @pytest.mark.parametrize("backend", _ALL_BACKEND_PARAMS)
+    def test_context_manager_switches_and_restores_backend(self, backend):
+        """attention_backend() should activate the requested backend and restore the previous one on exit."""
+        initial_backend, _ = _AttentionBackendRegistry.get_active_backend()
+
+        with attention_backend(backend):
+            active_backend, _ = _AttentionBackendRegistry.get_active_backend()
+            assert active_backend == backend, (
+                f"Backend should be {backend} inside the context manager, got {active_backend}."
+            )
+
+        restored_backend, _ = _AttentionBackendRegistry.get_active_backend()
+        assert restored_backend == initial_backend, (
+            f"Backend should be restored to {initial_backend} after exiting the context manager, "
+            f"got {restored_backend}."
+        )
+
+    @pytest.mark.parametrize("backend", _ALL_BACKEND_PARAMS)
+    def test_compile(self, backend):
+        """
+        `torch.compile` tests checking for recompilation, graph breaks, forward can run, etc.
+        For speed, we use regional compilation here (`model.compile_repeated_blocks()`
+        as opposed to `model.compile`).
+        """
+        _skip_if_backend_requires_nondeterminism(backend)
+        if getattr(self.model_class, "_repeated_blocks", None) is None:
+            pytest.skip("Skipping tests as regional compilation is not supported.")
+
+        if backend == AttentionBackendName.NATIVE and not is_torch_version(">=", "2.9.0"):
+            pytest.xfail(
+                "test_compile with the native backend requires torch >= 2.9.0 for stable "
+                "fullgraph compilation with error_on_recompile=True."
+            )
+
+        init_dict = self.get_init_dict()
+        inputs_dict = self.get_dummy_inputs()
+        model = self.model_class(**init_dict)
+        model.to(torch_device)
+        model.eval()
+
+        model, inputs_dict = _maybe_cast_to_bf16(backend, model, inputs_dict)
+
+        with torch.no_grad(), attention_backend(AttentionBackendName.NATIVE):
+            native_output = model(**inputs_dict, return_dict=False)[0]
+
+        initial_registry_backend, _ = _AttentionBackendRegistry.get_active_backend()
+
+        try:
+            model.set_attention_backend(backend.value)
+        except Exception as e:
+            logger.warning("Skipping test for backend '%s': %s", backend.value, e)
+            pytest.skip(str(e))
+
+        try:
+            model.compile_repeated_blocks(fullgraph=True)
+            torch.compiler.reset()
+
+            with (
+                torch._inductor.utils.fresh_inductor_cache(),
+                torch._dynamo.config.patch(error_on_recompile=True),
+            ):
+                with torch.no_grad():
+                    compile_output = model(**inputs_dict, return_dict=False)[0]
+                    model(**inputs_dict, return_dict=False)
+        finally:
+            model.reset_attention_backend()
+            _AttentionBackendRegistry.set_active_backend(initial_registry_backend)
+
+        assert_tensors_close(
+            compile_output,
+            native_output,
+            atol=self.compile_vs_native_atol,
+            rtol=self.compile_vs_native_rtol,
+            msg=f"Compiled output with backend '{backend.value}' should be numerically close to eager native SDPA.",
+        )
--- a/tests/models/testing_utils/lora.py
+++ b/tests/models/testing_utils/lora.py
@@ -375,7 +375,7 @@ class LoraHotSwappingForModelTesterMixin:
            # additionally check if dynamic compilation works.
            if different_shapes is not None:
                for height, width in different_shapes:
-                    new_inputs_dict = self.prepare_dummy_input(height=height, width=width)
+                    new_inputs_dict = self.get_dummy_inputs(height=height, width=width)
                    _ = model(**new_inputs_dict)
            else:
                output0_after = model(**inputs_dict)["sample"]
@@ -390,7 +390,7 @@ class LoraHotSwappingForModelTesterMixin:
        with torch.inference_mode():
            if different_shapes is not None:
                for height, width in different_shapes:
-                    new_inputs_dict = self.prepare_dummy_input(height=height, width=width)
+                    new_inputs_dict = self.get_dummy_inputs(height=height, width=width)
                    _ = model(**new_inputs_dict)
            else:
                output1_after = model(**inputs_dict)["sample"]
--- a/tests/models/testing_utils/quantization.py
+++ b/tests/models/testing_utils/quantization.py
@@ -628,6 +628,21 @@ class BitsAndBytesTesterMixin(BitsAndBytesConfigMixin, QuantizationTesterMixin):
        """Test that quantized models can be used for training with adapters."""
        self._test_quantization_training(BitsAndBytesConfigMixin.BNB_CONFIGS["4bit_nf4"])

+    @pytest.mark.parametrize(
+        "config_name",
+        list(BitsAndBytesConfigMixin.BNB_CONFIGS.keys()),
+        ids=list(BitsAndBytesConfigMixin.BNB_CONFIGS.keys()),
+    )
+    def test_cpu_device_map(self, config_name):
+        config_kwargs = BitsAndBytesConfigMixin.BNB_CONFIGS[config_name]
+        model_quantized = self._create_quantized_model(config_kwargs, device_map="cpu")
+
+        assert hasattr(model_quantized, "hf_device_map"), "Model should have hf_device_map attribute"
+        assert model_quantized.hf_device_map is not None, "hf_device_map should not be None"
+        assert model_quantized.device == torch.device("cpu"), (
+            f"Model should be on CPU, but is on {model_quantized.device}"
+        )
+

@is_quantization
@is_quanto
--- a/tests/models/transformers/test_models_transformer_flux.py
+++ b/tests/models/transformers/test_models_transformer_flux.py
@@ -25,6 +25,7 @@ from diffusers.utils.torch_utils import randn_tensor

 from ...testing_utils import enable_full_determinism, torch_device
 from ..testing_utils import (
+    AttentionBackendTesterMixin,
    AttentionTesterMixin,
    BaseModelTesterConfig,
    BitsAndBytesCompileTesterMixin,
@@ -224,6 +225,10 @@ class TestFluxTransformerAttention(FluxTransformerTesterConfig, AttentionTesterM
    """Attention processor tests for Flux Transformer."""


+class TestFluxTransformerAttentionBackend(FluxTransformerTesterConfig, AttentionBackendTesterMixin):
+    """Attention backend tests for Flux Transformer."""
+
+
 class TestFluxTransformerContextParallel(FluxTransformerTesterConfig, ContextParallelTesterMixin):
    """Context Parallel inference tests for Flux Transformer"""

--- a/tests/others/test_attention_backends.py
+++ b/tests/others/test_attention_backends.py
@@ -1,163 +0,0 @@
-"""
-This test suite exists for the maintainers currently. It's not run in our CI at the moment.
-
-Once attention backends become more mature, we can consider including this in our CI.
-
-To run this test suite:
-
-```bash
-export RUN_ATTENTION_BACKEND_TESTS=yes
-
-pytest tests/others/test_attention_backends.py
-```
-
-Tests were conducted on an H100 with PyTorch 2.8.0 (CUDA 12.9). Slices for the compilation tests in
-"native" variants were obtained with a torch nightly version (2.10.0.dev20250924+cu128).
-
-Tests for aiter backend were conducted and slices for the aiter backend tests collected on a MI355X
-with torch 2025-09-25 nightly version (ad2f7315ca66b42497047bb7951f696b50f1e81b) and
-aiter 0.1.5.post4.dev20+ga25e55e79.
-"""
-
-import os
-
-import pytest
-import torch
-
-
-pytestmark = pytest.mark.skipif(
-    os.getenv("RUN_ATTENTION_BACKEND_TESTS", "false") == "false", reason="Feature not mature enough."
-)
-from diffusers import FluxPipeline  # noqa: E402
-from diffusers.utils import is_torch_version  # noqa: E402
-
-
-# fmt: off
-FORWARD_CASES = [
-    (
-        "flash_hub",
-        torch.tensor([0.0820, 0.0859, 0.0918, 0.1016, 0.0957, 0.0996, 0.0996, 0.1016, 0.2188, 0.2266, 0.2363, 0.2500, 0.2539, 0.2461, 0.2422, 0.2695], dtype=torch.bfloat16)
-    ),
-    (
-        "_flash_3_hub",
-        torch.tensor([0.0820, 0.0859, 0.0938, 0.1016, 0.0977, 0.0996, 0.1016, 0.1016, 0.2188, 0.2246, 0.2344, 0.2480, 0.2539, 0.2480, 0.2441, 0.2715], dtype=torch.bfloat16),
-    ),
-    (
-        "native",
-        torch.tensor([0.0820, 0.0859, 0.0938, 0.1016, 0.0957, 0.0996, 0.0996, 0.1016, 0.2188, 0.2266, 0.2363, 0.2500, 0.2539, 0.2480, 0.2461, 0.2734], dtype=torch.bfloat16)
-        ),
-    (
-        "_native_cudnn",
-        torch.tensor([0.0781, 0.0840, 0.0879, 0.0957, 0.0898, 0.0957, 0.0957, 0.0977, 0.2168, 0.2246, 0.2324, 0.2500, 0.2539, 0.2480, 0.2441, 0.2695], dtype=torch.bfloat16),
-    ),
-    (
-        "aiter",
-        torch.tensor([0.0781, 0.0820, 0.0879, 0.0957, 0.0898, 0.0938, 0.0957, 0.0957, 0.2285, 0.2363, 0.2461, 0.2637, 0.2695, 0.2617, 0.2617, 0.2891], dtype=torch.bfloat16),
-    )
-]
-
-COMPILE_CASES = [
-    (
-        "flash_hub",
-        torch.tensor([0.0410, 0.0410, 0.0449, 0.0508, 0.0488, 0.0586, 0.0605, 0.0586, 0.2324, 0.2422, 0.2539, 0.2734, 0.2832, 0.2812, 0.2773, 0.3047], dtype=torch.bfloat16),
-        True
-    ),
-    (
-        "_flash_3_hub",
-        torch.tensor([0.0410, 0.0410, 0.0449, 0.0508, 0.0508, 0.0605, 0.0625, 0.0605, 0.2344, 0.2461, 0.2578, 0.2734, 0.2852, 0.2812, 0.2773, 0.3047], dtype=torch.bfloat16),
-        True,
-    ),
-    (
-        "native",
-        torch.tensor([0.0410, 0.0410, 0.0449, 0.0508, 0.0508, 0.0605, 0.0605, 0.0605, 0.2344, 0.2461, 0.2578, 0.2773, 0.2871, 0.2832, 0.2773, 0.3066], dtype=torch.bfloat16),
-        True,
-    ),
-    (
-        "_native_cudnn",
-        torch.tensor([0.0410, 0.0410, 0.0430, 0.0508, 0.0488, 0.0586, 0.0605, 0.0586, 0.2344, 0.2461, 0.2578, 0.2773, 0.2871, 0.2832, 0.2793, 0.3086], dtype=torch.bfloat16),
-        True,
-    ),
-    (
-        "aiter",
-        torch.tensor([0.0391, 0.0391, 0.0430, 0.0488, 0.0469, 0.0566, 0.0586, 0.0566, 0.2402, 0.2539, 0.2637, 0.2812, 0.2930, 0.2910, 0.2891, 0.3164], dtype=torch.bfloat16),
-        True,
-    )
-]
-# fmt: on
-
-INFER_KW = {
-    "prompt": "dance doggo dance",
-    "height": 256,
-    "width": 256,
-    "num_inference_steps": 2,
-    "guidance_scale": 3.5,
-    "max_sequence_length": 128,
-    "output_type": "pt",
-}
-
-
-def _backend_is_probably_supported(pipe, name: str):
-    try:
-        pipe.transformer.set_attention_backend(name)
-        return pipe, True
-    except Exception:
-        return False
-
-
-def _check_if_slices_match(output, expected_slice):
-    img = output.images.detach().cpu()
-    generated_slice = img.flatten()
-    generated_slice = torch.cat([generated_slice[:8], generated_slice[-8:]])
-    assert torch.allclose(generated_slice, expected_slice, atol=1e-4)
-
-
-@pytest.fixture(scope="session")
-def device():
-    if not torch.cuda.is_available():
-        pytest.skip("CUDA is required for these tests.")
-    return torch.device("cuda:0")
-
-
-@pytest.fixture(scope="session")
-def pipe(device):
-    repo_id = "black-forest-labs/FLUX.1-dev"
-    pipe = FluxPipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16).to(device)
-    pipe.set_progress_bar_config(disable=True)
-    return pipe
-
-
-@pytest.mark.parametrize("backend_name,expected_slice", FORWARD_CASES, ids=[c[0] for c in FORWARD_CASES])
-def test_forward(pipe, backend_name, expected_slice):
-    out = _backend_is_probably_supported(pipe, backend_name)
-    if isinstance(out, bool):
-        pytest.xfail(f"Backend '{backend_name}' not supported in this environment.")
-
-    modified_pipe = out[0]
-    out = modified_pipe(**INFER_KW, generator=torch.manual_seed(0))
-    _check_if_slices_match(out, expected_slice)
-
-
-@pytest.mark.parametrize(
-    "backend_name,expected_slice,error_on_recompile",
-    COMPILE_CASES,
-    ids=[c[0] for c in COMPILE_CASES],
-)
-def test_forward_with_compile(pipe, backend_name, expected_slice, error_on_recompile):
-    if "native" in backend_name and error_on_recompile and not is_torch_version(">=", "2.9.0"):
-        pytest.xfail(f"Test with {backend_name=} is compatible with a higher version of torch.")
-
-    out = _backend_is_probably_supported(pipe, backend_name)
-    if isinstance(out, bool):
-        pytest.xfail(f"Backend '{backend_name}' not supported in this environment.")
-
-    modified_pipe = out[0]
-    modified_pipe.transformer.compile(fullgraph=True)
-
-    torch.compiler.reset()
-    with (
-        torch._inductor.utils.fresh_inductor_cache(),
-        torch._dynamo.config.patch(error_on_recompile=error_on_recompile),
-    ):
-        out = modified_pipe(**INFER_KW, generator=torch.manual_seed(0))
-
-    _check_if_slices_match(out, expected_slice)
--- a/tests/pipelines/allegro/test_allegro.py
+++ b/tests/pipelines/allegro/test_allegro.py
@@ -158,6 +158,10 @@ class AllegroPipelineFastTests(PipelineTesterMixin, PyramidAttentionBroadcastTes
    def test_save_load_optional_components(self):
        pass

+    @unittest.skip("Decoding without tiling is not yet implemented")
+    def test_pipeline_with_accelerator_device_map(self):
+        pass
+
    def test_inference(self):
        device = "cpu"

--- a/tests/pipelines/kandinsky/test_kandinsky_combined.py
+++ b/tests/pipelines/kandinsky/test_kandinsky_combined.py
@@ -34,9 +34,7 @@ enable_full_determinism()

 class KandinskyPipelineCombinedFastTests(PipelineTesterMixin, unittest.TestCase):
    pipeline_class = KandinskyCombinedPipeline
-    params = [
-        "prompt",
-    ]
+    params = ["prompt"]
    batch_params = ["prompt", "negative_prompt"]
    required_optional_params = [
        "generator",
@@ -148,6 +146,10 @@ class KandinskyPipelineCombinedFastTests(PipelineTesterMixin, unittest.TestCase)
    def test_dict_tuple_outputs_equivalent(self):
        super().test_dict_tuple_outputs_equivalent(expected_max_difference=5e-4)

+    @unittest.skip("Test not supported.")
+    def test_pipeline_with_accelerator_device_map(self):
+        pass
+

 class KandinskyPipelineImg2ImgCombinedFastTests(PipelineTesterMixin, unittest.TestCase):
    pipeline_class = KandinskyImg2ImgCombinedPipeline
@@ -264,6 +266,10 @@ class KandinskyPipelineImg2ImgCombinedFastTests(PipelineTesterMixin, unittest.Te
    def test_save_load_optional_components(self):
        super().test_save_load_optional_components(expected_max_difference=5e-4)

+    @unittest.skip("Test not supported.")
+    def test_pipeline_with_accelerator_device_map(self):
+        pass
+

 class KandinskyPipelineInpaintCombinedFastTests(PipelineTesterMixin, unittest.TestCase):
    pipeline_class = KandinskyInpaintCombinedPipeline
@@ -384,3 +390,7 @@ class KandinskyPipelineInpaintCombinedFastTests(PipelineTesterMixin, unittest.Te

    def test_save_load_local(self):
        super().test_save_load_local(expected_max_difference=5e-3)
+
+    @unittest.skip("Test not supported.")
+    def test_pipeline_with_accelerator_device_map(self):
+        pass
--- a/tests/pipelines/kandinsky2_2/test_kandinsky_combined.py
+++ b/tests/pipelines/kandinsky2_2/test_kandinsky_combined.py
@@ -36,9 +36,7 @@ enable_full_determinism()

 class KandinskyV22PipelineCombinedFastTests(PipelineTesterMixin, unittest.TestCase):
    pipeline_class = KandinskyV22CombinedPipeline
-    params = [
-        "prompt",
-    ]
+    params = ["prompt"]
    batch_params = ["prompt", "negative_prompt"]
    required_optional_params = [
        "generator",
@@ -70,12 +68,7 @@ class KandinskyV22PipelineCombinedFastTests(PipelineTesterMixin, unittest.TestCa
    def get_dummy_inputs(self, device, seed=0):
        prior_dummy = PriorDummies()
        inputs = prior_dummy.get_dummy_inputs(device=device, seed=seed)
-        inputs.update(
-            {
-                "height": 64,
-                "width": 64,
-            }
-        )
+        inputs.update({"height": 64, "width": 64})
        return inputs

    def test_kandinsky(self):
@@ -155,12 +148,18 @@ class KandinskyV22PipelineCombinedFastTests(PipelineTesterMixin, unittest.TestCa
    def test_save_load_optional_components(self):
        super().test_save_load_optional_components(expected_max_difference=5e-3)

+    @unittest.skip("Test not supported.")
    def test_callback_inputs(self):
        pass

+    @unittest.skip("Test not supported.")
    def test_callback_cfg(self):
        pass

+    @unittest.skip("Test not supported.")
+    def test_pipeline_with_accelerator_device_map(self):
+        pass
+

 class KandinskyV22PipelineImg2ImgCombinedFastTests(PipelineTesterMixin, unittest.TestCase):
    pipeline_class = KandinskyV22Img2ImgCombinedPipeline
@@ -279,12 +278,18 @@ class KandinskyV22PipelineImg2ImgCombinedFastTests(PipelineTesterMixin, unittest
    def save_load_local(self):
        super().test_save_load_local(expected_max_difference=5e-3)

+    @unittest.skip("Test not supported.")
    def test_callback_inputs(self):
        pass

+    @unittest.skip("Test not supported.")
    def test_callback_cfg(self):
        pass

+    @unittest.skip("Test not supported.")
+    def test_pipeline_with_accelerator_device_map(self):
+        pass
+

 class KandinskyV22PipelineInpaintCombinedFastTests(PipelineTesterMixin, unittest.TestCase):
    pipeline_class = KandinskyV22InpaintCombinedPipeline
@@ -411,3 +416,7 @@ class KandinskyV22PipelineInpaintCombinedFastTests(PipelineTesterMixin, unittest

    def test_callback_cfg(self):
        pass
+
+    @unittest.skip("`device_map` is not yet supported for connected pipelines.")
+    def test_pipeline_with_accelerator_device_map(self):
+        pass
--- a/tests/pipelines/kandinsky2_2/test_kandinsky_inpaint.py
+++ b/tests/pipelines/kandinsky2_2/test_kandinsky_inpaint.py
@@ -296,6 +296,9 @@ class KandinskyV22InpaintPipelineFastTests(PipelineTesterMixin, unittest.TestCas
        output = pipe(**inputs)[0]
        assert output.abs().sum() == 0

+    def test_pipeline_with_accelerator_device_map(self):
+        super().test_pipeline_with_accelerator_device_map(expected_max_difference=5e-3)
+

@slow
@require_torch_accelerator
--- a/tests/pipelines/kandinsky3/test_kandinsky3_img2img.py
+++ b/tests/pipelines/kandinsky3/test_kandinsky3_img2img.py
@@ -194,6 +194,9 @@ class Kandinsky3Img2ImgPipelineFastTests(PipelineTesterMixin, unittest.TestCase)
    def test_save_load_dduf(self):
        super().test_save_load_dduf(atol=1e-3, rtol=1e-3)

+    def test_pipeline_with_accelerator_device_map(self):
+        super().test_pipeline_with_accelerator_device_map(expected_max_difference=5e-3)
+

@slow
@require_torch_accelerator
--- a/tests/pipelines/prx/test_pipeline_prx.py
+++ b/tests/pipelines/prx/test_pipeline_prx.py
@@ -1,7 +1,6 @@
 import unittest

 import numpy as np
-import pytest
 import torch
 from transformers import AutoTokenizer
 from transformers.models.t5gemma.configuration_t5gemma import T5GemmaConfig, T5GemmaModuleConfig
@@ -11,17 +10,11 @@ from diffusers.models import AutoencoderDC, AutoencoderKL
 from diffusers.models.transformers.transformer_prx import PRXTransformer2DModel
 from diffusers.pipelines.prx.pipeline_prx import PRXPipeline
 from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
-from diffusers.utils import is_transformers_version

 from ..pipeline_params import TEXT_TO_IMAGE_PARAMS
 from ..test_pipelines_common import PipelineTesterMixin


-@pytest.mark.xfail(
-    condition=is_transformers_version(">", "4.57.1"),
-    reason="See https://github.com/huggingface/diffusers/pull/12456#issuecomment-3424228544",
-    strict=False,
-)
 class PRXPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
    pipeline_class = PRXPipeline
    params = TEXT_TO_IMAGE_PARAMS - {"cross_attention_kwargs"}
--- a/tests/pipelines/test_pipelines_common.py
+++ b/tests/pipelines/test_pipelines_common.py
@@ -2355,7 +2355,6 @@ class PipelineTesterMixin:
                    f"Component '{name}' has dtype {component.dtype} but expected {expected_dtype}",
                )

-    @require_torch_accelerator
    def test_pipeline_with_accelerator_device_map(self, expected_max_difference=1e-4):
        components = self.get_dummy_components()
        pipe = self.pipeline_class(**components)
--- a/tests/pipelines/visualcloze/test_pipeline_visualcloze_combined.py
+++ b/tests/pipelines/visualcloze/test_pipeline_visualcloze_combined.py
@@ -342,3 +342,7 @@ class VisualClozePipelineFastTests(unittest.TestCase, PipelineTesterMixin):
        self.assertLess(
            max_diff, expected_max_diff, "The output of the fp16 pipeline changed after saving and loading."
        )
+
+    @unittest.skip("Test not supported.")
+    def test_pipeline_with_accelerator_device_map(self):
+        pass
--- a/tests/pipelines/visualcloze/test_pipeline_visualcloze_generation.py
+++ b/tests/pipelines/visualcloze/test_pipeline_visualcloze_generation.py
@@ -310,3 +310,7 @@ class VisualClozeGenerationPipelineFastTests(unittest.TestCase, PipelineTesterMi
    @unittest.skip("Skipped due to missing layout_prompt. Needs further investigation.")
    def test_encode_prompt_works_in_isolation(self, extra_required_param_value_dict=None, atol=0.0001, rtol=0.0001):
        pass
+
+    @unittest.skip("Needs to be revisited later.")
+    def test_pipeline_with_accelerator_device_map(self, expected_max_difference=0.0001):
+        pass
Author	SHA1	Message	Date
sayakpaul	7220687151	remove existing tests/others/test_attention_backends.py file	2026-02-23 11:31:37 +05:30
sayakpaul	73b159f2a1	add attention backend tests.	2026-02-23 11:30:28 +05:30
Álvaro Somoza	a80b19218b	Support Flux Klein peft (fal) lora format (#13169 ) peft (fal) lora format	2026-02-21 10:31:18 +05:30
Animesh Jain	01de02e8b4	[gguf][torch.compile time] Convert to plain tensor earlier in dequantize_gguf_tensor (#13166 ) [gguf] Convert to plain tensor earlier in dequantize_gguf_tensor Once dequantize_gguf_tensor fetches the quant_type attributed from the GGUFParamter tensor subclass, there is no further need of running the actual dequantize operations on the Tensor subclass, we can just convert to plain tensor right away. This not only makes PyTorch eager faster, but reduces torch.compile tracer compile time from 36 seconds to 10 seconds, because there is lot less code to trace now.	2026-02-20 09:31:52 +05:30
Dhruv Nair	db2d7e7bc4	[CI] Fix new LoRAHotswap tests (#13163 ) update Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2026-02-20 09:01:20 +05:30
Sayak Paul	f8d3db9ca7	remove deps related to test from ci (#13164 )	2026-02-20 08:35:35 +05:30
Sayak Paul	99daaa802d	[core] Enable CP for kernels-based attention backends (#12812 ) * up * up * up * up --------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>	2026-02-19 18:16:50 +05:30
dg845	fe78a7b7c6	Fix `ftfy` import for PRX Pipeline (#13154 ) * Guard ftfy import with is_ftfy_available * Remove xfail for PRX pipeline tests as they appear to work on transformers>4.57.1 * make style and make quality	2026-02-18 20:44:33 -08:00
dg845	53e1d0e458	[CI] Revert `setuptools` CI Fix as the Failing Pipelines are Deprecated (#13149 ) * Pin setuptools version for dependencies which explicitly depend on pkg_resources * Revert setuptools pin as k-diffusion pipelines are now deprecated --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2026-02-18 20:34:00 -08:00
dxqb	a577ec36df	Flux2: Tensor tuples can cause issues for checkpointing (#12777 ) * split tensors inside the transformer blocks to avoid checkpointing issues * clean up, fix type hints * fix merge error * Apply style fixes --------- Co-authored-by: s <you@example.com> Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2026-02-18 17:03:22 -08:00
Steven Liu	6875490c3b	[docs] add docs for qwenimagelayered (#13158 ) * add example * feedback	2026-02-18 11:02:15 -08:00
David El Malih	64734b2115	docs: improve docstring scheduling_flow_match_lcm.py (#13160 ) Improve docstring scheduling flow match lcm	2026-02-18 10:52:02 -08:00
Dhruv Nair	f81e653197	[CI] Add ftfy as a test dependency (#13155 ) * update * update * update * update * update * update	2026-02-18 22:51:10 +05:30
zhangtao0408	bcbbded7c3	[Bug] Fix QwenImageEditPlus Series on NPU (#13017 ) * [Bug Fix][Qwen-Image-Edit] Fix Qwen-Image-Edit series on NPU * Enhance NPU attention handling by converting attention mask to boolean and refining mask checks. * Refine attention mask handling in NPU attention function to improve validation and conversion logic. * Clean Code * Refine attention mask processing in NPU attention functions to enhance performance and validation. * Remove item() ops on npu fa backend. * Reuse NPU attention mask by `_maybe_modify_attn_mask_npu` * Apply style fixes * Update src/diffusers/models/attention_dispatch.py --------- Co-authored-by: zhangtao <zhangtao529@huawei.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>	2026-02-17 09:10:40 +05:30
Sayak Paul	35086ac06a	[core] support device type device_maps to work with offloading. (#12811 ) * support device type device_maps to work with offloading. * add tests. * fix tests * skip tests where it's not supported. * empty * up * up * fix allegro.	2026-02-16 16:31:45 +05:30