fix

refactor autoencoderdc tests
Fix Ulysses SP backward with SDPA (#13328 )
2026-04-01 05:16:39 +08:00 · 2026-03-30 16:05:22 +05:30 · 2026-03-30 15:30:18 +05:30 · 2026-03-30 15:15:27 +05:30 · 2026-03-29 23:13:52 -07:00 · 2026-03-30 11:09:14 +05:30
21 changed files with 1151 additions and 1046 deletions
--- a/.ai/AGENTS.md
+++ b/.ai/AGENTS.md
@@ -10,24 +10,34 @@ Strive to write code as simple and explicit as possible.

 ---

-### Dependencies
- No new mandatory dependency without discussion (e.g. `einops`)
- Optional deps guarded with `is_X_available()` and a dummy in `utils/dummy_*.py`
-
 ## Code formatting
+
 - `make style` and `make fix-copies` should be run as the final step before opening a PR

 ### Copied Code
+
 - Many classes are kept in sync with a source via a `# Copied from ...` header comment
 - Do not edit a `# Copied from` block directly — run `make fix-copies` to propagate changes from the source
 - Remove the header to intentionally break the link

 ### Models
- All layer calls should be visible directly in `forward` — avoid helper functions that hide `nn.Module` calls.
- Avoid graph breaks for `torch.compile` compatibility — do not insert NumPy operations in forward implementations and any other patterns that can break `torch.compile` compatibility with `fullgraph=True`.
- See the **model-integration** skill for the attention pattern, pipeline rules, test setup instructions, and other important details.
+
+- See [models.md](models.md) for model conventions, attention pattern, implementation rules, dependencies, and gotchas.
+- See the [model-integration](./skills/model-integration/SKILL.md) skill for the full integration workflow, file structure, test setup, and other details.
+
+### Pipelines & Schedulers
+
+- Pipelines inherit from `DiffusionPipeline`
+- Schedulers use `SchedulerMixin` with `ConfigMixin`
+- Use `@torch.no_grad()` on pipeline `__call__`
+- Support `output_type="latent"` for skipping VAE decode
+- Support `generator` parameter for reproducibility
+- Use `self.progress_bar(timesteps)` for progress tracking
+- Don't subclass an existing pipeline for a variant — DO NOT use an existing pipeline class (e.g., `FluxPipeline`) to override another pipeline (e.g., `FluxImg2ImgPipeline`) which will be a part of the core codebase (`src`)

 ## Skills

-Task-specific guides live in `.ai/skills/` and are loaded on demand by AI agents.
-Available skills: **model-integration** (adding/converting pipelines), **parity-testing** (debugging numerical parity).
+Task-specific guides live in `.ai/skills/` and are loaded on demand by AI agents. Available skills include:
+
+- [model-integration](./skills/model-integration/SKILL.md) (adding/converting pipelines)
+- [parity-testing](./skills/parity-testing/SKILL.md) (debugging numerical parity).
--- a/.ai/models.md
+++ b/.ai/models.md
@@ -0,0 +1,76 @@
+# Model conventions and rules
+
+Shared reference for model-related conventions, patterns, and gotchas.
+Linked from `AGENTS.md`, `skills/model-integration/SKILL.md`, and `review-rules.md`.
+
+## Coding style
+
+- All layer calls should be visible directly in `forward` — avoid helper functions that hide `nn.Module` calls.
+- Avoid graph breaks for `torch.compile` compatibility — do not insert NumPy operations in forward implementations and any other patterns that can break `torch.compile` compatibility with `fullgraph=True`.
+- No new mandatory dependency without discussion (e.g. `einops`). Optional deps guarded with `is_X_available()` and a dummy in `utils/dummy_*.py`.
+
+## Common model conventions
+
+- Models use `ModelMixin` with `register_to_config` for config serialization
+
+## Attention pattern
+
+Attention must follow the diffusers pattern: both the `Attention` class and its processor are defined in the model file. The processor's `__call__` handles the actual compute and must use `dispatch_attention_fn` rather than calling `F.scaled_dot_product_attention` directly. The attention class inherits `AttentionModuleMixin` and declares `_default_processor_cls` and `_available_processors`.
+
+```python
+# transformer_mymodel.py
+
+class MyModelAttnProcessor:
+    _attention_backend = None
+    _parallel_config = None
+
+    def __call__(self, attn, hidden_states, attention_mask=None, ...):
+        query = attn.to_q(hidden_states)
+        key = attn.to_k(hidden_states)
+        value = attn.to_v(hidden_states)
+        # reshape, apply rope, etc.
+        hidden_states = dispatch_attention_fn(
+            query, key, value,
+            attn_mask=attention_mask,
+            backend=self._attention_backend,
+            parallel_config=self._parallel_config,
+        )
+        hidden_states = hidden_states.flatten(2, 3)
+        return attn.to_out[0](hidden_states)
+
+
+class MyModelAttention(nn.Module, AttentionModuleMixin):
+    _default_processor_cls = MyModelAttnProcessor
+    _available_processors = [MyModelAttnProcessor]
+
+    def __init__(self, query_dim, heads=8, dim_head=64, ...):
+        super().__init__()
+        self.to_q = nn.Linear(query_dim, heads * dim_head, bias=False)
+        self.to_k = nn.Linear(query_dim, heads * dim_head, bias=False)
+        self.to_v = nn.Linear(query_dim, heads * dim_head, bias=False)
+        self.to_out = nn.ModuleList([nn.Linear(heads * dim_head, query_dim), nn.Dropout(0.0)])
+        self.set_processor(MyModelAttnProcessor())
+
+    def forward(self, hidden_states, attention_mask=None, **kwargs):
+        return self.processor(self, hidden_states, attention_mask, **kwargs)
+```
+
+Consult the implementations in `src/diffusers/models/transformers/` if you need further references.
+
+## Gotchas
+
+1. **Forgetting `__init__.py` lazy imports.** Every new class must be registered in the appropriate `__init__.py` with lazy imports. Missing this causes `ImportError` that only shows up when users try `from diffusers import YourNewClass`.
+
+2. **Using `einops` or other non-PyTorch deps.** Reference implementations often use `einops.rearrange`. Always rewrite with native PyTorch (`reshape`, `permute`, `unflatten`). Don't add the dependency. If a dependency is truly unavoidable, guard its import: `if is_my_dependency_available(): import my_dependency`.
+
+3. **Missing `make fix-copies` after `# Copied from`.** If you add `# Copied from` annotations, you must run `make fix-copies` to propagate them. CI will fail otherwise.
+
+4. **Wrong `_supports_cache_class` / `_no_split_modules`.** These class attributes control KV cache and device placement. Copy from a similar model and verify -- wrong values cause silent correctness bugs or OOM errors.
+
+5. **Missing `@torch.no_grad()` on pipeline `__call__`.** Forgetting this causes GPU OOM from gradient accumulation during inference.
+
+6. **Config serialization gaps.** Every `__init__` parameter in a `ModelMixin` subclass must be captured by `register_to_config`. If you add a new param but forget to register it, `from_pretrained` will silently use the default instead of the saved value.
+
+7. **Forgetting to update `_import_structure` and `_lazy_modules`.** The top-level `src/diffusers/__init__.py` has both -- missing either one causes partial import failures.
+
+8. **Hardcoded dtype in model forward.** Don't hardcode `torch.float32` or `torch.bfloat16` in the model's forward pass. Use the dtype of the input tensors or `self.dtype` so the model works with any precision.
--- a/.ai/review-rules.md
+++ b/.ai/review-rules.md
@@ -3,8 +3,8 @@
 Review-specific rules for Claude. Focus on correctness — style is handled by ruff.

 Before reviewing, read and apply the guidelines in:
- [AGENTS.md](AGENTS.md) — coding style, dependencies, copied code, model conventions
- [skills/model-integration/SKILL.md](skills/model-integration/SKILL.md) — attention pattern, pipeline rules, implementation checklist, gotchas
+- [AGENTS.md](AGENTS.md) — coding style, copied code
+- [models.md](models.md) — model conventions, attention pattern, implementation rules, dependencies, gotchas
 - [skills/parity-testing/SKILL.md](skills/parity-testing/SKILL.md) — testing rules, comparison utilities
 - [skills/parity-testing/pitfalls.md](skills/parity-testing/pitfalls.md) — known pitfalls (dtype mismatches, config assumptions, etc.)

--- a/.ai/skills/model-integration/SKILL.md
+++ b/.ai/skills/model-integration/SKILL.md
@@ -65,89 +65,19 @@ docs/source/en/api/
 - [ ] Run `make style` and `make quality`
 - [ ] Test parity with reference implementation (see `parity-testing` skill)

-### Attention pattern
+### Model conventions, attention pattern, and implementation rules

-Attention must follow the diffusers pattern: both the `Attention` class and its processor are defined in the model file. The processor's `__call__` handles the actual compute and must use `dispatch_attention_fn` rather than calling `F.scaled_dot_product_attention` directly. The attention class inherits `AttentionModuleMixin` and declares `_default_processor_cls` and `_available_processors`.
+See [../../models.md](../../models.md) for the attention pattern, implementation rules, common conventions, dependencies, and gotchas. These apply to all model work.

-```python
-# transformer_mymodel.py
+### Model integration specific rules

-class MyModelAttnProcessor:
-    _attention_backend = None
-    _parallel_config = None
-
-    def __call__(self, attn, hidden_states, attention_mask=None, ...):
-        query = attn.to_q(hidden_states)
-        key = attn.to_k(hidden_states)
-        value = attn.to_v(hidden_states)
-        # reshape, apply rope, etc.
-        hidden_states = dispatch_attention_fn(
-            query, key, value,
-            attn_mask=attention_mask,
-            backend=self._attention_backend,
-            parallel_config=self._parallel_config,
-        )
-        hidden_states = hidden_states.flatten(2, 3)
-        return attn.to_out[0](hidden_states)
-
-
-class MyModelAttention(nn.Module, AttentionModuleMixin):
-    _default_processor_cls = MyModelAttnProcessor
-    _available_processors = [MyModelAttnProcessor]
-
-    def __init__(self, query_dim, heads=8, dim_head=64, ...):
-        super().__init__()
-        self.to_q = nn.Linear(query_dim, heads * dim_head, bias=False)
-        self.to_k = nn.Linear(query_dim, heads * dim_head, bias=False)
-        self.to_v = nn.Linear(query_dim, heads * dim_head, bias=False)
-        self.to_out = nn.ModuleList([nn.Linear(heads * dim_head, query_dim), nn.Dropout(0.0)])
-        self.set_processor(MyModelAttnProcessor())
-
-    def forward(self, hidden_states, attention_mask=None, **kwargs):
-        return self.processor(self, hidden_states, attention_mask, **kwargs)
-```
-
-Consult the implementations in `src/diffusers/models/transformers/` if you need further references.
-
-### Implementation rules
-
-1. **Don't combine structural changes with behavioral changes.** Restructuring code to fit diffusers APIs (ModelMixin, ConfigMixin, etc.) is unavoidable. But don't also "improve" the algorithm, refactor computation order, or rename internal variables for aesthetics. Keep numerical logic as close to the reference as possible, even if it looks unclean. For standard → modular, this is stricter: copy loop logic verbatim and only restructure into blocks. Clean up in a separate commit after parity is confirmed.
-2. **Pipelines must inherit from `DiffusionPipeline`.** Consult implementations in `src/diffusers/pipelines` in case you need references.
-3. **Don't subclass an existing pipeline for a variant.** DO NOT use an existing pipeline class (e.g., `FluxPipeline`) to override another pipeline (e.g., `FluxImg2ImgPipeline`) which will be a part of the core codebase (`src`).
+**Don't combine structural changes with behavioral changes.** Restructuring code to fit diffusers APIs (ModelMixin, ConfigMixin, etc.) is unavoidable. But don't also "improve" the algorithm, refactor computation order, or rename internal variables for aesthetics. Keep numerical logic as close to the reference as possible, even if it looks unclean. For standard → modular, this is stricter: copy loop logic verbatim and only restructure into blocks. Clean up in a separate commit after parity is confirmed.

 ### Test setup

 - Slow tests gated with `@slow` and `RUN_SLOW=1`
 - All model-level tests must use the `BaseModelTesterConfig`, `ModelTesterMixin`, `MemoryTesterMixin`, `AttentionTesterMixin`, `LoraTesterMixin`, and `TrainingTesterMixin` classes initially to write the tests. Any additional tests should be added after discussions with the maintainers. Use `tests/models/transformers/test_models_transformer_flux.py` as a reference.

-### Common diffusers conventions
-
- Pipelines inherit from `DiffusionPipeline`
- Models use `ModelMixin` with `register_to_config` for config serialization
- Schedulers use `SchedulerMixin` with `ConfigMixin`
- Use `@torch.no_grad()` on pipeline `__call__`
- Support `output_type="latent"` for skipping VAE decode
- Support `generator` parameter for reproducibility
- Use `self.progress_bar(timesteps)` for progress tracking
-
-## Gotchas
-
-1. **Forgetting `__init__.py` lazy imports.** Every new class must be registered in the appropriate `__init__.py` with lazy imports. Missing this causes `ImportError` that only shows up when users try `from diffusers import YourNewClass`.
-
-2. **Using `einops` or other non-PyTorch deps.** Reference implementations often use `einops.rearrange`. Always rewrite with native PyTorch (`reshape`, `permute`, `unflatten`). Don't add the dependency. If a dependency is truly unavoidable, guard its import: `if is_my_dependency_available(): import my_dependency`.
-
-3. **Missing `make fix-copies` after `# Copied from`.** If you add `# Copied from` annotations, you must run `make fix-copies` to propagate them. CI will fail otherwise.
-
-4. **Wrong `_supports_cache_class` / `_no_split_modules`.** These class attributes control KV cache and device placement. Copy from a similar model and verify -- wrong values cause silent correctness bugs or OOM errors.
-
-5. **Missing `@torch.no_grad()` on pipeline `__call__`.** Forgetting this causes GPU OOM from gradient accumulation during inference.
-
-6. **Config serialization gaps.** Every `__init__` parameter in a `ModelMixin` subclass must be captured by `register_to_config`. If you add a new param but forget to register it, `from_pretrained` will silently use the default instead of the saved value.
-
-7. **Forgetting to update `_import_structure` and `_lazy_modules`.** The top-level `src/diffusers/__init__.py` has both -- missing either one causes partial import failures.
-
-8. **Hardcoded dtype in model forward.** Don't hardcode `torch.float32` or `torch.bfloat16` in the model's forward pass. Use the dtype of the input tensors or `self.dtype` so the model works with any precision.
-
 ---

 ## Modular Pipeline Conversion
--- a/.github/workflows/claude_review.yml
+++ b/.github/workflows/claude_review.yml
@@ -32,6 +32,9 @@ jobs:
      )
    runs-on: ubuntu-latest
    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 1
      - uses: anthropics/claude-code-action@v1
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
--- a/src/diffusers/models/attention_dispatch.py
+++ b/src/diffusers/models/attention_dispatch.py
@@ -862,23 +862,23 @@ def _native_attention_backward_op(
    key.requires_grad_(True)
    value.requires_grad_(True)

-    query_t, key_t, value_t = (x.permute(0, 2, 1, 3) for x in (query, key, value))
-    out = torch.nn.functional.scaled_dot_product_attention(
-        query=query_t,
-        key=key_t,
-        value=value_t,
-        attn_mask=ctx.attn_mask,
-        dropout_p=ctx.dropout_p,
-        is_causal=ctx.is_causal,
-        scale=ctx.scale,
-        enable_gqa=ctx.enable_gqa,
-    )
-    out = out.permute(0, 2, 1, 3)
+    with torch.enable_grad():
+        query_t, key_t, value_t = (x.permute(0, 2, 1, 3) for x in (query, key, value))
+        out = torch.nn.functional.scaled_dot_product_attention(
+            query=query_t,
+            key=key_t,
+            value=value_t,
+            attn_mask=ctx.attn_mask,
+            dropout_p=ctx.dropout_p,
+            is_causal=ctx.is_causal,
+            scale=ctx.scale,
+            enable_gqa=ctx.enable_gqa,
+        )
+        out = out.permute(0, 2, 1, 3)

-    grad_out_t = grad_out.permute(0, 2, 1, 3)
-    grad_query_t, grad_key_t, grad_value_t = torch.autograd.grad(
-        outputs=out, inputs=[query_t, key_t, value_t], grad_outputs=grad_out_t, retain_graph=False
-    )
+        grad_query_t, grad_key_t, grad_value_t = torch.autograd.grad(
+            outputs=out, inputs=[query_t, key_t, value_t], grad_outputs=grad_out, retain_graph=False
+        )

    grad_query = grad_query_t.permute(0, 2, 1, 3)
    grad_key = grad_key_t.permute(0, 2, 1, 3)
--- a/src/diffusers/quantizers/quantization_config.py
+++ b/src/diffusers/quantizers/quantization_config.py
@@ -470,8 +470,8 @@ class TorchAoConfig(QuantizationConfigMixin):
        self.post_init()

    def post_init(self):
-        if is_torchao_version("<=", "0.9.0"):
-            raise ValueError("TorchAoConfig requires torchao > 0.9.0. Please upgrade with `pip install -U torchao`.")
+        if is_torchao_version("<", "0.15.0"):
+            raise ValueError("TorchAoConfig requires torchao >= 0.15.0. Please upgrade with `pip install -U torchao`.")

        from torchao.quantization.quant_api import AOBaseConfig

@@ -495,8 +495,8 @@ class TorchAoConfig(QuantizationConfigMixin):
    @classmethod
    def from_dict(cls, config_dict, return_unused_kwargs=False, **kwargs):
        """Create configuration from a dictionary."""
-        if not is_torchao_version(">", "0.9.0"):
-            raise NotImplementedError("TorchAoConfig requires torchao > 0.9.0 for construction from dict")
+        if not is_torchao_version(">=", "0.15.0"):
+            raise NotImplementedError("TorchAoConfig requires torchao >= 0.15.0 for construction from dict")
        config_dict = config_dict.copy()
        quant_type = config_dict.pop("quant_type")

--- a/src/diffusers/quantizers/torchao/torchao_quantizer.py
+++ b/src/diffusers/quantizers/torchao/torchao_quantizer.py
@@ -113,7 +113,7 @@ if (
    is_torch_available()
    and is_torch_version(">=", "2.6.0")
    and is_torchao_available()
-    and is_torchao_version(">=", "0.7.0")
+    and is_torchao_version(">=", "0.15.0")
 ):
    _update_torch_safe_globals()

@@ -168,10 +168,10 @@ class TorchAoHfQuantizer(DiffusersQuantizer):
            raise ImportError(
                "Loading a TorchAO quantized model requires the torchao library. Please install with `pip install torchao`"
            )
-        torchao_version = version.parse(importlib.metadata.version("torch"))
-        if torchao_version < version.parse("0.7.0"):
+        torchao_version = version.parse(importlib.metadata.version("torchao"))
+        if torchao_version < version.parse("0.15.0"):
            raise RuntimeError(
-                f"The minimum required version of `torchao` is 0.7.0, but the current version is {torchao_version}. Please upgrade with `pip install -U torchao`."
+                f"The minimum required version of `torchao` is 0.15.0, but the current version is {torchao_version}. Please upgrade with `pip install -U torchao`."
            )

        self.offload = False
--- a/tests/models/autoencoders/test_models_autoencoder_dc.py
+++ b/tests/models/autoencoders/test_models_autoencoder_dc.py
@@ -13,24 +13,29 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import unittest
+import pytest
+import torch

 from diffusers import AutoencoderDC

-from ...testing_utils import IS_GITHUB_ACTIONS, enable_full_determinism, floats_tensor, torch_device
-from ..test_modeling_common import ModelTesterMixin
+from ...testing_utils import IS_GITHUB_ACTIONS, enable_full_determinism, torch_device
+from ..testing_utils import BaseModelTesterConfig, MemoryTesterMixin, ModelTesterMixin, TrainingTesterMixin
 from .testing_utils import AutoencoderTesterMixin


 enable_full_determinism()


-class AutoencoderDCTests(ModelTesterMixin, AutoencoderTesterMixin, unittest.TestCase):
-    model_class = AutoencoderDC
-    main_input_name = "sample"
-    base_precision = 1e-2
+class AutoencoderDCTesterConfig(BaseModelTesterConfig):
+    @property
+    def model_class(self):
+        return AutoencoderDC

-    def get_autoencoder_dc_config(self):
+    @property
+    def output_shape(self):
+        return (3, 32, 32)
+
+    def get_init_dict(self):
        return {
            "in_channels": 3,
            "latent_channels": 4,
@@ -56,33 +61,34 @@ class AutoencoderDCTests(ModelTesterMixin, AutoencoderTesterMixin, unittest.Test
            "scaling_factor": 0.41407,
        }

-    @property
-    def dummy_input(self):
+    def get_dummy_inputs(self, seed=0):
+        torch.manual_seed(seed)
        batch_size = 4
        num_channels = 3
        sizes = (32, 32)
-
-        image = floats_tensor((batch_size, num_channels) + sizes).to(torch_device)
-
+        image = torch.randn(batch_size, num_channels, *sizes).to(torch_device)
        return {"sample": image}

-    @property
-    def input_shape(self):
-        return (3, 32, 32)
-
-    @property
-    def output_shape(self):
-        return (3, 32, 32)
-
+    # Bridge for AutoencoderTesterMixin which still uses the old interface
    def prepare_init_args_and_inputs_for_common(self):
-        init_dict = self.get_autoencoder_dc_config()
-        inputs_dict = self.dummy_input
-        return init_dict, inputs_dict
+        return self.get_init_dict(), self.get_dummy_inputs()

-    @unittest.skipIf(IS_GITHUB_ACTIONS, reason="Skipping test inside GitHub Actions environment")
-    def test_layerwise_casting_inference(self):
-        super().test_layerwise_casting_inference()

-    @unittest.skipIf(IS_GITHUB_ACTIONS, reason="Skipping test inside GitHub Actions environment")
+class TestAutoencoderDC(AutoencoderDCTesterConfig, ModelTesterMixin):
+    base_precision = 1e-2
+
+
+class TestAutoencoderDCTraining(AutoencoderDCTesterConfig, TrainingTesterMixin):
+    """Training tests for AutoencoderDC."""
+
+
+class TestAutoencoderDCMemory(AutoencoderDCTesterConfig, MemoryTesterMixin):
+    """Memory optimization tests for AutoencoderDC."""
+
+    @pytest.mark.skipif(IS_GITHUB_ACTIONS, reason="Skipping test inside GitHub Actions environment")
    def test_layerwise_casting_memory(self):
        super().test_layerwise_casting_memory()
+
+
+class TestAutoencoderDCSlicingTiling(AutoencoderDCTesterConfig, AutoencoderTesterMixin):
+    """Slicing and tiling tests for AutoencoderDC."""
--- a/tests/models/autoencoders/testing_utils.py
+++ b/tests/models/autoencoders/testing_utils.py
@@ -44,9 +44,9 @@ class AutoencoderTesterMixin:
            if isinstance(output, dict):
                output = output.to_tuple()[0]

-        self.assertIsNotNone(output)
+        assert output is not None
        expected_shape = inputs_dict["sample"].shape
-        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")
+        assert output.shape == expected_shape, "Input and output shapes do not match"

    def test_enable_disable_tiling(self):
        if not hasattr(self.model_class, "enable_tiling"):
--- a/tests/models/test_modeling_common.py
+++ b/tests/models/test_modeling_common.py
@@ -465,8 +465,7 @@ class UNetTesterMixin:
    def test_forward_with_norm_groups(self):
        if not self._accepts_norm_num_groups(self.model_class):
            pytest.skip(f"Test not supported for {self.model_class.__name__}")
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["norm_num_groups"] = 16
        init_dict["block_out_channels"] = (16, 32)
@@ -481,9 +480,9 @@ class UNetTesterMixin:
            if isinstance(output, dict):
                output = output.to_tuple()[0]

-        assert output is not None
+        self.assertIsNotNone(output)
        expected_shape = inputs_dict["sample"].shape
-        assert output.shape == expected_shape, "Input and output shapes do not match"
+        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")


 class ModelTesterMixin:
--- a/tests/models/testing_utils/common.py
+++ b/tests/models/testing_utils/common.py
@@ -287,9 +287,8 @@ class ModelTesterMixin:
                f"Parameter shape mismatch for {param_name}. Original: {param_1.shape}, loaded: {param_2.shape}"
            )

-        inputs_dict = self.get_dummy_inputs()
-        image = model(**inputs_dict, return_dict=False)[0]
-        new_image = new_model(**inputs_dict, return_dict=False)[0]
+        image = model(**self.get_dummy_inputs(), return_dict=False)[0]
+        new_image = new_model(**self.get_dummy_inputs(), return_dict=False)[0]

        assert_tensors_close(image, new_image, atol=atol, rtol=rtol, msg="Models give different forward passes.")

@@ -309,9 +308,8 @@ class ModelTesterMixin:

        new_model.to(torch_device)

-        inputs_dict = self.get_dummy_inputs()
-        image = model(**inputs_dict, return_dict=False)[0]
-        new_image = new_model(**inputs_dict, return_dict=False)[0]
+        image = model(**self.get_dummy_inputs(), return_dict=False)[0]
+        new_image = new_model(**self.get_dummy_inputs(), return_dict=False)[0]

        assert_tensors_close(image, new_image, atol=atol, rtol=rtol, msg="Models give different forward passes.")

@@ -339,9 +337,8 @@ class ModelTesterMixin:
        model.to(torch_device)
        model.eval()

-        inputs_dict = self.get_dummy_inputs()
-        first = model(**inputs_dict, return_dict=False)[0]
-        second = model(**inputs_dict, return_dict=False)[0]
+        first = model(**self.get_dummy_inputs(), return_dict=False)[0]
+        second = model(**self.get_dummy_inputs(), return_dict=False)[0]

        first_flat = first.flatten()
        second_flat = second.flatten()
@@ -398,9 +395,8 @@ class ModelTesterMixin:
        model.to(torch_device)
        model.eval()

-        inputs_dict = self.get_dummy_inputs()
-        outputs_dict = model(**inputs_dict)
-        outputs_tuple = model(**inputs_dict, return_dict=False)
+        outputs_dict = model(**self.get_dummy_inputs())
+        outputs_tuple = model(**self.get_dummy_inputs(), return_dict=False)

        recursive_check(outputs_tuple, outputs_dict)

@@ -527,10 +523,8 @@ class ModelTesterMixin:
        new_model = new_model.to(torch_device)

        torch.manual_seed(0)
-        # Re-create inputs only if they contain a generator (which needs to be reset)
-        if "generator" in inputs_dict:
-            inputs_dict = self.get_dummy_inputs()
-        new_output = new_model(**inputs_dict, return_dict=False)[0]
+        inputs_dict_new = self.get_dummy_inputs()
+        new_output = new_model(**inputs_dict_new, return_dict=False)[0]

        assert_tensors_close(
            base_output, new_output, atol=atol, rtol=rtol, msg="Output should match after sharded save/load"
@@ -569,10 +563,8 @@ class ModelTesterMixin:
        new_model = new_model.to(torch_device)

        torch.manual_seed(0)
-        # Re-create inputs only if they contain a generator (which needs to be reset)
-        if "generator" in inputs_dict:
-            inputs_dict = self.get_dummy_inputs()
-        new_output = new_model(**inputs_dict, return_dict=False)[0]
+        inputs_dict_new = self.get_dummy_inputs()
+        new_output = new_model(**inputs_dict_new, return_dict=False)[0]

        assert_tensors_close(
            base_output, new_output, atol=atol, rtol=rtol, msg="Output should match after variant sharded save/load"
@@ -622,10 +614,8 @@ class ModelTesterMixin:
            model_parallel = model_parallel.to(torch_device)

            torch.manual_seed(0)
-            # Re-create inputs only if they contain a generator (which needs to be reset)
-            if "generator" in inputs_dict:
-                inputs_dict = self.get_dummy_inputs()
-            output_parallel = model_parallel(**inputs_dict, return_dict=False)[0]
+            inputs_dict_parallel = self.get_dummy_inputs()
+            output_parallel = model_parallel(**inputs_dict_parallel, return_dict=False)[0]

            assert_tensors_close(
                base_output, output_parallel, atol=atol, rtol=rtol, msg="Output should match with parallel loading"
--- a/tests/models/testing_utils/compile.py
+++ b/tests/models/testing_utils/compile.py
@@ -92,6 +92,9 @@ class TorchCompileTesterMixin:
        model.eval()
        model.compile_repeated_blocks(fullgraph=True)

+        if self.model_class.__name__ == "UNet2DConditionModel":
+            recompile_limit = 2
+
        with (
            torch._inductor.utils.fresh_inductor_cache(),
            torch._dynamo.config.patch(recompile_limit=recompile_limit),
--- a/tests/models/testing_utils/parallelism.py
+++ b/tests/models/testing_utils/parallelism.py
@@ -98,6 +98,64 @@ def _context_parallel_worker(rank, world_size, master_port, model_class, init_di
            dist.destroy_process_group()


+def _context_parallel_backward_worker(
+    rank, world_size, master_port, model_class, init_dict, cp_dict, inputs_dict, return_dict
+):
+    """Worker function for context parallel backward pass testing."""
+    try:
+        # Set up distributed environment
+        os.environ["MASTER_ADDR"] = "localhost"
+        os.environ["MASTER_PORT"] = str(master_port)
+        os.environ["RANK"] = str(rank)
+        os.environ["WORLD_SIZE"] = str(world_size)
+
+        # Get device configuration
+        device_config = DEVICE_CONFIG.get(torch_device, DEVICE_CONFIG["cuda"])
+        backend = device_config["backend"]
+        device_module = device_config["module"]
+
+        # Initialize process group
+        dist.init_process_group(backend=backend, rank=rank, world_size=world_size)
+
+        # Set device for this process
+        device_module.set_device(rank)
+        device = torch.device(f"{torch_device}:{rank}")
+
+        # Create model in training mode
+        model = model_class(**init_dict)
+        model.to(device)
+        model.train()
+
+        # Move inputs to device
+        inputs_on_device = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs_dict.items()}
+
+        # Enable context parallelism
+        cp_config = ContextParallelConfig(**cp_dict)
+        model.enable_parallelism(config=cp_config)
+
+        # Run forward and backward pass
+        output = model(**inputs_on_device, return_dict=False)[0]
+        loss = output.sum()
+        loss.backward()
+
+        # Check that backward actually produced at least one valid gradient
+        grads = [p.grad for p in model.parameters() if p.requires_grad and p.grad is not None]
+        has_valid_grads = len(grads) > 0 and all(torch.isfinite(g).all() for g in grads)
+
+        # Only rank 0 reports results
+        if rank == 0:
+            return_dict["status"] = "success"
+            return_dict["has_valid_grads"] = bool(has_valid_grads)
+
+    except Exception as e:
+        if rank == 0:
+            return_dict["status"] = "error"
+            return_dict["error"] = str(e)
+    finally:
+        if dist.is_initialized():
+            dist.destroy_process_group()
+
+
 def _custom_mesh_worker(
    rank,
    world_size,
@@ -204,6 +262,51 @@ class ContextParallelTesterMixin:
    def test_context_parallel_batch_inputs(self, cp_type):
        self.test_context_parallel_inference(cp_type, batch_size=2)

+    @pytest.mark.parametrize("cp_type", ["ulysses_degree", "ring_degree"], ids=["ulysses", "ring"])
+    def test_context_parallel_backward(self, cp_type, batch_size: int = 1):
+        if not torch.distributed.is_available():
+            pytest.skip("torch.distributed is not available.")
+
+        if not hasattr(self.model_class, "_cp_plan") or self.model_class._cp_plan is None:
+            pytest.skip("Model does not have a _cp_plan defined for context parallel inference.")
+
+        if cp_type == "ring_degree":
+            active_backend, _ = _AttentionBackendRegistry.get_active_backend()
+            if active_backend == AttentionBackendName.NATIVE:
+                pytest.skip("Ring attention is not supported with the native attention backend.")
+
+        world_size = 2
+        init_dict = self.get_init_dict()
+        inputs_dict = self.get_dummy_inputs(batch_size=batch_size)
+
+        # Move all tensors to CPU for multiprocessing
+        inputs_dict = {k: v.cpu() if isinstance(v, torch.Tensor) else v for k, v in inputs_dict.items()}
+        cp_dict = {cp_type: world_size}
+
+        # Find a free port for distributed communication
+        master_port = _find_free_port()
+
+        # Use multiprocessing manager for cross-process communication
+        manager = mp.Manager()
+        return_dict = manager.dict()
+
+        # Spawn worker processes
+        mp.spawn(
+            _context_parallel_backward_worker,
+            args=(world_size, master_port, self.model_class, init_dict, cp_dict, inputs_dict, return_dict),
+            nprocs=world_size,
+            join=True,
+        )
+
+        assert return_dict.get("status") == "success", (
+            f"Context parallel backward pass failed: {return_dict.get('error', 'Unknown error')}"
+        )
+        assert return_dict.get("has_valid_grads"), "Context parallel backward pass did not produce valid gradients."
+
+    @pytest.mark.parametrize("cp_type", ["ulysses_degree", "ring_degree"], ids=["ulysses", "ring"])
+    def test_context_parallel_backward_batch_inputs(self, cp_type):
+        self.test_context_parallel_backward(cp_type, batch_size=2)
+
    @pytest.mark.parametrize(
        "cp_type,mesh_shape,mesh_dim_names",
        [
--- a/tests/models/unets/test_models_unet_1d.py
+++ b/tests/models/unets/test_models_unet_1d.py
@@ -13,6 +13,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import unittest
+
 import pytest
 import torch

@@ -24,39 +26,64 @@ from ...testing_utils import (
    slow,
    torch_device,
 )
-from ..test_modeling_common import UNetTesterMixin
-from ..testing_utils import (
-    BaseModelTesterConfig,
-    MemoryTesterMixin,
-    ModelTesterMixin,
-)
+from ..test_modeling_common import ModelTesterMixin, UNetTesterMixin


-_LAYERWISE_CASTING_XFAIL_REASON = (
-    "RuntimeError: 'fill_out' not implemented for 'Float8_e4m3fn'. The error is caused due to certain torch.float8_e4m3fn and torch.float8_e5m2 operations "
-    "not being supported when using deterministic algorithms (which is what the tests run with). To fix:\n"
-    "1. Wait for next PyTorch release: https://github.com/pytorch/pytorch/issues/137160.\n"
-    "2. Unskip this test."
-)
-
-
-class UNet1DTesterConfig(BaseModelTesterConfig):
-    """Base configuration for UNet1DModel testing (standard variant)."""
+class UNet1DModelTests(ModelTesterMixin, UNetTesterMixin, unittest.TestCase):
+    model_class = UNet1DModel
+    main_input_name = "sample"

    @property
-    def model_class(self):
-        return UNet1DModel
+    def dummy_input(self):
+        batch_size = 4
+        num_features = 14
+        seq_len = 16
+
+        noise = floats_tensor((batch_size, num_features, seq_len)).to(torch_device)
+        time_step = torch.tensor([10] * batch_size).to(torch_device)
+
+        return {"sample": noise, "timestep": time_step}
+
+    @property
+    def input_shape(self):
+        return (4, 14, 16)

    @property
    def output_shape(self):
-        return (14, 16)
+        return (4, 14, 16)

-    @property
-    def main_input_name(self):
-        return "sample"
+    @unittest.skip("Test not supported.")
+    def test_ema_training(self):
+        pass

-    def get_init_dict(self):
-        return {
+    @unittest.skip("Test not supported.")
+    def test_training(self):
+        pass
+
+    @unittest.skip("Test not supported.")
+    def test_layerwise_casting_training(self):
+        pass
+
+    def test_determinism(self):
+        super().test_determinism()
+
+    def test_outputs_equivalence(self):
+        super().test_outputs_equivalence()
+
+    def test_from_save_pretrained(self):
+        super().test_from_save_pretrained()
+
+    def test_from_save_pretrained_variant(self):
+        super().test_from_save_pretrained_variant()
+
+    def test_model_from_pretrained(self):
+        super().test_model_from_pretrained()
+
+    def test_output(self):
+        super().test_output()
+
+    def prepare_init_args_and_inputs_for_common(self):
+        init_dict = {
            "block_out_channels": (8, 8, 16, 16),
            "in_channels": 14,
            "out_channels": 14,
@@ -70,40 +97,18 @@ class UNet1DTesterConfig(BaseModelTesterConfig):
            "up_block_types": ("UpResnetBlock1D", "UpResnetBlock1D", "UpResnetBlock1D"),
            "act_fn": "swish",
        }
+        inputs_dict = self.dummy_input
+        return init_dict, inputs_dict

-    def get_dummy_inputs(self):
-        batch_size = 4
-        num_features = 14
-        seq_len = 16
-
-        return {
-            "sample": floats_tensor((batch_size, num_features, seq_len)).to(torch_device),
-            "timestep": torch.tensor([10] * batch_size).to(torch_device),
-        }
-
-
-class TestUNet1D(UNet1DTesterConfig, ModelTesterMixin, UNetTesterMixin):
-    @pytest.mark.skip("Not implemented yet for this UNet")
-    def test_forward_with_norm_groups(self):
-        pass
-
-
-class TestUNet1DMemory(UNet1DTesterConfig, MemoryTesterMixin):
-    @pytest.mark.xfail(reason=_LAYERWISE_CASTING_XFAIL_REASON)
-    def test_layerwise_casting_memory(self):
-        super().test_layerwise_casting_memory()
-
-
-class TestUNet1DHubLoading(UNet1DTesterConfig):
    def test_from_pretrained_hub(self):
        model, loading_info = UNet1DModel.from_pretrained(
            "bglick13/hopper-medium-v2-value-function-hor32", output_loading_info=True, subfolder="unet"
        )
-        assert model is not None
-        assert len(loading_info["missing_keys"]) == 0
+        self.assertIsNotNone(model)
+        self.assertEqual(len(loading_info["missing_keys"]), 0)

        model.to(torch_device)
-        image = model(**self.get_dummy_inputs())
+        image = model(**self.dummy_input)

        assert image is not None, "Make sure output is not None"

@@ -126,7 +131,12 @@ class TestUNet1DHubLoading(UNet1DTesterConfig):
        # fmt: off
        expected_output_slice = torch.tensor([-2.137172, 1.1426016, 0.3688687, -0.766922, 0.7303146, 0.11038864, -0.4760633, 0.13270172, 0.02591348])
        # fmt: on
-        assert torch.allclose(output_slice, expected_output_slice, rtol=1e-3)
+        self.assertTrue(torch.allclose(output_slice, expected_output_slice, rtol=1e-3))
+
+    @unittest.skip("Test not supported.")
+    def test_forward_with_norm_groups(self):
+        # Not implemented yet for this UNet
+        pass

    @slow
    def test_unet_1d_maestro(self):
@@ -147,29 +157,98 @@ class TestUNet1DHubLoading(UNet1DTesterConfig):
        assert (output_sum - 224.0896).abs() < 0.5
        assert (output_max - 0.0607).abs() < 4e-4

+    @pytest.mark.xfail(
+        reason=(
+            "RuntimeError: 'fill_out' not implemented for 'Float8_e4m3fn'. The error is caused due to certain torch.float8_e4m3fn and torch.float8_e5m2 operations "
+            "not being supported when using deterministic algorithms (which is what the tests run with). To fix:\n"
+            "1. Wait for next PyTorch release: https://github.com/pytorch/pytorch/issues/137160.\n"
+            "2. Unskip this test."
+        ),
+    )
+    def test_layerwise_casting_inference(self):
+        super().test_layerwise_casting_inference()

-# =============================================================================
-# UNet1D RL (Value Function) Model Tests
-# =============================================================================
+    @pytest.mark.xfail(
+        reason=(
+            "RuntimeError: 'fill_out' not implemented for 'Float8_e4m3fn'. The error is caused due to certain torch.float8_e4m3fn and torch.float8_e5m2 operations "
+            "not being supported when using deterministic algorithms (which is what the tests run with). To fix:\n"
+            "1. Wait for next PyTorch release: https://github.com/pytorch/pytorch/issues/137160.\n"
+            "2. Unskip this test."
+        ),
+    )
+    def test_layerwise_casting_memory(self):
+        pass


-class UNet1DRLTesterConfig(BaseModelTesterConfig):
-    """Base configuration for UNet1DModel testing (RL value function variant)."""
+class UNetRLModelTests(ModelTesterMixin, UNetTesterMixin, unittest.TestCase):
+    model_class = UNet1DModel
+    main_input_name = "sample"

    @property
-    def model_class(self):
-        return UNet1DModel
+    def dummy_input(self):
+        batch_size = 4
+        num_features = 14
+        seq_len = 16
+
+        noise = floats_tensor((batch_size, num_features, seq_len)).to(torch_device)
+        time_step = torch.tensor([10] * batch_size).to(torch_device)
+
+        return {"sample": noise, "timestep": time_step}
+
+    @property
+    def input_shape(self):
+        return (4, 14, 16)

    @property
    def output_shape(self):
-        return (1,)
+        return (4, 14, 1)

-    @property
-    def main_input_name(self):
-        return "sample"
+    def test_determinism(self):
+        super().test_determinism()

-    def get_init_dict(self):
-        return {
+    def test_outputs_equivalence(self):
+        super().test_outputs_equivalence()
+
+    def test_from_save_pretrained(self):
+        super().test_from_save_pretrained()
+
+    def test_from_save_pretrained_variant(self):
+        super().test_from_save_pretrained_variant()
+
+    def test_model_from_pretrained(self):
+        super().test_model_from_pretrained()
+
+    def test_output(self):
+        # UNetRL is a value-function is different output shape
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        model = self.model_class(**init_dict)
+        model.to(torch_device)
+        model.eval()
+
+        with torch.no_grad():
+            output = model(**inputs_dict)
+
+            if isinstance(output, dict):
+                output = output.sample
+
+        self.assertIsNotNone(output)
+        expected_shape = torch.Size((inputs_dict["sample"].shape[0], 1))
+        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")
+
+    @unittest.skip("Test not supported.")
+    def test_ema_training(self):
+        pass
+
+    @unittest.skip("Test not supported.")
+    def test_training(self):
+        pass
+
+    @unittest.skip("Test not supported.")
+    def test_layerwise_casting_training(self):
+        pass
+
+    def prepare_init_args_and_inputs_for_common(self):
+        init_dict = {
            "in_channels": 14,
            "out_channels": 14,
            "down_block_types": ["DownResnetBlock1D", "DownResnetBlock1D", "DownResnetBlock1D", "DownResnetBlock1D"],
@@ -185,54 +264,18 @@ class UNet1DRLTesterConfig(BaseModelTesterConfig):
            "time_embedding_type": "positional",
            "act_fn": "mish",
        }
+        inputs_dict = self.dummy_input
+        return init_dict, inputs_dict

-    def get_dummy_inputs(self):
-        batch_size = 4
-        num_features = 14
-        seq_len = 16
-
-        return {
-            "sample": floats_tensor((batch_size, num_features, seq_len)).to(torch_device),
-            "timestep": torch.tensor([10] * batch_size).to(torch_device),
-        }
-
-
-class TestUNet1DRL(UNet1DRLTesterConfig, ModelTesterMixin, UNetTesterMixin):
-    @pytest.mark.skip("Not implemented yet for this UNet")
-    def test_forward_with_norm_groups(self):
-        pass
-
-    @torch.no_grad()
-    def test_output(self):
-        # UNetRL is a value-function with different output shape (batch, 1)
-        model = self.model_class(**self.get_init_dict())
-        model.to(torch_device)
-        model.eval()
-
-        inputs_dict = self.get_dummy_inputs()
-        output = model(**inputs_dict, return_dict=False)[0]
-
-        assert output is not None
-        expected_shape = torch.Size((inputs_dict["sample"].shape[0], 1))
-        assert output.shape == expected_shape, "Input and output shapes do not match"
-
-
-class TestUNet1DRLMemory(UNet1DRLTesterConfig, MemoryTesterMixin):
-    @pytest.mark.xfail(reason=_LAYERWISE_CASTING_XFAIL_REASON)
-    def test_layerwise_casting_memory(self):
-        super().test_layerwise_casting_memory()
-
-
-class TestUNet1DRLHubLoading(UNet1DRLTesterConfig):
    def test_from_pretrained_hub(self):
        value_function, vf_loading_info = UNet1DModel.from_pretrained(
            "bglick13/hopper-medium-v2-value-function-hor32", output_loading_info=True, subfolder="value_function"
        )
-        assert value_function is not None
-        assert len(vf_loading_info["missing_keys"]) == 0
+        self.assertIsNotNone(value_function)
+        self.assertEqual(len(vf_loading_info["missing_keys"]), 0)

        value_function.to(torch_device)
-        image = value_function(**self.get_dummy_inputs())
+        image = value_function(**self.dummy_input)

        assert image is not None, "Make sure output is not None"

@@ -256,4 +299,31 @@ class TestUNet1DRLHubLoading(UNet1DRLTesterConfig):
        # fmt: off
        expected_output_slice = torch.tensor([165.25] * seq_len)
        # fmt: on
-        assert torch.allclose(output, expected_output_slice, rtol=1e-3)
+        self.assertTrue(torch.allclose(output, expected_output_slice, rtol=1e-3))
+
+    @unittest.skip("Test not supported.")
+    def test_forward_with_norm_groups(self):
+        # Not implemented yet for this UNet
+        pass
+
+    @pytest.mark.xfail(
+        reason=(
+            "RuntimeError: 'fill_out' not implemented for 'Float8_e4m3fn'. The error is caused due to certain torch.float8_e4m3fn and torch.float8_e5m2 operations "
+            "not being supported when using deterministic algorithms (which is what the tests run with). To fix:\n"
+            "1. Wait for next PyTorch release: https://github.com/pytorch/pytorch/issues/137160.\n"
+            "2. Unskip this test."
+        ),
+    )
+    def test_layerwise_casting_inference(self):
+        pass
+
+    @pytest.mark.xfail(
+        reason=(
+            "RuntimeError: 'fill_out' not implemented for 'Float8_e4m3fn'. The error is caused due to certain torch.float8_e4m3fn and torch.float8_e5m2 operations "
+            "not being supported when using deterministic algorithms (which is what the tests run with). To fix:\n"
+            "1. Wait for next PyTorch release: https://github.com/pytorch/pytorch/issues/137160.\n"
+            "2. Unskip this test."
+        ),
+    )
+    def test_layerwise_casting_memory(self):
+        pass
--- a/tests/models/unets/test_models_unet_2d.py
+++ b/tests/models/unets/test_models_unet_2d.py
@@ -15,11 +15,12 @@

 import gc
 import math
+import unittest

-import pytest
 import torch

 from diffusers import UNet2DModel
+from diffusers.utils import logging

 from ...testing_utils import (
    backend_empty_cache,
@@ -30,40 +31,39 @@ from ...testing_utils import (
    torch_all_close,
    torch_device,
 )
-from ..test_modeling_common import UNetTesterMixin
-from ..testing_utils import (
-    BaseModelTesterConfig,
-    MemoryTesterMixin,
-    ModelTesterMixin,
-    TrainingTesterMixin,
-)
+from ..test_modeling_common import ModelTesterMixin, UNetTesterMixin


+logger = logging.get_logger(__name__)
+
 enable_full_determinism()


-# =============================================================================
-# Standard UNet2D Model Tests
-# =============================================================================
-
-
-class UNet2DTesterConfig(BaseModelTesterConfig):
-    """Base configuration for standard UNet2DModel testing."""
+class Unet2DModelTests(ModelTesterMixin, UNetTesterMixin, unittest.TestCase):
+    model_class = UNet2DModel
+    main_input_name = "sample"

    @property
-    def model_class(self):
-        return UNet2DModel
+    def dummy_input(self):
+        batch_size = 4
+        num_channels = 3
+        sizes = (32, 32)
+
+        noise = floats_tensor((batch_size, num_channels) + sizes).to(torch_device)
+        time_step = torch.tensor([10]).to(torch_device)
+
+        return {"sample": noise, "timestep": time_step}
+
+    @property
+    def input_shape(self):
+        return (3, 32, 32)

    @property
    def output_shape(self):
        return (3, 32, 32)

-    @property
-    def main_input_name(self):
-        return "sample"
-
-    def get_init_dict(self):
-        return {
+    def prepare_init_args_and_inputs_for_common(self):
+        init_dict = {
            "block_out_channels": (4, 8),
            "norm_num_groups": 2,
            "down_block_types": ("DownBlock2D", "AttnDownBlock2D"),
@@ -74,22 +74,11 @@ class UNet2DTesterConfig(BaseModelTesterConfig):
            "layers_per_block": 2,
            "sample_size": 32,
        }
+        inputs_dict = self.dummy_input
+        return init_dict, inputs_dict

-    def get_dummy_inputs(self):
-        batch_size = 4
-        num_channels = 3
-        sizes = (32, 32)
-
-        return {
-            "sample": floats_tensor((batch_size, num_channels) + sizes).to(torch_device),
-            "timestep": torch.tensor([10]).to(torch_device),
-        }
-
-
-class TestUNet2D(UNet2DTesterConfig, ModelTesterMixin, UNetTesterMixin):
    def test_mid_block_attn_groups(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["add_attention"] = True
        init_dict["attn_norm_num_groups"] = 4
@@ -98,11 +87,13 @@ class TestUNet2D(UNet2DTesterConfig, ModelTesterMixin, UNetTesterMixin):
        model.to(torch_device)
        model.eval()

-        assert model.mid_block.attentions[0].group_norm is not None, (
-            "Mid block Attention group norm should exist but does not."
+        self.assertIsNotNone(
+            model.mid_block.attentions[0].group_norm, "Mid block Attention group norm should exist but does not."
        )
-        assert model.mid_block.attentions[0].group_norm.num_groups == init_dict["attn_norm_num_groups"], (
-            "Mid block Attention group norm does not have the expected number of groups."
+        self.assertEqual(
+            model.mid_block.attentions[0].group_norm.num_groups,
+            init_dict["attn_norm_num_groups"],
+            "Mid block Attention group norm does not have the expected number of groups.",
        )

        with torch.no_grad():
@@ -111,15 +102,13 @@ class TestUNet2D(UNet2DTesterConfig, ModelTesterMixin, UNetTesterMixin):
            if isinstance(output, dict):
                output = output.to_tuple()[0]

-        assert output is not None
+        self.assertIsNotNone(output)
        expected_shape = inputs_dict["sample"].shape
-        assert output.shape == expected_shape, "Input and output shapes do not match"
+        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")

    def test_mid_block_none(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
-        mid_none_init_dict = self.get_init_dict()
-        mid_none_inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        mid_none_init_dict, mid_none_inputs_dict = self.prepare_init_args_and_inputs_for_common()
        mid_none_init_dict["mid_block_type"] = None

        model = self.model_class(**init_dict)
@@ -130,7 +119,7 @@ class TestUNet2D(UNet2DTesterConfig, ModelTesterMixin, UNetTesterMixin):
        mid_none_model.to(torch_device)
        mid_none_model.eval()

-        assert mid_none_model.mid_block is None, "Mid block should not exist."
+        self.assertIsNone(mid_none_model.mid_block, "Mid block should not exist.")

        with torch.no_grad():
            output = model(**inputs_dict)
@@ -144,10 +133,8 @@ class TestUNet2D(UNet2DTesterConfig, ModelTesterMixin, UNetTesterMixin):
            if isinstance(mid_none_output, dict):
                mid_none_output = mid_none_output.to_tuple()[0]

-        assert not torch.allclose(output, mid_none_output, rtol=1e-3), "outputs should be different."
+        self.assertFalse(torch.allclose(output, mid_none_output, rtol=1e-3), "outputs should be different.")

-
-class TestUNet2DTraining(UNet2DTesterConfig, TrainingTesterMixin):
    def test_gradient_checkpointing_is_applied(self):
        expected_set = {
            "AttnUpBlock2D",
@@ -156,32 +143,41 @@ class TestUNet2DTraining(UNet2DTesterConfig, TrainingTesterMixin):
            "UpBlock2D",
            "DownBlock2D",
        }
+
        # NOTE: unlike UNet2DConditionModel, UNet2DModel does not currently support tuples for `attention_head_dim`
-        super().test_gradient_checkpointing_is_applied(expected_set=expected_set)
+        attention_head_dim = 8
+        block_out_channels = (16, 32)
+
+        super().test_gradient_checkpointing_is_applied(
+            expected_set=expected_set, attention_head_dim=attention_head_dim, block_out_channels=block_out_channels
+        )


-# =============================================================================
-# UNet2D LDM Model Tests
-# =============================================================================
-
-
-class UNet2DLDMTesterConfig(BaseModelTesterConfig):
-    """Base configuration for UNet2DModel LDM variant testing."""
+class UNetLDMModelTests(ModelTesterMixin, UNetTesterMixin, unittest.TestCase):
+    model_class = UNet2DModel
+    main_input_name = "sample"

    @property
-    def model_class(self):
-        return UNet2DModel
+    def dummy_input(self):
+        batch_size = 4
+        num_channels = 4
+        sizes = (32, 32)
+
+        noise = floats_tensor((batch_size, num_channels) + sizes).to(torch_device)
+        time_step = torch.tensor([10]).to(torch_device)
+
+        return {"sample": noise, "timestep": time_step}
+
+    @property
+    def input_shape(self):
+        return (4, 32, 32)

    @property
    def output_shape(self):
        return (4, 32, 32)

-    @property
-    def main_input_name(self):
-        return "sample"
-
-    def get_init_dict(self):
-        return {
+    def prepare_init_args_and_inputs_for_common(self):
+        init_dict = {
            "sample_size": 32,
            "in_channels": 4,
            "out_channels": 4,
@@ -191,34 +187,17 @@ class UNet2DLDMTesterConfig(BaseModelTesterConfig):
            "down_block_types": ("DownBlock2D", "DownBlock2D"),
            "up_block_types": ("UpBlock2D", "UpBlock2D"),
        }
+        inputs_dict = self.dummy_input
+        return init_dict, inputs_dict

-    def get_dummy_inputs(self):
-        batch_size = 4
-        num_channels = 4
-        sizes = (32, 32)
-
-        return {
-            "sample": floats_tensor((batch_size, num_channels) + sizes).to(torch_device),
-            "timestep": torch.tensor([10]).to(torch_device),
-        }
-
-
-class TestUNet2DLDMTraining(UNet2DLDMTesterConfig, TrainingTesterMixin):
-    def test_gradient_checkpointing_is_applied(self):
-        expected_set = {"DownBlock2D", "UNetMidBlock2D", "UpBlock2D"}
-        # NOTE: unlike UNet2DConditionModel, UNet2DModel does not currently support tuples for `attention_head_dim`
-        super().test_gradient_checkpointing_is_applied(expected_set=expected_set)
-
-
-class TestUNet2DLDMHubLoading(UNet2DLDMTesterConfig):
    def test_from_pretrained_hub(self):
        model, loading_info = UNet2DModel.from_pretrained("fusing/unet-ldm-dummy-update", output_loading_info=True)

-        assert model is not None
-        assert len(loading_info["missing_keys"]) == 0
+        self.assertIsNotNone(model)
+        self.assertEqual(len(loading_info["missing_keys"]), 0)

        model.to(torch_device)
-        image = model(**self.get_dummy_inputs()).sample
+        image = model(**self.dummy_input).sample

        assert image is not None, "Make sure output is not None"

@@ -226,7 +205,7 @@ class TestUNet2DLDMHubLoading(UNet2DLDMTesterConfig):
    def test_from_pretrained_accelerate(self):
        model, _ = UNet2DModel.from_pretrained("fusing/unet-ldm-dummy-update", output_loading_info=True)
        model.to(torch_device)
-        image = model(**self.get_dummy_inputs()).sample
+        image = model(**self.dummy_input).sample

        assert image is not None, "Make sure output is not None"

@@ -286,31 +265,44 @@ class TestUNet2DLDMHubLoading(UNet2DLDMTesterConfig):
        expected_output_slice = torch.tensor([-13.3258, -20.1100, -15.9873, -17.6617, -23.0596, -17.9419, -13.3675, -16.1889, -12.3800])
        # fmt: on

-        assert torch_all_close(output_slice, expected_output_slice, rtol=1e-3)
+        self.assertTrue(torch_all_close(output_slice, expected_output_slice, rtol=1e-3))
+
+    def test_gradient_checkpointing_is_applied(self):
+        expected_set = {"DownBlock2D", "UNetMidBlock2D", "UpBlock2D"}
+
+        # NOTE: unlike UNet2DConditionModel, UNet2DModel does not currently support tuples for `attention_head_dim`
+        attention_head_dim = 32
+        block_out_channels = (32, 64)
+
+        super().test_gradient_checkpointing_is_applied(
+            expected_set=expected_set, attention_head_dim=attention_head_dim, block_out_channels=block_out_channels
+        )


-# =============================================================================
-# NCSN++ Model Tests
-# =============================================================================
-
-
-class NCSNppTesterConfig(BaseModelTesterConfig):
-    """Base configuration for UNet2DModel NCSN++ variant testing."""
+class NCSNppModelTests(ModelTesterMixin, UNetTesterMixin, unittest.TestCase):
+    model_class = UNet2DModel
+    main_input_name = "sample"

    @property
-    def model_class(self):
-        return UNet2DModel
+    def dummy_input(self, sizes=(32, 32)):
+        batch_size = 4
+        num_channels = 3
+
+        noise = floats_tensor((batch_size, num_channels) + sizes).to(torch_device)
+        time_step = torch.tensor(batch_size * [10]).to(dtype=torch.int32, device=torch_device)
+
+        return {"sample": noise, "timestep": time_step}
+
+    @property
+    def input_shape(self):
+        return (3, 32, 32)

    @property
    def output_shape(self):
        return (3, 32, 32)

-    @property
-    def main_input_name(self):
-        return "sample"
-
-    def get_init_dict(self):
-        return {
+    def prepare_init_args_and_inputs_for_common(self):
+        init_dict = {
            "block_out_channels": [32, 64, 64, 64],
            "in_channels": 3,
            "layers_per_block": 1,
@@ -332,71 +324,17 @@ class NCSNppTesterConfig(BaseModelTesterConfig):
                "SkipUpBlock2D",
            ],
        }
+        inputs_dict = self.dummy_input
+        return init_dict, inputs_dict

-    def get_dummy_inputs(self):
-        batch_size = 4
-        num_channels = 3
-        sizes = (32, 32)
-
-        return {
-            "sample": floats_tensor((batch_size, num_channels) + sizes).to(torch_device),
-            "timestep": torch.tensor(batch_size * [10]).to(dtype=torch.int32, device=torch_device),
-        }
-
-
-class TestNCSNpp(NCSNppTesterConfig, ModelTesterMixin, UNetTesterMixin):
-    @pytest.mark.skip("Test not supported.")
-    def test_forward_with_norm_groups(self):
-        pass
-
-    @pytest.mark.skip(
-        "To make layerwise casting work with this model, we will have to update the implementation. "
-        "Due to potentially low usage, we don't support it here."
-    )
-    def test_keep_in_fp32_modules(self):
-        pass
-
-    @pytest.mark.skip(
-        "To make layerwise casting work with this model, we will have to update the implementation. "
-        "Due to potentially low usage, we don't support it here."
-    )
-    def test_from_save_pretrained_dtype_inference(self):
-        pass
-
-
-class TestNCSNppMemory(NCSNppTesterConfig, MemoryTesterMixin):
-    @pytest.mark.skip(
-        "To make layerwise casting work with this model, we will have to update the implementation. "
-        "Due to potentially low usage, we don't support it here."
-    )
-    def test_layerwise_casting_memory(self):
-        pass
-
-    @pytest.mark.skip(
-        "To make layerwise casting work with this model, we will have to update the implementation. "
-        "Due to potentially low usage, we don't support it here."
-    )
-    def test_layerwise_casting_training(self):
-        pass
-
-
-class TestNCSNppTraining(NCSNppTesterConfig, TrainingTesterMixin):
-    def test_gradient_checkpointing_is_applied(self):
-        expected_set = {
-            "UNetMidBlock2D",
-        }
-        super().test_gradient_checkpointing_is_applied(expected_set=expected_set)
-
-
-class TestNCSNppHubLoading(NCSNppTesterConfig):
    @slow
    def test_from_pretrained_hub(self):
        model, loading_info = UNet2DModel.from_pretrained("google/ncsnpp-celebahq-256", output_loading_info=True)
-        assert model is not None
-        assert len(loading_info["missing_keys"]) == 0
+        self.assertIsNotNone(model)
+        self.assertEqual(len(loading_info["missing_keys"]), 0)

        model.to(torch_device)
-        inputs = self.get_dummy_inputs()
+        inputs = self.dummy_input
        noise = floats_tensor((4, 3) + (256, 256)).to(torch_device)
        inputs["sample"] = noise
        image = model(**inputs)
@@ -423,7 +361,7 @@ class TestNCSNppHubLoading(NCSNppTesterConfig):
        expected_output_slice = torch.tensor([-4836.2178, -6487.1470, -3816.8196, -7964.9302, -10966.3037, -20043.5957, 8137.0513, 2340.3328, 544.6056])
        # fmt: on

-        assert torch_all_close(output_slice, expected_output_slice, rtol=1e-2)
+        self.assertTrue(torch_all_close(output_slice, expected_output_slice, rtol=1e-2))

    def test_output_pretrained_ve_large(self):
        model = UNet2DModel.from_pretrained("fusing/ncsnpp-ffhq-ve-dummy-update")
@@ -444,4 +382,35 @@ class TestNCSNppHubLoading(NCSNppTesterConfig):
        expected_output_slice = torch.tensor([-0.0325, -0.0900, -0.0869, -0.0332, -0.0725, -0.0270, -0.0101, 0.0227, 0.0256])
        # fmt: on

-        assert torch_all_close(output_slice, expected_output_slice, rtol=1e-2)
+        self.assertTrue(torch_all_close(output_slice, expected_output_slice, rtol=1e-2))
+
+    @unittest.skip("Test not supported.")
+    def test_forward_with_norm_groups(self):
+        # not required for this model
+        pass
+
+    def test_gradient_checkpointing_is_applied(self):
+        expected_set = {
+            "UNetMidBlock2D",
+        }
+
+        block_out_channels = (32, 64, 64, 64)
+
+        super().test_gradient_checkpointing_is_applied(
+            expected_set=expected_set, block_out_channels=block_out_channels
+        )
+
+    def test_effective_gradient_checkpointing(self):
+        super().test_effective_gradient_checkpointing(skip={"time_proj.weight"})
+
+    @unittest.skip(
+        "To make layerwise casting work with this model, we will have to update the implementation. Due to potentially low usage, we don't support it here."
+    )
+    def test_layerwise_casting_inference(self):
+        pass
+
+    @unittest.skip(
+        "To make layerwise casting work with this model, we will have to update the implementation. Due to potentially low usage, we don't support it here."
+    )
+    def test_layerwise_casting_memory(self):
+        pass
--- a/tests/models/unets/test_models_unet_2d_condition.py
+++ b/tests/models/unets/test_models_unet_2d_condition.py
@@ -20,7 +20,6 @@ import tempfile
 import unittest
 from collections import OrderedDict

-import pytest
 import torch
 from huggingface_hub import snapshot_download
 from parameterized import parameterized
@@ -53,24 +52,17 @@ from ...testing_utils import (
    torch_all_close,
    torch_device,
 )
-from ..test_modeling_common import UNetTesterMixin
-from ..testing_utils import (
-    AttentionTesterMixin,
-    BaseModelTesterConfig,
-    IPAdapterTesterMixin,
+from ..test_modeling_common import (
    LoraHotSwappingForModelTesterMixin,
-    LoraTesterMixin,
-    MemoryTesterMixin,
    ModelTesterMixin,
    TorchCompileTesterMixin,
-    TrainingTesterMixin,
+    UNetTesterMixin,
 )


 if is_peft_available():
    from peft import LoraConfig
-
-    from ..testing_utils.lora import check_if_lora_correctly_set
+    from peft.tuners.tuners_utils import BaseTunerLayer


 logger = logging.get_logger(__name__)
@@ -90,6 +82,16 @@ def get_unet_lora_config():
    return unet_lora_config


+def check_if_lora_correctly_set(model) -> bool:
+    """
+    Checks if the LoRA layers are correctly set with peft
+    """
+    for module in model.modules():
+        if isinstance(module, BaseTunerLayer):
+            return True
+    return False
+
+
 def create_ip_adapter_state_dict(model):
    # "ip_adapter" (cross-attention weights)
    ip_cross_attn_state_dict = {}
@@ -352,28 +354,34 @@ def create_custom_diffusion_layers(model, mock_weights: bool = True):
    return custom_diffusion_attn_procs


-class UNet2DConditionTesterConfig(BaseModelTesterConfig):
-    """Base configuration for UNet2DConditionModel testing."""
+class UNet2DConditionModelTests(ModelTesterMixin, UNetTesterMixin, unittest.TestCase):
+    model_class = UNet2DConditionModel
+    main_input_name = "sample"
+    # We override the items here because the unet under consideration is small.
+    model_split_percents = [0.5, 0.34, 0.4]

    @property
-    def model_class(self):
-        return UNet2DConditionModel
+    def dummy_input(self):
+        batch_size = 4
+        num_channels = 4
+        sizes = (16, 16)
+
+        noise = floats_tensor((batch_size, num_channels) + sizes).to(torch_device)
+        time_step = torch.tensor([10]).to(torch_device)
+        encoder_hidden_states = floats_tensor((batch_size, 4, 8)).to(torch_device)
+
+        return {"sample": noise, "timestep": time_step, "encoder_hidden_states": encoder_hidden_states}

    @property
-    def output_shape(self) -> tuple[int, int, int]:
+    def input_shape(self):
        return (4, 16, 16)

    @property
-    def model_split_percents(self) -> list[float]:
-        return [0.5, 0.34, 0.4]
+    def output_shape(self):
+        return (4, 16, 16)

-    @property
-    def main_input_name(self) -> str:
-        return "sample"
-
-    def get_init_dict(self) -> dict:
-        """Return UNet2D model initialization arguments."""
-        return {
+    def prepare_init_args_and_inputs_for_common(self):
+        init_dict = {
            "block_out_channels": (4, 8),
            "norm_num_groups": 4,
            "down_block_types": ("CrossAttnDownBlock2D", "DownBlock2D"),
@@ -385,24 +393,26 @@ class UNet2DConditionTesterConfig(BaseModelTesterConfig):
            "layers_per_block": 1,
            "sample_size": 16,
        }
+        inputs_dict = self.dummy_input
+        return init_dict, inputs_dict

-    def get_dummy_inputs(self) -> dict[str, torch.Tensor]:
-        """Return dummy inputs for UNet2D model."""
-        batch_size = 4
-        num_channels = 4
-        sizes = (16, 16)
+    @unittest.skipIf(
+        torch_device != "cuda" or not is_xformers_available(),
+        reason="XFormers attention is only available with CUDA and `xformers` installed",
+    )
+    def test_xformers_enable_works(self):
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        model = self.model_class(**init_dict)

-        return {
-            "sample": floats_tensor((batch_size, num_channels) + sizes).to(torch_device),
-            "timestep": torch.tensor([10]).to(torch_device),
-            "encoder_hidden_states": floats_tensor((batch_size, 4, 8)).to(torch_device),
-        }
+        model.enable_xformers_memory_efficient_attention()

+        assert (
+            model.mid_block.attentions[0].transformer_blocks[0].attn1.processor.__class__.__name__
+            == "XFormersAttnProcessor"
+        ), "xformers is not enabled"

-class TestUNet2DCondition(UNet2DConditionTesterConfig, ModelTesterMixin, UNetTesterMixin):
    def test_model_with_attention_head_dim_tuple(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["block_out_channels"] = (16, 32)
        init_dict["attention_head_dim"] = (8, 16)
@@ -417,13 +427,12 @@ class TestUNet2DCondition(UNet2DConditionTesterConfig, ModelTesterMixin, UNetTes
            if isinstance(output, dict):
                output = output.sample

-        assert output is not None
+        self.assertIsNotNone(output)
        expected_shape = inputs_dict["sample"].shape
-        assert output.shape == expected_shape, "Input and output shapes do not match"
+        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")

    def test_model_with_use_linear_projection(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["use_linear_projection"] = True

@@ -437,13 +446,12 @@ class TestUNet2DCondition(UNet2DConditionTesterConfig, ModelTesterMixin, UNetTes
            if isinstance(output, dict):
                output = output.sample

-        assert output is not None
+        self.assertIsNotNone(output)
        expected_shape = inputs_dict["sample"].shape
-        assert output.shape == expected_shape, "Input and output shapes do not match"
+        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")

    def test_model_with_cross_attention_dim_tuple(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["cross_attention_dim"] = (8, 8)

@@ -457,13 +465,12 @@ class TestUNet2DCondition(UNet2DConditionTesterConfig, ModelTesterMixin, UNetTes
            if isinstance(output, dict):
                output = output.sample

-        assert output is not None
+        self.assertIsNotNone(output)
        expected_shape = inputs_dict["sample"].shape
-        assert output.shape == expected_shape, "Input and output shapes do not match"
+        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")

    def test_model_with_simple_projection(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        batch_size, _, _, sample_size = inputs_dict["sample"].shape

@@ -482,13 +489,12 @@ class TestUNet2DCondition(UNet2DConditionTesterConfig, ModelTesterMixin, UNetTes
            if isinstance(output, dict):
                output = output.sample

-        assert output is not None
+        self.assertIsNotNone(output)
        expected_shape = inputs_dict["sample"].shape
-        assert output.shape == expected_shape, "Input and output shapes do not match"
+        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")

    def test_model_with_class_embeddings_concat(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        batch_size, _, _, sample_size = inputs_dict["sample"].shape

@@ -508,287 +514,12 @@ class TestUNet2DCondition(UNet2DConditionTesterConfig, ModelTesterMixin, UNetTes
            if isinstance(output, dict):
                output = output.sample

-        assert output is not None
+        self.assertIsNotNone(output)
        expected_shape = inputs_dict["sample"].shape
-        assert output.shape == expected_shape, "Input and output shapes do not match"
-
-    # see diffusers.models.attention_processor::Attention#prepare_attention_mask
-    # note: we may not need to fix mask padding to work for stable-diffusion cross-attn masks.
-    # since the use-case (somebody passes in a too-short cross-attn mask) is pretty small,
-    # maybe it's fine that this only works for the unclip use-case.
-    @mark.skip(
-        reason="we currently pad mask by target_length tokens (what unclip needs), whereas stable-diffusion's cross-attn needs to instead pad by remaining_length."
-    )
-    def test_model_xattn_padding(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
-
-        model = self.model_class(**{**init_dict, "attention_head_dim": (8, 16)})
-        model.to(torch_device)
-        model.eval()
-
-        cond = inputs_dict["encoder_hidden_states"]
-        with torch.no_grad():
-            full_cond_out = model(**inputs_dict).sample
-            assert full_cond_out is not None
-
-            batch, tokens, _ = cond.shape
-            keeplast_mask = (torch.arange(tokens) == tokens - 1).expand(batch, -1).to(cond.device, torch.bool)
-            keeplast_out = model(**{**inputs_dict, "encoder_attention_mask": keeplast_mask}).sample
-            assert not keeplast_out.allclose(full_cond_out), "a 'keep last token' mask should change the result"
-
-            trunc_mask = torch.zeros(batch, tokens - 1, device=cond.device, dtype=torch.bool)
-            trunc_mask_out = model(**{**inputs_dict, "encoder_attention_mask": trunc_mask}).sample
-            assert trunc_mask_out.allclose(keeplast_out), (
-                "a mask with fewer tokens than condition, will be padded with 'keep' tokens. a 'discard-all' mask missing the final token is thus equivalent to a 'keep last' mask."
-            )
-
-    def test_pickle(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
-
-        init_dict["block_out_channels"] = (16, 32)
-        init_dict["attention_head_dim"] = (8, 16)
-
-        model = self.model_class(**init_dict)
-        model.to(torch_device)
-
-        with torch.no_grad():
-            sample = model(**inputs_dict).sample
-
-        sample_copy = copy.copy(sample)
-
-        assert (sample - sample_copy).abs().max() < 1e-4
-
-    def test_asymmetrical_unet(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
-        # Add asymmetry to configs
-        init_dict["transformer_layers_per_block"] = [[3, 2], 1]
-        init_dict["reverse_transformer_layers_per_block"] = [[3, 4], 1]
-
-        torch.manual_seed(0)
-        model = self.model_class(**init_dict)
-        model.to(torch_device)
-
-        output = model(**inputs_dict).sample
-        expected_shape = inputs_dict["sample"].shape
-
-        # Check if input and output shapes are the same
-        assert output.shape == expected_shape, "Input and output shapes do not match"
-
-
-class TestUNet2DConditionHubLoading(UNet2DConditionTesterConfig):
-    """Hub checkpoint loading tests for UNet2DConditionModel."""
-
-    @parameterized.expand(
-        [
-            ("hf-internal-testing/unet2d-sharded-dummy", None),
-            ("hf-internal-testing/tiny-sd-unet-sharded-latest-format", "fp16"),
-        ]
-    )
-    @require_torch_accelerator
-    def test_load_sharded_checkpoint_from_hub(self, repo_id, variant):
-        inputs_dict = self.get_dummy_inputs()
-        loaded_model = self.model_class.from_pretrained(repo_id, variant=variant)
-        loaded_model = loaded_model.to(torch_device)
-        new_output = loaded_model(**inputs_dict)
-
-        assert loaded_model
-        assert new_output.sample.shape == (4, 4, 16, 16)
-
-    @parameterized.expand(
-        [
-            ("hf-internal-testing/unet2d-sharded-dummy-subfolder", None),
-            ("hf-internal-testing/tiny-sd-unet-sharded-latest-format-subfolder", "fp16"),
-        ]
-    )
-    @require_torch_accelerator
-    def test_load_sharded_checkpoint_from_hub_subfolder(self, repo_id, variant):
-        inputs_dict = self.get_dummy_inputs()
-        loaded_model = self.model_class.from_pretrained(repo_id, subfolder="unet", variant=variant)
-        loaded_model = loaded_model.to(torch_device)
-        new_output = loaded_model(**inputs_dict)
-
-        assert loaded_model
-        assert new_output.sample.shape == (4, 4, 16, 16)
-
-    @require_torch_accelerator
-    def test_load_sharded_checkpoint_from_hub_local(self):
-        inputs_dict = self.get_dummy_inputs()
-        ckpt_path = snapshot_download("hf-internal-testing/unet2d-sharded-dummy")
-        loaded_model = self.model_class.from_pretrained(ckpt_path, local_files_only=True)
-        loaded_model = loaded_model.to(torch_device)
-        new_output = loaded_model(**inputs_dict)
-
-        assert loaded_model
-        assert new_output.sample.shape == (4, 4, 16, 16)
-
-    @require_torch_accelerator
-    def test_load_sharded_checkpoint_from_hub_local_subfolder(self):
-        inputs_dict = self.get_dummy_inputs()
-        ckpt_path = snapshot_download("hf-internal-testing/unet2d-sharded-dummy-subfolder")
-        loaded_model = self.model_class.from_pretrained(ckpt_path, subfolder="unet", local_files_only=True)
-        loaded_model = loaded_model.to(torch_device)
-        new_output = loaded_model(**inputs_dict)
-
-        assert loaded_model
-        assert new_output.sample.shape == (4, 4, 16, 16)
-
-    @require_torch_accelerator
-    @parameterized.expand(
-        [
-            ("hf-internal-testing/unet2d-sharded-dummy", None),
-            ("hf-internal-testing/tiny-sd-unet-sharded-latest-format", "fp16"),
-        ]
-    )
-    def test_load_sharded_checkpoint_device_map_from_hub(self, repo_id, variant):
-        inputs_dict = self.get_dummy_inputs()
-        loaded_model = self.model_class.from_pretrained(repo_id, variant=variant, device_map="auto")
-        new_output = loaded_model(**inputs_dict)
-
-        assert loaded_model
-        assert new_output.sample.shape == (4, 4, 16, 16)
-
-    @require_torch_accelerator
-    @parameterized.expand(
-        [
-            ("hf-internal-testing/unet2d-sharded-dummy-subfolder", None),
-            ("hf-internal-testing/tiny-sd-unet-sharded-latest-format-subfolder", "fp16"),
-        ]
-    )
-    def test_load_sharded_checkpoint_device_map_from_hub_subfolder(self, repo_id, variant):
-        inputs_dict = self.get_dummy_inputs()
-        loaded_model = self.model_class.from_pretrained(repo_id, variant=variant, subfolder="unet", device_map="auto")
-        new_output = loaded_model(**inputs_dict)
-
-        assert loaded_model
-        assert new_output.sample.shape == (4, 4, 16, 16)
-
-    @require_torch_accelerator
-    def test_load_sharded_checkpoint_device_map_from_hub_local(self):
-        inputs_dict = self.get_dummy_inputs()
-        ckpt_path = snapshot_download("hf-internal-testing/unet2d-sharded-dummy")
-        loaded_model = self.model_class.from_pretrained(ckpt_path, local_files_only=True, device_map="auto")
-        new_output = loaded_model(**inputs_dict)
-
-        assert loaded_model
-        assert new_output.sample.shape == (4, 4, 16, 16)
-
-    @require_torch_accelerator
-    def test_load_sharded_checkpoint_device_map_from_hub_local_subfolder(self):
-        inputs_dict = self.get_dummy_inputs()
-        ckpt_path = snapshot_download("hf-internal-testing/unet2d-sharded-dummy-subfolder")
-        loaded_model = self.model_class.from_pretrained(
-            ckpt_path, local_files_only=True, subfolder="unet", device_map="auto"
-        )
-        new_output = loaded_model(**inputs_dict)
-
-        assert loaded_model
-        assert new_output.sample.shape == (4, 4, 16, 16)
-
-
-class TestUNet2DConditionLoRA(UNet2DConditionTesterConfig, LoraTesterMixin):
-    """LoRA adapter tests for UNet2DConditionModel."""
-
-    @require_peft_backend
-    def test_load_attn_procs_raise_warning(self):
-        """Test that deprecated load_attn_procs method raises FutureWarning."""
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
-        model = self.model_class(**init_dict)
-        model.to(torch_device)
-
-        # forward pass without LoRA
-        with torch.no_grad():
-            non_lora_sample = model(**inputs_dict).sample
-
-        unet_lora_config = get_unet_lora_config()
-        model.add_adapter(unet_lora_config)
-
-        assert check_if_lora_correctly_set(model), "Lora not correctly set in UNet."
-
-        # forward pass with LoRA
-        with torch.no_grad():
-            lora_sample_1 = model(**inputs_dict).sample
-
-        with tempfile.TemporaryDirectory() as tmpdirname:
-            model.save_attn_procs(tmpdirname)
-            model.unload_lora()
-
-            with pytest.warns(FutureWarning, match="Using the `load_attn_procs\\(\\)` method has been deprecated"):
-                model.load_attn_procs(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"))
-
-            # import to still check for the rest of the stuff.
-            assert check_if_lora_correctly_set(model), "Lora not correctly set in UNet."
-
-            with torch.no_grad():
-                lora_sample_2 = model(**inputs_dict).sample
-
-        assert not torch.allclose(non_lora_sample, lora_sample_1, atol=1e-4, rtol=1e-4), (
-            "LoRA injected UNet should produce different results."
-        )
-        assert torch.allclose(lora_sample_1, lora_sample_2, atol=1e-4, rtol=1e-4), (
-            "Loading from a saved checkpoint should produce identical results."
-        )
-
-    @require_peft_backend
-    def test_save_attn_procs_raise_warning(self):
-        """Test that deprecated save_attn_procs method raises FutureWarning."""
-        init_dict = self.get_init_dict()
-        model = self.model_class(**init_dict)
-        model.to(torch_device)
-
-        unet_lora_config = get_unet_lora_config()
-        model.add_adapter(unet_lora_config)
-
-        assert check_if_lora_correctly_set(model), "Lora not correctly set in UNet."
-
-        with tempfile.TemporaryDirectory() as tmpdirname:
-            with pytest.warns(FutureWarning, match="Using the `save_attn_procs\\(\\)` method has been deprecated"):
-                model.save_attn_procs(os.path.join(tmpdirname))
-
-
-class TestUNet2DConditionMemory(UNet2DConditionTesterConfig, MemoryTesterMixin):
-    """Memory optimization tests for UNet2DConditionModel."""
-
-
-class TestUNet2DConditionTraining(UNet2DConditionTesterConfig, TrainingTesterMixin):
-    """Training tests for UNet2DConditionModel."""
-
-    def test_gradient_checkpointing_is_applied(self):
-        expected_set = {
-            "CrossAttnUpBlock2D",
-            "CrossAttnDownBlock2D",
-            "UNetMidBlock2DCrossAttn",
-            "UpBlock2D",
-            "Transformer2DModel",
-            "DownBlock2D",
-        }
-        super().test_gradient_checkpointing_is_applied(expected_set=expected_set)
-
-
-class TestUNet2DConditionAttention(UNet2DConditionTesterConfig, AttentionTesterMixin):
-    """Attention processor tests for UNet2DConditionModel."""
-
-    @unittest.skipIf(
-        torch_device != "cuda" or not is_xformers_available(),
-        reason="XFormers attention is only available with CUDA and `xformers` installed",
-    )
-    def test_xformers_enable_works(self):
-        init_dict = self.get_init_dict()
-        model = self.model_class(**init_dict)
-
-        model.enable_xformers_memory_efficient_attention()
-
-        assert (
-            model.mid_block.attentions[0].transformer_blocks[0].attn1.processor.__class__.__name__
-            == "XFormersAttnProcessor"
-        ), "xformers is not enabled"
+        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")

    def test_model_attention_slicing(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["block_out_channels"] = (16, 32)
        init_dict["attention_head_dim"] = (8, 16)
@@ -813,7 +544,7 @@ class TestUNet2DConditionAttention(UNet2DConditionTesterConfig, AttentionTesterM
        assert output is not None

    def test_model_sliceable_head_dim(self):
-        init_dict = self.get_init_dict()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["block_out_channels"] = (16, 32)
        init_dict["attention_head_dim"] = (8, 16)
@@ -831,6 +562,21 @@ class TestUNet2DConditionAttention(UNet2DConditionTesterConfig, AttentionTesterM
        for module in model.children():
            check_sliceable_dim_attr(module)

+    def test_gradient_checkpointing_is_applied(self):
+        expected_set = {
+            "CrossAttnUpBlock2D",
+            "CrossAttnDownBlock2D",
+            "UNetMidBlock2DCrossAttn",
+            "UpBlock2D",
+            "Transformer2DModel",
+            "DownBlock2D",
+        }
+        attention_head_dim = (8, 16)
+        block_out_channels = (16, 32)
+        super().test_gradient_checkpointing_is_applied(
+            expected_set=expected_set, attention_head_dim=attention_head_dim, block_out_channels=block_out_channels
+        )
+
    def test_special_attn_proc(self):
        class AttnEasyProc(torch.nn.Module):
            def __init__(self, num):
@@ -872,8 +618,7 @@ class TestUNet2DConditionAttention(UNet2DConditionTesterConfig, AttentionTesterM
                return hidden_states

        # enable deterministic behavior for gradient checkpointing
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["block_out_channels"] = (16, 32)
        init_dict["attention_head_dim"] = (8, 16)
@@ -900,8 +645,7 @@ class TestUNet2DConditionAttention(UNet2DConditionTesterConfig, AttentionTesterM
        ]
    )
    def test_model_xattn_mask(self, mask_dtype):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        model = self.model_class(**{**init_dict, "attention_head_dim": (8, 16), "block_out_channels": (16, 32)})
        model.to(torch_device)
@@ -931,13 +675,39 @@ class TestUNet2DConditionAttention(UNet2DConditionTesterConfig, AttentionTesterM
                "masking the last token from our cond should be equivalent to truncating that token out of the condition"
            )

+    # see diffusers.models.attention_processor::Attention#prepare_attention_mask
+    # note: we may not need to fix mask padding to work for stable-diffusion cross-attn masks.
+    # since the use-case (somebody passes in a too-short cross-attn mask) is pretty esoteric.
+    # maybe it's fine that this only works for the unclip use-case.
+    @mark.skip(
+        reason="we currently pad mask by target_length tokens (what unclip needs), whereas stable-diffusion's cross-attn needs to instead pad by remaining_length."
+    )
+    def test_model_xattn_padding(self):
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

-class TestUNet2DConditionCustomDiffusion(UNet2DConditionTesterConfig):
-    """Custom Diffusion processor tests for UNet2DConditionModel."""
+        model = self.model_class(**{**init_dict, "attention_head_dim": (8, 16)})
+        model.to(torch_device)
+        model.eval()
+
+        cond = inputs_dict["encoder_hidden_states"]
+        with torch.no_grad():
+            full_cond_out = model(**inputs_dict).sample
+            assert full_cond_out is not None
+
+            batch, tokens, _ = cond.shape
+            keeplast_mask = (torch.arange(tokens) == tokens - 1).expand(batch, -1).to(cond.device, torch.bool)
+            keeplast_out = model(**{**inputs_dict, "encoder_attention_mask": keeplast_mask}).sample
+            assert not keeplast_out.allclose(full_cond_out), "a 'keep last token' mask should change the result"
+
+            trunc_mask = torch.zeros(batch, tokens - 1, device=cond.device, dtype=torch.bool)
+            trunc_mask_out = model(**{**inputs_dict, "encoder_attention_mask": trunc_mask}).sample
+            assert trunc_mask_out.allclose(keeplast_out), (
+                "a mask with fewer tokens than condition, will be padded with 'keep' tokens. a 'discard-all' mask missing the final token is thus equivalent to a 'keep last' mask."
+            )

    def test_custom_diffusion_processors(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        # enable deterministic behavior for gradient checkpointing
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["block_out_channels"] = (16, 32)
        init_dict["attention_head_dim"] = (8, 16)
@@ -963,8 +733,8 @@ class TestUNet2DConditionCustomDiffusion(UNet2DConditionTesterConfig):
        assert (sample1 - sample2).abs().max() < 3e-3

    def test_custom_diffusion_save_load(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        # enable deterministic behavior for gradient checkpointing
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["block_out_channels"] = (16, 32)
        init_dict["attention_head_dim"] = (8, 16)
@@ -984,7 +754,7 @@ class TestUNet2DConditionCustomDiffusion(UNet2DConditionTesterConfig):

        with tempfile.TemporaryDirectory() as tmpdirname:
            model.save_attn_procs(tmpdirname, safe_serialization=False)
-            assert os.path.isfile(os.path.join(tmpdirname, "pytorch_custom_diffusion_weights.bin"))
+            self.assertTrue(os.path.isfile(os.path.join(tmpdirname, "pytorch_custom_diffusion_weights.bin")))
            torch.manual_seed(0)
            new_model = self.model_class(**init_dict)
            new_model.load_attn_procs(tmpdirname, weight_name="pytorch_custom_diffusion_weights.bin")
@@ -1003,8 +773,8 @@ class TestUNet2DConditionCustomDiffusion(UNet2DConditionTesterConfig):
        reason="XFormers attention is only available with CUDA and `xformers` installed",
    )
    def test_custom_diffusion_xformers_on_off(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        # enable deterministic behavior for gradient checkpointing
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["block_out_channels"] = (16, 32)
        init_dict["attention_head_dim"] = (8, 16)
@@ -1028,28 +798,41 @@ class TestUNet2DConditionCustomDiffusion(UNet2DConditionTesterConfig):
        assert (sample - on_sample).abs().max() < 1e-4
        assert (sample - off_sample).abs().max() < 1e-4

+    def test_pickle(self):
+        # enable deterministic behavior for gradient checkpointing
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

-class TestUNet2DConditionIPAdapter(UNet2DConditionTesterConfig, IPAdapterTesterMixin):
-    """IP Adapter tests for UNet2DConditionModel."""
+        init_dict["block_out_channels"] = (16, 32)
+        init_dict["attention_head_dim"] = (8, 16)

-    @property
-    def ip_adapter_processor_cls(self):
-        return (IPAdapterAttnProcessor, IPAdapterAttnProcessor2_0)
+        model = self.model_class(**init_dict)
+        model.to(torch_device)

-    def create_ip_adapter_state_dict(self, model):
-        return create_ip_adapter_state_dict(model)
+        with torch.no_grad():
+            sample = model(**inputs_dict).sample

-    def modify_inputs_for_ip_adapter(self, model, inputs_dict):
-        batch_size = inputs_dict["encoder_hidden_states"].shape[0]
-        # for ip-adapter image_embeds has shape [batch_size, num_image, embed_dim]
-        cross_attention_dim = getattr(model.config, "cross_attention_dim", 8)
-        image_embeds = floats_tensor((batch_size, 1, cross_attention_dim)).to(torch_device)
-        inputs_dict["added_cond_kwargs"] = {"image_embeds": [image_embeds]}
-        return inputs_dict
+        sample_copy = copy.copy(sample)
+
+        assert (sample - sample_copy).abs().max() < 1e-4
+
+    def test_asymmetrical_unet(self):
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        # Add asymmetry to configs
+        init_dict["transformer_layers_per_block"] = [[3, 2], 1]
+        init_dict["reverse_transformer_layers_per_block"] = [[3, 4], 1]
+
+        torch.manual_seed(0)
+        model = self.model_class(**init_dict)
+        model.to(torch_device)
+
+        output = model(**inputs_dict).sample
+        expected_shape = inputs_dict["sample"].shape
+
+        # Check if input and output shapes are the same
+        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")

    def test_ip_adapter(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["block_out_channels"] = (16, 32)
        init_dict["attention_head_dim"] = (8, 16)
@@ -1122,8 +905,7 @@ class TestUNet2DConditionIPAdapter(UNet2DConditionTesterConfig, IPAdapterTesterM
        assert sample2.allclose(sample6, atol=1e-4, rtol=1e-4)

    def test_ip_adapter_plus(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["block_out_channels"] = (16, 32)
        init_dict["attention_head_dim"] = (8, 16)
@@ -1195,16 +977,185 @@ class TestUNet2DConditionIPAdapter(UNet2DConditionTesterConfig, IPAdapterTesterM
        assert sample2.allclose(sample5, atol=1e-4, rtol=1e-4)
        assert sample2.allclose(sample6, atol=1e-4, rtol=1e-4)

+    @parameterized.expand(
+        [
+            ("hf-internal-testing/unet2d-sharded-dummy", None),
+            ("hf-internal-testing/tiny-sd-unet-sharded-latest-format", "fp16"),
+        ]
+    )
+    @require_torch_accelerator
+    def test_load_sharded_checkpoint_from_hub(self, repo_id, variant):
+        _, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        loaded_model = self.model_class.from_pretrained(repo_id, variant=variant)
+        loaded_model = loaded_model.to(torch_device)
+        new_output = loaded_model(**inputs_dict)

-class TestUNet2DConditionModelCompile(UNet2DConditionTesterConfig, TorchCompileTesterMixin):
-    """Torch compile tests for UNet2DConditionModel."""
+        assert loaded_model
+        assert new_output.sample.shape == (4, 4, 16, 16)

-    def test_torch_compile_repeated_blocks(self):
-        return super().test_torch_compile_repeated_blocks(recompile_limit=2)
+    @parameterized.expand(
+        [
+            ("hf-internal-testing/unet2d-sharded-dummy-subfolder", None),
+            ("hf-internal-testing/tiny-sd-unet-sharded-latest-format-subfolder", "fp16"),
+        ]
+    )
+    @require_torch_accelerator
+    def test_load_sharded_checkpoint_from_hub_subfolder(self, repo_id, variant):
+        _, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        loaded_model = self.model_class.from_pretrained(repo_id, subfolder="unet", variant=variant)
+        loaded_model = loaded_model.to(torch_device)
+        new_output = loaded_model(**inputs_dict)
+
+        assert loaded_model
+        assert new_output.sample.shape == (4, 4, 16, 16)
+
+    @require_torch_accelerator
+    def test_load_sharded_checkpoint_from_hub_local(self):
+        _, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        ckpt_path = snapshot_download("hf-internal-testing/unet2d-sharded-dummy")
+        loaded_model = self.model_class.from_pretrained(ckpt_path, local_files_only=True)
+        loaded_model = loaded_model.to(torch_device)
+        new_output = loaded_model(**inputs_dict)
+
+        assert loaded_model
+        assert new_output.sample.shape == (4, 4, 16, 16)
+
+    @require_torch_accelerator
+    def test_load_sharded_checkpoint_from_hub_local_subfolder(self):
+        _, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        ckpt_path = snapshot_download("hf-internal-testing/unet2d-sharded-dummy-subfolder")
+        loaded_model = self.model_class.from_pretrained(ckpt_path, subfolder="unet", local_files_only=True)
+        loaded_model = loaded_model.to(torch_device)
+        new_output = loaded_model(**inputs_dict)
+
+        assert loaded_model
+        assert new_output.sample.shape == (4, 4, 16, 16)
+
+    @require_torch_accelerator
+    @parameterized.expand(
+        [
+            ("hf-internal-testing/unet2d-sharded-dummy", None),
+            ("hf-internal-testing/tiny-sd-unet-sharded-latest-format", "fp16"),
+        ]
+    )
+    def test_load_sharded_checkpoint_device_map_from_hub(self, repo_id, variant):
+        _, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        loaded_model = self.model_class.from_pretrained(repo_id, variant=variant, device_map="auto")
+        new_output = loaded_model(**inputs_dict)
+
+        assert loaded_model
+        assert new_output.sample.shape == (4, 4, 16, 16)
+
+    @require_torch_accelerator
+    @parameterized.expand(
+        [
+            ("hf-internal-testing/unet2d-sharded-dummy-subfolder", None),
+            ("hf-internal-testing/tiny-sd-unet-sharded-latest-format-subfolder", "fp16"),
+        ]
+    )
+    def test_load_sharded_checkpoint_device_map_from_hub_subfolder(self, repo_id, variant):
+        _, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        loaded_model = self.model_class.from_pretrained(repo_id, variant=variant, subfolder="unet", device_map="auto")
+        new_output = loaded_model(**inputs_dict)
+
+        assert loaded_model
+        assert new_output.sample.shape == (4, 4, 16, 16)
+
+    @require_torch_accelerator
+    def test_load_sharded_checkpoint_device_map_from_hub_local(self):
+        _, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        ckpt_path = snapshot_download("hf-internal-testing/unet2d-sharded-dummy")
+        loaded_model = self.model_class.from_pretrained(ckpt_path, local_files_only=True, device_map="auto")
+        new_output = loaded_model(**inputs_dict)
+
+        assert loaded_model
+        assert new_output.sample.shape == (4, 4, 16, 16)
+
+    @require_torch_accelerator
+    def test_load_sharded_checkpoint_device_map_from_hub_local_subfolder(self):
+        _, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        ckpt_path = snapshot_download("hf-internal-testing/unet2d-sharded-dummy-subfolder")
+        loaded_model = self.model_class.from_pretrained(
+            ckpt_path, local_files_only=True, subfolder="unet", device_map="auto"
+        )
+        new_output = loaded_model(**inputs_dict)
+
+        assert loaded_model
+        assert new_output.sample.shape == (4, 4, 16, 16)
+
+    @require_peft_backend
+    def test_load_attn_procs_raise_warning(self):
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        model = self.model_class(**init_dict)
+        model.to(torch_device)
+
+        # forward pass without LoRA
+        with torch.no_grad():
+            non_lora_sample = model(**inputs_dict).sample
+
+        unet_lora_config = get_unet_lora_config()
+        model.add_adapter(unet_lora_config)
+
+        assert check_if_lora_correctly_set(model), "Lora not correctly set in UNet."
+
+        # forward pass with LoRA
+        with torch.no_grad():
+            lora_sample_1 = model(**inputs_dict).sample
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            model.save_attn_procs(tmpdirname)
+            model.unload_lora()
+
+            with self.assertWarns(FutureWarning) as warning:
+                model.load_attn_procs(os.path.join(tmpdirname, "pytorch_lora_weights.safetensors"))
+
+            warning_message = str(warning.warnings[0].message)
+            assert "Using the `load_attn_procs()` method has been deprecated" in warning_message
+
+            # import to still check for the rest of the stuff.
+            assert check_if_lora_correctly_set(model), "Lora not correctly set in UNet."
+
+            with torch.no_grad():
+                lora_sample_2 = model(**inputs_dict).sample
+
+        assert not torch.allclose(non_lora_sample, lora_sample_1, atol=1e-4, rtol=1e-4), (
+            "LoRA injected UNet should produce different results."
+        )
+        assert torch.allclose(lora_sample_1, lora_sample_2, atol=1e-4, rtol=1e-4), (
+            "Loading from a saved checkpoint should produce identical results."
+        )
+
+    @require_peft_backend
+    def test_save_attn_procs_raise_warning(self):
+        init_dict, _ = self.prepare_init_args_and_inputs_for_common()
+        model = self.model_class(**init_dict)
+        model.to(torch_device)
+
+        unet_lora_config = get_unet_lora_config()
+        model.add_adapter(unet_lora_config)
+
+        assert check_if_lora_correctly_set(model), "Lora not correctly set in UNet."
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            with self.assertWarns(FutureWarning) as warning:
+                model.save_attn_procs(tmpdirname)
+
+        warning_message = str(warning.warnings[0].message)
+        assert "Using the `save_attn_procs()` method has been deprecated" in warning_message


-class TestUNet2DConditionModelLoRAHotSwap(UNet2DConditionTesterConfig, LoraHotSwappingForModelTesterMixin):
-    """LoRA hot-swapping tests for UNet2DConditionModel."""
+class UNet2DConditionModelCompileTests(TorchCompileTesterMixin, unittest.TestCase):
+    model_class = UNet2DConditionModel
+
+    def prepare_init_args_and_inputs_for_common(self):
+        return UNet2DConditionModelTests().prepare_init_args_and_inputs_for_common()
+
+
+class UNet2DConditionModelLoRAHotSwapTests(LoraHotSwappingForModelTesterMixin, unittest.TestCase):
+    model_class = UNet2DConditionModel
+
+    def prepare_init_args_and_inputs_for_common(self):
+        return UNet2DConditionModelTests().prepare_init_args_and_inputs_for_common()


@slow
--- a/tests/models/unets/test_models_unet_3d_condition.py
+++ b/tests/models/unets/test_models_unet_3d_condition.py
@@ -18,44 +18,47 @@ import unittest
 import numpy as np
 import torch

-from diffusers import UNet3DConditionModel
+from diffusers.models import ModelMixin, UNet3DConditionModel
+from diffusers.utils import logging
 from diffusers.utils.import_utils import is_xformers_available

-from ...testing_utils import (
-    enable_full_determinism,
-    floats_tensor,
-    skip_mps,
-    torch_device,
-)
-from ..test_modeling_common import UNetTesterMixin
-from ..testing_utils import (
-    AttentionTesterMixin,
-    BaseModelTesterConfig,
-    ModelTesterMixin,
-)
+from ...testing_utils import enable_full_determinism, floats_tensor, skip_mps, torch_device
+from ..test_modeling_common import ModelTesterMixin, UNetTesterMixin


 enable_full_determinism()

+logger = logging.get_logger(__name__)
+

@skip_mps
-class UNet3DConditionTesterConfig(BaseModelTesterConfig):
-    """Base configuration for UNet3DConditionModel testing."""
+class UNet3DConditionModelTests(ModelTesterMixin, UNetTesterMixin, unittest.TestCase):
+    model_class = UNet3DConditionModel
+    main_input_name = "sample"

    @property
-    def model_class(self):
-        return UNet3DConditionModel
+    def dummy_input(self):
+        batch_size = 4
+        num_channels = 4
+        num_frames = 4
+        sizes = (16, 16)
+
+        noise = floats_tensor((batch_size, num_channels, num_frames) + sizes).to(torch_device)
+        time_step = torch.tensor([10]).to(torch_device)
+        encoder_hidden_states = floats_tensor((batch_size, 4, 8)).to(torch_device)
+
+        return {"sample": noise, "timestep": time_step, "encoder_hidden_states": encoder_hidden_states}
+
+    @property
+    def input_shape(self):
+        return (4, 4, 16, 16)

    @property
    def output_shape(self):
        return (4, 4, 16, 16)

-    @property
-    def main_input_name(self):
-        return "sample"
-
-    def get_init_dict(self):
-        return {
+    def prepare_init_args_and_inputs_for_common(self):
+        init_dict = {
            "block_out_channels": (4, 8),
            "norm_num_groups": 4,
            "down_block_types": (
@@ -70,25 +73,27 @@ class UNet3DConditionTesterConfig(BaseModelTesterConfig):
            "layers_per_block": 1,
            "sample_size": 16,
        }
+        inputs_dict = self.dummy_input
+        return init_dict, inputs_dict

-    def get_dummy_inputs(self):
-        batch_size = 4
-        num_channels = 4
-        num_frames = 4
-        sizes = (16, 16)
+    @unittest.skipIf(
+        torch_device != "cuda" or not is_xformers_available(),
+        reason="XFormers attention is only available with CUDA and `xformers` installed",
+    )
+    def test_xformers_enable_works(self):
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        model = self.model_class(**init_dict)

-        return {
-            "sample": floats_tensor((batch_size, num_channels, num_frames) + sizes).to(torch_device),
-            "timestep": torch.tensor([10]).to(torch_device),
-            "encoder_hidden_states": floats_tensor((batch_size, 4, 8)).to(torch_device),
-        }
+        model.enable_xformers_memory_efficient_attention()

+        assert (
+            model.mid_block.attentions[0].transformer_blocks[0].attn1.processor.__class__.__name__
+            == "XFormersAttnProcessor"
+        ), "xformers is not enabled"

-class TestUNet3DCondition(UNet3DConditionTesterConfig, ModelTesterMixin, UNetTesterMixin):
    # Overriding to set `norm_num_groups` needs to be different for this model.
    def test_forward_with_norm_groups(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()
        init_dict["block_out_channels"] = (32, 64)
        init_dict["norm_num_groups"] = 32

@@ -102,74 +107,39 @@ class TestUNet3DCondition(UNet3DConditionTesterConfig, ModelTesterMixin, UNetTes
            if isinstance(output, dict):
                output = output.sample

-        assert output is not None
+        self.assertIsNotNone(output)
        expected_shape = inputs_dict["sample"].shape
-        assert output.shape == expected_shape, "Input and output shapes do not match"
+        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")

    # Overriding since the UNet3D outputs a different structure.
-    @torch.no_grad()
    def test_determinism(self):
-        model = self.model_class(**self.get_init_dict())
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        model = self.model_class(**init_dict)
        model.to(torch_device)
        model.eval()

-        inputs_dict = self.get_dummy_inputs()
+        with torch.no_grad():
+            # Warmup pass when using mps (see #372)
+            if torch_device == "mps" and isinstance(model, ModelMixin):
+                model(**self.dummy_input)

-        first = model(**inputs_dict)
-        if isinstance(first, dict):
-            first = first.sample
+            first = model(**inputs_dict)
+            if isinstance(first, dict):
+                first = first.sample

-        second = model(**inputs_dict)
-        if isinstance(second, dict):
-            second = second.sample
+            second = model(**inputs_dict)
+            if isinstance(second, dict):
+                second = second.sample

        out_1 = first.cpu().numpy()
        out_2 = second.cpu().numpy()
        out_1 = out_1[~np.isnan(out_1)]
        out_2 = out_2[~np.isnan(out_2)]
        max_diff = np.amax(np.abs(out_1 - out_2))
-        assert max_diff <= 1e-5
-
-    def test_feed_forward_chunking(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
-        init_dict["block_out_channels"] = (32, 64)
-        init_dict["norm_num_groups"] = 32
-
-        model = self.model_class(**init_dict)
-        model.to(torch_device)
-        model.eval()
-
-        with torch.no_grad():
-            output = model(**inputs_dict)[0]
-
-        model.enable_forward_chunking()
-        with torch.no_grad():
-            output_2 = model(**inputs_dict)[0]
-
-        assert output.shape == output_2.shape, "Shape doesn't match"
-        assert np.abs(output.cpu() - output_2.cpu()).max() < 1e-2
-
-
-class TestUNet3DConditionAttention(UNet3DConditionTesterConfig, AttentionTesterMixin):
-    @unittest.skipIf(
-        torch_device != "cuda" or not is_xformers_available(),
-        reason="XFormers attention is only available with CUDA and `xformers` installed",
-    )
-    def test_xformers_enable_works(self):
-        init_dict = self.get_init_dict()
-        model = self.model_class(**init_dict)
-
-        model.enable_xformers_memory_efficient_attention()
-
-        assert (
-            model.mid_block.attentions[0].transformer_blocks[0].attn1.processor.__class__.__name__
-            == "XFormersAttnProcessor"
-        ), "xformers is not enabled"
+        self.assertLessEqual(max_diff, 1e-5)

    def test_model_attention_slicing(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["block_out_channels"] = (16, 32)
        init_dict["attention_head_dim"] = 8
@@ -192,3 +162,22 @@ class TestUNet3DConditionAttention(UNet3DConditionTesterConfig, AttentionTesterM
        with torch.no_grad():
            output = model(**inputs_dict)
        assert output is not None
+
+    def test_feed_forward_chunking(self):
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        init_dict["block_out_channels"] = (32, 64)
+        init_dict["norm_num_groups"] = 32
+
+        model = self.model_class(**init_dict)
+        model.to(torch_device)
+        model.eval()
+
+        with torch.no_grad():
+            output = model(**inputs_dict)[0]
+
+        model.enable_forward_chunking()
+        with torch.no_grad():
+            output_2 = model(**inputs_dict)[0]
+
+        self.assertEqual(output.shape, output_2.shape, "Shape doesn't match")
+        assert np.abs(output.cpu() - output_2.cpu()).max() < 1e-2
--- a/tests/models/unets/test_models_unet_controlnetxs.py
+++ b/tests/models/unets/test_models_unet_controlnetxs.py
@@ -13,42 +13,59 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import unittest
+
 import numpy as np
-import pytest
 import torch
 from torch import nn

 from diffusers import ControlNetXSAdapter, UNet2DConditionModel, UNetControlNetXSModel
+from diffusers.utils import logging

 from ...testing_utils import enable_full_determinism, floats_tensor, is_flaky, torch_device
-from ..test_modeling_common import UNetTesterMixin
-from ..testing_utils import (
-    BaseModelTesterConfig,
-    ModelTesterMixin,
-    TrainingTesterMixin,
-)
+from ..test_modeling_common import ModelTesterMixin, UNetTesterMixin


+logger = logging.get_logger(__name__)
+
 enable_full_determinism()


-class UNetControlNetXSTesterConfig(BaseModelTesterConfig):
-    """Base configuration for UNetControlNetXSModel testing."""
+class UNetControlNetXSModelTests(ModelTesterMixin, UNetTesterMixin, unittest.TestCase):
+    model_class = UNetControlNetXSModel
+    main_input_name = "sample"

    @property
-    def model_class(self):
-        return UNetControlNetXSModel
+    def dummy_input(self):
+        batch_size = 4
+        num_channels = 4
+        sizes = (16, 16)
+        conditioning_image_size = (3, 32, 32)  # size of additional, unprocessed image for control-conditioning
+
+        noise = floats_tensor((batch_size, num_channels) + sizes).to(torch_device)
+        time_step = torch.tensor([10]).to(torch_device)
+        encoder_hidden_states = floats_tensor((batch_size, 4, 8)).to(torch_device)
+        controlnet_cond = floats_tensor((batch_size, *conditioning_image_size)).to(torch_device)
+        conditioning_scale = 1
+
+        return {
+            "sample": noise,
+            "timestep": time_step,
+            "encoder_hidden_states": encoder_hidden_states,
+            "controlnet_cond": controlnet_cond,
+            "conditioning_scale": conditioning_scale,
+        }
+
+    @property
+    def input_shape(self):
+        return (4, 16, 16)

    @property
    def output_shape(self):
        return (4, 16, 16)

-    @property
-    def main_input_name(self):
-        return "sample"
-
-    def get_init_dict(self):
-        return {
+    def prepare_init_args_and_inputs_for_common(self):
+        init_dict = {
            "sample_size": 16,
            "down_block_types": ("DownBlock2D", "CrossAttnDownBlock2D"),
            "up_block_types": ("CrossAttnUpBlock2D", "UpBlock2D"),
@@ -63,23 +80,11 @@ class UNetControlNetXSTesterConfig(BaseModelTesterConfig):
            "ctrl_max_norm_num_groups": 2,
            "ctrl_conditioning_embedding_out_channels": (2, 2),
        }
-
-    def get_dummy_inputs(self):
-        batch_size = 4
-        num_channels = 4
-        sizes = (16, 16)
-        conditioning_image_size = (3, 32, 32)
-
-        return {
-            "sample": floats_tensor((batch_size, num_channels) + sizes).to(torch_device),
-            "timestep": torch.tensor([10]).to(torch_device),
-            "encoder_hidden_states": floats_tensor((batch_size, 4, 8)).to(torch_device),
-            "controlnet_cond": floats_tensor((batch_size, *conditioning_image_size)).to(torch_device),
-            "conditioning_scale": 1,
-        }
+        inputs_dict = self.dummy_input
+        return init_dict, inputs_dict

    def get_dummy_unet(self):
-        """Build the underlying UNet for tests that construct UNetControlNetXSModel from UNet + Adapter."""
+        """For some tests we also need the underlying UNet. For these, we'll build the UNetControlNetXSModel from the UNet and ControlNetXS-Adapter"""
        return UNet2DConditionModel(
            block_out_channels=(4, 8),
            layers_per_block=2,
@@ -94,16 +99,10 @@ class UNetControlNetXSTesterConfig(BaseModelTesterConfig):
        )

    def get_dummy_controlnet_from_unet(self, unet, **kwargs):
-        """Build the ControlNetXS-Adapter from a UNet."""
+        """For some tests we also need the underlying ControlNetXS-Adapter. For these, we'll build the UNetControlNetXSModel from the UNet and ControlNetXS-Adapter"""
+        # size_ratio and conditioning_embedding_out_channels chosen to keep model small
        return ControlNetXSAdapter.from_unet(unet, size_ratio=1, conditioning_embedding_out_channels=(2, 2), **kwargs)

-
-class TestUNetControlNetXS(UNetControlNetXSTesterConfig, ModelTesterMixin, UNetTesterMixin):
-    @pytest.mark.skip("Test not supported.")
-    def test_forward_with_norm_groups(self):
-        # UNetControlNetXSModel only supports SD/SDXL with norm_num_groups=32
-        pass
-
    def test_from_unet(self):
        unet = self.get_dummy_unet()
        controlnet = self.get_dummy_controlnet_from_unet(unet)
@@ -116,7 +115,7 @@ class TestUNetControlNetXS(UNetControlNetXSTesterConfig, ModelTesterMixin, UNetT
                assert torch.equal(model_state_dict[weight_dict_prefix + "." + param_name], param_value)

        # # check unet
-        # everything except down,mid,up blocks
+        # everything expect down,mid,up blocks
        modules_from_unet = [
            "time_embedding",
            "conv_in",
@@ -153,7 +152,7 @@ class TestUNetControlNetXS(UNetControlNetXSTesterConfig, ModelTesterMixin, UNetT
                assert_equal_weights(u.upsamplers[0], f"up_blocks.{i}.upsamplers")

        # # check controlnet
-        # everything except down,mid,up blocks
+        # everything expect down,mid,up blocks
        modules_from_controlnet = {
            "controlnet_cond_embedding": "controlnet_cond_embedding",
            "conv_in": "ctrl_conv_in",
@@ -194,12 +193,12 @@ class TestUNetControlNetXS(UNetControlNetXSTesterConfig, ModelTesterMixin, UNetT
            for p in module.parameters():
                assert p.requires_grad

-        init_dict = self.get_init_dict()
+        init_dict, _ = self.prepare_init_args_and_inputs_for_common()
        model = UNetControlNetXSModel(**init_dict)
        model.freeze_unet_params()

        # # check unet
-        # everything except down,mid,up blocks
+        # everything expect down,mid,up blocks
        modules_from_unet = [
            model.base_time_embedding,
            model.base_conv_in,
@@ -237,7 +236,7 @@ class TestUNetControlNetXS(UNetControlNetXSTesterConfig, ModelTesterMixin, UNetT
                assert_frozen(u.upsamplers)

        # # check controlnet
-        # everything except down,mid,up blocks
+        # everything expect down,mid,up blocks
        modules_from_controlnet = [
            model.controlnet_cond_embedding,
            model.ctrl_conv_in,
@@ -268,6 +267,16 @@ class TestUNetControlNetXS(UNetControlNetXSTesterConfig, ModelTesterMixin, UNetT
        for u in model.up_blocks:
            assert_unfrozen(u.ctrl_to_base)

+    def test_gradient_checkpointing_is_applied(self):
+        expected_set = {
+            "Transformer2DModel",
+            "UNetMidBlock2DCrossAttn",
+            "ControlNetXSCrossAttnDownBlock2D",
+            "ControlNetXSCrossAttnMidBlock2D",
+            "ControlNetXSCrossAttnUpBlock2D",
+        }
+        super().test_gradient_checkpointing_is_applied(expected_set=expected_set)
+
    @is_flaky
    def test_forward_no_control(self):
        unet = self.get_dummy_unet()
@@ -278,7 +287,7 @@ class TestUNetControlNetXS(UNetControlNetXSTesterConfig, ModelTesterMixin, UNetT
        unet = unet.to(torch_device)
        model = model.to(torch_device)

-        input_ = self.get_dummy_inputs()
+        input_ = self.dummy_input

        control_specific_input = ["controlnet_cond", "conditioning_scale"]
        input_for_unet = {k: v for k, v in input_.items() if k not in control_specific_input}
@@ -303,7 +312,7 @@ class TestUNetControlNetXS(UNetControlNetXSTesterConfig, ModelTesterMixin, UNetT
        model = model.to(torch_device)
        model_mix_time = model_mix_time.to(torch_device)

-        input_ = self.get_dummy_inputs()
+        input_ = self.dummy_input

        with torch.no_grad():
            output = model(**input_).sample
@@ -311,14 +320,7 @@ class TestUNetControlNetXS(UNetControlNetXSTesterConfig, ModelTesterMixin, UNetT

        assert output.shape == output_mix_time.shape

-
-class TestUNetControlNetXSTraining(UNetControlNetXSTesterConfig, TrainingTesterMixin):
-    def test_gradient_checkpointing_is_applied(self):
-        expected_set = {
-            "Transformer2DModel",
-            "UNetMidBlock2DCrossAttn",
-            "ControlNetXSCrossAttnDownBlock2D",
-            "ControlNetXSCrossAttnMidBlock2D",
-            "ControlNetXSCrossAttnUpBlock2D",
-        }
-        super().test_gradient_checkpointing_is_applied(expected_set=expected_set)
+    @unittest.skip("Test not supported.")
+    def test_forward_with_norm_groups(self):
+        # UNetControlNetXSModel currently only supports StableDiffusion and StableDiffusion-XL, both of which have norm_num_groups fixed at 32. So we don't need to test different values for norm_num_groups.
+        pass
--- a/tests/models/unets/test_models_unet_spatiotemporal.py
+++ b/tests/models/unets/test_models_unet_spatiotemporal.py
@@ -16,10 +16,10 @@
 import copy
 import unittest

-import pytest
 import torch

 from diffusers import UNetSpatioTemporalConditionModel
+from diffusers.utils import logging
 from diffusers.utils.import_utils import is_xformers_available

 from ...testing_utils import (
@@ -28,34 +28,45 @@ from ...testing_utils import (
    skip_mps,
    torch_device,
 )
-from ..test_modeling_common import UNetTesterMixin
-from ..testing_utils import (
-    AttentionTesterMixin,
-    BaseModelTesterConfig,
-    ModelTesterMixin,
-    TrainingTesterMixin,
-)
+from ..test_modeling_common import ModelTesterMixin, UNetTesterMixin


+logger = logging.get_logger(__name__)
+
 enable_full_determinism()


@skip_mps
-class UNetSpatioTemporalTesterConfig(BaseModelTesterConfig):
-    """Base configuration for UNetSpatioTemporalConditionModel testing."""
+class UNetSpatioTemporalConditionModelTests(ModelTesterMixin, UNetTesterMixin, unittest.TestCase):
+    model_class = UNetSpatioTemporalConditionModel
+    main_input_name = "sample"

    @property
-    def model_class(self):
-        return UNetSpatioTemporalConditionModel
+    def dummy_input(self):
+        batch_size = 2
+        num_frames = 2
+        num_channels = 4
+        sizes = (32, 32)
+
+        noise = floats_tensor((batch_size, num_frames, num_channels) + sizes).to(torch_device)
+        time_step = torch.tensor([10]).to(torch_device)
+        encoder_hidden_states = floats_tensor((batch_size, 1, 32)).to(torch_device)
+
+        return {
+            "sample": noise,
+            "timestep": time_step,
+            "encoder_hidden_states": encoder_hidden_states,
+            "added_time_ids": self._get_add_time_ids(),
+        }
+
+    @property
+    def input_shape(self):
+        return (2, 2, 4, 32, 32)

    @property
    def output_shape(self):
        return (4, 32, 32)

-    @property
-    def main_input_name(self):
-        return "sample"
-
    @property
    def fps(self):
        return 6
@@ -72,8 +83,8 @@ class UNetSpatioTemporalTesterConfig(BaseModelTesterConfig):
    def addition_time_embed_dim(self):
        return 32

-    def get_init_dict(self):
-        return {
+    def prepare_init_args_and_inputs_for_common(self):
+        init_dict = {
            "block_out_channels": (32, 64),
            "down_block_types": (
                "CrossAttnDownBlockSpatioTemporal",
@@ -92,23 +103,8 @@ class UNetSpatioTemporalTesterConfig(BaseModelTesterConfig):
            "projection_class_embeddings_input_dim": self.addition_time_embed_dim * 3,
            "addition_time_embed_dim": self.addition_time_embed_dim,
        }
-
-    def get_dummy_inputs(self):
-        batch_size = 2
-        num_frames = 2
-        num_channels = 4
-        sizes = (32, 32)
-
-        noise = floats_tensor((batch_size, num_frames, num_channels) + sizes).to(torch_device)
-        time_step = torch.tensor([10]).to(torch_device)
-        encoder_hidden_states = floats_tensor((batch_size, 1, 32)).to(torch_device)
-
-        return {
-            "sample": noise,
-            "timestep": time_step,
-            "encoder_hidden_states": encoder_hidden_states,
-            "added_time_ids": self._get_add_time_ids(),
-        }
+        inputs_dict = self.dummy_input
+        return init_dict, inputs_dict

    def _get_add_time_ids(self, do_classifier_free_guidance=True):
        add_time_ids = [self.fps, self.motion_bucket_id, self.noise_aug_strength]
@@ -128,15 +124,43 @@ class UNetSpatioTemporalTesterConfig(BaseModelTesterConfig):

        return add_time_ids

-
-class TestUNetSpatioTemporal(UNetSpatioTemporalTesterConfig, ModelTesterMixin, UNetTesterMixin):
-    @pytest.mark.skip("Number of Norm Groups is not configurable")
+    @unittest.skip("Number of Norm Groups is not configurable")
    def test_forward_with_norm_groups(self):
        pass

+    @unittest.skip("Deprecated functionality")
+    def test_model_attention_slicing(self):
+        pass
+
+    @unittest.skip("Not supported")
+    def test_model_with_use_linear_projection(self):
+        pass
+
+    @unittest.skip("Not supported")
+    def test_model_with_simple_projection(self):
+        pass
+
+    @unittest.skip("Not supported")
+    def test_model_with_class_embeddings_concat(self):
+        pass
+
+    @unittest.skipIf(
+        torch_device != "cuda" or not is_xformers_available(),
+        reason="XFormers attention is only available with CUDA and `xformers` installed",
+    )
+    def test_xformers_enable_works(self):
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        model = self.model_class(**init_dict)
+
+        model.enable_xformers_memory_efficient_attention()
+
+        assert (
+            model.mid_block.attentions[0].transformer_blocks[0].attn1.processor.__class__.__name__
+            == "XFormersAttnProcessor"
+        ), "xformers is not enabled"
+
    def test_model_with_num_attention_heads_tuple(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["num_attention_heads"] = (8, 16)
        model = self.model_class(**init_dict)
@@ -149,13 +173,12 @@ class TestUNetSpatioTemporal(UNetSpatioTemporalTesterConfig, ModelTesterMixin, U
            if isinstance(output, dict):
                output = output.sample

-        assert output is not None
+        self.assertIsNotNone(output)
        expected_shape = inputs_dict["sample"].shape
-        assert output.shape == expected_shape, "Input and output shapes do not match"
+        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")

    def test_model_with_cross_attention_dim_tuple(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["cross_attention_dim"] = (32, 32)

@@ -169,13 +192,27 @@ class TestUNetSpatioTemporal(UNetSpatioTemporalTesterConfig, ModelTesterMixin, U
            if isinstance(output, dict):
                output = output.sample

-        assert output is not None
+        self.assertIsNotNone(output)
        expected_shape = inputs_dict["sample"].shape
-        assert output.shape == expected_shape, "Input and output shapes do not match"
+        self.assertEqual(output.shape, expected_shape, "Input and output shapes do not match")
+
+    def test_gradient_checkpointing_is_applied(self):
+        expected_set = {
+            "TransformerSpatioTemporalModel",
+            "CrossAttnDownBlockSpatioTemporal",
+            "DownBlockSpatioTemporal",
+            "UpBlockSpatioTemporal",
+            "CrossAttnUpBlockSpatioTemporal",
+            "UNetMidBlockSpatioTemporal",
+        }
+        num_attention_heads = (8, 16)
+        super().test_gradient_checkpointing_is_applied(
+            expected_set=expected_set, num_attention_heads=num_attention_heads
+        )

    def test_pickle(self):
-        init_dict = self.get_init_dict()
-        inputs_dict = self.get_dummy_inputs()
+        # enable deterministic behavior for gradient checkpointing
+        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["num_attention_heads"] = (8, 16)

@@ -188,33 +225,3 @@ class TestUNetSpatioTemporal(UNetSpatioTemporalTesterConfig, ModelTesterMixin, U
        sample_copy = copy.copy(sample)

        assert (sample - sample_copy).abs().max() < 1e-4
-
-
-class TestUNetSpatioTemporalAttention(UNetSpatioTemporalTesterConfig, AttentionTesterMixin):
-    @unittest.skipIf(
-        torch_device != "cuda" or not is_xformers_available(),
-        reason="XFormers attention is only available with CUDA and `xformers` installed",
-    )
-    def test_xformers_enable_works(self):
-        init_dict = self.get_init_dict()
-        model = self.model_class(**init_dict)
-
-        model.enable_xformers_memory_efficient_attention()
-
-        assert (
-            model.mid_block.attentions[0].transformer_blocks[0].attn1.processor.__class__.__name__
-            == "XFormersAttnProcessor"
-        ), "xformers is not enabled"
-
-
-class TestUNetSpatioTemporalTraining(UNetSpatioTemporalTesterConfig, TrainingTesterMixin):
-    def test_gradient_checkpointing_is_applied(self):
-        expected_set = {
-            "TransformerSpatioTemporalModel",
-            "CrossAttnDownBlockSpatioTemporal",
-            "DownBlockSpatioTemporal",
-            "UpBlockSpatioTemporal",
-            "CrossAttnUpBlockSpatioTemporal",
-            "UNetMidBlockSpatioTemporal",
-        }
-        super().test_gradient_checkpointing_is_applied(expected_set=expected_set)
--- a/tests/quantization/torchao/test_torchao.py
+++ b/tests/quantization/torchao/test_torchao.py
@@ -14,13 +14,11 @@
 # limitations under the License.

 import gc
-import importlib.metadata
 import tempfile
 import unittest
 from typing import List

 import numpy as np
-from packaging import version
 from parameterized import parameterized
 from transformers import AutoTokenizer, CLIPTextModel, CLIPTokenizer, T5EncoderModel

@@ -82,18 +80,17 @@ if is_torchao_available():
        Float8WeightOnlyConfig,
        Int4WeightOnlyConfig,
        Int8DynamicActivationInt8WeightConfig,
+        Int8DynamicActivationIntxWeightConfig,
        Int8WeightOnlyConfig,
+        IntxWeightOnlyConfig,
    )
    from torchao.quantization.linear_activation_quantized_tensor import LinearActivationQuantizedTensor
    from torchao.utils import get_model_size_in_bytes

-    if version.parse(importlib.metadata.version("torchao")) >= version.Version("0.10.0"):
-        from torchao.quantization import Int8DynamicActivationIntxWeightConfig, IntxWeightOnlyConfig
-

@require_torch
@require_torch_accelerator
-@require_torchao_version_greater_or_equal("0.14.0")
+@require_torchao_version_greater_or_equal("0.15.0")
 class TorchAoConfigTest(unittest.TestCase):
    def test_to_dict(self):
        """
@@ -128,7 +125,7 @@ class TorchAoConfigTest(unittest.TestCase):
 # Slices for these tests have been obtained on our aws-g6e-xlarge-plus runners
@require_torch
@require_torch_accelerator
-@require_torchao_version_greater_or_equal("0.14.0")
+@require_torchao_version_greater_or_equal("0.15.0")
 class TorchAoTest(unittest.TestCase):
    def tearDown(self):
        gc.collect()
@@ -527,7 +524,7 @@ class TorchAoTest(unittest.TestCase):
        inputs = self.get_dummy_inputs(torch_device)
        _ = pipe(**inputs)

-    @require_torchao_version_greater_or_equal("0.9.0")
+    @require_torchao_version_greater_or_equal("0.15.0")
    def test_aobase_config(self):
        quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
        components = self.get_dummy_components(quantization_config)
@@ -540,7 +537,7 @@ class TorchAoTest(unittest.TestCase):
 # Slices for these tests have been obtained on our aws-g6e-xlarge-plus runners
@require_torch
@require_torch_accelerator
-@require_torchao_version_greater_or_equal("0.14.0")
+@require_torchao_version_greater_or_equal("0.15.0")
 class TorchAoSerializationTest(unittest.TestCase):
    model_name = "hf-internal-testing/tiny-flux-pipe"

@@ -650,7 +647,7 @@ class TorchAoSerializationTest(unittest.TestCase):
        self._check_serialization_expected_slice(quant_type, expected_slice, device)


-@require_torchao_version_greater_or_equal("0.14.0")
+@require_torchao_version_greater_or_equal("0.15.0")
 class TorchAoCompileTest(QuantCompileTests, unittest.TestCase):
    @property
    def quantization_config(self):
@@ -696,7 +693,7 @@ class TorchAoCompileTest(QuantCompileTests, unittest.TestCase):
 # Slices for these tests have been obtained on our aws-g6e-xlarge-plus runners
@require_torch
@require_torch_accelerator
-@require_torchao_version_greater_or_equal("0.14.0")
+@require_torchao_version_greater_or_equal("0.15.0")
@slow
@nightly
 class SlowTorchAoTests(unittest.TestCase):
@@ -854,7 +851,7 @@ class SlowTorchAoTests(unittest.TestCase):

@require_torch
@require_torch_accelerator
-@require_torchao_version_greater_or_equal("0.14.0")
+@require_torchao_version_greater_or_equal("0.15.0")
@slow
@nightly
 class SlowTorchAoPreserializedModelTests(unittest.TestCase):
Author	SHA1	Message	Date
sayakpaul	119daffdf1	fix	2026-03-30 16:05:22 +05:30
sayakpaul	c346ad5eb8	refactor autoencoderdc tests	2026-03-30 15:30:18 +05:30
Cheung Ka Wai	e1e7d58a4a	Fix Ulysses SP backward with SDPA (#13328 ) * add UT for backward * fix SDPA attention backward	2026-03-30 15:15:27 +05:30
Steven Liu	a93f7f137a	[docs] refactor model skill (#13334 ) * refactor * feedback * feedback --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2026-03-29 23:13:52 -07:00
Sayak Paul	10ec3040a2	[ci] move to assert instead of self.Assert* (#13366 ) move to assert instead of self.Assert*	2026-03-30 11:09:14 +05:30
Howard Zhang	f2be8bd6b3	change minimum version guard for torchao to 0.15.0 (#13355 )	2026-03-28 09:11:51 +05:30
Sayak Paul	7da22b9db5	[ci] include checkout step in claude review workflow (#13352 ) up	2026-03-27 17:28:31 +05:30