initial draft

2026-04-02 22:01:42 +08:00 · 2026-04-01 11:36:04 -03:00
7 changed files with 984 additions and 0 deletions
--- a/.ai/skills/optimizations/SKILL.md
+++ b/.ai/skills/optimizations/SKILL.md
@@ -0,0 +1,113 @@
+---
+name: optimizations
+description: >
+  NEVER answer optimization questions from general knowledge — ALWAYS invoke
+  this skill via the Skill tool first. Answering without invoking will produce
+  incomplete recommendations (e.g. missing group offloading, wrong API calls).
+  IMPORTANT: When ANY tool output (especially Bash) contains
+  "torch.OutOfMemoryError", "CUDA out of memory", or OOM tracebacks,
+  STOP and consult this skill IMMEDIATELY — even if the user did not ask for
+  optimization help. Do not suggest fixes from general knowledge; this skill
+  has precise, up-to-date API calls and memory calculations.
+  Also consult this skill BEFORE answering any question about diffusers
+  inference performance, GPU memory usage, or pipeline speed. Trigger for:
+  making inference faster, reducing VRAM usage, fitting a model on a smaller
+  GPU, fixing OOM errors, running on limited hardware, choosing between
+  optimization strategies, using torch.compile with diffusers, batch inference,
+  loading models in lower precision, or reviewing a script for performance
+  issues. Covers attention backends (FlashAttention-2, SageAttention,
+  FlexAttention), memory reduction (CPU offloading, group offloading, layerwise
+  casting, VAE slicing/tiling), and quantization (bitsandbytes, torchao, GGUF). 
+  Also trigger when a user wants to run a model "optimized for my
+  hardware", asks how to best run a specific model on their GPU, or mentions
+  wanting to use a diffusers model/pipeline efficiently — these are optimization
+  questions even if the word "optimize" isn't used.
+---
+
+## Goal
+
+Help users apply and debug optimizations for diffusers pipelines. There are five main areas:
+
+1. **Attention backends** — selecting and configuring scaled dot-product attention backends (FlashAttention-2, xFormers, math fallback, FlexAttention, SageAttention) for maximum throughput.
+2. **Memory reduction** — techniques to reduce peak GPU memory: model CPU offloading, group offloading, layerwise casting, VAE slicing/tiling, and attention slicing.
+3. **Quantization** — reducing model precision with bitsandbytes, torchao, or GGUF to fit larger models on smaller GPUs.
+4. **torch.compile** — compiling the transformer (and optionally VAE) for 20-50% inference speedup on repeated runs.
+5. **Combining techniques** — layerwise casting + group offloading, quantization + offloading, etc.
+
+## Workflow: When a user hits OOM or asks to fit a model on their GPU
+
+When a user asks how to make a pipeline run on their hardware, or hits an OOM error, follow these steps **in order** before proposing any changes:
+
+### Step 1: Detect hardware
+
+Run these commands to understand the user's system:
+
+```bash
+# GPU VRAM
+nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader,nounits
+
+# System RAM
+free -g | head -2
+```
+
+Record the GPU name, total VRAM (in GB), and total system RAM (in GB). These numbers drive the recommendation.
+
+### Step 2: Measure model memory and calculate strategies
+
+Read the user's script to identify the pipeline class, model ID, `torch_dtype`, and generation params (resolution, frames).
+
+Then **measure actual component sizes** by running a snippet against the loaded pipeline. Do NOT guess sizes from parameter counts or model cards — always measure. See [memory-calculator.md](memory-calculator.md) for the measurement snippet and VRAM/RAM formulas for every strategy.
+
+Steps:
+1. Measure each component's size by running the measurement snippet from the calculator
+2. Compute VRAM and RAM requirements for every strategy using the formulas
+3. Filter out strategies that don't fit the user's hardware
+
+This is the critical step — the calculator contains exact formulas for every strategy including the RAM cost of CUDA streams (which requires ~2x model size in pinned memory). Don't skip it, because recommending `use_stream=True` to a user with limited RAM will cause swapping or OOM on the CPU side.
+
+### Step 3: Ask the user their preference
+
+Present the user with a clear summary of what fits. **Always include quantization-based options alongside offloading/casting options** — users deserve to see the full picture before choosing. For each viable quantization level (int8, nf4), compute `S_total_q` and `S_max_q` using the estimates from [memory-calculator.md](memory-calculator.md) (int4/nf4 ≈ 0.25x, int8 ≈ 0.5x component size), then check fit just like other strategies.
+
+Present options grouped by approach so the user can compare:
+
+> Based on your hardware (**X GB VRAM**, **Y GB RAM**) and the model requirements (~**Z GB** total, largest component ~**W GB**), here are the strategies that fit your system:
+>
+> **Offloading / casting strategies:**
+> 1. **Quality** — [specific strategy]. Full precision, no quality loss. [estimated VRAM / RAM / speed tradeoff].
+> 2. **Speed** — [specific strategy]. [quality tradeoff]. [estimated VRAM / RAM].
+> 3. **Memory saving** — [specific strategy]. Minimizes VRAM. [tradeoffs].
+>
+> **Quantization strategies:**
+> 4. **int8 [components]** — [with offloading if needed]. [estimated VRAM / RAM]. Less quality loss than int4.
+> 5. **nf4 [components]** — [with offloading if needed]. [estimated VRAM / RAM]. Maximum memory savings, some quality degradation.
+>
+> Which would you prefer?
+
+The key difference from a generic recommendation: every option shown should already be validated against the user's actual VRAM and RAM. Don't show options that won't fit. Read [quantization.md](quantization.md) for correct API usage when applying quantization strategies.
+
+### Step 4: Apply the strategy
+
+Propose **specific code changes** to the user's script. Always show the exact code diff. Read [reduce-memory.md](reduce-memory.md) and [layerwise-casting.md](layerwise-casting.md) for correct API usage before writing code.
+
+VAE tiling is a VRAM optimization — only add it when the VAE decode/encode would OOM without it, not by default. See [reduce-memory.md](reduce-memory.md) for thresholds, the correct API (`pipe.vae.enable_tiling()` — pipeline-level is deprecated since v0.40.0), and which VAEs don't support it.
+
+## Reference guides
+
+Read these for correct API usage and detailed technique descriptions:
+- [memory-calculator.md](memory-calculator.md) — **Read this first when recommending strategies.** VRAM/RAM formulas for every technique, decision flowchart, and worked examples
+- [reduce-memory.md](reduce-memory.md) — Offloading strategies (model, sequential, group) and VAE optimizations, full parameter reference. **Authoritative source for compatibility rules.**
+- [layerwise-casting.md](layerwise-casting.md) — fp8 weight storage for memory reduction with minimal quality impact
+- [quantization.md](quantization.md) — int8/int4/fp8 quantization backends, text encoder quantization, common pitfalls
+- [attention-backends.md](attention-backends.md) — Attention backend selection for speed
+- [torch-compile.md](torch-compile.md) — torch.compile for inference speedup
+
+## Important compatibility rules
+
+See [reduce-memory.md](reduce-memory.md) for the full compatibility reference. Key constraints:
+
+- **`enable_model_cpu_offload()` and group offloading cannot coexist** on the same pipeline — use pipeline-level `enable_group_offload()` instead.
+- **`torch.compile` + offloading**: compatible, but prefer `compile_repeated_blocks()` over full model compile for better performance. See [torch-compile.md](torch-compile.md).
+- **`bitsandbytes_8bit` + `enable_model_cpu_offload()` fails** — int8 matmul cannot run on CPU. See [quantization.md](quantization.md) for the fix.
+- **Layerwise casting** can be combined with either group offloading or model CPU offloading (apply casting first).
+- **`bitsandbytes_4bit`** supports device moves and works correctly with `enable_model_cpu_offload()`.
--- a/.ai/skills/optimizations/attention-backends.md
+++ b/.ai/skills/optimizations/attention-backends.md
@@ -0,0 +1,40 @@
+# Attention Backends
+
+## Overview
+
+Diffusers supports multiple attention backends through `dispatch_attention_fn`. The backend affects both speed and memory usage. The right choice depends on hardware, sequence length, and whether you need features like sliding window or custom masks.
+
+## Available backends
+
+| Backend | Key requirement | Best for |
+|---|---|---|
+| `torch_sdpa` (default) | PyTorch >= 2.0 | General use; auto-selects FlashAttention or memory-efficient kernels |
+| `flash_attention_2` | `flash-attn` package, Ampere+ GPU | Long sequences, training, best raw throughput |
+| `xformers` | `xformers` package | Older GPUs, memory-efficient attention |
+| `flex_attention` | PyTorch >= 2.5 | Custom attention masks, block-sparse patterns |
+| `sage_attention` | `sageattention` package | INT8 quantized attention for inference speed |
+
+## How to set the backend
+
+```python
+# Global default
+from diffusers import set_attention_backend
+set_attention_backend("flash_attention_2")
+
+# Per-model
+pipe.transformer.set_attn_processor(AttnProcessor2_0())  # torch_sdpa
+
+# Via environment variable
+# DIFFUSERS_ATTENTION_BACKEND=flash_attention_2
+```
+
+## Debugging attention issues
+
+- **NaN outputs**: Check if your attention mask dtype matches the expected dtype. Some backends require `bool`, others require float masks with `-inf` for masked positions.
+- **Speed regression**: Profile with `torch.profiler` to verify the expected kernel is actually being dispatched. SDPA can silently fall back to the math kernel.
+- **Memory spike**: FlashAttention-2 is memory-efficient for long sequences but has overhead for very short ones. For short sequences, `torch_sdpa` with math fallback may use less memory.
+
+## Implementation notes
+
+- Models integrated into diffusers should use `dispatch_attention_fn` (not `F.scaled_dot_product_attention` directly) so that backend switching works automatically.
+- See the attention pattern in the `model-integration` skill for how to implement this in new models.
--- a/.ai/skills/optimizations/layerwise-casting.md
+++ b/.ai/skills/optimizations/layerwise-casting.md
@@ -0,0 +1,68 @@
+# Layerwise Casting
+
+## Overview
+
+Layerwise casting stores model weights in a smaller data format (e.g., `torch.float8_e4m3fn`) to use less memory, and upcasts them to a higher precision (e.g., `torch.bfloat16`) on-the-fly during computation. This cuts weight memory roughly in half (bf16 → fp8) with minimal quality impact because normalization and modulation layers are automatically skipped.
+
+This is one of the most effective techniques for fitting a large model on a GPU that's just slightly too small — it doesn't require any special quantization libraries, just PyTorch.
+
+## When to use
+
+- The model **almost** fits in VRAM (e.g., 28GB model on a 32GB GPU)
+- You want memory savings with **less speed penalty** than offloading
+- You want to **combine with group offloading** for even more savings
+
+## Basic usage
+
+Call `enable_layerwise_casting` on any Diffusers model component:
+
+```python
+import torch
+from diffusers import DiffusionPipeline
+
+pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
+
+# Store weights in fp8, compute in bf16
+pipe.transformer.enable_layerwise_casting(
+    storage_dtype=torch.float8_e4m3fn,
+    compute_dtype=torch.bfloat16,
+)
+
+pipe.to("cuda")
+```
+
+The `storage_dtype` controls how weights are stored in memory. The `compute_dtype` controls the precision used during the actual forward pass. Normalization and modulation layers are automatically kept at full precision.
+
+### Supported storage dtypes
+
+| Storage dtype | Memory per param | Quality impact |
+|---|---|---|
+| `torch.float8_e4m3fn` | 1 byte (vs 2 for bf16) | Minimal for most models |
+| `torch.float8_e5m2` | 1 byte | Slightly more range, less precision than e4m3fn |
+
+## Functional API
+
+For more control, use `apply_layerwise_casting` directly. This lets you target specific submodules or customize which layers to skip:
+
+```python
+from diffusers.hooks import apply_layerwise_casting
+
+apply_layerwise_casting(
+    pipe.transformer,
+    storage_dtype=torch.float8_e4m3fn,
+    compute_dtype=torch.bfloat16,
+    skip_modules_classes=["norm"],  # skip normalization layers
+    non_blocking=True,
+)
+```
+
+## Combining with other techniques
+
+Layerwise casting is compatible with both group offloading and model CPU offloading. Always apply layerwise casting **before** enabling offloading. See [reduce-memory.md](reduce-memory.md) for code examples and the memory savings formulas for each combination.
+
+## Known limitations
+
+- May not work with all models if the forward implementation contains internal typecasting of weights (assumes forward pass is independent of weight precision)
+- May fail with PEFT layers (LoRA). There are some checks but they're not guaranteed for all cases
+- Not suitable for training — inference only
+- The `compute_dtype` should match what the model expects (usually bf16 or fp16)
--- a/.ai/skills/optimizations/memory-calculator.md
+++ b/.ai/skills/optimizations/memory-calculator.md
@@ -0,0 +1,298 @@
+# Memory Calculator
+
+Use this guide to measure VRAM and RAM requirements for each optimization strategy, then recommend the best fit for the user's hardware.
+
+## Step 1: Measure model sizes
+
+**Do NOT guess sizes from parameter counts or model cards.** Pipelines often contain components that are not obvious from the model name (e.g., a pipeline marketed as having a "28B transformer" may also include a 24 GB text encoder, 6 GB connectors module, etc.). Always measure by running this snippet after loading the pipeline:
+
+```python
+import torch
+from diffusers import DiffusionPipeline  # or the specific pipeline class
+
+pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
+
+for name, component in pipe.components.items():
+    if hasattr(component, 'parameters'):
+        size_gb = sum(p.numel() * p.element_size() for p in component.parameters()) / 1e9
+        print(f"{name}: {size_gb:.2f} GB")
+```
+
+For the transformer, also measure block-level and leaf-level sizes:
+
+```python
+# S_block: size of one transformer block
+transformer = pipe.transformer
+block_attr = None
+for attr in ["transformer_blocks", "blocks", "layers"]:
+    if hasattr(transformer, attr):
+        block_attr = attr
+        break
+if block_attr:
+    blocks = getattr(transformer, block_attr)
+    block_size = sum(p.numel() * p.element_size() for p in blocks[0].parameters()) / 1e9
+    print(f"S_block: {block_size:.2f} GB ({len(blocks)} blocks)")
+
+# S_leaf: largest leaf module
+max_leaf = max(
+    (sum(p.numel() * p.element_size() for p in m.parameters(recurse=False))
+     for m in transformer.modules() if list(m.parameters(recurse=False))),
+    default=0
+) / 1e9
+print(f"S_leaf: {max_leaf:.4f} GB")
+```
+
+To measure the effect of layerwise casting on a component, apply it and re-measure:
+
+```python
+pipe.transformer.enable_layerwise_casting(
+    storage_dtype=torch.float8_e4m3fn,
+    compute_dtype=torch.bfloat16,
+)
+size_after = sum(p.numel() * p.element_size() for p in pipe.transformer.parameters()) / 1e9
+print(f"Transformer after layerwise casting: {size_after:.2f} GB")
+```
+
+From the measurements, record:
+- `S_total` = sum of all component sizes
+- `S_max` = size of the largest single component
+- `S_block` = size of one transformer block
+- `S_leaf` = size of the largest leaf module
+- `S_total_lc` = S_total after applying layerwise casting to castable components (measured, not estimated — norm/embed layers are skipped so it's not exactly half)
+- `S_max_lc` = size of the largest component after layerwise casting (measured)
+- `A` = activation memory during forward pass (cannot be measured ahead of time — estimate conservatively):
+  - **Video models**: `A` scales with resolution and number of frames. A 5-second 960x544 video at 24fps can use ~7-8 GB. Higher resolution or more seconds = more activation memory.
+  - **Image models**: `A` scales with image resolution. A 1024x1024 image might use 2-4 GB, but 2048x2048 could use 8-16 GB.
+  - **Edit/inpainting models**: `A` includes the reference image(s) in addition to the generation activations, so budget extra.
+  - When in doubt, estimate conservatively: `A ≈ 5-8 GB` for typical video workloads, `A ≈ 2-4 GB` for typical image workloads. For high-resolution or long video, increase accordingly.
+
+## Step 2: Compute VRAM and RAM per strategy
+
+### No optimization (all on GPU)
+
+| | Estimate |
+|---|---|
+| **VRAM** | `S_total + A` |
+| **RAM** | Minimal (just for loading) |
+| **Speed** | Fastest — no transfers |
+| **Quality** | Full precision |
+
+### Model CPU offloading
+
+| | Estimate |
+|---|---|
+| **VRAM** | `S_max + A` (only one component on GPU at a time) |
+| **RAM** | `S_total` (all components stored on CPU) |
+| **Speed** | Moderate — full model transfers between CPU/GPU per step |
+| **Quality** | Full precision |
+
+### Group offloading: block_level (no stream)
+
+| | Estimate |
+|---|---|
+| **VRAM** | `num_blocks_per_group * S_block + A` |
+| **RAM** | `S_total` (all weights on CPU, no pinned copy) |
+| **Speed** | Moderate — synchronous transfers per group |
+| **Quality** | Full precision |
+
+Tune `num_blocks_per_group` to fill available VRAM: `floor((VRAM - A) / S_block)`.
+
+### Group offloading: block_level (with stream)
+
+Streams force `num_blocks_per_group=1`. Prefetches the next block while the current one runs.
+
+| | Estimate |
+|---|---|
+| **VRAM** | `2 * S_block + A` (current block + prefetched next block) |
+| **RAM** | `~2.5-3 * S_total` (original weights + pinned copies + allocation overhead) |
+| **Speed** | Fast — overlaps transfer and compute |
+| **Quality** | Full precision |
+
+With `low_cpu_mem_usage=True`: RAM drops to `~S_total` (pins tensors on-the-fly instead of pre-pinning), but slower.
+
+With `record_stream=True`: slightly more VRAM (delays memory reclamation), slightly faster (avoids stream synchronization).
+
+> **Note on RAM estimates with streams:** Measured RAM usage is consistently higher than the theoretical `2 * S_total`. Pinned memory allocation, CUDA runtime overhead, and memory fragmentation add ~30-50% on top. Always use `~2.5-3 * S_total` when checking if the user has enough RAM for streamed offloading.
+
+### Group offloading: leaf_level (no stream)
+
+| | Estimate |
+|---|---|
+| **VRAM** | `S_leaf + A` (single leaf module, typically very small) |
+| **RAM** | `S_total` |
+| **Speed** | Slow — synchronous transfer per leaf module (many transfers) |
+| **Quality** | Full precision |
+
+### Group offloading: leaf_level (with stream)
+
+| | Estimate |
+|---|---|
+| **VRAM** | `2 * S_leaf + A` (current + prefetched leaf) |
+| **RAM** | `~2.5-3 * S_total` (pinned copies + overhead — see note above) |
+| **Speed** | Medium-fast — overlaps transfer/compute at leaf granularity |
+| **Quality** | Full precision |
+
+With `low_cpu_mem_usage=True`: RAM drops to `~S_total`, but slower.
+
+### Sequential CPU offloading (legacy)
+
+| | Estimate |
+|---|---|
+| **VRAM** | `S_leaf + A` (similar to leaf_level group offloading) |
+| **RAM** | `S_total` |
+| **Speed** | Very slow — no stream support, synchronous per-leaf |
+| **Quality** | Full precision |
+
+Group offloading `leaf_level + use_stream=True` is strictly better. Prefer that.
+
+### Layerwise casting (fp8 storage)
+
+Reduces weight memory by casting to fp8. Norm and embedding layers are automatically skipped, so the reduction is less than 50% — always measure with the snippet above.
+
+**`pipe.to()` caveat:** `pipe.to(device)` internally calls `module.to(device, dtype)` where dtype is `None` when not explicitly passed. This preserves fp8 weights. However, if the user passes dtype explicitly (e.g., `pipe.to("cuda", torch.bfloat16)` or the pipeline has internal dtype overrides), the fp8 storage will be overridden back to bf16. When in doubt, combine with `enable_model_cpu_offload()` which safely moves one component at a time without dtype overrides.
+
+**Case 1: Everything on GPU** (if `S_total_lc + A <= VRAM`)
+
+| | Estimate |
+|---|---|
+| **VRAM** | `S_total_lc + A` (measured — use the layerwise casting measurement snippet) |
+| **RAM** | Minimal |
+| **Speed** | Near-native — small cast overhead per layer |
+| **Quality** | Slight degradation (fp8 weights, norm layers kept full precision) |
+
+Use `pipe.to("cuda")` (without explicit dtype) after applying layerwise casting. Or move each component individually.
+
+**Case 2: With model CPU offloading** (if Case 1 doesn't fit but `S_max_lc + A <= VRAM`)
+
+| | Estimate |
+|---|---|
+| **VRAM** | `S_max_lc + A` (largest component after layerwise casting, one on GPU at a time) |
+| **RAM** | `S_total` (all components on CPU) |
+| **Speed** | Fast — small cast overhead per layer, component transfer overhead between steps |
+| **Quality** | Slight degradation (fp8 weights, norm layers kept full precision) |
+
+Apply layerwise casting to target components, then call `pipe.enable_model_cpu_offload()`.
+
+### Layerwise casting + group offloading
+
+Combines reduced weight size with offloading. The offloaded weights are in fp8, so transfers are faster and pinned copies smaller.
+
+| | Estimate |
+|---|---|
+| **VRAM** | `num_blocks_per_group * S_block * 0.5 + A` (block_level) or `S_leaf * 0.5 + A` (leaf_level) |
+| **RAM** | `S_total * 0.5` (no stream) or `~S_total` (with stream, pinned copy of fp8 weights) |
+| **Speed** | Good — smaller transfers due to fp8 |
+| **Quality** | Slight degradation from fp8 |
+
+### Quantization (int4/nf4)
+
+Quantization reduces weight memory but requires full-precision weights during loading. Always use `device_map="cpu"` so quantization happens on CPU.
+
+Notation:
+- `S_component_q` = quantized size of a component (int4/nf4 ≈ `S_component * 0.25`, int8 ≈ `S_component * 0.5`)
+- `S_total_q` = total pipeline size after quantizing selected components
+- `S_max_q` = size of the largest single component after quantization
+
+**Loading (with `device_map="cpu"`):**
+
+| | Estimate |
+|---|---|
+| **RAM (peak during loading)** | `S_largest_component_bf16` — full-precision weights of the largest component must fit in RAM during quantization |
+| **RAM (after loading)** | `S_total_q` — all components at their final (quantized or bf16) sizes |
+
+**Inference with `pipe.to(device)`:**
+
+| | Estimate |
+|---|---|
+| **VRAM** | `S_total_q + A` (all components on GPU at once) |
+| **RAM** | Minimal |
+| **Speed** | Good — smaller model, may have dequantization overhead |
+| **Quality** | Noticeable degradation possible, especially int4. Try int8 first. |
+
+**Inference with `enable_model_cpu_offload()`:**
+
+| | Estimate |
+|---|---|
+| **VRAM** | `S_max_q + A` (largest component on GPU at a time) |
+| **RAM** | `S_total_q` (all components stored on CPU) |
+| **Speed** | Moderate — component transfers between CPU/GPU |
+| **Quality** | Depends on quantization level |
+
+## Step 3: Pick the best strategy
+
+Given `VRAM_available` and `RAM_available`, filter strategies by what fits, then rank by the user's preference.
+
+### Algorithm
+
+```
+1. Measure S_total, S_max, S_block, S_leaf, S_total_lc, S_max_lc, A for the pipeline
+2. For each strategy (offloading, casting, AND quantization), compute estimated VRAM and RAM
+3. Filter out strategies where VRAM > VRAM_available or RAM > RAM_available
+4. Present ALL viable strategies to the user grouped by approach (offloading/casting vs quantization)
+5. Let the user pick based on their preference:
+   - Quality:    pick the one with highest precision that fits
+   - Speed:      pick the one with lowest transfer overhead
+   - Memory:     pick the one with lowest VRAM usage
+   - Balanced:   pick the lightest technique that fits comfortably (target ~80% VRAM)
+```
+
+### Quantization size estimates
+
+Always compute these alongside offloading strategies — don't treat quantization as a last resort.
+Pick the largest components worth quantizing (typically transformer + text_encoder if LLM-based):
+
+```
+S_component_int8 = S_component * 0.5
+S_component_nf4  = S_component * 0.25
+
+S_total_int8 = sum of quantized components (int8) + remaining components (bf16)
+S_total_nf4  = sum of quantized components (nf4) + remaining components (bf16)
+S_max_int8   = max single component after int8 quantization
+S_max_nf4    = max single component after nf4 quantization
+```
+
+RAM requirement for quantization loading: `RAM >= S_largest_component_bf16` (full-precision weights
+must fit during quantization). If this doesn't hold, quantization is not viable unless pre-quantized
+checkpoints are available.
+
+### Quick decision flowchart
+
+Offloading / casting path:
+```
+VRAM >= S_total + A?
+  → YES: No optimization needed (maybe attention backend for speed)
+  → NO:
+    VRAM >= S_total_lc + A? (layerwise casting, everything on GPU)
+      → YES: Layerwise casting, pipe.to("cuda") without explicit dtype
+      → NO:
+        VRAM >= S_max + A? (model CPU offload, full precision)
+          → YES: Model CPU offloading
+                  - Want less VRAM? → add layerwise casting too
+          → NO:
+            VRAM >= S_max_lc + A? (layerwise casting + model CPU offload)
+              → YES: Layerwise casting + model CPU offloading
+              → NO: Need group offloading
+                RAM >= 3 * S_total? (enough for pinned copies + overhead)
+                  → YES: group offload leaf_level + stream (fast)
+                  → NO:
+                    RAM >= S_total?
+                      → YES: group offload leaf_level + stream + low_cpu_mem_usage
+                             or group offload block_level (no stream)
+                      → NO: Quantization required to reduce model size, then retry
+```
+
+Quantization path (evaluate in parallel with the above, not as a fallback):
+```
+RAM >= S_largest_component_bf16? (must fit full-precision weights during quantization)
+  → NO: Cannot quantize — need more RAM or pre-quantized checkpoints
+  → YES: Compute quantized sizes for target components (typically transformer + text_encoder)
+    nf4 quantization:
+      VRAM >= S_total_nf4 + A?  → pipe.to("cuda"), fastest (no offloading overhead)
+      VRAM >= S_max_nf4 + A?    → model CPU offload, moderate speed
+    int8 quantization:
+      VRAM >= S_total_int8 + A?  → pipe.to("cuda"), fastest
+      VRAM >= S_max_int8 + A?    → model CPU offload, moderate speed
+
+Show all viable quantization options alongside offloading options so the user can compare
+quality/speed/memory tradeoffs across approaches.
+```
--- a/.ai/skills/optimizations/quantization.md
+++ b/.ai/skills/optimizations/quantization.md
@@ -0,0 +1,180 @@
+# Quantization
+
+## Overview
+
+Quantization reduces model weights from fp16/bf16 to lower precision (int8, int4, fp8), cutting memory usage and often improving throughput. Diffusers supports several quantization backends.
+
+## Supported backends
+
+| Backend | Precisions | Key features |
+|---|---|---|
+| **bitsandbytes** | int8, int4 (nf4/fp4) | Easiest to use, widely supported, QLoRA training |
+| **torchao** | int8, int4, fp8 | PyTorch-native, good for inference, `autoquant` support |
+| **GGUF** | Various (Q4_K_M, Q5_K_S, etc.) | Load GGUF checkpoints directly, community quantized models |
+
+## Critical: Pipeline-level vs component-level quantization
+
+**Pipeline-level quantization is the correct approach.** Pass a `PipelineQuantizationConfig` to `from_pretrained`. Do NOT pass a `BitsAndBytesConfig` directly — the pipeline's `from_pretrained` will reject it with `"quantization_config must be an instance of PipelineQuantizationConfig"`.
+
+### Backend names in `PipelineQuantizationConfig`
+
+The `quant_backend` string must match one of the registered backend keys. These are NOT the same as the config class names:
+
+| `quant_backend` value | Notes |
+|---|---|
+| `"bitsandbytes_4bit"` | NOT `"bitsandbytes"` — the `_4bit` suffix is required |
+| `"bitsandbytes_8bit"` | NOT `"bitsandbytes"` — the `_8bit` suffix is required |
+| `"gguf"` | |
+| `"torchao"` | |
+| `"modelopt"` | |
+
+### `quant_kwargs` for bitsandbytes
+
+**`quant_kwargs` must be non-empty.** The validator raises `ValueError: Both quant_kwargs and quant_mapping cannot be None` if it's `{}` or `None`. Always pass at least one kwarg.
+
+For `bitsandbytes_4bit`, the quantizer class is selected by backend name — `load_in_4bit=True` is redundant (the quantizer ignores it) but harmless. Pass the bnb-specific options instead:
+
+```python
+quant_kwargs={"bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"}
+```
+
+For `bitsandbytes_8bit`, there are no bnb_8bit-specific kwargs, so pass the flag explicitly to satisfy the non-empty requirement:
+
+```python
+quant_kwargs={"load_in_8bit": True}
+```
+
+## Usage patterns
+
+### bitsandbytes (pipeline-level, recommended)
+
+```python
+from diffusers import PipelineQuantizationConfig, DiffusionPipeline
+
+quantization_config = PipelineQuantizationConfig(
+    quant_backend="bitsandbytes_4bit",
+    quant_kwargs={"bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"},
+    components_to_quantize=["transformer"],  # specify which components to quantize
+)
+
+pipe = DiffusionPipeline.from_pretrained(
+    "model_id",
+    quantization_config=quantization_config,
+    torch_dtype=torch.bfloat16,
+    device_map="cpu",  # load on CPU first to avoid OOM during quantization
+)
+```
+
+### torchao (pipeline-level)
+
+```python
+from diffusers import PipelineQuantizationConfig, DiffusionPipeline
+
+quantization_config = PipelineQuantizationConfig(
+    quant_backend="torchao",
+    quant_kwargs={"quant_type": "int8_weight_only"},
+    components_to_quantize=["transformer"],
+)
+
+pipe = DiffusionPipeline.from_pretrained(
+    "model_id",
+    quantization_config=quantization_config,
+    torch_dtype=torch.bfloat16,
+    device_map="cpu",
+)
+```
+
+### GGUF (pipeline-level)
+
+```python
+from diffusers import PipelineQuantizationConfig, DiffusionPipeline
+
+quantization_config = PipelineQuantizationConfig(
+    quant_backend="gguf",
+    quant_kwargs={"compute_dtype": torch.bfloat16},
+)
+
+pipe = DiffusionPipeline.from_pretrained(
+    "model_id",
+    quantization_config=quantization_config,
+    torch_dtype=torch.bfloat16,
+    device_map="cpu",
+)
+```
+
+## Loading: memory requirements and `device_map="cpu"`
+
+Quantization is NOT free at load time. The full-precision (bf16/fp16) weights must be loaded into memory first, then compressed. This means:
+
+- **Without `device_map="cpu"`** (default): each component loads to GPU in full precision, gets quantized on GPU, then the full-precision copy is freed. But while loading, you need VRAM for the full-precision weights of the current component PLUS all previously loaded components (already quantized or not). For large models, this causes OOM.
+- **With `device_map="cpu"`**: components load and quantize on CPU. This requires **RAM >= S_component_bf16** for the largest component being quantized (the full-precision weights must fit in RAM during quantization). After quantization, RAM usage drops to the quantized size.
+
+**Always pass `device_map="cpu"` when using quantization.** Then choose how to move to GPU:
+
+1. **`pipe.to(device)`** — moves everything to GPU at once. Only works if all components (quantized + non-quantized) fit in VRAM simultaneously: `VRAM >= S_total_after_quant`.
+2. **`pipe.enable_model_cpu_offload(device=device)`** — moves components to GPU one at a time during inference. Use this when `S_total_after_quant > VRAM` but `S_max_after_quant + A <= VRAM`.
+
+### Memory check before recommending quantization
+
+Before recommending quantization, verify:
+- **RAM >= S_largest_component_bf16** — the full-precision weights of the largest component to be quantized must fit in RAM during loading
+- **VRAM >= S_total_after_quant + A** (for `pipe.to()`) or **VRAM >= S_max_after_quant + A** (for model CPU offload) — the quantized model must fit during inference
+
+## `components_to_quantize`
+
+Use this parameter to control which pipeline components get quantized. Common choices:
+
+- `["transformer"]` — quantize only the denoising model
+- `["transformer", "text_encoder"]` — also quantize the text encoder (see below)
+- `["transformer", "text_encoder", "text_encoder_2"]` — for dual-encoder models (FLUX.1, SD3, etc.) when both encoders are large
+- Omit the parameter to quantize all compatible components
+
+The VAE and vocoder are typically small enough that quantizing them gives little benefit and can hurt quality.
+
+### Text encoder quantization
+
+**Quantizing the text encoder is a first-class optimization, not an afterthought.** Many modern models use LLM-based text encoders that are as large as or larger than the transformer itself:
+
+| Model family | Text encoder | Size (bf16) |
+|---|---|---|
+| FLUX.2 Klein | Qwen3 | ~9 GB |
+| FLUX.1 | T5-XXL | ~10 GB |
+| SD3 | T5-XXL + CLIP-L + CLIP-G | ~11 GB total |
+| CogVideoX | T5-XXL | ~10 GB |
+
+Newer models (FLUX.2 Klein, etc.) use a **single LLM-based text encoder** — check the pipeline definition for `text_encoder` vs `text_encoder_2`. Never assume CLIP+T5 dual-encoder layout.
+
+When the text encoder is LLM-based, always include it in `components_to_quantize`. The combined savings often allow both components to fit in VRAM simultaneously, eliminating the need for CPU offloading entirely:
+
+```python
+# Both transformer (~4.5 GB) + Qwen3 text encoder (~4.5 GB) fit in VRAM at int4
+quantization_config = PipelineQuantizationConfig(
+    quant_backend="bitsandbytes_4bit",
+    quant_kwargs={"bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"},
+    components_to_quantize=["transformer", "text_encoder"],
+)
+pipe = DiffusionPipeline.from_pretrained("model_id", quantization_config=quantization_config, device_map="cpu")
+pipe.to("cuda")  # everything fits — no offloading needed
+```
+
+vs. transformer-only quantization, which may still require offloading because the text encoder alone exceeds available VRAM.
+
+## Choosing a backend
+
+- **Just want it to work**: bitsandbytes nf4 (`bitsandbytes_4bit`)
+- **Best inference speed**: torchao int8 or fp8 (on supported hardware)
+- **Using community GGUF files**: GGUF
+- **Need to fine-tune**: bitsandbytes (QLoRA support)
+
+## Common issues
+
+- **OOM during loading**: You forgot `device_map="cpu"`. See the loading section above.
+- **`quantization_config must be an instance of PipelineQuantizationConfig`**: You passed a `BitsAndBytesConfig` directly. Wrap it in `PipelineQuantizationConfig` instead.
+- **`quant_backend not found`**: The backend name is wrong. Use `bitsandbytes_4bit` or `bitsandbytes_8bit`, not `bitsandbytes`. See the backend names table above.
+- **`Both quant_kwargs and quant_mapping cannot be None`**: `quant_kwargs` is empty or `None`. Always pass at least one kwarg — see the `quant_kwargs` section above.
+- **OOM during `pipe.to(device)` after loading**: Even quantized, all components don't fit in VRAM at once. Use `enable_model_cpu_offload()` instead of `pipe.to(device)`.
+- **`bitsandbytes_8bit` + `enable_model_cpu_offload()` fails at inference**: `LLM.int8()` (bitsandbytes 8-bit) can only execute on CUDA — it cannot run on CPU. When `enable_model_cpu_offload()` moves the quantized component back to CPU between steps, the int8 matmul fails. **Fix**: keep the int8 component on CUDA permanently (`pipe.transformer.to("cuda")`) and use group offloading with `exclude_modules=["transformer"]` for the rest, or switch to `bitsandbytes_4bit` which supports device moves.
+- **Quality degradation**: int4 can produce noticeable artifacts for some models. Try int8 first, then drop to int4 if memory requires it.
+- **Slow first inference**: Some backends (torchao) compile/calibrate on first run. Subsequent runs are faster.
+- **Incompatible layers**: Not all layer types support all quantization schemes. Check backend docs for supported module types.
+- **Training**: Only bitsandbytes supports training (via QLoRA). Other backends are inference-only.
--- a/.ai/skills/optimizations/reduce-memory.md
+++ b/.ai/skills/optimizations/reduce-memory.md
@@ -0,0 +1,213 @@
+# Reduce Memory
+
+## Overview
+
+Large diffusion models can exceed GPU VRAM. Diffusers provides several techniques to reduce peak memory, each with different speed/memory tradeoffs.
+
+## Techniques (ordered by ease of use)
+
+### 1. Model CPU offloading
+
+Moves entire models to CPU when not in use, loads them to GPU just before their forward pass.
+
+```python
+pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
+pipe.enable_model_cpu_offload()
+# Do NOT call pipe.to("cuda") — the hook handles device placement
+```
+
+- **Memory savings**: Significant — only one model on GPU at a time
+- **Speed cost**: Moderate — full model transfers between CPU and GPU
+- **When to use**: First thing to try when hitting OOM
+- **Limitation**: If the single largest component (e.g. transformer) exceeds VRAM, this won't help — you need group offloading or layerwise casting instead.
+
+### 2. Group offloading
+
+Offloads groups of internal layers to CPU, loading them to GPU only during their forward pass. More granular than model offloading, faster than sequential offloading.
+
+**Two offload types:**
+- `block_level` — offloads groups of N layers at a time. Lower memory, moderate speed.
+- `leaf_level` — offloads individual leaf modules. Equivalent to sequential offloading but can be made faster with CUDA streams.
+
+**IMPORTANT**: `enable_model_cpu_offload()` will raise an error if any component has group offloading enabled. If you need offloading for the whole pipeline, use pipeline-level `enable_group_offload()` instead — it handles all components in one call.
+
+#### Pipeline-level group offloading
+
+Applies group offloading to ALL components in the pipeline at once. Simplest approach.
+
+```python
+import torch
+from diffusers import DiffusionPipeline
+
+pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
+
+# Option A: leaf_level with CUDA streams (recommended — fast + low memory)
+pipe.enable_group_offload(
+    onload_device=torch.device("cuda"),
+    offload_device=torch.device("cpu"),
+    offload_type="leaf_level",
+    use_stream=True,
+)
+
+# Option B: block_level (more memory savings, slower)
+pipe.enable_group_offload(
+    onload_device=torch.device("cuda"),
+    offload_device=torch.device("cpu"),
+    offload_type="block_level",
+    num_blocks_per_group=2,
+)
+```
+
+#### Component-level group offloading
+
+Apply group offloading selectively to specific components. Useful when only the transformer is too large for VRAM but other components fit fine.
+
+For Diffusers model components (inheriting from `ModelMixin`), use `enable_group_offload`:
+
+```python
+import torch
+from diffusers import DiffusionPipeline
+
+pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
+
+# Group offload the transformer (the largest component)
+pipe.transformer.enable_group_offload(
+    onload_device=torch.device("cuda"),
+    offload_device=torch.device("cpu"),
+    offload_type="leaf_level",
+    use_stream=True,
+)
+
+# Group offload the VAE too if needed
+pipe.vae.enable_group_offload(
+    onload_device=torch.device("cuda"),
+    offload_type="leaf_level",
+)
+```
+
+For non-Diffusers components (e.g. text encoders from transformers library), use the functional API:
+
+```python
+from diffusers.hooks import apply_group_offloading
+
+apply_group_offloading(
+    pipe.text_encoder,
+    onload_device=torch.device("cuda"),
+    offload_type="block_level",
+    num_blocks_per_group=2,
+)
+```
+
+#### CUDA streams for faster group offloading
+
+When `use_stream=True`, the next layer is prefetched to GPU while the current layer runs. This overlaps data transfer with computation. Requires ~2x CPU memory of the model.
+
+```python
+pipe.transformer.enable_group_offload(
+    onload_device=torch.device("cuda"),
+    offload_device=torch.device("cpu"),
+    offload_type="leaf_level",
+    use_stream=True,
+    record_stream=True,  # slightly more speed, slightly more memory
+)
+```
+
+If using `block_level` with `use_stream=True`, set `num_blocks_per_group=1` (a warning is raised otherwise).
+
+#### Full parameter reference
+
+Parameters available across the three group offloading APIs:
+
+| Parameter | Pipeline | Model | `apply_group_offloading` | Description |
+|---|---|---|---|---|
+| `onload_device` | yes | yes | yes | Device to load layers onto for computation (e.g. `torch.device("cuda")`) |
+| `offload_device` | yes | yes | yes | Device to offload layers to when idle (default: `torch.device("cpu")`) |
+| `offload_type` | yes | yes | yes | `"block_level"` (groups of N layers) or `"leaf_level"` (individual modules) |
+| `num_blocks_per_group` | yes | yes | yes | Required for `block_level` — how many layers per group |
+| `non_blocking` | yes | yes | yes | Non-blocking data transfer between devices |
+| `use_stream` | yes | yes | yes | Overlap data transfer and computation via CUDA streams. Requires ~2x CPU RAM of the model |
+| `record_stream` | yes | yes | yes | With `use_stream`, marks tensors for stream. Faster but slightly more memory |
+| `low_cpu_mem_usage` | yes | yes | yes | Pins tensors on-the-fly instead of pre-pinning. Saves CPU RAM when using streams, but slower |
+| `offload_to_disk_path` | yes | yes | yes | Path to offload weights to disk instead of CPU RAM. Useful when system RAM is also limited |
+| `exclude_modules` | **yes** | no | no | Pipeline-only: list of component names to skip (they get placed on `onload_device` instead) |
+| `block_modules` | no | **yes** | **yes** | Override which submodules are treated as blocks for `block_level` offloading |
+| `exclude_kwargs` | no | **yes** | **yes** | Kwarg keys that should not be moved between devices (e.g. mutable cache state) |
+
+### 3. Sequential CPU offloading
+
+Moves individual layers to GPU one at a time during forward pass.
+
+```python
+pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
+pipe.enable_sequential_cpu_offload()
+# Do NOT call pipe.to("cuda") first — saves minimal memory if you do
+```
+
+- **Memory savings**: Maximum — only one layer on GPU at a time
+- **Speed cost**: Very high — many small transfers per forward pass
+- **When to use**: Last resort when group offloading with streams isn't enough
+- **Note**: Group offloading with `leaf_level` + `use_stream=True` is essentially the same idea but faster. Prefer that.
+
+### 4. VAE slicing
+
+Processes VAE encode/decode in slices along the batch dimension.
+
+```python
+pipe.vae.enable_slicing()
+```
+
+- **Memory savings**: Reduces VAE peak memory for batch sizes > 1
+- **Speed cost**: Minimal
+- **When to use**: When generating multiple images/videos in a batch
+- **Note**: `AutoencoderKLWan` and `AsymmetricAutoencoderKL` don't support slicing.
+- **API note**: The pipeline-level `pipe.enable_vae_slicing()` is deprecated since v0.40.0. Use `pipe.vae.enable_slicing()`.
+
+### 5. VAE tiling
+
+Processes VAE encode/decode in spatial tiles. This is a **VRAM optimization** — only use when the VAE decode/encode would OOM without it.
+
+```python
+pipe.vae.enable_tiling()
+```
+
+- **Memory savings**: Bounds VAE peak memory by tile size rather than full resolution
+- **Speed cost**: Some overhead from tile overlap processing
+- **When to use** (only when VAE decode would OOM):
+  - **Image models**: Typically needed above ~1.5 MP on ≤16 GB GPUs, or ~4 MP on ≤32 GB GPUs
+  - **Video models**: When `H × W × num_frames` is large relative to remaining VRAM after denoising
+- **When NOT to use**: At standard resolutions where the VAE fits comfortably — tiling adds overhead for no benefit
+- **Note**: `AutoencoderKLWan` and `AsymmetricAutoencoderKL` don't support tiling.
+- **API note**: The pipeline-level `pipe.enable_vae_tiling()` is deprecated since v0.40.0. Use `pipe.vae.enable_tiling()`.
+- **Tip for group offloading with streams**: If combining VAE tiling with group offloading (`use_stream=True`), do a dummy forward pass first to avoid device mismatch errors.
+
+### 6. Attention slicing (legacy)
+
+```python
+pipe.enable_attention_slicing()
+```
+
+- Largely superseded by `torch_sdpa` and FlashAttention
+- Still useful on very old GPUs without SDPA support
+
+## Combining techniques
+
+Compatible combinations:
+- Group offloading (pipeline-level) + VAE tiling — good general setup
+- Group offloading (pipeline-level, `exclude_modules=["small_component"]`) — keeps small models on GPU, offloads large ones
+- Model CPU offloading + VAE tiling — simple and effective when the largest component fits in VRAM
+- Layerwise casting + group offloading — maximum savings (see [layerwise-casting.md](layerwise-casting.md))
+- Layerwise casting + model CPU offloading — also works
+- Quantization + model CPU offloading — works well
+- Per-component group offloading with different configs — e.g. `block_level` for transformer, `leaf_level` for VAE
+
+**Incompatible combinations:**
+- `enable_model_cpu_offload()` on a pipeline where ANY component has group offloading — raises ValueError
+- `enable_sequential_cpu_offload()` on a pipeline where ANY component has group offloading — same error
+
+## Debugging OOM
+
+1. Check which stage OOMs: loading, encoding, denoising, or decoding
+2. If OOM during `.to("cuda")` — the full pipeline doesn't fit. Use model CPU offloading or group offloading
+3. If OOM during denoising with model CPU offloading — the transformer alone exceeds VRAM. Use layerwise casting (see [layerwise-casting.md](layerwise-casting.md)) or group offloading instead
+4. If still OOM during VAE decode, add `pipe.vae.enable_tiling()`
+5. Consider quantization (see [quantization.md](quantization.md)) as a complementary approach
--- a/.ai/skills/optimizations/torch-compile.md
+++ b/.ai/skills/optimizations/torch-compile.md
@@ -0,0 +1,72 @@
+# torch.compile
+
+## Overview
+
+`torch.compile` traces a model's forward pass and compiles it to optimized machine code (via Triton or other backends). For diffusers, it typically speeds up the denoising loop by 20-50% after a warmup period.
+
+## Full model compilation
+
+Compile individual components, not the whole pipeline:
+
+```python
+import torch
+from diffusers import DiffusionPipeline
+
+pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16).to("cuda")
+
+pipe.transformer = torch.compile(pipe.transformer, mode="reduce-overhead", fullgraph=True)
+# Optionally compile the VAE decoder too
+pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead", fullgraph=True)
+```
+
+The first 1-3 inference calls are slow (compilation/warmup). Subsequent calls are fast. Always do a warmup run before benchmarking.
+
+## Regional compilation (preferred)
+
+Regional compilation compiles only the frequently repeated sub-modules (transformer blocks) instead of the whole model. It provides the same runtime speedup but with ~8-10x faster compile time and better compatibility with offloading.
+
+Diffusers models declare their repeated blocks via the `_repeated_blocks` class attribute (a list of class name strings). Most modern transformers define this:
+
+```python
+# FluxTransformer defines:
+_repeated_blocks = ["FluxTransformerBlock", "FluxSingleTransformerBlock"]
+```
+
+Use `compile_repeated_blocks()` to compile them:
+
+```python
+pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16).to("cuda")
+pipe.transformer.compile_repeated_blocks(fullgraph=True)
+```
+
+**Always guard before calling** — raises `ValueError` if `_repeated_blocks` is empty or the named classes aren't found. Use this pattern universally, whether or not you're using offloading:
+
+```python
+# Works with or without enable_model_cpu_offload() / enable_group_offload()
+if getattr(pipe.transformer, "_repeated_blocks", None):
+    pipe.transformer.compile_repeated_blocks(fullgraph=True)
+else:
+    pipe.transformer = torch.compile(pipe.transformer, mode="reduce-overhead", fullgraph=True)
+```
+
+`torch.compile` is compatible with diffusers' offloading methods — the offloading hooks use `@torch.compiler.disable()` on device-transfer operations so they run natively outside the compiled graph. Regional compilation is preferred when combining with offloading because it avoids compiling the parts that interact with the hooks.
+
+Models with `_repeated_blocks` defined include: Flux, Flux2, HunyuanVideo, LTX2Video, Wan, CogVideo, SD3, UNet2DConditionModel, and most other modern architectures.
+
+## Compile modes
+
+| Mode | Speed gain | Compile time | Notes |
+|---|---|---|---|
+| `"default"` | Moderate | Fast | Safe starting point |
+| `"reduce-overhead"` | Good | Moderate | Reduces Python overhead via CUDA graphs |
+| `"max-autotune"` | Best | Very slow | Tries many kernel configs; best for repeated inference |
+
+## `fullgraph=True`
+
+Requires the entire forward pass to be compilable as a single graph. Most diffusers transformers support this. If you get a `torch._dynamo` graph break error, remove `fullgraph=True` to allow partial compilation.
+
+## Limitations
+
+- **Dynamic shapes**: Changing resolution between calls triggers recompilation. Use `torch.compile(..., dynamic=True)` for variable resolutions, at some speed cost.
+- **First call is slow**: Budget 1-3 minutes for initial compilation depending on model size.
+- **Windows**: `reduce-overhead` and `max-autotune` modes may have issues. Use `"default"` if you hit errors.