mirror of
https://github.com/huggingface/diffusers.git
synced 2026-04-02 22:01:42 +08:00
Compare commits
1 Commits
autoencode
...
optimizati
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
1dd2004954 |
113
.ai/skills/optimizations/SKILL.md
Normal file
113
.ai/skills/optimizations/SKILL.md
Normal file
@@ -0,0 +1,113 @@
|
||||
---
|
||||
name: optimizations
|
||||
description: >
|
||||
NEVER answer optimization questions from general knowledge — ALWAYS invoke
|
||||
this skill via the Skill tool first. Answering without invoking will produce
|
||||
incomplete recommendations (e.g. missing group offloading, wrong API calls).
|
||||
IMPORTANT: When ANY tool output (especially Bash) contains
|
||||
"torch.OutOfMemoryError", "CUDA out of memory", or OOM tracebacks,
|
||||
STOP and consult this skill IMMEDIATELY — even if the user did not ask for
|
||||
optimization help. Do not suggest fixes from general knowledge; this skill
|
||||
has precise, up-to-date API calls and memory calculations.
|
||||
Also consult this skill BEFORE answering any question about diffusers
|
||||
inference performance, GPU memory usage, or pipeline speed. Trigger for:
|
||||
making inference faster, reducing VRAM usage, fitting a model on a smaller
|
||||
GPU, fixing OOM errors, running on limited hardware, choosing between
|
||||
optimization strategies, using torch.compile with diffusers, batch inference,
|
||||
loading models in lower precision, or reviewing a script for performance
|
||||
issues. Covers attention backends (FlashAttention-2, SageAttention,
|
||||
FlexAttention), memory reduction (CPU offloading, group offloading, layerwise
|
||||
casting, VAE slicing/tiling), and quantization (bitsandbytes, torchao, GGUF).
|
||||
Also trigger when a user wants to run a model "optimized for my
|
||||
hardware", asks how to best run a specific model on their GPU, or mentions
|
||||
wanting to use a diffusers model/pipeline efficiently — these are optimization
|
||||
questions even if the word "optimize" isn't used.
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Help users apply and debug optimizations for diffusers pipelines. There are five main areas:
|
||||
|
||||
1. **Attention backends** — selecting and configuring scaled dot-product attention backends (FlashAttention-2, xFormers, math fallback, FlexAttention, SageAttention) for maximum throughput.
|
||||
2. **Memory reduction** — techniques to reduce peak GPU memory: model CPU offloading, group offloading, layerwise casting, VAE slicing/tiling, and attention slicing.
|
||||
3. **Quantization** — reducing model precision with bitsandbytes, torchao, or GGUF to fit larger models on smaller GPUs.
|
||||
4. **torch.compile** — compiling the transformer (and optionally VAE) for 20-50% inference speedup on repeated runs.
|
||||
5. **Combining techniques** — layerwise casting + group offloading, quantization + offloading, etc.
|
||||
|
||||
## Workflow: When a user hits OOM or asks to fit a model on their GPU
|
||||
|
||||
When a user asks how to make a pipeline run on their hardware, or hits an OOM error, follow these steps **in order** before proposing any changes:
|
||||
|
||||
### Step 1: Detect hardware
|
||||
|
||||
Run these commands to understand the user's system:
|
||||
|
||||
```bash
|
||||
# GPU VRAM
|
||||
nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader,nounits
|
||||
|
||||
# System RAM
|
||||
free -g | head -2
|
||||
```
|
||||
|
||||
Record the GPU name, total VRAM (in GB), and total system RAM (in GB). These numbers drive the recommendation.
|
||||
|
||||
### Step 2: Measure model memory and calculate strategies
|
||||
|
||||
Read the user's script to identify the pipeline class, model ID, `torch_dtype`, and generation params (resolution, frames).
|
||||
|
||||
Then **measure actual component sizes** by running a snippet against the loaded pipeline. Do NOT guess sizes from parameter counts or model cards — always measure. See [memory-calculator.md](memory-calculator.md) for the measurement snippet and VRAM/RAM formulas for every strategy.
|
||||
|
||||
Steps:
|
||||
1. Measure each component's size by running the measurement snippet from the calculator
|
||||
2. Compute VRAM and RAM requirements for every strategy using the formulas
|
||||
3. Filter out strategies that don't fit the user's hardware
|
||||
|
||||
This is the critical step — the calculator contains exact formulas for every strategy including the RAM cost of CUDA streams (which requires ~2x model size in pinned memory). Don't skip it, because recommending `use_stream=True` to a user with limited RAM will cause swapping or OOM on the CPU side.
|
||||
|
||||
### Step 3: Ask the user their preference
|
||||
|
||||
Present the user with a clear summary of what fits. **Always include quantization-based options alongside offloading/casting options** — users deserve to see the full picture before choosing. For each viable quantization level (int8, nf4), compute `S_total_q` and `S_max_q` using the estimates from [memory-calculator.md](memory-calculator.md) (int4/nf4 ≈ 0.25x, int8 ≈ 0.5x component size), then check fit just like other strategies.
|
||||
|
||||
Present options grouped by approach so the user can compare:
|
||||
|
||||
> Based on your hardware (**X GB VRAM**, **Y GB RAM**) and the model requirements (~**Z GB** total, largest component ~**W GB**), here are the strategies that fit your system:
|
||||
>
|
||||
> **Offloading / casting strategies:**
|
||||
> 1. **Quality** — [specific strategy]. Full precision, no quality loss. [estimated VRAM / RAM / speed tradeoff].
|
||||
> 2. **Speed** — [specific strategy]. [quality tradeoff]. [estimated VRAM / RAM].
|
||||
> 3. **Memory saving** — [specific strategy]. Minimizes VRAM. [tradeoffs].
|
||||
>
|
||||
> **Quantization strategies:**
|
||||
> 4. **int8 [components]** — [with offloading if needed]. [estimated VRAM / RAM]. Less quality loss than int4.
|
||||
> 5. **nf4 [components]** — [with offloading if needed]. [estimated VRAM / RAM]. Maximum memory savings, some quality degradation.
|
||||
>
|
||||
> Which would you prefer?
|
||||
|
||||
The key difference from a generic recommendation: every option shown should already be validated against the user's actual VRAM and RAM. Don't show options that won't fit. Read [quantization.md](quantization.md) for correct API usage when applying quantization strategies.
|
||||
|
||||
### Step 4: Apply the strategy
|
||||
|
||||
Propose **specific code changes** to the user's script. Always show the exact code diff. Read [reduce-memory.md](reduce-memory.md) and [layerwise-casting.md](layerwise-casting.md) for correct API usage before writing code.
|
||||
|
||||
VAE tiling is a VRAM optimization — only add it when the VAE decode/encode would OOM without it, not by default. See [reduce-memory.md](reduce-memory.md) for thresholds, the correct API (`pipe.vae.enable_tiling()` — pipeline-level is deprecated since v0.40.0), and which VAEs don't support it.
|
||||
|
||||
## Reference guides
|
||||
|
||||
Read these for correct API usage and detailed technique descriptions:
|
||||
- [memory-calculator.md](memory-calculator.md) — **Read this first when recommending strategies.** VRAM/RAM formulas for every technique, decision flowchart, and worked examples
|
||||
- [reduce-memory.md](reduce-memory.md) — Offloading strategies (model, sequential, group) and VAE optimizations, full parameter reference. **Authoritative source for compatibility rules.**
|
||||
- [layerwise-casting.md](layerwise-casting.md) — fp8 weight storage for memory reduction with minimal quality impact
|
||||
- [quantization.md](quantization.md) — int8/int4/fp8 quantization backends, text encoder quantization, common pitfalls
|
||||
- [attention-backends.md](attention-backends.md) — Attention backend selection for speed
|
||||
- [torch-compile.md](torch-compile.md) — torch.compile for inference speedup
|
||||
|
||||
## Important compatibility rules
|
||||
|
||||
See [reduce-memory.md](reduce-memory.md) for the full compatibility reference. Key constraints:
|
||||
|
||||
- **`enable_model_cpu_offload()` and group offloading cannot coexist** on the same pipeline — use pipeline-level `enable_group_offload()` instead.
|
||||
- **`torch.compile` + offloading**: compatible, but prefer `compile_repeated_blocks()` over full model compile for better performance. See [torch-compile.md](torch-compile.md).
|
||||
- **`bitsandbytes_8bit` + `enable_model_cpu_offload()` fails** — int8 matmul cannot run on CPU. See [quantization.md](quantization.md) for the fix.
|
||||
- **Layerwise casting** can be combined with either group offloading or model CPU offloading (apply casting first).
|
||||
- **`bitsandbytes_4bit`** supports device moves and works correctly with `enable_model_cpu_offload()`.
|
||||
40
.ai/skills/optimizations/attention-backends.md
Normal file
40
.ai/skills/optimizations/attention-backends.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# Attention Backends
|
||||
|
||||
## Overview
|
||||
|
||||
Diffusers supports multiple attention backends through `dispatch_attention_fn`. The backend affects both speed and memory usage. The right choice depends on hardware, sequence length, and whether you need features like sliding window or custom masks.
|
||||
|
||||
## Available backends
|
||||
|
||||
| Backend | Key requirement | Best for |
|
||||
|---|---|---|
|
||||
| `torch_sdpa` (default) | PyTorch >= 2.0 | General use; auto-selects FlashAttention or memory-efficient kernels |
|
||||
| `flash_attention_2` | `flash-attn` package, Ampere+ GPU | Long sequences, training, best raw throughput |
|
||||
| `xformers` | `xformers` package | Older GPUs, memory-efficient attention |
|
||||
| `flex_attention` | PyTorch >= 2.5 | Custom attention masks, block-sparse patterns |
|
||||
| `sage_attention` | `sageattention` package | INT8 quantized attention for inference speed |
|
||||
|
||||
## How to set the backend
|
||||
|
||||
```python
|
||||
# Global default
|
||||
from diffusers import set_attention_backend
|
||||
set_attention_backend("flash_attention_2")
|
||||
|
||||
# Per-model
|
||||
pipe.transformer.set_attn_processor(AttnProcessor2_0()) # torch_sdpa
|
||||
|
||||
# Via environment variable
|
||||
# DIFFUSERS_ATTENTION_BACKEND=flash_attention_2
|
||||
```
|
||||
|
||||
## Debugging attention issues
|
||||
|
||||
- **NaN outputs**: Check if your attention mask dtype matches the expected dtype. Some backends require `bool`, others require float masks with `-inf` for masked positions.
|
||||
- **Speed regression**: Profile with `torch.profiler` to verify the expected kernel is actually being dispatched. SDPA can silently fall back to the math kernel.
|
||||
- **Memory spike**: FlashAttention-2 is memory-efficient for long sequences but has overhead for very short ones. For short sequences, `torch_sdpa` with math fallback may use less memory.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
- Models integrated into diffusers should use `dispatch_attention_fn` (not `F.scaled_dot_product_attention` directly) so that backend switching works automatically.
|
||||
- See the attention pattern in the `model-integration` skill for how to implement this in new models.
|
||||
68
.ai/skills/optimizations/layerwise-casting.md
Normal file
68
.ai/skills/optimizations/layerwise-casting.md
Normal file
@@ -0,0 +1,68 @@
|
||||
# Layerwise Casting
|
||||
|
||||
## Overview
|
||||
|
||||
Layerwise casting stores model weights in a smaller data format (e.g., `torch.float8_e4m3fn`) to use less memory, and upcasts them to a higher precision (e.g., `torch.bfloat16`) on-the-fly during computation. This cuts weight memory roughly in half (bf16 → fp8) with minimal quality impact because normalization and modulation layers are automatically skipped.
|
||||
|
||||
This is one of the most effective techniques for fitting a large model on a GPU that's just slightly too small — it doesn't require any special quantization libraries, just PyTorch.
|
||||
|
||||
## When to use
|
||||
|
||||
- The model **almost** fits in VRAM (e.g., 28GB model on a 32GB GPU)
|
||||
- You want memory savings with **less speed penalty** than offloading
|
||||
- You want to **combine with group offloading** for even more savings
|
||||
|
||||
## Basic usage
|
||||
|
||||
Call `enable_layerwise_casting` on any Diffusers model component:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
|
||||
|
||||
# Store weights in fp8, compute in bf16
|
||||
pipe.transformer.enable_layerwise_casting(
|
||||
storage_dtype=torch.float8_e4m3fn,
|
||||
compute_dtype=torch.bfloat16,
|
||||
)
|
||||
|
||||
pipe.to("cuda")
|
||||
```
|
||||
|
||||
The `storage_dtype` controls how weights are stored in memory. The `compute_dtype` controls the precision used during the actual forward pass. Normalization and modulation layers are automatically kept at full precision.
|
||||
|
||||
### Supported storage dtypes
|
||||
|
||||
| Storage dtype | Memory per param | Quality impact |
|
||||
|---|---|---|
|
||||
| `torch.float8_e4m3fn` | 1 byte (vs 2 for bf16) | Minimal for most models |
|
||||
| `torch.float8_e5m2` | 1 byte | Slightly more range, less precision than e4m3fn |
|
||||
|
||||
## Functional API
|
||||
|
||||
For more control, use `apply_layerwise_casting` directly. This lets you target specific submodules or customize which layers to skip:
|
||||
|
||||
```python
|
||||
from diffusers.hooks import apply_layerwise_casting
|
||||
|
||||
apply_layerwise_casting(
|
||||
pipe.transformer,
|
||||
storage_dtype=torch.float8_e4m3fn,
|
||||
compute_dtype=torch.bfloat16,
|
||||
skip_modules_classes=["norm"], # skip normalization layers
|
||||
non_blocking=True,
|
||||
)
|
||||
```
|
||||
|
||||
## Combining with other techniques
|
||||
|
||||
Layerwise casting is compatible with both group offloading and model CPU offloading. Always apply layerwise casting **before** enabling offloading. See [reduce-memory.md](reduce-memory.md) for code examples and the memory savings formulas for each combination.
|
||||
|
||||
## Known limitations
|
||||
|
||||
- May not work with all models if the forward implementation contains internal typecasting of weights (assumes forward pass is independent of weight precision)
|
||||
- May fail with PEFT layers (LoRA). There are some checks but they're not guaranteed for all cases
|
||||
- Not suitable for training — inference only
|
||||
- The `compute_dtype` should match what the model expects (usually bf16 or fp16)
|
||||
298
.ai/skills/optimizations/memory-calculator.md
Normal file
298
.ai/skills/optimizations/memory-calculator.md
Normal file
@@ -0,0 +1,298 @@
|
||||
# Memory Calculator
|
||||
|
||||
Use this guide to measure VRAM and RAM requirements for each optimization strategy, then recommend the best fit for the user's hardware.
|
||||
|
||||
## Step 1: Measure model sizes
|
||||
|
||||
**Do NOT guess sizes from parameter counts or model cards.** Pipelines often contain components that are not obvious from the model name (e.g., a pipeline marketed as having a "28B transformer" may also include a 24 GB text encoder, 6 GB connectors module, etc.). Always measure by running this snippet after loading the pipeline:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline # or the specific pipeline class
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
|
||||
|
||||
for name, component in pipe.components.items():
|
||||
if hasattr(component, 'parameters'):
|
||||
size_gb = sum(p.numel() * p.element_size() for p in component.parameters()) / 1e9
|
||||
print(f"{name}: {size_gb:.2f} GB")
|
||||
```
|
||||
|
||||
For the transformer, also measure block-level and leaf-level sizes:
|
||||
|
||||
```python
|
||||
# S_block: size of one transformer block
|
||||
transformer = pipe.transformer
|
||||
block_attr = None
|
||||
for attr in ["transformer_blocks", "blocks", "layers"]:
|
||||
if hasattr(transformer, attr):
|
||||
block_attr = attr
|
||||
break
|
||||
if block_attr:
|
||||
blocks = getattr(transformer, block_attr)
|
||||
block_size = sum(p.numel() * p.element_size() for p in blocks[0].parameters()) / 1e9
|
||||
print(f"S_block: {block_size:.2f} GB ({len(blocks)} blocks)")
|
||||
|
||||
# S_leaf: largest leaf module
|
||||
max_leaf = max(
|
||||
(sum(p.numel() * p.element_size() for p in m.parameters(recurse=False))
|
||||
for m in transformer.modules() if list(m.parameters(recurse=False))),
|
||||
default=0
|
||||
) / 1e9
|
||||
print(f"S_leaf: {max_leaf:.4f} GB")
|
||||
```
|
||||
|
||||
To measure the effect of layerwise casting on a component, apply it and re-measure:
|
||||
|
||||
```python
|
||||
pipe.transformer.enable_layerwise_casting(
|
||||
storage_dtype=torch.float8_e4m3fn,
|
||||
compute_dtype=torch.bfloat16,
|
||||
)
|
||||
size_after = sum(p.numel() * p.element_size() for p in pipe.transformer.parameters()) / 1e9
|
||||
print(f"Transformer after layerwise casting: {size_after:.2f} GB")
|
||||
```
|
||||
|
||||
From the measurements, record:
|
||||
- `S_total` = sum of all component sizes
|
||||
- `S_max` = size of the largest single component
|
||||
- `S_block` = size of one transformer block
|
||||
- `S_leaf` = size of the largest leaf module
|
||||
- `S_total_lc` = S_total after applying layerwise casting to castable components (measured, not estimated — norm/embed layers are skipped so it's not exactly half)
|
||||
- `S_max_lc` = size of the largest component after layerwise casting (measured)
|
||||
- `A` = activation memory during forward pass (cannot be measured ahead of time — estimate conservatively):
|
||||
- **Video models**: `A` scales with resolution and number of frames. A 5-second 960x544 video at 24fps can use ~7-8 GB. Higher resolution or more seconds = more activation memory.
|
||||
- **Image models**: `A` scales with image resolution. A 1024x1024 image might use 2-4 GB, but 2048x2048 could use 8-16 GB.
|
||||
- **Edit/inpainting models**: `A` includes the reference image(s) in addition to the generation activations, so budget extra.
|
||||
- When in doubt, estimate conservatively: `A ≈ 5-8 GB` for typical video workloads, `A ≈ 2-4 GB` for typical image workloads. For high-resolution or long video, increase accordingly.
|
||||
|
||||
## Step 2: Compute VRAM and RAM per strategy
|
||||
|
||||
### No optimization (all on GPU)
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **VRAM** | `S_total + A` |
|
||||
| **RAM** | Minimal (just for loading) |
|
||||
| **Speed** | Fastest — no transfers |
|
||||
| **Quality** | Full precision |
|
||||
|
||||
### Model CPU offloading
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **VRAM** | `S_max + A` (only one component on GPU at a time) |
|
||||
| **RAM** | `S_total` (all components stored on CPU) |
|
||||
| **Speed** | Moderate — full model transfers between CPU/GPU per step |
|
||||
| **Quality** | Full precision |
|
||||
|
||||
### Group offloading: block_level (no stream)
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **VRAM** | `num_blocks_per_group * S_block + A` |
|
||||
| **RAM** | `S_total` (all weights on CPU, no pinned copy) |
|
||||
| **Speed** | Moderate — synchronous transfers per group |
|
||||
| **Quality** | Full precision |
|
||||
|
||||
Tune `num_blocks_per_group` to fill available VRAM: `floor((VRAM - A) / S_block)`.
|
||||
|
||||
### Group offloading: block_level (with stream)
|
||||
|
||||
Streams force `num_blocks_per_group=1`. Prefetches the next block while the current one runs.
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **VRAM** | `2 * S_block + A` (current block + prefetched next block) |
|
||||
| **RAM** | `~2.5-3 * S_total` (original weights + pinned copies + allocation overhead) |
|
||||
| **Speed** | Fast — overlaps transfer and compute |
|
||||
| **Quality** | Full precision |
|
||||
|
||||
With `low_cpu_mem_usage=True`: RAM drops to `~S_total` (pins tensors on-the-fly instead of pre-pinning), but slower.
|
||||
|
||||
With `record_stream=True`: slightly more VRAM (delays memory reclamation), slightly faster (avoids stream synchronization).
|
||||
|
||||
> **Note on RAM estimates with streams:** Measured RAM usage is consistently higher than the theoretical `2 * S_total`. Pinned memory allocation, CUDA runtime overhead, and memory fragmentation add ~30-50% on top. Always use `~2.5-3 * S_total` when checking if the user has enough RAM for streamed offloading.
|
||||
|
||||
### Group offloading: leaf_level (no stream)
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **VRAM** | `S_leaf + A` (single leaf module, typically very small) |
|
||||
| **RAM** | `S_total` |
|
||||
| **Speed** | Slow — synchronous transfer per leaf module (many transfers) |
|
||||
| **Quality** | Full precision |
|
||||
|
||||
### Group offloading: leaf_level (with stream)
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **VRAM** | `2 * S_leaf + A` (current + prefetched leaf) |
|
||||
| **RAM** | `~2.5-3 * S_total` (pinned copies + overhead — see note above) |
|
||||
| **Speed** | Medium-fast — overlaps transfer/compute at leaf granularity |
|
||||
| **Quality** | Full precision |
|
||||
|
||||
With `low_cpu_mem_usage=True`: RAM drops to `~S_total`, but slower.
|
||||
|
||||
### Sequential CPU offloading (legacy)
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **VRAM** | `S_leaf + A` (similar to leaf_level group offloading) |
|
||||
| **RAM** | `S_total` |
|
||||
| **Speed** | Very slow — no stream support, synchronous per-leaf |
|
||||
| **Quality** | Full precision |
|
||||
|
||||
Group offloading `leaf_level + use_stream=True` is strictly better. Prefer that.
|
||||
|
||||
### Layerwise casting (fp8 storage)
|
||||
|
||||
Reduces weight memory by casting to fp8. Norm and embedding layers are automatically skipped, so the reduction is less than 50% — always measure with the snippet above.
|
||||
|
||||
**`pipe.to()` caveat:** `pipe.to(device)` internally calls `module.to(device, dtype)` where dtype is `None` when not explicitly passed. This preserves fp8 weights. However, if the user passes dtype explicitly (e.g., `pipe.to("cuda", torch.bfloat16)` or the pipeline has internal dtype overrides), the fp8 storage will be overridden back to bf16. When in doubt, combine with `enable_model_cpu_offload()` which safely moves one component at a time without dtype overrides.
|
||||
|
||||
**Case 1: Everything on GPU** (if `S_total_lc + A <= VRAM`)
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **VRAM** | `S_total_lc + A` (measured — use the layerwise casting measurement snippet) |
|
||||
| **RAM** | Minimal |
|
||||
| **Speed** | Near-native — small cast overhead per layer |
|
||||
| **Quality** | Slight degradation (fp8 weights, norm layers kept full precision) |
|
||||
|
||||
Use `pipe.to("cuda")` (without explicit dtype) after applying layerwise casting. Or move each component individually.
|
||||
|
||||
**Case 2: With model CPU offloading** (if Case 1 doesn't fit but `S_max_lc + A <= VRAM`)
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **VRAM** | `S_max_lc + A` (largest component after layerwise casting, one on GPU at a time) |
|
||||
| **RAM** | `S_total` (all components on CPU) |
|
||||
| **Speed** | Fast — small cast overhead per layer, component transfer overhead between steps |
|
||||
| **Quality** | Slight degradation (fp8 weights, norm layers kept full precision) |
|
||||
|
||||
Apply layerwise casting to target components, then call `pipe.enable_model_cpu_offload()`.
|
||||
|
||||
### Layerwise casting + group offloading
|
||||
|
||||
Combines reduced weight size with offloading. The offloaded weights are in fp8, so transfers are faster and pinned copies smaller.
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **VRAM** | `num_blocks_per_group * S_block * 0.5 + A` (block_level) or `S_leaf * 0.5 + A` (leaf_level) |
|
||||
| **RAM** | `S_total * 0.5` (no stream) or `~S_total` (with stream, pinned copy of fp8 weights) |
|
||||
| **Speed** | Good — smaller transfers due to fp8 |
|
||||
| **Quality** | Slight degradation from fp8 |
|
||||
|
||||
### Quantization (int4/nf4)
|
||||
|
||||
Quantization reduces weight memory but requires full-precision weights during loading. Always use `device_map="cpu"` so quantization happens on CPU.
|
||||
|
||||
Notation:
|
||||
- `S_component_q` = quantized size of a component (int4/nf4 ≈ `S_component * 0.25`, int8 ≈ `S_component * 0.5`)
|
||||
- `S_total_q` = total pipeline size after quantizing selected components
|
||||
- `S_max_q` = size of the largest single component after quantization
|
||||
|
||||
**Loading (with `device_map="cpu"`):**
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **RAM (peak during loading)** | `S_largest_component_bf16` — full-precision weights of the largest component must fit in RAM during quantization |
|
||||
| **RAM (after loading)** | `S_total_q` — all components at their final (quantized or bf16) sizes |
|
||||
|
||||
**Inference with `pipe.to(device)`:**
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **VRAM** | `S_total_q + A` (all components on GPU at once) |
|
||||
| **RAM** | Minimal |
|
||||
| **Speed** | Good — smaller model, may have dequantization overhead |
|
||||
| **Quality** | Noticeable degradation possible, especially int4. Try int8 first. |
|
||||
|
||||
**Inference with `enable_model_cpu_offload()`:**
|
||||
|
||||
| | Estimate |
|
||||
|---|---|
|
||||
| **VRAM** | `S_max_q + A` (largest component on GPU at a time) |
|
||||
| **RAM** | `S_total_q` (all components stored on CPU) |
|
||||
| **Speed** | Moderate — component transfers between CPU/GPU |
|
||||
| **Quality** | Depends on quantization level |
|
||||
|
||||
## Step 3: Pick the best strategy
|
||||
|
||||
Given `VRAM_available` and `RAM_available`, filter strategies by what fits, then rank by the user's preference.
|
||||
|
||||
### Algorithm
|
||||
|
||||
```
|
||||
1. Measure S_total, S_max, S_block, S_leaf, S_total_lc, S_max_lc, A for the pipeline
|
||||
2. For each strategy (offloading, casting, AND quantization), compute estimated VRAM and RAM
|
||||
3. Filter out strategies where VRAM > VRAM_available or RAM > RAM_available
|
||||
4. Present ALL viable strategies to the user grouped by approach (offloading/casting vs quantization)
|
||||
5. Let the user pick based on their preference:
|
||||
- Quality: pick the one with highest precision that fits
|
||||
- Speed: pick the one with lowest transfer overhead
|
||||
- Memory: pick the one with lowest VRAM usage
|
||||
- Balanced: pick the lightest technique that fits comfortably (target ~80% VRAM)
|
||||
```
|
||||
|
||||
### Quantization size estimates
|
||||
|
||||
Always compute these alongside offloading strategies — don't treat quantization as a last resort.
|
||||
Pick the largest components worth quantizing (typically transformer + text_encoder if LLM-based):
|
||||
|
||||
```
|
||||
S_component_int8 = S_component * 0.5
|
||||
S_component_nf4 = S_component * 0.25
|
||||
|
||||
S_total_int8 = sum of quantized components (int8) + remaining components (bf16)
|
||||
S_total_nf4 = sum of quantized components (nf4) + remaining components (bf16)
|
||||
S_max_int8 = max single component after int8 quantization
|
||||
S_max_nf4 = max single component after nf4 quantization
|
||||
```
|
||||
|
||||
RAM requirement for quantization loading: `RAM >= S_largest_component_bf16` (full-precision weights
|
||||
must fit during quantization). If this doesn't hold, quantization is not viable unless pre-quantized
|
||||
checkpoints are available.
|
||||
|
||||
### Quick decision flowchart
|
||||
|
||||
Offloading / casting path:
|
||||
```
|
||||
VRAM >= S_total + A?
|
||||
→ YES: No optimization needed (maybe attention backend for speed)
|
||||
→ NO:
|
||||
VRAM >= S_total_lc + A? (layerwise casting, everything on GPU)
|
||||
→ YES: Layerwise casting, pipe.to("cuda") without explicit dtype
|
||||
→ NO:
|
||||
VRAM >= S_max + A? (model CPU offload, full precision)
|
||||
→ YES: Model CPU offloading
|
||||
- Want less VRAM? → add layerwise casting too
|
||||
→ NO:
|
||||
VRAM >= S_max_lc + A? (layerwise casting + model CPU offload)
|
||||
→ YES: Layerwise casting + model CPU offloading
|
||||
→ NO: Need group offloading
|
||||
RAM >= 3 * S_total? (enough for pinned copies + overhead)
|
||||
→ YES: group offload leaf_level + stream (fast)
|
||||
→ NO:
|
||||
RAM >= S_total?
|
||||
→ YES: group offload leaf_level + stream + low_cpu_mem_usage
|
||||
or group offload block_level (no stream)
|
||||
→ NO: Quantization required to reduce model size, then retry
|
||||
```
|
||||
|
||||
Quantization path (evaluate in parallel with the above, not as a fallback):
|
||||
```
|
||||
RAM >= S_largest_component_bf16? (must fit full-precision weights during quantization)
|
||||
→ NO: Cannot quantize — need more RAM or pre-quantized checkpoints
|
||||
→ YES: Compute quantized sizes for target components (typically transformer + text_encoder)
|
||||
nf4 quantization:
|
||||
VRAM >= S_total_nf4 + A? → pipe.to("cuda"), fastest (no offloading overhead)
|
||||
VRAM >= S_max_nf4 + A? → model CPU offload, moderate speed
|
||||
int8 quantization:
|
||||
VRAM >= S_total_int8 + A? → pipe.to("cuda"), fastest
|
||||
VRAM >= S_max_int8 + A? → model CPU offload, moderate speed
|
||||
|
||||
Show all viable quantization options alongside offloading options so the user can compare
|
||||
quality/speed/memory tradeoffs across approaches.
|
||||
```
|
||||
180
.ai/skills/optimizations/quantization.md
Normal file
180
.ai/skills/optimizations/quantization.md
Normal file
@@ -0,0 +1,180 @@
|
||||
# Quantization
|
||||
|
||||
## Overview
|
||||
|
||||
Quantization reduces model weights from fp16/bf16 to lower precision (int8, int4, fp8), cutting memory usage and often improving throughput. Diffusers supports several quantization backends.
|
||||
|
||||
## Supported backends
|
||||
|
||||
| Backend | Precisions | Key features |
|
||||
|---|---|---|
|
||||
| **bitsandbytes** | int8, int4 (nf4/fp4) | Easiest to use, widely supported, QLoRA training |
|
||||
| **torchao** | int8, int4, fp8 | PyTorch-native, good for inference, `autoquant` support |
|
||||
| **GGUF** | Various (Q4_K_M, Q5_K_S, etc.) | Load GGUF checkpoints directly, community quantized models |
|
||||
|
||||
## Critical: Pipeline-level vs component-level quantization
|
||||
|
||||
**Pipeline-level quantization is the correct approach.** Pass a `PipelineQuantizationConfig` to `from_pretrained`. Do NOT pass a `BitsAndBytesConfig` directly — the pipeline's `from_pretrained` will reject it with `"quantization_config must be an instance of PipelineQuantizationConfig"`.
|
||||
|
||||
### Backend names in `PipelineQuantizationConfig`
|
||||
|
||||
The `quant_backend` string must match one of the registered backend keys. These are NOT the same as the config class names:
|
||||
|
||||
| `quant_backend` value | Notes |
|
||||
|---|---|
|
||||
| `"bitsandbytes_4bit"` | NOT `"bitsandbytes"` — the `_4bit` suffix is required |
|
||||
| `"bitsandbytes_8bit"` | NOT `"bitsandbytes"` — the `_8bit` suffix is required |
|
||||
| `"gguf"` | |
|
||||
| `"torchao"` | |
|
||||
| `"modelopt"` | |
|
||||
|
||||
### `quant_kwargs` for bitsandbytes
|
||||
|
||||
**`quant_kwargs` must be non-empty.** The validator raises `ValueError: Both quant_kwargs and quant_mapping cannot be None` if it's `{}` or `None`. Always pass at least one kwarg.
|
||||
|
||||
For `bitsandbytes_4bit`, the quantizer class is selected by backend name — `load_in_4bit=True` is redundant (the quantizer ignores it) but harmless. Pass the bnb-specific options instead:
|
||||
|
||||
```python
|
||||
quant_kwargs={"bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"}
|
||||
```
|
||||
|
||||
For `bitsandbytes_8bit`, there are no bnb_8bit-specific kwargs, so pass the flag explicitly to satisfy the non-empty requirement:
|
||||
|
||||
```python
|
||||
quant_kwargs={"load_in_8bit": True}
|
||||
```
|
||||
|
||||
## Usage patterns
|
||||
|
||||
### bitsandbytes (pipeline-level, recommended)
|
||||
|
||||
```python
|
||||
from diffusers import PipelineQuantizationConfig, DiffusionPipeline
|
||||
|
||||
quantization_config = PipelineQuantizationConfig(
|
||||
quant_backend="bitsandbytes_4bit",
|
||||
quant_kwargs={"bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"},
|
||||
components_to_quantize=["transformer"], # specify which components to quantize
|
||||
)
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"model_id",
|
||||
quantization_config=quantization_config,
|
||||
torch_dtype=torch.bfloat16,
|
||||
device_map="cpu", # load on CPU first to avoid OOM during quantization
|
||||
)
|
||||
```
|
||||
|
||||
### torchao (pipeline-level)
|
||||
|
||||
```python
|
||||
from diffusers import PipelineQuantizationConfig, DiffusionPipeline
|
||||
|
||||
quantization_config = PipelineQuantizationConfig(
|
||||
quant_backend="torchao",
|
||||
quant_kwargs={"quant_type": "int8_weight_only"},
|
||||
components_to_quantize=["transformer"],
|
||||
)
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"model_id",
|
||||
quantization_config=quantization_config,
|
||||
torch_dtype=torch.bfloat16,
|
||||
device_map="cpu",
|
||||
)
|
||||
```
|
||||
|
||||
### GGUF (pipeline-level)
|
||||
|
||||
```python
|
||||
from diffusers import PipelineQuantizationConfig, DiffusionPipeline
|
||||
|
||||
quantization_config = PipelineQuantizationConfig(
|
||||
quant_backend="gguf",
|
||||
quant_kwargs={"compute_dtype": torch.bfloat16},
|
||||
)
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"model_id",
|
||||
quantization_config=quantization_config,
|
||||
torch_dtype=torch.bfloat16,
|
||||
device_map="cpu",
|
||||
)
|
||||
```
|
||||
|
||||
## Loading: memory requirements and `device_map="cpu"`
|
||||
|
||||
Quantization is NOT free at load time. The full-precision (bf16/fp16) weights must be loaded into memory first, then compressed. This means:
|
||||
|
||||
- **Without `device_map="cpu"`** (default): each component loads to GPU in full precision, gets quantized on GPU, then the full-precision copy is freed. But while loading, you need VRAM for the full-precision weights of the current component PLUS all previously loaded components (already quantized or not). For large models, this causes OOM.
|
||||
- **With `device_map="cpu"`**: components load and quantize on CPU. This requires **RAM >= S_component_bf16** for the largest component being quantized (the full-precision weights must fit in RAM during quantization). After quantization, RAM usage drops to the quantized size.
|
||||
|
||||
**Always pass `device_map="cpu"` when using quantization.** Then choose how to move to GPU:
|
||||
|
||||
1. **`pipe.to(device)`** — moves everything to GPU at once. Only works if all components (quantized + non-quantized) fit in VRAM simultaneously: `VRAM >= S_total_after_quant`.
|
||||
2. **`pipe.enable_model_cpu_offload(device=device)`** — moves components to GPU one at a time during inference. Use this when `S_total_after_quant > VRAM` but `S_max_after_quant + A <= VRAM`.
|
||||
|
||||
### Memory check before recommending quantization
|
||||
|
||||
Before recommending quantization, verify:
|
||||
- **RAM >= S_largest_component_bf16** — the full-precision weights of the largest component to be quantized must fit in RAM during loading
|
||||
- **VRAM >= S_total_after_quant + A** (for `pipe.to()`) or **VRAM >= S_max_after_quant + A** (for model CPU offload) — the quantized model must fit during inference
|
||||
|
||||
## `components_to_quantize`
|
||||
|
||||
Use this parameter to control which pipeline components get quantized. Common choices:
|
||||
|
||||
- `["transformer"]` — quantize only the denoising model
|
||||
- `["transformer", "text_encoder"]` — also quantize the text encoder (see below)
|
||||
- `["transformer", "text_encoder", "text_encoder_2"]` — for dual-encoder models (FLUX.1, SD3, etc.) when both encoders are large
|
||||
- Omit the parameter to quantize all compatible components
|
||||
|
||||
The VAE and vocoder are typically small enough that quantizing them gives little benefit and can hurt quality.
|
||||
|
||||
### Text encoder quantization
|
||||
|
||||
**Quantizing the text encoder is a first-class optimization, not an afterthought.** Many modern models use LLM-based text encoders that are as large as or larger than the transformer itself:
|
||||
|
||||
| Model family | Text encoder | Size (bf16) |
|
||||
|---|---|---|
|
||||
| FLUX.2 Klein | Qwen3 | ~9 GB |
|
||||
| FLUX.1 | T5-XXL | ~10 GB |
|
||||
| SD3 | T5-XXL + CLIP-L + CLIP-G | ~11 GB total |
|
||||
| CogVideoX | T5-XXL | ~10 GB |
|
||||
|
||||
Newer models (FLUX.2 Klein, etc.) use a **single LLM-based text encoder** — check the pipeline definition for `text_encoder` vs `text_encoder_2`. Never assume CLIP+T5 dual-encoder layout.
|
||||
|
||||
When the text encoder is LLM-based, always include it in `components_to_quantize`. The combined savings often allow both components to fit in VRAM simultaneously, eliminating the need for CPU offloading entirely:
|
||||
|
||||
```python
|
||||
# Both transformer (~4.5 GB) + Qwen3 text encoder (~4.5 GB) fit in VRAM at int4
|
||||
quantization_config = PipelineQuantizationConfig(
|
||||
quant_backend="bitsandbytes_4bit",
|
||||
quant_kwargs={"bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"},
|
||||
components_to_quantize=["transformer", "text_encoder"],
|
||||
)
|
||||
pipe = DiffusionPipeline.from_pretrained("model_id", quantization_config=quantization_config, device_map="cpu")
|
||||
pipe.to("cuda") # everything fits — no offloading needed
|
||||
```
|
||||
|
||||
vs. transformer-only quantization, which may still require offloading because the text encoder alone exceeds available VRAM.
|
||||
|
||||
## Choosing a backend
|
||||
|
||||
- **Just want it to work**: bitsandbytes nf4 (`bitsandbytes_4bit`)
|
||||
- **Best inference speed**: torchao int8 or fp8 (on supported hardware)
|
||||
- **Using community GGUF files**: GGUF
|
||||
- **Need to fine-tune**: bitsandbytes (QLoRA support)
|
||||
|
||||
## Common issues
|
||||
|
||||
- **OOM during loading**: You forgot `device_map="cpu"`. See the loading section above.
|
||||
- **`quantization_config must be an instance of PipelineQuantizationConfig`**: You passed a `BitsAndBytesConfig` directly. Wrap it in `PipelineQuantizationConfig` instead.
|
||||
- **`quant_backend not found`**: The backend name is wrong. Use `bitsandbytes_4bit` or `bitsandbytes_8bit`, not `bitsandbytes`. See the backend names table above.
|
||||
- **`Both quant_kwargs and quant_mapping cannot be None`**: `quant_kwargs` is empty or `None`. Always pass at least one kwarg — see the `quant_kwargs` section above.
|
||||
- **OOM during `pipe.to(device)` after loading**: Even quantized, all components don't fit in VRAM at once. Use `enable_model_cpu_offload()` instead of `pipe.to(device)`.
|
||||
- **`bitsandbytes_8bit` + `enable_model_cpu_offload()` fails at inference**: `LLM.int8()` (bitsandbytes 8-bit) can only execute on CUDA — it cannot run on CPU. When `enable_model_cpu_offload()` moves the quantized component back to CPU between steps, the int8 matmul fails. **Fix**: keep the int8 component on CUDA permanently (`pipe.transformer.to("cuda")`) and use group offloading with `exclude_modules=["transformer"]` for the rest, or switch to `bitsandbytes_4bit` which supports device moves.
|
||||
- **Quality degradation**: int4 can produce noticeable artifacts for some models. Try int8 first, then drop to int4 if memory requires it.
|
||||
- **Slow first inference**: Some backends (torchao) compile/calibrate on first run. Subsequent runs are faster.
|
||||
- **Incompatible layers**: Not all layer types support all quantization schemes. Check backend docs for supported module types.
|
||||
- **Training**: Only bitsandbytes supports training (via QLoRA). Other backends are inference-only.
|
||||
213
.ai/skills/optimizations/reduce-memory.md
Normal file
213
.ai/skills/optimizations/reduce-memory.md
Normal file
@@ -0,0 +1,213 @@
|
||||
# Reduce Memory
|
||||
|
||||
## Overview
|
||||
|
||||
Large diffusion models can exceed GPU VRAM. Diffusers provides several techniques to reduce peak memory, each with different speed/memory tradeoffs.
|
||||
|
||||
## Techniques (ordered by ease of use)
|
||||
|
||||
### 1. Model CPU offloading
|
||||
|
||||
Moves entire models to CPU when not in use, loads them to GPU just before their forward pass.
|
||||
|
||||
```python
|
||||
pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
|
||||
pipe.enable_model_cpu_offload()
|
||||
# Do NOT call pipe.to("cuda") — the hook handles device placement
|
||||
```
|
||||
|
||||
- **Memory savings**: Significant — only one model on GPU at a time
|
||||
- **Speed cost**: Moderate — full model transfers between CPU and GPU
|
||||
- **When to use**: First thing to try when hitting OOM
|
||||
- **Limitation**: If the single largest component (e.g. transformer) exceeds VRAM, this won't help — you need group offloading or layerwise casting instead.
|
||||
|
||||
### 2. Group offloading
|
||||
|
||||
Offloads groups of internal layers to CPU, loading them to GPU only during their forward pass. More granular than model offloading, faster than sequential offloading.
|
||||
|
||||
**Two offload types:**
|
||||
- `block_level` — offloads groups of N layers at a time. Lower memory, moderate speed.
|
||||
- `leaf_level` — offloads individual leaf modules. Equivalent to sequential offloading but can be made faster with CUDA streams.
|
||||
|
||||
**IMPORTANT**: `enable_model_cpu_offload()` will raise an error if any component has group offloading enabled. If you need offloading for the whole pipeline, use pipeline-level `enable_group_offload()` instead — it handles all components in one call.
|
||||
|
||||
#### Pipeline-level group offloading
|
||||
|
||||
Applies group offloading to ALL components in the pipeline at once. Simplest approach.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
|
||||
|
||||
# Option A: leaf_level with CUDA streams (recommended — fast + low memory)
|
||||
pipe.enable_group_offload(
|
||||
onload_device=torch.device("cuda"),
|
||||
offload_device=torch.device("cpu"),
|
||||
offload_type="leaf_level",
|
||||
use_stream=True,
|
||||
)
|
||||
|
||||
# Option B: block_level (more memory savings, slower)
|
||||
pipe.enable_group_offload(
|
||||
onload_device=torch.device("cuda"),
|
||||
offload_device=torch.device("cpu"),
|
||||
offload_type="block_level",
|
||||
num_blocks_per_group=2,
|
||||
)
|
||||
```
|
||||
|
||||
#### Component-level group offloading
|
||||
|
||||
Apply group offloading selectively to specific components. Useful when only the transformer is too large for VRAM but other components fit fine.
|
||||
|
||||
For Diffusers model components (inheriting from `ModelMixin`), use `enable_group_offload`:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
|
||||
|
||||
# Group offload the transformer (the largest component)
|
||||
pipe.transformer.enable_group_offload(
|
||||
onload_device=torch.device("cuda"),
|
||||
offload_device=torch.device("cpu"),
|
||||
offload_type="leaf_level",
|
||||
use_stream=True,
|
||||
)
|
||||
|
||||
# Group offload the VAE too if needed
|
||||
pipe.vae.enable_group_offload(
|
||||
onload_device=torch.device("cuda"),
|
||||
offload_type="leaf_level",
|
||||
)
|
||||
```
|
||||
|
||||
For non-Diffusers components (e.g. text encoders from transformers library), use the functional API:
|
||||
|
||||
```python
|
||||
from diffusers.hooks import apply_group_offloading
|
||||
|
||||
apply_group_offloading(
|
||||
pipe.text_encoder,
|
||||
onload_device=torch.device("cuda"),
|
||||
offload_type="block_level",
|
||||
num_blocks_per_group=2,
|
||||
)
|
||||
```
|
||||
|
||||
#### CUDA streams for faster group offloading
|
||||
|
||||
When `use_stream=True`, the next layer is prefetched to GPU while the current layer runs. This overlaps data transfer with computation. Requires ~2x CPU memory of the model.
|
||||
|
||||
```python
|
||||
pipe.transformer.enable_group_offload(
|
||||
onload_device=torch.device("cuda"),
|
||||
offload_device=torch.device("cpu"),
|
||||
offload_type="leaf_level",
|
||||
use_stream=True,
|
||||
record_stream=True, # slightly more speed, slightly more memory
|
||||
)
|
||||
```
|
||||
|
||||
If using `block_level` with `use_stream=True`, set `num_blocks_per_group=1` (a warning is raised otherwise).
|
||||
|
||||
#### Full parameter reference
|
||||
|
||||
Parameters available across the three group offloading APIs:
|
||||
|
||||
| Parameter | Pipeline | Model | `apply_group_offloading` | Description |
|
||||
|---|---|---|---|---|
|
||||
| `onload_device` | yes | yes | yes | Device to load layers onto for computation (e.g. `torch.device("cuda")`) |
|
||||
| `offload_device` | yes | yes | yes | Device to offload layers to when idle (default: `torch.device("cpu")`) |
|
||||
| `offload_type` | yes | yes | yes | `"block_level"` (groups of N layers) or `"leaf_level"` (individual modules) |
|
||||
| `num_blocks_per_group` | yes | yes | yes | Required for `block_level` — how many layers per group |
|
||||
| `non_blocking` | yes | yes | yes | Non-blocking data transfer between devices |
|
||||
| `use_stream` | yes | yes | yes | Overlap data transfer and computation via CUDA streams. Requires ~2x CPU RAM of the model |
|
||||
| `record_stream` | yes | yes | yes | With `use_stream`, marks tensors for stream. Faster but slightly more memory |
|
||||
| `low_cpu_mem_usage` | yes | yes | yes | Pins tensors on-the-fly instead of pre-pinning. Saves CPU RAM when using streams, but slower |
|
||||
| `offload_to_disk_path` | yes | yes | yes | Path to offload weights to disk instead of CPU RAM. Useful when system RAM is also limited |
|
||||
| `exclude_modules` | **yes** | no | no | Pipeline-only: list of component names to skip (they get placed on `onload_device` instead) |
|
||||
| `block_modules` | no | **yes** | **yes** | Override which submodules are treated as blocks for `block_level` offloading |
|
||||
| `exclude_kwargs` | no | **yes** | **yes** | Kwarg keys that should not be moved between devices (e.g. mutable cache state) |
|
||||
|
||||
### 3. Sequential CPU offloading
|
||||
|
||||
Moves individual layers to GPU one at a time during forward pass.
|
||||
|
||||
```python
|
||||
pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
|
||||
pipe.enable_sequential_cpu_offload()
|
||||
# Do NOT call pipe.to("cuda") first — saves minimal memory if you do
|
||||
```
|
||||
|
||||
- **Memory savings**: Maximum — only one layer on GPU at a time
|
||||
- **Speed cost**: Very high — many small transfers per forward pass
|
||||
- **When to use**: Last resort when group offloading with streams isn't enough
|
||||
- **Note**: Group offloading with `leaf_level` + `use_stream=True` is essentially the same idea but faster. Prefer that.
|
||||
|
||||
### 4. VAE slicing
|
||||
|
||||
Processes VAE encode/decode in slices along the batch dimension.
|
||||
|
||||
```python
|
||||
pipe.vae.enable_slicing()
|
||||
```
|
||||
|
||||
- **Memory savings**: Reduces VAE peak memory for batch sizes > 1
|
||||
- **Speed cost**: Minimal
|
||||
- **When to use**: When generating multiple images/videos in a batch
|
||||
- **Note**: `AutoencoderKLWan` and `AsymmetricAutoencoderKL` don't support slicing.
|
||||
- **API note**: The pipeline-level `pipe.enable_vae_slicing()` is deprecated since v0.40.0. Use `pipe.vae.enable_slicing()`.
|
||||
|
||||
### 5. VAE tiling
|
||||
|
||||
Processes VAE encode/decode in spatial tiles. This is a **VRAM optimization** — only use when the VAE decode/encode would OOM without it.
|
||||
|
||||
```python
|
||||
pipe.vae.enable_tiling()
|
||||
```
|
||||
|
||||
- **Memory savings**: Bounds VAE peak memory by tile size rather than full resolution
|
||||
- **Speed cost**: Some overhead from tile overlap processing
|
||||
- **When to use** (only when VAE decode would OOM):
|
||||
- **Image models**: Typically needed above ~1.5 MP on ≤16 GB GPUs, or ~4 MP on ≤32 GB GPUs
|
||||
- **Video models**: When `H × W × num_frames` is large relative to remaining VRAM after denoising
|
||||
- **When NOT to use**: At standard resolutions where the VAE fits comfortably — tiling adds overhead for no benefit
|
||||
- **Note**: `AutoencoderKLWan` and `AsymmetricAutoencoderKL` don't support tiling.
|
||||
- **API note**: The pipeline-level `pipe.enable_vae_tiling()` is deprecated since v0.40.0. Use `pipe.vae.enable_tiling()`.
|
||||
- **Tip for group offloading with streams**: If combining VAE tiling with group offloading (`use_stream=True`), do a dummy forward pass first to avoid device mismatch errors.
|
||||
|
||||
### 6. Attention slicing (legacy)
|
||||
|
||||
```python
|
||||
pipe.enable_attention_slicing()
|
||||
```
|
||||
|
||||
- Largely superseded by `torch_sdpa` and FlashAttention
|
||||
- Still useful on very old GPUs without SDPA support
|
||||
|
||||
## Combining techniques
|
||||
|
||||
Compatible combinations:
|
||||
- Group offloading (pipeline-level) + VAE tiling — good general setup
|
||||
- Group offloading (pipeline-level, `exclude_modules=["small_component"]`) — keeps small models on GPU, offloads large ones
|
||||
- Model CPU offloading + VAE tiling — simple and effective when the largest component fits in VRAM
|
||||
- Layerwise casting + group offloading — maximum savings (see [layerwise-casting.md](layerwise-casting.md))
|
||||
- Layerwise casting + model CPU offloading — also works
|
||||
- Quantization + model CPU offloading — works well
|
||||
- Per-component group offloading with different configs — e.g. `block_level` for transformer, `leaf_level` for VAE
|
||||
|
||||
**Incompatible combinations:**
|
||||
- `enable_model_cpu_offload()` on a pipeline where ANY component has group offloading — raises ValueError
|
||||
- `enable_sequential_cpu_offload()` on a pipeline where ANY component has group offloading — same error
|
||||
|
||||
## Debugging OOM
|
||||
|
||||
1. Check which stage OOMs: loading, encoding, denoising, or decoding
|
||||
2. If OOM during `.to("cuda")` — the full pipeline doesn't fit. Use model CPU offloading or group offloading
|
||||
3. If OOM during denoising with model CPU offloading — the transformer alone exceeds VRAM. Use layerwise casting (see [layerwise-casting.md](layerwise-casting.md)) or group offloading instead
|
||||
4. If still OOM during VAE decode, add `pipe.vae.enable_tiling()`
|
||||
5. Consider quantization (see [quantization.md](quantization.md)) as a complementary approach
|
||||
72
.ai/skills/optimizations/torch-compile.md
Normal file
72
.ai/skills/optimizations/torch-compile.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# torch.compile
|
||||
|
||||
## Overview
|
||||
|
||||
`torch.compile` traces a model's forward pass and compiles it to optimized machine code (via Triton or other backends). For diffusers, it typically speeds up the denoising loop by 20-50% after a warmup period.
|
||||
|
||||
## Full model compilation
|
||||
|
||||
Compile individual components, not the whole pipeline:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16).to("cuda")
|
||||
|
||||
pipe.transformer = torch.compile(pipe.transformer, mode="reduce-overhead", fullgraph=True)
|
||||
# Optionally compile the VAE decoder too
|
||||
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead", fullgraph=True)
|
||||
```
|
||||
|
||||
The first 1-3 inference calls are slow (compilation/warmup). Subsequent calls are fast. Always do a warmup run before benchmarking.
|
||||
|
||||
## Regional compilation (preferred)
|
||||
|
||||
Regional compilation compiles only the frequently repeated sub-modules (transformer blocks) instead of the whole model. It provides the same runtime speedup but with ~8-10x faster compile time and better compatibility with offloading.
|
||||
|
||||
Diffusers models declare their repeated blocks via the `_repeated_blocks` class attribute (a list of class name strings). Most modern transformers define this:
|
||||
|
||||
```python
|
||||
# FluxTransformer defines:
|
||||
_repeated_blocks = ["FluxTransformerBlock", "FluxSingleTransformerBlock"]
|
||||
```
|
||||
|
||||
Use `compile_repeated_blocks()` to compile them:
|
||||
|
||||
```python
|
||||
pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16).to("cuda")
|
||||
pipe.transformer.compile_repeated_blocks(fullgraph=True)
|
||||
```
|
||||
|
||||
**Always guard before calling** — raises `ValueError` if `_repeated_blocks` is empty or the named classes aren't found. Use this pattern universally, whether or not you're using offloading:
|
||||
|
||||
```python
|
||||
# Works with or without enable_model_cpu_offload() / enable_group_offload()
|
||||
if getattr(pipe.transformer, "_repeated_blocks", None):
|
||||
pipe.transformer.compile_repeated_blocks(fullgraph=True)
|
||||
else:
|
||||
pipe.transformer = torch.compile(pipe.transformer, mode="reduce-overhead", fullgraph=True)
|
||||
```
|
||||
|
||||
`torch.compile` is compatible with diffusers' offloading methods — the offloading hooks use `@torch.compiler.disable()` on device-transfer operations so they run natively outside the compiled graph. Regional compilation is preferred when combining with offloading because it avoids compiling the parts that interact with the hooks.
|
||||
|
||||
Models with `_repeated_blocks` defined include: Flux, Flux2, HunyuanVideo, LTX2Video, Wan, CogVideo, SD3, UNet2DConditionModel, and most other modern architectures.
|
||||
|
||||
## Compile modes
|
||||
|
||||
| Mode | Speed gain | Compile time | Notes |
|
||||
|---|---|---|---|
|
||||
| `"default"` | Moderate | Fast | Safe starting point |
|
||||
| `"reduce-overhead"` | Good | Moderate | Reduces Python overhead via CUDA graphs |
|
||||
| `"max-autotune"` | Best | Very slow | Tries many kernel configs; best for repeated inference |
|
||||
|
||||
## `fullgraph=True`
|
||||
|
||||
Requires the entire forward pass to be compilable as a single graph. Most diffusers transformers support this. If you get a `torch._dynamo` graph break error, remove `fullgraph=True` to allow partial compilation.
|
||||
|
||||
## Limitations
|
||||
|
||||
- **Dynamic shapes**: Changing resolution between calls triggers recompilation. Use `torch.compile(..., dynamic=True)` for variable resolutions, at some speed cost.
|
||||
- **First call is slow**: Budget 1-3 minutes for initial compilation depending on model size.
|
||||
- **Windows**: `reduce-overhead` and `max-autotune` modes may have issues. Use `"default"` if you hit errors.
|
||||
Reference in New Issue
Block a user