more combos.

Merge branch 'main' into tip-compile-offload
add a tip for compile + offload
2026-03-06 16:51:49 +08:00 · 2025-06-18 07:10:34 +05:30 · 2025-06-17 16:32:17 +05:30 · 2025-06-17 16:30:24 +05:30
1 changed files with 13 additions and 0 deletions
--- a/docs/source/en/optimization/memory.md
+++ b/docs/source/en/optimization/memory.md
@@ -302,6 +302,13 @@ compute-bound, [group-offloading](#group-offloading) tends to be better. Group o

 </Tip>

+<Tip>
+
+When using offloading, users can additionally compile the diffusion transformer/unet to get a
+good speed-memory trade-off. First set `torch._dynamo.config.cache_size_limit=1000`, and then before calling the pipeline, add `pipeline.transformer.compile()`.
+
+</Tip>
+
 ## Layerwise casting

 Layerwise casting stores weights in a smaller data format (for example, `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision like `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality.
@@ -365,6 +372,12 @@ apply_layerwise_casting(
 )
 ```

+<Tip>
+
+Layerwise casting can be combined with group offloading.
+
+</Tip>
+
 ## torch.channels_last

 [torch.channels_last](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) flips how tensors are stored from `(batch size, channels, height, width)` to `(batch size, heigh, width, channels)`. This aligns the tensors with how the hardware sequentially accesses the tensors stored in memory and avoids skipping around in memory to access the pixel values.
Author	SHA1	Message	Date
sayakpaul	b7f414f719	more combos.	2025-06-18 07:10:34 +05:30
Sayak Paul	222ed1500d	Merge branch 'main' into tip-compile-offload	2025-06-17 16:32:17 +05:30
sayakpaul	edef2da4e4	add a tip for compile + offload	2025-06-17 16:30:24 +05:30