Compare commits

...

3 Commits

Author SHA1 Message Date
sayakpaul
b7f414f719 more combos. 2025-06-18 07:10:34 +05:30
Sayak Paul
222ed1500d Merge branch 'main' into tip-compile-offload 2025-06-17 16:32:17 +05:30
sayakpaul
edef2da4e4 add a tip for compile + offload 2025-06-17 16:30:24 +05:30

View File

@@ -302,6 +302,13 @@ compute-bound, [group-offloading](#group-offloading) tends to be better. Group o
</Tip>
<Tip>
When using offloading, users can additionally compile the diffusion transformer/unet to get a
good speed-memory trade-off. First set `torch._dynamo.config.cache_size_limit=1000`, and then before calling the pipeline, add `pipeline.transformer.compile()`.
</Tip>
## Layerwise casting
Layerwise casting stores weights in a smaller data format (for example, `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision like `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality.
@@ -365,6 +372,12 @@ apply_layerwise_casting(
)
```
<Tip>
Layerwise casting can be combined with group offloading.
</Tip>
## torch.channels_last
[torch.channels_last](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) flips how tensors are stored from `(batch size, channels, height, width)` to `(batch size, heigh, width, channels)`. This aligns the tensors with how the hardware sequentially accesses the tensors stored in memory and avoids skipping around in memory to access the pixel values.