Compare commits

...

3 Commits

Author SHA1 Message Date
Sayak Paul
ddee236a20 Merge branch 'main' into fp8-note-torchao 2025-06-13 06:53:29 +05:30
Sayak Paul
01c637f3c2 Update docs/source/en/quantization/torchao.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-06-13 06:53:20 +05:30
sayakpaul
fb0c722661 mention fp8 benefits on supported hardware. 2025-06-12 09:34:57 +05:30

View File

@@ -65,6 +65,9 @@ transformer = torch.compile(transformer, mode="max-autotune", fullgraph=True)
For speed and memory benchmarks on Flux and CogVideoX, please refer to the table [here](https://github.com/huggingface/diffusers/pull/10009#issue-2688781450). You can also find some torchao [benchmarks](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks) numbers for various hardware.
> [!TIP]
> The FP8 post-training quantization schemes in torchao are effective for GPUs with compute capability of at least 8.9 (RTX-4090, Hopper, etc.). FP8 often provides the best speed, memory, and quality trade-off when generating images and videos. We recommend combining FP8 and torch.compile if your GPU is compatible.
torchao also supports an automatic quantization API through [autoquant](https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#autoquantization). Autoquantization determines the best quantization strategy applicable to a model by comparing the performance of each technique on chosen input types and shapes. Currently, this can be used directly on the underlying modeling components. Diffusers will also expose an autoquant configuration option in the future.
The `TorchAoConfig` class accepts three parameters: