mirror of
https://github.com/huggingface/diffusers.git
synced 2025-12-06 20:44:33 +08:00
Compare commits
8 Commits
sage-kerne
...
ltx-098-me
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
4eaf84e2d3 | ||
|
|
aed636f5f0 | ||
|
|
53a10518b9 | ||
|
|
b4e6dc3037 | ||
|
|
3eb40786ca | ||
|
|
a4bc845478 | ||
|
|
fa468c5d57 | ||
|
|
47d93ce35e |
3
.gitignore
vendored
3
.gitignore
vendored
@@ -125,6 +125,9 @@ dmypy.json
|
||||
.vs
|
||||
.vscode
|
||||
|
||||
# Cursor
|
||||
.cursor
|
||||
|
||||
# Pycharm
|
||||
.idea
|
||||
|
||||
|
||||
@@ -49,7 +49,7 @@
|
||||
isExpanded: false
|
||||
sections:
|
||||
- local: using-diffusers/weighted_prompts
|
||||
title: Prompt techniques
|
||||
title: Prompting
|
||||
- local: using-diffusers/create_a_server
|
||||
title: Create a server
|
||||
- local: using-diffusers/batched_inference
|
||||
|
||||
@@ -15,7 +15,7 @@ specific language governing permissions and limitations under the License.
|
||||
[IP-Adapter](https://hf.co/papers/2308.06721) is a lightweight adapter that enables prompting a diffusion model with an image. This method decouples the cross-attention layers of the image and text features. The image features are generated from an image encoder.
|
||||
|
||||
> [!TIP]
|
||||
> Learn how to load an IP-Adapter checkpoint and image in the IP-Adapter [loading](../../using-diffusers/loading_adapters#ip-adapter) guide, and you can see how to use it in the [usage](../../using-diffusers/ip_adapter) guide.
|
||||
> Learn how to load and use an IP-Adapter checkpoint and image in the [IP-Adapter](../../using-diffusers/ip_adapter) guide,.
|
||||
|
||||
## IPAdapterMixin
|
||||
|
||||
|
||||
@@ -34,7 +34,7 @@ LoRA is a fast and lightweight training method that inserts and trains a signifi
|
||||
- [`LoraBaseMixin`] provides a base class with several utility methods to fuse, unfuse, unload, LoRAs and more.
|
||||
|
||||
> [!TIP]
|
||||
> To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide.
|
||||
> To learn more about how to load LoRA weights, see the [LoRA](../../tutorials/using_peft_for_inference) loading guide.
|
||||
|
||||
## LoraBaseMixin
|
||||
|
||||
|
||||
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# PEFT
|
||||
|
||||
Diffusers supports loading adapters such as [LoRA](../../using-diffusers/loading_adapters) with the [PEFT](https://huggingface.co/docs/peft/index) library with the [`~loaders.peft.PeftAdapterMixin`] class. This allows modeling classes in Diffusers like [`UNet2DConditionModel`], [`SD3Transformer2DModel`] to operate with an adapter.
|
||||
Diffusers supports loading adapters such as [LoRA](../../tutorials/using_peft_for_inference) with the [PEFT](https://huggingface.co/docs/peft/index) library with the [`~loaders.peft.PeftAdapterMixin`] class. This allows modeling classes in Diffusers like [`UNet2DConditionModel`], [`SD3Transformer2DModel`] to operate with an adapter.
|
||||
|
||||
> [!TIP]
|
||||
> Refer to the [Inference with PEFT](../../tutorials/using_peft_for_inference.md) tutorial for an overview of how to use PEFT in Diffusers for inference.
|
||||
|
||||
@@ -17,7 +17,7 @@ Textual Inversion is a training method for personalizing models by learning new
|
||||
[`TextualInversionLoaderMixin`] provides a function for loading Textual Inversion embeddings from Diffusers and Automatic1111 into the text encoder and loading a special token to activate the embeddings.
|
||||
|
||||
> [!TIP]
|
||||
> To learn more about how to load Textual Inversion embeddings, see the [Textual Inversion](../../using-diffusers/loading_adapters#textual-inversion) loading guide.
|
||||
> To learn more about how to load Textual Inversion embeddings, see the [Textual Inversion](../../using-diffusers/textual_inversion_inference) loading guide.
|
||||
|
||||
## TextualInversionLoaderMixin
|
||||
|
||||
|
||||
@@ -17,7 +17,7 @@ This class is useful when *only* loading weights into a [`SD3Transformer2DModel`
|
||||
The [`SD3Transformer2DLoadersMixin`] class currently only loads IP-Adapter weights, but will be used in the future to save weights and load LoRAs.
|
||||
|
||||
> [!TIP]
|
||||
> To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide.
|
||||
> To learn more about how to load LoRA weights, see the [LoRA](../../tutorials/using_peft_for_inference) loading guide.
|
||||
|
||||
## SD3Transformer2DLoadersMixin
|
||||
|
||||
|
||||
@@ -17,7 +17,7 @@ Some training methods - like LoRA and Custom Diffusion - typically target the UN
|
||||
The [`UNet2DConditionLoadersMixin`] class provides functions for loading and saving weights, fusing and unfusing LoRAs, disabling and enabling LoRAs, and setting and deleting adapters.
|
||||
|
||||
> [!TIP]
|
||||
> To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide.
|
||||
> To learn more about how to load LoRA weights, see the [LoRA](../../tutorials/using_peft_for_inference) guide.
|
||||
|
||||
## UNet2DConditionLoadersMixin
|
||||
|
||||
|
||||
@@ -418,7 +418,7 @@ When unloading the Control LoRA weights, call `pipe.unload_lora_weights(reset_to
|
||||
## IP-Adapter
|
||||
|
||||
> [!TIP]
|
||||
> Check out [IP-Adapter](../../../using-diffusers/ip_adapter) to learn more about how IP-Adapters work.
|
||||
> Check out [IP-Adapter](../../using-diffusers/ip_adapter) to learn more about how IP-Adapters work.
|
||||
|
||||
An IP-Adapter lets you prompt Flux with images, in addition to the text prompt. This is especially useful when describing complex concepts that are difficult to articulate through text alone and you have reference images.
|
||||
|
||||
|
||||
@@ -21,7 +21,7 @@
|
||||
|
||||
## Available models
|
||||
|
||||
The following models are available for the [`HiDreamImagePipeline`](text-to-image) pipeline:
|
||||
The following models are available for the [`HiDreamImagePipeline`] pipeline:
|
||||
|
||||
| Model name | Description |
|
||||
|:---|:---|
|
||||
|
||||
@@ -254,8 +254,8 @@ export_to_video(video, "output.mp4", fps=24)
|
||||
pipeline.vae.enable_tiling()
|
||||
|
||||
def round_to_nearest_resolution_acceptable_by_vae(height, width):
|
||||
height = height - (height % pipeline.vae_temporal_compression_ratio)
|
||||
width = width - (width % pipeline.vae_temporal_compression_ratio)
|
||||
height = height - (height % pipeline.vae_spatial_compression_ratio)
|
||||
width = width - (width % pipeline.vae_spatial_compression_ratio)
|
||||
return height, width
|
||||
|
||||
prompt = """
|
||||
@@ -325,6 +325,95 @@ export_to_video(video, "output.mp4", fps=24)
|
||||
|
||||
</details>
|
||||
|
||||
- LTX-Video 0.9.8 distilled model is similar to the 0.9.7 variant. It is guidance and timestep-distilled, and similar inference code can be used as above. An improvement of this version is that it supports generating very long videos. Additionally, it supports using tone mapping to improve the quality of the generated video using the `tone_map_compression_ratio` parameter. The default value of `0.6` is recommended.
|
||||
|
||||
<details>
|
||||
<summary>Show example code</summary>
|
||||
|
||||
```python
|
||||
import torch
|
||||
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
|
||||
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
|
||||
from diffusers.pipelines.ltx.modeling_latent_upsampler import LTXLatentUpsamplerModel
|
||||
from diffusers.utils import export_to_video, load_video
|
||||
|
||||
pipeline = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.8-13B-distilled", torch_dtype=torch.bfloat16)
|
||||
# TODO: Update the checkpoint here once updated in LTX org
|
||||
upsampler = LTXLatentUpsamplerModel.from_pretrained("a-r-r-o-w/LTX-0.9.8-Latent-Upsampler", torch_dtype=torch.bfloat16)
|
||||
pipe_upsample = LTXLatentUpsamplePipeline(vae=pipeline.vae, latent_upsampler=upsampler).to(torch.bfloat16)
|
||||
pipeline.to("cuda")
|
||||
pipe_upsample.to("cuda")
|
||||
pipeline.vae.enable_tiling()
|
||||
|
||||
def round_to_nearest_resolution_acceptable_by_vae(height, width):
|
||||
height = height - (height % pipeline.vae_spatial_compression_ratio)
|
||||
width = width - (width % pipeline.vae_spatial_compression_ratio)
|
||||
return height, width
|
||||
|
||||
prompt = """The camera pans over a snow-covered mountain range, revealing a vast expanse of snow-capped peaks and valleys.The mountains are covered in a thick layer of snow, with some areas appearing almost white while others have a slightly darker, almost grayish hue. The peaks are jagged and irregular, with some rising sharply into the sky while others are more rounded. The valleys are deep and narrow, with steep slopes that are also covered in snow. The trees in the foreground are mostly bare, with only a few leaves remaining on their branches. The sky is overcast, with thick clouds obscuring the sun. The overall impression is one of peace and tranquility, with the snow-covered mountains standing as a testament to the power and beauty of nature."""
|
||||
# prompt = """A woman walks away from a white Jeep parked on a city street at night, then ascends a staircase and knocks on a door. The woman, wearing a dark jacket and jeans, walks away from the Jeep parked on the left side of the street, her back to the camera; she walks at a steady pace, her arms swinging slightly by her sides; the street is dimly lit, with streetlights casting pools of light on the wet pavement; a man in a dark jacket and jeans walks past the Jeep in the opposite direction; the camera follows the woman from behind as she walks up a set of stairs towards a building with a green door; she reaches the top of the stairs and turns left, continuing to walk towards the building; she reaches the door and knocks on it with her right hand; the camera remains stationary, focused on the doorway; the scene is captured in real-life footage."""
|
||||
negative_prompt = "bright colors, symbols, graffiti, watermarks, worst quality, inconsistent motion, blurry, jittery, distorted"
|
||||
expected_height, expected_width = 480, 832
|
||||
downscale_factor = 2 / 3
|
||||
# num_frames = 161
|
||||
num_frames = 361
|
||||
|
||||
# 1. Generate video at smaller resolution
|
||||
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
|
||||
downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
|
||||
latents = pipeline(
|
||||
prompt=prompt,
|
||||
negative_prompt=negative_prompt,
|
||||
width=downscaled_width,
|
||||
height=downscaled_height,
|
||||
num_frames=num_frames,
|
||||
timesteps=[1000, 993, 987, 981, 975, 909, 725, 0.03],
|
||||
decode_timestep=0.05,
|
||||
decode_noise_scale=0.025,
|
||||
image_cond_noise_scale=0.0,
|
||||
guidance_scale=1.0,
|
||||
guidance_rescale=0.7,
|
||||
generator=torch.Generator().manual_seed(0),
|
||||
output_type="latent",
|
||||
).frames
|
||||
|
||||
# 2. Upscale generated video using latent upsampler with fewer inference steps
|
||||
# The available latent upsampler upscales the height/width by 2x
|
||||
upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
|
||||
upscaled_latents = pipe_upsample(
|
||||
latents=latents,
|
||||
adain_factor=1.0,
|
||||
tone_map_compression_ratio=0.6,
|
||||
output_type="latent"
|
||||
).frames
|
||||
|
||||
# 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
|
||||
video = pipeline(
|
||||
prompt=prompt,
|
||||
negative_prompt=negative_prompt,
|
||||
width=upscaled_width,
|
||||
height=upscaled_height,
|
||||
num_frames=num_frames,
|
||||
denoise_strength=0.999, # Effectively, 4 inference steps out of 5
|
||||
timesteps=[1000, 909, 725, 421, 0],
|
||||
latents=upscaled_latents,
|
||||
decode_timestep=0.05,
|
||||
decode_noise_scale=0.025,
|
||||
image_cond_noise_scale=0.0,
|
||||
guidance_scale=1.0,
|
||||
guidance_rescale=0.7,
|
||||
generator=torch.Generator().manual_seed(0),
|
||||
output_type="pil",
|
||||
).frames[0]
|
||||
|
||||
# 4. Downscale the video to the expected resolution
|
||||
video = [frame.resize((expected_width, expected_height)) for frame in video]
|
||||
|
||||
export_to_video(video, "output.mp4", fps=24)
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
- LTX-Video supports LoRAs with [`~loaders.LTXVideoLoraLoaderMixin.load_lora_weights`].
|
||||
|
||||
<details>
|
||||
|
||||
@@ -32,7 +32,7 @@ The table below lists all the pipelines currently available in 🤗 Diffusers an
|
||||
| [Attend-and-Excite](attend_and_excite) | text2image |
|
||||
| [AudioLDM](audioldm) | text2audio |
|
||||
| [AudioLDM2](audioldm2) | text2audio |
|
||||
| [AuraFlow](auraflow) | text2image |
|
||||
| [AuraFlow](aura_flow) | text2image |
|
||||
| [BLIP Diffusion](blip_diffusion) | text2image |
|
||||
| [Bria 3.2](bria_3_2) | text2image |
|
||||
| [CogVideoX](cogvideox) | text2video |
|
||||
|
||||
@@ -109,7 +109,7 @@ image_1 = load_image("https://huggingface.co/datasets/huggingface/documentation-
|
||||
image_2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peng.png")
|
||||
image = pipe(
|
||||
image=[image_1, image_2],
|
||||
prompt="put the penguin and the cat at a game show called "Qwen Edit Plus Games"",
|
||||
prompt='''put the penguin and the cat at a game show called "Qwen Edit Plus Games"''',
|
||||
num_inference_steps=50
|
||||
).images[0]
|
||||
```
|
||||
|
||||
@@ -271,7 +271,7 @@ Check out the full script [here](https://gist.github.com/sayakpaul/508d89d7aad4f
|
||||
|
||||
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
|
||||
|
||||
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`StableDiffusion3Pipeline`] for inference with bitsandbytes.
|
||||
Refer to the [Quantization](../../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`StableDiffusion3Pipeline`] for inference with bitsandbytes.
|
||||
|
||||
```py
|
||||
import torch
|
||||
|
||||
@@ -29,7 +29,7 @@ The abstract from the paper is:
|
||||
|
||||
Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient.
|
||||
|
||||
Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage.
|
||||
Check out the [Text or image-to-video](../../../using-diffusers/text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage.
|
||||
|
||||
## StableVideoDiffusionPipeline
|
||||
|
||||
|
||||
@@ -172,7 +172,7 @@ Here are some sample outputs:
|
||||
|
||||
Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient.
|
||||
|
||||
Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage.
|
||||
Check out the [Text or image-to-video](../../using-diffusers/text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage.
|
||||
|
||||
> [!TIP]
|
||||
> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
|
||||
|
||||
@@ -26,6 +26,10 @@ Utility and helper functions for working with 🤗 Diffusers.
|
||||
|
||||
[[autodoc]] utils.load_image
|
||||
|
||||
## load_video
|
||||
|
||||
[[autodoc]] utils.load_video
|
||||
|
||||
## export_to_gif
|
||||
|
||||
[[autodoc]] utils.export_to_gif
|
||||
|
||||
@@ -548,4 +548,4 @@ Training the DeepFloyd IF model can be challenging, but here are some tips that
|
||||
|
||||
Congratulations on training your DreamBooth model! To learn more about how to use your new model, the following guide may be helpful:
|
||||
|
||||
- Learn how to [load a DreamBooth](../using-diffusers/loading_adapters) model for inference if you trained your model with LoRA.
|
||||
- Learn how to [load a DreamBooth](../using-diffusers/dreambooth) model for inference if you trained your model with LoRA.
|
||||
@@ -75,7 +75,7 @@ accelerate launch train_lcm_distill_sd_wds.py \
|
||||
Most of the parameters are identical to the parameters in the [Text-to-image](text2image#script-parameters) training guide, so you'll focus on the parameters that are relevant to latent consistency distillation in this guide.
|
||||
|
||||
- `--pretrained_teacher_model`: the path to a pretrained latent diffusion model to use as the teacher model
|
||||
- `--pretrained_vae_model_name_or_path`: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows you to specify an alternative VAE (like this [VAE]((https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)) by madebyollin which works in fp16)
|
||||
- `--pretrained_vae_model_name_or_path`: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows you to specify an alternative VAE (like this [VAE](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)) by madebyollin which works in fp16)
|
||||
- `--w_min` and `--w_max`: the minimum and maximum guidance scale values for guidance scale sampling
|
||||
- `--num_ddim_timesteps`: the number of timesteps for DDIM sampling
|
||||
- `--loss_type`: the type of loss (L2 or Huber) to calculate for latent consistency distillation; Huber loss is generally preferred because it's more robust to outliers
|
||||
@@ -245,5 +245,5 @@ The SDXL training script is discussed in more detail in the [SDXL training](sdxl
|
||||
|
||||
Congratulations on distilling a LCM model! To learn more about LCM, the following may be helpful:
|
||||
|
||||
- Learn how to use [LCMs for inference](../using-diffusers/lcm) for text-to-image, image-to-image, and with LoRA checkpoints.
|
||||
- Learn how to use [LCMs for inference](../using-diffusers/inference_with_lcm) for text-to-image, image-to-image, and with LoRA checkpoints.
|
||||
- Read the [SDXL in 4 steps with Latent Consistency LoRAs](https://huggingface.co/blog/lcm_lora) blog post to learn more about SDXL LCM-LoRA's for super fast inference, quality comparisons, benchmarks, and more.
|
||||
|
||||
@@ -198,5 +198,5 @@ image = pipeline("A naruto with blue eyes").images[0]
|
||||
|
||||
Congratulations on training a new model with LoRA! To learn more about how to use your new model, the following guides may be helpful:
|
||||
|
||||
- Learn how to [load different LoRA formats](../using-diffusers/loading_adapters#LoRA) trained using community trainers like Kohya and TheLastBen.
|
||||
- Learn how to [load different LoRA formats](../tutorials/using_peft_for_inference) trained using community trainers like Kohya and TheLastBen.
|
||||
- Learn how to use and [combine multiple LoRA's](../tutorials/using_peft_for_inference) with PEFT for inference.
|
||||
|
||||
@@ -178,5 +178,5 @@ image.save("yoda-naruto.png")
|
||||
|
||||
Congratulations on training your own text-to-image model! To learn more about how to use your new model, the following guides may be helpful:
|
||||
|
||||
- Learn how to [load LoRA weights](../using-diffusers/loading_adapters#LoRA) for inference if you trained your model with LoRA.
|
||||
- Learn how to [load LoRA weights](../tutorials/using_peft_for_inference) for inference if you trained your model with LoRA.
|
||||
- Learn more about how certain parameters like guidance scale or techniques such as prompt weighting can help you control inference in the [Text-to-image](../using-diffusers/conditional_image_generation) task guide.
|
||||
|
||||
@@ -203,5 +203,4 @@ image.save("cat-train.png")
|
||||
|
||||
Congratulations on training your own Textual Inversion model! 🎉 To learn more about how to use your new model, the following guides may be helpful:
|
||||
|
||||
- Learn how to [load Textual Inversion embeddings](../using-diffusers/loading_adapters) and also use them as negative embeddings.
|
||||
- Learn how to use [Textual Inversion](textual_inversion_inference) for inference with Stable Diffusion 1/2 and Stable Diffusion XL.
|
||||
- Learn how to [load Textual Inversion embeddings](../using-diffusers/textual_inversion_inference) and also use them as negative embeddings.
|
||||
@@ -16,24 +16,24 @@ Batch inference processes multiple prompts at a time to increase throughput. It
|
||||
|
||||
The downside is increased latency because you must wait for the entire batch to complete, and more GPU memory is required for large batches.
|
||||
|
||||
<hfoptions id="usage">
|
||||
<hfoption id="text-to-image">
|
||||
|
||||
For text-to-image, pass a list of prompts to the pipeline.
|
||||
For text-to-image, pass a list of prompts to the pipeline and for image-to-image, pass a list of images and prompts to the pipeline. The example below demonstrates batched text-to-image inference.
|
||||
|
||||
```py
|
||||
import torch
|
||||
import matplotlib.pyplot as plt
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
torch_dtype=torch.float16,
|
||||
device_map="cuda"
|
||||
)
|
||||
|
||||
prompts = [
|
||||
"cinematic photo of A beautiful sunset over mountains, 35mm photograph, film, professional, 4k, highly detailed",
|
||||
"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
|
||||
"pixel-art a cozy coffee shop interior, low-res, blocky, pixel art style, 8-bit graphics"
|
||||
"Cinematic shot of a cozy coffee shop interior, warm pastel light streaming through a window where a cat rests. Shallow depth of field, glowing cups in soft focus, dreamy lofi-inspired mood, nostalgic tones, framed like a quiet film scene.",
|
||||
"Polaroid-style photograph of a cozy coffee shop interior, bathed in warm pastel light. A cat sits on the windowsill near steaming mugs. Soft, slightly faded tones and dreamy blur evoke nostalgia, a lofi mood, and the intimate, imperfect charm of instant film.",
|
||||
"Soft watercolor illustration of a cozy coffee shop interior, pastel washes of color filling the space. A cat rests peacefully on the windowsill as warm light glows through. Gentle brushstrokes create a dreamy, lofi-inspired atmosphere with whimsical textures and nostalgic calm.",
|
||||
"Isometric pixel-art illustration of a cozy coffee shop interior in detailed 8-bit style. Warm pastel light fills the space as a cat rests on the windowsill. Blocky furniture and tiny mugs add charm, low-res retro graphics enhance the nostalgic, lofi-inspired game aesthetic."
|
||||
]
|
||||
|
||||
images = pipeline(
|
||||
@@ -52,6 +52,10 @@ plt.tight_layout()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/batch-inference.png"/>
|
||||
</div>
|
||||
|
||||
To generate multiple variations of one prompt, use the `num_images_per_prompt` argument.
|
||||
|
||||
```py
|
||||
@@ -61,11 +65,18 @@ from diffusers import DiffusionPipeline
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
torch_dtype=torch.float16,
|
||||
device_map="cuda"
|
||||
)
|
||||
|
||||
prompt="""
|
||||
Isometric pixel-art illustration of a cozy coffee shop interior in detailed 8-bit style. Warm pastel light fills the
|
||||
space as a cat rests on the windowsill. Blocky furniture and tiny mugs add charm, low-res retro graphics enhance the
|
||||
nostalgic, lofi-inspired game aesthetic.
|
||||
"""
|
||||
|
||||
images = pipeline(
|
||||
prompt="pixel-art a cozy coffee shop interior, low-res, blocky, pixel art style, 8-bit graphics",
|
||||
prompt=prompt,
|
||||
num_images_per_prompt=4
|
||||
).images
|
||||
|
||||
@@ -81,6 +92,10 @@ plt.tight_layout()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/batch-inference-2.png"/>
|
||||
</div>
|
||||
|
||||
Combine both approaches to generate different variations of different prompts.
|
||||
|
||||
```py
|
||||
@@ -89,7 +104,7 @@ images = pipeline(
|
||||
num_images_per_prompt=2,
|
||||
).images
|
||||
|
||||
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
|
||||
fig, axes = plt.subplots(2, 4, figsize=(12, 12))
|
||||
axes = axes.flatten()
|
||||
|
||||
for i, image in enumerate(images):
|
||||
@@ -101,126 +116,18 @@ plt.tight_layout()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="image-to-image">
|
||||
|
||||
For image-to-image, pass a list of input images and prompts to the pipeline.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers.utils import load_image
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
|
||||
input_images = [
|
||||
load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"),
|
||||
load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"),
|
||||
load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/detail-prompt.png")
|
||||
]
|
||||
|
||||
prompts = [
|
||||
"cinematic photo of a beautiful sunset over mountains, 35mm photograph, film, professional, 4k, highly detailed",
|
||||
"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
|
||||
"pixel-art a cozy coffee shop interior, low-res, blocky, pixel art style, 8-bit graphics"
|
||||
]
|
||||
|
||||
images = pipeline(
|
||||
prompt=prompts,
|
||||
image=input_images,
|
||||
guidance_scale=8.0,
|
||||
strength=0.5
|
||||
).images
|
||||
|
||||
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
|
||||
axes = axes.flatten()
|
||||
|
||||
for i, image in enumerate(images):
|
||||
axes[i].imshow(image)
|
||||
axes[i].set_title(f"Image {i+1}")
|
||||
axes[i].axis('off')
|
||||
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
To generate multiple variations of one prompt, use the `num_images_per_prompt` argument.
|
||||
|
||||
```py
|
||||
import torch
|
||||
import matplotlib.pyplot as plt
|
||||
from diffusers.utils import load_image
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
|
||||
input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/detail-prompt.png")
|
||||
|
||||
images = pipeline(
|
||||
prompt="pixel-art a cozy coffee shop interior, low-res, blocky, pixel art style, 8-bit graphics",
|
||||
image=input_image,
|
||||
num_images_per_prompt=4
|
||||
).images
|
||||
|
||||
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
|
||||
axes = axes.flatten()
|
||||
|
||||
for i, image in enumerate(images):
|
||||
axes[i].imshow(image)
|
||||
axes[i].set_title(f"Image {i+1}")
|
||||
axes[i].axis('off')
|
||||
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
Combine both approaches to generate different variations of different prompts.
|
||||
|
||||
```py
|
||||
input_images = [
|
||||
load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"),
|
||||
load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/detail-prompt.png")
|
||||
]
|
||||
|
||||
prompts = [
|
||||
"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
|
||||
"pixel-art a cozy coffee shop interior, low-res, blocky, pixel art style, 8-bit graphics"
|
||||
]
|
||||
|
||||
images = pipeline(
|
||||
prompt=prompts,
|
||||
image=input_images,
|
||||
num_images_per_prompt=2,
|
||||
).images
|
||||
|
||||
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
|
||||
axes = axes.flatten()
|
||||
|
||||
for i, image in enumerate(images):
|
||||
axes[i].imshow(image)
|
||||
axes[i].set_title(f"Image {i+1}")
|
||||
axes[i].axis('off')
|
||||
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/batch-inference-3.png"/>
|
||||
</div>
|
||||
|
||||
## Deterministic generation
|
||||
|
||||
Enable reproducible batch generation by passing a list of [Generator’s](https://pytorch.org/docs/stable/generated/torch.Generator.html) to the pipeline and tie each `Generator` to a seed to reuse it.
|
||||
|
||||
Use a list comprehension to iterate over the batch size specified in `range()` to create a unique `Generator` object for each image in the batch.
|
||||
> [!TIP]
|
||||
> Refer to the [Reproducibility](./reusing_seeds) docs to learn more about deterministic algorithms and the `Generator` object.
|
||||
|
||||
Don't multiply the `Generator` by the batch size because that only creates one `Generator` object that is used sequentially for each image in the batch.
|
||||
Use a list comprehension to iterate over the batch size specified in `range()` to create a unique `Generator` object for each image in the batch. Don't multiply the `Generator` by the batch size because that only creates one `Generator` object that is used sequentially for each image in the batch.
|
||||
|
||||
```py
|
||||
generator = [torch.Generator(device="cuda").manual_seed(0)] * 3
|
||||
@@ -234,14 +141,16 @@ from diffusers import DiffusionPipeline
|
||||
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
torch_dtype=torch.float16,
|
||||
device_map="cuda"
|
||||
)
|
||||
|
||||
generator = [torch.Generator(device="cuda").manual_seed(i) for i in range(3)]
|
||||
prompts = [
|
||||
"cinematic photo of A beautiful sunset over mountains, 35mm photograph, film, professional, 4k, highly detailed",
|
||||
"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
|
||||
"pixel-art a cozy coffee shop interior, low-res, blocky, pixel art style, 8-bit graphics"
|
||||
"Cinematic shot of a cozy coffee shop interior, warm pastel light streaming through a window where a cat rests. Shallow depth of field, glowing cups in soft focus, dreamy lofi-inspired mood, nostalgic tones, framed like a quiet film scene.",
|
||||
"Polaroid-style photograph of a cozy coffee shop interior, bathed in warm pastel light. A cat sits on the windowsill near steaming mugs. Soft, slightly faded tones and dreamy blur evoke nostalgia, a lofi mood, and the intimate, imperfect charm of instant film.",
|
||||
"Soft watercolor illustration of a cozy coffee shop interior, pastel washes of color filling the space. A cat rests peacefully on the windowsill as warm light glows through. Gentle brushstrokes create a dreamy, lofi-inspired atmosphere with whimsical textures and nostalgic calm.",
|
||||
"Isometric pixel-art illustration of a cozy coffee shop interior in detailed 8-bit style. Warm pastel light fills the space as a cat rests on the windowsill. Blocky furniture and tiny mugs add charm, low-res retro graphics enhance the nostalgic, lofi-inspired game aesthetic."
|
||||
]
|
||||
|
||||
images = pipeline(
|
||||
@@ -261,4 +170,4 @@ plt.tight_layout()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
You can use this to iteratively select an image associated with a seed and then improve on it by crafting a more detailed prompt.
|
||||
You can use this to select an image associated with a seed and iteratively improve on it by crafting a more detailed prompt.
|
||||
@@ -70,32 +70,6 @@ For convenience, we provide a table to denote which methods are inference-only a
|
||||
[InstructPix2Pix](../api/pipelines/pix2pix) is fine-tuned from Stable Diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image.
|
||||
InstructPix2Pix has been explicitly trained to work well with [InstructGPT](https://openai.com/blog/instruction-following/)-like prompts.
|
||||
|
||||
## Pix2Pix Zero
|
||||
|
||||
[Paper](https://huggingface.co/papers/2302.03027)
|
||||
|
||||
[Pix2Pix Zero](../api/pipelines/pix2pix_zero) allows modifying an image so that one concept or subject is translated to another one while preserving general image semantics.
|
||||
|
||||
The denoising process is guided from one conceptual embedding towards another conceptual embedding. The intermediate latents are optimized during the denoising process to push the attention maps towards reference attention maps. The reference attention maps are from the denoising process of the input image and are used to encourage semantic preservation.
|
||||
|
||||
Pix2Pix Zero can be used both to edit synthetic images as well as real images.
|
||||
|
||||
- To edit synthetic images, one first generates an image given a caption.
|
||||
Next, we generate image captions for the concept that shall be edited and for the new target concept. We can use a model like [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) for this purpose. Then, "mean" prompt embeddings for both the source and target concepts are created via the text encoder. Finally, the pix2pix-zero algorithm is used to edit the synthetic image.
|
||||
- To edit a real image, one first generates an image caption using a model like [BLIP](https://huggingface.co/docs/transformers/model_doc/blip). Then one applies DDIM inversion on the prompt and image to generate "inverse" latents. Similar to before, "mean" prompt embeddings for both source and target concepts are created and finally the pix2pix-zero algorithm in combination with the "inverse" latents is used to edit the image.
|
||||
|
||||
> [!TIP]
|
||||
> Pix2Pix Zero is the first model that allows "zero-shot" image editing. This means that the model
|
||||
> can edit an image in less than a minute on a consumer GPU as shown [here](../api/pipelines/pix2pix_zero#usage-example).
|
||||
|
||||
As mentioned above, Pix2Pix Zero includes optimizing the latents (and not any of the UNet, VAE, or the text encoder) to steer the generation toward a specific concept. This means that the overall
|
||||
pipeline might require more memory than a standard [StableDiffusionPipeline](../api/pipelines/stable_diffusion/text2img).
|
||||
|
||||
> [!TIP]
|
||||
> An important distinction between methods like InstructPix2Pix and Pix2Pix Zero is that the former
|
||||
> involves fine-tuning the pre-trained weights while the latter does not. This means that you can
|
||||
> apply Pix2Pix Zero to any of the available Stable Diffusion models.
|
||||
|
||||
## Attend and Excite
|
||||
|
||||
[Paper](https://huggingface.co/papers/2301.13826)
|
||||
@@ -178,14 +152,6 @@ multi-concept training by design. Like DreamBooth and Textual Inversion, Custom
|
||||
teach a pre-trained text-to-image diffusion model about new concepts to generate outputs involving the
|
||||
concept(s) of interest.
|
||||
|
||||
## Model Editing
|
||||
|
||||
[Paper](https://huggingface.co/papers/2303.08084)
|
||||
|
||||
The [text-to-image model editing pipeline](../api/pipelines/model_editing) helps you mitigate some of the incorrect implicit assumptions a pre-trained text-to-image
|
||||
diffusion model might make about the subjects present in the input prompt. For example, if you prompt Stable Diffusion to generate images for "A pack of roses", the roses in the generated images
|
||||
are more likely to be red. This pipeline helps you change that assumption.
|
||||
|
||||
## DiffEdit
|
||||
|
||||
[Paper](https://huggingface.co/papers/2210.11427)
|
||||
|
||||
@@ -257,7 +257,7 @@ LCMs are compatible with adapters like LoRA, ControlNet, T2I-Adapter, and Animat
|
||||
|
||||
### LoRA
|
||||
|
||||
[LoRA](../using-diffusers/loading_adapters#lora) adapters can be rapidly finetuned to learn a new style from just a few images and plugged into a pretrained model to generate images in that style.
|
||||
[LoRA](../tutorials/using_peft_for_inference) adapters can be rapidly finetuned to learn a new style from just a few images and plugged into a pretrained model to generate images in that style.
|
||||
|
||||
<hfoptions id="lcm-lora">
|
||||
<hfoption id="LCM">
|
||||
|
||||
@@ -18,7 +18,7 @@ Trajectory Consistency Distillation (TCD) enables a model to generate higher qua
|
||||
|
||||
The major advantages of TCD are:
|
||||
|
||||
- Better than Teacher: TCD demonstrates superior generative quality at both small and large inference steps and exceeds the performance of [DPM-Solver++(2S)](../../api/schedulers/multistep_dpm_solver) with Stable Diffusion XL (SDXL). There is no additional discriminator or LPIPS supervision included during TCD training.
|
||||
- Better than Teacher: TCD demonstrates superior generative quality at both small and large inference steps and exceeds the performance of [DPM-Solver++(2S)](../api/schedulers/multistep_dpm_solver) with Stable Diffusion XL (SDXL). There is no additional discriminator or LPIPS supervision included during TCD training.
|
||||
|
||||
- Flexible Inference Steps: The inference steps for TCD sampling can be freely adjusted without adversely affecting the image quality.
|
||||
|
||||
@@ -166,7 +166,7 @@ image = pipe(
|
||||
TCD-LoRA also supports other LoRAs trained on different styles. For example, let's load the [TheLastBen/Papercut_SDXL](https://huggingface.co/TheLastBen/Papercut_SDXL) LoRA and fuse it with the TCD-LoRA with the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] method.
|
||||
|
||||
> [!TIP]
|
||||
> Check out the [Merge LoRAs](merge_loras) guide to learn more about efficient merging methods.
|
||||
> Check out the [Merge LoRAs](../tutorials/using_peft_for_inference#merge) guide to learn more about efficient merging methods.
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
@@ -280,7 +280,7 @@ refiner = DiffusionPipeline.from_pretrained(
|
||||
```
|
||||
|
||||
> [!WARNING]
|
||||
> You can use SDXL refiner with a different base model. For example, you can use the [Hunyuan-DiT](../../api/pipelines/hunyuandit) or [PixArt-Sigma](../../api/pipelines/pixart_sigma) pipelines to generate images with better prompt adherence. Once you have generated an image, you can pass it to the SDXL refiner model to enhance final generation quality.
|
||||
> You can use SDXL refiner with a different base model. For example, you can use the [Hunyuan-DiT](../api/pipelines/hunyuandit) or [PixArt-Sigma](../api/pipelines/pixart_sigma) pipelines to generate images with better prompt adherence. Once you have generated an image, you can pass it to the SDXL refiner model to enhance final generation quality.
|
||||
|
||||
Generate an image from the base model, and set the model output to **latent** space:
|
||||
|
||||
|
||||
@@ -10,423 +10,96 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Prompt techniques
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Prompts are important because they describe what you want a diffusion model to generate. The best prompts are detailed, specific, and well-structured to help the model realize your vision. But crafting a great prompt takes time and effort and sometimes it may not be enough because language and words can be imprecise. This is where you need to boost your prompt with other techniques, such as prompt enhancing and prompt weighting, to get the results you want.
|
||||
# Prompting
|
||||
|
||||
This guide will show you how you can use these prompt techniques to generate high-quality images with lower effort and adjust the weight of certain keywords in a prompt.
|
||||
Prompts describes what a model should generate. Good prompts are detailed, specific, and structured and they generate better images and videos.
|
||||
|
||||
## Prompt engineering
|
||||
This guide shows you how to write effective prompts and introduces techniques that make them stronger.
|
||||
|
||||
> [!TIP]
|
||||
> This is not an exhaustive guide on prompt engineering, but it will help you understand the necessary parts of a good prompt. We encourage you to continue experimenting with different prompts and combine them in new ways to see what works best. As you write more prompts, you'll develop an intuition for what works and what doesn't!
|
||||
## Writing good prompts
|
||||
|
||||
New diffusion models do a pretty good job of generating high-quality images from a basic prompt, but it is still important to create a well-written prompt to get the best results. Here are a few tips for writing a good prompt:
|
||||
Every effective prompt needs three core elements.
|
||||
|
||||
1. What is the image *medium*? Is it a photo, a painting, a 3D illustration, or something else?
|
||||
2. What is the image *subject*? Is it a person, animal, object, or scene?
|
||||
3. What *details* would you like to see in the image? This is where you can get really creative and have a lot of fun experimenting with different words to bring your image to life. For example, what is the lighting like? What is the vibe and aesthetic? What kind of art or illustration style are you looking for? The more specific and precise words you use, the better the model will understand what you want to generate.
|
||||
1. <span class="underline decoration-sky-500 decoration-2 underline-offset-4">Subject</span> - what you want to generate. Start your prompt here.
|
||||
2. <span class="underline decoration-pink-500 decoration-2 underline-offset-4">Style</span> - the medium or aesthetic. How should it look?
|
||||
3. <span class="underline decoration-green-500 decoration-2 underline-offset-4">Context</span> - details about actions, setting, and mood.
|
||||
|
||||
Use these elements as a structured narrative, not a keyword list. Modern models understand language better than keyword matching. Start simple, then add details.
|
||||
|
||||
Context is especially important for creating better prompts. Try adding lighting, artistic details, and mood.
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/plain-prompt.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">"A photo of a banana-shaped couch in a living room"</figcaption>
|
||||
<div class="flex-1 text-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ok-prompt.png" class="w-full h-auto object-cover rounded-lg">
|
||||
<figcaption class="mt-2 text-sm text-gray-500">A <span class="underline decoration-sky-500 decoration-2 underline-offset-1">cute cat</span> <span class="underline decoration-pink-500 decoration-2 underline-offset-1">lounges on a leaf in a pool during a peaceful summer afternoon</span>, in <span class="underline decoration-green-500 decoration-2 underline-offset-1">lofi art style, illustration</span>.</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/detail-prompt.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">"A vibrant yellow banana-shaped couch sits in a cozy living room, its curve cradling a pile of colorful cushions. on the wooden floor, a patterned rug adds a touch of eclectic charm, and a potted plant sits in the corner, reaching towards the sunlight filtering through the windows"</figcaption>
|
||||
<div class="flex-1 text-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/better-prompt.png" class="w-full h-auto object-cover rounded-lg"/>
|
||||
<figcaption class="mt-2 text-sm text-gray-500">A cute cat lounges on a floating leaf in a sparkling pool during a peaceful summer afternoon. Clear reflections ripple across the water, with sunlight casting soft, smooth highlights. The illustration is detailed and polished, with elegant lines and harmonious colors, evoking a relaxing, serene, and whimsical lofi mood, anime-inspired and visually comforting.</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Prompt enhancing with GPT2
|
||||
|
||||
Prompt enhancing is a technique for quickly improving prompt quality without spending too much effort constructing one. It uses a model like GPT2 pretrained on Stable Diffusion text prompts to automatically enrich a prompt with additional important keywords to generate high-quality images.
|
||||
|
||||
The technique works by curating a list of specific keywords and forcing the model to generate those words to enhance the original prompt. This way, your prompt can be "a cat" and GPT2 can enhance the prompt to "cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain quality sharp focus beautiful detailed intricate stunning amazing epic".
|
||||
Be specific and add context. Use photography terms like lens type, focal length, camera angles, and depth of field.
|
||||
|
||||
> [!TIP]
|
||||
> You should also use a [*offset noise*](https://www.crosslabs.org//blog/diffusion-with-offset-noise) LoRA to improve the contrast in bright and dark images and create better lighting overall. This [LoRA](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_offset_example-lora_1.0.safetensors) is available from [stabilityai/stable-diffusion-xl-base-1.0](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0).
|
||||
|
||||
Start by defining certain styles and a list of words (you can check out a more comprehensive list of [words](https://hf.co/LykosAI/GPT-Prompt-Expansion-Fooocus-v2/blob/main/positive.txt) and [styles](https://github.com/lllyasviel/Fooocus/tree/main/sdxl_styles) used by Fooocus) to enhance a prompt with.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import GenerationConfig, GPT2LMHeadModel, GPT2Tokenizer, LogitsProcessor, LogitsProcessorList
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
|
||||
styles = {
|
||||
"cinematic": "cinematic film still of {prompt}, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
|
||||
"anime": "anime artwork of {prompt}, anime style, key visual, vibrant, studio anime, highly detailed",
|
||||
"photographic": "cinematic photo of {prompt}, 35mm photograph, film, professional, 4k, highly detailed",
|
||||
"comic": "comic of {prompt}, graphic illustration, comic art, graphic novel art, vibrant, highly detailed",
|
||||
"lineart": "line art drawing {prompt}, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
|
||||
"pixelart": " pixel-art {prompt}, low-res, blocky, pixel art style, 8-bit graphics",
|
||||
}
|
||||
|
||||
words = [
|
||||
"aesthetic", "astonishing", "beautiful", "breathtaking", "composition", "contrasted", "epic", "moody", "enhanced",
|
||||
"exceptional", "fascinating", "flawless", "glamorous", "glorious", "illumination", "impressive", "improved",
|
||||
"inspirational", "magnificent", "majestic", "hyperrealistic", "smooth", "sharp", "focus", "stunning", "detailed",
|
||||
"intricate", "dramatic", "high", "quality", "perfect", "light", "ultra", "highly", "radiant", "satisfying",
|
||||
"soothing", "sophisticated", "stylish", "sublime", "terrific", "touching", "timeless", "wonderful", "unbelievable",
|
||||
"elegant", "awesome", "amazing", "dynamic", "trendy",
|
||||
]
|
||||
```
|
||||
|
||||
You may have noticed in the `words` list, there are certain words that can be paired together to create something more meaningful. For example, the words "high" and "quality" can be combined to create "high quality". Let's pair these words together and remove the words that can't be paired.
|
||||
|
||||
```py
|
||||
word_pairs = ["highly detailed", "high quality", "enhanced quality", "perfect composition", "dynamic light"]
|
||||
|
||||
def find_and_order_pairs(s, pairs):
|
||||
words = s.split()
|
||||
found_pairs = []
|
||||
for pair in pairs:
|
||||
pair_words = pair.split()
|
||||
if pair_words[0] in words and pair_words[1] in words:
|
||||
found_pairs.append(pair)
|
||||
words.remove(pair_words[0])
|
||||
words.remove(pair_words[1])
|
||||
|
||||
for word in words[:]:
|
||||
for pair in pairs:
|
||||
if word in pair.split():
|
||||
words.remove(word)
|
||||
break
|
||||
ordered_pairs = ", ".join(found_pairs)
|
||||
remaining_s = ", ".join(words)
|
||||
return ordered_pairs, remaining_s
|
||||
```
|
||||
|
||||
Next, implement a custom [`~transformers.LogitsProcessor`] class that assigns tokens in the `words` list a value of 0 and assigns tokens not in the `words` list a negative value so they aren't picked during generation. This way, generation is biased towards words in the `words` list. After a word from the list is used, it is also assigned a negative value so it isn't picked again.
|
||||
|
||||
```py
|
||||
class CustomLogitsProcessor(LogitsProcessor):
|
||||
def __init__(self, bias):
|
||||
super().__init__()
|
||||
self.bias = bias
|
||||
|
||||
def __call__(self, input_ids, scores):
|
||||
if len(input_ids.shape) == 2:
|
||||
last_token_id = input_ids[0, -1]
|
||||
self.bias[last_token_id] = -1e10
|
||||
return scores + self.bias
|
||||
|
||||
word_ids = [tokenizer.encode(word, add_prefix_space=True)[0] for word in words]
|
||||
bias = torch.full((tokenizer.vocab_size,), -float("Inf")).to("cuda")
|
||||
bias[word_ids] = 0
|
||||
processor = CustomLogitsProcessor(bias)
|
||||
processor_list = LogitsProcessorList([processor])
|
||||
```
|
||||
|
||||
Combine the prompt and the `cinematic` style prompt defined in the `styles` dictionary earlier.
|
||||
|
||||
```py
|
||||
prompt = "a cat basking in the sun on a roof in Turkey"
|
||||
style = "cinematic"
|
||||
|
||||
prompt = styles[style].format(prompt=prompt)
|
||||
prompt
|
||||
"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"
|
||||
```
|
||||
|
||||
Load a GPT2 tokenizer and model from the [Gustavosta/MagicPrompt-Stable-Diffusion](https://huggingface.co/Gustavosta/MagicPrompt-Stable-Diffusion) checkpoint (this specific checkpoint is trained to generate prompts) to enhance the prompt.
|
||||
|
||||
```py
|
||||
tokenizer = GPT2Tokenizer.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion")
|
||||
model = GPT2LMHeadModel.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion", torch_dtype=torch.float16).to(
|
||||
"cuda"
|
||||
)
|
||||
model.eval()
|
||||
|
||||
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
|
||||
token_count = inputs["input_ids"].shape[1]
|
||||
max_new_tokens = 50 - token_count
|
||||
|
||||
generation_config = GenerationConfig(
|
||||
penalty_alpha=0.7,
|
||||
top_k=50,
|
||||
eos_token_id=model.config.eos_token_id,
|
||||
pad_token_id=model.config.eos_token_id,
|
||||
pad_token=model.config.pad_token_id,
|
||||
do_sample=True,
|
||||
)
|
||||
|
||||
with torch.no_grad():
|
||||
generated_ids = model.generate(
|
||||
input_ids=inputs["input_ids"],
|
||||
attention_mask=inputs["attention_mask"],
|
||||
max_new_tokens=max_new_tokens,
|
||||
generation_config=generation_config,
|
||||
logits_processor=proccesor_list,
|
||||
)
|
||||
```
|
||||
|
||||
Then you can combine the input prompt and the generated prompt. Feel free to take a look at what the generated prompt (`generated_part`) is, the word pairs that were found (`pairs`), and the remaining words (`words`). This is all packed together in the `enhanced_prompt`.
|
||||
|
||||
```py
|
||||
output_tokens = [tokenizer.decode(generated_id, skip_special_tokens=True) for generated_id in generated_ids]
|
||||
input_part, generated_part = output_tokens[0][: len(prompt)], output_tokens[0][len(prompt) :]
|
||||
pairs, words = find_and_order_pairs(generated_part, word_pairs)
|
||||
formatted_generated_part = pairs + ", " + words
|
||||
enhanced_prompt = input_part + ", " + formatted_generated_part
|
||||
enhanced_prompt
|
||||
["cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain quality sharp focus beautiful detailed intricate stunning amazing epic"]
|
||||
```
|
||||
|
||||
Finally, load a pipeline and the offset noise LoRA with a *low weight* to generate an image with the enhanced prompt.
|
||||
|
||||
```py
|
||||
pipeline = StableDiffusionXLPipeline.from_pretrained(
|
||||
"RunDiffusion/Juggernaut-XL-v9", torch_dtype=torch.float16, variant="fp16"
|
||||
).to("cuda")
|
||||
|
||||
pipeline.load_lora_weights(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
weight_name="sd_xl_offset_example-lora_1.0.safetensors",
|
||||
adapter_name="offset",
|
||||
)
|
||||
pipeline.set_adapters(["offset"], adapter_weights=[0.2])
|
||||
|
||||
image = pipeline(
|
||||
enhanced_prompt,
|
||||
width=1152,
|
||||
height=896,
|
||||
guidance_scale=7.5,
|
||||
num_inference_steps=25,
|
||||
).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/non-enhanced-prompt.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">"a cat basking in the sun on a roof in Turkey"</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/enhanced-prompt.png"/>
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
> Try a [prompt enhancer](https://huggingface.co/models?sort=downloads&search=prompt+enhancer) to help improve your prompt structure.
|
||||
|
||||
## Prompt weighting
|
||||
|
||||
Prompt weighting provides a way to emphasize or de-emphasize certain parts of a prompt, allowing for more control over the generated image. A prompt can include several concepts, which gets turned into contextualized text embeddings. The embeddings are used by the model to condition its cross-attention layers to generate an image (read the Stable Diffusion [blog post](https://huggingface.co/blog/stable_diffusion) to learn more about how it works).
|
||||
Prompt weighting makes some words stronger and others weaker. It scales attention scores so you control how much influence each concept has.
|
||||
|
||||
Prompt weighting works by increasing or decreasing the scale of the text embedding vector that corresponds to its concept in the prompt because you may not necessarily want the model to focus on all concepts equally. The easiest way to prepare the prompt embeddings is to use [Stable Diffusion Long Prompt Weighted Embedding](https://github.com/xhinker/sd_embed) (sd_embed). Once you have the prompt-weighted embeddings, you can pass them to any pipeline that has a [prompt_embeds](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.prompt_embeds) (and optionally [negative_prompt_embeds](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__.negative_prompt_embeds)) parameter, such as [`StableDiffusionPipeline`], [`StableDiffusionControlNetPipeline`], and [`StableDiffusionXLPipeline`].
|
||||
Diffusers handles this through `prompt_embeds` and `pooled_prompt_embeds` arguments which take scaled text embedding vectors. Use the [sd_embed](https://github.com/xhinker/sd_embed) library to generate these embeddings. It also supports longer prompts.
|
||||
|
||||
> [!TIP]
|
||||
> If your favorite pipeline doesn't have a `prompt_embeds` parameter, please open an [issue](https://github.com/huggingface/diffusers/issues/new/choose) so we can add it!
|
||||
|
||||
This guide will show you how to weight your prompts with sd_embed.
|
||||
|
||||
Before you begin, make sure you have the latest version of sd_embed installed:
|
||||
|
||||
```bash
|
||||
pip install git+https://github.com/xhinker/sd_embed.git@main
|
||||
```
|
||||
|
||||
For this example, let's use [`StableDiffusionXLPipeline`].
|
||||
> [!NOTE]
|
||||
> The sd_embed library only supports Stable Diffusion, Stable Diffusion XL, Stable Diffusion 3, Stable Cascade, and Flux. Prompt weighting doesn't necessarily help for newer models like Flux which already has very good prompt adherence.
|
||||
|
||||
```py
|
||||
from diffusers import StableDiffusionXLPipeline, UniPCMultistepScheduler
|
||||
import torch
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained("Lykon/dreamshaper-xl-1-0", torch_dtype=torch.float16)
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
pipe.to("cuda")
|
||||
!uv pip install git+https://github.com/xhinker/sd_embed.git@main
|
||||
```
|
||||
|
||||
To upweight or downweight a concept, surround the text with parentheses. More parentheses applies a heavier weight on the text. You can also append a numerical multiplier to the text to indicate how much you want to increase or decrease its weights by.
|
||||
Format weighted text with numerical multipliers or parentheses. More parentheses mean stronger weighting.
|
||||
|
||||
| format | multiplier |
|
||||
|---|---|
|
||||
| `(hippo)` | increase by 1.1x |
|
||||
| `((hippo))` | increase by 1.21x |
|
||||
| `(hippo:1.5)` | increase by 1.5x |
|
||||
| `(hippo:0.5)` | decrease by 4x |
|
||||
| `(cat)` | increase by 1.1x |
|
||||
| `((cat))` | increase by 1.21x |
|
||||
| `(cat:1.5)` | increase by 1.5x |
|
||||
| `(cat:0.5)` | decrease by 4x |
|
||||
|
||||
Create a prompt and use a combination of parentheses and numerical multipliers to upweight various text.
|
||||
Create a weighted prompt and pass it to [get_weighted_text_embeddings_sdxl](https://github.com/xhinker/sd_embed/blob/4a47f71150a22942fa606fb741a1c971d95ba56f/src/sd_embed/embedding_funcs.py#L405) to generate embeddings.
|
||||
|
||||
> [!TIP]
|
||||
> You could also pass negative prompts to `negative_prompt_embeds` and `negative_pooled_prompt_embeds`.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline
|
||||
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sdxl
|
||||
|
||||
prompt = """A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus.
|
||||
This imaginative creature features the distinctive, bulky body of a hippo,
|
||||
but with a texture and appearance resembling a golden-brown, crispy waffle.
|
||||
The creature might have elements like waffle squares across its skin and a syrup-like sheen.
|
||||
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting,
|
||||
possibly including oversized utensils or plates in the background.
|
||||
The image should evoke a sense of playful absurdity and culinary fantasy.
|
||||
"""
|
||||
|
||||
neg_prompt = """\
|
||||
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
|
||||
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
|
||||
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
|
||||
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
|
||||
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
|
||||
(normal quality:2),lowres,((monochrome)),((grayscale))
|
||||
"""
|
||||
```
|
||||
|
||||
Use the `get_weighted_text_embeddings_sdxl` function to generate the prompt embeddings and the negative prompt embeddings. It'll also generated the pooled and negative pooled prompt embeddings since you're using the SDXL model.
|
||||
|
||||
> [!TIP]
|
||||
> You can safely ignore the error message below about the token index length exceeding the models maximum sequence length. All your tokens will be used in the embedding process.
|
||||
>
|
||||
> ```
|
||||
> Token indices sequence length is longer than the specified maximum sequence length for this model
|
||||
> ```
|
||||
|
||||
```py
|
||||
(
|
||||
prompt_embeds,
|
||||
prompt_neg_embeds,
|
||||
pooled_prompt_embeds,
|
||||
negative_pooled_prompt_embeds
|
||||
) = get_weighted_text_embeddings_sdxl(
|
||||
pipe,
|
||||
prompt=prompt,
|
||||
neg_prompt=neg_prompt
|
||||
pipeline = DiffusionPipeline.from_pretrained(
|
||||
"Lykon/dreamshaper-xl-1-0", torch_dtype=torch.bfloat16, device_map="cuda"
|
||||
)
|
||||
|
||||
image = pipe(
|
||||
prompt_embeds=prompt_embeds,
|
||||
negative_prompt_embeds=prompt_neg_embeds,
|
||||
pooled_prompt_embeds=pooled_prompt_embeds,
|
||||
negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
|
||||
num_inference_steps=30,
|
||||
height=1024,
|
||||
width=1024 + 512,
|
||||
guidance_scale=4.0,
|
||||
generator=torch.Generator("cuda").manual_seed(2)
|
||||
).images[0]
|
||||
image
|
||||
prompt = """
|
||||
A (cute cat:1.4) lounges on a (floating leaf:1.2) in a (sparkling pool:1.1) during a peaceful summer afternoon.
|
||||
Gentle ripples reflect pastel skies, while (sunlight:1.1) casts soft highlights. The illustration is smooth and polished
|
||||
with elegant, sketchy lines and subtle gradients, evoking a ((whimsical, nostalgic, dreamy lofi atmosphere:2.0)),
|
||||
(anime-inspired:1.6), calming, comforting, and visually serene.
|
||||
"""
|
||||
|
||||
prompt_embeds, _, pooled_prompt_embeds, *_ = get_weighted_text_embeddings_sdxl(pipeline, prompt=prompt)
|
||||
```
|
||||
|
||||
Pass the embeddings to `prompt_embeds` and `pooled_prompt_embeds` to generate your image.
|
||||
|
||||
```py
|
||||
image = pipeline(prompt_embeds=prompt_embeds, pooled_prompt_embeds=pooled_prompt_embeds).images[0]
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sd_embed_sdxl.png"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/prompt-embed-sdxl.png"/>
|
||||
</div>
|
||||
|
||||
> [!TIP]
|
||||
> Refer to the [sd_embed](https://github.com/xhinker/sd_embed) repository for additional details about long prompt weighting for FLUX.1, Stable Cascade, and Stable Diffusion 1.5.
|
||||
|
||||
### Textual inversion
|
||||
|
||||
[Textual inversion](../training/text_inversion) is a technique for learning a specific concept from some images which you can use to generate new images conditioned on that concept.
|
||||
|
||||
Create a pipeline and use the [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] function to load the textual inversion embeddings (feel free to browse the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer) for 100+ trained concepts):
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import StableDiffusionPipeline
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16,
|
||||
).to("cuda")
|
||||
pipe.load_textual_inversion("sd-concepts-library/midjourney-style")
|
||||
```
|
||||
|
||||
Add the `<midjourney-style>` text to the prompt to trigger the textual inversion.
|
||||
|
||||
```py
|
||||
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sd15
|
||||
|
||||
prompt = """<midjourney-style> A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus.
|
||||
This imaginative creature features the distinctive, bulky body of a hippo,
|
||||
but with a texture and appearance resembling a golden-brown, crispy waffle.
|
||||
The creature might have elements like waffle squares across its skin and a syrup-like sheen.
|
||||
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting,
|
||||
possibly including oversized utensils or plates in the background.
|
||||
The image should evoke a sense of playful absurdity and culinary fantasy.
|
||||
"""
|
||||
|
||||
neg_prompt = """\
|
||||
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
|
||||
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
|
||||
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
|
||||
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
|
||||
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
|
||||
(normal quality:2),lowres,((monochrome)),((grayscale))
|
||||
"""
|
||||
```
|
||||
|
||||
Use the `get_weighted_text_embeddings_sd15` function to generate the prompt embeddings and the negative prompt embeddings.
|
||||
|
||||
```py
|
||||
(
|
||||
prompt_embeds,
|
||||
prompt_neg_embeds,
|
||||
) = get_weighted_text_embeddings_sd15(
|
||||
pipe,
|
||||
prompt=prompt,
|
||||
neg_prompt=neg_prompt
|
||||
)
|
||||
|
||||
image = pipe(
|
||||
prompt_embeds=prompt_embeds,
|
||||
negative_prompt_embeds=prompt_neg_embeds,
|
||||
height=768,
|
||||
width=896,
|
||||
guidance_scale=4.0,
|
||||
generator=torch.Generator("cuda").manual_seed(2)
|
||||
).images[0]
|
||||
image
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sd_embed_textual_inversion.png"/>
|
||||
</div>
|
||||
|
||||
### DreamBooth
|
||||
|
||||
[DreamBooth](../training/dreambooth) is a technique for generating contextualized images of a subject given just a few images of the subject to train on. It is similar to textual inversion, but DreamBooth trains the full model whereas textual inversion only fine-tunes the text embeddings. This means you should use [`~DiffusionPipeline.from_pretrained`] to load the DreamBooth model (feel free to browse the [Stable Diffusion Dreambooth Concepts Library](https://huggingface.co/sd-dreambooth-library) for 100+ trained models):
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import DiffusionPipeline, UniPCMultistepScheduler
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained("sd-dreambooth-library/dndcoverart-v1", torch_dtype=torch.float16).to("cuda")
|
||||
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
```
|
||||
|
||||
Depending on the model you use, you'll need to incorporate the model's unique identifier into your prompt. For example, the `dndcoverart-v1` model uses the identifier `dndcoverart`:
|
||||
|
||||
```py
|
||||
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sd15
|
||||
|
||||
prompt = """dndcoverart of A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus.
|
||||
This imaginative creature features the distinctive, bulky body of a hippo,
|
||||
but with a texture and appearance resembling a golden-brown, crispy waffle.
|
||||
The creature might have elements like waffle squares across its skin and a syrup-like sheen.
|
||||
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting,
|
||||
possibly including oversized utensils or plates in the background.
|
||||
The image should evoke a sense of playful absurdity and culinary fantasy.
|
||||
"""
|
||||
|
||||
neg_prompt = """\
|
||||
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
|
||||
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
|
||||
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
|
||||
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
|
||||
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
|
||||
(normal quality:2),lowres,((monochrome)),((grayscale))
|
||||
"""
|
||||
|
||||
(
|
||||
prompt_embeds
|
||||
, prompt_neg_embeds
|
||||
) = get_weighted_text_embeddings_sd15(
|
||||
pipe
|
||||
, prompt = prompt
|
||||
, neg_prompt = neg_prompt
|
||||
)
|
||||
```
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sd_embed_dreambooth.png"/>
|
||||
</div>
|
||||
Prompt weighting works with [Textual inversion](./textual_inversion_inference) and [DreamBooth](./dreambooth) adapters too.
|
||||
@@ -280,5 +280,5 @@ This is really what 🧨 Diffusers is designed for: to make it intuitive and eas
|
||||
|
||||
For your next steps, feel free to:
|
||||
|
||||
* Learn how to [build and contribute a pipeline](../using-diffusers/contribute_pipeline) to 🧨 Diffusers. We can't wait and see what you'll come up with!
|
||||
* Learn how to [build and contribute a pipeline](../conceptual/contribution) to 🧨 Diffusers. We can't wait and see what you'll come up with!
|
||||
* Explore [existing pipelines](../api/pipelines/overview) in the library, and see if you can deconstruct and build a pipeline from scratch using the models and schedulers separately.
|
||||
|
||||
@@ -369,6 +369,15 @@ def get_spatial_latent_upsampler_config(version: str) -> Dict[str, Any]:
|
||||
"spatial_upsample": True,
|
||||
"temporal_upsample": False,
|
||||
}
|
||||
elif version == "0.9.8":
|
||||
config = {
|
||||
"in_channels": 128,
|
||||
"mid_channels": 512,
|
||||
"num_blocks_per_stage": 4,
|
||||
"dims": 3,
|
||||
"spatial_upsample": True,
|
||||
"temporal_upsample": False,
|
||||
}
|
||||
else:
|
||||
raise ValueError(f"Unsupported version: {version}")
|
||||
return config
|
||||
@@ -402,7 +411,7 @@ def get_args():
|
||||
"--version",
|
||||
type=str,
|
||||
default="0.9.0",
|
||||
choices=["0.9.0", "0.9.1", "0.9.5", "0.9.7"],
|
||||
choices=["0.9.0", "0.9.1", "0.9.5", "0.9.7", "0.9.8"],
|
||||
help="Version of the LTX model",
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
@@ -18,7 +18,6 @@ import numpy as np
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import torch.utils.checkpoint
|
||||
|
||||
from ...configuration_utils import ConfigMixin, register_to_config
|
||||
from ...utils import logging
|
||||
|
||||
@@ -23,7 +23,6 @@ from typing import List, Optional, Tuple, Union
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import torch.utils.checkpoint
|
||||
|
||||
from ...configuration_utils import ConfigMixin, register_to_config
|
||||
from ...loaders import FromOriginalModelMixin
|
||||
|
||||
@@ -17,7 +17,6 @@ from typing import List, Optional, Tuple, Union
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import torch.utils.checkpoint
|
||||
|
||||
from ...configuration_utils import ConfigMixin, register_to_config
|
||||
from ...loaders import FromOriginalModelMixin
|
||||
|
||||
@@ -16,7 +16,6 @@ from math import gcd
|
||||
from typing import Any, Dict, List, Optional, Tuple, Union
|
||||
|
||||
import torch
|
||||
import torch.utils.checkpoint
|
||||
from torch import Tensor, nn
|
||||
|
||||
from ...configuration_utils import ConfigMixin, register_to_config
|
||||
|
||||
@@ -18,7 +18,6 @@ from typing import Dict, Optional, Union
|
||||
import numpy as np
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.utils.checkpoint
|
||||
|
||||
from ...configuration_utils import ConfigMixin, register_to_config
|
||||
from ...utils import logging
|
||||
|
||||
@@ -16,7 +16,6 @@ from typing import Any, Dict, List, Optional, Tuple, Union
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.utils.checkpoint
|
||||
|
||||
from ...configuration_utils import ConfigMixin, register_to_config
|
||||
from ...loaders import PeftAdapterMixin, UNet2DConditionLoadersMixin
|
||||
|
||||
@@ -18,7 +18,6 @@ from typing import Any, Dict, List, Optional, Tuple, Union
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.utils.checkpoint
|
||||
|
||||
from ...configuration_utils import ConfigMixin, register_to_config
|
||||
from ...loaders import UNet2DConditionLoadersMixin
|
||||
|
||||
@@ -16,7 +16,6 @@ from typing import Any, Dict, Optional, Tuple, Union
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.utils.checkpoint
|
||||
|
||||
from ...configuration_utils import ConfigMixin, register_to_config
|
||||
from ...loaders import UNet2DConditionLoadersMixin
|
||||
|
||||
@@ -16,7 +16,6 @@ from dataclasses import dataclass
|
||||
from typing import Dict, Tuple, Union
|
||||
|
||||
import torch
|
||||
import torch.utils.checkpoint
|
||||
from torch import nn
|
||||
|
||||
from ...configuration_utils import ConfigMixin, register_to_config
|
||||
|
||||
@@ -18,7 +18,6 @@ from typing import Any, Dict, Optional, Tuple, Union
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import torch.utils.checkpoint
|
||||
|
||||
from ...configuration_utils import ConfigMixin, FrozenDict, register_to_config
|
||||
from ...loaders import FromOriginalModelMixin, PeftAdapterMixin, UNet2DConditionLoadersMixin
|
||||
|
||||
@@ -17,7 +17,6 @@ from typing import Any, Dict, List, Optional, Tuple, Union
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.utils.checkpoint
|
||||
|
||||
from ...configuration_utils import ConfigMixin, register_to_config
|
||||
from ...loaders import UNet2DConditionLoadersMixin
|
||||
|
||||
@@ -14,7 +14,6 @@
|
||||
from typing import Optional, Tuple, Union
|
||||
|
||||
import torch
|
||||
import torch.utils.checkpoint
|
||||
from torch import nn
|
||||
from transformers import BertTokenizer
|
||||
from transformers.activations import QuickGELUActivation as QuickGELU
|
||||
|
||||
@@ -18,7 +18,6 @@ from typing import Callable, List, Optional, Union
|
||||
import numpy as np
|
||||
import PIL.Image
|
||||
import torch
|
||||
import torch.utils.checkpoint
|
||||
from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection
|
||||
|
||||
from ....image_processor import VaeImageProcessor
|
||||
|
||||
@@ -16,7 +16,6 @@ import inspect
|
||||
from typing import Callable, List, Optional, Union
|
||||
|
||||
import torch
|
||||
import torch.utils.checkpoint
|
||||
from transformers import CLIPImageProcessor, CLIPTextModelWithProjection, CLIPTokenizer
|
||||
|
||||
from ....image_processor import VaeImageProcessor
|
||||
|
||||
@@ -17,7 +17,6 @@ from typing import List, Optional, Tuple, Union
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.utils.checkpoint
|
||||
from transformers import PretrainedConfig, PreTrainedModel, PreTrainedTokenizer
|
||||
from transformers.activations import ACT2FN
|
||||
from transformers.modeling_outputs import BaseModelOutput
|
||||
|
||||
@@ -4,7 +4,6 @@ from typing import List, Optional, Tuple, Union
|
||||
import numpy as np
|
||||
import PIL.Image
|
||||
import torch
|
||||
import torch.utils.checkpoint
|
||||
|
||||
from ...models import UNet2DModel, VQModel
|
||||
from ...schedulers import (
|
||||
|
||||
@@ -121,6 +121,38 @@ class LTXLatentUpsamplePipeline(DiffusionPipeline):
|
||||
result = torch.lerp(latents, result, factor)
|
||||
return result
|
||||
|
||||
def tone_map_latents(self, latents: torch.Tensor, compression: float) -> torch.Tensor:
|
||||
"""
|
||||
Applies a non-linear tone-mapping function to latent values to reduce their dynamic range in a perceptually
|
||||
smooth way using a sigmoid-based compression.
|
||||
|
||||
This is useful for regularizing high-variance latents or for conditioning outputs during generation, especially
|
||||
when controlling dynamic behavior with a `compression` factor.
|
||||
|
||||
Args:
|
||||
latents : torch.Tensor
|
||||
Input latent tensor with arbitrary shape. Expected to be roughly in [-1, 1] or [0, 1] range.
|
||||
compression : float
|
||||
Compression strength in the range [0, 1].
|
||||
- 0.0: No tone-mapping (identity transform)
|
||||
- 1.0: Full compression effect
|
||||
|
||||
Returns:
|
||||
torch.Tensor
|
||||
The tone-mapped latent tensor of the same shape as input.
|
||||
"""
|
||||
# Remap [0-1] to [0-0.75] and apply sigmoid compression in one shot
|
||||
scale_factor = compression * 0.75
|
||||
abs_latents = torch.abs(latents)
|
||||
|
||||
# Sigmoid compression: sigmoid shifts large values toward 0.2, small values stay ~1.0
|
||||
# When scale_factor=0, sigmoid term vanishes, when scale_factor=0.75, full effect
|
||||
sigmoid_term = torch.sigmoid(4.0 * scale_factor * (abs_latents - 1.0))
|
||||
scales = 1.0 - 0.8 * scale_factor * sigmoid_term
|
||||
|
||||
filtered = latents * scales
|
||||
return filtered
|
||||
|
||||
@staticmethod
|
||||
# Copied from diffusers.pipelines.ltx.pipeline_ltx.LTXPipeline._normalize_latents
|
||||
def _normalize_latents(
|
||||
@@ -196,7 +228,7 @@ class LTXLatentUpsamplePipeline(DiffusionPipeline):
|
||||
)
|
||||
self.vae.disable_tiling()
|
||||
|
||||
def check_inputs(self, video, height, width, latents):
|
||||
def check_inputs(self, video, height, width, latents, tone_map_compression_ratio):
|
||||
if height % self.vae_spatial_compression_ratio != 0 or width % self.vae_spatial_compression_ratio != 0:
|
||||
raise ValueError(f"`height` and `width` have to be divisible by 32 but are {height} and {width}.")
|
||||
|
||||
@@ -205,6 +237,9 @@ class LTXLatentUpsamplePipeline(DiffusionPipeline):
|
||||
if video is None and latents is None:
|
||||
raise ValueError("One of `video` or `latents` has to be provided.")
|
||||
|
||||
if not (0 <= tone_map_compression_ratio <= 1):
|
||||
raise ValueError("`tone_map_compression_ratio` must be in the range [0, 1]")
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self,
|
||||
@@ -215,6 +250,7 @@ class LTXLatentUpsamplePipeline(DiffusionPipeline):
|
||||
decode_timestep: Union[float, List[float]] = 0.0,
|
||||
decode_noise_scale: Optional[Union[float, List[float]]] = None,
|
||||
adain_factor: float = 0.0,
|
||||
tone_map_compression_ratio: float = 0.0,
|
||||
generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
|
||||
output_type: Optional[str] = "pil",
|
||||
return_dict: bool = True,
|
||||
@@ -224,6 +260,7 @@ class LTXLatentUpsamplePipeline(DiffusionPipeline):
|
||||
height=height,
|
||||
width=width,
|
||||
latents=latents,
|
||||
tone_map_compression_ratio=tone_map_compression_ratio,
|
||||
)
|
||||
|
||||
if video is not None:
|
||||
@@ -266,6 +303,9 @@ class LTXLatentUpsamplePipeline(DiffusionPipeline):
|
||||
else:
|
||||
latents = latents_upsampled
|
||||
|
||||
if tone_map_compression_ratio > 0.0:
|
||||
latents = self.tone_map_latents(latents, tone_map_compression_ratio)
|
||||
|
||||
if output_type == "latent":
|
||||
latents = self._normalize_latents(
|
||||
latents, self.vae.latents_mean, self.vae.latents_std, self.vae.config.scaling_factor
|
||||
|
||||
@@ -18,7 +18,6 @@ from typing import Optional
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.utils.checkpoint
|
||||
|
||||
from ...configuration_utils import ConfigMixin, register_to_config
|
||||
from ...models.modeling_utils import ModelMixin
|
||||
|
||||
@@ -49,7 +49,7 @@ EXAMPLE_DOC_STRING = """
|
||||
Examples:
|
||||
```python
|
||||
>>> import torch
|
||||
>>> from diffusers.utils import export_to_video
|
||||
>>> from diffusers.utils import export_to_video, load_video
|
||||
>>> from diffusers import AutoencoderKLWan, WanVideoToVideoPipeline
|
||||
>>> from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
|
||||
|
||||
|
||||
@@ -138,10 +138,8 @@ class AudioLDM2PipelineFastTests(PipelineTesterMixin, unittest.TestCase):
|
||||
patch_stride=2,
|
||||
patch_embed_input_channels=4,
|
||||
)
|
||||
text_encoder_config = ClapConfig.from_text_audio_configs(
|
||||
text_config=text_branch_config,
|
||||
audio_config=audio_branch_config,
|
||||
projection_dim=16,
|
||||
text_encoder_config = ClapConfig(
|
||||
text_config=text_branch_config, audio_config=audio_branch_config, projection_dim=16
|
||||
)
|
||||
text_encoder = ClapModel(text_encoder_config)
|
||||
tokenizer = RobertaTokenizer.from_pretrained("hf-internal-testing/tiny-random-roberta", model_max_length=77)
|
||||
|
||||
Reference in New Issue
Block a user