mirror of
https://github.com/huggingface/diffusers.git
synced 2026-03-27 02:47:41 +08:00
Compare commits
2 Commits
ltx2-3-upd
...
profiling-
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
af96109435 | ||
|
|
b757035df6 |
1
.github/workflows/claude_review.yml
vendored
1
.github/workflows/claude_review.yml
vendored
@@ -10,6 +10,7 @@ permissions:
|
||||
contents: write
|
||||
pull-requests: write
|
||||
issues: read
|
||||
id-token: write
|
||||
|
||||
jobs:
|
||||
claude-review:
|
||||
|
||||
@@ -18,7 +18,7 @@
|
||||
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
|
||||
</div>
|
||||
|
||||
[LTX-2](https://arxiv.org/abs/2601.03233) is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
|
||||
LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
|
||||
|
||||
You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization.
|
||||
|
||||
@@ -293,7 +293,6 @@ import torch
|
||||
from diffusers import LTX2ConditionPipeline
|
||||
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
|
||||
from diffusers.pipelines.ltx2.export_utils import encode_video
|
||||
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT
|
||||
from diffusers.utils import load_image, load_video
|
||||
|
||||
device = "cuda"
|
||||
@@ -316,6 +315,19 @@ prompt = (
|
||||
"landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the "
|
||||
"solitude and beauty of a winter drive through a mountainous region."
|
||||
)
|
||||
negative_prompt = (
|
||||
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
|
||||
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
|
||||
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
|
||||
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
|
||||
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
|
||||
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
|
||||
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
|
||||
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
|
||||
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
|
||||
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
|
||||
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
||||
)
|
||||
|
||||
cond_video = load_video(
|
||||
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
|
||||
@@ -331,7 +343,7 @@ frame_rate = 24.0
|
||||
video, audio = pipe(
|
||||
conditions=conditions,
|
||||
prompt=prompt,
|
||||
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
|
||||
negative_prompt=negative_prompt,
|
||||
width=width,
|
||||
height=height,
|
||||
num_frames=121,
|
||||
@@ -354,154 +366,6 @@ encode_video(
|
||||
|
||||
Because the conditioning is done via latent frames, the 8 data space frames corresponding to the specified latent frame for an image condition will tend to be static.
|
||||
|
||||
## Multimodal Guidance
|
||||
|
||||
With LTX-2.3 support, the LTX-2.X pipelines now support multimodal guidance. The LTX-2.X multimodal guidance setup is composed of three terms, all using a CFG-style update rule:
|
||||
|
||||
1. **Classifier-Free Guidance (CFG)**: standard [CFG](https://arxiv.org/abs/2207.12598) where the perturbed ("weaker") output is generated using the negative prompt.
|
||||
2. **Spatio-Temporal Guidance (STG)**: [STG](https://arxiv.org/pdf/2411.18664) moves away from a perturbed output created by short-cutting self-attention operations by substituting in the attention values instead. The idea is that this creates sharper videos and better spatiotemporal consistency.
|
||||
3. **Modality Isolation Guidance**: this moves away from a perturbed output created by disabling cross-modality (audio-to-video and video-to-audio) cross attention. This guidance is more specific to [LTX-2.X](https://arxiv.org/pdf/2601.03233) models, with the idea that this produces better consistency between the generated audio and video.
|
||||
|
||||
These are controlled by the `guidance_scale`, `stg_scale`, and `modality_scale` arguments, respectively, and can be set separately for video and audio. Additionally, for STG, the transformer block indices where self-attention is skipped needs to be specified via the `spatio_temporal_guidance_blocks` argument. In addition, the LTX-2.X pipelines also support [guidance rescaling](https://arxiv.org/abs/2305.08891) to help reduce over-exposure, which can be a problem when the guidance scales are set to high values.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import LTX2ImageToVideoPipeline
|
||||
from diffusers.pipelines.ltx2.export_utils import encode_video
|
||||
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT
|
||||
from diffusers.utils import load_image
|
||||
|
||||
device = "cuda"
|
||||
width = 768
|
||||
height = 512
|
||||
random_seed = 42
|
||||
frame_rate = 24.0
|
||||
generator = torch.Generator(device).manual_seed(random_seed)
|
||||
model_path = "dg845/LTX-2.3-Diffusers"
|
||||
|
||||
pipe = LTX2ImageToVideoPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
|
||||
pipe.enable_sequential_cpu_offload(device=device)
|
||||
pipe.vae.enable_tiling()
|
||||
|
||||
prompt = (
|
||||
"An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in "
|
||||
"gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs "
|
||||
"before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small "
|
||||
"fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly "
|
||||
"shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a "
|
||||
"smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the "
|
||||
"distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a "
|
||||
"breath-taking, movie-like shot."
|
||||
)
|
||||
|
||||
image = load_image(
|
||||
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg",
|
||||
)
|
||||
|
||||
video, audio = pipe(
|
||||
image=image,
|
||||
prompt=prompt,
|
||||
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
|
||||
width=width,
|
||||
height=height,
|
||||
num_frames=121,
|
||||
frame_rate=frame_rate,
|
||||
num_inference_steps=30,
|
||||
guidance_scale=3.0, # Recommended LTX-2.3 guidance parameters
|
||||
stg_scale=1.0, # Note that 0.0 (not 1.0) means that STG is disabled (all other guidance is disabled at 1.0)
|
||||
modality_scale=3.0,
|
||||
guidance_rescale=0.7,
|
||||
audio_guidance_scale=7.0, # Note that a higher CFG guidance scale is recommended for audio
|
||||
audio_stg_scale=1.0,
|
||||
audio_modality_scale=3.0,
|
||||
audio_guidance_rescale=0.7,
|
||||
spatio_temporal_guidance_blocks=[28],
|
||||
use_cross_timestep=True,
|
||||
generator=generator,
|
||||
output_type="np",
|
||||
return_dict=False,
|
||||
)
|
||||
|
||||
encode_video(
|
||||
video[0],
|
||||
fps=frame_rate,
|
||||
audio=audio[0].float().cpu(),
|
||||
audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
|
||||
output_path="ltx2_3_i2v_stage_1.mp4",
|
||||
)
|
||||
```
|
||||
|
||||
## Prompt Enhancement
|
||||
|
||||
The LTX-2.X models are sensitive to the prompting style used. You can refer to the [official prompting guide](https://ltx.io/model/model-blog/prompting-guide-for-ltx-2) for recommendations on how to write a good prompt. Using prompt enhancement, where the supplied prompts are enhanced using the pipeline's text encoder (by default a [Gemma 3](https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized) model) given a system prompt, can also improve sample quality. The optional `processor` pipeline component needs to be present to use prompt enhancement. Prompt enhancement can be enabled by supplying a `system_prompt` argument:
|
||||
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import Gemma3Processor
|
||||
from diffusers import LTX2Pipeline
|
||||
from diffusers.pipelines.ltx2.export_utils import encode_video
|
||||
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT, T2V_DEFAULT_SYSTEM_PROMPT
|
||||
|
||||
device = "cuda"
|
||||
width = 768
|
||||
height = 512
|
||||
random_seed = 42
|
||||
frame_rate = 24.0
|
||||
generator = torch.Generator(device).manual_seed(random_seed)
|
||||
model_path = "dg845/LTX-2.3-Diffusers"
|
||||
|
||||
pipe = LTX2Pipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
|
||||
pipe.enable_model_cpu_offload(device=device)
|
||||
pipe.vae.enable_tiling()
|
||||
if getattr(pipe, "processor", None) is None:
|
||||
processor = Gemma3Processor.from_pretrained("google/gemma-3-12b-it-qat-q4_0-unquantized")
|
||||
pipe.processor = processor
|
||||
|
||||
prompt = (
|
||||
"An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in "
|
||||
"gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs "
|
||||
"before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small "
|
||||
"fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly "
|
||||
"shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a "
|
||||
"smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the "
|
||||
"distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a "
|
||||
"breath-taking, movie-like shot."
|
||||
)
|
||||
|
||||
video, audio = pipe(
|
||||
prompt=prompt,
|
||||
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
|
||||
width=width,
|
||||
height=height,
|
||||
num_frames=121,
|
||||
frame_rate=frame_rate,
|
||||
num_inference_steps=30,
|
||||
guidance_scale=3.0,
|
||||
stg_scale=1.0,
|
||||
modality_scale=3.0,
|
||||
guidance_rescale=0.7,
|
||||
audio_guidance_scale=7.0,
|
||||
audio_stg_scale=1.0,
|
||||
audio_modality_scale=3.0,
|
||||
audio_guidance_rescale=0.7,
|
||||
spatio_temporal_guidance_blocks=[28],
|
||||
use_cross_timestep=True,
|
||||
system_prompt=T2V_DEFAULT_SYSTEM_PROMPT,
|
||||
generator=generator,
|
||||
output_type="np",
|
||||
return_dict=False,
|
||||
)
|
||||
|
||||
encode_video(
|
||||
video[0],
|
||||
fps=frame_rate,
|
||||
audio=audio[0].float().cpu(),
|
||||
audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
|
||||
output_path="ltx2_3_t2v_stage_1.mp4",
|
||||
)
|
||||
```
|
||||
|
||||
## LTX2Pipeline
|
||||
|
||||
[[autodoc]] LTX2Pipeline
|
||||
|
||||
168
profiling/PROFILING_PLAN.md
Normal file
168
profiling/PROFILING_PLAN.md
Normal file
@@ -0,0 +1,168 @@
|
||||
# Profiling Plan: Diffusers Pipeline Profiling with torch.profiler
|
||||
|
||||
## Context
|
||||
|
||||
We want to uncover CPU overhead, CPU-GPU sync points, and other bottlenecks in popular diffusers pipelines — especially issues that become non-trivial under `torch.compile`. The approach is inspired by [flux-fast's run_benchmark.py](https://github.com/huggingface/flux-fast/blob/0a1dcc91658f0df14cd7fce862a5c8842784c6da/run_benchmark.py#L66-L85) which uses `torch.profiler` with method-level annotations, and motivated by issues like [diffusers#11696](https://github.com/huggingface/diffusers/pull/11696) (DtoH sync from scheduler `.item()` call).
|
||||
|
||||
## Target Pipelines
|
||||
|
||||
| Pipeline | Type | Checkpoint | Steps |
|
||||
|----------|------|-----------|-------|
|
||||
| `FluxPipeline` | text-to-image | `black-forest-labs/FLUX.1-dev` | 4 |
|
||||
| `Flux2Pipeline` | text-to-image | `black-forest-labs/FLUX.2-dev` | 4 |
|
||||
| `WanPipeline` | text-to-video | `Wan-AI/Wan2.1-T2V-14B-Diffusers` | 4 |
|
||||
| `LTX2Pipeline` | text-to-video | `Lightricks/LTX-2` | 4 |
|
||||
| `QwenImagePipeline` | text-to-image | `Qwen/Qwen-Image` | 4 |
|
||||
|
||||
## Approach
|
||||
|
||||
Follow the flux-fast pattern: **annotate key pipeline methods** with `torch.profiler.record_function` wrappers, then run the pipeline under `torch.profiler.profile` and export a Chrome trace.
|
||||
|
||||
### New Files
|
||||
|
||||
```
|
||||
profiling/
|
||||
profiling_utils.py # Annotation helper + profiler setup
|
||||
profiling_pipelines.py # CLI entry point with pipeline configs
|
||||
```
|
||||
|
||||
### Step 1: `profiling_utils.py` — Annotation and Profiler Infrastructure
|
||||
|
||||
**A) `annotate(func, name)` helper** (same pattern as flux-fast):
|
||||
|
||||
```python
|
||||
def annotate(func, name):
|
||||
"""Wrap a function with torch.profiler.record_function for trace annotation."""
|
||||
@functools.wraps(func)
|
||||
def wrapper(*args, **kwargs):
|
||||
with torch.profiler.record_function(name):
|
||||
return func(*args, **kwargs)
|
||||
return wrapper
|
||||
```
|
||||
|
||||
**B) `annotate_pipeline(pipe)` function** — applies annotations to key methods on any pipeline:
|
||||
|
||||
- `pipe.transformer.forward` → `"transformer_forward"`
|
||||
- `pipe.vae.decode` → `"vae_decode"` (if present)
|
||||
- `pipe.vae.encode` → `"vae_encode"` (if present)
|
||||
- `pipe.scheduler.step` → `"scheduler_step"`
|
||||
- `pipe.encode_prompt` → `"encode_prompt"` (if present, for full-pipeline profiling)
|
||||
|
||||
This is non-invasive — it monkey-patches bound methods without modifying source.
|
||||
|
||||
**C) `PipelineProfiler` class:**
|
||||
|
||||
- `__init__(pipeline_config, output_dir, mode="eager"|"compile")`
|
||||
- `setup_pipeline()` → loads from pretrained, optionally compiles transformer, calls `annotate_pipeline()`
|
||||
- `run()`:
|
||||
1. Warm up with 1 unannotated run
|
||||
2. Profile 1 run with `torch.profiler.profile`:
|
||||
- `activities=[CPU, CUDA]`
|
||||
- `record_shapes=True`
|
||||
- `profile_memory=True`
|
||||
- `with_stack=True`
|
||||
3. Export Chrome trace JSON
|
||||
4. Print `key_averages()` summary table (sorted by CUDA time) to stdout
|
||||
|
||||
### Step 2: `profiling_pipelines.py` — CLI with Pipeline Configs
|
||||
|
||||
**Pipeline config registry** — each entry specifies:
|
||||
|
||||
- `pipeline_cls`, `pretrained_model_name_or_path`, `torch_dtype`
|
||||
- `call_kwargs` with pipeline-specific defaults:
|
||||
|
||||
| Pipeline | Resolution | Frames | Steps | Extra |
|
||||
|----------|-----------|--------|-------|-------|
|
||||
| Flux | 1024x1024 | — | 4 | `guidance_scale=3.5` |
|
||||
| Flux2 | 1024x1024 | — | 4 | `guidance_scale=3.5` |
|
||||
| Wan | 480x832 | 81 | 4 | — |
|
||||
| LTX2 | 768x512 | 121 | 4 | `guidance_scale=4.0` |
|
||||
| QwenImage | 1024x1024 | — | 4 | `true_cfg_scale=4.0` |
|
||||
|
||||
All configs use `output_type="latent"` by default (skip VAE decode for cleaner denoising-loop traces).
|
||||
|
||||
**CLI flags:**
|
||||
|
||||
- `--pipeline flux|flux2|wan|ltx2|qwenimage|all`
|
||||
- `--mode eager|compile|both`
|
||||
- `--output_dir profiling_results/`
|
||||
- `--num_steps N` (override, default 4)
|
||||
- `--full_decode` (switch output_type from `"latent"` to `"pil"` to include VAE)
|
||||
- `--compile_mode default|reduce-overhead|max-autotune`
|
||||
- `--compile_fullgraph` flag
|
||||
|
||||
**Output:** `{output_dir}/{pipeline}_{mode}.json` Chrome trace + stdout summary.
|
||||
|
||||
### Step 3: Known Sync Issues to Validate
|
||||
|
||||
The profiling should surface these known/suspected issues:
|
||||
|
||||
1. **Scheduler DtoH sync via `nonzero().item()`** — For Flux, this was fixed by adding `scheduler.set_begin_index(0)` before the denoising loop ([diffusers#11696](https://github.com/huggingface/diffusers/pull/11696)). Profiling should reveal whether similar sync points exist in other pipelines.
|
||||
|
||||
2. **`modulate_index` tensor rebuilt every forward in `transformer_qwenimage.py`** (line 901-905) — Python list comprehension + `torch.tensor()` each step. Minor but visible in trace.
|
||||
|
||||
3. **Any other `.item()`, `.cpu()`, `.numpy()` calls** in the denoising loop hot path — the profiler's `with_stack=True` will surface these as CPU stalls with Python stack traces.
|
||||
|
||||
## Verification
|
||||
|
||||
1. Run: `python profiling/profiling_pipelines.py --pipeline flux --mode eager --num_steps 4`
|
||||
2. Verify `profiling_results/flux_eager.json` is produced
|
||||
3. Open trace in [Perfetto UI](https://ui.perfetto.dev/) — confirm:
|
||||
- `transformer_forward` and `scheduler_step` annotations visible
|
||||
- CPU and CUDA timelines present
|
||||
- Stack traces visible on CPU events
|
||||
4. Run with `--mode compile` and compare trace for fewer/fused CUDA kernels
|
||||
|
||||
## Interpreting Traces in Perfetto UI
|
||||
|
||||
Open the exported `.json` trace at [ui.perfetto.dev](https://ui.perfetto.dev/). The trace has two main rows: **CPU** (top) and **CUDA** (bottom).
|
||||
|
||||
### What to look for
|
||||
|
||||
**1. Gaps between CUDA kernels**
|
||||
|
||||
Zoom into the CUDA row during the denoising loop. Ideally, GPU kernels should be back-to-back with no gaps. Gaps mean the GPU is idle waiting for the CPU to launch the next kernel. Common causes:
|
||||
- Python overhead between ops (visible as CPU slices in the CPU row during the gap)
|
||||
- DtoH sync (`.item()`, `.cpu()`) forcing the GPU to drain before the CPU can proceed
|
||||
|
||||
**2. CPU stalls (DtoH syncs)**
|
||||
|
||||
Look for long CPU slices labeled `cudaStreamSynchronize` or `cudaDeviceSynchronize`. Click on them — if `with_stack=True` was enabled, the bottom panel shows the Python stack trace pointing to the exact line causing the sync (e.g., a `.item()` call in the scheduler).
|
||||
|
||||
**3. Annotated regions**
|
||||
|
||||
Our `record_function` annotations (`transformer_forward`, `scheduler_step`, etc.) appear as labeled spans on the CPU row. This lets you quickly:
|
||||
- Measure how long each phase takes (click a span to see duration)
|
||||
- See if `scheduler_step` is disproportionately expensive relative to `transformer_forward` (it should be negligible)
|
||||
- Spot unexpected CPU work between annotated regions
|
||||
|
||||
**4. Eager vs compile comparison**
|
||||
|
||||
Open both traces side by side (two Perfetto tabs). Key differences to look for:
|
||||
- **Fewer, wider CUDA kernels** in compile mode (fused ops) vs many small kernels in eager
|
||||
- **Smaller CPU gaps** between kernels in compile mode (less Python dispatch overhead)
|
||||
- **Graph breaks**: if compile mode still shows many small kernels in a section, that section likely has a graph break — check `TORCH_LOGS="+dynamo"` output for details
|
||||
|
||||
**5. Memory timeline**
|
||||
|
||||
In Perfetto, look for the memory counter track (if `profile_memory=True`). Spikes during the denoising loop suggest unexpected allocations per step. Steady-state memory during denoising is expected — growing memory is not.
|
||||
|
||||
**6. Kernel launch latency**
|
||||
|
||||
Each CUDA kernel is launched from the CPU. In Perfetto, you can see the CPU-side launch call (e.g., `cudaLaunchKernel`) and the corresponding GPU-side kernel execution. The time between the CPU dispatch and the GPU kernel starting should be minimal (single-digit microseconds). If you see consistent delays > 10-20us between launch and execution:
|
||||
- The launch queue may be starved because of excessive Python work between ops
|
||||
- There may be implicit syncs forcing serialization
|
||||
- `torch.compile` should help here by batching launches — compare eager vs compile to confirm
|
||||
|
||||
To inspect this: zoom into a single denoising step, select a CUDA kernel on the GPU row, and look at the corresponding CPU-side launch slice directly above it. The horizontal offset between them is the launch latency. In a healthy trace, CPU launch slices should be well ahead of GPU execution (the CPU is "feeding" the GPU faster than it can consume).
|
||||
|
||||
### Quick checklist per pipeline
|
||||
|
||||
| Question | Where to look | Healthy | Unhealthy |
|
||||
|----------|--------------|---------|-----------|
|
||||
| GPU staying busy? | CUDA row gaps | Back-to-back kernels | Frequent gaps > 100us |
|
||||
| CPU blocking on GPU? | `cudaStreamSynchronize` slices | Rare/absent during denoise | Present every step |
|
||||
| Scheduler overhead? | `scheduler_step` span duration | < 1% of step time | > 5% of step time |
|
||||
| Compile effective? | CUDA kernel count per step | Fewer large kernels | Same as eager |
|
||||
| Kernel launch latency? | CPU launch → GPU kernel offset | < 10us, CPU ahead of GPU | > 20us or CPU trailing GPU |
|
||||
| Memory stable? | Memory counter track | Flat during denoise loop | Growing per step |
|
||||
182
profiling/profiling_pipelines.py
Normal file
182
profiling/profiling_pipelines.py
Normal file
@@ -0,0 +1,182 @@
|
||||
"""
|
||||
Profile diffusers pipelines with torch.profiler.
|
||||
|
||||
Usage:
|
||||
python profiling/profiling_pipelines.py --pipeline flux --mode eager
|
||||
python profiling/profiling_pipelines.py --pipeline flux --mode compile
|
||||
python profiling/profiling_pipelines.py --pipeline flux --mode both
|
||||
python profiling/profiling_pipelines.py --pipeline all --mode eager
|
||||
python profiling/profiling_pipelines.py --pipeline wan --mode eager --full_decode
|
||||
python profiling/profiling_pipelines.py --pipeline flux --mode compile --num_steps 4
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import copy
|
||||
import logging
|
||||
|
||||
import torch
|
||||
|
||||
from profiling_utils import PipelineProfiler, PipelineProfilingConfig
|
||||
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
PROMPT = "A cat holding a sign that says hello world"
|
||||
|
||||
|
||||
def build_registry():
|
||||
"""Build the pipeline config registry. Imports are deferred to avoid loading all pipelines upfront."""
|
||||
from diffusers import FluxPipeline, Flux2Pipeline, WanPipeline, LTX2Pipeline, QwenImagePipeline
|
||||
|
||||
return {
|
||||
"flux": PipelineProfilingConfig(
|
||||
name="flux",
|
||||
pipeline_cls=FluxPipeline,
|
||||
pipeline_init_kwargs={
|
||||
"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
|
||||
"torch_dtype": torch.bfloat16,
|
||||
},
|
||||
pipeline_call_kwargs={
|
||||
"prompt": PROMPT,
|
||||
"height": 1024,
|
||||
"width": 1024,
|
||||
"num_inference_steps": 4,
|
||||
"guidance_scale": 3.5,
|
||||
"output_type": "latent",
|
||||
},
|
||||
),
|
||||
"flux2": PipelineProfilingConfig(
|
||||
name="flux2",
|
||||
pipeline_cls=Flux2Pipeline,
|
||||
pipeline_init_kwargs={
|
||||
"pretrained_model_name_or_path": "black-forest-labs/FLUX.2-klein-base-9B",
|
||||
"torch_dtype": torch.bfloat16,
|
||||
},
|
||||
pipeline_call_kwargs={
|
||||
"prompt": PROMPT,
|
||||
"height": 1024,
|
||||
"width": 1024,
|
||||
"num_inference_steps": 4,
|
||||
"guidance_scale": 3.5,
|
||||
"output_type": "latent",
|
||||
},
|
||||
),
|
||||
"wan": PipelineProfilingConfig(
|
||||
name="wan",
|
||||
pipeline_cls=WanPipeline,
|
||||
pipeline_init_kwargs={
|
||||
"pretrained_model_name_or_path": "Wan-AI/Wan2.1-T2V-14B-Diffusers",
|
||||
"torch_dtype": torch.bfloat16,
|
||||
},
|
||||
pipeline_call_kwargs={
|
||||
"prompt": PROMPT,
|
||||
"negative_prompt": "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards",
|
||||
"height": 480,
|
||||
"width": 832,
|
||||
"num_frames": 81,
|
||||
"num_inference_steps": 4,
|
||||
"output_type": "latent",
|
||||
},
|
||||
),
|
||||
"ltx2": PipelineProfilingConfig(
|
||||
name="ltx2",
|
||||
pipeline_cls=LTX2Pipeline,
|
||||
pipeline_init_kwargs={
|
||||
"pretrained_model_name_or_path": "Lightricks/LTX-2",
|
||||
"torch_dtype": torch.bfloat16,
|
||||
},
|
||||
pipeline_call_kwargs={
|
||||
"prompt": PROMPT,
|
||||
"negative_prompt": "worst quality, inconsistent motion, blurry, jittery, distorted",
|
||||
"height": 512,
|
||||
"width": 768,
|
||||
"num_frames": 121,
|
||||
"num_inference_steps": 4,
|
||||
"guidance_scale": 4.0,
|
||||
"output_type": "latent",
|
||||
},
|
||||
),
|
||||
"qwenimage": PipelineProfilingConfig(
|
||||
name="qwenimage",
|
||||
pipeline_cls=QwenImagePipeline,
|
||||
pipeline_init_kwargs={
|
||||
"pretrained_model_name_or_path": "Qwen/Qwen-Image",
|
||||
"torch_dtype": torch.bfloat16,
|
||||
},
|
||||
pipeline_call_kwargs={
|
||||
"prompt": PROMPT,
|
||||
"negative_prompt": " ",
|
||||
"height": 1024,
|
||||
"width": 1024,
|
||||
"num_inference_steps": 4,
|
||||
"true_cfg_scale": 4.0,
|
||||
"output_type": "latent",
|
||||
},
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Profile diffusers pipelines with torch.profiler")
|
||||
parser.add_argument(
|
||||
"--pipeline",
|
||||
choices=["flux", "flux2", "wan", "ltx2", "qwenimage", "all"],
|
||||
required=True,
|
||||
help="Which pipeline to profile",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--mode",
|
||||
choices=["eager", "compile", "both"],
|
||||
default="eager",
|
||||
help="Run in eager mode, compile mode, or both",
|
||||
)
|
||||
parser.add_argument("--output_dir", default="profiling_results", help="Directory for trace output")
|
||||
parser.add_argument("--num_steps", type=int, default=None, help="Override num_inference_steps")
|
||||
parser.add_argument("--full_decode", action="store_true", help="Profile including VAE decode (output_type='pil')")
|
||||
parser.add_argument(
|
||||
"--compile_mode",
|
||||
default="default",
|
||||
choices=["default", "reduce-overhead", "max-autotune"],
|
||||
help="torch.compile mode",
|
||||
)
|
||||
parser.add_argument("--compile_fullgraph", action="store_true", help="Use fullgraph=True for torch.compile")
|
||||
parser.add_argument(
|
||||
"--compile_regional",
|
||||
action="store_true",
|
||||
help="Use compile_repeated_blocks() instead of full model compile",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
registry = build_registry()
|
||||
|
||||
pipeline_names = list(registry.keys()) if args.pipeline == "all" else [args.pipeline]
|
||||
modes = ["eager", "compile"] if args.mode == "both" else [args.mode]
|
||||
|
||||
for pipeline_name in pipeline_names:
|
||||
for mode in modes:
|
||||
config = copy.deepcopy(registry[pipeline_name])
|
||||
|
||||
# Apply overrides
|
||||
if args.num_steps is not None:
|
||||
config.pipeline_call_kwargs["num_inference_steps"] = args.num_steps
|
||||
if args.full_decode:
|
||||
config.pipeline_call_kwargs["output_type"] = "pil"
|
||||
if mode == "compile":
|
||||
config.compile_kwargs = {
|
||||
"fullgraph": args.compile_fullgraph,
|
||||
"mode": args.compile_mode,
|
||||
}
|
||||
config.compile_regional = args.compile_regional
|
||||
|
||||
logger.info(f"Profiling {pipeline_name} in {mode} mode...")
|
||||
profiler = PipelineProfiler(config, args.output_dir)
|
||||
try:
|
||||
trace_file = profiler.run()
|
||||
logger.info(f"Done: {trace_file}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to profile {pipeline_name} ({mode}): {e}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
143
profiling/profiling_utils.py
Normal file
143
profiling/profiling_utils.py
Normal file
@@ -0,0 +1,143 @@
|
||||
import functools
|
||||
import gc
|
||||
import logging
|
||||
import os
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Any
|
||||
|
||||
import torch
|
||||
import torch.profiler
|
||||
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def annotate(func, name):
|
||||
"""Wrap a function with torch.profiler.record_function for trace annotation."""
|
||||
|
||||
@functools.wraps(func)
|
||||
def wrapper(*args, **kwargs):
|
||||
with torch.profiler.record_function(name):
|
||||
return func(*args, **kwargs)
|
||||
|
||||
return wrapper
|
||||
|
||||
|
||||
def annotate_pipeline(pipe):
|
||||
"""Apply profiler annotations to key pipeline methods.
|
||||
|
||||
Monkey-patches bound methods so they appear as named spans in the trace.
|
||||
Non-invasive — no source modifications required.
|
||||
"""
|
||||
annotations = [
|
||||
("transformer", "forward", "transformer_forward"),
|
||||
("vae", "decode", "vae_decode"),
|
||||
("vae", "encode", "vae_encode"),
|
||||
("scheduler", "step", "scheduler_step"),
|
||||
]
|
||||
|
||||
# Annotate sub-component methods
|
||||
for component_name, method_name, label in annotations:
|
||||
component = getattr(pipe, component_name, None)
|
||||
if component is None:
|
||||
continue
|
||||
method = getattr(component, method_name, None)
|
||||
if method is None:
|
||||
continue
|
||||
setattr(component, method_name, annotate(method, label))
|
||||
|
||||
# Annotate pipeline-level methods
|
||||
if hasattr(pipe, "encode_prompt"):
|
||||
pipe.encode_prompt = annotate(pipe.encode_prompt, "encode_prompt")
|
||||
|
||||
|
||||
def flush():
|
||||
gc.collect()
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.reset_max_memory_allocated()
|
||||
torch.cuda.reset_peak_memory_stats()
|
||||
|
||||
|
||||
@dataclass
|
||||
class PipelineProfilingConfig:
|
||||
name: str
|
||||
pipeline_cls: Any
|
||||
pipeline_init_kwargs: dict[str, Any]
|
||||
pipeline_call_kwargs: dict[str, Any]
|
||||
compile_kwargs: dict[str, Any] | None = field(default=None)
|
||||
compile_regional: bool = False
|
||||
|
||||
|
||||
class PipelineProfiler:
|
||||
def __init__(self, config: PipelineProfilingConfig, output_dir: str = "profiling_results"):
|
||||
self.config = config
|
||||
self.output_dir = output_dir
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
def setup_pipeline(self):
|
||||
"""Load the pipeline from pretrained, optionally compile, and annotate."""
|
||||
logger.info(f"Loading pipeline: {self.config.name}")
|
||||
pipe = self.config.pipeline_cls.from_pretrained(**self.config.pipeline_init_kwargs)
|
||||
pipe.to("cuda")
|
||||
|
||||
if self.config.compile_kwargs:
|
||||
if self.config.compile_regional:
|
||||
logger.info(f"Regional compilation (compile_repeated_blocks) with kwargs: {self.config.compile_kwargs}")
|
||||
pipe.transformer.compile_repeated_blocks(**self.config.compile_kwargs)
|
||||
else:
|
||||
logger.info(f"Full compilation with kwargs: {self.config.compile_kwargs}")
|
||||
pipe.transformer.compile(**self.config.compile_kwargs)
|
||||
|
||||
annotate_pipeline(pipe)
|
||||
return pipe
|
||||
|
||||
def run(self):
|
||||
"""Execute the profiling run: warmup, then profile one pipeline call."""
|
||||
pipe = self.setup_pipeline()
|
||||
flush()
|
||||
|
||||
mode = "compile" if self.config.compile_kwargs else "eager"
|
||||
trace_file = os.path.join(self.output_dir, f"{self.config.name}_{mode}.json")
|
||||
|
||||
# Warmup (pipeline __call__ is already decorated with @torch.no_grad())
|
||||
logger.info("Running warmup...")
|
||||
pipe(**self.config.pipeline_call_kwargs)
|
||||
flush()
|
||||
|
||||
# Profile
|
||||
logger.info("Running profiled iteration...")
|
||||
activities = [
|
||||
torch.profiler.ProfilerActivity.CPU,
|
||||
torch.profiler.ProfilerActivity.CUDA,
|
||||
]
|
||||
with torch.profiler.profile(
|
||||
activities=activities,
|
||||
record_shapes=True,
|
||||
profile_memory=True,
|
||||
with_stack=True,
|
||||
) as prof:
|
||||
with torch.profiler.record_function("pipeline_call"):
|
||||
pipe(**self.config.pipeline_call_kwargs)
|
||||
|
||||
# Export trace
|
||||
prof.export_chrome_trace(trace_file)
|
||||
logger.info(f"Chrome trace saved to: {trace_file}")
|
||||
|
||||
# Print summary
|
||||
print("\n" + "=" * 80)
|
||||
print(f"Profile summary: {self.config.name} ({mode})")
|
||||
print("=" * 80)
|
||||
print(
|
||||
prof.key_averages().table(
|
||||
sort_by="cuda_time_total",
|
||||
row_limit=20,
|
||||
)
|
||||
)
|
||||
|
||||
# Cleanup
|
||||
pipe.to("cpu")
|
||||
del pipe
|
||||
flush()
|
||||
|
||||
return trace_file
|
||||
39
profiling/run_profiling.sh
Executable file
39
profiling/run_profiling.sh
Executable file
@@ -0,0 +1,39 @@
|
||||
#!/bin/bash
|
||||
# Run profiling across all pipelines in eager and compile (regional) modes.
|
||||
#
|
||||
# Usage:
|
||||
# bash profiling/run_profiling.sh
|
||||
# bash profiling/run_profiling.sh --output_dir my_results
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
OUTPUT_DIR="${1:-profiling_results}"
|
||||
NUM_STEPS=2
|
||||
PIPELINES=("flux" "flux2" "wan" "ltx2" "qwenimage")
|
||||
MODES=("eager" "compile")
|
||||
|
||||
for pipeline in "${PIPELINES[@]}"; do
|
||||
for mode in "${MODES[@]}"; do
|
||||
echo "============================================================"
|
||||
echo "Profiling: ${pipeline} | mode: ${mode}"
|
||||
echo "============================================================"
|
||||
|
||||
COMPILE_ARGS=""
|
||||
if [ "$mode" = "compile" ]; then
|
||||
COMPILE_ARGS="--compile_regional --compile_fullgraph --compile_mode default"
|
||||
fi
|
||||
|
||||
python profiling/profiling_pipelines.py \
|
||||
--pipeline "$pipeline" \
|
||||
--mode "$mode" \
|
||||
--output_dir "$OUTPUT_DIR" \
|
||||
--num_steps "$NUM_STEPS" \
|
||||
$COMPILE_ARGS
|
||||
|
||||
echo ""
|
||||
done
|
||||
done
|
||||
|
||||
echo "============================================================"
|
||||
echo "All traces saved to: ${OUTPUT_DIR}/"
|
||||
echo "============================================================"
|
||||
@@ -1,155 +1,6 @@
|
||||
# Copyright 2026 Lightricks and The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# Pre-trained sigma values for distilled model are taken from
|
||||
# https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/utils/constants.py
|
||||
DISTILLED_SIGMA_VALUES = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875]
|
||||
|
||||
# Reduced schedule for super-resolution stage 2 (subset of distilled values)
|
||||
STAGE_2_DISTILLED_SIGMA_VALUES = [0.909375, 0.725, 0.421875]
|
||||
|
||||
|
||||
# Default negative prompt from
|
||||
# https://github.com/Lightricks/LTX-2/blob/ae855f8538843825f9015a419cf4ba5edaf5eec2/packages/ltx-pipelines/src/ltx_pipelines/utils/constants.py#L131-L143
|
||||
DEFAULT_NEGATIVE_PROMPT = (
|
||||
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
|
||||
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
|
||||
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
|
||||
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
|
||||
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
|
||||
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
|
||||
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
|
||||
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
|
||||
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
|
||||
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
|
||||
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
|
||||
)
|
||||
|
||||
|
||||
# System prompts for prompt enhancement
|
||||
# https://github.com/Lightricks/LTX-2/blob/ae855f8538843825f9015a419cf4ba5edaf5eec2/packages/ltx-core/src/ltx_core/text_encoders/gemma/encoders/prompts/gemma_t2v_system_prompt.txt#L1
|
||||
# Disable line-too-long rule in ruff to keep the prompts exactly the same (e.g. in terms of newlines)
|
||||
# Supported in ruff>=0.15.0
|
||||
# ruff: disable[E501]
|
||||
T2V_DEFAULT_SYSTEM_PROMPT = """
|
||||
You are a Creative Assistant. Given a user's raw input prompt describing a scene or concept, expand it into a detailed
|
||||
video generation prompt with specific visuals and integrated audio to guide a text-to-video model.
|
||||
|
||||
#### Guidelines
|
||||
- Strictly follow all aspects of the user's raw input: include every element requested (style, visuals, motions,
|
||||
actions, camera movement, audio).
|
||||
- If the input is vague, invent concrete details: lighting, textures, materials, scene settings, etc.
|
||||
- For characters: describe gender, clothing, hair, expressions. DO NOT invent unrequested characters.
|
||||
- Use active language: present-progressive verbs ("is walking," "speaking"). If no action specified, describe natural
|
||||
movements.
|
||||
- Maintain chronological flow: use temporal connectors ("as," "then," "while").
|
||||
- Audio layer: Describe complete soundscape (background audio, ambient sounds, SFX, speech/music when requested).
|
||||
Integrate sounds chronologically alongside actions. Be specific (e.g., "soft footsteps on tile"), not vague (e.g.,
|
||||
"ambient sound is present").
|
||||
- Speech (only when requested):
|
||||
- For ANY speech-related input (talking, conversation, singing, etc.), ALWAYS include exact words in quotes with
|
||||
voice characteristics (e.g., "The man says in an excited voice: 'You won't believe what I just saw!'").
|
||||
- Specify language if not English and accent if relevant.
|
||||
- Style: Include visual style at the beginning: "Style: <style>, <rest of prompt>." Default to cinematic-realistic if
|
||||
unspecified. Omit if unclear.
|
||||
- Visual and audio only: NO non-visual/auditory senses (smell, taste, touch).
|
||||
- Restrained language: Avoid dramatic/exaggerated terms. Use mild, natural phrasing.
|
||||
- Colors: Use plain terms ("red dress"), not intensified ("vibrant blue," "bright red").
|
||||
- Lighting: Use neutral descriptions ("soft overhead light"), not harsh ("blinding light").
|
||||
- Facial features: Use delicate modifiers for subtle features (i.e., "subtle freckles").
|
||||
|
||||
#### Important notes:
|
||||
- Analyze the user's raw input carefully. In cases of FPV or POV, exclude the description of the subject whose POV is
|
||||
requested.
|
||||
- Camera motion: DO NOT invent camera motion unless requested by the user.
|
||||
- Speech: DO NOT modify user-provided character dialogue unless it's a typo.
|
||||
- No timestamps or cuts: DO NOT use timestamps or describe scene cuts unless explicitly requested.
|
||||
- Format: DO NOT use phrases like "The scene opens with...". Start directly with Style (optional) and chronological
|
||||
scene description.
|
||||
- Format: DO NOT start your response with special characters.
|
||||
- DO NOT invent dialogue unless the user mentions speech/talking/singing/conversation.
|
||||
- If the user's raw input prompt is highly detailed, chronological and in the requested format: DO NOT make major edits
|
||||
or introduce new elements. Add/enhance audio descriptions if missing.
|
||||
|
||||
#### Output Format (Strict):
|
||||
- Single continuous paragraph in natural language (English).
|
||||
- NO titles, headings, prefaces, code fences, or Markdown.
|
||||
- If unsafe/invalid, return original user prompt. Never ask questions or clarifications.
|
||||
|
||||
Your output quality is CRITICAL. Generate visually rich, dynamic prompts with integrated audio for high-quality video
|
||||
generation.
|
||||
|
||||
#### Example Input: "A woman at a coffee shop talking on the phone" Output: Style: realistic with cinematic lighting.
|
||||
In a medium close-up, a woman in her early 30s with shoulder-length brown hair sits at a small wooden table by the
|
||||
window. She wears a cream-colored turtleneck sweater, holding a white ceramic coffee cup in one hand and a smartphone
|
||||
to her ear with the other. Ambient cafe sounds fill the space—espresso machine hiss, quiet conversations, gentle
|
||||
clinking of cups. The woman listens intently, nodding slightly, then takes a sip of her coffee and sets it down with a
|
||||
soft clink. Her face brightens into a warm smile as she speaks in a clear, friendly voice, 'That sounds perfect! I'd
|
||||
love to meet up this weekend. How about Saturday afternoon?' She laughs softly—a genuine chuckle—and shifts in her
|
||||
chair. Behind her, other patrons move subtly in and out of focus. 'Great, I'll see you then,' she concludes cheerfully,
|
||||
lowering the phone.
|
||||
"""
|
||||
# ruff: enable[E501]
|
||||
|
||||
# ruff: disable[E501]
|
||||
I2V_DEFAULT_SYSTEM_PROMPT = """
|
||||
You are a Creative Assistant writing concise, action-focused image-to-video prompts. Given an image (first frame) and
|
||||
user Raw Input Prompt, generate a prompt to guide video generation from that image.
|
||||
|
||||
#### Guidelines:
|
||||
- Analyze the Image: Identify Subject, Setting, Elements, Style and Mood.
|
||||
- Follow user Raw Input Prompt: Include all requested motion, actions, camera movements, audio, and details. If in
|
||||
conflict with the image, prioritize user request while maintaining visual consistency (describe transition from image
|
||||
to user's scene).
|
||||
- Describe only changes from the image: Don't reiterate established visual details. Inaccurate descriptions may cause
|
||||
scene cuts.
|
||||
- Active language: Use present-progressive verbs ("is walking," "speaking"). If no action specified, describe natural
|
||||
movements.
|
||||
- Chronological flow: Use temporal connectors ("as," "then," "while").
|
||||
- Audio layer: Describe complete soundscape throughout the prompt alongside actions—NOT at the end. Align audio
|
||||
intensity with action tempo. Include natural background audio, ambient sounds, effects, speech or music (when
|
||||
requested). Be specific (e.g., "soft footsteps on tile") not vague (e.g., "ambient sound").
|
||||
- Speech (only when requested): Provide exact words in quotes with character's visual/voice characteristics (e.g., "The
|
||||
tall man speaks in a low, gravelly voice"), language if not English and accent if relevant. If general conversation
|
||||
mentioned without text, generate contextual quoted dialogue. (i.e., "The man is talking" input -> the output should
|
||||
include exact spoken words, like: "The man is talking in an excited voice saying: 'You won't believe what I just
|
||||
saw!' His hands gesture expressively as he speaks, eyebrows raised with enthusiasm. The ambient sound of a quiet room
|
||||
underscores his animated speech.")
|
||||
- Style: Include visual style at beginning: "Style: <style>, <rest of prompt>." If unclear, omit to avoid conflicts.
|
||||
- Visual and audio only: Describe only what is seen and heard. NO smell, taste, or tactile sensations.
|
||||
- Restrained language: Avoid dramatic terms. Use mild, natural, understated phrasing.
|
||||
|
||||
#### Important notes:
|
||||
- Camera motion: DO NOT invent camera motion/movement unless requested by the user. Make sure to include camera motion
|
||||
only if specified in the input.
|
||||
- Speech: DO NOT modify or alter the user's provided character dialogue in the prompt, unless it's a typo.
|
||||
- No timestamps or cuts: DO NOT use timestamps or describe scene cuts unless explicitly requested.
|
||||
- Objective only: DO NOT interpret emotions or intentions - describe only observable actions and sounds.
|
||||
- Format: DO NOT use phrases like "The scene opens with..." / "The video starts...". Start directly with Style
|
||||
(optional) and chronological scene description.
|
||||
- Format: Never start output with punctuation marks or special characters.
|
||||
- DO NOT invent dialogue unless the user mentions speech/talking/singing/conversation.
|
||||
- Your performance is CRITICAL. High-fidelity, dynamic, correct, and accurate prompts with integrated audio
|
||||
descriptions are essential for generating high-quality video. Your goal is flawless execution of these rules.
|
||||
|
||||
#### Output Format (Strict):
|
||||
- Single concise paragraph in natural English. NO titles, headings, prefaces, sections, code fences, or Markdown.
|
||||
- If unsafe/invalid, return original user prompt. Never ask questions or clarifications.
|
||||
|
||||
#### Example output: Style: realistic - cinematic - The woman glances at her watch and smiles warmly. She speaks in a
|
||||
cheerful, friendly voice, "I think we're right on time!" In the background, a café barista prepares drinks at the
|
||||
counter. The barista calls out in a clear, upbeat tone, "Two cappuccinos ready!" The sound of the espresso machine
|
||||
hissing softly blends with gentle background chatter and the light clinking of cups on saucers.
|
||||
"""
|
||||
# ruff: enable[E501]
|
||||
|
||||
Reference in New Issue
Block a user