mirror of
https://github.com/huggingface/diffusers.git
synced 2025-12-24 13:24:49 +08:00
* minor documentation fixes of the depth and normals pipelines * update license headers * update model checkpoints in examples fix missing prediction_type in register_to_config in the normals pipeline * add initial marigold intrinsics pipeline update comments about num_inference_steps and ensemble_size minor fixes in comments of marigold normals and depth pipelines * update uncertainty visualization to work with intrinsics * integrate iid --------- Co-authored-by: YiYi Xu <yixu310@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
605 lines
30 KiB
Markdown
605 lines
30 KiB
Markdown
<!--
|
|
Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
|
|
Copyright 2024-2025 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
-->
|
|
|
|
# Marigold Computer Vision
|
|
|
|
**Marigold** is a diffusion-based [method](https://huggingface.co/papers/2312.02145) and a collection of [pipelines](../api/pipelines/marigold) designed for
|
|
dense computer vision tasks, including **monocular depth prediction**, **surface normals estimation**, and **intrinsic
|
|
image decomposition**.
|
|
|
|
This guide will walk you through using Marigold to generate fast and high-quality predictions for images and videos.
|
|
|
|
Each pipeline is tailored for a specific computer vision task, processing an input RGB image and generating a
|
|
corresponding prediction.
|
|
Currently, the following computer vision tasks are implemented:
|
|
|
|
| Pipeline | Recommended Model Checkpoints | Spaces (Interactive Apps) | Predicted Modalities |
|
|
|---------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | [Depth Estimation](https://huggingface.co/spaces/prs-eth/marigold) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) |
|
|
| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [prs-eth/marigold-normals-v1-1](https://huggingface.co/prs-eth/marigold-normals-v1-1) | [Surface Normals Estimation](https://huggingface.co/spaces/prs-eth/marigold-normals) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) |
|
|
| [MarigoldIntrinsicsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py) | [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1),<br>[prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | [Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) | [Albedo](https://en.wikipedia.org/wiki/Albedo), [Materials](https://www.n.aiq3d.com/wiki/roughnessmetalnessao-map), [Lighting](https://en.wikipedia.org/wiki/Diffuse_reflection) |
|
|
|
|
All original checkpoints are available under the [PRS-ETH](https://huggingface.co/prs-eth/) organization on Hugging Face.
|
|
They are designed for use with diffusers pipelines and the [original codebase](https://github.com/prs-eth/marigold), which can also be used to train
|
|
new model checkpoints.
|
|
The following is a summary of the recommended checkpoints, all of which produce reliable results with 1 to 4 steps.
|
|
|
|
| Checkpoint | Modality | Comment |
|
|
|-----------------------------------------------------------------------------------------------------|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
| [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | Depth | Affine-invariant depth prediction assigns each pixel a value between 0 (near plane) and 1 (far plane), with both planes determined by the model during inference. |
|
|
| [prs-eth/marigold-normals-v0-1](https://huggingface.co/prs-eth/marigold-normals-v0-1) | Normals | The surface normals predictions are unit-length 3D vectors in the screen space camera, with values in the range from -1 to 1. |
|
|
| [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1) | Intrinsics | InteriorVerse decomposition is comprised of Albedo and two BRDF material properties: Roughness and Metallicity. |
|
|
| [prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | Intrinsics | HyperSim decomposition of an image \\(I\\) is comprised of Albedo \\(A\\), Diffuse shading \\(S\\), and Non-diffuse residual \\(R\\): \\(I = A*S+R\\). |
|
|
|
|
The examples below are mostly given for depth prediction, but they can be universally applied to other supported
|
|
modalities.
|
|
We showcase the predictions using the same input image of Albert Einstein generated by Midjourney.
|
|
This makes it easier to compare visualizations of the predictions across various modalities and checkpoints.
|
|
|
|
<div class="flex gap-4" style="justify-content: center; width: 100%;">
|
|
<div style="flex: 1 1 50%; max-width: 50%;">
|
|
<img class="rounded-xl" src="https://marigoldmonodepth.github.io/images/einstein.jpg"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Example input image for all Marigold pipelines
|
|
</figcaption>
|
|
</div>
|
|
</div>
|
|
|
|
## Depth Prediction
|
|
|
|
To get a depth prediction, load the `prs-eth/marigold-depth-v1-1` checkpoint into [`MarigoldDepthPipeline`],
|
|
put the image through the pipeline, and save the predictions:
|
|
|
|
```python
|
|
import diffusers
|
|
import torch
|
|
|
|
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
|
|
"prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
|
|
).to("cuda")
|
|
|
|
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
|
|
|
|
depth = pipe(image)
|
|
|
|
vis = pipe.image_processor.visualize_depth(depth.prediction)
|
|
vis[0].save("einstein_depth.png")
|
|
|
|
depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction)
|
|
depth_16bit[0].save("einstein_depth_16bit.png")
|
|
```
|
|
|
|
The [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_depth`] function applies one of
|
|
[matplotlib's colormaps](https://matplotlib.org/stable/users/explain/colors/colormaps.html) (`Spectral` by default) to map the predicted pixel values from a single-channel `[0, 1]`
|
|
depth range into an RGB image.
|
|
With the `Spectral` colormap, pixels with near depth are painted red, and far pixels are blue.
|
|
The 16-bit PNG file stores the single channel values mapped linearly from the `[0, 1]` range into `[0, 65535]`.
|
|
Below are the raw and the visualized predictions. The darker and closer areas (mustache) are easier to distinguish in
|
|
the visualization.
|
|
|
|
<div class="flex gap-4">
|
|
<div style="flex: 1 1 50%; max-width: 50%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_depth_16bit.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Predicted depth (16-bit PNG)
|
|
</figcaption>
|
|
</div>
|
|
<div style="flex: 1 1 50%; max-width: 50%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_depth.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Predicted depth visualization (Spectral)
|
|
</figcaption>
|
|
</div>
|
|
</div>
|
|
|
|
## Surface Normals Estimation
|
|
|
|
Load the `prs-eth/marigold-normals-v1-1` checkpoint into [`MarigoldNormalsPipeline`], put the image through the
|
|
pipeline, and save the predictions:
|
|
|
|
```python
|
|
import diffusers
|
|
import torch
|
|
|
|
pipe = diffusers.MarigoldNormalsPipeline.from_pretrained(
|
|
"prs-eth/marigold-normals-v1-1", variant="fp16", torch_dtype=torch.float16
|
|
).to("cuda")
|
|
|
|
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
|
|
|
|
normals = pipe(image)
|
|
|
|
vis = pipe.image_processor.visualize_normals(normals.prediction)
|
|
vis[0].save("einstein_normals.png")
|
|
```
|
|
|
|
The [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_normals`] maps the three-dimensional
|
|
prediction with pixel values in the range `[-1, 1]` into an RGB image.
|
|
The visualization function supports flipping surface normals axes to make the visualization compatible with other
|
|
choices of the frame of reference.
|
|
Conceptually, each pixel is painted according to the surface normal vector in the frame of reference, where `X` axis
|
|
points right, `Y` axis points up, and `Z` axis points at the viewer.
|
|
Below is the visualized prediction:
|
|
|
|
<div class="flex gap-4" style="justify-content: center; width: 100%;">
|
|
<div style="flex: 1 1 50%; max-width: 50%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_normals.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Predicted surface normals visualization
|
|
</figcaption>
|
|
</div>
|
|
</div>
|
|
|
|
In this example, the nose tip almost certainly has a point on the surface, in which the surface normal vector points
|
|
straight at the viewer, meaning that its coordinates are `[0, 0, 1]`.
|
|
This vector maps to the RGB `[128, 128, 255]`, which corresponds to the violet-blue color.
|
|
Similarly, a surface normal on the cheek in the right part of the image has a large `X` component, which increases the
|
|
red hue.
|
|
Points on the shoulders pointing up with a large `Y` promote green color.
|
|
|
|
## Intrinsic Image Decomposition
|
|
|
|
Marigold provides two models for Intrinsic Image Decomposition (IID): "Appearance" and "Lighting".
|
|
Each model produces Albedo maps, derived from InteriorVerse and Hypersim annotations, respectively.
|
|
|
|
- The "Appearance" model also estimates Material properties: Roughness and Metallicity.
|
|
- The "Lighting" model generates Diffuse Shading and Non-diffuse Residual.
|
|
|
|
Here is the sample code saving predictions made by the "Appearance" model:
|
|
|
|
```python
|
|
import diffusers
|
|
import torch
|
|
|
|
pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained(
|
|
"prs-eth/marigold-iid-appearance-v1-1", variant="fp16", torch_dtype=torch.float16
|
|
).to("cuda")
|
|
|
|
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
|
|
|
|
intrinsics = pipe(image)
|
|
|
|
vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties)
|
|
vis[0]["albedo"].save("einstein_albedo.png")
|
|
vis[0]["roughness"].save("einstein_roughness.png")
|
|
vis[0]["metallicity"].save("einstein_metallicity.png")
|
|
```
|
|
|
|
Another example demonstrating the predictions made by the "Lighting" model:
|
|
|
|
```python
|
|
import diffusers
|
|
import torch
|
|
|
|
pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained(
|
|
"prs-eth/marigold-iid-lighting-v1-1", variant="fp16", torch_dtype=torch.float16
|
|
).to("cuda")
|
|
|
|
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
|
|
|
|
intrinsics = pipe(image)
|
|
|
|
vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties)
|
|
vis[0]["albedo"].save("einstein_albedo.png")
|
|
vis[0]["shading"].save("einstein_shading.png")
|
|
vis[0]["residual"].save("einstein_residual.png")
|
|
```
|
|
|
|
Both models share the same pipeline while supporting different decomposition types.
|
|
The exact decomposition parameterization (e.g., sRGB vs. linear space) is stored in the
|
|
`pipe.target_properties` dictionary, which is passed into the
|
|
[`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_intrinsics`] function.
|
|
|
|
Below are some examples showcasing the predicted decomposition outputs.
|
|
All modalities can be inspected in the
|
|
[Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) Space.
|
|
|
|
<div class="flex gap-4">
|
|
<div style="flex: 1 1 50%; max-width: 50%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/8c7986eaaab5eb9604eb88336311f46a7b0ff5ab/marigold/marigold_einstein_albedo.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Predicted albedo ("Appearance" model)
|
|
</figcaption>
|
|
</div>
|
|
<div style="flex: 1 1 50%; max-width: 50%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/8c7986eaaab5eb9604eb88336311f46a7b0ff5ab/marigold/marigold_einstein_diffuse.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Predicted diffuse shading ("Lighting" model)
|
|
</figcaption>
|
|
</div>
|
|
</div>
|
|
|
|
## Speeding up inference
|
|
|
|
The above quick start snippets are already optimized for quality and speed, loading the checkpoint, utilizing the
|
|
`fp16` variant of weights and computation, and performing the default number (4) of denoising diffusion steps.
|
|
The first step to accelerate inference, at the expense of prediction quality, is to reduce the denoising diffusion
|
|
steps to the minimum:
|
|
|
|
```diff
|
|
import diffusers
|
|
import torch
|
|
|
|
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
|
|
"prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
|
|
).to("cuda")
|
|
|
|
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
|
|
|
|
- depth = pipe(image)
|
|
+ depth = pipe(image, num_inference_steps=1)
|
|
```
|
|
|
|
With this change, the `pipe` call completes in 280ms on RTX 3090 GPU.
|
|
Internally, the input image is first encoded using the Stable Diffusion VAE encoder, followed by a single denoising
|
|
step performed by the U-Net.
|
|
Finally, the prediction latent is decoded with the VAE decoder into pixel space.
|
|
In this setup, two out of three module calls are dedicated to converting between the pixel and latent spaces of the LDM.
|
|
Since Marigold's latent space is compatible with Stable Diffusion 2.0, inference can be accelerated by more than 3x,
|
|
reducing the call time to 85ms on an RTX 3090, by using a [lightweight replacement of the SD VAE](../api/models/autoencoder_tiny).
|
|
Note that using a lightweight VAE may slightly reduce the visual quality of the predictions.
|
|
|
|
```diff
|
|
import diffusers
|
|
import torch
|
|
|
|
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
|
|
"prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
|
|
).to("cuda")
|
|
|
|
+ pipe.vae = diffusers.AutoencoderTiny.from_pretrained(
|
|
+ "madebyollin/taesd", torch_dtype=torch.float16
|
|
+ ).cuda()
|
|
|
|
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
|
|
|
|
depth = pipe(image, num_inference_steps=1)
|
|
```
|
|
|
|
So far, we have optimized the number of diffusion steps and model components. Self-attention operations account for a
|
|
significant portion of computations.
|
|
Speeding them up can be achieved by using a more efficient attention processor:
|
|
|
|
```diff
|
|
import diffusers
|
|
import torch
|
|
+ from diffusers.models.attention_processor import AttnProcessor2_0
|
|
|
|
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
|
|
"prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
|
|
).to("cuda")
|
|
|
|
+ pipe.vae.set_attn_processor(AttnProcessor2_0())
|
|
+ pipe.unet.set_attn_processor(AttnProcessor2_0())
|
|
|
|
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
|
|
|
|
depth = pipe(image, num_inference_steps=1)
|
|
```
|
|
|
|
Finally, as suggested in [Optimizations](../optimization/torch2.0#torch.compile), enabling `torch.compile` can further enhance performance depending on
|
|
the target hardware.
|
|
However, compilation incurs a significant overhead during the first pipeline invocation, making it beneficial only when
|
|
the same pipeline instance is called repeatedly, such as within a loop.
|
|
|
|
```diff
|
|
import diffusers
|
|
import torch
|
|
from diffusers.models.attention_processor import AttnProcessor2_0
|
|
|
|
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
|
|
"prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
|
|
).to("cuda")
|
|
|
|
pipe.vae.set_attn_processor(AttnProcessor2_0())
|
|
pipe.unet.set_attn_processor(AttnProcessor2_0())
|
|
|
|
+ pipe.vae = torch.compile(pipe.vae, mode="reduce-overhead", fullgraph=True)
|
|
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
|
|
|
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
|
|
|
|
depth = pipe(image, num_inference_steps=1)
|
|
```
|
|
|
|
## Maximizing Precision and Ensembling
|
|
|
|
Marigold pipelines have a built-in ensembling mechanism combining multiple predictions from different random latents.
|
|
This is a brute-force way of improving the precision of predictions, capitalizing on the generative nature of diffusion.
|
|
The ensembling path is activated automatically when the `ensemble_size` argument is set greater or equal than `3`.
|
|
When aiming for maximum precision, it makes sense to adjust `num_inference_steps` simultaneously with `ensemble_size`.
|
|
The recommended values vary across checkpoints but primarily depend on the scheduler type.
|
|
The effect of ensembling is particularly well-seen with surface normals:
|
|
|
|
```diff
|
|
import diffusers
|
|
|
|
pipe = diffusers.MarigoldNormalsPipeline.from_pretrained("prs-eth/marigold-normals-v1-1").to("cuda")
|
|
|
|
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
|
|
|
|
- depth = pipe(image)
|
|
+ depth = pipe(image, num_inference_steps=10, ensemble_size=5)
|
|
|
|
vis = pipe.image_processor.visualize_normals(depth.prediction)
|
|
vis[0].save("einstein_normals.png")
|
|
```
|
|
|
|
<div class="flex gap-4">
|
|
<div style="flex: 1 1 50%; max-width: 50%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_normals.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Surface normals, no ensembling
|
|
</figcaption>
|
|
</div>
|
|
<div style="flex: 1 1 50%; max-width: 50%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_normals.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Surface normals, with ensembling
|
|
</figcaption>
|
|
</div>
|
|
</div>
|
|
|
|
As can be seen, all areas with fine-grained structurers, such as hair, got more conservative and on average more
|
|
correct predictions.
|
|
Such a result is more suitable for precision-sensitive downstream tasks, such as 3D reconstruction.
|
|
|
|
## Frame-by-frame Video Processing with Temporal Consistency
|
|
|
|
Due to Marigold's generative nature, each prediction is unique and defined by the random noise sampled for the latent
|
|
initialization.
|
|
This becomes an obvious drawback compared to traditional end-to-end dense regression networks, as exemplified in the
|
|
following videos:
|
|
|
|
<div class="flex gap-4">
|
|
<div style="flex: 1 1 50%; max-width: 50%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama.gif"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">Input video</figcaption>
|
|
</div>
|
|
<div style="flex: 1 1 50%; max-width: 50%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama_depth_independent.gif"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">Marigold Depth applied to input video frames independently</figcaption>
|
|
</div>
|
|
</div>
|
|
|
|
To address this issue, it is possible to pass `latents` argument to the pipelines, which defines the starting point of
|
|
diffusion.
|
|
Empirically, we found that a convex combination of the very same starting point noise latent and the latent
|
|
corresponding to the previous frame prediction give sufficiently smooth results, as implemented in the snippet below:
|
|
|
|
```python
|
|
import imageio
|
|
import diffusers
|
|
import torch
|
|
from diffusers.models.attention_processor import AttnProcessor2_0
|
|
from PIL import Image
|
|
from tqdm import tqdm
|
|
|
|
device = "cuda"
|
|
path_in = "https://huggingface.co/spaces/prs-eth/marigold-lcm/resolve/c7adb5427947d2680944f898cd91d386bf0d4924/files/video/obama.mp4"
|
|
path_out = "obama_depth.gif"
|
|
|
|
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
|
|
"prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
|
|
).to(device)
|
|
pipe.vae = diffusers.AutoencoderTiny.from_pretrained(
|
|
"madebyollin/taesd", torch_dtype=torch.float16
|
|
).to(device)
|
|
pipe.unet.set_attn_processor(AttnProcessor2_0())
|
|
pipe.vae = torch.compile(pipe.vae, mode="reduce-overhead", fullgraph=True)
|
|
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
|
pipe.set_progress_bar_config(disable=True)
|
|
|
|
with imageio.get_reader(path_in) as reader:
|
|
size = reader.get_meta_data()['size']
|
|
last_frame_latent = None
|
|
latent_common = torch.randn(
|
|
(1, 4, 768 * size[1] // (8 * max(size)), 768 * size[0] // (8 * max(size)))
|
|
).to(device=device, dtype=torch.float16)
|
|
|
|
out = []
|
|
for frame_id, frame in tqdm(enumerate(reader), desc="Processing Video"):
|
|
frame = Image.fromarray(frame)
|
|
latents = latent_common
|
|
if last_frame_latent is not None:
|
|
latents = 0.9 * latents + 0.1 * last_frame_latent
|
|
|
|
depth = pipe(
|
|
frame,
|
|
num_inference_steps=1,
|
|
match_input_resolution=False,
|
|
latents=latents,
|
|
output_latent=True,
|
|
)
|
|
last_frame_latent = depth.latent
|
|
out.append(pipe.image_processor.visualize_depth(depth.prediction)[0])
|
|
|
|
diffusers.utils.export_to_gif(out, path_out, fps=reader.get_meta_data()['fps'])
|
|
```
|
|
|
|
Here, the diffusion process starts from the given computed latent.
|
|
The pipeline sets `output_latent=True` to access `out.latent` and computes its contribution to the next frame's latent
|
|
initialization.
|
|
The result is much more stable now:
|
|
|
|
<div class="flex gap-4">
|
|
<div style="flex: 1 1 50%; max-width: 50%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama_depth_independent.gif"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">Marigold Depth applied to input video frames independently</figcaption>
|
|
</div>
|
|
<div style="flex: 1 1 50%; max-width: 50%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_obama_depth_consistent.gif"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">Marigold Depth with forced latents initialization</figcaption>
|
|
</div>
|
|
</div>
|
|
|
|
## Marigold for ControlNet
|
|
|
|
A very common application for depth prediction with diffusion models comes in conjunction with ControlNet.
|
|
Depth crispness plays a crucial role in obtaining high-quality results from ControlNet.
|
|
As seen in comparisons with other methods above, Marigold excels at that task.
|
|
The snippet below demonstrates how to load an image, compute depth, and pass it into ControlNet in a compatible format:
|
|
|
|
```python
|
|
import torch
|
|
import diffusers
|
|
|
|
device = "cuda"
|
|
generator = torch.Generator(device=device).manual_seed(2024)
|
|
image = diffusers.utils.load_image(
|
|
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png"
|
|
)
|
|
|
|
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
|
|
"prs-eth/marigold-depth-v1-1", torch_dtype=torch.float16, variant="fp16"
|
|
).to(device)
|
|
|
|
depth_image = pipe(image, generator=generator).prediction
|
|
depth_image = pipe.image_processor.visualize_depth(depth_image, color_map="binary")
|
|
depth_image[0].save("motorcycle_controlnet_depth.png")
|
|
|
|
controlnet = diffusers.ControlNetModel.from_pretrained(
|
|
"diffusers/controlnet-depth-sdxl-1.0", torch_dtype=torch.float16, variant="fp16"
|
|
).to(device)
|
|
pipe = diffusers.StableDiffusionXLControlNetPipeline.from_pretrained(
|
|
"SG161222/RealVisXL_V4.0", torch_dtype=torch.float16, variant="fp16", controlnet=controlnet
|
|
).to(device)
|
|
pipe.scheduler = diffusers.DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True)
|
|
|
|
controlnet_out = pipe(
|
|
prompt="high quality photo of a sports bike, city",
|
|
negative_prompt="",
|
|
guidance_scale=6.5,
|
|
num_inference_steps=25,
|
|
image=depth_image,
|
|
controlnet_conditioning_scale=0.7,
|
|
control_guidance_end=0.7,
|
|
generator=generator,
|
|
).images
|
|
controlnet_out[0].save("motorcycle_controlnet_out.png")
|
|
```
|
|
|
|
<div class="flex gap-4">
|
|
<div style="flex: 1 1 33%; max-width: 33%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_depth_source.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Input image
|
|
</figcaption>
|
|
</div>
|
|
<div style="flex: 1 1 33%; max-width: 33%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/motorcycle_controlnet_depth.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Depth in the format compatible with ControlNet
|
|
</figcaption>
|
|
</div>
|
|
<div style="flex: 1 1 33%; max-width: 33%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/motorcycle_controlnet_out.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
ControlNet generation, conditioned on depth and prompt: "high quality photo of a sports bike, city"
|
|
</figcaption>
|
|
</div>
|
|
</div>
|
|
|
|
## Quantitative Evaluation
|
|
|
|
To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets),
|
|
follow the evaluation protocol outlined in the paper: load the full precision fp32 model and use appropriate values
|
|
for `num_inference_steps` and `ensemble_size`.
|
|
Optionally seed randomness to ensure reproducibility.
|
|
Maximizing `batch_size` will deliver maximum device utilization.
|
|
|
|
```python
|
|
import diffusers
|
|
import torch
|
|
|
|
device = "cuda"
|
|
seed = 2024
|
|
|
|
generator = torch.Generator(device=device).manual_seed(seed)
|
|
pipe = diffusers.MarigoldDepthPipeline.from_pretrained("prs-eth/marigold-depth-v1-1").to(device)
|
|
|
|
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
|
|
|
|
depth = pipe(
|
|
image,
|
|
num_inference_steps=4, # set according to the evaluation protocol from the paper
|
|
ensemble_size=10, # set according to the evaluation protocol from the paper
|
|
generator=generator,
|
|
)
|
|
|
|
# evaluate metrics
|
|
```
|
|
|
|
## Using Predictive Uncertainty
|
|
|
|
The ensembling mechanism built into Marigold pipelines combines multiple predictions obtained from different random
|
|
latents.
|
|
As a side effect, it can be used to quantify epistemic (model) uncertainty; simply specify `ensemble_size` greater
|
|
or equal than 3 and set `output_uncertainty=True`.
|
|
The resulting uncertainty will be available in the `uncertainty` field of the output.
|
|
It can be visualized as follows:
|
|
|
|
```python
|
|
import diffusers
|
|
import torch
|
|
|
|
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
|
|
"prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
|
|
).to("cuda")
|
|
|
|
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
|
|
|
|
depth = pipe(
|
|
image,
|
|
ensemble_size=10, # any number >= 3
|
|
output_uncertainty=True,
|
|
)
|
|
|
|
uncertainty = pipe.image_processor.visualize_uncertainty(depth.uncertainty)
|
|
uncertainty[0].save("einstein_depth_uncertainty.png")
|
|
```
|
|
|
|
<div class="flex gap-4">
|
|
<div style="flex: 1 1 33%; max-width: 33%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_depth_uncertainty.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Depth uncertainty
|
|
</figcaption>
|
|
</div>
|
|
<div style="flex: 1 1 33%; max-width: 33%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_normals_uncertainty.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Surface normals uncertainty
|
|
</figcaption>
|
|
</div>
|
|
<div style="flex: 1 1 33%; max-width: 33%;">
|
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/4f83035d84a24e5ec44fdda129b1d51eba12ce04/marigold/marigold_einstein_albedo_uncertainty.png"/>
|
|
<figcaption class="mt-1 text-center text-sm text-gray-500">
|
|
Albedo uncertainty
|
|
</figcaption>
|
|
</div>
|
|
</div>
|
|
|
|
The interpretation of uncertainty is easy: higher values (white) correspond to pixels, where the model struggles to
|
|
make consistent predictions.
|
|
- The depth model exhibits the most uncertainty around discontinuities, where object depth changes abruptly.
|
|
- The surface normals model is least confident in fine-grained structures like hair and in dark regions such as the
|
|
collar area.
|
|
- Albedo uncertainty is represented as an RGB image, as it captures uncertainty independently for each color channel,
|
|
unlike depth and surface normals. It is also higher in shaded regions and at discontinuities.
|
|
|
|
## Conclusion
|
|
|
|
We hope Marigold proves valuable for your downstream tasks, whether as part of a broader generative workflow or for
|
|
perception-based applications like 3D reconstruction. |