CS180: Project 5A

by Adriel Vijuan

Exploring Diffusion Models through Sampling, Denoising, and Translation

Part 0. Setup

I began the project by setting up the development environment, installing necessary libraries like PyTorch, configuring the UNet architecture, and preparing the training dataset. I established key diffusion model parameters, including the number of inference steps and noise scales, and set a fixed seed to ensure reproducibility. Below are three examples of test-generated images based on specific prompts:

Project Setup

an oil painting of a snowy mountain village

Project Setup

a man wearing a hat

Project Setup

a rocket ship

1.1 Forward Process

I implemented the forward process to add Gaussian noise to an image based on the cumulative product of noise scales. The function apply_noise computes the noisy image for various timesteps, showing how noise progressively degrades the original image.

Forward Process - Timestep 100

Timestep 100 - Initial noise

Forward Process - Timestep 500

Timestep 500 - Moderate noise

Forward Process - Timestep 750

Timestep 750 - Heavy noise

1.2 Classical Denoising

Gaussian blur was used as a baseline to compare the effectiveness of classical denoising with model-based approaches. The kernel size and sigma were carefully tuned to remove high-frequency noise while retaining image details.

Classical Denoising - Timestep 250

Gaussian Blur - Timestep 250

Classical Denoising - Timestep 500

Gaussian Blur - Timestep 500

Classical Denoising - Timestep 750

Gaussian Blur - Timestep 750

1.3 One-Step Denoising

Using a trained UNet, I implemented one-step denoising by estimating and removing the noise from a noisy image. The function predict_noise accurately predicts the added noise at a given timestep.

One-Step Denoising Result - Step 250

Result of one-step denoising - step 250

One-Step Denoising Result - Step 500

step 500

One-Step Denoising Result - Step 750

step 750

1.4 Iterative Denoising

This step refines the process by repeatedly estimating and removing noise over multiple timesteps. Each iteration uses the UNet's predictions to guide the denoising process.

Timestep 690 - Initial

Initial noisy image (timestep 690)

Timestep 690 - Step 10

Intermediate step 10

Timestep 690 - Step 30

Intermediate step 30

Timestep 690 - Clean

Fully denoised image at timestep 690

1.5 Diffusion Model Sampling

I initialized random noise and iteratively denoised it using the trained UNet to generate synthetic images from scratch. This method demonstrates the ability to model realistic images purely from noise.

Sampling Step 0

Sampling Step 0 - Random noise

Sampling Step 10

Sampling Step 10 - Intermediate result

Sampling Step 20

Sampling Step 20 - Improved details

Sampling Step 30

Sampling Step 30 - Final output

1.6 Classifier-Free Guidance (CFG)

CFG modifies the sampling process by incorporating conditional and unconditional noise estimates, blending them to achieve guided image generation. Adjusting the guidance scale impacts the balance between realism and adherence to prompts.

CFG Sample 1

Guided generation with CFG - Sample 1

CFG Sample 2

Guided generation with CFG - Sample 2

CFG Sample 3

Guided generation with CFG - Sample 3

1.7 Image-to-Image Translation

1.7.1 Editing Hand-Drawn and Web Images

I tried doing image-to-image translation by introducing varying levels of Gaussian noise to input images. By adding different amounts of noise, I aimed to transform both hand-drawn sketches and web-sourced images into enhanced, higher-quality outputs. Utilizing iterative denoising techniques guided by the diffusion model, I was able to progressively refine these images, resulting in more polished and detailed results.

Hand-Drawn Image

Hand-Drawn - Noise Level 3

Noise level 3

Hand-Drawn - Noise Level 5

Noise level 5

Hand-Drawn - Noise Level 7

Noise level 7

Hand-Drawn - Noise Level 10

Noise level 10

Hand-Drawn - Noise Level 20

Noise level 20

Web Image

Web Image - Noise Level 1

Noise level 1

Web Image - Noise Level 5

Noise level 5

Web Image - Noise Level 10

Noise level 10

Web Image - Noise Level 20

Noise level 20

1.7.2 Inpainting

I explored the inpainting technique, which involves reconstructing specific parts of an image while keeping the rest unchanged. By applying masks to designate "unknown" regions, I iteratively used the reverse diffusion process to accurately fill in the masked areas. This approach ensured that the unmasked parts of the image remained intact and consistent throughout the inpainting process.

Inpainting Original Image

Original Image

Inpainting Mask

Applied Mask

Inpainting Result (Masked)

Intermediate Result

Final Inpainting Result

Final Output

1.7.3 Text-Conditional Image Translation

I implemented text-conditional image translation by utilizing text prompts to guide the diffusion model in transforming noisy images into outputs that align with the desired descriptions. By conditioning the model on specific textual inputs, I was able to direct the denoising process to produce images that not only reduced noise but also matched the semantic content specified by the prompts.

Example 1

Text Conditional - Noise Level 1

Noise level 1

Text Conditional - Noise Level 5

Noise level 5

Text Conditional - Noise Level 10

Noise level 10

Text Conditional - Noise Level 20

Noise level 20

1.8 Visual Anagrams

I created visual anagrams by flipping images and averaging the noise estimates for both orientations. This innovative technique resulted in images that could be interpreted in two different ways depending on the viewing angle. By combining the noise patterns from the original and flipped versions, I was able to generate visually ambiguous images that offer dual interpretations, showcasing the creative potential of diffusion models.

Visual Anagram - Original Orientation

Old Man

Visual Anagram - Flipped Orientation

Campfire

1.10 Hybrid Images

I experimented with creating hybrid images by blending high-frequency and low-frequency information from different sources. This approach resulted in visuals that appear ambiguous, changing their perception based on the viewer's distance. By applying lowpass filters to one image and highpass filters to another, I merged distinct visual elements into a single, cohesive image. This technique highlights the diffusion model's ability to manipulate and combine different frequency components to produce complex, detailed visuals.

Example 1

Hybrid Image - Skull x Waterfall

Skull x Waterfall

Example 2

Hybrid Image - Dog x Old Man

Dog x Old Man

Example 3

Hybrid Image - Man Wearing Hat x Pencil

Man Wearing Hat x Pencil (Failure)