Project 5: Fun with Diffusion Models

CS 180: Computer Vision and Computational Photography, Fall 2024

Rebecca Feng

Part A: The Power of Diffusion Models!

0. The DeepFloyd Diffusion Model

Let's test out the diffusion model by inserting some prompts! We notice that the higher step count we have, the higher the quality of our images, and the more accurate the image is to the prompt. We are using a random seed of 180.

Step size A man wearing a hat An oil painting of a
snowy mountain village
A rocket ship

1. The Forward Process

The forward process in diffusion models consists of taking an image, and adding noise to it. The resulting noised image at timestep t, denoted as x_t, is given by the equation

\[ x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1) \tag{A.2}\]

t has values ranging from 0 and T. When t=0, we should yield a clean image. When t=T, the image should be maximally noised. The noise in the image is sampled in a normal distribution. Here are the results of the noised test images at timesteps t = 250, 500, and 750:

t=0 (Original image)

2. Classical Denoising

Now let's try denoising these images. We can attempt to do so by simply doing a Gaussian blur, with a kernel size of (7,7). However, we find that the noisier an image is, the less quality the result:

Noised t=250
Noised t=500
Noised t=750
Blurred t=250
Blurred t=500
Blurred t=750

3. One-Step Denoising

We use a pretrained diffusion model to attempt to denoise our input noised image in one step, with the prompt "a high quality photo". That is, we try to remove the noise all at once. We use the given unet to predict the noise in the image, and subtract the noise from our current noised image, in order to get a denoised one. Here are the results for denoising an image at varying noise levels.

Denoised t=250
Denoised t=500
Denoised t=750

Notice how increased levels in noise make it harder to predict the original image passed in. This makes sense because as an image gets noisier, it is less clear what details are in the original image, so we pull some assumptions about what details might be included in the original image.

4. Iterative Denoising

We can improve our results above by denoising our noised image step by step, removing a little bit of noise each time in the direction of our prompt. We create a list of timesteps from t = 0 to t = 1000. However, having to denoise an image one step at a time, totaling up to 1000 steps, can be computationally expensive. Instead, we can skip a few timesteps in each iteration. We keep track of these skipped timesteps through an array we define as strided_timesteps, consisting of only every 30th timestep.

The equation to estimate our less noisy image is

\[ x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma\tag{A.3} \]

where t' is the next strided timestep leading to a less noisy image, t is our current timestep, x_0 is the current estimate of the original image (predicted in 3), and the alphas are noise coefficients at their respective timesteps. The v_sigma is also some predicted noise.

We show the process of a noised image being iteratively denoised, resulting in a final, less noised, predicted result:

Noisy Campanile at t=690
Noisy Campanile at t=540
Noisy Campanile at t=390
Noisy Campanile at t=240
Noisy Campanile at t=90

5. Diffusion Model Sampling

With our trained model, we attempt to generate random images of our own now by passing in random noise, and seeing what the model outputs based upon our prompt, a high quality photo. Here are some results:

Sample 1
Sample 2
Sample 3
Sample 4
Sample 5

6. Classifier-Free Guidance (CFG)

You may notice that a couple of the images, or parts of the images above seem unclear in what they are depicting. We aim to make the quality of these images better, and have more comprehensible results, by employing a technique called "classifier-free guidance."

We estimate both a conditional noise estimate $\epsilon_c $ and unconditional noise estimate $\epsilon_u $, such that our final output noise is

\[ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \tag{A.4} \]

where $\gamma$ is the strength of the classifier free guidance. We end up having higher quality images when $\gamma > 1$. We implement CFG into our iterative denoiser, with $\gamma = 7$

Here are some results:

Sample 1
Sample 2
Sample 3
Sample 4
Sample 5

These images look way more clear and comprehensible than the ones generated without CFG!

7. Image-to-image Translation

With CFG, we noise a test image a little, then force it back on the natural image manifold in a certain amount of timesteps. We call this algorithm SDEdit. Here are some results noising our image at various noise levels defined at different starting indices for strided_timesteps:

Ghiradelli Square
Water tower
Original Image
SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20

We can also play around and draw images on our own, or find images on the internet, and pass it through the model to see what the computer predicts the image will end up being:

Tomato carrot
Mysterious guy
Original Image
SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20

We can also perform this algorithm on a portion of our image instead of the entire image. This is called inpainting. We mask out the part of the image that we want to noise and denoise by setting the mask's value to 0 if we want the same content, and 1 if we want to generate something new in the denoising loop.

Here are some inpainting results:

Original image
Portion to replace
Inpainted image

Instead of using our default prompt, a high quality photo, we can change the prompt and generate different looking results combined with inpainting. We show the result of CFG at different noise levels determined by i_start and a few chosen prompts:

A rocket ship
A photo of a hipster barista
A pencil
Original image

8. Visual Anagrams

We can also generate optical illusions with our diffusion model. We will generate an image that looks like one prompt when right-side up, but when flipped upside down, it'll look like something else! In order to do so, we generate the predicted noise for the right-side up prompt and the upside-down prompt, average the two, and iteratively denoise our image. Our noise will be calculated as such:

\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \] \[ \epsilon = (\epsilon_1 + \epsilon_2) / 2 \]

Here are some visual results:

Right side up
Upside down
a photo of a dog,

a man wearing a hat
an oil painting of people
around a campfire,

an oil painting of an old man
an oil painting of a
snowy mountain village

a lithograph of waterfalls

9. Hybrid Images

We can also create hybrid images, such that it looks like one prompt looking close up and another prompt looking far away. This is a concept we have explored extensively in earlier projects. High frequencies tend to dominate our interpretation of an image when we look at it up closer, and low frequencies dominate when we look at it far away.

This is how we will calculate the predicted noise at each timestep:

\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{UNet}(x_t, t, p_2) \] \[ \epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2) \]

The above calculations for noise are described as follows: utilizing this idea, we predict the noise of the image for prompt 1 at low frequencies-in other words, we take a Gaussian blur of the directly predicted noise from prompt 1 to filter out high frequencies within that noise. For prompt 2, we predict the noise of that image at high frequencies. In other words, we subtract the predicted noise for prompt 1 from the blurred predicted noise for prompt 1 to get only the high frequencies.

We add the resulting noises together to get the predicted noise that we will use for iterative denoising.

Low freq
High freq
a photo of the amalfi cost
a photo of a hipster barista
a lithograph of skulls
a lithograph of waterfalls
a lithograph of waterfalls
a photo of a dog

Part B: Diffusion Models from Scratch!

Now, we implement three different UNets from scratch--unconditioned, time-conditioned, and class-conditioned. The class-conditioned Unet is also time-conditioned. In the process, we are also implementing DDPM as a wrapper for our UNet based upon this paper, and are training solely on the MNIST dataset.

Unconditional UNet

The implemented UNet has the following structure:


We used this unconditioned UNet as a one-step denoiser, optimized on an L2 loss

\[ L = \mathbb{E}_{z,x} \|D_{\theta}(z) - x\|^2 \tag{B.1} \]

where $z$ is our noisy image, and $x$ is the clean image. We can generate such an image $z$ in the forward pass by adding some noise $\epsilon$ with percentage amount $\sigma$ such that

\[ z = x + \sigma \epsilon,\quad \text{where }\epsilon \sim N(0, I). \tag{B.2} \]

Here are the results of applying noise onto a clean image with varying amounts of $\sigma$

We trained our UNet on noisy images from the MNIST dataset with $\sigma = 0.5$, with batch size of 256 over 5 epochs. We set our number of hidden dimensions to D = 128, and use the Adam optimizer with a learning rate of 1e-4. Here is our training loss over 5 epochs:

Here are our results after training:

Denoised images after 1 epoch
Denoised images after 5 epochs

After 5 epochs, the output of our image becomes much clearer!

Let's test how well our unconditioned UNet works on images with varying noise levels, although it is trained for noise level $\sigma=0.5$:

The output isn't looking that great for noise levels less than 0.5.

Conditional UNet

Let's implement iterative denoising so we can get a much better output. The loss function is now modified to

\[ L = \mathbb{E}_{\epsilon,z} \|\epsilon_{\theta}(z) - \epsilon\|^2 \tag{B.3} \]

We predefine a list of parameters helping us to calculate the noise at each timestep:

  1. $\beta$ with length T, where $\beta_0 = 0.0001$ and $\beta_T = 0.02$. In between, we have equally spaced values
  2. $\alpha_t = 1 - \beta_t$
  3. $\bar\alpha_t = \prod_{s=1}^t \alpha_s$ where $s \in \{1, \cdots, t\}$
We let $T=300$. The architecture for our time-conditioned UNet is:

We add a new operator FCBlock which works as follows;

In order to train our UNet, we take a batch of images and train them at randomly sampled timesteps $t$, then optimize over the loss given above to predict the noise and the subsequent denoised image and timestep $t-1$ at any given $t$. We take a batch size of 128 images at a time, train over 20 epochs, with hidden dimension D = 64, and use the Adam optimizer over an initial learning rate of 1e-3. We also use an exponential learning rate decay scheduler with a gamma of $0.1^{(1.0 / \text{num_epochs})}$. Here is our training loss, and resulting images after training our model 5 epochs in, and 20 epochs in:

Training loss of the time conditioned UNet over 20 epochs
Generated images after training on 5 epochs
Generated images after training on 20 epochs

These look great, but some of the generated numbers look pretty incomprehensible. Similar to part A, we can apply class conditioning and train our results on a set of labels in order to improve our output.

Class-conditioned UNet

Refer to Part A's description on classifier-free guidance. We use a gamma value of 5 in our implementation of class-conditioning with the same parameters as the time-conditioned UNet. We also insert two more FCBlocks for our class-conditioning one-hot vector c in our architecture so that for FCBlock-applied c1, c2, t1, and t2,

unflatten = c1 * unflatten + t1 and up1 = c2 * up1 + t1

Here are the results:

Training loss over 20 epochs of class-conditioning
Generated numbers over 5 epochs of class-conditioning
Generated numbers over 20 epochs of class-conditioning

Our numbers look more clear and comprehensible!

Bells and Whistles

Course logo

Here are two variations on a course logo by feeding in a drawn version of the iconic "orapple" into the image-to-image translation, then using the stage 2 unet to upsample the image. I also made references to our facial triangulation project as well!

Artistic results

I took one of my drawings and fed it into the image-to-image translation as well to see what it outputs with i_start=20. Didn't get any dogs back though.

The original image
Image to image translation with i_start=20

Gif of denoising

A gif of denoising a class-conditioned sample every 30 timesteps. Hover over:

Hover to play
Over 5 epochs
Hover to play
Over 20 epochs