We use a pretrained diffusion model to attempt to denoise our input noised image in one step, with the prompt "a high quality photo". That is, we try to remove the noise all at once. We use the given unet to predict the noise in the image, and subtract the noise from our current noised image, in order to get a denoised one. Here are the results for denoising an image at varying noise levels.


Denoised t=250	Denoised t=500	Denoised t=750

Notice how increased levels in noise make it harder to predict the original image passed in. This makes sense because as an image gets noisier, it is less clear what details are in the original image, so we pull some assumptions about what details might be included in the original image.

4. Iterative Denoising

We can improve our results above by denoising our noised image step by step, removing a little bit of noise each time in the direction of our prompt. We create a list of timesteps from t = 0 to t = 1000. However, having to denoise an image one step at a time, totaling up to 1000 steps, can be computationally expensive. Instead, we can skip a few timesteps in each iteration. We keep track of these skipped timesteps through an array we define as strided_timesteps, consisting of only every 30th timestep.

The equation to estimate our less noisy image is

\[ x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma\tag{A.3} \]

where t' is the next strided timestep leading to a less noisy image, t is our current timestep, x_0 is the current estimate of the original image (predicted in 3), and the alphas are noise coefficients at their respective timesteps. The v_sigma is also some predicted noise.

We show the process of a noised image being iteratively denoised, resulting in a final, less noised, predicted result:


Noisy Campanile at t=690	Noisy Campanile at t=540	Noisy Campanile at t=390	Noisy Campanile at t=240	Noisy Campanile at t=90

5. Diffusion Model Sampling

6. Classifier-Free Guidance (CFG)

You may notice that a couple of the images, or parts of the images above seem unclear in what they are depicting. We aim to make the quality of these images better, and have more comprehensible results, by employing a technique called "classifier-free guidance."

We estimate both a conditional noise estimate $\epsilon_c $ and unconditional noise estimate $\epsilon_u $, such that our final output noise is

\[ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \tag{A.4} \]

where $\gamma$ is the strength of the classifier free guidance. We end up having higher quality images when $\gamma > 1$. We implement CFG into our iterative denoiser, with $\gamma = 7$

Here are some results:


Sample 1	Sample 2	Sample 3	Sample 4	Sample 5

These images look way more clear and comprehensible than the ones generated without CFG!

7. Image-to-image Translation

With CFG, we noise a test image a little, then force it back on the natural image manifold in a certain amount of timesteps. We call this algorithm SDEdit. Here are some results noising our image at various noise levels defined at different starting indices for strided_timesteps:

Campanile
Ghiradelli Square
Water tower
	Original Image	SDEdit with `i_start=1`	SDEdit with `i_start=3`	SDEdit with `i_start=5`	SDEdit with `i_start=7`	SDEdit with `i_start=10`	SDEdit with `i_start=20`

We can also play around and draw images on our own, or find images on the internet, and pass it through the model to see what the computer predicts the image will end up being:

Tomato carrot
Mysterious guy
Galaxy
	Original Image	SDEdit with `i_start=1`	SDEdit with `i_start=3`	SDEdit with `i_start=5`	SDEdit with `i_start=7`	SDEdit with `i_start=10`	SDEdit with `i_start=20`

We can also perform this algorithm on a portion of our image instead of the entire image. This is called inpainting. We mask out the part of the image that we want to noise and denoise by setting the mask's value to 0 if we want the same content, and 1 if we want to generate something new in the denoising loop.

Here are some inpainting results:




Original image	Mask	Portion to replace	Inpainted image

Instead of using our default prompt, a high quality photo, we can change the prompt and generate different looking results combined with inpainting. We show the result of CFG at different noise levels determined by i_start and a few chosen prompts:

A rocket ship
A photo of a hipster barista
A pencil
	Original image	`i_start=20`	`i_start=10`	`i_start=7`	`i_start=5`	`i_start=3`	`i_start=1`

8. Visual Anagrams

We can also generate optical illusions with our diffusion model. We will generate an image that looks like one prompt when right-side up, but when flipped upside down, it'll look like something else! In order to do so, we generate the predicted noise for the right-side up prompt and the upside-down prompt, average the two, and iteratively denoise our image. Our noise will be calculated as such:

\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \] \[ \epsilon = (\epsilon_1 + \epsilon_2) / 2 \]

Here are some visual results:

Right side up
Upside down
Prompts	a photo of a dog, a man wearing a hat	an oil painting of people around a campfire, an oil painting of an old man	an oil painting of a snowy mountain village a lithograph of waterfalls

9. Hybrid Images

We can also create hybrid images, such that it looks like one prompt looking close up and another prompt looking far away. This is a concept we have explored extensively in earlier projects. High frequencies tend to dominate our interpretation of an image when we look at it up closer, and low frequencies dominate when we look at it far away.

This is how we will calculate the predicted noise at each timestep:

\[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{UNet}(x_t, t, p_2) \] \[ \epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2) \]

The above calculations for noise are described as follows: utilizing this idea, we predict the noise of the image for prompt 1 at low frequencies-in other words, we take a Gaussian blur of the directly predicted noise from prompt 1 to filter out high frequencies within that noise. For prompt 2, we predict the noise of that image at high frequencies. In other words, we subtract the predicted noise for prompt 1 from the blurred predicted noise for prompt 1 to get only the high frequencies.

We add the resulting noises together to get the predicted noise that we will use for iterative denoising.


Low freq High freq	a photo of the amalfi cost a photo of a hipster barista	a lithograph of skulls a lithograph of waterfalls	a lithograph of waterfalls a photo of a dog

Part B: Diffusion Models from Scratch!

Now, we implement three different UNets from scratch--unconditioned, time-conditioned, and class-conditioned. The class-conditioned Unet is also time-conditioned. In the process, we are also implementing DDPM as a wrapper for our UNet based upon this paper, and are training solely on the MNIST dataset.

Unconditional UNet

The implemented UNet has the following structure:

where

We used this unconditioned UNet as a one-step denoiser, optimized on an L2 loss

\[ L = \mathbb{E}_{z,x} \|D_{\theta}(z) - x\|^2 \tag{B.1} \]

where $z$ is our noisy image, and $x$ is the clean image. We can generate such an image $z$ in the forward pass by adding some noise $\epsilon$ with percentage amount $\sigma$ such that

\[ z = x + \sigma \epsilon,\quad \text{where }\epsilon \sim N(0, I). \tag{B.2} \]

Here are the results of applying noise onto a clean image with varying amounts of $\sigma$

We trained our UNet on noisy images from the MNIST dataset with $\sigma = 0.5$, with batch size of 256 over 5 epochs. We set our number of hidden dimensions to D = 128, and use the Adam optimizer with a learning rate of 1e-4. Here is our training loss over 5 epochs:

Here are our results after training:

Denoised images after 1 epoch

Denoised images after 5 epochs

After 5 epochs, the output of our image becomes much clearer!

Let's test how well our unconditioned UNet works on images with varying noise levels, although it is trained for noise level $\sigma=0.5$:

The output isn't looking that great for noise levels less than 0.5.

Conditional UNet

Let's implement iterative denoising so we can get a much better output. The loss function is now modified to

\[ L = \mathbb{E}_{\epsilon,z} \|\epsilon_{\theta}(z) - \epsilon\|^2 \tag{B.3} \]

We predefine a list of parameters helping us to calculate the noise at each timestep:

$\beta$ with length T, where $\beta_0 = 0.0001$ and $\beta_T = 0.02$. In between, we have equally spaced values
$\alpha_t = 1 - \beta_t$
$\bar\alpha_t = \prod_{s=1}^t \alpha_s$ where $s \in \{1, \cdots, t\}$

We let $T=300$. The architecture for our time-conditioned UNet is:

We add a new operator FCBlock which works as follows;

In order to train our UNet, we take a batch of images and train them at randomly sampled timesteps $t$, then optimize over the loss given above to predict the noise and the subsequent denoised image and timestep $t-1$ at any given $t$. We take a batch size of 128 images at a time, train over 20 epochs, with hidden dimension D = 64, and use the Adam optimizer over an initial learning rate of 1e-3. We also use an exponential learning rate decay scheduler with a gamma of $0.1^{(1.0 / \text{num_epochs})}$. Here is our training loss, and resulting images after training our model 5 epochs in, and 20 epochs in:

Training loss of the time conditioned UNet over 20 epochs

Generated images after training on 5 epochs

Generated images after training on 20 epochs

These look great, but some of the generated numbers look pretty incomprehensible. Similar to part A, we can apply class conditioning and train our results on a set of labels in order to improve our output.

Class-conditioned UNet

Refer to Part A's description on classifier-free guidance. We use a gamma value of 5 in our implementation of class-conditioning with the same parameters as the time-conditioned UNet. We also insert two more FCBlocks for our class-conditioning one-hot vector c in our architecture so that for FCBlock-applied c1, c2, t1, and t2,


                unflatten = c1 * unflatten + t1

and


                up1 = c2 * up1 + t1

Here are the results:

Training loss over 20 epochs of class-conditioning

Generated numbers over 5 epochs of class-conditioning

Generated numbers over 20 epochs of class-conditioning

Our numbers look more clear and comprehensible!


Noised t=250	Noised t=500	Noised t=750

Blurred t=250	Blurred t=500	Blurred t=750

Project 5: Fun with Diffusion Models

CS 180: Computer Vision and Computational Photography, Fall 2024

Rebecca Feng

Part A: The Power of Diffusion Models!

0. The DeepFloyd Diffusion Model

1. The Forward Process

2. Classical Denoising

3. One-Step Denoising