A Gentle Introduction to Stable Diffusion

Chuck Chen

January 17, 2024

If you love online photo editing and enhancing tools, you might have heard of Stable Diffusion models. They are a powerful technique that can create stunning images from simple text prompts or low-quality images. But what are Stable Diffusion models and how do they work?

Stable Diffusion models are a type of generative models, which are models that can create new data based on existing data. For example, a generative model can take a text prompt such as "a sunset over the ocean" and produce a beautiful image that matches the description.

Diffusion models work by adding noise to an image and then reversing the process. Imagine you have a clear photo of a cat. You can add some noise to it, making it blurry and distorted. You can repeat this process, adding more and more noise, until the image becomes completely unrecognizable. This is called the forward process.

Now, what if you could do the opposite? What if you could start from a noisy image and remove the noise gradually, until you recover the original image? This is called the reverse process. Diffusion models use a deep learning model to learn how to do this reverse process. They can also use the same model to generate new images from scratch, by starting from random noise and removing it until a realistic image emerges.

Diffusion models have many advantages over other generative models, such as:

  • They can produce diverse, stable, and high-quality images, without any artifacts
  • They can handle complex and challenging tasks, such as super-resolution, inpainting, and style transfer
  • They do not suffer from problems such as mode collapse, instability, or distortions that affect other generative models, such as Generative Adversarial Networks (GANs)

In this blog post, we will introduce you to Stable Diffusion, a state-of-the-art diffusion model for photo editing and enhancing. Stable Diffusion is a breakthrough technique that improves the speed, stability, and quality of diffusion models. We will explain how Stable Diffusion works, how it differs from other diffusion models, and how it can help you with various photo editing and enhancing tasks. We will also show you how capable Stable Diffusion models are in different tasks.

By the end of this blog post, you will have a better understanding of Stable Diffusion models. You will also see in your own eyes how Stable Diffusion is used in various ocassions. So, let's get started!

What is Stable Diffusion?

Why is Stable Diffusion Different?

Stable Diffusion models make use of technique that is commonly known as semi-supervised learning or self-supervised learning.

To understand what Stable Diffusion models are and how they work, let's first take a look at some of the other generative models that are commonly used for image generation. Generative models are models that can create new data that resembles the data they are trained on. For example, a generative model for image generation can create new images that look like real photos, paintings, cartoons, etc.

There are different types of generative models, such as:

  • Generative adversarial networks (GANs): GANs are composed of two models: a generator and a discriminator. The generator tries to create fake images that look real, while the discriminator tries to tell apart real images from fake ones. The generator and the discriminator compete with each other, and in the process, they both improve their skills. GANs can produce realistic and sharp images, but they are also prone to instability and mode collapse, which means they can fail to generate diverse images or even produce nonsense images.
  • Variational autoencoders (VAEs): VAEs are composed of two models: an encoder and a decoder. The encoder takes an image and compresses it into a low-dimensional vector, called a latent code. The decoder takes a latent code and reconstructs the original image from it. VAEs can learn to represent the essential features of the images in the latent space, and they can generate diverse images by sampling different latent codes. However, VAEs tend to produce blurry and low-quality images, because they use a simple distribution for the latent space, such as a Gaussian distribution.
  • Autoregressive models: Autoregressive models are models that generate images pixel by pixel, following a certain order. For example, a model can generate an image from left to right, top to bottom, or in a spiral pattern. Autoregressive models can capture the dependencies and correlations between the pixels, and they can produce high-quality and diverse images. However, autoregressive models are very slow and inefficient, because they have to generate each pixel one by one, and they have to use a large number of parameters to model the complex distribution of the pixels.

Diffusion models are a new type of generative models that are based on a different idea: instead of creating images from scratch, they start with an existing image and modify it gradually until it becomes a new image. Diffusion models are inspired by a natural phenomenon called diffusion, which is the process of spreading something more widely. For example, when you put a drop of ink in a glass of water, the ink will diffuse and spread throughout the water.

Diffusion models use diffusion to generate images in two steps: a forward step and a reverse step. In the forward step, the model takes a real image and adds some noise to it, making it more blurry and distorted. The model repeats this process several times, following a predefined schedule of noise levels, until the image becomes completely random. In the reverse step, the model takes a random image and removes the noise from it, making it more clear and realistic. The model repeats this process in the opposite order, following the same schedule of noise levels, until the image becomes a new image that resembles the original image.

The main components of a diffusion model are:

  • A denoising function: This is a neural network that learns how to remove the noise from the images. The denoising function takes a noisy image and a noise level as inputs, and outputs a less noisy image. The denoising function is trained by comparing the outputs with the original images, and minimizing the difference between them.
  • A noise schedule: This is a predefined sequence of noise levels that determines how much noise to add or remove at each step of the diffusion process. The noise schedule can be linear, exponential, or any other function that starts from a low value and ends at a high value. The noise schedule affects the speed and quality of the sampling process, and it can be optimized for different tasks and domains.
  • A sampling procedure: This is a method that generates new images by using the denoising function and the noise schedule. The sampling procedure can be either ancestral or Markov chain Monte Carlo (MCMC). Ancestral sampling is a simple and fast method that generates each pixel independently, using the output of the denoising function as the mean and the noise level as the variance. MCMC sampling is a more complex and slow method that generates each pixel conditionally, using the output of the denoising function as the proposal distribution and the noise level as the acceptance probability.

The following diagram illustrates the steps of a diffusion model:

diffusion model diagram
Sampling steps of Stable Diffusion

As you can see, diffusion models are quite different from other generative models, and they have some unique advantages and challenges. In the next section, we will discuss what are the benefits and challenges of diffusion models for image generation.

What can Stable Diffusion do?

Diffusion models are a new and exciting type of generative models that have some remarkable advantages over other generative models, such as GANs, VAEs, and autoregressive models. However, they also have some limitations and drawbacks that need to be addressed. In this section, we will explore some of the benefits and challenges of diffusion models for image generation.

Stable Diffusion models are a family of models following the same approach but in different architectures.

ModelDescriptionFeatures
Stable Diffusion

A diffusion model that can generate images from text, images, or masks, using a latent diffusion technique

  • High-quality and diverse images
  • Text-to-image generation
  • Image-to-image modification
  • Mask-to-image editing
  • Outpainting
Stable Diffusion XL

A large-scale diffusion model that can generate images from text, with high resolution and quality

  • Higher resolution and quality than Stable Diffusion
  • Text-to-image generation
Stable Diffusion XL Turbo

A fast diffusion model that can generate images from text in a single step and in real time, using a novel distillation technique called Adversarial Diffusion Distillation

  • Faster and more stable than Stable Diffusion XL
  • Text-to-image generation
Stable Video Diffusion

A diffusion model that can generate short videos from images, with high resolution and quality

  • High-quality and diverse videos
  • Image-to-video generation
Stable Diffusion Cascade

A diffusion model that can generate images from various inputs, using a three-stage approach that achieves high compression and efficiency

  • Higher compression and efficiency than Stable Diffusion
  • Text-to-image generation
  • Image-to-image modification
  • Mask-to-image editing
  • Outpainting

Stable Diffusion capabilities

Some of the benefits of diffusion models are:

  • High-quality and diverse samples: Diffusion models can produce realistic and sharp images that capture the details and textures of the original images. They can also generate diverse images that cover the whole range of the data distribution, without repeating or missing any modes. This is because diffusion models use a simple and powerful denoising function that can learn to preserve the essential features of the images, and a flexible and scalable noise schedule that can adjust the difficulty and diversity of the sampling process.
  • Stability and robustness to mode collapse: Diffusion models are more stable and robust than other generative models, especially GANs, which are notorious for suffering from instability and mode collapse. Mode collapse is a phenomenon where the generator produces only a few types of images, ignoring the rest of the data distribution. This happens when the generator and the discriminator reach an equilibrium where the generator can fool the discriminator with a limited set of images, and the discriminator cannot provide useful feedback to the generator. Diffusion models avoid mode collapse by using a single model that does not compete with another model, and by using a noise schedule that ensures that the model sees the whole data distribution at different noise levels.
  • Flexibility and scalability to different domains and tasks: Diffusion models are very flexible and scalable, and they can be applied to different domains and tasks, such as natural images, medical images, paintings, cartoons, faces, text, speech, music, etc. They can also handle different types of conditioning information, such as text, labels, attributes, sketches, etc. This is because diffusion models use a generic and modular framework that can be easily adapted and extended to different scenarios, and because diffusion models can leverage the existing advances and techniques from other generative models, such as transformers, attention, self-attention, etc.

Stable Diffusion limitations

Some of the challenges of diffusion models are:

  • High computational cost and memory usage: Diffusion models are very computationally expensive and memory intensive, and they require a lot of time and resources to train and sample. This is because diffusion models use a large number of steps to generate images, and each step requires a forward and a backward pass of the denoising function, which can be a complex and deep neural network. Moreover, diffusion models have to store and process a large amount of intermediate data, such as the noisy images, the noise levels, the gradients, etc. This makes diffusion models difficult to scale up and deploy in practice, especially for high-resolution and large-scale image generation.
  • Difficulty of incorporating conditioning information: Diffusion models have a hard time incorporating conditioning information, such as text, labels, attributes, sketches, etc., into the image generation process. This is because diffusion models use a noisy and distorted image as the input, which makes it challenging to align and match the conditioning information with the image features. Moreover, diffusion models have to deal with the trade-off between the fidelity and the diversity of the conditioned images, as adding more conditioning information can reduce the noise and the uncertainty of the sampling process, but also limit the possible outcomes and variations of the images.
  • Lack of interpretability and control over the generation process: Diffusion models are not very interpretable and controllable, and they do not provide much insight or feedback on how and why they generate the images they do. This is because diffusion models use a black-box denoising function that does not have a clear or intuitive meaning or representation of the image features, and because diffusion models use a random and irreversible sampling procedure that does not allow for any manipulation or intervention of the images. This makes diffusion models hard to understand and trust, and also hard to customize and fine-tune for specific purposes and preferences.

Stable Diffusion for various tasks

Image generation

Due to the massive images Stable Diffusion is trained on, it learns unimagined details, textures, composition and styles from those images. This makes it possible to generate images with vast variaty of those combinations.

Image editing

The diffusion architecture takes an image consisting of noise at every iteration and tries to remove noise from the image in latent space. If an arbitory images is feeded into the latent space as input, this makes a really helpful image editor with generative capabilities.

Style transfer

Style transfer turns images to targeted styles but remains the original composition.

Image upscaling

Upscaling images is a really tricky area. It usually introduces arbiguity when a pixel in original image is rplaced by more pixels. How can those extra pixels be colored?

Video generation

The industry has been very successful with video compression and processing. Video is not just a bunch of images displayed at a time axis, but are highly correlated.

Conclusion

In this blog post, we have learned about diffusion models, a new class of generative models that can create high-quality and diverse images from text or other sources of information. We have also learned about Stable Diffusion, a state-of-the-art text-to-image generation model that uses diffusion to synthesize images from text prompts. We have seen some examples of images generated by Stable Diffusion from different text prompts and compared them with other text-to-image models.

We hope you have enjoyed this blog post and gained some insights and knowledge about diffusion models and Stable Diffusion. Thank you for reading this blog post, and happy image generation! 😊

We at PixelsAI provides easy-to-use tools with the power of Stable Diffusion to create crystal clear images.

Get free trial