ControlNet Tile: The Secret to Transforming Low-Quality Images into Masterpieces

Chuck Chen

This is the first of our series on image generation models. This post can also be found at here on LinkedIn or here on Medium.

Have you ever wondered about the ControlNet Tile model? If you're not quite sure what it is, you're not alone. The ControlNet Tile models are probably the most misunderstood models in the realm of image generation.

Unlike other ControlNet models that extract specific features to guide the diffusion generation process, the ControlNet Tile model doesn't require any preprocessing. It operates differently from models like ControlNet Depth, Scribble, or others. The name can be confusing at first. What exactly does a tile model mean?

A quick recap: Stable Diffusion and ControlNet

Stable Diffusion, a suite of models introduced by Stability AI, represents a significant advancement over previous state-of-the-art image generation models like GANs. Its architecture allows for efficient training on extensive datasets, giving it the remarkable ability to render intricate details, including a variety of textures. In case you need more general introduction, we have covered it in our previous Stable Diffusion series.

However, Stable Diffusion has some quirks when it comes to composition. It's infamous for producing unwanted artifacts, such as superfluous digits, incorrect placement of objects or figures, and distorted body shapes. In essence, while Stable Diffusion excels at filling pixels with detail, it has yet to achieve the nuanced artistry of a master painter.

This is where ControlNet comes into play as a valuable addition to Stable Diffusion. Acting as a supportive counterpart, ControlNet lets Stable Diffusion handle texture details while ControlNet itself trains on pairs of images—one detailed, one simplified. These image pairs are carefully crafted, ensuring that each pair shares common elements such as lines, lighting, or composition. By training on these datasets, ControlNet models can focus on areas where Stable Diffusion struggles—namely, the composition of the generated images. We've seen ControlNet models skillfully infuse line art with vibrant colors to achieve various artistic styles. But what role do these tile resampling or ControlNet Tile models play?

ControlNet Tile

The confusing naming of ControlNet Tile

The ControlNet Tile model may have tile in its name, but it's actually a ControlNet model trained to fill in missing details. Traditional methods for creating high-detail images involve increasing the image resolution, and one approach is called tiled diffusion, which generates a high resolution image tile by tile. These tiles usually overlap each other. When it works properly, you get extremely high-resolution output images up to 4K or even 8K even with mid-range or low-end hardware.

But Stable Diffusion can run into problems when run tile by tile. Each tile in a 4x6 tiled diffusion may see conflicting local semantics vs global prompt—let's say "a gorgeous woman" will get you 24 women in the resulting image.

The capabilities of ControlNet Tile

The above problem is why the ControlNet Tile model was created. According to its creators, it excels in two key areas:

  • It skillfully replaces missing details while preserving the overall structure of an image.
  • It can disregard a global prompt if it conflicts with local semantics, instead guiding the diffusion process with the local context.

The impact of these capabilities on image quality might not be immediately apparent. However, consider this: ControlNet Tile models let you set the stage—whether it's sketching the outlines on a canvas or framing the perfect shot. They then step in to carefully refine the fine details and textures. This is more than just another tool; it's a powerful enhancement that offers a level of detail and finesse that traditional software like Photoshop struggles to match.

ControlNet Tile in action

Clearing up noisy images

The convenience of smartphone cameras has become a staple in our daily lives, but this convenience comes with trade-offs. One such trade-off is the lack of detail in natural photography, a result of the small sensors that fit within our compact devices.

Consider a typical photograph of Mount Everest, commonly found on Instagram or other sharing platforms. The composition is impeccable, and the lighting is spot-on. However, the image is marred by blurry details, a result of image compression and the limitations of the small sensor.

mount everest in low resolution

Now, take a look at the enhanced version processed by Stable Diffusion with the assistance of the ControlNet Tile model. While the original layout and composition are preserved, the clouds and the textures on Mount Everest regain their clarity. The comparison between the 'before' and 'after' is remarkable.

mount everest in low resolution

Upscaling low-resolution images

ControlNet Tile models are known for their ability to transform low-resolution images into high-definition visuals. However, they do more than just upscaling. To actually enlarge images, they work together with AI upscalers such as ESRGAN. ControlNet Tile models excel at refining imperfections, enhancing textures, and adding clarity, even without increasing the image size.

The transformative power of this technology is evident in a typical upscaling process that integrates both ESRGAN and ControlNet Tile models. Observe the transformation of a minuscule canine image—merely a segment of a larger picture—magnified 16-fold. The result? A remarkably detailed portrayal, with the dog's fur and the surrounding environment rendered in stunning clarity.

fluffy dog in 64 pixels by 64 pixels
fluffy dog resized to 1024 pixels by 1024 pixels with pixel duplication

More details for the curious

ControlNet Tile training

The official code repository offers insights into the training and functionality of ControlNet models. ControlNet models train independently of Stable Diffusion's weights (see training instructions here). This differs from typical fine-tuning approaches for Stable Diffusion models, as ControlNet doesn't alter Stable Diffusion's weights. Instead, it trains a separate neural network that works with Stable Diffusion. This network trains on a dataset of image pairs: each pair has an original image and its preprocessed version, or as ControlNet calls it, an image processed by its 'annotator.' Usually, the annotator extracts lines or outlines from the image, so the trained ControlNet model can predict the detailed image from just lines or outlines.

While inference code for ControlNet is openly shared, the datasets and training code for various models are not publicly available. Nevertheless, the repository provides a sample dataset and training script to help developers navigate the training process.

To train a ControlNet Tile model, training data consisting of image pairs and captions needs to be prepared in advance. Each image pair includes the original image, usually a 512x512 or larger image with few details, and a target image, which contains much more detail. The original image also serves as the control image, much like the preprocessed images for other ControlNet models, such as depth maps or edge lines. The target image looks similar to the control image, but with much more detail.

By learning on those pairs of images together with captions, the model gradually learns to honor local semantics even when the global prompt suggests otherwise.

ControlNet Tile inference

The ControlNet Tile model, as outlined in its dedicated repository, functions like a super-resolution tool but is not confined to image upscaling. Its dataset is akin to a pixelated, 'Minecraft'-esque rendition of high-resolution image pairs. Utilizing diffusion models, it enhances local details that are blurry or missing, all while maintaining the original composition.

Conceptually, it is similar to a super-resolution model, but its usage is not limited to that. It is also possible to generate details at the same size as the input (conditioned) image.

Consider the process of upscaling a 64x64 resolution image by a factor of 16. This involves populating 255 new pixels for every single pixel in the original image. The question arises: how do we determine the content of these 255 pixels based on just one?

A naive method would simply replicate the original pixel across the 255 new pixels, resulting in an extremely pixelated and enlarged image without any added detail. This is supported by the ImageMagick scale operator. The result looks like below.

fluffy dog in 64 pixels by 64 pixels
fluffy dog resized to 1024 pixels by 1024 pixels with pixel duplication

Conventional image editing tools adopt a different strategy: they interpolate the new pixels based on the original pixel and its neighbors, producing a smoother result but still leaving the images notably blurry.

fluffy dog in 64 pixels by 64 pixels
fluffy dog resized to 1024 pixels by 1024 pixels with pixel duplication

AI upscalers work differently from traditional methods, using insights from large datasets instead of simple pixel calculations. This process is like viewing the world through human eyes, figuring out what elements need to be brought into sharper focus. However, previous generation AI upscalers like ESRGAN struggle with extremely small images like 64x64 pixels, since they're trained on images that are at least a few hundred pixels on each side.

fluffy dog in 64 pixels by 64 pixels
fluffy dog resized to 1024 pixels by 1024 pixels with pixel duplication

Enter ControlNet Tile, which takes detail enhancement to new heights. While conventional AI upscalers work well with larger images, they often struggle with tiny ones, such as a 64x64 pixel image. Trained on datasets with pairs of high-resolution and degraded images, ControlNet Tile models excel at reconstructing the high-quality version from the low-quality one. They use Stable Diffusion's power to add intricate textures with precision. For those familiar with Stable Diffusion's img2img, this process is similar to a denoising operation. The result is remarkable: a single pixel becomes 255 pixels filled with detail that was once invisible—a true leap forward in image transformation!

fluffy dog resized to 1024 pixels by 1024 pixels with pixel duplication
fluffy dog resized to 1024 pixels by 1024 pixels with pixel duplication

Closing thoughts

Grasping the mechanics behind ControlNet Tile models necessitates a deep dive into the datasets they are trained on. Unlike other ControlNet variants, such as those designed for line art which require preprocessing, the data guiding the denoising process in Stable Diffusion is less apparent, fostering a rich landscape for exploration and understanding.

Simultaneously, the practicality of these models becomes evident in their ability to restore low-quality images. ControlNet Tile will undoubtedly prove to be an invaluable asset in the arsenal of any creative professional.

You can just use PixelsAI to upscale and clean up images. It provides a hassle-free experience without going through all the hassle of setting up your own ControlNet workflow.