ControlNet Tile: The Secret to Transforming Low-Quality Images into Masterpieces

Chuck Chen

January 18, 2024

This is the first of our series on image generation models. This post can also be found at here on LinkedIn or here on Medium.

Have you ever wondered about the ControlNet Tile model? If you're not quite sure what it is, you're not alone. The ControlNet Tile models are probably the most misunderstood models in the realm of image generation.

Unlike other ControlNet models that extract specific features to guide the diffusion generation process, the ControlNet Tile model doesn't require any preprocessing. It operates differently from models like ControlNet Depth, Scribble, or others.

A quick recap: Stable Diffusion and ControlNet

Stable Diffusion, a suite of models introduced by Stability AI, represents a significant advancement over previous state-of-the-art image generation models like GANs. Its architecture allows for efficient training on extensive datasets, giving it the remarkable ability to render intricate details, including a variety of textures. In case you need more general introduction, we have covered it in our previous Stable Diffusion series.

However, Stable Diffusion has some quirks when it comes to composition. It's infamous for producing unwanted artifacts, such as superfluous digits, incorrect placement of objects or figures, and distorted body shapes. In essence, while Stable Diffusion excels at filling pixels with detail, it has yet to achieve the nuanced artistry of a master painter.

This is where ControlNet comes into play as a valuable addition to Stable Diffusion. Acting as a supportive counterpart, ControlNet allows Stable Diffusion to meticulously handle texture details while it focuses on training with supplemental pairs of images that lack such detail. These image pairs are carefully crafted, ensuring that each image within the pair shares common elements such as lines, lighting, or composition. By training on these datasets, ControlNet models can concentrate on areas where Stable Diffusion may falter—namely, the composition of the generated images. We've seen ControlNet models skillfully infuse line art with vibrant colors to achieve various artistic styles. But what role do these tile resampling or ControlNet Tile models play?

The capabilities of ControlNet Tile

According to its creators, ControlNet Tile excels in two key areas:

  • It adeptly replaces missing details while preserving the overall structure of an image.
  • It can disregard a global prompt if it conflicts with local semantics, instead guiding the diffusion process with the local context.

The impact of these capabilities on image quality might not be immediately apparent. However, consider this: ControlNet Tile models empower you to set the stage—whether it's sketching the outlines on a canvas or framing the perfect shot. They then step in to meticulously refine the fine details and textures. This isn't just another tool; it's the elusive magic brush that artists and photographers have sought for years, offering a level of detail and finesse that traditional software like Photoshop has yet to provide.

ControlNet Tile in action

Clearing up noisy images

The convenience of smartphone cameras has become a staple in our daily lives, often overshadowing the compromises they bring. One such compromise is the lack of detail in natural photography, a consequence of the small-sized sensors that fit within the confines of our compact devices.

Consider a typical photograph of Mount Everest, commonly found on Instagram or other sharing platforms. The composition is impeccable, and the lighting is spot-on. However, the image is marred by blurry details, a result of image compression and the limitations of the small sensor.

mount everest in low resolution

Now, take a look at the enhanced version processed by Stable Diffusion with the assistance of the ControlNet Tile model. While the original layout and composition are preserved, the clouds and the textures on Mount Everest regain their clarity. The comparison between the 'before' and 'after' is remarkable.

mount everest in low resolution

Upscaling low-resolution images

ControlNet Tile models are renowned for their ability to transform low-resolution images into high-definition visuals. However, their function extends beyond mere upscaling. To actually enlarge images, they require the collaboration of AI upscalers such as ESRGAN. ControlNet Tile models excel at refining imperfections, enhancing textures, and adding clarity, even without increasing the image size.

The transformative power of this technology is evident in a typical upscaling process that integrates both ESRGAN and ControlNet Tile models. Observe the transformation of a minuscule canine image—merely a segment of a larger picture—magnified 16-fold. The result? A remarkably detailed portrayal, with the dog's fur and the surrounding environment rendered in stunning clarity.

fluffy dog in 64 pixels by 64 pixels
fluffy dog resized to 1024 pixels by 1024 pixels with pixel duplication

More details for the curious

ControlNet Tile training

The official code repository offers intriguing insights into the training and functionality of ControlNet models. ControlNet models undergo training independently of Stable Diffusion's weights (see training instructions here). This diverges from the typical fine-tuning approaches for Stable Diffusion models, as ControlNet does not alter the weights of Stable Diffusion. Instead, it trains an ancillary neural network that interfaces with Stable Diffusion. This network is honed on a novel dataset comprising pairs of images: each pair consists of an original image and its preprocessed version, or as ControlNet terminology puts it, an image processed by its 'annotator.' Typically, the annotator extracts lines or outlines from the image, enabling the trained ControlNet model to predict the detailed image from mere lines or outlines upon successful training.

While the training and inference code for ControlNet is openly shared, the datasets for various models are not publicly available. Nevertheless, the repository provides a sample dataset to help developers navigate the training process.

ControlNet Tile inference

The ControlNet Tile model, as outlined in its dedicated repository, functions like a super-resolution tool but is not confined to image upscaling. Its dataset is akin to a pixelated, 'Minecraft'-esque rendition of high-resolution image pairs. Utilizing diffusion models, it enhances local details that are blurry or missing, all while maintaining the original composition.

Conceptually, it is similar to a super-resolution model, but its usage is not limited to that. It is also possible to generate details at the same size as the input (conditione) image.

Consider the process of upscaling a 64x64 resolution image by a factor of 16. This involves populating 255 new pixels for every single pixel in the original image. The question arises: how do we determine the content of these 255 pixels based on just one?

A naive method would simply replicate the original pixel across the 255 new pixels, resulting in an extremely pixelated and enlarged image without any added detail. This is supported by the ImageMagick scale operator. The result looks like below.

fluffy dog in 64 pixels by 64 pixels
fluffy dog resized to 1024 pixels by 1024 pixels with pixel duplication

Conventional image editing tools adopt a different strategy: they interpolate the new pixels based on the original pixel and its neighbors, producing a smoother result but still leaving the images notably blurry.

fluffy dog in 64 pixels by 64 pixels
fluffy dog resized to 1024 pixels by 1024 pixels with pixel duplication

AI upscalers diverge from traditional methods, eschewing deterministic pixel functions in favor of insights gleaned from expansive datasets. This process is akin to viewing the world through human eyes, discerning what elements need to be brought into sharper focus. But still, previous generation of AI upscalers like ESRGAN suffers from extremely low details if the input is a tiny 64 pixels by 64 pixels image, since it's trained on images of at least a few hundreds pixels on each dimensions.

fluffy dog in 64 pixels by 64 pixels
fluffy dog resized to 1024 pixels by 1024 pixels with pixel duplication

Enter ControlNet Tile, which elevates the art of detail enhancement to new heights. While conventional AI upscalers excel with larger images, they often falter with minuscule ones, such as a 64x64 pixel image. Trained on datasets featuring pairs of high-resolution and degraded images, ControlNet Tile models are adept at reconstructing the former from the latter. They harness the power of Stable Diffusion to weave in the intricate textures with precision. For those acquainted with Stable Diffusion's img2img, this process might echo a denoising operation. Yet, it culminates in a visual symphony where a solitary pixel blossoms into a vibrant tapestry of 255, unveiling a world of detail that was once imperceptible—truly a leap into the future of image transformation!

fluffy dog resized to 1024 pixels by 1024 pixels with pixel duplication
fluffy dog resized to 1024 pixels by 1024 pixels with pixel duplication

Closing thoughts

Grasping the mechanics behind ControlNet Tile models necessitates a deep dive into the datasets they are trained on. Unlike other ControlNet variants, such as those designed for line art which require preprocessing, the data guiding the denoising process in Stable Diffusion is less apparent, fostering a rich landscape for exploration and understanding.

Simultaneously, the practicality of these models becomes evident in their ability to restore low-quality images. ControlNet Tile will undoubtedly prove to be an invaluable asset in the arsenal of any creative professional.

You can just use PixelsAI to upscale and clean up images. It provides a hassle-free experience without going through all the hassle of setting up your own ControlNet workflow.

Get free trial