If you’ve recently typed a few simple words into an AI image generator, you’ve probably been amazed at the wide variety of captivating images that can be created in seconds. The possibilities are pretty much endless, and range from photorealistic to utter fantasy.
And like those old potato chip ads used to say, “Betcha can’t [make] just one!” We hear you there…
How do AI image generators do what they do?
Whether because they expect the answer would use geek-speak that goes above their heads, or they simply don’t care to know how the sausage is made, about 90% of the people we talked to who make AI generated images don’t know how the system works.
In order to understand the hugely hyped and justifiably popular image generators enough to explain it to our readers, we needed to find out. But instead of a little swim in a lap pool, we needed to do a deep dive to get a handle on it.
💡 The one weird way many popular AI image generators work
Before researching the science behind it, we expected it would be a fairly straightforward process, sort of like a digital copy machine dialed up to 100.
Instead, we learned we had the process backwards, and that the reality of the systems in play was far more complicated and unusual than we could have ever imagined.
People — meaning everyone from world-renowned artists to toddlers — typically approach creating pictures the same way: they take a blank canvas and add to it. Here’s Bob Ross to demonstrate…
Additive art is a wide category that includes a wide range of artistic practices, including traditional methods like drawing, painting, and printmaking, as well as modern techniques like digital art and mixed media. The term highlights the fundamental process of adding materials to create an image, as opposed to carving or chiseling away, as seen in subtractive art forms.
One expert on subtractive art was the Renaissance artist Michelangelo Buonarroti. The legendary creation of his famous statue of David is often distilled into an interesting anecdote about artistic vision and skill.
The famous sculptor is said to have approached the creation of David with a unique perspective: When he looked at the marble block from which his hero would emerge, he didn’t see just a stone — he envisioned the man trapped within it, waiting to be released.
According to the story, Michelangelo’s process was not so much about sculpting in the traditional sense, but more about liberating David from the excess marble. He is often quoted as saying that he simply removed the parts of the stone that were not David.
This approach was revolutionary for its time, and underscored his phenomenal skill and a remarkable understanding of form and anatomy.
Fast forward 600 years, and this approach — where the art is found after being hidden in a mass of material — is surprisingly quite similar to the way many AI image generators create art.
💡 Taking pixels away until only an image remains
In one of the most popular modern models of image creation — the diffusion model — the AI image generator starts with a canvas of digital noise. That noise is essentially static in the form of random pixels with no discernible pattern or meaning — sort of like to the uncarved low-quality block of marble that Michelangelo began with.
As AI processes the input (meaning the prompt from the user), it gradually refines this noise, removing or altering pixels that don’t contribute to the desired image.
Both processes involve the removal of unnecessary elements to uncover a preconceived form, be it a statue or a digital image. For Michelangelo, it was the non-essential marble — but for AI, it’s the pixels that don’t fit the image being generated. In each case, the creator — whether artist or AI algorithm — has a vision of the end result, and systematically works to reveal it from within the raw material.
Of course, while Michelangelo’s process was guided by his extraordinary talent, AI image generation is driven by algorithms processing mind-bloggingly huge datasets that identify and create patterns that align with the input prompt.
So while the work of the true Renaissance Man is a shining example of human creativity and craftsmanship, AI image generation is a marvel of technological innovation, harnessing the power of data and machine learning algorithms to create art.
It may seem strange, but it works — in fact, all of the images below were created with this remarkable techy technique.
💡 AI image refinement, step-by-step
Here is a set of 4 images being created, seen in 8 different steps that we captured in the creation of graphics on Midjourney.
Steps one through four:
Steps five through eight:
Let’s get technical
In many popular AI models like Stable Diffusion, Midjourney and DALL-E, adding noise to an image is the required first part of the creation process. And as unintuitive as it might be for our human minds, adding that noise and then teaching the model to remove it is a way to train AI to understand and generate images.
This method uses the system’s programmed ability to find order in chaos, turning it into a powerful tool for creative and varied image generation. AI graphic production systems use this approach to produce a wide array of images — from the realistic to the fantastical — based on text input or other guidelines.
🤓 The art of noise in denoising diffusion probabilistic models
This approach is used for generating images and works in a somewhat counterintuitive way. Here’s a (somewhat) simplified explanation:
Starting with noise: The process begins with a pattern of random noise. This noise does not contain any meaningful information or structure — it’s essentially visual chaos.
Sequential denoising: The model then gradually removes this noise over a series of steps. At each step, it makes predictions about what parts of the “clean” image (without noise) might look like. These predictions are based on the model’s training, where it has learned about shapes, textures, colors, and how objects in images typically appear.
Learning to create images: By repeatedly trying to remove noise and make sense of the underlying image, the AI model effectively learns how to create images. It understands how to form coherent structures and recognizable subjects from what initially looks like random visual static.
Control and creativity: Adding noise and then learning to remove it allows the model to generate a wide variety of images. It can create different versions of an image based on different “noisy” starting points, leading to a huge range of creative outputs.
How creative? It can blend reality and imagination in all new ways… like by making a picture of a 20 layer cake.
🤓 What exactly does “denoising diffusion probabilistic models” even mean?!
Although it may sound like a whole bunch of techno-jargon that doesn’t actually say anything, “denoising diffusion probabilistic models” really does mean something.
It refers to a class of AI models that create images by starting with a noisy canvas and progressively refining it through a process that removes noise and brings the image closer to the intended target, all while considering multiple possible outcomes and their probabilities.
Here’s a breakdown of the words, courtesy of various AI systems:
1. Denoising: This refers to the core process of removing noise from a signal. In image generation, noise is often added intentionally as a starting point, and the model gradually removes it to reveal the desired image. It’s like sculpting a statue by removing excess marble.
2. Diffusion: This describes the underlying mathematical process used to create the images. The model starts with a canvas of pure noise and gradually “diffuses” it (removes randomness) by replacing pixels to match the user’s textual prompt.
It involves iteratively predicting and replacing pixels to better match the intended image, guided by a text prompt and statistical patterns learned from a massive image dataset.
3. Probabilistic: This emphasizes that the model doesn’t produce a single definitive image. Instead, it generates multiple possible images, each with varying probabilities of being “correct” based on the prompt and dataset. This probabilistic nature allows for the exploration of different creative possibilities.
4. Models: This simply reminds us that we’re dealing with mathematical models, not physical processes. These models are complex algorithms trained on massive amounts of data to learn how to create images that resemble real-world photos or artistic styles.
🤓 Why do AI image generators use noise & diffusion?
1. To escape local minima
Imagine the creative process as navigating a hilly landscape. Local minima are like valleys where your progress gets stuck. Adding noise helps the algorithm “jitter” out of these local minima and explore a wider range of possibilities, leading to more diverse and creative outputs. (Read more about minima below.)
2. To prevent overfitting
Overfitting occurs when the AI model memorizes its training data too closely, leading to less generalizable results. Noise disrupts these memorized patterns, forcing the model to rely on its understanding of underlying image statistics and generate outputs that are consistent with real-world images, not just its training set.
3. To control the level of detail
The amount of noise added directly affects the level of detail in the generated image. High noise leads to abstract, blurry outputs, while low noise results in sharper, more realistic images. This allows users to control the trade-off between creativity and detail based on their desired outcome.
4. To guide the diffusion process
During the “diffusion” process, the AI gradually removes noise from the image to reveal the final work. The initial noise pattern acts as a guiding force, influencing the overall composition, textures, and details of the final image.
5. To introduce randomness and prevent deterministic outputs
Adding noise ensures that each generation of the same image is slightly different, even with the same prompt. This element of randomness helps to preserve the excitement and unpredictability of the creative process, preventing predictable or “stale” outputs.
In essence, the noise in stable diffusion acts as a catalyst for creativity, a tool for exploration, and a key ingredient in generating diverse, realistic, and unique images. Also, some people argue that the noise inherent in the diffusion process adds a unique aesthetic quality to the generated images, contributing to their dreamlike and artistic appeal.
🤓 What is “minima”?
In AI image generation models like Stable Diffusion, “minima” refer to points in the model’s creative process where it might produce repetitive or predictable images. These are not the least creative points, but spots where the model could get stuck without varied input.
Think of it as navigating a landscape with hills and valleys: the valleys are these minima where creativity is limited. These models use noise to move out of these valleys, pushing the model to explore new, more creative possibilities and avoid getting trapped in these repetitive zones. Essentially, noise helps the model to be more diverse and original in its outputs.
🤓 Key points to remember about diffusion tech
- Not exclusive to image generation: Denoising diffusion models have applications beyond image generation, such as speech synthesis and text-to-speech.
- Active research area: This field of AI is still evolving rapidly, with new models and techniques emerging frequently.
- Wider reach: While not all AI image generators use diffusion models, the approach is gaining momentum as researchers explore its potential for generating high-quality, diverse, and creative images.
What systems use diffusion technology for AI images?
While not all AI image generators use denoising diffusion probabilistic models, several notable ones do:
🤖 DALL-E 2
This super versatile model can generate stunning images from text descriptions, even creating realistic images of objects and scenes that don’t exist in the real world. It employs a diffusion-based approach, leveraging noise as a tool to guide image creation.
Another model capable of generating high-quality images from text, Imagen also utilizes a diffusion process that involves gradual denoising. This model is known for its exceptional visual fidelity and ability to capture intricate details.
Popular among artists and creative professionals, Midjourney offers a wide range of image generation capabilities, including landscapes, portraits, and abstract art. It too relies on diffusion-based techniques, allowing users to control the level of noise and detail in the generated images.
Known for its efficient and versatile image creation, Stable Diffusion stands out in the realm of AI-driven art generation. This tool leverages advanced diffusion-based methods to offer a broad spectrum of possibilities, from photorealistic scenes to stylized illustrations. Its strength lies in its ability to finely balance noise and detail, giving users considerable control over the texture and clarity of the output.
This open-source model has gained popularity for its ability to produce diverse and creative images with varying artistic styles. It employs a diffusion model with several unique features, including the ability to generate images in different resolutions and with varying levels of abstraction.
Another open-source model, Disco Diffusion is known for its ability to create dreamlike and surreal images. It utilizes a diffusion approach that allows for a high degree of control over the image generation process, enabling users to experiment with different prompts and settings.
And what’s next?
The field of AI and machine learning, especially regarding image generation and processing, is evolving rapidly. Denoising Diffusion Probabilistic Models (DDPMs) currently represent a significant leap in generating high-quality images, but predicting their longevity as the industry standard is tricky, given the fast-paced nature of AI development.
As of now, DDPMs are a hot topic in AI image generation. They’ve shown impressive results in creating high-quality, diverse images and have certain advantages over other methods like Generative Adversarial Networks (GANs). Their ability to generate images gradually, starting from noise and adding details iteratively, helps in creating more realistic and less artifact-prone images.
But still, new methodologies are being explored all the time. For instance, transformer-based models, known for their success in natural language processing, are being adapted for image generation tasks.
You also need to consider that the standard in AI technologies often shifts not only due to technological advancements, but also due to factors like resource availability, ease of use, community support, and scalability. Even if a new method is slightly superior technically, it might not become the standard if it’s not user-friendly or requires significantly more computational resources.
Also, different applications may favor different technologies. For instance, fields requiring extremely high-resolution images might develop and standardize different models than those used for things like quick, lower-resolution social media content generation.
So while DDPMs are certainly making waves now, the dynamic nature of AI research suggests that they may either evolve or be supplanted by newer technologies in the future. The key is to stay adaptable, and keep an eye on emerging trends to leverage the most effective tools as they develop.
No matter how far AI takes us in the realm of imagery, though, there will never be anything that matches the finest examples of art made with human hands.