Jon Evans explaining how text-to-image generators work

Thursday 15th December, 2022 - Bruce Sterling

*That’s a noble effort at explication there, and well worth a read.


This is where the magic happens. Suppose the diffusion process with which you trained your model had 1,000 steps, invariably ending after the thousandth in pure Gaussian noise, seemingly incomprehensible random chaos. Well, if you show your trained model some of that pure noise … it has learned how to estimate the change since the step just before random chaos. You can then remove that estimated noise from the pure randomness, resulting in, in effect, the model’s estimate of a 999th step, a point at which where there is the tiniest hint of structure, the merest suggestion of some kind of order.

So far, so little. To the human eye, that estimated 999th step will still be indistinguishable from chaos. But…

…then you run that estimated 999th step through the network again, to generate an estimated 998th step … and then a 997th … and very slowly, step by step, patterns begin to appear which seem non-random, and then, ultimately, recognizable to the human eye … forms, then shapes, then images, even faces … and eventually, after a thousand backwards steps, you wind up with a glorious, brand-new image, generated on the spot. Brand-new because, remember, the neural network estimates the added noise, rather than making precise calculations, so every input will lead to new estimation fuzziness at each step, and ultimately a different result.

In the end, as you’ve no doubt witnessed, the generated DALL-E image can be as photorealistic, as apparently seamless, as any of the original set of high-quality, human-generated images on which the diffusion model was initially trained. It feels like sorcery! But it’s really just a huge number of very repetitive, quite simple mathematical calculations … which accumulate into enormous power and subtlety….