Diffusion Models are Evolutionary Algorithms
Authors: Yanbo Zhang, Benedikt Hartl, Hananel Hazan, Michael Levin
Paper: https://arxiv.org/abs/2410.02543
Code: https://github.com/Zhangyanbo/diffusion-evolution
An interesting paper from Michael Levin and co. This is the same Michael Levin who has long been working on bioelectricity, artificial life, and many other biology-related topics. He gave a fantastic talk titled "What Bodies Think About: Bioelectric Computation Outside the Nervous System" at NeurIPS 2018:
He also did an intriguing keynote "Robot Cancer: what the bioelectrics of embryogenesis and regeneration can teach us about unconventional computing, cognition, and the software of life" at ALife 2020:
He co-authored an excellent work on cellular automata and differentiable morphogenesis (https://distill.pub/2020/growing-ca/).
The paper claims that diffusion models are evolutionary algorithms. How so? Let's break it down.
In the biosphere, at least two processes are capable of generalization and creating novelty: evolution (a slow process of adaptation to the environment across generations of organisms through natural selection) and learning (a fast process allowing individuals to acquire knowledge and generalize subjective experience during their lifetime). Recently, there's been a surge in work about the connection between evolution and learning, from Hinton's classic "How Learning Can Guide Evolution" (1987) to Vanchurin, Wolf, Katsnelson, Koonin's "Toward a theory of evolution as multilevel learning" (2022) and Watson (not that one) and Levin's (that one) "The collective intelligence of evolution and development" (2023). The current work argues that a specific class of diffusion models, where a generative model performs sequential stochastic denoising, can be understood through an evolutionary process that performs natural selection, mutations, and reproductive isolation.
Let's recall both concepts:
Diffusion models in simple terms. The forward diffusion process takes an image as input (it could be any other signal instead of an image) and sequentially adds noise step by step until it becomes completely noisy. The forward diffusion process isn't very interesting; what's interesting is the reverse process – it takes noise as input and sequentially removes it, "revealing" (creating) the image hidden behind it (like doing denoising). The forward and reverse processes can be called diffusion and denoising, respectively. I've previously discussed examples of diffusion models like DALLE 2 here.
Evolutionary algorithms in simple terms. Imagine we have a complex task (for example, finding the optimal shape of an airplane wing), and we create a set of random solutions – like a "population" of creatures in nature. We evaluate each solution according to specific criteria (how well it flies), "crossbreed" the best solutions (taking some parameters from one solution, some from another), occasionally "mutate" them randomly (slightly changing some parameters), and get a new "generation" of solutions. This process repeats many times, and gradually, as in natural selection, more successful variants survive and reproduce. Eventually, we get a solution that might not be perfect but is good enough for practical application. Usually, the structure of the parameter space is unknown beforehand, so the initial population often starts with a standard normal distribution. The main advantage of this approach is that it doesn't require a precise understanding of how the task works – it's enough to be able to evaluate the quality of solutions. Among popular methods are CMA-ES and PEPG (the latter, by the way, is from Schmidhuber and co., he was also actively involved with these https://people.idsia.ch/~juergen/evolution.html), but there are many others. Some work with discrete parameter sets, some with continuous ones; here we're looking at the latter.
As you can see, both approaches involve iterative data updates and sampling new objects from complex distributions. Both have a combination of directed updates and random perturbations. These are selection+mutation in evolution's case, random noise+learned denoising in diffusion's case. This raises the question: is the mechanics of these two processes fundamentally connected, and is there a deep mathematical duality between biological evolution and generative modeling? Or is it all just analogy and vanity of vanities?
First, the authors analyze evolution from the perspective of generative models. Looking at species populations in the biosphere, the variational evolutionary process can be understood as a transformation of distributions (will it be dist2dist analogous to seq2seq?), distributions of genotypes and phenotypes. Mutations and selection jointly change the shapes of these distributions. Many biologically inspired evolutionary algorithms can be understood similarly: they optimize an objective function by maintaining and iteratively changing the distribution of a large population. And this same concept – transformation of distributions – is central to many generative models: VAEs, GANs, and diffusion models learn to transform simple distributions (often standard Gaussian) into more complex ones, where samples represent meaningful images, sounds, and texts.
On the other hand, diffusion can also be viewed from an evolutionary perspective. During training, data points are noised, and the model learns to predict this added noise to reverse the process (by the way, has anyone worked on a diffusion time machine yet?). In the sampling phase, the model starts with points from a Gaussian distribution and incrementally updates them through denoising, where noise-free samples are the ideal. In this case, directed denoising can be interpreted as directed selection, and each step adds a small noise (with a minus sign?) analogous to mutations. This resembles an evolutionary process and aligns with ideas interpreting the genome as a latent space parameterization of a multi-scale generative morphogenetic process, rather than just a blueprint of an organism. If an evolutionary process is reversed, the evolved population of highly correlated and highly fit individuals will gradually dissolve, similar to the forward diffusion process.
Analogous to energy and probability in statistical physics, evolutionary tasks can be connected to generative ones by mapping fitness to probability density: high fitness corresponds to high probability density. The authors ultimately mathematically derive a new algorithm called Diffusion Evolution – an evolutionary optimization procedure based on iterative error correction similar to diffusion models, but not relying on neural networks.
Here are its key features:
Start with a population of random solutions (like noise in diffusion models)
At each step:
Each solution is evaluated by a fitness function
For each solution, its "denoised" version is estimated through weighted averaging with neighboring solutions (higher weight for more successful neighbors)
The solution takes a small step toward its "denoised version" and receives a small random mutation
As it progresses:
The neighbor search radius gradually decreases (like noise reduction in diffusion models)
This allows exploring the solution space globally at first, then optimizing locally
The key advantage of the algorithm is that it can find and maintain multiple good solutions simultaneously, unlike many classical evolutionary algorithms, which usually converge to a single solution.
Several experiments were conducted with the new algorithm.
In the first experiment, they used five different two-dimensional fitness landscapes: Rosenbrock and Beale with one optimum, and Himmelblau, Ackley, and Rastrigin with multiple optima. These were compared with other evolutionary strategies: CMA-ES, OpenES, and PEPG.
Evolution was run 100 times for each method. Each experiment had a population size of 512 and used 25 iterations (except for OpenES, which needed 1000 steps to converge). Diffusion Evolution finds quality and diverse solutions, especially on the last three landscapes where other methods struggle and tend to converge to a single solution.
In evolutionary algorithms, fitness evaluation is often the most computationally expensive operation, so the authors tried to reduce the number of iterations by borrowing cosine scheduling from work on diffusion models. This significantly reduced the number of fitness evaluations.
In the second experiment, they proposed Latent Space Diffusion Evolution, inspired by latent space diffusion models (https://arxiv.org/abs/2112.10752). It allows solving problems with high-dimensional parameter spaces by exploring a low-dimensional latent space. Here, the method was applied to RL tasks where a network had to learn to control the classic cart-pole system. A two-layer network with 58 parameters was used for control. Direct Diffusion Evolution works poorly, but switching to a latent space with two parameters works well. As I understand it, the transformation is performed through a random projection matrix, and it's only used to calculate distances between solutions, while the solutions themselves are updated in the original space. The result is good and works with larger networks too (they also tested it on a three-layer network with 17,410 parameters).
Overall, it's a success. They also showed that working solutions can be transferred from other areas (like they transferred the idea from latent diffusion models). This approach is similar to what Tri Dao and Albert Gu use in their SSMs, when they unite SSM and something known like linear transformer into one class and transfer ideas that work on this transformer to SSM, as was done in the Mamba-2 paper (we discussed it here), for example.
This is all very exciting movement, showing that learning and evolution are essentially doing the same thing. And remembering the work comparing neural network training through SGD with the diffusion process (Neural Network Diffusion, a beautiful paper discussed here), then transitively we can probably say that gradient descents are also evolutionary algorithms? Are evolution and learning unifying again? And in that case, is the thermodynamic computer (we touched it here and in more detail here) the universal hardware for all this future AI? There's a lot to think about.
There are also open questions, for instance, a very big question about how diffusion models work in finite time, while real evolution is potentially infinite and open-ended. How can Diffusion Evolution be adapted to an open-ended setting? Could other variants of diffusion models lead to new evolutionary algorithms? (why not?) Can inductive biases from diffusion models be brought into evolutionary algorithms? How do latent diffusion models correlate with neutral genes? Can diffusion models be advanced by ideas from evolution?
In short, let's actively cross-pollinate!