Neural Network Diffusion

SGD as a diffusion process?

Feb 26, 2024

Authors: Kai Wang, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, Yang You
Paper: https://arxiv.org/abs/2402.13144
Code: https://github.com/NUS-HPC-AI-Lab/Neural-Network-Diffusion

Diffusion models are currently leading the way, creating stunning images and more. The authors suggest they can also generate neural network parameters. It seems to me they've invented a hypernetwork through diffusion.

A couple of 101’s

If you're not familiar with the concept of hypernetworks, in a nutshell, it's when one neural network (which is the hypernetwork) generates weights for another neural network (the main network). This direction was initiated by the eponymous work of David Ha and colleagues (https://arxiv.org/abs/1609.09106). Research on hypernetworks has been flowing quite steadily, but in my opinion, the topic is still relatively unknown and feels insufficiently explored. I'm convinced that there's still a lot more interesting stuff hidden within it.

Image from the paper “A Brief Review of Hypernetworks in Deep Learning”

For those unfamiliar with how diffusion models work, here's a quick and simple explanation. The forward diffusion process takes an image (or any other signal) and gradually adds noise step by step until it becomes a completely noisy signal. The forward diffusion process isn't very interesting; the reverse process is - it starts with noise and progressively removes it, "revealing" (creating) the hidden image (essentially denoising).

Image from the CVPR 2022 Tutorial on Diffusion Models

SGD as a diffusion process

Training a neural network through SGD is conceptually similar to the reverse diffusion process: we start with random initialization and sequentially update the weights until we achieve high quality on a given task.

The authors call their approach neural network diffusion or p-diff (from parameter diffusion). The idea and implementation are simple and beautiful in their own way.

First, we collect a dataset of neural network parameters trained with SGD and train an autoencoder on it, from which we then take the latent representation (this can be done not on the full set of parameters, but on a subset). The second step is to train a diffusion model that generates a latent representation from random noise, which we then convert back into the actual weights through the decoder of the autoencoder trained in the first step. Theoretically, one could train the diffusion directly on the weights, but this requires significantly more memory.

For the autoencoder, parameters are transformed into a one-dimensional vector, and noise augmentation is used simultaneously on the input parameters and the latent representation. The training of the diffusion model is a classic DDPM (https://arxiv.org/abs/2006.11239). Four-layer 1D CNN encoders and decoders were used.

Results

The method was tested on image datasets MNIST, CIFAR-10, CIFAR-100, STL-10, Flowers, Pets, F-101, ImageNet-1K, and on networks ResNet-18/50, ViT-Tiny/Base, ConvNeXt-T/B.

For each architecture, 200 training points (last epoch checkpoints) were accumulated. It's not entirely clear to me what exactly they saved, they mention the last two layers of normalization (just the BatchNorm learned parameters?) and fixed the rest. In most cases, training the autoencoder and diffusion model required 1-3 hours on a single NVIDIA A100 40G GPU.

For inference, they generate 100 new parameters, keep one with the highest performance on the training set, evaluate it on the validation set, and report this result.

The baselines are 1) the original models and 2) ensembles in the form of averaged soups of fine-tuned models ("Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time", https://arxiv.org/abs/2203.05482).

In most cases, the results are no worse than both baselines. That is, it learns the distribution of high-performing parameters. The method consistently works well across different datasets.

Many ablations were conducted on ResNet-18 + CIFAR-100.

The more models trained, the better. The method generates higher quality models for layers at any depth. However, the highest results are achieved in the last layers (presumably due to less accumulation of errors during forward propagation). Noise augmentation in the autoencoder is very important, especially for the latent state (and even better simultaneously for the input as well).

This was all for a subset of weights. They also tested generating a full set of weights on small networks MLP-3 and ConvNet-3 and MNIST/CIFAR-10/100. Network sizes here are 25-155k parameters. It works as well.

Additionally, they trained ResNet-18 on three random seeds and looked for patterns in the parameters. There seem to be some patterns (to me, the pictures are not very illustrative, I couldn't understand exactly what patterns they saw). If they exist, then presumably the proposed approach learns them.

They explored the difference between original and generated models to understand, 1) whether p-diff memorizes training data, and 2) if there's any difference between parameters obtained through fine-tuning or adding noise and the newly generated ones. Model similarity was assessed by Intersection over Union (IoU) for their incorrect predictions. I haven't encountered this method of determining model similarity before (but maybe I missed something, and it's already a common approach?)

The difference between generated models was noticeably greater than between the original ones. And even the maximum similarity between generated and original models was significantly lower than between the originals. Thus, the method generates some new parameters.

Fine-tuned and noise-added model versions cluster in their narrow clusters, while the diffusion method generates much more diverse (and sometimes higher quality) models.

t-SNE from latent representations of p-diff significantly differs from original and noisy model versions (it's logical that noisy ones are where the originals are, since we trained for noise resistance).

Overall, it's an interesting topic. Indeed, why not have a diffusion optimizer? And it could also be useful for initialization (if it can, for example, speed up the process by a couple of epochs?).

We await further developments!

Gonzo ML

Discussion about this post