Stable diffusion is essentially removing entropy from noisy images. If you have $n$ latent dimensions, and add Gaussian noise independently to each dimension, then the added entropy is on the order of $n\log\sigma$, where $\sigma$ is the variance of the noise. I would guess the entropy of a neural net is proportional to the number of parameters. To remove the noise, the neural net's entropy must be at least as large.
The entropy of an image increases as the logarithm of its resolution (conjecture), so to run stable diffusion at twice the resolution (i.e. one extra latent dimension) you would need $O(\log\sigma)$ extra parameters. Similarly, to remove twice the noise, you would need $O(n)=O(\log\text{resolution})$ extra parameters. Stable diffusion works for super noisy images because there are already a ton of extra parameters lying around. Same for upscaling. Like, the 512x512 model was fine tuned for a few extra steps, and suddenly can produce images that are 768x768 perfectly fine.
Anyway, it seems that we should be able to do a lot better than 1B parameters. Has anyone tried to find the minimum number of parameters necessary to run stable diffusion, on say the MNIST dataset? If so, how many parameters was that? It should only take ~6x as many parameters to run stable diffusion on a 768x768 resolution image ($\log(768)/\log(28)\approx 2$, and three times as many channels).
Here someone classifies MNIST with >99% accuracy using only 8K parameters. I feel like you should be able to get a diffuser in <100K and run stable diffusion on larger images with <1M parameters.
If this is all true, then it would make sense for someone to have tried this. If not, where am I going wrong?