Why is it an advantage "that Markov chains are never needed" to obtain gradients?

Question

In the original GAN (Generative Adversarial Network) paper, Generative adversarial networks by I. Goodfellow, J. Pouget-Abadie, M. Mirza et. al. they state an advantage of the GAN is "that Markov chains are never needed, only backprop is used to obtain gradients, no inference is needed during learning" (Section 6 of paper).

I don't understand why this is an advantage? If we look at this statement from the other way around, why would using Markov chains be a disadvantage?

I'm not very competent but my guess would be that it's related to the estimation process with Markov chains, like Gibbs sampling: it's about repeatedly generating trials randomly to fit the data, it's a quite resource-intensive process. — Erwan, Oct 04 '22 at 10:49

score 2 · Accepted Answer · answered Oct 05 '22 at 08:50

Implementation Considerations

Markov chains are sequential, because they describe one state t_0 based on the previous state t_-1. When you have long Markov chains, you basically have a long sequence of calculations, each relying on the previous state to be calculated. Due to this sequential nature, parallelizing the computation, as you can do with the gradients and inference in a neural net, is not possible.

Architecture Considerations

In the paper they state several issues with Markov chains:

"[...] methods based on Markov chains require that the distribution be somewhat blurry in order for the chains to be able to mix between modes."
- In addition, in Section 6 they state that Markov chains require "blurry" data distributions as opposed to GANs, which can represent "sharp" distributions as well.
Section 2, first paragraph: They talk about Markov chains as a means for approximating the partition function of Deep/Restricted Boltzman Machines, which would otherwise be intractable. However, they state that mixing is a problem here. I am not sure what they mean by mixing here.
As I understand the caption of Figure 2, they state that Markov chain mixing leads to correlated samples. This might be due to the seed, i.e., the inital state, you must provide in order to sample from a Markov chain.

Why is it an advantage "that Markov chains are never needed" to obtain gradients?

1 Answers1

Implementation Considerations

Architecture Considerations