The previous answer was wrong so I removed it.
Here goes my second attempt after reading Speech and Language processing by daniel Jurafsky and James H Martin(good book to read).
The 39 features associated with an observation/acoustic is considered to have come from mixtures of multivariate gaussian.
Why Mixture of MV gaussian ? Assuming a single MV gaussian for each state(phones) is a strong assumption which might not be true.
How does HMM comes into picture with GMM in ASR: Consider an uni-variate case where a single cepstral feature(usually it is 39) is represented by a single gaussian and HMM state has a mean value and variance which generate the particular observation. To get which observation was produced by which state is a part of decoding problem.
Let me know if this is right ?