GMM in speech recoginition using HMM-GMM

Question

I am trying to solve/understand ASR using HMM-GMM.

At the abstract level i do understand what's happening but I did not understand how GMM fits into it.

My data has 5K hours of speech from single user. I took the above picture from this article.

I do know what is GMM but i am unable to wrap my head around it. Can somebody explain with a simple example.

Naveen Gabriel · Answer 1 · 2020-02-14T09:51:58.590

The previous answer was wrong so I removed it.

Here goes my second attempt after reading Speech and Language processing by daniel Jurafsky and James H Martin(good book to read).

The 39 features associated with an observation/acoustic is considered to have come from mixtures of multivariate gaussian.

Why Mixture of MV gaussian ? Assuming a single MV gaussian for each state(phones) is a strong assumption which might not be true.

How does HMM comes into picture with GMM in ASR: Consider an uni-variate case where a single cepstral feature(usually it is 39) is represented by a single gaussian and HMM state has a mean value and variance which generate the particular observation. To get which observation was produced by which state is a part of decoding problem.

Let me know if this is right ?

GMM in speech recoginition using HMM-GMM

1 Answers1