I have a personal dataset of 10000 audio files, each consisting a single spoken sentence. These files each have the transcribed text labels with them that I can use for supervised HMM training.
Now that I have have extracted MFCC features, how do I input these vectors of MFCC sequences for each format into the HMM? After reviewing multiple source materials, my head tells me to initialize the $N \times N$ input transition matrix with all of the vectors, but how does this segment out the starts and ends of each spoken sentence sequence? I'm also unsure how to assign number of states when the files each vary in word length.
This is my personal understanding:
$N \times N$ transition probability matrix – which I believe are for words (language model).
$1 \times N$ start & end probability matrices – which I also believe are $0$th order Markovian per word.
The type of underlying distribution(s) used for each state.
Please correct me if my perspective is wrong.