I saw this tutorial on generating text using LSTM. In this tutorial the author trained the network by taking 100 previous characters as input and the next character as the output label.
I am interested to try some simple speech recognition using LSTM. I may use mfcc features of the audio signal as input data, but what's confusing me most is how to represent the output label.
The dataset I have is the VCTK corpus which contains sentence level audio recording and its transcription.
In the tutorial, next character that comes after the input vector was used as output label. But for speech it's impractical to know which part of speech produced which character without transcribing the audio for every second. So, how would I represent the output labels for this problem?