Questions tagged [speech-to-text]

46 questions
4
votes
1 answer

Why are observation probabilities modelled as Gaussian distributions in HMM?

HMM is a statistical model with unobserved (i.e. hidden) states used for recognition algorithms (speech, handwriting, gesture, ...). What distinguishes DHMM form CHMM is the transition probability matrix P with elements. In CHMM, state space of…
4
votes
2 answers

How to convert a mel spectrogram to log-scaled mel spectrogram

I was reading this paper on environmental noise discrimination using Convolution Neural Networks and wanted to reproduce their results. They convert WAV files into log-scaled mel spectrograms. How do you do this? I am able to convert a WAV file to a…
Ajay H
  • 222
  • 1
  • 3
  • 9
3
votes
1 answer

How to double audio dataset?

I am trying to develop a mispronunciation detection model for English speech. I use TIMIT dataset, this is phoneme labeled audio dataset. A phoneme is any of the perceptually distinct units of sound. So, my dataset looks like an audio file and…
3
votes
1 answer

ASR on low dataset

I am doing an ASR(automatic speech recognition) as master thesis on low key dataset. Voice and text data is labelled. There are around 4000 phrases and around 5 hours speech. I don't have background in speech or signal processing. How huge would be…
2
votes
2 answers

How to evaluate the quality of speech-to-text data without access to the true labels?

I am dealing with a data set of transcribed call center data, where customers are being recorded when interacting with the agent. This is then automatically transcribed by an external transcription system. I want to automatically assess the quality…
miri_h_ds
  • 21
  • 3
2
votes
0 answers

How to do phoneme segmentation using dynamic time warping?

Background Information: Dynamic Time Warping (DTW): In time series analysis, dynamic time warping (DTW) is one of the algorithms for measuring similarity between two temporal sequences, which may vary in speed. (Source: Wikipedia) Phoneme…
Sam Kagawa
  • 21
  • 1
2
votes
1 answer

How is an ASR's output compared to ground truth for validation?

I am curious how it is done as I am interested in doing something similar. I have some manually transcribed data that contains tags for multiple speakers. I want to compare how well the out of the box ASRs (Google, AWS Transcribe) are able to…
Samarth
  • 339
  • 1
  • 8
2
votes
2 answers

Creating pronunciation dictionary for ASR

I am working on ASR(automatic speech recoginition) on Somali data as master thesis and now I am stuck with how to create a phonetics or pronunciation dictionary for it. I searched over net and could not find one. I'm not sure how to tackle this.…
2
votes
1 answer

GMM in speech recoginition using HMM-GMM

I am trying to solve/understand ASR using HMM-GMM. At the abstract level i do understand what's happening but I did not understand how GMM fits into it. My data has 5K hours of speech from single user. I took the above picture from this article. I…
2
votes
2 answers

where to start in natural language processing for a language

My native language is a regional language and few people speak it. I have some assignements in a machine learning course and i was thinking about doing some natural languge processing on my native language but i don't know where to start since there…
2
votes
0 answers

Representing output label for character level speec recognition using RNN

I saw this tutorial on generating text using LSTM. In this tutorial the author trained the network by taking 100 previous characters as input and the next character as the output label. I am interested to try some simple speech recognition using…
Jahir Islam
  • 121
  • 1
2
votes
1 answer

Validation loss is less than training loss by 5 units. How this result is interpreted?

Iam training a Keras model for end-to-end speech recognition. I have my own dataset of speech containing about 400 wave files. Text transcriptions is also given as input. Model summary…
1
vote
0 answers

How do I initialize a Hidden Markov Model when using MFCC features for speech recognition?

I have a personal dataset of 10000 audio files, each consisting a single spoken sentence. These files each have the transcribed text labels with them that I can use for supervised HMM training. Now that I have have extracted MFCC features, how do I…
1
vote
1 answer

How does Wav2Vec 2.0 feed output from Convolutional Feature Encoder as input to the Transformer Context Network

I was reading the Wav2Vec 2.0 paper and trying to understand the model architecture, but I have trouble understanding how audio raw inputs of variable lengths can be fed through the model, especially from the Convolutional Feature Encoder to the…
user116029
  • 11
  • 1
1
vote
1 answer

How to prepare Audio-text data for speech recognition

I have gathered some raw audio from all the conferences, meetings, lectures & casual conversation that I was part of. The machine transcription did not offer good results (from Azure, AWS etc.) I would transcribe it so to have both data+label…
johnyc
  • 11
  • 1
1
2 3 4