Keyword localization in audio file

Question

I want to build a model that can localize occurrences of a particular word in an audio file. For example, I want to find the word "pizza" in a ~5min recording. The program should return an array with (start, stop) objects describing the start and stop boundaries of that word in the file.

Can I just use classic object localization with some kind of CNN, where the object is the wanted word in the spectrogram? If I could: how would I need to prepare the data for training—records with the word "pizza" and the same number of other words or more other words?

Is there perhaps a better method for word-searching in recordings?

score 2 · Answer 1 · answered Jan 31 '20 at 14:42

The problem you are describing is known as wake word detection or trigger word detection.

I'm sure you could use a CNN to classify a chunked Mel-spectrogram of your audio (see also librosa). As training labels you would simply use 0 for timestamps with no wake word (no "pizza") and 1 for timestamps with the wake word. Alternatively to classifying all timestamps of one chunk, you could also train for just the the center frame of each spectrogram chunk (makes things easier). In any case, you will have to make sure that your dataset is at least mildly balanced, i.e., you'll have to have enough wake word and non-wake word instances. One way to achieve this, is to overlay recordings of background noise with recordings of wake and non-wake words. There are some tutorials that detail how to do this, e.g. this YouTube video, this article or this GitHub repo. Note that all these approaches use RNNs for the task. However, it has been argued by Bia et al. that a temporal convolutional network (TCN) architecture (in essence a CNN skip connections and dilation) may work equally well or better for such tasks as the one you describe and is probably easier to train.

Hopefully this answer will give you some points to start from.

score 1 · Answer 2 · answered Jul 25 '21 at 16:25

For a practical way to search for a words in a recording, consider using a Speech Recognition model and just matching in text form. Modern pretrained speech recognition models are really good, and available both as services, and as locally install-able open source packages. See this answer for an example of word-level speech recognition.

Keyword localization in audio file

2 Answers2