I want to build a model that can localize occurrences of a particular word in an audio file. For example, I want to find the word "pizza" in a ~5min recording. The program should return an array with (start, stop) objects describing the start and stop boundaries of that word in the file.
Can I just use classic object localization with some kind of CNN, where the object is the wanted word in the spectrogram? If I could: how would I need to prepare the data for training—records with the word "pizza" and the same number of other words or more other words?
Is there perhaps a better method for word-searching in recordings?