4

I was reading this paper on environmental noise discrimination using Convolution Neural Networks and wanted to reproduce their results. They convert WAV files into log-scaled mel spectrograms. How do you do this? I am able to convert a WAV file to a mel spectrogram

y, sr = librosa.load('audio/100263-2-0-117.wav',duration=3)
ps = librosa.feature.melspectrogram(y=y, sr=sr)
librosa.display.specshow(ps, y_axis='mel', x_axis='time')

mel spectrogram

I am also able to display it as a log scaled spectrogram:

librosa.display.specshow(ps, y_axis='log', x_axis='time')

log scaled image

Clearly, they look different, but the actual spectrogram ps is the same. Using librosa, how can I convert this melspectrogram into a log scaled melspectrogram? Furthermore, what is the use of a log scaled spectrogram over the original? Is it just to reduce the variance of the Frequency domain to make it comparable to the time axis, or something else?

Ajay H
  • 222
  • 1
  • 3
  • 9

2 Answers2

6

I think you're wrongly interpreting what the authors meant by log-scaled. When the authors mention log-scaled, they are not referring to the frequency (y) axis, although spectrograms are typically log-scaled here. They are instead referring to the scale of the 3rd dimension in the spectrogram. In your case, the raw spectrogram is displaying power in color. What you want is instead decibels, which are log-scaled.

In your case, the code would look like this:

y, sr = librosa.load('audio/100263-2-0-117.wav',duration=3)
ps = librosa.feature.melspectrogram(y=y, sr=sr)
ps_db= librosa.power_to_db(ps, ref=np.max)

lr.display.specshow(ps_db, x_axis='time', y_axis='mel')

Note: Each spectrogram will be scaled based off of the ref within librosa.power_to_db. If you do not supply anything, librosa just shoves a 1 in there, which may or may not be what you're looking for. You can also try out np.median.

Austin
  • 161
  • 1
  • 4
1

here is example code to extract audio features as log spectrogram using python scipy:

from scipy.io import wavfile
from scipy import signal

sample_rate=16000
window_size=20
step_size=10
eps=1e-10
rate, data = wavfile.read('filename.wav')
if data.ndim > 1 : # ignore  channels 2+
    data = data[:, 0]
nperseg = int(round(window_size * sample_rate / 1e3))
noverlap = int(round(step_size * sample_rate / 1e3))
freqs, times, spec = signal.spectrogram(data,fs=sample_rate,window='hann',nperseg=nperseg,noverlap=noverlap)
log_specgram = np.log(spec.T.astype(np.float32) + eps)
Denize
  • 31
  • 2