The first step of speech recognition system is feature extraction. MFCC is a feature describing the envelope of short-term power spectrum, which is widely used in speech recognition system.
I. Mel filter
Each speech signal is divided into several frames. Each frame of signal corresponds to a spectrum (realized by FFT transform). The spectrum represents the relationship between frequency and signal energy. Mel filter refers to a number of band-pass filters. In Mel frequency, the passband of the band-pass filter is the same width, but in Hertz spectrum, Mel filter has narrow dense cut-off band at low frequency, sparse high frequency and wide passband, aiming to simulate the perception of non-linear human ear to sound by having more discrimination at lower frequency and less discrimination at higher frequency.
The relationship between Hertz frequency and Mel frequency is as follows:
Assuming that there are m band-pass filters HM (k), 0 ≤ m < m in MEL spectrum, the center frequency of each band-pass filter is f (m) f (m) f (m), and the transfer function of each band-pass filter is:
The following figure shows Mel filter in Hertz frequency, and the number of band-pass filters is 24:
Features of MFCC
MFCC coefficient extraction steps:
(1) speech signal frame processing
(2) Fourier transform power spectrum of each frame
(3) pass the short-time power spectrum through Mel filter
(4) logarithm of filter bank coefficient
(5) DCT the logarithm of filter bank coefficients
(6) the second to the 13th cepstrum coefficients are generally reserved as the features of short-term speech signals
import wave import numpy as np import math import matplotlib.pyplot as plt from scipy.fftpack import dct def read(data_path): '' read voice signal ''' wavepath = data_path f = wave.open(wavepath,'rb') params = f.getparams() Nchannels, sampwidth, framerate, nframes = params [: 4] ා number of channels, quantization bits, sampling frequency, sampling points STR? Data = f.readframes (nframes)? Read audio, string format f.close() Wavedata = np.fromstring (str_data, dtype = NP. Short) ා convert string to floating-point data Wavedata = wavedata * 1.0 / (max (ABS (wavedata))) × wave amplitude normalization return wavedata,nframes,framerate def enframe(data,win,inc): '' frame the voice data Input: data (one-dimensional array): voice signal WLAN (int): sliding window length Inc (int): the length of each window move Output: F (two-dimensional array) a two-dimensional array composed of data in each sliding window ''' Nx = len (data) × length of voice signal try: nwin = len(win) except Exception as err: nwin = 1 if nwin == 1: wlen = win else: wlen = nwin NF = int (NP. Fix ((NX - WLAN) / Inc) + 1) times of window movement F = NP. Zeros ((NF, WLAN)) initialize 2D array indf = [inc * j for j in range(nf)] indf = (np.mat(indf)).T inds = np.mat(range(wlen)) indf_tile = np.tile(indf,wlen) inds_tile = np.tile(inds,(nf,1)) mix_tile = indf_tile + inds_tile f = np.zeros((nf,wlen)) for i in range(nf): for j in range(wlen): f[i,j] = data[mix_tile[i,j]] return f def point_check(wavedata,win,inc): '' voice signal endpoint detection Input: wavedata (one-dimensional array): original voice signal Output: startpoint (int): start endpoint Endpoint (int): endpoint ''' #1. Calculate the short-time zero crossing rate FrameTemp1 = enframe(wavedata[0:-1],win,inc) FrameTemp2 = enframe(wavedata[1:],win,inc) Signs = NP. Sign (NP. Multiply (frametemp1, frametemp2)) calculate whether each bit of data adjacent to it has a different sign, and the different sign will cross zero signs = list(map(lambda x:[[i,0] [i>0] for i in x],signs)) signs = list(map(lambda x:[[i,1] [i<0] for i in x], signs)) diffs = np.sign(abs(FrameTemp1 - FrameTemp2)-0.01) diffs = list(map(lambda x:[[i,0] [i<0] for i in x], diffs)) zcr = list((np.multiply(signs, diffs)).sum(axis = 1)) #2. Calculate short-term energy amp = list((abs(enframe(wavedata,win,inc))).sum(axis = 1)) #Set threshold #Print ('set threshold ') Zcrlow = max ([round (NP. Mean (ZCR) * 0.1), 3]) low threshold of zero crossing rate Zcrhigh = max ([round (max (ZCR) * 0.1), 5]) high threshold of zero crossing rate Amplow = min ([min (AMP) × 10, NP. Mean (AMP) × 0.2, max (AMP) × 0.1]) energy low threshold Amphigh = max ([min (AMP) × 10, NP. Mean (AMP) × 0.2, max (AMP) × 0.1]) high energy threshold #Endpoint detection Maxsilence = 8 × maximum voice gap time Minaudio = 16 ා minimum voice time Status = 0 ා status 0: Mute segment, 1: transition segment, 2: voice segment, 3: end segment Holdtime = 0 ා voice duration Silence time = 0 ා voice gap time Print ('Start endpoint detection ') StartPoint = 0 for n in range(len(zcr)): if Status ==0 or Status == 1: if amp[n] > AmpHigh or zcr[n] > ZcrHigh: StartPoint = n - HoldTime Status = 2 HoldTime = HoldTime + 1 SilenceTime = 0 elif amp[n] > AmpLow or zcr[n] > ZcrLow: Status = 1 HoldTime = HoldTime + 1 else: Status = 0 HoldTime = 0 elif Status == 2: if amp[n] > AmpLow or zcr[n] > ZcrLow: HoldTime = HoldTime + 1 else: SilenceTime = SilenceTime + 1 if SilenceTime < MaxSilence: HoldTime = HoldTime + 1 elif (HoldTime - SilenceTime) < MinAudio: Status = 0 HoldTime = 0 SilenceTime = 0 else: Status = 3 elif Status == 3: break if Status == 3: break HoldTime = HoldTime - SilenceTime EndPoint = StartPoint + HoldTime return FrameTemp1[StartPoint:EndPoint] def mfcc(FrameK,framerate,win): '' extract MFCC parameters Input: framek (two-dimensional array): two-dimensional framing speech signal Framerate: voice sampling frequency Win: framing window length (FFT points) output: ''' #Mel filter mel_bank,w2 = mel_filter(24,win,framerate,0,0.5) FrameK = FrameK.T #Calculate power spectrum S = abs(np.fft.fft(FrameK,axis = 0)) ** 2 #Pass the power spectrum through the filter P = np.dot(mel_bank,S[0:w2,:]) Take a logarithm logP = np.log(P) #Calculate DCT coefficient # rDCT = 12 # cDCT = 24 # dctcoef =  # for i in range(1,rDCT+1): # tmp = [np.cos((2*j+1)*i*math.pi*1.0/(2.0*cDCT)) for j in range(cDCT)] # dctcoef.append(tmp) # Take a logarithm后做余弦变换 # D = np.dot(dctcoef,logP) num_ceps = 12 D = dct(logP,type = 2,axis = 0,norm = 'ortho')[1:(num_ceps+1),:] return S,mel_bank,P,logP,D def mel_filter(M,N,fs,l,h): '' Mel filter Input: m (int): number of filters N (int): FFT points FS (int): sampling frequency L (float): low frequency coefficient H (float): high frequency coefficient Output: melbank (2D array): Mel filter ''' FL = FS * l ා lowest frequency in filter range FH = FS * h ා highest frequency of filter range BL = 1125 * np.log (1 + FL / 700) bh = 1125 * np.log(1 + fh /700) B = BH - BL band width Y = NP. Linspace (0, B, M + 2) ා mark Mel equally Print ('mel interval ', y) FB = 700 * (NP. Exp (Y / 1125) - 1) change Mel to Hz print(Fb) w2 = int(N / 2 + 1) df = fs / N Freq =  ා sampling frequency value for n in range(0,w2): freqs = int(n * df) freq.append(freqs) melbank = np.zeros((M,w2)) print(freq) for k in range(1,M+1): f1 = Fb[k - 1] f2 = Fb[k + 1] f0 = Fb[k] n1 = np.floor(f1/df) n2 = np.floor(f2/df) n0 = np.floor(f0/df) for i in range(1,w2): if i >= n1 and i <= n0: melbank[k-1,i] = (i-n1)/(n0-n1) if i >= n0 and i <= n2: melbank[k-1,i] = (n2-i)/(n2-n0) plt.plot(freq,melbank[k-1,:]) plt.show() return melbank,w2 if __name__ == '__main__': data_path = 'audio_data.wav' win = 256 inc = 80 wavedata,nframes,framerate = read(data_path) FrameK = point_check(wavedata,win,inc) S,mel_bank,P,logP,D = mfcc(FrameK,framerate,win)
The above is the whole content of this article. I hope it will help you in your study, and I hope you can support developepaer more.