Mel frequency cepstrum coefficient (MFCC) and its implementation in Python

Time:2019-11-8

The first step of speech recognition system is feature extraction. MFCC is a feature describing the envelope of short-term power spectrum, which is widely used in speech recognition system.

I. Mel filter

Each speech signal is divided into several frames. Each frame of signal corresponds to a spectrum (realized by FFT transform). The spectrum represents the relationship between frequency and signal energy. Mel filter refers to a number of band-pass filters. In Mel frequency, the passband of the band-pass filter is the same width, but in Hertz spectrum, Mel filter has narrow dense cut-off band at low frequency, sparse high frequency and wide passband, aiming to simulate the perception of non-linear human ear to sound by having more discrimination at lower frequency and less discrimination at higher frequency.

The relationship between Hertz frequency and Mel frequency is as follows:

Assuming that there are m band-pass filters HM (k), 0 ≤ m < m in MEL spectrum, the center frequency of each band-pass filter is f (m) f (m) f (m), and the transfer function of each band-pass filter is:

The following figure shows Mel filter in Hertz frequency, and the number of band-pass filters is 24:

Features of MFCC

MFCC coefficient extraction steps:

(1) speech signal frame processing
(2) Fourier transform power spectrum of each frame
(3) pass the short-time power spectrum through Mel filter
(4) logarithm of filter bank coefficient
(5) DCT the logarithm of filter bank coefficients
(6) the second to the 13th cepstrum coefficients are generally reserved as the features of short-term speech signals

Python implementation

import wave
import numpy as np
import math
import matplotlib.pyplot as plt
from scipy.fftpack import dct

def read(data_path):
 '' read voice signal
 '''
 wavepath = data_path
 f = wave.open(wavepath,'rb')
 params = f.getparams()
 Nchannels, sampwidth, framerate, nframes = params [: 4] ා number of channels, quantization bits, sampling frequency, sampling points
 STR? Data = f.readframes (nframes)? Read audio, string format
 f.close()
 Wavedata = np.fromstring (str_data, dtype = NP. Short) ා convert string to floating-point data
 Wavedata = wavedata * 1.0 / (max (ABS (wavedata))) × wave amplitude normalization
 return wavedata,nframes,framerate

def enframe(data,win,inc):
 '' frame the voice data
 Input: data (one-dimensional array): voice signal
   WLAN (int): sliding window length
   Inc (int): the length of each window move
 Output: F (two-dimensional array) a two-dimensional array composed of data in each sliding window
 '''
 Nx = len (data) × length of voice signal
 try:
  nwin = len(win)
 except Exception as err:
  nwin = 1 
 if nwin == 1:
  wlen = win
 else:
  wlen = nwin
 NF = int (NP. Fix ((NX - WLAN) / Inc) + 1) times of window movement
 F = NP. Zeros ((NF, WLAN)) 񖓿 initialize 2D array
 indf = [inc * j for j in range(nf)]
 indf = (np.mat(indf)).T
 inds = np.mat(range(wlen))
 indf_tile = np.tile(indf,wlen)
 inds_tile = np.tile(inds,(nf,1))
 mix_tile = indf_tile + inds_tile
 f = np.zeros((nf,wlen))
 for i in range(nf):
  for j in range(wlen):
   f[i,j] = data[mix_tile[i,j]]
 return f

def point_check(wavedata,win,inc):
 '' voice signal endpoint detection
 Input: wavedata (one-dimensional array): original voice signal
 Output: startpoint (int): start endpoint
   Endpoint (int): endpoint
 '''
 #1. Calculate the short-time zero crossing rate
 FrameTemp1 = enframe(wavedata[0:-1],win,inc)
 FrameTemp2 = enframe(wavedata[1:],win,inc)
 Signs = NP. Sign (NP. Multiply (frametemp1, frametemp2)) 񖓿 calculate whether each bit of data adjacent to it has a different sign, and the different sign will cross zero
 signs = list(map(lambda x:[[i,0] [i>0] for i in x],signs))
 signs = list(map(lambda x:[[i,1] [i<0] for i in x], signs))
 diffs = np.sign(abs(FrameTemp1 - FrameTemp2)-0.01)
 diffs = list(map(lambda x:[[i,0] [i<0] for i in x], diffs))
 zcr = list((np.multiply(signs, diffs)).sum(axis = 1))
 #2. Calculate short-term energy
 amp = list((abs(enframe(wavedata,win,inc))).sum(axis = 1))
#Set threshold
#Print ('set threshold ')
 Zcrlow = max ([round (NP. Mean (ZCR) * 0.1), 3]) low threshold of zero crossing rate
 Zcrhigh = max ([round (max (ZCR) * 0.1), 5]) high threshold of zero crossing rate
 Amplow = min ([min (AMP) × 10, NP. Mean (AMP) × 0.2, max (AMP) × 0.1]) energy low threshold
 Amphigh = max ([min (AMP) × 10, NP. Mean (AMP) × 0.2, max (AMP) × 0.1]) high energy threshold
 #Endpoint detection
 Maxsilence = 8 × maximum voice gap time
 Minaudio = 16 ා minimum voice time
 Status = 0 ා status 0: Mute segment, 1: transition segment, 2: voice segment, 3: end segment
 Holdtime = 0 ා voice duration
 Silence time = 0 ා voice gap time
 Print ('Start endpoint detection ')
 StartPoint = 0
 for n in range(len(zcr)):
  if Status ==0 or Status == 1:
   if amp[n] > AmpHigh or zcr[n] > ZcrHigh:
    StartPoint = n - HoldTime
    Status = 2
    HoldTime = HoldTime + 1
    SilenceTime = 0
   elif amp[n] > AmpLow or zcr[n] > ZcrLow:
    Status = 1
    HoldTime = HoldTime + 1
   else:
    Status = 0
    HoldTime = 0
  elif Status == 2:
   if amp[n] > AmpLow or zcr[n] > ZcrLow:
    HoldTime = HoldTime + 1
   else:
    SilenceTime = SilenceTime + 1
    if SilenceTime < MaxSilence:
     HoldTime = HoldTime + 1
    elif (HoldTime - SilenceTime) < MinAudio:
     Status = 0
     HoldTime = 0
     SilenceTime = 0
    else:
     Status = 3
  elif Status == 3:
   break
  if Status == 3:
   break
 HoldTime = HoldTime - SilenceTime
 EndPoint = StartPoint + HoldTime
 return FrameTemp1[StartPoint:EndPoint]


def mfcc(FrameK,framerate,win):
 '' extract MFCC parameters 
 Input: framek (two-dimensional array): two-dimensional framing speech signal
   Framerate: voice sampling frequency
   Win: framing window length (FFT points)
 output:
 '''
 #Mel filter
 mel_bank,w2 = mel_filter(24,win,framerate,0,0.5)
 FrameK = FrameK.T
 #Calculate power spectrum
 S = abs(np.fft.fft(FrameK,axis = 0)) ** 2
 #Pass the power spectrum through the filter
 P = np.dot(mel_bank,S[0:w2,:])
 Take a logarithm
 logP = np.log(P)
 #Calculate DCT coefficient
# rDCT = 12
# cDCT = 24
# dctcoef = []
# for i in range(1,rDCT+1):
#  tmp = [np.cos((2*j+1)*i*math.pi*1.0/(2.0*cDCT)) for j in range(cDCT)]
#  dctcoef.append(tmp)
# Take a logarithm后做余弦变换 
# D = np.dot(dctcoef,logP)
 num_ceps = 12
 D = dct(logP,type = 2,axis = 0,norm = 'ortho')[1:(num_ceps+1),:]
 return S,mel_bank,P,logP,D
 


def mel_filter(M,N,fs,l,h):
 '' Mel filter
 Input: m (int): number of filters
   N (int): FFT points
   FS (int): sampling frequency
   L (float): low frequency coefficient
   H (float): high frequency coefficient
 Output: melbank (2D array): Mel filter
 '''
 FL = FS * l ා lowest frequency in filter range
 FH = FS * h ා highest frequency of filter range
 BL = 1125 * np.log (1 + FL / 700)
 bh = 1125 * np.log(1 + fh /700) 
 B = BH - BL band width
 Y = NP. Linspace (0, B, M + 2) ා mark Mel equally
 Print ('mel interval ', y)
 FB = 700 * (NP. Exp (Y / 1125) - 1) change Mel to Hz
 print(Fb)
 w2 = int(N / 2 + 1)
 df = fs / N
 Freq = [] ා sampling frequency value
 for n in range(0,w2):
  freqs = int(n * df)
  freq.append(freqs)
 melbank = np.zeros((M,w2))
 print(freq)
 
 for k in range(1,M+1):
  f1 = Fb[k - 1]
  f2 = Fb[k + 1]
  f0 = Fb[k]
  n1 = np.floor(f1/df)
  n2 = np.floor(f2/df)
  n0 = np.floor(f0/df)
  for i in range(1,w2):
   if i >= n1 and i <= n0:
    melbank[k-1,i] = (i-n1)/(n0-n1)
   if i >= n0 and i <= n2:
    melbank[k-1,i] = (n2-i)/(n2-n0)
  plt.plot(freq,melbank[k-1,:])
 plt.show()
 return melbank,w2

if __name__ == '__main__':
 data_path = 'audio_data.wav'
 win = 256
 inc = 80
 wavedata,nframes,framerate = read(data_path)
 FrameK = point_check(wavedata,win,inc)
 S,mel_bank,P,logP,D = mfcc(FrameK,framerate,win)

The above is the whole content of this article. I hope it will help you in your study, and I hope you can support developepaer more.

Recommended Today

The use of progressbarcontrol, a progress bar control of devexpress – Taking ZedGraph as an example to add curve progress

scene WinForm control – devexpress18 download installation registration and use in vs: https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/details/100061243 When using ZedGraph to add curves, the number of curves is slower if there are many cases. So in the process of adding curve, the progress needs to be displayed, and the effect is as follows     Note: Blog home page:https://blog.csdn.net/badao_liumang_qizhi […]