Deep Neural Network-Chinese Speech Recognition


1. Background introduction

Speech is the most natural way of human interaction. After the invention of computer, it has become the goal of people to “understand” human language, understand the meaning of language and make correct answers. This process mainly uses three technologies, namely automatic speech recognition.
ASR, natural language processing,
NLP and speech synthesis (SS). The purpose of speech recognition technology is to make the machine understand human voice. It is a typical interdisciplinary task.

2. overview

The model of speech recognition system consists of two parts: acoustic model and language model. Acoustic model corresponds to the probability calculation of phoneme to phoneme, and language model corresponds to the probability calculation of phoneme to text.

A continuous speech recognition system can be composed of four parts: feature extraction, acoustic model, language model and decoding part. The specific process is firstly to extract the acoustic features from the speech data, then to get an acoustic model through model training statistics, as a template for recognition, and then to get a recognition result by decoding the language model.

Acoustic model is a key part of speech recognition system. Its function is to describe the feature sequence generated by acoustic unit and classify speech signals. We can use the acoustic model to calculate the probability that the observed feature vectors belong to each acoustic unit and transform the feature sequence into the state sequence according to the likelihood criterion.
The address of this data set is THCHS30 Chinese speech data set of Tsinghua University.
Detailed Code Tutorial Chinese Speech Recognition

2.1 Feature Extraction

Neural network can not train audio as input, so the first step is to extract features from audio data. Common feature extraction is based on human voice mechanism and auditory perception, from voice mechanism to auditory perception to recognize the essence of sound.
Some commonly used acoustic characteristics are as follows:

(1) Linear Prediction Coefficient (LPC). Linear Prediction Analysis (LPA) is based on the analysis of the cascade model of short ducts, which simulates the human vocal principle. Assuming that the transfer function of the system is similar to that of all-pole digital filters, the features of speech signals can usually be described with 12 to 16 poles. So for n-time speech signal,
We can use the linear combination approximation of the signals from the previous time to simulate. Then LPC can be obtained by calculating the sampling value of speech signal and the sampling value of linear prediction and minimizing the mean square error (MSE).
(2) Perception Linear Prediction (PLP), which is a characteristic parameter based on auditory model. This parameter is a characteristic equivalent to LPC and a set of coefficients of the prediction polynomial of the full pole model. The difference is that
PLP is based on the blink of the ear. It is applied to spectrum analysis by calculation. The input speech signal is processed by the auditory model to replace the time domain signal used by LPC. This advantage is conducive to the extraction of anti-noise speech features.

(3) Mel Frequency Cepstrum Coefficient (MFCC), MFCC is also based on the auditory characteristics, Mel Frequency.
Rate cepstrum band division is equidistant on Mel scale. The logarithmic distribution relationship between the scale value of Mel frequency and the actual frequency is more in line with the auditory characteristics of the human ear, so speech signal can be better represented.

(5) Based on the feature Fbank (Filter bank) of filter banks, the feature extraction methods of Fbank are equivalent.
After removing the last step of DCT from MFCC, the Fbank feature retains more original voice data with the MFCC feature.

(6) Spectrogram, which is the speech spectrum, is usually obtained by processing the received time-domain signal, so as long as there is enough time-domain signal. The characteristic of spectrogram is to observe the signal intensity of different frequency bands of speech, and to see the change with time.
In this paper, we use CNN for image processing training by using spectrogram as feature input. Spectrogram can be understood as the superposition of spectrograms over a period of time, so the main steps of extracting spectrograms are: frame division, windowing, fast Fourier transform (FFT).

2.1.1 Read Audio

In the first step, we need to find out how to use SciPy module to convert audio into useful information, FS is the sampling frequency, and wave signal is the voice data. The FS of our dataset are all 16 khz.

import as wav
filepath = 'test.wav'

fs, wavsignal =

2.1.2 Frame-dividing and windowing

Speech signal is unstable macroscopically and stable microscopically. It has short-term stationarity (within 10-30 ms, it can be considered that speech signal is approximately invariant to the pronunciation of a phoneme). In general, it is 25 ms.
In order to process the speech signal, we need to add windows to the speech signal, that is, only process the data in the window at a time. Because the actual speech signal is very long, we can not and do not have to dispose of very long data at one time. The wise solution is to take a piece of data at a time, analyze it, and then take the next piece of data, and then analyze it. Our windowing operation refers to the Hamming window operation. The principle is to multiply the data in a frame by a function.And get a new frame of data. The formula is given below.
How can we just take a piece of data? Because then we will use Fast Fourier Transform (FFT) to process the data in the Hamming window. It assumes that the signal in a window is a signal representing a period (that is, the left and right ends of the window can be approximately continuous). Usually, there is no obvious periodicity in a small section of audio data. When the Hamming window is added, the data will be compared. It’s close to the periodic function.
Because of the Hamming window, only the data in the middle is reflected, and the data information on both sides is lost. So when the window is moved, there will be overlapping parts. When the window is taken 25ms, the step length can be taken 10ms (the value of a is usually 0.46).
The formula is:

The code is:

import numpy as np

X = np. linspace (0, 400 - 1, 400, dtype = np. int64)# returns the uniform number in the interval
w = 0.54 - 0.46 * np.cos(2 * np.pi * (x) / (400 - 1))

time_window = 25
window_length = fs // 1000 * time_window

Frame Division
p_begin = 0
p_end = p_begin + window_length
frame = wavsignal[p_begin:p_end]

plt.figure(figsize=(15, 5))
ax4 = plt.subplot(121)
ax4.set_title('the original picture of one frame')

Windows and windows

frame = frame * w
ax5 = plt.subplot(122)
ax5.set_title('after hanmming')

The effect of the picture is as follows:
Deep Neural Network-Chinese Speech Recognition

2.1.3 Fast Fourier Transform (FFT)

Speech signal is difficult to see its characteristics in time domain, so it is usually converted to energy distribution in frequency domain. So we do fast Fourier transform for each frame of signal processed by window function to convert the time domain map into the spectrum of each frame, and then we can superimpose the spectrum of each window to get the spectrum map.

The code is:

from scipy.fftpack import fft

# Fast Fourier Transform
frame_fft = np.abs(fft(frame))[:200]

# Take logarithm and find DB
frame_log = np.log(frame_fft)

2.2 CTC(Connectionist Temporal Classification)

Speaking of speech recognition, if there is a data set for editing audio and corresponding transcription, and we don’t know how to align transcribed characters with phonemes in audio, it will greatly increase the difficulty of training speech recognition. If the data is not adjusted and processed, it means that some simple methods can not be used for training. The first way we can choose to do this is to make a rule, such as “one character corresponds to ten phonemes input”, but people speak at different speeds, which is easy to slip. In order to ensure the reliability of the model, the second method, i.e. manually aligning the position of each character in the audio, is better for training the model performance, because we can know the real information of each input time step. But its drawbacks are also obvious — even if it’s a data set of the right size, it’s still time-consuming. In fact, the inaccuracy of rule-making and the long time of manual debugging are not only in the field of speech recognition, but also in other work, such as handwriting recognition and adding action markers in video.
In this scenario, it is CTC’s place of use. CTC is a good way for the network to learn to align automatically, which is very suitable for speech recognition and writing recognition. To be more descriptive, we can map the input sequence (audio) to X=[x1, x2,…. The corresponding output sequence (transcription) is Y=[y1, y2,… YU]. Thereafter, aligning characters with phonemes is equivalent to creating an accurate mapping between X and Y. Details can be found in the classic CTC articles.
Loss function part code:

def ctc_lambda(args):
    labels, y_pred, input_length, label_length = args
    y_pred = y_pred[:, :, :]
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

Decode part of the code:

# Num_result is the predicted result of the model, and num2word corresponds to the Pinyin list.
def decode_ctc(num_result, num2word):
    result = num_result[:, :, :]
    in_len = np.zeros((1), dtype = np.int32)
    in_len[0] = result.shape[1];
    r = K.ctc_decode(result, in_len, greedy = True, beam_width=10, top_paths=1)
    r1 = K.get_value(r[0][0])
    r1 = r1[0]
    text = []
    for i in r1:
    return r1, text

[3. Acoustic model

The model mainly uses CNN to process the image and extracts the main features by maximizing the pooling and adding the defined CTC loss function for training. With input and labels, the model construction can be set by itself, and if the accuracy is improved, it is desirable. Interested in LSTM and other network structures can also be added. There are a lot of information about CNN and pooling operation network, which will not be repeated here. Interested readers can refer to the previous convolutional neural network AlexNet.

class Amodel():
    """docstring for Amodel."""
    def __init__(self, vocab_size):
        super(Amodel, self).__init__()
        self.vocab_size = vocab_size

    def _model_init(self):
        self.inputs = Input(name='the_inputs', shape=(None, 200, 1))
        self.h1 = cnn_cell(32, self.inputs)
        self.h2 = cnn_cell(64, self.h1)
        self.h3 = cnn_cell(128, self.h2)
        self.h4 = cnn_cell(128, self.h3, pool=False)
        # 200 / 8 * 128 = 3200
        self.h6 = Reshape((-1, 3200))(self.h4)
        self.h7 = dense(256)(self.h6)
        self.outputs = dense(self.vocab_size, activation='softmax')(self.h7)
        self.model = Model(inputs=self.inputs, outputs=self.outputs)

    def _ctc_init(self):
        self.labels = Input(name='the_labels', shape=[None], dtype='float32')
        self.input_length = Input(name='input_length', shape=[1], dtype='int64')
        self.label_length = Input(name='label_length', shape=[1], dtype='int64')
        self.loss_out = Lambda(ctc_lambda, output_shape=(1,), name='ctc')\
            ([self.labels, self.outputs, self.input_length, self.label_length])
        self.ctc_model = Model(inputs=[self.labels, self.inputs,
            self.input_length, self.label_length], outputs=self.loss_out)

    def opt_init(self):
        opt = Adam(lr = 0.0008, beta_1 = 0.9, beta_2 = 0.999, decay = 0.01, epsilon = 10e-8)
        self.ctc_model.compile(loss={'ctc': lambda y_true, output: output}, optimizer=opt)

4. Language Model

4.1 Introduction to Statistical Language Model

Statistical Language Model (SLM) is the basis of natural language processing. It is a mathematical model with certain context-related characteristics. It is also essentially a kind of probability graph model. It is widely used in machine translation, speech recognition, Pinyin input, image and character recognition, spelling correction, error detection and search engine, etc. In many tasks, the computer needs to know whether a sequence of words can constitute a meaningful sentence that everyone understands, has no wrong words, such as these sentences:

Many people may not know exactly what machine learning is, but it has actually become an indispensable part of our daily life.
It is not clear what machine learning is for many people, and it has become an indispensable part of our daily life.
It's not clear to many people what machine learning is, but it's not important or essential for us to survive.

The first sentence conforms to the grammatical norms. The meaning of the word is clear. The meaning of the second sentence is still clear. The meaning of the third conjunction is vague. This is precisely understood from a rule-based perspective, as did scientists before the 1970s. Later, Jallinik solved the problem by using a simple statistical model. From a statistical point of view, the probability of the first sentence is very high, such asAnd the second one, for example, isThe third is the smallest, such as。 According to this model, the probability of the first sentence appearing is 10 times that of the second, let alone the third sentence, so the first sentence is most in line with common sense.

4.2 Model Establishment

Suppose S is a generated sentence with a series of words w1, w2,… The probability of the occurrence of sentence S is as follows:

Due to the limitation of computer memory space and computing power, we obviously need more reasonable computing methods. Generally speaking, only considering the former word can have a fairly good accuracy. In practical use, it is enough to consider the first two words. In very few cases, only the first three words are considered. Therefore, we can choose to adopt the following formula:

And P can calculate the probability by crawling the statistical word frequency of the data.

4.3 Implementation of Pinyin to Text

The algorithm of pinyin to Chinese characters is dynamic programming, which is basically the same as the algorithm of finding the shortest path. We can regard Chinese input as a communication problem. Each pinyin can correspond to more than one Chinese character, and each Chinese character can only read one sound at a time. If the corresponding words of each Pinyin are connected from left to right, it becomes a directed graph.Deep Neural Network-Chinese Speech Recognition

5. Model testing

Acoustic model testing:
Deep Neural Network-Chinese Speech Recognition
Language model testing:
Deep Neural Network-Chinese Speech Recognition
Because the model is simple and the data set is too small, the effect of the model is not very good.
Project source address:

6. References

Paper: Research Progress and Prospect of Speech Recognition Technology
Blog: ASRT_SpeechRecognition
Blog: DeepSpeechRecognition

About us

Mo( It’s a Python-enabledArtificial Intelligence Online Modeling PlatformIt can help you develop, train and deploy models quickly.

Mo Artificial Intelligence ClubIt is a club initiated by the R&D and product design team of the website and dedicated to lowering the threshold of AI development and use. The team has experience in large data processing, analysis, visualization and data modeling, has undertaken multi-domain intelligent projects, and has the ability to design and develop from the bottom to the front. The main research direction is large data management analysis and artificial intelligence technology, and to promote data-driven scientific research.

At present, the club holds an offline Technology Salon in Hangzhou every Saturday with the theme of machine learning, and carries out paper sharing and academic exchanges from time to time. I hope to gather friends from all walks of life who are interested in AI, exchange and grow together, and promote the democratization and popularization of AI.
Deep Neural Network-Chinese Speech Recognition