Starting with the encoder decoder model, this paper explores the solution of context offset

Time:2021-7-24

Summary:In this paper, we show class, an end-to-end context ASR model composed of all neural networks, which fuses context information by mapping all context phrases. In the experimental evaluation, we found that the proposed clas model exceeded the standard shallow fusion bias method.

This article is shared from Huawei cloud community《How to solve the context shift? Proprietary domain end-to-end ASR Road (II)》, original author: xiaoye0829.

Here, we introduce a work related to end-to-end ASR in the proprietary field, deep context: end-to-end contextual speech recognition, which is also from the same research team of Google.

In ASR, the content of a user’s speech depends on his context, which can usually be represented by a series of n-gram words. In this work, we also study how to apply this context information in the end-to-end model. The core approach of this article can be seen as a contextual Las [1] model. The contextual Las (CLAs) model is optimized based on the Las model and combined with n-gram embedding. That is, during beam search, the independently trained n-gram and Las models are shared fusion.

In the work of this paper, we consider dynamically integrating context information into the recognition process. In the traditional ASR system, a mainstream method of integrating context information is to use an independently trained online re scoring framework, which can dynamically adjust the weight of a small part of n-gram related to the context of a specific scene. It is very important to extend this technology to the ASR model of seq2seq. In order to realize the purpose of offset recognition process according to specific tasks, previous work has tried to integrate independent LM into the recognition process. The common practice is shallow fusion or cold fusion. In work [2], the method of shallow fusion is used to construct contextual LAS, that is, the output probability of LAS is modified by a special WFST constructed by the speaker context, and the effect is improved.

Previous work used LM with external independent training for online re scoring, which is contrary to the benefits of joint optimization of seq2seq model. Therefore, in this paper, we propose a contextual Las (CLAs), which provides a series of contextual phrases (i.e. contextual phrases) to improve the recognition effect. Our method first maps each phrase into a fixed dimension word embedding, and then uses an attention attention mechanism to summarize the available context information at each step of the model output prediction. Our method can be regarded as a generalization of streaming keyword discovery technology [3], that is, it allows a variable number of context phrases to be used in reasoning. The proposed model does not need specific context information during training, and does not need to carefully adjust the re scoring weight, and can still be integrated into OOV vocabulary.

Next, this paper will explain the standard Las model, the standard context Las model and the modified Las proposed by us.

Las model is a seq2seq model, including encoder and decoder with attention mechanism. When decoding each word, the attention mechanism will dynamically calculate the weight of each input hidden state, and obtain the current attention vector through weighted linear combination. The input X of this model is speech signal and the output y is graphs (i.e. English characters, including a ~ Z, 0 ~ 9, < space >, < comma >, < period >, < apostrophe >, < unk >).
Starting with the encoder decoder model, this paper explores the solution of context offset

The output of LAS is the following formula:
Starting with the encoder decoder model, this paper explores the solution of context offset

This formula depends on the encoder’s state vector HX, the decoder’s hidden layer state DT, and CT modeled as a context vector. CT uses an attention gate to aggregate the encoder’s state and the encoder’s output.

In the standard contextual Las model, we assume that we have known a series of word level offset phrases in advance. And compiled them into a WFST. This word level WFST G can be composed of a speaker FST s. S can convert a string of graphs or word pieces into corresponding words. Therefore, we can obtain a context language model LM C = min (DET (s о G))。 The score PC (y) from this context language model can then be used in the decoding process to enhance the standard log probability term.
Starting with the encoder decoder model, this paper explores the solution of context offset

ad locum, λ Is an adjustable parameter to control the impact of the context language model on the overall model score. The total score in this formula is only applied at the word level. As shown in the figure below:
Starting with the encoder decoder model, this paper explores the solution of context offset

Therefore, if the related word does not appear in the beam, this technology can not improve the effect. Moreover, we observed that although this method works well when the number of context phrases is small (such as yes, no, cancel), it does not work well when the context phrase contains many nouns (such as song name and contact). Therefore, as shown in Figure C above, we explore applying weights to the sub word units of each word. In order to avoid manually setting the weight of prefix words (matching the prefix but not the whole phrase), we also include a subtractive cost, such as the negative weight in Figure C above.

Next, we begin to introduce the context Las model proposed in this paper, which can effectively model P (y|x, z) by using the additional context information provided by a series of offset phrases Z. A single element in Z is a phrase such as contact, song name, etc. related to a specific context. It is assumed that these contextual phrases can be expressed as: z = Z1, Z2…, Zn. These bias phrases are used to bias the model towards the output of specific phrases. However, not all bias phrases are related to the current speech to be processed. The model needs to determine which phrases may be related, and use these phrases to modify the target output distribution of the model. We use a bias encoder to enhance LAS and encode these phrases into Hz = {h0z, h1z,…, HNZ}. We use superscript Z to distinguish sound related vectors. Hiz is the mapping vector of Zi. Since all offset phrases may be independent of the current speech, we include an additional learnable vector, h0z = hnbz, which corresponds to no offset, that is, no offset phrase is used in output. This option enables the model to ignore all offset phrases. The bias encoder is composed of a multi-layer LSTM network. Hiz sends the embedding sequence corresponding to Zi sub word to the bias encoder, and uses the last state of LSTM as the output feature of the whole phrase. We then use an additional attention to calculate Hz. Using the following formula, when input into the decoder, CT = [CTX; Ctz]。 Other parts are the same as the traditional Las model.
Starting with the encoder decoder model, this paper explores the solution of context offset

It is worth noting that the above formula explicitly models the probability of seeing each specific phrase at the current time when a given voice and previous output are given.
Starting with the encoder decoder model, this paper explores the solution of context offset

Now let’s look at the experiment. The experiment is carried out on 25000 hours of English data. This data set uses a room simulator to add noise and confusion of different intensities and manually interfere with normal speech, so that the signal-to-noise ratio is between 0 and 30dB. The noise source comes from youtube and noise environment recording in daily life. The structure of encoder includes 10 layers of unidirectional LSTM, with 256 units in each layer. The bias encoder consists of a single-layer LSTM with 512 units. The decoder consists of four layers of LSTM, with 256 units in each layer. The test set of the experiment is as follows:
Starting with the encoder decoder model, this paper explores the solution of context offset

Firstly, in order to test whether the offset module we introduced will affect the decoding without offset phrases. We compared our clas with the ordinary Las model. The clas model used random offset phrases during training, but did not provide offset phrases during testing. Unexpectedly, clas also achieved better performance than Las when no offset phrases were provided.
Starting with the encoder decoder model, this paper explores the solution of context offset

We further compare different online re scoring schemes, which are different in how to assign weight to sub word units. As can be seen from the table below, the best model is offset on each subword unit to help retain words in the beam. All the following online re scoring experiments are offset on sub word units.
Starting with the encoder decoder model, this paper explores the solution of context offset

Next, we compare the effects of clas and the above schemes:
Starting with the encoder decoder model, this paper explores the solution of context offset

As can be seen from this table, clas significantly exceeds the traditional method and does not require any additional super parameter adjustment.

Finally, by combining clas with traditional methods, we can see that bias control and online re scoring are helpful to improve the effect.
Starting with the encoder decoder model, this paper explores the solution of context offset

In this paper, we show class, an end-to-end context ASR model composed of all neural networks, which fuses context information by mapping all context phrases. In the experimental evaluation, we found that the proposed clas model exceeded the standard shallow fusion bias method.

[1] Chan, William, et al. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition.” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.

[2] Ian Williams, Anjuli Kannan, Petar Aleksic, David Rybach, and Tara N. Sainath, “Contextual speech recognition in end-to-end neural network systems using beam search,” in Proc. of Interspeech, 2018.

[3] Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. McGraw, “Streaming Small-footprint Keyword Spotting Using Sequence-to-Sequence Models,” in Proc. ASRU, 2017.

Click focus to learn about Huawei cloud’s new technologies for the first time~