Icassp 2021 will kick off in Toronto, Canada from June 6 to 11, 2021. With solid accumulation and cutting-edge innovation in the field of voice technology, Jingdong Technology Group’s3The paper has been accepted by icassp 2021.
The full name of icassp is International Conference on acoustics, speech and signal processing (International Conference on acoustics, speech and signal processing). It is the world’s largest and most comprehensive top academic conference on signal processing and its applications sponsored by IEEE. The selected papers of Jingdong science and technology group are fully displayed on the international stageIts own strength in speech enhancement, speech synthesis and multi round dialogue.
01.Neural Kalman Filtering for Speech Enhancement
Research on speech enhancement algorithm based on neural Kalman filter
*Link to the paper: https://arxiv.org/abs/2007.13962
Due to the existence of complex environmental noise, speech enhancement plays an important role in human-computer speech interaction system. The speech enhancement algorithm based on statistical machine learning usually uses the existing common modules in the field of machine learning (such as fully connected network, recurrent neural network, convolutional neural network, etc.) to build the enhancement system. However, how to apply the optimal filter design theory based on expert knowledge to the speech enhancement system based on machine learning is still an unsolved problem.
The paper “neural Kalman filtering for speech enhancement algorithm research based on neural Kalman filter” selected by Jingdong science and technology group proposedThe speech enhancement framework of neural Kalman filter combines neural network and optimal filter theory, and uses supervised learning method to train the optimal weight of Kalman filter.
Firstly, the researchers build a speech temporal change model based on recurrent neural network. Compared with the traditional Kalman filter, this model eliminates the unreasonable assumption that the speech change obeys the linear prediction model, and can be used to model the real speech nonlinear change. On the one hand, based on the time series model and Kalman hidden state vector information, the algorithm first obtains the speech long-term envelope prediction. On the other hand, by fusing the observation information of the current time, the system further solves the speech spectrum prediction based on Wiener filter of traditional signal processing. The final output of the system is a linear combination of speech long-term envelope prediction and Wiener filter prediction. Based on the traditional Kalman filter theory, the system directly obtains the optimal solution of the linear combination weight. By designing an end-to-end system, the weights of the speech time-varying network and the Wiener filter related noise estimation network can be updated synchronously. This study is based onLibrispeech speech set 、 PNL-100Nonspeech-SoundsandMusan noise setThe experimental results show that the proposed algorithm achieves better performance than the traditional speech enhancement algorithms based on UNET and crnn in terms of SNR gain, speech perceptual quality (PESQ) and speech intelligibility (stoi).
02.Improving Prosody Modelling with Cross-Utterance Bert Embeddings for End-to-End Speech Synthesis
Prosodic modeling of end to end speech synthesis based on cross sentence information
*Link to the paper:
Although the current end-to-end speech synthesis technology has achieved a relatively natural and rich prosody speech synthesis effect, it does not use the text structure information, but only uses the linguistic features of the current sentence for speech synthesis. Generally, prosodic information is strongly related to the textual structure of context, and the same sentence text will have completely different prosodic performance in different contexts. Therefore, it is difficult for the end-to-end system which only uses the current sentence text features to synthesize a piece of text to convert a piece of text into a natural one according to the context information Prosodic expression of rich voice.
The paper “improving prosody modeling with cross utterance Bert embedding for end-to-end speech synthesis” selected by Jingdong science and Technology Group adoptsAt present, the mainstream Bert model is used to extract the cross sentence feature vector of the text to be synthesized, and then the context vector is used to improve the prosodic effect of the end-to-end speech synthesis model.
Figure 2: model structure diagram
Instead of using any displayed prosodic control information, researchers extract the cross sentence feature representation of the context sentence of the sentence to be synthesized through the Bert language model, and use the feature representation as the additional input of the current mainstream end-to-end speech synthesis algorithm. This paper discusses two different ways to use cross sentence features. The first is to put together the cross sentence features of all the context sentences to serve as the input of an end-to-end speech synthesis system. The second is to make the cross sentence features of all the context sentences as a sequence, Then the attention of each speech unit and the sequence of the text to be synthesized is calculated, and then the cross sentence features of the context sentence can be weighted summation by the calculated attention to get the cross sentence features of each speech unit. The second way is to use the cross sentence feature, which can make each pronunciation unit get a fine-grained cross sentence feature that is helpful for the current unit.
The experimental results show that,This study combines cross sentence features in the end-to-end speech synthesis system, which can effectively improve the naturalness and expressiveness of the synthesized paragraph text.This study verified the experimental results on Chinese and English audio books. Moreover, in the comparison test results, compared with our end-to-end baseline model, most of the testers prefer the audio synthesized by the cross sentence vector representation speech synthesis algorithm in this study.
03.Conversational Query Rewriting with Self-supervised Learning
Dialogue query rewriting based on self supervised learning
*Link to the paper:
In the multi round dialogue system, users tend to be short and colloquial, and there are a lot of missing information and reference in the expression. These phenomena make it difficult for the dialogue robot to understand the user’s real intention, which greatly increases the difficulty of system response. In order to improve the level of the dialogue system, query rewriting completes the user’s speech according to the user’s historical conversation, so as to recover all the omitted and referred information. However, the existing query rewriting technologies all adopt supervised learning method, and the effect of the model is seriously limited by the scale of the annotated data, which has a great obstacle to the implementation of the technology in real business scenarios. In addition, whether the user’s intention changes after rewriting has not been concerned by the existing work. How to ensure the consistency of user’s intention after rewriting is still an urgent problem to be solved. Jingdong science and Technology Group’s selected paper “conversational query rewriting with self supervised learning”A self supervised query rewriting method is proposed.When there are co-occurrence words between the user’s question and the historical session, the co-occurrence words will be deleted or replaced by pronouns with a specific probability. Finally, the query rewriting model restores the user’s original question according to the historical session. Compared with the supervised learning method, the self supervised learning method can obtain a large number of training data at low cost and give full play to the representation learning ability of the model.
Jingdong researchers further proposed to improve the model Teresa to improve the quality and accuracy of the rewriting model from two aspects.One is to introduce keyword detection module in transformer coding layer,Key words are extracted to guide sentence generation. Firstly, the self attention map is constructed for the encoded output of historical conversation to obtain the relevance between words in historical conversation; Then the text rank algorithm is used to calculate the importance score of words; Finally, the word importance score is integrated into the decoder as a priori information to guide the model to generate questions with more key information.The second is to propose the module of intention consistency,A special label [CLS] is added to the input text of the transformer encoder to obtain the intention distribution of the text content, and the intention consistency is maintained by constraining the intention distribution. The original session (context, query) and the generated sentence (target) share the transformer encoder, and get the intention distribution before and after rewriting respectively. We keep the distribution of the two consistent, so as to ensure the consistency of the intention of the generated sentence.
Jingdong technology group, as the core sector of Jingdong’s external technical services, has been committed to cutting-edge research and exploration, and continues to lead by science and technology to help cities and industries achieve digital upgrading. Up to now, Jingdong Technology Group has published nearly 100 papers on top international AI conferences such as AAAI, IJCAI, CVPR, KDD, neurips, ICML, ACL and icassp350 + articlesAnd won many international academic competitions19 items worldfirst. It is believed that in the future, Jingdong Technology Group will continue to make efforts in the fields of voice semantics, computer vision, machine learning, etc., help the real economy with science and technology, and effectively change everyone’s life.
- Dialogue with Wu Youzheng, algorithm scientist of Jingdong science and technology: looking back to 2020, NLP technology is developing rapidly
- Four papers of JD Zhilian cloud were selected in interspeech 202
- NLP brings a sense of science fiction beyond your imagination
Welcome to click【Jingdong Technology】, learn about the developer community
More wonderful technical practice and exclusive dry goods analysis
Welcome to official account of Jingdong technology developer.