Abstract: with the development of big data and deep learning technology, it is no longer a fantasy to create an automatic human-computer dialogue system as our personal assistant or chat partner. And does the dialogue system really report “the future has come” like the media? At present, what are the problems and challenges in the research field of man-machine dialogue system? This article will take you to understand the current situation and challenges of human-computer dialogue system. Yan Rui, an assistant professor of Peking University, made an answer from the point of view of scholars in the third session of yunqi lecture hall | future Human-Computer Interaction Technology Salon.
Introduction to the speaker:
Yan Rui, researcher of Institute of computer science and technology of Peking University, assistant professor and doctoral supervisor of Peking University, former senior R & D of Baidu company, visiting professor and off campus supervisor of central China Normal University and Central University of Finance and economics. Presided over the research and development of multiple open domain dialogue systems and service dialogue systems, published nearly 50 high-level research papers, and served as the (Senior) program committee member and reviewer of multiple academic conferences (KDD, IJCAI, SIGIR, ACL, WWW, AAAI, CIKM, emnlp, etc.).
The following content is compiled according to the PPT and video of the speaker (edited by yunqi community without modification).
Application status of dialog system at home and abroad
At present, the dialogue system is very popular in China. In the following picture, we have listed some dialogue systems, such as Ali Xiaomi, a big Internet Co, and Baidu’s “secret”. Of course, many start-ups have also entered into the dialogue system, such as the more famous triangle animals.
The situation in foreign countries is basically the same as that in China, because everyone knows that the dialogue system is a big cake, so everyone hopes to get a piece of it. Nowadays, no company in the market can say that its ability is enough to eat the whole cake, so all companies are swarming forward, hoping to seize the possible opportunities and more users at the forefront. In foreign dialogue systems, Amazon’s echo has done a good job, which can be said to have exploded the market of intelligent devices. Of course, the otaku goddess created by Microsoft’s Xiaobing team has saved countless young single men. At present, it has been in five countries around the world, including China, Japan, Indonesia, etc., and it can be said that Xiaobing is changing the open field chat robot into a local one Now this road is in the forefront of the world.
Problem definition of dialogue system
Back to the topic, since we want to make a dialogue system, we need to understand the dialogue process first. The simplest definition of a conversation process is to generate an output for the user’s input. What kind of output is needed to be considered in the algorithm, such as which information is needed to respond. Because the dialogue is a continuous process, so the context is very important. When replying, the context of the context needs to be considered. In addition, information needs to be obtained from the knowledge base. This is because if the dialogue system does not understand the knowledge of the human world, it becomes nonsense, so the dialogue cannot continue. Of course, in the process of dialogue, we need some semantic and logical information. The dialogue system should be able to chat along the logic main line, otherwise it will make people feel split personality. These problems make it difficult to realize the dialogue system.
And why has the conversation system suddenly turned red in recent years? This may be due to the role of data-driven, now can obtain a large number of data, so that machines can learn the human dialogue mode. Of course, there are a large number of machines, powerful computing resources and efficient machine learning algorithm as support, which makes human-computer dialogue possible.
However, the dialogue system originates from human’s cognition of intuitive behavior, and the reason why the dialogue system is not good enough at present is that human has not fully understood how to carry out the dialogue. As you can imagine, the dialogue between people is basically based on intuition, and rarely think about why such an answer is generated. Therefore, human beings are not very clear about the dialogue process, and they are still in a cognitive bottleneck period. There are also some task points to be considered in the dialogue system, such as the relatively good relevance that can be solved now, at least the answers generated at present can be context related, of course, there are more points such as interest, if chatting with the machine is boring, then people will not continue to chat. In addition, there is also the amount of information. Chatting between people is the process of information interaction. But if chatting with a dialogue robot, all the content provided by the robot is nutritious, it may be meaningless to chat. These problems are the problems that need to be solved in the dialogue system, but they have not been solved well at present.
Dialog system classification
According to the domain classification, dialogue system can be divided into open domain and vertical domain. Open field usually refers to chat robots, that is, you can chat with anything. The vertical domain, such as the dialogue system in the fields of medical treatment, finance and law, is more applicable and has the potential of realization. It can solve the dialogue of various problems. But it is worth noting that for these vertical fields, open data is relatively difficult to obtain. Of course, although there is a bottleneck here, it does not mean that there is no other way to obtain data in the vertical field in the future.
At present, the mainstream of industry is retrieval dialogue system, which is basically to obtain a large number of human dialogue corpus. When the system receives a sentence, it will judge whether there are similar answers in the corpus. If there are, it will directly return the answers. The retrieval dialogue system is relatively controllable, because the answers written by human are all in the corpus. The second is the generative dialogue system, which is the current trend, but the technology is not mature. In this way, a large number of corpora are used to learn how to answer a sentence sent by the receiver. The system may generate a new answer for the sentence without necessarily being included in the original corpus. This is a generative dialogue system. But there are also some problems. Even though neural network technology has raised the level of generative dialogue a lot, sometimes the results are not human words. The feasible way of dialogue system is to combine the existing mainstream system with the future system to make it available.
According to the scene classification, it is generally divided into single round dialogue and multi round dialogue. Single round dialogue is to only consider the current user input and generate corresponding answers. Usually, this simplest hypothesis can be used in preliminary research, but it should not be used in practical application, because in general, the dialogue is continuous and multi round interactive.
According to the way of classification, it is more interesting. Generally speaking, the current dialogue system is passive. After receiving the message from the person, the dialogue system will generate another answer. But in fact, the dialogue between people is not like this. It is possible that one party will bring rhythm in the dialogue, that is, to pick up some new content, add active factors in the dialogue, and consciously interact.
Simple flow of dialogue system
Compared with the industry, the process constructed by the academic community is very abstract. Generally speaking, voice input comes. First, it is transformed into text, dialogue management and intention understanding, and then it is transformed into answer and output. The process constructed by the academic community is no more than such an abstract process, and it will be done very carefully when it is applied in the industry. Every step will be designed and processed in detail, which is the difference between the academic community and the industry.
Structure of dialogue system
The dialogue system will have an offline process. Data acquisition, retrieval model and depth model are all done offline. Of course, there will also be a wired process. When a user comes to a sentence, they need to generate a corresponding answer and return it to the user.
For a dialogue system, data is very important, so first share the open data available to the academic community. Academia is different from enterprises, because there are real users, logs and other data in the enterprise community, while academia does not, only can get some public data, such as Sina Weibo, Douban and tieba. This process is relatively complex. First, we need to crawl the dialogue data between people, do filtering and cleaning, and extract the dialogue pairs and dialogue modes.
Search dialog – background
The idea of retrieval dialogue is very simple. It’s just that an input sentence needs to produce an answer, that is to match the dialogue. As long as the two can match successfully, it’s a good answer. The basic idea is to project input and candidate output into a semantic space to judge whether they are similar. If they are similar, it is good, otherwise it is not a good answer. For the calculation of similarity or not, the traditional way is based on one hot. When a word appears, it is 1. When it does not appear, it is 0. This is a relatively old way of expression. After the rise of deep learning, people began to use representation learning, that is, learning an embedding and embedding. Both word embedding and sentence embedding represent abstraction, and they can also calculate similarity based on abstraction.
Search dialog – typical match pattern
The figure below shows Huawei’s early work of judging whether two sentences are similar. It provides two better architectures: one is to make each sentence into a separate representation to calculate the similarity, and the other is to do the similarity calculation through more detailed intelligence during the matching process. Both of these architectures are used in the later retrieval dialogues.
Retrieval matching recursive expression
There are some interesting points in retrieval matching. In general, matching is to get the final state after the whole input of a sentence, and then calculate its matching. However, with the help of the common recursive thought in the computer, the middle state can be taken down for matching, so that more evidence can be obtained to match, so that there will be more confidence to judge the match between the two. In addition, there are two matching methods, one is based on LSTM, the other is based on Gru.
Search dialogue – multiple rounds of dialogue
Baidu is the first to do the context sensitive part of search dialogue. The initial idea is relatively simple. First, we can find a candidate result based on the current user input to determine which is more appropriate. This is called context free. After that, it depends on which candidate is more suitable after a given context, so as to form another sort, so as to form an optimized sort, so that the final result is both query related and context related.
Later, multiple rounds of dialogue became more detailed. In the previous work, the context is integrated as a whole to make a judgment and match. Later, it is found that the context is not equally important, some words are important, but some words are not. Therefore, by scoring the importance of the context, the current user’s input is transformed through the scoring result, and then the resulting answer is reflected in the answer as much as possible, And the results are integrated and sorted to get the final results, which is a more fine-grained context usage.
Baidu’s work shown below did not make any hierarchical distinction for the expression learning of sentences at the beginning, only the sequential coding from the front to the back can get a result. Of course, the idea of updating is to take word granularity as a level and sentence granularity as a level, learn to express it at the same time above two levels, and then calculate the matching, which is also a good attempt.
The team of Microsoft Xiaobing has made the match more profound. On the one hand, it has considered the word level, on the other hand, it has also considered the sentence level, and after matching through different levels, it has integrated and output the final result according to the time sequence.
Search dialog – dialog tips
The picture below shows some related work done by Peking University. The idea of retrieval dialogue comes from information retrieval. In information retrieval, there is an interesting concept called query prompt, that is, the user queries something, and the system gives feedback to tell the user other relevant things or better. Therefore, in the work of Peking University, we also learn from the practice that when users input a sentence, the system will generate an answer and give corresponding suggestions. Users can continue to talk along with this suggestion. This is also an attempt of active dialogue, that is, consciously let the system learn the content that leads the user to go down. At present, this part of work is very meaningful in the academic community, but the specific practice needs to be tried by the industry.
Generative dialogue – background
For generative dialogue, it is generally believed that the three models shown in the figure below are standard, namely sequence to sequence model, attention mechanism used to judge the alignment between input and output, and bidirectional coding mechanism.
Generative dialogue sequence to sequence + attention
The figure below shows the early work of generative dialogue done by Huawei’s Noah’s Ark laboratory, that is, applying sequence to sequence model to the field of dialogue, and adding different levels of attention mechanism.
Generative dialogue – multiple rounds of dialogue
Dialogue needs context, so how to integrate context into generative dialogue? The simplest is to make a generative representation of the context and then splice it into the current input as a way of representation + representation, then decode it and get a reply result. The second way is to add a hierarchical factor and an intermediate state. Each round will have information of the current round. The information of the current round will be transmitted to the global representation, which will always take the information with it. Therefore, when a reply is generated later, it is aware of the context information.
Generative dialogue – topic information
The work of Microsoft Xiaobing team is to make sure that on the one hand, it is not enough to keep the literal relevance of the answers, but also to keep the topic relevance of the generated answers, which is a more abstract concept than the literal meaning. It can be seen that in the figure below, the left side is based on the probability of literal generation, and the right side is based on the probability of topic, so that both literal meaning and topic meaning can be considered finally. This is the semantic model of generative dialogue with topic information.
Generative dialogue – active dialogue
The figure below shows the work done by Peking University in relation to active dialogue. When we chat, we usually want to tell each other some information, often speaking with a certain purpose, so there will also be the information we want to convey in the response, which can be called constraints. And constraints can be divided into several ways. When the input query comes, we can determine what word must appear in the reply, and then generate the word in advance, and generate half a sentence from the back to the front according to the input query, and complete according to the half sentence, so it must contain the information of mandatory constraints.
But the forced constraint looks more like “hard wide”. It is necessary to make some things appear which may sometimes make people uncomfortable. Therefore, one idea is to do some softening work, so that the word does not necessarily appear, but let the relevant semantics appear. Therefore, the word can be transformed into embedding representation, so that the model can decide when to reflect the semantics of the word.
Generative dialogue – personification
In the dialogue, everyone talks with different characters. The Northeast men and Jiangnan girls must have different characters. So usually, the dialogue system needs to maintain some personality, and can’t split personality. Stanford University’s work is to portray the personality through the additional personal. When generating a reply, it needs to obtain information from the personal to make the reply conform to the character settings.
Generative dialogue – Emotionalization
Tsinghua University has done the related work of emotional dialogue. Because people will convey their feelings when they are talking. If the emotions of the dialogue system are completely uncontrolled, it is also unreasonable. If you vent a very sad thing to the dialogue system, but it means you are happy, then you may want to turn off the computer in minutes. Tsinghua University adds a control unit to its work to determine what emotion the current sentence needs to generate, and whether the emotion can be reflected in the output generated, so that the final response reflects the emotion.
Generative dialogue – diversity
Dialogue diversity is also an important issue, because dialogue is not the same as other language tasks. For machine translation, input English to produce Chinese, the general meaning of the result is the same, basically only need to translate with reference to English. As for the dialogue, there may be n kinds of replies to input a sentence, and each kind of reply is different, but each kind of reply can be used, so there is a strong manifestation of diversity. But it’s hard to embody such features only by model learning, because machines often learn the most likely one, and the most likely one is irrelevant answers like “I don’t know” or “ha ha”, because no matter what you input, you can use “I don’t know” or “ha ha” to answer. Therefore, it is necessary to press down something that often appears, and bring up something that doesn’t often appear, so that the dialogue has the characteristics of diversity.
Peking University also adds the determinant point process to the generative dialogue, so that when generating each word in response, both quality and diversity are considered, and the balance is made in the middle.
Generative dialogue – enhance learning
The figure below shows Stanford’s work in using reinforcement learning in Generative dialogue. It is similar to the interaction between two dialogue robots to achieve the enhanced effect, that is, the process of left and right hand fighting. Its goal is very clear is to choose what kind of mode can make the dialogue process as long as possible, and choose the appropriate dialogue mode as far as possible.
Generative dialog Gan
Gan is a kind of antagonism generation neural network, which belongs to the concept of left and right hand fighting. It is to generate fake data for the system to judge. Gan was applied to language in 2107. Because language and picture are different, pictures are continuous data, which is easier to process, but language is discrete data, which is not easy to apply. And Stanford University has improved and implemented seq2seqgan, so that the generated answers also need to be differentiated to determine whether they are true answers or system generated answers. When the differentiator can not determine whether the dialogue is generated by human or system, it means that the ability of the dialogue system has been very strong.
The Harbin Institute of technology and the triangle beast apply Gan from another angle, and judge whether the generation is good enough by the way of evaluation.
Peking University has put forward an idea, which has been applied in alixiaomi. As we all know, there is a certain bottleneck in the retrieval dialogue, and the data will never meet the “enough” conditions. The advantage of the generative dialogue is that it is flexible. As long as you learn the dialogue mode, you do not have to have data reserves to be able to answer. The problem is that sometimes the generated answers are not very controllable, so it is necessary to combine the retrieval dialogue with the generated dialogue, which is also published in the relevant work of alixiaomi.
We are also concerned about the automatic evaluation of dialogue, because without evaluation, we can’t know whether what we have done is good enough. The way of dialogue is usually from machine translation, so the evaluation index of using machine translation is not very easy to use, because machine translation has one-to-one correspondence, but dialogue does not. Therefore, the University of Montreal has also done relevant work to prove that it is unscientific to use the evaluation indicators of machine translation to measure the dialogue system.
Dialogue assessment – learning methods
The University of Montreal proposes to throw all the input queries, contexts and responses into the neural network, so that the neural network and artificial scoring can judge whether the dialogue system is good enough in this way, which is a solution. However, another problem will be found. Manual scoring is required, but manual scoring may not be objective enough.
Dialogue evaluation – ruber
Peking University has also put forward a work to do dialogue evaluation, at this time, there is no need to score manually. This work is based on two assumptions. One is that the generated answer can be very similar to the real result, so it is a good result. If not, judge whether the reply can match the query very well, and if it can, it is also a good reply. Do dialogue evaluation in this way, without manual scoring.
In summary, human-computer dialogue seems to have been well done. The media are saying that “the future has come”. Today is a “heavyweight”, tomorrow is a “no call”, which makes people think that we are not far from the “terminator”. However, when we really use the dialogue system, we will know that there is still a big gap in its actual technology.
Author: Yan Rui
This is the original content of yunqi community, which can not be reproduced without permission.