Dialogue and interaction is a very imaginative key technology direction in the next era after traditional PC, PC Internet and mobile Internet. Both academia and industry have a high degree of attention. At the same time, as one of the key nodes of oppo’s all things integration strategy, it carries a great and arduous mission.
Algorithm is one of the core abilities of dialogue and interaction, which determines the intelligent level that voice assistant can achieve, and has high technical value. This paper will mainly introduce the objectives of dialogue and interaction, the key problems to be solved by the algorithm, the current situation and trend of the industry, Xiaobu’s main practice and progress, as well as challenges and the future.
2 objectives and key issues of dialogue and interaction
Generally speaking, the goal of dialogue and interaction is to complete human-computer interaction processes such as task execution, information acquisition and emotional communication through natural dialogue through voice or text. For example, intelligent assistants such as Jarvis and Dabai in science fiction movies represent people’s expectation of the ideal state of dialogue and interaction ability.
Dialogue and interaction have attracted more and more attention in recent years. What is the reason behind it? In fact, looking back on the development of information technology in the past 40 years, it is not difficult to understand. As we know, information technology has experienced several times of traditional PC, PC Internet and mobile Internet. Each of these times is closely related to equipment, which gives birth to the revolution of entrance and interaction mode. Now we are moving towards the era of aiot with high hopes. Dialogue and interaction, because of its great imagination in the new generation of search engines, super service distribution centers, new interaction methods and so on, just carries the mission and vision of the next entry-level interaction reform in this new era.
However, it is very difficult to achieve the ideal dialogue interaction effect, mainly because it needs to cross the current mature perceptual intelligence technology and move towards cognitive intelligence. At present, there are still many problems that have not been fundamentally solved or even clearly defined in the field of cognitive intelligence. Typical cognitive problems include how to express and understand common sense, how to make machines have the ability of reasoning and planning, and how to make machines have the same imagination and autonomy as human beings. To some extent, it can be said that solving the problem of cognitive intelligence is basically equivalent to realizing strong artificial intelligence, which shows the high difficulty of dialogue and interaction.
The main process of dialogue and interaction is shown in the figure below. It is not difficult to find that almost all key nodes are related to the algorithm. The algorithm is the core ability to achieve better dialogue and interaction effect.
For oppo’s self-developed Xiaobu assistant, the current status of its algorithm capability is shown in the table below. Voice wake-up is mainly supported by three parties and software engineering systems. At present, the effect of new machines is aligned with the top competitive products in the industry, but there are some problems, such as high technology upgrading cost of old models and inability of some low-end models to support voice wake-up; Speech recognition uses the capabilities of three parties and oppo Research Institute at the same time. Due to the maturity of speech recognition technology, the overall effect is good, and the word error rate can be controlled below 6%. At present, the main problem is audio quality; Similar to speech recognition, speech synthesis is also supported by three parties and oppo Research Institute. It has a good effect on accuracy and fluency, but the evaluation of naturalness, emotion and other dimensions is very subjective, and user personalization is not supported at present; Semantic understanding and dialogue ability are mainly provided by the business and technical team. In terms of semantic understanding, the accuracy and recall rate can reach more than 90%. There is a problem that it is difficult to understand open domain long tail query; In terms of dialogue ability, it currently supports immersive strong multiple rounds, free switching weak multiple rounds, above reasoning multiple rounds, etc. the difficulties of multiple rounds are mainly difficult to evaluate, weak user habits and low online penetration.
Semantic understanding and dialogue ability are the focus of this paper. The main task is to understand what the user wants first, then decide what to give to the user, and finally assemble appropriate resources to properly meet the user’s needs. The semantic algorithm system composed of semantic understanding and dialogue ability is to achieve the above objectives. The system will mainly face two categories: systematic problems and technical problems, as shown in the figure below.
Systematic problems include how to decouple and disassemble complex systems that need to support query in all fields, hundreds of skills, multi equipment and multi-channel; How to iterate efficiently for the problems of many product requirements, many modules, long process and large algorithm uncertainty; For the inexhaustible diversified oral query, how to ensure the experience through effect monitoring; How to avoid “mentally retarded” experiences such as low-level defects, non answers, and excessive disclosure.
Technical problems include algorithm selection, modeling and solving of key problems, control of multi round dialogue, performance guarantee and so on.
3 industry status and algorithm trend
First of all, dialogue and interaction has become increasingly mature in application scenarios, covering many fields such as smart home, car, life travel and professional services. Convenience and quickness are the natural advantages of natural language dialogue and interaction, which are accepted by more and more users. It is estimated that more than 7 billion devices will be equipped with voice assistants in 2020.
In addition, from the perspective of development trend, top technology companies have never given up their investment in this direction in the past decade. Foreign companies, represented by apple, Amazon and Google, all take dialogue and interaction as their very important direction; The domestic situation is similar. Baidu, Xiaomi and Alibaba all actively layout to seize the future traffic entrance of dialogue and interaction.
A noteworthy trend is that the dialogue interactive intelligent assistant for third-party devices is gradually fading out, and each company mainly focuses on its own devices. In addition to the close coupling between related technologies and devices, another more important reason is that this entrance is too important, and no head equipment manufacturer is willing to hand it over to the third-party technology party.
Dialogue and interaction is also a hot spot in academic research. From the trend analysis of ACL papers, it can be seen that the direction of dialogue and interaction has sprung up in recent five years and has become the hottest research direction in 2019 and 2020.
Reference: trends of ACL:https://public.flourish.studi…
In terms of the core cognitive understanding algorithm, its solution paradigm has evolved from the traditional multi module pipeline scheme that strongly depends on language, problem type and manual customization experience to a simpler, universal and efficient end-to-end integration scheme. The evolution of this paradigm greatly simplifies the problem solving process, which can not only effectively avoid cumulative errors, but also enable the application of big data, big models and large computing power, and significantly improve the effect.
In the past two years, at the model level, a large-scale pre training model represented by Google Bert has emerged, sweeping the list of major language modeling tasks, releasing great potential for the development of more advanced semantic understanding algorithm models, which will undoubtedly provide solid technical support for the development of dialogue and interaction.
In conclusion, both industry and academia pay great attention to the direction of dialogue and interaction, which reflects the industry’s prediction of future trends. The breakthrough of algorithm technology further catalyzes the landing speed of dialogue and interaction products, making the future come earlier.
4 practice and progress of Xiaobu algorithm system
As mentioned earlier, semantic understanding and dialogue ability together constitute Xiaobu’s core semantic algorithm system. The following parts will present our practice and key progress in this direction in detail.
Generally speaking, Xiaobu assistant’s mission is to establish a dialogue connection. One end of the connection is the huge user group of equipment ecology of Oujia group, and the other end is high-quality dialogue service. With the help of this connection, it can realize user value, marketing value and technical value.
In order to support the above business requirements, we have abstractly summarized four design principles to guide the design of algorithm system:
Domain partition: decompose the complex problems in the whole field by dividing the fields, and transform them into simpler sub problems to solve in groups, so as to reduce the difficulty of solving and improve the controllability of the system;
Effect priority: in order to avoid the experience of “mental retardation” as much as possible, we do not stick to any single technology, and drive the algorithm scheme design with effect first to avoid low-level defects;
Closed loop monitoring: establish a perfect closed-loop monitoring mechanism, improve the test coverage through the test case design of product, test and R & D, and ensure the experience through real-time dynamic test set monitoring and manual evaluation online;
Platform efficiency improvement: in order to cope with the skill support of many medium and long tail, promote the construction of skill platform, and reduce the R & D and maintenance cost of medium and long tail skills with consistent and universal platform solutions.
Referring to the business requirements and design principles, the overall architecture of Xiaobu assistant’s current algorithm system is shown in the figure below. Firstly, in terms of platforms and tools, the basic algorithm is mainly the industry mainstream deep learning algorithm, on which the algorithm scheme is constructed for different problem types, and further encapsulated into NLU framework, general atlas Q & A, skill platform, open platform and other modules. Then in terms of business, the top layer will adopt symbolic, structured and numerical ideas to generally process query, and then split the business according to the dimensions of system application, life service, video and entertainment, information query and intelligent chat, and each business line will iterate independently. Finally, combined with dialogue generation and fusion ranking, the best skills are selected to meet the demands of users.
In terms of processing flow, it can be divided into pretreatment, intention identification, multi classification ranking, resource acquisition and post-processing. The first three nodes are mainly responsible for the recall rate of intention, the last two nodes are responsible for the coverage of resources and the relevance of results, and the whole process is responsible for the final skill execution satisfaction.
The key algorithm modules involved in the semantic algorithm system are shown in the figure below. The following will introduce the three core modules of semantic understanding, dialogue management and dialogue generation.
Intention recognition is the core module of semantic understanding. Its main task is to infer what users want to do through the analysis of users’ current query and interaction history, including several typical scenarios of closed domain, open domain and context.
Slot extraction is a task closely related to intention recognition. Its main task is to extract key information from the user’s current query and interaction history to help accurately obtain the answers / contents required by the user.
Intention recognition and slot extraction together constitute the semantic understanding module, and its difficulty mainly lies in the diversity of spoken language (100 million level independent query); Ambiguity (for example, Peppa Pig is both an animation and an app); Rely on knowledge (such as “can or can not” is also a song title).
Dialogue management is another key module of semantic algorithm system. Its task is to deduce the dialogue state according to the current query and dialogue context, and infer the best response of the dialogue system in the next step.
After completing semantic understanding and dialogue management, it is also necessary to combine dialogue generation to realize the final and appropriate execution feedback of skills. The task of dialogue generation is to obtain the appropriate response script in an appropriate way according to the parsing results of semantic understanding and the actions to be performed.
In terms of algorithm model, Xiaobu is mainly driven by strong deep learning. On the one hand, this kind of module has good effect. In addition, the technical scheme has been relatively mature, and there are many successful cases.
However, it is worth emphasizing that in this field, there is basically no algorithm scheme of “one recruit fresh” single model to solve all technical problems. Generally, the main model based on in-depth learning is responsible for ensuring the fundamentals of the effect, and it still needs to deal with the badcase at the corners in combination with customized rules.
Facing the manipulation skills of system application, in order to improve the effect of semantic understanding, we mainly adopt the scheme based on the fusion of rules and deep learning model, in which the reverse rules are used to quickly reject queries outside the field, the positive rules are used to cover strong statements, and the deep learning model is responsible for the generalization and recognition of general cases. In addition, in order to improve the joint accuracy of intention and slot, multi task joint learning is introduced.
Multi task joint learning can disambiguate intention and slot. It is mainly used in telephone, SMS, schedule and other skills. Compared with single task independent learning, the general accuracy can be improved by 1% ~ 3%. Combined with detailed data-driven optimization and rule verification, the calling rate can be more than 95%.
For skills with strong dependence on knowledge, such as music, radio, film and television, we mainly adopt the intention recognition scheme integrating knowledge, as shown in the figure below. The main difficulty of this kind of skills is that it is impossible to judge the intention from the sentence pattern alone. It is very important to accurately extract the resource fields from query. The intention recognition after integrating the resource association results can significantly reduce the difficulty of problem solving.
Different from the closed domain, the open domain’s intention recognition is difficult to model the component class problem, which generally needs to be solved by semantic matching scheme. To solve these problems, we mainly adopt the deep semantic matching method, as shown in the figure below. Compared with the traditional matching based on text symbols, the effect is better, and the matching accuracy can reach more than 95%; However, there are also problems such as subject recognition and semantic inclusion, which need to be controlled with downstream verification strategies. At present, it is mainly used in information query and chat QA matching.
In addition, in order to further improve the effect of semantic understanding, we are also exploring the landing scheme of large-scale complex models. In the direction of large-scale pre training language model, the team has improved, retrained and fine tuned on the basis of the open source model, and achieved rapid improvement in the effect. At present, it ranks sixth in the total ranking of Chinese language understanding evaluation benchmark (clue).
However, such models have high computational complexity and are generally difficult to meet the timeliness requirements of online reasoning. They need to be combined with acceleration schemes such as knowledge distillation to be applied.
Common knowledge distillation schemes can be divided into data distillation and model distillation. The assumption of data distillation is that the effect of simple model is not as good as that of complex model because of the lack of labeled data. If the complex model is used to provide enough pseudo labeled data, it can help the simple model gradually approach the effect of complex model; The assumption of model distillation is that the simple model is not only lack of enough data, but also lack of good guidance. If the training process of the simple model is guided by the intermediate results obtained in the process of training the complex model, it will help the simple model approach the effect of the complex model. Both data distillation and model distillation have been applied in Xiaobu assistant business.
Dialogue system is also considered as the next generation search engine, and users have many demands on Knowledge Q & A. It is expected to obtain accurate answers. In order to meet these needs, we build our own knowledge base through data acquisition and data mining, and provide Q & A services in combination with online semantic matching and kbqa.
In addition, in order to accurately answer the questions of vertical facts, we also built a general Q & a capability based on knowledge map. For high-quality vertical categories, we built a domain map through data cooperation and self-help crawling, and then conducted accurate Q & A based on templates and maps.
At present, Xiaobu assistant has more than 50% of its online head traffic through self built Knowledge Q & A services, and the long tail has also been used by strong search companies such as Duer and Sogou.
In terms of dialog management, the commonly used schemes include the scheme based on finite state machine, the scheme based on slot filling, and the end-to-end scheme. The difficulties are flexible process control, context inheritance and forgetting, intention jump, exception handling, etc. at present, the mode of slot filling is mainly adopted.
In order to achieve better context understanding effect in multiple rounds, Xiaobu assistant implements a context understanding scheme based on anaphora resolution, which is used to deal with the common problems of anaphora and ellipsis in multiple rounds of dialogue.
Reference: ACL’2019 improving multi turn dialogue modeling with utterance rewriter
With the help of dialogue management and context understanding, Xiaobu assistant has supported immersive strong multi round, free switching weak multi round, context reasoning multi round and other modes, covering task-based, information query, multi round chat and other business scenarios.
In terms of dialog generation, there are three types in the industry: template based, retrieval based and model-based. Due to the weak controllability of the generative model, Xiaobu mainly adopts template based and retrieval based schemes, and the generative model is still under pre research.
In the aspect of algorithm engineering, in the early stage, in order to go online quickly, a Python based service framework was provided to compensate for the weak concurrency of a single service by deploying multiple instances; At present, for services with high computational complexity, operator engineering reconstruction and optimization are also being explored, and the joint machine learning platform team is exploring a simpler and more efficient service model.
In terms of skill building, in the early stage, in order to quickly go online, we focused on skill customization research and development; The construction of skill platform began at the end of last year. The main idea is to standardize the off-line model generation and on-line reasoning process, operator the key algorithms, complete the skill research and development through data import and process configuration, and reduce the cost of medium and long tail skill support and maintenance.
Finally, in order to ensure the effect experience of dialogue and interaction, we have combined the data team and the evaluation team to build a closed-loop monitoring scheme for the whole process. First, the R & D self-test ensures that the effect of the algorithm model meets the expectations, and then enter a round of batch test when publishing the version to ensure that no new risks will be introduced; After the launch, there will be routine monitoring and real-time monitoring to ensure the overall effect and the normal monitoring of key functions respectively; In addition, manual sampling evaluation and tripartite evaluation will be introduced to further monitor the experience.
5 challenges and future thinking
Although dialogue interaction has made great progress in algorithm technology in recent years, there are still many challenges compared with Jarvis and Dabai expected by users.
Firstly, in terms of semantic understanding, the current model is essentially statistical induction based on data, which is lack of robustness and completeness when encountering extreme cases.
Secondly, as a candidate with potential to replace search engine, it is bound to assume the role of “know it all”. Then, there are some problems in low-frequency Q & A, such as open field, obvious long tail effect and great dependence on knowledge content, which makes the construction difficult and costly.
In addition, unlike the relatively mature search and recommendation scenarios, the iterative optimization of dialogue and interaction ability mainly depends on manpower, which is difficult to connect with the high-speed self feedback and self-learning engine driven by big data, and it is difficult to improve quickly.
The challenges in the future are far more than that. We will continue to actively explore stronger semantic understanding ability, more profound knowledge, smoother dialogue, more domain dialogue management, self feedback, weak supervision and self evolutionary learning ability, and make unremitting efforts to create an intelligent assistant with the best user experience in the Chinese field.
Welcome colleagues interested in intelligent assistant and dialogue interaction technology to exchange and discuss together!
Introduction to the author
Zhenyu head of NLP and dialogue algorithm of oppo Xiaobu intelligent center
The candidate of Shenzhen high-level talent program received a bachelor’s degree and a doctor’s degree in computer science from the University of science and technology of China.
In recent years, he has focused on the research and development and implementation of key algorithm technologies of dialogue AI. In 2018, he joined oppo to lead the construction of NLP and dialogue algorithm system. He cited more than 800 representative works of academic papers engaged in research work, and won the second prize of scientific and technological progress of colleges and universities (Science and Technology) once and the second prize of scientific and technological progress of Hunan Province twice.
For more exciting content, please scan the code to follow the [oppo digital intelligence technology] official account