Natural language intelligence (NLP)
Natural language intelligence research realizes effective communication between human and computer by language. It is a science integrating linguistics, psychology, computer science, mathematics and statistics. It involves the analysis, extraction, understanding, transformation and production of natural language and formal language.
Artificial intelligence can be divided into several stages
• computational intelligence It means that relying on the powerful computing power and the storage capacity of massive data, it can surpass human performance in some fields. A representative example is Google’s alpha go. With the powerful computing power of Google TPU and algorithms such as Monte Carlo tree search and reinforcement learning, it can calculate a good decision path in the huge search space of go and defeat human beings. This is computational intelligence;
• Perceptual intelligenceIt refers to the identification of important elements from unstructured data. For example, give a query to analyze the people, place names, organization names, etc;
• Cognitive intelligenceOn the basis of perception, we can understand the meaning of the elements and make some reasoning. For example, “whose son is TSE Ting Feng and who is his son”. Words and entities are similar, but there are many semantic differences. This is the problem of cognitive intelligence;
• Creating intelligenceFor example, computer refers to the creation of common sense, semantic and logical sentences on the basis of understanding semantics. For example, it can automatically write flowing novels, create beautiful music and chat with people naturally
Natural language processing research covers perceptual intelligence, cognitive intelligence and creative intelligence, which is a necessary technology to realize complete artificial intelligence
Development trend of natural language intelligence
- Deep language model is a breakthrough development, leading to the progress of important natural language technology;
- Public cloud NLP technology services from general functions to customized services;
- Natural language technology is gradually combined with industry / scene to produce greater value;
NLP platform capability of Ali group
From bottom to top, it is divided into NLP data, NLP basic capability, NLP application technology and upper application.
NLP data is the raw material of many algorithms, including language dictionary, entity knowledge dictionary, syntax dictionary, sentiment analysis dictionary and so on. Ali NLP basic technology includes lexical analysis, syntactic analysis, text analysis, depth model. On top of this, it is NLP vertical technology, including Q & a dialogue technology, anti spam address resolution and so on. The combination of these technologies supports many applications. Among them, search is a very NLP application
NLP application and typical technology in Open Search
The infrastructure includes the basic products of Alibaba cloud, as well as a number of self-developed search systems based on Alibaba’s ecological search scenarios, such as ha3, RTP, DiI, etc;
Control the basic platform to ensure our offline data collection, management, training, etc;
The algorithm module is divided into two parts: one is query analysis correlation, including multi granularity word segmentation, entity recognition, error correction and rewriting; the other is correlation and ranking correlation, including text correlation, CTR CVR estimation, LTR, etc;
(the orange background is related to NLP)
The goal of open search is to create one-stop, out of the box intelligent search services, so we will open these algorithm capabilities to users in the form of industry templates, scenarios, and peripheral services.
Open search NLP analysis link
The launch of a search is often triggered by a search keyword, such as the user’s search for “aj1 new North Carolina sneakers”
Interdisciplinary word segmentation
We open a series of domain segmentation models in open search
Word segmentation challenge
- The new words or new words added in various fields will greatly reduce the effect of word segmentation;
- The cost of the whole process from tagging to training is relatively high;
- Combined with statistical characteristics, such as mutual information, left and right upper, a word formation model can be constructed, which can quickly build a domain dictionary based on user data;
- Combining the source domain word segmentation model and the target domain dictionary, we can quickly build a target domain word segmentation device based on remote monitoring technology;
(the picture above shows the automatic cross domain word segmentation framework)
Users only need to provide us with some of their own business corpus data, we can automatically get a customized word segmentation model, which not only greatly improves the efficiency, but also meets the needs of customers faster.
Through this technology, we can get better results than open source universal word segmentation in various fields
Named entity recognition
Named entity recognition (NER), such as extracting person name, place name and time from query.
Challenges and difficulties
Ner has a lot of research in the field of NLP, but also faces many challenges. Especially in Chinese, due to the lack of natural separators, NER faces the difficulties of boundary ambiguity, semantic ambiguity, nested ambiguity and so on.
**The upper right corner of the figure below shows the model architecture diagram we used in open search;
In open search, many users have accumulated a large number of dictionary entity libraries. In order to make full use of these dictionaries, we propose a graphner framework that integrates knowledge organically based on Bert. As can be seen from the table in the lower right corner, the best effect can be achieved in Chinese;
Open search is divided into four error correction steps, including mining, training, evaluation and online prediction.
The main model is based on statistical translation model and neural network translation model, and has a complete set of methods in performance, display style and intervention.
The emergence of deep language model has brought leapfrog improvement to many NLP tasks, especially in semantic matching tasks.
Dharma Institute also put forward a lot of innovation in Bert, and put forward self-developed structbert. The main innovation is that in the deep language model training, the objective function of word order / word order and the objective function of sentence structure prediction are added to carry out multi task learning. But such a general-purpose structbert can’t be tried out to thousands of customers and thousands of fields in open search. We need to do domain adaptation. So we propose a three-stage semantic matching paradigm. It can quickly customize the semantic matching model suitable for their own business.
(the specific process is shown in the figure)
Production of NLP algorithm
The system architecture of algorithm module production includes offline computing, online engine and product console.
The light blue part in the figure shows the algorithm related functions of NLP in open search, which users can directly experience and use on the console.
Link to original text
This article is the original content of Alibaba cloud and can not be reproduced without permission.