The evolution of geographic text processing technology in gaud


1、 Background

The functions of map app can be simply summarized as positioning, searching and navigation, respectively solving the problems of where, where to go and how to go. In the search scenario of Gaode map, the input information is query, user location, APP map and other information related to geography, and the output is POI that users want. How to find the POI users want more accurately and improve their satisfaction is the most important index to evaluate the search effect.

A search engine can usually be divided into three parts: query analysis, recall and sorting. Query analysis is mainly to try to understand the meaning of query expression and provide guidance for recall and sorting.

The query analysis of map search includes not only the general NLP technology such as word segmentation, component analysis, synonym, error correction, but also the specific intention understanding methods such as city analysis, where and where analysis, path planning analysis, etc.

Common query intentions in some map scenarios are as follows:
The evolution of geographic text processing technology in gaud
Query analysis is a strategy intensive scenario in search engine, which usually applies various technologies in NLP field. Query analysis in the map scene only needs to deal with geographically related text, and the diversity is not as good as web search, and it seems to be simpler. However, geographic text is usually short, and most of the users’ needs are only a small number of results, which requires very high accuracy. How to do a good job of text analysis in map scenes and improve the quality of search results is full of challenges.

2、 Overall technical framework
The evolution of geographic text processing technology in gaud
Search architecture

Similar to the general retrieval architecture, map retrieval architecture includes query analysis, recall and sorting. Prior, the user’s input information can be understood as the expression of multiple intentions, while sending a request to try to obtain the retrieval results. Posteriori, when we get the retrieval results of each intention, we make a comprehensive judgment and choose the one with the best effect.
The evolution of geographic text processing technology in gaud
Query analysis process

The specific intention understanding can be divided into two parts: basic query analysis and application query analysis. Basic query analysis mainly uses some common NLP technologies to understand query, including analysis, component analysis, ellipsis, synonym, error correction, etc. The application of query analysis is mainly aimed at specific problems in the map scene, including the analysis of user target city, whether it is the expression of where + what, whether it is the expression of path planning requirements from a to B, etc.
The evolution of geographic text processing technology in gaud
Overall technology evolution

In the field of text processing, the overall technology evolution has gone through the process of rule-based, gradually introducing machine learning, and fully applying machine learning. As the search module is a highly concurrent online service, there are more stringent conditions for the introduction of depth model. However, as the performance problem is gradually solved, we gradually introduce deep learning technology from various sub directions to improve the effect of a new round.

In recent years, NLP technology has made rapid development. Bert, xlnet and other models dominate the list one after another. We gradually unify each query analysis subtask, use a unified vector representation to express user requirements, and carry out multi task learning of seq2seq. On the basis of further improving the effect, we can also ensure that the system will not be too bloated.

In this paper, we introduce the evolution of related technologies in the past few years. We will select some points and divide them into two parts. The first part mainly introduces some common query analysis techniques in search engines, including error correction, rewriting and omission. The second part focuses on the unique query analysis technology in map scene, including city analysis, where analysis and path planning.

3、 Evolution of general query analysis technology

3.1 error correction

In the search engine, the query input by the user is often misspelled. If the wrong query is retrieved directly, the user will not get the desired result. Therefore, both the general search engine and the vertical search engine will correct the user’s query, and obtain the query that the user wants to search with the maximum probability.

In the current map search, about 6% – 10% of user requests will input errors, so query error correction is a very important module in map search, which can greatly improve the user search experience.

In search engine, low frequency and mid long tail problems are often difficult to solve, which are also the main problems of error correction module. In addition, there is an obvious difference between map search and general search. The structure of map search query is more prominent. The segments in query often contain certain location information. How to make good use of the structured information in query to better identify the user’s intention is the unique challenge of map error correction.

Common error classification

(1) The Pinyin is the same or similar, for example: PANQIAO Logistics Park PANQIAO Logistics Park
(2) Similar shapes, for example: Maoli, Hebei – Changli, Hebei
(3) Many words or missing words, for example: Quanzhou top Street – Quanzhou top street

Status quo of error correction

The original error correction module includes a variety of recall methods, such as:

Pinyin error correction: it mainly solves the problem of Pinyin error correction of short query. The Pinyin is identical or the fuzzy sound is used as the error correction candidate.
Spelling error correction: it is also called shape near word error correction. By traversing and replacing shape near word, query is used to filter and add candidates.
Combinatorial error correction: error correction replacement is carried out through translation model, and resources are mainly various Replacement Resources mined through query alignment.
The evolution of geographic text processing technology in gaud
Calculation formula of combined error correction translation model:
The evolution of geographic text processing technology in gaud
Where p (f) is the language model and P (f| E) is the replacement model.

Problem 1: the recall method is defective. At present, the main recall strategies of query error correction module include Pinyin recall, near word recall, and replacement resource recall. For low-frequency cases, the ability to solve is limited.

Problem 2: the sorting method is unreasonable. According to the recall mode, error correction is divided into several independent modules, which complete the corresponding recall and sorting, which is unreasonable.

Technical transformation

Transformation 1: entity error correction based on spatial relationship
The original error correction is mainly based on user session mining fragment replacement resources, so the ability to solve low-frequency problems is limited. However, the long tail problem is often concentrated in the low frequency, so the low frequency problem is the current pain point.

A big difference between map search and general search engine is that the map search query is more structured, such as shoukai square, No. 10, Furong street, Chaoyang District, Beijing. We can do the structural segmentation of query (that is, the work of component analysis in the map), and get such a structured description with categories. The first open square [general entity] is located at No. 10, Fu Rong Street, Chaoyang District, Chaoyang District, Beijing.

At the same time, we have authoritative geographic knowledge data, use the authoritative geographic entity database to build the prefix tree + suffix tree index database, extract the suspected error correction part in the index database for zipper recall, and use the logical membership relationship in the entity database to filter the error correction results. Practice shows that this method has obvious effect on low frequency zoning or entity error.

Character similarity calculation based on radical

In the sorting strategy mentioned above, the editing distance of glyphs is taken as an important feature of sorting. Here we develop a font similarity calculation strategy based on the root, which is more precise and accurate for the calculation of editing distance. Chinese character information includes the root split word list and stroke number of Chinese characters.
The evolution of geographic text processing technology in gaud
The common root of a Chinese character is divided into two common roots according to the similarity of the two characters.

Transformation 2: reconstruction of sequencing strategy
The coupling of the original policy recall and the ranking policy leads to different recall links, which leads to the situation that one is concerned and the other is lost. In order to give full play to the advantages of various recall methods, it is urgent to decouple the recall and sorting and optimize the global ranking. Therefore, we add a sorting module, which divides the process into two stages: recall and sorting.
The evolution of geographic text processing technology in gaud
Model selection

For this scheduling problem, we refer to the practice of the industry and use the gbrank based on pair wise for model training.

Sample construction

Samples are constructed by online output and manual review.

Characteristic construction
(1) Semantic features. Such as statistical language model.
(2) Thermal characteristics. PV, click, etc.
(3) Basic features. Editing distance, word segmentation and component features, cumulative distribution characteristics, etc.

Two pain points of the error correction module are solved here. One is most of the low-frequency error correction problems in the map scene. The other is to reconstruct the module process, decouple the recall and sorting, and give full play to the role of each recall link. After the recall mode is updated, it only needs to retrain the sorting model, which makes the module more reasonable and lays a good foundation for the later in-depth model upgrade. Then, in this framework, we use the depth model to carry out the error correction recall of seq2seq and obtain further benefits.

3.2 rewriting

As a way of query transformation, the recall strategy of error correction has many limitations. For some atypical expression of query transformation, there is a blank of strategy. For example, query = the new rural cooperative medical service hall of Yongcheng city is the target poi. In fact, the semantics of user description is similar to the high-frequency query of main poi.

Here we propose a query rewriting idea, which can rewrite low-frequency query into high-frequency query with similar meaning, so as to better meet the diverse expression of user needs.

This is an implementation from scratch. The query expressed by users is diverse, and the expression of rules is obviously inexhaustible. The intuitive idea is to recall by vector, but the way of vector recall is likely to have too many generalization problems, which are not suitable for the retrieval of map scenes, which are all problems to be considered in the process of practice.

The evolution of geographic text processing technology in gaud
Overall, the scheme includes recall, sorting, filtering and three stages.

Recall stage
The evolution of geographic text processing technology in gaud
We investigate several methods of sentence vector representation and choose SIF (smooth inverse frequency) which is simple in algorithm, comparable in effect and performance to CNN and RNN. Vector recall can use the open source faiss vector search engine. Here we use the vector retrieval engine with better performance in Alibaba.

Sorting stage
Sample construction
The original query and the high-frequency query candidate set, calculate the semantic similarity, select the TOPK of semantic similarity, and manually label the training samples.

Characteristic construction

1. Basic text features
2. Edit distance
3. Combination characteristics

Model selection

Fractional regression using xgboost

Filtration stage
The excessive generalization of query through vector recall is very serious. In order to apply in map scene, alignment model is added. Two kinds of statistical alignment models, Giza and fastalign, are used. Experiments show that they are almost the same, but fastalign is better than Giza in performance, so fastalign is chosen.
The evolution of geographic text processing technology in gaud
Through the alignment probability and non alignment probability, the recall results are further filtered, and the results with high accuracy are obtained.

Query rewriting fills in some gaps in the original query analysis module, which is different from synonyms or explicit query transformation expression of error correction. The vector representation of sentences is an implicit expression of similar query, and has its corresponding advantages.

Vector representation and recall are also attempts to apply deep learning model gradually. Synonyms, rewriting and error correction are the three main ways of query transformation in maps. In the past, they were scattered in the map module, each performing his own duties, and there would be overlapping parts. In the follow-up iterative upgrade, we introduce the unified query transformation model for transformation, and get rid of the historical burden caused by many rules and model coupling in the past.

3.2 omission

In the map search scenario, many queries contain invalid words. If you try to recall all the queries, you may not recall the valid results. For example, Xiamen City searched “room 1101, 11th floor, xinjiechuang operation center, county rear high tech park, Huli District”. This requires a kind of retrieval intention, in the absence of obvious escape, using core term to recall the target POI candidate set, when the search results are not successful or the recall is poor, it plays a role of supplementary recall.

In the process of omitting judgment, there is the problem of a priori and a posteriori balance. Omitting intention is a priori judgment, but the expected result is to be able to recall POI effectively, which is closely related to the current situation of the recall field of POI. How to keep a priori consistency in the process of strategy design and get relatively good results in the posterior POI is a difficult place to do a good job of omitting modules.

The original ellipsis module is mainly based on rules, and the main feature of rule dependence is the upstream component analysis feature. Because of the rule-based fitting, there is a large optimization space for the model effect. In addition, due to the strong dependent component analysis, the robustness of the model is not good.

Technical transformation

The transformation of omitting module mainly completes the upgrade from rule to CRF model, and the deep learning model is also used offline to assist sample generation.

Model selection

It is a sequence annotation problem to identify which parts of query are the core and which parts can be omitted. In the selection of shallow model, it is obvious that we use CRF model.

Characteristic construction

Term feature. It uses weighted features, part of speech, prior dictionary features and so on.
Composition characteristics. The characteristics of component analysis are still used.
Statistical characteristics. The left and right boundary entropy and urban distribution entropy of the statistical fragment are discretized by dividing boxes.

Sample construction

In the first phase of the project, we used the way of using online strategy coarse standard and outsourcing fine standard to construct 10000 level samples for CRF model training.

However, the diversity of omitting query is very high, and it is not enough to use 10000 level samples. When the online model can not quickly apply the depth model, we use the boosting method to construct a large number of samples offline with the help of the generalization ability of the depth model.
The evolution of geographic text processing technology in gaud
In this way, the sample can be easily expanded from ten thousand to one million. We still use CRF model for training and online application.

In the omitting module, we complete the upgrade from rule to machine learning, and introduce other features besides components to improve the robustness of the model. At the same time, the off-line deep learning method is used to construct the sample cycle, which improves the diversity of samples and makes the model closer to the ceiling of CRF.

In the follow-up modeling of deep model, we gradually get rid of the dependence on the characteristics of component analysis, directly model from query to hit POI core, build a large number of samples, and obtain further benefits.
The evolution of geographic text processing technology in gaud