Introduction: multiple recall refers to the strategy of using different strategies, features or simple models to recall part of the candidate sets respectively, and then mixing these candidate sets for subsequent sorting models. This paper will introduce how the multiple recall technology on the open search platform can deeply improve the search effect.
The so-called “multiple recall” refers to the strategy of using different strategies, features or simple models to recall part of the candidate sets respectively, and then mixing these candidate sets together for subsequent sorting models.
Alibaba cloud opensearch is a one-stop intelligent search business development platform based on the large-scale distributed search engine independently developed by Alibaba. At present, it provides search service support for the core businesses of Alibaba group, including Taobao and tmall. At present, open search provides text retrieval. Through word segmentation of text query and some query analysis and processing, query is rewritten and then query the engine, which greatly improves the search effect. However, for some scenes that require high search results, such as the scene of educational search questions, there are obvious differences between educational photo search questions and traditional web page or e-commerce search. The first point is that the search query is particularly long. The second point is the text obtained after the search query is recognized by photo OCR. If the key term is recognized incorrectly, it will seriously affect the recall ranking. One solution to these problems is to continue to optimize QP and enhance QP’s ability to process text. Another scheme is to introduce vector recall, which recalls documents by calculating the distance of vector space as a supplement to text recall.
In long query, long tail query, query nonstandard and other scenarios, if there are problems such as inaccurate recall and insufficient results in text-based retrieval, supplementary vector recall can not only effectively improve the effect of recalled text, but also provide the ability to expand recall.
Open search provides the algorithm engineering ability of multi-channel recall, gives users in different industries to customize different multi-channel recall function requirements, and has been commercialized and applied to users in multiple industries. Its advantages include the following aspects:
1. Provide flexible algorithm capability, support the technical optimization of text vectorization according to the characteristics of different industries, and take into account the effect and performance;
2. Support cava script and provide more flexible customized sorting and scoring ability;
3. Support analyzer with model and analyzer without model, and provide vector recall function for users without algorithm ability and users with algorithm ability respectively;
4. Compared with open source products, the advantages of open search accuracy and search delay are more obvious, and the search delay is reduced from open source seconds to tens of MS.
Multi channel recall architecture
Open search supports multi-channel query function. After configuring the query strategy, you can query text query and vector query at the same time. Of course, it also supports query only text query or query only vector query. If the text vectorization function is configured, the open search will vectorize the text during text query, generate vector query, and sort the two results after recall.
Opensearch supports many types of vector analyzers, mainly industry general vector analyzers, industry customized vector analyzers, and general vector analyzers (vector – 64 dimensional, 128 dimensional, 256 dimensional general). Among them, the general vector analyzer requires users to convert data into vectors and use double_ Array type storage, which is suitable for customers with strong algorithm ability.
Give the algorithm students to customize the vector models of different industries. Take the education industry as an example,
Among them, special optimizations for educational search questions include:
- The Bert model adopts structbert developed by Dharma Institute and customized for the education industry
- The vector retrieval engine adopts the Proxima engine developed by Dharma Institute, which is far more accurate and faster than the open source system
- Training data can be accumulated continuously based on the search logs of customers, and the effect can be continuously improved
- Rewrite the semantic vector query, the text term on rank, only participate in scoring, not recall, and improve the quality of the recalled top text.
Open search opens two stages of sorting: basic sorting and business sorting, that is, rough sorting and fine sorting. Among them, fine-tuning supports cava script, which more flexibly supports the sorting needs of users.
In the multi-channel recall process, the open search will eventually carry out unified sorting. At present, it supports internal sorting and fine sorting model scoring sorting. Internal sorting is directly sorted from high to low according to the returned scores according to the results of multi-channel recall. Fine sorting model scoring requires users to provide model information, and the results of multi-channel recall are ranked according to the model scoring.
Multiple recall practice cases
E-commerce / retail search
Community Forum Search
Compare the different effects of top title before and after access
This article is the original content of Alibaba cloud and cannot be reproduced without permission.