Two scoring models of ES

Time:2022-6-19

Correlation score: refers to the correlation between a document and a query statement. The document list matching the query statement can be obtained through inverted index

How to put the documents that best meet the query needs of users in the forefront?
The essence of the problem is a sorting problem. The sorting is based on the correlation score to determine which document in the inverted index ranks first

Parameters affecting the calculation of correlation degree:

  1. TF (term frequency): word frequency, that is, the number of times words appear in the document. The higher the word frequency, the higher the correlation
  2. Document frequency (DF): document word frequency, that is, the number of documents with words
  3. IDF (inverse document frequency): reverse document word frequency, which is opposite to the document word frequency, i.e. 1/df. That is, the fewer documents with words, the higher the relevance (common terms like and or the contribution little to relevance, as they appear in most documents, while uncommon terms like elastic or hippopotamus help us zoom in on the most interesting documents)
  4. Field length norm: the shorter the document, the higher the relevance
Two scoring models of ES

There are two algorithms to evaluate document relevance:

  • TF-IDF model (scoring model of versions before Es5)
Two scoring models of ES

  • BM25 model (scoring model after Es5)
    Starting from elasticsearch 5, the default similarity algorithm of elasticsearch is okapi BM25. The okapi BM25 model was proposed in 1994. The BM of BM25 is abbreviated from best match. 25 is the algorithm obtained after 25 iterations. The model is also evolved based on tf/idf. Okapi information retrieval system is the first system to realize this function, and has been widely used in different systems
Two scoring models of ES

Comparison of word frequency effects between TF-IDF and BM25

However, TF/IDF was designed in an era when it was standard practice to remove the most common words (or stopwords, see Stopwords: Performance Versus Precision) from the index altogether. The algorithm didn’t need to worry about an upper limit for term frequency because the most frequent terms had already been removed.

In Elasticsearch, the standard analyzer—the default for string fields—doesn’t remove stopwords because, even though they are words of little value, they do still have some value. The result is that, for very long documents, the sheer number of occurrences of words like the and and can artificially boost their weight.

BM25, on the other hand, does have an upper limit. Terms that appear 5 to 10 times in a document have a significantly larger impact on relevance than terms that appear just once or twice. However, as can be seen in Figure 34, “Term frequency saturation for TF/IDF and BM25”, terms that appear 20 times in a document have almost the same impact as terms that appear a thousand times or more.

Two scoring models of ES

Term frequency saturation for TF/IDF and BM25

Elasticsearch BM25 model scoring details
Classification calculation method of Taobao category and Title Relevance (probability retrieval, BIM binary independent model, BM25 and vector space model)
The most comprehensive and profound interpretation of BM25 model and in-depth explanation of Lucene ranking

Recommended Today

20220516 Core Features – 7. Task Execution and Scheduling

preface Document address Not in contextExecutorIn the case of bean, spring boot will automatically configure aThreadPoolTaskExecutorThese default values can be automatically associated with asynchronous task execution(@EnableAsync)And spring MVC asynchronous request processing. If you define a custom in the contextExecutor, the normal task is executed ([email protected])It will be used transparently, but due to the needAsyncTaskExecutorImplementation (namedapplicationTaskExecutor)Therefore, […]