Redis real combat 10. Realize content search, targeted advertising and job search


Search with redisP153

By changing the way the program searches data, and using redis to reduce the execution time of most content search operations based on words or keywords.P154

Basic search principlesP154

Inverted indexes is the underlying structure used by most search engines on the Internet, which is similar to the index at the end of a book. Inverted index extracts some words from each indexed document and records the collection of documents containing each word.P154


Suppose there are three documents:

  • R = “it is what it is”
  • S = “what is it”
  • T = “it is a banana”

We can get the following inverted index set:

  • “a”: {2}
  • “banana”: {2}
  • “is”: {0, 1, 2}
  • “it”: {0, 1, 2}
  • “what”: {0, 1}

The search criteria “what”, “is” and “it” will correspond to this set:{0,1} ∩ {0,1,2} ∩ {0,1,2} = {0,1}

It can be found that redis set and ordered set are very suitable for inverted index.

Basic index operation

The process of extracting words from documents is usually called parsing and tokenization. This process can produce a series of tokens used to represent documents, sometimes called words.P155

A common additional step of tokenization is to remove stop words. Non words are those words that frequently appear in the document but do not provide the corresponding amount of information. Searching these words will return a lot of useless results.P155

In this book, the logic to realize the direction index is very simple

  1. Divide the document into words and remove the one character word
  2. For each word, get or create the corresponding set, and put the unique identity of the current document into the set

If you need to support Chinese and so on, you can’t simply do English word segmentation, you need word segmentation machine to process. The first contact with inverted index was inElasticsearchFor those who are interested, you can learn moreElasticsearchThe realization of inverted index and its applicationIKChinese word segmentation.

Basic search operations

It’s very easy to find a word in the index, just get all the documents in the word set. When searching documents according to multiple words, we need to process the corresponding set according to the conditions, and then get all documents from the final set.P156

You can use redis’s set operation to process different conditions

  • SINTER / SINTERSTORE: find the document collection that contains all the specified words at the same time
  • SUNION / SUNIONSTORE: find the document collection that contains at least one specified word
  • SDIFF / SDIFFSTORE: find a collection of documents that contain one word and some other words

Through the above three kinds of commands, we can basically achieve most of the and or non operations of the condition.

Analyze and perform search

The query sentence we use has the following characteristics after word segmentation

  • Words beginning with + indicate that the word is synonymous with the previous word and needs Union
  • Words beginning with – indicate that the word does not want to be included in the document and needs to be subtracted
  • Other common words: it means that the user needs to query this word and get the intersection

That is: “connect + connection chat – proxy – proxies” means that the query document should contain “connect” or “connection” and “chat” at the same time, and cannot contain “proxy” and “proxies”.

In the actual processing, we first take the union set of synonyms, then take the intersection set with the words to be queried, and finally take the difference set with the words we don’t want to contain, so that the set obtained is the result set of user query.

Sort and paginate search resultsP160

The above search function and the set of unique identification of all documents queried by users can be searched. Now we will sort and paginate according to the set of unique identification of documents and the specific information of each document.

  • Document unique ID set: stores the unique ID of each document, for example:{1, 2, 276}
  • Specific information of each document: the data structure isHASH, withdoc_{id}For the key, the related information of the corresponding document is stored internally, for example:"doc:276": {"id": 276, "created": 1324114412, "updated": 132562777, "title": "Troubleshooting...", ...}

In this case, we can use redis’sSORTCommand to sort and paginate the collection of unique identification of documents by referring to the specific information of each document. (05. Introduction to redis’s other commands)

Ordered indexP162

The above describes how to search with redis and store it in theHASHThe data inside sort and paginate the search results. Next, we will introduce the use of set and ordered set to achieve composite sorting operation based on multiple scores, which can provide better thanSORTCommand more flexibility.P162

Sort multiple numeric fieldsP162

Suppose we need to sort the update time and the number of votes according to the documents, so we need to use two ordered sets to store the relevant information. The members of the two ordered sets are the unique identification of the document, and the score of the members is the update time of the document and the number of votes.

Set the unique identification set of documents that meet the search criteria after searching asfiltered_doc_ids, the ordered set corresponding to the unique document ID and its update time isdoc_ids_with_update, the ordered set corresponding to the unique document ID and the number of votes obtained isdoc_ids_with_votes. Then it can be passedZINTERSTORECommand to find the intersection of the three sets, and finally get the ordered set corresponding to the unique identification and sorting score of the documents that meet the search conditionsZRANGE, ZREVRANGEIt can be obtained by paging.P162

Example:ZINTERSTORE filtered_doc_ids_with_sort_score 3 filtered_doc_ids doc_ids_with_update doc_ids_with_votes WEIGHTS 0 {update_weight} {vote_weight}

Among them:

  • filtered_doc_ids_with_sort_scoreAn ordered set of results
  • filtered_doc_idsThe weight of is0, only for filtering results, not for sorting
  • doc_ids_with_update, doc_ids_with_votesThe weight of can be set as0When it is a positive number, it means that the field needs to be sorted in positive order, and when it is a negative number, it means that the field needs to be sorted in reverse order.

person one is in love with

This method of using score is very clever, and can basically achieve multi field sorting, but the priority may be difficult to control, and it is difficult to sort according to one field first and then according to another field. Because the order of magnitude of the score corresponding to each field may be small, at this time, if you need to have the priority of the sorting field, you may need to design each weight carefully.

Sort non numeric fieldsP164

The above introduction uses the ordered set to sort multiple numerical fields. Because the score of the ordered set can only be floating-point numbers, the non numerical fields cannot be directly used for sorting, and need to be converted to the corresponding floating-point numbers. However, because the double precision floating-point number has only 64 binary bits, it can actually use 63 binary bits, so it can only use the first few characters of the string to estimate the score. If the number of characters is less than the specified number, it needs to be supplemented to the specified number of characters. Of course, if the character set is reduced, the encoding calculation can be performed again, and then the score of longer strings can be estimated.P165

When the score is very large, it may lead to the problem of overflow of the final calculated score.

Advertising orientationP166

Next, we will introduce the use of set and ordered set to build an almost complete ad serving platform.P166

Index advertisementsP167

There is not much difference between the index operation for advertisement and that for other content. The indexed advertisement usually has the necessary orientation parameters such as location, age and gender, and only a single advertisement is returned.P167

The price of advertising P167

  • Cost per view: this kind of advertisement, also known as CPM advertisement or cost per mile, requires a fixed fee for every 1000 displays
  • Cost per click: this kind of advertisement, also known as CPC advertisement, charges a fixed fee according to the number of clicks
  • Cost per action: also known as cost per acquisition, this kind of advertisement is also known as CPA advertisement, which charges different fees according to the actions performed by users on the destination website of the advertisement

In order to simplify the calculation method of advertising price as much as possible, all types of advertisements will be converted so that their prices can be calculated based on 1000 displays, and an estimated CPM (ECPM) will be generated.P168

  • CPM’s ECPM price can directly use CPM price
  • The ECPM price of CPC can be obtained by multiplying the per click price of advertisement by the click through rate (CTR) of advertisement, and then multiplying by 1000
  • The ECPM price of CPA can be obtained by multiplying the click through rate of the advertisement, the probability of the user executing the action on the target page of the advertiser, and the price of the executed action, and then multiplying by 1000

Insert ad into inverted index P169

We can basically reuse the search function mentioned above. In addition to inserting keywords into the inverted index, we can also insert orientation parameters (location, age, gender, etc.) into the inverted index, and record the type of advertisement, basic price and ECPM price.P169

Perform ad targetingP170

When the system receives the advertisement orientation request, what it needs to do is to find the advertisement with the highest ECPM among a series of advertisements matching the user orientation parameters. At the same time, the program will also record the matching degree of page content and advertising content, as well as the impact of different matching degrees on the click through rate of advertising and other statistical data. Through the use of these statistics, the content matching with the page in the advertisement will be included in the ECPM price of CPC and CPA as the added value, so that those advertisements containing the matching content can be displayed more.P170

Calculate added value
To calculate the added value is to calculate how much increment should be added to the ECPM price of the advertisement based on the matching words between the page content and the advertisement content. Each word has an ordered set, the member is the advertisement ID, and the score of the member is the additional value of the current word to the ECPM of the advertisement.P171

When looking for the right ads, we will first filter out the ads that match the location and contain at least one page word, and then replace the search by calculating the added value, so as to achieve the highest value ads every time, and learn according to the user’s behavior. At the same time, because the content of each advertisement match is different, the best way is to use the weighted average value to calculate the added value of the word part, but limited to the command of redis itself, we finally take the form of (MAX + min) / 2 to calculate the added value of the word part (max represents the maximum added value of all matched words, min) Represents the minimum added value of all matching words), using the following command:ZUNIONSTORE final_score 3 base max min WEIGHTS 1 0.5 0.5

Learn from user behavior P175

First of all, it needs to store users’ browsing records, including three parts: (update ECPM every 100 times)P175

  • Words directed to a given advertisement (i.e. the intersection of the words in the content and the words in the given advertisement)
  • The number of times a given ad is targeted
  • The number of times a word in an advertisement is used to calculate added value

Secondly, it needs to store the user’s click and action records, which is used to calculate the click through rate = click volume or action times / advertising display times. (update ECPM every time)P176

Finally, the ECPM is updated, which includes two parts

  • ECPM of advertisement: calculate the latest ECPM according to the actual price of advertisement and the click through rate of current advertisement
  • ECPM added value of advertising words: calculate the latest ECPM added value of each word according to the basic price of advertising and the click through rate of each word
Improvement planP179
  • Over time: you can follow 03. Redis simple practiceRescaleItemViewedNumFunction to reduce the display times and click times (or action execution times) of ads regularly
  • Add count value: you can consider the click technology of the previous day, the previous week or other time periods, and give different weights based on the length of the time period
  • Second price auction is used to determine the cost of advertising space
  • Give low-cost ads a certain amount of exposure: in part of the time, the top 100 ads with revenue will be selected based on their relative ECPM, rather than the ads with the highest ECPM
  • Optimize the initial ECPM of new ads:

    • Average click data of the same type of ads used in the initial stage
    • A simple inverse linear relationship or inverse sigmoid relationship is constructed between the average click through rate of the same type of ads and the current actual click through rate until the ads are displayed enough times
    • Artificially improve the click through rate to ensure that there is enough traffic to learn real ECPM
  • Consider using real Bayesian statistics, neural networks, association rule learning, clustering or other techniques to calculate added value
  • The logic of recording information can be changed to asynchronous (09. Can be used to implement task queue, message pull and file distribution task queue) to improve response efficiency

Job searchP180

Next, we will use set and ordered set to realize the job search function, and find the right job for job seekers according to their skills.P180

Search for the right positionP180

The first reaction must be to search all the positions directly for every job seeker, so as to find the right position for the job seeker. But the efficiency of this method is very low (most of the jobs are definitely not matched by skills), and it can not be extended.P181

Search for the right positionP181

Using the value-added form mentioned above, add the ID of the position in the corresponding skill set each time you add a position(SADD idx:skill:{skill} {job_id}), and then add in the position ordered set, the member is the position ID, and the score of the member is the number of skills required(ZADD job_required_skill_count {job_id} {required_skill_count})。 When searching, first use the set corresponding to all skills of job seekersZUNIONSTORECalculate the number of matching skills for each company(ZUNIONSTORE matched {n} idx:skill:{skill} ... WEIGHTS 1 ...)And then find the intersection with the position ordered set, and let the weight of the company ordered set be – 1(ZINTERSTORE result 2 job_required_skill_count matched WEIGHTS -1 1)Finally, all positions with a score of 0 can be obtained to complete the search.P181

person one is in love with

This method in the book is troublesome. In fact, you can use the unordered inverted index at the beginning of the article. The post is equivalent to the document to search, and the skills required for the post are equivalent to words.

This article starts with the official account: full Fu machine (click to view the original), open source in GitHub:reading-notes/redis-in-action
Redis real combat 10. Realize content search, targeted advertising and job search