How to implement full-text retrieval of IM SDK chat messages on electron

Time:2021-8-21

preface

Full text search based on local data plays an important role in the client requirements of IM scenario. The so-called full-text retrieval is a technology to find the location of a word in a large number of documents. In the past, relational databases can only be implemented through like, which has several disadvantages:

  1. The database index cannot be used. The whole table needs to be traversed, and the performance is poor
  2. The search effect is poor, only the head and tail bits can be fuzzy matched, and the complex search requirements can not be realized
  3. Unable to get the relevance of the document to the search criteria

The IOS, Android and desktop of Netease Yunxin im all implement the local data full-text retrieval function based on SQLite and other libraries, but these functions are missing on the web and electron. On the web side, due to the limitations of the browser environment, the only local storage database that can be used is indexdb, which is not within the scope of discussion. On electron, although the chromium kernel is also built-in, there are more choices because node.js can be used.

Let’s first look at how to realize full-text retrieval.

Basic technical knowledge points

To realize full-text retrieval, the following two knowledge points are indispensable:

These two technologies are the technologies and difficulties to realize full-text retrieval. The implementation process is relatively complex. Before talking about the implementation of full-text index, let’s talk about the implementation of these two technologies.

Inverted index

First, we briefly introduce the inverted index and the inverted index concept, which distinguishes Yu Zheng row index.

  • Forward index: the structure of document object is used as the index of the unique ID, and the content of the document is used as the record structure.
  • Inverted index: it takes the word in the document content as the index and takes the document ID containing the word as the record structure

How to implement full-text retrieval of IM SDK chat messages on electron

Take the inverted index library search index as a practical example. In Netease Yunxin’s IM, each message object has idclient as its unique ID. next, we enter “nice weather today” and separate it in Chineseparticiple(we will share the concept of word segmentation in detail below), so the input becomes “today”, “heaven”, “heaven”, “Qi”, “truth” and “good”. Then write it into the database through the put method of search index. Finally, look at the structure of the stored content:

How to implement full-text retrieval of IM SDK chat messages on electron

As shown in the figure, you can see the inverted index structure. Key is a single Chinese word after word segmentation, and value is an array composed of idclient containing the Chinese message object. Of course, in addition to the above contents, search index also has some other contents, such as weight, count and positive data, which exist for sorting, paging, search by field and other functions. This paper will not expand in detail.

participle

Word segmentation is to segment the content of the original message into multiple words or sentences according to semantics. Considering the effect of Chinese word segmentation and the need to run on node, we chose nodejieba as the basic word segmentation library. The following is the flow chart of Jieba word segmentation:

How to implement full-text retrieval of IM SDK chat messages on electron

Taking “going to Peking University” as an example, we choose the most important modules to analyze:

Load dictionary

Jieba word segmentation will load the dictionary first during initialization. The general contents are as follows:

How to implement full-text retrieval of IM SDK chat messages on electron

Build prefix dictionary

Next, a prefix dictionary will be built based on the dictionary, with the following structure:

How to implement full-text retrieval of IM SDK chat messages on electron

Among them, “Peking University” as the prefix of “Peking University”, its word frequency is 0, which is to facilitate the subsequent construction of DAG map.

Build DAG diagram

Dag graph is the abbreviation of directed acyclic graph, that is, directed acyclic graph.

Based on the prefix dictionary, the input content is segmented. Among them, “go” has no prefix, so there is only one way of segmentation; For “North”, there are three segmentation methods: “North”, “Beijing” and “Peking University”; There is only one way to segment Beijing; For “big”, there are two segmentation methods: “big” and “University”; There is still only one way to segment “learning” and “playing”. In this way, the segmentation method of each word as a prefix word can be obtained, and its DAG diagram is shown in the following figure:

How to implement full-text retrieval of IM SDK chat messages on electron

Maximum probability path calculation

All paths of the above DAG diagram are as follows:

  1. Go / North / Beijing / University / study / play
  2. Go / Beijing / University / study / play
  3. Go / Beijing / University / play
  4. Go / Peking University / play

Because each node has weight, for the words in the prefix dictionary, its weight is its word frequency. Therefore, our problem is to find a maximum path so that the weight of the whole sentence is the highest.

This is a typical dynamic programming problem. First, we confirm two conditions of dynamic programming:

  • Repeated subproblem: for node i and its possible multiple successor nodes J and K:
The weight of any path passing through I to j = the weight of the path passing through I + the weight of J, that is, R (I - > J) = R (I) + W (J) 
The weight of any path passing through I to k = the weight of the path passing through I + the weight of K, that is, R (I - > k) = R (I) + W (k)

That is, for J and K with common precursor node i, the weight of the path to I needs to be calculated repeatedly.

  • Optimal substructure: let the optimal path of the whole sentence be Rmax, the end node be x, and multiple possible precursor nodes be I, J and K. The formula is as follows:
Rmax = max(Rmaxi, Rmaxj, Rmaxk) + W(x) 

So the problem becomes to solve rmaxi, rmaxj and rmaxk. The optimal solution in the substructure is a part of the global optimal solution.

As above, it is finally calculated that the optimal path is “go / Peking University / play”.

HMM implicit Markov model

For unlisted words, Jieba word segmentation adopts HMM (abbreviation of hidden Markov model) model for word segmentation. It regards the word segmentation problem as a sequence labeling problem, the sentence is an observation sequence, and the word segmentation result is a state sequence. The author of Jieba word segmentation mentioned in the issue that the parameters of HMM model are based on the segmentation corpus of 1998 people’s Daily that can be downloaded online, an MSR corpus and its own TXT novels, segmented with ICTCLAS, and finally counted the word frequency with Python script.

The model consists of a five tuple and has two basic assumptions.

Quintuple:

  1. Status value set
  2. Set of observations
  3. State initial probability
  4. State transition probability
  5. State emission probability

Basic assumptions:

  1. Homogeneous hypothesis: that is, it is assumed that the state of the hidden Markov chain at any time t only depends on the state of its previous time T-1, which has nothing to do with the state and observation at other times, nor with time t.
  2. Observation independence hypothesis: it is assumed that the observation value at any time is only related to the state of Markov chain at that time, and has nothing to do with other observations and states.

The set of status values, {B: begin, e: end, M: middle, s: single}, represents the position of each word in the sentence. B is the start position, e is the end position, M is the middle position, and S is the word formation of a single word.

The set of observations is the set of each word in the input sentence.

The initial state probability indicates the probability that the first word in the sentence belongs to the four states of B, m, e and s, in which the probabilities of E and m are 0, because the first word can only be B or s, which is consistent with the reality.

The state transition probability indicates the probability of transition from state 1 to state 2, which satisfies the homogeneity assumption, and the structure can be represented by a nested object:

P = {
    B: {E: -0.510825623765990, M: -0.916290731874155},
    E: {B: -0.5897149736854513, S: -0.8085250474669937},
    M: {E: -0.33344856811948514, M: -1.2603623820268226},
    S: {B: -0.7211965654669841, S: -0.6658631448798212},
}

P’b ‘indicates that the probability of transferring from state B to state e (logarithm of probability in the structure for easy calculation) is 0.6. Similarly, p’b’ indicates that the probability of the next state being m is 0.4, indicating that when a word is at the beginning, the probability of the next word at the end is higher than the probability of the next word in the middle, which is intuitive, because words of two words are more common than words of multiple words.

The state emission probability indicates that the current state meets the assumption of independence of observation value. The structure is the same as above, and can also be represented by a nested object:

P = {
    B: {'Tu': -2.70366861046, 'Su': -10.2782270947, 'Shi': -5.57547658034},
    M: {'want': -4.26625051239, 'close': -2.1517176509, 'Cheng': -5.11354837278},
    S: {……},
    E: {……},
}

P’b ‘means that the state is in B, and the logarithm of the probability that the observed word is “sudden” is equal to -2.70366861046.

Finally, the Viterbi algorithm inputs the set of observed values, takes the state initial probability, state transition probability and state emission probability as parameters, and outputs the set of state values (i.e. the word segmentation result of the maximum probability). About Viterbi algorithm, this paper will not expand in detail, and interested readers can refer to it by themselves.

The above two technologies are the technical core of our architecture. Based on this, we have improved the electron side technical architecture of Netease Yunxin im.

Netease Yunxin im electron end architecture

Detailed explanation of architecture diagram

Considering that full-text retrieval is only a function of IM, the following architecture scheme is adopted in order to not affect the functions of other IM and meet the needs of faster iteration:

How to implement full-text retrieval of IM SDK chat messages on electron

On the right is the previous technical architecture. The underlying repository uses indexdb, and the upper layer has two read-write modules:

  • When users actively send messages, actively synchronize messages, actively delete messages and receive messages, the message objects will be synchronized to indexdb;
  • When users need to query keywords, they will traverse all message objects in indexdb, and then use indexof to judge whether each message object contains the queried keywords (similar to like).

Then, when the amount of data is large, the query speed is very slow.

On the left is a new architecture scheme with word segmentation and inverted index database. This scheme will not have any impact on the previous scheme, but a layer is added before the previous scheme. Now:

  • When users actively send messages, actively synchronize messages, actively delete messages and receive messages, the messages in each message object will be synchronized to the inverted index database after word segmentation;
  • When the user needs to query keywords, he will first find the idclient of the corresponding message in the inverted index database, and then find the corresponding message object in the indexdb according to the idclient and return it to the user.

Architecture advantages

The scheme has the following four advantages:

  • Fast speed:The inverted index is realized through search index, which improves the search speed.
  • Cross platform: because both search index and indexdb are based on leveldb, search index also supports browser environment, which provides the possibility of full-text retrieval on the web side.
  • Independence: the inverted index database is separated from the im main service database indexdb. When the indexdb writes data, it will automatically notify the write module of the inverted index database, segment the message content, insert it into the storage queue, and finally insert it into the inverted index database in turn. When full-text retrieval is required, the idclient of the message object corresponding to the keyword can be quickly found through the reading module of the inverted index library. According to the idclient, the message object can be found in the indexdb and returned.
  • Flexibility:Full text search is accessed in the form of plug-in, which exposes a high-order function, wraps the IM and returns the new inherited and extended im. Because JS is prototype oriented, the methods that do not exist in the new im will be automatically searched in the prototype chain (i.e. the old IM). Therefore, the plug-in can focus on the implementation of its own methods, It doesn’t need to care about the specific version of IM, and the plug-in supports custom word segmentation functions to meet the scenarios of different word segmentation needs of different users.

Use effect

After using the above architecture, our test shows that at the level of 20W data volume, the search time is reduced from more than ten seconds to one second, and the search speed is about 20 times faster.

summary

Above, based on nodejieba and search index, we realized the full-text retrieval of Netease Yunxin im SDK chat messages on electron, which accelerated the search speed of chat records. Of course, we will do more optimization in the following aspects in the future, such as:

  • Write performance improvement: in actual use, it is found that when the amount of data is large, there will be a write performance bottleneck in the underlying database leveldb on which search index depends, and the consumption of CPU and memory is large. After investigation, the write performance of SQLite is much better. From the observation, the write speed is only proportional to the amount of data, and the CPU and memory are relatively stable. Therefore, it may be considered to compile SQLite into a node native module to replace search index in the future.
  • Scalability: at present, the decoupling of business logic is not complete enough. Some business fields are stored in the inverted index library. In the future, the inverted index library can only find the idclient of the message object according to the keyword, put the search with business attributes into indexdb, and completely decouple the inverted index library from the main business library.

The above is all the sharing of this article. Welcome to pay attention to us and continue to share more technical dry goods. Finally, I hope my sharing can be helpful to you.

Author introduction

Li Ning, senior front-end development engineer of Netease Yunxin, is responsible for the application development, component development and solution development of Netease Yunxin audio and video im SDK, and has rich practical experience in react, PAAS component design, multi platform development and compilation. If you have any questions, please leave a message.

More technical dry cargo, welcome to NetEase [WeChat] + official account]

Recommended Today

Java Engineer Interview Questions

The content covers: Java, mybatis, zookeeper, Dubbo, elasticsearch, memcached, redis, mysql, spring, spring boot, springcloud, rabbitmq, Kafka, Linux, etcMybatis interview questions1. What is mybatis?1. Mybatis is a semi ORM (object relational mapping) framework. It encapsulates JDBC internally. During development, you only need to pay attention to the SQL statement itself, and you don’t need to […]