Architecture analysis of distributed search engine elasticsearch


1、 Write on the front

ES (elastic search, hereinafter referred to as ES) more and more enterprises use es to store their own unstructured data in business scenarios, such as e-commerce business to realize commodity search, data index analysis, log analysis, etc. as a supplement to traditional relational database, ES provides some capabilities that relational database does not have.

ES is the first to enter the public view of its ability to achieve full-text search, but also because of the implementation based on Lucene, there is an inverted index data structure inside.

The author of this paper will introduce the distributed architecture of ES and the storage index mechanism of ES. This paper will not introduce the API of ES in detail, but will analyze it from the overall architecture level. Later, the author will introduce the use of ES in other articles.

2、 What is inverted index

To understand what inverted index is, first of all, let’s sort out what index is. For example, a book’s catalog page has chapters and chapter names, which chapter we want to see. Through the catalog page, we can find the corresponding chapter and page number, and then we can locate the specific chapter content. Through the chapter name of the catalog page, we can find the page number of the chapter, and then we can see the chapter content, This process is an index process, so what is inverted index?

For example, if you query the article in the book “Java programming ideas”, you can open the book to see the directory page, record the chapter name and chapter address page number. By querying the chapter name “inheritance”, you can locate the specific address of the chapter “inheritance”, and view the content of the article. We can see that the content of the article contains a lot of “object”.

So what if we want to look up all the articles that contain the word “object” in this book?

According to the current indexing method, it is no doubt to look for a needle in a haystack. Suppose we have an “object” — → article mapping relationship, isn’t that ok? Such reverse mapping is called inverted index.

As shown in Figure 1, the key words are obtained after the article is segmented, and an inverted index is established according to the key words. The key words are built into a dictionary. Each entry (key word) is stored in the dictionary, and each key word has a list corresponding to it. This list is the inverted table, which stores information such as chapter document number and word frequency, Each element in the inverted list is an inverted item. Finally, we can see that the whole inverted index is like a Xinhua dictionary. The inverted list of all words is often stored in a file on disk in sequence, which is called inverted file.

Architecture analysis of distributed search engine elasticsearch

(Figure 1)

Dictionaries and inverted files are two basic data structures of Lucene, but they are stored in different ways. Dictionaries are stored in memory and inverted files are stored on disk. This article will not introduce word segmentation, TF IDF, BM25, vector space similarity and other technologies used to build inverted index and query inverted index. Readers only need to have a basic understanding of inverted index.

3、 Cluster architecture of ES

1. Cluster node

An ES cluster can be composed of multiple nodes. A node is an ES service instance. You can join the cluster by configuring the cluster name cluster. Name. So how do nodes join the cluster by configuring the same cluster name? To understand this problem, we must first understand the role of nodes in ES cluster.

If the nodes in es have different roles, the roles can be set by configuring the following configuration in the configuration file conf / elasticsearch.yml.

node.master: true/false true/false

A single node in a cluster can be either a candidate primary node or a data node. Through the above configuration, it can be combined in pairs to form four categories

(1) Only candidate master nodes
(2) It is not only a candidate master node, but also a data node
(3) Data nodes only
(4) It is neither a candidate master nor a data node

Candidate master node:Only the candidate master node can participate in the voting, and only the candidate master node can be elected as the master node.

Master node:It is responsible for adding and deleting indexes, tracking which nodes are part of the cluster, allocating partitions, and collecting the status of each node in the cluster. A stable master node is very important for the health of the cluster.

Data node:It is responsible for data addition, deletion, modification, query, aggregation and other operations. Data query and storage are all in the charge of the data node. It has high requirements on the CPU, IO and memory of the machine. Generally, the machine with high configuration is selected as the data node.

In addition, there is a node role calledCoordination nodeThe user’s request can be randomly sent to any node, and the node is responsible for distributing the request and collecting the results, without the need for the master node to forward. This kind of node can be called coordination node, and any node in the cluster can act as coordination node. Every node keeps in touch with each other.

Architecture analysis of distributed search engine elasticsearch

(Figure 2)

two   discovery mechanism

As mentioned earlier, nodes can join a cluster by setting a cluster name. How does es do this?

Here we will talk about zendiscovery, a special discovery mechanism of ES.

Zendiscovery is the built-in discovery mechanism of ES, which provides unicast and multicast discovery methods. Its main responsibilities are the discovery of nodes in the cluster and the election of master nodes.

Multicast is also called multicastIt means that a node can send requests to multiple machines. This method is not recommended for ES in production environment. For a large-scale cluster, multicast will generate a lot of unnecessary communication.

unicastWhen a node joins an existing cluster or forms a new cluster, the request is sent to a machine. When a node contacts the members in the unicast list, it will get the status of all nodes in the whole cluster, and then it will contact the master node and join the cluster.

Only nodes running on the same machine will automatically cluster. ES is configured to use unicast discovery by default. Unicast list does not need to include all nodes in the cluster. It just needs enough nodes. When a new node contacts one of them and communicates, it is OK. If you use master candidate nodes as unicast lists, you only need to list three.

This configuration is in the elasticsearch.yml file: ["host1", "host2:port"]

In the cluster information collection stage, the gossip protocol is used, and the configuration above is equivalent to a seed nodes. The gossip protocol will not be described here.

Es official suggests that unicast. Hosts be configured as all candidate master nodes, and zendiscovery will ping every other time\_ Interval (configuration item) Ping once, and each timeout is\_ Timeout (configuration item), 3 times (Ping)_ If Ping fails, the node is considered to be down. In case of down, the failure will be triggered, and operations such as fragment redistribution and replication will be carried out.

If the down node is not a master, the master will update the meta information of the cluster. The master node publishes the latest meta information of the cluster to other nodes. Other nodes reply to ACK, and the master node receives discovery.zen.minimum\_ master\_ The value of nodes – the reply of one candidate primary node, the apply message is sent to other nodes, and the cluster status is updated. If the down node is the master, the other candidate primary nodes start the election process of the master node.

2.1 selection

In the process of selecting master, there should be only one master. Es ensures that the selected master is recognized by at least one quorum candidate master node through a threshold value of quorum representing the majority.

The primary selection is initiated by the candidate primary node. The current candidate primary node finds that it is not the master node, and it is unable to contact the primary node by pinging other nodes, and there are more than minimum nodes including itself\_ master\_ Nodes can’t contact the primary node, so the primary node selection is initiated at this time.

Main selection flow chart

Architecture analysis of distributed search engine elasticsearch

(Figure 3)

When selecting the primary node, it is sorted according to the cluster node’s parameter < stateversion, ID >. The stateversion is sorted from large to small, so that the nodes with relatively new cluster meta information can be selected as the master, and the ID is sorted from small to large, so as to avoid the failure of voting when stateversion is the same.

After sorting, the first node is the master node. When a candidate primary node initiates an election, it will select a master according to the above ranking strategy.

2.2 cerebral fissure

When it comes to the selection of distributed systems, it is inevitable to mention the phenomenon of brain fissure. What is brain fissure? If multiple master nodes are selected in the cluster, the data update is inconsistent, which is called brain fissure.

In short, different nodes in the cluster have different choices for the master, and there are multiple master competitions.

  Generally speaking, the brain crack problem may have the following several aspectsreasonCause:

  • Network problems:The network delay between clusters results in that some nodes can’t access the master and think that the master is dead, but the master is not down. Instead, a new master is elected, and the partitions and copies on the master are marked red to allocate a new master partition.
  • Node load:The role of the master node is both master and data. When the traffic volume is large, es may stop responding (feign death state), causing a large area of delay. At this time, other nodes can not get the response from the master node. If they think that the master node is dead, they will re select the master node.
  • Memory recovery:The role of the master node is both master and data. When the ES process on the data node occupies a large amount of memory, it will cause a large-scale memory recovery of the JVM and cause the ES process to lose its response.

How to avoid brain fissure: we can make optimization measures based on the above reasons

  • Increase the response timeout appropriately to reduce misjudgment. Through the parameter discovery.zen.ping_ Timeout sets the node Ping timeout, which is 3S by default and can be increased appropriately.
  • To trigger the election, we need to set the parameter discovery.zen.munimum in the configuration file of the candidate node\_ master\_ The value of nodes. This parameter indicates the number of candidate primary nodes that need to participate in the election when the primary node is selected. The default value is 1, and the official recommended value is (Master)\_ eligibel\_ Nodes / 2) + 1, where Master\_ eligibel\_ Nodes is the number of candidate primary nodes. This can not only prevent the occurrence of brain crack, but also maximize the high availability of the cluster, as long as it is not less than discovery.zen.munimum\_ master\_ Nodes: if the candidate nodes survive, the election can proceed normally. When the value is less than this value, the election behavior cannot be triggered, the cluster cannot be used, and the fragmentation chaos will not be caused.
  • Role separation is the role separation between the candidate primary node and the data node mentioned above, which can reduce the burden of the primary node, prevent the occurrence of the feign death state of the primary node, and reduce the misjudgment of the primary node downtime.

4、 How is the index written

one   Write index principle

1.1 slice

Es supports Pb level full-text search. Generally, when we have a large amount of data, the query performance will be slower and slower. One way we can think of is to spread the data to different places for storage. The same is true of ES. Es splits the data in an index into different data blocks by horizontal splitting, and the split database block is called a shard, It’s very similar to MySQL’s sub database and sub table.

Different primary partitions are distributed on different nodes, so where should data be written in the index of multiple partitions? It must not be written randomly, otherwise the corresponding data cannot be retrieved quickly when querying. This requires a routing strategy to determine which partition to write to and how to route. We will introduce it later. When creating an index, you need to specify the number of partitions, and once the number of partitions is determined, it cannot be modified.

1.2 copies

A replica is a replica of a partition. Each primary partition has one or more replica partitions. When the primary partition is abnormal, the replica can provide data query and other operations. The primary partition and the corresponding replica partition are not on the same node to avoid data loss. When a node goes down, data can also be queried through the replica. The maximum number of replica partitions is n-1 (where n is the number of nodes).

The new, index and delete requests of DOC are all write operations. These write operations must be completed on the main partition before they can be copied to the corresponding replica. In order to improve the writing ability of ES, this process is concurrent writing. At the same time, in order to solve the problem of data conflict in the process of concurrent writing, ES is controlled by optimistic lock, and each document has a lock_ The version number is incremented when the document is modified.

Once all the replica fragments are successfully written, they will be reported to the coordination node, and the coordination node will report to the client.

Architecture analysis of distributed search engine elasticsearch

(Figure 4)

1.3 write index process of elasticsearch

As mentioned above, the write index can only be written on the primary partition and then synchronized to the replica partition. As shown in Figure 4, there are four primary partitions, namely S0, S1, S2 and S3. According to what strategy is a piece of data written to the specified partition? Why is this index data written to S0 instead of S1 or S2? This process is determined by the following formula.

shard = hash(routing) % number_of_primary_shards

The value of the above formula is between 0 and number\_ of\_ primary\_ The remainder between shards-1, that is, the location of the partition where the data file is located. Routing generates a number through the hash function, and then divides the number by number\_ of\_ primary\_ Shards (the number of main partitions) to get the remainder. Routing is a variable value. By default, it is the value of the document_ ID can also be set to a custom value.

After a write request is sent to a node, the node will act as the coordination node according to the above, and will calculate which partition to write according to the routing formula. The current node has the partition information of all other nodes. If the corresponding partition is found on other nodes, the request will be forwarded to the main partition node of the partition.

In ES cluster, each node knows the location of data in the cluster through the above formula, so each node has the ability to receive read-write requests.

So why is the number of primary partitions determined when creating an index and cannot be modified? Because if the number changes, all the previously calculated values will be invalid, and the data will never be found.

Architecture analysis of distributed search engine elasticsearch

(   Figure 5)

As shown in Figure 5 above, the current value of a data obtained through the routing calculation formula is shard = hash (routing)% 4 = 0, then the specific process is as follows:

(1) The data write request is sent to node1 node, and the value obtained by routing calculation is 1, then the corresponding data should be on the main partition S1.
(2) Node1 node forwards the request to node2, where S1 main partition is located. Node2 accepts the request and writes it to disk.
(3) The data is copied to three replica slices R1 by concurrency, and the data conflict is controlled by optimistic concurrency. Once all the replica fragmentation reports success, node2 will report success to node1, and node1 will report success to the client.

In this mode, as long as there is a copy, the minimum write delay is also the sum of two single slice write times, and the efficiency will be lower. However, the advantage of this mode is obvious. It can avoid data loss caused by single machine hardware failure after writing. In terms of data integrity and performance, data is generally preferred, except for some special scenarios that allow data loss.

In ES, in order to reduce disk IO and ensure read-write performance, data is usually written to disk persistently every other period of time (such as 30 minutes). For data written to memory but not yet flushed to disk, if machine downtime or power failure occurs, the data in memory will also be lost. How to ensure this?

For this problem, ES uses the processing method in the database for reference, and adds the commitlog module, which is called translog in ES, and will be introduced in the ES storage principle below.

two   Storage principle

The above describes the process of writing index in ES. After the data is written to fragmentation and replica, the data is in memory. To ensure that the data will not be lost after power failure, it needs to be persisted to disk.

We know that ES is implemented based on Lucene, and the internal work is to create, write, search and query the index through Lucene. The working principle of Lucene is as shown in the figure below. When a new document is added, Lucene performs word segmentation and other preprocessing, then writes the document index into memory, and writes this operation to the transaction log (translog), which is similar to MySQL’s binlog, It is used to recover memory data after downtime and save operation log of non persistent data.

By default, Lucene updates every 1 s (refresh)_ Interval configuration item) refreshes the data in memory to the file system cache, which is called a segment. Once it is flushed into the file system cache, the segment can be used for retrieval. Before that, it cannot be retrieved.

So refresh_ Interval determines the real-time of ES data, so es is a quasi real-time system. Segment is not modifiable in disk, so random write of disk is avoided. All random write is carried out in memory. As time goes on, there are more and more segments. By default, Lucene persistently drops the segment in the cache every 30 minutes or the segment space is larger than 512M, which is called a commit point. At this time, the corresponding translog is deleted.

When we test the write operation, we can manually refresh to ensure that the data can be retrieved in time, but do not manually refresh every time a document is indexed in the production environment. The refresh operation will have a certain performance overhead. In general business scenarios, it is not necessary to refresh every second.

You can increase refresh\_ by Settings. Interval = “30s” to reduce the refresh rate of each index. When setting the value, you need to pay attention to the time unit after it, otherwise the default is Ms. When refresh\_ Interval = – 1 indicates that the automatic refresh of the index is turned off.

Architecture analysis of distributed search engine elasticsearch

(Figure 6)

Index files are stored in segments and can’t be modified. How to add, update and delete them?

  • newly addedIt’s easy to add. Because the data is new, you only need to add a new segment to the current document.
  • deleteBecause it is not modifiable, for deletion, the document will not be removed from the old segment. Instead, a new. Del file will be added. The segment information of the deleted document will be listed in the file. The marked deleted document can still be matched by query, but it will be removed from the result set before the final result is returned.
  • to updateYou can’t modify the old segment to update the document. In fact, updating is equivalent to deleting and adding these two actions. The old document is marked for deletion in the. Del file, and then the new version of the document is indexed to a new segment. Maybe both versions of documents will be matched by a query, but the deleted old version will be removed before the result set is returned.

Segment is set as immutable, which has some advantages and disadvantages.


  • No locks are required. If you never update the index, you don’t need to worry about multiple processes modifying the data at the same time.
  • Once the index is read into the kernel’s file system cache, it stays there because of its invariance. As long as there is enough space in the file system cache, most of the read requests will directly request memory instead of hitting the disk. This provides a great performance boost
  • Other caches, such as filter caches, are always valid throughout the life of the index. They don’t need to be rebuilt every time the data changes, because the data doesn’t change.
  • Writing a single large inverted index allows data to be compressed, reducing disk I / O and the use of indexes that need to be cached in memory.


  • When the old data is deleted, the old data will not be deleted immediately, but will be marked as deleted in the. Del file. The old data can only be removed when the segment is updated, which will cause a lot of space waste.
  • If a piece of data is updated frequently, and each update is new and marked with old, there will be a lot of space waste.
  • Each time you add data, you need to add a new segment to store the data. When the number of segments is too large, the consumption of server resources such as file handles will be very large.
  • All result sets are included in the query results, and the old data marked for deletion needs to be excluded, which increases the burden of the query.

two point one   Segment merging

Because each refresh will create a new segment (segment), which will lead to a sudden increase in the number of segments in a short time, and too many segments will bring great trouble. A large number of segments will affect the data reading performance. Each segment consumes file handles, memory and CPU cycles.

More importantly, each search request must check each segment in turn, and then merge the query results, so the more segments, the slower the search.

Therefore, Lucene will merge segments according to certain policies. When merging, the old deleted documents will be removed from the file system. The deleted document will not be copied to the new large segment.

In the process of merging, the index and search will not be interrupted. The inverted index data structure makes the file merging easier.

Segment merging is automatically performed during indexing and searching. The merging process selects a small number of segments with similar size and merges them into larger segments in the background. These segments can be uncommitted or submitted.

After merging, the old segment will be deleted, and the new segment will be refreshed to disk. At the same time, a new intersection containing the new segment and excluding the old and smaller segments will be written. The new segment will be opened for searching. The computation of segment merging is huge, and it also consumes a lot of disk I / O, and it will drag down the write rate. If it is allowed to develop, it will affect the search performance.

By default, es will restrict the resources of the merging process, so the search performance can be guaranteed.

Architecture analysis of distributed search engine elasticsearch

(Figure 7)

5、 Write at the end

The author introduces the architecture principle and index storage and writing mechanism of ES. The overall architecture of ES is relatively ingenious. We can learn from its design ideas when we design the system. This paper only introduces the overall architecture of ES, and more content will be shared in other articles.

Author: vivo official website mall development team