Getting things technology teaches you how to understand mongodb



With the full opening of the company’s live broadcasting platform and the various upgrading of live broadcasting playing methods, the most obvious feeling is:

① The data increment is obvious.

② The number of complications increased significantly.

In response to these two changes, as a back-end development, we need to make some choices in technology selection. At present, the services with large amount of live broadcast are processed by using mysql. However, after a period of observation, we found the following problems:

① The amount of data stored in the sub tables of some services is not very balanced, and the amount of data in the sub tables corresponding to some large anchors is obviously too large.

② On the new business or some new activities, we need to build dozens of tables to deal with the known amount of data in advance, which is cumbersome and inconvenient for subsequent changes.

So when we do technology selection, we think of mongodb’s slice cluster. Now let’s explore it

1、 Recall the features of mongodb

1 flexible model

One of the biggest characteristics different from relational database is the flexibility of field changes, which is very suitable for some business scenarios with frequent iterations and changeable data models. For example, in our live activity business scenario, there are more and more live activities and more tricks to play. In fact, it is very suitable to use mongodb for data storage of activity business.

2. JSON data format

The data storage of mongodb is in JSON format, which is very suitable for restful API.

3 horizontal expansion capability

Different from MySQL’s sub database and sub table, it can do automatic horizontal expansion without changing the business code. It can well solve the two scenarios described above
① Uneven data in different tables
② If the data of sub table is too large, the original data distribution should be adjusted.

2、 Talk about the principle and mechanism of mongodb’s partitioned cluster

Before formally describing the partitioned cluster, let’s take a look at the current mongodb architecture deployment, as shown in the following figure:

Getting things technology teaches you how to understand mongodb

From the figure above, we can clearly see three common architecture patterns:
1. Stand alone version: only for development and test environment, no high availability.
2. Replica set: generally one master and two slaves. Of course, more slave libraries can be configured. The production environment of most users is in use, which can meet most business scenarios and is highly available.
3. Partitioned cluster: meet the horizontal expansion ability and high availability.

Components of fragmentation

As shown in the figure below:

Getting things technology teaches you how to understand mongodb

Let’s divide the above figure into four modules

1. Application + driver

2. Routing nodes (mongos)

Routing nodes are mainly used to control which partition (shard) to distribute to, because mongos stores a mapping table. The data of this mapping table is synchronized with the mapping data stored in the config node. It is usually loaded into mongos memory when Mongo is started.

As can be seen from the above figure, there are many mongos. In fact, this is also to meet the high availability scenario.

3. Configure node (config)

Data node mainly stores the storage of cluster metadata, data distribution mapping table.

The format of the stored data table is shown in the following figure:

Getting things technology teaches you how to understand mongodb

4. Data node (shard)

The area at the bottom of the figure above is the data node of the partition. Each partition meets the high availability, and is generally one master and two slaves. The maximum number of partitions can be 1024. All the pieces together are complete data. There is no duplicate data.

So far, we have roughly described the components of the partitioned cluster.

Do you have such a question?

  1. What is the distribution range of the data stored in the configuration table?
  2. How does mongodb’s partitioned cluster balance data?

If you have any other questions, please leave me a message at the end of the article. Before answering the above questions, I’d like to explain a few concepts. Let’s take a look at the figure below

Getting things technology teaches you how to understand mongodb

As shown in the figure, we can see the following terms: cluster > piece > block > document.

A cluster is composed of multiple partitions. A partition stores multiple blocks (Note: logical data partition), and a block contains multiple documents. The document is not the smallest unit. There are fields in the document. One field or more fields can be formed into a slice key.

What’s the use of chip key? The slice key determines how many blocks your data can be divided into.

Now that the basic concepts are introduced, let me answer the above questions.

Mongodb provides three ways of data distribution

a. Scope based

Getting things technology teaches you how to understand mongodb

As shown in the figure, the data is logically divided into four blocks: for example, your system stores company user information. If you divide it by age, assuming that the company’s age range is 18-60 and divided into one block by one age, you can divide it into 43 blocks at most. Then divide the blocks into several pieces. At this time, we query the data of a certain age group, such as the data between 22 and 25, which may be in the same film or two films. Therefore, it performs well in range query performance. Then the problem comes. At this time, we will enlarge the data infinitely. The proportion of companies aged between 22 and 25 is more than 80. At this time, we will find that the data of users aged between 22 and 25 is very large compared with other films. That is to say, the data distribution is not balanced. The new data are between 22 and 25 years old, which leads to the situation of hot movies.

b. Based on Hash

Getting things technology teaches you how to understand mongodb

As shown in the figure above, the slice keys are not continuous, but hash to different regions through hash, which solves the problem of uneven data. But at the same time, the problem is that the scene efficiency of range query is very low, so we need to traverse all the fragments to meet the business query. For example, in the user’s order system, we hash the order data according to the user ID, so that the order data of different users will be evenly distributed to different partitions. It is very efficient to check the order data of a user, but if you want to check according to the time range, you need to scan all the pieces.

c. Based on zone

Getting things technology teaches you how to understand mongodb

As the name suggests, based on the geographical division. Sometimes we may serve global users, and because of its regional nature, we can naturally divide them into different regions. This is a good guarantee that the nearest data node can be accessed in a fixed region.

Is the first question suddenly clear? Now let’s answer the second question

In short, Mongo provides two programs:

① Cutter

The cutter can cut the data of a certain source slice according to the chunk (data logic block).

② Balancer

When some pieces of data are uneven, the balancer will play a role. It will send a command to the cutter to cut the data on the pieces that need to be moved, and then move the data to the pieces with less data. The specific steps are as follows:

  • The balancer sends the movechunk command to the source tile
  • After receiving the command, the source partition will start its own internal movechunk command. If a client sends a read / write request during data movement, it will send it to the source partition. (because the metadata on the configuration server has not changed)
  • The target slice begins to request the document of the data block to be moved from the source slice, and is ready to copy the document data.
  • When the target partition receives the last document of the data block, it will start a synchronization process to check whether all the documents have been copied.
  • After synchronization, the target fragment will connect to the configuration server and update the address of the data block in the metadata list.
  • When the target partition completes the metadata update, the source partition will delete the original data block. If there are new data blocks to move, you can continue to move.
  • The configuration server notifies the monogs process to update its own mapping table.

So far, we have also made a more detailed description of the second question above. Let’s summarize the characteristics of mongodb’s partitioned cluster

  • The application is completely transparent without special treatment

It’s very friendly to development and doesn’t need to be changed. The Mongo architecture deployment change doesn’t affect the business code.

  • Automatic data equalization

You don’t have to worry about the situation that a piece is too big, but here we should pay attention to that your piece must not be too big, otherwise God can’t help you. Because if it’s too big, it often fails to move.

  • Dynamic expansion without offline

If the current production environment is the Mongo replica set architecture, you can switch to the sliced cluster architecture directly from the online environment. If you use Alibaba cloud’s mongodb service, Alibaba provides two ways: 1 full cut 2 incremental cut (note that incremental cut requires additional payment), which is very convenient for users.

3、 Talking about the applicable scenarios of fragmentation cluster

Above, we have described the concept of partitioned cluster, including its composition, implementation mechanism and features. I believe you also have a relatively complete understanding of fragmented clusters. Now let’s talk about which scenarios we need to consider using partitioned clusters?

1. The data capacity is increasing and the access performance is decreasing

2. The launch of new products is extremely hot, how to support more concurrent users

With the live broadcast campaign in full swing, the increment and concurrency of some business data have also increased significantly. At present, it can be easily handled without adjusting the deployment strategy. Assuming that the increment is infinitely steep, we can consider using partitioned cluster to solve the problem of database performance bottleneck.

3. A single database has a large amount of data, how to recover quickly when there is a failure

At this time, we need to predict the risk in advance, and we can’t wait for the failure to consider using the partitioned cluster. Massive data (assuming that the data has reached TB level) takes a long time to do data recovery.

4. Geographical distribution

Data from different regions can be considered

4、 Problems to be considered before using partitioned cluster

1. Reasonable architecture, we need to consider the following issues

Question 1: do you need to split

This need to be combined with their own business scenarios and the current architecture deployment to do screening. It can be combined with the applicable scenarios mentioned above.

Question 2: how many pieces do you need

There are three dimensions to consider

① Your estimated business in the next year or a certain period of time, the total amount of storage to calculate, generally speaking, 2TB a piece of experience

② Calculate according to the maximum concurrency of your business

③ Measure according to your hardware condition

This may require the DBA to make an evaluation to decide.

Question 3: data distribution rules

This is a very critical point. With regard to data distribution, if your data distribution is unreasonable, it will directly lead to performance problems such as horizontal expansion, low query efficiency and so on

2. Use the right posture

① Select the table you want to slice

Mongo’s fragmentation is based on tables, which is a major premise. Only the tables that need to be partitioned are partitioned.

② Select the correct chip key

As we have said above, the function of slice key directly determines the partition of your data. Next, we focus on the choice of chip key to explain what is the right chip key.

First of all, we need to abide by the following guidelines:

Big base!

In order to have a large base? If your foundation is small, such as the example above, segmenting by age, then your data can be divided into dozens of blocks at most. If your data is 10TB, then even if each block is evenly distributed, each block of data will be hundreds of GB. Such a large block is difficult to move, which directly leads to the failure of the balancing strategy.

Ensure uniform distribution

Why should we ensure uniform distribution? If not, what will happen to us?

a. Cause the hot plate to generate (recall the example mentioned above when talking about the distribution mode of split data)

b. As a result, the size difference of partitioned data is often large, and data equalization is frequently carried out

It is necessary to ensure good directional query

A good architecture serves the business. If you have more query business, you certainly don’t want to traverse all the partitions every time you query.

So we need to make the data we query fall on the same slice as much as possible.

3. Let’s talk through two business scenarios

① Business order system

Business description: assuming that the daily order volume of the platform is in the million level, different businesses need to count the order data of different periods frequently

For this business scenario, let’s consider the following steps?

Think about step 1: do you want to use a slice?

Answer: it can be used because of the large amount of data and frequent query statistics.

Thinking step 2: what is the most appropriate key to use?

Answer: combined with the selection key, the best is the merchant ID and order time. If only the merchant ID is used as the slice key, it can satisfy the requirement that the base number is large enough, and the data is even, but it is necessary to traverse all slices to query the orders in a certain period of continuous time. Therefore, we need to select the combination key of merchant ID and order time to ensure the directional query.

② Live reward record

Business description: record the reward data of users to the anchor. Suppose that the reward records are increasing by millions every day, and only query the recent data

From the perspective of business description, there is no need to query the historical data of this log. In fact, we can archive the historical data. There is no need to consider using partitioned clusters.

4、 Summary

After reading the full text, I believe you all have a certain understanding of fragmentation cluster. In the work, I hope to provide you with a variety of choices for some business scenarios. Finally, I would like to conclude with two points

  1. Partitioned cluster can effectively solve the problem of performance bottleneck and system expansion
  2. Slice management is complex, high cost, can not slice as much as possible

Wen Rongrong
Focus on dewu technology, take you to the cloud of Technology