Massive data segmentation, so it’s done

Time:2021-1-22

background

Today’s society is an information explosion society, everyone is using a variety of application software, which produces a lot of data. Enterprises regard these data as treasure. However, these data are often the troubles of our technicians. The storage and access of these massive data become the bottleneck of system design and use, and these data are often stored in the database Then the traditional database is insufficient. There is a performance bottleneck in a single database, and it is very difficult to expand. In today’s era of big data, we must solve this problem. If the stand-alone database is easy to expand and the data can be segmented, these problems can be avoided. However, the current database manufacturers, including the open source database mysql, charge for providing these services. So we generally turn to the third-party software, use these software to do data segmentation for our data, and disperse the original data from one database to multiple databases, reducing the load of each single database. So how do we do data segmentation? Next, follow the old cat to see the segmentation scheme.

Data segmentation

In fact, the so-called data segmentation is to distribute the data on one machine to multiple databases according to certain rules, so as to reduce the pressure of a single database. What we call database segmentation can be roughly divided into two categories, one is vertical segmentation, the other is horizontal segmentation. Now let’s take a look at these two solutions.

vertical partitioning

The so-called vertical segmentation is to segment different tables or schemas into different databases. Let’s give a simple example of the order table, product information table and member in e-commerce products. In the early days, we might put them in the same database. Now we need to split them. The rule of splitting is that different business line tables fall into different data centers of different physical machines, so that they can be completely isolated and the cost can be reduced The effect of database load. The following is an example

Massive data segmentation, so it's done

The feature of vertical segmentation is that the rules are relatively simple and easy to implement. Modules can be divided according to different business types. In this way, the coupling of various businesses is reduced and the mutual influence is small.

Laomao thinks that for an application system with better architecture design, the overall function should be composed of different business modules. Each different business module corresponds to a series of tables in the data. Let’s take the three business modules for example just now, if we expand them a little bit. It should be like this:

  • Order module: order, order details, order receiving address, order log, etc
  • Commodity module: category, attribute, attribute value, commodity, SKU, etc
  • Member module: member basic information table, member information operation log table, etc

In this way, we can extend it to the following figure

Massive data segmentation, so it's done

When we design the architecture of a system, the more unified the interaction between the various modules, the better, the less the better. In this way, the coupling degree between system modules will be very low, and the scalability and maintainability of each system module will be greatly improved. If the subsequent order of magnitude of such a system is large, it is quite easy for us to realize the vertical segmentation of data.

However, in our actual system architecture design, it is often difficult to achieve complete independence, and there will be some queries that boast of join between tables. For example, we need to query how many orders are generated under a category. If it is a single database, we can query directly by connecting tables. But now if we vertically split it into two pieces of data, we need to query by calling the interface. In this way, the complexity of the system will be improved. So at this time, we need to balance whether the database gives way to the business, put these tables in one database, or split them into multiple databases, and then call them through the interface. How to segment and to what extent is actually a test for architects.

On the final advantages and disadvantages of vertical atmosphere arrangement:

advantage:

  • After the split, our business is clearer, and the split rules are clearer.
  • Data maintenance becomes simple.
  • It is easier to expand and integrate between systems.

Disadvantages:

  • In business, tables and tables can’t do join query, they can only be called through the interface, which increases the complexity of the system.
  • If transactions are involved, cross database transactions are more difficult to handle.
  • Although vertical segmentation is carried out, some business data is still too large, such as orders. In fact, there is still a single performance bottleneck.

Above we talked about the shortcomings of vertical segmentation, but how can we solve the last point? At this time, we need to use horizontal segmentation.

horizontal partitioning

In fact, horizontal segmentation is more complex than vertical segmentation. It needs to split the data in a table into different databases according to specific rules. For example, let’s take a relatively simple example. The business data is still very large after the vertical segmentation of orders, so we can perform horizontal segmentation according to certain rules. For example, according to the odd or even number of order numbers, we can store odd numbered orders in database a and even numbered orders in database B. But the trouble is that we need to query data in different databases according to parity. Let’s take a look at the architecture of horizontal segmentation as follows:

Massive data segmentation, so it's done

When we split the data horizontally, we need to define the specific dimensions according to which to split the data. In the previous order, we mentioned to split the data according to the parity of the tail number. Let’s think about the problems? I am a user. I have placed two orders. One order number is odd, and the other is even. When we view our order records, we need to query the data of two orders in two different databases according to the user’s ID. we can imagine that this is quite troublesome.

Therefore, we need to combine specific business scenarios when we split horizontally. Is it OK if we dismantle it according to the user’s ID? In fact, it’s not necessarily. Let’s change our perspective. If we stand not in the position of users, but in the position of merchants. There will also be many orders in the merchant’s background. Merchants need to manage their own orders. When splitting orders, we use the user’s ID, which means that many merchants still need to query in different order tables when obtaining orders, and then aggregate them into an order table for merchants. At this time, it is obviously unreasonable for us to split orders with the user’s ID.

Let’s take a look at several ways to split the scene horizontally

  • The module method of user ID has been mentioned above.
  • Split data by date.
  • Split the data according to other fields.

The schematic diagram of the above user ID calculation method is as follows:

Massive data segmentation, so it's done

To sum up, let’s look at the advantages and disadvantages of horizontal segmentation,

advantage:

  • It solves the performance bottleneck of single database big data and high concurrency.
  • After encapsulation, the splitting rules are transparent to the application layer, and developers do not need to care about the splitting details.
  • The stability and load capacity of the system are improved.

Disadvantages:

  • Splitting rules are hard to define.
  • The problem of transaction consistency is difficult to solve.
  • In the second expansion (for example, when the module 3 is expanded to module 5, the historical data processing is changed from three databases to five databases), the data migration and maintenance are difficult.

Write at the end

In fact, there is no perfect thing in the world. There are both advantages and disadvantages. It’s the same with big data segmentation, whether it’s vertical or horizontal. These two methods solve the performance problems of massive data storage and access, but at the same time, they will produce many new problems. Common problems:

  • Distributed transaction problem.
  • Cross database connection query problem.
  • Management of multiple data sources.

In the last case, there are two ways to manage multi-source data

  1. Client mode: in each application module, configure the data source you need, and then access the database.
  2. Intermediate agent mode – the intermediate agent manages all data sources uniformly, the database layer is transparent to developers, and developers do not need to pay attention to the details of splitting.

According to the above two models, there are mature third-party software in the market. MYCAT (intermediate agent mode) and sharding JDBC (client mode).

Due to the limitation of space, follow-up laomao will give a detailed example of the actual implementation of these two kinds of software.