Vivo global mall: order center architecture design and Practice

Time:2022-1-14

1、 Background

With the rapid growth of users, vivo official mall v1 The single architecture of 0 has gradually exposed its disadvantages: more and more bloated modules, low development efficiency, performance bottlenecks and difficult system maintenance.

V2.0 launched in 2017 0 architecture upgrade, vertical system physical splitting based on business modules, and the split business lines perform their respective duties, provide service-oriented capabilities, and jointly support the master station business.

The order module is the transaction core of the e-commerce system. The continuously accumulated data is about to reach the bottleneck of single table storage. The system is difficult to support the flow during the new product launch and promotion activities. Service-oriented reform is inevitable.

This paper will introduce the problems and solutions encountered in the construction of vivo mall order system, and share the architecture design experience.

2、 System architecture

The order module is separated from the mall and is an independent order system. It uses an independent database to provide standardized services such as order, payment, logistics and after-sales for the relevant systems of the mall.

The system architecture is shown in the figure below:

Vivo global mall: order center architecture design and Practice

3、 Technical challenges

3.1} data volume and high concurrency

The first challenge comes from the storage system:

  • Data volume problem

    With the continuous accumulation of historical orders, the amount of order table data in MySQL has reached tens of millions.

    We know that the storage structure of InnoDB storage engine is a B + tree and the lookup time complexity is O (log n). Therefore, when the total amount of data n becomes larger, the retrieval speed will inevitably slow down. No matter how to add index or optimize, we can only find ways to reduce the amount of data in a single table.

    Solutions with large amount of data include:Data archiving and tabulation

  • High concurrency problem

    The mall business is in a period of rapid development, the number of orders has reached new highs, the business complexity is also improving, and the application’s access to MySQL is getting higher and higher.

    The processing capacity of stand-alone MySQL is limited. When the pressure is too high, the access speed of all requests will decrease, and even the database may be down.

    Solutions with high concurrency include:Use cache, read-write separation and sub database

These schemes are briefly described below:

  • Data archiving

    Order data has time attribute and has hot tail effect. In most cases, the most recent orders are retrieved, but a large number of old data with low frequency are stored in the order table.

    Then you can store new and old data separately, move historical orders into another table, and make some corresponding changes to the query module in the code, which can effectively solve the problem of large amount of data.

  • Use cache

    Using redis as MySQL’s pre cache can block most query requests and reduce the response delay.

    Caching is particularly effective for commodity systems that have little to do with users, but for the order system, the order data of each user is different, the cache hit rate is not high, and the effect is not very good.

Vivo global mall: order center architecture design and Practice

  • Read write separation

    The master database is responsible for executing the data update request, then synchronizing the data changes to all slave databases in real time, and sharing the query request with multiple slave libraries.

    However, there are many operations to update the order data, and the pressure on the main warehouse at the peak of order placing has not been solved. And there is master-slave synchronization delay. Under normal circumstances, the delay is very small, no more than 1ms, but it will also lead to inconsistent master-slave data at a certain time.

    All affected business scenarios need to be handled in a compatible manner, and some compromises may be made. For example, after a successful order is placed, the user first jumps to an order success page, and the user can see the order only after manually clicking to view the order.

Vivo global mall: order center architecture design and Practice

  • sub-treasury

    Sub database also includes vertical sub database and horizontal sub database.

    ① Horizontal sub Library:Split the data of the same table into different databases according to certain rules, and each database can be placed on different servers.

    ② Vertical sub Library:The tables are classified according to business and distributed to different databases. Each database can be placed on different servers. Its core concept is dedicated to the database.

  • Sub table

    The sub table also includes vertical sub table and horizontal sub table.

    *① Horizontal split table: * split the data of one table into multiple tables according to certain rules in the same database.

    *② Vertical table splitting: * divide a table into multiple tables according to fields, and each table stores some of the fields.

We have comprehensively considered the transformation cost, effect and impact on the existing business, and decided to directly use the last move:Sub database and sub table

3.2 selection of warehouse and table technology

The technical selection of sub warehouse and sub table is mainly considered from these directions:

  1. CLIENT SDK open source solution
  2. Middleware proxy open source solution
  3. Self developed framework provided by the company’s middleware team
  4. Make your own wheels

After consulting the previous project experience and communicating with the middleware team of the company, the open source software is adoptedSharding-JDBCProgramme. Now renamed sharding sphere.

  • Github:https://github.com/sharding-sphere/
  • Documents: the official documents are rough, but the online materials, source code analysis and demo are rich
  • Community: active
  • Features: provided in the form of jar package, it belongs to client-side fragmentation and supports XA transactions

Vivo global mall: order center architecture design and Practice  

3.2.1 strategy of warehouse and table division

Combined with the business characteristics, the user ID is selected as the partition key, and the library table number of the user order data is obtained by calculating the hash value of the user ID and taking the module
Suppose there are n libraries and each library has m tables,

The calculation method of library table number is:

-Library serial number: hash (userid) / M% n

-Table No.: hash (userid)% m

The routing process is shown in the following figure:

Vivo global mall: order center architecture design and Practice

3.2.2} limitations and Countermeasures of sub warehouse and sub table

Database and table splitting solves the problem of data volume and concurrency, but it will greatly limit the query ability of the database. There are some simple association queries that may not be implemented after database and table splitting, so you need to rewrite these SQL that sharding JDBC does not support separately.

In addition, these challenges have been encountered:

(1) Globally unique ID design

After splitting databases and tables, the self added primary key of the database is no longer globally unique and cannot be used as an order number. However, many interactive interfaces between internal systems only have an order number, and there is no partition key for user identification. How can we use the order number to find the corresponding library table?

Originally, when we generated the order number, we implied the library table number. In this way, the library table number can be obtained from the order number in the scenario without user ID.

(2) There is no implied library table information in the historical order number

A table is used to separately store the mapping relationship between historical order numbers and user IDs. As time goes by, these orders are no longer used because they are not interactive between systems.

(3) The management background needs to query all qualified orders in pages according to various filter conditions

The order data is redundantly stored in the search engine elasticsearch, which is only used for background query.

3.3 how to synchronize data from Mysql to es

As mentioned above, in order to facilitate the management of background queries, we store the order data redundantly in elasticsearch. Then, how to synchronize the order data of Mysql to es after it is changed?

What should be considered here is the timeliness and consistency of data synchronization, small intrusion into business code, and no impact on the performance of the service itself.

  • MQ scheme

    As a consumer, ES update service updates es after receiving the order change MQ message

Vivo global mall: order center architecture design and Practice

  • Binlog scheme

    With the help of open source projects such as canal, ES update service disguises itself as a slave node of MySQL, receives binlog and parses the real-time data change information, and then updates es according to the change information.

Vivo global mall: order center architecture design and Practice

Binlog scheme is common, but its implementation is also complex. We finally choose MQ scheme.

Because es data is only used in the management background, the requirements for data reliability and synchronization real-time are not particularly high.

Considering extreme situations such as downtime and message loss, the function of manually synchronizing es data under certain conditions is added in the background to compensate.

3.4 how to replace database safely

How to migrate data from the original single instance database to the new database cluster is also a major technical challenge

Not only to ensure the correctness of the data, but also to ensure that once a problem occurs after each step, it can quickly roll back to the previous step.

We have considered two schemes of downtime migration and non downtime migration:

(1) Non downtime migration scheme:

  • Copy the data from the old database to the new database, launch a synchronization program, and use binlog and other schemes to synchronize the data from the old database to the new database in real time.
  • Online double write order old and new library service, read-only write old library.
  • Start double write, stop the synchronization program at the same time, and start the comparison compensation program to ensure that the new database data is consistent with the old database.
  • Gradually switch the read request to the new library.
  • Switch to the new library for reading and writing, and compare the compensation program to ensure that the data of the old library is consistent with the new library.
  • Offline old library, offline order double writing function, offline synchronization program and comparison compensation program.

Vivo global mall: order center architecture design and Practice Vivo global mall: order center architecture design and Practice

(2) Shutdown migration scheme:

  • Launch the new order system, execute the migration program, synchronize the orders two months ago to the new library, and audit the data.
  • Shut down the application of mall V1 to ensure that the old database data does not change.
  • Execute the migration program, synchronize the orders not migrated in the first step to the new library and audit.
  • Launch the mall V2 application and start the test and verification. If it fails, return to the mall V1 application (the new order system has a switch to double write the old library).

Vivo global mall: order center architecture design and Practice Vivo global mall: order center architecture design and Practice

Considering the high transformation cost of the non shutdown scheme and the small business loss of the night shutdown scheme, the shutdown migration scheme is finally selected.

3.5 distributed transaction problems

In the e-commerce transaction process, distributed transaction is a classic problem, such as:

  • After successful payment, the user needs to notify the shipping system to deliver goods to the user.
  • After confirming the receipt, the user needs to notify the points system to issue the points of shopping reward to the user.

How do we ensure data consistency under the microservice architecture?

Different business scenarios have different requirements for data consistency. Among the mainstream solutions in the industry, two-stage commit (2pc) and three-stage commit (3pc) are used to solve strong consistency, and TCC, local message, transaction message and best effort notification are used to solve final consistency.

The above scheme will not be described in detail here, but the local message table scheme we are using: record the asynchronous operation to be executed in the message table in the local transaction. If the execution fails, it can be compensated by scheduled tasks.

The following figure takes the example of notifying the point system to give points after the order is completed.

Vivo global mall: order center architecture design and Practice Vivo global mall: order center architecture design and Practice

3.6 system safety and stability

  • Network isolation

    Only a few third-party interfaces can be accessed through the Internet, and the signature will be verified. The intranet domain name and RPC interface are used for internal system interaction.

  • Concurrent lock

    Before any order update operation, it will be restricted by database row level lock to prevent concurrent updates.

  • Idempotency

    All interfaces are idempotent, so you don’t have to worry about the impact of the other party’s network timeout retry.

  • Fuse

    Using hystrix components, fuse protection is added to the real-time call of external systems to prevent the impact of a system fault from expanding to the whole distributed system.

  • Monitoring and alarm

    By configuring the error log alarm of the log platform, the service analysis alarm of the call chain, and the monitoring alarm function of the company’s middleware and basic components, we can find system exceptions at the first time.

3.7 , stepped pits

MQ consumption is used to synchronize the order related data in the database to es, and the written data encountered is not the latest order data

The original scheme is shown on the left of the figure below:

When consuming MQ for order data synchronization, if thread a executes first and finds out the data, then the order data is updated, and thread B starts to execute the synchronization operation. After finding out the order data, thread a writes it to es one step ahead of thread A. when thread a performs writing, it will overwrite the data written by thread B, resulting in that the order data in ES is not up-to-date.

The solution is to add a line lock when querying the order data. The whole business is executed in the transaction, and the next thread is executed after the execution is completed.

Vivo global mall: order center architecture design and Practice Vivo global mall: order center architecture design and Practice

After sharding JDBC grouping, sort and page query all data problems

Example: select a # from temp group by a, b order by a desc limit 1,10.

In sharding JDBC, the group by and order by fields are inconsistent with each other, and 10 is set to integer MAX_ Value, which invalidates the paging query.


io.shardingsphere.core.routing.router.sharding.ParsingSQLRouter#processLimit

private void processLimit(final List<Object> parameters, final SelectStatement selectStatement, final boolean isSingleRouting) {
     boolean isNeedFetchAll = (!selectStatement.getGroupByItems().isEmpty() || !selectStatement.getAggregationSelectItems().isEmpty()) && !selectStatement.isSameGroupByAndOrderByItems();
    selectStatement.getLimit().processParameters(parameters, isNeedFetchAll, databaseType, isSingleRouting);
}

io.shardingsphere.core.parsing.parser.context.limit.Limit#processParameters

/**
* Fill parameters for rewrite limit.
*
* @param parameters parameters
* @param isFetchAll is fetch all data or not
* @param databaseType database type
* @param isSingleRouting is single routing or not
*/
public void processParameters(final List<Object> parameters, final boolean isFetchAll, final DatabaseType databaseType, final boolean isSingleRouting) {
    fill(parameters);
    rewrite(parameters, isFetchAll, databaseType, isSingleRouting);
}


private void rewrite(final List<Object> parameters, final boolean isFetchAll, final DatabaseType databaseType, final boolean isSingleRouting) {
    int rewriteOffset = 0;
    int rewriteRowCount;
    if (isFetchAll) {
        rewriteRowCount = Integer.MAX_VALUE;
    } else if (isNeedRewriteRowCount(databaseType) && !isSingleRouting) {
         rewriteRowCount = null == rowCount ? -1 : getOffsetValue() + rowCount.getValue();
    } else {
       rewriteRowCount = rowCount.getValue();
    }
    if (null != offset && offset.getIndex() > -1 && !isSingleRouting) {
       parameters.set(offset.getIndex(), rewriteOffset);
     }
     if (null != rowCount && rowCount.getIndex() > -1) {
        parameters.set(rowCount.getIndex(), rewriteRowCount);
      }
}

The correct wording should be select a # from temp group by a DESC, B limit 1,10; The version used is 3.1.1 of sharing JDBC.

If there are duplicate values in the sorting field of ES paging query, it is better to add a unique field as the second sorting condition to avoid missing data and finding duplicate data during paging query. For example, the order creation time is used as the unique sorting condition. If there are a lot of data at the same time, the queried orders will be missed or duplicated, You need to add a unique value as the second sorting condition or directly use the unique value as the sorting condition.

4、 Results

  • It was successfully launched at one time and operated stably for more than a year
  • Core service performance increased by more than ten times
  • The system is decoupled and the iterative efficiency is greatly improved
  • It can support the high-speed development of the mall for at least five years

5、 Conclusion

We did not blindly pursue cutting-edge technologies and ideas in system design. When facing problems, we did not directly adopt the solutions of mainstream e-commerce, but selected the most appropriate methods according to the actual business conditions.

Personally, a good system is not designed by Daniel at the beginning. It must be iterated gradually with the development and evolution of the business. It continues to predict the business development direction and formulate the architecture evolution plan in advance. In short, it is: go ahead of the business!

Author: vivo official website mall development team