With the growth of the company’s transaction volume, the TPS on the application side decreases, the interface time becomes longer, and there may even be request timeout during business peak hours. However, the CPU of the application server is less than 50%, the memory consumption is 40% (4c8g configuration), and the load average: 0.43, 0.45, 0.49; The load is very low.
The database is Oracle 11g, CPU: 128C, memory: 512g, storage: 14T. CPU is used up by 60% and IO is idle by 45% during peak business hours.
The average time consumption of the main interfaces is one point per minute. As can be seen from the figure below, when the traffic is small, the average time consumption is about 25ms. Once the business peak is reached, the maximum average time consumption suddenly rises to 120 milliseconds.
Database monitoring chart: CPU and IO utilization is very high during peak hours.
After the situation happened, a research team was established immediately. After all, it is only one week before the May Day holiday, and the trading volume will increase by at least 30% during the May Day holiday.
After the first stage analysis, the following transformation schemes are formulated:
1. Modify the number of connections initialized by the application Druid connection pool. From the Druid monitoring, the peak value of the maximum number of connections during peak hours is 27. It may be that the connection number is insufficient due to the sudden surge in transaction volume. Past experience tells us that it will take a long time to create a new database connection, which will lead to instant requests queuing for connections. So I decided to change the initial value from 20 to 30.
2. Because the accounting database (Oracle) adopts two master and one slave high availability deployment (a, B as the master node and C as the slave node), at present, the master node configured by all application systems is a and the slave node is B. It is decided to adjust some application master nodes to B to reduce the pressure of node a.
3. After the transformation of the accounting idempotent table, the idempotent table has been divided into four tables (400 million data in a single table). However, from the analysis of the time-consuming of the whole bookkeeping request, the query and insertion of the idempotent table and the accounting flow table are time-consuming. Therefore, we decided to expand the idempotent table to 64 (why 64? It has been scientifically calculated. The specific calculation process cannot be said because it is confidential).
4. MQ takeover is under great pressure because it is close to May Day. If the above optimization points fail, the worst plan is to adopt MQ asynchronous takeover mechanism. However, this scheme will have an impact on the business and will not be used as a last resort.
After the transformation plan was determined, the overtime mode was started. The brothers completed the transformation day and night and put it into operation. But after the launch, the effect is far from expected.
Scheme 1: after putting into operation, it is found that the bookkeeping time in peak hours is still serious. Adjusting the Druid connection pool not only has no effect, but increases the number of database connections, wastes resources and goes back.
Scheme 2: after the primary and secondary nodes change the order, there is serious contention for database resources, frequent GC, but the utilization rate of CPU increases and goes back.
Scheme 3: after the idempotent table is divided into 64, the reading and writing speed of the new table is very fast, with an average of 0.7 MS, and that of the old idempotent table is 4.7 Ms. however, because the old idempotent table has a large amount of data, it does not re hash the data, so the code layer must be compatible with the query of the old idempotent table, which does not improve the overall efficiency.
Summary of the first stage: except for scheme 3, which is effective in the long run, the other two schemes have not solved the problem. After reflection, we feel that we are in the wrong direction and still do not apply the right medicine to the case. So everyone embarked on the road of seeking medical treatment again.
After the second stage analysis, the following two schemes are formulated:
1. At this time, the DBA found a new problem. During the peak business period, there was an index splitting problem between the accounting flow table and the accounting flow table. It was initially suspected that the serial number adopted the incremental mode (generated through sequence and is an index column). PS: the index splitting was mainly caused by inserting a smaller serial number in front of a larger value, and the current index block space was insufficient, You need to split the original index block into two to ensure that smaller values can be inserted in front. This brings a problem. Part of the data of the old block needs to be put into the newly opened index block. Frequent index splitting is certainly not a good thing. Therefore, it is decided to change the generation method of serial number to snow algorithm.
2. At present, the flow table adopts a partition of 5 days according to the date, and retains the data for 2 months. After discussion, it was decided to change to one partition per day and only keep the data for nearly 10 days. Why not split the table? Because the flow table involves data synchronization, many business lines use this table. The split table transformation cannot be completed in a short time, and the partition is transparent to the application system.
PS: for index splitting, please refer to this post: http://blog.itpub.net/2673616… , the situation we encountered belongs to 9-1 splitting. No matter what type of index splitting is, it is certainly not a good thing for the system.
Like the first stage, the overtime mode is started again. However, after the final production, the effect was the same as that in the first stage, and it was busy in vain. The database is still under great pressure during peak hours, and the average time of bookkeeping during peak hours is still more than 100ms. At this moment, the state of mind has burst. After all, we have worked overtime for a week and are already tired physically and mentally. But the problem still needs to be solved, otherwise there will be no holiday during the May Day Golden Week…
The third stage begins, the so-called life two, two born three, three born all things, all things change, and recycle and return to one after 9981. In the first stage of optimization, it was mentioned that from the time-consuming analysis of the entire bookkeeping request, the query and insertion operations of idempotent table and accounting flow table are time-consuming. At this time, the idempotent table has been completed in separate tables. What if the insertion flow operation is changed to asynchronous insertion? Yes, first of all, pipeline asynchronous insertion will not affect the business. Secondly, changing to asynchronous can reduce the whole transaction cycle, reduce the transaction time, and release resources faster. Finally, even if there is no effect, a peak clipping can be achieved through asynchronous writing. Through MQ backlog, reduce the database write speed and protect the database.
PS: because the flow table is written in 100 million level flow every day, the amount of data is still relatively large.
After modification, we went online soon. After all, time is precious. Sure enough, nine things come together and have an effect.
In the figure above, gray is the time-consuming curve after we change pipelining to asynchronous insertion, which is very obvious compared with the previous effect. Even in peak hours, it takes only 30 milliseconds.
Take another look at the database monitoring chart. Compared with before, the pressure has decreased significantly. At this time, I am finally relieved.
The fourth stage of Optimization: however, as a technician, he should not be satisfied at this time. Looking back, it is indeed scientific to think about index splitting. It should not be changed and has no effect. After re combing, it is found that the snow algorithm used by the development partners is a mutated algorithm. In other words, it can only ensure global hash (distributed), but the local is still continuous (single machine). If we directly use UUID, we have carried out another round of transformation with the idea of giving it a try. Sure enough, this idea was verified.
After putting into operation, the pressure on the database has dropped a lot. The following three figures show the effects of application optimization and index splitting optimization before optimization.