High frequency access by a large number of users at the same time is a difficult problem for any platform, and it is also a research direction that the industry enjoys. Fortunately, although the business scenarios are different, the ideas of design and optimization remain the same. This paper will deeply analyze the system architecture optimization scheme combined with the core technical points of business and high concurrency system design.
The article is compiled according to the keynote speech “the road to the evolution of 100 million traffic system architecture” delivered by Roger Lin, a senior cloud engineer of authing, at the Beijing station of youpai cloud open talk Technology Salon. The live video and PPT can be downloadedClick to readSee the original text.
I believe everyone agrees that the gradual fierce development momentum of the Internet has changed many of our lifestyles. For example, online shopping, bank transfer and other businesses no longer need to be handled offline, which greatly facilitates our life. Behind this, of course, we, as Internet practitioners, are facing more and more challenges, and will devote more efforts to the upgrading of system architecture.
Understanding highly concurrent systems
High concurrency system has the characteristics of high concurrency, high performance, high availability, distribution, clustering, security and so on.
Let’s first look at high concurrency, high performance and high availability, which are the three high systems we often mention. When our traffic is very large, we must ensure these three high. High concurrency means to support many concurrent users. High performance means to ensure excellent performance on the premise of high concurrency. High availability means to ensure that the system will not go down as a whole and continue to provide services when a node has problems. It can be seen that the main characteristics of the three high are distributed and clustered, and the main problem we need to solve is security.
The above figure shows some common high concurrency scenarios closely related to our lives. The top left e-commerce second kill is the most common scenario. During the epidemic last year, there was a shortage of masks. Many people clicked on the same page at a unified time, and the concurrency of this was particularly high. On the upper right is ticket grabbing, which is also familiar to everyone. Especially friends working in other places who need to go home during the Spring Festival must have opened a ticket grabbing software to grab tickets for themselves all the time. This kind of concurrent traffic is particularly large. The lower left is the bank trading system. All our online and offline code scanning actually needs to pass through the bank system, which makes its daily trading volume great. Finally, the authing ID card. We mainly do a complete set of identity authentication and user management system for users. This system allows developers to avoid repeated identity construction, reduce the code written by developers and improve their efficiency. The following figure is an example:
The figure shows our core components. On the surface, it is a simple login box, that is, the user authentication interface, but behind it is a huge background support composed of a series of services such as user system, management system and authentication system. Although the user only enters the user name and password, we should consider not only the user’s security authentication and multiple login methods, but also how to deal with many users when authenticating at the same time. In addition, we also need to consider how to achieve high availability, rapid deployment and rapid integration for various types of customers, including privatized users.
If you have friends who are highly concurrent, you must be familiar with cap theory. Its main point is that the distributed system can not meet three at the same time, but only two of them. That is, the distributed system can either meet CA or CP, but can not meet cap at the same time. It means that if the availability and fault tolerance of partitions are met, it may mean sacrificing consistency to achieve the final data consistency. It tells us to make a choice.
Starting from the single application architecture
The single application architecture illustrated in the figure above is a common pattern in the early stage. In the early stage, due to the shortage of manpower, web and server are usually developed and deployed together, and then connected with the database to provide services normally. The advantage of this is that it is easy to maintain, but the iteration is troublesome.
Now, after the front and back ends are separated, we usually deploy the web and server as two services, which facilitates rapid iteration. If we have a server that needs to be repaired, we can modify and deploy the code of the service separately, and then quickly go online. However, its disadvantage is that with the increase of business, the server contains more and more content, which will make it deeply coupled and slow down the service. I have a deep understanding of this. Many years ago, a friend of mine had a problem with the structure. For some time, every weekend, he would buy a bag of melon seeds and come to my house to think about it together. Why buy a bag of melon seeds? Because the coupling is too deep, it takes 5 minutes to start the service and 5 minutes to restart after changing something, so we chat and wait with melon seeds.
As mentioned above, complex dependency and bloated complexity are problems encountered by single applications. In addition, single applications also have the following problems:
- Single point bottleneck
- Poor stability
- Poor scalability
- Missing business model
- Poor expansion of new business
- Lack of basic business process capability
- Front and rear end coupling is serious
- The API is messy and difficult to maintain
Since the pain point is so obvious, how to optimize it is very important. But before we talk about this, we need to think about a new question – will the more CPUs, the better performance?
In most cases, this is the case because the CPU can improve the operation speed. But this is not absolute. If there are many lock concepts in our program, it can not reflect the multi-core of multithreading. The number of CPUs may not have a significant effect. In this case, many companies will consider splitting the service. This involves the cost problem, that is, increasing the CPU is not the optimal solution. We still need to consider how to optimize the lock. However, before thinking about specific optimization, we can first understand the next pooling technology.
The above figure is an abstract concept of pooling technology. Generally, obtaining connections and threads will be put into the resource pool after they are used up. At the same time, we also need to have the following four concepts: connection pool, thread pool, constant pool and memory pool.
Generally, connection pools are used more, because calls between systems and requests for external services will be made through request connections. In the past, we used to use short connections, but because each HTTP connection needs to repeat the process of establishing and closing connections, which is very time-consuming, we now use the connection pool. The connection created after each request is repeatedly available, which is very helpful to save overhead. At the same time, our tasks finally need to be disassembled, and those disassembled asynchronous tasks are placed in the thread pool. The concepts of constant pool and memory pool are reasonable. We will apply for a large memory reuse.
After understanding pooling technology, let’s go back to specific optimization.
Application architecture optimization
Web server optimization
First, let’s take a look at the optimization of web server, which is mainly realized through code optimization, hotspot cache, algorithm optimization and so on.
The first step is code optimization, which optimizes unreasonable code. For example, the query interface usually queries a lot of content, making the operation slow, which needs to be optimized first.
The second step is hotspot caching, which caches all hotspot data, so as to reduce database operations as much as possible. For example, it is impossible for authing to perform database operations every time after getting the token, so QPS will be very slow. We can improve QPS by caching all hot data.
The third step is algorithm optimization, because our business is usually very complex, so this concept is very broad. For example, when querying a list, do you need to list all the lists at once or return the results to the front end after calculation in memory? This requires optimization for different business scenarios to improve performance.
After single application optimization, if these services are deployed on the same server, CPU and memory may be occupied. At this time, we can take out the web and the cached applications and deploy them to a separate server. At the same time, all static resources are stored on the CDN to speed up page loading through nearby access. Through these methods, our auting can meet the requirement of response within 50 milliseconds. The method of separate deployment is also very suitable for the needs between systems. No matter what business scenario you are, if you need to improve the response speed, you can consider this method.
Then we need to split the business. There are three ways to split business:
- Split according to business scenarios, such as splitting users, orders and accounts.
- According to whether the business is split synchronously or asynchronously, the advantage of this is that it can well control the asynchronous traffic and prevent it from affecting the operation of our core services.
- Split according to the model, because business splitting is mainly to solve the problem of serious laziness of coupling between systems, and to minimize the delay between systems in the later stage, the early model must be built as well as possible.
After splitting the system, we need to judge how much business the optimized system can carry and how much it has been optimized. Then I need to do a pressure test on it. The pressure measurement will involve the barrel theory that we all know. We compare the system to a barrel. How much water the barrel can carry depends on the lowest board. Therefore, during pressure measurement, we do not need to pay attention to those parts that occupy less resources, but those parts that have reached the bottleneck of the system. Use this part to find the potential problem points of our system.
After we vertically split the service, it may still be unable to meet the demand with the gradual increase of requests. At this time, we can split the system horizontally and then expand the capacity horizontally. If one is not enough, we can add two or more. At the same time, the load balanced server evenly distributes the requests to these horizontal nodes. We usually choose to use ng as the load balancing server.
The figure above shows our load balancing server. There are many gateway systems under load balancing. We see an nginx cluster in the middle. We all know that nginx can withstand a very large amount of concurrency, so this cluster is not needed when the traffic is small. When it is needed, it must have a very large amount of concurrency. When your concurrency is so great that the nginx cluster can’t bear it, we’d better not put another layer of nginx in front of its cluster, because the effect is not obvious. At the same time, I personally do not recommend that you choose F5, because F5 is a hardware and its cost is relatively large. I personally suggest that you choose LVS, which is a virtual service under Linux. If configured well, its performance is completely comparable to F5.
Having finished load balancing, let’s go back to horizontal splitting.
We can’t ignore the cache problem when we split horizontally. In the stand-alone mode, all caches are local caches. When we become distributed, if one server gets a token and coexists locally, the other server will be unable to communicate because it does not get it. Therefore, we introduce distributed caching, such as putting the cache into redis, so that all applications can request redis to take the cache.
When we split horizontally, we also need to focus on Distributed IDS. Because the method of generating ID at monomer time may not be applicable to distributed services. Take timestamp as an example. In the past, when there was a request in a single entity, we generated an ID, which is unique. In a distributed situation, multiple servers may generate duplicate IDS when receiving requests, which cannot be unique. Therefore, we need to create an ID service separately to generate ID.
After we split the services horizontally and vertically, how to uniformly and synchronously configure the configuration to each service has become a problem. The best way is to make all services aware of the change after we modify the configuration, and then apply and configure it ourselves. Therefore, we introduced the configuration center.
The figure above shows the general process of the configuration center. At present, there are two popular configuration center schemes: one is Nacos open source by Alibaba, and the other is spring cloud config built by spring cloud. Interested friends can learn about it.
Next, let’s take a look at the figure above. The server is the console where our configuration is stored. Generally, developers will modify the configuration through the API on the console. The modified configuration can be permanently placed in MySQL or other databases. The client contains all our applications. There will be a monitor to monitor whether there is a configuration change in the server. When there is a configuration change, get the configuration, so that all applications can be updated in time after the front-end update. At the same time, in order to prevent the app from failing to obtain updates due to network problems, we will take a snapshot locally. When there is a network problem, the app can be downgraded to obtain files locally.
We have completed the splitting of the system, done a good job of load balancing, and completed the configuration center. In the case of small requests, we have actually completed the optimization of the system. When the business continues to expand in the later stage, the bottleneck we encounter is no longer the system, but the database. So how to solve this problem?
The first method is to separate master-slave copy from read-write. The separation of reading and writing can solve the problem that data reading and writing are all in one database. By splitting the master-slave database into master and slave, the writing process is handled by the master, and the writing pressure is shared, so as to improve the database performance. Then, as the business volume continues to increase, when the separate master-slave replication can no longer meet our needs, we use the second method.
The second way is to split vertically. The concept of vertical splitting is similar to that of business splitting. We split the database into users, orders, apps, etc. according to the service, so that each service has its own database to avoid unified requests, so as to improve concurrency. With the continuous growth of business volume, even a single library will reach the bottleneck. At this time, we need to use the third method.
The third method is horizontal splitting. For example, we further split the tables in the users database into users1, users2, users3 and so on. To complete this split, we need to consider how to query multiple tables. At this time, we need to judge according to our specific business. For example, when querying users, we can split the ID into pieces according to the user ID, and then use the hash algorithm to unify them within a certain range. After that, every time we get users, we calculate the specific piece through hash and quickly reach the corresponding position. The concept of splitting is used in the design of auting multi tenant, as shown in the figure below.
Service current limiting
When the business volume reaches a certain level, we will certainly involve service flow restriction, which is a disguised degradation strategy. Although our ideal is that the system can withstand more and more users, but because the resources are always limited, you must limit them.
There are two main algorithms for service flow restriction, leaky bucket algorithm and token bucket algorithm. We can take a look at the picture above, which is more vivid. In the leaky bucket algorithm, we can imagine the flow as a cup of water and limit it where the water flows out. No matter how fast the water flows in, the outflow speed is the same. The token bucket is to establish a task of issuing tokens, so that each request needs to get the token before entering. If the request speed is too fast and the token is not enough, the corresponding current limiting strategy will be adopted. In addition to these two algorithms, we will generally use the familiar counter algorithm. Interested friends can also learn about it by themselves. We won’t talk about it in detail here.
In fact, these algorithms reject the excessive part of the request when the traffic is excessive. In addition to this rejection strategy, we also have a queuing strategy.
When the flow of our business cannot be limited or rejected, we need to use queue messages.
As shown in the figure, the main concept of message queue is that the producer will put the message into the queue, and the consumer will get the message from the queue and solve it. We usually use MQ, redis and Kafka as message queues. The queue is responsible for solving the publish / subscribe and client push-pull problems, and the producer is responsible for solving the following problems:
- Buffer: set buffer for excessive flow at the inlet
- Peak clipping: similar to buffering
- System decoupling: if the two services do not have a dependency call relationship, they can be decoupled through message queue
- asynchronous communication
- Extension: Based on message queue, many listeners can listen
When the business provides services normally, we may encounter the following situation:
Services a and B call services C and D respectively, and both of them will call service E. once service e dies, all the previous services will be dragged down due to the accumulation of requests. This phenomenon is generally called service avalanche.
In order to avoid this situation, we introduced the concept of service fusing and let it play the role of a fuse. When the failure amount of service e reaches a certain level, the next request will not let service e continue to process, but directly return the failure information to avoid the accumulation of requests to continue to call service E.
In brief, this is a kind of service degradation. The common service degradation includes the following:
- Page degradation: click button to disable the visual interface and adjust the static page
- Delayed services: such as delayed processing of scheduled tasks and delayed processing of messages after entering MQ
- Write degradation: service requests that directly prohibit related write operations
- Read degradation: service requests that directly prohibit related reads
- Cache degradation: use the cache method to degrade some frequently read service interfaces
- Stop service: turn off unimportant functions and give up resources for core services
The above figure is what we should pay attention to in specific pressure measurement. First of all, we should know that the pressure measurement is actually a closed loop, because we may need to repeat this process many times, constantly repeating the process of finding problems, solving problems, verifying whether it is effective and finding new problems, until we finally achieve our pressure measurement goal.
Before the commencement of pressure measurement, we will formulate pressure measurement objectives, and then prepare the environment according to the objectives. The pressure measurement model can be online or offline. Generally, considering the cost of offline, single machine or small cluster will be selected for pressure measurement, which may make the results inaccurate. Therefore, we usually choose online or computer room for pressure measurement, so the data is more accurate. In the process of pressure measurement, we will find new problems, then solve them, and verify the results until the pressure measurement goal is reached.
In the process of pressure measurement, we need to pay attention to the following points. The first is QPS, that is, the number of queries per second. The difference between TPS and TPS is that TPS has the concept of transaction, and the transaction needs to be completed before it is counted as a request. QPS does not have this concept. It makes a request as long as it finds the result. The second is RT (response time), which requires our focus. The more concurrent the system is, the more important RT is. Later, in the pressure test, we need to pay attention to how much concurrency and throughput the system can carry. The success rate refers to whether our business can be carried out according to the original plan and obtain the established results when the pressure is increasing during the pressure test. GC refers to garbage collection, which is also a big problem, because if our code is not written well, GC becomes more and more frequent with the increase of pressure, which will eventually lead to system pause.
Then there is the hardware. We need to pay attention to the occupancy of CPU, memory, network and I / O. any card owner may lead to a system bottleneck. The last is the database, which will not be discussed in detail here.
How can we know the problems in the pressure measurement process? It depends on the log, which makes the system visual and facilitates us to find the root cause of the problem.
What about the log? It mainly depends on the buried point, such as the time and response time of entering each system and layer through the buried point request, and then the time-consuming of the system can be seen through the two time differences. It can be seen that only when the buried point is clear can the problem be accurately found.
The above figure shows a general log processing scheme. The logs generated by each service are collected from Kafka through filbeat, then to logstack, and finally to elasticsearch. Kibana is a visual interface for us to analyze logs.
The above figure shows auting’s log and monitoring system. In the middle is the k8s cluster, on the left is the service message queue, and on the right is our monitoring system. In the monitoring system, as long as we use grafana to alarm according to the business, for example, we will configure the alarm when the success rate is lower than what. The main log system uses logstash to extract log files into es and kibana to view them.
Finally, I would like to say that all highly available systems must not forget a core concept, that is, multi live in different places. For example, we need to deploy multiple computer rooms in multiple places and have multiple backups and disaster recovery. The figure above is my summary of all the above application architecture optimization. I hope to provide you with reference. Thank you.