This article is shared by Fu Hongcheng, the producer of the technology opening day special show held by Tencent SNG and msup, and edited by 100 cases. The original text is published in100 cases。
As the producer of this technology opening day platform architecture special show, I’d like to bring you “building a highly reliable mass user service – SNG hundreds of millions of daily active business background core technology revelation”, to discuss how to improve the reliability of mass service and the troubleshooting of mass service from the dimension of availability, including:
SNG background Architecture Overview;
Design principles for massive services. What are the general solutions for the background design of Tencent’s massive services, including how to improve the high availability of massive services and how to improve the rationality of services from the architecture layer, product layer and operation and maintenance layer;
Solutions to back office service failure
SNG is Tencent’s social based business group, including QQ, qzone, QQ music, Tencent cloud, karaoke, Penguin MV, national movie king, Tiantian P map and many other businesses. According to the SNG background architecture, from the business logic, data level, operation and maintenance, including the availability of the whole star rating:
QQ is Tencent’s business that has lasted for 16 years. Its business is very complex. From the perspective of communication, it mainly focuses on message storage and forwarding. The QQ team is committed to building a service that will never go down.
Qzone has always been the leader of China’s social products. It is the business that Tencent’s first doctor Ross is always in charge of. The front end needs high-performance access server. Its mobile end is gradually opened to Tencent cloud through the framework WNS. Third party developers can also use the high-performance basic framework of qzone.
QQ music mainly involves streaming media on demand, MV system, massive legitimate knowledge base construction, etc. at present, we have announced that it has exceeded 100 million, and is the most popular leading Internet music platform in China.
Tencent cloud has Tencent’s best massive service solution ability. It provides operational platforms for all kinds of 2B operators, which can save a lot of R & D costs for start-ups, especially small and medium-sized start-ups. There should be a lot of entrepreneurs doing it. Welcome to use Tencent cloud products.
Two years ago, Ross always came from space to work as the digital music department and started the team to do it. In less than two years since its launch, the registered users have far exceeded the competitive products. In terms of business logic, it mainly involves a large number of UGC streaming media upload, storage and distribution, feeds management and interaction, knowledge base management, etc. if you don’t have one installed, you can go back and install one as soon as possible. There are also advantages in it Quality users are singing.
How to define usability
Starting from the well-known cases, 12306 is often hung up when it is just online, and then a notice is displayed that “troubleshooting is in progress”, especially in rush hours such as Spring Festival and holidays, which seriously affects the use of users.
As an Internet architect, how to improve the usability of 12306? How to define the availability of large-scale Internet services? How to define a reasonable team available performance index? How to improve usability?
1. How to define the availability of large-scale Internet services?
Availability is the robustness and reliability of a system or service. Every service, including QQ and qzone, has a failure, we will evaluate whether it is a level 1 failure, a level 2 failure or a level 3 failure.
In the business scenario, the core is the embodiment of user value. We define usability more reasonably as:
P = (total user value – user value lost due to accident) / total user value * 100%
2. How to define usability index?
Whether using QQ, QQ music or QQ space, if the interruption is 1 minute, the user will check the problem; if the interruption is more than 5 minutes, the user will start to have questions; when the interruption is 30 minutes, the forum and other places will start to have complaints, and when the interruption is several hours, the user may give up the software and go to the competition.
For Tencent, the availability of back-end services is four nines. When the four nines are converted into time, it is required that the failure time within one year should not exceed 52 minutes. Why four nines instead of five nines? Why not the higher the better?
The direct cause is cost. We have just launched the service to ensure two nines; then we have continued to optimize and adjust the architecture. For example, we have made disaster recovery deployment and double centers for servers, which have been upgraded from two nines to three nines; we have also invested a lot of R & D energy and R & D personnel, which have been upgraded from three nines to four nines. To achieve five nines, we need to increase investment by several orders of magnitude. In the process of leading to the five nines, in addition to the rationality of the architecture design, it also depends on the bandwidth of the whole telecom operator’s dedicated line and the equipment in the computer room, which will be converted into the system failure rate. How to prevent it also requires a lot of energy. So at present, we are at four nines.
In our company, sometimes there are special line failures. For example, the failure rate of some businesses is very high and there are a lot of complaints. Then there must be problems in the back office architecture of these businesses, including the overall system architecture and the front and back office system architecture.
3. Before talking about how to do it, let’s talk about the estimation
When we review an architect, we attach great importance to his system architecture. Maybe half of the 45 minute review time is devoted to discussing the rationality of the system architecture, why it should be designed in this way, and whether there is any evolution history and background.
System architecture is very important for an architect. As an architect, you should be able to outline and specify the system architecture. Your understanding of your own system architecture will also reflect your ability to control the whole business.
The basic ability of an architect is actually the ability to estimate. Before, I often asked some colleagues who came to participate in the channel defense, they would say: “the concurrency of my system is 10000 per second”, and asked how he got it. He said it was a stress test. We think that engineers have the ability to test out and understand the maximum concurrency of the system through stress testing, but this is not the ability of an architect.
An architect should be able to predict the maximum concurrency of services when the system is not online, rather than passing the system test. The chief architect of Google also said that in fact, a good system architect does not know the system capability by testing, but by estimating. This is the basic skill of an architect. He may not be able to avoid all mistakes, but he can avoid big ones.
Service design for mass users: architecture layer
Take the past data as an example. One day, there are 1 billion visits, and every day there are 86400 seconds. I estimate that if it is 100000 seconds, the average number of visits per second can be converted to about 10000. If it is converted to the concurrency of peak area, it is 2-3 times of the average number of views. Then the concurrency of peak area is about 20000-30000 per second.
Architects should have this kind of awareness. How do you abstract business operation indicators into every technical detail? This is what I mentioned earlier. Architects should be able to know a lot of operation data at their fingertips.
1. Code level optimization
To be clear, first of all, you need to have enough understanding of each granularity of data, be able to estimate the business data from the system architecture and technology, and then return to itself, which is definitely related to the code.
Haifeng, our leader of basic data services, used new technology components to optimize the code and improve the high availability of services not long after he was employed. Before going online, he can predict how many changes in online traffic will be brought by optimization and the number of devices that will be reduced when the business volume increases. He can optimize the cost, bandwidth and technology without changing or improving the user experience.
Code is the basis of supporting the whole mass service. When we recruit, we will see whether you are a super programmer or an ordinary programmer, which can be seen from the aspect of code.If you are an experienced or progressive architect, you will constantly ask yourself if there is any bottleneck or optimization space in writing code.
Tencent also has programmers who just stay at the completion of the assigned tasks and don’t think about how to better complete them, while some programmers still spend a lot of time to continuously implement optimization after completion, and even find possible memory leaks or vulnerabilities. The former just completes the tasks assigned to them, while the latter is constantly thinking, and often surpasses expectations.
To grow as an architect, you need to constantly think about how much memory is consumed, how much bottleneck or optimization space is in the products you serve. For example, how to speed up the features of TCP and tcp2.0? This is the point for architects to constantly think and reflect on, which is the basis for improving the whole mass service.
2. System design level optimization
Architects should not only push the project team to optimize the code, but also pay attention to the rationality of the system design. Taking user scenarios as an example, after entering QQ music, users will collect their own song lists. It seems that the system structure is relatively simple on the outside, but it contains very complex technical logic.
These technical logic is to ensure that in the case of high availability, each module can be deployed independently, and can do the service upgrade and expansion well. The key point of technology is storage. As an architect, when he mentions to do some individual collection, he should split it into quantifiable and understandable indicators from a technical point of view, including storage and a large number of small files, including how to store 280 million users. These problems must be listed. He should also think about how to expand. In the future, he can not only collect songs, but also collect albums, song lists and special songs Questions and so on; there is also usability, in-depth thinking about how the background should be deployed to ensure the reliability and consistency of data. All of these are the thinking needed when doing task decomposition and disassembly.
In depth design should be based on priority level control. For example, our users are divided into PC client, web, Android, IOS, etc. when users collect a song on the web, they will refresh the page or leave the page, and the server needs to write it in real time, The PC or mobile client can send a data to the local and write it to the local records. The server processing is not so real-time processing, which is helpful for our 50 million online peak. We can allow a certain amount of storage redundancy.
This is a fast and slow processing scheme based on priority. For example, on the client side, we can slow processing. As long as the background is sent to the server side, the server side tells the front-end that the collection is successful, but it may not be successful. The server-side processor will write it to the message queue and further process it by another process, while the requests sent from the web will be processed in real time It’s the same way.
As soon as the service is just launched, there may be a large system to promote the service. At this time, if you map to the code level, you may improve each code optimization by 1 millisecond, which is a very considerable amount of data for the whole high concurrency. The processing time of each user is less than 1 millisecond, and the processing time of 1000 user services can be reduced by 1 second. So in the whole internet background development, every code detail is very important, and of course, there is storage consistency.
3. Flexible service thinking
At the system level, you’re not just doing systems, you’re doing services. We also have a lot of service thinking. For example, how do you ensure that the front end of the system will not be affected immediately due to a fault? Our thinking is called logical service thinking.
In terms of these types of failures, such as hardware, network and code, there will be many failures. Without this kind of logical service, more than 80% of users may be affected, which is unacceptable.
Tencent’s automatic test standard for each background CGI is 1 second. If the time from user request to return operation is greater than 1 second, it will not meet the standard. In order to make every CGI reach the standard as much as possible, we will take many strategies, such as uploading many types of CGI, there will be a lot of thinking to think about: whether there is an avalanche in the background, and the way of thinking includes identifying critical path and non critical path.
In many applications of Tencent, the algorithm similar to stock EMI is introduced, that is, request expectation;When the stock price is very low, your expectation is very large. When the stock price is very high, your expectation is very low. For example, if CGI accesses the server within 10 milliseconds, the timeout can be set very long. If the return time has reached 100ms, 1s or even longer, the timeout should be set as short as possible, which is one of the anti avalanche measures and can effectively protect the back end.
Some businesses can’t accept the loss, such as bank transactions or payment purchase; but some businesses can, such as QQ login to the home page information, encounter server failure or see the member level, the member’s server will hang up, this time can quickly return, this is the scene of accepting the loss service.
How to identify critical path and non critical path?Tencent takes user value as the core of everything. When doing critical path and identification, it usually depends on whether this point is the most critical to users. For example, what is the most important user information when logging in to a client? Many business scenarios can identify which are critical paths and which are non critical paths.
QQ space also has many such cases, for example, in the event of computer room failure, you can set 80% of the traffic available, 50% of the traffic available, you can also set the strategy of giving priority to yellow diamond customers. When the system fails, we should also have an emergency strategy. We will also have some disaster recovery strategies, which will be divided into different levels to control the disaster recovery according to the level of network failure.
4. Load balancing
Many of Tencent’s services are based on the main network structure of LVS. Tencent has a GSLB, which is our full level load balancing system. Every domain name will go through GSLB and LVS. For example, if you are a user of Shanghai Telecom, you will experience Tencent’s early load controller in the whole process, find it and return to this IDC.
This principle is based on the fact that Tencent itself is an Internet company with massive services. We have done a lot of tests. The test results show that it can achieve the most effective scheduling results and give the optimal scheduling results, including all kinds of IDC failures, which will be predicted in advance. It is better to deploy modules independently.
Service design for mass users: operation and maintenance layer
If the system lacks monitoring, it is not a good system. In SNG, including Tencent, all R & D teams attach great importance to this. You should master every link and module of the system. If you want to do this, it is very important to monitor the system of your whole module.
Tencent has a large number of network management platforms. Each device’s current CPU status, network outlet, uplink and downlink traffic will have a very detailed configuration, which is also a service to help you achieve consistency. There are also security issues, such as the purchase of game currency, the opening and closing of members, and so on. It may be necessary to prevent other people from entering or leaking messages when sending messages at ordinary times. I think system monitoring is a very basic aspect of the whole background development.
Service design for mass users: product layer
Product strategy, I am now responsible for our department’s basic data center, some things related to search, music library products, etc. from the perspective of creating massive services, good product design is an indispensable link.
When there is a fault, some of our previous fault tips are very annoying, such as busy network, system crash and IE poisoning. Similar tips are very unfriendly and need to be discarded in product design. Even if the system is going to fail, don’t close your eyes and give users enough reassuring and friendly tips. This is the biggest psychological comfort for users when your whole technology fails.
Background fault solution
From the perspective of background fault solution, when you are in charge of the background system, because the system is mainly operated online, there will always be various faults. How do you find these problems?
Monitoring is very important, systematic monitoring and automatic monitoring. The so-called automatic monitoring is that in addition to using outdoor, there are many testing machines. It will also launch tests regularly, simulate the user’s request, and then judge whether the return value is successful or failed, so as to judge whether the service is normal. In addition, we mustPay attention to the feedback of every colleague around you, including the feedback of every user。
From the outside to the essence of the core problem, for example, some users can’t use it, and some users can use it. It depends on whether there are large-scale complaints of geographical concentration, which must be ruled out. Is there a customer version problem, is there a carrier circuit problem, is there a component problem.In fact, it is very important to analyze problems and find out the key problems。
I think that for the problems that often occur in the back-end field, some experienced people will know the problems at a glance. However, when you are not clear about the problems and the positioning may be very cumbersome, you need to list all the links and then use the exclusion method to eliminate the causes. The root cause of the problem is the one that can not be eliminated.
Here is a summary of some experience. As long as all bugs are put forward, the cause can always be found and solved, which is also the greatest value of programmers. When you can’t find your own system problems, you have to find other system problems. When positioning problems, we should often think about whether our own system has problems, and then think about the problems of other systems.
Building massive services is a big and deep topic. In the whole SNG, in fact, many businesses are constantly dealing with massive service requests, system problems and even failures. This is a great exercise in people’s system architecture ability, and there will be a lot of experience and inspiration from it.
Q: What are the considerations of Tencent’s adoption of C + + technology?
A: It’s a long time ago. At the beginning, it was Tony (editor’s note: Tony is Zhang Zhidong, co-founder of Tencent, former CTO) and other founders who set the tone of using C + + in the background. C + + is a very efficient programming language, and we are also very eager to invest in a language, or even spend more time on it to achieve high performance.
Q: In terms of load balancing, is the optimization of QQ biased towards hardware or software? Or what scene is better with hardware?
A: Tencent has a lot of software hardware combination. We Tencent widely used software level load balancing. When information access back-end, the back-end has 100 devices. The usual method is to randomly access 100 devices. If scheduling can be realized, it will collect each IP return code to decide that the next time it gives you this interface is the optimal IP. This is our common software level load balancing component L5 of SNG.
There are LVS and TGW that provide balance of debt in Wangping. Many of them are combined in Tencent. We can’t say which one is better. The highest possibility of how to maximize our service is to constantly think about it, and may adopt the way of combining hardware and software.
Q: Is Tencent also paying for four nines? Do you have higher requirements?
A: There are higher requirements. For the architecture, Tencent has a payment line specifically responsible for payment, not me. In many cases, it is not the system itself, but the external environment. Due to the problems of the bank gateway system, the payment business will also affect the service quality.
Tencent’s payment positioning may still be four nines. It also involves the interactive interface with the bank. However, due to the failure of the service, it is necessary to ensure the success and ultimately ensure the interests of every user. Wechat TenPay and QQ payment fields will do a lot of strategies to avoid this problem, including how to ensure the transactional nature of the flow of water, so as not to let users spend money and get goods. According to our data, if 10000 users fail in an ordinary function, there may be no more than 10 users’ feedback in the end. If the payment involves the interests of core users, at least one user will give feedback when two users have problems.
As for the business indicators of payment, the goal of 100% may also need a high cost to achieve. The only thing that can be done is that, for example, if the user fails to purchase the q-coin, the final link will be customer service processing. Tencent will ensure that the problems of each user will be solved in the end, not just the payment problem.
Massive service is a big course. Today, due to the time, many aspects can’t be explained in detail. Later, we can discuss it in QQ group. Once again, welcome and thank you for coming to SNG technology open day. I hope you can gain more in the next rich courses of backstage channel. (do you want to see more back-end architecture cases in this open day? Attention to the official account will be pushed in the near future.
Also welcome to Tencent, SNG, QQ space and music applications, QQ music we are still recruiting excellent background development (Linux) C / C + +), mobile terminal development (IOS / Android) and web development, content editing, product planning and operation are also hot recruitment. If you are an excellent Internet practitioner, you might as well come to Tencent SNG to invest in the service practice of 100 million users. Your resume can be sent to: [email protected]tencent.com