Youku quality assurance series (1) — Practice of server stability assurance


Reading guide for entertainment girlsQuality assurance runs through the whole R & D process. As the builder and guardian of quality, testing needs to guarantee not only the functional quality after testing, but also the quality and efficiency of the whole R & D process. Share Youku’s practice process of improving R & D efficiency and quality through quality assurance construction.
This series of articles will be published one after another, interested friends continue to pay attention_

What does the quality assurance of server do?

Before answering this question, we should first look at the factors that affect the quality of the server? From the perspective of the current server R & D process, the whole stage of a requirement launch and the main activities of each stage:
Youku quality assurance series (1) -- Practice of server stability assurance
It can be seen that quality related activities run through the whole R & D process. As the builder and guardian of quality, testing needs to guarantee not only the functional quality after testing, but also the quality and efficiency of the whole R & D process.

Define the main factors affecting quality in each stage:

Requirements validation: the effectiveness of requirements and business value
Program audit: the rationality of the program and the quality risk caused by the change
Code development: code logic and coding specification
Offline validation: efficiency and quality of regression testing; Efficiency and quality of new function testing
Safety production: the effectiveness of observation flow; Adequacy of quality verification
Online publishing: online stability guarantee mechanism and exception checking capability

Combined with the business characteristics and R & D status of Youku, the guarantee contents that need to be focused on at present are determined

Code development: through the establishment of static scanning and unit testing, the continuous verification of the submitted code can be realized
Offline verification: ensure the quality of test code and offline acceptance;
Safety production: ensure the effectiveness of safety production verification
Online publishing: guarantee the stability of online services

It can be summed up in one sentenceThe service side quality assurance system is to build the automatic test support capability that fits the business characteristics, and integrate into the key quality stages of the R & D process(Test accessSmoke testing Raise testregression testingSafety production verificationOnline publishing)To ensure the sustainable integration, deployment and release of application changes.

How to build quality assurance system?

Push process

Only when the support capability is embedded in the R & D process, can it really play a role. Therefore, by customizing the application publishing process and components, the offline deployment and online publishing process of Youku is constructed. With the unified upgrade of the publishing process, the support system is upgraded
Youku quality assurance series (1) -- Practice of server stability assurance
Admittance test

Guarantee content: acceptance of unit test results and static code scanning results to guarantee the basic quality of code entering the test phase; Avoid invalid deployment (code debugging, development self-test) triggering test task execution
· exit condition: no block problem in static scanning; Unit test passed

Smoke testing

Guarantee content: build automatic test laboratory, check and accept basic functions of application, prevent low-level problems from flowing into the testing stage and integration testing stage, and guarantee the quality and efficiency of follow-up testing activities
· exit conditions: smoke test completed; Complete failure case analysis;

Submit test

Guarantee content: through custom "test promotion" component, open up the release platform and Youku R & D efficiency platform, develop one click test promotion in the release process, automatically collect test promotion related information and generate test promotion sheet, guarantee the validity of test promotion information, and block low-quality test into the test link
· exit conditions: smoke test results meet the business test baseline; The test sheet contains the test information such as code change, function description, impact interface, etc;

integration testing

Support content: build regression test tasks that fit business characteristics, conduct full regression test on changes, and ensure that the change code does not affect the original function
· exit criteria: regression test completed; Complete failure case analysis; There are no issues affecting online publishing

Safety production verification

Guarantee content: connect safety production verification components, provide verification ability of safety production environment for application release, and guarantee flow effectiveness and quality stability during safety production observation period.
Approval conditions: safety production verification is completed; Complete the failure verification point analysis; There are no issues affecting online publishing

Grayscale verification

Support content: establish pressure test group in micro gray environment, conduct automatic pressure test on pressure test group machine through drainage, and compare with historical pressure test baseline to provide performance evaluation ability and ensure the performance stability of change code
Qualification: compared with the performance baseline, the pressure test evaluation results are passed

Online deployment

Support content: through the construction of online inspection tasks, regularly inspect the core scenarios of the core interface, and timely discover the online problems caused by configuration changes, code changes, and dependency changes
· exit conditions: None

Building capacity and platform

After unifying the application release process, how to quickly assist the business to build various support capabilities required in the process is also a problem that must be solved by the quality assurance system. Under the principle of “no duplication of wheels”, by integrating Ali’s capabilities and services in the field of server-side testing, a set of quality assurance capabilities suitable for Youku’s business characteristics has been formed, and interfaces are provided to various businesses through platformization to assist the business team in quickly building a quality assurance system.
Youku quality assurance series (1) -- Practice of server stability assurance
R & D process

Through user-defined publishing process components, we can get through the publishing process and Youku performance platform, and achieve one click test on the publishing process, mainly providing test loading page, test code change analysis, and test bayonet functions.

Basic ability
Based on the capabilities and platforms provided by JVM sandbox, it realizes automatic deployment of acquisition module, automatic maintenance of module status, automatic regulation of request acquisition, and automatic resolution of enhancement class, and supports the basic capabilities required by application one click access security system

All environment data acquisition capability (request for input and return results, application of internal method link)
Full environment, full interface protocol, multi-mode request playback capability (real-time playback, mock playback, generalized playback)
· offline environment mock capability (simulation return result, simulation exception, simulation timeout)

Automated test tasks
The support capabilities required by different businesses may come from different platforms of the group, and the cost of business access and analysis is relatively high. Therefore, the automatic test page provides the access configuration, scheduling execution, result recovery and failure analysis functions of various tasks, realizing the closed-loop capability of task scheduling and analysis processing. At present, the supported task types and configurations are as follows:

Support task types: automatic laboratory, intelligent playback, safety production verification, stand-alone performance evaluation
Support task configuration: continuous integration configuration, result notification configuration

Automated testing framework
Youku’s self-developed interface testing framework provides the capabilities of remote interface call, wrapper assertion, custom test report, middleware cross environment access, Pandora class isolation mechanism, etc. the business team can complete the development of test script at low cost, and effectively support the development and maintenance of smoke test, regression test, online patrol script of all business scenes of Youku, The main modules of the framework are as follows:

Remote interface call module: provides cross environment call capability of group common protocol interfaces
Assertion module: it does not need to write any assertion method, and provides a WYSIWYG assertion mechanism
Report module: automatically record the interface call, assertion and other information in the test script to generate a complete test report
Class isolation module: implement running test script through container to solve the problem of package conflict of test project

Intelligent playback
Alibaba’s existing playback test platforms are all based on the comparison test of sub calls of offline mock playback, but Youku’s business form determines that there are too many server-side read interfaces, which is more suitable for real-time playback. It can avoid the inconsistency of return results caused by configuration changes, and reduce the service access and use costs.

Therefore, based on the link acquisition and playback capabilities provided by JVM sandbox, Youku has realized the real-time playback capability based on online hot link recommendation. Compared with the existing playback platform of the group, Youku has the following features:
More effective request recommendation capability: by aggregating the application internal method links collected online, the hot link on the outgoing line can be effectively identified, and the playback request is recommended based on the hot link. Compared with the recommendation request based on the sub call link, it can more effectively cover the application internal code links and real business scenarios
More stable comparison playback test (only applicable to the playback of the read interface): real time non mock playback is adopted, and the target environment and playback environment are requested at the same time. The comparison interface returns the result, which effectively avoids the comparison failure caused by sub call change and configuration data change
Lower access and use costs: application non intrusive deployment based on JVM sandbox; For the read-only interface, there is no need to configure the sub call comparison logic. After the interface is specified, the playback can be started, that is, ready to use

Safety production
Youku has built a verification mechanism and capability for the safety production environment, which can effectively guarantee the flow effectiveness and quality stability during the period of safety production observation

Business quality rule verification: by configuring the quality rules of the interface return results, fully verify the return results of the interface during the safety production observation period, and ensure the business quality of the interface return results
Intelligent policy alarm: monitor and analyze the business alarm generated during the observation period of safety production, so as to ensure that the routine business index monitoring and middleware monitoring can be found in the safety production stage in advance
Interface RT comparison: compare the average RT of the core interface in safety production and the average RT on the line in the observation period, and find the problems related to the interface performance in advance
Interface automatic verification: verify the interface quality of safe production environment through automatic script, and timely discover major interface problems such as interface call failure and no data return
Intelligent playback: through intelligent playback, the interface return results of safety production and on-line are compared to ensure the functional stability of the read interface
Business scenario coverage: by establishing the business scenario rules of the core interface, the traffic during the observation period can be guaranteed to cover the core business scenario

How to measure the quality of guarantee system?

The main purpose of building a server-side security system is toImprove the quality of server release and R & D EfficiencyYouku defines reducing the number of server failures and improving the efficiency of change publishing as the main problems to be solved. Based on these two problems, the definition of goal is as follows:
Business quality

Number of failures caused by publishing: number of online problems caused by application publishing. Evaluate the value of the failure caused by the interception release
Number of online sudden failures: number of online sudden failures. Used to evaluate the stability of online services

R & D Efficiency

Change the unattended rate: evaluate the value of continuous integration system to improve the efficiency of regression testing
Change verification duration: evaluate the value of continuous integration system to improve the efficiency of change release

After determining the indicators of the security system, it is necessary to collect the required basic data and aggregate them according to different dimensions to form accurate and reliable measurement indicators. This is the measurement ability of the support system, which enables the business team, the person in charge of special projects and the person in charge of testing to find problems through the index data and provide direction for subsequent optimization.
Youku server support platform through the various external systems, and provides a general failure analysis capabilities, forming a closed-loop measurement of data collection, failure analysis, aggregation index.
Youku quality assurance series (1) -- Practice of server stability assurance
After capacity building and continuous promotion in the past six months, there have been hundreds of application access server test platforms. Among them, the application access rate of the core scenario is 100%, the user activity is high, the guarantee effect is obvious, the quality problems can be found effectively, and the problems that may cause rollback or even online failure have been intercepted several times, It greatly improves the quality of service end online and R & D efficiency.

This series of articles will be published one after another, interested friends must pay attention to!

Ali entertainment technology has opened a video number, and will share our technical experience and technical personnel’s daily life in various forms
I hope I can watch something interesting with you in the video Number “entertainment programmer”!
Youku quality assurance series (1) -- Practice of server stability assurance