Practice of traffic recording and playback technology

Time:2021-12-8

Article guide

This paper mainly introduces the application of flow recording and playback technology in pressure measurement scene. By reading this article, you will learn how the open source recording tool integrates with the internal system, how to carry out secondary development to support Dubbo traffic recording, how to solve the jar package version conflict through Java class loading mechanism, and the application and value of traffic recording in automatic test scenarios. The article has about 14000 words and 17 pictures. This article is a summary of my personal work in the past year. It involves many technical points. I have learned a lot from it. I also hope this article can make you gain something. Of course, my personal ability is limited, and I welcome your advice on the inappropriate parts in the article. The specific chapters are arranged as follows:

Practice of traffic recording and playback technology

1. Preface

This article records and summarizes the project I led in the past year – traffic recording and playback, which is mainly used to provide pressure testing services for business teams. As the project leader, I undertake about 70% of the work, so this project carries a lot of my memory. From demand proposal, technical investigation, type selection verification, problem handling, scheme design, launching the minimum available system within two weeks, promotion and use, supporting mid year / end full link voltage measurement, iterative optimization, supporting Dubbo traffic recording, to the landing of new scenes. I am deeply involved in each item listed here, so I have learned a lot from it. Including but not limited to go language, network knowledge, Dubbo protocol details, Java class loading mechanism, etc. In addition, I am very pleased with the value generated by the project. One year after the project was launched, it helped the business line find more than a dozen performance problems, and helped the middleware team find many serious problems with basic components. In general, this project is of great significance to me personally and has benefited a lot. Here, the project experience of the past year is recorded and summarized. This article focuses on the implementation idea, and will not post too much code. Interested friends can customize a set according to the idea. OK, let’s start the text.

2. Project background

The emergence of the project stems from a demand of the business team – to use the real online flow for pressure measurement to make the pressure measurement more “real”. The reason why the business team feels that using the old pressure measurement platform (based on JMeter Implementation) is not true is that the diversity of pressure measurement data is insufficient and the coverage of code is insufficient. The conventional pressure measurement task is usually to perform pressure measurement on the top 30 interface of the application. If the pressure measurement data of these interfaces are manually improved, the cost will be very high. Based on this demand, we investigated some tools and finally chose the one written in go languageGoReplayAs a traffic recording and playback tool. As for why you chose this tool, let’s talk about it.

3. Technical selection and verification

3.1 technical selection

At the beginning of model selection, I was inexperienced and did not consider too many factors. I only conducted research from the two dimensions of functionality and popularity. First of all, the function must meet our needs, such as the flow filtering function, so that the specified interface can be recorded on demand. Secondly, candidates should be endorsed by large manufacturers. There are many stars on GitHub. Based on these two requirements, the following tools were selected:

Practice of traffic recording and playback technology

Figure 1: technology selection

The first is model selection. It is an open source tool of Alibaba. Its full name isjvm-sandbox-repeater, this tool is actually based onJVM-SandboxImplemented. In principle, the tool intercepts the target interface in the form of bytecode enhancement to obtain interface parameters and return values. The effect is equivalent to the surround advice in AOP.

The second selection is goreplay, which is implemented based on go language. The bottom layer relies on pcap library to provide traffic recording capability. The famous tcpdump also relies on the pcap library, so goreplay can be regarded as a minimalist version of tcpdump, because it supports a single protocol and only supports recording HTTP traffic.

The third selection is the traffic image module of nginxngx_http_mirror_module, based on this module, traffic can be mirrored to a machine to realize traffic recording.

The fourth selection is a sub product of Alibaba cloud——Dual engine regression test platform, as you can see from the name, this system is developed for regression testing. We need to do pressure test, so we can’t use many functions in this service.

After comparison and filtering, we chose goreplay as the traffic recording tool. Before analyzing the advantages and disadvantages of goreplay, first analyze the problems of several other tools.

  1. The JVM sandbox repeater plug-in is implemented based on JVM sandbox at the bottom. When it is used, the codes of the two projects need to be loaded into the target application, which will invade the application runtime environment. If two project codes have problems, causing problems like oom, it will have a great impact on the target application. In addition, due to the small number of directions, JVM sandbox is not widely used and the community activity is low. Therefore, we are worried that the official can not repair the problem in time, so the selection is pending.
  2. ngx_ http_ mirror_ Module seems to be a good choice, born in a “famous family”. But there are some problems. First, we can only support HTTP traffic, and we will definitely support Dubbo traffic recording in the future. Secondly, if the plug-in wants to mirror the request, it is bound to consume the number of TCP connections, network bandwidth and other resources of the machine. Considering that our traffic recording will continue to run on the gateway, these resource consumption must be considered. Finally, this module cannot mirror the specified interface, and the mirror function switch needs to modify the nginx configuration. Online configuration is impossible, especially the configuration of core applications such as gateway can not be changed arbitrarily. Combined with these factors, this selection has also been abandoned.
  3. During our research, the functions of Alibaba cloud’s engine regression test platform are also being polished, which is very troublesome to use. Secondly, this product is a sub product of cloud effect and is not sold separately. In addition, this product is mainly used for regression testing, which deviates greatly from our scenario, so it is also abandoned.

Next, let’s talk about the advantages and disadvantages of goreplay. First, let’s talk about the advantages:

  • The single program has no other dependencies and configuration except pcap library, so the environment preparation is very simple
  • Itself is an executable program, can run directly, very light. It can be recorded as long as appropriate parameters are passed in, which is easy to use
  • The number of stars on GitHub is large, the popularity is large, and the community is active
  • It supports the functions of flow filtering, playback by multiple speed, rewriting interface parameters during playback, etc., which functionally meets our needs
  • Low resource consumption, no intrusion into the JVM runtime environment of business applications, and little impact on target applications

For companies based on Java technology stack, because goreplay is developed by go language, the technology stack is very different, and the maintenance and expansion in the future is a big problem. Therefore, it is normal to eliminate this type selection based on this alone. However, due to its relatively prominent advantages, after considering the advantages and disadvantages of other types, we finally chose goreplay as the final type. Finally, you may wonder why you don’t choose tcpdump. There are two reasons. Our demand is relatively small. Using tcpdump has the feeling of shooting mosquitoes with cannons. On the other hand, tcpdump gives us the feeling that it is too complex to control (tears without technology), so we didn’t think much about this selection at the beginning.

model selection language Open source advantage shortcoming
GoReplay Go 1. Open source project, simple code and convenient customization
2. The monomer is continuous, less dependent, no configuration is required, and the environment preparation is simple
3. The tool is light and easy to use
3. It has relatively rich functions and can meet all our needs
4. It has its own playback function, which can directly use the recorded data without separate development
5. It consumes less resources and does not invade the JVM runtime environment of the target application, with little impact
6. The plug-in mechanism is provided, and the implementation of the plug-in does not limit the language, which is convenient for expansion
1. The application is not wide enough, there is no endorsement from large companies, and the maturity is not enough
2. There are many problems. Version 1.2.0 is not recommended by the official
3. Next, the requirements for users are high. In case of problems, they should be able to read the source code by themselves. The official response speed is average
4. The community version only supports HTTP protocol, not binary protocol, and the core logic is coupled with HTTP protocol, which is troublesome to expand
5. Only command line startup is supported, and there is no built-in service, so it is difficult to integrate
JVM-Sandbox
jvm-sandbox-repeater
Java 1. Through enhancement, you can record Java class methods directly, which is very powerful
2. It has rich functions and meets the requirements
3. Transparent to business code without intrusion
1. It will invade the application runtime environment to some extent. If there is a problem, it may affect the application
2. The tool itself is still biased towards test regression, so some functions cannot be used in our scenario. For example, its playback function cannot be used for high-speed pressure test
3. The community activity is low and there is a risk of stopping maintenance
4. The underlying implementation is indeed complex and the maintenance cost is relatively high. Once again left tears without technology
5. It needs to be matched with other auxiliary systems, and the integration cost is not low
ngx_http_mirror_module C 1. Produced by nginx, the maturity can be guaranteed
2. The configuration is relatively simple
1. It is inconvenient to start and stop and does not support filtering
2. It must be used with nginx, so the scope of use is also limited
Alibaba cloud engine regression test platform

3.2 type selection verification

After the model selection is completed, the function, performance, resource consumption and other aspects shall be verified to test whether the model selection meets the requirements. According to our requirements, the following verification is made:

  1. The recording function is verified to verify whether the traffic recording is complete, including the number of requests, integrity and accuracy of request data. And resource consumption verification in case of large traffic
  2. Verify the flow filtering function to verify whether the flow of the specified interface can be filtered and the integrity of the flow
  3. Playback function verification: verify whether the traffic playback can work as expected and whether the playback request volume meets the expectation
  4. Double speed playback verification to verify whether the double speed function meets the expectations and the resource consumption of high speed playback

The above verifications passed offline at that time, the effect was very good, and everyone was very satisfied. However, when the function of double speed playback is verified in the production environment, the playback pressure can not go up, and can only be pressed to about 600 QPS. After that, no matter how pressurized, the QPS is always at this water level. We and our colleagues in the business line used different recorded data to test online for several rounds, but we couldn’t. at first, we thought that there was a bottleneck in machine resources. However, we have seen that the CPU and memory consumption are very low, and the number of TCP connections and bandwidth are also very surplus. Therefore, there is no bottleneck in resources. This also highlights a problem. In the early stage, we only tested the function of the tool without performance testing, resulting in this problem not being exposed as soon as possible. So I built a test service offline with nginx and tomcat, conducted some performance tests, and found that I can press thousands of QPS at will. Seeing this result, I was neither laughing nor crying, and my brain cracked. Later, it was found that the RT of offline services was too short, which was very different from that of online services. So let the thread sleep randomly for tens to hundreds of milliseconds. At this time, the effect is very close to that on the line. At this point, we can basically determine the scope of the problem. It should be that there is a problem with goreplay. But goreplay is written in go language. Everyone has no experience in go language. Seeing that the problem can be solved easily, I just have nowhere to start. It’s suffocating. Later, the leaders decided to invest time in the goreplay source code and find problems by analyzing the source code. Since then, I began to learn the go language. The original plan gave a preliminary conclusion in two weeks, but I didn’t expect to find the problem in one week. The original reason is that there is a big deviation between the use document of goreplay V1.1.0 and the code implementation, so the operation according to the document can not achieve the expected effect. Details are as follows:

Practice of traffic recording and playback technology

Figure 2: instructions for goreplay

Let’s see what Keng dad’s document says first,--output-http-workersThis parameter indicates how many coroutines are used for HTTP requests at the same time. The default value is 0, that is, unlimited. Let’s look at how the code (output_http. Go) is implemented:

Practice of traffic recording and playback technology

Figure 3: gorepaly concurrency decision logic

The document says that there is no limit on the number of HTTP sending processes by default, and 10 is set in the result code, which is too different. Why are 10 processes not enough? Because the process needs to wait for the response result in place, that is, it will be blocked, so the QPS that can be played by 10 processes is limited. After the cause is found, we clearly set the — output HTTP workers parameter value, and the QPS of double speed playback can finally meet the requirements.

After this problem occurred, we had great doubts about goreplay and felt that this problem was relatively low-level. Such problems will appear. Will there be other problems in the future, so it’s hard to use it. Of course, because there are few people maintaining this project, it can be basically regarded as a personal project. Moreover, the project has not been applied on a large scale, especially without the endorsement of large companies. Such problems can be understood and there is no need to be too harsh. Therefore, if you encounter problems later, you can only see the moves. Anyway, there are all the codes. Let’s audit the white box directly.

3.3 summary and reflection

Let’s talk about the problems in the selection process first. From the above description, I made some serious mistakes in the selection and verification process, which was vividly taught by myself. In the selection stage, for popularity, I actually think there are more stars, even if they are more famous, it’s still too childish to think about it now. Compared with popularity, maturity is actually more important. There are fewer stability pits and get off work early. In addition, observability must also be considered, otherwise you will experience a sense of helplessness when checking questions.

In the verification phase, there are no major problems in function verification. However, the performance verification was only symbolic, and finally overturned when verifying with business line colleagues. Therefore, during the verification period, the performance test cannot be careless. Once the relevant problems are found online, it will be very passive.

Make a summary according to the experience of this technical selection, and turn it out in the future. The selection dimensions are summarized as follows:

dimension explain
Functionality 1. Whether the function of model selection can meet the requirements. If not, what is the cost of secondary development
Maturity 1. Whether the model selection has been widely used in relevant fields. For example, in the java web domain, the spring technology stack is well known
2. The model selection in some niche fields may not be widely used, so you can only check the issue, search some pit records and evaluate by yourself
Observability 1. Is there any observation means for the internal status data? For example, goreplay will print the internal status data regularly
2. We should also consider the inconvenience of accessing the company’s monitoring system. After all, human flesh observation is too hard

The validation is summarized as follows:

  1. According to the requirements, verify whether the selected functions meet the expectations one by one. You can make a verification checklist and confirm it item by item
  2. Test the performance of the model selection from several possible aspects, and pay attention to the consumption of various resources in this process. For example, goreplay traffic recording, filtering and playback functions must be tested
  3. The stability of long-time operation of the type selection shall be verified, and the abnormal conditions existing during the verification period shall be observed and analyzed
  4. More strictly, some fault tests can be done. For example, kill the process, disconnect the network, etc

For more detailed practical experience of model selection, please refer to the article of leader Li Yunhua:How to use open source projects correctly

4. Specific practice

When the technology selection and verification are completed, the next step is to turn the idea into reality. According to the current mode of small step, fast run and fast iteration, in the start-up stage, we usually only plan the core functions to ensure the smooth flow of the process. Next, iterate according to the priority of requirements and gradually improve them. Next, I will introduce it according to the iterative process of the project.

4.1 minimum available system

4.1.1 requirements introduction

Serial number classification Demand point explain
1 Recording Flow filtering, recording on demand It supports filtering traffic by HTTP request path, so that the traffic of the specified interface can be recorded
2 The recording duration can be specified The recording duration can be set. Generally, it is recorded for 10 minutes to record the flow peak
3 Recording task details Including recording status, recording result statistics, etc
4 playback Playback duration can be specified It supports setting the playback duration of 1 ~ 10 minutes
5 Playback speed can be specified According to the QPS during recording, the flow is amplified by multiple, and the minimum particle size is 1 times the speed
6 Manual termination is allowed during playback When problems are found in the pressure tested application, the playback process can be terminated manually
7 Playback task details Including playback status and playback result statistics

The above is the list of requirements in the project startup phase, which are the most basic requirements. As long as these requirements are completed, a minimum available system is realized.

4.1.2 introduction to technical scheme

4.1.2.1 architecture diagram

Practice of traffic recording and playback technology

Figure 4: phase I architecture of pressure measurement system

After editing, the above architecture diagram is somewhat different from the actual situation, but it does not affect the explanation. It should be noted that our gateway service, piezometer and piezometer service are composed of multiple servers respectively, and all gateways and piezometer instances are deployed with gorepaly and its controller. Here, in order to simplify the architecture diagram, only one machine is drawn. Some core processes are introduced below.

4.1.2.2 gor controller

Before introducing other contents, let’s talk about the purpose of GOR controller. In one sentence: the purpose of introducing this middle layer is to integrate goreplay, a command-line tool, with our pressure measurement system. This module was developed by ourselves. It was first written in shell (miserable), and later rewritten in go language. Gor controller is mainly responsible for the following things:

  1. Master the power of life and death of goreplay, and you can adjust and terminate the goreplay program
  2. Mask the use details of goreplay, reduce complexity and improve ease of use
  3. The return status will be returned to the pressure measurement system before and after the start of goreplay and after other landmark events
  4. Process and return the data generated by recording and playback
  5. Log and record the status data output by gorepaly for subsequent troubleshooting

Goreplay itself only provides the most basic functions. You can imagine it as a car with only basic accessories such as chassis, wheels, steering wheel and engine. Although it can drive, it is difficult. Our gor controller is equivalent to providing enhanced functions such as one click start and stop, power steering and Internet of vehicles, making it more useful. Of course, this is just an approximate metaphor. Don’t tangle with rationality. After knowing the purpose of the controller, the execution process of startup and playback is described below.

4.1.2.3 introduction to recording process

The user’s recording command will first be sent to the pressure test service. The pressure test service could have directly sent the recording command to the gor controller through SSH, but for security reasons, it must bypass the operation and maintenance system. After the gor controller receives the recording command and the parameters are verified to be correct, it will call goreplay. After recording, the gor controller will send the status back to the pressure measurement system to determine whether the recording task is over. The detailed process is as follows:

  1. The user sets the recording parameters and submits the recording request to the pressure test service
  2. The pressure test service generates a pressure test task and generates a recording command according to the parameters specified by the user
  3. The recording command is sent to the specific machine through the operation and maintenance system
  4. The Gor controller receives the recording command, returns the “recording to start” state to the metering service, and then transfers the GoReplay.
  5. After recording, goreplay exits, and the gor controller returns the “recording end” status to the pressure test service
  6. Gor controller sends back other information to the pressure measurement system
  7. After the pressure measurement service decides that the recording task is over, it notifies the pressure measurement machine to read the recorded data into the local file
  8. End of recording task

It is explained here that in order to use the goreplay multiple speed playback function, the recorded data must be stored in a file. Then set the double speed through the following parameters:

#Triple speed playback
gor --input-file "requests.gor|300%" --output-http "test.com"
4.1.2.4 introduction to playback process

The playback process is basically similar to the recording process, except that the playback command is sent to the pressure measuring machine, and the specific process will not be repeated. Here are some differences:

  1. Mark the playback flow with pressure gauge: to distinguish the playback flow from the real flow, a mark is required, that is, the pressure gauge
  2. Rewrite the parameters as needed: for example, change the user agent to goreplay, or add the token information of the test account
  3. Goreplay runtime status collection: including QPS, task queue backlog, etc. this information can help you understand the running status of goreplay

4.1.3 deficiencies

The minimum available system has been running online for almost 4 months, and there have been no major problems, but there are still some deficiencies. There are two main points:

  1. The command transmission link is slightly longer, which increases the probability of error and the difficulty of troubleshooting. For example, the interface of the operation and maintenance system occasionally fails, and there is no key log. At the beginning, it is impossible to check the problem
  2. Gor controller is written in shell, about 300 lines. Shell syntax is quite different from Java, and the code is not easy to debug. At the same time, for complex logic, such as generating JSON strings, it is very troublesome to write, and the subsequent maintenance cost is high

These two deficiencies have been accompanied by our development and operation and maintenance work until some optimization is carried out later, which can be regarded as a complete solution to these problems.

4.2 continuous optimization

Practice of traffic recording and playback technology

Figure 5: architecture diagram of GOR controller after optimization

We have made targeted improvements to the previous pain points. Focus on rewriting the gor controller in go language, and the new controller name is gor server. As can be seen from the name, we have a built-in HTTP service. Based on this service, the pressure test service issued an order, and finally there was no need to bypass the operation and maintenance system. At the same time, all modules are under our control, and the efficiency of development and maintenance has obviously increased.

4.3 support Dubbo traffic recording

We internally use Dubbo as the RPC framework, and calls between applications are completed through Dubbo. Therefore, we also have a great demand for Dubbo traffic recording. After some achievements in gateway traffic recording, some colleagues in charge of internal system also hope to conduct pressure test through goreplay. In order to meet the internal use requirements, we have carried out secondary development of goreplay to support the recording and playback of Dubbo traffic.

4.3.1 introduction to Dubbo agreement

To support Dubbo recording, you need to understand the contents of Dubbo protocol first. Dubbo is a binary protocol. Its coding rules are shown in the following figure:

Practice of traffic recording and playback technology

Figure 6: illustration of Dubbo protocol; Source:Dubbo official website

The following is a brief introduction to the protocol, and the meanings of each field are introduced in the order shown in the figure.

field Bits (bit) meaning explain
Magic High 8 Magic number high Fixed to 0xda
Magic Low 8 Magic number low Fixed to 0xbb
Req/Res 1 Packet type 0 – Response
1 – Request
2way 1 Call mode 0 – one way call
1 – bidirectional call
Event 1 Event identification Such as heartbeat events
Serialization ID 5 Serializer number 2 – Hessian2Serialization<br/>3 – JavaSerialization<br/>4 – CompactedJavaSerialization<br/>6 – FastJsonSerialization
……
Status 8 Response status The status list is as follows:
20 – OK
30 – CLIENT_TIMEOUT
31 – SERVER_TIMEOUT
40 – BAD_REQUEST
50 – BAD_RESPONSE
……
Request ID 64 Request ID The same ID will also be carried in the response header to associate the request with the response
Data Length 32 Data length Used to identify the length of the variable part
Variable Part(payload) Data load

After knowing the content of the agreement, we ran the official demo and grabbed a package to study it.

Practice of traffic recording and playback technology

Figure 7: Dubbo request packet capture

First, we can see the magic number 0xdabb occupying two bytes. The next 14 bytes are other contents in the protocol header. Let’s briefly analyze:

Practice of traffic recording and playback technology

Figure 8: Dubbo request header data analysis

The above marks are quite clear. Here’s a little explanation. It can be seen from the third byte that this packet is a Dubbo request. Because it is the first request, the request ID is 0. The length of data is 0xdc, which is converted to 220 bytes in decimal system. With a 16 byte message header, the total length is exactly 236, which is consistent with the length shown in the packet capture result.

4.3.2 analysis of Dubbo protocol

We support Dubbo traffic recording. First, we need to decode the data packet according to the Dubbo protocol to judge whether the recorded data is a Dubbo request. So the problem is, how to judge that the data in the recorded TCP message segment is a Dubbo request? The answer is as follows:

  1. First, judge whether the data length is greater than or equal to the length of the protocol header, that is, 16 bytes
  2. Judge whether the first two bytes of data are magic number 0xdabb
  3. Judge whether the 17th bit is 1. If it is not 1, it can be discarded

The above detection can quickly determine whether the data conforms to the Dubbo request format. If the detection passes, how to judge whether the recorded request data is complete? The answer is to compare the recorded data length L1 and the length L2 given by the data length field, and carry out subsequent operations according to the comparison results. There are several situations as follows:

  1. L1 = = L2, indicating that the data reception is complete and no additional processing logic is required
  2. L1 < L2 indicates that some data has not been received. Continue to wait for the remaining data
  3. L1 > L2 indicates that some more data has been received. These data do not belong to the current request. At this time, the received data should be segmented according to L2

The schematic diagram of the three cases is as follows:

Practice of traffic recording and playback technology

Figure 9: several situations of application layer receiver

Seeing this, some students must want to say that this is not a typical TCP “sticking packets” and “unpacking” problem. However, I do not want to use these two words to illustrate some of the above situations. TCP is a byte stream oriented protocol. The protocol itself does not have the so-called problems of “sticking packets” and “unpacking”. In the process of transmitting data, TCP does not care how the upper data is defined. In its view, it is just bytes. It is only responsible for transporting these bytes to the target process reliably and orderly. As for cases 2 and 3, that is what the application layer should deal with. Therefore, we can find the relevant processing logic in Dubbo’s code, which can be read by interested studentsNettyCodecAdapter.InternalDecoder#decodeMethod code.

That’s all for this section. Finally, I’ll leave you a question. In goreplay’s code, case 3 is not handled. Why is there no error recording HTTP protocol traffic?

4.3.3 goreplay reconstruction

4.3.3.1 introduction to transformation

At present, goreplay Community Edition only supports HTTP traffic recording, and its commercial edition supports some binary protocols, but does not support Dubbo. Therefore, in order to meet the internal use requirements, only secondary development can be carried out. However, due to the large logical coupling between the community version code and HTTP protocol processing, it is troublesome to support a new protocol recording. In our implementation, the transformation of goreplay mainly includes Dubbo protocol identification, Dubbo traffic filtering, packet integrity judgment, etc. The decoding and deserialization of data packets are implemented by Java programs, and the serialization results are converted into JSON for storage. The effects are as follows:

Practice of traffic recording and playback technology

Figure 10: Dubbo traffic recording effect

Goreplay uses three monkeys as request separators. It’s funny at first sight.

4.3.3.2 introduction to goreplay plug-in mechanism

You may be curious about how goreplay works with Java programs. The principle is very simple. First, let’s see how to turn on the plug-in mode of goreplay:

gor --input-raw :80 --middleware "java -jar xxx.jar" --output-file request.gor

You can pass a command to goreplay through the middleware parameter, and goreplay will pull up a process to execute the command. During recording, goreplay communicates with the plug-in process by obtaining the standard input and output of the process. The data flow is roughly as follows:

+-------------+     Original request     +--------------+     Modified request      +-------------+
|  Gor input  |----------STDIN---------->|  Middleware  |----------STDOUT---------->| Gor output  |
+-------------+                          +--------------+                           +-------------+
  input-raw                              java -jar xxx.jar                            output-file           
4.3.3.3 implementation idea of Dubbo decoding plug-in

The decoding of Dubbo protocol is relatively easy to implement. After all, many codes have been written in the Dubbo framework. We only need to modify and customize the codes as needed. The parsing logic of the protocol header is in the dubbocodec#decodebody method, and the parsing logic of the message body is in the decodeablerpcinvocation#decode (channel, InputStream) method. Since goreplay has parsed and processed logarithmic data, it is not necessary to parse many fields in the plug-in. Just parse the serialization ID. This field will guide us in subsequent deserialization operations.

The decoding of the message body is a little troublesome. We copied the code of decodablerpcinvocation into the plug-in project and modified it. The unnecessary logic is deleted, only the decode method is retained, and it is turned into a tool class. Considering that it is inconvenient for our plug-in to introduce the jar package of the application to be recorded, we should also pay attention to removing the type related logic when modifying the decode method. The modified code is roughly as follows:

public class RpcInvocationCodec {
    
    public static MyRpcInvocation decode(byte[] bytes, int serializationId) {
        ObjectInput in = CodecSupport.getSerializationById(serializationId).deserialize(null, input);
        
        MyRpcInvocation rpcInvocation = new MyRpcInvocation();
        String dubboVersion = in.readUTF();
        // ......
        rpcInvocation.setMethodName(in.readUTF());    
        
        //Original code: class <? > [] pts = DubboCodec.EMPTY_ CLASS_ ARRAY;
        //After modification, the PTS type is changed to string [], and the type list is required for generalization calls
        String[] pts = desc2className(int.readUTF());
        Object[] args = new Object[pts.length];
        for (int i = 0; i < args.length; i++) {
            //Original code: args [i] = in. ReadObject (PTS [i]);
            //After modification, it no longer depends on specific types, and is directly de sequenced into a map
            args[i] = in.readObject();
        }
        rpcInvocation.setArguments(args);
        rpcInvocation.setParameterTypeNames(pts);
        
        return rpcInvocation;
    }
}

Only from the perspective of code development, it is not very difficult. Of course, the premise is to have a certain understanding of Dubbo’s source code. For me, time is mainly spent on the transformation of gorepaly. The main reason is that I am not familiar with the go language and check while writing, resulting in low efficiency. When the function is written and debugged, I am really happy to see the correct output of the results. However, this happiness lasted only a short time. Soon, when conducting online verification with business colleagues, the plug-in pattern collapsed, and the scene was once very embarrassing. I was confused when I read the wrong information. I couldn’t solve it for a while and a half. In order to preserve my face, I quickly terminated the verification. After the investigation, it was found that when some special deserialized data was converted into JSON format, there was an dead loop, resulting in stackoverflowerror error. Because the main process of the plug-in is single threaded and only exceptions are captured, the plug-in exits incorrectly.

Practice of traffic recording and playback technology

Figure 11: error reporting of gson framework caused by cyclic dependency

This error tells us that there is a circular reference between classes, and our plug-in code does not handle the circular reference. This error is reasonable. But when I found the business code that caused this error, I didn’t find the circular reference. I didn’t find it until I debugged locally. Codes similar to business codes are as follows:

public class Outer {   
    private Inner inner;

    public class Inner {
        private Long xyz;
        
        public class Inner() {
        }
    }
}

The problem lies in the inner class. Inner implicitly holds the outer reference. Not surprisingly, the compiler should have done it. There is no secret in front of the source code. We decompile the class file of the internal class, and everything will be clear.

Practice of traffic recording and playback technology

Figure 12: decompilation results of internal classes

This should be regarded as the basic knowledge of Java. However, it is rarely used at ordinary times. When I first saw the code, I didn’t see the circular reference hidden in it. It’s reasonable to explain here. Is that the end? Not yet. In fact, gson does not report an error when serializing outer. Debugging found that it will be eliminatedthis$0The exclusion logic of this field is as follows:

public final class Excluder
    public boolean excludeField(Field field, boolean serialize) {
        // ......

        //Determine whether the field is composite
        if (field.isSynthetic()) {
          return true;
        }
    }
}

So why do we report errors when converting recorded traffic to JSON? The reason is that our plug-in can’t get the type information of interface parameters during deserialization, so we deserialize the parameters intoMapObject, sothis$0This field and value are also stored in the map as key value pairs. At this time, gson’s filtering rules will not take effect and cannot be filtered outthis$0This field causes an endless loop and eventually leads to stack overflow. After knowing the reason, how to solve such a problem? The next section expands.

4.3.3.4 direct attack

I began to think about whether I could manually clean the data in the map, but I found it difficult. If the data structure of the map is very complex, such as nesting many layers, the cleaning logic may not be easy to implement. Also, I don’t know if there will be other twists and turns, so I give up this idea and leave this dirty work to the deserialization tool. We need to find a way to get the parameter type of the interface. How can the plug-in get the parameter type of the business application API? One way is to download the jar package of the target application locally when the plug-in starts, and then load it by a separate class loader. But there will be a problem here. There are also some dependencies in the API jar package of business applications. Should these dependencies be handed over to download? The second way is to be simple and rough, directly introduce business application API dependencies into the plug-in project, and then print them into fat jar. In this way, there is no need to build a separate class loader, and there is no need to recursively download other dependencies. The only obvious disadvantage is that some irrelevant dependencies will be introduced into the plug-in project POM, but compared with the benefits, this disadvantage is nothing at all. For convenience, we rely on the APIs of many business applications. After some operations, we get the following POM configuration:

<project>
    <groupId>com.xxx.middleware</groupId>
    <artifactId>DubboParser</artifactId>
    <version>1.0</version>
    
    <dependencies>
        <dependency>
            <groupId>com.xxx</groupId>
            <artifactId>app-api-1</artifactId>
            <version>1.0</version>
        </dependency>
        <dependency>
            <groupId>com.xxx</groupId>
            <artifactId>app-api-2</artifactId>
            <version>1.0</version>
        </dependency>
        ......
    <dependencies>
</project>

Next, change the rpcinvocationcodec#decode method, which is actually to restore the code:

public class RpcInvocationCodec {
    
    public static MyRpcInvocation decode(byte[] bytes, int serializationId) {
        ObjectInput in = CodecSupport.getSerializationById(serializationId).deserialize(null, input);
        
        MyRpcInvocation rpcInvocation = new MyRpcInvocation();
        String dubboVersion = in.readUTF();
        // ......
        rpcInvocation.setMethodName(in.readUTF());    
        
        //Resolving interface parameter types
        Class<?>[] pts = ReflectUtils.desc2classArray(desc);
        Object args = new Object[pts.length];
        for (int i = 0; i < args.length; i++) {
            //Deserialization according to specific type
            args[i] = in.readObject(pts[i]);
        }
        rpcInvocation.setArguments(args);
        rpcInvocation.setParameterTypeNames(pts);
        
        return rpcInvocation;
    }
}

After the code adjustment is completed, it will be verified online on a certain date. Everything is normal. Congratulations. But soon, I found that there were some hidden dangers. If something happens online one day, it will bring great difficulties to the investigation work.

4.3.3.5 potential problems

Considering this situation, the API jar packages of business application a and application B depend on some internal public packages at the same time, and the versions of public packages may be inconsistent. At this time, how do we deal with dependency conflict? What if the internal public package is not well done and there is a compatibility problem.

Practice of traffic recording and playback technology

Figure 13: dependency conflict diagram

For example, the version of the common package here conflicts, and 3.0 is incompatible with 1.0. How to deal with it?

Simply handle it, we will not rely on all API packages of business applications in the plug-in POM, but only one. But the disadvantage is that we have to build plug-in code separately for different applications every time. Obviously, we don’t like this approach.

Further, we don’t rely on the API package of business applications in the plug-in. We keep the plug-in code clean, so we don’t have to package every time. How to get the API jar package of business application? The answer is to create a special project for each API jar, and then type the project as fat jar. The plug-in code uses a custom class loader to load the business class. When the plug-in starts, download the jar package to the machine according to the configuration. Only one jar package needs to be loaded at a time, so there is no dependency conflict. By doing this, the problem can be solved.

Further, when reading the source code of Alibaba’s open source JVM sandbox project earlier, I found that this project implements a class loader with routing function. Can our plug-in build a similar loader? Out of curiosity, I tried and found that it was OK. The final implementation is as follows:

Practice of traffic recording and playback technology

Figure 14: schematic diagram of custom class loading mechanism

The primary class loader has the function of routing according to the “fragment” of the package name, and the secondary class loader is responsible for specific loading. The application API jar packages are uniformly placed in one folder, and only the secondary class loader can load them. Some classes in the JDK, such as list, should still be loaded by the JVM’s built-in class loader. Finally, the main purpose of this class loader with routing function is to play. Although the goal can be achieved, it is safer to use one method in the actual project.

4.4 blossom and bear fruit, landing a new scene

The main and only use scenario of our traffic recording and playback system at that time was pressure measurement. After the system is stable, we are also considering whether there are other scenarios to do. I just tried JVM sandbox repeater in the technology selection stage. The main application scenario of this tool is traffic comparison test. For code refactoring, which does not affect the return value structure of the interface, you can verify whether there is a problem with the change through traffic comparison test. Because the bosses think that the JVM sandbox repeater and the underlying JVM sandbox are a little heavy, and the technical complexity is also relatively high. In addition, there are no resources to develop and maintain these two tools. Therefore, we hope to do this based on the traffic recording and playback system and get through the process first.

The project is led by the QA team. The traffic playback and diff functions are developed by them, and we provide the underlying recording capability. The working diagram of the system is as follows:

Practice of traffic recording and playback technology

Figure 15: schematic diagram of comparative test

Our recording system provides real-time traffic data for the repeater. After receiving the data, the repeater immediately replays it to the pre delivery and online environment. After playback, the player can get the results returned by the two environments respectively, and then send the results to the comparison module for subsequent comparison. Finally, the comparison results are stored in the database. During the comparison process, users can see which requests failed. For the recording module, pay attention to filtering playback traffic. Otherwise, the QPS of the interface will be doubled. After replaying the transformer voltage measurement, I like to mention a fault.

The project has been online for 3 months, helping the business line find 3 serious bugs and 6 general problems, and its value is emerging. Although the project is not led by us, we are also very happy as the provider of underlying services. We hope to expand more use scenarios for our system in the future and make it grow into a big tree with luxuriant branches and leaves.

5. Project results

As of the time of article release, the project has been online for nearly a year. A total of five applications are accessed and used, and the cumulative number of recording and playback is almost four or five hundred times. The usage data looks a little shabby, mainly because the company’s business is tob and there is not so much demand for pressure measurement. Although the use data is relatively low, it still plays a corresponding value as a pressure measurement system. It mainly includes two aspects:

  1. Performance problem discovery: the pressure test platform found more than a dozen performance problems for the business line and helped the middleware team find six serious basic component problems
  2. Efficiency improvement: the new pressure measurement system is simple and easy to use. It only takes 10 minutes to complete an online flow recording. Compared with the things that can be completed in half a day by a single person in the past, the efficiency has been improved by at least 20 times, and the user experience has been greatly improved. A proof is that more than 90% of the pressure measurement tasks are completed on the new platform.

You may have doubts about the efficiency improvement data. You can think about how to obtain online traffic without recording tools. The traditional approach is to modify the interface code for business development and add some logs, which should pay attention to the amount of logs. After that, the changed code is released online. For some large applications, a release involves dozens of machines, which is still quite time-consuming. Then, the interface parameter data is cleaned from the log file. Finally, these data should be converted into pressure test scripts. This is the traditional process, and each step is time-consuming. Of course, companies with good infrastructure can get interface data based on the full link tracking platform. But for most companies, the traditional approach may still be used. On our platform, we only need to select the target application and interface, record the duration, and click the record button. User operations are limited to these, so the efficiency improvement is still obvious.

6. Looking ahead

Although the project has been online for one year, due to the limited manpower, I am basically the only one developing and maintaining, so the iteration is still relatively slow. In view of some problems encountered in practice, here are several obvious problems, hoping to solve them one by one in the future.

1. Full link node pressure diagram

At present, during pressure measurement, pressure measurement personnel need to open the monitoring pages of many applications on the monitoring platform. During pressure measurement, they need to switch between multiple application monitoring. It is hoped that in the future, the pressure diagram of each node on the whole link can be displayed, and the node alarm information can be sent to the pressure measurement personnel to reduce the monitoring cost of pressure measurement.

2. Collection and visualization of pressure measuring tool status

The pressure measurement tool itself has some useful status information, such as task queue backlog, current number of processes, etc. This information can help us troubleshoot problems when the pressure measurement pressure does not go up. For example, the number of tasks in the task queue is increasing, and the number of processes remains high. Can you infer the reason at this time? The probability is that the applied pressure is too high, which leads to the RT becoming longer, which leads to the pressure process (fixed number) being blocked for a long time, and finally leads to the backlog of the queue. Goreplay currently outputs these status information to the console, which is still very inconvenient to view. At the same time, there is no alarm function, which can only be viewed passively when there is a problem. Therefore, I hope to put these status data on the monitoring platform in the future, so the experience will be much better.

3. Pressure sensing and automatic regulation

At present, the pressure measurement system does not sense the pressure of business applications. No matter what state the pressure measurement application is in, the pressure measurement system will perform pressure measurement according to the established settings. Of course, due to the limitations of goreplay concurrency model, there is no need to worry about this problem at present. However, it is not ruled out that the concurrency model of goreplay will change in the future. For example, as long as there are tasks in the task queue, a collaboration process will be started immediately to send requests, which will cause great risks to business applications.

There are still some problems, because the importance is not high, so we won’t write them here. Generally speaking, at present, our demand for pressure measurement is still relatively small, and the QPS of pressure measurement is not high, resulting in many optimizations. For example, the performance optimization of the pressure measuring machine and the dynamic expansion and contraction of the pressure measuring machine. But think about our four pressure measuring machines. The default configuration can fully meet the needs, so we don’t bother to toss about these problems. Of course, from the perspective of personal technical ability improvement, these optimizations are still very valuable and can be played when you have time.

7. Personal gains

7.1 technical harvest

1. Introduction to go language

Because goreplay is developed in go language, and we do encounter some problems in use, we have to go deep into the source code. In order to better control the tools and facilitate troubleshooting and secondary development, I specially learned the go language. The current level is at the entry-level, rookie level. I’ve been using java for a long time. I’m still confused at the beginning of learning go language. Like go’smethoddefinition:

type Rectangle struct {
    Length uint32
    Width  uint32
}

//Calculated area
func (r *Rectangle) Area() uint32 {
    return r.Length * r.Width
}

At that time, I felt that the syntax was very strange. What the hell was the declaration in front of the area method name. Fortunately, I still have some knowledge of C language. On second thought, what should I do if I let c realize object-oriented?

struct Rectangle {
    uint32_t length;
    uint32_t width;
 
    //Member function declaration
    uint32_t (*Area) (struct Rectangle *rect);
};

uint32_t Area(struct Rectangle *rect) {
    return rect->length * rect->width;
}

struct Rectangle *newRect(uint32_t length, uint32_t width)
{
    struct Rectangle *rp = (struct Rectangle *) malloc(sizeof(struct Rectangle));  
    rp->length = length;
    rp->width = width;
 
    //Binding function
    rp->Area = Area;
    return rp;
}

int main()
{
    struct Rectangle *rp = newRect(5, 8);
    uint32_t area = rp->Area(rectptr);
    printf("area: %u\n", area);
    free(pr);
    return 0;
}

If you understand the above code, you will know why the go method is so defined.

With the in-depth study, I found that the grammatical characteristics of go are really similar to that of C. unexpectedly, there is also the concept of pointer. The C language in the 21st century really deserves its reputation. Therefore, in the learning process, I will involuntarily compare the characteristics of the two and learn go according to C’s experience. So when I saw the following code, I was very frightened.

func NewRectangle(length, width uint32) *Rectangle {
    var rect Rectangle = Rectangle{length, width}
    return &rect
}

func main() {
    fmt.Println(NewRectangle(4, 5).Area())
}

At that time, it was expected that the operating system would mercilessly throw me a segmentation fault error, but there was no problem in compiling and running… Ask. Am I wrong? Look at it again and think it’s no problem. C language can’t return the pointer of stack space, and go language shouldn’t operate like this. This shows the difference between the two languages. The rectangle above looks like it is allocated in stack space, but it is actually allocated in heap space. This is the same as Java.

Generally speaking, go syntax is similar to C, and C language is my enlightenment programming language. I feel very kind and like the go language. Its syntax is simple, the standard library is rich and easy to use, and the use experience is good. Of course, I’m still in the novice village and haven’t written big projects in go, so I still have a shallow understanding of this language. Please forgive me for what is wrong with the above.

2. Master the principle of goreplay

I’ve basically read the logic of goreplay recording and playback core, and I’ve also written articles to share on the intranet. Here’s a brief talk with you about this tool. Goreplay abstracts some concepts in design, such as usinginputandoutputTo represent the source and destination of data, and use the interface between the input and output modulesmiddleware Realize the expansion mechanism. At the same time, input and output can be combined flexibly, and even form a cluster.

Practice of traffic recording and playback technology

Figure 16: goreplay cluster diagram

In the recording phase, each TCP packet segment is abstracted as a packet. When the amount of data is large and needs to be split into multiple message segments for transmission, the receiving end needs to combine these message segments in order. At the same time, it also needs to deal with problems such as out of order and repeated messages, so as to ensure that a complete and correct HTTP data is transmitted to the next module. These logic systems are encapsulated in TCP_ In message, TCP_ Message and packet are one to many. The following logic will TCP_ The data in message is taken out, marked and passed to the middleware (optional) or output module.

The playback process is relatively simple, but it will still be executed according to the input → middleware → output process. Usually, the input module is input file and the output module is output http. An interesting point in the playback stage is the principle of double speed playback. The acceleration function is realized by shortening the interval between requests by multiple, and the implementation code is also very simple.

Generally speaking, the core code of this tool is not much, but the functions are relatively rich. You can experience it.

3. Have more knowledge of Dubbo framework and class loading mechanism

When realizing Dubbo traffic recording, I basically read the logic related to decoding. Of course, I have read and written this logic before. But this time I want to customize the code, I will understand it more deeply than just looking at the source code and writing articles. After all, I have to deal with some practical problems. In this process, we need to customize the class loader, so we have more understanding of the class loading mechanism, especially the class loader with routing function, which is still very fun. Of course, learning these technologies is no big deal. The key is to find and solve problems.

4. Other harvest

Other gains are relatively small points. I won’t say more here. Let’s leave it to you in the form of questions.

  1. TCP protocol will ensure the upper layerOrderlyWhen delivering data, why does goreplay work in the application layer deal with out of order data?
  2. What is the communication process of HTTP 1.1 protocol? What is the problem if two HTTP requests are sent continuously on a TCP connection?

7.2 lessons and feelings

1. Be careful in technical selection

There was little experience in model selection at the beginning, and the investigation dimensions were few and not comprehensive enough. This has led to several problems. First, in the verification stage, the tool has not met the expectations, which has delayed a lot of time. Secondly, during the subsequent iterations, it is found that goreplay has many small problems and feels that the degree of rigor is not enough. For example, there are many differences between the document and code used in version 1.1.0. Be careful when using it. For another example, during use, it is found that there is a resource leakage problem in version 1.3.0-rc1#926, I helped fix it#927。 Of course, it’s normal for the RC version to have problems, but such obvious problems should not occur to be honest. However, considering that this project is maintained by individuals, we can’t ask too much. But for users, be careful. This kind of program to run in production is very disturbing because it is unreliable. So for me personally, the maturity of model selection will come first in the future. For items maintained by individuals, try not to be the top candidate.

2. Technical verification shall be comprehensive

Performance test and limit test were not carried out in the initial selection, which led to the problem found only during online verification. It’s embarrassing to find such an obvious problem so late. Therefore, for technical verification, performance test and limit test should be carried out from different angles. More strictly, you can talk to big brother Li YunhuaHow to use open source projects correctlyAs mentioned in the article, engage in fault testing, such as killing process, power failure, etc. Do enough work in the early stage to avoid being passive in the later stage.

3. Sharpen the knife without mistaking the firewood cutter

This project involves different technologies. The company’s existing development platform can’t support this project, so packaging and publishing is a trouble. In the development and testing phase, the code will be frequently modified. If it is packaged manually, then uploaded to the FTP server (unable to directly access the online machine), and finally deployed to the specific recording machine, it is a very mechanical and inefficient thing. So I wrote an automated build script to improve the efficiency of build and deployment. Practice has proved that the effect is very good. Since then, the state of mind has been much more stable and rarely entered the irritable mode.

Practice of traffic recording and playback technology

Figure 17: rendering of automated build script

It’s very embarrassing that I didn’t write the script until the project went online. I didn’t enjoy the benefits of automation in the early stage. However, in the subsequent iterations, automated scripts are still very helpful. The early implementation of compilation and packaging automation tools will help to improve work efficiency. Although we think it will take a lot of time to write tools, if we can expect that many things will be repeated many times, the benefits of these tools will far exceed the cost.

8. Write at the end

I was very lucky to participate in and lead this project. Generally speaking, I personally learned a lot from it. This is the first project with deep participation and continuous iteration in my career. I watched its functions gradually improve, provide stable and uninterrupted services to everyone and give full play to its value. As the project leader, I am still very happy and proud. But at the same time, there are some regrets. Because the company’s business is tob, the requirements for pressure measurement system are not high. At present, the system has entered a stable period, and there are not many needs or big problems to do. Although I can do some technical optimization privately, it is difficult to see the effect. After all, the existing use requirements have not reached the system bottleneck. Early optimization is not a good idea. I look forward to the great development of the company’s business in the future and put forward higher requirements for the pressure measurement system. I am also very happy to continue to optimize this system. In addition, I would like to thank the colleagues who participated in the project together. Their strong output enabled the project to go online with quality and quantity within the tight construction period and provide services to the business line as scheduled. Well, that’s the end of this article. Thank you for reading.

This article is released under the knowledge sharing license agreement 4.0. Please indicate the source for reprint
Author: Tian Xiaobo
Original articles are preferentially published on personal websites. Welcome to:https://www.tianxiaobo.com

Practice of traffic recording and playback technology
This work adoptsCreative Commons Attribution – non commercial use – no deduction 4.0 international license agreementLicense.

Recommended Today

Heavyweight Tencent cloud open source industry’s first etcd one-stop governance platform kstone

​ Kstone open source At the kubecon China Conference held by CNCF cloud native foundation on December 9, 2021,Tencent cloud container tke team released the open source project of kstone etcd governance platform. KstoneIt was initiated by the TKE team of Tencent cloud containerCloud native one-stop etcd governance project based on kubernetes。 The project originates […]