Understanding Link Tracking in One Paper


Background introduction

In the era of micro-services, service-oriented thinking has gradually become the basic thinking mode of programmers. However, because most of the projects only add services blindly and do not manage them properly, when interface problems arise, it is difficult to find the root cause of the problem from the intricate service invocation network, which leads to failure. The golden opportunity to lose.

The emergence of link tracking is to solve this problem. It can locate the problem in complex service invocation, and let the newcomers know clearly which part of the service they are responsible for when they join the background team.

In addition, if the time-consuming of an interface suddenly increases, we can intuitively analyze the performance bottleneck of the service, so that we can accurately and reasonably expand the capacity in the case of traffic surge.

Link tracking

The term “link tracking” was proposed in 2010, when Google published a Dapper paper, which introduced the principle of Google’s self-developed distributed link tracking, and how they realized transparency to applications at low cost.

In fact, Dapper started as an independent call link tracking system, and gradually evolved into a monitoring platform. Based on the monitoring platform, many tools have been developed, such as real-time early warning, overload protection, index data query and so on.

In addition to Google’s dapper, there are other well-known products, such as Eagle Eye of Ali, CAT of public comment, Zipkin of Twitter, pinpoint of Naver (parent company of famous social software LINE) and skywalking of domestic open source.

Basic Realization Principle

If you want to know which link an interface has a problem, you must know which services the interface calls and the order of invocation. If you string these services together, it looks like a chain. We call them invocation chains.

To implement the call chain, we need to make an identification for each call, and then arrange the services according to the size of the identification, so that we can see the order of calls more clearly. Let’s name the identification spanid for the time being.

In the actual scenario, we need to know the situation of a request invocation, so only spanid is not enough. We have to make a unique identification for each request, so that we can identify all the services invoked by this request according to the identification, which we call traceid.

Now, according to spanid, it is easy to know the order of services invoked, but it can not reflect the hierarchical relationship of invocation. As shown in the figure below, multiple services may be invoked step by step, or they may be invoked simultaneously by the same service.

So we should record who called it every time. We use parentid as the name of the logo.

Up to now, we have known the order of invocation and the hierarchical relationship, but after the interface problems, we still can’t find the link of the problem. If a service has a problem, the service that is invoked and executed must take a long time. In order to calculate the time-consuming, the above three identifications are not enough, but also need to add a timestamp, time-consuming. The stamp can be more refined to microsecond level.

It is not time-consuming to record only the time stamp when the invocation is initiated. To record the time stamp when the service returns, the time difference can only be calculated from the beginning to the end. Now that the return is also recorded, write down all the above three identities, otherwise you can not distinguish whose time stamp is.

Although the total time taken from service invocation to service return can be calculated, this time includes service execution time and network latency. Sometimes we need to distinguish these two kinds of time to facilitate targeted optimization. So how to calculate network latency? We can divide the process of calling and returning into the following four events.

  • Client Sent is abbreviated as cs. The client initiates the call request to the server.
  • Server Received is short for sr, which means that the server receives the call request from the client.
  • Server Sent is short for ss, which means that the server has finished processing and is ready to return the information to the client.
  • Client Received is abbreviated as cr, which means that the client receives the return information from the server.

If the time stamps are recorded when these four events occur, it is easy to calculate the time-consuming. For example, SR minus CS is the network delay at the time of invocation, SS minus SR is the service execution time, CR minus SS is the delay of service response, and Cr minus CS is the execution time of the whole service invocation.

In fact, in the span block, besides recording these parameters, we can also record some other information, such as the name of the calling service, the name of the called service, the return result, the IP, the name of the calling service, etc. Finally, we synthesize the same spanid information into a large span block and complete a complete call chain. Interested students can go deep into the link tracking, I hope this article will be helpful to you.