Application and practice of link tracking technology


Application and practice of link tracking technologyApplication and practice of link tracking technology

Wen Danqing

Senior architect of Netease smart enterprise

Link tracking background

As shown in the figure, in the microservice architecture, a request often needs multiple services to cooperate.

Application and practice of link tracking technology

Everything has its advantages and disadvantages. This model not only brings us better scalability, but also brings some new problems. For example, the difficulty of Troubleshooting: the abnormality of any node may lead to the abnormality of the upstream link, which is difficult to trace to the source; The system topology is complex and difficult to control, and there are hidden dangers in its robustness.

In 2010, Google published a paper introducing the design of Google’s internal link tracking system dapper, and link tracking technology has entered the vision of the community since then.

Next, we will briefly introduce its application in the field of APM, as well as its practice in service dependency governance and R & D efficiency improvement.

Application and practice of link tracking technology


In a distributed system, a request flows among multiple nodes. APM associates the relevant nodes in the whole request processing link through traceid, and records the execution time and other information of each node to form the request life cycle link.

Application and practice of link tracking technology

As shown in the figure, we can intuitively see which nodes the request passes through and the processing time of each node. This enables us to pay attention to all the details and indicators on the whole request link from the perspective of the request life cycle while paying attention to the running state of the service itself, which greatly improves the efficiency of troubleshooting and positioning problems.

Application and practice of link tracking technology

Service dependency governance

Unreasonable dependence may lead to the failure of the edge system, bring down the core services, and threaten the overall stability of the distributed system.Through the summary and analysis of link tracking data, we can draw the dependency topology between systems to provide data support for dependency governance.

We generally evaluate the rationality of service dependency from the following three perspectives:

  • Reverse dependency。 Reverse dependency means that high-level services rely on low-level services. For example, tenant services are one of our core services, and statistical services are relatively less important. Obviously, we do not allow tenant services to rely on statistical services. Through the combination of service topology and service level, we can easily automate reverse dependency analysis and give real-time early warning.
  • Strength dependence。 Strong dependency means that when an exception occurs in a downstream service, it will affect the stability of the current node. When designing, we should fully consider the necessity of strong dependence in the current scene. Whether strong dependence can be weakened, and if not, whether protection measures such as fuse degradation are allowed in the business scenario. By combing the strong and weak dependencies, we can produce systematic reports in combination with fault injection tools.
  • Cyclic dependence。 Circular dependence is often the performance of unclear boundary, twisted into a ball and unclear level. Sorting out the circular dependency is also our sorting out of the business boundary and system boundary, which is of great significance to the improvement of the overall health of the system.

Improvement of R & D Efficiency

With the development of the business, the scale of the R & D team will increase accordingly at a certain stage, but the infrastructure supporting our R & D activities can not grow linearly, the most important of which is the joint commissioning or test environment.

Business development often leads to the increase of parallel iterations, and these parallel iterations will inevitably change to the same services, especially some core basic services. As shown below,

Application and practice of link tracking technology

This leads to two problems——

1. Environmental competition。 As shown in the figure, story-b needs to deploy the ticket service, while hotfix-a also waits for verification. It also needs to deploy the ticket service, which means that at least one party will be blocked and wait. This serial mode greatly reduces our delivery efficiency. The more parallel iterations, the more obvious the efficiency reduction.

2. Environmental stability。 Services are interrelated, and the instability of any service may lead to the instability of the environment. The auth service in the figure above is used by almost all business processes. If the restart / deployment process of story-a is not smooth enough when deploying auth service, or there are some bugs in story-a’s code, the whole test environment will be unstable.

When the project scale is small, we can often coordinate through some management means. For example, version serialization avoids deploying a service at the same time by staggering the iteration plans. Only specific branches are deployed in the test environment. When verification is required, their respective codes are merged into this branch; It is required that the code deployed to the test environment must meet some standards to improve the stability of the test environment.

However, we can also see that,The effectiveness of management means is inversely related to the scale of team and micro service. We need technical means to achieve better results.

Think about it carefully. In fact, the root cause of the problem is that we share a set of test environment, so there is resource competition in our R & D activities. Our operation of one service may affect other services.

So, can everyone easily create their own environment, and the use of each environment does not affect each other?

As shown in the figure above, story-a needs to deploy user and auth services, so we create env-1 and deploy our user and auth; Story-b needs to deploy the ticket service, so we create env-2 and deploy the ticket service; Env-3 is the same.

For the convenience of description, we call the env-x environment in the figure above as the test environment; The lower part of the figure is called regression environment. The test environment only includes the applications to be deployed in this iteration, and the regression environment includes all applications.

When we use this mechanism, we expect env-1 users to only route to env-1 environment when requesting user and auth services, and to return to the environment when requesting other services. Users in the env-2 environment will only route to the env-2 environment when requesting ticket services, and will also route to the return environment when requesting other services.

That is, for the request of the environment user, if the relevant application is in the environment, the request will only be processed by the application in the environment, otherwise it will be routed to the regression environment for processing.

The regression environment is a relatively stable environment containing all services. Development and testing are not allowed to be deployed in the regression environment to ensure sufficient stability.

In terms of R & D process, we no longer deploy to the environment we are using as before, but create their own exclusive environment and complete relevant work in their own environment.

We call the above mechanismEnvironmental isolation, to achieve environmental isolation, the technical side needs to achieve at least two capabilities:

  • Identify and transfer the environment information corresponding to the request
  • ⼲ instance selection / consumption rules of pre Middleware

Identify and transfer the environment information corresponding to the request

First, we need to be able to associate the request with the test environment.

Identify the environment information corresponding to the request, which means that when we create the test environment, we need to specify a certain identity, which we can extract from the request, so as to complete the association through the matching of the identities of both parties. This identity can be a user account, a group of IP, or an enterprise tenant. It doesn’t matter which method to use, It is important to combine the business characteristics to achieve the purpose of convenience and ease of use.

For example, in our SaaS system, we use tenant ID as our identity. When creating a test environment on our platform, in addition to indicating the application to be deployed, we also need to enter tenant information to complete the mapping between the request and the test environment in this way.

Application and practice of link tracking technology

We can get the tenant to which the request belongs at the unified entrance of the request (for example, gateway), and then we can further get the environment information (environment ID, application list) to which it belongs. Similar to the link tracking system transmitting traceid in the request chain, we attach environment information to the request chain to lay the foundation for environment isolation.

⼲ instance selection / consumption rules of pre Middleware

Let’s take the most commonly used components in the microservice architecture as an example to talk about the implementation of environment isolation.

RPC framework——

The core process of RPC: the provider instance registers itself in the registry. The consumer obtains the list of provider instances through the registry, and selects one of them to initiate the call according to a certain instance filtering strategy and load balancing algorithm.

Therefore, the transformation means are very clear. When the provider is started, we write the environment ID in the metadata. During instance selection, we take out the environment ID from the requested link data to match it.

It should be noted that when the matching instances fail to meet the requirements, we cannot simply think of no provider and let the program report an error. We need to consider whether the application to which the provider belongs is in the application list of the corresponding environment. If not, we need to route the request to the regression environment.

Application and practice of link tracking technology

Message Oriented Middleware——

The RPC has the instance filtering process described above before calling, but the message middleware does not have this logic, but we can still intervene in the consumption rules, that is, after the consumer gets the message, judge whether to discard the message or consume the message. Take Kafka as an example. When the Kafka consumer of the test environment is started, modify the consumer groupid to groupid_$ {env}。 When Kafka consumer receives a message, it can execute the logic similar to the above RPC framework for filtering instances.

Application and practice of link tracking technology

Timed task——

Scheduled tasks are actually the most special. As mentioned earlier, at the unified entry of the request, query the environment information corresponding to the request and write it to the link. However, the request initiated by the scheduled task is not triggered by the user. It comes from within the system, and the scheduled scheduling component is the “source” of the request. Therefore, we need to add our judgment identification logic at the beginning of scheduled task execution, which requires us to:

  • Regular tasks need a unified scheduling platform to avoid different postures of each business module and can not be handled uniformly by general components
  • Aiming at the transformation of task distribution / fragmentation mechanism of scheduling component, unified abstract execution layer and added environment isolation logic

Application and practice of link tracking technology


In Netease smart enterprises, the application of link tracking technology not only improves our problem troubleshooting efficiency and control of request links, but also provides necessary data support for service dependency governance. At the same time, environmental isolation also greatly improves the delivery efficiency and the stability of the test environment,Thus, the overall well-being of the R & D team is improved.

On the whole, we have built a unified link tracking system to support the implementation of service dependency governance and environment isolation technology, but this is not the end point. We can also explore more scenarios, such as multi tenant resource isolation of SaaS system, or exception monitoring and early warning. The world is big. Let’s explore more together.