“The war of microservice” is a series of themes about microservice design thinking, mainly aiming at some contradictions / conflicts after the microservice, not involving a specific knowledge point. If you have any questions or suggestions, please feel free to communicate.
In the experienceMicroservice War: cascading failures and avalanchesAfter the P0 level event, you will lie down with your little hand. I started to carry out self recovery. Thinking of this troubleshooting experience, since there is no infrastructure yet, after receiving customer feedback, you check the problem through the error log.
But in cascading errors, there are too many error logs. Different services and different links are almost crowded together. The repair time is mainly spent on turning the logs. It takes several pages to find the relatively effective error information.
If there is a similar problem next time, it’s amazing. The MTTR is too long, and the four 9s will be used up soon. At this time, you think of a sharp tool often mentioned in the industry, that is “distributed link tracking system“. Roughly speaking, you can see the call dependencies of various applications
One of the most famous isGoogle DapperDapper introduced in this paper. In order to solve the software complexity caused by different teams, different languages, different modules, different servers and different data centers, Google has built a distributed tracking system
Since then, it has opened the way for the industry to inspire / enlighten on distributed links. Many well-known distributed link tracking systems are developed based on Google dapper papers, and their basic principles and architectures are similar. If you are interested in this, please check itGoogle DapperVery interesting.
(the concepts of tracking tree and span exist in Google dapper)
Model selection? What do you have
If you want to do link tracking, you must choose an open source product as your distributed link tracking system. It’s impossible to recreate a brand new one. It’s most important to realize the business purpose first. Therefore, a search on the Internet found the following products:
- Elastic Stack：Elastic APM。
- Apache: skywalking.
- Naver: pinpoint.
- Ali: eagle eye.
- Public comment: cat.
- JD: Hydra.
It’s easy to find that there are many such products, and it’s said that every major company has its own set of internal link tracking system, so you’ve made a big mistake. They are all evolved based on Google dapper. What is the difference in essence? How can they extend so many new products?
Let’s first look at Jaeger developed by Uber. Jaeger is currently hosted by cloud native Computing Foundation (CNCF) and is the seventh top-level project of CNCF (graduated in October 2019)
- Jaeger client: Jaeger client is a specific language implementation of Jaeger for opentracing API. It can be used to detect applications manually or through various existing open source frameworks integrated with opentracing (such as flag, dropwizard, grpc, etc.) for distributed tracking.
- Jaeger agent: Jaeger client agent, monitors the accepted span on the UDP port and sends it to the collector in batches.
- Jaeger collector: Jaeger collector, as the name suggests, is agent oriented, which is used to collect / manage the tracking information of links.
- Jaeger query: data query and front-end interface display.
- Jaeger ingester: can read data from Kafka and write to other storage media (Cassandra, elastic search).
After understanding the functions of Jaeger’s components, we mainly focus on the data flow of its overall architecture
Jaeger is a classic architecture. The client actively sends link information to the agent, the agent reports it to the collector, and then through the queue, finally lands to the storage. Then another visual management background is used to view and analyze.
More modern is the standardized process of reporting, collection, storage and analysis. And you will find that Jaeger and Zipkin are similar in architecture
- Zipkin collector: Zipkin collector, used to collect / manage the tracking information of the link.
- Storage: Zipkin data storage, supporting Cassandra, elasticsearch, MySQL and other third-party storage.
- Zipkin query service: after data is stored and indexed, it is used to find and retrieve tracking information.
- Web UI: data query and front-end interface display.
In terms of time, Jaeger is four years later than Zipkin. Could it be that he built the wheel repeatedly. The main reasons for Jaeger are as follows:
At that time, the only way to send span to Zipkin was through scribe, and the only high-performance data storage supported by Zipkin was Cassandra. At that time, Uber had no experience in these two technologies, so it chose to build a back-end by itself, which combined some custom components with Zipkin UI to form a complete tracking system.
More details can be readEvolving Distributed Tracing at Uber EngineeringYou can learn a lot of details.
Ali eagle eye
Another representative of the link tracking system is based on log and streaming computing, such as Alibaba’s Hawkeye and Didi’s traces, as shown in the following figure:
More specificallyAlibaba eagle eye technology decryptionandLink tracking in heterogeneous systems — didi trace practiceThe sharing at the conference will not be repeated here. It is recommended for those who are curious or sad to read it.
Most of the initial selection will choose the tracking system with strong affinity. For example, Jaeger belongs to go, Zipkin and skywalking are mostly Java families, all of them are fully compatible with opentracking, but the architecture is somewhat different, and they are all based on Google dapper divergence. Therefore, the basic functions and query page elegance are very important.
And there are original n systems, if you want to directly access the new link tracking system, it is very troublesome. Because the original intention of access is to solve the problem of investigation / location of the original system, not just for the new system. Therefore, from the perspective of access, most of them will not use the existing open source tracking system (unless the historical debt is small), and the amount of data may be very large.
Therefore, it is quite common to transform the existing methods to clean the data and then make the link tracking mode. The log is often a better starting point, that is, to clean the data, form a new analysis system and rebuild an internal wheel.
In addition, the “no” intrusive link tracking based on servicemesh is also popular in recent two years, which seems to be a promising direction. One of its representative works is Jaeger, who was born in CNCF, and Jaeger is also compatible with Zipkin. Jaeger wins in this respect.
My official account