Author Jiang Mingming
Source ｜Erda official account
Reading guide: in order to let you better understand the design and implementation of APM system in MSP, we decided to write a series of articles on detailed discussion on microservice observation to go deep into the products, architecture design and basic technology of APM system. This is the third article in this series. It will mainly introduce the implementation principle of telegraf data processing link and the implementation mode of plug-in.
A series of articles on micro service observation:
- From monitoring to observability, where are we finally going
- Only after I got started did I know that this dashboard system is really cool to use
- Telegraf, a sharp tool for understanding indicators in one text (this article)
Telegraf is a very popular open source index acquisition software of influxdata company, which has tens of thousands of stars in gihub. With the help of the community, it has more than 200 kinds of collection plug-ins and more than 40 kinds of export plug-ins, covering almost all monitoring items, such as machine monitoring, service monitoring and even hardware monitoring.
Pipeline concurrent programming
In go, pipeline concurrent programming mode is a common concurrent programming mode. In short, it is composed of a series of stages as a whole. Each stage is composed of a group of goroutines running the same function, and each stage is connected with each other by channel.
In each phase, goroutine is responsible for the following:
- Receive the data generated by the upstream stage through the entry channel.
- Process data, such as format conversion, data filtering, aggregation, etc.
- Send the processed data to the downstream stage through the exit channel.
Each stage has one or more exit and entry channels at the same time. Except for the first and last stage, it has only exit and entry channels respectively.
Implementation in telegraf
This programming mode is adopted by telegraf, which mainly has four stages: inputs, processors, aggregators and outputs.
- Inputs: responsible for collecting the original monitoring indicators, including active acquisition and passive acquisition.
- Processors: responsible for processing the data collected by inputs, including de duplication, renaming, format conversion, etc.
- Aggregators: responsible for aggregating the data processed by processors and calculating the aggregated data.
- Outputs: it is responsible for receiving and processing the data output by processors or aggregators and exporting it to other media, such as files, databases, etc.
And they are also linked to each other by channel. The frame composition is as follows:
It can be seen that the pipeline concurrent programming mode is adopted as a whole. Let’s briefly introduce its operation mechanism:
- The first stage is inputs. Each input generates a goroutine, which collects data and fan in to the channel.
- The second stage is processors. Each processor generates a goroutine and connects with each other through channel in order.
- The third stage is aggregators. Each aggregator generates a goroutine, consumes the data generated by processors, and fans out to each aggregator.
- The last stage is outputs. Each output generates a goroutine, consumes the data generated by processors or aggregators, and fans it out to each output.
Fan in: multiple functions output data to a channel, and a function reads the channel until it is closed.
Fan out: multiple functions read the same channel until it is closed.
Plug in design
With so many input, output and processor plug-ins, how does telegraf manage these plug-ins efficiently? And how to design plug-in system to cope with the increasing expansion requirements? Don’t worry, please let me elaborate.
In fact, the plug-in here is not a normal plug-in (that is, dynamically loading and binding the dynamic link library at runtime), but a variant based on the factory mode. First, let’s take a look at the plug-in directory structure of telegraf:
plugins ├── aggregators │ ├── all │ ├── basicstats │ ├── registry.go ... ├── inputs │ ├── all │ ├── cpu │ ├── registry.go ... ├── outputs │ ├── all │ ├── amqp │ ├── registry.go ... ├── processors │ ├── all │ ├── clone │ ├── registry.go ...
As can be seen from the above, the directory structure is regular (we take the plug-in of inputs as an example below, and the implementation of other modules is similar).
- Plugins / inputs: package directory for each input plug-in.
- Plugins / inputs / all: import the plug-in module package (mainly to avoid circular reference).
- Plugins / inputs / registry.go: stores the registry and related functions.
Telegraf declares the following input interface through interface, indicating input:
Create a plug-in in the plugins / inputs / directory, such as CPU, and implement the input interface:
Finally, we only need to register the factory function of the plug-in in the global registry:
In this way, many plug-ins are managed in an orderly manner. At the same time, the extension is also very convenient. You only need to implement the input interface and register the factory function.
Application in Erda
In Erda, we use telegraf as the indicator collection service of Erda platform and deploy it on each physical machine in the form of daemon. Nowadays, it has been widely used in production, operates stably on thousands of machines, collects and reports a large number of indicators for SRE and relevant operation and maintenance personnel to analyze and troubleshoot conveniently.
Due to some special requirements, we have to carry out secondary development based on telegraf to better adapt to business requirements. Nevertheless, thanks to the powerful plug-in system of telegraf, we often only need to add plug-ins according to the requirements. For example, add the output plug-in to report to our own collection end, and add the input plug-in to check the health of Erda’s own components.
In the future, we will gradually abandon the second part, embrace open source, and maximize the consistency with the official open source version of telegraf, so as to give back to the community.
- 《Go Concurrency Patterns: Pipelines and cancellation》
- Telegraf project address
- Talk about go’s factory model in combination with the project
If you have any questions, welcome to add a little assistant wechat (erda202106) to join the communication group and participate in communication and discussion!