Powerdotnet platform software architecture design and implementation series (11): log platform

Time:2022-5-30

Almost all back-end applications record logs, and the log system can be abstracted out to provide services.

RecentlyLog4j2As a developer, I can only baa ha several times to say that log processing is too difficult. Only those who have struggled know the hardships.

Before the implementation of powerdotnet logging system, reference was made to flume, elk, scribe and Kafka’s logging solutions. After comparison, Facebook’s logging system scribe was finally selected as the target to achieve a lock free and asynchronous scalable logging system power based on thrift protocol (of course, it also supports HTTP protocol) Xlogger.

This article talks about the built-in logging platform system of powerdotnet.

Environmental preparation

1. (required) Net framework4.5+

2. (required) MySQL or sqlserver or PostgreSQL or MariaDB or mongodb or elasticsearch

3. (mandatory) powerdotnetDatabase management platform, mainly using the dbkey function

4. (mandatory) powerdotnet configuration centerPower.ConfigCenter

5. (mandatory) powerdotnet registryPower.RegistryCenter

6. (mandatory) powerdotnet basic data platformPower.BaseData

7. (mandatory) powerdotnet caching platformPower.Cache

8. (mandatory) powerdotnet messaging platformPower.Message

9. (mandatory) powerdotnet people management platformPower.HCRM, detailed in subsequent articles

1、 About scribe

Scribe is Facebook’s open source log collection system. It can collect logs from various log sources and store them on a central storage system (NFS, distributed file system, etc.) for centralized statistical analysis and processing.

Scribe provides an extensible and error tolerant solution for “distributed collection and unified processing” of logs.

Scribe collects data from various data sources, puts it on a shared queue, and then pushes it to the back-end central storage system.

When the central storage system fails, scribe can temporarily write the log to the local file. After the central storage system recovers its performance, scribe will continue to transmit the local log to the central storage system.

Scribe mainly includes three parts: scribe agent, scribe and storage system.

1、 Scribe Agent

The scribe agent is actually a thrift client. The only way to send data to scribe is to use the thrift client. Scribe defines a thrift interface internally, which is used by users to send data to the server.

2、Scribe

Scribe receives the data sent by the thrift client and sends the data of different topics to different objects according to the configuration file. Scribe provides a variety of stores, such as file and HDFS. Scribe can load data into these stores.

3、Store

The store in scribe is the storage system as we understand it. Currently, scribe supports many stores, including:

File

Buffer (double-layer storage, one primary storage and one secondary storage)

Network (another scribe server)

Bucket (contains multiple stores, and stores data in different stores through hash)

Null (ignore data)

Thriftfile (write to a thrift tfiletransport file)

Multi (store data in different stores at the same time)

2、 Log storage

The design of powerdotnet’s log platform draws on scribe. It also supports the collection of data from various data sources, putting them on a shared queue, and then pushing or pulling them to the back-end central storage system.

However, considering different application scenarios, this shared queue is designed as a dynamically configurable log container. The container can be several mainstream message queues (rabbitmq, MSMQ, rocketmq, Kafka, etc.), redis, local cache, etc.

When the storage system fails, powerdotnet will also temporarily (serialize) write the log “messages” to the local file. After the storage system recovers performance, it will continue to transmit the local logs (deserialize) to the central storage system. The idea of fault tolerance in logging is simple and straightforward, which is very easy to understand.

The built-in log storage media (i.e. central storage system) of powerdotnet include mongodb, mysql, MariaDB, PostgreSQL, sqlserver and elasticsearch, which also provideExceptionlessInterfaces are reserved with elk, and hive is considered to be included in subsequent development. After all, large data processing and relatively complete analysis tool chain are extremely important reference indicators for the log system.

When creating back-end applications, the configuration center will automatically allocate a dbkey to record logs, which is PostgreSQL by default. However, individuals often use mongodb or elasticsearch or elk in the company.

After configuring the dbkey, the log system will take effect automatically. When writing code, you can call the existing powerdotnet logging method.

By default, the powerdotnet logging method is fully asynchronous collection, which will not affect the main business process.

The powerdotnet log component supports sensitive information desensitization, which is also a very common business requirement function.

Compare scribe’s storage system, powerdotnet Xlogger does a lot of tailoring.

3、 Log management

1. Get dbkey

Because each application is configured with a dbkey by default, when querying the application log, you need to find the storage indirectly through the dbkey, and then display the query data.

Considering that the amount of logs is usually very large, we need to consider fragment processing when designing the log system.

The dbkey configuration log method is naturally suitable for log fragment storage. After the log has developed to a certain data level, it does not need the intervention of O & M and DBA. You can directly change the dbkey or use theDataX data synchronization platformYou can switch to a new log storage medium by modifying the dbkey connection string. If you need to access the historical log directly through other tools, it is not easy.

2. Log query

The log query interface automatically finds its own log records according to the application adaptation dbkey.

3. Call chain query

For interfaces with complex and lengthy call links, call chain query support is very important. According to personal development, operation and maintenance experience, the call chain query function plays a very intuitive and efficient role in troubleshooting online problems.

The call chain supports multi system and multi application queries. The dbkeys of different systems may be different, and the data is stored in different databases. At this time, we need to perform aggregate paging display in memory.

Some excellent open source components, such as Zipkin or skywalking+skyapm, embed corresponding code in your service to implement a distributed link tracking system.

Call chain query is very helpful for troubleshooting call anomalies.

4、 Other

For the logging system, we hardly emphasizeACID、CAP、BASEThese requirements. The log system should be as fast and efficient as possible without affecting the main business process. The best result is to ensure that the log is highly available without losing data.

Power Xlogger by default recommends that all applications use the AOP method to record logs and unify the log recording format. Of course, in order to facilitate troubleshooting, it is also necessary to embed specific logs in some steps.

For some systems, logs may be very important, such as payment, finance, account and other systems.

If logs are a very important part of business logic, especially the important logs of key links, not only business logic may be used, but also troubleshooting and tracking problems are very useful. In this case, it is not recommended to directly use the log platform.

For these log sensitive systems, powerdotnet will create log tables in its own system to record core key logs. Of course, it also supports on-demand logging in power Xlogger logs.

For the buried point operation of the log system, it is recommended to use power Xlogger’s asynchronous batch processing method with explicit timeout, power Xlogger logs are timed out for 2 seconds by default. By default, 200 pieces of data are processed in batch. These two parameters can be dynamically adjusted through the configuration center.

When the log queue accumulates to a certain threshold (200000 by default, which can be dynamically configured in the configuration center), the logs are automatically pruned to prevent the system from crashing due to insufficient memory or other reasons.

Reference:

https://flume.apache.org

https://github.com/facebookarchive/scribe

https://kafka.apache.org

https://www.elastic.co/cn/products/logstash

https://www.elastic.co/guide/en/logstash/5.6/index.html

https://www.elastic.co/cn/products/kibana

https://www.elastic.co/guide/en/kibana/5.5/index.html

https://www.elastic.co/cn/products/elasticsearch

https://www.elastic.co/guide/en/elasticsearch/reference/5.6/index.html

https://elasticsearch.cn

https://www.elastic.co/cn/products/beats/filebeat

https://www.elastic.co/guide/en/beats/filebeat/5.6/index.html