Top 100 Summit: [sharing record – Microsoft] real time big data quality monitoring platform based on Kafka and spark

Time:2021-7-18

The content of this article is from the case sharing of Xing Guodong, senior product manager of Top100 summit Microsoft in 2016.
Editor: Cynthia

Tony Xing: Senior Product Manager of Microsoft, responsible for the construction of big data platform, data products and services of Microsoft application and service group

Introduction: Microsoft’s ASG (application and service group) includes Bing, office, Skype. More than 5 Pb of data are generated every day. How to build a highly scalable data audit service to ensure such magnitude of data integrity and real-time is very challenging. This article will introduce how Microsoft ASG big data team uses Kafka, spark and elastic search to solve this problem.

1、 Case introduction

This case introduces the big data quality monitoring platform based on open source technologies (Kafka, spark, elasticssearch, kibana) designed and deployed by the Microsoft big data platform team. This platform has the characteristics of real-time, high availability, scalability, and high credibility. It has become Microsoft Bing, office365 Skype and other businesses with annual revenue of 27 billion US dollars provide reliable technical support in monitoring data quality. At the same time, based on the business needs, we have achieved the following series of goals in the design and Implementation:

Monitoring the integrity and delay of streaming data;
The data pipeline to be monitored has the characteristics of multiple data producers, multiple processing stages and multiple data consumers;
The monitoring of data quality needs near real time;
When data quality problems occur, it is necessary to provide corresponding diagnostic information to help engineers solve the problems quickly;
The service of the monitoring platform itself needs to be super stable and highly available, with more than 99.9% online time;
Monitoring and auditing itself is highly credible;
The platform architecture can scale out.

2、 Background and problem introduction

In order to serve Microsoft’s Bing, office 365 and Skype businesses, our big data platform needs to process massive data of more than ten petabytes per day. All data analysis, reports, insights and a / B tests rely on high-quality data. If the data quality is not high, the businesses that rely on data to make decisions will be seriously affected.

At the same time, the demand of Microsoft business for real-time data processing is also increasing. Many previous solutions for monitoring batch data are no longer suitable for real-time streaming data quality monitoring.

On the other hand, due to historical reasons, various business groups often use different technologies and tools for data processing. How to integrate such heterogeneous technologies and tools and the data quality monitoring based on them is also an urgent problem to be solved.

Figure 1 is a conceptual architecture of our data processing platform. From the data producer side, we use the common SDK on the client and server side to generate data according to the common schema. The data is distributed to the corresponding Kafka through the data collectors distributed all over the world, and then subscribed by various computing and storage frameworks through the pub / sub mode.

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

In this way, teams can choose the tools they are most familiar with or have been using. For example, from the perspective of real-time processing, each business team can use spark or Microsoft’s usql streaming processing framework, as well as other third-party tools to do some analysis of specific scenarios, such as Splunk for log analysis, interana for interactive analysis, etc. In the batch processing framework, users can choose Hadoop from the open source community, spark or cosmos from Microsoft.

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

As shown in Figure 2, in the process of migrating big data to the architecture of Figure 1, we also see the rapid growth of real-time streaming data. The peak messages per day are more than 1 billion, 1.3 million messages per second and 3.5 Pb of streaming data per day.

3、 The scene and working principle of data monitoring

3.1 data monitoring scenarios

Based on the business requirements, we summarize the characteristics of the data processing pipeline to be monitored (as shown in Figure 3)
Multiple data producers, data from client and server;
Multiple data consumers, which specifically refers to various data processing frameworks;
Multiple stages, from data generation to data processing, data often flows through multiple data pipeline components. We need to ensure that the data in each stage will not be lost, high delay and abnormal through monitoring.

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

3.2 working principle

Based on the data pipeline in Figure 3, we specify the problem as how to ensure the data integrity, real-time and abnormal monitoring of the upstream and downstream data pipelines based on Kafka. Figure 4 shows an abstract monitoring architecture and its working principle.

The blue component is the processing stage of data flow in the data pipeline; Green component is the core service audit trail of real-time data quality monitoring in this paper. When the data flows through each component, the corresponding audit data will be sent to audit trail at the same time. The audit data can be regarded as a kind of meta data, which contains information about the data flow, such as which data center and which machine the message is generated in; The message contains several records, size, timestamp, etc. After summarizing the metadata sent by various data processing components, audit trail can evaluate the quality of various data in real time, such as the integrity, real-time performance and whether there are exceptions of the data at this time.

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

Based on the audit metadata in Figure 5, once a data quality problem occurs, engineers can quickly locate which server in which data center has a problem at what time period, and then quickly take corresponding actions to solve or alleviate the problem, and minimize the impact on downstream data processing.

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

Data quality problems that can be monitored can be divided into the following categories:
● the data delay exceeds the specified SLA (service level agreement)

Engineers can quickly understand whether the dimension of data quality delay is normal through the delay state diagram shown in Figure 6, which is very important for data products and applications with strict real-time requirements. If the data delay arrives, it will lose its significance in many cases.

It should be noted that the chart only plays an auxiliary role here. In the real production environment, the system API calls are used to regularly check the SLA compliance. Once the delay threshold is exceeded, the engineer on duty will be informed by telephone, SMS and other means to solve the problem in real time.

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

Data loss in the mobile leads to the integrity not meeting the SLA (service level agreement)

Engineers can understand the status of data integrity through the simple chart shown in Figure 7, which includes two data processing stages: one data producer and two data consumers. So there are actually three lines in the chart. Green is the real-time data volume of producers, and blue and purple lines are the data volume processed by two data consumers. If, ideally, there is no problem with data integrity, the three lines are completely coincident. In this example, a bifurcation occurs at the last point, which indicates that there is a problem with data integrity and needs the intervention of the engineer.

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

Abnormal data itself – real time monitoring through abnormal detection

If the data itself is abnormal, we use the corresponding anomaly detection based on statistical metadata (as shown in Figure 8) to do real-time monitoring. Anomaly detection is a very common problem and challenge in the industry. Almost every Internet company has a service or platform to do anomaly detection, but it is not easy to do it well. This is a big topic that can be written in a separate article. Here is just a single chapter to introduce the algorithm.

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

This example is to find the problem of upstream log writing or other logic problems of data production by detecting the abnormal amount of data.

3.3 anomaly detection

3.3.1 anomaly detection algorithm 1

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

We use Holt winters algorithm (Figure 9) to train the model and make predictions, and make many improvements to increase the robustness and fault tolerance of the algorithm.

Improvements in robustness include:
● get better valuation with medium absolute deviation (MAD);
Handle data missing points and noise (e.g. data smoothing).
The functional improvements include:
Automatic acquisition of trend and cycle information;
Allow users to manually mark and feedback to better handle trend changes.
By comparing the predicted value with the actual value, we use GLR (generalized likelihood ratio) to find the outliers. We have also made corresponding improvements, including:
Floating threshold GLR, dynamically adjusting the model based on the new input data;
Remove outliers for noisy data.

3.3.2 anomaly detection algorithm 2

This is an anomaly detection algorithm of online time series based on exchange martingale. Its core is to assume that the distribution of data is stable. If the addition of new data points results in a relatively large change in the distribution of data, we think that an exception has occurred. Therefore, based on historical data, we need to define a new value anomaly formula. The following is the composition of these formulas, which can be omitted by readers who are not interested in mathematics.

At a certain time t, we receive a new data point
s[i] = strangeness function of (value[i], history)
Let p[t] = (#{i: s[i] > s[t]}+ r*#{i: s[i]==s[t]})/N, where r is uniform in (0,1)
Uniform r makes sure p is uniform
Exchangeability Martingale: Mt=i=1tϵpiϵ-1
EMtp1,p2,…pt-1=Mt-1
Integrate ϵpiϵ-1 over [0,1] and pi is uniform
The threshold of alarm triggering is controlled by Doob’s maximum inequality
Prob (∃ t :Mt>λ)<1λ
For outliers, the martingale value is greater than the threshold value.

3.3.3 anomaly detection algorithm 3

This is a simple and effective exponential smoothing algorithm based on historical data.
Firstly, it generates dynamic upper and lower bounds based on historical data

Threshold (width) = min(max(M1Mean, M2Standard Deviation), M3*Mean) (M1<M3)
Alert: |Value – predicated value| > Threshold
Predicted value = S1 + 12s2 + 14s3 + 18s4 + 116s51 + 12 + 14 + 18 + 116
The advantage of this method is that it can handle the periodic data well and allow users to adjust the dynamic upper and lower bounds by feedback and marking.

4、 System design overview

Based on the needs of business scenarios, we need to achieve a series of goals and deal with the corresponding challenges in the design and implementation
Monitoring the integrity and delay of streaming data;
The data pipeline to be monitored has the characteristics of multiple data producers, multiple processing stages and multiple data consumers;
The monitoring of data quality needs near real time;
When data problems occur, provide corresponding diagnosis information to help engineers solve problems quickly;
The service of the monitoring platform itself needs to be super stable and highly available, with more than 99.9% of the online time;
Monitoring and auditing itself is highly credible;
The platform architecture can scale out.

4.1 highly available and scalable architecture

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

As shown in Figure 10, audit metadata arrives at Kafka through the front end web service. We use Kafka to achieve highly available temporary storage. In this way, when our data producers and consumers send audit data, they will not block and affect more important data flows.

Through the application of spark streaming, the audit data is aggregated according to the time window. At the same time, there are corresponding logic processing to remove duplicate, late and non sequential data. At the same time, various fault-tolerant processing is done to ensure high availability.

Elasticssearch is used to store and aggregate audit data. It displays reports through kibana, and then provides API through data analysis service to enable users to obtain various data quality information.

As the final API, data analysis service provides all kinds of data integrity, real-time and exception information.
Each of the above components is designed to scale out independently, and the design ensures that high error tolerance has realized high availability.

4.2 reliability guarantee of remote dual live

Through the dual data center active active disaster recovery as shown in Figure 11, to further ensure high availability and high reliability services. The overall architecture ensures that the data flow is processed through two isomorphic audit processing pipelines at the same time. Even if a data center is offline for various reasons, the overall service is still available, so as to ensure all-weather data quality audit and monitoring.

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

4.3 highly credible audit and monitoring services

For any monitoring service, it is often questioned whether the result of monitoring service itself is accurate and reliable. To ensure this, we guarantee the credibility of the service in two ways:
Audit for audit (Figure 12);
● Synthetic probe。

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

In addition to the Kafka / spark / ES based pipeline, we also have a set of independent auditing metadata processing pipeline through es. By comparing the results of the above two pipelines, we can ensure the reliability of the auditing data.
In addition, based on the method of synthetic probe, we will send a group of synthetic data every minute to the front end web service, and then try to read it from the data analysis web service, so as to further guarantee the reliability of the data.

4.4 diagnosis of auxiliary data quality problems

When data quality problems occur, audit trail provides the original audit metadata to help engineers further diagnose the problems. Engineers can use this metadata and their own trace to further join to provide an interactive diagnosis, as shown in Figure 13.

Top 100 Summit: [sharing record - Microsoft] real time big data quality monitoring platform based on Kafka and spark

5、 Effect evaluation and summary

Through the design and deployment of the above system architecture, we have achieved a series of data quality monitoring objectives to support the business development of Bing, office and Skype
Monitoring the integrity and delay of streaming data;
The data pipeline to be monitored has the characteristics of multiple data producers, multiple processing stages and multiple data consumers;
The monitoring of data quality needs near real time;
When there is a problem with the data, it is necessary to provide corresponding diagnostic information to help engineers solve the problem quickly;
The service of the monitoring platform itself needs to be super stable and highly available, with 99.9% online time
Monitoring and auditing itself is highly credible;
The platform architecture can scale out.

Microsoft principal product designer bill Zhong will share “the agile UX transformation practice of Microsoft OneNote” at the 6th top 100 global software case study summit in Beijing National Convention Center from November 9 to 12; Microsoft Data scientist Kirk Lee will share “reinforcement learning in azure customer engagement”; Zheng Yu, senior researcher of Microsoft Asia Research Institute, will share “driving smart city with big data and AI”.

The top 100 global software case study summit has been held for six times. It selects excellent cases of global software R & D, with 2000 participants each year. It includes product, team, architecture, operation and maintenance, big data, artificial intelligence and other technology special sessions to learn the latest R & D practices of Google, Microsoft, Tencent, Alibaba, Baidu and other first-line Internet enterprises.

Application entrance for one day experience ticket of opening ceremony