How to collect and extract big data? This article has finally come to light!

Time:2020-10-27

This article comes from the official account [fat rolling pig programming], please indicate the source of the reprint!

About the concept and architecture of data platform, we analyze the panoramic architecture and module of data platform and data platform in big vernacular six questions! One article introduction to Taiwan architect! It is clear in both articles. From this article, I will share the actual combat of the landing of the middle ground.

In fact, whether it is a data platform or a data platform, data is undoubtedly the core of the core, so close your eyes to know that data aggregation is the entrance of the data platform / platform. Looking at the architecture chart of many middle stations, data acquisition and aggregation are all leading the way

How to collect and extract big data? This article has finally come to light!

This article will share all aspects of data collection from the following aspects:

1、 Enterprise data sources
2、 Concept and value of data collection
3、 Common tools for data acquisition
4、 Design principle of data acquisition system
5、 Data acquisition module production landing sharing

Only when there is a source can we talk about the collection, so let’s first summarize the data sources in the enterprise.

data sources

There are many data sources in enterprises, but most of them are inseparable from these aspectsDatabase, log, front buried point, crawler system, etc.

  • We don’t need to say much about database. For example, MySQL is usually used as a business library to store some key business indicators, such as user information and order information. Some NoSQL databases are also used, which are generally used to store some less important data.
  • The log is also an important data source, because the log records the various execution of the program, including the user’s business processing track. According to the log, we can analyze the abnormal situation of the program, and also can count the key business indicators such as PV and UV.
  • For example, it is very important to analyze the user’s loss rate in the front end. For example, it is very important to analyze the user’s loss rate in the front end.
  • Crawler system should be familiar to you. Although many enterprises declare that it is forbidden to crawl, the data that is prohibited from crawling is valuable data. Some management and decision-making need the data of competitors as a comparison, and these data can be obtained through crawler.

Data acquisition and extraction

Just said so much data, but they are scattered in different network environments and storage platforms. In addition, different project groups may have to repeatedly collect the same data, so the dataIt is difficult to use, reuse and produce value。 Data aggregation is to make data from various heterogeneous networks and heterogeneous data sources,It is convenient to collect and store data in the data centerTo prepare for the subsequent machining modeling.

  • Data aggregation can beReal time accessFor example, flume collects logs in real time, and canal collects binlog of MySQL in real time.
  • It can also beOffline synchronization,For example, use sqoop to synchronize MySQL data to hive offline, and use dataX to synchronize Mongo data to hive.

Technology selection

Common data acquisition frameworks include flume, sqoop, logstash, dataX, canal. There are also some tools that are not very mainstream but can also be considered, such as waterdrop and Maxwell. These tools are very simple to use and have low learning costs. However, there may be some details in actual use. But on the whole, it’s not very difficult.

So it’s important to understand the scope, advantages and disadvantages of each tool. Then think about what your requirements are, real-time or offline? From which data source to where? How to deal with it?

Flume

Flume is a distributed, reliable and highly available system for massive log collection, aggregation and transmission.
Flume can collect files, socket packets and other forms of source data, and output the collected data to HDFS, HBase, hive, Kafka and other external storage systems.

How to collect and extract big data? This article has finally come to light!

Logstash

Logstash is l in the famous elk. Logstash is most commonly used as a log collector in elk (elastic search + logstash + kibane)

How to collect and extract big data? This article has finally come to light!

The main components of logstash are as follows:

  • Input: must be responsible for generating inputs generate events. Commonly used: file, syslog, redis, beats (e.g. filebeats)
  • Filters: optional, responsible for data processing and transformation (filters modify them), commonly used: grok, mutate, drop, clone, geoip
  • Outputs: required, responsible for data output ship they else. Commonly used: elasticsearch, file, graphite, statsd

Sqoop

Sqoop is mainly used in Hadoop (HDFS, hive, HBase) and traditional databases (mysql, PostgreSQL) )The data in a relational database can be imported into Hadoop’s HDFS, and HDFS data can also be imported into the relational database.

How to collect and extract big data? This article has finally come to light!

Datax

DataX is a widely used offline data synchronization tool / platform within Alibaba group, which realizes efficient data synchronization between various heterogeneous data sources including mysql, Oracle, sqlserver, postgre, HDFS, hive, ads, HBase, tablestore (OTS), maxcompute (ODPs), DRDS and other heterogeneous data sources.

How to collect and extract big data? This article has finally come to light!

The supported data sources are as follows, and plug-ins can also be developed by ourselves:

type data source Reader Writer file
RDBMS relational database MySQL Reading and writing
            Oracle         √         √     Reading and writing
SQLServer Reading and writing
PostgreSQL Reading and writing
DRDS Reading and writing
Universal RDBMS (supports all relational databases) Reading and writing
NoSQL data storage OTS Reading and writing
Hbase0.94 Reading and writing
Hbase1.1 Reading and writing
Phoenix4.x Reading and writing
Phoenix5.x Reading and writing
MongoDB Reading and writing
Hive Reading and writing
Cassandra Reading and writing
Unstructured data storage TxtFile Reading and writing
FTP Reading and writing
HDFS Reading and writing
Elasticsearch write
Time series database OpenTSDB read
TSDB Reading and writing

Canal

The main purpose of canal is to provide incremental data subscription and consumption based on incremental log parsing of MySQL database
How to collect and extract big data? This article has finally come to light!

How to use it? Start the canal server and connect to MySQL, and then use the canal client to connect to the canal server to receive the data change message. After getting the corresponding table and change data, the corresponding business logic will be triggered automatically. More commonly, canal is used to directly post data changes to message queues, message queue consumers are used to process logic, and canal is also supported to land in ES and other places. The picture is already very detailed!

Due to the space problem, this paper does not make a detailed comparison of these tools. Do you want to know their advantages and disadvantages? Do you want to know how to select the type? Go to the official account [fat pig learn programming] to find the answer!

Data landing

After collection, it is necessary to land the data, that is, the storage layer

  • MYSQL、Oracle
  • Hive、Hdfs
  • HBase
  • Redis
  • ElasticSearch
  • Tidb
  • Mongo

Learn Hive, HBase, ElasticSearch, Redis, please pay attention to the official account.

It should be noted that after data collection, it is often sent to Kafka, which is a message queue, before it is actually landed in various storage layers.

Data aggregation design principles

From the perspective of the middle ground, the author thinks that several key factors should be considered in the design of data aggregation layer

  • At the beginning of design, we should consider supporting all kinds of data sources, supporting different sources and different types of data sources. The data aggregation layer is not created for a certain kind of data and should be generalized.
  • Need to support different time window data acquisition, real-time, non real-time, historical.
  • The operation is friendly and simple. Even those who don’t know the technology can operate conveniently and synchronize the data. For example, when MySQL is synchronized to hive, you should not let users fill in complicated sqoop task parameters, but only need to select the source table and the destination table, and leave the rest to the middle office.
  • Reasonable choice of storage layer, different data sources should be stored in different places, for example, log data is definitely not suitable for MySQL.

This article comes from the official account [fat rolling pig programming], please indicate the source of the reprint!

Production landing and sharing

I will start to share the company’s real landing case soon! Online articles are the same, very few will have actual combat landing to share! I also welcome your advice!

First of all, just talking about the design principles, we should consider supporting all kinds of data sources and landing, and should consider offline and real-time collection respectively. It should be friendly and simple to operate, and can be operated without understanding technology. Our overall design is also guided by these principles. I would like to introduce the company’s landing plan from the aspects of offline and real-time acquisition

Offline acquisition

In terms of offline synchronization, our company mainly collects and extracts several data source data as shown in the figure below, and finally lands on hive or tidb. I won’t say much about the role of landing in hive. We are all familiar with it. The main function of tidb is to support real-time query, real-time business analysis and all kinds of regular run batch reports.

Next, through MySQL self-service synchronization to hive as an example, share the system design of self-service offline data acquisition module.

Firstly, the information of the data source is displayed one by one through the data source data management module in the data, and users can check synchronization as required

How to collect and extract big data? This article has finally come to light!

Synchronization supports full synchronization and incremental synchronization, and supports additional configuration, such as desensitization, encryption, decryption, etc. Since the database table name needs to be standardized, the target table name is automatically generated by the system, such as MySQL synchronization to hive uniform prefix ODS_ (follow up in the specification of the warehouse will be detailed, please pay attention to the official account [fat pig learning programming].

How to collect and extract big data? This article has finally come to light!

After the user clicks to confirm the synchronization, the user will first go through the metadata management system to query the meta information (including IP, port, account password, column information) required by the synchronization task from the metadata management system, assemble it into sqoop parameters, and record the synchronization information (including applicant, application reason, synchronization parameters and other information) into MySQL table. Then the invoke system is checked by the superior leader.

The work order system sends a message to MQ after it is approved. Through MQ, the work order audit status can be obtained in real time. If the audit is passed, the task will be automatically generated in the scheduling system (based on easyscheduler). In the early stage, our company selected Azkaban. Later, we found that easyscheduler outperformed Azkaban in many aspects, especially in terms of usability, UI and monitoring.

How to collect and extract big data? This article has finally come to light!

It can be seen from the figure that MySQL synchronization to hive involves three process nodes. Taking incremental synchronization of user table as an example, the first step is to synchronize MySQL data to hive’s ODS through the sqoop task_ user_ TMP table, the second step is to add ODS_ user_ TMP data merge to ODS_ In user (covering the original partition), the third step is to do data verification.

In addition to MySQL synchronization to hive, the synchronization of other data sources is also similar. The key is to define the process template (usually shell script) and process dependency, and then use the scheduling system to schedule.

Real time acquisition

The real-time acquisition module, based on Flink real-time computing platform, has the following features:

  • Support multiple data sources: Kafka, rocketmq, hive, etc
  • Support multiple landing: Kafka, JDBC, HDFS, elasticsearch, rocketmq, hive, etc
  • General SQL processing: data processing can be directly configured with one SQL
  • Alarm strategy: support a variety of alarm strategies, such as stream computing stack batch monitoring, application startup and exit, etc.

In the design principle, the extensibility and ease of use are fully considered. The source, process and sinkdim (dimension table) are plug-in development. The follow-up extension of aspect, interface configuration and automatic generation of DAG diagram make it possible for those who do not know the technology to start to develop the flow calculation task quickly

How to collect and extract big data? This article has finally come to light!

How to collect and extract big data? This article has finally come to light!

Due to the length of space, details can not be clearly stated. I will continue to share in public official account.
How to collect and extract big data? This article has finally come to light!