Hematemesis sorting: a commonly used big data collection tool, you must not know


There are many sources of big data. Under the background of the big data era, how to collect useful information from big data is the most key factor in the development of big data. Big data acquisition is the cornerstone of the big data industry, and the work in the big data acquisition stage is one of the core technologies of big data. In order to efficiently collect big data, it is very important to select appropriate big data acquisition methods and platforms according to the acquisition environment and data types. Here are some common big data acquisition platforms and tools.

1 Flume

Flume, as a component of Hadoop, is a distributed log collection system specially developed by cloudera. Especially in recent years, with the continuous improvement of flume, the convenience of users in the development process has been greatly improved. Flume has become one of the Apache top projects.
Flume provides the ability to collect data from data sources such as console, RPC (Thrift RPC), text (file), tail (UNIX tail), syslog, exec (command execution).
Flume uses a multi master approach. In order to ensure the consistency of configuration data, flume introduces zookeeper to save configuration data. Zookeeper itself can ensure the consistency and high availability of configuration data. In addition, zookeeper can notify the flume master node when the configuration data changes. The mission protocol is used to synchronize data between flume master nodes.
Flume also has good customization and expansion capabilities for special scenarios, so flume is suitable for most daily data collection scenarios. Flume relies on the Java runtime environment because it is built using jruby. Flume is designed as a distributed pipeline architecture, which can be regarded as an agent network between data source and destination to support data routing.
Flume supports setting sink failover and load balancing, so that the whole system can still collect data normally in case of an agent failure. The content transmitted in flume is defined as an event, which is composed of headers (including metadata, i.e. meta data) and payload.
Flume provides SDK to support user customized development. Flume client is responsible for sending events to flume agent at the source of events. The client is usually in the same process space as the application that generates the data source. Common flume clients include Avro, log4j, syslog and HTTP post.

2 Fluentd

Fluent D is another open source data collection architecture, as shown in Figure 1. Fluent is developed in C / Ruby and uses JSON files to unify log data. Through rich plug-ins, you can collect logs from various systems or applications, and then classify the logs according to user definitions. With fluent, you can easily implement operations such as tracking log files, filtering them and transferring them to mongodb. Fluent can completely liberate people from cumbersome log processing.
Hematemesis sorting: a commonly used big data collection tool, you must not know
Figure 1 fluent architecture
Fluent has several functional features: easy installation, small space occupation, semi-structured data logging, flexible plug-in mechanism, reliable buffering and log forwarding. Treasure data provides support and maintenance for this product. In addition, the adoption of JSON unified data / log format is another feature. Compared with flume, the configuration of FLUENT is also relatively simple.
Fluent has very good scalability. Customers can customize (Ruby) input / buffer / output. Fluent has cross platform problems and does not support the windows platform.
The input / buffer / output of FLUENT is very similar to the source / Channel / sink of flume. The fluent architecture is shown in Figure 2.
Hematemesis sorting: a commonly used big data collection tool, you must not know
Figure 2 fluent architecture

3 Logstash

Logstash is the L in the famous open source data stack elk (elastic search, logstash, kibana). Because logstash is developed in jruby, the runtime relies on the JVM. The deployment architecture of logstash is shown in Figure 3. Of course, this is only a deployment option.

Figure 3 deployment architecture of logstash
A typical logstash is configured as follows, including input and filter output settings.

input {
    file {
        type =>"Apache-access"
        path =>"/var/log/Apache2/other_vhosts_access.log" 
    file {
        type =>"pache-error"
        path =>"/var/log/Apache2/error.log" 
filter {
    grok {
        match => {"message"=>"%(COMBINEDApacheLOG)"}
    date {
        match => {"timestamp"=>"dd/MMM/yyyy:HH:mm:ss Z"}
output  {
    stdout {}
    Redis {
        data_type => "list"
        key => "Logstash"

Almost in most cases, elk is used simultaneously as a stack. When your data system uses elasticsearch, logstash is the first choice.

4 Chukwa

Chukwa is another open source data collection platform under Apache, which is far less famous than several other platforms. Chukwa is built based on Hadoop’s HDFS and MapReduce (implemented in Java) to provide scalability and reliability. It provides many modules to support Hadoop cluster log analysis. Chukwa also provides data display, analysis and monitoring. The project is currently inactive.
Chukwa meets the following needs:
(1) Flexible, dynamic and controllable data source.
(2) High performance, highly scalable storage system.
(3) Appropriate architecture for analyzing the collected large-scale data.
Chukwa architecture is shown in Figure 4.
Hematemesis sorting: a commonly used big data collection tool, you must not know
Figure 4 chukwa architecture

5 Scribe

Scribe is a data (log) collection system developed by Facebook. Its official website has not been maintained for many years. Scribe provides a scalable and error tolerant solution for “distributed collection and unified processing” of logs. When the network or machine of the central storage system fails, scribe will transfer the log to the local or another location; When the central storage system is restored, scribe will retransmit the transferred logs to the central storage system. Scribe is usually used in combination with Hadoop to push logs into HDFS, and Hadoop processes them regularly through MapReduce jobs.
Scribe architecture is shown in Figure 5.
Hematemesis sorting: a commonly used big data collection tool, you must not know
Figure 5 scribe architecture
Scribe architecture is relatively simple. It mainly includes three parts: scribe agent, scribe and storage system.

6 Splunk

In the commercialized big data platform products, Splunk provides complete data collection, data storage, data analysis and processing, and data presentation capabilities. Splunk is a distributed machine data platform with three main roles. Splunk architecture is shown in Figure 6.
Hematemesis sorting: a commonly used big data collection tool, you must not know
Figure 6 Splunk architecture
Search: it is responsible for data search and processing, and provides information extraction during search.
Indexer: responsible for data storage and indexing.
Forwarder: responsible for data collection, cleaning, deformation and sending to indexer.
Splunk has built-in support for syslog, TCP / UDP and spooling. At the same time, users can obtain specific data by developing input and modular input. In the software warehouse provided by Splunk, there are many mature data acquisition applications, such as AWS and database (dbconnect), which can easily obtain data from the cloud or database and enter Splunk’s data platform for analysis.
Both search head and indexer support the configuration of cluster, that is, highly available and highly extended. However, Splunk does not have the function of cluster for forwarder. In other words, if one forwarder fails, the data collection will be interrupted, and the running data collection tasks cannot be failover to other forwarders.

7 Scrapy

Python’s crawler architecture is called scrapy. Scrapy is a fast and high-level screen capture and web capture architecture developed by Python language, which is used to capture web sites and extract structured data from pages. Scrapy is widely used for data mining, monitoring and automated testing.
What’s interesting about slapy is that it’s an architecture that anyone can easily modify according to their needs. It also provides base classes for various types of crawlers, such as basespider and sitemap crawlers. The latest version provides support for Web 2.0 crawlers.
The operation principle of scrapy is shown in Figure 7.
Hematemesis sorting: a commonly used big data collection tool, you must not know
Figure 7 operation principle of scrapy
The whole data processing flow of scripy is controlled by the scripy engine. The operation process of scrapy is as follows:
(1) When the scrapy engine opens a domain name, the crawler processes the domain name and lets the crawler obtain the first crawled URL.
(2) The scrapy engine first obtains the first URL to crawl from the crawler, and then schedules it as a request in the scheduling.
(3) The scrapy engine gets the next page to crawl from the scheduler.
(4) The scheduler returns the next crawled URL to the engine, which sends them to the downloader through the download middleware.
(5) After the web page is downloaded by the downloader, the response content is sent to the scripy engine through the downloader middleware.
(6) The scrapy engine receives the response from the downloader and sends it to the crawler for processing through the crawler middleware.
(7) The crawler processes the response and returns the crawled items, and then sends a new request to the scrapy engine.
(8) The scrapy engine puts the captured into the project pipeline and sends a request to the scheduler.
(9) The system repeats the operations after step (2) until there is no request in the scheduler, and then disconnects the scrapy engine from the domain.
Hematemesis sorting: a commonly used big data collection tool, you must not know