What happened to flume

Time:2020-10-22

Framework learning the most reliable information flume official website and flume source code

If you think that pure English is inefficient, you can also search flume in Chinese;
This paper is used to record the process of understanding flume and the core technology points of flume, so as to quickly master flume

Quick understanding

Flume User Guide

[Overview]Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Flume is a distributed, reliable and available system that can collect, summarize and move massive logs from multiple sources to log storage.


Classic application scenarios:

  • Flume collects logs from the log server and transmits them to Kafka in real time
  • Flume collects Kafka cluster data and lands it to HDFS

Introduction to flume user guide pageSuperior menuas well asMenu on this page, which can be quickly located according to requirements (see Appendix 1). Common requirements include:

  1. Quickly get familiar with the specific type of a component, such as how to select the source
  2. Parameter specifications are unfamiliar when defining components, such as specifying the parameter configuration of source
  3. Quick development of custom components, refer to [developer’s Guide]

Core concepts

  • Flume flow chart
  1. Source reads the data source and encapsulates it into event (process method)
  2. Then, through the interceptor chain, the even set is chained (modified or discarded)
  3. Enter the channel selector and select the channel according to the configuration policy
  4. Send the event to the selected channel cache (source push here)
  5. Enter the sink selector and select sink according to the configuration policy
  6. Send event to selected sink (here is sink polling pull)

What happened to flume

[Data flow model]A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).
[Flume Interceptors]Flume has the capability to modify/drop events in-flight. This is done with the help of interceptors. Interceptors are classes that implement org.apache.flume.interceptor.Interceptor interface. An interceptor can modify or even drop events based on any criteria chosen by the developer of the interceptor. Flume supports chaining of interceptors. This is made possible through by specifying the list of interceptor builder class names in the configuration. Interceptors are specified as a whitespace separated list in the source configuration. The order in which the interceptors are specified is the order in which they are invoked. The list of events returned by one interceptor is passed to the next interceptor in the chain. Interceptors can modify or drop events. If an interceptor needs to drop events, it just does not return that event in the list that it returns. If it is to drop all events, then it simply returns an empty list. Interceptors are named components, here is an example of how they are created through configuration:

  • Core concepts (must be understood)
name description
Event Flume data stream contains multi-attribute (header body) transmission unit (itself is byte array)
Agent The JVM process that specifies where events come from can be understood as a job or deployment instance
Source Consume data from external system and send data to channel
Channel Data transfer station (can be understood as the channel in the queue or go) to realize the asynchronous of source and sink
Sink Consume the data in the channel and write it downstream (it can be an agent or an external system)
Channel Selector The configuration of source, as the name suggests, is the selection policy for multiple channels; the default is replication (that is, an event will be sent to multiple channels), and multiplex is optional (rules can be made for distribution according to the event attribute)
Sink Processors The configuration of sink groups, as the name suggests, is the selection strategy for multiple sinks; fail over and load_ Load balancing (round can be selected as the balancing strategy_ Robin and random)
Serializers You can choose which part of the event needs to be serialized (such as only body) and the compression format
Interceptors Interceptors are used to modify or even discard the event (such as format verification, adding timestamp, adding unique ID, sink = kafak scenario, adding specified key according to the content, etc.)
  • remarks:

1. Source and sink run in separate threads
2. The event interface contains the header (map) and body (byte array) properties. The main implementation classes include (flumeevent, jsonevent, mockevent, persistableevent, simpleevent)

Network topology

[Setting multi-agent flow]In order to flow the data across multiple agents or hops, the sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname (or IP address) and port of the source.

  • If multiple agents are connected in series, the sink of the former and the source of the latter must be Avro

What happened to flume

[Consolidation]A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem. For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.

  • A common scenario: there are many log collection machines, but few are connected to the storage system, so the former needs to collect logs and the latter to summarize logs(In actual production, the latter often has more than one machine, that is, it needs to configure sink groups)

What happened to flume

[Multiplexing the flow]Flume supports multiplexing the event flow to one or more destinations. This is achieved by defining a flow multiplexer that can replicate or selectively route an event to one or more channels.

  • Flume supports multiplexing (in this way, the data can be sent to the specified location by configuring the channel selector, it can be an event sent multiple times, or it can be sent to the specified destination according to the event)

What happened to flume

How to use it

All the use of basic can be based on user guide, more detailed examples
Here is a common example of production: Kafka > HDFS

Check the description of Kafka source and HDFS sink as required (ignore while you know them)

Kafka Source

[Kafka Source]Kafka Source is an Apache Kafka consumer that reads messages from Kafka topics.
[Security and Kafka Source]Secure authentication as well as data encryption is supported on the communication channel between Flume and Kafka.
SASL_PLAINTEXT – Kerberos or plaintext authentication with no data encryption

  • Description of main parameters:
Property Name Default Description
kafka.bootstrap.servers List of Kafka cluster instances separated by commas
kafka.consumer.group.id flume In order to increase concurrency, the same consumer group can be set for multiple agents
kafka.topics Subscribe to the topic list, separated by commas
kafka.topics.regex The topic of regular matching subscription will be overridden kafka.topics
kafka.consumer.security.protocol PLAINTEXT Set which security protocol to write to Kafka
  • For example:
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
#Batch write channel, i.e. 2000 events, write batch once
tier1.sources.source1.batchSize = 5000 
#Write the channel regularly, that is, write a batch once for more than 2000ms
tier1.sources.source1.batchDurationMillis = 2000 
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
#This parameter will override the topic of the above subscription
tier1.sources.source1.kafka.topics.regex = ^topic[0-9]$
#The consumer group generally uses flume.topicName.index name
tier1.sources.source1.kafka.consumer.group.id = custom.g.id

#Communication between flume and Kafka supports security authentication and data encryption
#SASL in Enterprises_ Plaintext (more used for Kerberos authentication without data encryption)
tier1.sources.source1.kafka.consumer.security.protocol = SASL_PLAINTEXT
tier1.sources.source1.kafka.consumer.sasl.mechanism = GSSAPI
tier1.sources.source1.kafka.consumer.sasl.kerberos.service.name = kafka
  • Kafkasource (also defined as kafaconsumer) only the core logic code (doprocess executed once in a batch) is displayed here. You need to see the viewable source code of detailed logic
public class KafkaSource extends AbstractPollableSource implements Configurable, BatchSizeSupported {

    protected Status doProcess() throws EventDeliveryException {
        //Generate ID for this batch of push to ensure transaction
        final String batchUUID = UUID.randomUUID().toString();
        //Batch write channels are all placed in the try, packaged as a transaction, and the offset is committed only when all are successful. Otherwise, back off is rolled back
        try{
            //If the number of batches is not enough or the time interval is not reached, continue to consume data and parse consumerrecord to construct an event to add to the eventlist
            while (eventList.size() < batchUpperLimit && System.currentTimeMillis() < maxBatchEndTime) {
                ConsumerRecords<String, byte[]> records = consumer.poll(duration);
                ConsumerRecord<String, byte[]> message = it.next();
                event = EventBuilder.withBody(eventBody, headers);
                eventList.add(event);
                //Update the offset information to HashMap
                tpAndOffsetMetadata.put(new TopicPartition(message.topic(), message.partition()),new OffsetAndMetadata(message.offset() + 1, batchUUID));
            }
            //Get the channel processor to perform bulk write to the channel, which is actually called channel.put
            getChannelProcessor().processEventBatch(eventList);
            //Commit offset after writing
            consumer.commitSync(tpAndOffsetMetadata);
            return Status.READY;        
        }catch (Exception e) {
            return Status.BACKOFF;
        }
    }
}

HDFS Sink

[HDFS Sink]This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types. The files can be rolled (close current file and create a new one) periodically based on the elapsed time or size of data or number of events. It also buckets/partitions data by attributes like timestamp or machine where the event originated. The HDFS directory path may contain formatting escape sequences that will replaced by the HDFS sink to generate a directory/file name to store the events.

  • Description of main parameters:
Property Name Default Description
hdfs.path HDFS directory path (for example: hdfs://namenode/flume/webdata/ )
hdfs.filePrefix FlumeData Flume creates a fixed prefix for new files in the HDFS folder (usually adds instance IP or host information to facilitate location)
hdfs.rollInterval 30 Create a new file regularly, in seconds; that is, create a new file every 30s, and 0 means no reference to the write time
hdfs.rollSize 1024 Fixed size creates a new file in bytes; 0 means no reference to file size
hdfs.rollCount 10 Fixed number of events to create a new file, unit bar, 0 means no reference to write quantity
hdfs.batchSize 100 Batch write HDFS number of events per batch
hdfs.codeC Compression algorithm, such as snappy, etc
  • For example:
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
#User defined save path and use escape character. See other escape characters on the official website for details
a1.sinks.k1.hdfs.path = /user/log/business/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = %[IP]-log-%t
a1.sinks.k1. hdfs.fileSuffix  =. snappy ා points are not automatically generated
a1.sinks.k1.hdfs.codeC = snappy
a1.sinks.k1. hdfs.rollSize  =134217728 ා set larger to avoid small files
a1.sinks.k1.hdfs.batchSize = 1000 

#The following is the basis for data file directory division. The following is a huafneyige directory in 10 minutes
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
  • Hdfseventsink, core logic
public class HDFSEventSink extends AbstractSink implements Configurable, BatchSizeSupported {
    public Status process() throws EventDeliveryException {
        Channel channel = getChannel();
        //The transaction interface is directly used here. For usage, please refer to the interface description (begin > commit > rollback > close)
        Transaction transaction = channel.getTransaction();
        transaction.begin();
        try {
            //Obviously, this is to take in the channel
          for (txnEventCount = 0; txnEventCount < batchSize; txnEventCount++) {
            Event event = channel.take();
               //Remove references to facilitate GC
            WriterCallback closeCallback = new WriterCallback() {
              @Override
              public void run(String bucketPath) {
                synchronized (sfWritersLock) {
                  sfWriters.remove(bucketPath);
                }
              }
            };
            //Call hdfswriter append
            //Hdfswriter is mainly the difference of format, that is, filetype is specified in the configuration file
            bucketWriter.append(event);
          }
          // flush all pending buckets before committing the transaction
          for (BucketWriter bucketWriter : writers) {
              //Call hdfswriter sync
              //All data in the stream is flushed to the file
            bucketWriter.flush();
          }
          //Commit transaction
          transaction.commit();
        } catch (IOException eIO) {
          transaction.rollback();
          return Status.BACKOFF;
        } catch (Throwable th) {
          transaction.rollback();
        } finally {
          transaction.close();
        }
      }

epilogue

The purpose of this paper is to provide a more perfect implementation ideas, try to complete the development task quickly in any scenario; and the design of flume automatic transmission platform.

appendix

[Appendix 1: flume user guide page directory]

This Page
Flume 1.9.0 User Guide
Introduction
    Overview
    System Requirements
    Architecture
Setup
    Setting up an agent
    Data ingestion
    Setting multi-agent flow
    Consolidation
    Multiplexing the flow
Configuration
    Defining the flow
    Configuring individual components
    Adding multiple flows in an agent
    Configuring a multi agent flow
    Fan out flow
    SSL/TLS support
    Source and sink batch sizes and channel transaction capacities
    Flume Sources
        Avro Source
        Thrift Source
        Exec Source
        JMS Source
        Spooling Directory Source
        Taildir Source
        Twitter 1% firehose Source (experimental)
        Kafka Source
        NetCat TCP Source
        NetCat UDP Source
        Sequence Generator Source
        Syslog Sources
        HTTP Source
        Stress Source
        Legacy Sources
        Custom Source
        Scribe Source
    Flume Sinks
        HDFS Sink
        Hive Sink
        Logger Sink
        Avro Sink
        Thrift Sink
        IRC Sink
        File Roll Sink
        Null Sink
        HBaseSinks
        MorphlineSolrSink
        ElasticSearchSink
        Kite Dataset Sink
        Kafka Sink
        HTTP Sink
        Custom Sink
    Flume Channels
        Memory Channel
        JDBC Channel
        Kafka Channel
        File Channel
        Spillable Memory Channel
        Pseudo Transaction Channel
        Custom Channel
    Flume Channel Selectors
        Replicating Channel Selector (default)
        Multiplexing Channel Selector
        Custom Channel Selector
    Flume Sink Processors
        Default Sink Processor
        Failover Sink Processor
        Load balancing Sink Processor
        Custom Sink Processor
    Event Serializers
        Body Text Serializer
        “Flume Event” Avro Event Serializer
        Avro Event Serializer
    Flume Interceptors
        Timestamp Interceptor
        Host Interceptor
        Static Interceptor
        Remove Header Interceptor
        UUID Interceptor
        Morphline Interceptor
        Search and Replace Interceptor
        Regex Filtering Interceptor
        Regex Extractor Interceptor
        Example 1:
        Example 2:
    Flume Properties
        Property: flume.called.from.service
Configuration Filters
    Common usage of config filters
    Environment Variable Config Filter
        Example
    External Process Config Filter
        Example
        Example 2
    Hadoop Credential Store Config Filter
        Example
Log4J Appender
Load Balancing Log4J Appender
Security
Monitoring
    Available Component Metrics
        Sources 1
        Sources 2
        Sinks 1
        Sinks 2
        Channels
    JMX Reporting
    Ganglia Reporting
    JSON Reporting
    Custom Reporting
    Reporting metrics from custom components
Tools
    File Channel Integrity Tool
    Event Validator Tool
Topology Design Considerations
    Is Flume a good fit for your problem?
    Flow reliability in Flume
    Flume topology design
    Sizing a Flume deployment
Troubleshooting
    Handling agent failures
    Compatibility
        HDFS
        AVRO
        Additional version requirements
    Tracing
    More Sample Configs
Component Summary
Alias Conventions

Recommended Today

Layout of angular material (2): layout container

Layout container Layout and container Using thelayoutDirective to specify the layout direction for its child elements: arrange horizontally(layout=”row”)Or vertically(layout=”column”)。 Note that if thelayoutInstruction has no value, thenrowIs the default layout direction. row: items arranged horizontally.max-height = 100%andmax-widthIs the width of the item in the container. column: items arranged vertically.max-width = 100%andmax-heightIs the height of the […]