Use of Flume Log Collection Framework

Time:2019-8-13

Author: foochane

Link to the original text: https://foochane.cn/article/2019062701.html

Flume Log Acquisition Framework Installation and Deployment Flume Running Mechanism Acquisition Static Files to HDFS Acquisition Dynamic Log Files to HDFS Two Agents Cascade

Flume Log Collection Framework

In a complete off-line large data processing system, besides the core of the analysis system composed of HDFS + MapReduce + hive, it also needs indispensable auxiliary systems such as data acquisition, result data export, task scheduling and so on. These auxiliary tools have convenient open source frameworks in Hadoop ecosystem, as shown in the figure:

Use of Flume Log Collection Framework

1 Flume Introduction

FlumeIt is a distributed, reliable and highly available system for collecting, aggregating and transferring massive logs.FlumeDocuments can be collected.socketPackets, files, folders,kafkaSource data in various forms, and the collected data (sink)sink) Output toHDFShbasehivekafkaAnd many other external storage systems.

For general acquisition requirements, it can be achieved by simple configuration of flume.

FlumeFor special scenarios, it also has a good ability to customize and expand, so,flumeIt can be used in most daily constant data acquisition scenarios.

2 Flume Operating Mechanism

FlumeThe core roles in distributed systems areagentflumeThe acquisition system consists of one by one.agentConnected to form, eachagentEquivalent to a data transporter, there are three components in it:

  • SourceAcquisition component for docking with data sources to obtain data
  • SinkSinking assembly for the next levelagentTransfer data or data to the final storage system
  • ChannelTransport Channel Component for use fromsourceTransfer data tosink

Single agent collects data:

Use of Flume Log Collection Framework

Multi-level agent in series:

Use of Flume Log Collection Framework

3 Flume Installation and Deployment

1 Download installation packageapache-flume-1.9.0-bin.tar.gzdecompression

2 inconfUnder the folderflume-env.shAdd toJAVA_HOME

export JAVA_HOME=/usr/local/bigdata/java/jdk1.8.0_211

3. According to the requirement of collecting, add the configuration file of collecting scheme, and the file name can be chosen arbitrarily.

Specifically, you can see the following examples

4 Start-upflume

In the test environment:

$ bin/flume/-ng agent -c conf/ -f ./dir-hdfs.conf -n agent1 -Dflume.root.logger=INFO,console

Instructions:

  • -cSpecify the configuration file directory that comes with flume without modifying it yourself
  • -fTo specify your own configuration file, ask the current folderdir-hdfs.conf
  • -nSpecify which to use in your own configuration fileagentThe name defined in the corresponding configuration file.
  • -Dflume.root.loggerPrint the log in the console, type isINFOThis is for testing only, and will be printed to the log file later

In production, flume should be started in the background.

nohup bin/flume-ng  agent  -c  ./conf  -f ./dir-hdfs.conf -n  agent1 1>/dev/null 2>&1 &

4. Acquisition of static files to HDFS

4.1 Collection Requirements

In a specific directory of a server, new files will be generated continuously. Whenever new files appear, it is necessary to collect the files into HDFS.

4.2 Add Configuration Files

Add files to installation directorydir-hdfs.confThen add configuration information.

Acquisition firstagentNamedagent1The following configurations followagent1Later, it can be changed to other values, such asagt1In the same configuration file, you can have multiple configuration schemes to startagentWhen the corresponding name can be obtained.

According to the requirements, first define the following three elements

Data Source Component

Namelysource—— Monitor file directory:spooldir
spooldirIt has the following characteristics:

  • Monitor a directory and collect the contents of the file whenever new files appear in the directory
  • A suffix is automatically added to the collected files by the agent:COMPLETED(modifiable)
  • Documents with the same file name are not allowed to appear repeatedly in the monitored directory
Subsidence assembly

Namelysink——HDFSFile system:hdfs sink

Channel component

Namelychannel—— Availablefile channelMemory is also availablechannel

# Define the names of three major components
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Configuring source components
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /root/log/
agent1.sources.source1.fileSuffix=.FINISHED
# The length of each line of a file. Note that if each line of an event file exceeds this length, it will be automatically cut off, resulting in data loss.
agent1.sources.source1.deserializer.maxLineLength=5120

# Configuring sink components
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path =hdfs://Master:9000/access_log/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = app_log
agent1.sinks.sink1.hdfs.fileSuffix = .log
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text

# roll: scroll switching: switching rules that control writing files
## Cut by file size (bytes) 
agent1.sinks.sink1.hdfs.rollSize = 512000
## Cut by event bar
agent1.sinks.sink1.hdfs.rollCount = 1000000
## Switch files at intervals
agent1.sinks.sink1.hdfs.rollInterval = 60

## Rules for controlling the generation of directories
agent1.sinks.sink1.hdfs.round = true
agent1.sinks.sink1.hdfs.roundValue = 10
agent1.sinks.sink1.hdfs.roundUnit = minute

agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

# Channel component configuration
agent1.channels.channel1.type = memory
## Event number
agent1.channels.channel1.capacity = 500000
## 600 events required for flume transaction control
agent1.channels.channel1.transactionCapacity = 600

# Binding the connection between source, channel and sink
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

ChannelParametric interpretation:

  • capacityDefault the largest storage in this channeleventNumber
  • trasactionCapacityEach time, the maximum can be obtained from the ____________sourceTo receive or deliver.sinkMediumeventNumber
  • keep-aliveeventAllowable time to add to or remove from channels

4.3 Start Flume

$ bin/flume/-ng agent -c conf/ -f dir-hdfs.conf -n agent1 -Dflume.root.logger=INFO,console

5 Collecting dynamic log files to HDFS

5.1 Collection Requirements

For example, the business system uses log4j to generate logs, and the content of logs is increasing. It is necessary to collect the data appended to the log files into HDFS in real time.

5.2 Profile

Configuration file name:tail-hdfs.conf
According to the requirements, the following three elements are defined at first:

  • The acquisition source, i.e.source—— Monitoring file content updates:exec tail -F file
  • The sinking target, i.e.sink——HDFSFile system: HDFS sink
  • SourceandsinkTransfer Channels Between——channelAvailablefile channelMemory is also availablechannel

Configuration file content:


# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/app_weichat_login.log

# Describe the sink
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path =hdfs://Master:9000/app_weichat_login_log/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = weichat_log
agent1.sinks.sink1.hdfs.fileSuffix = .dat
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text

agent1.sinks.sink1.hdfs.rollSize = 100
agent1.sinks.sink1.hdfs.rollCount = 1000000
agent1.sinks.sink1.hdfs.rollInterval = 60

agent1.sinks.sink1.hdfs.round = true
agent1.sinks.sink1.hdfs.roundValue = 1
agent1.sinks.sink1.hdfs.roundUnit = minute


agent1.sinks.sink1.hdfs.useLocalTimeStamp = true



# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

5.3 Start Flume

Start command:

bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1

6 Two Agent Cascades

Get data from tail command and send it to Avro port
Another node can configure an Avro source to relay data and send external storage

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/log/access.log


# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hdp-05
a1.sinks.k1.port = 4141
a1.sinks.k1.batch-size = 2



# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Receive data from Avro port, sink tohdfs

Collect configuration files.avro-hdfs.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
## Avro component in source is a receiver service
a1.sources.r1.type = avro
a1.sources.r1.bind = hdp-05
a1.sources.r1.port = 4141


# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/taildata/%y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = tail-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 24
a1.sinks.k1.hdfs.roundUnit = hour
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 50
a1.sinks.k1.hdfs.batchSize = 10
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# Generated file type, default is Sequencefile, available DataStream, is plain text
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Recommended Today

Protocol basis: use telnet to learn IMAP protocol

IMAP introduction IMAPThe full name is Internet Mail Access Protocol, or Interactive Mail Access ProtocolPOP3Similar to one of the mail access standard protocols. The difference is, it’s onIMAPAfter that, the e-mail you received from the e-mail client remains on the server, and the operations on the client will be fed back to the server, such […]