Link to the original text: https://foochane.cn/article/2019062701.html
Flume Log Acquisition Framework Installation and Deployment Flume Running Mechanism Acquisition Static Files to HDFS Acquisition Dynamic Log Files to HDFS Two Agents Cascade
Flume Log Collection Framework
In a complete off-line large data processing system, besides the core of the analysis system composed of HDFS + MapReduce + hive, it also needs indispensable auxiliary systems such as data acquisition, result data export, task scheduling and so on. These auxiliary tools have convenient open source frameworks in Hadoop ecosystem, as shown in the figure:
1 Flume Introduction
FlumeIt is a distributed, reliable and highly available system for collecting, aggregating and transferring massive logs.
FlumeDocuments can be collected.
socketPackets, files, folders,
kafkaSource data in various forms, and the collected data (sink)
sink) Output to
kafkaAnd many other external storage systems.
For general acquisition requirements, it can be achieved by simple configuration of flume.
FlumeFor special scenarios, it also has a good ability to customize and expand, so,
flumeIt can be used in most daily constant data acquisition scenarios.
2 Flume Operating Mechanism
FlumeThe core roles in distributed systems are
flumeThe acquisition system consists of one by one.
agentConnected to form, each
agentEquivalent to a data transporter, there are three components in it:
SourceAcquisition component for docking with data sources to obtain data
SinkSinking assembly for the next level
agentTransfer data or data to the final storage system
ChannelTransport Channel Component for use from
sourceTransfer data to
Single agent collects data:
Multi-level agent in series:
3 Flume Installation and Deployment
1 Download installation package
confUnder the folder
3. According to the requirement of collecting, add the configuration file of collecting scheme, and the file name can be chosen arbitrarily.
Specifically, you can see the following examples
In the test environment:
$ bin/flume/-ng agent -c conf/ -f ./dir-hdfs.conf -n agent1 -Dflume.root.logger=INFO,console
-cSpecify the configuration file directory that comes with flume without modifying it yourself
-fTo specify your own configuration file, ask the current folder
-nSpecify which to use in your own configuration file
agentThe name defined in the corresponding configuration file.
-Dflume.root.loggerPrint the log in the console, type is
INFOThis is for testing only, and will be printed to the log file later
In production, flume should be started in the background.
nohup bin/flume-ng agent -c ./conf -f ./dir-hdfs.conf -n agent1 1>/dev/null 2>&1 &
4. Acquisition of static files to HDFS
4.1 Collection Requirements
In a specific directory of a server, new files will be generated continuously. Whenever new files appear, it is necessary to collect the files into HDFS.
4.2 Add Configuration Files
Add files to installation directory
dir-hdfs.confThen add configuration information.
agent1The following configurations follow
agent1Later, it can be changed to other values, such as
agt1In the same configuration file, you can have multiple configuration schemes to start
agentWhen the corresponding name can be obtained.
According to the requirements, first define the following three elements
Data Source Component
source—— Monitor file directory:
spooldirIt has the following characteristics:
- Monitor a directory and collect the contents of the file whenever new files appear in the directory
- A suffix is automatically added to the collected files by the agent:
- Documents with the same file name are not allowed to appear repeatedly in the monitored directory
file channelMemory is also available
# Define the names of three major components agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # Configuring source components agent1.sources.source1.type = spooldir agent1.sources.source1.spoolDir = /root/log/ agent1.sources.source1.fileSuffix=.FINISHED # The length of each line of a file. Note that if each line of an event file exceeds this length, it will be automatically cut off, resulting in data loss. agent1.sources.source1.deserializer.maxLineLength=5120 # Configuring sink components agent1.sinks.sink1.type = hdfs agent1.sinks.sink1.hdfs.path =hdfs://Master:9000/access_log/%y-%m-%d/%H-%M agent1.sinks.sink1.hdfs.filePrefix = app_log agent1.sinks.sink1.hdfs.fileSuffix = .log agent1.sinks.sink1.hdfs.batchSize= 100 agent1.sinks.sink1.hdfs.fileType = DataStream agent1.sinks.sink1.hdfs.writeFormat =Text # roll: scroll switching: switching rules that control writing files ## Cut by file size (bytes) agent1.sinks.sink1.hdfs.rollSize = 512000 ## Cut by event bar agent1.sinks.sink1.hdfs.rollCount = 1000000 ## Switch files at intervals agent1.sinks.sink1.hdfs.rollInterval = 60 ## Rules for controlling the generation of directories agent1.sinks.sink1.hdfs.round = true agent1.sinks.sink1.hdfs.roundValue = 10 agent1.sinks.sink1.hdfs.roundUnit = minute agent1.sinks.sink1.hdfs.useLocalTimeStamp = true # Channel component configuration agent1.channels.channel1.type = memory ## Event number agent1.channels.channel1.capacity = 500000 ## 600 events required for flume transaction control agent1.channels.channel1.transactionCapacity = 600 # Binding the connection between source, channel and sink agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1
capacityDefault the largest storage in this channel
trasactionCapacityEach time, the maximum can be obtained from the ____________
sourceTo receive or deliver.
eventAllowable time to add to or remove from channels
4.3 Start Flume
$ bin/flume/-ng agent -c conf/ -f dir-hdfs.conf -n agent1 -Dflume.root.logger=INFO,console
5 Collecting dynamic log files to HDFS
5.1 Collection Requirements
For example, the business system uses log4j to generate logs, and the content of logs is increasing. It is necessary to collect the data appended to the log files into HDFS in real time.
Configuration file name:
According to the requirements, the following three elements are defined at first:
- The acquisition source, i.e.
source—— Monitoring file content updates:
tail -F file
- The sinking target, i.e.
sink——HDFSFile system: HDFS sink
sinkTransfer Channels Between——
file channelMemory is also available
Configuration file content:
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /root/app_weichat_login.log # Describe the sink agent1.sinks.sink1.type = hdfs agent1.sinks.sink1.hdfs.path =hdfs://Master:9000/app_weichat_login_log/%y-%m-%d/%H-%M agent1.sinks.sink1.hdfs.filePrefix = weichat_log agent1.sinks.sink1.hdfs.fileSuffix = .dat agent1.sinks.sink1.hdfs.batchSize= 100 agent1.sinks.sink1.hdfs.fileType = DataStream agent1.sinks.sink1.hdfs.writeFormat =Text agent1.sinks.sink1.hdfs.rollSize = 100 agent1.sinks.sink1.hdfs.rollCount = 1000000 agent1.sinks.sink1.hdfs.rollInterval = 60 agent1.sinks.sink1.hdfs.round = true agent1.sinks.sink1.hdfs.roundValue = 1 agent1.sinks.sink1.hdfs.roundUnit = minute agent1.sinks.sink1.hdfs.useLocalTimeStamp = true # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
5.3 Start Flume
bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1
6 Two Agent Cascades
Get data from tail command and send it to Avro port
Another node can configure an Avro source to relay data and send external storage
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /root/log/access.log # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = hdp-05 a1.sinks.k1.port = 4141 a1.sinks.k1.batch-size = 2 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Receive data from Avro port, sink to
Collect configuration files.
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source ## Avro component in source is a receiver service a1.sources.r1.type = avro a1.sources.r1.bind = hdp-05 a1.sources.r1.port = 4141 # Describe the sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = /flume/taildata/%y-%m-%d/ a1.sinks.k1.hdfs.filePrefix = tail- a1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue = 24 a1.sinks.k1.hdfs.roundUnit = hour a1.sinks.k1.hdfs.rollInterval = 0 a1.sinks.k1.hdfs.rollSize = 0 a1.sinks.k1.hdfs.rollCount = 50 a1.sinks.k1.hdfs.batchSize = 10 a1.sinks.k1.hdfs.useLocalTimeStamp = true # Generated file type, default is Sequencefile, available DataStream, is plain text a1.sinks.k1.hdfs.fileType = DataStream # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1