As long as it’s right
mapreduceAll of us are familiar with the operation principle of the whole process, which involves at least three sorts, namely, overflow quick sort, overflow merge sort,
reducePull merge sort, and sort is the default, that is, natural sort, so why do you want to do this? What is the design reason. First give a conclusion, in order to be more stable, output to meet most of the needs, the former is reflected in not using
sortShuffleThe latter is reflected in precomputation. It will be much more convenient to know the sorted data in the use of subsequent data, such as the place to reflect the index, such as
reduceWhen pulling data.
2. MapReduce principle analysis
Before analyzing the design reasons, first understand the whole process
mapPhase, according to the predefined
partitionPartition by rules,
mapFirst, the output is written to the cache, and when the cache content reaches the threshold value, the result is saved
spillTo the hard disk, every time
spillWill generate one on the hard disk
spillTherefore, a map task may generate multiple files
spillFile in which each
spillI’ll be right when I’m on the road
keySort. Next, let’s go
mapWrite the last output in the
mapOnce at the end
mergeOperate according to
keyTo merge and sort (merge + sort), each
partitionIn accordance with
keyThe values are ordered as a whole. Then start the second time
mergeThis time in
reduceDuring this period, data is available in both memory and disk. In fact, this stage is very important
mergeIt’s not a sort in a strict sense. It’s also a merge + sort similar to the previous one. It’s just that multiple files are organized as a whole
mergeFinally, the sorting work is completed. After analyzing the whole process, do you feel that if you realize it yourself
MapReduceFrame, consider using
HashMapOutput the map content.
2.1 detailed explanation of maptask operation mechanism
The whole flow chart is as follows:
First, read the data component
TextInputFormat）It will pass
getSplitsMethods the files in the input directory were logically sliced
splitHow many will be activated
blockBy default, the corresponding relation of is ⼀ to ⼀.
Split the input file into
splitsAfter that, by
\nAs a separator, read the line data and return the
KeyRepresents the first character offset of each line,
valueRepresents this line of text.
, which is inherited by the user
MapperClass to perform user override
RecordReaderRead the ⼀ line and call here once.
mapAfter the logic is finished, the
mapEach result of the
context.writeTo carry out
collectData collection. stay
collectIn, it will be partitioned first and used by default
PartitionerInterface, which is based on
reduceTo determine which pair of output data should be handed over to
reduce taskhandle. Default pair
key hashAfter that, we’ll start with
reduce taskThe quantity is measured. The default mode is just for average
reduceIf the user has the ability to
PartitionerCan be customized and set to
Next, the data will be written to the memory. This area in the memory is called the ring buffer. The buffer is used for batch collection
mapAs a result, there are fewer disks
IOThe impact of the. ours
PartitionWill be written into the buffer. Of course, before writing in,
valueValues are serialized into byte arrays
The ring buffer is actually an array in which the
valueSerialized data and
valueMetadata information, including
keyThe starting position of the
valueThe starting position of the
valueThe length of the. Ring structure is an abstract concept.
There is a size limit for the buffer. The default is
map taskWhen there are many output results, the memory may burst, so it is necessary to temporarily write the data in the buffer to the disk under certain conditions, and then use the buffer again. This process of writing data from memory to disk is called
SpillIt can be translated into overflow in Chinese. The overflow is completed by a single thread, which does not affect the thread that writes map results to the buffer. Overflow threads should not be blocked when starting
stopThe result of map is output, so the whole buffer has an overflow ratio
spill.percent. The default ratio is
0.8That is, when the data in the buffer has reached the threshold
（buffer size * spillpercent = 100MB * 0.8 = 80MB）, overflow thread start, lock this
80MBTo execute the overflow process
MaptaskThe output of can be sent to the rest
20MBWrite in memory, do not affect each other
When the overflow thread is started, the
Sort)。 The order is
MapReduceDefault line of the model!
CombinerNow it’s time to use
CombinerIt’s time. There will be the same
valueIn addition, it reduces the amount of data overflowed to disk.
MapReduceSo it will be used many times throughout the model.
Which scenes can be used
CombinerWhat about it? From this analysis,
CombinerThe output is
ReducerThe output of,
CombinerThe final calculation results must not be changed.
CombinerIt should only be used for that
key/valueThe scene with the same type and without affecting the final result. Such as accumulation, maximum value, etc.
CombinerYou must be careful when using it. If you use it well, it’s good for you
jobExecutive efficiency is helpful, and vice versa
reduceThe final result of
Merge overflow file: each overflow will generate a temporary file on the disk (judge whether there is any overflow file before writing)
mapThe output result of is really big. If there are many times of such overflow, there will be many temporary files on the disk. When the whole data processing is finished, the temporary files in the disk will be processed
mergeMerge, because there is only one final file, write to disk, and provide an index file for this file to record each file
reduceThe offset of the corresponding data.
2.2 detailed explanation of reducetask operation mechanism
Reduce⼤ roughly divided into
reduceThree stages, focusing on the first two stages.
copyStage contains one
eventFetcherTo get the completed
mapThe column table is sent by the fetcher thread
copyData, in the process of which two
mergeTo disk and put data in disk
merge. Pending data
copyWhen it’s done,
copyThe stage is completed, and the process begins
sortThe main stage is to hold
sortStage, after the completion is
reducePhase, calling user-defined
reduceThe function is processed.
Simply pull the data.
ReduceThe process starts some data
maptaskGet your own file.
2.2.2 merge phase
MergeStage. It’s in the middle of nowhere
mergeAction, but the array is not the same
copyHere’s the number.
CopyThe incoming data will be put into the memory buffer first, and the size of the buffer in the buffer is larger than that in the memory buffer
mapMore flexible at the end.
mergeThere are three forms: memory to memory; memory to disk; disk to disk. By default, the second form is not enabled. When the amount of data in memory reaches a certain threshold, it starts the memory to disk processing
map This is also an overflow process. If you set the
Combiner, which can also be used, and then many overflow files are generated in the disk. The second merge mode has been in operation until there is no new one
mapEnd of data, and then start the third kind of disk to disk
mergeMethod to generate the final file.
2.2.3 merge sort
After merging the scattered data into a big data, the merged data will be sorted again. Use the sorted key value exchange
reduceMethod, the key value with the same key is transferred once
reduceMethod, each call will generate zero or more key value pairs, and finally write these output key value pairs to the
MapReduceIn the process of executing, let’s look at why to sort and why to sort
ShuffleWhen to use
SortShuffle, From the design point of view, maptask and reducetask are two completely different processes running on yarn. The interaction mode of the processes is through memory or disk. In order to decouple the two programs and better realize the failure and retrial mechanism, we can’t be like Kafka. When producers produce messages and consumers consume messages, there will be problems such as blocking, and the cluster can’t get stuck The data that MapReduce runs is a large amount of data, so we should try our best to let the data processed by the map side drop to disk, but we should also ensure that the whole speed is accelerated as much as possible. So at the end of the map, we give the sorted data and an index file to reduce, so that although we sacrifice a certain amount of CPU, we can pull the data as quickly as possible How can map finish execution? In theory, it can continue to run reducetask after downtime to complete the whole task. At the same time, why not hashshuffle? It’s because hashshuffle takes up a lot of memory in the case of big data, which is likely to explode memory, leading to the instability of cluster computing.
Wu Xie, little third master, is a rookie in the field of big data and artificial intelligence.
Please pay more attention