Why do Hadoop and spark sort keys


1. Thinking

As long as it’s righthadoopinmapreduceAll of us are familiar with the operation principle of the whole process, which involves at least three sorts, namely, overflow quick sort, overflow merge sort,reducePull merge sort, and sort is the default, that is, natural sort, so why do you want to do this? What is the design reason. First give a conclusion, in order to be more stable, output to meet most of the needs, the former is reflected in not usinghashShuffleIt issortShuffleThe latter is reflected in precomputation. It will be much more convenient to know the sorted data in the use of subsequent data, such as the place to reflect the index, such asreduceWhen pulling data.


2. MapReduce principle analysis

Before analyzing the design reasons, first understand the whole processmapPhase, according to the predefinedpartitionPartition by rules,mapFirst, the output is written to the cache, and when the cache content reaches the threshold value, the result is savedspillTo the hard disk, every timespillWill generate one on the hard diskspillTherefore, a map task may generate multiple filesspillFile in which eachspillI’ll be right when I’m on the roadkeySort. Next, let’s goshuffleStage, whenmapWrite the last output in themapOnce at the endmergeOperate according topartitionandpartitionInternalkeyTo merge and sort (merge + sort), eachpartitionIn accordance withkeyThe values are ordered as a whole. Then start the second timemergeThis time inreduceDuring this period, data is available in both memory and disk. In fact, this stage is very importantmergeIt’s not a sort in a strict sense. It’s also a merge + sort similar to the previous one. It’s just that multiple files are organized as a wholemergeFinally, the sorting work is completed. After analyzing the whole process, do you feel that if you realize it yourselfMapReduceFrame, consider usingHashMapOutput the map content.

2.1 detailed explanation of maptask operation mechanism

The whole flow chart is as follows:


Detailed steps:

  1. First, read the data componentInputFormat(default)TextInputFormat)It will passgetSplitsMethods the files in the input directory were logically slicedsplitsHow manysplitHow many will be activatedMapTasksplitAndblockBy default, the corresponding relation of is ⼀ to ⼀.

  2. Split the input file intosplitsAfter that, byRecordReaderObject (default)LineRecordReader)Read to\nAs a separator, read the line data and return theKeyRepresents the first character offset of each line,valueRepresents this line of text.

  3. readsplitreturn, which is inherited by the userMapperClass to perform user overridemapFunction.RecordReaderRead the ⼀ line and call here once.

  4. mapAfter the logic is finished, themapEach result of thecontext.writeTo carry outcollectData collection. staycollectIn, it will be partitioned first and used by defaultHashPartitionerMapReduceprovidePartitionerInterface, which is based onkeyorvalueandreduceTo determine which pair of output data should be handed over toreduce taskhandle. Default pairkey hashAfter that, we’ll start withreduce taskThe quantity is measured. The default mode is just for averagereduceIf the user has the ability toPartitionerCan be customized and set tojobIt’s on.

  5. Next, the data will be written to the memory. This area in the memory is called the ring buffer. The buffer is used for batch collectionmapAs a result, there are fewer disksIOThe impact of the. ourskey/valueYes, andPartitionWill be written into the buffer. Of course, before writing in,keyAndvalueValues are serialized into byte arrays

    • The ring buffer is actually an array in which thekeyvalueSerialized data andkeyvalueMetadata information, includingpartitionkeyThe starting position of thevalueThe starting position of thevalueThe length of the. Ring structure is an abstract concept.

    • There is a size limit for the buffer. The default is100MB. Whenmap taskWhen there are many output results, the memory may burst, so it is necessary to temporarily write the data in the buffer to the disk under certain conditions, and then use the buffer again. This process of writing data from memory to disk is calledSpillIt can be translated into overflow in Chinese. The overflow is completed by a single thread, which does not affect the thread that writes map results to the buffer. Overflow threads should not be blocked when startingstopThe result of map is output, so the whole buffer has an overflow ratiospill.percent. The default ratio is0.8That is, when the data in the buffer has reached the threshold(buffer size * spillpercent = 100MB * 0.8 = 80MB), overflow thread start, lock this80MBTo execute the overflow processMaptaskThe output of can be sent to the rest20MBWrite in memory, do not affect each other

  6. When the overflow thread is started, the80MBIn spacekeySort(Sort)。 The order isMapReduceDefault line of the model!

    • IfjobSet upCombinerNow it’s time to useCombinerIt’s time. There will be the samekeyOfkey/valueThat’s rightvalueIn addition, it reduces the amount of data overflowed to disk.CombinerWill optimizeMapReduceSo it will be used many times throughout the model.

    • Which scenes can be usedCombinerWhat about it? From this analysis,CombinerThe output isReducerThe output of,CombinerThe final calculation results must not be changed.CombinerIt should only be used for thatReduceInput ofkey/valueAnd outputkey/valueThe scene with the same type and without affecting the final result. Such as accumulation, maximum value, etc.CombinerYou must be careful when using it. If you use it well, it’s good for youjobExecutive efficiency is helpful, and vice versareduceThe final result of

  7. Merge overflow file: each overflow will generate a temporary file on the disk (judge whether there is any overflow file before writing)combiner), ifmapThe output result of is really big. If there are many times of such overflow, there will be many temporary files on the disk. When the whole data processing is finished, the temporary files in the disk will be processedmergeMerge, because there is only one final file, write to disk, and provide an index file for this file to record each filereduceThe offset of the corresponding data.

2.2 detailed explanation of reducetask operation mechanism


Reduce⼤ roughly divided intocopysortreduceThree stages, focusing on the first two stages.copyStage contains one
eventFetcherTo get the completedmapThe column table is sent by the fetcher threadcopyData, in the process of which twomergeThread, respectivelyinMemoryMergerandonDiskMerger, respectivelymergeTo disk and put data in diskmerge. Pending datacopyWhen it’s done,copyThe stage is completed, and the process beginssortStage,sortThe main stage is to holdfinalMergeOperation, puresortStage, after the completion isreducePhase, calling user-definedreduceThe function is processed.
Detailed steps

2.2.1 phase

Simply pull the data.ReduceThe process starts some datacopyThread(Fetcher), byHTTPMode requestmaptaskGet your own file.

2.2.2 merge phase

MergeStage. It’s in the middle of nowheremergeasmapTerminalmergeAction, but the array is not the samemapendcopyHere’s the number.CopyThe incoming data will be put into the memory buffer first, and the size of the buffer in the buffer is larger than that in the memory buffermapMore flexible at the end.mergeThere are three forms: memory to memory; memory to disk; disk to disk. By default, the second form is not enabled. When the amount of data in memory reaches a certain threshold, it starts the memory to disk processingmerge. Andmap This is also an overflow process. If you set theCombiner, which can also be used, and then many overflow files are generated in the disk. The second merge mode has been in operation until there is no new onemapEnd of data, and then start the third kind of disk to diskmergeMethod to generate the final file.

2.2.3 merge sort

After merging the scattered data into a big data, the merged data will be sorted again. Use the sorted key value exchangereduceMethod, the key value with the same key is transferred oncereduceMethod, each call will generate zero or more key value pairs, and finally write these output key value pairs to theHDFSFile.

3. Summary

fromMapReduceIn the process of executing, let’s look at why to sort and why to sortShuffleWhen to useSortShuffle, From the design point of view, maptask and reducetask are two completely different processes running on yarn. The interaction mode of the processes is through memory or disk. In order to decouple the two programs and better realize the failure and retrial mechanism, we can’t be like Kafka. When producers produce messages and consumers consume messages, there will be problems such as blocking, and the cluster can’t get stuck The data that MapReduce runs is a large amount of data, so we should try our best to let the data processed by the map side drop to disk, but we should also ensure that the whole speed is accelerated as much as possible. So at the end of the map, we give the sorted data and an index file to reduce, so that although we sacrifice a certain amount of CPU, we can pull the data as quickly as possible How can map finish execution? In theory, it can continue to run reducetask after downtime to complete the whole task. At the same time, why not hashshuffle? It’s because hashshuffle takes up a lot of memory in the case of big data, which is likely to explode memory, leading to the instability of cluster computing.
Wu Xie, little third master, is a rookie in the field of big data and artificial intelligence.
Please pay more attention