Shuffle of Hadoop


Hadoop system performs partitioning, sorting, and passing map output as input to the reducershuffle。 Shuffle is the data processing process after the map method and before the reduce method.

Map end

Each maptask processes a slice, and the output is not simply written to disk. It will go through the following process,

  1. Each map task has a ring memory buffer for storing output. By default, the buffer size is 100MB, which can be changed Once the content reaches the threshold( default value is 0.80, or 80%), and a background thread starts to spill the contents to the disk. During the overflow writing process, the map output continues to write to the buffer, but if the buffer is filled during this period, the map will be blocked until the write to disk process is completed.
  2. Before writing to the disk, the thread first divides the data into corresponding partitions according to the reducer to which the data is finally transferred. In each partition, the background thread keys to sort in memory. If there is a combiner function, it will run on the sorted output. Running the combiner function makes the map output more compact, thus reducing the amount of data written to disk and passing to the reducer. For example, combiner can merge data with the same key.
  3. Each time the memory buffer reaches the overflow threshold, a new overflow file will be created, so there will be several overflow files after the map task finishes writing its last output record.
  4. Before the map task is completed, the overflow file is merged into a partitioned and sorted (ordered within partition) output file (using merge sort). It is often a good idea to compress the map output when it is written to disk, because this will make writing to disk faster, save disk space, and reduce the amount of data transferred to the reducer. By default, the output is uncompressed, but as long as to true to enable this feature. The compression library used

Shuffle of Hadoop

Reduce end

The shuffle on the reduce side includes the following parts,

  1. Reducetask will fetch the corresponding partition data from each maptask machine according to its own partition number. The data of the same partition may come from the output files of different maptasks. Reducetask starts some data copy threads (fetchers), and requests the tasktracker of maptask to obtain the output file of maptask through HTTP. Because maptask is long over, these files are managed by tasktracker on the local disk.
  2. Reducetask will merge these files in the memory buffer (merge sort). After merging these files into large files, it should be emphasized that merge has three forms: 1) memory to memory 2) memory to disk 3) disk to disk. By default, the first form is not enabled. When the amount of data in memory reaches a certain threshold, the merge from memory to disk is started. The second merge mode runs until there is no data on the map side. Finally, the third disk to disk merge mode is started to generate the final file as the input of the reduce method.
  3. Once the input file of the reduce method is generated, the shuffle process is finished, and then the logic operation process of reducetask is started (take one key value pair group from the file and call the user-defined reduce() method).

Shuffle of Hadoop

reference resources

  1. Hadoop authoritative guide (4th Edition)