1、 Shuffle mechanism
1) The process after the map method and before the reduce method is called shuffle
2) After the map method, the data first enters the partition method, marks the partition of the data, and then sends the data to the ring buffer; The default size of the ring buffer is 100m. When the ring buffer reaches 80%, it will overflow; Sort the data before overflow. Sort the data according to the dictionary order of the key index. The sorting method is fast sorting; Overflow generates a large number of overflow files, which need to be merged and sorted; The combiner operation can also be performed on the over written files, provided that the summary operation is performed and the average value cannot be obtained. Finally, store the files to the disk according to the partition and wait for the reduce side to pull them.
3) Each reduce pulls the data of the corresponding partition at the map end. After the data is pulled, it is stored in the memory first. When the memory is insufficient, it is stored in the disk. After pulling all data, merge sort is used to sort the data in memory and disk. Before entering the reduce method, you can group the data.
key word:Big data training