Home >Common Problem >What is the process of sorting performed by the system called?
MapReduce ensures that the input of each reducer is sorted by key, and the system performs the sorting process called shuffle. The shuffle phase mainly includes combine, group, sort, partition in the map phase and merge sorting in the reducer phase.
The operating environment of this tutorial: Windows 7 system, Dell G3 computer.
MapReduce ensures that the input of each reducer is sorted by key, and the system performs the sorting process called shuffle. We can understand it as the entire project where map generates output and digests input to reduce.
Map side: Each mapperTask has a ring memory buffer used to store the output of the map task. Once the threshold is reached, a background thread writes the content to a new overflow write file in the specified directory on the disk. , partition, sort, and combiner must be passed before writing to disk. After the last record is written, merge all overflow written files into one partitioned and sorted file.
Reduce side: can be divided into copy phase, sorting phase, reduce phase
Copy phase: The map output file is located on the local disk of the tasktracker running the map task, and reduce obtains the output through http For the partition of the file, tasktracker runs the reduce task for the partition file. As long as a map task is completed, the reduce task starts to copy the output.
Sorting phase: A more appropriate term is the merging phase, because sorting is performed on the map side. This stage will merge the map output, maintain its order, and loop.
The final stage is the reduce stage. The reduce function is called for each key in the sorted output. The output of this stage is written directly to the output file system, usually HDFS. ,
Shuffle phase description
The shuffle phase mainly includes combine, group, sort, partition in the map phase and merge sorting in the reducer phase. After shuffling in the Map stage, the output data will be saved in files according to the reduce partitions, and the file contents will be sorted according to the defined sort. After the Map phase is completed, the ApplicationMaster will be notified, and then the AM will notify the Reduce to pull the data, and perform the shuffle process on the reduce side during the pulling process.
Note: The output data of the Map stage is stored on the disk running the Map node. It is a temporary file and does not exist on HDFS. After Reduce pulls the data, the temporary file will be deleted. If it exists on HDFS, it will be deleted. This causes a waste of storage space (three copies will be generated).
User-defined Combiner
Combiner can reduce the number of intermediate output results in the Map stage and reduce network overhead. By default, there is no Combiner. The user-defined Combiner is required to be a subclass of Reducer. The output
You can set the combiner processing class through job.setCombinerClass. The MapReduce framework does not guarantee that the method of this class will be called.
Note: If the input and output of reduce are the same, you can directly use the reduce class as combiner
User-defined Partitioner
Partitioner is used Determine which node is the processing reducer corresponding to the
You can specify the Partitioner class through the job.setPartitionerClass method. By default, HashPartitioner is used (the key's hashCode method is called by default).
User-defined Group
GroupingComparator is used to group the
Our custom class is required to implement the self-interface RawComparator, and the comparison class can be specified through the job.setGroupingComparatorClass method. By default a WritableComparator is used, but the key's compareTo method is ultimately called for comparison.
User-defined Sort
SortComparator is the key class used to key sort the
Our custom class is required to implement the self-interface RawComparator, and the comparison class can be specified through the job.setSortComparatorClass method. By default a WritableComparator is used, but the key's compareTo method is ultimately called for comparison.
User-defined Reducer's Shuffle
When the reduce side pulls the output data of the map, shuffle (merge sort) will be performed, and the MapReduce framework is provided in plug-in mode In a customized way, we can specify custom shuffle rules by implementing the interface ShuffleConsumerPlugin and specifying the parameter mapreduce.job.reduce.shuffle.consumer.plugin.class, but in general, the default class org is used directly. apache.hadoop.mapreduce.task.reduce.Shuffle.
For more programming-related knowledge, please visit: Programming Video! !
The above is the detailed content of What is the process of sorting performed by the system called?. For more information, please follow other related articles on the PHP Chinese website!