What is the process of sorting performed by the system called?-Common Problem-php.cn

Home

Common Problem

What is the process of sorting performed by the system called?

青灯夜游

Apr 25, 2021 pm 05:10 PM

hadoopshuffle

MapReduce ensures that the input of each reducer is sorted by key, and the system performs the sorting process called shuffle. The shuffle phase mainly includes combine, group, sort, partition in the map phase and merge sorting in the reducer phase.

What is the process of sorting performed by the system called?

The operating environment of this tutorial: Windows 7 system, Dell G3 computer.

MapReduce ensures that the input of each reducer is sorted by key, and the system performs the sorting process called shuffle. We can understand it as the entire project where map generates output and digests input to reduce.

Map side: Each mapperTask has a ring memory buffer used to store the output of the map task. Once the threshold is reached, a background thread writes the content to a new overflow write file in the specified directory on the disk. , partition, sort, and combiner must be passed before writing to disk. After the last record is written, merge all overflow written files into one partitioned and sorted file.

Reduce side: can be divided into copy phase, sorting phase, reduce phase

Copy phase: The map output file is located on the local disk of the tasktracker running the map task, and reduce obtains the output through http For the partition of the file, tasktracker runs the reduce task for the partition file. As long as a map task is completed, the reduce task starts to copy the output.

Sorting phase: A more appropriate term is the merging phase, because sorting is performed on the map side. This stage will merge the map output, maintain its order, and loop.

The final stage is the reduce stage. The reduce function is called for each key in the sorted output. The output of this stage is written directly to the output file system, usually HDFS. ,

Shuffle phase description

The shuffle phase mainly includes combine, group, sort, partition in the map phase and merge sorting in the reducer phase. After shuffling in the Map stage, the output data will be saved in files according to the reduce partitions, and the file contents will be sorted according to the defined sort. After the Map phase is completed, the ApplicationMaster will be notified, and then the AM will notify the Reduce to pull the data, and perform the shuffle process on the reduce side during the pulling process.

Note: The output data of the Map stage is stored on the disk running the Map node. It is a temporary file and does not exist on HDFS. After Reduce pulls the data, the temporary file will be deleted. If it exists on HDFS, it will be deleted. This causes a waste of storage space (three copies will be generated).

User-defined Combiner

Combiner can reduce the number of intermediate output results in the Map stage and reduce network overhead. By default, there is no Combiner. The user-defined Combiner is required to be a subclass of Reducer. The output of the Map is used as the input and output of the Combiner. That is to say, the input and output of the Combiner must it's the same.

You can set the combiner processing class through job.setCombinerClass. The MapReduce framework does not guarantee that the method of this class will be called.

Note: If the input and output of reduce are the same, you can directly use the reduce class as combiner
User-defined Partitioner

Partitioner is used Determine which node is the processing reducer corresponding to the output by the map. The default MapReduce task reduce number is 1. At this time, the Partitioner actually has no effect. However, when we change the number of reduce to multiple, the Partitioner will determine the node number of the reduce corresponding to the key (starting from 0).

You can specify the Partitioner class through the job.setPartitionerClass method. By default, HashPartitioner is used (the key's hashCode method is called by default).
User-defined Group

GroupingComparator is used to group the output by Map into > The key class, to put it bluntly, is used to determine whether key1 and key2 belong to the same group. If they are the same group, the output values of the map are combined.

Our custom class is required to implement the self-interface RawComparator, and the comparison class can be specified through the job.setGroupingComparatorClass method. By default a WritableComparator is used, but the key's compareTo method is ultimately called for comparison.
User-defined Sort

SortComparator is the key class used to key sort the output by Map. To put it bluntly, it is used for Determine which group key1 belongs to and which group key2 belongs to comes first and which one comes last.

Our custom class is required to implement the self-interface RawComparator, and the comparison class can be specified through the job.setSortComparatorClass method. By default a WritableComparator is used, but the key's compareTo method is ultimately called for comparison.
User-defined Reducer's Shuffle

When the reduce side pulls the output data of the map, shuffle (merge sort) will be performed, and the MapReduce framework is provided in plug-in mode In a customized way, we can specify custom shuffle rules by implementing the interface ShuffleConsumerPlugin and specifying the parameter mapreduce.job.reduce.shuffle.consumer.plugin.class, but in general, the default class org is used directly. apache.hadoop.mapreduce.task.reduce.Shuffle.

For more programming-related knowledge, please visit: Programming Video! !

The above is the detailed content of What is the process of sorting performed by the system called?. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn