The three core components of HADOOP are HDFS, MapReduce and YARN. Detailed introduction: 1. HDFS: Distributed file system, used to store large amounts of data in Hadoop clusters. It has high fault tolerance, can store data across multiple data nodes, and provides high-throughput data access; 2. MapReduce: used for parallel processing of large-scale data sets. It decomposes big data tasks into multiple small tasks, processes them in parallel on multiple nodes, and finally summarizes the results; 3. YARN: Responsible for the allocation and management of cluster resources.
The three core components of Hadoop are HDFS (distributed file storage), MapReduce (distributed computing) and YARN (resource scheduling).
1. HDFS: HADOOP Distributed File System
HDFS (Hadoop Distributed File System) is the core sub-project of the Hadoop project and is mainly responsible for the storage and reading of cluster data. Take, HDFS is a distributed file system with a master/slave architecture. HDFS supports a traditional hierarchical file organization structure, where users or applications can create directories and then store files in these directories. The hierarchical structure of the file system namespace is similar to that of most existing file systems, and files can be created, read, updated, and deleted through file paths. However, due to the nature of distributed storage, it is obviously different from traditional file systems.
HDFS advantages:
- High fault tolerance. The data uploaded by HDFS automatically saves multiple copies, and its fault tolerance can be increased by adding data in the copies. If a replica is lost, HDFS will replicate the replica on the other machine, and we don't have to worry about its implementation.
- Suitable for big data processing. HDFS is capable of handling gigabytes, terabytes, and even petabytes of data, ranging in size to millions, which is very large. (1PB=1024TB, 1TB=1014GB)
- Streaming data access. HDFS uses a streaming data access model to store very large files, writing once and reading many times. That is, once a file is written, it cannot be modified, but can only be added. This maintains data consistency.
2. MapReduce: Large-scale data processing
MapReduce is the core computing framework of Hadoop, suitable for programming of parallel operations on large-scale data sets (greater than 1TB) The model includes two parts: Map (mapping) and Reduce (reduction).
When a MapReduce task is started, the Map side will read the data on HDFS, map the data into the required key-value pair type and transfer it to the Reduce side. The Reduce side receives the key-value pair type data from the Map side, groups it according to different keys, processes each group of data with the same key, obtains a new key-value pair and outputs it to HDFS. This is the core idea of MapReduce.
A complete MapReduce process includes data input and sharding, Map stage data processing, Reduce stage data processing, data output and other stages:
- Read input data. Data in the MapReduce process is read from the HDFS distributed file system. When a file is uploaded to HDFS, it is generally divided into several data blocks according to 128MB, so when running the MapReduce program, each data block will generate a Map, but you can also adjust the number of Maps by resetting the file fragment size. When running MapReduce, the file will be re-split (Split) according to the set split size, and a data block of the split size will correspond to a Map.
- Map stage. The program has one or more Maps, determined by the default number of storage or shards. For the Map stage, data is read in the form of key-value pairs. The value of the key is generally the offset between the first character of each line and the initial position of the file, that is, the number of characters in between, and the value is the data record of this line. Process key-value pairs according to requirements, map them into new key-value pairs, and pass the new key-value pairs to the Reduce side.
- Shuffle/Sort stage: This stage refers to the process of starting from the Map output and transferring the Map output to Reduce as input. This process will first integrate the output data with the same key in the same Map to reduce the amount of data transmitted, and then sort the data according to the key after integration.
- Reduce stage: There can also be multiple Reduce tasks. According to the data partition set in the Map stage, one partition data is processed by one Reduce. For each Reduce task, Reduce will receive data from different Map tasks, and the data from each Map is in order. Each processing in a Reduce task is to reduce the data for all data with the same key and output it to HDFS as a new key-value pair.
3. Yarn: Resource Manager
Hadoop’s MapReduce architecture is called YARN (Yet Another Resource Negotiator, another resource coordinator). More efficient resource management core.
YARN mainly consists of three modules: Resource Manager (RM), Node Manager (NM), and Application Master (AM):
- Resource Manager is responsible for the monitoring, allocation and management of all resources. Management;
- Application Master is responsible for the scheduling and coordination of each specific application;
- Node Manager is responsible for the maintenance of each node.
The above is the detailed content of What are the three core components of HADOOP?. For more information, please follow other related articles on the PHP Chinese website!