search
HomeSystem TutorialLINUXOptimize HDFS data access efficiency: use data heat and coldness to manage
Optimize HDFS data access efficiency: use data heat and coldness to manageJan 15, 2024 am 09:18 AM
linuxlinux tutorialRed Hatlinux systemlinux commandlinux certificationred hat linuxlinux video

Theme introduction:

  1. HDFS optimized storage function explanation
  2. SSM system architecture design
  3. SSM system application scenario analysis
1. Background

With the development and popularization of big data technology-related technologies, more and more companies are beginning to use platform systems based on open source Hadoop. At the same time, more and more businesses and applications are migrating from traditional technical architecture to big data technology. on the data platform. In a typical Hadoop big data platform, people use HDFS as the core of storage services.

At the beginning of the development of big data, the main application scenario was still the offline batch processing scenario, and the demand for storage was the pursuit of throughput. HDFS was designed for such scenarios, and as technology continues to evolve With development, more and more scenarios will put forward new demands for storage, and HDFS is also facing new challenges. It mainly includes several aspects:

1. Data volume problem

On the one hand, with the growth of business and new application access, more data will be brought to HDFS. On the other hand, with the development of deep learning, artificial intelligence and other technologies, users usually hope to save it for a longer period of time. data to improve the effect of deep learning. The rapid increase in the amount of data will cause the cluster to continuously face expansion needs, resulting in increasing storage costs.

2. Small file problem

As we all know, HDFS is designed for offline batch processing of large files. Processing small files is not a scenario that traditional HDFS is good at. The root cause of the HDFS small file problem is that the metadata information of the file is maintained in the memory of a single Namenode, and the memory space of a single machine is always limited. It is estimated that the maximum number of system files that a single namenode cluster can accommodate is about 150 million. In fact, the HDFS platform usually serves as the underlying storage platform to serve multiple upper-layer computing frameworks and multiple business scenarios, so the problem of small files is unavoidable from a business perspective. There are currently solutions such as HDFS-Federation to solve the Namenode single-point scalability problem, but at the same time it will also bring huge difficulties in operation and maintenance management.

3. Hot and cold data issues

As the amount of data continues to grow and accumulate, the data will also show huge differences in access popularity. For example, a platform will continuously write the latest data, but usually the recently written data will be accessed much more frequently than the data written long ago. If the same storage strategy is used regardless of whether the data is hot or cold, it is a waste of cluster resources. How to optimize the HDFS storage system based on the hotness and coldness of the data is an urgent problem that needs to be solved.

2. Existing HDFS optimization technology

It has been more than 10 years since the birth of Hadoop. During this period, HDFS technology itself has been continuously optimized and evolved. HDFS has some existing technologies that can solve some of the above problems to a certain extent. Here is a brief introduction to HDFS heterogeneous storage and HDFS erasure coding technology.

HDFS heterogeneous storage:

Hadoop supports heterogeneous storage functionality starting from version 2.6.0. We know that the default storage strategy of HDFS uses three copies of each data block and stores them on disks on different nodes. The role of heterogeneous storage is to use different types of storage media on the server (including HDD hard disk, SSD, memory, etc.) to provide more storage strategies (for example, three copies, one is stored in the SSD media, and the remaining two are still stored in the HDD hard disk) , thus making HDFS storage more flexible and efficient in responding to various application scenarios.

Various storages predefined and supported in HDFS include:

  • ARCHIVE: Storage media with high storage density but low power consumption, such as tape, are usually used to store cold data

  • DISK: Disk media, this is the earliest storage medium supported by HDFS

  • SSD: Solid state drive is a new type of storage medium that is currently used by many Internet companies

  • RAM_DISK: The data is written into the memory, and another copy will be written (asynchronously) to the storage medium at the same time

Optimize HDFS data access efficiency: use data heat and coldness to manage

Storage strategies supported in HDFS include:

  1. Lazy_persist: One copy is saved in memory RAM_DISK, and the remaining copies are saved on disk

  2. ALL_SSD: All copies are saved in SSD

  3. One_SSD: One copy is saved in the SSD and the remaining copies are saved on the disk

  4. Hot: All copies are saved on disk, which is also the default storage policy

  5. Warm: One copy is saved on disk, the remaining copies are saved on archive storage

  6. Cold: All copies are kept on archive storage

In general, the value of HDFS heterogeneous storage lies in adopting different strategies according to the popularity of data to improve the overall resource usage efficiency of the cluster. For frequently accessed data, save all or part of it on storage media (memory or SSD) with higher access performance to improve its read and write performance; for data that is rarely accessed, save it on archive storage media to reduce its read and write performance. Storage costs. However, the configuration of HDFS heterogeneous storage requires users to specify corresponding policies for directories, that is, users need to know the access popularity of files in each directory in advance. In actual big data platform applications, this is more difficult.

HDFS erasure coding:

Traditional HDFS data uses a three-copy mechanism to ensure data reliability. That is, for every 1TB of data stored, the actual data occupied on each node of the cluster reaches 3TB, with an additional overhead of 200%. This puts great pressure on node disk storage and network transmission.

In Hadoop 3.0, erasure coding was introduced to support HDFS file block level, and the underlying layer uses the Reed-Solomon (k, m) algorithm. RS is a commonly used erasure coding algorithm. Through matrix operations, m-bit check digits can be generated for k-bit data. Depending on the values ​​of k and m, different degrees of fault tolerance can be achieved. It is a relatively flexible method. Erasure coding algorithm.

Optimize HDFS data access efficiency: use data heat and coldness to manage

Common algorithms are RS(3,2), RS(6,3), RS(10,4), k file blocks and m check blocks form a group, and any m can be tolerated in this group Loss of data blocks.

HDFS erasure coding technology can reduce the redundancy of data storage. Taking RS(3,2) as an example, its data redundancy is 67%, which is greatly reduced compared to Hadoop's default 200%. However, erasure coding technology requires the consumption of CPU for calculations for data storage and data recovery. It is actually a choice of exchanging time for space, so the more applicable scenario is the storage of cold data. The data stored in cold data is often not accessed for a long time after being written once. In this case, erasure coding technology can be used to reduce the number of copies.

3. Big data storage optimization: SSM

Regardless of the HDFS heterogeneous storage or erasure coding technology introduced earlier, the premise is that the user needs to specify the storage behavior for specific data, which means that the user needs to know which data is hot data and which is cold data. So is there a way to automatically optimize storage?

The answer is yes. The SSM (Smart Storage Management) system introduced here obtains metadata information from the underlying storage (usually HDFS), and obtains data heat status through data read and write access information analysis, targeting different heat levels. According to a series of pre-established rules, corresponding storage optimization strategies are adopted to improve the efficiency of the entire storage system. SSM is an open source project led by Intel, and China Mobile also participates in its research and development. The project can be obtained from Github: https://github.com/Intel-bigdata/SSM.

SSM positioning is a storage peripheral optimization system that adopts a Server-Agent-Client architecture as a whole. The Server is responsible for the implementation of the overall logic of SSM, the Agent is used to perform various operations on the storage cluster, and the Client is provided to users. Data access interface, usually including native HDFS interface.

Optimize HDFS data access efficiency: use data heat and coldness to manage

The main framework of SSM-Server is shown in the figure above. From top to bottom, StatesManager interacts with the HDFS cluster to obtain HDFS metadata information and maintain the access popularity information of each file. The information in StatesManager will be persisted to the relational database. TiDB is used as the underlying storage database in SSM. RuleManager maintains and manages rule-related information. Users define a series of storage rules for SSM through the front-end interface, and RuleManger is responsible for parsing and executing the rules. CacheManager/StorageManager generates specific action tasks based on popularity and rules. ActionExecutor is responsible for specific action tasks, assigns tasks to Agents, and executes them on Agent nodes.

Optimize HDFS data access efficiency: use data heat and coldness to manage

The internal logic implementation of SSM-Server relies on the definition of rules, which requires the administrator to formulate a series of rules for the SSM system through the front-end web page. A rule consists of several parts:

  • Operation objects usually refer to files that meet specific conditions.
  • Trigger refers to the time point when the rule is triggered, such as a scheduled trigger every day.
  • Execution conditions, define a series of conditions based on popularity, such as file access count requirements within a period of time.
  • Execute operations, perform related operations on data that meets the execution conditions, usually specifying its storage policy, etc.

An actual rule example:

file.path matches ”/foo/*”: accessCount(10min) >= 3 | one-ssd

This rule means that for files in the /foo directory, if they are accessed no less than three times within 10 minutes, the One-SSD storage strategy will be adopted, that is, one copy of the data will be saved on the SSD, and the remaining 2 A copy is saved on a regular disk.

4. SSM application scenarios

SSM can use different storage strategies to optimize the hotness and coldness of data. The following are some typical application scenarios:

Optimize HDFS data access efficiency: use data heat and coldness to manage

The most typical scenario is for cold data, as shown in the figure above, define relevant rules and use lower-cost storage for data that has not been accessed for a long time. For example, the original data blocks are degraded from SSD storage to HDD storage.

Optimize HDFS data access efficiency: use data heat and coldness to manage

Similarly, for hotspot data, a faster storage strategy can also be adopted according to different rules. As shown in the figure above, accessing more hotspot data here in a short time will increase the storage cost from HDD. For SSD storage, more hot data will use the memory storage strategy.

Optimize HDFS data access efficiency: use data heat and coldness to manage

For cold data scenarios, SSM can also use erasure coding optimization. By defining corresponding rules, erasure code operations are performed on cold data with few accesses to reduce data copy redundancy.

Optimize HDFS data access efficiency: use data heat and coldness to manage

It is also worth mentioning that SSM also has corresponding optimization methods for small files. This feature is still in the development process. The general logic is that SSM will merge a series of small files on HDFS into large files. At the same time, the mapping relationship between the original small files and the merged large files and the location of each small file in the large file will be recorded in the metadata of SSM. offset. When the user needs to access a small file, the original small file is obtained from the merged file through the SSM-specific client (SmartClient) based on the small file mapping information in the SSM metadata.

Finally, SSM is an open source project and is still in the process of very rapid iterative evolution. Any interested friends are welcome to contribute to the development of the project.

Q&A

Q1: What scale should we start with to build HDFS ourselves?

A1: HDFS supports pseudo-distribution mode. Even if there is only one node, an HDFS system can be built. If you want to better experience and understand the distributed architecture of HDFS, it is recommended to build an environment with 3 to 5 nodes.

Q2: Has Su Yan used SSM in the actual big data platform of each province?

A2: Not yet. This project is still developing rapidly. It will be gradually used in production after testing is stable.

Q3: What is the difference between HDFS and Spark? What are the pros and cons?

A3: HDFS and Spark are not technologies on the same level. HDFS is a storage system, while Spark is a computing engine. What we often compare with Spark is the Mapreduce computing framework in Hadoop rather than the HDFS storage system. In actual project construction, HDFS and Spark usually have a collaborative relationship. HDFS is used for underlying storage and Spark is used for upper-level computing.

The above is the detailed content of Optimize HDFS data access efficiency: use data heat and coldness to manage. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:Linux就该这么学. If there is any infringement, please contact admin@php.cn delete
什么是linux设备节点什么是linux设备节点Apr 18, 2022 pm 08:10 PM

linux设备节点是应用程序和设备驱动程序沟通的一个桥梁;设备节点被创建在“/dev”,是连接内核与用户层的枢纽,相当于硬盘的inode一样的东西,记录了硬件设备的位置和信息。设备节点使用户可以与内核进行硬件的沟通,读写设备以及其他的操作。

Linux中open和fopen的区别有哪些Linux中open和fopen的区别有哪些Apr 29, 2022 pm 06:57 PM

区别:1、open是UNIX系统调用函数,而fopen是ANSIC标准中的C语言库函数;2、open的移植性没fopen好;3、fopen只能操纵普通正规文件,而open可以操作普通文件、网络套接字等;4、open无缓冲,fopen有缓冲。

linux中什么叫端口映射linux中什么叫端口映射May 09, 2022 pm 01:49 PM

端口映射又称端口转发,是指将外部主机的IP地址的端口映射到Intranet中的一台计算机,当用户访问外网IP的这个端口时,服务器自动将请求映射到对应局域网内部的机器上;可以通过使用动态或固定的公共网络IP路由ADSL宽带路由器来实现。

linux中eof是什么linux中eof是什么May 07, 2022 pm 04:26 PM

在linux中,eof是自定义终止符,是“END Of File”的缩写;因为是自定义的终止符,所以eof就不是固定的,可以随意的设置别名,linux中按“ctrl+d”就代表eof,eof一般会配合cat命令用于多行文本输出,指文件末尾。

linux怎么判断pcre是否安装linux怎么判断pcre是否安装May 09, 2022 pm 04:14 PM

在linux中,可以利用“rpm -qa pcre”命令判断pcre是否安装;rpm命令专门用于管理各项套件,使用该命令后,若结果中出现pcre的版本信息,则表示pcre已经安装,若没有出现版本信息,则表示没有安装pcre。

linux怎么查询mac地址linux怎么查询mac地址Apr 24, 2022 pm 08:01 PM

linux查询mac地址的方法:1、打开系统,在桌面中点击鼠标右键,选择“打开终端”;2、在终端中,执行“ifconfig”命令,查看输出结果,在输出信息第四行中紧跟“ether”单词后的字符串就是mac地址。

linux中rpc是什么意思linux中rpc是什么意思May 07, 2022 pm 04:48 PM

在linux中,rpc是远程过程调用的意思,是Reomote Procedure Call的缩写,特指一种隐藏了过程调用时实际通信细节的IPC方法;linux中通过RPC可以充分利用非共享内存的多处理器环境,提高系统资源的利用率。

什么是linux交叉编译什么是linux交叉编译Apr 29, 2022 pm 06:47 PM

在linux中,交叉编译是指在一个平台上生成另一个平台上的可执行代码,即编译源代码的平台和执行源代码编译后程序的平台是两个不同的平台。使用交叉编译的原因:1、目标系统没有能力在其上进行本地编译;2、有能力进行源代码编译的平台与目标平台不同。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.