search
HomeDatabaseMysql TutorialHive RCFile的高效存储结构

本文介绍了Facebook公司数据分析系统中的RCFile存储结构,该结构集行存储和列存储的优点于一身,在 MapReduce环境下的大规模数据

本文介绍了Facebook公司数据分析系统中的RCFile存储结构,该结构集行存储和列存储的优点于一身,在 MapReduce环境下的大规模数据分析中扮演重要角色。

Facebook曾在2010 ICDE(IEEE International Conference on Data Engineering)会议上介绍了数据仓库Hive。Hive存储海量数据在Hadoop系统中,提供了一套类数据库的数据存储和处理机制。它采用类 SQL语言对数据进行自动化管理和处理,经过语句解析和转换,最终生成基于Hadoop的MapReduce任务,通过执行这些任务完成数据处理。图1显 示了Hive数据仓库的系统结构。

图1 Hive数据仓库的系统结构

基于MapReduce的数据仓库在超大规模数据分析中扮演了重要角色,对于典型的Web服 务供应商,这些分析有助于它们快速理解动态的用户行为及变化的用户需求。数据存储结构是影响数据仓库性能的关键因素之一。Hadoop系统中常用的文件存 储格式有支持文本的TextFile和支持二进制的SequenceFile等,它们都属于行存储方式。Facebook工程师发表的RCFile: A Fast and Spaceefficient Data Placement Structure in MapReducebased Warehouse Systems一文,介绍了一种高效的数据存储结构——RCFile(Record Columnar File),并将其应用于Facebook的数据仓库Hive中。与传统数据库的数据存储结构相比,RCFile更有效地满足了基于MapReduce的 数据仓库的四个关键需求,即Fast data loading、Fast query processing、Highly efficient storage space utilization和Strong adaptivity to highly dynamic workload patterns。

数据仓库的需求

基于Facebook系统特征和用户数据的分析,在MapReduce计算环境下,数据仓库对于数据存储结构有四个关键需求。

Fast data loading

对于Facebook的产品数据仓库而言,快速加载数据(写数据)是非常关键的。每天大约有超过20TB的数据上传到Facebook的数据仓库, 由于数据加载期间网络和磁盘流量会干扰正常的查询执行,因此缩短数据加载时间是非常必要的。

Fast query processing

为了满足实时性的网站请求和支持高并发用户提交查询的大量读负载,查询响应时间是非常关键的,这要求底层存储结构能够随着查询数量的增加而保持高速 的查询处理。

Highly efficient storage space utilization

高速增长的用户活动总是需要可扩展的存储容量和计算能力,有限的磁盘空间需要合理管理海量数据的存储。实际上,该问题的解决方案就是最大化磁盘空间 利用率。

Strong adaptivity to highly dynamic workload patterns

同一份数据集会供给不同应用的用户,通过各种方式来分析。某些数据分析是例行过程,,按照某种固定模式周期性执行;而另一些则是从中间平台发起的查 询。大多数负载不遵循任何规则模式,这需要底层系统在存储空间有限的前提下,对数据处理中不可预知的动态数据具备高度的适应性,而不是专注于某种特殊的负 载模式。

MapReduce存储策略

要想设计并实现一种基于MapReduce数据仓库的高效数据存储结构,关键挑战是在MapReduce计算环境中满足上述四个需求。在传统数据库 系统中,三种数据存储结构被广泛研究,分别是行存储结构、列存储结构和PAX混合存储结构。上面这三种结构都有其自身特点,不过简单移植这些数据库导向的 存储结构到基于MapReduce的数据仓库系统并不能很好地满足所有需求。

行存储

如图2所示,基于Hadoop系统行存储结构的优点在于快速数据加载和动态负载的高适应能力,这是因为行存储保证了相同记录的所有域都在同一个集群 节点,即同一个HDFS块。不过,行存储的缺点也是显而易见的,例如它不能支持快速查询处理,因为当查询仅仅针对多列表中的少数几列时,它不能跳过不必要 的列读取;此外,由于混合着不同数据值的列,行存储不易获得一个极高的压缩比,即空间利用率不易大幅提高。尽管通过熵编码和利用列相关性能够获得一个较好 的压缩比,但是复杂数据存储实现会导致解压开销增大。

图2 HDFS块内行存储的例子

列存储

图3显示了在HDFS上按照列组存储表格的例子。在这个例子中,列A和列B存储在同一列组,而列C和列D分别存储在单独的列组。查询时列存储能够避 免读不必要的列,并且压缩一个列中的相似数据能够达到较高的压缩比。然而,由于元组重构的较高开销,它并不能提供基于Hadoop系统的快速查询处理。列 存储不能保证同一记录的所有域都存储在同一集群节点,例如图2的例子中,记录的4个域存储在位于不同节点的3个HDFS块中。因此,记录的重构将导致通过 集群节点网络的大量数据传输。尽管预先分组后,多个列在一起能够减少开销,但是对于高度动态的负载模式,它并不具备很好的适应性。除非所有列组根据可能的 查询预先创建,否则对于一个查询需要一个不可预知的列组合,一个记录的重构或许需要2个或多个列组。再者由于多个组之间的列交叠,列组可能会创建多余的列 数据存储,这导致存储利用率的降低。

图3 HDFS块内列存储的例子

更多详情见请继续阅读下一页的精彩内容

Hive 的详细介绍:请点这里
Hive 的下载地址:请点这里

相关阅读:

基于Hadoop集群的Hive安装

Hive内表和外表的区别

Hadoop + Hive + Map +reduce 集群安装部署

Hive本地独立模式安装

Hive学习之WordCount单词统计

linux

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
How to identify and optimize slow queries in MySQL? (slow query log, performance_schema)How to identify and optimize slow queries in MySQL? (slow query log, performance_schema)Apr 10, 2025 am 09:36 AM

To optimize MySQL slow query, slowquerylog and performance_schema need to be used: 1. Enable slowquerylog and set thresholds to record slow query; 2. Use performance_schema to analyze query execution details, find out performance bottlenecks and optimize.

MySQL and SQL: Essential Skills for DevelopersMySQL and SQL: Essential Skills for DevelopersApr 10, 2025 am 09:30 AM

MySQL and SQL are essential skills for developers. 1.MySQL is an open source relational database management system, and SQL is the standard language used to manage and operate databases. 2.MySQL supports multiple storage engines through efficient data storage and retrieval functions, and SQL completes complex data operations through simple statements. 3. Examples of usage include basic queries and advanced queries, such as filtering and sorting by condition. 4. Common errors include syntax errors and performance issues, which can be optimized by checking SQL statements and using EXPLAIN commands. 5. Performance optimization techniques include using indexes, avoiding full table scanning, optimizing JOIN operations and improving code readability.

Describe MySQL asynchronous master-slave replication process.Describe MySQL asynchronous master-slave replication process.Apr 10, 2025 am 09:30 AM

MySQL asynchronous master-slave replication enables data synchronization through binlog, improving read performance and high availability. 1) The master server record changes to binlog; 2) The slave server reads binlog through I/O threads; 3) The server SQL thread applies binlog to synchronize data.

MySQL: Simple Concepts for Easy LearningMySQL: Simple Concepts for Easy LearningApr 10, 2025 am 09:29 AM

MySQL is an open source relational database management system. 1) Create database and tables: Use the CREATEDATABASE and CREATETABLE commands. 2) Basic operations: INSERT, UPDATE, DELETE and SELECT. 3) Advanced operations: JOIN, subquery and transaction processing. 4) Debugging skills: Check syntax, data type and permissions. 5) Optimization suggestions: Use indexes, avoid SELECT* and use transactions.

MySQL: A User-Friendly Introduction to DatabasesMySQL: A User-Friendly Introduction to DatabasesApr 10, 2025 am 09:27 AM

The installation and basic operations of MySQL include: 1. Download and install MySQL, set the root user password; 2. Use SQL commands to create databases and tables, such as CREATEDATABASE and CREATETABLE; 3. Execute CRUD operations, use INSERT, SELECT, UPDATE, DELETE commands; 4. Create indexes and stored procedures to optimize performance and implement complex logic. With these steps, you can build and manage MySQL databases from scratch.

How does the InnoDB Buffer Pool work and why is it crucial for performance?How does the InnoDB Buffer Pool work and why is it crucial for performance?Apr 09, 2025 am 12:12 AM

InnoDBBufferPool improves the performance of MySQL databases by loading data and index pages into memory. 1) The data page is loaded into the BufferPool to reduce disk I/O. 2) Dirty pages are marked and refreshed to disk regularly. 3) LRU algorithm management data page elimination. 4) The read-out mechanism loads the possible data pages in advance.

MySQL: The Ease of Data Management for BeginnersMySQL: The Ease of Data Management for BeginnersApr 09, 2025 am 12:07 AM

MySQL is suitable for beginners because it is simple to install, powerful and easy to manage data. 1. Simple installation and configuration, suitable for a variety of operating systems. 2. Support basic operations such as creating databases and tables, inserting, querying, updating and deleting data. 3. Provide advanced functions such as JOIN operations and subqueries. 4. Performance can be improved through indexing, query optimization and table partitioning. 5. Support backup, recovery and security measures to ensure data security and consistency.

When might a full table scan be faster than using an index in MySQL?When might a full table scan be faster than using an index in MySQL?Apr 09, 2025 am 12:05 AM

Full table scanning may be faster in MySQL than using indexes. Specific cases include: 1) the data volume is small; 2) when the query returns a large amount of data; 3) when the index column is not highly selective; 4) when the complex query. By analyzing query plans, optimizing indexes, avoiding over-index and regularly maintaining tables, you can make the best choices in practical applications.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use