search
HomeDatabaseMysql Tutorial使用RawComparator加速Hadoop程序

在前面两篇文章[1][2]中我们介绍了Hadoop序列化的相关知识,包括Writable接口与Writable对象以及如何编写定制的Writable类,深入的分析了Writable类序列化之后占用的字节空间以及字节序列的构成。我们指出Hadoop序列化是Hadoop的核心部分之一,了解和分析Wri

在前面两篇文章[1][2]中我们介绍了Hadoop序列化的相关知识,包括Writable接口与Writable对象以及如何编写定制的Writable类,深入的分析了Writable类序列化之后占用的字节空间以及字节序列的构成。我们指出Hadoop序列化是Hadoop的核心部分之一,了解和分析Writable类的相关知识有助于我们理解Hadoop序列化的工作方式以及选择合适的Writable类作为MapReduce的键和值,以达到高效利用磁盘空间以及快速读写对象。因为在数据密集型计算中,在网络数据的传输是影响计算效率的一个重要因素,选择合适的Writable对象不但减小了磁盘空间,而且更重要的是其减小了需要在网络中传输的数据量,从而加快了程序的速度。

在本文中我们介绍另外一种方法加快程序的速度,这就是使用RawComparator加速Hadoop程序。我们知道作为键(Key)的Writable类必须实现WritableComparable接口,以实现对键进行排序的功能。Writable类进行比较时,Hadoop的默认方式是先将序列化后的对象字节流反序列化为对象,然后再进行比较(compareTo方法),比较过程需要一个反序列化的步骤。RawComparator的做法是不进行反序列化,而是在字节流层面进行比较,这样就省下了反序列化过程,从而加速程序的运行。Hadoop自身提供的IntWritable、LongWritabe等类已经实现了这种优化,使这些Writable类作为键进行比较时,直接使用序列化的字节数组进行比较大小,而不用进行反序列化。

RawComparator的实现

在Hadoop中编写Writable的RawComparator一般不直接继承RawComparator类,而是继承RawComparator的子类WritableComparator,因为WritableComparator类为我们提供了一些有用的工具方法,比如从字节数组中读取int、long和vlong等值。下面是上两篇文章中我们定制的MyWritable类的RawComparator实现,定制的MyWritable由两个VLongWritable对组成,为了添加RawComparator功能,Writable类必须实现WritableComparable接口,这里不再展示实现了WritableComparable接口的MyWritableComparable类的全部内容,而只是MyWritableComparable类中Comparator的实现,完整的代码可以在github中找到。

...//omitted for conciseness
/**
 * A RawComparator that compares serialized VlongWritable Pair
 * compare method decode long value from serialized byte array one by one
 *
 * @author yoyzhou
 *
 * */
public static class Comparator extends WritableComparator {
	public Comparator() {
		super(MyWritableComparable.class);
	}
	public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
		int cmp = 1;
		//determine how many bytes the first VLong takes
		int n1 = WritableUtils.decodeVIntSize(b1[s1]);
		int n2 = WritableUtils.decodeVIntSize(b2[s2]);
		try {
			//read value from VLongWritable byte array
			long l11 = readVLong(b1, s1);
			long l21 = readVLong(b2, s2);
			cmp = l11 > l21 ? 1 : (l11 == l21 ? 0 : -1);
			if (cmp != 0) {
				return cmp;
			} else {
				long l12 = readVLong(b1, s1 + n1);
				long l22 = readVLong(b2, s2 + n2);
				return cmp = l12 > l22 ? 1 : (l12 == l22 ? 0 : -1);
			}
		} catch (IOException e) {
				throw new RuntimeException(e);
		}
	}
}
static { // register this comparator
	WritableComparator.define(MyWritableComparable.class, new Comparator());
}
...

通过上面的代码我们可以看到要实现Writable的RawComparator我们只需要重载WritableComparator的public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)方法。在我们的例子中,通过从VLongWritable对序列化后字节数组中一个一个的读取VLongWritable的值,再进行比较。

当然编写完compare方法之后,不要忘了为Writable类注册编写的RawComparator类。

总结

为Writable类编写RawComparator必须对Writable本身序列化之后的字节数组有清晰的了解,知道如何从字节数组中读取Writable对象的值,而这正是我们前两篇关于Hadoop序列化和Writable接口的文章所要阐述的内容。

通过以上的三篇文章,我们了解了Hadoop Writable接口,如何编写自己的Writable类,Writable类的字节序列长度与其构成,以及如何为Writable类编写RawComparator来为Hadoop提速。

参考资料

Tom White, Hadoop: The Definitive Guide, 3rd Edition

Hadoop序列化与Writable接口(一)

Hadoop序列化与Writable接口(二)

--EOF--

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Explain the role of InnoDB redo logs and undo logs.Explain the role of InnoDB redo logs and undo logs.Apr 15, 2025 am 12:16 AM

InnoDB uses redologs and undologs to ensure data consistency and reliability. 1.redologs record data page modification to ensure crash recovery and transaction persistence. 2.undologs records the original data value and supports transaction rollback and MVCC.

What are the key metrics to look for in an EXPLAIN output (type, key, rows, Extra)?What are the key metrics to look for in an EXPLAIN output (type, key, rows, Extra)?Apr 15, 2025 am 12:15 AM

Key metrics for EXPLAIN commands include type, key, rows, and Extra. 1) The type reflects the access type of the query. The higher the value, the higher the efficiency, such as const is better than ALL. 2) The key displays the index used, and NULL indicates no index. 3) rows estimates the number of scanned rows, affecting query performance. 4) Extra provides additional information, such as Usingfilesort prompts that it needs to be optimized.

What is the Using temporary status in EXPLAIN and how to avoid it?What is the Using temporary status in EXPLAIN and how to avoid it?Apr 15, 2025 am 12:14 AM

Usingtemporary indicates that the need to create temporary tables in MySQL queries, which are commonly found in ORDERBY using DISTINCT, GROUPBY, or non-indexed columns. You can avoid the occurrence of indexes and rewrite queries and improve query performance. Specifically, when Usingtemporary appears in EXPLAIN output, it means that MySQL needs to create temporary tables to handle queries. This usually occurs when: 1) deduplication or grouping when using DISTINCT or GROUPBY; 2) sort when ORDERBY contains non-index columns; 3) use complex subquery or join operations. Optimization methods include: 1) ORDERBY and GROUPB

Describe the different SQL transaction isolation levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable) and their implications in MySQL/InnoDB.Describe the different SQL transaction isolation levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable) and their implications in MySQL/InnoDB.Apr 15, 2025 am 12:11 AM

MySQL/InnoDB supports four transaction isolation levels: ReadUncommitted, ReadCommitted, RepeatableRead and Serializable. 1.ReadUncommitted allows reading of uncommitted data, which may cause dirty reading. 2. ReadCommitted avoids dirty reading, but non-repeatable reading may occur. 3.RepeatableRead is the default level, avoiding dirty reading and non-repeatable reading, but phantom reading may occur. 4. Serializable avoids all concurrency problems but reduces concurrency. Choosing the appropriate isolation level requires balancing data consistency and performance requirements.

MySQL vs. Other Databases: Comparing the OptionsMySQL vs. Other Databases: Comparing the OptionsApr 15, 2025 am 12:08 AM

MySQL is suitable for web applications and content management systems and is popular for its open source, high performance and ease of use. 1) Compared with PostgreSQL, MySQL performs better in simple queries and high concurrent read operations. 2) Compared with Oracle, MySQL is more popular among small and medium-sized enterprises because of its open source and low cost. 3) Compared with Microsoft SQL Server, MySQL is more suitable for cross-platform applications. 4) Unlike MongoDB, MySQL is more suitable for structured data and transaction processing.

How does MySQL index cardinality affect query performance?How does MySQL index cardinality affect query performance?Apr 14, 2025 am 12:18 AM

MySQL index cardinality has a significant impact on query performance: 1. High cardinality index can more effectively narrow the data range and improve query efficiency; 2. Low cardinality index may lead to full table scanning and reduce query performance; 3. In joint index, high cardinality sequences should be placed in front to optimize query.

MySQL: Resources and Tutorials for New UsersMySQL: Resources and Tutorials for New UsersApr 14, 2025 am 12:16 AM

The MySQL learning path includes basic knowledge, core concepts, usage examples, and optimization techniques. 1) Understand basic concepts such as tables, rows, columns, and SQL queries. 2) Learn the definition, working principles and advantages of MySQL. 3) Master basic CRUD operations and advanced usage, such as indexes and stored procedures. 4) Familiar with common error debugging and performance optimization suggestions, such as rational use of indexes and optimization queries. Through these steps, you will have a full grasp of the use and optimization of MySQL.

Real-World MySQL: Examples and Use CasesReal-World MySQL: Examples and Use CasesApr 14, 2025 am 12:15 AM

MySQL's real-world applications include basic database design and complex query optimization. 1) Basic usage: used to store and manage user data, such as inserting, querying, updating and deleting user information. 2) Advanced usage: Handle complex business logic, such as order and inventory management of e-commerce platforms. 3) Performance optimization: Improve performance by rationally using indexes, partition tables and query caches.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.