search
HomeDatabaseMysql TutorialHive自定义函数

Hive自定义函数

Jun 07, 2016 pm 03:28 PM
hivebuilt-infunctionsupplycustomize

当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。 Hive目前只支持用java语言书写自定义函数。如果需要采用其他语言,比如Python,可以考虑上一节提到的transform语法来实现。 Hive支持三

当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。

Hive目前只支持用java语言书写自定义函数。如果需要采用其他语言,比如Python,可以考虑上一节提到的transform语法来实现。

Hive支持三种自定义函数,我们逐个讲解。

UDF

这是普通的用户自定义函数。接受单行输入,并产生单行输出。

编写java代码如下:

package com.oserp.hiveudf;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

public classPassExam extendsUDF {

publicText evaluate(Integer score)

{

Text result = new Text();

if(score

result.set("Failed");

else

result.set("Pass");

return result;

}

}

然后,打包成.jar文件,比如hiveudf.jar。

执行以下语句:

add jar /home/user/hadoop_jar/hiveudf.jar;

create temporary function pass_scorecom.oserp.hiveudf.PassExam;

select stuNo,pass_score(score) from student;

输出结果为:

N0101 Pass

N0102 Failed

N0201 Pass

N0103 Pass

N0302 Pass

N0202 Pass

N0203 Pass

N0301 Failed

N0306 Pass

第一个语句注册jar文件;第二个语句为自定义函数取别名;第三个语句调用自定义函数。

Java代码中,自定义函数的类继承自UDF类,且提供了一个evaluate方法。这个方法接受一个整数值作为参数,并返回字符串。结构十分明了。其中的evaluate方法并没有作为interface提供,因为实际使用时,函数的参数个数及类型是多变的。

以上UDF名称是不区分大小写的,比如调用时写成PASS_SCORE也是可以的(因为它是hive中的别名,不是java类名)。

使用完成后,可调用以下语句删除函数别名:

Drop temporary function pass_score;

UDAF

用户定义聚集函数(User-defined aggregate function)。接受多行输入,并产生单行输出。比如MAX,COUNT函数。

编写以下Java代码:

packagecom.oserp.hiveudf;

importorg.apache.hadoop.hive.ql.exec.UDAF;

importorg.apache.hadoop.hive.ql.exec.UDAFEvaluator;

importorg.apache.hadoop.hive.serde2.io.DoubleWritable;

importorg.apache.hadoop.io.IntWritable;

publicclass HiveAvgextends UDAF {

public staticclass AvgEvaluate implements UDAFEvaluator

{

public staticclass PartialResult

{

public intcount;

public doubletotal;

public PartialResult()

{

count = 0;

total = 0;

}

}

private PartialResultpartialResult;

@Override

public voidinit() {

partialResult = new PartialResult();

}

public booleaniterate(IntWritable value)

{

// 此处一定要判断partialResult是否为空,否则会报错

// 原因就是init函数只会被调用一遍,不会为每个部分聚集操作去做初始化

//此处如果不加判断就会出错

if (partialResult==null)

{

partialResult =new PartialResult();

if (value !=null)

{

partialResult.total =partialResult.total +value.get();

partialResult.count=partialResult.count + 1;

}

return true;

}

public PartialResult terminatePartial()

{

returnpartialResult;

}

public booleanmerge(PartialResult other)

{

partialResult.total=partialResult.total + other.total;

partialResult.count=partialResult.count + other.count;

return true;

}

public DoubleWritable terminate()

{

return newDoubleWritable(partialResult.total /partialResult.count);

}

}

}

然后打包成jar文件,比如hiveudf.jar。

执行以下语句:

add jar/home/user/hadoop_jar/hiveudf.jar;

create temporary function avg_udf as'com.oserp.hiveudf.HiveAvg';

select classNo, avg_udf(score) from studentgroup by classNo;

输出结果如下:

C01 68.66666666666667

C02 80.66666666666667

C03 73.33333333333333

 

\

参照以上图示(来自Hadoop权威教程)我们来看看各个函数:

l Init在类似于构造函数,用于UDF的初始化。

注意上图中红色框中的init函数。在实际运行中,无论hive将记录集划分了多少个部分去做(比如上图中的file1和file2两个部分),init函数仅被调用一次。所以上图中的示例是有歧义的。这也是为什么上面的代码中加了特别的注释来说明。或者换一句话说,init函数中不应该用于初始化部分聚集值相关的逻辑,而应该处理全局的一些数据逻辑。

l Iterate函数用于聚合。当每一个新的值被聚合时,此函数被调用。

l TerminatePartial函数在部分聚合完成后被调用。当hive希望得到部分记录的聚合结果时,此函数被调用。

l Merge函数用于合并先前得到的部分聚合结果(也可以理解为分块记录的聚合结果)。

l Terminate返回最终的聚合结果。

我们可以看出merge的输入参数类型和terminatePartial函数的返回值类型必须是一致的。

UDTF

用户定义表生成函数(User-defined table-generating function)。接受单行输入,并产生多行输出(即一个表)。不是特别常用,此处不详述。

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Explain the InnoDB Buffer Pool and its importance for performance.Explain the InnoDB Buffer Pool and its importance for performance.Apr 19, 2025 am 12:24 AM

InnoDBBufferPool reduces disk I/O by caching data and indexing pages, improving database performance. Its working principle includes: 1. Data reading: Read data from BufferPool; 2. Data writing: After modifying the data, write to BufferPool and refresh it to disk regularly; 3. Cache management: Use the LRU algorithm to manage cache pages; 4. Reading mechanism: Load adjacent data pages in advance. By sizing the BufferPool and using multiple instances, database performance can be optimized.

MySQL vs. Other Programming Languages: A ComparisonMySQL vs. Other Programming Languages: A ComparisonApr 19, 2025 am 12:22 AM

Compared with other programming languages, MySQL is mainly used to store and manage data, while other languages ​​such as Python, Java, and C are used for logical processing and application development. MySQL is known for its high performance, scalability and cross-platform support, suitable for data management needs, while other languages ​​have advantages in their respective fields such as data analytics, enterprise applications, and system programming.

Learning MySQL: A Step-by-Step Guide for New UsersLearning MySQL: A Step-by-Step Guide for New UsersApr 19, 2025 am 12:19 AM

MySQL is worth learning because it is a powerful open source database management system suitable for data storage, management and analysis. 1) MySQL is a relational database that uses SQL to operate data and is suitable for structured data management. 2) The SQL language is the key to interacting with MySQL and supports CRUD operations. 3) The working principle of MySQL includes client/server architecture, storage engine and query optimizer. 4) Basic usage includes creating databases and tables, and advanced usage involves joining tables using JOIN. 5) Common errors include syntax errors and permission issues, and debugging skills include checking syntax and using EXPLAIN commands. 6) Performance optimization involves the use of indexes, optimization of SQL statements and regular maintenance of databases.

MySQL: Essential Skills for Beginners to MasterMySQL: Essential Skills for Beginners to MasterApr 18, 2025 am 12:24 AM

MySQL is suitable for beginners to learn database skills. 1. Install MySQL server and client tools. 2. Understand basic SQL queries, such as SELECT. 3. Master data operations: create tables, insert, update, and delete data. 4. Learn advanced skills: subquery and window functions. 5. Debugging and optimization: Check syntax, use indexes, avoid SELECT*, and use LIMIT.

MySQL: Structured Data and Relational DatabasesMySQL: Structured Data and Relational DatabasesApr 18, 2025 am 12:22 AM

MySQL efficiently manages structured data through table structure and SQL query, and implements inter-table relationships through foreign keys. 1. Define the data format and type when creating a table. 2. Use foreign keys to establish relationships between tables. 3. Improve performance through indexing and query optimization. 4. Regularly backup and monitor databases to ensure data security and performance optimization.

MySQL: Key Features and Capabilities ExplainedMySQL: Key Features and Capabilities ExplainedApr 18, 2025 am 12:17 AM

MySQL is an open source relational database management system that is widely used in Web development. Its key features include: 1. Supports multiple storage engines, such as InnoDB and MyISAM, suitable for different scenarios; 2. Provides master-slave replication functions to facilitate load balancing and data backup; 3. Improve query efficiency through query optimization and index use.

The Purpose of SQL: Interacting with MySQL DatabasesThe Purpose of SQL: Interacting with MySQL DatabasesApr 18, 2025 am 12:12 AM

SQL is used to interact with MySQL database to realize data addition, deletion, modification, inspection and database design. 1) SQL performs data operations through SELECT, INSERT, UPDATE, DELETE statements; 2) Use CREATE, ALTER, DROP statements for database design and management; 3) Complex queries and data analysis are implemented through SQL to improve business decision-making efficiency.

MySQL for Beginners: Getting Started with Database ManagementMySQL for Beginners: Getting Started with Database ManagementApr 18, 2025 am 12:10 AM

The basic operations of MySQL include creating databases, tables, and using SQL to perform CRUD operations on data. 1. Create a database: CREATEDATABASEmy_first_db; 2. Create a table: CREATETABLEbooks(idINTAUTO_INCREMENTPRIMARYKEY, titleVARCHAR(100)NOTNULL, authorVARCHAR(100)NOTNULL, published_yearINT); 3. Insert data: INSERTINTObooks(title, author, published_year)VA

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)