当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。 Hive目前只支持用java语言书写自定义函数。如果需要采用其他语言,比如Python,可以考虑上一节提到的transform语法来实现。 Hive支持三
当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。
Hive目前只支持用java语言书写自定义函数。如果需要采用其他语言,比如Python,可以考虑上一节提到的transform语法来实现。
Hive支持三种自定义函数,我们逐个讲解。
UDF
这是普通的用户自定义函数。接受单行输入,并产生单行输出。
编写java代码如下:
package com.oserp.hiveudf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public classPassExam extendsUDF {
publicText evaluate(Integer score)
{
Text result = new Text();
if(score
result.set("Failed");
else
result.set("Pass");
return result;
}
}
然后,打包成.jar文件,比如hiveudf.jar。
执行以下语句:
add jar /home/user/hadoop_jar/hiveudf.jar;
create temporary function pass_scorecom.oserp.hiveudf.PassExam;
select stuNo,pass_score(score) from student;
输出结果为:
N0101 Pass
N0102 Failed
N0201 Pass
N0103 Pass
N0302 Pass
N0202 Pass
N0203 Pass
N0301 Failed
N0306 Pass
第一个语句注册jar文件;第二个语句为自定义函数取别名;第三个语句调用自定义函数。
Java代码中,自定义函数的类继承自UDF类,且提供了一个evaluate方法。这个方法接受一个整数值作为参数,并返回字符串。结构十分明了。其中的evaluate方法并没有作为interface提供,因为实际使用时,函数的参数个数及类型是多变的。
以上UDF名称是不区分大小写的,比如调用时写成PASS_SCORE也是可以的(因为它是hive中的别名,不是java类名)。
使用完成后,可调用以下语句删除函数别名:
Drop temporary function pass_score;
UDAF
用户定义聚集函数(User-defined aggregate function)。接受多行输入,并产生单行输出。比如MAX,COUNT函数。
编写以下Java代码:
packagecom.oserp.hiveudf;
importorg.apache.hadoop.hive.ql.exec.UDAF;
importorg.apache.hadoop.hive.ql.exec.UDAFEvaluator;
importorg.apache.hadoop.hive.serde2.io.DoubleWritable;
importorg.apache.hadoop.io.IntWritable;
publicclass HiveAvgextends UDAF {
public staticclass AvgEvaluate implements UDAFEvaluator
{
public staticclass PartialResult
{
public intcount;
public doubletotal;
public PartialResult()
{
count = 0;
total = 0;
}
}
private PartialResultpartialResult;
@Override
public voidinit() {
partialResult = new PartialResult();
}
public booleaniterate(IntWritable value)
{
// 此处一定要判断partialResult是否为空,否则会报错
// 原因就是init函数只会被调用一遍,不会为每个部分聚集操作去做初始化
//此处如果不加判断就会出错
if (partialResult==null)
{
partialResult =new PartialResult();
}
if (value !=null)
{
partialResult.total =partialResult.total +value.get();
partialResult.count=partialResult.count + 1;
}
return true;
}
public PartialResult terminatePartial()
{
returnpartialResult;
}
public booleanmerge(PartialResult other)
{
partialResult.total=partialResult.total + other.total;
partialResult.count=partialResult.count + other.count;
return true;
}
public DoubleWritable terminate()
{
return newDoubleWritable(partialResult.total /partialResult.count);
}
}
}
然后打包成jar文件,比如hiveudf.jar。
执行以下语句:
add jar/home/user/hadoop_jar/hiveudf.jar;
create temporary function avg_udf as'com.oserp.hiveudf.HiveAvg';
select classNo, avg_udf(score) from studentgroup by classNo;
输出结果如下:
C01 68.66666666666667
C02 80.66666666666667
C03 73.33333333333333
参照以上图示(来自Hadoop权威教程)我们来看看各个函数:
l Init在类似于构造函数,用于UDF的初始化。
注意上图中红色框中的init函数。在实际运行中,无论hive将记录集划分了多少个部分去做(比如上图中的file1和file2两个部分),init函数仅被调用一次。所以上图中的示例是有歧义的。这也是为什么上面的代码中加了特别的注释来说明。或者换一句话说,init函数中不应该用于初始化部分聚集值相关的逻辑,而应该处理全局的一些数据逻辑。
l Iterate函数用于聚合。当每一个新的值被聚合时,此函数被调用。
l TerminatePartial函数在部分聚合完成后被调用。当hive希望得到部分记录的聚合结果时,此函数被调用。
l Merge函数用于合并先前得到的部分聚合结果(也可以理解为分块记录的聚合结果)。
l Terminate返回最终的聚合结果。
我们可以看出merge的输入参数类型和terminatePartial函数的返回值类型必须是一致的。
UDTF
用户定义表生成函数(User-defined table-generating function)。接受单行输入,并产生多行输出(即一个表)。不是特别常用,此处不详述。

InnoDBBufferPool reduces disk I/O by caching data and indexing pages, improving database performance. Its working principle includes: 1. Data reading: Read data from BufferPool; 2. Data writing: After modifying the data, write to BufferPool and refresh it to disk regularly; 3. Cache management: Use the LRU algorithm to manage cache pages; 4. Reading mechanism: Load adjacent data pages in advance. By sizing the BufferPool and using multiple instances, database performance can be optimized.

Compared with other programming languages, MySQL is mainly used to store and manage data, while other languages such as Python, Java, and C are used for logical processing and application development. MySQL is known for its high performance, scalability and cross-platform support, suitable for data management needs, while other languages have advantages in their respective fields such as data analytics, enterprise applications, and system programming.

MySQL is worth learning because it is a powerful open source database management system suitable for data storage, management and analysis. 1) MySQL is a relational database that uses SQL to operate data and is suitable for structured data management. 2) The SQL language is the key to interacting with MySQL and supports CRUD operations. 3) The working principle of MySQL includes client/server architecture, storage engine and query optimizer. 4) Basic usage includes creating databases and tables, and advanced usage involves joining tables using JOIN. 5) Common errors include syntax errors and permission issues, and debugging skills include checking syntax and using EXPLAIN commands. 6) Performance optimization involves the use of indexes, optimization of SQL statements and regular maintenance of databases.

MySQL is suitable for beginners to learn database skills. 1. Install MySQL server and client tools. 2. Understand basic SQL queries, such as SELECT. 3. Master data operations: create tables, insert, update, and delete data. 4. Learn advanced skills: subquery and window functions. 5. Debugging and optimization: Check syntax, use indexes, avoid SELECT*, and use LIMIT.

MySQL efficiently manages structured data through table structure and SQL query, and implements inter-table relationships through foreign keys. 1. Define the data format and type when creating a table. 2. Use foreign keys to establish relationships between tables. 3. Improve performance through indexing and query optimization. 4. Regularly backup and monitor databases to ensure data security and performance optimization.

MySQL is an open source relational database management system that is widely used in Web development. Its key features include: 1. Supports multiple storage engines, such as InnoDB and MyISAM, suitable for different scenarios; 2. Provides master-slave replication functions to facilitate load balancing and data backup; 3. Improve query efficiency through query optimization and index use.

SQL is used to interact with MySQL database to realize data addition, deletion, modification, inspection and database design. 1) SQL performs data operations through SELECT, INSERT, UPDATE, DELETE statements; 2) Use CREATE, ALTER, DROP statements for database design and management; 3) Complex queries and data analysis are implemented through SQL to improve business decision-making efficiency.

The basic operations of MySQL include creating databases, tables, and using SQL to perform CRUD operations on data. 1. Create a database: CREATEDATABASEmy_first_db; 2. Create a table: CREATETABLEbooks(idINTAUTO_INCREMENTPRIMARYKEY, titleVARCHAR(100)NOTNULL, authorVARCHAR(100)NOTNULL, published_yearINT); 3. Insert data: INSERTINTObooks(title, author, published_year)VA


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 Linux new version
SublimeText3 Linux latest version

Dreamweaver Mac version
Visual web development tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 Mac version
God-level code editing software (SublimeText3)