mysql教程

Alex的Hadoop菜鸟教程:第10课Hive入门教程

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 07, 2016 pm 04:12 PM

hadoophive教程菜鸟

Hive 安装相比起很多教程先介绍概念，我喜欢先动手装上，然后用例子来介绍概念。我们先来安装一下Hive 先确认是否已经安装了对应的yum源，如果没有照这个教程里面写的安装cdh的yum源http://blog.csdn.net/nsrainbow/article/details/36629339 Hive是什么 Hi

Hive 安装

相比起很多教程先介绍概念，我喜欢先动手装上，然后用例子来介绍概念。我们先来安装一下Hive

先确认是否已经安装了对应的yum源，如果没有照这个教程里面写的安装cdh的yum源http://blog.csdn.net/nsrainbow/article/details/36629339

Hive是什么

Hive 提供了一个让大家可以使用sql去查询数据的途径。但是最好不要拿Hive进行实时的查询。因为Hive的实现原理是把sql语句转化为多个Map Reduce任务所以Hive非常慢，官方文档说Hive 适用于高延时性的场景而且很费资源。

举个简单的例子，可以像这样去查询

hive> select * from h_employee;
OK
1	1	peter
2	2	paul
Time taken: 9.289 seconds, Fetched: 2 row(s)

这个h_employee不一定是一个数据库表

metastore

Hive 中建立的表都叫metastore表。这些表并不真实的存储数据，而是定义真实数据跟hive之间的映射，就像传统数据库中表的meta信息，所以叫做metastore。实际存储的时候可以定义的存储模式有四种：

内部表（默认）分区表桶表外部表举个例子，这是一个简历内部表的语句

CREATE TABLE worker(id INT, name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY &#39;\054&#39;;

这个语句的意思是建立一个worker的内部表，内部表是默认的类型，所以不用写存储的模式。并且使用逗号作为分隔符存储

建表语句支持的类型

基本数据类型
tinyint / smalint / int /bigint
float / double
boolean
string

复杂数据类型
Array/Map/Struct

没有date /datetime

建完的表存在哪里呢？

在 /user/hive/warehouse 里面，可以通过hdfs来查看建完的表位置

$ hdfs dfs -ls /user/hive/warehouse
Found 11 items
drwxrwxrwt   - root     supergroup          0 2014-12-02 14:42 /user/hive/warehouse/h_employee
drwxrwxrwt   - root     supergroup          0 2014-12-02 14:42 /user/hive/warehouse/h_employee2
drwxrwxrwt   - wlsuser  supergroup          0 2014-12-04 17:21 /user/hive/warehouse/h_employee_export
drwxrwxrwt   - root     supergroup          0 2014-08-18 09:20 /user/hive/warehouse/h_http_access_logs
drwxrwxrwt   - root     supergroup          0 2014-06-30 10:15 /user/hive/warehouse/hbase_apache_access_log
drwxrwxrwt   - username supergroup          0 2014-06-27 17:48 /user/hive/warehouse/hbase_table_1
drwxrwxrwt   - username supergroup          0 2014-06-30 09:21 /user/hive/warehouse/hbase_table_2
drwxrwxrwt   - username supergroup          0 2014-06-30 09:43 /user/hive/warehouse/hive_apache_accesslog
drwxrwxrwt   - root     supergroup          0 2014-12-02 15:12 /user/hive/warehouse/hive_employee

一个文件夹对应一个metastore表

Hive 各种类型表使用

内部表

CREATE TABLE workers( id INT, name STRING)  
ROW FORMAT DELIMITED FIELDS TERMINATED BY &#39;\054&#39;;

通过这样的语句就建立了一个内部表叫 workers，并且分隔符是逗号， \054 是ASCII 码
我们可以通过 show tables; 来看看有多少表，其实hive的很多语句是模仿mysql的，当你们不知道语句的时候，把mysql的语句拿来基本可以用。除了limit比较怪，这个后面会说

hive> show tables;
OK
h_employee
h_employee2
h_employee_export
h_http_access_logs
hive_employee
workers
Time taken: 0.371 seconds, Fetched: 6 row(s)

建立完后，我们试着插入几条数据。这边要告诉大家Hive不支持单句插入的语句，必须批量，所以不要指望能用insert into workers values (1,'jack') 这样的语句插入数据。hive支持的插入数据的方式有两种：从文件读取数据从别的表读出数据插入(insert from select) 这里我采用从文件读数据进来。先建立一个叫 worker.csv的文件

$ cat workers.csv
1,jack
2,terry
3,michael

用LOAD DATA 导入到Hive的表中

hive> LOAD DATA LOCAL INPATH &#39;/home/alex/workers.csv&#39; INTO TABLE workers;
Copying data from file:/home/alex/workers.csv
Copying file: file:/home/alex/workers.csv
Loading data to table default.workers
Table default.workers stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 25, raw_data_size: 0]
OK
Time taken: 0.655 seconds

注意不要少了那个 LOCAL ， LOAD DATA LOCAL INPATH 跟 LOAD DATA INPATH 的区别是一个是从你本地磁盘上找源文件，一个是从hdfs上找文件如果加上OVERWRITE可以再导入之前先清空表，比如 LOAD DATA LOCAL INPATH '/home/alex/workers.csv' OVERWRITE INTO TABLE workers; 查询一下数据

hive> select * from workers;
OK
1	jack
2	terry
3	michael
Time taken: 0.177 seconds, Fetched: 3 row(s)

我们去看下导入后在hive内部表是怎么存的

# hdfs dfs -ls /user/hive/warehouse/workers/
Found 1 items
-rwxrwxrwt   2 root supergroup         25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv

原来就是原封不动的把文件拷贝进去啊！就是这么土！我们可以试验再放一个文件 workers2.txt （我故意把扩展名换一个，其实hive是不看扩展名的）

# cat workers2.txt 
4,peter
5,kate
6,ted

导入

hive> LOAD DATA LOCAL INPATH &#39;/home/alex/workers2.txt&#39; INTO TABLE workers;
Copying data from file:/home/alex/workers2.txt
Copying file: file:/home/alex/workers2.txt
Loading data to table default.workers
Table default.workers stats: [num_partitions: 0, num_files: 2, num_rows: 0, total_size: 46, raw_data_size: 0]
OK
Time taken: 0.79 seconds

去看下文件的存储结构

# hdfs dfs -ls /user/hive/warehouse/workers/
Found 2 items
-rwxrwxrwt   2 root supergroup         25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv
-rwxrwxrwt   2 root supergroup         21 2014-12-08 15:29 /user/hive/warehouse/workers/workers2.txt

多出来一个workers2.txt 再用sql查询下

hive> select * from workers;
OK
1	jack
2	terry
3	michael
4	peter
5	kate
6	ted
Time taken: 0.144 seconds, Fetched: 6 row(s)

分区表

分区表是用来加速查询的，比如你的数据非常多，但是你的应用场景是基于这些数据做日报表，那你就可以根据日进行分区，当你要做2014-05-05的报表的时候只需要加载2014-05-05这一天的数据就行了。我们来创建一个分区表来看下

create table partition_employee(id int, name string) 
partitioned by(daytime string) 
row format delimited fields TERMINATED BY &#39;\054&#39;;

可以看到分区的属性，并不是任何一个列我们先建立2个测试数据文件，分别对应两天的数据

# cat 2014-05-05
22,kitty
33,lily
# cat 2014-05-06
14,sami
45,micky

导入到分区表里面

hive> LOAD DATA LOCAL INPATH &#39;/home/alex/2014-05-05&#39; INTO TABLE partition_employee partition(daytime=&#39;2014-05-05&#39;);
Copying data from file:/home/alex/2014-05-05
Copying file: file:/home/alex/2014-05-05
Loading data to table default.partition_employee partition (daytime=2014-05-05)
Partition default.partition_employee{daytime=2014-05-05} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]
Table default.partition_employee stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]
OK
Time taken: 1.154 seconds
hive> LOAD DATA LOCAL INPATH &#39;/home/alex/2014-05-06&#39; INTO TABLE partition_employee partition(daytime=&#39;2014-05-06&#39;);
Copying data from file:/home/alex/2014-05-06
Copying file: file:/home/alex/2014-05-06
Loading data to table default.partition_employee partition (daytime=2014-05-06)
Partition default.partition_employee{daytime=2014-05-06} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]
Table default.partition_employee stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 42, raw_data_size: 0]
OK
Time taken: 0.763 seconds

导入的时候通过 partition 来指定分区。
查询的时候通过指定分区来查询

hive> select * from partition_employee where daytime=&#39;2014-05-05&#39;;
OK
22	kitty	2014-05-05
33	lily	2014-05-05
Time taken: 0.173 seconds, Fetched: 2 row(s)

我的查询语句并没有什么特别的语法，hive 会自动判断你的where语句中是否包含分区的字段。而且可以使用大于小于等运算符

hive> select * from partition_employee where daytime>=&#39;2014-05-05&#39;;
OK
22	kitty	2014-05-05
33	lily	2014-05-05
14	sami	2014-05-06
45	mick&#39;	2014-05-06
Time taken: 0.273 seconds, Fetched: 4 row(s)

我们去看看存储的结构

# hdfs dfs -ls /user/hive/warehouse/partition_employee
Found 2 items
drwxrwxrwt   - root supergroup          0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-05
drwxrwxrwt   - root supergroup          0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-06

我们试试二维的分区表

create table p_student(id int, name string) 
partitioned by(daytime string,country string) 
row format delimited fields TERMINATED BY &#39;\054&#39;;

查入一些数据

# cat 2014-09-09-CN 
1,tammy
2,eric
# cat 2014-09-10-CN 
3,paul
4,jolly
# cat 2014-09-10-EN 
44,ivan
66,billy

导入hive

hive> LOAD DATA LOCAL INPATH &#39;/home/alex/2014-09-09-CN&#39; INTO TABLE p_student partition(daytime=&#39;2014-09-09&#39;,country=&#39;CN&#39;);
Copying data from file:/home/alex/2014-09-09-CN
Copying file: file:/home/alex/2014-09-09-CN
Loading data to table default.p_student partition (daytime=2014-09-09, country=CN)
Partition default.p_student{daytime=2014-09-09, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]
Table default.p_student stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]
OK
Time taken: 0.736 seconds
hive> LOAD DATA LOCAL INPATH &#39;/home/alex/2014-09-10-CN&#39; INTO TABLE p_student partition(daytime=&#39;2014-09-10&#39;,country=&#39;CN&#39;);
Copying data from file:/home/alex/2014-09-10-CN
Copying file: file:/home/alex/2014-09-10-CN
Loading data to table default.p_student partition (daytime=2014-09-10, country=CN)
Partition default.p_student{daytime=2014-09-10, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]
Table default.p_student stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 38, raw_data_size: 0]
OK
Time taken: 0.691 seconds
hive> LOAD DATA LOCAL INPATH &#39;/home/alex/2014-09-10-EN&#39; INTO TABLE p_student partition(daytime=&#39;2014-09-10&#39;,country=&#39;EN&#39;);
Copying data from file:/home/alex/2014-09-10-EN
Copying file: file:/home/alex/2014-09-10-EN
Loading data to table default.p_student partition (daytime=2014-09-10, country=EN)
Partition default.p_student{daytime=2014-09-10, country=EN} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]
Table default.p_student stats: [num_partitions: 3, num_files: 3, num_rows: 0, total_size: 59, raw_data_size: 0]
OK
Time taken: 0.622 seconds

看看存储结构

# hdfs dfs -ls /user/hive/warehouse/p_student
Found 2 items
drwxr-xr-x   - root supergroup          0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09
drwxr-xr-x   - root supergroup          0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-10
# hdfs dfs -ls /user/hive/warehouse/p_student/daytime=2014-09-09
Found 1 items
drwxr-xr-x   - root supergroup          0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09/country=CN

查询一下数据

hive> select * from p_student;
OK
1	tammy	2014-09-09	CN
2	eric	2014-09-09	CN
3	paul	2014-09-10	CN
4	jolly	2014-09-10	CN
44	ivan	2014-09-10	EN
66	billy	2014-09-10	EN
Time taken: 0.228 seconds, Fetched: 6 row(s)

hive> select * from p_student where daytime=&#39;2014-09-10&#39; and country=&#39;EN&#39;;
OK
44	ivan	2014-09-10	EN
66	billy	2014-09-10	EN
Time taken: 0.224 seconds, Fetched: 2 row(s)

桶表

桶表是根据某个字段的hash值，来将数据扔到不同的“桶”里面。外国人有个习惯，就是分类东西的时候摆几个桶，上面贴不同的标签，所以他们取名的时候把这种表形象的取名为桶表。桶表表专门用于采样分析
下面这个例子是官网教程直接拷贝下来的，因为分区表跟桶表是可以同时使用的，所以这个例子中同时使用了分区跟桶两种特性

CREATE TABLE b_student(id INT, name STRING)
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(id) SORTED BY(name) INTO 4 BUCKETS
row format delimited 
    fields TERMINATED BY &#39;\054&#39;;

意思是根据userid来进行计算hash值，用viewTIme来排序存储做数据跟导入的过程我就不在赘述了，这是导入后的数据

hive> select * from b_student;
OK
1	tammy	2014-09-09	CN
2	eric	2014-09-09	CN
3	paul	2014-09-10	CN
4	jolly	2014-09-10	CN
34	allen	2014-09-11	EN
Time taken: 0.727 seconds, Fetched: 5 row(s)

从4个桶中采样抽取一个桶的数据

hive> select * from b_student tablesample(bucket 1 out of 4 on id);
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there&#39;s no reduce operator
Starting Job = job_1406097234796_0041, Tracking URL = http://hadoop01:8088/proxy/application_1406097234796_0041/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1406097234796_0041
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-12-08 17:35:56,995 Stage-1 map = 0%,  reduce = 0%
2014-12-08 17:36:06,783 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.9 sec
2014-12-08 17:36:07,845 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.9 sec
MapReduce Total cumulative CPU time: 2 seconds 900 msec
Ended Job = job_1406097234796_0041
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 2.9 sec   HDFS Read: 482 HDFS Write: 22 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 900 msec
OK
4	jolly	2014-09-10	CN

外部表

外部表就是存储不是由hive来存储的，比如可以依赖Hbase来存储，hive只是做一个映射而已。我用Hbase来举例
先建立一张Hbase表叫 employee

hbase(main):005:0> create &#39;employee&#39;,&#39;info&#39;  
0 row(s) in 0.4740 seconds  
  
=> Hbase::Table - employee  
hbase(main):006:0> put &#39;employee&#39;,1,&#39;info:id&#39;,1  
0 row(s) in 0.2080 seconds  
  
hbase(main):008:0> scan &#39;employee&#39;  
ROW                                      COLUMN+CELL                                                                                                             
 1                                       column=info:id, timestamp=1417591291730, value=1                                                                        
1 row(s) in 0.0610 seconds  
  
hbase(main):009:0> put &#39;employee&#39;,1,&#39;info:name&#39;,&#39;peter&#39;  
0 row(s) in 0.0220 seconds  
  
hbase(main):010:0> scan &#39;employee&#39;  
ROW                                      COLUMN+CELL                                                                                                             
 1                                       column=info:id, timestamp=1417591291730, value=1                                                                        
 1                                       column=info:name, timestamp=1417591321072, value=peter                                                                  
1 row(s) in 0.0450 seconds  
  
hbase(main):011:0> put &#39;employee&#39;,2,&#39;info:id&#39;,2  
0 row(s) in 0.0370 seconds  
  
hbase(main):012:0> put &#39;employee&#39;,2,&#39;info:name&#39;,&#39;paul&#39;  
0 row(s) in 0.0180 seconds  
  
hbase(main):013:0> scan &#39;employee&#39;  
ROW                                      COLUMN+CELL                                                                                                             
 1                                       column=info:id, timestamp=1417591291730, value=1                                                                        
 1                                       column=info:name, timestamp=1417591321072, value=peter                                                                  
 2                                       column=info:id, timestamp=1417591500179, value=2                                                                        
 2                                       column=info:name, timestamp=1417591512075, value=paul                                                                   
2 row(s) in 0.0440 seconds

建立外部表进行映射

hive> CREATE EXTERNAL TABLE h_employee(key int, id int, name string)   
    > STORED BY &#39;org.apache.hadoop.hive.hbase.HBaseStorageHandler&#39;  
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name")  
    > TBLPROPERTIES ("hbase.table.name" = "employee");  
OK  
Time taken: 0.324 seconds  
hive> select * from h_employee;  
OK  
1   1   peter  
2   2   paul  
Time taken: 1.129 seconds, Fetched: 2 row(s)

查询语法

具体语法可以参考官方手册https://cwiki.apache.org/confluence/display/Hive/Tutorial 我只说几个比较奇怪的点

显示条数

展示x条数据，用的还是limit，比如

hive> select * from h_employee limit 1
    > ;
OK
1	1	peter
Time taken: 0.284 seconds, Fetched: 1 row(s)

但是不支持起点，比如offset
下课！

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

MySQL的许可与其他数据库系统相比如何？Apr 25, 2025 am 12:26 AM

MySQL使用的是GPL许可证。1）GPL许可证允许自由使用、修改和分发MySQL，但修改后的分发需遵循GPL。2）商业许可证可避免公开修改，适合需要保密的商业应用。

您什么时候选择InnoDB而不是Myisam，反之亦然？Apr 25, 2025 am 12:22 AM

选择InnoDB而不是MyISAM的情况包括：1)需要事务支持，2)高并发环境，3)需要高数据一致性；反之，选择MyISAM的情况包括：1)主要是读操作，2)不需要事务支持。InnoDB适合需要高数据一致性和事务处理的应用，如电商平台，而MyISAM适合读密集型且无需事务的应用，如博客系统。

在MySQL中解释外键的目的。Apr 25, 2025 am 12:17 AM

在MySQL中，外键的作用是建立表与表之间的关系，确保数据的一致性和完整性。外键通过引用完整性检查和级联操作维护数据的有效性，使用时需注意性能优化和避免常见错误。

MySQL中有哪些不同类型的索引？Apr 25, 2025 am 12:12 AM

MySQL中有四种主要的索引类型：B-Tree索引、哈希索引、全文索引和空间索引。1.B-Tree索引适用于范围查询、排序和分组，适合在employees表的name列上创建。2.哈希索引适用于等值查询，适合在MEMORY存储引擎的hash_table表的id列上创建。3.全文索引用于文本搜索，适合在articles表的content列上创建。4.空间索引用于地理空间查询，适合在locations表的geom列上创建。

您如何在MySQL中创建索引？Apr 25, 2025 am 12:06 AM

toCreateAnIndexinMysql，usethecReateIndexStatement.1）forasingLecolumn，使用“ createIndexIdx_lastNameEnemployees（lastName）; 2）foracompositeIndex，使用“ createIndexIndexIndexIndexIndexDx_nameOmplayees（lastName，firstName，firstName）;” 3）forauniqe instex，creationexexexexex，

MySQL与Sqlite有何不同？Apr 24, 2025 am 12:12 AM

MySQL和SQLite的主要区别在于设计理念和使用场景：1.MySQL适用于大型应用和企业级解决方案，支持高性能和高并发；2.SQLite适合移动应用和桌面软件，轻量级且易于嵌入。

MySQL中的索引是什么？它们如何提高性能？Apr 24, 2025 am 12:09 AM

MySQL中的索引是数据库表中一列或多列的有序结构，用于加速数据检索。1）索引通过减少扫描数据量提升查询速度。2）B-Tree索引利用平衡树结构，适合范围查询和排序。3）创建索引使用CREATEINDEX语句，如CREATEINDEXidx_customer_idONorders(customer_id)。4）复合索引可优化多列查询，如CREATEINDEXidx_customer_orderONorders(customer_id,order_date)。5）使用EXPLAIN分析查询计划，避