Home >Common Problem >What services do hive components provide?
The services that the hive component can provide: 1. Convert SQL statements into mapreduce code; 2. Data can be stored, using HDFS; 3. Data can be calculated, using MapReduce. hive is a data warehouse tool based on Hadoop, used for data extraction, transformation, and loading; hive data warehouse tool can map structured data files into a database table, and provides SQL query functions, which can convert SQL statements into MapReduce tasks to execute.
The operating environment of this tutorial: Windows 7 system, Dell G3 computer.
When building a data warehouse, the Hive component plays a very key role. We know that Hive is an important data warehouse tool based on Hadoop, but how to apply it requires further exploration.
hive is a data warehouse tool based on Hadoop, used for data extraction, transformation, and loading. It is a mechanism that can store, query and analyze large-scale data stored in Hadoop. The hive data warehouse tool can map structured data files into a database table, and provides SQL query functions, which can convert SQL statements into MapReduce tasks for execution. The advantage of Hive is that it has low learning cost and can implement fast MapReduce statistics through SQL-like statements, making MapReduce simpler without having to develop a specialized MapReduce application. hive is very suitable for statistical analysis of data warehouses
1. Convert SQL statements into mapreduce code
2. Data can be stored using HDFS
3. Data can be calculated using MapReduce
a.Hive’s advantages
(1) Simple and easy to use: Provides SQL-like query language HQL
(2) Scalable: Designed for extremely large data sets Computing/expansion capabilities (MR as the computing engine, HDFS as the storage system)
Generally, there is no need to restart the service. Hive can freely expand the scale of the cluster.
(3) Provide unified metadata management
(4) Scalability: Hive supports user-defined functions, and users can implement their own functions according to their own needs
(5) Fault tolerance: Good fault tolerance, if there is a problem with the node, SQL can still complete the execution
b. Disadvantages of Hive
(1)hive’s HQL expression ability Limited
1) Iterative algorithms cannot be expressed, such as pagerank
2) Data mining, such as kmeans
(2)hive’s efficiency is relatively low
1)The mapreduce jobs automatically generated by hive are usually not intelligent enough
2) Hive tuning is difficult and the granularity is coarse
3) Hive has poor controllability
(3)Hive does not support things . Mainly used for OLAP (Online Analytical Processing)
1) The data processed by Hive is stored in HDFS
2) The default implementation of the bottom layer of Hive analysis data is MapReduce
3) The executor runs on Yarn
Summary: It is equivalent to a client of hadoop effect.
(1) Comparison between Hive and traditional database
Hive is used for offline data analysis of massive data. Hive has the appearance of a SQL database, but the application scenarios are completely different. Hive is only suitable for statistical analysis of batch data.
(2) Advantages of Hive
Hive uses HDFS to store data and MapReduce to query and analyze data. Because directly using Hadoop MapReduce to process data will face the problem of high personnel learning costs, and it is too difficult to develop complex query logic using MapReduce. With Hive, the operation interface adopts SQL-like syntax, which not only provides rapid development capabilities but also avoids writing MapReduce, thereby reducing developers' learning costs and making function expansion more convenient.
Hive solves the query function of big data, so that people who cannot write MR can also use MR. Its essence is to convert HQL into MR. Its bottom layer is MR. Writing MR is inefficient and painful. The emergence of Hive has brought shortcuts and good news to JAVAEE brothers.
#1. User interface: Client
CLI (hive shell), JDBC/ODBC (java access hive), WEBUI (browser access hive)
2. Metadata: Metastore
Metadata includes: table name, database to which the table belongs (default is default), table owner, column/partition field, type of table
(whether it is an external table), and table data location Directory, etc.;
Metadata: Metastore
Metadata includes: table name, database to which the table belongs (default is default), table owner, column/partition field, table
type (whether it is an external table), the directory where the table data is located, etc.;
is stored in the built-in derby database by default. It is recommended to use MySQL to store Metastore.
3. Hadoop
Uses HDFS for storage and MapReduce for calculation.
4. Driver: Driver
(1) Parser (SQL Parser): Convert SQL string into abstract syntax tree AST. This step is generally completed using a
third-party tool library, such as antlr; Perform syntax analysis on the AST, such as whether the table exists, whether the fields exist, and whether the SQL semantics are incorrect.
(2) Compiler (Physical Plan): Compile the AST to generate a logical execution plan.
(3) Optimizer (Query Optimizer): Optimize the logical execution plan.
(4) Execution: Convert the logical execution plan into a physical plan that can be run. For Hive, it is MR/Spark.
Hive is built on Hadoop, and all Hive data is stored in HDFS. The databasecan save data in a block device or local file system.
Since Hive is designed for data warehouse applications, the content of the data warehouse requires more reading and less writing. Therefore, it is not recommended to rewrite data in Hive
. All data is determined when loading. The data in the database usually needs to be modified frequently, so you can use INSERT INTO... VALUES to add data and UPDATE... SET to modify the data.
Comparison between Hive and databaseBecause Hive uses a SQL-like query language HQL (Hive Query Language), it is easy to understand Hive as a database. In fact, from a structural point of view, apart from having similar query languages, Hive and database have nothing in common. This section will explain the differences between Hive and databases from many aspects. Databases can be used in Online applications, but Hive is designed for data warehouses. Knowing this will help you understand the characteristics of Hive from an application perspective.
4. Index: Hive does not perform any processing on the data during the process of loading the data, and does not even scan the data. Therefore, some keys in the data are not indexed. When Hive wants to access specific values in the data that meet conditions, it needs to brute force scan the entire data, so the access latency is high. Due to the introduction of MapReduce, Hive can access data in parallel, so even without indexes, Hive can still show advantages for accessing large amounts of data. In a database, an index is usually created on one or several columns, so the database can have high efficiency and low latency for accessing a small amount of data with specific conditions. Due to the high latency of data access, Hive is not suitable for online data query.
5. Execution: Most queries in Hive are executed through MapReduce provided by Hadoop. The database usually has its own execution engine.
6. Execution delay: When Hive queries data, since there is no index, the entire table needs to be scanned, so the delay is high. Another factor that contributes to high Hive execution latency is the MapReduce framework. Since MapReduce itself has high latency, there will also be high latency when using MapReduce to execute Hive queries. In contrast, the execution latency of the database is low. Of course, this low is conditional, that is, the data scale is small. When the data scale is large enough to exceed the processing capabilities of the database, Hive's parallel computing can obviously show its advantages.
7. Scalability: Since Hive is built on Hadoop, the scalability of Hive is consistent with the scalability of Hadoop (the world's largest Hadoop cluster is in Yahoo!, 2009 The annual scale is around 4,000 nodes). Due to the strict restrictions of ACID semantics, the database has very limited expansion rows. Currently, Oracle, the most advanced parallel database, has a theoretical scalability of only about 100 units.
8. Data scale: Since Hive is built on a cluster and can use MapReduce for parallel computing, it can support large-scale data; correspondingly, the database can support smaller data scale.
For more programming-related knowledge, please visit: Programming Teaching! !
The above is the detailed content of What services do hive components provide?. For more information, please follow other related articles on the PHP Chinese website!