概述
Hive Shell命令
Database
Create Database
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[MANAGEDLOCATION hdfs_path]
[WITH DBPROPERTIES
(property_name=property_value, ...)];
建库语句
create database if not exists dtinone;
create database if not exists test03 location "/a/test03.db";
Drop Database
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
## RESTRICT 默认,可以不写
## CASCADE 级联,删除非空的数据库
drop database test01 cascade;
Alter Database
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;
更改数据库只能更改属性信息和数据库拥有者。
Use Database
使用(切换)数据库。
USE database_name;
USE DEFAULT;
Table
Create Table 创建表
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
-- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...)
-- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
-- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
-- (Note: Available in Hive 0.6.0 and later)
[AS select_statement];
-- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type
-- (Note: Available in Hive 0.7.0 and later)
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
| STRING
| BINARY
-- (Note: Available in Hive 0.8.0 and later)
| TIMESTAMP
-- (Note: Available in Hive 0.8.0 and later)
| DECIMAL
-- (Note: Available in Hive 0.11.0 and later)
| DECIMAL(precision, scale)
-- (Note: Available in Hive 0.13.0 and later)
| DATE
-- (Note: Available in Hive 0.12.0 and later)
| VARCHAR
-- (Note: Available in Hive 0.12.0 and later)
| CHAR
-- (Note: Available in Hive 0.13.0 and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... >
-- (Note: Available in Hive 0.7.0 and later)
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char]
-- (Note: Available in Hive 0.13 and later)
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
file_format:
: SEQUENCEFILE
| TEXTFILE
-- (Default, depending on hive.default.fileformat configuration)
| RCFILE
-- (Note: Available in Hive 0.6.0 and later)
| ORC
-- (Note: Available in Hive 0.11.0 and later)
| PARQUET
-- (Note: Available in Hive 0.13.0 and later)
| AVRO
-- (Note: Available in Hive 0.14.0 and later)
| JSONFILE
-- (Note: Available in Hive 4.0.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
column_constraint_specification:
: [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK
[check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
default_value:
: [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ]
constraint_specification:
: [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE
[, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
建表语句
create table dtinone.t_student(
sid int,
name string,
age int,
address string
)row format delimited fields terminated by ',';
建表后会在hdfs上/user/hive/warehouse/dtinone.db/目录下生成t_student目录,这时将本地文件上传到t_student目录下,就可以查询t_student表中的数据了
本地文件student.txt的内容:
1,zhangsan,20,chengdu
2,lisi,18,beijing
3,wangwu,22,shanghai
4,zhao,21,guangzhou
查询结果:
0: jdbc:hive2://hadoop101:10000> select * from dtinone.t_student;
OK
+----------------+-----------------+----------------+--------------------+--+
| t_student.sid
| t_student.name
| t_student.age
| t_student.address
|
+----------------+-----------------+----------------+--------------------+--+
| 1
| zhangsan
| 20
| chengdu
|
| 2
| lisi
| 18
| beijing
|
| 3
| wangwu
| 22
| shanghai
|
| 4
| zhao
| 21
| guangzhou
|
+----------------+-----------------+----------------+--------------------+--+
4 rows selected (0.124 seconds)
Hive数据类型
Primitive Types 原始类型
类型与表中的列相关联。支持以下基本类型:
- Types are associated with the columns in the tables. The following Primitive types are supported:
- Integers
- TINYINT—1 byte integer
- SMALLINT—2 byte integer
- INT—4 byte integer
- BIGINT—8 byte integer
- Boolean type
- BOOLEAN—TRUE/FALSE
- Floating point numbers
- FLOAT—single precision
- DOUBLE—Double precision
- Fixed point numbers
- DECIMAL—a fixed point value of user defined scale and precision
- String types
- STRING—sequence of characters in a specified character set
- VARCHAR—sequence of characters in a specified character set with a maximum length
- CHAR—sequence of characters in a specified character set with a defined length
- Date and time types
- TIMESTAMP — A date and time without a timezone (“LocalDateTime” semantics)
- TIMESTAMP WITH LOCAL TIME ZONE — A point in time measured down to nanoseconds (“Instant” semantics)
- DATE—a date
- Binary types
- BINARY—a sequence of bytes
Complex Types 复杂类型
可以使用基本类型和其他复合类型来构建复杂类型:
- Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT}, the a field is accessed by the expression c.a
- Maps (key-value tuples): The elements are accessed using [‘element name’] notation. For example in a map M comprising of a mapping from ‘group’ -> gid the gid value can be accessed using M[‘group’]
- Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements [‘a’, ‘b’, ‘c’], A[1] retruns ‘b’.
Arrays类型演示
student_array.txt
1,zhangsan,20,beijing,10 50 80
2,lisi,22,chengdu,40 55 80
3,wangwu,21,shanghai,70 90 80
4,zhaoliu,20,hangzhou,30 52 80
create table if not exists dtinone.t_student_arr(
sid int,
name string,
age int,
address string,
scores array<int>
)row format delimited fields terminated by ','
collection items terminated by ' ';
查询结果(数据不太一致,但结构是一样的):
0: jdbc:hive2://hadoop101:10000> select * from dtinone.t_student_arr;
OK
+----------------+-----------------+----------------+--------------------+-------------------+--+
| t_student.sid
| t_student.name
| t_student.age
| t_student.address
| t_student.scores
|
+----------------+-----------------+----------------+--------------------+-------------------+--+
| 1
| zhangsan
| 20
| chengdu
| [70,75,90]
|
| 2
| lisi
| 22
| beijing
| [88,96,74]
|
| 3
| wangwu
| 21
| shanghai
| [77,65,87]
|
| 4
| zhaoliu
| 20
| guangzhou
| [90,96,94]
|
+----------------+-----------------+----------------+--------------------+-------------------+--+
4 rows selected (0.124 seconds)
Maps 类型
student_map.txt
1,zhangsan,20,beijing,father:laozhang mother:baby
2,lisi,22,chengdu,father:laoli mother:lily
3,wangwu,21,shanghai,father:laozhang mother:baby
4,zhaoliu,20,hangzhou,father:laozhang mother:baby
create table if not exists dtinone.t_student_map(
sid int,
name string,
age int,
address string,
family map<string,string>
)row format delimited fields terminated by ','
collection items terminated by ' '
map keys terminated by ':';
查询结果(数据不太一致,但结构是一样的):
0: jdbc:hive2://hadoop101:10000> select * from test.t_student_map;
OK
+--------------------+---------------------+--------------------+------------------------+----------------------------------+--+
| t_student_map.sid
| t_student_map.name
| t_student_map.age
| t_student_map.address
|
t_student_map.family
|
+--------------------+---------------------+--------------------+------------------------+----------------------------------+--+
| 1
| zhangsan
| 20
| chengdu
| {"father":"aaa","mother":"bbb"}
|
| 2
| lisi
| 18
| beijing
| {"father":"ccc","mother":"ddd"}
|
| 3
| wangwu
| 22
| shanghai
| {"father":"eee","mother":"fff"}
|
| 4
| zhao
| 21
| guangzhou
| {"father":"aaa","mother":"bbb"}
|
+--------------------+---------------------+--------------------+------------------------+----------------------------------+--+
4 rows selected (0.135 seconds)
## map的字段用M['group']表示
select * from test.t_student_map where family['father'] = 'aaa';
Structs类型
student_struct.txt
1,zhangsan,20,sichuan chengdu gaoxin
2,lisi,22,beijing beijing haidian
3,wangwu,21,sichuan chengdu jinjiang
4,zhaoliu,20,sichuan chengdu jinjiang
create table if not exists dtinone.t_student_struct(
sid int,
name string,
age int,
address struct<province:string,city:string,area:string>
)row format delimited fields terminated by ','
collection items terminated by ' ';
多种复杂类型一起使用
{
sid:'1',
name:'zhangsan',
age:20,
scores:[88,99,100],
family:{
father:'laozhang',
mother:'lily'
},
address:{
province:'四川',
city:'成都',
area:'高新区'
}
}
1,zhangsan,20,88 99 100,father:laozhangmother:lily,四川 成都 高新区
2,zhaosi,21,78 90 100,father:laozhaomother:lily,四川 成都 高新区
create table if not exists dtinone.t_student_mix(
sid int,
name string,
age int,
scores array<int>,
family map<string,string>,
address struct<province:string,city:string,area:string>
)
row format delimited fields terminated by ','
collection items terminated by ' '
map keys terminated by ':';
内部表和外部表
内部表/管理表
默认情况下,Hive创建内部表,其中的文件、元数据和统计数据由内部Hive进程管理。
一个内部表存储在hive.metastore.warehouse.dir路径属性,默认情况下在类似于/user/hive/warehouse/databasename.db/tablename/的文件夹路径中。
在表创建期间,location属性可以覆盖默认位置。
如果删除了一个内部表或分区,则会删除与该表或分区关联的数据和元数据。
当Hive应该管理表的生命周期,或者在生成临时表时,使用内部表。
内部表使用location属性可以覆盖默认位置
删除内部表,会删除元数据和数据
create table dtinone.t1(
id int,
name string,
age int
)row format delimited fields terminated by ','
location '/xxx';
## /xxx 会覆盖/user/hive/warehouse/dtinone.db/t1
外部表
外部表描述外部文件的元数据/模式。
外部表可以访问存储在源中的数据。
如果外部表的结构或分区被更改,可以使用MSCK REPAIR table table_name语句刷新元数据信息。
删除外部表,只会删除元数据信息,数据不会删除
一般源数据使用外部表。
创建外部表
create external table dtinone.t2(
id int,
name string,
age int
)row format delimited fields terminated by ',';
删除外部表,仅仅是删除元数据,数据还在
内部表和外部表转换
查看表的描述信息
desc 表名; ## 查看表的字段信息
desc extended 表名; ## 查看表的扩展信息
desc formatted 表名; ## 格式化表的扩展信息
内部表转外部表
ALTER TABLE dtinone.t3 SET TBLPROPERTIES ('EXTERNAL'='TRUE');
外部表转内部表
ALTER TABLE dtinone.t3 SET TBLPROPERTIES ('EXTERNAL'='FALSE');
Alter Table Properties
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ... )
复制表结构及数据
create table student as select * from dtinone.t_student;
复制表结构
create table student01 like dtinone.t_student;
分区表
Hive处理的是离线数据(历史数据)。离线数据一般是 按照天或者周为单位的数据。现在就每天的数据都放在 一个表中,那么处理其中一天的数据的时候会加载所有 的数据。这种肯定是不行的。此时可以使用分区表。
在hive中数据库是一个目录;hive的表也是一个目录; hive的分区表是表目录的子目录。
分区表的目录格式:分区字段=值
-
创建access_20201210数据
1 1001 www.baidu.com 2020-12-10 2 1001 www.126.com 2020-12-10 3 1002 www.baidu.com 2020-12-10 4 1003 www.baidu.com 2020-12-10
-
创建分区表
create table t_access( id int, uid int, website string, time string ) partitioned by (access_time string) row format delimited fields terminated by ' ';
-
创建分区表的目录
hdfs dfs -mkdir /user/hive/warehouse/t_access/access_time=2020-12-10
-
上传数据到分区表目录下
hdfs dfs -put /home/hadoop/data/access_20201210 /user/hive/warehouse/t_access/access_time=2020-12-10
-
查询数据
select * from t_access;
注意:此时没有数据显示,原因是因为分区目录是使用hdfs命 名手动创建的。hive的元数据信息中没有该分表表的信 息,所以无法查询到数据。
-
解决方式一:使用元数据修复命令MSCK(metastore check)
MSCK REPAIR table t_access;
-
解决方式二:使用hive的命令添加分区(添加分区时多个分区用空格隔开,官网语法有误)
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][ PARTITION partition_spec [LOCATION 'location'], ...]; partition_spec: : (partition_column = partition_col_value, partition_column = partition_col_value, ...)
ALTER TABLE t_access ADD IF NOT EXISTS PARTITION (access_time='2020-12-12');
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808' PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';
多级分区
一般也就是2-3级足以。
-
创建多级分区
create table t_access01( id int, uid int, website string, time string ) partitioned by (access_time string,hours int) row format delimited fields terminated by ' ';
-
创建多级分区需要创建多级目录
hdfs dfs -mkdir /user/hive/warehouse/t_access/access_time=2020-12-10/hours=10
-
将数据文件上传到多级目录下
hdfs dfs -put /home/hadoop/data/access_20201210 /user/hive/warehouse/t_access/access_time=2020-12-10/hours=10
删除分区
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]
[IGNORE PROTECTION] [PURGE];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
注意:添加分区的时候多个PARTITION之间使用空格; 删除分区的时候多个PARTITION之间使用逗号。
alter table t_access drop if exists partition (access_time='2020-12-10'),partition (access_time='2020-12-11');
Hive表的存储格式
File Formats
Hive supports several file formats:
- Text File
- SequenceFile
- RCFile
- Avro Files
- ORC Files
- Parquet
- Custom INPUTFORMAT and OUTPUTFORMAT
file_format:
:SEQUENCEFILE
| TEXTFILE --(Default, depending on hive.default.fileformat configuration)
| ORC – (Note: Available in Hive0.11.0 and later)
| PARQUET – (Note: Available in Hive 0.13.0 and later)
| AVRO – (Note: Available in Hive 0.14.0 and later)
| INPUTFORMAT input_format_classname
| OUTPUTFORMAT output_format_classname
parquet
https://zhuanlan.zhihu.com/p/141908285
简单介绍下:
- Parquet 是一种支持嵌套结构的列式存储格式
- 非常适用于 OLAP 场景,按列存储和扫描
ORC
除了 Parquet,另一个常见的列式存储格式是ORC(OptimizedRC File)。在 ORC 之前,Apache Hive 中就有一种列式存储格式称为RCFile(RecordColumnar File),ORC 是对 RCFile 格式的改进,主要在压缩编码、查询性能方面做了优化。因此 ORC/RC 都源于Hive,主要用来提高 Hive 查询速度和降低 Hadoop 的数据存储空间。
ORC具有以下一些优势:
- ORC是列式存储,有多种文件压缩方式,并且有着很高的压缩比。
- 文件是可切分(Split)的。因此,在Hive中使用ORC作为表的文件存储格式,不仅节省HDFS存储资源,查询任务的输入数据量减少,使用的MapTask也就减少了。
- 提供了多种索引,row group index、bloom filter index。
- ORC可以支持复杂的数据结构(比如Map等)
Parquet 与 ORC 的不同点总结以下:
- 嵌套结构支持:Parquet 能够很完美的支持嵌套式结构,而在这一点上 ORC 支持的并不好,表达起来复杂且性能和空间都损耗较大。
- 更新与 ACID 支持:ORC 格式支持 update 操作与ACID,而 Parquet 并不支持。
- 压缩与查询性能:在压缩空间与查询性能方面,Parquet 与 ORC 总体上相差不大。可能 ORC 要稍好于 Parquet。
- 查询引擎支持:这方面 Parquet 可能更有优势,支持Hive、Impala、Presto 等各种查询引擎,而 ORC 与Hive 接触的比较紧密,而与Impala 适配的并不好。之前我们说 Impala 不支持 ORC,直到 CDH 6.1.x 版本也就是 Impala3.x 才开始以 experimental feature支持 ORC 格式。
最后
以上就是朴素玉米为你收集整理的Hive Shell命令Hive Shell命令的全部内容,希望文章能够帮你解决Hive Shell命令Hive Shell命令所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复