我是靠谱客的博主 朴素玉米,最近开发中收集的这篇文章主要介绍Hive Shell命令Hive Shell命令,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

Hive Shell命令

Database

Create Database

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[MANAGEDLOCATION hdfs_path]
[WITH DBPROPERTIES
(property_name=property_value, ...)];

建库语句

create database if not exists dtinone;
create database if not exists test03 location "/a/test03.db";

Drop Database

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
## RESTRICT 默认,可以不写
## CASCADE 级联,删除非空的数据库
drop database test01 cascade;

Alter Database

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;

更改数据库只能更改属性信息和数据库拥有者。

Use Database

使用(切换)数据库。

USE database_name;
USE DEFAULT;

Table

Create Table 创建表

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
-- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...)
-- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
-- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
-- (Note: Available in Hive 0.6.0 and later)
[AS select_statement];
-- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type
-- (Note: Available in Hive 0.7.0 and later)
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
| STRING
| BINARY
-- (Note: Available in Hive 0.8.0 and later)
| TIMESTAMP
-- (Note: Available in Hive 0.8.0 and later)
| DECIMAL
-- (Note: Available in Hive 0.11.0 and later)
| DECIMAL(precision, scale)
-- (Note: Available in Hive 0.13.0 and later)
| DATE
-- (Note: Available in Hive 0.12.0 and later)
| VARCHAR
-- (Note: Available in Hive 0.12.0 and later)
| CHAR
-- (Note: Available in Hive 0.13.0 and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... >
-- (Note: Available in Hive 0.7.0 and later)
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char]
-- (Note: Available in Hive 0.13 and later)
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
file_format:
: SEQUENCEFILE
| TEXTFILE
-- (Default, depending on hive.default.fileformat configuration)
| RCFILE
-- (Note: Available in Hive 0.6.0 and later)
| ORC
-- (Note: Available in Hive 0.11.0 and later)
| PARQUET
-- (Note: Available in Hive 0.13.0 and later)
| AVRO
-- (Note: Available in Hive 0.14.0 and later)
| JSONFILE
-- (Note: Available in Hive 4.0.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
column_constraint_specification:
: [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK
[check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
default_value:
: [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ]
constraint_specification:
: [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE
[, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]

建表语句

create table dtinone.t_student(
sid int,
name string,
age int,
address string
)row format delimited fields terminated by ',';

建表后会在hdfs上/user/hive/warehouse/dtinone.db/目录下生成t_student目录,这时将本地文件上传到t_student目录下,就可以查询t_student表中的数据了

本地文件student.txt的内容:

1,zhangsan,20,chengdu
2,lisi,18,beijing
3,wangwu,22,shanghai
4,zhao,21,guangzhou

查询结果:

0: jdbc:hive2://hadoop101:10000> select * from dtinone.t_student;
OK
+----------------+-----------------+----------------+--------------------+--+
| t_student.sid
| t_student.name
| t_student.age
| t_student.address
|
+----------------+-----------------+----------------+--------------------+--+
| 1
| zhangsan
| 20
| chengdu
|
| 2
| lisi
| 18
| beijing
|
| 3
| wangwu
| 22
| shanghai
|
| 4
| zhao
| 21
| guangzhou
|
+----------------+-----------------+----------------+--------------------+--+
4 rows selected (0.124 seconds)

Hive数据类型

Primitive Types 原始类型

类型与表中的列相关联。支持以下基本类型:

  • Types are associated with the columns in the tables. The following Primitive types are supported:
  • Integers
    • TINYINT—1 byte integer
    • SMALLINT—2 byte integer
    • INT—4 byte integer
    • BIGINT—8 byte integer
  • Boolean type
    • BOOLEAN—TRUE/FALSE
  • Floating point numbers
    • FLOAT—single precision
    • DOUBLE—Double precision
  • Fixed point numbers
    • DECIMAL—a fixed point value of user defined scale and precision
  • String types
    • STRING—sequence of characters in a specified character set
    • VARCHAR—sequence of characters in a specified character set with a maximum length
    • CHAR—sequence of characters in a specified character set with a defined length
  • Date and time types
    • TIMESTAMP — A date and time without a timezone (“LocalDateTime” semantics)
    • TIMESTAMP WITH LOCAL TIME ZONE — A point in time measured down to nanoseconds (“Instant” semantics)
    • DATE—a date
  • Binary types
    • BINARY—a sequence of bytes

Complex Types 复杂类型

可以使用基本类型和其他复合类型来构建复杂类型:

  • Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT}, the a field is accessed by the expression c.a
  • Maps (key-value tuples): The elements are accessed using [‘element name’] notation. For example in a map M comprising of a mapping from ‘group’ -> gid the gid value can be accessed using M[‘group’]
  • Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements [‘a’, ‘b’, ‘c’], A[1] retruns ‘b’.

Arrays类型演示

student_array.txt
1,zhangsan,20,beijing,10 50 80
2,lisi,22,chengdu,40 55 80
3,wangwu,21,shanghai,70 90 80
4,zhaoliu,20,hangzhou,30 52 80
create table if not exists dtinone.t_student_arr(
sid int,
name string,
age int,
address string,
scores array<int>
)row format delimited fields terminated by ','
collection items terminated by ' ';

查询结果(数据不太一致,但结构是一样的):

0: jdbc:hive2://hadoop101:10000> select * from dtinone.t_student_arr;
OK
+----------------+-----------------+----------------+--------------------+-------------------+--+
| t_student.sid
| t_student.name
| t_student.age
| t_student.address
| t_student.scores
|
+----------------+-----------------+----------------+--------------------+-------------------+--+
| 1
| zhangsan
| 20
| chengdu
| [70,75,90]
|
| 2
| lisi
| 22
| beijing
| [88,96,74]
|
| 3
| wangwu
| 21
| shanghai
| [77,65,87]
|
| 4
| zhaoliu
| 20
| guangzhou
| [90,96,94]
|
+----------------+-----------------+----------------+--------------------+-------------------+--+
4 rows selected (0.124 seconds)

Maps 类型

student_map.txt
1,zhangsan,20,beijing,father:laozhang mother:baby
2,lisi,22,chengdu,father:laoli mother:lily
3,wangwu,21,shanghai,father:laozhang mother:baby
4,zhaoliu,20,hangzhou,father:laozhang mother:baby
create table if not exists dtinone.t_student_map(
sid int,
name string,
age int,
address string,
family map<string,string>
)row format delimited fields terminated by ','
collection items terminated by ' '
map keys terminated by ':';

查询结果(数据不太一致,但结构是一样的):

0: jdbc:hive2://hadoop101:10000> select * from test.t_student_map;
OK
+--------------------+---------------------+--------------------+------------------------+----------------------------------+--+
| t_student_map.sid
| t_student_map.name
| t_student_map.age
| t_student_map.address
|
t_student_map.family
|
+--------------------+---------------------+--------------------+------------------------+----------------------------------+--+
| 1
| zhangsan
| 20
| chengdu
| {"father":"aaa","mother":"bbb"}
|
| 2
| lisi
| 18
| beijing
| {"father":"ccc","mother":"ddd"}
|
| 3
| wangwu
| 22
| shanghai
| {"father":"eee","mother":"fff"}
|
| 4
| zhao
| 21
| guangzhou
| {"father":"aaa","mother":"bbb"}
|
+--------------------+---------------------+--------------------+------------------------+----------------------------------+--+
4 rows selected (0.135 seconds)
## map的字段用M['group']表示
select * from test.t_student_map where family['father'] = 'aaa';

Structs类型

student_struct.txt
1,zhangsan,20,sichuan chengdu gaoxin
2,lisi,22,beijing beijing haidian
3,wangwu,21,sichuan chengdu jinjiang
4,zhaoliu,20,sichuan chengdu jinjiang
create table if not exists dtinone.t_student_struct(
sid int,
name string,
age int,
address struct<province:string,city:string,area:string>
)row format delimited fields terminated by ','
collection items terminated by ' ';

多种复杂类型一起使用

{
sid:'1',
name:'zhangsan',
age:20,
scores:[88,99,100],
family:{
father:'laozhang',
mother:'lily'
},
address:{
province:'四川',
city:'成都',
area:'高新区'
}
}
1,zhangsan,20,88 99 100,father:laozhangmother:lily,四川 成都 高新区
2,zhaosi,21,78 90 100,father:laozhaomother:lily,四川 成都 高新区
create table if not exists dtinone.t_student_mix(
sid int,
name string,
age int,
scores array<int>,
family map<string,string>,
address struct<province:string,city:string,area:string>
)
row format delimited fields terminated by ','
collection items terminated by ' '
map keys terminated by ':';

内部表和外部表

内部表/管理表

默认情况下,Hive创建内部表,其中的文件、元数据和统计数据由内部Hive进程管理。

一个内部表存储在hive.metastore.warehouse.dir路径属性,默认情况下在类似于/user/hive/warehouse/databasename.db/tablename/的文件夹路径中。

在表创建期间,location属性可以覆盖默认位置。

如果删除了一个内部表或分区,则会删除与该表或分区关联的数据和元数据。

当Hive应该管理表的生命周期,或者在生成临时表时,使用内部表。

内部表使用location属性可以覆盖默认位置

删除内部表,会删除元数据和数据

create table dtinone.t1(
id int,
name string,
age int
)row format delimited fields terminated by ','
location '/xxx';
## /xxx 会覆盖/user/hive/warehouse/dtinone.db/t1

外部表

外部表描述外部文件的元数据/模式。
外部表可以访问存储在源中的数据。
如果外部表的结构或分区被更改,可以使用MSCK REPAIR table table_name语句刷新元数据信息。
删除外部表,只会删除元数据信息,数据不会删除
一般源数据使用外部表。

创建外部表

create external table dtinone.t2(
id int,
name string,
age int
)row format delimited fields terminated by ',';

删除外部表,仅仅是删除元数据,数据还在

内部表和外部表转换

查看表的描述信息

desc 表名; ## 查看表的字段信息
desc extended 表名; ## 查看表的扩展信息
desc formatted 表名; ## 格式化表的扩展信息

内部表转外部表

ALTER TABLE dtinone.t3 SET TBLPROPERTIES ('EXTERNAL'='TRUE');

外部表转内部表

ALTER TABLE dtinone.t3 SET TBLPROPERTIES ('EXTERNAL'='FALSE');

Alter Table Properties

ALTER TABLE table_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ... )

复制表结构及数据

create table student as select * from dtinone.t_student;

复制表结构

create table student01 like dtinone.t_student;

分区表

Hive处理的是离线数据(历史数据)。离线数据一般是 按照天或者周为单位的数据。现在就每天的数据都放在 一个表中,那么处理其中一天的数据的时候会加载所有 的数据。这种肯定是不行的。此时可以使用分区表。

在hive中数据库是一个目录;hive的表也是一个目录; hive的分区表是表目录的子目录。

分区表的目录格式:分区字段=值

  1. 创建access_20201210数据

    1 1001 www.baidu.com 2020-12-10
    2 1001 www.126.com 2020-12-10
    3 1002 www.baidu.com 2020-12-10
    4 1003 www.baidu.com 2020-12-10
    
  2. 创建分区表

    create table t_access(
    id int,
    uid int,
    website string,
    time string
    )
    partitioned by (access_time string)
    row format delimited fields terminated by ' ';
    
  3. 创建分区表的目录

    hdfs dfs -mkdir /user/hive/warehouse/t_access/access_time=2020-12-10
    
  4. 上传数据到分区表目录下

    hdfs dfs -put /home/hadoop/data/access_20201210 /user/hive/warehouse/t_access/access_time=2020-12-10
    
  5. 查询数据

    select * from t_access;
    

注意:此时没有数据显示,原因是因为分区目录是使用hdfs命 名手动创建的。hive的元数据信息中没有该分表表的信 息,所以无法查询到数据。

  • 解决方式一:使用元数据修复命令MSCK(metastore check)

    MSCK REPAIR table t_access;
    
  • 解决方式二:使用hive的命令添加分区(添加分区时多个分区用空格隔开,官网语法有误

    ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
    [LOCATION 'location'][ PARTITION partition_spec [LOCATION 'location'], ...];
    partition_spec:
    : (partition_column = partition_col_value, partition_column = partition_col_value, ...)
    
    ALTER TABLE t_access ADD IF NOT EXISTS PARTITION (access_time='2020-12-12');
    
    ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808'
    PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';
    

多级分区

一般也就是2-3级足以。

  • 创建多级分区

    create table t_access01(
    id int,
    uid int,
    website string,
    time string
    )
    partitioned by (access_time string,hours int)
    row format delimited fields terminated by ' ';
    
  • 创建多级分区需要创建多级目录

    hdfs dfs -mkdir /user/hive/warehouse/t_access/access_time=2020-12-10/hours=10
    
  • 将数据文件上传到多级目录下

    hdfs dfs -put /home/hadoop/data/access_20201210 /user/hive/warehouse/t_access/access_time=2020-12-10/hours=10
    

删除分区

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]
[IGNORE PROTECTION] [PURGE];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)

注意:添加分区的时候多个PARTITION之间使用空格; 删除分区的时候多个PARTITION之间使用逗号。

alter table t_access drop if exists partition (access_time='2020-12-10'),partition (access_time='2020-12-11');

Hive表的存储格式

File Formats

Hive supports several file formats:

  • Text File
  • SequenceFile
  • RCFile
  • Avro Files
  • ORC Files
  • Parquet
  • Custom INPUTFORMAT and OUTPUTFORMAT

file_format:

:SEQUENCEFILE

| TEXTFILE --(Default, depending on hive.default.fileformat configuration)

| ORC – (Note: Available in Hive0.11.0 and later)

| PARQUET – (Note: Available in Hive 0.13.0 and later)

| AVRO – (Note: Available in Hive 0.14.0 and later)

| INPUTFORMAT input_format_classname

| OUTPUTFORMAT output_format_classname

parquet

https://zhuanlan.zhihu.com/p/141908285
简单介绍下:

  • Parquet 是一种支持嵌套结构的列式存储格式
  • 非常适用于 OLAP 场景,按列存储和扫描

ORC

除了 Parquet,另一个常见的列式存储格式是ORC(OptimizedRC File)。在 ORC 之前,Apache Hive 中就有一种列式存储格式称为RCFile(RecordColumnar File),ORC 是对 RCFile 格式的改进,主要在压缩编码、查询性能方面做了优化。因此 ORC/RC 都源于Hive,主要用来提高 Hive 查询速度和降低 Hadoop 的数据存储空间。

ORC具有以下一些优势:

  1. ORC是列式存储,有多种文件压缩方式,并且有着很高的压缩比。
  2. 文件是可切分(Split)的。因此,在Hive中使用ORC作为表的文件存储格式,不仅节省HDFS存储资源,查询任务的输入数据量减少,使用的MapTask也就减少了。
  3. 提供了多种索引,row group index、bloom filter index。
  4. ORC可以支持复杂的数据结构(比如Map等)

Parquet 与 ORC 的不同点总结以下:

  • 嵌套结构支持:Parquet 能够很完美的支持嵌套式结构,而在这一点上 ORC 支持的并不好,表达起来复杂且性能和空间都损耗较大。
  • 更新与 ACID 支持:ORC 格式支持 update 操作与ACID,而 Parquet 并不支持。
  • 压缩与查询性能:在压缩空间与查询性能方面,Parquet 与 ORC 总体上相差不大。可能 ORC 要稍好于 Parquet。
  • 查询引擎支持:这方面 Parquet 可能更有优势,支持Hive、Impala、Presto 等各种查询引擎,而 ORC 与Hive 接触的比较紧密,而与Impala 适配的并不好。之前我们说 Impala 不支持 ORC,直到 CDH 6.1.x 版本也就是 Impala3.x 才开始以 experimental feature支持 ORC 格式。

最后

以上就是朴素玉米为你收集整理的Hive Shell命令Hive Shell命令的全部内容,希望文章能够帮你解决Hive Shell命令Hive Shell命令所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(50)

评论列表共有 0 条评论

立即
投稿
返回
顶部