我是靠谱客的博主 迷你蜻蜓,最近开发中收集的这篇文章主要介绍Hive SQL操作与函数自定义(一)1 DDL2 Statistics3 Indexes4 Archiving5 DML6 import/export7 explain plan8 Select,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

http://blog.csdn.net/u013980127/article/details/52604882


与传统SQL一致,并且概念简单的,本文没有详细说明或完全没有说明。另外,本文也没有按照官文一字不漏地翻译,只是提炼本人认为需要的。

1 DDL

1.1 Create/Drop/Alter/Use Database

1.1.1 Create Database

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];


注:DATABASE与SCHEMA用途相同

1.1.2 Drop Database

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];


注:默认RESTRICT,使用CASCADE可删除含表的数据库。

1.1.3 Alter Database

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);
-- (Note: SCHEMA added in Hive 0.14.0)
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;
-- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)

1.1.4 Use Database

USE database_name;
USE DEFAULT;
-- 获取当前数据库
SELECT current_database()

1.2 Create/Drop/Truncate Table

1.2.1 Create Table

1.2.1.1 语法


CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
-- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...)
-- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
-- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
-- (Note: Available in Hive 0.6.0 and later)
[AS select_statement];
-- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type
-- (Note: Available in Hive 0.7.0 and later)
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
| STRING
| BINARY
-- (Note: Available in Hive 0.8.0 and later)
| TIMESTAMP
-- (Note: Available in Hive 0.8.0 and later)
| DECIMAL
-- (Note: Available in Hive 0.11.0 and later)
| DECIMAL(precision, scale)
-- (Note: Available in Hive 0.13.0 and later)
| DATE
-- (Note: Available in Hive 0.12.0 and later)
| VARCHAR
-- (Note: Available in Hive 0.12.0 and later)
| CHAR
-- (Note: Available in Hive 0.13.0 and later)
array_type
: ARRAY
map_type
: MAP 

1.2.1.2 Row Format, Storage Format, and SerDe

可以使用自定义的SerDe或者Hive自带的SerDe。如果没有指定ROW FORMAT或指定了ROW FORMAT DELIMITED时,会使用自带的SerDe。

‘hive.default.fileformat’设置默认存储格式。

参考
[Hive中的InputFormat、OutputFormat与SerDe]
(https://www.coder4.com/archives/4031)

1.2.1.3 Partitioned Table

通过PARTITIONED BY创建。可以指定一个或多个分区列,用CLUSTERED BY columns分桶,通过SORT BY排序桶中的数据。

-- 示例
id
int,
date
date,
name
varchar
create table table_name (
id
int,
dtDontQuery
string,
name
string
)
partitioned by (date string)

1.2.1.4 External Tables

LOCATION指定数据所在位置。删除外部表时,表中数据不会被删除。

CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '54'
STORED AS TEXTFILE
LOCATION '<hdfs_location>';

1.2.1.5 Create Table As Select (CTAS)

限制:
1. 目标表不能是分区表;
2. 目标表不能是外部表;
3. 目标表不能是list bucketing表。

示例

CREATE TABLE new_key_value_store
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
STORED AS RCFile
AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;

1.2.1.6 Create Table Like

复制存在的表的表定义,创建新表(不含数据)。

CREATE TABLE empty_key_value_store
LIKE key_value_store;

1.2.1.7 Bucketed Sorted Tables

对于每一个表(table)或者分区, Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。使用CLUSTERED BY 子句来指定划分桶所用的列和要划分的桶的个数。


CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '01'
COLLECTION ITEMS TERMINATED BY '02'
MAP KEYS TERMINATED BY '03'
STORED AS SEQUENCEFILE;

参考:
[Hive 基础之:分区、桶、Sort Merge Bucket Join]
(http://blog.csdn.net/wisgood/article/details/17186107)

1.2.1.8 Skewed Tables

该特性为了优化表中一列或几列有数据倾斜的值。

CREATE TABLE list_bucket_single (key STRING, value STRING)
SKEWED BY (key) ON (1,5,6) [STORED AS DIRECTORIES];
CREATE TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING)
SKEWED BY (col1, col2) ON (('s1',1), ('s3',3), ('s13',13), ('s78',78)) [STORED AS DIRECTORIES];

参考:
[HIVE 数据倾斜调优总结]
(http://www.cnblogs.com/end/archive/2012/06/19/2554582.html)

1.2.1.9 Temporary Tables


1. 不支持分区;
2. 不支持索引。

示例

CREATE TEMPORARY TABLE temp_table (key STRING, value STRING);

1.2.2 Drop Table

DROP TABLE [IF EXISTS] table_name [PURGE];


注:指定PURGE后,数据不会放到回收箱,会直接删除。

1.2.3 Truncate Table

TRUNCATE TABLE table_name [PARTITION partition_spec];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)


注:指定PARTITION时,只删除PARTITION的数据,否则删除所有数据。

1.3 Alter Table/Partition/Column

1.3.1 Alter Table

-- 重命名表名
ALTER TABLE table_name RENAME TO new_table_name;
-- 修改表属性
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ... )
-- 修改表注释
ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);
-- 增加SerDe属性
ALTER TABLE table_name [PARTITION partition_spec] SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties];
ALTER TABLE table_name [PARTITION partition_spec] SET SERDEPROPERTIES serde_properties;
serde_properties:
: (property_name = property_value, property_name = property_value, ... )
示例
ALTER TABLE table_name SET SERDEPROPERTIES ('field.delim' = ',');
-- 修改存储属性
ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)]
INTO num_buckets BUCKETS;
-- Alter Table Skewed
ALTER TABLE table_name SKEWED BY (col_name1, col_name2, ...)
ON ([(col_name1_value, col_name2_value, ...) [, (col_name1_value, col_name2_value), ...]
[STORED AS DIRECTORIES];
-- Alter Table Not Skewed
ALTER TABLE table_name NOT SKEWED;
-- Alter Table Not Stored as Directories
ALTER TABLE table_name NOT STORED AS DIRECTORIES;
-- Alter Table Set Skewed Location
ALTER TABLE table_name SET SKEWED LOCATION (col_name1="location1" [, col_name2="location2", ...] );

1.3.2 Alter Partition

增加分区

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
示例
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808'
PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';

动态分区

DynamicPartitions

重命名分区

ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;

交换分区

分区可以在表之间交换

ALTER TABLE table_name_1 EXCHANGE PARTITION (partition_spec) WITH TABLE table_name_2;
-- multiple partitions
ALTER TABLE table_name_1 EXCHANGE PARTITION (partition_spec, partition_spec2, ...) WITH TABLE table_name_2;

恢复分区 (MSCK REPAIR TABLE)

通过HDFS命令增加的分区,不会在Hive的metastore中有记录,可以通过下面命令,自动修复。

MSCK REPAIR TABLE table_name;

删除分区

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]
[IGNORE PROTECTION] [PURGE];
-- (Note: PURGE available in Hive 1.2.0 and later, IGNORE PROTECTION not available 2.0.0 and later)

(Un)Archive Partition

ALTER TABLE table_name ARCHIVE PARTITION partition_spec;
ALTER TABLE table_name UNARCHIVE PARTITION partition_spec;

1.3.3 Alter Either Table or Partition

修改表或分区的文件格式

ALTER TABLE table_name [PARTITION partition_spec] SET FILEFORMAT file_format;

修改表或分区的存储位置

ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "new location";

Alter Table/Partition Touch

ALTER TABLE table_name TOUCH [PARTITION partition_spec];

Alter Table/Partition Protections

ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE NO_DROP [CASCADE];
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE OFFLINE;

Alter Table/Partition Compact

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])]
COMPACT 'compaction_type'
[WITH OVERWRITE TBLPROPERTIES ("property"="value" [, ...])];

Alter Table/Partition Concatenate
如果表或分区包含很多小的RCF文件或ORC文件,下面命令会合并小文件成大文件。

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;

1.3.4 Alter Column

修改列名、类型、位置与注释

ALTER TABLE table_name [PARTITION partition_spec] CHANGE [COLUMN] col_old_name col_new_name column_type
[COMMENT col_comment] [FIRST|AFTER column_name] [CASCADE|RESTRICT];
-- 默认RESTRICT,CASCADE除了修改表的metadata,也会修改所有分区的metadata。

增加或替换

ALTER TABLE table_name
[PARTITION partition_spec]
-- (Note: Hive 0.14.0 and later)
ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
[CASCADE|RESTRICT]
-- (Note: Hive 0.15.0 and later)
-- REPLACE会移除所有存在的列,增加新的列。

Partial Partition Specification

-- 以下操作
SET hive.exec.dynamic.partition = true;
ALTER TABLE foo PARTITION (ds='2008-04-08', hr) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
-- 等价于
ALTER TABLE foo PARTITION (ds='2008-04-08', hr=11) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
ALTER TABLE foo PARTITION (ds='2008-04-08', hr=12) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
...

支持的操作:Change column、Add column、Replace column、File Format、Serde Properties

1.4 Create/Drop/Alter View

1.4.1 Create View

语法:

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...;

示例:

CREATE VIEW onion_referrers(url COMMENT 'URL of Referring page')
COMMENT 'Referrers to The Onion website'
AS
SELECT DISTINCT referrer_url
FROM page_view
WHERE page_url='http://www.theonion.com';

1.4.2 Drop View

DROP VIEW [IF EXISTS] [db_name.]view_name;

1.4.3 Alter View Properties

ALTER VIEW [db_name.]view_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ...)

1.4.4 Alter View As Select

ALTER VIEW [db_name.]view_name AS select_statement;

1.5 Create/Drop/Alter Index

1.5.1 Create Index

CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS index_type
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"];

参考:
CREATE INDEX

1.5.2 Drop Index

DROP INDEX [IF EXISTS] index_name ON table_name;

1.5.3 Alter Index

ALTER INDEX index_name ON table_name [PARTITION partition_spec] REBUILD;

1.6 Create/Drop Macro

1.6.1 Create Temporary Macro

只在当前session有效。

CREATE TEMPORARY MACRO macro_name([col_name col_type, ...]) expression;
-- 示例
CREATE TEMPORARY MACRO fixed_number() 42;
CREATE TEMPORARY MACRO string_len_plus_two(x string) length(x) + 2;
CREATE TEMPORARY MACRO simple_add (x int, y int) x + y;

1.6.2 Drop Temporary Macro

DROP TEMPORARY MACRO [IF EXISTS] macro_name;

1.7 Create/Drop/Reload Function

1.7.1 临时函数

-- 创建
CREATE TEMPORARY FUNCTION function_name AS class_name;
-- 删除
DROP TEMPORARY FUNCTION [IF EXISTS] function_name;

1.7.2 永久函数

-- 创建
CREATE FUNCTION [db_name.]function_name AS class_name
[USING JAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ];
-- 删除
DROP FUNCTION [IF EXISTS] function_name;
-- 使其他session创建的函数有效
RELOAD FUNCTION;

1.8 Create/Drop/Grant/Revoke Roles and Privileges

参考:
Create/Drop/Grant/Revoke Roles and Privileges

1.9 Show

1.9.1 Show Databases

SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];


注:LIKE子句可以使用规则过滤,’*’代表任意个字符,’|’表示或,例如’emp*|*ees’

1.9.2 Show Tables/Partitions/Indexes

SHOW TABLES [IN database_name] ['identifier_with_wildcards'];
SHOW PARTITIONS table_name;
SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)];
SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)];
-- 展示表属性
SHOW TBLPROPERTIES table_name;
SHOW TBLPROPERTIES table_name("foo");
-- 显示建表文
SHOW CREATE TABLE ([db_name.]table_name|view_name);
-- 显示索引
SHOW [FORMATTED] (INDEX|INDEXES) ON table_with_index [(FROM|IN) db_name];

1.9.3 Show Columns

SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name];

1.9.4 Show Functions

-- 显示以'a'开头的函数
SHOW FUNCTIONS "a.*";
-- 显示所有函数
SHOW FUNCTIONS ".*";

1.9.5 Show Granted Roles and Privileges

参考:
Show Granted Roles and Privileges

1.9.6 Show Locks

SHOW LOCKS <table_name>;
SHOW LOCKS <table_name> EXTENDED;
SHOW LOCKS <table_name> PARTITION (<partition_spec>);
SHOW LOCKS <table_name> PARTITION (<partition_spec>) EXTENDED;
SHOW LOCKS (DATABASE|SCHEMA) database_name;
-- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)

参考:[HIVE-6460]
(https://issues.apache.org/jira/browse/HIVE-6460)

1.9.7 Show Conf

SHOW CONF <configuration_name>;
-- 返回
default value
required value
description

参考:[Hive 属性]
(https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties)

1.9.8 Show Transactions

SHOW TRANSACTIONS;
-- 返回
transaction ID
transaction state
user who started the transaction
machine where the transaction was started

1.9.9 Show Compactions

SHOW COMPACTIONS;

1.10 Describe

1.10.1 Describe Database

DESCRIBE DATABASE [EXTENDED] db_name;
DESCRIBE SCHEMA [EXTENDED] db_name;
-- (Note: Hive 0.15.0 and later)
例如
hive> desc database w1;
OK
w1
hdfs://hsm01:9000/hive/warehouse/w1.db
zkpk
USER
Time taken: 0.032 seconds, Fetched: 1 row(s)


注:EXTENDED会展示数据库的属性

1.10.2 Describe Table/View/Column


-- Hive 1.x.x and 0.x.x only.
DESCRIBE [EXTENDED|FORMATTED]
table_name[.col_name ( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )* ];
DESCRIBE [EXTENDED|FORMATTED]
[db_name.]table_name[ col_name ( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )* ];
-- 示例
hive> desc province.province;
OK
province
string
from deserializer
hive> desc w1.province province;
OK
province
string
from deserializer

1.10.3 显示Column统计

-- (Note: Hive 0.14.0 and later)
DESCRIBE FORMATTED [db_name.]table_name column_name;
-- (Note: Hive 0.14.0 to 1.x.x)
DESCRIBE FORMATTED [db_name.]table_name column_name PARTITION (partition_spec);

1.10.4 Describe Partition

-- (Hive 1.x.x and 0.x.x only)
DESCRIBE [EXTENDED|FORMATTED] table_name[.column_name] PARTITION partition_spec;
DESCRIBE [EXTENDED|FORMATTED] [db_name.]table_name [column_name] PARTITION partition_spec;

1.10.5 Hive 2.0+

DESCRIBE [EXTENDED | FORMATTED]
[db_name.]table_name [PARTITION partition_spec] [col_name ( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )* ];

1.11 Abort

移除指定的事物。

ABORT TRANSACTIONS transactionID [, ...];

2 Statistics

参考:
StatsDev

分析Hive表和分区的统计信息

ANALYZE

ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
-- (Note: Fully support qualified table name since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS]
-- (Note: Hive 0.10.0 and later.)
[CACHE METADATA]
-- (Note: Hive 2.1.0 and later.)
[NOSCAN];
-- 示例
ANALYZE TABLE Table1 COMPUTE STATISTICS;
ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr) COMPUTE STATISTICS NOSCAN;
-- 查看统计
DESCRIBE EXTENDED TABLE1;
DESCRIBE EXTENDED TABLE1 PARTITION(ds='2008-04-09', hr=11);

3 Indexes

参考:
Indexing Resources

示例:


Create/build, show, and drop index:
CREATE INDEX table01_index ON TABLE table01 (column2) AS 'COMPACT';
SHOW INDEX ON table01;
DROP INDEX table01_index ON table01;
Create then build, show formatted (with column names), and drop index:
CREATE INDEX table02_index ON TABLE table02 (column3) AS 'COMPACT' WITH DEFERRED REBUILD;
ALTER INDEX table02_index ON table2 REBUILD;
SHOW FORMATTED INDEX ON table02;
DROP INDEX table02_index ON table02;
Create bitmap index, build, show, and drop:
CREATE INDEX table03_index ON TABLE table03 (column4) AS 'BITMAP' WITH DEFERRED REBUILD;
ALTER INDEX table03_index ON table03 REBUILD;
SHOW FORMATTED INDEX ON table03;
DROP INDEX table03_index ON table03;
Create index in a new table:
CREATE INDEX table04_index ON TABLE table04 (column5) AS 'COMPACT' WITH DEFERRED REBUILD IN TABLE table04_index_table;
Create index stored as RCFile:
CREATE INDEX table05_index ON TABLE table05 (column6) AS 'COMPACT' STORED AS RCFILE;
Create index stored as text file:
CREATE INDEX table06_index ON TABLE table06 (column7) AS 'COMPACT' ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE;
Create index with index properties:
CREATE INDEX table07_index ON TABLE table07 (column8) AS 'COMPACT' IDXPROPERTIES ("prop1"="value1", "prop2"="value2");
Create index with table properties:
CREATE INDEX table08_index ON TABLE table08 (column9) AS 'COMPACT' TBLPROPERTIES ("prop3"="value3", "prop4"="value4");
Drop index if exists:
DROP INDEX IF EXISTS table09_index ON table09;
Rebuild index on a partition:
ALTER INDEX table10_index ON table10 PARTITION (columnX='valueQ', columnY='valueR') REBUILD;

4 Archiving

利用hadoop Archives可以把多个文件归档成为一个文件,归档成一个文件后还可以透明的访问每一个文件。这样可以减少分区中文件的数量。

启动archiving,需要设置3个配置:

hive> set hive.archive.enabled=true;
hive> set hive.archive.har.parentdir.settable=true;
hive> set har.partfile.size=1099511627776;

语法:

ALTER TABLE table_name ARCHIVE|UNARCHIVE PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

示例:

-- 执行归档
ALTER TABLE srcpart ARCHIVE PARTITION(ds='2008-04-08', hr='12')
-- 解除归档
ALTER TABLE srcpart UNARCHIVE PARTITION(ds='2008-04-08', hr='12')

5 DML

5.1 LOAD文件到表

5.1.1 语法

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

5.1.2 示例

-- 从本地导入数据到表格并追加原表
LOAD DATA LOCAL INPATH `/tmp/pv_2013-06-08_us.txt` INTO TABLE c02 PARTITION(date='2013-06-08', country='US')
-- 从hdfs导入数据到表格并覆盖原表:
LOAD DATA INPATH '/user/admin/SqlldrDat/CnClickstat/20131101/18/clickstat_gp_fatdt0/0' INTO table c02_clickstat_fatdt1 OVERWRITE PARTITION (dt='20131201');

5.2 从查询Insert到Hive表

5.2.1 标准Insert

语法:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] (z,y) select_statement1 FROM from_statement;

示例:

from B insert table A select 1,‘abc’ limit 1

5.2.2 Hive扩展 (多inserts):

语法:

FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;

5.2.3 Hive扩展 (动态分区inserts)

5.2.3.1 语法

INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

5.2.3.2 说明

默认disable。配置动态分区插入:

属性默认值说明
hive.exec.dynamic.partitionfalse是否启动动态分区插入
hive.exec.dynamic.partition.modestrictstrict模式,在子句中必须至少指定一个静态分区
hive.exec.max.dynamic.partitions.pernode100每个mapper-reducer节点,可创建动态分区的最大数量
hive.exec.max.dynamic.partitions1000可创建动态分区总的最大数量
hive.exec.max.created.files100000MapReduce任务中,所有mapper-reducer可创建HDFS文件的最大数量
hive.error.on.empty.partitionfalse如果动态分区插入生成空结果集,是否抛出异常

5.2.3.3 示例

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt

5.3 Write数据到文件系统

5.3.1 语法

-- 标准
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ...
-- Hive扩展 (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)

5.3.2 示例

-- 导出文件到本地:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
-- 导出文件到HDFS:
INSERT OVERWRITE DIRECTORY '/user/admin/SqlldrDat/CnClickstat/20131101/19/clickstat_gp_fatdt0/0' SELECT a.* FROM c02_clickstat_fatdt1 a WHERE dt=’20131201’;
-- 一个源可以同时插入到多个目标表或目标文件,多目标insert可以用一句话来完成:
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2013-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

5.4 Insert值到表

5.4.1 语法

INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
Where values_row is:
( value [, value ...] )
where a value is either null or any valid SQL literal

5.4.2 示例

INSERT INTO TABLE students
VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);

5.5 Update

语法:

UPDATE tablename SET column = value [, column = value ...] [WHERE expression]

5.6 Delete

DELETE FROM tablename [WHERE expression]

6 import/export

6.1 语法

-- EXPORT
EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]
TO 'export_target_path' [ FOR replication('eventid') ]
-- IMPORT
IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column="value"[, ...])]]
FROM 'source_path'
[LOCATION 'import_target_path']

6.2 示例


简单导出导入示例:
export table department to 'hdfs_exports_location/department';
import from 'hdfs_exports_location/department';
使用import重命名表:
export table department to 'hdfs_exports_location/department';
import table imported_dept from 'hdfs_exports_location/department';
导出导入分区:
export table employee partition (emp_country="in", emp_state="ka") to 'hdfs_exports_location/employee';
import from 'hdfs_exports_location/employee';
导出表,导入分区:
export table employee to 'hdfs_exports_location/employee';
import table employee partition (emp_country="us", emp_state="tn") from 'hdfs_exports_location/employee';
指定导入的路径:
export table department to 'hdfs_exports_location/department';
import table department from 'hdfs_exports_location/department'
location 'import_target_location/department';
导入到外部表:
export table department to 'hdfs_exports_location/department';
import external table department from 'hdfs_exports_location/department';

7 explain plan

7.1 语法

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query

示例:

hive>
>
> explain
> select * from student;
OK
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: student
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: classno (type: string), stuno (type: string), score (type: float)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
ListSink
Time taken: 0.106 seconds, Fetched: 17 row(s)

7.2 DEPENDENCY

更多的关于inputs的信息会被展示。

7.3 AUTHORIZATION

权限相关信息会被展示


hive.explain.user=true (default is false) 时,用户会看到更多的执行信息。

8 Select

8.1 语法

[WITH CommonTableExpression (, CommonTableExpression)*]
(Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]

8.2 REGEX列

hive.support.quoted.identifiers配置项(默认column)设置为none时,可以使用正则。正则规则和Java相同。

-- 返回除ds和hr的所有列
SELECT `(ds|hr)?+.+` FROM sales

8.3 Group by

8.3.1 语法

groupByClause: GROUP BY groupByExpression (, groupByExpression)*
groupByExpression: expression
groupByQuery: SELECT expression (, expression)* FROM src groupByClause?

示例(Multi-Group-By Inserts)

FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count(DISTINCT pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
SELECT pv_users.age, count(DISTINCT pv_users.userid)
GROUP BY pv_users.age;

8.3.2 增强聚集

8.3.2.1 GROUPING sets

GROUPING SET相当于多个GROUP BY查询语句,通过UNION连接。

示例

GROUPING SETS例等价的GROUP BY例
SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ((a, b), a, b, ())SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b
UNION
SELECT a, null, SUM(c) FROM tab1 GROUP BY a, null
UNION
SELECT null, b, SUM(c) FROM tab1 GROUP BY null, b
UNION
SELECT null, null, SUM(c) FROM tab1

8.3.2.2 GROUPING__ID函数

表示结果属于哪一个分组集合.

SELECT
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM lxw1234
GROUP BY month,day
GROUPING SETS (month, day, (month, day))
ORDER BY GROUPING__ID;
month
day
uv
GROUPING__ID
------------------------------------------------
2015-03
NULL
5
1
2015-04
NULL
6
1
NULL
2015-03-10
4
2
NULL
2015-03-12
1
2
NULL
2015-04-12
2
2
NULL
2015-04-13
3
2
NULL
2015-04-15
2
2
NULL
2015-04-16
2
2
2015-03
2015-03-10
4
3
2015-03
2015-03-12
1
3
2015-04
2015-04-12
2
3
2015-04
2015-04-13
3
3
2015-04
2015-04-15
2
3
2015-04
2015-04-16
2
3

8.3.2.3 CUBE

完成对字段列中的所有可能组合进行GROUP BY的功能。

GROUP BY a, b, c WITH CUBE 等同于
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c),(a), (b), (c), ()) 

8.3.2.4 ROLLUP

GROUPING SETS的特例,用于计算从一个维度进行层级聚合的操作。

GROUP BY a, b, c, WITH ROLLUP 等同于
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ())

8.3.2.5 hive.new.job.grouping.set.cardinality

grouping set数据量大于这个值时,则增加额外的map-reduce任务以减少map侧的数据量。

8.4 Order, Sort, Cluster, Distribute By与Transform

9.4.1 Order By

语法:

colOrder: ( ASC | DESC )
colNullOrder: (NULLS FIRST | NULLS LAST)
-- (Note: Available in Hive 2.1.0 and later)
orderBy: ORDER BY colName colOrder? colNullOrder? (',' colName colOrder? colNullOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy

注:
hive.mapred.mode=strict时,必须跟limit子句。原因是order子句最后只能用一个reduce排序并输出最后结果,如果数据量大的话,会花很长时间。

8.4.2 Sort By

语法:

colOrder: ( ASC | DESC )
sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy

说明:
输出到reducer前,根据列进行行排序,即多个reduce时,分别进行排序。

8.4.3 Distribute By

Hive在列上使用Distribute By,为了分散行到不同的reducer。拥有相同Distribute By列的所有行会进入相同的reducer。

8.4.4 Cluster By

Cluster By是Distribute By和Sort By的结合。但是排序只能是倒序排序,不能指定排序规则为asc 或者desc。

8.4.5 Transform

调用用户自定义的 Hive 使用的 Map/Reduce 脚本。

表student(classNo,stuNo,score)

#! /usr/bin/env python
import sys
for line in sys.stdin:
(classNo,stuNo,score) = line.strip().split('t')
ifint(score) >= 60:
print"%st%st%s" %(classNo,stuNo,score)
add file /home/user/score_pass.py;
select
transform(classNo, stuNo, score)
using'score_pass.py'
as classNo, stuNo, score
from student;

8.4.6 参考

  1. [Hive的Transform功能]
    (http://www.tuicool.com/articles/fuimmmQ)

  2. [Hive查询]
    (http://blog.csdn.net/zythy/article/details/18814781)

8.5 Join

8.5.1 语法

join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)
table_reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression


注:Hive不支持所有非等值的连接

8.5.2 示例

如果join中多个表的 join key 是同一个,则 join 会被转化为单个 map/reduce 任务

SELECT a.val, b.val, c.val FROM a JOIN b
ON (a.key = b.key1) JOIN c
ON (c.key = b.key1)

这一 join 被转化为 2 个 map/reduce 任务。因为 b.key1 用于第一次 join 条件,而 b.key2 用于第二次 join。

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key2)


注:reducer 会缓存 join 序列中除了最后一个表的所有表的记录,再通过最后一个表将结果序列化到文件系统。这一实现有助于在 reduce 端减少内存的使用量。实践中,应该把最大的那个表写在最后

可以通过hint改变序列化的表:

SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)

8.5.3 LEFT, RIGHT, and FULL OUTER

SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)

LEFT:将保留a的所有值;
RIGHT:将保留b的所有值;
FULL:将保留a、b的所有值。

8.5.4 LEFT SEMI JOIN

IN/EXISTS 子查询的一种更高效的实现。JOIN 子句中右边的表只能在 ON 子句中设置过滤条件。

SELECT a.key, a.value
FROM a
WHERE a.key in
(SELECT b.key
FROM B);

可以改写为

SELECT a.key, a.val
FROM a LEFT SEMI JOIN b ON (a.key = b.key)

8.5.4 MapJoin

hive.auto.convert.join设置为true时,可能的话,执行时会自动转换为mapjoin。

Bucket mapjoin

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

8.5.5 优化

JoinOptimization

深入浅出数据仓库中SQL性能优化之Hive篇

8.6 UNION

8.6.1 语法

select_statement UNION [ALL | DISTINCT] select_statement UNION [ALL | DISTINCT] select_statement ...


注:
1. 1.2.0以前,只支持UNION ALL;
2. UNION与UNION ALL混合使用时,UNION会覆盖左侧的UNION ALL;
3. Select的列别名必须一致(含类型)。

8.6.2 示例

From子句中使用UNION

SELECT u.id, actions.date
FROM (
SELECT av.uid AS uid
FROM action_video av
WHERE av.date = '2008-06-03'
UNION ALL
SELECT ac.uid AS uid
FROM action_comment ac
WHERE ac.date = '2008-06-03'
) actions JOIN users u ON (u.id = actions.uid)

可以在DDL与Insert语句中使用;

子句中使用ORDER BY, SORT BY, CLUSTER BY, DISTRIBUTE BY 或 LIMIT等需要括号括起来。

SELECT key FROM (SELECT key FROM src ORDER BY key LIMIT 10)subq1
UNION
SELECT key FROM (SELECT key FROM src1 ORDER BY key LIMIT 10)subq2

UNION语句中使用ORDER BY, SORT BY, CLUSTER BY, DISTRIBUTE BY 或 LIMIT等,需要放到最后。

SELECT key FROM src
UNION
SELECT key FROM src1
ORDER BY key LIMIT 10

8.7 TABLESAMPLE

用来从Hive表中根据一定的规则进行数据取样。

8.7.1 分桶表取样

table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])
-- 用来从Hive表中根据一定的规则进行数据取样
-- 该语句表示将表lxw1随机分成10个桶,抽样第一个桶的数据;
SELECT COUNT(1) FROM lxw1 TABLESAMPLE (BUCKET 1 OUT OF 10 ON rand());

8.7.2 块取样

block_sample: TABLESAMPLE (n PERCENT)
SELECT * FROM source TABLESAMPLE(0.1 PERCENT) s;
block_sample: TABLESAMPLE (ByteLengthLiteral)
ByteLengthLiteral : (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')
SELECT * FROM source TABLESAMPLE(100M) s;
block_sample: TABLESAMPLE (n ROWS)
SELECT * FROM source TABLESAMPLE(10 ROWS);

8.8 子查询

可以在from、where中使用子查询。

8.9 虚拟列

虚拟列说明
INPUT__FILE__NAMEmapper任务的输入文件名
BLOCK__OFFSET__INSIDE__FILE当前全局文件位置

8.10 UDF

见第9章

8.11 Lateral View

8.11.1 语法

lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*

通过Lateral view可以方便的将UDTF得到的行转列的结果集合在一起提供服务。

8.11.2 示例

数据

Array col1Array col2
[1, 2][a”, “b”, “c”]
[3, 4][d”, “e”, “f”]
SELECT myCol1, col2 FROM baseTable
LATERAL VIEW explode(col1) myTable1 AS myCol1;
int mycol1 | Array<string> col2
---------- | ------------------
1
| [a", "b", "c"]
2
| [a", "b", "c"]
3
| [d", "e", "f"]
4
| [d", "e", "f"]

8.11.3 多LATERAL VIEW

SELECT myCol1, myCol2 FROM baseTable
LATERAL VIEW explode(col1) myTable1 AS myCol1
LATERAL VIEW explode(col2) myTable2 AS myCol2;

8.11.4 Outer Lateral Views

与外链接一样。Lateral View是

SELEC * FROM src LATERAL VIEW explode(array()) C AS a limit 1;
-- 无结果返回
SELECT * FROM src LATERAL VIEW OUTER explode(array()) C AS a limit 1;
-- 返回
河北省 NULL

8.12 窗口分析

8.12.1 窗口函数

准备数据

CREATE EXTERNAL TABLE demo_win_fun (
cookieid string,
createtime string,
--页面访问时间
url STRING
--被访问页面
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile location '/hw/hive/win/';
cookie1,2015-04-10 10:00:02,url2
cookie1,2015-04-10 10:00:00,url1
cookie1,2015-04-10 10:03:04,1url3
cookie1,2015-04-10 10:50:05,url6
cookie1,2015-04-10 11:00:00,url7
cookie1,2015-04-10 10:10:00,url4
cookie1,2015-04-10 10:50:01,url5
cookie2,2015-04-10 10:00:02,url22
cookie2,2015-04-10 10:00:00,url11
cookie2,2015-04-10 10:03:04,1url33
cookie2,2015-04-10 10:50:05,url66
cookie2,2015-04-10 11:00:00,url77
cookie2,2015-04-10 10:10:00,url44
cookie2,2015-04-10 10:50:01,url55

8.12.1.1 LEAD

语法:

LEAD(col, n, DEFAULT) 用于统计窗口内往下第n行值。
第一个参数为列名;
第二个参数为往下第n行(可选,默认为1);
第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL

示例

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM demo_win_fun;
cookieid createtime
url
rn
next_1_time
next_2_time
-------------------------------------------------------------------------------------------
cookie1 2015-04-10 10:00:00
url1
1
2015-04-10 10:00:02
2015-04-10 10:03:04
cookie1 2015-04-10 10:00:02
url2
2
2015-04-10 10:03:04
2015-04-10 10:10:00
cookie1 2015-04-10 10:03:04
1url3
3
2015-04-10 10:10:00
2015-04-10 10:50:01
cookie1 2015-04-10 10:10:00
url4
4
2015-04-10 10:50:01
2015-04-10 10:50:05
cookie1 2015-04-10 10:50:01
url5
5
2015-04-10 10:50:05
2015-04-10 11:00:00
cookie1 2015-04-10 10:50:05
url6
6
2015-04-10 11:00:00
NULL
cookie1 2015-04-10 11:00:00
url7
7
1970-01-01 00:00:00
NULL
cookie2 2015-04-10 10:00:00
url11
1
2015-04-10 10:00:02
2015-04-10 10:03:04
cookie2 2015-04-10 10:00:02
url22
2
2015-04-10 10:03:04
2015-04-10 10:10:00
cookie2 2015-04-10 10:03:04
1url33
3
2015-04-10 10:10:00
2015-04-10 10:50:01
cookie2 2015-04-10 10:10:00
url44
4
2015-04-10 10:50:01
2015-04-10 10:50:05
cookie2 2015-04-10 10:50:01
url55
5
2015-04-10 10:50:05
2015-04-10 11:00:00
cookie2 2015-04-10 10:50:05
url66
6
2015-04-10 11:00:00
NULL
cookie2 2015-04-10 11:00:00
url77
7
1970-01-01 00:00:00
NULL

8.12.1.2 LAG

语法

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名,
第二个参数为往上第n行(可选,默认为1),
第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL

示例

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM demo_win_fun;
cookieid createtime
url
rn
last_1_time
last_2_time
-------------------------------------------------------------------------------------------
cookie1 2015-04-10 10:00:00
url1
1
1970-01-01 00:00:00
NULL
cookie1 2015-04-10 10:00:02
url2
2
2015-04-10 10:00:00
NULL
cookie1 2015-04-10 10:03:04
1url3
3
2015-04-10 10:00:02
2015-04-10 10:00:00
cookie1 2015-04-10 10:10:00
url4
4
2015-04-10 10:03:04
2015-04-10 10:00:02
cookie1 2015-04-10 10:50:01
url5
5
2015-04-10 10:10:00
2015-04-10 10:03:04
cookie1 2015-04-10 10:50:05
url6
6
2015-04-10 10:50:01
2015-04-10 10:10:00
cookie1 2015-04-10 11:00:00
url7
7
2015-04-10 10:50:05
2015-04-10 10:50:01
cookie2 2015-04-10 10:00:00
url11
1
1970-01-01 00:00:00
NULL
cookie2 2015-04-10 10:00:02
url22
2
2015-04-10 10:00:00
NULL
cookie2 2015-04-10 10:03:04
1url33
3
2015-04-10 10:00:02
2015-04-10 10:00:00
cookie2 2015-04-10 10:10:00
url44
4
2015-04-10 10:03:04
2015-04-10 10:00:02
cookie2 2015-04-10 10:50:01
url55
5
2015-04-10 10:10:00
2015-04-10 10:03:04
cookie2 2015-04-10 10:50:05
url66
6
2015-04-10 10:50:01
2015-04-10 10:10:00
cookie2 2015-04-10 11:00:00
url77
7
2015-04-10 10:50:05
2015-04-10 10:50:01

8.12.1.3 FIRST_VALUE

语法

FIRST_VALUE(col)取分组内排序后,截止到当前行,第一个值

示例

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM demo_win_fun;
cookieid
createtime
url
rn
first1
---------------------------------------------------------
cookie1 2015-04-10 10:00:00
url1
1
url1
cookie1 2015-04-10 10:00:02
url2
2
url1
cookie1 2015-04-10 10:03:04
1url3
3
url1
cookie1 2015-04-10 10:10:00
url4
4
url1
cookie1 2015-04-10 10:50:01
url5
5
url1
cookie1 2015-04-10 10:50:05
url6
6
url1
cookie1 2015-04-10 11:00:00
url7
7
url1
cookie2 2015-04-10 10:00:00
url11
1
url11
cookie2 2015-04-10 10:00:02
url22
2
url11
cookie2 2015-04-10 10:03:04
1url33
3
url11
cookie2 2015-04-10 10:10:00
url44
4
url11
cookie2 2015-04-10 10:50:01
url55
5
url11
cookie2 2015-04-10 10:50:05
url66
6
url11
cookie2 2015-04-10 11:00:00
url77
7
url11

8.12.1.4 LAST_VALUE

语法

LAST_VALUE(col)取分组内排序后,截止到当前行,最后一个值

示例

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM demo_win_fun;
cookieid
createtime
url
rn
last1
-----------------------------------------------------------------
cookie1 2015-04-10 10:00:00
url1
1
url1
cookie1 2015-04-10 10:00:02
url2
2
url2
cookie1 2015-04-10 10:03:04
1url3
3
1url3
cookie1 2015-04-10 10:10:00
url4
4
url4
cookie1 2015-04-10 10:50:01
url5
5
url5
cookie1 2015-04-10 10:50:05
url6
6
url6
cookie1 2015-04-10 11:00:00
url7
7
url7
cookie2 2015-04-10 10:00:00
url11
1
url11
cookie2 2015-04-10 10:00:02
url22
2
url22
cookie2 2015-04-10 10:03:04
1url33
3
1url33
cookie2 2015-04-10 10:10:00
url44
4
url44
cookie2 2015-04-10 10:50:01
url55
5
url55
cookie2 2015-04-10 10:50:05
url66
6
url66
cookie2 2015-04-10 11:00:00
url77
7
url77

8.12.2 OVER子句

数据准备

CREATE EXTERNAL TABLE demo_over (
cookieid string,
createtime string,
--day
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile
location '/hw/hive/over/';
cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4

如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,
UNBOUNDED PRECEDING 表示从前面的起点,
UNBOUNDED FOLLOWING:表示到后面的终点

8.12.2.1 SUM

语法

SUM(col) 结果和ORDER BY相关,默认为升序

示例

SELECT cookieid,
createtime,
pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,
--当前行+往前3SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,
--当前行+往前3行+往后1SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6
---当前行+往后所有行
FROM demo_over
order by createtime;
cookieid createtime
pv
pv1
pv2
pv3
pv4
pv5
pv6
-----------------------------------------------------------------------------
cookie1
2015-04-10
1
1
1
26
1
6
26
cookie1
2015-04-11
5
6
6
26
6
13
25
cookie1
2015-04-12
7
13
13
26
13
16
20
cookie1
2015-04-13
3
16
16
26
16
18
13
cookie1
2015-04-14
2
18
18
26
17
21
10
cookie1
2015-04-15
4
22
22
26
16
20
8
cookie1
2015-04-16
4
26
26
26
13
13
4

8.12.2.2 COUNT

示例

SELECT cookieid,
createtime,
pv,
COUNT(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
COUNT(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
COUNT(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
COUNT(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,
--当前行+往前3COUNT(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,
--当前行+往前3行+往后1COUNT(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6
---当前行+往后所有行
FROM demo_over
order by createtime;
cookieid createtime
pv
pv1
pv2
pv3
pv4
pv5
pv6
-----------------------------------------------------------------------------
cookie1 2015-04-10
1
1
1
7
1
2
7
cookie1 2015-04-11
5
2
2
7
2
3
6
cookie1 2015-04-12
7
3
3
7
3
4
5
cookie1 2015-04-13
3
4
4
7
4
5
4
cookie1 2015-04-14
2
5
5
7
4
5
3
cookie1 2015-04-15
4
6
6
7
4
5
2
cookie1 2015-04-16
4
7
7
7
4
4
1

8.12.2.3 AVG

示例

SELECT cookieid,
createtime,
pv,
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
AVG(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,
--当前行+往前3AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,
--当前行+往前3行+往后1AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6
---当前行+往后所有行
FROM demo_over
order by createtime;
cookieid createtime
pv
pv1
pv2
pv3
pv4
pv5
pv6
-----------------------------------------------------------------------------
cookie1 2015-04-10
1
1.0
1.0
3.7142857142857144
1.0
3.0
3.7142857142857144
cookie1 2015-04-11
5
3.0
3.0
3.7142857142857144
3.0
4.333333333333333
4.166666666666667
cookie1 2015-04-12
7
4.333333333333333
4.333333333333333
3.7142857142857144
4.333333333333333
4.0
4.0
cookie1 2015-04-13
3
4.0
4.0
3.7142857142857144
4.0
3.6
3.25
cookie1 2015-04-14
2
3.6
3.6
3.7142857142857144
4.25
4.2
3.3333333333333335
cookie1 2015-04-15
4
3.6666666666666665
3.6666666666666665
3.7142857142857144
4.0
4.0
4.0
cookie1 2015-04-16
4
3.7142857142857144
3.7142857142857144
3.7142857142857144
3.25
3.25
4.0

8.12.2.4 MIN

示例

SELECT cookieid,
createtime,
pv,
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
MIN(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,
--当前行+往前3MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,
--当前行+往前3行+往后1MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6
---当前行+往后所有行
FROM demo_over
order by createtime;
cookieid createtime
pv
pv1
pv2
pv3
pv4
pv5
pv6
-----------------------------------------------------------------------------
cookie1 2015-04-10
1
1
1
1
1
1
1
cookie1 2015-04-11
5
1
1
1
1
1
2
cookie1 2015-04-12
7
1
1
1
1
1
2
cookie1 2015-04-13
3
1
1
1
1
1
2
cookie1 2015-04-14
2
1
1
1
2
2
2
cookie1 2015-04-15
4
1
1
1
2
2
4
cookie1 2015-04-16
4
1
1
1
2
2
4

8.12.2.5 MAX

示例

SELECT cookieid,
createtime,
pv,
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
MAX(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,
--当前行+往前3MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,
--当前行+往前3行+往后1MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6
---当前行+往后所有行
FROM demo_over
order by createtime;
cookieid createtime
pv
pv1
pv2
pv3
pv4
pv5
pv6
-----------------------------------------------------------------------------
cookie1 2015-04-10
1
1
1
7
1
5
7
cookie1 2015-04-11
5
5
5
7
5
7
7
cookie1 2015-04-12
7
7
7
7
7
7
7
cookie1 2015-04-13
3
7
7
7
7
7
4
cookie1 2015-04-14
2
7
7
7
7
7
4
cookie1 2015-04-15
4
7
7
7
7
7
4
cookie1 2015-04-16
4
7
7
7
4
4
4

8.12.3 分析函数

数据准备

d1,user1,1000
d1,user2,2000
d1,user3,3000
d2,user4,4000
d2,user5,5000
CREATE EXTERNAL TABLE analytics_1 (
dept STRING,
userid string,
sal INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile location '/hw/hive/analytics/1';
cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7
CREATE EXTERNAL TABLE analytics_2 (
cookieid string,
createtime string,
--day
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile location '/hw/hive/analytics/2';

8.12.3.1 RANK

语法

RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位

8.12.3.2 DENSE_RANK

语法

DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位

示例

SELECT
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM analytics_2
WHERE cookieid = 'cookie1';
cookieid day
pv
rn1
rn2
rn3
-------------------------------------------------- 
cookie1 2015-04-12
7
1
1
1
cookie1 2015-04-11
5
2
2
2
cookie1 2015-04-15
4
3
3
3
cookie1 2015-04-16
4
3
3
4
cookie1 2015-04-13
3
5
4
5
cookie1 2015-04-14
2
6
5
6
cookie1 2015-04-10
1
7
6
7
rn1: 15号和16号并列第3, 13号排第5
rn2: 15号和16号并列第3, 13号排第4
rn3: 如果相等,则按记录值排序,生成唯一的次序,如果所有记录值都相等,或许会随机排吧。

8.12.3.3 ROW_NUMBER

语法

ROW_NUMBER() –从1开始,按照顺序,生成分组内记录的序列
比如,按照pv降序排列,生成分组内每天的pv名次
ROW_NUMBER() 的应用场景非常多,再比如,获取分组内排序第一的记录;获取一个session中的第一条refer等。

示例

SELECT
cookieid,
createtime,
pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
FROM analytics_2;
cookieid day
pv
rn
------------------------------------------- 
cookie1 2015-04-12
7
1
cookie1 2015-04-11
5
2
cookie1 2015-04-15
4
3
cookie1 2015-04-16
4
4
cookie1 2015-04-13
3
5
cookie1 2015-04-14
2
6
cookie1 2015-04-10
1
7
cookie2 2015-04-15
9
1
cookie2 2015-04-16
7
2
cookie2 2015-04-13
6
3
cookie2 2015-04-12
5
4
cookie2 2015-04-14
3
5
cookie2 2015-04-11
3
6
cookie2 2015-04-10
2
7

8.12.3.4 CUME_DIST

语法

CUME_DIST() 小于等于当前值的行数/分组内总行数
比如,统计小于等于当前薪水的人数,所占总人数的比例

示例

SELECT
dept,
userid,
sal,
CUME_DIST() OVER(ORDER BY sal) AS rn1,
CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM analytics_1;
dept
userid
sal
rn1
rn2
-------------------------------------------
d1
user1
1000
0.2
0.3333333333333333
d1
user2
2000
0.4
0.6666666666666666
d1
user3
3000
0.6
1.0
d2
user4
4000
0.8
0.5
d2
user5
5000
1.0
1.0
rn1: 没有partition,所有数据均为1组,总行数为5,
第一行:小于等于1000的行数为1,因此,1/5=0.2
第三行:小于等于3000的行数为3,因此,3/5=0.6
rn2: 按照部门分组,dpet=d1的行数为3,
第二行:小于等于2000的行数为2,因此,2/3=0.6666666666666666

8.12.3.5 PERCENT_RANK

语法

PERCENT_RANK() 分组内当前行的RANK值-1/分组内总行数-1

示例

SELECT
dept,
userid,
sal,
PERCENT_RANK() OVER(ORDER BY sal) AS rn1,
--分组内
RANK() OVER(ORDER BY sal) AS rn11,
--分组内RANK值
SUM(1) OVER(PARTITION BY NULL) AS rn12,
--分组内总行数
PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM analytics_1;
dept
userid
sal
rn1
rn11
rn12
rn2
---------------------------------------------------
d1
user1
1000
0.0
1
5
0.0
d1
user2
2000
0.25
2
5
0.5
d1
user3
3000
0.5
3
5
1.0
d2
user4
4000
0.75
4
5
0.0
d2
user5
5000
1.0
5
5
1.0
rn1: rn1 = (rn11-1) / (rn12-1)
第一行,(1-1)/(5-1)=0/4=0
第二行,(2-1)/(5-1)=1/4=0.25
第四行,(4-1)/(5-1)=3/4=0.75
rn2: 按照dept分组,
dept=d1的总行数为3
第一行,(1-1)/(3-1)=0
第三行,(3-1)/(3-1)=1

8.12.3.6 NTILE

语法

NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值。
NTILE不支持ROWS BETWEEN
如果切片不均匀,默认增加第一个切片的分布

示例

SELECT
cookieid,
createtime,
pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,
--分组内将数据分成2片
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,
--分组内将数据分成3片
NTILE(4) OVER(ORDER BY createtime) AS rn3
--将所有数据分成4FROM analytics_2
ORDER BY cookieid,createtime;
cookieid day
pv
rn1
rn2
rn3
-------------------------------------------------
cookie1 2015-04-10
1
1
1
1
cookie1 2015-04-11
5
1
1
1
cookie1 2015-04-12
7
1
1
2
cookie1 2015-04-13
3
1
2
2
cookie1 2015-04-14
2
2
2
3
cookie1 2015-04-15
4
2
3
3
cookie1 2015-04-16
4
2
3
4
cookie2 2015-04-10
2
1
1
1
cookie2 2015-04-11
3
1
1
1
cookie2 2015-04-12
5
1
1
2
cookie2 2015-04-13
6
1
2
2
cookie2 2015-04-14
3
2
2
3
cookie2 2015-04-15
9
2
3
4
cookie2 2015-04-16
7
2
3
4

8.12.4 Distinct(2.1.0及以后)

COUNT(DISTINCT a) OVER (PARTITION BY c)

8.12.5 OVER子句中使用聚合函数(2.1.0及以后)

SELECT rank() OVER (ORDER BY sum(b))
FROM T
GROUP BY a;

8.12.6 参考

Hive分析窗口函数(一) SUM,AVG,MIN,MAX

Hive分析窗口函数(二) NTILE,ROW_NUMBER,RANK,DENSE_RANK

Hive分析窗口函数(三) CUME_DIST,PERCENT_RANK

Hive分析窗口函数(四) LAG,LEAD,FIRST_VALUE,LAST_VALUE

8.13 With表达式

指定WITH语句,通过查询生成的临时的结果集。可以用于SELECT, INSERT, CREATE TABLE AS SELECT, 或者CREATE VIEW AS SELECT语句。


1. 子查询中不支持with语句;
2. 递归查询不被支持。

8.13.1 语法

withClause: cteClause (, cteClause)*
cteClause: cte_name AS (select statment)

8.13.2 示例

with q1 as ( select key from src where key = '5')
select *
from q1;
-- from style
with q1 as (select * from src where key= '5')
from q1
select *;
-- chaining CTEs
with q1 as ( select key from q2 where key = '5'),
q2 as ( select key from src where key = '5')
select * from (select key from q1) a;
-- union example
with q1 as (select * from src where key= '5'),
q2 as (select * from src s2 where key = '4')
select * from q1 union all select * from q2;
-- insert example
create table s1 like src;
with q1 as ( select key, value from src where key = '5')
from q1
insert overwrite table s1
select *;
-- ctas example
create table s2 as
with q1 as ( select key from src where key = '4')
select * from q1;
-- view example
create view v1 as
with q1 as ( select key from src where key = '5')
select * from q1;
select * from v1;
-- view example, name collision
create view v1 as
with q1 as ( select key from src where key = '5')
select * from q1;
with q1 as ( select key from src where key = '4')
select * from v1;

8.14 参考

  1. [官方:LanguageManual Select]
    (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select)

最后

以上就是迷你蜻蜓为你收集整理的Hive SQL操作与函数自定义(一)1 DDL2 Statistics3 Indexes4 Archiving5 DML6 import/export7 explain plan8 Select的全部内容,希望文章能够帮你解决Hive SQL操作与函数自定义(一)1 DDL2 Statistics3 Indexes4 Archiving5 DML6 import/export7 explain plan8 Select所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(41)

评论列表共有 0 条评论

立即
投稿
返回
顶部