hive 分区表，桶，倾斜知识点

89 阅读 0 评论 59 点赞

我是靠谱客的博主重要戒指，最近开发中收集的这篇文章主要介绍hive 分区表，桶，倾斜知识点，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

为什么要引入分区和桶的概念？

hive的select会扫描整个表的内容，引入partition 桶【把hive表的数据划分为快】。

partition更粗粒度桶更细粒度在小范围的查询上提高效率。

分区：

partitioned by（分区字段的名字分区字段的类型）
eg partitiond by（time date）
静态分区：分区的值是确定的【假设有一个程序，每天都统计销售额，按照日期进行分区每天的插入到制定的日期分区】
动态分区：分区的值是不确定的，由输入的数据确定【比如：京东有很多二级类目，每一个二级类目对应多个产品，按照二级类目分区】
是否是动态分区由set hive.exec.dynamic.partition决定【=true时为动态分区】
如果模式设置为strict 必须有一个静态分区【为了灵活使用，一般会设置为 none strict 】

每一个分区字段都有一个目录存储

partitoned by(字段【为字段，不能跟表定义的字段重】）
clustered by 将每一个reduce处理的数据不重复全局排序
sort by 保证桶的数据是有序的
distributed by 将数据分到不同的reduce中，数据是无序的

桶表【bucketed sorted tables】
create table page (
viewtime int,
useid string,
id string
)
partitoned by(dt)
clustered by (userid) sorted by(viewtime) into 30 buckets

倾斜表【skewedtable】
对表的数据内容有了一定的了解
skewed by(userid) on(null)【on是倾斜的值】
优势：倾斜表将倾斜严重的列分开存储为不同的文件，每一个倾斜的值为一个目录，过滤倾斜的数据，定义了倾斜表，查询不过滤还是进行全表的扫描

临时表【temporary table】
当前的绘画中可见的表【打开一个hive sension 就算打开了一个绘画】
在hive下输入：show tables
create temporary table【创建一个临时表】
desc formatted table name【查看表的url，不在metastore中存储的为临时表】
不建议使用临时表，因为临时表不支持索引和分区，如果有临时表test和表test，默认操作的都是临时表。

删除表

1.删除一个内表【元数据和数据文件都删除了其实是移到了垃圾箱了删除表的时候不指定purge （干净）】

drop table if exists table_name purge[把垃圾箱的数据也都删除]

删除外表的时候【删除元数据，数据内容是第三方来管理的比如hdfs】
删除表被索引引用的表，不报错，但是失效

修改表
修改表名：alter table [tabel_name] rename [new table name]
修改表属性：alter table [tabel_name] set tblproperties (属性名 = 值)

修改分区：
新增分区：alter table [table_name] add
if not exists partition [name]
if not ...

重命名分区：alter table [table_name] partiton [p_name] rename partition (p_name)
交换分区：alter table [table name] exchange partition (p_name,p_name) with table [t_name]

修改注释：alter table [table_name] set tblproperties(注释名 = 值)
修改存储属性：alter table [table_name] clustered by(列，列) into num buckets【修改的表的元数据信息，不格式化现在又的数据】