hive的数据组织形式

333 阅读 0 评论 220 点赞

我是靠谱客的博主苹果美女，这篇文章主要介绍hive的数据组织形式，现在分享给大家，希望可以做个参考。

一、库：

同mysql中的数据库

将数据进行细化管理不同业务模块的数据放在一个数据库中

分库分表

hive中不同的模块的数据一般放在不同的数据库中

便于数据管理

二、表：

hive数据的管理权限分：

内部表：

管理权限是hive自己, hive对表中的数据（原始数据）有绝对的增删权限。内部表在进行删除表时，表中的数据（hdfs对应的目录）会被删除（元数据被删除）

例：创建

create table if not exists stu_managed(sid int,name string,sex string,age int,dept string) comment “test one managed_table” row format delimited fields terminated by “,”;

外部表：

外部表更像一个hdfs的数据使用者，使用的数据管理权限hdfs自己管理，对hive来说只有使用权限。外部表在删除时，元数据会被删除，表中的数据（hdfs）hdfs对应的数据不会被删除

例：创建

create external table if not exists stu_external(sid int,name string,sex string,age int,dept string) comment “test one external_table” row format delimited fields terminated by “,”;

总结：内部表和外部表的区别

从表删除：

内部表删除的时候元数据和表数据（hdfs）一并删除的

外部表删除的时候元数据会被删除表数据（hdfs）不会删除的

从建表语句：

内部不加external 外部表必须加 external

从管理权限：

内部表表数据 hive自己管理的有删除的权利的

外部表的数据 hdfs管理的 hive只有使用权限没有删除权限

从使用场景：

内部表一般用于存储中间模块数据中间层（原始数据加工出来的数据）结果层

外部表一般用于存储共享数据原始数据（防止误删）配合

内部表或外部表删除表时，元数据都会被删除

分区表：

按照数据表的某列或者某些列分为多个区

一般分区：

进行分区时，依据字段—分区字段

分区字段：一般是过滤条件字段，经常用于过滤的字段

例：创建

分区字段：dept

create table if not exists stu_ptn(sid int,name string,sex string,age int) comment “test one external_table” partitioned by(dept string) row format delimited fields terminated by “,”;

alter table stu_ptn add partition(dept=“MA”); （创建分区，按部门，设置一个分区部门名为MA）

alter table stu_ptn drop partition(dept=“MA”); (删除分区)

load data local inpath “/home/joker-717/tmpdata/student.txt” into table stu_ptn partition(dept="");（把数据放到分区中）

insert into table stu_ptn partition (dept=“MA”) select sid,name,sex,age from stu_external where dept=“MA”; (把数据分区插入对应分区单个)

from stu_external insert into table stu_ptn partition(dept=“CS”) select sid,name,sex,age where dept=“CS” insert into table stu_pSELtn partition(dept=“IS”) select sid,name,sex,age where dept=“IS”;(把数据分区插入对应分区多个)

分桶表：

分桶是相对分区进行更细粒度的划分。分桶将整个数据内容按照某列属性值得hash值进行区分

解决思路：

将两个表都进行切分按照统一的规则切分

每个切分的小文件称为一个桶

作用：

为了提升关联查询的性能关联时两个表进行分桶个数限制：相同或有倍数关系 2. 提升抽样查询的性能抽样：随机性一个桶的数据就可以认为一个抽样的样本数据

分桶表的数据存储：

不同分桶表对应的是不同的文件

例：创建

分桶字段：age 分桶个数：3

create table if not exists stu_buk(sid int,name string,sex string,age int,dept string) clustered by(age) sorted by (dept desc,age asc) into 3 buckets row format delimited fields terminated by “,”;

分桶：设置参数

set hive.strict.checks.bucketing=false;

set hive.mapred.mode=nonstrict;

insert into table stu_buk select * from stu_external;(分桶插入数据)

insert overwrite local directory “/home/joker-717/hive_data” select * from stu_external where age=18;（单重数据导出）

from stu_external insert overwrite directory “/home/joker-717/hive_data” select * where age=18 insert overwrite directory “/home/joker-717/hive_data” select * where age=19;（多重数据导出）