大数据学习教程SD版第七篇【Hive】1. Hive 简介2. Hive vs RDBMS3. Hive 安装4. Hive JDBC5. Hive Shell6. Hive Conf7. Hive Type8. Hive SQL9. Hive 压缩10. Hive 文件格式

168 阅读 0 评论 111 点赞

我是靠谱客的博主风趣短靴，这篇文章主要介绍大数据学习教程SD版第七篇【Hive】1. Hive 简介2. Hive vs RDBMS3. Hive 安装4. Hive JDBC5. Hive Shell6. Hive Conf7. Hive Type8. Hive SQL9. Hive 压缩10. Hive 文件格式，现在分享给大家，希望可以做个参考。

1. Hive 简介

数据仓库工具，将结构化数据映射成二维表，并提供类SQL查询，底层把HQL转换成MR程序

Hive 自带的客户端

hive client
beeline client

特点

HQL 用于数据分析，但处理处理粒度粗
处理大数据，但延迟高
支持自定义函数

架构原理

Metastore 元数据存储 Client 客户端 MapReduce 计算引擎 HDFS 数据源

解析器解析HQL 映射关系，元数据
编译器把HQL 转化成MR
优化器优化执行的逻辑
执行器把执行逻辑物理化执行

2. Hive vs RDBMS

查询语言：类似SQL
数据规模：数据量大
数据更新：读多写少，导入导出
执行延迟：量大，延迟高

3. Hive 安装

下载安装包
解压安装包
配置环境变量
初始化元数据

./schematool -dbType derby -initSchema

启动测试

# 先启动hadoop,在启动hive
./hive

更改元数据存储位置

hive 默认元数据存在自带的derby 数据库，缺点： 1. 元数据不好查看；2.使用derby 不支持多租户，即不支持多个hive client 同时使用，可以更换元数据存储位置为MySQL

3.1 MySQL安装

此处安装的是rpm的MySQL安装包，所以只适用于redhat系列

卸载系统自带的MySQL

# 查找
rpm -qa|grep mariadb
rpm -qa|grep mysql
# 卸载,不带依赖的卸载方式
rmp -e --nodeps maridb-libs

解压tar包并安装rpm包

tar -xvf xx.tar
# 依次安装 common、libs、libs-compat、client、server
rpm -ivh xxx.rpm
# 如果是最小化安装，需要安装额外依赖
rpm install -y libaio

修改MySQL配置文件

vim /etc/my.cnf # 查看datadir 位置

# 删除datadir 目录下内容

初始化MySQL

mysqld --initialize --user=mysql

# 查看临时密码
cat /var/log/mysqld.log

启动MySQL服务并登录

systemctl start mysqld

mysql -u root -pXXX

# 密码需改（可以不修改）,各个版本MySQL修改方式不同
set password = password("000000");

开启MySQL远程连接

# 查看
select host,user from mysql.user;

# 修改
update mysql.user set host='%' where user='root';
flush privileges;

3.2 Hive Metasore

修改元数据存储位置为MySQL

copy 一个MySQL驱动包到 hive的lib目录下
在conf下创建一个一个hive-site.xml 配置文件

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
   <!-- 元数据存储库MySQL四大连接参数-->   
   <property>
		<name>javax.jdo.option.ConnectionURL</name>
		<value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false&amp;allowPublicKeyRetrieval=true</value>
	 </property>  
	<property>
		<name>javax.jdo.option.ConnectionDriverName</name>
		<value>com.mysql.jdbc.Driver</value>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionUserName</name>
		<value>root</value>
	</property> 
	<property>
		<name>javax.jdo.option.ConnectionPassword</name>
		<value>000000</value>
	</property>
   <!-- 元数据存储版本验证和存储授权--> 
	<property>
	<name>hive.metastore.schema.verification</name>
	<value>false</value>
	</property>
	<property>
	<name>datanucleus.metadata.validate</name>
	<value>false</value>
	</property>
</configuration>

登录MySQL并创建存储的数据库

create database metastore;

初始化hive元数据存储库

schematool -initSchema -dbType mysql -verbose

再次启动hive

hive

3.3 Hive 远程服务

使用元数据服务的方式，开启hive的远程访问

修改hive-site.xml，添加

<property>
    <name>hive.metastore.uris</name>
    <value>thrift://hadoop102:9083</value>
</property>

启动元数据服务

hive --service metastore

4. Hive JDBC

要想使用jdbc连接Hive ,还需要开启hiveserver2的服务

修改hive-site.xml

        <property> 
                <name>hive.server2.thrift.port</name> 
                <value>10000</value>
        </property>  
        <property> 
                <name>hive.server2.thrift.bind.host</name>
                <value>hadoop102</value>
        </property>

启动服务

hive --service metastore &
hive --service hiveserver2 &

脚本封装，快速启动

Shell脚本补充知识：

[ 判断 ] && xxx || yyy 前面条件为真，走xxx；为假，走yyy

eval $cmd 执行shell命令

grep -v grep 过滤掉查找的这个grep

#!/bin/bash
HIVE_LOG_DIR=$HIVE_HOME/logs
if [[ ! -d $HIVE_LOG_DIR ]]; then
	mkdir -p $HIVE_LOG_DIR
fi

function check_process(){
	pid=$(ps -ef |grep -v grep |grep -i $1 |awk '{print $2}')
	echo $pid
	[ "$pid" ] && return 0 || return 1
}

function hive_start(){
	metapid=$(check_process HiveMetastore 9083)
	cmd="nohup hive --service metastore > $HIVE_LOG_DIR/metastore.log 2>&1 &"
	[ -z "$metapid" ] && eval $cmd && echo "HiveMetastore Is Starting!" || echo "HiveMetastore Is Running!"
	server2pid=$(check_process HiveServer2 10000)
	cmd="nohup hive --service hiveserver2  >$HIVE_LOG_DIR/HiveServer2.log 2>&1  &"
	[ -z "$server2pid" ] && eval $cmd && echo "Hiveserver2 Is Starting!" || echo "HiveServer2 IS Running!"
}

function hive_stop(){
	metapid=$(check_process HiveMetastore 9083)
	[ "$metapid" ] && kill $metapid && echo "HiveMetastore Is Killing!" || echo "HiveMetastore Not Running!"
	server2pid=$(check_process HiveServer2 10000)
	[ "$server2pid" ] && kill $server2pid && echo "HiveServer2 Is Killing!" || echo "HiveServer2 Not Running!"
}


case $1 in
	"start" )
		hive_start
		;;
	"stop" )
		hive_stop
		;;
	"restart" )
		hive_stop
		sleep 5
		hive_start
		;;
	"status" )
		check_process HiveMetastore 9083 && echo "HiveMetastore Is Running!" || echo "HiveMetastore Not Running!"
		check_process HiveServer2 10000 && echo "HiveServer2 Is Running!" || echo "HiveServer2 Not Running!"
		;;
	* )
		echo "Args Error!"
		;;
esac

两个服务可以通过 jps -ml |grep Hive 查看

5. Hive Shell

hive -e "sql" # 执行SQL
hvie -f query.sql # 执行SQL文件中的SQL
quit;

# 所有的历史操作命令在 当前用户目录下的 .hivehistory 文件中

6. Hive Conf

Hive 的配置优先级：hive client set key=value > hive -hiveconf key=value > hive-site.xml > hive-default.xml

查看Hive配置的值：hive set xxx;

 <!--配置hive client内部显示的库和表头，可不配-->
<property>
    <name>hive.cli.print.header</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
  </property>

7. Hive Type

Hive 日期字符串能自动识别成日期时间戳，并使用对应的函数

复合数据类型需要指定分隔符，且Array和Map访问时通过下标和Key ，Struct 通过属性名来访问

数据类型	Java	Hive
Int	int	int
Long	long	bigint
Double	double	double
String	String	string
Date	Date	timestamp
Array	Array	array
Map	Map	map <string,string>
Struct	Class	struct id:int,name:string

隐式转换

整数类型、String类型可以隐式的转换成Double类型

强制转换

cast (field as type)

8. Hive SQL

8.1 Hive DDL

# 1.建库
create database [if not exists] xxx
[comment 'xxx']
[location 'xxx'];
# 2.查库
show databases;
show databases like '*xxx*';
desc database xxx;
desc database extended xxx;
# 3.改库,鸡肋
alter database xxx set dbproperties("key"="value");
# 4.删库
drop database xxx;  # 空库
drop database xxx cascade; # 强制删

表类型分为：管理表（内部表）【默认创建】、外部表

表删除时：内部表会连带删除HDFS上的数据，而外部表则不会，更安全

# 1.建表
create [external] table [if not exists] xxx(
col_name date_type [comment 'col xxx']
)
[comment 'table xxx']
[partitioned by (col type,col2 type2)]
[clustered by (col1,col2)]
[sorted by (col asc|desc) into num_buckets buckets]
[row format delimited fields terminated by 'xxx']
[stored as file_type]
[location file_path]
[as slect_statement];

# 1.m 建表(包含复合类型)
create table test(
name string,
friends array<string>,
children map<string,int>,
address struct<stree:string,city:string>
)
row format delimited fields terminated by ','
collection items terminated by '_'
map key terminated by ':';

# 2.改表
alter table xxx set tblpreperties('EXTERNAL'='TRUE')
alter table xxx rename to new_xxx;
# string 不能改为 int
alter table xxx change old_col new_col type;
# replace 整张表替换为新字段
alter table xxx add|replace columns(col type,col2 type2);
# 删除列，可以使用replace 来间接修改

# 3.查表
show tables;
show create table xxx;
desc xxx;
desc formatted xxx;

# 4.删表
drop table xxx;

8.2 Hive DML

数据导入

# 1.文件导入
load data [local] inpath 'xxx' [overwrite] into table xxx [partition (par_col=par_value)];
# 2.表查询导入
insert into|overwrite table table_name select * from table2_name;
# 3.表创建导入
create table if not exists table_name as select * from table_name2;
# 4.表创建指定路径导入
create table …… location 'xxx';
# 5.import方式导入,路径必须是export导出路径
import table xxx from 'hdfs_path';

数据导出

# 1.insert 导出
insert overwrite [local] directory 'xxx' [row format delimited fields terminated by ','] select * from table_name ;
# 2.get 导出
dfs -get /remote /local
# 3.client 导出
hive -f xxx.sql >> result.txt
# 4.export 导出
export table xxx to 'hdfs_path';

不同数据导入方式count(*)结果的问题：

insert方式：元数据中 numFiles numRows 均改变，不走mr，结果正确

put方式：元数据中上述参数不变，不走mr，结果不对

load方式：元数据中 numFiles 改变，numRows 参数不变，走mr，结果正确

如果load加载的是本地文件，则上传至HDFS对应位置，如果加载的是HDFS文件，则移动文件位置

清空数据

# 只能清空管理表
truncate table xxx;

8.3 Hive DQL

执行顺序：

from
join on
where
group by
avg()，sum()……
having
select
distinct
order by
limit

基础语法

# Hive SQL 大小写不敏感，不用能tab缩进
select * from table_name;
select col1,col2 from table_name;

# 1.alias
as

# 2.数值运算
+ - * /

# 3.聚合函数
count  max min sum avg

# 4.limit
limit x;

# 5.过滤
where

# 6.比较
> >= < <= <>  != is [not] null  in (x,y)
[not] like % _
x<=>y : 一边为null,为false
between xx and yy： 左右皆闭合
and or not

# 7.分组
group by xxx having yyy

# 8.连接
## 1.inner join
xxx join yyy on
## 2.left outer join
xxx left join yyy on
## 3.right outer join
xxx right join yyy on
## 4.full outer join
xxx full join yyy on
## 5.xxx - yyy 左xxx表独有数据
xxx left join yyy on  where xxx.col is null
xxx left join yyy on  where xxx.col not in(select col from yyy)
## 6.yyy - xxx 和上面操作相反 右yyy表独有数据
## 7.xxx 和yyy 各自独有的数据
xxx left join yyy on  where xxx.col is null or yyy.col is null;
## union 连接时两表查询字段必须一致; union all 不去重,效率更高
select mmm from ( ……t1 union ……t2)tmp;

# 9. 排序，全局排序，reduce个数始终是一个
order by xxx asc|desc
## 每个分区内部排序有序，整体无序，分区规则随机，可通过 distribute by 指定
sort by xxx asc|desc
distribute by col sort by xxx asc
## 当 distribute by 和 sort by 字段一致时,限制：只能asc
cluster by xxx

# 10.分区，多目录,可以避免全表扫描，正常按天导入数据
## 1.加分区
alter table table_name add partition(par_name=par_value);
## 2.删分区
alter table table_name drop partition(par_name=par_value);
## 3.查分区
show partitions;
desc formatted table_name;
## 二级分区，两个分区字段
## 分区调整：手动put方式加载到分区表目录，需要修复元数据
msck repair table table_name;
## 动态分区：根据查询结果动态的插入到指定分区中
### 1.关闭严格模式,下面的语句不加[],就可以不关
set hive.exec.dynamic.partition.mode=nonstrict;
### 2.查询的最后一个字段为分区字段
insert into table table_name [partition(par_name)]
select xxx,xxx,par_name from X;

# 11.分桶，多文件，不常用,抽样查询
create table table_name(id int,name string)
clusted by(id)
into 4 buckets
row format delimited fields terminated by 't';

常用函数

按照处理数据的行数，划分：

UDF：一进一出，比如：upper、nvl、split、concat

UDAF：多进一出，比如：count、sum、collect_set

UDTF：一进多出，比如：explode

# 显示Hive自带函数
show functions;
desc function fun_name;
desc function extended fun_name;

# 1.nvl
select nvl(col1,col2) from X;

# 2.case when then else end   |   if(condation,x,y)

# 3.concat、concat_ws、collect_set
select concat(col1,str,col2……) from
select concat_ws(sep,str1,str2,arr1<string>,……) from
# set 去重
select collect_set(col) from

# 4.explode、lateral view:侧写
select explode(arr) from

select col1,col2,new_col from old_table
lateral view explode(arr) new_table as new_col

开窗函数

# 1.over : 开一个窗口,相当于侧写的感觉,可以理解为二次查询，比 group by 分组更加强大
over()
CURRENT ROW #当前行
n PRECEDING #往前n行数据
n FOLLOWING #往后n行数据
UNBOUNDED # 起点
UNBOUNDED PRECEDING # 前面的起点
UNBOUNDED FOLLOWING # 后面的终点
# 必须和 over() 函数配合使用
LAG(col,n,default_val) # 往前第n行的col的数据
LEAD(col,n, default_val) # 往后第n行的col的数据
NTILE(n) # 数据分组，取 num% 数据

# 指定窗口分区
over(partition by col1,col2,……)
# 指定窗口范围：默认从开始到当前行
over(partition by col1 order by col2) 
over(partition by col1 order by col2 rows|range between xxx and yyy)

# 当 order by 字段排序的值相同时，Hive将把值相同的多行当成一行处理

# 排名 必须和 over() 函数配合使用
rank() # 1 1 3 
dense_rank() # 1 1 2 
row_number() # 1 2 3

补充函数

# 1.日期处理 10位s 13位ms
current_timestamp() # 2021-12-22 10:55:13.105
year() month() day() hour() minute() second()

unix_timestamp() # 1640141746
unix_timestamp("2021-12-22",'yyyy-MM-dd') # 1640131200
unix_timestamp('2021/12/22 10:10:10','yyyy/MM/dd HH:mm:ss') # 1640167810

from_unixtime(1640131200) # 2021-12-22 00:00:00
from_unixtime(1640131200,'yyyy-MM-dd') # 2021-12-22
from_unixtime(unix_timestamp('2021/12/22 10:10:10','yyyy/MM/dd HH:mm:ss'),'yyyy-MM-dd') # 2021-12-22

current_date() # 2021-12-22

datediff('2021-12-22','2021-11-11') # 41
date_add('2021-12-22',3) # 2021-12-25 
date_sub('2021-12-22',10) # 2021-12-12
last_day('2021-12-22') # 2021-12-31

date_format('2021-12-22 22:22:22','yyyy-MM-dd') # 2021-12-22
date_format('2021-12-22 22:22:22','yyyy/MM/dd') # 2021/12/22

# 2.数值处理
round(3.14),round(3.69) # 3 4
ceil(3.14),ceil(3.69) # 4 4
floor(3.14),floor(3.69) # 3 3

# 3.字符串处理
upper() lower() length() trim()
substring('abcde',1,2)  ==  substring('abcde',0,2) # ab  
lpad('abcde',6,'m') # mabcde
rpad('abcde',6,'m') # abcdem

regexp_replace('2021/12/22','/','-') # 2021-12-22

# 4.集合处理
size(array('aaa','bbb','ccc')) # 3
map_keys(map('zhangsan',18,'lisi',20)) # ["zhangsan","lisi"]
map_values(map('zhangsan',18,'lisi',20)) # [18,20]
array_contains(array('aaa','bbb','ccc'),'bbb') # true
sort_array(array('aaa','ccc','bbb')) # ["aaa","bbb","ccc"]

# Hive wordcount
+---------------+
|  t_line.line  |
+---------------+
| hello,aaa     |
| hello,aaa     |
| hello,hadoop  |
| hadoop,hive   |
| hive,spark    |
+---------------+
select word,count(*)
from (
select 
explode(split(line,',')) word
from t_line
)t1
group by word;
+---------+------+
|  word   | _c1  |
+---------+------+
| aaa     | 2    |
| hadoop  | 2    |
| hello   | 3    |
| hive    | 2    |
| spark   | 1    |
+---------+------+

# grouping sets 多维分析，相当于多表Union

自定义函数

以UDF函数为例：定义一个UDF函数，实现把一个日期的字符串转化成标准格式的日期字符串

比如：

输入：“1970/01/01 00:00:01.666”

输出：1970-01-01

导入依赖 hive-exec
编写核心函数逻辑
打包，一般上传至Hive的libs目录下
重新进入Hive或者手动添加

add jar /xxx/xxx.jar

创建并命名函数

create [temporary] function fun_name as "函数所在的全类名"

使用测试

select format_string('1999/01/01 10:10:10.666');
+-------------+
|     _c0     |
+-------------+
| 1999-01-01  |
+-------------+

9. Hive 压缩

基于Hadoop的压缩，通过在core-site.xml 和mapred-site.xml（主要）中配置

Hive 也分为中间端压缩、输出端压缩，通过参数配置

常用压缩方式：LZO、Snappy

10. Hive 文件格式

主要有：

行式存储：TEXTFILE、SEQUENCEFILE
列式存储：ORCFILE、PARQUET

最后

以上就是风趣短靴最近收集整理的关于大数据学习教程SD版第七篇【Hive】1. Hive 简介2. Hive vs RDBMS3. Hive 安装4. Hive JDBC5. Hive Shell6. Hive Conf7. Hive Type8. Hive SQL9. Hive 压缩10. Hive 文件格式的全部内容，更多相关大数据学习教程SD版第七篇【Hive】1.内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：大数据学习篇
浏览次数：168 次浏览
发布日期：2023-10-20 09:11:24
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_o_26_f5_13_jg3.html

大数据学习教程SD版第七篇【Hive】1. Hive 简介2. Hive vs RDBMS3. Hive 安装4. Hive JDBC5. Hive Shell6. Hive Conf7. Hive Type8. Hive SQL9. Hive 压缩10. Hive 文件格式

1. Hive 简介

2. Hive vs RDBMS

3. Hive 安装

3.1 MySQL安装

3.2 Hive Metasore

3.3 Hive 远程服务

4. Hive JDBC

5. Hive Shell

6. Hive Conf

7. Hive Type

8. Hive SQL

8.1 Hive DDL

8.2 Hive DML

8.3 Hive DQL

9. Hive 压缩

10. Hive 文件格式

最后

评论列表共有 0 条评论

发表评论取消回复

大数据学习教程SD版第七篇【Hive】1. Hive 简介2. Hive vs RDBMS3. Hive 安装4. Hive JDBC5. Hive Shell6. Hive Conf7. Hive Type8. Hive SQL9. Hive 压缩10. Hive 文件格式

1. Hive 简介

2. Hive vs RDBMS

3. Hive 安装

3.1 MySQL安装

3.2 Hive Metasore

3.3 Hive 远程服务

4. Hive JDBC

5. Hive Shell

6. Hive Conf

7. Hive Type

8. Hive SQL

8.1 Hive DDL

8.2 Hive DML

8.3 Hive DQL

9. Hive 压缩

10. Hive 文件格式

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复