我是靠谱客的博主 粗犷抽屉,这篇文章主要介绍hive执行计划分析(join详解),现在分享给大家,希望可以做个参考。

我们都知道执行的hive sql是需要编译成MapReduce任务去执行的,那是如何编译的呢,可以分为六个阶段:

  1. Antlr定义SQL的语法规则,完成SQL词法、语法解析,将SQL转化为抽象语法树AST Tree
  2. 遍历AST Tree,抽象出查询的基本组成单元QueryBlock
  3. 遍历QueryBlock,翻译为执行操作树OperatorTree
  4. 逻辑层优化器进行OperatorTree变换,合并不必要的ReduceSinkOperator,减少shuffle数据量
  5. 遍历OperatorTree,翻译为MapReduce任务
  6. 物理层优化器进行MapReduce任务的变换,生成最终的执行计划

通过六个阶段得到了执行计划,通过执行计划我们可以清楚的知道,sql的执行顺序以及执行过程,这样有助于我们对底层的理解以及对代码的优化,提高执行效率

下面我们来看看怎么查看执行计划:

语法:
复制代码
1
2
EXPLAIN [EXTENDED|CBO|AST|DEPENDENCY|AUTHORIZATION|LOCKS|VECTORIZATION|ANALYZE] query

查看执行计划的关键词为:EXPLAIN
官方文档上也有详细的描述:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

我们来看一个简单的例子:
复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
hive> explain select * from test.test8; OK STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: test8 Statistics: Num rows: 1 Data size: 38 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: user_id (type: int), name (type: string), address (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 38 Basic stats: COMPLETE Column stats: NONE ListSink Time taken: 0.791 seconds, Fetched: 17 row(s)
执行计划第一部分:
STAGE DEPENDENCIES,各stage之间的依赖关系,就是执行的顺序
复制代码
1
2
3
STAGE DEPENDENCIES: Stage-0 is a root stage
执行计划第二部分:
STAGE PLANS,详细的执行计划

这里只有一步,没有依赖的操作,比较简单

复制代码
1
2
3
STAGE PLANS: Stage: Stage-0

抓取数据操作,没有limit限制

复制代码
1
2
3
4
Fetch Operator limit: -1

表扫描,表别名为test8

复制代码
1
2
3
4
TableScan alias: test8

需要查询的字段,user_id 、name 、address以及字段类型

复制代码
1
2
3
4
Select Operator expressions: user_id (type: int), name (type: string), address (type: string)

输出字段

复制代码
1
2
outputColumnNames: _col0, _col1, _col2

如果是MR任务的话,
分为 Map Operator Tree和Reduce Operator Tree,如果中间有过滤操作的话,Filter Operator,聚合操作Group By Operator,join操作Join Operator等等。

下面我们来看几个复杂一点的例子:
聚合操作:
复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
hive> explain select > apptypeid, > uid, > srcqid, > os, > isnewuser, > SUM(pv) AS pv, > dt > FROM dw_center.dwb_open_srcqid_os_ver_user > where dt='20210727' > group by apptypeid,uid,srcqid,os,isnewuser,dt; OK STAGE DEPENDENCIES:--stage之间的依赖关系,有两个stage,下面依赖上面 Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS:--执行计划 Stage: Stage-1 Map Reduce--MR任务 Map Operator Tree:--map阶段 TableScan--表扫描 alias: dwb_open_srcqid_os_ver_user--表名 Statistics: Num rows: 220982 Data size: 91044719 Basic stats: COMPLETE Column stats: NONE--表信息 Select Operator--需要查询的字段以及字段结构 expressions: apptypeid (type: string), uid (type: string), srcqid (type: string), os (type: string), isnewuser (type: int), pv (type: bigint) outputColumnNames: apptypeid, uid, srcqid, os, isnewuser, pv--输出的字段 Statistics: Num rows: 220982 Data size: 91044719 Basic stats: COMPLETE Column stats: NONE Group By Operator--group by操作 aggregations: sum(pv)--聚合函数sum,聚合字段pv keys: apptypeid (type: string), uid (type: string), srcqid (type: string), os (type: string), isnewuser (type: int)--聚合的key mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5--map阶段输出的字段 Statistics: Num rows: 220982 Data size: 91044719 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator--reduce阶段输出信息 key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: int) sort order: +++++--按5个字段正序排序 Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: int) Statistics: Num rows: 220982 Data size: 91044719 Basic stats: COMPLETE Column stats: NONE value expressions: _col5 (type: bigint)--聚合后的值 Reduce Operator Tree:--reduce阶段 Group By Operator--group by聚合操作 aggregations: sum(VALUE._col0)--聚合函数 keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: int)--聚合的key mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 110491 Data size: 45522359 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: int), _col5 (type: bigint), '20210727' (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 Statistics: Num rows: 110491 Data size: 45522359 Basic stats: COMPLETE Column stats: NONE File Output Operator--文件输出 compressed: false--是否压缩 Statistics: Num rows: 110491 Data size: 45522359 Basic stats: COMPLETE Column stats: NONE table:输出数据的格式 input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator--抓取数据操作 limit: -1--不做limit限制 Processor Tree: ListSink Time taken: 0.471 seconds, Fetched: 52 row(s)
join操作:
复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
hive> explain select > t1.apptypeid, > t1.uid, > t1.srcqid, > t2.os, > t1.dt > from > (select > apptypeid, > uid, > srcqid, > dt > from dw_center.dwb_open_srcqid_user > where dt='20210727') t1 > > left join > (select > apptypeid, > uid, > os, > dt > from dw_center.dwb_open_os_user > where dt='20210727') t2 > on t1.apptypeid=t2.apptypeid and t1.uid=t2.uid and t1.dt=t2.dt; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree:--map阶段 TableScan alias: dwb_open_srcqid_user--表别名 Statistics: Num rows: 289319 Data size: 86795789 Basic stats: COMPLETE Column stats: NONE Select Operator--查询的字段以及字段结构 expressions: apptypeid (type: string), uid (type: string), srcqid (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 289319 Data size: 86795789 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string)--关联的字段 sort order: ++ Map-reduce partition columns: _col0 (type: string), _col1 (type: string) Statistics: Num rows: 289319 Data size: 86795789 Basic stats: COMPLETE Column stats: NONE value expressions: _col2 (type: string) TableScan alias: dwb_open_os_user--表别名 Statistics: Num rows: 259103 Data size: 77731098 Basic stats: COMPLETE Column stats: NONE Select Operator--查询的字段以及字段结构 expressions: apptypeid (type: string), uid (type: string), os (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 259103 Data size: 77731098 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string)--关联的字段 sort order: ++ Map-reduce partition columns: _col0 (type: string), _col1 (type: string) Statistics: Num rows: 259103 Data size: 77731098 Basic stats: COMPLETE Column stats: NONE value expressions: _col2 (type: string) Reduce Operator Tree:--reduce阶段 Join Operator--join操作 condition map: Left Outer Join0 to 1 --left join操作 keys:--关联的key,dt为虚拟字段,没有参与关联 0 _col0 (type: string), _col1 (type: string) 1 _col0 (type: string), _col1 (type: string) outputColumnNames: _col0, _col1, _col2, _col5--输出字段 Statistics: Num rows: 318250 Data size: 95475369 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col5 (type: string), '20210727' (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4 Statistics: Num rows: 318250 Data size: 95475369 Basic stats: COMPLETE Column stats: NONE File Output Operator--文件输出 compressed: false Statistics: Num rows: 318250 Data size: 95475369 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 2.032 seconds, Fetched: 61 row(s)

最后

以上就是粗犷抽屉最近收集整理的关于hive执行计划分析(join详解)的全部内容,更多相关hive执行计划分析(join详解)内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(72)

评论列表共有 0 条评论

立即
投稿
返回
顶部