概述
查看很多博客,都说join on和where一起使用时,先进行join产生临时表,再进行where条件过滤得到结果表,但使用explain查看执行计划,会发现执行顺序并不是这样。
在两表join(注意: 此处只表示内连接, 如果是left join的话,不会过滤null, 会产生数据倾斜)时如果关联字段为null,有些博客上写会产生数据倾斜或笛卡儿积,其实不会,因为在join之前就会对join的表中关联字段进行非空过滤。
> explain select
> *
> from
> deal_tb
> join
> order_tb
> on daystr=day
> where
> month="2019-02";
OK
Explain
STAGE DEPENDENCIES:
Stage-4 is a root stage # 先执行Stage-4
Stage-3 depends on stages: Stage-4 # Stage-3依赖于Stage-4
Stage-0 depends on stages: Stage-3 # Stage-0依赖于Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
$hdt$_1:order_tb #先执行join后的表
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$hdt$_1:order_tb
TableScan #扫描表
alias: order_tb
Statistics: Num rows: 14 Data size: 280 Basic stats: COMPLETE Column stats: NONE
Filter Operator #执行where条件进行过滤
predicate: ((UDFToString(month) = '2019-01') and day is not null) (type: boolean)
Statistics: Num rows: 7 Data size: 140 Basic stats: COMPLETE Column stats: NONE
Select Operator #在执行select语句
expressions: day (type: varchar(10)), money (type: int)
outputColumnNames: _col1, _col2
Statistics: Num rows: 7 Data size: 140 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 _col1 (type: varchar(10))
1 _col1 (type: varchar(10))
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan #扫描表
alias: deal_tb
Statistics: Num rows: 38 Data size: 722 Basic stats: COMPLETE Column stats: NONE
Filter Operator #执行where语句进行过滤
predicate: daystr is not null (type: boolean) #对关联字段进行非空过滤
Statistics: Num rows: 38 Data size: 722 Basic stats: COMPLETE Column stats: NONE
Select Operator #执行select语句
expressions: id (type: int), daystr (type: varchar(10)), amount (type: decimal(10,2))
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 38 Data size: 722 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map: #执行join
Inner Join 0 to 1
keys:
0 _col1 (type: varchar(10))
1 _col1 (type: varchar(10))
outputColumnNames: _col0, _col1, _col2, _col4, _col5
Statistics: Num rows: 41 Data size: 794 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: varchar(10)), _col2 (type: decimal(10,2)), '2019-01' (type: varchar(7)), _col4 (type: varcha
r(10)), _col5 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 41 Data size: 794 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 41 Data size: 794 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 1.019 seconds, Fetched: 71 row(s)
在explain后的执行计划中的29行和47行可以看出,先扫描join后的表,对此表进行where语句的条件过滤和关联字段非空过滤,再扫描from后的表进行where语句中条件的过滤和关联字段的非空过滤。
当where语句中的条件为关联字段的过滤,不管你过滤的字段名用A表还是B表,在join之前都会对两表的关联字段进行过滤,不信可以查看SQL语句的执行计划:
> explain select
> *
> from
> deal_tb
> join
> order_tb
> on daystr=day
> where
> daystr="2019-01-20";
OK
Explain
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
order_tb
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
order_tb
TableScan
alias: order_tb
Statistics: Num rows: 14 Data size: 280 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (day = '2019-01-20') (type: boolean)
Statistics: Num rows: 7 Data size: 140 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 daystr (type: varchar(10))
1 day (type: varchar(10))
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: deal_tb
Statistics: Num rows: 38 Data size: 722 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (daystr = '2019-01-20') (type: boolean)
Statistics: Num rows: 19 Data size: 361 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 daystr (type: varchar(10))
1 day (type: varchar(10))
outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
Statistics: Num rows: 20 Data size: 397 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: varchar(10)), _col2 (type: decimal(10,2)), _col6 (type: varchar(7)), _col7 (type: varchar(10))
, _col8 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 20 Data size: 397 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 20 Data size: 397 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 2.237 seconds, Fetched: 63 row(s)
在执行计划中30行和44行出现这两句话,说明对两表都进行了关联字段where语句条件的过滤
predicate: (day = ‘2019-01-20’) (type: boolean)
predicate: (daystr = ‘2019-01-20’) (type: boolean)
对于hive SQL执行的顺序如果有模糊的,建议使用explain查看执行计划,这样能看到正确的执行顺序。
最后
以上就是忐忑果汁为你收集整理的hive中join on和where一起使用时的执行顺序及关联字段为null的情况的全部内容,希望文章能够帮你解决hive中join on和where一起使用时的执行顺序及关联字段为null的情况所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复