我是靠谱客的博主 忐忑果汁,最近开发中收集的这篇文章主要介绍hive中join on和where一起使用时的执行顺序及关联字段为null的情况,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

查看很多博客,都说join on和where一起使用时,先进行join产生临时表,再进行where条件过滤得到结果表,但使用explain查看执行计划,会发现执行顺序并不是这样。

在两表join(注意: 此处只表示内连接, 如果是left join的话,不会过滤null, 会产生数据倾斜)时如果关联字段为null,有些博客上写会产生数据倾斜或笛卡儿积,其实不会,因为在join之前就会对join的表中关联字段进行非空过滤。

> explain select
> *
> from
> deal_tb
> join
> order_tb
> on daystr=day
> where
> month="2019-02";
OK
Explain
STAGE DEPENDENCIES:
  Stage-4 is a root stage             # 先执行Stage-4
  Stage-3 depends on stages: Stage-4  # Stage-3依赖于Stage-4
  Stage-0 depends on stages: Stage-3  # Stage-0依赖于Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_1:order_tb   #先执行join后的表
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_1:order_tb 
          TableScan  #扫描表
            alias: order_tb
            Statistics: Num rows: 14 Data size: 280 Basic stats: COMPLETE Column stats: NONE
            Filter Operator  #执行where条件进行过滤
              predicate: ((UDFToString(month) = '2019-01') and day is not null) (type: boolean)
              Statistics: Num rows: 7 Data size: 140 Basic stats: COMPLETE Column stats: NONE
              Select Operator  #在执行select语句
                expressions: day (type: varchar(10)), money (type: int)
                outputColumnNames: _col1, _col2
                Statistics: Num rows: 7 Data size: 140 Basic stats: COMPLETE Column stats: NONE
                HashTable Sink Operator
                  keys:
                    0 _col1 (type: varchar(10))
                    1 _col1 (type: varchar(10))

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan  #扫描表
            alias: deal_tb
            Statistics: Num rows: 38 Data size: 722 Basic stats: COMPLETE Column stats: NONE
            Filter Operator  #执行where语句进行过滤
              predicate: daystr is not null (type: boolean)  #对关联字段进行非空过滤
              Statistics: Num rows: 38 Data size: 722 Basic stats: COMPLETE Column stats: NONE
              Select Operator  #执行select语句
                expressions: id (type: int), daystr (type: varchar(10)), amount (type: decimal(10,2))
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 38 Data size: 722 Basic stats: COMPLETE Column stats: NONE
                Map Join Operator
                  condition map:  #执行join
                       Inner Join 0 to 1
                  keys:
                    0 _col1 (type: varchar(10))
                    1 _col1 (type: varchar(10))
                  outputColumnNames: _col0, _col1, _col2, _col4, _col5
                  Statistics: Num rows: 41 Data size: 794 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: int), _col1 (type: varchar(10)), _col2 (type: decimal(10,2)), '2019-01' (type: varchar(7)), _col4 (type: varcha
r(10)), _col5 (type: int)                    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                    Statistics: Num rows: 41 Data size: 794 Basic stats: COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 41 Data size: 794 Basic stats: COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 1.019 seconds, Fetched: 71 row(s)

在explain后的执行计划中的29行和47行可以看出,先扫描join后的表,对此表进行where语句的条件过滤和关联字段非空过滤,再扫描from后的表进行where语句中条件的过滤和关联字段的非空过滤。

当where语句中的条件为关联字段的过滤,不管你过滤的字段名用A表还是B表,在join之前都会对两表的关联字段进行过滤,不信可以查看SQL语句的执行计划:

> explain select
> *
> from
> deal_tb
> join
> order_tb
> on daystr=day
> where
> daystr="2019-01-20";
OK
Explain
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        order_tb 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        order_tb 
          TableScan
            alias: order_tb
            Statistics: Num rows: 14 Data size: 280 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (day = '2019-01-20') (type: boolean)
              Statistics: Num rows: 7 Data size: 140 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 daystr (type: varchar(10))
                  1 day (type: varchar(10))

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: deal_tb
            Statistics: Num rows: 38 Data size: 722 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (daystr = '2019-01-20') (type: boolean)
              Statistics: Num rows: 19 Data size: 361 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 daystr (type: varchar(10))
                  1 day (type: varchar(10))
                outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
                Statistics: Num rows: 20 Data size: 397 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: varchar(10)), _col2 (type: decimal(10,2)), _col6 (type: varchar(7)), _col7 (type: varchar(10))
, _col8 (type: int)                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                  Statistics: Num rows: 20 Data size: 397 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 20 Data size: 397 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 2.237 seconds, Fetched: 63 row(s)

在执行计划中30行和44行出现这两句话,说明对两表都进行了关联字段where语句条件的过滤
predicate: (day = ‘2019-01-20’) (type: boolean)
predicate: (daystr = ‘2019-01-20’) (type: boolean)

对于hive SQL执行的顺序如果有模糊的,建议使用explain查看执行计划,这样能看到正确的执行顺序。

最后

以上就是忐忑果汁为你收集整理的hive中join on和where一起使用时的执行顺序及关联字段为null的情况的全部内容,希望文章能够帮你解决hive中join on和where一起使用时的执行顺序及关联字段为null的情况所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(76)

评论列表共有 0 条评论

立即
投稿
返回
顶部