我是靠谱客的博主 感性鱼,最近开发中收集的这篇文章主要介绍HIVE SQL 聚合函数与 rows between / range between详解,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

一、rows between 与 range between 用法

1. 相关关键词解析

unbounded 无边界
preceding 往前
following 往后
unbounded preceding 往前所有行,即初始行
n preceding 往前n行
unbounded following 往后所有行,即末尾行
n following 往后n行
current row 当前行

语法
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

2. rows between ... and ...

rows:指以行号来决定frame的范围,是物理意义上的行。

比如rows between 1 preceding and 1 following代表从当前行往前一行以及往后一行。

3. range between ... and ...

range:指以当前行在开窗函数中的值为根基,然后按照order by进行排序,最后根据range去加减上下界。是逻辑意义上的行。

比如sum(score) over (PARTITION by id order by score ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) 表示按照id分组,按照score升序排序,然后以当前行的score,下界减一,上界加一,作为范围,将这范围里的score进行加总。

讲的比较拗口,下面看个例子就懂了。

二、举例

1. 数据准备

假设有表datadev.t_student,数据如下

idscore
stu_11
stu_12
stu_13
stu_14
stu_15
stu_15

2. 测试 rows between ... and ...

SELECT id, score,
sum(score) over (PARTITION by id) as a1,
sum(score) over (PARTITION by id order by score) as a2,
sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as a3,
sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as a4,
sum(score) over (PARTITION by id order by 1) as a5
from datadev.t_student;

测试结果如下:

分析:

  1. sum(score) over (PARTITION by id) as a1:按照id分组直接加总score,这种大家最熟悉了
  2. sum(score) over (PARTITION by id order by score) as a2:按照score排序,从起始行到当前行进行加总,与a3中的ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW不同的是,当score相同时,算相同排名,会一起加总。类似rank的概念。
  3. sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):从起始行到当前行进行加总,与a2不同的是,当score相同时,排名不同,不会加总到当前行。类似row_number的概念。
  4. sum(score) over (PARTITION by id order by score ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):从起始行到末尾行进行加总,与a1相同。
  5. sum(score) over (PARTITION by id order by 1):作用与a2一样,这里order by 1,相当于score相同,因此全部加总。

a1与a2在官网上的解释如下:

  • When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
  • When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

因此,a1与a2等价于

SELECT id, score,
sum(score) over (PARTITION by id) as a1,
sum(score) over (PARTITION by id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as a1,
sum(score) over (PARTITION by id order by score) as a2,
sum(score) over (PARTITION by id order by score RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as a2,
from datadev.t_student;

 官网地址如下:

LanguageManual WindowingAndAnalytics - Apache Hive - Apache Software Foundation

3. 测试 range between ... and ...

SELECT id, score,
sum(score) over (PARTITION by id order by score RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as b1,
sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING) as b2
from datadev.t_student;

测试结果如下:

 分析:

  1. sum(score) over (PARTITION by id order by score RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 是默认值,可不写。
  2. sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING):按照id分组,score升序排序,并将当前行的score下界减一,上界为全部(可认为是无穷大),作为筛选范围。最后将符合筛选范围的score进行相加。

b2运算过程分析如下:

idscore运算过程运算b2
stu_11[当前行的score值 - 1,∞] ==> 即[0, ∞]1+2+3+4+5+5=2020
stu_12[当前行的score值 - 1,∞] ==> 即[1, ∞]1+2+3+4+5+5=2020
stu_13[当前行的score值 - 1,∞] ==> 即[2, ∞]2+3+4+5+5=1919
stu_14[当前行的score值 - 1,∞] ==> 即[3, ∞]3+4+5+5=1717
stu_15[当前行的score值 - 1,∞] ==> 即[4, ∞]4+5+5=1414
stu_15[当前行的score值 - 1,∞] ==> 即[4, ∞]4+5+5=1414

4. 对比 range between ... and ... 与 rows between ... and ...

SELECT id, score,
sum(score) over (PARTITION by id order by score RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) as a,
sum(score) over (PARTITION by id order by score ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as b
from datadev.t_student;

测试结果如下:

 分析:

idscorerange运算过程range运算a(range)rows运算过程rows运算b(rows)
stu_11[当前行的score值 - 1,当前行的score值 + 1] ==> 即[0, 2]1+2=33将当前行的上一行以及下一行
的score进行相加
1+2=33
stu_12[当前行的score值 - 1,当前行的score值 + 1] ==> 即[1, 3]1+2+3=66将当前行的上一行以及下一行
的score进行相加
1+2+3=66
stu_13[当前行的score值 - 1,当前行的score值 + 1] ==> 即[2, 4]2+3+4=99将当前行的上一行以及下一行
的score进行相加
2+3+4=99
stu_14[当前行的score值 - 1,当前行的score值 + 1] ==> 即[3, 5]3+4+5+5=1717将当前行的上一行以及下一行
的score进行相加
3+4+5=1212
stu_15[当前行的score值 - 1,当前行的score值 + 1] ==> 即[4, 6]4+5+5=1414将当前行的上一行以及下一行
的score进行相加
4+5+5=1414
stu_15[当前行的score值 - 1,当前行的score值 + 1] ==> 即[4, 6]4+5+5=1414将当前行的上一行以及下一行
的score进行相加
5+5=1010

参考文档:Hive 窗口与分析型函数

最后

以上就是感性鱼为你收集整理的HIVE SQL 聚合函数与 rows between / range between详解的全部内容,希望文章能够帮你解决HIVE SQL 聚合函数与 rows between / range between详解所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(74)

评论列表共有 0 条评论

立即
投稿
返回
顶部