概述
窗口函数
物理窗口
真实往上下移动多少行rows between
CURRENT ROW | UNBOUNDED PRECEDING | [num] PRECEDING AND UNBOUNDED FOLLOWING | [num] FOLLOWING| CURRENT ROW
如: over(partition by col order by rows between 1 preceding and 1 fllowing)
逻辑窗口
满足条件上下多少行
range between [num] PRECEDING AND [num] FOLLOWING
如: over(partition by col order by range between 5 preceding and 5 fllowing)
函数
sum(col) over() : 分组对col累计求和,over() 中的语法如下
count(col) over() : 分组对col累计,over() 中的语法如下
min(col) over() : 分组对col求最小
max(col) over() : 分组求col的最大值
avg(col) over() : 分组求col列的平均值
first_value(col) over() : 某分区排序后的第一个col值
last_value(col) over() : 某分区排序后的最后一个col值
lag(col,n,DEFAULT) : 统计往前n行的col值,n可选,默认为1,DEFAULT当往上第n行为NULL时候,
取默认值,如不指定,则为NULL
lead(col,n,DEFAULT) : 统计往后n行的col值,n可选,默认为1,DEFAULT当往下第n行为NULL时候,
取默认值,如不指定,则为NULL
ntile(n) : 用于将分组数据按照顺序切分成n片,返回当前切片值。注意:n必须为int类型。
排名函数:
row_number() over() : 排名函数,不会重复,适合于生成主键或者不并列排名
rank() over() : 排名函数,有并列名次,名次不连续。如:1,1,3
dense_rank() over() : 排名函数,有并列名次,名次连续。如:1,1,2
练习
1、
编写hql实现每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数
数据:
userid,month,visits
A 2015-01 5
A 2015-01 15
B 2015-01 5
A 2015-01 8
B 2015-01 25
A 2015-01 5
A 2015-02 4
A 2015-02 6
B 2015-02 10
B 2015-02 5
A 2015-03 16
A 2015-03 22
B 2015-03 23
B 2015-03 10
B 2015-03 1
SELECT userid
,month
,MAX(vis) OVER(distribute by userid sort by month)
,SUM(vis) OVER(distribute by userid sort by month)
,visits
FROM
(
SELECT userid
,month
,SUM(visits) vis
FROM visits
GROUP BY userid
,month
) t1
;
2、
求出每个栏目的被观看次数及累计观看时长
数据:
Uid channl min
1 1 23
2 1 12
3 1 12
4 1 32
5 1 342
6 2 13
7 2 34
8 2 13
9 2 134
SELECT channl
,COUNT(*)
,SUM(min)
FROM vedio
GROUP BY channl
;
3、
编写连续7天登录的总人数
数据:
Uid dt login_status(1登录成功,0异常)
1 2019-07-11 1
1 2019-07-12 1
1 2019-07-13 1
1 2019-07-14 1
1 2019-07-15 1
1 2019-07-16 1
1 2019-07-17 1
1 2019-07-18 1
2 2019-07-11 1
2 2019-07-12 1
2 2019-07-13 0
2 2019-07-14 1
2 2019-07-15 1
2 2019-07-16 0
2 2019-07-17 1
2 2019-07-18 0
3 2019-07-11 1
3 2019-07-12 1
3 2019-07-13 1
3 2019-07-14 1
3 2019-07-15 1
3 2019-07-16 1
3 2019-07-17 1
3 2019-07-18 1
SELECT COUNT(distinct uid)
FROM
(
SELECT uid
,dt
FROM
(
SELECT t1.uid uid
,date_sub(t1.dt,t1.rm) dt
FROM
(
SELECT uid
,dt
,row_number() over(distribute by uid sort by dt) rm
FROM login
WHERE login_status=1
) t1
)t2
GROUP BY uid
,dt
HAVING COUNT(uid) >7
) t
;
4、
编写hql语句实现每班前三名,分数一样并列,同时求出前三名按名次排序的一次的分差
数据:
Stu_no class score
1 1901 90
2 1901 90
3 1901 83
4 1901 60
5 1902 66
6 1902 23
7 1902 99
8 1902 67
9 1902 87
SELECT class
,stu_no
,score
,dr
,score-nvl(lag(score) over(distribute by class sort by dr),0)
FROM
(
SELECT
class,
stu_no,
score,
dense_rank() over(distribute by class sort by score desc) dr
from stu
) t1
WHERE t1.dr<4
;
5、
编写Hive的HQL语句求出每个店铺的当月销售额和累计到当月的总销售额
数据:
店铺,月份,金额
a,01,150
a,01,200
b,01,1000
b,01,800
c,01,250
c,01,220
b,01,6000
a,02,2000
a,02,3000
b,02,1000
b,02,1500
c,02,350
c,02,280
a,03,350
a,03,250
SELECT userid
,month
,vis
,SUM(vis) over(distribute by userid sort by month)
FROM
(
SELECT userid
,month
,SUM(visits) vis
FROM visits
GROUP BY userid
,month
) t1
;
6、
分析用户行为习惯,找到每一个用户在表中的第一次行为
数据:
uid time action
1 1 read
3 2 comment
1 3 share
2 4 like
1 5 write
2 6 like
3 7 write
2 8 read
SELECT distinct u_id
,first_value(action) over(distribute by u_id sort by time) val
FROM user_action_log
;
7、
订单及订单类型行列互换
数据:
t1表: order_id order_type order_time
111 N 10:00
111 A 10:05
111 B 10:10
是用hql获取结果如下:
order_id order_type_1 order_type_2 order_time_1 order_time_2
111 N A 10:00 10:05
111 A B 10:05 10:10
SELECT *
FROM
(
SELECT order_id
,order_type order_type1
,lead(order_type,1) over(sort by order_time) order_type_2
,order_time order_time_1
,lead(order_time,1) over(sort by order_time) order_time_2
FROM t1
) t
WHERE order_type_2 is not null
;
8、
数据:
某APP每天访问数据存放在表access_log里面,包含日期字段 ds,用户类型字段user_type,用户账号user_id,用户访问时间 log_time,请使用hive的hql语句实现如下需求:
(1)、每天整体的访问UV、PV?
(2)、每天每个类型的访问UV、PV?
(3)、每天每个类型中早访问时间和晚访问时间?
(4)、每天每个类型中访问次数高的10个用户?
--(1)
SELECT ds
,COUNT(*) pv
,COUNT(distinct user_id) uv
FROM access_log
GROUP BY ds
;
--(2)
SELECT ds
,user_type
,COUNT(*) pv
,COUNT(distinct user_id) uv
FROM access_log
GROUP BY ds
,user_type
;
--(3)
SELECT ds
,user_type
,first_value(log_time) over(distribute by ds,user_type sort by log_time)
,last_value(log_time) over(distribute by ds,user_type sort by log_time)
FROM access_log
;
--(4)
SELECT *
from
(
SELECT ds
,user_type
,user_id,
,row_number() over(distribute by ds,user_type sort by cnt desc) rn
FROM
(
SELECT ds
,user_type
,user_id
,COUNT(*) cnt
FROM access_log
GROUP BY ds,user_type,user_id;
) t ) tt
WHERE rn < 11
;
9、
每个用户连续登陆的最大天数
数据:
uid,date
1,2019-08-01
1,2019-08-02
1,2019-08-03
2,2019-08-01
2,2019-08-02
3,2019-08-01
3,2019-08-03
4,2019-07-28
4,2019-07-29
4,2019-08-01
4,2019-08-02
4,2019-08-03
SELECT uid
,MAX(cnt)
FROM
(
SELECT uid
,COUNT(*) cnt
FROM
(
SELECT uid
,date_sub(date,dense_rank() over(distribue by uid sort by date)) ds
FROM login
) t
GROUP BY uid
,ds
) tt
GROUP BY uid
;
10、
使用hive的hql实现男女各自第一名及其它
1、男女各自语文第一名(0:男,1:女)
2、男生成绩语文大于80,女生数学成绩大于70
数据:
id sex chinese_s math_s
0 0 70 50
1 0 90 70
2 1 80 90
--(1)
SELECT *
FROM
(
SELECT id
,sex
,chinese_s
,row_number() over(distribue by sex sort by chinese_s desc) rn
FROM t1
) t
WHERE rn = 1
;
--(2)
SELECT id
,sex
,chinese_s
,math_s
FROM score_s
WHERE sex=0
AND chinese_s>80 union
SELECT id
,sex
,chinese_s
,math_s
FROM score_s
WHERE sex=1
AND math_s>70
;
11、
使用hive的hql实现最大连续访问天数
数据:
log_time uid
2019-09-21 12:34:11 123
2019-09-22 12:34:11 123
2019-09-22 12:34:11 456
2019-09-24 12:34:11 123
2019-09-24 12:34:11 456
2019-09-25 12:34:11 123
2019-09-26 12:34:11 123
SELECT uid
,MAX(cnt)
FROM
(
SELECT uid
,COUNT(ds) cnt
FROM
(
SELECT uid
,data_sub(time,dense_rank() over(distribute by uid sort by time)) ds
FROM
(
SELECT uid
,to_date(log_time) time
FROM tableA
GROUP BY uid
,to_date(log_time)
)
) t1
GROUP BY uid
,ds
)
GROUP BY uid
;
最后
以上就是瘦瘦冰淇淋为你收集整理的hive hql 窗口函数 练习的全部内容,希望文章能够帮你解决hive hql 窗口函数 练习所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复