文章目录
1.窗口函数介绍
窗口 :函数运行/计算时 所对应的数据集范围
函数 :执行的函数
(1)官方参考文档
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
(2)Over与聚合函数结合使用官方介绍
The OVER clause
OVER with standard aggregates:
COUNT
SUM
MIN
MAX
AVG
OVER with a PARTITION BY statement with one or more partitioning columns of any primitive datatype.
OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns of any datatype.
OVER with a window specification. Windows can be defined separately in a WINDOW clause. Window specifications support the following formats:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
从上面可以总结一个公式
select 聚合函数 over (partition by ... order by ... ) ROWS ......
具体使用看下面的案例
(3)Over与分析函数结合使用官方介绍
Analytics functions
RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE
(4)Windowing functions
LEAD
The number of rows to lead can optionally be specified. If the number of rows to lead is not specified, the lead is one row.
Returns null when the lead for the current row extends beyond the end of the window.
LAG
The number of rows to lag can optionally be specified. If the number of rows to lag is not specified, the lag is one row.
Returns null when the lag for the current row extends before the beginning of the window.
FIRST_VALUE
This takes at most two parameters. The first parameter is the column for which you want the first value, the second (optional) parameter must be a boolean which is false by default. If set to true it skips null values.
LAST_VALUE
This takes at most two parameters. The first parameter is the column for which you want the last value, the second (optional) parameter must be a boolean which is false by default. If set to true it skips null values.
2. 聚合函数(SUM等)+Over 的使用——累计求和
WINDOW子句
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,
UNBOUNDED PRECEDING 表示从前面的起点,
UNBOUNDED FOLLOWING:表示到后面的终点
2.1 使用案例
--数据
pentaKilldata,2019-04-10,1
pentaKilldata,2019-04-11,5
pentaKilldata,2019-04-12,7
pentaKilldata,2019-04-13,3
pentaKilldata,2019-04-14,2
pentaKilldata,2019-04-15,4
pentaKilldata,2019-04-16,4
--导入数据
drop table if exists pentaKilldata_window;
create table pentaKilldata_window(domain string,time string,traffic int) row format delimited fields terminated by ',';
load data local inpath '/opt/data/window.txt' overwrite into table pentaKilldata_window;
--查询
select
domain, time, traffic,
sum(traffic) OVER (partition by domain order by time) pv1,
sum(traffic) OVER (partition by domain order by time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) pv2,
sum(traffic) OVER (partition by domain) pv3,
sum(traffic) OVER (partition by domain order by time ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) pv4,
sum(traffic) OVER (partition by domain order by time ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) pv5,
sum(traffic) OVER (partition by domain order by time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) pv6
from pentaKilldata_window
order by time;
--输出结果
domain time traffic pv1 pv2 pv3 pv4 pv5 pv6
pentaKilldata 2019-04-10 1 1 1 26 1 6 26
pentaKilldata 2019-04-11 5 6 6 26 6 13 25
pentaKilldata 2019-04-12 7 13 13 26 13 16 20
pentaKilldata 2019-04-13 3 16 16 26 16 18 13
pentaKilldata 2019-04-14 2 18 18 26 17 21 10
pentaKilldata 2019-04-15 4 22 22 26 16 20 8
pentaKilldata 2019-04-16 4 26 26 26 13 13 4
(1)pv1解释
sum(traffic) OVER (partition by domain order by time) pv1
按照domain分区,并按照time升序排序, 进行sum累计求和
traffic pv1 计算过程
1 1 1
5 6 1+5
7 13 1+5+7
3 16 1+5+7+3
2 18 1+5+7+3+2
4 22 ...
4 26 ...
值得一提的是,此种方式可以写自连接实现与窗口函数相同累加效果,具体实现如下:
with t as(select domain, date_format(day, 'yyyy-MM') as month, sum(pv) pv from access group by domain, date_format(day, 'yyyy-MM'))
select
b.domain, b.month, b.pv,
max(a.pv) max_pv,
sum(a.pv) sum_pv
from t a join t b
on a.domain=b.domain
where a.month <= b.month
group by b.domain, b.month, b.pv;
(2)pv2解释
sum(traffic) OVER (partition by domain order by time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) pv2,
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW的意识是最前面到当前行,其实是和pv1是一样的意思。
traffic pv2
1 1
5 6
7 13
3 16
2 18
4 22
4 26
(3)pv3解释 -> 这种方式可以求占比
sum(traffic) OVER (partition by domain) pv3
只按照domain进行分区求和,相同domain的数据会发送到一个分区进行求和,最后的结果均为1+5+7+3+2+4+4=26.
traffic pv3
1 26
5 26
7 26
3 26
2 26
4 26
4 26
# 具体可参考
业务分析:hive下的分组求占比情况
https://blog.csdn.net/oyy_90/article/details/89843016
(4)pv4解释
sum(traffic) OVER (partition by domain order by time ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) pv4,
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW的意思是 向前3行到当前行
traffic pv4 计算过程
1 1 1
5 6 5+1
7 13 1+5+7
3 16 1+5+7+3
2 17 5+7+3+2 traffic为2是当前行,向前3行就是traffic为5,7,3的三行数据
4 16 7+3+2+4
4 13 3+2+4+4
(5)pv5解释
sum(traffic) OVER (partition by domain order by time ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) pv5,
ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING 的意思是 向前3行 向下一行
还记得前面说的窗口含义就是数据集的范围,简单来说就是框数据。
traffic pv5 计算过程
1 6
5 13
7 16
3 18
2 21 5+7+3 +2 +4 -》 traffic的2是当前行,前3行是traffic为5,7,3,后1行是traffic为4.
4 20
4 13
(6)pv6解释
sum(traffic) OVER (partition by domain order by time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) pv6
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING 的意思是 当前行到下面所有数据
traffic pv6 计算过程
1 26 1+5+7+3+2+4+4
5 25 5+7+3+2+4+4
7 20 7+3+2+4+4
3 13 3+2+4+4
2 10 2+4+4
4 8 4+4
4 4 4
2.2 使用场景
累计求和中使用的多
3. NTILE+OVER 的使用
3.1 使用案例
NTILE(n) 用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。NTILE不支持ROWS BETWEEN
-- 数据源
pentaKilldata,2019-04-10,1
pentaKilldata,2019-04-11,5
pentaKilldata,2019-04-12,7
pentaKilldata,2019-04-13,3
pentaKilldata,2019-04-14,2
pentaKilldata,2019-04-15,4
pentaKilldata,2019-04-16,4
(1)使用测试1
-- 查询语句1
select
domain, time, traffic,
NTILE(2) OVER (partition by domain order by time) rn1,
NTILE(3) OVER (partition by domain order by time) rn2,
NTILE(4) OVER (order by time) rn3
from pentaKilldata_window
order by domain,time;
-- 结果1解释
domain time traffic rn1 rn2 rn3
pentaKill 2019-04-10 1 1 1 1
pentaKill 2019-04-11 5 1 1 2
pentaKill 2019-04-12 7 1 1 2
pentaKill 2019-04-13 3 1 2 3
pentaKill 2019-04-14 2 2 2 3
pentaKill 2019-04-15 4 2 3 4
pentaKill 2019-04-16 4 2 3 4
NTILE(2) :将数据切成两片
NTILE(3) : 将数据切成3片,多的分到其他切片
NTILE(4) : 将数据切成4片,不够的某个切片会少
(2)使用测试2
-- 在原有表中增加数据
yy.com,2019-04-10,2
yy.com,2019-04-11,3
yy.com,2019-04-12,5
yy.com,2019-04-13,6
yy.com,2019-04-14,3
yy.com,2019-04-15,9
yy.com,2019-04-16,7
-- 查询语句2
select
domain, time, traffic,
NTILE(4) OVER (order by time) rn3
from pentaKilldata_window
order by domain,time;
domain time traffic rn3
pentaKill 2019-04-10 1 1
pentaKill 2019-04-11 5 1
pentaKill 2019-04-12 7 2
pentaKill 2019-04-13 3 2
pentaKill 2019-04-14 2 3
pentaKill 2019-04-15 4 4
pentaKill 2019-04-16 4 4
yy.com 2019-04-10 2 1
yy.com 2019-04-11 3 1
yy.com 2019-04-12 5 2
yy.com 2019-04-13 6 2
yy.com 2019-04-14 3 3
yy.com 2019-04-15 9 3
yy.com 2019-04-16 7 4
NTILE(4) : 将数据切成4片,不够的某个切片会少,进行hash随机到分区
3.2 使用场景
一个表中字段不多,但是行数很多的情况下,可以使用此函数。 其实,实际生产中使用的不多。
4. ROW_NUMBER的使用 ——分组topN
窗口函数 | 若排序值相同 |
---|---|
ROW_NUMBER | 排序按自然升序 |
RANK | 排序总数不变 |
DENSE_RANK | 排序总数变少 |
窗口函数还可以完成group by 去重中 获取分组之外字段的场景。
但是有些情况下使用ROW_NUMBER会非常耗时,使用group by的方式去重也可以获取分组之外的字段
SELECT
url ,
max_token[2] as post_data,
max_token[3] as res_raw_data,
max_token[4] AS req_raw_data
'' AS d,
max_token[5] as request_datetime,
'' AS f
FROM
(
--相同url情况下,将其他5个字段放入array,然后使用max取出其中最大的array
SELECT max(
ARRAY(
cast(content_length AS BIGINT) + 1000000 -- 0
,content_length -- 1
,post_data
,res_raw_data
,req_raw_data
,request_datetime
)
) AS max_token,
url
FROM XXXX
WHERE ds = '20210101'
group by url
) tmp
row_number():从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列;通常用于获取分组内排序第一的记录;获取一个session中的第一条refer等。
rank():生成数据项在分组中的排名,排名相等会在名次中留下空位。
dense_rank():生成数据项在分组中的排名,排名相等会在名次中不会留下空位。
4.1 使用案例
select
domain, time, traffic,
ROW_NUMBER() OVER (partition by domain order by traffic desc) rn1,
RANK() OVER (partition by domain order by traffic desc) rn2,
DENSE_RANK() OVER (partition by domain order by traffic desc) rn3
from pentaKilldata_window;
domain time traffic rn1 rn2 rn3
yy.com 2019-04-15 9 1 1 1
yy.com 2019-04-16 7 2 2 2
yy.com 2019-04-13 6 3 3 3
yy.com 2019-04-12 5 4 4 4
yy.com 2019-04-14 3 5 5 5
yy.com 2019-04-11 3 6 5 5
yy.com 2019-04-10 2 7 7 6
(1) ROW_NUMBER
ROW_NUMBER() OVER (partition by domain order by traffic desc) rn1,
这种方式在分组TopN中使用的最多,也用的较为频繁
traffic rn1
9 1
7 2
6 3
5 4
3 5
3 6
2 7
比如:查询按照domain分组的traffic使用top3
select domain,time,traffic from
(select
domain, time, traffic,
ROW_NUMBER() OVER (partition by domain order by traffic desc) as rank
from pentaKilldata_window
) c where rank<=3;
domain time traffic
yy.com 2019-04-15 9
yy.com 2019-04-16 7
yy.com 2019-04-13 6
(2) RANK
RANK() OVER (partition by domain order by traffic desc) rn2,
traffic为3的排名相同,ran2值相同。但是ran2值的总数7不变。 这种场景目前碰到少
traffic rn2
9 1
7 2
6 3
5 4
3 5
3 5
2 7
(3)DENSE_RANK
DENSE_RANK() OVER (partition by domain order by traffic desc) rn3
traffic为3的排名相同,ran2值相同。但是rank2的值的总数变为6,总数变少
traffic rn3
9 1
7 2
6 3
5 4
3 5
3 5
2 6
4.2 使用场景
求分组topN的时候使用
5.CUME_DIST 与 PERCENT_RANK的使用
dept01,pentakill,10000
dept01,doubleKill,20000
dept01,firstblood,30000
dept02,zhangsan,40000
dept02,lisi,50000
create table ruozedata_window02(dept string,user string,sal int)
row format delimited fields terminated by ',';
load data local inpath '/opt/data/window02.txt' overwrite into table ruozedata_window02;
5.1 CUME_DIST : 小于等于当前行值的行数 / 分组内的总行数
select
dept,user,sal,
round(CUME_DIST () over(order by sal),2) rn1,
round(CUME_DIST () over(partition by dept order by sal),2) rn2
from ruozedata_window02;
小于等于当前行值的行数 / 分组内的总行数
dept user sal rn1 rn1计算过程 rn2 rn2计算过程
dept01 pentakill 10000 0.2 1/5 0.33 1/3
dept01 doubleKill 20000 0.4 2/5 0.67 2/3
dept01 firstblood 30000 0.6 3/5 1.0 3/3
dept02 zhangsan 40000 0.8 4/5 0.5 1/2
dept02 lisi 50000 1.0 5 1.0 2/2
备注: CUME_DIST () over(order by sal),2) 其中的2是返回值的小数点
partition by dept 相当于对dept分组
5.2 PERCENT_RANK : 分组内当前行的rank-1/分组内总行数-1
select
dept,user,sal,
round(PERCENT_RANK() over(order by sal),2) rn1,
round(PERCENT_RANK() over(partition by dept order by sal),2) rn2
from ruozedata_window02;
分组内当前行的rank-1/分组内总行数-1
dept user sal rn1 rn1计算过程 rn2 rn2计算过程
dept01 pentakill 10000 0.0 1-1/5-1 0.0 1-1/3-1
dept01 doubleKill 20000 0.25 2-1/5-1 0.5 2-1/3-1
dept01 firstblood 30000 0.5 3-1/5-1 1.0 3-1/3-1
dept02 zhangsan 40000 0.75 4-1/5-1 0.0 1-1/2-1
dept02 lisi 50000 1.0 5-1/5-1 1.0 2-1/2-1
6. lag与lead的使用
cookie1,2015-04-10 10:00:02,url2
cookie1,2015-04-10 10:00:00,url1
cookie1,2015-04-10 10:03:04,1url3
cookie1,2015-04-10 10:50:05,url6
cookie1,2015-04-10 11:00:00,url7
cookie1,2015-04-10 10:10:00,url4
cookie1,2015-04-10 10:50:01,url5
cookie2,2015-04-10 10:00:02,url22
cookie2,2015-04-10 10:00:00,url11
cookie2,2015-04-10 10:03:04,1url33
cookie2,2015-04-10 10:50:05,url66
cookie2,2015-04-10 11:00:00,url77
cookie2,2015-04-10 10:10:00,url44
cookie2,2015-04-10 10:50:01,url55
drop table if exists ruozedata_window03;
create table ruozedata_window03(cookieid string,time string,url string)
row format delimited fields terminated by ',';
load data local inpath '/opt/data/window03.txt' overwrite into table ruozedata_window03;
6.1 lead(往下)
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
select cookieid,time,url,
lead(time, 1, '1970-01-01 00:00:00') over(partition by cookieid order by time) pre1,
lead(time, 2) over(partition by cookieid order by time) pre2
from ruozedata_window03;
cookieid time url pre1 pre2
cookie1 2015-04-10 10:00:00 url1 2015-04-10 10:00:02 2015-04-10 10:03:04
cookie1 2015-04-10 10:00:02 url2 2015-04-10 10:03:04 2015-04-10 10:10:00
cookie1 2015-04-10 10:03:04 1url3 2015-04-10 10:10:00 2015-04-10 10:50:01
cookie1 2015-04-10 10:10:00 url4 2015-04-10 10:50:01 2015-04-10 10:50:05
cookie1 2015-04-10 10:50:01 url5 2015-04-10 10:50:05 2015-04-10 11:00:00
cookie1 2015-04-10 10:50:05 url6 2015-04-10 11:00:00 NULL
cookie1 2015-04-10 11:00:00 url7 1970-01-01 00:00:00 NULL
cookie2 2015-04-10 10:00:00 url11 2015-04-10 10:00:02 2015-04-10 10:03:04
cookie2 2015-04-10 10:00:02 url22 2015-04-10 10:03:04 2015-04-10 10:10:00
cookie2 2015-04-10 10:03:04 1url33 2015-04-10 10:10:00 2015-04-10 10:50:01
cookie2 2015-04-10 10:10:00 url44 2015-04-10 10:50:01 2015-04-10 10:50:05
cookie2 2015-04-10 10:50:01 url55 2015-04-10 10:50:05 2015-04-10 11:00:00
cookie2 2015-04-10 10:50:05 url66 2015-04-10 11:00:00 NULL
cookie2 2015-04-10 11:00:00 url77 1970-01-01 00:00:00 NULL
6.2 lag(往上)
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
select cookieid,time,url,
lag(time, 1, '1970-01-01 00:00:00') over(partition by cookieid order by time) pre1,
lag(time, 2) over(partition by cookieid order by time) pre2
from ruozedata_window03
cookieid time url pre1 pre2
cookie1 2015-04-10 10:00:00 url1 1970-01-01 00:00:00 NULL
cookie1 2015-04-10 10:00:02 url2 2015-04-10 10:00:00 NULL
cookie1 2015-04-10 10:03:04 1url3 2015-04-10 10:00:02 2015-04-10 10:00:00
cookie1 2015-04-10 10:10:00 url4 2015-04-10 10:03:04 2015-04-10 10:00:02
cookie1 2015-04-10 10:50:01 url5 2015-04-10 10:10:00 2015-04-10 10:03:04
cookie1 2015-04-10 10:50:05 url6 2015-04-10 10:50:01 2015-04-10 10:10:00
cookie1 2015-04-10 11:00:00 url7 2015-04-10 10:50:05 2015-04-10 10:50:01
cookie2 2015-04-10 10:00:00 url11 1970-01-01 00:00:00 NULL
cookie2 2015-04-10 10:00:02 url22 2015-04-10 10:00:00 NULL
cookie2 2015-04-10 10:03:04 1url33 2015-04-10 10:00:02 2015-04-10 10:00:00
cookie2 2015-04-10 10:10:00 url44 2015-04-10 10:03:04 2015-04-10 10:00:02
cookie2 2015-04-10 10:50:01 url55 2015-04-10 10:10:00 2015-04-10 10:03:04
cookie2 2015-04-10 10:50:05 url66 2015-04-10 10:50:01 2015-04-10 10:10:00
cookie2 2015-04-10 11:00:00 url77 2015-04-10 10:50:05 2015-04-10 10:50:01
7 FIRST_VALUE与LAST_VALUE
7.1 LAST_VALUE 取分组内排序后,截止到当前行,最后一个值
select cookieid,time,url,
LAST_VALUE(url) over(partition by cookieid order by time) rn
from ruozedata_window03;
cookieid time url rn
cookie1 2015-04-10 10:00:00 url1 url1
cookie1 2015-04-10 10:00:02 url2 url2
cookie1 2015-04-10 10:03:04 1url3 1url3
cookie1 2015-04-10 10:10:00 url4 url4
cookie1 2015-04-10 10:50:01 url5 url5
cookie1 2015-04-10 10:50:05 url6 url6
cookie1 2015-04-10 11:00:00 url7 url7
cookie2 2015-04-10 10:00:00 url11 url11
cookie2 2015-04-10 10:00:02 url22 url22
cookie2 2015-04-10 10:03:04 1url33 1url33
cookie2 2015-04-10 10:10:00 url44 url44
cookie2 2015-04-10 10:50:01 url55 url55
cookie2 2015-04-10 10:50:05 url66 url66
cookie2 2015-04-10 11:00:00 url77 url77
7.2 FIRST_VALUE 取分组内排序后,截止到当前行,第一个值
select cookieid,time,url,
FIRST_VALUE(url) over(partition by cookieid order by time) rn
from ruozedata_window03;
cookieid time url rn
cookie1 2015-04-10 10:00:00 url1 url1 //根据cookie1分组,time升序排序后的第一个值是rul1
cookie1 2015-04-10 10:00:02 url2 url1
cookie1 2015-04-10 10:03:04 1url3 url1
cookie1 2015-04-10 10:10:00 url4 url1
cookie1 2015-04-10 10:50:01 url5 url1
cookie1 2015-04-10 10:50:05 url6 url1
cookie1 2015-04-10 11:00:00 url7 url1
cookie2 2015-04-10 10:00:00 url11 url11 //根据cookie2分组,time升序排序后的第一个值是rul11
cookie2 2015-04-10 10:00:02 url22 url11
cookie2 2015-04-10 10:03:04 1url33 url11
cookie2 2015-04-10 10:10:00 url44 url11
cookie2 2015-04-10 10:50:01 url55 url11
cookie2 2015-04-10 10:50:05 url66 url11
cookie2 2015-04-10 11:00:00 url77 url11
7.3 应用场景
比如给一批订单数据,求
求这个月你下的第一单 时间?
求这个月你下的最后一单 时间?
以上可以用到FIRST_VALUE与LAST_VALUE的值
参考文档
# 参考文档1
https://blog.csdn.net/dingchangxiu11/article/details/83145151
# 参考文档2
https://blog.csdn.net/weixin_38750084/article/details/82779910