窗口函数
窗口函数一般就是说over()函数,其窗口是由一个OVER字句定义的多行记录。
窗口函数有两种形式:
over(distribute by 分区字段 sort by 排序字段)
distribute by 是按照多个reduce去处理数据的,对应的排序是局部排序sort by
over(partition by 分区字段 order by 排序字段)
partition by 是按照一个reduce去处理数据的,对应的排序是全局排序order by
开窗大小设置:
窗口大小的设置使用rows between
语句,也叫window字句。
-- 由起点到当前行
over(partition by city order by year rows between UNBOUNDED PRECEDING and current row)
-- current row 指当前行
-- UNBOUNDED PRECEDING 指第一行
-- and前面和后面的关系就是范围,从and后面的到and前面
-- 由终点到当前行
over(partition by city order by year rows between UNBOUNDED FOLLOWING and current row)
-- UNBOUNDED FOLLOWING 指最后一行
-- current row 指当前行
-- 当前行和前面一行
over(partition by city order by year rows between 1 PRECEDING and current row)
-- current row 指当前行
-- 1 PRECEDING 指前面一行
-- 当前行和前边一行及后面一行
over(partition by city order by year rows between 1 FOLLOWING and current row)
-- current row 指当前行
-- 1 FOLLOWING 指后面一行
-- 当前行和前边一行及后面一行
over(partition by city order by year rows between 1 PRECEDING and 1 FOLLOWING)
-- 1 FOLLOWING 指后面一行
-- 1 PRECEDING 指前面一行
分析函数
分析函数是对数据进行处理、分析的函数,是对开窗函数获取的窗口数据进行操作的函数。
-
sum()
对窗口范围内聚合
-
avg()
对窗口范围内求平均数
-
max()
求窗口范围内最大值
-
min()
求窗口范围内最小值
-
row_number()
row_number排序为标准顺序排序,排序后序号按照行号依次递增
id number a 1 b 2 b 3 b 4 c 5 c 6
-
dense_rank()
dense_rank排序中大小一致的元素序号一样,然后按照元素降序依次降序排序
id number a 1 b 2 b 2 b 2 c 3 c 3
-
rank()
rank排序大小一致的元素序号一样,但是会按照行号依次降序排序
id number a 1 b 2 b 2 b 2 c 5 c 5
-
ntile(n)
用于将分组数据按照顺序切分成n片,返回切片值。
SELECT cookieid, createtime, pv, NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1, --分组内将数据分成2片 NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2, --分组内将数据分成3片 NTILE(4) OVER(ORDER BY createtime) AS rn3 --将所有数据分成4片 FROM test ORDER BY cookieid,createtime; cookieid day pv rn1 rn2 rn3 cookie1 2021-04-10 1 1 1 1 cookie1 2021-04-11 5 1 1 1 cookie1 2021-04-12 7 1 1 2 cookie1 2021-04-13 3 1 2 2 cookie1 2021-04-14 2 2 2 3 cookie1 2021-04-15 4 2 3 3 cookie1 2021-04-16 4 2 3 4
-
cume_dist()
小于或等于当前值的行数/当前分组内总行数
SELECT dept, userid, sal, CUME_DIST() OVER(ORDER BY sal) AS rn1, CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM test; dept userid sal rn1 rn2 d1 user1 1000 0.2 0.3333333333333333 d1 user2 2000 0.4 0.6666666666666666 d1 user3 3000 0.6 1.0 d2 user4 4000 0.8 0.5 d2 user5 5000 1.0 1.0 rn1: 没有partition,所有数据均为1组,总行数为5, 第一行:小于等于1000的行数为1,因此,1/5=0.2 第三行:小于等于3000的行数为3,因此,3/5=0.6 rn2: 按照部门分组,dpet=d1的行数为3, 第二行:小于等于2000的行数为2,因此,2/3=0.6666666666666666
-
percent_rank()
分组内当前行的rank值-1/当前分组内总行数
SELECT dept, userid, sal, PERCENT_RANK() OVER(ORDER BY sal) AS rn1, --分组内 RANK() OVER(ORDER BY sal) AS rn11, --分组内RANK值 PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM test; dept userid sal rn1 rn11 rn2 d1 user1 1000 0.0 1 0.0 d1 user2 2000 0.25 2 0.5 d1 user3 3000 0.5 3 1.0 d2 user4 4000 0.75 4 0.0 d2 user5 5000 1.0 5 1.0 rn1: rn1 = (rn11-1) / (rn12-1) 第一行,(1-1)/(5-1)=0/4=0 第二行,(2-1)/(5-1)=1/4=0.25 第四行,(4-1)/(5-1)=3/4=0.75 rn2: 按照dept分组, dept=d1的总行数为3 第一行,(1-1)/(3-1)=0 第三行,(3-1)/(3-1)=1
-
lag()
lag(col,n,default)用于统计窗口内往上第n行值,第一个值是列名,第二个值为向上第n行,第三个值是设置默认值(当往上第n行为null时,取默认值,如不指定,则为null)
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time, LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time FROM test; cookieid createtime url rn last_1_time last_2_time cookie1 2021-04-10 10:00:00 url1 1 1970-01-01 00:00:00 NULL cookie1 2021-04-10 10:00:02 url2 2 2021-04-10 10:00:00 NULL cookie1 2021-04-10 10:03:04 1url3 3 2021-04-10 10:00:02 2021-04-10 10:00:00 cookie1 2021-04-10 10:10:00 url4 4 2021-04-10 10:03:04 2021-04-10 10:00:02 cookie1 2021-04-10 10:50:01 url5 5 2021-04-10 10:10:00 2021-04-10 10:03:04 cookie1 2021-04-10 10:50:05 url6 6 2021-04-10 10:50:01 2021-04-10 10:10:00 cookie1 2021-04-10 11:00:00 url7 7 2021-04-10 10:50:05 2021-04-10 10:50:01 cookie2 2021-04-10 10:00:00 url11 1 1970-01-01 00:00:00 NULL cookie2 2021-04-10 10:00:02 url22 2 2021-04-10 10:00:00 NULL cookie2 2021-04-10 10:03:04 1url33 3 2021-04-10 10:00:02 2021-04-10 10:00:00 cookie2 2021-04-10 10:10:00 url44 4 2021-04-10 10:03:04 2021-04-10 10:00:02 cookie2 2021-04-10 10:50:01 url55 5 2021-04-10 10:10:00 2021-04-10 10:03:04 cookie2 2021-04-10 10:50:05 url66 6 2021-04-10 10:50:01 2021-04-10 10:10:00 cookie2 2021-04-10 11:00:00 url77 7 2021-04-10 10:50:05 2021-04-10 10:50:01 last_1_time: 指定了往上第1行的值,default为'1970-01-01 00:00:00' cookie1第一行,往上1行为NULL,因此取默认值 1970-01-01 00:00:00 cookie1第三行,往上1行值为第二行值,2021-04-10 10:00:02 cookie1第六行,往上1行值为第五行值,2021-04-10 10:50:01 last_2_time: 指定了往上第2行的值,为指定默认值 cookie1第一行,往上2行为NULL cookie1第二行,往上2行为NULL cookie1第四行,往上2行为第二行值,2021-04-10 10:00:02 cookie1第七行,往上2行为第五行值,2021-04-10 10:50:01
-
lead()
与lag相反,用于统计窗口内往下第n行值。
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time, LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time FROM test; cookieid createtime url rn next_1_time next_2_time cookie1 2021-04-10 10:00:00 url1 1 2021-04-10 10:00:02 2021-04-10 10:03:04 cookie1 2021-04-10 10:00:02 url2 2 2021-04-10 10:03:04 2021-04-10 10:10:00 cookie1 2021-04-10 10:03:04 1url3 3 2021-04-10 10:10:00 2021-04-10 10:50:01 cookie1 2021-04-10 10:10:00 url4 4 2021-04-10 10:50:01 2021-04-10 10:50:05 cookie1 2021-04-10 10:50:01 url5 5 2021-04-10 10:50:05 2021-04-10 11:00:00 cookie1 2021-04-10 10:50:05 url6 6 2021-04-10 11:00:00 NULL cookie1 2021-04-10 11:00:00 url7 7 1970-01-01 00:00:00 NULL cookie2 2021-04-10 10:00:00 url11 1 2021-04-10 10:00:02 2021-04-10 10:03:04 cookie2 2021-04-10 10:00:02 url22 2 2021-04-10 10:03:04 2021-04-10 10:10:00 cookie2 2021-04-10 10:03:04 1url33 3 2021-04-10 10:10:00 2021-04-10 10:50:01 cookie2 2021-04-10 10:10:00 url44 4 2021-04-10 10:50:01 2021-04-10 10:50:05 cookie2 2021-04-10 10:50:01 url55 5 2021-04-10 10:50:05 2021-04-10 11:00:00 cookie2 2021-04-10 10:50:05 url66 6 2021-04-10 11:00:00 NULL cookie2 2021-04-10 11:00:00 url77 7 1970-01-01 00:00:00 NULL
-
first_value()
取分组内排序后,截止到当前行,第一个值
SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 FROM test; cookieid createtime url rn first1 cookie1 2021-04-10 10:00:00 url1 1 url1 cookie1 2021-04-10 10:00:02 url2 2 url1 cookie1 2021-04-10 10:03:04 1url3 3 url1 cookie1 2021-04-10 10:10:00 url4 4 url1 cookie1 2021-04-10 10:50:01 url5 5 url1 cookie1 2021-04-10 10:50:05 url6 6 url1 cookie1 2021-04-10 11:00:00 url7 7 url1 cookie2 2021-04-10 10:00:00 url11 1 url11 cookie2 2021-04-10 10:00:02 url22 2 url11 cookie2 2021-04-10 10:03:04 1url33 3 url11 cookie2 2021-04-10 10:10:00 url44 4 url11 cookie2 2021-04-10 10:50:01 url55 5 url11 cookie2 2021-04-10 10:50:05 url66 6 url11 cookie2 2021-04-10 11:00:00 url77 7 url11
-
last_value()
取分组内排序后,截止到当前行,最后一个值,使用这个分析函数需要注意order by 子句的排序方式
SELECT cookieid, createtime, url, LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2 FROM test; cookieid createtime url last2 cookie1 2021-04-10 10:00:02 url2 url5 cookie1 2021-04-10 10:00:00 url1 url5 cookie1 2021-04-10 10:03:04 1url3 url5 cookie1 2021-04-10 10:50:05 url6 url5 cookie1 2021-04-10 11:00:00 url7 url5 cookie1 2021-04-10 10:10:00 url4 url5 cookie1 2021-04-10 10:50:01 url5 url5 cookie2 2021-04-10 10:00:02 url22 url55 cookie2 2021-04-10 10:00:00 url11 url55 cookie2 2021-04-10 10:03:04 1url33 url55 cookie2 2021-04-10 10:50:05 url66 url55 cookie2 2021-04-10 11:00:00 url77 url55 cookie2 2021-04-10 10:10:00 url44 url55 cookie2 2021-04-10 10:50:01 url55 url55
-
grouping sets()
group by可以进行单维度分析,但是如果要进行多维度分析的话,可以使用grouping sets()子句。
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test GROUP BY month,day GROUPING SETS (month,day) ORDER BY GROUPING__ID; month day uv GROUPING__ID 2021-03 NULL 5 1 2021-04 NULL 7 1 NULL 2021-03-10 4 2 NULL 2021-03-12 1 2 NULL 2021-04-12 2 2 NULL 2021-04-13 3 2 NULL 2021-04-15 2 2 NULL 2021-04-16 2 2 等价于 SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test GROUP BY day 同时子句中可以声明多维度 SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test GROUP BY month,day GROUPING SETS (month,day,(month,day)) ORDER BY GROUPING__ID; month day uv GROUPING__ID 2021-03 NULL 5 1 2021-04 NULL 7 1 NULL 2021-03-10 4 2 NULL 2021-03-12 1 2 NULL 2021-04-12 2 2 NULL 2021-04-13 3 2 NULL 2021-04-15 2 2 NULL 2021-04-16 2 2 2021-03 2021-03-10 4 3 2021-03 2021-03-12 1 3 2021-04 2021-04-12 2 3 2021-04 2021-04-13 3 3 2021-04 2021-04-15 2 3 2021-04 2021-04-16 2 3 其中的 GROUPING__ID,表示结果属于哪一个分组集合
-
cube()
根据GROUP BY的维度的所有组合进行聚合
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test GROUP BY month,day WITH CUBE ORDER BY GROUPING__ID; month day uv GROUPING__ID NULL NULL 7 0 2021-03 NULL 5 1 2021-04 NULL 7 1 NULL 2021-04-12 2 2 NULL 2021-04-13 3 2 NULL 2021-04-15 2 2 NULL 2021-04-16 2 2 NULL 2021-03-10 4 2 NULL 2021-03-12 1 2 2021-03 2021-03-10 4 3 2021-03 2021-03-12 1 3 2021-04 2021-04-16 2 3 2021-04 2021-04-12 2 3 2021-04 2021-04-13 3 3 2021-04 2021-04-15 2 3
-
rollup()
cube的子集,以最左侧的维度为止进行层级聚合。
比如,以month维度进行层级聚合: SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID; month day uv GROUPING__ID NULL NULL 7 0 2021-03 NULL 5 1 2021-04 NULL 7 1 2021-03 2021-03-10 4 3 2021-03 2021-03-12 1 3 2021-04 2021-04-12 2 3 2021-04 2021-04-13 3 3 2021-04 2021-04-15 2 3 2021-04 2021-04-16 2 3