Bootstrap

Hive窗口函数的使用

1.窗口函数介绍

窗口 :函数运行/计算时 所对应的数据集范围
函数 :执行的函数

(1)官方参考文档

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

(2)Over与聚合函数结合使用官方介绍

The OVER clause
OVER with standard aggregates:

COUNT
SUM
MIN
MAX
AVG
OVER with a PARTITION BY statement with one or more partitioning columns of any primitive datatype.
OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns of any datatype.
OVER with a window specification. Windows can be defined separately in a WINDOW clause. Window specifications support the following formats:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.


从上面可以总结一个公式

select 聚合函数 over (partition by ... order by ... ) ROWS ......

具体使用看下面的案例

(3)Over与分析函数结合使用官方介绍

Analytics functions
RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE

(4)Windowing functions

LEAD
The number of rows to lead can optionally be specified. If the number of rows to lead is not specified, the lead is one row.
Returns null when the lead for the current row extends beyond the end of the window.
LAG
The number of rows to lag can optionally be specified. If the number of rows to lag is not specified, the lag is one row.
Returns null when the lag for the current row extends before the beginning of the window.
FIRST_VALUE
This takes at most two parameters. The first parameter is the column for which you want the first value, the second (optional) parameter must be a boolean which is false by default. If set to true it skips null values.
LAST_VALUE
This takes at most two parameters. The first parameter is the column for which you want the last value, the second (optional) parameter must be a boolean which is false by default. If set to true it skips null values.

2. 聚合函数(SUM等)+Over 的使用——累计求和

WINDOW子句

PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,
UNBOUNDED PRECEDING 表示从前面的起点,
UNBOUNDED FOLLOWING:表示到后面的终点

2.1 使用案例

--数据
pentaKilldata,2019-04-10,1
pentaKilldata,2019-04-11,5
pentaKilldata,2019-04-12,7
pentaKilldata,2019-04-13,3
pentaKilldata,2019-04-14,2
pentaKilldata,2019-04-15,4
pentaKilldata,2019-04-16,4
--导入数据
drop table if exists pentaKilldata_window;

create table pentaKilldata_window(domain string,time string,traffic int) row format delimited fields terminated by ',';

load data local inpath '/opt/data/window.txt' overwrite into table pentaKilldata_window;

--查询
select 
domain, time, traffic,
sum(traffic) OVER (partition by domain order by time) pv1,
sum(traffic) OVER (partition by domain order by time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) pv2,
sum(traffic) OVER (partition by domain)  pv3,
sum(traffic) OVER (partition by domain order by time  ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)  pv4,
sum(traffic) OVER (partition by domain order by time  ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING)  pv5,
sum(traffic) OVER (partition by domain order by time  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)  pv6
from pentaKilldata_window
order by time;

--输出结果
domain  		    time    	traffic 	pv1     pv2     pv3     pv4     pv5     pv6
pentaKilldata       2019-04-10      1       1       1       26      1       6       26
pentaKilldata       2019-04-11      5       6       6       26      6       13      25
pentaKilldata       2019-04-12      7       13      13      26      13      16      20
pentaKilldata       2019-04-13      3       16      16      26      16      18      13
pentaKilldata       2019-04-14      2       18      18      26      17      21      10
pentaKilldata       2019-04-15      4       22      22      26      16      20      8
pentaKilldata       2019-04-16      4       26      26      26      13      13      4

(1)pv1解释

sum(traffic) OVER (partition by domain order by time) pv1
按照domain分区,并按照time升序排序, 进行sum累计求和


traffic 	pv1   计算过程
    1       1     1
    5       6     1+5
    7       13    1+5+7
    3       16    1+5+7+3
    2       18    1+5+7+3+2
    4       22    ...
    4       26    ...


值得一提的是,此种方式可以写自连接实现与窗口函数相同累加效果,具体实现如下:

with t as(select domain, date_format(day, 'yyyy-MM') as month, sum(pv) pv from access group by domain, date_format(day, 'yyyy-MM'))
select 
b.domain, b.month, b.pv,
max(a.pv) max_pv,
sum(a.pv) sum_pv
from t a join t b 
on a.domain=b.domain 
where a.month <= b.month
group by b.domain, b.month, b.pv;

(2)pv2解释

sum(traffic) OVER (partition by domain order by time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) pv2,

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW的意识是最前面到当前行,其实是和pv1是一样的意思。
traffic   pv2  
    1     1   
    5     6   
    7     13  
    3     16  
    2     18  
    4     22  
    4     26  

(3)pv3解释 -> 这种方式可以求占比

sum(traffic) OVER (partition by domain)  pv3
只按照domain进行分区求和,相同domain的数据会发送到一个分区进行求和,最后的结果均为1+5+7+3+2+4+4=26.

traffic  pv3
    1    26 
    5    26 
    7    26 
    3    26 
    2    26 
    4    26 
    4    26 

# 具体可参考	
业务分析:hive下的分组求占比情况
https://blog.csdn.net/oyy_90/article/details/89843016	

(4)pv4解释

sum(traffic) OVER (partition by domain order by time  ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)  pv4,

ROWS BETWEEN 3 PRECEDING AND CURRENT ROW的意思是 向前3行到当前行

traffic   pv4 计算过程
    1     1   1
    5     6   5+1
    7     13  1+5+7
    3     16  1+5+7+3
    2     17  5+7+3+2   traffic为2是当前行,向前3行就是traffic为5,7,3的三行数据
    4     16  7+3+2+4
    4     13  3+2+4+4

(5)pv5解释

sum(traffic) OVER (partition by domain order by time  ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING)  pv5,

ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING 的意思是 向前3行 向下一行
还记得前面说的窗口含义就是数据集的范围,简单来说就是框数据。

traffic  pv5 计算过程
    1    6  
    5    13 
    7    16 
    3    18 
    2    21  5+7+3 +2 +4 -》 traffic的2是当前行,前3行是traffic为5,7,3,后1行是traffic为4.
    4    20 
    4    13 

(6)pv6解释

sum(traffic) OVER (partition by domain order by time  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)  pv6

ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING 的意思是 当前行到下面所有数据

traffic   pv6 计算过程
    1     26  1+5+7+3+2+4+4
    5     25  5+7+3+2+4+4
    7     20  7+3+2+4+4
    3     13  3+2+4+4
    2     10  2+4+4
    4     8   4+4
    4     4   4

2.2 使用场景

累计求和中使用的多

3. NTILE+OVER 的使用

3.1 使用案例

NTILE(n) 用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。NTILE不支持ROWS BETWEEN

-- 数据源
pentaKilldata,2019-04-10,1
pentaKilldata,2019-04-11,5
pentaKilldata,2019-04-12,7
pentaKilldata,2019-04-13,3
pentaKilldata,2019-04-14,2
pentaKilldata,2019-04-15,4
pentaKilldata,2019-04-16,4    

(1)使用测试1

-- 查询语句1
select 
domain, time, traffic,
NTILE(2) OVER (partition by domain order by time) rn1,
NTILE(3) OVER (partition by domain order by time) rn2,
NTILE(4) OVER (order by time) rn3
from pentaKilldata_window
order by domain,time;
-- 结果1解释
domain	        time	traffic	rn1	rn2	rn3
pentaKill	2019-04-10	1	 	1	1	1
pentaKill	2019-04-11	5		1	1	2
pentaKill	2019-04-12	7		1	1	2
pentaKill	2019-04-13	3		1	2	3
pentaKill	2019-04-14	2		2	2	3
pentaKill	2019-04-15	4		2	3	4
pentaKill	2019-04-16	4		2	3	4

  NTILE(2) :将数据切成两片
  NTILE(3) : 将数据切成3片,多的分到其他切片
  NTILE(4) : 将数据切成4片,不够的某个切片会少

(2)使用测试2

 -- 在原有表中增加数据
yy.com,2019-04-10,2       
yy.com,2019-04-11,3       
yy.com,2019-04-12,5      
yy.com,2019-04-13,6      
yy.com,2019-04-14,3       
yy.com,2019-04-15,9       
yy.com,2019-04-16,7
 -- 查询语句2
select 
domain, time, traffic,
NTILE(4) OVER (order by time) rn3
from pentaKilldata_window
order by domain,time;

domain		time	  traffic	rn3
pentaKill	2019-04-10	1		1
pentaKill	2019-04-11	5		1
pentaKill	2019-04-12	7		2
pentaKill	2019-04-13	3		2
pentaKill	2019-04-14	2		3
pentaKill	2019-04-15	4		4
pentaKill	2019-04-16	4		4
yy.com		2019-04-10	2		1
yy.com		2019-04-11	3		1
yy.com		2019-04-12	5		2
yy.com		2019-04-13	6		2
yy.com		2019-04-14	3		3
yy.com		2019-04-15  9 		3
yy.com		2019-04-16	7		4

NTILE(4) : 将数据切成4片,不够的某个切片会少,进行hash随机到分区

3.2 使用场景

一个表中字段不多,但是行数很多的情况下,可以使用此函数。 其实,实际生产中使用的不多。

4. ROW_NUMBER的使用 ——分组topN

窗口函数若排序值相同
ROW_NUMBER排序按自然升序
RANK排序总数不变
DENSE_RANK排序总数变少

窗口函数还可以完成group by 去重中 获取分组之外字段的场景。

但是有些情况下使用ROW_NUMBER会非常耗时,使用group by的方式去重也可以获取分组之外的字段

SELECT 
url ,
max_token[2] as post_data,
max_token[3] as res_raw_data,
max_token[4] AS req_raw_data
'' AS d,
max_token[5] as request_datetime,
'' AS f
FROM 
(
    --相同url情况下,将其他5个字段放入array,然后使用max取出其中最大的array
	SELECT  max(
	            ARRAY(
	                cast(content_length AS BIGINT) + 1000000    -- 0
	                ,content_length    -- 1            
	                ,post_data
	                ,res_raw_data
	                ,req_raw_data
	                ,request_datetime
	            )
	        ) AS max_token,
	        url
	FROM    XXXX
	WHERE   ds = '20210101' 
	group by url
)  tmp

row_number():从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列;通常用于获取分组内排序第一的记录;获取一个session中的第一条refer等。

rank():生成数据项在分组中的排名,排名相等会在名次中留下空位。

dense_rank():生成数据项在分组中的排名,排名相等会在名次中不会留下空位。

4.1 使用案例

select 
domain, time, traffic,
ROW_NUMBER() OVER (partition by domain order by traffic desc) rn1,
RANK() OVER (partition by domain order by traffic desc) rn2,
DENSE_RANK() OVER (partition by domain order by traffic desc) rn3
from pentaKilldata_window;


domain  time    	traffic   rn1     rn2     rn3
yy.com  2019-04-15      9       1       1       1
yy.com  2019-04-16      7       2       2       2
yy.com  2019-04-13      6       3       3       3
yy.com  2019-04-12      5       4       4       4
yy.com  2019-04-14      3       5       5       5
yy.com  2019-04-11      3       6       5       5
yy.com  2019-04-10      2       7       7       6

(1) ROW_NUMBER

ROW_NUMBER() OVER (partition by domain order by traffic desc) rn1,
这种方式在分组TopN中使用的最多,也用的较为频繁

traffic   rn1  
    9       1  
    7       2  
    6       3  
    5       4  
    3       5  
    3       6  
    2       7  

比如:查询按照domain分组的traffic使用top3
	select domain,time,traffic from 
	(select 
	domain, time, traffic,
	ROW_NUMBER() OVER (partition by domain order by traffic desc) as rank
	from pentaKilldata_window
	) c where rank<=3;
 
domain  time    	traffic  
yy.com  2019-04-15      9    
yy.com  2019-04-16      7    
yy.com  2019-04-13      6 

(2) RANK

RANK() OVER (partition by domain order by traffic desc) rn2,
traffic为3的排名相同,ran2值相同。但是ran2值的总数7不变。 这种场景目前碰到少

traffic    rn2   
    9        1   
    7        2   
    6        3   
    5        4   
    3        5   
    3        5   
    2        7   

(3)DENSE_RANK

DENSE_RANK() OVER (partition by domain order by traffic desc) rn3
traffic为3的排名相同,ran2值相同。但是rank2的值的总数变为6,总数变少

traffic   rn3
    9       1
    7       2
    6       3
    5       4
    3       5
    3       5
    2       6

4.2 使用场景

求分组topN的时候使用

5.CUME_DIST 与 PERCENT_RANK的使用

dept01,pentakill,10000
dept01,doubleKill,20000
dept01,firstblood,30000
dept02,zhangsan,40000
dept02,lisi,50000      

create table ruozedata_window02(dept  string,user string,sal int)
row format  delimited fields  terminated by ',';

load data local inpath '/opt/data/window02.txt' overwrite into table ruozedata_window02;

5.1 CUME_DIST : 小于等于当前行值的行数 / 分组内的总行数

select 
dept,user,sal,
round(CUME_DIST () over(order by sal),2) rn1,
round(CUME_DIST () over(partition by dept order by sal),2) rn2
from ruozedata_window02;

小于等于当前行值的行数 /  分组内的总行数
dept    user    		sal     rn1     rn1计算过程         rn2        rn2计算过程
dept01  pentakill   	10000   0.2     1/5                 0.33        1/3
dept01  doubleKill  	20000   0.4     2/5                 0.67        2/3
dept01  firstblood      30000   0.6     3/5                 1.0         3/3
dept02  zhangsan        40000   0.8     4/5                 0.5         1/2
dept02  lisi    		50000   1.0     5                   1.0         2/2

备注: CUME_DIST () over(order by sal),2)  其中的2是返回值的小数点
      partition by dept 相当于对dept分组

5.2 PERCENT_RANK : 分组内当前行的rank-1/分组内总行数-1

select 
dept,user,sal,
round(PERCENT_RANK() over(order by sal),2) rn1,
round(PERCENT_RANK() over(partition by dept order by sal),2) rn2
from ruozedata_window02;


分组内当前行的rank-1/分组内总行数-1
dept    user    		sal     rn1     rn1计算过程        	rn2   rn2计算过程
dept01  pentakill   	10000   0.0     1-1/5-1         	0.0   1-1/3-1
dept01  doubleKill  	20000   0.25    2-1/5-1  	        0.5   2-1/3-1
dept01  firstblood      30000   0.5     3-1/5-1		        1.0   3-1/3-1
dept02  zhangsan        40000   0.75    4-1/5-1	        	0.0   1-1/2-1
dept02  lisi    		50000   1.0     5-1/5-1		        1.0   2-1/2-1

6. lag与lead的使用

cookie1,2015-04-10 10:00:02,url2
cookie1,2015-04-10 10:00:00,url1
cookie1,2015-04-10 10:03:04,1url3
cookie1,2015-04-10 10:50:05,url6
cookie1,2015-04-10 11:00:00,url7
cookie1,2015-04-10 10:10:00,url4
cookie1,2015-04-10 10:50:01,url5
cookie2,2015-04-10 10:00:02,url22
cookie2,2015-04-10 10:00:00,url11
cookie2,2015-04-10 10:03:04,1url33
cookie2,2015-04-10 10:50:05,url66
cookie2,2015-04-10 11:00:00,url77
cookie2,2015-04-10 10:10:00,url44
cookie2,2015-04-10 10:50:01,url55

drop table if exists  ruozedata_window03;
create table ruozedata_window03(cookieid string,time string,url string)
row format delimited  fields terminated by ',';

load data local inpath '/opt/data/window03.txt' overwrite into table ruozedata_window03;

6.1 lead(往下)

LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

select cookieid,time,url,
lead(time, 1, '1970-01-01 00:00:00') over(partition by cookieid order by time) pre1,
lead(time, 2) over(partition by cookieid order by time) pre2
from ruozedata_window03;


cookieid	time	        url	    pre1	            pre2
cookie1	2015-04-10 10:00:00	url1	2015-04-10 10:00:02	2015-04-10 10:03:04
cookie1	2015-04-10 10:00:02	url2	2015-04-10 10:03:04	2015-04-10 10:10:00
cookie1	2015-04-10 10:03:04	1url3	2015-04-10 10:10:00	2015-04-10 10:50:01
cookie1	2015-04-10 10:10:00	url4	2015-04-10 10:50:01	2015-04-10 10:50:05
cookie1	2015-04-10 10:50:01	url5	2015-04-10 10:50:05	2015-04-10 11:00:00
cookie1	2015-04-10 10:50:05	url6	2015-04-10 11:00:00	NULL
cookie1	2015-04-10 11:00:00	url7	1970-01-01 00:00:00	NULL
cookie2	2015-04-10 10:00:00	url11	2015-04-10 10:00:02	2015-04-10 10:03:04
cookie2	2015-04-10 10:00:02	url22	2015-04-10 10:03:04	2015-04-10 10:10:00
cookie2	2015-04-10 10:03:04	1url33	2015-04-10 10:10:00	2015-04-10 10:50:01
cookie2	2015-04-10 10:10:00	url44	2015-04-10 10:50:01	2015-04-10 10:50:05
cookie2	2015-04-10 10:50:01	url55	2015-04-10 10:50:05	2015-04-10 11:00:00
cookie2	2015-04-10 10:50:05	url66	2015-04-10 11:00:00	NULL
cookie2	2015-04-10 11:00:00	url77	1970-01-01 00:00:00	NULL

6.2 lag(往上)

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

select cookieid,time,url,
lag(time, 1, '1970-01-01 00:00:00') over(partition by cookieid order by time) pre1,
lag(time, 2) over(partition by cookieid order by time) pre2
from ruozedata_window03

cookieid	time	url	pre1	pre2
cookie1	2015-04-10 10:00:00	url1	1970-01-01 00:00:00	NULL
cookie1	2015-04-10 10:00:02	url2	2015-04-10 10:00:00	NULL
cookie1	2015-04-10 10:03:04	1url3	2015-04-10 10:00:02	2015-04-10 10:00:00
cookie1	2015-04-10 10:10:00	url4	2015-04-10 10:03:04	2015-04-10 10:00:02
cookie1	2015-04-10 10:50:01	url5	2015-04-10 10:10:00	2015-04-10 10:03:04
cookie1	2015-04-10 10:50:05	url6	2015-04-10 10:50:01	2015-04-10 10:10:00
cookie1	2015-04-10 11:00:00	url7	2015-04-10 10:50:05	2015-04-10 10:50:01
cookie2	2015-04-10 10:00:00	url11	1970-01-01 00:00:00	NULL
cookie2	2015-04-10 10:00:02	url22	2015-04-10 10:00:00	NULL
cookie2	2015-04-10 10:03:04	1url33	2015-04-10 10:00:02	2015-04-10 10:00:00
cookie2	2015-04-10 10:10:00	url44	2015-04-10 10:03:04	2015-04-10 10:00:02
cookie2	2015-04-10 10:50:01	url55	2015-04-10 10:10:00	2015-04-10 10:03:04
cookie2	2015-04-10 10:50:05	url66	2015-04-10 10:50:01	2015-04-10 10:10:00
cookie2	2015-04-10 11:00:00	url77	2015-04-10 10:50:05	2015-04-10 10:50:01

7 FIRST_VALUE与LAST_VALUE

7.1 LAST_VALUE 取分组内排序后,截止到当前行,最后一个值

select cookieid,time,url,
LAST_VALUE(url) over(partition by cookieid order by time) rn
from ruozedata_window03;

cookieid	time        	url	    rn
cookie1	2015-04-10 10:00:00	url1	url1   
cookie1	2015-04-10 10:00:02	url2	url2
cookie1	2015-04-10 10:03:04	1url3	1url3
cookie1	2015-04-10 10:10:00	url4	url4
cookie1	2015-04-10 10:50:01	url5	url5
cookie1	2015-04-10 10:50:05	url6	url6
cookie1	2015-04-10 11:00:00	url7	url7
cookie2	2015-04-10 10:00:00	url11	url11
cookie2	2015-04-10 10:00:02	url22	url22
cookie2	2015-04-10 10:03:04	1url33	1url33
cookie2	2015-04-10 10:10:00	url44	url44
cookie2	2015-04-10 10:50:01	url55	url55
cookie2	2015-04-10 10:50:05	url66	url66
cookie2	2015-04-10 11:00:00	url77	url77

7.2 FIRST_VALUE 取分组内排序后,截止到当前行,第一个值

select cookieid,time,url,
FIRST_VALUE(url) over(partition by cookieid order by time) rn
from ruozedata_window03;


cookieid	time	url	rn
cookie1	2015-04-10 10:00:00	url1	url1   //根据cookie1分组,time升序排序后的第一个值是rul1
cookie1	2015-04-10 10:00:02	url2	url1
cookie1	2015-04-10 10:03:04	1url3	url1
cookie1	2015-04-10 10:10:00	url4	url1
cookie1	2015-04-10 10:50:01	url5	url1
cookie1	2015-04-10 10:50:05	url6	url1
cookie1	2015-04-10 11:00:00	url7	url1
cookie2	2015-04-10 10:00:00	url11	url11   //根据cookie2分组,time升序排序后的第一个值是rul11
cookie2	2015-04-10 10:00:02	url22	url11
cookie2	2015-04-10 10:03:04	1url33	url11
cookie2	2015-04-10 10:10:00	url44	url11
cookie2	2015-04-10 10:50:01	url55	url11
cookie2	2015-04-10 10:50:05	url66	url11
cookie2	2015-04-10 11:00:00	url77	url11

7.3 应用场景

比如给一批订单数据,求
求这个月你下的第一单 时间?
求这个月你下的最后一单  时间?

以上可以用到FIRST_VALUE与LAST_VALUE的值

参考文档

    # 参考文档1
    https://blog.csdn.net/dingchangxiu11/article/details/83145151
    # 参考文档2
    https://blog.csdn.net/weixin_38750084/article/details/82779910
;