Bootstrap

Hive 分析窗口函数

窗口函数

窗口函数一般就是说over()函数,其窗口是由一个OVER字句定义的多行记录。

窗口函数有两种形式:

over(distribute by 分区字段 sort by 排序字段)
distribute by 是按照多个reduce去处理数据的,对应的排序是局部排序sort by

over(partition by 分区字段 order by 排序字段)
partition by 是按照一个reduce去处理数据的,对应的排序是全局排序order by

开窗大小设置:
窗口大小的设置使用rows between语句,也叫window字句。

-- 由起点到当前行
over(partition by city order by year rows between UNBOUNDED PRECEDING and current row)

-- current row 指当前行
-- UNBOUNDED PRECEDING 指第一行
-- and前面和后面的关系就是范围,从and后面的到and前面

-- 由终点到当前行
over(partition by city order by year rows between UNBOUNDED FOLLOWING and current row)

-- UNBOUNDED FOLLOWING 指最后一行
-- current row 指当前行

-- 当前行和前面一行
over(partition by city order by year rows between 1 PRECEDING and current row)

-- current row 指当前行
-- 1 PRECEDING 指前面一行

-- 当前行和前边一行及后面一行
over(partition by city order by year rows between 1 FOLLOWING and current row)

-- current row 指当前行
-- 1 FOLLOWING 指后面一行

-- 当前行和前边一行及后面一行
over(partition by city order by year rows between 1 PRECEDING and 1 FOLLOWING)

-- 1 FOLLOWING 指后面一行
-- 1 PRECEDING 指前面一行

分析函数

分析函数是对数据进行处理、分析的函数,是对开窗函数获取的窗口数据进行操作的函数。

  • sum()

    对窗口范围内聚合

  • avg()

    对窗口范围内求平均数

  • max()

    求窗口范围内最大值

  • min()

    求窗口范围内最小值

  • row_number()

    row_number排序为标准顺序排序,排序后序号按照行号依次递增

    id	number
    a	1
    b	2
    b	3
    b	4
    c	5
    c	6
    
  • dense_rank()

    dense_rank排序中大小一致的元素序号一样,然后按照元素降序依次降序排序

    id	number
    a	1
    b	2
    b	2
    b	2
    c	3
    c	3
    
  • rank()

    rank排序大小一致的元素序号一样,但是会按照行号依次降序排序

    id	number
    a	1
    b	2
    b	2
    b	2
    c	5
    c	5
    
  • ntile(n)

    用于将分组数据按照顺序切分成n片,返回切片值。

    SELECT 
    cookieid,
    createtime,
    pv,
    NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,  --分组内将数据分成2片
    NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,  --分组内将数据分成3片
    NTILE(4) OVER(ORDER BY createtime) AS rn3                         --将所有数据分成4片
    FROM test 
    ORDER BY cookieid,createtime;
    
    cookieid day           pv       rn1     rn2     rn3
    cookie1 2021-04-10      1       1       1       1
    cookie1 2021-04-11      5       1       1       1
    cookie1 2021-04-12      7       1       1       2
    cookie1 2021-04-13      3       1       2       2
    cookie1 2021-04-14      2       2       2       3
    cookie1 2021-04-15      4       2       3       3
    cookie1 2021-04-16      4       2       3       4
    
  • cume_dist()

    小于或等于当前值的行数/当前分组内总行数

    SELECT 
    dept,
    userid,
    sal,
    CUME_DIST() OVER(ORDER BY sal) AS rn1,
    CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2 
    FROM test;
     
    dept    userid   sal   rn1       rn2 
    d1      user1   1000    0.2     0.3333333333333333
    d1      user2   2000    0.4     0.6666666666666666
    d1      user3   3000    0.6     1.0
    d2      user4   4000    0.8     0.5
    d2      user5   5000    1.0     1.0
     
    rn1: 没有partition,所有数据均为1组,总行数为5,
         第一行:小于等于1000的行数为1,因此,1/5=0.2
         第三行:小于等于3000的行数为3,因此,3/5=0.6
    rn2: 按照部门分组,dpet=d1的行数为3,
         第二行:小于等于2000的行数为2,因此,2/3=0.6666666666666666
    
  • percent_rank()

    分组内当前行的rank值-1/当前分组内总行数

    SELECT 
    dept,
    userid,
    sal,
    PERCENT_RANK() OVER(ORDER BY sal) AS rn1,   --分组内
    RANK() OVER(ORDER BY sal) AS rn11,          --分组内RANK值
    PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2 
    FROM test;
     
    dept    userid   sal    rn1    rn11     rn2
    d1      user1   1000    0.0     1       0.0
    d1      user2   2000    0.25    2       0.5
    d1      user3   3000    0.5     3       1.0
    d2      user4   4000    0.75    4       0.0
    d2      user5   5000    1.0     5       1.0
     
    rn1: rn1 = (rn11-1) / (rn12-1) 
    	   第一行,(1-1)/(5-1)=0/4=0
    	   第二行,(2-1)/(5-1)=1/4=0.25
    	   第四行,(4-1)/(5-1)=3/4=0.75
    rn2: 按照dept分组,
         dept=d1的总行数为3
         第一行,(1-1)/(3-1)=0
         第三行,(3-1)/(3-1)=1
    
  • lag()

    lag(col,n,default)用于统计窗口内往上第n行值,第一个值是列名,第二个值为向上第n行,第三个值是设置默认值(当往上第n行为null时,取默认值,如不指定,则为null)

    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
    LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time 
    FROM test;
    
    cookieid createtime             url    rn       last_1_time             last_2_time
    cookie1 2021-04-10 10:00:00     url1    1       1970-01-01 00:00:00     NULL
    cookie1 2021-04-10 10:00:02     url2    2       2021-04-10 10:00:00     NULL
    cookie1 2021-04-10 10:03:04     1url3   3       2021-04-10 10:00:02     2021-04-10 10:00:00
    cookie1 2021-04-10 10:10:00     url4    4       2021-04-10 10:03:04     2021-04-10 10:00:02
    cookie1 2021-04-10 10:50:01     url5    5       2021-04-10 10:10:00     2021-04-10 10:03:04
    cookie1 2021-04-10 10:50:05     url6    6       2021-04-10 10:50:01     2021-04-10 10:10:00
    cookie1 2021-04-10 11:00:00     url7    7       2021-04-10 10:50:05     2021-04-10 10:50:01
    cookie2 2021-04-10 10:00:00     url11   1       1970-01-01 00:00:00     NULL
    cookie2 2021-04-10 10:00:02     url22   2       2021-04-10 10:00:00     NULL
    cookie2 2021-04-10 10:03:04     1url33  3       2021-04-10 10:00:02     2021-04-10 10:00:00
    cookie2 2021-04-10 10:10:00     url44   4       2021-04-10 10:03:04     2021-04-10 10:00:02
    cookie2 2021-04-10 10:50:01     url55   5       2021-04-10 10:10:00     2021-04-10 10:03:04
    cookie2 2021-04-10 10:50:05     url66   6       2021-04-10 10:50:01     2021-04-10 10:10:00
    cookie2 2021-04-10 11:00:00     url77   7       2021-04-10 10:50:05     2021-04-10 10:50:01
     
     
    last_1_time: 指定了往上第1行的值,default'1970-01-01 00:00:00'  
                 cookie1第一行,往上1行为NULL,因此取默认值 1970-01-01 00:00:00
                 cookie1第三行,往上1行值为第二行值,2021-04-10 10:00:02
                 cookie1第六行,往上1行值为第五行值,2021-04-10 10:50:01
    last_2_time: 指定了往上第2行的值,为指定默认值
    			 cookie1第一行,往上2行为NULL
    			 cookie1第二行,往上2行为NULL
    			 cookie1第四行,往上2行为第二行值,2021-04-10 10:00:02
    			 cookie1第七行,往上2行为第五行值,2021-04-10 10:50:01
    
  • lead()

    与lag相反,用于统计窗口内往下第n行值。

    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
    LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time 
    FROM test;
     
     
    cookieid createtime             url    rn       next_1_time             next_2_time 
    cookie1 2021-04-10 10:00:00     url1    1       2021-04-10 10:00:02     2021-04-10 10:03:04
    cookie1 2021-04-10 10:00:02     url2    2       2021-04-10 10:03:04     2021-04-10 10:10:00
    cookie1 2021-04-10 10:03:04     1url3   3       2021-04-10 10:10:00     2021-04-10 10:50:01
    cookie1 2021-04-10 10:10:00     url4    4       2021-04-10 10:50:01     2021-04-10 10:50:05
    cookie1 2021-04-10 10:50:01     url5    5       2021-04-10 10:50:05     2021-04-10 11:00:00
    cookie1 2021-04-10 10:50:05     url6    6       2021-04-10 11:00:00     NULL
    cookie1 2021-04-10 11:00:00     url7    7       1970-01-01 00:00:00     NULL
    cookie2 2021-04-10 10:00:00     url11   1       2021-04-10 10:00:02     2021-04-10 10:03:04
    cookie2 2021-04-10 10:00:02     url22   2       2021-04-10 10:03:04     2021-04-10 10:10:00
    cookie2 2021-04-10 10:03:04     1url33  3       2021-04-10 10:10:00     2021-04-10 10:50:01
    cookie2 2021-04-10 10:10:00     url44   4       2021-04-10 10:50:01     2021-04-10 10:50:05
    cookie2 2021-04-10 10:50:01     url55   5       2021-04-10 10:50:05     2021-04-10 11:00:00
    cookie2 2021-04-10 10:50:05     url66   6       2021-04-10 11:00:00     NULL
    cookie2 2021-04-10 11:00:00     url77   7       1970-01-01 00:00:00     NULL
    
  • first_value()

    取分组内排序后,截止到当前行,第一个值

    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 
    FROM test;
     
    cookieid  createtime              url     rn      first1
    cookie1   2021-04-10 10:00:00     url1    1       url1
    cookie1   2021-04-10 10:00:02     url2    2       url1
    cookie1   2021-04-10 10:03:04     1url3   3       url1
    cookie1   2021-04-10 10:10:00     url4    4       url1
    cookie1   2021-04-10 10:50:01     url5    5       url1
    cookie1   2021-04-10 10:50:05     url6    6       url1
    cookie1   2021-04-10 11:00:00     url7    7       url1
    cookie2   2021-04-10 10:00:00     url11   1       url11
    cookie2   2021-04-10 10:00:02     url22   2       url11
    cookie2   2021-04-10 10:03:04     1url33  3       url11
    cookie2   2021-04-10 10:10:00     url44   4       url11
    cookie2   2021-04-10 10:50:01     url55   5       url11
    cookie2   2021-04-10 10:50:05     url66   6       url11
    cookie2   2021-04-10 11:00:00     url77   7       url11
    
  • last_value()

    取分组内排序后,截止到当前行,最后一个值,使用这个分析函数需要注意order by 子句的排序方式

    SELECT cookieid,
    createtime,
    url,
    LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2  
    FROM test;
     
    cookieid  createtime              url     last2
    cookie1   2021-04-10 10:00:02     url2    url5
    cookie1   2021-04-10 10:00:00     url1    url5
    cookie1   2021-04-10 10:03:04     1url3   url5
    cookie1   2021-04-10 10:50:05     url6    url5
    cookie1   2021-04-10 11:00:00     url7    url5
    cookie1   2021-04-10 10:10:00     url4    url5
    cookie1   2021-04-10 10:50:01     url5    url5
    cookie2   2021-04-10 10:00:02     url22   url55
    cookie2   2021-04-10 10:00:00     url11   url55
    cookie2   2021-04-10 10:03:04     1url33  url55
    cookie2   2021-04-10 10:50:05     url66   url55
    cookie2   2021-04-10 11:00:00     url77   url55
    cookie2   2021-04-10 10:10:00     url44   url55
    cookie2   2021-04-10 10:50:01     url55   url55
    
  • grouping sets()

    group by可以进行单维度分析,但是如果要进行多维度分析的话,可以使用grouping sets()子句。

    SELECT 
    month,
    day,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID 
    FROM test 
    GROUP BY month,day 
    GROUPING SETS (month,day) 
    ORDER BY GROUPING__ID;
     
    month      day            uv      GROUPING__ID
    2021-03    NULL            5       1
    2021-04    NULL            7       1
    NULL       2021-03-10      4       2
    NULL       2021-03-12      1       2
    NULL       2021-04-12      2       2
    NULL       2021-04-13      3       2
    NULL       2021-04-15      2       2
    NULL       2021-04-16      2       2
     
     
    等价于 
    SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test GROUP BY month 
    UNION ALL 
    SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test GROUP BY day
    
    
    同时子句中可以声明多维度
    SELECT 
    month,
    day,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID 
    FROM test 
    GROUP BY month,day 
    GROUPING SETS (month,day,(month,day)) 
    ORDER BY GROUPING__ID;
     
    month         day             uv      GROUPING__ID
    2021-03       NULL            5       1
    2021-04       NULL            7       1
    NULL          2021-03-10      4       2
    NULL          2021-03-12      1       2
    NULL          2021-04-12      2       2
    NULL          2021-04-13      3       2
    NULL          2021-04-15      2       2
    NULL          2021-04-16      2       2
    2021-03       2021-03-10      4       3
    2021-03       2021-03-12      1       3
    2021-04       2021-04-12      2       3
    2021-04       2021-04-13      3       3
    2021-04       2021-04-15      2       3
    2021-04       2021-04-16      2       3
    
    其中的 GROUPING__ID,表示结果属于哪一个分组集合
    
  • cube()

    根据GROUP BY的维度的所有组合进行聚合

    SELECT 
    month,
    day,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID 
    FROM test 
    GROUP BY month,day 
    WITH CUBE 
    ORDER BY GROUPING__ID;
     
     
    month  			day             uv     GROUPING__ID
    NULL            NULL            7       0
    2021-03         NULL            5       1
    2021-04         NULL            7       1
    NULL            2021-04-12      2       2
    NULL            2021-04-13      3       2
    NULL            2021-04-15      2       2
    NULL            2021-04-16      2       2
    NULL            2021-03-10      4       2
    NULL            2021-03-12      1       2
    2021-03         2021-03-10      4       3
    2021-03         2021-03-12      1       3
    2021-04         2021-04-16      2       3
    2021-04         2021-04-12      2       3
    2021-04         2021-04-13      3       3
    2021-04         2021-04-15      2       3
    
  • rollup()

    cube的子集,以最左侧的维度为止进行层级聚合。

    比如,以month维度进行层级聚合:
    SELECT 
    month,
    day,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID  
    FROM test 
    GROUP BY month,day
    WITH ROLLUP 
    ORDER BY GROUPING__ID;
     
    month  			 day             uv     GROUPING__ID
    NULL             NULL            7       0
    2021-03          NULL            5       1
    2021-04          NULL            7       1
    2021-03          2021-03-10      4       3
    2021-03          2021-03-12      1       3
    2021-04          2021-04-12      2       3
    2021-04          2021-04-13      3       3
    2021-04          2021-04-15      2       3
    2021-04          2021-04-16      2       3
    
;