目录
一、Hive常用窗口函数和行列转化
排序类:ROW_NUMBER()、RANK()、DENSE_RANK()等
取值类:FIRST_VALUE(col)、LAST_VALUE(col)、LEAD(col,n,DEFAULT)、LAG(col,n,DEFAULT)等
聚合类:COUNT()、SUM()、MIN()、MAX()、AVG()等
行列转化
列转行 (对某列拆分,形成新列) :lateral view explode(split(column, ',')) num
行转列(根据主键,对某列进行合并) : concat_ws(',',collect_set(column))
二、窗口函数应用场景
用于分组排序、动态Group By、Top N、累计计算、层次查询。
三、应用举例
3.1 分组排序窗口函数举例
分组排序 | 解释 |
---|---|
ROW_NUMBER() | 从1开始,按照顺序,生成分组内记录的序列 |
RANK() | 生成分组内的排名,排名相等在名次会中留下空位 |
DENSE_RANK() | 生成分组内的排名,排名相等在名次中不会留下空位 |
SELECT user_id,
course,
score,
ROW_NUMBER() OVER(PARTITION BY course ORDER BY score) as rn,
RANK() OVER(PARTITION BY course ORDER BY score) as rk,
DENSE_RANK() OVER(PARTITION BY course ORDER BY score) as dr
FROM student_score
3.2 取值相关窗口函数举例
取值相关函数 | 解释 |
---|---|
FIRST_VALUE(col) | 取分组内排序后,截止到当前行,第一个col值 |
LAST_VALUE(col) | 取分组内排序后,截止到当前行,最后一个col值 如果order的值变化了,基本上也就是当前行的值了,如果没有变化就取相同order系列中的最后一项。 由于排序的值可能相同但要取的列值可能不同,所以FIRST/LAST这俩函数的返回值均是不确定的。 |
LEAD(col,n,DEFAULT) | 用于统计窗口内往下第n行值。 参数1:列名; 参数2:往下第n行(可选,默认为1); 参数3:默认值(当往下第n行为NULL时,取默认值,不指定为NULL) |
LAG(col,n,DEFAULT) | 与lead相反,用于统计窗口内往上第n行值。 参数1:列名; 参数2:往上第n行(可选,默认为1); 参数3:默认值(当往上第n行为NULL时,取默认值,不指定为NULL) |
SELECT user_id,
course,
score,
ROW_NUMBER() OVER(PARTITION BY course ORDER BY score ASC) AS rn,
FIRST_VALUE(score) OVER(PARTITION BY course ORDER BY score ASC) AS first_scorea,
FIRST_VALUE(score) OVER(PARTITION BY course ORDER BY score DESC) AS first_scored,
FIRST_VALUE(user_id) OVER(PARTITION BY course ORDER BY score ASC) AS first_usera,
FIRST_VALUE(user_id) OVER(PARTITION BY course ORDER BY score DESC, user_id ASC) AS first_userda,
LAST_VALUE(score) OVER(PARTITION BY course ORDER BY score) AS last_scorea,
LAST_VALUE(user_id) OVER(PARTITION BY course ORDER BY score ASC,user_id ASC RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS last_user_upcr,
LAST_VALUE(user_id) OVER(PARTITION BY course ORDER BY score ASC,user_id ASC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_user_upuf,
LAG(score,1,0) OVER(PARTITION BY course ORDER BY score) AS lag_1_0
FROM student_score
ORDER BY course,
rn
3.3 聚合相关窗口函数和语法介绍
上面例子中over从句,在这里我们统一说一下相关语法
OVER从句
1、使用标准的聚合函数COUNT、SUM、MIN、MAX、AVG
2、使用PARTITION BY语句,使用一个或者多个原始数据类型的列
3、使用PARTITION BY与ORDER BY语句,使用一个或者多个数据类型的分区或者排序列
4、使用窗口规范,窗口规范支持以下格式:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
注意点:(1)理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点
UNBOUNDED PRECEDING 表示从前面的起点
UNBOUNDED FOLLOWING:表示到后面的终点;
(2)[ORDER BY后]缺失和[ORDER BY+窗口从句]都缺失区别:
当ORDER BY后面缺少窗口从句条件,窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
当ORDER BY和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
(3)ROWS和RANGE区别:
ROWS是物理窗口,与当前行的值(order by key的key的值)无关,只与排序后的行号相关(对行操作范围,返回对行范围)。
RANGE是逻辑窗口,与当前行的值有关(order by key的key的值),在key上操作range范围(对值操作range范围,返回对应值分为)。
SELECT user_id,
course,
score,
ROW_NUMBER() OVER(PARTITION BY course ORDER BY score ASC) AS rn,
-- 组内score总和
SUM(score) OVER(PARTITION BY course) AS sum_p_score,
-- 截止当前score值和
SUM(score) OVER(PARTITION BY course ORDER BY score ASC) AS sum_po_score,
-- 截止当前score值和,order by后缺失默认如下
SUM(score) OVER(PARTITION BY course ORDER BY score ASC RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum_po_range_score,
-- 截止当前行,score值的和
SUM(score) OVER(PARTITION BY course ORDER BY score ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum_po_row_score,
-- 往前2行 + 当前行score值的和
SUM(score) OVER(PARTITION BY course ORDER BY score ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS sum_row_p2_score,
-- 分组排序和sum的字段不一样
SUM(user_id) OVER(PARTITION BY course ORDER BY score ASC) AS sum_user
FROM student_score
ORDER BY course,
rn
3.4 行列转化函数和语法介绍
列转行 (对某列拆分,形成新列) :lateral view explode(split(column, ',')) num
行转列(根据主键,对某列进行合并) : concat_ws(',',collect_set(column))