Bootstrap

sparksql的agg函数,作用:在整体DataFrame不分组聚合

1、 agg(expers:column*) 返回dataframe类型 ,同数学计算求值
df.agg(max("age"), avg("salary"))
df.groupBy().agg(max("age"), avg("salary"))
2、 agg(exprs: Map[String, String])  返回dataframe类型 ,同数学计算求值 map类型的
df.agg(Map("age" -> "max", "salary" -> "avg"))
df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
3、 agg(aggExpr: (String, String), aggExprs: (String, String)*)  返回dataframe类型 ,同数学计算求值
df.agg(Map("age" -> "max", "salary" -> "avg"))
df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
例子1:
scala> spark.version
res2: String =  2.0 . 2
 
scala>  case  class  Test(bf: Int, df: Int, duration: Int, tel_date: Int)
defined  class  Test
 
scala> val df = Seq(Test( 1 , 1 , 1 , 1 ), Test( 1 , 1 , 2 , 2 ), Test( 1 , 1 , 3 , 3 ), Test( 2 , 2 , 3 , 3 ), Test( 2 , 2 , 2 , 2 ), Test( 2 , 2 , 1 , 1 )).toDF
df: org.apache.spark.sql.DataFrame = [bf:  int , df:  int  ...  2  more fields]
 
scala> df.show
+---+---+--------+--------+
| bf| df|duration|tel_date|
+---+---+--------+--------+
|   1 |   1 |        1 |        1 |
|   1 |   1 |        2 |        2 |
|   1 |   1 |        3 |        3 |
|   2 |   2 |        3 |        3 |
|   2 |   2 |        2 |        2 |
|   2 |   2 |        1 |        1 |
+---+---+--------+--------+
 
 
scala> df.groupBy( "bf" "df" ).agg(( "duration" , "sum" ),( "tel_date" , "min" ),( "tel_date" , "max" )).show()
+---+---+-------------+-------------+-------------+
| bf| df|sum(duration)|min(tel_date)|max(tel_date)|
+---+---+-------------+-------------+-------------+
|   2 |   2 |             6 |             1 |             3 |
|   1 |   1 |             6 |             1 |             3 |

+---+---+-------------+-------------+-------------+
注意:此处df已经少了列duration和tel_date,只有groupby的key和agg中的字段

例子2:
import pyspark.sql.functions as func
agg(func.max("event_time").alias("max_event_tm"),func.min("event_time").alias("min_event_tm"))
;