Spark SQL----函数
Spark SQL提供了两种功能特性来满足广泛的用户需求:内置函数和用户定义函数(udf)。内置函数是Spark SQL预定义的常用routines,可以在 内置函数API文档中找到完整的函数列表。当系统的内置函数不足以执行所需的任务时,udf允许用户定义自己的函数。
一、内置函数
Spark SQL有一些常用的内置函数,用于聚合、数组/映射、日期/时间戳和JSON数据。本节介绍这些函数的用法和描述。
1.1 Scalar函数
- Array Functions
- Map Functions
- Date and Timestamp Functions
- JSON Functions
- Mathematical Functions
- String Functions
- Bitwise Functions
- Conversion Functions
- Conditional Functions
- Predicate Functions
- Csv Functions
- Misc Functions
1.2 Aggregate-like函数
1.3 Generator函数
二、UDFs (User-Defined Functions)
2.1 Scalar 用户自定义函数 (UDFs)
2.1.1 描述
用户定义函数(UDFs)是用户可编程的routines,作用于一行。本文档列出了创建和注册UDFs所需的类。它还包含演示如何在Spark SQL中定义和注册UDFs以及调用UDFs的示例。
2.1.2 UserDefinedFunction
要定义用户定义函数的属性,用户可以使用该类中定义的一些方法。
- asNonNullable(): UserDefinedFunction将UserDefinedFunction更新为非空。
- asNondeterministic(): UserDefinedFunction更新UserDefinedFunction为nondeterministic。
- withName(name: String): UserDefinedFunction用给定的名称更新UserDefinedFunction。
2.1.3 例子
import org.apache.spark.sql.*;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.expressions.UserDefinedFunction;
import static org.apache.spark.sql.functions.udf;
import org.apache.spark.sql.types.DataTypes;
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL UDF scalar example")
.getOrCreate();
// Define and register a zero-argument non-deterministic UDF
// UDF is deterministic by default, i.e. produces the same result for the same input.
UserDefinedFunction random = udf(
() -> Math.random(), DataTypes.DoubleType
);
random.asNondeterministic();
spark.udf().register("random", random);
spark.sql("SELECT random()").show();
// +-------+
// |UDF() |
// +-------+
// |xxxxxxx|
// +-------+
// Define and register a one-argument UDF
spark.udf().register("plusOne",
(UDF1<Integer, Integer>) x -> x + 1, DataTypes.IntegerType);
spark.sql("SELECT plusOne(5)").show();
// +----------+
// |plusOne(5)|
// +----------+
// | 6|
// +----------+
// Define and register a two-argument UDF
UserDefinedFunction strLen = udf(
(String s, Integer x) -> s.length() + x, DataTypes.IntegerType
);
spark.udf().register("strLen", strLen);
spark.sql("SELECT strLen('test', 1)").show();
// +------------+
// |UDF(test, 1)|
// +------------+
// | 5|
// +------------+
// UDF in a WHERE clause
spark.udf().register("oneArgFilter",
(UDF1<Long, Boolean>) x -> x > 5, DataTypes.BooleanType);
spark.range(1, 10).createOrReplaceTempView("test");
spark.sql("SELECT * FROM test WHERE oneArgFilter(id)").show();
// +---+
// | id|
// +---+
// | 6|
// | 7|
// | 8|
// | 9|
// +---+
2.2 用户定义的聚合函数(UDAFs)
2.2.1 描述
用户定义聚合函数(UDAF)是用户可编程的routines,它一次作用于多行,并返回单个聚合值。本文档列出了创建和注册UDAFs所需的类。它还包含一些示例,演示如何在Java中定义和注册UDAF,并在Spark SQL中调用它们。
2.2.2 Aggregator[-IN, BUF, OUT]
用户定义聚合的基类,可用于“Dataset”操作,以获取组的所有元素并将其reduce为单个值。
IN-聚合的输入类型。
BUF-reduction的中间值的类型。
OUT-最终输出结果的类型。
- bufferEncoder: Encoder[BUF]
指定中间值类型的编码器。 - finish(reduction: BUF): OUT
转换reduction的输出。 - merge(b1: BUF, b2: BUF): BUF
合并两个中间值。 - outputEncoder: Encoder[OUT]
指定最终输出值类型的编码器。 - reduce(b: BUF, a: IN): BUF
将输入值a聚合为当前中间值。为了提高性能,函数可以修改b并返回它,而不是为b构造新对象。 - zero: BUF
此聚合的中间结果的初始值。
2.2.3 例子
- 类型安全的用户定义聚合函数
强类型Datasets的用户定义聚合围绕Aggregator抽象类。例如,类型安全的用户定义average如下:
import java.io.Serializable;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.TypedColumn;
import org.apache.spark.sql.expressions.Aggregator;
public static class Employee implements Serializable {
private String name;
private long salary;
// Constructors, getters, setters...
}
public static class Average implements Serializable {
private long sum;
private long count;
// Constructors, getters, setters...
}
public static class MyAverage extends Aggregator<Employee, Average, Double> {
// A zero value for this aggregation. Should satisfy the property that any b + zero = b
@Override
public Average zero() {
return new Average(0L, 0L);
}
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
@Override
public Average reduce(Average buffer, Employee employee) {
long newSum = buffer.getSum() + employee.getSalary();
long newCount = buffer.getCount() + 1;
buffer.setSum(newSum);
buffer.setCount(newCount);
return buffer;
}
// Merge two intermediate values
@Override
public Average merge(Average b1, Average b2) {
long mergedSum = b1.getSum() + b2.getSum();
long mergedCount = b1.getCount() + b2.getCount();
b1.setSum(mergedSum);
b1.setCount(mergedCount);
return b1;
}
// Transform the output of the reduction
@Override
public Double finish(Average reduction) {
return ((double) reduction.getSum()) / reduction.getCount();
}
// Specifies the Encoder for the intermediate value type
@Override
public Encoder<Average> bufferEncoder() {
return Encoders.bean(Average.class);
}
// Specifies the Encoder for the final output value type
@Override
public Encoder<Double> outputEncoder() {
return Encoders.DOUBLE();
}
}
Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class);
String path = "examples/src/main/resources/employees.json";
Dataset<Employee> ds = spark.read().json(path).as(employeeEncoder);
ds.show();
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+
MyAverage myAverage = new MyAverage();
// Convert the function to a `TypedColumn` and give it a name
TypedColumn<Employee, Double> averageSalary = myAverage.toColumn().name("average_salary");
Dataset<Double> result = ds.select(averageSalary);
result.show();
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+
在spark repo中的“examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedTypedAggregation.java”中查找完整的示例代码。
- 无类型的用户自定义聚合函数
如上所述,类型化聚合也可以注册为与DataFrames一起使用的非类型化聚合UDFs。例如,用户定义的非类型化DataFrames的average如下所示:
import java.io.Serializable;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.expressions.Aggregator;
import org.apache.spark.sql.functions;
public static class Average implements Serializable {
private long sum;
private long count;
// Constructors, getters, setters...
public Average() {
}
public Average(long sum, long count) {
this.sum = sum;
this.count = count;
}
public long getSum() {
return sum;
}
public void setSum(long sum) {
this.sum = sum;
}
public long getCount() {
return count;
}
public void setCount(long count) {
this.count = count;
}
}
public static class MyAverage extends Aggregator<Long, Average, Double> {
// A zero value for this aggregation. Should satisfy the property that any b + zero = b
@Override
public Average zero() {
return new Average(0L, 0L);
}
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
@Override
public Average reduce(Average buffer, Long data) {
long newSum = buffer.getSum() + data;
long newCount = buffer.getCount() + 1;
buffer.setSum(newSum);
buffer.setCount(newCount);
return buffer;
}
// Merge two intermediate values
@Override
public Average merge(Average b1, Average b2) {
long mergedSum = b1.getSum() + b2.getSum();
long mergedCount = b1.getCount() + b2.getCount();
b1.setSum(mergedSum);
b1.setCount(mergedCount);
return b1;
}
// Transform the output of the reduction
@Override
public Double finish(Average reduction) {
return ((double) reduction.getSum()) / reduction.getCount();
}
// Specifies the Encoder for the intermediate value type
@Override
public Encoder<Average> bufferEncoder() {
return Encoders.bean(Average.class);
}
// Specifies the Encoder for the final output value type
@Override
public Encoder<Double> outputEncoder() {
return Encoders.DOUBLE();
}
}
// Register the function to access it
spark.udf().register("myAverage", functions.udaf(new MyAverage(), Encoders.LONG()));
Dataset<Row> df = spark.read().json("examples/src/main/resources/employees.json");
df.createOrReplaceTempView("employees");
df.show();
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+
Dataset<Row> result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees");
result.show();
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+
在Spark repo中的“examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedUntypedAggregation.java”中查找完整的示例代码。
2.3 整合Hive UDFs/UDAFs/UDTFs
2.3.1 描述
Spark SQL支持Hive UDFs、UDAFs和UDTFs的集成。与Spark UDFs和UDAFs类似,Hive UDF处理单行作为输入并生成单行作为输出,而Hive UDAFs处理多行并返回单个聚合行作为结果。此外,Hive还支持UDTFs(用户定义的表格函数),它将一行作为输入,并将多行作为输出返回。要使用Hive UDFs/UDAFs/UTFs,用户应该在Spark中注册它们,然后在Spark SQL查询中使用它们。
2.3.2 例子
Hive有两个UDF接口: UDF和GenericUDF。下面的示例使用了从GenericUDF派生的GenericUDFAbs。
-- Register `GenericUDFAbs` and use it in Spark SQL.
-- Note that, if you use your own programmed one, you need to add a JAR containing it
-- into a classpath,
-- e.g., ADD JAR yourHiveUDF.jar;
CREATE TEMPORARY FUNCTION testUDF AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs';
SELECT * FROM t;
+-----+
|value|
+-----+
| -1.0|
| 2.0|
| -3.0|
+-----+
SELECT testUDF(value) FROM t;
+--------------+
|testUDF(value)|
+--------------+
| 1.0|
| 2.0|
| 3.0|
+--------------+
-- Register `UDFSubstr` and use it in Spark SQL.
-- Note that, it can achieve better performance if the return types and method parameters use Java primitives.
-- e.g., UDFSubstr. The data processing method is UTF8String <-> Text <-> String. we can avoid UTF8String <-> Text.
CREATE TEMPORARY FUNCTION hive_substr AS 'org.apache.hadoop.hive.ql.udf.UDFSubstr';
select hive_substr('Spark SQL', 1, 5) as value;
+-----+
|value|
+-----+
|Spark|
+-----+
下面的示例使用从 GenericUDTF派生的GenericUDTFExplode。
-- Register `GenericUDTFExplode` and use it in Spark SQL
CREATE TEMPORARY FUNCTION hiveUDTF
AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode';
SELECT * FROM t;
+------+
| value|
+------+
|[1, 2]|
|[3, 4]|
+------+
SELECT hiveUDTF(value) FROM t;
+---+
|col|
+---+
| 1|
| 2|
| 3|
| 4|
+---+
Hive有两个UDAF接口: UDAF和GenericUDAFResolver。下面的例子使用了GenericUDAFResolver派生的 GenericUDAFSum。
-- Register `GenericUDAFSum` and use it in Spark SQL
CREATE TEMPORARY FUNCTION hiveUDAF
AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum';
SELECT * FROM t;
+---+-----+
|key|value|
+---+-----+
| a| 1|
| a| 2|
| b| 3|
+---+-----+
SELECT key, hiveUDAF(value) FROM t GROUP BY key;
+---+---------------+
|key|hiveUDAF(value)|
+---+---------------+
| b| 3|
| a| 3|
+---+---------------+