Bootstrap

Spark SQL----函数


Spark SQL提供了两种功能特性来满足广泛的用户需求:内置函数和用户定义函数(udf)。内置函数是Spark SQL预定义的常用routines,可以在 内置函数API文档中找到完整的函数列表。当系统的内置函数不足以执行所需的任务时,udf允许用户定义自己的函数。

一、内置函数

Spark SQL有一些常用的内置函数,用于聚合、数组/映射、日期/时间戳和JSON数据。本节介绍这些函数的用法和描述。

1.1 Scalar函数

1.2 Aggregate-like函数

1.3 Generator函数

二、UDFs (User-Defined Functions)

2.1 Scalar 用户自定义函数 (UDFs)

2.1.1 描述

用户定义函数(UDFs)是用户可编程的routines,作用于一行。本文档列出了创建和注册UDFs所需的类。它还包含演示如何在Spark SQL中定义和注册UDFs以及调用UDFs的示例。

2.1.2 UserDefinedFunction

要定义用户定义函数的属性,用户可以使用该类中定义的一些方法。

  • asNonNullable(): UserDefinedFunction将UserDefinedFunction更新为非空。
  • asNondeterministic(): UserDefinedFunction更新UserDefinedFunction为nondeterministic。
  • withName(name: String): UserDefinedFunction用给定的名称更新UserDefinedFunction。

2.1.3 例子

import org.apache.spark.sql.*;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.expressions.UserDefinedFunction;
import static org.apache.spark.sql.functions.udf;
import org.apache.spark.sql.types.DataTypes;

SparkSession spark = SparkSession
  .builder()
  .appName("Java Spark SQL UDF scalar example")
  .getOrCreate();

// Define and register a zero-argument non-deterministic UDF
// UDF is deterministic by default, i.e. produces the same result for the same input.
UserDefinedFunction random = udf(
  () -> Math.random(), DataTypes.DoubleType
);
random.asNondeterministic();
spark.udf().register("random", random);
spark.sql("SELECT random()").show();
// +-------+
// |UDF()  |
// +-------+
// |xxxxxxx|
// +-------+

// Define and register a one-argument UDF
spark.udf().register("plusOne",
  (UDF1<Integer, Integer>) x -> x + 1, DataTypes.IntegerType);
spark.sql("SELECT plusOne(5)").show();
// +----------+
// |plusOne(5)|
// +----------+
// |         6|
// +----------+

// Define and register a two-argument UDF
UserDefinedFunction strLen = udf(
  (String s, Integer x) -> s.length() + x, DataTypes.IntegerType
);
spark.udf().register("strLen", strLen);
spark.sql("SELECT strLen('test', 1)").show();
// +------------+
// |UDF(test, 1)|
// +------------+
// |           5|
// +------------+

// UDF in a WHERE clause
spark.udf().register("oneArgFilter",
  (UDF1<Long, Boolean>) x -> x > 5, DataTypes.BooleanType);
spark.range(1, 10).createOrReplaceTempView("test");
spark.sql("SELECT * FROM test WHERE oneArgFilter(id)").show();
// +---+
// | id|
// +---+
// |  6|
// |  7|
// |  8|
// |  9|
// +---+

2.2 用户定义的聚合函数(UDAFs)

2.2.1 描述

用户定义聚合函数(UDAF)是用户可编程的routines,它一次作用于多行,并返回单个聚合值。本文档列出了创建和注册UDAFs所需的类。它还包含一些示例,演示如何在Java中定义和注册UDAF,并在Spark SQL中调用它们。

2.2.2 Aggregator[-IN, BUF, OUT]

用户定义聚合的基类,可用于“Dataset”操作,以获取组的所有元素并将其reduce为单个值。
IN-聚合的输入类型。
BUF-reduction的中间值的类型。
OUT-最终输出结果的类型。

  • bufferEncoder: Encoder[BUF]
    指定中间值类型的编码器。
  • finish(reduction: BUF): OUT
    转换reduction的输出。
  • merge(b1: BUF, b2: BUF): BUF
    合并两个中间值。
  • outputEncoder: Encoder[OUT]
    指定最终输出值类型的编码器。
  • reduce(b: BUF, a: IN): BUF
    将输入值a聚合为当前中间值。为了提高性能,函数可以修改b并返回它,而不是为b构造新对象。
  • zero: BUF
    此聚合的中间结果的初始值。

2.2.3 例子

  • 类型安全的用户定义聚合函数
    强类型Datasets的用户定义聚合围绕Aggregator抽象类。例如,类型安全的用户定义average如下:
import java.io.Serializable;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.TypedColumn;
import org.apache.spark.sql.expressions.Aggregator;

public static class Employee implements Serializable {
  private String name;
  private long salary;

  // Constructors, getters, setters...

}

public static class Average implements Serializable  {
  private long sum;
  private long count;

  // Constructors, getters, setters...

}

public static class MyAverage extends Aggregator<Employee, Average, Double> {
  // A zero value for this aggregation. Should satisfy the property that any b + zero = b
  @Override
  public Average zero() {
    return new Average(0L, 0L);
  }
  // Combine two values to produce a new value. For performance, the function may modify `buffer`
  // and return it instead of constructing a new object
  @Override
  public Average reduce(Average buffer, Employee employee) {
    long newSum = buffer.getSum() + employee.getSalary();
    long newCount = buffer.getCount() + 1;
    buffer.setSum(newSum);
    buffer.setCount(newCount);
    return buffer;
  }
  // Merge two intermediate values
  @Override
  public Average merge(Average b1, Average b2) {
    long mergedSum = b1.getSum() + b2.getSum();
    long mergedCount = b1.getCount() + b2.getCount();
    b1.setSum(mergedSum);
    b1.setCount(mergedCount);
    return b1;
  }
  // Transform the output of the reduction
  @Override
  public Double finish(Average reduction) {
    return ((double) reduction.getSum()) / reduction.getCount();
  }
  // Specifies the Encoder for the intermediate value type
  @Override
  public Encoder<Average> bufferEncoder() {
    return Encoders.bean(Average.class);
  }
  // Specifies the Encoder for the final output value type
  @Override
  public Encoder<Double> outputEncoder() {
    return Encoders.DOUBLE();
  }
}

Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class);
String path = "examples/src/main/resources/employees.json";
Dataset<Employee> ds = spark.read().json(path).as(employeeEncoder);
ds.show();
// +-------+------+
// |   name|salary|
// +-------+------+
// |Michael|  3000|
// |   Andy|  4500|
// | Justin|  3500|
// |  Berta|  4000|
// +-------+------+

MyAverage myAverage = new MyAverage();
// Convert the function to a `TypedColumn` and give it a name
TypedColumn<Employee, Double> averageSalary = myAverage.toColumn().name("average_salary");
Dataset<Double> result = ds.select(averageSalary);
result.show();
// +--------------+
// |average_salary|
// +--------------+
// |        3750.0|
// +--------------+

spark repo中的“examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedTypedAggregation.java”中查找完整的示例代码。

  • 无类型的用户自定义聚合函数
    如上所述,类型化聚合也可以注册为与DataFrames一起使用的非类型化聚合UDFs。例如,用户定义的非类型化DataFrames的average如下所示:
import java.io.Serializable;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.expressions.Aggregator;
import org.apache.spark.sql.functions;

public static class Average implements Serializable  {
  private long sum;
  private long count;

  // Constructors, getters, setters...
  public Average() {
  }

  public Average(long sum, long count) {
    this.sum = sum;
    this.count = count;
  }

  public long getSum() {
    return sum;
  }

  public void setSum(long sum) {
    this.sum = sum;
  }

  public long getCount() {
    return count;
  }

  public void setCount(long count) {
    this.count = count;
  }
}

public static class MyAverage extends Aggregator<Long, Average, Double> {
  // A zero value for this aggregation. Should satisfy the property that any b + zero = b
  @Override
  public Average zero() {
    return new Average(0L, 0L);
  }
  // Combine two values to produce a new value. For performance, the function may modify `buffer`
  // and return it instead of constructing a new object
  @Override
  public Average reduce(Average buffer, Long data) {
    long newSum = buffer.getSum() + data;
    long newCount = buffer.getCount() + 1;
    buffer.setSum(newSum);
    buffer.setCount(newCount);
    return buffer;
  }
  // Merge two intermediate values
  @Override
  public Average merge(Average b1, Average b2) {
    long mergedSum = b1.getSum() + b2.getSum();
    long mergedCount = b1.getCount() + b2.getCount();
    b1.setSum(mergedSum);
    b1.setCount(mergedCount);
    return b1;
  }
  // Transform the output of the reduction
  @Override
  public Double finish(Average reduction) {
    return ((double) reduction.getSum()) / reduction.getCount();
  }
  // Specifies the Encoder for the intermediate value type
  @Override
  public Encoder<Average> bufferEncoder() {
    return Encoders.bean(Average.class);
  }
  // Specifies the Encoder for the final output value type
  @Override
  public Encoder<Double> outputEncoder() {
    return Encoders.DOUBLE();
  }
}

// Register the function to access it
spark.udf().register("myAverage", functions.udaf(new MyAverage(), Encoders.LONG()));

Dataset<Row> df = spark.read().json("examples/src/main/resources/employees.json");
df.createOrReplaceTempView("employees");
df.show();
// +-------+------+
// |   name|salary|
// +-------+------+
// |Michael|  3000|
// |   Andy|  4500|
// | Justin|  3500|
// |  Berta|  4000|
// +-------+------+

Dataset<Row> result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees");
result.show();
// +--------------+
// |average_salary|
// +--------------+
// |        3750.0|
// +--------------+

Spark repo中的“examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedUntypedAggregation.java”中查找完整的示例代码。

2.3 整合Hive UDFs/UDAFs/UDTFs

2.3.1 描述

Spark SQL支持Hive UDFs、UDAFs和UDTFs的集成。与Spark UDFs和UDAFs类似,Hive UDF处理单行作为输入并生成单行作为输出,而Hive UDAFs处理多行并返回单个聚合行作为结果。此外,Hive还支持UDTFs(用户定义的表格函数),它将一行作为输入,并将多行作为输出返回。要使用Hive UDFs/UDAFs/UTFs,用户应该在Spark中注册它们,然后在Spark SQL查询中使用它们。

2.3.2 例子

Hive有两个UDF接口: UDFGenericUDF。下面的示例使用了从GenericUDF派生的GenericUDFAbs

-- Register `GenericUDFAbs` and use it in Spark SQL.
-- Note that, if you use your own programmed one, you need to add a JAR containing it
-- into a classpath,
-- e.g., ADD JAR yourHiveUDF.jar;
CREATE TEMPORARY FUNCTION testUDF AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs';

SELECT * FROM t;
+-----+
|value|
+-----+
| -1.0|
|  2.0|
| -3.0|
+-----+

SELECT testUDF(value) FROM t;
+--------------+
|testUDF(value)|
+--------------+
|           1.0|
|           2.0|
|           3.0|
+--------------+

-- Register `UDFSubstr` and use it in Spark SQL.
-- Note that, it can achieve better performance if the return types and method parameters use Java primitives.
-- e.g., UDFSubstr. The data processing method is UTF8String <-> Text <-> String. we can avoid UTF8String <-> Text. 
CREATE TEMPORARY FUNCTION hive_substr AS 'org.apache.hadoop.hive.ql.udf.UDFSubstr';

select hive_substr('Spark SQL', 1, 5) as value;
+-----+
|value|
+-----+
|Spark|
+-----+

下面的示例使用从 GenericUDTF派生的GenericUDTFExplode

-- Register `GenericUDTFExplode` and use it in Spark SQL
CREATE TEMPORARY FUNCTION hiveUDTF
    AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode';

SELECT * FROM t;
+------+
| value|
+------+
|[1, 2]|
|[3, 4]|
+------+

SELECT hiveUDTF(value) FROM t;
+---+
|col|
+---+
|  1|
|  2|
|  3|
|  4|
+---+

Hive有两个UDAF接口: UDAFGenericUDAFResolver。下面的例子使用了GenericUDAFResolver派生的 GenericUDAFSum

-- Register `GenericUDAFSum` and use it in Spark SQL
CREATE TEMPORARY FUNCTION hiveUDAF
    AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum';

SELECT * FROM t;
+---+-----+
|key|value|
+---+-----+
|  a|    1|
|  a|    2|
|  b|    3|
+---+-----+

SELECT key, hiveUDAF(value) FROM t GROUP BY key;
+---+---------------+
|key|hiveUDAF(value)|
+---+---------------+
|  b|              3|
|  a|              3|
+---+---------------+
;