1.流计算基本概念
- Structured Streaming: 结构化流计算
流计算(Streaming)和批计算(Batch)
- 批计算(或批处理),处理离线的数据,单次处理数据量大,速度较慢
- 流计算(或流处理),处理在线的实时数据,单次处理数据量小,速度较快
Spark Streaming 和 Spark Structured Streaming
- Spark2.0前期版本主推的流计算模块为Spark Streaming,目前主推Structured Streaming
- Spark Streaming 建立在RDD上,其数据结构为DStream
- Structured Streaming 建立在SparkSQL上,其数据结构为DataFrame,大部分API支持流计算,另外SparkSQL也具备自动优化,所以性能更佳
案例说明
- 先通过make_streaming_data.py构造虚拟的实时交易数据
- 再通过流计算来处理实时的交易数据
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
import time
# partitions分区数量
# parallelism并行的任务数量
spark = SparkSession.builder \
.appName("structured streaming") \
.config("spark.sql.shuffle.partitions", "4") \
.config("spark.default.parallelism", "4") \
.config("master", "local[4]") \
.enableHiveSupport() \
.getOrCreate()
sc = spark.sparkContext
2.定义表架构
# 定义流计算的DataFrame表架构
user_id = T.StructField("user_id", T.IntegerType())
quantity = T.StructField("quantity", T.IntegerType())
price = T.StructField("price", T.IntegerType())
order_time = T.StructField("order_time", T.StringType())
schema = T.StructType([user_id, quantity, price, order_time])
3.make_streaming_data
import os
import shutil
import time
import random
import datetime
import json
path = "./data/streaming_input"
# 如存在则删除
if os.path.exists(path):
shutil.rmtree(path)
# 创建目录
os.makedirs(path)
for i in range(100):
user_id = random.choice(range(1, 4))
quantity = random.choice(range(10, 100))
price = random.choice(range(100, 1000))
order_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
data = {
"user_id": user_id,
"quantity": quantity,
"price": price,
"order_time": order_time
}
file = os.path.join(path, str(i)+".json")
with open(file, "w") as f:
json.dump(data, f)
time.sleep(5)
4.读取流数据
# 流计算的输入目录,即流数据源
input_dir = "data/streaming_input"
df = spark.readStream.schema(schema).json(input_dir)
df.printSchema()
'''
root
|-- user_id: integer (nullable = true)
|-- quantity: integer (nullable = true)
|-- price: integer (nullable = true)
|-- order_time: string (nullable = true)
'''
# 必须使用start()执行具有流数据源的查询
# df.show()
5.输出到文件
- 必须设置检查点目录checkpointLocation
- 在代码 make_streaming_data 没执行完之前运行,输出的数据会随之变化
# 流计算的输出目录
output_dir = "data/streaming_output"
# 流计算的检查点目录:避免数据重复处理
checkpoint = "data/checkpoint"
stream = df.writeStream.format('csv').option('checkpointLocation', checkpoint).option('path', output_dir).start()
# 运行 20 秒
time.sleep(20)
# 终止流计算
stream.stop()
6.输出到控制台
- Jupyter Notebook(命令行窗口)
total = df.groupBy('user_id').sum('quantity')
# outputMode 调整输出模式 complete 完整数据
stream = total.writeStream.outputMode('complete').format('console').start()
# 运行 20 秒
time.sleep(20)
# 终止流计算
stream.stop()
7.输出到内存
stream = df.writeStream.queryName('hive_table').outputMode('append').format('memory').start()
sql = 'select user_id, sum(quantity) from hive_table group by user_id'
total = spark.sql(sql)
total.show()
# 手动终止流计算
stream.stop()