Bootstrap

Apache Flink 流处理(快速入门)

Flink Streaming

概述

DataStream在Flink中实现了对数据流的Transformation,内部Flink的数据源可以通过各种数据源创建,例如:消息队列、socket streams、文件。流计算的结果通过Sinks输出,例如 将数据写入文件标准输出等。

共同依赖

<properties>
  <flink.version>1.7.1</flink.version>
  <flink.scala.version>2.11</flink.scala.version>
</properties>
<dependencies>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-core</artifactId>
    <version>${flink.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-clients_${flink.scala.version}</artifactId>
    <version>${flink.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-scala_${flink.scala.version}</artifactId>
    <version>${flink.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-scala_${flink.scala.version}</artifactId>
    <version>${flink.version}</version>
  </dependency>
</dependencies>

Flink数据源

kafka 数据源

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-kafka_${flink.scala.version}</artifactId>
  <version>${flink.version}</version>
</dependency>

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig

val env = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, 
                  "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")
env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
.flatMap(line => for( i <- line.split(" ")) yield (i,1))
.keyBy(_._1)
.reduce((in1,in2)=>(in1._1,in1._2+in2._2))
.print()
env.execute("hello wolrd")

运行任务:

[root@CentOS flink-1.7.1]# ./bin/flink run -d -c com.jiangzz.StreamingSateTests /root/flink_streaming-1.0-SNAPSHOT.jar

Starting execution of program
Job has been submitted with JobID 4e12f0860af48c592917194b2e2e481f

取消任务:

[root@CentOS flink-1.7.1]# ./bin/flink cancel -s /root/flink_savepoint 4e12f0860af48c592917194b2e2e481f

恢复任务

[root@CentOS flink-1.7.1]# ./bin/flink run -d -s /root/flink_savepoint/savepoint-4e12f0-a49053766a81 -c com.jiangzz.StreamingSateTests /root/flink_streaming-1.0-SNAPSHOT.jar

Soket数据源

import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

val env = StreamExecutionEnvironment.createLocalEnvironment()
env.socketTextStream("CentOS",9999)
.flatMap(line => for( i <- line.split(" ")) yield (i,1))
.keyBy(_._1)
.reduce((in1,in2)=>(in1._1,in1._2+in2._2))
.print()
env.execute("socket demo")

File文件系统

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>2.9.2</version>
</dependency>

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-hdfs</artifactId>
  <version>2.9.2</version>
</dependency>
val env = StreamExecutionEnvironment.createLocalEnvironment()
val text = env.readTextFile("hdfs://CentOS:9000/demo/csv")

val counts = text.flatMap(_.toLowerCase.split("\\W+")).filter(_.nonEmpty)
.map((_, 1))
.keyBy(0)
.sum(1)
counts.print()
env.execute("Window Stream WordCount")

Stream操作符

常规操作符

map/flatMap/filter

import java.util.Properties

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._

import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig

// order zhangsan TV,GAME
val env = StreamExecutionEnvironment.createLocalEnvironment()
val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
.filter(line => line.startsWith("order"))
.map(line => line.replace("order","").trim)
.flatMap(user =>for(i <- user.split(" ")(1).split(",")) yield (user.split(" ")(0),i))
.print()

env.execute("word counts")

分组操作符

KeyBy

import java.util.Properties

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._

import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig


val env = StreamExecutionEnvironment.createLocalEnvironment()
val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

//001 zhansan 苹果 4.5 2 2018-10-01
//003 lisi 机械键盘 800 1 2018-01-23
//002 zhansan 橘子 2.5 2 2018-11-22
env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
.map(line => {
  val user = line.split(" ")(1)
  val cost= line.split(" ")(3).toDouble * line.split(" ")(4).toInt
  (user,cost)
})
.keyBy(0)
.reduce((item1,item2)=>(item1._1,item1._2+item2._2))
.print()
env.execute("order counts")

聚合操作符

reduce/fold/max|maxBy/min|minBy|sum

fold

import java.util.Properties

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.{KeyedStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig


env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(), props))
.map(line => {
  val user = line.split(" ")(1)
  val cost = line.split(" ")(3).toDouble * line.split(" ")(4).toInt
  (user, cost)
})
.keyBy(0)
.fold(("",0.0))((z,t)=>{(t._1,t._2+z._2)})
.print()
env.execute("order counts")

sum

import java.util.Properties

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig

val env = StreamExecutionEnvironment.createLocalEnvironment()
val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

//001 zhansan 苹果 4.5 2 2018-10-01
//003 lisi 机械键盘 800 1 2018-01-23
//002 zhansan 橘子 2.5 2 2018-11-22

env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(), props))
.map(line => {
  val user = line.split(" ")(1)
  val cost = line.split(" ")(3).toDouble * line.split(" ")(4).toInt
  (user, cost)
})
.keyBy(0).sum(1)
.print()
env.execute("word counts")

窗口函数

Windows是流计算的核心。Windows将流拆分为有限大小的“桶”,我们可以在其上应用计算。

基本概念

Window Assigners

Window Assigners定义如何将元素分配给窗口。WindowAssigner负责将每个传入元素分配给一个或多个窗口。Flink为最常见的用例提供了预定义的Window Assigners,分别是tumbling windows, sliding windows, session windowsglobal windows。用户还可以通过扩展WindowAssigner类来实现自定义Window Assigners。所有内置WindowAssigner(global windows除外)都根据时间为窗口分配元素,Event Time是基于processing time 或者 event time.基于时间的窗口具有start timestamp(包括)和end timestamp(不包括),它们一起描述窗口的大小。Flink在使用基于时间的窗口时使用TimeWindow,该窗口具有查询开始和结束时间戳的方法,以及返回给定窗口的最大允许时间戳的附加方法maxTimestamp()。

Event Time

[外链图片转存失败(img-mWVIMpxH-1564229204004)(assets/times_clocks-1551841367065.svg)]

  • Processing time:处理时间是指执行相应操作的机器的系统时间,所有基于时间的操作(如时间窗口)将使用运行相应操作的机器的系统时钟。缺点是在分布式系统中,无法保证数据处理时间的确定性,因为数据的窗口划分和数据出现的时间没有关系,取决于数据何时抵达处理节点。
  • Event time:Event time是每个单独事件在其生产设备上发生的时间。这个时间一般是在数据Event进入Flink之前,由发射该Event的设备分配,改时间标识了数据产生时间。使用此种机制所有的窗口的划分取决于数据本身携带的时间戳。缺点:需要用户指定watermarker,给系统带来更多延长,优点:可以有效的处理数据的乱序问题,更加符合窗口处理的逻辑。
  • Ingestion time:数据的采集时间,标识记录进入Flink系统的时间。

目前Flink支持两种EventTime模型Processing Time和Event Time,为了讲解方便,以下案例主要讲解的基于Processing Time时间窗口的划分

窗口分类

tumbling windows(滚动窗口)

[外链图片转存失败(img-1yk9bTfH-1564229204005)(assets/tumbling-windows.svg)]

import java.util.Properties

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig

val env = StreamExecutionEnvironment.createLocalEnvironment()

val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(), props))
.flatMap(line => line.split(" "))
.map((_,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.sum(1)
.print()

env.execute("word counts")

Sliding Window(滑动窗口)

[外链图片转存失败(img-QQWIikCs-1564229204005)(assets/sliding-windows.svg)]

import java.util.Properties

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig

val env = StreamExecutionEnvironment.createLocalEnvironment()

val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(), props))
.flatMap(line => line.split(" "))
.map((_,1))
.keyBy(0)
.window(SlidingProcessingTimeWindows.of(Time.seconds(10),Time.seconds(5)))
.sum(1)
.print()

env.execute("word counts")

Session Windows(会话窗口)

Session Window分配器按活动会话对元素进行分组。与tumbling windows和Sliding Window相反,Session Window不重叠,没有固定的开始和结束时间。当Session Window在一段时间内没有接收到元素时(即当发生不活动的时间间隙时)会话窗口关闭。

[外链图片转存失败(img-A8jebJh1-1564229204006)(assets/session-windows.svg)]

import java.util.Properties

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.ProcessingTimeSessionWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig

val env = StreamExecutionEnvironment.createLocalEnvironment()

val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(), props))
.flatMap(line => line.split(" "))
.map((_,1))
.keyBy(0)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
.sum(1)
.print()

env.execute("word counts")

Global Windows

Global Windows分配器将具有相同键的所有元素分配给同一个全局窗口。此窗口方案仅在您还指定自定义trigger时才有用。否则,将不执行任何计算,因为Global Windows没有我们可以处理聚合元素的自然结束。

[外链图片转存失败(img-ntKNkHWr-1564229204006)(assets/non-windowed.svg)]

import java.util.Properties

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows
import org.apache.flink.streaming.api.windowing.triggers.{Trigger, TriggerResult}
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig

val env = StreamExecutionEnvironment.createLocalEnvironment()

val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(), props))
.flatMap(line => line.split(" "))
.map((_,1))
.keyBy(0)
.window(GlobalWindows.create())
.trigger(new CustomTrigger())
.sum(1)
.print()

env.execute("word counts")

-----

class CustomTrigger extends Trigger[(String,Int),GlobalWindow]{
  var map:mutable.HashMap[String,Long]= new mutable.HashMap[String,Long]()
  override def onElement(element: (String, Int), timestamp: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = {
    if(!map.contains(element._1)){
      map.put(element._1,System.currentTimeMillis())
      TriggerResult.CONTINUE
    }else{
      var time:Long =System.currentTimeMillis() - map.getOrElse(element._1,0L)
      if(time <= 5000 ){
        TriggerResult.CONTINUE
      }else{
        map.remove(element._1)
        TriggerResult.FIRE_AND_PURGE
      }
    }
  }

  override def onProcessingTime(time: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = {
    TriggerResult.CONTINUE
  }

  override def onEventTime(time: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = {
    TriggerResult.CONTINUE
  }

  override def clear(window: GlobalWindow, ctx: Trigger.TriggerContext): Unit = {}

}

聚合函数

定义window assigner后,Window Function对指定窗口元素做计算。window Function可以是ReduceFunction,AggregateFunction,FoldFunction或ProcessWindowFunction。前两个可以更有效地执行,因为Flink可以在每个窗口到达时递增地聚合它们的元素。ProcessWindowFunction获取窗口中包含的所有元素的Iterable以及有关元素所属窗口的元信息。

ReduceFunction

import java.util.Properties

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig

val env = StreamExecutionEnvironment.createLocalEnvironment()

val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(), props))
.flatMap(line => line.split(" "))
.map((_,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((in1,in2)=>(in1._1,in1._2+in2._2))
.print()

env.execute("word counts")

AggregateFunction

AggregateFunction是ReduceFunction的通用版本,有三种类型:输入类型(IN),累加器类型(ACC)和输出类型(OUT)。

import java.util.Properties

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig

val env = StreamExecutionEnvironment.createLocalEnvironment()

val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(), props))
.flatMap(line => line.split(" "))
.map((_,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.aggregate(new CustomAggregateFunction)
.print()

env.execute("word counts")

自定义Aggregation

class CustomAggregateFunction extends AggregateFunction[(String,Int),(String,Int),String]{
  //返回累加器的初始值
  override def createAccumulator(): (String,Int) = {
    ("",0)
  }
  //累加元素
  override def add(value: (String, Int), accumulator: (String,Int)):(String,Int) = {
    (value._1,value._2+accumulator._2)
  }
  //合并累加器的值
  override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
    (a._1,a._2+b._2)
  }
  //返回最终输出结果
  override def getResult(accumulator: (String, Int)): String = {
    accumulator._1+" -> "+accumulator._2
  }
}

FoldFunction

import java.util.Properties

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig

val env = StreamExecutionEnvironment.createLocalEnvironment()

val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(), props))
.flatMap(line => line.split(" "))
.map((_,1))
.keyBy(0)
.window(SlidingProcessingTimeWindows.of(Time.seconds(10),Time.seconds(5)))
.fold(("",0))((i1,i2)=>(i2._1,i1._2+i2._2))
.print()

env.execute("word counts")

注意Fold function不能应用于会话窗口或者可以合并窗口

ProcessWindowFunction

ProcessWindowFunction获取包含窗口所有元素的Iterable,以及可访问时间和状态信息的Context对象,这使其能够提供比其他窗口函数更多的灵活性。这是以性能和资源消耗为代价的,因为元素不能以递增方式聚合,而是需要在内部进行缓冲,直到认为窗口已准备好进行处理。

import java.util.Properties

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.util.Collector
import org.apache.kafka.clients.consumer.ConsumerConfig

val env = StreamExecutionEnvironment.createLocalEnvironment()

val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(), props))
.flatMap(line => line.split(" "))
.map((_,1))
.keyBy(_._1)
.window(SlidingProcessingTimeWindows.of(Time.seconds(10),Time.seconds(5)))
.process(new CustomProcessWindowFunction())
.print()

env.execute("word counts")

自定义ProcessWindowFunction

class CustomProcessWindowFunction extends 
														ProcessWindowFunction[(String,Int),String,String,TimeWindow]{
  override def process(key: String, context: Context, elements: Iterable[(String, Int)], out: Collector[String]): Unit = {
    val tuple: (String, Int) = elements.reduce((i1,i2)=>(i1._1,i1._2+i2._2))
    out.collect(tuple._1+" => "+tuple._2)
  }
}

Event Time Window

上文阐述过Flink支持Processing Time和Event Time两种机制对元素做窗口划分,由于Processing Time窗口划分的机制是按照处理节点接收到的数据的时间节点为划分窗口的机制,并不能很好的去描述数据产生时刻某个窗口期数计算的结果。所以Flink提供了基于Event Time的机制去划分窗口。在基于Event Time划分窗口的优点在于系统可以实现对窗口中乱序数据的处理以及延迟窗口和迟到数据采集等功能。

water marker

import java.text.SimpleDateFormat

import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

val env = StreamExecutionEnvironment.createLocalEnvironment()
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

val watermarker = env.socketTextStream("CentOS", 9999)
.map(line => (line.split(",")(0), line.split(",")(1).toLong))
.assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks[(String, Long)] {//抽取时间戳和水位线
  var maxOutOfOrderness:Long=5000L
  var currentMaxTimestamp:Long=0
  //每当有数据的时候,系统会自动调用checkAndGetNextWatermark方法
  override def checkAndGetNextWatermark(lastElement: (String, Long), extractedTimestamp: Long): Watermark = {
    new Watermark(currentMaxTimestamp-maxOutOfOrderness)
  }
  //抽取时间戳
  override def extractTimestamp(element: (String, Long), previousElementTimestamp: Long): Long = {
    var sf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")

    currentMaxTimestamp=Math.max(currentMaxTimestamp,element._2)
    println(element+"\tcurrentMaxTimestamp:"+sf.format(currentMaxTimestamp)+"\ttimestamp:"+sf.format(element._2)+"\t watermarker:"+sf.format(currentMaxTimestamp-maxOutOfOrderness))
    element._2
  }
})
watermarker.keyBy(_._1)
.timeWindow(Time.seconds(5))//5秒种滚动窗口
.apply((key:String,w:TimeWindow,values:Iterable[(String,Long)],out:Collector[String])=>{
  var sf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
  val tuples = values.toList.sortBy(_._2).map(t=> sf.format(t._2))
  var msg =tuples.mkString(" , ") +" size:"+tuples.size+" window:["+sf.format(w.getStart)+","+sf.format(w.getEnd)+"]"
  out.collect(msg)
})
.print()
env.execute("water marker")

迟到数据处理

设置最大迟到 widow_end + Lateness > watermarker 触发过的窗口会重新触发

import java.text.SimpleDateFormat

import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

val env = StreamExecutionEnvironment.createLocalEnvironment()
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

val watermarker = env.socketTextStream("CentOS", 9999)

.map(line => (line.split(",")(0), line.split(",")(1).toLong))
.assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks[(String, Long)] {//抽取时间戳和水位线
  var maxOutOfOrderness:Long=5000L
  var currentMaxTimestamp:Long=0
  //每当有数据的时候,系统会自动调用checkAndGetNextWatermark方法
  override def checkAndGetNextWatermark(lastElement: (String, Long), extractedTimestamp: Long): Watermark = {
    new Watermark(currentMaxTimestamp-maxOutOfOrderness)
  }
  //抽取时间戳
  override def extractTimestamp(element: (String, Long), previousElementTimestamp: Long): Long = {
    var sf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")

    currentMaxTimestamp=Math.max(currentMaxTimestamp,element._2)
    println(element+"\tcurrentMaxTimestamp:"+sf.format(currentMaxTimestamp)+"\ttimestamp:"+sf.format(element._2)+"\t watermarker:"+sf.format(currentMaxTimestamp-maxOutOfOrderness))
    element._2
  }
})
watermarker.keyBy(_._1)
.timeWindow(Time.seconds(5))//5秒种滚动窗口
.allowedLateness(Time.seconds(2))
.apply((key:String,w:TimeWindow,values:Iterable[(String,Long)],out:Collector[String])=>{
  var sf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
  val tuples = values.toList.sortBy(_._2).map(t=> sf.format(t._2))
  var msg =tuples.mkString(" , ") +" size:"+tuples.size+" window:["+sf.format(w.getStart)+","+sf.format(w.getEnd)+"]"
  out.collect(msg)
})
.print()
env.execute("water marker")

sideOutputLateData过期数据

import java.text.SimpleDateFormat

import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks
import org.apache.flink.streaming.api.scala.{DataStream, OutputTag, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

val env = StreamExecutionEnvironment.createLocalEnvironment()
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

val watermarker = env.socketTextStream("CentOS", 9999)

.map(line => (line.split(",")(0), line.split(",")(1).toLong))
.assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks[(String, Long)] {//抽取时间戳和水位线
  var maxOutOfOrderness:Long=5000L
  var currentMaxTimestamp:Long=0
  //每当有数据的时候,系统会自动调用checkAndGetNextWatermark方法
  override def checkAndGetNextWatermark(lastElement: (String, Long), extractedTimestamp: Long): Watermark = {
    new Watermark(currentMaxTimestamp-maxOutOfOrderness)
  }
  //抽取时间戳
  override def extractTimestamp(element: (String, Long), previousElementTimestamp: Long): Long = {
    var sf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")

    currentMaxTimestamp=Math.max(currentMaxTimestamp,element._2)
    println(element+"\tcurrentMaxTimestamp:"+sf.format(currentMaxTimestamp)+"\ttimestamp:"+sf.format(element._2)+"\t watermarker:"+sf.format(currentMaxTimestamp-maxOutOfOrderness))
    element._2
  }
})
val window = watermarker.keyBy(_._1)
.timeWindow(Time.seconds(5)) //5秒种滚动窗口
.allowedLateness(Time.seconds(2))
.sideOutputLateData(new OutputTag[(String, Long)]("lateData"))
.apply((key: String, w: TimeWindow, values: Iterable[(String, Long)], out: Collector[String]) => {
  var sf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
  val tuples = values.toList.sortBy(_._2).map(t => sf.format(t._2))
  var msg = tuples.mkString(" , ") + " size:" + tuples.size + " window:[" + sf.format(w.getStart) + "," + sf.format(w.getEnd) + "]"
  out.collect(msg)
})

window
.getSideOutput(new OutputTag[(String, Long)]("lateData"))
.print()
window.print()
env.execute("water marker")

合并分支操作符

Union

流合并,必须保证合并的流的类型保持一致

import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}

val env = StreamExecutionEnvironment.createLocalEnvironment()
val stream1: DataStream[String] = env.fromElements("a","b","c")
val stream2: DataStream[String] = env.fromElements("b","c","d")
stream1.union(stream2)
.print()
env.execute("union")

Connect

import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.co.{ CoMapFunction}
import org.apache.flink.streaming.api.scala.{ DataStream, StreamExecutionEnvironment}

val env = StreamExecutionEnvironment.createLocalEnvironment()
val s1: DataStream[String] = env.socketTextStream("CentOS",9999)
val s2: DataStream[String] = env.socketTextStream("CentOS",8888)
s1.connect(s2).map(new CoMapFunction[String,String,String] {
  override def map1(value: String) = {
    value.split(" ")(0)+","+value.split(" ")(1)
  }
  override def map2(value: String) = {
    value.split(",")(0)+","+value.split(",")(1)
  }
}).map(line =>  (line.split(",")(0),line.split(",")(1).toDouble))
.keyBy(_._1)
.sum(1)
.print()
env.execute("split demo")

Split/Select


import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.collector.selector.OutputSelector
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

val env = StreamExecutionEnvironment.createLocalEnvironment()
val split = env.socketTextStream("CentOS", 9999)
.split(new OutputSelector[String] {
  override def select(value: String): lang.Iterable[String] = {
    var list = new util.ArrayList[String]()
    if (value.contains("error")) {
      list.add("error")
    } else {
      list.add("info")
    }
    return list
  }
})
split.select("error").map(t=> "ERROR "+t).print()
split.select("info").map(t=> "INFO "+t).print()

env.execute("split demo")

Window Join

Tumbling Window Join

[外链图片转存失败(img-vMa3Ro5v-1564229204006)(assets/tumbling-window-join.svg)]

Sliding Window Join

[外链图片转存失败(img-O2Zb6FV4-1564229204008)(assets/sliding-window-join.svg)]

Session Window Join

[外链图片转存失败(img-kgTdq6Jm-1564229204009)(assets/session-window-join.svg)]

Interval Join

[外链图片转存失败(img-4SqmtxUG-1564229204009)(assets/interval-join.svg)]

在同一个时间窗口中对数据做join

import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala.{DataStream, OutputTag, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector

val env = StreamExecutionEnvironment.createLocalEnvironment()
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

//001 张三 true ts
val s1 = env.socketTextStream("CentOS", 9999)
.map(line => (line.split(" ")(0),line.split(" ")(1),line.split(" ")(3).toLong))
.assignTimestampsAndWatermarks(new UserAssignerWithPunctuatedWatermarks)

//苹果,4.5,2,001,ts
val s2= env.socketTextStream("CentOS", 8888)
.map(line => (line.split(",")(3),line.split(",")(1).toDouble * line.split(",")(2).toInt ,line.split(",")(4).toLong))
.assignTimestampsAndWatermarks(new OrderAssignerWithPunctuatedWatermarks)

s1.join(s2).where(_._1).equalTo(_._1)
.window(TumblingEventTimeWindows.of(Time.seconds(5))) //5秒种滚动窗口
.allowedLateness(Time.seconds(2))
.apply((t1,t2,out:Collector[String])=>{
  out.collect(t1+" "+t2)
}).print()

State & Fault Tolerance

针对于流处理的有状态function和operators可以存储流计算过程中的每个Event的计算状态。状态计算是构建精确操作不会或缺的板块。Flink需要获知计算节点的状态,从而使用checkpoint和savepoint机制实现数据的故障恢复和容错。其中Queryable State允许外部在Flink运行过程中查询数据状态,当用户使用State操作flink提供了state backend机制用于存储状态信息,其中计算状态可以存储在Java的堆内和堆外,这取决于采取的statebackend机制。配置Statebackend不会影响应用的处理逻辑。

State状态分类

  • Keyed State:针对KeyedStream上的一些操作,基于key和其对应的状态
  • Operator State:针对一个non-keyed state操作的状态,例如FlinkKafkaConsumer就是一个Operators state,每个Kafka Conusmer实例管理该实例消费Topic分区和偏移量信息等

上述两种状态可以managedraw形式存储state,其中managed状态表示状态的管理托管给Flink来确定数据结构(“ValueState”, “ListState”等),Flink在运行期间自动的使用checkpoint机制持久化计算状态。Flink所有的function都支持managed形式的 ,RawState是采取在Checkpoint的时候序列化字节持久化状态,Flink并不知道state的存储结构,仅仅是在定义Operators的时候才会使用raw形式存储状态。Flink推荐使用managed 形式,因为这样Flink在当任务并行度发生变化的时候状态是可以重新分发,并且有着更好的内存管理。

Managed Keyed State

  • ValueState<T>: 该状态记录的是key中所包含的值,例如可以使用upate(T)更新值或者T value()获取值。

  • ListState<T>: 可以存储一系列的元素,可以使用add(T) 或者addAll(List)添加元素,Iterable get()获取元素,或者update(List)更新元素。

  • ReducingState<T>:保留一个值代表所有的数据的聚合结果,在创建的时候需要给定ReduceFunction实现计算逻辑,调用add(T)方法实现数据的汇总。

  • AggregatingState<IN, OUT>:保留一个值代表数据的局和结果,在创建时候需要给一个AggregateFunction实现计算逻辑,提供add(IN)方法实现数据的累计 。

  • FoldingState<T, ACC>:保留一个值代表所有数据的聚合结果。但是和ReduceState相反的是不要求聚合的类型和最终类型保持一致,在创建FoldingState的时候需要指定一个FoldFunction,调用add(T)方法实现数据汇总。FoldingState在Flink 1.4版本过时,预计可能会在后期版本废除。用户可以使用AggregatingState替换。

  • MapState<UK, UV>:存储一些列的Mapping,使用 put(UK, UV)和putAll(Map<UK, UV>)添加元素数据可以通过

    get(UK)获取一个值,可以通过entries(), keys()and values()方法查询数据。

以上所有的State都提供了一个clear()方法清除当前key所对应的状态值。需要注意的是这些状态只可以在一些带有state的接口中使用,状态不一定存储在内部,但可能驻留在磁盘或其他位置。从状态获得的值取决于input元素的key。

import org.apache.flink.api.common.functions.{ReduceFunction, RichMapFunction}
import org.apache.flink.api.common.state.{ReducingState, ReducingStateDescriptor}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

val env = StreamExecutionEnvironment.createLocalEnvironment()
env.socketTextStream("CentOS",9999)
.flatMap(line=>line.split(" "))
.map((_,1))
.keyBy(_._1)
.map(new RichMapFunction[(String,Int),(String,Int)] {
  @transient var reduceState:ReducingState[(String,Int)] = _
  override def map(value: (String, Int)):(String,Int) = {
    reduceState.add(value)
    reduceState.get()
  }
  override def open(parameters: Configuration) = {
    //初始化状态值
    val reduceDescriptor = new ReducingStateDescriptor[(String, Int)]("wordreduce", new ReduceFunction[(String, Int)]() {
      override def reduce(value1: (String, Int), value2: (String, Int)): (String, Int) = {
        (value1._1, value1._2 + value2._2)
      }
    }, createTypeInformation[(String, Int)])
    reduceState = getRuntimeContext.getReducingState(reduceDescriptor)
  }
})
.print()
env.execute("split demo")

Managed Operator State

使用Operator State需要实现CheckpointedFunction接口或者是ListCheckpointed接口。

  • CheckpointedFunction

该接口需要提供了访问non-keyed state不同的分发方案,需要实现改接口两个方法:

void snapshotState(FunctionSnapshotContext context) throws Exception;
void initializeState(FunctionInitializationContext context) throws Exception;

当Flink在进行Checkpoint时候,系统会调用snapshotState()方法,相应的initializeState()方法会在系统初始化该function的时候被调用。需要注意的是当function第一次被初始化的时候或者从早期的Checkpoint状态中进行恢复的时候都会调用该方法。

目前为止支持list-style风格的managed operator state。该状态需要一些列可序列化对象列表,这些对象彼此独立,因此在集群并行度发生变化的时候可以重新分配状态。目前支持状态分配方案两种:

  • 平均分配:每个operator都返回一个状态元素列表。整个状态在逻辑上是所有列表的串联。在恢复的时候系统化会均分状态
  • 联合分配:每个operator都返回一个状态元素列表。整个状态在逻辑上是所有列表的串联。在恢复/重新分配时,每个运算符都会获得完整的状态元素列表。
import java.lang

import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.scala._
import scala.collection.JavaConversions._
import org.apache.flink.runtime.state.{FunctionInitializationContext, FunctionSnapshotContext}
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction
import org.apache.flink.streaming.api.functions.sink.SinkFunction

import scala.collection.mutable.ListBuffer

class BufferingSink(threshold: Int = 0) extends SinkFunction[(String,Int)] with CheckpointedFunction{
  @transient
  private var checkpointedState: ListState[(String, Int)] = _
  private val bufferedElements = ListBuffer[(String, Int)]()

  override def invoke(value: (String, Int), context: SinkFunction.Context[_]): Unit = {
    bufferedElements += value
    if (bufferedElements.size == threshold) {
      for (element <- bufferedElements) {
        // send it to the sink
        print(element)
      }
      bufferedElements.clear()
    }
  }

  override def snapshotState(context: FunctionSnapshotContext): Unit = {
    checkpointedState.clear()
    for (element <- bufferedElements) {
      checkpointedState.add(element)
    }
  }

  override def initializeState(context: FunctionInitializationContext): Unit = {
    val descriptor = new ListStateDescriptor[(String, Int)]("buffered-elements",
                                                            createTypeInformation[(String,Int)])

    checkpointedState = context.getOperatorStateStore.getListState(descriptor)
    //判断是否是恢复
    if(context.isRestored) {
      val tuples: lang.Iterable[(String, Int)] = checkpointedState.get()
      for(v  <- tuples){
        bufferedElements += v
      }
    }
  }
}
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS",9999)
.flatMap(line=>line.split(" "))
.map((_,1))
.addSink(new BufferingSink(10))
env.execute("Xxx demo")
[root@CentOS flink-1.7.1]# ./bin/flink list
Waiting for response...
------------------ Running/Restarting Jobs -------------------
08.03.2019 19:24:39 : 862c4b5a99a508cdd0e21c22e3befb60 : Xxx demo (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
[root@CentOS flink-1.7.1]# ./bin/flink cancel 862c4b5a99a508cdd0e21c22e3befb60 -s /root/savepoint
Cancelling job 862c4b5a99a508cdd0e21c22e3befb60 with savepoint to /root/savepoint.
Cancelled job 862c4b5a99a508cdd0e21c22e3befb60. Savepoint stored in file:/root/savepoint/savepoint-862c4b-bdeed30d80dd.

其中getListState(descriptor)使用的均分按照任务并行度均分,getUnionListState(descriptor)则是将数据做完整Copy,可以修改代码做对应测试。

ListCheckpointed

相比较CheckpointedFunction该接口更加限制了数据恢复的策略,支持均分state策略。同样也提供了两个接口

List<T> snapshotState(long checkpointId, long timestamp) throws Exception;
void restoreState(List<T> state) throws Exception;

案例1

import java.{lang, util}
import java.util.Collections

import org.apache.flink.streaming.api.checkpoint.ListCheckpointed
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import scala.collection.JavaConversions._

class CounterSource extends RichParallelSourceFunction[Long]  with ListCheckpointed[java.lang.Long] {

    @volatile
    private var isRunning = true

    private var offset = 0L

    override def snapshotState(checkpointId: Long, timestamp: Long): util.List[lang.Long] = {
        Collections.singletonList(offset)
    }

    override def restoreState(state: util.List[java.lang.Long]): Unit = {
        for (s <- state) {
            offset = s
        }
    }

    override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
        val lock = ctx.getCheckpointLock
        while (isRunning) {
            // output and state update are atomic
            lock.synchronized({
                Thread.sleep(1000)
                ctx.collect(offset)
                offset += 1
            })
        }
    }
    override def cancel(): Unit = {
        isRunning = false
    }
}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._

val env =StreamExecutionEnvironment.getExecutionEnvironment
env.addSource(new CounterSource)
.map(i=> i+" offset")
.print()
env.execute("测试案例")

Broadcast State

可以尝试将一个Streams的执行状态进行广播,这样其他的流就可以本次存储该流的状态,以便其他流使用。使用Broadcast的前提:

  • 必须是一个Map格式
  • 一个是Broadcast流和一个非Broadcast流
  • Broadcast可以有多个状态但是名字要不一样
import org.apache.flink.api.common.state.MapStateDescriptor
import org.apache.flink.api.common.typeinfo.BasicTypeInfo
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.co.KeyedBroadcastProcessFunction
import org.apache.flink.util.Collector
import scala.collection.JavaConversions._

val env = StreamExecutionEnvironment.createLocalEnvironment()

val stream1 = env.socketTextStream("localhost", 9999)
.flatMap(_.split(" "))
.map((_, 1))
.setParallelism(3)
.keyBy(_._1)

val mapStateDescriptor = new MapStateDescriptor[String,(String,Int)]("user state",BasicTypeInfo.STRING_TYPE_INFO,createTypeInformation[(String,Int)])

val stream2 = env.socketTextStream("localhost", 8888)
.map(line => (line.split(",")(0), line.split(",")(1).toInt))
.broadcast(mapStateDescriptor)

stream1.connect(stream2).process(new KeyedBroadcastProcessFunction[String,(String,Int),(String,Int),String]{
    override def processElement(in1: (String, Int), readOnlyContext: KeyedBroadcastProcessFunction[String, (String, Int), (String, Int), String]#ReadOnlyContext, collector: Collector[String]): Unit = {
        val mapBroadstate = readOnlyContext.getBroadcastState(mapStateDescriptor)
        println("in1:"+in1)
        println("---------state---------")
        for (i <- mapBroadstate.immutableEntries()){
            println(i.getKey+" "+i.getValue)
        }
    }

    override def processBroadcastElement(in2: (String, Int), context: KeyedBroadcastProcessFunction[String, (String, Int), (String, Int), String]#Context, collector: Collector[String]): Unit = {
        context.getBroadcastState(mapStateDescriptor).put(in2._1,in2)
    }
}).print()
env.execute("测试广播状态")

State Checkpointing&Backends & save point

为了使状态容错,Flink需要检查状态。检查点允许Flink恢复流中的状态和位置,从而为应用程序提供与无故障执行相同的语义。默认情况下,禁用检查点。要启用检查点,请在StreamExecutionEnvironment上调用enableCheckpointing(n),其中n是检查点间隔(以毫秒为单位)。

savepoint

Savepoint是通过Flink的检查点机制创建的流作业执行状态的一致图像。可以使用Savepoints来停止和恢复,分叉或更新Flink作业。从概念上讲,Flink的Savepoints与Checkpoints的不同之处在于备份与传统数据库系统中的恢复日志不同。检查点的主要目的是在意外的作业失败时提供恢复机制。Checkpoint的生命周期由Flink管理,即Flink创建,拥有和发布Checkpoint - 无需用户交互。与此相反,Savepoints由用户创建,拥有和删除。

触发保存点

D:\flink-1.7.2\bin>flink savepoint ca66561b000bba339881b2c781216f27 hdfs://local
host:9000/savepoint

Checkpointing

val env = StreamExecutionEnvironment.getExecutionEnvironment()

// 设置每 1000 ms 进行一次Checkpoint操作
env.enableCheckpointing(1000)
// 高级属性
// 设置精准一次(默认)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// 确保两次的checkpoint时间间隔至少是 500 ms
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
// checkpoints 必须在1分钟内完成,否则丢弃
env.getCheckpointConfig.setCheckpointTimeout(60000)
// 如果Checkpoint失败也不能影响任务计算
env.getCheckpointConfig.setFailTasksOnCheckpointingErrors(false)
// 一次只至允许一个checkpoint执行 
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)

**Backends **

Flink提供了不同的状态后端,用于指定状态的存储方式和位置。State可以位于Java的堆上或堆外。根据您的状态后端,Flink还可以管理应用程序的状态,意味着Flink处理内存管理(如果需要可能会溢出到磁盘),以允许应用程序保持非常大的状态。默认情况下,配置文件flink-conf.yaml确定所有Flink作业的状态后端。然而更多的时候使用Job私有配置覆盖全局配置。

val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setStateBackend(...)

(1)MemoryStateBackend:state数据保存在java堆内存中,执行checkpoint的时候,会把state的快照数据保存到jobmanager的内存中,基于内存的state backend在生产环境下不建议使用

(2)FsStateBackend:state数据保存在taskmanager的内存中,执行checkpoint的时候,会把state的快照数据保存到配置的文件系统中,可以使用hdfs等分布式文件系统

(3)RocksDBStateBackend:RocksDB跟上面的都略有不同,它会在本地文件系统中维护状态,state会直接写入本地rocksdb中。同时它需要配置一个远端的filesystem uri(一般是HDFS),在做checkpoint的时候,会把本地的数据直接复制到filesystem中。fail over的时候从filesystem中恢复到本地。RocksDB克服了state受内存限制的缺点,同时又能够持久化到远端文件系统中,比较适合在生产中使用。

如果使用rocksdb需要引入一下依赖:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-statebackend-rocksdb_${flink.scala.version}</artifactId>
    <version>${flink.version}</version>
</dependency>

Queryable State

简而言之,此功能将Flink的Managed(分区)状态暴露给外部世界,并允许用户从Flink外部查询作业的状态。要在Flink集群上启用可查询状态,只需将flink-queryable-state-runtime_2.11-1.7.2.jar从Flink发行版的opt /文件夹复制到lib /文件夹即可。否则,未启用可查询状态功能。同时修改flink-conf.yaml文件添加如下配置

query.server.ports: 50100-50200
query.server.network-threads: 3
query.server.query-threads: 3

query.proxy.ports: 50300-50400
query.proxy.network-threads: 3
query.proxy.query-threads: 3

启动Flink集群要验证群集是否在启用了可查询状态的情况下运行,请检查该行的任何任务管理器的日志:“启动可查询状态代理服务器@ …”。

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-queryable-state-client-java_${flink.scala.version}</artifactId>
    <version>${flink.version}</version>
</dependency>

import org.apache.flink.api.common.functions.{ReduceFunction}
import org.apache.flink.api.common.state.{ReducingStateDescriptor}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

val env = StreamExecutionEnvironment.getExecutionEnvironment
var reduceFunction:ReduceFunction[(String,Int)]=(v1:(String,Int),v2:(String,Int))=> (v1._1,v1._2+v2._2)
val reducestate = new ReducingStateDescriptor[(String, Int)]("reducestate",
                                                             reduceFunction
                                                             , createTypeInformation[(String, Int)])
env.socketTextStream("localhost",8888)
.flatMap(_.split(" "))
.map((_,1))
.keyBy(_._1)
.asQueryableState("WordCount",reducestate)

env.execute("状态查询")

状态查询

var reduceFunction=new ReduceFunction[(String, Int)] {
    override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
        (v1._1,v1._2+v2._2)
    }
}
val reducestate = new ReducingStateDescriptor("reducestate",reduceFunction
                                              , createTypeInformation[(String, Int)])
var client = new QueryableStateClient("localhost",50300)
val jobID = JobID.fromHexString("93e1b5127416047541257f9c9dc29c34")
val result = client.getKvState(jobID,
                               "WorldCount", "this",
                               BasicTypeInfo.STRING_TYPE_INFO, reducestate)
println(result.get().get())

Flink 数据Sink

Hadoop FileSystem

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>${hadoop.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-hdfs</artifactId>
  <version>${hadoop.version}</version>
</dependency>
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode

import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.kafka.clients.consumer.ConsumerConfig

val env = StreamExecutionEnvironment.createLocalEnvironment()
val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
.flatMap(line => for( i <- line.split(" ")) yield (i,1))
.keyBy(_._1)
.reduce((in1,in2)=>(in1._1,in1._2+in2._2))
.writeAsText("hdfs://CentOS:9000/result",WriteMode.OVERWRITE).setParallelism(1)

env.execute("word counts")

注意:DataStream上的write*()方法主要用于调试目的。他们没有参与Flink的检查点,这意味着这些函数通常具有at-least-once的语义。刷新到目标系统的数据取决于OutputFormat的实现。这意味着并非所有发送到OutputFormat的元素都会立即显示在目标系统中。此外,在失败的情况下,这些记录可能会丢失。要将流可靠,准确地一次传送到文件系统,请使用flink-connector-filesystem。此外,通过.addSink(…)方法的自定义实现可以参与Flink的exactly-once语义检查点。

输出到文件系统(exactly-once)

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-filesystem_${flink.scala.version}</artifactId>
  <version>${flink.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-hadoop-compatibility_${flink.scala.version}</artifactId>
  <version>${flink.version}</version>
</dependency>

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>${hadoop.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-hdfs</artifactId>
  <version>${hadoop.version}</version>
</dependency>

import java.time.ZoneId

import org.apache.flink.api.scala._

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.fs.{SequenceFileWriter, StringWriter}
import org.apache.flink.streaming.connectors.fs.bucketing.{BucketingSink, DateTimeBucketer}
import org.apache.hadoop.io.{IntWritable,Text}
import org.apache.flink.api.java.tuple.Tuple2

val env = StreamExecutionEnvironment.createLocalEnvironment()

val bucketingSink = new BucketingSink[Tuple2[Text, IntWritable]]("hdfs://CentOS:9000/res3")
bucketingSink.setBucketer(new DateTimeBucketer[Tuple2[Text, IntWritable]]("yyyy-MM-dd",ZoneId.of("Asia/Shanghai")))
bucketingSink.setWriter(new StringWriter[Tuple2[Text, IntWritable]]())
bucketingSink.setBatchSize(1024 * 1024 * 128) // this is 128 MB,
bucketingSink.setBatchRolloverInterval(10 * 60 * 1000); // this is 10 mins

val stream: DataStream[Tuple2[Text,IntWritable]] = env.socketTextStream("CentOS", 9999)
.flatMap(line => for (i <- line.split(" ")) yield (i, 1))
.keyBy(_._1)
.reduce((in1, in2) => (in1._1, in1._2 + in2._2))
.map(in => {
  val value = new Tuple2[Text,IntWritable](new Text(in._1),new IntWritable(in._2))
  value
})
stream.addSink(bucketingSink)
env.execute("write to bucket")

Redis Sink

<dependency>
  <groupId>org.apache.bahir</groupId>
  <artifactId>flink-connector-redis_2.11</artifactId>
  <version>1.0</version>
</dependency>

参考:https://bahir.apache.org/docs/flink/current/flink-streaming-redis/

val env = StreamExecutionEnvironment.getExecutionEnvironment

val conf = new FlinkJedisPoolConfig.Builder().setHost("CentOS").setPort(6379).build()

env.socketTextStream("localhost",8888)
.flatMap(_.split(" "))
.map((_,1))
.keyBy(_._1)
.sum(1)
.addSink(new RedisSink[(String, Int)](conf, new RedisWordMapper))
env.execute("状态测试")

RedisMapper

import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}

class RedisWordMapper  extends RedisMapper[(String,Int)]{
  override def getCommandDescription: RedisCommandDescription = {
    new RedisCommandDescription(RedisCommand.HSET,"wordcounts")
  }

  override def getKeyFromData(t: (String, Int)): String = {
    t._1
  }

  override def getValueFromData(t: (String, Int)): String = {
    t._2.toString
  }
}

集群

FlinkJedisPoolConfig conf = new FlinkJedisPoolConfig.Builder()
    .setNodes(new HashSet<InetSocketAddress>(Arrays.asList(new InetSocketAddress(5601)))).build();

哨兵

val conf = new FlinkJedisSentinelConfig.Builder().setMasterName("master").setSentinels(...).build()
stream.addSink(new RedisSink[(String, String)](conf, new RedisExampleMapper))

Kafka Sink

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-kafka_2.11</artifactId>
  <version>1.7.2</version>
</dependency>
val env = StreamExecutionEnvironment.createLocalEnvironment()

val props = new Properties()
  props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092,CentOS:9093,CentOS:9094")
  val value = new FlinkKafkaProducer("topic01",new SimpleStringSchema(), props)

 env.socketTextStream("localhost",8888)
  .flatMap(_.split(" "))
  .map((_,1))
  .keyBy(_._1).sum(1)
  .map(t => t._1+","+t._2).addSink(value)
  env.execute("状态测试")
;