Flink watermark
1.简介
Flink水印的本质是DataStream中的一种特殊元素,每个水印都携带有一个时间戳。当时间戳为T的水印出现时,表示事件时间t <= T的数据都已经到达,即水印后面应该只能流入事件时间t > T的数据。也就是说,水印是Flink判断迟到数据的标准,同时也是窗口触发的标记。本质上用来处理实时数据中的乱序问题的,通常是水位线和窗口结合使用来实现。
2. WaterMark触发时机
上面谈到了对数据乱序问题的处理机制是watermark+window,那么window什么时候该被触发呢?
基于Event Time的事件处理,Flink默认的事件触发条件为:
对于out-of-order及正常的数据而言
watermark的时间戳 > = window_end_time
在 [window_start_time,window_end_time] 中有数据存在。
对于late element太多的数据而言
Event Time > watermark的时间戳
WaterMark相当于一个EndLine,一旦Watermarks大于了某个window的end_time,就意味着windows_end_time时间和WaterMark时间相同的窗口开始计算执行了。
就是说,我们根据一定规则,计算出Watermarks,并且设置一些延迟,给迟到的数据一些机会,也就是说正常来讲,对于迟到的数据,我只等你一段时间,再不来就没有机会了。
WaterMark时间可以用Flink系统现实时间,也可以用处理数据所携带的Event time。
总的来说:WaterMark的任务触发时机为:
1:watermark时间 >= window_end_time 即max(timestamp, currentMaxTimestamp....)-allowedLateness >= window_end_time
2:在[window_start_time,window_end_time)中有数据存在
针对乱序事件的处理总结为:
窗口window 的作用是为了周期性的获取数据。
watermark的作用是防止数据出现乱序(经常),事件时间内获取不到指定的全部数据,而做的一种保险方法。
allowLateNess是将窗口关闭时间再延迟一段时间。
sideOutPut是最后兜底操作,所有过期延迟数据,指定窗口已经彻底关闭了,就会把数据放到侧输出流。
3.watermark的几种生产方式
3.1 标点水位线(Punctuated Watermark)
标点水位线(Punctuated Watermark)通过数据流中某些特殊标记事件来触发新水位线的生成。这种方式下窗口的触发与时间无关,而是决定于何时收到标记事件。
在实际的生产中Punctuated方式在TPS很高的场景下会产生大量的Watermark在一定程度上对下游算子造成压力,所以只有在实时性要求非常高的场景才会选择Punctuated的方式进行Watermark的生成。
class PunctuatedAssigner extends AssignerWithPunctuatedWatermarks[MyEvent] {
override def extractTimestamp(element: MyEvent, previousElementTimestamp: Long): Long = {
element.getCreationTime
}
override def checkAndGetNextWatermark(lastElement: MyEvent, extractedTimestamp: Long): Watermark = {
if (element.hasWatermarkMarker()) new Watermark(extractedTimestamp) else null
}
}
其中extractTimestamp用于从消息中提取事件时间,checkAndGetNextWatermark用于检查事件是否标点事件,若是则生成新的水位线。不同于定期水位线定时调用getCurrentWatermark,标点水位线是每接受一个事件就需要调用checkAndGetNextWatermark,若返回值非 null 且新水位线大于当前水位线,则触发窗口计算
注:数据流中每一个递增的EventTime都会产生一个Watermark。在实际的生产中Punctuated方式在TPS很高的场景下会产生大量的Watermark在一定程度上对下游算子造成压力,所以只有在实时性要求非常高的场景才会选择Punctuated的方式进行Watermark的生成
3.2 定期水位线(Periodic Watermark)
周期性的(允许一定时间间隔或者达到一定的记录条数)产生一个Watermark。不管是否有新的消息抵达,水位线提升的时间间隔是由用户设置的,在两次水位线提升时隔内会有一部分消息流入,用户可以根据这部分数据来计算出新的水位线。
在实际的生产中Periodic的方式必须结合时间和积累条数两个维度继续周期性产生Watermark,否则在极端情况下会有很大的延时。
举个例子,最简单的水位线算法就是取目前为止最大的事件时间,然而这种方式比较暴力,对乱序事件的容忍程度比较低,容易出现大量迟到事件。
class BoundedOutOfOrdernessGenerator extends AssignerWithPeriodicWatermarks[MyEvent] {
val maxOutOfOrderness = 3500L; // 3.5 seconds
var currentMaxTimestamp: Long;
override def extractTimestamp(element: MyEvent, previousElementTimestamp: Long): Long = {
val timestamp = element.getCreationTime()
currentMaxTimestamp = max(timestamp, currentMaxTimestamp)
timestamp;
}
override def getCurrentWatermark(): Watermark = {
// return the watermark as current highest timestamp minus the out-of-orderness bound
new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
}
其中extractTimestamp用于从消息中提取事件时间,而getCurrentWatermark用于生成新的水位线,新的水位线只有大于当前水位线才是有效的。每个窗口都会有该类的一个实例,因此可以利用实例的成员变量保存状态,比如上例中的当前最大时间戳
注:周期性的(一定时间间隔或者达到一定的记录条数)产生一个Watermark。在实际的生产中Periodic的方式必须结合时间和积累条数两个维度继续周期性产生Watermark,否则在极端情况下会有很大的延时。
4. flink1.11之后新的水印生成策略WatermarkStrategy
在flink 1.11之前的版本中,提供了两种生成水印(Watermark)的策略,分别是AssignerWithPunctuatedWatermarks和AssignerWithPeriodicWatermarks,这两个接口都继承自TimestampAssigner接口。所以为了避免代码的重复,在flink 1.11 中对flink的水印生成接口进行了重构,统一使用使用assignTimestampsAndWatermarks方法来构造水印,新的接口需要传入一个WatermarkStrategy对象。
assignTimestampsAndWatermarks(WatermarkStrategy<T>)
4.1 WatermarkStrategy源码:
@Public
public interface WatermarkStrategy<T> extends
TimestampAssignerSupplier<T>, WatermarkGeneratorSupplier<T> {
/**
* Instantiates a WatermarkGenerator that generates watermarks according to this strategy.
*/
@Override
WatermarkGenerator<T> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context);
/**
* Instantiates a {@link TimestampAssigner} for assigning timestamps according to this
* strategy.
*/
@Override
default TimestampAssigner<T> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
// By default, this is {@link RecordTimestampAssigner},
// for cases where records come out of a source with valid timestamps, for example from Kafka.
return new RecordTimestampAssigner<>();
}
// ------------------------------------------------------------------------
// Builder methods for enriching a base WatermarkStrategy
// ------------------------------------------------------------------------
/**
* Creates a new {@code WatermarkStrategy} that wraps this strategy but instead uses the given
* {@link TimestampAssigner} (via a {@link TimestampAssignerSupplier}).
*
* <p>You can use this when a {@link TimestampAssigner} needs additional context, for example
* access to the metrics system.
*
* <pre>
* {@code WatermarkStrategy<Object> wmStrategy = WatermarkStrategy
* .forMonotonousTimestamps()
* .withTimestampAssigner((ctx) -> new MetricsReportingAssigner(ctx));
* }</pre>
*/
default WatermarkStrategy<T> withTimestampAssigner(TimestampAssignerSupplier<T> timestampAssigner) {
checkNotNull(timestampAssigner, "timestampAssigner");
return new WatermarkStrategyWithTimestampAssigner<>(this, timestampAssigner);
}
/**
* Creates a new {@code WatermarkStrategy} that wraps this strategy but instead uses the given
* {@link SerializableTimestampAssigner}.
*
* <p>You can use this in case you want to specify a {@link TimestampAssigner} via a lambda
* function.
*
* <pre>
* {@code WatermarkStrategy<CustomObject> wmStrategy = WatermarkStrategy
* .forMonotonousTimestamps()
* .withTimestampAssigner((event, timestamp) -> event.getTimestamp());
* }</pre>
*/
default WatermarkStrategy<T> withTimestampAssigner(SerializableTimestampAssigner<T> timestampAssigner) {
checkNotNull(timestampAssigner, "timestampAssigner");
return new WatermarkStrategyWithTimestampAssigner<>(this,
TimestampAssignerSupplier.of(timestampAssigner));
}
/**
* Creates a new enriched {@link WatermarkStrategy} that also does idleness detection in the
* created {@link WatermarkGenerator}.
*
* <p>Add an idle timeout to the watermark strategy. If no records flow in a partition of a
* stream for that amount of time, then that partition is considered "idle" and will not hold
* back the progress of watermarks in downstream operators.
*
* <p>Idleness can be important if some partitions have little data and might not have events
* during some periods. Without idleness, these streams can stall the overall event time
* progress of the application.
*/
default WatermarkStrategy<T> withIdleness(Duration idleTimeout) {
checkNotNull(idleTimeout, "idleTimeout");
checkArgument(!(idleTimeout.isZero() || idleTimeout.isNegative()),
"idleTimeout must be greater than zero");
return new WatermarkStrategyWithIdleness<>(this, idleTimeout);
}
// ------------------------------------------------------------------------
// Convenience methods for common watermark strategies
// ------------------------------------------------------------------------
/**
* Creates a watermark strategy for situations with monotonously ascending timestamps.
*
* <p>The watermarks are generated periodically and tightly follow the latest
* timestamp in the data. The delay introduced by this strategy is mainly the periodic interval
* in which the watermarks are generated.
*
* @see AscendingTimestampsWatermarks
*/
static <T> WatermarkStrategy<T> forMonotonousTimestamps() {
return (ctx) -> new AscendingTimestampsWatermarks<>();
}
/**
* @see BoundedOutOfOrdernessWatermarks
*/
static <T> WatermarkStrategy<T> forBoundedOutOfOrderness(Duration maxOutOfOrderness) {
return (ctx) -> new BoundedOutOfOrdernessWatermarks<>(maxOutOfOrderness);
}
/**
* Creates a watermark strategy based on an existing {@link WatermarkGeneratorSupplier}.
*/
static <T> WatermarkStrategy<T> forGenerator(WatermarkGeneratorSupplier<T> generatorSupplier) {
return generatorSupplier::createWatermarkGenerator;
}
/**
* Creates a watermark strategy that generates no watermarks at all. This may be useful in
* scenarios that do pure processing-time based stream processing.
*/
static <T> WatermarkStrategy<T> noWatermarks() {
return (ctx) -> new NoWatermarksGenerator<>();
}
创建source之后设置的固定延迟生成水印watermark,如kafka
wordSource.assignTimestampsAndWatermarks(
WatermarkStrategy
.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(5)) // 设置水印允许延迟5秒
.withTimestampAssigner((event, timestamp) -> event.f1 ));da
4.2 单调递增生成水印:
dataStream.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
使用WatermarkStrategy 生成watermark demo:
package it.kenn.eventtime;
import com.alibaba.fastjson.JSONObject;
import it.kenn.util.DateUtils;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.util.Collector;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.StringDeserializer;
import java.time.Duration;
import java.time.LocalDateTime;
import java.time.ZoneOffset;
import java.time.format.DateTimeFormatter;
import java.time.temporal.ChronoUnit;
import java.util.Iterator;
import java.util.Properties;
/**
* 主要是event time、watermark的知识
*/
public class EventTimeDemo {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(6);
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "1test_34fldink182ddddd344356");
properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
SingleOutputStreamOperator<JSONObject> kafkaSource = env.addSource(new FlinkKafkaConsumer<>("metric-topic", new SimpleStringSchema(), properties)).map(JSONObject::parseObject);
kafkaSource
.assignTimestampsAndWatermarks(WatermarkStrategy
.<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(5))//水印策略
.withTimestampAssigner((record, ts) -> {
DateTimeFormatter pattern = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
// LocalDateTime parse = LocalDateTime.parse(record.getString("@timestamp"), pattern).plusHours(8);
// return parse.toInstant(ZoneOffset.of("+8")).toEpochMilli();
return DateUtils.parseStringToLong(record.getString("@timestamp"),pattern,8, ChronoUnit.HOURS);
})//解析事件时间
.withIdleness(Duration.ofMinutes(1))//对于很久不来的流(空闲流,即可能一段时间内某源没有流来数据)如何处置
)
.keyBy(new KeySelector<JSONObject, String>() {
@Override
public String getKey(JSONObject record){
if (record.containsKey("process") && record.getJSONObject("process").containsKey("name")){
return record.getJSONObject("process").getString("name");
}else {
return "unknown-process";
}
}
})
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//四个泛型分别是输入类型,输出类型,key和TimeWindow,这个process函数处理的数据是这个5s窗口中的所有数据
.process(new ProcessWindowFunction<JSONObject, Tuple2<String,Long>, String, TimeWindow>() {
@Override
public void process(String key, Context context, Iterable<JSONObject> iterable, Collector<Tuple2<String,Long>> collector) throws Exception {
String time = null;
Long ts = 0L;
Iterator<JSONObject> iterator = iterable.iterator();
if (iterator.hasNext()){
JSONObject next = iterator.next();
time = next.getString("@timestamp");
DateTimeFormatter pattern = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
// time = LocalDateTime.parse(time, pattern).plusHours(8).toString().replace("T"," ");
ts = DateUtils.parseStringToLong(time, pattern, 8, ChronoUnit.HOURS);
}
collector.collect(new Tuple2<>(key,ts));
}
})
.print();
// kafkaSource.print();
env.execute();
}
}
package it.kenn.util;
import java.time.LocalDateTime;
import java.time.ZoneOffset;
import java.time.format.DateTimeFormatter;
import java.time.temporal.TemporalUnit;
/**
* 时间工具类
*
* @author kenn
* 2020年11月25日23点10分
*/
public final class DateUtils {
public static Long parseStringToLong(String time, DateTimeFormatter pattern, int offset, TemporalUnit unit) {
// DateTimeFormatter pattern = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
LocalDateTime dateTime = null;
if (offset > 0){
dateTime = LocalDateTime.parse(time, pattern).plus(offset, unit);
}else if (offset < 0){
dateTime = LocalDateTime.parse(time, pattern).minus(Math.abs(offset), unit);
}else {
dateTime = LocalDateTime.parse(time, pattern);
}
return dateTime.toInstant(ZoneOffset.of("+8")).toEpochMilli();
}
public static Long parseStringToLong(String time, DateTimeFormatter pattern) {
return parseStringToLong(time, pattern, 0, null);
}
public static Long parseStringToLong(String time) {
return parseStringToLong(time, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"));
}
public static LocalDateTime parseStringToDateTime(String time, DateTimeFormatter pattern) {
return LocalDateTime.parse(time, pattern);
}
public static LocalDateTime parseStringToDateTime(String time) {
return parseStringToDateTime(time, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"));
}
}
4.3 一种是periodic(周期性)水印
public class MonkeyPeriodicWatermarkGenerator implements WatermarkGenerator<Tuple2<String, Long>> {
// 因为Watermark是不断推进的,所以我们总是保存最大的事件时间
private long currentTimestamp;
// 允许最大的乱序时间
private long maxOutOfOrderness = 3000;
@Override
public void onEvent(Tuple2<String, Long> event, long eventTimestamp, WatermarkOutput output) {
currentTimestamp = Math.max(event.f1, currentTimestamp);
}
@Override
public void onPeriodicEmit(WatermarkOutput output) {
// 发出水印(允许乱序时间)
output.emitWatermark(new Watermark(currentTimestamp - maxOutOfOrderness));
}
}
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
ExecutionConfig config = env.getConfig();
// 设置水印的生成间隔为1秒,也就是说每隔1秒往流中加入一个水印
config.setAutoWatermarkInterval(1000);
DataStreamSource<Tuple2<String, Long>> wordSource = env.addSource(new RichSourceFunction<Tuple2<String, Long>>() {
private volatile Boolean isCancel;
private int totalCount;
@Override
public void open(Configuration parameters) throws Exception {
this.isCancel = false;
this.totalCount = 0;
}
@Override
public void run(SourceContext<Tuple2<String, Long>> ctx) throws Exception {
while(!this.isCancel) {
String word = RandomStringUtils.randomAlphabetic(10);
ctx.collect(Tuple2.of(word, System.currentTimeMillis()));
this.totalCount++;
if(this.totalCount % 100 == 0) {
TimeUnit.SECONDS.sleep(1);
}
}
}
@Override
public void cancel() {
this.isCancel = true;
}
});
SingleOutputStreamOperator<Tuple2<String, Long>> wordWithTsDS =
wordSource.assignTimestampsAndWatermarks(new WatermarkStrategy<Tuple2<String, Long>>() {
@Override
public WatermarkGenerator<Tuple2<String, Long>> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new MonkeyPeriodicWatermarkGenerator();
}
@Override
public TimestampAssigner<Tuple2<String, Long>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return (event, ts) -> event.f1;
}
});
wordWithTsDS.map(tuple -> tuple.f0)
.map(word -> Tuple2.of(word, 1), TypeInformation.of(new TypeHint<Tuple2<String, Integer>>() {}))
.keyBy(wordAndCnt -> wordAndCnt.f0)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.reduce((wc1, wc2) -> Tuple2.of(wc1.f0, wc1.f1 + wc2.f1)).name("reduce")
.print();
env.execute("Flink Eventtime and Watermark");
punctuated watermark
接下来,我用代码模拟一下使用punctuated watermark。我需要对Source做以下改造,就是Source发出的消息有可能会有时间戳,也有可能没有时间戳。但如果我们检测到时间戳后,立即发出水印。
首先,此处基于punctuated事件来发出水印,只要检测到元组中的第二个字段不为-1,马上发出水印。注意提取事件时间有一处小细节,第一次因为还没有任何的事件时间,所以默认会是Long.MIN_VALUE,系统会直接报错,所以,我们初始化为0。
public class PunctuatedWatermarkGenerator
implements WatermarkGenerator<Tuple2<String, Long>> , TimestampAssigner<Tuple2<String, Long>> {
@Override
public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
// 提前事件时间要先判断时间戳字段是否为-1
if(element.f1 != -1) {
return element.f1;
}
else {
// 如果为空,返回上一次的事件时间
return recordTimestamp > 0 ? recordTimestamp : 0;
}
}
@Override
public void onEvent(Tuple2<String, Long> event, long eventTimestamp, WatermarkOutput output) {
if(event.f1 != -1) {
output.emitWatermark(new Watermark(event.f1));
}
}
@Override
public void onPeriodicEmit(WatermarkOutput output) {
// nothing
}
}
4.4 指定使用punctuated watermark
SingleOutputStreamOperator<Tuple2<String, Long>> wordWithTsDS =
wordSource.assignTimestampsAndWatermarks(new WatermarkStrategy<Tuple2<String, Long>>() {
@Override
public WatermarkGenerator<Tuple2<String, Long>> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new PunctuatedWatermarkGenerator();
}
@Override
public TimestampAssigner<Tuple2<String, Long>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return new PunctuatedWatermarkGenerator();
}
});
4.5 处理空闲数据源
在某些情况下,由于数据产生的比较少,导致一段时间内没有数据产生,进而就没有水印的生成,导致下游依赖水印的一些操作就会出现问题,比如某一个算子的上游有多个算子,这种情况下,水印是取其上游两个算子的较小值,如果上游某一个算子因为缺少数据迟迟没有生成水印,就会出现eventtime倾斜问题,导致下游没法触发计算。
所以filnk通过WatermarkStrategy.withIdleness()方法允许用户在配置的时间内(即超时时间内)没有记录到达时将一个流标记为空闲。这样就意味着下游的数据不需要等待水印的到来。
当下次有水印生成并发射到下游的时候,这个数据流重新变成活跃状态。
在Flink中,我们可以使用withIdleness来设置空闲的source。
ingleOutputStreamOperator<Tuple2<String, Long>> wordWithTsDS =
wordSource.assignTimestampsAndWatermarks(WatermarkStrategy
.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(5)) // 设置水印允许延迟5秒
.withIdleness(Duration.ofSeconds(15)) // 设置空闲source为15秒
.withTimestampAssigner((event, timestamp) -> event.f1));
大部分时候,我们只需要使用内置的BoundedOutOfOrdernessWatermarks即可,并使用Lambda表达式从事件中提出时间戳就好。但还是得了解它的实现机制。这样将来出现问题的时候,我们也能够第一时间发现问题在哪儿。
案例demo
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import javax.annotation.Nullable;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;
/**
*
* Watermark 案例
*
* Created by xuwei.tech.
*/
public class StreamingWindowWatermark {
public static void main(String[] args) throws Exception {
//定义socket的端口号
int port = 9000;
//获取运行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//设置使用eventtime,默认是使用processtime
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
//设置并行度为1,默认并行度是当前机器的cpu数量
env.setParallelism(1);
//连接socket获取输入的数据
DataStream<String> text = env.socketTextStream("hadoop100", port, "\n");
//解析输入的数据
DataStream<Tuple2<String, Long>> inputMap = text.map(new MapFunction<String, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] arr = value.split(",");
return new Tuple2<>(arr[0], Long.parseLong(arr[1]));
}
});
//抽取timestamp和生成watermark
DataStream<Tuple2<String, Long>> waterMarkStream = inputMap.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Tuple2<String, Long>>() {
Long currentMaxTimestamp = 0L;
final Long maxOutOfOrderness = 10000L;// 最大允许的乱序时间是10s
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
/**
* 定义生成watermark的逻辑
* 默认100ms被调用一次
*/
@Nullable
@Override
public Watermark getCurrentWatermark() {
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
//定义如何提取timestamp
@Override
public long extractTimestamp(Tuple2<String, Long> element, long previousElementTimestamp) {
long timestamp = element.f1;
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
System.out.println("key:"+element.f0+",eventtime:["+element.f1+"|"+sdf.format(element.f1)+"],currentMaxTimestamp:["+currentMaxTimestamp+"|"+
sdf.format(currentMaxTimestamp)+"],watermark:["+getCurrentWatermark().getTimestamp()+"|"+sdf.format(getCurrentWatermark().getTimestamp())+"]");
return timestamp;
}
});
// 保存被丢弃的数据
OutputTag<Tuple2<String,Long>> outputTag = new <Tuple2<String,Long>>("late-data"){};
//分组,聚合
DataStream<String> window = waterMarkStream.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.seconds(3)))//按照消息的EventTime分配窗口,和调用TimeWindow效果一样
.allowedLateness(Time.seconds(2))// 允许数据迟到2s
.sideOutputLateData(outputTag) // 通过sideOutputLateData 可以把迟到的数据统一收集,统计存储,方便后期排查问题。旁路输出
.apply(new WindowFunction<Tuple2<String, Long>, String, Tuple, TimeWindow>() {
/**
* 对window内的数据进行排序,保证数据的顺序
* @param tuple
* @param window
* @param input
* @param out
* @throws Exception
*/
@Override
public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Long>> input, Collector<String> out) throws Exception {
String key = tuple.toString();
List<Long> arrarList = new ArrayList<Long>();
Iterator<Tuple2<String, Long>> it = input.iterator();
while (it.hasNext()) {
Tuple2<String, Long> next = it.next();
arrarList.add(next.f1);
}
Collections.sort(arrarList);
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
String result = key + "," + arrarList.size() + "," + sdf.format(arrarList.get(0)) + "," + sdf.format(arrarList.get(arrarList.size() - 1))
+ "," + sdf.format(window.getStart()) + "," + sdf.format(window.getEnd());
out.collect(result);
}
});
// 把迟到数据暂时打印到控制台,实际中可以保存到其它存储介质中
DataStream<Tuple2<String,Long>> sideOut = window.getSideOutput(outputTag);
//测试-把结果打印到控制台即可
window.print();
//注意:因为flink是懒加载的,所以必须调用execute方法,上面的代码才会执行
env.execute("eventtime-watermark");
}
}