Bootstrap

Flume + Kafka+sparkstreaming

 

整合Flume、Kafka搭建实时日志收集系统

Flume收集某一个目录的日志,设置kafka sink,Kafka从sink中pull数据进行消费。

物理配置

主机名:s201 

zookeeper3.4.12:s201:2181 

kafka0.9.0.1:s201:9092

flume1.7.0

spark:2.2.3

flume配置文件如下:

# 监听flume_msg的日志,将数据传到avroSink
exec-memory-avro.sources=execSrc
exec-memory-avro.channels=memoryChannel
exec-memory-avro.sinks=avroSink

exec-memory-avro.sources.execSrc.type=exec
exec-memory-avro.sources.execSrc.command=tail -F /home/hadoop/data/flume/source/flume_msg/data.log
exec-memory-avro.sources.execSrc.shell=/bin/sh -c

exec-memory-avro.sinks.avroSink.type=avro
exec-memory-avro.sinks.avroSink.hostname=s201
exec-memory-avro.sinks.avroSink.port=33333

exec-memory-avro.sources.execSrc.channels=memoryChannel
exec-memory-avro.sinks.avroSink.channel=memoryChannel

exec-memory-avro.channels.memoryChannel.type=memory
exec-memory-avro.channels.memoryChannel.capacity=100

#-----------------------------------------------------------
#-----------------------------------------------------------

# avro将数据传到kafka的hello_topic中
avro-memory-kafka.sources=avroSource
avro-memory-kafka.sinks=kafkaSink
avro-memory-kafka.channels=memoryChannel

avro-memory-kafka.sources.avroSource.type=avro
avro-memory-kafka.sources.avroSource.bind=s201
avro-memory-kafka.sources.avroSource.port=33333

avro-memory-kafka.sinks.kafkaSink.type=org.apache.flume.sink.kafka.KafkaSink
avro-memory-kafka.sinks.kafkaSink.kafka.bootstrap.servers=s201:9092
avro-memory-kafka.sinks.kafkaSink.kafka.topic=hello
avro-memory-kafka.sinks.kafkaSink.batchSize=5
avro-memory-kafka.sinks.kafkaSink.requiredAcks=1

avro-memory-kafka.channels.memoryChannel.type=memory
avro-memory-kafka.channels.memoryChannel.capacity=100

avro-memory-kafka.sources.avroSource.channels=memoryChannel
avro-memory-kafka.sinks.kafkaSink.channel=memoryChannel

kafka中增加hello这个topic,用于接受生产者生产的消息。

kafka-topics.sh --create --zookeeper s201:2181/mykafka --replication-factor 1 --partitions 1 --topic hello

测试方法

  1. 启动 avro-memory-kafka这个flume agent,用于接收日志。
  2. 启动exec-memory-avro这个flume agent,用于从source发送日志。
  3. 启动kafka自带的消费者,消费hello这个topic。
# 首先启动avro源,监听其它服务器发过来的消息
bin/flume-ng agent \
--name avro-memory-kafka \
--conf conf \
--conf-file conf/avro-memory-kafka.properties \
-Dflume.root.logger=INFO,console

# 监听data.log日志,一旦有变化将新的消息传出去
bin/flume-ng agent \
--name exec-memory-avro \
--conf conf \
--conf-file conf/exec-memory-avro.properties \
-Dflume.root.logger=INFO,console

# 启动kafka消费者,消费hello这个topic
kafka-console-consumer.sh --zookeeper s201:2181/mykafka --topic hello

 以上安装好之后,向data.log写入数据: echo "hello1" >> data.log  ... 

新版本flume支持 taildir的source,exec-memory-avro这个agent可以进行修改。

 

Flume、Kafka整合Sparkstreaming

Flume整合Sparkstreaming

Push方式

windows10上安装netcat工具,加入环境变量,使用的时候可以直接 nc.exe hostname port即可。

flume配置文件如下:

# flume_push_streaming.properties
# netcat-memory-avro
flume_push_streaming.sources=netcatSrc
flume_push_streaming.channels=memoryChannel
flume_push_streaming.sinks=avroSink

flume_push_streaming.sources.netcatSrc.type=netcat
flume_push_streaming.sources.netcatSrc.bind=s201
flume_push_streaming.sources.netcatSrc.port=22222

flume_push_streaming.sinks.avroSink.type=avro
# 本地IDE环境所在IP地址
flume_push_streaming.sinks.avroSink.hostname=192.168.204.1  
flume_push_streaming.sinks.avroSink.port=33333

flume_push_streaming.sources.netcatSrc.channels=memoryChannel
flume_push_streaming.sinks.avroSink.channel=memoryChannel

flume_push_streaming.channels.memoryChannel.type=memory
flume_push_streaming.channels.memoryChannel.capacity=100

scala代码如下:

object FlumePushWordCount {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("FlumePush")
    val ssc = new StreamingContext(sparkConf, Seconds(10))

    // TODO... 如何使用SparkStreaming整合Flume
    // FlumeUtils可以将flume的event流转换为DStream类型,进而进行处理
    // 0.0.0.0表示任意网卡都可以
    val flumeStream = FlumeUtils.createStream(ssc, "0.0.0.0", 33333)
    flumeStream.map(x => new String(x.event.getBody().array()).trim)
        .flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).print

    ssc.start()
    ssc.awaitTermination()
  }
}

以上代码需要注意:FlumeUtils.createStream(ssc, hostname, port) 返回的类型为  ReceiverInputDStream[SparkFlumeEvent]

测试

# 1. 首先启动本地IDEA编写的spark-streaming程序。
# 2. 启动flume
bin/flume-ng agent \
--name flume_push_streaming \
--conf conf \
--conf-file conf/flume_push_streaming.properties \
-Dflume.root.logger=INFO,console
# 3. 使用netcat想flume源监听端口发消息
# 此处是在windows环境下使用的nc
nc.exe s201 22222

Pull方式

flume将数据push到sink,数据被缓存。spark-streaming使用a reliable flume receiver 从sink中拉取数据。

flume的配置文件 sink的type换成 org.apache.spark.streaming.flume.sink.SparkSink

scala代码中换成:

FlumeUtils.createPollingStream(ssc, “sink machine hostname”, “sink port”)
# flume_pull_streaming.properties
flume_push_streaming.sources=netcatSrc
flume_push_streaming.channels=memoryChannel
flume_push_streaming.sinks=sparkSink  # 此处开始不同

flume_push_streaming.sources.execSrc.type=netcat
flume_push_streaming.sources.execSrc.bind=s201
flume_push_streaming.sources.execSrc.port=22222
# 以下开始不同
flume_push_streaming.sinks.sparkSink.type=org.apache.spark.streaming.flume.sink.SparkSink
flume_push_streaming.sinks.sparkSink.hostname=s201
flume_push_streaming.sinks.sparkSink.port=33333

flume_push_streaming.sources.netcatSrc.channels=memoryChannel
flume_push_streaming.sinks.sparkSink.channel=memoryChannel

flume_push_streaming.channels.memoryChannel.type=memory
flume_push_streaming.channels.memoryChannel.capacity=100

小技巧:如果服务器上需要导入依赖包,可以使用 --jars参数指定依赖包即可

例如:spark-submit --jars org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.3 ... 

Kafka整合Sparkstreaming

官网:http://spark.apache.org/docs/2.2.3/streaming-kafka-0-8-integration.html

Receiver-based

使用Kafka的高阶API将缴费的消息offsets存在zookeeper中,需要通过receiver将数据存储在Write Ahead Log中,增加了数据被重复复制的开销,效率不如directStream高。仅仅能够保证at least once,可能数据会用重复,无法做到exactly once。

Direct Approach【用的比较多】 1.3之后被引入

不使用Receiver接受数据,而是周期性的查询Kafka每一个topic+partition中最近的offsets,通过Kafka的simple consumer API读取Kafka中自定义的offset ranges。

这种方式中SparkStreaming创建的RDD partitions和要消费的Kafka partitions是一样多的,两者是一一对应的,简化了并行度。

*使用低阶Kafka API将offsets记录在sparkstreaming的checkpoint中,而不是zookeeper中,能够保证exactly once语义。

KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
  ssc,
  Map[String, String]("metadata.broker.list"->brokers),  // kafkaparams
  topicsSet // 要消费的topic集合
)

 

Flume+Kafka+Sparkstreaming整合

 

 

 

 

 

 

 

 

;