1.前言
网上有一大堆的关于高级api同时消费多个topic的做法,也有一大堆关于低级api消费单个topic,并手动管理offsets存储到zookeeper的做法,但是搜寻了很久仍没有发现有使用低级api消费多个topic,并手动管理offsets存储到zookeeper的做法
2.前提
1)已经安装了zookeeper集群和kafka集群并已经启动
有关zookeeper集群的简单搭建请看[zookeeper快速入门一]------简单搭建zookeeper集群
有关kafka集群的简单搭建请看[Kafka快速入门一]------搭建Kafka集群
2)有关的maven依赖,具体如何根据自己的集群中相关组件版本来确定
scala 2.11.11
kafka 0.8
zookeeper 3.4.6
spark 2.3.0
hbase 1.1.2
<properties>
<maven.compiler.source>1.5</maven.compiler.source>
<maven.compiler.target>1.5</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.11.11</scala.version>
<kafka.version>1.0.0</kafka.version>
<zookeeper.version>3.4.6</zookeeper.version>
<spark.version>2.3.0</spark.version>
<hbase.version>1.1.2</hbase.version>
</properties>
<dependencies>
<!-- scala -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- Test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.8.1</version>
<scope>test</scope>
</dependency>
<!--Kafka -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.8.2.1</version>
</dependency>
<!--spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!--spark整合kafka -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!--hbase-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>${hbase.version}</version>
</dependency>
</dependencies>
3.使用低级api消费kafka数据
class GetDataFromKafka {
//定义日志
@transient lazy val log = LogManager.getLogger(classOf[GetDataFromKafka])
/**
* 从kafka中消费数据
*/
def getKafkaData(): Unit ={
val prop = new Properties()
prop.load(new FileInputStream("C:\\Users\\qin\\Desktop\\study\\Kafka\\sparkStreaming.properties")) //读取window本地配置文件
// prop.load(new FileInputStream("/home/kbd/producer/sparkStreaming.properties")) //读取linux上的配置文件
// prop.load(new FileInputStream(this.getClass.getResource("/").getPath+"sparkStreaming.properties")) //读取resources下的配置文件
val zkHost = prop.getProperty("kafka.zookeeper.connect") //zookeeper集群地址 master.hadoop:2181,slave1.hadoop:2181,slave2.hadoop:2181
val brokerList = prop.getProperty("kafka.metadata.broker.list") //kafka集群broker地址 master.hadoop:9092,slave1.hadoop:9092
val groupId = prop.getProperty("group.id") //kafka消费组
val zkClient = new ZkClient(zkHost) //创建一个zkClient
var kafkaParams = Map[String,String]("metadata.broker.list" -> brokerList, //定义kafkaParams,用于后续创建kafkaStream
"zookeeper.connect" -> zkHost,
"group.id" -> groupId)
var kafkaStream : InputDStream[(String,String)] = null
val conf = new SparkConf().setAppName(prop.getProperty("sparkStreamName"))
.setMaster("local[2]")
//spark的反压机制,有兴趣可以自己百度下.具体用处就是控制kafka消费速率
conf.set("spark.streaming.backpressure.enabled",prop.getProperty("spark.streaming.backpressure.enabled"))
conf.set("spark.streaming.backpressure.initialRate",prop.getProperty("spark.streaming.backpressure.initialRate"))// the max size , for the first time
conf.set("spark.streaming.kafka.maxRatePerPartition",prop.getProperty("spark.streaming.kafka.maxRatePerPartition"))//the max record of kafka , for per process per second
//创建sparkStream
val ssc = new StreamingContext(conf,Seconds(5))
val accumulator = ssc.sparkContext.accumulator(0)
val broadcast: Broadcast[Properties] = ssc.sparkContext.broadcast(prop) //广播变量
val topics = prop.getProperty("topic") //有多个topic
var fromOffsets : Map[TopicAndPartition,Long] = Map()
//定义一个消费者,用于后续获取kafka上最新和最小的offset
val simpleConsumer: SimpleConsumer = new SimpleConsumer(prop.getProperty("kafka.hosts"), prop.getProperty("kafka_port").toInt, 1000000, 64 * 1024, "octServer")
val messageHandler = (mmd : MessageAndMetadata[String,String]) => (mmd.topic,mmd.message())
//获取每个topic在zookeeper中存储的offsets
topics.split(",").foreach(topic=>{
val topicDirs = new ZKGroupTopicDirs(groupId,topic)
println(topicDirs.consumerOffsetDir)
val children = zkClient.countChildren(s"${topicDirs.consumerOffsetDir}")
//判断当前topic是否在zookeeper上有存储offsets
if (children > 0){
for (i <- 0 until children){
//获取当前zookeeper存储的offsets
val partitionOffset = zkClient.readData[String](s"${topicDirs.consumerOffsetDir}/${i}")
var currentOffset = partitionOffset.toLong
val tp = TopicAndPartition(topic,i)
//获取kafka最小的offset
val earliestOffset: OffsetRequest = OffsetRequest(Map(tp -> PartitionOffsetRequestInfo(OffsetRequest.EarliestTime, 1)))
//获取kafka最大的offset
val latestOffset: OffsetRequest = OffsetRequest(Map(tp -> PartitionOffsetRequestInfo(OffsetRequest.LatestTime, 1)))
val earOffset: Seq[Long] = simpleConsumer.getOffsetsBefore(earliestOffset).partitionErrorAndOffsets(tp).offsets
val latOffset: Seq[Long] = simpleConsumer.getOffsetsBefore(latestOffset).partitionErrorAndOffsets(tp).offsets
if (earOffset.length >0 && currentOffset < earOffset.head){
currentOffset = earOffset.head
}
log.info("当前zookeeper上的offsets" + partitionOffset)
fromOffsets += (tp->currentOffset)
}
//当前topic没有在zookeeper存储过offsets,需要从kafkaCluster中获取kafka中存储当前topic最小的offsets
}else {
val kc = new KafkaCluster(kafkaParams)
val either : Either[Err, Set[TopicAndPartition]] = kc.getPartitions(Set(topic))
either.right.foreach(TopicAndPartitions=>{
val leaderLatestOffsets = kc.getEarliestLeaderOffsets(TopicAndPartitions).right.get
leaderLatestOffsets.foreach(x=>{
fromOffsets +=(x._1->x._2.offset)
})
})
}
})
//创建kafkaStream
kafkaStream = KafkaUtils
.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler)
//将kafka数据插入到hbase中
new PutDataToHBase(broadcast).save(kafkaStream,accumulator,groupId)
ssc.start()
ssc.awaitTermination()
}
}
4.手动保存offsets到zookeeper
/**
* 保存offsets到zookeeper
* @param kafkaStream
* @param topicDirs
* @param zkClient
*/
def saveOffsetsToZookeeper(rdd :RDD[(String,String)],groupId : String ,zkClient: ZkClient): Unit ={
log.info("-------------------------------开始更新offset--------------------------------------------")
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
for (o <- offsetRanges) {
val topicDirs = new ZKGroupTopicDirs(groupId,o.topic)
val zkPath = s"${topicDirs.consumerOffsetDir}/${o.partition}"
ZkUtils.updatePersistentPath(zkClient, zkPath, o.fromOffset.toString) //将该 partition 的 offset 保存到 zookeeper
log.info("-------------------------------"+o.topic+"的offset更新为"+o.fromOffset+"--------------------------------------------")
}
log.info("-------------------------------所有offset更新完毕--------------------------------------------")
}
5.感想
本次操作虽然可以使用一个kafkastream消费多个topic,并手动管理offsets到zookeeper中,但是为了确保数据不重复消费,采用了先更新offsets到zookeeper,再消费数据到hbase中。采用这种做法的缺点就是可能会导致hbsae中的部分数据丢失,用部分数据量确保数据的准确性。
如何实现既能够保证hbase中不重复消费kafka的数据,又保证hbase中的数据不丢失,至今还没有想到,也希望大家提供点思路,谢谢!!