Bootstrap

[Kafka快速入门三]------低级api同时消费多个topic,并手动管理offsets到zookeeper

1.前言

网上有一大堆的关于高级api同时消费多个topic的做法,也有一大堆关于低级api消费单个topic,并手动管理offsets存储到zookeeper的做法,但是搜寻了很久仍没有发现有使用低级api消费多个topic,并手动管理offsets存储到zookeeper的做法

2.前提

1)已经安装了zookeeper集群和kafka集群并已经启动

有关zookeeper集群的简单搭建请看[zookeeper快速入门一]------简单搭建zookeeper集群
有关kafka集群的简单搭建请看[Kafka快速入门一]------搭建Kafka集群

2)有关的maven依赖,具体如何根据自己的集群中相关组件版本来确定

scala 2.11.11

kafka 0.8

zookeeper 3.4.6

spark 2.3.0

hbase 1.1.2

<properties>
    <maven.compiler.source>1.5</maven.compiler.source>
    <maven.compiler.target>1.5</maven.compiler.target>
    <encoding>UTF-8</encoding>
    <scala.version>2.11.11</scala.version>
    <kafka.version>1.0.0</kafka.version>
    <zookeeper.version>3.4.6</zookeeper.version>
    <spark.version>2.3.0</spark.version>
    <hbase.version>1.1.2</hbase.version>
  </properties>

  <dependencies>
    <!-- scala -->
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>

    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-compiler</artifactId>
      <version>${scala.version}</version>
    </dependency>

    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-reflect</artifactId>
      <version>${scala.version}</version>
    </dependency>

    <!-- Test -->
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.8.1</version>
      <scope>test</scope>
    </dependency>

    <!--Kafka -->
    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka-clients</artifactId>
      <version>0.8.2.1</version>
    </dependency>

    <!--spark -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>

    <!--spark整合kafka -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>

    <!--hbase-->
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-server</artifactId>
      <version>${hbase.version}</version>
    </dependency>

    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-client</artifactId>
      <version>${hbase.version}</version>
    </dependency>

    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-common</artifactId>
      <version>${hbase.version}</version>
    </dependency>

  </dependencies>

3.使用低级api消费kafka数据

class GetDataFromKafka {
  //定义日志
  @transient lazy val log = LogManager.getLogger(classOf[GetDataFromKafka])

  /**
    * 从kafka中消费数据
    */
  def getKafkaData(): Unit ={
    val prop = new Properties()
    prop.load(new FileInputStream("C:\\Users\\qin\\Desktop\\study\\Kafka\\sparkStreaming.properties")) //读取window本地配置文件
    //    prop.load(new FileInputStream("/home/kbd/producer/sparkStreaming.properties")) //读取linux上的配置文件
    //    prop.load(new FileInputStream(this.getClass.getResource("/").getPath+"sparkStreaming.properties")) //读取resources下的配置文件
    val zkHost = prop.getProperty("kafka.zookeeper.connect")  //zookeeper集群地址 master.hadoop:2181,slave1.hadoop:2181,slave2.hadoop:2181
    val brokerList = prop.getProperty("kafka.metadata.broker.list") //kafka集群broker地址 master.hadoop:9092,slave1.hadoop:9092
    val groupId = prop.getProperty("group.id")  //kafka消费组
    val zkClient = new ZkClient(zkHost) //创建一个zkClient
    var kafkaParams = Map[String,String]("metadata.broker.list" -> brokerList, //定义kafkaParams,用于后续创建kafkaStream
      "zookeeper.connect" -> zkHost,
      "group.id" -> groupId)
    var kafkaStream : InputDStream[(String,String)] = null
    val conf = new SparkConf().setAppName(prop.getProperty("sparkStreamName"))
      .setMaster("local[2]")
    //spark的反压机制,有兴趣可以自己百度下.具体用处就是控制kafka消费速率
    conf.set("spark.streaming.backpressure.enabled",prop.getProperty("spark.streaming.backpressure.enabled"))
    conf.set("spark.streaming.backpressure.initialRate",prop.getProperty("spark.streaming.backpressure.initialRate"))// the max size , for the first time
    conf.set("spark.streaming.kafka.maxRatePerPartition",prop.getProperty("spark.streaming.kafka.maxRatePerPartition"))//the max record of kafka , for per process per second
    //创建sparkStream
    val ssc = new StreamingContext(conf,Seconds(5))
    val accumulator = ssc.sparkContext.accumulator(0)
    val broadcast: Broadcast[Properties] = ssc.sparkContext.broadcast(prop) //广播变量
    val topics = prop.getProperty("topic") //有多个topic
    var fromOffsets : Map[TopicAndPartition,Long] = Map()
    //定义一个消费者,用于后续获取kafka上最新和最小的offset
    val simpleConsumer: SimpleConsumer = new SimpleConsumer(prop.getProperty("kafka.hosts"), prop.getProperty("kafka_port").toInt, 1000000, 64 * 1024, "octServer")
    val messageHandler = (mmd : MessageAndMetadata[String,String]) => (mmd.topic,mmd.message())
    //获取每个topic在zookeeper中存储的offsets
    topics.split(",").foreach(topic=>{
      val topicDirs = new ZKGroupTopicDirs(groupId,topic)
      println(topicDirs.consumerOffsetDir)
      val children = zkClient.countChildren(s"${topicDirs.consumerOffsetDir}")
      //判断当前topic是否在zookeeper上有存储offsets
      if (children > 0){
        for (i <- 0 until children){
          //获取当前zookeeper存储的offsets
          val partitionOffset = zkClient.readData[String](s"${topicDirs.consumerOffsetDir}/${i}")
          var currentOffset = partitionOffset.toLong
          val tp = TopicAndPartition(topic,i)
          //获取kafka最小的offset
          val earliestOffset: OffsetRequest = OffsetRequest(Map(tp -> PartitionOffsetRequestInfo(OffsetRequest.EarliestTime, 1)))
          //获取kafka最大的offset
          val latestOffset: OffsetRequest = OffsetRequest(Map(tp -> PartitionOffsetRequestInfo(OffsetRequest.LatestTime, 1)))
          val earOffset: Seq[Long] = simpleConsumer.getOffsetsBefore(earliestOffset).partitionErrorAndOffsets(tp).offsets
          val latOffset: Seq[Long] = simpleConsumer.getOffsetsBefore(latestOffset).partitionErrorAndOffsets(tp).offsets
          if (earOffset.length >0 && currentOffset < earOffset.head){
            currentOffset = earOffset.head
          }
          log.info("当前zookeeper上的offsets" + partitionOffset)
          fromOffsets += (tp->currentOffset)
        }
        //当前topic没有在zookeeper存储过offsets,需要从kafkaCluster中获取kafka中存储当前topic最小的offsets
      }else {
        val kc = new KafkaCluster(kafkaParams)
        val either : Either[Err, Set[TopicAndPartition]] = kc.getPartitions(Set(topic))
        either.right.foreach(TopicAndPartitions=>{
          val leaderLatestOffsets = kc.getEarliestLeaderOffsets(TopicAndPartitions).right.get
          leaderLatestOffsets.foreach(x=>{
            fromOffsets +=(x._1->x._2.offset)
          })
        })
      }
    })
    //创建kafkaStream
    kafkaStream = KafkaUtils
      .createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler)
    //将kafka数据插入到hbase中
    new PutDataToHBase(broadcast).save(kafkaStream,accumulator,groupId)
    ssc.start()
    ssc.awaitTermination()
  }

}

4.手动保存offsets到zookeeper

/**
    * 保存offsets到zookeeper
    * @param kafkaStream
    * @param topicDirs
    * @param zkClient
    */
  def saveOffsetsToZookeeper(rdd :RDD[(String,String)],groupId : String ,zkClient: ZkClient): Unit ={
    log.info("-------------------------------开始更新offset--------------------------------------------")
    val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
    for (o <- offsetRanges) {
      val topicDirs = new ZKGroupTopicDirs(groupId,o.topic)
      val zkPath = s"${topicDirs.consumerOffsetDir}/${o.partition}"
      ZkUtils.updatePersistentPath(zkClient, zkPath, o.fromOffset.toString)  //将该 partition 的 offset 保存到 zookeeper
      log.info("-------------------------------"+o.topic+"的offset更新为"+o.fromOffset+"--------------------------------------------")
    }
    log.info("-------------------------------所有offset更新完毕--------------------------------------------")
  }

5.感想

本次操作虽然可以使用一个kafkastream消费多个topic,并手动管理offsets到zookeeper中,但是为了确保数据不重复消费,采用了先更新offsets到zookeeper,再消费数据到hbase中。采用这种做法的缺点就是可能会导致hbsae中的部分数据丢失,用部分数据量确保数据的准确性。

如何实现既能够保证hbase中不重复消费kafka的数据,又保证hbase中的数据不丢失,至今还没有想到,也希望大家提供点思路,谢谢!!

;