Bootstrap

Flume+Kafka+Spark streaming

在这里插入图片描述
一.使用Flume实时收集日志信息
创建streaming_project.conf :

exec-memory-logger.sources = exec-source
exec-memory-logger.channels = memory-channel
exec-memory-logger.sinks = logger-sink

exec-memory-logger.sources.exec-source.type = exec
exec-memory-logger.sources.exec-source.command = tail -F /home/hzhang/logs/pyaccess.log
exec-memory-logger.sources.exec-source.shell = /bin/sh -c

exec-memory-logger.channels.memory-channel.type = memory

exec-memory-logger.sinks.logger-sink = logger

exec-memory-logger.sources.exec-source.channels = memory-channel
exec-memory-logger.sinks.logger-sink.channel = memory-channel

启动flume:

flume-ng agent \
--name exec-memory-logger \
--conf $FLUME_HOME/conf \
--conf-file /home/hzhang/streaming_project.conf \
-Dflume.root.logger=INFO,console

二.Kafka实时接收日志数据
创建streaming_project2.conf :

exec-memory-kafka.sources = exec-source
exec-memory-kafka.channels = memory-channel
exec-memory-kafka.sinks = kafka-sink

exec-memory-kafka.sources.exec-source.type = exec
exec-memory-kafka.sources.exec-source.command = tail -F /home/hzhang/logs/pyaccess.log
exec-memory-kafka.sources.exec-source.shell = /bin/sh -c

exec-memory-kafka.channels.memory-channel.type = memory

exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
exec-memory-kafka.sinks.kafka-sink.brokerList = 10.5.45.212:9092
exec-memory-kafka.sinks.kafka-sink.topic = test
exec-memory-kafka.sinks.kafka-sink.batchSize = 5
exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1

exec-memory-kafka.sources.exec-source.channels = memory-channel
exec-memory-kafka.sinks.kafka-sink.channel = memory-channel

启动flume:

flume-ng agent \
--name exec-memory-kafka \
--conf $FLUME_HOME/conf \
--conf-file /home/hzhang/streaming_project2.conf \
-Dflume.root.logger=INFO,console

三.连接spark streaming
见StatStreamingApp.Scala
四.数据清洗
(1)从原始数据中取出我们所需要的字段信息
(2)数据清洗结果类似如下:

ClickLog(98.55.30.143,20191105173301,112,500,-)
ClickLog(55.143.4.29,20191105173401,162,404,-)
ClickLog(4.187.156.132,20191105173601,122,200,https://search.yahoo.com/search?p=Hadoop基础)
ClickLog(55.30.44.29,20191106150901,162,404,-)
ClickLog(4.143.124.55,20191106150901,122,200,-)
ClickLog(30.167.156.98,20191106150901,123,404,https://www.sogou.com/web?query=大数据面试)

清洗完之后,日志中只包含实战课程的日志
五.实现的功能
【一】
(1)统计今天到现在为止的实战课程访问量
yyyyMMdd coursed
(2)用数据库来存储统计结果:
Spark Streaming把统计结果写入到数据库里面
可视化前端:yyyyMMdd coursed把数据库里的统计结果展示出来
(3)选择数据库
RDBMS:Mysql、Oracle。。。
day course_id click_count
20191106 1 10
20191106 2 10
下一个批次数据进来:
20191106 + 2(course_id) ==>click_count +下一个批次的统计结果 ==>写入到数据中
NoSQL:HBase、Redis。。。
HBase:一个API搞定
20191106 + 2(course_id) ==>click_count +下一个批次的统计结果
(4)设计HBase表
创建表

create 'test_course_clickcount', 'info'

Rowkey设计
day_courseid
(5)HBase开发工具类
(6)使用scala操作Hbase实现
(7)
【二】功能一和从搜索引擎中引流过来的
(1) HBase表设计

create 'test_course_search_clickcount', 'info'

Rowkey设计:根据业务需求
20191111 + search +3(课程编号)
六.将项目部署在服务器环境中
(一)项目打包:进入项目文件夹的根目录

mvn clean package -DskipTests

报错,原因:java和scala不能混合编译
解决:把pom.xml中的build路径注释掉
(二)
(1)

cd /opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/lib/spark/bin

(2)

spark-submit --master local[5] \
--jars $(echo /opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/lib/hbase/lib/*.jar | tr ‘ ’ ‘,’) \
--class FirstThrought.spark.StatStreamingApp \
--packages org.apache.spark:spark-streaming-kafka_2.11:1.6.0 \
/home/hzhang/sparktrain-1.0.jar \
10.5.45.212:2181 test_group1 test 1

如果报错:
(1)找不到类
在这里插入图片描述

解决:添加

–packages org.apache.spark:spark-streaming-kafka_2.11:1.6.0 \

(2)
在这里插入图片描述

解决:添加

–jars $(echo /opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/lib/hbase/lib/*.jar
|tr ‘ ’ ‘,’) \

./spark-submit --master local[5]
–jars $(echo /opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/lib/hbase/lib/*.jar | tr ’ ’ ‘,’ \
–class FirstThrought.spark.StatStreamingApp \
–packages org.apache.spark:spark-streaming-kafka_2.11:1.6.0
/home/hzhang/sparktrain-1.0.jar
10.5.45.212:2181 test_group1 test 1

;