Bootstrap

spark 算子例子_Spark算子进阶和案例讲解

Spark算子进阶和案例讲解

回顾1、RDD的概念和属性

2、常用算子回顾

今天内容1、map、mapPartitions、mapPartitionsWithIndex算子区别

2、aggregate算子

3、aggregateByKey算子

4、checkpoint(设置检查点)

5、repartition、coalesce、partitionBy算子区别

6、combineByKey算子

7、其它算子

8、根据基站位置判断用户家庭工作地址案例

教学目标1、掌握用算子实现函数式编程

2、熟悉checkpoint流程

3、用SparkCore实现案例需求

第一节 map、mapPartitions、mapPartitionsWithIndex​x

map和partition的区别:scala> val rdd2 = rdd1.mapPartitions(_.map(_*10))rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] ...​scala> rdd2.collectres1: Array[Int] = Array(10, 20, 30, 40, 50, 60, 70)​scala> rdd1.map(_ * 10).collectres3: Array[Int] = Array(10, 20, 30, 40, 50, 60, 70)​介绍mapPartition和map的区别,引出下面的内容:​mapPartitionsWithIndexval func = (index: Int, iter: Iterator[(Int)]) => {iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator}val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 2)rdd1.mapPartitionsWithIndex(func).collect

第二节 aggregate第一个参数是分区里的每个元素相加,第二个参数是每个分区的结果再相加

rdd1.aggregate(0)(_+_, _+_)

需求:把每个分区的最大值取出来,再把各分区最大值相加

rdd1.aggregate(0)(math.max(_, _), _+_)

再看初始值设为10的结果

rdd1.aggregate(10)(math.max(_, _), _+_)

再看初始值设为2的结果

rdd1.aggregate(2)(math.max(_, _), _+_)

def func1(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {

iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator

}

val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 2)

rdd1.mapPartitionsWithIndex(func1).collect

rdd1.aggregate(0)(math.max(_, _), _ + _)

rdd1.aggregate(5)(math.max(_, _), _ + _)

val rdd2 = sc.parallelize(List("a","b","c","d","e","f"),2)

def func2(index: Int, iter: Iterator[(String)]) : Iterator[String] = {

iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator

}

rdd2.mapPartitionsWithIndex(func2).collect --查看每个分区的元素

rdd2.aggregate("")(_ + _, _ + _)

查看初始值被应用了几次

rdd2.aggregate("=")(_ + _, _ + _)

如果设了三个分区,初始值被应用了几次?

val rdd3 = sc.parallelize(List("12","23","345","4567"),2)

rdd3.mapPartitionsWithIndex(func2).collect --查看每个分区的元素

每次返回的值不一样,因为executor有时返回的慢,有时返回的快一些

rdd3.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)

val rdd4 = sc.parallelize(List("12","23","345",""),2)

rdd4.mapPartitionsWithIndex(func2).collect --查看每个分区的元素

为什么是01或10? 关键点:"".length是"0",下次比较最小length就是1了

rdd4.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)

val rdd5 = sc.parallelize(List("12","23","","345"),2)

rdd5.mapPartitionsWithIndex(func2).collect

rdd5.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)

第三节 aggregateByKeyval pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)

需求:统计猫狗耗子各有多少只? 比较两种方法

pairRDD.aggregateByKey(0)(_+_,_+_).collect

pairRDD.reduceByKey(_+_).collect --也能实现

需求:把每个分区每种最多的动物取出来再进行对应的相加

def func2(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {

iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator

}

pairRDD.mapPartitionsWithIndex(func2).collect

pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect

pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect

第四节 checkpointcheckpoint(以后结合实例再讲)

sc.setCheckpointDir("hdfs://node01:9000/ck")

val rdd = sc.textFile("hdfs://node01:9000/wc").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)

rdd.checkpoint

rdd.isCheckpointed

rdd.count

rdd.isCheckpointed

rdd.getCheckpointFile

第五节 repartition、coalesce、partitionByrepartition(重新分配分区), coalesce((合并)重新分配分区并设置是否shuffle),

partitionBy(根据partitioner函数生成新的ShuffleRDD,将原RDD重新分区)

val rdd1 = sc.parallelize(1 to 10, 10)

rdd1.repartition(5) --分区调整为5个

rdd1.partitions.length =5

coalesce:调整分区数量,参数一:要合并成几个分区,参数二:是否shuffle,false不会shuffle

val rdd2 = rdd1.coalesce(2, false)

val rdd1 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2)), 3)

var rdd2 = rdd1.partitionBy(new org.apache.spark.HashPartitioner(2))

rdd2.partitions.length

第六节 combineByKey偏底层,reduceByKey、aggregateByKey、combineByKey底层都是调用的combineByKeyWithClassTag

combineByKey

val rdd1 = sc.textFile("hdfs://node01:9000/wc").flatMap(_.split(" ")).map((_, 1))

以前是这样调用的

val rdd2 = rdd1.reduceByKey(_+_)

或者是这样

val rdd2 = rdd1.aggregateByKey(0)(_+_,_+_)

现在用这个,x => x:把每一个元素拿出来,(a: Int, b: Int) => a + b:

分区的元素相加,(m: Int, n: Int) => m + n:把每个分区的结果相加

val rdd2 = rdd1.combineByKey(x => x, (a: Int, b: Int) => a + b, (m: Int, n: Int) => m + n)

rdd2.collect

上面的combineByKey的应用场景,虽然复杂,但可以实现很多需求

下面每个值多了30,因为有三个分区,各加了10。x代表分区里的第一个值,10只加一次

val rdd3 = rdd1.combineByKey(x => x + 10, (a: Int, b: Int) => a + b, (m: Int, n: Int) => m + n)

rdd3.collect

val rdd4 = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)

val rdd5 = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)

组合在一起,用拉链,RDD5在前面,谁拉谁?

val rdd6 = rdd5.zip(rdd4)

rdd6.collect

需求:把单身狗放在一起,成双成对放在一起

val rdd7 = rdd6.combineByKey(List(_), (x: List[String], y: String) => x :+ y,

(m: List[String], n: List[String]) => m ++ n)

画图分析

idea实现

在源码查看reduceByKey、aggregateByKey、combineByKey调用的都是combineByKeyWithClassTag

第七节 其它算子数组或集合变成一个map

collectAsMap

val rdd = sc.parallelize(List(("a", 1), ("b", 2)))

rdd.collectAsMap

-------------------------------------------------------------------------------------------

countByKey

val rdd1 = sc.parallelize(List(("a", 1), ("b", 2), ("b", 2), ("c", 2), ("c", 1)))

统计相同key的value出现的次数,和reduceByKey比较

rdd1.countByKey

统计相同元素出现的次数

rdd1.countByValue

-------------------------------------------------------------------------------------------

filterByRange

过滤出一个范围的所有的值,以key过滤

val rdd1 = sc.parallelize(List(("e", 5), ("c", 3), ("d", 4), ("c", 2), ("a", 1)))

val rdd2 = rdd1.filterByRange("c", "d")

rdd2.colllect

-------------------------------------------------------------------------------------------

flatMapValues

val a = sc.parallelize(List(("a", "1 2"), ("b", "3 4")))

以value进行map再压平

rdd3.flatMapValues(_.split(" "))

-------------------------------------------------------------------------------------------

foldByKey

val rdd1 = sc.parallelize(List("dog", "wolf", "cat", "bear"), 2)

val rdd2 = rdd1.map(x => (x.length, x))

以key来进行折叠

val rdd3 = rdd2.foldByKey("")(_+_)

val rdd = sc.textFile("hdfs://node01:9000/wc").flatMap(_.split(" ")).map((_, 1))

rdd.foldByKey(0)(_+_)

-------------------------------------------------------------------------------------------

foreachPartition

val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)

把每个分区的元素取出来,常用作操作分区数据后向数据库写入数据

rdd1.foreachPartition(x => println(x.reduce(_ + _))) 该结果再IDEA里才能 打印出来

-------------------------------------------------------------------------------------------

keyBy

val rdd1 = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)

以参数作为key来生成新的元组

val rdd2 = rdd1.keyBy(_.length)

rdd2.collect

-------------------------------------------------------------------------------------------

keys values

val rdd1 = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)

val rdd2 = rdd1.map(x => (x.length, x))

rdd2.keys.collect

rdd2.values.collect

第八节 根据基站位置判断用户家庭工作地址

通过用户经常连接的基站信息,判断用户的家庭地址和工作地址。

用户连接信息:手机号,发生时间,基站ID,事件类型

18101056806,20160327075000,9F36407EAD0629FC166F14DDE7970F68,1

18101056806,20160327081000,9F36407EAD0629FC166F14DDE7970F68,0

18101056806,20160327081100,CC0710CC94ECC657A8561DE549D940E0,1

18101056806,20160327082000,CC0710CC94ECC657A8561DE549D940E0,0

18688888888,20160327082400,16030401EAFB68F1E3CDF819735E1C66,1

18101056806,20160327082500,16030401EAFB68F1E3CDF819735E1C66,1

18688888888,20160327170000,16030401EAFB68F1E3CDF819735E1C66,0

18101056806,20160327180000,16030401EAFB68F1E3CDF819735E1C66,0

18688888888,20160327171000,CC0710CC94ECC657A8561DE549D940E0,1

18688888888,20160327171600,CC0710CC94ECC657A8561DE549D940E0,1

18101056806,20160327180500,CC0710CC94ECC657A8561DE549D940E0,1

18101056806,20160327181500,CC0710CC94ECC657A8561DE549D940E0,0

18101056806,20160327182000,9F36407EAD0629FC166F14DDE7970F68,1

18101056806,20160327230000,9F36407EAD0629FC166F14DDE7970F68,0

基站信息:基站ID,经度,纬度,信号类型

9F36407EAD0629FC166F14DDE7970F68,116.304864,40.050645,6

CC0710CC94ECC657A8561DE549D940E0,116.303955,40.041935,6

16030401EAFB68F1E3CDF819735E1C66,116.296302,40.032296,6

案例代码:xxxxxxxxxx

object MobileLocation {def main(args: Array[String]): Unit = {

// 模板代码val conf = new SparkConf().setAppName("MobileLocation").setMaster("local[2]")val sc = new SparkContext(conf)​// 获取用户访问基站信息数据

val file: RDD[String] = sc.textFile("mobilelocation/log")​// 切分数据val phoneAndLacAndTime: RDD[((String, String), Long)] = file.map(line => {val fields = line.split(",")val phone = fields(0) // 用户手机号val time = fields(1).toLong // 时间戳val lac = fields(2) // 基站IDval eventType = fields(3).toInt // 事件类型val time_long = if (eventType == 1) -time else time​((phone, lac), time_long)})​// 用户在基站停留的时间的总和val sumedPhoneAndLacAndTime: RDD[((String, String), Long)] = phoneAndLacAndTime.reduceByKey(_+_)​// 把经纬度加到数据里val lacAndPhoneAndTime: RDD[(String, (String, Long))] = sumedPhoneAndLacAndTime.map(x => {val phone = x._1._1 // 手机号val lac = x._1._2 // 基站IDval time = x._2 // 用户在某个基站停留的总时长​(lac, (phone, time))})​// 读取基站的经纬度信息val lacInfo: RDD[String] =

sc.textFile("mobilelocation/lac_info.txt")// 切分基站对应的经纬度信息val lacAndXY: RDD[(String, (String, String))] = lacInfo.map(line => {val fields = line.split(",")val lac = fields(0) // 基站IDval x = fields(1) // 经度val y = fields(2) // 纬度​(lac, (x, y))})​// 用户在基站停留的时间上加上经纬度

val joined: RDD[(String, ((String, Long), (String, String)))] = lacAndPhoneAndTime.join(lacAndXY)​// 首先把数据重新调整,便于以后的计算val phoneAndTimeAndXY: RDD[(String, Long, (String, String))] = joined.map(x => {val phone = x._2._1._1 // 手机号val lac = x._1 // 基站IDval time = x._2._1._2 // 停留时长val xy = x._2._2 // 经纬度​(phone, time, xy)})​// 按手机号进行分组并按停留的时间进行排序val sorted: RDD[(String, List[(String, Long, (String, String))])] =phoneAndTimeAndXY.groupBy(_._1).mapValues(_.toList.sortBy(_._2).reverse.take(2))​val res = sorted.map(_._2)​println(res.collect.toBuffer)​sc.stop()}}

附件

作业1、SparkCore算子熟悉

2、checkpoint实现流程

3、利用所学过的算子实现案例需求

面试题1、为什么要实现checkpoint以及checkpoint流程

2、aggregate和aggregateByKey算子的区别

;