出现数据倾斜的六种情况
- 1、shuffle的时候,如果这个产生shuffle的字段为空,会出现数据倾斜
- 2、key有很多,分区数设置的过少,导致很多key聚集在一个分区出现数据倾斜
- 3、当某一个表中某一个key数据特别多,然后使用group by 就会出现数据倾斜
- 4、大表 join 小表 ,这两个表中某一个表有某一个key或者某几个key数据比较多,会出现数据倾斜
- 5、大表 join 大表,其中某一个表分布比较均匀,另一个表存在某一个或者某几个key数据特别多,也会出现数据倾斜
- 6、大表 join 大表,其中某一个表分布比较均匀,另一个表存在很多key数据特别多,也会出现数据倾斜
空值造成的数据倾斜解决代码
class DataProcess extends Serializable{
val spark = SparkSession
.builder()
.config("spark.sql.autoBroadcastJoinThreshold","10485760")
.master("local[4]")
.appName("dataprocess").getOrCreate()
import spark.implicits._
/**
* 1、shuffle的时候,如果这个产生shuffle的字段为空,会出现数据倾斜
* 解决方案:
* 将空字段进过滤
*/
@Test
def solution1(): Unit ={
spark.sparkContext.parallelize(Seq[(Int,String,Int,String)](
(1,"aa",20,""),
(2,"bb",20,""),
(3,"vv",20,""),
(4,"dd",20,""),
(5,"ee",20,""),
(6,"ss",20,""),
(7,"uu",20,""),
(8,"qq",20,""),
(9,"ww",20,""),
(10,"rr",20,""),
(11,"tt",20,""),
(12,"xx",20,"class_02"),
(13,"kk",20,"class_03"),
(14,"oo",20,"class_01"),
(15,"pp",20,"class_01")
)).toDF("id","name","age","clazzId")
.filter("clazzId is not null and clazzId!=''")
.createOrReplaceTempView("student")
spark.sparkContext.parallelize(Seq[(String,String)](
("class_01","java"),
("class_02","python"),
("class_03","大数据")
)).toDF("id","name")
.createOrReplaceTempView("class_info")
spark.sql(
"""
|select s.id,s.name,s.age,c.name
| from student s left join class_info c
| on s.clazzId = c.id
""".stripMargin)
}
}
相同key数据过多造成的数据倾斜解决代码
**方案适用场景:**对RDD执行reduceByKey等聚合类shuffle算子或者在Spark SQL中使用group by语句进行分组聚合时,比较适用这种方案。
**方案实现思路:**这个方案的核心实现思路就是进行两阶段聚合。第一次是局部聚合,先给每个key都打上一个随机数,比如10以内的随机数,此时原先一样的key就变成不一样的了,比如(hello, 1) (hello, 1) (hello, 1) (hello, 1),就会变成(1_hello,