Spark SQL操作
一、DataFrame的创建与保存
1.前提操作
在创建DataFrame之前,为了支持RDD转换为DataFrame及后续的SQL操作,需要通过import语句(即import spark.implicits._)导入相应的包,启用隐式转换。
直接启用隐式转换,即直接执行import spark.implicits._
//启用隐式转换
scala> import spark.implicits._
import spark.implicits._
2.数据准备
直接使用Spark安装目录下的文件
文件路径是/usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources
[centos7@master resources]$ pwd
/usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources
[centos7@master resources]$ ls
full_user.avsc kv1.txt people.json people.txt user.avsc users.avro users.parquet
[centos7@master resources]$
3.创建
通过people.json文件创建一个DataFrame
scala> val df = spark.read.json("file:///usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
4.保存DataFrame
保存成功后和RDD的saveAsTextFile保存类似是一个文件夹,文件夹加中有两类文件
①part开头的文件,存储了数据
②_SUCCESS文件,大小为0,只表示成功,如果不存在这个文件则表示保存失败
//保存为json文件
scala> df.write.json("file:///home/centos7/df1")
//保存为csv文件
scala> df.write.csv("file:///home/centos7/df2")
//保存为parquet文件
scala> df.write.parquet("file:///home/centos7/df3")
二、DataFrame的操作
1.printSchema
打印schema信息(表头及类型)
scala> df.printSchema
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
2.show
以表的形式打印DataFrame的内容
scala> df.show
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
3.select
选取DataFrame的部分列生成一个新的DataFrame
scala> df.select("name")
res7: org.apache.spark.sql.DataFrame = [name: string]
scala> df.select("name").show
+-------+
| name|
+-------+
|Michael|
| Andy|
| Justin|
+-------+
scala> df.select("age").show
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+
scala> df.select("age","name").show
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
scala> df.select("age","name","age").show
+----+-------+----+
| age| name| age|
+----+-------+----+
|null|Michael|null|
| 30| Andy| 30|
| 19| Justin| 19|
+----+-------+----+
//select的括号中可以加DataFrame的名称,也不可不加
scala> df.select(df("name")).show
+-------+
| name|
+-------+
|Michael|
| Andy|
| Justin|
+-------+
//使用as进行重命名操作
scala> df.select(df("name").as("username")).show
+--------+
|username|
+--------+
| Michael|
| Andy|
| Justin|
+--------+
4.filter
过滤,实现条件查询
scala> df.filter(df("age")<20).show
+---+------+
|age| name|
+---+------+
| 19|Justin|
+---+------+
5.groupBy(filed)
根据某个字段(filed)进行分组
scala> df.groupBy("age").count().show()
+----+-----+
| age|count|
+----+-----+
| 19| 1|
|null| 1|
| 30| 1|
+----+-----+
scala> df.groupBy("name").count().show()
+-------+-----+
| name|count|
+-------+-----+
|Michael| 1|
| Andy| 1|
| Justin| 1|
+-------+-----+
6.sort(field)
根据某个字段进行排序
//根据age字段进行排序,默认升序
scala> df.sort(df("age")).show
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 19| Justin|
| 30| Andy|
+----+-------+
//根据age字段进行排序,加入.asc表示降序
scala> df.sort(df("age").asc).show
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 19| Justin|
| 30| Andy|
+----+-------+
//根据age字段进行排序,加入.desc表示升序
scala> df.sort(df("age").desc).show
+----+-------+
| age| name|
+----+-------+
| 30| Andy|
| 19| Justin|
|null|Michael|
+----+-------+
三、临时表操作
根据已经存在的DataFrame创建一张临时表,通过sql语句查询
1.创建临时表
格式:DataFrame.createOrReplaceTempView(“临时表名”)
scala> df.create
createGlobalTempView createOrReplaceTempView createTempView
scala> df.createOrReplaceTempView("people_tmp")
2.通过临时表及SQL语句进行查询
注意:查询结束后生成一个新的DataFrame
scala> spark.sql("select * from people_tmp")
res26: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> spark.sql("select * from people_tmp").show
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
四、从RDD转换得到DataFrame
1.利用反射机制推断RDD模式
//定义一个case类,用来转换RDD
scala> case class Person(name:String,age:Int)
defined class Person
//创建RDD
scala> val rdd = sc.textFile("file:///usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources/people.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:///usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources/people.txt MapPartitionsRDD[72] at textFile at <console>:27
//显示RDD的内容
scala> rdd.collect
res29: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)
//将RDD中的每一个元素通过逗号拆分,拆分后得到的RDD内的每一个元素都是Array[String]类型
scala> rdd.map(_.split(",")).collect
res30: Array[Array[String]] = Array(Array(Michael, " 29"), Array(Andy, " 30"), Array(Justin, " 19"))
//通过.map(x=>Person(x(0),x(1).trim.toInt))将上一步得到的RDD的元素从数组转换为Person对象
//新得到RDD的每一个元素都是一个Person对象
scala> rdd.map(_.trim().split(",")).map(x=>Person(x(0),x(1).trim.toInt)).collect
res32: Array[Person] = Array(Person(Michael,29), Person(Andy,30), Person(Justin,19))
//通过.toDF方法将存储Person对象的RDD转换为DataFrame
scala> rdd.map(_.trim().split(",")).map(x=>Person(x(0),x(1).trim.toInt)).toDF
res33: org.apache.spark.sql.DataFrame = [name: string, age: int]
//以表的形式展现DataFrame的内容
scala> rdd.map(_.trim().split(",")).map(x=>Person(x(0),x(1).trim.toInt)).toDF.show()
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
2.使用编程方式定义RDD模式
步骤三步走:
①制作表头
②制作表的记录
③拼接表头和表的记录
//引入types,为了后面使用StructType、StructField
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
//引入Row,为了后续转换RDD时,将RDD的元素类型设置为Row
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
//定义表头信息
//StructField("name",StringType,true) "name"表示字段名,StringType表示字符串,true表示允许为空
scala> val fields = Array(StructField("name",StringType,true), StructField("age",IntegerType,true))
fields: Array[org.apache.spark.sql.types.StructField] = Array(StructField(name,StringType,true), StructField(age,IntegerType,true))
//将上一步的表头信息转换为StructType类型
scala> val schema = StructType(fields)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,true))
//创建RDD
scala> val rdd = sc.textFile("file:///usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources/people.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:///usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources/people.txt MapPartitionsRDD[85] at textFile at <console>:30
//拆分
scala> rdd.map(_.split(",")).collect
res35: Array[Array[String]] = Array(Array(Michael, " 29"), Array(Andy, " 30"), Array(Justin, " 19"))
//将拆分后的RDD转换为Row对象
scala> rdd.map(_.split(",")).map(x=>Row(x(0),x(1).trim.toInt)).collect
res36: Array[org.apache.spark.sql.Row] = Array([Michael,29], [Andy,30], [Justin,19])
//使用rows变量接受转换后的Row对象,(rows即为“表记录信息”)
scala> val rows = rdd.map(_.split(",")).map(x=>Row(x(0),x(1).trim.toInt))
rows: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[90] at map at <console>:33
//显示rows这个RDD的内容
scala> rows.collect
res37: Array[org.apache.spark.sql.Row] = Array([Michael,29], [Andy,30], [Justin,19])
//将表头信息和表记录信息拼接在一起形成DataFrame
scala> spark.createDataFrame(rows,schema)
res40: org.apache.spark.sql.DataFrame = [name: string, age: int]
//以表的形式展现DataFrame的内容
scala> spark.createDataFrame(rows,schema).show
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
五、案例:
1.统计学生成绩
现有一个学生成绩文件score.txt如下
现在有一个学生成绩信息文本文件(score.txt),这个文本文件的内容如下
2103080003,张三,male,20,88,77,100
2103080006,赵六,male,20,100,88,100
2103080005,王五,male,20,99,100,77
2103080007,孙七,male,20,88,88,100
score.txt对应标题为:
学号,姓名,性别,年龄,成绩1,成绩2,成绩3
stuid,name,gender,age,score1,score2,score3
1.请采用RDD编程统计每位学生的总成绩(score_sum)
//创建RDD
scala> val rdd = sc.textFile("file:///home/centos7/score.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:///home/centos7/score.txt MapPartitionsRDD[1] at textFile at <console>:24
//查看RDD内容
scala> rdd.collect
res0: Array[String] = Array(2103080003,张三,male,20,88,77,100, 2103080006,赵六,male,20,100,88,100, 2103080005,王五,male,20,99,100,77, 2103080007,孙七,male,20,88,88,100)
//拆分
scala> rdd.map(_.split(",")).collect
res1: Array[Array[String]] = Array(Array(2103080003, 张三, male, 20, 88, 77, 100), Array(2103080006, 赵六, male, 20, 100, 88, 100), Array(2103080005, 王五, male, 20, 99, 100, 77), Array(2103080007, 孙七, male, 20, 88, 88, 100))
//求和
scala> rdd.map(_.split(",")).map(x=>(x(0),x(1),x(2),x(3),x(4).toInt+x(5).toInt+x(6).toInt)).collect
res2: Array[(String, String, String, String, Int)] = Array((2103080003,张三,male,20,265), (2103080006,赵六,male,20,288), (2103080005,王五,male,20,276), (2103080007,孙七,male,20,276))
//打印
scala> rdd.map(_.split(",")).map(x=>(x(0),x(1),x(2),x(3),x(4).toInt+x(5).toInt+x(6).toInt)).foreach(println(_))
(2103080003,张三,male,20,265)
(2103080006,赵六,male,20,288)
(2103080005,王五,male,20,276)
(2103080007,孙七,male,20,276)
2.请采用DataFrame以表的形式展示学生成绩信息
方法一:通过反射机制推断RDD模式
//定义一个case类,用来转换RDD
scala> case class Score(stuid:String,name:String,gender:String,age:Int,score1:Int,score2:Int,score3:Int)
defined class Score
//从文件创建RDD
scala> val rdd = sc.textFile("file:///home/centos7/score.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:///home/centos7/score.txt MapPartitionsRDD[13] at textFile at <console>:24
//显示RDD内容
scala> rdd.collect
res8: Array[String] = Array(2103080003,张三,male,20,88,77,100, 2103080006,赵六,male,20,100,88,100, 2103080005,王五,male,20,99,100,77, 2103080007,孙七,male,20,88,88,100)
//拆分
scala> rdd.map(_.split(",")).collect
res9: Array[Array[String]] = Array(Array(2103080003, 张三, male, 20, 88, 77, 100), Array(2103080006, 赵六, male, 20, 100, 88, 100), Array(2103080005, 王五, male, 20, 99, 100, 77), Array(2103080007, 孙七, male, 20, 88, 88, 100))
//将上一步得到的RDD的元素从数组转换为Score对象
scala> rdd.map(_.split(",")).map(x=>Score(x(0),x(1),x(2),x(3).trim.toInt,x(4).trim.toInt,x(5).trim.toInt,x(6).trim.toInt)).collect
res10: Array[Score] = Array(Score(2103080003,张三,male,20,88,77,100), Score(2103080006,赵六,male,20,100,88,100), Score(2103080005,王五,male,20,99,100,77), Score(2103080007,孙七,male,20,88,88,100))
//将RDD转化为DataFrame
scala> var df = rdd.map(_.split(",")).map(x=>Score(x(0),x(1),x(2),x(3).trim.toInt,x(4).trim.toInt,x(5).trim.toInt,x(6).trim.toInt)).toDF
df: org.apache.spark.sql.DataFrame = [stuid: string, name: string ... 5 more fields]
//以表的形式展现DataFrame的内容
scala> df.show()
+----------+----+------+---+------+------+------+
| stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080003| 张三| male| 20| 88| 77| 100|
|2103080006| 赵六| male| 20| 100| 88| 100|
|2103080005| 王五| male| 20| 99| 100| 77|
|2103080007| 孙七| male| 20| 88| 88| 100|
+----------+----+------+---+------+------+------+
方法二:使用编程方式定义RDD模式
//引入types,为了后面使用StructType、StructField
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
//引入Row,为了后续转换RDD时,将RDD的元素类型设置为Row
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
//定义表头信息
scala> val fields = Array(StructField("stuid",StringType,true),StructField("name",StringType,true),StructField("gender",StringType,true), StructField("age",IntegerType,true), StructField("score1",IntegerType,true), StructField("score2",IntegerType,true), StructField("score3",IntegerType,true))
fields: Array[org.apache.spark.sql.types.StructField] = Array(StructField(stuid,StringType,true), StructField(name,StringType,true), StructField(gender,StringType,true), StructField(age,IntegerType,true), StructField(score1,IntegerType,true), StructField(score2,IntegerType,true), StructField(score3,IntegerType,true))
//将上一步的表头信息转换为StructType类型
scala> val schema = StructType(fields)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(stuid,StringType,true), StructField(name,StringType,true), StructField(gender,StringType,true), StructField(age,IntegerType,true), StructField(score1,IntegerType,true), StructField(score3,IntegerType,true), StructField(score3,IntegerType,true))
//创建RDD
scala> val rdd = sc.textFile("file:///home/centos7/score.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:///home/centos7/score.txt MapPartitionsRDD[23] at textFile at <console>:28
//将拆分
scala> rdd.map(_.split(",")).collect
res12: Array[Array[String]] = Array(Array(2103080003, 张三, male, 20, 88, 77, 100), Array(2103080006, 赵六, male, 20, 100, 88, 100), Array(2103080005, 王五, male, 20, 99, 100, 77), Array(2103080007, 孙七, male, 20, 88, 88, 100))
//将拆分后的RDD转换为Row对象
scala> rdd.map(_.split(",")).map(x=>Row(x(0),x(1),x(2),x(3).trim.toInt,x(4).trim.toInt,x(5).trim.toInt,x(6).trim.toInt)).collect
res13: Array[org.apache.spark.sql.Row] = Array([2103080003,张三,male,20,88,77,100], [2103080006,赵六,male,20,100,88,100], [2103080005,王五,male,20,99,100,77], [2103080007,孙七,male,20,88,88,100])
//使用rows变量接受转换后的Row对象,(rows即为“表记录信息”)
scala> val rows = rdd.map(_.split(",")).map(x=>Row(x(0),x(1),x(2),x(3).trim.toInt,x(4).trim.toInt,x(5).trim.toInt,x(6).trim.toInt))
rows: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[28] at map at <console>:30
//显示rows这个RDD的内容
scala> rows.collect
res14: Array[org.apache.spark.sql.Row] = Array([2103080003,张三,male,20,88,77,100], [2103080006,赵六,male,20,100,88,100], [2103080005,王五,male,20,99,100,77], [2103080007,孙七,male,20,88,88,100])
//将表头信息和表记录信息拼接在一起形成DataFrame
scala> var df = spark.createDataFrame(rows,schema)
df: org.apache.spark.sql.DataFrame = [stuid: string, name: string ... 5 more fields]
//以表的形式展现DataFrame的内容
scala> df.show
+----------+----+------+---+------+------+------+
| stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080003| 张三| male| 20| 88| 77| 100|
|2103080006| 赵六| male| 20| 100| 88| 100|
|2103080005| 王五| male| 20| 99| 100| 77|
|2103080007| 孙七| male| 20| 88| 88| 100|
+----------+----+------+---+------+------+------+
//以SQL语句显示
//创建临时表
scala> spark.sql("select * from score_tmp")
res15: org.apache.spark.sql.DataFrame = [stuid: string, name: string ... 5 more fields]
//查询
scala> spark.sql("select * from score_tmp").show
+----------+----+------+---+------+------+------+
| stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080003| 张三| male| 20| 88| 77| 100|
|2103080006| 赵六| male| 20| 100| 88| 100|
|2103080005| 王五| male| 20| 99| 100| 77|
|2103080007| 孙七| male| 20| 88| 88| 100|
+----------+----+------+---+------+------+------+
3.请采用DataFrame以表的形式展示学生成绩信息(根据学号升序排序)
//方法一:采用上一步的DataFrame,使用DataFrame操作直接显示
scala> df.sort(df("stuid")).show
+----------+----+------+---+------+------+------+
| stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080003| 张三| male| 20| 88| 77| 100|
|2103080005| 王五| male| 20| 99| 100| 77|
|2103080006| 赵六| male| 20| 100| 88| 100|
|2103080007| 孙七| male| 20| 88| 88| 100|
+----------+----+------+---+------+------+------+
//方法二:采用上一步的临时表,使用SQL语句查询
scala> spark.sql("select * from score_tmp order by stuid").show
+----------+----+------+---+------+------+------+
| stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080003| 张三| male| 20| 88| 77| 100|
|2103080005| 王五| male| 20| 99| 100| 77|
|2103080006| 赵六| male| 20| 100| 88| 100|
|2103080007| 孙七| male| 20| 88| 88| 100|
+----------+----+------+---+------+------+------+
4.请采用DataFrame以表的形式展示学生成绩信息(根据成绩1降序排序,成绩1相同的则根据成绩二降序排序)
//方法一:采用第2步DataFrame
//只以score1降序
scala> df.sort(df("score1").desc).show
+----------+----+------+---+------+------+------+
| stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080006| 赵六| male| 20| 100| 88| 100|
|2103080005| 王五| male| 20| 99| 100| 77|
|2103080003| 张三| male| 20| 88| 77| 100|
|2103080007| 孙七| male| 20| 88| 88| 100|
+----------+----+------+---+------+------+------+
//score1降序,score1相同时以score2降序
scala> df.sort(df("score1").desc,df("score2").desc).show
+----------+----+------+---+------+------+------+
| stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080006| 赵六| male| 20| 100| 88| 100|
|2103080005| 王五| male| 20| 99| 100| 77|
|2103080007| 孙七| male| 20| 88| 88| 100|
|2103080003| 张三| male| 20| 88| 77| 100|
+----------+----+------+---+------+------+------+
//方法二:采用第2步临时表
//只以score1降序
scala> spark.sql("select * from score_tmp order by score1 desc ").show
+----------+----+------+---+------+------+------+
| stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080006| 赵六| male| 20| 100| 88| 100|
|2103080005| 王五| male| 20| 99| 100| 77|
|2103080003| 张三| male| 20| 88| 77| 100|
|2103080007| 孙七| male| 20| 88| 88| 100|
+----------+----+------+---+------+------+------+
//score1降序,score1相同时以score2降序
scala> spark.sql("select * from score_tmp order by score1 desc, score2 desc ").show
+----------+----+------+---+------+------+------+
| stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080006| 赵六| male| 20| 100| 88| 100|
|2103080005| 王五| male| 20| 99| 100| 77|
|2103080007| 孙七| male| 20| 88| 88| 100|
|2103080003| 张三| male| 20| 88| 77| 100|
+----------+----+------+---+------+------+------+