Bootstrap

Spark SQL操作

Spark SQL操作

一、DataFrame的创建与保存
1.前提操作

在创建DataFrame之前,为了支持RDD转换为DataFrame及后续的SQL操作,需要通过import语句(即import spark.implicits._)导入相应的包,启用隐式转换。

直接启用隐式转换,即直接执行import spark.implicits._

//启用隐式转换
scala> import spark.implicits._
import spark.implicits._
2.数据准备

直接使用Spark安装目录下的文件

文件路径是/usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources

[centos7@master resources]$ pwd
/usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources
[centos7@master resources]$ ls
full_user.avsc  kv1.txt  people.json  people.txt  user.avsc  users.avro  users.parquet
[centos7@master resources]$
3.创建

通过people.json文件创建一个DataFrame

scala> val df = spark.read.json("file:///usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
4.保存DataFrame

保存成功后和RDD的saveAsTextFile保存类似是一个文件夹,文件夹加中有两类文件

①part开头的文件,存储了数据

②_SUCCESS文件,大小为0,只表示成功,如果不存在这个文件则表示保存失败

//保存为json文件
scala> df.write.json("file:///home/centos7/df1")

//保存为csv文件
scala> df.write.csv("file:///home/centos7/df2")

//保存为parquet文件
scala> df.write.parquet("file:///home/centos7/df3")
二、DataFrame的操作
1.printSchema

打印schema信息(表头及类型)

scala> df.printSchema
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)
2.show

以表的形式打印DataFrame的内容

scala> df.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+
3.select

选取DataFrame的部分列生成一个新的DataFrame

scala> df.select("name")
res7: org.apache.spark.sql.DataFrame = [name: string]

scala> df.select("name").show
+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+

scala> df.select("age").show
+----+
| age|
+----+
|null|
|  30|
|  19|
+----+

scala> df.select("age","name").show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

scala> df.select("age","name","age").show
+----+-------+----+
| age|   name| age|
+----+-------+----+
|null|Michael|null|
|  30|   Andy|  30|
|  19| Justin|  19|
+----+-------+----+

//select的括号中可以加DataFrame的名称,也不可不加
scala> df.select(df("name")).show
+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+

//使用as进行重命名操作
scala> df.select(df("name").as("username")).show
+--------+
|username|
+--------+
| Michael|
|    Andy|
|  Justin|
+--------+
4.filter

过滤,实现条件查询

scala> df.filter(df("age")<20).show
+---+------+
|age|  name|
+---+------+
| 19|Justin|
+---+------+
5.groupBy(filed)

根据某个字段(filed)进行分组

scala> df.groupBy("age").count().show()
+----+-----+
| age|count|
+----+-----+
|  19|    1|
|null|    1|
|  30|    1|
+----+-----+

scala> df.groupBy("name").count().show()
+-------+-----+
|   name|count|
+-------+-----+
|Michael|    1|
|   Andy|    1|
| Justin|    1|
+-------+-----+
6.sort(field)

根据某个字段进行排序

//根据age字段进行排序,默认升序
scala> df.sort(df("age")).show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  19| Justin|
|  30|   Andy|
+----+-------+

//根据age字段进行排序,加入.asc表示降序
scala> df.sort(df("age").asc).show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  19| Justin|
|  30|   Andy|
+----+-------+

//根据age字段进行排序,加入.desc表示升序
scala> df.sort(df("age").desc).show
+----+-------+
| age|   name|
+----+-------+
|  30|   Andy|
|  19| Justin|
|null|Michael|
+----+-------+
三、临时表操作

根据已经存在的DataFrame创建一张临时表,通过sql语句查询

1.创建临时表

格式:DataFrame.createOrReplaceTempView(“临时表名”)

scala> df.create
createGlobalTempView   createOrReplaceTempView   createTempView

scala> df.createOrReplaceTempView("people_tmp")

2.通过临时表及SQL语句进行查询

注意:查询结束后生成一个新的DataFrame

scala> spark.sql("select * from people_tmp")
res26: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> spark.sql("select * from people_tmp").show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

四、从RDD转换得到DataFrame
1.利用反射机制推断RDD模式
//定义一个case类,用来转换RDD
scala> case class Person(name:String,age:Int)
defined class Person

//创建RDD
scala> val rdd = sc.textFile("file:///usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources/people.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:///usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources/people.txt MapPartitionsRDD[72] at textFile at <console>:27

//显示RDD的内容
scala> rdd.collect
res29: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)

//将RDD中的每一个元素通过逗号拆分,拆分后得到的RDD内的每一个元素都是Array[String]类型
scala> rdd.map(_.split(",")).collect
res30: Array[Array[String]] = Array(Array(Michael, " 29"), Array(Andy, " 30"), Array(Justin, " 19"))

//通过.map(x=>Person(x(0),x(1).trim.toInt))将上一步得到的RDD的元素从数组转换为Person对象
//新得到RDD的每一个元素都是一个Person对象
scala> rdd.map(_.trim().split(",")).map(x=>Person(x(0),x(1).trim.toInt)).collect
res32: Array[Person] = Array(Person(Michael,29), Person(Andy,30), Person(Justin,19))

//通过.toDF方法将存储Person对象的RDD转换为DataFrame
scala> rdd.map(_.trim().split(",")).map(x=>Person(x(0),x(1).trim.toInt)).toDF
res33: org.apache.spark.sql.DataFrame = [name: string, age: int]

//以表的形式展现DataFrame的内容
scala> rdd.map(_.trim().split(",")).map(x=>Person(x(0),x(1).trim.toInt)).toDF.show()
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+
2.使用编程方式定义RDD模式

步骤三步走:

①制作表头

②制作表的记录

③拼接表头和表的记录

//引入types,为了后面使用StructType、StructField
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

//引入Row,为了后续转换RDD时,将RDD的元素类型设置为Row
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

//定义表头信息
//StructField("name",StringType,true)  "name"表示字段名,StringType表示字符串,true表示允许为空
scala> val fields = Array(StructField("name",StringType,true), StructField("age",IntegerType,true))
fields: Array[org.apache.spark.sql.types.StructField] = Array(StructField(name,StringType,true), StructField(age,IntegerType,true))

//将上一步的表头信息转换为StructType类型
scala> val schema = StructType(fields)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,true))

//创建RDD
scala> val rdd = sc.textFile("file:///usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources/people.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:///usr/local/src/spark-2.1.0-bin-without-hadoop/examples/src/main/resources/people.txt MapPartitionsRDD[85] at textFile at <console>:30

//拆分
scala> rdd.map(_.split(",")).collect
res35: Array[Array[String]] = Array(Array(Michael, " 29"), Array(Andy, " 30"), Array(Justin, " 19"))

//将拆分后的RDD转换为Row对象
scala> rdd.map(_.split(",")).map(x=>Row(x(0),x(1).trim.toInt)).collect
res36: Array[org.apache.spark.sql.Row] = Array([Michael,29], [Andy,30], [Justin,19])

//使用rows变量接受转换后的Row对象,(rows即为“表记录信息”)
scala> val rows = rdd.map(_.split(",")).map(x=>Row(x(0),x(1).trim.toInt))
rows: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[90] at map at <console>:33

//显示rows这个RDD的内容
scala> rows.collect
res37: Array[org.apache.spark.sql.Row] = Array([Michael,29], [Andy,30], [Justin,19])

//将表头信息和表记录信息拼接在一起形成DataFrame
scala> spark.createDataFrame(rows,schema)
res40: org.apache.spark.sql.DataFrame = [name: string, age: int]

//以表的形式展现DataFrame的内容
scala> spark.createDataFrame(rows,schema).show
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+
五、案例:
1.统计学生成绩

现有一个学生成绩文件score.txt如下

现在有一个学生成绩信息文本文件(score.txt),这个文本文件的内容如下

2103080003,张三,male,20,88,77,100
2103080006,赵六,male,20,100,88,100
2103080005,王五,male,20,99,100,77
2103080007,孙七,male,20,88,88,100

score.txt对应标题为:

学号,姓名,性别,年龄,成绩1,成绩2,成绩3

stuid,name,gender,age,score1,score2,score3

1.请采用RDD编程统计每位学生的总成绩(score_sum)
//创建RDD
scala> val rdd = sc.textFile("file:///home/centos7/score.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:///home/centos7/score.txt MapPartitionsRDD[1] at textFile at <console>:24

//查看RDD内容
scala> rdd.collect
res0: Array[String] = Array(2103080003,张三,male,20,88,77,100, 2103080006,赵六,male,20,100,88,100, 2103080005,王五,male,20,99,100,77, 2103080007,孙七,male,20,88,88,100)

//拆分
scala> rdd.map(_.split(",")).collect
res1: Array[Array[String]] = Array(Array(2103080003, 张三, male, 20, 88, 77, 100), Array(2103080006, 赵六, male, 20, 100, 88, 100), Array(2103080005, 王五, male, 20, 99, 100, 77), Array(2103080007, 孙七, male, 20, 88, 88, 100))

//求和
scala> rdd.map(_.split(",")).map(x=>(x(0),x(1),x(2),x(3),x(4).toInt+x(5).toInt+x(6).toInt)).collect
res2: Array[(String, String, String, String, Int)] = Array((2103080003,张三,male,20,265), (2103080006,赵六,male,20,288), (2103080005,王五,male,20,276), (2103080007,孙七,male,20,276))

//打印
scala> rdd.map(_.split(",")).map(x=>(x(0),x(1),x(2),x(3),x(4).toInt+x(5).toInt+x(6).toInt)).foreach(println(_))
(2103080003,张三,male,20,265)
(2103080006,赵六,male,20,288)
(2103080005,王五,male,20,276)
(2103080007,孙七,male,20,276)
2.请采用DataFrame以表的形式展示学生成绩信息

方法一:通过反射机制推断RDD模式

//定义一个case类,用来转换RDD
scala> case class Score(stuid:String,name:String,gender:String,age:Int,score1:Int,score2:Int,score3:Int)
defined class Score

//从文件创建RDD
scala> val rdd = sc.textFile("file:///home/centos7/score.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:///home/centos7/score.txt MapPartitionsRDD[13] at textFile at <console>:24

//显示RDD内容
scala> rdd.collect
res8: Array[String] = Array(2103080003,张三,male,20,88,77,100, 2103080006,赵六,male,20,100,88,100, 2103080005,王五,male,20,99,100,77, 2103080007,孙七,male,20,88,88,100)

//拆分
scala> rdd.map(_.split(",")).collect
res9: Array[Array[String]] = Array(Array(2103080003, 张三, male, 20, 88, 77, 100), Array(2103080006, 赵六, male, 20, 100, 88, 100), Array(2103080005, 王五, male, 20, 99, 100, 77), Array(2103080007, 孙七, male, 20, 88, 88, 100))

//将上一步得到的RDD的元素从数组转换为Score对象
scala> rdd.map(_.split(",")).map(x=>Score(x(0),x(1),x(2),x(3).trim.toInt,x(4).trim.toInt,x(5).trim.toInt,x(6).trim.toInt)).collect
res10: Array[Score] = Array(Score(2103080003,张三,male,20,88,77,100), Score(2103080006,赵六,male,20,100,88,100), Score(2103080005,王五,male,20,99,100,77), Score(2103080007,孙七,male,20,88,88,100))

//将RDD转化为DataFrame
scala> var df = rdd.map(_.split(",")).map(x=>Score(x(0),x(1),x(2),x(3).trim.toInt,x(4).trim.toInt,x(5).trim.toInt,x(6).trim.toInt)).toDF
df: org.apache.spark.sql.DataFrame = [stuid: string, name: string ... 5 more fields]

//以表的形式展现DataFrame的内容
scala> df.show()
+----------+----+------+---+------+------+------+
|     stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080003|  张三|  male| 20|    88|    77|   100|
|2103080006|  赵六|  male| 20|   100|    88|   100|
|2103080005|  王五|  male| 20|    99|   100|    77|
|2103080007|  孙七|  male| 20|    88|    88|   100|
+----------+----+------+---+------+------+------+

方法二:使用编程方式定义RDD模式

//引入types,为了后面使用StructType、StructField
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

//引入Row,为了后续转换RDD时,将RDD的元素类型设置为Row
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

//定义表头信息
scala> val fields = Array(StructField("stuid",StringType,true),StructField("name",StringType,true),StructField("gender",StringType,true), StructField("age",IntegerType,true), StructField("score1",IntegerType,true), StructField("score2",IntegerType,true), StructField("score3",IntegerType,true))
fields: Array[org.apache.spark.sql.types.StructField] = Array(StructField(stuid,StringType,true), StructField(name,StringType,true), StructField(gender,StringType,true), StructField(age,IntegerType,true), StructField(score1,IntegerType,true), StructField(score2,IntegerType,true), StructField(score3,IntegerType,true))

//将上一步的表头信息转换为StructType类型
scala> val schema = StructType(fields)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(stuid,StringType,true), StructField(name,StringType,true), StructField(gender,StringType,true), StructField(age,IntegerType,true), StructField(score1,IntegerType,true), StructField(score3,IntegerType,true), StructField(score3,IntegerType,true))

//创建RDD
scala> val rdd = sc.textFile("file:///home/centos7/score.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:///home/centos7/score.txt MapPartitionsRDD[23] at textFile at <console>:28

//将拆分
scala> rdd.map(_.split(",")).collect
res12: Array[Array[String]] = Array(Array(2103080003, 张三, male, 20, 88, 77, 100), Array(2103080006, 赵六, male, 20, 100, 88, 100), Array(2103080005, 王五, male, 20, 99, 100, 77), Array(2103080007, 孙七, male, 20, 88, 88, 100))

//将拆分后的RDD转换为Row对象
scala> rdd.map(_.split(",")).map(x=>Row(x(0),x(1),x(2),x(3).trim.toInt,x(4).trim.toInt,x(5).trim.toInt,x(6).trim.toInt)).collect
res13: Array[org.apache.spark.sql.Row] = Array([2103080003,张三,male,20,88,77,100], [2103080006,赵六,male,20,100,88,100], [2103080005,王五,male,20,99,100,77], [2103080007,孙七,male,20,88,88,100])

//使用rows变量接受转换后的Row对象,(rows即为“表记录信息”)
scala> val rows = rdd.map(_.split(",")).map(x=>Row(x(0),x(1),x(2),x(3).trim.toInt,x(4).trim.toInt,x(5).trim.toInt,x(6).trim.toInt))
rows: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[28] at map at <console>:30

//显示rows这个RDD的内容
scala> rows.collect
res14: Array[org.apache.spark.sql.Row] = Array([2103080003,张三,male,20,88,77,100], [2103080006,赵六,male,20,100,88,100], [2103080005,王五,male,20,99,100,77], [2103080007,孙七,male,20,88,88,100])

//将表头信息和表记录信息拼接在一起形成DataFrame
scala> var df = spark.createDataFrame(rows,schema)
df: org.apache.spark.sql.DataFrame = [stuid: string, name: string ... 5 more fields]

//以表的形式展现DataFrame的内容
scala> df.show
+----------+----+------+---+------+------+------+
|     stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080003|  张三|  male| 20|    88|    77|   100|
|2103080006|  赵六|  male| 20|   100|    88|   100|
|2103080005|  王五|  male| 20|    99|   100|    77|
|2103080007|  孙七|  male| 20|    88|    88|   100|
+----------+----+------+---+------+------+------+

//以SQL语句显示
//创建临时表
scala> spark.sql("select * from score_tmp")
res15: org.apache.spark.sql.DataFrame = [stuid: string, name: string ... 5 more fields]

//查询
scala> spark.sql("select * from score_tmp").show
+----------+----+------+---+------+------+------+
|     stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080003|  张三|  male| 20|    88|    77|   100|
|2103080006|  赵六|  male| 20|   100|    88|   100|
|2103080005|  王五|  male| 20|    99|   100|    77|
|2103080007|  孙七|  male| 20|    88|    88|   100|
+----------+----+------+---+------+------+------+
3.请采用DataFrame以表的形式展示学生成绩信息(根据学号升序排序)
//方法一:采用上一步的DataFrame,使用DataFrame操作直接显示
scala> df.sort(df("stuid")).show
+----------+----+------+---+------+------+------+
|     stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080003|  张三|  male| 20|    88|    77|   100|
|2103080005|  王五|  male| 20|    99|   100|    77|
|2103080006|  赵六|  male| 20|   100|    88|   100|
|2103080007|  孙七|  male| 20|    88|    88|   100|
+----------+----+------+---+------+------+------+

//方法二:采用上一步的临时表,使用SQL语句查询
scala> spark.sql("select * from score_tmp order by stuid").show
+----------+----+------+---+------+------+------+
|     stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080003|  张三|  male| 20|    88|    77|   100|
|2103080005|  王五|  male| 20|    99|   100|    77|
|2103080006|  赵六|  male| 20|   100|    88|   100|
|2103080007|  孙七|  male| 20|    88|    88|   100|
+----------+----+------+---+------+------+------+

4.请采用DataFrame以表的形式展示学生成绩信息(根据成绩1降序排序,成绩1相同的则根据成绩二降序排序)
//方法一:采用第2步DataFrame
//只以score1降序
scala> df.sort(df("score1").desc).show
+----------+----+------+---+------+------+------+
|     stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080006|  赵六|  male| 20|   100|    88|   100|
|2103080005|  王五|  male| 20|    99|   100|    77|
|2103080003|  张三|  male| 20|    88|    77|   100|
|2103080007|  孙七|  male| 20|    88|    88|   100|
+----------+----+------+---+------+------+------+

//score1降序,score1相同时以score2降序
scala> df.sort(df("score1").desc,df("score2").desc).show
+----------+----+------+---+------+------+------+
|     stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080006|  赵六|  male| 20|   100|    88|   100|
|2103080005|  王五|  male| 20|    99|   100|    77|
|2103080007|  孙七|  male| 20|    88|    88|   100|
|2103080003|  张三|  male| 20|    88|    77|   100|
+----------+----+------+---+------+------+------+

//方法二:采用第2步临时表
//只以score1降序
scala> spark.sql("select * from score_tmp order by score1 desc ").show
+----------+----+------+---+------+------+------+
|     stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080006|  赵六|  male| 20|   100|    88|   100|
|2103080005|  王五|  male| 20|    99|   100|    77|
|2103080003|  张三|  male| 20|    88|    77|   100|
|2103080007|  孙七|  male| 20|    88|    88|   100|
+----------+----+------+---+------+------+------+

//score1降序,score1相同时以score2降序
scala> spark.sql("select * from score_tmp order by score1 desc, score2 desc ").show
+----------+----+------+---+------+------+------+
|     stuid|name|gender|age|score1|score2|score3|
+----------+----+------+---+------+------+------+
|2103080006|  赵六|  male| 20|   100|    88|   100|
|2103080005|  王五|  male| 20|    99|   100|    77|
|2103080007|  孙七|  male| 20|    88|    88|   100|
|2103080003|  张三|  male| 20|    88|    77|   100|
+----------+----+------+---+------+------+------+
;