spark对接elasticsearch遇到的坑
- 环境
- Provider org.apache.hadoop.hdfs.DistributedFileSystem could not be instantiated
- ERROR NetworkClient: Node [xxx] failed (java.net.SocketException: Connection reset); selected next node [xxx]
- Cannot parse value [2020-04-19 09:35:53] for field [updatetime]
- scala.None$ is not a valid external type for schema of string
- 怎么打印dsl查询语句
- 一个sparkSession中读取和写入的es的host不同
- 写入es的时候手动指定id
环境
- spark3.0
- scala2.12
- es7.3
- pom文件:
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-30_2.12</artifactId>
<version>7.12.0</version>
</dependency>
Provider org.apache.hadoop.hdfs.DistributedFileSystem could not be instantiated
问题:出现了和hadoop相关的问题
解决:引入hadoop-client包
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.3</version>
</dependency>
ERROR NetworkClient: Node [xxx] failed (java.net.SocketException: Connection reset); selected next node [xxx]
问题:连接不上es
解决:设置两个参数
.set(“es.nodes.discovery”, “false”)
.set(“es.nodes.data.only”, “false”)
val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
.set("es.nodes", "node1:9200,node2:39200,node3:39200")
.set("es.nodes.discovery", "false")
.set("es.nodes.data.only", "false")
.set("es.net.http.auth.user", "es")
.set("es.net.http.auth.pass", "123456")
Cannot parse value [2020-04-19 09:35:53] for field [updatetime]
问题:解析不了时间
解决:添加参数设置:“es.mapping.date.rich”,“false”
scala.None$ is not a valid external type for schema of string
问题:解析none值有问题
解决:添加参数设置:es.field.read.empty.as.null", “false”
怎么打印dsl查询语句
问题:pushdown是官网强烈推荐的,不过我怎么知道他将spark sql转成了什么dsl语句呢?也就是说,我怎么知道谓词下推有没有成功?
解决:log4j.properties中添加日志打印:
log4j.logger.org.elasticsearch.spark.sql=TRACE
一个sparkSession中读取和写入的es的host不同
问题:我从host1中读取es的数据,要写入host2中,怎么做呢?
解决:读取和写入使用不同的配置
val query =
"""
xxx
""".stripMargin
val readOptions = Map(
"pushdown" -> "true",
"es.resource.read" -> "index1",
"es.mapping.date.rich" -> "false",
"es.read.field.empty.as.null" -> "false",
"es.scroll.size" -> "10000",
"es.nodes" -> "node1",
"es.nodes.discovery" -> "false",
"es.nodes.data.only" -> "false",
"es.http.timeout" -> "10m",
"es.query" -> query
)
val result: DataFrame = spark.read.options(readOptions).format("org.elasticsearch.spark.sql")
.load()
val writeOptions = Map(
"es.nodes" -> "other-host-node",
"es.http.timeout" -> "10m",
"es.resource.write" -> "index2",
"es.mapping.id" -> "id",
"es.index.auto.create" -> "false"
)
result.saveToEs(writeOptions)
写入es的时候手动指定id
问题:我不要es自动生成_id,我自己指定
解决:把schema中的一个字段指定成id(上个问题有写):
"es.mapping.id" -> "id"