Bootstrap

spark对接elasticsearch遇到的坑

环境

  • spark3.0
  • scala2.12
  • es7.3
  • pom文件:
<dependency>
            <groupId>org.elasticsearch</groupId>
            <artifactId>elasticsearch-spark-30_2.12</artifactId>
            <version>7.12.0</version>
        </dependency>

Provider org.apache.hadoop.hdfs.DistributedFileSystem could not be instantiated

问题:出现了和hadoop相关的问题

解决:引入hadoop-client包

  <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.1.3</version>
        </dependency>

ERROR NetworkClient: Node [xxx] failed (java.net.SocketException: Connection reset); selected next node [xxx]

问题:连接不上es

解决:设置两个参数

.set(“es.nodes.discovery”, “false”)
.set(“es.nodes.data.only”, “false”)

 val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
      .set("es.nodes", "node1:9200,node2:39200,node3:39200")
      .set("es.nodes.discovery", "false")
      .set("es.nodes.data.only", "false")
      .set("es.net.http.auth.user", "es")
      .set("es.net.http.auth.pass", "123456")

Cannot parse value [2020-04-19 09:35:53] for field [updatetime]

问题:解析不了时间

解决:添加参数设置:“es.mapping.date.rich”,“false”

scala.None$ is not a valid external type for schema of string

问题:解析none值有问题

解决:添加参数设置:es.field.read.empty.as.null", “false”

怎么打印dsl查询语句

问题:pushdown是官网强烈推荐的,不过我怎么知道他将spark sql转成了什么dsl语句呢?也就是说,我怎么知道谓词下推有没有成功?

解决:log4j.properties中添加日志打印:

log4j.logger.org.elasticsearch.spark.sql=TRACE

一个sparkSession中读取和写入的es的host不同

问题:我从host1中读取es的数据,要写入host2中,怎么做呢?

解决:读取和写入使用不同的配置

  val query =
          """
           xxx
          """.stripMargin

    val readOptions = Map(
      "pushdown" -> "true",
      "es.resource.read" -> "index1",
      "es.mapping.date.rich" -> "false",
      "es.read.field.empty.as.null" -> "false",
      "es.scroll.size" -> "10000",
      "es.nodes" -> "node1",
      "es.nodes.discovery" -> "false",
      "es.nodes.data.only" -> "false",
      "es.http.timeout" -> "10m",
      "es.query" -> query
    )

    val result: DataFrame = spark.read.options(readOptions).format("org.elasticsearch.spark.sql")
      .load()

    val writeOptions = Map(
      "es.nodes" -> "other-host-node",
      "es.http.timeout" -> "10m",
      "es.resource.write" -> "index2",
      "es.mapping.id" -> "id",
      "es.index.auto.create" -> "false"
    )

    result.saveToEs(writeOptions)

写入es的时候手动指定id

问题:我不要es自动生成_id,我自己指定

解决:把schema中的一个字段指定成id(上个问题有写):

 "es.mapping.id" -> "id"
;