1.ES--基本操作

待研究：
https

文章目录

数据准备

POST /forum/_bulk
{ "index": { "_id": 1 }}
{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 2 }}
{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
{ "index": { "_id": 3 }}
{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 4 }}
{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }

因为ES是json结构，可以后期灵活调整json结构

1 . 查询映射

GET /forum/_mapping/article

{
  "forum" : {
    "mappings" : {
      "properties" : {
        "articleID" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "hidden" : {
          "type" : "boolean"
        },
        "postDate" : {
          "type" : "date"
        },
        "userID" : {
          "type" : "long"
        }
      }
    }
  }
}

现在es 7.1版本，type=text，默认会设置两个field，一个是field本身，比如articleID，就是分词的；还有一个的话，就是field.keyword，articleID.keyword，默认不分词，会最多保留256个字符

2.默认分词

GET /forum/_analyze
{
  "text": "JODL-X-1937-#pV7"
}

查询结果

{
  "tokens" : [
    {
      "token" : "jodl",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "x",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "1937",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "pv7",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

es中就是按照分词器，将内容分词以后，维护到倒排索引中的；

当我们用term搜索，输入的内容，就会作为一个整体，去倒排索引中匹配；
如果我们采用match搜索，输入的内容，就会按照默认的分词器或者指定的分词器，将输入内容分词以后，去倒排索引中匹配；

所以分词器又分为两类：一类是文档字段的进行分词，一类是对输入的关键词进行分词，所以文档创建的时候，我们可以指定分词器对某一字段进行分词；搜索的时候，也可以指定某个分词器对搜索词进行分词--个人理解

所以

GET /forum/_search 
{
  "query": {
    "match": {
      "articleID": "JODL-X-1937-#pV7"
    }
  }
}
匹配了
{
  "took" : 22,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 4.8158913,
    "hits" : [
      {
        "_index" : "forum",
        "_type" : "article",
        "_id" : "3",
        "_score" : 4.8158913,
        "_source" : {
          "articleID" : "JODL-X-1937-#pV7",
          "userID" : 2,
          "hidden" : false,
          "postDate" : "2017-01-01"
        }
      }
    ]
  }
}

但是

GET /forum/_search 
{
  "query": {
    "term": {
      "articleID": "JODL-X-1937-#pV7"
    }
  }
}
term方式，却无法搜索到，因为articleID字段给分词了；
但是mapping中，我们可以看到对text类型的字段，默认额外创建了新的字段： 字段名.keyword ，而这个字段类型是keyword的，倒排索引中存储的就是articleID的整体内容

所以当我们用
GET /forum/_search 
{
  "query": {
    "term": {
      "articleID.keyword": "JODL-X-1937-#pV7"
    }
  }
}
这个方式查询，也能查询到

term filter/query：对搜索文本不分词，直接拿去倒排索引中匹配，你输入的是什么，就去匹配什么
比如说，如果对搜索文本进行分词的话，“helle world” --> “hello”和“world”，两个词分别去倒排索引中匹配
term，“hello world” --> “hello world”，直接去倒排索引中匹配“hello world”

3.搜索没有隐藏的帖子

GET /forum/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "hidden" : false
                }
            }
        }
    }
}

4.根据发帖日期搜索帖子

GET /forum/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "postDate" : "2017-01-01"
                }
            }
        }
    }
}

5.查看分词

GET /forum/_analyze
{
  "field": "articleID",
  "text": "XHDK-A-1293-#fJ3"
}

默认是analyzed的text类型的field，建立倒排索引的时候，就会对所有的articleID分词，分词以后，原本的articleID就没有了，只有分词后的各个word存在于倒排索引中。
term，是不对搜索文本分词的，XHDK-A-1293-#fJ3 --> XHDK-A-1293-#fJ3；但是articleID建立索引的时候，XHDK-A-1293-#fJ3 --> xhdk，a，1293，fj3

6. 重建索引

PUT /forum
{
  "mappings": {
    "properties": {
      "articleID":{"type": "keyword"}
    }
  }
}

新增字段映射是可以的，但是如果对现有字段的索引方式进行修改，例如以前是text类型，已经按照text建设好倒排索引了，如果要再改成keyword就会报错；

因为上面已经创建好索引，这里重新设置就会报错，当然也可以将索引删除，如果索引下已经有很多数据了，删除就不合适了，具体方案以后再讲，这里就用删除索引的方式重建

DELETE /forum

PUT /forum
{
  "mappings": {
    "article": {
      "properties": {
        "articleID": {
          "type": "keyword"
        }
      }
    }
  }
}

重新导入数据

POST /forum/_bulk
{ "index": { "_id": 1 }}
{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 2 }}
{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
{ "index": { "_id": 3 }}
{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 4 }}
{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }

再查看映射

GET /forum/_mapping

{
  "forum" : {
    "mappings" : {
      "properties" : {
        "articleID" : {
          "type" : "keyword"
        },
        "hidden" : {
          "type" : "boolean"
        },
        "postDate" : {
          "type" : "date"
        },
        "userID" : {
          "type" : "long"
        }
      }
    }
  }
}

7.bool查询

1、搜索发帖日期为2017-01-01，或者帖子ID为XHDK-A-1293-#fJ3的帖子，同时要求帖子的发帖日期绝对不为2017-01-02

GET /forum/_search 
{
  "query": {
    "bool": {
      "must_not": [
        {
          "term": {
            "postDate": {
              "value": "2017-01-02"
            }
          }
        }
      ],
      "should": [
        {
          "term": {
            "postDate": "2017-01-01"
          }
        },
        {
          "term": {
            "articleID": "XHDK-A-1293-#fJ3"
          }
        }
      ]
    }
  }
}

must，should，must_not，filter：必须匹配，可以匹配其中任意一个即可，必须不匹配

2、2、搜索帖子ID为XHDK-A-1293-#fJ3，或者是帖子ID为JODL-X-1937-#pV7而且所有文档的发帖日期为2017-01-01的帖子

GET /forum/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "should":[
            {
              "term":{
                "articleID":"XHDK-A-1293-#fJ3"
              }
            },
            {
              "term":{
                "articleID":"JODL-X-1937-#pV7"
              }
            }
            ],
            "must":{"term":{"postDate":"2017-01-01"}}
        }
      }
    }
  }
}

terms 搜索

term: {“field”: “value”}
terms: {“field”: [“value1”, “value2”]}

POST  /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"tag" : ["java", "hadoop"]} }
{ "update": { "_id": "2"} }
{ "doc" : {"tag" : ["java"]} }
{ "update": { "_id": "3"} }
{ "doc" : {"tag" : ["hadoop"]} }
{ "update": { "_id": "4"} }
{ "doc" : {"tag" : ["java", "elasticsearch"]} }

搜索articleID为KDKE-B-9947-#kL5或QQPX-R-3956-#aD8的帖子，搜索tag中包含java的帖子

GET /forum/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must": [
            {
              "terms": {
                "articleID": [
                  "KDKE-B-9947-#kL5",
                  "QQPX-R-3956-#aD8"
                ]
              }
            },
            {
              "terms":{"tag":["java"]}
            }
          ]
        }
      }
    }
  }
}


GET /forum/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "tag": [
            "java"
          ]
        }
      }
    }
  }
}

GET /forum/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "tag": "java"
        }
      }
    }
  }
}

range查询

range，sql中的between，或者是>=1，<=1
为帖子增加浏览量

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"view_cnt" : 30} }
{ "update": { "_id": "2"} }
{ "doc" : {"view_cnt" : 50} }
{ "update": { "_id": "3"} }
{ "doc" : {"view_cnt" : 100} }
{ "update": { "_id": "4"} }
{ "doc" : {"view_cnt" : 80} }

# 搜索浏览量在30~60之间的帖子

GET /forum/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "view_cnt": {
            "gte": 30,
            "lte": 60
          }
        }
      }
    }
  }
}



# 搜索发帖日期在最近1个月的帖子

GET /forum/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "postDate": {
            "gte": "2017-03-10||-30d"
          }
        }
      }
    }
  }
}



GET /forum/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "postDate": {
            "gt": "now-30d"
          }
        }
      }
    }
  }
}

全文检索 match

1、全文检索的时候，进行多个值的检索，有两种做法，match query；should
2、控制搜索结果精准度：and operator，minimum_should_match

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"title" : "this is java and elasticsearch blog"} }
{ "update": { "_id": "2"} }
{ "doc" : {"title" : "this is java blog"} }
{ "update": { "_id": "3"} }
{ "doc" : {"title" : "this is elasticsearch blog"} }
{ "update": { "_id": "4"} }
{ "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} }
{ "update": { "_id": "5"} }
{ "doc" : {"title" : "this is spark blog"} }

#搜索标题中包含java或elasticsearch的blog

GET /forum/_search 
{
  "query": {
    "match": {
      "title": "java elasticsearch"
    }
  }
}


#搜索标题中包含java和elasticsearch的blog
GET /forum/_search 
{
  "query": {
    "match": {
      "title": {
        "query": "java elasticsearch",
        "operator": "and"
      }
       
    }
  }
}

#搜索包含java，elasticsearch，spark，hadoop，4个关键字中，至少3个的blog

GET /forum/_search
{
  "query": {
    "match": {
      "title": {
        "query": "java elasticsearch spark hadoop",
        "minimum_should_match": "75%"
      }
    }
  }
}


GET /forum/_search
{
  "query": {
    "bool": {
      "must":     { "match": { "title": "java" }},
      "must_not": { "match": { "title": "spark"  }},
      "should": [
                  { "match": { "title": "hadoop" }},
                  { "match": { "title": "elasticsearch"   }}
      ]
    }
  }
}


1、普通match如何转换为term+should

{
    "match": { "title": "java elasticsearch"}
}

使用诸如上面的match query进行多值搜索的时候，es会在底层自动将这个match query转换为bool的语法
bool should，指定多个搜索词，同时使用term query

{
  "bool": {
    "should": [
      { "term": { "title": "java" }},
      { "term": { "title": "elasticsearch"   }}
    ]
  }
}

2、and match如何转换为term+must

{
    "match": {
        "title": {
            "query":    "java elasticsearch",
            "operator": "and"
        }
    }
}

{
  "bool": {
    "must": [
      { "term": { "title": "java" }},
      { "term": { "title": "elasticsearch"   }}
    ]
  }
}

3、minimum_should_match如何转换

{
    "match": {
        "title": {
            "query":                "java elasticsearch hadoop spark",
            "minimum_should_match": "75%"
        }
    }
}

{
  "bool": {
    "should": [
      { "term": { "title": "java" }},
      { "term": { "title": "elasticsearch"   }},
      { "term": { "title": "hadoop" }},
      { "term": { "title": "spark" }}
    ],
    "minimum_should_match": 3 
  }
}

搜索权重

需求：搜索标题中包含java的帖子，同时呢，如果标题中包含hadoop或elasticsearch就优先搜索出来，同时呢，如果一个帖子包含java hadoop，一个帖子包含java elasticsearch，包含hadoop的帖子要比elasticsearch优先搜索出来

知识点，搜索条件的权重，boost，可以将某个搜索条件的权重加大，此时当匹配这个搜索条件和匹配另一个搜索条件的document，计算relevance score时，匹配权重更大的搜索条件的document，relevance score会更高，当然也就会优先被返回回来

默认情况下，搜索条件的权重都是一样的，都是1

GET /forum/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "blog"
          }
        }
      ],
      "should": [
        {
          "match": {
            "title": {
              "query": "java"
            }
          }
        },
        {
          "match": {
            "title": {
              "query": "hadoop"
            }
          }
        },
        {
          "match": {
            "title": {
              "query": "elasticsearch"
            }
          }
        },
        {
          "match": {
            "title": {
              "query": "spark",
              "boost": 5
            }
          }
        }
      ]
    }
  }
}

基于dis_max实现多字段搜索

POST /forum/_bulk
{ "index": { "_id": 5 }}
{ "articleID" : "QQPX-R-4956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-11-02" ,"tag" : ["java", "elasticsearch"]}

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"title" : "this is java and elasticsearch blog"} }
{ "update": { "_id": "2"} }
{ "doc" : {"title" : "this is java blog"} }
{ "update": { "_id": "3"} }
{ "doc" : {"title" : "this is elasticsearch blog"} }
{ "update": { "_id": "4"} }
{ "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} }
{ "update": { "_id": "5"} }
{ "doc" : {"title" : "this is spark blog"} }


POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"content" : "i like to write best elasticsearch article"} }
{ "update": { "_id": "2"} }
{ "doc" : {"content" : "i think java is the best programming language"} }
{ "update": { "_id": "3"} }
{ "doc" : {"content" : "i am only an elasticsearch beginner"} }
{ "update": { "_id": "4"} }
{ "doc" : {"content" : "elasticsearch and hadoop are all very good solution, i am a beginner"} }
{ "update": { "_id": "5"} }
{ "doc" : {"content" : "spark is best big data solution based on scala ,an programming language similar to java"} }

搜索title或content中包含java或solution的帖子

下面这个就是multi-field搜索，多字段搜索

GET /forum/article/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "java solution" }},
                { "match": { "content":  "java solution" }}
            ]
        }
    }
}

3、结果分析

期望的是doc5，结果是doc2,doc4排在了前面

计算每个document的relevance score：每个query的分数，乘以matched query数量，除以总query数量

算一下doc4的分数

{ "match": { "title": "java solution" }}，针对doc4，是有一个分数的
{ "match": { "content":  "java solution" }}，针对doc4，也是有一个分数的

所以是两个分数加起来，比如说，1.1 + 1.2 = 2.3
matched query数量 = 2
总query数量 = 2

2.3 * 2 / 2 = 2.3

算一下doc5的分数

{ "match": { "title": "java solution" }}，针对doc5，是没有分数的
{ "match": { "content":  "java solution" }}，针对doc5，是有一个分数的

所以说，只有一个query是有分数的，比如2.3
matched query数量 = 1
总query数量 = 2

2.3 * 1 / 2 = 1.15

doc5的分数 = 1.15 < doc4的分数 = 2.3

4、best fields策略，dis_max

best fields策略，就是说，搜索到的结果，应该是某一个field中匹配到了尽可能多的关键词，被排在前面；而不是尽可能多的field匹配到了少数的关键词，排在了前面

dis_max语法，直接取多个query中，分数最高的那一个query的分数即可

{ "match": { "title": "java solution" }}，针对doc4，是有一个分数的，1.1
{ "match": { "content":  "java solution" }}，针对doc4，也是有一个分数的，1.2
取最大分数，1.2

{ "match": { "title": "java solution" }}，针对doc5，是没有分数的
{ "match": { "content":  "java solution" }}，针对doc5，是有一个分数的，2.3
取最大分数，2.3

然后doc4的分数 = 1.2 < doc5的分数 = 2.3，所以doc5就可以排在更前面的地方，符合我们的需要

GET /forum/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "java solution" }},
                { "match": { "content":  "java solution" }}
            ]
        }
    }
}

tie_breaker

tie_breaker参数的意义，在于说，将其他query的分数，乘以tie_breaker，然后综合与最高分数的那个query的分数，综合在一起进行计算
除了取最高分以外，还会考虑其他的query的分数
tie_breaker的值，在0~1之间，是个小数，就ok

基于multi_match语法实现dis_max+tie_breaker

GET /forum/_search
{
  "query": {
    "multi_match": {
        "query":                "java solution",
        "type":                 "best_fields", 
        "fields":               [ "title^2", "content" ],
        "tie_breaker":          0.3,
        "minimum_should_match": "50%" 
    }
  } 
}

GET /forum/_search
{
  "query": {
    "dis_max": {
      "queries":  [
        {
          "match": {
            "title": {
              "query": "java beginner",
              "minimum_should_match": "50%",
	      "boost": 2
            }
          }
        },
        {
          "match": {
            "body": {
              "query": "java beginner",
              "minimum_should_match": "30%"
            }
          }
        }
      ],
      "tie_breaker": 0.3
    }
  } 
}

minimum_should_match，主要是用来干嘛的？
去长尾，long tail
长尾，比如你搜索5个关键词，但是很多结果是只匹配1个关键词的，其实跟你想要的结果相差甚远，这些结果就是长尾
minimum_should_match，控制搜索结果的精准度，只有匹配一定数量的关键词的数据，才能返回

基于multi_match+most fiels策略进行multi-field搜索

从best-fields换成most-fields策略
best-fields策略，主要是说将某一个field匹配尽可能多的关键词的doc优先返回回来
most-fields策略，主要是说尽可能返回更多field匹配到某个关键词的doc，优先返回回来

POST /forum/_mapping
{
  "properties": {
      "sub_title": { 
          "type":     "text",
          "analyzer": "english",
          "fields": {
              "std":   { 
                  "type":     "text",
                  "analyzer": "standard"
              }
          }
      }
  }
}


POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"sub_title" : "learning more courses"} }
{ "update": { "_id": "2"} }
{ "doc" : {"sub_title" : "learned a lot of course"} }
{ "update": { "_id": "3"} }
{ "doc" : {"sub_title" : "we have a lot of fun"} }
{ "update": { "_id": "4"} }
{ "doc" : {"sub_title" : "both of them are good"} }
{ "update": { "_id": "5"} }
{ "doc" : {"sub_title" : "haha, hello world"} }

GET /_analyze
{
  "text": "learning course",
  "analyzer": "english"
}
如果不指定analyzer，默认就是standard分词器，分词为learning 和course；如果分词器是english，则会将learned或者learning 分词为learn；

所以以上返回
{
  "tokens" : [
    {
      "token" : "learn",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "cours",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

如果指定了index，指定了field，则默认会按照field的分词器进行分词

GET /forum/_analyze
{
  "text": "learning course",
  "field": "sub_title"
}
返回
{
  "tokens" : [
    {
      "token" : "learn",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "cours",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

因为我们前面把sub_title 的分词器设置为english了；

查询

GET /forum/article/_search
{
  "query": {
    "match": {
      "sub_title": "learning courses"
    }
  }
}

返回
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.7968805,
    "hits" : [
      {
        "_index" : "forum",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.7968805,
        "_source" : {
          "articleID" : "XHDK-A-1293-#fJ3",
          "userID" : 1,
          "hidden" : false,
          "postDate" : "2017-01-01",
          "tag" : [
            "java",
            "hadoop"
          ],
          "view_cnt" : 30,
          "title" : "this is java and elasticsearch blog",
          "content" : "i like to write best elasticsearch article",
          "sub_title" : "learning more courses"
        }
      },
      {
        "_index" : "forum",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.7968805,
        "_source" : {
          "articleID" : "KDKE-B-9947-#kL5",
          "userID" : 1,
          "hidden" : false,
          "postDate" : "2017-01-02",
          "tag" : [
            "java"
          ],
          "view_cnt" : 50,
          "title" : "this is java blog",
          "content" : "i think java is the best programming language",
          "sub_title" : "learned a lot of course"
        }
      }
    ]
  }
}

GET /forum/article/_search
{
   "query": {
        "multi_match": {
            "query":  "learning courses",
            "type":   "most_fields", 
            "fields": [ "sub_title", "sub_title.std" ]
        }
    }
}

（1）best_fields，是对多个field进行搜索，挑选某个field匹配度最高的那个分数，同时在多个query最高分相同的情况下，在一定程度上考虑其他query的分数。简单来说，你对多个field进行搜索，就想搜索到某一个field尽可能包含更多关键字的数据

优点：通过best_fields策略，以及综合考虑其他field，还有minimum_should_match支持，可以尽可能精准地将匹配的结果推送到最前面
缺点：除了那些精准匹配的结果，其他差不多大的结果，排序结果不是太均匀，没有什么区分度了

实际的例子：百度之类的搜索引擎，最匹配的到最前面，但是其他的就没什么区分度了

（2）most_fields，综合多个field一起进行搜索，尽可能多地让所有field的query参与到总分数的计算中来，此时就会是个大杂烩，出现类似best_fields案例最开始的那个结果，结果不一定精准，某一个document的一个field包含更多的关键字，但是因为其他document有更多field匹配到了，所以排在了前面；所以需要建立类似sub_title.std这样的field，尽可能让某一个field精准匹配query string，贡献更高的分数，将更精准匹配的数据排到前面

优点：将尽可能匹配更多field的结果推送到最前面，整个排序结果是比较均匀的
缺点：可能那些精准匹配的结果，无法推送到最前面

实际的例子：wiki，明显的most_fields策略，搜索结果比较均匀，但是的确要翻好几页才能找到最匹配的结果

cross_field

cross-fields搜索，一个唯一标识，跨了多个field。比如一个人，标识，是姓名；一个建筑，它的标识是地址。姓名可以散落在多个field中，比如first_name和last_name中，地址可以散落在country，province，city中。
跨多个field搜索一个标识，比如搜索一个人名，或者一个地址，就是cross-fields搜索
初步来说，如果要实现，可能用most_fields比较合适。因为best_fields是优先搜索单个field最匹配的结果，cross-fields本身就不是一个field的问题了。

POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"author_first_name" : "Peter", "author_last_name" : "Smith"} }
{ "update": { "_id": "2"} }
{ "doc" : {"author_first_name" : "Smith", "author_last_name" : "Williams"} }
{ "update": { "_id": "3"} }
{ "doc" : {"author_first_name" : "Jack", "author_last_name" : "Ma"} }
{ "update": { "_id": "4"} }
{ "doc" : {"author_first_name" : "Robbin", "author_last_name" : "Li"} }
{ "update": { "_id": "5"} }
{ "doc" : {"author_first_name" : "Tonny", "author_last_name" : "Peter Smith"} }

GET /forum/_search
{
  "query": {
    "multi_match": {
      "query":       "Peter Smith",
      "type":        "cross_fields",
      "fields":      [ "author_first_name", "author_last_name" ]
    }
  }
}

copy to

一个人名，本来是first_name，last_name，现在合并成一个full_name

PUT /forum/_mapping
{
  "properties": {
      "new_author_first_name": {
          "type":     "text",
          "copy_to":  "new_author_full_name" 
      },
      "new_author_last_name": {
          "type":     "text",
          "copy_to":  "new_author_full_name" 
      },
      "new_author_full_name": {
          "type":     "text"
      }
  }
}


POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"new_author_first_name" : "Peter", "new_author_last_name" : "Smith"} }		
{ "update": { "_id": "2"} }	
{ "doc" : {"new_author_first_name" : "Smith", "new_author_last_name" : "Williams"} }	
{ "update": { "_id": "3"} }
{ "doc" : {"new_author_first_name" : "Jack", "new_author_last_name" : "Ma"} }		
{ "update": { "_id": "4"} }
{ "doc" : {"new_author_first_name" : "Robbin", "new_author_last_name" : "Li"} }		
{ "update": { "_id": "5"} }
{ "doc" : {"new_author_first_name" : "Tonny", "new_author_last_name" : "Peter Smith"} }		 

GET /forum/_search
{
  "query": {
    "match": {
      "new_author_full_name":       "Peter Smith"
    }
  }
}

用了这个copy_to语法之后，就可以将多个字段的值拷贝到一个字段中，并建立倒排索引

短语匹配近似匹配

1、什么是近似匹配

两个句子

java is my favourite programming language, and I also think spark is a very good big data system.
java spark are very related, because scala is spark’s programming language and scala is also based on jvm like java.

match query，搜索java spark


{
	"match": {
		"content": "java spark"
	}
}

match query，只能搜索到包含java和spark的document，但是不知道java和spark是不是离的很近

包含java或包含spark，或包含java和spark的doc，都会被返回回来。我们其实并不知道哪个doc，java和spark距离的比较近。如果我们就是希望搜索java spark，中间不能插入任何其他的字符，那这个时候match去做全文检索，能搞定我们的需求吗？答案是，搞不定。

如果我们要尽量让java和spark离的很近的document优先返回，要给它一个更高的relevance score，这就涉及到了proximity match，近似匹配

如果说，要实现两个需求：
1、java spark，就靠在一起，中间不能插入任何其他字符，就要搜索出来这种doc
2、java spark，但是要求，java和spark两个单词靠的越近，doc的分数越高，排名越靠前

要实现上述两个需求，用match做全文检索，是搞不定的，必须得用proximity match，近似匹配

phrase match，proximity match：短语匹配，近似匹配

phrase match 短语匹配

phrase match 就是要去将多个term作为一个短语，一起去搜索，只有包含这个短语的doc才会作为结果返回。不像是match，java spark，java的doc也会返回，spark的doc也会返回。

GET /forum/_search
{
  "query": {
    "match": {
      "content": "java spark"
    }
  }
}

单单包含java的doc也返回了，不是我们想要的结果

POST /forum/_update/5
{
  "doc": {
   
    "content": "spark is best big data solution based on scala ,an programming language similar to java spark"
  }
}

当采用 
GET /forum/_search 
{
  "query": {
    "term": {
      "content": {
        "value": "java spark"
      }
    }
  }
}
发现么有匹配到，这就是因为term 查询，会将java spark作为整体去倒排索引中去查，而倒排索引中没 java spark，只有单独的java，和spark；

而采用
GET /forum/_search 
{
  "query": {
    "match_phrase": {
      "content": "java spark"
    }
  }
}
可以匹配，是因为match_phrase 会将java spark进行分词，分词为java和spark，去倒排索引中去匹配；
首先先定位文档，寻找同时有java和spark的文档，然后，在看spark的位置是否在java位置的后一位，如果是，则返回

term position

hello world, java spark		doc1
hi, spark java				doc2

hello 		doc1(0)		
wolrd		doc1(1)
java		doc1(2) doc2(2)
spark		doc1(3) doc2(1)

了解什么是分词后的position

GET _analyze
{
  "text": "hello world, java spark",
  "analyzer": "standard"
}

match_phrase的基本原理

索引中的position，match_phrase

hello world, java spark		doc1
hi, spark java				doc2

hello 		doc1(0)		
wolrd		doc1(1)
java		doc1(2) doc2(2)
spark		doc1(3) doc2(1)

java spark --> match phrase

java spark --> java和spark

java --> doc1(2) doc2(2)
spark --> doc1(3) doc2(1)

要找到每个term都在的一个共有的那些doc，就是要求一个doc，必须包含每个term，才能拿出来继续计算

doc1 --> java和spark --> spark position恰巧比java大1 --> java的position是2，spark的position是3，恰好满足条件

doc1符合条件

doc2 --> java和spark --> java position是2，spark position是1，spark position比java position小1，而不是大1 --> 光是position就不满足，那么doc2不匹配

必须理解这块原理！！！！

因为后面的proximity match就是原理跟这个一模一样！！！

match_phrase ,slop 实现近似匹配

GET /_analyze
{
  "text": "i like to write best elasticsearch article."
}
slop表示从第一个分词到下一个分词，最多移动几步；【只要看两个分词之间间隔了几个词就可以了】
GET /forum/_search 
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "i article",
        "slop": 5
      }
    }
  }
}

term查询的字段，最好不要分词，尤其是中文，分词以后，就无法和term 中的参数匹配了

混合使用match和近似匹配实现召回率与精准度的平衡

召回率
比如你搜索一个java spark，总共有100个doc，能返回多少个doc作为结果，就是召回率，recall
精准度

GET /forum/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "java spark"
          }
        }
      ],
      "should": [
        {
          "match_phrase": {
            "content": {
              "query": "java spark",
               "slop": 50
            }
          }
        }
      ]
    }
  }
}

match和match_phrase 区别

match和phrase match(proximity match)区别

match --> 只要简单的匹配到了一个term，就可以理解将term对应的doc作为结果返回，扫描倒排索引，扫描到了就ok

phrase match --> 首先扫描到所有term的doc list; 找到包含所有term的doc list; 然后对每个doc都计算每个term的position，是否符合指定的范围; slop，需要进行复杂的运算，来判断能否通过slop移动，匹配一个doc

match query的性能比phrase match和proximity match（有slop）要高很多。因为后两者都要计算position的距离。
match query比phrase match的性能要高10倍，比proximity match的性能要高20倍。

但是别太担心，因为es的性能一般都在毫秒级别，match query一般就在几毫秒，或者几十毫秒，而phrase match和proximity match的性能在几十毫秒到几百毫秒之间，所以也是可以接受的。

优化proximity match的性能，一般就是减少要进行proximity match搜索的document数量。主要思路就是，用match query先过滤出需要的数据，然后再用proximity match来根据term距离提高doc的分数，同时proximity match只针对每个shard的分数排名前n个doc起作用，来重新调整它们的分数，这个过程称之为rescoring，重计分。因为一般用户会分页查询，只会看到前几页的数据，所以不需要对所有结果进行proximity match操作。

用我们刚才的说法，match + proximity match同时实现召回率和精准度

默认情况下，match也许匹配了1000个doc，proximity match全都需要对每个doc进行一遍运算，判断能否slop移动匹配上，然后去贡献自己的分数
但是很多情况下，match出来也许1000个doc，其实用户大部分情况下是分页查询的，所以可能最多只会看前几页，比如一页是10条，最多也许就看5页，就是50条
proximity match只要对前50个doc进行slop移动去匹配，去贡献自己的分数即可，不需要对全部1000个doc都去进行计算和贡献分数

rescore：重打分

match：1000个doc，其实这时候每个doc都有一个分数了; proximity match，前50个doc，进行rescore，重打分，即可; 让前50个doc，term举例越近的，排在越前面

GET /forum/_search 
{
  "query": {
    "match": {
      "content": "java spark"
    }
  },
  "rescore": {
    "window_size": 50,
    "query": {
      "rescore_query": {
        "match_phrase": {
          "content": {
            "query": "java spark",
            "slop": 50
          }
        }
      }
    }
  }
}

前缀搜索、通配符搜索、正则搜索等技术

课程大纲

1、前缀搜索

C3D0-KD345
C3K5-DFG65
C4I8-UI365

C3 --> 上面这两个都搜索出来 --> 根据字符串的前缀去搜索

不用帖子的案例背景，因为比较简单，直接用自己手动建的新索引，给大家演示一下就可以了

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "keyword"
        }
      }
    }
  }
}

GET my_index/my_type/_search
{
  "query": {
    "prefix": {
      "title": {
        "value": "C3"
      }
    }
  }
}

2、前缀搜索的原理

prefix query不计算relevance score，与prefix filter唯一的区别就是，filter会cache bitset

扫描整个倒排索引，举例说明

前缀越短，要处理的doc越多，性能越差，尽可能用长前缀搜索

前缀搜索，它是怎么执行的？性能为什么差呢？

match

C3-D0-KD345
C3-K5-DFG65
C4-I8-UI365

全文检索

每个字符串都需要被分词

c3			doc1,doc2
d0
kd345
k5
dfg65
c4
i8
ui365

c3 --> 扫描倒排索引 --> 一旦扫描到c3，就可以停了，因为带c3的就2个doc，已经找到了 --> 没有必要继续去搜索其他的term了

match性能往往是很高的

不分词

C3-D0-KD345
C3-K5-DFG65
C4-I8-UI365

c3 --> 先扫描到了C3-D0-KD345，很棒，找到了一个前缀带c3的字符串 --> 还是要继续搜索的，因为后面还有一个C3-K5-DFG65，也许还有其他很多的前缀带c3的字符串 --> 你扫描到了一个前缀匹配的term，不能停，必须继续搜索 --> 直到扫描完整个的倒排索引，才能结束

因为实际场景中，可能有些场景是全文检索解决不了的

C3D0-KD345
C3K5-DFG65
C4I8-UI365

c3d0
kd345

c3 --> match --> 扫描整个倒排索引，能找到吗

c3 --> 只能用prefix

prefix性能很差

3、通配符搜索

跟前缀搜索类似，功能更加强大

C3D0-KD345
C3K5-DFG65
C4I8-UI365

5字符-D任意个字符5

5?-*5：通配符去表达更加复杂的模糊搜索的语义

GET my_index/my_type/_search
{
  "query": {
    "wildcard": {
      "title": {
        "value": "C?K*5"
      }
    }
  }
}

?：任意字符
*：0个或任意多个字符

性能一样差，必须扫描整个倒排索引，才ok

4、正则搜索

GET /my_index/my_type/_search 
{
  "query": {
    "regexp": {
      "title": "C[0-9].+"
    }
  }
}

C[0-9].+

[0-9]：指定范围内的数字
[a-z]：指定范围内的字母
.：一个字符
+：前面的正则表达式可以出现一次或多次

wildcard和regexp，与prefix原理一致，都会扫描整个索引，性能很差

主要是给大家介绍一些高级的搜索语法。在实际应用中，能不用尽量别用。性能太差了。

搜索推荐

hello w --> 搜索

hello world
hello we
hello win
hello wind
hello dog
hello cat

hello w -->

hello world
hello we
hello win
hello wind

搜索推荐的功能

百度 --> elas --> elasticsearch --> elasticsearch权威指南

GET /my_index/my_type/_search 
{
  "query": {
    "match_phrase_prefix": {
      "title": "hello d"
    }
  }
}

原理跟match_phrase类似，唯一的区别，就是把最后一个term作为前缀去搜索

hello就是去进行match，搜索对应的doc
w，会作为前缀，去扫描整个倒排索引，找到所有w开头的doc
然后找到所有doc中，即包含hello，又包含w开头的字符的doc
根据你的slop去计算，看在slop范围内，能不能让hello w，正好跟doc中的hello和w开头的单词的position相匹配

也可以指定slop，但是只有最后一个term会作为前缀

max_expansions：指定prefix最多匹配多少个term，超过这个数量就不继续匹配了，限定性能

默认情况下，前缀要扫描所有的倒排索引中的term，去查找w打头的单词，但是这样性能太差。可以用max_expansions限定，w前缀最多匹配多少个term，就不再继续搜索倒排索引了。

尽量不要用，因为，最后一个前缀始终要去扫描大量的索引，性能可能会很差

match_phrase和edge_ngram&ngram分词器的区别

match_phrase会将检索关键词进行分词，分词结果，必须在检索字段的分词中都包含，而且顺序必须相同，如果没有设置slop 值，默认搜索关键词的各个分词必须在检索字段分词中的 position 是连续的。

此外，除了english，standard分词器以外，es 还自带了ngram、edge_ngram分词器，而且区别：

ngram会细分,如name 会分词成n,na,am,me,但是edge_ngram只会从开头分词,如n,na

PUT my_index
{
  "mappings":{
      "properties": {
        "content": {
            "type": "text",
            "analyzer": "my_edge_ngram"
            }
        }
    },  
  "settings": {
    "analysis": {
      "analyzer": {
          "my_edge_ngram": {
                "tokenizer": "custom_edge_ngram"
          }        
      },
      "tokenizer": {
        "custom_edge_ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
          ,"token_chars": [
            "letter",
            "punctuation",
            "symbol",
            "digit"
            ]
        }
      }
    }
  }
}

添加内容

POST my_index/_doc/2
{
    "content": "that isnot a test"
}

我们来看一下，that isnot a test的分词效果

GET /my_index/_analyze
{
  "text": "that isnot a test",
  "field": "content"
}

结果为

{
  "tokens" : [
    {
      "token" : "t",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "th",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "tha",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "that",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "i",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "isn",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "isno",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "isnot",
      "start_offset" : 5,
      "end_offset" : 10,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "a",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "t",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "te",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "tes",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "test",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "word",
      "position" : 13
    }
  ]
}

我们可以看到 is 的postion 是5，a的postion 是9；所以如果match_phrase方式查询is a的话，slop至少要是9-5+1=3才可以；
我们可以看到isnot 的postion 8，a的postion是9，所以他们是连续的，无须设置slop也可以搜到

GET /my_index/_search 
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "isnot a "
     
      }
    }
  }
}

GET /my_index/_search 
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "is a ",
        "slop": 3  #少于3就检索不到了
      }
    }
  }
}

结合filter实现的edge_ngram

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "ik_max_word",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}


GET /my_index/_analyze
{
  "analyzer": "autocomplete",
  "text": "SWJG_DM 我爱你中国"
}

响应结果

{
  "tokens" : [
    {
      "token" : "s",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "LETTER",
      "position" : 0
    },
    {
      "token" : "sw",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "LETTER",
      "position" : 0
    },
    {
      "token" : "swj",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "LETTER",
      "position" : 0
    },
    {
      "token" : "swjg",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "LETTER",
      "position" : 0
    },
    {
      "token" : "swjg_",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "LETTER",
      "position" : 0
    },
    {
      "token" : "swjg_d",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "LETTER",
      "position" : 0
    },
    {
      "token" : "swjg_dm",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "LETTER",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "sw",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "swj",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "swjg",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "d",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "ENGLISH",
      "position" : 2
    },
    {
      "token" : "dm",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "ENGLISH",
      "position" : 2
    },
    {
      "token" : "我",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "我爱",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "我爱你",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "爱",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "爱你",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "中",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "中国",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

而我们想字段分词采用自定的"edge_ngram",但是检索关键词直接用ik_max_word分词器

PUT /my_index/_mapping
{
  "properties": {
      "title": {
          "type":     "text",
          "analyzer": "autocomplete",
          "search_analyzer": "ik_max_word"
      }
  }
}


POST my_index/_doc/2
{
    "title": "hello world SWJG_DM"
}

POST my_index/_doc/3
{
    "title": "我爱你中国 YHUUID"
}

GET /my_index/_search 
{
  "query": {
    "match": {
      "title": "hel w yh 我"
    }
  }
}

我们惊喜的发现，都可以很好的匹配，这个自定义分词就是我业务上所需要的，哈哈

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.0332315,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 2.0332315,
        "_source" : {
          "title" : "我爱你中国 YHUUID"
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.9676435,
        "_source" : {
          "title" : "hello world SWJG_DM"
        }
      }
    ]
  }
}

这个分词器也可以实现自动完成功能了

四种常见的相关度分数优化方式

1、query-time boost

GET /forum/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "java spark",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "content": "java spark"
          }
        }
      ]
    }
  }
}

2.重构查询结果，在es新版本中，影响越来越小了。一般情况下，没什么必要的话，大家不用也行。

GET /forum/article/_search 
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "content": "java"
          }
        },
        {
          "match": {
            "content": "spark"
          }
        },
        {
          "bool": {
            "should": [
              {
                "match": {
                  "content": "solution"
                }
              },
              {
                "match": {
                  "content": "beginner"
                }
              }
            ]
          }
        }
      ]
    }
  }
}

3、negative boost

搜索包含java，不包含spark的doc，但是这样子很死板
搜索包含java，尽量不包含spark的doc，如果包含了spark，不会说排除掉这个doc，而是说将这个doc的分数降低
包含了negative term的doc，分数乘以negative boost，分数降低

GET /forum/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "java"
          }
        }
      ],
      "must_not": [
        {
          "match": {
            "content": "spark"
          }
        }
      ]
    }
  }
}



GET /forum/_search 
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "content": "java"
        }
      },
      "negative": {
        "match": {
          "content": "spark"
        }
      },
      "negative_boost": 0.2
    }
  }
}
negative的doc，会乘以negative_boost，降低分数

4、constant_score
如果你压根儿不需要相关度评分，直接走constant_score加filter，所有的doc分数都是1，没有评分的概念了

GET /forum/article/_search 
{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "query": {
              "match": {
                "title": "java"
              }
            }
          }
        },
        {
          "constant_score": {
            "query": {
              "match": {
                "title": "spark"
              }
            }
          }
        }
      ]
    }
  }
}

拼写错误时的fuzzy搜索

analyzer 和search_analyzer

如果想要在创建索引和查询时分别使⽤不同的分词器，ElasticSearch也是⽀持的。
在创建索引，指定analyzer，ES在创建时会先检查是否设置了analyzer字段，如果没定义就⽤ES预设的
在查询时，指定search_analyzer，ES查询时会先检查是否设置了search_analyzer字段，如果没有设置，还会去检查创建索引时是否
指定了analyzer，还是没有还设置才会去使⽤ES预设的
ES分析器主要有两种情况会被使⽤：
插⼊⽂档时，将text类型的字段做分词然后插⼊倒排索引，此时就可能⽤到analyzer指定的分词器
在查询时，先对要查询的text类型的输⼊做分词，再去倒排索引搜索，此时就可能⽤到search_analyzer指定的分词器

PUT /mapping_analyzer
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "standard",
        "search_analyzer": "ik_smart"
      },
      "desc": {
        "type": "text",
        "analyzer": "ik_smart",
        "search_analyzer": "ik_smart"
      }
    }
  }

计算分数不准确原因

1、多shard场景下relevance score不准确问题大揭秘

如果你的一个index有多个shard的话，可能搜索结果会不准确

图解

2、如何解决该问题？

（1）生产环境下，数据量大，尽可能实现均匀分配

数据量很大的话，其实一般情况下，在概率学的背景下，es都是在多个shard中均匀路由数据的，路由的时候根据_id，负载均衡
比如说有10个document，title都包含java，一共有5个shard，那么在概率学的背景下，如果负载均衡的话，其实每个shard都应该有2个doc，title包含java
如果说数据分布均匀的话，其实就没有刚才说的那个问题了

（2）测试环境下，将索引的primary shard设置为1个，number_of_shards=1，index settings

如果说只有一个shard，那么当然，所有的document都在这个shard里面，就没有这个问题了

（3）测试环境下，搜索附带search_type=dfs_query_then_fetch参数，会将local IDF取出来计算global IDF

计算一个doc的相关度分数的时候，就会将所有shard对的local IDF计算一下，获取出来，在本地进行global IDF分数的计算，会将所有shard的doc作为上下文来进行计算，也能确保准确性。但是production生产环境下，不推荐这个参数，因为性能很差。

啥是倒排索引


（1）在倒排索引中查找搜索串，获取document list


date来举例

word		doc1		doc2		doc3

2017-01-01	*		*
2017-02-02			*		*
2017-03-03	*		*		*

filter：2017-02-02

到倒排索引中一找，发现2017-02-02对应的document list是doc2,doc3

（2）为每个在倒排索引中搜索到的结果，构建一个bitset，[0, 0, 0, 1, 0, 1]

非常重要

使用找到的doc list，构建一个bitset，就是一个二进制的数组，数组每个元素都是0或1，用来标识一个doc对一个filter条件是否匹配，如果匹配就是1，不匹配就是0

[0, 1, 1]

doc1：不匹配这个filter的
doc2和do3：是匹配这个filter的

尽可能用简单的数据结构去实现复杂的功能，可以节省内存空间，提升性能

（3）遍历每个过滤条件对应的bitset，优先从最稀疏的开始搜索，查找满足所有条件的document

后面会讲解，一次性其实可以在一个search请求中，发出多个filter条件，每个filter条件都会对应一个bitset
遍历每个filter条件对应的bitset，先从最稀疏的开始遍历

[0, 0, 0, 1, 0, 0]：比较稀疏
[0, 1, 0, 1, 0, 1]

先遍历比较稀疏的bitset，就可以先过滤掉尽可能多的数据

遍历所有的bitset，找到匹配所有filter条件的doc

请求：filter，postDate=2017-01-01，userID=1

postDate: [0, 0, 1, 1, 0, 0]
userID:   [0, 1, 0, 1, 0, 1]

遍历完两个bitset之后，找到的匹配所有条件的doc，就是doc4

就可以将document作为结果返回给client了

（4）caching bitset，跟踪query，在最近256个query中超过一定次数的过滤条件，缓存其bitset。对于小segment（<1000，或<3%），不缓存bitset。

比如postDate=2017-01-01，[0, 0, 1, 1, 0, 0]，可以缓存在内存中，这样下次如果再有这个条件过来的时候，就不用重新扫描倒排索引，反复生成bitset，可以大幅度提升性能。

在最近的256个filter中，有某个filter超过了一定的次数，次数不固定，就会自动缓存这个filter对应的bitset

segment（上半季），filter针对小segment获取到的结果，可以不缓存，segment记录数<1000，或者segment大小<index总大小的3%

segment数据量很小，此时哪怕是扫描也很快；segment会在后台自动合并，小segment很快就会跟其他小segment合并成大segment，此时就缓存也没有什么意义，segment很快就消失了

针对一个小segment的bitset，[0, 0, 1, 0]

filter比query的好处就在于会caching，但是之前不知道caching的是什么东西，实际上并不是一个filter返回的完整的doc list数据结果。而是filter bitset缓存起来。下次不用扫描倒排索引了。

（5）filter大部分情况下来说，在query之前执行，先尽量过滤掉尽可能多的数据

query：是会计算doc对搜索条件的relevance score，还会根据这个score去排序
filter：只是简单过滤出想要的数据，不计算relevance score，也不排序

（6）如果document有新增或修改，那么cached bitset会被自动更新

postDate=2017-01-01，[0, 0, 1, 0]
document，id=5，postDate=2017-01-01，会自动更新到postDate=2017-01-01这个filter的bitset中，全自动，缓存会自动更新。postDate=2017-01-01的bitset，[0, 0, 1, 0, 1]
document，id=1，postDate=2016-12-30，修改为postDate-2017-01-01，此时也会自动更新bitset，[1, 0, 1, 0, 1]

（7）以后只要是有相同的filter条件的，会直接来使用这个过滤条件对应的cached bitset