2024-03-13
原文作者:吴声子夜歌 原文地址: https://blog.csdn.net/cold___play/article/details/133772481

搜索

1、搜索入门

搜索分为两个过程:

  1. 当向索引中保存文档时,默认情况下,es 会保存两份内容,一份是 _source 中的数据,另一份则是通过分词、排序等一系列过程生成的倒排索引文件,倒排索引中保存了词项和文档之间的对应关系。
  2. 搜索时,当 es 接收到用户的搜索请求之后,就会去倒排索引中查询,通过的倒排索引中维护的倒排记录表找到关键词对应的文档集合,然后对文档进行评分、排序、高亮等处理,处理完成后返回文档。

2、简单搜索

2.1、match_all——查询所有

    GET /bank/_search
    {
      "query": {
        "match_all": {}
      }
    }

简写:

    GET /bank/_search

结果:

202403132037129611.png

因为没有设置查询条件,所有最大的得分是 1.0。

这里并没有把所有的数据都展示出来,因为默认是有分页功能的。

2.2、term——词项查询

即 term 查询,就是根据词去查询,查询指定字段中包含给定单词的文档,term 查询不被解析,只有搜索的词和文档中的词精确匹配,才会返回文档。应用场景如:人名、地名等等。

    GET /bank/_search
    {
      "query": {
        "term": {
          "city.keyword": {
            "value": "Brogan"
          }
        }
      }
    }

结果:

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 6.5032897,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 6.5032897,
            "_source" : {
              "account_number" : 1,
              "balance" : 39225,
              "firstname" : "Amber",
              "lastname" : "Duke",
              "age" : 32,
              "gender" : "M",
              "address" : "880 Holmes Lane",
              "employer" : "Pyrami",
              "email" : "amberduke@pyrami.com",
              "city" : "Brogan",
              "state" : "IL"
            }
          }
        ]
      }
    }

2.3、from/size——分页

默认返回前 10 条数据,es 中也可以像关系型数据库一样,给一个分页参数:

from:从第几条开始。
size:多少条数据。

    GET /bank/_search
    {
      "query": {
        "term": {
          "age": {
            "value": 32
          }
        }
      },
      "from": 0,
      "size": 2
    }

返回:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 52,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 1.0,
            "_source" : {
              "account_number" : 1,
              "balance" : 39225,
              "firstname" : "Amber",
              "lastname" : "Duke",
              "age" : 32,
              "gender" : "M",
              "address" : "880 Holmes Lane",
              "employer" : "Pyrami",
              "email" : "amberduke@pyrami.com",
              "city" : "Brogan",
              "state" : "IL"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "56",
            "_score" : 1.0,
            "_source" : {
              "account_number" : 56,
              "balance" : 14992,
              "firstname" : "Josie",
              "lastname" : "Nelson",
              "age" : 32,
              "gender" : "M",
              "address" : "857 Tabor Court",
              "employer" : "Emtrac",
              "email" : "josienelson@emtrac.com",
              "city" : "Sunnyside",
              "state" : "UT"
            }
          }
        ]
      }
    }

2.4、_source——过滤返回字段

如果返回的字段比较多,又不需要这么多字段,此时可以指定返回的字段:

    GET /bank/_search
    {
      "query": {
        "term": {
          "age": {
            "value": 32
          }
        }
      },
      "from": 0,
      "size": 2,
      "_source": ["firstname", "lastname"]
    }

返回:

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 52,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 1.0,
            "_source" : {
              "firstname" : "Amber",
              "lastname" : "Duke"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "56",
            "_score" : 1.0,
            "_source" : {
              "firstname" : "Josie",
              "lastname" : "Nelson"
            }
          }
        ]
      }
    }

2.5、min_score——最小评分

有的文档得分特别低,说明这个文档和我们查询的关键字相关度很低。我们可以设置一个最低分,只有得分超过最低分的文档才会被返回。

    GET /bank/_search
    {
      "query": {
        "match": {
          "address": "Street"
        }
      },
      "min_score": 0.9
    }

返回:

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 385,
          "relation" : "eq"
        },
        "max_score" : 0.95395315,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "6",
            "_score" : 0.95395315,
            "_source" : {
              "account_number" : 6,
              "balance" : 5686,
              "firstname" : "Hattie",
              "lastname" : "Bond",
              "age" : 36,
              "gender" : "M",
              "address" : "671 Bristol Street",
              "employer" : "Netagy",
              "email" : "hattiebond@netagy.com",
              "city" : "Dante",
              "state" : "TN"
            }
          },
          ...
        ]
      }
    }

2.6、highlight——高亮

查询关键字高亮:

    GET /bank/_search
    {
      "query": {
        "term": {
          "city.keyword": {
            "value": "Brogan"
          }
        }
      },
      "highlight": {
        "fields": {"city.keyword": {}}
      }
    }

返回:

    {
      "took" : 59,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 6.5032897,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 6.5032897,
            "_source" : {
              "account_number" : 1,
              "balance" : 39225,
              "firstname" : "Amber",
              "lastname" : "Duke",
              "age" : 32,
              "gender" : "M",
              "address" : "880 Holmes Lane",
              "employer" : "Pyrami",
              "email" : "amberduke@pyrami.com",
              "city" : "Brogan",
              "state" : "IL"
            },
            "highlight" : {
              "city.keyword" : [
                "<em>Brogan</em>"
              ]
            }
          }
        ]
      }
    }

3、全文搜索

3.1、match query——分词查询

match query 会对查询语句进行分词,分词后,如果查询语句中的任何一个词项被匹配,则文档就会被索引到。

    GET /bank/_search
    {
      "query": {
        "match": {
          "address": "Bristol Street"
        }
      },
      "from": 0,
      "size": 2
    }

返回:

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 385,
          "relation" : "eq"
        },
        "max_score" : 7.455468,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "6",
            "_score" : 7.455468,
            "_source" : {
              "account_number" : 6,
              "balance" : 5686,
              "firstname" : "Hattie",
              "lastname" : "Bond",
              "age" : 36,
              "gender" : "M",
              "address" : "671 Bristol Street",
              "employer" : "Netagy",
              "email" : "hattiebond@netagy.com",
              "city" : "Dante",
              "state" : "TN"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "13",
            "_score" : 0.95395315,
            "_source" : {
              "account_number" : 13,
              "balance" : 32838,
              "firstname" : "Nanette",
              "lastname" : "Bates",
              "age" : 28,
              "gender" : "F",
              "address" : "789 Madison Street",
              "employer" : "Quility",
              "email" : "nanettebates@quility.com",
              "city" : "Nogal",
              "state" : "VA"
            }
          }
        ]
      }
    }

Bristol Street只要能有一个词能匹配,这条记录就算是相关记录会返回来。如果想要两个词都包含,那么可以使用 operator 的 and (默认是 or):

    GET /bank/_search
    {
      "query": {
        "match": {
          "address": {
            "query": "Bristol Street",
            "operator": "and"
          }
        }
      },
      "from": 0,
      "size": 2
    }

返回:

    {
      "took" : 2,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 7.455468,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "6",
            "_score" : 7.455468,
            "_source" : {
              "account_number" : 6,
              "balance" : 5686,
              "firstname" : "Hattie",
              "lastname" : "Bond",
              "age" : 36,
              "gender" : "M",
              "address" : "671 Bristol Street",
              "employer" : "Netagy",
              "email" : "hattiebond@netagy.com",
              "city" : "Dante",
              "state" : "TN"
            }
          }
        ]
      }
    }

3.2、match_phrase query——分词且有序

match_phrase query 也会对查询的关键字进行分词,但是它分词后有两个特点:

  1. 分词后的词项顺序必须和文档中词项的顺序一致
  2. 所有的词都必须出现在文档中
    GET /bank/_search
    {
      "query": {
        "match_phrase": {
          "address": {
            "query": "671 street",
            "slop": 1
          }
        }
      }
    }

返回:

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 4.1140327,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "206",
            "_score" : 4.1140327,
            "_source" : {
              "account_number" : 206,
              "balance" : 47423,
              "firstname" : "Kelli",
              "lastname" : "Francis",
              "age" : 20,
              "gender" : "M",
              "address" : "671 George Street",
              "employer" : "Exoswitch",
              "email" : "kellifrancis@exoswitch.com",
              "city" : "Babb",
              "state" : "NJ"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "6",
            "_score" : 4.1140327,
            "_source" : {
              "account_number" : 6,
              "balance" : 5686,
              "firstname" : "Hattie",
              "lastname" : "Bond",
              "age" : 36,
              "gender" : "M",
              "address" : "671 Bristol Street",
              "employer" : "Netagy",
              "email" : "amberduke@pyrami.com",
              "city" : "Dante",
              "state" : "TN"
            }
          }
        ]
      }
    }

query 是查询的关键字,会被分词器进行分解,分解之后去倒排索引中进行匹配。

slop 是指关键字之间的最小距离,但是注意不是关键之间间隔的字数。文档中的字段被分词器解析之后,解析出来的词项都包含一个 position 字段表示词项的位置,查询短语分词之后 的 position 之间的间隔要满足 slop 的要求。

    PUT /b
    {
      "mappings": {
        "properties": {
          "title": {
            "type": "text",
            "analyzer": "ik_smart"
          }
        }
      }
    }
    
    PUT /b/_doc/1
    {
      "title": "普通高等教育【十一五】国家级规划教材:大学计算机基础"
    }
    
    GET /b/_search
    {
      "query": {
        "match_phrase": {
          "title": {
            "query": "十一五计算机"
          }
        }
      }
    }

返回:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 0,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      }
    }

因为十一五计算机分词之后为十一五计算机。两个term之间有分隔,所以并没有搜索到:

    POST /_analyze
    {
      "analyzer": "ik_smart",
      "text": "普通高等教育【十一五】国家级规划教材:大学计算机基础"
    }
    {
      "tokens" : [
        {
          "token" : "普通",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "CN_WORD",
          "position" : 0
        },
        {
          "token" : "高等教育",
          "start_offset" : 2,
          "end_offset" : 6,
          "type" : "CN_WORD",
          "position" : 1
        },
        {
          "token" : "十一五",
          "start_offset" : 7,
          "end_offset" : 10,
          "type" : "CN_WORD",
          "position" : 2
        },
        {
          "token" : "国家级",
          "start_offset" : 11,
          "end_offset" : 14,
          "type" : "CN_WORD",
          "position" : 3
        },
        {
          "token" : "规划",
          "start_offset" : 14,
          "end_offset" : 16,
          "type" : "CN_WORD",
          "position" : 4
        },
        {
          "token" : "教材",
          "start_offset" : 16,
          "end_offset" : 18,
          "type" : "CN_WORD",
          "position" : 5
        },
        {
          "token" : "大学",
          "start_offset" : 19,
          "end_offset" : 21,
          "type" : "CN_WORD",
          "position" : 6
        },
        {
          "token" : "计算机",
          "start_offset" : 21,
          "end_offset" : 24,
          "type" : "CN_WORD",
          "position" : 7
        },
        {
          "token" : "基础",
          "start_offset" : 24,
          "end_offset" : 26,
          "type" : "CN_WORD",
          "position" : 8
        }
      ]
    }

十一五的position为2,计算机的position为7,中间还有3,4,5,6四个,所以slot最少为4,才可以查询到:

    GET /b/_search
    {
      "query": {
        "match_phrase": {
          "title": {
            "query": "十一五计算机",
            "slop": 4
          }
        }
      }
    }

返回:

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 0.18082869,
        "hits" : [
          {
            "_index" : "b",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 0.18082869,
            "_source" : {
              "title" : "普通高等教育【十一五】国家级规划教材:大学计算机基础"
            }
          }
        ]
      }
    }

3.3、match_pharse_prefix query(效率低)

这个类似于 match_phrase query,只不过这里多了一个通配符,match_phrase_prefix 支持最后一个词项的前缀匹配,但是由于这种匹配方式效率较低,因此大家作为了解即可。

    GET /b/_search
    {
      "query": {
        "match_phrase_prefix": {
          "title": {
            "query": "普通"
          }
        }
      }
    }

这个查询过程,会自动进行单词匹配,会自动查找以普通开始的单词,默认是 50 个,可以自己控制:

    GET /b/_search
    {
      "query": {
        "match_phrase_prefix": {
          "title": {
            "query": "普通",
            "max_expansions": 1
          }
        }
      }
    }

match_phrase_prefix 是针对分片级别的查询,假设 max_expansions 为 1,可能返回多个文档,但是只有一个词,这是我们预期的结果。有的时候实际返回结果和我们预期结果并不一致,原因在于这个查询是分片级别的,不同的分片确实只返回了一个词,但是结果可能来自不同的分片,所以最终会看到多个词。

3.4、multi_match query——多字段

match 查询的升级版,可以指定多个查询域(意思就是查询多个字段):

    GET /bank/_search
    {
      "query": {
        "multi_match": {
          "query": "street",
          "fields": ["address", "email"]
        }
      },
      "from": 0,
      "size": 2
    }

返回:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 385,
          "relation" : "eq"
        },
        "max_score" : 0.95335925,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "13",
            "_score" : 0.95335925,
            "_source" : {
              "account_number" : 13,
              "balance" : 32838,
              "firstname" : "Nanette",
              "lastname" : "Bates",
              "age" : 28,
              "gender" : "F",
              "address" : "789 Madison Street",
              "employer" : "Quility",
              "email" : "nanettebates@quility.com",
              "city" : "Nogal",
              "state" : "VA"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "32",
            "_score" : 0.95335925,
            "_source" : {
              "account_number" : 32,
              "balance" : 48086,
              "firstname" : "Dillard",
              "lastname" : "Mcpherson",
              "age" : 34,
              "gender" : "F",
              "address" : "702 Quentin Street",
              "employer" : "Quailcom",
              "email" : "dillardmcpherson@quailcom.com",
              "city" : "Veguita",
              "state" : "IN"
            }
          }
        ]
      }
    }

这种查询方式还可以指定字段的权重:

    GET /bank/_search
    {
      "query": {
        "multi_match": {
          "query": "street",
          "fields": ["address^10", "email"]
        }
      },
      "from": 0,
      "size": 2
    }

3.5、query_string query——Lucene查询

query_string 是一种紧密结合 Lucene 的查询方式,在一个查询语句中可以用到 Lucene 的一些查询语法。

query_string查询解析输入并围绕运算符拆分文本。每个文本部分都是独立分析的。

    GET /bank/_search
    {
      "query": {
        "query_string": {
          "default_field": "address",
          "query": "(702) AND (Street)"
        }
      }
    }

返回:

    {
      "took" : 15,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 6.946187,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "32",
            "_score" : 6.946187,
            "_source" : {
              "account_number" : 32,
              "balance" : 48086,
              "firstname" : "Dillard",
              "lastname" : "Mcpherson",
              "age" : 34,
              "gender" : "F",
              "address" : "702 Quentin Street",
              "employer" : "Quailcom",
              "email" : "dillardmcpherson@quailcom.com",
              "city" : "Veguita",
              "state" : "IN"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "361",
            "_score" : 6.946187,
            "_source" : {
              "account_number" : 361,
              "balance" : 23659,
              "firstname" : "Noreen",
              "lastname" : "Shelton",
              "age" : 36,
              "gender" : "M",
              "address" : "702 Tillary Street",
              "employer" : "Medmex",
              "email" : "noreenshelton@medmex.com",
              "city" : "Derwood",
              "state" : "NH"
            }
          }
        ]
      }
    }

语法非常灵活,类似与SQL方式where子句的写法,以下是各种场景的应用示例

3.5.1、指定匹配多字段
    GET /bank/_search
    {
      "query": {
        "query_string": {
          "fields": ["employer", "firstname"],
          "query": "Reversus OR Nanette"
        }
      }
    }

返回:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 6.507941,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "37",
            "_score" : 6.507941,
            "_source" : {
              "account_number" : 37,
              "balance" : 18612,
              "firstname" : "Mcgee",
              "lastname" : "Mooney",
              "age" : 39,
              "gender" : "M",
              "address" : "826 Fillmore Place",
              "employer" : "Reversus",
              "email" : "mcgeemooney@reversus.com",
              "city" : "Tooleville",
              "state" : "OK"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "13",
            "_score" : 6.5052857,
            "_source" : {
              "account_number" : 13,
              "balance" : 32838,
              "firstname" : "Nanette",
              "lastname" : "Bates",
              "age" : 28,
              "gender" : "F",
              "address" : "789 Madison Street",
              "employer" : "Quility",
              "email" : "nanettebates@quility.com",
              "city" : "Nogal",
              "state" : "VA"
            }
          }
        ]
      }
    }
3.5.2、指定单个字段
    GET /bank/_search
    {
      "query": {
        "query_string": {
          "query": "(employer:Reversus) OR (firstname:Nanette)"
        }
      }
    }

返回:

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 6.507941,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "37",
            "_score" : 6.507941,
            "_source" : {
              "account_number" : 37,
              "balance" : 18612,
              "firstname" : "Mcgee",
              "lastname" : "Mooney",
              "age" : 39,
              "gender" : "M",
              "address" : "826 Fillmore Place",
              "employer" : "Reversus",
              "email" : "mcgeemooney@reversus.com",
              "city" : "Tooleville",
              "state" : "OK"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "13",
            "_score" : 6.5052857,
            "_source" : {
              "account_number" : 13,
              "balance" : 32838,
              "firstname" : "Nanette",
              "lastname" : "Bates",
              "age" : 28,
              "gender" : "F",
              "address" : "789 Madison Street",
              "employer" : "Quility",
              "email" : "nanettebates@quility.com",
              "city" : "Nogal",
              "state" : "VA"
            }
          }
        ]
      }
    }
3.5.3、加权查询

对字段加权,由于几个查询是从单个搜索词生成的,因此使用dis_max查询和一个连接中断器自动组合它们,例如(名称使用^5符号加5)。

    GET /bank/_search
    {
      "query": {
        "query_string": {
          "fields": ["employer", "firstname^5"],
          "query": "Reversus OR Nanette"
        }
      }
    }

返回:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 32.52643,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "13",
            "_score" : 32.52643,
            "_source" : {
              "account_number" : 13,
              "balance" : 32838,
              "firstname" : "Nanette",
              "lastname" : "Bates",
              "age" : 28,
              "gender" : "F",
              "address" : "789 Madison Street",
              "employer" : "Quility",
              "email" : "nanettebates@quility.com",
              "city" : "Nogal",
              "state" : "VA"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "37",
            "_score" : 6.507941,
            "_source" : {
              "account_number" : 37,
              "balance" : 18612,
              "firstname" : "Mcgee",
              "lastname" : "Mooney",
              "age" : 39,
              "gender" : "M",
              "address" : "826 Fillmore Place",
              "employer" : "Reversus",
              "email" : "mcgeemooney@reversus.com",
              "city" : "Tooleville",
              "state" : "OK"
            }
          }
        ]
      }
    }
3.5.4、通配符查询
    GET /bank/_search
    {
      "query": {
        "query_string": {
          "fields": ["*name"],
          "query": "Nanette OR Bates"
        }
      }
    }

返回:

    {
      "took" : 2,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 13.0105715,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "13",
            "_score" : 13.0105715,
            "_source" : {
              "account_number" : 13,
              "balance" : 32838,
              "firstname" : "Nanette",
              "lastname" : "Bates",
              "age" : 28,
              "gender" : "F",
              "address" : "789 Madison Street",
              "employer" : "Quility",
              "email" : "nanettebates@quility.com",
              "city" : "Nogal",
              "state" : "VA"
            }
          }
        ]
      }
    }
3.5.5、范围查询
    GET /bank/_search
    {
      "query": {
        "query_string": {
          "fields": ["age"],
          "query": "[28 TO 29]"
        }
      }
    }

3.6、simple_query_string——query_string升级版

这个是 query_string 的升级,可以直接使用 +、|、- 代替 AND、OR、NOT 等。

    GET /bank/_search
    {
      "query": {
        "simple_query_string": {
          "fields": ["address"],
          "query": "lake + street"
        }
      }
    }

返回:

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 6.6098056,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "537",
            "_score" : 6.6098056,
            "_source" : {
              "account_number" : 537,
              "balance" : 31069,
              "firstname" : "Morin",
              "lastname" : "Frost",
              "age" : 29,
              "gender" : "M",
              "address" : "910 Lake Street",
              "employer" : "Primordia",
              "email" : "morinfrost@primordia.com",
              "city" : "Rivera",
              "state" : "DE"
            }
          }
        ]
      }
    }

3.7、term query——词项查询

词项查询。词项查询不会分析查询字符,直接拿查询字符去倒排索引中比对。

    GET /bank/_search
    {
      "query": {
        "term": {
          "balance": {
            "value": 31069
          }
        }
      }
    }

3.8、terms query——多词项查询

词项查询,但是可以给多个关键词。

    GET /bank/_search
    {
      "query": {
        "terms": {
          "balance": [
            31069,
            48086
          ]
        }
      }
    }

3.9、range query——范围查询

范围查询,可以按照日期范围、数字范围等查询。

range query 中的参数主要有四个:

  • gt:大于
  • lt:小于
  • gte:大于等于
  • lte:小于等于
    GET /bank/_search
    {
      "query": {
        "range": {
          "age": {
            "gte": 28,
            "lte": 29
          }
        }
      }
    }

3.10、exists query——非空查询

exists query 会返回指定字段中至少有一个非空值的文档:

    GET /bank/_search
    {
      "query": {
        "exists": {
          "field": "address"
        }
      }
    }

注意,空字符串也是有值。null 是空值。

3.11、prefix query——前缀查询

前缀查询,效率略低,除非必要,一般不太建议使用。

给定关键词的前缀去查询:

    GET /bank/_search
    {
      "query": {
        "prefix": {
          "firstname": {
            "value": "na"
          }
        }
      }
    }

返回:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 4,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "13",
            "_score" : 1.0,
            "_source" : {
              "account_number" : 13,
              "balance" : 32838,
              "firstname" : "Nanette",
              "lastname" : "Bates",
              "age" : 28,
              "gender" : "F",
              "address" : "789 Madison Street",
              "employer" : "Quility",
              "email" : "nanettebates@quility.com",
              "city" : "Nogal",
              "state" : "VA"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "727",
            "_score" : 1.0,
            "_source" : {
              "account_number" : 727,
              "balance" : 27263,
              "firstname" : "Natasha",
              "lastname" : "Knapp",
              "age" : 36,
              "gender" : "M",
              "address" : "723 Hubbard Street",
              "employer" : "Exostream",
              "email" : "natashaknapp@exostream.com",
              "city" : "Trexlertown",
              "state" : "LA"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "158",
            "_score" : 1.0,
            "_source" : {
              "account_number" : 158,
              "balance" : 9380,
              "firstname" : "Natalie",
              "lastname" : "Mcdowell",
              "age" : 27,
              "gender" : "M",
              "address" : "953 Roder Avenue",
              "employer" : "Myopium",
              "email" : "nataliemcdowell@myopium.com",
              "city" : "Savage",
              "state" : "ND"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "827",
            "_score" : 1.0,
            "_source" : {
              "account_number" : 827,
              "balance" : 37536,
              "firstname" : "Naomi",
              "lastname" : "Ball",
              "age" : 29,
              "gender" : "F",
              "address" : "319 Stewart Street",
              "employer" : "Isotronic",
              "email" : "naomiball@isotronic.com",
              "city" : "Trona",
              "state" : "NM"
            }
          }
        ]
      }
    }

3.12、wildcard query——通配符查询

wildcard query 即通配符查询。支持单字符和多字符通配符:

  • 表示一个任意字符。
  • * 表示零个或者多个字符。
    GET /bank/_search
    {
      "query": {
        "wildcard": {
          "firstname": {
            "value": "na???te"
          }
        }
      }
    }

返回:

    {
      "took" : 2,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "13",
            "_score" : 1.0,
            "_source" : {
              "account_number" : 13,
              "balance" : 32838,
              "firstname" : "Nanette",
              "lastname" : "Bates",
              "age" : 28,
              "gender" : "F",
              "address" : "789 Madison Street",
              "employer" : "Quility",
              "email" : "nanettebates@quility.com",
              "city" : "Nogal",
              "state" : "VA"
            }
          }
        ]
      }
    }

3.13、regexp query——正则表达式查询

支持正则表达式查询。

    GET /bank/_search
    {
      "query": {
        "regexp": {
          "firstname": {
            "value": "na.*"
          }
        }
      }
    }

返回:

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 4,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "13",
            "_score" : 1.0,
            "_source" : {
              "account_number" : 13,
              "balance" : 32838,
              "firstname" : "Nanette",
              "lastname" : "Bates",
              "age" : 28,
              "gender" : "F",
              "address" : "789 Madison Street",
              "employer" : "Quility",
              "email" : "nanettebates@quility.com",
              "city" : "Nogal",
              "state" : "VA"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "727",
            "_score" : 1.0,
            "_source" : {
              "account_number" : 727,
              "balance" : 27263,
              "firstname" : "Natasha",
              "lastname" : "Knapp",
              "age" : 36,
              "gender" : "M",
              "address" : "723 Hubbard Street",
              "employer" : "Exostream",
              "email" : "natashaknapp@exostream.com",
              "city" : "Trexlertown",
              "state" : "LA"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "158",
            "_score" : 1.0,
            "_source" : {
              "account_number" : 158,
              "balance" : 9380,
              "firstname" : "Natalie",
              "lastname" : "Mcdowell",
              "age" : 27,
              "gender" : "M",
              "address" : "953 Roder Avenue",
              "employer" : "Myopium",
              "email" : "nataliemcdowell@myopium.com",
              "city" : "Savage",
              "state" : "ND"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "827",
            "_score" : 1.0,
            "_source" : {
              "account_number" : 827,
              "balance" : 37536,
              "firstname" : "Naomi",
              "lastname" : "Ball",
              "age" : 29,
              "gender" : "F",
              "address" : "319 Stewart Street",
              "employer" : "Isotronic",
              "email" : "naomiball@isotronic.com",
              "city" : "Trona",
              "state" : "NM"
            }
          }
        ]
      }
    }

3.14、fuzzy query——模糊查询

在实际搜索中,有时我们可能会打错字,从而导致搜索不到,在 match query 中,可以通过 fuzziness属性实现模糊查询。

fuzzy query 返回与搜索关键字相似的文档。怎么样就算相似?以LevenShtein 编辑距离为准。编辑距离是指将一个字符变为另一个字符所需要更改字符的次数,更改主要包括四种:

  • 更改字符( javb–〉java )
  • 删除字符( javva–〉java )
  • 插入字符( jaa–〉java )
  • 转置字符( ajva–〉java )

为了找到相似的词,模糊查询会在指定的编辑距离中创建搜索关键词的所有可能变化或者扩展的集合,然后进行搜索匹配:

    GET /bank/_search
    {
      "query": {
        "fuzzy": {
          "employer": {
            "value": "Qaility"
          }
        }
      }
    }

返回:

    {
      "took" : 18,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 4.648529,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "13",
            "_score" : 4.648529,
            "_source" : {
              "account_number" : 13,
              "balance" : 32838,
              "firstname" : "Nanette",
              "lastname" : "Bates",
              "age" : 28,
              "gender" : "F",
              "address" : "789 Madison Street",
              "employer" : "Quility",
              "email" : "nanettebates@quility.com",
              "city" : "Nogal",
              "state" : "VA"
            }
          }
        ]
      }
    }

3.15、ids query——多id查询

    GET /bank/_search
    {
      "query": {
        "ids": {
          "values": [1,2,3]
        }
      }
    }

4、复合查询

4.1、constant_score query——无关词频的查询

当我们不关心检索词项的频率(TF)对搜索结果排序的影响时,可以使用 constant_score 将查询语句或者过滤语句包裹起来。

比如:Chillium 出现 10 次跟出现 1 次是一样的,那么就可以这么做:

    GET /bank/_search
    {
      "query": {
        "constant_score": {
          "filter": {
            "match": {
              "employer": "Chillium"
            }
          },
          "boost": 1.2
        }
      }
    }

返回:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 1.2,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : 1.2,
            "_source" : {
              "account_number" : 2,
              "balance" : 28838,
              "firstname" : "Roberta",
              "lastname" : "Bender",
              "age" : 22,
              "gender" : "F",
              "address" : "560 Kingsway Place",
              "employer" : "Chillium",
              "email" : "robertabender@chillium.com",
              "city" : "Bennett",
              "state" : "LA"
            }
          }
        ]
      }
    }

4.2、bool query——逻辑组合查询

bool query 可以将任意多个简单查询组装在一起,有四个关键字可供选择,四个关键字所描述的条件可以有一个或者多个。

  • must:文档必须匹配 must 选项下的查询条件。
  • should:文档可以匹配 should 下的查询条件,也可以不匹配。
  • must_not:文档必须不满足 must_not 选项下的查询条件。
  • filter:类似于 must,但是 filter 不评分,只是过滤数据。
    GET /bank/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "address": "street"
              }
            }
          ],
          "should": [
            {
              "term": {
                "balance": {
                  "value": 1
                }
              }
            }
          ],
          "must_not": [
            {
              "range": {
                "age": {
                  "gte": 20,
                  "lte": 39
                }
              }
            }
          ]
        }
      },
      "from": 0,
      "size": 2
    }

返回:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 18,
          "relation" : "eq"
        },
        "max_score" : 0.95335925,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "291",
            "_score" : 0.95335925,
            "_source" : {
              "account_number" : 291,
              "balance" : 19955,
              "firstname" : "Lynn",
              "lastname" : "Pollard",
              "age" : 40,
              "gender" : "F",
              "address" : "685 Pierrepont Street",
              "employer" : "Slambda",
              "email" : "lynnpollard@slambda.com",
              "city" : "Mappsville",
              "state" : "ID"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "878",
            "_score" : 0.95335925,
            "_source" : {
              "account_number" : 878,
              "balance" : 49159,
              "firstname" : "Battle",
              "lastname" : "Blackburn",
              "age" : 40,
              "gender" : "F",
              "address" : "234 Hendrix Street",
              "employer" : "Zilphur",
              "email" : "battleblackburn@zilphur.com",
              "city" : "Wanamie",
              "state" : "PA"
            }
          }
        ]
      }
    }

这里还涉及到一个关键字, minmum_should_match 参数。

minmum_should_match 参数在 es 官网上称作最小匹配度。在之前学习的 multi_match 或者这里的should 查询中,都可以设置 minmum_should_match 参数。表示最少需要匹配的词项个数。

    GET /bank/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "term": {
                "address": {
                  "value": "madison"
                }
              }
            },
            {
              "term": {
                "address": {
                  "value": "street"
                }
              }
            }
          ],
          "minimum_should_match": 2
        }
      }
    }
    
    //等价于
    
    GET /bank/_search
    {
      "query": {
        "match": {
          "address": {
            "query": "madison street",
            "minimum_should_match": 2 
          }
        }
      }
    }

4.3、dis_max query——分离最大化查询

假设现在有两本书:

    PUT /blog 
    {
      "mappings": {
        "properties": {
          "title": {
            "type": "text",
            "analyzer": "ik_max_word"
          },
          "content": {
            "type": "text",
            "analyzer": "ik_max_word"
          }
        }
      }
    }
    
    POST /blog/_doc 
    {
      "title": "如何通过Java代码代码调用Elasticsearch",
      "content": "松哥力荐,这是一篇很好的解决方案"
    }
    
    POST /blog/_doc
    {
      "title": "初识MongoDB",
      "content": "简单介绍一下MongoDB,以及如何通过Java调用MongoDB,MongoDB是一个NoSQL解决方案"
    }

现在假设搜索Java解决方案关键字,但是不确定关键字是在title还是在content,所以两者都搜索:

    GET /blog/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "title": "java解决方案"
              }
            },
            {
              "match": {
                "content": "java解决方案"
              }
            }
          ]
        }
      }
    }

202403132037134422.png

肉眼观察,第二个的 comtent 语句有两个关键字,第一个只是 title 和 content 各有一个关键字,感觉第二个和查询关键字相似度更高。但是实际查询结果并非这样。

要理解这个原因,我们需要来看下 should query 中的评分策略:

  1. 首先会执行 should 中的两个查询。
  2. 对两个查询结果的评分求和。
  3. 对求和结果乘以匹配语句总数。
  4. 在对第三步的结果除以所有语句总数。

反映到具体的查询中:

前者:

  1. title 中 包含 java,假设评分是 1.1
  2. content 中包含解决方案,假设评分是 1.2
  3. 有得分的 query 数量,这里是 2
  4. 总的 query 数量也是 2

最终结果: (1.1+1.2)*2/2=2.3

后者:

  1. title 中 不包含查询关键字,没有得分
  2. content 中包含解决方案和 java,假设评分是 2
  3. 有得分的 query 数量,这里是 1
  4. 总的 query 数量也是 2

最终结果: 2*1/2=1

在这种查询中,title 和 content 相当于是相互竞争的关系,所以我们需要找到一个最佳匹配字段。

为了解决这一问题,就需要用到 dis_max query(disjunction max query,分离最大化查询):匹配的文档依然返回,但是只将最佳匹配的评分作为查询的评分。(这里就是:跟 title 和 content 都去匹配,谁的得分高就用谁;不在给这两个去算综合的分数):

    GET /blog/_search
    {
      "query": {
        "dis_max": {
          "queries": [
            {
              "match": {
                "title": "java解决方案"
              }
            },
            {
              "match": {
                "content": "java解决方案"
              }
            }
          ]
        }
      }
    }

202403132037139143.png
这一次的结果就符合心中的预期了。

在 dis_max query 中,还有一个参数 tie_breaker (取值在0~1),在 dis_max query 中,是完全不考虑其他 query 的分数,只是将最佳匹配的字段的评分返回。但是,有的时候,我们又不得不考虑一下其他 query 的分数,此时,可以通过 tie_breaker 来优化 dis_max query。 tie_breaker 会将其他 query 的分数,乘以 tie_breaker ,然后和分数最高的 query 进行一个综合计算:

    GET /blog/_search
    {
      "query": {
        "dis_max": {
          "tie_breaker": 0.7,
          "boost": 1.2, 
          "queries": [
            {
              "match": {
                "title": "java解决方案"
              }
            },
            {
              "match": {
                "content": "java解决方案"
              }
            }
          ]
        }
      }
    }

注意:如果tie_breaker设置为1,则变得和原来一样。

4.4、function_score query——自定义评分

场景:例如想要搜索博客信息,搜索的关键字是java,但是我们希望能够将评分较高的博客优先展示出来。但是默认的评分策略是没有办法考虑到评分的,他只是考虑相关性,这个时候可以通过 function_score query 来实现。

    PUT /blog 
    {
      "mappings": {
        "properties": {
          "title": {
            "type": "text",
            "analyzer": "ik_max_word"
          },
          "votes": {
            "type": "integer"
          }
        }
      }
    }
    
    PUT /blog/_doc/1
    {
      "title": "Java集合详解",
      "votes": 100
    }
    
    PUT /blog/_doc/2
    {
      "title": "Java多线程详解,Java锁详解",
      "votes": 10
    }
    
    GET /blog/_search
    {
      "query": {
        "match": {
          "title": "java"
        }
      }
    }

202403132037142964.png

默认情况,votes为10的评分更高。

如果我们在查询中,希望能够充分考虑 votes 字段,将 votes 较高的文档优先展示,就可以通过function_score 来实现。

具体的思路,就是在旧的得分基础上,根据 votes 的数值进行综合运算,重新得出一个新的评分。

具体有几种不同的计算方式:

  • weight
  • random_score
  • script_score
  • field_value_factor
4.4.1、weight

weight 可以对评分设置权重,就是在旧的评分基础上乘以 weight,他其实无法解决我们上面所说的问题。具体用法如下:

    GET /blog/_search
    {
      "query": {
        "function_score": {
          "query": {
            "match": {
              "title": "java"
            }
          },
          "functions": [
            {
              "weight": 10
            }
          ]
        }
      }
    }

202403132037147675.png

4.4.2、random_score

random_score 会根据 uid 字段进行 hash 运算,生成分数,使用 random_score 时可以配置一个种子,如果不配置,默认使用当前时间

    GET /blog/_search
    {
      "query": {
        "function_score": {
          "query": {
            "match": {
              "title": "java"
            }
          },
          "functions": [
            {
              "random_score": {}
            }
          ]
        }
      }
    }

202403132037151836.png

4.4.3、script_score

自定义评分脚本。假设每个文档的最终得分是旧的分数加上votes。查询方式如下:

    GET /blog/_search
    {
      "query": {
        "function_score": {
          "query": {
            "match": {
              "title": "java"
            }
          },
          "functions": [
            {
              "script_score": {
                "script": "_score + doc['votes'].value"
              }
            }
          ]
        }
      }
    }

202403132037156057.png

现在,最终得分是:(oldScore + votes) * oldScore。

如果不想乘以oldScore,查询方式如下:

    GET /blog/_search
    {
      "query": {
        "function_score": {
          "query": {
            "match": {
              "title": "java"
            }
          },
          "functions": [
            {
              "script_score": {
                "script": "_score + doc['votes'].value"
              }
            }
          ],
          "boost_mode": "replace"
        }
      }
    }

202403132037159068.png

通过 boost_mode 参数,可以设置最终的计算方式。该参数还有其他取值:

  • multiply:分数相乘
  • sum:分数相加
  • avg:求平均数
  • max:最大分
  • min:最小分
  • replace:不进行二次计算
4.4.4、field_value_factor

这个的功能类似于 script_score ,但是不用自己写脚本。
假设每个文档的最终得分是旧的分数乘以votes。查询方式如下:

    GET /blog/_search
    {
      "query": {
        "function_score": {
          "query": {
            "match": {
              "title": "java"
            }
          },
          "functions": [
            {
              "field_value_factor": {
                "field": "votes"
              }
            }
          ]
        }
      }
    }

202403132037162949.png

默认的得分就是 oldScore * votes 。
还可以利用 es 内置的函数进行一些更复杂的运算:

    GET /blog/_search
    {
      "query": {
        "function_score": {
          "query": {
            "match": {
              "title": "java"
            }
          },
          "functions": [
            {
              "field_value_factor": {
                "field": "votes",
                "modifier": "sqrt"
              }
            }
          ],
          "boost_mode": "replace"
        }
      }
    }

2024031320371675910.png

此时,最终的得分是(sqrt(votes))。
modifier 中可以设置内置函数,其他的内置函数还有:

参数名 含义
none 默认的,不进行任何计算
log 对字段值取对数
log1p 字段值加1然后取对数
log2p 字段值加2然后取对数
ln 取字段值的自然对数
ln1p 字段值加1然后取自然对数
ln2p 字段值加2然后取自然对数
sqrt 字段值求平方根
square 字段值的平方
ciprocal 倒数

另外还有个参数 factor ,影响因子。字段值先乘以影响因子,然后再进行计算。以 sqrt 为例,计算方式为 sqrt ( factor * votes) :

    GET /blog/_search
    {
      "query": {
        "function_score": {
          "query": {
            "match": {
              "title": "java"
            }
          },
          "functions": [
            {
              "field_value_factor": {
                "field": "votes",
                "modifier": "sqrt",
                "factor": 10
              }
            }
          ],
          "boost_mode": "replace"
        }
      }
    }

还有一个参数 max_boost ,控制计算结果的范围:

    GET /blog/_search
    {
      "query": {
        "function_score": {
          "query": {
            "match": {
              "title": "java"
            }
          },
          "functions": [
            {
              "field_value_factor": {
                "field": "votes"
              }
            }
          ],
          "boost_mode": "sum",
          "max_boost": 100
        }
      }
    }

max_boost 参数表示 functions 模块中,最终的计算结果上限。如果超过上限,就按照上线计算。

4.5、boosting query——子句加权查询

Boosting Query允许在查询中为不同的子句设置不同的权重,以影响文档的相关性得分。不同于bool查询的是bool查询中只要一个子查询条件不匹配那么搜索的数据就不会出现。而Boosting Query则是降低显示的权重/优先级(即score)。这对于优化搜索结果的排序非常有用。Boosting Query有以下重要参数:

  • positive:正向查询子句(希望匹配的条件),用于增加相关性得分。
  • negative:负向查询子句(不希望匹配的条件),用于减少相关性得分。
  • negative_boost:0到1.0之间的浮点数,用于降低与negative查询匹配的文档的相关性得分,指定负向查询的权重值越大,负向查询的影响越小。

得分计算规则:

  1. 如果文档不满足nagative,那么返回原始得分
  2. 如果文档满足了nagative,那么将原始匹配得分乘以
    GET /blog/_search
    {
      "query": {
        "boosting": {
          "positive": { 
            "match": { "title": "java" } 
          },
          "negative": { 
            "term": { "votes": 10 } 
          },
          "negative_boost": 0.5
        }
      }
    }

在上面的示例中,正向查询子句匹配title包含"java"的文档,而负向查询子句匹配votes字段为10"的文档,但由于negative_boost设置为0.5,负向查询的影响被减小了一半。

4.6、nested query——嵌套查询

4.6.1、嵌套文档

假设:有一个电影文档,每个电影都有演员信息:

    PUT /movies
    {
      "mappings": {
        "properties": {
          "name": {
            "type": "text"
          },
          "actors": {
            "type": "nested"
          }
        }
      }
    }
    
    PUT /movies/_doc/1
    {
      "name": "霸王别姬",
      "actors": [
        {
          "name": "张国荣",
          "gender": "男"
        },
        {
          "name": "葛优",
          "gender": "男"
        },
        {
          "name": "张丰毅",
          "gender": "男"
        },
        {
          "name": "巩俐",
          "gender": "女"
        }
      ]
    }

注意 actors 类型要是 nested。

缺点:查看文档数量:

    GET /_cat/indices?v

2024031320371715611.png

这是因为 nested 文档在 es 内部其实也是独立的 lucene 文档,只是在我们查询的时候,es 内部帮我们做了 join 处理,所以最终看起来就像一个独立文档一样。因此这种方案性能并不是特别好。

4.6.2、嵌套查询
    GET /movies/_search
    {
      "query": {
        "nested": {
          "path": "actors",
          "query": {
            "bool": {
              "must": [
                {
                  "match": {
                    "actors.name": "张国荣"
                  }
                },
                {
                  "match": {
                    "actors.gender": "男"
                  }
                }
              ]
            }
          }
        }
      }
    }

4.7、父子文档查询

4.7.1、父子文档

相比于嵌套文档,父子文档主要有如下优势:

  1. 更新父文档时,不会重新索引子文档
  2. 创建、修改或者删除子文档时,不会影响父文档或者其他的子文档。
  3. 子文档可以作为搜索结果独立返回。

定义父子文档映射语法:

    PUT /my_index
    {
        "mappings": {
            "properties" : {
                "my-join-field" : {
                    "type" : "join",
                    "relations": {
                        "parent": "child"
                    }
                }
            }
        }
    }

例如学生和班级的关系:

    PUT /stu_class
    {
      "mappings": {
        "properties": {
          "name": {
            "type": "keyword"
          },
          "s_c": {
            "type": "join",
            "relations": {
              "class": "student"
            }
          }
        }
      }
    }

s_c 表示父子文档关系的名字,可以自定义。join 表示这是一个父子文档。relations 里边,class 这个位置是 parent,student 这个位置是 child。

接下来,插入两个父文档:

    PUT /stu_class/_doc/1
    {
      "name": "一班",
      "s_c": {
        "name": "class"
      }
    }
    
    PUT /stu_class/_doc/2
    {
      "name": "二班",
      "s_c": {
        "name": "class"
      }
    }

再来添加三个子文档:

    PUT /stu_class/_doc/3?routing=1
    {
      "name": "zhangsan",
      "s_c": {
        "name": "student",
        "parent": 1
      }
    }
    
    PUT /stu_class/_doc/4?routing=1
    {
      "name": "lisi",
      "s_c": {
        "name": "student",
        "parent": 1
      }
    }
    
    PUT /stu_class/_doc/5?routing=2
    {
      "name": "wangwu",
      "s_c": {
        "name": "student",
        "parent": 2
      }
    }

首先大家可以看到,子文档都是独立的文档。特别需要注意的地方是,子文档需要和父文档在同一个分片上,所以 routing 关键字的值为父文档的 id。另外,name 属性表明这是一个子文档。

父子文档需要注意的地方:

  1. 每个索引只能定义一个 join filed
  2. 父子文档需要在同一个分片上(查询,修改需要routing)
  3. 可以向一个已经存在的 join filed 上新增关系
4.7.2、has_child query——子查父

通过子文档查询父文档使用 has_child query。

    GET /stu_class/_search
    {
      "query": {
        "has_child": {
          "type": "student",
          "query": {
            "match": {
              "name": "wangwu"
            }
          }
        }
      }
    }

返回:

    {
      "took" : 21,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "stu_class",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : 1.0,
            "_source" : {
              "name" : "二班",
              "s_c" : {
                "name" : "class"
              }
            }
          }
        ]
      }
    }
4.7.3、has_parent query——父查子

通过父文档查询子文档:

    GET /stu_class/_search
    {
      "query": {
        "has_parent": {
          "parent_type": "class",
          "query": {
            "match": {
              "name": "一班"
            }
          }
        }
      }
    }

返回:不计算得分

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "stu_class",
            "_type" : "_doc",
            "_id" : "3",
            "_score" : 1.0,
            "_routing" : "1",
            "_source" : {
              "name" : "zhangsan",
              "s_c" : {
                "name" : "student",
                "parent" : 1
              }
            }
          },
          {
            "_index" : "stu_class",
            "_type" : "_doc",
            "_id" : "4",
            "_score" : 1.0,
            "_routing" : "1",
            "_source" : {
              "name" : "lisi",
              "s_c" : {
                "name" : "student",
                "parent" : 1
              }
            }
          }
        ]
      }
    }
4.7.3、parent_id——父id查子文档

通过 parent id 查询,默认情况下使用相关性计算分数。

    GET /stu_class/_search
    {
      "query": {
        "parent_id": {
          "type": "student",
          "id": 2
        }
      }
    }

返回:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 0.87546873,
        "hits" : [
          {
            "_index" : "stu_class",
            "_type" : "_doc",
            "_id" : "5",
            "_score" : 0.87546873,
            "_routing" : "2",
            "_source" : {
              "name" : "wangwu",
              "s_c" : {
                "name" : "student",
                "parent" : 2
              }
            }
          }
        ]
      }
    }

5、地理位置查询

5.1、数据准备

    PUT /geo 
    {
      "mappings": {
        "properties": {
          "name": {
            "type": "keyword"
          },
          "location": {
            "type": "geo_point"
          }
        }
      }
    }
    
    POST /geo/_bulk
    {"index":{"_index":"geo","_id":1}}
    {"name":"西安","location":"34.288991865037524,108.9404296875"}
    {"index":{"_index":"geo","_id":2}}
    {"name":"北京","location":"39.926588421909436,116.43310546875"}
    {"index":{"_index":"geo","_id":3}}
    {"name":"上海","location":"31.240985378021307,121.53076171875"}
    {"index":{"_index":"geo","_id":4}}
    {"name":"天津","location":"39.13006024213511,117.20214843749999"}
    {"index":{"_index":"geo","_id":5}}
    {"name":"杭州","location":"30.259067203213018,120.21240234375001"}
    {"index":{"_index":"geo","_id":6}}
    {"name":"武汉","location":"30.581179257386985,114.3017578125"}
    {"index":{"_index":"geo","_id":7}}
    {"name":"合肥","location":"31.840232667909365,117.20214843749999"}
    {"index":{"_index":"geo","_id":8}}
    {"name":"重庆","location":"29.592565403314087,106.5673828125"}

5.2、geo_distance query

给出一个中心点,查询距离该中心点指定范围内的文档:

以(34.288991865037524,108.9404296875) 为圆心,以 600KM 为半径,这个范围内的数据。

    GET /geo/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match_all": {}
            }
          ],
          "filter": [
            {
              "geo_distance": {
                "distance": "600km",
                "location": {
                  "lat": 34.288991865037524,
                  "lon": 108.9404296875
                }
              }
            }
          ]
        }
      }
    }

返回:

    {
      "took" : 958,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "geo",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 1.0,
            "_source" : {
              "name" : "西安",
              "location" : "34.288991865037524,108.9404296875"
            }
          },
          {
            "_index" : "geo",
            "_type" : "_doc",
            "_id" : "8",
            "_score" : 1.0,
            "_source" : {
              "name" : "重庆",
              "location" : "29.592565403314087,106.5673828125"
            }
          }
        ]
      }
    }

5.3、geo_bounding_box query

在某一个矩形内的点,通过两个点锁定一个矩形:

    GET /geo/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match_all": {}
            }
          ],
          "filter": [
            {
              "geo_bounding_box": {
                "location": {
                  "top_left": {
                    "lat": 32.0639555946604,
                    "lon": 118.78967285156249
                  },
                  "bottom_right": {
                    "lat": 29.98824461550903,
                    "lon": 122.20642089843749
                  }
                }
              }
            }
          ]
        }
      }
    }

以南京经纬度作为矩形的左上角,以舟山经纬度作为矩形的右下角,构造出来的矩形中,包含上海和杭州两个城市。

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "geo",
            "_type" : "_doc",
            "_id" : "3",
            "_score" : 1.0,
            "_source" : {
              "name" : "上海",
              "location" : "31.240985378021307,121.53076171875"
            }
          },
          {
            "_index" : "geo",
            "_type" : "_doc",
            "_id" : "5",
            "_score" : 1.0,
            "_source" : {
              "name" : "杭州",
              "location" : "30.259067203213018,120.21240234375001"
            }
          }
        ]
      }
    }

5.4、geo_polygon query

在某一个多边形范围内的查询。

    GET /geo/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match_all": {}
            }
          ],
          "filter": [
            {
              "geo_polygon": {
                "location": {
                  "points": [
                    {
                      "lat": 31.793755581217674,
                      "lon": 113.8238525390625
                    },
                    {
                      "lat": 30.007273923504556,
                      "lon": 114.224853515625
                    },
                    {
                      "lat": 30.007273923504556,
                      "lon": 114.8345947265625
                    }
                  ]
                }
              }
            }
          ]
        }
      }
    }

给定多个点,由多个点组成的多边形中的数据。

上图是三个点,也就是三角形。

    {
      "took" : 22,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "geo",
            "_type" : "_doc",
            "_id" : "6",
            "_score" : 1.0,
            "_source" : {
              "name" : "武汉",
              "location" : "30.581179257386985,114.3017578125"
            }
          }
        ]
      }
    }

5.5、geo_shape query

geo_shape 用来查询图形,针对 geo_shape ,两个图形之间的关系有:相交、包含、不相交。

新建索引:

    PUT /geo_shape
    {
      "mappings": {
        "properties": {
          "name": {
            "type": "keyword"
          },
          "location": {
            "type": "geo_shape"
          }
        }
      }
    }

然后添加一条线:

    PUT /geo_shape/_doc/1
    {
      "name": "西安-郑州",
      "location": {
        "type": "linestring",
        "coordinates": [
          [108.9404296875, 34.279914398549934],
          [113.66455078125, 34.768691457552706]
        ]
      }
    }

接下来查询某一个图形中是否包含该线:

    GET /geo_shape/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match_all": {}
            }
          ],
          "filter": [
            {
              "geo_shape": {
                "location": {
                  "shape": {
                    "type": "envelope",
                    "coordinates": [
                      [
                        106.5234375,
                        36.80928470205937
                      ],
                      [
                        115.3344726625,
                        32.24997445586331
                      ]
                    ]
                  },
                  "relation": "within"
                }
              }
            }
          ]
        }
      }
    }

relation 属性表示两个图形的关系:

  • within:包含
  • intersects:相交
  • disjoint:不相交
    {
      "took" : 530,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "geo_shape",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 1.0,
            "_source" : {
              "name" : "西安-郑州",
              "location" : {
                "type" : "linestring",
                "coordinates" : [
                  [
                    108.9404296875,
                    34.279914398549934
                  ],
                  [
                    113.66455078125,
                    34.768691457552706
                  ]
                ]
              }
            }
          }
        ]
      }
    }

6、特殊查询

6.1、more_like_this query——基于内容的推荐

more_like_this query 可以实现基于内容的推荐,给定一篇文章,可以查询出和该文章相似的内容。

    GET /bank/_search
    {
      "query": {
        "more_like_this": {
          "fields": [
            "address"
          ],
          "like": "madison street",
          "min_term_freq": 1,
          "max_query_terms": 12
        }
      }
    }
  • fields:要匹配的字段,可以有多个。
  • like:要匹配的文本。
  • min_term_freq:词项的最低频率,默认是 2。特别注意,这个是指词项在要匹配的文本中的频率,而不是 es 文档中的频率。
  • max_query_terms:query 中包含的最大词项数目。
  • min_doc_freq:最小的文档频率,搜索的词,至少在多少个文档中出现,少于指定数目,该词会被忽略。
  • max_doc_freq:最大文档频率。
  • analyzer:分词器,默认使用字段的分词器。
  • stop_words:停用词列表。
  • minmum_should_match

6.2、script query——脚本查询

    GET /bank/_search
    {
      "query": {
        "bool": {
          "filter": [
            {
              "script": {
                "script": {
                  "lang": "painless",
                  "source": "doc['age'].value > 39"
                }
              }
            }
          ]
        }
      }
    }

6.3、percolate query——反向查询

percolate query译作渗透查询或者反向查询:

  • 正常操作:根据查询语句找到对应的文档query->document
  • percolate query:根据文档,返回与之匹配的查询语句,docuent->query

应用场景:

  • 价格监控
  • 库存报警
  • 股票警告

例如阈值告警,假设这顶字段值大于阈值,报警提示:

    PUT /log 
    {
      "mappings": {
        "properties": {
          "threshold": {
            "type": "long"
          },
          "count": {
            "type": "long"
          },
          "query":{
            "type": "percolator"
          }
        }
      }
    }
    //percolator类型相当于一个es查询
    
    
    
    PUT /log/_doc/1
    {
      "thredhold": 10,
      "query": {
        "bool": {
          "must": {
            "range": {
              "count": {
                "gt": 10
              }
            }
          }
        }
      }
    }
    
    GET /log/_search
    {
      "query": {
        "percolate": {
          "field": "query",
          "documents": [
            {"count":3},
            {"count":6},
            {"count":90},
            {"count":12},
            {"count":15}
          ]
        }
      }
    }

查询结果中会列出满足条件的文档。
查询结果中的_percolator_document_slot字段表示文档的position,从0开始计数。

    {
      "took" : 5,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "log",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 1.0,
            "_source" : {
              "thredhold" : 10,
              "query" : {
                "bool" : {
                  "must" : {
                    "range" : {
                      "count" : {
                        "gt" : 10
                      }
                    }
                  }
                }
              }
            },
            "fields" : {
              "_percolator_document_slot" : [
                2,
                3,
                4
              ]
            }
          }
        ]
      }
    }

6.4、highlight——搜索高亮

普通高亮,默认自动添加em标签:

    GET /bank/_search
    {
      "query": {
        "match": {
          "address": "madison"
        }
      },
      "highlight": {
        "fields": {
          "address": {}
        }
      }
    }

2024031320371789612.png

自定义高亮标签:

    GET /bank/_search
    {
      "query": {
        "match": {
          "address": "madison"
        }
      },
      "highlight": {
        "fields": {
          "address": {
            "pre_tags": ["<strong>"],
            "post_tags": ["</strong>"]
          }
        }
      }
    }

2024031320371846413.png

多字段高亮:

    GET /blog/_search
    {
      "query": {
        "match": {
          "title": "mongoDB"
        }
      },
      "highlight": {
        "require_field_match": "false",
        "fields": {
          "title": {},
          "content": {}
        }
      }
    }

2024031320371888414.png

6.5、sort——排序

排序很简单,默认是按照查询文档的相关度来排序的,即( _score 字段)。

另外match_all查询只是返回所有文档,不评分,默认按照添加顺序返回,可以通过_doc字段对其进行排序。

    GET /bank/_search
    {
      "query": {
        "match_all": {}
      },
      "sort": [
        {
          "_doc": {
            "order": "desc"
          }
        }
      ],
      "size": 5
    }

多字段排序:

    GET /bank/_search
    {
      "query": {
        "match_all": {}
      },
      "sort": [
        {
          "age": {
            "order": "desc"
          },
          "balance": {
            "order": "asc"
          }
        }
      ],
      "size": 5
    }

6.6、_count——计数

_count API不会返回实际的字段数据,只会返回一个数量,支持所有查询语句。

    GET /bank/_count
    {
      "query": {
        "match_all": {}
      }
    }
    {
      "count" : 1000,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      }
    }
阅读全文