常用术语查询 · Elasticsearch 5.4 中文文档

# 常用术语查询原文链接 : [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html)（修改该链接为官网对应的链接）译文链接 : [http://www.le.wiki/pages/viewpage.action?pageId=4883341](http://www.le.wiki/pages/viewpage.action?pageId=4883341)（修改该链接为 ApacheCN 对应的译文链接）贡献者 : @羊两头 ## 常用术语查询常用术语查询是停用词的现代替代，其提高了搜索结果的精度和回忆（通过考虑停用词）且不牺牲性能 ### 问题：查询中的每个词都有成本。搜索“棕色狐狸”需要三个词查询，一个针对“the”，“brown”和“fox”中的每一个，所有这些查询针对索引中的所有文档执行。对“the”的查询可能匹配许多文档，因此对相关性的影响要小于其他两个术语。以前，这个问题的解决方案是忽略高频率的项。通过将“the”视为停用词，我们减少索引大小并减少需要执行的术语查询的数量。这种方法的问题是，虽然停用词对相关性有小的影响，但它们仍然很重要。如果我们删除禁用词，我们失去精确性（例如，我们无法区分“快乐”和“不快乐”），我们失去回忆（例如像“The”或“To be or not be be” 存在于索引中）。 ### 解决：常见术语查询将查询项划分为两组：更重要的（即低频项）和不太重要的（即先前已经是停用词的高频项）。首先它搜索与更重要的术语匹配的文档。这些是出现在较少文件中并对相关性具有更大影响的术语。然后，它对比不重要的术语执行第二次查询 - 经常出现并且对相关性影响较小的术语。但是，不是计算所有匹配文档的相关性分数，而是仅计算已由第一个查询匹配的文档的分数。以这种方式，高频项可以改进相关性计算，而不支付差的性能的成本。如果查询仅由高频项组成，则单个查询作为AND（连接）查询执行，换句话说，所有项都是必需的。即使每个单独的术语将匹配许多文档，术语的组合将结果集缩小到仅最相关。单个查询也可以作为具有特定minimum_should_match的OR执行，在这种情况下，应该使用足够高的值。基于cutoff频率将术语分配给高频组或低频组，其可以被指定为绝对频率（> = 1）或相对频率（0.0..1.0）。（请记住，文档频率是按照每个分片级别计算的，如博文中所述的相关性已损坏）。也许这个查询的最有趣的属性是它自动适应域特定的停用词。例如，在视频托管网站上，诸如“剪辑”或“视频”之类的常用术语将自动表现为停用词，而无需保留手动列表。 ### 举例：在该示例中，具有大于0.1％的文档频率（例如“this”和“is”）的词将被视为共同词。 ``` GET /_search { "query": { "common": { "body": { "query": "this is bonsai cool", "cutoff_frequency": 0.001 } } } } ``` 可以使用minimum_should_match（high_freq，low_freq），low_freq_operator（默认“或”）和high_freq_operator（默认“或”）参数来控制应匹配的术语数。对于低频项，将low_freq_operator设置为“and”，以使所有项都是必需的： ``` GET /_search { "query": { "common": { "body": { "query": "nelly the elephant as a cartoon", "cutoff_frequency": 0.001, "low_freq_operator": "and" } } } } ``` 相当于： ``` GET /_search { "query": { "bool": { "must": [ { "term": { "body": "nelly"}}, { "term": { "body": "elephant"}}, { "term": { "body": "cartoon"}} ], "should": [ { "term": { "body": "the"}}, { "term": { "body": "as"}}, { "term": { "body": "a"}} ] } } } ``` 或者使用minimum_should_match来指定必须存在的低频项的最小数量或百分比，例如： ``` GET /_search { "query": { "common": { "body": { "query": "nelly the elephant as a cartoon", "cutoff_frequency": 0.001, "minimum_should_match": 2 } } } } ``` 相当于： ``` GET /_search { "query": { "bool": { "must": { "bool": { "should": [ { "term": { "body": "nelly"}}, { "term": { "body": "elephant"}}, { "term": { "body": "cartoon"}} ], "minimum_should_match": 2 } }, "should": [ { "term": { "body": "the"}}, { "term": { "body": "as"}}, { "term": { "body": "a"}} ] } } } ``` minimum_should_match 可以对具有附加low_freq和high_freq参数的低频和高频项应用不同的minimum_should_match。这里是一个提供附加参数的例子（注意结构的变化）： ``` GET /_search { "query": { "common": { "body": { "query": "nelly the elephant not as a cartoon", "cutoff_frequency": 0.001, "minimum_should_match": { "low_freq" : 2, "high_freq" : 3 } } } } } ``` 相当于： ``` GET /_search { "query": { "bool": { "must": { "bool": { "should": [ { "term": { "body": "nelly"}}, { "term": { "body": "elephant"}}, { "term": { "body": "cartoon"}} ], "minimum_should_match": 2 } }, "should": { "bool": { "should": [ { "term": { "body": "the"}}, { "term": { "body": "not"}}, { "term": { "body": "as"}}, { "term": { "body": "a"}} ], "minimum_should_match": 3 } } } } } ``` 在这种情况下，这意味着当至少有三个词时，高频词只对相关性有影响。但是对于高频项，minimum_should_match的最有趣的使用是当只有高频项时： ``` GET /_search { "query": { "common": { "body": { "query": "how not to be", "cutoff_frequency": 0.001, "minimum_should_match": { "low_freq" : 2, "high_freq" : 3 } } } } } ``` 相当于： ``` GET /_search { "query": { "bool": { "should": [ { "term": { "body": "how"}}, { "term": { "body": "not"}}, { "term": { "body": "to"}}, { "term": { "body": "be"}} ], "minimum_should_match": "3<50%" } } } ``` 高频率生成的查询然后比使用AND稍微限制性。常见的术语查询还支持boost，analyzer和disable_coord作为参数