ThinkChat🤖让你学习和工作更高效,注册即送10W Token,即刻开启你的AI之旅 广告
[TOC] ## 1. match query 底层转换 > 参考【操作】中 14 控制搜索精准度 ~~~ { "match": { "title": "java elasticsearch"} } ~~~ 1. 使用诸如上面的match query进行多值搜索的时候,es会在底层自动将这个match query转换为bool的语法 bool should,指定多个搜索词,同时使用term query ~~~ { "bool": { "should": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }} ] } } ~~~ 2. and match如何转换为term+must ~~~ { "match": { "title": { "query": "java elasticsearch", "operator": "and" } } } ~~~ 底层转换成 ~~~ { "bool": { "must": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }} ] } } ~~~ 3. minimum_should_match如何转换 ~~~ { "match": { "title": { "query": "java elasticsearch hadoop spark", "minimum_should_match": "75%" } } } ~~~ 底层转换成 ~~~ { "bool": { "should": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }}, { "term": { "title": "hadoop" }}, { "term": { "title": "spark" }} ], "minimum_should_match": 3 } } ~~~ ## 2. boost 控制搜索权重 > 需求: > 搜索标题中包含java的帖子,同时呢,如果标题中包含hadoop或elasticsearch就优先搜索出来,同时呢,如果一个帖子包含java hadoop,一个帖子包含java elasticsearch,包含hadoop的帖子要比elasticsearch优先搜索出来 > ~~~ GET /forum/_search { "query": { "bool": { "must": [ {"match": {"title": "java"}} ], "should": [ {"match":{"title": {"query": "elasticsearch","boost":3}}}, {"match":{"title": {"query": "hadoop","boost":2}}} ] } } } ~~~ ## 3. dis_max 多字段查询取最优(相关度分值最高) 1. 查找title或者content字段中含有 Java solution的文档 ~~~ GET /forum/_search { "query": { "bool": { "should": [ {"match":{"title": "java solution"}}, {"match":{"content": "java solution"}} ] } } } ~~~ 得到 ~~~ "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 0.8849759, "_source": { "title": "this is java blog", "content": "i think java is the best programming language" }, "highlight": { "title": [ "this is <em>java</em> blog" ], "content": [ "i think <em>java</em> is the best programming language" ] } }, { "_index": "forum", "_type": "article", "_id": "4", "_score": 0.7120095, "_source": { "title": "this is java, elasticsearch, hadoop blog", "content": "elasticsearch and hadoop are all very good solution, i am a beginner" }, "highlight": { "title": [ "this is <em>java</em>, elasticsearch, hadoop blog" ], "content": [ "elasticsearch and hadoop are all very good <em>solution</em>, i am a beginner" ] } }, { "_index": "forum", "_type": "article", "_id": "5", "_score": 0.56008905, "_source": { "title": "this is spark blog", "content": "spark is best big data solution based on scala ,an programming language similar to java" }, "highlight": { "content": [ "spark is best big data <em>solution</em> based on scala ,an programming language similar to <em>java</em>" ] } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.26742277, "_source": { "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article" }, "highlight": { "title": [ "this is <em>java</em> and elasticsearch blog" ] ~~~ * 显然id=5的文档content字段,既有Java也有solution,但是相关度评分确不是最高的,这不是我们想要的结果 计算分值大致如下 ~~~ 计算每个document的relevance score:每个query的分数,乘以matched query数量,除以总query数量 算一下doc4的分数 { "match": { "title": "java solution" }},针对doc4,是有一个分数的 { "match": { "content": "java solution" }},针对doc4,也是有一个分数的 所以是两个分数加起来,比如说,1.1 + 1.2 = 2.3 matched query数量 = 2 总query数量 = 2 2.3 * 2 / 2 = 2.3 算一下doc5的分数,只有一个query有分 { "match": { "title": "java solution" }},针对doc5,是没有分数的 { "match": { "content": "java solution" }},针对doc5,是有一个分数的 所以说,只有一个query是有分数的,比如2.3 matched query数量 = 1 总query数量 = 2 2.3 * 1 / 2 = 1.15 doc5的分数 = 1.15 < doc4的分数 = 2.3 ~~~ 2. dis_max query 出场 * 选取查询最高的相关度得分,不是取平均 best fields策略,就是说,搜索到的结果,应该是某一个field中匹配到了尽可能多的关键词,被排在前面;而不是尽可能多的field匹配到了少数的关键词,排在了前面 ~~~ GET forum/_search { "query": { "dis_max": { "queries": [ {"match": {"title": "java solution"}}, {"match": {"FIELD": "java solution"}} ] } } } ~~~ * 这样id=5的文档排在前边了 ~~~ "hits": { "total": 4, "max_score": 0.68640786, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 0.68640786, "_source": { "articleID": "KDKE-B-9947-#kL5", "userID": 1, "hidden": false, "postDate": "2017-01-02", "title": "this is java blog", "content": "i think java is the best programming language" } }, { "_index": "forum", "_type": "article", "_id": "5", "_score": 0.56008905, "_source": { "articleID": "hjPX-R-hhh-#aDn", "userID": 3, "hidden": true, "postDate": "2017-01-04", "title": "this is spark blog", "content": "spark is best big data solution based on scala ,an programming language similar to java" } }, { "_index": "forum", "_type": "article", "_id": "4", "_score": 0.5565415, "_source": { "articleID": "QQPX-R-3956-#aD8", "userID": 2, "hidden": true, "postDate": "2017-01-02", "title": "this is java, elasticsearch, hadoop blog", "content": "elasticsearch and hadoop are all very good solution, i am a beginner" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.26742277, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article" } } ] } } ~~~ 3. dis_max只考虑分值最高的查询,所有存在一定的缺陷,加入tie_breaker,可以优化dis_max ~~~ GET forum/_search { "query": { "dis_max": { "queries": [ {"match": {"title": "java solution"}}, {"match": {"content": "java solution"}} ], "tie_breaker": 0.3 } } } ~~~ tie_breake(0-1)会乘以除最高分值以外的其他分值,然后综合最高分得到一个最终分数,将其他查询的结果也考虑了进去。 4. multi_match实现dis_max ~~~ GET forum/_search { "query": { "multi_match": { "query": "java solution", "fields": ["title^2","content"], "type": "best_fields", "minimum_should_match":"50%" } } } ~~~ ## 4. fields策略 best-fields策略,主要是说将某一个field匹配尽可能多的关键词的doc优先返回回来(默认) most-fields策略,主要是说尽可能返回更多field匹配到某个关键词的doc,优先返回回来 ~~~ GET forum/_search { "query": { "multi_match": { "query": "java solution", "fields": ["title^2","content"], "type": "best_fields", "minimum_should_match":"50%" } } } ~~~ ~~~ POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"sub_title" : "learning more courses"} } { "update": { "_id": "2"} } { "doc" : {"sub_title" : "learned a lot of course"} } { "update": { "_id": "3"} } { "doc" : {"sub_title" : "we have a lot of fun"} } { "update": { "_id": "4"} } { "doc" : {"sub_title" : "both of them are good"} } { "update": { "_id": "5"} } { "doc" : {"sub_title" : "haha, hello world"} } ~~~ ### 4.1 match搜索 1. 使用match,对sub_title进行搜索,sub_title使用的是english分词器,回把复数,动名词,过去式转换成最原始的词,搜索learning courses 也会和对应的field使用相同的分词器,被分成 ~~~ GET /forum/article/_search { "query": { "match": { "sub_title": "learning courses" } } } ~~~ * 搜索转换 ~~~ GET _analyze { "analyzer": "english", "text": "learning courses" } ~~~ 得到 ~~~ { "tokens": [ { "token": "learn", "start_offset": 0, "end_offset": 8, "type": "<ALPHANUM>", "position": 0 }, { "token": "cours", "start_offset": 9, # term position,在近似匹配中会用到,表示两个词的距离(match_phrase) "end_offset": 16, "type": "<ALPHANUM>", "position": 1 } ] } ~~~ 搜索 ~~~ GET /forum/article/_search { "query": { "match": { "sub_title.std": "learning courses" } } } ~~~ 得到 learning more courses由于english分词导致排在了后边!!!! ~~~ "hits": [ 。。。 "sub_title": "learned a lot of course" 。。。。 "sub_title": "learning more courses" } } ] } } ~~~ 这时我们用sub_title的子field(标准分词器)查 得到结果,符合我们的预期 ~~~ "hits": [ { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.5063205, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article", "sub_title": "learning more courses" } } ] } } ~~~ 2. multi_match 多field查询,就涉及到了field策略 > 1. 默认best_field查询 ~~~ GET /forum/_search { "query": { "multi_match": { "query": "learning courses", "fields": ["sub_title","sub_title.std"] } } } ~~~ 结果learned a lot of course 排在了前面 ~~~ "hits": { "total": 2, "max_score": 1.219939, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 1.219939, "title": "this is java blog", "content": "i think java is the best programming language", "sub_title": "learned a lot of course" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.5063205, "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article", "sub_title": "learning more courses" } } ] } } ~~~ 2. most_field策略 虽然learned a lot of course仍然在前面,但是他的分值几乎没有变化,而learning more courses分值增加,说明了most_field策略很好的照顾到了所有请求。 ~~~ "hits": { "total": 2, "max_score": 1.219939, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 1.219939, "_source": { "articleID": "KDKE-B-9947-#kL5", "userID": 1, "hidden": false, "postDate": "2017-01-02", "title": "this is java blog", "content": "i think java is the best programming language", "sub_title": "learned a lot of course" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 1.012641, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article", "sub_title": "learning more courses" } } ] } } ~~~ > best_fields与most_fields的区别: > (1)best_fields,是对多个field进行搜索,挑选某个field匹配度最高的那个分数,同时在多个query最高分相同的情况下,在一定程度上考虑其他query的分数。简单来说,你对多个field进行搜索,就想搜索到某一个field尽可能包含更多关键字的数据 > 优点:通过best_fields策略,以及综合考虑其他field,还有minimum_should_match支持,可以尽可能精准地将匹配的结果推送到最前面 > 缺点:除了那些精准匹配的结果,其他差不多大的结果,排序结果不是太均匀,没有什么区分度了 > 实际的例子:百度之类的搜索引擎,最匹配的到最前面,但是其他的就没什么区分度了 > (2)most_fields,综合多个field一起进行搜索,尽可能多地让所有field的query参与到总分数的计算中来,此时就会是个大杂烩,出现类似best_fields案例最开始的那个结果,结果不一定精准,某一个document的一个field包含更多的关键字,但是因为其他document有更多field匹配到了,所以排在了前面;所以需要建立类似sub_title.std这样的field,尽可能让某一个field精准匹配query string,贡献更高的分数,将更精准匹配的数据排到前面 > 优点:将尽可能匹配更多field的结果推送到最前面,整个排序结果是比较均匀的 > 缺点:可能那些精准匹配的结果,无法推送到最前面 > 实际的例子:wiki,明显的most_fields策略,搜索结果比较均匀,但是的确要翻好几页才能找到最匹配的结果 > 3. cross_fields 适用于横跨多个field,搜索一个事务,比如人名,地名 ~~~ GET /forum/article/_search { "query": { "multi_match": { "query": "Peter Smith", "type": "cross_fields", "operator": "and", "fields": ["author_first_name", "author_last_name"] } } } ~~~ > 问题1:只是找到尽可能多的field匹配的doc,而不是某个field完全匹配的doc --> 解决,要求每个term都必须在任何一个field中出现 > Peter,Smith > 要求Peter必须在author_first_name或author_last_name中出现 > 要求Smith必须在author_first_name或author_last_name中出现 > Peter Smith可能是横跨在多个field中的,所以必须要求每个term都在某个field中出现,组合起来才能组成我们想要的标识,完整的人名 > 原来most_fiels,可能像Smith Williams也可能会出现,因为most_fields要求只是任何一个field匹配了就可以,匹配的field越多,分数越高 > 问题2:most_fields,没办法用minimum_should_match去掉长尾数据,就是匹配的特别少的结果 --> 解决,既然每个term都要求出现,长尾肯定被去除掉了 > java hadoop spark --> 这3个term都必须在任何一个field出现了 > 比如有的document,只有一个field中包含一个java,那就被干掉了,作为长尾就没了 > 问题3:TF/IDF算法,比如Peter Smith和Smith Williams,搜索Peter Smith的时候,由于first_name中很少有Smith的,所以query在所有document中的频率很低,得到的分数很高,可能Smith Williams反而会排在Peter Smith前面 --> 计算IDF的时候,将每个query在每个field中的IDF都取出来,取最小值,就不会出现极端情况下的极大值了 > Peter Smith > Peter > Smith > Smith,在author_first_name这个field中,在所有doc的这个Field中,出现的频率很低,导致IDF分数很高;Smith在所有doc的author_last_name field中的频率算出一个IDF分数,因为一般来说last_name中的Smith频率都较高,所以IDF分数是正常的,不会太高;然后对于Smith来说,会取两个IDF分数中,较小的那个分数。就不会出现IDF分过高的情况。