elasticsearch基础_2 · TUNA-daily

[TOC] ## 1. match query 底层转换 > 参考【操作】中 14 控制搜索精准度 ~~~ { "match": { "title": "java elasticsearch"} } ~~~ 1. 使用诸如上面的match query进行多值搜索的时候，es会在底层自动将这个match query转换为bool的语法 bool should，指定多个搜索词，同时使用term query ~~~ { "bool": { "should": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }} ] } } ~~~ 2. and match如何转换为term+must ~~~ { "match": { "title": { "query": "java elasticsearch", "operator": "and" } } } ~~~ 底层转换成 ~~~ { "bool": { "must": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }} ] } } ~~~ 3. minimum_should_match如何转换 ~~~ { "match": { "title": { "query": "java elasticsearch hadoop spark", "minimum_should_match": "75%" } } } ~~~ 底层转换成 ~~~ { "bool": { "should": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }}, { "term": { "title": "hadoop" }}, { "term": { "title": "spark" }} ], "minimum_should_match": 3 } } ~~~ ## 2. boost 控制搜索权重 > 需求： > 搜索标题中包含java的帖子，同时呢，如果标题中包含hadoop或elasticsearch就优先搜索出来，同时呢，如果一个帖子包含java hadoop，一个帖子包含java elasticsearch，包含hadoop的帖子要比elasticsearch优先搜索出来 > ~~~ GET /forum/_search { "query": { "bool": { "must": [ {"match": {"title": "java"}} ], "should": [ {"match":{"title": {"query": "elasticsearch","boost":3}}}, {"match":{"title": {"query": "hadoop","boost":2}}} ] } } } ~~~ ## 3. dis_max 多字段查询取最优（相关度分值最高） 1. 查找title或者content字段中含有 Java solution的文档 ~~~ GET /forum/_search { "query": { "bool": { "should": [ {"match":{"title": "java solution"}}, {"match":{"content": "java solution"}} ] } } } ~~~ 得到 ~~~ "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 0.8849759, "_source": { "title": "this is java blog", "content": "i think java is the best programming language" }, "highlight": { "title": [ "this is java blog" ], "content": [ "i think java is the best programming language" ] } }, { "_index": "forum", "_type": "article", "_id": "4", "_score": 0.7120095, "_source": { "title": "this is java, elasticsearch, hadoop blog", "content": "elasticsearch and hadoop are all very good solution, i am a beginner" }, "highlight": { "title": [ "this is java, elasticsearch, hadoop blog" ], "content": [ "elasticsearch and hadoop are all very good solution, i am a beginner" ] } }, { "_index": "forum", "_type": "article", "_id": "5", "_score": 0.56008905, "_source": { "title": "this is spark blog", "content": "spark is best big data solution based on scala ,an programming language similar to java" }, "highlight": { "content": [ "spark is best big data solution based on scala ,an programming language similar to java" ] } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.26742277, "_source": { "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article" }, "highlight": { "title": [ "this is java and elasticsearch blog" ] ~~~ * 显然id=5的文档content字段，既有Java也有solution，但是相关度评分确不是最高的，这不是我们想要的结果计算分值大致如下 ~~~ 计算每个document的relevance score：每个query的分数，乘以matched query数量，除以总query数量算一下doc4的分数 { "match": { "title": "java solution" }}，针对doc4，是有一个分数的 { "match": { "content": "java solution" }}，针对doc4，也是有一个分数的所以是两个分数加起来，比如说，1.1 + 1.2 = 2.3 matched query数量 = 2 总query数量 = 2 2.3 * 2 / 2 = 2.3 算一下doc5的分数，只有一个query有分 { "match": { "title": "java solution" }}，针对doc5，是没有分数的 { "match": { "content": "java solution" }}，针对doc5，是有一个分数的所以说，只有一个query是有分数的，比如2.3 matched query数量 = 1 总query数量 = 2 2.3 * 1 / 2 = 1.15 doc5的分数 = 1.15 < doc4的分数 = 2.3 ~~~ 2. dis_max query 出场 * 选取查询最高的相关度得分，不是取平均 best fields策略，就是说，搜索到的结果，应该是某一个field中匹配到了尽可能多的关键词，被排在前面；而不是尽可能多的field匹配到了少数的关键词，排在了前面 ~~~ GET forum/_search { "query": { "dis_max": { "queries": [ {"match": {"title": "java solution"}}, {"match": {"FIELD": "java solution"}} ] } } } ~~~ * 这样id=5的文档排在前边了 ~~~ "hits": { "total": 4, "max_score": 0.68640786, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 0.68640786, "_source": { "articleID": "KDKE-B-9947-#kL5", "userID": 1, "hidden": false, "postDate": "2017-01-02", "title": "this is java blog", "content": "i think java is the best programming language" } }, { "_index": "forum", "_type": "article", "_id": "5", "_score": 0.56008905, "_source": { "articleID": "hjPX-R-hhh-#aDn", "userID": 3, "hidden": true, "postDate": "2017-01-04", "title": "this is spark blog", "content": "spark is best big data solution based on scala ,an programming language similar to java" } }, { "_index": "forum", "_type": "article", "_id": "4", "_score": 0.5565415, "_source": { "articleID": "QQPX-R-3956-#aD8", "userID": 2, "hidden": true, "postDate": "2017-01-02", "title": "this is java, elasticsearch, hadoop blog", "content": "elasticsearch and hadoop are all very good solution, i am a beginner" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.26742277, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article" } } ] } } ~~~ 3. dis_max只考虑分值最高的查询，所有存在一定的缺陷，加入tie_breaker,可以优化dis_max ~~~ GET forum/_search { "query": { "dis_max": { "queries": [ {"match": {"title": "java solution"}}, {"match": {"content": "java solution"}} ], "tie_breaker": 0.3 } } } ~~~ tie_breake（0-1）会乘以除最高分值以外的其他分值，然后综合最高分得到一个最终分数，将其他查询的结果也考虑了进去。 4. multi_match实现dis_max ~~~ GET forum/_search { "query": { "multi_match": { "query": "java solution", "fields": ["title^2","content"], "type": "best_fields", "minimum_should_match":"50%" } } } ~~~ ## 4. fields策略 best-fields策略，主要是说将某一个field匹配尽可能多的关键词的doc优先返回回来(默认) most-fields策略，主要是说尽可能返回更多field匹配到某个关键词的doc，优先返回回来 ~~~ GET forum/_search { "query": { "multi_match": { "query": "java solution", "fields": ["title^2","content"], "type": "best_fields", "minimum_should_match":"50%" } } } ~~~ ~~~ POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"sub_title" : "learning more courses"} } { "update": { "_id": "2"} } { "doc" : {"sub_title" : "learned a lot of course"} } { "update": { "_id": "3"} } { "doc" : {"sub_title" : "we have a lot of fun"} } { "update": { "_id": "4"} } { "doc" : {"sub_title" : "both of them are good"} } { "update": { "_id": "5"} } { "doc" : {"sub_title" : "haha, hello world"} } ~~~ ### 4.1 match搜索 1. 使用match，对sub_title进行搜索，sub_title使用的是english分词器，回把复数，动名词，过去式转换成最原始的词，搜索learning courses 也会和对应的field使用相同的分词器，被分成 ~~~ GET /forum/article/_search { "query": { "match": { "sub_title": "learning courses" } } } ~~~ * 搜索转换 ~~~ GET _analyze { "analyzer": "english", "text": "learning courses" } ~~~ 得到 ~~~ { "tokens": [ { "token": "learn", "start_offset": 0, "end_offset": 8, "type": "<ALPHANUM>", "position": 0 }, { "token": "cours", "start_offset": 9, # term position，在近似匹配中会用到，表示两个词的距离（match_phrase） "end_offset": 16, "type": "<ALPHANUM>", "position": 1 } ] } ~~~ 搜索 ~~~ GET /forum/article/_search { "query": { "match": { "sub_title.std": "learning courses" } } } ~~~ 得到 learning more courses由于english分词导致排在了后边！！！！ ~~~ "hits": [ 。。。 "sub_title": "learned a lot of course" 。。。。 "sub_title": "learning more courses" } } ] } } ~~~ 这时我们用sub_title的子field（标准分词器）查得到结果，符合我们的预期 ~~~ "hits": [ { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.5063205, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article", "sub_title": "learning more courses" } } ] } } ~~~ 2. multi_match 多field查询，就涉及到了field策略 > 1. 默认best_field查询 ~~~ GET /forum/_search { "query": { "multi_match": { "query": "learning courses", "fields": ["sub_title","sub_title.std"] } } } ~~~ 结果learned a lot of course 排在了前面 ~~~ "hits": { "total": 2, "max_score": 1.219939, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 1.219939, "title": "this is java blog", "content": "i think java is the best programming language", "sub_title": "learned a lot of course" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.5063205, "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article", "sub_title": "learning more courses" } } ] } } ~~~ 2. most_field策略虽然learned a lot of course仍然在前面，但是他的分值几乎没有变化，而learning more courses分值增加，说明了most_field策略很好的照顾到了所有请求。 ~~~ "hits": { "total": 2, "max_score": 1.219939, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 1.219939, "_source": { "articleID": "KDKE-B-9947-#kL5", "userID": 1, "hidden": false, "postDate": "2017-01-02", "title": "this is java blog", "content": "i think java is the best programming language", "sub_title": "learned a lot of course" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 1.012641, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article", "sub_title": "learning more courses" } } ] } } ~~~ > best_fields与most_fields的区别： > （1）best_fields，是对多个field进行搜索，挑选某个field匹配度最高的那个分数，同时在多个query最高分相同的情况下，在一定程度上考虑其他query的分数。简单来说，你对多个field进行搜索，就想搜索到某一个field尽可能包含更多关键字的数据 > 优点：通过best_fields策略，以及综合考虑其他field，还有minimum_should_match支持，可以尽可能精准地将匹配的结果推送到最前面 > 缺点：除了那些精准匹配的结果，其他差不多大的结果，排序结果不是太均匀，没有什么区分度了 > 实际的例子：百度之类的搜索引擎，最匹配的到最前面，但是其他的就没什么区分度了 > （2）most_fields，综合多个field一起进行搜索，尽可能多地让所有field的query参与到总分数的计算中来，此时就会是个大杂烩，出现类似best_fields案例最开始的那个结果，结果不一定精准，某一个document的一个field包含更多的关键字，但是因为其他document有更多field匹配到了，所以排在了前面；所以需要建立类似sub_title.std这样的field，尽可能让某一个field精准匹配query string，贡献更高的分数，将更精准匹配的数据排到前面 > 优点：将尽可能匹配更多field的结果推送到最前面，整个排序结果是比较均匀的 > 缺点：可能那些精准匹配的结果，无法推送到最前面 > 实际的例子：wiki，明显的most_fields策略，搜索结果比较均匀，但是的确要翻好几页才能找到最匹配的结果 > 3. cross_fields 适用于横跨多个field，搜索一个事务，比如人名，地名 ~~~ GET /forum/article/_search { "query": { "multi_match": { "query": "Peter Smith", "type": "cross_fields", "operator": "and", "fields": ["author_first_name", "author_last_name"] } } } ~~~ > 问题1：只是找到尽可能多的field匹配的doc，而不是某个field完全匹配的doc --> 解决，要求每个term都必须在任何一个field中出现 > Peter，Smith > 要求Peter必须在author_first_name或author_last_name中出现 > 要求Smith必须在author_first_name或author_last_name中出现 > Peter Smith可能是横跨在多个field中的，所以必须要求每个term都在某个field中出现，组合起来才能组成我们想要的标识，完整的人名 > 原来most_fiels，可能像Smith Williams也可能会出现，因为most_fields要求只是任何一个field匹配了就可以，匹配的field越多，分数越高 > 问题2：most_fields，没办法用minimum_should_match去掉长尾数据，就是匹配的特别少的结果 --> 解决，既然每个term都要求出现，长尾肯定被去除掉了 > java hadoop spark --> 这3个term都必须在任何一个field出现了 > 比如有的document，只有一个field中包含一个java，那就被干掉了，作为长尾就没了 > 问题3：TF/IDF算法，比如Peter Smith和Smith Williams，搜索Peter Smith的时候，由于first_name中很少有Smith的，所以query在所有document中的频率很低，得到的分数很高，可能Smith Williams反而会排在Peter Smith前面 --> 计算IDF的时候，将每个query在每个field中的IDF都取出来，取最小值，就不会出现极端情况下的极大值了 > Peter Smith > Peter > Smith > Smith，在author_first_name这个field中，在所有doc的这个Field中，出现的频率很低，导致IDF分数很高；Smith在所有doc的author_last_name field中的频率算出一个IDF分数，因为一般来说last_name中的Smith频率都较高，所以IDF分数是正常的，不会太高；然后对于Smith来说，会取两个IDF分数中，较小的那个分数。就不会出现IDF分过高的情况。