操作 · TUNA-daily

[TOC] ## 1. query string search查询 > * 结构化查询，需要传递query参数，{}表示查询子句匹配所有的,所以叫query string search 以下是用json构建的query DSL ### 1.1 match_all > 使用match_all查询所有文档，是没有查询条件下的默认语句 ~~~ GET _search { "query": { "match_all" : {} } } ~~~ ### 1.2 match 1. 匹配年龄是32的文档 ~~~ GET _search { "query": { "match": { "age" : 32 } } } ~~~ ### 1.3 match_phrase > 精准匹配: 不会对搜索语句不会分词，短语匹配 > 像匹配查询一样，但用于匹配确切的短语或单词接近度匹配。 > match_phrase查询会分析文本，并从被分析的文本中创建短语查询 > 例如hello world，只能匹配包含hello world的文本 * 返回Java spark短语的文档 ~~~ GET /forum/article/_search { "query": { "match_phrase": { "content": "java spark" } } } ~~~ * slop 参数（proximity match）通过slop参数，控制短语的间隔，近似匹配 ~~~ GET /forum/article/_search { "query": { "match_phrase": { "content": { "query": "java spark", "slop":10 # java和spark的最大距离，单词的个数 } } } } ~~~ ### 1.4 match_phrase_prefix > match_phrase_prefix与match_phrase相同，但它允许在文本中的最后一个术语上使用前缀匹配。 > 它接受与短语类型相同的参数。此外，它还接受一个max_expansions参数（默认值50），可以控制最后一个术语将扩展多少个后缀。强烈建议将其设置为可接受的值来控制查询的执行时间。例如： ~~~ GET /_search { "query": { "match_phrase_prefix" : { "message" : { "query" : "quick brown f", "max_expansions" : 50 } } } } ~~~ > * 这个查询会构建成 "quick brown" +f 开头的全文索引， "max_expansions" : 50表示它查看排序的词典字典，以查找以f开头的前50个术语，并将这些术语添加到短语查询中。这样可能导致以f开头的单词，在词典顺序排在50以后的不会被快速的查找到，但是随着用户或者f后边的字符增多而查找到。 > 无论是全文搜索还是精准查询，都基本上使用到match ### 1.4 multi_match > 1. 使用math的基础上，加入多个字段匹配 ~~~ GET _search { "query": { "multi_match": { "query": "like", # 查询语句 "fields": ["interests","about"] # 字段 } } } ~~~ > 2. 可以使用通配符指定字段 ~~~ GET _search { "query": { "multi_match": { "query": "like", "fields": ["interests","*t"] # 匹配以t结尾的字段 } } } ~~~ > 3. 通过^n来增加字段的权重 ~~~ GET _search { "query": { "multi_match": { "query": "music", "fields": ["interests","about^3"] } } } ~~~ > * 这里会增加about在匹配结果时的权重，也就是说about字段中如果包含music，这个文档在查询的得到的结果中会比较靠前 > 4. multi查询的类别 > 4.1 best_fields（multi_match默认类型）查找与任何字段匹配的文档，但使用最佳字段中的_score，当搜索在同一字段中去寻找的多个单词时，best_fields类型最为有用。例如，一个领域的“brown fox”在一个领域比“brown”更有意义，而另一个领域的“fox”更有意义。 best_fields类型为每个字段生成匹配查询，并将其包装在dis_max查询中，以找到单个最佳匹配字段。例如，这个查询： ~~~ GET /_search { "query": { "multi_match" : { "query": "brown fox", "type": "best_fields", "fields": [ "subject", "message" ], "tie_breaker": 0.3 } } } ~~~ 会被转换成dis_max,包含两个math查询 ~~~ GET /_search { "query": { "dis_max": { "queries": [ { "match": { "subject": "brown fox" }}, { "match": { "message": "brown fox" }} ], "tie_breaker": 0.3 # 给其他不太精准的匹配一个权值 } } } ~~~ 通常，best_fields类型使用单个最佳匹配字段的分数，但是如果指定了tie_breaker，则它计算得分如下： 1. 来自于匹配最精准的字段的得分 2. 加上所有其他匹配字段的tie_breaker * _score * best_fields和most_fields类型是以字段为中心的 - 它们会为每个字段生成匹配查询。这意味着操作符和minimum_should_match参数将单独应用于每个字段，这可能不是您想要的。 ~~~ GET /_search { "query": { "multi_match" : { "query": "Will Smith", "type": "best_fields", "fields": [ "first_name", "last_name" ], "operator": "and" } } } ~~~ 会类似这样的为每个字段产生查询，并且将参数分别用到两个字段,will smith 被拆分，并且运用了and,会查找first_name和last_name 中含有 will Smith的文档，并且是精准匹配（表示两个都得有） ~~~ (+first_name:will +first_name:smith) # 大写都转小写了 | (+last_name:will +last_name:smith) # 大写都转小写了 ~~~ 把and换成or ~~~ GET /_search { "query": { "multi_match" : { "query": "Will smith", "type": "best_fields", "fields": [ "first_name", "last_name" ], "operator": "or" } } } ~~~ 转换成 ~~~ (first_name:will or first_name:smith) # 大写都转小写了 | (last_name:will or last_name:smith) # 大写都转小写了 ~~~ 所有Word都必须存在于一个文档匹配的单个字段中。 > 4.2 most_fields 当查询包含以不同方式分析的相同文本的多个字段时，most_fields类型最为有用。例如，主要领域可能包含同义词，词干和术语，而没有变音符号。第二个字段可能包含原始术语，第三个字段可能包含带状键。通过组合来自所有三个字段的分数，我们可以在主域中匹配尽可能多的文档，但是使用第二和第三个字段将最相似的结果推送到列表的顶部。 ~~~ GET /_search { "query": { "multi_match" : { "query": "quick brown fox", "type": "most_fields", "fields": [ "title", "title.original", "title.shingles" ] } } } ~~~ 会被这样执行 ~~~ GET /_search { "query": { "bool": { "should": [ { "match": { "title": "quick brown fox" }}, { "match": { "title.original": "quick brown fox" }}, { "match": { "title.shingles": "quick brown fox" }} ] } } } ~~~ 来自每个match子句的得分加在一起，然后除以match子句的数量 > 4.3 cross_fields cross_fields类型对于多个字段应匹配的结构化文档特别有用。例如，当查询“Will Smith”的first_name和last_name字段时，最佳匹配在一个字段中可能具有“Will”，而在另一个字段中可能具有“Smith”。 ~~~ GET /_search { "query": { "multi_match" : { "query": "Will Smith", "type": "cross_fields", "fields": [ "first_name", "last_name" ], "operator": "and" } } } ~~~ 转成，和best_field不同的是，他会把 will Smith且分给两个域 ~~~ +(first_name:will last_name:will) +(first_name:smith last_name:smith) ~~~ ### 1.5 bool > bool 查询与bool过滤类似，不同的是， bool过滤可以直接给出是否匹配成功，而bool 查询要计算每一个查询子句的_score（相关性分值） > 以下查询将会找到 title 字段中包含"how to make millions"，并且"tag" 字段没有被标为 spam. 如果有标为"starred"或者发布日期为2014年之前，那么这些匹配的文档将比同类网站等级高： ~~~ GET _search { "bool": { "must": {"match": {"title":"how to make millions"}}, "must_not": { "match": {"tag": "spam" }}, "should": [ {"match": {"tag": "starred"}}, {"range":{"date": {"gte":"2014-01-01"}}} ] } } ~~~ * 查询about字段中含有basketball和年龄在35-40之间的文档 ~~~ GET /megacorp/_search { "query": { "bool": { "must": [ {"match": {"about": "basketball"}}, {"range": {"age": {"gte": 35, "lte": 40}}} ] } } } ~~~ ### 1.6 Common Terms 查询 > * 查询中的每个术语都性能消耗。搜索“The brown fox”需要三个查询，每个查询“一个”，“brown”和“fox”，所有这些都针对索引中的所有文档执行。“The”会查询出很多文档，所以查询的效果不如前面的两个查询。 > * 以前的做法是把the去掉，这样有很大的问题，例如我们无法区分“happy”和“not happy” > * common term 是把查询词分为两类，一类是重要词（查询与文档相关性较大的词）和非重要词（例如无用词） >1. common term会先查找重要词，这些词会出现在较少的文档中（效率），且有很好的相关性 > 2. 接着执行次要词查询，在计算相关性评分时，不会计算所有匹配的文档，而是计算第一步中的得到文档的评分，以这种方式，高频率可以改善相关性计算，而无需支付性能差的成本。 > 3. 如果查询仅由高频项组成，则单个查询将作为AND（连接）查询执行，换句话说，所有术语都是必需的。即使每一个术语都符合许多文件，术语的组合将结果集缩小到最相关。单个查询也可以作为具有特定minimum_should_match的OR执行，在这种情况下，应该使用足够高的值。 * 在这个例子中，文档频率大于0.1％的单词（例如“this”和“is”）将被视为通用术语。 ~~~ GET /_search { "query": { "common": { "body": { "query": "this is bonsai cool", "cutoff_frequency": 0.001 } } } } ~~~ 可以使用minimum_should_match（high_freq，low_freq），low_freq_operator（默认“或”）和high_freq_operator（默认“或”）参数来控制应该匹配的术语数量。对于低频条件，将low_freq_operator设置为“and”以使所有条件都需要： ~~~ GET /_search { "query": { "common": { "body": { "query": "nelly the elephant as a cartoon", "cutoff_frequency": 0.001, "low_freq_operator": "and" } } } } ~~~ 可以粗略的等于 ~~~ GET /_search { "query": { "bool": { "must": [ { "term": { "body": "nelly"}}, { "term": { "body": "elephant"}}, { "term": { "body": "cartoon"}} ], "should": [ { "term": { "body": "the"}}, { "term": { "body": "as"}}, { "term": { "body": "a"}} ] } } } ~~~ ### 1.7 Query String 查询一个使用查询解析器解析其内容的查询。 ~~~ GET /_search { "query": { "query_string" : { "default_field" : "content", "query" : "this AND that OR thus" } } } ~~~ ### 1.8 批量查询 1、批量查询的好处就是一条一条的查询，比如说要查询100条数据，那么就要发送100次网络请求，这个开销还是很大的如果进行批量查询的话，查询100条数据，就只要发送1次网络请求，网络请求的性能开销缩减100倍 2、mget的语法（1）一条一条的查询 ~~~ GET /test_index/test_type/1 GET /test_index/test_type/2 ~~~ ~~~ （2）mget批量查询 GET /_mget { "docs" : [ { "_index" : "test_index", "_type" : "test_type", "_id" : 1 }, { "_index" : "test_index", "_type" : "test_type", "_id" : 2 } ] } { "docs": [ { "_index": "test_index", "_type": "test_type", "_id": "1", "_version": 2, "found": true, "_source": { "test_field1": "test field1", "test_field2": "test field2" } }, { "_index": "test_index", "_type": "test_type", "_id": "2", "_version": 1, "found": true, "_source": { "test_content": "my test" } } ] } ~~~ （3）如果查询的document是一个index下的不同type种的话 ~~~ GET /test_index/_mget { "docs" : [ { "_type" : "test_type", "_id" : 1 }, { "_type" : "test_type", "_id" : 2 } ] } ~~~ （4）如果查询的数据都在同一个index下的同一个type下，最简单了 ~~~ GET /test_index/test_type/_mget { "ids": [1, 2] } ~~~ 3、mget的重要性可以说mget是很重要的，一般来说，在进行查询的时候，如果一次性要查询多条数据的话，那么一定要用batch批量操作的api 尽可能减少网络开销次数，可能可以将性能提升数倍，甚至数十倍，非常非常之重要 ### 1.9 prefix query( 前缀查询) * 查出title以C3开头的文档，prefix不计算分数，不建立倒排索引，性能较差 ~~~ GET my_index/_search { "query": { "prefix": { "title": { "value": "C3" } } } } ~~~ 得到 ~~~ "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "2", "_score": 1, "_source": { "title": "C3-K5-DFG65" } }, { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 1, "_source": { "title": "C3-D0-KD345" } } ] } ~~~ ### 1.1.0 wildcard（模糊搜索） * 查询title以C开头，5结尾的文档 ~~~ GET my_index/_search { "query": { "wildcard": { "title": { "value": "C*5" } } } } ~~~ 得到 ~~~ "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "2", "_score": 1, "_source": { "title": "C3-K5-DFG65" } }, { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 1, "_source": { "title": "C3-D0-KD345" } }, { "_index": "my_index", "_type": "my_type", "_id": "3", "_score": 1, "_source": { "title": "C4-I8-UI365" } } ] ~~~ ### 1.11 搜索推荐 1、ngram和index-time搜索推荐原理什么是ngram 例如quick，5种长度下的ngram ~~~ ngram length=1，q u i c k ngram length=2，qu ui ic ck ngram length=3，qui uic ick ngram length=4，quic uick ngram length=5，quick ~~~ 什么是edge ngram quick，anchor首字母后进行ngram ~~~ q qu qui quic quick ~~~ 使用edge ngram将每个单词都进行进一步的分词切分，用切分后的ngram来实现前缀搜索推荐功能 ~~~ hello world hello we h he hel hell hello doc1,doc2 w doc1,doc2 wo wor worl world e doc2 helloworld min ngram = 1 max ngram = 3 h he hel hello w hello --> hello，doc1 w --> w，doc1 ~~~ > doc1，hello和w，而且position也匹配，所以，ok，doc1返回，hello world > 搜索的时候，不用再根据一个前缀，然后扫描整个倒排索引了; 简单的拿前缀去倒排索引中匹配即可，如果匹配上了，那么就好了; match，全文检索 2、实验一下ngram 1. 自定义分词器 ~~~ PUT /my_index { "settings": { "analysis": { "filter": { "autocomplete_filter": { "type": "edge_ngram", "min_gram": 1, # 切分单词最小长度 "max_gram": 20 # 切分单词最大长度 } }, "analyzer": { "autocomplete": { "type": "custom", "tokenizer": "standard", # 标准分词器 "filter": [ "lowercase", # 大小写转换 "autocomplete_filter" # 搜索推荐 ] } } } } } ~~~ 测试自定义分词器 ~~~ GET /my_index/_analyze { "analyzer": "autocomplete", "text": "quick brown" } ~~~ 得到，quick brown被按照搜索推荐，分成 q，qu，qui，。。。 ~~~ "tokens": [ { "token": "q", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "qu", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "qui", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "quic", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "quick", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "b", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 }, { "token": "br", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 }, { "token": "bro", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 }, { "token": "brow", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 }, { "token": "brown", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 } ] } ~~~ 2. 设置映射属性 ~~~ PUT /my_index/_mapping/my_type { "properties": { "title": { "type": "string", "analyzer": "autocomplete", # 倒排索引，使用自定义分词器 "search_analyzer": "standard" # 搜索正常分词 } } } ~~~ 3. 插入测试数据 ~~~ PUT my_index/my_type/1 { "title": "hello world" } PUT my_index/my_type/2 { "title": "hello win" } PUT my_index/my_type/3 { "title": "hello dog" } ~~~ 4. 测试 ~~~ GET my_index/_search { "query": { "match_phrase": { "title": "hello w" } } } ~~~ 5. 得出所有可能以hello w 开头的文档，并求推荐给用户 "min_gram": 1, # 切分单词最小长度 "max_gram": 4 # 切分单词最大长度，hello被分成h，he，hel，hell。 ~~~ hello world h he hel hell hello w wo wor worl world hello w h he hel hell hello w hello w --> hello --> w GET /my_index/my_type/_search { "query": { "match_phrase": { "title": "hello w" } } } ~~~ 如果用match，只有hello的也会出来，全文检索，只是分数比较低推荐使用match_phrase，要求每个term都有，而且position刚好靠着1位，符合我们的期望的 ### 1.12 纠错查询数据 ~~~ POST /my_index/my_type/_bulk { "index": { "_id": 1 }} { "text": "Surprise me!"} { "index": { "_id": 2 }} { "text": "That was surprising."} { "index": { "_id": 3 }} { "text": "I wasn't surprised."} ~~~ 查询 ~~~ GET /my_index/my_type/_search { "query": { "fuzzy": { "text": { "value": "surprize", "fuzziness": 2 # 最多纠正错误次数 } } } } ~~~ 得到 ~~~ "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 0.22585157, "_source": { "text": "Surprise me!" } }, { "_index": "my_index", "_type": "my_type", "_id": "3", "_score": 0.1898702, "_source": { "text": "I wasn't surprised." } } ] } } ~~~ * 自动纠错 ~~~ GET /my_index/my_type/_search { "query": { "match": { "text": { # field "query": "SURPIZE ME", "fuzziness": "AUTO", "operator": "and" } } } } ~~~ 得到 ~~~ "hits": { "total": 1, "max_score": 0.44248468, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 0.44248468, "_source": { "text": "Surprise me!" } } ] } } ~~~ ### 1.13 histogram数值区间分组 > histogram对于某个数值field，按照一定的区间间隔分组， "interval": 2000指的是按照步长为2000进行分组 1. 按照价格，每2000是一组 ~~~ GET /tvs/_search { "size": 0, "aggs": { "group_by_price": { "histogram": { "field": "price", "interval": 2000 } } } } ~~~ 得到 ~~~ "buckets": [ { "key": 0, "doc_count": 3 # 0-2000有3个 }, { "key": 2000, "doc_count": 4 # 2000-4000有4个 }, { "key": 4000, "doc_count": 0 }, { "key": 6000, "doc_count": 0 }, { "key": 8000, "doc_count": 1 } ] } } } ~~~ ### 1.14 date_histogram 时间区间聚合 1. 求每个月销售总量 > min_doc_count：0 > 即使某个日期interval，2017-01-01~2017-01-31中，一条数据都没有，那么这个区间也是要返回的，不然默认是会过滤掉这个区间的 "interval": "month" ：以月为单位聚合 "extended_bounds":{ "min":"2016-01-01", "max":"2017-12-12" }：指定时间边界 ~~~ GET tvs/_search { "size": 0, "aggs": { "sales": { "date_histogram": { "field": "sold_date", "interval": "month", "format": "yyyy-MM-dd", "min_doc_count": 0, "extended_bounds":{ "min":"2016-01-01", "max":"2017-12-12" } }, "aggs": { "sum_price_month": { "sum": { "field": "price" } } } } } } ~~~ ### 1.15 单聚与整聚 1. 求出长虹电视销售平均价格和所有电视销售平均价的对比 ~~~ GET tvs/_search { "size": 0, "query": { "term": { "brand": { "value": "长虹" } } }, "aggs": { # 根据查询聚合 "single_avg_price": { "avg": { "field": "price" } }, "globle":{ # 名称 "global": {}, 构造一个整体桶 "aggs": { "globle_avg_price": { "avg": { "field": "price" } } } } } } ~~~ 得到 ~~~ }, "aggregations": { "globle": { "doc_count": 8, "globle_avg_price": { "value": 2650 } }, "single_avg_price": { "value": 1666.6666666666667 } } } ~~~ ### 1.15 聚合去重 1. 查看每个季度，都有哪些品牌销售 ~~~ GET /tvs/_search { "size": 0, "aggs": { "groupby_mounth": { "date_histogram": { "field": "sold_date", "format": "yyyy-MM-dd", "interval": "quarter", "extended_bounds":{ "min":"2016-01-01", "max":"2017-08-08" } }, "aggs": { "distinct_brand": { "cardinality": { "field": "brand" # 在桶内，对品牌去重 } } } } } } ~~~ 得到 ~~~ "hits": [] }, "aggregations": { "groupby_mounth": { "buckets": [ { "key_as_string": "2016-01-01", "key": 1451606400000, "doc_count": 0, "distinct_brand": { "value": 0 } }, { "key_as_string": "2016-04-01", "key": 1459468800000, "doc_count": 1, "distinct_brand": { "value": 1 } }, { "key_as_string": "2016-07-01", "key": 1467331200000, "doc_count": 2, "distinct_brand": { "value": 1 } }, { "key_as_string": "2016-10-01", "key": 1475280000000, "doc_count": 3, "distinct_brand": { "value": 1 } }, { "key_as_string": "2017-01-01", "key": 1483228800000, "doc_count": 2, "distinct_brand": { "value": 2 } ~~~ 2. 控制去重精准度 > 1. cardinality，count(distinct)，5%的错误率，性能在100ms左右 > 2. "precision_threshold": 100：表示brand去重，如果brand的unique value，在100个以内，小米，长虹，三星，TCL，HTL。。。，几乎保证去重100%准确 > 3. 为了保证去重的准确性，可以根据需要调大precision_threshold的值 > 4. 小缺点： > cardinality算法，会占用precision_threshold * 8 byte 内存消耗，100 * 8 = 800个字节占用内存很小。。。而且unique value如果的确在值以内，那么可以确保100%准确100，数百万的unique value，错误率在5%以内 ~~~ GET /tvs/_search { "size": 0, "aggs": { "groupby_mounth": { "date_histogram": { "field": "sold_date", "format": "yyyy-MM-dd", "interval": "quarter", "extended_bounds":{ "min":"2016-01-01", "max":"2017-08-08" } }, "aggs": { "distinct_brand": { "cardinality": { "field": "brand", "precision_threshold": 100 # 保证不同的值在100以内的话，去重精准性 } } } } } } ~~~ ## 2. 过滤 ### 2.1 bool过滤 >* 对于精准值，使用过滤。合并多个过滤条件查询结果的布尔逻辑，包含以下操作符： > 1. must：多个查询条件必须满足，相当于and > 2. must_not : 多个查询条件的相反匹配，相当于not > 3. should ：至少有一个查询条件匹配，相当于or ~~~ ~~~ { "bool": { "must": { "term": { "folder":"inbox"}}, "must_not": { "term": { "tag": "spam" }}, "should": [ { "term": { "starred": true }}, { "term": { "unread": true }} ] } } ~~~ 匹配数量 ~~~ GET /_search { "query": { "common": { "body": { "query": "nelly the elephant not as a cartoon", "cutoff_frequency": 0.001, "minimum_should_match": { "low_freq" : 2, "high_freq" : 3 } } } } } ~~~ ### 2.2 term:过滤主要用于精确匹配哪些值，比如数字、日期布尔值或not_analyzed的字符串(未经分析的文本数据类型)： ### 2.3 terms：过滤 > * 与term相似，但是可以指定多个匹配条件，如果某一字段有多个值，那么文档需要一起做匹配例如，想要查找价格是20或者30的商品 ~~~ GET /my_store/products/_search { "query": { "bool": { "filter": { "terms": { "price":[20,30] } } } } } ~~~ ### 2.4 range ：过滤 > 过滤某一区间的数据 > 1. gt 大于 > 2. gte 大于等于 > 3. lt 小于 > 4. lte 小于等于 > * 查找价格大于等于10小于等于20的商品 ~~~ GET /my_store/_search { "query": { "bool": { "filter": { "range": { "price": { "gte": 10, "lte": 20 } } } } } } ~~~ > 查找最近一小时的文档 ~~~ "range" : { "timestamp" : { "gt" : "now-1h" } } ~~~ ### 2.5 exists过滤器返回包含某一字段的文档 > * 查找tags字段有值的文档 ~~~ GET /my_index/_search { "query": { "bool": { "filter": { "exists": { "field": "tags" } } } } } ~~~ ### 2.6 missing过滤器与exists过滤相反，返回没有指定字段值的文档 ~~~ GET /my_index/_search { "query": { "bool": { "filter": { "missing": { "field": "tags" } } } } } ~~~ ## 3. 复合查询 > 通常情况下，一条过滤语句需要过滤子句的辅助，全文搜索除外。一条查询语句可以包含过滤子句，反之亦然。search API中只能包含 query 语句，所以我们需要用 bool 来同时包含"query"和"filter"子句：查询姓Smith的人，要求年龄是25 ~~~ GET /_search { "query": { "bool": { "must": [ { "match": { "last_name":"Smith"}} ], "filter": [ { "range": { "age": { "lte" :25}}} ] } } } # 使用term GET /_search { "query": { "bool": { "must": [ {"match": { "last_name": "Smith" }} ], "filter": [ {"term":{"age":25}} ] } } } ~~~ ## 4. 索引管理 #### 4.1 创建自定义索引 ~~~ PUT /my_index { "settings": { ... any settings ... }, "mappings": { "type_one": { ... any mappings ... }, "type_two": { ... any mappings ... }, ... } ~~~ #### 4.2 删除索引 ~~~ DELETE /index_one,index_two DELETE /index_* ~~~ 创建只有一个分片，没有副本的索引 ~~~ PUT /my_temp_index { "settings": { "number_of_shards" : 1, "number_of_replicas" : 0 } } ~~~ 动态修改副本个数 ~~~ PUT /my_temp_index/_settings { "number_of_replicas": 1 } ~~~ #### 4.3 更新 > * 如果field存在就更新，不存在就创建 ~~~ POST my_store/products/1/_update { "doc": { "bookname" : "elasticsearch" } } ~~~ ## 5. 插入文档 ### 5.1 _bulk ~~~ 每个操作需要两个json字符串，语法如下 {“action”：{“metadata”}} {“data”} ~~~ > 有哪些类型的操作可以执行呢？ > （1）delete：删除一个文档，只要1个json串就可以了 > （2）create：PUT /index/type/id/_create，强制创建 > （3）index：普通的put操作，可以是创建文档，也可以是全量替换文档 > （4）update：执行的partial update操作插入数据 ~~~ POST /my_store/products/_bulk { "index": { "_id": 1 }} { "price" : 10, "productID" : "XHDK-A-1293-#fJ3" } { "index": { "_id": 2 }} { "price" : 20, "productID" : "KDKE-B-9947-#kL5" } { "index": { "_id": 3 }} { "price" : 30, "productID" : "JODL-X-1937-#pV7" } { "index": { "_id": 4 }} { "price" : 30, "productID" : "QQPX-R-3956-#aD8" } ~~~ 查询数据 ~~~ GET /my_store/products/_search { "query": { "bool": { "filter": {"term": { "price": "20" }} } } } ~~~ ### 5.2 批量数据导入elasticsearch * 数据文件格式如5.1中导入命令：在数据文件目录下执行 ~~~ curl -H 'Content-Type: application/x-ndjson' -XPUT 'http://192.168.2.88:9200/bank/account/_bulk?pretty' --data-binary @accounts.json ~~~ ## 6. mapping > * 可以理解为为文档创建模型（scheme），规定每个字段的信息 > * 可以为字段添加index参数，指定字符串以何种方式索引 > * index index：和_index区别开来，_index是文档的索引，而index是字段的描述，表示字段以何种方式被索引 index参数有以下值; 1. analyzed : 以全文方式索引这个字段，先分析、分词、倒排索引（全文索引） 2. not_analyzed:索引这个字段，使之可以被搜索，但是索引内容和指定值一样。不分析此字段(精准值匹配) 3. no ：不索引这个字段，这个字段不会被检索到 ## 7. 聚合 ### 查询聚合（多次分组） 1. 按照country分组 2. 在country组内按照join_date分组 3. 接着按照最小组内求平均年龄 country（join_date（avg）） * 元数据信息 ~~~ PUT /company { "mappings": { "employee": { "properties": { "age": { "type": "long" }, "country": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "fielddata": true # 指定country为正排索引，因为要分组，所以不能被分词,其实field不指定也行 }, "join_date": { "type": "date" # date类型本身就不会被分词，不用指定 }, "name": { "type": "text", "fields": { "keyword": { "type": "keyword", # "type": "keyword",代表不分词 "ignore_above": 256 } } }, "position": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "salary": { "type": "long" } } } } } ~~~ * 查询 ~~~ GET company/_search {"size": 0, "aggs": { "group_by_country": { # 1. 按照country第一次分组 "terms": { "field": "country" }, "aggs": { "group_by_date": { # 2. 按照join_date第二次分组 "terms": { "field": "join_date" }, "aggs": { "avg_age": { "avg": { "field": "age" } } } } } } } } ~~~ * 结果 ~~~ "group_by_country": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ # country组 { "key": "china", "doc_count": 3, "group_by_date": { # group_by_date"组，有两个2016-01-01和2017-01-01 "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": 1483228800000, "key_as_string": "2017-01-01T00:00:00.000Z", "doc_count": 2, "avg_age": { "value": 31 } }, { "key": 1451606400000, "key_as_string": "2016-01-01T00:00:00.000Z", "doc_count": 1, "avg_age": { "value": 32 } } ] ~~~ > 用state字段分组，并且计算出每组的个数，类似于mysql的分组，term用于按照指定的field分组，并给组内成员个数 `SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC` ~~~ GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword" } } } } ~~~ > "size": 0 ：不显示查询的hits部分，只查看聚合的结果，terms是聚合的意思 * 在以上分完组的前提下，对每组的余额求平均数 ~~~ GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword" }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } } } ~~~ 返回桶中桶（组中平均） ~~~ "aggregations": { "group_by_state": { # 第一层聚合的名字 "doc_count_error_upper_bound": 20, "sum_other_doc_count": 770, "buckets": [ { "key": "ID", "doc_count": 27, "average_balance": { "value": 24368.777777777777 } }, { "key": "TX", "doc_count": 27, "average_balance": { # 平均值 "value": 27462.925925925927 } }, { ~~~ * 聚合排序,aggs{group,avg} * 按照average_balance求出的平均值排序 ~~~ GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "order": { "average_balance": "desc" } }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } } } ~~~ * 按照价格区间分组 ~~~ GET /ecommerce/product/_search { "size": 0, "aggs": { "group_by_price": { "range": { "field": "price", "ranges": [ { "from": 0, "to": 20 }, { "from": 20, "to": 40 }, { "from": 40, "to": 50 } ] }, "aggs": { "group_by_tags": { "terms": { "field": "tags" }, "aggs": { "average_price": { "avg": { "field": "price" } } } } } } } } ~~~ ## 8. geo query 地理位置查询 > 有以下几种物理位置查询方式 > 1. geo_shape查询 > 查找具有与指定的地理形状相交或与之相交的地理形状的文档。 > 2. geo_bounding_box查询 > 查找具有落入指定矩形中的地理位置的文档。 > 3. geo_distance查询 > 在中心点的指定距离内查找具有地理点的文档。 > 4. geo_distance_range查询 > 像geo_distance查询一样，但距离中心点指定的距离开始。 > 5. geo_polygon查询 > 查找指定多边形内具有地理位置的文档。 ### 8.1 地理坐标点地理坐标点是指地球表面可以用经纬度描述的一个点。地理坐标点可以用来计算两个坐标间的距离，还可以判断一个坐标是否在一个区域中，或在聚合中。地理坐标点不能被动态映射（dynamic mapping）自动检测，而是需要显式声明对应字段类型为 geo-point ： ~~~ PUT /attractions { "mappings": { # 映射 "restaurant": { # 类型 "properties": { "name": { # field "type": "string" }, "location": { "type": "geo_point" } } } } ~~~ ## 9. _termvectors 统计词条信息可以对词语进行过滤，常用的过滤器参数如： max_num_terms 最大的词条数目 min_term_freq 最小的词频，比如忽略那些在字段中出现次数小于一定值的词条。 max_term_freq 最大的词频 min_doc_freq 最小的文档频率，比如忽略那些在文档中出现次数小于一定的值的词条 max_doc_freq 最大的文档频率 min_word_length 词的最小长度 max_word_length 词的最大长度 1. 对content字段进行词频统计 ~~~ GET news/new/1/_termvectors { "fields": ["content"] } ~~~ 得到 ~~~ "terms": { "30": { "term_freq": 1, "tokens": [ 。。。。。 "与": { "term_freq": 1, # 词出现的次数 "tokens": [ { "position": 1, "start_offset": 2, "end_offset": 3 } ] 。。。。。。 ~~~ 2. 对词进行过滤 ~~~ GET /news/new/9/_termvectors { "fields": ["content"], "filter": { "min_word_length": 2, 词的长度大于1,这样不会出现单词字了 "min_term_freq": 2 # 词出现的次数最少有两次 } } ~~~ 3. java api ~~~ public List<Map<String,Object>> termVectos(String index, String type, String id,String field) throws IOException { TermVectorsRequest.FilterSettings filterSettings = new TermVectorsRequest.FilterSettings(); filterSettings.minWordLength = 2; filterSettings.maxNumTerms = 10000; //返回最大结果数 TermVectorsResponse resp = client.prepareTermVectors(index, type, id) .setFilterSettings(filterSettings) .setSelectedFields(field) .execute().actionGet(); //获取字段 Fields fields = resp.getFields(); Iterator<String> iterator = fields.iterator(); List<Map<String,Object>> result = new ArrayList<Map<String, Object>>(); Map<String,Object> temp = null; while (iterator.hasNext()){ String dfield = iterator.next(); Terms terms = fields.terms(dfield); //获取字段对应的terms TermsEnum termsEnum = terms.iterator(); //termsEnum包含词语统计信息 while (termsEnum.next() != null){ String word = termsEnum.term().utf8ToString(); int freq = termsEnum.postings(null,120).freq(); temp = new HashMap<String, Object>(); temp.put("word",word); temp.put("freq",freq); result.add(temp); } } return result; } ~~~ ## 10.高亮 ~~~ GET /ecommerce/product/_search { "query" : { "match" : { "producer" : "producer" } }, "highlight": { "fields" : { "producer" : {} } } } ~~~ 得到 ~~~ "_index": "ecommerce", "_type": "product", "_id": "1", "_score": 0.25811607, "_source": { "name": "gaolujie yagao", "desc": "gaoxiao meibai", "price": 30, "producer": "gaolujie producer", "tags": [ "meibai", "fangzhu" ] }, "highlight": { "producer": [ "gaolujie <em>producer</em>" # 高亮标记 ] } }, ~~~ ## 11. 插入数据 ### 11.1 手动指定document id （1）根据应用情况来说，是否满足手动指定document id的前提： > 一般来说，是从某些其他的系统中，导入一些数据到es时，会采取这种方式，就是使用系统中已有数据的唯一标识，作为es中document的id。举个例子，比如说，我们现在在开发一个电商网站，做搜索功能，或者是OA系统，做员工检索功能。这个时候，数据首先会在网站系统或者IT系统内部的数据库中，会先有一份，此时就肯定会有一个数据库的primary key（自增长，UUID，或者是业务编号）。如果将数据导入到es中，此时就比较适合采用数据在数据库中已有的primary key。 > 如果说，我们是在做一个系统，这个系统主要的数据存储就是es一种，也就是说，数据产生出来以后，可能就没有id，直接就放es一个存储，那么这个时候，可能就不太适合说手动指定document id的形式了，因为你也不知道id应该是什么，此时可以采取下面要讲解的让es自动生成id的方式。（2）put /index/type/id put指定ID ~~~ PUT /test_index/test_type/2 { "test_content": "my test" } ~~~ ### 11.2 自动生成document id （1）post /index/type ~~~ POST /test_index/test_type { "test_content": "my test" } { "_index": "test_index", "_type": "test_type", "_id": "AVp4RN0bhjxldOOnBxaE", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } ~~~ （2）自动生成的id，长度为20个字符，URL安全，base64编码，GUID，分布式系统并行生成时不可能会发生冲突 ## 12. document修改 ### 12.1 全量替换（put） > 1. 全量替换 > 如果document不存在，则创建document，version为1，如果存在，则用新put的数据替换原来的document，version加1，旧的文档不会被马上删除，但是也不会被访问了。不会立即物理删除，只会将其标记为deleted，当数据越来越多的时候，es在后台自动删除 > 2. document的强制创建 > 创建文档与全量替换的语法是一样的，有时我们只是想新建文档，不想替换文档，如果强制进行创建呢？ > PUT /index/type/id?op_type=create，PUT /index/type/id/_create ## 13. partial update 1、什么是partial update？ > PUT /index/type/id，创建文档&替换文档，就是一样的语法 > 一般对应到应用程序中，每次的执行流程基本是这样的： > （1）应用程序先发起一个get请求，获取到document，展示到前台界面，供用户查看和修改 > （2）用户在前台界面修改数据，发送到后台 > （3）后台代码，会将用户修改的数据在内存中进行执行，然后封装好修改后的全量数据 > （4）然后发送PUT请求，到es中，进行全量替换 > （5）es将老的document标记为deleted，然后重新创建一个新的document partial update ~~~ post /index/type/id/_update { "doc": { "要修改的少数几个field即可，不需要全量的数据" } } ~~~ 看起来，好像就比较方便了，每次就传递少数几个发生修改的field即可，不需要将全量的document数据发送过去 partial update，看起来很方便的操作，实际内部的原理是什么样子的，然后它的优点是什么 3、上机动手实战演练partial update ~~~ PUT /test_index/test_type/10 { "test_field1": "test1", "test_field2": "test2" } POST /test_index/test_type/10/_update { "doc": { "test_field2": "updated test2" } } ~~~ ## 14. 控制查询精度 > 全文搜索： > 1. 全文搜索有两种办法，match query和should > 2. 控制搜索精度and operator（和），minimum_should_match（最少匹配数量）准备数据 ~~~ POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"title" : "this is java and elasticsearch blog"} } { "update": { "_id": "2"} } { "doc" : {"title" : "this is java blog"} } { "update": { "_id": "3"} } { "doc" : {"title" : "this is elasticsearch blog"} } { "update": { "_id": "4"} } { "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} } { "update": { "_id": "5"} } { "doc" : {"title" : "this is spark blog"} } ~~~ ### 1. match query 1. 查询包含 Java 或者elasticsearch，得到四个文档 ~~~ "title": "this is java, elasticsearch, hadoop blog" # elasticsearch java "title": "this is java and elasticsearch blog" # elasticsearch java "title": "this is elasticsearch blog" # elasticsearch "title": "this is java blog" # java ~~~ 2. 更精准一些，and 关键字还有or关键字，不过没啥意义了，match本身就有or的作用,和1.是一样的作用，为什么要多写查询包含Java和elasticsearch的文档 ~~~ GET /forum/_search { "query": { "match": { "title": { "query": "java elasticsearch", "operator": "and" } } } } ~~~ 得到两条，将上一次的结果过滤掉了两条 ~~~ "title": "this is java, elasticsearch, hadoop blog" "title": "this is java and elasticsearch blog" ~~~ 3. 最小匹配"minimum_should_match": "75%" java elasticsearch spark hadoop 中至少有3个（75%）关键字出现 ~~~ GET forum/_search { "query": { "match": { "title": { "query": "java elasticsearch spark hadoop", "minimum_should_match": "75%" # 上边的查询条件至少满足75% } } } } ~~~ 得到 ~~~ "title": "this is java, elasticsearch, hadoop blog" ~~~ * java elasticsearch spark hadoop 中至少有三个关键字2个（50%） ~~~ GET forum/_search { "query": { "match": { "title": { "query": "java elasticsearch spark hadoop", "minimum_should_match": "50%" } } } } ~~~ 搜索条件放宽了，多搜索出一条数据 ~~~ "title": "this is java, elasticsearch, hadoop blog" "title": "this is java and elasticsearch blog" ~~~ ### 2. should ~~~ GET /forum/_search { "query": { "bool": { "should": [ {"match": {"title": "java"}}, {"match": {"title": "elasticsearch"}}, {"match": {"title": "spark"}}, {"match": {"title": "hadoop"}} ], "minimum_should_match": 3 } } } ~~~ 和3.作用相同 > 默认情况下，should是可以不匹配任何一个的，比如上面的搜索中，this is java blog，就不匹配任何一个should条件 > 但是有个例外的情况，如果没有must的话，那么should中必须至少匹配一个才可以 > 比如下面的搜索，should中有4个条件，默认情况下，只要满足其中一个条件，就可以匹配作为结果返回