数据类型 · JAVA

[TOC] # 类型映射关系 ## 核心数据类型 * 字符串 - text * 用于全文索引，该类型的字段将通过分词器进行分词，最终用于构建索引 * 字符串 - keyword * 不分词，只能搜索该字段的完整的值，只用于 filtering * 数值型 * long：有符号64-bit integer：-2^63 ~ 2^63 - 1 * integer：有符号32-bit integer，-2^31 ~ 2^31 - 1 * short：有符号16-bit integer，-32768 ~ 32767 * byte：有符号8-bit integer，-128 ~ 127 * double：64-bit IEEE 754 浮点数 * float：32-bit IEEE 754 浮点数 * half\_float：16-bit IEEE 754 浮点数 * scaled\_float * 布尔 - boolean * 值：false, "false", true, "true" * 日期 - date * 由于Json没有date类型，所以es通过识别字符串是否符合format定义的格式来判断是否为date类型 * format默认为：`strict_date_optional_time||epoch_millis` [format](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html) * 二进制 - binary * 该类型的字段把值当做经过 base64 编码的字符串，默认不存储，且不可搜索 * 范围类型 * 范围类型表示值是一个范围，而不是一个具体的值 * 譬如 age 的类型是 integer\_range，那么值可以是 {"gte" : 10, "lte" : 20}；搜索 "term" : {"age": 15} 可以搜索该值；搜索 "range": {"age": {"gte":11, "lte": 15}} 也可以搜索到 * range参数 relation 设置匹配模式 * INTERSECTS ：默认的匹配模式，只要搜索值与字段值有交集即可匹配到 * WITHIN：字段值需要完全包含在搜索值之内，也就是字段值是搜索值的子集才能匹配 * CONTAINS：与WITHIN相反，只搜索字段值包含搜索值的文档 * integer\_range * float\_range * long\_range * double\_range * date\_range：64-bit 无符号整数，时间戳（单位：毫秒） * ip\_range：IPV4 或 IPV6 格式的字符串 ~~~ # 创建range索引 PUT range_index { "mappings": { "_doc": { "properties": { "expected_attendees": { "type": "integer_range" }, "time_frame": { "type": "date_range", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis" } } } } } # 插入一个文档 PUT range_index/_doc/1 { "expected_attendees" : { "gte" : 10, "lte" : 20 }, "time_frame" : { "gte" : "2015-10-31 12:00:00", "lte" : "2015-11-05" } } # 12在 10~20的范围内，可以搜索到文档1 GET range_index/_search { "query" : { "term" : { "expected_attendees" : { "value": 12 } } } } # within可以搜索到文档 # 可以修改日期，然后分别对比CONTAINS，WITHIN，INTERSECTS的区别 GET range_index/_search { "query" : { "range" : { "time_frame" : { "gte" : "2015-11-02", "lte" : "2015-11-03", "relation" : "within" } } } } ~~~ ## 复杂数据类型 * 数组类型 Array * 字符串数组 \[ "one", "two" \] * 整数数组 \[ 1, 2 \] * 数组的数组 \[ 1, \[ 2, 3 \]\]，相当于 \[ 1, 2, 3 \] * Object对象数组 \[ { "name": "Mary", "age": 12 }, { "name": "John", "age": 10 }\] * 同一个数组只能存同类型的数据，不能混存，譬如 \[ 10, "some string" \] 是错误的 * 数组中的 null 值将被 null\_value 属性设置的值代替或者被忽略 * 空数组 \[\] 被当做 missing field 处理 * 对象类型 Object * 对象类型可能有内部对象 * 被索引的形式为：manager.name.first ~~~ # tags字符串数组，lists 对象数组 PUT my_index/_doc/1 { "message": "some arrays in this document...", "tags": [ "elasticsearch", "wow" ], "lists": [ { "name": "prog_list", "description": "programming list" }, { "name": "cool_list", "description": "cool stuff list" } ] } ~~~ ## 嵌套类型Nested nested 类型是一种对象类型的特殊版本，它允许索引对象数组，**独立地索引每个对象** **嵌套类型与Object类型的区别** 通过例子来说明: 1. 插入一个文档，不设置mapping，此时 user 字段被自动识别为**对象数组** ~~~ DELETE my_index PUT my_index/_doc/1 { "group" : "fans", "user" : [ { "first" : "John", "last" : "Smith" }, { "first" : "Alice", "last" : "White" } ] } ~~~ 2. 查询 user.first为 Alice，user.last 为 Smith的文档，理想中应该找不到匹配的文档 3. 结果是查到了文档1，为什么呢？ ~~~ GET my_index/_search { "query": { "bool": { "must": [ { "match": { "user.first": "Alice" }}, { "match": { "user.last": "Smith" }} ] } } } ~~~ 4. 是由于Object对象类型在内部被转化成如下格式的文档： ~~~ { "group" : "fans", "user.first" : [ "alice", "john" ], "user.last" : [ "smith", "white" ] } ~~~ 5. user.first 和 user.last 扁平化为多值字段，alice 和 white 的**关联关系丢失了**。导致这个文档错误地匹配对 alice 和 smith 的查询 6. 如果最开始就把user设置为 nested 嵌套对象呢？ ~~~ DELETE my_index PUT my_index { "mappings": { "_doc": { "properties": { "user": { "type": "nested" } } } } } PUT my_index/_doc/1 { "group": "fans", "user": [ { "first": "John", "last": "Smith" }, { "first": "Alice", "last": "White" } ] } ~~~ 7. 再来进行查询，可以发现以下第一个查不到文档，第二个查询到文档1，符合我们预期 ~~~ GET my_index/_search { "query": { "nested": { "path": "user", "query": { "bool": { "must": [ { "match": { "user.first": "Alice" }}, { "match": { "user.last": "Smith" }} ] } } } } } GET my_index/_search { "query": { "nested": { "path": "user", "query": { "bool": { "must": [ { "match": { "user.first": "Alice" }}, { "match": { "user.last": "White" }} ] } }, "inner_hits": { "highlight": { "fields": { "user.first": {} } } } } } } ~~~ 8. nested对象将数组中每个对象作为独立隐藏文档来索引，这意味着每个嵌套对象都可以独立被搜索 9. 需要注意的是： * 使用 [nested 查询](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html)来搜索 * 使用 nested 和 [reverse\_nested](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-reverse-nested-aggregation.html) 聚合来分析 * 使用 [nested sorting](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html#nested-sorting) 来排序 * 使用 [nested inner hits](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-inner-hits.html#nested-inner-hits) 来检索和高亮 ## 地理位置数据类型 * geo\_point * 地理位置，其值可以有如下四中表现形式： * object对象："location": {"lat": 41.12, "lon": -71.34} * 字符串："location": "41.12,-71.34" * [geohash](http://geohash.gofreerange.com/)："location": "drm3btev3e86" * 数组："location": \[ -71.34, 41.12 \] * 查询的时候通过 [Geo Bounding Box Query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-geo-bounding-box-query.html) 进行查询 * geo\_shape ## 专用数据类型 * 记录IP地址 ip * 实现自动补全 completion * 记录分词数 token\_count * 记录字符串hash值 murmur3 * Percolator ~~~ # ip类型，存储IP PUT my_index { "mappings": { "_doc": { "properties": { "ip_addr": { "type": "ip" } } } } } PUT my_index/_doc/1 { "ip_addr": "192.168.1.1" } GET my_index/_search { "query": { "term": { "ip_addr": "192.168.0.0/16" } } } ~~~ ## 多字段特性 multi-fields * 允许对同一个字段采用不同的配置，比如分词，常见例子如对人名实现拼音搜索，只需要在人名中新增一个**子字段**为 pinyin 即可 * 通过参数 fields 设置