自定义分词器 · TUNA-daily

课程大纲 1. 默认的分词器 > standard > standard tokenizer：以单词边界进行切分 > standard token filter：什么都不做 > lowercase token filter：将所有字母转换为小写 > stop token filer（默认被禁用）：移除停用词，比如a the it等等 2. 修改分词器的设置启用english停用词token filter ~~~ PUT /my_index { "settings": { "analysis": { "analyzer": { "es_std": { "type": "standard", "stopwords": "_english_" } } } } } ~~~ 标准分词器，安排空格分割字符串，得到a dog is in the house六个词 ~~~ GET /my_index/_analyze { "analyzer": "standard", "text": "a dog is in the house" } ~~~ 自定义分词器，去掉英文中常用词，is the and等频繁词得到 dog 和 house两个词 ~~~ GET /my_index/_analyze { "analyzer": "es_std", # 使用索引下，自定义的分词器 "text":"a dog is in the house" } ~~~ 3. 定制化自己的分词器 ~~~ PUT /my_index { "settings": { "analysis": { "char_filter": { "&_to_and": { "type": "mapping", "mappings": ["&=> and"] } }, "filter": { "my_stopwords": { # 自定义一个stop tokenizer，去掉 the is "type": "stop", "stopwords": ["the", "a"] } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": ["html_strip", "&_to_and"], # 调用HTML和自定义转化 "tokenizer": "standard", # 标准分词器 "filter": ["lowercase", "my_stopwords"] # 所有字母变成小写，调用自定义stop tokenizer } } } } } ~~~ char_filter：字符映射 filter：字符过滤 ~~~ GET /my_index/_analyze { "text": "tom&jerry are a friend in the house, <a>, HAHA!!", # &变成and，<a>变成a,HAHA变成haha "analyzer": "my_analyzer" } PUT /my_index/_mapping/my_type { "properties": { "content": { "type": "text", "analyzer": "my_analyzer" } } } ~~~