企业🤖AI Agent构建引擎,智能编排和调试,一键部署,支持私有化部署方案 广告
课程大纲 1. 默认的分词器 > standard > standard tokenizer:以单词边界进行切分 > standard token filter:什么都不做 > lowercase token filter:将所有字母转换为小写 > stop token filer(默认被禁用):移除停用词,比如a the it等等 2. 修改分词器的设置 启用english停用词token filter ~~~ PUT /my_index { "settings": { "analysis": { "analyzer": { "es_std": { "type": "standard", "stopwords": "_english_" } } } } } ~~~ 标准分词器,安排空格分割字符串,得到a dog is in the house六个词 ~~~ GET /my_index/_analyze { "analyzer": "standard", "text": "a dog is in the house" } ~~~ 自定义分词器,去掉英文中常用词,is the and等频繁词 得到 dog 和 house两个词 ~~~ GET /my_index/_analyze { "analyzer": "es_std", # 使用索引下,自定义的分词器 "text":"a dog is in the house" } ~~~ 3. 定制化自己的分词器 ~~~ PUT /my_index { "settings": { "analysis": { "char_filter": { "&_to_and": { "type": "mapping", "mappings": ["&=> and"] } }, "filter": { "my_stopwords": { # 自定义一个stop tokenizer,去掉 the is "type": "stop", "stopwords": ["the", "a"] } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": ["html_strip", "&_to_and"], # 调用HTML和自定义转化 "tokenizer": "standard", # 标准分词器 "filter": ["lowercase", "my_stopwords"] # 所有字母变成小写,调用自定义stop tokenizer } } } } } ~~~ char_filter:字符映射 filter:字符过滤 ~~~ GET /my_index/_analyze { "text": "tom&jerry are a friend in the house, <a>, HAHA!!", # &变成and,<a>变成a,HAHA变成haha "analyzer": "my_analyzer" } PUT /my_index/_mapping/my_type { "properties": { "content": { "type": "text", "analyzer": "my_analyzer" } } } ~~~