多应用+插件架构,代码干净,二开方便,首家独创一键云编译技术,文档视频完善,免费商用码云13.8K 广告
[[synonyms-analysis-chain]] === Synonyms and The Analysis Chain The example we ((("synonyms", "and the analysis chain")))showed in <<synonym-formats>>, used `u s a` as a synonym. Why did we use that instead of `U.S.A.`? The reason is that the `synonym` token filter sees only the terms that the previous token filter or tokenizer has emitted.((("analysis", "synonyms and the analysis chain"))) Imagine that we have an analyzer that consists of the `standard` tokenizer, with the `lowercase` token filter followed by a `synonym` token filter. The analysis process for the text `U.S.A.` would look like this: [source,text] ------------------------------------------------ original string → "U.S.A." standard tokenizer → (U),(S),(A) lowercase token filter → (u),(s),(a) synonym token filter → (usa) ------------------------------------------------ If we had specified the synonym as `U.S.A.`, it would never match anything because, by the time `my_synonym_filter` sees the terms, the periods have been removed and the letters have been lowercased. This is an important point to consider. What if we want to combine synonyms with stemming, so that `jumps`, `jumped`, `jump`, `leaps`, `leaped`, and `leap` are all indexed as the single term `jump`? We ((("stemming words", "combining synonyms with")))could place the synonyms filter before the stemmer and list all inflections: "jumps,jumped,leap,leaps,leaped => jump" But the more concise way would be to place the synonyms filter after the stemmer, and to list just the root words that would be emitted by the stemmer: "leap => jump" ==== Case-Sensitive Synonyms Normally, synonym filters are placed after the `lowercase` token filter and so all synonyms are ((("synonyms", "and the analysis chain", "case-sensitive synonyms")))((("case-sensitive synonyms")))written in lowercase, but sometimes that can lead to odd conflations. For instance, a `CAT` scan and a `cat` are quite different, as are `PET` (positron emmision tomography) and a `pet`. For that matter, the surname `Little` is distinct from the adjective `little` (although if a sentence starts with the adjective, it will be uppercased anyway). If you need use case to distinguish between word senses, you will need to place your synonym filter before the `lowercase` filter. Of course, that means that your synonym rules would need to list all of the case variations that you want to match (for example, `Little,LITTLE,little`). Instead of that, you could have two synonym filters: one to catch the case-sensitive synonyms and one for all the case-insentive synonyms. For instance, the case-sensitive rules could look like this: "CAT,CAT scan => cat_scan" "PET,PET scan => pet_scan" "Johnny Little,J Little => johnny_little" "Johnny Small,J Small => johnny_small" And the case-insentive rules could look like this: "cat => cat,pet" "dog => dog,pet" "cat scan,cat_scan scan => cat_scan" "pet scan,pet_scan scan => pet_scan" "little,small" The case-sensitive rules would `CAT scan` but would match only the `CAT` in `CAT scan`. For this reason, we have the odd-looking rule `cat_scan scan` in the case-insensitive list to catch bad replacements. TIP: You can see how quickly it can get complicated. As always, the `analyze` API is your friend--use it to check that your analyzers are configured correctly. See <<analyze-api>>.