ElasticSearch50：索引管理_快速上机动手实战修改分词器以及定制自己的分词器

92 阅读 0 评论 61 点赞

我是靠谱客的博主高贵滑板，最近开发中收集的这篇文章主要介绍ElasticSearch50：索引管理_快速上机动手实战修改分词器以及定制自己的分词器，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

1.默认的分词器
standard
standard tokenizer:以单词的边界进行切分
standard token filter:什么都不做
lowercase token filter:将所有字母转换成小写
stop token filter(默认被禁用)，移除停用词，比如a the it等等

2.修改分词器的设置

例子：启用standard的基于english的分词器的停用词token filter
其中，es_std是这个分词器的名称

PUT /index0
{
"settings": {
"analysis": {
"analyzer": {
"es_std":{
"type":"standard",
"stopwords":"_english_"
}
}
}
}
}

测试：

使用standard分词器分词a little dog

GET /index0/_analyze
{
"analyzer":"standard",
"text":"a little dog"
}
执行结果：
{
"tokens": [
{
"token": "a",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "little",
"start_offset": 2,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "dog",
"start_offset": 9,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 2
}
]
}

使用设置的es_std分词器分词a little dog，可以看到结果中，停用词过滤了

GET /index0/_analyze
{
"analyzer":"es_std",
"text":"a little dog"
}
执行结果

{
"tokens": [
{
"token": "little",
"start_offset": 2,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "dog",
"start_offset": 9,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 2
}
]
}

3.定制化自己的分词器
例子
char_filter:类型为mapping，定义自己的替换过滤器，这里我们将&转换为and，并将这个过滤器起名为&_to_and
my_stopwords:类型为stop，定义自己的停用词，这里我们设置了两个停用词a和the
my_analyzer：类型为customer，自定义分词器，分词前操作：html_strip过滤html代码标签，&_to_and是我们自己定义的字符过滤器（将&提换成and)，分词使用standard，停用词使用my_stopwords,并将所有的词转成小写

PUT /index0
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and":{
"type":"mapping",
"mappings":["&=> and"]
}
},
"filter":{
"my_stopwords":{
"type":"stop",
"stopwords":["a","the"]
}
},
"analyzer":{
"my_analyzer":{
"type":"custom",
"char_filter":["html_strip","&_to_and"],
"tokenizer":"standard",
"filter":["lowercase","my_stopwords"]
}
}
}
}
}

执行：报错，索引已存在，
{
"error": {
    "root_cause": [
      {
        "type": "index_already_exists_exception",
        "reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
        "index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
        "index": "index0"
      }
    ],
    "type": "index_already_exists_exception",
    "reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
    "index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
    "index": "index0"
},
"status": 400
}

我们先删除这个索引 DELETE /index0，然后再执行
执行成功：
{
"acknowledged": true,
"shards_acknowledged": true
}

测试我们的分词器my_analyzer：
模拟一段文本：tom and jery in the a house <a> & me HAHA
从执行结果中可以看出，a和the过滤了，HAHA转成了小写，&转成了and，<a>标签过滤了

GET /index0/_analyze
{
"analyzer": "my_analyzer",
"text":"tom and jery in the a house <a> & me HAHA"
}

执行结果

{
"tokens": [
{
"token": "tom",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "and",
"start_offset": 4,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "jery",
"start_offset": 8,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "in",
"start_offset": 13,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "house",
"start_offset": 22,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "and",
"start_offset": 32,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "me",
"start_offset": 34,
"end_offset": 36,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "haha",
"start_offset": 37,
"end_offset": 41,
"type": "<ALPHANUM>",
"position": 9
}
]
}

4.在我们的索引中使用我们自定义的分词器
设置mytype中的字段content使用我们的自定义的分词器my_analyzer
GET /index0/_mapping/my_type
{
   "properties":{
       "content":{
           "type":"text",
           "analyzer":"my_analyzer"
       }
   }
}