概述
1.默认的分词器
standard
standard tokenizer:以单词的边界进行切分
standard token filter:什么都不做
lowercase token filter:将所有字母转换成小写
stop token filter(默认被禁用),移除停用词,比如a the it等等
2.修改分词器的设置
例子:启用standard的基于english的分词器的停用词token filter
其中,es_std是这个分词器的名称
PUT /index0
{
"settings": {
"analysis": {
"analyzer": {
"es_std":{
"type":"standard",
"stopwords":"_english_"
}
}
}
}
}
测试:
使用standard分词器分词a little dog
GET /index0/_analyze
{
"analyzer":"standard",
"text":"a little dog"
}
执行结果:
{
"tokens": [
{
"token": "a",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "little",
"start_offset": 2,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "dog",
"start_offset": 9,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 2
}
]
}
使用设置的es_std分词器分词a little dog,可以看到结果中,停用词过滤了
GET /index0/_analyze
{
"analyzer":"es_std",
"text":"a little dog"
}
执行结果
{
"tokens": [
{
"token": "little",
"start_offset": 2,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "dog",
"start_offset": 9,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 2
}
]
}
3.定制化自己的分词器
例子
char_filter:类型为mapping,定义自己的替换过滤器,这里我们将&转换为and,并将这个过滤器起名为&_to_and
my_stopwords:类型为stop,定义自己的停用词,这里我们设置了两个停用词a和the
my_analyzer:类型为customer,自定义分词器,分词前操作:html_strip过滤html代码标签,&_to_and是我们自己定义的字符过滤器(将&提换成and),分词使用standard,停用词使用my_stopwords,并将所有的词转成小写
PUT /index0
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and":{
"type":"mapping",
"mappings":["&=> and"]
}
},
"filter":{
"my_stopwords":{
"type":"stop",
"stopwords":["a","the"]
}
},
"analyzer":{
"my_analyzer":{
"type":"custom",
"char_filter":["html_strip","&_to_and"],
"tokenizer":"standard",
"filter":["lowercase","my_stopwords"]
}
}
}
}
}
执行:报错,索引已存在,
{
"error": {
"root_cause": [
{
"type": "index_already_exists_exception",
"reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
"index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
"index": "index0"
}
],
"type": "index_already_exists_exception",
"reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
"index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
"index": "index0"
},
"status": 400
}
我们先删除这个索引 DELETE /index0,然后再执行
执行成功:
{
"acknowledged": true,
"shards_acknowledged": true
}
测试我们的分词器my_analyzer:
模拟一段文本:tom and jery in the a house <a> & me HAHA
从执行结果中可以看出,a和the过滤了,HAHA转成了小写,&转成了and,<a>标签过滤了
GET /index0/_analyze
{
"analyzer": "my_analyzer",
"text":"tom and jery in the a house <a> & me HAHA"
}
执行结果
{
"tokens": [
{
"token": "tom",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "and",
"start_offset": 4,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "jery",
"start_offset": 8,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "in",
"start_offset": 13,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "house",
"start_offset": 22,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "and",
"start_offset": 32,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "me",
"start_offset": 34,
"end_offset": 36,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "haha",
"start_offset": 37,
"end_offset": 41,
"type": "<ALPHANUM>",
"position": 9
}
]
}
4.在我们的索引中使用我们自定义的分词器
设置mytype中的字段content使用我们的自定义的分词器my_analyzer
GET /index0/_mapping/my_type
{
"properties":{
"content":{
"type":"text",
"analyzer":"my_analyzer"
}
}
}
最后
以上就是高贵滑板为你收集整理的ElasticSearch50:索引管理_快速上机动手实战修改分词器以及定制自己的分词器的全部内容,希望文章能够帮你解决ElasticSearch50:索引管理_快速上机动手实战修改分词器以及定制自己的分词器所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复