Elasticsearch 自定义过滤器示例

80 阅读 0 评论 53 点赞

我是靠谱客的博主酷酷天空，最近开发中收集的这篇文章主要介绍Elasticsearch 自定义过滤器示例，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

Elasticsearch 自定义过滤器示例

- - - HTML strip Character Filter
    - - 添加分析器
      - 参数
    - 标准分词器
    - - 参数
    - Lowercase token filter 小写标记过滤器
    - - 创建分析器
      - 参数
      - 自定义
    - 组合使用
    - 一个更复杂的例子

HTML strip Character Filter

删除HTML从文本元素，并替换HTML实体与他们的解码值（例如，更换&用&）。html_strip使用的是Lucene的HTMLStripCharFilter。

GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"text": "<p>I&apos;m so <b>happy</b>!</p>"
}
/*输出*/
{
"tokens" : [
{
"token" : """I'm so happy!""",
"start_offset" : 0,
"end_offset" : 32,
"type" : "word",
"position" : 0
}
]
}

添加分析器

这个API示例，是创建一个索引，使用html_strip配置一个自定义分析器

PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"html_strip"
]
}
}
}
}
}

参数

escaped_tags，可选数组。不包含尖括号（< >）的HTML标签数组。从文本中剥离HTML时，过滤器会跳过这些HTML元素。例如，值为[ "p" ]跳过<p>HTML标签。

请求示例

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_custom_html_strip_char_filter"
]
}
},
"char_filter": {
"my_custom_html_strip_char_filter": {
"type": "html_strip",
"escaped_tags": [
"b"
]
}
}
}
}
}

标准分词器

标准分词器，基于语法分词，（基于Unicode标准附件＃29中指定的Unicode文本分段算法），并且适用于大多数语言。

POST _analyze
{
"tokenizer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
/*输出*/
{
"tokens" : [
{
"token" : "The",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "QUICK",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "Brown",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "Foxes",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "jumped",
"start_offset" : 24,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "over",
"start_offset" : 31,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "the",
"start_offset" : 36,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "lazy",
"start_offset" : 40,
"end_offset" : 44,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "dog's",
"start_offset" : 45,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "bone",
"start_offset" : 51,
"end_offset" : 55,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}

参数

max_token_length，最大标记长度，如果标记长度超过此长度，则将其根据max_token_length分割，默认为255.

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length": 5
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Lowercase token filter 小写标记过滤器

将文本转化为小写，例如，你可以使用lowercase过滤器，将 THE Lazy DoG转化为the lazy dog。除了默认过滤器外，lowercase令牌过滤器还提供对Lucene语言特定的小写过滤器（希腊语，爱尔兰语和土耳其语）的访问权限。

GET _analyze
{
"tokenizer" : "standard",
"filter" : ["lowercase"],
"text" : "THE Quick FoX JUMPs"
}
/*输出*/
[ the, quick, fox, jumps ]

创建分析器

PUT lowercase_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"whitespace_lowercase" : {
"tokenizer" : "whitespace",
"filter" : ["lowercase"]
}
}
}
}
}

参数

language，（可选，字符串）要使用的特定于语言的小写标记过滤器。有效值包括：
1. greek，使用Lucene的 GreekLowerCaseFilter
2. irish，使用Lucene的 IrishLowerCaseFilter
3. turkish，使用Lucene的 TurkishLowerCaseFilter
如果未指定，则默认为Lucene的 LowerCaseFilter。

自定义

要自定义lowercase过滤器，需要先复制它以创建新的自定义标记过滤器的基础。您可以使用其可配置参数来修改过滤器。
例如，以下请求lowercase使用过滤器，为希腊语创建一个过滤器。

PUT custom_lowercase_example
{
"settings": {
"analysis": {
"analyzer": {
"greek_lowercase_example": {
"type": "custom",
"tokenizer": "standard",
"filter": ["greek_lowercase"]
}
},
"filter": {
"greek_lowercase": {
"type": "lowercase",
"language": "greek"
}
}
}
}
}

组合使用

设置type为custom，声明定义一个自定义的分析器（type还可以设置为standard，simple）。这个示例使用了标记生成器，标记过滤器和字符过滤器及其默认配置，但是可以创建每个标记器的配置版本并在自定义分析器中使用它们。

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this <b>déjà vu</b>?"
}

一个更复杂的例子

character filter：Mapping Character Filter替换字符串，下面的示例：:) 转化为_happy_ ，:( 转化为_sad_
tokenizer：Pattern Tokenizer分词器，配置为按标点符号分割
Token Filters：Lowercase Token Filter 和 Stop Token Filter（配置为使用英语停用词的预定义列表）

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"char_filter": [
"emoticons"
],
"tokenizer": "punctuation",
"filter": [
"lowercase",
"english_stop"
]
}
},
"tokenizer": {
"punctuation": {
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": {
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
/*输出*/
[ i'm, _happy_, person, you ]

最后

以上就是酷酷天空为你收集整理的Elasticsearch 自定义过滤器示例的全部内容，希望文章能够帮你解决Elasticsearch 自定义过滤器示例所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错，欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：Elasticsearch7.7 文档翻译
浏览次数：80 次浏览
发布日期：2023-10-08 14:36:27
本文链接：https://www.kaopuke.com/article/k-p-k_14_uzokfy_13_z_10_2.html

Elasticsearch 自定义过滤器示例

概述

Elasticsearch 自定义过滤器示例

HTML strip Character Filter

添加分析器

参数

标准分词器

参数

Lowercase token filter 小写标记过滤器

创建分析器

参数

自定义

组合使用

一个更复杂的例子

最后

评论列表共有 0 条评论

发表评论取消回复

Elasticsearch 自定义过滤器示例

概述

Elasticsearch 自定义过滤器示例

HTML strip Character Filter

添加分析器

参数

标准分词器

参数

Lowercase token filter 小写标记过滤器

创建分析器

参数

自定义

组合使用

一个更复杂的例子

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复