概述
1、ngram和index-time搜索推荐原理
先建立ngram的分词然后再去搜索
什么是ngram
quick,5种长度下的ngram
ngram length=1,q u i c k
ngram length=2,qu ui ic ck
ngram length=3,qui uic ick
ngram length=4,quic uick
ngram length=5,quick
什么是edge ngram
quick,anchor首字母后进行ngram
q
qu
qui
quic
quick
使用edge ngram将每个单词都进行进一步的分词切分,用切分后的ngram来实现前缀搜索推荐功能.
doc1 hello world
doc2 hello we
doc1 ,doc2
h * *
he * *
hel * *
hell * *
hello * *
w * *
wo *
wor *
worl *
world *
e *
hello world
min ngram = 1
max ngram = 3
h
he
hel
当你搜索hello w 时,hello 匹配到doc1,doc2后就不会继续往下走,w匹配到doc1也不会继续往下走。
hello w
hello --> hello,doc1 doc2
w --> w,doc1
doc1,hello和w,而且position也匹配,所以,ok,doc1返回,hello world
搜索的时候,不用再根据一个前缀,然后扫描整个倒排索引了; 简单的拿前缀去倒排索引中匹配即可,如果匹配上了,那么就好了; match,全文检索
2、实验一下ngram
先删除原先的索引
DELETE my_index 结果: { "acknowledged": true } |
新建立my_index
PUT /my_index { "settings": { "analysis": { "filter": { "autocomplete_filter": { "type": "edge_ngram", "min_gram": 1, "max_gram": 20 } }, "analyzer": { "autocomplete": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "autocomplete_filter" ] } } } } } 结果: { "acknowledged": true, "shards_acknowledged": true } |
查看
GET /my_index/_analyze { "analyzer": "autocomplete", "text": "quick brown" } 结果: { "tokens": [ { "token": "q", "start_offset": 0, "end_offset": 5, "type": "word", "position": 0 }, { "token": "qu", "start_offset": 0, "end_offset": 5, "type": "word", "position": 0 }, { "token": "qui", "start_offset": 0, "end_offset": 5, "type": "word", "position": 0 }, { "token": "quic", "start_offset": 0, "end_offset": 5, "type": "word", "position": 0 }, { "token": "quick", "start_offset": 0, "end_offset": 5, "type": "word", "position": 0 }, { "token": "b", "start_offset": 6, "end_offset": 11, "type": "word", "position": 1 }, { "token": "br", "start_offset": 6, "end_offset": 11, "type": "word", "position": 1 }, { "token": "bro", "start_offset": 6, "end_offset": 11, "type": "word", "position": 1 }, { "token": "brow", "start_offset": 6, "end_offset": 11, "type": "word", "position": 1 }, { "token": "brown", "start_offset": 6, "end_offset": 11, "type": "word", "position": 1 } ] } |
建立mapping
PUT /my_index/_mapping/my_type { "properties": { "title": { "type": "string", "analyzer": "autocomplete", "search_analyzer": "standard"//搜索的时候还是standard 标准的分词器,比如:hello w 搜索时分词为hello + w ;而进行搜索时不需要进行ngram或edge_ngram分词,否则会变慢。 } } } 结果: { "acknowledged": true } |
比如:
添加数据时,创建如下:
hello world
h
he
hel
hell
hello
w
wo
wor
worl
world
这时候你搜索:
hello w
h
he
hel
hell
hello
w
hello w --> hello --> w
添加测试数据
PUT /my_index/my_type/1 { "title":"hello world" } PUT /my_index/my_type/2 { "title":"hello we" } PUT /my_index/my_type/3 { "title":"hello win" } PUT /my_index/my_type/4 { "title":"hello dog" }
|
测试
测试一:
GET /my_index/my_type/_search { "query": { "match_phrase": { "title": "hello w" } } } 结果: { "took": 20, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1.1983768, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "2", "_score": 1.1983768, "_source": { "title": "hello we" } }, { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 0.8271048, "_source": { "title": "hello world" } }, { "_index": "my_index", "_type": "my_type", "_id": "3", "_score": 0.797104, "_source": { "title": "hello win" } } ] } } |
测试二:
GET /my_index/my_type/_search { "query": { "match": { "title": "hello w" } } } 结果: { "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 4, "max_score": 1.1983768, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "2", "_score": 1.1983768, "_source": { "title": "hello we" } }, { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 0.8271048, "_source": { "title": "hello world" } }, { "_index": "my_index", "_type": "my_type", "_id": "3", "_score": 0.797104, "_source": { "title": "hello win" } }, { "_index": "my_index", "_type": "my_type", "_id": "4", "_score": 0.2495691, "_source": { "title": "hello dog" } } ] } } |
如果用match,只有hello的也会出来,全文检索,只是分数比较低
推荐使用match_phrase,要求每个term都有,而且position刚好靠着1位,符合我们的期望的
最后
以上就是拉长小懒虫为你收集整理的进阶-第23__深度探秘搜索技术_实战通过ngram分词机制实现index-time搜索推荐的全部内容,希望文章能够帮你解决进阶-第23__深度探秘搜索技术_实战通过ngram分词机制实现index-time搜索推荐所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复