概述
Dice's coefficient (also known as the Dice coefficient) is a similarity measure related to the Jaccard index.
For sets X and Y of keywords used in information retrieval, the coefficient may be defined as:[1]
When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:[2]
where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:
-
night
-
nacht
We would find the set of bigrams in each word:
-
{
ni
,ig
,gh
,ht
} -
{
na
,ac
,ch
,ht
}
Each set has 4 elements, and the intersection of these two sets has only one element: ht
.
Plugging this into the formula, we calculate, s = (2 * 1) / (4 + 4) = 0.25
See also
-
Jaccard index
- Levenshtein distance
- Sørensen similarity index
Notes
- ^ C. J. van Rijsbergen (1979)
- ^ Kondrak, G. et al. (2003)
References
- C. J. van Rijsbergen (1979) Information Retrieval (London: Butterworths)
- Kondrak, G., Marcu, D. and Knight, K. (2003) "Cognates Can Improve Statistical Translation Models" in Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 46--48
最后
以上就是知性宝贝为你收集整理的Dice's coefficient的全部内容,希望文章能够帮你解决Dice's coefficient所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复