我是靠谱客的博主 疯狂宝贝,最近开发中收集的这篇文章主要介绍字符串相似度之编辑距离(levenshtein、jaro、jaro_winkler算法)的rust实现,觉得挺不错的,现在分享给大家,希望可以做个参考。
概述
最近闲着没事学学rust;正好公司之前用来分析账号相似度的模块是用python写的,于是想到用rust重写底层算法提高运行效率,顺便练练手。
稍微翻了一下github,在字符串相似度方面现成的开源rust轮子不多,一个手数的过来,而且质量普遍不算很高(无脑递归,或者直接将jaro_winkler的p因子固定为0.1,前者仅是运行性能层面的不足,而后者则会导致运算结果与预期不一致。另外某高star的项目是不支持中文编码的。)
先把levenshtein、jaro、jaro_winkler的轮子造一下,基本够用,有空再把其他基础算法补上吧。
gitee:https://gitee.com/DontBeProud/str_sim
github:https://github.com/DontBeProud/str_sim
levenshtein.rs
use std::cmp::min;
/// levenshtein edit distance
pub fn levenshtein_distance(s1: &str, s2:&str) -> usize{
let row = s1.chars().count() + 1;
let col = s2.chars().count() + 1;
let mut matrix = vec![vec![0; col].into_boxed_slice(); row].into_boxed_slice();
for i in 0..row{
for j in 0..col{
if i * j == 0{
matrix[i][j] = i + j;
}
else {
let b_not_equal = if s1.chars().nth(i - 1) != s2.chars().nth(j - 1) {1} else {0};
matrix[i][j] = min(min(matrix[i - 1][j] + 1, matrix[i][j - 1] + 1), matrix[i - 1][j - 1] + b_not_equal);
}
}
}
matrix[row - 1][col - 1]
}
jaro.rs
use std::cmp;
/// jaro similarity
pub fn sim_jaro(s1: &str, s2: &str) -> f64 {
let s1_len = s1.chars().count();
let s2_len = s2.chars().count();
if s1_len == 0 && s2_len == 0 { return 1.0; }
let match_distance = cmp::max(s1_len, s2_len) / 2 - 1;
let mut s1_matches = vec![false; s1_len];
let mut s2_matches = vec![false; s2_len];
let mut m: isize = 0;
for i in 0..s1_len {
let start = cmp::max(0, i as isize - match_distance as isize) as usize;
let end = cmp::min(i + match_distance + 1, s2_len);
for j in start..end {
if !s2_matches[j] && s1.chars().nth(i) == s2.chars().nth(j) {
s1_matches[i] = true;
s2_matches[j] = true;
m += 1;
break;
}
}
}
if m == 0 { return 0.0; }
let mut t = 0.0;
let mut k = 0;
for i in 0..s1_len {
if s1_matches[i] {
while !s2_matches[k] { k += 1; }
if s1.chars().nth(i) != s2.chars().nth(k) { t += 0.5; }
k += 1;
}
}
let m = m as f64;
(m / s1_len as f64 + m / s2_len as f64 + (m
- t) / m) / 3.0
}
jaro_winkler.rs
use crate::jaro::sim_jaro;
use std::cmp::min;
/// jaro-winkler similarity
pub fn sim_jaro_winkler(s1: &str, s2: &str, prefix_weight: f64) -> f64 {
let jaro_sim = sim_jaro(s1, s2);
let prefix_len = get_longest_prefix_length(s1, s2);
let mut prefix_factor_coefficient = prefix_weight * prefix_len as f64;
prefix_factor_coefficient = if prefix_factor_coefficient < 0.0 { 0.0 } else if prefix_factor_coefficient > 1.0 { 1.0 } else { prefix_factor_coefficient };
jaro_sim + prefix_factor_coefficient * (1.0 - jaro_sim)
}
fn get_longest_prefix_length(s1: &str, s2: &str) -> usize{
let min_len = min(s1.chars().count(), s2.chars().count());
let mut longest_prefix_length = min_len;
for i in 0..min_len{
if s1.chars().nth(i) != s2.chars().nth(i){
longest_prefix_length = i;
break;
}
}
longest_prefix_length
}
最后
以上就是疯狂宝贝为你收集整理的字符串相似度之编辑距离(levenshtein、jaro、jaro_winkler算法)的rust实现的全部内容,希望文章能够帮你解决字符串相似度之编辑距离(levenshtein、jaro、jaro_winkler算法)的rust实现所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复