概述
-
算法代码实现:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869package
com.util;
public
class
SimFeatureUtil {
private
static
int
min(
int
one,
int
two,
int
three) {
int
min = one;
if
(two < min) {
min = two;
}
if
(three < min) {
min = three;
}
return
min;
}
public
static
int
ld(String str1, String str2) {
int
d[][];
// 矩阵
int
n = str1.length();
int
m = str2.length();
int
i;
// 遍历str1的
int
j;
// 遍历str2的
char
ch1;
// str1的
char
ch2;
// str2的
int
temp;
// 记录相同字符,在某个矩阵位置值的增量,不是0就是1
if
(n ==
0
) {
return
m;
}
if
(m ==
0
) {
return
n;
}
d =
new
int
[n +
1
][m +
1
];
for
(i =
0
; i <= n; i++) {
// 初始化第一列
d[i][
0
] = i;
}
for
(j =
0
; j <= m; j++) {
// 初始化第一行
d[
0
][j] = j;
}
for
(i =
1
; i <= n; i++) {
// 遍历str1
ch1 = str1.charAt(i -
1
);
// 去匹配str2
for
(j =
1
; j <= m; j++) {
ch2 = str2.charAt(j -
1
);
if
(ch1 == ch2) {
temp =
0
;
}
else
{
temp =
1
;
}
// 左边+1,上边+1, 左上角+temp取最小
d[i][j] = min(d[i -
1
][j] +
1
, d[i][j -
1
] +
1
, d[i -
1
][j -
1
]+ temp);
}
}
return
d[n][m];
}
public
static
double
sim(String str1, String str2) {
try
{
double
ld = (
double
)ld(str1, str2);
return
(
1
-ld/(
double
)Math.max(str1.length(), str2.length()));
}
catch
(Exception e) {
return
0.1
;
}
}
public
static
void
main(String[] args) {
String str1 =
"测试12"
;
String str2 =
"测试123"
;
System.out.println(
"ld="
+ ld(str1, str2));
System.out.println(
"sim="
+ sim(str1, str2));
}
}
算法介绍:
算法原理:
设我们可以使用d[ i , j ]个步骤(可以使用一个二维数组保存这个值),表示将串s[ 1…i ] 转换为 串t [ 1…j ]所需要的最少步骤个数,那么,在最基本的情况下,即在i等于0时,也就是说串s为空,那么对应的d[0,j] 就是 增加j个字符,使得s转化为t,在j等于0时,也就是说串t为空,那么对应的d[i,0] 就是 减少 i个字符,使得s转化为t。
然后我们考虑一般情况,加一点动态规划的想法,我们要想得到将s[1..i]经过最少次数的增加,删除,或者替换操作就转变为t[1..j],那么我们就必须在之前可以以最少次数的增加,删除,或者替换操作,使得现在串s和串t只需要再做一次操作或者不做就可以完成s[1..i]到t[1..j]的转换。所谓的“之前”分为下面三种情况:
1)我们可以在k个操作内将 s[1…i] 转换为 t[1…j-1]
2)我们可以在k个操作里面将s[1..i-1]转换为t[1..j]
3)我们可以在k个步骤里面将 s[1…i-1] 转换为 t [1…j-1]
针对第1种情况,我们只需要在最后将 t[j] 加上s[1..i]就完成了匹配,这样总共就需要k+1个操作。
针对第2种情况,我们只需要在最后将s[i]移除,然后再做这k个操作,所以总共需要k+1个操作。
针对第3种情况,我们只需要在最后将s[i]替换为 t[j],使得满足s[1..i] == t[1..j],这样总共也需要k+1个操作。而如果在第3种情况下,s[i]刚好等于t[j],那我们就可以仅仅使用k个操作就完成这个过程。
最后,为了保证得到的操作次数总是最少的,我们可以从上面三种情况中选择消耗最少的一种最为将s[1..i]转换为t[1..j]所需要的最小操作次数。
算法实现步骤:
步骤 说明 1 设置n为字符串s的长度。("GUMBO")
设置m为字符串t的长度。("GAMBOL")
如果n等于0,返回m并退出。
如果m等于0,返回n并退出。
构造两个向量v0[m+1] 和v1[m+1],串联0..m之间所有的元素。2 初始化 v0 to 0..m。 3 检查 s (i from 1 to n) 中的每个字符。 4 检查 t (j from 1 to m) 中的每个字符 5 如果 s[i] 等于 t[j],则编辑代价cost为 0;
如果 s[i] 不等于 t[j],则编辑代价cost为1。6 设置单元v1[j]为下面的最小值之一:
a、紧邻该单元上方+1:v1[j-1] + 1
b、紧邻该单元左侧+1:v0[j] + 1
c、该单元对角线上方和左侧+cost:v0[j-1] + cost7 在完成迭代 (3, 4, 5, 6) 之后,v1[m]便是编辑距离的值。
算法步骤详解:
本小节将演示如何计算"GUMBO"和"GAMBOL"两个字符串的Levenshtein距离。
步骤1、2
v0 v1 G U M B O 0 1 2 3 4 5 G 1 A 2 M 3 B 4 O 5 L 6
初始化完了之后重点是理解步骤6.
步骤3-6,当 i = 1
v0 v1 G U M B O 0 1 2 3 4 5 G 1 0 A 2 1 M 3 2 B 4 3 O 5 4 L 6 5 我们算V1中的值:以红色的0所在的格子为例
根据步骤5:
如果 s[i] 等于 t[j],则编辑代价cost为 0;
如果 s[i] 不等于 t[j],则编辑代价cost为1。
和
步骤6: 设置单元v1[j]为下面的最小值之一:
a、紧邻该单元上方+1:v1[j-1] + 1
b、紧邻该单元左侧+1:v0[j] + 1
c、该单元对角线上方和左侧+cost:v0[j-1] + cost
得到:
a: 该格子所在上方为 1加上1为2 b:该格子左边为1加上1为2 c:该格子对角线上方和左侧(也就是左斜对角)为0+ cost(cost是通过步骤5得到的编辑花费,这里G等于G所以编辑花费为0,cost为0) 为0
三个值中 最小的为0,则 该格子的值为0
其他格子以此类推。
步骤3-6,当 i = 2
v0 v1 G U M B O 0 1 2 3 4 5 G 1 0 1 A 2 1 1 M 3 2 2 B 4 3 3 O 5 4 4 L 6 5 5 步骤3-6,当 i = 3
v0 v1 G U M B O 0 1 2 3 4 5 G 1 0 1 2 A 2 1 1 2 M 3 2 2 1 B 4 3 3 2 O 5 4 4 3 L 6 5 5 4 步骤3-6,当 i = 4
v0 v1 G U M B O 0 1 2 3 4 5 G 1 0 1 2 3 A 2 1 1 2 3 M 3 2 2 1 2 B 4 3 3 2 1 O 5 4 4 3 2 L 6 5 5 4 3 步骤3-6,当 i = 5
v0 v1 G U M B O 0 1 2 3 4 5 G 1 0 1 2 3 4 A 2 1 1 2 3 4 M 3 2 2 1 2 3 B 4 3 3 2 1 2 O 5 4 4 3 2 1 L 6 5 5 4 3 2 步骤7
编辑距离就是矩阵右下角的值,v1[m] == 2。由"GUMBO"变换为"GAMBOL"的过程对于我来说是很只管的,即通过将"A"替换为"U",并在末尾追加"L"这样子(实际上替换的过程是由移除和插入两个操作组合而成的)。
我们得到最小编辑距离为2
那么它们的相似度为 (1-ld/(double)Math.max(str1.length(), str2.length()));
1 - 2/6=0.6666666666666667
参考链接:
http://www.cnblogs.com/ymind/archive/2012/03/27/fast-memory-efficient-Levenshtein-algorithm.html
http://teiraisan.blog.163.com/blog/static/12278141420098685835372/
http://teiraisan.blog.163.com/blog/static/12278141420098685835372/
其他语言的代码实现:
c++
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164In C++, the size of an array must be a constant, and
this
code fragment causes an error at compile time:
int
sz =
5
;
int
arr[sz];
This limitation makes the following C++ code slightly more complicated than it would be
if
the matrix could simply be declared as a two-dimensional array, with a size determined at run-time.
In C++ it
's more idiomatic to use the System Template Library'
s vector
class
, as Anders Sewerin Johansen has done in an alternative C++ implementation.
Here is the definition of the
class
(distance.h):
class
Distance
{
public
:
int
LD (
char
const
*s,
char
const
*t);
private
:
int
Minimum (
int
a,
int
b,
int
c);
int
*GetCellPointer (
int
*pOrigin,
int
col,
int
row,
int
nCols);
int
GetAt (
int
*pOrigin,
int
col,
int
row,
int
nCols);
void
PutAt (
int
*pOrigin,
int
col,
int
row,
int
nCols,
int
x);
};
Here is the implementation of the
class
(distance.cpp):
#include
"distance.h"
#include <string.h>
#include <malloc.h>
//****************************
// Get minimum of three values
//****************************
int
Distance::Minimum (
int
a,
int
b,
int
c)
{
int
mi;
mi = a;
if
(b < mi) {
mi = b;
}
if
(c < mi) {
mi = c;
}
return
mi;
}
//**************************************************
// Get a pointer to the specified cell of the matrix
//**************************************************
int
*Distance::GetCellPointer (
int
*pOrigin,
int
col,
int
row,
int
nCols)
{
return
pOrigin + col + (row * (nCols +
1
));
}
//*****************************************************
// Get the contents of the specified cell in the matrix
//*****************************************************
int
Distance::GetAt (
int
*pOrigin,
int
col,
int
row,
int
nCols)
{
int
*pCell;
pCell = GetCellPointer (pOrigin, col, row, nCols);
return
*pCell;
}
//*******************************************************
// Fill the specified cell in the matrix with the value x
//*******************************************************
void
Distance::PutAt (
int
*pOrigin,
int
col,
int
row,
int
nCols,
int
x)
{
int
*pCell;
pCell = GetCellPointer (pOrigin, col, row, nCols);
*pCell = x;
}
//*****************************
// Compute Levenshtein distance
//*****************************
int
Distance::LD (
char
const
*s,
char
const
*t)
{
int
*d;
// pointer to matrix
int
n;
// length of s
int
m;
// length of t
int
i;
// iterates through s
int
j;
// iterates through t
char
s_i;
// ith character of s
char
t_j;
// jth character of t
int
cost;
// cost
int
result;
// result
int
cell;
// contents of target cell
int
above;
// contents of cell immediately above
int
left;
// contents of cell immediately to left
int
diag;
// contents of cell immediately above and to left
int
sz;
// number of cells in matrix
// Step 1
n = strlen (s);
m = strlen (t);
if
(n ==
0
) {
return
m;
}
if
(m ==
0
) {
return
n;
}
sz = (n+
1
) * (m+
1
) * sizeof (
int
);
d = (
int
*) malloc (sz);
// Step 2
for
(i =
0
; i <= n; i++) {
PutAt (d, i,
0
, n, i);
}
for
(j =
0
; j <= m; j++) {
PutAt (d,
0
, j, n, j);
}
// Step 3
for
(i =
1
; i <= n; i++) {
s_i = s[i-
1
];
// Step 4
for
(j =
1
; j <= m; j++) {
t_j = t[j-
1
];
// Step 5
if
(s_i == t_j) {
cost =
0
;
}
else
{
cost =
1
;
}
// Step 6
above = GetAt (d,i-
1
,j, n);
left = GetAt (d,i, j-
1
, n);
diag = GetAt (d, i-
1
,j-
1
, n);
cell = Minimum (above +
1
, left +
1
, diag + cost);
PutAt (d, i, j, n, cell);
}
}
// Step 7
result = GetAt (d, n, m, n);
free (d);
return
result;
} </malloc.h></string.h>
Visual Basic
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293'*******************************
'*** Get minimum of three values
'*******************************
Private Function Minimum(ByVal a As Integer, _
ByVal b As Integer, _
ByVal c As Integer) As Integer
Dim mi As Integer
mi = a
If b < mi Then
mi = b
End If
If c < mi Then
mi = c
End If
Minimum = mi
End Function
'********************************
'*** Compute Levenshtein Distance
'********************************
Public Function LD(ByVal s As String, ByVal t As String) As Integer
Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim n As Integer ' length of s
Dim i As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost
' Step
1
n = Len(s)
m = Len(t)
If n =
0
Then
LD = m
Exit Function
End If
If m =
0
Then
LD = n
Exit Function
End If
ReDim d(
0
To n,
0
To m) As Integer
' Step
2
For i =
0
To n
d(i,
0
) = i
Next i
For j =
0
To m
d(
0
, j) = j
Next j
' Step
3
For i =
1
To n
s_i = Mid$(s, i,
1
)
' Step
4
For j =
1
To m
t_j = Mid$(t, j,
1
)
' Step
5
If s_i = t_j Then
cost =
0
Else
cost =
1
End If
' Step
6
d(i, j) = Minimum(d(i -
1
, j) +
1
, d(i, j -
1
) +
1
, d(i -
1
, j -
1
) + cost)
Next j
Next i
' Step
7
LD = d(n, m)
Erase d
End Function
Python代码
123456789101112131415161718192021222324252627282930313233#!/user/bin/env python
# -*- coding: utf-
8
-*-
class
arithmetic():
def __init__(self):
pass
''
''
' 【编辑距离算法】 【levenshtein distance】 【字符串相似度算法】 '
''
def levenshtein(self,first,second):
if
len(first) > len(second):
first,second = second,first
if
len(first) ==
0
:
return
len(second)
if
len(second) ==
0
:
return
len(first)
first_length = len(first) +
1
second_length = len(second) +
1
distance_matrix = [range(second_length)
for
x in range(first_length)]
#print distance_matrix
for
i in range(
1
,first_length):
for
j in range(
1
,second_length):
deletion = distance_matrix[i-
1
][j] +
1
insertion = distance_matrix[i][j-
1
] +
1
substitution = distance_matrix[i-
1
][j-
1
]
if
first[i-
1
] != second[j-
1
]:
substitution +=
1
distance_matrix[i][j] = min(insertion,deletion,substitution)
print distance_matrix
return
distance_matrix[first_length-
1
][second_length-
1
]
if
__name__ ==
"__main__"
:
arith = arithmetic()
print arith.levenshtein(
'GUMBOsdafsadfdsafsafsadfasfadsfasdfasdfs'
,
'GAMBOL00000000000dfasfasfdafsafasfasdfdsa'
http://www.2cto.com/kf/201407/314271.html
最后
以上就是默默招牌为你收集整理的java文本相似度计算(Levenshtein Distance算法(中文翻译:编辑距离算法))----代码和详解的全部内容,希望文章能够帮你解决java文本相似度计算(Levenshtein Distance算法(中文翻译:编辑距离算法))----代码和详解所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复