java文本相似度计算(Levenshtein Distance算法(中文翻译：编辑距离算法))----代码和详解

293 阅读 0 评论 194 点赞

我是靠谱客的博主默默招牌，这篇文章主要介绍java文本相似度计算(Levenshtein Distance算法(中文翻译：编辑距离算法))----代码和详解，现在分享给大家，希望可以做个参考。

算法代码实现:

 
     ? 
    
          package 
           com.util; 
         
          public 
           class 
           SimFeatureUtil { 
         
          private 
           static 
           int 
           min( 
          int 
           one,  
          int 
           two,  
          int 
           three) { 
         
          int 
           min = one; 
         
          if 
           (two < min) { 
         
          min = two; 
         
          } 
         
          if 
           (three < min) { 
         
          min = three; 
         
          } 
         
          return 
           min; 
         
          } 
         
          public 
           static 
           int 
           ld(String str1, String str2) { 
         
          int 
           d[][];  
          // 矩阵 
         
          int 
           n = str1.length(); 
         
          int 
           m = str2.length(); 
         
          int 
           i;  
          // 遍历str1的 
         
          int 
           j;  
          // 遍历str2的 
         
          char 
           ch1;  
          // str1的 
         
          char 
           ch2;  
          // str2的 
         
          int 
           temp;  
          // 记录相同字符,在某个矩阵位置值的增量,不是0就是1 
         
          if 
           (n ==  
          0 
          ) { 
         
          return 
           m; 
         
          } 
         
          if 
           (m ==  
          0 
          ) { 
         
          return 
           n; 
         
          } 
         
          d =  
          new 
           int 
          [n +  
          1 
          ][m +  
          1 
          ]; 
         
          for 
           (i =  
          0 
          ; i <= n; i++) {  
          // 初始化第一列 
         
          d[i][ 
          0 
          ] = i; 
         
          } 
         
          for 
           (j =  
          0 
          ; j <= m; j++) {  
          // 初始化第一行 
         
          d[ 
          0 
          ][j] = j; 
         
          } 
         
          for 
           (i =  
          1 
          ; i <= n; i++) {  
          // 遍历str1 
         
          ch1 = str1.charAt(i -  
          1 
          ); 
         
          // 去匹配str2 
         
          for 
           (j =  
          1 
          ; j <= m; j++) { 
         
          ch2 = str2.charAt(j -  
          1 
          ); 
         
          if 
           (ch1 == ch2) { 
         
          temp =  
          0 
          ; 
         
          }  
          else 
           { 
         
          temp =  
          1 
          ; 
         
          } 
         
          // 左边+1,上边+1, 左上角+temp取最小 
         
          d[i][j] = min(d[i -  
          1 
          ][j] +  
          1 
          , d[i][j -  
          1 
          ] +  
          1 
          , d[i -  
          1 
          ][j -  
          1 
          ]+ temp); 
         
          } 
         
          } 
         
          return 
           d[n][m]; 
         
          } 
         
          public 
           static 
           double 
           sim(String str1, String str2) { 
         
          try 
           { 
         
          double 
           ld = ( 
          double 
          )ld(str1, str2); 
         
          return 
           ( 
          1 
          -ld/( 
          double 
          )Math.max(str1.length(), str2.length())); 
         
          }  
          catch 
           (Exception e) { 
         
          return 
           0.1 
          ; 
         
          } 
         
          } 
         
          public 
           static 
           void 
           main(String[] args) { 
         
          String str1 =  
          "测试12" 
          ; 
         
          String str2 =  
          "测试123" 
          ; 
         
          System.out.println( 
          "ld=" 
           + ld(str1, str2)); 
         
          System.out.println( 
          "sim=" 
           + sim(str1, str2)); 
         
          } 
         
          }

算法介绍:

编辑距离（Edit Distance），又称Levenshtein距离，是指两个字串之间，由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符，插入一个字符，删除一个字符。

算法原理:

设我们可以使用d[ i , j ]个步骤（可以使用一个二维数组保存这个值），表示将串s[ 1…i ] 转换为串t [ 1…j ]所需要的最少步骤个数，那么，在最基本的情况下，即在i等于0时，也就是说串s为空，那么对应的d[0,j] 就是增加j个字符，使得s转化为t，在j等于0时，也就是说串t为空，那么对应的d[i,0] 就是减少 i个字符，使得s转化为t。

然后我们考虑一般情况，加一点动态规划的想法，我们要想得到将s[1..i]经过最少次数的增加，删除，或者替换操作就转变为t[1..j]，那么我们就必须在之前可以以最少次数的增加，删除，或者替换操作，使得现在串s和串t只需要再做一次操作或者不做就可以完成s[1..i]到t[1..j]的转换。所谓的“之前”分为下面三种情况：

1）我们可以在k个操作内将 s[1…i] 转换为 t[1…j-1]

2）我们可以在k个操作里面将s[1..i-1]转换为t[1..j]

3）我们可以在k个步骤里面将 s[1…i-1] 转换为 t [1…j-1]

针对第1种情况，我们只需要在最后将 t[j] 加上s[1..i]就完成了匹配，这样总共就需要k+1个操作。

针对第2种情况，我们只需要在最后将s[i]移除，然后再做这k个操作，所以总共需要k+1个操作。

针对第3种情况，我们只需要在最后将s[i]替换为 t[j]，使得满足s[1..i] == t[1..j]，这样总共也需要k+1个操作。而如果在第3种情况下，s[i]刚好等于t[j]，那我们就可以仅仅使用k个操作就完成这个过程。

最后，为了保证得到的操作次数总是最少的，我们可以从上面三种情况中选择消耗最少的一种最为将s[1..i]转换为t[1..j]所需要的最小操作次数。

算法实现步骤:

步骤	说明
1	设置n为字符串s的长度。("GUMBO") 设置m为字符串t的长度。("GAMBOL") 如果n等于0，返回m并退出。如果m等于0，返回n并退出。构造两个向量v0[m+1] 和v1[m+1]，串联0..m之间所有的元素。
2	初始化 v0 to 0..m。
3	检查 s (i from 1 to n) 中的每个字符。
4	检查 t (j from 1 to m) 中的每个字符
5	如果 s[i] 等于 t[j]，则编辑代价cost为 0；如果 s[i] 不等于 t[j]，则编辑代价cost为1。
6	设置单元v1[j]为下面的最小值之一： a、紧邻该单元上方+1：v1[j-1] + 1 b、紧邻该单元左侧+1：v0[j] + 1 c、该单元对角线上方和左侧+cost：v0[j-1] + cost
7	在完成迭代 (3, 4, 5, 6) 之后，v1[m]便是编辑距离的值。

算法步骤详解:

本小节将演示如何计算"GUMBO"和"GAMBOL"两个字符串的Levenshtein距离。

步骤1、2

	v0	v1
		G	U	M	B	O
	0	1	2	3	4	5
G	1
A	2
M	3
B	4
O	5
L	6

初始化完了之后重点是理解步骤6.

步骤3-6，当 i = 1

	v0	v1
		G	U	M	B	O
	0	1	2	3	4	5
G	1	0
A	2	1
M	3	2
B	4	3
O	5	4
L	6	5

我们算V1中的值：以红色的0所在的格子为例

根据步骤5：
如果 s[i] 等于 t[j]，则编辑代价cost为 0；
如果 s[i] 不等于 t[j]，则编辑代价cost为1。
和
步骤6: 设置单元v1[j]为下面的最小值之一：
a、紧邻该单元上方+1：v1[j-1] + 1
b、紧邻该单元左侧+1：v0[j] + 1
c、该单元对角线上方和左侧+cost：v0[j-1] + cost

得到:
a：该格子所在上方为 1加上1为2 b：该格子左边为1加上1为2 c：该格子对角线上方和左侧(也就是左斜对角)为0+ cost（cost是通过步骤5得到的编辑花费，这里G等于G所以编辑花费为0，cost为0）为0
三个值中最小的为0，则该格子的值为0
其他格子以此类推。

步骤3-6，当 i = 2

		v0	v1
		G	U	M	B	O
	0	1	2	3	4	5
G	1	0	1
A	2	1	1
M	3	2	2
B	4	3	3
O	5	4	4
L	6	5	5

步骤3-6，当 i = 3

			v0	v1
		G	U	M	B	O
	0	1	2	3	4	5
G	1	0	1	2
A	2	1	1	2
M	3	2	2	1
B	4	3	3	2
O	5	4	4	3
L	6	5	5	4

步骤3-6，当 i = 4

				v0	v1
		G	U	M	B	O
	0	1	2	3	4	5
G	1	0	1	2	3
A	2	1	1	2	3
M	3	2	2	1	2
B	4	3	3	2	1
O	5	4	4	3	2
L	6	5	5	4	3

步骤3-6，当 i = 5

					v0	v1
		G	U	M	B	O
	0	1	2	3	4	5
G	1	0	1	2	3	4
A	2	1	1	2	3	4
M	3	2	2	1	2	3
B	4	3	3	2	1	2
O	5	4	4	3	2	1
L	6	5	5	4	3	2

步骤7

编辑距离就是矩阵右下角的值，v1[m] == 2。由"GUMBO"变换为"GAMBOL"的过程对于我来说是很只管的，即通过将"A"替换为"U"，并在末尾追加"L"这样子（实际上替换的过程是由移除和插入两个操作组合而成的）。

我们得到最小编辑距离为2

那么它们的相似度为 (1-ld/(double)Math.max(str1.length(), str2.length()));

1 - 2/6=0.6666666666666667

参考链接:

http://www.cnblogs.com/ymind/archive/2012/03/27/fast-memory-efficient-Levenshtein-algorithm.html

http://teiraisan.blog.163.com/blog/static/12278141420098685835372/

其他语言的代码实现:

c++

 
     ? 
    
          In C++, the size of an array must be a constant, and  
          this 
           code fragment causes an error at compile time:    
         
          int 
           sz =  
          5 
          ;    
         
          int 
           arr[sz];    
         
          This limitation makes the following C++ code slightly more complicated than it would be  
          if 
           the matrix could simply be declared as a two-dimensional array, with a size determined at run-time.    
         
          In C++ it 
          's more idiomatic to use the System Template Library' 
          s vector  
          class 
          , as Anders Sewerin Johansen has done in an alternative C++ implementation.    
         
          Here is the definition of the  
          class 
           (distance.h):    
         
          class 
           Distance    
         
          {    
         
          public 
          :    
         
          int 
           LD ( 
          char 
           const 
           *s,  
          char 
           const 
           *t);    
         
          private 
          :    
         
          int 
           Minimum ( 
          int 
           a,  
          int 
           b,  
          int 
           c);    
         
          int 
           *GetCellPointer ( 
          int 
           *pOrigin,  
          int 
           col,  
          int 
           row,  
          int 
           nCols);    
         
          int 
           GetAt ( 
          int 
           *pOrigin,  
          int 
           col,  
          int 
           row,  
          int 
           nCols);    
         
          void 
           PutAt ( 
          int 
           *pOrigin,  
          int 
           col,  
          int 
           row,  
          int 
           nCols,  
          int 
           x);    
         
          };     
         
          Here is the implementation of the  
          class 
           (distance.cpp):    
         
          #include  
          "distance.h"  
         
          #include <string.h>    
         
          #include <malloc.h>    
         
          //****************************    
         
          // Get minimum of three values    
         
          //****************************    
         
          int 
           Distance::Minimum ( 
          int 
           a,  
          int 
           b,  
          int 
           c)    
         
          {    
         
          int 
           mi;    
         
          mi = a;    
         
          if 
           (b < mi) {    
         
          mi = b;    
         
          }    
         
          if 
           (c < mi) {    
         
          mi = c;    
         
          }    
         
          return 
           mi;    
         
          }    
         
          //**************************************************    
         
          // Get a pointer to the specified cell of the matrix    
         
          //**************************************************     
         
          int 
           *Distance::GetCellPointer ( 
          int 
           *pOrigin,  
          int 
           col,  
          int 
           row,  
          int 
           nCols)    
         
          {    
         
          return 
           pOrigin + col + (row * (nCols +  
          1 
          ));    
         
          }    
         
          //*****************************************************    
         
          // Get the contents of the specified cell in the matrix     
         
          //*****************************************************    
         
          int 
           Distance::GetAt ( 
          int 
           *pOrigin,  
          int 
           col,  
          int 
           row,  
          int 
           nCols)    
         
          {    
         
          int 
           *pCell;    
         
          pCell = GetCellPointer (pOrigin, col, row, nCols);    
         
          return 
           *pCell;    
         
          }    
         
          //*******************************************************    
         
          // Fill the specified cell in the matrix with the value x    
         
          //*******************************************************    
         
          void 
           Distance::PutAt ( 
          int 
           *pOrigin,  
          int 
           col,  
          int 
           row,  
          int 
           nCols,  
          int 
           x)    
         
          {    
         
          int 
           *pCell;    
         
          pCell = GetCellPointer (pOrigin, col, row, nCols);    
         
          *pCell = x;    
         
          }    
         
          //*****************************    
         
          // Compute Levenshtein distance    
         
          //*****************************    
         
          int 
           Distance::LD ( 
          char 
           const 
           *s,  
          char 
           const 
           *t)    
         
          {    
         
          int 
           *d;  
          // pointer to matrix    
         
          int 
           n;  
          // length of s    
         
          int 
           m;  
          // length of t    
         
          int 
           i;  
          // iterates through s    
         
          int 
           j;  
          // iterates through t    
         
          char 
           s_i;  
          // ith character of s    
         
          char 
           t_j;  
          // jth character of t    
         
          int 
           cost;  
          // cost    
         
          int 
           result;  
          // result    
         
          int 
           cell;  
          // contents of target cell    
         
          int 
           above;  
          // contents of cell immediately above    
         
          int 
           left;  
          // contents of cell immediately to left    
         
          int 
           diag;  
          // contents of cell immediately above and to left    
         
          int 
           sz;  
          // number of cells in matrix    
         
          // Step 1     
         
          n = strlen (s);    
         
          m = strlen (t);    
         
          if 
           (n ==  
          0 
          ) {    
         
          return 
           m;    
         
          }    
         
          if 
           (m ==  
          0 
          ) {    
         
          return 
           n;    
         
          }    
         
          sz = (n+ 
          1 
          ) * (m+ 
          1 
          ) * sizeof ( 
          int 
          );    
         
          d = ( 
          int 
           *) malloc (sz);    
         
          // Step 2    
         
          for 
           (i =  
          0 
          ; i <= n; i++) {    
         
          PutAt (d, i,  
          0 
          , n, i);    
         
          }    
         
          for 
           (j =  
          0 
          ; j <= m; j++) {    
         
          PutAt (d,  
          0 
          , j, n, j);    
         
          }    
         
          // Step 3    
         
          for 
           (i =  
          1 
          ; i <= n; i++) {    
         
          s_i = s[i- 
          1 
          ];    
         
          // Step 4    
         
          for 
           (j =  
          1 
          ; j <= m; j++) {    
         
          t_j = t[j- 
          1 
          ];    
         
          // Step 5    
         
          if 
           (s_i == t_j) {    
         
          cost =  
          0 
          ;    
         
          }    
         
          else 
           {    
         
          cost =  
          1 
          ;    
         
          }    
         
          // Step 6     
         
          above = GetAt (d,i- 
          1 
          ,j, n);    
         
          left = GetAt (d,i, j- 
          1 
          , n);    
         
          diag = GetAt (d, i- 
          1 
          ,j- 
          1 
          , n);    
         
          cell = Minimum (above +  
          1 
          , left +  
          1 
          , diag + cost);    
         
          PutAt (d, i, j, n, cell);    
         
          }    
         
          }    
         
          // Step 7    
         
          result = GetAt (d, n, m, n);    
         
          free (d);    
         
          return 
           result;    
         
          }   </malloc.h></string.h>

Visual Basic

 
     ? 
    
          '*******************************    
         
          '*** Get minimum of three values    
         
          '*******************************    
         
          Private Function Minimum(ByVal a As Integer, _    
         
          ByVal b As Integer, _    
         
          ByVal c As Integer) As Integer    
         
          Dim mi As Integer    
         
          mi = a    
         
          If b < mi Then    
         
          mi = b    
         
          End If    
         
          If c < mi Then    
         
          mi = c    
         
          End If    
         
          Minimum = mi    
         
          End Function    
         
          '********************************    
         
          '*** Compute Levenshtein Distance    
         
          '********************************    
         
          Public Function LD(ByVal s As String, ByVal t As String) As Integer    
         
          Dim d() As Integer ' matrix    
         
          Dim m As Integer ' length of t    
         
          Dim n As Integer ' length of s    
         
          Dim i As Integer ' iterates through s    
         
          Dim j As Integer ' iterates through t    
         
          Dim s_i As String ' ith character of s    
         
          Dim t_j As String ' jth character of t    
         
          Dim cost As Integer ' cost    
         
          ' Step  
          1  
         
          n = Len(s)    
         
          m = Len(t)    
         
          If n =  
          0 
           Then    
         
          LD = m    
         
          Exit Function    
         
          End If     
         
          If m =  
          0 
           Then    
         
          LD = n    
         
          Exit Function    
         
          End If     
         
          ReDim d( 
          0 
           To n,  
          0 
           To m) As Integer    
         
          ' Step  
          2  
         
          For i =  
          0 
           To n    
         
          d(i,  
          0 
          ) = i    
         
          Next i    
         
          For j =  
          0 
           To m    
         
          d( 
          0 
          , j) = j    
         
          Next j    
         
          ' Step  
          3  
         
          For i =  
          1 
           To n    
         
          s_i = Mid$(s, i,  
          1 
          )    
         
          ' Step  
          4  
         
          For j =  
          1 
           To m    
         
          t_j = Mid$(t, j,  
          1 
          )    
         
          ' Step  
          5  
         
          If s_i = t_j Then    
         
          cost =  
          0  
         
          Else    
         
          cost =  
          1  
         
          End If    
         
          ' Step  
          6  
         
          d(i, j) = Minimum(d(i -  
          1 
          , j) +  
          1 
          , d(i, j -  
          1 
          ) +  
          1 
          , d(i -  
          1 
          , j -  
          1 
          ) + cost)    
         
          Next j    
         
          Next i    
         
          ' Step  
          7  
         
          LD = d(n, m)    
         
          Erase d    
         
          End Function

Python代码

 
     ? 
    
          #!/user/bin/env python    
         
          # -*- coding: utf- 
          8 
           -*-    
         
          class 
           arithmetic():    
         
          def __init__(self):    
         
          pass   
         
          '' 
          '' 
          ' 【编辑距离算法】 【levenshtein distance】 【字符串相似度算法】 ' 
          ''  
         
          def levenshtein(self,first,second):    
         
          if 
           len(first) > len(second):    
         
          first,second = second,first    
         
          if 
           len(first) ==  
          0 
          :    
         
          return 
           len(second)    
         
          if 
           len(second) ==  
          0 
          :    
         
          return 
           len(first)    
         
          first_length = len(first) +  
          1  
         
          second_length = len(second) +  
          1  
         
          distance_matrix = [range(second_length)  
          for 
           x in range(first_length)]     
         
          #print distance_matrix    
         
          for 
           i in range( 
          1 
          ,first_length):    
         
          for 
           j in range( 
          1 
          ,second_length):    
         
          deletion = distance_matrix[i- 
          1 
          ][j] +  
          1  
         
          insertion = distance_matrix[i][j- 
          1 
          ] +  
          1  
         
          substitution = distance_matrix[i- 
          1 
          ][j- 
          1 
          ]    
         
          if 
           first[i- 
          1 
          ] != second[j- 
          1 
          ]:    
         
          substitution +=  
          1  
         
          distance_matrix[i][j] = min(insertion,deletion,substitution)    
         
          print distance_matrix    
         
          return 
           distance_matrix[first_length- 
          1 
          ][second_length- 
          1 
          ]    
         
          if 
           __name__ ==  
          "__main__" 
          :    
         
          arith = arithmetic()    
         
          print arith.levenshtein( 
          'GUMBOsdafsadfdsafsafsadfasfadsfasdfasdfs' 
          , 
          'GAMBOL00000000000dfasfasfdafsafasfasdfdsa' 
         
          http://www.2cto.com/kf/201407/314271.html