字符串相似度算法及应用

346 阅读 0 评论 229 点赞

我是靠谱客的博主坚定发夹，这篇文章主要介绍字符串相似度算法及应用，现在分享给大家，希望可以做个参考。

Levenshtein Distance 简介

字符串相似度的算法还是比较经典的DP算法，看到有两篇文章写的比较不错，他们的介绍也非常详细，值得学习。文章地址如下:

文章1 http://blog.csdn.net/orbit/article/details/6649322 (C++) ，

文章2 http://www.cnblogs.com/ivanyb/archive/2011/11/25/2263356.html (C#)。

文章1中实现Levenshtein Distance的算法已经比较简洁了，不过他是在Stack上申请的距离数组，如果处理特别长字符串，会耗尽Stack，所以改为在Heap上申请内存。

//Implement Levenshtein Distance
// author mingspy 5/30 2012
#include "stdafx.h"
#include <iostream>
#include <cmath>
#define Element(arr, col, i, j) arr[(i)*(col)+j]
template<typename T>
inline const T& min(const T& i, const T& j, const T& k)
{
return std::min(std::min(i, j), k);
}
int LevenshteinDistance(const char * str1, int len1, const char * str2, int len2)
{
if(!str1 || !str2)
{
_ASSERT(false);
return INT_MAX;
}
// new a matrix with [len1, len2] to save result.
int * distance = new int[(len1+1)*(len2+1)];
memset(distance, 0, (len1+1)*(len2+1) * sizeof(int));
// Initialize the matrix first row.
for(int i = 0; i < len1+1; i++)
{
distance[i] = i;
}
// Initialize the matrix first column.
for(int i= 0; i< len2+1; i++)
{
distance[(len1+1)*i] = i;
}
// Using dynamic programming to calculate
// the distance between str1 and str2.
int dist = 0;
for(int i = 1; i < len2 + 1; i++)
{
for(int j = 1; j < len1 + 1; j++)
{
if(str1[j - 1] == str2[i - 1])
{
dist = 0;
}
else
{
dist = 1;
}
Element(distance, len1 + 1, i, j) = min(
Element(distance, len1 + 1, i - 1, j -1) + dist,
Element(distance, len1 + 1, i - 1, j) + 1,
Element(distance, len1 + 1, i, j-1) + 1);
}
}
#ifdef _DEBUG
// print distance
for(int i = 0; i < len2 + 1; i++)
{
for(int j = 0; j < len1 + 1; j++)
{
std::cout<<Element(distance, len1 + 1, i, j)<<" ";
}
std::cout<<std::endl;
}
#endif
dist = Element(distance,len1 + 1, len2, len1);
delete [] distance;
return dist;
}

算法中占用(M+1)*(N+1)的额外内存，让人感觉很是浪费，如果能减少内存的使用那么就相当完美了，在CodeProject找到一篇文章，只用了M+N个额外空间:

http://www.codeproject.com/Articles/8265/More-than-strcmp-similarity-in-strings