Smith Water算法实现字符串相似度匹配

53 阅读 0 评论 35 点赞

我是靠谱客的博主沉静黄蜂，最近开发中收集的这篇文章主要介绍Smith Water算法实现字符串相似度匹配，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

Smith Waterman 字符串相似度度量算法

smith waterman 算法最初用于基因序列匹配中，用于检测基因序列之间的相似性

序：最近在做数据清洗，需要用到去重处理。想到Smith Waterman可以用于序列对的匹配，并且能处理漏写，简写的问题，所以将问题进行整理，以供参考。

基本思想

创建得分矩阵H并初始化其首行和首列
根据匹配结果填写得分矩阵
从矩阵尾部往前进行回溯，找出得分最高的一条回溯路线，并将得分返回

伪代码

Simth_Waterman

Input: str1, str2
return: the score of match and the comparation path
Smith_Waterman(str1, str2){
    # initialize matrix H
    for i in len(str1)
        H[i, 0] = 0
    end for 
    for j in len(str2)
        H[0, j] = 0
    end for
    # fill the H according to the punish matrix
    for i in len(str1)
        for j in len(str2)
            H[i, j] = max{H[i-1, j-1] + S(str[i], str2[j]), max{H[i-k, j] - W[k], k>=1}, max{H[i, j-l - W[l}, l>=1}
        end for
    end for
    score = H[len(str1), len(str2)]
    path = trace_back(H)
    return score, path
}

Trace Back

Input: H
return score
flaot Trace_Back(H){
    w, h = H.shape
    while(H[i, j] != 0):
        if H[i-1, j] + W[1] == H[i, j]:
            i = i - 1
            record path
        else if H[i, j-1] + W[i] == H[i, j]
            j = j - 1
            record path
        else
            i = i - 1
            j = j - 1
            record path
        end if
     end while
     return path, score
}

python 实现

导入相关包

import numpy as np

Smith_Waterman 实现

def Smith_Waterman(str1, str2, s_score, m_score):
	len1, len2 = len(str1), len(str2)
	matrix = np.zeros([len1 + 1, len2 + 1])
	for i in range(len1):
		matrix[i, 0] = 0
	for i in range(len2):
		matrix[0, i] = 0
	Space = 0
	for i in range(1, len1 + 1):
		for j in range(1, len2 + 1):
			Mkj = matrix[i-1, j] - Space
			Mik = matrix[i, j-1] - Space
			Mij = matrix[i-1, j-1] + 1 if str1[i-1] == str2[j-1] else matrix[i-1, j-1] -1
			matrix[i, j] = max(Mij, Mkj, Mik, 0)
	match_str1, match_str2, match_rate = Trace_back(str1, str2, matrix, Space)
	# print(match_str1)
	# print(match_str2)
	# print(match_rate)
	return match_str1, match_str2, match_rate

Trace Back 实现

def Trace_back(str1, str2, M, Space):
	#find max
	x, y = np.where(M == np.max(M))
	x, y = x[0], y[0]
	# print(M)
	# print(x, y)
	match_str1, match_str2 = '', ''
	match_count = 0
	score = 0
	count = 0
	while M[x, y] != 0:
		count += 1
		# print(x, y)
		if M[x - 1, y] - Space == M[x, y]:
			x = x -1
			match_str1, match_str2 = str1[x] + match_str1, '_' + match_str2
			score += 0.5
		elif M[x, y - 1] - Space == M[x, y]:
			y = y - 1
			match_str1, match_str2 = '_' + match_str1, str2[y] + match_str2
			score += 0.5
		else:
			x, y = x-1, y-1
			match_str1, match_str2 = str1[x] + match_str1, str2[y] + match_str2
			match_count += 1
			score += 1
		# match_rate = match_count/min(len(str1), len(str2))
	return match_str1, match_str2, score/count

测试代码

if __name__ == '__main__':
	str1 = 'asdfa'
	str2 = 'asda'
	print(Smith_Waterman(str1, str2, 0.5, 1))

('asdfa', 'asd_a', 0.9)

如有任何疑问请留言，多多指教

最后

以上就是沉静黄蜂为你收集整理的Smith Water算法实现字符串相似度匹配的全部内容，希望文章能够帮你解决Smith Water算法实现字符串相似度匹配所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错，欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：数据清洗
浏览次数：53 次浏览
发布日期：2023-10-08 17:05:52
本文链接：https://www.kaopuke.com/article/k-p-k_14_uzokfy_12__23_k0.html

Smith Water算法实现字符串相似度匹配

概述

Smith Waterman 字符串相似度度量算法

相关定义

基本思想

伪代码

Simth_Waterman

Trace Back

python 实现

导入相关包

Smith_Waterman 实现

Trace Back 实现

测试代码

最后

评论列表共有 0 条评论

发表评论取消回复

Smith Water算法 实现字符串相似度匹配

概述

Smith Waterman 字符串相似度度量算法

相关定义

基本思想

伪代码

Simth_Waterman

Trace Back

python 实现

导入相关包

Smith_Waterman 实现

Trace Back 实现

测试代码

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

Smith Water算法实现字符串相似度匹配

发表评论取消回复