题目来自*Speech and Language Processing
Daniel-Jurafsky Stanford-University
*
Augment the minimum edit distance algorithm to output an alignment; you will need to store pointers and add a stage to compute the backtrace.
复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115# To extend the edit distance algorithm to produce an alignment, we can start by # visualizing an alignment as a path through the edit distance matrix. # In the first step, we augment the minimum edit distance algorithm to store backpointers in each cell. # The backpointer from a cell points to the previous cell (or cells) that we came from in entering the current cell. # Some cells have multiple backpointers because the minimum extension could have come from multiple previous cells. # In the second step, we perform a backtrace. In a backtrace, we start from the last cell (at the final row and column), # and follow the pointers back through the dynamic programming matrix. # Each complete path between the final cell and the initial cell is a minimum distance alignment. import numpy as np def edit(str1, str2): # 计算两个单词之间的最小编辑距离,并返回一个从源字符串转换到到目标字符串的操作列表(其中0代表无操作,1代表替换,2代表插入,3代表删除) # 初始化矩阵的行、列长度 len_str1 = len( str1 ) + 1 len_str2 = len( str2 ) + 1 # 初始化矩阵 matrix = np.zeros( shape=(len_str1, len_str2), dtype=int ) # np.zeros 创建指定大小的数组,数组元素以 0 来填充 # numpy 产生三维矩阵,i行j列,每个单元2个元素,这两个元素代表matrix中的一个坐标,用于存储matrix中(i,j)位置的回溯,回溯即当前位置的编辑距离由min(x,y,z)决定 matrix_pointer = np.zeros( shape=(len_str1, len_str2, 3), dtype=int ) print( "编辑距离矩阵和回溯矩阵的性状:", matrix.shape, matrix_pointer.shape ) # 初始化x维度 for i in range( 1, len_str1 ): matrix[i][0] = i matrix_pointer[i][0] = i - 1, 0, 3 # 初始化y维度 for j in range( 1, len_str2 ): matrix[0][j] = j matrix_pointer[0][j] = 0, j - 1, 2 # 动态规划,计算最小编辑距离 for i in range( 1, len_str1 ): for j in range( 1, len_str2 ): # 当前i,j字符是否相同 if str1[i - 1:i] == str2[j - 1:j]: # 字符串截取,左闭右开 flag = 0 else: flag = 2 # substitute操作,权重设置为2,其他版本的算法也将该权重设置为1 # 选择距离 matrix[i][j] = min( matrix[i - 1][j] + 1, matrix[i][j - 1] + 1, matrix[i - 1][j - 1] + flag ) # 如果数值相同,优先级是斜上、左和上 if str1[i - 1:i] == str2[j - 1:j]: # 若两字符串在该位置的字符相同,则直接向左上方回溯(none 0) matrix_pointer[i][j] = i - 1, j - 1, 0 elif matrix[i][j] - 2 == matrix[i - 1][j - 1]: # 若最小距离是由左上+2得来(substitute 1),则向左上方回溯 matrix_pointer[i][j] = i - 1, j - 1, 1 # If two boldfaced cells occur in the same row, there will be an insertion in going from the source to the target; # two boldfaced cells in the same column indicate a deletion. elif matrix[i][j] - 1 == matrix[i][j - 1]: # 若最小距离是由左边+1得来(insert 2),则向左边回溯 matrix_pointer[i][j] = i, j - 1, 2 elif matrix[i][j] - 1 == matrix[i - 1][j]: # 若最小距离是由上方+1得来(delete 3),则向上方回溯 matrix_pointer[i][j] = i - 1, j, 3 i, j, k = matrix_pointer[len_str1 - 1][len_str2 - 1] # i,j代表回溯路径中的坐标,k代表操作 op_list = [k] # 初始化操作列表 print("编辑距离矩阵:") print(matrix) print( "回溯路径的各单元坐标(从右下角单元回溯到左上角单元,其间的这条路径展示了两个字符串的对齐方式):" ) while True: if i != 0 or j != 0: # print( i, j, k, matrix[i][j] ) print( i, j ) i, j, k = matrix_pointer[i][j] op_list.append( k ) # 扩展操作列表 elif i == 0 and j == 0: # print( i, j, k ) break op_list.reverse() # print( matrix, matrix.shape, matrix_pointer ) # 返回最小编辑距离矩阵的右下角元素(最小编辑距离),源字符串对应的操作列表 return matrix[len_str1 - 1][len_str2 - 1], list( enumerate( op_list ) ) def print_alignment(src, tar, op_list): # 根据源字符串和目标字符串以及操作序列输出对齐方式 list_src = list( src ) list_tar = list( tar ) list_ope = [] # I N T E * N T I O N # * E X E C U T I O N 0n 1s 2i 3d # d s s i s for i, j in op_list: if j == 0: list_ope.append( " " ) elif j == 3: list_tar.insert( i, "*" ) list_ope.append( "d" ) # delete elif j == 1: list_ope.append( "s" ) # substitute elif j == 2: list_src.insert( i, "*" ) list_ope.append( "i" ) # insert print( ''.join( list_src ) ) print( ''.join( list_tar ) ) print( ''.join( list_ope ) ) if __name__ == '__main__': src = "intention" tar = "execution" res, op_list = edit( src, tar ) print( "intention和execution的最小编辑距离是:", res ) print( "对源字符串的操作序列:", op_list ) print( "源字符串和目标字符串的对齐方式如下:" ) print_alignment( src, tar, op_list )
复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104output: 编辑距离矩阵和回溯矩阵的性状: (10, 10) (10, 10, 3) 编辑距离矩阵: [[ 0 1 2 3 4 5 6 7 8 9] [ 1 2 3 4 5 6 7 6 7 8] [ 2 3 4 5 6 7 8 7 8 7] [ 3 4 5 6 7 8 7 8 9 8] [ 4 3 4 5 6 7 8 9 10 9] [ 5 4 5 6 7 8 9 10 11 10] [ 6 5 6 7 8 9 8 9 10 11] [ 7 6 7 8 9 10 9 8 9 10] [ 8 7 8 9 10 11 10 9 8 9] [ 9 8 9 10 11 12 11 10 9 8]] 回溯路径的各单元坐标(从右下角单元回溯到左上角单元,其间的这条路径展示了两个字符串的对齐方式): 8 8 7 7 6 6 5 5 4 4 4 3 3 2 2 1 1 0 intention和execution的最小编辑距离是: 8 对源字符串的操作序列: [(0, 3), (1, 1), (2, 1), (3, 0), (4, 2), (5, 1), (6, 0), (7, 0), (8, 0), (9, 0)] 源字符串和目标字符串的对齐方式如下: inte*ntion *execution dss is
最后
以上就是快乐小懒虫最近收集整理的关于求解最小编辑距离并通过对矩阵回溯展示两个字符串的对齐的全部内容,更多相关求解最小编辑距离并通过对矩阵回溯展示两个字符串内容请搜索靠谱客的其他文章。
本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
发表评论 取消回复