Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation 论文代码解析

96 阅读 0 评论 64 点赞

我是靠谱客的博主慈祥海燕，这篇文章主要介绍Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation 论文代码解析，现在分享给大家，希望可以做个参考。

1.数据处理部分

原始文本处理

table2entity2text.py

举例：原始数据的一个句子

键值对

复制代码

1
name_1:walter   name_2:extra    image:<none>    image_size:<none>   caption:<none>  birth_name:<none>   birth_date_1:1954   birth_place:<none>  death_date:<none>   death_place:<none>  death_cause:<none>  resting_place:<none>    resting_place_coordinates:<none>    residence:<none>    nationality_1:german    ethnicity:<none>    citizenship:<none>  other_names:<none>  known_for:<none>    education:<none>    alma_mater:<none>   employer:<none> occupation_1:aircraft   occupation_2:designer   occupation_3:and    occupation_4:manufacturer   home_town:<none>    title:<none>    salary:<none>   networth:<none> height:<none>   weight:<none>   term:<none> predecessor:<none>  successor:<none>    party:<none>    boards:<none>   religion:<none> spouse:<none>   partner:<none>  children:<none> parents:<none>  relations:<none>    signature:<none>    website:<none>  footnotes:<none>    article_title_1:walter  article_title_2:extra

描述

复制代码

1
walter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft .

经过数据处理后变成：

复制代码

1
{"source": "walter extra 1954 german aircraft designer and manufacturer walter extra", "target": "walter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft .", "field": "name name birth_date nationality occupation occupation occupation occupation article_title article_title", "lpos": "1 2 1 1 1 2 3 4 1 2", "rpos": "2 1 1 1 4 3 2 1 2 1"}

然后再增加是否为关键事实的label（即source中的单词是否出现在target中，有则label为1，无则为0）：

复制代码

1
{"source": "walter extra 1954 german aircraft designer and manufacturer walter extra", "target": "walter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft .", "field": "name name birth_date nationality occupation occupation occupation occupation article_title article_title", "lpos": "1 2 1 1 1 2 3 4 1 2", "rpos": "2 1 1 1 4 3 2 1 2 1", "label": "1 1 0 1 1 1 1 1 1 1", "pivot": "walter extra german aircraft designer and manufacturer walter extra"}

同时生成单独的pivot文件

extract_entity.py

使用Stanford CoreNLP toolkit2标记文本，为每个单词分配POS标记。我们保留target中其POS标签属于{NN，NNS，NNP，NNPS，JJ，JJR，JJS，CD，FW}标签集的单词，即{名词，单数或质量；名词，复数；专有名词，单数；专有名词，复数；形容词；形容词，比较级；形容词，最高级；基数；外国词}，并删除剩余单词。构建伪并行数据，得到entity文件：

复制代码

1
walter extra german award-winning aerobatic pilot chief aircraft designer founder extra flugzeugbau extra aircraft construction manufacturer aerobatic aircraft

extract_pivot.py

将source(value)，target(text)，field，lpos，rpos，entity，lable，pivot放一起生成train.pivot.jsonl文件：

复制代码

1
{"value": "walter extra 1954 german aircraft designer and manufacturer walter extra", "text": "walter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft .", "field": "name name birth_date nationality occupation occupation occupation occupation article_title article_title", "lpos": "1 2 1 1 1 2 3 4 1 2", "rpos": "2 1 1 1 4 3 2 1 2 1", "entity": "walter extra german award-winning aerobatic pilot chief aircraft designer founder extra flugzeugbau extra aircraft construction manufacturer aerobatic aircraft", "label": "1 1 0 1 1 1 1 1 1 1", "pivot": "walter extra german aircraft designer and manufacturer walter extra"}

process_pivot.py

生成两份文件：

table2pivot:

随机选取数据集中10000个元素，将其对应的value,label,field,lpos,rpos放一起生成train.t2p.jsonl文件：

复制代码

1
{"value": "walter extra 1954 german aircraft designer and manufacturer walter extra", "label": "1 1 0 1 1 1 1 1 1 1", "field": "name name birth_date nationality occupation occupation occupation occupation article_title article_title", "lpos": "1 2 1 1 1 2 3 4 1 2", "rpos": "2 1 1 1 4 3 2 1 2 1"}

pivot2text:

复制代码

1
2
3
4
5
6
for i, d in enumerate(ori_datas):    
    if index is None or i in index:        
        data = {'source': d['pivot'], 'target': d['text']}    
    else:        
        data = {'source': d['entity'], 'target': d['text']}    
    datas.append(data)

复制代码

1
{"source": "walter extra german award-winning aerobatic pilot chief aircraft designer founder extra flugzeugbau extra aircraft construction manufacturer aerobatic aircraft", "target": "walter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft ."}

construct_corpus.py

首先index是之前生成的随机索引文件，size=10,000。

train_parallel_dataset：

复制代码

1
2
3
for d in ori_datas:    
    data = {'source': d['value'], 'target': d['text'], 'field': d['field'], 'lpos': d['lpos'], 'rpos': d['rpos']}    
    datas.append(data)

生成train.parallel.{10000}.jsonl, len=10000

train_p2t_dataset：

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
def get_filter_data(data: Dict) -> Dict:
    _data = {}
    label = data['label'].split(' ')
    field = data['field'].split(' ')
    lpos = data['lpos'].split(' ')
    rpos = data['rpos'].split(' ')
    _field = [f for f, l in zip(field, label) if l == '1']
    _lpos = [f for f, l in zip(lpos, label) if l == '1']
    _rpos = [f for f, l in zip(rpos, label) if l == '1']

    return {'source': data['pivot'], 'target': data['text'],
            'field': ' '.join(_field), 'lpos': ' '.join(_lpos),
            'rpos': ' '.join(_rpos)}

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def train_p2t_dataset(r_path: str, w_path: str, index: List[int]) -> List[Dict]:
    '''
    value, text, field, lpos, rpos, pivot, entity
    '''
    ori_datas = loads(open(r_path))[1:]

    statistic = {'length': len(ori_datas)}
    datas = [statistic]
    index = set(index)

    for i, d in enumerate(ori_datas):
        if i in index:
            #datas.append({'source': d['pivot'], 'target': d['text']})
            datas.append(get_filter_data(d))
        else:
            datas.append({'source': d['entity'], 'target': d['text']})
    
    dumps(datas, open(w_path, 'w'))

生成train.p2t.{10000}.jsonl文件,len=580000

train_t2p_dataset：

复制代码

1
2
3
4
for d in ori_datas:
    data = {'value': d['value'], 'label': d['label'], 'field': d['field'],
            'lpos': d['lpos'], 'rpos': d['rpos']}
    datas.append(data)

生成train.t2p.{10000}.jsonl文件, len=10000

train_aug_dataset:

复制代码

1
2
3
4
5
   for i, d in enumerate(ori_datas):
        if i in index:
            datas.append({'source': d['value'], 'target': d['text'], 'field': d['field'], 'lpos': d['lpos'], 'rpos': d['rpos']})
        else:
            datas.append({'source': d['entity'], 'target': d['text']})

生成train.aug.{10000}.jsonl文件, len=58000

train_semi_dataset:

复制代码

1
2
3
4
5
   for i, d in enumerate(ori_datas):
        if i in index:
            datas.append({'source': d['value'], 'target': d['text'], 'field': d['field'], 'lpos': d['lpos'], 'rpos': d['rpos']})
        else:
            datas.append({'source': d['text'], 'target': d['text']})

生成train.semi.{10000}.jsonl文件, len=58000

train_pretrain_dataset:

复制代码

1
2
3
4
5
   for i, d in enumerate(ori_datas):
        if i in index:
            datas.append({'source': d['value'], 'target': d['text'], 'field': d['field'], 'lpos': d['lpos'], 'rpos': d['rpos']})
        else:
            datas.append({'source': '', 'target': d['text']})

生成train.pretrain.{10000}.jsonl文件, len=58000

将数据转为allennlp适用的格式

dataset.pivot.py

class Table2PivotDataset:

将输入数据转为Instance。每个Instance会有一个textfield包含一个句子，一个sequenceLabelField包含对应的标签。

classs Pivot2TextDataset:

将生成的关键事实转为Instance，并增加噪声数据：

复制代码

1
2
3
4
5
6
7
8
9
10
   def add_noise(self, words):
        #words = self.word_shuffle(words)
        # 随机扔掉一些单词
        words = self.word_drop(words)
        # 随机将一些单词替换为‘@@UNKNOWN@@’
        words = self.word_blank(words)
        # 随机位置增加一定比例的‘@@UNKNOWN@@’
        words = self.word_append(words)

        return words

class Table2PivotCorpus:

将数据加载进来：

复制代码

1
2
3
4
5
train_path = os.path.join(data_path, 'train.t2p.{0}.jsonl'.format(scale))
dev_path = os.path.join(data_path, 'valid.t2p.jsonl')
test_path = os.path.join(data_path, 'test.t2p.jsonl')

vocab_dir = os.path.join(data_path, 'dicts-{0}-t2p-{1}'.format(vocab_size, scale))

转为Instance：

复制代码

1
2
3
self.train_dataset = Table2PivotDataset(path=train_path, max_len=max_len)
self.test_dataset = Table2PivotDataset(path=test_path, max_len=max_len)
self.dev_dataset = Table2PivotDataset(path=dev_path, max_len=max_len)

转为torch的dataloader:

复制代码

1
2
3
4
5
self.train_loader = torch.utils.data.DataLoader(dataset=self.train_dataset,
batch_size=batch_size,
collate_fn=collate_fn,
shuffle=True
)

将结果保存为predict-{10000}.txt文件

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
       for ids, source, field, lpos, rpos in zip(model_ids, sources, fields, lposs, rposs):
            words = [s for id, s in zip(ids, source) if id > 0]
            _field = [s for id, s in zip(ids, field) if id > 0]
            _lpos = [s for id, s in zip(ids, lpos) if id > 0]
            _rpos = [s for id, s in zip(ids, rpos) if id > 0]
            model_words.append(' '.join(words))
            _fields.append(' '.join(_field))
            _lposs.append(' '.join(_lpos))
            _rposs.append(' '.join(_rpos))
        
        with open(os.path.join(self.data_path, 'predict-{0}.txt'.format(self.scale)), 'w') as f:
            print('n'.join(model_words), file=f)

评估模型：def evaluate

class Pivot2TextCorpus:

加载数据：

复制代码

1
2
3
train_path = os.path.join(data_path, 'train.p2t.{0}.jsonl'.format(scale))
dev_path = os.path.join(data_path, 'valid.p2t.jsonl')
test_path = os.path.join(data_path, 'test.predict.{0}.jsonl'.format(scale))

将生成的关键事实转为Instance，并增加噪声数据：

复制代码

1
2
3
4
5
self.train_dataset = Pivot2TextDataset(path=train_path, src_max_len=src_max_len, tgt_max_len=tgt_max_len, share=share, append_rate=append_rate, drop_rate=drop_rate, blank_rate=blank_rate, use_feature=use_feature)

self.test_dataset = Pivot2TextDataset(path=test_path, src_max_len=src_max_len, tgt_max_len=tgt_max_len, share=share, use_feature=use_feature)

self.dev_dataset = Pivot2TextDataset(path=dev_path, src_max_len=src_max_len, tgt_max_len=tgt_max_len, share=share, use_feature=use_feature)

评估：def evaluate

2.模型部分

table2pivot模块使用模型：

sequence_labeling.py

一个双向LSTM编码器加一个线性分类解码器，将单词嵌入，属性嵌入和位置嵌入连接起来作为模型x的输入：

复制代码

1
2
3
4
5
6
7
8
9
self.src_embedding = nn.Embedding(self.vocab_size, emb_size)
self.key_embedding = nn.Embedding(vocab.get_vocab_size('keys'), key_emb_size)
self.lpos_embedding = nn.Embedding(vocab.get_vocab_size('lpos'), pos_emb_size)
self.rpos_embedding = nn.Embedding(vocab.get_vocab_size('rpos'), pos_emb_size)

self.encoder = rnn.rnn_encoder(emb_size + key_emb_size + pos_emb_size * 2, hidden_size, enc_layers, dropout,
bidirectional)
self.decoder = nn.Linear(hidden_size, self.label_size)
self.accuracy = SequenceAccuracy()

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
src_embs = torch.cat([self.src_embedding(src),
                        self.key_embedding(keys),
                        self.lpos_embedding(lpos),
                        self.rpos_embedding(rpos)], dim=-1)

src_embs = pack(src_embs, lengths, batch_first=True)
encode_outputs = self.encoder(src_embs)
out_logits = self.decoder(encode_outputs['hidden_outputs'])
seq_mask = (src > 0).float()

self.accuracy(predictions=out_logits, gold_labels=tgt, mask=seq_mask)
loss = sequence_cross_entropy_with_logits(logits=out_logits,targets=tgt, weights=seq_mask,average='token')
# outputs = out_logits.max(-1)[1] greedy

预测部分：

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
out_logits = self.decoder(encode_outputs['hidden_outputs'])
outputs = out_logits.max(-1)[1]

outputs = outputs.index_select(dim=0, index=rev_indices)
src = src.index_select(dim=0, index=rev_indices)

seq_mask = src > 0
correct_tokens = (tgt.eq(outputs) * seq_mask).sum()
total_tokens = seq_mask.sum()

h_tokens = (tgt.eq(outputs) * seq_mask * (tgt.eq(0))).sum()
r_total_tokens = (tgt.eq(0) * seq_mask).sum()
p_total_tokens = (outputs.eq(0) * seq_mask).sum()

output_ids = 1 - outputs

return {'correct': correct_tokens.item(), 'total': total_tokens.item(), 'hit': h_tokens.item(),
'r_total': r_total_tokens.item(), 'p_total': p_total_tokens.item(),
'output_ids': output_ids.tolist()}

pivot2table模块使用模型：

transformer.py

一个标准的transformer架构

复制代码

self.encoder = TransformerEncoder(n_hidden, ff_size, n_head, dropout, n_block)
        self.decoder = TransformerDecoder(n_hidden, ff_size, n_head, dropout, n_block)

self.generator = nn.Linear(n_hidden, self.vocab_size)
        self.accuracy = SequenceAccuracy()

def encode(self, src):
        embs = self.src_embedding(src)
        out = self.encoder(embs)
        return out

def decode(self, tgt, memory, step_wise=False):
        embs = self.tgt_embedding(tgt)
        outputs = self.decoder(embs, memory, step_wise)
        out, attns = outputs['outs'], outputs['attns']
        out = self.generator(out)
        return out, attns

def forward(self,
                src: Dict[str, torch.Tensor],
                tgt: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        src, tgt = src['tokens'], tgt['tokens']
        encode_outputs = self.encode(src)
        out_logits, _ = self.decode(tgt[:, :-1], encode_outputs)

targets = tgt[:, 1:].contiguous()
        seq_mask = (targets > 0).float()

self.accuracy(predictions=out_logits, gold_labels=targets, mask=seq_mask)
        loss = sequence_cross_entropy_with_logits(logits=out_logits,
                                                  targets=targets,
                                                  weights=seq_mask,
                                                  average='token',
                                                  label_smoothing=self.label_smoothing)
        outputs = {'loss': loss}

return outputs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
       self.encoder = TransformerEncoder(n_hidden, ff_size, n_head, dropout, n_block)
        self.decoder = TransformerDecoder(n_hidden, ff_size, n_head, dropout, n_block)

        self.generator = nn.Linear(n_hidden, self.vocab_size)
        self.accuracy = SequenceAccuracy()

    def encode(self, src):
        embs = self.src_embedding(src)
        out = self.encoder(embs)
        return out

    def decode(self, tgt, memory, step_wise=False):
        embs = self.tgt_embedding(tgt)
        outputs = self.decoder(embs, memory, step_wise)
        out, attns = outputs['outs'], outputs['attns']
        out = self.generator(out)
        return out, attns

    def forward(self,
                src: Dict[str, torch.Tensor],
                tgt: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        src, tgt = src['tokens'], tgt['tokens']
        encode_outputs = self.encode(src)
        out_logits, _ = self.decode(tgt[:, :-1], encode_outputs)

        targets = tgt[:, 1:].contiguous()
        seq_mask = (targets > 0).float()

        self.accuracy(predictions=out_logits, gold_labels=targets, mask=seq_mask)
        loss = sequence_cross_entropy_with_logits(logits=out_logits,
                                                  targets=targets,
                                                  weights=seq_mask,
                                                  average='token',
                                                  label_smoothing=self.label_smoothing)
        outputs = {'loss': loss}

        return outputs

预测解码时使用贪婪搜索：

复制代码

def greedy_search(self,
                      src: Dict[str, torch.Tensor],
                      max_decoding_step: int) -> Dict[str, torch.Tensor]:

src = src['tokens']
        ys = torch.ones(src.size(0), 1).long().fill_(self._bos).cuda()
        # self.decoder.init_cache()
        # output_ids, attns = [], []

encode_outputs = self.encode(src)

for i in range(max_decoding_step):
            outputs, attn = self.decode(ys, encode_outputs, step_wise=False)
            logits = outputs[:, -1]
            # logits = outputs
            next_id = logits.max(1, keepdim=True)[1]
            # ys = next_id
            # output_ids.append(ys)
            # attns.append(attn)
            ys = torch.cat([ys, next_id], dim=1)

output_ids = ys[:, 1:]
        attns = attn
        # output_ids = torch.cat(output_ids, dim=1)
        # attns = torch.cat(attns, dim=1)
        alignments = attns.max(2)[1]
        outputs = {'output_ids': output_ids.tolist(), 'alignments': alignments.tolist()}

return outputs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
   def greedy_search(self,
                      src: Dict[str, torch.Tensor],
                      max_decoding_step: int) -> Dict[str, torch.Tensor]:

        src = src['tokens']
        ys = torch.ones(src.size(0), 1).long().fill_(self._bos).cuda()
        # self.decoder.init_cache()
        # output_ids, attns = [], []

        encode_outputs = self.encode(src)

        for i in range(max_decoding_step):
            outputs, attn = self.decode(ys, encode_outputs, step_wise=False)
            logits = outputs[:, -1]
            # logits = outputs
            next_id = logits.max(1, keepdim=True)[1]
            # ys = next_id
            # output_ids.append(ys)
            # attns.append(attn)
            ys = torch.cat([ys, next_id], dim=1)

        output_ids = ys[:, 1:]
        attns = attn
        # output_ids = torch.cat(output_ids, dim=1)
        # attns = torch.cat(attns, dim=1)
        alignments = attns.max(2)[1]
        outputs = {'output_ids': output_ids.tolist(), 'alignments': alignments.tolist()}

        return outputs

整体流程：

table2pivot:

train:

train.t2p.10000.jsonl 1W标记数据用于训练模型

然后用于验证集valid.t2p.jsonl 7.2W数据，根据效果记录模型是否最佳或需要提前停止

test:

test.t2p.jsonl 7.2W数据，加载best.th模型，生成test.predict.10000.jsonl关键事实（预测label>0）。

Accuracy: 85.21618881852217, Precision: 91.67537147343486, Recall: 87.06278910894117, F1: 89.3095632971675

pivot2text:

train:

train.p2t.10000.jsonl 78W训练

accuracy: 0.9140, loss: 0.3335 ||: 100%|████| 9105/9105 [30:46<00:00, 5.84it/s]

然后用于验证集valid.t2p.jsonl 7.2W数据，根据效果记录模型是否最佳或需要提前停止

100%|█████████████████████████████████████████| 569/569 [03:16<00:00, 3.06it/s] BLEU = 29.33, 54.4/32.9/24.6/19.3 (BP=0.967, ratio=0.967, hyp_len=1836100, ref_len=1898421)

test:

100%|█████████████████████████████████████████| 569/569 [03:16<00:00, 2.88it/s] BLEU = 29.63, 54.2/32.9/24.6/19.3 (BP=0.978, ratio=0.978, hyp_len=1856611, ref_len=1898421)

根据test.predict.10000.jsonl生成最终文本与target比较。