我是靠谱客的博主 慈祥海燕,最近开发中收集的这篇文章主要介绍Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation 论文代码解析,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

1.数据处理部分

原始文本处理

table2entity2text.py

举例:原始数据的一个句子

键值对

name_1:walter   name_2:extra    image:<none>    image_size:<none>   caption:<none>  birth_name:<none>   birth_date_1:1954   birth_place:<none>  death_date:<none>   death_place:<none>  death_cause:<none>  resting_place:<none>    resting_place_coordinates:<none>    residence:<none>    nationality_1:german    ethnicity:<none>    citizenship:<none>  other_names:<none>  known_for:<none>    education:<none>    alma_mater:<none>   employer:<none> occupation_1:aircraft   occupation_2:designer   occupation_3:and    occupation_4:manufacturer   home_town:<none>    title:<none>    salary:<none>   networth:<none> height:<none>   weight:<none>   term:<none> predecessor:<none>  successor:<none>    party:<none>    boards:<none>   religion:<none> spouse:<none>   partner:<none>  children:<none> parents:<none>  relations:<none>    signature:<none>    website:<none>  footnotes:<none>    article_title_1:walter  article_title_2:extra

描述

walter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft .

经过数据处理后变成:

{"source": "walter extra 1954 german aircraft designer and manufacturer walter extra", "target": "walter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft .", "field": "name name birth_date nationality occupation occupation occupation occupation article_title article_title", "lpos": "1 2 1 1 1 2 3 4 1 2", "rpos": "2 1 1 1 4 3 2 1 2 1"}

然后再增加是否为关键事实的label(即source中的单词是否出现在target中,有则label为1,无则为0):

{"source": "walter extra 1954 german aircraft designer and manufacturer walter extra", "target": "walter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft .", "field": "name name birth_date nationality occupation occupation occupation occupation article_title article_title", "lpos": "1 2 1 1 1 2 3 4 1 2", "rpos": "2 1 1 1 4 3 2 1 2 1", "label": "1 1 0 1 1 1 1 1 1 1", "pivot": "walter extra german aircraft designer and manufacturer walter extra"}

同时生成单独的pivot文件

extract_entity.py

使用Stanford CoreNLP toolkit2标记文本,为每个单词分配POS标记。我们保留target中其POS标签属于{NN,NNS,NNP,NNPS,JJ,JJR,JJS,CD,FW}标签集的单词,即{名词,单数或质量;名词,复数;专有名词,单数;专有名词,复数;形容词;形容词,比较级;形容词,最高级;基数;外国词},并删除剩余单词。构建伪并行数据,得到entity文件:

walter extra german award-winning aerobatic pilot chief aircraft designer founder extra flugzeugbau extra aircraft construction manufacturer aerobatic aircraft

extract_pivot.py

将source(value),target(text),field,lpos,rpos,entity,lable,pivot放一起生成train.pivot.jsonl文件:

{"value": "walter extra 1954 german aircraft designer and manufacturer walter extra", "text": "walter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft .", "field": "name name birth_date nationality occupation occupation occupation occupation article_title article_title", "lpos": "1 2 1 1 1 2 3 4 1 2", "rpos": "2 1 1 1 4 3 2 1 2 1", "entity": "walter extra german award-winning aerobatic pilot chief aircraft designer founder extra flugzeugbau extra aircraft construction manufacturer aerobatic aircraft", "label": "1 1 0 1 1 1 1 1 1 1", "pivot": "walter extra german aircraft designer and manufacturer walter extra"}

process_pivot.py

生成两份文件:

table2pivot:

随机选取数据集中10000个元素,将其对应的value,label,field,lpos,rpos放一起生成train.t2p.jsonl文件:

{"value": "walter extra 1954 german aircraft designer and manufacturer walter extra", "label": "1 1 0 1 1 1 1 1 1 1", "field": "name name birth_date nationality occupation occupation occupation occupation article_title article_title", "lpos": "1 2 1 1 1 2 3 4 1 2", "rpos": "2 1 1 1 4 3 2 1 2 1"}

pivot2text:

for i, d in enumerate(ori_datas):    
    if index is None or i in index:        
        data = {'source': d['pivot'], 'target': d['text']}    
    else:        
        data = {'source': d['entity'], 'target': d['text']}    
    datas.append(data)
{"source": "walter extra german award-winning aerobatic pilot chief aircraft designer founder extra flugzeugbau extra aircraft construction manufacturer aerobatic aircraft", "target": "walter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft ."}

construct_corpus.py

首先index是之前生成的随机索引文件,size=10,000。

train_parallel_dataset:

for d in ori_datas:    
    data = {'source': d['value'], 'target': d['text'], 'field': d['field'], 'lpos': d['lpos'], 'rpos': d['rpos']}    
    datas.append(data)

生成train.parallel.{10000}.jsonl, len=10000

 

train_p2t_dataset:

def get_filter_data(data: Dict) -> Dict:
    _data = {}
    label = data['label'].split(' ')
    field = data['field'].split(' ')
    lpos = data['lpos'].split(' ')
    rpos = data['rpos'].split(' ')
    _field = [f for f, l in zip(field, label) if l == '1']
    _lpos = [f for f, l in zip(lpos, label) if l == '1']
    _rpos = [f for f, l in zip(rpos, label) if l == '1']
​
    return {'source': data['pivot'], 'target': data['text'],
            'field': ' '.join(_field), 'lpos': ' '.join(_lpos),
            'rpos': ' '.join(_rpos)}
def train_p2t_dataset(r_path: str, w_path: str, index: List[int]) -> List[Dict]:
    '''
    value, text, field, lpos, rpos, pivot, entity
    '''
    ori_datas = loads(open(r_path))[1:]
​
    statistic = {'length': len(ori_datas)}
    datas = [statistic]
    index = set(index)
​
    for i, d in enumerate(ori_datas):
        if i in index:
            #datas.append({'source': d['pivot'], 'target': d['text']})
            datas.append(get_filter_data(d))
        else:
            datas.append({'source': d['entity'], 'target': d['text']})
    
    dumps(datas, open(w_path, 'w'))

生成train.p2t.{10000}.jsonl文件,len=580000

 

train_t2p_dataset:

for d in ori_datas:
    data = {'value': d['value'], 'label': d['label'], 'field': d['field'],
            'lpos': d['lpos'], 'rpos': d['rpos']}
    datas.append(data)

生成train.t2p.{10000}.jsonl文件, len=10000

 

train_aug_dataset:

    for i, d in enumerate(ori_datas):
        if i in index:
            datas.append({'source': d['value'], 'target': d['text'], 'field': d['field'], 'lpos': d['lpos'], 'rpos': d['rpos']})
        else:
            datas.append({'source': d['entity'], 'target': d['text']})

生成train.aug.{10000}.jsonl文件, len=58000

 

train_semi_dataset:

    for i, d in enumerate(ori_datas):
        if i in index:
            datas.append({'source': d['value'], 'target': d['text'], 'field': d['field'], 'lpos': d['lpos'], 'rpos': d['rpos']})
        else:
            datas.append({'source': d['text'], 'target': d['text']})

生成train.semi.{10000}.jsonl文件, len=58000

 

train_pretrain_dataset:

    for i, d in enumerate(ori_datas):
        if i in index:
            datas.append({'source': d['value'], 'target': d['text'], 'field': d['field'], 'lpos': d['lpos'], 'rpos': d['rpos']})
        else:
            datas.append({'source': '', 'target': d['text']})

生成train.pretrain.{10000}.jsonl文件, len=58000

 

将数据转为allennlp适用的格式

dataset.pivot.py

class Table2PivotDataset:

将输入数据转为Instance。每个Instance会有一个textfield包含一个句子,一个sequenceLabelField包含对应的标签。

classs Pivot2TextDataset:

将生成的关键事实转为Instance,并增加噪声数据:

    def add_noise(self, words):
        #words = self.word_shuffle(words)
        # 随机扔掉一些单词
        words = self.word_drop(words)
        # 随机将一些单词替换为‘@@UNKNOWN@@’
        words = self.word_blank(words)
        # 随机位置增加一定比例的‘@@UNKNOWN@@’
        words = self.word_append(words)
​
        return words

class Table2PivotCorpus:

将数据加载进来:

train_path = os.path.join(data_path, 'train.t2p.{0}.jsonl'.format(scale))
dev_path = os.path.join(data_path, 'valid.t2p.jsonl')
test_path = os.path.join(data_path, 'test.t2p.jsonl')
​
vocab_dir = os.path.join(data_path, 'dicts-{0}-t2p-{1}'.format(vocab_size, scale))

转为Instance:

self.train_dataset = Table2PivotDataset(path=train_path, max_len=max_len)
self.test_dataset = Table2PivotDataset(path=test_path, max_len=max_len)
self.dev_dataset = Table2PivotDataset(path=dev_path, max_len=max_len)

转为torch的dataloader:

self.train_loader = torch.utils.data.DataLoader(dataset=self.train_dataset,
batch_size=batch_size,
collate_fn=collate_fn,
shuffle=True
)

将结果保存为predict-{10000}.txt文件

        for ids, source, field, lpos, rpos in zip(model_ids, sources, fields, lposs, rposs):
            words = [s for id, s in zip(ids, source) if id > 0]
            _field = [s for id, s in zip(ids, field) if id > 0]
            _lpos = [s for id, s in zip(ids, lpos) if id > 0]
            _rpos = [s for id, s in zip(ids, rpos) if id > 0]
            model_words.append(' '.join(words))
            _fields.append(' '.join(_field))
            _lposs.append(' '.join(_lpos))
            _rposs.append(' '.join(_rpos))
        
        with open(os.path.join(self.data_path, 'predict-{0}.txt'.format(self.scale)), 'w') as f:
            print('n'.join(model_words), file=f)

评估模型:def evaluate

 

class Pivot2TextCorpus:

加载数据:

train_path = os.path.join(data_path, 'train.p2t.{0}.jsonl'.format(scale))
dev_path = os.path.join(data_path, 'valid.p2t.jsonl')
test_path = os.path.join(data_path, 'test.predict.{0}.jsonl'.format(scale))

将生成的关键事实转为Instance,并增加噪声数据:

self.train_dataset = Pivot2TextDataset(path=train_path, src_max_len=src_max_len, tgt_max_len=tgt_max_len, share=share, append_rate=append_rate, drop_rate=drop_rate, blank_rate=blank_rate, use_feature=use_feature)
​
self.test_dataset = Pivot2TextDataset(path=test_path, src_max_len=src_max_len, tgt_max_len=tgt_max_len, share=share, use_feature=use_feature)
​
self.dev_dataset = Pivot2TextDataset(path=dev_path, src_max_len=src_max_len, tgt_max_len=tgt_max_len, share=share, use_feature=use_feature)

评估:def evaluate

 

2.模型部分

table2pivot模块使用模型:

sequence_labeling.py

一个双向LSTM编码器加一个线性分类解码器,将单词嵌入,属性嵌入和位置嵌入连接起来作为模型x的输入:

self.src_embedding = nn.Embedding(self.vocab_size, emb_size)
self.key_embedding = nn.Embedding(vocab.get_vocab_size('keys'), key_emb_size)
self.lpos_embedding = nn.Embedding(vocab.get_vocab_size('lpos'), pos_emb_size)
self.rpos_embedding = nn.Embedding(vocab.get_vocab_size('rpos'), pos_emb_size)
​
self.encoder = rnn.rnn_encoder(emb_size + key_emb_size + pos_emb_size * 2, hidden_size, enc_layers, dropout,
bidirectional)
self.decoder = nn.Linear(hidden_size, self.label_size)
self.accuracy = SequenceAccuracy()
src_embs = torch.cat([self.src_embedding(src),
                        self.key_embedding(keys),
                        self.lpos_embedding(lpos),
                        self.rpos_embedding(rpos)], dim=-1)
​
src_embs = pack(src_embs, lengths, batch_first=True)
encode_outputs = self.encoder(src_embs)
out_logits = self.decoder(encode_outputs['hidden_outputs'])
seq_mask = (src > 0).float()
​
self.accuracy(predictions=out_logits, gold_labels=tgt, mask=seq_mask)
loss = sequence_cross_entropy_with_logits(logits=out_logits,targets=tgt, weights=seq_mask,average='token')
# outputs = out_logits.max(-1)[1] greedy

预测部分:

out_logits = self.decoder(encode_outputs['hidden_outputs'])
outputs = out_logits.max(-1)[1]
​
outputs = outputs.index_select(dim=0, index=rev_indices)
src = src.index_select(dim=0, index=rev_indices)
​
seq_mask = src > 0
correct_tokens = (tgt.eq(outputs) * seq_mask).sum()
total_tokens = seq_mask.sum()
​
h_tokens = (tgt.eq(outputs) * seq_mask * (tgt.eq(0))).sum()
r_total_tokens = (tgt.eq(0) * seq_mask).sum()
p_total_tokens = (outputs.eq(0) * seq_mask).sum()
​
output_ids = 1 - outputs
​
return {'correct': correct_tokens.item(), 'total': total_tokens.item(), 'hit': h_tokens.item(),
'r_total': r_total_tokens.item(), 'p_total': p_total_tokens.item(),
'output_ids': output_ids.tolist()}

pivot2table模块使用模型:

transformer.py

一个标准的transformer架构

        self.encoder = TransformerEncoder(n_hidden, ff_size, n_head, dropout, n_block)
        self.decoder = TransformerDecoder(n_hidden, ff_size, n_head, dropout, n_block)
​
        self.generator = nn.Linear(n_hidden, self.vocab_size)
        self.accuracy = SequenceAccuracy()
​
    def encode(self, src):
        embs = self.src_embedding(src)
        out = self.encoder(embs)
        return out
​
    def decode(self, tgt, memory, step_wise=False):
        embs = self.tgt_embedding(tgt)
        outputs = self.decoder(embs, memory, step_wise)
        out, attns = outputs['outs'], outputs['attns']
        out = self.generator(out)
        return out, attns
​
    def forward(self,
                src: Dict[str, torch.Tensor],
                tgt: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        src, tgt = src['tokens'], tgt['tokens']
        encode_outputs = self.encode(src)
        out_logits, _ = self.decode(tgt[:, :-1], encode_outputs)
​
        targets = tgt[:, 1:].contiguous()
        seq_mask = (targets > 0).float()
​
        self.accuracy(predictions=out_logits, gold_labels=targets, mask=seq_mask)
        loss = sequence_cross_entropy_with_logits(logits=out_logits,
                                                  targets=targets,
                                                  weights=seq_mask,
                                                  average='token',
                                                  label_smoothing=self.label_smoothing)
        outputs = {'loss': loss}
​
        return outputs

预测解码时使用贪婪搜索:

    def greedy_search(self,
                      src: Dict[str, torch.Tensor],
                      max_decoding_step: int) -> Dict[str, torch.Tensor]:
​
        src = src['tokens']
        ys = torch.ones(src.size(0), 1).long().fill_(self._bos).cuda()
        # self.decoder.init_cache()
        # output_ids, attns = [], []
​
        encode_outputs = self.encode(src)
​
        for i in range(max_decoding_step):
            outputs, attn = self.decode(ys, encode_outputs, step_wise=False)
            logits = outputs[:, -1]
            # logits = outputs
            next_id = logits.max(1, keepdim=True)[1]
            # ys = next_id
            # output_ids.append(ys)
            # attns.append(attn)
            ys = torch.cat([ys, next_id], dim=1)
​
        output_ids = ys[:, 1:]
        attns = attn
        # output_ids = torch.cat(output_ids, dim=1)
        # attns = torch.cat(attns, dim=1)
        alignments = attns.max(2)[1]
        outputs = {'output_ids': output_ids.tolist(), 'alignments': alignments.tolist()}
​
        return outputs

整体流程:

table2pivot:

train:

train.t2p.10000.jsonl 1W标记数据用于训练模型

然后用于验证集valid.t2p.jsonl 7.2W数据,根据效果记录模型是否最佳或需要提前停止

 

test:

test.t2p.jsonl 7.2W数据,加载best.th模型,生成test.predict.10000.jsonl关键事实(预测label>0)。

Accuracy: 85.21618881852217, Precision: 91.67537147343486, Recall: 87.06278910894117, F1: 89.3095632971675

 

pivot2text:

train:

train.p2t.10000.jsonl 78W训练

accuracy: 0.9140, loss: 0.3335 ||: 100%|████| 9105/9105 [30:46<00:00, 5.84it/s]

然后用于验证集valid.t2p.jsonl 7.2W数据,根据效果记录模型是否最佳或需要提前停止

100%|█████████████████████████████████████████| 569/569 [03:16<00:00, 3.06it/s] BLEU = 29.33, 54.4/32.9/24.6/19.3 (BP=0.967, ratio=0.967, hyp_len=1836100, ref_len=1898421)

 

test:

100%|█████████████████████████████████████████| 569/569 [03:16<00:00, 2.88it/s] BLEU = 29.63, 54.2/32.9/24.6/19.3 (BP=0.978, ratio=0.978, hyp_len=1856611, ref_len=1898421)

根据test.predict.10000.jsonl生成最终文本与target比较。

最后

以上就是慈祥海燕为你收集整理的Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation 论文代码解析的全部内容,希望文章能够帮你解决Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation 论文代码解析所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(42)

评论列表共有 0 条评论

立即
投稿
返回
顶部