关系抽取专题（二）Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification关系抽取论文笔记2： Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification

50 阅读 0 评论 33 点赞

我是靠谱客的博主勤奋草莓，这篇文章主要介绍关系抽取专题（二）Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification关系抽取论文笔记2： Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification，现在分享给大家，希望可以做个参考。

关系抽取论文笔记2： Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification

Author: Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi∗ , Bingchen Li, Hongwei Hao, Bo Xu

中科院自动化所

来源：ACL16

2.1 研究动机

本文是关系抽取非常经典的一个工作,也是在非远程标注数据集上一个比较成熟的工作（预训练语言模型之前）。本文的主要动机是将当时在NLP的很多任务上成功应用的BiLSTM+attention架构应用到关系抽取，这个框架的好处是完全的端到端，不用做任何的特征工程。上一篇的CNN虽然减少了很多特征选择的工作，但还是精心设计了词汇特征和句子特征，特别是在lexical和 sentence的特征中还分布选取了义原上为词和位置编码，而本文则是完全的端到端。

2.2 研究方法

本文采用的方法从架构上来看，从输入层向上，先是一个词嵌入，然后是两个并行的LSTM层（前项+后项），结果拼接之后过一个attention层，之后送入softmax层，得到最后的关系分类，如下图所示：

代码还是比较清晰的：

def attention(inputs):
    # Trainable parameters
    hidden_size = inputs.shape[2].value
    u_omega = tf.get_variable("u_omega", [hidden_size], initializer=tf.keras.initializers.glorot_normal())

    with tf.name_scope('v'):
        v = tf.tanh(inputs)

    # For each of the timestamps its vector of size A from `v` is reduced with `u` vector
    vu = tf.tensordot(v, u_omega, axes=1, name='vu')  # (B,T) shape
    alphas = tf.nn.softmax(vu, name='alphas')  # (B,T) shape

    # Output of (Bi-)RNN is reduced with attention vector; the result has (B,D) shape
    output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1)

    # Final output with tanh
    output = tf.tanh(output)

    return output, alphas

class AttLSTM:
    def __init__(self, sequence_length, num_classes, vocab_size, embedding_size, hidden_size, l2_reg_lambda=0.0):
        # Placeholders for input, output and dropout
        self.input_text = tf.placeholder(tf.int32, shape=[None, sequence_length], name='input_text')
        self.input_y = tf.placeholder(tf.float32, shape=[None, num_classes], name='input_y')
        self.emb_dropout_keep_prob = tf.placeholder(tf.float32, name='emb_dropout_keep_prob')
        self.rnn_dropout_keep_prob = tf.placeholder(tf.float32, name='rnn_dropout_keep_prob')
        self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')

        initializer = tf.keras.initializers.glorot_normal

        # Word Embedding Layer， 模型这里没有添加位置编码
        with tf.device('/cpu:0'), tf.variable_scope("word-embeddings"):
            self.W_text = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -0.25, 0.25), name="W_text")
            self.embedded_chars = tf.nn.embedding_lookup(self.W_text, self.input_text)

        # Dropout for Word Embedding
        with tf.variable_scope('dropout-embeddings'):
            self.embedded_chars = tf.nn.dropout(self.embedded_chars, self.emb_dropout_keep_prob)

        # Bidirectional LSTM
        with tf.variable_scope("bi-lstm"):
            _fw_cell = tf.nn.rnn_cell.LSTMCell(hidden_size, initializer=initializer())
            fw_cell = tf.nn.rnn_cell.DropoutWrapper(_fw_cell, self.rnn_dropout_keep_prob)
            _bw_cell = tf.nn.rnn_cell.LSTMCell(hidden_size, initializer=initializer())
            bw_cell = tf.nn.rnn_cell.DropoutWrapper(_bw_cell, self.rnn_dropout_keep_prob)
            self.rnn_outputs, _ = tf.nn.bidirectional_dynamic_rnn(cell_fw=fw_cell,cell_bw=bw_cell,inputs=self.embedded_chars,sequence_length=self._length(self.input_text),dtype=tf.float32)
            self.rnn_outputs = tf.add(self.rnn_outputs[0], self.rnn_outputs[1])

        # Attention
        with tf.variable_scope('attention'):
            self.attn, self.alphas = attention(self.rnn_outputs)

        # Dropout
        with tf.variable_scope('dropout'):
            self.h_drop = tf.nn.dropout(self.attn, self.dropout_keep_prob)

        # Fully connected layer
        with tf.variable_scope('output'):
            self.logits = tf.layers.dense(self.h_drop, num_classes, kernel_initializer=initializer())
            self.predictions = tf.argmax(self.logits, 1, name="predictions")

        # Calculate mean cross-entropy loss
        with tf.variable_scope("loss"):
            losses = tf.nn.softmax_cross_entropy_with_logits_v2(logits=self.logits, labels=self.input_y)
            self.l2 = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables()])
            self.loss = tf.reduce_mean(losses) + l2_reg_lambda * self.l2

        # Accuracy
        with tf.variable_scope("accuracy"):
            correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
            self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32), name="accuracy")

    # Length of the sequence data
    @staticmethod
    def _length(seq):
        relevant = tf.sign(tf.abs(seq))
        length = tf.reduce_sum(relevant, reduction_indices=1)
        length = tf.cast(length, tf.int32)
        return length

2.3 实验结果

实验的结果在当时得到了STOA，F1 达到84%，比上一篇的CNN的方法提高了2%。具体结果如下图：
实验结果

2.4 创新点

本文的主要创新点有：

应用了在Transformer和预训练模型框架出现前，效果最好的端到端框架LSTM+attention
取得了数据集上的STOA

2.5 个人点评

小细节，本文Bilstm中前后两层lstm在输出的时候用的element wise add的，而不是通常采取的直接拼接。考虑的因素可能是维度低一点，参数量也相对少一点，且原理上差距不大，而且本文的实验数据集样本数其实还是比较小的。
思路上，本文的贡献一直强调端到端，没有特征工程。其实个人觉得特征工程未必就不可取，在好的模型中叠加可用的特征工程，如果能有效调优，至少工程上是一定会用的。端到端只是因为在文本的很多场景下，无法判断那些特征是真正好的特征。但是对于关系抽取这个任务，词和实体的距离编码的位置特征已经被证明是一个充分有效的特征。

2.6 模型调试心得

2.6.1 模型地址

主要参考了https://github.com/SeoSangwoo/Attention-Based-BiLSTM-relation-extraction.git，这里感谢代码贡献者。

2.6.2 模型运行环境

tensorflow 1.4.0+
python 3.5+
nltk
glove 预训练词向量

2.6.3 数据集选择与下载

数据集是文章的dataset: SemEval2010 task8

2.6.4 模型运行步骤

1）训练 python train.py --embedding_path “{词向量路径}”

2 测试 python eval.py --checkpoint_dir “runs/1523902663/checkpoints/”

2.6.5 代码遇到的小问题

一个小问题，就是如果nltk安装的时候，没有下载数据，可能会运行报错。解决方法
python -> import nltk -> nltk.download(‘punkt’)
其中，nltk.tokenize.punkt中包含了很多预先训练好的tokenize模型。详见Dive into NLTK II. 具体应用如下：