文本增强技术

193 阅读 0 评论 128 点赞

我是靠谱客的博主魁梧荷花，这篇文章主要介绍文本增强技术，现在分享给大家，希望可以做个参考。

参考资料熵简科技-李渔博士的报告《文本增强技术的研究进展及应用实践》

背景

本质上是增加少类样本（数量少或者类型单一）的数量
具体场景包括
- 少样本场景（如文本标注耗时耗力很难获取很多）
- 分类任务中样本不均衡（note：常见方法-欠采样 ( undersampling ) 和过采样 ( oversampling ) 会对模型带来怎样的影响？)
- 半监督训练（19年google半监督学习算法 UDA 可以看出[6]，文本数据增强技术可以用在无标签样本上，以构造出半监督训练所需的样本对，以此让模型从无标签的数据中获取到优化所需的梯度）
- 提高模型robustness

典型方法

1）回译（Back translation）
- 举例如下
  
  【原始文本为】文本数据增强技术在自然语言处理中属于基础性技术；
  【翻译为日语】テキストデータ拡張技術は、自然言語処理の基本的な技術です；
  【日语再翻译为英语】Text data extension technology is a basic technology of natural language processing；
  【英语再翻译回中文】文本数据扩展技术是自然语言处理的基本技术。
- 细节
  - 使用google、有道等翻译工具，可以通过更换中间语种实现N倍数据扩充；使用训练的翻译模型则可通过decode时的random sampling或者beam search等策略获取扩充数据；
  - 长文本一般先切分并分别回译，然后合并
- google半监督学习算法（UDA）[6]
2）EDA（Easy data augmentation）
- 四种操作：同义词替换、随机插入、随机交换和随机删除
  
  【同义词替换(SR)】从句子中随机选择非停止词。用随机选择的同义词替换这些单词；
  【随机插入(RI)】随机的找出句中某个不属于停用词集的词，并求出其随机的同义词，将该同义词插入句子的一个随机位置。重复n次；
  【随机交换(Random Swap, RS)】随机的选择句中两个单词并交换它们的位置。重复n次；
  【随机删除(RD)】以概率p随机删除句子中每个单词。
- 举例如下
  
  【原始文本】今天天气很好。
  【同义词替换(SR)】今天天气不错。（好替换为不错）
  【随机插入(RI)】今天不错天气很好。（插入不错）
  【随机交换(RS)】今天很好天气。（很好和天气交换位置）
  【随机删除(RD)】今天天气好。（删除很)
- 一个问题：经过EDA操作之后，文本的类别标签 ( label ) 是否还能保持不变，毕竟这是对文本进行随机操作？
  - 下图可以说明基本上能保持不变（将原有的测试集和拓展出的语料输入到模型A中，并对模型在最后线性层的输出进行了比较（t-SNE降维））
- 效果
- 如何设置替换比例和增强的文本倍数，原文给出的建议如下[11]，其中，α是替换删除等的比例，比如同义词替换中，替换的单词数n=α∗L，L是句子长度，随机插入、随机替换类似；Naug 是使用EDA方法从每一个句子拓展出的句子数量
3）EDA改进：非核心词替换
- 目的：避免替换掉重要的词，用词典中不重要的词去替换文本中一定比例的不重要词，从而产生新的文本
- 方法：TFIDF衡量词的重要性
- UDA[6]提到的重要性具体实现方式如下：
- 效果
4）基于上下文信息的文本增强
- 原理：首先需要一个训练好的语言模型 ( LM )，对于需要增强的原始文本，随机去掉文中的一个词或字 ( 这取决于语言模型支持字还是词 )。接下来，将文本的剩余部分输入语言模型，选择语言模型所预测的 top k 个词去替换原文中被去掉的词，以形成 k 条新的文本。
- 一种实现（日本Preferred Networks 公司在2018年提出的基于双向LM的上下文文本增强技术[12]），添加极性信息保证文本标签不变
- 另一种实现：finetune的BERT替换双向LM，同样引入原始文本的标签信息，实验结果如下：
5）基于语言生成模型的文本增强–LAMBADA(IBM201911 GPT-based[2])
- 原理：LAMBADA 首先在大量文本上进行了预训练，使模型能够捕获语言的结构，从而能产生连贯的句子。然后在不同任务的少量数据集上对模型进行微调，并使用微调后的模型生成新的句子（generator)。最后在相同的小型数据集上训练分类器（discriminator），并进行过滤，保证现有的小型数据集和新生成数据集有相近的分布。（类似GAN生成图片）
- 实验结果：BERT/SVM/LSTM表示测试模型
- 分析：
  - 没有和回译作对比；
  - 对于 ATIS 数据集，相对 baseline 的性能提升超过了 50%，原论文中给出的说法是，ATIS 数据具有明显的分布不均衡性，而 LAMBADA 技术可以有效弥补原数据集的不均衡性；
  - 改进：LAMBADA 技术后续还有很多可以挖掘的地方，比如与前面提到的 UDA 框架结合，用实现少样本下的半监督学习。或者像论文原作者提到的那样，未来他们将尝试将此技术用于 zero-shot learning；
6）其他方法：基于文本风格迁移的数据增强
- 待续，可参考[19,20]

总结

从机器学习的角度看待文本增强技术的有效性

参考文献

[1] Wei, Jason W., and Kai Zou. “Eda: Easy data augmentation techniques for boosting performance on text classification tasks.” arXiv preprint arXiv:1901.11196 (2019).
[2] Anaby-Tavor, Ateret, et al. “Not Enough Data? Deep Learning to the Rescue!.” arXiv preprint arXiv:1911.03118 (2019).
[3] Hu, Zhiting, et al. “Learning Data Manipulation for Augmentation and Weighting.” Advances in Neural Information Processing Systems. 2019.
[4] Wang, William Yang, and Diyi Yang. “That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets.” Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015.
[5] Chawla, Nitesh V., et al. “SMOTE: synthetic minority over-sampling technique.” Journal of artificial intelligence research16 (2002): 321-357.
[6] Xie, Qizhe, et al. “Unsupervised data augmentation.” arXiv preprint arXiv:1904.12848 (2019).
[7] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[8] Sennrich, Rico, Barry Haddow, and Alexandra Birch. “Improving neural machine translation models with monolingual data.” arXiv preprint arXiv:1511.06709 (2015).
[9] Edunov, Sergey, et al. “Understanding back-translation at scale.” arXiv preprint arXiv:1808.09381 (2018).
[10] Yu, Adams Wei, et al. “Qanet: Combining local convolution with global self-attention for reading comprehension.” arXiv preprint arXiv:1804.09541 (2018).
[11] Wei, Jason W., and Kai Zou. “Eda: Easy data augmentation techniques for boosting performance on text classification tasks.” arXiv preprint arXiv:1901.11196 (2019).
[12] Kobayashi, Sosuke. “Contextual augmentation: Data augmentation by words with paradigmatic relations.” arXiv preprint arXiv:1805.06201 (2018).
[13] Wu, Xing, et al. “Conditional BERT contextual augmentation.” International Conference on Computational Science. Springer, Cham, 2019.
[14] Liu, Ting, et al. “Generating and exploiting large-scale pseudo training data for zero pronoun resolution.” arXiv preprint arXiv:1606.01603 (2016).
[15] Hou, Yutai, et al. “Sequence-to-sequence data augmentation for dialogue language understanding.” arXiv preprint arXiv:1807.01554 (2018).
[16] Dong, Li, et al. “Learning to paraphrase for question answering.” arXiv preprint arXiv:1708.06022 (2017).
[17] Radford, Alec, et al. “Improving language understanding by generative pre-training.”(2018).
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
[18] Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI Blog 1.8 (2019): 9.
[19] Hu, Zhiting, et al. “Toward controlled generation of text.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.
[20] Guu, Kelvin, et al. “Generating sentences by editing prototypes.” Transactions of the Association for Computational Linguistics 6 (2018): 437-450.