我是靠谱客的博主 雪白凉面,最近开发中收集的这篇文章主要介绍An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction 论文笔记,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction

Motivation

We tackle GEC as MT, but EncDec requires a large amount of training data. So the method of augmenting the data by incorporating pseudo training data has been studied intensively .

But consensus one the experimental configurations is yet to be formulated.

(i) the method of generating the pseudo data,

(ii) the seed corpus for the pseudo data

(iii) the optimization setting

Problem Formulation and Notation

D: training data

X: ungrammatical source sentence

Y: grammatical target sentence

D = ( X n , Y n ) n D={(X_n, Y_n)}_n D=(Xn,Yn)n

D g D_g Dg : genuine parallel data

D p D_p Dp : pseudo data

T T T : seed corpus

Θ Theta Θ : all trainable parameters of the model

Objective: to find the optimal parameter set Θ ^ hatTheta Θ^ the minimizes the following objective function L ( D , Θ ) = − 1 ∣ D ∣ ∑ ( X , Y ) ∈ D l o g ( p ( Y ∣ X , Θ ) ) L(D,Theta) = -frac{1}{|D|} sum_{(X,Y)in D}log(p(Y|X, Theta)) L(D,Θ)=D1(X,Y)Dlog(p(YX,Θ))

Aspect (i): multiple methods for generating pseudo data D p D_p Dp are available.
Aspect (ii): options for the seed corpus T are numerous.

We compare three corpora, namely, Wikipedia, Simple Wikipedia(SimpleWiki) and English Gigaword, as a first trial. Wikipedia vs SimpleWiki : similar domains, different grammatical complexities.

Gigaword : to investigate whether clean text improves model performance.

Aspect (iii): two major settings for incorporating D p D_p Dp into the optimization of Equation 1 are available.

JOINT: D = D p ∪ D g D=D_pcup D_g D=DpDg
PRETRAIN: D p D_p Dp for pretraining, namely, minimizing L ( D p , Θ ) L(D_p, Theta) L(Dp,Θ) to acquire Θ 0 Theta_0 Θ0, and then fine-tuning the model by minimizing L ( D g , Θ 0 ) L(D_g,Theta_0) L(Dg,Θ0)

Methods for Generating Pseudo Data

BACKTRANS (NOISY) and BACKTRANS (SAMPLE)

backtranslation: train a reverse model to generate the ungrammatical sentences

BACKTRANS(NOISY) : adds r β r a n d o m rβ_{random} rβrandomto the score of each hypothesis in the beam for every time step.

BACKTRANS (SAMPLE): sentences are decoded by sampling from the distribution of the reverse model.

DIRECTNOISE: (i) masking with <mask>, (ii) deletion, (iii) insertion of a random token, (iv) keeping the original

Experimental Configurations

dataset

Model: Transformer EncDec model

Optimization: JOINT: adam; PRETRAIN: Adam-Adafactor

Generating Pseudo Data experiment

BACKTRANS (NOISY) and DIRECTNOISE: faster better

seed corpus

The difference in F0.5is small, which implies that the seed corpus T has only a minor effect on the model performance.

Gigaword

Optimization Setting

We use Wikipedia as the seed corpus

Joint Training or Pretraining:

Pretraining: more pseudo data and better performance

Amount of Pseudo Data: Backtrans (noisy)

Comparison with Current Top Models

The present experimental results show that the following configurations are effective for improving the model performance:

(i) the combination of JOINT and Gigaword

(ii) the amount of pseudo data D p D_p Dp not being too large in JOINT
(iii) PRETRAIN with BACKTRANS (NOISY) using large pseudo data D p D_p Dp.

Therefore, the best approach available is simply to pretrain the model with large (70M) BACKTRANS(NOISY) pseudo data and then fine-tune using BEA-train, which hereinafter we refer to as PRETLARGE.

We use Gigaword for the seed corpus T because it has the best performance in Table 3.

To further improve the performance, we incorporate the following techniques that are widely used in shared tasks such as BEA-2019 and WMT13:

Synthetic Spelling Error (SSE): character-level noise

Right-to-left Re-ranking

Sentence-level Error Detection : SED could potentially reduce the number of false-positive errors of the GEC model

However, unfortunately, incorporating SED decreased the performance on CoNLL-2014 and JFLEG. This fact implies that SED is sensitive to the domain of the test set since the SED model is fine-tuned with the official validation split of BEA dataset.

conclusion

we found the following to be effective: (i) utilizing Gigaword as the seed corpus, and (ii) pretraining the model with BACKTRANS (NOISY) data.

最后

以上就是雪白凉面为你收集整理的An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction 论文笔记的全部内容,希望文章能够帮你解决An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction 论文笔记所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(51)

评论列表共有 0 条评论

立即
投稿
返回
顶部