概述
An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction
Motivation
We tackle GEC as MT, but EncDec requires a large amount of training data. So the method of augmenting the data by incorporating pseudo training data has been studied intensively .
But consensus one the experimental configurations is yet to be formulated.
(i) the method of generating the pseudo data,
(ii) the seed corpus for the pseudo data
(iii) the optimization setting
Problem Formulation and Notation
D: training data
X: ungrammatical source sentence
Y: grammatical target sentence
D = ( X n , Y n ) n D={(X_n, Y_n)}_n D=(Xn,Yn)n
D g D_g Dg : genuine parallel data
D p D_p Dp : pseudo data
T T T : seed corpus
Θ Theta Θ : all trainable parameters of the model
Objective: to find the optimal parameter set Θ ^ hatTheta Θ^ the minimizes the following objective function L ( D , Θ ) = − 1 ∣ D ∣ ∑ ( X , Y ) ∈ D l o g ( p ( Y ∣ X , Θ ) ) L(D,Theta) = -frac{1}{|D|} sum_{(X,Y)in D}log(p(Y|X, Theta)) L(D,Θ)=−∣D∣1∑(X,Y)∈Dlog(p(Y∣X,Θ))
Aspect (i): multiple methods for generating pseudo data
D
p
D_p
Dp are available.
Aspect (ii): options for the seed corpus T are numerous.
We compare three corpora, namely, Wikipedia, Simple Wikipedia(SimpleWiki) and English Gigaword, as a first trial. Wikipedia vs SimpleWiki : similar domains, different grammatical complexities.
Gigaword : to investigate whether clean text improves model performance.
Aspect (iii): two major settings for incorporating D p D_p Dp into the optimization of Equation 1 are available.
JOINT:
D
=
D
p
∪
D
g
D=D_pcup D_g
D=Dp∪Dg
PRETRAIN:
D
p
D_p
Dp for pretraining, namely, minimizing
L
(
D
p
,
Θ
)
L(D_p, Theta)
L(Dp,Θ) to acquire
Θ
0
Theta_0
Θ0, and then fine-tuning the model by minimizing
L
(
D
g
,
Θ
0
)
L(D_g,Theta_0)
L(Dg,Θ0)
Methods for Generating Pseudo Data
BACKTRANS (NOISY) and BACKTRANS (SAMPLE)
backtranslation: train a reverse model to generate the ungrammatical sentences
BACKTRANS(NOISY) : adds r β r a n d o m rβ_{random} rβrandomto the score of each hypothesis in the beam for every time step.
BACKTRANS (SAMPLE): sentences are decoded by sampling from the distribution of the reverse model.
DIRECTNOISE: (i) masking with <mask>, (ii) deletion, (iii) insertion of a random token, (iv) keeping the original
Experimental Configurations
dataset
Model: Transformer EncDec model
Optimization: JOINT: adam; PRETRAIN: Adam-Adafactor
Generating Pseudo Data experiment
BACKTRANS (NOISY) and DIRECTNOISE: faster better
seed corpus
The difference in F0.5is small, which implies that the seed corpus T has only a minor effect on the model performance.
Gigaword
Optimization Setting
We use Wikipedia as the seed corpus
Joint Training or Pretraining:
Pretraining: more pseudo data and better performance
Amount of Pseudo Data: Backtrans (noisy)
Comparison with Current Top Models
The present experimental results show that the following configurations are effective for improving the model performance:
(i) the combination of JOINT and Gigaword
(ii) the amount of pseudo data
D
p
D_p
Dp not being too large in JOINT
(iii) PRETRAIN with BACKTRANS (NOISY) using large pseudo data
D
p
D_p
Dp.
Therefore, the best approach available is simply to pretrain the model with large (70M) BACKTRANS(NOISY) pseudo data and then fine-tune using BEA-train, which hereinafter we refer to as PRETLARGE.
We use Gigaword for the seed corpus T because it has the best performance in Table 3.
To further improve the performance, we incorporate the following techniques that are widely used in shared tasks such as BEA-2019 and WMT13:
Synthetic Spelling Error (SSE): character-level noise
Right-to-left Re-ranking
Sentence-level Error Detection : SED could potentially reduce the number of false-positive errors of the GEC model
However, unfortunately, incorporating SED decreased the performance on CoNLL-2014 and JFLEG. This fact implies that SED is sensitive to the domain of the test set since the SED model is fine-tuned with the official validation split of BEA dataset.
conclusion
we found the following to be effective: (i) utilizing Gigaword as the seed corpus, and (ii) pretraining the model with BACKTRANS (NOISY) data.
最后
以上就是雪白凉面为你收集整理的An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction 论文笔记的全部内容,希望文章能够帮你解决An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction 论文笔记所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复