An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction 论文笔记

77 阅读 0 评论 51 点赞

我是靠谱客的博主雪白凉面，这篇文章主要介绍An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction 论文笔记，现在分享给大家，希望可以做个参考。

An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction

Motivation

We tackle GEC as MT, but EncDec requires a large amount of training data. So the method of augmenting the data by incorporating pseudo training data has been studied intensively .

But consensus one the experimental configurations is yet to be formulated.

(i) the method of generating the pseudo data,

(ii) the seed corpus for the pseudo data

(iii) the optimization setting

Problem Formulation and Notation

D: training data

X: ungrammatical source sentence

Y: grammatical target sentence

$D={(X_n, Y_n)}_n$

$D_g$ : genuine parallel data

$D_p$ : pseudo data

$T$ : seed corpus

$Θ$ : all trainable parameters of the model

Objective: to find the optimal parameter set $\hat{Θ}$ the minimizes the following objective function $sum_{(X,Y)in D}log(p(Y|X, Theta))$

Aspect (i): multiple methods for generating pseudo data $D_p$ are available.
Aspect (ii): options for the seed corpus T are numerous.

We compare three corpora, namely, Wikipedia, Simple Wikipedia(SimpleWiki) and English Gigaword, as a first trial. Wikipedia vs SimpleWiki : similar domains, different grammatical complexities.

Gigaword : to investigate whether clean text improves model performance.

Aspect (iii): two major settings for incorporating $D_p$ into the optimization of Equation 1 are available.

JOINT: $D=D_pcup D_g$
PRETRAIN: $D_p$ for pretraining, namely, minimizing $L(D_p, Theta)$ to acquire $Theta_0$ , and then fine-tuning the model by minimizing $L(D_g,Theta_0)$

Methods for Generating Pseudo Data

BACKTRANS (NOISY) and BACKTRANS (SAMPLE)

backtranslation: train a reverse model to generate the ungrammatical sentences

BACKTRANS(NOISY) : adds $rβ_{random}$ to the score of each hypothesis in the beam for every time step.

BACKTRANS (SAMPLE): sentences are decoded by sampling from the distribution of the reverse model.

DIRECTNOISE: (i) masking with <mask>, (ii) deletion, (iii) insertion of a random token, (iv) keeping the original

Experimental Configurations

dataset

Model: Transformer EncDec model

Optimization: JOINT: adam; PRETRAIN: Adam-Adafactor

Generating Pseudo Data experiment

BACKTRANS (NOISY) and DIRECTNOISE: faster better

seed corpus

The difference in F0.5is small, which implies that the seed corpus T has only a minor effect on the model performance.

Gigaword

Optimization Setting

We use Wikipedia as the seed corpus

Joint Training or Pretraining:

Pretraining: more pseudo data and better performance

Amount of Pseudo Data: Backtrans (noisy)

Comparison with Current Top Models

The present experimental results show that the following configurations are effective for improving the model performance:

(i) the combination of JOINT and Gigaword

(ii) the amount of pseudo data $D_p$ not being too large in JOINT
(iii) PRETRAIN with BACKTRANS (NOISY) using large pseudo data $D_p$ .

Therefore, the best approach available is simply to pretrain the model with large (70M) BACKTRANS(NOISY) pseudo data and then fine-tune using BEA-train, which hereinafter we refer to as PRETLARGE.

We use Gigaword for the seed corpus T because it has the best performance in Table 3.

To further improve the performance, we incorporate the following techniques that are widely used in shared tasks such as BEA-2019 and WMT13:

Synthetic Spelling Error (SSE): character-level noise

Right-to-left Re-ranking

Sentence-level Error Detection : SED could potentially reduce the number of false-positive errors of the GEC model

However, unfortunately, incorporating SED decreased the performance on CoNLL-2014 and JFLEG. This fact implies that SED is sensitive to the domain of the test set since the SED model is fine-tuned with the official validation split of BEA dataset.