Universal Dependencies for Learner English - CBMM · Universal Dependencies for Learner English Yevgeni Berzak ... cate in English (FCE) corpus. ... The treebank was annotated by

This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216.

CBMM Memo No. 052 August, 2016

Universal Dependencies for Learner English

Published in the Proceedings of ACL 2016

Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian Garza and Boris Katz

We introduce the Treebank of Learner English (TLE), the first publicly available syntactic treebank for English as a Second Language (ESL). The TLE provides manually annotated POS tags and Universal Dependency (UD) trees for 5,124 sentences from the Cambridge First Certificate in English (FCE) corpus. The UD annotations are tied to a pre-existing error annotation of the FCE, whereby full syntactic analyses are provided for both the original and error corrected versions of each sentence. Further on, we delineate ESL annotation guidelines that allow for consistent syntactic treatment of ungrammatical English. Finally, we benchmark POS tagging and dependency parsing performance on the TLE dataset and measure the effect of grammatical errors on parsing accuracy. We envision the treebank to support a wide range of linguistic and computational research on second language acquisition as well as automatic processing of ungrammatical language.

Universal Dependencies for Learner English

Yevgeni BerzakCSAIL MIT

[email protected]

Jessica KenneyEECS & Linguistics MIT

[email protected]

Carolyn SpadineLinguistics MIT

[email protected]

Jing Xian WangEECS MIT

[email protected]

Lucia LamMECHE [email protected]

Keiko Sophie MoriLinguistics [email protected]

Sebastian GarzaLinguistics [email protected]

Boris KatzCSAIL MIT

[email protected]

Abstract

We introduce the Treebank of Learner En-glish (TLE), the first publicly availablesyntactic treebank for English as a Sec-ond Language (ESL). The TLE providesmanually annotated POS tags and Univer-sal Dependency (UD) trees for 5,124 sen-tences from the Cambridge First Certifi-cate in English (FCE) corpus. The UDannotations are tied to a pre-existing er-ror annotation of the FCE, whereby fullsyntactic analyses are provided for boththe original and error corrected versions ofeach sentence. Further on, we delineateESL annotation guidelines that allow forconsistent syntactic treatment of ungram-matical English. Finally, we benchmarkPOS tagging and dependency parsing per-formance on the TLE dataset and measurethe effect of grammatical errors on parsingaccuracy. We envision the treebank to sup-port a wide range of linguistic and compu-tational research on second language ac-quisition as well as automatic processingof ungrammatical language1.

1 Introduction

The majority of the English text available world-wide is generated by non-native speakers (Crys-tal, 2003). Such texts introduce a variety of chal-lenges, most notably grammatical errors, and areof paramount importance for the scientific studyof language acquisition as well as for NLP. De-spite the ubiquity of non-native English, there is

1The treebank is available at universaldependencies.org.The annotation manual used in this project and a graphicalquery engine are available at esltreebank.org.

currently no publicly available syntactic treebankfor English as a Second Language (ESL).

To address this shortcoming, we present theTreebank of Learner English (TLE), a first ofits kind resource for non-native English, contain-ing 5,124 sentences manually annotated with POStags and dependency trees. The TLE sentences aredrawn from the FCE dataset (Yannakoudakis et al.,2011), and authored by English learners from 10different native language backgrounds. The tree-bank uses the Universal Dependencies (UD) for-malism (De Marneffe et al., 2014; Nivre et al.,2016), which provides a unified annotation frame-work across different languages and is geared to-wards multilingual NLP (McDonald et al., 2013).This characteristic allows our treebank to sup-port computational analysis of ESL using not onlyEnglish based but also multilingual approacheswhich seek to relate ESL phenomena to native lan-guage syntax.

While the annotation inventory and guidelinesare defined by the English UD formalism, webuild on previous work in learner language anal-ysis (Dıaz-Negrillo et al., 2010; Dickinson andRagheb, 2013) to formulate an additional set ofannotation conventions aiming at a uniform treat-ment of ungrammatical learner language. Ourannotation scheme uses a two-layer analysis,whereby a distinct syntactic annotation is pro-vided for the original and the corrected versionof each sentence. This approach is enabled by apre-existing error annotation of the FCE (Nicholls,2003) which is used to generate an error correctedvariant of the dataset. Our inter-annotator agree-ment results provide evidence for the ability of theannotation scheme to support consistent annota-tion of ungrammatical structures.

Finally, a corpus that is annotated with bothgrammatical errors and syntactic dependenciespaves the way for empirical investigation of therelation between grammaticality and syntax. Un-derstanding this relation is vital for improving tag-ging and parsing performance on learner language(Geertzen et al., 2013), syntax based grammati-cal error correction (Tetreault et al., 2010; Ng etal., 2014), and many other fundamental challengesin NLP. In this work, we take the first step inthis direction by benchmarking tagging and pars-ing accuracy on our dataset under different train-ing regimes, and obtaining several estimates forthe impact of grammatical errors on these tasks.

To summarize, this paper presents three contri-butions. First, we introduce the first large scalesyntactic treebank for ESL, manually annotatedwith POS tags and universal dependencies. Sec-ond, we describe a linguistically motivated anno-tation scheme for ungrammatical learner Englishand provide empirical support for its consistencyvia inter-annotator agreement analysis. Third, webenchmark a state of the art parser on our datasetand estimate the influence of grammatical errorson the accuracy of automatic POS tagging and de-pendency parsing.

The remainder of this paper is structured as fol-lows. We start by presenting an overview of thetreebank in section 2. In sections 3 and 4 weprovide background information on the annota-tion project, and review the main annotation stagesleading to the current form of the dataset. The ESLannotation guidelines are summarized in section 5.Inter-annotator agreement analysis is presented insection 6, followed by parsing experiments in sec-tion 7. Finally, we review related work in section8 and present the conclusion in section 9.

2 Treebank Overview

The TLE currently contains 5,124 sentences(97,681 tokens) with POS tag and dependency an-notations in the English Universal Dependencies(UD) formalism (De Marneffe et al., 2014; Nivreet al., 2016). The sentences were obtained fromthe FCE corpus (Yannakoudakis et al., 2011), acollection of upper intermediate English learneressays, containing error annotations with 75 errorcategories (Nicholls, 2003). Sentence level seg-mentation was performed using an adaptation ofthe NLTK sentence tokenizer2. Under-segmented

2http://www.nltk.org/api/nltk.tokenize.html

sentences were split further manually. Word leveltokenization was generated using the StanfordPTB word tokenizer3.

The treebank represents learners with 10 dif-ferent native language backgrounds: Chinese,French, German, Italian, Japanese, Korean, Por-tuguese, Spanish, Russian and Turkish. For everynative language, we randomly sampled 500 au-tomatically segmented sentences, under the con-straint that selected sentences have to contain atleast one grammatical error that is not punctuationor spelling.

The TLE annotations are provided in two ver-sions. The first version is the original sentence au-thored by the learner, containing grammatical er-rors. The second, corrected sentence version, is agrammatical variant of the original sentence, gen-erated by correcting all the grammatical errors inthe sentence according to the manual error anno-tation provided in the FCE dataset. The resultingcorrected sentences constitute a parallel corpus ofstandard English. Table 1 presents basic statisticsof both versions of the annotated sentences.

original correctedsentences 5,124 5,124tokens 97,681 98,976sentence length 19.06 (std 9.47) 19.32 (std 9.59)errors per sentence 2.67 (std 1.9) -authors 924native languages 10

Table 1: Statistics of the TLE. Standard deviationsare denoted in parenthesis.

To avoid potential annotation biases, the anno-tations of the treebank were created manually fromscratch, without utilizing any automatic annota-tion tools. To further assure annotation quality,each annotated sentence was reviewed by two ad-ditional annotators. To the best of our knowledge,TLE is the first large scale English treebank con-structed in a completely manual fashion.

3 Annotator Training

The treebank was annotated by six students, fiveundergraduates and one graduate. Among the un-dergraduates, three are linguistics majors and twoare engineering majors with a linguistic minor.The graduate student is a linguist specializing insyntax. An additional graduate student in NLPparticipated in the final debugging of the dataset.

3http://nlp.stanford.edu/software/tokenizer.shtml

Prior to annotating the treebank sentences, theannotators were trained for about 8 weeks. Dur-ing the training, the annotators attended tutorialson dependency grammars, and learned the EnglishUD guidelines4, the Penn Treebank POS guide-lines (Santorini, 1990), the grammatical error an-notation scheme of the FCE (Nicholls, 2003), aswell as the ESL guidelines described in section 5and in the annotation manual.

Furthermore, the annotators completed six an-notation exercises, in which they were required toannotate POS tags and dependencies for practicesentences from scratch. The exercises were doneindividually, and were followed by group meet-ings in which annotation disagreements were dis-cussed and resolved. Each of the first three exer-cises consisted of 20 sentences from the UD goldstandard for English, the English Web Treebank(EWT) (Silveira et al., 2014). The remaining threeexercises contained 20-30 ESL sentences from theFCE. Many of the ESL guidelines were introducedor refined based on the disagreements in the ESLpractice exercises and the subsequent group dis-cussions. Several additional guidelines were in-troduced in the course of the annotation process.

During the training period, the annotators alsolearned to use a search tool that enables formulat-ing queries over word and POS tag sequences asregular expressions and obtaining their annotationstatistics in the EWT. After experimenting withboth textual and graphical interfaces for perform-ing the annotations, we converged on a simple textbased format described in section 4.1, where theannotations were filled in using a spreadsheet ora text editor, and tested with a script for detect-ing annotation typos. The annotators continued tomeet and discuss annotation issues on a weeklybasis throughout the entire duration of the project.

4 Annotation Procedure

The formation of the treebank was carried out infour steps: annotation, review, disagreement reso-lution and targeted debugging.

4.1 Annotation

In the first stage, the annotators were given sen-tences for annotation from scratch. We use aCoNLL based textual template in which each wordis annotated in a separate line. Each line contains6 columns, the first of which has the word index

4http://universaldependencies.org/#en

(IND) and the second the word itself (WORD).The remaining four columns had to be filled inwith a Universal POS tag (UPOS), a Penn Tree-bank POS tag (POS), a head word index (HIND)and a dependency relation (REL) according to ver-sion 1 of the English UD guidelines.

The annotation section of the sentence is pre-ceded by a metadata header. The first field in thisheader, denoted with SENT, contains the FCE er-ror coded version of the sentence. The annotatorswere instructed to verify the error annotation, andadd new error annotations if needed. Correctionsto the sentence segmentation are specified in theSEGMENT field5. Further down, the field TYPOis designated for literal annotation of spelling er-rors and ill formed words that happen to form validwords (see section 5.2).

The example below presents a pre-annotatedoriginal sentence given to an annotator.

#SENT=That time I had to sleep in <ns type="MD"><c>a</c></ns> tent.#SEGMENT=#TYPO=

#IND WORD UPOS POS HIND REL1 That2 time3 I4 had5 to6 sleep7 in8 tent9 .

Upon completion of the original sentence, theannotators proceeded to annotate the correctedsentence version. To reduce annotation time, an-notators used a script that copies over annotationsfrom the original sentence and updates head in-dices of tokens that appear in both sentence ver-sions. Head indices and relation labels were filledin only if the head word of the token appeared inboth the original and corrected sentence versions.Tokens with automatically filled annotations in-cluded an additional # sign in a seventh columnof each word’s annotation. The # signs had tobe removed, and the corresponding annotations ei-ther approved or changed as appropriate. Tokensthat did not appear in the original sentence versionwere annotated from scratch.

5The released version of the treebank splits the sentencesaccording to the markings in the SEGMENT field when thoseapply both to the original and corrected versions of the sen-tence. Resulting segments without grammatical errors in theoriginal version are currently discarded.

4.2 Review

All annotated sentences were randomly assignedto a second annotator (henceforth reviewer), in adouble blind manner. The reviewer’s task was tomark all the annotations that they would have an-notated differently. To assist the review process,we compiled a list of common annotation errors,available in the released annotation manual.

The annotations were reviewed using an activeediting scheme in which an explicit action was re-quired for all the existing annotations. The schemewas introduced to prevent reviewers from over-looking annotation issues due to passive approval.Specifically, an additional # sign was added at theseventh column of each token’s annotation. Thereviewer then had to either “sign off” on the exist-ing annotation by erasing the # sign, or provide analternative annotation following the # sign.

4.3 Disagreement Resolution

In the final stage of the annotation process allannotator-reviewer disagreements were resolvedby a third annotator (henceforth judge), whosemain task was to decide in favor of the annotatoror the reviewer. Similarly to the review process,the judging task was carried out in a double blindmanner. Judges were allowed to resolve annotator-reviewer disagreements with a third alternative, aswell as introduce new corrections for annotationissues overlooked by the reviewers.

Another task performed by the judges was tomark acceptable alternative annotations for am-biguous structures determined through review dis-agreements or otherwise present in the sentence.These annotations were specified in an additionalmetadata field called AMBIGUITY. The ambigu-ity markings are provided along with the resolvedversion of the annotations.

4.4 Final Debugging

After applying the resolutions produced by thejudges, we queried the corpus with debuggingtests for specific linguistics constructions. Thisadditional testing phase further reduced the num-ber of annotation errors and inconsistencies in thetreebank. Including the training period, the tree-bank creation lasted over a year, with an aggregateof more than 2,000 annotation hours.

5 Annotation Scheme for ESL

Our annotations use the existing inventory of En-glish UD POS tags and dependency relations, andfollow the standard UD annotation guidelines forEnglish. However, these guidelines were for-mulated with grammatical usage of English inmind and do not cover non canonical syntacticstructures arising due to grammatical errors6. Toencourage consistent and linguistically motivatedannotation of such structures, we formulated acomplementary set of ESL annotation guidelines.

Our ESL annotation guidelines follow the gen-eral principle of literal reading, which emphasizessyntactic analysis according to the observed lan-guage usage. This strategy continues a line ofwork in SLA which advocates for centering analy-sis of learner language around morpho-syntacticsurface evidence (Ragheb and Dickinson, 2012;Dickinson and Ragheb, 2013). Similarly to ourframework, which includes a parallel annotationof corrected sentences, such strategies are oftenpresented in the context of multi-layer annota-tion schemes that also account for error correctedsentence forms (Hirschmann et al., 2007; Dıaz-Negrillo et al., 2010; Rosen et al., 2014).

Deploying a strategy of literal annotation withinUD, a formalism which enforces cross-linguisticconsistency of annotations, will enable meaning-ful comparisons between non-canonical structuresin English and canonical structures in the author’snative language. As a result, a key novel character-istic of our treebank is its ability to support cross-lingual studies of learner language.

5.1 Literal Annotation

With respect to POS tagging, literal annotation im-plies adhering as much as possible to the observedmorphological forms of the words. Syntactically,argument structure is annotated according to theusage of the word rather than its typical distribu-tion in the relevant context. The following list ofconventions defines the notion of literal readingfor some of the common non canonical structuresassociated with grammatical errors.

Argument StructureExtraneous prepositions We annotate all nominaldependents introduced by extraneous prepositions

6The English UD guidelines do address several issues en-countered in informal genres, such as the relation “goeswith”,which is used for fragmented words resulting from typos.

as nominal modifiers. In the following sentence,“him” is marked as a nominal modifier (nmod) in-stead of an indirect object (iobj) of “give”.#SENT=...I had to give <ns type="UT">to </ns>him water...

...21 I PRON PRP 22 nsubj22 had VERB VBD 5 parataxis23 to PART TO 24 mark24 give VERB VB 22 xcomp25 to ADP IN 26 case26 him PRON PRP 24 nmod27 water NOUN NN 24 dobj...

Omitted prepositions We treat nominal depen-dents of a predicate that are lacking a prepositionas arguments rather than nominal modifiers. In theexample below, “money” is marked as a direct ob-ject (dobj) instead of a nominal modifier (nmod)of “ask”. As “you” functions in this context as asecond argument of “ask”, it is annotated as an in-direct object (iobj) instead of a direct object (dobj).#SENT=...I have to ask you <ns type="MT"><c>for</c></ns> the money <ns type= "RT">of<c>for</c></ns> the tickets back.

...12 I PRON PRP 13 nsubj13 have VERB VBP 2 conj14 to PART TO 15 mark15 ask VERB VB 13 xcomp16 you PRON PRP 15 iobj17 the DET DT 18 det18 money NOUN NN 15 dobj19 of ADP IN 21 case20 the DET DT 21 det21 tickets NOUN NNS 18 nmod22 back ADV RB 15 advmod23 . PUNCT . 2 punct

TenseCases of erroneous tense usage are annotated ac-cording to the morphological tense of the verb.For example, below we annotate “shopping”with present participle VBG, while the correction“shop” is annotated in the corrected version of thesentence as VBP.#SENT=...when you <ns type="TV">shopping<c>shop</c></ns>...

...4 when ADV WRB 6 advmod5 you PRON PRP 6 nsubj6 shopping VERB VBG 12 advcl...

Word FormationErroneous word formations that are contextuallyplausible and can be assigned with a PTB tagare annotated literally. In the following example,“stuffs” is handled as a plural count noun.#SENT=...into fashionable <ns type="CN">stuffs<c>stuff</c></ns>...

...7 into ADP IN 9 case8 fashionable ADJ JJ 9 amod9 stuffs NOUN NNS 2 ccomp...

Similarly, in the example below we annotate“necessaryiest” as a superlative.#SENT=The necessaryiest things...

1 The DET DT 3 det2 necessaryiest ADJ JJS 3 amod3 things NOUN NNS 0 root...

5.2 Exceptions to Literal AnnotationAlthough our general annotation strategy for ESLfollows literal sentence readings, several types ofword formation errors make such readings unin-formative or impossible, essentially forcing cer-tain words to be annotated using some degree ofinterpretation (Rosen and De Smedt, 2010). Wehence annotate the following cases in the originalsentence according to an interpretation of an in-tended word meaning, obtained from the FCE er-ror correction.

SpellingSpelling errors are annotated according to the cor-rectly spelled version of the word. To support erroranalysis of automatic annotation tools, misspelledwords that happen to form valid words are anno-tated in the metadata field TYPO for POS tagswith respect to the most common usage of themisspelled word form. In the example below, theTYPO field contains the typical POS annotation of“where”, which is clearly unintended in the con-text of the sentence.#SENT=...we <ns type="SX">where<c>were</c></ns> invited to visit...#TYPO=5 ADV WRB

...4 we PRON PRP 6 nsubjpass5 where AUX VBD 6 auxpass6 invited VERB VBN 0 root7 to PART TO 8 mark8 visit VERB VB 6 xcomp...

Word FormationErroneous word formations that cannot be as-signed with an existing PTB tag are annotated withrespect to the correct word form.#SENT=I am <ns type="IV">writting<c>writing</c></ns>...

1 I PRON PRP 3 nsubj2 am AUX VBP 3 aux3 writting VERB VBG 0 root...

In particular, ill formed adjectives that have aplural suffix receive a standard adjectival POS tag.When applicable, such cases also receive an addi-tional marking for unnecessary agreement in theerror annotation using the attribute “ua”.#SENT=...<ns type="IJ" ua=true>interestings<c>interesting</c></ns> things...

...6 interestings ADJ JJ 7 amod7 things NOUN NNS 3 dobj...

Wrong word formations that result in a valid,but contextually implausible word form are alsoannotated according to the word correction. Inthe example below, the nominal form “sale” islikely to be an unintended result of an ill formedverb. Similarly to spelling errors that result invalid words, we mark the typical literal POS an-notation in the TYPO metadata field.#SENT=...they do not <ns type="DV">sale<c>sell</c></ns> them...#TYPO=15 NOUN NN

...12 they PRON PRP 15 nsubj13 do AUX VBP 15 aux14 not PART RB 15 neg15 sale VERB VB 0 root16 them PRON PRP 15 dobj...

Taken together, our ESL conventions covermany of the annotation challenges related to gram-matical errors present in the TLE. In addition tothe presented overview, the complete manual ofESL guidelines used by the annotators is pub-licly available. The manual contains further detailson our annotation scheme, additional annotationguidelines and a list of common annotation errors.We plan to extend and refine these guidelines infuture releases of the treebank.

6 Editing Agreement

We utilize our two step review process to estimateagreement rates between annotators7. We measureagreement as the fraction of annotation tokens ap-proved by the editor. Table 2 presents the agree-ment between annotators and reviewers, as well asthe agreement between reviewers and the judges.Agreement measurements are provided for boththe original the corrected versions of the dataset.

Overall, the results indicate a high agreementrate in the two editing tasks. Importantly, the gapbetween the agreement on the original and cor-rected sentences is small. Note that this result isobtained despite the introduction of several ESLannotation guidelines in the course of the annota-tion process, which inevitably increased the num-ber of edits related to grammatical errors. We in-terpret this outcome as evidence for the effective-ness of the ESL annotation scheme in supportingconsistent annotations of learner language.

7All experimental results on agreement and parsing ex-clude punctuation tokens.

Annotator-Reviewer UPOS POS HIND RELoriginal 98.83 98.35 97.74 96.98corrected 99.02 98.61 97.97 97.20Reviewer-Judgeoriginal 99.72 99.68 99.37 99.15corrected 99.80 99.77 99.45 99.28

Table 2: Inter-annotator agreement on the entireTLE corpus. Agreement is measured as the frac-tion of tokens that remain unchanged after an edit-ing round. The four evaluation columns corre-spond to universal POS tags, PTB POS tags, un-labeled attachment, and dependency labels. Co-hen’s Kappa scores (Cohen, 1960) for POS tagsand dependency labels in all evaluation conditionsare above 0.96.

7 Parsing Experiments

The TLE enables studying parsing for learner lan-guage and exploring relationships between gram-matical errors and parsing performance. Here, wepresent parsing benchmarks on our dataset, andprovide several estimates for the extent to whichgrammatical errors degrade the quality of auto-matic POS tagging and dependency parsing.

Our first experiment measures tagging and pars-ing accuracy on the TLE and approximates theglobal impact of grammatical errors on automaticannotation via performance comparison betweenthe original and error corrected sentence versions.In this, and subsequent experiments, we utilizeversion 2.2 of the Turbo tagger and Turbo parser(Martins et al., 2013), state of the art tools for sta-tistical POS tagging and dependency parsing.

Table 3 presents tagging and parsing results ona test set of 500 TLE sentences (9,591 original to-kens, 9,700 corrected tokens). Results are pro-vided for three different training regimes. Thefirst regime uses the training portion of version 1.3of the EWT, the UD English treebank, contain-ing 12,543 sentences (204,586 tokens). The sec-ond training mode uses 4,124 training sentences(78,541 original tokens, 79,581 corrected tokens)from the TLE corpus. In the third setup we com-bine these two training corpora. The remaining500 TLE sentences (9,549 original tokens, 9,695corrected tokens) are allocated to a developmentset, not used in this experiment. Parsing of the testsentences was performed on predicted POS tags.

The EWT training regime, which uses out of do-main texts written in standard English, providesthe lowest performance on all the evaluation met-

Test set Train Set UPOS POS UAS LA LASTLEorig EWT 91.87 94.28 86.51 88.07 81.44TLEcorr EWT 92.9 95.17 88.37 89.74 83.8TLEorig TLEorig 95.88 94.94 87.71 89.26 83.4TLEcorr TLEcorr 96.92 95.17 89.69 90.92 85.64TLEorig EWT+TLEorig 93.33 95.77 90.3 91.09 86.27TLEcorr EWT+TLEcorr 94.27 96.48 92.15 92.54 88.3

Table 3: Tagging and parsing results on a test set of500 sentences from the TLE corpus. EWT is theEnglish UD treebank. TLEorig are original sen-tences from the TLE. TLEcorr are the correspond-ing error corrected sentences.

rics. An additional factor which negatively af-fects performance in this regime are systematicdifferences in the EWT annotation of possessivepronouns, expletives and names compared to theUD guidelines, which are utilized in the TLE. Inparticular, the EWT annotates possessive pronounUPOS as PRON rather than DET, which leads theUPOS results in this setup to be lower than thePTB POS results. Improved results are obtainedusing the TLE training data, which, despite itssmaller size, is closer in genre and syntactic char-acteristics to the TLE test set. The strongest PTBPOS tagging and parsing results are obtained bycombining the EWT with the TLE training data,yielding 95.77 POS accuracy and a UAS of 90.3on the original version of the TLE test set.

The dual annotation of sentences in their orig-inal and error corrected forms enables estimatingthe impact of grammatical errors on tagging andparsing by examining the performance gaps be-tween the two sentence versions. Averaged acrossthe three training conditions, the POS tagging ac-curacy on the original sentences is lower than theaccuracy on the sentence corrections by 1.0 UPOSand 0.61 POS. Parsing performance degrades by1.9 UAS, 1.59 LA and 2.21 LAS.

To further elucidate the influence of grammati-cal errors on parsing quality, table 4 compares per-formance on tokens in the original sentences ap-pearing inside grammatical error tags to those ap-pearing outside such tags. Although grammaticalerrors may lead to tagging and parsing errors withrespect to any element in the sentence, we expecterroneous tokens to be more challenging to ana-lyze compared to grammatical tokens.

This comparison indeed reveals a substantialdifference between the two types of tokens, withan average gap of 5.0 UPOS, 6.65 POS, 4.67 UAS,6.56 LA and 7.39 LAS. Note that differently from

Tokens Train Set UPOS POS UAS LA LASUngrammatical EWT 87.97 88.61 82.66 82.66 74.93Grammatical EWT 92.62 95.37 87.26 89.11 82.7Ungrammatical TLEorig 90.76 88.68 83.81 83.31 77.22Grammatical TLEorig 96.86 96.14 88.46 90.41 84.59Ungrammatical EWT+TLEorig 89.76 90.97 86.32 85.96 80.37Grammatical EWT+TLEorig 94.02 96.7 91.07 92.08 87.41

Table 4: Tagging and parsing results on the origi-nal version of the TLE test set for tokens markedwith grammatical errors (Ungrammatical) and to-kens not marked for errors (Grammatical).

the global measurements in the first experiment,this analysis, which focuses on the local impactof remove/replace errors, suggests a stronger ef-fect of grammatical errors on the dependency la-bels than on the dependency structure.

Finally, we measure tagging and parsing perfor-mance relative to the fraction of sentence tokensmarked with grammatical errors. Similarly to theprevious experiment, this analysis focuses on re-move/replace rather than insert errors.

0-5(362)

5-10(1033)

10-15(1050)

15-20(955)

20-25(613)

25-30(372)

30-35(214)

35-40(175)

% of Original Sentence Tokens Marked as Grammatical Errors

76

78

80

82

84

86

88

90

92

94

96

98

100

Mea

n Pe

r Sen

tenc

e Sc

ore

POS originalPOS correctedUAS originalUAS correctedLAS originalLAS corrected

Figure 1: Mean per sentence POS accuracy, UASand LAS of the Turbo tagger and Turbo parser, asa function of the percentage of original sentencetokens marked with grammatical errors. The tag-ger and the parser are trained on the EWT cor-pus, and tested on all 5,124 sentences of the TLE.Points connected by continuous lines denote per-formance on the original TLE sentences. Pointsconnected by dashed lines denote performance onthe corresponding error corrected sentences. Thenumber of sentences whose errors fall within eachpercentage range appears in parenthesis.

Figure 1 presents the average sentential perfor-mance as a function of the percentage of tokensin the original sentence marked with grammati-

cal errors. In this experiment, we train the parseron the EWT training set and test on the entireTLE corpus. Performance curves are presentedfor POS, UAS and LAS on the original and errorcorrected versions of the annotations. We observethat while the performance on the corrected sen-tences is close to constant, original sentence per-formance is decreasing as the percentage of the er-roneous tokens in the sentence grows.

Overall, our results suggest a negative, albeitlimited effect of grammatical errors on parsing.This outcome contrasts a study by Geertzen et al.(2013) which reported a larger performance gap of7.6 UAS and 8.8 LAS between sentences with andwithout grammatical errors. We believe that ouranalysis provides a more accurate estimate of thisimpact, as it controls for both sentence content andsentence length. The latter factor is crucial, sinceit correlates positively with the number of gram-matical errors in the sentence, and negatively withparsing accuracy.

8 Related Work

Previous studies on learner language proposedseveral annotation schemes for both POS tags andsyntax (Hirschmann et al., 2007; Dıaz-Negrillo etal., 2010; Dickinson and Ragheb, 2013; Rosen etal., 2014). The unifying theme in these proposalsis a multi-layered analysis aiming to decouple theobserved language usage from conventional struc-tures in the foreign language.

In the context of ESL, Dıaz et al. (2010) pro-pose three parallel POS tag annotations for thelexical, morphological and distributional forms ofeach word. In our work, we adopt the distinc-tion between morphological word forms, whichroughly correspond to our literal word readings,and distributional forms as the error correctedwords. However, we account for morphologicalforms only when these constitute valid existingPTB POS tags and are contextually plausible. Fur-thermore, while the internal structure of invalidword forms is an interesting object of investiga-tion, we believe that it is more suitable for anno-tation as word features rather than POS tags. Ourtreebank supports the addition of such features tothe existing annotations.

The work of Ragheb and Dickinson (2009;2012; 2013) proposes ESL annotation guidelinesfor POS tags and syntactic dependencies based onthe CHILDES annotation framework. This ap-

proach, called “morphosyntactic dependencies” isrelated to our annotation scheme in its focus onsurface structures. Differently from this proposal,our annotations are grounded in a parallel anno-tation of grammatical errors and include an ad-ditional layer of analysis for the corrected forms.Moreover, we refrain from introducing new syn-tactic categories and dependency relations specificto ESL, thereby supporting computational treat-ment of ESL using existing resources for standardEnglish. At the same time, we utilize a multilin-gual formalism which, in conjunction with our lit-eral annotation strategy, facilitates linking the an-notations to native language syntax.

While the above mentioned studies focus on an-notation guidelines, attention has also been drawnto the topic of parsing in the learner language do-main. However, due to the shortage of syntacticresources for ESL, much of the work in this arearesorted to using surrogates for learner data. Forexample, in Foster (2007) and Foster et al. (2008)parsing experiments are carried out on syntheticlearner-like data, that was created by automatic in-sertion of grammatical errors to well formed En-glish text. In Cahill et al. (2014) a treebank of sec-ondary level native students texts was used to ap-proximate learner text in order to evaluate a parserthat utilizes unlabeled learner data.

Syntactic annotations for ESL were previouslydeveloped by Nagata et al. (2011), who annotatean English learner corpus with POS tags and shal-low syntactic parses. Our work departs from shal-low syntax to full syntactic analysis, and providesannotations on a significantly larger scale. Fur-thermore, differently from this annotation effort,our treebank covers a wide range of learner na-tive languages. An additional syntactic dataset forESL, currently not available publicly, are 1,000sentences from the EFCamDat dataset (Geertzenet al., 2013), annotated with Stanford dependen-cies (De Marneffe and Manning, 2008). Thisdataset was used to measure the impact of gram-matical errors on parsing by comparing perfor-mance on sentences with grammatical errors to er-ror free sentences. The TLE enables a more directway of estimating the magnitude of this perfor-mance gap by comparing performance on the samesentences in their original and error corrected ver-sions. Our comparison suggests that the effect ofgrammatical errors on parsing is smaller that theone reported in this study.

9 Conclusion

We present the first large scale treebank oflearner language, manually annotated and double-reviewed for POS tags and universal dependen-cies. The annotation is accompanied by a linguis-tically motivated framework for handling syntacticstructures associated with grammatical errors. Fi-nally, we benchmark automatic tagging and pars-ing on our corpus, and measure the effect of gram-matical errors on tagging and parsing quality. Thetreebank will support empirical study of learnersyntax in NLP, corpus linguistics and second lan-guage acquisition.

10 Acknowledgements

We thank Anna Korhonen for helpful discussionsand insightful comments on this paper. We alsothank Dora Alexopoulou, Andrei Barbu, MarkusDickinson, Sue Felshin, Jeroen Geertzen, YanHuang, Detmar Meurers, Sampo Pyysalo, Roi Re-ichart and the anonymous reviewers for valuablefeedback on this work. This material is basedupon work supported by the Center for Brains,Minds, and Machines (CBMM), funded by NSFSTC award CCF-1231216.

ReferencesAoife Cahill, Binod Gyawali, and James V Bruno.

2014. Self-training for parsing learner text. InProceedings of the First Joint Workshop on Statisti-cal Parsing of Morphologically Rich Languages andSyntactic Analysis of Non-Canonical Languages,pages 66–73.

Jacob Cohen. 1960. A Coefficient of Agreement forNominal Scales. Educational and PsychologicalMeasurement, 20(1):37.

David Crystal. 2003. English as a global language.Ernst Klett Sprachen.

Marie-Catherine De Marneffe and Christopher D Man-ning. 2008. Stanford typed dependencies manual.Technical report, Technical report, Stanford Univer-sity.

Marie-Catherine De Marneffe, Timothy Dozat, Na-talia Silveira, Katri Haverinen, Filip Ginter, JoakimNivre, and Christopher D Manning. 2014. Univer-sal stanford dependencies: A cross-linguistic typol-ogy. In Proceedings of LREC, pages 4585–4592.

Ana Dıaz-Negrillo, Detmar Meurers, Salvador Valera,and Holger Wunsch. 2010. Towards interlanguagepos annotation for effective learner corpora in slaand flt. Language Forum, 36(1–2):139–154.

Markus Dickinson and Marwa Ragheb. 2009. Depen-dency annotation for learner corpora. In Proceed-ings of the Eighth Workshop on Treebanks and Lin-guistic Theories (TLT-8), pages 59–70.

Markus Dickinson and Marwa Ragheb. 2013. Annota-tion for learner English guidelines, v. 0.1. Technicalreport, Indiana University, Bloomington, IN, June.June 9, 2013.

Jennifer Foster, Joachim Wagner, and Josef Van Gen-abith. 2008. Adapting a wsj-trained parser to gram-matically noisy text. In Proceedings of the 46thAnnual Meeting of the Association for Computa-tional Linguistics on Human Language Technolo-gies: Short Papers, pages 221–224. Association forComputational Linguistics.

Jennifer Foster. 2007. Treebanks gone bad. Interna-tional Journal of Document Analysis and Recogni-tion (IJDAR), 10(3-4):129–145.

Jeroen Geertzen, Theodora Alexopoulou, and AnnaKorhonen. 2013. Automatic linguistic annotationof large scale l2 databases: The ef-cambridge openlanguage database (efcamdat). In Proceedings of the31st Second Language Research Forum. Somerville,MA: Cascadilla Proceedings Project.

Hagen Hirschmann, Seanna Doolittle, and AnkeLudeling. 2007. Syntactic annotation of non-canonical linguistic structures.

Andre FT Martins, Miguel Almeida, and Noah ASmith. 2013. Turning on the turbo: Fast third-ordernon-projective turbo parsers. In ACL (2), pages617–622. Citeseer.

Ryan T McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, KuzmanGanchev, Keith B Hall, Slav Petrov, Hao Zhang,Oscar Tackstrom, et al. 2013. Universal depen-dency annotation for multilingual parsing. In ACL(2), pages 92–97. Citeseer.

Ryo Nagata, Edward Whittaker, and Vera Shein-man. 2011. Creating a manually error-taggedand shallow-parsed learner corpus. In Proceed-ings of the 49th Annual Meeting of the Associationfor Computational Linguistics: Human LanguageTechnologies-Volume 1, pages 1210–1219. Associ-ation for Computational Linguistics.

Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, ChristianHadiwinoto, Raymond Hendy Susanto, and Christo-pher Bryant. 2014. The conll-2014 shared taskon grammatical error correction. In CoNLL SharedTask, pages 1–14.

Diane Nicholls. 2003. The cambridge learner corpus:Error coding and analysis for lexicography and elt.In Proceedings of the Corpus Linguistics 2003 con-ference, pages 572–581.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajic, Christopher Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,Natalia Silveira, et al. 2016. Universal dependen-cies v1: A multilingual treebank collection. In Pro-ceedings of the 10th International Conference onLanguage Resources and Evaluation (LREC 2016).

Marwa Ragheb and Markus Dickinson. 2012. Defin-ing syntax for learner language annotation. In COL-ING (Posters), pages 965–974.

Victoria Rosen and Koenraad De Smedt. 2010. Syn-tactic annotation of learner corpora. Systematisk,variert, men ikke tilfeldig, pages 120–132.

Alexandr Rosen, Jirka Hana, Barbora Stindlova, andAnna Feldman. 2014. Evaluating and automatingthe annotation of a learner corpus. Language Re-sources and Evaluation, 48(1):65–92.

Beatrice Santorini. 1990. Part-of-speech taggingguidelines for the penn treebank project (3rd revi-sion). Technical Reports (CIS).

Natalia Silveira, Timothy Dozat, Marie-Catherinede Marneffe, Samuel R Bowman, Miriam Connor,John Bauer, and Christopher D Manning. 2014. Agold standard dependency corpus for english. InProceedings of the Ninth International Conferenceon Language Resources and Evaluation (LREC-2014).

Joel Tetreault, Jennifer Foster, and Martin Chodorow.2010. Using parse features for preposition selectionand error detection. In Proceedings of the acl 2010conference short papers, pages 353–358. Associa-tion for Computational Linguistics.

Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.2011. A new dataset and method for automaticallygrading ESOL texts. In ACL, pages 180–189.

Documents

Universal Dependencies for Learner English - CBMM · Universal Dependencies for Learner English Yevgeni Berzak ... cate in English (FCE) corpus. ... The treebank was annotated by