14
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2755–2768, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics 2755 Deep Contextualized Word Embeddings in Transition-Based and Graph-Based Dependency Parsing – A Tale of Two Parsers Revisited * Artur Kulmizev Miryam de Lhoneux Johannes Gontrum Elena Fano Joakim Nivre Department of Linguistics and Philology, Uppsala University {artur.kulmizev,miryam.de lhoneux,joakim.nivre}@lingfil.uu.se {johannes.gontrum.4608,elena.fano.3249}@student.uu.se Abstract Transition-based and graph-based dependency parsers have previously been shown to have complementary strengths and weaknesses: transition-based parsers exploit rich structural features but suffer from error propagation, while graph-based parsers benefit from global optimization but have restricted feature scope. In this paper, we show that, even though some details of the picture have changed after the switch to neural networks and continuous rep- resentations, the basic trade-off between rich features and global optimization remains es- sentially the same. Moreover, we show that deep contextualized word embeddings, which allow parsers to pack information about global sentence structure into local feature represen- tations, benefit transition-based parsers more than graph-based parsers, making the two ap- proaches virtually equivalent in terms of both accuracy and error profile. We argue that the reason is that these representations help pre- vent search errors and thereby allow transition- based parsers to better exploit their inherent strength of making accurate local decisions. We support this explanation by an error analy- sis of parsing experiments on 13 languages. 1 Introduction For more than a decade, research on data-driven dependency parsing has been dominated by two approaches: transition-based parsing and graph- based parsing (McDonald and Nivre, 2007, 2011). Transition-based parsing reduces the parsing task to scoring single parse actions and is often com- bined with local optimization and greedy search algorithms. Graph-based parsing decomposes parse trees into subgraphs and relies on global op- timization and exhaustive (or at least non-greedy) * We gratefully acknowledge the inspiration for our sub- title in the seminal paper by Zhang and Clark (2008). search to find the best tree. These radically differ- ent approaches often lead to comparable parsing accuracy, but with distinct error profiles indicative of their respective strengths and weaknesses, as shown by McDonald and Nivre (2007, 2011). In recent years, dependency parsing, like most of NLP, has shifted from linear models and dis- crete features to neural networks and continu- ous representations. This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question whether their complementary strengths and weak- nesses are still relevant. In this paper, we repli- cate the analysis of McDonald and Nivre (2007, 2011) for neural parsers. In addition, we investi- gate the impact of deep contextualized word repre- sentations (Peters et al., 2018; Devlin et al., 2019) for both types of parsers. Based on what we know about the strengths and weaknesses of the two approaches, we hypothe- size that deep contextualized word representations will benefit transition-based parsing more than graph-based parsing. The reason is that these rep- resentations make information about global sen- tence structure available locally, thereby helping to prevent search errors in greedy transition-based parsing. The hypothesis is corroborated in ex- periments on 13 languages, and the error analysis supports our suggested explanation. We also find that deep contextualized word representations im- prove parsing accuracy for longer sentences, both for transition-based and graph-based parsers. 2 Two Models of Dependency Parsing After playing a marginal role in NLP for many years, dependency-based approaches to syntactic parsing have become mainstream during the last fifteen years. This is especially true if we consider languages other than English, ever since the influ-

Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 2755–2768,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

2755

Deep Contextualized Word Embeddings in Transition-Based andGraph-Based Dependency Parsing – A Tale of Two Parsers Revisited∗

Artur Kulmizev Miryam de Lhoneux Johannes Gontrum Elena Fano Joakim NivreDepartment of Linguistics and Philology, Uppsala University

{artur.kulmizev,miryam.de lhoneux,joakim.nivre}@lingfil.uu.se{johannes.gontrum.4608,elena.fano.3249}@student.uu.se

Abstract

Transition-based and graph-based dependencyparsers have previously been shown to havecomplementary strengths and weaknesses:transition-based parsers exploit rich structuralfeatures but suffer from error propagation,while graph-based parsers benefit from globaloptimization but have restricted feature scope.In this paper, we show that, even though somedetails of the picture have changed after theswitch to neural networks and continuous rep-resentations, the basic trade-off between richfeatures and global optimization remains es-sentially the same. Moreover, we show thatdeep contextualized word embeddings, whichallow parsers to pack information about globalsentence structure into local feature represen-tations, benefit transition-based parsers morethan graph-based parsers, making the two ap-proaches virtually equivalent in terms of bothaccuracy and error profile. We argue that thereason is that these representations help pre-vent search errors and thereby allow transition-based parsers to better exploit their inherentstrength of making accurate local decisions.We support this explanation by an error analy-sis of parsing experiments on 13 languages.

1 Introduction

For more than a decade, research on data-drivendependency parsing has been dominated by twoapproaches: transition-based parsing and graph-based parsing (McDonald and Nivre, 2007, 2011).Transition-based parsing reduces the parsing taskto scoring single parse actions and is often com-bined with local optimization and greedy searchalgorithms. Graph-based parsing decomposesparse trees into subgraphs and relies on global op-timization and exhaustive (or at least non-greedy)

∗We gratefully acknowledge the inspiration for our sub-title in the seminal paper by Zhang and Clark (2008).

search to find the best tree. These radically differ-ent approaches often lead to comparable parsingaccuracy, but with distinct error profiles indicativeof their respective strengths and weaknesses, asshown by McDonald and Nivre (2007, 2011).

In recent years, dependency parsing, like mostof NLP, has shifted from linear models and dis-crete features to neural networks and continu-ous representations. This has led to substantialaccuracy improvements for both transition-basedand graph-based parsers and raises the questionwhether their complementary strengths and weak-nesses are still relevant. In this paper, we repli-cate the analysis of McDonald and Nivre (2007,2011) for neural parsers. In addition, we investi-gate the impact of deep contextualized word repre-sentations (Peters et al., 2018; Devlin et al., 2019)for both types of parsers.

Based on what we know about the strengths andweaknesses of the two approaches, we hypothe-size that deep contextualized word representationswill benefit transition-based parsing more thangraph-based parsing. The reason is that these rep-resentations make information about global sen-tence structure available locally, thereby helpingto prevent search errors in greedy transition-basedparsing. The hypothesis is corroborated in ex-periments on 13 languages, and the error analysissupports our suggested explanation. We also findthat deep contextualized word representations im-prove parsing accuracy for longer sentences, bothfor transition-based and graph-based parsers.

2 Two Models of Dependency Parsing

After playing a marginal role in NLP for manyyears, dependency-based approaches to syntacticparsing have become mainstream during the lastfifteen years. This is especially true if we considerlanguages other than English, ever since the influ-

Page 2: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2756

ential CoNLL shared tasks on dependency pars-ing in 2006 (Buchholz and Marsi, 2006) and 2007(Nivre et al., 2007) with data from 19 languages.

The transition-based approach to dependencyparsing was pioneered by Yamada and Matsumoto(2003) and Nivre (2003), with inspiration fromhistory-based parsing (Black et al., 1992) anddata-driven shift-reduce parsing (Veenstra andDaelemans, 2000). The idea is to reduce the com-plex parsing task to the simpler task of predictingthe next parsing action and to implement parsingas greedy search for the optimal sequence of ac-tions, guided by a simple classifier trained on localparser configurations. This produces parsers thatare very efficient, often with linear time complex-ity, and which can benefit from rich non-local fea-tures defined over parser configurations but whichmay suffer from compounding search errors.

The graph-based approach to dependency pars-ing was developed by McDonald et al. (2005a,b),building on earlier work by Eisner (1996). Theidea is to score dependency trees by a linear com-bination of scores of local subgraphs, often singlearcs, and to implement parsing as exact search forthe highest scoring tree under a globally optimizedmodel. These parsers do not suffer from search er-rors but parsing algorithms are more complex andrestrict the scope of features to local subgraphs.

The terms transition-based and graph-basedwere coined by McDonald and Nivre (2007,2011), who performed a contrastive error anal-ysis of the two top-performing systems in theCoNLL 2006 shared task on multilingual depen-dency parsing: MaltParser (Nivre et al., 2006) andMSTParser (McDonald et al., 2006), which rep-resented the state of the art in transition-basedand graph-based parsing, respectively, at the time.Their analysis shows that, despite having almostexactly the same parsing accuracy when averagedover 13 languages, the two parsers have very dis-tinctive error profiles. MaltParser is more accu-rate on short sentences, on short dependencies,on dependencies near the leaves of the tree, onnouns and prounouns, and on subject and objectrelations. MSTParser is more accurate on longsentences, on long dependencies, on dependenciesnear the root of the tree, on verbs, and on coordi-nation relations and sentence roots.

McDonald and Nivre (2007, 2011) argue thatthese patterns can be explained by the complemen-tary strengths and weaknesses of the systems. The

0 2 4 6 8 10 12 14Dependency length

0.5

0.6

0.7

0.8

0.9

Dep

ende

ncy

prec

isio

n

MSTParserMaltParserZPar

2 4 6 8 10 12 14Dependency length

0.5

0.6

0.7

0.8

0.9

Dep

ende

ncy

reca

ll

MSTParserMaltParserZPar

Figure 3: Dependency arc precision/recall relative to predicted/gold dependency length.

1 2 3 4 5 6 7Distance to root

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

Dep

ende

ncy

prec

isio

n

MSTParserMaltParserZPar

1 2 3 4 5 6 7Distance to root

0.78

0.8

0.82

0.84

0.86

0.88

0.9

Dep

ende

ncy

reca

ll

MSTParserMaltParserZPar

Figure 4: Dependency arc precision/recall relative to predicted/gold distance to root.

ZPar performs better than MaltParser and MSTParser, particularly on short sentences ( 30),due to the richest feature representation. For longer sentences (20 to 50), the performance ofZPar drops as quickly as that of MaltParser. One possible reason is that the effect of a fixed-size beam on the reduction of error propagation becomes less obvious when the number ofpossible parse trees grows exponentially with sentence size. The performance of MSTParserdecreases less quickly as the size of the sentence increases, demonstrating the advantage ofexact inference. Sentences with 50+ words are relatively rare in the test set.

The three parsers show larger variance in performance when evaluated against specific proper-ties of the dependency tree. Figure 3 shows the precision and recall for each parser relative tothe arc lengths in the predicted and gold-standard dependency trees. Here the length of an arcis defined as the absolute difference between the indices of the head and modifier. Precisionrepresents the percentage of predicted arcs with a particular length that are correct, and recallrepresents the percentage of gold arcs of a particular length that are correctly predicted.

MaltParser gives higher precision than MSTParser for short dependency arcs ( 4), but its pre-cision drops rapidly for arcs with increased lengths. These arcs take more shift-reduce actionsto build, and are hence more prone to error propagation. The precision of ZPar drops muchslower compared to MaltParser, demonstrating the effect of beam-search for the reduction oferror propagation. Another important factor is the use of rich non-local features by ZPar, whichis a likely reason for its precision to drop slower even than that of MSTParser when the arc sizeincreases from 1 to 8. Interestingly, the precision of ZPar is almost indistinguishable from thatof MaltParser for size 1 arcs (arcs between neighbouring words), showing that the wider rangeof features in ZPar is the most helpful in arcs that take more than one, but not too many shift-reduce actions to build. The recall curves of the three parsers are similar, with ZPar having

1397

Figure 1: Labeled precision by dependency length forMST (global–exhaustive–graph), Malt (local–greedy–transition) and ZPar (global–beam–transition). FromZhang and Nivre (2012).

transition-based MaltParser prioritizes rich struc-tural features, which enable accurate disambigua-tion in local contexts, but is limited by a locallyoptimized model and greedy algorithm, resultingin search errors for structures that require longertransition sequences. The graph-based MSTParserbenefits from a globally optimized model and ex-act inference, which gives a better analysis ofglobal sentence structure, but is more restricted inthe features it can use, which limits its capacity toscore local structures accurately.

Many of the developments in dependency pars-ing during the last decade can be understood in thislight as attempts to mitigate the weaknesses of tra-ditional transition-based and graph-based parserswithout sacrificing their strengths. This maymean evolving the model structure through newtransition systems (Nivre, 2008, 2009; Kuhlmannet al., 2011) or higher-order models for graph-based parsing (McDonald and Pereira, 2006; Car-reras, 2007; Koo and Collins, 2010); it may meanexploring alternative learning strategies, in partic-ular for transition-based parsing, where improve-ments have been achieved thanks to global struc-ture learning (Zhang and Clark, 2008; Zhang andNivre, 2011; Andor et al., 2016) and dynamic or-acles (Goldberg and Nivre, 2012, 2013); it maymean using alternative search strategies, such astransition-based parsing with beam search (Jo-hansson and Nugues, 2007; Titov and Hender-son, 2007; Zhang and Clark, 2008) or exact search(Huang and Sagae, 2010; Kuhlmann et al., 2011)or graph-based parsing with heuristic search tocope with the complexity of higher-order models,especially for non-projective parsing (McDonaldand Pereira, 2006; Koo et al., 2010; Zhang andMcDonald, 2012); or it may mean hybrid or en-

Page 3: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2757

semble systems (Sagae and Lavie, 2006; Nivre andMcDonald, 2008; Zhang and Clark, 2008; Bohnetand Kuhn, 2012). A nice illustration of the im-pact of new techniques can be found in Zhang andNivre (2012), where an error analysis along thelines of McDonald and Nivre (2007, 2011) showsthat a transition-based parser using global learn-ing and beam search (instead of local learning andgreedy search) performs on par with graph-basedparsers for long dependencies, while retaining theadvantage of the original transition-based parserson short dependencies (see Figure 1).

Neural networks for dependency parsing, firstexplored by Titov and Henderson (2007) and At-tardi et al. (2009), have come to dominate thefield during the last five years. While this hasdramatically changed learning architectures andfeature representations, most parsing models arestill either transition-based (Chen and Manning,2014; Dyer et al., 2015; Weiss et al., 2015; An-dor et al., 2016; Kiperwasser and Goldberg, 2016)or graph-based (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017). However, more ac-curate feature learning using continuous represen-tations and nonlinear models has allowed parsingarchitectures to be simplified. Thus, most recenttransition-based parsers have moved back to lo-cal learning and greedy inference, seemingly with-out losing accurracy (Chen and Manning, 2014;Dyer et al., 2015; Kiperwasser and Goldberg,2016). Similarly, graph-based parsers again relyon first-order models and obtain no improvementsfrom using higher-order models (Kiperwasser andGoldberg, 2016; Dozat and Manning, 2017).

The increasing use of neural networks has alsoled to a convergence in feature representationsand learning algorithms for transition-based andgraph-based parsers. In particular, most recentsystems rely on an encoder, typically in the formof a BiLSTM, that provides contextualized repre-sentations of the input words as input to the scor-ing of transitions – in transition-based parsers –or of dependency arcs – in graph-based parsers.By making information about the global sentencecontext available in local word representations,this encoder can be assumed to mitigate errorpropagation for transition-based parsers and towiden the feature scope beyond individual wordpairs for graph-based parsers. For both types ofparsers, this also obviates the need for complexstructural feature templates, as recently shown by

Falenska and Kuhn (2019). We should thereforeexpect neural transition-based and graph-basedparsers to be not only more accurate than theirnon-neural counterparts but also more similar toeach other in their error profiles.

3 Deep Contextualized WordRepresentations

Neural parsers rely on vector representations ofwords as their primary input, often in the formof pretrained word embeddings such as word2vec(Mikolov et al., 2013), GloVe (Pennington et al.,2014), or fastText (Bojanowski et al., 2016),which are sometimes extended with character-based representations produced by recurrent neu-ral networks (Ballesteros et al., 2015). These tech-niques assign a single static representation to eachword type and therefore cannot capture context-dependent variation in meaning and syntactic be-havior.

By contrast, deep contextualized word repre-sentations encode words with respect to the sen-tential context in which they appear. Like wordembeddings, such models are typically trainedwith a language-modeling objective, but yieldsentence-level tensors as representations, insteadof single vectors. These representations are typ-ically produced by transferring a model’s entirefeature encoder – be it a BiLSTM (Hochreiterand Schmidhuber, 1997) or Transformer (Vaswaniet al., 2017) – to a target task, where the dimen-sionality of the tensor S is typically S ∈ RN×L×D

for a sentence of length N , an encoder with L lay-ers, and word-level vectors of dimensionality D.The advantage of such models, compared to theparser-internal encoders discussed in the previoussection, is that they not only produce contextual-ized representations but do so over several layersof abstraction, as captured by the model’s differentlayers, and are pre-trained on corpora much largerthan typical treebanks.

Deep contextualized embedding models haveproven to be adept at a wide array of NLP tasks,achieving state-of-the-art performance in standardNatural Language Understanding (NLU) bench-marks, such as GLUE (Wang et al., 2019). Thoughmany such models have been proposed, we adoptthe two arguably most popular ones for our ex-periments: ELMo and BERT. Both models havepreviously been used for dependency parsing (Cheet al., 2018; Jawahar et al., 2018; Lim et al., 2018;

Page 4: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2758

Kondratyuk, 2019; Schuster et al., 2019), but therehas been no systematic analysis of their impact ontransition-based and graph-based parsers.

3.1 ELMoELMo is a deep contextualized embedding modelproposed by Peters et al. (2018), which pro-duces sentence-level representations yielded by amulti-layer BiLSTM language model. ELMo istrained with a standard language-modeling ob-jective, in which a BiLSTM reads a sequenceof N learned context-independent embeddingsw1, . . . ,wN (obtained via a character-level CNN)and produces a context-dependent representationhj,k = BiLSTM(w1:N , k), where j (1≤ j≤L)is the BiLSTM layer and k is the index of the wordin the sequence. The output of the last layer hL,k

is then employed in conjunction with a softmaxlayer to predict the next token at k + 1.

The simplest way of transferring ELMo to adownstream task is to encode the input sentenceS = w1, . . . , wN by extracting the representa-tions from the BiLSTM at layer L for each tokenwk ∈ S: hL,1, . . . ,hL,N ,. However, Peters et al.(2018) posit that the best way to take advantageof ELMo’s representational power is to compute alinear combination of BiLSTM layers:

ELMok = γL∑

j=0

sjhj,k (1)

where sj is a softmax-normalized task-specific pa-rameter and γ is a task-specific scalar. Peters et al.(2018) demonstrate that this scales the layers oflinguistic abstraction encoded by the BiLSTM forthe task at hand.

3.2 BERTBERT (Devlin et al., 2019) is similar to ELMoin that it employs a language-modeling objectiveover unannotated text in order to produce deepcontextualized embeddings. However, BERT dif-fers from ELMo in that, in place of a BiLSTM,it employs a bidirectional Transformer (Vaswaniet al., 2017), which, among other factors, carriesthe benefit of learning potential dependencies be-tween words directly. This lies in contrast to re-current models, which may struggle to learn corre-spondences between constituent signals when thetime-lag between them is long (Hochreiter et al.,2001). For a token wk in sentence S, BERT’sinput representation is composed by summing a

word embedding xk, a position embedding ik,and a WordPiece embedding sk (Wu et al., 2016):wk = xk + ik + sk.

Each wk ∈ S is passed to an L-layered Bi-Transformer, which is trained with a masked lan-guage modeling objective (i.e., randomly maskinga percentage of input tokens and only predictingsaid tokens). For use in downstream tasks, Devlinet al. (2019) propose to extract the Transformer’sencoding of each token wk ∈ S at layer L, whicheffectively produces BERTk.

4 Hypotheses

Based on our discussion in Section 2, we assumethat transition-based and graph-based parsers stillhave distinctive error profiles due to the basictrade-off between rich structural features, whichallow transition-based parsers to make accuratelocal decisions, and global learning and exactsearch, which give graph-based parsers an advan-tage with respect to global sentence structure. Atthe same time, we expect the differences to be lesspronounced than they were ten years ago becauseof the convergence in neural architectures and fea-ture representations. But how will the additionof deep contextualized word representations affectthe behavior of the two parsers?

Given recent recent work showing that deepcontextualized word representations incorporaterich information about syntactic structure (Gold-berg, 2019; Liu et al., 2019; Tenney et al., 2019;Hewitt and Manning, 2019), we hypothesize thattransition-based parsers have most to gain fromthese representations because it will improve theircapacity to make decisions informed by globalsentence structure and therefore reduce the num-ber of search errors. Our main hypothesis can bestated as follows:

Deep contextualized word representations aremore effective at reducing errors in transition-based parsing than in graph-based parsing.

If this holds true, then the analysis of McDonaldand Nivre (2007, 2011) suggests that the differen-tial error reduction should be especially visible onphenomena such as:

1. longer dependencies,2. dependencies closer to the root,3. certain parts of speech,4. certain dependency relations,5. longer sentences.

Page 5: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2759

The error analysis will consider all these factors aswell as non-projective dependencies.

5 Experimental Setup

5.1 Parsing Architecture

To be able to compare transition-based and graph-based parsers under equivalent conditions, weuse and extend UUParser1 (de Lhoneux et al.,2017a; Smith et al., 2018a), an evolution of bist-parser (Kiperwasser and Goldberg, 2016), whichsupports transition-based and graph-based parsingwith a common infrastructure but different scoringmodels and parsing algorithms.

For an input sentence S = w1, . . . , wN , theparser creates a sequence of vectors w1:N , wherethe vector wk = xk ◦ BILSTM(c1:M ) represent-ing input word wk is the concatenation of a pre-trained word embedding xk and a character-basedembedding BILSTM(c1:M ) obtained by running aBiLSTM over the character sequence c1:M of wk.Finally, each input element is represented by aBiLSTM vector, hk = BILSTM(w1:N , k).

In transition-based parsing, the BiLSTM vec-tors are input to a multi-layer perceptron (MLP)for scoring transitions, using the arc-hybrid transi-tion system from Kuhlmann et al. (2011) extendedwith a SWAP transition to allow the constructionof non-projective dependency trees (Nivre, 2009;de Lhoneux et al., 2017b). The scoring is basedon the top three words on the stack and the firstword of the buffer, and the input to the MLP in-cludes the BiLSTM vectors for these words as wellas their leftmost and rightmost dependents (up to12 words in total).

In graph-based parsing, the BiLSTM vectorsare input to an MLP for scoring all possible de-pendency relations under an arc-factored model,meaning that only the vectors corresponding to thehead and dependent are part of the input (2 wordsin total). The parser then extracts a maximumspanning tree over the score matrix using the Chu-Liu-Edmonds (CLE) algorithm2 (Edmonds, 1967)which allows us to construct non-projective trees.

It is important to note that, while we acknowl-edge the existence of graph-based parsers that out-perform the implementation of Kiperwasser andGoldberg (2016), such models do not meet ourcriteria for systematic comparison. The parser

1https://github.com/UppsalaNLP/uuparser

2We use the implementation from Qi et al. (2018).

by Dozat et al. (2017) is very similar, but em-ploys the MLP as a further step in the featuriza-tion process prior to scoring via a biaffine clas-sifier. To keep the comparison as exact as pos-sible, we forego comparing our transition-basedsystems to the Dozat et al. (2017) parser (and itsnumerous modifications). In addition, preliminaryexperiments showed that our chosen graph-basedparser outperforms its transition-based counter-part, which was itself competitive in the CoNLL2018 shared task (Zeman et al., 2018).

5.2 Input Representations

In our experiments, we evaluate three pairs of sys-tems – differing only in their input representations.The first is a baseline that represents tokens bywk = xk ◦ BILSTM(c1:M ), as described in Sec-tion 5.1. The word embeddings xk are initializedvia pretrained fastText vectors (xk ∈ R300) (Graveet al., 2018), which are updated for the parsingtask. We term these transition-based and graph-based baselines TR and GR.

For the ELMo experiments, we make use ofpretrained models provided by Che et al. (2018),who train ELMo on 20 million words randomlysampled from raw WikiDump and Common Crawldatasets for 44 languages. We encode each gold-segmented sentence in our treebank via the ELMomodel for that language, which yields a tensorSELMo = RN×L×D, where N is the number ofwords in the sentence, L = 3 is the number ofELMo layers, and D = 1024 is the ELMo vec-tor dimensionality. Following Peters et al. (2018)(see Eq. 1), we learn a linear combination and atask-specific γ of each token’s ELMo representa-tion, which yields a vector ELMok ∈ R1024. Wethen concatenate this vector with wk and pass itto the BiLSTM. We call the transition-based andgraph-based systems enhanced with ELMo TR+Eand GR+E.

For the BERT experiments, we employ thepretrained multilingual cased model provided byGoogle,3 4 which is trained on the concatenationof WikiDumps for the top 104 languages with thelargest Wikipedias.5 The model’s parameters fea-ture a 12-layer transformer trained with 768 hid-

3https://github.com/google-research/bert

4Except for Chinese, for which we make use of a separate,pretrained model.

5See sorted list here: https://meta.wikimedia.org/wiki/List_of_Wikipedias

Page 6: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2760

den units and 12 self-attention heads. In order toobtain a word-level vector for each token in a sen-tence, we experimented with a variety of represen-tations: namely, concatenating each transformerlayer’s word representation into a single vectorwconcat ∈ R768∗12, employing the last layer’s rep-resentation, or learning a linear combination overa range of layers, as we do with ELMo (e.g., viaEq. 1). In a preliminary set of experiments, wefound that the latter approach over layers 4–8 con-sistently yielded the best results, and thus choseto adopt this method going forward. Regardingtokenization, we select the vector for the first sub-word token, as produced by the native BERT to-kenizer. Surprisingly, this gave us better resultsthan averaging subword token vectors in a prelim-inary round of experiments. Like with the ELMorepresentations, we concatenate each BERT vec-tor BERTk ∈ R768 with wk and pass it to therespective TR+B and GR+B parsers.

It is important to note that while the ELMomodels we work with are monolingual, the BERTmodel is multilingual. In other words, whilethe standalone ELMo models were trained onthe tokenized WikiDump and CommonCrawl foreach language respectively, the BERT model wastrained only on the former, albeit simultaneouslyfor 104 languages. This means that the modelsare not strictly comparable, and it is an interest-ing question whether either of the models has anadvantage in terms of training regime. However,since our purpose is not to compare the two mod-els but to study their impact on parsing, we leavethis question for future work.

5.3 Language and Treebank Selection

For treebank selection, we rely on the criteria pro-posed by de Lhoneux et al. (2017c) and adapted bySmith et al. (2018b) to have languages from dif-ferent language families, with different morpho-logical complexity, different scripts and characterset sizes, different training sizes and domains, andwith good annotation quality. This gives us 13treebanks from UD v2.3 (Nivre et al., 2018), in-formation about which is shown in Table 1.

5.4 Parser Training and Evaluation

In all experiments, we train parsers with defaultsettings6 for 30 epochs and select the model with

6All hyperparameters are specified in the supplementarymaterial (Part A).

Language Treebank Family Order TrainArabic PADT non-IE VSO 6.1kBasque BDT non-IE SOV 5.4kChinese GSD non-IE SVO 4.0kEnglish EWT IE SVO 12.5kFinnish TDT non-IE SVO 12.2kHebrew HTB non-IE SVO 5.2kHindi HDTB IE SOV 13.3kItalian ISDT IE SVO 13.1kJapanese GSD non-IE SOV 7.1kKorean GSD non-IE SOV 4.4kRussian SynTagRus IE SVO 48.8kSwedish Talbanken IE SVO 4.3kTurkish IMST non-IE SOV 3.7k

Table 1: Languages and treebanks used in experiments.Family = Indo-European (IE) or not. Order = domi-nant word order according to WALS (Haspelmath et al.,2005). Train = number of training sentences.

the best labeled attachment score on the dev set.For each combination of model and training set,we repeat this procedure three times with differentrandom seeds, apply the three selected models tothe test set, and report the average result.

5.5 Error AnalysisIn order to conduct an error analysis along thelines of McDonald and Nivre (2007, 2011), we ex-tract all sentences from the smallest developmentset in our treebank sample (Hebrew HTB, 484 sen-tences) and sample the same number of sentencesfrom each of the other development sets (6,292sentences in total). For each system, we then ex-tract parses of these sentences for the three train-ing runs with different random seeds (18,876 pre-dictions in total). Although it could be interest-ing to look at each language separately, we followMcDonald and Nivre (2007, 2011) and base ourmain analysis on all languages together to preventdata sparsity for longer dependencies, longer sen-tences, etc.7

6 Results and Discussion

Table 2 shows labeled attachment scores for thesix parsers on all languages, averaged over threetraining runs with random seeds. The resultsclearly corroborate our main hypothesis. WhileELMo and BERT provide significant improve-ments for both transition-based and graph-based

7The supplementary material contains tables for the erroranalysis (Part B) and graphs for each language (Part C).

Page 7: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2761

Language TR GR TR+E GR+E TR+B GR+BArabic 79.1 79.9 82.0 81.7 81.9 81.8Basque 73.6 77.6 80.1 81.4 77.9 79.8Chinese 75.3 76.7 79.8 80.4 83.7 83.4English 82.7 83.3 87.0 86.5 87.8 87.6Finnish 80.0 81.4 87.0 86.6 85.1 83.9Hebrew 81.1 82.4 85.2 85.9 85.5 85.9Hindi 88.4 89.6 91.0 91.2 89.5 90.8Italian 88.0 88.2 90.9 90.6 92.0 91.7Japanese 92.1 92.2 93.1 93.0 92.9 92.1Korean 79.6 81.2 82.3 82.3 83.7 84.2Russian 88.3 88.0 90.7 90.6 91.5 91.0Swedish 80.5 81.6 86.9 86.2 87.6 86.9Turkish 57.8 61.2 62.6 63.8 64.2 64.9Average 80.5 81.8 84.5 84.6 84.9 84.9

Table 2: Labeled attachment score on 13 languages forparsing models with and without deep contextualizedword representations.

parsers, the magnitude of the improvement isgreater in the transition-based case: 3.99 vs. 2.85for ELMo and 4.47 vs. 3.13 for BERT. In termsof error reduction, this corresponds to 21.1% vs.16.5% for ELMo and 22.5% vs. 17.4% for BERT.The differences in error reduction are statisticallysignificant at α = 0.01 (Wilcoxon).

Although both parsing accuracy and absoluteimprovements vary across languages, the overalltrend is remarkably consistent and the transition-based parser improves more with both ELMo andBERT for every single language. Furthermore,a linear mixed effect model analysis reveals that,when accounting for language as a random effect,there are no significant interactions between theimprovement of each model (over its respectivebaseline) and factors such as language family (IEvs. non-IE), dominant word order, or number oftraining sentences. In other words, the improve-ments for both parsers seem to be largely indepen-dent of treebank-specific factors. Let us now seeto what extent they can be explained by the erroranalysis.

6.1 Dependency LengthFigure 2 shows labeled F-score for dependenciesof different lengths, where the length of a depen-dency between words wi and wj is equal to |i− j|(and with root tokens in a special bin on the farleft). For the baseline parsers, we see that thecurves diverge with increasing length, clearly indi-cating that the transition-based parser still suffers

Figure 2: Labeled F-score by dependency length.

Figure 3: Labeled F-score by distance to root.

Figure 4: Labeled precision (left) and recall (right) fornon-projective dependencies.

from search errors on long dependencies, whichrequire longer transition sequences for their con-struction. However, the differences are muchsmaller than in McDonald and Nivre (2007, 2011)and the transition-based parser no longer has anadvantage for short dependencies, which is consis-tent with the BiLSTM architecture providing theparsers with more similar features that help thegraph-based parser overcome the limited scope ofthe first-order model.

Adding deep contextualized word representa-tions clearly helps the transition-based parser toperform better on longer dependencies. For ELMothere is still a discernible difference for dependen-cies longer than 5, but for BERT the two curves

Page 8: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2762

are almost indistinguishable throughout the wholerange. This could be related to the aforemen-tioned intuition that a Transformer captures longdependencies more effectively than a BiLSTM(see Tran et al. (2018) for contrary observations,albeit for different tasks). The overall trends forboth baseline and enhanced models are quite con-sistent across languages, although with large vari-ations in accuracy levels.

6.2 Distance to RootFigure 3 reports labeled F-score for dependenciesat different distances from the root of the tree,where distance is measured by the number of arcsin the path from the root. There is a fairly strong(inverse) correlation between dependency lengthand distance to the root, so it is not surprising thatthe plots in Figure 3 largely show the mirror imageof the plots in Figure 2. For the baseline parsers,the graph-based parser has a clear advantage fordependencies near the root (including the root it-self), but the transition-based parser closes the gapwith increasing distance.8 For ELMo and BERT,the curves are much more similar, with only aslight advantage for the graph-based parser nearthe root and with the transition-based BERT parserbeing superior from distance 5 upwards. The maintrends are again similar across all languages.

6.3 Non-Projective DependenciesFigure 4 shows precision and recall specifically fornon-projective dependencies. We see that there isa clear tendency for the transition-based parser tohave better precision and the graph-based parserbetter recall.9 In other words, non-projective de-pendencies are more likely to be correct when theyare predicted by the transition-based parser usingthe swap transition, but real non-projective depen-dencies are more likely to be found by the graph-based parser using a spanning tree algorithm. In-terestingly, adding deep contextualized word rep-resentations has almost no effect on the graph-based parser,10 while especially the ELMo em-

8At the very end, the curves appear to diverge again, butthe data is very sparse in this part of the plot.

9Incidentally, the same pattern is reported by McDon-ald and Nivre (2007, 2011), even though the techniques forprocessing non-projective dependencies are different in thatstudy: pseudo-projective parsing (Nivre and Nilsson, 2005)for the transition-based parser and approximate second-ordernon-projective parsing (McDonald and Pereira, 2006) for thegraph-based parser.

10The breakdown per language shows marginal improve-ments for the enhanced graph-based models on a few lan-

Figure 5: Labeled attachment score by sentence length.

beddings improve both precision and recall for thetransition-based parser.

6.4 Parts of Speech and Dependency Types

Thanks to the cross-linguistically consistent UDannotations, we can relate errors to linguistic cate-gories more systematically than in the old study.The main impression, however, is that there arevery few clear differences, which is again indica-tive of the convergence between the two parsingapproaches. We highlight the most notable differ-ences and refer to the supplementary material (PartB) for the full results.

Looking first at parts of speech, the baselinegraph-based parser is slightly more accurate onverbs and nouns than its transition-based counter-part, which is consistent with the old study forverbs but not for nouns. After adding the deepcontextualized word representations, both differ-ences are essentially eliminated.

With regard to dependency relations, the base-line graph-based parser has better precision andrecall than the baseline transition-based parser forthe relations of coordination (conj), which is con-sistent with the old study, as well as clausal sub-jects (csubj) and clausal complements (ccomp),which are relations that involve verbs in clausalstructures. Again, the differences are greatly re-duced in the enhanced parsing models, especiallyfor clausal complements, where the transition-based parser with ELMo representations is evenslightly more accurate than the graph-based parser.

6.5 Sentence Length

Figure 5 plots labeled attachment score for sen-tences of different lengths, measured by numberof words in bins of 1–10, 11–20, etc. Here we

guages, canceled out by equally marginal degradations onothers.

Page 9: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2763

find the most unexpected results of the study. Firstof all, although the baseline parsers exhibit thefamiliar pattern of accuracy decreasing with sen-tence length, it is not the transition-based but thegraph-based parser that is more accurate on shortsentences and degrades faster. In other words, al-though the transition-based parser still seems tosuffer from search errors, as shown by the resultson dependency length and distance to the root, itno longer seems to suffer from error propagationin the sense that earlier errors make later errorsmore probable. The most likely explanation forthis is the improved training for transition-basedparsers using dynamic oracles and aggressive ex-ploration to learn how to behave optimally also innon-optimal configurations (Goldberg and Nivre,2012, 2013; Kiperwasser and Goldberg, 2016).

Turning to the models with deep contextual-ized word representations, we find that transition-based and graph-based parsers behave more simi-larly, which is in line with our hypotheses. How-ever, the most noteworthy result is that accuracyimproves with increasing sentence length. ForELMo this holds only from 1–10 to 11–20, but forBERT it holds up to 21–30, and even sentencesof length 31–40 are parsed with higher accuracythan sentences of length 1–10. A closer look at thebreakdown per language reveals that this picture isslightly distorted by different sentence length dis-tributions in different languages. More precisely,high-accuracy languages seem to have a higherproportion of sentences of mid-range length, caus-ing a slight boost in the accuracy scores of thesebins, and no single language exhibits exactly thepatterns shown in Figure 5. Nevertheless, severallanguages exhibit an increase in accuracy from thefirst to the second bin or from the second to thethird bin for one or more of the enhanced mod-els (especially the BERT models). And almost alllanguages show a less steep degradation for the en-hanced models, clearly indicating that deep con-textualized word representations improve the ca-pacity to parse longer sentences.

7 Conclusion

In this paper, we have essentially replicated thestudy of McDonald and Nivre (2007, 2011) forneural parsers. In the baseline setting, whereparsers use pre-trained word embeddings andcharacter representations fed through a BiLSTM,we can still discern the basic trade-off identified

in the old study, with the transition-based parsersuffering from search errors leading to lower accu-racy on long dependencies and dependencies nearthe root of the tree. However, important details ofthe picture have changed. The graph-based parseris now as accurate as the transition-based parser onshorter dependencies and dependencies near theleaves of the tree, thanks to improved represen-tation learning that overcomes the limited featurescope of the first order model. And with respectto sentence length, the pattern has actually beenreversed, with the graph-based parser being moreaccurate on short sentences and the transition-based parser gradually catching up thanks to newtraining methods that prevent error propagation.

When adding deep contextualized word repre-sentations, the behavior of the two parsers con-verge even more, and the transition-based parserin particular improves with respect to longer de-pendencies and dependencies near the root, as aresult of fewer search errors thanks to enhancedinformation about the global sentence structure.One of the most striking results, however, is thatboth parsers improve their accuracy on longer sen-tences, with some models for some languages infact being more accurate on medium-length sen-tences than on shorter sentences. This is a mile-stone in parsing research, and more research isneeded to explain it.

In a broader perspective, we hope that futurestudies on dependency parsing will take the re-sults obtained here into account and extend themby investigating other parsing approaches and neu-ral network architectures. Indeed, given the rapiddevelopment of new representations and archi-tectures, future work should include analyses ofhow all components in neural parsing architec-tures (embeddings, encoders, decoders) contributeto distinct error profiles (or lack thereof).

Acknowledgments

We want to thank Ali Basirat, Christian Hard-meier, Jamie Henderson, Ryan McDonald, PaolaMerlo, Gongbo Tang, and the EMNLP review-ers and area chairs for valuable feedback on pre-liminary versions of this paper. We acknowledgethe computational resources provided by CSC inHelsinki and Sigma2 in Oslo through NeIC-NLPL(www.nlpl.eu).

Page 10: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2764

ReferencesDaniel Andor, Chris Alberti, David Weiss, Aliaksei

Severyn, Alessandro Presta, Kuzman Ganchev, SlavPetrov, and Michael Collins. 2016. Globally nor-malized transition-based neural networks. In Pro-ceedings of the 54th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages2442–2452.

Giuseppe Attardi, Felice Dell’Orletta, Maria Simi, andJoseph Turian. 2009. Accurate dependency parsingwith a stacked multilayer perceptron. In Proceed-ings of EVALITA 2009.

Miguel Ballesteros, Chris Dyer, and Noah A. Smith.2015. Improved transition-based parsing by mod-eling characters instead of words with LSTMs. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing (EMNLP),pages 349–359.

Ezra Black, Frederick Jelinek, John D. Lafferty,David M. Magerman, Robert L. Mercer, and SalimRoukos. 1992. Towards history-based grammars:Using richer models for probabilistic parsing. InProceedings of the 5th DARPA Speech and NaturalLanguage Workshop, pages 31–37.

Bernd Bohnet and Jonas Kuhn. 2012. The best ofboth worlds – a graph-based completion model fortransition-based parsers. In Proceedings of the 13thConference of the European Chpater of the Associ-ation for Computational Linguistics (EACL), pages77–87.

Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606.

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-Xshared task on multilingual dependency parsing. InProceedings of the 10th Conference on Computa-tional Natural Language Learning (CoNLL), pages149–164.

Xavier Carreras. 2007. Experiments with a higher-order projective dependency parser. In Proceedingsof the CoNLL Shared Task of EMNLP-CoNLL 2007,pages 957–961.

Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng,and Ting Liu. 2018. Towards better UD parsing:Deep contextualized word embeddings, ensemble,and treebank concatenation. In Proceedings of theCoNLL 2018 Shared Task: Multilingual Parsingfrom Raw Text to Universal Dependencies, pages55–64.

Danqi Chen and Christopher Manning. 2014. A fastand accurate dependency parser using neural net-works. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP), pages 740–750.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies.

Timothy Dozat and Christopher D. Manning. 2017.Deep biaffine attention for neural dependency pars-ing. In Proceedings of the 5th International Confer-ence on Learning Representations.

Timothy Dozat, Peng Qi, and Christopher D. Manning.2017. Stanford’s graph-based neural dependencyparser at the conll 2017 shared task. In Proceed-ings of the CoNLL 2017 Shared Task: MultilingualParsing from Raw Text to Universal Dependencies,pages 20–30.

Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics (ACL), pages 334–343.

Jack Edmonds. 1967. Optimum branchings. Journalof Research of the National Bureau of Standards,71B:233–240.

Jason M. Eisner. 1996. Three new probabilistic modelsfor dependency parsing: An exploration. In Pro-ceedings of the 16th International Conference onComputational Linguistics (COLING), pages 340–345.

Agnieszka Falenska and Jonas Kuhn. 2019. The (non-)utility of structural features in BiLSTM-based de-pendency parsers. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics (ACL), pages 117–128.

Yoav Goldberg. 2019. Assessing BERT’s syntacticabilities. CoRR, abs/1901.05287.

Yoav Goldberg and Joakim Nivre. 2012. A dynamic or-acle for arc-eager dependency parsing. In Proceed-ings of the 24th International Conference on Com-putational Linguistics (COLING), pages 959–976.

Yoav Goldberg and Joakim Nivre. 2013. Training de-terministic parsers with non-deterministic oracles.Transactions of the Association for ComputationalLinguistics, 1:403–414.

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-manpd Joulin, and Tomas Mikolov. 2018. Learn-ing word vectors for 157 languages. In Proceed-ings of the International Conference on LanguageResources and Evaluation (LREC 2018).

Martin Haspelmath, Matthew S. Dryer, David Gil, andBernard Comrie. 2005. Thw World Atlas of Lan-guage Structures. Oxford University Press.

Page 11: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2765

John Hewitt and Christopher D. Manning. 2019. Astructural probe for finding syntax in word represen-tations. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies.

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi,Jurgen Schmidhuber, et al. 2001. Gradient flow inrecurrent nets: the difficulty of learning long-termdependencies.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Liang Huang and Kenji Sagae. 2010. Dynamic pro-gramming for linear-time incremental parsing. InProceedings of the 48th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages1077–1086.

Ganesh Jawahar, Benjamin Muller, Amal Fethi, LouisMartin, Eric Villemonte de la Clergerie, BenoıtSagot, and Djame Seddah. 2018. ELMoLex: Con-necting ELMo and lexicon features for dependencyparsing. In Proceedings of the CoNLL 2018 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies, pages 223–237.

Richard Johansson and Pierre Nugues. 2007. Incre-mental dependency parsing using online learning. InProceedings of the CoNLL Shared Task of EMNLP-CoNLL 2007, pages 1134–1138.

Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-ple and accurate dependency parsing using bidirec-tional lstm feature representations. Transactionsof the Association for Computational Linguistics,4:313–327.

Daniel Kondratyuk. 2019. 75 languages, 1 model:Parsing universal dependencies universally. CoRR,abs/1904.02099.

Terry Koo and Michael Collins. 2010. Efficient third-order dependency parsers. In Proceedings of the48th Annual Meeting of the Association for Compu-tational Linguistics (ACL), pages 1–11.

Terry Koo, Alexander M. Rush, Michael Collins,Tommi Jaakkola, and David Sontag. 2010. Dualdecomposition for parsing with non-projective headautomata. In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Process-ing, pages 1288–1298.

Marco Kuhlmann, Carlos Gomez-Rodrıguez, and Gior-gio Satta. 2011. Dynamic programming algorithmsfor transition-based dependency parsers. In Pro-ceedings of the 49th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages673–682.

Miryam de Lhoneux, Yan Shao, Ali Basirat, EliyahuKiperwasser, Sara Stymne, Yoav Goldberg, andJoakim Nivre. 2017a. From raw text to UniversalDependencies – Look, no tags! In Proceedings ofthe CoNLL 2017 Shared Task: Multilingual Pars-ing from Raw Text to Universal Dependencies, pages207–217.

Miryam de Lhoneux, Sara Stymne, and Joakim Nivre.2017b. Arc-hybrid non-projective dependency pars-ing with a static-dynamic oracle. In Proceedings ofthe 15th International Conference on Parsing Tech-nologies, pages 99–104.

Miryam de Lhoneux, Sara Stymne, and Joakim Nivre.2017c. Old school vs. new school: Comparingtransition-based parsers with and without neural net-work enhancement. In Proceedings of the 15th Tree-banks and Linguistic Theories Workshop (TLT).

KyungTae Lim, Cheoneum Park, Changki Lee, andThierry Poibeau. 2018. SEx BiST: A multi-sourcetrainable parser with deep contextualized lexicalrepresentations. In Proceedings of the CoNLL 2018Shared Task: Multilingual Parsing from Raw Text toUniversal Dependencies, pages 143–152.

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019. Lin-guistic knowledge and transferability of contextualrepresentations. CoRR, abs/1903.08855.

Ryan McDonald, Koby Crammer, and FernandoPereira. 2005a. Online large-margin training of de-pendency parsers. In Proceedings of the 43rd An-nual Meeting of the Association for ComputationalLinguistics (ACL), pages 91–98.

Ryan McDonald, Kevin Lerman, and Fernando Pereira.2006. Multilingual dependency analysis with a two-stage discriminative parser. In Proceedings of the10th Conference on Computational Natural Lan-guage Learning (CoNLL), pages 216–220.

Ryan McDonald and Joakim Nivre. 2007. Character-izing the errors of data-driven dependency parsingmodels. In Proceedings of the 2007 Joint Confer-ence on Empirical Methods in Natural LanguageProcessing and Computational Natural LanguageLearning (EMNLP-CoNLL), pages 122–131.

Ryan McDonald and Joakim Nivre. 2011. Analyzingand integrating dependency parsers. ComputationalLinguistics, pages 197–230.

Ryan McDonald and Fernando Pereira. 2006. Onlinelearning of approximate dependency parsing algo-rithms. In Proceedings of the 11th Conference ofthe European Chapter of the Association for Com-putational Linguistics (EACL), pages 81–88.

Ryan McDonald, Fernando Pereira, Kiril Ribarov, andJan Hajic. 2005b. Non-projective dependency pars-ing using spanning tree algorithms. In Proceed-ings of the Human Language Technology Confer-ence and the Conference on Empirical Methods in

Page 12: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2766

Natural Language Processing (HLT/EMNLP), pages523–530.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Joakim Nivre. 2003. An efficient algorithm for pro-jective dependency parsing. In Proceedings of the8th International Workshop on Parsing Technologies(IWPT), pages 149–160.

Joakim Nivre. 2008. Algorithms for deterministic in-cremental dependency parsing. Computational Lin-guistics, 34:513–553.

Joakim Nivre. 2009. Non-projective dependency pars-ing in expected linear time. In Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP (ACL-IJCNLP), pages 351–359.

Joakim Nivre, Mitchell Abrams, Zeljko Agic, LarsAhrenberg, Lene Antonsen, Katya Aplonova,Maria Jesus Aranzabe, Gashaw Arutie, MasayukiAsahara, Luma Ateyah, Mohammed Attia, Aitz-iber Atutxa, Liesbeth Augustinus, Elena Bad-maeva, Miguel Ballesteros, Esha Banerjee, Sebas-tian Bank, Verginica Barbu Mititelu, Victoria Bas-mov, John Bauer, Sandra Bellato, Kepa Bengoetxea,Yevgeni Berzak, Irshad Ahmad Bhat, Riyaz Ah-mad Bhat, Erica Biagetti, Eckhard Bick, RogierBlokland, Victoria Bobicev, Carl Borstell, CristinaBosco, Gosse Bouma, Sam Bowman, AdrianeBoyd, Aljoscha Burchardt, Marie Candito, BernardCaron, Gauthier Caron, Gulsen Cebiroglu Eryigit,Flavio Massimiliano Cecchini, Giuseppe G. A.Celano, Slavomır Ceplo, Savas Cetin, FabricioChalub, Jinho Choi, Yongseok Cho, Jayeol Chun,Silvie Cinkova, Aurelie Collomb, Cagrı Coltekin,Miriam Connor, Marine Courtin, Elizabeth David-son, Marie-Catherine de Marneffe, Valeria de Paiva,Arantza Diaz de Ilarraza, Carly Dickerson, Pe-ter Dirix, Kaja Dobrovoljc, Timothy Dozat, KiraDroganova, Puneet Dwivedi, Marhaba Eli, AliElkahky, Binyam Ephrem, Tomaz Erjavec, AlineEtienne, Richard Farkas, Hector Fernandez Al-calde, Jennifer Foster, Claudia Freitas, KatarınaGajdosova, Daniel Galbraith, Marcos Garcia, MoaGardenfors, Sebastian Garza, Kim Gerdes, FilipGinter, Iakes Goenaga, Koldo Gojenola, MemduhGokırmak, Yoav Goldberg, Xavier Gomez Guino-vart, Berta Gonzales Saavedra, Matias Grioni, Nor-munds Gruzıtis, Bruno Guillaume, Celine Guillot-Barbance, Nizar Habash, Jan Hajic, Jan Hajic jr.,Linh Ha My, Na-Rae Han, Kim Harris, Dag Haug,Barbora Hladka, Jaroslava Hlavacova, FlorinelHociung, Petter Hohle, Jena Hwang, Radu Ion,Elena Irimia, O. lajıde Ishola, Tomas Jelınek, An-ders Johannsen, Fredrik Jørgensen, Huner Kasıkara,Sylvain Kahane, Hiroshi Kanayama, Jenna Kan-erva, Boris Katz, Tolga Kayadelen, Jessica Ken-ney, Vaclava Kettnerova, Jesse Kirchner, Kamil

Kopacewicz, Natalia Kotsyba, Simon Krek, Sooky-oung Kwak, Veronika Laippala, Lorenzo Lam-bertino, Lucia Lam, Tatiana Lando, Septina DianLarasati, Alexei Lavrentiev, John Lee, PhuongLe H`ong, Alessandro Lenci, Saran Lertpradit, Her-man Leung, Cheuk Ying Li, Josie Li, KeyingLi, KyungTae Lim, Nikola Ljubesic, Olga Logi-nova, Olga Lyashevskaya, Teresa Lynn, VivienMacketanz, Aibek Makazhanov, Michael Mandl,Christopher Manning, Ruli Manurung, CatalinaMaranduc, David Marecek, Katrin Marheinecke,Hector Martınez Alonso, Andre Martins, JanMasek, Yuji Matsumoto, Ryan McDonald, Gus-tavo Mendonca, Niko Miekka, Margarita Misir-pashayeva, Anna Missila, Catalin Mititelu, YusukeMiyao, Simonetta Montemagni, Amir More, LauraMoreno Romero, Keiko Sophie Mori, ShinsukeMori, Bjartur Mortensen, Bohdan Moskalevskyi,Kadri Muischnek, Yugo Murawaki, Kaili Muurisep,Pinkey Nainwani, Juan Ignacio Navarro Horniacek,Anna Nedoluzhko, Gunta Nespore-Berzkalne, Lu-ong Nguy˜en Thi., Huy`en Nguy˜en Thi. Minh, Vi-taly Nikolaev, Rattima Nitisaroj, Hanna Nurmi,Stina Ojala, Adedayo. Oluokun, Mai Omura, PetyaOsenova, Robert Ostling, Lilja Øvrelid, NikoPartanen, Elena Pascual, Marco Passarotti, Ag-nieszka Patejuk, Guilherme Paulino-Passos, SiyaoPeng, Cenel-Augusto Perez, Guy Perrier, SlavPetrov, Jussi Piitulainen, Emily Pitler, BarbaraPlank, Thierry Poibeau, Martin Popel, LaumaPretkalnina, Sophie Prevost, Prokopis Proko-pidis, Adam Przepiorkowski, Tiina Puolakainen,Sampo Pyysalo, Andriela Raabis, Alexandre Rade-maker, Loganathan Ramasamy, Taraka Rama, Car-los Ramisch, Vinit Ravishankar, Livy Real, SivaReddy, Georg Rehm, Michael Rießler, Larissa Ri-naldi, Laura Rituma, Luisa Rocha, Mykhailo Ro-manenko, Rudolf Rosa, Davide Rovati, ValentinRos, ca, Olga Rudina, Jack Rueter, Shoval Sadde,Benoıt Sagot, Shadi Saleh, Tanja Samardzic,Stephanie Samson, Manuela Sanguinetti, BaibaSaulıte, Yanin Sawanakunanon, Nathan Schnei-der, Sebastian Schuster, Djame Seddah, WolfgangSeeker, Mojgan Seraji, Mo Shen, Atsuko Shi-mada, Muh Shohibussirri, Dmitry Sichinava, Na-talia Silveira, Maria Simi, Radu Simionescu, KatalinSimko, Maria Simkova, Kiril Simov, Aaron Smith,Isabela Soares-Bastos, Carolyn Spadine, AntonioStella, Milan Straka, Jana Strnadova, Alane Suhr,Umut Sulubacak, Zsolt Szanto, Dima Taji, YutaTakahashi, Takaaki Tanaka, Isabelle Tellier, TrondTrosterud, Anna Trukhina, Reut Tsarfaty, FrancisTyers, Sumire Uematsu, Zdenka Uresova, LarraitzUria, Hans Uszkoreit, Sowmya Vajjala, Daniel vanNiekerk, Gertjan van Noord, Viktor Varga, EricVillemonte de la Clergerie, Veronika Vincze, LarsWallin, Jing Xian Wang, Jonathan North Washing-ton, Seyi Williams, Mats Wiren, Tsegay Wolde-mariam, Tak-sum Wong, Chunxiao Yan, Marat M.Yavrumyan, Zhuoran Yu, Zdenek Zabokrtsky, AmirZeldes, Daniel Zeman, Manying Zhang, and HanzhiZhu. 2018. Universal dependencies 2.3. LIN-DAT/CLARIN digital library at the Institute of For-

Page 13: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2767

mal and Applied Linguistics (UFAL), Faculty ofMathematics and Physics, Charles University.

Joakim Nivre, Johan Hall, Sandra Kubler, Ryan Mc-Donald, Jens Nilsson, Sebastian Riedel, and DenizYuret. 2007. The CoNLL 2007 shared task on de-pendency parsing. In Proceedings of the CoNLLShared Task of EMNLP-CoNLL 2007, pages 915–932.

Joakim Nivre, Johan Hall, Jens Nilsson, GulsenEryigit, and Svetoslav Marinov. 2006. Labeledpseudo-projective dependency parsing with supportvector machines. In Proceedings of the 10th Confer-ence on Computational Natural Language Learning(CoNLL), pages 221–225.

Joakim Nivre and Ryan McDonald. 2008. Integrat-ing graph-based and transition-based dependencyparsers. In Proceedings of the 46th Annual Meet-ing of the Association for Computational Linguistics(ACL), pages 950–958.

Joakim Nivre and Jens Nilsson. 2005. Pseudo-projective dependency parsing. In Proceedings ofthe 43rd Annual Meeting of the Association forComputational Linguistics (ACL), pages 99–106.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 1532–1543.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages2227–2237.

Peng Qi, Timothy Dozat, Yuhao Zhang, and Christo-pher D Manning. 2018. Universal dependency pars-ing from scratch. In Proceedings of the 2018 CoNLLShared Task: Multilingual Parsing from Raw Text toUniversal Dependencies, page 160.

Kenji Sagae and Alon Lavie. 2006. Parser combinationby reparsing. In Proceedings of the Human Lan-guage Technology Conference of the NAACL, Com-panion Volume: Short Papers, pages 129–132.

Tal Schuster, Ori Ram, Regina Barzilay, and AmirGloberson. 2019. Cross-lingual alignment of con-textual word embeddings, with applications to zero-shot dependency parsing. In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers), pages 1599–1613.

Aaron Smith, Bernd Bohnet, Miryam de Lhoneux,Joakim Nivre, Yan Shao, and Sara Stymne. 2018a.82 treebanks, 34 models: Universal dependency

parsing with multi-treebank models. In Proceedingsof the 2018 CoNLL Shared Task: Multilingual Pars-ing from Raw Text to Universal Dependencies.

Aaron Smith, Miryam de Lhoneux, Sara Stymne, andJoakim Nivre. 2018b. An investigation of the inter-actions between pre-trained word embeddings, char-acter models and pos tags in dependency parsing. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R. Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R. Bowman, Dipan-jan Das, and Ellie Pavlick. 2019. What do you learnfrom context? probing for sentence structure in con-textualized word representations. In Proceedings ofthe 5th International Conference on Learning Rep-resentations.

Ivan Titov and James Henderson. 2007. A latent vari-able model for generative dependency parsing. InProceedings of the 10th International Conference onParsing Technologies (IWPT), pages 144–155.

Ke Tran, Arianna Bisazza, and Christof Monz. 2018.The importance of being recurrent for modeling hi-erarchical structure. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 4731–4736.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

Jorn Veenstra and Walter Daelemans. 2000. Amemory-based alternative for connectionist shift-reduce parsing. Technical Report ILK-0012, TilburgUniversity.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2019.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Pro-ceedings of the 7th International Conference onLearning Representations.

David Weiss, Chris Alberti, Michael Collins, and SlavPetrov. 2015. Structured training for neural net-work transition-based parsing. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics (ACL), pages 323–333.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144.

Hiroyasu Yamada and Yuji Matsumoto. 2003. Statis-tical dependency analysis with support vector ma-chines. In Proceedings of the 8th International

Page 14: Deep Contextualized Word Embeddings in Transition-Based ...This has led to substantial accuracy improvements for both transition-based and graph-based parsers and raises the question

2768

Workshop on Parsing Technologies (IWPT), pages195–206.

Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot-thtyersast, Milan Straka, Filip Ginter, Joakim Nivre,and Slav Petrov. 2018. CoNLL 2018 Shared Task:Multilingual Parsing from Raw Text to UniversalDependencies. In Proceedings of the CoNLL 2018Shared Task: Multilingual Parsing from Raw Text toUniversal Dependencies.

Hao Zhang and Ryan McDonald. 2012. Generalizedhigher-order dependency parsing with cube prun-ing. In Proceedings of the 2012 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL), pages 320–331.

Yue Zhang and Stephen Clark. 2008. A tale of twoparsers: Investigating and combining graph-basedand transition-based dependency parsing. In Pro-ceedings of the Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages562–571.

Yue Zhang and Joakim Nivre. 2011. Transition-basedparsing with rich non-local features. In Proceedingsof the 49th Annual Meeting of the Association forComputational Linguistics (ACL), pages 188–193.

Yue Zhang and Joakim Nivre. 2012. Analyzingthe effect of global learning and beam-search ontransition-based dependency parsing. In Proceed-ings of COLING 2012: Posters, pages 1391–1400.