Transition-based Semantic Dependency Parsing with Pointer ... · 7037 level language modeling. Both extensions, (Wang et al.,2019) and (He and Choi,2019), are currently the state

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7035–7046July 5 - 10, 2020. c©2020 Association for Computational Linguistics

7035

Transition-based Semantic Dependency Parsing with Pointer Networks

Daniel Fernandez-Gonzalez and Carlos Gomez-RodrıguezUniversidade da Coruna, CITICFASTPARSE Lab, LyS Group

Depto. de Ciencias de la Computacion y Tecnologıas de la InformacionCampus de Elvina, s/n, 15071 A Coruna, Spain

[email protected], [email protected]

Abstract

Transition-based parsers implemented withPointer Networks have become the new stateof the art in dependency parsing, excellingin producing labelled syntactic trees and out-performing graph-based models in this task.In order to further test the capabilities ofthese powerful neural networks on a harderNLP problem, we propose a transition systemthat, thanks to Pointer Networks, can straight-forwardly produce labelled directed acyclicgraphs and perform semantic dependency pars-ing. In addition, we enhance our approachwith deep contextualized word embeddingsextracted from BERT. The resulting systemnot only outperforms all existing transition-based models, but also matches the best fully-supervised accuracy to date on the SemEval2015 Task 18 English datasets among previousstate-of-the-art graph-based parsers.

1 Introduction

In dependency parsing, the syntactic structure of asentence is represented by means of a labelled tree,where each word is forced to be attached exclu-sively to another that acts as its head. In contrast,semantic dependency parsing (SDP) (Oepen et al.,2014) aims to represent binary predicate-argumentrelations between words of a sentence, which re-quires producing a labelled directed acyclic graph(DAG): not only semantic predicates can have mul-tiple or zero arguments, but words from the sen-tence can be attached as arguments to more thanone head word (predicate), or they can be outsidethe SDP graph (being neither a predicate nor anargument) as shown in the examples in Figure 1.Since existing dependency parsers cannot be di-rectly applied, most SDP research has focused onadapting them to deal with the absence of single-head and connectedness constraints and to producean SDP graph instead.

Figure 1: Sentence from the SemEval 2015 Task 18 de-velopment set parsed with semantic dependencies fol-lowing the DM, PAS and PSD formalisms.

As in dependency parsing, we can find two mainfamilies of approaches to efficiently generate accu-rate SDP graphs. On the one hand, graph-basedalgorithms have drawn more attention since adapt-ing them to this task is relatively straightforward.In particular, these globally optimized methods in-dependently score arcs (or sets of them) and thensearch for a high-scoring graph by combining thesescores. From one of the first graph-based DAGparsers proposed by McDonald and Pereira (2006)to the current state-of-the-art models (Wang et al.,2019; He and Choi, 2019), different graph-basedSDP approaches have been presented, providing ac-curacies above their main competitors: transition-based DAG algorithms.

A transition-based parser generates a sequenceof actions to incrementally build a valid graph (usu-ally from left to right). This is typically done bylocal, greedy prediction and can efficiently parse asentence in a linear or quadratic number of actions(transitions); however, the lack of global inferencemakes them more prone to suffer from error propa-

7036

gation: i.e., since transitions are sequentially andlocally predicted, an erroneous action can affect fu-ture predictions, having a significant impact in longsentences and being, to date, less appealing for SDP.In fact, in recent years only a few contributions,such as the system developed by Wang et al. (2018),present a purely transition-based SDP parser. It ismore common to find hybrid systems that combinetransition-based approaches with graph-based tech-niques to alleviate the impact of error propagationin accuracy (Du et al., 2015), but this penalizes theefficiency provided by transition-based algorithms.

Away from the current mainstream, we presenta purely transition-based parser that directly gener-ates SDP graphs without the need of any additionaltechniques. We rely on Pointer Networks (Vinyalset al., 2015) to predict transitions that can attachmultiple heads to the same word and incrementallybuild a labelled DAG. This kind of neural networksprovide an encoder-decoder architecture that is ca-pable of capturing information from the whole sen-tence and previously created arcs, alleviating theimpact of error propagation and already showingremarkable results in transition-based dependencyparsing (Ma et al., 2018; Fernandez-Gonzalez andGomez-Rodrıguez, 2019). We further enhance ourneural network with deep contextualized word em-beddings extracted from the pre-trained languagemodel BERT (Devlin et al., 2019).

The proposed SDP parser1 can process sentencesin SDP treebanks (where structures are sparseDAGs with a low in-degree) in O(n2log n) time,or O(n2) without cycle detection. This is more effi-cient than the current fully-supervised state-of-the-art system by Wang et al. (2019) (O(n3) withoutcycle detection), while matching its accuracy onthe SemEval 2015 Task 18 datasets (Oepen et al.,2015). In addition, we also prove that our noveltransition-based model provides promising accu-racies in the semi-supervised scenario, achievingsome state-of-the-art results.

2 Related Work

An early approach to DAG parsing was imple-mented as a modification to a graph-based parserby McDonald and Pereira (2006). This producedDAGs using approximate inference by first findinga dependency tree, and then adding extra edgesthat would increase the graph’s overall score. A

1Source code available at https://github.com/danifg/SemanticPointer.

few years later, this attempt was outperformed bythe first transition-based DAG parser by Sagae andTsujii (2008). They extended the existing transitionsystem by Nivre (2003) to allow multiple heads pertoken. The resulting algorithm was not able to pro-duce DAGs with crossing dependencies, requiringthe pseudo-projective transformation by Nivre andNilsson (2005) (plus a cycle removal procedure) asa post-processing stage.

More recently, there has been a predominanceof purely graph-based DAG models since the Se-mEval 2015 Task 18 (Oepen et al., 2015). Almeidaand Martins (2015) adapted the pre-deep-learningdependency parser by Martins et al. (2013) to pro-duce SDP graphs. This graph-based parser en-codes higher-order information with hand-craftedfeatures and employs the AD3 algorithm (Mar-tins et al., 2011) to find valid DAGs during de-coding. This was extended by Peng et al. (2017)with BiLSTM-based feature extraction and mul-titask learning: the three formalisms consideredin the shared task were jointly learned to improvefinal accuracy.

After the success of Dozat et al. (2017) in graph-based dependency parsing, Dozat and Manning(2018) proposed minor adaptations to use this bi-affine neural architecture to produce SDP graphs.To that end, they removed the maximum span-ning tree algorithm (Chu and Liu, 1965; Edmonds,1967) necessary for decoding well-formed depen-dency trees and simply kept those edges with a pos-itive score. In addition, they trained the unlabelledparser with a sigmoid cross-entropy (instead of theoriginal softmax one) in order to accept multipleheads.

The parser by Dozat and Manning (2018) wasrecently improved by two contributions. Firstly,Wang et al. (2019) manage to add second-orderinformation for score computation and then applyeither mean field variational inference or loopy be-lief propagation information to decode the highest-scoring SDP graph. While significantly boostingparsing accuracy, the original O(n2) runtime com-plexity is modified to O(n3) in the resulting SDPsystem. Secondly, He and Choi (2019) significantlyimprove the original parser’s accuracy by not onlyusing contextualized word embeddings extractedfrom BERT (Devlin et al., 2019), but also intro-ducing contextual string embeddings (called Flair)(Akbik et al., 2018), which consist in a novel typeof word vector representations based on character-

https://github.com/danifg/SemanticPointer

https://github.com/danifg/SemanticPointer

7037

level language modeling. Both extensions, (Wanget al., 2019) and (He and Choi, 2019), are currentlythe state of the art on the SemEval 2015 Task 18 inthe fully-supervised and semi-supervised scenarios,respectively.

Kurita and Søgaard (2019) have also recentlyproposed a complex approach that iteratively ap-plies the syntactic dependency parser by Zhanget al. (2017), sequentially building a DAG structure.At each iteration, the graph-based parser selects thehighest-scoring arcs, keeping the single-head con-straint. The process ends when no arcs are added inthe last iteration. The combination of partial parsesresults in an SDP graph. Since the graph is built ina sequential process, they use reinforcement learn-ing to guide the model through more optimal paths.Following Peng et al. (2017), multi-task learning isalso added to boost final accuracy.

On the other hand, the use of transition-basedalgorithms in the SDP task had been less exploreduntil very recently. Du et al. (2015) presenteda voting-based ensemble of fourteen graph- andtransition-based parsers. In their work, they no-ticed that individual graph-based models outper-form transition-based algorithms, assigning, duringvoting, higher weights to them. Among the transi-tion systems used, we can find the one developedby Titov et al. (2009), which is not able to cover allSDP graphs.

We have to wait until the work by Wang et al.(2018) to see that a purely transition-based SDPparser (enhanced with a simple model ensembletechnique) can achieve competitive results. Theysimply modified the preconditions of the complextransition system by Choi and McCallum (2013)to produce unrestricted DAG structures. In addi-tion, their system was implemented by means ofstack-LSTMs (Dyer et al., 2015), enhanced withBiLSTMs and Tree-LSTMs for feature extraction.

We are, to the best of our knowledge, first to ex-plore DAG parsing with Pointer Networks, propos-ing a purely transition-based algorithm that canbe a competitive alternative to graph-based SDPmodels.

Finally, during the reviewing process of thiswork, the proceedings of the CoNLL 2019 sharedtask (Oepen et al., 2019) were released. In thatevent, SDP parsers were evaluated on updatedversions of SemEval 2015 Task 18 datasets, aswell as on datasets in other semantic formalismssuch as Abstract Meaning Representation (AMR)

(Banarescu et al., 2013) and Universal Cogni-tive Conceptual Annotation (UCCA) (Abend andRappoport, 2013). Although graph-based parsersachieved better accuracy in the SDP track, sev-eral BERT-enhanced transition-based approacheswere proposed. Among them we can find an ex-tension (Che et al., 2019) of the system by Wanget al. (2018), several adaptations for SDP (Her-shcovich and Arviv, 2019; Bai and Zhao, 2019)of the transition-based UCCA parser by Hersh-covich et al. (2017), as well as an SDP variant (Laiet al., 2019) of the constituent transition systemintroduced by Fernandez-Gonzalez and Gomez-Rodrıguez (2019). Also in parallel to the develop-ment of this research, Zhang et al. (2019) proposeda transition-based parser that, while it can be ap-plied for SDP, was specifically designed for AMRand UCCA parsing (where graph nodes do not cor-respond with words and must be generated duringthe parsing process). In particular, this approachincrementally builds a graph by predicting at eachstep a semantic relation composed of the target andsource nodes plus the arc label. While this can beseen as an extension of our approach for those taskswhere nodes must be generated, its complexity pe-nalizes accuracy in the SDP task.

3 Multi-head Transition System

We design a novel transition system that is ableto straightforwardly attach multiple heads to eachword in a single pass, incrementally building, fromleft to right, a valid SDP graph: a labelled DAG.

To implement it, we use Pointer Networks(Vinyals et al., 2015). These neural networks areable to learn the conditional probability of a se-quence of discrete numbers that correspond to po-sitions in an input sequence and, at decoding time,perform as a pointer that selects a position fromthe input. In other words, we can train this neuralnetwork to, given a word, point to the position ofthe sentence where its head (Fernandez-Gonzalezand Gomez-Rodrıguez, 2019) or dependent words(Ma et al., 2018) are located, depending on whatinterpretation we use during training. In particu-lar, (Fernandez-Gonzalez and Gomez-Rodrıguez,2019) proved to be more suitable for dependencyparsing than (Ma et al., 2018) since it requires halfas many steps to produce the same dependencyparse, being not only faster, but also more accurate(as this mitigates the impact of error propagation).

Inspired by Fernandez-Gonzalez and Gomez-

7038

Rodrıguez (2019), we train a Pointer Network topoint to the head of a given word and propose analgorithm that does not use any kind of data struc-tures (stack or buffer, required in classic transition-based parsers (Nivre, 2008)), but just a focus wordpointer i for marking the word currently being pro-cessed. More in detail, given an input sentence of nwords w1, . . . , wn, the parsing process starts with ipointing at the first word w1. At each time step, thecurrent focus word wi is used by the Pointer Net-work to return a position p from the input sentence(or 0, where the ROOT node is located). This infor-mation is used to choose between the two availabletransitions:

• If p 6= i, then the pointed word wp is consid-ered as a semantic head word (predicate) ofwi and an Attach-p transition is applied, cre-ating the directed arc wp → wi. The Attach-ptransition is only permissible if the resultingpredicate-argument arc neither exists nor gen-erates a cycle in the already-built graph, inorder to output a valid DAG.

• On the contrary, if p = i (i.e., the model pointsto the current focus word), then wi is consid-ered to have found all its head words, and aShift transition is chosen to move i one po-sition to the right to process the next wordwi+1.

The parsing ends when the last word from the sen-tence is shifted, meaning that the input is com-pletely processed. As stated by Ma et al. (2018)for attaching dependent words, it is necessary tofix the order in which (in our case, head) wordsare assigned in order to define a deterministic de-coding. As the sentence is parsed in a left-to-rightmanner, we adopt the same order for head assign-ments. For instance, the SDP graph in Figure 1(a)is produced by the transition sequence described inTable 1. We just need n Shift transitions to movethe focus word pointer through the whole sentenceand m Attach-p transitions to create the m arcspresent in the SDP graph.

It is worth mentioning that we manage to signifi-cantly reduce the amount of transitions necessaryfor generating DAGs in comparison to those pro-posed in the complex transition systems by Choiand McCallum (2013) and Titov et al. (2009), usedin the SDP systems by Wang et al. (2018) andDu et al. (2015), respectively. In addition, thedescribed multi-head transition system is able to

p transition focus wordi added arc

The11 Shift results21 Attach-1 results2 The1→ results24 Attach-4 results2 results2← in4

2 Shift were33 Shift in4

0 Attach-0 in4 ROOT0→ in4

6 Attach-6 in4 in4← with6

4 Shift line54 Attach-4 line5 in4→ line55 Shift with6

6 Shift analysts77 Shift ’88 Shift expectations96 Attach-6 expectations9 with6→ expectations97 Attach-7 expectations9 analysts7→ expectations99 Shift .1010 Shift

Table 1: Transition sequence for generating the SDPgraph in Figure 1(a).

directly produce any DAG structure without excep-tion, while some transition systems, such as thementioned (Sagae and Tsujii, 2008; Titov et al.,2009), are limited to a subset of DAGs.

Finally, while the outcome of the proposed tran-sition system is a SDP graph without cycles, inother research, such as (Kurita and Søgaard, 2019)and state-of-the-art models by Dozat and Manning(2018) and Wang et al. (2019), the parser is notforced to produce well-formed DAGs, allowing thepresence of cycles.

4 Neural Network Architecture

4.1 Basic Approach

Vinyals et al. (2015) introduced an encoder-decoderarchitecture, called Pointer Network, that uses amechanism of neural attention (Bahdanau et al.,2014) to select positions from the input sequence,without requiring a fixed size of the output dic-tionary. This allows Pointer Networks to easilyaddress those problems where the target classesconsidered at each step are variable and depend onthe length of the input sequence. We prove thatimplementing the transition system previously de-fined on this neural network results in an accurateSDP system.

We follow previous work in dependency parsing(Ma et al., 2018; Fernandez-Gonzalez and Gomez-Rodrıguez, 2019) to design our neural architecture:

Encoder A BiLSTM-CNN architecture (Ma andHovy, 2016) is used to encode the input sentence

7039

w1, . . . , wn, word by word, into a sequence of en-coder hidden states h1, . . . ,hn. CNNs with maxpooling are used for extracting character-level rep-resentations of words and, then, each word wi isrepresented by the concatenation of character (eci ),word (ewi ), lemma (eli) and POS tag (epi ) embed-dings:

xi = eci ⊕ ewi ⊕ eli ⊕ epi

After that, the xi of each word wi is fed one-by-oneinto a BiLSTM that captures context informationin both directions and generates a vector represen-tation hi:

hi = hli ⊕ hri = BiLSTM(xi)

In addition, a special vector representation h0, de-noting the ROOT node, is prepended at the begin-ning of the sequence of encoder hidden states.

Decoder An LSTM is used to output, at eachtime step t, a decoder hidden state st. As inputof the decoder, we use the encoder hidden state hiof the current focus word wi plus extra high-orderfeatures. In particular, we take into account thehidden state of the last head word (hh) attached towi, which will be a co-parent of a future predicateassigned to wi. Following Ma et al. (2018), we useelement-wise sum to add this information withoutincreasing the dimensionality of the input:

ri = hi + hh; st = LSTM(ri)

Note that feature information like this can be easilyadded in transition-based models without increas-ing the parser’s runtime complexity, something thatdoes not happen in graph-based models, where, forinstance, the second-order features added by Wanget al. (2019) penalize runtime complexity.

We experimented with other high-order featuressuch as grandparent or sibling information of thecurrent focus word wi, but no significant improve-ments were obtained from their addition, so theywere discarded for simplicity. Further feature ex-ploration might improve parser performance, butwe leave this for future work.

Once st is generated, the attention vector at,which will work as a pointer over the input, mustbe computed in the pointer layer. First, followingthe previously cited work, the scores between stand each encoder hidden representation hj fromthe input sentence are computed using this biaffineattention scoring function (Dozat and Manning,

2017):

vtj = score(st,hj) = f1(st)TWf2(hj)

+UT f1(st) + VT f2(hj) + b

where parameter W is the weight matrix of the bi-linear term, U and V are the weight tensors of thelinear terms and b is the bias vector. In addition,f1(·) and f2(·) are two single-layer multilayer per-ceptrons (MLP) with ELU activation, proposed by(Dozat and Manning, 2017) for reducing dimen-sionality and minimizing overfitting.

Then, a softmax is applied on the resulting scorevector vt to compute a probability distribution overthe input words:

at = softmax(vt)

The resulting attention vector at can now be usedas a pointer to select the highest-scoring position pfrom the input. This information will be employedby the transition system to choose between the twoavailable actions and create a predicate-argumentrelation between wp and wi (Attach-p) or movethe focus word pointer to wi+1 (Shift). In case thechosen Attach-p is forbidden due to the acyclicityconstraint, the next highest-scoring position in at

is considered as output instead. Figure 2 depictsthe neural architecture and the decoding procedurefor the SDP structure in Figure 1(a).

Label prediction We jointly train a multi-classclassifier that scores every label for each pair ofwords. This shares the same encoder and uses thesame biaffine attention function as the pointer:

sltp = score(st,hp, l) = g1(st)TWlg2(hp)

+UTl g1(st) + VT

l g2(hp) + bl

where a distinct weight matrix Wl, weight tensorsUl and Vl and bias bl are used for each label l,where l ∈ {1, 2, . . . , L} and L is the number oflabels. In addition, g1(·) and g2(·) are two single-layer MLPs with ELU activation.

The scoring function is applied over each pre-dicted arc between the dependent word wi (repre-sented by st) and the pointed head word wp in posi-tion p (represented by hp) to compute the score ofeach possible label and assign the highest-scoringone.

Training Objectives The Pointer Network istrained to minimize the negative log likelihood

7040

Figure 2: Neural network architecture and decoding steps to partially parse the SDP graph in Figure 1.

(implemented as cross-entropy loss) of producingthe correct SDP graph y for a given sentence x:Pθ(y|x). Let y be a DAG for an input sentence xthat is decomposed into a set of m directed arcsa1, . . . , am following a left-to-right order. Thisprobability can be factorized as follows:

Pθ(y|x) =

m∏k=1

Pθ(ak|a<k, x)

where a<k denotes previous predicted arcs.On the other hand, the labeler is trained with

softmax cross-entropy to minimize the negative loglikelihood of assigning the correct label l, given adependency arc with head word wh and dependentword wi.

The whole neural model is jointly trained bysumming the parser and labeler losses prior to com-puting the gradients. In that way, model parame-ters are learned to minimize the sum of the cross-entropy loss objectives over the whole corpus.

4.2 Deep Contextualized Word EmbeddingsAugmentation

In order to further improve the accuracy of ourapproach, we augment our model with deep contex-tualized word embeddings provided by the widely-used pre-trained language model BERT (Devlinet al., 2019).

Instead of including and training the wholeBERT model as encoder of our system, we fol-low the common, greener and more cost-effectiveapproach of leveraging the potential of BERT byextracting the weights of one or several layersas word-level embeddings. To that end, the pre-trained uncased BERTBASE model is used.

Since BERT is trained on subwords (i.e., sub-strings of the original token), we take the 768-dimension vector of each subword of an input to-ken and use the average embedding as the finalrepresentation eBERTi . Finally, this is directly con-catenated to the resulting basic word representationbefore feeding the BiLSTM-based encoder:

x′i = xi ⊕ eBERTi ; hi = BiLSTM(x′

i)

Higher performances can be achieved by summingor concatenating (depending on the task) severallayers of BERT; however, exploring these combi-nations is out of the scope of this paper and wesimply use embeddings extracted from the second-to-last hidden layer (since the last layer is biased tothe target objectives used to train BERT’s languagemodel).

5 Experiments

5.1 DataIn order to test the proposed approach, we conductexperiments on the SemEval 2015 Task 18 Englishdatasets (Oepen et al., 2015), where all sentencesare annotated with three different formalisms:DELPH-IN MRS (DM) (Flickinger et al., 2012),Predicate-Argument Structure (PAS) (Miyao andTsujii, 2004) and Prague Semantic Dependencies(PSD) (Hajic et al., 2012). Standard split as in pre-vious work (Almeida and Martins, 2015; Du et al.,2015) results in 33,964 training sentences fromSections 00-19 of the Wall Street Journal corpus(Marcus et al., 1993), 1,692 development sentencesfrom Section 20, 1,410 sentences from Section 21as in-domain test set, and 1,849 sentences sam-pled from the Brown Corpus (Francis and Kucera,

7041

Architecture hyper-parametersCNN window size 3CNN number of filters 50BiLSTM encoder layers 3BiLSTM encoder size 512LSTM decoder layers 1LSTM decoder size 512LSTM layers dropout 0.33Word/POS/Char./Lemma embedding dimension 100BERT embedding dimension 768Embeddings dropout 0.33MLP layers 1MLP activation function ELUArc MLP size 512Label MLP size 128UNK replacement probability 0.5Adam optimizer hyper-parametersInitial learning rate 0.001β1, β2 0.9Batch size 32Decay rate 0.75Gradient clipping 5.0

Table 2: Model hyper-parameters.

1982) as out-of-domain test data. For the evalua-tion, we use the official script,2 reporting labelledF-measure scores (LF1) (including ROOT arcs)on the in-domain (ID) and out-of-domain (OOD)test sets for each formalism as well as the macro-average over the three of them.

5.2 Settings

We use the Adam optimizer (Kingma and Ba, 2014)and follow (Ma et al., 2018; Dozat and Manning,2017) for parameter optimization. We do not specif-ically perform hyper-parameter selection for SDPand just adopt those proposed by Ma et al. (2018)for syntactic dependency parsing (detailed in Ta-ble 2). For initializing word and lemma vectors,we use the pre-trained structured-skipgram embed-dings developed by Ling et al. (2015). POS tagand character embeddings are randomly initializedand all embeddings (except the deep contextualizedones) are fine-tuned during training. Due to randominitializations, we report average accuracy over 5repetitions for each experiment. In addition, duringa 500-epoch training, the model with the highestlabelled F-score on the development set is chosen.Finally, while further beam-size exploration mightimprove accuracy, we use beam-search decodingwith beam size 5 in all experiments.

2https://github.com/semantic-dependency-parsing/toolkit

5.3 Results and Discussion

Table 3 reports the accuracy obtained by state-of-the-art SDP parsers detailed in Section 2 in compar-ison to our approach. To perform a fair comparison,we group SDP systems in three blocks dependend-ing on the embeddings provided to the architecture:(1) just basic pre-trained word and POS tag embed-dings, (2) character and pre-trained lemma embed-dings augmentation and (3) pre-trained deep con-textualized embeddings augmentation. As provedby these results, our approach outperforms all ex-isting transition-based models and the widely-usedapproach by Dozat and Manning (2018) with orwithout character and lemma embeddings, and itis on par with the best graph-based SDP parserby (Wang et al., 2019) on average in the fully-supervised scenario.3

In addition, our model achieves the best fully-supervised accuracy to date on the PSD formalism,considered the hardest to parse. We hypothesizethat this might be explained by the fact that the PSDformalism is the more tree-oriented (as pointed outby Oepen et al. (2015)) and presents a lower ratioof arcs per sentence, being more suitable for ourtransition-based approach.

In the semi-supervised scenario, BERT-basedembeddings proved to be more beneficial for theout-of-domain data. In fact, while not being a faircomparison since we neither include contextualstring embeddings (Flair) (Akbik et al., 2018) norexplore different BERT layer combinations, ournew transition-based parser manages to outperformthe state-of-the-art system by He and Choi (2019)4

on average on the out-of-domain test set, obtaininga remarkable accuracy on the PSD formalism.

5.4 Complexity

Given a sentence with length n whose SDP graphhas m arcs, the proposed transition system requiresn Shift plus m Attach-p transitions to parse it.Therefore, since a DAG can have at most Θ(n2)edges (as is also the case for general directedgraphs), it could potentially need O(n2) transitionsin the worst case. However, we prove that this doesnot happen in practice and real sentences can be

3It is common practice in the literature that systems thatonly use standard pre-trained word or lemma embeddings areclassed as fully-supervised models, even though, strictly, theyare not trained exclusively on the official training data.

4He and Choi (2019) do not specify in their paper theBERT layer configuration used for generating the word em-beddings.

https://github.com/semantic-dependency-parsing/toolkit

https://github.com/semantic-dependency-parsing/toolkit

7042

DM PAS PSD AvgParser ID OOD ID OOD ID OOD ID OODDu et al. (2015) TbGb+Ens 89.1 81.8 91.3 87.2 75.7 73.3 85.3 80.8Almeida and Martins (2015) Gb 88.2 81.8 90.9 86.9 76.4 74.8 85.2 81.2Peng et al. (2017) Gb 89.4 84.5 92.2 88.3 77.6 75.3 86.4 82.7Peng et al. (2017) Gb+MT 90.4 85.3 92.7 89.0 78.5 76.4 87.2 83.6Wang et al. (2018) Tb 89.3 83.2 91.4 87.2 76.1 73.2 85.6 81.2Wang et al. (2018) Tb+Ens 90.3 84.9 91.7 87.6 78.6 75.9 86.9 82.8Dozat and Manning (2018) Gb 91.4 86.9 93.9 90.8 79.1 77.5 88.1 85.0Kurita and Søgaard (2019) Gb 91.1 - 92.4 - 78.6 - 87.4 -Kurita and Søgaard (2019) Gb+MT+RL 91.2 - 92.9 - 78.8 - 87.6 -Wang et al. (2019) Gb 93.0 88.4 94.3 91.5 80.9 78.9 89.4 86.3This work Tb 92.5 87.7 94.2 91.0 81.0 78.7 89.2 85.8Dozat and Manning (2018) Gb+char+lemma 93.7 88.9 93.9 90.6 81.0 79.4 89.5 86.3Kurita and Søgaard (2019) Gb+MT+RL+lemma 92.0 87.2 92.8 88.8 79.3 77.7 88.0 84.6Wang et al. (2019) Gb+char+lemma 94.0 89.7 94.1 91.3 81.4 79.6 89.8 86.9This work Tb+char+lemma 93.9 89.6 94.2 91.2 81.8 79.8 90.0 86.9Zhang et al. (2019) Tb+char+BERTLARGE 92.2 87.1 - - - - - -He and Choi (2019) Gb+lemma+Flair+BERTBASE 94.6 90.8 96.1 94.4 86.8 79.5 92.5 88.2This work Tb+char+lemma+BERTBASE 94.4 91.0 95.1 93.4 82.6 82.0 90.7 88.8

Table 3: Accuracy comparison of state-of-the-art SDP parsers on the SemEval 2015 Task 18 datasets. Gb andTb stand for graph- and transition-based models, +char and +lemma for augmentations with character-level andlemma embeddings, +Flair and +BERT BASE|LARGE for augmentations with deep contextualized character-leveland word-level embeddings, and, finally, +MT , +RL and +Ens for the application of multi-task, reinforcementlearning and ensemble techniques.

parsed with O(n) transitions instead.

Parsing complexity of a transition-based depen-dency parsing algorithm can be determined by thenumber of transitions performed with respect tothe number of words in a sentence (Kubler et al.,2009). Therefore, we measure the transition se-quence length predicted by the system to analyzeevery sentence from the development sets of thethree available formalisms and depict the relationbetween them and sentence lengths. As shown inFigure 3, a linear behavior is observed in all cases,proving that the number of Attach-p transitionsevaluated by the model at each step is considerablylow (behaving practically like a constant).

This can be explained by the fact that, on averageon the training set, the ratio of predicate-argumentdependencies per word in a sentence is 0.79 in DM,0.99 in PAS and 0.70 in PSD, meaning that thetransition sequence necessary for parsing a givensentence will need no more Attach-p transitionsthan Shift ones (which are one per word in the sen-tence). It is true that one argument can be attachedto more than one predicate; however, the amountof words unattached in the resulting DAG (single-

tons)5 can be significant in some formalisms (asdescribed graphically in Figure 1): on average onthe training set, 23% of words per sentence in DM,6% in PAS and 35% in PSD. In addition, edge den-sity on non-singleton words, computed by Oepenet al. (2015) on the test sets, also backs the linearbehavior shown in our experiments: 0.96 in DM,1.02 in PAS and 1.01 in PSD for the in-domain setand 0.95 in DM, 1.02 in PAS and 0.99 in PSD forthe out-of-domain data. In conclusion, we can statethat, on the datasets tested, the proposed transitionsystem executes O(n) transitions.

To determine the runtime complexity of the im-plementation of the transition system, we needto consider the following: firstly, at each transi-tion, the attention vector at needs to be computed,which means that each of the O(n) transitions takesO(n) time to run. Therefore, the overall time com-plexity of the parser, ignoring cycle detection, isO(n2). Note that this is in contrast to algorithmslike (Wang et al., 2019), which takes cubic timeeven though it does not enforce acyclicity.

5A singleton is a word that has neither incoming nor out-going edges.

7043

Figure 3: Number of predicted transitions relative tothe length of the sentence, for the three SDP for-malisms on the development set from SemEval 2015Task 18.

If we add cycle detection, needed to forbid tran-sitions that would create cycles and therefore to en-force that the output is a DAG, then the complexitybecomes O(n2log n). This is because an efficientimplementation of cycle detection contributes anadditive factor of O(n2log n) to worst-case timecomplexity, which becomes the dominant factor.To achieve this efficient implementation, we incre-mentally keep two data structures: on the one hand,we keep track of weakly connected componentsusing path compression and union by rank, whichcan be done in inverse Ackermann time, as is com-

monly done for cycle detection in tree and forestparsers (Covington, 2001; Gomez-Rodrıguez andNivre, 2010). On the other hand, we keep a weaktopological numbering of the graph using the algo-rithm by Bender et al. (2015), which takes overallO(n2log n) time over all edge insertions. Whenthese two data structures are kept, cycles can bechecked in constant time: an arc a → b creates acycle if the involved nodes are in the same weaklyconnected component and a has a greater topologi-cal number than b.

Therefore, the overall expected worst-caserunning time of the proposed SDP system isO(n2log n) for the range of data attested in theexperiments, and can be lowered to O(n2) if weare willing to forgo enforcing acyclicity.

6 Conclusions and Future work

Our multi-head transition system can accuratelyparse a sentence in quadratic worst-case runtimethanks to Pointer Networks. While being more effi-cient, our approach outperforms the previous state-of-the-art parser by Dozat and Manning (2018) andmatches the accuracy of the best model to date(Wang et al., 2019), proving that, with a state-of-the-art neural architecture, transition-based SDPparsers are a competitive alternative.

By adding BERT-based embeddings, we signifi-cantly improve our model accuracy by marginallyaffecting computational cost, achieving state-of-the-art F-scores in out-of-domain test sets.

Despite the promising results, the accuracy ofour approach could probably be boosted further byexperimenting with new feature information andspecifically tuning hyper-parameters for the SDPtask, as well as using different enhancements suchas implementing the hierarchical decoding recentlypresented by Liu et al. (2019), including contextualstring embeddings (Akbik et al., 2018) like He andChoi (2019), or applying multi-task learning acrossthe three formalisms like Peng et al. (2017).

Acknowledgments

This work has received funding from the Euro-pean Research Council (ERC), under the Euro-pean Union’s Horizon 2020 research and innova-tion programme (FASTPARSE, grant agreementNo 714150), from the ANSWER-ASAP project(TIN2017-85160-C2-1-R) from MINECO, andfrom Xunta de Galicia (ED431B 2017/01, ED431G2019/01).

7044

ReferencesOmri Abend and Ari Rappoport. 2013. Universal con-

ceptual cognitive annotation (UCCA). In Proceed-ings of the 51st Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 228–238, Sofia, Bulgaria. Associationfor Computational Linguistics.

Alan Akbik, Duncan Blythe, and Roland Vollgraf.2018. Contextual string embeddings for sequencelabeling. In Proceedings of the 27th InternationalConference on Computational Linguistics, pages1638–1649, Santa Fe, New Mexico, USA. Associ-ation for Computational Linguistics.

Mariana S. C. Almeida and Andre F. T. Martins. 2015.Lisbon: Evaluating TurboSemanticParser on multi-ple languages and out-of-domain data. In Proceed-ings of the 9th International Workshop on SemanticEvaluation (SemEval 2015), pages 970–973, Den-ver, Colorado. Association for Computational Lin-guistics.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. CoRR,abs/1409.0473.

Hongxiao Bai and Hai Zhao. 2019. SJTU at MRP2019: A transition-based multi-task parser for cross-framework meaning representation parsing. In Pro-ceedings of the Shared Task on Cross-FrameworkMeaning Representation Parsing at the 2019 Confer-ence on Natural Language Learning, pages 86–94,Hong Kong. Association for Computational Linguis-tics.

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In Proceedings of the 7th Linguis-tic Annotation Workshop and Interoperability withDiscourse, pages 178–186, Sofia, Bulgaria. Associa-tion for Computational Linguistics.

Michael A. Bender, Jeremy T. Fineman, Seth Gilbert,and Robert E. Tarjan. 2015. A new approach toincremental cycle detection and related problems.ACM Trans. Algorithms, 12(2):14:1–14:22.

Wanxiang Che, Longxu Dou, Yang Xu, Yuxuan Wang,Yijia Liu, and Ting Liu. 2019. HIT-SCIR at MRP2019: A unified pipeline for meaning representa-tion parsing via efficient training and effective en-coding. In Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the2019 Conference on Natural Language Learning,pages 76–85, Hong Kong. Association for Compu-tational Linguistics.

Jinho D. Choi and Andrew McCallum. 2013.Transition-based dependency parsing with se-lectional branching. In Proceedings of the 51st

Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers),pages 1052–1062, Sofia, Bulgaria. Association forComputational Linguistics.

Y. J. Chu and T. H. Liu. 1965. On the shortest arbores-cence of a directed graph. Science Sinica, 14:1396–1400.

Michael A. Covington. 2001. A fundamental algorithmfor dependency parsing. In Proceedings of the 39thAnnual ACM Southeast Conference, pages 95–102.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Timothy Dozat and Christopher D. Manning. 2017.Deep biaffine attention for neural dependency pars-ing. In ICLR. OpenReview.net.

Timothy Dozat and Christopher D. Manning. 2018.Simpler but more accurate semantic dependencyparsing. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguis-tics (Volume 2: Short Papers), pages 484–490, Mel-bourne, Australia. Association for ComputationalLinguistics.

Timothy Dozat, Peng Qi, and Christopher D. Manning.2017. Stanford’s graph-based neural dependencyparser at the CoNLL 2017 shared task. In Proceed-ings of the CoNLL 2017 Shared Task: MultilingualParsing from Raw Text to Universal Dependencies,pages 20–30, Vancouver, Canada. Association forComputational Linguistics.

Yantao Du, Fan Zhang, Xun Zhang, Weiwei Sun, andXiaojun Wan. 2015. Peking: Building semantic de-pendency graphs with a hybrid parser. In Proceed-ings of the 9th International Workshop on SemanticEvaluation (SemEval 2015), pages 927–931, Den-ver, Colorado. Association for Computational Lin-guistics.

Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: LongPapers), pages 334–343, Beijing, China. Associa-tion for Computational Linguistics.

Jack Edmonds. 1967. Optimum branchings. Journalof Research of the National Bureau of Standards,71B:233–240.

https://www.aclweb.org/anthology/P13-1023


https://www.aclweb.org/anthology/C18-1139


https://doi.org/10.18653/v1/S15-2162

https://doi.org/10.18653/v1/S15-2162

https://doi.org/10.18653/v1/K19-2008

https://doi.org/10.18653/v1/K19-2008

https://doi.org/10.18653/v1/K19-2008

https://www.aclweb.org/anthology/W13-2322


https://doi.org/10.1145/2756553

https://doi.org/10.1145/2756553

https://doi.org/10.18653/v1/K19-2007

https://doi.org/10.18653/v1/K19-2007

https://doi.org/10.18653/v1/K19-2007

https://doi.org/10.18653/v1/K19-2007



https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423



https://doi.org/10.18653/v1/K17-3002

https://doi.org/10.18653/v1/K17-3002

https://doi.org/10.18653/v1/S15-2154

https://doi.org/10.18653/v1/S15-2154

https://doi.org/10.3115/v1/P15-1033

https://doi.org/10.3115/v1/P15-1033

https://doi.org/10.3115/v1/P15-1033

7045

Daniel Fernandez-Gonzalez and Carlos Gomez-Rodrıguez. 2019. Faster shift-reduce constituentparsing with a non-binary, bottom-up strategy.Artificial Intelligence, 275:559 – 574.

Daniel Fernandez-Gonzalez and Carlos Gomez-Rodrıguez. 2019. Left-to-right dependency parsingwith pointer networks. In Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics, Volume1 (Long and Short Papers), pages 710–716, Min-neapolis, Minnesota. Association for ComputationalLinguistics.

Dan Flickinger, Yi Zhang, and Valia Kordoni. 2012.Deepbank : a dynamically annotated treebank of thewall street journal. In Proceedings of the EleventhInternational Workshop on Treebanks and Linguis-tic Theories (TLT11), pages 85–96, Lisbon. HU.

W.N. Francis and H. Kucera. 1982. Frequency Analysisof English Usage: Lexicon and Usage. HoughtonMifflin.

Carlos Gomez-Rodrıguez and Joakim Nivre. 2010.A transition-based parser for 2-planar dependencystructures. In Proceedings of the 48th Annual Meet-ing of the Association for Computational Linguis-tics, pages 1492–1501, Uppsala, Sweden. Associa-tion for Computational Linguistics.

Jan Hajic, Eva Hajicova, Jarmila Panevova, Petr Sgall,Ondrej Bojar, Silvie Cinkova, Eva Fucıkova, MarieMikulova, Petr Pajas, Jan Popelka, Jirı Semecky,Jana Sindlerova, Jan Stepanek, Josef Toman, ZdenkaUresova, and Zdenek Zabokrtsky. 2012. Announc-ing Prague Czech-English dependency treebank 2.0.In Proceedings of the Eighth International Con-ference on Language Resources and Evaluation(LREC’12), pages 3153–3160, Istanbul, Turkey. Eu-ropean Language Resources Association (ELRA).

Han He and Jinho D. Choi. 2019. Establishing strongbaselines for the new decade: Sequence tagging,syntactic and semantic parsing with bert. ArXiv,abs/1908.04943.

Daniel Hershcovich, Omri Abend, and Ari Rappoport.2017. A transition-based directed acyclic graphparser for UCCA. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1127–1138, Vancouver, Canada. Association for Computa-tional Linguistics.

Daniel Hershcovich and Ofir Arviv. 2019. TUPA atMRP 2019: A multi-task baseline system. In Pro-ceedings of the Shared Task on Cross-FrameworkMeaning Representation Parsing at the 2019 Confer-ence on Natural Language Learning, pages 28–39,Hong Kong. Association for Computational Linguis-tics.

Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. Published as a

conference paper at the 3rd International Conferencefor Learning Representations, San Diego, 2015.

Sandra Kubler, Ryan T. McDonald, and Joakim Nivre.2009. Dependency Parsing. Synthesis Lectureson Human Language Technologies. Morgan & Clay-pool Publishers.

Shuhei Kurita and Anders Søgaard. 2019. Multi-tasksemantic dependency parsing with policy gradientfor learning easy-first strategies. In Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics, pages 2420–2430, Florence,Italy. Association for Computational Linguistics.

Sunny Lai, Chun Hei Lo, Kwong Sak Leung, and YeeLeung. 2019. CUHK at MRP 2019: Transition-based parser with cross-framework variable-arity re-solve action. In Proceedings of the Shared Task onCross-Framework Meaning Representation Parsingat the 2019 Conference on Natural Language Learn-ing, pages 104–113, Hong Kong. Association forComputational Linguistics.

Wang Ling, Chris Dyer, Alan W. Black, and IsabelTrancoso. 2015. Two/too simple adaptations ofWord2Vec for syntax problems. In Proceedings ofthe 2015 Conference of the North American Chap-ter of the Association for Computational Linguis-tics, pages 1299–1304, Denver, Colorado. Associa-tion for Computational Linguistics.

Linlin Liu, Xiang Lin, Shafiq Joty, Simeng Han, and Li-dong Bing. 2019. Hierarchical pointer net parsing.In Proceedings of the 2019 Conference on Empiri-cal Methods in Natural Language Processing, HongKong, China.

Xuezhe Ma and Eduard Hovy. 2016. End-to-end se-quence labeling via bi-directional lstm-cnns-crf. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 1064–1074. Association forComputational Linguistics.

Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng,Graham Neubig, and Eduard H. Hovy. 2018. Stack-pointer networks for dependency parsing. In Pro-ceedings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics, Melbourne, Aus-tralia, July 15-20, 2018, Volume 1: Long Papers,pages 1403–1414.

Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: The Penn Treebank. Computa-tional Linguistics, 19:313–330.

Andre Martins, Miguel Almeida, and Noah A. Smith.2013. Turning on the turbo: Fast third-order non-projective turbo parsers. In Proceedings of the 51stAnnual Meeting of the Association for Computa-tional Linguistics (Volume 2: Short Papers), pages617–622, Sofia, Bulgaria. Association for Computa-tional Linguistics.

https://doi.org/https://doi.org/10.1016/j.artint.2019.07.006

https://doi.org/https://doi.org/10.1016/j.artint.2019.07.006

https://doi.org/10.18653/v1/N19-1076

https://doi.org/10.18653/v1/N19-1076

https://books.google.es/books?id=LrbPAQAACAAJ

https://books.google.es/books?id=LrbPAQAACAAJ



http://www.lrec-conf.org/proceedings/lrec2012/pdf/510_Paper.pdf

http://www.lrec-conf.org/proceedings/lrec2012/pdf/510_Paper.pdf

https://doi.org/10.18653/v1/P17-1104

https://doi.org/10.18653/v1/P17-1104

https://doi.org/10.18653/v1/K19-2002

https://doi.org/10.18653/v1/K19-2002

http://arxiv.org/abs/1412.6980

http://arxiv.org/abs/1412.6980

https://doi.org/10.2200/S00169ED1V01Y200901HLT002

https://doi.org/10.18653/v1/P19-1232

https://doi.org/10.18653/v1/P19-1232

https://doi.org/10.18653/v1/P19-1232

https://doi.org/10.18653/v1/K19-2010

https://doi.org/10.18653/v1/K19-2010

https://doi.org/10.18653/v1/K19-2010

https://doi.org/10.3115/v1/N15-1142

https://doi.org/10.3115/v1/N15-1142

https://arxiv.org/abs/1908.11571

https://doi.org/10.18653/v1/P16-1101

https://doi.org/10.18653/v1/P16-1101

https://aclanthology.info/papers/P18-1130/p18-1130

https://aclanthology.info/papers/P18-1130/p18-1130



7046

Andre Martins, Noah Smith, Mario Figueiredo, and Pe-dro Aguiar. 2011. Dual decomposition with manyoverlapping components. In Proceedings of the2011 Conference on Empirical Methods in NaturalLanguage Processing, pages 238–249, Edinburgh,Scotland, UK. Association for Computational Lin-guistics.

Ryan McDonald and Fernando Pereira. 2006. Onlinelearning of approximate dependency parsing algo-rithms. In 11th Conference of the European Chap-ter of the Association for Computational Linguis-tics, Trento, Italy. Association for ComputationalLinguistics.

Yusuke Miyao and Jun’ichi Tsujii. 2004. Deep lin-guistic analysis for the accurate identification ofpredicate-argument relations. In COLING 2004:Proceedings of the 20th International Conferenceon Computational Linguistics, pages 1392–1398,Geneva, Switzerland. COLING.

Joakim Nivre. 2003. An efficient algorithm for pro-jective dependency parsing. In Proceedings of theEighth International Conference on Parsing Tech-nologies, pages 149–160, Nancy, France.

Joakim Nivre. 2008. Algorithms for deterministic in-cremental dependency parsing. Computational Lin-guistics, 34:513–553.

Joakim Nivre and Jens Nilsson. 2005. Pseudo-projective dependency parsing. In Proceedings ofthe 43rd Annual Meeting of the Association for Com-putational Linguistics (ACL’05), pages 99–106, AnnArbor, Michigan. Association for ComputationalLinguistics.

Stephan Oepen, Omri Abend, Jan Hajic, Daniel Her-shcovich, Marco Kuhlmann, Tim O’Gorman, Nian-wen Xue, Jayeol Chun, Milan Straka, and ZdenkaUresova. 2019. MRP 2019: Cross-framework mean-ing representation parsing. In Proceedings of theShared Task on Cross-Framework Meaning Repre-sentation Parsing at the 2019 Conference on NaturalLanguage Learning, pages 1–27, Hong Kong. Asso-ciation for Computational Linguistics.

Stephan Oepen, Marco Kuhlmann, Yusuke Miyao,Daniel Zeman, Silvie Cinkova, Dan Flickinger, JanHajic, and Zdenka Uresova. 2015. SemEval 2015task 18: Broad-coverage semantic dependency pars-ing. In Proceedings of the 9th International Work-shop on Semantic Evaluation (SemEval 2015), pages915–926, Denver, Colorado. Association for Compu-tational Linguistics.

Stephan Oepen, Marco Kuhlmann, Yusuke Miyao,Daniel Zeman, Dan Flickinger, Jan Hajic, AngelinaIvanova, and Yi Zhang. 2014. SemEval 2014 task8: Broad-coverage semantic dependency parsing. InProceedings of the 8th International Workshop onSemantic Evaluation (SemEval 2014), pages 63–72,Dublin, Ireland. Association for Computational Lin-guistics.

Hao Peng, Sam Thomson, and Noah A. Smith. 2017.Deep multitask learning for semantic dependencyparsing. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 2037–2048, Van-couver, Canada. Association for Computational Lin-guistics.

Kenji Sagae and Jun’ichi Tsujii. 2008. Shift-reduce de-pendency DAG parsing. In Proceedings of the 22ndInternational Conference on Computational Linguis-tics (Coling 2008), pages 753–760, Manchester, UK.Coling 2008 Organizing Committee.

Ivan Titov, James Henderson, Paola Merlo, andGabriele Musillo. 2009. Online graph planarizationfor synchronous parsing of semantic and syntacticdependencies. In International Joint Conferenceson Artificial Intelligence (IJCAI-09).

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Gar-nett, editors, Advances in Neural Information Pro-cessing Systems 28, pages 2692–2700. Curran Asso-ciates, Inc.

Xinyu Wang, Jingxian Huang, and Kewei Tu. 2019.Second-order semantic dependency parsing withend-to-end neural networks. In Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics, pages 4609–4618, Florence,Italy. Association for Computational Linguistics.

Yuxuan Wang, Wanxiang Che, Jiang Guo, and Ting Liu.2018. A neural transition-based approach for seman-tic dependency graph parsing. In Proc. AAAI.

Sheng Zhang, Xutai Ma, Kevin Duh, and BenjaminVan Durme. 2019. Broad-coverage semantic pars-ing as transduction. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 3786–3798, Hong Kong, China. As-sociation for Computational Linguistics.

Xingxing Zhang, Jianpeng Cheng, and Mirella Lap-ata. 2017. Dependency parsing as head selection.In Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics, EACL 2017, Valencia, Spain, April 3-7,2017, Volume 1: Long Papers, pages 665–676.

https://www.aclweb.org/anthology/D11-1022

https://www.aclweb.org/anthology/D11-1022

https://www.aclweb.org/anthology/E06-1011








https://doi.org/10.3115/1219840.1219853

https://doi.org/10.3115/1219840.1219853

https://doi.org/10.18653/v1/K19-2001

https://doi.org/10.18653/v1/K19-2001

https://doi.org/10.18653/v1/S15-2153

https://doi.org/10.18653/v1/S15-2153

https://doi.org/10.18653/v1/S15-2153

https://doi.org/10.3115/v1/S14-2008

https://doi.org/10.3115/v1/S14-2008

https://doi.org/10.18653/v1/P17-1186

https://doi.org/10.18653/v1/P17-1186



http://papers.nips.cc/paper/5866-pointer-networks.pdf

https://doi.org/10.18653/v1/P19-1454

https://doi.org/10.18653/v1/P19-1454

https://doi.org/10.18653/v1/D19-1392

https://doi.org/10.18653/v1/D19-1392

https://aclanthology.info/papers/E17-1063/e17-1063

Documents

Transition-based Semantic Dependency Parsing with Pointer ... · 7037 level language modeling. Both extensions, (Wang et al.,2019) and (He and Choi,2019), are currently the state