Adversarial Connective-exploiting Networks for Implicit ... · Arg1 and Arg2 imply relation Cause (i.e., Arg2 is the cause of Arg1). [Arg1]: Never mind. [Arg2]: You already know the

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1006–1017Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1093

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1006–1017Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1093

Adversarial Connective-exploiting Networks forImplicit Discourse Relation Classification

Lianhui Qin1,2, Zhisong Zhang1,2, Hai Zhao1,2,∗, Zhiting Hu3, Eric P. Xing3

1Department of Computer Science and Engineering, Shanghai Jiao Tong University2Key Laboratory of Shanghai Education Commission for Intelligent Interaction

and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China3Carnegie Mellon University

{qinlianhui, zzs2011}@sjtu.edu.cn, [email protected],{zhitingh, epxing}@cs.cmu.edu

Abstract

Implicit discourse relation classification isof great challenge due to the lack of con-nectives as strong linguistic cues, whichmotivates the use of annotated implicitconnectives to improve the recognition.We propose a feature imitation frame-work in which an implicit relation net-work is driven to learn from another neu-ral network with access to connectives,and thus encouraged to extract similarlysalient features for accurate classification.We develop an adversarial model to en-able an adaptive imitation scheme throughcompetition between the implicit networkand a rival feature discriminator. Ourmethod effectively transfers discriminabil-ity of connectives to the implicit features,and achieves state-of-the-art performanceon the PDTB benchmark.

1 Introduction

Discourse relations connect linguistic units suchas clauses and sentences to form coherent seman-tics. Identification of discourse relations can ben-efit a variety of downstream applications includ-ing question answering (Liakata et al., 2013), ma-chine translation (Li et al., 2014), text summariza-tion (Gerani et al., 2014), opinion spam detection(Chen and Zhao, 2015), and so forth.

∗Corresponding authors. This paper was partially sup-ported by Cai Yuanpei Program (CSC No. 201304490199and No. 201304490171), National Natural Science Foun-dation of China (No. 61170114, No. 61672343 and No.61272248), National Basic Research Program of China (No.2013CB329401), Major Basic Research Program of Shang-hai Science and Technology Committee (No. 15JC1400103),Art and Science Interdisciplinary Funds of Shanghai JiaoTong University (No. 14JCRZ04), and Key Project ofNational Society Science Foundation of China (No. 15-ZDA041).

Connectives (e.g., but, so, etc) are one of themost critical linguistic cues for identifying dis-course relations. When explicit connectives arepresent in the text, a simple frequency-based map-ping is sufficient to achieve over 85% classifica-tion accuracy (Xue et al., 2016; Li et al., 2016).In contrast, implicit discourse relation recognitionhas long been seen as a challenging problem, withthe best accuracy so far still lower than 50% (Chenet al., 2015). In the implicit case, discourse rela-tions are not lexicalized by connectives, but to beinferred from relevant sentences (i.e., arguments).For example, the following two adjacent sentencesArg1 and Arg2 imply relation Cause (i.e., Arg2 isthe cause of Arg1).

[Arg1]: Never mind.

[Arg2]: You already know the answer.

[Implicit connective]: Because

[Discourse relation]: Cause

Various attempts have been made to directly in-fer underlying relations by modeling the seman-tics of the arguments, ranging from feature-basedmethods (Lin et al., 2009; Pitler et al., 2009) to thevery recent end-to-end neural models (Chen et al.,2016a; Qin et al., 2016c). Despite impressive per-formance, the absence of strong explicit connec-tive cues has made the inference extremely hardand hindered further improvement. In fact, eventhe human annotators would make use of connec-tives to aid relation annotation. For instance, thepopular Penn Discourse Treebank (PDTB) bench-mark data (Prasad et al., 2008) was annotated byfirst inserting a connective expression (i.e., im-plicit connective, as shown in the above example)manually, and determining the abstract relation bycombining both the implicit connective and con-textual semantics.

1006

https://doi.org/10.18653/v1/P17-1093

https://doi.org/10.18653/v1/P17-1093

Therefore, the huge performance gap betweenexplicit and implicit parsing (namely, 85% vs50%), as well as the human annotation practice,strongly motivates to incorporate connective infor-mation to guide the reasoning process. This paperaims to advance implicit parsing by making use ofannotated implicit connectives available in train-ing data. Few recent work has explored such com-bination. Zhou et al. (2010) developed a two-stepapproach by first predicting implicit connectiveswhose sense is then disambiguated to obtain therelation. However, the pipeline approach usuallysuffers from error propagation, and the method it-self has relied on hand-crafted features which donot necessarily generalize well. Other researchleveraged explicit connective examples for dataaugmentation (Rutherford and Xue, 2015; Braudand Denis, 2015; Ji et al., 2015; Braud and Denis,2016). Our work is orthogonal and complemen-tary to this line.

In this paper, we propose a novel neural methodthat incorporates implicit connectives in a princi-pled adversarial framework. We use deep neu-ral models for relation classification, and take theintuition that, sentence arguments integrated withconnectives would enable highly discriminativeneural features for accurate relation inference, andan ideal implicit relation classifier, even thoughwithout access to connectives, should mimic theconnective-augmented reasoning behavior by ex-tracting similarly salient features. We thereforesetup a secondary network in addition to the im-plicit relation classifier, building upon connective-augmented inputs and serving as a feature learningmodel for the implicit classifier to emulate.

Methodologically, however, feature imitation inour problem is challenging due to the semantic gapinduced by adding the connective cues. It is nec-essary to develop an adaptive scheme to flexiblydrive learning and transfer discriminability. Wedevise a novel adversarial approach which enablesa self-calibrated imitation mechanism. Specifi-cally, we build a discriminator which distinguishesbetween the features by the two counterpart net-works. The implicit relation network is thentrained to correctly classify relations and simulta-neously to fool the discriminator, resulting in anadversarial framework. The adversarial mecha-nism has been an emerging method in differentcontext, especially for image generation (Good-fellow et al., 2014) and domain adaptation (Ganin

et al., 2016; Chen et al., 2016c). Our adversar-ial framework is unique to address neural fea-ture emulation between two models. Besides, tothe best of our knowledge, this is the first adver-sarial approach in the context of discourse pars-ing. Compared to previous connective exploit-ing work (Zhou et al., 2010; Xu et al., 2012),our method provides a new integration paradigmand an end-to-end procedure that avoids inefficientfeature engineering and error propagation.

Our method is evaluated on the PDTB 2.0benchmark in a variety of experimental settings.The proposed adversarial model greatly improvesover standalone neural models and previous best-performing approaches. We also demonstrate thatour implicit recognition network successfully imi-tates and extracts crucial hidden representations.

We begin by briefly reviewing related work insection 2. Section 3 presents the proposed adver-sarial model. Section 4 shows substantially im-proved experimental results over previous meth-ods. Section 5 discusses extensions and futurework.

2 Related Work

2.1 Implicit Discourse Relation Recognition

There has been a surge of interest in implicit dis-course parsing since the release of PDTB (Prasadet al., 2008), the first large discourse corpus distin-guishing implicit examples from explicit ones. Alarge set of work has focused on direct classifica-tion based on observed sentences, including struc-tured methods with linguistically-informed fea-tures (Lin et al., 2009; Pitler et al., 2009; Zhouet al., 2010), end-to-end neural models (Qin et al.,2016b,c; Chen et al., 2016a; Liu and Li, 2016), andcombined approaches (Ji and Eisenstein, 2015; Jiet al., 2016). However, the lacking of connec-tive cues makes learning purely from contextualsemantics full of challenges.

Prior work has attempted to leverage connec-tive information. Zhou et al. (2010) also incorpo-rate implicit connectives, but in a pipeline man-ner by first predicting the implicit connective witha language model and determining discourse rela-tion accordingly. Instead of treating implicit con-nectives as intermediate prediction targets whichcan suffer from error propagation, we use the con-nectives to induce highly discriminative features toguide the learning of an implicit network, servingas an adaptive regularization mechanism for en-

1007

hanced robustness and generalization. Our frame-work is also end-to-end, avoiding costly featureengineering. Another notable line aims at adapt-ing explicit examples for data synthesis (Biranand McKeown, 2013; Rutherford and Xue, 2015;Braud and Denis, 2015; Ji et al., 2015), multi-tasklearning (Lan et al., 2013; Liu et al., 2016), andword representation (Braud and Denis, 2016). Ourwork is orthogonal and complementary to thesemethods, as we use implicit connectives whichhave been annotated for implicit examples.

2.2 Adversarial Networks

Deep neural networks have gained impressive suc-cess in various natural language processing tasks(Wang et al., 2016; Zhang et al., 2016b; Caiet al., 2017), in which adversarial networks havebeen shown especially effective in deep genera-tive modeling (Goodfellow et al., 2014) and do-main adaptation (Ganin et al., 2016). Generativeadversarial nets (Goodfellow et al., 2014) learnto produce realistic samples through competitionbetween a generator and a real/fake discrimina-tor. Professor forcing (Lamb et al., 2016) ap-plies a similar idea to improve long-term gener-ation of a recurrent neural language model. Otherapproaches (Chen et al., 2016b; Hu et al., 2017;Liang et al., 2017) extend the framework for con-trollable image/text generation. Li et al. (2015);Salimans et al. (2016) propose feature matchingwhich trains generators to match the statistics ofreal/fake examples. Their features are extractedby the discriminator rather than the classifier net-works as in our case. Our work differs from theabove since we consider the context of discrimi-native modeling. Adversarial domain adaptationforces a neural network to learn domain-invariantfeatures using a classifier that distinguishes thedomain of the network’s input data based on thehidden feature. Our adversarial framework is dis-tinct in that besides the implicit relation networkwe construct a second neural network serving as ateacher model for feature emulation.

To the best of our knowledge, this is the firstto employ the idea of adversarial learning in thecontext of discourse parsing. We propose a novelconnective exploiting scheme based on feature im-itation, and to this end derive a new adversar-ial framework, achieving substantial performancegain over existing methods. The proposed ap-proach is generally applicable to other tasks for

utilizing any indicative side information. We givemore discussions in section 5.

3 Adversarial Method

Discourse connectives are key indicators for dis-course relation. In the annotation procedure ofthe PDTB implicit relation benchmark, annotatorsinserted implicit connective expressions betweenadjacent sentences to lexicalize abstract relationsand help with final decisions. Our model aims atmaking full use of the provided implicit connec-tives at training time to regulate learning of im-plicit relation recognizer, encouraging extractionof highly discriminative semantics from raw argu-ments, and improving generalization at test time.Our method provides a novel adversarial frame-work that leverages connective information in aflexible adaptive manner, and is efficiently trainedend-to-end through standard back-propagation.

The basic idea of the proposed approach is sim-ple. We want our implicit relation recognizer,which predicts the underlying relation of sen-tence arguments without discourse connective, tohave prediction behaviors close to a connective-augmented relation recognizer which is providedwith a discourse connective in addition to the ar-guments. The connective-augmented recognizer isin analogy to an annotator with the help of connec-tives as in the human annotation process, and theimplicit recognizer would be improved by learn-ing from such an “informed” annotator. Specif-ically, we want the latent features extracted bythe two models to match as closely as possible,which explicitly transfers the discriminability ofthe connective-augmented representations to im-plicit ones.

To this end, instead of manually selecting acloseness metric, we take advantage of the ad-versarial framework by constructing a two-playerzero-sum game between the implicit recognizerand a rival discriminator. The discriminator at-tempts to distinguish between the features ex-tracted by the two relation models, while the im-plicit relation model is trained to maximize the ac-curacy on implicit data, and at the same time toconfuse the discriminator.

In the next we first present the overall architec-ture of the proposed approach (section 3.1), thendevelop the training procedure (section 3.2). Thecomponents are realized as deep (convolutional)neural networks, with detailed modeling choices

1008

x1: Never mind. x2: You Know the answer. i-CNN

a-CNN

+implicit connective c: Because Discriminator D Classifier C

x1: Never mind. x2: Because You Know the answer.

HI

HA

Figure 1: Architecture of the proposed method. The framework contains three main components: 1)an implicit relation network i-CNN over raw sentence arguments, 2) a connective-augmented relationnetwork a-CNN whose inputs are augmented with implicit connectives, and 3) a discriminator distin-guishing between the features by the two networks. The features are fed to the final classifier for relationclassification. The discriminator and i-CNN form an adversarial pair for feature imitation. At test time,the implicit network i-CNN with the classifier is used for prediction.

discussed in section 3.3.

3.1 Model ArchitectureLet (x, y) be a pair of input and output of implicitrelation classification, where x = (x1,x2) is apair of sentence arguments, and y is the underlyingdiscourse relation. Each training example also in-cludes an annotated implicit connective c that bestexpresses the relation. Figure 1 shows the archi-tecture of our framework.

The neural model for implicit relation clas-sification (i-CNN in the figure) extracts latentrepresentation from the arguments, denoted asHI(x1,x2), and feeds the feature into a classifierC for final prediction C(HI(x1,x2)). For easeof notation, we will also use HI(x) to denote thelatent feature on data x.

The second relation network (a-CNN) takesas inputs the sentence arguments along with animplicit connective, to induce the connective-augmented representation HA(x1,x2, c), and ob-tains relation prediction C(HA(x1,x2, c)). Notethat the same final classifierC is used for both net-works, so that the feature representations by thetwo networks are ensured to be within the samesemantic space, enabling feature emulation as pre-sented shortly.

We further pair the implicit network with a ri-val discriminator D to form our adversarial game.The discriminator is to differentiate between thereasoning behaviors of the implicit network i-CNNand the augmented network a-CNN. Specifically,D is a binary classifier that takes as inputs a la-

tent feature H derived from either i-CNN or a-CNN given appropriate data (where implicit con-nectives is either missing or present, respectively).The output D(H) estimates the probability thatH comes from the connective-augmented a-CNNrather than i-CNN.

3.2 Training ProcedureThe system is trained through an alternating op-timization procedure that updates the componentsin an interleaved manner. In this section, we firstpresent the training objective for each component,and then give the overall training algorithm.

Let θD denote the parameters of the discrimina-tor. The training objective ofD is straightforward,i.e., to maximize the probability of correctly dis-tinguishing the input features:

maxθDLD = E(x,c,y)∼data

[logD(HA(x, c);θD)+

log(1−D(HI(x);θD))],

(1)

where E(x,c,y)∼data[·] denotes the expectation interms of the data distribution.

We denote the parameters of the implicit net-work i-CNN and the classifier C as θI and θC ,respectively. The model is then trained to (a)correctly classify relations in training data and(b) produce salient features close to connective-augmented ones. The first objective can be ful-filled by minimizing the usual cross-entropy loss:

LI,C(θI ,θC) = E(x,y)∼data

[J(C(HI(x;θI);θC), y

)],

(2)

1009

Algorithm 1 Adversarial Model for Implicit Recognition

Input: Training data {(x, c, y)n}Parameters: λ1, λ2 – balancing parameters

1: Initialize {θI ,θC} and {θA} by minimizingEq.(2) and Eq.(4), respectively

2: repeat3: Train the discriminator through Eq.(1)4: Train the relation models through Eq.(5)5: until convergence

Output: Adversarially enhanced implicit relationnetwork i-CNN with classifier C for prediction

where J(p, y) = −∑k I(y = k) log pk is the

cross-entropy loss between predictive distributionp and ground-truth label y. We achieve objective(b) by minimizing the discriminator’s chance ofcorrectly telling apart the features:

LI(θI) = Ex∼data

[log(1−D(HI(x;θI))

)]. (3)

The parameters of the augmented network a-CNN, denoted as θA, can be learned by simply fit-ting to the data, i.e., minimizing the cross-entropyloss as follows:

LA(θA) = E(x,c,y)∼data

[J(C(HA(x, c;θA)), y

)]. (4)

As mentioned above, here we use the same classi-fierC as for the implicit network, forcing a unifiedfeature space of both networks. We combine theabove objectives Eqs.(2)-(4) of the relation classi-fiers and minimize the joint loss:

minθI ,θA,θC

LI,A,C = LI,C(θI ,θC) + λ1LI(θI) + λ2LA(θA),

(5)

where λ1 and λ2 are two balancing parameters cal-ibrating the weights of the classification losses andthe feature-regulating loss. In practice, we pre-train the implicit and augmented networks inde-pendently by minimizing Eq.(2) and Eq.(4), re-spectively. In the adversarial training process,we found setting λ2 = 0 gives stable conver-gence. That is, the connective-augmented featuresare fixed after the pre-training stage.

Algorithm 1 summarizes the training procedure,where we interleave the optimization of Eq.(1) andEq.(5) at each iteration. More practical details areprovided in section 4. We instantiate all modulesas neural networks (section 3.3) which are differ-entiable, and perform the optimization efficientlythrough standard stochastic gradient descent andback-propagation.

Concat

Max-pooling

Convolution

Embedding

Arg1: Never mind. Arg2: You know the answer.

H(x)

Classi�cation: Cause Discrimination

Figure 2: Neural structure of i-CNN. Two setsof convolutional filters are shown, with the cor-responding features in red and blue, respectively.The weights of the filters on two input argumentsare tied.

Through Eq.(1) and Eq.(3), the discriminatorand the implicit relation network follow a min-imax competition, which drives both to improveuntil the implicit feature representations are closeto the connective-augmented latent representa-tions, encouraging the implicit network to ex-tract highly discriminative features from raw sen-tence arguments for relation classification. Alter-natively, we can see Eq.(3) as an adaptive regu-larization on the implicit model, which, comparedto pre-fixed regularizors such as `2-regularization,provides a more flexible, self-calibrated mecha-nism to improve generalization ability.

0 1

Input

Gate

Output /

HI HA/

Gate

Figure 3: Neural structure of the discriminator D.

1010

3.3 Component Structures

We have presented our adversarial framework forimplicit relation classification. We now discuss themodel realization of each component. All com-ponents of the framework are parameterized withneural networks. Distinct roles of the modules inthe framework lead to different modeling choices.

Relation Classification Networks Figure 2 il-lustrates the structure of the implicit relation net-work i-CNN. We use a convolutional network asit is a common architectural choice for discourseparsing. The network takes as inputs the wordvectors of the tokens in each sentence argument,and maps each argument to intermediate featuresthrough a shared convolutional layer. The result-ing representations are then concatenated and fedinto a max pooling layer to select most salient fea-tures as the final representation. The final classi-fier C is a simple fully-connected layer followedby a softmax classifier.

The connective-augmented network a-CNN hasa similar structure as i-CNN, wherein implicit con-nective is appended to the second sentence as in-put. The key difference from i-CNN is that here weadopt average k-max pooling, which takes the av-erage of the top-k maximum values in each pool-ing window. The reason is to prevent the net-work from solely selecting the connective inducedfeatures (which are typically the most salient fea-tures) which would be the case when using maxpooling, but instead force it to also attend to con-textual features derived from the arguments. Thisfacilitates more homogeneous output features ofthe two networks, and thus facilitates feature imi-tation. In all the experiments we fixed k = 2.

Discriminator The discriminator is a binaryclassifier to identify the correct source of an in-put feature vector. To make it a strong rival to thefeature imitating network (i-CNN), we model thediscriminator as a multi-layer perceptron (MLP)enhanced with gated mechanism for efficient in-formation flow (Srivastava et al., 2015; Qin et al.,2016c), as shown in Figure 3.

4 Experiments

We demonstrate the effectiveness of our approachboth quantitatively and qualitatively with exten-sive experiments. We evaluate prediction perfor-mance on the PDTB benchmark in different set-tings. Our method substantially improves over a

diverse set of previous models, especially in thepractical multi-class classification task. We per-form in-depth analysis of the model behaviors, andshow our adversarial framework successfully en-ables the implicit relation model to imitate andlearn discriminative features.

4.1 Experiment Setup

We use PDTB 2.01, one of the largest manuallyannotated discourse relation corpus. The datasetcontains 16,224 implicit relation instances in total,with three levels of senses: Level-1 Class, Level-2Type, and Level-3 Subtypes. The 1st level con-sists of four major relation Classes: COMPARI-SON, CONTINGENCY, EXPANSION and TEMPO-RAL. The 2nd level contains 16 Types.

To make extensive comparison with prior workof implicit discourse relation classification, weevaluate on two popular experimental settings: 1)multi-class classification for 2nd-level types (Linet al., 2009; Ji and Eisenstein, 2015), and 2) one-versus-others binary classifications for 1st-levelclasses (Pitler et al., 2009). We describe the de-tailed configurations in the following respectivesections. We will focus our analysis on the multi-class classification setting, which is most realis-tic in practice and serves as a building block fora complete discourse parser such as that for theshared tasks of CoNLL-2015 and 2016 (Xue et al.,2015, 2016).

Model Training Here we provide the detailedarchitecture configurations of each component weused in the experiments.

• Throughout the experiments i-CNN and a-CNN contain 3 sets of convolutional filterswith the filter sizes selected on the dev set.Table 1 lists the filter configurations of theconvolutional layer in i-CNN and a-CNN indifferent tasks. As described in section 3.3,following the convolutional layer is a maxpooling layer in i-CNN, and an average k-max pooling layer with k = 2 in a-CNN.

• The final single-layer classifier C contains512 neurons with tanh activation function.

• The discriminator D consists of 4 fully-connected layers, with 2 gated pathways fromlayer 1 to layer 3 and layer 4 (see Figure 3).

1http://www.seas.upenn.edu/∼pdtb/

1011

Task Filter sizes Filter number

PDTB-Lin 2, 4, 8 3×256

PDTB-Ji 2, 5, 10 3×256

One-vs-all 2, 5, 10 3×1024

Table 1: The convolutional architectures of i-CNNand a-CNN in different tasks (section 4). For ex-ample, in PDTB-Lin, we use 3 sets of filters, eachof which is of size 2, 4, and 8, respectively; andeach set has 256 filters.

The size of each layer is set to 1024 and isfixed in all the experiments.

• We set the dimension of the input word vec-tors to 300 and initialize with pre-trainedword2vec (Mikolov et al., 2013). The max-imum length of sentence argument is set to80. Truncation and zero-padding are appliedwhen necessary.

All experiments were performed on a TITAN-XGPU and 128GB RAM, with neural implementa-tion based on Tensorflow2.

For adversarial model training, it is critical tokeep balance between the progress of the two play-ers. We use a simple strategy which at each itera-tion optimizes the discriminator and the implicitrelation network on a randomly-sampled mini-batch. We found this is enough to stabilize thetraining. The neural parameters are trained usingAdaGrad (Duchi et al., 2011) with an initial learn-ing rate of 0.001. For the balancing parametersin Eq.(5), we set λ1 = 0.1, while λ2 = 0. Thatis, after the initialization stage the weights of theconnective-augmented network a-CNN are fixed.This has been shown capable of giving stable andgood predictive performance for our system.

4.2 Implicit Relation Classification

We will mainly focus on the general multi-classclassification problem in two alternative settingsadopted in prior work, showing the superiority ofour model over previous state of the arts. Weperform in-depth comparison with carefully de-signed baselines, providing empirical insights intothe working mechanism of the proposed frame-work. For broader comparisons we also report theperformance in the one-versus-all setting.

2https://www.tensorflow.org

Model PDTB-Lin PDTB-Ji

1 Word-vector 34.07 36.862 CNN 43.12 44.513 Ensemble 42.17 44.274 Multi-task 43.73 44.755 `2-reg 44.12 45.33

6 Lin et al. (2009) 40.20 -7 Lin et al. (2009) - 40.66

+Brown clusters8 Ji and Eisenstein

(2015)- 44.59

9 Qin et al. (2016a) 43.81 45.04

10 Ours 44.65 46.23

Table 2: Accuracy (%) on the test sets of thePDTB-Lin and PDTB-Ji settings for multi-classclassification. Please see the text for more details.

Multi-class ClassificationsWe first adopt the standard PDTB splitting con-vention following (Lin et al., 2009), denoted asPDTB-Lin, where sections 2-21, 22, and 23 areused as training, dev, and test sets, respectively.The most frequent 11 types of relations are se-lected in the task. During training, instances withmore than one annotated relation types are con-sidered as multiple instances, each of which hasone of the annotations. At test time, a predictionthat matches one of the gold types is considered ascorrect. The test set contains 766 examples. Moredetails are in (Lin et al., 2009). An alternative,slightly different multi-class setting is used in (Jiand Eisenstein, 2015), denoted as PDTB-Ji, wheresections 2-20, 0-1, and 21-22 are used as training,dev, and test sets, respectively. The resulting testset contains 1039 examples. We also evaluate inthis setting for thorough comparisons.

Table 2 shows the classification accuracy inboth of the settings. We see that our model(Row 10) achieves state-of-the-art performance,greatly outperforming previous methods (Rows 6-9) with various modeling paradigms, including thelinguistic feature-based model (Lin et al., 2009),pure neural methods (Qin et al., 2016c), and com-bined approach (Ji and Eisenstein, 2015).

To obtain better insights into the working mech-anism of our method, we further compare witha set of carefully selected baselines as shownin Rows 1-5. 1) “Word-vector” sums over theword vectors for sentence representation, show-ing the base effect of word embeddings. 2)“CNN” is a standalone convolutional net havingthe exact same architecture with our implicit rela-

1012

tion network. Our model trained within the pro-posed framework provides significant improve-ment, showing the benefits of utilizing implicitconnectives at training time. 3) “Ensemble” hasthe same neural architecture with the proposedframework except that the input of a-CNN is notaugmented with implicit connectives. This essen-tially is an ensemble of two implicit recognitionnetworks. We see that the method performs eveninferior to the single CNN model. This furtherconfirms the necessity of exploiting connective in-formation. 4) “Multi-task” is the convolutional netaugmented with an additional task of simultane-ously predicting the implicit connectives based onthe network features. As a straightforward way ofincorporating connectives, we see that the methodslightly improves over the stand-alone CNN, whilefalling behind our approach with a large margin.This indicates that our proposed feature imitationis a more effective scheme for making use of im-plicit connectives. 5) At last, “`2-reg” also imple-ments feature mimicking by imposing an `2 dis-tance penalty between the implicit relation fea-tures and connective-augmented features. We seethat the simple model has obtained improvementover previous best-performing systems in bothsettings, further validating the idea of imitation.However, in contrast to the fixed `2 regularization,our adversarial framework provides an adaptivemechanism, which is more flexible and performsbetter as shown in the table.

Model COMP. CONT. EXP. TEMP.

Pitler et al. (2009) 21.96 47.13 - 16.76Qin et al. (2016c) 41.55 57.32 71.50 35.43Zhang et al. (2016a) 35.88 50.56 71.48 29.54Zhou et al. (2010) 31.79 47.16 70.11 20.30Liu and Li (2016) 36.70 54.48 70.43 38.84Chen et al. (2016a) 40.17 54.76 - 31.32

Ours 40.87 54.56 72.38 36.20

Table 3: Comparisons of F1 scores (%) for binaryclassification.

One-versus-all ClassificationsWe also report the results of four one-versus-allbinary classifications for more comparisons withprior work. We follow the conventional experi-mental setting (Pitler et al., 2009) by selecting sec-tions 2-20, 21-22, and 0-1 as training, dev, and testsets. Table 4 lists the statistics of the data.

Following previous work, Table 3 reports the F1

Relation Train Dev TestComparison 1942/1942 197/986 152/894Contigency 3342/3342 295/888 279/767Expansion 7004/7004 671/512 574/472Temporal 760/760 64/1119 85/961

Table 4: Distributions of positive and negative in-stances from the train/dev/test sets in four binaryrelation classification tasks.

scores. Our method outperforms most of the priorsystems in all the tasks. We achieve state-of-the-art performance in recognition of the Expansionrelation, and obtain comparable scores with thebest-performing methods in each of the other rela-tions, respectively. Notably, our feature imitationscheme greatly improves over (Zhou et al., 2010)which leverages implicit connectives as an inter-mediate prediction task. This provides additionalevidence for the effectiveness of our approach.

4.3 Qualitative Analysis

We now take a closer look into the modeling be-havior of our framework, by investigating the pro-cess of the adversarial game during training, aswell as the feature imitation effects.

Figure 4 demonstrates the training progress ofdifferent components. The a-CNN network keepshigh predictive accuracy as implicit connectivesare given, showing the importance of connectivecues. The rise-and-fall patterns in the accuracyof the discriminator clearly show its competitionwith the implicit relation network i-CNN as train-ing goes. At first few iterations the accuracy of thediscriminator increases quickly to over 0.9, whileat late stage the accuracy drops to around 0.6,showing that the discriminator is getting confusedby i-CNN (an accuracy of 0.5 indicates full con-fusion). The i-CNN network keeps improving interms of implicit relation classification accuracy,as it is gradually fitting to the data and simultane-ously learning increasingly discriminative featuresby mimicking a-CNN. The system exhibits simi-lar learning patterns in the two different settings,showing the stability of the training strategy.

We finally visualize the output feature vec-tors of i-CNN and a-CNN using the t-SNEmethod (Maaten and Hinton, 2008) in Figure 5.Without feature imitation, the extracted featuresby the two networks are clearly separated (Fig-ure 5(a)). In contrast, as shown in Figures 5(b)-(c), the feature vectors are increasingly mixed astraining proceeds. Thus our framework has suc-

1013

0 5 10 15Training epochs

0.4

0.6

0.8A

ccura

cy

a-CNN

i-CNN

Discr

0 5 10 15 20Training epochs

0.4

0.6

0.8

Acc

ura

cy

a-CNN

i-CNN

Discr

Figure 4: (Best viewed in colors.) Test-set performance of three components over training epochs.Relation networks a-CNN and i-CNN are measured with multi-class classification accuracy (with orwithout implicit connectives, respectively), while the discriminator is evaluated with binary classificationaccuracy. Top: the PDTB-Lin setting (Lin et al., 2009), where first 8 epochs are for initialization stage(thus the discriminator is fixed and not shown); Bottom: the PDTB-Ji setting (Ji and Eisenstein, 2015),where first 3 epochs are for initialization.

(a) (b) (c)

Figure 5: (Best viewed in colors.) Visualizations of the extracted hidden features by the implicit relationnetwork i-CNN (blue) and connective-augmented relation network a-CNN (orange), in the multi-classclassification setting (Lin et al., 2009). (a) Two networks are trained without adversary (with sharedclassifier); (b) Two networks are trained within our framework at epoch 10; (c) at epoch 20. The implicitrelation network successfully imitates the connective-augmented features through the adversarial game.Visualization is conducted with the t-SNE algorithm (Maaten and Hinton, 2008).

cessfully driven i-CNN to induce similar represen-tations with a-CNN, even though connectives arenot present.

5 Discussions

We have developed an adversarial neural frame-work that facilitates an implicit relation network toextract highly discriminative features by mimick-ing a connective-augmented network. Our methodachieved state-of-the-art performance for implicitdiscourse relation classification. Besides implicitconnective examples, our model can naturally ex-ploit enormous explicit connective data to furtherimprove discourse parsing.

The proposed adversarial feature imitationscheme is also generally applicable to other con-text to incorporate indicative side information

available at training time for enhanced infer-ence. Our framework shares a similar spirit ofthe iterative knowledge distillation method (Huet al., 2016a,b) which train a “student” network tomimic the classification behavior of a knowledge-informed “teacher” network. Our approach en-courages imitation on the feature level instead ofthe final prediction level. This allows our ap-proach to apply to regression tasks, and more in-terestingly, the context in which the student andteacher networks have different prediction out-puts, e.g., performing different tasks, while trans-ferring knowledge between each other can be ben-eficial. Besides, our adversarial mechanism pro-vides an adaptive metric to measure and drive theimitation procedure.

1014

ReferencesOr Biran and Kathleen McKeown. 2013. Aggregated

word pair features for implicit discourse relation dis-ambiguation. In Proceedings of the 51st AnnualMeeting of the Association for Computational Lin-guistics (ACL, Volume 2: Short Papers). Sofia, Bul-garia, pages 69–73.

Chloe Braud and Pascal Denis. 2015. Comparing wordrepresentations for implicit discourse relation classi-fication. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing(EMNLP). Lisbon, Portugal, pages 2201–2211.

Chloe Braud and Pascal Denis. 2016. Learningconnective-based word representations for implicitdiscourse relation identification. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP). Austin, Texas,pages 203–213.

Deng Cai, Hai Zhao, Zhisong Zhang, Yuan Xin,Yongjian Wu, and Feiyue Huang. 2017. Fast andaccurate neural word segmentation for Chinese. InProceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics (ACL). Van-couver, Canada.

Changge Chen, Peilu Wang, and Hai Zhao. 2015.Shallow discourse parsing using constituent pars-ing tree. In Proceedings of the Nineteenth Confer-ence on Computational Natural Language Learning- Shared Task (CONLL). Beijing, China, pages 37–41.

Changge Chen and Hai Zhao. 2015. Deceptive opinionspam detection using deep level linguistic feature.In The 4th CCF Conference on Natural LanguageProcessing and Chinese Computing (NLPCC 2015),LNCS. Nanchang, China, volume 9362, pages 465–474.

Jifan Chen, Qi Zhang, Pengfei Liu, Xipeng Qiu, andXuanjing Huang. 2016a. Implicit discourse rela-tion detection via a deep architecture with gated rel-evance network. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (ACL Volume 1: Long Papers). Berlin, Ger-many, pages 1726–1735.

Xi Chen, Yan Duan, Rein Houthooft, John Schul-man, Ilya Sutskever, and Pieter Abbeel. 2016b. In-fogan: Interpretable representation learning by in-formation maximizing generative adversarial nets.In Advances in Neural Information Processing Sys-tems. pages 2172–2180.

Xilun Chen, Ben Athiwaratkun, Yu Sun, Kilian Wein-berger, and Claire Cardie. 2016c. Adversarial deepaveraging networks for cross-lingual sentiment clas-sification. arXiv preprint arXiv:1606.01614 .

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. Journal of MachineLearning Research 12(Jul):2121–2159.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,Pascal Germain, Hugo Larochelle, Francois Lavi-olette, Mario Marchand, and Victor Lempitsky.2016. Domain-adversarial training of neural net-works. Journal of Machine Learning Research17(59):1–35.

Shima Gerani, Yashar Mehdad, Giuseppe Carenini,T. Raymond Ng, and Bita Nejat. 2014. Abstractivesummarization of product reviews using discoursestructure. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP). pages 1602–1613.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In Advances in Neural InformationProcessing Systems. pages 2672–2680.

Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, EduardHovy, and Eric P Xing. 2016a. Harnessing deepneural networks with logic rules. In Proceedingsof the 54th Annual Meeting of the Association forComputational Linguistics (ACL). Berlin, Germany,pages 2410–2420.

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Controllabletext generation. arXiv preprint arXiv:1703.00955 .

Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, andEric P Xing. 2016b. Deep neural networks withmassive learned knowledge. In Proceedings of the2016 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP). Austin, USA,pages 1670–1679.

Yangfeng Ji and Jacob Eisenstein. 2015. One vector isnot enough: Entity-augmented distributed semanticsfor discourse relations. Transactions of the Associ-ation for Computational Linguistics (TACL) 3:329–344.

Yangfeng Ji, Gholamreza Haffari, and Jacob Eisen-stein. 2016. A latent variable recurrent neural net-work for discourse-driven language models. InProceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies(NAACL). San Diego, California, pages 332–342.

Yangfeng Ji, Gongbo Zhang, and Jacob Eisenstein.2015. Closing the gap: Domain adaptation fromexplicit to implicit discourse relations. In Proceed-ings of the 2015 Conference on Empirical Methodsin Natural Language Processing (EMNLP). Lisbon,Portugal, pages 2219–2224.

Alex M Lamb, Anirudh Goyal, Ying Zhang, SaizhengZhang, Aaron C Courville, and Yoshua Bengio.2016. Professor forcing: A new algorithm for train-ing recurrent networks. In Advances In Neural In-formation Processing Systems. pages 4601–4609.

1015

Man Lan, Yu Xu, and Zhengyu Niu. 2013. Leverag-ing synthetic discourse data via multi-task learningfor implicit discourse relation recognition. In Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics (ACL, Volume 1:Long Papers). Sofia, Bulgaria, pages 476–485.

Junyi Jessy Li, Marine Carpuat, and Ani Nenkova.2014. Assessing the discourse factors that influencethe quality of machine translation. In Proceedingsof the 52nd Annual Meeting of the Association forComputational Linguistics (ACL, Volume 2: ShortPapers). Baltimore, Maryland, pages 283–288.

Yujia Li, Kevin Swersky, and Richard S Zemel. 2015.Generative moment matching networks. In Pro-ceedings of the 32nd International Conference onMachine Learning (ICML). Lille, France, pages1718–1727.

Zhongyi Li, Hai Zhao, Chenxi Pang, Lili Wang, andHuan Wang. 2016. A constituent syntactic parse treebased discourse parser. In Proceedings of the Twen-tieth Conference on Computational Natural Lan-guage Learning - Shared Task (CONLL). Berlin,Germany, pages 60–64.

Maria Liakata, Simon Dobnik, Shyamasree Saha,Colin Batchelor, and Dietrich Rebholz-Schuhmann.2013. A discourse-driven content model for sum-marising scientific articles evaluated in a complexquestion answering task. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP). Seattle, Washington,USA, pages 747–757.

Xiaodan Liang, Zhiting Hu, Hao Zhang, ChuangGan, and Eric P Xing. 2017. Recurrent topic-transition GAN for visual paragraph generation.arXiv preprint arXiv:1703.07022 .

Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng. 2009.Recognizing implicit discourse relations in the PennDiscourse Treebank. In Proceedings of the 2009Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP). Singapore, pages 343–351.

Yang Liu and Sujian Li. 2016. Recognizing implicitdiscourse relations via repeated reading: Neural net-works with multi-level attention. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP). Austin, Texas,pages 1224–1233.

Yang Liu, Sujian Li, Xiaodong Zhang, and ZhifangSui. 2016. Implicit discourse relation classifica-tion via multi-task neural networks. arXiv preprintarXiv:1603.02776 .

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE. Journal of MachineLearning Research 9:2579–2605.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processing

systems (3). South Lake Tahoe, Nevada, USA, pages3111–3119.

Emily Pitler, Annie Louis, and Ani Nenkova. 2009.Automatic sense prediction for implicit discourse re-lations in text. In Proceedings of the Joint Confer-ence of the 47th Annual Meeting of he Associationfor Computational Linguistics and the 4th Interna-tional Joint Conference on Natural Language Pro-cessing (ACL-IJCNLP). Suntec, Singapore, pages683–691.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt-sakaki, Livio Robaldo, Aravind K Joshi, and Bon-nie L Webber. 2008. The Penn discourse tree-bank 2.0. In The sixth international conference onLanguage Resources and Evaluation (LREC). Mar-rakech, Morocco, pages 2961–2968.

Lianhui Qin, Zhisong Zhang, and Hai Zhao. 2016a.Implicit discourse relation recognition with context-aware character-enhanced embeddings. In Proceed-ings of COLING 2016, the 26th International Con-ference on Computational Linguistics: TechnicalPapers. Osaka, Japan, pages 1914–1924.

Lianhui Qin, Zhisong Zhang, and Hai Zhao. 2016b.Shallow discourse parsing using convolutional neu-ral network. In Proceedings of the CoNLL-16shared task. Berlin, Germany, pages 70–77.

Lianhui Qin, Zhisong Zhang, and Hai Zhao. 2016c. Astacking gated neural architecture for implicit dis-course relation classification. In Proceedings of the2016 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP). Austin, Texas,pages 2263–2270.

Attapol Rutherford and Nianwen Xue. 2015. Im-proving the inference of implicit discourse relationsvia classifying explicit discourse connectives. InProceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies(NAACL: HLT). Denver, Colorado, pages 799–808.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba,Vicki Cheung, Alec Radford, and Xi Chen. 2016.Improved techniques for training gans. In Advancesin Neural Information Processing Systems. pages2226–2234.

Rupesh Kumar Srivastava, Klaus Greff, and JurgenSchmidhuber. 2015. Highway networks. arXivpreprint arXiv:1505.00387 .

Peilu Wang, Yao Qian, Frank K. Soong, Lei He, andHai Zhao. 2016. Learning distributed word repre-sentations for bidirectional LSTM recurrent neuralnetwork. In Proceedings of the 2016 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies (NAACL-HLT). San Diego, California,pages 527–533.

1016

Yu Xu, Man Lan, Yue Lu, Zheng Yu Niu, andChew Lim Tan. 2012. Connective prediction usingmachine learning for implicit discourse relation clas-sification. In The 2012 International Joint Confer-ence on Neural Networks (IJCNN). Brisbane, Aus-tralia, pages 1–8.

Nianwen Xue, Hwee Tou Ng, Sameer Pradhan, RashmiPrasad, Christopher Bryant, and Attapol Ruther-ford. 2015. The CoNLL-2015 shared task on shal-low discourse parsing. In Proceedings of the Nine-teenth Conference on Computational Natural Lan-guage Learning - Shared Task (CoNLL). Beijing,China, pages 1–16.

Nianwen Xue, Hwee Tou Ng, Sameer Pradhan, Bon-nie Webber, Attapol Rutherford, Chuan Wang, andHongmin Wang. 2016. The CoNLL-2016 sharedtask on shallow discourse parsing. In Proceedings ofthe Twentieth Conference on Computational NaturalLanguage Learning - Shared Task (CoNLL). Berlin,Germany, pages 1–19.

Biao Zhang, Deyi Xiong, jinsong su, Qun Liu, Ron-grong Ji, Hong Duan, and Min Zhang. 2016a. Vari-ational neural discourse relation recognizer. In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP).Austin, Texas, pages 382–391.

Zhisong Zhang, Hai Zhao, and Lianhui Qin. 2016b.Probabilistic graph-based dependency parsing withconvolutional neural network. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (ACL). Berlin, Germany, pages1382–1392.

Zhi-Min Zhou, Yu Xu, Zheng-Yu Niu, Man Lan, JianSu, and Chew Lim Tan. 2010. Predicting discourseconnectives for implicit discourse relation recog-nition. In Proceedings of the 23rd InternationalConference on Computational Linguistics (CoLING2010). Beijing, China, pages 1507–1514.

1017

Documents

Adversarial Connective-exploiting Networks for Implicit ... · Arg1 and Arg2 imply relation Cause (i.e., Arg2 is the cause of Arg1). [Arg1]: Never mind. [Arg2]: You already know the