Margaret Mitchell - NAACL HLT 2016m-mitchell.com/NAACL-2016/SRW/N16-2-2016.pdfMinjoon Seo, University of Washington Kairit Sirts, Tallinn University of Technology Huan Sun, University

NAACL HLT 2016

The 2016 Conference of theNorth American Chapter of the

Association for Computational Linguistics:Human Language Technologies

Proceedings of the Student Research Workshop

June 12-17, 2016San Diego, California, USA

c©2016 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-941643-81-5

ii

Introduction

Welcome to the NAACL-HLT 2016 Student Research Workshop.

This year, we have three different kinds of papers: research papers, thesis proposals, and undergraduateresearch papers. Thesis proposals were intended for advanced students who have decided on a thesistopic and wish to get feedback on their proposal and broader ideas for their continuing work, whileresearch papers describe completed work or work in progress with preliminary results. In order toencourage undergraduate research, we offered a special track for research papers where the first authoris an undergraduate student.

We received 11 research papers, 5 thesis proposals, and 8 undergraduate research papers – making thetotal number of submissions 24. We accepted 9 research papers, 2 thesis proposals, and 6 undergraduateresearch papers (17 accepted in total). This translates to an acceptance rate of 81% for research papers,40% for thesis proposals, and 75% for undergraduate research papers (70% overall). This year, allthe SRW papers will be presented at the main conference evening poster session. In addition, eachSRW paper is assigned a dedicated mentor. The mentor is an experienced researcher from academiaor industry who will prepare in-depth comments and questions in advance for the poster session andwill provide feedback to the student author. Thanks to our funding sources, this year’s SRW coversregistration expenses and provides partial travel and/or lodging support to all student first authors ofthe SRW papers. We gratefully acknowledge the support from the NSF and Google. We thank ourdedicated program committee members who gave constructive and detailed feedback for the studentpapers. We also would like to thank the NAACL-HLT 2016 organizers and local arrangement chairs.

iii

Organizers:

Jacob Andreas, University of California BerkeleyEunsol Choi, University of WashingtonAngeliki Lazaridou, University of Trento

Faculty Advisors:

Jacob Eisenstein, Georgia Institute of TechnologyNianwen Xue, Brandeis University

Program Committee:

Amjad Abu-Jbara, MicrosoftBharat Ram Ambati, AppleGabor Angeli, Stanford UniversityYoav Artzi, Cornell UniversityDaniel Bauer, Columbia UniversityYevgeni Berzak, Massachusetts Institute of TechnologyArianna Bisazza, University of AmsterdamShu Cai, University of Southern CaliforniaHiram Calvo, Instituto Politécnico Nacional, MexicoDallas Card, Carneige Mellon UniversityAsli Celikyilmaz, MicrosoftDanqi Chen, Stanford UniversityJesse Dodge, University of WashingtonRaquel Fernandez, University of AmsterdamThomas Francois, UC LouvainLea Frermann, University of EdinburghDaniel Fried, University of California BerkeleyAnnemarie Friedrich, Saarland UniversityYoav Goldberg, Bar Ilan UniversityAmit Goyal , Yahoo! LabsMichael Hahn, University of EdinburghDavid Hall, Semantic MachinesEva Hasler, University of CambridgeLuheng He, University of WashingtonJohn Henderson, MITRE CorporationDerrick Higgins, Civis AnalyticsDirk Hovy, University of Copenhagen.Yuening Hu, Yahoo! LabsPhilipp Koehn, University of EdinburghLingpeng Kong, Carneige Mellon UniversityIoannis Konstas, University of Washington

v

Jonathan K. Kummerfeld, University of BerkeleyKenton Lee, University of WashingtonTal Linzen, Ecole Normale SupérieureFei Liu, University of Central FloridaAdam Lopez, University of EdinburghNitin Madnani, ETSShervin Malmasi , Harvard UniversityDiego Marcheggiani, University of AmsterdamKarthik Narasimhan, Massachusetts Institute of TechnologyArvind Neelakantan, University of Massachusetts AmherstDenis Paperno, University of TrentoAdam Pauls, University of California, BerkeleTed Pedersen, University of Minnesota DuluthBarbara Plank, University of GroningenChristopher Potts, Stanford UniversityDaniel Preoţiuc-Pietro, University of PennsylvaniaPreethi Raghavan, IBM T.J. Watson Research CenterRoi Reichart, TechnionTim Rocktäschel, University College LondonRoy Schwartz, The Hebrew UniversityMinjoon Seo, University of WashingtonKairit Sirts, Tallinn University of TechnologyHuan Sun, University of WashingtonSwabha Swayamdipta, Carneige Mellon UniversityKapil Thadani, Columbia UniversityTravis Wolfe, Johns Hopkins UniversityLuke Zettlemoyer, University of Washington

vi

Table of Contents

An End-to-end Approach to Learning Semantic Frames with Feedforward Neural NetworkYukun Feng, Yipei Xu and Dong Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Analogy-based detection of morphological and semantic relations with word embeddings: what worksand what doesn’t.

Anna Gladkova, Aleksandr Drozd and Satoshi Matsuoka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Argument Identification in Chinese EditorialsMarisa Chow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

Automatic tagging and retrieval of E-Commerce products based on visual featuresVasu Sharma and Harish Karnick. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22

Combining syntactic patterns and Wikipedia’s hierarchy of hyperlinks to extract meronym relationsDebela Tesfaye Gemechu, Michael Zock and Solomon Teferra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Data-driven Paraphrasing and Stylistic HarmonizationGerold Hintz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Detecting "Smart" Spammers on Social Network: A Topic Model ApproachLinqing Liu, Yao Lu, Ye Luo, Renxian Zhang, Laurent Itti and Jianwei Lu . . . . . . . . . . . . . . . . . . 45

Developing language technology tools and resources for a resource-poor language: SindhiRaveesh Motlani . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Effects of Communicative Pressures on Novice L2 Learners’ Use of Optional Formal DevicesYoav Binoun, Francesca Delogu, Clayton Greenberg, Mindaugas Mozuraitis and Matthew Crocker

59

Explicit Argument Identification for Discourse Parsing In Hindi: A Hybrid PipelineRohit Jain and Dipti Sharma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Exploring Fine-Grained Emotion Detection in TweetsJasy Suet Yan Liew and Howard R. Turtle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Extraction of Bilingual Technical Terms for Chinese-Japanese Patent TranslationWei Yang, Jinghui Yan and Yves Lepage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on TwitterZeerak Waseem and Dirk Hovy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Non-decreasing Sub-modular Function for Comprehensible SummarizationLitton J Kurisinkel, Pruthwik Mishra, Vigneshwaran Muralidaran, Vasudeva Varma and Dipti

Misra Sharma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

vii

Phylogenetic simulations over constraint-based grammar formalismsAndrew Lamont and Jonathan Washington . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Question Answering over Knowledge Base using Factual Memory NetworksSarthak Jain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .109

Using Related Languages to Enhance Statistical Language ModelsAnna Currey, Alina Karakanta and Jon Dehdari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116

viii

Conference Program

Monday June 13th

An End-to-end Approach to Learning Semantic Frames with Feedforward NeuralNetworkYukun Feng, Yipei Xu and Dong Yu

Analogy-based detection of morphological and semantic relations with word em-beddings: what works and what doesn’t.Anna Gladkova, Aleksandr Drozd and Satoshi Matsuoka

Argument Identification in Chinese EditorialsMarisa Chow

Automatic tagging and retrieval of E-Commerce products based on visual featuresVasu Sharma and Harish Karnick

Combining syntactic patterns and Wikipedia’s hierarchy of hyperlinks to extractmeronym relationsDebela Tesfaye Gemechu, Michael Zock and Solomon Teferra

Data-driven Paraphrasing and Stylistic HarmonizationGerold Hintz

Detecting "Smart" Spammers on Social Network: A Topic Model ApproachLinqing Liu, Yao Lu, Ye Luo, Renxian Zhang, Laurent Itti and Jianwei Lu

Developing language technology tools and resources for a resource-poor language:SindhiRaveesh Motlani

ix

Tuesday June 14th

Effects of Communicative Pressures on Novice L2 Learners’ Use of Optional FormalDevicesYoav Binoun, Francesca Delogu, Clayton Greenberg, Mindaugas Mozuraitis andMatthew Crocker

Explicit Argument Identification for Discourse Parsing In Hindi: A Hybrid PipelineRohit Jain and Dipti Sharma

Exploring Fine-Grained Emotion Detection in TweetsJasy Suet Yan Liew and Howard R. Turtle

Extraction of Bilingual Technical Terms for Chinese-Japanese Patent TranslationWei Yang, Jinghui Yan and Yves Lepage

Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detectionon TwitterZeerak Waseem and Dirk Hovy

Non-decreasing Sub-modular Function for Comprehensible SummarizationLitton J Kurisinkel, Pruthwik Mishra, Vigneshwaran Muralidaran, Vasudeva Varmaand Dipti Misra Sharma

Phylogenetic simulations over constraint-based grammar formalismsAndrew Lamont and Jonathan Washington

Question Answering over Knowledge Base using Factual Memory NetworksSarthak Jain

Using Related Languages to Enhance Statistical Language ModelsAnna Currey, Alina Karakanta and Jon Dehdari

x

Proceedings of NAACL-HLT 2016, pages 1–7,San Diego, California, June 12-17, 2016. c©2016 Association for Computational Linguistics

An End-to-end Approach to Learning Semantic Frameswith Feedforward Neural Network

Yukun Feng, Yipei Xu and Dong Yu∗College of Information Science

Beijing Language and Culture UniversityNo.15 Xueyuan Rd., Beijing, China, 100083

{fengyukun,xuyipei,yudong}@blcu.edu.cn

Abstract

We present an end-to-end method for learningverb-specific semantic frames with feedfor-ward neural network (FNN). Previous work-s in this area mainly adopt a multi-step pro-cedure including part-of-speech tagging, de-pendency parsing and so on. On the contrary,our method uses a FNN model that maps verb-specific sentences directly to semantic frames.The simple model gets good results on anno-tated data and has a good generalization abil-ity. Finally we get 0.82 F-score on 63 verbsand 0.73 F-score on 407 verbs.

1 Introduction

Lexical items usually have particular requirementsfor their semantic roles. Semantic frames are thestructures of the linked semantic roles near the lex-ical items. A semantic frame specifies its charac-teristic interactions with things necessarily or typ-ically associated with it (Alan, 2001). It is valu-able to build such resources. These resourcescan be effectively used in many natural languageprocessing (NLP) tasks, such as question answer-ing (Narayanan and Harabagiu, 2004) and machinetranslation (Boas, 2002).

Current semantic frame resources, such asFrameNet (Baker et al., 1998), PropBank (Palmer etal., 2005) and VerbNet (Schuler, 2005), have beenmanually created. These resources have promis-ing applications, but they are time-consuming andexpensive. El Maarouf and Baisa (2013) used a

∗The corresponding author.

bootstrapping model to classify the patterns of verb-s from Pattern Dictionary of English1 (PDEV). ElMaarouf et al. (2014) used a Support Vector Ma-chine (SVM) model to classify the patterns of PDE-V . The above supervised approaches are most close-ly related to ours since PDEV is also used in ourexperiment. But the models above are tested onlyon 25 verbs and they are not end-to-end. Popes-cu used Finite State Automata (FSA) to learn thepattern of semantic frames (Popescu, 2013). Butthe generalization ability of this rule-based methodmay be weak. Recently, some unsupervised stud-ies have focused on acquiring semantic frames fromraw corpora (Materna, 2012; Materna, 2013; Kawa-hara et al., 2014b; Kawahara et al., 2014a). Mater-na used LDA-Frame for identifying semantic framesbased on Latent Dirichlet Allocation (LDA) and theDirichlet Process. Kawahara et al. used ChineseRestaurant Process to induce semantic frames from asyntactically annotated corpus. These unsupervisedapproaches have a different goal compared with su-pervised approaches. They aim at identifying thesemantic frames by clustering the parsed sentencesbut they do not learn from semantic frames that havebeen built. These unsupervised approaches are alsounder a pipeline framework and not end-to-end.

One related resource to our work is Corpus Pat-tern Analysis (CPA) frames (Hanks, 2012). CPAproposes a heuristic procedure to obtain semanticframes. Most current supervised and unsupervisedapproaches are under similar pipeline procedure.The procedure can be summarized as follows withan example sentence ”The old music deeply moved

1http://pdev.org.uk/

1

the old man”:

step 1 Identify the arguments near ”moved”, whichcan be expressed as (subject:music, objec-t:man)

step 2 Attach meanings to above arguments, whichcan be expressed as (subject:Entity, objec-t:Human)

step 3 Clustering or classifying the arguments to getsemantic frames.

However, step 1 and 2 are proved to be difficult inSemEval-2015 task 15 2 (Feng et al., 2015; Millsand Levow, 2015).

This paper presents an end-to-end approach by di-rectly learning semantic frames from verb-specificsentences. One key component of our model iswell pre-trained word vectors. These vectors cap-ture fine-grained semantic and syntactic regulari-ties (Mikolov et al., 2013) and make our model havea good generalization ability. Another key compo-nent is FNN model. A supervised signal allows FN-N to learn the semantic frames directly. As a result,this simple model achieves good results. On the in-stances resources of PDEV, we got 0.82 F-score on63 verbs and 0.73 on 407 verbs.

The contributions of this paper are summarized asfollows:

• Semantic frames can be learned with neuralnetwork in an end-to-end map and we also anal-ysed our method in detail.

• We showed the power of pre-trained vectorsand simple neural network for the learning ofsemantic frames. It is helpful in developing amore powerful approach.

• We evaluate the learned semantic frames on an-notated data precisely and got good results withnot much training data.

2 Model Description

2.1 OverviewOur model gets verb-specific semantic frames di-rectly from verb-specific sentences. A running ex-

2http://alt.qcri.org/semeval2015/task15/

●...●●...●

● ●...●

moved the old manThe old music deeply

...

softmax

● ●...●

Entity move

Human

Human move

Vehicle

Figure 1: Model architecture for an example of learning seman-tic frames directly from verb-specific sentence. The sentence is

divided into two windows. ”The old music deeply” is in the left

window and ”the old man” is in the right window. The target

verb ”moved” is not used in the input. The input is connected

to output layer. Each unit of output layer corresponds to one

semantic frame of the target verb.

ample of learning semantic frames is shown in Fig-ure 1. Our FNN model can be regarded as a contin-uous function

c = f(x). (1)

Here x ∈ Rn represents the vector space of thesentence and c represents the index of the semanticframe. Instead of a multi-step , FNN model direct-ly maps the sentence into semantic frame. In thetraining phrase ”The old music deeply moved theold man” is mapped into vector space and ”Entitymove Human” is learned from the vector space. Inthe testing phrase, an example result of FNN mod-el can roughly expressed as ”Entity move Human”= f (”The fast melody moved the beautiful girl”) =which is an end-to-end map.

2.2 Feedforward Neural NetworkDenote Ci:j as the concatenation of word vectors ina sentence. Here i and j are word indexes in the sen-tence. The input layer is divided into two windows(padded with zero vector where necessary), whichare called left window and right window. The inputfor FNN is represented as

x = Cv−lw:v−1 ⊕ Cv+1:v+rw, (2)where v denotes the index of target verb in the sen-tence, ⊕ is the concatenation operator, lw is the

2

length of left window and rw is the length of rightwindow. Both lw and rw are hyperparameters. Thetarget verb can be ignored by the input layer becausethe arguments of it lie on the left and right windows.W , U and V respectively represent the weight ma-trix between input and hidden layer, hidden and out-put layer and input and output layer. d and b respec-tively represent the bias vector on hidden and outputlayer. We use hyperbolic function as our activationfunction in hidden layer. Using matrix-vector nota-tion, the net input of softmax layer of FNN can beexpressed as:

a = λ(U tanh(Wx + d) + b) + (1− λ)V x. (3)

Here λ controls the relative weight of the two itemsin the above formula. FNN will have three layerswhen λ is set to 1 and two layers without bias whenλ is set to 0. Then a softmax function is followed forclassification:

pi =eai∑i e

ai. (4)

Here pi represents the probability of the semanticframe i given x. The cost we minimize during train-ing is the negative log likelihood of the model plusthe L2 regularization term. The cost can be ex-pressed as:

L = −M∑

m=1

logptm + βR(U,W, V ). (5)

Here M is number of training samples and tm isthe index of the correct semantic frame for the m’thsample. R is a weight decay penalty applied to theweights of the model and β is the hyperparametercontrolling the weight of the regularization term inthe cost function.

2.3 Model AnalysisWe extend equation 1 as

c = f(wv−lw, ...wi..., wv+rw). (6)

wi is the i’th word vector in the input vector spaceabove. Note f is a continuous function and similarwords are likely to have similar word vectors.That is to say, if c1 = f(wv−lw, ...wi..., wv+rw)we usually have c1 = f(wv−lw, ...swi..., wv+rw)with wi similar to swi. One obvious example

but roughly expressed is if ”Entity move Human”= f (”The”,”old”,”music”,”the”,”old”,”man”),then it will have ”Entity move Human” =f (”The”,”fast”,”melody”,”the”,”beautiful”,”girl”).Because ”music” and ”melody” can be regarded assimilar words, which is also the case for ”man” and”girl”. Since one of the critical factors for semanticframe is semantic information in specific unit(e.g., subject and object), the pre-trained vectors caneasily capture what this task needs. Thus pre-trainedvectors can have a good generalization ability forsemantic frame learning. In the training phrase,FNN can learn to capture the key words whichhave more impact on the target verb. This will beshown later in the experiment. Because the inputof FNN is a window with fixed length, this wouldcause a limited ability of capturing long-distancekey words. Despite this weakness of this model, itstill got good results.

3 Experiments

3.1 Task and DatasetsSemEval-2015 Task 15 is a CPA (Hanks, 2012) dic-tionary entry building task. The task has three sub-tasks. Two related subtasks are summarized as fol-lows 3:

• CPA parsing. This task requires identifyingsyntactic arguments and their semantic type ofthe target verb. The result of this task followedby our example sentence can be ”The old [sub-ject/Entity music] deeply moved the old [ob-ject/Human man]”. The syntactic argumentsin the example are ”subject” and ”object” re-spectively labelled on the word ”music” and”man”. Their semantic types are ”Entity” and”Human”. Thus a pattern of the target verb”move” can be ”[subject/Entity] move [objec-t/Human]”.

• CPA Clustering. The result of the first task givethe patterns of the sentences. This task aims atclustering the most similar sentences accordingto the found patterns. Two sentences which be-long to the similar pattern are more likely in thesame cluster.

3Subtask 3 is CPA Automatic Lexicography. Since we havenothing to do with this task, we don’t make a introduction.

3

Datasets Statistics B-cubed or micro-average F-score of MethodsVerb number Training data Testing data Semantic frame number FNN SEB DULUTH BOB90

MTDSEM 4 136.5 159 4.86 0.7 0.59 0.52 0.743 1546.33 214.67

PDEV1 407 373.49 158.32 6.53 0.73 0.63 - -PDEV2 63 1421.22 606.60 9.60 0.82 0.64 - -

Table 1: Summary statistics for the datasets (left) and results of our FNN model against other methods (right). On the right side,MTDSEM is evaluated by B-cubed F-score for clustering. On PDEV1 and PDEV2, FNN model is evaluated by micro-average

F-score. SEB is always evaluated by B-cubed F-score as the base score. DULUTH and BOB90 are Participant teams in 2015.

SemEval-2015 Task 15 has two datasets which arecalled Microcheck dataset and Wingspread dataset.The dataset of SemEval-2015 Task 15 was derivedfrom PDEV (Baisa et al., ). That is to say, all thesentences in SemEval-2015 Task 15 are from PDE-V. These datasets have a lot of verbs and have manysentences for each verb. Each sentence of each ver-b corresponds to one index of the semantic frames.Note that the semantic frames are verb-specific andeach verb has a close set of its own semantic frames.Thus in our experiment we build one model for eachverb. Our task is to classify each sentence directlyinto one semantic frame which is different from C-PA clustering, but we will also test our model withclustering metric against other systems. We onlyremove punctuation for these datasets. To test ourmodel we split these datasets into training data andtesting data. Summary statistics of the these dataset-s are in Table 1. In Table 1, Figure 2 and Table3, Verb number is the number of verbs, Training da-ta and Testing data represent the average number ofsentences for each verb and Semantic frame numberis the average number of semantic frames for eachverb. Details of creating the datasets are as follows:

• MTDSEM: Microcheck test dataset ofSemEval-2015 Task 15. For each verb inMTDSEM we select training sentences fromPDEV that doesn’t appear in MTDSEM.

• PDEV1: For each verb, we filter PDEV withthe number of sentences not less than 100 andthe number of semantic frames not less than 2.Then we split the filtered data into training da-ta and testing data, respectively accounted for70% and 30% for each semantic frame of eachverb.

• PDEV2: Same with PDEV1, but with the dif-ference of threshold number of sentences set to

700. PDEV2 ensures that the model has rela-tively enough training data.

• MTTSEM: Microcheck train dataset and testdataset of SemEval-2015 Task 15. We splitMTTSEM as above to get training data andtesting data for each verb. The summary statis-tic of this dataset is separately shown in Table3.

We use the publicly available word2vec vectors thatwere trained through GloVe model (Pennington etal., 2014) on Wikipedia and Gigaword. The vectorshave dimensionality of 300. The word vectors not inpre-trained vectors are set to zero.

3.2 Experimental SetupWe build one model for each verb. Training is doneby stochastic gradient descent with shuffled mini-batches and we keep the word vectors static only up-date other parameters. In our experiments we keepall the same hyperparameters for each verb. we setlearning rate to 0.1, lw and rw to 5, minibatch sizeto 5, L2 regularization parameter β to 0.0001, thenumber of hidden unit to 30 and λ to 0. Becauseof limited training data, we do not use early stop-ping. Training will stop when the zero-one loss iszero over training data for each verb. The officialevaluation method used B-cubed definition of Preci-sion and Recall (Bagga and Baldwin, 1998) for CPAclustering. The final score is the average of B-cubedF-scores over all verbs. Since our task can be re-garded as a supervised classification, we also use themicro-average F-score to evaluate our results.

3.3 Experimental ResultsTable 1 shows the results on MTDSEM with super-vised and unsupervised approaches. SemEval-2015Task 15 baseline (SEB) clusters all sentences togeth-er for each verb. That is to say, SEB assigns the

4

Verb-specific Sentences Verb-specific Semantic FramesMary resisted the temptation to answer her back and after a moment’s silence [[Human 1]] answer ([[Human 2]]) back [[Human 1]]Pamala Klein would seem to have a lot to answer for. [[Human]] have a lot to answer for [NO OBJ]and I will answer for her safety [[Human]] answer [NO OBJ] for [[Eventuality]]he cannot answer for Labour party policies [[Human]] answer [NO OBJ] for [[Eventuality]]it is fiction and can not be made real by acting it out [[Human]] act [[Event or Human Role or Emotion]] outYou should try to build up a network of people you trust [[Human]] build ([[Entity]]) up

Table 2: Example results of our FNN model mapping verb-specific sentences to semantic frames on PDEV.

same cluster to all the sentences and is evaluated byB-cubed F-score for clustering. So its score dependson the distribution of semantic frames. The high-er the score is, the more concentrated the distribu-tion of semantic frames is. SEB to get higher scoreusually indicates other methods are more likely toget high scores, so we use it as a base score. DU-LUTH (Pedersen, 2015) treated this task as an un-supervised word sense discrimination or inductionproblem. The number of semantic frames was pre-dicted on the basis of the best value for the cluster-ing criterion function. BOB90 4 used a supervisedapproach to tackle the clustering problem (Baisa etal., 2015) and get the best score on MTDSEM. Anexample result of FNN model on PDEV is shown inTable 2

4 Discussions

4.1 Large vs. Small Training Data

MTDSEM is divided into two parts to report on theleft part of Table 1. One part has larger training datawhile the other part has little. Our FNN model getsa relatively lower score, mainly because the part oftraining data is too small. FNN got 0.88 B-cubedF-score on the larger training data part and 0.57 onthe other part. In order to show the real power ofour model, PDEV1 and PDEV2 were made whichhave much more training data than MTDSEM andmore verbs to test. It shows a better result on hun-dreds of verbs. We also made Figure 2 to show theperformance of FNN model when the training datasize increases. As a result, our method can performreally well on sufficient training data.

4.2 The Direct Connection

Our FNN model has a direct connection from inputto output layer controlled by λ in the second term

4BOB90 did not submit an article

0.55

0.60

0.65

0.70

0.75

0.80

0.85

1 1 .6 3 2 5 .2 9 5 3 .1 7 1 0 9 .3 7 2 2 2 .8 1 4 4 9 .9 8 904 .86 1 4 2 1 .2 2

F-s

core

Training dataFNN SEB

Figure 2: Results of FNN on PDEV2. The testing data is fixedat 606.60. The training data increases two times at each step.

Y-axis represents B-cubed F-score for SEB and micro-average

F-score for FNN.

of the equation 3. It is designed to speed up theconvergence of training (Bengio et al., 2006), sincethe direct connection allows the model fast learningfrom the input. But In our experiments the numberof epoch before the convergence of training is veryclose between FNN with two layers and FNN withthree layers. On the contrary, we observed that FNNwith two layers where λ is set to zero got a slightlybetter F-score than FNN where λ is set to 0.5 and1. This may suggest FNN with two layers is goodenough on PDEV.

4.3 The Ability of Capturing Key WordsFNN have the ability to capture the key words ofthe target verb. To show this, we test our FN-N model on MTTSEM with different preprocess-ing shown in Table 3. We only remove the punc-tuation of MTTSEM1 which is same as before.MTTSEM2 only contains the gold annotations ofsyntactic arguments provided by CPA parsing. Notethat MTTSEM2 only contains the key words foreach target verb and ignore those unimportant wordsin the sentences. MTTSEM3 is same as MTTSEM2but with the difference of the arguments for each tar-

5

get verb provided by Stanford Parser (De Marneffeet al., 2006). Dependents that have the following re-lations to the target verb are extracted as arguments:

nsubj, xsubj, dobj, iobj, ccomp, xcomp, prep *

As a result, FNN reasonably gets the best scoreon MTTSEM2 and FNN also gets a good score onMTTSEM1 but much lower score on MTTSEM3.This shows that FNN would have the ability to cap-ture the key words of target verb. The result onMTTSEM1 and MTTSEM3 shows that our FNNmodel captures the key words more effectively thanthe parser for this task.

MTTSEM1(verb-specific

sentences)

MTTSEM2(gold

annotations)

MTTSEM3(automatic

annotations)Verb number 28Training data 111.25Testing data 46.39FNN 0.76 0.82 0.67SEB 0.62

Table 3: Results on MTTSEM with different preprocessing.

5 Conclusion

This paper has described an end-to-end approach toobtain verb-specific semantic frames. We evaluatedour method on annotated data. But we do not iden-tify the semantic roles for target verbs and the verb-specific model suffers not enough training data. Apromising work is to merge these semantic framesover multiple verbs which can greatly increase thetraining data size. Also, convolutional layer can beapplied on the input vector to extract features aroundverb and more powerful neural network can be usedto model the verb.

Acknowledgments

We would like to thank the anonymous reviewers fortheir helpful suggestions and comments.

ReferencesKeith Alan. 2001. In Natural Language Semantics, page

251. Blackwell Publishers Ltd, Oxford.Amit Bagga and Breck Baldwin. 1998. Entity-based

cross-document coreferencing using the vector spacemodel. In Proceedings of the 17th international con-ference on Computational linguistics-Volume 1, pages79–85. Association for Computational Linguistics.

Vı́t Baisa, Ismaı̈l El Maarouf, Pavel Rychlỳ, and AdamRambousek. Software and data for corpus patternanalysis.

Vıt Baisa, Jane Bradbury, Silvie Cinková, IsmaılEl Maarouf, Adam Kilgarriff, and Octavian Popescu.2015. Semeval-2015 task 15: A corpus pattern analy-sis dictionary-entry-building task.

Collin F Baker, Charles J Fillmore, and John B Lowe.1998. The berkeley framenet project. In Proceedingsof the 17th international conference on Computation-al linguistics-Volume 1, pages 86–90. Association forComputational Linguistics.

Yoshua Bengio, Holger Schwenk, Jean-SébastienSenécal, Fréderic Morin, and Jean-Luc Gauvain.2006. Neural probabilistic language models. InInnovations in Machine Learning, pages 137–186.Springer.

Hans Christian Boas. 2002. Bilingual framenet dictio-naries for machine translation. In LREC.

Marie-Catherine De Marneffe, Bill MacCartney, Christo-pher D Manning, et al. 2006. Generating typed de-pendency parses from phrase structure parses. In Pro-ceedings of LREC, volume 6, pages 449–454.

Ismaıl El Maarouf and Vıt Baisa. 2013. Automatic clas-sification of patterns from the pattern dictionary of en-glish verbs. In Joint Symposium on Semantic Process-ing., page 95.

Ismail El Maarouf, Jane Bradbury, Vı́t Baisa, and PatrickHanks. 2014. Disambiguating verbs by collocation:Corpus lexicography meets natural language process-ing. In LREC, pages 1001–1006.

Yukun Feng, Qiao Deng, and Dong Yu. 2015. Blcunlp:Corpus pattern analysis for verbs based on dependencychain. Proceedings of SemEval.

Patrick Hanks. 2012. How people use words to makemeanings: Semantic types meet valencies. Input, Pro-cess and Product: Developments in Teaching and Lan-guage Corpora, pages 54–69.

Daisuke Kawahara, Daniel Peterson, and Martha Palmer.2014a. A step-wise usage-based method for induc-ing polysemy-aware verb classes. In ACL (1), pages1030–1040.

Daisuke Kawahara, Daniel Peterson, Octavian Popes-cu, Martha Palmer, and Fondazione Bruno Kessler.2014b. Inducing example-based semantic frames froma massive amount of verb uses. In EACL, pages 58–67.

Jiřı́ Materna. 2012. Lda-frames: An unsupervised ap-proach to generating semantic frames. In Compu-tational Linguistics and Intelligent Text Processing,pages 376–387. Springer.

Jirı́ Materna. 2013. Parameter estimation for lda-frames.In HLT-NAACL, pages 482–486.

6

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013. Linguistic regularities in continuous space wordrepresentations. In HLT-NAACL, pages 746–751.

Chad Mills and Gina-Anne Levow. 2015. Cmills: Adapt-ing semantic role labeling features to dependency pars-ing. SemEval-2015, page 433.

Srini Narayanan and Sanda Harabagiu. 2004. Questionanswering based on semantic structures. In Proceed-ings of the 20th international conference on Computa-tional Linguistics, page 693. Association for Compu-tational Linguistics.

Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The proposition bank: An annotated corpus ofsemantic roles. Computational linguistics, 31(1):71–106.

Ted Pedersen. 2015. Duluth: Word sense discrimina-tion in the service of lexicography. In Proceedingsof the 9th International Workshop on Semantic Eval-uation (SemEval 2015), pages 438–442, Denver, Col-orado, June. Association for Computational Linguis-tics.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for word rep-resentation. In EMNLP, volume 14, pages 1532–1543.

Octavian Popescu. 2013. Learning corpus patterns usingfinite state automata. In Proceedings of the 10th In-ternational Conference on Computational Semantics,pages 191–203.

Karin Kipper Schuler. 2005. Verbnet: A broad-coverage,comprehensive verb lexicon.

7


Analogy-based Detection of Morphological and Semantic Relations WithWord Embeddings: What Works and What Doesn’t.

Anna GladkovaDepartment of Languageand Information SciencesThe University of Tokyo

Tokyo, [email protected]

Aleksandr DrozdGlobal Scientific Information

and Computing CenterTokyo Institute of Technology


Satoshi MatsuokaGlobal Scientific Information

and Computing CenterTokyo Institute of Technology


Abstract

Following up on numerous reports of analogy-based identification of “linguistic regularities”in word embeddings, this study applies thewidely used vector offset method to 4 typesof linguistic relations: inflectional and deriva-tional morphology, and lexicographic and en-cyclopedic semantics. We present a balancedtest set with 99,200 questions in 40 categories,and we systematically examine how accuracyfor different categories is affected by windowsize and dimensionality of the SVD-basedword embeddings. We also show that GloVeand SVD yield similar patterns of results fordifferent categories, offering further evidencefor conceptual similarity between count-basedand neural-net based models.

1 Introduction

The recent boom of research on analogies with wordembedding models is largely due to the strikingdemonstration of “linguistic regularities” (Mikolovet al., 2013b). In the so-called Google analogy testset (Mikolov et al., 2013a) the task is to solve analo-gies with vector offsets (a frequently cited exampleis king - man + woman = queen). This test is a pop-ular benchmark for word embeddings, some achiev-ing 80% accuracy (Pennington et al., 2014).

Analogical reasoning is a promising line of re-search, since it can be used for morphological anal-ysis (Lavallée and Langlais, 2010), word sense dis-ambiguation (Federici et al., 1997), and even forbroad-range detection of both morphological andsemantic features (Lepage and Goh, 2009). How-ever, it remains to be seen to what extent word em-

beddings capture the “linguistic regularities”. TheGoogle analogy test set includes only 15 relations,and Köper et al. (2015) showed that lexicographicrelations such as synonymy are not reliably discov-ered in the same way.

This study systematically examines how well var-ious kinds of linguistic relations can be detectedwith the vector offset method, and how this pro-cess is affected by window size and dimensional-ity of count-based word embeddings. We develop anew, more balanced test set (BATS) which includes99,200 questions in 40 morphological and semanticcategories. The results of this study are of practicaluse in real-world applications of analogical reason-ing, and also provide a more accurate estimate of thedegree to which word embeddings capture linguisticrelations.

2 Related work

Current research on analogical reasoning in wordembeddings focuses on the so-called “proportionalanalogies” of the a:b::c:d kind. The task is todetect whether two pairs of words have the samerelation. A recent term is “linguistic regularity”(Mikolov et al., 2013b), used to refer to any “sim-ilarities between pairs of words” (Levy et al., 2014).Analogies have been successfully used for detect-ing different semantic relations, such as synonymyand antonymy (Turney, 2008), ConceptNet relationsand selectional preferences (Herdadelen and Baroni,2009), and also for inducing morphological cate-gories from unparsed data (Soricut and Och, 2015).

The fact that analogies are so versatile means thatto make any claims about a model being good at

8

analogical reasoning, we need to show what typesof analogies it can handle. This can only be de-termined with a comprehensive test set. However,the current sets tend to only include a certain typeof relations (semantic-only: SAT (Turney et al.,2003), SemEval2012-Task2 (Jurgens et al., 2012),morphology-only: MSR (Mikolov et al., 2013b)).The Google analogy test (Mikolov et al., 2013a)contains 9 morphological and 5 semantic categories,with 20-70 unique word pairs per category which arecombined in all possible ways to yield 8,869 seman-tic and 10,675 syntactic questions.1

None of the existing tests is balanced across dif-ferent types of relations (word-formation gettingparticularly little attention). With unbalanced sets,and potentially high variation in performance fordifferent relations, it is important to evaluate resultson all relations, and not only the average.

Unfortunately, this is not common practice. De-spite the popularity of the Google test set, the onlystudy we have found that provides data for indi-vidual categories is (Levy et al., 2014). In theirexperiments, accuracy varied between 10.53% and99.41%, and much success in the semantic part wasdue to the fact that the two categories explore thesame capital:country relation and together consti-tute 56.72% of all semantic questions. This showsthat a model may be more successful with some re-lations but not others, and more comprehensive testsare needed to show what it can and cannot do.

Model parameters can also have a major impacton performance (Levy et al., 2015; Lai et al., 2015).So far they have been studied in the context of se-mantic priming (Lapesa and Evert, 2014), semanticsimilarity tasks (Kiela and Clark, 2014), and acrossgroups of tasks (Bullinaria and Levy, 2012). How-ever, these results are not necessarily transferable todifferent tasks; e.g. dependency-based word embed-dings perform better on similarity task, but worse onanalogies (Levy and Goldberg, 2014a). Some stud-ies report effects of changing model parameters on

1For semantic relations there are also generic resources suchas EVALution (Santus et al., 2015), and semantic similarity setssuch as BLESS and WordSim353 (Baroni and Lenci, 2011),which are sometimes used as sources for compiling analogytests. For example, (Vylomova et al., 2015) presents a com-pilation with 18 relations in total (58 to 3163 word pairs perrelation): 10 semantic, 4 morphological, 2 affix-derived wordrelations, animal collective nouns, and verb-object pairs.

general accuracy on Google analogy test (Levy etal., 2015; Lai et al., 2015), but, to our knowledge,this is the first study to address the effect of modelparameters on individual linguistic relations in thecontext of analogical reasoning task.

3 The Bigger Analogy Test Set (BATS)

We introduce BATS - the Bigger Analogy Test Set.It covers 40 linguistic relations that are listed in ta-ble 1. Each relation is represented with 50 uniqueword pairs, which yields 2480 questions (99,200 inall set). BATS is balanced across 4 types of rela-tions: inflectional and derivational morphology, andlexicographic and encyclopedic semantics.

A major feature of BATS that is not present inMSR and Google test sets is that morphological cat-egories are sampled to reduce homonymy. For ex-ample, for verb present tense the Google set includespairs like walk:walks, which could be both verbsand nouns. It is impossible to completely elimi-nate homonymy, as a big corpus will have some cre-ative uses for almost any word, but we reduce it byexcluding words attributed to more than one part-of-speech in WordNet (Miller and Fellbaum, 1998).After generating lists of such pairs, we select 50pairs by top frequency in our corpus (section 4.2).

The semantic part of BATS does includehomonyms, since semantic categories are overallsmaller than morphological categories, and it is themore frequently used words that tend to have mul-tiple functions. For example, both dog and cat arealso listed in WordNet as verbs, and aardvark is not;an homonym-free list of animals would mostly con-tain low-frequency words, which in itself decreasesperformance. However, we did our best to avoidclearly ambiguous words; e.g. prophet Muhammadwas not included in the E05 name:occupations sec-tion, because many people have the same name.

The lexicographic part of BATS is based onSemEval2012-Task2, extended by the authors withwords similar to those included in SemEval set.About 15% of extra words came from BLESS andEVALution. The encyclopedic section was com-piled on the basis of word lists in Wikipedia andother internet resources2. Categories E01 and E10

2E06-08: https://en.wikipedia.org/wiki/List of animal namesE02: http://www.infoplease.com/ipa/A0855611.html

9

SubcategoryAnalogy structure and examples

Infle

ctio

nsNouns I01: regular plurals (student:students)

I02: plurals - orthographic changes (wife:wives)Adjectives I03: comparative degree (strong:stronger)

I04: superlative degree (strong:strongest)Verbs I05: infinitive: 3Ps.Sg (follow:follows)

I06: infinitive: participle (follow:following)I07: infinitive: past (follow:followed)I08: participle: 3Ps.Sg (following:follows)I09: participle: past (following:followed)I10: 3Ps.Sg : past (follows:followed)

Der

ivat

ion

No stem D01: noun+less (life:lifeless)change D02: un+adj. (able:unable)

D03: adj.+ly (usual:usually)D04: over+adj./Ved (used:overused)D05: adj.+ness (same:sameness)D06: re+verb (create:recreate)D07: verb+able (allow:allowable)

Stem D08: verb+er (provide:provider)change D09: verb+ation (continue:continuation)

D10: verb+ment (argue:argument)

Lex

icog

raph

y

Hypernyms L01: animals (cat:feline)L02: miscellaneous (plum:fruit, shirt:clothes)

Hyponyms L03: miscellaneous (bag:pouch, color:white)Meronyms L04: substance (sea:water)

L05: member (player:team)L06: part-whole (car:engine)

Synonyms L07: intensity (cry:scream)L08: exact (sofa:couch)

Antonyms L09: gradable (clean:dirty)L10: binary (up:down)

Enc

yclo

pedi

a

Geography E01: capitals (Athens:Greece)E02: country:language (Bolivia:Spanish)E03: UK city:county York:Yorkshire

People E04: nationalities (Lincoln:American)E05: occupation (Lincoln:president)

Animals E06: the young (cat:kitten)E07: sounds (dog:bark)E08: shelter (fox:den)

Other E09: thing:color (blood:red)E10: male:female (actor:actress)

Table 1: The Bigger Analogy Test Set: categoriesand examples

are based on the Google test, and category E09 - onthe color dataset (Bruni et al., 2012). In most caseswe did not rely on one source completely, as they didnot make the necessary distinctions, included clearlyambiguous or low-frequency words, and/or weresometimes inconsistent3 (e.g. sheep:flock in EVA-Lution is a better example of member:collection re-lation than jury:court).

Another new feature in BATS, as compared to theGoogle test set and SemEval, is that it contains sev-eral acceptable answers (sourced from WordNet),

E03: http://whitefiles.org/b4 g/5 towns to counties index/L02: https://www.vocabulary.com/lists/189583#view=notesL07: http://justenglish.me/2012/10/17/character-feelings

3No claims are made about our own work being free frominconsistencies, as no dictionary will ever be so.

where applicable. For example, both mammal andcanine are hypernyms of dog.

4 Testing the test

4.1 The vector offset methodAs mentioned above, Mikolov et al. (2013a) sug-gested to capture the relations between words as theoffset of their vector embeddings. The answer to thequestion “a is to b as c is to ?d” is represented by hid-den vector d, calculated as argmaxd∈V (sim(d, c−a + b)). Here V is the vocabulary excluding wordsa, b and c and sim is a similarity measure, for whichMikolov and many other researchers use angular dis-tance: sim(u, v) = cos(u, v) = u·v||u||||v|| .

Levy and Goldberg (2014b) propose an alterna-tive optimization objective: argmaxd∈V (cos(d −c, b−a)) They report that this method produces moreaccurate results for some categories. Essentially itaccounts for d− c and b−a to share the same direc-tion and discards lengths of these vectors.

We supply the BATS test set with a Python eval-uation script that implements both methods.4 Wereport results calculated by the Mikolov’s methodfor the sake of consistency, but some authors choosethe best result for each category from each method(Levy and Goldberg, 2014b).

4.2 Corpus and modelsOne of the current topics in research on word em-beddings is the (de)merits of count-based modelsas compared to the neural-net-based models. Whilesome researchers find that the latter outperform theformer (Baroni et al., 2014), others show that theseapproaches are mathematically similar (Levy andGoldberg, 2014b). We compare models of bothtypes as a contribution to the ongoing dispute.

Our count-based model is built with PointwiseMutual Information (PMI)frequency weighting. Inthe dimensionality reduction step we used the Sin-gular Value Decomposition (SVD), raising Σ matrixelement-wise to the power of a where 0 < a ≤ 1to give a boost to dimensions with smaller varianceCaron (2001). In this study, unless mentined oth-erwise, a = 1. The co-occurrence extraction wasperformed with the kernel developed by Drozd et al.(2015).

4http://vsm.blackbird.pw/bats

10

I01

[nou

n - p

lura

l_reg

]I0

2 [n

oun

- plu

ral_i

rreg]

I03

[adj

- co

mpa

rativ

e]I0

4 [a

dj -

supe

rlativ

e]I0

5 [v

erb_

inf –

3PS

.Sg]

I06

[ver

b_in

f - V

ing]

I07

[ver

b_in

f - V

ed]

I08

[ver

b_Vi

ng -

3pSg

]I0

9 [v

erb_

Ving

- Ve

d]I1

0 [v

erb_

3Ps.S

g - V

ed]

D01

[nou

n+le

ss_r

eg]

D02

[un+

adj_r

eg]

D03

[adj

+ly_

reg]

D04

[ove

r+ad

j_reg

]D0

5 [a

dj+n

ess_

reg]

D06

[re+v

erb_

reg]

D07

[ver

b+ab

le_r

eg]

D08

[ver

b+er

_irre

g]D0

9 [v

erb+

tion_

irreg

]D1

0 [v

erb+

men

t_irr

eg]

L01

[hyp

erny

ms -

ani

mal

s]L0

2 [h

yper

nym

s - m

isc]

L03

[hyp

onym

s - m

isc]

L04

[mer

onym

s - su

bsta

nce]

L05

[mer

onym

s - m

embe

r]L0

6 [m

eron

yms -

par

t]L0

7 [s

ynon

yms -

inte

nsity

]L0

8 [s

ynon

yms -

exa

ct]

L09

[ant

onym

s - g

rada

ble]

L10

[ant

onym

s - b

inar

y]E0

1 [c

ount

ry -

capi

tal]

E02

[cou

ntry

- la

ngua

ge]

E03

[UK_

city

- cou

nty]

E04

[nam

e - n

atio

nalit

y]E0

5 [n

ame

- occ

upat

ion]

E06

[ani

mal

- yo

ung]

E07

[ani

mal

- so

und]

E08

[ani

mal

- sh

elte

r]E0

9 [th

ings

- co

lor]

E10

[mal

e - f

emal

e]

Category

0.0

0.2

0.4

0.6

0.8

1.0

Accu

racy

GloVe w10 d300 (average accuracy 0.285) SVD w3 d1000 (average accuracy 0.221)

Figure 1: GloVe and SVD: accuracy on different types of relations

As a representative of implicit models we choseGloVe (Pennington et al., 2014) that achieved thehighest performance on the Google test set to thisdate. Our source corpus combines the EnglishWikipedia snapshot from July 2015 (1.8B tokens),Araneum Anglicum Maius (1.2B) (Benko, 2014)and ukWaC (2B) (Baroni et al., 2009). We discardedwords occurring less than 100 times, resulting in vo-cabulary of 301,949 words (uncased).

To check the validity of our models we evaluateit with the Google test set for which there are nu-merous reported results. For GloVe we used theparameters from the original study (Pennington etal., 2014): 300 dimensions, window 10, 100 iter-ations, xmax= 100, a = 3/4, sentence borders ig-nored. For comparison we also built an SVD modelwith 300 dimensions and window size 10. On our5 B corpus GloVe achieved 80.4% average accuracy(versus 71.7% on 6 B corpus in the original study).The comparable SVD model achieved 49.9%, as op-posed to with 52.6% result reported by Levy et al.(2015) for 500 dimensions, window size 10 on 1.5B Wikipedia corpus.

To evaluate effects of window size and dimen-sionality we built 19 SVD-based models for win-dows 2-8 at 1000 dimensions, and for dimensions100-1200 for window size 5.

5 Results and discussion

5.1 Word category effect

Figure 1 presents the results of BATS test on theGloVe model (built with the parameters from theoriginal study (Pennington et al., 2014)), and thebest performing SVD model, which was the modelwith window size 3 at 1000 dimensions. The modelbuilt with the same parameters as GloVe achievedonly 15.9% accuracy on BATS, and is not shown.

While GloVe outperforms the SVD-based modelon most categories, neither of them achieves even30% accuracy, suggesting that BATS is much moredifficult than the Google test set. Many categoriesare either not captured well by the embedding, orcannot be reliably retrieved with vector offset, orboth. The overall pattern of easier and more dif-ficult categories is the same for GloVe and SVD,which supports the conclusion of Levy and Gold-berg (2014b) about conceptual similarity of explicitand implicit models. The overall performance ofboth models could perhaps be improved by parame-ters that we did not consider, but the point is that thecurrent state-of-the-art in analogical reasoning withword embeddings handles well only certain types oflinguistic relations, and there are directions for im-provement that have not been considered so far.

The high variation we observe in this experiment

11

is consistent with evidence from systems competingat SemEval2012-Task2, where not a single systemwas able to achieve superior performance on all sub-categories. Fried and Duh (2015) also showed a sim-ilar pattern in 7 different word embeddings.

As expected, inflectional morphology is overalleasier than semantics, as shown even by the Googletest results (see Skip-Gram (Mikolov et al., 2013a;Lai et al., 2015), GloVe (Pennington et al., 2014),and K-Net (Cui et al., 2014), among others). But itis surprising that derivational morphology is signifi-cantly more difficult to detect than inflectional: only3 categories out of ten yield even 20% accuracy.

The low accuracy on the lexicographic part ofBATS is consistent with the findings of Köper etal. (2015). It is not clear why lexicographic rela-tions are so difficult to detect with the vector offsetmethod, despite numerous successful word similar-ity tests on much the same relations, and the factthat BATS make the task easier by accepting sev-eral correct answers. The easiest category is binaryantonyms of the up:down kind - the category forwhich the choice should be the most obvious in thesemantic space.

A typical mistake that our SVD models makein semantic questions is suggesting a morphologi-cal form of one of the source words in the a:b::c:danalogy: cherry:red :: potato:?potatoes instead ofpotato:brown. It would thus be beneficial to excludefrom the set of possible answers not only the wordsa, b and c, but also their morphological forms.

5.2 Window size effectEvaluating two count-based models on semantic andsyntactic parts of the Google test set, Lebret andCollobert (2015) shows that the former benefit fromlarger windows while the latter do not. Our exper-iments with SVD models using different windowsizes only partly concur with this finding.

Table 2 presents the accuracy for all categoriesof BATS using a 1000-dimension SVD model withwindow size varying between 2 and 8. The codesand examples for each category are listed in table1. All categories are best detected between win-dow sizes 2-4, although 9 of them yield equallygood performance in larger windows. This indicatesthat there is not a one-on-one correspondence be-tween “semantics” and “larger windows” or “mor-

2 3 4 5 6 7 8 2 3 4 5 6 7 8I01 62 71 70 68 67 65 58 L01 11 10 9 8 7 6 6I02 41 50 47 44 42 40 34 L02 5 4 4 4 4 5 4I03 57 61 58 52 47 41 32 L03 10 8 8 8 7 6 4I04 49 57 51 45 40 35 25 L04 5 5 5 5 5 5 4I05 27 37 39 36 34 32 29 L05 2 0 1 1 1 1 1I06 62 71 67 63 60 58 53 L06 3 3 4 3 3 3 3I07 26 32 36 36 36 36 34 L07 13 12 9 7 6 5 4I08 21 20 19 18 18 18 16 L08 19 16 13 12 10 9 6I09 23 30 34 35 36 36 35 L09 15 19 17 14 12 11 9I10 25 25 23 21 19 19 17 L10 32 33 30 28 27 25 24D01 0 0 0 0 0 0 0 E01 69 77 79 77 74 71 69D02 12 13 12 12 11 10 9 E02 29 28 24 22 21 20 17D03 10 18 20 20 20 20 19 E03 11 18 18 18 18 18 17D04 12 8 6 5 4 3 2 E04 19 10 3 3 3 3 4D05 7 13 13 11 9 8 5 E05 20 15 15 14 14 13 13D06 15 24 18 13 10 8 5 E06 2 2 1 1 1 1 1D07 4 4 3 2 2 1 1 E07 2 3 3 2 2 1 1D08 1 2 2 2 1 1 1 E08 0 0 0 0 0 0 0D09 6 10 11 11 11 11 10 E09 19 18 19 18 18 19 18D10 3 12 12 10 10 9 9 E10 20 25 25 25 24 23 21

Table 2: Accuracy of SVD-based model on 40 BATScategories, window sizes 2-8, 1000 dimensions

phology” and “smaller windows”. Also, differentcategories benefit from changing window size in dif-ferent ways: for noun plurals the difference betweenthe best and the worse choice is 13%, but for cate-gories where accuracy is lower overall there is notmuch gain from altering the window size.

Our results are overall consistent with the evalu-ation of an SVD-based model on the Google set byLevy et al. (2015). This study reports 59.1% averageaccuracy for window size 2 yields, 56.9% for win-dow size 5, and 56.2% for window size 10. How-ever, using window sizes 3-4 clearly merits furtherinvestigation. Another question is whether changingwindow size has different effect on different models,as the data of Levy et al. (2015) suggest that GloVeactually benefits from larger windows.

5.3 Vector dimensionality effect

Intuitively, larger vectors capture more informationabout individual words, and therefore should in-crease accuracy of detecting linguistic patterns. Inour data this was true of 19 BATS categories (I01-02, I04, I06, D02-03, D05-07, E01, E03, E07, E10,L03-04, L07-10): all of them either peaked at 1200dimensions or did not start decreasing by that point.

However, the other 20 relations show all kindsof patterns. 14 categories peaked between 200 and1100 dimensions, and then performance started de-creasing (I03, I05, I07-10, D01, D04, D09, E02,E05, E09, L1, L6). 2 categories showed negativeeffect of higher dimensionality (D08, E04). Finally,2 categories showed no dimensionality effect (E08,

12

L05), and 3 more - idiosyncratic patterns with sev-eral peaks (D10, E02, L06); however, this could bechance variation, as in these categories performancewas generally low (under 10%). Figure 2 shows sev-eral examples of these different trends5.

100 200 300 400 500 600 700 800 900 1000 1100 1200Vector size

0.00.10.20.30.40.50.60.70.80.9

Accu

racy

E04 nationalitiesE10 male - female

I01 noun pluralsI05 infinitive – 3Ps.Sg

Figure 2: Effect of vector dimensionality: example categories

The main takeaway from this experiment is that,although 47.5% of BATS categories do perform bet-ter at higher dimensions (at least for SVD-basedmodels), 40% do not, and, like with window size,there is no correlation between type of the relation(semantic or morphological) and its preference fora higher or low dimensionality. One possible ex-planation for lower saturation points of some rela-tions is that, once the dimensions corresponding tothe core aspects of a particular relation are includedin the vectors, adding more dimensions increasesnoise. For practical purposes this means that choos-ing model parameters would have to be done to tar-get specific relations rather than relation types.

5.4 Other parameters

In scope of this study we did not investigate all pos-sible parameters, but our pilot experiments show thatchanging the power a for the Σ matrix of the SVDtransformation can boost or decrease the perfor-mance on individual categories by 40-50%. Smallervalue of a gives more weight to the dimensionswhich capture less variance in the original data,which can correspond to subtle linguistic nuances.However, as with windows and dimensions, no set-ting yields the best result for all categories.

A big factor is word frequency, and it deservesmore attention than we can provide in scope of thispaper. Some categories could perform worse be-

5All data for all categories can be found athttp://vsm.blackbird.pw/bats

cause they contain only low-frequency vocabulary;in our corpus, this could be the case for D01 andD04-066. But other derivational categories still donot yield higher accuracy even if the frequency dis-tribution is comparable with that of an “easier” cat-egory (e.g. D8 and E10). Also, SVD was shown tohandle low frequencies well (Wartena, 2014).

6 Conclusion

This study follows up on numerous reports of suc-cessful detection of linguistic relations with vectoroffset method in word embeddings. We developBATS - a balanced analogy test set with 40 morpho-logical and semantic relations (99,200 questions intotal). Our experiments show that derivational andlexicographic relations remain a major challenge.Our best-performing SVD-based model and GloVeachieved only 22.1% and 28.5% average accuracy,respectively. The overall pattern of “easy” and “dif-ficult” categories is the same for the two models, of-fering further evidence in favor of conceptual sim-ilarity between explicit and implicit word embed-dings. We hope that this study would draw atten-tion of the NLP community to word embeddings andanalogical reasoning algorithms in context of lexico-graphic and derivational relations7.

Our evaluation of the effect of vector dimension-ality on accuracy of analogy detection with SVD-based models shows that roughly half BATS cate-gories are best discovered with over 1000 dimen-sions, but 40% peak between 200 and 1100. Theredoes not seem to be a correlation between type oflinguistic relation and preference for higher or lowdimensionality. Likewise, our data does not confirmthe intuition about larger windows being more ben-eficial for semantic relations, and smaller windows- for morphological, as our SVD model performedbest on both relation types in windows 2-4. Furtherresearch is needed to establish whether other modelsbehave in the same way.

6Data on frequency distribution of words inBATS categories in our corpus can be found athttp://vsm.blackbird.pw/bats

7BATS was designed for word-level models and does notfocus on word phrases, but we included WordNet phrases aspossible correct answers, which may be useful for phrase-awaremodels. Also, morphological categories involving orthographicchanges may be of interest for character-based models.

13

References

Marco Baroni and Alessandro Lenci. 2011. How weBLESSed distributional semantic evaluation. In Pro-ceedings of the GEMS 2011 Workshop on GEometricalModels of Natural Language Semantics, GEMS ’11,pages 1–10. Association for Computational Linguis-tics.

Marco Baroni, Silvia Bernardini, Adriano Ferraresi, andEros Zanchetta. 2009. The wacky wide web: acollection of very large linguistically processed web-crawled corpora. Language Resources and Evalua-tion, 43(3):209–226.

Marco Baroni, Georgiana Dinu, and Germn Kruszewski.2014. Dont count, predict! a systematic comparison ofcontext-counting vs. context-predicting semantic vec-tors. In Proceedings of the 52nd Annual Meeting of theAssociation for Computational Linguistics, volume 1,pages 238–247.

Vladimı́r Benko. 2014. Aranea: Yet another family of(comparable) web corpora. In Petr Sojka, Aleš Horák,Ivan Kopeček, and Karel Pala, editors, Text, speech,and dialogue: 17th international conference, TSD2014, Brno, Czech Republic, September 8-12, 2014.Proceedings, LNCS 8655, pages 257–264. Springer.

Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in tech-nicolor. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguistics:Long Papers-Volume 1, pages 136–145. Associationfor Computational Linguistics.

John A. Bullinaria and Joseph P. Levy. 2012. Extract-ing semantic representations from word co-occurrencestatistics: stop-lists, stemming, and SVD. BehaviorResearch Methods, 44(3):890–907.

John Caron. 2001. Computational information re-trieval. chapter Experiments with LSA Scoring: Op-timal Rank and Basis, pages 157–169. Society for In-dustrial and Applied Mathematics, Philadelphia, PA,USA.

Qing Cui, Bin Gao, Jiang Bian, Siyu Qiu, and Tie-YanLiu. 2014. Learning effective word embedding usingmorphological word similarity.

Aleksandr Drozd, Anna Gladkova, and Satoshi Mat-suoka. 2015. Python, performance, and natural lan-guage processing. In Proceedings of the 5th Workshopon Python for High-Performance and Scientific Com-puting, PyHPC ’15, pages 1:1–1:10, New York, NY,USA. ACM.

Stefano Federici, Simonetta Montemagni, and Vito Pir-relli. 1997. Inferring semantic similarity from dis-tributional evidence: an analogy-based approach toword sense disambiguation. In Proceedings of the

ACL/EACL Workshop on Automatic Information Ex-traction and Building of Lexical Semantic Resourcesfor NLP Applications, pages 90–97.

Daniel Fried and Kevin Duh. 2015. Incorporating bothdistributional and relational semantics in word repre-sentations.

Ama Herdadelen and Marco Baroni. 2009. BagPack: Ageneral framework to represent semantic relations. InProceedings of the Workshop on Geometrical Modelsof Natural Language Semantics, GEMS ’09, pages 33–40. Association for Computational Linguistics.

David A. Jurgens, Peter D. Turney, Saif M. Mohammad,and Keith J. Holyoak. 2012. Semeval-2012 task 2:Measuring degrees of relational similarity. In Pro-ceedings of the First Joint Conference on Lexical andComputational Semantics-Volume 1: Proceedings ofthe main conference and the shared task, and Volume2: Proceedings of the Sixth International Workshop onSemantic Evaluation, pages 356–364. Association forComputational Linguistics.

Douwe Kiela and Stephen Clark. 2014. A systematicstudy of semantic vector space model parameters. InProceedings of the 2nd Workshop on Continuous Vec-tor Space Models and their Compositionality (CVSC)at EACL, pages 21–30.

Maximilian Köper, Christian Scheible, andSabine Schulte im Walde. 2015. Multilingualreliability and semantic structure of continuous wordspaces. In Proceedings of the 11th InternationalConference on Computational Semantics 2015, pages40–45. Association for Computational Linguistics.

Siwei Lai, Kang Liu, Liheng Xu, and Jun Zhao. 2015.How to generate a good word embedding?

Gabriella Lapesa and Stefan Evert. 2014. A large scaleevaluation of distributional semantic models: Parame-ters, interactions and model selection. 2:531–545.

Jean-Franois Lavallée and Philippe Langlais. 2010. Un-supervised morphological analysis by formal analogy.In Multilingual Information Access Evaluation I. TextRetrieval Experiments, pages 617–624. Springer.

Rmi Lebret and Ronan Collobert. 2015. Rehabilita-tion of count-based models for word vector represen-tations. In Computational Linguistics and IntelligentText Processing, pages 417–429. Springer.

Yves Lepage and Chooi-ling Goh. 2009. Towards au-tomatic acquisition of linguistic features. In Proceed-ings of the 17th Nordic Conference on ComputationalLinguistics (NODALIDA 2009), eds., Kristiina Jokinenand Eckard Bick, pages 118–125.

Omer Levy and Yoav Goldberg. 2014a. Dependency-based word embeddings. In ACL (2), pages 302–308.

Omer Levy and Yoav Goldberg. 2014b. Linguistic regu-larities in sparse and explicit word representations. In

14

Proceedings of the Eighteenth Conference on Compu-tational Natural Language Learning, pages 171–180,Ann Arbor, Michigan, June. Association for Compu-tational Linguistics.

Omer Levy, Yoav Goldberg, and Israel Ramat-Gan.2014. Linguistic regularities in sparse and explicitword representations. In CoNLL, pages 171–180.

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-proving distributional similarity with lessons learnedfrom word embeddings. In Transactions of the Associ-ation for Computational Linguistics, volume 3, pages211–225.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient estimation of word representa-tions in vector space.

Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig.2013b. Linguistic regularities in continuous spaceword representations. In Proceedings of the 2013 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies (NAACL-HLT-2013), pages 746–751. Association for Computational Linguistics.

George Miller and Christiane Fellbaum. 1998. Word-net: An electronic lexical database. MIT Press: Cam-bridge.

Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of the Empiricial Methodsin Natural Language Processing (EMNLP 2014), vol-ume 12, pages 1532–1543.

Enrico Santus, Frances Yung, Alessandro Lenci, andChu-Ren Huang. 2015. EVALution 1.0: an evolvingsemantic dataset for training and evaluation of distri-butional semantic models. In Proceedings of the 4thWorkshop on Linked Data in Linguistics (LDL-2015),pages 64–69.

Radu Soricut and Franz Och. 2015. Unsupervised mor-phology induction using word embeddings. In Hu-man Language Technologies: The 2015 Annual Con-ference of the North American Chapter of the ACL,pages 1627–1637.

Peter Turney, Michael Littman, Jeffrey Bigham, and Vic-tor Shnayder. 2003. Combining independent modulesto solve multiple-choice synonym and analogy prob-lems. In In Proceedings of the International Confer-ence on Recent Advances in Natural Language Pro-cessing, pages 482–489.

Peter D. Turney. 2008. A uniform approach to analogies,synonyms, antonyms, and associations. In Proceed-ings of the 22nd International Conference on Compu-tational Linguistics (Coling 2008), pages 905–912.

Ekaterina Vylomova, Laura Rimmel, Trevor Cohn, andTimothy Baldwin. 2015. Take and took, gaggle and

goose, book and read: Evaluating the utility of vectordifferences for lexical relation learning.

Christian Wartena. 2014. On the effect of word fre-quency on distributional similarity. In Proceedingsof the 12th edition of the KONVENS conference -Hildesheim, volume 1, pages 1–10.

15


Argument Identification in Chinese Editorials

Marisa ChowPrinceton University

1482 Frist Campus CtrPrinceton, NJ 08544, USA

[email protected]

Abstract

In this paper, we develop and evaluate sev-eral techniques for identifying argumentativeparagraphs in Chinese editorials. We first usethree methods of evaluation to score a para-graph’s argumentative nature: a relative wordfrequency approach; a method which targetsknown argumentative words in our corpus;and a combined approach which uses elementsfrom the previous two. Then, we determinethe best score thresholds for separating argu-mentative and non-argumentative paragraphs.The results of our experimentation show thatour relative word frequency approach providesa reliable way to identify argumentative para-graphs with a F1 score of 0.91, though chal-lenges in accurate scoring invite improvementthrough context-aware means.

1 Introduction

Argumentation – the act of reasoning in support ofan opinion or idea – frequently presents itself in alltypes of texts, from casual chat messages to onlineblogs. Argumentation mining aims to identify anddetermine a persuasive text’s argumentative compo-nents, or the atomic units of its underlying struc-ture. For example, an argumentation mining systemmight seek to locate and classify sections of claimsand supporting evidence within an essay. More com-prehensive mining might map the relations betweendifferent units, such as the support of evidence or theopposition of counterarguments to the thesis.

Argument identification offers a wide variety ofpractical applications. If argumentative text can beidentified accurately, then the main arguments of

large sets of data may be extracted. For exam-ple, argument identification could isolate argumentssurrounding subjects like U.S. immigration law, orsummarize the arguments in research papers. Recentefforts in argumentation mining have included appli-cations such as automatic essay scoring (Song et al.,2014; Ong et al., 2014), online debates (Boltuzic andŠnajder, 2014), and arguments in specific domainssuch as online Greek social media sites (Sardianoset al., 2015). However, to our knowledge, no workin argumentation mining to date has been performedfor Chinese, a large and rich domain for NLP work.

Here we focus on the first step of argumenta-tion mining, locating argumentative units within atext. We develop and evaluate several methods ofargument identification when performed upon a cor-pus of Chinese editorials, making the assumptionthat editorials are opinionated texts, although a sin-gle editorial may contain both opinionated and non-opinionated paragraphs.

Our work met with several challenges. Althoughnewspaper editorials can be assumed to carry anopinion of some sort, the opinion is not always ex-plicitly expressed at a word level, and methods ofargumentation can vary widely from editorial to ed-itorial. For example, one might exhibit a thesis fol-lowed by supporting evidence, but others might onlystate facts until the final paragraph. Furthermore, ed-itorials commonly build arguments by citing facts.In our work, we not only had to define ’argumen-tative’ and ’non-argumentative’, but also limit thescope of an argument. In order to capture the largerargument structure, our work focuses on identifyingarguments in paragraph units of no more than 200

16

characters (around 3-5 sentences), although we donot concatenate shorter paragraphs to ensure a min-imum size.

Our task aims to label paragraphs such as the fol-lowing as argumentative: “不幸的是，如今在深圳的各十字路口，绳子在显示作用，白线却无力地趴在地上。这是法规的悲哀。” (“Un-fortunately, nowadays at Shenzhen’s ten road inter-sections, cords are used to show that the white roadlines lie uselessly on the ground. This legislation istragic.”)

The contributions of this paper are collecting andannotating a dataset of Chinese editorials; manuallycreating a list of argumentative words; and the com-parison and analysis of three methods.

2 Data

2.1 Corpora

This work makes use of two corpora, one of Chi-nese editorials and one of Chinese reportage, bothof which are subcorpora in the Lancaster Corpusof Mandarin Chinese (McEnery and Xiao, 2004).The LCMC, a member of the Brown family corpora,is a 1M balanced word corpus with seventeen sub-corpora of various topics. We used the Press: Re-portage and Press: Editorials subcorpora, contain-ing 53K and 88K words respectively. Samples inboth subcorpora were drawn from mainland Man-darin Chinese newspaper issues published between1989 and 1993, which increases the likelihood thatboth articles and editorials discuss the same topicsand use a similar vocabulary.

Our unit of text was the paragraph, which typ-ically contains a single argument or thought. Wedecided to use paragraphs that were no more than200 characters in our experimentation, assumingthat longer paragraphs might hold multiple argu-ments. We split our raw data into two subsets: para-graphs 200 characters and below, and paragraphslarger than 200 characters. The small paragraphswere left in their original form, but we manually splitthe larger paragraphs into small sections under 200characters, with individual paragraphs no smallerthan 50 characters. We omitted large paragraphswhich cannot reasonably be split up into sentences(for example, a long one-sentence paragraph).

2.2 Gold Standard Annotation

To evaluate our experiments, we employed work-ers through Amazon Mechanical Turk to tag ourset of 719 editorial paragraphs. For each para-graph, the worker was asked, ”Does the author ofthis paragraph express an argument?” In response,the worker categorized the paragraph by selecting”Makes argument,” ”Makes NO argument,” or ”Un-sure”. All text shown to the worker was writtenin both English and manually translated Mandarin.Instructions were screened by native speakers forclarity. Each paragraph was rated by three ”MasterWorkers,” distinguished as accurate AMT workers.

Though we provided clear instructions and exam-ples for our categorization task, we found that thethree workers for each task often did not all agreeon an answer. Only 26% of paragraphs receivedan unambiguous consensus of ”has argument” or”no argument” for the paragraph’s argumentativenature. The rest of the paragraph results containat least two different opinions about the paragraph.Since paragraphs receiving three different answerswere likely unreliable for identification, we threwout those paragraphs, leaving 622 paragraphs for ourmethods. Around 78% of paragraphs were rated asargumentative, and 22% as non-argumentative.

Paragraph Consensus Count PercentageMakes an argument 484 67.32%Makes NO argument 138 19.19%Unsure 43 5.98%No consensus 54 7.51%total 719

Table 1: Breakdown of AMT paragraph results.

3 Models

We first score paragraphs according to the meth-ods outlined below. Then, we determine thebest score threshold for each method, and accord-ingly label paragraphs ”argumentative” or ”non-argumentative.”

3.1 Method 1: Identification by ComparativeWord Frequency

Our first method of evaluation is based on a processoutlined by Kim and Hovy in a paper on identifying

17

opinion-bearing words (Kim and Hovy, 2005). Wefirst follow Kim and Hovy’s process to construct alist of word-score pairs. Then, we use these scoresto evaluate our paragraphs of editorial text.

Kim and Hovy postulate that words which appearmore often in editorials than in non-editorial textcould be opinion-bearing words. For a given word,we use the Reportage and Editorials subcorpora tofind its unigram probabilities in both corpora, thencompute a score that indicates its frequency bias to-ward editorial or reportage text. Words that are rela-tively more frequent in editorial text are more likelyargumentative.

Score(W ) =EditorialProb(W )ReportageProb(W )

(1)

Kim and Hovy further specify a way to elimi-nate words which do not have a repeated bias to-ward editorial or reportage text. We divide theReportage and Editorial corpora each into threesubsets, creating three pairs of reportage and ed-itorial subsets. Then, for each word, we com-pute word scores as specified above, but for eachpair of reportage and editorial subsets. This cre-ates Score1(W ), Score2(W ), Score3(W ), whichare essentially ratios between editorial or reportageappearances of a word. We only retain words whosescores are all greater than 1.0, or all below 1.0, sincethis indicates repeated bias toward either editorialsor reportage (opinionated or non-opinionated) text.

After scoring individual words, we rate para-graphs by assigning a score based on the scoresof the individual words which comprise them. Ifa paragraph P contains n opinion words with cor-responding frequency f1, f2, . . . fn and assignedscores s1, s2, . . . sn, then the score for the paragraphis calculated by following:

Score(P ) = f1s1 + f2s2 + . . . fnsn (2)

From these scores and our tagged data, we deter-mine a best score threshold by tuning on our taggeddata, which produced a threshold of 40.0.

3.2 Method 2: Targeting KnownArgumentative Words

Our second method involves creating a list of knownargumentative words that appear in the Editorials

corpus and scoring paragraphs based on how manyof these words appear in them. First, we constructeda list of the most frequent argumentative words thatappear in the Editorials corpus. Then, we assignedeach paragraph a score based on presence of thesewords.

We manually selected the most frequent argumen-tative words in the Editorials corpus by sorting alist of the words and their frequencies. Words wereselected for their likelihood of indicating argumen-tation. Generally, the most common words whichindicated opinion also possessed non-argumentativemeanings. For example, the common word ”要” canmean ”to want” as well as ”must” or ”if.”

Word Translation Count %我们 we 219 2.55要 must 210 2.45问题 problem 192 2.24就 right away, at once 158 1.84而 and so, yet, but 131 1.53都 all, even (emphasis) 116 1.35更 even more, further 87 1.01但 but 86 1.00还 still 84 0.98好 good (adj) 76 0.89人们 people 64 0.75自己 self 61 0.71却 however 57 0.66人民 the people 53 0.62必须 must 49 0.57认为 believe 49 0.57为了 in order to 48 0.56我 I 47 0.55重要 important 46 0.54因此 consequently 46 0.54

Table 2: Constructed list of known argumentative words by fre-quency. Horizontal lines mark boundaries between 10-, 15-, and

20-word lists.

Scoring paragraphs based on this list was simple:we awarded a paragraph a point for each instanceof any of the words on the list. We were inter-ested in whether the presence of a few argumenta-tive words could indicate argumentation in the entireparagraph. We determined the best word list size andthe best threshold that returned the most accurate la-bels, a word list size of 15 words and a threshold of

18

1. For this model, a threshold of 1 means if the para-graph contains at least one word from the word list,it is labeled as argumentative.

3.3 Method 3: Combined Method

Our third method of identifying arguments combinesthe previous two methods. Similar to the secondmethod, we scored paragraphs based on a list ofargumentative words. However, instead of manu-ally selecting argumentative words from a word fre-quency distribution, we created a list of opinionatedwords by picking a number of the highest-scoringwords from the first method.

In structuring this combination method, we theo-rized that the highest-scoring words are those whichare the most common opinionated words, since theyhave the highest probability of consistently appear-ing throughout the Editorials corpus and not the Re-portage corpus. By using these words instead ofmanually-picked argumentative words, we scoredparagraphs using a list of words based on the compo-sition of the corpus itself, with the intent of creatinga more relevant list of words.

Scoring remained similar to the second method,where we awarded a paragraph a point for each in-stance of any of the words on the list. Again, thethreshold which produced the best results was 1.That is, if a paragraph contained at least one wordfrom the list, it was labeled as argumentative.

4 Results

4.1 Method 1: Identification by ComparativeWord Frequency

Method Accuracy Precision Recall F1 score1 0.841 0.847 0.971 0.9052 0.801 0.826 0.942 0.8803 0.371 0.850 0.233 0.3661 = Relative Word Frequency Method (T=40)2 = Targeting Argument Words (T=1,W=15)3 = Combined Method (T=1,W=20)T = threshold, W = word list size

Table 3: A comparison of the best metric scores of all threemethods.

Our experiments produced the best performanceunder the relative word frequency method, achieving84% accuracy and an F1 score of 0.91. These scores

were closely followed by the second method with80% accuracy and an F1 score of 0.88.

Despite these high scores, we were surprised tofind that our relative word frequency system hadscored many non-argumentative words very high.For example, the highest-scoring word was自民党,”Liberal Democratic Party.” When we eliminatedwords with non-argumentative POS tags (i.e. nounsand other noun forms), the highest-scoring word was监测, ”to monitor” (Table 4). These words werelikely rare enough in the Reportage corpus that theywere awarded high scores. Words indicative of ar-guments such as必要, ”necessary,” scored high, butlargely did not make up the highest-scoring words.

Word Translation Score监测 to monitor 57.917谈判 to negotiate 51.656污染 to pollute 34.437发展中国家 developing country 32.872停火 to cease fire 29.741整治 to rennovate, restore 28.176腐败 to corrupt 26.610北方 north 25.129断 to break 25.129匿 to hide 25.062

Table 4: Highest-scoring words from the Kim and Hovy scor-ing in the statistical method, words with non-argumentative

POS tags removed.

As a result, our list of opinion-bearing words con-tained non-argumentative words along with argu-mentative identifiers, artificially raising paragraphscores. Paragraphs were more likely to score high,since our system labeled many non-argumentativeparagraphs as argumentative. The inflated wordscores are likely a result of mismatched editorialand reportage corpora, since a word that is rela-tively rare in the Reportage corpus and more com-mon in the Editorials corpus will score high, re-gardless of its actual meaning. However, this ap-proach still performed well, suggesting that thesenon-argumentative words, such as ”to monitor,” maybe used to persuade in context (e.g. ”The govern-ment monitors its people too closely”).

19

4.2 Method 2: Targeting KnownArgumentative Words

Our second method similarly performed well, withhigh accuracy and fewer false positives than the pre-vious method, due to the list of words that clearlyindicated argumentation. The best performance wasgiven by a threshold of 1. That is, the system per-formed best when it marked a paragraph argumenta-tive as long as it has at least one of the words fromthe list. Results did not significantly improve evenif the list was expanded or the score threshold wasraised, implying that adding words to the 10-wordlist, even if the new words had exclusively argumen-tative meanings, did not significantly improve per-formance. The more frequent semi-argumentativewords like ”而” (”and so,” ”yet”) had a greater posi-tive effect on accuracy than obviously argumentativewords like ”必须” (”must”) which do not appear asoften in the corpus.

4.3 Method 3: Combined Method

Since our combined method relied heavily uponthe word scores generated by the relative wordfrequency approach, the results showed signifi-cant errors. Seeded with a word list that didnot contain solely argumentative words (e.g. ”tomonitor” as well as ”to pollute”), the combinedmethod attempted to find argumentative paragraphsusing words which did not exclusively indicateargumentation. Overall, the combined methodrated many more argumentative paragraphs as non-argumentative than the reverse, and performedpoorly overall with a F1 score of 0.37.

5 Related Work

Prior work on argument identification has beenlargely domain-specific. Among them, Sardianos etal. (2015) produced work on argument extractionfrom news in Greek, and Boltûzic and Jan Ŝnajder(2015) worked on recognizing arguments in onlinediscussions. Kiesel et al. have worked on a sharedtask for argument mining in newspaper editorials(Kiesel et al., 2015). They contributed a data set oftagged newspaper editorials and a method for mod-eling an editorial’s argumentative structure.

Because argument mining is a relatively new fieldwithin the NLP community, there has been no argu-

ment identification study performed on Chinese edi-torials, although there has been a significant amountof work on opinion identification. In particular, BinLu’s work on opinion detection in Chinese news text(Lu, 2010) has produced a highest F-measure of 78.4for opinion holders and 59.0 for opinion targets.

6 Conclusion and Future Work

In this study, we sought to computationally iden-tify argumentative paragraphs in Chinese editorialsthrough three methods: using relative word frequen-cies to score paragraphs; targeting known argumen-tative words in paragraphs; and combining the twomethods. Our experiments produced the best per-formance under the relative word frequency method,achieving 84% accuracy and an F1 score of 0.91.

Despite these high scores, we found our rel-ative word frequency system scored many non-argumentative words very high. These words werelikely

Documents

Margaret Mitchell - NAACL HLT 2016m-mitchell.com/NAACL-2016/SRW/N16-2-2016.pdfMinjoon Seo, University of Washington Kairit Sirts, Tallinn University of Technology Huan Sun, University