10
RESIDE: Improving Distantly-Supervised Neural Relation Extraction using Side Information Shikhar Vashishth 1 Rishabh Joshi 2 * Sai Suman Prayaga 1 Chiranjib Bhattacharyya 1 Partha Talukdar 1 1 Indian Institute of Science 2 Birla Institute of Technology and Science, Pilani {shikhar,chiru,ppt}@iisc.ac.in [email protected], [email protected] Abstract Distantly-supervised Relation Extraction (RE) methods train an extractor by automatically aligning relation instances in a Knowledge Base (KB) with unstructured text. In addi- tion to relation instances, KBs often contain other relevant side information, such as aliases of relations (e.g., founded and co-founded are aliases for the relation founderOfCompany). RE models usually ignore such readily avail- able side information. In this paper, we pro- pose RESIDE, a distantly-supervised neural relation extraction method which utilizes ad- ditional side information from KBs for im- proved relation extraction. It uses entity type and relation alias information for imposing soft constraints while predicting relations. RE- SIDE employs Graph Convolution Networks (GCN) to encode syntactic information from text and improves performance even when limited side information is available. Through extensive experiments on benchmark datasets, we demonstrate RESIDE’s effectiveness. We have made RESIDE’s source code available to encourage reproducible research. 1 Introduction The construction of large-scale Knowledge Bases (KBs) like Freebase (Bollacker et al., 2008) and Wikidata (Vrandeˇ ci´ c and Kr¨ otzsch, 2014) has proven to be useful in many natural language pro- cessing (NLP) tasks like question-answering, web search, etc. However, these KBs are not exhaus- tive. Relation Extraction (RE) attempts to fill this gap by extracting semantic relationships between entity pairs from plain text. This task can be mod- eled as a simple classification problem after the entity pairs are specified. Formally, given an en- tity pair (e 1 ,e 2 ) from the KB and an entity anno- tated sentence (or instance), we aim to predict the * Research internship at Indian Institute of Science. relation r, from a predefined relation set, that ex- ists between e 1 and e 2 . If no relation exists, we simply label it NA. Most supervised relation extraction methods re- quire large labeled training data which is expen- sive to construct. Distant Supervision (DS) (Mintz et al., 2009) helps with the construction of this dataset automatically, under the assumption that if two entities have a relationship in a KB, then all sentences mentioning those entities express the same relation. While this approach works well in generating large amounts of training instances, the DS assumption does not hold in all cases. Riedel et al. (2010); Hoffmann et al. (2011); Surdeanu et al. (2012) propose multi-instance based learn- ing to relax this assumption. However, they use NLP tools to extract features, which can be noisy. Recently, neural models have demonstrated promising performance on RE. Zeng et al. (2014, 2015) employ Convolutional Neural Networks (CNN) to learn representations of instances. For alleviating noise in distant supervised datasets, at- tention has been utilized by (Lin et al., 2016; Jat et al., 2018). Syntactic information from depen- dency parses has been used by (Mintz et al., 2009; He et al., 2018) for capturing long-range depen- dencies between tokens. Recently proposed Graph Convolution Networks (GCN) (Defferrard et al., 2016) have been effectively employed for en- coding this information (Marcheggiani and Titov, 2017; Bastings et al., 2017). However, all the above models rely only on the noisy instances from distant supervision for RE. Relevant side information can be effective for improving RE. For instance, in the sentence, Mi- crosoft was started by Bill Gates., the type infor- mation of Bill Gates (person) and Microsoft (or- ganization) can be helpful in predicting the cor- rect relation founderOfCompany. This is because every relation constrains the type of its target en-

RESIDE: Improving Distantly-Supervised Neural Relation ...malllabiisc.github.io/publications/papers/reside_emnlp18.pdfNLP tools to extract features, which can be noisy. Recently, neural

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RESIDE: Improving Distantly-Supervised Neural Relation ...malllabiisc.github.io/publications/papers/reside_emnlp18.pdfNLP tools to extract features, which can be noisy. Recently, neural

RESIDE: Improving Distantly-Supervised Neural Relation Extractionusing Side Information

Shikhar Vashishth1 Rishabh Joshi2 ∗ Sai Suman Prayaga1

Chiranjib Bhattacharyya1 Partha Talukdar11 Indian Institute of Science

2 Birla Institute of Technology and Science, Pilani{shikhar,chiru,ppt}@iisc.ac.in

[email protected], [email protected]

Abstract

Distantly-supervised Relation Extraction (RE)methods train an extractor by automaticallyaligning relation instances in a KnowledgeBase (KB) with unstructured text. In addi-tion to relation instances, KBs often containother relevant side information, such as aliasesof relations (e.g., founded and co-founded arealiases for the relation founderOfCompany).RE models usually ignore such readily avail-able side information. In this paper, we pro-pose RESIDE, a distantly-supervised neuralrelation extraction method which utilizes ad-ditional side information from KBs for im-proved relation extraction. It uses entity typeand relation alias information for imposingsoft constraints while predicting relations. RE-SIDE employs Graph Convolution Networks(GCN) to encode syntactic information fromtext and improves performance even whenlimited side information is available. Throughextensive experiments on benchmark datasets,we demonstrate RESIDE’s effectiveness. Wehave made RESIDE’s source code available toencourage reproducible research.

1 Introduction

The construction of large-scale Knowledge Bases(KBs) like Freebase (Bollacker et al., 2008) andWikidata (Vrandecic and Krotzsch, 2014) hasproven to be useful in many natural language pro-cessing (NLP) tasks like question-answering, websearch, etc. However, these KBs are not exhaus-tive. Relation Extraction (RE) attempts to fill thisgap by extracting semantic relationships betweenentity pairs from plain text. This task can be mod-eled as a simple classification problem after theentity pairs are specified. Formally, given an en-tity pair (e1,e2) from the KB and an entity anno-tated sentence (or instance), we aim to predict the

∗Research internship at Indian Institute of Science.

relation r, from a predefined relation set, that ex-ists between e1 and e2. If no relation exists, wesimply label it NA.

Most supervised relation extraction methods re-quire large labeled training data which is expen-sive to construct. Distant Supervision (DS) (Mintzet al., 2009) helps with the construction of thisdataset automatically, under the assumption thatif two entities have a relationship in a KB, thenall sentences mentioning those entities express thesame relation. While this approach works well ingenerating large amounts of training instances, theDS assumption does not hold in all cases. Riedelet al. (2010); Hoffmann et al. (2011); Surdeanuet al. (2012) propose multi-instance based learn-ing to relax this assumption. However, they useNLP tools to extract features, which can be noisy.

Recently, neural models have demonstratedpromising performance on RE. Zeng et al. (2014,2015) employ Convolutional Neural Networks(CNN) to learn representations of instances. Foralleviating noise in distant supervised datasets, at-tention has been utilized by (Lin et al., 2016; Jatet al., 2018). Syntactic information from depen-dency parses has been used by (Mintz et al., 2009;He et al., 2018) for capturing long-range depen-dencies between tokens. Recently proposed GraphConvolution Networks (GCN) (Defferrard et al.,2016) have been effectively employed for en-coding this information (Marcheggiani and Titov,2017; Bastings et al., 2017). However, all theabove models rely only on the noisy instancesfrom distant supervision for RE.

Relevant side information can be effective forimproving RE. For instance, in the sentence, Mi-crosoft was started by Bill Gates., the type infor-mation of Bill Gates (person) and Microsoft (or-ganization) can be helpful in predicting the cor-rect relation founderOfCompany. This is becauseevery relation constrains the type of its target en-

Page 2: RESIDE: Improving Distantly-Supervised Neural Relation ...malllabiisc.github.io/publications/papers/reside_emnlp18.pdfNLP tools to extract features, which can be noisy. Recently, neural

Matt_Coffin

executive

of

Lowermybills

Side Information Acquisition

Matched RelationEmbedding (hrel)

Classifier

/person/companySyntactic GCN

Sentence Embedding (hconcat)

Positional Embedding

WordEmbedding

Syntactic Sentence Encoding Instance Set Aggregation

Bi-GRU

Entity Type Embedding

(htype)

Sentence 1

sub

obj

case

Sentence 1

Sentence 2

Sentence 3

Sentence 4hgcnhgru

Figure 1: Overview of RESIDE. RESIDE first encodes each sentence in the bag by concatenating em-beddings (denoted by ⊕) from Bi-GRU and Syntactic GCN for each token, followed by word attention.Then, sentence embedding is concatenated with relation alias information, which comes from the SideInformation Acquisition Section (Figure 2), before computing attention over sentences. Finally, bagrepresentation with entity type information is fed to a softmax classifier. Please see Section 5 for moredetails.

tities. Similarly, relation phrase “was startedby” extracted using Open Information Extrac-tion (Open IE) methods can be useful, given thatthe aliases of relation founderOfCompany, e.g.,founded, co-founded, etc., are available. KBs usedfor DS readily provide such information which hasnot been completely exploited by current models.

In this paper, we propose RESIDE, a novel dis-tant supervised relation extraction method whichutilizes additional supervision from KB throughits neural network based architecture. RESIDEmakes principled use of entity type and relationalias information from KBs, to impose soft con-straints while predicting the relation. It uses en-coded syntactic information obtained from GraphConvolution Networks (GCN), along with embed-ded side information, to improve neural relationextraction. Our contributions can be summarizedas follows:

• We propose RESIDE, a novel neural methodwhich utilizes additional supervision from KBin a principled manner for improving distant su-pervised RE.• RESIDE uses Graph Convolution Networks

(GCN) for modeling syntactic information andhas been shown to perform competitively evenwith limited side information.• Through extensive experiments on benchmark

datasets, we demonstrate RESIDE’s effective-ness over state-of-the-art baselines.

RESIDE’s source code and datasets used in thepaper are available at http://github.com/malllabiisc/RESIDE.

2 Related Work

Distant supervision: Relation extraction is thetask of identifying the relationship between twoentity mentions in a sentence. In supervisedparadigm, the task is considered as a multi-classclassification problem but suffers from lack oflarge labeled training data. To address this limita-tion, (Mintz et al., 2009) propose distant supervi-sion (DS) assumption for creating large datasets,by heuristically aligning text to a given Knowl-edge Base (KB). As this assumption does notalways hold true, some of the sentences mightbe wrongly labeled. To alleviate this shortcom-ing, Riedel et al. (2010) relax distant supervi-sion for multi-instance single-label learning. Sub-sequently, for handling overlapping relations be-tween entities (Hoffmann et al., 2011; Surdeanuet al., 2012) propose multi-instance multi-labellearning paradigm.

Neural Relation Extraction: The performanceof the above methods strongly rely on the qualityof hand engineered features. Zeng et al. (2014)

Page 3: RESIDE: Improving Distantly-Supervised Neural Relation ...malllabiisc.github.io/publications/papers/reside_emnlp18.pdfNLP tools to extract features, which can be noisy. Recently, neural

propose an end-to-end CNN based method whichcould automatically capture relevant lexical andsentence level features. This method is further im-proved through piecewise max-pooling by (Zenget al., 2015). Lin et al. (2016); Nagarajan et al.(2017) use attention (Bahdanau et al., 2014) forlearning from multiple valid sentences. We alsomake use of attention for learning sentence andbag representations.

Dependency tree based features have beenfound to be relevant for relation extraction (Mintzet al., 2009). He et al. (2018) use them for gettingpromising results through a recursive tree-GRUbased model. In RESIDE, we make use of recentlyproposed Graph Convolution Networks (Deffer-rard et al., 2016; Kipf and Welling, 2017), whichhave been found to be quite effective for modellingsyntactic information (Marcheggiani and Titov,2017; Nguyen and Grishman, 2018; Vashishthet al., 2018a).

Side Information in RE: Entity descriptionfrom KB has been utilized for RE (Ji et al., 2017),but such information is not available for all enti-ties. Type information of entities has been used byLing and Weld (2012); Liu et al. (2014) as featuresin their model. Yaghoobzadeh et al. (2017) alsoattempt to mitigate noise in DS through their jointentity typing and relation extraction model. How-ever, KBs like Freebase readily provide reliabletype information which could be directly utilized.In our work, we make principled use of entity typeand relation alias information obtained from KB.We also use unsupervised Open Information Ex-traction (Open IE) methods (Mausam et al., 2012;Angeli et al., 2015), which automatically discoverpossible relations without the need of any prede-fined ontology, which is used as a side informationas defined in Section 5.2.

3 Background: Graph ConvolutionNetworks (GCN)

In this section, we provide a brief overview ofGraph Convolution Networks (GCN) for graphswith directed and labeled edges, as used in(Marcheggiani and Titov, 2017).

3.1 GCN on Labeled Directed Graph

For a directed graph, G = (V, E), where V andE represent the set of vertices and edges respec-tively, an edge from node u to node v with labelluv is represented as (u, v, luv). Since, informa-

tion in directed edge does not necessarily propa-gate along its direction, following (Marcheggianiand Titov, 2017) we define an updated edge set E ′which includes inverse edges (v, u, l−1uv ) and self-loops (u, u,>) along with the original edge set E ,where > is a special symbol to denote self-loops.For each node v in G, we have an initial represen-tation xv ∈ Rd, ∀v ∈ V . On employing GCN, weget an updated d-dimensional hidden representa-tion hv ∈ Rd, ∀v ∈ V , by considering only its im-mediate neighbors (Kipf and Welling, 2017). Thiscan be formulated as:

hv = f

∑u∈N (v)

(Wluvxu + bluv)

.

Here, Wluv ∈ Rd×d and bluv ∈ Rd are label de-pendent model parameters which are trained basedon the downstream task. N (v) refers to the set ofneighbors of v based on E ′ and f is any non-linearactivation function. In order to capture multi-hop neighborhood, multiple GCN layers can bestacked. Hidden representation of node v in thiscase after kth GCN layer is given as:

hk+1v = f

∑u∈N (v)

(W k

luvhku + bkluv

) .

3.2 Integrating Edge Importance

In automatically constructed graphs, some edgesmight be erroneous and hence need to be dis-carded. Edgewise gating in GCN by (Bastingset al., 2017; Marcheggiani and Titov, 2017) allowsus to alleviate this problem by subduing the noisyedges. This is achieved by assigning a relevancescore to each edge in the graph. At kth layer, theimportance of an edge (u, v, luv) is computed as:

gkuv = σ(hku · wk

luv + bkluv

), (1)

Here, wkluv∈ Rm and bkluv ∈ R are parameters

which are trained and σ(·) is the sigmoid function.With edgewise gating, the final GCN embeddingfor a node v after kth layer is given as:

hk+1v = f

∑u∈N (v)

gkuv ×(W k

luvhku + bkluv

) .

(2)

Page 4: RESIDE: Improving Distantly-Supervised Neural Relation ...malllabiisc.github.io/publications/papers/reside_emnlp18.pdfNLP tools to extract features, which can be noisy. Recently, neural

4 RESIDE Overview

In multi-instance learning paradigm, we are givena bag of sentences (or instances) {s1, s2, ...sn} fora given entity pair, the task is to predict the relationbetween them. RESIDE consists of three compo-nents for learning a representation of a given bag,which is fed to a softmax classifier. We brieflypresent the components of RESIDE below. Eachcomponent will be described in detail in the sub-sequent sections. The overall architecture of RE-SIDE is shown in Figure 1.

1. Syntactic Sentence Encoding: RESIDE usesa Bi-GRU over the concatenated positional andword embedding for encoding the local con-text of each token. For capturing long-rangedependencies, GCN over dependency tree isemployed and its encoding is appended to therepresentation of each token. Finally, attentionover tokens is used to subdue irrelevant tokensand get an embedding for the entire sentence.More details in Section 5.1.

2. Side Information Acquisition: In this mod-ule, we use additional supervision from KBsand utilize Open IE methods for getting rele-vant side information. This information is laterutilized by the model as described in Section5.2.

3. Instance Set Aggregation: In this part, sen-tence representation from syntactic sentenceencoder is concatenated with the matched re-lation embedding obtained from the previousstep. Then, using attention over sentences,a representation for the entire bag is learned.This is then concatenated with entity type em-bedding before feeding into the softmax classi-fier for relation prediction. Please refer to Sec-tion 5.3 for more details.

5 RESIDE Details

In this section, we provide the detailed descriptionof the components of RESIDE.

5.1 Syntactic Sentence Encoding

For each sentence in the bag si with m tokens{w1, w2, ...wm}, we first represent each tokenby k-dimensional GloVe embedding (Penningtonet al., 2014). For incorporating relative positionof tokens with respect to target entities, we usep-dimensional position embeddings, as used by

(Zeng et al., 2014). The combined token embed-dings are stacked together to get the sentence rep-resentationH ∈ Rm×(k+2p). Then, using Bi-GRU(Cho et al., 2014) over H, we get the new sen-tence representationHgru ∈ Rm×dgru , where dgruis the hidden state dimension. Bi-GRUs have beenfound to be quite effective in encoding the contextof tokens in several tasks (Sutskever et al., 2014;Graves et al., 2013).

Although Bi-GRU is capable of capturing lo-cal context, it fails to capture long-range depen-dencies which can be captured through depen-dency edges. Prior works (Mintz et al., 2009; Heet al., 2018) have exploited features from syntac-tic dependency trees for improving relation ex-traction. Motivated by their work, we employSyntactic Graph Convolution Networks for en-coding this information. For a given sentence,we generate its dependency tree using StanfordCoreNLP (Manning et al., 2014). We then runGCN over the dependency graph and use Equa-tion 2 for updating the embeddings, taking Hgru

as the input. Since dependency graph has 55 dif-ferent edge labels, incorporating all of them over-parameterizes the model significantly. Therefore,following (Marcheggiani and Titov, 2017; Nguyenand Grishman, 2018; Vashishth et al., 2018a) weuse only three edge labels based on the directionof the edge {forward (→), backward (←), self-loop (>)}. We define the new edge label Luv foran edge (u, v, luv) as follows:

Luv =

→ if edge exists in dependency parse← if edge is an inverse edge> if edge is a self-loop

For each token wi, GCN embedding hgcnik+1∈

Rdgcn after kth layer is defined as:

hgcnik+1= f

( ∑u∈N (i)

gkiu ×(W k

Liuhgcnuk

+ bkLiu

)).

Here, gkiu denotes edgewise gating as defined inEquation 1 and Liu refers to the edge label definedabove. We use ReLU as activation function f ,throughout our experiments. The syntactic graphencoding from GCN is appended to Bi-GRU out-put to get the final token representation, hconcati

as [hgrui ;hgcnik+1 ]. Since, not all tokens are equally

relevant for RE task, we calculate the degree ofrelevance of each token using attention as used in

Page 5: RESIDE: Improving Distantly-Supervised Neural Relation ...malllabiisc.github.io/publications/papers/reside_emnlp18.pdfNLP tools to extract features, which can be noisy. Recently, neural

Matt Coffin , executive of lowermybills , a company

sub

Syntactic Context Extractor

ParaphraseDatabase

head of

executive of

capital of

center of

caseobj

appos

Embedding Space

Relation aliases

Knowledge Base

Extended Relation

aliases (R )

board of directors

founder

administrativecenter

{executive of, }

(Ρ )r1 aliases

r2 aliases

Encoder

chief_of (r1): head of...capital_of (r2): center of... Matched Relation

Embedding (hrel)

Selected Relation

X

X

0

1 r1

r2

Figure 2: Relation alias side information extraction for a given sentence. First, Syntactic Context Extrac-tor identifies relevant relation phrases P between target entities. They are then matched in the embeddingspace with the extended set of relation aliasesR from KB. Finally, the relation embedding correspondingto the closest alias is taken as relation alias information. Please refer Section 5.2.

(Jat et al., 2018). For token wi in the sentence,attention weight αi is calculated as:

αi =exp(ui)∑mj=1 exp(uj)

where, ui = hconcati · r.

where r is a random query vector and ui is therelevance score assigned to each token. Atten-tion values {αi} are calculated by taking soft-max over {ui}. The representation of a sentenceis given as a weighted sum of its tokens, s =∑m

j=1 αihconcati .

5.2 Side Information Acquisition

Relevant side information has been found to im-prove performance on several tasks (Ling andWeld, 2012; Vashishth et al., 2018b). In distantsupervision based relation extraction, since the en-tities are from a KB, knowledge about them can beutilized to improve relation extraction. Moreover,several unsupervised relation extraction methods(Open IE) (Angeli et al., 2015; Mausam et al.,2012) allow extracting relation phrases betweentarget entities without any predefined ontology andthus can be used to obtain relevant side informa-tion. In RESIDE, we employ Open IE methodsand additional supervision from KB for improvingneural relation extraction.

Relation Alias Side InformationRESIDE uses Stanford Open IE (Angeli et al.,2015) for extracting relation phrases between tar-get entities, which we denote by P . As shown inFigure 2, for the sentence Matt Coffin, executive of

lowermybills, a company.., Open IE methods ex-tract “executive of” between Matt Coffin and low-ermybills. Further, we extend P by including to-kens at one hop distance in dependency path fromtarget entities. Such features from dependencyparse have been exploited in the past by (Mintzet al., 2009; He et al., 2018). The degree of matchbetween the extracted phrases in P and aliases ofa relation can give important clues about the rel-evance of that relation for the sentence. SeveralKBs like Wikidata provide such relation aliases,which can be readily exploited. In RESIDE, wefurther expand the relation alias set using Para-phrase database (PPDB) (Pavlick et al., 2015). Wenote that even for cases when aliases for relationsare not available, providing only the names of rela-tions give competitive performance. We shall ex-plore this point further in Section 7.3.

For matching P with the PPDB expanded rela-tion alias setR, we project both in a d-dimensionalspace using GloVe embeddings (Pennington et al.,2014). Projecting phrases using word embeddingshelps to further expand these sets, as semanti-cally similar words are closer in embedding space(Mikolov et al., 2013; Pennington et al., 2014).Then, for each phrase p ∈ P , we calculate its co-sine distance from all relation aliases inR and takethe relation corresponding to the closest relationalias as a matched relation for the sentence. Weuse a threshold on cosine distance to remove noisyaliases. In RESIDE, we define a kr-dimensionalembedding for each relation which we call asmatched relation embedding (hrel). For a givensentence, hrel is concatenated with its representa-

Page 6: RESIDE: Improving Distantly-Supervised Neural Relation ...malllabiisc.github.io/publications/papers/reside_emnlp18.pdfNLP tools to extract features, which can be noisy. Recently, neural

tion s, obtained from syntactic sentence encoder(Section 5.1) as shown in Figure 1. For sentenceswith |P| > 1, we might get multiple matched re-lations. In such cases, we take the average of theirembeddings. We hypothesize that this helps in im-proving the performance and find it to be true asshown in Section 7.

Entity Type Side InformationType information of target entities has been shownto give promising results on relation extraction(Ling and Weld, 2012; Yaghoobzadeh et al.,2017). Every relation puts some constraint on thetype of entities which can be its subject and object.For example, the relation person/place of birthcan only occur between a person and a location.Sentences in distance supervision are based on en-tities in KBs, where the type information is readilyavailable.

In RESIDE, we use types defined by FIGER(Ling and Weld, 2012) for entities in Freebase. Foreach type, we define a kt-dimensional embeddingwhich we call as entity type embedding (htype).For cases when an entity has multiple types in dif-ferent contexts, for instance, Paris may have typesgovernment and location, we take the average overthe embeddings of each type. We concatenate theentity type embedding of target entities to the finalbag representation before using it for relation clas-sification. To avoid over-parameterization, insteadof using all fine-grained 112 entity types, we use38 coarse types which form the first hierarchy ofFIGER types.

5.3 Instance Set Aggregation

For utilizing all valid sentences, following (Linet al., 2016; Jat et al., 2018), we use attention oversentences to obtain a representation for the entirebag. Instead of directly using the sentence repre-sentation si from Section 5.1, we concatenate theembedding of each sentence with matched relationembedding hreli as obtained from Section 5.2. Theattention score αi for ith sentence is formulatedas:

αi =exp(si · q)∑nj=1 exp(sj · q)

where, si = [si;hreli ].

here q denotes a random query vector. The bagrepresentation B, which is the weighted sum ofits sentences, is then concatenated with the entitytype embeddings of the subject (htypesub ) and object

Datasets Split # Sentences # Entity-pairs

Riedel(# Relations: 53)

Train 455,771 233,064Valid 114,317 58,635Test 172,448 96,678

GIDS(# Relations: 5)

Train 11,297 6,498Valid 1,864 1,082Test 5,663 3,247

Table 1: Details of datasets used. Please see Sec-tion 6.1 for more details.

(htypeobj ) from Section 5.2 to obtain B.

B = [B;htypesub ;htypeobj ] where, B =

n∑i=1

αisi.

Finally, B is fed to a softmax classifier to get theprobability distribution over the relations.

p(y) = Softmax(W · B + b).

6 Experimental Setup

6.1 DatasetsIn our experiments, we evaluate the models onRiedel and Google Distant Supervision (GIDS)dataset. Statistics of the datasets is summarizedin Table 1. Below we described each in detail1.

1. Riedel: The dataset is developed by (Riedelet al., 2010) by aligning Freebase relations withNew York Times (NYT) corpus, where sen-tences from the year 2005-2006 are used forcreating the training set and from the year 2007for the test set. The entity mentions are anno-tated using Stanford NER (Finkel et al., 2005)and are linked to Freebase. The dataset hasbeen widely used for RE by (Hoffmann et al.,2011; Surdeanu et al., 2012) and more recentlyby (Lin et al., 2016; Feng et al.; He et al., 2018).

2. GIDS: Jat et al. (2018) created Google Dis-tant Supervision (GIDS) dataset by extendingthe Google relation extraction corpus2 with ad-ditional instances for each entity pair. Thedataset assures that the at-least-one assumptionof multi-instance learning, holds. This makesautomatic evaluation more reliable and thus re-moves the need for manual verification.

1Data splits and hyperparameters are in supplementary.2https://research.googleblog.com/2013/04/50000-

lessons-on-how-to-read-relation.html

Page 7: RESIDE: Improving Distantly-Supervised Neural Relation ...malllabiisc.github.io/publications/papers/reside_emnlp18.pdfNLP tools to extract features, which can be noisy. Recently, neural

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45Recall

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Prec

ision

RESIDEBGWAPCNN+ATTPCNNMIMLREMultiRMintz

(a) Riedel dataset

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Prec

ision

RESIDEBGWAPCNN+ATTPCNN

(b) GIDS dataset

Figure 3: Comparison of Precision-recall curve. RESIDE achieves higher precision over the entire rangeof recall than all the baselines on both datasets. Please refer Section 7.1 for more details.

6.2 Baselines

For evaluating RESIDE, we compare against thefollowing baselines:

• Mintz: Multi-class logistic regression modelproposed by (Mintz et al., 2009) for distant su-pervision paradigm.• MultiR: Probabilistic graphical model for multi

instance learning by (Hoffmann et al., 2011)• MIMLRE: A graphical model which jointly

models multiple instances and multiple labels.More details in (Surdeanu et al., 2012).• PCNN: A CNN based relation extraction model

by (Zeng et al., 2015) which uses piecewisemax-pooling for sentence representation.• PCNN+ATT: A piecewise max-pooling over

CNN based model which is used by (Lin et al.,2016) to get sentence representation followedby attention over sentences.• BGWA: Bi-GRU based relation extraction

model with word and sentence level attention(Jat et al., 2018).• RESIDE: The method proposed in this paper,

please refer Section 5 for more details.

6.3 Evaluation Criteria

Following the prior works (Lin et al., 2016; Fenget al.), we evaluate the models using held-out eval-uation scheme. This is done by comparing the re-lations discovered from test articles with those inFreebase. We evaluate the performance of modelswith Precision-Recall curve and top-N precision(P@N) metric in our experiments.

7 Results

In this section we attempt to answer the followingquestions:

Q1. Is RESIDE more effective than existing ap-proaches for distant supervised RE? (7.1)

Q2. What is the effect of ablating different com-ponents on RESIDE’s performance? (7.2)

Q3. How is the performance affected in the ab-sence of relation alias information? (7.3)

7.1 Performance Comparison

For evaluating the effectiveness of our proposedmethod, RESIDE, we compare it against the base-lines stated in Section 6.2. We use only the neuralbaselines on GIDS dataset. The Precision-Recallcurves on Riedel and GIDS are presented in Figure3. Overall, we find that RESIDE achieves higherprecision over the entire recall range on both thedatasets. All the non-neural baselines could notperform well as the features used by them aremostly derived from NLP tools which can be er-roneous. RESIDE outperforms PCNN+ATT andBGWA which indicates that incorporating side in-formation helps in improving the performance ofthe model. The higher performance of BGWAand PCNN+ATT over PCNN shows that attentionhelps in distant supervised RE. Following (Linet al., 2016; Liu et al., 2017), we also evaluate ourmethod with different number of sentences. Re-sults summarized in Table 2, show the improvedprecision of RESIDE in all test settings, as com-pared to the neural baselines, which demonstrates

Page 8: RESIDE: Improving Distantly-Supervised Neural Relation ...malllabiisc.github.io/publications/papers/reside_emnlp18.pdfNLP tools to extract features, which can be noisy. Recently, neural

One Two All

P@100 P@200 P@300 P@100 P@200 P@300 P@100 P@200 P@300

PCNN 73.3 64.8 56.8 70.3 67.2 63.1 72.3 69.7 64.1PCNN+ATT 73.3 69.2 60.8 77.2 71.6 66.1 76.2 73.1 67.4BGWA 78.0 71.0 63.3 81.0 73.0 64.0 82.0 75.0 72.0RESIDE 80.0 75.5 69.3 83.0 73.5 70.6 84.0 78.5 75.6

Table 2: P@N for relation extraction using variable number of sentences in bags (with more than onesentence) in Riedel dataset. Here, One, Two and All represents the number of sentences randomlyselected from a bag. RESIDE attains improved precision in all settings. More details in Section 7.1

Figure 4: Performance comparison of different ab-lated version of RESIDE on Riedel dataset. Over-all, GCN and side information helps RESIDE im-prove performance. Refer Section 7.2.

the efficacy of our model.

7.2 Ablation Results

In this section, we analyze the effect of variouscomponents of RESIDE on its performance. Forthis, we evaluate various versions of our modelwith cumulatively removed components. The ex-perimental results are presented in Figure 4. Weobserve that on removing different componentsfrom RESIDE, the performance of the model de-grades drastically. The results validate that GCNsare effective at encoding syntactic information.Further, the improvement from side informationshows that it is complementary to the features ex-tracted from text, thus validating the central thesisof this paper, that inducing side information leadsto improved relation extraction.

7.3 Effect of Relation Alias Side Information

In this section, we test the performance of themodel in setting where relation alias information isnot readily available. For this, we evaluate the per-formance of the model on four different settings:• None: Relation aliases are not available.

Figure 5: Performance on settings defined in Sec-tion 7.3 with respect to the presence of relationalias side information on Riedel dataset. RESIDEperforms comparably in the absence of relationsfrom KB.

• One: The name of relation is used as its alias.• One+PPDB: Relation name extended using

Paraphrase Database (PPDB).• All: Relation aliases from Knowledge Base3

The overall results are summarized in Figure 5.We find that the model performs best when aliasesare provided by the KB itself. Overall, we findthat RESIDE gives competitive performance evenwhen very limited amount of relation alias infor-mation is available. We observe that performanceimproves further with the availability of more aliasinformation.

8 Conclusion

In this paper, we propose RESIDE, a novel neuralnetwork based model which makes principled useof relevant side information, such as entity typeand relation alias, from Knowledge Base, for im-proving distant supervised relation extraction. RE-SIDE employs Graph Convolution Networks for

3Each relation in Riedel dataset is manually mapped tocorresponding Wikidata property for getting relation aliases.Few examples are presented in supplementary material.

Page 9: RESIDE: Improving Distantly-Supervised Neural Relation ...malllabiisc.github.io/publications/papers/reside_emnlp18.pdfNLP tools to extract features, which can be noisy. Recently, neural

encoding syntactic information of sentences andis robust to limited side information. Through ex-tensive experiments on benchmark datasets, wedemonstrate RESIDE’s effectiveness over state-of-the-art baselines. We have made RESIDE’ssource code publicly available to promote repro-ducible research.

Acknowledgements

We thank the anonymous reviewers for their con-structive comments. This work is supported in partby the Ministry of Human Resource Development(Government of India), CAIR (DRDO) and by agift from Google.

ReferencesGabor Angeli, Melvin Jose Johnson Premkumar, and

Christopher D. Manning. 2015. Leveraging linguis-tic structure for open domain information extrac-tion. In ACL (1), pages 344–354. The Associationfor Computer Linguistics.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv e-prints,abs/1409.0473.

Joost Bastings, Ivan Titov, Wilker Aziz, DiegoMarcheggiani, and Khalil Simaan. 2017. Graphconvolutional encoders for syntax-aware neural ma-chine translation. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1957–1967, Copenhagen, Den-mark. Association for Computational Linguistics.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: A col-laboratively created graph database for structuringhuman knowledge. In Proceedings of the 2008ACM SIGMOD International Conference on Man-agement of Data, SIGMOD ’08, pages 1247–1250,New York, NY, USA. ACM.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1724–1734. Association for Computational Linguistics.

Michael Defferrard, Xavier Bresson, and Pierre Van-dergheynst. 2016. Convolutional neural networkson graphs with fast localized spectral filtering. InProceedings of the 30th International Conference onNeural Information Processing Systems, NIPS’16,pages 3844–3852, USA. Curran Associates Inc.

Xiaocheng Feng, Jiang Guo, Bing Qin, Ting Liu, andYongjie Liu. Effective deep memory networks fordistant supervised relation extraction.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by gibbssampling. In Proceedings of the 43rd annual meet-ing on association for computational linguistics,pages 363–370. Association for Computational Lin-guistics.

A. Graves, A. r. Mohamed, and G. Hinton. 2013.Speech recognition with deep recurrent neural net-works. In 2013 IEEE International Conferenceon Acoustics, Speech and Signal Processing, pages6645–6649.

Zhengqiu He, Wenliang Chen, Zhenghua Li, MeishanZhang, Wei Zhang, and Min Zhang. 2018. See:Syntax-aware entity embedding for neural relationextraction.

Raphael Hoffmann, Congle Zhang, Xiao Ling, LukeZettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extractionof overlapping relations. In Proceedings of the 49thAnnual Meeting of the Association for Computa-tional Linguistics: Human Language Technologies-Volume 1, pages 541–550. Association for Compu-tational Linguistics.

S. Jat, S. Khandelwal, and P. Talukdar. 2018. Improv-ing Distantly Supervised Relation Extraction usingWord and Entity Based Attention. ArXiv e-prints.

Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao.2017. Distant supervision for relation extractionwith sentence-level attention and entity descriptions.In AAAI.

Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutionalnetworks. In International Conference on LearningRepresentations (ICLR).

Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,and Maosong Sun. 2016. Neural relation extractionwith selective attention over instances. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), volume 1, pages 2124–2133.

Xiao Ling and Daniel S. Weld. 2012. Fine-grainedentity recognition. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence,AAAI’12, pages 94–100. AAAI Press.

Tianyu Liu, Kexiang Wang, Baobao Chang, and Zhi-fang Sui. 2017. A soft-label method for noise-tolerant distantly supervised relation extraction. InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing, pages1790–1795. Association for Computational Linguis-tics.

Page 10: RESIDE: Improving Distantly-Supervised Neural Relation ...malllabiisc.github.io/publications/papers/reside_emnlp18.pdfNLP tools to extract features, which can be noisy. Recently, neural

Yang Liu, Kang Liu, Liheng Xu, and Jun Zhao. 2014.Exploring fine-grained entity type constraints fordistantly supervised relation extraction. In COL-ING.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations,pages 55–60.

Diego Marcheggiani and Ivan Titov. 2017. Encodingsentences with graph convolutional networks for se-mantic role labeling. CoRR, abs/1703.04826.

Mausam, Michael Schmitz, Robert Bart, StephenSoderland, and Oren Etzioni. 2012. Open languagelearning for information extraction. In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning, EMNLP-CoNLL ’12,pages 523–534, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In Proceedings of the 26th International Con-ference on Neural Information Processing Systems -Volume 2, NIPS’13, pages 3111–3119, USA. CurranAssociates Inc.

Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP: Vol-ume 2-Volume 2, pages 1003–1011. Association forComputational Linguistics.

Tushar Nagarajan, Sharmistha Jat, and Partha Talukdar.2017. Candis: Coupled & attention-driven neuraldistant supervision. CoRR, abs/1710.09942.

Thien Nguyen and Ralph Grishman. 2018. Graph con-volutional networks with argument-aware poolingfor event detection.

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,Benjamin Van Durme, and Chris Callison-Burch.2015. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, andstyle classification. In Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing of the AsianFederation of Natural Language Processing, ACL2015, July 26-31, 2015, Beijing, China, Volume 2:Short Papers, pages 425–430.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors for

word representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions with-out labeled text. In Joint European Conferenceon Machine Learning and Knowledge Discovery inDatabases, pages 148–163. Springer.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,and Christopher D Manning. 2012. Multi-instancemulti-label learning for relation extraction. In Pro-ceedings of the 2012 joint conference on empiricalmethods in natural language processing and compu-tational natural language learning, pages 455–465.Association for Computational Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In Proceedings of the 27th InternationalConference on Neural Information Processing Sys-tems - Volume 2, NIPS’14, pages 3104–3112, Cam-bridge, MA, USA. MIT Press.

Shikhar Vashishth, Shib Sankar Dasgupta,Swayambhu Nath Ray, and Partha Talukdar.2018a. Dating documents using graph convolutionnetworks. In 56th Annual Meeting of the Associ-ation for Computational Linguistics (ACL 2018),July 15-20, 2018, Melbourne, Australia.

Shikhar Vashishth, Prince Jain, and Partha Talukdar.2018b. Cesi: Canonicalizing open knowledge basesusing embeddings and side information. In Pro-ceedings of the 2018 World Wide Web Conference,WWW ’18, pages 1317–1327, Republic and Cantonof Geneva, Switzerland. International World WideWeb Conferences Steering Committee.

Denny Vrandecic and Markus Krotzsch. 2014. Wiki-data: A free collaborative knowledgebase. Com-mun. ACM, 57(10):78–85.

Yadollah Yaghoobzadeh, Heike Adel, and HinrichSchutze. 2017. Noise mitigation for neural entitytyping and relation extraction. In Proceedings ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume1, Long Papers, pages 1183–1194. Association forComputational Linguistics.

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extraction viapiecewise convolutional neural networks. In Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, pages 1753–1762.

Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,and Jun Zhao. 2014. Relation classification via con-volutional deep neural network. In Proceedings ofCOLING 2014, the 25th International Conferenceon Computational Linguistics: Technical Papers,pages 2335–2344.