46
Saliency Learning: Teaching the Model Where to Pay Attention Reza Ghaeini, Xioali Fern, Hamed Shahbazi, Prasad Tadepalli Oregon State University

Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Learning: Teaching the Model Where to Pay

AttentionReza Ghaeini, Xioali Fern, Hamed Shahbazi, Prasad Tadepalli

Oregon State University

Page 2: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Motivation

!2 NAACL 2019

• Does deep models make the right prediction for the right reason? — How reliable the deep models are?

The Office, S04E03

Page 3: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Motivation

!3 NAACL 2019

• Does deep models make the right prediction for the right reason? — How reliable the deep models are?

The Office, S04E03

Pizza Delivery

Guy

Page 4: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Motivation

!4 NAACL 2019

• Does deep models make the right prediction for the right reason? — How reliable the deep models are?

The Office, S04E03

Pizza Delivery

Guy

Wearing Jeans

Page 5: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Motivation

!5 NAACL 2019

• Attempts toward interpretation and explanation.

• Teach the model to make the right prediction for the right reason.

The Office, S04E03

Pizza Delivery

Guy

Page 6: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Motivation

!6 NAACL 2019

• Attempts toward interpretation and explanation.

• Teach the model to make the right prediction for the right reason.

The Office, S04E03

Pizza Delivery

Guy

Right Reason: Carrying Pizzas

Page 7: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Learning• Contributory Words (Z): Words that their occurrence in a

sample suggest the gold prediction — The prediction should be make by focusing on them.

• Saliency: An explanation method which determines the impact of a unit toward a prediction. — Gradient of the prediction respect to the unit.

• Goal: Aligning behavior of the model with the expected and desired behavior.

• Methodology: Teaching the model to consider positive saliency for contributory words.

!7 NAACL 2019

Page 8: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Learning• Proposing a penalization term (explanation loss) to

enforce positive saliency for the contributory words.

!8 NAACL 2019

C(✓, X, y, Z) = L(✓, X, y) + �nX

i=1

max (0,�ZiS(Xi))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 9: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Learning• Proposing a penalization term (explanation loss) to

enforce positive saliency for the contributory words.

!9 NAACL 2019

C(✓, X, y, Z) = L(✓, X, y) + �nX

i=1

max (0,�ZiS(Xi))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Explanation Loss

Word SaliencySentence Length

Traditional Loss Function

Hyper-Parameter

Model Parameter

Gold Label

Gold Explanation

Input (Sentence)

A Word

Contributory Words

Page 10: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Learning

!10 NAACL 2019

tmp

mr.ghaeini

May 2019

1 Introduction

• Sentence:An unknown man had [broken into] a house last November.

• Contributory Words: broken into.

• Label: Positive.

• Sentence:An unknown man had [broken into] a house last November.

• Event Mention: broken into.

• Event Type: Attack.

SentenceAn unknown man had [broken into] a house last November.

Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive

X = X1, X2, X3, X4, X5, X6, · · · Xn

Z = 0, 0, 1, 1, 0, 1, · · · 0

1

0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Explanation Loss =0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 11: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Learning

!11 NAACL 2019

tmp

mr.ghaeini

May 2019

1 Introduction

• Sentence:An unknown man had [broken into] a house last November.

• Contributory Words: broken into.

• Label: Positive.

• Sentence:An unknown man had [broken into] a house last November.

• Event Mention: broken into.

• Event Type: Attack.

SentenceAn unknown man had [broken into] a house last November.

Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive

X = X1, X2, X3, X4, X5, X6, · · · Xn

Z = 0, 0, 1, 1, 0, 1, · · · 0

1

0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Explanation Loss =0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 12: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Learning

!12 NAACL 2019

tmp

mr.ghaeini

May 2019

1 Introduction

• Sentence:An unknown man had [broken into] a house last November.

• Contributory Words: broken into.

• Label: Positive.

• Sentence:An unknown man had [broken into] a house last November.

• Event Mention: broken into.

• Event Type: Attack.

SentenceAn unknown man had [broken into] a house last November.

Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive

X = X1, X2, X3, X4, X5, X6, · · · Xn

Z = 0, 0, 1, 1, 0, 1, · · · 0

1

max(0,�S(X3))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Explanation Loss =max(0,�S(X3))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 13: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Learning

!13 NAACL 2019

tmp

mr.ghaeini

May 2019

1 Introduction

• Sentence:An unknown man had [broken into] a house last November.

• Contributory Words: broken into.

• Label: Positive.

• Sentence:An unknown man had [broken into] a house last November.

• Event Mention: broken into.

• Event Type: Attack.

SentenceAn unknown man had [broken into] a house last November.

Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive

X = X1, X2, X3, X4, X5, X6, · · · Xn

Z = 0, 0, 1, 1, 0, 1, · · · 0

1

max(0,�S(X4))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Explanation Loss =max(0,�S(X3)) +max(0,�S(X4))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 14: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Learning

!14 NAACL 2019

tmp

mr.ghaeini

May 2019

1 Introduction

• Sentence:An unknown man had [broken into] a house last November.

• Contributory Words: broken into.

• Label: Positive.

• Sentence:An unknown man had [broken into] a house last November.

• Event Mention: broken into.

• Event Type: Attack.

SentenceAn unknown man had [broken into] a house last November.

Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive

X = X1, X2, X3, X4, X5, X6, · · · Xn

Z = 0, 0, 1, 1, 0, 1, · · · 0

1

0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Explanation Loss =max(0,�S(X3)) +max(0,�S(X4))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 15: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Learning

!15 NAACL 2019

tmp

mr.ghaeini

May 2019

1 Introduction

• Sentence:An unknown man had [broken into] a house last November.

• Contributory Words: broken into.

• Label: Positive.

• Sentence:An unknown man had [broken into] a house last November.

• Event Mention: broken into.

• Event Type: Attack.

SentenceAn unknown man had [broken into] a house last November.

Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive

X = X1, X2, X3, X4, X5, X6, · · · Xn

Z = 0, 0, 1, 1, 0, 1, · · · 0

1

Explanation Loss =max(0,�S(X3)) +max(0,�S(X4))+

max(0,�S(X6))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

max(0,�S(X6))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 16: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Learning

!16 NAACL 2019

tmp

mr.ghaeini

May 2019

1 Introduction

• Sentence:An unknown man had [broken into] a house last November.

• Contributory Words: broken into.

• Label: Positive.

• Sentence:An unknown man had [broken into] a house last November.

• Event Mention: broken into.

• Event Type: Attack.

SentenceAn unknown man had [broken into] a house last November.

Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive

X = X1, X2, X3, X4, X5, X6, · · · Xn

Z = 0, 0, 1, 1, 0, 1, · · · 0

1

Explanation Loss =max(0,�S(X3)) +max(0,�S(X4))+

max(0,�S(X6)) + · · ·<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 17: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Tasks and Dataset • Event Detection:

• ACE 2005 → Annotations: Event Mentions

• Rich ERE 2015 → Annotations: Event Mentions

• Cloze-Style Question Answering:

• CBT-NE → Annotations: Gold Replacement (Candidate)

• CBT-CN → Annotations: Gold Replacement (Candidate)

!17 NAACL 2019

Page 18: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Event Detection

• Event Detection: Given a sentence, find the event mention and determine its event type.

• Modified Event Detection: Given a sentence, determine if it contains an event.

18 NAACL 2019

tmp

mr.ghaeini

May 2019

1 Introduction

• Sentence:An unknown man had [broken into] a house last November.

• Contributory Words: broken into.

• Label: Positive.

• Sentence:An unknown man had [broken into] a house last November.

• Event Mention: broken into.

• Event Type: Attack.

SentenceAn unknown man had [broken into] a house last November.

Event Detection Modified Event DetectionEvent Mention: broken into Contributory Words: broken intoEvent Type: Attack Label: Positive

1

Page 19: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Cloze-Style QA

!19

• Cloze-Style QA: Given a document, a query with a blank, and a set of possible entities for filling the blank in the query, find the right entity.

• Modified Cloze-Style QA: Given a sentence and a query with a blank, determine if the sentence contains the right entity for the blank in query.

NAACL 2019

Page 20: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Dataset Statistics

!20

1

000

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Saliency Learning: Teaching the Model Where to Pay AttentionAppendix

Anonymous NAACL submission

A Background: Saliency

The concept of saliency was first introduced in vi-sion for visualizing the spatial support on an imagefor particular object class (Simonyan et al., 2013).Considering a deep model prediction as a differ-entiable model f parameterized by ✓ with inputX 2 Rn⇥d. Such model could be describe usingthe Taylor series as follow:

f(x) = f(a)+f0(a)(x�a)+

f00(a)

2!(x�a)2+. . .

(1)By approximating that the deep model is a lin-

ear function, we could just use the first order Tay-lor expansion.

f(x) ⇡ f0(a)x+ b (2)

According to Equation 2, the first derivative ofmodel’s prediction with respect to the input (f 0

(a)or @f

@x |x=a) serves as the description of model’sbehaviour near the input. To make it more clear,bigger derivative/gradient indicates more impactand contribution toward model’s prediction. Con-sequently, the large-magnitude derivative valuesdetermine units of input that would greatly affectf(x) if changed.

B Task and Dataset

Here, we first describe the main and real EventExtraction and Close-Style Question Answeringtasks (before our modification). Next, we pro-vide data statistics of the modified version of ACE,ERE, CBT-NE, and CBT-CN datasets in Table 1.

• Event Extraction: Given a set of ontolo-gized event types (e.g. Movement, Transac-tion, Conflict, etc.), the goal of event extrac-tion is to identify the mentions of differentevents along their types from natural texts.

DatasetSample Count

Train TestP.a N.b P. N.

ACE 3.2K 15K 293 421ERE 3.1K 4K 2.7K 1.91K

CBT-NE 359K 1.82M 8.8K 41.1KCBT-CN 256K 2.16M 5.5K 44.4Ka Positive Sample Countb Negative Sample Count

Table 1: Dataset statistics of the modified tasks anddatasets.

• Cloze-Style Question Answering: Docu-ments in CBT consist of 20 contiguous sen-tences from the body of a popular childrenbook, and queries are formed by replacing atoken from the 21st sentence with a blank.Given a document, a query, and a set of can-didates, the goal is to find the correct replace-ment for blank in the query among the givencandidates. To avoid having too many neg-ative examples in our modified datasets, weonly consider the sentences that contains atleast one candidate. To be more clear, eachsample from the CBT dataset is split to atmost 20 samples – each sentence of the mainsample as long as it contains one of the can-didates.

C Training

All hyper-parameters are tuned based on the de-velopment set. We use pre-trained 300�D Glove840B vectors to initialize our word embeddingvectors. All hidden states and feature sizes are 300dimensions (d = 300). The weights are learned byminimizing the cost function on the training datavia Adam optimizer. The initial learning rate is0.0001 and � = 0.5, 0.7, 0.4, and 0.35 for ACE,

NAACL 2019

Page 21: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Toy Models

!21

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Query

Conv-W3 Conv-W5

Max-Pooling

Max-Pooling

(a) (b)

Event Detection Cloze-Style Question Answering

NAACL 2019

Page 22: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Toy Models

!22

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Query

Conv-W3 Conv-W5

Max-Pooling

Max-Pooling

(a) (b)

Event Detection Cloze-Style Question AnsweringSentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Query

Conv-W3 Conv-W5

Max-Pooling

Max-Pooling

(a) (b)

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Query

Conv-W3 Conv-W5

Max-Pooling

Max-Pooling

(a) (b)

W: Word Representation

NAACL 2019

Page 23: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Toy Models

!23

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Query

Conv-W3 Conv-W5

Max-Pooling

Max-Pooling

(a) (b)

Event Detection Cloze-Style Question AnsweringSentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Query

Conv-W3 Conv-W5

Max-Pooling

Max-Pooling

(a) (b)

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Query

Conv-W3 Conv-W5

Max-Pooling

Max-Pooling

(a) (b)

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Query

Conv-W3 Conv-W5

Max-Pooling

Max-Pooling

(a) (b)

I: Intermediate Representation

NAACL 2019

Page 24: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Toy Models

!24

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Query

Conv-W3 Conv-W5

Max-Pooling

Max-Pooling

(a) (b)

Event Detection Cloze-Style Question AnsweringSentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Query

Conv-W3 Conv-W5

Max-Pooling

Max-Pooling

(a) (b)

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Sentence

Conv-W3 Conv-W5

Max-Pooling

Dim & Seq Max-Pooling

Query

Conv-W3 Conv-W5

Max-Pooling

Max-Pooling

(a) (b)

D: Decision Representation

NAACL 2019

Page 25: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Train Cost Function

!25

Enforcing positive saliency at

NAACL 2019

C(✓, X, y, Z) = L(✓, X, y)

+ �nX

i=1

max (0,�ZiS(Wi))

+ �nX

i=1

max (0,�ZiS(Ii))

+ �nX

i=1

max (0,�ZiS(Di))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 26: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Train Cost Function

!26

Enforcing positive saliency at

W: Word Representation

NAACL 2019

C(✓, X, y, Z) = L(✓, X, y)

+ �nX

i=1

max (0,�ZiS(Wi))

+ �nX

i=1

max (0,�ZiS(Ii))

+ �nX

i=1

max (0,�ZiS(Di))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 27: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Train Cost Function

!27

Enforcing positive saliency at

I: Intermediate Representation

NAACL 2019

C(✓, X, y, Z) = L(✓, X, y)

+ �nX

i=1

max (0,�ZiS(Wi))

+ �nX

i=1

max (0,�ZiS(Ii))

+ �nX

i=1

max (0,�ZiS(Di))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 28: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Train Cost Function

!28

Enforcing positive saliency at

D: Decision Representation

NAACL 2019

C(✓, X, y, Z) = L(✓, X, y)

+ �nX

i=1

max (0,�ZiS(Wi))

+ �nX

i=1

max (0,�ZiS(Ii))

+ �nX

i=1

max (0,�ZiS(Di))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 29: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Results

!29

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

NAACL 2019

Page 30: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Results

!30

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

NAACL 2019

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

Without Explanation

Page 31: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Results

!31

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

NAACL 2019

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

With Explanation

Page 32: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Results

!32

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

NAACL 2019

Page 33: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Results

!33

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

NAACL 2019

Page 34: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Results

!34

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 1: A high-level view of the models used forevent extraction (a) and question answering (b).

to binary tasks. Note that for both tasks if an ex-ample is negative, its explanation annotation willbe all zero. In other words, for negative exampleswe have C = L.

4 Model

We use simple CNN based models to avoid com-plexity. Figure 1 illustrates the models used in thispaper. Both models have a similar structure. Themain difference is that Q.A. has two inputs (sen-tence and query). We first describe the event ex-traction model followed by the Q.A. model.

Figure 1 (a) shows the event extraction model.Given a sentence W = [w1, . . . , wn] where wi 2Rd, we first pass the embeddings to two CNNswith feature size of d and window size of 3 and 5.Next we apply max-pooling to both CNN outputs.It will give us the representation I 2 Rn⇥d, whichwe refer to it as the intermediate representation.Then, we apply sequence-wise and dimension-wise max-poolings to I to capture Dseq 2 Rd andDdim 2 Rn respectively. Ddim will be referredas decision representation. Finally we pass theconcatenation of Dseq and Ddim to a feed-forwardlayer for prediction.

Figure 1 (b) depicts the Q.A. model. The maindifference is having query as an extra input. Toprocess the query, we use a similar structure asthe main model. After CNNs and max-pooling weend up with Q 2 Rm⇥d where m is the length ofquery. To obtain a sequence independent vector,we apply another max-pooling to Q resulting in aquery representation q 2 Rd. We follow a similarapproach on the sentence as in event extraction.The only difference is that we apply the dot prod-uct between the intermediate representations andquery representation (Ii = Ii � q).

As mentioned previously, we can apply saliencyregularization to different levels of the model. Inthis paper, we apply saliency regularization on thefollowing three levels: 1) Word embeddings (W ).

Dataset S.a P.b R.c F1 Acc.d

ACE No 66.0 77.5 71.3 74.4Yes 70.1 76.1 73.0 76.9

ERE No 85.0 86.6 85.8 83.1Yes 85.8 87.3 86.6 84.0

CBT-NE No 55.6 76.3 64.3 75.5Yes 57.2 74.5 64.7 76.5

CBT-CN No 47.4 39.0 42.8 77.3Yes 48.3 38.9 43.1 77.7

aSaliency Learning. bPrecision.cRecall. dAccuracy

Table 1: Performance of trained models on multipledatasets using traditional method and saliency learning.

2) Intermediate representation (I). 3) Decisionrepresentation (Ddim). Note that the aforemen-tioned levels share the same annotation for train-ing. For training details please refer to Section Cof the Appendix1.

5 Experiments and Analysis

5.1 PerformanceTable 1 shows the performance of the trained mod-els on ACE, ERE, CBT-NE, and CBT-CN datasetsusing the aforementioned models with and with-out saliency learning. The results indicate that us-ing saliency learning yields better accuracy andF1 measure on all four datasets. It is interestingto note that saliency learning consistently helpsthe models to achieve noticeably higher preci-sion without hurting the F1 measure and accuracy.This observation suggests that saliency learning iseffective in providing proper guidance for moreaccurate predictions – Note that here we onlyhave guidance for positive prediction. To ver-ify the statistical significance of the observed per-formance improvement over traditionally trainedmodels without saliency learning, we conductedthe one-sided McNemar’s test. The obtained p-values are 0.03, 0.03, 0.0001, and 0.04 for ACE,ERE, CBT-NE, and CBT-CN respectively, indicat-ing that the performance gain by saliency learningis statistically significant.

5.2 Saliency Accuracy and VisualizationIn this section, we examine how well does thesaliency of the trained model align with the an-notation. To this end, we define a metric calledsaliency accuracy (sacc), which measures what

1Code will be publicly available upon acceptance.

NAACL 2019

Page 35: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Accuracy

!35

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Dataset S. W.a I.b D.c

ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49

ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78

CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56

CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06

aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.

Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.

percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100

Pi �(ZiGi>0)P

i Ziwhere Gi is the gradient

of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-

ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.

5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).

Table 3 shows the outcome of the aforemen-

Dataset S. TPRa0 TPRb

1 �TPRc

ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9

ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1

CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8

CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4

aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.

Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).

tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.

It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.

6 Conclusion

In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.

sacc = 100⇥P

i �(ZiGi > 0)Pi Zi

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

NAACL 2019

Page 36: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Accuracy

!36

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Dataset S. W.a I.b D.c

ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49

ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78

CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56

CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06

aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.

Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.

percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100

Pi �(ZiGi>0)P

i Ziwhere Gi is the gradient

of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-

ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.

5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).

Table 3 shows the outcome of the aforemen-

Dataset S. TPRa0 TPRb

1 �TPRc

ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9

ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1

CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8

CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4

aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.

Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).

tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.

It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.

6 Conclusion

In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.

sacc = 100⇥P

i �(ZiGi > 0)Pi Zi

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Saliency Accuracy

Gradient/SaliencyIndicator Function

NAACL 2019

Page 37: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Accuracy

!37

sacc = 100⇥P

i �(ZiGi > 0)Pi Zi

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Saliency Accuracy

Gradient/SaliencyIndicator Function

NAACL 2019

Dataset S. W.a I.b D.c

ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49

ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78

CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56

CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06

aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Dataset S. W.a I.b D.c

ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49

ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78

CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56

CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06

aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.

Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.

percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100

Pi �(ZiGi>0)P

i Ziwhere Gi is the gradient

of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-

ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.

5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).

Table 3 shows the outcome of the aforemen-

Dataset S. TPRa0 TPRb

1 �TPRc

ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9

ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1

CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8

CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4

aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.

Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).

tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.

It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.

6 Conclusion

In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.

Other words are also impactful toward the occurrence of an event.

Page 38: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

True Positive Rate

!38

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Dataset S. W.a I.b D.c

ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49

ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78

CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56

CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06

aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.

Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.

percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100

Pi �(ZiGi>0)P

i Ziwhere Gi is the gradient

of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-

ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.

5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).

Table 3 shows the outcome of the aforemen-

Dataset S. TPRa0 TPRb

1 �TPRc

ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9

ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1

CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8

CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4

aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.

Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).

tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.

It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.

6 Conclusion

In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.

�TPR = 100⇥ TPR0 � TPR1

TPR0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

NAACL 2019

Page 39: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

True Positive Rate

!39

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Dataset S. W.a I.b D.c

ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49

ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78

CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56

CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06

aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.

Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.

percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100

Pi �(ZiGi>0)P

i Ziwhere Gi is the gradient

of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-

ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.

5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).

Table 3 shows the outcome of the aforemen-

Dataset S. TPRa0 TPRb

1 �TPRc

ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9

ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1

CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8

CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4

aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.

Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).

tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.

It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.

6 Conclusion

In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.

�TPR = 100⇥ TPR0 � TPR1

TPR0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

NAACL 2019

TPR before removing contributory words

TPR after removing contributory words

Page 40: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

True Positive Rate

!40

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Dataset S. W.a I.b D.c

ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49

ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78

CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56

CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06

aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.

Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.

percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100

Pi �(ZiGi>0)P

i Ziwhere Gi is the gradient

of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-

ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.

5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).

Table 3 shows the outcome of the aforemen-

Dataset S. TPRa0 TPRb

1 �TPRc

ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9

ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1

CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8

CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4

aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.

Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).

tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.

It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.

6 Conclusion

In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Dataset S. W.a I.b D.c

ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49

ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78

CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56

CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06

aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.

Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.

percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100

Pi �(ZiGi>0)P

i Ziwhere Gi is the gradient

of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-

ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.

5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).

Table 3 shows the outcome of the aforemen-

Dataset S. TPRa0 TPRb

1 �TPRc

ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9

ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1

CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8

CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4

aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.

Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).

tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.

It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.

6 Conclusion

In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Dataset S. W.a I.b D.c

ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49

ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78

CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56

CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06

aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.

Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.

percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100

Pi �(ZiGi>0)P

i Ziwhere Gi is the gradient

of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-

ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.

5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).

Table 3 shows the outcome of the aforemen-

Dataset S. TPRa0 TPRb

1 �TPRc

ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9

ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1

CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8

CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4

aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.

Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).

tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.

It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.

6 Conclusion

In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Dataset S. W.a I.b D.c

ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49

ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78

CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56

CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06

aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.

Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.

percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100

Pi �(ZiGi>0)P

i Ziwhere Gi is the gradient

of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-

ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.

5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).

Table 3 shows the outcome of the aforemen-

Dataset S. TPRa0 TPRb

1 �TPRc

ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9

ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1

CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8

CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4

aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.

Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).

tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.

It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.

6 Conclusion

In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Dataset S. W.a I.b D.c

ACE No 61.60 66.05 63.27Yes 99.26 77.92 65.49

ERE No 51.62 56.71 44.37Yes 99.77 77.45 51.78

CBT-NE No 52.32 65.38 68.81Yes 98.17 98.34 95.56

CBT-CN No 47.78 53.68 45.15Yes 99.13 98.94 97.06

aWord Level Saliency Accuracy.bIntermediate Level Saliency Accuracy.cDecision Level Saliency Accuracy.

Table 2: Saliency accuracies of different layer of ourmodels trained on ACE, ERE, CBT-NE, CBT-CN.

percentage of all positive positions of annotationZ indeed obtain a positive gradient. Formally,sacc = 100

Pi �(ZiGi>0)P

i Ziwhere Gi is the gradient

of unit i and � is the indicator function.Table 2 shows the saliency accuracies at dif-

ferent layers of the trained model with and with-out saliency learning. According to Table 2, ourmethod achieves a much higher saliency accuracyfor all datasets indicating that the learning was indeed effective in aligning the model saliency withthe annotation. In other words, important wordswill have positive contributions in the saliency-trained model, and as such, it learns to focus onthe right part of the data. This claim can also beverified by visualizing the saliency, which are pro-vided in section D of the Appendix.

5.3 VerificationUp to this point we show that using saliency learn-ing yields noticeably better precision, F1 measure,accuracy, and saliency accuracy. Here, we aim toverify our claim that saliency learning coerces themodel to pay more attention to the critical parts.The annotation Z describes the influential wordstoward the positive labels. Our hypothesis is thatremoving such words would cause more impacton saliency-trained models, since by training theyshould be more sensitive to these words. We mea-sure the impact as the percentage change of themodel’s true positive rate. This measure is cho-sen because negative examples do not have anyannotated contributory words, and hence we areparticularly interested in how removing contribu-tory words of positive examples would impact themodel’s true positive rate (TPR).

Table 3 shows the outcome of the aforemen-

Dataset S. TPRa0 TPRb

1 �TPRc

ACE No 77.5 52.2 32.6Yes 76.1 45.0 40.9

ERE No 86.6 73.2 15.4Yes 87.3 70.6 19.1

CBT-NE No 76.3 30.2 60.4Yes 74.5 28.5 61.8

CBT-CN No 39.0 16.6 57.4Yes 38.9 15.4 60.4

aTrue Positive Rate (before removal).bTPR after removing the critical word(s).cTPR change rate.

Table 3: True positive rate and true positive rate changeof the trained models before and after removing thecontributory word(s).

tioned experiment, where the last column lists theTPR reduction rates. From the table, we see a con-sistently higher rate of TPR reduction for saliency-trained models compared to traditionally trainedmodels, suggesting that saliency-trained modelsare more sensitive to the presence of the contribu-tory words and confirming our hypothesis.

It worth noting that we observe less substantialchange to the true positive rate for the event task.This is likely due to the fact that we are using thetrigger words as simulated explanations. Whiletrigger words are clearly related to events, thereare often other words in the sentence relating toevents but not annotated as trigger words.

6 Conclusion

In this paper, we proposed saliency learning, anovel approach for teaching a model where topay attention. We demonstrated the effectivenessof our method on multiple tasks and datasets us-ing simulated explanations. The results show thatsaliency learning enables us to obtain better pre-cision, F1 measure and accuracy on these tasksand datasets. Further, it produces models whosesaliency is more properly aligned with the desiredexplanation. In other words, saliency learninggives us more reliable predictions while deliveringbetter performance as traditionally trained models.Finally, our verification experiments illustrate thatthe saliency trained models show higher sensitiv-ity to the removal of contributory words in a posi-tive example. For future work, we will extend ourstudy to examine saliency learning on NLP tasksin an active learning setting where real explana-tions are requested and provided by human.

�TPR = 100⇥ TPR0 � TPR1

TPR0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

NAACL 2019

TPR before removing contributory words

TPR after removing contributory words

Page 41: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Saliency Visualization

!41

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

id Baseline Model Saliency-based Model Z PB PS

1 The judge at Hassan’s The judge at Hassan ’s extradition 1 1extradition hearing said extradition hearing said hearingthat he found the French that he found the French said

handwriting report very handwriting report veryproblematic, very confusing, problematic, very confusing,and with suspect conclusions. and with suspect conclusions.

2 Solana said the EU would help Solana said the EU would help attack 1 1in the humanitarian crisis in the humanitarian crisisexpected to follow an expected to follow anattack on Iraq. attack on Iraq .

3 The trial will start on The trial will start on trial 1 1March 13 , the court said . March 13 , the court said.

4 India ’s has been reeling India ’s has been reeling killed 1 1under a heatwave since under a heatwave sincemid-May which has mid-May which haskilled 1,403 people. killed 1,403 people .

5 Retired General Electric Co. Retired General Electric Co. Retired 1 1Chairman Jack Welch is Chairman Jack Welch is divorce

seeking work-related seeking work-relateddocuments of his estranged documents of his estranged

wife in his high-stakes wife in his high-stakesdivorce case . divorce case .

6 The following year, he was The following year, he was acquitted 1 1acquitted in the Guatemala acquitted in the Guatemala casecase, but the U.S. continued case , but the U.S. continuedto push for his prosecution. to push for his prosecution .

7 In 2011, a Spanish National In 2011, a Spanish National issued 1 1Court judge issued arrest Court judge issued arrest slayingwarrants for 20 men , warrants for 20 men, arrestincluding Montano,suspected including Montano,suspectedof participating in the of participating in the

slaying of the priests. slaying of the priests.8 Slobodan Milosevic’s wife will Slobodan Milosevic’s wife will trial 1 1

go on trial next week on go on trial next week on chargescharges of mismanaging state charges of mismanaging state formerproperty during the former property during the formerpresident’s rule, a court said president ’s rule, a court saidThursday. Thursday .

9 Iraqis mostly fought back Iraqis mostly fought back fought 1 1with small arms, pistols, with small arms, pistols,machine guns and machine guns and

rocket-propelled grenades . rocket-propelled grenades.10 But the Saint Petersburg But the Saint Petersburg summit 1 1

summit ended without any summit ended without anyformal declaration on Iraq . formal declaration on Iraq .

Table 2: Top 6 salient tokens visualization of samples in ACE and ERE for baseline and saliency-based models.

NAACL 2019

Page 42: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

id Baseline Model Saliency-based Model Z PB PS

1 The judge at Hassan’s The judge at Hassan ’s extradition 1 1extradition hearing said extradition hearing said hearingthat he found the French that he found the French said

handwriting report very handwriting report veryproblematic, very confusing, problematic, very confusing,and with suspect conclusions. and with suspect conclusions.

2 Solana said the EU would help Solana said the EU would help attack 1 1in the humanitarian crisis in the humanitarian crisisexpected to follow an expected to follow anattack on Iraq. attack on Iraq .

3 The trial will start on The trial will start on trial 1 1March 13 , the court said . March 13 , the court said.

4 India ’s has been reeling India ’s has been reeling killed 1 1under a heatwave since under a heatwave sincemid-May which has mid-May which haskilled 1,403 people. killed 1,403 people .

5 Retired General Electric Co. Retired General Electric Co. Retired 1 1Chairman Jack Welch is Chairman Jack Welch is divorce

seeking work-related seeking work-relateddocuments of his estranged documents of his estranged

wife in his high-stakes wife in his high-stakesdivorce case . divorce case .

6 The following year, he was The following year, he was acquitted 1 1

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

id Baseline Model Saliency-based Model Z PB PS

1 The judge at Hassan’s The judge at Hassan ’s extradition 1 1extradition hearing said extradition hearing said hearingthat he found the French that he found the French said

handwriting report very handwriting report veryproblematic, very confusing, problematic, very confusing,and with suspect conclusions. and with suspect conclusions.

2 Solana said the EU would help Solana said the EU would help attack 1 1in the humanitarian crisis in the humanitarian crisisexpected to follow an expected to follow anattack on Iraq. attack on Iraq .

3 The trial will start on The trial will start on trial 1 1March 13 , the court said . March 13 , the court said.

4 India ’s has been reeling India ’s has been reeling killed 1 1under a heatwave since under a heatwave sincemid-May which has mid-May which haskilled 1,403 people. killed 1,403 people .

5 Retired General Electric Co. Retired General Electric Co. Retired 1 1Chairman Jack Welch is Chairman Jack Welch is divorce

seeking work-related seeking work-relateddocuments of his estranged documents of his estranged

wife in his high-stakes wife in his high-stakesdivorce case . divorce case .

6 The following year, he was The following year, he was acquitted 1 1acquitted in the Guatemala acquitted in the Guatemala casecase, but the U.S. continued case , but the U.S. continuedto push for his prosecution. to push for his prosecution .

7 In 2011, a Spanish National In 2011, a Spanish National issued 1 1Court judge issued arrest Court judge issued arrest slayingwarrants for 20 men , warrants for 20 men, arrestincluding Montano,suspected including Montano,suspectedof participating in the of participating in the

slaying of the priests. slaying of the priests.8 Slobodan Milosevic’s wife will Slobodan Milosevic’s wife will trial 1 1

go on trial next week on go on trial next week on chargescharges of mismanaging state charges of mismanaging state formerproperty during the former property during the formerpresident’s rule, a court said president ’s rule, a court saidThursday. Thursday .

9 Iraqis mostly fought back Iraqis mostly fought back fought 1 1with small arms, pistols, with small arms, pistols,machine guns and machine guns and

rocket-propelled grenades . rocket-propelled grenades.10 But the Saint Petersburg But the Saint Petersburg summit 1 1

summit ended without any summit ended without anyformal declaration on Iraq . formal declaration on Iraq .

Table 2: Top 6 salient tokens visualization of samples in ACE and ERE for baseline and saliency-based models.

Saliency Visualization

!42 NAACL 2019

Page 43: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

id Baseline Model Saliency-based Model Z PB PS

1 The judge at Hassan’s The judge at Hassan ’s extradition 1 1extradition hearing said extradition hearing said hearingthat he found the French that he found the French said

handwriting report very handwriting report veryproblematic, very confusing, problematic, very confusing,and with suspect conclusions. and with suspect conclusions.

2 Solana said the EU would help Solana said the EU would help attack 1 1in the humanitarian crisis in the humanitarian crisisexpected to follow an expected to follow anattack on Iraq. attack on Iraq .

3 The trial will start on The trial will start on trial 1 1March 13 , the court said . March 13 , the court said.

4 India ’s has been reeling India ’s has been reeling killed 1 1under a heatwave since under a heatwave sincemid-May which has mid-May which haskilled 1,403 people. killed 1,403 people .

5 Retired General Electric Co. Retired General Electric Co. Retired 1 1Chairman Jack Welch is Chairman Jack Welch is divorce

seeking work-related seeking work-relateddocuments of his estranged documents of his estranged

wife in his high-stakes wife in his high-stakesdivorce case . divorce case .

6 The following year, he was The following year, he was acquitted 1 1

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

id Baseline Model Saliency-based Model Z PB PS

1 The judge at Hassan’s The judge at Hassan ’s extradition 1 1extradition hearing said extradition hearing said hearingthat he found the French that he found the French said

handwriting report very handwriting report veryproblematic, very confusing, problematic, very confusing,and with suspect conclusions. and with suspect conclusions.

2 Solana said the EU would help Solana said the EU would help attack 1 1in the humanitarian crisis in the humanitarian crisisexpected to follow an expected to follow anattack on Iraq. attack on Iraq .

3 The trial will start on The trial will start on trial 1 1March 13 , the court said . March 13 , the court said.

4 India ’s has been reeling India ’s has been reeling killed 1 1under a heatwave since under a heatwave sincemid-May which has mid-May which haskilled 1,403 people. killed 1,403 people .

5 Retired General Electric Co. Retired General Electric Co. Retired 1 1Chairman Jack Welch is Chairman Jack Welch is divorce

seeking work-related seeking work-relateddocuments of his estranged documents of his estranged

wife in his high-stakes wife in his high-stakesdivorce case . divorce case .

6 The following year, he was The following year, he was acquitted 1 1acquitted in the Guatemala acquitted in the Guatemala casecase, but the U.S. continued case , but the U.S. continuedto push for his prosecution. to push for his prosecution .

7 In 2011, a Spanish National In 2011, a Spanish National issued 1 1Court judge issued arrest Court judge issued arrest slayingwarrants for 20 men , warrants for 20 men, arrestincluding Montano,suspected including Montano,suspectedof participating in the of participating in the

slaying of the priests. slaying of the priests.8 Slobodan Milosevic’s wife will Slobodan Milosevic’s wife will trial 1 1

go on trial next week on go on trial next week on chargescharges of mismanaging state charges of mismanaging state formerproperty during the former property during the formerpresident’s rule, a court said president ’s rule, a court saidThursday. Thursday .

9 Iraqis mostly fought back Iraqis mostly fought back fought 1 1with small arms, pistols, with small arms, pistols,machine guns and machine guns and

rocket-propelled grenades . rocket-propelled grenades.10 But the Saint Petersburg But the Saint Petersburg summit 1 1

summit ended without any summit ended without anyformal declaration on Iraq . formal declaration on Iraq .

Table 2: Top 6 salient tokens visualization of samples in ACE and ERE for baseline and saliency-based models.

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

id Baseline Model Saliency-based Model Z PB PS

1 The judge at Hassan’s The judge at Hassan ’s extradition 1 1extradition hearing said extradition hearing said hearingthat he found the French that he found the French said

handwriting report very handwriting report veryproblematic, very confusing, problematic, very confusing,and with suspect conclusions. and with suspect conclusions.

2 Solana said the EU would help Solana said the EU would help attack 1 1in the humanitarian crisis in the humanitarian crisisexpected to follow an expected to follow anattack on Iraq. attack on Iraq .

3 The trial will start on The trial will start on trial 1 1March 13 , the court said . March 13 , the court said.

4 India ’s has been reeling India ’s has been reeling killed 1 1under a heatwave since under a heatwave sincemid-May which has mid-May which haskilled 1,403 people. killed 1,403 people .

5 Retired General Electric Co. Retired General Electric Co. Retired 1 1Chairman Jack Welch is Chairman Jack Welch is divorce

seeking work-related seeking work-relateddocuments of his estranged documents of his estranged

wife in his high-stakes wife in his high-stakesdivorce case . divorce case .

6 The following year, he was The following year, he was acquitted 1 1acquitted in the Guatemala acquitted in the Guatemala casecase, but the U.S. continued case , but the U.S. continuedto push for his prosecution. to push for his prosecution .

7 In 2011, a Spanish National In 2011, a Spanish National issued 1 1Court judge issued arrest Court judge issued arrest slayingwarrants for 20 men , warrants for 20 men, arrestincluding Montano,suspected including Montano,suspectedof participating in the of participating in the

slaying of the priests. slaying of the priests.8 Slobodan Milosevic’s wife will Slobodan Milosevic’s wife will trial 1 1

go on trial next week on go on trial next week on chargescharges of mismanaging state charges of mismanaging state formerproperty during the former property during the formerpresident’s rule, a court said president ’s rule, a court saidThursday. Thursday .

9 Iraqis mostly fought back Iraqis mostly fought back fought 1 1with small arms, pistols, with small arms, pistols,machine guns and machine guns and

rocket-propelled grenades . rocket-propelled grenades.10 But the Saint Petersburg But the Saint Petersburg summit 1 1

summit ended without any summit ended without anyformal declaration on Iraq . formal declaration on Iraq .

Table 2: Top 6 salient tokens visualization of samples in ACE and ERE for baseline and saliency-based models.

Saliency Visualization

!43 NAACL 2019

Irrelevant helpful data

Page 44: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Conclusion• Proposing saliency learning, a novel approach for

teaching a model where to pay attention.

• Experimenting on multiple tasks and datasets.

• Achieving better precision, F1 measure and accuracy.

• Obtaining models whose saliency are more properly aligned with the desired explanation (More reliable).

• FW: Using saliency learning in a semi-supervised framework.

!44 NAACL 2019

Page 45: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a

Thank You

Page 46: Saliency Learning: Teaching the Model Where to Pay Attentionweb.engr.oregonstate.edu/~ghaeinim/files/Saliency... · 2019-10-15 · • Sentence: An unknown man had [broken into] a