February 2012 Princeton Plasma Physics Laboratory With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Michael Connor, Dan Goldwasser, Lev Ratinov,

February 2012

Princeton Plasma Physics Laboratory

With thanks to:

Collaborators: Ming-Wei Chang, James Clarke, Michael Connor, Dan Goldwasser, Lev Ratinov, Vivek Srikumar, Many others Funding: NSF; DHS; NIH; DARPA. DASH Optimization (Xpress-MP)

Learning and Inferencefor

Natural Language Understanding

Dan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign

Nice to Meet You

Page 2

Page 3

Learning and Inference in Natural Language

Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. Structured Output Problems – multiple dependent output variables

(Learned) models/classifiers for different sub-problems In some cases, not all local models can be learned simultaneously In these cases, constraints may appear only at evaluation time

Incorporate models’ information, along with prior knowledge (constraints), in making coherent decisions decisions that respect the local models as well as domain & context

specific knowledge/constraints.

Page 4

Comprehension

1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.

A process that maintains and updates a collection of propositions about the state of affairs.

This is an Inference Problem

(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

Why is it difficult?

Meaning

Language

Ambiguity

Variability

Page 5

Context Sensitive Paraphrasing

He used a Phillips head to tighten the screw.

The bank owner tightened security after a spat of local crimes.

The Federal Reserve will aggressively tighten monetary policy.

LoosenStrengthenStep upToughenImproveFastenImposeIntensifyEaseBeef upSimplifyCurbReduce

LoosenStrengthenStep upToughenImproveFastenImposeIntensifyEaseBeef upSimplifyCurbReduce

Page 6

Variability in Natural Language Expressions

Example: Relation Extraction: “Works_for”

Jim Carpenter works for the U.S. Government.

The American government employed Jim Carpenter.

Jim Carpenter was fired by the US Government.

Jim Carpenter worked in a number of important positions. …. As a press liaison for the IRS, he made contacts in the white house.

Top Russian interior minister Yevgeny Topolov met yesterday with his US counterpart, Jim Carpenter.

Former US Secretary of Defence Jim Carpenter spoke today…

Page 7

Textual Entailment

Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year

Yahoo acquired Overture

Is it true that…?(Textual Entailment)

Overture is a search company Google is a search company

……….

Google owns Overture

Page 8

A key problem in natural language understanding is to abstract over the inherent syntactic and semantic variability in natural

language.

Page 9

Why? A law office wants to get the list of all people that were mentioned in email

correspondence with the office. For each name, determine whether is was mentioned adversarially or not.

A political scientist studies Climate Change and its effect on Societal instability. He wants to identify all events related to demonstrations, protests, parades, elections, analyze them (who, when, where, why) and generate a timeline

An electronic health record (EHR) is a personal health record in digital format. Includes information relating to: Current and historical health, medical conditions and medical tests; medical

referrals, treatments, medications, demographic information etc. Today: a write only document Can we use it in medical advice systems; medication selection and tracking

(Vioxx…); disease outbreak and control; science – correlating response to drugs with other conditions

Page 10

Background It’s difficult to program predicates of interest due to

Ambiguity (everything has multiple meanings) Variability (everything you want to say you can say in many ways)

Consequently: all of Natural Language Processing is driven by Statistical Machine Learning

Even simple predicates like: What is the part of speech of the word “can”, or Correct: I’d like a peace of cake

Not to mention harder problems like co-reference resolution, parsing, semantic parsing, named entity recognition,…

Machine Learning problems in NLP are large: Often, 106 features (due to lexical items, conjunctions of them, etc.) We are pretty good for some classes of problems.

11

Illinois’ bored of education [board]Nissan Car and truck plant; plant and animal kingdom(This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. He was taken to a veterinarian; a hospital

Tiger was in Washington for the PGA Tour Finance; Banking; World News; Sports

Important or not important; love or hate

Classification: Ambiguity Resolution

12

Theoretically: generalization bounds How many example does one need to see in order to guarantee good

behavior on previously unobserved examples. Algorithmically: good learning algorithms for linear

representations. Can deal with very high dimensionality (106 features) Very efficient in terms of computation and # of examples. On-line.

Key issues remaining: Learning protocols: how to minimize interaction (supervision); how to

map domain/task information to supervision; semi-supervised learning; active learning; ranking; adaptation to new domains.

What are the features? No good theoretical understanding here. Is it sufficient for making progress in NLP?

Classification is Well UnderstoodClassification: learn a function f: X Y that maps observations in a domain to one of several categories.

Page 13Sign

ifica

nt P

rogr

ess

in N

LP a

nd In

form

ation

Ext

racti

on

Page 14

Comprehension

1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.

A process that maintains and updates a collection of propositions about the state of affairs.

This is an Inference Problem

(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

Page 15

Coherency in Semantic Role Labeling

Predicate-arguments generated should be consistent across phenomena

The touchdown scored by Cutler cemented the victory of the Bears.

Verb Nominalization Preposition

Predicate: score

A0: Cutler (scorer)A1: The touchdown (points scored)

Predicate: win

A0: the Bears (winner)

Sense: 11(6)

“the object of the preposition is the object of the underlying verb of the nominalization”

Linguistic Constraints: A0: the Bears Sense(of): 11(6)

A0: Cutler Sense(by): 1(1)

Page 16

Semantic Parsing

Successful interpretation involves multiple decisions What entities appear in the interpretation? “New York” refers to a state or a city?

How to compose fragments together? state(next_to()) >< next_to(state())

X :“What is the largest state that borders New York and Maryland ?"

Y: largest( state( next_to( state(NY) AND next_to (state(MD))))

Page 17

Learning and Inference

Natural Language Decisions are Structured Global decisions in which several local decisions play a role but there are

mutual dependencies on their outcome. It is essential to make coherent decisions in a way that takes the

interdependencies into account. Joint, Global Inference.

Today: How to support Structured Predictions in NLP

Using declarative constraints in Natural Language Processing Learning and Inference issues Mostly Examples

Outline Background:

Natural Language Processing: problems and difficulties

Global Inference with expressive structural constraints in NLP Constrained Conditional Models

Some Learning Issues in the presence of minimal supervision Constraints Driven Learning Learning with Indirect Supervision Response based Learning

More Examples

Page 18

Page 19

Statistics or Linguistics? Statistical approaches were very successful in NLP

But, it has become clear that there is a need to move from strictly Data Driven approaches to Knowledge Driven approaches

Knowledge: Linguistics, Background world knowledge

How to incorporate Knowledge into Statistical Learning & Decision Making?

In many respects Structured Prediction addresses this question.

This also distinguishes it from the “standard” study of probabilistic models.

Page 20

Inference with General Constraint StructureRecognizing Entities and Relations

Dole ’s wife, Elizabeth , is a native of N.C.

E1 E2 E3

R12 R2

3

other 0.05

per 0.85

loc 0.10

other 0.05

per 0.50

loc 0.45

other 0.10

per 0.60

loc 0.30

irrelevant 0.10

spouse_of 0.05

born_in 0.85

irrelevant 0.05

spouse_of 0.45

born_in 0.50

irrelevant 0.05

spouse_of 0.45

born_in 0.50

other 0.05

per 0.85

loc 0.10

other 0.10

per 0.60

loc 0.30

other 0.05

per 0.50

loc 0.45

irrelevant 0.05

spouse_of 0.45

born_in 0.50

irrelevant 0.10

spouse_of 0.05

born_in 0.85

other 0.05

per 0.50

loc 0.45

How to guide the global inference? Why not learn jointly?

Page 21

Pipeline

Conceptually, Pipelining is a crude approximation Interactions occur across levels and down stream decisions often interact with

previous decisions. Leads to propagation of errors Occasionally, later stage problems are easier but cannot correct earlier errors.

But, there are good reasons to use pipelines Putting everything in one basket may not be right How about choosing some stages and think about them jointly?

POS Tagging

Phrases

Semantic Entities

Relations

Most problems are not single classification problems

Parsing

WSD Semantic Role Labeling

Raw Data

Page 22

Semantic Role Labeling

I left my pearls to my daughter in my will .

[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .

A0 Leaver A1 Things left A2 Benefactor AM-LOC Location I left my pearls to my daughter in my will .

Overlapping arguments

If A2 is present, A1 must also be

present.

Who did what to whom, when, where, why,…

How to express the constraints on the decisions? How to “enforce” them?

Page 23

PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. It adds a layer of generic semantic labels to Penn Tree Bank II. (Almost) all the labels are on the constituents of the parse trees.

Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files

13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type

Semantic Role Labeling (2/2)

Page 24

Algorithmic Approach

Identify argument candidates Pruning [Xue&Palmer, EMNLP’04] Argument Identifier

Binary classification (A-Perc)

Classify argument candidates Argument Classifier

Multi-class classification (A-Perc)

Inference Use the estimated probability distribution given

by the argument classifier Use structural and linguistic constraints Infer the optimal global output

I left my nice pearls to her

I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]

I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]



candidate arguments

Page 25Page 25

Semantic Role Labeling (SRL)


0.5

0.15

0.15

0.1

0.1

0.15

0.6

0.05

0.05

0.05

0.05

0.1

0.2

0.6

0.05

0.05

0.05

0.7

0.05

0.150.3

0.2

0.2

0.1

0.2

Page 26Page 26



0.5

0.15

0.15

0.1

0.1

0.15

0.6

0.05

0.05

0.05

0.05

0.1

0.2

0.6

0.05

0.05

0.05

0.7

0.05

0.150.3

0.2

0.2

0.1

0.2

Page 27Page 27



0.5

0.15

0.15

0.1

0.1

0.15

0.6

0.05

0.05

0.05

0.05

0.1

0.2

0.6

0.05

0.05

0.05

0.7

0.05

0.150.3

0.2

0.2

0.1

0.2

One inference problem for each verb predicate.

Page 28

Integer Linear Programming Inference

For each argument ai

Set up a Boolean variable: ai,t indicating whether ai is classified as t

Goal is to maximize i score(ai = t ) ai,t

Subject to the (linear) constraints

If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints.

The Constrained Conditional Model is completely decomposed during training

Page 29

No duplicate argument classes

a POTARG x{a = A0} 1 R-ARG

a2 POTARG , a POTARG x{a = A0} x{a2 = R-A0}

C-ARG a2 POTARG ,

(a POTARG) (a is before a2 ) x{a = A0} x{a2 = C-A0}

Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments; order constraints If verb is of type A, no argument of type B

Any Boolean rule can be encoded as a (collection of) linear constraints.

If there is an R-ARG phrase, there is an ARG Phrase

If there is an C-ARG phrase, there is an ARG before it

Constraints

Joint inference can be used also to combine different (SRL) Systems.

Universally quantified rulesLBJ: allows a developer to encode

constraints in FOL to be compiled into linear inequalities automatically.

Page 30

SRL: Formulation & Outcomes

maximizen¡ 1X

i=0

X

y2Y

¸x i ;y1f yi =yg

where ¸x;y = ¸ ¢F (x;y) = ¸y ¢F (x)

subject to8i;

X

y2Y

1f yi =yg = 1

8y 2 Y;n¡ 1X

i=0

1f yi =yg · 1

8y 2 YR ;n¡ 1X

i=0

1f yi =y=\ R-Ax"g ·n¡ 1X

i=0

1f yi =\ Ax"g

8j ;y 2 YC ; 1f yj =y=\ C-Ax"g ·jX

i=0

1f yi =\ Ax"g

2:30

Demo:http://cogcomp.cs.illinois.edu/page/demos

Top ranked system in CoNLL’05 shared task

Key difference is the Inference

2) Produces a very good semantic parser. F1~90% 3) Easy and fast: ~7 Sent/Sec (using Xpress-MP)

Constrained Conditional Models (aka ILP Inference)

How to solve?

This is an Integer Linear Program

Solving using ILP packages gives an exact solution.

Cutting Planes, Dual Decomposition & other search techniques are possible

(Soft) constraints component

Weight Vector for “local” models

Penalty for violatingthe constraint.

How far y is from a “legal” assignment

Features, classifiers; log-linear models (HMM, CRF) or a combination

How to train?

Training is learning the objective function

Decouple? Decompose?

How to exploit the structure to minimize supervision?

Three Ideas Idea 1: Separate modeling and problem formulation from algorithms

Similar to the philosophy of probabilistic modeling

Idea 2: Keep model simple, make expressive decisions (via constraints)

Unlike probabilistic modeling, where models become more expressive

Idea 3: Expressive structured decisions can be supervised indirectly via related simple binary decisions

Global Inference can be used to amplify the minimal supervision.

Modeling

Inference

Learning

Linguistics Constraints

Cannot have both A states and B states in an output sequence.

Linguistics Constraints

If a modifier chosen, include its headIf verb is chosen, include its arguments

Examples: CCM Formulations (aka ILP for NLP)

CCMs can be viewed as a general interface to easily combine declarative domain knowledge with data driven statistical models

Sequential Prediction

HMM/CRF based: Argmax ¸ij xij

Sentence Compression/Summarization:

Language Model based: Argmax ¸ijk xijk

Formulate NLP Problems as ILP problems (inference may be done otherwise)1. Sequence tagging (HMM/CRF + Global constraints)2. Sentence Compression (Language Model + Global Constraints)3. SRL (Independent classifiers + Global Constraints)

Outline Background:




More Examples

Page 34

Learning structured models requires annotating structures

Interdependencies among decision variables should be exploited in Inference & Learning.

Goal: learn from minimal, indirect supervision.Amplify it using variables’ interdependencies

Information extraction without Prior Knowledge

Prediction result of a trained HMMLars Ole Andersen . Program analysis and

specialization for the C Programming language

. PhD thesis .DIKU , University of Copenhagen , May1994 .

[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION]

[DATE]

Violates lots of natural constraints!

Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .

Page 36

Strategies for Improving the Results

(Pure) Machine Learning Approaches Higher Order HMM/CRF? Increasing the window size? Adding a lot of new features

Requires a lot of labeled examples

What if we only have a few labeled examples?

Other options? Constrain the output to make sense Push the (simple) model in a direction that makes sense

Increasing the model complexity

Can we keep the learned model simple and still make expressive decisions?

Increase difficulty of Learning

Page 37

Examples of Constraints

Each field must be a consecutive list of words and can appear at most once in a citation.

State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE ……. Easy to express pieces of “knowledge”

Non Propositional; May use Quantifiers

Page 38

Information Extraction with Constraints Adding constraints, we get correct results!

Without changing the model

[AUTHOR] Lars Ole Andersen . [TITLE] Program analysis and specialization

for the C Programming language .

[TECH-REPORT] PhD thesis .[INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .

Constrained Conditional Models Allow: Learning a simple model Make decisions with a more complex model Accomplished by directly incorporating constraints to bias/re-

rank decisions made by the simpler model

Page 39

Guiding (Semi-Supervised) Learning with Constraints

Model

Decision Time Constraints

Un-labeled Data

Constraints

In traditional Semi-Supervised learning the model can drift away from the correct one.

Constraints can be used to generate better training data At training to improve labeling of un-labeled data (and thus

improve the model) At decision time, to bias the objective function towards favoring

constraint satisfaction.

Can be viewed as Constrained Expectation Maximization (EM) algorithm. This is the hard EM version; can be generalized in several directions

Page 40Page 40

Constraints Driven Learning (CoDL)

(w0,½0)=learn(L)

For N iterations do

T=

For each x in unlabeled dataset

h Ã argmaxy wT Á(x,y) - ½k dC(x,y)

T=T {(x, h)}

(w,½) = (w0,½0) + (1- ) learn(T)

[Chang, Ratinov, Roth, ACL’07;ICML’08,ML, to appear]Related to Ganchev et. al [PR work 09,10]

Supervised learning algorithm parameterized by (w,½). Learning can be justified as an optimization procedure for an objective function

Inference with constraints: augment the training set

Learn from new training dataWeigh supervised & unsupervised models.

Excellent Experimental Results showing the advantages of using constraints, especially with small amounts on labeled data [Chang et. al, Others]

Several Training Paradigms

Page 41

Objective function:

Constraints Driven Learning (CODL)

# of available labeled examples

Learning w 10 Constraints

Poor model + constraints

Constraints are used to: Bootstrap a semi-supervised learner Correct weak models predictions on unlabeled data, which in turn are used to keep training the model.

Learning w/o Constraints: 300 examples.

Semi-Supervised Learning Paradigm that makes use of constraints to bootstrap from a small number of examples

[Chang, Ratinov, Roth, ACL’07;ICML’08,ML, to appear]Related to Ganchev et. al [PR work 09,10]

Constrained Conditional Models – ILP formulations – have been shown useful in the context of many NLP problems, [Roth&Yih, 04,07; Chang et. al. 07,08,…] SRL, Summarization; Co-reference; Information & Relation Extraction;

Event Identifications; Transliteration; Textual Entailment; Knowledge Acquisition

Some theoretical work on training paradigms [Punyakanok et. al., 05 more]

See a NAACL’10 tutorial on my web page & an NAACL’09 ILPNLP workshop

Summary of work & a bibliography: http://L2R.cs.uiuc.edu/tutorials.html

Constrained Conditional Models [AAAI’08, MLJ’12]

Outline Background:




More Examples

Page 43

Learning structured models requires annotating structures

Interdependencies among decision variables should be exploited in Inference & Learning.

Goal: learn from minimal, indirect supervision.Amplify it using variables’ interdependencies

Page 44

Connecting Language to the World [CoNLL’10,ACL’11,IJCAI’11]

Can I get a coffee with no sugar and just a bit of milk

Can we rely on this interaction to provide supervision?

MAKE(COFFEE,SUGAR=NO,MILK=LITTLE)

Arggg

Great!

Semantic Parser

This requires that we use the minimal binary supervision (“good structure” /”bad structure”) as a way to learn how to generate good structures.

SKIP

Page 45

Key Ideas in Learning Structures

Idea1: Simple, easy to supervise, binary decisions often depend on the structure you care about. Learning to do well on the binary task can drive the structure learning.

Idea2: Global Inference can be used to amplify the minimal supervision.

Idea 2 ½: There are several settings where a binary label can be used to replace a structured label. Perhaps the most intriguing is where you use the world response to the model’s actions.

Page 46

I. Paraphrase Identification

Consider the following sentences:

S1: Druce will face murder charges, Conte said.

S2: Conte said Druce will be charged with murder .

Are S1 and S2 a paraphrase of each other? There is a need for an intermediate representation to justify

this decision

Given an input x 2 XLearn a model f : X ! {-1, 1}

We need latent variables that explain why this is a positive example.

Given an input x 2 XLearn a model f : X ! H ! {-1, 1}

X YH

Page 47

Algorithms: Two Conceptual Approaches

Two stage approach (a pipeline; typically used for TE, paraphrase id, others) Learn hidden variables; fix it

Need supervision for the hidden layer (or heuristics) For each example, extract features over x and (the fixed) h. Learn a binary classier for the target task

Proposed Approach: Joint Learning Drive the learning of h from the binary labels Find the best h(x) An intermediate structure representation is good to the extent is

supports better final prediction. Algorithm? How to drive learning a good H?

X YH

Page 48

Algorithmic Intuition

If x is positive There must exist a good explanation (intermediate representation) 9 h, wT Á(x,h) ¸ 0 or, maxh wT Á(x,h) ¸ 0

If x is negative No explanation is good enough to support the answer 8 h, wT Á(x,h) · 0 or, maxh wT Á(x,h) · 0

Altogether, this can be combined into an objective function: Minw ¸/2 ||w||2 + Ci L(1-zimaxh 2 C wT {s} hs Ás (xi)) Why does inference help?

Constrains intermediate representations supporting good predictions

New feature vector for the final decision. Chosen h selects a representation.

Inference: best h subject to constraints C

Page 49

Optimization

Non Convex, due to the maximization term inside the global minimization problem

In each iteration: Find the best feature representation h* for all positive examples (off-

the shelf ILP solver) Having fixed the representation for the positive examples, update w

solving the convex optimization problem:

Not the standard SVM/LR: need inference Asymmetry: Only positive examples require a good

intermediate representation that justifies the positive label. Consequently, the objective function decreases monotonically

Page 50

Formalized as Structured SVM + Constrained Hidden Structure LCRL: Learning Constrained Latent Representation

Iterative Objective Function Learning

Inferencebest h subj. to C

Predictionwith inferred h

Trainingw/r to binary

decision label

Initial Objective Function

Generate features

Update weight vector

Feedback relative to binary problem

ILP inference discussed earlier; restrict possible hidden structures considered.

Page 51

Experimental Results

Transliteration:

Recognizing Textual Entailment:

Paraphrase Identification:*

Page 52

II. Structured Prediction

Before, the structure was in the intermediate level We cared about the structured representation only to the extent it helped

the final binary decision The binary decision variable was given as supervision

What if we care about the structure? Information Extraction; Relation Extraction; POS tagging, many others.

Invent a companion binary decision problem! Parse Citations: Lars Ole Andersen . Program analysis

and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .

Companion: Given a citation; does it have a legitimate citation parse? POS Tagging Companion: Given a word sequence, does it have a legitimate POS tagging

sequence? Binary Supervision is almost free

X YH

Page 53

Predicting phonetic alignment (For Transliteration)

Target Task Input: an English Named Entity and its Hebrew Transliteration Output: Phonetic Alignment (character sequence mapping) A structured output prediction task (many constraints), hard to label

Companion Task Input: an English Named Entity and an Hebrew Named Entity Companion Output: Do they form a transliteration pair? A binary output problem, easy to label Negative Examples are FREE, given positive examples

ylatI

טי יל אה

Target Task

Yes/No

Why it is a companion task?

Companion Task

לנ יי י או

I l l i n o i s

Page 54

Companion Task Binary Label as Indirect Supervision

The two tasks are related just like the binary and structured tasks discussed earlier

All positive examples must have a good structure Negative examples cannot have a good structure We are in the same setting as before

Binary labeled examples are easier to obtain We can take advantage of this to help learning a structured model

Algorithm: combine binary learning and structured learning

X YH

Positive transliteration pairs must have “good” phonetic alignments

Negative transliteration pairs cannot have “good” phonetic alignments

Page 55

Joint Learning Framework

Joint learning : If available, make use of both supervision types

Bi

iiBSi

iiST

wwzxLCwyxLCww );,();,(

2

1min 21

ylatI

טי יל אה

Target Task

Yes/No

Loss on Target Task Loss on Companion Task

Loss function – same as described earlier. Key: the same parameter w for both components

Companion Task

לנ יי י או

I l l i n o i s

Page 56

Experimental Result

Very little direct (structured) supervision.

Page 57

Experimental Result

Very little direct (structured) supervision. (Almost free) Large amount binary indirect supervision

Page 58

Driving supervision signal from World’s Response

Can I get a coffee with no sugar and just a bit of milk

Can we rely on this interaction to provide supervision?

MAKE(COFFEE,SUGAR=NO,MILK=LITTLE)

Arggg

Great!

Semantic Parser

Page 59

Traditional approach:learn from logical forms and gold alignments

EXPENSIVE!

Semantic parsing is a structured prediction problem: identify mappings from text to a meaning representation

Query Response:

Supervision = Expected Response

Check if Predicted response == Expected response

LogicalQuery

Real World Feedback

Interactive Computer SystemPennsylvania

Query Response:

r

largest( state( next_to( const(NY))))y

“What is the largest state that borders NY?"NLQuery

x

Train a structured predictor with this binary supervision !

Expected : PennsylvaniaPredicted : NYC

Negative Response

Pennsylvaniar

Binary Supervision

Expected : PennsylvaniaPredicted : Pennsylvania

Positive Response

Our approach: use only the responses

Page 60

Response Based Learning X: What is the largest state that borders NY?

Y: largest( state( next_to( const(NY)))) Use the expected response as supervision Additional difficulty:

Algorithm generates potential structures, tested against the DB oracle. Difficult to generate positive examples (good structures)

Learning approach – iteratively identify more correct structures DIRECT protocol: Convert the learning problem into binary prediction AGGRESSIVE protocol: Convert the feedback into structured supervision COMBINED approach, with a weighted loss function

Domain’s semantics is used to constrain interpretations declarative constraints: Lexical resources (wordnet); type consistency:

distance in sentence, in dependency tree,…

Repeat for all input sentences do Find best structured output Query feedback function end for Learn new W using feedbackUntil Convergence

Page 61

Empirical Evaluation [CoNLL’10,ACL’11,Current]

Key Question: Can we learn from this type of supervision?

Algorithm # training structures

Test set accuracy

No Learning: Initial Objective FnBinary signal: DIRECT Protocol

00

22.2%69.2 %

Binary signal: AGGRESSIVE Protocol 0 73.2 %

Binary Signal: COMBINED Protocol 0 81.6%

WM*2007 (fully supervised – uses gold structures)

310 75 %

*[WM] Y.-W. Wong and R. Mooney. 2007. Learning synchronous grammars for semantic parsing with lambda calculus. ACL.

Current emphasis: Learning to understand natural language instructions for games via response based learning

Page 62

Data vs. Annotated Data

One of the key challenges facing Machine Learning today is that of lack of annotated data.

We do have huge amounts of data…but not annotated as required by many tasks: Semantic Parsing Event Extraction and Analysis Co-reference resolution ….

Our goal was to show that sometimes, minimal, easy to come by supervision can be amplified and used

The next case provides an extreme example.

Reference Resolution

63

Document 1: The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission 's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963.

Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedy confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘

Document 3: David Kennedy was born in Leicester, England in 1959. …Kennedy co-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996).

Kennedy

The Reference Problem

The same problem exists with other types of entities &

Concepts

Decisions with respect to entities and concepts are interdependent (and depend on a lot more things beyond the variables of interests)

64

Organizing knowledge

It’s a version of Chicago – the standard classic Macintosh menu font, with that distinctive thick diagonal in the ”N”.

Chicago was used by default for Mac menus through MacOS 7.6, and OS 8 was released mid-1997..

Chicago VIII was one of the early 70s-era Chicago albums to catch myear, along with Chicago II.

65

Organizing knowledge




66

Cross-document co-reference resolution




67

Reference resolution: (disambiguation to Wikipedia)

68




The “reference” collection has structure

69




Used_In

Is_aIs_a

Succeeded

Released

Analysis of Information Networks

70




Wikipedia as a knowledge (and supervision) resource

71

Used_In

Is_aIs_a

Succeeded

Released

Problem formulation - Matching/Ranking problem Text Document(s)—News, Blogs,…

Wikipedia Articles

72

We view this, too, as a constrained optimization problem

Local approach

Γ is a solution to the problem A set of pairs (m,t)

m: a mention in the document t: the matched Wikipedia title

Text Document(s)—News, Blogs,…

Wikipedia Articles

73

Local approach

Γ is a solution to the problem A set of pairs (m,t)

m: a mention in the document t: the matched Wikipedia title

Local score of matchingthe mention to the title


Wikipedia Articles

74

A Global Augmentation

1. Invent a surrogate solution Γ’; • disambiguate each mention

independently.2. Evaluate the structure based on pair-

wise coherence scores Ψ(ti,tj)


Wikipedia Articles

Training the models is very involved and relies on the correctness of the (partial) link structure in WikipediaBut – requires no annotation, relying on Wikipedia

75

DemoState of the art results [ACL’11]

76

Wikifikation: Demo Screen Shot (Demo)

http://en.wikipedia.org/wiki/Mahmoud_Abbas

http://en.wikipedia.org/wiki/Mahmoud_Abbas

77

Robustness Across Domains (Demo)

http://en.wikipedia.org/wiki/Bone_tumorhttp://en.wikipedia.org/wiki/Protein_precursor

The training method used resulted in robustness across domains

78

Page 79

Conclusion Understanding Natural Language requires making decisions that rely both

on statistical learning and on the incorporation of (declarative) background knowledge.

We presented Constrained Conditional Models: A computational framework for learning and inference that augments probabilistic models with declarative constraints (within an ILP formulation).

We discussed learning protocols that aim at reduced annotation cost: Constrained Driven Learning (CoDL) Learning structure with indirect supervision Response Based Learning

Many open issues…

Check out our tools & demos

Page 80

Questions?

Thank you

Documents

February 2012 Princeton Plasma Physics Laboratory With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Michael Connor, Dan Goldwasser, Lev Ratinov,