CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

CSE 454 Advanced Internet Systems

Machine Learning for Extraction

Dan Weld

Logistics• Project Warm-Up

– Due this Sunday• Computing

– $100 EC2 credit per student• Team Selection

– Topic survey later this week

ShopBot

WIEN

Mulder

RESOLVER

KnowItAll

REALM

Opine LEX

TextRunner

KylinLuchs

KOG

WOE

USP OntoUSPSNE

Velvet

PrecHybrid

Holmes Sherlock

1997 2001 2005 2007 20092008 2010

WebTables

LDA-SP

IIAUCR

AuContraire

SRL-IE

2004

An (Incomplete) Timeline of UW MR Systems

2011+

ReVerb

MultiR

Figer

Perspective

• Crawling the Web• Inverted indices• Query processing• Pagerank computation & ranking

• Search UI• Computational advertising• Security & malware • Social systems

• Information extraction

Perspective

• Crawling the Web• Inverted indices• Query processing• Pagerank computation & ranking

• Search UI• Computational advertising• Security & malware • Social systems

• Information extraction

Today’s Outline• Supervised Learning – Compact Introduction

– Learning as Function Approximation– Need for Bias– Overfitting– Bias / Variance Tradeoff– Loss Functions; Regularization; Learning as Optimization – Curse of Dimensionality– Logistic Regression

• IE as Supervised Learning• Features for IE

Terminology• Examples

– Features– Labels

• Training Set• Validation Set• Test Set

Input: { …<X1, …, Xk, Y>…}

Output: F: X Y h: X Y

Objective: Minimize error of h on (unseen) test examples

hypothesis

© Daniel S. Weld 8

Learning is Function Approximation

Linear Regression

hw(x) = w1x + w0

Classifier: Y (Range of F) is Discrete

0.0 1.0 2.0 3.0 4.0 5.0 6.0

0.0

1.0

2.0

3.0

Hypothesis:Function for labeling examples

++

+ +

++

+

+

- -

-

- -

--

-

-

- +

++

-

-

-

+

+ Label: -Label: +

?

?

?

?

Objective: Minimize error on test examples

0.0 1.0 2.0 3.0 4.0 5.0 6.0

0.0

1.0

2.0

3.0

++

+ +

++

+

+

- -

-

- -

--

-

-

- +

++

-

-

-

+

+

?

?

?

?

So… How good is this hypothesis?

Objective: Minimize error on test examples

0.0 1.0 2.0 3.0 4.0 5.0 6.0

0.0

1.0

2.0

3.0

++

+ +

++

+

+

- -

-

- -

--

-

-

- +

++

-

-

-

+

+

Only know training data, so minimize error on that

Loss(F) =j=1

n

|yj – F(x)|

13

Generalization

• Hypotheses must generalize to correctly classify instances not in the training data.

• Simply memorizing training examples yields a [consistent] hypothesis that does not generalize.

© Daniel S. Weld 14

Why is Learning Possible?

Experience alone never justifies any conclusion about any unseen instance.

Learning occurs when PREJUDICE meets DATA!

Learning a “Frobnitz”

Frobnitz

15

Not a Frobnitz

© Daniel S. Weld16

Bias• The nice word for prejudice is “bias”.

– Different from “Bias” in statistics

• What kind of hypotheses will you consider?– What is allowable range of approximation functions? – Eg conjunctions linear functions

• What kind of hypotheses do you prefer?– Eg Simple hypotheses (Occam’s Razor) few parameters, small parameters,

Fitting a Polynomial

© Daniel S. Weld 18

Overfitting

Accuracy

0.9

0.8

0.7

0.6

On training dataOn test data

Model complexity (e.g., number of parameters in polynomial)

Bia / Variance Tradeoff

Slide from T Dietterich

• Variance: E[ (h(x*) – h(x*))2 ]How much h(x*) varies between training setsReducing variance risks underfitting

• Bias: [h(x*) – f(x*)]Describes the average error of h(x*)Reducing bias risks overfitting

Learning as Optimization

• Loss Function– Loss(h,data) = error(h, data) + complexity(h)– Error + regularization– Minimize loss over training data

• Opt Methods– Closed form– Greedy search– Gradient ascent

20

Effect of Regularization

Loss(hw) =j=1

n

(yj – (w1xj+w0))2 i=1+|wi|

k

ln = -25

Regularization: vs.

Curse of Dimensionality

• Intuitions fail• Hard to distinguish hypotheses

A Great Learning Algorithm

• Logistic Regression

25

Univariate Linear Regression

hw(x) = w1x + w0

Loss(hw) =j=1

n

L2(yj, hw(xj)) = j=1

n

(yj - hw(xj))2 =

j=1

n

(yj – (w1xj+w0))2

26

Understanding Weight Space

Loss(hw) =j=1

n

(yj – (w1xj+w0))2

hw(x) = w1x + w0

27

Understanding Weight Space

hw(x) = w1x + w0

Loss(hw) =j=1

n

(yj – (w1xj+w0))2

28

Finding Minimum Loss

hw(x) = w1x + w0

Loss(hw) =j=1

n

(yj – (w1xj+w0))2

Loss(hw) = 0w0

Argminw Loss(hw)

Loss(hw) = 0w1

Unique Solution!

hw(x) = w1x + w0

w1 =

Argminw Loss(hw)

w0 = ((yj)–w1(xj)/N

N(xjyj)–(xj)(yj)

N(xj2)–(xj)2

Could also Solve Iteratively

w = any point in weight spaceLoop until convergence

For each wi in w do

wi := wi - Loss(w)

Argminw Loss(hw)

wi

31

Multivariate Linear Regression

hw(xj) = w0

Unique Solution = (xTx)-1xTy

+wi xj,i =wi xj,i =wTxj

Argminw Loss(hw)

Problem….

32

Overfitting

Regularize!!Penalize high weights

Loss(hw) =j=1

n

(yj – (w1xj+w0))2 i=1+wi2

k

Loss(hw) =j=1

n

(yj – (w1xj+w0))2 i=1+|wi|

k

Alternatively….

33

Regularization

L1 L2

34

Back to Classification

P(edible|X)=1

P(edible|X)=0

Decision Boundary

35

Logistic Regression

Learn P(Y|X) directly! Assume a particular functional form Not differentiable…

P(Y)=1

P(Y)=0

36

Logistic Regression

Learn P(Y|X) directly! Assume a particular functional form Logistic Function

Aka Sigmoid

37

Logistic Function in n Dimensions

Sigmoid applied to a linear function of the data:

Features can be discrete or continuous!

Understanding Sigmoids

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=-2, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1= -0.5

Very convenient!

implies

implies

implies

linear classification rule!

39©Carlos Guestrin 2005-2009

Generative (Naïve Bayes) Loss function: Data likelihood

Discriminative (Logistic Regr.) Loss funct: Conditional Data Likelihood

Discriminative models can’t compute P(xj|w)!Or, … “They don’t waste effort learning P(X)” Focus only on P(Y|X) - all that matters for classification

Loss Functions: Likelihood vs. Conditional Likelihood


42

Expressing Conditional Log Likelihood

©Carlos Guestrin 2005-2009

}Probabilit

y of p

redicting 1

Probability o

f predicti

ng 0

}

1 when co

rrect

answer is

0

1 when co

rrect

answer is

1

ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)

ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)

Expressing Conditional Log Likelihood

ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)

ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)

44

Maximizing Conditional Log Likelihood

Bad news: no closed-form solution to maximize l(w)

Good news: l(w) is concave function of w!

No local minima

Concave functions easy to optimize

©Carlos Guestrin 2005-2009

Optimizing Concave FunctionsGradient Ascent

Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent

Gradient ascent is simplest of optimization approachese.g., Conjugate gradient ascent much better (see reading)

Gradient:

Learning rate, >0

Update rule:


46

Earthquake or Nuclear Test?

x1

x2

x xx x x x

x xx

x x x xx x

x

x x x xx

x x

linear classification rule!

implies

If > 1, then predict Y=0

47

Logistic w/ Initial Weightsw0=20 w1= -5 w2 =10

x x x xx x x

x x

x xx

x1

x2

Update rule:

w0

w1

l(w)

Loss(Hw) = Error(Hw, data)Minimize Error Maximize l(w) = ln P(DY | Dx, Hw)

48

Gradient Ascentw0=40 w1= -10 w2 =5

x1

x2

Update rule:

w0

w1

l(w)xx

x xxx

xx x

x xx

x1

x2

Maximize l(w) = ln P(DY | Dx, Hw)

IE as Classification

Citigroup has taken over EMI, the British …Citigroup’s acquisition of EMI comes just ahead of …Google’s Adwords system has long included ways to connect to Youtube.

++-

Preprocessed Data Files

tokens after tokenization John likes eating sausage .

Each line corresponds to a sentence. "John likes eating sausage."



pos Part-of-Speech tagsJohn/NNP likes/VBZ eating/VBG sausage/NN ./.


Grade School: “9 parts of speech in English”

•Noun•Verb•Article•Adjective•Preposition

But: plurals, possessive, case, tense, aspect, ….

•Pronoun•Adverb•Conjunction•Interjection

http://en.wikipedia.org/wiki/Parts_of_speech



pos Part-of-Speech tagsJohn/NNP likes/VBZ eating/VBG sausage/NN ./.

ner Named Entities


Text as Vectors

• Each document, j, can be viewed as a vector of frequency values, – one component for each word (or phrase)

• So we have a vector space– Words (or phrases) are axes– documents live in this space– even with stemming, may have 20,000+

dimensions

Vector Space Representation

Documents that are close to query (measured using vector-space metric) => returned first.

Query

slide from Raghavan, Schütze, Larson

Lemmatization

• Reduce inflectional/variant forms to base form

– am, are, is be– car, cars, car's, cars' car

the boy's cars are different colors the boy car be different color


Stemming

• Reduce terms to their “roots” before indexing– language dependent– e.g., automate(s), automatic, automation all

reduced to automat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compres andcompres are both acceptas equival to compres.


Porter’s algorithm

• Common algorithm for stemming English• Conventions + 5 phases of reductions

– phases applied sequentially– each phase consists of a set of commands– sample convention: Of the rules in a compound

command, select the one that applies to the longest suffix.

• Porter’s stemmer available: http//www.sims.berkeley.edu/~hearst/irbook/porter.html


Typical rules in Porter

• sses ss• ies i• ational ate• tional tion


Challenges

• Sandy• Sanded• Sander

Sand ???


6565

• Many relations and events are temporally bounded– a person's place of residence or employer– an organization's members– the duration of a war between two countries– the precise time at which a plane landed– …

• Temporal Information Distribution– One of every fifty lines of database application code involves a

date or time value (Snodgrass,1998)– Each news document in PropBank (Kingsbury and Palmer, 2002)

includes eight temporal arguments

Why Extract Temporal Information?

Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

6666

Time-intensive Slot TypesPerson Organization

per:alternate_names per:title org:alternate_namesper:date_of_birth per:member_of org:political/religious_affiliationper:age per:employee_of org:top_members/employeesper:country_of_birth per:religion org:number_of_employees/membersper:stateorprovince_of_birth per:spouse org:membersper:city_of_birth per:children org:member_ofper:origin per:parents org:subsidiariesper:date_of_death per:siblings org:parentsper:country_of_death per:other_family org:founded_byper:stateorprovince_of_death per:charges org:founded

per:city_of_death org:dissolved

per:cause_of_death org:country_of_headquarters

per:countries_of_residence org:stateorprovince_of_headquarters

per:stateorprovinces_of_residence org:city_of_headquarters

per:cities_of_residence org:shareholders

per:schools_attended org:website


6767

Temporal Expression ExamplesExpression Value in Timex Format

December 8, 2012 2012-12-08

Friday 2012-12-07

today 2012-12-08

1993 1993

the 1990's 199X

midnight, December 8, 2012 2012-12-08T00:00:00

5pm 2012-12-08T17:00

the previous day 2012-12-07

last October 2011-10

last autumn 2011-FA

last week 2012-W48

Thursday evening 2012-12-06TEV

three months ago 2012:09

Reference Date = December 8, 2012


6868

• Rule-based (Strtotgen and Gertz, 2010; Chang and Manning, 2012; Do et al., 2012)

• Machine Learning– Risk Minimization Model (Boguraev and Ando, 2005)– Conditional Random Fields (Ahn et al., 2005; UzZaman and Allen,

2010)

• State-of-the-art: about 95% F-measure for extraction and 85% F-measure for normalization

Temporal Expression Extraction


696969

Ordering events in discourse• (1 ) John entered the room at 5:00pm. • (2) It was pitch black. • (3) It had been three days since he’d slept.

Time: NowTime: 5pm

State: Pitch Black

State: John Slept Time: 3 days

Event: John entered the room


707070

Ordering events in time Speech (S), Event (E), & Reference (R) time (Reichenbach, 1947)

• Tense: relates R and S; Gr. Aspect: relates R and E • R associated with temporal anaphora (Partee 1984)• Order events by comparing R across sentences• By the time Boris noticed his blunder, John had (already) won the game

Sentence Tense Order

John wins the game Present E,R,S

John won the game Simple Past E,R<S

John had won the game Perfective Past E<R<S

John has won the game Present Perfect E<S,R

John will win the game Future S<E,R

Etc… Etc… Etc…

See Michaelis (2006) for a good explanation of tense and grammatical aspect


Types of eventualities

71Chart from (Dölling, 2011)Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

Inter-eventuality relations

• A boundary begins/ends a happening

• A boundary culminates an event

• A moment is the reduction of an episode

• A state is the result of a change

• A habitual state is realized by a class of occurrences

• A Processes is made of event constituents …

72Chart from (Dölling, 2011)Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial