68
CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Embed Size (px)

Citation preview

Page 1: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

CSE 454 Advanced Internet Systems

Machine Learning for Extraction

Dan Weld

Page 2: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Logistics• Project Warm-Up

– Due this Sunday• Computing

– $100 EC2 credit per student• Team Selection

– Topic survey later this week

Page 3: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

ShopBot

WIEN

Mulder

RESOLVER

KnowItAll

REALM

Opine LEX

TextRunner

KylinLuchs

KOG

WOE

USP OntoUSPSNE

Velvet

PrecHybrid

Holmes Sherlock

1997 2001 2005 2007 20092008 2010

WebTables

LDA-SP

IIAUCR

AuContraire

SRL-IE

2004

An (Incomplete) Timeline of UW MR Systems

2011+

ReVerb

MultiR

Figer

Page 4: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Perspective

• Crawling the Web• Inverted indices• Query processing• Pagerank computation & ranking

• Search UI• Computational advertising• Security & malware • Social systems

• Information extraction

Page 5: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Perspective

• Crawling the Web• Inverted indices• Query processing• Pagerank computation & ranking

• Search UI• Computational advertising• Security & malware • Social systems

• Information extraction

Page 6: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Today’s Outline• Supervised Learning – Compact Introduction

– Learning as Function Approximation– Need for Bias– Overfitting– Bias / Variance Tradeoff– Loss Functions; Regularization; Learning as Optimization – Curse of Dimensionality– Logistic Regression

• IE as Supervised Learning• Features for IE

Page 7: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Terminology• Examples

– Features– Labels

• Training Set• Validation Set• Test Set

Input: { …<X1, …, Xk, Y>…}

Output: F: X Y h: X Y

Objective: Minimize error of h on (unseen) test examples

hypothesis

Page 8: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

© Daniel S. Weld 8

Learning is Function Approximation

Page 9: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Linear Regression

hw(x) = w1x + w0

Page 10: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Classifier: Y (Range of F) is Discrete

0.0 1.0 2.0 3.0 4.0 5.0 6.0

0.0

1.0

2.0

3.0

Hypothesis:Function for labeling examples

++

+ +

++

+

+

- -

-

- -

--

-

-

- +

++

-

-

-

+

+ Label: -Label: +

?

?

?

?

Page 11: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Objective: Minimize error on test examples

0.0 1.0 2.0 3.0 4.0 5.0 6.0

0.0

1.0

2.0

3.0

++

+ +

++

+

+

- -

-

- -

--

-

-

- +

++

-

-

-

+

+

?

?

?

?

So… How good is this hypothesis?

Page 12: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Objective: Minimize error on test examples

0.0 1.0 2.0 3.0 4.0 5.0 6.0

0.0

1.0

2.0

3.0

++

+ +

++

+

+

- -

-

- -

--

-

-

- +

++

-

-

-

+

+

Only know training data, so minimize error on that

Loss(F) =j=1

n

|yj – F(x)|

Page 13: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

13

Generalization

• Hypotheses must generalize to correctly classify instances not in the training data.

• Simply memorizing training examples yields a [consistent] hypothesis that does not generalize.

Page 14: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

© Daniel S. Weld 14

Why is Learning Possible?

Experience alone never justifies any conclusion about any unseen instance.

Learning occurs when PREJUDICE meets DATA!

Learning a “Frobnitz”

Page 15: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Frobnitz

15

Not a Frobnitz

Page 16: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

© Daniel S. Weld16

Bias• The nice word for prejudice is “bias”.

– Different from “Bias” in statistics

• What kind of hypotheses will you consider?– What is allowable range of approximation functions? – Eg conjunctions linear functions

• What kind of hypotheses do you prefer?– Eg Simple hypotheses (Occam’s Razor) few parameters, small parameters,

Page 17: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Fitting a Polynomial

Page 18: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

© Daniel S. Weld 18

Overfitting

Accuracy

0.9

0.8

0.7

0.6

On training dataOn test data

Model complexity (e.g., number of parameters in polynomial)

Page 19: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Bia / Variance Tradeoff

Slide from T Dietterich

• Variance: E[ (h(x*) – h(x*))2 ]How much h(x*) varies between training setsReducing variance risks underfitting

• Bias: [h(x*) – f(x*)]Describes the average error of h(x*)Reducing bias risks overfitting

Page 20: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Learning as Optimization

• Loss Function– Loss(h,data) = error(h, data) + complexity(h)– Error + regularization– Minimize loss over training data

• Opt Methods– Closed form– Greedy search– Gradient ascent

20

Page 21: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Effect of Regularization

Loss(hw) =j=1

n

(yj – (w1xj+w0))2 i=1+|wi|

k

ln = -25

Page 22: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Regularization: vs.

Page 23: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Curse of Dimensionality

• Intuitions fail• Hard to distinguish hypotheses

Page 24: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

A Great Learning Algorithm

• Logistic Regression

Page 25: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

25

Univariate Linear Regression

hw(x) = w1x + w0

Loss(hw) =j=1

n

L2(yj, hw(xj)) = j=1

n

(yj - hw(xj))2 =

j=1

n

(yj – (w1xj+w0))2

Page 26: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

26

Understanding Weight Space

Loss(hw) =j=1

n

(yj – (w1xj+w0))2

hw(x) = w1x + w0

Page 27: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

27

Understanding Weight Space

hw(x) = w1x + w0

Loss(hw) =j=1

n

(yj – (w1xj+w0))2

Page 28: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

28

Finding Minimum Loss

hw(x) = w1x + w0

Loss(hw) =j=1

n

(yj – (w1xj+w0))2

Loss(hw) = 0w0

Argminw Loss(hw)

Loss(hw) = 0w1

Page 29: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Unique Solution!

hw(x) = w1x + w0

w1 =

Argminw Loss(hw)

w0 = ((yj)–w1(xj)/N

N(xjyj)–(xj)(yj)

N(xj2)–(xj)2

Page 30: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Could also Solve Iteratively

w = any point in weight spaceLoop until convergence

For each wi in w do

wi := wi - Loss(w)

Argminw Loss(hw)

wi

Page 31: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

31

Multivariate Linear Regression

hw(xj) = w0

Unique Solution = (xTx)-1xTy

+wi xj,i =wi xj,i =wTxj

Argminw Loss(hw)

Problem….

Page 32: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

32

Overfitting

Regularize!!Penalize high weights

Loss(hw) =j=1

n

(yj – (w1xj+w0))2 i=1+wi2

k

Loss(hw) =j=1

n

(yj – (w1xj+w0))2 i=1+|wi|

k

Alternatively….

Page 33: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

33

Regularization

L1 L2

Page 34: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

34

Back to Classification

P(edible|X)=1

P(edible|X)=0

Decision Boundary

Page 35: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

35

Logistic Regression

Learn P(Y|X) directly! Assume a particular functional form Not differentiable…

P(Y)=1

P(Y)=0

Page 36: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

36

Logistic Regression

Learn P(Y|X) directly! Assume a particular functional form Logistic Function

Aka Sigmoid

Page 37: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

37

Logistic Function in n Dimensions

Sigmoid applied to a linear function of the data:

Features can be discrete or continuous!

Page 38: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Understanding Sigmoids

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=-2, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1= -0.5

Page 39: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Very convenient!

implies

implies

implies

linear classification rule!

39©Carlos Guestrin 2005-2009

Page 40: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Generative (Naïve Bayes) Loss function: Data likelihood

Discriminative (Logistic Regr.) Loss funct: Conditional Data Likelihood

Discriminative models can’t compute P(xj|w)!Or, … “They don’t waste effort learning P(X)” Focus only on P(Y|X) - all that matters for classification

Loss Functions: Likelihood vs. Conditional Likelihood

41©Carlos Guestrin 2005-2009

Page 41: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

42

Expressing Conditional Log Likelihood

©Carlos Guestrin 2005-2009

}Probabilit

y of p

redicting 1

Probability o

f predicti

ng 0

}

1 when co

rrect

answer is

0

1 when co

rrect

answer is

1

ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)

ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)

Page 42: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Expressing Conditional Log Likelihood

ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)

ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)

Page 43: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

44

Maximizing Conditional Log Likelihood

Bad news: no closed-form solution to maximize l(w)

Good news: l(w) is concave function of w!

No local minima

Concave functions easy to optimize

©Carlos Guestrin 2005-2009

Page 44: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Optimizing Concave FunctionsGradient Ascent

Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent

Gradient ascent is simplest of optimization approachese.g., Conjugate gradient ascent much better (see reading)

Gradient:

Learning rate, >0

Update rule:

45©Carlos Guestrin 2005-2009

Page 45: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

46

Earthquake or Nuclear Test?

x1

x2

x xx x x x

x xx

x x x xx x

x

x x x xx

x x

linear classification rule!

implies

If > 1, then predict Y=0

Page 46: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

47

Logistic w/ Initial Weightsw0=20 w1= -5 w2 =10

x x x xx x x

x x

x xx

x1

x2

Update rule:

w0

w1

l(w)

Loss(Hw) = Error(Hw, data)Minimize Error Maximize l(w) = ln P(DY | Dx, Hw)

Page 47: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

48

Gradient Ascentw0=40 w1= -10 w2 =5

x1

x2

Update rule:

w0

w1

l(w)xx

x xxx

xx x

x xx

x1

x2

Maximize l(w) = ln P(DY | Dx, Hw)

Page 48: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

IE as Classification

Citigroup has taken over EMI, the British …Citigroup’s acquisition of EMI comes just ahead of …Google’s Adwords system has long included ways to connect to Youtube.

++-

Page 49: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Preprocessed Data Files

tokens after tokenization John likes eating sausage .

Each line corresponds to a sentence. "John likes eating sausage."

Page 50: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Preprocessed Data Files

tokens after tokenization John likes eating sausage .

pos Part-of-Speech tagsJohn/NNP likes/VBZ eating/VBG sausage/NN ./.

Each line corresponds to a sentence. "John likes eating sausage."

Grade School: “9 parts of speech in English”

•Noun•Verb•Article•Adjective•Preposition

But: plurals, possessive, case, tense, aspect, ….

•Pronoun•Adverb•Conjunction•Interjection

Page 51: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Preprocessed Data Files

tokens after tokenization John likes eating sausage .

pos Part-of-Speech tagsJohn/NNP likes/VBZ eating/VBG sausage/NN ./.

ner Named Entities

Each line corresponds to a sentence. "John likes eating sausage."

Page 52: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Text as Vectors

• Each document, j, can be viewed as a vector of frequency values, – one component for each word (or phrase)

• So we have a vector space– Words (or phrases) are axes– documents live in this space– even with stemming, may have 20,000+

dimensions

Page 53: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Vector Space Representation

Documents that are close to query (measured using vector-space metric) => returned first.

Query

slide from Raghavan, Schütze, Larson

Page 54: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Lemmatization

• Reduce inflectional/variant forms to base form

– am, are, is be– car, cars, car's, cars' car

the boy's cars are different colors the boy car be different color

slide from Raghavan, Schütze, Larson

Page 55: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Stemming

• Reduce terms to their “roots” before indexing– language dependent– e.g., automate(s), automatic, automation all

reduced to automat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compres andcompres are both acceptas equival to compres.

slide from Raghavan, Schütze, Larson

Page 56: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Porter’s algorithm

• Common algorithm for stemming English• Conventions + 5 phases of reductions

– phases applied sequentially– each phase consists of a set of commands– sample convention: Of the rules in a compound

command, select the one that applies to the longest suffix.

• Porter’s stemmer available: http//www.sims.berkeley.edu/~hearst/irbook/porter.html

slide from Raghavan, Schütze, Larson

Page 57: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Typical rules in Porter

• sses ss• ies i• ational ate• tional tion

slide from Raghavan, Schütze, Larson

Page 58: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Challenges

• Sandy• Sanded• Sander

Sand ???

slide from Raghavan, Schütze, Larson

Page 59: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld
Page 60: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld
Page 61: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

6565

• Many relations and events are temporally bounded– a person's place of residence or employer– an organization's members– the duration of a war between two countries– the precise time at which a plane landed– …

• Temporal Information Distribution– One of every fifty lines of database application code involves a

date or time value (Snodgrass,1998)– Each news document in PropBank (Kingsbury and Palmer, 2002)

includes eight temporal arguments

Why Extract Temporal Information?

Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

Page 62: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

6666

Time-intensive Slot TypesPerson Organization

per:alternate_names per:title org:alternate_namesper:date_of_birth per:member_of org:political/religious_affiliationper:age per:employee_of org:top_members/employeesper:country_of_birth per:religion org:number_of_employees/membersper:stateorprovince_of_birth per:spouse org:membersper:city_of_birth per:children org:member_ofper:origin per:parents org:subsidiariesper:date_of_death per:siblings org:parentsper:country_of_death per:other_family org:founded_byper:stateorprovince_of_death per:charges org:founded

per:city_of_death org:dissolved

per:cause_of_death org:country_of_headquarters

per:countries_of_residence org:stateorprovince_of_headquarters

per:stateorprovinces_of_residence org:city_of_headquarters

per:cities_of_residence org:shareholders

per:schools_attended org:website

Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

Page 63: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

6767

Temporal Expression ExamplesExpression Value in Timex Format

December 8, 2012 2012-12-08

Friday 2012-12-07

today 2012-12-08

1993 1993

the 1990's 199X

midnight, December 8, 2012 2012-12-08T00:00:00

5pm 2012-12-08T17:00

the previous day 2012-12-07

last October 2011-10

last autumn 2011-FA

last week 2012-W48

Thursday evening 2012-12-06TEV

three months ago 2012:09

Reference Date = December 8, 2012

Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

Page 64: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

6868

• Rule-based (Strtotgen and Gertz, 2010; Chang and Manning, 2012; Do et al., 2012)

• Machine Learning– Risk Minimization Model (Boguraev and Ando, 2005)– Conditional Random Fields (Ahn et al., 2005; UzZaman and Allen,

2010)

• State-of-the-art: about 95% F-measure for extraction and 85% F-measure for normalization

Temporal Expression Extraction

Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

Page 65: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

696969

Ordering events in discourse• (1 ) John entered the room at 5:00pm. • (2) It was pitch black. • (3) It had been three days since he’d slept.

Time: NowTime: 5pm

State: Pitch Black

State: John Slept Time: 3 days

Event: John entered the room

Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

Page 66: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

707070

Ordering events in time Speech (S), Event (E), & Reference (R) time (Reichenbach, 1947)

• Tense: relates R and S; Gr. Aspect: relates R and E • R associated with temporal anaphora (Partee 1984)• Order events by comparing R across sentences• By the time Boris noticed his blunder, John had (already) won the game

Sentence Tense Order

John wins the game Present E,R,S

John won the game Simple Past E,R<S

John had won the game Perfective Past E<R<S

John has won the game Present Perfect E<S,R

John will win the game Future S<E,R

Etc… Etc… Etc…

See Michaelis (2006) for a good explanation of tense and grammatical aspect

Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

Page 67: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Types of eventualities

71Chart from (Dölling, 2011)Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

Page 68: CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Inter-eventuality relations

• A boundary begins/ends a happening

• A boundary culminates an event

• A moment is the reduction of an episode

• A state is the result of a change

• A habitual state is realized by a class of occurrences

• A Processes is made of event constituents …

72Chart from (Dölling, 2011)Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial