Upload
terrence-dods
View
219
Download
4
Tags:
Embed Size (px)
Citation preview
CSE 454 Advanced Internet Systems
Machine Learning for Extraction
Dan Weld
Logistics• Project Warm-Up
– Due this Sunday• Computing
– $100 EC2 credit per student• Team Selection
– Topic survey later this week
ShopBot
WIEN
Mulder
RESOLVER
KnowItAll
REALM
Opine LEX
TextRunner
KylinLuchs
KOG
WOE
USP OntoUSPSNE
Velvet
PrecHybrid
Holmes Sherlock
1997 2001 2005 2007 20092008 2010
WebTables
LDA-SP
IIAUCR
AuContraire
SRL-IE
2004
An (Incomplete) Timeline of UW MR Systems
2011+
ReVerb
MultiR
Figer
Perspective
• Crawling the Web• Inverted indices• Query processing• Pagerank computation & ranking
• Search UI• Computational advertising• Security & malware • Social systems
• Information extraction
Perspective
• Crawling the Web• Inverted indices• Query processing• Pagerank computation & ranking
• Search UI• Computational advertising• Security & malware • Social systems
• Information extraction
Today’s Outline• Supervised Learning – Compact Introduction
– Learning as Function Approximation– Need for Bias– Overfitting– Bias / Variance Tradeoff– Loss Functions; Regularization; Learning as Optimization – Curse of Dimensionality– Logistic Regression
• IE as Supervised Learning• Features for IE
Terminology• Examples
– Features– Labels
• Training Set• Validation Set• Test Set
Input: { …<X1, …, Xk, Y>…}
Output: F: X Y h: X Y
Objective: Minimize error of h on (unseen) test examples
hypothesis
© Daniel S. Weld 8
Learning is Function Approximation
Linear Regression
hw(x) = w1x + w0
Classifier: Y (Range of F) is Discrete
0.0 1.0 2.0 3.0 4.0 5.0 6.0
0.0
1.0
2.0
3.0
Hypothesis:Function for labeling examples
++
+ +
++
+
+
- -
-
- -
--
-
-
- +
++
-
-
-
+
+ Label: -Label: +
?
?
?
?
Objective: Minimize error on test examples
0.0 1.0 2.0 3.0 4.0 5.0 6.0
0.0
1.0
2.0
3.0
++
+ +
++
+
+
- -
-
- -
--
-
-
- +
++
-
-
-
+
+
?
?
?
?
So… How good is this hypothesis?
Objective: Minimize error on test examples
0.0 1.0 2.0 3.0 4.0 5.0 6.0
0.0
1.0
2.0
3.0
++
+ +
++
+
+
- -
-
- -
--
-
-
- +
++
-
-
-
+
+
Only know training data, so minimize error on that
Loss(F) =j=1
n
|yj – F(x)|
13
Generalization
• Hypotheses must generalize to correctly classify instances not in the training data.
• Simply memorizing training examples yields a [consistent] hypothesis that does not generalize.
© Daniel S. Weld 14
Why is Learning Possible?
Experience alone never justifies any conclusion about any unseen instance.
Learning occurs when PREJUDICE meets DATA!
Learning a “Frobnitz”
Frobnitz
15
Not a Frobnitz
© Daniel S. Weld16
Bias• The nice word for prejudice is “bias”.
– Different from “Bias” in statistics
• What kind of hypotheses will you consider?– What is allowable range of approximation functions? – Eg conjunctions linear functions
• What kind of hypotheses do you prefer?– Eg Simple hypotheses (Occam’s Razor) few parameters, small parameters,
Fitting a Polynomial
© Daniel S. Weld 18
Overfitting
Accuracy
0.9
0.8
0.7
0.6
On training dataOn test data
Model complexity (e.g., number of parameters in polynomial)
Bia / Variance Tradeoff
Slide from T Dietterich
• Variance: E[ (h(x*) – h(x*))2 ]How much h(x*) varies between training setsReducing variance risks underfitting
• Bias: [h(x*) – f(x*)]Describes the average error of h(x*)Reducing bias risks overfitting
Learning as Optimization
• Loss Function– Loss(h,data) = error(h, data) + complexity(h)– Error + regularization– Minimize loss over training data
• Opt Methods– Closed form– Greedy search– Gradient ascent
20
Effect of Regularization
Loss(hw) =j=1
n
(yj – (w1xj+w0))2 i=1+|wi|
k
ln = -25
Regularization: vs.
Curse of Dimensionality
• Intuitions fail• Hard to distinguish hypotheses
A Great Learning Algorithm
• Logistic Regression
25
Univariate Linear Regression
hw(x) = w1x + w0
Loss(hw) =j=1
n
L2(yj, hw(xj)) = j=1
n
(yj - hw(xj))2 =
j=1
n
(yj – (w1xj+w0))2
26
Understanding Weight Space
Loss(hw) =j=1
n
(yj – (w1xj+w0))2
hw(x) = w1x + w0
27
Understanding Weight Space
hw(x) = w1x + w0
Loss(hw) =j=1
n
(yj – (w1xj+w0))2
28
Finding Minimum Loss
hw(x) = w1x + w0
Loss(hw) =j=1
n
(yj – (w1xj+w0))2
Loss(hw) = 0w0
Argminw Loss(hw)
Loss(hw) = 0w1
Unique Solution!
hw(x) = w1x + w0
w1 =
Argminw Loss(hw)
w0 = ((yj)–w1(xj)/N
N(xjyj)–(xj)(yj)
N(xj2)–(xj)2
Could also Solve Iteratively
w = any point in weight spaceLoop until convergence
For each wi in w do
wi := wi - Loss(w)
Argminw Loss(hw)
wi
31
Multivariate Linear Regression
hw(xj) = w0
Unique Solution = (xTx)-1xTy
+wi xj,i =wi xj,i =wTxj
Argminw Loss(hw)
Problem….
32
Overfitting
Regularize!!Penalize high weights
Loss(hw) =j=1
n
(yj – (w1xj+w0))2 i=1+wi2
k
Loss(hw) =j=1
n
(yj – (w1xj+w0))2 i=1+|wi|
k
Alternatively….
33
Regularization
L1 L2
34
Back to Classification
P(edible|X)=1
P(edible|X)=0
Decision Boundary
35
Logistic Regression
Learn P(Y|X) directly! Assume a particular functional form Not differentiable…
P(Y)=1
P(Y)=0
36
Logistic Regression
Learn P(Y|X) directly! Assume a particular functional form Logistic Function
Aka Sigmoid
37
Logistic Function in n Dimensions
Sigmoid applied to a linear function of the data:
Features can be discrete or continuous!
Understanding Sigmoids
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=0, w1=-1
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=-2, w1=-1
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=0, w1= -0.5
Very convenient!
implies
implies
implies
linear classification rule!
39©Carlos Guestrin 2005-2009
Generative (Naïve Bayes) Loss function: Data likelihood
Discriminative (Logistic Regr.) Loss funct: Conditional Data Likelihood
Discriminative models can’t compute P(xj|w)!Or, … “They don’t waste effort learning P(X)” Focus only on P(Y|X) - all that matters for classification
Loss Functions: Likelihood vs. Conditional Likelihood
41©Carlos Guestrin 2005-2009
42
Expressing Conditional Log Likelihood
©Carlos Guestrin 2005-2009
}Probabilit
y of p
redicting 1
Probability o
f predicti
ng 0
}
1 when co
rrect
answer is
0
1 when co
rrect
answer is
1
ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)
ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)
Expressing Conditional Log Likelihood
ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)
ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)
44
Maximizing Conditional Log Likelihood
Bad news: no closed-form solution to maximize l(w)
Good news: l(w) is concave function of w!
No local minima
Concave functions easy to optimize
©Carlos Guestrin 2005-2009
Optimizing Concave FunctionsGradient Ascent
Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent
Gradient ascent is simplest of optimization approachese.g., Conjugate gradient ascent much better (see reading)
Gradient:
Learning rate, >0
Update rule:
45©Carlos Guestrin 2005-2009
46
Earthquake or Nuclear Test?
x1
x2
x xx x x x
x xx
x x x xx x
x
x x x xx
x x
linear classification rule!
implies
If > 1, then predict Y=0
47
Logistic w/ Initial Weightsw0=20 w1= -5 w2 =10
x x x xx x x
x x
x xx
x1
x2
Update rule:
w0
w1
l(w)
Loss(Hw) = Error(Hw, data)Minimize Error Maximize l(w) = ln P(DY | Dx, Hw)
48
Gradient Ascentw0=40 w1= -10 w2 =5
x1
x2
Update rule:
w0
w1
l(w)xx
x xxx
xx x
x xx
x1
x2
Maximize l(w) = ln P(DY | Dx, Hw)
IE as Classification
Citigroup has taken over EMI, the British …Citigroup’s acquisition of EMI comes just ahead of …Google’s Adwords system has long included ways to connect to Youtube.
++-
Preprocessed Data Files
tokens after tokenization John likes eating sausage .
Each line corresponds to a sentence. "John likes eating sausage."
Preprocessed Data Files
tokens after tokenization John likes eating sausage .
pos Part-of-Speech tagsJohn/NNP likes/VBZ eating/VBG sausage/NN ./.
Each line corresponds to a sentence. "John likes eating sausage."
Grade School: “9 parts of speech in English”
•Noun•Verb•Article•Adjective•Preposition
But: plurals, possessive, case, tense, aspect, ….
•Pronoun•Adverb•Conjunction•Interjection
Preprocessed Data Files
tokens after tokenization John likes eating sausage .
pos Part-of-Speech tagsJohn/NNP likes/VBZ eating/VBG sausage/NN ./.
ner Named Entities
Each line corresponds to a sentence. "John likes eating sausage."
Text as Vectors
• Each document, j, can be viewed as a vector of frequency values, – one component for each word (or phrase)
• So we have a vector space– Words (or phrases) are axes– documents live in this space– even with stemming, may have 20,000+
dimensions
Vector Space Representation
Documents that are close to query (measured using vector-space metric) => returned first.
Query
slide from Raghavan, Schütze, Larson
Lemmatization
• Reduce inflectional/variant forms to base form
– am, are, is be– car, cars, car's, cars' car
the boy's cars are different colors the boy car be different color
slide from Raghavan, Schütze, Larson
Stemming
• Reduce terms to their “roots” before indexing– language dependent– e.g., automate(s), automatic, automation all
reduced to automat.
for example compressed and compression are both accepted as equivalent to compress.
for exampl compres andcompres are both acceptas equival to compres.
slide from Raghavan, Schütze, Larson
Porter’s algorithm
• Common algorithm for stemming English• Conventions + 5 phases of reductions
– phases applied sequentially– each phase consists of a set of commands– sample convention: Of the rules in a compound
command, select the one that applies to the longest suffix.
• Porter’s stemmer available: http//www.sims.berkeley.edu/~hearst/irbook/porter.html
slide from Raghavan, Schütze, Larson
Typical rules in Porter
• sses ss• ies i• ational ate• tional tion
slide from Raghavan, Schütze, Larson
Challenges
• Sandy• Sanded• Sander
Sand ???
slide from Raghavan, Schütze, Larson
6565
• Many relations and events are temporally bounded– a person's place of residence or employer– an organization's members– the duration of a war between two countries– the precise time at which a plane landed– …
• Temporal Information Distribution– One of every fifty lines of database application code involves a
date or time value (Snodgrass,1998)– Each news document in PropBank (Kingsbury and Palmer, 2002)
includes eight temporal arguments
Why Extract Temporal Information?
Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial
6666
Time-intensive Slot TypesPerson Organization
per:alternate_names per:title org:alternate_namesper:date_of_birth per:member_of org:political/religious_affiliationper:age per:employee_of org:top_members/employeesper:country_of_birth per:religion org:number_of_employees/membersper:stateorprovince_of_birth per:spouse org:membersper:city_of_birth per:children org:member_ofper:origin per:parents org:subsidiariesper:date_of_death per:siblings org:parentsper:country_of_death per:other_family org:founded_byper:stateorprovince_of_death per:charges org:founded
per:city_of_death org:dissolved
per:cause_of_death org:country_of_headquarters
per:countries_of_residence org:stateorprovince_of_headquarters
per:stateorprovinces_of_residence org:city_of_headquarters
per:cities_of_residence org:shareholders
per:schools_attended org:website
Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial
6767
Temporal Expression ExamplesExpression Value in Timex Format
December 8, 2012 2012-12-08
Friday 2012-12-07
today 2012-12-08
1993 1993
the 1990's 199X
midnight, December 8, 2012 2012-12-08T00:00:00
5pm 2012-12-08T17:00
the previous day 2012-12-07
last October 2011-10
last autumn 2011-FA
last week 2012-W48
Thursday evening 2012-12-06TEV
three months ago 2012:09
Reference Date = December 8, 2012
Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial
6868
• Rule-based (Strtotgen and Gertz, 2010; Chang and Manning, 2012; Do et al., 2012)
• Machine Learning– Risk Minimization Model (Boguraev and Ando, 2005)– Conditional Random Fields (Ahn et al., 2005; UzZaman and Allen,
2010)
• State-of-the-art: about 95% F-measure for extraction and 85% F-measure for normalization
Temporal Expression Extraction
Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial
696969
Ordering events in discourse• (1 ) John entered the room at 5:00pm. • (2) It was pitch black. • (3) It had been three days since he’d slept.
Time: NowTime: 5pm
State: Pitch Black
State: John Slept Time: 3 days
Event: John entered the room
Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial
707070
Ordering events in time Speech (S), Event (E), & Reference (R) time (Reichenbach, 1947)
• Tense: relates R and S; Gr. Aspect: relates R and E • R associated with temporal anaphora (Partee 1984)• Order events by comparing R across sentences• By the time Boris noticed his blunder, John had (already) won the game
Sentence Tense Order
John wins the game Present E,R,S
John won the game Simple Past E,R<S
John had won the game Perfective Past E<R<S
John has won the game Present Perfect E<S,R
John will win the game Future S<E,R
Etc… Etc… Etc…
See Michaelis (2006) for a good explanation of tense and grammatical aspect
Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial
Types of eventualities
71Chart from (Dölling, 2011)Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial
Inter-eventuality relations
• A boundary begins/ends a happening
• A boundary culminates an event
• A moment is the reduction of an episode
• A state is the result of a change
• A habitual state is realized by a class of occurrences
• A Processes is made of event constituents …
72Chart from (Dölling, 2011)Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial