(ebook-pdf) - Artificial Intelligence - Machine Learning.pdf

ML in NLP June 02

NASSLI 1

Machine Learning in NaturalLanguage Processing

Fernando PereiraUniversity of Pennsylvania

NASSLLI, June 2002Thanks to: William Bialek, John Lafferty, Andrew McCallum, Lillian Lee,Lawrence Saul, Yves Schabes, Stuart Shieber, Naftali Tishby

ML in NLP

Introduction

ML in NLP

Why ML in NLPn Examples are easier to create than rulesn Rule writers miss low frequency casesn Many factors involved in language interpretationn People do it

n AIn Cognitive science

n Let the computer do itn Moores lawn storagen lots of data

ML in NLP

Classificationn Document topic

n Word sensetreasury bonds chemical bonds

politics, business national, environment

ML in NLP June 02

NASSLI 2

ML in NLP

N(bin)

S(dumped)

NP-C(workers)

N(workers)

workers

VP(dumped)

V(dumped)

dumped

NP-C(sacks)

N(sacks)

sacks

PP(into)

P(into) NP-C(bin)

D(a)

a bin

into

Analysisn Tagging

n Parsing

laterdecadesupshowthatsymptomscausingJJNNSRPVBPWDTNNSVBG

ML in NLP

Language Modelingn Is this a likely English sentence?

n Disambiguate noisy transcriptionIts easy to wreck a nice beachIts easy to recognize speech

P(colorless green ideas sleep furiously)P(furiously sleep ideas green colorless)

2 105

ML in NLP

Inferencen Translation

n Information extractionligaes covalentesficovalent bondsobrigaes do tesourofitreasury bonds

Sara Lee to Buy 30% of DIM

Chicago, March 3 - Sara Lee Corp said it agreed to buy a 30 percent interest inParis-based DIM S.A., a subsidiary of BIC S.A., at cost of about 20 milliondollars. DIM S.A., a hosiery manufacturer, had sales of about 2 million dollars.

The investment includes the purchase of 5 million newly issued DIM sharesvalued at about 5 million dollars, and a loan of about 15 million dollars, it said.The loan is convertible into an additional 16 million DIM shares, it noted.

The proposed agreement is subject to approval by the French government, itsaid.

acquireracquired

ML in NLP

Machine Learning Approachn Algorithms that write programsn Specify

Form of output programs Accuracy criterion

n Input: set of training examplesn Output: program that performs as accurately

as possible on the training examplesn But will it work on new examples?

ML in NLP June 02

NASSLI 3

ML in NLP

Fundamental Questionsn Generalization: is the learned program useful on new

examples?n Statistical learning theory: quantifiable tradeoffs between

number of examples, complexity of program class, andgeneralization error

n Computational tractability: can we find a good programquickly?n If not, can we find a good approximation?

n Adaptation: can the program learn quickly from newevidence?n Information-theoretic analysis: relationship between adaptation

and compression

ML in NLP

Learning Tradeoffs

Program class complexity

Err

or o

f be

st p

rogr

am

training

testing on new examples

Overfitting

Rote learning

ML in NLP

Machine Learning Methodsn Classifiers

n Document classificationn Disambiguation disambiguation

n Structured modelsn Taggingn Parsingn Extraction

n Unsupervised learningn Generalizationn Structure induction

ML in NLP

Jargonn Instance: event type of interest

n Document and its classn Sentence and its analysisn

n Supervised learning: learn classificationfunction from hand-labeled instances

n Unsupervised learning: exploit correlations toorganize training instances

n Generalization: how well does it work onunseen data

n Features: map instance to set of elementaryevents

ML in NLP June 02

NASSLI 4

ML in NLP

Classification Ideasn Represent instances by feature vectorsn Contentn Context

n Learn function from feature vectorsn Classn Class-probability distribution

n Redundancy is our friend: many weakclues

ML in NLP

Structured Model Ideasn Interdependent decisions

n Successive parts-of-speechn Parsing/generation stepsn Lexical choice

Parsing Translation

n Combining decisionsn Sequential decisionsn Generative modelsn Constraint satisfaction

ML in NLP

Unsupervised Learning Ideasn Clustering: class inductionn Latent variables

Im thinking of sports fi more sporty wordsn Distributional regularities

Know words by the company they keepn Data compressionn Infer dependencies among variables:

structure learningML in NLP

Methodological Detour

n Empiricist/information-theoretic view:words combine following their associationsin previous material

n Rationalist/generative view:words combine according to a formalgrammar in the class of possible natural-language grammars

ML in NLP June 02

NASSLI 5

ML in NLP

Chomskys Challenge to Empiricism

(1) Colorless green ideas sleep furiously.

(2) Furiously sleep ideas green colorless.

It is fair to assume that neither sentence (1) nor(2) (nor indeed any part of these sentences) hasever occurred in an English discourse. Hence, inany statistical model for grammaticalness, thesesentences will be ruled out on identical grounds asequally remote from English. Yet (1), thoughnonsensical, is grammatical, while (2) is not.

Chomsky 57

ML in NLP

The Return of Empiricismn Empiricist methods work:

n Markov models can capture a surprising fraction ofthe unpredictability in language

n Statistical information retrieval methods beatalternatives

n Statistical parsers are more accurate thancompetitors based on rationalist methods

n Machine-learning, statistical techniques close tohuman performance in part-of-speech tagging, sensedisambiguation

n Just engineering tricks?

ML in NLP

Unseen Eventsn Chomskys implicit assumption: any model must

assign zero probability to unseen eventsn nave estimation of Markov model probabilities from

frequenciesn no latent (hidden) events

n Any such model overfits data: many events arelikely to be missing in any finite sample

\ The learned model cannot generalize to unseendata

\ Support for poverty of the stimulus arguments

ML in NLP

The Science of Modelingn Probability estimates can be smoothed to

accommodate unseen eventsn Redundancy in language supports

effective statistical inference procedures\ the stimulus is richer than it might seemn Statistical learning theory: generalization

ability of a model class can be measuredindependently of model representation

n Beyond Markov models: effects of latentconditioning variables can be estimatedfrom data

ML in NLP June 02

NASSLI 6

ML in NLP

Richness of the Stimulusn Information about: mutual information

n between linguistic and non-linguistic eventsn between parts of a linguistic event

n Global coherence:banks can now sell stocks and bonds

n Word statistics carry more informationthan it might seemn Markov models in speech recognitionn Success of bag-of-words model in information

retrievaln Statistical machine translation

n How far can these methods go?ML in NLP

Questionsn Generative or discriminative?n Structured models: local classification or

global constraint satisfaction?n Does unsupervised learning help?

ML in NLP

Classification

ML in NLP

Generative or Discriminative?n Generative modelsn Estimate the instance-label distribution

n Discriminative modelsn Estimate the label-given-instance distribution

n Minimize an upper-bound on training error

p(x,y)

p(y | x)

[[ f (xi ) yi ]]i

ML in NLP June 02

NASSLI 7

ML in NLP

Simple Generative Modeln Binary nave Bayes:n Represent instances by sets of binary

features Does word occur in document

n Finite predefined set of classes

C

F1 F2 F3 F4

P(F1,...,Fn ,C) = P(C) P(Fi | C)i

P(c | F1,...,Fn ) P(c) P(Fi | C)i

ML in NLP

Generative Claimsn Easy to train: just countn Language modeling: probability of

observed formsn More robustn Small training setsn Label noise

n Full advantage of probabilistic methods

ML in NLP

Discriminative Modelsn Define functional form for

n Binary classification: define a discriminantfunction

n Adjust parameter(s) q to maximizeprobability of training labels/minimizeerror

p(y | x;q )

y = signh(x;q)

ML in NLP

Simple Discriminative Formsn Linear discriminant function

n Logistic form:

n Multi-class exponential form (maxent):

h(x;q0 ,q1,...,qn ) = q0 + qi fii (x)

P(+1 | x) = 11+ exp- h(x;q )

h(x,y;q0 ,q1,...,qn ) = q0 + qi fii (x,y)

P(y | x;q) = exp h(x, y;q ) exp h(x, y ;q) y

ML in NLP June 02

NASSLI 8

ML in NLP

Discriminative Claimsn Focus modeling resources on instance-to-

label mappingn Avoid restrictive probabilistic assumptions

on instance distributionn Optimize what you care aboutn Higher accuracy

ML in NLP

Classification Tasksn Document categorization

n News categorizationn Message filteringn Web page selection

n Taggingn Named entityn Part-of-speechn Sense disambiguation

n Syntactic decisionsn Attachment

ML in NLP

Document Modelsn Binary vector

n Frequency vector

n N-gram language model

ft (d) t d

tf(d, t) = t d idf(d, t) = D d D : t draw frequency : rt (d) = tf(d,t)TF * IDF : xt (d) = log(1+ tf(d, t))log(1+ idf(d,t))

p(d | c) = p(d | c) p(di | d1...di-1;c)i=1d

p(di | d1...di-1;c) p(di | di-n ...di-1;c)ML in NLP

Term Weighting and FeatureSelectionn Select or weigh most informative featuresn TF*IDF: adjust term weight by how

document-specific the term isn Feature selection:n Remove low, unreliable countsn Mutual informationn Information gainn Other statistics

ML in NLP June 02

NASSLI 9

ML in NLP

Documents vs. Vectors (I)n Many documents have the same binary or

frequency vectorn Document multiplicity must be handled

correctly in probability modelsn Binary nave Bayes

n Multiplicity is not recoverable

p(f | c) = ft p(t | c) + (1- ft )(1- p(t | c))[ ]t

ML in NLP

Documents vs. Vectors (2)n Document probability (unigram language

model)

n Raw frequency vector probability

p(d | c) = p(d | c) p(di | c)i=1d

rt = tf(d,t)

p(r | c) = p(L | c)L! p(t | c)rt

rt!t where L = rtt

ML in NLP

Documents vs. Vectors (3)n Unigram model:

n Vector model:

p(c | d) =p(c)p(d | c) p(di | c)i=1

dp( c )p( d | c ) p(di | c )i=1

d c

p(c | r) =p(c)p(L | c) p(t | c)rttp( c )p(L | c ) p(t | c )rtt c

ML in NLP

Linear Classifiersn Embedding into high-dimensional vector

spacen Geometric intuitions and techniquesn Easier separability

Increase dimension with interaction terms Nonlinear embeddings (kernels)

n Swiss Army knife

ML in NLP June 02

NASSLI 10

ML in NLP

Kinds of Linear Classifiersn Nave Bayesn Exponential modelsn Large margin classifiersn Support vector machines (SVM)n Boosting

n Online methodsn Perceptronn Winnow

ML in NLP

Learning Linear Classifiersn Rocchio

n Widrow-Hoff

n (Balanced) winnow

wk = max 0,xkxc

c-

xkxcD - c

w w - 2h(w.x i - yi )x i

y = sign(w+ x - w- x -q )

positive error : w+ aw+ ,w- bw-,a > 1> b > 0

negative error : w+ bw+ ,w- aw-

ML in NLP

Linear Classificationn Linear discriminant function

h(x) = w x + b = wk xkk + b

b

w

xxx

x

x

xx

x

xxx

oo

oo

ooo

oo

ML in NLP

Marginn Instance marginn Normalized (geometric) margin

n Training set margin g

gi = yi w x i + b( )

xxx x

xx

xi

x

o

oo

ooo

o

gi

xj

gjg

gi = yiww

x i +bw

ML in NLP June 02

NASSLI 11

ML in NLP

Perceptron Algorithmn Givenn Linearly separable training set Sn Learning rate h>0

w 0;b 0;R = maxi x irepeat

for i = 1...Nif yi w x i + b( ) 0

w w + hyi x ib b + hyi R

2

until there are no mistakesML in NLP

Dualityn Final hypothesis is a linear combination of

training points

n Dual perceptron algorithm

w = ai yi x ii ai 0

a 0;b 0;R = maxi x irepeat

for i = 1...Nif yi a j y jx j x ij + b( ) 0

ai ai + 1b b + yi R

2

until there are no mistakes

ML in NLP

Why Maximize the Margin?n There is a constant c such that for any

data distribution D with support in a ball ofradius R and any training sample S of sizeN drawn from D

where g is the margin of h in S

p err(h) cN

R2

g 2log2 N + log 1 d( )

1- d

ML in NLP

Canonical Hyperplanesn Multiple representations for the same

hyperplane

n Canonical hyperplane: functional margin = 1n Geometric margin for canonical hyperplane

lw,lb( ) l > 0

g = 12

ww

x+ -ww

x-

=1

2 ww x+ - w x-( )

=1w

x+g

x-

ML in NLP June 02

NASSLI 12

ML in NLP

Convex Optimization (1)n Constrained optimization problem:

n Lagrangian function:

n Dual problem:

minwWn f (w)subject to gi (w) 0

h j (w) = 0

L(w,a,b ) = f (w) + ai gi (w) + b jh j (w)ji

maxa ,b infwW L(w,a ,b)subject to ai 0

ML in NLP

Convex Optimization (2)n Kuhn-Tucker conditions:n f convexn gi, hj affine (h(w)=Aw-b)n Solution w*, a*, b* must satisfy:

L(w*,a*,b* )w

= 0

L(w*,a*,b* )b

= 0

ai*gi (w

* ) = 0gi (w* ) 0

ai* 0

Complementary condition:parameter is non-zero iffconstraint is active

ML in NLP

Maximizing the Margin (1)n Given a separable training sample:

n Lagrangian:

minw,b w = w w subject to yi w x i + b( ) 1

L(w,b,a ) = 12

w w - ai yi w x i + b( ) -1[ ]iL(w,b,a)

w= w - yiai x i = 0i

L(w,b,a)b

= yiaii = 0

ML in NLP

Maximizing the margin (2)n Dual Lagrangian at stationary point:

n Dual maximization problem:

n Maximum margin weight vector:

W(a) = L(w*,b*,a) = aii -12

yi y jaia jx ii , j x j

maxa W (a)subject to ai 0

yiaii = 0

w* = yiai*x ii with margin g = 1 w

* = ai*

isv( )-1 2

ML in NLP June 02

NASSLI 13

ML in NLP

Building the Classifiern Computing the offset (from primal

constraints):

n Decision function:

b* =maxyi =-1 w

* x i + min yi =1 w* x i

2

h(x) = sgn yiai*x i x + b

*i( )

ML in NLP

Consequencesn Complementarity condition yields support

vectors:

n Functional margin of 1 implies minimumgeometric margin

ai yi w* x i + b( ) -1[ ] = 0

ai > 0 fi w* x i + b = yi

g = 1 w*

ML in NLP

General SVM Formn Margin maximization for an arbitrary

kernel K

n Decision rule

maxa aii -12

yi y jaia jK (x i ,i , j x j )subject to ai 0

yiaii = 0

h(x) = sgn yiai*K (x i ,x) + b

*i( )

ML in NLP

Soft Marginn Handles non-separable casen Primal problem (2-norm):

n Dual problem:

minw,b,x w w + C xi2isubject to yi w x i + b( ) 1- xi

xi 0

maxa aii -12

yi y jaia j x i x j +1C

dij

i , j

subject to ai 0yiaii = 0

ML in NLP June 02

NASSLI 14

ML in NLP

Conditional Maxent Modeln Model form

n Useful propertiesn Multi-classn May use different features for different classesn Training is convex optimization

p(y | x;L) =exp lk fk (x,y)k

Z(x;l)Z(x;L) = exp lk fk (x,y)ky

ML in NLP

Dualityn Maximize conditional log likelihood

n Maximizing conditional entropy

subject to constraints

yields

L = argmaxL log p(yi | x ii ;L)

p = argmax p - p(y | x i ) log p(y | x i )yi[ ]

fk (x i ,yi )i = p(y | x i ) fk (x i ,y)y

p (y | x) = p(y | x; L )

ML in NLP

Relationship to (Binary) LogisticDiscrimination

p(+1 | x) =exp lk fk (x,+1)k

exp lk fk (x,+1)k + exp lk fk (x,-1)k=

11+ exp- lk fk (x,+1) - fk (x,-1)( )k

=1

1+ exp- lk gk (x)k

ML in NLP

Relationship to LinearDiscriminationn Decision rule

n Bias term: parameter for always onfeature

n Question: relationship to other trainers forlinear discriminant functions

sign log p(+1| x)p(-1 | x)

= sign lk gk (x)k

ML in NLP June 02

NASSLI 15

ML in NLP

Solution Techniques (I)n Generalized iterative scaling (GIS)

n Parameter updates

n Requires that features add up to constant independent ofinstance or label (add slack feature)

lk lk +1C

logfk (x i ,yi )i

p(y | x i ;L) fk (x i ,y)yi

fk (x i ,y)k = C "i,y

ML in NLP

Solution Techniques (2)n Improved iterative scaling (IIS)

n Parameter updates

n For binary features reduces to solving a polynomial withpositive coefficients

n Reduces to GIS if feature sum constant

lk lk + dk

fk (x i ,yi ) = p(y | x i ;L) fk (x i ,y)edk f # (xi ,y)

yii

f # (x, y) = fk (x, y)k

ML in NLP

Deriving IIS (1)n Conditional log-likelihood

n Log-likelihood update

l(L) = log p(yi | x ii ;L)

l(L + D) - l(L) = D f (x i ,yi ) - logZ (x i ;L + D)

Z (x i ;L)

i

= D f (x i ,yi ) - loge L+D( ) f (xi ,y)

Z (x i ;L)yii

= D f (x i ,yi ) - log p(y | x i ;L)eD f (xi ,y)

yiilog x x -1( ) D f (x i ,yi ) + N - p(y | x i ;L)e

D f (xi ,y)yii

A(D)1 2 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4

ML in NLP

Deriving IIS (2)n By Jensens inequality:

n Maximize lower bound on update

A(D) D f (x i ,yi ) + N -i

p(y | x i ;L)fk (x i ,y)f # (x i ,y)

edk f# (xi ,y)

kyi = B(D)

B(D)dk

= fk (x i ,yi )i - p(y | x i ;L) fk (x i ,y)edk f # (xi ,y)

yi

ML in NLP June 02

NASSLI 16

ML in NLP

Solution Techniques (3)n GIS very slow if slack variable takes large

valuesn IIS faster, but still problematicn Recent suggestion: use standard convex

optimization techniquesn Eg. Conjugate gradientn Some evidence of faster convergence

ML in NLP

Gaussian Priorn Log-likelihood gradient

n Modified IIS update

l(L)lk

= fk (x i ,yi )i - p(y | x i ;L) fk (x i ,y)iyi -lks k

2

lk lk + dkfk (x i ,yi )i =

p(y | x i ;L) fk (x i ,y)edk f# (xi ,y)

yi +lk + dk

s k2

f # (x,y) = fk (x,y)k

ML in NLP

Instance Representationn Fixed-size instance (PP attachment):

binary featuresn Word identityn Word class

n Variable-size instance (documentclassification)n Word identityn Word relative frequency in document

ML in NLP

Enriching Featuresn Word n-gramsn Sparse word n-gramsn Character n-grams (noisy transcriptions:

speech, OCR)n Unknown word features: suffixes,

capitalizationn Feature combinations (cf. n-grams)

ML in NLP June 02

NASSLI 17

ML in NLP

I understood each and every word you said but not the order in which they appeared.

ML in NLP

Structured Models:Finite State

ML in NLP

Structured Model Applicationsn Language modelingn Story segmentationn POS taggingn Information extraction (IE)n (Shallow) parsing

ML in NLP

Structured Modelsn Assign a labeling to a sequencen Story segmentationn POS taggingn Named entity extractionn (Shallow) parsing

ML in NLP June 02

NASSLI 18

ML in NLP

Constraint Satisfaction inStructured Modelsn Train to minimize labeling loss

n Computing the best labeling:

n Efficient minimization requires:n A common currency for local labeling decisionsn Efficient algorithm to combine the decisions

argminy Loss(x,y | q )

q = argminq Loss(xi ,yi |q )i

ML in NLP

Local Classification Modelsn Train to minimize the per-decision loss in

context

n Apply by guessing context and findingeach lowest-loss label:

q = argminq loss(yi, j | xi,yi( j);q )0 j

ML in NLP June 02

NASSLI 19

ML in NLP

Markovs UnreasonableEffectivenessn Entropy estimates for English

n Local word relations dominate statistics (Jelinek):1 The are to know the issues necessary role

2 This will have this problems data thing

2 One the understand these the information that

7 Please need use problem people point

9 We insert all tools issues

98 resolve old

1641 important

1.34human prediction (Cover & King 78)1.75word trigrams (Brown et al 92)4.43compress

bits/charmodel

ML in NLP

Limits of Markov Modelsn No dependencyn Likelihoods based on

sequencing, not dependency

1 The are to know the issues necessary role

2 This will have this problems data thing

2 One the understand these the information that

7 Please need use problem people point

9 We insert all tools issues

98 resolve old

1641 important

ML in NLP

n Whats the probability of unseen events?n Bias forces nonzero probabilities for some

unseen eventsn Typical bias: tie probabilities of related events

n specific unseen event general seen eventeat pineapple eat _

n event decomposition: event event1 event2eat pineapple eat _ _ pineapple

n Factoring via latent variables:

Unseen Events (1)

P(eat | pineapple) P(eat | C) P(C | pineapple)C

ML in NLP

Unseen Events (2)n Discount estimates for seen eventsn Use leftover for unseen eventsn How to allocate leftover?n Back-off from unseen event to less specific

seen events: n-gram to n-1-gramn Hypothesize hidden cause for unseen events:

latent variable modeln Relate unseen event to distributionally similar

seen events

ML in NLP June 02

NASSLI 20

ML in NLP

Important Detour:Latent Variable Models

ML in NLP

Expectation-Maximization (EM)n Latent (hidden) variable models

n Examples:n Mixture modelsn Class-based models (hidden classes)n Hidden Markov models

p(y,x,z | L)p(y,x | L) = p(y,x,z | L)z

ML in NLP

Maximizing Likelihoodn Data log-likelihood

n Find parameters that maximize (log-)likelihood

D = (x1,y1),...,(xN ,yN ){ }L(D | L) = log p(x i ,yi )i = p (x,y) log p(x,y | L)x ,y

p (x,y) = i : x i = x,yi = yN

L = argmaxL p (x, y) log p(x,y | L)x ,y

ML in NLP

Convenient Lower Bounds (1)n Convex function

n Jensens inequality

if f is convex and p is a probability density

f p(x)xx( ) p(x) f (x)x

f

x0

x1

ax0 + 1- a( )x1

f ax0 + 1- a( )x1( )

f x0( )

f x1( )

af x0( ) + 1- a( ) f x1( )

ML in NLP June 02

NASSLI 21

ML in NLP

Convenient Lower Bounds (2)

L(D | l)

l

lower bound

E0

E1

M0

ML in NLP

Auxiliary Functionn Find a convenient non-negative function

that lower-bounds likelihood increase

n Maximize lower bound:

L(D | L ) - L(D | L) Q( L ,L) 0

Li+1 = argmax L Q L ,Li( )

ML in NLP

Commentsn Likelihood keeps increasing, but

n Can get stuck in local maximum (or saddle point!)n Can oscillate between different local maxima with

same log-likelihoodn If maximizing auxiliary function is too hard, find

some L that increases likelihood: generalizedEM (GEM)

n Sum over hidden variable values can beexponential if not done carefully (sometimes notpossible)

ML in NLP

Example: Mixture Modeln Base distributions

n Mixture coefficients

n Mixture distribution

pi (y) :1 i m

li 0 lii=1m = 1

p(y | L) = li pi (y)i

ML in NLP June 02

NASSLI 22

ML in NLP

Auxiliary Quantitiesn Mixture coefficient i = prior probability of

being in class In Joint probability

n Auxiliary function

p(c,y | L) = lc pc (y)

Q( L ,L) = p (y) p(c | y,L) log p(y,c | L )p(y,c | L)cy

ML in NLP

Solutionn E step:

n M-step:

Ci =1 l i

p (y) li pi (y)l j p j (y)jy

li Ci

C jj

ML in NLP

More Finite-State Models

ML in NLP

Example: Information Extractionn Given: types of entities and relationships

we are interested inn People, places, organizations, dates,

amounts, materials, processes, n Employed by, located in, used for, arrived

when, n Find all entities and relationships of the

given types in source materialn Collect in suitable database

ML in NLP June 02

NASSLI 23

ML in NLP

IE Example

n Rely on:n Syntactic structuren Phrase classification

Nance, who is also a paid consultant to ABC News , said

person

person-descriptor

employeerelationCo-reference

organization

ML in NLP

IE Methodsn Partial matching:n Hand-built patternsn Automatically-trained hidden Markov modelsn Cascaded finite-state transducers

n Parsing-based:n Parse the whole text:

Shallow parser (chunking) Automatically-induced grammar

n Classify phrases and phrase relations asdesired entities and relationships

ML in NLP

Global Constraint Modelsn Train to minimize labeling loss

n Computing the best labeling:

n Efficient minimization requires:n A common currency for local labeling decisionsn A dynamic programming algorithm to combine the

decisions

argminy Loss(x,y | q )

q = argminq Loss(xi ,yi |q )i

ML in NLP

Local Classification Modelsn Train to minimize the per-symbol loss in

context

n Apply by guessing context and findingeach lowest-loss label:

q = argminq loss(yi, j | xi,yi( j);q )0 j

ML in NLP June 02

NASSLI 24

ML in NLP

Structured Model Claimsn Global constraintn Principledn Probabilistic interpretation allows model

compositionn Efficient optimal decoding

n Local classifiern Wider range of modelsn More efficient trainingn Heuristic decoding comparable to pruning in

global modelsML in NLP

Generative vs. Discriminativen Hidden Markov models (HMMs):

generative, globaln Conditional exponential models (MEMMs,

CRFs): discriminative, globaln Boosting, winnow: discriminative, local

ML in NLP

Generative Modelsn Stochastic process that generates

instance-label pairsn Process structuren Process parameters

n (Hypothesize structure)n Estimate parameters from training data

ML in NLP

Model Structuren Decompose the generation of instances

into elementary stepsn Define dependencies between stepsn Parameterize the dependenciesn Useful descriptive language: graphical

models

ML in NLP June 02

NASSLI 25

ML in NLP

Binary Nave Bayesn Represent instances by sets of binary

featuresn Does word occur in documentn

n Finite predefined set of classes

C

F1 F2 F3 F4

P(F1,...,Fn ,C) = P(C) P(Fi | C)i

P(c | F1,...,Fn ) P(c) P(Fi | C)i

ML in NLP

Discrete Hidden Markov Modeln Instances: symbol sequencesn Labels: class sequences

C0 C1 C2 C3 C4

X0 X1 X2 X3 X4

P(X,C) = P(C0 )P(X0 | C0 ) P(Ci | Ci-1)i P(Xi | Ci )

ML in NLP

Generating Multiple Featuresn Instances: sequences of feature sets

n Word identityn Word properties (eg. spelling, capitalization)

n Labels: class sequences

C0 C1 C2 C3 C4

F0,0

F0,1F0,2

F1,0

F1,1F1,2

F2,0

F2,1F2,2

F3,0

F3,1F3,2

F4,0

F4,1F4,2

n Limitation: conditionally independent featuresML in NLP

Independence or Intractabilityn Trees are good: each node has a single

immediate ancestor, joint probabilitycomputed in linear time

n But that forces features to be conditionallyindependent given the class

n Unrealisticn Suffixes and capitalizationn San and Francisco in document

ML in NLP June 02

NASSLI 26

ML in NLP

Score Card No independence assumptions Richer features: combinations of existing

features- Optimization problem for parameters- Limited probabilistic interpretation- Insensitive to input distribution

ML in NLP

Information Extraction with HMMs

n Parameters = P(s|s) , P(o|s) for all states in S={s1,s2,}n Observations: wordsn Training: maximize probability of observations (+ prior).n For IE, states indicate database field.

[Seymore & McCallum 99][Freitag & McCallum 99]

ML in NLP

Problems with HMMs1. Would prefer richer representation of text:

multiple overlapping features, whole chunks of textn Example line features:

n length of linen line is centeredn percent of non-alphabeticsn total amount of white spacen line contains two verbsn line begins with a numbern line is grammatically a question

n Example word features:n identity of wordn word is in all capsn word ends in -tionn word is part of a noun phrasen word is in bold fontn word is on left hand side of pagen word is under node X in WordNet

2. HMMs are generative models.Generative models do not handle easily overlapping, non-independent features.Would prefer a conditional model: P({s}|{o}).

ML in NLP

Solution: Conditional Model

P(o|s)P(s|s)

P(s|o,s)

For the time being, capture dependencyon s with |S| independent functions, Ps(s|o)

Hidden Markov model Maximum entropyMarkov model

(represented by exponential model)

Each state contains a next-state classifierblack box, that, given the next observation, willproduce a probability distribution over possiblenext states, Ps(s|o).

s

s

ML in NLP June 02

NASSLI 27

ML in NLP

HMM MEMM

st-1 st

ot

P(o|s)P(s|s) P(s|o,s)

st-1 st

ot

Two Sequence Models

n Standard belief propagation: forward-backwardprocedure.

n Viterbi and Baum-Welch follow naturally.ML in NLP

Transition Featuresn Model Ps (s|o) in terms of multiple arbitrary

overlapping (binary) features.n Example observation predicates:l o is the word applel o is capitalizedl o is on a left-justified line

n Feature f depends on both a predicate bn and a destination state s.

f(o, s' ) =1 if b(o) is true and s'= s0 otherwise

ML in NLP

n Per-state conditional maxent model

n Training: each state model independentlyfrom labeled sequences

Next-State Classifier

Ps ' (s | o) =1

Z(o, s' )exp l f(o, s)

ML in NLP

X-NNTP-Poster: NewsHound v1.33Archive-name: acorn/faq/part2Frequency: monthly

2.6) What configuration of serial cable should I use?

Here follows a diagram of the necessary connections for common terminalprograms to work properly. They are as far as I know the informal standardagreed upon by commercial comms software developers for the Arc.

Pins 1, 4, and 8 must be connected together inside the 9 pin plug. Thisis to avoid the well known serial port chip bugs. The modems DCD (DataCarrier Detect) signal has been re-routed to the Arcs RI (Ring Indicator)most modems broadcast a software RING signal anyway, and even then itsreally necessary to detect it for the model to answer the call.

2.7) The sound from the speaker port seems quite muffled. How can I get unfiltered sound from an Acorn machine?

All Acorn machine are equipped with a sound filter designed to removehigh frequency harmonics from the sound output. To bypass the filter, hookinto the Unfiltered port. You need to have a capacitor. Look for LM324 (chip39) and and hook the capacitor like this:

Example: Q-A pairs from FAQ

ML in NLP June 02

NASSLI 28

ML in NLP

n 38 files belonging to 7 UseNet FAQs

n Procedure: For each FAQ, train on one file, teston other; average.

Experimental Data

X-NNTP-Poster: NewsHound v1.33 Archive-name: acorn/faq/part2 Frequency: monthly

2.6) What configuration of serial cable should I use?

Here follows a diagram of the necessary connection programs to work properly. They are as far as I know agreed upon by commercial comms software developers fo

Pins 1, 4, and 8 must be connected together inside is to avoid the well known serial port chip bugs. The

ML in NLP

Features in Experimentsbegins-with-numberbegins-with-ordinalbegins-with-punctuationbegins-with-question-wordbegins-with-subjectblankcontains-alphanumcontains-bracketed-numbercontains-httpcontains-non-spacecontains-numbercontains-pipe

contains-question-markcontains-question-wordends-with-question-markfirst-alpha-is-capitalizedindentedindented-1-to-4indented-5-to-10more-than-one-third-spaceonly-punctuationprev-is-blankprev-begins-with-ordinalshorter-than-30

ML in NLP

Models Testedn ME-Stateless: A single maximum entropy

classifier applied to each line independently.n TokenHMM: A fully-connected HMM with four

states, one for each of the line categories,each of which generates individual tokens(groups of alphanumeric characters andindividual punctuation characters).

n FeatureHMM: Identical to TokenHMM, onlythe lines in a document are first converted tosequences of features.

n MEMM: maximum entropy Markov modelML in NLP

ResultsLearner Segmentation

precisionSegmentation

recallME-Stateless 0.038 0.362

TokenHMM 0.276 0.140

FeatureHMM 0.413 0.529

MEMM 0.867 0.681

ML in NLP June 02

NASSLI 29

ML in NLP

n Example (after Bottou 91):

n Bias toward states with fewer outgoingtransitions.

n Per-state normalization does not allow therequired score(1,2|ro)

ML in NLP June 02

NASSLI 30

ML in NLP

n Matrix notation

n Efficient normalization: forward-backwardalgorithm

Efficient Estimation

Mt (s',s | o) = expL t (s',s | o)L t (s',s | o) = l f f (st-1,st ,o,i)f

PL (s | o) =1

ZL (o)Mi(st-1,st | o)

t

ZL (o) = (M1(o)M2(o)LMn +1(o))start,stop

ML in NLP

Forward-Backward Calculationsn For any path function

G(s) = gt (st-1,st )t

ELG = PL (s | o)G(s)s=

a t (s' | o)gt +1(s',s)Mt +1(s',s | o)b t +1(s | o)ZL (o)t ,s',s

a t (o) = a t-1(o)Mt (o)b t (o ) = Mt +1(o)b t +1(o)ZL (o) = an +1(end | o) = b0(start | o)

ML in NLP

Trainingn Maximizen Log-likelihood gradient

n Methods: iterative scaling, conjugategradient

n Comparable to standard Baum-Welch

L(L) = log PL (sk | ok )k

L(L)l f

= # f (sk | ok )k - EL # f (S | ok )k

# f (s | o) = f (st-1,st ,o,t)t

ML in NLP

Label Bias Experimentn Data source: noisy version of

n P(intended symbol) = 29/30, P(other) = 1/30.n Train both an MEMM and a CRF with identical

topologies on data from the source.n Compute decoding error: CRF 4.6%, MEMM 42%

(2,000 training samples, 500 test)

0

4 5

321

r

ri

o b

b

6

rib

rob

ML in NLP June 02

NASSLI 31

ML in NLP

Mixed-Order Sourcesn Data generated by mixing sparse first and second

order HMMs with varying mixing coefficient.n Modeled by first-order HMM, MEMM and CRF

(without contextual or overlapping features).

ML in NLP

Part-of-Speech Taggingn Trained on 50% of the 1.1 million words in the

Penn treebank. In this set, 5.45% of the wordsoccur only once, and were mapped to oov.

n Experiments with two different sets of features:n traditional: just the wordsn take advantage of power of conditional models: use

words, plus overlapping features: capitalized, beginswith #, contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies.

ML in NLP

POS Tagging Resultsmodel error oov errorHMM 5.69% 45.99%MEMM 6.37% 54.61%CRF 5.55% 48.05%MEMM+ 4.81% 26.99%CRF+ 4.27% 23.76%

ML in NLP

Structured Models:Stochastic Grammars

ML in NLP June 02

NASSLI 32

ML in NLP

Beyond Finite State

n Constituency and meaningful relationsWhat sleeps? How does it sleep? What kind of ideas?

n Related sentencesSleep furiously is all that colorless green ideas do.

S

NP

N'

N

Adj

Adj V

N'

N'

VP

AdvVP

ideas

revolutionary

new

advance

slowly

ML in NLP

Stochastic Context-Free GrammarsS

NP

N'

N

Adj

Adj V

N'

N'

VP

AdvVP

ideas

revolutionary

new

advance

slowly

S NP VP 1.0NP Det N 0.2NP N 0.8N Adj N 0.3N N 0.7VP VP Adv 0.4VP V 0.6N ideas 0.1N people 0.2

V sleep 0.3

ML in NLP

Stochastic CFG Inferencen Inside-outside algorithm (Baker 79): find rule

probabilities that locally maximize thelikelihood of a training corpus (instance of EM)

n Extended inside-outside algorithm: useinformation about training corpus phrasestructure to guide rule probability reestimation

l Better modeling of phrase structurel Improved convergencel Improved computational complexity

ML in NLP

SCFG Derivations

=

n Context-freeness fi

independence ofderivation steps

S

NP

N'

N

Adj

Adj V

N'

N'

VP

AdvVP

ideas

revolutionary

new

advance

slowly

P( ) S

NP

V

N'

VP

AdvVP

advance

slowly

P( )

Adj

revolutionary

P( ) N'

N

Adj N'

ideas

new

P( )

P(N Adj N)

ML in NLP June 02

NASSLI 33

ML in NLP

S

NP

N'

N

Adj

Adj V

N'

N'

VP

AdvVP

ideas

revolutionary

new

advance

slowly

Inside-Outside Reestimationn Ratios of expected rule frequencies to

expected phrase frequenciesphrase context

phrase

n Computed fromthe probabilities ofeach phrase andphrase context inthe trainingcorpus

n Iterate

Pn + 1(N Adj N) =

En(#N Adj N)En(#N)

ML in NLP

Problems with I-O Reestimationn Hill-climbing procedure: sensitivity to initial

rule probabilitiesn Does not learn grammar structure directly:

only implicit in rule probabilities

n Linguistically inadequate grammars: highmutual information sequences are groupedinto phrases

((What (((is the) cheapest)fare))((I can) (get ?)))))))

Contrast: Is $300 the cheapest fare?

ML in NLP

(Partially) Bracketed Textn Hand-annotated text with

(some) phrase boundaries(((List (the fares (for((flight) (number891)))))).)

n Use only derivationscompatible with trainingbracketing

ML in NLP

Predictive Power

2.53

3.54

4.55

0 20 40 60 80iteration

cros

s ent

ropy

raw train

bracketedtrain

2.53

3.54

4.55


cros

s ent

ropy

raw test

bracketedtest

Training:

Test:

ML in NLP June 02

NASSLI 34

ML in NLP

Bracketing Accuracyn Accuracy criterion: proportion of phrases in

most likely analysis compatible with tree bankbracketing

20

40

60

80

100


accu

racy

raw test

bracketedtest

n Conclusion: structure is not evident from distributionalone

ML in NLP

Limitations of SCFGsn Likelihoods independent of particular wordsn Markovian assumption on syntactic categories

Markov models:

Hierarchical models:

We need to resolve the issue

ML in NLP



Lexicalization

nMarkov model:

nDependency model:

ML in NLP

Best Current Modelsn Representation: surface trees with head-word

propagationn Generative power still context-freen Model variables:

head word, dependency type, argument vs. adjunct, heaviness,slash

n Main challenge: smoothing method for unseendependencies

n Learned from hand-parsed text (treebank)n Around 90% constituent accuracy

ML in NLP June 02

NASSLI 35

ML in NLP

Lexicalized Tree (Collins 98)

Dependency Direction Relationworkers dumped L NP, S, VPsacks dumped R NP, VP, Vinto dumped R PP, VP, Va bin L D, NP, Nbin into R NP, PP, P

S(dumped)

NP-C(workers)

N(workers)

workers

VP(dumped)

V(dumped)

dumped

NP-C(sacks)

N(sacks)

sacks

PP(into)

P(into) NP-C(bin)

D(a) N(bin)

a bin

into

ML in NLP

Inducing Representations

ML in NLP

Unsupervised Learningn Latent variable modelsn Model observables from latent variablesn Search for good set of latent variables

n Information bottleneckn Find efficient compression of some

observablesn preserving the information about other

observables

ML in NLP

Do Induced Classes Help?n Generalizationn Better statistics for coarser events

n Dimensionality reductionn Smaller modelsn Improved classification accuracy

ML in NLP June 02

NASSLI 36

ML in NLP

Chomskys Challengeto Empiricism

(1) Colorless green ideas sleep furiously.

(2) Furiously sleep ideas green colorless.

It is fair to assume that neither sentence (1) nor(2) (nor indeed any part of these sentences) hasever occurred in an English discourse. Hence, inany statistical model for grammaticalness, thesesentences will be ruled out on identical grounds asequally remote from English. Yet (1), thoughnonsensical, is grammatical, while (2) is not.

Chomsky 57

ML in NLP

Complex Eventsn What Chomsky was talking about: Markov

models state is just a record ofobservations

n But statistical models can have hiddenstate:n representation of past experiencen uncertainty about correct grammarn uncertainty about correct interpretation of

experience: ambiguityn Probabilistic relationships involving

hidden variables can be induced fromobservable data alone: EM algorithm

ML in NLP

In Any Model?n Factored bigram model:

n Trained for large-vocabulary speech recognitionfrom newswire text by EM

P(colorless green ideas sleep furiously)P(furiously sleep ideas green colorless)

2 105

P(wi+1 | wi ) P(wi+1 | c) P(c | wi )c=1

16

P(w1Lwn ) P(w1) P(i=2

n wi+1 | wi )

ML in NLP

Distributional Clusteringn Automatic grouping of words according to the

contexts in which they appearn Approach to data sparseness: approximate the

distribution of a relatively rare event (word) by thecollective distribution of similar events (cluster)

n Sense ambiguity membership in several softclusters

n Case study: cluster nouns according to the verbsthat take them as direct objects

ML in NLP June 02

NASSLI 37

ML in NLP

Training Datan Universe: two word classes V and N, a

single relation between them (eg. mainverb head noun of verbs direct object)

n Data: frequencies fvn of (v,n) pairsextracted from text by parsing or patternmatching

ML in NLP

Distributional RepresentationDescribing n N: use conditional

distribution p(V | n)

00.020.04

0.060.08

0.1

buy

hold

issue

own

purc

hase

sele

ct sell

tend

er

trade

rela

tive

frequ

ency stock

bond

ML in NLP

n Markov condition:

n Find p( | N) to maximizemutual information forfixed I( ,N)

n Solution:

Reminder: Bottleneck Model

p( n | v) = p( n n | n)p(n | v)

I( N ,V ) = p( n ,v ) n ,v log

p( n ,v )p( n )p(v )

N V

I(,N)

I(N,V)

I(,V)

p( n | n) = p( n )Zn

exp(-bDKL p(V | n) || p(V | n )( )ML in NLP

n The scale parameter b (inversetemperature) determines how much anoun contributes to nearby centroids

Search for Cluster Solutions

b0

n b increases fi clusters split fi hierarchicalclustering

ML in NLP June 02

NASSLI 38

ML in NLP

Small Examplen Cluster the 64 most common direct objects of

fire in 1988 Associated Press newswire

missile 0.835rocket 0.850bullet 0.917gun 0.940

officer 0.484aide 0.612chief 0.649manager 0.651

shot 0.858bullet 0.925rocket 0.930missile 1.037

gun 0.758missile 0.786weapon 0.862rocket 0.875

ML in NLP

Mutual Information Ratios

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8

I(,N)/H(N)

I(,V

)/I(

,N)

Increa

sing b

ML in NLP

Using Clusters for Predictionn Model verbobject associations through object

clusters:

n Depends on b

n Intuition: the associations of a word are a mixtureof the associations of the sense classes to whichthe word belongs

p (v | n) = p(v | n )p( n | n) n

p (v | n) =p(v | n )p( n |n)

n

Zn

used inexperiments

more appropriate

ML in NLP

Evaluationn Relative entropy of held-out data

to asymmetric modeln Decision task: which of two verbs

is more likely to take a noun asdirect object, estimated from themodel for training data in whichthe pairs relating the noun to oneof the verbs have been deleted

ML in NLP June 02

NASSLI 39

ML in NLP

Relative Entropy

0

1

2

3

4

5

6

0 200 400 600

number of clusters

avg

rela

tive

entro

py (b

its)

traintestnew

n Train: 756,721 verb-object pair trainingset

n Test: 81,240 pairheld-out test set

n New: held-out datafor 1000 nouns notin the training data

Verb-object pairs from 1988 AP Newswire

ML in NLP

Decision Task

0

0.2

0.4

0.6

0.8

1

1.2

0 100 200 300 400 500

number of clustersde

cisio

n er

ror

all

exceptional

n Which of v1 and v2 is more likely to take object on Held out: 104 (v2,o) pairs need to guess from

pc(v2)n Testing: compare all v1 and v2 such that (v2,o) was

held out and (v1,o) occursn All: all test datan Exceptional: verb pairs

whose frequency ratiowith o is reversed fromtheir overall frequencyratio