(ebook-pdf) - Artificial Intelligence - Machine Learning.pdf

Embed Size (px)

Citation preview

  • ML in NLP June 02

    NASSLI 1

    Machine Learning in NaturalLanguage Processing

    Fernando PereiraUniversity of Pennsylvania

    NASSLLI, June 2002Thanks to: William Bialek, John Lafferty, Andrew McCallum, Lillian Lee,Lawrence Saul, Yves Schabes, Stuart Shieber, Naftali Tishby

    ML in NLP

    Introduction

    ML in NLP

    Why ML in NLPn Examples are easier to create than rulesn Rule writers miss low frequency casesn Many factors involved in language interpretationn People do it

    n AIn Cognitive science

    n Let the computer do itn Moores lawn storagen lots of data

    ML in NLP

    Classificationn Document topic

    n Word sensetreasury bonds chemical bonds

    politics, business national, environment

  • ML in NLP June 02

    NASSLI 2

    ML in NLP

    N(bin)

    S(dumped)

    NP-C(workers)

    N(workers)

    workers

    VP(dumped)

    V(dumped)

    dumped

    NP-C(sacks)

    N(sacks)

    sacks

    PP(into)

    P(into) NP-C(bin)

    D(a)

    a bin

    into

    Analysisn Tagging

    n Parsing

    laterdecadesupshowthatsymptomscausingJJNNSRPVBPWDTNNSVBG

    ML in NLP

    Language Modelingn Is this a likely English sentence?

    n Disambiguate noisy transcriptionIts easy to wreck a nice beachIts easy to recognize speech

    P(colorless green ideas sleep furiously)P(furiously sleep ideas green colorless)

    2 105

    ML in NLP

    Inferencen Translation

    n Information extractionligaes covalentesficovalent bondsobrigaes do tesourofitreasury bonds

    Sara Lee to Buy 30% of DIM

    Chicago, March 3 - Sara Lee Corp said it agreed to buy a 30 percent interest inParis-based DIM S.A., a subsidiary of BIC S.A., at cost of about 20 milliondollars. DIM S.A., a hosiery manufacturer, had sales of about 2 million dollars.

    The investment includes the purchase of 5 million newly issued DIM sharesvalued at about 5 million dollars, and a loan of about 15 million dollars, it said.The loan is convertible into an additional 16 million DIM shares, it noted.

    The proposed agreement is subject to approval by the French government, itsaid.

    acquireracquired

    ML in NLP

    Machine Learning Approachn Algorithms that write programsn Specify

    Form of output programs Accuracy criterion

    n Input: set of training examplesn Output: program that performs as accurately

    as possible on the training examplesn But will it work on new examples?

  • ML in NLP June 02

    NASSLI 3

    ML in NLP

    Fundamental Questionsn Generalization: is the learned program useful on new

    examples?n Statistical learning theory: quantifiable tradeoffs between

    number of examples, complexity of program class, andgeneralization error

    n Computational tractability: can we find a good programquickly?n If not, can we find a good approximation?

    n Adaptation: can the program learn quickly from newevidence?n Information-theoretic analysis: relationship between adaptation

    and compression

    ML in NLP

    Learning Tradeoffs

    Program class complexity

    Err

    or o

    f be

    st p

    rogr

    am

    training

    testing on new examples

    Overfitting

    Rote learning

    ML in NLP

    Machine Learning Methodsn Classifiers

    n Document classificationn Disambiguation disambiguation

    n Structured modelsn Taggingn Parsingn Extraction

    n Unsupervised learningn Generalizationn Structure induction

    ML in NLP

    Jargonn Instance: event type of interest

    n Document and its classn Sentence and its analysisn

    n Supervised learning: learn classificationfunction from hand-labeled instances

    n Unsupervised learning: exploit correlations toorganize training instances

    n Generalization: how well does it work onunseen data

    n Features: map instance to set of elementaryevents

  • ML in NLP June 02

    NASSLI 4

    ML in NLP

    Classification Ideasn Represent instances by feature vectorsn Contentn Context

    n Learn function from feature vectorsn Classn Class-probability distribution

    n Redundancy is our friend: many weakclues

    ML in NLP

    Structured Model Ideasn Interdependent decisions

    n Successive parts-of-speechn Parsing/generation stepsn Lexical choice

    Parsing Translation

    n Combining decisionsn Sequential decisionsn Generative modelsn Constraint satisfaction

    ML in NLP

    Unsupervised Learning Ideasn Clustering: class inductionn Latent variables

    Im thinking of sports fi more sporty wordsn Distributional regularities

    Know words by the company they keepn Data compressionn Infer dependencies among variables:

    structure learningML in NLP

    Methodological Detour

    n Empiricist/information-theoretic view:words combine following their associationsin previous material

    n Rationalist/generative view:words combine according to a formalgrammar in the class of possible natural-language grammars

  • ML in NLP June 02

    NASSLI 5

    ML in NLP

    Chomskys Challenge to Empiricism

    (1) Colorless green ideas sleep furiously.

    (2) Furiously sleep ideas green colorless.

    It is fair to assume that neither sentence (1) nor(2) (nor indeed any part of these sentences) hasever occurred in an English discourse. Hence, inany statistical model for grammaticalness, thesesentences will be ruled out on identical grounds asequally remote from English. Yet (1), thoughnonsensical, is grammatical, while (2) is not.

    Chomsky 57

    ML in NLP

    The Return of Empiricismn Empiricist methods work:

    n Markov models can capture a surprising fraction ofthe unpredictability in language

    n Statistical information retrieval methods beatalternatives

    n Statistical parsers are more accurate thancompetitors based on rationalist methods

    n Machine-learning, statistical techniques close tohuman performance in part-of-speech tagging, sensedisambiguation

    n Just engineering tricks?

    ML in NLP

    Unseen Eventsn Chomskys implicit assumption: any model must

    assign zero probability to unseen eventsn nave estimation of Markov model probabilities from

    frequenciesn no latent (hidden) events

    n Any such model overfits data: many events arelikely to be missing in any finite sample

    \ The learned model cannot generalize to unseendata

    \ Support for poverty of the stimulus arguments

    ML in NLP

    The Science of Modelingn Probability estimates can be smoothed to

    accommodate unseen eventsn Redundancy in language supports

    effective statistical inference procedures\ the stimulus is richer than it might seemn Statistical learning theory: generalization

    ability of a model class can be measuredindependently of model representation

    n Beyond Markov models: effects of latentconditioning variables can be estimatedfrom data

  • ML in NLP June 02

    NASSLI 6

    ML in NLP

    Richness of the Stimulusn Information about: mutual information

    n between linguistic and non-linguistic eventsn between parts of a linguistic event

    n Global coherence:banks can now sell stocks and bonds

    n Word statistics carry more informationthan it might seemn Markov models in speech recognitionn Success of bag-of-words model in information

    retrievaln Statistical machine translation

    n How far can these methods go?ML in NLP

    Questionsn Generative or discriminative?n Structured models: local classification or

    global constraint satisfaction?n Does unsupervised learning help?

    ML in NLP

    Classification

    ML in NLP

    Generative or Discriminative?n Generative modelsn Estimate the instance-label distribution

    n Discriminative modelsn Estimate the label-given-instance distribution

    n Minimize an upper-bound on training error

    p(x,y)

    p(y | x)

    [[ f (xi ) yi ]]i

  • ML in NLP June 02

    NASSLI 7

    ML in NLP

    Simple Generative Modeln Binary nave Bayes:n Represent instances by sets of binary

    features Does word occur in document

    n Finite predefined set of classes

    C

    F1 F2 F3 F4

    P(F1,...,Fn ,C) = P(C) P(Fi | C)i

    P(c | F1,...,Fn ) P(c) P(Fi | C)i

    ML in NLP

    Generative Claimsn Easy to train: just countn Language modeling: probability of

    observed formsn More robustn Small training setsn Label noise

    n Full advantage of probabilistic methods

    ML in NLP

    Discriminative Modelsn Define functional form for

    n Binary classification: define a discriminantfunction

    n Adjust parameter(s) q to maximizeprobability of training labels/minimizeerror

    p(y | x;q )

    y = signh(x;q)

    ML in NLP

    Simple Discriminative Formsn Linear discriminant function

    n Logistic form:

    n Multi-class exponential form (maxent):

    h(x;q0 ,q1,...,qn ) = q0 + qi fii (x)

    P(+1 | x) = 11+ exp- h(x;q )

    h(x,y;q0 ,q1,...,qn ) = q0 + qi fii (x,y)

    P(y | x;q) = exp h(x, y;q ) exp h(x, y ;q) y

  • ML in NLP June 02

    NASSLI 8

    ML in NLP

    Discriminative Claimsn Focus modeling resources on instance-to-

    label mappingn Avoid restrictive probabilistic assumptions

    on instance distributionn Optimize what you care aboutn Higher accuracy

    ML in NLP

    Classification Tasksn Document categorization

    n News categorizationn Message filteringn Web page selection

    n Taggingn Named entityn Part-of-speechn Sense disambiguation

    n Syntactic decisionsn Attachment

    ML in NLP

    Document Modelsn Binary vector

    n Frequency vector

    n N-gram language model

    ft (d) t d

    tf(d, t) = t d idf(d, t) = D d D : t draw frequency : rt (d) = tf(d,t)TF * IDF : xt (d) = log(1+ tf(d, t))log(1+ idf(d,t))

    p(d | c) = p(d | c) p(di | d1...di-1;c)i=1d

    p(di | d1...di-1;c) p(di | di-n ...di-1;c)ML in NLP

    Term Weighting and FeatureSelectionn Select or weigh most informative featuresn TF*IDF: adjust term weight by how

    document-specific the term isn Feature selection:n Remove low, unreliable countsn Mutual informationn Information gainn Other statistics

  • ML in NLP June 02

    NASSLI 9

    ML in NLP

    Documents vs. Vectors (I)n Many documents have the same binary or

    frequency vectorn Document multiplicity must be handled

    correctly in probability modelsn Binary nave Bayes

    n Multiplicity is not recoverable

    p(f | c) = ft p(t | c) + (1- ft )(1- p(t | c))[ ]t

    ML in NLP

    Documents vs. Vectors (2)n Document probability (unigram language

    model)

    n Raw frequency vector probability

    p(d | c) = p(d | c) p(di | c)i=1d

    rt = tf(d,t)

    p(r | c) = p(L | c)L! p(t | c)rt

    rt!t where L = rtt

    ML in NLP

    Documents vs. Vectors (3)n Unigram model:

    n Vector model:

    p(c | d) =p(c)p(d | c) p(di | c)i=1

    dp( c )p( d | c ) p(di | c )i=1

    d c

    p(c | r) =p(c)p(L | c) p(t | c)rttp( c )p(L | c ) p(t | c )rtt c

    ML in NLP

    Linear Classifiersn Embedding into high-dimensional vector

    spacen Geometric intuitions and techniquesn Easier separability

    Increase dimension with interaction terms Nonlinear embeddings (kernels)

    n Swiss Army knife

  • ML in NLP June 02

    NASSLI 10

    ML in NLP

    Kinds of Linear Classifiersn Nave Bayesn Exponential modelsn Large margin classifiersn Support vector machines (SVM)n Boosting

    n Online methodsn Perceptronn Winnow

    ML in NLP

    Learning Linear Classifiersn Rocchio

    n Widrow-Hoff

    n (Balanced) winnow

    wk = max 0,xkxc

    c-

    xkxcD - c

    w w - 2h(w.x i - yi )x i

    y = sign(w+ x - w- x -q )

    positive error : w+ aw+ ,w- bw-,a > 1> b > 0

    negative error : w+ bw+ ,w- aw-

    ML in NLP

    Linear Classificationn Linear discriminant function

    h(x) = w x + b = wk xkk + b

    b

    w

    xxx

    x

    x

    xx

    x

    xxx

    oo

    oo

    ooo

    oo

    ML in NLP

    Marginn Instance marginn Normalized (geometric) margin

    n Training set margin g

    gi = yi w x i + b( )

    xxx x

    xx

    xi

    x

    o

    oo

    ooo

    o

    gi

    xj

    gjg

    gi = yiww

    x i +bw

  • ML in NLP June 02

    NASSLI 11

    ML in NLP

    Perceptron Algorithmn Givenn Linearly separable training set Sn Learning rate h>0

    w 0;b 0;R = maxi x irepeat

    for i = 1...Nif yi w x i + b( ) 0

    w w + hyi x ib b + hyi R

    2

    until there are no mistakesML in NLP

    Dualityn Final hypothesis is a linear combination of

    training points

    n Dual perceptron algorithm

    w = ai yi x ii ai 0

    a 0;b 0;R = maxi x irepeat

    for i = 1...Nif yi a j y jx j x ij + b( ) 0

    ai ai + 1b b + yi R

    2

    until there are no mistakes

    ML in NLP

    Why Maximize the Margin?n There is a constant c such that for any

    data distribution D with support in a ball ofradius R and any training sample S of sizeN drawn from D

    where g is the margin of h in S

    p err(h) cN

    R2

    g 2log2 N + log 1 d( )

    1- d

    ML in NLP

    Canonical Hyperplanesn Multiple representations for the same

    hyperplane

    n Canonical hyperplane: functional margin = 1n Geometric margin for canonical hyperplane

    lw,lb( ) l > 0

    g = 12

    ww

    x+ -ww

    x-

    =1

    2 ww x+ - w x-( )

    =1w

    x+g

    x-

  • ML in NLP June 02

    NASSLI 12

    ML in NLP

    Convex Optimization (1)n Constrained optimization problem:

    n Lagrangian function:

    n Dual problem:

    minwWn f (w)subject to gi (w) 0

    h j (w) = 0

    L(w,a,b ) = f (w) + ai gi (w) + b jh j (w)ji

    maxa ,b infwW L(w,a ,b)subject to ai 0

    ML in NLP

    Convex Optimization (2)n Kuhn-Tucker conditions:n f convexn gi, hj affine (h(w)=Aw-b)n Solution w*, a*, b* must satisfy:

    L(w*,a*,b* )w

    = 0

    L(w*,a*,b* )b

    = 0

    ai*gi (w

    * ) = 0gi (w* ) 0

    ai* 0

    Complementary condition:parameter is non-zero iffconstraint is active

    ML in NLP

    Maximizing the Margin (1)n Given a separable training sample:

    n Lagrangian:

    minw,b w = w w subject to yi w x i + b( ) 1

    L(w,b,a ) = 12

    w w - ai yi w x i + b( ) -1[ ]iL(w,b,a)

    w= w - yiai x i = 0i

    L(w,b,a)b

    = yiaii = 0

    ML in NLP

    Maximizing the margin (2)n Dual Lagrangian at stationary point:

    n Dual maximization problem:

    n Maximum margin weight vector:

    W(a) = L(w*,b*,a) = aii -12

    yi y jaia jx ii , j x j

    maxa W (a)subject to ai 0

    yiaii = 0

    w* = yiai*x ii with margin g = 1 w

    * = ai*

    isv( )-1 2

  • ML in NLP June 02

    NASSLI 13

    ML in NLP

    Building the Classifiern Computing the offset (from primal

    constraints):

    n Decision function:

    b* =maxyi =-1 w

    * x i + min yi =1 w* x i

    2

    h(x) = sgn yiai*x i x + b

    *i( )

    ML in NLP

    Consequencesn Complementarity condition yields support

    vectors:

    n Functional margin of 1 implies minimumgeometric margin

    ai yi w* x i + b( ) -1[ ] = 0

    ai > 0 fi w* x i + b = yi

    g = 1 w*

    ML in NLP

    General SVM Formn Margin maximization for an arbitrary

    kernel K

    n Decision rule

    maxa aii -12

    yi y jaia jK (x i ,i , j x j )subject to ai 0

    yiaii = 0

    h(x) = sgn yiai*K (x i ,x) + b

    *i( )

    ML in NLP

    Soft Marginn Handles non-separable casen Primal problem (2-norm):

    n Dual problem:

    minw,b,x w w + C xi2isubject to yi w x i + b( ) 1- xi

    xi 0

    maxa aii -12

    yi y jaia j x i x j +1C

    dij

    i , j

    subject to ai 0yiaii = 0

  • ML in NLP June 02

    NASSLI 14

    ML in NLP

    Conditional Maxent Modeln Model form

    n Useful propertiesn Multi-classn May use different features for different classesn Training is convex optimization

    p(y | x;L) =exp lk fk (x,y)k

    Z(x;l)Z(x;L) = exp lk fk (x,y)ky

    ML in NLP

    Dualityn Maximize conditional log likelihood

    n Maximizing conditional entropy

    subject to constraints

    yields

    L = argmaxL log p(yi | x ii ;L)

    p = argmax p - p(y | x i ) log p(y | x i )yi[ ]

    fk (x i ,yi )i = p(y | x i ) fk (x i ,y)y

    p (y | x) = p(y | x; L )

    ML in NLP

    Relationship to (Binary) LogisticDiscrimination

    p(+1 | x) =exp lk fk (x,+1)k

    exp lk fk (x,+1)k + exp lk fk (x,-1)k=

    11+ exp- lk fk (x,+1) - fk (x,-1)( )k

    =1

    1+ exp- lk gk (x)k

    ML in NLP

    Relationship to LinearDiscriminationn Decision rule

    n Bias term: parameter for always onfeature

    n Question: relationship to other trainers forlinear discriminant functions

    sign log p(+1| x)p(-1 | x)

    = sign lk gk (x)k

  • ML in NLP June 02

    NASSLI 15

    ML in NLP

    Solution Techniques (I)n Generalized iterative scaling (GIS)

    n Parameter updates

    n Requires that features add up to constant independent ofinstance or label (add slack feature)

    lk lk +1C

    logfk (x i ,yi )i

    p(y | x i ;L) fk (x i ,y)yi

    fk (x i ,y)k = C "i,y

    ML in NLP

    Solution Techniques (2)n Improved iterative scaling (IIS)

    n Parameter updates

    n For binary features reduces to solving a polynomial withpositive coefficients

    n Reduces to GIS if feature sum constant

    lk lk + dk

    fk (x i ,yi ) = p(y | x i ;L) fk (x i ,y)edk f # (xi ,y)

    yii

    f # (x, y) = fk (x, y)k

    ML in NLP

    Deriving IIS (1)n Conditional log-likelihood

    n Log-likelihood update

    l(L) = log p(yi | x ii ;L)

    l(L + D) - l(L) = D f (x i ,yi ) - logZ (x i ;L + D)

    Z (x i ;L)

    i

    = D f (x i ,yi ) - loge L+D( ) f (xi ,y)

    Z (x i ;L)yii

    = D f (x i ,yi ) - log p(y | x i ;L)eD f (xi ,y)

    yiilog x x -1( ) D f (x i ,yi ) + N - p(y | x i ;L)e

    D f (xi ,y)yii

    A(D)1 2 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4

    ML in NLP

    Deriving IIS (2)n By Jensens inequality:

    n Maximize lower bound on update

    A(D) D f (x i ,yi ) + N -i

    p(y | x i ;L)fk (x i ,y)f # (x i ,y)

    edk f# (xi ,y)

    kyi = B(D)

    B(D)dk

    = fk (x i ,yi )i - p(y | x i ;L) fk (x i ,y)edk f # (xi ,y)

    yi

  • ML in NLP June 02

    NASSLI 16

    ML in NLP

    Solution Techniques (3)n GIS very slow if slack variable takes large

    valuesn IIS faster, but still problematicn Recent suggestion: use standard convex

    optimization techniquesn Eg. Conjugate gradientn Some evidence of faster convergence

    ML in NLP

    Gaussian Priorn Log-likelihood gradient

    n Modified IIS update

    l(L)lk

    = fk (x i ,yi )i - p(y | x i ;L) fk (x i ,y)iyi -lks k

    2

    lk lk + dkfk (x i ,yi )i =

    p(y | x i ;L) fk (x i ,y)edk f# (xi ,y)

    yi +lk + dk

    s k2

    f # (x,y) = fk (x,y)k

    ML in NLP

    Instance Representationn Fixed-size instance (PP attachment):

    binary featuresn Word identityn Word class

    n Variable-size instance (documentclassification)n Word identityn Word relative frequency in document

    ML in NLP

    Enriching Featuresn Word n-gramsn Sparse word n-gramsn Character n-grams (noisy transcriptions:

    speech, OCR)n Unknown word features: suffixes,

    capitalizationn Feature combinations (cf. n-grams)

  • ML in NLP June 02

    NASSLI 17

    ML in NLP

    I understood each and every word you said but not the order in which they appeared.

    ML in NLP

    Structured Models:Finite State

    ML in NLP

    Structured Model Applicationsn Language modelingn Story segmentationn POS taggingn Information extraction (IE)n (Shallow) parsing

    ML in NLP

    Structured Modelsn Assign a labeling to a sequencen Story segmentationn POS taggingn Named entity extractionn (Shallow) parsing

  • ML in NLP June 02

    NASSLI 18

    ML in NLP

    Constraint Satisfaction inStructured Modelsn Train to minimize labeling loss

    n Computing the best labeling:

    n Efficient minimization requires:n A common currency for local labeling decisionsn Efficient algorithm to combine the decisions

    argminy Loss(x,y | q )

    q = argminq Loss(xi ,yi |q )i

    ML in NLP

    Local Classification Modelsn Train to minimize the per-decision loss in

    context

    n Apply by guessing context and findingeach lowest-loss label:

    q = argminq loss(yi, j | xi,yi( j);q )0 j

  • ML in NLP June 02

    NASSLI 19

    ML in NLP

    Markovs UnreasonableEffectivenessn Entropy estimates for English

    n Local word relations dominate statistics (Jelinek):1 The are to know the issues necessary role

    2 This will have this problems data thing

    2 One the understand these the information that

    7 Please need use problem people point

    9 We insert all tools issues

    98 resolve old

    1641 important

    1.34human prediction (Cover & King 78)1.75word trigrams (Brown et al 92)4.43compress

    bits/charmodel

    ML in NLP

    Limits of Markov Modelsn No dependencyn Likelihoods based on

    sequencing, not dependency

    1 The are to know the issues necessary role

    2 This will have this problems data thing

    2 One the understand these the information that

    7 Please need use problem people point

    9 We insert all tools issues

    98 resolve old

    1641 important

    ML in NLP

    n Whats the probability of unseen events?n Bias forces nonzero probabilities for some

    unseen eventsn Typical bias: tie probabilities of related events

    n specific unseen event general seen eventeat pineapple eat _

    n event decomposition: event event1 event2eat pineapple eat _ _ pineapple

    n Factoring via latent variables:

    Unseen Events (1)

    P(eat | pineapple) P(eat | C) P(C | pineapple)C

    ML in NLP

    Unseen Events (2)n Discount estimates for seen eventsn Use leftover for unseen eventsn How to allocate leftover?n Back-off from unseen event to less specific

    seen events: n-gram to n-1-gramn Hypothesize hidden cause for unseen events:

    latent variable modeln Relate unseen event to distributionally similar

    seen events

  • ML in NLP June 02

    NASSLI 20

    ML in NLP

    Important Detour:Latent Variable Models

    ML in NLP

    Expectation-Maximization (EM)n Latent (hidden) variable models

    n Examples:n Mixture modelsn Class-based models (hidden classes)n Hidden Markov models

    p(y,x,z | L)p(y,x | L) = p(y,x,z | L)z

    ML in NLP

    Maximizing Likelihoodn Data log-likelihood

    n Find parameters that maximize (log-)likelihood

    D = (x1,y1),...,(xN ,yN ){ }L(D | L) = log p(x i ,yi )i = p (x,y) log p(x,y | L)x ,y

    p (x,y) = i : x i = x,yi = yN

    L = argmaxL p (x, y) log p(x,y | L)x ,y

    ML in NLP

    Convenient Lower Bounds (1)n Convex function

    n Jensens inequality

    if f is convex and p is a probability density

    f p(x)xx( ) p(x) f (x)x

    f

    x0

    x1

    ax0 + 1- a( )x1

    f ax0 + 1- a( )x1( )

    f x0( )

    f x1( )

    af x0( ) + 1- a( ) f x1( )

  • ML in NLP June 02

    NASSLI 21

    ML in NLP

    Convenient Lower Bounds (2)

    L(D | l)

    l

    lower bound

    E0

    E1

    M0

    ML in NLP

    Auxiliary Functionn Find a convenient non-negative function

    that lower-bounds likelihood increase

    n Maximize lower bound:

    L(D | L ) - L(D | L) Q( L ,L) 0

    Li+1 = argmax L Q L ,Li( )

    ML in NLP

    Commentsn Likelihood keeps increasing, but

    n Can get stuck in local maximum (or saddle point!)n Can oscillate between different local maxima with

    same log-likelihoodn If maximizing auxiliary function is too hard, find

    some L that increases likelihood: generalizedEM (GEM)

    n Sum over hidden variable values can beexponential if not done carefully (sometimes notpossible)

    ML in NLP

    Example: Mixture Modeln Base distributions

    n Mixture coefficients

    n Mixture distribution

    pi (y) :1 i m

    li 0 lii=1m = 1

    p(y | L) = li pi (y)i

  • ML in NLP June 02

    NASSLI 22

    ML in NLP

    Auxiliary Quantitiesn Mixture coefficient i = prior probability of

    being in class In Joint probability

    n Auxiliary function

    p(c,y | L) = lc pc (y)

    Q( L ,L) = p (y) p(c | y,L) log p(y,c | L )p(y,c | L)cy

    ML in NLP

    Solutionn E step:

    n M-step:

    Ci =1 l i

    p (y) li pi (y)l j p j (y)jy

    li Ci

    C jj

    ML in NLP

    More Finite-State Models

    ML in NLP

    Example: Information Extractionn Given: types of entities and relationships

    we are interested inn People, places, organizations, dates,

    amounts, materials, processes, n Employed by, located in, used for, arrived

    when, n Find all entities and relationships of the

    given types in source materialn Collect in suitable database

  • ML in NLP June 02

    NASSLI 23

    ML in NLP

    IE Example

    n Rely on:n Syntactic structuren Phrase classification

    Nance, who is also a paid consultant to ABC News , said

    person

    person-descriptor

    employeerelationCo-reference

    organization

    ML in NLP

    IE Methodsn Partial matching:n Hand-built patternsn Automatically-trained hidden Markov modelsn Cascaded finite-state transducers

    n Parsing-based:n Parse the whole text:

    Shallow parser (chunking) Automatically-induced grammar

    n Classify phrases and phrase relations asdesired entities and relationships

    ML in NLP

    Global Constraint Modelsn Train to minimize labeling loss

    n Computing the best labeling:

    n Efficient minimization requires:n A common currency for local labeling decisionsn A dynamic programming algorithm to combine the

    decisions

    argminy Loss(x,y | q )

    q = argminq Loss(xi ,yi |q )i

    ML in NLP

    Local Classification Modelsn Train to minimize the per-symbol loss in

    context

    n Apply by guessing context and findingeach lowest-loss label:

    q = argminq loss(yi, j | xi,yi( j);q )0 j

  • ML in NLP June 02

    NASSLI 24

    ML in NLP

    Structured Model Claimsn Global constraintn Principledn Probabilistic interpretation allows model

    compositionn Efficient optimal decoding

    n Local classifiern Wider range of modelsn More efficient trainingn Heuristic decoding comparable to pruning in

    global modelsML in NLP

    Generative vs. Discriminativen Hidden Markov models (HMMs):

    generative, globaln Conditional exponential models (MEMMs,

    CRFs): discriminative, globaln Boosting, winnow: discriminative, local

    ML in NLP

    Generative Modelsn Stochastic process that generates

    instance-label pairsn Process structuren Process parameters

    n (Hypothesize structure)n Estimate parameters from training data

    ML in NLP

    Model Structuren Decompose the generation of instances

    into elementary stepsn Define dependencies between stepsn Parameterize the dependenciesn Useful descriptive language: graphical

    models

  • ML in NLP June 02

    NASSLI 25

    ML in NLP

    Binary Nave Bayesn Represent instances by sets of binary

    featuresn Does word occur in documentn

    n Finite predefined set of classes

    C

    F1 F2 F3 F4

    P(F1,...,Fn ,C) = P(C) P(Fi | C)i

    P(c | F1,...,Fn ) P(c) P(Fi | C)i

    ML in NLP

    Discrete Hidden Markov Modeln Instances: symbol sequencesn Labels: class sequences

    C0 C1 C2 C3 C4

    X0 X1 X2 X3 X4

    P(X,C) = P(C0 )P(X0 | C0 ) P(Ci | Ci-1)i P(Xi | Ci )

    ML in NLP

    Generating Multiple Featuresn Instances: sequences of feature sets

    n Word identityn Word properties (eg. spelling, capitalization)

    n Labels: class sequences

    C0 C1 C2 C3 C4

    F0,0

    F0,1F0,2

    F1,0

    F1,1F1,2

    F2,0

    F2,1F2,2

    F3,0

    F3,1F3,2

    F4,0

    F4,1F4,2

    n Limitation: conditionally independent featuresML in NLP

    Independence or Intractabilityn Trees are good: each node has a single

    immediate ancestor, joint probabilitycomputed in linear time

    n But that forces features to be conditionallyindependent given the class

    n Unrealisticn Suffixes and capitalizationn San and Francisco in document

  • ML in NLP June 02

    NASSLI 26

    ML in NLP

    Score Card No independence assumptions Richer features: combinations of existing

    features- Optimization problem for parameters- Limited probabilistic interpretation- Insensitive to input distribution

    ML in NLP

    Information Extraction with HMMs

    n Parameters = P(s|s) , P(o|s) for all states in S={s1,s2,}n Observations: wordsn Training: maximize probability of observations (+ prior).n For IE, states indicate database field.

    [Seymore & McCallum 99][Freitag & McCallum 99]

    ML in NLP

    Problems with HMMs1. Would prefer richer representation of text:

    multiple overlapping features, whole chunks of textn Example line features:

    n length of linen line is centeredn percent of non-alphabeticsn total amount of white spacen line contains two verbsn line begins with a numbern line is grammatically a question

    n Example word features:n identity of wordn word is in all capsn word ends in -tionn word is part of a noun phrasen word is in bold fontn word is on left hand side of pagen word is under node X in WordNet

    2. HMMs are generative models.Generative models do not handle easily overlapping, non-independent features.Would prefer a conditional model: P({s}|{o}).

    ML in NLP

    Solution: Conditional Model

    P(o|s)P(s|s)

    P(s|o,s)

    For the time being, capture dependencyon s with |S| independent functions, Ps(s|o)

    Hidden Markov model Maximum entropyMarkov model

    (represented by exponential model)

    Each state contains a next-state classifierblack box, that, given the next observation, willproduce a probability distribution over possiblenext states, Ps(s|o).

    s

    s

  • ML in NLP June 02

    NASSLI 27

    ML in NLP

    HMM MEMM

    st-1 st

    ot

    P(o|s)P(s|s) P(s|o,s)

    st-1 st

    ot

    Two Sequence Models

    n Standard belief propagation: forward-backwardprocedure.

    n Viterbi and Baum-Welch follow naturally.ML in NLP

    Transition Featuresn Model Ps (s|o) in terms of multiple arbitrary

    overlapping (binary) features.n Example observation predicates:l o is the word applel o is capitalizedl o is on a left-justified line

    n Feature f depends on both a predicate bn and a destination state s.

    f(o, s' ) =1 if b(o) is true and s'= s0 otherwise

    ML in NLP

    n Per-state conditional maxent model

    n Training: each state model independentlyfrom labeled sequences

    Next-State Classifier

    Ps ' (s | o) =1

    Z(o, s' )exp l f(o, s)

    ML in NLP

    X-NNTP-Poster: NewsHound v1.33Archive-name: acorn/faq/part2Frequency: monthly

    2.6) What configuration of serial cable should I use?

    Here follows a diagram of the necessary connections for common terminalprograms to work properly. They are as far as I know the informal standardagreed upon by commercial comms software developers for the Arc.

    Pins 1, 4, and 8 must be connected together inside the 9 pin plug. Thisis to avoid the well known serial port chip bugs. The modems DCD (DataCarrier Detect) signal has been re-routed to the Arcs RI (Ring Indicator)most modems broadcast a software RING signal anyway, and even then itsreally necessary to detect it for the model to answer the call.

    2.7) The sound from the speaker port seems quite muffled. How can I get unfiltered sound from an Acorn machine?

    All Acorn machine are equipped with a sound filter designed to removehigh frequency harmonics from the sound output. To bypass the filter, hookinto the Unfiltered port. You need to have a capacitor. Look for LM324 (chip39) and and hook the capacitor like this:

    Example: Q-A pairs from FAQ

  • ML in NLP June 02

    NASSLI 28

    ML in NLP

    n 38 files belonging to 7 UseNet FAQs

    n Procedure: For each FAQ, train on one file, teston other; average.

    Experimental Data

    X-NNTP-Poster: NewsHound v1.33 Archive-name: acorn/faq/part2 Frequency: monthly

    2.6) What configuration of serial cable should I use?

    Here follows a diagram of the necessary connection programs to work properly. They are as far as I know agreed upon by commercial comms software developers fo

    Pins 1, 4, and 8 must be connected together inside is to avoid the well known serial port chip bugs. The

    ML in NLP

    Features in Experimentsbegins-with-numberbegins-with-ordinalbegins-with-punctuationbegins-with-question-wordbegins-with-subjectblankcontains-alphanumcontains-bracketed-numbercontains-httpcontains-non-spacecontains-numbercontains-pipe

    contains-question-markcontains-question-wordends-with-question-markfirst-alpha-is-capitalizedindentedindented-1-to-4indented-5-to-10more-than-one-third-spaceonly-punctuationprev-is-blankprev-begins-with-ordinalshorter-than-30

    ML in NLP

    Models Testedn ME-Stateless: A single maximum entropy

    classifier applied to each line independently.n TokenHMM: A fully-connected HMM with four

    states, one for each of the line categories,each of which generates individual tokens(groups of alphanumeric characters andindividual punctuation characters).

    n FeatureHMM: Identical to TokenHMM, onlythe lines in a document are first converted tosequences of features.

    n MEMM: maximum entropy Markov modelML in NLP

    ResultsLearner Segmentation

    precisionSegmentation

    recallME-Stateless 0.038 0.362

    TokenHMM 0.276 0.140

    FeatureHMM 0.413 0.529

    MEMM 0.867 0.681

  • ML in NLP June 02

    NASSLI 29

    ML in NLP

    n Example (after Bottou 91):

    n Bias toward states with fewer outgoingtransitions.

    n Per-state normalization does not allow therequired score(1,2|ro)

  • ML in NLP June 02

    NASSLI 30

    ML in NLP

    n Matrix notation

    n Efficient normalization: forward-backwardalgorithm

    Efficient Estimation

    Mt (s',s | o) = expL t (s',s | o)L t (s',s | o) = l f f (st-1,st ,o,i)f

    PL (s | o) =1

    ZL (o)Mi(st-1,st | o)

    t

    ZL (o) = (M1(o)M2(o)LMn +1(o))start,stop

    ML in NLP

    Forward-Backward Calculationsn For any path function

    G(s) = gt (st-1,st )t

    ELG = PL (s | o)G(s)s=

    a t (s' | o)gt +1(s',s)Mt +1(s',s | o)b t +1(s | o)ZL (o)t ,s',s

    a t (o) = a t-1(o)Mt (o)b t (o ) = Mt +1(o)b t +1(o)ZL (o) = an +1(end | o) = b0(start | o)

    ML in NLP

    Trainingn Maximizen Log-likelihood gradient

    n Methods: iterative scaling, conjugategradient

    n Comparable to standard Baum-Welch

    L(L) = log PL (sk | ok )k

    L(L)l f

    = # f (sk | ok )k - EL # f (S | ok )k

    # f (s | o) = f (st-1,st ,o,t)t

    ML in NLP

    Label Bias Experimentn Data source: noisy version of

    n P(intended symbol) = 29/30, P(other) = 1/30.n Train both an MEMM and a CRF with identical

    topologies on data from the source.n Compute decoding error: CRF 4.6%, MEMM 42%

    (2,000 training samples, 500 test)

    0

    4 5

    321

    r

    ri

    o b

    b

    6

    rib

    rob

  • ML in NLP June 02

    NASSLI 31

    ML in NLP

    Mixed-Order Sourcesn Data generated by mixing sparse first and second

    order HMMs with varying mixing coefficient.n Modeled by first-order HMM, MEMM and CRF

    (without contextual or overlapping features).

    ML in NLP

    Part-of-Speech Taggingn Trained on 50% of the 1.1 million words in the

    Penn treebank. In this set, 5.45% of the wordsoccur only once, and were mapped to oov.

    n Experiments with two different sets of features:n traditional: just the wordsn take advantage of power of conditional models: use

    words, plus overlapping features: capitalized, beginswith #, contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies.

    ML in NLP

    POS Tagging Resultsmodel error oov errorHMM 5.69% 45.99%MEMM 6.37% 54.61%CRF 5.55% 48.05%MEMM+ 4.81% 26.99%CRF+ 4.27% 23.76%

    ML in NLP

    Structured Models:Stochastic Grammars

  • ML in NLP June 02

    NASSLI 32

    ML in NLP

    Beyond Finite State

    n Constituency and meaningful relationsWhat sleeps? How does it sleep? What kind of ideas?

    n Related sentencesSleep furiously is all that colorless green ideas do.

    S

    NP

    N'

    N

    Adj

    Adj V

    N'

    N'

    VP

    AdvVP

    ideas

    revolutionary

    new

    advance

    slowly

    ML in NLP

    Stochastic Context-Free GrammarsS

    NP

    N'

    N

    Adj

    Adj V

    N'

    N'

    VP

    AdvVP

    ideas

    revolutionary

    new

    advance

    slowly

    S NP VP 1.0NP Det N 0.2NP N 0.8N Adj N 0.3N N 0.7VP VP Adv 0.4VP V 0.6N ideas 0.1N people 0.2

    V sleep 0.3

    ML in NLP

    Stochastic CFG Inferencen Inside-outside algorithm (Baker 79): find rule

    probabilities that locally maximize thelikelihood of a training corpus (instance of EM)

    n Extended inside-outside algorithm: useinformation about training corpus phrasestructure to guide rule probability reestimation

    l Better modeling of phrase structurel Improved convergencel Improved computational complexity

    ML in NLP

    SCFG Derivations

    =

    n Context-freeness fi

    independence ofderivation steps

    S

    NP

    N'

    N

    Adj

    Adj V

    N'

    N'

    VP

    AdvVP

    ideas

    revolutionary

    new

    advance

    slowly

    P( ) S

    NP

    V

    N'

    VP

    AdvVP

    advance

    slowly

    P( )

    Adj

    revolutionary

    P( ) N'

    N

    Adj N'

    ideas

    new

    P( )

    P(N Adj N)

  • ML in NLP June 02

    NASSLI 33

    ML in NLP

    S

    NP

    N'

    N

    Adj

    Adj V

    N'

    N'

    VP

    AdvVP

    ideas

    revolutionary

    new

    advance

    slowly

    Inside-Outside Reestimationn Ratios of expected rule frequencies to

    expected phrase frequenciesphrase context

    phrase

    n Computed fromthe probabilities ofeach phrase andphrase context inthe trainingcorpus

    n Iterate

    Pn + 1(N Adj N) =

    En(#N Adj N)En(#N)

    ML in NLP

    Problems with I-O Reestimationn Hill-climbing procedure: sensitivity to initial

    rule probabilitiesn Does not learn grammar structure directly:

    only implicit in rule probabilities

    n Linguistically inadequate grammars: highmutual information sequences are groupedinto phrases

    ((What (((is the) cheapest)fare))((I can) (get ?)))))))

    Contrast: Is $300 the cheapest fare?

    ML in NLP

    (Partially) Bracketed Textn Hand-annotated text with

    (some) phrase boundaries(((List (the fares (for((flight) (number891)))))).)

    n Use only derivationscompatible with trainingbracketing

    ML in NLP

    Predictive Power

    2.53

    3.54

    4.55

    0 20 40 60 80iteration

    cros

    s ent

    ropy

    raw train

    bracketedtrain

    2.53

    3.54

    4.55

    0 20 40 60 80iteration

    cros

    s ent

    ropy

    raw test

    bracketedtest

    Training:

    Test:

  • ML in NLP June 02

    NASSLI 34

    ML in NLP

    Bracketing Accuracyn Accuracy criterion: proportion of phrases in

    most likely analysis compatible with tree bankbracketing

    20

    40

    60

    80

    100

    0 20 40 60 80iteration

    accu

    racy

    raw test

    bracketedtest

    n Conclusion: structure is not evident from distributionalone

    ML in NLP

    Limitations of SCFGsn Likelihoods independent of particular wordsn Markovian assumption on syntactic categories

    Markov models:

    Hierarchical models:

    We need to resolve the issue

    ML in NLP

    We need to resolve the issue

    We need to resolve the issue

    Lexicalization

    nMarkov model:

    nDependency model:

    ML in NLP

    Best Current Modelsn Representation: surface trees with head-word

    propagationn Generative power still context-freen Model variables:

    head word, dependency type, argument vs. adjunct, heaviness,slash

    n Main challenge: smoothing method for unseendependencies

    n Learned from hand-parsed text (treebank)n Around 90% constituent accuracy

  • ML in NLP June 02

    NASSLI 35

    ML in NLP

    Lexicalized Tree (Collins 98)

    Dependency Direction Relationworkers dumped L NP, S, VPsacks dumped R NP, VP, Vinto dumped R PP, VP, Va bin L D, NP, Nbin into R NP, PP, P

    S(dumped)

    NP-C(workers)

    N(workers)

    workers

    VP(dumped)

    V(dumped)

    dumped

    NP-C(sacks)

    N(sacks)

    sacks

    PP(into)

    P(into) NP-C(bin)

    D(a) N(bin)

    a bin

    into

    ML in NLP

    Inducing Representations

    ML in NLP

    Unsupervised Learningn Latent variable modelsn Model observables from latent variablesn Search for good set of latent variables

    n Information bottleneckn Find efficient compression of some

    observablesn preserving the information about other

    observables

    ML in NLP

    Do Induced Classes Help?n Generalizationn Better statistics for coarser events

    n Dimensionality reductionn Smaller modelsn Improved classification accuracy

  • ML in NLP June 02

    NASSLI 36

    ML in NLP

    Chomskys Challengeto Empiricism

    (1) Colorless green ideas sleep furiously.

    (2) Furiously sleep ideas green colorless.

    It is fair to assume that neither sentence (1) nor(2) (nor indeed any part of these sentences) hasever occurred in an English discourse. Hence, inany statistical model for grammaticalness, thesesentences will be ruled out on identical grounds asequally remote from English. Yet (1), thoughnonsensical, is grammatical, while (2) is not.

    Chomsky 57

    ML in NLP

    Complex Eventsn What Chomsky was talking about: Markov

    models state is just a record ofobservations

    n But statistical models can have hiddenstate:n representation of past experiencen uncertainty about correct grammarn uncertainty about correct interpretation of

    experience: ambiguityn Probabilistic relationships involving

    hidden variables can be induced fromobservable data alone: EM algorithm

    ML in NLP

    In Any Model?n Factored bigram model:

    n Trained for large-vocabulary speech recognitionfrom newswire text by EM

    P(colorless green ideas sleep furiously)P(furiously sleep ideas green colorless)

    2 105

    P(wi+1 | wi ) P(wi+1 | c) P(c | wi )c=1

    16

    P(w1Lwn ) P(w1) P(i=2

    n wi+1 | wi )

    ML in NLP

    Distributional Clusteringn Automatic grouping of words according to the

    contexts in which they appearn Approach to data sparseness: approximate the

    distribution of a relatively rare event (word) by thecollective distribution of similar events (cluster)

    n Sense ambiguity membership in several softclusters

    n Case study: cluster nouns according to the verbsthat take them as direct objects

  • ML in NLP June 02

    NASSLI 37

    ML in NLP

    Training Datan Universe: two word classes V and N, a

    single relation between them (eg. mainverb head noun of verbs direct object)

    n Data: frequencies fvn of (v,n) pairsextracted from text by parsing or patternmatching

    ML in NLP

    Distributional RepresentationDescribing n N: use conditional

    distribution p(V | n)

    00.020.04

    0.060.08

    0.1

    buy

    hold

    issue

    own

    purc

    hase

    sele

    ct sell

    tend

    er

    trade

    rela

    tive

    frequ

    ency stock

    bond

    ML in NLP

    n Markov condition:

    n Find p( | N) to maximizemutual information forfixed I( ,N)

    n Solution:

    Reminder: Bottleneck Model

    p( n | v) = p( n n | n)p(n | v)

    I( N ,V ) = p( n ,v ) n ,v log

    p( n ,v )p( n )p(v )

    N V

    I(,N)

    I(N,V)

    I(,V)

    p( n | n) = p( n )Zn

    exp(-bDKL p(V | n) || p(V | n )( )ML in NLP

    n The scale parameter b (inversetemperature) determines how much anoun contributes to nearby centroids

    Search for Cluster Solutions

    b0

    n b increases fi clusters split fi hierarchicalclustering

  • ML in NLP June 02

    NASSLI 38

    ML in NLP

    Small Examplen Cluster the 64 most common direct objects of

    fire in 1988 Associated Press newswire

    missile 0.835rocket 0.850bullet 0.917gun 0.940

    officer 0.484aide 0.612chief 0.649manager 0.651

    shot 0.858bullet 0.925rocket 0.930missile 1.037

    gun 0.758missile 0.786weapon 0.862rocket 0.875

    ML in NLP

    Mutual Information Ratios

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8

    I(,N)/H(N)

    I(,V

    )/I(

    ,N)

    Increa

    sing b

    ML in NLP

    Using Clusters for Predictionn Model verbobject associations through object

    clusters:

    n Depends on b

    n Intuition: the associations of a word are a mixtureof the associations of the sense classes to whichthe word belongs

    p (v | n) = p(v | n )p( n | n) n

    p (v | n) =p(v | n )p( n |n)

    n

    Zn

    used inexperiments

    more appropriate

    ML in NLP

    Evaluationn Relative entropy of held-out data

    to asymmetric modeln Decision task: which of two verbs

    is more likely to take a noun asdirect object, estimated from themodel for training data in whichthe pairs relating the noun to oneof the verbs have been deleted

  • ML in NLP June 02

    NASSLI 39

    ML in NLP

    Relative Entropy

    0

    1

    2

    3

    4

    5

    6

    0 200 400 600

    number of clusters

    avg

    rela

    tive

    entro

    py (b

    its)

    traintestnew

    n Train: 756,721 verb-object pair trainingset

    n Test: 81,240 pairheld-out test set

    n New: held-out datafor 1000 nouns notin the training data

    Verb-object pairs from 1988 AP Newswire

    ML in NLP

    Decision Task

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    0 100 200 300 400 500

    number of clustersde

    cisio

    n er

    ror

    all

    exceptional

    n Which of v1 and v2 is more likely to take object on Held out: 104 (v2,o) pairs need to guess from

    pc(v2)n Testing: compare all v1 and v2 such that (v2,o) was

    held out and (v1,o) occursn All: all test datan Exceptional: verb pairs

    whose frequency ratiowith o is reversed fromtheir overall frequencyratio