Noun Phrase Extraction A Description of Current Techniques

Noun Phrase ExtractionNoun Phrase Extraction

A Description of Current A Description of Current TechniquesTechniques

What is a noun phrase?What is a noun phrase?

A phrase whose head is a noun or pronoun A phrase whose head is a noun or pronoun optionally accompanied by a set of modifiersoptionally accompanied by a set of modifiers Determiners:Determiners:

• Articles: a, an, theArticles: a, an, the• Demonstratives: this, that, thoseDemonstratives: this, that, those• Numerals: one, two, threeNumerals: one, two, three• Possessives: my, their, whosePossessives: my, their, whose• Quantifiers: some, manyQuantifiers: some, many

Adjectives: the Adjectives: the redred ball ball Relative clauses: the books Relative clauses: the books that I bought yesterdaythat I bought yesterday Prepositional phrases: the man Prepositional phrases: the man with the black hatwith the black hat

Is that really what we want?Is that really what we want? POS tagging already identifies pronouns and POS tagging already identifies pronouns and

nouns by themselvesnouns by themselves The man whose red hat I borrowed yesterday in The man whose red hat I borrowed yesterday in

the street that is next to my house lives next the street that is next to my house lives next door.door. [The man[The man [whose red hat [I borrowed yesterday][whose red hat [I borrowed yesterday]RC RC ]]RCRC

[in the street][in the street]PPPP [that is next to my house][that is next to my house]RCRC ]]NPNP lives lives [next door][next door]NPNP..

Base Noun PhrasesBase Noun Phrases [The man][The man]NPNP whose whose [red hat][red hat]NPNP I borrowed I borrowed [yesterday [yesterday

]]NPNP in in [the street][the street]NPNP that is next to that is next to [my house][my house]NPNP lives lives [next door][next door]NPNP..

How Prevalent is this Problem?How Prevalent is this Problem?

Established by Steven Abney in 1991 as a core Established by Steven Abney in 1991 as a core step in Natural Language Processingstep in Natural Language Processing

Quite exploredQuite explored

What were the successful early What were the successful early solutions?solutions?

Simple Rule-based/ Finite State AutomataSimple Rule-based/ Finite State Automata

Both of these rely on the aptitude of the linguist Both of these rely on the aptitude of the linguist formulating the rule set.formulating the rule set.

Simple Rule-based/ Finite State Simple Rule-based/ Finite State AutomataAutomata

A list of grammar rules and relationships A list of grammar rules and relationships are established. For example:are established. For example: If I have an article preceding a noun, that If I have an article preceding a noun, that

article marks the beginning of a noun phrase. article marks the beginning of a noun phrase. I cannot have a noun phrase beginning I cannot have a noun phrase beginning afterafter

an articlean article The simplest methodThe simplest method

FSA simple NPE exampleFSA simple NPE example

S0

determiner/adjective

NPS1

noun/ pronoun

adjectiveRelative clause/

Prepositional phrase/

noun

noun/ pronoun/ determiner

Simple rule NPE exampleSimple rule NPE example

““Contextualization” and “lexicalization”Contextualization” and “lexicalization” Ratio between the number of occurrences of a Ratio between the number of occurrences of a

POS tag in a chunk and the number of POS tag in a chunk and the number of occurrences of this POS tag in the training occurrences of this POS tag in the training corporacorpora

Parsing FSA’s, grammars, regular Parsing FSA’s, grammars, regular expressions: LR(k) Parsingexpressions: LR(k) Parsing

The The LL means we do Left to right scan of input tokens means we do Left to right scan of input tokens

The The RR means we are guided by Rightmost derivations means we are guided by Rightmost derivations

The The kk means we will look at the next k tokens to help us means we will look at the next k tokens to help us make decisions about handlesmake decisions about handles

We shift input tokens onto a stack and then reduce that We shift input tokens onto a stack and then reduce that stack by replacing RHS handles with LHS non-terminalsstack by replacing RHS handles with LHS non-terminals

An Expression GrammarAn Expression Grammar

1.1. E -> E + TE -> E + T

2.2. E -> E - TE -> E - T

3.3. E -> TE -> T

4.4. T -> T * FT -> T * F

5.5. T -> T / FT -> T / F

6.6. T -> FT -> F

7.7. F -> (E)F -> (E)

8.8. F -> iF -> i

LR Table for Exp GrammarLR Table for Exp Grammar

An LR(1) NPE ExampleAn LR(1) NPE Example1.1. S S NP VP NP VP

2.2. NP NP Det N Det N

3.3. NP NP N N

4.4. VP VP V NP V NP

Stack Input Action

[] N V N SH N

[N] V N RE 3.) NP N

[NP] V N SH V

[NP V] N SH N

[NP V N] RE 3.) NP N

[NP V NP] RE 4.) VP V NP

[NP VP] RE 1.) S NP VP

[S] Accept!

(Abney, 1991)

Why isn’t this enough?Why isn’t this enough?

Unanticipated rulesUnanticipated rules Difficulty finding non-recursive, base NP’sDifficulty finding non-recursive, base NP’s Structural ambiguityStructural ambiguity

Structural AmbiguityStructural Ambiguity

S

NP

NP

VP

VI

saw

the man

DET NPP

NDETPRP

with the telescope

S

NP

NP

VP

VPI

the man

DET NV

saw

PP

NDETPRP

with the telescope

““I saw the man with the telescope.”I saw the man with the telescope.”

What are the more current What are the more current solutions?solutions?

Machine LearningMachine Learning Transformation-based LearningTransformation-based Learning Memory-based LearningMemory-based Learning Maximum Entropy ModelMaximum Entropy Model Hidden Markov ModelHidden Markov Model Conditional Random FieldConditional Random Field Support Vector MachinesSupport Vector Machines

Machine Learning means Machine Learning means TRAINING!TRAINING!

Corpus: a large, structured set of textsCorpus: a large, structured set of texts Establish usage statisticsEstablish usage statistics Learn linguistics rulesLearn linguistics rules

The Brown CorpusThe Brown Corpus American English, roughly 1 million wordsAmerican English, roughly 1 million words Tagged with the parts of speechTagged with the parts of speech http://http://www.edict.com.hk/concordance/WWWConcappE.htmwww.edict.com.hk/concordance/WWWConcappE.htm

Transformation-based Machine Transformation-based Machine LearningLearning

An ‘error-driven’ approach for learning anordered set of rules

1. Generate all rules that correct at least one error. 2. For each rule:

(a) Apply to a copy of the most recent state of the training set.

(b) Score the result using the objective function.

3. Select the rule with the best score. 4. Update the training set by applying the selected

rule. 5. Stop if the score is smaller than some pre-set

threshold T; otherwise repeat from step 1.

Transformation-based NPE Transformation-based NPE exampleexample

Input: “WhitneyNN currentlyADV hasVB theDT rightADJ ideaNN.”

Expected output: “[NP Whitney] [ADV currently] [VB has] [NP the right idea].”

Rules generated (all not shown):FromFrom ToTo IfIf

NN NP always

ADJ NP the previous word was ART

DT NP the next word is an ADJ

DT NP the previous word was VB

Memory-based Machine LearningMemory-based Machine Learning

Classify data according to similarities to other data Classify data according to similarities to other data observed earlierobserved earlier

““Nearest neighbor”Nearest neighbor” LearningLearning

Store all “rules” in memoryStore all “rules” in memory Classification:

Given new test instance X,• Compare it to all memory instances• Compute a distance between X and memory instance Y

Update the top k of closest instances (nearest neighbors) When done, take the majority class of the k nearest neighbors

as the class of X

Daelemans, 2005

Memory-based Machine Learning Memory-based Machine Learning ContinuedContinued

Distance…?Distance…? The Overlapping FunctionThe Overlapping Function: Count the number of : Count the number of

mismatching featuresmismatching features The Modified Value Distance Metric (MVDM) The Modified Value Distance Metric (MVDM)

FunctionFunction: estimate a numeric distance between two : estimate a numeric distance between two “rules”“rules” The distance between two N-dimensional vectors A,

B with discrete (for example symbolic) elements, in a K class problem, is computed using conditional probabilities:

d(A,B) = Σj..n Σi..k (P(Ci I Aj) - P(Ci | Pj)) where p(CilAj) is estimated by calculating the number

Ni(Aj) of times feature Aj occurred in vectors belonging to class Ci, and dividing it by the number of times feature Aj occurred for any class Dusch, 1998

Memory-based NPE exampleMemory-based NPE example

Suppose we have the following candidate Suppose we have the following candidate sequence:sequence: DT ADJ ADJ NN NN DT ADJ ADJ NN NN

• ““The beautiful, intelligent summer intern”The beautiful, intelligent summer intern”

In our rule set we have:In our rule set we have: DT ADJ ADJ NN NNPDT ADJ ADJ NN NNP DT ADJ NN NNDT ADJ NN NN

Maximum EntropyMaximum Entropy

The least biased probability distribution that The least biased probability distribution that encodes information encodes information maximizes the information maximizes the information entropyentropy, that is, the measure of uncertainty , that is, the measure of uncertainty associated with a random variable.associated with a random variable.

Consider that we have Consider that we have mm unique propositions unique propositions The most informative distribution is one in which we The most informative distribution is one in which we

know one of the propositions is true – information know one of the propositions is true – information entropy is entropy is 00

The least informative distribution is one in which there The least informative distribution is one in which there is no reason to favor any one proposition over another is no reason to favor any one proposition over another – information entropy is – information entropy is log log mm

Maximum Entropy applied to NPEMaximum Entropy applied to NPE

Let’s consider several French translations of the English word “in” p(dans) + p(en) + p(á) + p(au cours de) + p(pendant) = 1 Now suppose that we find that either dans or en is chosen 30% of

the time. We must add that constraint to the model and choose the most uniform distribution

p(dans) = 3/20 p(en) = 3/20 p(á) = 7/30 p(au cours de) = 7/30 p(pendant) = 7/30

What if we now find that either What if we now find that either dansdans or or á is used half of the time? p(dans) + p(en) = .3 p(dans) + p(á) = .5 Now what is the most “uniform” distribution?

Berger, 1996

Hidden Markov ModelHidden Markov Model

In a statistical model of a system possessing the In a statistical model of a system possessing the Markov propertyMarkov property…… There are a discrete number of possible statesThere are a discrete number of possible states The probability distribution of future states depends The probability distribution of future states depends

only on the present state and is independent of past only on the present state and is independent of past statesstates

These states are not directly observable in a These states are not directly observable in a hiddenhidden Markov model. Markov model.

The goal is to determine the hidden properties The goal is to determine the hidden properties from the observable ones.from the observable ones.

Hidden Markov ModelHidden Markov Model

a: transition probabilitiesa: transition probabilities x: hidden statesx: hidden states y: observable statesy: observable states b: output probabilitiesb: output probabilities

HMM ExampleHMM Example

states = ('Rainy', 'Sunny') states = ('Rainy', 'Sunny') observations = ('walk', 'shop', 'clean') observations = ('walk', 'shop', 'clean') start_probability = {'Rainy': 0.6, 'Sunny': 0.4} start_probability = {'Rainy': 0.6, 'Sunny': 0.4} transition_probability = {transition_probability = {

'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3}, 'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},

'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, } 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, } emission_probability = { emission_probability = {

'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}, 'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},

'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, } 'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, }

In this case, the weather possesses the Markov In this case, the weather possesses the Markov propertyproperty

HMM as applied to NPEHMM as applied to NPE

In the case of noun phrase extraction, the hidden In the case of noun phrase extraction, the hidden property is the unknown grammar “rule”property is the unknown grammar “rule”

Our observations are formed by our training dataOur observations are formed by our training data Contextual probabilities represent the transition states Contextual probabilities represent the transition states

that is, given our previous two transitions, what is the likelihood that is, given our previous two transitions, what is the likelihood of continuing, ending, or beginning a noun phrase/ P(oof continuing, ending, or beginning a noun phrase/ P(o ii|o|oj-1j-1,o,oj-2j-2))

Output probabilitiesOutput probabilities Given our current state transition, what is the likelihood of our Given our current state transition, what is the likelihood of our

current word being part of, beginning, or ending a noun phrase/ current word being part of, beginning, or ending a noun phrase/ P(iP(ijj|o|ojj))

MaxMaxO1…OTO1…OT( ( ππj:1…T j:1…T P(oP(oii|o|oj-1j-1,o,oj-2j-2) ) · · P(iP(ijj|o|ojj)) ) )

The Viterbi AlgorithmThe Viterbi Algorithm

Now that we’ve constructed this Now that we’ve constructed this probabilistic representation, we need to probabilistic representation, we need to traverse traverse itit

Finds the most likely sequence of statesFinds the most likely sequence of states

Viterbi AlgorithmViterbi Algorithm

Whitney gave a painfully long presentation.Whitney gave a painfully long presentation.

Conditional Random FieldsConditional Random Fields An undirected graphical model in which each vertex represents a An undirected graphical model in which each vertex represents a

random variable whose distribution is to be inferred, and each edge random variable whose distribution is to be inferred, and each edge represents a dependency between two random variables. In a CRF, represents a dependency between two random variables. In a CRF, the distribution of each discrete random variable the distribution of each discrete random variable YY in the graph is in the graph is conditioned on an input sequence conditioned on an input sequence XX

…

y1 y2 y3 y4 yn-1 yn

x1, …, xn-1, xn

Yi could be B,I,O in the NPE case

Conditional Random FieldsConditional Random Fields

The primary advantage of CRF’s over The primary advantage of CRF’s over hidden Markov models is their conditional hidden Markov models is their conditional nature, resulting in the relaxation of the nature, resulting in the relaxation of the independence assumptions required by independence assumptions required by HMM’sHMM’s

The transition probabilities of the HMM The transition probabilities of the HMM have been transformed into feature have been transformed into feature functions that are conditional upon the functions that are conditional upon the input sequenceinput sequence

Support Vector MachinesSupport Vector Machines We wish to graph an number of data points of dimension p and separate those We wish to graph an number of data points of dimension p and separate those

points with a p-1 dimensional hyperplane that guarantees the maximum distance points with a p-1 dimensional hyperplane that guarantees the maximum distance between the two classes of points – this ensures the most between the two classes of points – this ensures the most generalizationgeneralization

These data points represent pattern samples whose dimension is dependent upon These data points represent pattern samples whose dimension is dependent upon the number of the number of featuresfeatures used to describe them used to describe them

http://www.csie.ntu.edu.tw/~cjlin/libsvm/#GUIhttp://www.csie.ntu.edu.tw/~cjlin/libsvm/#GUI

What if our points are separated by What if our points are separated by a nonlinear barrier?a nonlinear barrier?

The Kernel function (Φ): maps points from 2d to 3d space

•The Radial Basis Function is the best function that we have for this right now

SVM’s applied to NPESVM’s applied to NPE

Normally, SVM’s are Normally, SVM’s are binarybinary classifiers classifiers For NPE we generally want to know about (at For NPE we generally want to know about (at

least) three classes:least) three classes: B: a token is at the B: a token is at the beginningbeginning of a chunk of a chunk I: a token is I: a token is insideinside a chunk a chunk O: a token is O: a token is outsideoutside a chunk a chunk

We can consider one class vs. all other classes We can consider one class vs. all other classes for all possible combinationsfor all possible combinations

We could do a pairwise classificationWe could do a pairwise classification If we have k classes, we build If we have k classes, we build k k ·· (k-1)/2 (k-1)/2 classifiers classifiers

Performance Metrics UsedPerformance Metrics Used

Precision = Precision = number of correct responsesnumber of correct responses

number of responsesnumber of responses Recall = Recall = number of correct responsesnumber of correct responses

number correct in keynumber correct in key F-measure = F-measure = ((ββ22 + 1) RP + 1) RP

((ββ22R)R) + P+ P

Where Where ββ22 represents the relative weight of recall to precision (typically 1) represents the relative weight of recall to precision (typically 1)

(Bikel, 1998)(Bikel, 1998)

Primary Primary WorkWork

MethodMethod ImplementationImplementation Evaluation Evaluation DataData

Performance Performance (F-measure)(F-measure)

ProsPros ConsCons

DejeanDejean Simple rule-basedSimple rule-based ““ALLiS”ALLiS”

Uses XML inputUses XML input

Not availableNot available

CONLL 2000 CONLL 2000 tasktask

92.0992.09 Extremely simple, Extremely simple, quick; doesn’t quick; doesn’t require a training require a training corpuscorpus

Not very robust, difficult to Not very robust, difficult to improve upon; extremely improve upon; extremely difficult to generate rulesdifficult to generate rules

Ramshaw, Ramshaw, MarcusMarcus

Transformation Transformation Based LearningBased Learning

C++, PerlC++, Perl

Available!Available!

Penn TreebankPenn Treebank 92.03 - 9392.03 - 93 …… Extremely dependent upon Extremely dependent upon training set and its training set and its “completeness” – how many “completeness” – how many different ways the NP are different ways the NP are formed; requires a fair amount formed; requires a fair amount of memoryof memory

Tjong Kim Tjong Kim SangSang

Memory-Based Memory-Based LearningLearning

““TiMBLTiMBL””

PythonPython


Penn Penn Treebank, Treebank, CONLL 2000 CONLL 2000 tasktask

93.34, 92.593.34, 92.5 Highly suited to Highly suited to the NLP taskthe NLP task

Has no ability to intelligently Has no ability to intelligently weight “important” features; weight “important” features; also it cannot identify feature also it cannot identify feature dependency – both of these dependency – both of these problems result in a loss of problems result in a loss of accuracyaccuracy

KoelingKoeling Maximum EntropyMaximum Entropy Not availableNot available CONLL 2000 CONLL 2000 tasktask

91.9791.97 First statistical First statistical approach, higher approach, higher accuracyaccuracy

Always makes the best Always makes the best locallocal decision without much regard decision without much regard at all for positionat all for position

Molina, PlaMolina, Pla Hidden Markov Hidden Markov ModelModel

Not availableNot available CONLL 2000 CONLL 2000 tasktask

92.1992.19 Takes position Takes position into accountinto account

Make conditional Make conditional independence assumptions independence assumptions which ignore special input which ignore special input features such as features such as capitalization, suffixes, capitalization, suffixes, surrounding wordssurrounding words

Sha, PereiraSha, Pereira Conditional Conditional Random FieldsRandom Fields

JavaJava

Is AvailableIs Available… sort of… sort of

CRF++ in C++ by CRF++ in C++ by Kudo alsoKudo also

IS AVAILABLEIS AVAILABLE!!


94.38 (“no 94.38 (“no significant significant

difference”)difference”)

Can handle Can handle millions of millions of features, handles features, handles both position and both position and dependenciesdependencies

““Over fitting” Over fitting”

Kudo, Kudo, MatsumotoMatsumoto

Support Vector Support Vector MachinesMachines

C++, Perl, PythonC++, Perl, Python



94.22, 93.9194.22, 93.91 Minimizes error Minimizes error resulting in higher resulting in higher accuracy/ handles accuracy/ handles tons of featurestons of features

Doesn’t really take position Doesn’t really take position into accountinto account

Documents

Noun Phrase Extraction A Description of Current Techniques