Upload
collin-charles
View
238
Download
0
Embed Size (px)
Citation preview
Noun Phrase ExtractionNoun Phrase Extraction
A Description of Current A Description of Current TechniquesTechniques
What is a noun phrase?What is a noun phrase?
A phrase whose head is a noun or pronoun A phrase whose head is a noun or pronoun optionally accompanied by a set of modifiersoptionally accompanied by a set of modifiers Determiners:Determiners:
• Articles: a, an, theArticles: a, an, the• Demonstratives: this, that, thoseDemonstratives: this, that, those• Numerals: one, two, threeNumerals: one, two, three• Possessives: my, their, whosePossessives: my, their, whose• Quantifiers: some, manyQuantifiers: some, many
Adjectives: the Adjectives: the redred ball ball Relative clauses: the books Relative clauses: the books that I bought yesterdaythat I bought yesterday Prepositional phrases: the man Prepositional phrases: the man with the black hatwith the black hat
Is that really what we want?Is that really what we want? POS tagging already identifies pronouns and POS tagging already identifies pronouns and
nouns by themselvesnouns by themselves The man whose red hat I borrowed yesterday in The man whose red hat I borrowed yesterday in
the street that is next to my house lives next the street that is next to my house lives next door.door. [The man[The man [whose red hat [I borrowed yesterday][whose red hat [I borrowed yesterday]RC RC ]]RCRC
[in the street][in the street]PPPP [that is next to my house][that is next to my house]RCRC ]]NPNP lives lives [next door][next door]NPNP..
Base Noun PhrasesBase Noun Phrases [The man][The man]NPNP whose whose [red hat][red hat]NPNP I borrowed I borrowed [yesterday [yesterday
]]NPNP in in [the street][the street]NPNP that is next to that is next to [my house][my house]NPNP lives lives [next door][next door]NPNP..
How Prevalent is this Problem?How Prevalent is this Problem?
Established by Steven Abney in 1991 as a core Established by Steven Abney in 1991 as a core step in Natural Language Processingstep in Natural Language Processing
Quite exploredQuite explored
What were the successful early What were the successful early solutions?solutions?
Simple Rule-based/ Finite State AutomataSimple Rule-based/ Finite State Automata
Both of these rely on the aptitude of the linguist Both of these rely on the aptitude of the linguist formulating the rule set.formulating the rule set.
Simple Rule-based/ Finite State Simple Rule-based/ Finite State AutomataAutomata
A list of grammar rules and relationships A list of grammar rules and relationships are established. For example:are established. For example: If I have an article preceding a noun, that If I have an article preceding a noun, that
article marks the beginning of a noun phrase. article marks the beginning of a noun phrase. I cannot have a noun phrase beginning I cannot have a noun phrase beginning afterafter
an articlean article The simplest methodThe simplest method
FSA simple NPE exampleFSA simple NPE example
S0
determiner/adjective
NPS1
noun/ pronoun
adjectiveRelative clause/
Prepositional phrase/
noun
noun/ pronoun/ determiner
Simple rule NPE exampleSimple rule NPE example
““Contextualization” and “lexicalization”Contextualization” and “lexicalization” Ratio between the number of occurrences of a Ratio between the number of occurrences of a
POS tag in a chunk and the number of POS tag in a chunk and the number of occurrences of this POS tag in the training occurrences of this POS tag in the training corporacorpora
Parsing FSA’s, grammars, regular Parsing FSA’s, grammars, regular expressions: LR(k) Parsingexpressions: LR(k) Parsing
The The LL means we do Left to right scan of input tokens means we do Left to right scan of input tokens
The The RR means we are guided by Rightmost derivations means we are guided by Rightmost derivations
The The kk means we will look at the next k tokens to help us means we will look at the next k tokens to help us make decisions about handlesmake decisions about handles
We shift input tokens onto a stack and then reduce that We shift input tokens onto a stack and then reduce that stack by replacing RHS handles with LHS non-terminalsstack by replacing RHS handles with LHS non-terminals
An Expression GrammarAn Expression Grammar
1.1. E -> E + TE -> E + T
2.2. E -> E - TE -> E - T
3.3. E -> TE -> T
4.4. T -> T * FT -> T * F
5.5. T -> T / FT -> T / F
6.6. T -> FT -> F
7.7. F -> (E)F -> (E)
8.8. F -> iF -> i
LR Table for Exp GrammarLR Table for Exp Grammar
An LR(1) NPE ExampleAn LR(1) NPE Example1.1. S S NP VP NP VP
2.2. NP NP Det N Det N
3.3. NP NP N N
4.4. VP VP V NP V NP
Stack Input Action
[] N V N SH N
[N] V N RE 3.) NP N
[NP] V N SH V
[NP V] N SH N
[NP V N] RE 3.) NP N
[NP V NP] RE 4.) VP V NP
[NP VP] RE 1.) S NP VP
[S] Accept!
(Abney, 1991)
Why isn’t this enough?Why isn’t this enough?
Unanticipated rulesUnanticipated rules Difficulty finding non-recursive, base NP’sDifficulty finding non-recursive, base NP’s Structural ambiguityStructural ambiguity
Structural AmbiguityStructural Ambiguity
S
NP
NP
VP
VI
saw
the man
DET NPP
NDETPRP
with the telescope
S
NP
NP
VP
VPI
the man
DET NV
saw
PP
NDETPRP
with the telescope
““I saw the man with the telescope.”I saw the man with the telescope.”
What are the more current What are the more current solutions?solutions?
Machine LearningMachine Learning Transformation-based LearningTransformation-based Learning Memory-based LearningMemory-based Learning Maximum Entropy ModelMaximum Entropy Model Hidden Markov ModelHidden Markov Model Conditional Random FieldConditional Random Field Support Vector MachinesSupport Vector Machines
Machine Learning means Machine Learning means TRAINING!TRAINING!
Corpus: a large, structured set of textsCorpus: a large, structured set of texts Establish usage statisticsEstablish usage statistics Learn linguistics rulesLearn linguistics rules
The Brown CorpusThe Brown Corpus American English, roughly 1 million wordsAmerican English, roughly 1 million words Tagged with the parts of speechTagged with the parts of speech http://http://www.edict.com.hk/concordance/WWWConcappE.htmwww.edict.com.hk/concordance/WWWConcappE.htm
Transformation-based Machine Transformation-based Machine LearningLearning
An ‘error-driven’ approach for learning anordered set of rules
1. Generate all rules that correct at least one error. 2. For each rule:
(a) Apply to a copy of the most recent state of the training set.
(b) Score the result using the objective function.
3. Select the rule with the best score. 4. Update the training set by applying the selected
rule. 5. Stop if the score is smaller than some pre-set
threshold T; otherwise repeat from step 1.
Transformation-based NPE Transformation-based NPE exampleexample
Input: “WhitneyNN currentlyADV hasVB theDT rightADJ ideaNN.”
Expected output: “[NP Whitney] [ADV currently] [VB has] [NP the right idea].”
Rules generated (all not shown):FromFrom ToTo IfIf
NN NP always
ADJ NP the previous word was ART
DT NP the next word is an ADJ
DT NP the previous word was VB
Memory-based Machine LearningMemory-based Machine Learning
Classify data according to similarities to other data Classify data according to similarities to other data observed earlierobserved earlier
““Nearest neighbor”Nearest neighbor” LearningLearning
Store all “rules” in memoryStore all “rules” in memory Classification:
Given new test instance X,• Compare it to all memory instances• Compute a distance between X and memory instance Y
Update the top k of closest instances (nearest neighbors) When done, take the majority class of the k nearest neighbors
as the class of X
Daelemans, 2005
Memory-based Machine Learning Memory-based Machine Learning ContinuedContinued
Distance…?Distance…? The Overlapping FunctionThe Overlapping Function: Count the number of : Count the number of
mismatching featuresmismatching features The Modified Value Distance Metric (MVDM) The Modified Value Distance Metric (MVDM)
FunctionFunction: estimate a numeric distance between two : estimate a numeric distance between two “rules”“rules” The distance between two N-dimensional vectors A,
B with discrete (for example symbolic) elements, in a K class problem, is computed using conditional probabilities:
d(A,B) = Σj..n Σi..k (P(Ci I Aj) - P(Ci | Pj)) where p(CilAj) is estimated by calculating the number
Ni(Aj) of times feature Aj occurred in vectors belonging to class Ci, and dividing it by the number of times feature Aj occurred for any class Dusch, 1998
Memory-based NPE exampleMemory-based NPE example
Suppose we have the following candidate Suppose we have the following candidate sequence:sequence: DT ADJ ADJ NN NN DT ADJ ADJ NN NN
• ““The beautiful, intelligent summer intern”The beautiful, intelligent summer intern”
In our rule set we have:In our rule set we have: DT ADJ ADJ NN NNPDT ADJ ADJ NN NNP DT ADJ NN NNDT ADJ NN NN
Maximum EntropyMaximum Entropy
The least biased probability distribution that The least biased probability distribution that encodes information encodes information maximizes the information maximizes the information entropyentropy, that is, the measure of uncertainty , that is, the measure of uncertainty associated with a random variable.associated with a random variable.
Consider that we have Consider that we have mm unique propositions unique propositions The most informative distribution is one in which we The most informative distribution is one in which we
know one of the propositions is true – information know one of the propositions is true – information entropy is entropy is 00
The least informative distribution is one in which there The least informative distribution is one in which there is no reason to favor any one proposition over another is no reason to favor any one proposition over another – information entropy is – information entropy is log log mm
Maximum Entropy applied to NPEMaximum Entropy applied to NPE
Let’s consider several French translations of the English word “in” p(dans) + p(en) + p(á) + p(au cours de) + p(pendant) = 1 Now suppose that we find that either dans or en is chosen 30% of
the time. We must add that constraint to the model and choose the most uniform distribution
p(dans) = 3/20 p(en) = 3/20 p(á) = 7/30 p(au cours de) = 7/30 p(pendant) = 7/30
What if we now find that either What if we now find that either dansdans or or á is used half of the time? p(dans) + p(en) = .3 p(dans) + p(á) = .5 Now what is the most “uniform” distribution?
Berger, 1996
Hidden Markov ModelHidden Markov Model
In a statistical model of a system possessing the In a statistical model of a system possessing the Markov propertyMarkov property…… There are a discrete number of possible statesThere are a discrete number of possible states The probability distribution of future states depends The probability distribution of future states depends
only on the present state and is independent of past only on the present state and is independent of past statesstates
These states are not directly observable in a These states are not directly observable in a hiddenhidden Markov model. Markov model.
The goal is to determine the hidden properties The goal is to determine the hidden properties from the observable ones.from the observable ones.
Hidden Markov ModelHidden Markov Model
a: transition probabilitiesa: transition probabilities x: hidden statesx: hidden states y: observable statesy: observable states b: output probabilitiesb: output probabilities
HMM ExampleHMM Example
states = ('Rainy', 'Sunny') states = ('Rainy', 'Sunny') observations = ('walk', 'shop', 'clean') observations = ('walk', 'shop', 'clean') start_probability = {'Rainy': 0.6, 'Sunny': 0.4} start_probability = {'Rainy': 0.6, 'Sunny': 0.4} transition_probability = {transition_probability = {
'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3}, 'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},
'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, } 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, } emission_probability = { emission_probability = {
'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}, 'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},
'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, } 'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, }
In this case, the weather possesses the Markov In this case, the weather possesses the Markov propertyproperty
HMM as applied to NPEHMM as applied to NPE
In the case of noun phrase extraction, the hidden In the case of noun phrase extraction, the hidden property is the unknown grammar “rule”property is the unknown grammar “rule”
Our observations are formed by our training dataOur observations are formed by our training data Contextual probabilities represent the transition states Contextual probabilities represent the transition states
that is, given our previous two transitions, what is the likelihood that is, given our previous two transitions, what is the likelihood of continuing, ending, or beginning a noun phrase/ P(oof continuing, ending, or beginning a noun phrase/ P(o ii|o|oj-1j-1,o,oj-2j-2))
Output probabilitiesOutput probabilities Given our current state transition, what is the likelihood of our Given our current state transition, what is the likelihood of our
current word being part of, beginning, or ending a noun phrase/ current word being part of, beginning, or ending a noun phrase/ P(iP(ijj|o|ojj))
MaxMaxO1…OTO1…OT( ( ππj:1…T j:1…T P(oP(oii|o|oj-1j-1,o,oj-2j-2) ) · · P(iP(ijj|o|ojj)) ) )
The Viterbi AlgorithmThe Viterbi Algorithm
Now that we’ve constructed this Now that we’ve constructed this probabilistic representation, we need to probabilistic representation, we need to traverse traverse itit
Finds the most likely sequence of statesFinds the most likely sequence of states
Viterbi AlgorithmViterbi Algorithm
Whitney gave a painfully long presentation.Whitney gave a painfully long presentation.
Conditional Random FieldsConditional Random Fields An undirected graphical model in which each vertex represents a An undirected graphical model in which each vertex represents a
random variable whose distribution is to be inferred, and each edge random variable whose distribution is to be inferred, and each edge represents a dependency between two random variables. In a CRF, represents a dependency between two random variables. In a CRF, the distribution of each discrete random variable the distribution of each discrete random variable YY in the graph is in the graph is conditioned on an input sequence conditioned on an input sequence XX
…
y1 y2 y3 y4 yn-1 yn
x1, …, xn-1, xn
Yi could be B,I,O in the NPE case
Conditional Random FieldsConditional Random Fields
The primary advantage of CRF’s over The primary advantage of CRF’s over hidden Markov models is their conditional hidden Markov models is their conditional nature, resulting in the relaxation of the nature, resulting in the relaxation of the independence assumptions required by independence assumptions required by HMM’sHMM’s
The transition probabilities of the HMM The transition probabilities of the HMM have been transformed into feature have been transformed into feature functions that are conditional upon the functions that are conditional upon the input sequenceinput sequence
Support Vector MachinesSupport Vector Machines We wish to graph an number of data points of dimension p and separate those We wish to graph an number of data points of dimension p and separate those
points with a p-1 dimensional hyperplane that guarantees the maximum distance points with a p-1 dimensional hyperplane that guarantees the maximum distance between the two classes of points – this ensures the most between the two classes of points – this ensures the most generalizationgeneralization
These data points represent pattern samples whose dimension is dependent upon These data points represent pattern samples whose dimension is dependent upon the number of the number of featuresfeatures used to describe them used to describe them
http://www.csie.ntu.edu.tw/~cjlin/libsvm/#GUIhttp://www.csie.ntu.edu.tw/~cjlin/libsvm/#GUI
What if our points are separated by What if our points are separated by a nonlinear barrier?a nonlinear barrier?
The Kernel function (Φ): maps points from 2d to 3d space
•The Radial Basis Function is the best function that we have for this right now
SVM’s applied to NPESVM’s applied to NPE
Normally, SVM’s are Normally, SVM’s are binarybinary classifiers classifiers For NPE we generally want to know about (at For NPE we generally want to know about (at
least) three classes:least) three classes: B: a token is at the B: a token is at the beginningbeginning of a chunk of a chunk I: a token is I: a token is insideinside a chunk a chunk O: a token is O: a token is outsideoutside a chunk a chunk
We can consider one class vs. all other classes We can consider one class vs. all other classes for all possible combinationsfor all possible combinations
We could do a pairwise classificationWe could do a pairwise classification If we have k classes, we build If we have k classes, we build k k ·· (k-1)/2 (k-1)/2 classifiers classifiers
Performance Metrics UsedPerformance Metrics Used
Precision = Precision = number of correct responsesnumber of correct responses
number of responsesnumber of responses Recall = Recall = number of correct responsesnumber of correct responses
number correct in keynumber correct in key F-measure = F-measure = ((ββ22 + 1) RP + 1) RP
((ββ22R)R) + P+ P
Where Where ββ22 represents the relative weight of recall to precision (typically 1) represents the relative weight of recall to precision (typically 1)
(Bikel, 1998)(Bikel, 1998)
Primary Primary WorkWork
MethodMethod ImplementationImplementation Evaluation Evaluation DataData
Performance Performance (F-measure)(F-measure)
ProsPros ConsCons
DejeanDejean Simple rule-basedSimple rule-based ““ALLiS”ALLiS”
Uses XML inputUses XML input
Not availableNot available
CONLL 2000 CONLL 2000 tasktask
92.0992.09 Extremely simple, Extremely simple, quick; doesn’t quick; doesn’t require a training require a training corpuscorpus
Not very robust, difficult to Not very robust, difficult to improve upon; extremely improve upon; extremely difficult to generate rulesdifficult to generate rules
Ramshaw, Ramshaw, MarcusMarcus
Transformation Transformation Based LearningBased Learning
C++, PerlC++, Perl
Available!Available!
Penn TreebankPenn Treebank 92.03 - 9392.03 - 93 …… Extremely dependent upon Extremely dependent upon training set and its training set and its “completeness” – how many “completeness” – how many different ways the NP are different ways the NP are formed; requires a fair amount formed; requires a fair amount of memoryof memory
Tjong Kim Tjong Kim SangSang
Memory-Based Memory-Based LearningLearning
““TiMBLTiMBL””
PythonPython
Available!Available!
Penn Penn Treebank, Treebank, CONLL 2000 CONLL 2000 tasktask
93.34, 92.593.34, 92.5 Highly suited to Highly suited to the NLP taskthe NLP task
Has no ability to intelligently Has no ability to intelligently weight “important” features; weight “important” features; also it cannot identify feature also it cannot identify feature dependency – both of these dependency – both of these problems result in a loss of problems result in a loss of accuracyaccuracy
KoelingKoeling Maximum EntropyMaximum Entropy Not availableNot available CONLL 2000 CONLL 2000 tasktask
91.9791.97 First statistical First statistical approach, higher approach, higher accuracyaccuracy
Always makes the best Always makes the best locallocal decision without much regard decision without much regard at all for positionat all for position
Molina, PlaMolina, Pla Hidden Markov Hidden Markov ModelModel
Not availableNot available CONLL 2000 CONLL 2000 tasktask
92.1992.19 Takes position Takes position into accountinto account
Make conditional Make conditional independence assumptions independence assumptions which ignore special input which ignore special input features such as features such as capitalization, suffixes, capitalization, suffixes, surrounding wordssurrounding words
Sha, PereiraSha, Pereira Conditional Conditional Random FieldsRandom Fields
JavaJava
Is AvailableIs Available… sort of… sort of
CRF++ in C++ by CRF++ in C++ by Kudo alsoKudo also
IS AVAILABLEIS AVAILABLE!!
Penn Penn Treebank, Treebank, CONLL 2000 CONLL 2000 tasktask
94.38 (“no 94.38 (“no significant significant
difference”)difference”)
Can handle Can handle millions of millions of features, handles features, handles both position and both position and dependenciesdependencies
““Over fitting” Over fitting”
Kudo, Kudo, MatsumotoMatsumoto
Support Vector Support Vector MachinesMachines
C++, Perl, PythonC++, Perl, Python
Available!Available!
Penn Penn Treebank, Treebank, CONLL 2000 CONLL 2000 tasktask
94.22, 93.9194.22, 93.91 Minimizes error Minimizes error resulting in higher resulting in higher accuracy/ handles accuracy/ handles tons of featurestons of features
Doesn’t really take position Doesn’t really take position into accountinto account