Corpora and Statistical Methods Lecture 11

Corpora and Statistical Methods Lecture 11. Albert Gatt. Part 2. Statistical parsing.

Albert GattCorpora and Statistical MethodsLecture 11Statistical parsingPart 2Preliminary issuesHow parsers are evaluatedEvaluationThe issue:what objective criterion are we trying to maximise?i.e. under what objective function can I say that my parser does well (and how well?) need a gold standard

Possibilities:strict match of candidate parse against gold standardmatch of components of candidate parse against gold standard componentsEvaluationA classic evaluation metric is the PARSEVAL oneinitiative to compare parsers on the same datanot initially concerned with stochastic parsersevaluate parser output piece by piece

Main components:compares gold standard tree to parser treetypically, gold standard is the tree in a treebank

computes:precisionrecallcrossing bracketsPARSEVAL: labeled recall

Correct node = node in candidate parse which:has same node labeloriginally omitted from PARSEVAL to avoid theoretical conflictspans the same words PARSEVAL: labeled precision

The proportion of correctly labelled and correctly spanning nodes in the candidate.Combining Precision and RecallAs usual, Precision and recall can be combined into a single F-measure:

PARSEVAL: crossed bracketsnumber of brackets in the candidate parse which cross brackets in the treebank parsee.g. treebank has ((X Y) Z) and candidate has (X (Y Z))

Unlike precision/recall, this is an objective function to minimiseCurrent performanceCurrent parsers achieve:ca. 90% precision>90% recall1% cross-bracketed constituents

Some issues with PARSEVALThese measures evaluate parses at the level of individual decisions (nodes). ignore the difficulty of getting a globally correct solution by carrying out a correct sequence of decisions

Success on crossing brackets depends on the kind of parse trees usedPenn Treebank has very flat trees (not much embedding), therefore likelihood of crossed brackets decreases.

In PARSEVAL, if a constituent is attached lower in a tree than the gold standard, all its daughters are counted wrong.

Probabilistic parsing with PCFGsThe basic algorithmThe basic PCFG parsing algorithmMany statistical parsers use a version of the CYK algorithm.

Assumptions:CFG productions are in Chomsky Normal Form.A BCA a

Use indices between words:Book the flight through Houston(0) Book (1) the (2) flight (3) through (4) Houston (5)

Procedure (bottom-up):Traverse input sentence left-to-rightUse a chart to store constituents and their span + their probability.

Probabilistic CYK: example PCFGS NP VP [.80]NP Det N [.30]VP V NP [.20]V includes [.05]Det the [.4]Det a [.4]N meal [.01]N flight [.02]Probabilistic CYK: initialisation12345012345The flight includes a meal.//Lexical lookup:for j = 1 to length(string) do: chartj-1,j := {X : X->word in G}

//syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik & C in chartkj: chartij := chartij U {A}

Probabilistic CYK: lexical step123450Det(.4)12345The flight includes a meal.//Lexical lookup:for j = 1 to length(string) do: chartj-1,j := {X : X->word in G}

Probabilistic CYK: lexical step123450Det(.4)1N.022345The flight includes a meal.//Lexical lookup:for j = 1 to length(string) do: chartj-1,j := {X : X->word in G}

Probabilistic CYK: syntactic step123450Det(.4)NP.00241N.022345The flight includes a meal.//Lexical lookup:for j = 1 to length(string) do: chartj-1,j := {X : X->word in G}

//syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik & C in chartkj: chartij := chartij U {A}

Probabilistic CYK: lexical step123450Det(.4)NP.00241N.022V.05345The flight includes a meal.//Lexical lookup:for j = 1 to length(string) do: chartj-1,j := {X : X->word in G}

Probabilistic CYK: lexical step123450Det(.4)NP.00241N.022V.053Det.445The flight includes a meal.//Lexical lookup:for j = 1 to length(string) do: chartj-1,j := {X : X->word in G}

Probabilistic CYK: syntactic step123450Det(.4)NP.00241N.022V.053Det.44N.01The flight includes a meal.//Lexical lookup:for j = 1 to length(string) do: chartj-1,j := {X : X->word in G}

Probabilistic CYK: syntactic step123450Det(.4)NP.00241N.022V.053Det.4NP.0014N.01The flight includes a meal.//Lexical lookup:for j = 1 to length(string) do: chartj-1,j := {X : X->word in G}

//syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik & C in chartkj: chartij := chartij U {A}

Probabilistic CYK: syntactic step123450Det(.4)NP.00241N.022V.05VP.000013Det.4NP.0014N.01The flight includes a meal.//Lexical lookup:for j = 1 to length(string) do: chartj-1,j := {X : X->word in G}

//syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik & C in chartkj: chartij := chartij U {A}

Probabilistic CYK: syntactic step123450Det(.4)NP.0024S.00000001921N.022V.05VP.000013Det.4NP.0014N.01The flight includes a meal.//Lexical lookup:for j = 1 to length(string) do: chartj-1,j := {X : X->word in G}

//syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik & C in chartkj: chartij := chartij U {A}

Probabilistic CYK: summaryCells in chart hold probabilities

Bottom-up procedure computes probability of a parse incrementally.

To obtain parse trees, cells need to be augmented with backpointers.Probabilistic parsing with lexicalised PCFGsMain approaches (focus on Collins (1997,1999))see also: Charniak (1997)

Unlexicalised PCFG EstimationCharniak (1996) used Penn Treebank POS and phrasal categories to induce a maximum likelihood PCFGonly used relative frequency of local trees as the estimates for rule probabilitiesdid not apply smoothing or any other techniques

Works surprisingly well:80.4% recall; 78.8% precision (crossed brackets not estimated)

Suggests that most parsing decisions are mundane and can be handled well by unlexicalized PCFGProbabilistic lexicalised PCFGsStandard format of lexicalised rules:associate head word with non-terminale.g. dumped sacks intoVP(dumped) VBD(dumped) NP(sacks) PP(into)

associate head tag with non-terminalVP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)

Types of rules:lexical rules expand pre-terminals to words: e.g. NNS(sacks,NNS) sacksprobability is always 1

internal rules expand non-terminalse.g. VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)

Estimating probabilitiesNon-generative model:take an MLE estimate of the probability of an entire rule

non-generative models suffer from serious data sparseness problems

Generative model:estimate the probability of a rule by breaking it up into sub-rules.

Collins Model 1Main idea:represent CFG rules as expansions into Head + left modifiers + right modifiers

Li/Ri is of the form L/R(word,tag); e.g. NP(sacks,NNS)STOP: special symbol indicating left/right boundary.

Parsing:Given the LHS, generate the head of the rule, then the left modifiers (until STOP) and right modifiers (until STOP) inside-out.Each step has a probability.

Collins Model 1: exampleVP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)

Head H(hw,ht):

Collins Model 1: exampleVP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)

Head H(hw,ht):

Left modifiers:

Collins Model 1: exampleVP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)

Head H(hw,ht):

Left modifiers:Right modifiers:

Collins Model 1: exampleVP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)

Head H(hw,ht):

Left modifiers:Right modifiers:Total probability: multiplication of (1) (3)

Variations on Model 1: distanceCollins proposed to extend rules by conditioning on distance of modifiers from the head:

a function of the yield of modifiers seen.

Distance for R2 probability = words under R1Using a distance functionSimplest kind of distance function is a tuple of binary features:Is the string of length 0?Does the string contain a verb?

Example uses:if the string has length 0, PR should be higher: English is right-branching & most right modifiers are adjacent to the head verbif string contains a verb, PR should be higher:accounts for preference to attach dependencies to main verb

Further additionsCollins Model 2:subcategorisation preferences distinction between complements and adjuncts.

Model 3 augmented to deal with long-distance (WH) dependencies.Smoothing and backoff

Rules may condition on words that never occur in training data.Collins used 3-level backoff model.Combined using linear interpolation.

use head word

use head tag

parent only

Other parsing approachesData-oriented parsingAlternative to grammar-based modelsdoes not attempt to derive a grammar from a treebank

treebank data is stored as fragments of trees

parser uses whichever trees seem to be usefulData-oriented parsingSuppose we want to parse Sue heard Jim.Corpus contains the following potentially useful fragments:

Parser can combine these to givea parseData-oriented ParsingMultiple fundamentally distinct derivations of a single tree.

Parse using Monte Carlo simulation methods:randomly produce a large sample of derivationsuse these to find the most probable parsedisadvantage: needs very large samples to make parses accurate, therefore potentially slowData-oriented parsing vs. PCFGsPossible advantages:using partial trees directly accounts for lexical dependenciesalso accounts for multi-word expressions and idioms (e.g. take advantage of)while PCFG rules only represent trees of depth 1, DOP fragments can represent trees of arbitrary length

Similarities to PCFG:tree fragments could be equivalent to PCFG rulesprobabilities estimated for grammar rules are exactly the same as for tree fragments

History Based Grammars (HBG)General idea: any derivational step can be influenced by any earlier derivational step

(Black et al. 1993) the probability of expansion of the current node conditioned on all previous nodes along the path from the root

History Based Grammars (HBG)Black et al lexicalise their grammar.every phrasal node inherits 2 words:its lexical head H1a secondary head H2, deemed to be usefule.g. the PP in the bank might have H1=in and H2=bank

Every non-terminal is also assigned:a syntactic category (Syn) e.g. PPa semantic category (Sem) e.g with-Data

Use the index I that indicates what number child of the parent node is being expanded

HBG Example (Black et al 1993)

History Based Grammars (HBG)Estimation of the probability of a rule R:

probability of:the current rule R to be appliedits Syn and Sem categoryits heads H1 and H2

conditioned on:Syn and Sem of parent nodethe rule that gave rise to the parentthe index of this child relative to the parentthe heads H1 and H2 of the parent

SummaryThis concludes our overview of statistical parsingWeve looked at three important modelsAlso considered basic search techniques and algorithms