Upload
sky2210
View
17
Download
0
Embed Size (px)
Citation preview
Natural Language Processing ApplicationsLecture 5: Chunking with NLTK
Claire Gardent
CNRS/LORIACampus Scientifique,
BP 239,F-54 506 Vandœuvre-les-Nancy, France
2007/2008
1 / 45
Chunking vs. Parsing
What are chunks?
Representing chunks
Chunker Input and Output
Chunking in NLTK
Evaluating Chunkers
Summary and Reading
2 / 45
Syntax, Grammars and Syntactic Analysis (Parsing)
◮ Syntax captures structural relationships between words andphrases i.e., describes the constituent structure of NLexpressions
◮ Grammars are used to describe the syntax of a language
◮ Syntactic analysers assign a syntactic structure to a string onthe basis of a grammar
◮ A syntactic analyser is also called a parser
3 / 45
Syntactic tree example
S
NP VP
John V NP PP
Adv V Det n Prep NP
often gives a book to Mary
4 / 45
Why parse sentences in the first place?
◮ Parsing is usually an intermediate stage in a larger processingframework.
◮ It is useful e.g., for interpreting a string (assigning it ameaning representation) or for comparing strings (machinetranslation)
5 / 45
But Parsing has its problems ...
Coverage
◮ No complete grammar of any language
◮ Sapir: “All grammars leak”
Ambiguity
◮ As coverage increases, so does ambiguity.
◮ Problem of ranking parses by degree of ‘plausibility’
6 / 45
Problems with Full Parsing, 2
◮ Complexity of rule-based chart parsing is O(n3) in length ofsentence, multiplied by factor O(G 2), where G is size ofgrammar.
◮ Practical results are often better, but still slow for parsinglarge (e.g., the web) corpora in reasonable time.
◮ Finite state machines have worst-case complexity O(n) inlength of string.
7 / 45
Chunking vs Parsing
Chunking is a popular alternative to full parsing :
◮ more efficient: based on Finite State techniques (Finite statemachines have worst-case complexity O(n) in length of string)
◮ more robust (always give a solution)
◮ often deterministic (gives only one solution)
◮ often sufficient when the application involves:◮ extracting information◮ ignoring information
8 / 45
What is Chunking?
Chunking is partial parsing. A chunker:
◮ assigns a partial syntactic structure to a sentence.
◮ yields flatter structures than full parsing (fixed tree depth,usually max. 2 vs arbitrarily deep trees)
◮ only deals with “chunks” (simplified constituents whichusually only capture constituents “up to their head”)
◮ Doesn’t try to deal with all of language
◮ Doesn’t attempt to resolve all semantically significantdecisions
◮ Uses deterministic grammars for easy-to-parse pieces, andother methods for other pieces, depending on task.
9 / 45
Chunks vs Constituents
1. Parsing
[
[ G.K. Chesterton ],
[
[ author ] of
[
[ The Man ] who was
[ Thursday ]
]
]
]
2. Chunking:
[ G.K. Chesterton ],
[ author ] of
[ The Man ] who was
[ Thursday ]
10 / 45
Extracting Information: Coreference Annotation
11 / 45
Extracting Information: Message Understanding
12 / 45
Ignoring Information: Lexical Acquisition
◮ studying syntactic patterns, e.g. finding verbs in a corpus,displaying possible arguments
◮ e.g. gave, in 100 files of the Penn Treebank corpus
◮ replaced internal details of each noun phrase with NP
gave NP
gave up NP in NP
gave NP up
gave NP help
gave NP to NP
◮ use in lexical acquisition, grammar development
13 / 45
Analogy with Tokenising and Tagging
◮ fundamental in NLP: segmentation and labelling
◮ tokenization and tagging
◮ other similarities: finite-state; application specific
14 / 45
What are chunks?
◮ Abney (1994):
[when I read] [a sentence], [I read it]
[a chunk] [at a time]
◮ Chunks are non-overlapping regions of text:[walk] [straight past] [the lake]
◮ (Usually) each chunk contains a head, with the possibleaddition of some preceding function words and modifiers
[ walk ] [straight past ] [the lake ]
◮ Chunks are non-recursive:◮ A chunk cannot contain another chunk of the same category
15 / 45
What are chunks?
◮ Chunks are non-exhaustive◮ Some words in a sentence may not be grouped into a chunk
[take] [the second road] that [is] on [the left hand side]
◮ NP postmodifiers (e.g., PPs, relative clauses) are oftenrecursive and/or structurally ambiguous:
◮ they are not included in noun chunks.
◮ Chunks are typically subsequences of constituents (they don’tcross constituent boundaries)
◮ noun groups — everything in NP up to and including thehead noun
◮ verb groups — everything in VP (including auxiliaries) up toand including the head verb
16 / 45
Psycholinguistic Motivations
◮ Chunks as processing units — evidence that humans tend toread texts one chunk at a time
◮ Chunks are phonologically relevant◮ prosodic phrase breaks◮ rhythmic patterns
◮ Chunking might be a first step in full parsing
17 / 45
Representing Chunks: Tags vs Trees
Chunks can be represented by:
◮ IOB tags: each token is tagged with one of three specialchunk tags, “INSIDE“, “OUTSIDE“, or “BEGIN“
◮ A token is tagged as “BEGIN“ if it is at the beginning of achunk, and contained within that chunk
◮ Subsequent tokens within the chunk are tagged “INSIDE“◮ All other tokens are tagged “OUTSIDE“.
◮ trees spanning the entire text
18 / 45
Tag Representation
19 / 45
Tree Representation
20 / 45
Chunker Input and Output
Input: Chunk parsers usually operate on tagged texts.
[ the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN ]
Output: the chunker combines chunks along with the interveningtokens into a chunk structure. A chunk structure is a two-level treethat spans the entire text, and contains both chunks andun-chunked tokens.
(S: (NP: ’I’)
’saw’
(NP: ’the’ ’big’ ’dog’)
’on’
(NP: ’the’ ’hill’))
21 / 45
Viewing NLTK Chunked corpora
The CoNLL 2000 corpus contains 270k words of Wall StreetJournal text, annotated with chunk tags in the IOB format.
>>> print nltk.corpus.conll2000.chunked_sents(’train’)[99]
(S
(PP Over/IN)
(NP a/DT cup/NN)
(PP of/IN)
(NP coffee/NN)
,/,
(NP Mr./NNP Stone/NNP)
(VP told/VBD)
(NP his/PRP story/NN)
./.)
22 / 45
Viewing NLTK Chunked corpora
We can also select which chunk types to read (only NPs)
>>> print nltk.corpus.conll2000.chunked_sents(’train’, chunk_types=(’NP’,))[99]
(S
Over/IN
(NP a/DT cup/NN)
of/IN
(NP coffee/NN)
,/,
(NP Mr./NNP Stone/NNP)
told/VBD
(NP his/PRP story/NN)
23 / 45
Chunk Parsing: Accuracy
Chunk parsing attempts to do less, but does it more accurately.
◮ Smaller solution space
◮ Less word-order flexibility within chunks than between chunks.
◮ Better locality:◮ doesn’t attempt to deal with unbounded dependencies◮ less context-dependence◮ doesn’t attempt to resolve ambiguity — only do those things
which can be done reliably[the boy] [saw] [the man] [with a telescope]
◮ less error propagation
24 / 45
Chunk Parsing: Domain Independence
Chunk parsing can be relatively domain independent, in thatdependencies involving lexical or semantic information tend tooccur at levels ‘higher’ than chunks:
◮ attachment of PPs and other modifiers
◮ argument selection
◮ constituent re-ordering
25 / 45
Chunk Parsing: Efficiency
Chunk parsing is more efficient:
◮ smaller solution space
◮ relevant context is small and local
◮ chunks are non-recursive
◮ can be implemented with a finite state automaton (FSA)
◮ can be applied to very large text sources
26 / 45
Chunking with Regular Expressions, 1
◮ Assume input is tagged.
◮ Identify chunks (e.g., noun groups) by sequences of tags:
announce any new policy measures in his . . .VB DT JJ NN NNS IN PRP$
27 / 45
Chunking with Regular Expressions, 2
◮ Assume input is tagged.
◮ Identify chunks (e.g., noun groups) by sequences of tags:
announce any new policy measures in his . . .
VB DT JJ NN NNS IN PRP$
◮ Define rules in terms of tag patterns
◮ grammar = r"""
NP: {<DT><JJ><NN><NNS>}
"""
28 / 45
Extending the example
◮ Extending the example:
in his Mansion House speech
IN PRP$ NNP NNP NN
◮ DT or PRP$: ’<DT|PRP$><JJ><NN><NNS>’
◮ JJ and NN are optional: ’<DT|PRP$><JJ>*<NN>*<NNS>’
◮ we can have NNPs: ’<DT|PRP$><JJ>*<NNP>*<NN>*<NNS>’
◮ NN or NNS: ’<DT|PRP$><JJ>*<NNP>*<NN>*<NN|NNS>’
29 / 45
Tag Strings and Tag Patterns
◮ A tag string is a string consisting of tags delimited withangle-brackets, e.g.,’<DT><JJ><NN><VBD><DT><NN>’
◮ NLTK tag patterns are a special kind of Regular Expressionsover tag strings:
◮ the angle brackets group their contents into atomic units’<NN>+’ matches one or more repetitions of the tag string’<NN>’
’<NN|JJ>’ matches the tag strings ’<NN>’ or ’<JJ>’◮ Wildcard ’.’ is constrained not to cross tag boundaries, e.g.
’<NN.*>’ matches any single tag starting with ’<NN>’◮ Whitespace is ignored in tag patterns, e.g.
’<NN | JJ>’ is equivalent to ’<NN|JJ>’
30 / 45
Chunk Parsing in NLTK
◮ regular expressions over part-of-speech tags
◮ Tag string: a string consisting of tags delimited with angle-brackets,e.g. <DT><JJ><NN><VBD><DT><NN>
◮ Tag pattern: regular expression over tag strings
◮ <DT><JJ>?<NN>◮ <NN|JJ>+◮ <NN.*>
◮ chunk a sequence of words matching a tag pattern
grammar = r"""
NP: {<DT><JJ><NN><NNS>}
"""
31 / 45
A Simple Chunk Parser
grammar = r"""
NP: {<PP\$>?<JJ>*<NN>} # chunk determiner/possessive,
adjectives and nouns
{<NNP>+} # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
tagged_tokens = [("Rapunzel", "NNP"), ("let", "VBD"),
("down", "RP"), ("her", "PP$"), ("long", "JJ"),
("golden", "JJ"), ("hair", "NN")]
>>> print cp.parse(tagged_tokens)
(S
(NP Rapunzel/NNP)
let/VBD
down/RP
(NP her/PP$ long/JJ golden/JJ hair/NN))
32 / 45
Rule ordering and defaults
◮ If a tag pattern matches at multiple overlapping locations, thefirst match takes precedence. For example, if we apply a rulethat matches two consecutive nouns to a text containing threeconsecutive nouns, then the first two nouns will be chunked
◮ When a chunk rule is applied to a chunking hypothesis, it willonly create chunks that do not partially overlap with chunksin the hypothesis
33 / 45
Tracing
◮ The “trace“ argument specifies whether debugging outputshould be shown during parsing.
◮ The debugging output shows the rules that are applied, andshows the chunking hypothesis at each stage of processing.
34 / 45
Developing Chunk Parsers
tagged_tokens = [("The", "DT"), ("enchantress", "NN"),
("clutched", "VBD"), ("the", "DT"), ("beautiful", "JJ"),
cp1 = nltk.RegexpParser(r"""
NP: {<DT><JJ><NN>} # Chunk det+adj+noun
{<DT|NN>+} # Chunk sequences of NN and DT
""")
cp2 = nltk.RegexpParser(r"""
NP: {<DT|NN>+} # Chunk sequences of NN and DT
{<DT><JJ><NN>} # Chunk det+adj+noun
""")
>>> print cp1.parse(tagged_tokens, trace=1)
# Input:
<DT> <NN> <VBD> <DT> <JJ> <NN>
# Chunk det+adj+noun:
<DT> <NN> <VBD> {<DT> <JJ> <NN>}
# Chunk sequences of NN and DT:
{<DT> <NN>} <VBD> {<DT> <JJ> <NN>}
(S
(NP The/DT enchantress/NN) 35 / 45
More Chunking Rules: Chinking
◮ chink: sequence of stopwords
◮ chinking: process of removing a sequence of tokens from achunk
◮ A chink rule chinks anything that matches a given tag pattern.
36 / 45
More Chunking Rules: Chinking
Entire chunk
Input: [a/DT big/JJ cat/NN]
Operation: Chink a/DT big/JJ cat/NN
Output: a/DT big/JJ cat/NN
Middle of a chunk
Input: [a/DT big/JJ cat/NN]
Operation: Chink big/JJ
Output: [a/DT] big/JJ [cat/NN]
End of a chunk
Input: [a/DT big/JJ cat/NN]
Operation: Chink cat/NN
Output: [a/DT big/JJ] cat/NN
37 / 45
Chinking ExampleThe following grammar puts the entire sentence into a single chunk, thenexcises the chink:
grammar = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|IN>+{ # Chink sequences of VBD and IN
"""
tagged_tokens = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the",
cp = nltk.RegexpParser(grammar)
>>> print cp.parse(tagged_tokens)
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
barked/VBD
at/IN
(NP the/DT cat/NN))
38 / 45
Evaluating Chunk Parsers
◮ Process:
1. take some already chunked text2. strip off the chunks3. rechunk it4. compare the result with the original chunked text
◮ Metrics:◮ precision: what fraction of the returned chunks were correct?◮ recall : what fraction of correct chunks were returned?
39 / 45
Evaluating Chunk Parsers in NLTKFirst, flatten a chunk structure into a tree consisting only of a root nodeand leaves:
>>> correct = nltk.chunk.tagstr2tree(
... "[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN
>>> correct.flatten()
(S: (’the’, ’DT’) (’little’, ’JJ’) (’cat’, ’NN’) (’sat’, ’VBD’)
(’on’, ’IN’) (’the’, ’DT’) (’mat’, ’NN’))
>>> grammar = r"NP: {<PRP|DT|POS|JJ|CD|N.*>+}"
>>> cp = nltk.RegexpParser(grammar)
>>> tagged_tokens = [("the", "DT"), ("little", "JJ"), ("cat", "NN"),
... ("sat", "VBD"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
>>> chunkscore = nltk.chunk.ChunkScore()
>>> guess = cp.parse(correct.flatten())
>>> chunkscore.score(correct, guess)
>>> print chunkscore
ChunkParse score:
Precision: 100.0%
Recall: 100.0%
F-Measure: 100.0%40 / 45
Cascaded Chunking
◮ chunks so far are flat
◮ it is possible to build chunks of arbitrary depth by connectingthe output of one chunker to the input of another.
41 / 45
Cascaded chunking
grammar = r"""
NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
PP: {<IN><NP>} # Chunk prepositions followed by NP
VP: {<VB.*><NP|PP|S>+$} # Chunk rightmost verbs and arguments/adjun
S: {<NP><VP>} # Chunk NP, VP
"""
cp = nltk.RegexpParser(grammar)
tagged_tokens = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"),
("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
>>> print cp.parse(tagged_tokens)
(S
(NP Mary/NN)
saw/VBD
(S
(NP the/DT cat/NN)
(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))
42 / 45
Repeated Cascaded Chunking
Repeat the process by adding an optional second argument loop tospecify the number of times the set of patterns should be run:
>>> cp = nltk.RegexpParser(grammar, loop=2)
>>> print cp.parse(tagged_tokens)
(S
(NP John/NNP)
thinks/VBZ
(S
(NP Mary/NN)
(VP
saw/VBD
(S
(NP the/DT cat/NN)
(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))
43 / 45
Summary
◮ Chunking is less ambitious than full parsing, but moreefficient.
◮ Maybe sufficient for many practical tasks:◮ Information Extraction◮ Question Answering◮ Extracting subcatgorization frames◮ Providing features for machine learning, e.g., for building
Named Entity recognizers.
44 / 45
Reading
◮ Jurafsky and Martin, Section 10.5
◮ NLTK Book chapter on Chunking
◮ Steven Abney. Parsing By Chunks. In: Robert Berwick,Steven Abney and Carol Tenny (eds.), Principle-BasedParsing. Kluwer Academic Publishers, Dordrecht. 1991.
◮ Steven Abney. Partial Parsing via Finite-State Cascades. J. ofNatural Language Engineering, 2(4): 337-344. 1996.
◮ Abney’s publications:http://www.vinartus.net/spa/publications.html
45 / 45