l6 Nltk Chunking 2x2

Natural Language Processing ApplicationsLecture 5: Chunking with NLTK

Claire Gardent

CNRS/LORIACampus Scientifique,

BP 239,F-54 506 Vandœuvre-les-Nancy, France

2007/2008

1 / 45

Chunking vs. Parsing

What are chunks?

Representing chunks

Chunker Input and Output

Chunking in NLTK

Evaluating Chunkers

Summary and Reading

2 / 45

Syntax, Grammars and Syntactic Analysis (Parsing)

◮ Syntax captures structural relationships between words andphrases i.e., describes the constituent structure of NLexpressions

◮ Grammars are used to describe the syntax of a language

◮ Syntactic analysers assign a syntactic structure to a string onthe basis of a grammar

◮ A syntactic analyser is also called a parser

3 / 45

Syntactic tree example

S

NP VP

John V NP PP

Adv V Det n Prep NP

often gives a book to Mary

4 / 45

Why parse sentences in the first place?

◮ Parsing is usually an intermediate stage in a larger processingframework.

◮ It is useful e.g., for interpreting a string (assigning it ameaning representation) or for comparing strings (machinetranslation)

5 / 45

But Parsing has its problems ...

Coverage

◮ No complete grammar of any language

◮ Sapir: “All grammars leak”

Ambiguity

◮ As coverage increases, so does ambiguity.

◮ Problem of ranking parses by degree of ‘plausibility’

6 / 45

Problems with Full Parsing, 2

◮ Complexity of rule-based chart parsing is O(n3) in length ofsentence, multiplied by factor O(G 2), where G is size ofgrammar.

◮ Practical results are often better, but still slow for parsinglarge (e.g., the web) corpora in reasonable time.

◮ Finite state machines have worst-case complexity O(n) inlength of string.

7 / 45

Chunking vs Parsing

Chunking is a popular alternative to full parsing :

◮ more efficient: based on Finite State techniques (Finite statemachines have worst-case complexity O(n) in length of string)

◮ more robust (always give a solution)

◮ often deterministic (gives only one solution)

◮ often sufficient when the application involves:◮ extracting information◮ ignoring information

8 / 45

What is Chunking?

Chunking is partial parsing. A chunker:

◮ assigns a partial syntactic structure to a sentence.

◮ yields flatter structures than full parsing (fixed tree depth,usually max. 2 vs arbitrarily deep trees)

◮ only deals with “chunks” (simplified constituents whichusually only capture constituents “up to their head”)

◮ Doesn’t try to deal with all of language

◮ Doesn’t attempt to resolve all semantically significantdecisions

◮ Uses deterministic grammars for easy-to-parse pieces, andother methods for other pieces, depending on task.

9 / 45

Chunks vs Constituents

1. Parsing

[

[ G.K. Chesterton ],

[

[ author ] of

[

[ The Man ] who was

[ Thursday ]

]

]

]

2. Chunking:

[ G.K. Chesterton ],

[ author ] of

[ The Man ] who was

[ Thursday ]

10 / 45

Extracting Information: Coreference Annotation

11 / 45

Extracting Information: Message Understanding

12 / 45

Ignoring Information: Lexical Acquisition

◮ studying syntactic patterns, e.g. finding verbs in a corpus,displaying possible arguments

◮ e.g. gave, in 100 files of the Penn Treebank corpus

◮ replaced internal details of each noun phrase with NP

gave NP

gave up NP in NP

gave NP up

gave NP help

gave NP to NP

◮ use in lexical acquisition, grammar development

13 / 45

Analogy with Tokenising and Tagging

◮ fundamental in NLP: segmentation and labelling

◮ tokenization and tagging

◮ other similarities: finite-state; application specific

14 / 45

What are chunks?

◮ Abney (1994):

[when I read] [a sentence], [I read it]

[a chunk] [at a time]

◮ Chunks are non-overlapping regions of text:[walk] [straight past] [the lake]

◮ (Usually) each chunk contains a head, with the possibleaddition of some preceding function words and modifiers

[ walk ] [straight past ] [the lake ]

◮ Chunks are non-recursive:◮ A chunk cannot contain another chunk of the same category

15 / 45

What are chunks?

◮ Chunks are non-exhaustive◮ Some words in a sentence may not be grouped into a chunk

[take] [the second road] that [is] on [the left hand side]

◮ NP postmodifiers (e.g., PPs, relative clauses) are oftenrecursive and/or structurally ambiguous:

◮ they are not included in noun chunks.

◮ Chunks are typically subsequences of constituents (they don’tcross constituent boundaries)

◮ noun groups — everything in NP up to and including thehead noun

◮ verb groups — everything in VP (including auxiliaries) up toand including the head verb

16 / 45

Psycholinguistic Motivations

◮ Chunks as processing units — evidence that humans tend toread texts one chunk at a time

◮ Chunks are phonologically relevant◮ prosodic phrase breaks◮ rhythmic patterns

◮ Chunking might be a first step in full parsing

17 / 45

Representing Chunks: Tags vs Trees

Chunks can be represented by:

◮ IOB tags: each token is tagged with one of three specialchunk tags, “INSIDE“, “OUTSIDE“, or “BEGIN“

◮ A token is tagged as “BEGIN“ if it is at the beginning of achunk, and contained within that chunk

◮ Subsequent tokens within the chunk are tagged “INSIDE“◮ All other tokens are tagged “OUTSIDE“.

◮ trees spanning the entire text

18 / 45

Tag Representation

19 / 45

Tree Representation

20 / 45

Chunker Input and Output

Input: Chunk parsers usually operate on tagged texts.

[ the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN ]

Output: the chunker combines chunks along with the interveningtokens into a chunk structure. A chunk structure is a two-level treethat spans the entire text, and contains both chunks andun-chunked tokens.

(S: (NP: ’I’)

’saw’

(NP: ’the’ ’big’ ’dog’)

’on’

(NP: ’the’ ’hill’))

21 / 45

Viewing NLTK Chunked corpora

The CoNLL 2000 corpus contains 270k words of Wall StreetJournal text, annotated with chunk tags in the IOB format.

>>> print nltk.corpus.conll2000.chunked_sents(’train’)[99]

(S

(PP Over/IN)

(NP a/DT cup/NN)

(PP of/IN)

(NP coffee/NN)

,/,

(NP Mr./NNP Stone/NNP)

(VP told/VBD)

(NP his/PRP story/NN)

./.)

22 / 45

Viewing NLTK Chunked corpora

We can also select which chunk types to read (only NPs)

>>> print nltk.corpus.conll2000.chunked_sents(’train’, chunk_types=(’NP’,))[99]

(S

Over/IN

(NP a/DT cup/NN)

of/IN

(NP coffee/NN)

,/,

(NP Mr./NNP Stone/NNP)

told/VBD

(NP his/PRP story/NN)

23 / 45

Chunk Parsing: Accuracy

Chunk parsing attempts to do less, but does it more accurately.

◮ Smaller solution space

◮ Less word-order flexibility within chunks than between chunks.

◮ Better locality:◮ doesn’t attempt to deal with unbounded dependencies◮ less context-dependence◮ doesn’t attempt to resolve ambiguity — only do those things

which can be done reliably[the boy] [saw] [the man] [with a telescope]

◮ less error propagation

24 / 45

Chunk Parsing: Domain Independence

Chunk parsing can be relatively domain independent, in thatdependencies involving lexical or semantic information tend tooccur at levels ‘higher’ than chunks:

◮ attachment of PPs and other modifiers

◮ argument selection

◮ constituent re-ordering

25 / 45

Chunk Parsing: Efficiency

Chunk parsing is more efficient:

◮ smaller solution space

◮ relevant context is small and local

◮ chunks are non-recursive

◮ can be implemented with a finite state automaton (FSA)

◮ can be applied to very large text sources

26 / 45

Chunking with Regular Expressions, 1

◮ Assume input is tagged.

◮ Identify chunks (e.g., noun groups) by sequences of tags:

announce any new policy measures in his . . .VB DT JJ NN NNS IN PRP$

27 / 45

Chunking with Regular Expressions, 2

◮ Assume input is tagged.

◮ Identify chunks (e.g., noun groups) by sequences of tags:

announce any new policy measures in his . . .

VB DT JJ NN NNS IN PRP$

◮ Define rules in terms of tag patterns

◮ grammar = r"""

NP: {<DT><JJ><NN><NNS>}

"""

28 / 45

Extending the example

◮ Extending the example:

in his Mansion House speech

IN PRP$ NNP NNP NN

◮ DT or PRP$: ’<DT|PRP$><JJ><NN><NNS>’

◮ JJ and NN are optional: ’<DT|PRP$><JJ>*<NN>*<NNS>’

◮ we can have NNPs: ’<DT|PRP$><JJ>*<NNP>*<NN>*<NNS>’

◮ NN or NNS: ’<DT|PRP$><JJ>*<NNP>*<NN>*<NN|NNS>’

29 / 45

Tag Strings and Tag Patterns

◮ A tag string is a string consisting of tags delimited withangle-brackets, e.g.,’<DT><JJ><NN><VBD><DT><NN>’

◮ NLTK tag patterns are a special kind of Regular Expressionsover tag strings:

◮ the angle brackets group their contents into atomic units’<NN>+’ matches one or more repetitions of the tag string’<NN>’

’<NN|JJ>’ matches the tag strings ’<NN>’ or ’<JJ>’◮ Wildcard ’.’ is constrained not to cross tag boundaries, e.g.

’<NN.*>’ matches any single tag starting with ’<NN>’◮ Whitespace is ignored in tag patterns, e.g.

’<NN | JJ>’ is equivalent to ’<NN|JJ>’

30 / 45

Chunk Parsing in NLTK

◮ regular expressions over part-of-speech tags

◮ Tag string: a string consisting of tags delimited with angle-brackets,e.g. <DT><JJ><NN><VBD><DT><NN>

◮ Tag pattern: regular expression over tag strings

◮ <DT><JJ>?<NN>◮ <NN|JJ>+◮ <NN.*>

◮ chunk a sequence of words matching a tag pattern

grammar = r"""

NP: {<DT><JJ><NN><NNS>}

"""

31 / 45

A Simple Chunk Parser

grammar = r"""

NP: {<PP\$>?<JJ>*<NN>} # chunk determiner/possessive,

adjectives and nouns

{<NNP>+} # chunk sequences of proper nouns

"""

cp = nltk.RegexpParser(grammar)

tagged_tokens = [("Rapunzel", "NNP"), ("let", "VBD"),

("down", "RP"), ("her", "PP$"), ("long", "JJ"),

("golden", "JJ"), ("hair", "NN")]

>>> print cp.parse(tagged_tokens)

(S

(NP Rapunzel/NNP)

let/VBD

down/RP

(NP her/PP$ long/JJ golden/JJ hair/NN))

32 / 45

Rule ordering and defaults

◮ If a tag pattern matches at multiple overlapping locations, thefirst match takes precedence. For example, if we apply a rulethat matches two consecutive nouns to a text containing threeconsecutive nouns, then the first two nouns will be chunked

◮ When a chunk rule is applied to a chunking hypothesis, it willonly create chunks that do not partially overlap with chunksin the hypothesis

33 / 45

Tracing

◮ The “trace“ argument specifies whether debugging outputshould be shown during parsing.

◮ The debugging output shows the rules that are applied, andshows the chunking hypothesis at each stage of processing.

34 / 45

Developing Chunk Parsers

tagged_tokens = [("The", "DT"), ("enchantress", "NN"),

("clutched", "VBD"), ("the", "DT"), ("beautiful", "JJ"),

cp1 = nltk.RegexpParser(r"""

NP: {<DT><JJ><NN>} # Chunk det+adj+noun

{<DT|NN>+} # Chunk sequences of NN and DT

""")

cp2 = nltk.RegexpParser(r"""

NP: {<DT|NN>+} # Chunk sequences of NN and DT

{<DT><JJ><NN>} # Chunk det+adj+noun

""")

>>> print cp1.parse(tagged_tokens, trace=1)

# Input:

<DT> <NN> <VBD> <DT> <JJ> <NN>

# Chunk det+adj+noun:

<DT> <NN> <VBD> {<DT> <JJ> <NN>}

# Chunk sequences of NN and DT:

{<DT> <NN>} <VBD> {<DT> <JJ> <NN>}

(S

(NP The/DT enchantress/NN) 35 / 45

More Chunking Rules: Chinking

◮ chink: sequence of stopwords

◮ chinking: process of removing a sequence of tokens from achunk

◮ A chink rule chinks anything that matches a given tag pattern.

36 / 45

More Chunking Rules: Chinking

Entire chunk

Input: [a/DT big/JJ cat/NN]

Operation: Chink a/DT big/JJ cat/NN

Output: a/DT big/JJ cat/NN

Middle of a chunk


Operation: Chink big/JJ

Output: [a/DT] big/JJ [cat/NN]

End of a chunk


Operation: Chink cat/NN

Output: [a/DT big/JJ] cat/NN

37 / 45

Chinking ExampleThe following grammar puts the entire sentence into a single chunk, thenexcises the chink:

grammar = r"""

NP:

{<.*>+} # Chunk everything

}<VBD|IN>+{ # Chink sequences of VBD and IN

"""

tagged_tokens = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),

("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the",



(S

(NP the/DT little/JJ yellow/JJ dog/NN)

barked/VBD

at/IN

(NP the/DT cat/NN))

38 / 45

Evaluating Chunk Parsers

◮ Process:

1. take some already chunked text2. strip off the chunks3. rechunk it4. compare the result with the original chunked text

◮ Metrics:◮ precision: what fraction of the returned chunks were correct?◮ recall : what fraction of correct chunks were returned?

39 / 45

Evaluating Chunk Parsers in NLTKFirst, flatten a chunk structure into a tree consisting only of a root nodeand leaves:

>>> correct = nltk.chunk.tagstr2tree(

... "[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN

>>> correct.flatten()

(S: (’the’, ’DT’) (’little’, ’JJ’) (’cat’, ’NN’) (’sat’, ’VBD’)

(’on’, ’IN’) (’the’, ’DT’) (’mat’, ’NN’))

>>> grammar = r"NP: {<PRP|DT|POS|JJ|CD|N.*>+}"

>>> cp = nltk.RegexpParser(grammar)

>>> tagged_tokens = [("the", "DT"), ("little", "JJ"), ("cat", "NN"),

... ("sat", "VBD"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]

>>> chunkscore = nltk.chunk.ChunkScore()

>>> guess = cp.parse(correct.flatten())

>>> chunkscore.score(correct, guess)

>>> print chunkscore

ChunkParse score:

Precision: 100.0%

Recall: 100.0%

F-Measure: 100.0%40 / 45

Cascaded Chunking

◮ chunks so far are flat

◮ it is possible to build chunks of arbitrary depth by connectingthe output of one chunker to the input of another.

41 / 45

Cascaded chunking

grammar = r"""

NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN

PP: {<IN><NP>} # Chunk prepositions followed by NP

VP: {<VB.*><NP|PP|S>+$} # Chunk rightmost verbs and arguments/adjun

S: {<NP><VP>} # Chunk NP, VP

"""


tagged_tokens = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"),

("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]


(S

(NP Mary/NN)

saw/VBD

(S

(NP the/DT cat/NN)

(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))

42 / 45

Repeated Cascaded Chunking

Repeat the process by adding an optional second argument loop tospecify the number of times the set of patterns should be run:

>>> cp = nltk.RegexpParser(grammar, loop=2)


(S

(NP John/NNP)

thinks/VBZ

(S

(NP Mary/NN)

(VP

saw/VBD

(S

(NP the/DT cat/NN)

(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))

43 / 45

Summary

◮ Chunking is less ambitious than full parsing, but moreefficient.

◮ Maybe sufficient for many practical tasks:◮ Information Extraction◮ Question Answering◮ Extracting subcatgorization frames◮ Providing features for machine learning, e.g., for building

Named Entity recognizers.

44 / 45

Reading

◮ Jurafsky and Martin, Section 10.5

◮ NLTK Book chapter on Chunking

◮ Steven Abney. Parsing By Chunks. In: Robert Berwick,Steven Abney and Carol Tenny (eds.), Principle-BasedParsing. Kluwer Academic Publishers, Dordrecht. 1991.

◮ Steven Abney. Partial Parsing via Finite-State Cascades. J. ofNatural Language Engineering, 2(4): 337-344. 1996.

◮ Abney’s publications:http://www.vinartus.net/spa/publications.html

45 / 45

Documents

l6 Nltk Chunking 2x2