View
213
Download
0
Embed Size (px)
Citation preview
2
Today
Handout: basic English grammarDetermine time for a one-time labBegin chunking/shallow parsing
3Slide modified from Steven Bird's
Shallow (Chunk) Parsing
Goal: divide a sentence into a sequence of chunks.
Chunks are non-overlapping regions of a text[I] saw [a tall man] in [the park].
Chunks are non-recursive A chunk can not contain other chunks
Chunks are non-exhaustive Not all words are included in chunks
4Slide modified from Steven Bird's
Chunk Parsing Examples
Noun-phrase chunking:[I] saw [a tall man] in [the park].
Verb-phrase chunking:The man who [was in the park] [saw me].
Prosodic chunking:
[I saw] [a tall man] [in the park].
Question answering:What [Spanish explorer] discovered [the Mississippi River]?
5Slide modified from Steven Bird's
Shallow Parsing: Motivation
Locating informatione.g., text retrieval
– Index a document collection on its noun phrases
Ignoring informationGeneralize in order to study higher-level patterns
– e.g. phrases involving “gave” in Penn treebank: gave NP; gave up NP in NP; gave NP up; gave NP
help; gave NP to NP
Sometimes a full parse has too much structure– Too nested– Chunks usually are not recursive
7Slide modified from Steven Bird's
Comparison with Full Syntactic Parsing
Parsing is usually an intermediate stageBuilds structures that are used by later stages of processing
Full parsing is a sufficient but not necessary intermediate stage for many NLP tasks
Parsing often provides more information than we need
Shallow parsing is an easier problemLess word-order flexibility within chunks than between chunksMore locality:
– Fewer long-range dependencies– Less context-dependence– Less ambiguity
8Slide modified from Steven Bird's
Chunks and ConstituencyConstituents: [[a tall man] [ in [the park]]].
Chunks: [a tall man] in [the park].
A constituent is part of some higher unit in the hierarchical syntactic parse Chunks are not constituents
Constituents are recursive
But, chunks are typically subsequences of constituents
Chunks do not cross major constituent boundaries
9Slide modified from Steven Bird's
Chunk Parsing in NLTK
Chunk parsers usually ignore lexical contentOnly need to look at part-of-speech tags
Possible steps in chunk parsingChunking, unchunkingChinkingMerging, splitting
EvaluationCompare to a BaselineEvaluate in terms of
– Precision, Recall, F-Measure– Missed (False Negative), Incorrect (False Positive)
10Slide modified from Steven Bird's
ChunkingDefine a regular expression that matches the sequences of tags in a chunk
A simple noun phrase chunk regexp:(Note that <NN.*> matches any tag starting with NN)
<DT>? <JJ>* <NN.?>
Chunk all matching subsequences:
the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN
[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]
If matching subsequences overlap, first 1 gets priority
11
Unchunking
Remove any chunk with a given patterne.g., unChunkRule(‘<NN|DT>+’, ‘Unchunk NNDT’)Combine with Chunk Rule <NN|DT|JJ>+
Chunk all matching subsequences:Input:
the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN
Apply chunk rule
[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]Apply unchunk rule
[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN
12Slide modified from Steven Bird's
Chinking
A chink is a subsequence of the text that is not a chunk.Define a regular expression that matches the sequences of tags in a chinkA simple chink regexp for finding NP chunks: (<VB.?>|<IN>)+
First apply chunk rule to chunk everythingInput: the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN
ChunkRule('<.*>+', ‘Chunk everything’)
[the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN]Apply Chink rule above:
[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]ChinkChunk Chunk
13Slide modified from Steven Bird's
Merging
Combine adjacent chunks into a single chunkDefine a regular expression that matches the sequences of tags on both sides of the point to be merged
Example:Merge a chunk ending in JJ with a chunk starting with NN
MergeRule(‘<JJ>’, ‘<NN>’, ‘Merge adjs and nouns’)
[the/DT little/JJ] [cat/NN] sat/VBD on/IN the/DT mat/NN
[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN
Splitting is the opposite of merging
14
Tokens and Labels in NLTKTokens are at many levels of description
DocumentSentenceWord
Can have multiple representations at the same level
A sentence can be marked up with TREE and WORDS simultaneouslyA word can have both TEXT and POS (or TAG)
18
Usually resolve this kind of problem by checking out the API:http://nltk.sourceforge.net/api-1.4/index.htmlBut not all that helpful in this case. Tutorial has the answer.
21
Next Time and Upcoming
Finish Shallow ParsingEvaluating Shallow Parsing ResultsMore examples of chunk/chink/unchunk rulesRevisit topics from previous week
Shallow Parsing AssignmentSent out Tues or WedDue on Wed Sept 29
Next week:Read paper on end-of-sentence disambiguationPresley and Barbara lecturing on categorization