21
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

1

SIMS 290-2: Applied Natural Language Processing

Marti HearstSept 20, 2004 

 

2

Today

Handout: basic English grammarDetermine time for a one-time labBegin chunking/shallow parsing

3Slide modified from Steven Bird's

Shallow (Chunk) Parsing

Goal: divide a sentence into a sequence of chunks.

Chunks are non-overlapping regions of a text[I] saw [a tall man] in [the park].

Chunks are non-recursive A chunk can not contain other chunks

Chunks are non-exhaustive Not all words are included in chunks

4Slide modified from Steven Bird's

Chunk Parsing Examples

Noun-phrase chunking:[I] saw [a tall man] in [the park].

Verb-phrase chunking:The man who [was in the park] [saw me].

Prosodic chunking:

[I saw] [a tall man] [in the park].

Question answering:What [Spanish explorer] discovered [the Mississippi River]?

5Slide modified from Steven Bird's

Shallow Parsing: Motivation

Locating informatione.g., text retrieval

– Index a document collection on its noun phrases

Ignoring informationGeneralize in order to study higher-level patterns

– e.g. phrases involving “gave” in Penn treebank: gave NP; gave up NP in NP; gave NP up; gave NP

help; gave NP to NP

Sometimes a full parse has too much structure– Too nested– Chunks usually are not recursive

6Slide modified from Steven Bird's

RepresentationBIO (or IOB)

Trees

7Slide modified from Steven Bird's

Comparison with Full Syntactic Parsing

Parsing is usually an intermediate stageBuilds structures that are used by later stages of processing

Full parsing is a sufficient but not necessary intermediate stage for many NLP tasks

Parsing often provides more information than we need

Shallow parsing is an easier problemLess word-order flexibility within chunks than between chunksMore locality:

– Fewer long-range dependencies– Less context-dependence– Less ambiguity

8Slide modified from Steven Bird's

Chunks and ConstituencyConstituents: [[a tall man] [ in [the park]]].

Chunks: [a tall man] in [the park].

A constituent is part of some higher unit in the hierarchical syntactic parse Chunks are not constituents

Constituents are recursive

But, chunks are typically subsequences of constituents

Chunks do not cross major constituent boundaries

9Slide modified from Steven Bird's

Chunk Parsing in NLTK

Chunk parsers usually ignore lexical contentOnly need to look at part-of-speech tags

Possible steps in chunk parsingChunking, unchunkingChinkingMerging, splitting

EvaluationCompare to a BaselineEvaluate in terms of

– Precision, Recall, F-Measure– Missed (False Negative), Incorrect (False Positive)

10Slide modified from Steven Bird's

ChunkingDefine a regular expression that matches the sequences of tags in a chunk

A simple noun phrase chunk regexp:(Note that <NN.*> matches any tag starting with NN)

<DT>? <JJ>* <NN.?>

Chunk all matching subsequences:

the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN

[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]

If matching subsequences overlap, first 1 gets priority

11

Unchunking

Remove any chunk with a given patterne.g., unChunkRule(‘<NN|DT>+’, ‘Unchunk NNDT’)Combine with Chunk Rule <NN|DT|JJ>+

Chunk all matching subsequences:Input:

the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN

Apply chunk rule

[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]Apply unchunk rule

[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN

12Slide modified from Steven Bird's

Chinking

A chink is a subsequence of the text that is not a chunk.Define a regular expression that matches the sequences of tags in a chinkA simple chink regexp for finding NP chunks: (<VB.?>|<IN>)+

First apply chunk rule to chunk everythingInput: the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN

ChunkRule('<.*>+', ‘Chunk everything’)

[the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN]Apply Chink rule above:

[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]ChinkChunk Chunk

13Slide modified from Steven Bird's

Merging

Combine adjacent chunks into a single chunkDefine a regular expression that matches the sequences of tags on both sides of the point to be merged

Example:Merge a chunk ending in JJ with a chunk starting with NN

MergeRule(‘<JJ>’, ‘<NN>’, ‘Merge adjs and nouns’)

[the/DT little/JJ] [cat/NN] sat/VBD on/IN the/DT mat/NN

[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN

Splitting is the opposite of merging

14

Tokens and Labels in NLTKTokens are at many levels of description

DocumentSentenceWord

Can have multiple representations at the same level

A sentence can be marked up with TREE and WORDS simultaneouslyA word can have both TEXT and POS (or TAG)

15

Applying Chunking to Treebank Data

16

17

18

Usually resolve this kind of problem by checking out the API:http://nltk.sourceforge.net/api-1.4/index.htmlBut not all that helpful in this case. Tutorial has the answer.

19

20Slide modified from Steven Bird's

Cascaded Chunking

21

Next Time and Upcoming

Finish Shallow ParsingEvaluating Shallow Parsing ResultsMore examples of chunk/chink/unchunk rulesRevisit topics from previous week

Shallow Parsing AssignmentSent out Tues or WedDue on Wed Sept 29

Next week:Read paper on end-of-sentence disambiguationPresley and Barbara lecturing on categorization