25
October 2005 CSA3180: Text Processing II 1 CSA3180: Natural Language Processing Shallow Parsing Shallow Parsing and Chunking

CSA3180: Natural Language Processing

  • Upload
    kordell

  • View
    39

  • Download
    3

Embed Size (px)

DESCRIPTION

CSA3180: Natural Language Processing. Shallow Parsing Shallow Parsing and Chunking. Given grammar G and sentence A discover all valid parse trees for G that exactly cover A. Parsing Problem. S. VP. NP. V. Nom. Det. book. N. that. flight. The elephant is in the garden. S. VP. - PowerPoint PPT Presentation

Citation preview

Page 1: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 1

CSA3180: Natural Language Processing

Shallow Parsing

Shallow Parsing and Chunking

Page 2: CSA3180: Natural Language Processing

October 2006 csa3180: Parsing Algorithms 1 2

Parsing Problem• Given grammar G and sentence A discover all

valid parse trees for G that exactly cover A

S

VP

NPV

DetNom

Nbook

that

flight

Page 3: CSA3180: Natural Language Processing

October 2006 csa3180: Parsing Algorithms 1 3

The elephant is in the garden

I shot an elephant in my garden

NP

VP

NP

PP

NP

S

Page 4: CSA3180: Natural Language Processing

October 2006 csa3180: Parsing Algorithms 1 4

I own the garden

I shot an elephant in my garden

NP

VP

PP

NP

S

Page 5: CSA3180: Natural Language Processing

October 2006 csa3180: Parsing Algorithms 1 5

Parsing as Search

• Search within a space defined by– Start State– Goal State– State to state transformations

• Two distinct parsing strategies:– Top down– Bottom up

• Different parsing strategy, different state space, different problem.

• N.B. Parsing strategy ≠ search strategy

Page 6: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 6

Shallow/Chunk Parsing

Goal: divide a sentence into a sequence of chunks.

• Chunks are non-overlapping regions of a text[I] saw [a tall man] in [the park].

• Chunks are non-recursive– A chunk can not contain other chunks

• Chunks are non-exhaustive– Not all words are included in chunks

Page 7: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 7

Chunk Parsing Examples

• Noun-phrase chunking:[I] saw [a tall man] in [the park].

• Verb-phrase chunking:The man who [was in the park] [saw me].

• Prosodic chunking:

[I saw] [a tall man] [in the park].

• Question answering:– What [Spanish explorer] discovered [the

Mississippi River]?

Page 8: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 8

Motivation

• Locating information– e.g., text retrieval

• Index a document collection on its noun phrases

• Ignoring information– Generalize in order to study higher-level patterns

• e.g. phrases involving “gave” in Penn treebank:– gave NP; gave up NP in NP; gave NP up; gave NP help; gave

NP to NP

– Sometimes a full parse has too much structure• Too nested• Chunks usually are not recursive

Page 9: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 9

Representation

• BIO (or IOB)

Trees

Page 10: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 10

Comparison with Full Parsing

• Parsing is usually an intermediate stage– Builds structures that are used by later stages of processing

• Full parsing is a sufficient but not necessary intermediate stage for many NLP tasks– Parsing often provides more information than we need

• Shallow parsing is an easier problem– Less word-order flexibility within chunks than between chunks– More locality:

• Fewer long-range dependencies• Less context-dependence• Less ambiguity

Page 11: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 11

Chunks and Constituency

Constituents: [[a tall man] [ in [the park]]].

Chunks: [a tall man] in [the park].

• A constituent is part of some higher unit in the hierarchical syntactic parse

• Chunks are not constituents– Constituents are recursive

• But, chunks are typically subsequences of constituents– Chunks do not cross major constituent boundaries

Page 12: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 12

Unchunking

• Remove any chunk with a given pattern– e.g., unChunkRule(‘<NN|DT>+’, ‘Unchunk NNDT’)– Combine with Chunk Rule <NN|DT|JJ>+

• Chunk all matching subsequences:– Input:

the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN

– Apply chunk rule

[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]– Apply unchunk rule

[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN

Page 13: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 13

Chinking

• A chink is a subsequence of the text that is not a chunk.• Define a regular expression that matches the sequences

of tags in a chinkA simple chink regexp for finding NP chunks: (<VB.?>|<IN>)+

• First apply chunk rule to chunk everything– Input:

the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN

– ChunkRule('<.*>+', ‘Chunk everything’)

[the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN]– Apply Chink rule above:

[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]

Page 14: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 14

Merging

• Combine adjacent chunks into a single chunk– Define a regular expression that matches the sequences of tags

on both sides of the point to be merged

• Example:– Merge a chunk ending in JJ with a chunk starting with NN

MergeRule(‘<JJ>’, ‘<NN>’, ‘Merge adjs and nouns’)

[the/DT little/JJ] [cat/NN] sat/VBD on/IN the/DT mat/NN

[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN

• Splitting is the opposite of merging

Page 15: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 15

Python and NLTK

• Natural Language Toolkit (NLTK)

• http://nltk.sourceforge.net/

• NLTK Slides partly based on Diane Litman Lectures

• Chunk parsing slides partly based on Marti Hearst Lectures

Page 16: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 16

Python for NLP

• Python is a great language for NLP:– Simple– Easy to debug:

• Exceptions• Interpreted language

– Easy to structure• Modules• Object oriented programming

– Powerful string manipulation

Page 17: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 17

Python Modules and Packages

• Python modules “package program code and data for reuse.” (Lutz)– Similar to library in C, package in Java.

• Python packages are hierarchical modules (i.e., modules that contain other modules).

• Three commands for accessing modules:1.import2.from…import3.reload

Page 18: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 18

Import Command

• The import command loads a module:# Load the regular expression module

>>> import re

• To access the contents of a module, use dotted names:

# Use the search method from the re module

>>> re.search(‘\w+’, str)

• To list the contents of a module, use dir:>>> dir(re)

[‘DOTALL’, ‘I’, ‘IGNORECASE’,…]

Page 19: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 19

from...import

• The from…import command loads individual functions and objects from a module:

# Load the search function from the re module>>> from re import search

• Once an individual function or object is loaded with from…import, it can be used directly:

# Use the search method from the re module>>> search (‘\w+’, str)

Page 20: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 20

Import vs. from...import

Import• Keeps module

functions separate from user functions.

• Requires the use of dotted names.

• Works with reload.

from…import• Puts module functions

and user functions together.

• More convenient names.

• Does not work with reload.

Page 21: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 21

Reload

• If you edit a module, you must use the reload command before the changes become visible in Python:

>>> import mymodule

...

>>> reload (mymodule)

• The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import.

Page 22: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 22

NLTK Introduction

• The Natural Language Toolkit (NLTK) provides:– Basic classes for representing data relevant to natural

language processing.– Standard interfaces for performing tasks, such as

tokenization, tagging, and parsing.– Standard implementations of each task, which can be

combined to solve complex problems.

Page 23: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 23

NLTK Example Modules

• nltk.token: processing individual elements of text, such as words or sentences.

• nltk.probability: modeling frequency distributions and probabilistic systems.

• nltk.tagger: tagging tokens with supplemental information, such as parts of speech or wordnet sense tags.

• nltk.parser: high-level interface for parsing texts.• nltk.chartparser: a chart-based implementation of

the parser interface.• nltk.chunkparser: a regular-expression based

surface parser.

Page 24: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 24

Chunk Parsing in NLTK

• Chunk parsers usually ignore lexical content– Only need to look at part-of-speech tags

• Possible steps in chunk parsing– Chunking, unchunking– Chinking– Merging, splitting

• Evaluation– Compare to a Baseline– Evaluate in terms of

• Precision, Recall, F-Measure• Missed (False Negative), Incorrect (False Positive)

Page 25: CSA3180: Natural Language Processing

October 2005 CSA3180: Text Processing II 25

Chunk Parsing in NLTK

• Define a regular expression that matches the sequences of tags in a chunk

A simple noun phrase chunk regexp:(Note that <NN.*> matches any tag starting with NN)

<DT>? <JJ>* <NN.?>

• Chunk all matching subsequences:

the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN

[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]

• If matching subsequences overlap, first 1 gets priority