Chunking: Shallow Parsing

Preview:

DESCRIPTION

School of Computing FACULTY OF ENGINEERING. Chunking: Shallow Parsing. Eric Atwell, Language Research Group. Shallow Parsing. Break text up into non-overlapping contiguous subsets of tokens. Also called chunking, partial parsing, light parsing. What is it useful for? – semantic patterns - PowerPoint PPT Presentation

Citation preview

School of somethingFACULTY OF OTHER

School of ComputingFACULTY OF ENGINEERING

Chunking: Shallow Parsing

Eric Atwell, Language Research Group

Shallow Parsing

Break text up into non-overlapping contiguous subsets of tokens.

• Also called chunking, partial parsing, light parsing.

What is it useful for? – semantic patterns

• Finding key “meaning-elements”: Named Entity Recognition

• people, locations, organizations

• Studying linguistic patterns, e.g. semantic patterns of verbs

• gave NP

• gave up NP in NP

• gave NP NP

• gave NP to NP

• Can ignore complex structure when not relevant

A Relationship between Segmenting and Labeling

Tokenization segments the text

Tagging labels the text

Shallow parsing does both simultaneously.

Chunking vs. Full Syntactic Parsing

“G.K. Chesterton, author of The Man who was Thursday”

Representations for Chunks

IOB tags

• Inside, outside, and begin

• In English, the start of a phrase is often marked by a function-word

Representations for Chunks

Trees

• Chunk structure is a two-level tree that spans the entire text, containing both chunks and non-chunks

CONLL Corpus: training data for Machine Learning of chunking

From the Conference on Natural Language Learning Competition from 2000

Goal: create machine learning methods to improve on the chunking task

CONLL Corpus

Data in IOB format from WSJ Wall Street Journal:

• Word POS-tag IOB-tag

• Training set: 8936 sentences

• Test set: 2012 sentences

Tags from the Brill tagger

• Penn Treebank Tags

Evaluation measure: F-score

• 2*precision*recall / (recall+precision)

• Baseline was: select the chunk tag that is most frequently associated with the POS tag, F =77.07

• Best score in the contest was F=94.13

Chunking with Regular Expressions

This time we write regex’s over TAGS rather than characters

• <DT><JJ>?<NN>

• <NN.*>

• <JJ|NN>+

Compile them with parse.ChunkRule()

• rule = parse.ChunkRule(‘<DT|NN>+’)

• chunkparser = parse.RegexpChunk([rule], chunk_node = ‘NP’)

Resulting object is a (sort-of) parse tree

• Top-level node called S

• Chunks are labelled NP

Chunking with Regular Expressions

Chunking with Regular Expressions

Rule application is sensitive to order

Chinking

Specify what does not go into a chunk.

• Kind of like specifying punctuation as being not alphanumeric and spaces.

• Can be more difficult to think about.

Simple chink-chunk approach: function v content word-class

Regular expressions for chunks and chinks CAN get complex

BUT the whole point is to be simpler than full parsing!

SO: use a simple model which works “reasonably well”

(then tidy up afterwards…)

Chunk = nominal content-word (noun)

Chink = others (verb, pronoun, determiner, preposition, conjunction) (+adjective, adverb as a borderline category)

Example

Fruit flies like a banana

fruit\N flies\N like\V a\A banana\N

[fruit flies] like a [banana]

[S [NP fruit\N flies\N NP]

[VP like\V

[NP a\A banana\N NP]

VP]

S]

An alternative parse

This sentence is grammatically ambiguous:

Fruit flies like a banana

fruit\N flies\N like\V a\A banana\N [fruit flies] like a [banana]

fruit\N flies\V like\I a\A banana\N [fruit] flies like a [banana]

cf: “bank robbers like a chase” v “bread bakes in an oven”

[S [NP fruit\N NP]

[VP flies\V

[PP like\I [NP a\A banana\N NP] PP]

VP]

S]

Ambiguity leads to more rules

fruit\N flies\N like\V a\A banana\N [fruit flies] like a [banana]

fruit\N flies\V like\I a\A banana\N [fruit] flies like a [banana]

BUT what about: Time flies like an arrow - time\N, time\V

time\N flies\N like\V an\A arrow\N [time flies] like an [arrow]

time\N flies\V like\I an\A arrow\N [time] flies like an [arrow]

time\V flies\N like\I an\A arrow\N time [flies] like an [arrow]

3rd PoS-tagging gives ambiguous parse

Chunking can predict prosodic breaks

http://www.acm.org/crossroads/

An Approach for Detecting Prosodic Phrase Boundaries in Spoken English by Claire Brierley and Eric Atwell

Summary

Shallow parsing is useful for:

Entity recognition

• people, locations, organizations

Studying linguistic patterns

• gave NP

• gave up NP in NP

• gave NP NP

• gave NP to NP

Prosodic phrase breaks – pauses in speech

Can ignore complex structure when not relevant

Chink-chunk approach: “quick-and-dirty” chunking, content v function PoS

Chink-chunk parsing is simpler than context-free grammar parsing!

Recommended