102
Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

  • View
    226

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information Extraction

Shallow Processing Techniques for NLPLing570

December 5, 2011

Page 2: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

RoadmapInformation extraction:

Definition & Motivation

General approach

Relation extraction techniques:Fully supervised

Lightly supervised

Page 3: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information ExtractionMotivation:

Unstructured data hard to manage, assess, analyze

Page 4: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information ExtractionMotivation:

Unstructured data hard to manage, assess, analyze

Many tools for manipulating structured data:Databases: efficient search, agglomerationStatistical analysis packagesDecision support systems

Page 5: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information ExtractionMotivation:

Unstructured data hard to manage, assess, analyze

Many tools for manipulating structured data:Databases: efficient search, agglomerationStatistical analysis packagesDecision support systems

Goal:Given unstructured information in textsConvert to structured data

E.g., populate DBs

Page 6: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

IE ExampleCiting high fuel prices, United Airlines said Friday it

has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. Lead airline: Amount: Effective Date: Follower:

Page 7: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

IE ExampleCiting high fuel prices, United Airlines said Friday it

has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. Lead airline: United Airlines Amount: $6 Effective Date: 2006-10-26 Follower: American Airlines

Page 8: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Other ApplicationsShared tasks:

Message Understanding Conferences (MUC)

Page 9: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Other ApplicationsShared tasks:

Message Understanding Conferences (MUC)Joint ventures/mergers:

Page 10: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Other ApplicationsShared tasks:

Message Understanding Conferences (MUC)Joint ventures/mergers:

CompanyA, CompanyB, CompanyC, amount, type

Terrorist incidents

Page 11: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Other ApplicationsShared tasks:

Message Understanding Conferences (MUC)Joint ventures/mergers:

CompanyA, CompanyB, CompanyC, amount, type

Terrorist incidents Incident type, Actor, Victim, Date, Location

General:

Page 12: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Other ApplicationsShared tasks:

Message Understanding Conferences (MUC)Joint ventures/mergers:

CompanyA, CompanyB, CompanyC, amount, type

Terrorist incidents Incident type, Actor, Victim, Date, Location

General:PricegrabberStock market analysisApartment finder

Page 13: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information ExtractionComponents

Incorporates range of techniques, tools

Page 14: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information ExtractionComponents

Incorporates range of techniques, toolsFSTs (pattern extraction)Supervised learning:

ClassificationSequence labeling

Page 15: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information ExtractionComponents

Incorporates range of techniques, toolsFSTs (pattern extraction)Supervised learning:

ClassificationSequence labeling

Semi-, un-supervised learning: clustering, bootstrapping

Page 16: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information ExtractionComponents

Incorporates range of techniques, toolsFSTs (pattern extraction)Supervised learning:

ClassificationSequence labeling

Semi-, un-supervised learning: clustering, bootstrapping

Part-of-Speech taggingNamed Entity RecognitionChunkingParsing

Page 17: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information Extraction Process

Typically pipeline of major components

Page 18: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information Extraction Process

Typically pipeline of major components

First step: Named Entity Recognition (NER)Often application-specific

Generic type: Person, Organiztion, LocationSpecific: genes to college courses

Page 19: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information Extraction Process

Typically pipeline of major components

First step: Named Entity Recognition (NER)Often application-specific

Generic type: Person, Organiztion, LocationSpecific: genes to college courses

Identify NE mentions: specific instances

Page 20: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Example: Pre-NERCiting high fuel prices, United Airlines said Friday

it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco.

Page 21: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Example: post-NERCiting high fuel prices, [ORGUnited Airlines] said

[TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Page 22: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information Extraction:Reference Resolution

Builds on NER

Clusters entity mentions referring to same thingsCreates entity equivalence classes

Page 23: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information Extraction:Reference Resolution

Builds on NER

Clusters entity mentions referring to same thingsCreates entity equivalence classes

Includes anaphora resolutionE.g. pronoun resolution (it, he, she…; one; this, that)

Page 24: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information Extraction:Reference Resolution

Builds on NER

Clusters entity mentions referring to same thingsCreates entity equivalence classes

Includes anaphora resolutionE.g. pronoun resolution (it, he, she…; one; this, that)

Also non-anaphoric, general mentions

Page 25: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Example: post-NERCiting high fuel prices, [ORGUnited Airlines] said

[TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Page 26: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Example: post-NERCiting high fuel prices, [ORGUnited Airlines]

said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Page 27: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Relation Detection & Classification

Relation detection & classification:Find and classify relations among entities in text

Page 28: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Relation Detection & Classification

Relation detection & classification:Find and classify relations among entities in text

Mix of application-specific and generic relations

Generic relations:

Page 29: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Relation Detection & Classification

Relation detection & classification:Find and classify relations among entities in text

Mix of application-specific and generic relations

Generic relations:

Family, employment, part-whole, membership, geospatial

Page 30: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Example: post-NERCiting high fuel prices, [ORGUnited Airlines]

said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Page 31: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Relation Detection & Classification

Relation detection & classification:Find and classify relations among entities in text

Mix of application-specific and generic relationsGeneric relations:

Family, employment, part-whole, membership, geospatial

Generic relations:United is part-of UAL Corp.American Airlines is part-of AMRTim Wagner is an employee of American Airlines

Page 32: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Relation Detection & Classification

Relation detection & classification: Find and classify relations among entities in text

Mix of application-specific and generic relationsGeneric relations:

Family, employment, part-whole, membership, geospatial

Generic relations:United is part-of UAL Corp.American Airlines is part-of AMRTim Wagner is an employee of American Airlines

Domain-specific relations:United serves Chicago; United serves Denver, etc

Page 33: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Example: post-NERCiting high fuel prices, [ORGUnited Airlines]

said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Page 34: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Event Detection &Classification

Event detection and classification:Find key events

Page 35: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Event Detection &Classification

Event detection and classification:Find key events

Fare increase by UnitedMatching fare increase by AmericanReporting of fare increases

How cued?

Page 36: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Event Detection &Classification

Event detection and classification:Find key events

Fare increase by UnitedMatching fare increase by AmericanReporting of fare increases

How cued? Instances of “said” and ”cite”

Page 37: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Temporal Expressions:Recognition and Analysis

Temporal expression recognitionFinds temporal events in text

Page 38: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Example: post-NERCiting high fuel prices, [ORGUnited Airlines]

said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Page 39: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Temporal Expressions:Recognition and Analysis

Temporal expression recognitionFinds temporal events in text

E.g. Friday, Thursday in example text

Page 40: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Temporal Expressions:Recognition and Analysis

Temporal expression recognitionFinds temporal events in text

E.g. Friday, Thursday in example textOthers:

Date expressions: Absolute: days of week, holidays Relative: two days from now, next year

Clock times: noon, 3:30pm

Page 41: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Temporal Expressions:Recognition and Analysis

Temporal expression recognition Finds temporal events in text

E.g. Friday, Thursday in example text Others:

Date expressions: Absolute: days of week, holidays Relative: two days from now, next year

Clock times: noon, 3:30pm

Temporal analysis: Map expressions to points/spans on timeline

Anchor: e.g. Friday, Thursday: tied to example dateline

Page 42: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Template FillingTexts describe stereotypical situations in domain

Identify texts evoking template situations

Fill slots in templates based on text

Page 43: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Template FillingTexts describe stereotypical situations in domain

Identify texts evoking template situations

Fill slots in templates based on text

In example: fare-raise is stereo-typical event

Page 44: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Information Extraction Steps

Named Entity Recognition

Reference Resolution

Relation Detection & Classification

Event Detection & Classification

Temporal Expression Recognition & Analysis

Template-filling

Page 45: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Relation Detection & Classification

Page 46: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Common Relations

Page 47: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Relations and Entitiesfrom Example

Page 48: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Supervised Approaches to Relation Analysis

Two stage process:

Page 49: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Supervised Approaches to Relation Analysis

Two stage process:Relation detection:

Detect whether a relation holds between two entities

Page 50: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Supervised Approaches to Relation Analysis

Two stage process:Relation detection:

Detect whether a relation holds between two entities Positive examples:

Page 51: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Supervised Approaches to Relation Analysis

Two stage process:Relation detection:

Detect whether a relation holds between two entities Positive examples:

Drawn from labeled training data Negative examples:

Page 52: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Supervised Approaches to Relation Analysis

Two stage process:Relation detection:

Detect whether a relation holds between two entities Positive examples:

Drawn from labeled training data Negative examples:

Generated from within-sentence entity pairs w/no relation

Relation classification:Label each related pair of entities by type

Multi-class classification task

Page 53: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Feature DesignWhich features can we use to classify relations?

Page 54: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Example: post-NERCiting high fuel prices, [ORGUnited Airlines] said

[TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Page 55: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Common Relations

Page 56: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Feature DesignWhich features can we use to classify relations?

Page 57: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Feature DesignWhich features can we use to classify relations?

Named entity features:Named entity types of candidate argumentsConcatenation of two entity typesHead words of argumentsBag-of-words from arguments

Page 58: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Feature DesignWhich features can we use to classify relations?

Named entity features:Named entity types of candidate argumentsConcatenation of two entity typesHead words of argumentsBag-of-words from arguments

Context word features:Bag-of-words/bigrams between entities (+/-

stemmed)Words and stems before/after entitiesDistance in words, # entities between arguments

Page 59: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Feature DesignFeatures for relation classification

Syntactic structure features can signal relations E.g. appositive signals part-of relation

Page 60: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Feature DesignFeatures for relation classification

Syntactic structure features can signal relations E.g. appositive signals part-of relation

Page 61: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Feature DesignSyntactic structure features

Derived from syntactic analysisFrom base-phrase chunking to full constituent parse

Page 62: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Feature DesignSyntactic structure features

Derived from syntactic analysisFrom base-phrase chunking to full constituent parse

Features:Presence of constructionChunk base-phrase pathsBag-of-chunk headsDependency/constituency pathDistance between arguments

How can we represent syntax?

Page 63: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Feature DesignSyntactic structure features

Derived from syntactic analysisFrom base-phrase chunking to full constituent parse

Features:Presence of constructionChunk base-phrase pathsBag-of-chunk headsDependency/constituency pathDistance between arguments

How can we represent syntax?Binary features: Detection of construction patterns

Page 64: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Feature DesignSyntactic structure features

Derived from syntactic analysisFrom base-phrase chunking to full constituent parse

Features:Presence of constructionChunk base-phrase pathsBag-of-chunk headsDependency/constituency pathDistance between arguments

How can we represent syntax?Binary features: Detection of construction patternsStructure features: Encode paths through trees

Page 65: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Some Example Features

Page 66: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Lightly Supervised Approaches

Supervised approaches highly effective, but

Page 67: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Lightly Supervised Approaches

Supervised approaches highly effective, butRequire large manually annotated corpus for

training

Page 68: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Lightly Supervised Approaches

Supervised approaches highly effective, butRequire large manually annotated corpus for

training

Unsupervised approaches interesting, but

Page 69: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Lightly Supervised Approaches

Supervised approaches highly effective, butRequire large manually annotated corpus for

training

Unsupervised approaches interesting, butMay not yield results consistent with desired

categories

Lightly supervised approaches:

Page 70: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Lightly Supervised Approaches

Supervised approaches highly effective, butRequire large manually annotated corpus for

training

Unsupervised approaches interesting, butMay not yield results consistent with desired

categories

Lightly supervised approaches:Build on small seed sets of patterns that capture

relations of interestE.g. manually constructed regular expressions

Page 71: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Lightly Supervised Approaches

Supervised approaches highly effective, butRequire large manually annotated corpus for

training

Unsupervised approaches interesting, butMay not yield results consistent with desired

categories

Lightly supervised approaches:Build on small seed sets of patterns that capture

relations of interestE.g. manually constructed regular expressions

Iteratively generalize and refine patterns to optimize

Page 72: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Bootstrapping RelationsCreate seed patterns and instances

Example: finding airline hub citiesInitial pattern:

Page 73: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Bootstrapping RelationsCreate seed patterns and instances

Example: finding airline hub citiesInitial pattern: / * has a hub at * /

In a large corpus finds examples like: Delta has a hub at LaGuardia. Milwaukee-based Midwest has a hub at KCI. American airlines has a hub at the San Juan airport

Page 74: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Bootstrapping RelationsCreate seed patterns and instances

Example: finding airline hub citiesInitial pattern: / * has a hub at * /

In a large corpus finds examples like: Delta has a hub at LaGuardia. Milwaukee-based Midwest has a hub at KCI. American airlines has a hub at the San Juan airport

Instances: (Delta, LaGuardia), (Midwest,KCI),(American,San Juan)

Page 75: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Bootstrapping RelationsIssues:

Page 76: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Bootstrapping RelationsIssues:

False alarms:Some matches aren’t valid relations

The catheter has a hub at the proximal end. A star topology often has a hub at the center.

Fix?

Page 77: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Bootstrapping RelationsIssues:

False alarms:Some matches aren’t valid relations

The catheter has a hub at the proximal end. A star topology often has a hub at the center.

Fix? /[ORG] has hub at [LOC]/Misses:

Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at

Liverpool. Ryanair also has a continental hub at Charleroi airport.

Fix?

Page 78: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Bootstrapping RelationsIssues:

False alarms:Some matches aren’t valid relations

The catheter has a hub at the proximal end. A star topology often has a hub at the center.

Fix? /[ORG] has hub at [LOC]/Misses:

Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport.

Fix? Relax patterns

Page 79: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Bootstrapping RelationsIssues:

False alarms:Some matches aren’t valid relations

The catheter has a hub at the proximal end. A star topology often has a hub at the center.

Fix? /[ORG] has hub at [LOC]/Misses:

Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport.

Fix? Relax patterns – see issue 1

Page 80: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Bootstrapping Relations Issues:

False alarms:Some matches aren’t valid relations

The catheter has a hub at the proximal end. A star topology often has a hub at the center.

Fix? /[ORG] has hub at [LOC]/ Misses:

Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport.

Fix? Relax patterns – see issue 1 Add new high precision patterns

Page 81: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

BootstrappingHow can we obtain more high quality patterns?

Page 82: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

BootstrappingHow can we obtain more high quality patterns?

Human labeling?

Page 83: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

BootstrappingHow can we obtain more high quality patterns?

Human labeling? – see above…Use tuples identified by seed patternsFind other occurrences of tuplesExtract patterns from those occurrences

Page 84: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

BootstrappingHow can we obtain more high quality patterns?

Human labeling? – see above…Use tuples identified by seed patternsFind other occurrences of tuplesExtract patterns from those occurrences

E.g. tuple: Ryanair, Charleroi and hub

Page 85: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

BootstrappingHow can we obtain more high quality patterns?

Human labeling? – see above…Use tuples identified by seed patternsFind other occurrences of tuplesExtract patterns from those occurrences

E.g. tuple: Ryanair, Charleroi and hubOccurrences:

Budget airline Ryanair, which uses Charleroi as a hubAll flights in and out of Ryanair’s Belgian hub at Charleroi

Patterns:

Page 86: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

BootstrappingHow can we obtain more high quality patterns?

Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences

E.g. tuple: Ryanair, Charleroi and hub Occurrences:

Budget airline Ryanair, which uses Charleroi as a hubAll flights in and out of Ryanair’s Belgian hub at Charleroi

Patterns: /[ORG], which uses [LOC] as a hub /

Page 87: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

BootstrappingHow can we obtain more high quality patterns?

Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences

E.g. tuple: Ryanair, Charleroi and hub Occurrences:

Budget airline Ryanair, which uses Charleroi as a hubAll flights in and out of Ryanair’s Belgian hub at Charleroi

Patterns: /[ORG], which uses [LOC] as a hub / /[ORG]’s hub at [LOC] /

Page 88: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011
Page 89: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Bootstrapping IssuesIssues:

How do we represent patterns?

How to we assess the patterns?

How to we assess the tuples?

Page 90: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Pattern RepresentationGeneral approaches:

Page 91: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Pattern RepresentationGeneral approaches:

Regular expressionsSpecific, high precision

Page 92: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Pattern RepresentationGeneral approaches:

Regular expressionsSpecific, high precision

Machine learning style feature vectorsMore flexible, higher recall

Page 93: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Pattern RepresentationGeneral approaches:

Regular expressionsSpecific, high precision

Machine learning style feature vectorsMore flexible, higher recall

Components in representation:Context before first entity mentionContext between entity mentionsContext after last entity mentionOrder of arguments

Page 94: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Assessing Patterns & Tuples

Issue:

Page 95: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Assessing Patterns & Tuples

Issue: No gold standard!Can use seed patterns

Page 96: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Assessing Patterns & Tuples

Issue: No gold standard!Can use seed patterns

Balance two main factors:Performance on seed tuples

Hits and missesProductivity in full collection

Finds

Page 97: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Assessing Patterns & Tuples

Issue: No gold standard!Can use seed patterns

Balance two main factors:Performance on seed tuples

Hits and missesProductivity in full collection

Finds

Use conservative strategy to minimize semantic driftE.g. Sydney has a ferry hub at Circular Key.

Page 98: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Relation Analysis Evaluation

Approaches:Standardized, labeled data sets

Page 99: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Relation Analysis Evaluation

Approaches:Standardized, labeled data sets

Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection

Page 100: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Relation Analysis Evaluation

Approaches:Standardized, labeled data sets

Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection

No fully labeled corpus

Page 101: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Relation Analysis Evaluation

Approaches:Standardized, labeled data sets

Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection

No fully labeled corpusEvaluate retrieved tuples (e.g. DB contents)

Intuition: Don’t care how many times we find same tuple

Page 102: Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011

Relation Analysis Evaluation

Approaches:Standardized, labeled data sets

Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection

No fully labeled corpusEvaluate retrieved tuples (e.g. DB contents)

Intuition: Don’t care how many times we find same tuple

Checked by human assessors