30
Information Extraction • Extract meaningful information from text Without fully understanding everything! Basic idea: – Define domain-specific templates – Simple and reliable linguistic processing – Recognize known types of entities and relations – Fill templates with recognized information

Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Embed Size (px)

Citation preview

Page 1: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Information Extraction

• Extract meaningful information from text

• Without fully understanding everything!

• Basic idea:– Define domain-specific templates– Simple and reliable linguistic processing– Recognize known types of entities and relations– Fill templates with recognized information

Page 2: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Example4 Apr. Dallas - Early last evening, a tornado swept

through northwest Dallas. The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. The Texaco station at 102 Main St. was also severely damaged, but no injuries were reported.

Event: tornadoDate: 4/3/97Time: 19:15Location:“northwest Dallas” : Texas : USADamage: “mobile homes” (2)

“Texaco station” (1)Injuries: none

Page 3: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

4 Apr. Dallas – Early last evening,

a tornado swept through northwest....

Event: tornadoDate: 4/3/97Time: 19:15Location: “northwest Dallas”

: Texas : USA...

Tokenization &

Tagging

Early/ADV last/ADJ evening/NN:time ,/,

a/DT tornado/NN:weather swept/VBD ...

Sentence Analysis

Early last evening: adv-phrase:timea tornado: noun-group:subject

swept: verb-group...

PatternExtraction

tornado swept: Event: tornado

through northwest Dallas: Loc: “northwest Dallas”

causing extensive damage:Damage

Merging

Early last evening, a tornado swept through northwest Dallas.

The twister occurred without warning at about ....

TemplateGeneration

Page 4: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

MUC: Message Understanding Conference

• “Competitive” conference with predefined tasks for research groups to address

• Tasks (MUC-7):– Named Entities: Extract typed entities from text– Equivalence Classes: Solving coreference– Attributes: Fill in attributes of entities– Facts: Extract logical relations between entities– Events: Extract descriptions of events from text

Page 5: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Tokenization & Tagging• Tokenization & POS tagging

• Also lexical semantic information, such as “time”, “location”, “weather”, “person”, etc.

Sentence Analysis• Shallow parsing for phrase types

• Use tagging & semantics to tag phrases

• Note phrase heads

Page 6: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Pattern Extraction

• Find domain-specific relations between text units

• Typically use lexical triggers and relation-specific patterns to recognize relations

Concept: Damaged-ObjectTrigger: destroyedPosition: direct-objectConstraints: physical-thing

... and [ destroyed ] [ two mobile homes ] Damaged-Object = “two mobile homes”

Page 7: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Learning Extraction Patterns

• Very difficult to predefine extraction patterns

• Must be redone for each new domain

• Hence, corpus-based approaches are indicated

• Some methods:– AutoSlog (1992) – “syntactic” learning– PALKA (1995) – “conceptual” learning– CRYSTAL (1995) – covering algorithm

Page 8: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

AutoSlog (Lehnert 1992)• Patterns based on recognizing “concepts”

– Concept: what concept to recognize– Trigger: a word indicating an occurrence– Position: what syntactic role the concept will

take in the sentence– Constraints: what type of entity to allow– Enabling conditions: constraints on the

linguistic context

Page 9: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

• Concept: Event-Time

• Trigger: “at”

• Position: prep-phrase-object

• Constraints: time

• Enabling conditions: post-verb

The twister occurred without warning at about 7:15 pm and destroyed two mobile homes.

Event-Time = 19:15

Page 10: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Learning Patterns

• Supervised: Training is text with patterns to be extracted from it

• Knowledge: 13 general syntactic patterns

• Algorithm:– Find sentence with target noun phrase

“two mobile homes”

– Partial parsing of sentence: find syntactic relations– Try all linguistic patterns to find match– Generate concept pattern from match

Page 11: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Linguistic Patterns• Identify domain-specific thematic roles

based on syntactic structure

active-voice-verb followed by target=direct object

Concept = target conceptTrigger = verb of active-voice-verbPosition = direct-objectConstraints = semantic-class of targetEnabling conditions = active-voice

Page 12: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

More Examples

– victim was murdered

– perpetrator bombed

– perpetrator attempted to kill

– was aimed at target

• Some bad extraction patterns occur (e.g, “is” as a trigger)

• Human review process

Page 13: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

CRYSTAL• Complex syntactic patterns• Use “covering” algorithm:

– Generate most specific possible patterns for all occurrences of targets in corpus

– Loop:• Find most specific unifier of the most similar

patterns C & C’, generating new pattern P• If P has less than ε error on corpus, replace C

and C’ with P• Continue until no new patterns can be added

Page 14: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

MergingMotor Vehicles International Corp. announced a

major management shake-up ... MVI said the CEO has resigned ... The Big 10 auto maker is attempting to regain market share ... It will announce losses ... A company spokesman said they are moving their operations ... MVI, the first company to announce such a move since the passage of the new international trade agreement, is facing increasing demands from unionized workers...

Page 15: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Coreference Resolution

• Many different kinds of linguistic phenomena:– Proper names,

– Aliases (MVI),

– Definite NPs (the Big 10 auto maker),

– Pronouns (it, they),

– Appositives (, the first company to ...)

• Errors of previous phases may be amplified

Page 16: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Learning to Merge• Treat coreference as a classification task

– Should this pair of entities be linked?

• Methodology:– Training corpus: manually link all coreferential

expressions– Each possible pair is a training example, if they

are linked it is positive if not, it is negative– Create a feature vector for each example– Use your favorite learning algorithm

Page 17: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

MLR (1995)• 66 features were used, in 4 categories:

– Lexical features of each phrasee.g, do they overlap?

– Grammatical role of each phrasee.g, subject, direct-object

– Semantic classes of each phrasee.g, physical-thing, company

– Relative positions of the phrasese.g, X one sentence after Y

• Decision-tree learning (C4.5)

Page 18: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

C4.5• Incrementally build decision-tree from

labeled training examples

• At each stage choose “best” attribute to split dataset– E.g, use info-gain to compare features

• After building complete tree, prune the leaves to prevent overfitting– Use statistical tests to determine if enough

examples are in leaf bins, if not – prune!

Page 19: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

C4.5

40 training

f1

f2 f3

15 training25 training

7 training18 training 2 training 13 training

C1 C2 C2 C1

Page 20: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

RESOLVE (1995)• C4.5 with 8 complex features:

– NAME-{1,2}: does reference include a name?– JV-CHILD-{1,2}: does reference refer to part of a

joint venture?– ALIAS: does one reference contain an alias for the

other?– BOTH-JV-CHILD: do both refer to part of a joint

venture?– COMMON-NP: do both contain a common NP?– SAME-SENTENCE: are both in the same

sentence?

Page 21: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

(ye s )CO RE FERENCE

(ye s )CO RE FERENCE

(ye s )NO T -CO RE F

(ye s )NO T -CO RE F

(n o )CO RE FERENCE

(u n know n)S AM E -S ENTENCE

(ye s )JV -CH ILD -2

(n o )NO T -CO RE F

(u n know n)NAM E -2

(ye s )CO RE FERENCE

(n o )NO T -CO RE F

(n o )A L IAS

(n o )BO TH -JV -CH ILD

COMMON -NP ?

Decision Tree

Page 22: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

RESOLVE Results

• 50 texts, leave-1-out cross-validation:

System Recall Precision

Unpruned 85.4% 87.6%

Pruned 80.1% 92.4%

Manual 67.7% 94.4%

Page 23: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Full System: FASTUS (1996)Input Text

OutputTemplate

PartialTemplates

TemplateMerger

CoreferenceResolution

Pattern Recognition

Page 24: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Pattern Recognition• Multiple passes of finite-state methods

John Smith, 47, was named president of ABC Corp.

Pers-Name

V-Group

Num Aux V N P Org-Name

Poss-N-Group

Domain-Event

Page 25: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Partially-Instantiated Templates

Person: _______Pos: PresidentOrg: ABC Corp.

Person: John SmithPos: PresidentOrg: ABC Corp.

Start:

End:

Domain-Dependent!!

Page 26: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

The Next Sentence...

Person: Mike JonesPos: ________Org: ________

Person: John SmithPos: ________Org: ________

Start:

End:

He replaces Mike Jones.

Coreference analysis: He = John Smith

Page 27: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

UnificationUnify new template with preceding template(s),if possible...

Person: Mike JonesPos: PresidentOrg: ABC Corp.

Person: John SmithPos: PresidentOrg: ABC Corp.

Start:

End:

Page 28: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Principle of Least Commitment

• Idea: Maintain options as long as possible

• E.g: parsing – maintain a lattice structure:

The committee heads announced that...

DT NN1NN2

VBZ

VBD CSubN-GRPEvent

Event: AnnounceActor: Committee

heads

Page 29: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

Principle of Least Commitment

• Idea: Maintain options as long as possible

• E.g: parsing – maintain a lattice structure:

N-GRPEvent

Head: CommitteeEffort:ABC’s

recruitment

The committee heads ABC’s recruitment effort.

DT NN1NN2

VBZ

NNpos NNN-GRP

Page 30: Information Extraction Extract meaningful information from text Without fully understanding everything! Basic idea: –Define domain-specific templates –Simple

More Least Commitment

• Maintain multiple coreference hypotheses:– Disambiguate when creating domain-events– More information available

• Too many possibilities?– Use beam search algorithm: maintain k ‘best’

hypotheses at every stage