View
213
Download
0
Embed Size (px)
Citation preview
Automatic Acquisition ofLexical Classes and Extraction Patterns
for Information Extraction
Kiyoshi Sudo
Ph.D. Research Proposal
New York University
Committee:
Ralph Grishman
Satoshi Sekine
I. Dan Melamed
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation2
Outline
Introduction Research Proposal
– Problem Setting– Approach– Application to Information Extraction
Discussion
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation3
MURREE, Pakistan (AP) -- Masked gunmen firing Kalashnikov rifles burst through the front gates of a Christian school Monday, killing six people and wounding three in the latest attack against Western interests since Pakistan joined the war against terrorism.
MUC Scenario Template Task
Date Perpetrator Weapon Victim Location
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation4
MUC Scenario Template Task
Date Perpetrator Weapon Victim Location
Maskedgunmen
Monday six people
three
Kalashnikovrifles
a Christianschool
MURREE, Pakistan (AP) -- Masked gunmen firing Kalashnikov rifles burst through the front gates of a Christian school Monday, killing six people and wounding three in the latest attack against Western interests since Pakistan joined the war against terrorism.
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation5
High Cost forAcquiring Knowledge-Base
Find extraction patterns– Find relevant documents– Find relevant events– Analyze sentences
Find domain-specific lexicon– Find existing KB (e.g. thesaurus, gazetteers)
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation6
Prior Work
Automatic Knowledge Acquisition
Lexical Acquisition Pattern Acquisition
Mutual Bootstrapping(Riloff and Jones 1999)
Simultaneous Multi-Semantic Class(Thelen and Riloff 2002)(Yangarber et al. 2002)
Pattern Discovery withDocument Re-ranking
(Yangarber et al. 2000)
Pattern Acquisition for QA (Ravichandran and Hovy 2002)
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation7
Challenge
Seed LexiconSeed Pattern
Expanded LexiconExpanded Pattern Set
User
KnowledgeBase
DateTypePerpatrator-IndividualPerpatrator-OrgPhysical TargetPhysical Target-NumPhysical Target-TypeHuman TargetHuman Target-Num…
MUC-3:Terrorism Event
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation8
Meeting the Challenge
Seed LexiconSeed Pattern
Expanded LexiconExpanded Pattern Set
User
KnowledgeBase
Semantic Clustering
ScenarioDescription
Semantic Cluster
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation9
Semantic Clustering
Semantic Clustering
ScenarioDescription
Semantic Cluster
– Description specific enoughto define the scenario
– (terrorism, bombing, kidnapping)– “Tell me about the terrorism action,
such as bombing and kidnapping.”
– Find Scenario-specific Semantic Clusters each of which consists of
– Semantic Lexicon– Extraction Patterns
Goal:
Input:
Semantic Lexicon
Extraction Patterns
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation10
Benefit for User
Semantic Clustering
ScenarioDescription
Semantic Cluster
Simplify Domain Analysis
Low-cost Knowledge-base Acquisitionfor IE systems
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation11
Extraction Patterns
Definition Lcwcontext ),(: patterns ofset a
wherec unifies with the context that is defined by semantic class L
context =
Case Frame: (bomb (v), x (subj), himself (obj))
Sequential: (x, bombs, himself)
Dependency: himselfbombx
(cf. Sudo et al. 2001)
V:subj V:obj
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation12
Outline
Introduction Research Proposal
– Problem Setting– Approach– Information Extraction
Evaluation
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation13
Overview
Semantic Clustering
ScenarioDescription
Semantic Cluster
InformationRetrieval
Boot-strapping
QueryExpansion
Source
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation14
Overview
Semantic Clustering
ScenarioDescription
Semantic Cluster
InformationRetrieval
Boot-strapping
QueryExpansion
Source
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation15
Information Retrieval
Get Relevant Document set Get list of lexical items and extraction patterns
ordered by relevance to the scenario– TF/IDF scoring
R
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation16
Example of TF/IDF scoring(Management Succession: Business)
w (N,V) TF/IDF
president 0.3897officer 0.3835named 0.3832
executive 0.3273Mr. 0.2587
chairman 0.2214vice 0.2186
years 0.1800company 0.1606
Inc. 0.1605
p TF/IDF
(succeed, V:obj:N, x ) 0.3435(x , N:title:, Mr.) 0.3311(succeed, V:subj:N, x ) 0.3167(name, V:obj:N, x ) 0.3141(name, V:subj:N, x ) 0.3069(name, V:iobj:N, x ) 0.2454(resign, V:subj:N, x ) 0.1920(as, Prep:pcomp-n:N, x ) 0.1118(retire, V:subj:N, x ) 0.0915(remain, V:subj:N, x ) 0.0861
300 documents retrievedFrom WSJ (7/94 - 8/94)
Extracted by MINIPAR (Lin 1998)
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation17
Overview
Semantic Clustering
ScenarioDescription
Semantic Cluster
InformationRetrieval
Boot-strapping
QueryExpansion
Source
extractionpatterns
lexicon
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation18
Bootstrapping
Assumption:
Patterns provide Lexical Classes. Lexicon provides contextual information.
Riloff and Jones 1999Agichtein and Gravano 2000
Find one cluster that consists of Lexicon and Extraction Patterns
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation19
Bootstrapping (Cont.)
Algorithm (cf. Riloff and Jones 1999)– Given
the ordered list of terms the ordered list of extraction patterns Lexicon = (), Pattern = ()
– w the most relevant term in the list and add it into Lexicon
1. p the most relevant pattern among those that extract w.2. Add p into Pattern3. w the most relevant term among those that are extracted by p4. Add w into Lexicon5. Go to 1
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation20
Example of Bootstrapping(Management Succession: Business)
w (N,V) TF/IDF
president 0.3897officer 0.3835named 0.3832
executive 0.3273Mr. 0.2587
chairman 0.2214vice 0.2186
years 0.1800company 0.1606
Inc. 0.1605
p TF/IDF
(succeed, V:obj:N, x ) 0.3435(x , N:title:, Mr.) 0.3311(succeed, V:subj:N, x ) 0.3167(name, V:obj:N, x ) 0.3141(name, V:subj:N, x ) 0.3069(name, V:iobj:N, x ) 0.2454(resign, V:subj:N, x ) 0.1920(as, Prep:pcomp-n:N, x ) 0.1118(retire, V:subj:N, x ) 0.0915(remain, V:obj:N, x ) 0.0861
From WSJ (7/94 - 8/94)
Extracted by MINIPAR (Lin 1998)
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation21
Example of Bootstrapping(Management Succession: Business)
w (N,V) TF/IDF
president 0.3897officer 0.3835named 0.3832
executive 0.3273Mr. 0.2587
chairman 0.2214vice 0.2186
years 0.1800company 0.1606
Inc. 0.1605
p TF/IDF
(succeed, V:obj:N, x ) 0.3435(x , N:title:, Mr.) 0.3311(succeed, V:subj:N, x ) 0.3167(name, V:obj:N, x ) 0.3141(name, V:subj:N, x ) 0.3069(name, V:iobj:N, x ) 0.2454(resign, V:subj:N, x ) 0.1920(as, Prep:pcomp-n:N, x ) 0.1118(retire, V:subj:N, x ) 0.0915(remain, V:obj:N, x ) 0.0861
From WSJ (7/94 - 8/94)
Extracted by MINIPAR (Lin 1998)
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation22
Problem:Polysemous Lexicon, Pattern
Lexicon can be ambiguous– e.g. Clinton (Person, Organization, Location … )
Extraction patterns can be ambiguous– e.g. be killed in <x> (x: Location, Date … )
Needs more study– more restriction– Probabilistic Model ??
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation23
Overview
Semantic Clustering
ScenarioDescription
Semantic Cluster
InformationRetrieval
Boot-strapping
QueryExpansion
Source
pattern
lexicon
pt lex
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation24
Query Expansion
Generalize terms in a query with a newly discovered cluster– cf. Rocchio 1971 (Vector model)– Zhai and Lafferty 2001 (Language-modeling)
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation25
Overview
Semantic Clustering
ScenarioDescription
Semantic Cluster
InformationRetrieval
Boot-strapping
QueryExpansion
Source
pattern
lexicon
pt lex
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation26
Outline
Introduction Research Proposal
– Problem Setting– Approach– Application to Information Extraction
Discussion
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation27
Application toInformation Extraction
Semantic Clustering
ScenarioDescription
Semantic Cluster
Preprocessing
EntityRecognition
Event RecognitionRole Assignment
Merging
Pattern MatchingSemantic Lexicon
Extraction Patterns
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation28
Human Intervention
Extraction patterns– Event pattern
Context contains a verb or nominalization of verb Used for event extraction and role assignment e.g. (terrorist, fire, x)
– Local pattern Context contains only enough information to recognize semantic class Used for entity recognition only e.g. (x,Inc.)
Association of Event Pattern to Role– e.g. (company, hire, x)PersonIn and (company, fire, x)PersonOut
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation29
Outline
Introduction Research Proposal
– Problem Setting– Approach– Application to Information Extraction
Discussion
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation30
Discussion
Domain Portability– User only needs to specify the scenario
Language Portability– Language-dependent Tools
Segmentation (Lemmatization) Dependency Parsing
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation31
Evaluation
MUC-style (Scenario-Template task)– Slot-base
Precision, Recall, F-measure
– Domain Portability Several pre-defined tasks that differ in difficulty
– Language Portability Japanese English
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation32
Contribution
Tool for Domain Analysis
Low-cost Knowledge-base Acquisition
Towards Open-domain Information Extraction
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation33
Conclusion
Proposed New Approach for Knowledge-base Acquisition (Semantic Clustering)
Discussed Application of Acquired KB to Information Extraction (Human Intervention and Local vs. Event patterns)
Discussed Evaluation with several predefined MUC-style tasks different in difficulty and across languages (Domain portability and Language portability)
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation34
ToDo
Implementation
Preparation for Evaluation
Evaluation
August 9, 2002Kiyoshi Sudo Thesis Proposal Presentation35
Time for Questions(Conclusion)
Proposed New Approach for Knowledge-base Acquisition (Semantic Clustering)
Discussed Application of Acquired KB to Information Extraction (Human Intervention and Local vs. Event patterns)
Discussed Evaluation with several predefined MUC-style tasks different in difficulty and across languages (Domain portability and Language portability)