Upload
inria-oak
View
89
Download
2
Tags:
Embed Size (px)
Citation preview
Speeding up Information Extraction Programs
Helena Galhardas
INESC-ID and IST, Universidade de Lisboa
Who am I? • Assistant Professor at IST, University of Lisbon (http://
www.ist.utl.pt) • Researcher at INESC-ID (http://www.inesc-id.pt) • Former (2000) PhD student at INRIA Rocquencourt/Univ.
Versailles • Main research interests:
– Data Cleaning • Support for User Involvement in Data Cleaning, DaWaK 2011 • Automatic optimization of the match operation (similarity join)
– Medical Information Systems • A Machine Learning based Natural Language Interface for a
Database of Medicines, poster paper DILS 2014 • ProntApp: A Mobile Question Answering System for Medicines
– Information Extraction 4/3/15 2 2
Who am I? • Assistant Professor at IST, University of Lisbon (http://
www.ist.utl.pt) • Researcher at INESC-ID (http://www.inesc-id.pt) • Former (2000) PhD student at INRIA Rocquencourt/Univ.
Versailles • Main research interests:
– Data Cleaning • Support for User Involvement in Data Cleaning, DaWaK 2011 • Automatic optimization of the match operation (similarity join)
– Medical Information Systems • A Machine Learning based Natural Language Interface for a
Database of Medicines, poster paper DILS 2014 • ProntApp: A Mobile Question Answering System for Medicines
Ø Information Extraction 4/3/15 3 3
Agenda
• A Holistic Optimizer for Speeding up Information Extraction Programs – PVLDB’13 and presented at VLDB’14 – Joint work with Gonçalo Simões, INESC-ID and IST/Univ.
Lisboa and Luis Gravano, Columbia University • A Learning-based Approach to Rank Documents
(briefly) – EDBT’15 – Joint work with Pablo Barrio, Columbia University, Gonçalo
Simões and Luis Gravano
4
Information Extraction Discovers Structured Information In Text Documents
The erup+on of Grímsvötn during 22-‐25 May 2011 brought back the memories of the erup+ons of EyjaEallajökull in 2010.
Natural Disaster
erup+on of Grímsvötn
erup+ons of EyjaEallajökull
Temporal Expression
22-‐25 May 2011
2010
Temporal Expression
22-‐25 May 2011
Natural Disaster Temporal Expression
erup+on of Grímsvötn 22-‐25 May 2011
Extract Occurs-‐In
Extract Natural Disaster
Select Temporal Expressions
From 2011
Extract Temporal Expressions
Much richer querying and analysis
5
IE SpecificaCon
Naïve ExecuCon Plan
Information Extraction Specification and Execution Plans are Graphs of
Operators Type of Operator: EE: En+ty Extractor RE: Rela+on Extractor
Technique
Algorithm
Information Extraction is Challenging and Time-consuming
• Operates on a large set of features
• Relies on complex text analysis
• Often applied over large document collections
erupCon brought memories erupCons
erupCon bring memory erupCon
Grímsvötn 22-‐25 brought 2010
Cc+ N+.N+ c+ N+
Ccccccccc NN.NN ccccccc NNNN
The erupCon brought back the memories the erupCons 2010
det nsubj prt det
dobj prep_of
prep_of det
LemmaCzaCon Features Character CapitalizaCon Features
7
IE Implementation Choices to the Rescue
• These implementation choices have an impact on execution time and quality of results (recall and precision)
– Recall: fraction of correct tuples that the plan produces – Precision: fraction of produced tuples that are correct
D
A C
B
D1
A1
A2
A3
C1
B1
B2
Choosing an Algorithm for Each Operator Choosing Operator ExecuCon Order
A
DC
B
A
DC
B
A
DC
B
A2
D1 C1
B1
ExecuCon Plan
Query(210K docs)
EECRF
EECRF Viterbi
EECRF VBS-‐10
EECRF VBS-‐5
RESVM RESVM All SVM
Choosing Document Retrieval Strategy
Query for Relevant Docs
Scan whole CollecCon
1. Execute A 2. Find sentences that produced results for A. 3. Execute B with those sentences only
Query for: [“natural disaster”], [earthquake], [erup+on], [tornado], (…)
8
State-of-the-art IE Optimization Systems
D
A C
B
D1
A1
A2
A3
C1
B1
B2
Choosing an Algorithm for Each Operator Choosing Operator ExecuCon Order
A
DC
B
A
DC
B
A
DC
B
SystemT CIMPLE
¡ Both use a cost-based optimizer to choose the execution order of operators ¡ System T uses heuristics to choose among different algorithms for an operator ¡ Do not allow approximate results
Choosing Document Retrieval Strategy
Query for Relevant Docs
Scan whole CollecCon
• An information extraction optimizer should make these implementation choices collectively!
D
A C
B
D1
A1
A2
A3
C1
B1
B2
Choosing an Algorithm for Each Operator Choosing Operator ExecuCon Order
A
DC
B
A
DC
B
A
DC
B
SQoUT
Choosing Document Retrieval Strategy
Query for Relevant Docs
Scan whole CollecCon
10
State-of-the-art IE Optimization Systems
Making implementation choices collectively leads to faster plans
A
DC
B
A1
D1 C1
B1
Scan(1M docs)
Time: 7h 30 m Recall: 93% Precision: 100%
A3
D1 C1
B2
Scan(1M docs)
Time: 9h 23 m Recall: 91% Precision: 95%
A1
D1 C1
B1
Scan(1M docs)
Time: 13h 16 m Recall: 100% Precision: 100%
Naive Plan
A1
D1 C1
B1
Query(153K docs)
Time: 2h 07 m Recall: 50% Precision: 100%
• By combining multiple choices we get a faster plan • We only control the quality results with collective
optimization
Algorithms OpCmizaCon
Operator Order OpCmizaCon
Document Retrieval OpCmizaCon
A3
D1 C1
B2
Query(210K docs)
Time: 1h 31 m Recall: 50% Precision: 96%
CollecCvely OpCmized Plan
A3
D1 C1
B2
Query(153K docs)
Time: 57 m Recall: 39% Precision: 98%
Direct CombinaCon Plan
Hard to control the expected recall and precision
Precision >= 90%
Recall >= 50%
Recall and Precision Constraints
Holistic Optimizer for Information Extraction
A2
D1 C1
B1
Query(210K docs)
Time: 43 m Recall: 50% Precision: 95%
Best Plan
Document CollecCon
A
D C
B
IE SpecificaCon
Precision >= 90% Recall >= 50%
Recall and Precision Constraints
A3
D1 C1
B2 A2
D1 C1
B1 A2
D1 C1
B1
Query (210K docs) Query (210K docs) Query (210K docs)
Candidate Plans Best Plan
Document Sample
Predictor Parameters
Recall and Precision Predictor
Time: 57 m Recall: 39% Precision: 78%
A2 D1
C1
B1
Time: 1h 24 m Recall: 98% Precision: 95% Query
A3
D1 C1
B2
Query
HolisCc OpCmizer
(…)
EnumeraCon of ExecuCon Plans
A2
D1 C1
B1
A3
D1 C1
B2
Predictor Parameters Estimation • Predictor parameters are determined from a
document sample
• Three phases: 1. Sampling to select the subset of input documents to
use in the estimation (stratified sampling) 2. Extraction to retrieve tuples from the sample using
the plans produced by the enumeration step (dynamic programming)
3. Estimation to determine the optimizer parameters using the results of 2) (Maximum-a-posteriori estimation)
Document Sample
Predictor Parameters
Recall and Precision Predictor
Extraction Quality Prediction
• Based on two predictions: – Number of input tuples of operator
– Number of output tuples of operator
A2
D1
C1
(…) (…) Path 1 Path 2
D1
t1 t2 t3 t4 (…)
• • • • • • • •
• •
A A2 A1 A3
t1 t2 t3 (…)
t1 t2 t1
t1 t3
Experimental Validation
• Impact of the Sample Size and Prediction Quality
• Impact of the Parameter Estimation Strategy
• Impact of Individual Implementation Choices
• Impact of Precision Constraints • Comparison With the State-of-the-Art
Techniques
15
Experimental Validation
• Impact of the Sample Size and Prediction Quality
• Impact of the Parameter Estimation Strategy
Ø Impact of Individual Implementation Choices
• Impact of Precision Constraints Ø Comparison With the State-of-the-Art
Techniques 16
Experimental Setup
• Datasets
• Information Extraction Programs
Researchers’ Homepages
Kedar Bellare’ Homepage
PhD student at University of Massachusegs, Amherst under supervision of Andrew McCallum. Publica+ons Generalized Expecta+on Criteria for Bootstrapping Extractors using Record-‐Text Alignment, EMNLP 2009, August 2009
Advisor Advisee
Andrew McCallum
Kedar Bellare
Conference Date
EMNLP 2009 August 2009
Advises
ConferencesDates
The Hai+ cholera outbreak between 2010 and 2013 was the worst epidemic of cholera in recent history
Google co-‐founders Larry Page and Sergey Brin recently sat down with billionaire venture capitalist Vinod Khosla for a lengthy interview.
Disease Time Period
Cholera Between 2010 and 2013
DiseaseOutbreaks
Person OrganizaCon
Larry Page Google
Sergey Brin Google
Organiza+onAffilia+on
Sparse RelaCons
Dense RelaCons 17
Experimental Setup (cont.) • IE Optimization Systems (baselines):
– Cimple – System T – SQoUT – SQoUt-Boosted: variation of SQoUT that uses the execution
plans produced by Cimple and System T • IE Optimization Techniques:
– Optimized-Ret: implementation choice of retrieval strategy – Optimized-Alg: implementation choice of the algorithms – Optimized-Order: implementation choice of the operator execution
order – Holistic-Map: uses maximum a posteriori for estimating the
parameters – Holistic-ML: uses maximum likelihood for estimating the parameters
18
Impact of Individual Implementation Choices
• Holistic-MAP obtains the fastest plan in every case • Optimized-Ret becomes too slow when desired recall
increases • Optimized-Order and Optimized-Alg consistently produce slow
plans • Holistic optimization outperforms individual optimization
ConferencesDates
19
Comparison with State-of-the-Art Techniques
• For dense relations, Holistic-MAP – Obtains faster plans – Closely matches recall constraints
OrganizaConAffiliaCon
20
Comparison with State-of-the-Art Techniques
• For sparse relations, Holistic-MAP – Obtains faster plans – Does not always matches recall constraints – Can correct the errors in recall constraints as the target recall
increases
DiseaseOutbreaks
21
Conclusions: Holistic Optimizer
• IE programs can be optimized along multiple interrelated dimensions, namely, the choice of: – Algorithms for each extraction operator – Operator execution order – Document retrieval strategy
• We presented the first holistic optimization approach for IE that collectively optimizes all dimensions simultaneously – It outperforms the state-of-the-art techniques and
selects efficient plans that meet target recall and precision values
22
What Comes Next?
• Ranking Models for IE (second part of the talk)
• Distributed Executions for IE (on-going work)
– by determining document placement in distributed file system
Random order
Ranked order
Agenda
• A Holistic Optimizer for Speed up Information Extraction Programs – PVLDB’13 and presented at VLDB’14 – Joint work with Gonçalo Simões and Luis
Gravano • A learning-based Approach to Rank
Documents – EDBT’15 – Joint work with Pablo Barrio, Gonçalo Simões
and Luis Gravano
24
Reducing IE Processing Time
• Small, topic-specific fraction of the entire collection is useful
Ex: Only 2% of documents in a New York Times archive, mostly environment-related, are useful for Natural Disaster-Location with a state-of-the-art IE system
• Useful documents share distinctive words and phrases Ex: “earthquake”, “storm,” “richter,” “volcano eruption” for Natural Disaster-Location
• Information extraction system labels documents as useful or not, for free
Should focus extraction over these documents
and ignore rest
Should learn to distinguish useful
documents for an IE task
Could use the result of an IE process to
generate an ever-expanding training set for learning and further
identifying useful documents
Documents are useful if they produce output for a given IE task
Existing Approaches: Qxtract and FactCrawl
• Qxtract and FactCrawl learn from a small document sample and so exhibit far-from-perfect recall
• FactCrawl re-ranks documents
using the quality of learned queries
• FactCrawl relies on queries derived from small set of documents and does not adapt to new processed documents
Our Approach • Focus on the potentially useful documents for the extraction
task at hand Ø Learning to rank approach for document ranking
• Results of extraction process form ever-expanding training set Ø Adaptive approach to update document ranking continuously
Features: Words and phrases, and agribute values
Document Collec+on
f(di) = si s1 ≥ s2 ≥ s3 ≥ ... si ≥ … sn
New words: lava, fissure
Learning Ranking and processing
New training instances
Ranking Documents Adaptively for IE
Learns that “tornado,” “earthquake,” are makers of useful documents
Learning
Document processing and update detec+on
Document Collec+on
Useful documents but on volcanoes, not yet observed prominently in IE process
<tornado, hawaii>
“… ‘Acermath’ narrates the story of a man that goes missing…”
“… S+ll recovering from an earthquake, Chile is threatened by the erupCon of
Copahue volcano…”
Online relearning
+
Performs online learning
New informaCon can poten+ally help improve ranking, so Update!
<volcano, chile>
Learns that “volcano” and “erup+on” are now makers of
useful documents
Ranking adaptaCon
Ranking Documents Adaptively for IE: Key Ideas
• To address efficiency, rely on online learning – Train the ranking model incrementally, one document at a time
• To handle large (and expanding) feature sets, rely on in-training feature selection – The learning-to-rank algorithm can efficiently identify the most
discriminative features during the training of the document ranking model
• We propose two learning-to-rank techniques for IE that integrate online learning and in-training feature selection: – BAgg-IE – RSVM-IE
• And two update detection techniques for document ranking adaptation: – Top-K – Mod-C
Experimental Validation
1. Impact of learning-to-rank approach 2. Impact of sampling strategies 3. Impact of adaptation 4. Impact of update detection 5. Scalability of our approach 6. Comparison with the state-of-the-art
ranking strategies
30
Experimental Validation
1. Impact of learning-to-rank approach 2. Impact of sampling strategies 3. Impact of adaptation 4. Impact of update detection 5. Scalability of our approach Ø Comparison with the state-of-the-art
ranking strategies
31
Complex extrac+on systems: CRFs, SVM kernels
Simple extrac+on systems: HMMs, text pagerns
• Dataset: • Information extraction systems
1.8 million ar+cles from 1987-‐2007
Experimental Setup
The Hai+ cholera outbreak between 2010 and 2013 was the worst epidemic of cholera in recent history
Google co-‐founders Larry Page and Sergey Brin recently sat down with billionaire venture capitalist Vinod Khosla for a lengthy interview.
"This is not a vic+mless crime," said Jim Kendall, president of the Washington Associa+on of Internet Service Providers.
A fire destroyed a Cargill Meat Solu+ons beef processing plant in Booneville.
Disease Time Period
cholera between 2010 and 2013
Disease-‐Outbreaks Person OrganizaCon
Larry Page Google
Sergey Brin Google
Person-‐Organiza+on
Person Career
Jim Kendall President
Person-‐Career Disaster LocaCon
fire Booneville
Man Made Disaster-‐Loca+on
Other rela+ons: Person-‐Charge, Elec+on-‐Winner, Natural Disaster-‐Loca+on
Dense rela+ons Sparse rela+ons
Recall Analysis
Update detec+on: Mod-‐C
Our adapCve implementa+on of the state of the art
Disease-‐Outbreak
• Our techniques bring significant improvement • RSVM-IE performs best, as it prioritizes useful documents better, favoring
adaptation
Extraction Time
Person-‐Organiza+on Affilia+on
• Our techniques improve efficiency of process even for inexpensive IE systems
Our adapCve implementa+on of the state of the art
Conclusion: Document Ranking for Scalable Information Extraction
• Running IE system over large text collections is computationally expensive
• Proposed lightweight, adaptive approach and learning-based alternatives
– Online learning algorithms with in-training feature selection: RSVM-IE, Bagg-IE
– Update detection based on feature changes: Mod-C, Top-K
– RSVM-IE + Mod-C performs best: Useful documents are better prioritized, enabling richer, more efficient ranking adaptation
Text Collec+on
IE system
<tornado, Florida>
<volcano, Chile> …
Future Work: Ranking at Different Granularities
• Few collections on the Web are relevant to an IE task
• Prioritize them based on number of useful documents
• Few sentences in a text
document output tuples for an IE task
• Prioritize them based on usefulness and diversity