View
221
Download
0
Category
Tags:
Preview:
Citation preview
Graphical models for structure extraction and information
integration
Sunita Sarawagi
IIT Bombayhttp://www.it.iitb.ac.in/~sunita
Information Extraction (IE) & IntegrationThe Extraction task: Given,
– E: a set of structured elements– S: unstructured source S
extract all instances of E from S
Many versions involving many source types• Actively researched in varied communities• Several tools and techniques• Several commercial applications
The Integration task: Also, given– Database of existing inter-linked entities
Resolve which extracted entities exists, andInsert appropriate links and entities.
• Classical Named Entity Recognition – Extract person, location, organization names
According to Robert Callahan, president of Eastern's flight attendants union, the past practice of Eastern's parent, Houston-based Texas Air Corp., has involved ultimatums to unions to accept the carrier's terms
IE from free format text
Several applications–News tracking
– Monitor events–Bio-informatics
– Protein and Gene names from publications–Customer care
•Part number, problem description from emails in help centers
Text segmentation
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.
Author Year Title JournalVolume
Page
4089 Whispering Pines Nobel Drive San Diego CA 92122
House number Building Road City ZipState
Personal Information Systems– Automatically add a bibtex entry of a paper I download– Integrate a resume in email with the candidates
database
People Papers
ProjectsEmails
Web
Files
Resumes
History of approaches• Manually-developed set of scripts
– Tedious, lots and lots of special cases– Needs continuous refinement as new cases arise– Ad hoc ways of combining varied set of clues– Example: wrappers, OK for regular tasks
• Learning-based approach (lots!)– Rule-based (Whisk, Rapier etc) 80s
• Brittle
– Statistical• Generative: HMMs 90s
– Intuitive but not too flexible
• Conditional (flexible feature set) 00s
Basic chain model for extraction
My review of Fermat’s last theorem by S. Singh
1 2 3 4 5 6 7 8 9
My review of Fermat’s last theorem by S. Singh
Other Other Other Title Title Title other Author Author
t
x
y
y1 y2 y3 y4 y5 y6 y7 y8 y9
Independent model
Features
• The word as-is• Orthographic word properties
• Capitalized? Digit? Ends-with-dot?• Part of speech
• Noun?• Match in a dictionary
• Appears in a dictionary of people names?• Appears in a list of stop-words?
• Fire these for each label and• The token,• W tokens to the left or right, or• Concatenation of tokens.
Basic chain model for extraction
My review of Fermat’s last theorem by S. Singh
1 2 3 4 5 6 7 8 9
My review of Fermat’s last theorem by S. Singh
Other Other Other Title Title Title other Author Author
t
x
y
y1 y2 y3 y4 y5 y6 y7 y8 y9
Global conditional model over Pr(y1,y2…y9|x)
Outline
– Chain models - Basic extraction (Word-level)
– Associative Markov Networks - Collective labeling
– Dynamic CRFs - Two labelings (POS, extraction)
– 2-D CRFs - Layout-driven extraction (web)
Graphical models Extraction
+ Integration
– Segmentation models - Match with entities databases
– Constrained models - Integrating to multiple tables
Undirected Graphical models
• Joint probability distribution of multiple variables expressed compactly as a graph
y1 y2 y3
y4
y5Discrete variables over finite set of labels
Example: {Author, Title, Other}
y3 directly dependent on y4
y3 independent of y1 and y5
given y2 & y4
The joint probability distribution
Cliques of graph
Normalizing
constant
Potential function
y1 y2 y3
y4 y5
Conditional Random Fields (CRFs)
Model probability of a set of labels given
observation x
Form of potentials
Lafferty et al, ICML 2000
Numeric feature
Model parameter
s
Observed variables
Inference on graphical models
• Probability of an assignment of variables
• Most likely assignment of variables
• Marginal probability of a subset of variables
Exponential terms
Message passing
• Efficient two-pass dynamic programming algorithm for graphs without cycles– Viterbi is a special case for chains
• Cyclic graphs– Approximate answer after convergence, or,– Transform cliques to nodes in a junction tree– Alternatives to message passing
• Exploit structure of potentials to design special algorithms– two examples in this talk
• Upper bound using one or more trees• MCMC sampling
Outline
– Chain models - Basic extraction (Word-level)
– Associative Markov Networks - Collective labeling
– Dynamic CRFs - Two labelings (POS, extraction)
– 2-D CRFs - Layout-driven extraction (web)
Graphical models Extraction
+ Integration
– Segmentation models - Match with entities databases
– Constrained models - Integrating to multiple tables
Dependency graph
• Assume only word-level matches.
y1 y2 y3 y4 y5 y6 y7 y8
• Approximate message passing• Sample results (Bunescu et al ACL 2004)
Protein names from medline abstracts– F1: 65% 68%
Person names, organization names etc from news articles– F1: 80% 82%
nitric oxide synthase eNos ….with …synthase interaction eNOS
Associative Markov Networks
y1 y2 y3 y4 y5 y6 y7 y8
• Consider a simpler graph • Binary labels• Only associative edges
– Higher potential when same label to both
y1 y2 y3 y4 y5 y6 y7 y8
Exact inference in polynomial time via mincut (Greig 1989)
Multi-class, metric labelingapproximate algorithm with
guarantees (Kleinberg 1999)
Factorial CRFs: multiple linked chains
• Several synchronized inter-dependent tasks– POS, Noun phrase, Entity extraction
• Cascading propagates errors• Joint models
w1 w2 w3 w4 w5 w6 w7 w8
y1 y2 y3 y4 y5 y6 y7 y8
i saw mr. ray canning at the marketPOS
IE
Inference with multiple chains
• Graph has cycles, most likely exact inference intractable
• Two alternatives– Approximate message passing– Upper bound marginal (Piecewise training)
• Treat each edge potential as an independent training instance
• Results (F1): noun phrase + POS – Piecewise training 88%, faster – Belief propagation 86%
Combined
Staged(Sutton et al, ICML 2004)
(McCallum et al, EMNLP/HLT 2005)
Outline
– Chain models - Basic extraction (Word-level)
– Associative Markov Networks - Collective labeling
– Dynamic CRFs - Two labelings (POS, extraction)
– 2-D CRFs - Layout-driven extraction (web)
Graphical models Extraction
+ Integration
– Segmentation models - Match with entities databases
– Constrained models - Integrating to multiple tables
Conventional Extraction Research
Labeled
unstructured
textTraining Model
Unstructured
text 1
Entities
Unstructured
text 2Unstructured
text 3
EntitiesEntities
Linked entity database
Labeled
unstructured
text
Training Model
Unstructured
text 1
Entities integrated
with existing data
Unstructured
text 2Unstructured
text 3
Data integration
Goals of integration
• Exploit database to improve extraction– Entity might exist in the database
• Integrate extracted entities, resolve if entity already in database– If existing, create links– If not existing, create a new entry
Id Title Year Journal Canonical
2 Update Semantics 1983 10
Id Name Canonical
10 ACM TODS
17 AI 17
16 ACM Trans. Databases
Article Author
2 11
2 2
2 3
Id Name Canonical
11 M Y Vardi
2 J. Ullman 4
3 Ron Fagin 3
4 Jeffrey Ullman 4
Authors
Writes
JournalsArticles
Variant links to
canonical entries
Database: normalized, stores noisy variants
3 Top-level
entities
R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see
1 2 3 4 5 6 7 8
R. Fagin and J. Helpbern Belief Awareness Reasoning
Author Author Other Author Author Title Title Title
t
x
y
Features describe the single word “Fagin”
Segmentation models (Semi-CRFs)
1 2 3 4 5 6 7 8
R. Fagin and J. Helpbern Belief Awareness Reasoning
Author Author Other Author Author Title Title Title
t
x
y
Features describe the single word “Fagin”
Segmentation models (Semi-CRFs)
l1=1, u1=2 l1=u1=3 l1=4, u1=5 l1=6, u1=8
R. Fagin and J. Helpbern Belief Awareness Reasoning
Author Other Author Title
x
y
Features describe the segment from l to u
Similarity to author’s column in database
l,u
Graphical models for segmentation
• Graph has many cycles– clique size = maximum segment length
• Two kinds of potentials– Transition potentials
• Only across adjacent nodes
– Segment potentials• Requires all positions in segment to have the same label
exact inference possible in time linear * maximum segment length (Cohen & Sarawagi 2004)
y1 y2 y3 y4 y5 y6 y7 y8
Effect of database on extraction performance
L L+DB %Δ
PersonalBib
author 75.7 79.5 4.9
journal 33.9 50.3 48.6
title 61.0 70.3 15.1
Address
city_name 72.4 76.7 6.0
state_name 13.9 33.2 138.5
zipcode 91.6 94.3 3.0
L = Only labeled structured data
L + DB: similarity to database entities and other DB features
(from Mansuri et al ICDE 2006)
Id Title Year Journal Canonical
2 Update Semantics 1983 10
7 Belief, awareness, reasoning
1988 17
Id Name Canonical
10 ACM TODS
17 AI 17
16 ACM Trans. Databases
10
Article Author
2 11
2 2
2 3
7 8
7 9
Id Name Canonical
11 M Y Vardi
2 J. Ullman 4
3 Ron Fagin 3
4 Jeffrey Ullman 4
8 R Fagin 3
9 J Helpern 8
Authors
Writes
JournalsArticles
R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see
Extraction
Integration
Match with existing linked entities while respecting
all constraints
Author: R. Fagin AAuthor: J. Helpern Title Belief,..reasoning Journal: AI Year: 1998
CACM 2000, R. Fagin and J. Helpern, Belief, awareness, reasoning in AI
Author: R. Fagin Author: J. Helpern Title: Belief,..reasoning Journal: AI Year: 2000
Only extractionCombined
Extraction+integration
Year mismatch!
Id Title Year Journal Canonical
2 Update Semantics 1983 10
7 Belief, awareness, reasoning
1988 17
Author: R. FaginAuthor: J. Helpern Title: Belief,..reasoning in AI Journal: CACM Year: 2000
Combined extraction + matching
• Convert predicted label to be a pair y = (a,r)• (r=0) means none-of-the-above or a new entry
Id of matching entity
r
l1=1, u1=2 l1=u1=3 l1=4, u1=8
CACM. 2000 Fagin Belief Awareness Reasoning In AI
Journal Year Author Title
0 7 3 7
x
y
l,u
Constraints exist on ids that can be assigned to
two segments
Constrained models
• Training– Ignore constraints or use max-margin methods that
require only MAP estimates
• Application: – Formulate as a constrained integer programming
problem (expensive) – Use general A-star search to find most likely
constrained assignment
Full integration performance
L L+DB %Δ
PersonalBib
author 70.8 74.0 4.5
journal 29.6 45.5 53.6
title 51.6 65.0 25.9
Address
city_name 70.1 74.6 6.4
state_name 9.0 28.3 213.8
pincode 87.8 90.7 3.3
• L = conventional extraction + matching• L + DB = technology presented here
• Much higher accuracies possible with more training data
(from Mansuri et al ICDE 2006)
What next in data integration?• Lots to be done in building large-scale, viable data
integration systems• Online collective inference
– Cannot freeze database– Cannot batch too many inferences– Need theoretically sound, practical alternatives to exact, batch
inference
• Performance of integration (Chandel et al, ICDE 2006)• Other operations
– Data standardization– Schema management
Probabilistic Querying Systems
• Integration systems while improving, cannot be perfect particularly for domains like the web
• Users supervision of each integration result impossible
Create uncertainty-aware storage and querying engines– Two enablers:
• Probabilistic database querying engines over generic uncertainty models
• Conditional graphical models produce well-calibrated probabilities
Probabilities in CRFs are well-calibrated
Probability of segmentation Probability correct
E.g: 0.5 probability Correct 50% of the times
Cora citations Cora headers
Ideal Ideal
Uncertainty in integration systems:
Model
Unstructured
text
Entities p1
Entities p2
Entities pk
Other more compact models?
Very
uncertain?
Additional training data
Probabilistic database system
Select conference name of article RJ03?
Find most cited author?
IEEE Intl. Conf. On data mining 0.8
Conf. On data mining 0.2
D Johnson 16000 0.6
J Ullman 13000 0.4
In summary
• Data integration provides scope for several interesting learning problems
• Probabilistic graphical models provide robust, unified mechanism of exploiting wide variety of clues and dependencies
• Lot of open research challenges in making graphical models work in a practical setting
Recommended