Graphical models for structure extraction and information integration Sunita Sarawagi IIT Bombay...

Graphical models for structure extraction and information

integration

Sunita Sarawagi

IIT Bombayhttp://www.it.iitb.ac.in/~sunita

Information Extraction (IE) & IntegrationThe Extraction task: Given,

– E: a set of structured elements– S: unstructured source S

extract all instances of E from S

Many versions involving many source types• Actively researched in varied communities• Several tools and techniques• Several commercial applications

The Integration task: Also, given– Database of existing inter-linked entities

Resolve which extracted entities exists, andInsert appropriate links and entities.

• Classical Named Entity Recognition – Extract person, location, organization names

According to Robert Callahan, president of Eastern's flight attendants union, the past practice of Eastern's parent, Houston-based Texas Air Corp., has involved ultimatums to unions to accept the carrier's terms

IE from free format text

Several applications–News tracking

– Monitor events–Bio-informatics

– Protein and Gene names from publications–Customer care

•Part number, problem description from emails in help centers

Text segmentation

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Author Year Title JournalVolume

4089 Whispering Pines Nobel Drive San Diego CA 92122

House number Building Road City ZipState

Information Extraction on the web

Personal Information Systems– Automatically add a bibtex entry of a paper I download– Integrate a resume in email with the candidates

database

People Papers

ProjectsEmails

Resumes

History of approaches• Manually-developed set of scripts

– Tedious, lots and lots of special cases– Needs continuous refinement as new cases arise– Ad hoc ways of combining varied set of clues– Example: wrappers, OK for regular tasks

• Learning-based approach (lots!)– Rule-based (Whisk, Rapier etc) 80s

• Brittle

– Statistical• Generative: HMMs 90s

– Intuitive but not too flexible

• Conditional (flexible feature set) 00s

Basic chain model for extraction

My review of Fermat’s last theorem by S. Singh

1 2 3 4 5 6 7 8 9

Other Other Other Title Title Title other Author Author

y1 y2 y3 y4 y5 y6 y7 y8 y9

Independent model

Features

• The word as-is• Orthographic word properties

• Capitalized? Digit? Ends-with-dot?• Part of speech

• Noun?• Match in a dictionary

• Appears in a dictionary of people names?• Appears in a list of stop-words?

• Fire these for each label and• The token,• W tokens to the left or right, or• Concatenation of tokens.

Basic chain model for extraction

1 2 3 4 5 6 7 8 9

Other Other Other Title Title Title other Author Author

y1 y2 y3 y4 y5 y6 y7 y8 y9

Global conditional model over Pr(y1,y2…y9|x)

Outline

– Chain models - Basic extraction (Word-level)

– Associative Markov Networks - Collective labeling

– Dynamic CRFs - Two labelings (POS, extraction)

– 2-D CRFs - Layout-driven extraction (web)

Graphical models Extraction

+ Integration

– Segmentation models - Match with entities databases

– Constrained models - Integrating to multiple tables

Undirected Graphical models

• Joint probability distribution of multiple variables expressed compactly as a graph

y1 y2 y3

y5Discrete variables over finite set of labels

Example: {Author, Title, Other}

y3 directly dependent on y4

y3 independent of y1 and y5

given y2 & y4

The joint probability distribution

Cliques of graph

Normalizing

constant

Potential function

y1 y2 y3

Conditional Random Fields (CRFs)

Model probability of a set of labels given

observation x

Form of potentials

Lafferty et al, ICML 2000

Numeric feature

Model parameter

Observed variables

Inference on graphical models

• Probability of an assignment of variables

• Most likely assignment of variables

• Marginal probability of a subset of variables

Exponential terms

Message passing

• Efficient two-pass dynamic programming algorithm for graphs without cycles– Viterbi is a special case for chains

• Cyclic graphs– Approximate answer after convergence, or,– Transform cliques to nodes in a junction tree– Alternatives to message passing

• Exploit structure of potentials to design special algorithms– two examples in this talk

• Upper bound using one or more trees• MCMC sampling

Outline

+ Integration

Long range dependencies• Extraction with repeated names (Bunescu et al

Dependency graph

• Assume only word-level matches.

y1 y2 y3 y4 y5 y6 y7 y8

• Approximate message passing• Sample results (Bunescu et al ACL 2004)

Protein names from medline abstracts– F1: 65% 68%

Person names, organization names etc from news articles– F1: 80% 82%

nitric oxide synthase eNos ….with …synthase interaction eNOS

Associative Markov Networks

y1 y2 y3 y4 y5 y6 y7 y8

• Consider a simpler graph • Binary labels• Only associative edges

– Higher potential when same label to both

y1 y2 y3 y4 y5 y6 y7 y8

Exact inference in polynomial time via mincut (Greig 1989)

Multi-class, metric labelingapproximate algorithm with

guarantees (Kleinberg 1999)

Factorial CRFs: multiple linked chains

• Several synchronized inter-dependent tasks– POS, Noun phrase, Entity extraction

• Cascading propagates errors• Joint models

w1 w2 w3 w4 w5 w6 w7 w8

y1 y2 y3 y4 y5 y6 y7 y8

i saw mr. ray canning at the marketPOS

Inference with multiple chains

• Graph has cycles, most likely exact inference intractable

• Two alternatives– Approximate message passing– Upper bound marginal (Piecewise training)

• Treat each edge potential as an independent training instance

• Results (F1): noun phrase + POS – Piecewise training 88%, faster – Belief propagation 86%

Combined

Staged(Sutton et al, ICML 2004)

(McCallum et al, EMNLP/HLT 2005)

Outline

+ Integration

Conventional Extraction Research

Labeled

unstructured

textTraining Model

Unstructured

text 1

Entities

Unstructured

text 2Unstructured

text 3

EntitiesEntities

Linked entity database

Labeled

unstructured

Training Model

Unstructured

text 1

Entities integrated

with existing data

Unstructured

text 2Unstructured

text 3

Data integration

Goals of integration

• Exploit database to improve extraction– Entity might exist in the database

• Integrate extracted entities, resolve if entity already in database– If existing, create links– If not existing, create a new entry

Id Title Year Journal Canonical

2 Update Semantics 1983 10

Id Name Canonical

10 ACM TODS

17 AI 17

16 ACM Trans. Databases

Article Author

Id Name Canonical

11 M Y Vardi

2 J. Ullman 4

3 Ron Fagin 3

4 Jeffrey Ullman 4

Authors

Writes

JournalsArticles

Variant links to

canonical entries

Database: normalized, stores noisy variants

3 Top-level

entities

R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see

1 2 3 4 5 6 7 8

R. Fagin and J. Helpbern Belief Awareness Reasoning

Author Author Other Author Author Title Title Title

Features describe the single word “Fagin”

Segmentation models (Semi-CRFs)

1 2 3 4 5 6 7 8

Author Author Other Author Author Title Title Title

Features describe the single word “Fagin”

Segmentation models (Semi-CRFs)

l1=1, u1=2 l1=u1=3 l1=4, u1=5 l1=6, u1=8

Author Other Author Title

Features describe the segment from l to u

Similarity to author’s column in database

Graphical models for segmentation

• Graph has many cycles– clique size = maximum segment length

• Two kinds of potentials– Transition potentials

• Only across adjacent nodes

– Segment potentials• Requires all positions in segment to have the same label

exact inference possible in time linear * maximum segment length (Cohen & Sarawagi 2004)

y1 y2 y3 y4 y5 y6 y7 y8

Effect of database on extraction performance

L L+DB %Δ

PersonalBib

author 75.7 79.5 4.9

journal 33.9 50.3 48.6

title 61.0 70.3 15.1

Address

city_name 72.4 76.7 6.0

state_name 13.9 33.2 138.5

zipcode 91.6 94.3 3.0

L = Only labeled structured data

L + DB: similarity to database entities and other DB features

(from Mansuri et al ICDE 2006)

7 Belief, awareness, reasoning

1988 17

Id Name Canonical

10 ACM TODS

17 AI 17

16 ACM Trans. Databases

Article Author

Id Name Canonical

11 M Y Vardi

2 J. Ullman 4

3 Ron Fagin 3

4 Jeffrey Ullman 4

8 R Fagin 3

9 J Helpern 8

Authors

Writes

JournalsArticles

R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see

Extraction

Integration

Match with existing linked entities while respecting

all constraints

Author: R. Fagin AAuthor: J. Helpern Title Belief,..reasoning Journal: AI Year: 1998

CACM 2000, R. Fagin and J. Helpern, Belief, awareness, reasoning in AI

Author: R. Fagin Author: J. Helpern Title: Belief,..reasoning Journal: AI Year: 2000

Only extractionCombined

Extraction+integration

Year mismatch!

7 Belief, awareness, reasoning

1988 17

Author: R. FaginAuthor: J. Helpern Title: Belief,..reasoning in AI Journal: CACM Year: 2000

Combined extraction + matching

• Convert predicted label to be a pair y = (a,r)• (r=0) means none-of-the-above or a new entry

Id of matching entity

l1=1, u1=2 l1=u1=3 l1=4, u1=8

CACM. 2000 Fagin Belief Awareness Reasoning In AI

Journal Year Author Title

0 7 3 7

Constraints exist on ids that can be assigned to

two segments

Constrained models

• Training– Ignore constraints or use max-margin methods that

require only MAP estimates

• Application: – Formulate as a constrained integer programming

problem (expensive) – Use general A-star search to find most likely

constrained assignment

Full integration performance

L L+DB %Δ

PersonalBib

author 70.8 74.0 4.5

journal 29.6 45.5 53.6

title 51.6 65.0 25.9

Address

city_name 70.1 74.6 6.4

state_name 9.0 28.3 213.8

pincode 87.8 90.7 3.3

• L = conventional extraction + matching• L + DB = technology presented here

• Much higher accuracies possible with more training data

(from Mansuri et al ICDE 2006)

What next in data integration?• Lots to be done in building large-scale, viable data

integration systems• Online collective inference

– Cannot freeze database– Cannot batch too many inferences– Need theoretically sound, practical alternatives to exact, batch

inference

• Performance of integration (Chandel et al, ICDE 2006)• Other operations

– Data standardization– Schema management

Probabilistic Querying Systems

• Integration systems while improving, cannot be perfect particularly for domains like the web

• Users supervision of each integration result impossible

Create uncertainty-aware storage and querying engines– Two enablers:

• Probabilistic database querying engines over generic uncertainty models

• Conditional graphical models produce well-calibrated probabilities

Probabilities in CRFs are well-calibrated

Probability of segmentation Probability correct

E.g: 0.5 probability Correct 50% of the times

Cora citations Cora headers

Ideal Ideal

Uncertainty in integration systems:

Unstructured

Entities p1

Entities p2

Entities pk

Other more compact models?

uncertain?

Additional training data

Probabilistic database system

Select conference name of article RJ03?

Find most cited author?

IEEE Intl. Conf. On data mining 0.8

Conf. On data mining 0.2

D Johnson 16000 0.6

J Ullman 13000 0.4

In summary

• Data integration provides scope for several interesting learning problems

• Probabilistic graphical models provide robust, unified mechanism of exploiting wide variety of clues and dependencies

• Lot of open research challenges in making graphical models work in a practical setting

Graphical models for structure extraction and information integration Sunita Sarawagi IIT Bombay...

Documents

Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Partograph dr sunita

Frequent itemset mining and temporal extensions Sunita Sarawagi sunita@it.iitb.ac.in sunita

1 Scaling multi-class Support Vector Machines using inter- class confusion Author:Shantanu Sunita Sarawagi Sunita Sarawagi Soumen Chakrabarti Soumen Chakrabarti

Sunita solar

Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Sunita Sarawagi IIT Bombay sunitasunita/talks/graphical.pdf · Examples of Joint Distributions So far Naive Bayes: P(x 1;:::x djy) , d is large.Assume conditional independence. Multivariate

1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri

Efficient Domain Generalization via Sunita Sarawagi Common …14-14-00... · 2020. 12. 26. · Contributions We provide a principled understanding of existing Domain Generalization

SUNITA RAMNATHKAR 2

Extreme Classificationsaketha/research/AkshatMTP2018.pdf · Place : MLAV\A.,6 ~ Prof. Sunita Sarawagi Dept o f C SE , ITT 13ornhay S u perviso r De p t. o f CSE , IIT Bo mbay Internal

Sunita dham

The Magical Seed of Neelpur - By Aditi Sarawagi

Sunita Sarawagi IIT Bombay sunitasunita/talks/ie_structLearn.pdf · 2010. 4. 1. · Sentence alignment Input x: sentence pair Output y : alignment y i,j = 1 iff word i in 1st sentence

Freecharge by Naman Sarawagi

Sunita Project

REPORT DOCUMENTATION PAGERaghu Ramakrishnan, Sunita Sarawagi, Michael Stonebraker, Alexander S. Szalay, Gerhard Weikum: The Claremont report on database research. SIGMOD Record 37(3):

Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1

Sunita Williams

Sequence data mining - IIT Bombaysunita/papers/sequence.pdf · Sequence data mining Sunita Sarawagi Indian Institute of Technology Bombay. sunita@iitb.ac.in Summary. Many interesting