Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V....

Preview:

Citation preview

Adaptive Graphical Approach to Entity Resolution

Dmitri V. Kalashnikov

Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra

Computer Science DepartmentUniversity of California, Irvine

Additional information is available at http://www.ics.uci.edu/~dvkCopyright © by Dmitri V. Kalashnikov, 2007

ACM IEEE Joint Conference on Digital Libraries 2007

2

Structure of the Talk

Motivation

• Generic Disambiguation Framework – High-level

• Entity Resolution Approach– Part of the Framework

• Experiments

3

Entity Resolution & Data Cleaning

Raw Dataset(s)

...J. Smith ...

.. John Smith ...

.. Jane Smith ...

MIT

Intel Inc. ?

A "nice" regular Database

Analysis on bad data leads to wrong conclusions!

•Uncertainty•Errors•Missing data

4

Why do we need “Entity Resolution”?

q Hi, I’m Jane Smith.

I’d like to apply for a faculty

position.

Wow! I am sure we will accept a strong candidate

like that!

Jane Smith – Fresh Ph.D. Tom - Recruiter

OK, let me check

something quickly …

???

Publications:1. ……2. ……3. ……

Publications:1. ……2. ……3. ……

CiteSeer Rank

5

Suspicious entries– Lets go to DBLP website

– which stores bibliographic entries of many CS authors

– Lets check two people– “A. Gupta”

– “L. Zhang”

What is the problem?

CiteSeer: the top-k most cited authors DBLP DBLP

6

Comparing raw and cleaned CiteSeer

Rank Author Location

1 (100.00%) douglas schmidt cs@wustl

2 (100.00%) rakesh agrawal almaden@ibm

3 (100.00%) hector garciamolina @

4 (100.00%) sally floyd @aciri

5 (100.00%) jennifer widom @stanford

6 (100.00%) david culler cs@berkeley

6 (100.00%) thomas henzinger eecs@berkeley

7 (100.00%) rajeev motwani @stanford

8 (100.00%) willy zwaenepoel cs@rice

9 (100.00%) van jacobson lbl@gov

10 (100.00%) rajeev alur cis@upenn

11 (100.00%) john ousterhout @pacbell

12 (100.00%) joseph halpern cs@cornell

13 (100.00%) andrew kahng @ucsd

14 (100.00%) peter stadler tbi@univie

15 (100.00%) serge abiteboul @inria

Raw CiteSeer’s Top-K Most Cited Authors

Cleaned CiteSeer’s Top-K Most Cited Authors

7

What is the lesson?

– Data should be cleaned first– E.g., determine the (unique) real authors of publications

– Solving such challenges is not always “easy”– This explains a large body of work on Entity Resolution

“Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.

8

Typical Data Processing Flow

Raw Data RepresentationData CleaningExtraction Analysis

9

Two most common types of Entity Resolution

...J. Smith ...

.. John Smith ...

.. Jane Smith ...

MIT

Intel Inc.

Fuzzy lookup

– match references to objects– list of all objects is given

– [SDM’05], [TODS’06]

Fuzzy grouping

– group references that co-refer

– [IQIS’05], [JCDL’07]

10

Structure of the Talk

• Motivation Generic Framework

– High-level

• Approach– Part of the Framework

• Experiments

11

Traditional Approach to Entity Resolution

"J. Smith"

f2

f3

?

?

?

js@mit.edu

Yf2

f3

sm@yahoo.com?

X

Traditional MethodsFeatures and Context

"Jane Smith"

s (X,Y) = f (X,Y) Similarity = Similarity of Features

12

Key Observation: More Info is Available

A "nice" regular DatabaseJane Smith

John Smith

J. Smith

=

13

Solution: Main Idea

f1

f2

f3

?

?

?

f4

Y

f1

f2

f3

f4?

X

Traditional Methods

+ X Y

A

B C

D

E F

Relationship Analysis

ARG

features and context

s (X,Y) = c (X,Y) + γ f (X,Y)Similarity = Similarity of Features + “Connection Strength”

New Paradigm

14

Illustrative Example

“Indirect connections”– Suppose your co-worker’s name is “John White”– Suppose you see on the Web, on my homepage

– My name: “Dmitri …”– Somebody named: “John White”

– Who is the “John White”?– From data you might establish a connection:

– “Dmitri” might be connected to more “John White”’s…

Dmitri

JCDL’07

Visited

<you>

Visited

<your ORG>

WorksAT WorksAT

John White

15

Key Features of the Framework

Our goal is/was to create a framework, such that:– solid theoretic foundation

– lookup

– domain-independent framework

– self-tuning

– scales to large datasets

– robust under uncertainty

– high disambiguation quality

16

Structure of the Talk

• Motivation

• Generic Framework – High-level

Approach– Part of the Framework

• Experiments

17

Approach

• Graph Creation– Entity-Relationship Graph

• Consolidation Algorithm – Bottom-up clustering

• Adaptiveness to data– That is, self-tuning– Supervised learning

• External Data– To improve the quality further– A theoretic possibility

– Not tested yet

18

ER Graph Creation

19

Virtual Connected Subgraph (VCS)

person

publication

department

organization

similarity

regular

Nodes

Edges

VCS

• VCS– Similarity edges form VCSs– Subgraphs in the ER graph

1. “Virtual”– Contains only similarity edges

2. “Connected”– A path between any 2 nodes

3. Completeness– Adding more nodes/edges would violate (1) and (2)

• Logically, the Goal is– Partition each VCS properly

20

Consolidation Algorithm: Merging

21

Self-tuning via Supervised Learning

22

Self-tuning (2)

23

External Knowledge to Improve Quality

24

Structure of the Talk

• Motivation

• Generic Framework – High-level

• Approach– Part of the Framework

Experiments

25

Quality

“Context” is proposed in [Bhattacharya et al., DMKD’04]

The two algos are proposed in [Dong et al., SIGMOD’05]

26

Scalability & Efficiency

27

Impact of Random Relationships

28

Contact Information

• Info about our disambiguation project– http://www.ics.uci.edu/~dvk

• Overall design– Dmitri V. Kalashnikov– dvk [at] domain

• Implementation details in JCDL’07– Zhaoqi (Stella) Chen– chenz [at] domain– domain = ics.uci.edu

Recommended