Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V....

Adaptive Graphical Approach to Entity Resolution

Dmitri V. Kalashnikov

Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra

Computer Science DepartmentUniversity of California, Irvine

ACM IEEE Joint Conference on Digital Libraries 2007

Structure of the Talk

Motivation

• Generic Disambiguation Framework – High-level

• Entity Resolution Approach– Part of the Framework

• Experiments

Entity Resolution & Data Cleaning

Raw Dataset(s)

...J. Smith ...

.. John Smith ...

.. Jane Smith ...

Intel Inc. ?

A "nice" regular Database

Analysis on bad data leads to wrong conclusions!

•Uncertainty•Errors•Missing data

Why do we need “Entity Resolution”?

q Hi, I’m Jane Smith.

I’d like to apply for a faculty

position.

Wow! I am sure we will accept a strong candidate

like that!

Jane Smith – Fresh Ph.D. Tom - Recruiter

OK, let me check

something quickly …

Publications:1. ……2. ……3. ……

CiteSeer Rank

Suspicious entries– Lets go to DBLP website

– which stores bibliographic entries of many CS authors

– Lets check two people– “A. Gupta”

– “L. Zhang”

What is the problem?

CiteSeer: the top-k most cited authors DBLP DBLP

Comparing raw and cleaned CiteSeer

Rank Author Location

1 (100.00%) douglas schmidt cs@wustl

2 (100.00%) rakesh agrawal almaden@ibm

3 (100.00%) hector garciamolina @

4 (100.00%) sally floyd @aciri

5 (100.00%) jennifer widom @stanford

6 (100.00%) david culler cs@berkeley

6 (100.00%) thomas henzinger eecs@berkeley

7 (100.00%) rajeev motwani @stanford

8 (100.00%) willy zwaenepoel cs@rice

9 (100.00%) van jacobson lbl@gov

10 (100.00%) rajeev alur cis@upenn

11 (100.00%) john ousterhout @pacbell

12 (100.00%) joseph halpern cs@cornell

13 (100.00%) andrew kahng @ucsd

14 (100.00%) peter stadler tbi@univie

15 (100.00%) serge abiteboul @inria

Raw CiteSeer’s Top-K Most Cited Authors

Cleaned CiteSeer’s Top-K Most Cited Authors

What is the lesson?

– Data should be cleaned first– E.g., determine the (unique) real authors of publications

– Solving such challenges is not always “easy”– This explains a large body of work on Entity Resolution

“Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.

Typical Data Processing Flow

Raw Data RepresentationData CleaningExtraction Analysis

Two most common types of Entity Resolution

...J. Smith ...

.. John Smith ...

.. Jane Smith ...

Intel Inc.

Fuzzy lookup

– match references to objects– list of all objects is given

– [SDM’05], [TODS’06]

Fuzzy grouping

– group references that co-refer

– [IQIS’05], [JCDL’07]

• Motivation Generic Framework

– High-level

• Approach– Part of the Framework

• Experiments

Traditional Approach to Entity Resolution

"J. Smith"

js@mit.edu

sm@yahoo.com?

Traditional MethodsFeatures and Context

"Jane Smith"

s (X,Y) = f (X,Y) Similarity = Similarity of Features

Key Observation: More Info is Available

A "nice" regular DatabaseJane Smith

John Smith

J. Smith

Solution: Main Idea

Traditional Methods

Relationship Analysis

features and context

s (X,Y) = c (X,Y) + γ f (X,Y)Similarity = Similarity of Features + “Connection Strength”

New Paradigm

Illustrative Example

“Indirect connections”– Suppose your co-worker’s name is “John White”– Suppose you see on the Web, on my homepage

– My name: “Dmitri …”– Somebody named: “John White”

– Who is the “John White”?– From data you might establish a connection:

– “Dmitri” might be connected to more “John White”’s…

Dmitri

JCDL’07

Visited

WorksAT WorksAT

John White

Key Features of the Framework

Our goal is/was to create a framework, such that:– solid theoretic foundation

– lookup

– domain-independent framework

– self-tuning

– scales to large datasets

– robust under uncertainty

– high disambiguation quality

• Motivation

• Generic Framework – High-level

Approach– Part of the Framework

• Experiments

Approach

• Graph Creation– Entity-Relationship Graph

• Consolidation Algorithm – Bottom-up clustering

• Adaptiveness to data– That is, self-tuning– Supervised learning

• External Data– To improve the quality further– A theoretic possibility

– Not tested yet

ER Graph Creation

Virtual Connected Subgraph (VCS)

person

publication

department

organization

similarity

regular

• VCS– Similarity edges form VCSs– Subgraphs in the ER graph

1. “Virtual”– Contains only similarity edges

2. “Connected”– A path between any 2 nodes

3. Completeness– Adding more nodes/edges would violate (1) and (2)

• Logically, the Goal is– Partition each VCS properly

Consolidation Algorithm: Merging

Self-tuning via Supervised Learning

Self-tuning (2)

External Knowledge to Improve Quality

• Motivation

• Generic Framework – High-level

• Approach– Part of the Framework

Experiments

Quality

“Context” is proposed in [Bhattacharya et al., DMKD’04]

The two algos are proposed in [Dong et al., SIGMOD’05]

Scalability & Efficiency

Impact of Random Relationships

Contact Information

• Info about our disambiguation project– http://www.ics.uci.edu/~dvk

• Overall design– Dmitri V. Kalashnikov– dvk [at] domain

• Implementation details in JCDL’07– Zhaoqi (Stella) Chen– chenz [at] domain– domain = ics.uci.edu

Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V....

Documents

Breast Cancer by Dishi Mehrotra

2000 Paunovic Mehrotra ThermochmicaActa

Notion Press - Dr. Dheeraj Mehrotra

Dmitri Shostakovich - Boosey & Hawkes · PDF fileDmitri Shostakovich Photo: Dmitri Shostakovich An Introduction to the music of Dmitri Shostakovich by Gerard McBurney Dmitri Shostakovich

The Flying Kalashnikov (Air Forces Monthly)

Query Aware Determinization of Uncertain Objectsdvk/pub/J16_TKDE14_Jeffrey.pdf · 1 Query Aware Determinization of Uncertain Objects Jie Xu, Dmitri V. Kalashnikov, and Sharad Mehrotra,

SIGMOD’03 Evaluating Probabilistic Queries over Imprecise Data Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar Department of Computer Science, Purdue

Bhatt, Mehrotra - Buddhist Epistemology.pdf

Dmitri Shostakovich - Boosey & Hawkes · Dmitri Shostakovich Photo: Dmitri Shostakovich An Introduction to the music of Dmitri Shostakovich by Gerard McBurney Dmitri Shostakovich

Liyan Zhang, Ronen Vaisenberg, Sharad Mehrotra, Dmitri V. Kalashnikov Department of Computer Science University of California, Irvine This material is

Http://coe.ihmc.us/ Pat Hayes Thomas C Eskridge Raul Saavedra Thomas ReichherzerMala Mehrotra Dmitri Bobrovnikoff Collaborative Knowledge Capture In Ontologies

WellNation - Rishabh Mehrotra

Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer

Exploiting Relationships for Domain-Independent Data Cleaning Dmitri V. Kalashnikov Sharad Mehrotra Stella Chen Computer Science Department University

The Kalash Connection | Your Kalashnikov Source

CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright

EFFICIENT QUERYING OF CONSTANTLY EVOLVING DATA …dvk/pub/thesis.pdf · Kalashnikov, Dmitri V. Ph.D., Purdue University, May, 2003. Eﬃcient Querying of Constantly Evolving Data

SAFIRE: Situational Awareness for Firefighters Using Acoustic Signal for Enhancing Situational Awareness in SAFIRE Dmitri V. Kalashnikov

SCIENTIFIC ABSTRACT KALASHNIKOV, N.A. - KALASHNIKOV, N.V

Situational Awareness Technologies for Disaster Responsedvk/pub/SV07_dvk.pdf · Situational Awareness Technologies for Disaster Response Naveen Ashish, Dmitri Kalashnikov, Sharad