Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,

Exploiting Relationships for Object Consolidation

Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra

Computer Science DepartmentUniversity of California, Irvine

http://www.ics.uci.edu/~dvk/RelDChttp://www.itr-rescue.org (RESCUE)

ACM IQIS 2005

Copyright(c) by Dmitri V. Kalashnikov, 2005Work supported by NSF Grants IIS-0331707 and IIS-0083489

2

Talk Overview

• Examples – motivating data cleaning (DC)– motivating analysis of relationships for DC

• Object consolidation– one of the DC problems– this work addresses

• Proposed approach – RelDC framework– Relationship analysis and graph partitioning

• Experiments

3

Why do we need “Data Cleaning”?

q Hi, my name is Jane Smith.

I’d like to apply for a faculty

position at your university

Wow! Unbelievable! You must be a really

hard worker! I am sure we will

accept a candidate like

that!

Jane Smith – Fresh Ph.D. Tom - Recruiter

OK, let me check

something quickly …

???

Publications:1. ……2. ……3. ……

Publications:1. ……2. ……3. ……

CiteSeer Rank

4

Suspicious entries– Lets go to DBLP website

– which stores bibliographic entries of many CS authors

– Lets check two people– “A. Gupta”

– “L. Zhang”

What is the problem?

CiteSeer: the top-k most cited authors DBLP DBLP

5

Comparing raw and cleaned CiteSeer

Rank Author Location # citations

1 (100.00%) douglas schmidt cs@wustl 5608

2 (100.00%) rakesh agrawal almaden@ibm 4209

3 (100.00%) hector garciamolina @ 4167

4 (100.00%) sally floyd @aciri 3902

5 (100.00%) jennifer widom @stanford 3835

6 (100.00%) david culler cs@berkeley 3619

6 (100.00%) thomas henzinger eecs@berkeley 3752

7 (100.00%) rajeev motwani @stanford 3570

8 (100.00%) willy zwaenepoel cs@rice 3624

9 (100.00%) van jacobson lbl@gov 3468

10 (100.00%) rajeev alur cis@upenn 3577

11 (100.00%) john ousterhout @pacbell 3290

12 (100.00%) joseph halpern cs@cornell 3364

13 (100.00%) andrew kahng @ucsd 3288

14 (100.00%) peter stadler tbi@univie 3187

15 (100.00%) serge abiteboul @inria 3060

CiteSeer top-k

Cleaned CiteSeer top-k

http://localhost:8084/3.6/LISTAUTHORS.jsp?nextIndex=0&orderBy=author

http://localhost:8084/3.6/LISTAUTHORS.jsp?nextIndex=0&orderBy=citation

http://localhost:8084/3.6/ListPapers.jsp?authID=196748





















6

What is the lesson?

– data should be cleaned first– e.g., determine the (unique) real authors of publications– solving such challenges is not always “easy”– that explains a large body of work on data cleaning– note

– CiteSeer is aware of the problem with its ranking– there are more issues with CiteSeer – many not related to data cleaning

“Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.

7

RelDC Framework

f1

f2

f3

?

?

?

f4

Y

f1

f2

f3

f4?

X

Traditional Methods

+ X Y

A

B C

D

E F

Relationship Analysis

ARG

Raw Data Representation

Tables/ARGs

RelDC Framework

ARG

X Y

AB C

D

E F

features and context

Data CleaningExtraction

Relationship-based Data Cleaning

Analysis

8

Object Consolidation

Notation– O={o1,...,o|O|} set of entities

– unknown in general

– X={x1,...,x|X|} set of repres.

– d[xi] the entity xi refers to– unknown in general

– C[xi] all repres. that refer to d[xi]– “group set”

– unknown in general

– the goal is to find it for each xi

– S[xi] all repres. that can be xi

– “consolidation set”

– determined by FBS

– we assume C[xi] S[xi]

9

Attributed Relational Graph (ARG)

ARG in RelDC

Nodes

– per cluster of representations

– per representation (for “tough” cases)

Edges – regular– similarity

person publication

department organization

10

Context Attraction Principle (CAP)

Take a guess: Who is “J. Smith”

– Jane?– John?

Jane Smith

John Smith

J. Smith

Merging a new publication.

11

Questions to Answer

1. Does the CAP principle hold over real datasets? That is, if we consolidate objects based on it, will the

quality of consolidation improves?

2. Can we design a generic solution to exploiting relationships for disambiguation?

12

Consolidation Algorithm

1. Construct ARG and identify all VCS’s– use FBS in constructing the ARG

2. Choose a VCS and compute c(u,v)’s– for each pair of repr. connected via a similarity edge

3. Partition VSC– use a graph partitioning algorithm– partitioning is based on c(u,v)’s– after partitioning, adjust ARG accordingly– go to Step 2, if more VCS exists

13

Connection Strength

Computation of c(u,v)

Phase 1: Discover connections– all L-short simple paths between u and v

– bottleneck

– optimizations, not in IQIS’05

Phase 2: Measure the strength– in the discovered connections

– many c(u,v) models exist

– we use model similar to diffusion kernels

u v

A

B C

D

E F

G H z

14

Existing c(u,v) Models

Models for c(u,v)– many exists

– diffusion kernels, random walks, etc

– none is fully adequate– cannot learn similarity from data

u v

A

B C

D

E F

G H z Diffusion kernels

– (x,y)= 1(x,y) “base similarity” – via direct links (of size 1)

– k(x,y) “indirect similarity”– via links of size k

– B: where Bxy = B1xy = 1(x,y)

– base similarity matrix

– Bk: indirect similarity matrix– K: total similarity matrix, or “kernel”

15

Our c(u,v) Model

T1

T2 T2

T1

N-2... ... ... ... ...

John Smith Alan WhiteP1

MIT

Our c(u,v) model– regular edges have types T1,...,Tn

– types T1,...,Tn have weights w1,...,wn

– (x,y) = wi

– get the type of a given edge– assign this weigh as base similarity

– paths with similarity edges– might not exist, use heuristics

Our model & Diff. kernels– virtually identical, but...– we do not compute the whole matrix K

– we compute one c(u,v) at a time

– we limit path lengths by L– (x,y) is unknown in general

– the analyst assigns them– learn from data (ongoing work)

P1R1:John R2:J.SmithA4:Alan P3A5:MikeP2

R3:John A3:JohnA6:Tom StanfordP4 A7:Kate

P1 R1:JohnA1:John A4:AlanMITR3:John

(a)

(b)

(c)

16

Consolidation via Partitioning

Observations– each VCS contains representations of

at least 1 object– if a repr. is in VCS, then the rest of repr.

of the same object are in it too

Partitioning– two cases

– k, the number of entities in VSC, is known– k is unknown

– when k is known, use any partit. algo– maximize inside-con, minimize outside-con.– we use [Shi,Malik’2000]– normalized cut

– when k is unknown– split into two: just to see the cut– compare cut against threshold– decide “to split” or “not to split” actually

1

1

1

12

2 2

3

3

4

4

VCS 1

VCS 2

5

5

5

5

17

Measuring Quality of Outcome

Existing measures– dispersion [DMKD’04]

– for an entity, into how many clusters its repr. are clustered, ideal is 1

– diversity– for a cluster, how many distinct entities it

covers, ideal is 1

– easy, clear semantics– but have problems, see figure

Entropy– for an entity, if out of m represent.

m1 to C1; ...; mn to Cn then

– if a cluster consists of represent.: m1 of E1; ...; mn of En then (same...)

– ideal entropy is zero

1 11 1

2 22 2 2 2

1 1

Ideal Clustering

1 11 1

2 22 2 2

21

1

One Misassigned (Example 1)

1

1

1 1

2 22

2 2 2

1 1

Half Misassigned

1 11 1

2 22 2 2

211

One Misassigned (Example 2)

C1

C2

Div H

1

1

0

0

E1

E2

Dis H

1

1

0

0

C1

C2

Div H

2

2

E1

E2

Dis H

2

2

0.65

0.65

C1

C2

Div H

2

2

1

1

E1

E2

Dis H

2

2

1

1

C1

C2

Div H

2

1

0.592

0

E1

E2

Dis H

1

2

0

0.65

0.65

0.65

Dis/Div cannot distinguish the two cases

Entropy can: since 0.65 < 1, first clustering is better

Average entropy decreases (improves), compared to Example 1

18

Experimental Setup

Parameters– L-short simple paths, L = 7– L is the path-length limit

Note– The algorithm is applied to

“tough cases”, after FBS already has successfully consolidated many entries!

RealMov– movies (12K) – people (22K)

– actors– directors– producers

– studious (1K) – producing – distributing

Uncertainty– d1,d2,...,dn are director entities– pick a fraction d1,d2,...,d10– group, e.g. in groups of two

– {d1,d2}, ... ,{d9,d10}

– make all representations of d1,d2 indiscernible by FBS, ...

Baseline 1– one cluster per VCS, regardless– dumb? ... but ideal disp & H(E)

Baseline 2– knows grouping statistics– guesses #ent in VCS – random assigns repr. to clusters

19

Sample Movies Data

20

The Effect of L on Quality

Cluster Entropy & Diversity Entity Entropy & Dispersion

21

Effect of Threshold and Scalability

22

Summary

RelDC– developed in Aug 2003 (reference disambiguation)– domain-independent data cleaning framework– uses relationships for data cleaning

– reference disambiguation [SDM’05]– object consolidation [IQIS’05]

Ongoing work– “learning” the importance of relationships from data

23

Contact Information

RelDC projectwww.ics.uci.edu/~dvk/RelDC

www.itr-rescue.org (RESCUE)

Zhaoqi [email protected]

Dmitri V. Kalashnikovwww.ics.uci.edu/~dvk

[email protected]

Sharad Mehrotrawww.ics.uci.edu/~sharad

[email protected]

Documents

Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,