Self-tuning in Graph-Based Reference Disambiguation

Preview:

DESCRIPTION

Self-tuning in Graph-Based Reference Disambiguation. Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine. Overview. Intro to Data Cleaning Entity resolution RelDC Framework Past work Adapting to data The new part - PowerPoint PPT Presentation

Citation preview

Self-tuning in Graph-Based Reference Disambiguation

Rabia Nuray-Turan Dmitri V. Kalashnikov

Sharad Mehrotra

Computer Science DepartmentUniversity of California, Irvine

April 19, 2023 DASFAA 2007, Bangkok, Thailand 2

Overview

• Intro to Data Cleaning– Entity resolution

• RelDC Framework– Past work

• Adapting to data – The new part– Reduction to an Optimization problem

– Linear programming

• Experiments

April 19, 2023 DASFAA 2007, Bangkok, Thailand 3

Raw Datasets(uncertainty, errors, multiple sources)

...J. Smith ...

.. John Smith ...

.. Jane Smith ...

MIT

Intel Inc.?

Regular Database(Can be analyzed)

John Smith Intel

Jane Smith MIT

... ...

Data Cleaning

Analysis on bad data leads to wrong conclusions

April 19, 2023 DASFAA 2007, Bangkok, Thailand 4

Suspicious entries– Lets go to DBLP website

– which stores bibliographic entries of many CS authors

– Lets check two people– “A. Gupta”

– “L. Zhang”

Example of the problem: CiteSeer top-K

CiteSeer: the top-k most cited authors DBLP DBLP

April 19, 2023 DASFAA 2007, Bangkok, Thailand 5

Two Most Common Entity-Resolution Challenges

...J. Smith ...

.. John Smith ...

.. Jane Smith ...

MIT

Intel Inc.

Fuzzy lookup

– reference disambiguation– match references to objects

– list of all objects is given

Fuzzy grouping

– group together object repre-sentations, that correspond to the same object

April 19, 2023 DASFAA 2007, Bangkok, Thailand 6

Standard Approach to Entity Resolution

"J. Smith"

f2

f3

?

?

?

js@mit.edu

Yf2

f3

sm@yahoo.com?

X

Traditional MethodsFeatures and Context

"Jane Smith"

April 19, 2023 DASFAA 2007, Bangkok, Thailand 7

Overview

• Intro to Data CleaningRelDC Framework

– Past work

• Adapting to data – The new part– Reduction to an Optimization problem

– Linear programming

• Experiments

April 19, 2023 DASFAA 2007, Bangkok, Thailand 8

RelDC Framework

f1

f2

f3

?

?

?

f4

Y

f1

f2

f3

f4?

X

Traditional Methods

+ X Y

A

B C

D

E F

Relationship Analysis

ARG

RelDC Framework

features and context

Relationship-based Data Cleaning

April 19, 2023 DASFAA 2007, Bangkok, Thailand 9

RelDC Framework

• Past work– SDM’05, TODS’06

• Domain-independent framework– Viewing the dataset as an Entity Relationship Graph

– Analyzes paths in this graph

• Solid theoretic foundation– Optimization problem

• Scales to large datasets

• Robust under uncertainty

• High disambiguation quality

• No Self-tuning– This paper solves this challenge

April 19, 2023 DASFAA 2007, Bangkok, Thailand 10

Entity-Relationship Graph

w2

Jane Smith

John Smith

J. Smithw1

r1

...

wr1=?

wrN=?

wr2=?er1

erN

er2xr

yr1

yr2

yrN

Regular nodes

Choice nodes Options of choice r

Option-edgesContext entity of r

• Choice node– For uncertain references

– To encode options/possibilities yr1, … yrN

• Among options yr1, … yrN

– Pick the most strongly connected one– CAP principle

– Analyze paths in G– that exist between xr and yrj, for all j

– Use a model to measure connection strength

• “Connection strength” model– c(u,v), for nodes u and v in G

– how strongly u and v are connected in G

– RandomWalk-based – Fixed– Based on Intuition!!!

– This paper, instead, learns such a model from data.

April 19, 2023 DASFAA 2007, Bangkok, Thailand 11

Overview

• Intro to Data Cleaning

• RelDC Framework– Past work

Adapting to data – The new part– Reduction to an Optimization problem

– Linear programming

• Experiments

April 19, 2023 DASFAA 2007, Bangkok, Thailand 12

Adaptive Solution

• Classify the found paths in the graph into a finite set of path types

ST ={ T1, T2, …, TN}

• If paths p1 and p2 are of the same type then they are treated as identical.

• We can show the connection between nodes u and v with a path-type count vector: Tuv = { c1, c2, …, cN}

• If there is a way to associate path Ti to wi then connection strength will be:

n

i ii cwvuc1

*),(

April 19, 2023 DASFAA 2007, Bangkok, Thailand 13

Problems to Answer

• How will we classify the paths?

• How will we associate each path type with a weight?

April 19, 2023 DASFAA 2007, Bangkok, Thailand 14

Classifying Paths

• Path Type Model (PTM):– Views each path as a sequence of edges

– <e1,e2,e3,…,en>

– Each edge ei has a type Ei associated with it

– Thus, can associate each path p with a string– <E1,E2,E3,…,En>

– Different strings correspond to different path types– Associate each string a weight

• Different models are also possible

April 19, 2023 DASFAA 2007, Bangkok, Thailand 15

Learning Path Weights : Optimization Problem

• CAP Principle states that: – the right option will be better connected

• Linear programming• Learn path types weight w’s.

April 19, 2023 DASFAA 2007, Bangkok, Thailand 16

Final Solution

• The value of c(xr,yrj)- c(xr,yrl) should be maximized for all r, l≠j

• Then final solution:

April 19, 2023 DASFAA 2007, Bangkok, Thailand 17

Example -Graph

y1

r1

r2

w1

w2

w1

w2

y2

y3

x1

e1e3

e1

e1

e1

e3

e1

e3e1

e1

e3 y4e1

e1

e1e3

e1e2

e2

e3

w1

w4

e3

e1

e2

e3e1

e2

e2

e3e3

e2

w4w3e1

e3

e1

e3 e3

e2

e2

e1

e2

e3

e2

w3

w4

w1

x2

y5

+

-

-

+

-e1

e2e3

e2

w3

y6w2

e1e1

e3 -

P1= e1-e3-e1 P2= e1-e1-e3

P3= e1-e2-e2-e3 P4= e1-e2-e3-e2-e3

April 19, 2023 DASFAA 2007, Bangkok, Thailand 18

Example- Solution

•w1 =1

•w3 = w4 = 0

•w2 can be anything between 0 and 1.

April 19, 2023 DASFAA 2007, Bangkok, Thailand 19

Overview

• Intro to Data Cleaning

• RelDC Framework– Past work

• Adapting to data – The new part– Reduction to an Optimization problem

– Linear programming

Experiments

April 19, 2023 DASFAA 2007, Bangkok, Thailand 20

Experimental Setup

Parameters– When looking for L-short simple paths, L = 5– L is the path-length limit

SynPub datasets: – many ds of five different

types– emulation of RealPub

– publications (5K) – authors (1K) – organizations (25K)– departments (125K)

– ground truth is known

RealMov:– movies (12K) – people (22K)

– actors– directors– producers

– studious (1K) – producing – distributing

– ground truth is known

April 19, 2023 DASFAA 2007, Bangkok, Thailand 21

Experimental Results on Movies

Parameters :

-Fraction : fraction of uncertain references in the dataset

-Each reference has 2 choices

April 19, 2023 DASFAA 2007, Bangkok, Thailand 22

Experimental Results on Movies- II

Number of options based on PMF Distribution

April 19, 2023 DASFAA 2007, Bangkok, Thailand 23

Experimental Results on SynPub

• RandomWalk, PTM and the Hybrid Model have the same accuracy

• Is RandomWalk the optimum model for Publications domain?

Hybrid Model :

),(

).(),(vuPp

i

L

wpcvuc

April 19, 2023 DASFAA 2007, Bangkok, Thailand 24

Effect of Random Relationships in the Publications Domain

April 19, 2023 DASFAA 2007, Bangkok, Thailand 25

Summary

• Main Contribution– An adaptive solution for connection strength– Model learns the weights of different path types

• Ongoing work– Using different models to learn the importance of

paths in the connection strength– Use of standard machine learning techniques for learning:

such as decision trees, etc…– Different ways to classify paths

April 19, 2023 DASFAA 2007, Bangkok, Thailand 26

Contact Information

• RelDC project– www.ics.uci.edu/~dvk/RelDC– www.itr-rescue.org (RESCUE)

• Rabia Nuray-Turan (contact author)– www.ics.uci.edu/~rnuray

• Dmitri V. Kalashnikov – www.ics.uci.edu/~dvk

• Sharad Mehrotra– www.ics.uci.edu/~sharad

Thank you !

Recommended