27
Self-tuning in Graph-Based Reference Disambiguation Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine

Self-tuning in Graph-Based Reference Disambiguation

Embed Size (px)

DESCRIPTION

Self-tuning in Graph-Based Reference Disambiguation. Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine. Overview. Intro to Data Cleaning Entity resolution RelDC Framework Past work Adapting to data The new part - PowerPoint PPT Presentation

Citation preview

Page 1: Self-tuning in Graph-Based Reference Disambiguation

Self-tuning in Graph-Based Reference Disambiguation

Rabia Nuray-Turan Dmitri V. Kalashnikov

Sharad Mehrotra

Computer Science DepartmentUniversity of California, Irvine

Page 2: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 2

Overview

• Intro to Data Cleaning– Entity resolution

• RelDC Framework– Past work

• Adapting to data – The new part– Reduction to an Optimization problem

– Linear programming

• Experiments

Page 3: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 3

Raw Datasets(uncertainty, errors, multiple sources)

...J. Smith ...

.. John Smith ...

.. Jane Smith ...

MIT

Intel Inc.?

Regular Database(Can be analyzed)

John Smith Intel

Jane Smith MIT

... ...

Data Cleaning

Analysis on bad data leads to wrong conclusions

Page 4: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 4

Suspicious entries– Lets go to DBLP website

– which stores bibliographic entries of many CS authors

– Lets check two people– “A. Gupta”

– “L. Zhang”

Example of the problem: CiteSeer top-K

CiteSeer: the top-k most cited authors DBLP DBLP

Page 5: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 5

Two Most Common Entity-Resolution Challenges

...J. Smith ...

.. John Smith ...

.. Jane Smith ...

MIT

Intel Inc.

Fuzzy lookup

– reference disambiguation– match references to objects

– list of all objects is given

Fuzzy grouping

– group together object repre-sentations, that correspond to the same object

Page 6: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 6

Standard Approach to Entity Resolution

"J. Smith"

f2

f3

?

?

?

[email protected]

Yf2

f3

[email protected]?

X

Traditional MethodsFeatures and Context

"Jane Smith"

Page 7: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 7

Overview

• Intro to Data CleaningRelDC Framework

– Past work

• Adapting to data – The new part– Reduction to an Optimization problem

– Linear programming

• Experiments

Page 8: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 8

RelDC Framework

f1

f2

f3

?

?

?

f4

Y

f1

f2

f3

f4?

X

Traditional Methods

+ X Y

A

B C

D

E F

Relationship Analysis

ARG

RelDC Framework

features and context

Relationship-based Data Cleaning

Page 9: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 9

RelDC Framework

• Past work– SDM’05, TODS’06

• Domain-independent framework– Viewing the dataset as an Entity Relationship Graph

– Analyzes paths in this graph

• Solid theoretic foundation– Optimization problem

• Scales to large datasets

• Robust under uncertainty

• High disambiguation quality

• No Self-tuning– This paper solves this challenge

Page 10: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 10

Entity-Relationship Graph

w2

Jane Smith

John Smith

J. Smithw1

r1

...

wr1=?

wrN=?

wr2=?er1

erN

er2xr

yr1

yr2

yrN

Regular nodes

Choice nodes Options of choice r

Option-edgesContext entity of r

• Choice node– For uncertain references

– To encode options/possibilities yr1, … yrN

• Among options yr1, … yrN

– Pick the most strongly connected one– CAP principle

– Analyze paths in G– that exist between xr and yrj, for all j

– Use a model to measure connection strength

• “Connection strength” model– c(u,v), for nodes u and v in G

– how strongly u and v are connected in G

– RandomWalk-based – Fixed– Based on Intuition!!!

– This paper, instead, learns such a model from data.

Page 11: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 11

Overview

• Intro to Data Cleaning

• RelDC Framework– Past work

Adapting to data – The new part– Reduction to an Optimization problem

– Linear programming

• Experiments

Page 12: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 12

Adaptive Solution

• Classify the found paths in the graph into a finite set of path types

ST ={ T1, T2, …, TN}

• If paths p1 and p2 are of the same type then they are treated as identical.

• We can show the connection between nodes u and v with a path-type count vector: Tuv = { c1, c2, …, cN}

• If there is a way to associate path Ti to wi then connection strength will be:

n

i ii cwvuc1

*),(

Page 13: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 13

Problems to Answer

• How will we classify the paths?

• How will we associate each path type with a weight?

Page 14: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 14

Classifying Paths

• Path Type Model (PTM):– Views each path as a sequence of edges

– <e1,e2,e3,…,en>

– Each edge ei has a type Ei associated with it

– Thus, can associate each path p with a string– <E1,E2,E3,…,En>

– Different strings correspond to different path types– Associate each string a weight

• Different models are also possible

Page 15: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 15

Learning Path Weights : Optimization Problem

• CAP Principle states that: – the right option will be better connected

• Linear programming• Learn path types weight w’s.

Page 16: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 16

Final Solution

• The value of c(xr,yrj)- c(xr,yrl) should be maximized for all r, l≠j

• Then final solution:

Page 17: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 17

Example -Graph

y1

r1

r2

w1

w2

w1

w2

y2

y3

x1

e1e3

e1

e1

e1

e3

e1

e3e1

e1

e3 y4e1

e1

e1e3

e1e2

e2

e3

w1

w4

e3

e1

e2

e3e1

e2

e2

e3e3

e2

w4w3e1

e3

e1

e3 e3

e2

e2

e1

e2

e3

e2

w3

w4

w1

x2

y5

+

-

-

+

-e1

e2e3

e2

w3

y6w2

e1e1

e3 -

P1= e1-e3-e1 P2= e1-e1-e3

P3= e1-e2-e2-e3 P4= e1-e2-e3-e2-e3

Page 18: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 18

Example- Solution

•w1 =1

•w3 = w4 = 0

•w2 can be anything between 0 and 1.

Page 19: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 19

Overview

• Intro to Data Cleaning

• RelDC Framework– Past work

• Adapting to data – The new part– Reduction to an Optimization problem

– Linear programming

Experiments

Page 20: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 20

Experimental Setup

Parameters– When looking for L-short simple paths, L = 5– L is the path-length limit

SynPub datasets: – many ds of five different

types– emulation of RealPub

– publications (5K) – authors (1K) – organizations (25K)– departments (125K)

– ground truth is known

RealMov:– movies (12K) – people (22K)

– actors– directors– producers

– studious (1K) – producing – distributing

– ground truth is known

Page 21: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 21

Experimental Results on Movies

Parameters :

-Fraction : fraction of uncertain references in the dataset

-Each reference has 2 choices

Page 22: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 22

Experimental Results on Movies- II

Number of options based on PMF Distribution

Page 23: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 23

Experimental Results on SynPub

• RandomWalk, PTM and the Hybrid Model have the same accuracy

• Is RandomWalk the optimum model for Publications domain?

Hybrid Model :

),(

).(),(vuPp

i

L

wpcvuc

Page 24: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 24

Effect of Random Relationships in the Publications Domain

Page 25: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 25

Summary

• Main Contribution– An adaptive solution for connection strength– Model learns the weights of different path types

• Ongoing work– Using different models to learn the importance of

paths in the connection strength– Use of standard machine learning techniques for learning:

such as decision trees, etc…– Different ways to classify paths

Page 26: Self-tuning in Graph-Based Reference Disambiguation

April 19, 2023 DASFAA 2007, Bangkok, Thailand 26

Contact Information

• RelDC project– www.ics.uci.edu/~dvk/RelDC– www.itr-rescue.org (RESCUE)

• Rabia Nuray-Turan (contact author)– www.ics.uci.edu/~rnuray

• Dmitri V. Kalashnikov – www.ics.uci.edu/~dvk

• Sharad Mehrotra– www.ics.uci.edu/~sharad

Page 27: Self-tuning in Graph-Based Reference Disambiguation

Thank you !