87
SOLOMON: SEEKING THE TRUTH VIA COPYING DETECTION Xin Luna Dong AT&T Labs-Research 8/2011

Solomon: Seeking the Truth Via Copying Detection

  • Upload
    yachi

  • View
    18

  • Download
    0

Embed Size (px)

DESCRIPTION

Solomon: Seeking the Truth Via Copying Detection. Xin Luna Dong AT&T Labs-Research 8/2011. We Live in an Information Era. A visualization of the topology of a portion of the Internet. Web 2.0. But the Freely Accessible Information Has Its Downside. - PowerPoint PPT Presentation

Citation preview

Page 1: Solomon: Seeking the Truth Via Copying Detection

SOLOMON: SEEKING THE TRUTH VIA COPYING

DETECTION

Xin Luna DongAT&T Labs-Research

8/2011

Page 5: Solomon: Seeking the Truth Via Copying Detection

False Information Can Be Propagated (I)UA’s bankruptcyChicago Tribune,

2002

Sun-Sentinel.com

Google News

Bloomberg.com

The UAL stock plummeted to $3

from $12.5

Page 6: Solomon: Seeking the Truth Via Copying Detection

False Information Can Be Propagated (II)

Maurice Jarre (1924-2009) French Conductor and Composer

“One could say my life itself has been one long soundtrack. Music was my life, music brought me to life, and music is how I will be remembered long after I leave this life. When I die there will be a final waltz playing in my head and that only I can hear.”

2:29, 30 March 2009

Page 7: Solomon: Seeking the Truth Via Copying Detection

False Information Can Be Propagated (III)

“[Please spread the word] From my friend living in Chiba Prefecture. The weather forecast says it will rain from Monday. People living around Chiba, please be careful. The explosion at the Cosmo oil refinery will cause harmful substance to rise to clouds and become toxic rain. So when you go out, take your umbrella or raincoat, and make sure the rain doesn’t touch your body!”

“The creator of Pokemon died today in the #tsunami, #Japan. RIP: Satoshi Tajiri. #prayforjapan.” By xCyrusAndLovato “The Creator of Hello Kitty, Yuko Yamaguchi, died today in Japan. #prayforjapan”

Relief aid from individualsIn order to avoid confusion, we ask that you please refrain [from distributing relief supplies].Chain letters with specific bank account

information for donations are getting sent around. Please Help Japan! Earthquake Weapons caused

Tsunami

Numerous rumors after the Japan earthquake and tsunami

Page 8: Solomon: Seeking the Truth Via Copying Detection

False Information Can Be Propagated (IV)

Posted by Andrew BreitbartIn his blog

Page 9: Solomon: Seeking the Truth Via Copying Detection

The Internet needs a way to help people separate rumor from real science.

– Tim Berners-Lee

We now live in this media culture where something goes up on YouTube or a blog and everybody scrambles. - Barack Obama

Page 10: Solomon: Seeking the Truth Via Copying Detection

Copying Can Happen on Structured Data (Copying of Weather Data)

Page 11: Solomon: Seeking the Truth Via Copying Detection

Copying Can Be Large Scaled (Copying of AbeBooks Data)

Data collected from AbeBooks[Yin et al., 2007]

Page 12: Solomon: Seeking the Truth Via Copying Detection

Intuitively Meaningful Clusters According to the Copying Relationships

Page 13: Solomon: Seeking the Truth Via Copying Detection

Intuitively Meaningful Clusters According to the Copying Relationships

Page 14: Solomon: Seeking the Truth Via Copying Detection

Copying Can Be Large Scaled (Copying of AbeBooks Data)

Page 15: Solomon: Seeking the Truth Via Copying Detection

SolomonGoal

Discover copying relationships between structured data sources

Leverage the copying relationships to improve various components of data integration

Other applicationsBusiness purpose: data are valuableIn-depth data analysis: information

dissemination

Page 16: Solomon: Seeking the Truth Via Copying Detection

Solomon

Outline

Copying discovery• Local detection

[VLDB’09a]• Global detection

[VLDB’10a]• Detection w.

dynamic data [VLDB’09b]

Applications in data integration• Truth discovery

[VLDB’09a][VLDB’09b]

• Query answering [VLDB’11][EDBT’11]

• Record linkage [VLDB’10b]

Visualization and decision explanation• Visualization• Decision

explanation[VLDB’10 demo]

Page 17: Solomon: Seeking the Truth Via Copying Detection

Problem Definition—Input

Src

ISBN Name Author

S11 IPV6: Theory, Protocol, and

Practice Loshin, Peter

2 Web Usability: A User-Centered Design Approach

Lazar, Jonathan

S21 IPV4: Theory, Protocol, and

Practice -

2 Web Usability: A User Jonathan Lazar

S31 IPV6: Theory, Protocol, and

Practice Loshin, Peter

2 Web Usability: A User Jonathan Lazar

S4 1 IPV6: Theory, Protocol, and Practice Loshin

2 Web Usability: A User Lazar

Missing values

Different formats

Incorrectvalues

Objects: a real-world entity, described by a set of attributes Each associated w. a true value

Sources: each providing data for a subset of objectsInpu

t

Page 18: Solomon: Seeking the Truth Via Copying Detection

Formatting Patterns for Author List

Page 19: Solomon: Seeking the Truth Via Copying Detection

Problem Definition—OutputFor each S1, S2, decide pr of S1 copying directly from S2

A copier copies all or a subset of data A copier can add values and verify/modify copied values—independent

contribution A copier can re-format copied values—still considered as copied

S1 S2

S3

S4

Src

ISBN Name Author

S11 IPV6: Theory, Protocol, and

Practice Loshin, Peter

2 Web Usability: A User-Centered Design Approach

Lazar, Jonathan

S21 IPV4: Theory, Protocol, and

Practice -

2 Web Usability: A User Jonathan Lazar

S31 IPV6: Theory, Protocol, and

Practice Loshin, Peter

2 Web Usability: A User Jonathan Lazar

S4 1 IPV6: Theory, Protocol, and Practice Loshin

2 Web Usability: A User Lazar

Page 20: Solomon: Seeking the Truth Via Copying Detection

Sharing data may be due to both sources providing accurate dataA copier can copy only a small fraction of dataWith only a snapshot it is hard to decide which source is a copierCopying relationship can be complex: co-copying, transitive copying

S1 S2

S3

S4

Challenges in Copying Detection

Src

ISBN Name Author

S11 IPV6: Theory, Protocol, and

Practice Loshin, Peter

2 Web Usability: A User-Centered Design Approach

Lazar, Jonathan

S21 IPV4: Theory, Protocol, and

Practice -

2 Web Usability: A User Jonathan Lazar

S31 IPV6: Theory, Protocol, and

Practice Loshin, Peter

2 Web Usability: A User Jonathan Lazar

S4 1 IPV6: Theory, Protocol, and Practice Loshin

2 Web Usability: A User Lazar

Page 21: Solomon: Seeking the Truth Via Copying Detection

High-Level Intuitions for Copying Detection

Intuition I: decide dependence (w/o direction)

For shared data, Pr(Ф(S1)|S1S2) is low e.g., incorrect value

Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2

Page 22: Solomon: Seeking the Truth Via Copying Detection

Dependence?Source 1 on USA Presidents:1st : George Washington2nd : John Adams3rd : Thomas Jefferson4th : James Madison…41st : George H.W. Bush42nd : William J. Clinton43rd : George W. Bush44th: Barack Obama

Source 2 on USA Presidents:1st : George Washington2nd : John Adams3rd : Thomas Jefferson4th : James Madison…41st : George H.W. Bush42nd : William J. Clinton43rd : George W. Bush44th: Barack Obama

Are Source 1 and Source 2 dependent?

Not necessarily

Page 23: Solomon: Seeking the Truth Via Copying Detection

Dependence? Source 1 on USA Presidents:1st : George Washington2nd : Benjamin Franklin3rd : Tom Jefferson4th : Abraham Lincoln …41st : George W. Bush42nd : Hillary Clinton43rd : Mickey Mouse44th: Barack Obama

Source 2 on USA Presidents:1st : George Washington2nd : Benjamin Franklin3rd : Tom Jefferson4th : Abraham Lincoln …41st : George W. Bush42nd : Hillary Clinton43rd : Mickey Mouse44th: John McCain

Are Source 1 and Source 2 dependent?

-- Common Errors Very likely

Page 24: Solomon: Seeking the Truth Via Copying Detection

High-Level Intuitions for Copying Detection

Intuition I: decide dependence (w/o direction)

For shared data, Pr(Ф(S1)|S1S2) is low e.g., incorrect data

Intuition II: decide copying directionLet F be a property function of the data

(e.g., accuracy of data)|F(Ф(S1) Ф(S2))-F(Ф(S1)-Ф(S2))|

> |F(Ф(S1) Ф(S2))-F(Ф(S2)-Ф(S1))| .

Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2

Page 25: Solomon: Seeking the Truth Via Copying Detection

Dependence? Source 2 on USA Presidents:1st : George Washington2nd : Benjamin Franklin3rd : Tom Jefferson4th : Abraham Lincoln …41st : Hillary Clinton42nd : William J. Clinton43rd : Mickey Mouse44th: John McCain

Are Source 1 and Source 2 dependent?

-- Different Accuracy

Source 1 on USA Presidents:1st : George Washington2nd : John Adams3rd : Thomas Jefferson4th : Abraham Lincoln …41st : George W. Bush42nd : William J. Clinton43rd : George W. Bush44th: John McCain

S2 more likely to be a copier

Page 26: Solomon: Seeking the Truth Via Copying Detection

Dependence? Source 2 on USA Presidents:1st : George Washington2nd : Benjamin Franklin3rd : Tom Jefferson4th : Abraham Lincoln …41st : George W. Bush42nd : Hillary Clinton43rd : Mickey Mouse44th: John McCain

Are Source 1 and Source 2 dependent?

-- Different Accuracy

Source 1 on USA Presidents:1st : George Washington2nd : John Adams3rd : Thomas Jefferson4th : Abraham Lincoln…41st : George W. Bush42nd : Hillary Clinton43rd : George W. Bush44th: John McCain

S1 more likely to be a copier

Page 27: Solomon: Seeking the Truth Via Copying Detection

Bayesian Analysis – BasicDifferent Values O.Ad

TRUE O.At

S1 S2

FALSE O.Af

Same Values

Observation: ФGoal: Pr(S1S2| Ф), Pr(S1S2| Ф) (sum up to 1)According to the Bayes Rule, we need to know

Pr(Ф|S1S2), Pr(Ф|S1S2)Key: computing Pr(ФO.A|S1S2), Pr(ФO.A|S1S2)

for each O.AS1 S2

Page 28: Solomon: Seeking the Truth Via Copying Detection

Bayesian Analysis – Probability Computation

Pr Independence Copying

O.At

O.Af

O.Ad

nnn

22

21

n

Pd2

211

)1(11 2 cc

)1(2

cn

c

)1( cPd

ε-error rate; n-#wrong-values; c-copy rate

>

Different Values O.Ad

TRUE O.At

S1 S2

FALSE O.Af

Same Values

Page 29: Solomon: Seeking the Truth Via Copying Detection

Considering Source Accuracy

Pr Independence S1 Copies S2 S2 Copies S1

O.At

O.Af

O.Ad

nSSPf 21

ftd PPP 1

)1(1 1 cPcS t

)1(1 cPcS f

)1( cPd

21 11 SSPt )1(1 2 cPcS t

)1(2 cPcS f

)1( cPd

≠≠

Different Values O.Ad

TRUE O.At

S1 S2

FALSE O.Af

Same Values

Page 30: Solomon: Seeking the Truth Via Copying Detection

Src

ISBN Name Author

S11 IPV6: Theory, Protocol, and

Practice Loshin, Peter

2 Web Usability: A User-Centered Design Approach

Lazar, Jonathan

S21 IPV4: Theory, Protocol, and

Practice -

2 Web Usability: A User Jonathan Lazar

S31 IPV6: Theory, Protocol, and

Practice Loshin, Peter

2 Web Usability: A User Jonathan Lazar

S4 1 IPV6: Theory, Protocol, and Practice Loshin

2 Web Usability: A User Lazar

Correctness of Data as Evidence for Copying

S1 S2

S3

S4

Page 31: Solomon: Seeking the Truth Via Copying Detection

Extending the Basic Technique

Consider correctness

of data [VLDB’09a]

Consider additional evidence

[VLDB’10a]

Page 32: Solomon: Seeking the Truth Via Copying Detection

Src

ISBN Name Author

S11 IPV6: Theory, Protocol, and

Practice Loshin, Peter

2 Web Usability: A User-Centered Design Approach

Lazar, Jonathan

S21 IPV4: Theory, Protocol, and

Practice -

2 Web Usability: A User Jonathan Lazar

S31 IPV6: Theory, Protocol, and

Practice Loshin, Peter

2 Web Usability: A User Jonathan Lazar

S4 1 IPV6: Theory, Protocol, and Practice Loshin

2 Web Usability: A User Lazar

Formatting as Evidence for Copying

S1 S2

S3

S4

Different formats

SubValues

Page 33: Solomon: Seeking the Truth Via Copying Detection

Extending the Basic Technique

Consider correctness

of data [VLDB’09a]

Consider additional evidence

[VLDB’10a]

Consider correlated copying

[VLDB’10a]

Page 34: Solomon: Seeking the Truth Via Copying Detection

Correlated CopyingK A1 A2 A3 A4

O1 S S S D DO2 S D S S DO3 S S D S DO4 S S S D SO5 S D S S S

K A1 A2 A3 A4

O1 S S S S SO2 S S S S SO3 S S S S SO4 S D D D DO5 S D D D D

17 same values, and 8 different values17 same values, and 8 different values

CopyingS: Two sources providing the same valueD: Two sources providing different values

Page 35: Solomon: Seeking the Truth Via Copying Detection

Extending the Basic Technique

Local Detection

Global Detection

[VLDB’10a]

Consider correctness

of data [VLDB’09a]

Consider additional evidence

[VLDB’10a]

Consider correlated copying

[VLDB’10a]

Consider updates [VLDB’09b]

Page 36: Solomon: Seeking the Truth Via Copying Detection

Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}

S2 S3Multi-source copying

Co-copying

{V51-V130}{V1-V50, V101-V130}

S1{V1-V100}

S2 S3{V21-V70}{V1-V50}

Transitive copying

S1{V1-V100}

S2 S3{V21-V50,V81-V100}{V1-V50}

(V81-V100 are popular values)

Page 37: Solomon: Seeking the Truth Via Copying Detection

Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}

S2 S3Multi-source copying

Co-copying

Local copying detection results

{V51-V130}{V1-V50, V101-V130}

S1{V1-V100}

S2 S3{V21-V70}{V1-V50}

Transitive copying

S1{V1-V100}

S2 S3{V21-V50,V81-V100}{V1-V50}

(V81-V100 are popular values)

Page 38: Solomon: Seeking the Truth Via Copying Detection

Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}

S2 S3Multi-source copying

Co-copying - Looking at the copying probabilities?

{V51-V130}{V1-V50, V101-V130}

S1{V1-V100}

S2 S3{V21-V70}{V1-V50}

Transitive copying

S1{V1-V100}

S2 S3{V21-V50,V81-V100}{V1-V50}

(V81-V100 are popular values)

Page 39: Solomon: Seeking the Truth Via Copying Detection

Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}

S2 S3Multi-source copying

Co-copying

1

X Looking at the copying probabilities? - Counting shared values?

{V51-V130}{V1-V50, V101-V130}

S1{V1-V100}

S2 S3{V21-V70}{V1-V50}

Transitive copying

S1{V1-V100}

S2 S3{V21-V50,V81-V100}{V1-V50}

(V81-V100 are popular values)

1

1

1 1

1

1 1

1

Page 40: Solomon: Seeking the Truth Via Copying Detection

Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}

S2 S3Multi-source copying

Co-copying

50

X Looking at the copying probabilities?X Counting shared values? - Comparing the set of shared values?

{V51-V130}{V1-V50, V101-V130}

S1{V1-V100}

S2 S3{V21-V70}{V1-V50}

Transitive copying

S1{V1-V100}

S2 S3{V21-V50,V81-V100}{V1-V50}

(V81-V100 are popular values)

50

30

50 50

30

50 50

30

Page 41: Solomon: Seeking the Truth Via Copying Detection

Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}

S2 S3Multi-source copying

Co-copying

V1-V50

V101-V130

X Looking at the copying probabilities?X Counting shared values? - Comparing the set of shared values?

V51-V100

{V51-V130}{V1-V50, V101-V130}

S1{V1-V100}

S2 S3V1-V50

V21-V50

V21-V70

{V21-V70}{V1-V50}

Transitive copying

S1{V1-V100}

S2 S3V1-V50

V21-V50

V21-V50, V81-V100{V21-V50,V81-V100}{V1-V50}

(V81-V100 are popular values)

Page 42: Solomon: Seeking the Truth Via Copying Detection

Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}

S2 S3Multi-source copying

Co-copying

V1-V50

V101-V130

X Looking at the copying probabilities?X Counting shared values?X Comparing the set of shared values?

V51-V100

{V51-V130}{V1-V50, V101-V130}

S1{V1-V100}

S2 S3V1-V50

V21-V50

V21-V70

{V21-V70}{V1-V50}

Transitive copying

S1{V1-V100}

S2 S3V1-V50

V21-V50

V21-V50, V80-V100{V21-V50,V81-V100}{V1-V50}

(V81-V100 are popular values)

V21-V50 shared by 3 sources

We need to reason for each data item in a principled way!

Page 43: Solomon: Seeking the Truth Via Copying Detection

Global Copying Detection1. Find a set of copyings R that significantly influence

the rest of the copyings Maximize

Finding R is NP-complete We propose a fast greedy algorithm

2. Adjust copying probability for the rest of the copyings: P(S1S2|R)

Replace Pr(ФO.A(S1)|S1S2) everywhere with Pr(ФO.A (S1)|S1S2, R), which considers sources that S1 copies from according to R and provide the same value on O.A as S1

Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2

Page 44: Solomon: Seeking the Truth Via Copying Detection

Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}

S2 S3Multi-source copying

Co-copying

V1-V50

V101-V130

V51-V100

{V51-V130}{V1-V50, V101-V130}

S1{V1-V100}

S2 S3V1-V50

V21-V50

V21-V70

{V21-V70}{V1-V50}

Transitive copying

S1{V1-V100}

S2 S3V1-V50

V21-V50

V21-V50, V81-V100{V21-V50,V81-V100}{V1-V50}

(V81-V100 are popular values)

R={S3S1}, Pr(Ф(S3))= Pr(Ф(S3)|R) for V101-V130

R={S3S1}, Pr(Ф(S3))<<Pr(Ф(S3)|R) for V21-V50

R={S3S2}, Pr(Ф(S3))<<Pr(Ф(S3)|R) for V21-V50Pr(Ф(S3)) is high for V81-V100

X X

?

? ?

Page 45: Solomon: Seeking the Truth Via Copying Detection

Experiment Setup18 weather websitesfor 30 major USA citiescollected every 45 minutes for a day33 collections, so 990 objects28 distinct attributes in total

Page 46: Solomon: Seeking the Truth Via Copying Detection

18 weather websitesfor 30 major USA citiescollected every 45 minutes for a day33 collections, so 990 objects28 distinct attributes in total

Silver Standard

Page 47: Solomon: Seeking the Truth Via Copying Detection

Experiment ResultsMeasure: Precision, Recall, F-measure

C: real copying; D: detected copying

RPPRF

CDC

RDDC

P

2,,

Methods Precision

Recall

F-measur

eCorr (Only correctness) .5 .43 .46

Enriched (More evidence) 1 .14 .25

Local (correlated copying) .33 .86 .48

Global (global detection) .79 .79 .79

Transitive/co-copying not removed

Ignoring evidence from

correlated copying

Enriched improves over Corr when true/false notion

does apply

Page 48: Solomon: Seeking the Truth Via Copying Detection

Solomon

Outline

Copying discovery• Local detection

[VLDB’09a]• Global detection

[VLDB’10a]• Detection w.

dynamic data [VLDB’09b]

Applications in data integration• Truth discovery

[VLDB’09a][VLDB’09b]

• Query answering [VLDB’11][EDBT’11]

• Record linkage [VLDB’10b]

Visualization and decision explanation• Visualization• Decision

explanation[VLDB’10 demo]

Page 49: Solomon: Seeking the Truth Via Copying Detection

Data Integration Faces 3 Challenges

Data Conflicts

Instance Heterogeneity

Structure Heterogeneity

Page 50: Solomon: Seeking the Truth Via Copying Detection

Data Integration Faces 3 Challenges

Data Conflicts

Instance Heterogeneity

Structure Heterogeneity

Page 51: Solomon: Seeking the Truth Via Copying Detection

Data Integration Faces 3 Challenges

Data Conflicts

Instance Heterogeneity

Structure Heterogeneity

Scissors

Paper Scissors

Page 52: Solomon: Seeking the Truth Via Copying Detection

Data Integration Faces 3 Challenges

Data Conflicts

Instance Heterogeneity

Structure Heterogeneity

Scissors

Glue

Page 53: Solomon: Seeking the Truth Via Copying Detection

Existing Solutions Assume Independence of Data Sources

Data Conflicts

Instance Heterogeneity

Structure Heterogeneity

•Schema matching•Model management•Query answering using views•Information extraction

•String matching (edit distance, token-based, etc.)•Object matching (aka. record linkage, reference reconciliation, …)

•Data fusion•Truth discovery

Assume INDEPENDENCEof data sources

Page 54: Solomon: Seeking the Truth Via Copying Detection

Data Conflicts

Instance Heterogeneity

Structure Heterogeneity

Source Copying Adds A New Dimension to Data Integration

• Truth discovery [VLDB’09a, VLDB’09b]

• Online data fusion [VLDB’11]

• Integrating probabilistic data

Data Fusion

• Improve record linkage• Distinguish bet wrong

values and alter representations [VLDB’10b]

Record Linkage

• Query optimization [EDBT’11]

• Improve schema matching

Query Answeri

ng

• Recommend trustworthy, up-to-date, and independent sources

Source Recom-mendati

on

Page 55: Solomon: Seeking the Truth Via Copying Detection

S1 S2 S3Stonebrak

erMIT Berkel

eyMIT

Dewitt MSR MSR UWiscBernstein MSR MSR MSR

Carey UCI AT&T BEAHalevy Google Google UW

Application I. Truth Discovery—Naïve Voting

Page 56: Solomon: Seeking the Truth Via Copying Detection

Application I. Truth Discovery—Naïve Voting

S1 S2 S3 S4 S5Stonebrak

erMIT Berkel

eyMIT MIT MS

Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW

Page 57: Solomon: Seeking the Truth Via Copying Detection

Application I. Truth Discovery—Our Solution

S1 S2 S3 S4 S5Stonebrak

erMIT Berkel

eyMIT MIT MS

Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW

Copying Relationship

UCI AT&T

BEA

Truth Discovery(1-.99*.8=.2)

(.22)

S1

S2

S4

S3

S5

.87 .2.2

.99

.99.99

S1 S2

S3

S4 S5Round 1

Page 58: Solomon: Seeking the Truth Via Copying Detection

Application I. Truth Discovery—Our Solution

S1 S2 S3 S4 S5Stonebrak

erMIT Berkel

eyMIT MIT MS

Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW

Copying Relationship

S1

S2

S4

S3

S5

.14

.49.49

.49.08

.49.49.49

AT&T

BEA

Truth Discovery

S2

S3

S4 S5

UCIS1

Round 2

Page 59: Solomon: Seeking the Truth Via Copying Detection

Application I. Truth Discovery—Our Solution

S1 S2 S3 S4 S5Stonebrak

erMIT Berkel

eyMIT MIT MS

Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW

Copying Relationship

S1

S2

S4

S3

S5

.12

.49.49

.49.06

.49.49.49

AT&T

BEA

Truth Discovery

S2

S3

S4 S5

UCI

S1

Round 3

Page 60: Solomon: Seeking the Truth Via Copying Detection

Application I. Truth Discovery—Our Solution

S1 S2 S3 S4 S5Stonebrak

erMIT Berkel

eyMIT MIT MS

Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW

Copying Relationship

S1

S2

S4

S3

S5

.10

.48.49

.50.05

.49.48.50

AT&T

BEA

Truth Discovery

S2

UCI

S1

Round 4

S3

S4 S5

Page 61: Solomon: Seeking the Truth Via Copying Detection

Application I. Truth Discovery—Our Solution

S1 S2 S3 S4 S5Stonebrak

erMIT Berkel

eyMIT MIT MS

Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW

Copying Relationship

AT&T

BEA

Truth Discovery

S2

UCI

S1

Round 5

S3

S4 S5

S1

S2

S4

S3

S5

.09

.47.49

.51.04

.49.47.51

Page 62: Solomon: Seeking the Truth Via Copying Detection

Application I. Truth Discovery—Our Solution

S1 S2 S3 S4 S5Stonebrak

erMIT Berkel

eyMIT MIT MS

Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR

Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW

Copying Relationship

AT&T

BEA

Truth Discovery

S2

UCI

S1

Round 13

S3

S4 S5

S1

S2

S4

S3

S5

.55.49

.55.49.44.44

Page 63: Solomon: Seeking the Truth Via Copying Detection

Application I. Truth Discovery (Con’t)

Truth Discovery

Source-accuracy

ComputationCopying

DetectionStep 1Step 3

Step 2

Theorem: w/o accuracy, converges Observation: w. accuracy, converges when #objs >> #srcs

Page 64: Solomon: Seeking the Truth Via Copying Detection

Application II. QA & Online Data FusionWhere is AT&T Shannon

Research Labs?

[VLDB’11]

Page 65: Solomon: Seeking the Truth Via Copying Detection

Application II. QA & Online Data FusionWhere is AT&T Shannon

Research Labs?

[VLDB’11]

Page 66: Solomon: Seeking the Truth Via Copying Detection

Application II. QA & Online Data FusionWhere is AT&T Shannon

Research Labs?

[VLDB’11]

Page 67: Solomon: Seeking the Truth Via Copying Detection

Application II. QA & Online Data FusionWhere is AT&T Shannon

Research Labs?

[VLDB’11]

Page 68: Solomon: Seeking the Truth Via Copying Detection

Application II. QA & Online Data FusionWhere is AT&T Shannon

Research Labs?

[VLDB’11]

Page 69: Solomon: Seeking the Truth Via Copying Detection

Application II. QA & Online Data FusionWhere is AT&T Shannon

Research Labs?

[VLDB’11]

Page 70: Solomon: Seeking the Truth Via Copying Detection

Application II. QA & Online Data FusionWhere is AT&T Shannon

Research Labs?

[VLDB’11]

Page 71: Solomon: Seeking the Truth Via Copying Detection

Application II. QA & Online Data FusionWhere is AT&T Shannon

Research Labs?

[VLDB’11]

Quickly find answers

Computing probabilities

Source ordering

Page 72: Solomon: Seeking the Truth Via Copying Detection

Solomon

Outline

Copying discovery• Local detection

[VLDB’09a]• Global detection

[VLDB’10a]• Detection w.

dynamic data [VLDB’09b]

Applications in data integration• Truth discovery

[VLDB’09a][VLDB’09b]

• Query answering [EDBT’11]

• Record linkage [VLDB’10b]

Visualization and decision explanation• Visualization• Decision

explanation[VLDB’10 demo]

Page 73: Solomon: Seeking the Truth Via Copying Detection

Copying of AbeBooks DataAbeBooks data set:

877 bookstores, 1265 CS books, 24364 listings Copying between 465 pairs of sources

Page 74: Solomon: Seeking the Truth Via Copying Detection

Demo Here

Page 75: Solomon: Seeking the Truth Via Copying Detection

Related WorkCopying detection [Sigmod’11 Tutorial]

TextsProgramsImages/VideosStructured sources

Data provenance [Buneman et al., PODS’08]

Focus on effective presentation and retrieval

Assume knowledge of provenance/lineage

Page 76: Solomon: Seeking the Truth Via Copying Detection

Take-AwaysCopying is common on the WebCopying can be detected using statistical approachesKnowing the copying relationship can benefit various aspects of data integration

Page 77: Solomon: Seeking the Truth Via Copying Detection

AcknowledgementsKen Lyons(AT&T Research)

Divesh Srivastava(AT&T Research)

Alon Halevy(Google)

Yifan Hu(AT&T Research)

Remi Zajac(AT&T Research)

Songtao Guo(AT&T Interactive)

Laure Berti-Equille(Institute of Research for Development)

Xuan Liu(Singapore National Univ.)

Xian Li(SUNY Binhamton)

Amelie Marian(Rutgers Univ.)

Anish Das Sarma(Google)

Beng Chin Ooi(Singapore National Univ.)

Ordered by the amount of time spent at AT&T

Page 78: Solomon: Seeking the Truth Via Copying Detection

SOLOMON: SEEKING THE TRUTH VIA COPYING

DETECTION

http://www2.research.att.com/~yifanhu/SourceCopying/

Page 79: Solomon: Seeking the Truth Via Copying Detection

What Is Missing? (a.k.a. Future Work)

Local Detection

Global Detection

Loop copying Copying by category Summarizing copying

patterns Exploring evidence from

schemas, tuple ordering, etc.

Scalability Detecting opinion

influence

Hidden Sources Global detection

for dynamic data

Page 80: Solomon: Seeking the Truth Via Copying Detection

Data Conflicts

Instance Heterogeneity

Structure Heterogeneity

What is Missing (a.k.a. Future Work)

• Truth discovery [VLDB’09a, VLDB’09b]

• Integrating probabilistic data

Data Fusion

• Improve record linkage• Distinguish bet wrong

values and alter representations [VLDB’10b]

Record Linkage

• Query optimization [Submitted]

• Improve schema matching

Query Answeri

ng

• Recommend trustworthy, up-to-date, and independent sources

Source Recom-mendati

on

Page 81: Solomon: Seeking the Truth Via Copying Detection

Future Work: Explaining Copying-Detection DecisionsProvide the simplest, understandable explanation for Bayesian analysis

A copying detection decision is complexWhy copying?Why a particular copying pattern (per-object copying vs. per-attribute

copying)?Why a particular copying direction?Why the local decision is different from the global decision?

Answer “what-if” questions What if the two sources actually use the same format for those

common values? What if there is a hidden source that S1 and S2 both copy

from?Answer “comparison” questions

Why S1 is a copier of S2 but not a copier of S3? Why S1 has copied attributes “title” but not “authors”?

Page 82: Solomon: Seeking the Truth Via Copying Detection

Experiment on Static Data [VLDB’09a]Dataset: AbeBooks

877 bookstores1265 CS books24364 listings, w. ISBN, name, author-listAfter pre-cleaning, each book on avg has 19

listings and 4 author lists (ranges from 1-23)Golden standard: 100 random books

Manually check author list from book coverMeasure: Precision=#(Corr author lists)/#(All lists)

Page 83: Solomon: Seeking the Truth Via Copying Detection

Naïve Voting and Types of ErrorsNaïve voting has precision .71

Error type NumMissing authors 23

Additional authors 4Mis-ordering 3Mis-spelling 2

Incomplete names 2

Page 84: Solomon: Seeking the Truth Via Copying Detection

Contributions of Various Components

Methods Prec #Rnds

Time(s)

Naïve .71 1 .2Only value similarity .74 1 .2

Only source accuracy .79 23 1.1

Only source copying .83 3 28.3Copy+accu .87 22 185.8

Copy+accu+sim .89 18 197.5Precision improves by 25.4% over Naïve

Considering copying improves the results most

Reasonably fast

Page 85: Solomon: Seeking the Truth Via Copying Detection

Experiment on Dynamic Data [VLDB’09b]Dataset: Manhattan restaurants

Data crawled from 12 restaurant websites8 versions: weekly from 1/22/2009 to 3/12/20095269 restaurants, 5231 appearing in the first

crawling and 5251 in the last crawling467 restaurants deleted from some websites,

280 closed before 3/15/2009 (Golden standard)Measure: Precision, Recall, F-measure

G: really closed restaurants; D: detected closed restaurants

RPPRF

GDG

RDDG

P

2,,

Page 86: Solomon: Seeking the Truth Via Copying Detection

Between 12 out of 66 pairs copying is likely

Discovered Copying

Page 87: Solomon: Seeking the Truth Via Copying Detection

Contributions of Various Components

Method

Ever-existing Closed #Rn

dsTime(

s)#Rest Prec Rec F-msr

ALL - .60 1.0 .75 - -ALL2 - .94 .34 .50 - -Naïve 1192 .70 .93 .80 1 158

Quality 5068 .83 .88 .85 7 637CopyQu

a 5186 .86 .87 .86 6 1408Google - .84 .19 .30 - -Quality and CopyQua

obtain high precision and recall

Applying rules is inadequate

Naïve missed a lot of restaurants

Google Map listed a lot of out-of-business restaurants