96
25 May 2016, AAFD & SFC’16 The Veracity of Big Data Pierre Senellart

The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

25 May 2016, AAFD & SFC’16

The Veracity of Big Data

Pierre Senellart

Page 2: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

23 April 2013, Dow Jones (cnn.com)

Twitter feed of Associated Press hacked

Algorithmic trading systems reacting to tweets

Page 3: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

23 April 2013, Dow Jones (cnn.com)

Twitter feed of Associated Press hacked

Algorithmic trading systems reacting to tweets

Page 4: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

23 April 2013, Dow Jones (cnn.com)

Twitter feed of Associated Press hacked

Algorithmic trading systems reacting to tweets

Page 5: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

3 / 38 AAFD & SFC’16 Pierre Senellart

The Four Vs of Big Data

Volume: Data volumes beyond what is manageable by traditionaldata management systems (from TB to PB to EB)

Variety: Very diverse forms of data (text, multimedia, graphs,structured data), very diverse organization of data

Velocity: Data produced or changing at high speed (LHC:100,000,000 collisions / second), more than able to store

Veracity: Data quality very diverse; imprecise, imperfect,untrustworthy information

Page 6: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

3 / 38 AAFD & SFC’16 Pierre Senellart

The Four Vs of Big Data

Volume: Data volumes beyond what is manageable by traditionaldata management systems (from TB to PB to EB)

Variety: Very diverse forms of data (text, multimedia, graphs,structured data), very diverse organization of data

Velocity: Data produced or changing at high speed (LHC:100,000,000 collisions / second), more than able to store

Veracity: Data quality very diverse; imprecise, imperfect,untrustworthy information

Page 7: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

3 / 38 AAFD & SFC’16 Pierre Senellart

The Four Vs of Big Data

Volume: Data volumes beyond what is manageable by traditionaldata management systems (from TB to PB to EB)

Variety: Very diverse forms of data (text, multimedia, graphs,structured data), very diverse organization of data

Velocity: Data produced or changing at high speed (LHC:100,000,000 collisions / second), more than able to store

Veracity: Data quality very diverse; imprecise, imperfect,untrustworthy information

Page 8: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

3 / 38 AAFD & SFC’16 Pierre Senellart

The Four Vs of Big Data

Volume: Data volumes beyond what is manageable by traditionaldata management systems (from TB to PB to EB)

Variety: Very diverse forms of data (text, multimedia, graphs,structured data), very diverse organization of data

Velocity: Data produced or changing at high speed (LHC:100,000,000 collisions / second), more than able to store

Veracity: Data quality very diverse; imprecise, imperfect,untrustworthy information

Page 9: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

3 / 38 AAFD & SFC’16 Pierre Senellart

The Four Vs of Big Data

Volume: Data volumes beyond what is manageable by traditionaldata management systems (from TB to PB to EB)

Variety: Very diverse forms of data (text, multimedia, graphs,structured data), very diverse organization of data

Velocity: Data produced or changing at high speed (LHC:100,000,000 collisions / second), more than able to store

Veracity: Data quality very diverse; imprecise, imperfect,untrustworthy information

Page 10: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

4 / 38 AAFD & SFC’16 Pierre Senellart

Uncertain data is everywhere

Numerous sources of uncertain data:

Measurement errors

Data integration from contradicting sources

Imprecise mappings between heterogeneous schemas

Imprecise automatic processes (information extraction,classification, natural language processing, etc.)

Imperfect human judgment

Lies, opinions, rumors

Page 11: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

4 / 38 AAFD & SFC’16 Pierre Senellart

Uncertain data is everywhere

Numerous sources of uncertain data:

Measurement errors

Data integration from contradicting sources

Imprecise mappings between heterogeneous schemas

Imprecise automatic processes (information extraction,classification, natural language processing, etc.)

Imperfect human judgment

Lies, opinions, rumors

Page 12: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

5 / 38 AAFD & SFC’16 Pierre Senellart

Uncertainty in Web information extraction

Never-ending Language Learning (NELL, CMU),http://rtw.ml.cmu.edu/rtw/kbbrowser/

Page 13: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

5 / 38 AAFD & SFC’16 Pierre Senellart

Uncertainty in Web information extraction

Google Squared (terminated),screenshot from (Fink et al. 2011)

Page 14: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

5 / 38 AAFD & SFC’16 Pierre Senellart

Uncertainty in Web information extraction

Subject Predicate Object Confidence

Elvis Presley diedOnDate 1977-08-16 97.91%Elvis Presley isMarriedTo Priscilla Presley 97.29%Elvis Presley influences Carlo Wolff 96.25%

YAGO, http://www.mpi-inf.mpg.de/yago-naga/yago

(Suchanek et al. 2007)

Page 15: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

6 / 38 AAFD & SFC’16 Pierre Senellart

Dealing with Uncertainty

Three main research questions:

How to estimate the veracity of a source, or of a piece ofinformation? ) truth finding

How to ensure the provenance of a piece of information, to knowwhere it comes from? ) provenance management

How to efficiently process uncertain data at scale? ) probabilisticdatabase management systems

Page 16: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

6 / 38 AAFD & SFC’16 Pierre Senellart

Dealing with Uncertainty

Three main research questions:

How to estimate the veracity of a source, or of a piece ofinformation? ) truth finding

How to ensure the provenance of a piece of information, to knowwhere it comes from? ) provenance management

How to efficiently process uncertain data at scale? ) probabilisticdatabase management systems

Page 17: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

6 / 38 AAFD & SFC’16 Pierre Senellart

Dealing with Uncertainty

Three main research questions:

How to estimate the veracity of a source, or of a piece ofinformation? ) truth finding

How to ensure the provenance of a piece of information, to knowwhere it comes from? ) provenance management

How to efficiently process uncertain data at scale? ) probabilisticdatabase management systems

Page 18: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

7 / 38 AAFD & SFC’16 Pierre Senellart

Outline

Introduction

Truth FindingSettingModelExperiments

Probabilistic Databases

Conclusion

Page 19: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

8 / 38 AAFD & SFC’16 Pierre Senellart

Outline

Introduction

Truth FindingSettingModelExperiments

Probabilistic Databases

Conclusion

Page 20: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

9 / 38 AAFD & SFC’16 Pierre Senellart

Motivating Example

What are the capital cities of European countries?

France Italy Poland Romania Hungary

Alice Paris Rome Warsaw Bucharest BudapestBob ? Rome Warsaw Bucharest BudapestCharlie Paris Rome Katowice Bucharest BudapestDavid Paris Rome Bratislava Budapest SofiaEve Paris Florence Warsaw Budapest SofiaFred Rome ? ? Budapest SofiaGeorge Rome ? ? ? Sofia

Page 21: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

10 / 38 AAFD & SFC’16 Pierre Senellart

Voting

Information: redundance

France Italy Poland Romania Hungary

Alice Paris Rome Warsaw Bucharest BudapestBob ? Rome Warsaw Bucharest BudapestCharlie Paris Rome Katowice Bucharest BudapestDavid Paris Rome Bratislava Budapest SofiaEve Paris Florence Warsaw Budapest SofiaFred Rome ? ? Budapest SofiaGeorge Rome ? ? ? Sofia

Frequence P. 0.67 R. 0.80 W. 0.60 Buch. 0.50 Bud. 0.43R. 0.33 F. 0.20 K. 0.20 Bud. 0.50 S. 0.57

B. 0.20

Page 22: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

11 / 38 AAFD & SFC’16 Pierre Senellart

Evaluating Trustworthiness of Sources

Information: redundance, trustworthiness of sources (= averagefrequence of predicted correctness)

France Italy Poland Romania Hungary Trust

Alice Paris Rome Warsaw Bucharest Budapest 0.60Bob ? Rome Warsaw Bucharest Budapest 0.58Charlie Paris Rome Katowice Bucharest Budapest 0.52David Paris Rome Bratislava Budapest Sofia 0.55Eve Paris Florence Warsaw Budapest Sofia 0.51Fred Rome ? ? Budapest Sofia 0.47George Rome ? ? ? Sofia 0.45

Frequence P. 0.70 R. 0.82 W. 0.61 Buch. 0.53 Bud. 0.46weighted R. 0.30 F. 0.18 K. 0.19 Bud. 0.47 S. 0.54by trust B 0.20

Page 23: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

12 / 38 AAFD & SFC’16 Pierre Senellart

Iterative Fixpoint Computation

Information: redundance, trustworthiness of sources with iterativefixpoint computation

France Italy Poland Romania Hungary Trust

Alice Paris Rome Warsaw Bucharest Budapest 0.65Bob ? Rome Warsaw Bucharest Budapest 0.63Charlie Paris Rome Katowice Bucharest Budapest 0.57David Paris Rome Bratislava Budapest Sofia 0.54Eve Paris Florence Warsaw Budapest Sofia 0.49Fred Rome ? ? Budapest Sofia 0.39George Rome ? ? ? Sofia 0.37

Frequence P. 0.75 R. 0.83 W. 0.62 Buch. 0.57 Bud. 0.51weighted R. 0.25 F. 0.17 K. 0.20 Bud. 0.43 S. 0.49by trust B 0.19

Page 24: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

13 / 38 AAFD & SFC’16 Pierre Senellart

Context and problem

Context:Set of sources stating facts(Possible) functional dependencies between factsFully unsupervised setting: we do not assume any information ontruth values of facts or inherent trust in sources

Problem: determine which facts are true and which facts are false

Real world applications: query answering, source selection, dataquality assessment on the web, making good use of the wisdom ofcrowds

Page 25: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

14 / 38 AAFD & SFC’16 Pierre Senellart

Outline

Introduction

Truth FindingSettingModelExperiments

Probabilistic DatabasesUncertainty ManagementQuerying Probabilistic Databases

Conclusion

Page 26: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

15 / 38 AAFD & SFC’16 Pierre Senellart

Outline

Introduction

Truth FindingSettingModelExperiments

Probabilistic Databases

Conclusion

Page 27: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

16 / 38 AAFD & SFC’16 Pierre Senellart

General Model

Set of facts F = ff1:::fng

Examples: “Paris is capital of France”, “Rome is capital of France”,“Rome is capital of Italy”

Set of views (= sources) V = fV1:::Vmg, where a view is a partialmapping from F to {T, F}

Example:: “Paris is capital of France” ^ “Rome is capital of France”

Objective: find the most likely real world W given V where thereal world is a total mapping from F to {T, F}

Example:“Paris is capital of France” ^ : “Rome is capital of France” ^“Rome is capital of Italy” ^ ...

Page 28: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

17 / 38 AAFD & SFC’16 Pierre Senellart

Generative Probabilistic Model(Galland et al. 2010)

Vi, fj

?

'(Vi)'(fj)1� '(Vi)'(fj)

:W(fj)

"(Vi)"(fj)

W(fj)

1� "(Vi)"(fj)

'(Vi)'(fj): probability that Vi “forgets” fj"(Vi)"(fj): probability that Vi “makes an error” on fj

Number of parameters: n+ 2(n+m)

Size of data: ~'nm with ~' the average forget rate

Page 29: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

18 / 38 AAFD & SFC’16 Pierre Senellart

Obvious Approach

Method: use this generative model to find the most likelyparameters given the data

Inverse the generative model to compute the probability of a set ofparameters given the data

Not practically applicable:Non-linearity of the model and boolean parameter W(fj)

) equations for inversing the generative model very complexLarge number of parameters (n and m can both be quite large) )Any exponential technique unpractical

) Heuristic fix-point algorithms (many proposed ones!)

Page 30: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

19 / 38 AAFD & SFC’16 Pierre Senellart

Outline

Introduction

Truth FindingSettingModelExperiments

Probabilistic Databases

Conclusion

Page 31: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

20 / 38 AAFD & SFC’16 Pierre Senellart

Hubdub (1/2)

http://www.hubdub.com/

357 questions, 1 to 20 answers, 473 participants

Page 32: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

21 / 38 AAFD & SFC’16 Pierre Senellart

Hubdub (2/2)

Number of errors Number of errors(no post-filtering) (with post-filtering)

Voting 278 292Counting 340 327TruthFinder 458 274(Yin et al. 2007)

3-Estimates 272 270(Galland et al. 2010)

Page 33: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

22 / 38 AAFD & SFC’16 Pierre Senellart

General-Knowledge Quiz (1/2)

http://www.madore.org/~david/quizz/quizz1.html

17 questions, 4 to 14 answers, 601 participants

Page 34: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

23 / 38 AAFD & SFC’16 Pierre Senellart

General-Knowledge Quiz (2/2)

Number of errors Number of errors(no post-filtering) (with post-filtering)

Voting 11 6Counting 12 6TruthFinder 78 77(Yin et al. 2007)

3-Estimates 9 0(Galland et al. 2010)

Page 35: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

24 / 38 AAFD & SFC’16 Pierre Senellart

Many variations. . .

Modeling of various real-world phenomena:

Sources copying each other (Dong et al. 2010)

Complex source dependencies (Pochampally et al. 2014)

Similarity between attribute values (Yin et al. 2008)

Correlated group of attributes (Ba et al. 2015)

Heterogeneous data types (Q. Li et al. 2014)

. . .

See extensive evaluations of different techniques (X. Li et al. 2012; Waguih

and Berti-Equille 2014). General problem far from being solved!

Page 36: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

25 / 38 AAFD & SFC’16 Pierre Senellart

Outline

Introduction

Truth Finding

Probabilistic DatabasesUncertainty ManagementQuerying Probabilistic Databases

Conclusion

Page 37: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

26 / 38 AAFD & SFC’16 Pierre Senellart

Outline

Introduction

Truth Finding

Probabilistic DatabasesUncertainty ManagementQuerying Probabilistic Databases

Conclusion

Page 38: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

27 / 38 AAFD & SFC’16 Pierre Senellart

Different types of uncertainty

Two dimensions:

Different types:Unknown value: NULL in an RDBMSAlternative between several possibilities: either A or B or CImprecision on a numeric value: a sensor gives a value that is anapproximation of the actual valueConfidence in a fact as a whole: cf. information extractionStructural uncertainty: the schema of the data itself is uncertain

Qualitative (NULL) or Quantitative (95%, low-confidence, etc.)uncertainty

Page 39: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

28 / 38 AAFD & SFC’16 Pierre Senellart

Managing uncertaintyObjectiveNot to pretend this imprecision does not exist, and manage it as rigor-ously as possible throughout a long, automatic and human, potentiallycomplex, process.

Especially:

Represent all different forms of uncertainty

Use probabilities to represent quantitative information on theconfidence in the data

Query data and retrieve uncertain results

Allow adding, deleting, modifying data in an uncertain way

Bonus (if possible): Keep as well lineage/provenance information,so as to ensure traceability

Page 40: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

28 / 38 AAFD & SFC’16 Pierre Senellart

Managing uncertaintyObjectiveNot to pretend this imprecision does not exist, and manage it as rigor-ously as possible throughout a long, automatic and human, potentiallycomplex, process.

Especially:

Represent all different forms of uncertainty

Use probabilities to represent quantitative information on theconfidence in the data

Query data and retrieve uncertain results

Allow adding, deleting, modifying data in an uncertain way

Bonus (if possible): Keep as well lineage/provenance information,so as to ensure traceability

Page 41: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

29 / 38 AAFD & SFC’16 Pierre Senellart

Why probabilities?

Not the only option: fuzzy set theory (Galindo et al. 2005),Dempster-Shafer theory (Zadeh 1986)

Mathematically rich theory, nice semantics with respect totraditional database operations (e.g., joins)

Some applications already generate probabilities (e.g., statisticalinformation extraction or natural language probabilities)

In other cases, we “cheat” and pretend that (normalized)confidence scores are probabilities: see this as a first-orderapproximation

Page 42: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

30 / 38 AAFD & SFC’16 Pierre Senellart

Outline

Introduction

Truth Finding

Probabilistic DatabasesUncertainty ManagementQuerying Probabilistic Databases

Conclusion

Page 43: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

31 / 38 AAFD & SFC’16 Pierre Senellart

Tuple-independent databases (TID)S

a a 1

b v 0:5

b w 0:2

aa

1b

v

0.5

w

0.2

This TID instance represents the following probability distribution:

0:5� 0:2

S

a a

b v

b w

0:5� (1� 0:2)

S

a a

b v

(1� 0:5)� 0:2

S

a a

b w

(1� 0:5)� (1� 0:2)

S

a a

Page 44: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

31 / 38 AAFD & SFC’16 Pierre Senellart

Tuple-independent databases (TID)S

a a 1

b v 0:5

b w 0:2 aa

1b

v

0.5

w

0.2

This TID instance represents the following probability distribution:

0:5� 0:2

S

a a

b v

b w

0:5� (1� 0:2)

S

a a

b v

(1� 0:5)� 0:2

S

a a

b w

(1� 0:5)� (1� 0:2)

S

a a

Page 45: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

31 / 38 AAFD & SFC’16 Pierre Senellart

Tuple-independent databases (TID)S

a a 1

b v 0:5

b w 0:2 aa

1b

v

0.5

w

0.2

This TID instance represents the following probability distribution:

0:5� 0:2

S

a a

b v

b w

0:5� (1� 0:2)

S

a a

b v

(1� 0:5)� 0:2

S

a a

b w

(1� 0:5)� (1� 0:2)

S

a a

Page 46: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

31 / 38 AAFD & SFC’16 Pierre Senellart

Tuple-independent databases (TID)S

a a 1

b v 0:5

b w 0:2 aa

1b

v

0.5

w

0.2

This TID instance represents the following probability distribution:

0:5� 0:2

S

a a

b v

b w

0:5� (1� 0:2)

S

a a

b v

(1� 0:5)� 0:2

S

a a

b w

(1� 0:5)� (1� 0:2)

S

a a

Page 47: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

31 / 38 AAFD & SFC’16 Pierre Senellart

Tuple-independent databases (TID)S

a a 1

b v 0:5

b w 0:2 aa

1b

v

0.5

w

0.2

This TID instance represents the following probability distribution:

0:5� 0:2

S

a a

b v

b w

0:5� (1� 0:2)

S

a a

b v

(1� 0:5)� 0:2

S

a a

b w

(1� 0:5)� (1� 0:2)

S

a a

Page 48: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

31 / 38 AAFD & SFC’16 Pierre Senellart

Tuple-independent databases (TID)S

a a 1

b v 0:5

b w 0:2 aa

1b

v

0.5

w

0.2

This TID instance represents the following probability distribution:

0:5� 0:2

S

a a

b v

b w

0:5� (1� 0:2)

S

a a

b v

(1� 0:5)� 0:2

S

a a

b w

(1� 0:5)� (1� 0:2)

S

a a

Page 49: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

31 / 38 AAFD & SFC’16 Pierre Senellart

Tuple-independent databases (TID)S

a a 1

b v 0:5

b w 0:2 aa

1b

v

0.5

w

0.2

This TID instance represents the following probability distribution:

0:5� 0:2

S

a a

b v

b w

0:5� (1� 0:2)

S

a a

b v

(1� 0:5)� 0:2

S

a a

b w

(1� 0:5)� (1� 0:2)

S

a a

Page 50: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 51: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 52: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 53: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 54: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 55: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 56: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 57: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:

S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 58: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are here

S(b; w) and T (w) are here

! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 59: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 60: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability:

0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 61: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4�

�1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 62: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4��1�

(1� 0:5� 0:3)� (1� 0:2� 0:7)�

Page 63: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4��1� (1� 0:5� 0:3)

� (1� 0:2� 0:7)�

Page 64: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

32 / 38 AAFD & SFC’16 Pierre Senellart

Query evaluation on probabilistic instances

We want to evaluate the probability of a query on a TID instance

q : 9x y R(x) ^ S(x; y) ^ T (y)

R

a 1

b 0:4

c 0:6

S

a a 1

b v 0:5

b w 0:2

T

v 0:3

w 0:7

b 1

The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here

! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)

Page 65: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

33 / 38 AAFD & SFC’16 Pierre Senellart

Complexity of probabilistic query evalua-tion (PQE)

What is the data complexity of probabilistic query evaluation on TIDdepending on the class Q of queries and class I of instances?

Existing dichotomy result: (Dalvi and Suciu 2012)

Q are (unions of) conjunctive queries, I is all TID instancesThere is a class S � Q of safe queriesPQE is PTIME for any q 2 S on all instancesPQE is #P-hard for any q 2 QnS on all instancesq : 9x y R(x) ^ S(x; y) ^ T (y) is unsafe!

Is there a smaller class I such that PQE is tractablefor a larger Q?

Page 66: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

33 / 38 AAFD & SFC’16 Pierre Senellart

Complexity of probabilistic query evalua-tion (PQE)

What is the data complexity of probabilistic query evaluation on TIDdepending on the class Q of queries and class I of instances?

Existing dichotomy result: (Dalvi and Suciu 2012)

Q are (unions of) conjunctive queries, I is all TID instancesThere is a class S � Q of safe queries

PQE is PTIME for any q 2 S on all instancesPQE is #P-hard for any q 2 QnS on all instancesq : 9x y R(x) ^ S(x; y) ^ T (y) is unsafe!

Is there a smaller class I such that PQE is tractablefor a larger Q?

Page 67: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

33 / 38 AAFD & SFC’16 Pierre Senellart

Complexity of probabilistic query evalua-tion (PQE)

What is the data complexity of probabilistic query evaluation on TIDdepending on the class Q of queries and class I of instances?

Existing dichotomy result: (Dalvi and Suciu 2012)

Q are (unions of) conjunctive queries, I is all TID instancesThere is a class S � Q of safe queriesPQE is PTIME for any q 2 S on all instances

PQE is #P-hard for any q 2 QnS on all instancesq : 9x y R(x) ^ S(x; y) ^ T (y) is unsafe!

Is there a smaller class I such that PQE is tractablefor a larger Q?

Page 68: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

33 / 38 AAFD & SFC’16 Pierre Senellart

Complexity of probabilistic query evalua-tion (PQE)

What is the data complexity of probabilistic query evaluation on TIDdepending on the class Q of queries and class I of instances?

Existing dichotomy result: (Dalvi and Suciu 2012)

Q are (unions of) conjunctive queries, I is all TID instancesThere is a class S � Q of safe queriesPQE is PTIME for any q 2 S on all instancesPQE is #P-hard for any q 2 QnS on all instances

q : 9x y R(x) ^ S(x; y) ^ T (y) is unsafe!

Is there a smaller class I such that PQE is tractablefor a larger Q?

Page 69: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

33 / 38 AAFD & SFC’16 Pierre Senellart

Complexity of probabilistic query evalua-tion (PQE)

What is the data complexity of probabilistic query evaluation on TIDdepending on the class Q of queries and class I of instances?

Existing dichotomy result: (Dalvi and Suciu 2012)

Q are (unions of) conjunctive queries, I is all TID instancesThere is a class S � Q of safe queriesPQE is PTIME for any q 2 S on all instancesPQE is #P-hard for any q 2 QnS on all instancesq : 9x y R(x) ^ S(x; y) ^ T (y) is unsafe!

Is there a smaller class I such that PQE is tractablefor a larger Q?

Page 70: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

33 / 38 AAFD & SFC’16 Pierre Senellart

Complexity of probabilistic query evalua-tion (PQE)

What is the data complexity of probabilistic query evaluation on TIDdepending on the class Q of queries and class I of instances?

Existing dichotomy result: (Dalvi and Suciu 2012)

Q are (unions of) conjunctive queries, I is all TID instancesThere is a class S � Q of safe queriesPQE is PTIME for any q 2 S on all instancesPQE is #P-hard for any q 2 QnS on all instancesq : 9x y R(x) ^ S(x; y) ^ T (y) is unsafe!

Is there a smaller class I such that PQE is tractablefor a larger Q?

Page 71: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 72: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 73: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 74: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 75: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 76: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 77: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 78: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 79: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 80: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 81: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 82: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 83: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 84: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 85: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 86: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 87: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 88: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 89: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

34 / 38 AAFD & SFC’16 Pierre Senellart

Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)

Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1

! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries

! non-probabilistic QE is in linear time

! Does this extend to probabilistic QE?

Page 90: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

35 / 38 AAFD & SFC’16 Pierre Senellart

Dichotomy for PQE

An instance-based dichotomy result:

Upper bound. (Amarilli et al. 2015)

For I the treelike instances and Q the MSO queries! PQE is in linear time modulo arithmetic costs

Also for expressive provenance representationsAlso with bounded-treewidth correlations

Lower bound. (Amarilli et al. 2016)

For any unbounded-tw family I and Q the FO queries! PQE is #P-hard under RP reductions assuming:

High-tw instances in I are easily constructibleSignature arity is 2 (graphs)

Page 91: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

35 / 38 AAFD & SFC’16 Pierre Senellart

Dichotomy for PQE

An instance-based dichotomy result:

Upper bound. (Amarilli et al. 2015)

For I the treelike instances and Q the MSO queries! PQE is in linear time modulo arithmetic costs

Also for expressive provenance representationsAlso with bounded-treewidth correlations

Lower bound. (Amarilli et al. 2016)

For any unbounded-tw family I and Q the FO queries! PQE is #P-hard under RP reductions assuming:

High-tw instances in I are easily constructibleSignature arity is 2 (graphs)

Page 92: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

35 / 38 AAFD & SFC’16 Pierre Senellart

Dichotomy for PQE

An instance-based dichotomy result:

Upper bound. (Amarilli et al. 2015)

For I the treelike instances and Q the MSO queries! PQE is in linear time modulo arithmetic costs

Also for expressive provenance representationsAlso with bounded-treewidth correlations

Lower bound. (Amarilli et al. 2016)

For any unbounded-tw family I and Q the FO queries! PQE is #P-hard under RP reductions assuming:

High-tw instances in I are easily constructibleSignature arity is 2 (graphs)

Page 93: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

36 / 38 AAFD & SFC’16 Pierre Senellart

Application: Efficient querying of uncertaingraphs (Maniu et al. 2014)

06

5

60

2

06

4

3

4

26

1

61

3: 0.144: 0.01

2: 0.183: 0.01

1: 0.75

1: 0.752: 0.06

1: 0.75

1: 0.75

1: 0.5

1: 0.75

1: 0.25

1: 0.75

1: 0.5 1: 0.5

1: 1

(α)

(β)

(γ)

(ε)

(δ)

(ζ)

Problem: Optimize query evaluationon probabilistic graphs

Challenge: Real graph data is nottreelike

Methodology: Build partial treedecompositions and use differentquery evaluation techniques ontreelike parts and on the rest of thedata

Page 94: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

37 / 38 AAFD & SFC’16 Pierre Senellart

Outline

Introduction

Truth Finding

Probabilistic Databases

Conclusion

Page 95: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

38 / 38 AAFD & SFC’16 Pierre Senellart

Conclusion

The real world is uncertain

Tools we use to process the real world introduce uncertaintyNeed for principled methods to:

Estimate uncertainty (veracity, truthfulness. . . ) of informationProperly manage the confidence (probability, level of certainty. . . )in the informationKeep information on the provenance of data

Merci.

Page 96: The Veracity of Big Data - Pierre Senellart · 5/25/2016  · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional

38 / 38 AAFD & SFC’16 Pierre Senellart

Conclusion

The real world is uncertain

Tools we use to process the real world introduce uncertaintyNeed for principled methods to:

Estimate uncertainty (veracity, truthfulness. . . ) of informationProperly manage the confidence (probability, level of certainty. . . )in the informationKeep information on the provenance of data

Merci.