15
Once upon a time in Datatown (Or: Query-driven Data Completeness Management) Simon Razniewski Supervisor: Werner Nutt

Once upon a time in Datatown

Embed Size (px)

DESCRIPTION

Slides from my PhD defense admission exam, Bozen, 9.7.2014

Citation preview

Page 1: Once upon a time in Datatown

Once upon a time in Datatown(Or: Query-driven Data Completeness Management)

Simon RazniewskiSupervisor: Werner Nutt

Page 2: Once upon a time in Datatown

2

Once upon a time in Datatown …

One database for all schools

Page 3: Once upon a time in Datatown

3

Monitoring school developments

• The central school administration decided last year that instead of HTML, now Ruby on Rails shall be taught in Computer Science classes.

• School district administrator Alice wants to monitor the impact of this decision

Query result:2012: 85632013: 8619 (+0,7%)2014: 3202 (-63%)

How many pupils have grade A in Computer Science?

DB

Page 4: Once upon a time in Datatown

4

• Was teaching Ruby instead of HTML a terrible idea??

• Alice orders her assistant Bob to investigate

Bob calls school A … ”No problem, everything as usual” Bob calls school B … “No, the CS grades in our school

are as usual” Bob calls school C … “What do you want? Everything is fine here” ….

• Bob concludes that something must be wrong with the data

2014: 3202 (-63%)

Page 5: Once upon a time in Datatown

5

“Something must be wrong with the data”

• Bob calls the DB admin Tom

• Tom: “Dude, of course these numbers are nonsense, most of the data wasn’t loaded yet!

Page 6: Once upon a time in Datatown

6

“Most of the data wasn’t loaded yet”

• Alice is relieved to hear that probably the change in teaching did not wreck the grades

• But how can such misunderstandings be prevented in the future?

Page 7: Once upon a time in Datatown

7

• Alice gives Bob and Tom the research question to find a technique for analyzing whether query answers over partially complete databases are complete

• Tom finds cryptic old papers in the archive that seem related

• Tom: “Motro describes a similar problem to ours: When do queries return complete answers over incomplete databases?”

• Bob: “Levy introduces a formalism to describe which parts of a database table are complete”

• Tom: “But those papers do not contain algorithms”

How can such misunderstandings be prevented in the future?

Obtaining complete

answers from incomplete databasesAlon Y. Levy

1996

Integrity = Validity +

Completeness

Amihai Motro1989

Page 8: Once upon a time in Datatown

8

“But those papers do not contain algorithms”

• Bob: “Maybe we can reduce this to conjunctive-query style containment?”

• Tom: “That works, but note that we also need to find procedures for asymmetric containment problems”

Bob and Tom sit down and write these procedures

Result 1: Development of decision procedures for completeness reasoning and complexity analysis [VLDB’11]

Page 9: Once upon a time in Datatown

9

Does this also work for null values?• When first presenting a demo system to Alice, the demo system crashes

• Tom: “Understandable, because it is not clear what a null means, whether computer science was an ungraded subject, or whether the grade is missing

• Alice: “Fix it!”

…Tom goes to work

Result 2: Extension of completeness reasoning to databases with null values, complexity analysis, and introduction of a technique to avoid the ambiguity of null values

[CIKM’12]

java.lang.NullpointerException ("Grade in CS is null")

Page 10: Once upon a time in Datatown

10

Late evening, Bar Nadamas

• Alice greatly impresses her colleague Frank, head of the statistics office of Datatown, with her new completeness tool/toy

• New EU guidelines on open government require Frank’s office to publish their data in RDF– Frank: “Do you think this tool could be adapted to also handle RDF data?”

– Alice: “What’s the difference?”

– Frank: “Well, there’s the OPT-construct, the RDFS closure, and also, the completeness statements should be expressed in RDF themselves”

– Alice: “Let me ask Tom…”

Result 3: Formalisms and algorithms for assessing the completeness of SPARQL queries over RDF data

[ISWC’13]

Page 11: Once upon a time in Datatown

11

Completeness of Geographical Data

• When bored at work, Bob likes to draw random objects into the free open mapping project OpenStreetMap

• When getting blocked the 17th time, he decides finally for a useful contribution

• Bob: Couldn’t we use the completeness statements on the OSM Wiki to also annotate spatial queries with completeness information?

(a few games of Minesweeper later)

• Bob: But things are different there, query completeness is not a binary issue, instead, queries are complete in certain areas while in others they are not. Also, we can divide objects in the database now into certain, possible and impossible answersResult 4: Model, algorithms and experimental evaluation of techniques for calculating the completeness area of a query over spatial data, and for classifying answers into certain answers, possible answers and impossible answers

[BNCOD’13, SIGSPATIAL’14 (submitted)]

Page 12: Once upon a time in Datatown

12

Automation• The techniques work, but minions are often

too lazy to give completeness statements

• Alice: Can’t we automate this by looking at the work processes?

• Tom: In principle yes, but we need a formal description of processes that manipulate data in the database and in the real-world

• Alice: Transition systems are the most general formalism, let’s extend those

• Tom: Ok, and I think we can again use containment-style reasoning

Result 5: Introduction of quality-aware transition systems (QATS) and development of algorithms for checking query completeness

over QATS [BPM’13]

Page 13: Once upon a time in Datatown

13

Open issues

• Alice: If a query is not complete, could you give me at least numerical estimates?

• Bob: How can we utilize the state of the database to draw additional conclusions?

• Tom: I would like to study Mathematics and solve the problem of Query Determinacy as raised by Gauß, Segoufin, Fermat and Vianu

Page 14: Once upon a time in Datatown

14

The end.

Page 15: Once upon a time in Datatown

15

Main publications• 1: Completeness of Queries over Incomplete Databases, Simon Razniewski and Werner Nutt, Int. Conference on Very Large

Databases (VLDB), 2011– Acceptance rate: 18,1%

• 2: Completeness of Queries over SQL Databases, Werner Nutt and Simon Razniewski, Conference on Information and Knowledge Management (CIKM), 2012– Acceptance rate: 13,4%

• 3: Completeness Statements about RDF Data Sources and Their Use for Query Answering, Fariz Darari, Werner Nutt, Giuseppe Pirro and Simon Razniewski, Int. Semantic Web Conference (ISWC), 2013– Acceptance rate: 21,5%

• 4a: Assessing the Completeness of Geographical Data, Simon Razniewski and Werner Nutt, British National Conference on Databases (BNCOD), 2013 (Short Paper)– Acceptance rate 47,6%

• 4b: Adding Completeness Information to Query Answers over Spatial Databases, Simon Razniewski and Werner Nutt, SIGSPATIAL 2014,– Submitted

• 5: Verification of Query Completeness over Processes, Simon Razniewski, Werner Nutt and Marco Montali, International Conference on Business Process Management (BPM), 2013– Acceptance rate: 14,4%