Upload
srazniewski
View
189
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Slides from my PhD defense admission exam, Bozen, 9.7.2014
Citation preview
Once upon a time in Datatown(Or: Query-driven Data Completeness Management)
Simon RazniewskiSupervisor: Werner Nutt
2
Once upon a time in Datatown …
One database for all schools
3
Monitoring school developments
• The central school administration decided last year that instead of HTML, now Ruby on Rails shall be taught in Computer Science classes.
• School district administrator Alice wants to monitor the impact of this decision
Query result:2012: 85632013: 8619 (+0,7%)2014: 3202 (-63%)
How many pupils have grade A in Computer Science?
DB
4
• Was teaching Ruby instead of HTML a terrible idea??
• Alice orders her assistant Bob to investigate
Bob calls school A … ”No problem, everything as usual” Bob calls school B … “No, the CS grades in our school
are as usual” Bob calls school C … “What do you want? Everything is fine here” ….
• Bob concludes that something must be wrong with the data
2014: 3202 (-63%)
5
“Something must be wrong with the data”
• Bob calls the DB admin Tom
• Tom: “Dude, of course these numbers are nonsense, most of the data wasn’t loaded yet!
6
“Most of the data wasn’t loaded yet”
• Alice is relieved to hear that probably the change in teaching did not wreck the grades
• But how can such misunderstandings be prevented in the future?
7
• Alice gives Bob and Tom the research question to find a technique for analyzing whether query answers over partially complete databases are complete
• Tom finds cryptic old papers in the archive that seem related
• Tom: “Motro describes a similar problem to ours: When do queries return complete answers over incomplete databases?”
• Bob: “Levy introduces a formalism to describe which parts of a database table are complete”
• Tom: “But those papers do not contain algorithms”
How can such misunderstandings be prevented in the future?
Obtaining complete
answers from incomplete databasesAlon Y. Levy
1996
Integrity = Validity +
Completeness
Amihai Motro1989
8
“But those papers do not contain algorithms”
• Bob: “Maybe we can reduce this to conjunctive-query style containment?”
• Tom: “That works, but note that we also need to find procedures for asymmetric containment problems”
Bob and Tom sit down and write these procedures
Result 1: Development of decision procedures for completeness reasoning and complexity analysis [VLDB’11]
9
Does this also work for null values?• When first presenting a demo system to Alice, the demo system crashes
• Tom: “Understandable, because it is not clear what a null means, whether computer science was an ungraded subject, or whether the grade is missing
• Alice: “Fix it!”
…Tom goes to work
Result 2: Extension of completeness reasoning to databases with null values, complexity analysis, and introduction of a technique to avoid the ambiguity of null values
[CIKM’12]
java.lang.NullpointerException ("Grade in CS is null")
10
Late evening, Bar Nadamas
• Alice greatly impresses her colleague Frank, head of the statistics office of Datatown, with her new completeness tool/toy
• New EU guidelines on open government require Frank’s office to publish their data in RDF– Frank: “Do you think this tool could be adapted to also handle RDF data?”
– Alice: “What’s the difference?”
– Frank: “Well, there’s the OPT-construct, the RDFS closure, and also, the completeness statements should be expressed in RDF themselves”
– Alice: “Let me ask Tom…”
Result 3: Formalisms and algorithms for assessing the completeness of SPARQL queries over RDF data
[ISWC’13]
11
Completeness of Geographical Data
• When bored at work, Bob likes to draw random objects into the free open mapping project OpenStreetMap
• When getting blocked the 17th time, he decides finally for a useful contribution
• Bob: Couldn’t we use the completeness statements on the OSM Wiki to also annotate spatial queries with completeness information?
(a few games of Minesweeper later)
• Bob: But things are different there, query completeness is not a binary issue, instead, queries are complete in certain areas while in others they are not. Also, we can divide objects in the database now into certain, possible and impossible answersResult 4: Model, algorithms and experimental evaluation of techniques for calculating the completeness area of a query over spatial data, and for classifying answers into certain answers, possible answers and impossible answers
[BNCOD’13, SIGSPATIAL’14 (submitted)]
12
Automation• The techniques work, but minions are often
too lazy to give completeness statements
• Alice: Can’t we automate this by looking at the work processes?
• Tom: In principle yes, but we need a formal description of processes that manipulate data in the database and in the real-world
• Alice: Transition systems are the most general formalism, let’s extend those
• Tom: Ok, and I think we can again use containment-style reasoning
Result 5: Introduction of quality-aware transition systems (QATS) and development of algorithms for checking query completeness
over QATS [BPM’13]
13
Open issues
• Alice: If a query is not complete, could you give me at least numerical estimates?
• Bob: How can we utilize the state of the database to draw additional conclusions?
• Tom: I would like to study Mathematics and solve the problem of Query Determinacy as raised by Gauß, Segoufin, Fermat and Vianu
14
The end.
15
Main publications• 1: Completeness of Queries over Incomplete Databases, Simon Razniewski and Werner Nutt, Int. Conference on Very Large
Databases (VLDB), 2011– Acceptance rate: 18,1%
• 2: Completeness of Queries over SQL Databases, Werner Nutt and Simon Razniewski, Conference on Information and Knowledge Management (CIKM), 2012– Acceptance rate: 13,4%
• 3: Completeness Statements about RDF Data Sources and Their Use for Query Answering, Fariz Darari, Werner Nutt, Giuseppe Pirro and Simon Razniewski, Int. Semantic Web Conference (ISWC), 2013– Acceptance rate: 21,5%
• 4a: Assessing the Completeness of Geographical Data, Simon Razniewski and Werner Nutt, British National Conference on Databases (BNCOD), 2013 (Short Paper)– Acceptance rate 47,6%
• 4b: Adding Completeness Information to Query Answers over Spatial Databases, Simon Razniewski and Werner Nutt, SIGSPATIAL 2014,– Submitted
• 5: Verification of Query Completeness over Processes, Simon Razniewski, Werner Nutt and Marco Montali, International Conference on Business Process Management (BPM), 2013– Acceptance rate: 14,4%