View
73
Download
4
Category
Preview:
DESCRIPTION
by Sławek Staworko, (joint work with Peter Buneman), University of Edinburgh, presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.eu
Citation preview
Experiments with evolving RDF
Sławek Staworko (joint work with Peter Buneman)
University of Edinburgh
Preservation of evolving data
Tomcat
has
tunaeats
Tomcat
has
Apr 1dies
Tomdog
has
dog food eats
Version 1 Version 2 Version 3
…
Archive• Version retrieval • Timeline queries • Storage space efficiency
Approaches to data preservation
• Store all versions • Store the original databases and log the changes • Hybrid approach of the above two
• store the initial and every 10th version • store log changes for the intermediate versions
• Annotation based approach!• never delete data but annotate its validity with
time intervals
Annotation of RDFTom
cat
has
tunaeats
Tomcat
has
Apr 1dies
Tomdog
has
dog food eats
Version 1 Version 2 Version 3
Archive
Tomcat
has [1–2]
tunaeats [1–1]
Apr 1dies [2–2]
dog
has [3—]
dog food
eats [3—]
What exactly is the input?
Delta = difference between two databases expressed with two atomic operations: inserting a triple and deleting a triple
Tomcat
has
tunaeats
Tomcat
has
Apr 1dies
Tomdog
has
dog food eats
delete (cat, eats, tuna) insert (cat, dies, Apr 1)
delete (Tom, has, cat) insert (Tom, has, dog) inset (dog, eats, dog food) delete (cat, dies, Apr 1)
Snapshots
Deltas
Snapshots = complete database instances
Challenges in preserving evolving data with annotations1. The task is relatively simple if deltas are know:!
• deleting a triple closes its interval!• adding a triple opens a new interval !
2. It gets complicated when only snapshots are given!• it boils down to computing deltas!• main challenge: identify objects that are the same across
versions of the database
Entity resolution problem!which data object represent the same entity across different versions!
well-studied database problem in various different settings (from duplicate elimination to record matching)
Entity resolution and RDFURI (Uniform resource identifier)
URIs are supposed to make things easy but… • RDF has also blank nodes • URIs don’t exactly solve the problem in the
context of evolving/merged ontologies…
Two different RDF nodes need not represent different objects
Blank nodes• LOD initiative frowns upon them • Blank nodes are commonplace (and misused?)
Tom cat
has
Peterbelieves
Tom cathas
Peter believes
_bsubjectpred
object_b
2.4 -0.4
Reification Complex number
Blank nodes (cont.)1. Reification (Peter believes that Tom has a cat) 2. Data structures (complex types) 3. Anonymization (Tom has a pet)
Assumptions on reasonable use of blank nodes:!1. Represent concrete objects !2. The objects can be identified from the context
Deblanking
_b1
7 end
_b2
3
_b3
5
LISP-style encoding list of numbers [5,3,7]
head
head
head
tail
tail
tail
#(7,end)
7 end
_b2
3
_b3
5
head
head
head
tail
tail
tail#(7,end)
7 end
#(3,7,end)
3
_b3
5
head
head
head
tail
tail
tail#(7,end)
7 end
#(3,7,end)
3
#(5,3,7,end)
5
head
head
head
tail
tail
tail
Assumption: graph has no cycles consisting of blanks only
Assumption: identity of a blank node is determined by its contents
Experiements
• 10 versions of Experimental Factor Ontology (EFO) data expressed in OWL
• 200k triples in the 1st version, 290k in the last • On average 20k blank nodes in each version • 920k triples overall (blank nodes are independent) • many triples do not last more than 1 version
ExperimentDeblanking and life expectancy of an object
Round Triples Blanks Life expect.0 921896 165935 2.551 358857 33253 6.392 348356 28150 6.573 339695 23502 6.884 330564 18862 7.105 318761 14763 7.246 311562 11021 7.397 304628 7299 7.548 297744 3622 7.839 285484 58 7.83
10 285334 2 7.8311 285334 1 7.8312 285334 0 7.83
Improving space efficiency
Peter
Edinburgh +44 712 4567
phone [1–10]lives [1–10]Peter
Edinburgh +44 712 4567
phonelives
[1–10]Lift common intervals to subject
dog
has [1–5]
dog
has [1–5]
• Intervals moved from all but 33.7k triples (of total 285k) • Number of subjects with histories is 34.3k • Total number of intervals is reduced from 285k to 60k • The size of the index reduced by almost 80%
Future:
• Bisimulation • Nested RDF
Conclusions
• Annotation offers an attractive way of representing an evolving RDF dataset (need for nested RDF?)
• Evolution of data may require more complex atomic operations. For instance, vocabulary evolution: adding, splitting, merging classes. (can bisimulation help here?)
Recommended