Querying Incomplete Geospatial Information in RDF
Charalampos Nikolaou and Manolis Koubarakis
Department of Informatics and Telecommunications National and Kapodistrian University of Athens
International Symposium on Spatial and Temporal Databases (SSTD) 2013 August 23, 2013
Motivation • Increased interest in publishing geospatial datasets
as linked data (i.e., encoded in RDF and with semantic links to other datasets)
• Geospatial information might be: o Quantitative (e.g., exact geometric information) o Qualitative (e.g., topological relations)
... and express knowledge that is o Complete o Incomplete (or indefinite)
Ordnance Survey (UK)
73,546,231 triples
Global Administrative Areas (GADM)
9,896,532 triples
Nomenclature of Territorial Units for Statistics (NUTS)
316,246 triples
Linked Geospatial Data
As of September 2011
MusicBrainz
(zitgist)
P20
Turismo de
Zaragoza
yovisto
Yahoo! Geo
Planet
YAGO
World Fact-book
El ViajeroTourism
WordNet (W3C)
WordNet (VUA)
VIVO UF
VIVO Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UniRef
UniProt
UMBEL
UK Post-codes
legislationdata.gov.uk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdata.gov.
uk
Traffic Scotland
theses.fr
Thesau-rus W
totl.net
Tele-graphis
TCMGeneDIT
TaxonConcept
Open Library (Talis)
tags2con delicious
t4gminfo
Swedish Open
Cultural Heritage
Surge Radio
Sudoc
STW
RAMEAU SH
statisticsdata.gov.
uk
St. Andrews Resource
Lists
ECS South-ampton EPrints
SSW Thesaur
us
SmartLink
Slideshare2RDF
semanticweb.org
SemanticTweet
Semantic XBRL
SWDog Food
Source Code Ecosystem Linked Data
US SEC (rdfabout)
Sears
Scotland Geo-
graphy
ScotlandPupils &Exams
Scholaro-meter
WordNet (RKB
Explorer)
Wiki
UN/LOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAASKISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP (RKB
Explorer)
Crime Reports
UK
Course-ware
CORDIS (RKB
Explorer)CiteSeer
Budapest
ACM
riese
Revyu
researchdata.gov.
ukRen. Energy Genera-
tors
referencedata.gov.
uk
Recht-spraak.
nl
RDFohloh
Last.FM (rdfize)
RDF Book
Mashup
Rådata nå!
PSH
Product Types
Ontology
ProductDB
PBAC
Poké-pédia
patentsdata.go
v.uk
OxPoints
Ord-nance Survey
Openly Local
Open Library
OpenCyc
Open Corpo-rates
OpenCalais
OpenEI
Open Election
Data Project
OpenData
Thesau-rus
Ontos News Portal
OGOLOD
JanusAMP
Ocean Drilling Codices
New York
Times
NVD
ntnusc
NTU Resource
Lists
Norwe-gian
MeSH
NDL subjects
ndlna
myExperi-ment
Italian Museums
medu-cator
MARC Codes List
Man-chester Reading
Lists
Lotico
Weather Stations
London Gazette
LOIUS
Linked Open Colors
lobidResources
lobidOrgani-sations
LEM
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
LinkedUser
FeedbackLOV
Linked Open
Numbers
LODE
Eurostat (OntologyCentral)
Linked EDGAR
(OntologyCentral)
Linked Crunch-
base
lingvoj
Lichfield Spen-ding
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Kno.e.sis)
Klapp-stuhl-club
Good-win
Family
National Radio-activity
JP
Jamendo (DBtune)
Italian public
schools
ISTAT Immi-gration
iServe
IdRef Sudoc
NSZL Catalog
Hellenic PD
Hellenic FBD
PiedmontAccomo-dations
GovTrack
GovWILD
GoogleArt
wrapper
gnoss
GESIS
GeoWordNet
GeoSpecies
GeoNames
GeoLinkedData
GEMET
GTAA
STITCH
SIDER
Project Guten-berg
MediCare
Euro-stat
(FUB)
EURES
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
CORDIS(FUB)
Freebase
flickr wrappr
Fishes of Texas
Finnish Munici-palities
ChEMBL
FanHubz
EventMedia
EUTC Produc-
tions
Eurostat
Europeana
EUNIS
EU Insti-
tutions
ESD stan-dards
EARTh
Enipedia
Popula-tion (En-AKTing)
NHS(En-
AKTing) Mortality(En-
AKTing)
Energy (En-
AKTing)
Crime(En-
AKTing)
CO2 Emission
(En-AKTing)
EEA
SISVU
education.data.g
ov.uk
ECS South-ampton
ECCO-TCP
GND
Didactalia
DDC Deutsche Bio-
graphie
datadcs
MusicBrainz
(DBTune)
Magna-tune
John Peel
(DBTune)
Classical (DB
Tune)
AudioScrobbler (DBTune)
Last.FM artists
(DBTune)
DBTropes
Portu-guese
DBpedia
dbpedia lite
Greek DBpedia
DBpedia
data-open-ac-uk
SMCJournals
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Metoffice Weather Forecasts
Discogs (Data
Incubator)
Climbing
data.gov.uk intervals
Data Gov.ie
databnf.fr
Cornetto
reegle
Chronic-ling
America
Chem2Bio2RDF
Calames
businessdata.gov.
uk
Bricklink
Brazilian Poli-
ticians
BNB
UniSTS
UniPathway
UniParc
Taxonomy
UniProt(Bio2RDF)
SGD
Reactome
PubMedPub
Chem
PRO-SITE
ProDom
Pfam
PDB
OMIMMGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Com-pound
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
Affy-metrix
bible ontology
BibBase
FTS
BBC Wildlife Finder
BBC Program
mes BBC Music
Alpine Ski
Austria
LOCAH
Amster-dam
Museum
AGROVOC
AEMET
US Census (rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
~ 62 billion triples
Question
How do we manage (represent, store, query) this data efficiently?
Challenges: Theory ① RDF extensions for representing and querying incomplete
qualitative and quantitative geospatial information
• GeoSPARQL o Standard OGC query language for RDF data with geospatial information o Topological relations can be expressed/queried, but no reasoning is
offered.
• We proposed RDFi
o Can work with any topological/temporal constraint language with/without constant symbols (e.g., RCC-5, RCC-8, IA)
o Formal semantics and algorithm for computing certain answers o Preliminary complexity results for various constraint languages
• No published algorithm for query processing when considering RCC-8 and constants
Open issue
RDFi by example gag:Region rdfs:subClassOf geo:Feature. gag:WestGreece rdf:type gag:Region. gag:Municipality rdfs:subClassOf geo:Feature.
gag:OlympiaMuni rdf:type gag:Municipality.
noa:Hotspot rdfs:subClassOf geo:Feature. noa:hotspot rdf:type noa:Hospot.
noa:Fire rdfs:subClassOf geo:Feature. noa:fire rdf:type noa:Fire.
gag:OlympiaMuni geo:hasGeometry ex:oGeo. ex:oGeo rdf:type sf:Polygon.
ex:oGeo geo:asWKT "POLYGON((..))"^^geo:wktLiteral.
noa:hotspot geo:hasGeometry ex:rec. ex:rec geo:asWKT "POLYGON((..))"^^geo:wktLiteral.
gag:WestGreece geo:sfContains gag:OlympiaMuni. noa:hotspot geo:sfContains noa:fire.
West Greece
Olympia
RDFi by example (cont’d) Query: Find fires inside the region of West Greece. GeoSPARQL query: CERTAIN SELECT ?f WHERE { ?f rdf:type noa:Fire.
gag:WestGreece geo:sfContains ?f. }
West Greece
Olympia
RDFi by example (cont’d) Query: Find fires inside the region of West Greece. GeoSPARQL query: CERTAIN SELECT ?f WHERE { ?f rdf:type noa:Fire.
gag:WestGreece geo:sfContains ?f. }
West Greece
Olympia
contains
contains
contains
Challenges: Theory ② Efficient computation of the entailment relation
Φ⊨Θ • where Φ and Θ are quantifier-free first-order
formulas of a constraint language expressing the topological relations of various frameworks (RCC-8, DE-9IM, etc.)
Challenges: Theory ③ Computing entailment is equivalent to checking
consistency of formulas with constraint networks
• Constraint networks: o Spatial relations among regions o Regions might be constant ones (exact geometric
information) or identified by a URI
• Most recent results considered basic and complete RCC-5 networks with polygonal regions
• For RCC-8, deciding consistency is NP-complete • No published algorithm for checking consistency • Are there tractable cases?
Open issue
Challenges: Practice ④ Scale to billions of triples
• Reasoners from QSR scale only up to hundreds of regions with complex spatial relations
How do they perform in our case?
• Setting: o Real linked geospatial datasets o No constants o Only base RCC-8 relations o Evaluation of consistency checking using the well-known
path-consistency algorithm
Experimental evaluation
0.01
0.1
1
10
100
1000
gag nuts admingeo gadm-geovocab
ela
pse
d t
ime
- m
inu
tes
(lo
gsc
ale
)
dataset
Timeout PostgreSQLRenz
PyRCC8PPyRCC8
Setup: Intel Xeon E5620, 2.4 GHz, 12MB L3, 48GB RAM, RAID 5, Ubuntu 12.04
• Computation of the complete constraint network
• Running time: O(n3)
• Memory requirements: O(n2)
n ≈ thousands to millions
hundreds of regions
thousands of regions
thousands of regions
thousands of regions
after one day
Network structure • We have started working on algorithms taking into
account the structure of these networks: o Node degrees fit a power-law distribution o Network is sparse
100
101
102
103
104
105
106
100 101 102 103 104 105
Nu
mb
er
of
no
de
s (lo
gsc
ale
)
Degree (logscale)
Power-law with ! = 2.1
Network structure (cont’d) • Edges of three kinds:
• Reflect networks composed of components with
hierarchical structure o R-tree extensions (Papadias, Kalnis, Mamoulis, AAAI’99)
• Parallel algorithms combined with backward-chaining techniques for lazy query processing o Graph partitioning o Path compression data structures and indexes
externally connected equals non-tangential proper part
Related work: Spatial
• Qualitative spatial reasoning - Efficient algorithms for consistency checking of constraint
networks (complex spatial relations, few number of regions) - Does not consider query processing
• Description logic reasoners - PelletSpatial: RCC-8 reasoning (cannot handle disjunctions) - RacerPro: RCC-8 reasoning
Related work: Temporal • Chaudhuri (VLDB’88)
• The knowledge representation language Telos (TOIS’90)
• Foundations of temporal constraint databases (Koubarakis,
PhD thesis, ‘94)
• Qualitative temporal reasoning community (since 80s)
• SQL+i system (BNCOD‘96)
• Later system (IEEE’97)
• Hurtado and Vaisman (2006)
Conclusions • What’s the CHALLENGE?
Implementing an efficient query processing system for incomplete geospatial information in RDFi
• The desired system should: o reason about qualitative and quantitative spatial
information that might be incomplete o be scalable to billions of triples in the most useful cases
Thank you
Dataset characteristics
Dataset #triples #regions #RCC-8 relations
#definiterelations(after PC)
#indefiniterelations(after PC)
ADMGB 149 046 11 762 77 907 46 777 728 45 777 577GAG 11 780 412 3023 4870 82 231NUTS-RDF 316 246 2236 3176 906 558 2 045 451GADM-RDF 9 896 532 276 728 590 445 X XGADM-RDF-EUROPE 355 656 23 037 51 309 X X