28
Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Embed Size (px)

Citation preview

Page 1: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Linked Open GeoData Management in the Cloud

K. Kritikos, Y. Roussakis ICS-FORTHD. Kotzinos ICS-FORTH & TEI of Serres

Page 2: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Cloud Computing

2

Better (faster, reliable, etc.) infrastructure - IaaS

Development infrastructure – PaaS

Software infrastructure – SaaS

Page 3: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Cloud Computing

3

DataData as a Service (DaaS)Data as a Service (DaaS)• Publication• Querying• Updating

Page 4: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Linked (Open) Data as a Service

• Publishing Linked Data– URI construction– Conceptual Model– Storage as RDF files or SPARQL endpoints

• Querying Linked Data– SPARQL– GeoSPARQL

• Updating Linked Data– SPARUL– Synchronization with original sources

4

Page 5: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Problem Introduction (I)

• INGeoCloudS FP7 Pilot B Project (www.ingeoclouds.eu)• Geophysical data from different sources and in different

formats (excel, xml, relational, nothing …)• Borehole and Groundwater Water Analysis

– Boreholes located in Mygdonia/Thriasio of Greece, whole country in Denmark and France and their features (static data over time)

– Chemical analyses of ground waters sampled from their boreholes (data updated over time)

• Earthquake events and features• Landslides

5

Page 6: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Data granularity

• Data refer to different levels of granularity, e.g. susceptibility maps refer to a country-wide area while earthquakes or boreholes are point-level data

• Data might need to be aggregated by such aggregation is based on the spatial dimension, i.e. points contained within a polygon

• Some problems of aggregation do exist since phenomena outside the area of concern may affect it, so spatial aggregation might not be enough

Page 7: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Earthquakes:•How much back in time should we go?•What information should be kept/would be relevant?•How should we query the repository to get the relevant information?

Earthquakes:•How much back in time should we go?•What information should be kept/would be relevant?•How should we query the repository to get the relevant information?

Landslides:•Which area and how much is it affected?•How does this change over time?•Is the earthquake effect cumulative or fades over time?

Landslides:•Which area and how much is it affected?•How does this change over time?•Is the earthquake effect cumulative or fades over time?

Page 8: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Problem Introduction (II)

• Data/Metadata Standards– INSPIRE standard proposes generic conceptual schema for scientific

data + models for 34 spatial data themes

• Deal with geospatial data & maintaining schemas/ontologies becomes difficult– Challenge is to exploit semantic heterogeneity

• Need to offer seamless & transparent LOD as a service (LODaaS) way to manage LOD data– Lack of tools for mapping, transforming & synchronizing geo-spatial

LD– Generic LOD management independent of way LOD are stored

Page 9: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Points of interest

• GeoData get bigger and more important– Used in a variety of applications in different fields

• Size & high demand impose considerable requirements in infrastructure storage size & compute power

• Need to be reused and linked with other data sets– Go beyond current Web paradigm of isolated data silos

• Current geo-spatial open data management work does not offer such effort– Cloud-based approaches:

• do not provide geo-spatial support • Some do not fully support SPARQL or offer SPARQL end-points

– Centralized approaches offer geo-spatial support but:• Do not enable automatic mapping between relational and RDF data• Worse performance in general (with the exception of Strabon wrt geo-

spatial query support)

Page 10: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Proposed Solution (I)

• A specific set of LODaaS services for geo-spatial LOD publishing, integration & querying

• Cloud is offering its scalability & elasticity of computation, 24/7 availability & multiple data storage and integration offerings

• Our cloud-based service-oriented system:– Exhibits good LOD management performance– Exposes a LOD management service that abstracts away

RDF Store peculiarities & provides a generic way for LOD access and management

Page 11: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Proposed Solution (II)

• A particular solution is adopted for mapping geo-spatial data in different formats to RDF data

• The latter conform to extensible conceptual models that accurately capture thematic areas and are integrated via GeoScientific Observation Model– This allows imposing queries across providers and thematic fields

• Our solution is part of the system, developed in the context of the InGeoCloudS project, that exploits cloud capabilities & LD technology to integrate & store heterogeneous geo-spatial data sets of different thematic fields + host & execute applications that exploit these data sets

Page 12: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Architecture (I)

• System is scalable and elastic by exploiting cloud facilities• An extensive application pool can be built on top that

exploits the offered services to perform various added-value and high-demanding tasks:– LO GeoData visualization, discovery & composition of data-sets,

LO GeoData analytics– System could be extended to host such applications & offer

various (geo-spatial) LO GeoData processing services and pre-built applications

Page 13: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Architecture (II)

• Distributor: equally distributes generic queries & collects back the results, non-generic queries are sent to instances with the appropriate data, data distribution achieved by assigning new data to the less loaded wrt storage space scaling layer, exploits CPU monitoring & elasticity facilities of Amazon

• Scaling Layer: comprises one or more LOD management components, data are replicated across these components to enhance reliability & enable layer-based load balancing

• LOD Management Component: comprises LOD Management Service (LMS) instance & Virtuoso server for storage

• LMS: provides methods for data providers to manage LOD & for other users to query & export the LOD stored

• Virtuoso: underlying RDF triple store also allowing the mapping & synchronization between relational and RDF data

Page 14: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres
Page 15: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

General Query Evaluation Behavior

Response Time

Time passed

2nd instance involvement

Page 16: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

LOD Integration & Publishing (I)

• Extension of the high-level CIDOC-CRM conceptual model• New model is called Geo-Scientific Spatial Observation

Model (GSOM) & expressed in RDF/S• It enables to capture all information coming different

fields & countries + link data across different providers• INSPIRE was not exploited as did not cover all

requirements:– Capturing of scientific events– Complicated and cumbersome for information integration– In some cases, does not cover all appropriate information

required by the data providers in particular thematic fields• GSOM-to-INSPIRE mapping specification to enable

exporting INSPIRE-compliant data

Page 17: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

LOD Integration & Publishing (II)

• Two alternatives for publishing LOD:1. Create and import RDF-based descriptions of data-sets via

particular LMS method– Data update process must be controlled by performing SPARUL

updates via particular LMS method– Data provider responsibility to keep synchronized relational &

RDF data• A perfect synchronization may be also not required as it may incur costs

-> second alternative becomes more preferable

Page 18: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

LOD Integration & Publishing (III)

2. Data provider publishes relational data of his/her data sets + provides a mapping file in R2RML to enable the synchronization of relational to RDF data (by executing LMS method)

– System takes care of this synchronization– Relational storage in the way used many years + additional

RDF storage for the data with automatic one-way synchronization between the two

– Provider should have a good knowledge of GSOM & RDF

Page 19: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

LOD Integration & Publishing (IV)

• R2RML: – W3C recommendation since 2012– Can specify customized mappings between RDB & RDF data– R2RML specification is just a RDF graph in Turtle– No specific implementation is imposed

• Virtuoso supports R2RML by processing the R2RML specification & creating the respective RDB2RDF triggers (used for creating/updating RDF data from relational ones)– An RDF view or physical RDF graph can be created with the

second option mapping to far better performance

Page 20: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

BoreholeName

Sample ID

WaterLevel

B1 XYZ Level1

Borehole Relational Model

E42.IdentifierSample_IDSample_ID,

E41.AppelationBorehole_NamBorehole_Namee

S16.Borehole

E54.DimensionWaterlevelWaterlevel

P43F.has_dimension

E26.Physical_Feature

S15.Acquifer_ConceptIntake

P121F.overlaps_with

P1F.is_identified_by

P1F.is_identified_by

S13.Sample

O5.removed

S2.SampleTaking

O4.sampled_from

URI Identification:http://orgURL/SampleID/XYZ

R2RML

GSOM

Publication

Synchronization

RDB

Page 21: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

LOD Management Service (I)

• REST-based service with API exposing all appropriate management functionality needed by geo-spatial LOD users– Abstracts away from peculiarities of RDF triple stores– Enables simple & intuitive use of a specific set of LOD

management methods– Programmatic or form-based access to methods– Production of query results in different forms, such as WKT,

GML, & KML– Imporing/exporting capabilities in different formats

(RDF/XML, NTriples, Turtle)

Page 22: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

LOD Management Service (II)

• The provided methods are:– meta_query (SPARQL string, timeout (opt.), row limit (opt.)): user-

requested format (e.g., JSON)– meta_update (SPARUL string, baseURI, timeout (opt.), row limit

(opt.))– meta_addMappings(R2RML string, graphURI) -> initiates mapping

procedure– meta_export(graphURI, subjURI, predURI, objURI, internal): user-

requested format -> last param indicates if result will be inline in the response

– meta_import(url, graphUri, format, blocking): ImportStatus -> RDF data are imported by downloading them via provided URL or inline in user-request + method can be blocking or non-blocking

– import_status(importID): ImportStatus -> in case of blocking import request, the user can inquire the status of his/her import by exploiting the value of a specific field (importID) returned from the previous method as input to this method

Page 23: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

LOD Management Service (III)

• Each method accessible via specific URL + produces meaningful exception messages (e.g., in case user input is wrong)

• User-friendly HTML Documentation produced via Enunciate

• Implementation exploited Sesame RDF Data Management API, Virtuoso’s JDBC Driver & Jersey

Page 24: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Open Issues (I)

• Model:– Extend it to capture other thematic fields– Data published in our system could fulfill all requirements to be

5-star LOD if respective owners decide to do so

• Data mapping:– Cloud-based Virtuoso version supports native Relational DB for

RDB2RDF synchronization• Trade-off between LOD management completeness & cost

– Mapping tools are needed to allow visual-based editing of R2RML without needing from data providers to have good knowledge of RDF

– Research issue: support bi-directional RDB2RDF mappings

Page 25: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Open Issues (II)

• Geo-spatial query support:– Virtuoso does not support GeoSPARQL– Virtuoso has limited geo-spatial query support only in

commercial versions• 2D geometries + limited set of topological relation operators

– Additional support in terms of geometry dimensionality + feature aggregation operators

• Could extend Virtuoso via frameworks, such as uSeekM, which provide adequate geo-spatial support along with the capability of evaluating GeoSPARQL queries

– Such solutions require processing all RDF data stored to create geo-spatial indices as well as deploy another DB -> do not fit well with automatic geo-spatial LOD management

– Could resolve problem by: (a) performing re-indexing in infrequent time intervals, (b) create specialized triggers which trigger re-indexing only when RDF data are updated

Page 26: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Open Issues (III)

• Quality & provenance:– Original input data sets may not have the appropriate quality -

> resulting RDF data can have the same or lower quality level– Proposed infrastructure must be extended with quality

resolving procedures & methods (e.g., data cleansing methods for correcting the data exploited)

– Provenance information can ensure the correct updating of LD + assist in LD reasoning process by deriving additional facts

– Thus, provenance information should be exploited by our system, especially if we consider that such exploitation is not enabled by most LOD management systems

Page 27: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Conclusions

• Proposed a scalable, geo-spatial LOD as-a-Service management system deployed on Amazon cloud– Distributes query load + scales-up/down when CPU utilization

surpasses specific thresholds– Exposes REST-based service with LOD management methods– Provides two different ways for publishing open geo-spatial data sets

• Advance geo-spatial support level by following two directions:– Realize GSOM-to-INSPIRE mapping to enable producing INSPIRE-

compliant data– Extend Virtuoso with geo-spatial indexing & query systems to enable

the efficient processing of rich & expressive geo-spatial queries, expressed either in SPARQL or GeoSPARQL

Page 28: Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

28