Data-as-a-Service: DataGraft

Preview:

Citation preview

Data-as-a-ServiceDataGraft

Dumitru Romandumitru.roman@sintef.no

https://datagraft.net

2

“Data is the new oil”…but many of us just need gasoline

Data-as-a-Service …is the new filling station

Data-as-a-Service

• Outsourcing of various data operations to the cloud

• Eliminates

– upfront costs on data infrastructure

– ongoing investment of time and resources in managing the data infrastructure

• Complete package for

– transformation of raw data into meaningful data assets

– reliable delivery of data assets

3

Example #1: Using open data – petroleum activities on the Norwegian continental shelf

4

• ~70 tabular datasets• Difficult to query across

tables, integrate with other data, e.g. Business Registry

• Simplified integration with external datasets

• Distribution of integrated dataset• Live service• Reliable access• …

• Which companies have been owners in license X?

• What is the oil production for each field in year X?

• What is the total production of the top 10 companies by number of employees in year X?

• ....

Integration and querying service

Tabular data on the Web

Data Insights

factpages.npd.no data.brreg.no/oppslag/enhetsregisteret

Example #2: Reporting state-owned real estate properties in Norway

• A hard copy of 314 pages and as a PDF file

• 6 Person-Months• Data collection with spreadsheets• Quality assurance through e-mails

and phone correspondence

Pains• Time consuming• Poor data quality• Static report without live updating

• Live service• Efficient sharing of data• Simplified integration with external

datasets• Live updating• Reliable access• …

• Risk and vulnerability analysis, e.g. buildings affected by flooding

• Analysis of leasing prices

Report Reporting Service 3rd party services

5

Sample data

6

Cleaning, Transformation, Publishing,

Integration, Querying, Visualization,

Service Access

7

Example #3: Personalized and Localized Urban Quality Index (PLUQI)

The index includes data from various domains:

Daily life satisfaction weather, transportation, community,…

Healthcare level number of doctors, hospitals, suicide statistics,…

Safety and security number of police stations, fire stations, crimes per capita,…

Financial satisfaction prices, incomes, housing, savings, debt, insurance, pension,…

Level of opportunity jobs, unemployment, education, re-education,…

Environmental needs and efficiency green space, air quality,…

Sample data

8

was developed to allow

data workers to manage their data in a

simple, effective, and efficient way

Powerful

data transformation and

reliable data access capabilities

9

DataGraft

Tabular Data Graph Data

• Open Data is mostly tabular data

• Excel, CSV, TSV, etc.

• Records organized in silos of collections

• Very few links within and/or across

collections

• Difficult to understand the nature of the

data

• Difficult to integrate / query

Based on Linked Data• Method for publishing data on the Web

• Self-describing data and relations

• Interlinking

• Accessed using semantic queries

• Open standards by W3C− Data format: RDF

− Knowledge representation: RDFS/OWL

− Query language: SPARQL

http://www.w3.org/standards/semanticweb/data

europeandataportal.eu

10

Data Transformation and RDF Publication Process

• Interactive design of transformations?

• Repeatable transformations?

• Reuse/share transformations (user-based access)?

• Cloud-based deployment of transformations?

• Self-serviced process?

• Data and Transformation as-a-Service? 11

Semantic graph

database

Tabular Data

GraphData

DataGraft: Data-as-a-ServiceFor the Data Transformation and RDF Publication Process

12

13

https://www.ssb.no/statistikkbanken

Example: Using statistical data

14

30

31

32

Data records (rows)

Add rowTake row(s)Drop row(s)

Shift rowFilter rows (grep)

Remove duplicate rows

Entire datasetSort

Reshape datasetGroup (categorize) and aggregate

Columns

Add column(s)Take column(s)Drop column(s)Move column

Merge columnsSplit column

Rename column(s)Apply function to all values in a column

33

34

35

36

37

Data pages and federated querying

38

What is the population of locations and total number of persons employed in Human health and social work activities?

Configuring data visualizations

39

40

41

42

43

APIs

DataGraft key feature: Flexible management and sharing of data

and transformations

Fork, reuse and extend transformations built by other professionals from DataGraft’s

transformations catalog

Interactively build, modify and share data

transformations

Share transformations privately or publicly

Reuse transformations to repeatably clean and

transform spreadsheet data

Programmatically access transformations and the transformation catalogue

44

Reuse of transformations in environmental data publishing

TRAGSA Pilot

• Number of transformations: 42

– Created via reuse: 25

• Number of triples:

– ~ 7.7M

ARPA Pilot

• Number of transformations: 5

– Created via reuse: 2

• Number of triples:

– ~ 14K

45

Forking/reusing transformations helped us spend less time on creating new transformations

DataGraft key feature: Reliable data hosting and querying services

Host data on DataGraft’sreliable, cloud-based

semantic graph database

Share data privately or publicly

Query data through your own SPARQL

endpoint

Programmatically access the data

catalogue

46

Operations & maintenance performed on behalf of users

Grafter Grafterizer

Semantic Graph DBaaSData Portal

DataGraft

47

DataGraft Enablers

DataGraft – 1 package 2 audiences

DataGraft

Data Publisher Application Developer

Helping integrating and publishing data

Giving better, easier tools

48

DataGraft – targeted impacts

Reduction in costsfor organisations which lack sufficient expertise and resources to make their data available

Reduction on the dependencyof data owners on generic Cloud platforms to build, deploy and maintain their linked data from scratch

Increase in the speed of publishing new datasets and updating existing datasets

Reduction in the cost and complexity of developing applications that use data

Increase in the reuse of data by providing reliable access to numerous datasets hosted on DataGraft.net

49

• Gathering enough of good datasets

• Designing/implementing

2. Able to focus onservice quality

Example: The benefit of DataGraft in PLUQI

50

• Reducing cost for implementing transformations

• Integrating the process is simpler

1. 23% of developmentcost reduction

Datasetsgathering

Datatransformation

Data provisioning/access

ImplementingApp

Before

Datasetsgathering

Datatransformation

Data provisioning/

access

ImplementingApp

After (with DataGraft)

DataGraft in numbers (as of end of Jan 2016)

51

238Registered users

607 (208 public)

Registered Data transformations

1828Uploaded files

192Public Data

pages

DataGraft in the wild

• Investigating crime data in small geographies

• Used DataGraft to transform data and publish RDF

52http://benproctor.co.uk/investigating-crime-data-at-small-geographies/

Data Science and DataGraft

Greater Data Science:

1. Data Exploration and Preparation

2. Data Representation and Transformation

3. Computing with Data

4. Data Visualization and Presentation

5. Data Modeling

6. Science about Data Science53

“50 years of Data Science” by David Donohohttp://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf

DataGraft

Summary

• DataGraft – emerging Data-as-a-Service solution for making (linked) data more accessible

– Platform, portal, methodology, APIs

– Online service, functional and documented

– Validated through several use cases

• Key features:

– Support for Sharable/Repeatable/Reusable Data Transformations

– Reliable RDF Database-as-a-Service

54

https://datagraft.net

Thank you!Contact: dumitru.roman@sintef.no 55

Recommended