The Dendro research data management platform: Applying ontologies to long-term preservation in a...

  • View
    305

  • Download
    0

  • Category

    Science

Preview:

DESCRIPTION

It has been shown that data management should start as early as possible in the research workflow to minimize the risks of data loss. Given the large numbers of datasets produced every day, curators may be unable to describe them all, so researchers should take an active part in the process. However, since they are not data management experts, they must be provided with user-friendly but powerful tools to capture the context information necessary for others to interpret and reuse their datasets. In this paper, we present Dendro, a fully ontology-based collaborative platform for research data management. Its graph data model innovates in the sense that it allows domain-specific lightweight ontologies to be used in resource description, acting as a staging area for later deposit in long-term preservation solutions.

Citation preview

The Dendro research data management platform

!Applying ontologies to long-term preservation in a collaborative

environment

João Rocha da Silva joaorosilva@gmail.com

Faculdade de Engenharia da

Universidade do Porto / INESC TECJoão Aguiar Castro

joaoaguiarcastro@gmail.com

Cristina Ribeiro mcr@fe.up.pt DEI—Faculdade de

Engenharia da Universidade do

Porto / INESC TECJoão Correia Lopes jlopes@fe.up.pt

iPRES 2014, October 06 - 10 2014, Melbourne, Australia

Contents• Research data management in the long tail

• Linked Open Data: why do we need it?

• Collaboration for easier metadata production

• The Dendro platform

• Conclusions

2

Research Data Management in the long tail of research

Why we need to start early

3

2011: Science magazine reviewers are asked about their data requirements

~1700 replied

The long tail of research

4

Dealing with data. Challenges and opportunities. Introduction. (2011). Science (New York, N.Y.), 331(6018), 692–3. doi:10.1126/science.331.6018.692

Source

5

Dealing with data. Challenges and opportunities. Introduction. (2011). Science (New York, N.Y.), 331(6018), 692–3. doi:10.1126/science.331.6018.692

Source

6

Gathering

Processing

Paper writing

Preservation, Sharing

7

Gathering

Processing

Paper writing

Researcher leaves

Metadata

8

Gathering

Processing

Paper writing

Project ends9

“Where is the data?”“How / when / by whom was the data

produced?”

Gathering

Processing

Paper writing

10

Researchers must participate in RDM from the start

They are the domain experts

Curators cannot cope with a posteriori description

11

Linked Open DataWhat is it? Why do we need it?

12

Linked Open Data• Simplicity!

- LOD is a very simple model for representing knowledge

• Meaning!

- Resources are interlinked by properties with established meaning

• Interoperability!

- Standard methods for querying data - SPARQL

- Representations use standard formats - RDF, OWL

13

!!!!

http://dendro.fe.up.pt/project/datanotes/data

nie:isLogicalPartOf

“Base data of the DCB experiments”

dc:title

base data.xls

nie:title

rdf:type

nie:File

180mm

dcb:initialCrackLength

!!!!!!

http://dendro.fe.up.pt/project/datanotes/data/base

%20data.xls

14

Analytical Chemistry Dataset

Fracture Mechanics Dataset …

GenericAuthor

Description Creation date

Author Description

Creation date …

Domain Specific

Sample Count Analysed Substance

Initial Crack Length Specimen Type

15

CollaborationFor metadata useful now and in the future

16

Gathering

Processing

Paper writing

Preservation, Sharing

17

Gathering

Deposit

“Freeze” in repository

Collaboration Description

Sharing

18

Gathering

…19

Demo

Dendroβ

20

The Dendro platformAn open-source platform for Linked Open Data in

research environments

21

Metadata

Ontologies

• Data store fully built on Linked Data

• No relational database to preserve

• Model can grow by loading more ontologies

• External systems can retrieve resources via SPARQL

Description

22

Metadata

Ontologies

File Storage !

!

• GridFS cluster for large or numerous files

• Can work in the cloud if needed

Deposit

23

Metadata

Ontologies

File Storage !

!

Business Logic

• Flexible access control system

• Backup / Restore

• Versions history

• File type previews

• Integration • DSpace (SWORD)

• ePrints (SWORD)

• CKAN

• Figshare

• ……..

Collaboration

24

Metadata

Ontologies

File Storage !

!

Business Logic

API

Sharing

• All operations available via RESTful API using JSON

• All resources are de-referenceable (HTTP content negotiation)

• Plugin architecture allows integration with external systems

Web UI

25

For curators• Curators can work with researchers to build more

ontologies using existing tools (e.g. Protégé)

• Established ontologies can be loaded (DC, FOAF…)

• Ontologies mature (reuse across Dendro instances)

• Data, metadata and its meaning go together

Creating lightweight ontologies for dataset description: Practical applications in a cross-domain research data management workflow Castro, J., Rocha da Silva, J., Ribeiro, C. Digital Libraries 2014 (DL2014) (pre-print available at http://dendro.fe.up.pt/)

Beyond INSPIRE: An ontology for biodiversity metadata records !Rocha da Silva, J., Castro, J., Ribeiro, C., Honrado, J., Lomba, A., Gonçalves, J. 10th International Workshop on Ontology Content (OntoContent 2014) (pre-print available at http://dendro.fe.up.pt/) 26

For programmers

• 100% Open-source software

• Rich API allows Dendro to be connected to almost any system (e.g. mobile apps)

LabTablet: semantic metadata collection on a multi-domain laboratory notebook Amorim,R., Castro, J., Rocha da Silva, J., Ribeiro, C. 8th Metadata and Semantics Research Conference (MTSR 2014) (pre-print available at http://dendro.fe.up.pt/)

Ontology-based multi-domain metadata for research data management using triple stores Rocha da Silva, J., Ribeiro, C., Correia Lopes, J. 18th International Database Engineering & Applications Symposium (IDEAS 2014) (pre-print available at http://dendro.fe.up.pt/) 27

Triple Store Ontologies

Dendro dies, data lives on

“Database” “Documentation”28

Conclusions• Research data management should start early

• Linked Open Data: simple, interoperable, flexible

• Collaboration support helps researchers while gathering metadata for later deposit

• Dendro: a fully open-source platform for RDM, built on Linked Open Data

• Dendro integrates with major repository platforms

29

Conclusions (cont’d)

• Ontologies: source of metadata descriptors

• Data model grows as more ontologies are loaded

• Curators can model and share the ontologies

• Domain ontologies evolve with reuse

30

Visit us at

http://dendro.fe.up.pt

João Rocha da Silva is an Informatics Engineering PhD student at the Faculty of Engineering of the University of Porto. He specializes on research data management, applying the latest Semantic Web Technologies to the adequate preservation and discovery of research data assets.!!He is also an experienced freelancer iOS Developer with several Apps published on the App Store, and a self-taught DIY mechanic with a special interest in classic cars, particularly his 1987 Toyota Corolla GT Twin Cam, also known as Hachi-Roku or AE86.!

PhD Student, Senior Web Developer, Semantic Web at INESC TEC

João Rocha da Silva!

João Correia Lopes is an Assistant Professor in Informatics Engineering at Universidade do Porto and a researcher at INESC TEC. He has graduated in Electrical Engineering in the University of Porto in 1984 and holds a PhD in Computing Science by Glasgow University in1997. His teaching includes undergraduate and graduate courses in databases and web applications, software engineering and object-oriented programming, markup languages and semantic web. He has been involved in research projects in the area of long-term preservation, service-oriented architectures and e-Science. Currently his main research interests are e-Science and the management of research data.

Cristina Ribeiro is an Assistant Professor in Informatics Engineering at Universidade do Porto and a researcher at INESC TEC. She has graduated in Electrical Engineering, holds a Master in Electrical and Computer Engineering and a Ph.D. in Informatics. Her teaching includes undergraduate and graduate courses in information retrieval, digital libraries, knowledge representation and markup languages. She has been involved in research projects in the areas of cultural heritage, multimedia databases and information retrieval. Currently her main research interests are information retrieval, digital preservation and the management of research data.

Assistant Professor in Informatics Engineering at Universidade do Porto, Researcher at INESC TEC

Cristina Ribeiro! João Correia Lopes!Assistant Professor in Informatics Engineering at Universidade do Porto, Researcher at INESC TEC

João Aguiar Castro holds a Masters degree in Information Science, and is currently a Digital Platforms PhD student at the Faculty of Engineering of the University of Porto. He is a research data management researcher, particularly in the definition of application profiles that meet the metadata needs of different research domains

PhD Student, Research Data Management researcher at INESC TEC

João Aguiar Castro!

Extras

Graph Database(LOD)

Distributed document index

File Storage Cluster

Business Logic

Web Interface

Openlink Virtuoso 7 ElasticSearch MongoDB

(GridFS)

NodeJS (JavaScript)

AngularJS (JavaScript)

DB Adapter ES Endpoint GridFS Client

Human UsersWeb

JSON JSON JSON

RDF/XML, SPARQL Endpoint

JSON API

HTML

Data

Logic

Presentation

CuratedDataset

Curator

WorkingFiles

Dendro

FOAF

DC

dc:titlenie:isPartOfdcb:specimenLength

Ontology concept reuse

SPARQLEndpoint

Sharing & evolution

“Mature”ontologies on the web

Metadatavalidation

Deposit

Data producers

Free-TextSearch

API

CKANDryad

Web Portal

Domain-Specific Lightweight Ontologies

dcbdcb

Data reuser

dcb

Specification of new metadata ontologies

1

2

3

4

Recommended