21
A Generic Scientific Data Model and Ontology for Representation of Chemical Data Stuart J. Chalk, Department of Chemistry University of North Florida [email protected] CINF Paper 171 – 251 st ACS Meeting Spring 2 #ACSCINFDataSummit

A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Embed Size (px)

Citation preview

Page 1: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

A Generic Scientific Data Modeland Ontology for Representation

of Chemical DataStuart J. Chalk, Department of Chemistry

University of North [email protected]

CINF Paper 171 – 251st ACS Meeting Spring 2016

#ACSCINFDataSummit

Page 2: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Scientific Data Should be Open

Simple: Openness as the norm not the exception

Data made available, without restriction, so its useful Mechanisms/tools to make data available Formats to allow others to get the data… …but also so its easy to use Annotate the data to make it easy to find

Community driven promotion of and action on this issue

Page 3: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Research Notebook Spectral Files (JCAMP-DX, propriety) Excel Spreadsheets Personal Databases Online Databases

PDF Files No!

RDF Yes!Resource Description Framework

Options for Storing Data?

Page 4: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

W3C Recommendation 2015Specification - https://www.w3.org/TR/ldp/Primer - https://www.w3.org/TR/ldp-primer/

The Linked Data Platform

From: http://www.dataversity.net/introduction-linked-data-platform/

Page 5: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Use JavaScript Object Notation (JSON) as a text format for storing data and metadata so it can be converted to RDF

JSON for Linked Data (JSON-LD){ "@context": { "name": "http://schema.org/name", "isAlive": "http://example.org/isAlive", "age": "http://example.org/age", "height": "http://schema.org/height", "@base": "http://www.unf.edu/chemistry/stuart_chalk.aspx" }, "@id": "", "name": "Stuart Chalk", "isAlive": true, "age": 49, "height": 188.0} http://json-ld.org/playground/

Page 6: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

JSON for Linked Data (JSON-LD)<http://www.unf.edu/chemistry/stuart_chalk.aspx>

<http://example.org/age> "49"^^<http://www.w3.org/2001/XMLSchema#integer> .

<http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://example.org/isAlive>

"true"^^<http://www.w3.org/2001/XMLSchema#boolean> .

<http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://schema.org/height>

"188"^^<http://www.w3.org/2001/XMLSchema#integer> .

<http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://schema.org/name>

"Stuart Chalk" .

Page 7: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Nice idea but because anything can belinked to anything else to form a graph of variable structure…

...difficult to search, hard to maintain

OK, use regular relational database – Rigid SchemaNot good to try and make data fit the schema…

Use a hybrid approach! Encode some structure in RDF using a framework... ...add data to the structured graph in an organized way

Store all Scientific Data in RDF?

Page 8: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Consider FAIR Principals (http://www.datafairport.org) To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier F2. data are described with rich metadata (defined by R1 below) F3. metadata clearly and explicitly include the identifier of the data it describes F4. (meta)data are registered or indexed in a searchable resource

To be Accessible: A1. (meta)data are retrievable by their identifier using a standardized communications protocol A2. metadata are accessible, even when the data are no longer available

To be Interoperable: I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles I3. (meta)data include qualified references to other (meta)data

To be Reusable: R1. meta(data) are richly described with a plurality of accurate and relevant attributes R1.1. (meta)data are released with a clear and accessible data usage license R1.2. (meta)data are associated with detailed provenance R1.3. (meta)data meet domain-relevant community standards

What Metadata is Important for Data?

Page 9: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Define scope as data obtained from an experiment,a series of experiments, a project

Who did the work and where are they? Metadata about the data “packet” The raw data… …its associated metadata (enough to properly contextualize the

data) Access rights Published location

What Should a Data Model Represent?

Page 10: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

General Framework

SciData – Scientific Data Model (SDM)

Overview –http://stuchalk.github.io/scidata/

GitHub Repo –https://github.com/stuchalk/scidata

Page 11: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

General Framework

- The Context “@context” contains the

context definition Refers to other context files Namespace abbreviations Default vocabulary “@vocab”

“@id” links ontology term “@type” states data type

Page 12: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Methodology, System, and Dataset

Page 13: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Example Data - pH

Page 14: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Example Data -Literature Value

“scope” provides internal link to “@id” value

Each value of a name value pair has a default data type that can be override by expanding value to a JSON object and adding “@value” and “@type”

Page 15: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Example Data - NMR Spectrum

“dataseries” are JSON arrays of data on one axis

Bring them together with “datagroup” and we can represent at spectrum

“parameter” is generic container for data, or metadata

Page 16: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Example Data –CC Calculation

“datagroup”s are structures to aggregate data at any level

“datagroup”s can be infinitely nested

“uid” is optional and can be used to unique define any piece of data

Page 17: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

The SDM Ontology

SciData Ontology – Scientific Data Model Ontology (SDMO)

OWL File –https://github.com/stuchalk/scidata/blob/master/ontology/scidata.owl

Page 18: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Get community feedback, refine/extend/standardize Generate large corpus of disparate data in JSON-LD, ingest into triple

store and query (SPARQL) Evaluate inferencing on the triple store data Push adoption through collaboration Run hackathons to build developer implementations Develop Electronic Laboratory Notebook (ELN) to generate data in

JSON-LD

Get feedback from data community, RDA - https://rd-alliance.org/ Test using the NDS - http://www.nationaldataservice.org/

Future Work

Page 19: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Pain Points Challenges

Opportunities Normalization Tools to generate

metadata automatically User Perspective Gaps in Data Gaps in Ontology

Coverage

Pain Points? Gather stakeholders to work on

standards Broad knowledge domain representation i-UPAC, RDA Chemistry Research Data IG

Priorities? Data annotation and representation Data exchange (repo <-> repo, user <->

user) Structure representation (chiral centers) Curation infrastructures Domain vocabulary translations Units of measure

Page 20: A Generic Scientific Data Model and Ontology for Representation of Chemical Data

Reality Check

“to err is human; to forgive, divine”Alexander Pope

“to err is human; to really screw things up requires a computer”Paul Ehrlich“to err is human; all hell will break loose

if you don’t provide accurate semantics to a computer”

Stuart Chalk