20
A Standard Data Format for Computational Chemistry: CSX Stuart J. Chalk 1,2 , Neil Ostlund 1 , Mirek Sopek 1 , Bing Wang 1 1) Chemical Semantics Inc., Gainesville FL 2) Department of Chemistry, University of North Florida [email protected] 249th ACS Meeting, Denver, CO – March 2015

A Standard Data Format for Computational Chemistry: CSX

Embed Size (px)

Citation preview

A Standard Data Format for Computational Chemistry: CSX

Stuart J. Chalk1,2, Neil Ostlund1, Mirek Sopek1, Bing Wang1

1) Chemical Semantics Inc., Gainesville FL

2) Department of Chemistry, University of North Florida

[email protected]

249th ACS Meeting, Denver, CO – March 2015

Semantic Annotation of Data

Current DOE Project

Data Transformations

Common Standard for eXchange (CSX)

CSX a Standard Data Format

The CSX Schema

CSX - Publishing Information

CSX - Molecular System Information

CSX - Calculated Result Information

Future Plans

Conclusion

Outline

Create a way to ‘teach’ computers what information means – contextualize the data

Example

What is this? 904-620-1938

A computer just sees it as…

… a string

By using an appropriate semantic definition in RDF (the Resource Description Framework) we can identify to the computer that the text is a phone number (using the Friend of a Friend (FOAF) specification), i.e.

Semantic Annotation of Data

RDF Specification http://www.w3.org/RDF/FOAF Specification http://xmlns.com/foaf/spec/

<foaf:phone rdf:datatype=“#string">904-620-1938</foaf:phone>

RDF can be use to relate information as well as annotate it

The following RDF/XML shows how some information is related (XML is the eXtensible Markup Language)

Applying this technology to computational chemistry calculations will allow integration of the calculation and results with data about chemicals from other sources

Semantic Annotation of Data

<rdf:Description rdf:about=http://example.org/StuartChalk>

<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>

<foaf:knows rdf:resource="http://example.org/NeilOstlund"/>

<foaf:phone rdf:datatype=”…#string”>904-620-

1938</foaf:phone>

</rdf:Description>

Chemical Semantics is funded by DOE to create a web portal to collect, organize and make searchable the results output from computational chemistry (CC) calculations

This will be freely available and will accept output from all CC software packages

The intent is to capture calculation results and…

Software used to calculate the results

Input parameters used in the calculation

Methodology by which the calculation was done

Details of the molecular system studied

DOE SBIR Grant

The approach Chemical Semantics is taking is to

1. Add code to software packages to generate an XML file alongside the normal output file –OR–Parse an existing output file (using a free application) and generate XML file

2. Send the XML file into the web portal

3. Convert the XML file into RDF into turtle format (TTL)

4. Finally, ingest TTL into a triplestore (Virtuoso)

All the data in Virtuoso can then be search using SPARQL (SPARQL Protocol and RDF Query Language)

Data Transformations

Virtuoso http://virtuoso.openlinksw.com/SPARQL http://www.w3.org/TR/sparql11-query/

Why XML?

Human readable (plain text - UTF-8)

Platform neutral

Archivable

Validatable

Why not use CML?

Inability to represent complex structures e.g. residues

No standard way to add CC results

Intermediate XML File

A CSX file is a text based file written in XML

It is a structured data container design to hold CC result data and additional metadata

Version 0.x was developed by Neil Ostlund

Version 1.0 is the current stable release developed as part of Phase 1 of the SBIR grant (limited scope)

Version 2.0 is currently under development as part of Phase 2 of the SBIR grant

Common Standard for eXchange (CSX)

It is well know that the formats in which data is reported in CC output files is:

Highly variable (software specific)

Sometimes difficult to interpret

Standardization would:

Allow data from different packages to be more easily compared

Open up opportunities for software development to display and reuse data for different applications

This mirrors movement in the CC community toward a common driver base for CC software packages

CSX as a Standard Data Format

In order to describe the layout and allowed names of elements and attributes, and values for both, a schema document is available for the CSX specification

This can be used to help new users write valid CSX files (using XML editing applications such as XML Spy and oxygenXML) and…

… validate existing CSX files using any of a number of XML validators (e.g. Xerces) …

… and understand the structure of the data especially for less frequently calculated results

The CSX Schema

CSX Schema v1.0

CSX Schema v1.0

CSX Schema v1.0

CSX Schema v1.0

CSX – Publication Information

CSX – Molecular System Information

CSX – Calculated Result Information

Work on CSX 2.0 is ongoing – expand to multiple systems and sets of calculated results

Develop CSX focused website with converter functionality, libraries, and documentation

Engage CC software users/programmers to get involved with the project

Organize a community developer workshop over summer 2015

Publish version 2.0 of CSX in Fall 2015

Future Plans

CSX started out as a stepping stone to transfer information to the CS portal

Having a data standard for CC is an important development in of itself

The CC community can do more with their data

Leverage XML tools to visualize, process etc…

Compare results across CC packages

Validate results

Reference basis sets (https://bse.pnl.gov/)

Conclusion

[email protected]

Phone: 904-620-1938

Skype: stuartchalk

LinkedIn/Slidehare: https://www.linkedin.com/in/stuchalk

ORCID: http://orcid.org/0000-0002-0703-7776

ResearcherID: http://www.researcherid.com/rid/D-8577-2013

Questions?