Upload
stuart-chalk
View
371
Download
4
Embed Size (px)
Citation preview
A Standard Data Format for Computational Chemistry: CSX
Stuart J. Chalk1,2, Neil Ostlund1, Mirek Sopek1, Bing Wang1
1) Chemical Semantics Inc., Gainesville FL
2) Department of Chemistry, University of North Florida
249th ACS Meeting, Denver, CO – March 2015
Semantic Annotation of Data
Current DOE Project
Data Transformations
Common Standard for eXchange (CSX)
CSX a Standard Data Format
The CSX Schema
CSX - Publishing Information
CSX - Molecular System Information
CSX - Calculated Result Information
Future Plans
Conclusion
Outline
Create a way to ‘teach’ computers what information means – contextualize the data
Example
What is this? 904-620-1938
A computer just sees it as…
… a string
By using an appropriate semantic definition in RDF (the Resource Description Framework) we can identify to the computer that the text is a phone number (using the Friend of a Friend (FOAF) specification), i.e.
Semantic Annotation of Data
RDF Specification http://www.w3.org/RDF/FOAF Specification http://xmlns.com/foaf/spec/
<foaf:phone rdf:datatype=“#string">904-620-1938</foaf:phone>
RDF can be use to relate information as well as annotate it
The following RDF/XML shows how some information is related (XML is the eXtensible Markup Language)
Applying this technology to computational chemistry calculations will allow integration of the calculation and results with data about chemicals from other sources
Semantic Annotation of Data
<rdf:Description rdf:about=http://example.org/StuartChalk>
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
<foaf:knows rdf:resource="http://example.org/NeilOstlund"/>
<foaf:phone rdf:datatype=”…#string”>904-620-
1938</foaf:phone>
</rdf:Description>
Chemical Semantics is funded by DOE to create a web portal to collect, organize and make searchable the results output from computational chemistry (CC) calculations
This will be freely available and will accept output from all CC software packages
The intent is to capture calculation results and…
Software used to calculate the results
Input parameters used in the calculation
Methodology by which the calculation was done
Details of the molecular system studied
DOE SBIR Grant
The approach Chemical Semantics is taking is to
1. Add code to software packages to generate an XML file alongside the normal output file –OR–Parse an existing output file (using a free application) and generate XML file
2. Send the XML file into the web portal
3. Convert the XML file into RDF into turtle format (TTL)
4. Finally, ingest TTL into a triplestore (Virtuoso)
All the data in Virtuoso can then be search using SPARQL (SPARQL Protocol and RDF Query Language)
Data Transformations
Virtuoso http://virtuoso.openlinksw.com/SPARQL http://www.w3.org/TR/sparql11-query/
Why XML?
Human readable (plain text - UTF-8)
Platform neutral
Archivable
Validatable
Why not use CML?
Inability to represent complex structures e.g. residues
No standard way to add CC results
Intermediate XML File
A CSX file is a text based file written in XML
It is a structured data container design to hold CC result data and additional metadata
Version 0.x was developed by Neil Ostlund
Version 1.0 is the current stable release developed as part of Phase 1 of the SBIR grant (limited scope)
Version 2.0 is currently under development as part of Phase 2 of the SBIR grant
Common Standard for eXchange (CSX)
It is well know that the formats in which data is reported in CC output files is:
Highly variable (software specific)
Sometimes difficult to interpret
Standardization would:
Allow data from different packages to be more easily compared
Open up opportunities for software development to display and reuse data for different applications
This mirrors movement in the CC community toward a common driver base for CC software packages
CSX as a Standard Data Format
In order to describe the layout and allowed names of elements and attributes, and values for both, a schema document is available for the CSX specification
This can be used to help new users write valid CSX files (using XML editing applications such as XML Spy and oxygenXML) and…
… validate existing CSX files using any of a number of XML validators (e.g. Xerces) …
… and understand the structure of the data especially for less frequently calculated results
The CSX Schema
Work on CSX 2.0 is ongoing – expand to multiple systems and sets of calculated results
Develop CSX focused website with converter functionality, libraries, and documentation
Engage CC software users/programmers to get involved with the project
Organize a community developer workshop over summer 2015
Publish version 2.0 of CSX in Fall 2015
Future Plans
CSX started out as a stepping stone to transfer information to the CS portal
Having a data standard for CC is an important development in of itself
The CC community can do more with their data
Leverage XML tools to visualize, process etc…
Compare results across CC packages
Validate results
Reference basis sets (https://bse.pnl.gov/)
Conclusion
Phone: 904-620-1938
Skype: stuartchalk
LinkedIn/Slidehare: https://www.linkedin.com/in/stuchalk
ORCID: http://orcid.org/0000-0002-0703-7776
ResearcherID: http://www.researcherid.com/rid/D-8577-2013
Questions?