Upload
wesley-carroll
View
216
Download
4
Embed Size (px)
Citation preview
Data Provenance and Annotation Dec. 2, 2003
Collaboratory for Multi-scale Chemical Science (CMCS):
A Knowledge Grid/ Adaptive Informatics Infrastructure
Jim Myers, Carmen Pancerella
Data Provenance and Annotation Dec. 2, 2003
CMCS – Enabling New Forms of Research and Communication
Distributed Research Groups
Chemical Databases
Rich Publication
Community Annotation
Informatics Analysis
Cross-scale Communication
Peer Data Review
Pedigree Analysis
Automated informatics
Automated monitoring/analysis
Data Provenance and Annotation Dec. 2, 2003
Adaptive Informatics Infrastructure
Infrastructure – a well designed, scalable, reusable, flexible set of tools, middleware, and services
Informatics – the emerging use of semi-automated means to derive new knowledge from the analysis of (large amounts of) heterogeneous data, annotating existing data with its newly discovered meaning
Adaptive – able to dynamically change to incorporate new knowledge and support new activities› Low Barriers
Many access points Storage of data in original formats with dynamic metadata extraction and
translation
› Powerful Arbitrary formats (binary, ASCII, XML) Integrated data, metadata, pedigree across internal and external tools
› Evolvable Schema can be changed/extended as needed Metadata, translations, viewers, portal, etc. can be dynamically configured
Data Provenance and Annotation Dec. 2, 2003
SAM Architecture
Notebook Services
Semantic Services
Metadata Services
DataGrid
Database
WebD
AV
, D
AS
L,
JMS
, S
AM
Ext
ensi
on
s
DA
V,
JDB
C,
Gri
dF
TP
Data Provenance and Annotation Dec. 2, 2003
SAM Metadata Services Layer
Jakarta Slide DAV server plus configurable:› Mime Type Assignment
CMCS default: Based on dc:format tag within .xml file
› Property Generation from binary/ASCII/XML files 12 types standard CMCS properties
› Resource Translation 12+ Viewers/Translators for CMCS including Interactive Applets
› Mapping to Data Store(s) NIST Kinetics DB
› JMS Events for access and changes Feeds events to CMCS NED Email Notification daemon
› Authentication/Authorization model (single sign-on with CMCS Portal – username/password or GridCert)
Data Provenance and Annotation Dec. 2, 2003
Extensible Scientific Interchange Language (XSIL) / Binary Format Description (BFD) language
XSIL (Roy Williams, CalTech) - XML Encoding and Java code for scientific data › Ints, floats, vectors, arrays, time series, …› Can describe the byte structure of external data
files/streams (encoding, byte order,…)› Can have link(s) to external data
BFD (Alan Chappell, Jim Myers, PNNL) XML Encoding and Java code for describing binary/ascii files› Bug fixes, removed ambiguities› Parameterized logic (if, while, for…)› Parameterized Stream interface
Being used as input for Grid Forum Data Format description Language (DFDL) standard
<XSIL> <Param Name="date" Type="String" /> <Param Name="Program Version" Type="float" /> <Param Name="numColumns" Type="int" />
<Array Name="data" Type="float"> <Dim> <XBFDvalue-of select ="/XSIL/Param[@Name='numColumns']" /> </Dim> <Dim>6</Dim> </Array> <Stream Encoding="Binary" Type="Remote“ XBFDstreamnumber="0" /> </XSIL>
Data Provenance and Annotation Dec. 2, 2003
Demo
Data Provenance and Annotation Dec. 2, 2003
Example
Binary XML Properties
Translation of Chemistry Data
SAM-based Electronic Notebook
CMCS Portal/Pedigree Browser
FortranApplication
‘LocalDisk’ DataGrid
DAV
DAV+
JMS
ELN
Data Provenance and Annotation Dec. 2, 2003
CMCS Provenance:de-facto standards
Cmcs:hasinputs – workflow
Cmcs:hasoutputs – workflow
Sam:hastranslations – virtual workflow
Cmcs:ispartofproject – hierarchy
Eln:children – hierarchy
(Dav:collection) – hierarchy
Dcterms:references – scientific pedigree
Dcterms:isreferencedby – scientific pedigree
Eln:references – informal/private scientific pedigree
Data Provenance and Annotation Dec. 2, 2003
Applications/Chemistry Services
Extensible Computational Chemistry Environment› Export to CMCS with pedigree/metadata
Active Thermochemical Tables› Portlet/web service using CMCS data store
RIOT – adaptive mechanism reduction› Portlet/web service using CMCS data store –
asynchronous invocation mechanism
Data Provenance and Annotation Dec. 2, 2003
Standard Protocol and API
WebDAV: An early web service (XML commands over HTTP)› A widely adopted standard for metadata/data transport› Put/Get data with arbitrary properties (dynamic)› Properties can be discovered and accessed independently› DASL, Versioning, Transactions, …
JSR 170: Java Content Repository› An API for working with nodes with properties (versioning, queries,
typing, notification, …)
Data Provenance and Annotation Dec. 2, 2003
Path Forward
Pilot groups doing “real” chemistry
Exploring new practice› Peer-Review / Endorsement Mechanisms/Interfaces
Digital publication, third party annotation
› Activity Reporting tools› Scoping Searches, Notifications
Based on user-defined notion of provenance/hierarchy› Notebook Views of Other Hierarchies
E.g. A notebook sharing a computational chemistry project hierarchy
› Validation of Chemical networks E.g. Active Thermo-chemical Tables
› Workflow by Example…› Informatics Data File Assembly Tool
Data Provenance and Annotation Dec. 2, 2003
URLs/Team Members
http://cmcs.org/
http://www.scidac.org/SAM/
CMCS Team Members: Thomas C. Allison, Kaizar Amin, Sandra Bittner, Brett Didier, Michael Frenklach, William H. Green, Jr., Yen-Ling Ho, John Hewson, Wendy Koegler, Carina Lansing, David Leahy, Michael Lee, Renata McCoy, Michael Minkoff, James D. Myers, Sandeep Nijsure, Gregor von Laszewski, David Montoya, Carmen Pancerella, Reinhardt Pinzon, William Pitz, Larry Rahn, Branko Ruscic, Karen Schuchardt, Eric Stephan, Al Wagner, Baoshan Wang, Theresa Windus, Lili Xu, Christine Yang