13
Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure Jim Myers, Carmen Pancerella

Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Embed Size (px)

Citation preview

Page 1: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

Collaboratory for Multi-scale Chemical Science (CMCS):

A Knowledge Grid/ Adaptive Informatics Infrastructure

Jim Myers, Carmen Pancerella

Page 2: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

CMCS – Enabling New Forms of Research and Communication

Distributed Research Groups

Chemical Databases

Rich Publication

Community Annotation

Informatics Analysis

Cross-scale Communication

Peer Data Review

Pedigree Analysis

Automated informatics

Automated monitoring/analysis

Page 3: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

Adaptive Informatics Infrastructure

Infrastructure – a well designed, scalable, reusable, flexible set of tools, middleware, and services

Informatics – the emerging use of semi-automated means to derive new knowledge from the analysis of (large amounts of) heterogeneous data, annotating existing data with its newly discovered meaning

Adaptive – able to dynamically change to incorporate new knowledge and support new activities› Low Barriers

Many access points Storage of data in original formats with dynamic metadata extraction and

translation

› Powerful Arbitrary formats (binary, ASCII, XML) Integrated data, metadata, pedigree across internal and external tools

› Evolvable Schema can be changed/extended as needed Metadata, translations, viewers, portal, etc. can be dynamically configured

Page 4: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

SAM Architecture

Notebook Services

Semantic Services

Metadata Services

DataGrid

Database

WebD

AV

, D

AS

L,

JMS

, S

AM

Ext

ensi

on

s

DA

V,

JDB

C,

Gri

dF

TP

Page 5: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

SAM Metadata Services Layer

Jakarta Slide DAV server plus configurable:› Mime Type Assignment

CMCS default: Based on dc:format tag within .xml file

› Property Generation from binary/ASCII/XML files 12 types standard CMCS properties

› Resource Translation 12+ Viewers/Translators for CMCS including Interactive Applets

› Mapping to Data Store(s) NIST Kinetics DB

› JMS Events for access and changes Feeds events to CMCS NED Email Notification daemon

› Authentication/Authorization model (single sign-on with CMCS Portal – username/password or GridCert)

Page 6: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

Extensible Scientific Interchange Language (XSIL) / Binary Format Description (BFD) language

XSIL (Roy Williams, CalTech) - XML Encoding and Java code for scientific data › Ints, floats, vectors, arrays, time series, …› Can describe the byte structure of external data

files/streams (encoding, byte order,…)› Can have link(s) to external data

BFD (Alan Chappell, Jim Myers, PNNL) XML Encoding and Java code for describing binary/ascii files› Bug fixes, removed ambiguities› Parameterized logic (if, while, for…)› Parameterized Stream interface

Being used as input for Grid Forum Data Format description Language (DFDL) standard

<XSIL> <Param Name="date" Type="String" /> <Param Name="Program Version" Type="float" /> <Param Name="numColumns" Type="int" />

<Array Name="data" Type="float"> <Dim> <XBFDvalue-of select ="/XSIL/Param[@Name='numColumns']" /> </Dim> <Dim>6</Dim> </Array> <Stream Encoding="Binary" Type="Remote“ XBFDstreamnumber="0" /> </XSIL>

Page 7: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

Demo

Page 8: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

Example

Binary XML Properties

Translation of Chemistry Data

SAM-based Electronic Notebook

CMCS Portal/Pedigree Browser

FortranApplication

‘LocalDisk’ DataGrid

DAV

DAV+

JMS

ELN

Page 9: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

CMCS Provenance:de-facto standards

Cmcs:hasinputs – workflow

Cmcs:hasoutputs – workflow

Sam:hastranslations – virtual workflow

Cmcs:ispartofproject – hierarchy

Eln:children – hierarchy

(Dav:collection) – hierarchy

Dcterms:references – scientific pedigree

Dcterms:isreferencedby – scientific pedigree

Eln:references – informal/private scientific pedigree

Page 10: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

Applications/Chemistry Services

Extensible Computational Chemistry Environment› Export to CMCS with pedigree/metadata

Active Thermochemical Tables› Portlet/web service using CMCS data store

RIOT – adaptive mechanism reduction› Portlet/web service using CMCS data store –

asynchronous invocation mechanism

Page 11: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

Standard Protocol and API

WebDAV: An early web service (XML commands over HTTP)› A widely adopted standard for metadata/data transport› Put/Get data with arbitrary properties (dynamic)› Properties can be discovered and accessed independently› DASL, Versioning, Transactions, …

JSR 170: Java Content Repository› An API for working with nodes with properties (versioning, queries,

typing, notification, …)

Page 12: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

Path Forward

Pilot groups doing “real” chemistry

Exploring new practice› Peer-Review / Endorsement Mechanisms/Interfaces

Digital publication, third party annotation

› Activity Reporting tools› Scoping Searches, Notifications

Based on user-defined notion of provenance/hierarchy› Notebook Views of Other Hierarchies

E.g. A notebook sharing a computational chemistry project hierarchy

› Validation of Chemical networks E.g. Active Thermo-chemical Tables

› Workflow by Example…› Informatics Data File Assembly Tool

Page 13: Data Provenance and Annotation Dec. 2, 2003 Collaboratory for Multi-scale Chemical Science (CMCS): A Knowledge Grid/ Adaptive Informatics Infrastructure

Data Provenance and Annotation Dec. 2, 2003

URLs/Team Members

http://cmcs.org/

http://www.scidac.org/SAM/

CMCS Team Members: Thomas C. Allison, Kaizar Amin, Sandra Bittner, Brett Didier, Michael Frenklach, William H. Green, Jr., Yen-Ling Ho, John Hewson, Wendy Koegler, Carina Lansing, David Leahy, Michael Lee, Renata McCoy, Michael Minkoff, James D. Myers, Sandeep Nijsure, Gregor von Laszewski, David Montoya, Carmen Pancerella, Reinhardt Pinzon, William Pitz, Larry Rahn, Branko Ruscic, Karen Schuchardt, Eric Stephan, Al Wagner, Baoshan Wang, Theresa Windus, Lili Xu, Christine Yang