December 26, 2012 1
Climate Science for a Sustainable Energy Future (CSSEF) Provenance ERIC STEPHAN Pacific Northwest National Laboratory Richland, WA
Provenance Definitions
! Provenance is a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing. https://dvcs.w3.org/hg/prov/raw-file/tip/presentations/wg-overview/overview/index.html
! Metadata used to describe the origin of the data and any of its modifications.
! A log of historical events describing the origin of data and any subsequent changes.
December 26, 2012 2
3
Popular Provenance Vocabularies
See Also: W3C Incubator Group, h8p://www.w3.org/2005/Incubator/prov/wiki/W3C_Provenance_Incubator_Group_Wiki
Open Provenance Model
The Provenance Ontology (Prov-‐O)
Proof Markup Language Ontology
Dublin Core Provenance Task Force
4
The Systems Science Challenge ! Studying complex systems typically has the
following characterisEcs: ! Interdisciplinary studies involve mulEple stakeholders ! Leverage mulEple tools, algorithms, data products, and
sensors ! Reliant on highly iteraEve and repeEEve techniques ! Steps are difficult to document and are oLen Eme
commiMed to memory or notes.
! Sharing complex systems data between collaborators has the following inherent problems ! To establish data confidence, scienEsts accessing data
(consumers) need to know data origin and modificaEon history (data provenance).
! ScienEsts producing the data need a consistent means to convey data provenance to targeted scienEfic communiEes ! the data provenance needs to be diverse enough to
support any data. ! It must also be based on community standards to
cross-‐reference searches
December 26, 2012 5
Example: Motivating User Questions About the CSSEFARMBE Diagnostics Dataset
CAM Modeler
How do CAM output
Variables map to the
CSSEFARMBE variables?
What addiEonal ancillary
informaEon is available about this dataset?
Atmosphere ScienEst
How did both CSSEFARMBE and ARMBE originate?
December 26, 2012 6
The Knowledge Gap: CSSEF Users Needing Additional Answers from Data Producers
CSSEFARMBE Developers
Test NCL Code ARMBE
Header
CSSEF ARMBE Header
Tech Report
CF Terms
CAM Web Page
wrote
read
wrote
read
wrote
compared
CAM Modeler
How do CAM output
Variables map to the
CSSEFARMBE variables?
What addiEonal ancillary
informaEon is available about this dataset?
Atmosphere ScienEst
How did both CSSEFARMBE and ARMBE originate?
December 26, 2012 7
Goals of CSSEF Provenance Environment (ProvEn) Services
! Identify future user communities that will need provenance while the data is being generated by scientists producing the data
! Knowledge products (e.g reports, archivable provenance records)
! Create consumer oriented provenance products by: ! Capturing historical information from any native source necessary to describe
the origin of the dataset.
! For user referential purposes retaining a copy of the native source familiar to the domain community.
December 26, 2012 8
FoundaGonal Ontology Cross-‐Reference Capability W3C Provenance Ontology (Prov-‐O) Core Ontology Describing Data Origin
Dublin Core Terms Data citaEons and soLware
Friend of a Friend (FOAF) DescripEon of ScienEst and collaborators
(Future) Proof Markup Language 3.0 DescripEon of jusEficaEon and trust
(Future) Dublin Core to PROV-‐O Mapping Support integraEon of DC provenance and PROV-‐O
! Store this information in a cross-referenced knowledge model by mapping domain ontology to foundational ontology ! Domain ontologies are diverse and subject to constant changes defined by the
concepts extracted from native sources. ! Foundational ontologies are stable and seldom change.
! Use composite knowledge model to provide finished products to different kinds of consumers ! Stability infers lots of methodologies, tools and, services are available to
leverage.
Goals of CSSEF Provenance Environment (ProvEn) Services
December 26, 2012 9
Identifying a New Product with Native Sources, Domain Concepts and Terms for dataset
Test NCL Code
ARMBE Header
CSSEF ARMBE Header
Tech Report
CF Terms
CAM Web Page
ObservaEonal Data Origin Concepts
ObservaEonal Data Origin Concepts
IdenEfied Variable Mapping Concepts and Terms
IdenEfied Variable Mapping Concepts and Terms
December 26, 2012 10
Creating and Maintaining Domain Ontologies (Knowledge Engineer)
Atmosphere DiagnosEcs
Dataset Origin/Mapping Terms and Concepts
Atmosphere Domain Ontology
FoundaEonal Ontologies
(Build Ontology)
(Align Ontologies)
Aligned Knowledge Model For
Atmosphere
ProvEn Services
Register
Add
December 26, 2012 11
Creating new Product By Populating ProvEn Services with CSSEFARMBE Dataset Native Sources
Test NCL Code
ARMBE Header
CSSEF ARMBE Header Tech
Report
CF Terms
CAM Web Page
NaEve Sources contributed by Developers
NaEve Source Concept ExtracEon
FoundaEonal Ontologies
Aligned Knowledge Model for Atmosphere
NaEve Provenance Mapped to Atmosphere Domain
Ontology
Copy of Corresponding NaEve Sources NaEve
Source References
ProvEn
Services
CSSEFARMBE knowledge relevant to CAM Modeler and Atmosphere ScienEst
CSSEFARMBE Developers
December 26, 2012 12
Producing ProvEn Services Product: CSSEFARMBE Dataset Origin Report
Standard Vocabulary Cross-‐Reference Searching and Reasoning
ProvEn Services Store
CAM Modeler
What addiEonal ancillary
informaEon is available about this dataset?
Atmosphere ScienEst
How did both CSSEFARMBE and ARMBE originate?
FoundaEonal Ontologies
Aligned Knowledge Model for Atmosphere
NaEve Provenance Mapped to Atmosphere Domain
Ontology
Glassfish Server
December 26, 2012 13
ProvEn Services Architecture
Sesame Store
Ali Baba Object to RDF API
Store NaEve Provenance
Searching and Inferencing API
ProvEn (Jersey) REST Services
Query and Cross-‐Reference Provenance
Portable Jarfile
ESGF Node
Local Compute Cluster
UVCDAT
Deploy