19
Towards a Provenance Towards a Provenance Architecture Architecture Karen Schuchardt PNNL

Towards a Provenance Architecture Karen Schuchardt PNNL

Embed Size (px)

Citation preview

Page 1: Towards a Provenance Architecture Karen Schuchardt PNNL

Towards a Provenance ArchitectureTowards a Provenance ArchitectureTowards a Provenance ArchitectureTowards a Provenance Architecture

Karen SchuchardtPNNL

Page 2: Towards a Provenance Architecture Karen Schuchardt PNNL

2

Kepler Provenence Meeting Jan 05

OutlineOutlineOutlineOutline

Past and Present Work Use CasesThoughts on Workflow Provenance and Architectures

Page 3: Towards a Provenance Architecture Karen Schuchardt PNNL

3

Kepler Provenence Meeting Jan 05

Past and Present Provenance WorkPast and Present Provenance WorkPast and Present Provenance WorkPast and Present Provenance Work

Ecce Chemistry EnvironmentElectronic Laboratory NotebooksCollaboratory for Multi-Scale Chemical Science (CMCS)Scientific Annotation MiddlewareTowards a Semantic Data Grid for Systems Science

mid 90s

late 90s

2000

2000

2004-2006

Page 4: Towards a Provenance Architecture Karen Schuchardt PNNL

4

Kepler Provenence Meeting Jan 05

Ecce Chemistry EnvironmentEcce Chemistry EnvironmentEcce Chemistry EnvironmentEcce Chemistry Environment

Chemistry-based calculation workflowProvenance Captured as user performs actions

W’s (who, what, when) Job submissionstatus Info Relationships (Xlinks) between

calculations, outputs, inputs etc Linkbase for molecular dynamics

multi-step processes

WebDAV-based server captures all inputs, outputs and metadataProvenance used to

provide at-a-glance summary of work performed,

duplicate and rerun, search, Bind rules based on types and

relationships

Page 5: Towards a Provenance Architecture Karen Schuchardt PNNL

5

Kepler Provenence Meeting Jan 05

Electronic Laboratory NotebooksElectronic Laboratory NotebooksElectronic Laboratory NotebooksElectronic Laboratory NotebooksHierarchical, Chronological Chapters/Pages/Notes

File upload, sketch, text, equations, forms, image capture, …

Add/View/Search NotesRecords functionality:

Non-repudiation - digital signatures and timestamps

Persistence/completeness - write-once/no deletions/audit trail

Standardized lifecycle – signing/witnessing policies, archiving, retention schedules, …

Now based on WebDAVProvenance

Structure of notebook Records data Mimetype-based functionality

Page 6: Towards a Provenance Architecture Karen Schuchardt PNNL

6

Kepler Provenence Meeting Jan 05

Collaboratory for Multi-Scale Chemical Collaboratory for Multi-Scale Chemical Sciences (CMCS)Sciences (CMCS)

Collaboratory for Multi-Scale Chemical Collaboratory for Multi-Scale Chemical Sciences (CMCS)Sciences (CMCS)

Dublin Core for basic pedigree: title, creator, dates, publisher, is-referenced-by, references, replaces, is-replaced-by, has-version

Dublin Core Element Set and Qualified Dublin Core

Both XML and RDF to encode metadata values

Use of XLink to express values of relationships

CMCS properties for chemical science to enable searching: species name, CAS, chemical properties, and chemical formula.CMCS properties for defining scientific data: has-inputs, has-outputs, and is-part-of-project.CMCS properties for scientific publication and peer review annotations: is-sanctioned-by.Flexible infrastructure for addition of new metadata. As new metadata is added to infrastructure,current apps will not break!

Page 7: Towards a Provenance Architecture Karen Schuchardt PNNL

7

Kepler Provenence Meeting Jan 05

Scientific Annotation MiddlewareScientific Annotation MiddlewareScientific Annotation MiddlewareScientific Annotation Middleware

Provides a node plus metadata/relationship view of underlying data sourcesSupport put/get/search/access control of arbitrary data/metadataConfigurable metadata extraction from binary/ASCII/XML filesConfigurable Data TranslationSemantic/graph queriesRDF ExportNotebook Services (page display, signatures, timestamps, …)Pluggable security

Direct connection between metadata and resource limits use as next generation provenance store

Page 8: Towards a Provenance Architecture Karen Schuchardt PNNL

8

Kepler Provenence Meeting Jan 05

Towards a Semantic Data GridTowards a Semantic Data GridTowards a Semantic Data GridTowards a Semantic Data Grid

Explore frameworks for advanced model-driven data integration capabilities Seamlessly integrate files, databases Automated scientific workflow mechanisms Capture, represent, and disseminate knowledge Identify changes via discovery mechanisms

Internally funded 2 year project

Page 9: Towards a Provenance Architecture Karen Schuchardt PNNL

9

Kepler Provenence Meeting Jan 05

Towards a Semantic Data GridTowards a Semantic Data GridTowards a Semantic Data GridTowards a Semantic Data Grid

What proteins in my organism(s) are both predicted and shown by experiment to interact with E. Coli

Resources required Microarray spreadsheets NCBI data services BIND data base DIP database Work-group specific

databases

Other data services Extraction Translation Merging HPC Services Public Web services Discovery

Page 10: Towards a Provenance Architecture Karen Schuchardt PNNL

10

Kepler Provenence Meeting Jan 05

Use Case - Personal RecordsUse Case - Personal RecordsUse Case - Personal RecordsUse Case - Personal Records

Capture and organize display of provenance simplifies the job keeping track of activities performed over the course of long research process

Example: Bioinformatisist performs data integration/analysis for many diverse projects. After 6 months, he/she can’t remember what a particular result pertained to or how it was generated.

Page 11: Towards a Provenance Architecture Karen Schuchardt PNNL

11

Kepler Provenence Meeting Jan 05

Use Case - VerifiabilityUse Case - VerifiabilityUse Case - VerifiabilityUse Case - Verifiability

Data generated from instruments/experiments undergoes numerous automatic processes before becoming available to researcher(s)

Example: High-throughput biology experiments run through several automated and in some cases manual processes before it becomes available to the bioinformatisist. The bioinformatisist often does not trust the data. They want to know who created, what was done to it, when it was generated….

Page 12: Towards a Provenance Architecture Karen Schuchardt PNNL

12

Kepler Provenence Meeting Jan 05

Use Case - ApplicabilityUse Case - ApplicabilityUse Case - ApplicabilityUse Case - Applicability

Increasingly, research problems span disciplines or scales. Though data needs to move across these boundaries, it is often a manual process involving personal communications.

Example: In the combustion multi-scale research environment, data generated at one scale (e.g. thermochemical data) serves as input to successive scales (e.g. mechanisms). But its not that simple - we must be able to determine the applicability of available data - are the theoretical underpinnings under which it was generated consistent with the intended use?

Page 13: Towards a Provenance Architecture Karen Schuchardt PNNL

13

Kepler Provenence Meeting Jan 05

Use Case - Best PracticesUse Case - Best PracticesUse Case - Best PracticesUse Case - Best Practices

By capturing and providing access to provenance of prior work, best practices can be shared.

Example: This is a little bit hypothetical but… best practices can be shared by sharing workflow definitions or by viewing provenance (and inputs) from instances of workflows.

Page 14: Towards a Provenance Architecture Karen Schuchardt PNNL

14

Kepler Provenence Meeting Jan 05

Types of Provenance in Workflow Types of Provenance in Workflow EnvironmentEnvironment

Types of Provenance in Workflow Types of Provenance in Workflow EnvironmentEnvironment

Interaction Provenance Data that moves between services

State Provenance Data known only to the actor itself

Observable Provenance Start/completion times Error detection

Page 15: Towards a Provenance Architecture Karen Schuchardt PNNL

15

Kepler Provenence Meeting Jan 05

Other ProvenanceOther ProvenanceOther ProvenanceOther Provenance

Other Applications will record data Pedigree/Provenance Experiment Metadata Project Organization Categorization Detected Features Instrument logs Digital Signatures Endorsements Community Annotations Other workflow engines

Page 16: Towards a Provenance Architecture Karen Schuchardt PNNL

16

Kepler Provenence Meeting Jan 05

Logical ArchitectureLogical ArchitectureLogical ArchitectureLogical Architecture

ProvenanceStore(s)

Query Interface

Sub

mis

sion

Int

erfa

ceUser Recording Tools

PortletsAnnotator

Notebooks ScienceApplications

Client QueryLibrary

Clie

nt S

ubm

issi

on L

ibra

ry

Experiment Services

Workflow engine

Domain specific services

Presentation Services

Visualizer/Browser

DifferenceVisualizer

Workflow construction

Processing Services

Difference Analyzer

Quality Analyzer

Extracted from escience Strawman - Moreau

ProvenanceStore(s)

Page 17: Towards a Provenance Architecture Karen Schuchardt PNNL

17

Kepler Provenence Meeting Jan 05

Components of Physical ArchitectureComponents of Physical ArchitectureComponents of Physical ArchitectureComponents of Physical Architecture

One or more RDF triple storesGlobal naming serviceArbitrary data stores for data referenced by the provenanceSecurity services (pluggable for scalability)

Page 18: Towards a Provenance Architecture Karen Schuchardt PNNL

18

Kepler Provenence Meeting Jan 05

Workflow and ProvenanceWorkflow and ProvenanceWorkflow and ProvenanceWorkflow and Provenance

Requires binding to provenance serviceNeed mechanism to associate provenance from workflow instance Id? Links?

Requires communication of service information or other mechanism for actors to contribute state provenance

Page 19: Towards a Provenance Architecture Karen Schuchardt PNNL

19

Kepler Provenence Meeting Jan 05

SummarySummarySummarySummary

We’ve done a lot of work on provenance but see value in moving to more flexible architectureWorkflow engines are just one component that can contribute to the provenance of research results.Provenance capture should be thought of as a cross-cutting technologyModels for provenance need to be flexible allowing arbitrary contentProvenance services need to be scalable low-footprint usages for individual applications large experimental facilities Virtual organizations