42
Enabling and Supporting Provenance in e-Science Applications Luc Moreau University of Southampton [email protected]

Enabling and Supporting Provenance in e-Science Applications Luc Moreau University of Southampton [email protected]

Embed Size (px)

Citation preview

  • Slide 1

Enabling and Supporting Provenance in e-Science Applications Luc Moreau University of Southampton [email protected] Slide 2 Contents Provenance: problem definition Use cases of provenance in e-Science Architectural vision for provenance First experimentation, current work Research agenda Conclusion Slide 3 Provenance: definition Main Entry: provenance Pronunciation: 'prv-n&n(t)s, 'pr-v&-"nn(t)s Function: noun Etymology: French, from provenir to come forth, originate, from Latin provenire, from pro- forth + venire to come -- more at PRO-, COME Date: 1785 1 : ORIGIN, SOURCE 2 : the history of ownership of a valued object or work of art or literaturePRO-COMEORIGINSOURCE (Merriam-Webster Online) Slide 4 Slide 5 Slide 6 Slide 7 Slide 8 The Grid and Virtual Organisations The Grid problem is defined as coordinated resource sharing and problem solving in dynamic, multi- institutional virtual organisations [FKT01]. Effort is required to allow users to place their trust in the data produced by such virtual organisations Slide 9 Provenance and Virtual Organisations Given a set of services in an open grid environment that decide to form a virtual organisation with the aim to produce a given result; How can we determine the process that generated the result, especially after the virtual organisation has been disbanded? Slide 10 Provenance and Workflows Workflow enactment has become popular in the Grid and Web Services communities Workflow enactment can be seen as a scripted form of virtual organisation The problem is similar: how can we determine the origin of enactment results? Slide 11 Use cases Bioinformatics Aerospace Engineering Organ transplant management Combechem Physics Slide 12 Provenance in Bioinformatics myGrid builds a personalised problem-solving environment that helps bioinformaticians find, adapt, construct and execute in silico experiments Manchester, Southampton, Newcastle, Nothinham, EBI IBM, SUN, GSK, AZ, Merck Slide 13 Graves disease Autoimmune disease of the thyroid Weight loss, trembling, muscle weakness, increased pulse rate, increased sweating and heat intolerance, goitre, exophthalmos Slide 14 Provenance in Bioinformatics Notification Service Knowledge Services DB2 Registry Semantic registration Service Structural registration Knowledge Service Ontology Server Reasoner Matcher Registry DB2 Workflow templates DataProvenance mInfo Repository Workflow enactment engine Workflow instances Build/Edit Workflow Service Discovery Test Data Notification Service WSFL JMS Distributed Query Processor Information Extraction PASTA Job Execution SoapLab mIR Provenance service Component Discovery MetadataConcepts Registry View UDDI UDDI-M Slide 15 Provenance: Execution Trail process start time end time lsid:HGVBase_retrieve input by_service urn: John run_for output Gene:AC005412.6SNP:000010197 Slide 16 Provenance: Domain Level Trail Gene:AC005412.6SNP:000010197 process start time end time lsid:HGVBase_retrieve input by_service urn: John run_for output contains_single_nucleotide_polymorphism as stated by Slide 17 Provenance: Annotation Gene:AC005412.6SNP:000010197 process start time end time lsid:HGVBase_retrieve input by_service urn: John run_for output contains_single_nucleotide_polymorphism as stated by urn: Alice disputed by Slide 18 myGrid Provenance Requirements Execution trail Knowledge level representation of the execution, expressed in domain specific terms Undisputed view of execution Capability of annotating and providing interpretations to results Interpretation of execution Slide 19 Provenance in a Bioinformatics myGrid focus is on the scientist and their collaborations: provenance is a form of log book. There are other uses of provenance in bioinformatics Provenance in Drugs Discovery process Requirement on drug companies to keep a record of provenance of drug discovery as long as the drug is in use (up to 50 years sometimes). Slide 20 Provenance in Bioinformatics The mRNA that is to be translated contains stretches of noncoding sequence that are removed before translation begins. Noncoding stretches are called introns (for INtervening sequences) Sequences that are translated are called exons (for EXpressed sequence). Klaus-Peter Zauners study the quantity of information (Kolmogorov complexity) contained in introns and exons involves bioinformatics and statistical processes, relying on brute force and guess work Slide 21 Provenance in Bioinformatics Determining the difference in the system during two runs of an experiment. Determining how best to run the experiment in future Historical record and proof of process Checks on validity of process Tracing the origin of data Slide 22 Provenance in Aerospace Engineering Provenance requirement: to maintain a historical record of inputs/outputs from each sub-system involved in simulations. Aircrafts provenance data need to be kept for up to 99 years when sold to some countries. Currently, little direct support is available for this. Slide 23 Provenance in Organ Transplant Management Decision support systems for organ and tissue transplant, rely on a wide range of data sources, patient data, and doctors and surgeons knowledge Heavily regulated domain: European, national, regional and site specific rules govern how decisions are made. Application of these rules must be ensured, be auditable and may change over time Provenance allows tracking previous decisions: crucial to maximise the efficiency in matching and recovery rate of patients Slide 24 Provenance in Combechem Mechanism by which PhD students supervisor may check that a students experiment was performed properly, especially if the results are odd. If enough information is recorded about an experiment, the paper describing it can be automatically created. Protection of intellectual property rights. The signing chemist will use their expertise to determine whether the experiment was performed correctly, and the provenance should be complete enough that they could potentially re-run the experiment to check the results. Slide 25 GridPhyN Slide 26 Architectural Vision Slide 27 What is the problem? Provenance recording should be part of the infrastructure, so that users can elect to enable it when they execute their complex tasks over the Grid or in Web Services environments. Currently, the Web Services protocol stack and the Open Grid Services Architecture do not provide any support for recording provenance. Methods are generally adhoc and do not interoperate. Slide 28 Architectural Vision Typical workflow enactment in service oriented architecture Slide 29 Architectural Vision with provenance support Slide 30 A First Prototype Slide 31 Sequence Diagram/Data Model Must support recording of all information necessary to replay execution Must support all complex forms of workflows (recursion, iterations, parallel execution). Slide 32 PReP: Provenance Recording Protocol clientservice invocation result Provenance Service invocation and result notify invocation and result notify negotiate configuration Slide 33 PReP Formalisation Abstract machines Properties Termination Liveness Safety Foundation for adding necessary cryptographic techniques Slide 34 PReP: Client Side State Space Slide 35 Research Agenda (1) In order for provenance data to be useful, we expect such a protocol to support some classical properties of distributed algorithms. Using mutual authentication, an invoked service can ensure that it submits data to a specific provenance server, and vice-versa, a provenance server can ensure that it receives data from a given service. With non-repudiation, we can retain evidence of the fact that a service has committed to executing a particular invocation and has produced a given result. We anticipate that cryptographic techniques will be useful to ensure such properties Slide 36 Research Agenda (2) Access control Medical applications: organ transplant, IXI, e- Diamond Scalability DC2 10^7 files, CERN envision 10^12 files From execution level provenance, how to infer domain level provenance. Slide 37 Research Agenda (3) Using provenance of data, trust metrics of the data can be derived from: Trust the user places in invoked services Trust the user places in the input data Trust the user places in the enacted workflow Trust the user places in the enactor Trust the user places in the provenance service. Slide 38 The purpose of project PASOA to investigate provenance in Grid architectures Funded by EPSRC under the fundamental computer science for e-Science call In collaboration with Cardiff www.pasoa.org Slide 39 EU Provenance STREP: Enabling and Supporting Provenance in Grids for Complex Problems IBM United Kingdom Ltd, University of Southampton, German Aerospace Centre, University of Wales, Cardiff, Universitat Politecnica de Catalunya, MTA SZTAKI To design, conceive and implement an industrial-strength open provenance architecture for Grid computing, and to deploy and evaluate it in complex grid applications (aerospace engineering and organ transplant management). Slide 40 Conclusion Provenance is a rather unexplored domain Strategic to bring trust in open environment Necessity to design a secure, scalable and configurable architecture capable of supporting multiple requirements from very different application domains. Need to further investigate the algorithmic foundations of provenance, which will lead to scalable and secure industrial solutions. Slide 41 Acknowledgements myGrid Simon Miles, Juri Papay, Ananth Krishna, Michael Luck, David De Roure, Terry Payne, Mark Greenwood, Carole Goble, Martin Szomszor Combechem Gareth Hughes, Hugo Mills, monica schraeffel PASOA Omer Rana, Paul Groth, Simon Miles, Ben Caroll EU-Provenance Syd Chapman, John Ibbotson, Laszlo Varga, Steve Willmott, Ulises Cortes, Andreas Schreiber, Rolf Hempel Slide 42