Upload
hue
View
42
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Security and privacy in provenance. Simon Miles King’s College London. Outline. Provenance Models and Systems Illustrative Application Privacy and Security Issues. Provenance. What Provenance Is. Oxford English Dictionary: - PowerPoint PPT Presentation
Citation preview
Architecture Tutorial
Security and privacy in provenance
Simon MilesKing’s College London
Architecture Tutorial
Outline
• Provenance• Models and Systems• Illustrative Application• Privacy and Security Issues
Architecture Tutorial
Provenance
Architecture Tutorial
What Provenance Is
• Oxford English Dictionary: – the fact of coming from some particular source or
quarter; origin, derivation– the history or pedigree of a work of art, manuscript,
rare book, etc.; – concretely, a record of the passage of an item through its various owners.
• Provenance is important for:– Interpretation– Judging value
Architecture Tutorial
Causation
• Everything that is part of the provenance of an item is a cause of that item being as it is
• For example, provenance of a bottle of wine includes:– Grapes from which it is made– Where those grapes grew– Steps in the wine’s preparation– How the wine was stored– Between which parties the wine was transported, e.g.
producer to distributer to retailer
Architecture Tutorial
Motivating Applications
• We and other projects interviewed and supported users with issues regarding provenance in a range of domains, including:
• Bioinformatics Particle Physics• Proteomics Organ transplant• Aircraft simulation Police database
integration• Social planning Chemical analysis• Genetic diseasesGrid service fault tolerance• Brain image analysis Astronomy
Architecture Tutorial
Provenance Questions
• How did I (or someone else) come by this result?
• What was common and relevant in the history of this set of successful outcomes?
• Was the process claimed to be performed the one which was actually performed?
Architecture Tutorial
Provenance Questions
• What inputs were used to derive this output?
• What software produced this data?
• Can I generalise from the process by which this result was produced to a re-usable plan?
Architecture Tutorial
Provenance Questions
• Were these regulations followed in producing this result?
• Are these two independent conclusions actually based on the same faulty assumption/input?
• What differed between the way these two results were produced?
Architecture Tutorial
Shared Histories and Futures
• Multiple data can be produced by one process
• One process can use data from many sources as input
• The provenance (and futures) of data items overlap
• It is suspect to say that one data item = one provenance, provenance stored with data
Architecture Tutorial
Causal Provenance Models
Illustrative Application
Architecture Tutorial
Causal graphs
Donor OrganDecision: Yes
Architecture Tutorial
Causal graphs
Donor OrganDecision: Yes
Family ConsentDecision: Yes
decision based on
Blood TestResults: -ve
Architecture Tutorial
Causal graphs
Donor OrganDecision: Yes
Family ConsentDecision: Yes
decision based on
response to
Blood TestResults: -ve
Blood TestRequest: 432
Family ConsentRequest: 432
response to
Architecture Tutorial
Causal graphs
Donor OrganDecision: Yes
Family ConsentDecision: Yes
Patient BrainDeath: PID 432
decision based on
response to
triggered by
Blood TestResults: -ve
Blood TestRequest: 432
Family ConsentRequest: 432
response to
Architecture Tutorial
Causal graphs
Donor OrganDecision: Yes
Family ConsentDecision: Yes
Patient BrainDeath: PID 432
decision based on
response to
triggered by
Blood TestResults: -ve
Blood TestRequest: 432
Family ConsentRequest: 432
response totriggered by
Architecture Tutorial
Causal Connections
Patient afterdonation withtwo kidneys
Donationoperation
• Causes and effects are occurrences– Occurrence of an event, or– Occurrence of a data item or
physical object being in a particular state
Architecture Tutorial
Documentation and Provenance• We can distinguish
– process documentation (the documentation recorded into a store about processes)
– provenance (everything that caused an item to be as it is)• Process documentation is recorded as processes are executed• The data items that a process will ultimately produce may not be
known at that time• Provenance of an entity is obtained as the result of a query over
process documentation
Process documentation Provenance
Architecture Tutorial
Process Documentation
• Documentation of one process comes from multiple, possibly independent, sources
• May share a store or use separate ones
Family
TestingLab
Doctor
Blood TestResults
Blood TestRequest
Family ConsentDecision
Family ConsentRequest
Donor OrganDecision: Yes
Patient BrainDeath: PID 432
Architecture Tutorial
Provenance Scope• An item is caused to be as it is by
previous events, which were themselves caused by other events
• The causal graph could go back to the beginning of time
• If all this information was provided as a result of a query, it would be unmanageable and mostly irrelevant to the querier
• Therefore, the querier needs to scope the query to that which is relevant scope
Architecture Tutorial
Open Data Model
Organisation 1
Organisation 2
Organisation 3
• Distributed processes involve functionality from multiple independent organisations
• Each needs to record documentation independently• We need a common, open data model and interfaces for
recording and querying data in that model
ProvenanceStores
Architecture Tutorial
Digitally Controlled Process
Inference
Blood TestResults
Blood TestRequest
Architecture Tutorial
Inferred Physical ProcessDigitally Controlled Process
Inference
Blood TestResults
Blood TestRequest
Sent BloodSample
Received BloodSample
Architecture Tutorial
Privacy and Security Issues
Architecture Tutorial
Anonymised User Actions
• Provenance records for healthcare will include documentation regarding the actions of patients (or samples of theirs)• Going to see a particular (their) GP• Undergoing surgery at a particular hospital• Their blood sample being sent to a testing lab
• Even if the patient is anonymised within the records, the pattern of their actions can be enough to uniquely identify them
Architecture Tutorial
Data and Metadata Rights
• Provenance is often viewed as metadata to the data of which it provides a history
• Provenance information is usually generated automatically at runtime, and it is not known what that information will be in advance, appropriate rights have to be applied to the provenance
• How do access rights of the provenance metadata relate to those of the data?
Architecture Tutorial
Multi-Data Metadata
• Furthermore, provenance is often metadata to multiple data items
• For example, a record of the process of a transplant operation is the provenance of• The transplanted organ,• The decision to transplant,• Blood tests carried out to decide to transplant, etc.
• Each may be stored separately and have very different access control policies
Architecture Tutorial
Necessary Distribution of Query
• It is sometimes necessary to distribute parts of the provenance data about a process into multiple stores
• For example, in the OTM case, by EU law the data regarding activity within each hospital had to remain within that hospital
• To answer a provenance question, we need to query across distributed stores
Architecture Tutorial
Automatic Capture
• Provenance is often viewed as metadata to the data of which it provides a history
• Provenance information is usually generated automatically at runtime, and it is not known what that information will be in advance, appropriate rights have to be applied to the provenance
• How do access rights of the provenance metadata relate to those of the data?
Architecture Tutorial
Traffic Confidentiality and Inference
• Traffic confidentiality means hiding the fact that a service was used by a client, even where transmitted data is encrypted• A pharmaceutical company querying a small lab’s
public database concerning a particular disease• Can help achieve confidentiality by using
intermediaries who use multiple services• But could infer actual service used from
provenance set up to allow inferences
Architecture Tutorial
Extra Material
Architecture Tutorial
Extra Material Index
• Motivation for general provenance models• Interoperability and the Open Provenance Model• Provenance technologies in database research,
digital libraries, semantic web• Provenance in Tupelo (from NCSA)• Provenance in Taverna (from Manchester)• The Provenance Challenges• Open research issues
Architecture Tutorial
Motivation forCommon, General
Provenance Models
Architecture Tutorial
Separately Documented Aspects
• Attribution and related events– Modified by Simon Miles, compressed by X– Created at time T1, deposited at T2
• Documentation of the processing of data– Enactment of workflows– Chain of ownership
• Versioning• Differing practice, technologies, emphasis:
workflows, DB research, libraries, semweb
Architecture Tutorial
Preparation for Questions
• Don’t know in advance of something being produced that it will be produced– When documenting events, can’t yet
associate that documentation with what those events ultimately produce
• Don’t know in advance of being asked (about provenance) what will be asked– When documenting provenance, can’t restrict
documentation to that you know will be used
Architecture Tutorial
Shared Histories and Futures
• Multiple data can be produced by one process
• One process can use data from many sources as input
• The provenance (and futures) of data items overlap
• It is suspect to say that one data item = one provenance, provenance stored with data
Architecture Tutorial
Alternative Accounts
• In some disciplines or for some kinds of data, provenance can be disputed
• Even within a computer system, there can be multiple accounts of apparently the same event
A B
A sent X to B A sent Y to B
corruption
Architecture Tutorial
Common General Models
• Provide skeleton for documenting all aspects of provenance
• Record lots without (much) regard to particular questions...
• Then query as relevant to required usage• System interoperation through common
serialisation• Can connect records from different
systems involved in producing 1 data item
Architecture Tutorial
Provenance Scope• An item is caused to be as it is by
previous events, which were themselves caused by other events
• The causal graph could go back to the beginning of time
• If all this information was provided as a result of a query, it would be unmanageable and mostly irrelevant to the querier
• Therefore, the querier needs to scope the query to that which is relevant scope
Architecture Tutorial
Interoperability
Architecture Tutorial
Open Data Model
Organisation 1
Organisation 2
Organisation 3
• Distributed processes involve functionality from multiple independent organisations
• Each needs to record documentation independently• We need a common, open data model and interfaces for
recording and querying data in that model
ProvenanceStores
Architecture Tutorial
Open Provenance Model
Can describe any process (not just WF execution)Allows alternate accounts by different observers
http://openprovenance.org
Architecture Tutorial
OPM Requirements• To allow provenance information to be
exchanged between systems, by means of a compatibility layer based on a shared provenance model.
• To allow developers to build and share tools that operate on such provenance model.
• To define the model in a precise, technology-agnostic manner.
• To support a digital representation of provenance for any “thing”, whether produced by computer systems or not.
Architecture Tutorial
OPM Non-Requirements• OPM does not specify the internal
representations that systems have to adopt to store and manipulate provenance internally.
• OPM does not define a computer-parsable syntax for this model (but prototype RDF, XML schemas have been developed)
• OPM does not specify protocols to store such provenance information in provenance repositories.
• OPM does not specify protocols to query provenance repositories.
Architecture Tutorial
Contributors
• Original contributors from:– Universities: Southampton, Indiana, King’s
College, Manchester, Davis, Hasselt, Utah, Southern California
– Microsoft, NCSA, PNNL• Plus 3rd challenge participants including:
– Universities: Harvard, Chicago, Santa Barbara, Amsterdam
– SDSC
Architecture Tutorial
Open Provenance Model
• 3 node types – artifact, process, agent• 5 arc types – used, generated, triggered,
derived, controlled – and inference rules• Generic – extensibility via annotation• Choice of granularity and focus (e.g.,
artifact or process-centric)
Architecture Tutorial
Entities
• Artifact: Immutable piece of state, which may have a physical embodiment in an physical object, or a digital representation in a computer system.
• Process: Action or series of actions performed on or caused by artifacts, and resulting in new artifacts.
• Agent: Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, affecting its execution.
Architecture Tutorial
Edges
A
A
Pused
Pwas generated by
A
Pwas triggered by
was derived from
P
A
Role identifiers on edges specify in what wayan artifact relates to a process
Architecture Tutorial
Pegasus Example
FITS DataSet Produce
Sky Mosaic
used (inputSet)
Degree used (size)
Mosaic
was generated by(output)
Pegasus /Condor DAGMan
was controlled by(enactor)
agent
artifact
artifact
processartifact
Architecture Tutorial
Mapping Attribution to OPM
creation
used
used
A
was generatedby
Simon Miles
wasActionOf
agent
artifact
artifact
processartifact
A dc:creator “Simon Miles”
Architecture Tutorial
Provenance Technologiesin
database research, digital libraries, semantic web
Architecture Tutorial
Database Research
• In database research, the concept of provenance has been used for:– Inferring what database table values affected
a query result (Buneman et al)– Tracking the changes in relational data
structure between versions of a database– Tracking changes in database schemas
(Chiticariu and Tan)
Architecture Tutorial
Why & Where Provenance (Buneman et al.)
SELECT name, telephoneFROM employeeWHERE salary > SELECT AVERAGE salary
FROM employee
AlfredBerthaCharlieDenise
Eric 020 7848 ….020 7848 ….020 7848 ….020 7848 ….020 7848 ….
900800700600500
DeniseEric 020 7848 ….
020 7848 ….
name telephone salary
name telephone
where
why
Architecture Tutorial
Digital Library Technologies
• In digital libraries, a set of standards are sometimes used to provide data structures to store metadata along with archived objects, OAIS, METS, PREMIS...
• An Archival Information Packet (AIP) provides write-once data and metadata
• AIP metadata can contain identifiers and relationships to connect one version to preceding versions, and record events relevant to the archived object, e.g. compression, integrity check
Architecture Tutorial
Provenance in RDF
• Different schemes have been suggested for recording documentation on the provenance of statements in RDF
• Reified statements:A: http://...subj http://...isRelated http://...objB: <A> http://...hasCreator “Simon”• Named graphs• Causal graph explicit as part of data model
Architecture Tutorial
Provenance as Bibliography
• Dublin Core can be used to express bibliography information: creator, publisher, subject, etc.– http://purl.org/dc/elements/1.1/creator
• Not as expressive as causal graphs and can be captured in a graph– e.g. who created something is part of the process by
which it was created• But DC metadata common across applications
and easy to use• Users can find it helpful to include both
Architecture Tutorial
Provenance in Tupelo
Thanks to Joe Futrelle, National Centre for Supercomputing Applications for following
slides
Architecture Tutorial
Tupelo: semantic content
Abstracts content from storage impls (e.g., Sesame, Mulgara)Provides location-independent addressing of content and metadataSupports transparent mirroring, caching, failover, etc.
(tupeloproject.org)
Architecture Tutorial
Tupelo
• “Tupelo... provides a Web access protocol and Java API (Application Program Interface) that interface with an RDF (Resource Description Framework) mapping of the Open Provenance Model.”– Towards provenance-aware geographic
information systems, ACM SIGSPATIAL 2008
Architecture Tutorial
NCSA Provenance Infrastructure
Open Provenance Model
Tupelo Semantic Content Repository
Context ContextContext
OPM toolkit
Store Store Store
OPM toolkit
Visualization,interaction
Tracking,modeling,presentation
Abstraction,inference,storage
desktop,portal,etc.
Architecture Tutorial
Tupelo Provenance API
• Java API to record OPM data as RDF, e.g
Artifact artifact = graph.newArtifact("input file 1");
graph.assertArtifact (artifact);
Architecture Tutorial
Tupelo Provenance API
• Query OPM graph by searching for patterns in RDF
Unifier u = new Unifier();u.setColumnNames("file", "path");u.addPattern("file", Rdf.TYPE,
PC3Utilities.ns("CSV_file"));u.addPattern("file“, PC3Utilities.ns("PathToFile"), "path");
context.perform(u);
Architecture Tutorial
Provenance in Taverna
Thanks to Paolo Missier, University of Manchester, for following slides
Architecture Tutorial
Taverna
• “The Taverna workbench is a free software tool for designing and executing workflows, created by the myGrid project”– Taverna website
Architecture Tutorial
65Collections example: from genes to SNPs
gene -> genomic region
extend region
retrieve SNPs in the region
rearrange SNP details
• See myexperiment.org: http://www.myexperiment.org/workflows/166
[ ENSG00000139618 , ENSG00000083093 ]
[[<1,23554512,16,rs45585833>, <1,23554712,16,rs45594034>,...],[<1,31820153,13,ENSSNP10730823>, <1,31818497,13,ENSSNP10730820>,...] ]
Architecture Tutorial
66Collections, iterations, and provenance
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809,...>
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
Architecture Tutorial
67Capturing provenance with iterations
X:s
PY:s
[a1...ai...an]
semantics:Y = (map P [a1...an]) = [ (P a1) ... (P an) ](extends to multiple inputs...)
[b1...bi...bn]
workflow processor:the elementary graph building block
XP[n]
Y
a1 an
b1 bn
XP[1]
Y
...unfoldingduring execution:
b1 a1
bn an
P[1]
P[n]
wasGeneratedBy
wasGeneratedBy
used
used
OPM pattern:
...
iteration due to list depth mismatch
Architecture Tutorial
68Querying provenance graphs
• Problem:– users are rarely interested in the complete
provenance graph• noisy, possibly large, difficult to navigate
• Goal: let users identify– variables that carry interesting values for
which provenance is sought– nodes in the graph where provenance
information should be reported
Architecture Tutorial
Provenance query - no semantics
provenancy query syntax:SELECT merged_pathwaysAT get_pathways_by_genes1, mmusculus_gene_ensembl
interestingvalue
interestingprocessor
interestingprocessor
Architecture Tutorial
Role of semantics in provenance
Tavernaruntime
P1
P2
P3
P4
P5
P6
P1
P2
P3
P4
P5
P6
P1
P2
P3
P4
P5
P6
dataflow topology +raw lineage events
Provenance capture and query processor
lineage database
(RDB)
query
semanticresource
annotations
“describe the derivation of each pathway through
Kegg, in which gene g is involved”
referenceontologies
Semanticoverlays
currentimplementation
Architecture Tutorial
The Provenance Challenges
Architecture Tutorial
Provenance Challenges 1 & 2
IPAW 2006, HPDC 200720 teams, 1 workflow, 9 queriesInteroperability?
lots of manual work requiredcall for standards
(source: gridprovenance.org)
Architecture Tutorial
Provenance Challenge 3
• Ended with a workshop in Amsterdam, 10-11th June
• Specifically aimed at interoperability• Each team:
– Runs an astronomy data analysis process– Executes queries on provenance– Exports provenance as OPM– Imports other teams’ OPM provenance and
re-runs queries
Architecture Tutorial
Open Issues
Architecture Tutorial
Intention and Reason
• OPM provides a mechanistic view of what has occurred
• It does not capture assertions such as:– X occurred because I aimed to achieve Y– X occurred because I believed that Y was true– X occurred because I had an obligation to
ensure it did
Architecture Tutorial
Digitally Controlled Process
Inference and Physical Processes
Blood TestResults
Blood TestRequest
Architecture Tutorial
Inferred Physical ProcessDigitally Controlled Process
Inference and Physical Processes
Blood TestResults
Blood TestRequest
Sent BloodSample
Received BloodSample