Provenance abstraction for implementing security: Learning Health System and securing provenance of health data

Provenance abstraction for implementing security policiesLearning Health System and securing provenance of health data

Dr Vasa CurcinKing’s College London

Overview

• Learning Health System• LHS requirements for provenance data• TRANSFoRm project• Transformation-oriented Access Control Language

for Provenance (TACLP)

Learning Health System

“ ... one in which progress in science, informatics, and care culture align to generate new knowledge as an ongoing, natural by-product of the care experience, and seamlessly refine and deliver best practices for continuous improvement in health and health care.” (Institute of Medicine)

We can’t afford to waste data!

Learning Health System

Defining functions of a LHS are to:1. routinely and securely aggregate data from disparate sources2. convert the data to knowledge3. disseminate that knowledge, in actionable forms, to everyone who can

benefit from it.c/o C. Friedman

Learning Health System take-up

• US medical/academic centreso Mayo, Duke, Vanderbilto PCORI

• National data aggregatorso Clinical Practice Research Datalinko NIVEL

• EHR vendorso CSC, Asseco, TPP, InPractice Systems

• European academic-industrial collaborationso TRANSFoRm, EHR4CR, Semantic

HealthNet

…and Bill

Example: Clinical trial challenges

• Major motivation for the LHS work• Trials too expensive and difficult to run• Efficacy-effectiveness gap (EEG)

o Disconnect between outcomes from clinical trials and information needed for clinical practice

o Interaction of drug effect and real-life contextual factorso Challenge to identify contextual factors

• LHS provides context and workflow

LHS for Clinical Trials

• EHR integrationo Eligibility checking done automatically from EHR datao eCRFs partially filled based on EHR informationo All collected data stored in the EHR system as well as the

research database• Closing the loop

o eCRF data enriches the EHRo Helps the cliniciano Adds value to the EHR system

• Data does not go to waste!7

Trust in the LHS

• Research community is struggling to ensure transparency and correctness of published research

• Reasons complex and interleaving (positive bias, intractable analysis, deluge of journals)

• Bayer Healthcare team published work showing that only 25% of the academic studies they examined could be replicatedo Prinz et al. Nat. Rev. Drug Discov. 10, 712, 2011

• Of 53 oncology studies from 2001-2011, each highlighting big new apparent advances in the field, only 11% (6) could be robustly replicated.o Begley & Ellis Nature 483, 531–533, 2012

Trust in the LHS (cont.)

• The problem is by no means restricted to preclinical studies• Twelve randomised clinical trials testing 52 observational claims and failed to

reproduce a single oneo Young SS, Karr A. Deming, data and observational studies. Significance sep 2011;

8(3):116–120 • Replication of 100 experiments published in 2008 in three high-ranking

psychology journals – less than one half of finding replicatedo Estimating the reproducibility of psychological science. Science Aug 2015;349(6251)

• Random sample of 441 biomedical journal articles 2000 – 2014: none made all their data available, one provided full protocol, majority did not disclose funding or conflicts of interesto Iqbal et al. Reproducible Research Practices and Transparency across the Biomedical

Literature. PLoS biology 2016; 14(1) • Cost of irreproducible research in life science is estimated at $28 billion per

year in the U.So Freedman LP, Cockburn IM, Simcoe TS. The Economics of Reproducibility in Preclinical

Research. PLOS Biology jun 2015; 13(6)

• Each component in the healthcare system produces and consumes data:• Epidemiological research

using record linkages• Research data embedded in

the EHR• Decision support for

diagnosis• Provenance infrastructure

required to support all these domains

Data in the Learning Health System

Specific research

data

Actionable data

Routinely collected

data

• Clinical trials

• Controlled populations

• Well-defined questions

• EHR systems• Wide coverage• Vast quantity• May lack in

detail and quality

• Distilled scientific findings

• Usable in clinical practice

• Decision support

TRANSFoRm project

• €7.5M European Commission 2010-2015• Funded under the Patient Safety Work Program of FP7• Developing methods, models, services, validated

architectures and demonstrations to support:o Epidemiological research using GP records, including genotype-

phenotype studies and other record linkageso Clinical trials embedded in the EHRo Decision support for diagnosis

www.transformproject.eu

MiddlewareSecure data transport

RCT tools(Electronic Data

Collection)

Epidemiological study tools

(Data queries)

Authenticationframework

Diagnostic supporttools

Data source connectivity

module

Provenanceframework

Vocabulary service

TRANSFoRm software landscape

Use case 1: Type 2 Diabetes

• Research Question: In type 2 diabetic patients, are selected single nucleotide polymorphisms (SNPs) associated with variations in drug response to oral antidiabetic drugs (Sulfonylurea)?

• Design: Case-control study

• Data: primary care databases (phenotype data) pre-linked to genomic databases (genetic risk factors) – data federation

Use case 2: Gastro-oesophageal reflux disease (GORD)

• Research Question: What gives the best symptom relief and improvement in Quality of Life: continuous or on demand Proton Pump Inhibitor use?

• Design: Randomised Controlled Trial (RCT)• Data: Collection through EHR & web based questionnaire –

electronic case report forms AND mobile Patient Related Outcome Measures

• Provenance and security

Use case 3: Diagnostic Decision Support

• Early diagnostic suggestions for presenting problems:• chest pain• abdominal pain• shortness of breath

• Clinical Prediction Rule web service (with underlying ontology)

• Prototype Decision Support System integrated with a commercial electronic health record system• Vision by InPractice Systems

Provenance challenge for TRANSFoRm

• Viable methods for adoption in a heterogeneous software environmento No shared workflow middleware to rely on

• Need to achieve domain specificity• Able to demonstrate conformance to standards

o Title 21 of the Code of Federal Regulations; Electronic Records; Electronic Signatures (21 CFR Part 11)

o Good Clinical Practice (GCP)o EudraLex Vol. 4 Annex 11: Computerised Systems in EUo CONSORT, STROBE, RECORD

Semantic annotations

• Semantic concepts in the provenance graph defined using TRANSFoRm ontologies:o Clinical Research Information Model (CRIM)o Software infrastructure ontologyo Clinical evidence ontology

• Ontology concepts annotations on provenance nodes• Provenance templates define domain actions that map to

provenance fragments

PCROM (UML Model)

Randomised Clinical Trial

Ontology(RCTO)

Randomised Clinical Trial Provenance

Ontology(RCTPO)

Provenance templates

Provenance database

Provenance server

Existingtools

1. Tools are agnostic to provenance representation

2. Service invocation matches some provenance template in Provenance server

3. Template is instantiated into a provenance graph fragment with OWL concept annotations

4. Graphs merged inside the database

API service calls

OPM graphs annotated with OWL

Example: Provenance of diagnostic recommendation

Provenance security

• Use a single provenance graph for:o Full trial audito Reporting studieso Publication reviewo Collaboratorso Readers

• Need to abstract parts of the graph• Access control and view generation for provenance

graphso Future Generation Computer Systems, Volume 49, August

2015, Pages 8-27 Roxana Danger, Vasa Curcin, Paolo Missier, Jeremy Bryans

Basic idea

• The aim of an access control strategy is not only to determine if the resource can be viewed or not, but to construct a view of the graph which satisfies the security constraints

• The goal is for maximum amount of information to be retained

• NB Based on TRANSFoRm use cases but not implemented in the live system

Access control

• Ensuring that a principal (person, process, etc.) can only access the services or data in a system that they are authorized to

• Implemented through security policies that try to enforce a certain protection goal such as to prevent unauthorized disclosure (secrecy) and intentional or accidental unauthorized changes (integrity)

• Authorizations for some resource can be:o Positive (allow)o Negative (deny)

Access control

• Two classical approaches:o Closed policy

• deny-by-default• Access to a resource is only granted if a corresponding positive

authorization policy existso Open policy

• Permit-by=default• Access unless a corresponding negative authorization policy exists.

• Combined approach used to support policy exceptions• Conflict resolution needed if multiple policies apply, e.g.

o denials-take-precedenceo most-specific-takes- precedenceo priority levelso time-dependent access.

Access control languages for provenance

• Qin Ni et alo Semantic description of subjects (user roles) and resources to

be accessedo conditions under which restrictions are applied,o four different types of access permissions.

• Cadenhead et alo Added regular expressions for resource and condition

descriptions • Transformation-oriented Access Control Language for

Provenance (TACLP)o Allows users to define subgraphs to be transformed, with three

different levels of abstractions (namely hide, minimal and maximal).

Indirect relations

• Introduce some new relations to be used in abstraction

External effects and causes

• External effects and causes of the set of nodes S w.r.t. a set of nodes Ro Set of nodes that represent the immediate

effects/causes of S that would be affected by removal of nodes in R from the graph V ()

o If S=R, then denote as ef(R) and ca(R)

External effects and causes

Basic operations

• Node removalo Subgraph needs to be hiddeno e.g. if it is unnecessary for an analysis or user access to it

has been restricted. • Node replacement

o removing details of data and operations in a subgraph while retaining some information (abstract entity) of the existence of such subgraph.

Operation: node removal• Let Prov = (V , E , type) and R V be a set of nodes to be ⊆

removed. Result is a new provenance graph Prov =(V ,E′ ′,type ), such that: ′ ′

Operation: node replacement

• As before, with operation AR replacing node set R with node va

Abstract nodes and edges

• Dummy nodes introduced during entity replacement

• Preserve the causality of the rest of the graph• Two types of dependencies:

o Indirect• Denoted with double lines• Represent multi-step dependences (wdf+, u+, wgb+, wtb+)

o Soft dependencies• Denoted with double dashed lines• Generic transitive relationship which is not one of the above

Removal and Replacement

Replace (A,B)

Remove (A,B)

Removal and Replacement

Replace (A,B)

Remove (A,B)

False dependencies

• False dependencies introduce a previously non-existent path in the new graph, e.g. removing A, B

Causality preserving transformation

• A transformation is called causality preserving if it does not introduce false dependencies.

• Given a provenance graph and a set of entities to be abstracted/hidden, the question is how can these entities be joined or removed from the graph using only causality-preserving transformations?

Causality preserving partition and transformation

• Given a set of nodes R V, a causality preserving ⊆partition of R is such that removing or replacing any set of nodes will not introduce causal dependencies

• A graph transformation by partition of R is then a sequential application of Remp or Repp

• The necessary and sufficient condition for such transformation to be causality preserving is that for each all of P’s external causes and effects are connected

Optimal causality preserving partition

• Default partition of R consists of singletons, i.e. each node in R is a set in the partition.

• Optimal partition is such that none of its sets have the same sets of external causes and effects w.r.t. R

• Partitioning algorithmo Step 1, determine external causes and effects for default

partitiono Step 2, gradually merge the partitions until optimal.

Provenance graph transformation algorithm

• Once the partition is computed, the transformations are iteratively applied to each element in the partition

• Labels input provides names for generated abstract nodes

• Levels input provides abstraction level for each partitiono Hide

• remove operationo Minimum abstraction, maximum abstraction

• replace operation• isolated singletons removed as a special case.

Computational efficiency

• Transformation algorithm performance depends on the performance of the partition algorithm

• The other steps are linear to cardinality of the set of partitions and its edges

• The partition algorithm considers pair-wise combinations of nodes.

• Overall complexity is O(R2), where R is the set of nodes to abstract

Experimental results

• Provenance view transformation algorithm was implemented in Python 2.7 using Networkx API.

• Experiments were executed on Ubuntu 12.04, Intel Core i7-3687U CPU with 2.10GHz and 8GB RAM

• Synthetic provenance graphs used, randomly generating edges for each node within the degree range 2-10

• Two parameters:o the percentage of nodes to abstract (from 5 to 25 with a step 5)o the percentage of nodes to abstract which are causally

dependent (from 0 to 100 with a step of 25)• Each configuration was executed 10 times and the plots

presented show the averages of these executions.

Performance behaviour

• Execution time (Y) in seconds as a function of the number of nodes (X) and the percentage of nodes to abstract (Z)

• Quadratic time

Use case: Access to health data

• Access control for the provenance data collected from an Electronic Health Record (EHR) and clinical trial systems

• Rules:o Auditors. Healthcare system auditors or law enforcement agencies can access

the whole provenance graph during the auditing process. o Family doctors and patients. Electronic health records and their data

provenance can only be accessed by patients during weekends, and by FDs during weekdays.

o Active FDs. Active FDs have access to the provenance data associated with the EHRs of their patients and its provenance;

o Clinical trial 1. If some data comes from a clinical trial, the GP needs to be participant of the trial to see the subgraph associated with that trial.

o Clinical trial 2. Patients do not have access to clinical trial processes. o Laboratory. Patients do not have access to laboratory processes. o Automatic diagnosis recommendation. Patients have no access to any

information related to the automatic diagnosis recommendation nor to the graph segment connecting it with the clinical evidences.

TACLP

• Transformation-oriented Access Control Language for Provenance (TACLP)

• Extends the works of Ni and Cadenhead by introducing transformations

• A policy consists of:o Targeto Effecto Transformationo Condition (optional)o Obligation (optional)

TACLP Target

• Subject elemento Set of users (subject element) to which the policy should be applied,

expressed through IRI references• Record element

o Set of resources to which the policy should be applied, expressed through IRI references

• Restriction element (optional)o A conditional expression under which the policy is appliedo Either a relational comparison between a value in a property path and a

literal, or a full logical expression. • Scope element (optional)

o If the policy is ‘transferable’ or ‘non-transferable’ with respect to subjectso Whether it applies to all the ancestors of matched elements in the graph,

or to the matched elements only.

TACLP Effect

• Specifies the intended outcome• Four possibilities:

o Absolute permit guarantees access to the graph regardless of the effect of other policies

• e.g. for allowing access to auditors or law enforcement agencies, and avoids the need for additional conditions in deny policies

o Deny guarantees that certain parts of the graph will not be accessed by users in the subject element.

o Necessary permit is used to describe the necessary, but not always sufficient, conditions for accessing certain parts of the graphs

o Permit is used to describe those parts of the graph that can be accessed if there are no other policies denying access to it.

TACLP Transformation

• How to transform the provenance graph in order to hide certain resources

• Specification of which nodes need to be hidden and Removal/Replace operations to be applied to them

• Set of policies comprisingo Policy type (target, record, condition, effect,

transformation element and obligation)o Policy evaluation type (deny- takes-precedence or

permit-takes-precedence)

TACLP Transformation

• Abstraction levelo Hide

• matched nodes of the subgraph have to be completely hidden (removed) from the graph

• Remove transformation is applied; o Minimum abstraction

• Replace transformation is applied• No caused-by relationship (soft dependencies) will appear in

the transformed graph. o Maximum abstraction

• Replace transformation is applied• Soft dependencies can appear in the transformed graph.

Access control evaluation algorithms

• Aim to produce an abstracted graph that satisfies the constraints

• Deny-takes-precedence1. Absolute permit policies evaluated first2. Necessary permit and deny policies 3. Permit policies

• Allow-takes-precedence1. Absolute permit evaluated first2. Necessary permit policies3. Permit policies4. Deny policies

Example: Source provenance graph

Example: Abstracted provenance graph

Summary

• Learning Health System presenting new set of challenges for medical and informatics communities

• Provenance can help establish trust in the LHS• Methods needed to verify trust• Abstraction of provenance traces needed to address

requirements of multiple stakeholderso Researcherso Regulatorso Publishers

• Future worko Projects running on provenance of decision support and visual analytics

for health datao Looking for partnerships to investigate applications of the security work

Acknowledgements

• Thanks to:o Roxana Dangero Paolo Missier o Jeremy Bryanto Derek Corrigano Brendan Delaney

Questions?

Thank you!