31
ERICA A cloud orchestration meta-framework for secure health data analytics Tim Churches SW Sydney Clinical School & Centre for Big Data Research in Health UNSW Medicine

ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA

A cloud orchestration meta-framework for secure health data

analytics

Tim Churches

SW Sydney Clinical School & Centre for Big Data Research in Health

UNSW Medicine

Page 2: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

Why we need secure platforms for health

data analysis

Page 3: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

Tran B, Straka P, Falster MO, Douglas KA, Britz T, Jorm LR. Overcoming the data

drought: exploring general practice in Australia by network analysis of big data.

Med J Aust 2018; 209(2):68-73

Overcoming the GP

data drought• No systematically reported national data

on the size and structure of general

practices in Australia

• Network analysis of 21 years of Medicare

claims shows:

• general practices have increased in

size

• continuity of care and patient loyalty

have remained stable

• greater sharing of patients by GPs is

associated with greater patient loyalty

• This new approach allows continuous

monitoring of the characteristics of

Australian general practices

Page 4: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

Re-operation after breast

conserving surgery

• Linked hospital inpatient and death data for NSW

• Primary unilateral or bilateral BCS

• 90-day reoperation (re-excision or mastectomy)

• 29% overall re-operation

• 17% BCS

• 12% mastectomy

• ↑ BCS over time, ↓ mastectomy over time

• Significant variation by hospital

van Leeuwen MT, Falster MO, Vajdic CM, Crowe PJ, Lujic S, Klaes E, Jorm L,

Sedrakyan A. Reoperation after breast-conserving surgery for cancer in Australia:

statewide cohort study of linked hospital data. BMJ Open 2018, vol. 8, pp. e020858.

Page 5: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

Sydney Morning Herald

Page 6: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

Why we need secure platforms for health

data analysis

• Current research at UNSW Medicine (CBDRH, SPHCM, SWSCS, MRIs) ofr

health services research, clinical epidemiology and ML research

• whole-of-NSW-Health administrative data (hospital admissions, ED visits,

cancer registry, death certificates) linked at person level over 15 year

span

• linked MBS-PBS data (not the retracted dataset!)

• EMR and cancer information system data from specific hospitals

• DVA linked data

• 25% subset of NPS MedicineWise data

• All of these are de-identified

• But all of these are potentially re-identifiable

• Must be kept safe!

• Security requirements that exceed “…data will be stored on a password-

protected file server..”

Page 7: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA: key features

• Provides up to 256 secure remote-access analysis project spaces per

instance

• “Enclave” model: each project space is completely self-contained and

disconnected from other projects, from the internet, and from the users’

desktops

• Provides an invigilated gateway for data coming in and research results going

out, with complete audit trail

• Uses Amazon Web Services (AWS) commercial cloud computing

• Leverages the features and scalability of AWS

• Different OS and workspace configurations

• High performance computing

• Multiple storage and pricing options

Page 8: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA: key features

• Is institution-based

• Governed and managed by a host institution and its policies and procedures

• Multiple instances (‘clones’), governed by different host institutions can be

established (anywhere that AWS operates), currently:

• UNSW

• Australian Institute of Health and Welfare (AIHW SRAE)

• NSW Government Data Analytics Centre

• A code-driven ‘orchestration framework’

• Testable and tested for correct behaviour

• System administrators do not manually configure resources

• Project space configuration is point-and-click

• Minimises human error

• Accredited by eHealth NSW under their Privacy and Security Assessment

Framework (PSAF) to hold fully-identified NSW Health data

Page 9: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to
Page 10: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

Typical ERICA virtual workstation

Page 11: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA virtual workstations

• Most current users Windows 7 or Windows 10

• Linux workstations available

• Software can be pre-installed in workstation images (up to 100)

• e.g. MS Office, SAS, SPSS, Stata, R, python, TensorFlow etc pre-installed

• System administrators can define additional HPC resources via templates,

restricted to specific project spaces e.g.

• Linux compute server with multiple high-end GPU cards

• Apache Spark cluster with many nodes

• End-users in the project space can start and stop these on demand and are

given warnings if left running!

Page 12: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

‘Five safes’ framework

1. Safe Projects

2. Safe People

3. Safe Data

4. Safe Settings

5. Safe Outputs

Is this use of the data appropriate?

Can the researchers be trusted to use it appropriately?

Is there a disclosure risk in the data itself?

Does the access facility limit unauthorised use?

Are the statistical results non-disclosive?

Desai T, et al. Five Safes: designing data access for research. Economics

working paper series 1601. Bristol: University of the West of England, 2016

Page 13: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA: Safe projects

• Policies set by host institution

• UNSW ERICA

• Projects must have data custodian and ethics approvals

• Projects must therefore meet NHMRC guidelines for human research

• ERICA must be named as data storage and analysis facility on HREC

applications

Page 14: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA: Safe people

• Roles defined in ERICA code and assigned to individuals according to policies

of host institution

• System Administrator

• Project Chief Investigator

• Project Controller

• Project Manager

• Project Researcher

• Online training module for researchers

• With an exam that must be passed…

Page 15: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to
Page 16: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to
Page 17: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to
Page 18: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA: Safe data

• Designed for research using sensitive microdata

• The datasets, variables, level of detail and any suppression or perturbation

are governed by host institution’s policies and data provider policies

• UNSW ERICA: governed according to data custodian and ethics approvals

• Project Controller checks and approves all inbound files

• Role can be assigned to data custodian nominee (e.g. AIHW staff member)

or research team member

• Data custodians can upload encrypted data themselves through eHub or

large file ingress facility

• By carefully attending to the other four “Safes”, ERICA and similar secure

analysis platforms dramatically reduce the level of anonymization which data

providers and data custodians need to do

• Data anonymisation is the enemy of quality research and effective ML

model development

Page 19: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA: Safe settings - threat model

• Basic premise: researchers are honest-but-sloppy

• Ignorant of IT security

• Reliant on institutional IT security

• Driven by convenience

• Designed to protect against

• Innocent acts-of-omission by researchers

• Acts-of-carelessness by researchers

• Malicious acts by non-users (i.e. external hackers)

• But not necessarily malicious acts-of-commission by researchers

• e.g. Filming the screen as they scroll through data

Page 20: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA: Safe settings

– identity and authentication• Authentication and authorisation uses a Microsoft Active Directory instance

specific to ERICA

• ERICA user accounts are assigned to one or more roles (e.g. Project

Controller, Project Researcher) for each project space

• At all external access points, users authenticate themselves using a single

set of login credentials (account name and password) plus mandatory multi-

factor authentication code (using smartphone)

• External access points can be further restricted to specific IP address ranges

or source networks, or client-side digital certificates can be used to restrict

access to specific devices (e.g. specific laptop or desktop computers)

• e.g. UNSW medicine ERICA instance is accessible only from the UNSW

internal network, behind the main UNSW firewall, so no Internet-facing

interface

Page 21: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

Logging into ERICA

AWS Desktop client

Page 22: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA: Safe settings

– movement of data• All research data held in ERICA are encrypted both at-rest and in movement

• AWS key management and encryption services are used to strongly encrypt

all EBS and S3 data stores used by ERICA

• Secure protocols, including HTTPS (TLS v1.3), LDAPS, scp and encrypted

SMB/CIFS are used for all communications and data movement

• Users can only import or export data via a controlled gateway mechanism

known as the Hub

• All other file or data ingress and egress mechanisms, including clipboard,

email, messenger services, printing services and internet access, are blocked

by two independent and redundant layers in the system network architecture.

• Project workspaces are isolated from each other, and no data can be

transferred between them (except via the Hub)

Page 23: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

Importing and exporting

Page 24: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to
Page 25: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA: Safe settings

– logging and audit• All data movements inbound to and outbound from ERICA are fully logged

and subject to full-copy audit trails

• An activity trail displays the time, project and the action that a particular user

has taken within the system regarding data movement

• A checksum of the imported/exported file is maintained and logged to ensure

the file has not been modified during the ingress/egress.

• All logging is aggregated into AWS Cloudwatch, which provides a single

unalterable and digitally signed and timestamped source of information for

auditing purposes

• Key security event logs include those generated by: border routing devices,

network and application firewalls, intrusion detection, anti-virus and malicious

code protection services, internet-connected services

• Automated log analysis and notification using industry standard tools is

currently being implemented

Page 26: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA: Safe outputs

• Project Controller checks and approves all outbound files

• Project Controller role is assigned according to policies of host institution

• Can be assigned to data custodian nominee (e.g. AIHW staff member) or

research team member

• Confidentialisation applied according to policies of host institution

• Users are trained in the principles of Statistical Disclosure Control (SDC)

• SDC tools provided

• Expert SDC help available to end users on-demand

Page 27: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA licensing model

• Institutions manage and operate their own ERICA instances

• Employ their own system administrator/s

• Responsible for user accounts

• Responsible for end-user software licenses (e.g. SAS)

• Provide tier one user support

• Set up user accounts, projects

• Triage user issues

• Apply their own policies

• Control the allocation of project roles, auditing etc

• Pay a license fee to UNSW

• Participate in user community and development roadmap

• Shared training and help desk (including SDC) resources and services

Page 28: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA licensing model

• UNSW

• Has no access to other instances at all

• Manages the ERICA master code repository

• Manages development and testing

• Provides tier two support – escalate to AWS or engage developers if code

fix is required

• Other ERICA instances are ‘clones’

• Updates from master repository are pulled by each instance

• DevOps model for easy deployment of updates

Page 29: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

ERICA: future plans

• Expand user base

• UNSW ERICA: cross-Faculty projects

• AIHW SRAE: soft launch 2019, hard launch 2020

• Additional ERICA instances

• eHealth NSW/NSW Ministry of Health

• Being evaluated by NSW govt Data Analytics Centre

• Four other Australian universities considering instances

• New and enhanced features

• Re-engineer some components to use microservices to further reduce

costs

• Further streamline setup of new instances of ERICA

• Further streamline on-demand end-user access to HPC

• Possibly diversify to other cloud providers that meet on-shore and security

standards (eg Australian government IRAP accreditation)

Page 30: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to

Australasia’s First Postgraduate Programs

in Health Data Science

Find out more about the programs:

[email protected]

+61 2 9385 9064

cbdrh.med.unsw.edu.au/study-with-us

Master of ScienceGraduate DiplomaGraduate Certificate

Page 31: ERICA · ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to