Creating a Research Computing platform for the Science of Human

Preview:

Citation preview

The Francis Crick Institute

Creating a Research Computing Platform for the Science of Human Health

David Fergusson

Introduction

2

Challenges for biomed ”big data” scienceDistributed Data Sets

Distributed computing resources

Separate authentication/authorization mechanisms

Researchers want to combine and synthesise data

How do we do this?3

Example

Dr David Fergusson,Head of Scientific Computing,Francis Crick Institute

Challenges of providing shared platformsfor staff from existing institutes– CRUK London Research Institute– National Institute for Medical Research

Compute and data requirements for 1,250 scientists working in biomed– In a central London building

Direction of travel towards more and wider collaboration, requirement for controlled sharing of sensitive data

4

Photo credit: Francis Crick Institute

Addressing the problem in the short term

SafeShare – shared secure authorisation/authentication

Shared Data Centre(s) – avoid costly/insecure moving of data

eMedlab – collaborative science/shared operations model

5

UK e-Infrastructure

A new bottom up approach

6

People’s National eInfrastructure

Uganda

Medical Bioinformatics

Business and local government

ESRC £64M

MRC £120M

SECURE

What has worked?

Consolidation through collaboration

Swansea: One system supporting Farr Wales, ADRC Wales, MRC CLIMB, Dementia Platform UK

Scotland: EPCC supporting Farr Scotland and ADRC Scotland, leveraging expertise from Archer, UK-RDF

Leeds: ARC supporting Farr HeRC, Leeds Med Bio, Consumer Data RC

Slough DC: eMedLab, Imperial Med Bio, KCL bio cluster

Jisc network: Safe Share

JISC SafeShare

9

John Chapman, Deputy head, information security, Jisc

The safe share project

About Jisc » AssentAssent:

Single, unifying technology that enables you to effectively manage and control access to a wide range of web and non-web services and applications.These include cloud infrastructures, High Performance Computing, Grid Computing and commonly deployed services such as email, file store, remote access and instant messaging

11

About Jisc » Safe ShareSafe Share:

Providing and building services on encrypted VPN infrastructure between organisationsEnhanced confidentiality and integrity requirements per ISO27001Requirement to move electronic health data securely and support research collaborationWorking with biomedical researchers at Farr Institute, MRC Medical Bioinformatics initiative, ESRC Administrative Data Centres

12

The safe share project

The safe share project 15

Drivers

• Requirement for connectivity to move and access electronic health data securely

• Challenge to give public confidence that data is appropriately protected

• Provide economies of scale in secure connectivity

The safe share project

• Jisc management and funding of £960k to pilot potential solutions with the aim of developing a service in 2016/17

Partners

The safe share project 16

University of BristolCardiff University

University of Leeds

Swansea University

University of Edinburgh

UCLFrancis Crick Institute

University of Oxford

University of Southampton

University of Manchester

St Andrews University

The Farr Institute The MRC Medical Bioinformatics initiative

The Administrative Data Research Network

University of BristolCardiff UniversityUniversity of EdinburghFrancis Crick InstituteUniversity of LeedsUCLUniversity of ManchesterUniversity of OxfordUniversity of St AndrewsUniversity of SouthamptonSwansea University

The safe share project

The safe share project 17

Authentication, Authorisation and Accounting Infrastructure (AAAI)

Use Cases:• HeRC, N8 HPC – access between facilities using home institution credentials

• eMedLab – partners will be able to use a common AAAI to access this new system (for analysis of for instance human genome data, medical images, clinical, psychological and social data)

• Swansea University Health Informatics Group – investigating Moonshot as an authentication mechanism to allow use of home institution credentials

• University of Oxford: to enable researchers to use home institution credentials for authentication to request access to datasets for studies e.g. into dementia

The safe share project

The safe share project 18

Example “service slice”: FarrInstitution LAN

Safe sharecore

Janet, internet or other network

Farr trusted environments

safe share router at edge

The safe share project

The safe share project 19

Example “service slice”: FarrInstitution LAN

Farr trusted environments

Janet, internet or other network

safe share router at edge

Safe sharecore

UK Academic

Shared Data Centre

20

Shared data centre

£900K investment from HEFCEAnchor tenants:

– Francis Crick Institute– King’s College London– London School of Economics– Queen Mary University of London– Wellcome Trust Sanger Institute– University College London 21

Potential cost-saving/resource benefits

Jisc Shared Datacentre is already a cost savingeMedLab award, and need for quick spend, gave impetus to UCL, KCL, QMUL, Sanger, LSE and Crick to identify off-site datacentre hosting (Slough)– Anchor tenants get price reduction based on volume of space usedProcurement led by JiscDatacentre connected to Janet network (Jisc investment) Improved PUE; Slough 1.25 cf ~2 for HEI datacentre (UCL save ~£2M p.a.)

Datacentre Connection Topology

N3/PSNH/PSN

eMedLab

Collaborative science

Shared Operation

24

Objectives - Flexibility

• To help generate new insights and clinical outcomes by combining data from diverse sources and disciplines

• Bring computing workloads to the data, minimising the need for costly data movements

• To allow customised use of resources• To enable innovative ways of working collaboratively• To allow a distributed support model

25

Institutional Collaboration

Support team

eMedLab academy• Training via CDFs and courses• Promote collaborations via “Labs”

eMedLab infrastructure

• Shared computer cluster• Integrate exchange heterogeneous data • Methods and insights across diseases

Hardware Overview

eMedLabis a hub

6+1 partners

3 data types

electronic health records

genomic

images

3 expertises

clinician scientists

analytics

basic science

3 disease areas

rare

cancer

cardio

>6M patients

What is eMedLab?

Distributed/Federated support(What has worked/savings ..)

eMedLabOps team(shared team)

Knowledge sharing/transfer

(inc. developing UK industrial capacity –OCF/OpenStack)

Support

Support

Support

SupportSupport

Support

Project Model

Many projects, same challenges

Information governanceSecure data transferUser managementAAAIWorking with Janet to explore how to support most/all projects

Cultural Barriers Challenges

Finance – government funding with spend window of 1 year only+Mitigated by use of efficient procurement teams and framework agreements

+Working closely with vendors to ensure tight time targets met- Drain on (unfunded) project management and finance team resourcesRegulatory challenge+Mitigated by clear policies, governance, supported by training+Changing EU data protection legislation- Risk of bad PR and/or data leaksPeople +Everyone is open, collaborative, generous with time and knowledge

eMedLab production service Projects• UCL & WTSI - Enabling Collaborative Medical Genomics Analysis Using Arvados – Javier Herrero

• Crick KCL UCL - A scalable and flexible collaborative eMedLab cancer genomics cluster to share large-scale datasets and computational resources – Peter van Loo

• UCL QMUL Farr - Creating and exploiting research datamart using i2b2 and novel data-driven methods - Spiros Denaxas

• LSHTM & QMUL - An evaluation of a genomic analysis tools VM on the EMedLab, applied to infectious disease projects at the LSHTM using data from EBI and Sanger & Genetic Analysis of UK Biobank Data - Taane Clark & Helen Warren

• UCL & ICH - The HIGH-5 Programme - High definition, in-depth phenotyping at GOSH, plus related projects - Phil Beales & Hywel Williams & Chela James

eMedLabenablesprojects

eMedLab brings data and expertise together across diseases

(potential)

• Mechanisms of cancer diversity and genome instability• Better understanding of biomarkers• DARWIN Clinical Trial to target clonal drivers

Cancer evolution and heterogeneity (Swanton & Van Loo)

• Cancers evolve heterogeneously• Diverse driver mutations and instability mechanisms

• TracerX: Track lung cancer evolution• Data: genomes, MRI, molecular pathology• Who: clinicians, statisticians, evolutionary biologists

37

CAMPCrick Analysis and Data Management Platform

David Fergusson,Bruno Silva,Adam Hufman,Luke Reimbach

CAMP

Fast Data ingest

Batch And cloud automated Analysis

Fast parallel Shared Storage

Advanced cloud storage

Tiered storage

Archive

DR storage and compute

Postprocessing

and buffer

Instruments

Analysis of data independent of location, platform, OS

Data ingest

Shared Storage

AnalysisAnd cloud

Advanced cloud storage

Data File profiles

40

No.s of files

Size of file

1,000,000

64k – 1Mb 500Mb – 4Gb

41

Future Problems

Metadata

Collecting metadata– Metadata collection has to be automated – manual systems are too labour

intensive– Open shared metadata formats – currently many proprietaryGA4GH, etc.

– Tools for managing metadata Using metadata

– Object storage– Non-tree based search– Natural language search

Staging data between multiple infrastructures Data infrastructures do not have shared understanding of data location.

Ideally compute moves to the data.– Much biomed data (medical imaging, genomics) is becoming too large to move efficiently– Published resource information would allow jobs to be moved to data location– “Data resource broker”– Increasingly multiple data sets need to be synthesised to create new knowledge – Compute jobs need to span multiple data locations.– (shared physical data locations & virtual secure networks between data resources are a beginning).

Sharing data seamlessly between infrastructuresResearchers and work need to be able to move seamlessly between infrastructures

Currently many proprietary barriers.

Different virtualisation technologies.

Is containerisation the answer?

Also relies on shared AAAI

Two types of data/analysisCurrently the biomed data we deal with splits into two main categories:

1. moderate numbers of large files – Typically image data but also annotated genomic sequence data– Files typically 10s Mb – 10s Gb– Generally 1000s – 10,000s files

2. Very large numbers of small files– Typically sequencing data– Files typically Mb down to Kb sizes– Very large numbers of files ~ 2 - 300,000,000

45

Data Pipelines

Typically have data in three states:

Active analysis

Available for re-analysis

Archive

46

Storage tiers

47

Data volumes

Instruments producing large scale data– 100s of Tb per sample for new EMs

Distributed production of data – Shared and international instruments– Sequencing everywhere

Data through-put in analysis For biomed research the ability to move data through a pipeline rapidly and to run

many pipelines in parallel is essential.

“Balanced” infrastructures, where data IO, network and compute speeds match

Being able to support concurrent analyses of widely differing profiles on an infrastructure

Data segmentation in analysis

Image data– 1000s of medium sized files (Mb – Gb). – Video– 3D– Data may be inflated immediately (recently 50Tb capture -> 150Tb)– Identification of objects, modelling

Sequencing data– Millions of small files– Genomic analysis can be 1000s of medium sized files– Data inflates with analysis and annotation– Finding interconnections and distant relations

50

Data management efficiency

Compression– Image data– Sequence data – CRAM

Managing duplication

Data management tools – where is a data tool suite?– Monitoring– Viewing – Manipulating

51

Thank you for reading the information within this document; you have now reached the end.

52

Recommended