39
Big Data in Biology &Healthcare Big Data in Biology &Healthcare Ewan Birney Director, EMBL-EBI www.ebi.ac.uk

Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Big Data in Biology &HealthcareBig Data in Biology &Healthcare

Ewan Birney

Director, EMBL-EBI

www.ebi.ac.uk

Page 2: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

What is EMBL-EBI?

• Europe’s home for biological data services, research and training

• A trusted data provider for the life sciences

• 200 Petabytes of storage (0.2 exabytes)

• >40,000 CPU Cores

• Part of EMBL, an intergovernmental research organisation

• International: 600 members of staff from 60 nations

• Home of the ELIXIR Technical hub.

Page 3: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

See the live map at www.ebi.ac.uk/about/our-impact

Global reference data

Page 4: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

We have been living through a revolution.

Page 5: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

One genome 2003 to 2017

The cost of sequencing a genome in 2017

The cost of sequencing a genome in 2003

$100 Genome within the next 5 years (likely 3 years)

Page 6: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Real-time genomics in the fieldMeasuring DNA, RNA, protein…

(Note: I am a long-term, paid consultant to Oxford Nanopore)

Page 7: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Medical GenomicsMedical Genomics

Page 8: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Sequencing is now “cheap enough”

• Between $200-300 / exome, and $800-$1000 for whole genome

• Line of sight to $100 genome

• Quoted by Illumina, contenders emerging, steady progress.

• More costs now in consent, DNA sample acquisition (storage and standard analysis low-ish, but not 0!)

• All in costs at or below “routine” medical diagnosis, eg, MRI scans

Page 9: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Clinical Utility is present: Rare disease

• Consistent 20-30% yield of diagnosis for suspected rare diseases

• Diagnosis ends “diagnostic odyssey” for patients – painful, emotionally draining and costly the healthcare service

• Opens up reproductive choices for the parents

• Like for like study in Australia

• 5 fold more diagnoses at 1/3 cost to previous standard of care!

• Roll out in Denmark, Finland, France, UK

Page 10: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Clinical Utility is present: Cancer

• (Cancer logistics harder: sample acquisition and DNA extraction harder to standardise; timelines far shorter)

• In umbrella + basket trials, 1 in 10 patients have treatment changing information from cancer genomic information

• Often being deployed in aggressive, “any option” metastasis scenarios

• Broader molecular phenotyping via genomics showing promise

• Signatures of NHER (BRCA1/BRCA2) defects far broader than suspected from germline associations

• Age of key mutations becoming more obvious

Page 11: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Cohorts and Medical genomics

Medical GenomesCountries with active national medical genome projectsCountries with some activity of medical genomicsCountries planning medical genome projects

USA

Brazil

Canada

Iceland

South Korea

Japan

China

Finland

Australia

India

Spain

UK

Ireland

Estonia

Saudi Arabia

Turkey

France

Mexico

Sweden

Norway

Taiwan

CohortsNational cohorts > 100k genotyped or sequenced at least 25kNational cohorts > 100k people active collection nowPlanning national cohorts > 100k

H3Africa

South Africa

Malaysia

Singapore

Iran

Israel

Austria

Switzerland

Germany

Netherlands

Denmark

Jordan

Kuwait

Qatar

U.A.E

Scotland

Page 12: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Big numbers!

Page 13: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Genomics: from research to healthcare

Research

• English language• Light-weight legal• Similar systems• Open data• Publications• Grant funding

Practicing Medicine

• National language• Heavy legal framework• Different systems• Closed data• Not published• Contract funding

Page 14: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Bridges need at least two anchors

Page 15: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Global standards: the GA4GH

• GA4GH is THE standards-setting body for genomics and healthcare

• Embraces federated approach

• Setting community standards early

• Cloud: Analysis carried out where the data ‘lives’

• “You’re already using it!”: SAM/BAM/CRAM/VCF formats

• Tools: htsget – the first step away from file-based access

• Rare disease diagnoses: Matchmaker Exchange

• Federated discovery: GA4GH Beacons

Page 16: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Federation

Open research data Healthcare datawith research use

analysis analysis

Aggregate data globally

Download, analyse locally

Analyse data locally (via VMs)

Collate analyses

Page 17: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Clinical &

PhenotypicCloud

Discovery

Data Security

Data Use &

Researcher IDs

Genomic

Knowledge

StandardsLarge-Scale

GenomicsRegulatory &

Ethics

1. Pheno ontology recommendations2. Info models for clin data exchange

3. Implementing pheno standards4. Test bed & interoperability demo

5. TES6. TRS7. WES8. DOS

9. Beacon10. Search

11. Service registry12. Variant submission

13. IoG14. Breach response

15. AAI16. Researcher ID & Bona Fide status

17. DUO18. Variant Annotation

19. Variant Representation20. htsget streaming API

21. Reference sequence retrieval API22. Read file formats

23. Genetic variation file formats24. RNASeq expression matrix

25. Return of results policy26. Participant values survey

27. Code of conduct for data sharing28. Cloud access policy

DURI

C & P DURI GKS

Cloud LSG

GKS

R & E

C & P Discov DURI GKS

R & ESecur

C & P Discov

GKS

GKS

Secur DURI

Discov Secur DURI LSG

Discov Secur

Discov Secur

Discov Secur

Secur

Discov GKS

Discov GKS

Discov GKSDURIC & P

C & P

C & P

Cloud

Cloud

Cloud

Cloud

Cloud

Discov

Discov

Discov

Discov

Discov

Secur

Secur

DURI

DURI

GKS

GKS

LSG

LSG

LSG

LSG

LSG

R & E

R & E

R & E

R & E

R & ESecur

Page 18: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Europe’s opportunity

• Strengths/Opportunities

• Public Healthcare systems

• Strong genomics

• Strong public health delivery

• Strong infrastructure

• Transnational requirment

• Weaknesses/Threats

• Less IT depth in some healthcare systems

• Fragmentation of skills

• AI / Big Data capacity (skills+ capital)

• Transnational complexity

Page 19: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

EMBL-EBI, ELIXIR and GA4GH

• EMBL-EBI is the world’s leading bioinformatics infrastructure provider

• Human Reference Genome, Annotation, Transcription, Proteomics, Structure, Pathways and Literature

• ELIXIR is Europe’s transnational coordination of bioinformatics infrastructure

• 23 European countries + EMBL-EBI

• Human data community

• GA4GH is the global standards setting organisation in human genomics

• ELIXIR and GA4GH have a strategic partnership

Page 20: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Humans: a new model organismHumans: a new model organism

Page 21: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Humans are…

• Similar to most other life forms on Earth

• Outbred organisms with pretty good genetics

• Huge cohorts – millions of people

• Big (lots and lots of cells)

• Willing participants – they take themselves to hospitals to be phenotyped

• Popular organisms – research into them attracts a lot of funding

• …A great model organism for understanding biology –including human disease!

Page 22: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Trabeculation

UK BioBank – 500,000 healthy UK citizens, consistently phenotyped and genotyped (will be full genome sequence)

100,000 will be MRI imaged (head including fMRI, chest including cardiac MRI)

Page 23: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Fractal dimension trabeculation

Page 24: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Co-registration

Page 25: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is
Page 26: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is
Page 27: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Meta analysis

Systolic BPHeart phenotypes

GOSR2

TTN

TNNT2Heart phenotypesDCM

Pulse rateSLC35F1

Many loci also shows changes in QRS

Some loci have “other heart conditions” ICD-10 codes

Page 28: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Meta analysis

Page 29: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Replication in 1,200 other healthy Brits

Page 30: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is
Page 31: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Meta analysis

Systolic BPHeart phenotypes

GOSR2

TTN

TNNT2Heart phenotypesDCM

Pulse rateSLC35F1

Many loci also shows changes in QRS

Some loci have “other heart conditions” ICD-10 codes

Page 32: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Thanks

Hannah Meyer, EBIDeclan O’regan, LMS, MRC

Page 33: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Thank you!Thank you!

Follow me on twitter: @ewanbirney

I blog regularly (Google Ewan Birney)

2/14/2019 33

Page 34: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Imaging: new technologies change the game

EM tomography,Atomic-scale models from EM

Super-resolutionlight microscopy

High-resolution MRI and CTLight sheet microcopy

Page 35: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Huge impact on biological research

Tools for the wet lab Tools for the dry lab

Page 36: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

‘White-collar’ and ‘blue-collar’ problems

Tools and data management: necessary,

less glamorous

Ground-breaking ideas Making them work

Innovative, interesting, blue-skies thinking

Page 37: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Life science: many data types

Genes, genomes & variation

Gene, protein & metabolite expression

Protein sequences, families & motifs

Macromolecular structures

Interactions, reactions & pathways

Chemogenomics & metabolomics

Phenotypes

Page 38: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Data resources at EMBL-EBI

Literature & ontologies• Experimental Factor

Ontology• Gene Ontology• BioStudies• Europe PMC

Chemical biology• ChEBI• ChEMBL• SureChEMBL

Molecular structures• Protein Data Bank in

Europe• Electron Microscopy Data

Bank

Gene, protein & metabolite expression• Expression Atlas• Metabolights• PRIDE• RNA Central

Protein sequences, families & motifs• InterPro• Pfam• UniProt

Genes, genomes & variation• Ensembl• Ensembl Genomes• GWAS Catalog• Metagenomics portal

Systems• BioModels• BioSamples• Enzyme Portal• IntAct• Reactome

Molecular Archives• European Nucleotide Archive• European Variation Archive• European Genome-phenome Archive• ArrayExpress

~410 peopleWorldwide collaborations

Page 39: Big Data in Biology &Healthcare - European Commissionec.europa.eu/jrc/sites/jrcsh/files/1-ewan_birney...2019/12/02  · Proteomics, Structure, Pathways and Literature • ELIXIR is

Data Growth Doubling time~16 months

Doubling time~6 months