Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Big Data in Biology &HealthcareBig Data in Biology &Healthcare
Ewan Birney
Director, EMBL-EBI
www.ebi.ac.uk
What is EMBL-EBI?
• Europe’s home for biological data services, research and training
• A trusted data provider for the life sciences
• 200 Petabytes of storage (0.2 exabytes)
• >40,000 CPU Cores
• Part of EMBL, an intergovernmental research organisation
• International: 600 members of staff from 60 nations
• Home of the ELIXIR Technical hub.
See the live map at www.ebi.ac.uk/about/our-impact
Global reference data
We have been living through a revolution.
One genome 2003 to 2017
The cost of sequencing a genome in 2017
The cost of sequencing a genome in 2003
$100 Genome within the next 5 years (likely 3 years)
Real-time genomics in the fieldMeasuring DNA, RNA, protein…
(Note: I am a long-term, paid consultant to Oxford Nanopore)
Medical GenomicsMedical Genomics
Sequencing is now “cheap enough”
• Between $200-300 / exome, and $800-$1000 for whole genome
• Line of sight to $100 genome
• Quoted by Illumina, contenders emerging, steady progress.
• More costs now in consent, DNA sample acquisition (storage and standard analysis low-ish, but not 0!)
• All in costs at or below “routine” medical diagnosis, eg, MRI scans
Clinical Utility is present: Rare disease
• Consistent 20-30% yield of diagnosis for suspected rare diseases
• Diagnosis ends “diagnostic odyssey” for patients – painful, emotionally draining and costly the healthcare service
• Opens up reproductive choices for the parents
• Like for like study in Australia
• 5 fold more diagnoses at 1/3 cost to previous standard of care!
• Roll out in Denmark, Finland, France, UK
Clinical Utility is present: Cancer
• (Cancer logistics harder: sample acquisition and DNA extraction harder to standardise; timelines far shorter)
• In umbrella + basket trials, 1 in 10 patients have treatment changing information from cancer genomic information
• Often being deployed in aggressive, “any option” metastasis scenarios
• Broader molecular phenotyping via genomics showing promise
• Signatures of NHER (BRCA1/BRCA2) defects far broader than suspected from germline associations
• Age of key mutations becoming more obvious
Cohorts and Medical genomics
Medical GenomesCountries with active national medical genome projectsCountries with some activity of medical genomicsCountries planning medical genome projects
USA
Brazil
Canada
Iceland
South Korea
Japan
China
Finland
Australia
India
Spain
UK
Ireland
Estonia
Saudi Arabia
Turkey
France
Mexico
Sweden
Norway
Taiwan
CohortsNational cohorts > 100k genotyped or sequenced at least 25kNational cohorts > 100k people active collection nowPlanning national cohorts > 100k
H3Africa
South Africa
Malaysia
Singapore
Iran
Israel
Austria
Switzerland
Germany
Netherlands
Denmark
Jordan
Kuwait
Qatar
U.A.E
Scotland
Big numbers!
Genomics: from research to healthcare
Research
• English language• Light-weight legal• Similar systems• Open data• Publications• Grant funding
Practicing Medicine
• National language• Heavy legal framework• Different systems• Closed data• Not published• Contract funding
Bridges need at least two anchors
Global standards: the GA4GH
• GA4GH is THE standards-setting body for genomics and healthcare
• Embraces federated approach
• Setting community standards early
• Cloud: Analysis carried out where the data ‘lives’
• “You’re already using it!”: SAM/BAM/CRAM/VCF formats
• Tools: htsget – the first step away from file-based access
• Rare disease diagnoses: Matchmaker Exchange
• Federated discovery: GA4GH Beacons
Federation
Open research data Healthcare datawith research use
analysis analysis
Aggregate data globally
Download, analyse locally
Analyse data locally (via VMs)
Collate analyses
Clinical &
PhenotypicCloud
Discovery
Data Security
Data Use &
Researcher IDs
Genomic
Knowledge
StandardsLarge-Scale
GenomicsRegulatory &
Ethics
1. Pheno ontology recommendations2. Info models for clin data exchange
3. Implementing pheno standards4. Test bed & interoperability demo
5. TES6. TRS7. WES8. DOS
9. Beacon10. Search
11. Service registry12. Variant submission
13. IoG14. Breach response
15. AAI16. Researcher ID & Bona Fide status
17. DUO18. Variant Annotation
19. Variant Representation20. htsget streaming API
21. Reference sequence retrieval API22. Read file formats
23. Genetic variation file formats24. RNASeq expression matrix
25. Return of results policy26. Participant values survey
27. Code of conduct for data sharing28. Cloud access policy
DURI
C & P DURI GKS
Cloud LSG
GKS
R & E
C & P Discov DURI GKS
R & ESecur
C & P Discov
GKS
GKS
Secur DURI
Discov Secur DURI LSG
Discov Secur
Discov Secur
Discov Secur
Secur
Discov GKS
Discov GKS
Discov GKSDURIC & P
C & P
C & P
Cloud
Cloud
Cloud
Cloud
Cloud
Discov
Discov
Discov
Discov
Discov
Secur
Secur
DURI
DURI
GKS
GKS
LSG
LSG
LSG
LSG
LSG
R & E
R & E
R & E
R & E
R & ESecur
Europe’s opportunity
• Strengths/Opportunities
• Public Healthcare systems
• Strong genomics
• Strong public health delivery
• Strong infrastructure
• Transnational requirment
• Weaknesses/Threats
• Less IT depth in some healthcare systems
• Fragmentation of skills
• AI / Big Data capacity (skills+ capital)
• Transnational complexity
EMBL-EBI, ELIXIR and GA4GH
• EMBL-EBI is the world’s leading bioinformatics infrastructure provider
• Human Reference Genome, Annotation, Transcription, Proteomics, Structure, Pathways and Literature
• ELIXIR is Europe’s transnational coordination of bioinformatics infrastructure
• 23 European countries + EMBL-EBI
• Human data community
• GA4GH is the global standards setting organisation in human genomics
• ELIXIR and GA4GH have a strategic partnership
Humans: a new model organismHumans: a new model organism
Humans are…
• Similar to most other life forms on Earth
• Outbred organisms with pretty good genetics
• Huge cohorts – millions of people
• Big (lots and lots of cells)
• Willing participants – they take themselves to hospitals to be phenotyped
• Popular organisms – research into them attracts a lot of funding
• …A great model organism for understanding biology –including human disease!
Trabeculation
UK BioBank – 500,000 healthy UK citizens, consistently phenotyped and genotyped (will be full genome sequence)
100,000 will be MRI imaged (head including fMRI, chest including cardiac MRI)
Fractal dimension trabeculation
Co-registration
Meta analysis
Systolic BPHeart phenotypes
GOSR2
TTN
TNNT2Heart phenotypesDCM
Pulse rateSLC35F1
Many loci also shows changes in QRS
Some loci have “other heart conditions” ICD-10 codes
Meta analysis
Replication in 1,200 other healthy Brits
Meta analysis
Systolic BPHeart phenotypes
GOSR2
TTN
TNNT2Heart phenotypesDCM
Pulse rateSLC35F1
Many loci also shows changes in QRS
Some loci have “other heart conditions” ICD-10 codes
Thanks
Hannah Meyer, EBIDeclan O’regan, LMS, MRC
Thank you!Thank you!
Follow me on twitter: @ewanbirney
I blog regularly (Google Ewan Birney)
2/14/2019 33
Imaging: new technologies change the game
EM tomography,Atomic-scale models from EM
Super-resolutionlight microscopy
High-resolution MRI and CTLight sheet microcopy
Huge impact on biological research
Tools for the wet lab Tools for the dry lab
‘White-collar’ and ‘blue-collar’ problems
Tools and data management: necessary,
less glamorous
Ground-breaking ideas Making them work
Innovative, interesting, blue-skies thinking
Life science: many data types
Genes, genomes & variation
Gene, protein & metabolite expression
Protein sequences, families & motifs
Macromolecular structures
Interactions, reactions & pathways
Chemogenomics & metabolomics
Phenotypes
Data resources at EMBL-EBI
Literature & ontologies• Experimental Factor
Ontology• Gene Ontology• BioStudies• Europe PMC
Chemical biology• ChEBI• ChEMBL• SureChEMBL
Molecular structures• Protein Data Bank in
Europe• Electron Microscopy Data
Bank
Gene, protein & metabolite expression• Expression Atlas• Metabolights• PRIDE• RNA Central
Protein sequences, families & motifs• InterPro• Pfam• UniProt
Genes, genomes & variation• Ensembl• Ensembl Genomes• GWAS Catalog• Metagenomics portal
Systems• BioModels• BioSamples• Enzyme Portal• IntAct• Reactome
Molecular Archives• European Nucleotide Archive• European Variation Archive• European Genome-phenome Archive• ArrayExpress
~410 peopleWorldwide collaborations
Data Growth Doubling time~16 months
Doubling time~6 months