The 100,000 Genomes ProjectDavid MontanerBioinformatics [email protected]
Valencia University, October 6th 2016
Talk Outline
1. Introduction & Background2. Pipelines3. Systems and Databases4. Cancer5. Rare Diseases
2
3
The 100,000 Genomes Project
Genomics England & Partners
Genomics England
• Owned by the Department of Health, UK• Set up to deliver the 100,000 Genomes Project: 100,000 whole genome sequences of National Health Service (NHS)
patients with: • Rare Diseases (and family members)• Cancer
Aims: Create an ethical and transparent programme based on consent Establish the infrastructure, human capacity & capability to set up a
genomic medicine service for the NHS and bring benefit to patients. Enable new scientific discovery and medical insights, and add to
the already extensive databases on human variation Working with the National Health Service (NHS), academics and industry
to make the UK a world leader in Genomic Medicine
4
Who are we & what are we doing?
Generate health & wealth
• Sequence 100,000 genomes• Cancer and rare genetic disease• Capture data delivered
electronically, store it securely and analyse it
• within an English data centre (reading library)
• Combine genomes with extracted clinical information for analysis, interpretation, and aggregation
• Create capacity, capability and legacy in personalised medicine for the UK
Goals of Genomics England
1. To bring benefit to NHS patients
2. To enable new scientific discovery and medical insights
3. To create an ethical and transparent programme based on consent
4. To kickstart the development of a UK genomics industry
Inception of the 100,000 genomes project (2012, 2014)
“If we get this right, we could transform how we diagnose and treat our most complex diseases not only here but across the world” (December 2012)
“I am determined to do all I can to support the health and scientific sector to unlock the power of DNA, turning an important scientific breakthrough into something that will help deliver better tests, better drugs and above all better care for patients.” (August 2014)
Schedule
2012 -2014: consortium creation 2014-2015: pilot studies 2016-2015: main project
Where are we?
8Lodon
Where are we?
9Lodon
London: Management All data storage Cambridge: Software for genomic data storage Oxford: Software for clinical data storage and collection
Recruitment and clinical interface13 “GMCs”, Scotland and Northern Ireland
• Genomic Medicine Centres• Networks of NHS hospitals
including genomics labs• 13 “Lead organisation” plus
71 “Local Delivery Partners”• Contracted by NHS England• Cover recruitment, data and
return of results• Scotland
• Doing own sequencing• Northern Ireland
• Similar to a GMC• Contracted by NI payer
+
The Journey of a Genome
11
ACGTTTGAAGC
Consent & Sample
collection
DNAextraction
Bio-repository
Sequencing
Variant Calling
Interpretation
Feedback to clinician
Validation
Treatment
The Journey of a Genome: Partners
12
ACGTTTGAAGC?
Consent & Sample
collection
DNAextraction
Bio-repository
Sequencing
Variant Calling
Interpretation
Feedback to clinician
Validation
Treatment
Genome Medicine
Centres (GMCs)13x NHS
organisations
Genomics England Clinical Interpretation Partnerships
(GeCIPs)Collaborations of
clinicians & academics,
> 2,000 researchers
Clinical interpretation
companies• Omicia• Congenica• Nextcode
Hiseq X Ten
GENE Consortium
• Working together on a year-long Industry Trial involving a selection of whole genome sequences across cancer and rare diseases
• Aims to identify most effective and secure way to accelerate development of new diagnostics and treatments for patients
• Working in a pre-competitive environment
AbbVieAlexion PharmaceuticalsAstraZenecaBerg HealthBiogenDimension TherapeuticsGSKHelomicsNGM BiopharmaceuticalsRocheTakeda
Genomics Expert Network for Enterprises
14
BAM fileFrom Illumina
Variant Callingpipelines: VCF file
QC1 QC2
Variant Annotation
Tiering of variantsDispatchClinical Interpretation
QC Portal Reporting portal
Medical review
Validation
Simplified Workflow
Genomic Medicine Centre (GMC)
Bioinformatics Team Role
15
ACGTTTGAAGC?
Consent & Sample
collection
DNAextraction
Biorepository
Sequencing
Variant Calling
Interpretation
Feedback to clinician
Validation
Treatment
Genomics Education
Health Education England• MSc in Genomic Medicine
• 10 Universities across the UK• Online training courses and resources
• The fundamentals of genomics• Sample handling and DNA
extraction• Bioinformatics • How to support patients through
the consent process
Genomics England Communications Team
Update on numbers: at about 10%
• >10,000 genomes received
• >1PB of primary data• >1.3M files received or
generated and indexed • 200M germline variants
databased• 48M somatic variants
databased• 70,000 HPO terms asserted• >450,000 hospital episodes
100,000 Genomes
• Rare Disease• Each Genome: 100Gb• Trio is preferred so 300Gb per
participant• x 50,000 participants =
15,000,000Gb total • Cancer
• Germline: 100Gb• Tumour: 200Gb• 300Gb per patient• x 25,000 participants =
15,000,000Gb total
• 10,000,000Gb = 10 Petabytes• Expecting around 30 Petabytes
18
Huge Amount of Data
10 Billion Photos = 1.5 Petabytes
Data Processed in 1 day = 20 Petabytes
19
Pipelines
bertha_default 1.1.0
Single Sample QC & Processing
Analysis
Intake QC
Multi Sample QC
Cross Sample Contamination
Single-Sample QC Check Point
Identity by DecentMendelian Inconsistency Rate
Sex Check
Somatic VCF re-headering
Tumour Cross Sample ContaminationCross Species Contamination Depth of Coverage Concordance check
Intake QC Check Point
Merge Array Genotypes
Multi-Sample QC Check Point
Consent Check Point
Variant Calling
Variant Normalisation
Tumour PloidyTumour PurityTumour ClonalityMutation SignatureViral InsertionsActionable Mutation CoverageSNV & Indel RefinementMutation BurdenInbreeding Coefficient Homozygosity Runs
Variant Annotation
Variant Tiering
Interpretation Dispatch Exomiser
Delivery API
Integrity Check
MD5 Check
Validate BAM Picard
Filtered Bamstats Unfiltered Bamstats Q30 Bamstats VCF QC
Fix Permissions
Plot Filtered Bamstats Generate Filtered Metrics Bamstats Plot Unfiltered Bamstats Generate Q30 Metrics Bamstats
QC Stats Post-processing
WorkflowdiagrammeData intake
Single Sample QC & Processing
Multi-sample QC
Analysis
Interpretation Request DispatchedInterpretationAPI
BerthaDistributed Workflow Management System
Interpretation Dispatch
Message Broker
Tracking DB
Job Scheduler
Dashboard
DeliveryAPI
Auditor
Orchestrator
Grid Consumer
Oxford Bus
6 node Hadoop cluster:• Transform: 97 min• Load: 80 sec• Merge: 84 sec• Millisecond response
times for regional queries• Whole genome filtering
queries for all individuals within seconds
OpenCGA: storage
Extensive capabilities to query across genotype and phenotype relationships
https://github.com/opencb/opencga
To be fully GA4GH compatible from v1.0
global data standards for Genomics - http://ga4gh.org/
Clinical data
+ 150 tables (+2000 variables)
Administrative & ConsentClinical / medical reviewsImaging, blood & non genetic testsDisease status and phenotypeFamily & pedigreeTreatments and clinical history
Security and logs:CMCs access here
CatalogBioinformatics
Oxford
OpenCGA - Catalog
Metadata store and A&A for OpenCGA• Manages roles, groups,
acls• Audit log• LDAP integration• Arbitrary schemas
(annotation sets)
Cellbase: annotation
Reference Genomic data warehouse
• Compared in testing against VEP• More than 99.999% similarity in Consequence
types
• Phased annotation implemented for MNVs
• Initial structural variation annotation• Can annotate 4-5 families per hour
(>8000 variants/s) on a single database instance
• Will have (very soon) an Rpackagesimilar to biomaRt
PanelApp
27https://panelapp.extge.co.uk/crowdsourcing/PanelApp
Panel list
28https://panelapp.extge.co.uk/crowdsourcing/PanelApp/
Platform for interpretation
● Filter and classify variants● Well-defined rules, stable across the project● General, it works for any family configuration● Implemented using VCF/Cellbase or OpenCGA● Based on GA4GH variant model ● Uses pedigrees as defined at Genomics England
(Based on phenotips format) Uses PanelApp as source of gene panels
Variant Tiering
Yes No
Tier 1 Tier 2Tier 3
Yes No
Expected pathogenic(set criteria; transcript_ablation,
splice_donor_variant, splice_acceptor_variant, stop_gained,
frameshift_variant, stop_lost, initiator_codon_variant)
Is the variant in a gene in the Virtual Gene Panel (green list) for that disorder?
Known Pathogenic(Not implemented)
Yes No
Tier 3
Is the variant in a gene in the Virtual Gene Panel (green list) for that disorder?
Other coding impact (set criteria;
inframe_insertioninframe_deletionmissense_variant
transcript_amplificationsplice_region_variant
incomplete_terminal_codon_variant)
Impact of the variant?
OtherDoes not fit any
of the other criteria?
The variant allele is not commonly found in the general healthy population (set criteria for allele frequency filter)Familial segregation
Allelic state matches known mode of inheritance for the gene and disorder (moi required)
Variant
Variant Tiering
32
The Cancer Programme
Cancer
33
Which cancers?• Lung• Breast• Colon• Prostate• Ovary• Hematological
malignancies (CLL)• Pediatric Cancers
Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)
Why sequence?• Disease of disordered
genomes• >200 driver genes known• Stratified
Management/targeted therapy
• Complications: Heterogeneity
Sequencing cancer genomes
34
Tumour genome
Germlinegenome
Germline variants
Tumour variants
Somatic variation=
Coverage
35
High Depth
ATGCGTTCGATGAGTGATGAAACCCATGATGGATGCCGATGAGATGATG
Coverage
Germline Samples35x Coverage
• Rare Disease Participants
• Cancer “Normal”
Cancer Samples75x Coverage
• Cancer “Tumour” Samples
Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)
Normal Contamination
Coverage
36
Why Higher Depth for Cancer?
Clonality/Heterogeneity
Cancer Pilot
• Resections/Biopsies are routinely fixed in formalin and embedded in paraffin
• Causes DNA damage• Difficult to extract DNA
• Fresh frozen logistically difficult & not trusted to maintain morphology
37
Fresh Frozen vs Formalin-fixed, paraffin-embedded (FFPE) tumour samples
Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)
Cancer Pilot
• Difficulty in obtaining long fragments
• “Random” DNA damage• “Cross-links” DNA which can be
reversed – but currently at high temperatures
• Chimeric fragments in library preparation
38
Problems with FFPE
Heat
A T
Repetitive Regions Re-anneal causing Chimeric Reads
GC Rich regions are more robust
Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)
FFPE = Formalin-fixed, paraffin-embedded tumour samples
Read Alignment
CG Content
FF Copy Number Data
41Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)
FFPE Copy Number Data
42Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)
Fraction of overlapping SNVsin FF and FFPE samples from 5 trios
Improving FFPE Sequencing
44
What can we do?
Procedure
Procedure FixationFixation
DNA Extractio
n
DNA Extractio
n
Library Preparati
on
Library Preparati
on
Cold Ischaemic Time
Storage Conditions
Time of Fixation
Size of Sample
pH of Fixative
Temperature of De-crosslinking
Addition of Salt
Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)
FFPE = Formalin-fixed, paraffin-embedded tumour samples
Cancer reports
45
• Quality metrics pre- and post-sequencing• A small number of clinically actionable mutations • Germline results which affect cancer development• Remainder of results are mostly of research interest
for now, but in future may assist:• Drug development• Targeted treatment selection• Prediction of prognosis• Monitoring of disease progression
46
Rare Disease Programme
47
The case for whole genomes
• Severe intellectual disability occurs in 0.5% of newborns
• Whole-genome sequencing at 80x in 50 parent-offspring with no diagnosis for their severe intellectual disability.
• Overall 62% increase in diagnostic yield with WGS.• Most diagnoses were for de-novo dominant mutations, roughly
equally divided in SNVs and CNVs.
48
Gilissen et al (2014), Nature PMID: 24896178
Why make a genetic diagnosis?
49
For a patient with rare disease
• Understand why their condition happened
• More accurate knowledge of how it might develop in future
• Possible treatment avenues• Early intervention may help
avoid disability• Contact with others with the
same condition
For the family• Predict whether family
members will get the condition
• Offer screening/treatment to prevent it
• Reproductive decisions
For medical research• Further our understanding of
disease mechanisms• Novel drug development or
drug repurposing
Rare disease programme
• Over 200 disorders so far
Data model: describes the clinical information to be collected for each disorder
Disorders nominated by the NHS and academia
Eligibility & Exclusion criteria for recruitment; rare, mendelian, unmet clinical diagnostic need, prior genetic testing
Virtual Gene panel to aid analysis
Challenges
• Equity of diseases for inclusion
• Tightness of criteria for patient inclusion
• Equity of WGS consumption per phenotype
The biggest challenge?
51
Interpretation• ~5-10 million variants in our
genome• ~3.5 million “known” SNPs• ~0.5 million “novel” SNPs• ~0.5 million small indels• ~1000 large (>500bp) CNVs• ~20,000-25,000 coding variants• ~9,000-11,000 non-synonymous
• 92 rare missense variants (MAF <0.1%)
• 5 rare truncating variants (MAF <0.1%)
• 0-2 de novo variants
What information is needed?
52
To aid interpretation of variants
• Allele frequency: How common is the variant in the ‘healthy’ population?
• Familial segregation: Is the variant present in the family members with the disorder, and not in those without it?
• Mode of inheritance: Does the pattern fit with the inheritance within the family and what is known about the gene?
• Likely consequence: Does the variant cause a change in the protein sequence likely to affect function?
• Gene panel: Is the variant in a gene associated with causing the disorder?
• Known pathogenicity? Has the variant been seen before in people with the same disease?
Rare Diseases
Gender• X chromosome homozygosity, Y chromosome genotyping
rate• Copy number for X and Y chromosomes
Relatedness• Mendelian error checking for parent-child pairs• IBD sharing estimation for all participants
Inbreeding/ excess homozygosity• Observed vs expected homozygosity
Ancestry• Multidimensional scaling
53
Genetic data checks and analyses
Dr Katherine Smith, Lead Analyst for Rare Disorders (Bioinformatics)
Rare Disease Pilot
54
4800 people
Primary Data
• 4,128 participants data cleansed
• (15,065 including family members),
• 149 different conditions.
• 56,004 HPO terms used
• 12,966 terms present• 43,088 terms absent
Secondary Data
• Hospital Episodes• 250,000 records• 11,910 - Accident
Dept• 37,479 - Inpatient• 199418 - Outpatients
Rare disease pilot – 4,919 samples
55
Relatedness checking
56
Georgia
57
Georgia and her familyImage courtesy of Great Ormond
Street Hospital
• Undiagnosed condition that included physical and mental developmental delay, a rare eye condition affecting sight, impaired kidney function, verbal dyspraxia.
• Through enrolling in the project, a mutation in a single gene was found in Georgia’s genome which is likely to be the cause of her condition.
• Provides a molecular diagnosis for her condition for the first time.
Maria Bitner-Glindzicz – Great Ormond Street Hospital
http://www.genomicsengland.co.uk/first-children-recieve-diagnoses-through-100000-genomes-project/
Jessica
58
Jessica and her family. Image courtesy of Great Ormond
Street Hospital.
“Now that we have this diagnosis there are things that we can do differently almost straight away. Her condition is one that has a high chance of improvement on a special diet, which means that her medication dose is likely to decrease and her epilepsy may be more easily controlled. Hopefully she might have better balance so she can be more stable and walk more…”
“…More than anything the outcome of the project has taken the uncertainty out of life for us and the worry of not knowing what was wrong. It has allowed us to feel like we can take control of things and make positive changes for Jessica. It may also open doors to other research projects that we can to go on. These could be more specific to her condition and we are hopeful that they could one day find a cure.”
http://www.genomicsengland.co.uk/first-children-recieve-diagnoses-through-100000-genomes-project/
Mum, Kate Palmer:
59
Thank you!