Upload
wadeschulz
View
29
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Elastic{on} 2015 talk on the use of Elasticsearch to index genetic variation in cancer.
Citation preview
The Search for Cancers Causes and Cures
Wade L. Schulz, MD, PhD
Yale University, Department of Laboratory Medicine
{ } CC-BY-ND 4.0
Cancer Statistics An Improving Outlook?
{ 1 }
0
100
200
300
400
500
600
Rat
e p
er 1
00,0
00
Incidence Mortality
{ } CC-BY-ND 4.0 { 2 }
Precision Medicine
Tailoring medical therapy to a particular patients characteristics
{ } CC-BY-ND 4.0
Presentation to Precision Care
{ 3 }
Images adapted from Servier Medical Art, CC-BY
{ } CC-BY-ND 4.0
When Cells Go Bad
{ 4 }
{ } CC-BY-ND 4.0
Genetics in 60 Seconds
{ 5 }
{ } CC-BY-ND 4.0
Genetics in 60 Seconds
{ 6 }
{ } CC-BY-ND 4.0
Searching for Mutations
{ 7 }
Gels and Capillaries
{ } CC-BY-ND 4.0
Next Generation Sequencing
{ 8 }
Massively Parallel
{ } CC-BY-ND 4.0
NGS The Technology
{ 9 }
{ } CC-BY-ND 4.0
$1
$10
$100
$1,000
$10,000
$100,000
$1,000,000
$10,000,000
$100,000,000Se
p-0
1
Jan
-02
May
-02
Sep
-02
Jan
-03
May
-03
Sep
-03
Jan
-04
May
-04
Sep
-04
Jan
-05
May
-05
Sep
-05
Jan
-06
May
-06
Sep
-06
Jan
-07
May
-07
Sep
-07
Jan
-08
May
-08
Sep
-08
Jan
-09
May
-09
Sep
-09
Jan
-10
May
-10
Sep
-10
Jan
-11
May
-11
Sep
-11
Jan
-12
May
-12
Sep
-12
Jan
-13
May
-13
Sep
-13
Jan
-14
May
-14
Moore's Law Cost per Genome
Cost of Sequencing
{ 10 }
{ } CC-BY-ND 4.0
Bases to Bytes
23 chromosomes 21,000 genes 3,300,000,000 base pairs
3.3e9 bases X 2 bits 825 MB/sequence
With metadata: 150 GB/sequence
3,000,000 variants/genome
{ 11 }
How big is the genome?
{ } CC-BY-ND 4.0
What are the Problems?
Constantly evolving data schema
Ability to integrate diverse data silos
Rapidly increasing needs for data storage
Need for easy, flexible analysis
{ 12 }
{ } CC-BY-ND 4.0
Why Elasticsearch?
- Rapid on-premise and cloud installations
- Dynamic schema that supported clinical results and annotation data
- Availability of libraries for multiple languages (NEST, elasticsearch-py)
- Tool availability (Kibana, Shield)
Its great!
{ 13 }
{ } CC-BY-ND 4.0
Sequencing and Interpretation Pipeline
{ 14 }
Gene Sequencing
Sequence Alignment
Quality Assurance
Variant Annotation
Clinical Interpretation
Clinical Trial Eligibility
ResearchManagement
{galileo} {kepler}
{galileo} {galileo/kepler}{galileo/kepler}
{ } CC-BY-ND 4.0
Whats in a Variant?
{ 15 }
60G6V:01053:03044 16 chr1 161383 0 16M * 0 0 TTTGCCAGAAAGCAAG
)///7;;6*669:1:5 ZP:B:f,0.00279573,0.0054005,2.19516e-07
ZM:B:s,244,0,242,0,0,242,2,270,494,300,0,248,36,0,0,0,272,0,204,272,398,248,246,268,270,0,0,0,302,0,0,0,550,
38,44,194,14,32,204,2,666,212,222,494,2,2,238,630,92,220,4,102,438,2,60,384,2,76,2,2,294,394,34 ZF:i:28
RG:Z:60G6V. PG:Z:tmap MD:Z:16 NM:i:0AS:i:16 XA:Z:map4-1 XS:i:16
60G6V:00605:00113 0 chr1 415215 2 8M5I31M3S * 0 0
CCAGCCTGGGTGCGTGACAGAGCAAGACTCCGTCTAAAAAGAAAGGT
B
{ } CC-BY-ND 4.0
Whats in a Variant?
{ 16 }
{ "chromosome": "chr7", "position": 148506396, "type": "snv", "refAllele": "A", "altAllele": "C", "totalReads": 1998, "forwardReads": 1038, "forwardRefReads": 524, "forwardAltReads": 514, "reverseReads": 960, "reverseRefReads": 500, "reverseAltReads": 460, "refReads": 1024, "altReads": 974, "vaf": 48.749, "variantRegion": "intronic", "variantEffect": "", "snvEffect": "A>C", "gene": "EZH2
}
- Variant location in genome
- Nucleotide change
- Sequencing statistics
- Variant prevalence in specimen
- Variant coding/protein effects
{ } CC-BY-ND 4.0
{Elastic} Searching for Meaning
{ 17 }
AzureElasticsearch
Local SQL and Elasticsearch
OMIM
COSMIC
dbSNP
ClinVar
Public Databases
Sequencers Variant AnalysisEffect Prediction
Public Variant Data
Private Variant Data
{ } CC-BY-ND 4.0
{Elastic} Searching for Meaning
{ 18 }
OMIM
COSMIC
dbSNP
ClinVar
Public Databases
Sequencers Variant AnalysisEffect Prediction
Public Variant Data
Private Variant Data
MVC Application(NEST)
{ } CC-BY-ND 4.0
Kibana Drilldown
{ 19 }
Rapid population stats
Physicians/researchers can quickly analyze data
Integration with health record
Demographics
Laboratory testing
Comorbidities
Treatment information
{ } CC-BY-ND 4.0
Kibana Drilldown
{ 20 }
{ } CC-BY-ND 4.0
Service Integration
{ 21 }
Predictive Algorithms
Quality Assurance-3
-2
-1
0
1
2
3
Variant Database
Clinical Interpretation
System
Web Service
Interfaces
Custom Validation
Scripts
Third-Party
Data Analysis
Software
{ } CC-BY-ND 4.0
Data Sharing
{ 22 }
Variant Database
Clinical Interpretation
System
Web Service
Interfaces
{ } CC-BY-ND 4.0
Conclusions
- Genetic sequencing and
clinical consultation complete
within one week of biopsy
- Integrated multiple analysis
pipelines for clinical
interpretation and research
applications
- Frequently identify patients
eligible for clinical trials
Clinical implications
- Two Elasticsearch clusters
- Over 60 million variant
annotations
- Nearly 10 million documents
related to cancer-associated
mutations
- Kibana and custom web
applications using NEST for
data visualization
System statistics
{ 24 }
{ }
Thank you!
Wade L. Schulz, MD, PhD
http://www.wadeschulz.com
Many images adapted from Servier Medical Art, CC-BY
Henry Rinder MD, Richard Torres MD, Christopher Tormey MD, Brian Smith MD, John Howe PhD,
Karl Hager PhD, Rodion Rathbone MD, Nathaniel Price, Alexa Siddon MD
{ } CC-BY-ND 4.0
This work is licensed under the Creative Commons
Attribution-NoDerivatives 4.0 International License.
To view a copy of this license, visit:
http://creativecommons.org/licenses/by-nd/4.0/
or send a letter to:
Creative Commons
PO Box 1866
Mountain View, CA 94042
USA
{ 25 }