Upload
spark-summit
View
90
Download
0
Embed Size (px)
Citation preview
Which needle is the right one?
All humans carry between 200 to 800 mutation that disrupt the
function of a gene.
The human genome is 3 billion letters long.
Finding genetic underpinnings for diseases and phenotypic traits
What are the biological mechanism?Who is at risk for a disease?
How to prevent and treat?
5319talented
staff
$1billion+budget
Workingwith over2800+industry partners
55sitesacrossAustralia
Top 1%of global research agencies
Each year6 CSIRO
technologies contribute
$5 billion tothe economy
Genome Wide Association Studies
image courtesy ofPasieka
Science Photo Library
1000+ samples
Relatively common > 1%~ 500,000 SNPs
Look at the data
Typical GWAS: 1M variants x 5K samplesFull genome: 80M variants x 2.5K samples
0 1 0 … 11 1 1 … 10 0 0 … 00 0 1 … 10 1 1 … 10 0 0 … 01 2 0 … 0..................0 0 0 … 21 2 0 … 0
samples (103)
variants (106)
0 1 0 0 0 0 1 ... 0 11 1 0 0 1 0 2 ... 0 20 1 0 1 1 0 0 ... 0 0 .....................1 1 0 1 1 0 0 ... 2 0
variants x samples
transpose
DND.N
1 x samples
predictors response
associate
0
10,000
20,000
30,000
40,000
50,000
100,000 1,000,000 10,000,000 100,000,000
Studies 1000 Genomes
sam
ples
variants
GWAS
0
2000
4000
6000
8000
10000
12000
2008 2009 2010 2011 2012 2013 2014 2015
GWAS StudiesAssociations Studies
2713 studies 31183 associations
Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci for atopic dermatitis in the Japanese population The NHGRI-EBI Catalog of published genome-
wide association studies
Missing Heritability
Manolio et al. (2009) Finding the missing heritability of complex diseases
… human height heritability is ~80% yet more that 40 associated loci explain only about 5% of phenotypic variance …
“Dark matter” of genomics
Epistasis
Traditional approach for interaction modeling ’squares’ the problem size500,000 SNPs à ~100,000,000,000 pairs
Random Forest to the rescue
Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions using random forests
Breiman (2001) Random Forests.Machine Learning
Random Forest in GWAS• Non-parametric and arbitrarily expressive • Insensitive to outliers and non-informative predictors • Stable performance – no overfitting• Easy to tune• Built in error estimate (OOB error)• Variable importance measures• Ability to deal with heterogeneous data• Easy to parallelize and scale on HPC
Sun (2010) Multigenic Modeling of Complex Disease by Random Forests
RF is an appropriate candidate to capture the genetic heterogeneityunderlying the trait because RF itself is an ensemble of many heterogeneous
trees built from uncorrelated subsamples of the original data
VariantSpark
0
1000
2000
Python R
HadoopAdam
ADMIXTURE
VariantSpark
method
time i
n sec
onds task
binary−conversionclusteringpre−processing
Itcancluster3000individuals and 80millionvariants
O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information
Natalie TwineDenis Bauer Oscar Luo Rob Dunne Piotr Szul
Transformational Bioinformatics Team
Aidan O’BrienLaurence Wilson
Software
Open source (MIT) @ https://github.com/csirobigdata/variant-spark
Random Forest SparkML
Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK
• Failing for millions of variables• Relatively slow
“Cursed Forest”
broadcastaggregate
1
2,1 2,2
Executors
v1
v2
v3v3v3
vn
…
var, pointlocal best split
var1, point1
var21, point21 var22, point22
global best split
…
initial sample
split subsets
Driver
Partition data by variables (columns)
• Columns are “small” –easy partition
• An executor can find (an exact) best split for many variables
• Finding global best split is efficient
Some implementation tricks• ”Native” data shape
– VCF files are organized by ”variables”• Building by levels and tree batching
– Minimize communication overhead and the number of stages• Optimized split finding for ordered factors
– The most frequent operation – Java implementation faster then Scala
• Choice of data representation– byte representation for variant data– with sparsity 0.75 a sparse vector 3x bigger than a byte array
Big data performance• Yarn Cluster (12 workers)
– 16 x Intel Xeon [email protected] CPU – 128 GB of RAM
• Spark 1.6.1 on YARN– 128 executors– 6GB / executor (0.75TB)
• Synthetic dataset (mtry = 0.25)
TypicalGWASRange
100K trees: 5 – 50hAWS: ~$215.50
Whole GenomeRange
100K trees: 200 – 2000hAWS: ~ $ 8620.00
50M variable x 10k samples!
Other features• Various input formats
– VCF, CSV, parquet
• A variety of RF (fine) tuning parameters– Sampling– Depths– Splitting
• Insight into RF model– cumulative OOB error– per tree variable importance– per tree OOB predictions
Simulated Data Study• Synthetic dataset of 2.5M variables and 5000 samples• 5 informative variables with dichotomous response
• Compare RF importance ranking with the model
• Rank-biased overlap (RBO) – measure of ranking overlap (with emphasis of highly ranked elements)
RBO
0
0.5
1
1.5
w_1 w_2 w_3 w_4 w_5
Bone Mineral Density Study• Osteoporotic fracture is a leading cause of morbidity
and mortality particularly amongst the elderly.
• In 2004 ten million Americans were estimated to have osteoporosis, resulting in 1.5 million fractures per annum.
• Hip fracture is associated with a one year mortality rate of 36% in men and 21% in women
Burden of disease of osteoporotic fractures overall is similar to that of colorectal cancer and greater than that of
hypertension and breast cancer
Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.
Bone Mineral Density Study
• 2036 samples & 288,768 SNPs
• Replicates 21 of 26 known associated genes
• Identifies 2 novel loci (known association with BMD)
• Provides strong evidence for further 4 loci
Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.
BMD - VariantSpark ResultsKnown BMD locations have significantly higher ranking
(Mann-Whitney U, p = 1.3e-7)
A few novel highly ranked locations with plausible
association with BMD: COLEC10, PRODH
Not replicated DCDC5 ranked 9,667 out of 10,000
Future work and directionsTechchnical
Compare and ’merge’ with
yggdrasilDeployment on cloud platforms
Further performance
improvements
FunctionalImplementation of cutting edge
research
Integration within genomics
platforms (GATK4)
More ML algorithms
Research Applications Data science research Gradient Boosted
Trees
References1. Aleesha Bates (2016) Practical aspects of GWAS Association studies under statistical
genetics and GenABEL hands-on tutorial2. Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci
for atopic dermatitis in the Japanese population3. The NHGRI-EBI Catalog of published genome-wide association studies4. Manolio et al. (2009) Finding the missing heritability of complex diseases5. Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions
using random forests6. Breiman (2004) Random Forests. Machine Learning7. Sun (2010) Multigenic Modeling of Complex Disease by Random Forests8. Danecek et al. The Variant Call Format and VCFtools9. O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information10. Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column
Partitioning in SPARK11. Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection
Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.
ConclusionsApache Spark is a feasible platform machine learning
in population scale genomics.
VariantSpark with CursedForest is a promising alternative for traditional GWAS approaches.
Data shape, type, etc. matter – different optimizations are needed.
Thank YouEmail: [email protected]: https://github.com/csirobigdata/variant-spark