Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul

FINDING NEEDLES IN GENOMIC HAYSTACKS WITH “WIDE” RANDOM FOREST

Piotr SzulCSIRO Data61

Which needle is the right one?

All humans carry between 200 to 800 mutation that disrupt the

function of a gene.

The human genome is 3 billion letters long.

Finding genetic underpinnings for diseases and phenotypic traits

What are the biological mechanism?Who is at risk for a disease?

How to prevent and treat?

5319talented

staff

$1billion+budget

Workingwith over2800+industry partners

55sitesacrossAustralia

Top 1%of global research agencies

Each year6 CSIRO

technologies contribute

$5 billion tothe economy

Agenda• Intro to Genome Wide Association Studies• Variant Spark and “Cursed Forest”• GWAS use-cases

Genome Wide Association Studies

image courtesy ofPasieka

Science Photo Library

1000+ samples

Relatively common > 1%~ 500,000 SNPs

Look at the data

Typical GWAS: 1M variants x 5K samplesFull genome: 80M variants x 2.5K samples

0 1 0 … 11 1 1 … 10 0 0 … 00 0 1 … 10 1 1 … 10 0 0 … 01 2 0 … 0..................0 0 0 … 21 2 0 … 0

samples (103)

variants (106)

0 1 0 0 0 0 1 ... 0 11 1 0 0 1 0 2 ... 0 20 1 0 1 1 0 0 ... 0 0 .....................1 1 0 1 1 0 0 ... 2 0

variants x samples

transpose

DND.N

1 x samples

predictors response

associate

0

10,000

20,000

30,000

40,000

50,000

100,000 1,000,000 10,000,000 100,000,000

Studies 1000 Genomes

sam

ples

variants

GWAS

0

2000

4000

6000

8000

10000

12000

2008 2009 2010 2011 2012 2013 2014 2015

GWAS StudiesAssociations Studies

2713 studies 31183 associations

Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci for atopic dermatitis in the Japanese population The NHGRI-EBI Catalog of published genome-

wide association studies

Missing Heritability

Manolio et al. (2009) Finding the missing heritability of complex diseases

… human height heritability is ~80% yet more that 40 associated loci explain only about 5% of phenotypic variance …

“Dark matter” of genomics

Epistasis

Traditional approach for interaction modeling ’squares’ the problem size500,000 SNPs à ~100,000,000,000 pairs

Random Forest to the rescue

Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions using random forests

Breiman (2001) Random Forests.Machine Learning

Random Forest in GWAS• Non-parametric and arbitrarily expressive • Insensitive to outliers and non-informative predictors • Stable performance – no overfitting• Easy to tune• Built in error estimate (OOB error)• Variable importance measures• Ability to deal with heterogeneous data• Easy to parallelize and scale on HPC

Sun (2010) Multigenic Modeling of Complex Disease by Random Forests

RF is an appropriate candidate to capture the genetic heterogeneityunderlying the trait because RF itself is an ensemble of many heterogeneous

trees built from uncorrelated subsamples of the original data

VariantSpark

0

1000

2000

Python R

HadoopAdam

ADMIXTURE

VariantSpark

method

time i

n sec

onds task

binary−conversionclusteringpre−processing

Itcancluster3000individuals and 80millionvariants

O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information

Natalie TwineDenis Bauer Oscar Luo Rob Dunne Piotr Szul

Transformational Bioinformatics Team

Aidan O’BrienLaurence Wilson

Software

Open source (MIT) @ https://github.com/csirobigdata/variant-spark

Random Forest SparkML

Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK

• Failing for millions of variables• Relatively slow

“Cursed Forest”

broadcastaggregate

1

2,1 2,2

Executors

v1

v2

v3v3v3

vn

…

var, pointlocal best split

var1, point1

var21, point21 var22, point22

global best split

…

initial sample

split subsets

Driver

Partition data by variables (columns)

• Columns are “small” –easy partition

• An executor can find (an exact) best split for many variables

• Finding global best split is efficient

Some implementation tricks• ”Native” data shape

– VCF files are organized by ”variables”• Building by levels and tree batching

– Minimize communication overhead and the number of stages• Optimized split finding for ordered factors

– The most frequent operation – Java implementation faster then Scala

• Choice of data representation– byte representation for variant data– with sparsity 0.75 a sparse vector 3x bigger than a byte array

How fast it is? 16 CPU cores 32GB RAMlocal mode

Big data performance• Yarn Cluster (12 workers)

– 16 x Intel Xeon [email protected] CPU – 128 GB of RAM

• Spark 1.6.1 on YARN– 128 executors– 6GB / executor (0.75TB)

• Synthetic dataset (mtry = 0.25)

TypicalGWASRange

100K trees: 5 – 50hAWS: ~$215.50

Whole GenomeRange

100K trees: 200 – 2000hAWS: ~ $ 8620.00

50M variable x 10k samples!

Other features• Various input formats

– VCF, CSV, parquet

• A variety of RF (fine) tuning parameters– Sampling– Depths– Splitting

• Insight into RF model– cumulative OOB error– per tree variable importance– per tree OOB predictions

Simulated Data Study• Synthetic dataset of 2.5M variables and 5000 samples• 5 informative variables with dichotomous response

• Compare RF importance ranking with the model

• Rank-biased overlap (RBO) – measure of ranking overlap (with emphasis of highly ranked elements)

RBO

0

0.5

1

1.5

w_1 w_2 w_3 w_4 w_5

Bone Mineral Density Study• Osteoporotic fracture is a leading cause of morbidity

and mortality particularly amongst the elderly.

• In 2004 ten million Americans were estimated to have osteoporosis, resulting in 1.5 million fractures per annum.

• Hip fracture is associated with a one year mortality rate of 36% in men and 21% in women

Burden of disease of osteoporotic fractures overall is similar to that of colorectal cancer and greater than that of

hypertension and breast cancer

Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.

Bone Mineral Density Study

• 2036 samples & 288,768 SNPs

• Replicates 21 of 26 known associated genes

• Identifies 2 novel loci (known association with BMD)

• Provides strong evidence for further 4 loci

Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.

BMD - VariantSpark ResultsKnown BMD locations have significantly higher ranking

(Mann-Whitney U, p = 1.3e-7)

A few novel highly ranked locations with plausible

association with BMD: COLEC10, PRODH

Not replicated DCDC5 ranked 9,667 out of 10,000

Future work and directionsTechchnical

Compare and ’merge’ with

yggdrasilDeployment on cloud platforms

Further performance

improvements

FunctionalImplementation of cutting edge

research

Integration within genomics

platforms (GATK4)

More ML algorithms

Research Applications Data science research Gradient Boosted

Trees

References1. Aleesha Bates (2016) Practical aspects of GWAS Association studies under statistical

genetics and GenABEL hands-on tutorial2. Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci

for atopic dermatitis in the Japanese population3. The NHGRI-EBI Catalog of published genome-wide association studies4. Manolio et al. (2009) Finding the missing heritability of complex diseases5. Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions

using random forests6. Breiman (2004) Random Forests. Machine Learning7. Sun (2010) Multigenic Modeling of Complex Disease by Random Forests8. Danecek et al. The Variant Call Format and VCFtools9. O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information10. Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column

Partitioning in SPARK11. Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection

Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.

ConclusionsApache Spark is a feasible platform machine learning

in population scale genomics.

VariantSpark with CursedForest is a promising alternative for traditional GWAS approaches.

Data shape, type, etc. matter – different optimizations are needed.

Thank YouEmail: [email protected]: https://github.com/csirobigdata/variant-spark

Data & Analytics

Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul