28
Scalable genomic data processing and interoperable systems with ADAM/Spark Andy Petrella Xavier Tordoir 2015-02-19

BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale

Embed Size (px)

Citation preview

Page 1: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

Scalable genomic data processing and interoperable systems with ADAM/Spark

Andy PetrellaXavier Tordoir

2015-02-19

Page 2: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

LineupIntro

● who we are ● we do distributed computing

Abstract● Content: Distributed machine learning on

genomes data● Distributed data and processing (S3, Spark,

Tachyon)● Distributed machine learning (MLlib, H2O)● Spark Notebook

Context● 1000 genomes in VCF● Distributed genomic data in ADAM● Size matters (VCF → ADAM + partitioned)● Data available on S3 (s3://med-at-

scale/1000genomes)● Stratification

Procedure● Deploy Spark on ec2● Deploy Spark Notebook● Load data● Clean data● Transform data● Train KMeans

Results● Prediction (confusion matrix)● Performance

On the bench● GA4GH compliant and scalable server● Ad hoc analyses and sharing (through Tachyon)

Page 3: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

Andy

@Noootsab, I am@SparkNotebook creator@Devoxx4Kids organizerMaths & CSScalable systems Machine learning

Med@Scale

Xavier

@xtordoirPhysicsData analysisGenomicsDistributed computing

Page 4: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

Products (OSS)● SparkNotebook● GA4GH server

What we do?Distributed computing consultancy in

● Internet of Things● Finance● Geospatial● Marketing

Training and coaching in● Scala● Spark● Distributed architecture● Distributed machine learning

Research and development● Distributed machine learning models● Genomics and health

Page 5: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

Data: 1000genomes (Genotypes + Samples Population)

- Quite some data → real scalability test

- Machine learning:- Genotype inference- Population classification (supervised learning)- Population stratification (unsupervised learning)

Distributed Machine Learning on Genotypes Data

Page 6: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

The era of distributed computing

Strong Open Source ecosystem, Industrial developments and research

- Infrastructure can be elastic (e.g. EC2/S3)

- Data storage: HDFS (large blocks…), S3 (remote...)

- Processing: Beefed up MapReduce: Spark

- Escaping the IOPs: Tachyon in-memory filesystem

- Scheduling, HA (Mesos, Marathon)

Distributed Data Processing

Page 7: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

BerkeleyDataAnalyticsStackmore here

Distributed Data Processing

Page 8: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

SparkNotebook

Interactive Distributed Computing

Dev’ timeDev’ time

Dev’ timeDev’ timeDev’ time

Dev’ timeDev’ time

Page 9: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

Distributed Genomic Data1000 genomes1092 samples43,372,735,220 genotypes

Original DataVCF not partitioned files on FTP or S3: 152 GB (gzipped)VCF format not easily parallelizable, even worst with compression

Adam / med-at-scaleADAM files S3: 70.75GB (parquet, compressed)9172 partitions (7Mb each)@see http://med-at-scale.s3.amazonaws.com/1000genomes/counts.html

Eggo projecthttps://github.com/bigdatagenomics/eggo

Page 10: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

DataWe have the 1000 genomes data, hence

- we have genotypes- we have samples population labels

ExplorationWe can cluster samples.We can compare with samples populations.

ModelWe can run simple stratification algorithms, K-Means.

Technology assessment

Page 11: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

K-Means

MLLib provides K-Means (not hierarchical) → limit to 3 populations

MLLib uses breeze linalg library → Only euclidean metric (at that moment)

AT 1

AA 0

TT 2

A ref allele

11

2

Page 12: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

ProcedureSpark on EC2 cluster

- spark-ec2 script

- 2 to 40 workers (x 13GB / 4 cores)

- 10 to 40 minutes to launch Driver

Worker

Worker

Worker Worker

$ ./spark-ec2 launch

Page 13: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

ProcedureSparkNotebook on EC2 cluster

- access from your browser

- configure spark

- control computations on the cluster

Driver

Worker

Worker

Worker

Worker

Page 14: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

ProcedureLoad data

- Read ADAM data from S3 repo

- Read the samples populations

Worker

WorkerWorker

Worker

Driver

Page 15: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

ProcedureFilter and clean data

- Sample: chromosome slice (chr22), 3 populations (GBR, ASW, CHB)

- Missing genotypes (remove incomplete variants)

Variant1 Variant2 Variant3 Variant4 Variant5 Variant6 Variant7

Sample1 0 0 1 0 1 0 1

Sample2 2 NA 1 2 1 0 0

Sample3 2 0 1 2 2 0 2

Sample4 1 1 0 0 0 NA 0

Page 16: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

ProcedureTransform data

- Flat Genotype collection → Sample collection

- Each Sample is a Vector of Genotypes (0, 1, 2)

- Vector is ordered consistently

Genotype

Variant

Sample (ID)

Alleles

Sample

Sample (ID)

Vector[Genotype]

Vector[Variant]

Page 17: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

ProcedureTrain K-Means

- 10 iterations

- 3 clusters

Sample

Sample (ID)

Vector[Genotype]

Vector

Vector

Vector

Page 18: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

Results

~ 100,000 variants

#0 #1 #2GBR 0 0 89ASW 54 0 7CHB 0 97 0

The procedure reconstructs the actual populations.

Page 19: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

Results

Performance (cluster size)

2 NODES 20 NODES(*)

Cluster Launch 10 min 30.0 min

Count chr22 genotypes (S3) 6 min 1.1 min

Save chr22 from s3 to HDFS 26 min 3.5 min

Count chr22 genotypes (HDFS) 10 min 1.4 min

(*) Cluster size / nb partitions not optimal here: 80 cores / 114 partitions

Page 20: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

Results

Performance (cluster size)

121,023 Variants 2 NODES 20 NODES

Missing data (collect) 7.8 min 33 sec

Train (10 iter) 2.1 min 28 sec

Predict (collect) 8 sec 2 sec

Page 21: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

Results

Performance, 20 NODES (data size)

121,023 Variants

491,222 Variants

Missing data (collect) 33 sec 3.7 min

Train (10 iter) 28 sec 1.6 min

Predict (collect) 2 sec 25 sec

Page 22: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

On the benchGlobal Alliance for Genomic and Health (GA4GH)

http://genomicsandhealth.org/http://ga4gh.org/

- Framework for responsible data sharing- Define schemas- Define services for interoperability

Page 23: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

On the benchGA4GH schemas

Page 24: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

On the benchGA4GH google implementation

Page 25: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

On the benchGA4GH google implementation

Page 26: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

On the benchGA4GH compliant & scalable server

Open source and available on GitHub,https://github.com/med-at-scale/high-health

PRs are welcome!

Page 27: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

On the bench

Methods grouped in micro services

GA4GH & Custom methods

Page 28: BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale

Thank you

Biobankcloud, KTH (Jim Dowling)UC Berkeley AMPLab, bdgenomics.org team (Frank Nothaft, Matt Massie)Cloudera (Uri Laserson)

Hey…

Come back tomorrow morning → for demos

And afternoon → to hack on it!