Using apache spark to fight world hunger - spark meetup

Preview:

Citation preview

Spark Meetup, December 2015Noam Barkainoamb@nrgene.com

Overview

Food shortage: new problems, new solutions

Intermezzo: how DNA works

Tach’les: what we do with Apache Spark

The planet has gotten very populous

And it’s the only one we got

World Population

Annual Growth Rate:Peak - 2.1% (1962)Current - 1.1% (2009)

https://en.wikipedia.org/wiki/World_population#/media/File:World-Population-1800-2100.svg

Food intake

source: http://www.coolgeography.co.uk/A-level/AQA/Year%2012/Food%20supply/Patterns%20and%20intro/Food_consumption.gif

Upscale: Same area, more crops

Plant breeding

An ancient art

Incremental changes

Slow but considerable

source: https://en.wikipedia.org/wiki/Zea_%28genus%29#/media/File:Maize-teosinte.jpg

How long does it take today?

Maize: 10-15 years

source: http://www.cropj.com/shimelis_6_11_2012_1542_1549.pdf

How breeding works1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

Computational genomics

⬇ Prices of DNA sequencing⬆ Number of samples per crop sequenced and analyzed⬆ Amount and quality of genomic data⬇ Prices of computation⬇ Prices of storageWe’re entering a new era

BIG DATA Genomics

Food security - a computational problem?

The plant’s potential lies in its DNA.

We analyze and compare sequences from many plants.

Resulting in better predictions for breeding.

Faster rate of crop improvement.

Intermezzo: DNA - how does it work?

Four “letters”:

cytosine(C), guanine(G),

adenine(A), thymine(T)

Encode 20 amino acids

Combine to make:

+100K proteins

Conceptually we can think of this as a “pipeline”:“The Central Dogma”

DNA as storageDurable

Supports random access

Efficient sequential reads

Easily replicated

Contains error correction mechanisms

Maximally “data local”

Part 2: What we do with

Analyze lots of genome sequences.

Apply similarity algorithms, find where they match.

Finally, assist the breeding program.

Input data is “noisy”

Contains errors and gaps.

Is fragmented.

All due to sequencing technology.

Our setup

Hadoop clusters on both private cloud and AWS

Textual files, using Parquet.

MapR 5 Hadoop distro

Spark 1.4.1

SparkSQL and Hive (JDBC)

Instances: ~150GB RAM, 40 cores.

Provisioning: Ansible

Our data

A dozen or so different crops, going for hundreds.

Each crop: potentially ~1K fully sequenced samples

~100K “markers”.

Each sequence: 1Gbp - 10Gbp (giga base-pairs =

characters) long

Current: several terabytes, aiming at petabytes

Working with Spark and Scala

Scala’s type system is your friend

Thinking functional takes time - and can be “overdone”

Remember to add @tailrec when needed

Scala case classes - great

Nested structure: keeps you DRY, but sluggish.

Scala has its pitfalls - profile.

Spark as the “ultimate scala collection” - Martin Odersky.

Complex unmanaged framework - the usual 20/80 rule:

20% fun algorithmic stuff,

80% integration/devops/tuning/black-voodoo

Integration with Hive - doable but cumbersome

DataFrames API - very clean

Parquet in Spark 1.4 - seamless, Parquet with SparkSQL <

1.3 - rather sucks.

Integrations with Spark

If RDD objects need high RAM → memory gets tricky.

Spark UI in 1.4.1 - very nice

PairRDD - need to be your own “query optimizer”

repartition / coalesce - very useful, but gets tricky if data

variability is high (a dynamic real-time optimizer would be

great).

Performance tuning with Spark

Testing: “local” is great, but means no unit-test :-(

sbt-pack - good alternative to sbt-assembly.

Spark packages: spark-csv, spark-notebook and more.

Speaking of open-source packages...

Testing, packaging and extending Spark

ADAM Project - Genomics using Spark

Fully open sourced from

Similarity algorithms

Population clustering

Predictive analysis using Deep Learning

And more

Spark Meetup, December 2015Noam Barkainoamb@nrgene.com

Thank you