24
Spark Meetup, December 2015 Noam Barkai [email protected]

Using apache spark to fight world hunger - spark meetup

Embed Size (px)

Citation preview

Page 1: Using apache spark to fight world hunger - spark meetup

Spark Meetup, December 2015Noam [email protected]

Page 2: Using apache spark to fight world hunger - spark meetup

Overview

Food shortage: new problems, new solutions

Intermezzo: how DNA works

Tach’les: what we do with Apache Spark

Page 3: Using apache spark to fight world hunger - spark meetup

The planet has gotten very populous

And it’s the only one we got

Page 4: Using apache spark to fight world hunger - spark meetup

World Population

Annual Growth Rate:Peak - 2.1% (1962)Current - 1.1% (2009)

https://en.wikipedia.org/wiki/World_population#/media/File:World-Population-1800-2100.svg

Page 5: Using apache spark to fight world hunger - spark meetup

Food intake

source: http://www.coolgeography.co.uk/A-level/AQA/Year%2012/Food%20supply/Patterns%20and%20intro/Food_consumption.gif

Page 6: Using apache spark to fight world hunger - spark meetup

Upscale: Same area, more crops

Page 7: Using apache spark to fight world hunger - spark meetup

Plant breeding

An ancient art

Incremental changes

Slow but considerable

source: https://en.wikipedia.org/wiki/Zea_%28genus%29#/media/File:Maize-teosinte.jpg

Page 8: Using apache spark to fight world hunger - spark meetup

How long does it take today?

Maize: 10-15 years

source: http://www.cropj.com/shimelis_6_11_2012_1542_1549.pdf

Page 9: Using apache spark to fight world hunger - spark meetup

How breeding works1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

Page 10: Using apache spark to fight world hunger - spark meetup

Computational genomics

⬇ Prices of DNA sequencing⬆ Number of samples per crop sequenced and analyzed⬆ Amount and quality of genomic data⬇ Prices of computation⬇ Prices of storageWe’re entering a new era

BIG DATA Genomics

Page 11: Using apache spark to fight world hunger - spark meetup

Food security - a computational problem?

The plant’s potential lies in its DNA.

We analyze and compare sequences from many plants.

Resulting in better predictions for breeding.

Faster rate of crop improvement.

Page 12: Using apache spark to fight world hunger - spark meetup

Intermezzo: DNA - how does it work?

Four “letters”:

cytosine(C), guanine(G),

adenine(A), thymine(T)

Encode 20 amino acids

Combine to make:

+100K proteins

Page 13: Using apache spark to fight world hunger - spark meetup

Conceptually we can think of this as a “pipeline”:“The Central Dogma”

Page 14: Using apache spark to fight world hunger - spark meetup

DNA as storageDurable

Supports random access

Efficient sequential reads

Easily replicated

Contains error correction mechanisms

Maximally “data local”

Page 15: Using apache spark to fight world hunger - spark meetup

Part 2: What we do with

Analyze lots of genome sequences.

Apply similarity algorithms, find where they match.

Finally, assist the breeding program.

Page 16: Using apache spark to fight world hunger - spark meetup

Input data is “noisy”

Contains errors and gaps.

Is fragmented.

All due to sequencing technology.

Page 17: Using apache spark to fight world hunger - spark meetup

Our setup

Hadoop clusters on both private cloud and AWS

Textual files, using Parquet.

MapR 5 Hadoop distro

Spark 1.4.1

SparkSQL and Hive (JDBC)

Instances: ~150GB RAM, 40 cores.

Provisioning: Ansible

Page 18: Using apache spark to fight world hunger - spark meetup

Our data

A dozen or so different crops, going for hundreds.

Each crop: potentially ~1K fully sequenced samples

~100K “markers”.

Each sequence: 1Gbp - 10Gbp (giga base-pairs =

characters) long

Current: several terabytes, aiming at petabytes

Page 19: Using apache spark to fight world hunger - spark meetup

Working with Spark and Scala

Scala’s type system is your friend

Thinking functional takes time - and can be “overdone”

Remember to add @tailrec when needed

Scala case classes - great

Nested structure: keeps you DRY, but sluggish.

Scala has its pitfalls - profile.

Spark as the “ultimate scala collection” - Martin Odersky.

Page 20: Using apache spark to fight world hunger - spark meetup

Complex unmanaged framework - the usual 20/80 rule:

20% fun algorithmic stuff,

80% integration/devops/tuning/black-voodoo

Integration with Hive - doable but cumbersome

DataFrames API - very clean

Parquet in Spark 1.4 - seamless, Parquet with SparkSQL <

1.3 - rather sucks.

Integrations with Spark

Page 21: Using apache spark to fight world hunger - spark meetup

If RDD objects need high RAM → memory gets tricky.

Spark UI in 1.4.1 - very nice

PairRDD - need to be your own “query optimizer”

repartition / coalesce - very useful, but gets tricky if data

variability is high (a dynamic real-time optimizer would be

great).

Performance tuning with Spark

Page 22: Using apache spark to fight world hunger - spark meetup

Testing: “local” is great, but means no unit-test :-(

sbt-pack - good alternative to sbt-assembly.

Spark packages: spark-csv, spark-notebook and more.

Speaking of open-source packages...

Testing, packaging and extending Spark

Page 23: Using apache spark to fight world hunger - spark meetup

ADAM Project - Genomics using Spark

Fully open sourced from

Similarity algorithms

Population clustering

Predictive analysis using Deep Learning

And more

Page 24: Using apache spark to fight world hunger - spark meetup

Spark Meetup, December 2015Noam [email protected]

Thank you