Intro to Apache Spark

All product images owned by respective companies/institutions

Intro to Apache

Takeaways

To understand: •  Why we have big data today •  What big data problems Spark solves •  How Spark approaches big data differently

But most of all… to feel comfortable trying Spark out!

Image Credit: http://commons.wikimedia.org/wiki/File:BigData_2267x1146_white.png

Why does big data exist?


7.2 B 6.8 B 1.44 B 300 M 236 M 3.5 B / day


When data is small it’s cute and cuddly, easy to contain…

When data gets big, we need tools to help us.

What tools can help?

2002 - MapReduce @ Google

2004 – MapReduce Paper

2006 – Hadoop @ Yahoo

2011 – Hadoop Released

Hadoop Data Flow

But MapReduce falls short…

Hadoop’s Limitations

Lacks one thing to succeed for: •  Iterative Queries •  Interactive Queries Fast data sharing

Image courtesy of: http://workinganalytics.com/

a better way. We need…

We need… fault tolerance and speed.

We need… a better data abstraction.

The Solution..

•  Resilient Distributed Datasets – A distributed memory abstraction that lets

programmers perform in-memory computations on large clusters in a fault-tolerant manner.




2013 – Spark @ Apache

2014 – Spark 1.0 Released


2011 – Hadoop Released 2009 – Spark at UC Berkeley




2013 – Spark @ Apache

2014 – Spark 1.0 Released

2009 – Spark at UC Berkeley


Hadoop Data Flow

Spark Data Flow

a

Why Spark? Fast

Image Credit: http://pixabay.com/en/tunnel-light-speed-fast-auto-101976/

Why Spark? Fast

General Purpose

Image Credit: http://www.freestockphotos.biz/stockphoto/9182

Why Spark? Fast

General Purpose

Easy

Image Credit: http://upload.wikimedia.org/wikipedia/commons/9/92/Easy_button.JPG

Why Spark? Fast

General Purpose

Easy

Streaming

Image Credit: http://pixabay.com/en/faucet-water-bad-sanitaryblock-686958/

Why Spark? Fast

General Purpose

Easy

Streaming

Adoption



Use Cases

Spark Use Cases ETL

Spark Use Cases ETL

Machine Learning

Spark Use Cases ETL

Machine Learning

Analytics

Table Credit: http://www.wsj.com/articles/SB10001424052970203914304576630742911364206

Spark Use Cases ETL

Machine Learning

Analytics

Modeling


Spark Use Cases ETL

Machine Learning

Analytics

Modeling

Data Mining


Spark Modules

Image Credit: http://www.numaq.com

Basics


a

Spark Data Flow

Creating RDDs

•  From practically any data source –  HDFS –  Local file system –  S3 –  NoSQL (Cassandra, Hbase, …) –  JDBC

•  From any collection •  Transform an existing RDD

Text File

We start with some data. Put it in a form Spark understands…

File RDD

Read File

Text File

File RDD

Read File

RDDs: •  Computation blueprint •  Lazy: Hold instructions – not data

Text File

File RDD

Word RDD

Word Count RDD

Read File Split Words Count Words

Transformations chain operations together Nothing actually computed yet…

Text File

File RDD

Word RDD

Word Count RDD

All Word

Counts

Read File Split Words Count Words Store Result

Actions compute results. Why is laziness good?

Text File

File RDD

Word RDD

Word Count RDD

All Word

Counts


Top 10 Words

Only compute what we need Allows you to: - Focus more on algorithm - Worry less about performance

Text File

File RDD

Word RDD

Word Count RDD

All Word

Counts


Top 10 Words

“A” Word RDD

Words starting with “A”

By default, RDDs recomputed each use

Word RDD

Text File

File RDD

Word Count RDD

All Word

Counts


Top 10 Words

“A” Word RDD


For better performance… Persist reused RDDs

Word RDD

Word RDD

Text File

File RDD

Word Count RDD

All Word

Counts


Top 10 Words

“A” Word RDD


RDDs are fault tolerant.

Text File

File RDD

Word RDD

Word Count RDD

All Word

Counts


Top 10 Words

“A” Word RDD


RDDs are fault tolerant. Lineage allows recreation.

Once more, with code

Word Count Example

val input = sc.textFile(”hdfs://...") // HadoopRDD //Transformation val words = input.flatMap(line => line.split(" ")) //FlatMappedRDD //Transformation val result = words.map(word => (word, 1)).reduceByKey((acc, curr) => acc + curr) //Action val collectedResult = result.collect()

Image courtesy of http://blog.jetoile.fr

Cluster Basics


Image courtesy of https://spark.apache.org


Creates RDDs Executes code on cluster

Connects our program to Spark “Main”


Acquires cluster resources YARN, Mesos, Standalone…


Spawns executors Performs tasks


Managed by Spark Managed by You


In Action

Questions?

More Information on Spark

•  https://spark.apache.org/docs/latest/index.html •  http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf •  http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf •  http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf •  http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf •  https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia •  http://www.meetup.com/Washington-DC-Area-Spark-Interactive/ •  https://spark-summit.org/

Shared Variables

●  Broadcast variables o  Allows user to keep a read-only variable cached on each machine vs

shipping it with tasks. o  e.g. lookup table

●  Accumulators o  workers can “add” to using associative operations o  only driver can read o  used for

§  counters §  sums

Software

Intro to Apache Spark