Introduction to Spark

Intro to SparkDave Smelker

What is spark• In-Memory Map/Reduce Engine• Spark was developed in 2009 by the

Berkley Amp lab• Converted to an Apache project in

2013• Scala based• Scala, Java, and Python API

Most Active Big Data Project within Apache

Data from Spark-Summit 2014

Spark

Spark Streaming

Stand alone HDFS

Spark SQL

Tachyon

MLBase

Cassandra Cloud Services

GraphX

RDBMS

Spark VS. Hadoop• Hadoop Map/Reduce Limitations• High Latency• No in-memory caching• Map/Reduce code very complicated to write

• Spark• In-Memory Processing• Very Easy API• Can run stand alone even on Windows• 100x faster in memory and 10x faster on disk

Hadoop Word Count ExampleSee Code

Spark Word Count Examplefile = spark.textFile(“file.name”)

file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)

RDD – Resilient Distributed Dataset• Operations• Transformations• Actions

• Persistence• Allows an RDD to persist between operations• Provides the ability to write to disk if to large for

memory

• Parallelized Collections• Typically you want 2-4 slices per CPU in your cluster

OperationsTransformations Actions

• Map• Filter• Sample• Join• ReduceByKey• GroupByKey• Distinct

• Reduce• Collect• Count• First, Take• SaveAs• CountByKey

Operations continued

Persistence • Store a RDD for later operations• Each node persists a partition• Partitions are fault-tolerant• persist() or cache()

Persistence storage levels• MEMORY_ONLY - Store RDD as deserialized Java objects in

the JVM• MEMORY_AND_DISK - Store RDD as deserialized Java

objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk

• MEMORY_ONLY_SER - Store RDD as serialized Java objects (one byte array per partition).

• MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk

• DISK_ONLY - Store the RDD partitions only on disk.• MEMORY_ONLY_2, MEMORY_AND_DISK_2 - Same as the

levels above, but replicate each partition on two cluster nodes.

• OFF_HEAP - Store RDD in serialized format in Tachyon

Spark Advantages• Same code can be used for streaming and batch

processing• In Memory Processing• Fault tolerant rdd persistence • Machine Learning library built in• Spark SQL (Coming Soon)• Data Graphing (GraphX, Bagel/Pregel)

Spark Drawbacks• No append for output • Lack of job schedule• Spark on Yarn not quite ready for prime time• Still a young project

Questions?

Data & Analytics

Introduction to Spark