Introduction to Spark

Intro to SparkDave Smelker

What is spark• In-Memory Map/Reduce Engine• Spark was developed in 2009 by the

Berkley Amp lab• Converted to an Apache project in

2013• Scala based• Scala, Java, and Python API

Most Active Big Data Project within Apache

Data from Spark-Summit 2014

Spark Streaming

Stand alone HDFS

Spark SQL

Tachyon

MLBase

Cassandra Cloud Services

GraphX

Spark VS. Hadoop• Hadoop Map/Reduce Limitations• High Latency• No in-memory caching• Map/Reduce code very complicated to write

• Spark• In-Memory Processing• Very Easy API• Can run stand alone even on Windows• 100x faster in memory and 10x faster on disk

Hadoop Word Count ExampleSee Code

Spark Word Count Examplefile = spark.textFile(“file.name”)

file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)

RDD – Resilient Distributed Dataset• Operations• Transformations• Actions

• Persistence• Allows an RDD to persist between operations• Provides the ability to write to disk if to large for

memory

• Parallelized Collections• Typically you want 2-4 slices per CPU in your cluster

OperationsTransformations Actions

• Map• Filter• Sample• Join• ReduceByKey• GroupByKey• Distinct

• Reduce• Collect• Count• First, Take• SaveAs• CountByKey

Operations continued

Persistence • Store a RDD for later operations• Each node persists a partition• Partitions are fault-tolerant• persist() or cache()

Persistence storage levels• MEMORY_ONLY - Store RDD as deserialized Java objects in

the JVM• MEMORY_AND_DISK - Store RDD as deserialized Java

objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk

• MEMORY_ONLY_SER - Store RDD as serialized Java objects (one byte array per partition).

• MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk

• DISK_ONLY - Store the RDD partitions only on disk.• MEMORY_ONLY_2, MEMORY_AND_DISK_2 - Same as the

levels above, but replicate each partition on two cluster nodes.

• OFF_HEAP - Store RDD in serialized format in Tachyon

Spark Advantages• Same code can be used for streaming and batch

processing• In Memory Processing• Fault tolerant rdd persistence • Machine Learning library built in• Spark SQL (Coming Soon)• Data Graphing (GraphX, Bagel/Pregel)

Spark Drawbacks• No append for output • Lack of job schedule• Spark on Yarn not quite ready for prime time• Still a young project

Questions?

Introduction to Spark

Data & Analytics

Introduction to Spark - DataFactZ

An Introduction to Spark

Introduction to Spark with Scala

Introduction to Spark ML Pipelines Workshop

Introduction To Spark - Durham LUG 20150916

Introduction to Apache Spark 2.0

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

Introduction to Spark Training

Introduction to Spark on Hadoop

Introduction to Spark (Intern Event Presentation)

Introduction to apache spark

Learning spark ch01 - Introduction to Data Analysis with Spark

OCF.tw's talk about "Introduction to spark"

Introduction to Apache Spark Ecosystem

Introduction to Spark SQL training workshop

Introduction to Spark

Introduction to Machine Learning with Spark

20150716 introduction to apache spark v3

Introduction to Cassandra • Why Spark + Cassandra ... · • Introduction to Cassandra • Why Spark + Cassandra • Problem background and overall architecture •Implementation

Developing Apache Spark Applications · Apache Spark Introduction Introduction Apache Spark enables you to quickly develop applications and process jobs. Apache Spark is designed