15
Intro to Spark Dave Smelker

Introduction to Spark

Embed Size (px)

DESCRIPTION

Introduction to Spark for the Boulder / Denver Spark meetup

Citation preview

Page 1: Introduction to Spark

Intro to SparkDave Smelker

Page 2: Introduction to Spark

What is spark• In-Memory Map/Reduce Engine• Spark was developed in 2009 by the

Berkley Amp lab• Converted to an Apache project in

2013• Scala based• Scala, Java, and Python API

Page 3: Introduction to Spark

Most Active Big Data Project within Apache

Data from Spark-Summit 2014

Page 4: Introduction to Spark

Spark

Spark Streaming

Stand alone HDFS

Spark SQL

Tachyon

MLBase

Cassandra Cloud Services

GraphX

RDBMS

Page 5: Introduction to Spark

Spark VS. Hadoop• Hadoop Map/Reduce Limitations• High Latency• No in-memory caching• Map/Reduce code very complicated to write

• Spark• In-Memory Processing• Very Easy API• Can run stand alone even on Windows• 100x faster in memory and 10x faster on disk

Page 6: Introduction to Spark

Hadoop Word Count ExampleSee Code

Page 7: Introduction to Spark

Spark Word Count Examplefile = spark.textFile(“file.name”)

file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)

Page 8: Introduction to Spark

RDD – Resilient Distributed Dataset• Operations• Transformations• Actions

• Persistence• Allows an RDD to persist between operations• Provides the ability to write to disk if to large for

memory

• Parallelized Collections• Typically you want 2-4 slices per CPU in your cluster

Page 9: Introduction to Spark

OperationsTransformations Actions

• Map• Filter• Sample• Join• ReduceByKey• GroupByKey• Distinct

• Reduce• Collect• Count• First, Take• SaveAs• CountByKey

Page 10: Introduction to Spark

Operations continued

Page 11: Introduction to Spark

Persistence • Store a RDD for later operations• Each node persists a partition• Partitions are fault-tolerant• persist() or cache()

Page 12: Introduction to Spark

Persistence storage levels• MEMORY_ONLY - Store RDD as deserialized Java objects in

the JVM• MEMORY_AND_DISK - Store RDD as deserialized Java

objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk

• MEMORY_ONLY_SER - Store RDD as serialized Java objects (one byte array per partition).

• MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk

• DISK_ONLY - Store the RDD partitions only on disk.• MEMORY_ONLY_2, MEMORY_AND_DISK_2 - Same as the

levels above, but replicate each partition on two cluster nodes.

• OFF_HEAP - Store RDD in serialized format in Tachyon

Page 13: Introduction to Spark

Spark Advantages• Same code can be used for streaming and batch

processing• In Memory Processing• Fault tolerant rdd persistence • Machine Learning library built in• Spark SQL (Coming Soon)• Data Graphing (GraphX, Bagel/Pregel)

Page 14: Introduction to Spark

Spark Drawbacks• No append for output • Lack of job schedule• Spark on Yarn not quite ready for prime time• Still a young project

Page 15: Introduction to Spark

Questions?