Intro to SparkDave Smelker
What is spark• In-Memory Map/Reduce Engine• Spark was developed in 2009 by the
Berkley Amp lab• Converted to an Apache project in
2013• Scala based• Scala, Java, and Python API
Most Active Big Data Project within Apache
Data from Spark-Summit 2014
Spark
Spark Streaming
Stand alone HDFS
Spark SQL
Tachyon
MLBase
Cassandra Cloud Services
GraphX
RDBMS
Spark VS. Hadoop• Hadoop Map/Reduce Limitations• High Latency• No in-memory caching• Map/Reduce code very complicated to write
• Spark• In-Memory Processing• Very Easy API• Can run stand alone even on Windows• 100x faster in memory and 10x faster on disk
Hadoop Word Count ExampleSee Code
Spark Word Count Examplefile = spark.textFile(“file.name”)
file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)
RDD – Resilient Distributed Dataset• Operations• Transformations• Actions
• Persistence• Allows an RDD to persist between operations• Provides the ability to write to disk if to large for
memory
• Parallelized Collections• Typically you want 2-4 slices per CPU in your cluster
OperationsTransformations Actions
• Map• Filter• Sample• Join• ReduceByKey• GroupByKey• Distinct
• Reduce• Collect• Count• First, Take• SaveAs• CountByKey
Operations continued
Persistence • Store a RDD for later operations• Each node persists a partition• Partitions are fault-tolerant• persist() or cache()
Persistence storage levels• MEMORY_ONLY - Store RDD as deserialized Java objects in
the JVM• MEMORY_AND_DISK - Store RDD as deserialized Java
objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk
• MEMORY_ONLY_SER - Store RDD as serialized Java objects (one byte array per partition).
• MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk
• DISK_ONLY - Store the RDD partitions only on disk.• MEMORY_ONLY_2, MEMORY_AND_DISK_2 - Same as the
levels above, but replicate each partition on two cluster nodes.
• OFF_HEAP - Store RDD in serialized format in Tachyon
Spark Advantages• Same code can be used for streaming and batch
processing• In Memory Processing• Fault tolerant rdd persistence • Machine Learning library built in• Spark SQL (Coming Soon)• Data Graphing (GraphX, Bagel/Pregel)
Spark Drawbacks• No append for output • Lack of job schedule• Spark on Yarn not quite ready for prime time• Still a young project
Questions?