Scrap Your MapReduce - Apache Spark

Lightning-fast cluster computing

Rahul Kavale(rahulkav@thoughtworks.com)

Unmesh Joshi(uvjoshi@thoughtworks.com)

Some properties of “Big Data”

•Big data is inherently immutable, meaning it is not supposed to updated once generated.

•Mostly the operations are coarse grained when it comes to write

•Commodity hardware makes more sense for storage/computation of such enormous data,hence the data is distributed across clusterof many such machines

• The distributed nature makes the programming complicated.

Brush up for Hadoop concepts

Distributed Storage => HDFS

Cluster Manager => YARN

Fault tolerance => achieved via replication

Job scheduling => Scheduler in YARN

Mapper

Reducer

Combiner

4http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif

Map Reduce Programming Model

6https://twitter.com/francesc/status/507942534388011008

7http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop

http://www.slideshare.net/JimArgeropoulos/hadoop-101-32661121

MapReduce pain points

• considerable latency

• only Map and Reduce phases

• Non trivial to test

• results into complex workflow

• Not suitable for Iterative processing

Immutability and MapReduce model

• HDFS storage is immutable or append-only.

• The MapReduce model lacks to exploit the immutable nature of

the data.

• The intermediate results are persisted resulting in huge of IO,

causing a serious performance hit.

Wouldn’t it be very nice if we could have• Low latency

• Programmer friendly programming model

• Unified ecosystem

• Fault tolerance and other typical distributed system properties

• Easily testable code

• Of course open source :)

What is Apache Spark

• Cluster computing Engine

• Abstracts the storage and cluster management

• Unified interfaces to data

• API in Scala, Python, Java, R*

Where does it fit in existing Bigdata ecosystem

http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html

Why should you care about Apache Spark

• Abstracts underlying storage,

• Abstracts cluster management

• Easy programming model

• Very easy to test the code

• Highly performant

• Petabyte sort record

https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

• Offers in memory caching of data

• Specialized Applications

• GraphX for graph processing

• Spark Streaming

• MLib for Machine learning

• Spark SQL

• Data exploration via Spark-Shell

Programming model

Apache Spark

Word Count example

val file = spark.textFile("input path")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey((a, b) => a + b)

counts.saveAsTextFile("destination path")

Comparing example with MapReduce

Spark Shell Demo

• SparkContext

• RDD

• RDD operations

• RDD stands for Resilient Distributed Dataset.

• basic abstraction for Spark

• Equivalent of Distributed collections.

• The interface makes distributed nature of underlying data transparent.

• RDD is immutable

• Can be created via,

• parallelising a collection,

• transforming an existing RDD by applying a transformation function,

• reading from a persistent data store like HDFS.

RDD is lazily evaluated

RDD has two type of operations

• Transformations

Create a DAG of transformations to be applied on the RDD

Does not evaluating anything

• Actions

Evaluate the DAG of transformations

RDD operations

Transformations

map(f : T ⇒ U) : RDD[T] ⇒ RDD[U]

filter(f : T ⇒ Bool) : RDD[T] ⇒ RDD[T]

flatMap(f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U]

sample(fraction : Float) : RDD[T] ⇒ RDD[T] (Deterministic sampling)

union() : (RDD[T],RDD[T]) ⇒ RDD[T]

join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))]

groupByKey() : RDD[(K, V)] ⇒ RDD[(K, Seq[V])]

reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)]

partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]

Actions

count() : RDD[T] ⇒ Long

collect() : RDD[T] ⇒ Seq[T]

reduce(f : (T,T) ⇒ T) : RDD[T] ⇒ T

lookup(k : K) : RDD[(K, V)] ⇒ Seq[V] (On hash/range partitioned RDDs)

save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Job Execution

Spark Execution in Context of YARN

http://kb.cnblogs.com/page/198414/

Fault tolerance via lineage

MappedRDD

FilteredRDD

FlatMappedRDD

MappedRDD

HadoopRDD

Testing

Why is Spark more performant than MapReduce

Reduced IO

• No disk IO between phases since phases themselves are pipelined

• No network IO involved unless a shuffle is required

No Mandatory Shuffle

• Programs not bounded by map and reduce phases

• No mandatory Shuffle and sort required

In memory caching of data

• Optional In memory caching

• DAG engine can apply certain optimisations since when an action is called, it knows what all transformations as to be applied

Questions?

Thank You!

Scrap Your MapReduce - Apache Spark

Software

Thrill : High-Performance Algorithmic...Batch Processing Google’s MapReduce, Hadoop MapReduce , Apache Spark , Apache Flink (Stratosphere), Google’s FlumeJava. High Performance

Running Non- MapReduce Applications on Apache Hadoop

Introduction to Apache Spark - tropars.github.io · I Hadoop MapReduce, Apache Spark, Apache Flink, etc 25. Agenda Computing at large scale Programming distributed systems MapReduce

Apache Hadoop 3.0 What's new in YARN and MapReduce

Big Data Platforms for Artifical Intelligence · MapReduce and Apache Pig (HDD),Apache Spark(RAM) I Distributed SQL database queries for analytics: Apache Hive, Spark SQL, Cloudera

A BigData Tour – HDFS, Ceph and MapReduce...•The Hadoop infrastructure provides these capabilities Introduction to Hadoop •Apache Hadoop • Based on 2004 Google MapReduce Paper

Getting Started with Apache Spark - MapRinfo.mapr.com/rs/mapr/images/Getting_Started_With_Apache_Spark.pdfGetting Started with Apache Spark ... 100 times faster than Hadoop MapReduce,

Apache Hadoop™ YARN: Moving beyond MapReduce and Batch

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

Community Spotlight Apache MapReduce - Intel · Apache Hadoop* Community Spotlight Apache* MapReduce “This ‘data locality’ is a key design criterion of the MapReduce framework

Big Data Analysis with Apache - Broad Institute · PDF fileOutline The big data problem MapReduce Apache Spark How people are using it

MapReduce Tutorial - Apache Hadoop. Purpose This document comprehensively describes all user-facing facets of the Hadoop MapReduce framework and serves as a tutorial. 2. Prerequisites

Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect @acmurthy@acmurthy (@hortonworks) Formerly Architect, MapReduce

Apache Flume & Apache Sqoop Data Ingestion to Apache ... · HBase and Hive and then export the data back after transforming it using Hadoop MapReduce. ... , enabling fast and simple

Intro to Apache Spark - Stanford Universityrezab/classes/cme323/S15/slides/... · 2016-02-28 · 2002 2002 MapReduce @ Google 2004 MapReduce paper 2006 Hadoop @ Yahoo! 2004 2006 2008

Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and

Scaling Information Retrieval to the Webmooney/ir-course/slides/ScalingIR.pdf · Apache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System

Meetup: Bluemix y Sparkfiles.meetup.com/18480826/Introduccion a Hadoop.pdf · 4 Publishes MapReduce, GFS Paper Apache OpenSource MapReduce & HDFS projects created Runs 4,000 node

Hortonworks Data Platform - Apache Ambari … Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog,

· B17 IN V l) 7 -f V y 7 OSS — 11 Apache Tomcat GCC Eclipse Linux MapReduce oss Apache Hadoop Apache Spark Apache Kafta Apache Storm