In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42

In-memory data processing Apache Spark

Pelle Jakovits

31 October 2018, Tartu

Outline

• Disk based vs In-Memory data processing

• Introduction to Apache Spark– Resilient Distributed Datasets (RDD)

– RDD actions and transformations

– Fault tolerance

– Frameworks powered by Spark

• Advantages & Disadvantages

Pelle Jakovits 2/42

Memory vs Disk based processing

• In Hadoop MapReduce all input, intermediate and output data must be written to disk

• Even if data significantly reduced, it can not be kept in memory between Map and Reduce tasks

• Hadoop MapReduce is not suitable for all types of algorithms

– Iterative algorithms, graph processing, machine learning

Pelle Jakovits 3/34

In-Memory data processing frameworks

• Goal is to support computationally complex applications which can benefit from keeping intermediate data in memory

• Keep data in memory between data processing operations

• Input & Output are disk based file storage systems like HDFS

Pelle Jakovits 4/42

In-Memory data processing

• Data must fit into the collective memory of the cluster

• Should still support keeping data in disk

– when it would not fit into memory

– for fault tolerance

• Fault tolerance is more complicated

– The whole application is affected when data is only kept in memory

– In Hadoop, input data is replicated in HDFS and readily available. only the last Map or Reduce task is affected

Pelle Jakovits 5/42

Apache Spark

• MapReduce-like & in-memory data processing framework

• From Map & Reduce -> Map, Join, Co-group, Filter, Distinct, Union, Sample, ReduceByKey, etc

• Directed acyclic graph (DAG) task execution engine– Users have more control over the data processing execution flow

• Uses Resilient Distributed Datasets (RDD) abstraction– Input data is load into RDD

– RDD transformations and user defined functions are applied to define data processing applications

Pelle Jakovits 6/42

Apache Spark

• More than just a replacement for MapReduce

– Spark works with Scala, Java, Python and R

– Extended with built-in tools for SQL queries, stream processing, ML and graph processing

• Integrated with Hadoop Yarn and HDFS

• Included in many public cloud platforms alongside with Hadoop MapReduce

– IBM cloud, Amazon AWS, Google Cloud, Microsoft Azure

Pelle Jakovits 7/42

Hadoop Mapreduce vs Spark

Pelle Jakovits

0.96

110

0 25 50 75 100 125

Logistic Regression

4.1

155

0 30 60 90 120 150 180

K-Means ClusteringHadoop

Spark

Time per Iteration (s)Source: Introduction to Spark – Patrick Wendell, Databricks

8/42

Resilient Distributed Datasets

• Collections of data objects

• Distributed across cluster

• Stored in RAM or Disk

• Immutable/Read-only

• Built through parallel transformations

• Automatically rebuilt on failures

Pelle Jakovits

Source: http://horicky.blogspot.com.ee/2013/12/spark-low-latency-massively-parallel.html

9/42

Structure of RDDs

• Contains a number of rows

• Rows are divided into partitions

• Partitions are distributed between nodes in the cluster

• Row is a tuple of records similarly to Apache Pig

• Can contain nested data structures

Pelle Jakovits

Source: http://horicky.blogspot.com.ee/2013/12/spark-low-latency-massively-parallel.html

10/42

Spark DAG execution flow

Pelle Jakovits

http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/

11/42

Spark in Java

• A lot of additional boilerplate code related to data and function types

• There are different classes for each Tuple length (Tuple2, … , Tuple9):

Tuple2 pair = new Tuple2(a, b);

pair._1 // => a

pair._2 // => b

• In Java 8 you can use lambda functions:

JavaPairRDD<String, Integer> counts = pairs.reduceByKey( (a, b) -> a + b );

• But In older Java you must use predefined function interfaces:

– Function, Function2, Function 3

– FlatMapFunction

– PairFunction

Pelle Jakovits 12/42

Java 7 Example - WordCount☺JavaRDD<String> lines = ctx.textFile(input_folder);

JavaRDD<String> words = lines.flatMap(

new FlatMapFunction<String, String>() {

public Iterable<String> call(String line) {

return Arrays.asList(line.split(" "));

}});

JavaPairRDD<String, Integer> ones = words.mapToPair(

new PairFunction<String, String, Integer>() {

public Tuple2<String, Integer> call(String word) {

return new Tuple2<String, Integer>(word, 1);

}});

JavaPairRDD<String, Integer> counts = ones.reduceByKey(

new Function2<Integer, Integer, Integer>(){

public Integer call(Integer i1, Integer i2) {

return i1 + i2;

}}); Pelle Jakovits 13/42

Java 8 Example - WordCount

JavaRDD<String> lines = ctx.textFile(input_folder);

JavaRDD<String> words = lines.flatMap(

line -> Arrays.asList(line.split(" ")).iterator()

);

JavaPairRDD<String, Integer> pairs = words.mapToPair(

word -> new Tuple2<String, Integer>(word, 1)

);

JavaPairRDD<String, Integer> wordCounts = pairs.reduceByKey(

(x, y) -> x + y

);


Python example - WordCount

• Word count in Spark's Python API:

lines = spark.textFile(input_folder)

words = lines.flatMap(lambda line: line.split() )

pairs = words.map(lambda word: (word, 1) )

wordCounts = pairs.reduceByKey(lambda a, b: a + b )


RDD operations

• Actions

– Creating RDD’s

– Storing RDD’s

– Extracting data from RDD

• Transformations

– Restructure or transform RDDs into new RDDs

– Apply user defined functions


RDD Actions

Pelle Jakovits 17

Loading Data

• Local data directly from memorydataset = [1, 2, 3, 4, 5];

slices = 5 # Number of partitions

distData = sc.parallelize(dataset, slices);

• External data from HDFS or local file systeminput = sc.textFile(“file.txt”)

input = sc.textFile(“directory/*.txt”)

input = sc.textFile(“hdfs://xxx:9000/path/file”)


Storing data

counts.saveAsTextFile("hdfs://...");

counts.saveAsObjectFile("hdfs://...");

counts.saveAsHadoopFile(

"testfile.seq",

Text.class,

LongWritable.class,

SequenceFileOutputFormat.class

);


Extracting data from RDD

• Extract data out of distributed RDD object into driver program memory:

– Collect() – Retrieve the whole RDD content as a list

– First() – Take first element from RDD

– Take(n) - Take n first elements from RDD as a list


Broadcast

• Share data between to every node in the Spark cluster, which can then be accessed inside Spark functions:

broadcastVar = sc.broadcast([1992, „gray“, „bear“])

result = input.map(lambda line: weight_first_bc(line, broadcastVar))

• Don’t have to use broadcast if data is very small. This would also work:

globalVar = [1992, „gray“, „bear“]

result = input.map(lambda line: weight_first_bc(line, globalVar))

• However, this is inefficient when passed along data is larger (> 1MB)

• Spark uses Torrent protocol to optimize broadcast data distribution


Other actions

• Reduce(func) – Apply an aggregation function to all tuples in RDD

• Count() – count number of elements in RDD

• countByKey() – count values for each unique key


RDD Transformations

Pelle Jakovits 23

Map

• Applies a user defined function to every tuple in RDD.

• From the WordCount example, using a lambda function:

pairs = words.map(lambda word: (word, 1))

• Using a separately defined function:

def toPair(word):

pair = (word, 1)

return pair

pairs = words.map(toPair)


Map transformation

• pairs = words.map(lambda word: (word, 1))


FlatMap

• Similar to Map - applied to each tuple in RDD

• But can result in multiple output tuples

• From the Python WordCount example:

words = file.flatMap(lambda line: line.split())

• User defined function has to return a list

• Each element in the output list results in a new tuple inside the resulting RDD


FlatMap transformation

• words = lines.flatMap( lambda line: line.split() )


GroupBy & GroupByKey

• Restructure the RDD by grouping all the values inside the RDD

• Such restructuring is inefficient and should be avoided if possible– It is better to use reduceByKey or aggregateByKey, which automatically

applies an aggregation function on the grouped data.

• GroupByKey operation uses first value inside the RDD tuples as the grouping Key


GroupByKey transformation

• Groups RDD by key, and results in a nested RDD

wordCounts = pairs.groupByKey()


ReduceByKey

• Groups all tuples in RDD by the first field in the tuple

• Applies a user defined aggregation function to all tuples inside a group

• Outputs a single tuple for each group

• From the Python WordCount example:

pairs = words.map(lambda word: (word, 1) )

wordCounts = pairs.reduceByKey(lambda a, b: a + b )


ReduceByKey

Pelle Jakovits

• ReduceByKey() applies GroupByKey() together with a nested Reduce(UDF)

wordCounts = pairs.reduceByKey(lambda a, b: a + b)

31/42

Working with Keys

• When using OperationByKey transformations, Spark expects the RDD to contain (Key, Value) tuples

• If input RDD contains longer typles, then we need to restructure the RDD using a map() operation.

data = sc.parallelize([("hi", 1, "file1"), ("bye", 3, "file2")])

pairs = data.map(lambda (a, b, c) : (a, (b, c)) )

sums = pairs.reduceByKey(lambda (b1, c2), (b2, c2) : b1 + b2)

output = sums.collect()

for (key, value) in output:

print(key, ", " , value)


Other transformations

• sample(withReplacement, fraction, seed)

• distinct([numTasks]))

• union(otherDataset)

• filter(func)• join(otherDataset, [numTasks]) - When called on datasets

of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

• cogroup(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples.


Persisting/Caching data

• Spark uses Lazy evaluation

• Intermediate RDD's may be discarded to optimize memory consumption

• To force spark to keep any intermediate data in memory, we can use:

– lineLengths.persist(StorageLevel.MEMORY_ONLY);

– To forces RDD to be cached in memory after irst time it is computed

• NB! Caching should be used when an RDD is accessed multiple times!


Persistance level

• DISK_ONLY

• MEMORY_ONLY

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

– More efficient

– Use more CPU

• MEMORY_ONLY_2

– Replicate data on 2 executors


Fault tolerance

• Faults are inevitable when running distributed applications in large clusters and repeating long-running tasks can be costly

• Fault recovery is more complicated for In-memory frameworks

– In Spark only the initial input data is replicated on HDFS

– Hadoop MR data is replicated in HDFS, can easily repeat failed tasks

• Checkpointing is typically used for long running in-memory distributed applications

– Processes periodically store their memory into disk storage

– Can affect the efficiency of the application


Spark Lineage

• Lineage is the history of RDDs

• Spark keeps track of each RDD partition's lineage

– What functions were applied to produce it

– Which input data partition were involved

• Rebuild lost RDD partitions according to lineage, using the latest still available partitions

• No performance cost if nothing fails (in comparison to checkpointing)


Lineage

Source: Glenn K. Lockwood, Advanced Technologies at NERSC/LBNL


Apache Spark built-in extensions

• Spark SQL - Seamlessly mix SQL queries with Spark programs– Similar to Pig and Hive

• Spark Streaming – Apply Spark on Streaming data

• Structured Streaming – Higher level abstraction for streaming applications

• MLlib - Machine learning library

• GraphX - Spark's API for graphs and graph-parallel computation

• SparkR – Utilize Spark in R scripts


Advantages of Spark

• Much faster than Hadoop when data fits into the memory– Affects all higher-level Spark or Hadoop MapReduce frameworks

• Support for more programming languages– Scala, Java, Python, R

• Has a lot of built-in extensions– DataFrames, SQL, R, ML, Streaming, Graph processing

• It is constantly being updated

• Well suitable for computationally complex algorithms processing medium-to-large scale data.


Disadvantages of Spark

• What if data does not fit into the memory?

• Hard to keep in track of how (well) the data is distributed

• Working in Java requires a lot of boiler plate code

• Saving as text files can be very slow


Conclusion

• RDDs offer a reasonably simple and efficient programming model for a broad range of applications

• Spark provides more data manipulation operations than just Map and Reduce.

• Spark achieves fault tolerance by providing coarse-grained operations and tracking lineage

• Provides definite speedup when data fits into the collective memory

• Very large development community which has resulted in creation of many integrated tools for different types of applications


That’s all for this week

• Next week’s practice session

– Processing data with Apache Spark in Python

– Focus on RDD transformations

• Next week’s lecture

– SQL abstraction for distributed data processing

• HiveQL

• Spark SQL


Documents

In-memory data processing Apache Spark · 2018-11-05 · In-memory data processing Apache Spark Pelle Jakovits 31 October 2018, Tartu. Outline ... Microsoft Azure Pelle Jakovits 7/42