Upload
others
View
25
Download
0
Embed Size (px)
Citation preview
In-memory data processing Apache Spark
Pelle Jakovits
31 October 2018, Tartu
Outline
• Disk based vs In-Memory data processing
• Introduction to Apache Spark– Resilient Distributed Datasets (RDD)
– RDD actions and transformations
– Fault tolerance
– Frameworks powered by Spark
• Advantages & Disadvantages
Pelle Jakovits 2/42
Memory vs Disk based processing
• In Hadoop MapReduce all input, intermediate and output data must be written to disk
• Even if data significantly reduced, it can not be kept in memory between Map and Reduce tasks
• Hadoop MapReduce is not suitable for all types of algorithms
– Iterative algorithms, graph processing, machine learning
Pelle Jakovits 3/34
In-Memory data processing frameworks
• Goal is to support computationally complex applications which can benefit from keeping intermediate data in memory
• Keep data in memory between data processing operations
• Input & Output are disk based file storage systems like HDFS
Pelle Jakovits 4/42
In-Memory data processing
• Data must fit into the collective memory of the cluster
• Should still support keeping data in disk
– when it would not fit into memory
– for fault tolerance
• Fault tolerance is more complicated
– The whole application is affected when data is only kept in memory
– In Hadoop, input data is replicated in HDFS and readily available. only the last Map or Reduce task is affected
Pelle Jakovits 5/42
Apache Spark
• MapReduce-like & in-memory data processing framework
• From Map & Reduce -> Map, Join, Co-group, Filter, Distinct, Union, Sample, ReduceByKey, etc
• Directed acyclic graph (DAG) task execution engine– Users have more control over the data processing execution flow
• Uses Resilient Distributed Datasets (RDD) abstraction– Input data is load into RDD
– RDD transformations and user defined functions are applied to define data processing applications
Pelle Jakovits 6/42
Apache Spark
• More than just a replacement for MapReduce
– Spark works with Scala, Java, Python and R
– Extended with built-in tools for SQL queries, stream processing, ML and graph processing
• Integrated with Hadoop Yarn and HDFS
• Included in many public cloud platforms alongside with Hadoop MapReduce
– IBM cloud, Amazon AWS, Google Cloud, Microsoft Azure
Pelle Jakovits 7/42
Hadoop Mapreduce vs Spark
Pelle Jakovits
0.96
110
0 25 50 75 100 125
Logistic Regression
4.1
155
0 30 60 90 120 150 180
K-Means ClusteringHadoop
Spark
Time per Iteration (s)Source: Introduction to Spark – Patrick Wendell, Databricks
8/42
Resilient Distributed Datasets
• Collections of data objects
• Distributed across cluster
• Stored in RAM or Disk
• Immutable/Read-only
• Built through parallel transformations
• Automatically rebuilt on failures
Pelle Jakovits
Source: http://horicky.blogspot.com.ee/2013/12/spark-low-latency-massively-parallel.html
9/42
Structure of RDDs
• Contains a number of rows
• Rows are divided into partitions
• Partitions are distributed between nodes in the cluster
• Row is a tuple of records similarly to Apache Pig
• Can contain nested data structures
Pelle Jakovits
Source: http://horicky.blogspot.com.ee/2013/12/spark-low-latency-massively-parallel.html
10/42
Spark DAG execution flow
Pelle Jakovits
http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/
11/42
Spark in Java
• A lot of additional boilerplate code related to data and function types
• There are different classes for each Tuple length (Tuple2, … , Tuple9):
Tuple2 pair = new Tuple2(a, b);
pair._1 // => a
pair._2 // => b
• In Java 8 you can use lambda functions:
JavaPairRDD<String, Integer> counts = pairs.reduceByKey( (a, b) -> a + b );
• But In older Java you must use predefined function interfaces:
– Function, Function2, Function 3
– FlatMapFunction
– PairFunction
Pelle Jakovits 12/42
Java 7 Example - WordCount☺JavaRDD<String> lines = ctx.textFile(input_folder);
JavaRDD<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String line) {
return Arrays.asList(line.split(" "));
}});
JavaPairRDD<String, Integer> ones = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String word) {
return new Tuple2<String, Integer>(word, 1);
}});
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
new Function2<Integer, Integer, Integer>(){
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}}); Pelle Jakovits 13/42
Java 8 Example - WordCount
JavaRDD<String> lines = ctx.textFile(input_folder);
JavaRDD<String> words = lines.flatMap(
line -> Arrays.asList(line.split(" ")).iterator()
);
JavaPairRDD<String, Integer> pairs = words.mapToPair(
word -> new Tuple2<String, Integer>(word, 1)
);
JavaPairRDD<String, Integer> wordCounts = pairs.reduceByKey(
(x, y) -> x + y
);
Pelle Jakovits 14/42
Python example - WordCount
• Word count in Spark's Python API:
lines = spark.textFile(input_folder)
words = lines.flatMap(lambda line: line.split() )
pairs = words.map(lambda word: (word, 1) )
wordCounts = pairs.reduceByKey(lambda a, b: a + b )
Pelle Jakovits 15/42
RDD operations
• Actions
– Creating RDD’s
– Storing RDD’s
– Extracting data from RDD
• Transformations
– Restructure or transform RDDs into new RDDs
– Apply user defined functions
Pelle Jakovits 16/42
RDD Actions
Pelle Jakovits 17
Loading Data
• Local data directly from memorydataset = [1, 2, 3, 4, 5];
slices = 5 # Number of partitions
distData = sc.parallelize(dataset, slices);
• External data from HDFS or local file systeminput = sc.textFile(“file.txt”)
input = sc.textFile(“directory/*.txt”)
input = sc.textFile(“hdfs://xxx:9000/path/file”)
Pelle Jakovits 18/42
Storing data
counts.saveAsTextFile("hdfs://...");
counts.saveAsObjectFile("hdfs://...");
counts.saveAsHadoopFile(
"testfile.seq",
Text.class,
LongWritable.class,
SequenceFileOutputFormat.class
);
Pelle Jakovits 19/42
Extracting data from RDD
• Extract data out of distributed RDD object into driver program memory:
– Collect() – Retrieve the whole RDD content as a list
– First() – Take first element from RDD
– Take(n) - Take n first elements from RDD as a list
Pelle Jakovits 20/42
Broadcast
• Share data between to every node in the Spark cluster, which can then be accessed inside Spark functions:
broadcastVar = sc.broadcast([1992, „gray“, „bear“])
result = input.map(lambda line: weight_first_bc(line, broadcastVar))
• Don’t have to use broadcast if data is very small. This would also work:
globalVar = [1992, „gray“, „bear“]
result = input.map(lambda line: weight_first_bc(line, globalVar))
• However, this is inefficient when passed along data is larger (> 1MB)
• Spark uses Torrent protocol to optimize broadcast data distribution
Pelle Jakovits 21/42
Other actions
• Reduce(func) – Apply an aggregation function to all tuples in RDD
• Count() – count number of elements in RDD
• countByKey() – count values for each unique key
Pelle Jakovits 22/42
RDD Transformations
Pelle Jakovits 23
Map
• Applies a user defined function to every tuple in RDD.
• From the WordCount example, using a lambda function:
pairs = words.map(lambda word: (word, 1))
• Using a separately defined function:
def toPair(word):
pair = (word, 1)
return pair
pairs = words.map(toPair)
Pelle Jakovits 24/42
Map transformation
• pairs = words.map(lambda word: (word, 1))
Pelle Jakovits 25/42
FlatMap
• Similar to Map - applied to each tuple in RDD
• But can result in multiple output tuples
• From the Python WordCount example:
words = file.flatMap(lambda line: line.split())
• User defined function has to return a list
• Each element in the output list results in a new tuple inside the resulting RDD
Pelle Jakovits 26/42
FlatMap transformation
• words = lines.flatMap( lambda line: line.split() )
Pelle Jakovits 27/42
GroupBy & GroupByKey
• Restructure the RDD by grouping all the values inside the RDD
• Such restructuring is inefficient and should be avoided if possible– It is better to use reduceByKey or aggregateByKey, which automatically
applies an aggregation function on the grouped data.
• GroupByKey operation uses first value inside the RDD tuples as the grouping Key
Pelle Jakovits 28/42
GroupByKey transformation
• Groups RDD by key, and results in a nested RDD
wordCounts = pairs.groupByKey()
Pelle Jakovits 29/42
ReduceByKey
• Groups all tuples in RDD by the first field in the tuple
• Applies a user defined aggregation function to all tuples inside a group
• Outputs a single tuple for each group
• From the Python WordCount example:
pairs = words.map(lambda word: (word, 1) )
wordCounts = pairs.reduceByKey(lambda a, b: a + b )
Pelle Jakovits 30/42
ReduceByKey
Pelle Jakovits
• ReduceByKey() applies GroupByKey() together with a nested Reduce(UDF)
wordCounts = pairs.reduceByKey(lambda a, b: a + b)
31/42
Working with Keys
• When using OperationByKey transformations, Spark expects the RDD to contain (Key, Value) tuples
• If input RDD contains longer typles, then we need to restructure the RDD using a map() operation.
data = sc.parallelize([("hi", 1, "file1"), ("bye", 3, "file2")])
pairs = data.map(lambda (a, b, c) : (a, (b, c)) )
sums = pairs.reduceByKey(lambda (b1, c2), (b2, c2) : b1 + b2)
output = sums.collect()
for (key, value) in output:
print(key, ", " , value)
Pelle Jakovits 32/42
Other transformations
• sample(withReplacement, fraction, seed)
• distinct([numTasks]))
• union(otherDataset)
• filter(func)• join(otherDataset, [numTasks]) - When called on datasets
of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.
• cogroup(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples.
Pelle Jakovits 33/42
Persisting/Caching data
• Spark uses Lazy evaluation
• Intermediate RDD's may be discarded to optimize memory consumption
• To force spark to keep any intermediate data in memory, we can use:
– lineLengths.persist(StorageLevel.MEMORY_ONLY);
– To forces RDD to be cached in memory after irst time it is computed
• NB! Caching should be used when an RDD is accessed multiple times!
Pelle Jakovits 34/42
Persistance level
• DISK_ONLY
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
– More efficient
– Use more CPU
• MEMORY_ONLY_2
– Replicate data on 2 executors
Pelle Jakovits 35/42
Fault tolerance
• Faults are inevitable when running distributed applications in large clusters and repeating long-running tasks can be costly
• Fault recovery is more complicated for In-memory frameworks
– In Spark only the initial input data is replicated on HDFS
– Hadoop MR data is replicated in HDFS, can easily repeat failed tasks
• Checkpointing is typically used for long running in-memory distributed applications
– Processes periodically store their memory into disk storage
– Can affect the efficiency of the application
Pelle Jakovits 36/42
Spark Lineage
• Lineage is the history of RDDs
• Spark keeps track of each RDD partition's lineage
– What functions were applied to produce it
– Which input data partition were involved
• Rebuild lost RDD partitions according to lineage, using the latest still available partitions
• No performance cost if nothing fails (in comparison to checkpointing)
Pelle Jakovits 37/42
Lineage
Source: Glenn K. Lockwood, Advanced Technologies at NERSC/LBNL
Pelle Jakovits 38/42
Apache Spark built-in extensions
• Spark SQL - Seamlessly mix SQL queries with Spark programs– Similar to Pig and Hive
• Spark Streaming – Apply Spark on Streaming data
• Structured Streaming – Higher level abstraction for streaming applications
• MLlib - Machine learning library
• GraphX - Spark's API for graphs and graph-parallel computation
• SparkR – Utilize Spark in R scripts
Pelle Jakovits 39/42
Advantages of Spark
• Much faster than Hadoop when data fits into the memory– Affects all higher-level Spark or Hadoop MapReduce frameworks
• Support for more programming languages– Scala, Java, Python, R
• Has a lot of built-in extensions– DataFrames, SQL, R, ML, Streaming, Graph processing
• It is constantly being updated
• Well suitable for computationally complex algorithms processing medium-to-large scale data.
Pelle Jakovits 40/42
Disadvantages of Spark
• What if data does not fit into the memory?
• Hard to keep in track of how (well) the data is distributed
• Working in Java requires a lot of boiler plate code
• Saving as text files can be very slow
Pelle Jakovits 41/42
Conclusion
• RDDs offer a reasonably simple and efficient programming model for a broad range of applications
• Spark provides more data manipulation operations than just Map and Reduce.
• Spark achieves fault tolerance by providing coarse-grained operations and tracking lineage
• Provides definite speedup when data fits into the collective memory
• Very large development community which has resulted in creation of many integrated tools for different types of applications
Pelle Jakovits 42/42
That’s all for this week
• Next week’s practice session
– Processing data with Apache Spark in Python
– Focus on RDD transformations
• Next week’s lecture
– SQL abstraction for distributed data processing
• HiveQL
• Spark SQL
Pelle Jakovits 43/42