Upload
gao-yunzhong
View
36
Download
0
Embed Size (px)
Citation preview
Summary of Apache Spark
Original Papers:1. “Spark: Cluster Computing with Working Sets” by Matei Zaharia, et al. Hotcloud
2010.2. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing” by Matei Zaharia, et al. NSDI 2012.
Motivation• MapReduce is great, but,• There are applications where you iterate on the same set of data, e.g.,
for many iterations { old_data = new_data; new_data = func(old_data); }
• Problem: The body of each iteration can be described as a MapReduce task, where the inputs and final outputs are GFS files. There is redundant work storing the output data to GFS and then reading it out again in the next iteration.
• Idea: We can provide a mode to cache the final outputs in memory if possible.o Challenge: but if the machine crashes, we lose the outputs, can we recover?o Solution: store the lineage of the data so that they can be reconstructed as needed (e.g., if they
get lost or insufficient memory).
Motivation• Spark’s goal was to generalize MapReduce to support new apps
within same engineoMapReduce problems can be expressed in Spark too.oWhere Spark shines and MapReduce does not: applications that need to
reuse a working set of data across multiple parallel operations
• Two reasonably small additions allowed the previous specialized models to be expressed within Spark:o fast data sharingogeneral DAGs
MotivationSome key points about Spark:• handles batch, interactive, and real-time within a single framework• native integration with Java, Python, Scala.
oHas APIs written in these languages
• programming at a higher level of abstraction• more general: map/reduce is just one set of supported constructs
Use Example• We’ll run Spark’s interactive shell…
./bin/spark-shell
• Then from the “scala>” REPL prompt, let’s create some data…val data = 1 to 10000
• Create an RDD based on that data…val distData = sc.parallelize(data)
• Then use a filter to select values less than 10…distData.filter(_ < 10).collect()
Resilient Distributed Datasets (RDD)
• Represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.• RDDs can only be created through deterministic operations (aka
transformations) on:o Either data in stable storage:
Any file stored in HDFS Or other storage systems supported by Hadoop
oOr other RDDs
• A program cannot reference an RDD that it cannot reconstruct after a failure
Resilient Distributed Datasets (RDD)
• Two types of operations on RDDs: transformations and actionso Programmers start by defining one or more RDDs through transformations on data in stable storage
(e.g., map and filter). Transformations create a new dataset from an existing one transformations are lazy (not computed immediately) instead they remember the transformations applied to some base dataset The transformed RDD gets recomputed when an action is run on it (default)
o They can then use these RDDs in actions, which are operations that return a value to the application or export data to a storage system.
• However, an RDD can be persisted into storage in memory or disko Each node stores in memory any slices of it that it computes and reuses them in other actions on that
dataset – often making future actions more than 10x fastero The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created ito Note that by default, the base RDD (the original data in stable storage) is not loaded into RAM because
the useful data after transformation might be only a small fraction (small enough to fit into memory).
RDD ImplementationCommon interface of each RDD:• A set of partitions: atomic pieces of the dataset• A set of dependencies on parent RDDs: one RDD can have multiple parents
o narrow dependencies: each partition of the parent RDD is used by at most one partition of the child RDD.
o wide dependencies: multiple child partitions may depend on a parent. Requires the shuffle operation.
• A function for computing the dataset based on its parents• Metadata about its partitioning scheme• Metadata about its data placement, e.g., perferredLocations(p) returns a list of
nodes where partition p can be accessed faster due to data locality
Narrow vs Wide Dependencies• Narrow dependencies allow for pipelined
execution on one cluster node, which can compute all the parent partitions. Wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation.• Recovery after a node failure is more
efficient with a narrow dependency, as only the lost parent partitions need be recomputed, and re-computation can be done in parallel
Job Scheduling on RDDs• Similar to Dryad, but takes data locality into account• When user runs an action, scheduler builds a DAG of
stages to execute. Each stage contains as many pipelined transformations with narrow dependencies as possible.
• Stage boundaries are the shuffle operations required for wide dependencies
• Scheduler launches tasks to compute missing partitions from each stage. Tasks are assigned to machines based on data locality using delay scheduling.
• For wide dependencies, intermediate records are materialized on the nodes holding parent partitions (similar to the intermediate map outputs of MapReduce) to simplify fault recovery.
Checkpointing• Although lineage can always be used to recover RDDs after a failure,
checkpointing can be helpful for RDDs with long lineage chains containing wide dependencies.• For RDDs with narrow dependencies on data in stable storage,
checkpointing is not worthwhile. Reconstruction can be done in parallel for these RDDs, at a fraction of the cost of replicating the whole RDD.
RDD vs Distributed Shared Memory (DSM)
• Previous frameworks that support data reuse, e.g., Pregel and Piccolo.o Perform data sharing implicitly for these patternso Specialized frameworks; do not provide abstractions for more general reuseo Programming interface supports fine-grained updates (reads and writes to each memory
location): fault-tolerance requires expensive replication of data across machines or logging of updates across machines
• RDD:o Only coarse-grained transformations (e.g., map, filter and join): apply the same operation to many
data item. Note that reads on RDDs can still be fine-grained.o Fault-tolerance only requires logging the transformation used to build a dataset instead of the
actual datao RDDs are not suitable for applications that make asynchronous fine-grained updates to shared
state.
RDD vs Distributed Shared Memory (DSM)
• Other benefits of RDDs:oRDDs are immutable. A system can mitigate slow nodes (stragglers) by
running backup copies of slow tasks as in MapReduce.o In bulk operations, a runtime can schedule tasks based on data locality to
improve performanceoRDDs degrade gracefully when there is not enough memory to store them. An
LRU eviction policy is used at the level of RDDs. A partition from the least recently accessed RDD is evicted to make room for a newly computed RDD partition. This is user-configurable via the “persistence priority” for each RDD.
Debugging RDDs• One can reconstruct the RDDs later from the lineage and let the user
query them interactively• One can re-run any task from the job in a single-process debugger by
recomputing the RDD partitions it depends on.
• Similar to the replay debuggers but without the capturing/recording overhead.
// load error messages from a log into memory; then interactively search for
// various patterns
// base RDD
val lines = sc.textFile("hdfs://...")
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
messages.cache()
// action 1
messages.filter(_.contains("mysql")).count()
// action 2
messages.filter(_.contains("php")).count()
RDD Example
// load error messages from a log into memory; then interactively search for
// various patterns
// base RDD
val lines = sc.textFile("hdfs://...")
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
messages.cache()
// action 1
messages.filter(_.contains("mysql")).count()
// action 2
messages.filter(_.contains("php")).count()
RDD
// load error messages from a log into memory; then interactively search for
// various patterns
// base RDD
val lines = sc.textFile("hdfs://...")
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
messages.cache()
// action 1
messages.filter(_.contains("mysql")).count()
// action 2
messages.filter(_.contains("php")).count()
RDDRDDRDDRDDTransformations
// load error messages from a log into memory; then interactively search for
// various patterns
// base RDD
val lines = sc.textFile("hdfs://...")
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("\t")).map(r => r(1))
messages.cache()
// action 1
messages.filter(_.contains("mysql")).count()
// action 2
messages.filter(_.contains("php")).count()
RDDRDDRDDRDDTransformations
ValueActions
Shared Variables• Broadcast variables let programmer keep a read-only variable cached
on each machine rather than shipping a copy of it with taskso For example, to give every node a copy of a large input dataset efficiently
• Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost
Shared Variables• Accumulators are variables that can only be “added” to through an
associative operation.oUsed to implement counters and sums, efficiently in parallel
• Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types• Only the driver program can read an accumulator’s value, not the workers
o Each accumulator is given a unique ID upon creationo Each worker creates a separate copy of the accumulatoroWorker sends a message to driver about the updates to the accumulator