Spark: Cluster Computing with Working Sets

--Aaron 2013/03/28

Overviews

• Motivation• Spark• Resilient Distributed Datasets(RDD)• Shared Variables• Experiments

Motivation

Hadoop

• Advantages:ScalabilityFault Tolerance

• Disadvantage:Acyclic ： not suitable for iterative

algorithms

Disadvantage of Hadoop(1)

• Iterative jobs: Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter.

• Interactive analytics: When querying on large datasets, a user would be able to load a dataset of interest into memory across a number of machines and query it repeatedly.

Disadvantage of Hadoop(2)

• In most current frameworks, the only way to reuse data between computations (e.g., between two MapReduce jobs) is to write it to an external stable storage system, e.g., a distributed file system.

• This incurs substantial overheads due to data replication, disk I/O, and serialization, which can dominate application execution times.

Programming Model

• Driver Program: implements the high-level control flow of their application and launches various operations in parallel.

• Resilient Distributed Datasets(RDD)• Parallel Operations: passing a function to apply

on a dataset.• Shared Variables: they can be used in functions

running on the cluster.

Framework of Spark

Resilient Distributed Datasets(RDD)

Resilient Distributed Datasets

• A resilient distributed dataset (RDD) is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

• The elements of an RDD need not exist in physical storage; instead, a handle to an RDD contains enough information to compute the RDD starting from data in reliable storage. This means that RDDs can always be reconstructed if nodes fail.

Changing the Persistence of RDD

• By default, RDDs are lazy and ephemeral.• User can alter the persistence of an RDD

through two actions:– Cache action: hints that it should be kept in

memory after the first time it is computed, because it will be reused.

– Save action: evaluates the dataset and writes it to a distributed filesystem such as HDFS

Operations on RDD(1)

• Transformations(create an RDD): operations on either (1) data in stable storage or (2) others RDDs.

• Examples of transformations include map, filter, and join.

• Actions: operations that return a value to the application or export data to a storage system.

• Examples of actions include count, collect, and save.

• Programmers can call a persist method to indicate which RDDs they want to reuse in future operations.

• Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM.

• Users can set a persistence priority on each RDD to specify which in-memory data should spill to disk first.

Example1 of RDD

• These datasets will be stored as a chain of objects capturing the lineage of each RDD. Each dataset object contains a pointer to its parent and information about how the parent was transformed.

Lineage Chain of Example1

Example2 of RDD

Lineage Chain of Example2

RDD Objects• Each RDD object implements the same simple

interface, which consists of five piece of information:

Examples of RDD Operations

• HDFS files(lines): – partitions(): returns one partition for each block

of the file (with the block’s offset stored in each Partition object)

– preferredLocations: gives the nodes the block is on

– iterator: reads the block

Dependencies between RDDs(1)

• Narrow Dependencies: each partition of the parent RDD is used by at most one partition of the child RDD(1:1). Map leads to a narrow dependency.

• Wide Dependencies: multiple child partitions may depend on it(1:N). Join leads to wide dependencies.

Dependencies between RDDs(2)

• Narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. For example, one can apply a map followed by a filter on an element-by-element basis.

• Wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce like operation.

• Recovery after a node failure is more efficient with a narrow dependency than the ones with wide dependency.

Parallel Operations

• reduce: Combines dataset elements using an associative function to produce a result at the driver program.

• collect: Sends all elements of the dataset to the driver program.

Shared Variables

Shared Variables• Programmers invoke operations like map, filter

and reduce by passing closures (functions) to Spark. Normally, when Spark runs a closure on a worker node, these variables are copied to the worker.

• However, Spark also lets programmers create two restricted types of shared variables to support two simple but common usage patterns.

Broadcast Variables

• When one creates a broadcast variable b with a value v, v is saved to a file in a shared file system. The serialized form of b is a path to this file. When b’s value is queried on a worker node, Spark first checks whether v is in a local cache, and reads it from the file system if it isn’t.

Accumulators

• Each accumulator is given a unique ID when it is created. When the accumulator is saved, its serialized form contains its ID and the “zero” value for its type.

• On the workers, a separate copy of the accumulator is created for each thread that runs a task using thread-local variables, and is reset to zero when a task begins. After each task runs, the worker sends a message to the driver program containing the updates it made to various accumulators.

Experiments

Results(1)• Dataset: 29G• Envirement: 20 “m1.xlarge” EC2 nodes with 4 cores

each.• Algorithm: logistic regression

Results(2)

Results(3)

Thank You!!!

Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Documents

Creating a Cluster on Azure - Cloudera€¦ · Spark 2 HDP 2.6 Cluster type Main services Description List of all services Data Science Spark 2, Zeppelin Useful for data science with

Creating a Cluster on Azure...Spark 2 HDP 2.6 Cluster type Main services Description List of all services Data Science Spark 2, Zeppelin Useful for data science with Spark 2 and Zeppelin

Learning spark ch07 - Running on a Cluster

· on-demand Spark (and Hadoop) clusters in IBM cloud Watson Studio Spark Environments cloud tools for data sc'entists and appl'cation developers dedcated Spark cluster per Notebook

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

Www.spark-project.org Spark Lightning-Fast Cluster Computing UC BERKELEY

Deep learning on a mixed cluster with deeplearning4j and spark

Spark Fast, Interactive, Language-Integrated Cluster Computing

Experiences Implementing Apache Spark and Apache Hadoop ... · The Apache Spark site illustrates the Spark architecture. The nodes types are: Client Driver, Cluster Manager, and Worker

1. Spark DataFrames + SQL€¦ · Spark + MongoDB 1. Spark DataFrames + SQL 1.1 Setup the Spark cluster on Azure Create a cluster Sign into the azure portal (portal.azure.com). Search

Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

CaffeOnSpark: Deep Learning On Spark Cluster

Data Driven Performance Repository to Classify and ... · MongoDB. Cluster-Python Driver. Cassandra - Python Driver. Python. Spark Cluster. Spark - Cassandra Connector. Spark - MongoDB

SnappyData, the Spark Database. A unified cluster for streaming, transactions & interactive analytics

Tuning Apache Spark - docs.cloudera.com · In yarn-cluster mode, the Spark driver runs inside an application master process that is managed by YARN on the cluster. The client can

Intro to Apache Spark - Databricks · PDF fileApache Spark top-level 2010 Spark paper 2008 Hadoop Summit A Brief History: Spark Spark: Cluster Computing with Working Sets! Matei Zaharia,

Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Apache Spark: Lightning Fast Cluster Computing

Spark application on ec2 cluster