71
Università degli Studi di Roma Tor VergataDipartimento di Ingegneria Civile e Ingegneria Informatica Apache Spark Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini

Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

  • Upload
    others

  • View
    35

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

Apache Spark

Corso di Sistemi e Architetture per Big Data A.A. 2016/17

Valeria Cardellini

Page 2: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

The reference Big Data stack

Valeria Cardellini - SABD 2016/17

1

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

Page 3: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

MapReduce: weaknesses and limitations •  Programming model

–  Hard to implement everything as a MapReduce program

–  Multiple MapReduce steps can be needed even for simple operations

–  Lack of control, structures and data types •  No native support for iteration

–  Each iteration writes/reads data from disk leading to overheads

–  Need to design algorithms that can minimize number of iterations

Valeria Cardellini - SABD 2016/17

2

Page 4: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

MapReduce: weaknesses and limitations

•  Efficiency (recall HDFS) –  High communication cost: computation (map),

communication (shuffle), computation (reduce) –  Frequent writing of output to disk –  Limited exploitation of main memory

•  Not feasible for real-time processing –  A MapReduce job requires to scan the entire input –  Stream processing and random access impossible

Valeria Cardellini - SABD 2016/17

3

Page 5: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Alternative programming models •  Based on directed acyclic graphs (DAGs)

–  E.g., Spark, Spark Streaming and Storm

•  SQL-based –  We have already analyzed Hive and Pig

•  NoSQL databases –  We have already analyzed HBase, MongoDB,

Cassandra, …

Valeria Cardellini - SABD 2016/17

4

Page 6: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Alternative programming models •  Based on Bulk Synchronous Parallel (BSP)

–  Developed by Leslie Valiant during the 1980s –  Considers communication actions en masse –  Suitable for graph analytics at massive scale and

massive scientific computations (e.g., matrix, graph and network algorithms)

Valeria Cardellini - SABD 2016/17

5

-  Examples: Google’s Pregel, Apache Hama, Apache Giraph

-  Apache Giraph: open source counterpart to Pregel, used at Facebook to analyze the users’ social graph

Page 7: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Apache Spark

•  Separate, fast and general-purpose engine for large-scale data processing –  Not a modified version of Hadoop –  The leading candidate for “successor to MapReduce”

•  In-memory data storage for very fast iterative queries –  At least 10x faster than Hadoop

•  Suitable for general execution graphs and powerful optimizations

•  Compatible with Hadoop’s storage APIs –  Can read/write to any Hadoop-supported system,

including HDFS and HBase Valeria Cardellini - SABD 2016/17

6

Page 8: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Project history

•  Spark project started in 2009 •  Developed originally at UC Berkeley’s AMPLab by

Matei Zaharia •  Open sourced 2010, Apache project from 2013 •  In 2014, Zaharia founded Databricks •  It is now the post popular project for Big Data

analysis •  Current version: 2.1.1

Valeria Cardellini - SABD 2016/17

7

Page 9: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark: why a new programming model? •  MapReduce simplified Big Data analysis

–  But it executes jobs in a simple but rigid structure •  Processing or transform step (map) •  Synchronization step (shuffle) •  Step to combine results from the cluster nodes (reduce)

•  As soon as MapReduce got popular, users wanted: –  More complex, multi-stage applications (e.g., iterative

graph algorithms and machine learning) –  More efficiency –  More interactive ad-hoc queries –  Faster in-memory data sharing across parallel jobs

(required by both multi-stage and interactive applications)

Valeria Cardellini - SABD 2016/17

8

Page 10: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Memory I/O vs. disk I/O

•  See “Latency numbers every programmer should know” http://bit.ly/2pZXIU9

Valeria Cardellini - SABD 2016/17

9

Page 11: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Data sharing in MapReduce

•  Slow due to replication, serialization and disk I/O

Valeria Cardellini - SABD 2016/17

10

Page 12: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Data sharing in Spark

•  Distributed in-memory: 10x-100x faster than disk and network

Valeria Cardellini - SABD 2016/17

11

Page 13: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark stack

Valeria Cardellini - SABD 2016/17

12

Page 14: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark core

•  Provides basic functionalities (including task scheduling, memory management, fault recovery, interacting with storage systems) used by other components

•  Provides a data abstraction called resilient distributed dataset (RDD), a collection of items distributed across many compute nodes that can be manipulated in parallel –  Spark Core provides many APIs for building and

manipulating these collections •  Written in Scala but APIs for Java, Python

and R Valeria Cardellini - SABD 2016/17

13

Page 15: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark tools •  A number of frameworks built on top of Spark •  Spark SQL

–  Spark’s package for working with structured data –  Allows querying data via SQL as well as the

Apache Hive variant of SQL (Hive QL) and supports many sources of data including Hive tables, Parquet, and JSON

–  Extends the Spark RDD API

•  Spark Streaming –  Enables processing live streams of data –  Extends the Spark RDD API

Valeria Cardellini - SABD 2016/17

14

Page 16: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark tools •  MLlib

–  Library containing common machine learning (ML) functionalities and algorithms, including classification, regression, clustering and collaborative filtering

•  GraphX –  Library that provides an API for manipulating

graphs and performing graph-parallel computations

–  Includes also common graph algorithms (e.g., PageRank)

–  Extends the Spark RDD API

Valeria Cardellini - SABD 2016/17

15

Page 17: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark schedulers

•  Spark can exploit many schedulers to execute its applications

•  Standalone Spark scheduler –  Simple cluster scheduler included in Spark

•  Hadoop YARN •  Mesos

–  Mesos and Spark both from the AMPLab @ UC Berkeley

Valeria Cardellini - SABD 2016/17

16

Page 18: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark architecture

Valeria Cardellini - SABD 2016/17

17

•  Master/worker architecture

Page 19: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark architecture

Valeria Cardellini - SABD 2016/17

18

http://spark.apache.org/docs/latest/cluster-overview.html

•  Driver that talks to the master •  Master manages workers in which executors run

Page 20: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark architecture •  Applications run as independent sets of processes on a

cluster, coordinated by the SparkContext object in the main program (the driver program)

•  Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads

•  To run on a cluster, the SparkContext can connect to a cluster manager (Spark’s cluster manager, Mesos or YARN), which allocates resources across applications

•  Once connected, Spark acquires executors on nodes in the cluster, which run computations and store data. Next, it sends the application code to the executors. Finally, SparkContext sends tasks to the executors to run

Valeria Cardellini - SABD 2016/17

19

Page 21: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark data flow

Valeria Cardellini - SABD 2016/17

20

Page 22: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark programming model

Valeria Cardellini - SABD 2016/17

21

Page 23: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark programming model

Valeria Cardellini - SABD 2016/17

22

Page 24: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Resilient Distributed Dataset (RDD)

•  The primary abstraction in Spark: a distributed memory abstraction

•  Immutable, partitioned collection of elements –  Like a LinkedList <MyObjects> –  Operated on in parallel –  Cached in memory across the cluster nodes

•  Each node of the cluster that is used to run an application contains at least one partition of the RDD(s) that is (are) defined in the application

Valeria Cardellini - SABD 2016/17

23

Page 25: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Resilient Distributed Dataset (RDD)

•  Stored in main memory of the executors running in the worker nodes (when it is possible) or on node local disk (if not enough main memory) –  Can persist in memory, on disk, or both

•  Allow executing in parallel the code invoked on them –  Each executor of a worker node runs the specified code on its

partition of the RDD –  A partition is an atomic piece of information –  Partitions of an RDD can be stored on different cluster nodes

Valeria Cardellini - SABD 2016/17

24

Page 26: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Resilient Distributed Dataset (RDD)

•  Immutable once constructed –  i.e., the RDD content cannot be modified

•  Automatically rebuilt on failure (but no replication) –  Track lineage information to efficiently recompute lost data –  For each RDD, Spark knows how it has been constructed and

can rebuilt it if a failure occurs –  This information is represented by means of a DAG connecting

input data and RDDs

•  Interface –  Clean language-integrated API for Scala, Python, Java, and R –  Can be used interactively from Scala console

Valeria Cardellini - SABD 2016/17

25

Page 27: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Resilient Distributed Dataset (RDD)

•  RDD can be created by: –  Parallelizing existing collections of the hosting

programming language (e.g., collections and lists of Scala, Java, Python, or R)

•  Number of partitions specified by user •  API: parallelize!

–  From (large) files stored in HDFS or any other file system

•  One partition per HDFS block •  API: textFile

–  By transforming an existing RDD •  Number of partitions depends on transformation type •  API: transformation operations (map, filter, flatMap)

Valeria Cardellini - SABD 2016/17

26

Page 28: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Resilient Distributed Dataset (RDD)

•  Applications suitable for RDDs –  Batch applications that apply the same operation

to all elements of a dataset

•  Applications not suitable for RDDs –  Applications that make asynchronous fine-grained

updates to shared state, e.g., storage system for a web application

Valeria Cardellini - SABD 2016/17

27

Page 29: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Resilient Distributed Dataset (RDD)

•  Spark programs are written in terms of operations on RDDs

•  RDDs built and manipulated through: –  Coarse-grained transformations

•  Map, filter, join, …

–  Actions •  Count, collect, save, …

Valeria Cardellini - SABD 2016/17

28

Page 30: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark and RDDs

•  Spark manages scheduling and synchronization of the jobs

•  Manages the split of RDDs in partitions and allocates RDDs’ partitions in the nodes of the cluster

•  Hides complexities of fault-tolerance and slow machines

•  RDDs are automatically rebuilt in case of machine failure

Valeria Cardellini - SABD 2016/17

29

Page 31: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark and RDDs

Valeria Cardellini - SABD 2016/17

30

Page 32: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark programming model •  Spark programming model is based on parallelizable

operators •  Parallelizable operators are higher-order functions

that execute user-defined functions in parallel •  A data flow is composed of any number of data

sources, operators, and data sinks by connecting their inputs and outputs

•  Job description based on DAG

Valeria Cardellini - SABD 2016/17

31

Page 33: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Higher-order functions

•  Higher-order functions: RDDs operators •  Two types of RDD operators: transformations

and actions •  Transformations: lazy operations that create

new RDDs –  Lazy: partitions of a dataset are materialized on

demand when they are used in a parallel operation •  Actions: operations that launch a computation

and return a value to the driver program or write data to the external storage

Valeria Cardellini - SABD 2016/17

32

Page 34: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Higher-order functions

Valeria Cardellini - SABD 2016/17

33

•  Transformations and actions available on RDDs in Spark -  Seq[T] denotes a sequence of elements of type T

Page 35: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

RDD transformations: map •  map: takes as input a function which is applied

to each element of the RDD and map each input item to another item

•  filter: generates a new RDD by filtering the source dataset using the specified function

•  flatMap: takes as input a function which is applied to each element of the RDD; can map each input item to zero or more output items

Valeria Cardellini - SABD 2016/17

34

Examples in Scala

Range class in Scala: ordered sequence of integer values in range [start;end) with non-zero step

Page 36: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

RDD transformations: reduceByKey

•  reduceByKey: aggregates values with identical key using the specified function

•  Groups are independently processed

Valeria Cardellini - SABD 2016/17

35

Page 37: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

RDD transformations: join •  join: Performs an equi-join on the

key of two RDDs •  Join candidates are independently

processed

Valeria Cardellini - SABD 2016/17

36

Page 38: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Basic RDD actions

•  collect: returns all the elements of the RDD as an array

•  take: returns an array with the first n elements in the RDD

•  count: returns the number of elements in the RDD

Valeria Cardellini - SABD 2016/17

37

Page 39: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Basic RDD actions

•  reduce: aggregates the elements in the RDD using the specified function

•  saveAsTextFile: writes the elements of the RDD as a text file either to the local file system or HDFS

Valeria Cardellini - SABD 2016/17

38

Page 40: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

How to create RDDs

•  Turn a collection into an RDD

•  Load text file from local file system, HDFS, or S3

Valeria Cardellini - SABD 2016/17

39

Page 41: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Lazy transformations

•  Transformations are lazy: they are not computed till an action requires a result to be returned to the driver program

•  This design enables Spark to perform operations more efficiently as operations can be grouped together –  E.g.,, we can realize that a dataset created through map will

be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset

Valeria Cardellini - SABD 2016/17

40

Page 42: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

First examples

•  Let us analyze some simple examples that use the Spark API –  WordCount –  Pi estimation

•  See http://spark.apache.org/examples.html

Valeria Cardellini - SABD 2016/17

41

Page 43: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

WordCount in Scala

Valeria Cardellini - SABD 2016/17

42

Page 44: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

WordCount in Scala: with chaining

•  Transformations and actions can be chained together

val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Valeria Cardellini - SABD 2016/17

43

Page 45: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

WordCount in Python text_file = sc.textFile("hdfs://...”)

counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)

output = counts.collect()

output.saveAsTextFile("hdfs://...")

Valeria Cardellini - SABD 2016/17

44

Page 46: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

WordCount in Java (Java 7 and earlier) JavaRDD<String> textFile = sc.textFile("hdfs://..."); JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile("hdfs://...");

Valeria Cardellini - SABD 2016/17

45

Spark’s Java API allows to create tuples using the scala.Tuple2 class

Pair RDDs are RDDs containing key/value pairs

Page 47: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

WordCount in Java 8

JavaRDD<String> textFile = sc.textFile("hdfs://..."); JavaPairRDD<String, Integer> counts = textFile .flatMap(s -> Arrays.asList(s.split(" ")).iterator()) .mapToPair(word -> new Tuple2<>(word, 1)) .reduceByKey((a, b) -> a + b); counts.saveAsTextFile("hdfs://...");

Valeria Cardellini - SABD 2016/17

46

Page 48: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

SparkContext

•  Main entry point for Spark functionality –  Represents the connection to a Spark cluster, and

can be used to create RDDs on that cluster •  Also available in shell (as variable sc) •  Only one SparkContext may be active per

JVM –  You must stop() the active SparkContext before

creating a new one

Valeria Cardellini - SABD 2016/17

47

Page 49: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

WordCount in Java (complete - 1)

Valeria Cardellini - SABD 2016/17

48

Page 50: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

WordCount in Java (complete - 2)

Valeria Cardellini - SABD 2016/17

49

Page 51: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Persistence •  Spark supports the persistence of a dataset in

memory across operations –  When you persist an RDD, each node stores any partitions

of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it)

–  This allows future actions to be much faster (more than 10x) –  RDD is persisted using the persist() or cache() methods on it –  You can specify the storage level (memory as default) –  Spark’s cache is fault-tolerant: a lost RDD partition is

automatically recomputed using the transformations that originally created it

•  Useful when data is accessed repeatedly, e.g., when querying a small “hot” dataset or when running an iterative algorithm like logistic regression and PageRank

Valeria Cardellini - SABD 2016/17

50

Page 52: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Persistence: storage level •  Using persist() you can specify the storage level for

persisting an RDD –  cache() is the same as calling persist() with the default storage

level (MEMORY_ONLY)

•  Storage levels for persist(): –  MEMORY_ONLY –  MEMORY_AND_DISK –  MEMORY_ONLY_SER, MEMORY_AND_DISK_SER –  DISK_ONLY, …

•  Which storage level is best? Few things to consider: –  Try to keep in-memory as much as possible –  Serialization make the objects much more space-efficient –  Try not to spill to disk unless the functions that computed your

datasets are expensive –  Use replication only if you want fault tolerance

Valeria Cardellini - SABD 2016/17

51

Page 53: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Persistence: example of iterative algorithm •  Logistic regression

–  Supervised learning algorithm used to classify input data into categories

–  Use gradient descent –  x: training data –  w: weights –  y: values we want to predict given x

val points = spark.textFile(...) .map(parsePoint).persist()

var w = // random initial vector for (i <- 1 to ITERATIONS) {

val gradient = points.map{ p => p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y }.reduce((a,b) => a+b) w -= gradient

} Vale

ria C

arde

llini

- S

AB

D 2

016/

17

52

We ask Spark to persist RDD if

needed for reuse

Gradient formula (based on logistic function)

Source: “Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing”

Page 54: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Persistence: example of iterative algorithm •  Spark outperforms Hadoop by up to 20x in iterative

machine learning –  Speedup comes from avoiding I/O and deserialization costs

by storing data in memory

Valeria Cardellini - SABD 2016/17

53

HadoopBinMem: Hadoop deployment that converts input data into low-overhead binary format in first iteration to eliminate text parsing and stores it in in-memory HDFS

Page 55: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark programming interface

•  A Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster

Valeria Cardellini - SABD 2016/17

54

Page 56: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

How Spark works at runtime •  User application create RDDs, transform them, and

run actions •  This results in a DAG of operators •  DAG is compiled into stages

–  Stages are sequences of RDDs without a shuffle in between

•  Each stage is executed as a series of tasks (one task for each partition)

•  Actions drive the execution

Valeria Cardellini - SABD 2016/17

55

task

Page 57: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Stage execution

•  Spark: –  Creates a task for each partition in the new RDD –  Schedules and assign tasks to slaves

•  All this happens internally (you need to do anything)

Valeria Cardellini - SABD 2016/17

56

Page 58: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Summary of Spark components

•  RDD: parallel dataset with partitions

•  DAG: logical graph of RDD operations

•  Stage: set of tasks that run in parallel

•  Task: fundamental unit of execution in Spark

Valeria Cardellini - SABD 2016/17

57

Coarse grain

Fine grain

Page 59: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Fault tolerance in Spark •  RDDs track the series of

transformations used to build them (their lineage)

•  Lineage information is used to recompute lost data

•  RDDs are stored as a chain of objects capturing the lineage of each RDD

Valeria Cardellini - SABD 2016/17

58

Page 60: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Job scheduling in Spark •  Spark job scheduling takes into

account which partitions of persistent RDDs are available in memory

•  When a user runs an action on an RDD: the scheduler builds a DAG of stages from the RDD lineage graph

•  A stage contains as many pipelined transformations with narrow dependencies

•  The boundary of a stage: –  Shuffles for wide dependencies –  Already computed partitions

Valeria Cardellini - SABD 2016/17

59

Page 61: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Job scheduling in Spark

•  The scheduler launches tasks to compute missing partitions from each stage until it computes the target RDD

•  Tasks are assigned to machines based on data locality –  If a task needs a partition, which is available in

the memory of a node, the task is sent to that node

Valeria Cardellini - SABD 2016/17

60

Page 62: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark stack

Valeria Cardellini - SABD 2016/17

61

Page 63: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark SQL

Valeria Cardellini - SABD 2016/17

62

•  Spark interface for structured data processing •  Porting of Apache Hive to run on Spark •  Compatible with existing Hive data (you can

run unmodified Hive queries on existing data) •  Speedup up to 40x •  Motivation:

–  Hive is great, but Hadoop’s execution engine makes even the smallest queries take minutes

–  Many data users know SQL –  Can we extend Hive to run on Spark? –  Started with the Shark project

Page 64: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark SQL: the beginning

•  Shark project modified the Hive backend to run over Spark

•  Shark employs column-oriented storage •  Spark limitations

–  Limited integration with Spark programs –  Hive optimizer not designed for Spark

Valeria Cardellini - SABD 2016/17

63

Page 65: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark SQL •  Borrows from Shark

–  Hive data loading –  In-memory column store

•  Adds by Spark –  RDD-aware optimizer (Catalyst Optimizer) –  Adds schema to RDD (DataFrame) –  Rich language interfaces

Valeria Cardellini - SABD 2016/17

64

Page 66: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark SQL: efficient in-memory storage

•  Simply caching records as Java objects is inefficient due to high per-object overhead

•  Spark SQL employs column-oriented storage using arrays of primitive types

•  This format is called Parquet

Valeria Cardellini - SABD 2016/17

65

Page 67: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark SQL: DataFrame

•  Extension of RDD that provides a distributed collection of rows with a homogeneous schema

•  Equivalent to a table in a relational database (or a data frame in Python)

•  Can be constructed from: structured data files, tables in Hive, external databases, or existing RDDs

•  Can also be manipulated in similar ways to RDDs •  DataFrames are lazy •  The DataFrame API is available in Scala, Java,

Python, and R

Valeria Cardellini - SABD 2016/17

66

Page 68: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark on Amazon EMR

•  Example in Scala using Hive on Spark: query about domestic flights in the US http://amzn.to/2r9vgme

val df = session.read.parquet("s3://us-east-1.elasticmapreduce.samples/flightdata/input/")

//Parquet files can be registered as tables and used in SQL statements df.createOrReplaceTempView("flights")

//Top 10 airports with the most departures since 2000 val topDepartures = session.sql("SELECT origin, count(*) AS total_departures \ FROM flights WHERE year >= '2000' \ GROUP BY origin \ ORDER BY total_departures DESC LIMIT 10")

Valeria Cardellini - SABD 2016/17

67

Page 69: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark Streaming

•  Spark Streaming: extension of the core Spark API that allows the analysis of streaming data –  Ingested and analyzed in micro-batches

•  High-level abstraction called Dstream (discretized stream) which represents a continuous stream of data –  A sequence of RDDs

Valeria Cardellini - SABD 2016/17

68

Page 70: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Spark Streaming

•  Internally, it works as:

•  We will analyze Spark Streaming in detail in next lessons

Valeria Cardellini - SABD 2016/17

69

Page 71: Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

References

•  Zaharia et al., “Spark: Cluster Computing with Working Sets”, HotCloud’10. http://bit.ly/2rj0uqH

•  Zaharia et al., “Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing”, NSDI'12. http://bit.ly/1XWkOFB

•  Karau et al., “Learning Spark - Lightning-Fast Big Data Analysis”, O’Reilly, 2015.

Valeria Cardellini - SABD 2016/17

70