Rapid Cluster Computing with Apache Spark 2016

1

Rapid Cluster Computing with Apache Spark

Zohar Elkayam CTO, Brillix

[email protected]

Twitter: @realmgic

mailto:[email protected]

2

Who am I?• Zohar Elkayam, CTO at Brillix

• Programmer, DBA, team leader, database trainer, public speaker, and a senior consultant for over 18 years

• Oracle ACE Associate

• Part of ilOUG – Israel Oracle User Group

• Blogger – www.realdbamagic.com and www.ilDBA.co.il

http://www.realdbamagic.com/

http://www.ildba.co.il/

3

About Brillix• We offer complete, integrated end-to-end solutions based

on best-of-breed innovations in database, security and big data technologies

• We provide complete end-to-end 24x7 expert remote database services

• We offer professional customized on-site trainings, delivered by our top-notch world recognized instructors

4

Some of Our Customers

5

Agenda• The Big Data problem and possible solutions

• Basic Spark Core

• Working with RDDs

• Working with Spark Cluster and Parallel programming

• Spark modules: Spark SQL and Spark Streaming

• Performance and Troubleshooting

6

Our Goal Today• Knowing more about Big Data and Big Data

solutions

• Get a taste of Spark abilities – not becoming a Spark expert

• This is a starting point – don’t be afraid to try

7

The REAL Agenda

נשמח לקבל את חוות , משובטופס יחולק יום הסמינר בסיום •

!טאבלטממלאי המשוב יוגרל מידי יום בין . דעתכם

הפסקה10:30-10:45

2באולם ששון י הכנס ארוחת צהריים לכל משתתפ12:30-13:30

הפסקה מתוקה במתחם קבלת הפנים15:00-15:15

הולכים הביתה16:30

The Challenge

And a Possible Solution…

9

The Big Data Challenge

10

Volume

• Big data comes in one size: Big.

• Size is measured in Terabyte (1012), Petabyte (1015),

Exabyte (1018), Zettabyte (1021)

• The storing and handling of the data becomes an issue

• Producing value out of the data in a reasonable time

is an issue

11

Data Grows Fast!

12

Variety• Big Data extends beyond structured data, including semi-

structured and unstructured information: logs, text, audio and videos

• Wide variety of rapidly evolving data types requires highly flexible stores and handling

Un-Structured Structured

Objects Tables

Flexible Columns and Rows

Structure Unknown Predefined Structure

Textual and Binary Mostly Textual

13

Data Types By Industry

14

Velocity • The speed in which data is being generated and

collected

• Streaming data and large volume data movement

• High velocity of data capture – requires rapid ingestion

• Might cause a backlog problem

15

The Backlog Problem• Caused when the data is produced very quickly

• The time it takes to digest the new data is as long or very close to the time it takes the new data to be generated

• If the intake of new data is down for any reason, there is no way to complete that missing data thus causing the backlog problem

16

The Internet of Things (IoT)/

17

Value

Big data is not about the size of the data,

It’s about the value within the data

18

19

So, We Define Big Data Problem…• When the data is too big or moves too fast to

handle in a sensible amount of time

• When the data doesn’t fit any conventional database structure

• When we think that we can still produce value from that data and want to handle it

• When the technical solution to the business need becomes part of the problem

How to do Big Data

21

22

Big Data in Practice• Big data is big: technological framework and

infrastructure solutions are needed

• Big data is complicated:

– We need developers to manage handling of the data

– We need devops to manage the clusters

– We need data analysts and data scientists to produce value

23

Possible Solutions: Scale Up• Older solution: using a giant server with a lot of

resources (scale up: more cores, faster processers, more memory) to handle the data– Process everything on a single server with hundreds of CPU

cores

– Use lots of memory (1+ TB)

– Have a huge data store on high end storage solutions

• Data needs to be copied to the processes in real time, so it’s no good for high amounts of data (Terabytes to Petabytes)

24

Another Solution: Distributed Systems• A scale-out solution: let’s use distributed systems:

use multiple machine for a single job/application

• More machines means more resources

– CPU

– Memory

– Storage

• But the solution is still complicated: infrastructure and frameworks are needed

25

Distributed Infrastructure Challenges• We need Infrastructure that is built for:

– Large-scale

– Linear scale out ability

– Data-intensive jobs that spread the problem across clusters of server nodes

• Storage: efficient and cost-effective enough to capture and store terabytes, if not petabytes, of data

• Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing

• High-end hardware is too expensive - we need a solution that uses cheaper hardware

26

Distributed System/Frameworks Challenges• How do we distribute our workload across the

system?

• Programming complexity – keeping the data synced

• What to do with faults and redundancy?

• How do we handle security demands to protect highly-distributed infrastructure and data?

A Big Data Solution:Apache Hadoop

28

Apache Hadoop• Open source project run by Apache Foundation

(2006)

• Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure

• It Is has been the driving force behind the growth of the big data industry

• Get the public release from:

– http://hadoop.apache.org/core/

http://hadoop.apache.org/core/

29

Original Hadoop 1.0 Components• HDFS (Hadoop Distributed File System) – distributed file system that

runs in a clustered environment

• MapReduce – programming technique for running processes over a clustered environment

• Hadoop main idea: let’s distribute the data to many servers, and then bring the program to the data

30

Hadoop 2.0• Hadoop 2.0 changed the Hadoop conception and

introduced a better resource management concept:– Hadoop Common

– HDFS

– YARN

– Multiple data processing frameworks including MapReduce, Spark and others

31

HDFS is...• A distributed file system

• Designed to reliably store data using commodity hardware

• Designed to expect hardware failures and still stay resilient

• Intended for larger files

• Designed for batch inserts and appending data (no updates)

32

Files and Blocks• Files are split into 128MB blocks (single unit of

storage)

– Managed by NameNode and stored on DataNodes

– Transparent to users

• Replicated across machines at load time

– Same block is stored on multiple machines

– Good for fault-tolerance and access

– Default replication factor is 3

33

HDFS Node Types• HDFS has three types of Nodes

• Datanodes– Responsible for actual file store

– Serving data from files(data) to client

• Namenode (MasterNode)– Distribute files in the cluster

– Responsible for the replication between the datanodes and for file blocks location

• BackupNode– It’s a backup of the NameNode

34

Using HDFS in Command Line

35

How Does HDFS Look Like (GUI)

36

Interfacing with HDFS

37

MapReduce is...• A programming model for expressing distributed

computations at a massive scale

• An execution framework for organizing and performing such computations

• MapReduce can be written in Java, Scala, C, Python, Ruby and others

• Bring the code to the data, not the data to the code

38

MapReduce paradigm• Implement two functions:

• MAP - Takes a large problem and divides into sub problems and performs the same function on all subsystems

Map(k1, v1) -> list(k2, v2)

• REDUCE - Combine the output from all sub-problems

Reduce(k2, list(v2)) -> list(v3)

• Frameworks handles everything else (almost)

• Value with same key must go to the same reducer

39

Divide and Conquer

40

YARN• Takes care of distributed processing and coordination

• Scheduling

– Jobs are broken down into smaller chunks called tasks

– These tasks are scheduled to run on data nodes

• Task Localization with Data

– Framework strives to place tasks on the nodes that host the segment of data to be processed by that specific task

– Code is moved to where the data is

41

YARN (cont.)• Error Handling

– Failures are an expected behavior so tasks are automatically re-tried on other machines

• Data Synchronization

– Shuffle and Sort barrier re-arranges and moves data between machines

– Input and output are coordinated by the framework

42

Submitting a Job• Yarn script with a class argument command

launches a JVM and executes the provided Job

$ yarn jar HadoopSamples.jar mr.wordcount.StartsWithCountJob \/user/sample/hamlet.txt \/user/sample/wordcount/

43

Resource Manage: UI

44

Application View

45

Hadoop Main Problems• But Hadoop (MapReduce Framework – not

MapReduce paradigm) had some problems:

– Developing MapReduce was complicated – there was more than just business logics to develop

– Transferring data between stages requires the intermediate data to be written to disk (and than read by the next step)

– Multi-step needed orchestration and abstraction solutions

– Initial resource management was very painful –MapReduce framework was based on resource slots

A Different Big Data Solution: Apache Spark

47

Introducing Apache Spark• Apache Spark is a fast, general engine

for large-scale data processing on a cluster

• Originally developed by UC Berkeley in 2009 as a research project, and is now an open source Apache top level project

• Main idea: use the memory resources of the cluster for better performance

• It is now one of the most fast-growing project today

http://spark.apache.org/

48

Spark Advantages• High level programming framework: programmers focus on

what (logic), not how• Cluster computing

– Managed by a single master node– Distributed to worker nodes– Scalable and fault tolerant by the framework

• Distributed storage– Data is distributed when stored– Replication for efficiency and fault tolerance– Bring code to the data state of mind

• High performance by in-memory utilization and caching

49

Code Complexity

50

Scalability• Spark is highly scalable

• Adding worker nodes to the cluster increase performance in a near-linear scale

– More processing power

– More memory

• Nodes can be added and removed according to load – ideal for cloud computing (EC2)

51

Fault Tolerance• Commodity hardware is bound to fail

• Spark is built for low-cost clusters and has fault tolerance embedded in the framework

– System continue to function

– Master re-assign tasks to nodes

– Data is replicated so there is no data loss

– Node will rejoin cluster automatically when they recover

52

Spark and Hadoop• Spark and Hadoop are built to co-exist

• Spark can use other storage systems (S3, local disks, NFS) but works best when combined with HDFS

– Uses Hadoop InputFormats and OutputFormats

– Fully compatible with Avro and SequenceFiles as well of other types of files

• Spark can use YARN for running jobs

53

Spark and Hadoop (cont.)• Spark interacts with the Hadoop ecosystem:

– Flume

– Sqoop (watch out for DDoS on the database…)

– HBase

– Hive

• Spark can also interact with tools outside the Hadoop ecosystem: Kafka, NoSQL, Cassandra, XAP, Relational databases, and more

54

The Spark Stack• In addition to the Core spark engine, there are

some related projects to extend Spark functionality

55

Spark Use Cases• Spark is especially useful when working with any

combination of– Large amount of data

– Intensive computing

– Iterative algorithm

• Spark does well because of– Distributed storage

– Distributed computing

– In-memory processing and pipelining

56

Common Spark Use Cases• ETL Processing

• Text Mining

• Index Building

• Graph Creation and analysis

• Pattern recognition

• Fraud detection

• Collaborative filtering

• Stream processing

• Prediction models

• Sentiment analysis

• Risk assessment

• Machine learning

57

Examples for Common Use Cases• Risk analysis

– How likely is this borrower to pay back a loan?

• Recommendation

– Which products will this customer enjoy?

• Predictions

– How can we prevent service outage?

• Classification

– How can we tell which mail is spam and which is Not?

58

Spark 2.0

59

Spark 2.0 Major Changes• Major performance improvements

• Unifying DataFrames and Datasets for Scala/Java

• Changes to extensions– Multiple changes to MLlib and Machine Learning

– Improvements for Spark Streaming [ALPHA]

– Spark SQL supports ANSI SQL 2003

– R UDF

• Over 2000 bugs fixed

• Current version: 2.0.2 (released, Nov 14, 2016)

Basic Spark

Spark Core

61

What is Apache Spark• Apache Spark is a fast the general engine for large-

scale data processing• Written in Scala• Spark Shell

– Interactive interface for learning, testing, and data explorations

– Scala or Python shells available– Spark on R using RStudio and SparkR

• Spark Application– Framework for running large scale processes– Supports Scala, Python, and Java

62

Starting the Shells$ pyspark

Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//__ / .__/\_,_/_/ /_/\_\ version 1.6.0

/_/

Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56)SparkContext available as sc, HiveContext available as sqlContext.>>>

$ spark-shell

Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//___/ .__/\_,_/_/ /_/\_\ version 1.6.0

/_/

Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_95)Spark context available as sc.SQL context available as sqlContext.

scala>

63

Spark Context• Every Spark application requires a Spark Context

– The main entry point to the Spark API

• Spark shells provide preconfigured Spark contexts called sc

• Spark application will need to create their own Spark context instance

• Spark Context is there the “magic” is happening

64

RDD: Resilient Distributed Datasets• Basic component is the RDD

– Resilient – if the data is lost, it can be recreated for previous steps

– Distributed – appears as a single chunk of code, but is actually distributed across nodes

– Dataset – initial data can come from file or created programmatically

• RDD are the fundamental unit of data in spark

• Most of Spark program consists of performing operations on RDDs

65

How to Create a RDD• We can create RDD in 3 ways:

– Create the RDD from a file, set of files, or directory

– Create RDD from data already in memory

– Create RDD by manipulating another RDD

• Later, we will talk about creating RDDs from different data sources…

66

Creating RDD from Files• Creating RDD from files:

– Use the SparkContext.textFile – it can read a single file, comma delimited list or wildcards

– Each line in the file is separate record in the RDD

• Files are referenced by absolute or relative URI– Absolute: file:/home/myfile.txt

– Relative (used default file system): myfile.txt

sc.textFile (“myfile.txt”)sc.textFile (“mydata/*.txt”)sc.textFile (“myfile1.txt,myfile2.txt”)

67

Example: Creating RDD From Files (Scala)scala> val mydata = sc.textFile("file:/home/spark/derby.log")16/06/12 13:15:39 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 220.3 KB, free 960.5 KB)16/06/12 13:15:39 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 26.4 KB, free 986.9 KB)16/06/12 13:15:39 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:36895(size: 26.4 KB, free: 511.4 MB)16/06/12 13:15:39 INFO SparkContext: Created broadcast 3 from textFile at <console>:27mydata: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at textFile at <console>:27

scala> mydata.count()[..]16/06/12 13:15:41 INFO DAGScheduler: Job 0 finished: count at <console>:30, took 0.489132 sres3: Long = 13

scala> val mydata = sc.textFile("hdfs:/tmp/eventlog-demo.log")[..]mydata: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at textFile at <console>:27

scala> mydata.count()16/06/12 13:16:41 INFO DAGScheduler: Job 1 finished: count at <console>:30, took 1.515042 sres4: Long = 451210

68

RDD Operations• There are two types of operations

– Actions – return values

– Transformations – define a new RDD based on the current one(s)

• RDD have a lazy execution model

– Transformations set things up

– Actions cause calculations to actually be performed

69

RDD Operations: Actions• Common actions:

– take(n) – return an array of the first n elements

– collect() – return an array of all elements

– saveAsTextFile(file) – save RDD to text file

– Count() – return the number of elements in the RDDscala> mydata.take(2)res6: Array[String] = Array(2016-06-08T16:48:49|121.170.77.248|FR|SUCCESS, 2016-06-08T16:48:49|142.13.127.131|FR|SUCCESS)

scala> for (line <- mydata.take(2)) println(line)2016-06-08T16:48:49|121.170.77.248|FR|SUCCESS2016-06-08T16:48:49|142.13.127.131|FR|SUCCESS

scala> mydata.count()res4: Long = 451210

70

RDD Operations: Transformations• Transformations define a new RDD based on the current one

• RDD are immutable– Data in an RDD cannot change

– Transform in sequence to modify as needed

• Operations can be chained (piped) for multiple operations

• Common transformations:– map(function) – creates a new RDD by preforming a function on

each record in the base RDD

– filter(function) – creates a new RDD by including/excluding records in the base RDD according to the Boolean function it got

What is Functional Programming?

72

Functional Programming• Spark is dependent on the concept of functional

programming

– Functions are the fundamental unit of programming

– Functional have input and output only – there is no state or side effects

• Key concepts

– Passing functions as input to other functions

– Anonymous functions

73

Passing Functions as Parameters• Some of the RDD operations take functions as

parameters• The received function is applied to all the records in

the RDD• For example:

– map function gets a function as a parameter. That function will convert each record in the RDD to a key and value tuple (AKA Pair RDD)

– filter function gets a function which will check each record in the RDD and return a Boolean value for filtering the data

74

Defining Functions and Passing Them• Python Example:

• Scala Example:

> def toUpper(s):return s.upper()

> mydata.map(toUpper).take(3)

> Def toUpper(s: String): String ={ s.toUpperCase }

> mydata.map(toUpper).take(2)

75

Anonymous Functions• Scala, Python, R, and Java can declare anonymous

one-time functions

• These functions are in-line functions without a name, often used for one-off functions

• Spark doesn’t require the use of anonymous functions

• Examples:

– Python: lambda x: …

– Scala X => …

– Java 8: x -> …

76

Using Anonymous Functions• Python:

• Scala:

• Scala, using “_” as anonymous parameter:

> mydata.map(lambda line: line.upper()).take(3)

> mydata.map(line => line.toUpperCase()).take(2)

> mydata.map(_.toUpperCase()).take(2)

Working with RDDs

78

RDD Data Types• RDD can hold any type of element

– Primitive: integers, chars, Booleans, etc.

– Sequences: string, list, tuple, dictionaries, arrays, and all kind of nested data types

– Scala and Java serialized types

– Mixed types

• Some RDD types have additional functionality– Pair RDDs consist of Key-Value pairs

– Double RDDs consist of numeric data

79

Generating RDDs From Collections• We can create RDDs from collections, and not from

files. Common uses: testing, integrating, and in case we need to generate data programmatically

• Example:

> Randomnumlist = [random.uniform(0,10) for _ in xrange (10000)]

> randomrdd = sc.parallelize(randomnumlist)

> Print “Mean is %f” % randomrdd.mean

80

Common RDD Operations (1)

• Common transformations– flatMap – maps an element to 0 or more output

elements

– distinct – return a new dataset that contains unique elements in the original RDD

• Other RDD operations– first – return the first element in the dataset

– foreach – apply function to each element in an RDD

– top(n) return the n largest elements using natural ordering

81

Common RDD Operations (2)

• Sampling operations

– sample(percent) – create a new RDD with a sampling of elements

– takeSample(percent) – return an array of sampled elements (with or without replacement)

• Double RDD operations

– Statistical functions: mean, sum, stdev etc.

82

Using flatMap and distinct> sc.textFile(file) \

.flatMap(lambda line: line.split()) \

.distinct()

> sc.textFile(file).flatMap(line => line.split(“\\W”)).distinct

I see the worldand the world see me

I

see

the

world

and

the

world

see

me

I

see

the

world

and

me

83

Pair RDD• Pair RDDs are special form of RDD

– Each element must be a key-value pair (tuple)

– Key and values be any type

• Use Pair RDDs when using MapReduce algorithms

• Many additional functions for common data processing needs: sorting, grouping, joining, etc.

84

Simple Pair RDD• Create a Pair RDD from comma delimited file (CSV)

> users = sc.textFile (“file:/users.csv”) \.map(lambda line: line.split(“,”)) \.map(lambda fields: (fields[0], fields[1]))

> val users = sc.textFile (“file:/users.csv”).map (line => line.split(‘,’)) .map (fields => (fields(0), fields(1)))

user001,Zohar Elkayamuser002,Efratuser009,Tamaruser100,Ido

(user001, Zohar Elkayam)

(user002, Efrat)

)user009, Tamar)

(user100, Ido)

85

Creating Pair RDDs• Commonly used functions that create Pair RDDs

– map

– flatMap

– flatMapValues

– keyBy

• Deciding what is the key and what is the value is very important as first step in most workflows in to convert the base RDD to a Pair RDD

86

Complex Values• When creating key-value pairs, the values can be a

complex value

> users = sc.textFile (“file:/users.csv”) \.map(lambda line: line.split(“,”)) \.map(lambda fields: (fields[0], (fields[1], field[2])))

user001,Zohar, Elkayamuser002,Efrat, Elkayamuser009,Tamar, Fritziuser100,Ido, Bob

(user001, (Zohar, Elkayam))

(user002, (Efrat, Elkayam))

)user009, (Tamar, Fritzi))

(user100, (Ido, Bob))

87

Using flatMapValues• flatMapValues will convert multiple values having

the same key into different key-value pairs

0001 a1:b1:c10002 a2:b20003 c3

> users = sc.textFile (“file”) \.map(lambda line: line.split(“ ”)) \.map(lambda fields: (fields[0], fields[1])) \.flatMapValues(lambda val: val.split(‘:’))

(0001, a1:b1:c1)

(0002, a2:b2)

)0003, c3)

(0001, a1)

(0001, b1)

(0001, c1)

(0002, a2)

(0002, b2)

(0003, b3)

[0001, a1:b1:c1]

[0002, a2:b2]

[0003, c3]

88

MapReduce: Reminder• MapReduce is a common programming paradigm• MapReduce breaks complex tasks down into smaller

elements which can be executed in parallel• Hadoop MapReduce was the first major framework

implementation, but it had some limitations:– Each job can have only one Map, one Reduce– Job output and intermediate data be saved to files

• Spark implement the MapReduce model with greater flexibility– Map and Reduce function can be interspersed– Results stored in memory (or to disk, if not enough memory)– Operations can easily chained

89

MapReduce in Spark• MapReduce in spark works on Pair RDDs

• Map phase:– Operated on one record at a time

– “Maps” each record to one or more new records

– Use map and flatMap for the Mapping phase

• Reduce phase– Works on Map output

– Consolidates multiple records

– reduceByKey operation

90

Word Count using Spark• The famous word count example using Spark

made very easy:

the cat sat on the mat

> Count = sc.textFile (“file”) \.flatMap(lambda line: line.split()) \.map(lambda word: (word, 1)) \.reduceByKey(lambda v1, v2: v1 + v2)

(the, 2)

(cat, 1)

(on, 1)

(mat, 1)

(sat, 1)

the

cat

sat

on

the

mat

(the, 1)

(cat, 1)

(sat, 1)

(on, 1)

(the, 1)

(mat, 1)

91

Why Is It Always Word Count?!• Word count is an easy explanation of many things:

– File handling

– Breaking lines into key-value pairs

– Reducing the keys and handling values

• Calculating statistics are often simple aggregative functions, just like in word count example

• Many common tasks are very similar to word count - Log file analysis for example

92

ReduceByKey• ReduceByKey function behaves like the original

reduce key in the MapReduce model

• The function ReduceByKey gets must be binary –combining two values from two keys

• In order to work properly, the function must be

– Commutative (x+y = y+x)

– Associative ((x+y)+z = x+(y+z))

• All keys are being handled together (piped)

93

Other Pair RDD Operations• Pair RDDs can do other things beside reduce

– countByKey – returns map with count occurrences by key

– groupByKey – group all values for each key in an RDD

– sortByKey – sort in ascending/descending order by key

– join – return RDD containing all pairs with matching keys from two pair RDDs

94

Pair RDD Operations Examples

> grpUsers = users.groupByKeys()

> sortUsers = users.sortByKey (ascending=False)

(0001, a1:b1:c1)

(0002, a2:b2)

)0003, c3)

(0001, a1)

(0001, b1)

(0001, c1)

(0002, a2)

(0002, b2)

(0003, b3)

(0003, b3)

(0002, b2)

(0002, a2)

(0001, c1)

(0001, c1)

(0001, b1)

95

Joining RDDs• Using Joins is a common programming pattern

– Map separate datasets into key-value pair RDDs

– Join by key (make sure key is the same data type and same key structure)

– Map joined data into the desired format

– Save, display or continue processing the new RDD

96

Other Pair RDD Operations (cont.)• Other pairs operations

– keys – return an RDD of just the keys (no values)

– values – return an RDD of just the values

– lookup(key) return the value for a specific key

– leftOuterJoin, rightOuterJoin – left and right outer joins

– mapValues, flatMapValues – execute function on just the values, keeping the key the same

• See PairRDDFunctions class for a full (long) list of functions

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Working with Clusters

Spark Clusters and Resource Management

98

Spark Clusters• Spark is designed to run on a cluster

– Spark includes basic cluster management called “Spark Standalone”

– Can also run on Hadoop and Mesos

• The jobs are broken down into tasks and sent to worker nodes– Each worker nodes run executioners with stand alone

JVMs

• Spark cluster workers can work closely with HDFS

99

Spark Cluster Options

• Locally (No distributed processing)

• Locally with multiple Worker threads

• On an actual cluster, resources managed by:

– Spark Standalone

– Apache Hadoop YARN (Yet Another Resource Negotiator)

– Apache Mesos

100

Spark Cluster Terms• Cluster is a group of computer working together

– i.e. Spark Standalone Cluster, Hadoop (YARN), Mesos

• Node is an individual computer in the cluster

– Master node mange the distributed work

– Worker node does the actual work

• Daemon is a program running on a node

– Each preform different functions in the cluster

101

Spark Standalone Cluster Daemons• Spark Master (cluster manager)

– One per cluster

– Manages applications, distribute individual tasks to Spark Workers

• Spark Worker– One per node

– Starts and monitor Executors for applications

– Spark workers can run on Hadoop DataNodes – for reading data from HDFS efficiently

102

Spark Driver

• Spark driver is the main program• Runs in a Spark Shell or as Spark Application• The driver creates the Spark Context for the run• Communicate with the Cluster Manager to

distribute the work between workers

103

Driver Modes: Client vs. Cluster• Driver runs outside the cluster by default

– Called “client” deploy mode

– Most common

– Required for interactive use

• We can run the driver from one of the worker nodes in a cluster

– Called “cluster” deploy mode

– Doesn’t require interaction with cluster’s nodes

104

Running a Cluster Application

105


106


107


108


109

Supported Cluster Resource Managers• Spark Standalone (EC2 or private)

– Included with Spark– Easy to install and run– Limited configurability and scalability– Useful for testing, development, or smaller systems

• Hadoop YARN– Requires an Hadoop Cluster– Common for production sites– Allows sharing cluster resources with other applications (MapReduce,

Hive, etc.)

• Apache Mesos– Original platform for Spark– Less common

110

Setting sc.master • Using the --master parameter, we set the SparkContext.master

parameter in spark shells

• Spark shell can set different Cluster Masters using command line:– URL – the url location for the cluster manager (Spark Standalone Manager or

Mesos manager)

– local[*] – runs locally with as many threads as cores (default)

– local[n] – runs locally with n worker threads

– local – does not use distributed processing

– yarn – use YARN as the cluster manager

• Example:$ pyspark --master spark://sparkmasternode:7077

$ spark-shell --master yarn

111

UI Management• Spark standalone master provides a UI interface for

monitoring and history• UI runs by default on port 18080

112

Spark Job Details UI

113

Spark Job Details Timeline

114

Stage Breakdown

Parallel Programming with Spark

116

Datasets in the Cluster• Resilient Distributed Datasets

– Data is partitioned across worker nodes

• Partitioning is being done by the Spark framework – no action needed by the programmer

• We can control the number of partitions

117

Working With Partitioned Files• Partitioning a single file is based on the size of the

file – the default is 2

• We can control the number of partitions (optional):

• The more partitions we have, the more parallel the program is

sc.textFile(“myfile.txt”, 4)

118

Working With Multiple Files• When working with sc.textFile, we can choose a

wildcard or directory– Each file will become at least one partition

– Operations can be done per file (JSON and XML parsing)

• We can automatically create Pair RDD by using sc.wholeTextFile(“mydir”)

– Useful for many small files

– Key = file name

– Value = file ccontents

> sc.textFile (“mydir/*”)

119

Running Operations on Partition• Most operations are working on single elements

• Some operations can be ran at the partition level– foreachPartition – call a function for each partition

– mapPartition – create a new RDD by executing a function on each partition in the RDD (transformation)

– mapPartitionWithIndex – same as mapPartition but includes an index for the RDD (transformation)

• Commonly used for initializations

• Functions in partition operators get iterators as argument to iterate through the elements

120

HDFS and Data Locality• By default, Spark partitions file-based RDDs by

block. Each block loaded to a single partition

• An action triggers execution: tasks on executors load data from blocks into partitions

• When using HDFS, workers run near their respected blocks

• The Collect operation will copy the data from the workers back to the driver (no locality here)

121

Parallel Operations• RDD operations run in parallel on partitions

• Some operations preserve partitioning– map, flatMap, filter

• Some operations need repartitioning– reduce, sort, group

• Repartitioning requires the data to move between workers – thus hurting performance– Try to reduce the number of elements movement by

running better sequences up to the reshuffle stages

122

Execution Terminology• Job – a set of tasks executed as a result of an action

• Stage – a set of tasks in a job that can exected in parallel

• Task – an individual unit of work sent to one executor

123

From Job to Stages• Spark calculates a Directed Acyclic

Graph (DAG) of RDD dependencies• Narrow operations:

– Only one child depends on the RDD– No Shuffle required between nodes– Can be collected in a single stage– e.g., map, filter, union

• Wide operations– Multiple depends on the RDD– Defines a new stage– e.g., reduceByKey, join, groupByKey

124

Controlling the Level of Parallelism• Wide operations partition result RDDs

– More partitions = more parallel tasks

– Cluster will be under-utilized if there are too few partitions

• We can control the number of partitions:

– Setting a default property (spark.default.parallelism)

– Setting an optional parameter in the function call> users = sc.textFile (“file”) \

.flatMap(lambda line: line.split()) \

.map(lambda word: (word, 1)) \

.reduceByKey(lambda v1, v2: v1 + v2, 10)

125

Spark Lineage• Each transformation operation create a new

child RDD

• Spark keeps track of the parent RDD for each new RDD

• Child RDDs depends on their parents

• Action operations execute parent transformations

• Each action re-executes the linage transformations starting with the base RDD

126

Caching• Since running all the transformations from the

base RDD can be expensive, RDDs can be cached

• Caching an RDD means saving it to the memory to reduce dependency link length

• Caching is a suggestion to Spark

– If not enough memory is available, transformations will be re-executed when needed

– Cache will never spill to disk – it’s in memory only

127

Caching and Fault-Tolerance• Resilient Distributed Datasets

– Resiliency is a product of tacking lineage

• RDDs can always be recomputed from their base if needed

• In case a worker fail, the task can be re-run on a different node, recalculating data from the base RDD using the same partition

128

Persistence Levels• Cache stores data in-memory only

• The persist method offers other options called Storage Levels

• Storage location – where to store the data– MEMORY_ONLY – same as cache

– MEMORY_AND_DISK – stores partition in memory, use disk if not enough memory

– MEMORY_ONLY_SER – stores partition as serialized java object

– DISK_ONLY – store partition on disk

• Replication – store partition on two cluster nodes– MEMORY_ONLY_2, MEMORY_AND_DISK_2, DISK_ONLY_2

129

Persistence Options• To stop persisting and remove from memory

and/or disk– rdd.unpersist()

• To change persistency level we need to unpersistand re-persist the data in the new level

130

When and Where to Cache• When should we cache a dataset

– When dataset is likely to be re-used (machine learning, iterative algorithms, etc.)

– When calculation is long and we don’t want to lose steps in case of failure

• How to choose a persistent level– Memory only – whenever possible, use serialized object to

reduce memory usage if possible– Disk – choose when recomputation is more expensive than

disk read (filtering large dataset, expensive functions)– Replication – choose when recomputation is more

expensive than memory

131

Checkpointing• Maintianing RDD linage provides resillence but can

also cause problem when the lineage gets very long

• Recovery can be very expensive

• Potential stack overflow

• Solution: Checkpointing, saving the data to HDFS (reliable) or to local disk (local)

– HDFS Provides fault –tolerance storage across nodes

– Linage is not saved

– Must be checkpointed before any actions on the RDD

Spark Modules

Spark SQL and Spark Streaming

133

The Spark Stack• In addition to the Core spark engine, there are

some related projects to extend Spark functionality

134

Spark SQL• Spark SQL is a Spark module for structured data

processing• Spark SQL provides more information about the

structure of both the data and the computation being performed for more optimization

• Supports basic SQL and HiveQL• Spark SQL can also act as a distributed query engine

using its JDBC/ODBC or command-line interface• For more information:

http://spark.apache.org/docs/latest/sql-programming-guide.html

http://spark.apache.org/docs/latest/sql-programming-guide.html

135

Dataframes, Datasets and RDDs• A DataFrame is a distributed collection of data

organized into named columns– It is conceptually equivalent to a table in a relational

database or a data frame in R/Python, but with richer optimizations under the hood

• A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine– A Dataset can be constructed from JVM objects and then

manipulated using functional transformations

http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-datasets

136

Spark SQL Context• Spark SQL entry point is Spark SQL Context:

from pyspark.sql import SQLContextsqlContext = SQLContext(sc)

val sc: SparkContext // An existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// this is used to implicitly convert an RDD to a DataFrame.import sqlContext.implicits._

sqlContext <- sparkRSQL.init(sc)

137

Running a Query

from pyspark.sql import SQLContextsqlContext = SQLContext(sc)df = sqlContext.sql("SELECT * FROM table")

val sqlContext = ... // An existing SQLContextval df = sqlContext.sql("SELECT * FROM table")

sqlContext <- sparkRSQL.init(sc)df <- sql(sqlContext, "SELECT * FROM table")

138

Connecting Oracle and Spark SQL• Connecting Spark SQL and Oracle is easy when

using JDBC:

• More about it (and a demo): https://www.realdbamagic.com/spark-sql-and-oracle-database-integration/

scala> val employees = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:zohar/zohar@//localhost:1521/single", "dbtable" -> "hr.employees"))warning: there were 1 deprecation warning(s); re-run with -deprecation for detailsemployees: org.apache.spark.sql.DataFrame = [EMPLOYEE_ID: decimal(6,0), FIRST_NAME: string, LAST_NAME: string, EMAIL: string, PHONE_NUMBER: string, HIRE_DATE: timestamp, JOB_ID: string, SALARY: decimal(8,2), COMMISSION_PCT: decimal(2,2), MANAGER_ID: decimal(6,0), DEPARTMENT_ID: decimal(4,0)]

https://www.realdbamagic.com/spark-sql-and-oracle-database-integration/

Spark Streaming

Ishay Wayner

[email protected]

140

Agenda• What is stream processing?

• Principles

• Stream processing with Spark

• Demo

• How does it compare?

141

Spark Streaming• Spark Streaming is an extension of the core Spark

API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams

142

What is Stream Data Processing?• Stream – a constant flow of data events

• Data – any occurrence that happens at a clearly defined time and is recorded in a collection of fields

• Processing – the act of analyzing data

143

When Do You Use Stream Processing?• Wherever you have a continuous stream of data

• This data needs to be processed quickly so that the business can react

• Examples include trading, fraud detection, spam filtering, trading and many more

144

Data Delivery Methods• At-most-once – possibility for data loss.

• At-least-once – messages may be redelivered.

• Exactly-once – each message is only delivered once.

145

Keep The Data Moving• Data events are processed in the stream (in

memory)

• Storage operations add unnecessary latency

• Data will be stored on disk at the end of the stream

146

Window Consideration• Windowing: grouping of events based on time

• Windowing can also be data-driven

• Out of order events make windowing tricky

• There are different types of windowing including fixed, sliding and count windows

147

Window Types

Time

1 3

4

Inputevents

2 7

9

Time

1 3Inputevents

2 7 2 4

1012

9

Fixed Sliding

Stream Processing With Spark

149

Spark Streaming • An extension of the core Spark API

• Fault tolerant

• Built in support for merging streaming data with historical data

• Supports Scala, Python and Java

150

Spark Streaming - Metrics• Exactly once delivery

• Provides Statefull state management

• Groups events into micro batches

• Latency depends on the configuration of the DStream microbatch interval

151

The DStream• An abstraction provided by Spark

• Represents a continuous stream of data

• Internally treated as a sequence of RDDs

• Each RDD consists of the last X seconds

152

How do we use it?• Create a StreamingContext

• Define the input sources

• Apply transformations and output operations to DStreams

• Issue streamingContext.start() to start receiving data

• Wait for the processing to be stopped using streamingContext.awaitTermination()

153

DStreams and Receivers• Every input DStream is associated with a receiver

object

• The receiver receives the data and stores it in memory for processing

• If you receive multiple input DStreams, multiple receivers will be created

• Remember to allocate spark with enough cores to process the data as well as to run the receivers

154

Scala Exampleimport org.apache.spark._import org.apache.spark.streaming._import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3

// Create a local StreamingContext with two working thread and batch interval of 1 second.// The master requires 2 cores to prevent from a starvation scenario.

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream that will connect to hostname:port, like localhost:9999val lines = ssc.socketTextStream("localhost", 9999)

// Split each line into words, pairs, and countval words = lines.flatMap(_.split(" "))val pairs = words.map(word => (word, 1))val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the consolewordCounts.print()

ssc.start() // Start the computationssc.awaitTermination() // Wait for the computation to terminate

155

Python Examplefrom __future__ import print_function

import sys

from pyspark import SparkContextfrom pyspark.streaming import StreamingContext

if __name__ == "__main__":if len(sys.argv) != 3:

print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)exit(-1)

sc = SparkContext(appName="PythonStreamingNetworkWordCount")ssc = StreamingContext(sc, 1)

lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))counts = lines.flatMap(lambda line: line.split(" "))\

.map(lambda word: (word, 1))\

.reduceByKey(lambda a, b: a+b)counts.pprint()

ssc.start()ssc.awaitTermination()

Spark Streaming Demo

157

Source Types• Spark Streaming provides two built in types of

streaming sources:

– Basic sources: such as file systems and socket connections

– Advanced sources: such as like Kafka, Flume, Kinesis etc

158

Basic Sources - Files• The StreamingContext API provides several

methods for creating input DStreams from files

• File Streams: monitors a given directory and process any files created in it– Files must have the same format

– Once moved into the directory the files can not be changed

• There is an easier method for simple text files which doesn’t require a receiver

159

Advanced Sources• Requires interfacing with non spark libraries

• Functionality to create DStreams from advanced sources has moved to separate libraries

• This is done to prevent version conflict issues

• Libraries need to be explicitly linked when needed

• There is also the ability to create a custom source using a user defined receiver

160

Sliding Window Operations• Spark Streaming allows for windowed computations

• Used to apply transformations on a sliding window of data

• Every time the window slides the source RDDs that fit into the window are combined into a single DStream object

• A window operation needs these two parameters

– Window length: the duration of the window

– Sliding interval: the interval in which the window slides

161

Checkpointing• A streaming application usually runs 24/7

• Therefore it has to survive numerous types of failures outside the application logic

• Spark streaming checkpoints data to a fault tolerant storage system to recover if and when any failure occurs

• Two types of data are checkpointed for that purpose

162

Checkpointed Data Types• Metadata: the information defining the stream

computation

– The configuration used to create the streaming application

– The DStream operations that define the streaming application

– Batches who’s jobs are queued and are yet to be completed

• Data itself: the RDDs containing the data

– Necessary in transformations that combine data from multiple batches

163

How To Checkpoint• Set up a directory in a fault tolerant file system

• Use StreamingContext.checkpoint(dir) to enable data checkpointing

• The application will recover from driver failures (metadata checkpointing) If it does the following:– Create StreamingContext at first run, set up DStreams

then start()

– When restarted StreamingContext will be recreated from checkpoint directory

164

Monitoring And Tuning Streaming Applications• Statistics about a running streaming application

are available thru the Spark web UI

• There are two main metrics to monitor and tune:

– Processing Time: the time to process each batch of data (lower is better)

– Scheduling Delay: are my batches being processed as fast as they are arriving (they should be)

165

Reducing Batch Processing Time• Divide your input DStream into several DStreams

• Increase parallelism of data processing if you feel your resources are under utilized

• Reduce serialization overhead by tuning the serialization format

166

Setting The Right Batch Interval• Batch processing time should be less then batch

interval time

• First start with a conservative interval (a few seconds)

• Make sure batch processing time is below batch interval

• If so you may increase data rate or lower batch interval

• Consistently monitor the batch processing time

167

Other Streaming Frameworks• Apache Storm

• Samza

• Apache Flink

168

How Does It Compare?Storm Samza Flink Spark

Deliverysemantics

At Least OnceExactly-oncewith Trident

At Least Once Exactly once Exactly Once

Statemanagement

StatelessRoll your own or

use Trident

StatefulEmbaded key-

value store

Statefulperiodically writes state

without inerrupting

Statefulwrites state to

storage

Latency Sub-second Sub-Second Sub-second SecondsDepending on

batch size

Languagesupport

Any JVM-languages,

Ruby, Python Javascript, Perl

Scala, Java Scala, Java Scala, Java, Python

169

For More Info• For more information:

http://spark.apache.org/docs/latest/streaming-programming-guide.html

http://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Applications

171

Spark Shell vs. Spark Applications• The Spark Shell allows interactive exploration and

manipulation of data (REPL – read, evaluate, print, loop)

• Spark applications run as independent programs

– Python, Scala, R with SparkR package, or Java

– Common uses: ETL processing, Streaming, and more

172

SparkContext• Every Spark program needs a SparkContext

– The interactive shell creates SC for us

– When creating our own application, we need to create the context ourselves

– A common convention is to name the context cs

173

import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._import org.apache.spark.SparkConf

object SparkWordCount {def main(args: Array[String]) {

// create Spark context with Spark configurationval sc = new SparkContext(new SparkConf().setAppName("Spark Count"))

// get thresholdval threshold = args(1).toInt

// read in text file and split each document into wordsval tokenized = sc.textFile(args(0)).flatMap(_.split(" "))

// count the occurrence of each wordval wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)

// filter out words with fewer than threshold occurrencesval filtered = wordCounts.filter(_._2 >= threshold)

// count charactersval charCounts = filtered.flatMap(_._1.toCharArray).map((_, 1)).reduceByKey(_ + _)

System.out.println(charCounts.collect().mkString(", "))}

}

174

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

# create Spark context with Spark configurationconf = SparkConf().setAppName("Spark Count")sc = SparkContext(conf=conf)

# get thresholdthreshold = int(sys.argv[2])

# read in text file and split each document into wordstokenized = sc.textFile(sys.argv[1]).flatMap(lambda line: line.split(" "))

# count the occurrence of each wordwordCounts = tokenized.map(lambda word: (word, 1)).reduceByKey(lambda v1,v2:v1 +v2)

# filter out words with fewer than threshold occurrencesfiltered = wordCounts.filter(lambda pair:pair[1] >= threshold)

# count characterscharCounts = filtered.flatMap(lambda pair:pair[0]).map(lambda c: c).map(lambda c: (c,

1)).reduceByKey(lambda v1,v2:v1 +v2)

list = charCounts.collect()print repr(list)[1:-1]

175

library(SparkR)

args <- commandArgs(trailing = TRUE)

if (length(args) != 2) {print("Usage: wordcount <master> <file>")q("no")

}

# Initialize Spark contextsc <- sparkR.init(args[[1]], "RwordCount")lines <- textFile(sc, args[[2]])

words <- flatMap(lines,function(line) {

strsplit(line, " ")[[1]]})

wordCount <- lapply(words, function(word) { list(word, 1L) })

counts <- reduceByKey(wordCount, "+", 2L)output <- collect(counts)

for (wordcount in output) {cat(wordcount[[1]], ": ", wordcount[[2]], "\n")

}

176

Building a Spark Application: Scala or Java• Scala or Java applications must be compiled and

assembled into JAR files.

– The JAR file will be passed (uploaded) to worker nodes

• Most developers use Apache Maven or SBT to build

– See http://spark.apache.org/docs/latest/building-spark.html for more details about building an application

• Build details will differ depending on Hadoop version, deployment platform, and other factors

http://spark.apache.org/docs/latest/building-spark.html

177

Running a Spark Application• The easiest way to run a Spark application is by using spark-

submit– Python

– Scala and Java

• Spark-submit options:– --master (local, local[*], yarn, etc.)– --deploy_mode (client or cluster)– --name – application name for UI– --conf – configuration changes from default or settings– … more …

$ spark-submit WordCount.py fileURL

$ spark-submit –class WordCount myJarFile.jar fileURL

178

Running time Configuration Options• Spark-submit can accept a properties file with

settings, instead of parameters– Tab or space-delimited list of properties

– Load with spark-submit –properties-file filename

– Example:

• Site defaults properties file– $SPARK_HOME/conf/spark-defaults

spark.master spark://masternode:7077spark.local.dir /tmp/sparkSpark.ui.port 28080

179

Setting Configuration at Runtime• Spark allow changing the configuration when

creating the SparkContext

• Configure the parameters with the SparkConfobject

• Some functions– setAppName(name)

– setMaster(master)

– set(property-name, value)

• Set functions return a SparkConf object to support chaining

180

Logging

• Spark uses Apache log4j for logging

• You can configure it by adding a log4j.propertiesfile in the $SPARK_HOME/conf directory

• Log file locations depends on the cluster management platform– Spark daemons: /var/log/spark

– Individual tasks: $SPARK_HOME/work on each worker node

– YARN has a log aggregator for log files from workers

http://logging.apache.org/log4j/

Spark Performance and Troubleshooting

Common use cases, problems and solutions

182

Broadcast Variables• Broadcast variables set by the driver and retrived

by workers

• They are read-only once set

• The first read of Broadcast variable retrieves and stores its value on nodes

183

Why Use Broadcast Variables• Use to minimize transfer of data over the network,

which is usually the biggest bottleneck

• Spark Broadcast variables are distributed to worker nodes using a very efficient peer-to-peer algorithmscala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.valueres5: Array[Int] = Array(1, 2, 3)

184

Accumulators• Accumulators are shared variables

– Worker nodes can add to the value– Only the driver application can access the value

• Default accumulator is of type int or double, but we can create custom types when needed (extend class AccumulatorParam)

scala> val accum = sc.accumulator(0, "My Accumulator")accum: spark.Accumulator[Int] = 0

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)...16/06/18 14:34:40 INFO DAGScheduler: Job 5 finished: foreach at <console>:30, took 0.111459 s

scala> accum.valueres6: Int = 10

185

Accumulators (cont.)

• Accumulators will only increment once per task

– If task must be return due to failure, spark will correctly add only for task which succeeded

• Only the driver can access the value

– Code will throw an exception if we use .value on a worker

• Supports the increment (+=) operator

186

Common Performance Issues: Serialization• Serialization affects

– Network bandwidth– Memory (save memory by serializing to disk)

• Default methods of serialization in Spark is basic java serializations– Simple, but slow

• Use Kryo Serialization for Scala and Java– Set spark.serializer = spark.KryoSerializer– Create KryoRegistrar class and set the class in

spark.kryo.registrator=MyRegistrator– Register classes with Kryo (kryo.register(classOf[MyClass]))

187

Small Partitions

• Problem: filter() can result in partitions with small amounts of data

– Results in many small tasks

• Solution: repartition(n)

– This is the same as coalesce(n, suffle=true)

• This will build a new partition RDD, reducing the number of tasks

188

Passing Too Much Data in Functions• Problem: Passing large amounts of data to parallel

functions results in poor performance

• Solution:

– If the data is relatively small, use broadcast variable

– If the data is very large, parallelize into RDDs

189

Where to Look for Performance Issues• Scheduling and lunching tasks

– Are you passing too much data to tasks?

– Use broadcast variable, or RDD

• Task execution– Are there tasks with a very high per-record overhead?

• mydata.map(dbLookup)

• Each lookup call opens a connection to the DB, reads, and closes

– Try mapPartitions

– Are a few tasks taking much more time than others?• Repartition, partition on a different key, or write custom partitioner

190

Where to Look for Performance Issues (cont)• Shuffling

– Make sure you have enough memory for buffer cache

– Make sure spark.local.dir is local disk, ideally dedicated or SSD

• Collecting data to the Driver– Beware of returning large amounts of data to the

driver (using collect())

– Process data on the worker, not the driver

– Save large results to HDFS

191

Conclusion• We talked about the Big Data problem and

Hadoop

• We learned how to use Spark Core

• We got an overview about Spark Cluster and Parallel programming

• We reviewed Spark modules: Spark SQL, Spark Streaming, Machine learning and Graphs

• Spark is one of the leading technologies in todays world

192

What Did We Not Talk About?• Spark unified APIs and data frames

• Spark MLlib and machine learning in general

• Spark graph processing

Q&AAny Questions? Now will be the time!

Zohar Elkayamtwitter: @[email protected]

www.ilDBA.co.ilwww.realdbamagic.com

mailto:[email protected]

195

Technology

Rapid Cluster Computing with Apache Spark 2016