18
Apache Spark Architecture

Spark architecture

Embed Size (px)

Citation preview

Page 1: Spark architecture

Apache Spark Architecture

Page 2: Spark architecture

● Madhukara Phatak

● Big data consultant and trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

Page 3: Spark architecture

RDD - Resilient Distributed Dataset● An big collection of data having following properties

○ Immutable○ Lazy evaluated○ Cacheable○ Type inferred

Page 4: Spark architecture

Immutability● Immutability means once created it never changes

● Big data by default immutable in nature

● Immutability helps to○ Parallelize○ Caching

Page 5: Spark architecture

Immutability in action● const int a = 0 //immutable

int b = 0; // mutable

● Updation○ b ++ // in place○ c = a + 1

● Immutability is about value not about reference

Page 6: Spark architecture

Immutability in collectionsMutable Immutable

var collection = [1,2,4,5]

for ( i = 0; i<collection.length; i++) { collection[i]+=1;

}

Uses loop for updating

collection is updated in place

val collection = [1,2,4,5]

val newCollection = collection.map(value => value +1)

Uses transformation for change

Creates a new copy of collection. Leaves

collection intact

Page 7: Spark architecture

Multiple transformationsMutable Immutable

var collection = [1,2,4,5]

for ( i = 0; i<collection.length; i++) { collection[i]+=1;

}

for ( i = 0; i<collection.length; i++) { collection[i]+=2;

}

val collection = [1,2,4,5]

val newCollection = collection.map(value => value +1)

val nextNewCollection = newCollection.map(value => value +2)

Creates 3 copies of same collection

Page 8: Spark architecture

Challenges of immutability● Immutability is great for parallelism but not good for

space● Doing multiple transformations result in

○ Multiple copies of data○ Multiple passes over data

● In big data, multiple copies and multiple passes will have poor performance characteristics.

Page 9: Spark architecture

Let’s get lazy● Laziness means not computing transformation till it’s

need

● Laziness defers evaluation

● Laziness allows separating execution from evaluation

Page 10: Spark architecture

Laziness in action● val c1 = collection.map(value => value +1) // do not compute anything

● val c2 = c1.map(value => value +2) // don’t compute

● print c2 Now transform into

Page 11: Spark architecture

Transformation● val c2 = collection.map ( value => { var result = value +1 result = result + 2 } ) ● Multiple transformations are combined to one

Page 12: Spark architecture

Laziness and Immutability● You can be lazy only if the underneath data is

immutable● You cannot combine transformation if transformation

has side effect● So combining laziness and immutability gives better

performance and distributed processing.

Page 13: Spark architecture

Challenges of Laziness● Laziness poses challenges in terms of data type ● If laziness defers execution, determining the type of the

variable becomes challenging● If we can’t determine the right type, it allows to have

semantic issues● Running big data programs and getting semantics

errors are not fun.

Page 14: Spark architecture

Type inference● Type inference is part of compiler to determining the

type by value● As all the transformation are side effect free, we can

determine the type by operation● Every transformation has specific return type● Having type inference relieves you think about

representation for many transforms.

Page 15: Spark architecture

Type inference in action● val collection = [1,2,4,5] // explicit type Array● val c1 = collection.map( value => value + 1) // Array

only● val c2 = c1.map( value => value +2 ) // Array only● val c3 = c2.count( ) // inferred as Int● val c4 = c3.map(value => value+3) //gets error. As you

cannot map over an integers

Page 16: Spark architecture

Caching● Immutable data allows you to cache data for long time

● Lazy transformation allows to recreate data on failure

● Transformations can be saved also

● Caching data improves execution engine performance

Page 17: Spark architecture

Spark running architecture

HDFS

NoSQL

Spark Driver program

Worker Node running transformations

Worker Node running transformations

Spark Scheduler

Mesos / YARN

Page 18: Spark architecture

Components● Spark driver (context)● Spark DAG scheduler● Cluster management systems

○ YARN○ Apache Mesos

● Data sources○ In memory○ HDFS○ No SQL