Ten tools for ten big data areas 03_Apache Spark

Apache Spark OverviewTen Tools for Ten Big Data Areas

Series 03 Big Data Analytics

www.sparkera.ca

http://www.sparkera.ca/

2

Ten Tools for Ten Big Data Areas – Overview

© Sparkera. Confidential. All Rights Reserved

10 Tools10 Areas Data Warehouse

Data Platform

Data Bus

Programm

ing

Data A

nalyti

cs

Sear

ch a

nd In

dex

Visualization

Data Integration

Streaming

Data

base

First ETL fully on Yarn

Data storing platformData computing platform

SQL & Metadata

Visualize with just few clicks

Powerful as JavaSimple as Python

real-time streamingMade easier

Yours Google

Lightning-fast cluster computing

Real-time distributed data store

High throughput distributed messaging

3

Agenda


About Apache Spark History

2 About Apache Spark Stack and Architect

3 Resilient Distributed Dataset - RDD

4 RDD Transformation and Action

5 Spark SQL and DataFrame

1

Little About DI – Data Integration

• DI involves combining data residing in different sources and providing users with a unified view of these data.

• DI process is also called Enterprise Information Integration (EII).

• DI usually means ETL - data extract, transformation, load.

• 80% of enterprise data projects' efforts are spent on DI work.

• Data cleansing, audit, master data management are usually considered with DI.


Apache Spark History

• Started in 2009 as a research project in UC Berkeley RAD lab which became AMP Lab.

• Spark researchers found that Hadoop MapReduce was inefficient for iterative and interactive computing.

• Spark was designed from the beginning to be fast for interactive, iterative with support for in-memory storage and fault-tolerance.

• Apart from UC Berkeley, Databricks, Yahoo! and Intel are major contributors.

• Spark was open sourced in March 2010 and transformed into Apache Foundation project in June 2013. One of most active project in apache.

Apache Spark Stack

Spark Core

Standalone Scheduler YARN Other

Spark SQLSpark

Streaming MLlib GraphX

Full stack and great ecosystem.One stack to rule them all.

API

Apache Spark Ecosystem

Spark summit Europe 2015

Apache Spark Advantages

• Fast by leverage memory• General purpose framework – ecosystem• Good and wide API support - native integration with Java, Python,

Scala• Spark handles batch, interactive, and real-time within a single

framework• Leverage Resilient Distributed Dataset – RDD to process data in

memory Resilient – Data can be recreated if it is lost in the memory Distributed – Stored in the memory across cluster Dataset – Can be created from file or programmatically

Apache Spark Hall of Fame

Largest cluster Largest single-day intake

Longest-running job Largest shuffle job

More than 8k nodes 1PB+/day

1 week on 1PB+ data 1PB sort in 4hrs

How to Work with Spark

• Scala IDE- Add hadoop dependency or use mvn- Compile to jar, build or download

(http://spark.apache.org/docs/latest/building-spark.html) - Run with spark-submit

• spark-shell- Integrative command line application

• spark-sql- Where spark using hive metadata

Demo – Word Count

MR code Hive code

How about spark code?

Demo – Word Count Cont.

MR code Hive code

How about spark code?

Spark Architecture

ZookeeperCluster

Cluster Manager

Cluster Manager

Master

Standby

DriverApplication

(Spark Context)

Executor

Cache

TaskTask

Task

Executor

Cache

TaskTask

Task

Worker

One worker node by default has one executor. It can be set from SPARK_WORKER_INSTANCES

Each workers will try to use all coresSPARK_WORKER_CORES

Spark Core Components

Application User program built on Spark. Consists of a driver program and executors on the cluster.

Driver program The process running the main() function of the application and creating the SparkContext

Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)

Worker node Any node that can run application code in the cluster

Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.

Task A unit of work that will be sent to one executor

Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.

Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.

Example of Partial Word Count

• Each core by default execute one task a time (spark.task.cpus)• One task per partition, one partition per split• No. of stages is coming from shuffle/wide transformations

Two Main Spark Abstractions

RDD – Resilient Distributed Dataset

DAG – Direct Acyclic Graph

16

RDD Concept

Simple explanation• RDD is collection of data items split into partitions,stored in memory on worker nodes of the cluster, and perform proper operations.

Complex explanation• RDD is an interface for data transformation • RDD refers to the data stored either in persisted store (HDFS,

Cassandra, HBase, etc.) or in cache (memory, memory+disks, disk only, etc.) or in another RDD

• Partitions are recomputed on failure or cache eviction

Spark Program Flow by RDD

Create input RDD

• spark context• create from external• create from SEQ

Transform RDD

• Using map, flatmap, distinct, filter• Wide/Narrow

Persist intermediate

RDD

• Cache• Persist

Action on RDD • Produce result

Example:myRDD = sc.parallelize([1,2,3,4]) myRDD = sc.textFile("hdfs://tmp/shakepeare.txt”)

RDD Partitions

• A Partition is one of the different chunks that a RDD is splitted on and that is sent to a node

• The more partitions we have, the more parallelism we get• Each partition is candidate to be spread out to different worker

nodes

RDD Operations

• RDD is the only data manipulation tool in Apache Spark

• Lazy and non-lazy operations

• There are two types of RDDs Transformation – lazy – returns Array[T] by collect action Actions – non-lazy – returns value

• Narrow and wide transformations

RDD Transformations - map

• Applies a transformation function on each item of the RDD and returns the result as a new RDD.

• A narrow transformation

V1V2V3

T1T2T3

V10,1V20,2V30,3

T10,1T20,2T30,3

f:T->u val a = sc.parallelize(List("dog", "salmon", "pig"), 3)val b = a.map(x=>(x,x.length))b.collect()b.partitions.size

RDD Transformations - flatMap

• Similar to map, but each input item can be mapped to 0 or more output items.


V1V2V3

T1T2T3

V10,1V20,2V30,3T10,1T20,2T30,3

f:T->u val a = sc.parallelize(List("dog", "salmon", "pig"), 3)val c = a.flatMap(x=>List(x,x))c.collectval d = a.map(x=>List(x,x)).collect

RDD Transformations - filter

• Create a new RDD by passing in a function used to filter the results. It is similar to where clause in SQL language


V1V2V3

T1T2T3

f:T->u val a = sc.parallelize(List("dog", "salmon", "pig"), 3)val e = a.filter(x=>x.length>3)e.collect

V1V3

T1T2

RDD Transformations - sample

• Return a random sample subset RDD of the input RDD• A narrow transformation

V1V2V3

V4V5V6

F:U=>v

val sam = sc.parallelize(1 to 10, 3)sam.sample(false, 0.1, 0).collectsam.sample(true, 0.3, 1).collect

V1V2

V5

def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T]

RDD Transformations - distinct

• Return a new RDD with distinct elements within a source RDD with specific partitions

• A narrow/wide transformation

V1V1V3

V4V4V4

F:U=>v

Val dis = sc.parallelize(List(1,2,3,4,5,6,7,8,9,9,2,6))dis.distinct.collectdis.distinct(2).partitions.length

V1V3

V4

RDD Transformations - reduceByKey

• Operates on (K,V) pairs of course, but the function must be of type (V,V) => V


V1,1V2,2V2,1

T1,3T1,2T2,1

F(A,B)=(A+B)

val a = sc.parallelize(List("dog", "salmon", "pig"), 3)val f = a.map(x=>(x.length,x))f.reduceByKey((a,b)=>(a+b)).collectf.reduceByKey(_+_).collectf.groupByKey.collect

V1,1V2,3

T1,5T2,1

In terms of group, ReduceByKey is more efficient with combiner in the map phase.

RDD Transformations - sortByKey

• This simply sorts the (K,V) pair by K. Default is asc. False as desc.• A narrow/wide transformation

1,V12,V23,V3

4,V46,V65,V3

F:U=>v

val a = sc.parallelize(List("dog", "salmon", "pig"), 3)val f = a.map(x=>(x.length,x))f.sortByKey().collectf.sortByKey(false).collect

1,V12,V23,V3

4,V45,V56,V6

RDD Transformations - union

• Returns the union of two RDDs to a new RDD• A narrow transformation

V1V2

U1

F:U=>v

val u1 = sc.parallelize(1 to 3)val u2 = sc.parallelize(3 to 4)u1.union(u2).collect(u1 ++ u2).collect

V1V2V3

U1U2

V3

U2

RDD Transformations - intersection

• Returns a new RDD containing only common elements from source RDD and argument.


V1V2

U1

F:U=>v

val u1 = sc.parallelize(1 to 3)val u2 = sc.parallelize(3 to 4)u1.intersection(u2).collectV2

U1V2

U1

RDD Transformations - join

• When invoked on (K,V) and (K,W), this operation creates a new RDD of (K, (V,W)). Must apply on key-value pairs.

• A wide transformation

K1,V1K2,V2K3,V3

K1,W1K2, V2

F:U=>vval j1 = sc.parallelize(List("abe", "abby", "apple")).map(a => (a, 1))val j2 = sc.parallelize(List("apple", "beatty", "beatrice")).map(a => (a, 1))j1.join(j2).collectj1.leftOuterJoin(j2).collectj1.rightOuterJoin(j2).collect

K1,(V1,V2)K2,(V2,V2)

RDD Actions - count

• Get the number of data elements in the RDD

V1V2

U1

F:U=>v val u1 = sc.parallelize(1 to 3)val u2 = sc.parallelize(3 to 10)u1.countu2.count

3

RDD Actions - take

• Fetch number of data elements from the RDD

V1V2

F:U=>v val t1 = sc.parallelize(3 to 10)t1.take(3)V1

RDD Actions - foreach

• Execute function for each data element in RDD.

F:U=>vval c = sc.parallelize(List("cat", "dog", "tiger", "lion"))c.foreach(x => println(x + "s are fun"))

V1V2V3

U1U2U3

RDD Actions - saveAsTextFile

• Writes the content of RDD to a text file or a set of text files to local file system/HDFS. The default path is HDFS or hdfs://

val a = sc.parallelize(1 to 10000, 3)a.saveAsTextFile("file:/Users/will/Downloads/mydata")

RDD Persistence – cache/persist

• Cache data to memory/disk. It is a lazy operation• cache = persist(MEMORY)

persist(MEMORY_AND_DISK)

val u1 = sc.parallelize(1 to 1000).filter(_%2==0)u1.cache()u1.count

V1 (mem)V2 (mem)V3 (mem)

U1 (disk)U2 (disk)

V1V2V3

U1U2

Storage Level for persist

Storage Level Purpose

MEMORY_ONLY (Default level)Keep RDD in available cluster memory as deserialized Java objects. Some partitions may not be cached if there is not enough cluster memory. Those partitions will be recalculated on the fly as needed.

MEMORY_AND_DISKThis option stores RDD as deserialized Java objects. If RDD does not fit in cluster memory, then store those partitions on the disk and read them as needed.

MEMORY_ONLY_SERThis options stores RDD as serialized Java objects (One byte array per partition). This is more CPU intensive but saves memory as it is more space efficient. Some partitions may not be cached. Those will be recalculated on the fly as needed.

MEMORY_ONLY_DISK_SER This option is same as above except that disk is used when memory is not sufficient.

DISC_ONLY This option stores the RDD only on the disk

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as other levels but partitions are replicated on 2 slave nodes

By default, spark uses algorithm of Least Recently Used (LRU) to remove old and unused RDD to release more memory. You can also manually remove unused RDD from memory by using unpersist()

Spark Variable - Broadcast

• Like Hadoop DC, broadcast variables let programmer keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. For example, to give every node a copy of a large input dataset efficiently

• Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost

val broadcastVar = sc.broadcast(Array(1, 2, 3))broadcastVar.value

val fn= sc.textFile(“hdfs://filename”).map(_.split(“|”))val fn_bc= sc.broadcast(fn)

Spark Variable - Accumulators

• Accumulators are variables that can only be “added” to through an associative operation Used to implement counters and sums, efficiently in parallel

• Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types

• Only the driver program can read an accumulator’s value, not the tasks

val accum = sc.accumulator(0)sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)accum.value

Two Main Spark AbstractionsRDD – Resilient Distributed Dataset

DAG – Direct Acyclic Graph

39

DAG

• Direct Acyclic Graph – sequence of computations performed on data

Node: RDD partition Edge: Transformation on top of data Acyclic: Graph cannot return to the older partition Direct: transformation is an action that transitions data partition

state (from A to B)

How Spark Works by DAG

Apache Spark SQL Stack

Spark ApplicationPython/Scala/Java/MLLib/R

BI ToolsTableau/Qlik/ZoomData/…

JDBC

Spark SQLSchemaRDD/DataFrame

Hive Json CSV Parquet JDBC HBase Other

Spark SQL History

Stop at mid of 2014

Ready at Beginning of 2015

Released at April of 2014

Spark SQL with Hive Meta

Support almost all hive query – HiveContext, which is superset of SQLContextSparkSQL uses Hive’s meta store, but his own version of thrift server

Spark SQL with BI Tools

Leverage existing BI Tools

Spark SQL with Other Integrations

Picture from databricks.com

Spark SQL Supported Keywords


Support SQL 92

Spark SQL – How It Works

Catalyst

Frontend Backend


DataFrame (DF) RDD

• Schema on read• Works with structured and semi-structured data• Simplifies working with structured data• Keep track of schema information• Read/Write from structure data like JSON, Hive tables, Parquet, etc. • SQL inside your Spark App• Best Performance and more powerful operations API• Access through SQLContext • Integrated with ML

DF – Performance


DF Example – Direct from files

val df = sqlContext.read.json("/demo/data/data_json/customer.json”)

df.printSchema()

df.show()

df.select(“name”).showdf.filter(df("age") >= 35).show()df.groupBy("age").count().show()

SqlContext directly deals with DF

DF Example – Direct from Hive

val a = sqlContext.table(”sample_07”)

a.printSchema()

df.show(1)


DF Example – Indirect from Other RDD

val rdd = sc.textFile("/demo/data/data_cust/data1.csv").flatMap(_.split(",")).map(x=>(x,1)).reduceByKey(_+_)

//auto schemaval df1 = sqlContext.createDataFrame(rdd)val df2 = rdd.toDF

//specify schemaval schemaString = "word cnt”val schema = StructType(StructField("word", StringType, true) :: StructField("cnt", IntegerType, false) :: Nil)val rowRDD = rdd.map(p => Row(p._1,p._2))val dfcust = sqlContext.createDataFrame(rowRDD, schema)


www.sparkera.ca

BIG DATA is not only about data, but the understanding of the data and how people use data actively to improve their life.

Internet

Ten tools for ten big data areas 03_Apache Spark