OCF.tw's talk about "Introduction to spark"

Introduction to SparkWisely Chen (aka thegiive)

Sr. Engineer at Yahoo

Agenda• What is Spark? ( Easy )

• Spark Concept ( Middle )

• Break : 10min

• Spark EcoSystem ( Easy )

• Spark Future ( Middle )

• Q&A

Who am I? • Wisely Chen ( [email protected] )

• Sr. Engineer in Yahoo![Taiwan] data team

• Loves to promote open source tech

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

• Spark Summit 2014 San Francisco

• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

mailto:[email protected]

Taiwan Data Team

Data!Highway

BI!Report

Serving!API

Data!Mart

ETL /Forecast

Machine!Learning

Recommendation

Forecast

HADOOP

Opinion from Cloudera• The leading candidate for “successor to

MapReduce” today is Apache Spark

• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !

• From http://0rz.tw/y3OfM

http://0rz.tw/y3OfM

What is Spark

• From UC Berkeley AMP Lab

• Most activity Big data open source project since Hadoop

Community

Community

Where is Spark?

HDFS

YARN

MapReduce

Hadoop 2.0

Storm HBase Others

HDFS

YARN

MapReduce

Hadoop Architecture

Hive

Storage

Resource Management

Computing Engine

SQL

HDFS

YARN

MapReduce

Hadoop vs Spark

Spark

Hive Shark/SparkSQL

Spark vs Hadoop• Spark run on Yarn, Mesos or Standalone mode

• Spark’s main concept is based on MapReduce

• Spark can read from

• HDFS: data locality

• HBase

• Cassandra

More than MapReduce

HDFS

Spark Core : MapReduce

Shark: Hive GraphX: Pregel MLib: MahoutStreaming:

Storm

Resource Management System(Yarn, Mesos)

Why Spark?

天下武功，無堅不破，惟快不破

3X~25X than MapReduce framework !

From Matei’s paper: http://0rz.tw/VVqgP

Logistic regression

Runn

ing

Tim

e(S)

0

20

40

60

80

MR Spark3

76

KMeans

0

27.5

55

82.5

110

MR Spark

33

106

PageRank

0

45

90

135

180

MR Spark

23

171

http://0rz.tw/VVqgP

What is Spark

• Apache Spark™ is a very fast and general engine for large-scale data processing

Language Support

• Python

• Java

• Scala

Python Word Count• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Access data via Spark API

Process via Python

What is Spark

• Apache Spark™ is a very fast and general engine for large-scale data processing

Why is Spark so fast?

Most machine learning algorithms need iterative computing

a1.0

1.0

1.0

1.0

PageRank

1st Iter 2nd Iter 3rd Iter

b

d

c

Rank Tmp

Result

Rank Tmp

Result

a1.85

1.00.58

b

d

c

0.58

a1.31

1.720.39

b

d

c

0.58

HDFS is 100x slower than memory

Input (HDFS) Iter 1 Tmp

(HDFS) Iter 2 Tmp (HDFS) Iter N

Input (HDFS) Iter 1 Tmp

(Mem) Iter 2 Tmp (Mem) Iter N

MapReduce

Spark

First iteration(HDFS)!take 200 sec

3rd iteration(mem)!take 7.7 sec

Page Rank algorithm in 1 billion record url

2nd iteration(mem)!take 7.4 sec

Spark Concept

Shuffle

Map Reduce

DAG Engine

DAG Engine

RDD

• Resilient Distributed Dataset

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

Fault Tolerance

天下武功，無堅不破，惟快不破

RDD

RDD a RDD b

val a =sc.textFile(“hdfs://....”)

val b = a.filer( line=>line.contain(“Spark”) )

Value c

val c = b.count()

Transformation Action

hdfs://....%E2%80%9D

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!

Worker!!!!

Worker!!!!Task

TaskTask

Log mining


Driver

Worker!!!!!Block1

RDD a

Worker!!!!!Block2

RDD a

Worker!!!!!Block3

RDD a

Log mining


Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Log mining


Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Log mining


Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Cache1 Cache2

Cache3

Log mining


Driver

Worker!!!!!

RDD m

Worker!!!!!

RDD m

Worker!!!!!

RDD m

Cache1 Cache2

Cache3

Log mining


Driver

Worker!!!!!

RDD a

Worker!!!!!

RDD a

Worker!!!!!

RDD a

Cache1 Cache2

Cache3

1st iteration(no cache)!

take same time

with cache!take 7 sec

RDD Cache

RDD Cache

• Data locality

• CacheA big shuffle!take 20min

After cache, take only 265ms

self join 5 billion record data

Scala Word Count• val file = spark.textFile("hdfs://...")

• val counts = file.flatMap(line => line.split(" "))

• .map(word => (word, 1))

• .reduceByKey(_ + _)


Step by Step

• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)

• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)

• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)

Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }

• });

• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()

• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }

• });

• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()

• public Integer call(Integer a, Integer b) { return a + b; }

• });

• counts.saveAsTextFile("hdfs://...");

Java vs Scala• Scala : file.flatMap(line => line.split(" "))

• Java version :

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) {

• return Arrays.asList(s.split(" ")); }

• });

Python• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)


Highly Recommend

• Scala : Latest API feature, Stable

• Python

• very familiar language

• Native Lib: NumPy, SciPy

How to use it?

• 1. go to https://spark.apache.org/

• 2. Download and unzip it

• 3. ./sbin/start-all.sh or ./bin/spark-shell

https://spark.apache.org/

DEMO

EcoSystem/Future

Hadoop EcoSystem

Hadoop EcoSystem

Spark ECOSystem

HDFS

Spark Core : MapReduce

SparkSQL: Hive GraphX: Pregel MLib: MahoutStreaming:

Storm

Resource Management System(Yarn, Mesos)

Unified Platform

Detail

SparkSQL

Spark

MLlib

Hive HDFS Cassandra RDBMS

Streaming BI ETL

Complexity

Performance

Write once, Run use case

BI (SparkSQL)

Streaming (SparkStreaming)

Machine Learning (MLlib)

Spark

Spark bridge people together

Data Analyst

Data Engineer Data Scientist

Bridge people together

• Scala : Engineer

• Java : Engineer

• Python : Data Scientist , Engineer

• R : Data Scientist , Data Analyst

• SQL : Data Analyst

Yahoo EC team

Data Platform!!!!!!!!!!

Filtered Data!

(HDFS)

Data Mart!

(Oracle)

ML Model!(Spark)

BI Report!(MSTR)

Traffic!Data

Transaction!Data

Shark

Data Analyst

Data Analyst

• =

• Select tweet from tweets_data where similarity(tweet , “FIFA” ) > 0.01

!

!

• http://youtu.be/lO7LhVZrNwA?list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr

350 TB data

Machine Learning

https://www.youtube.com/watch?v=lO7LhVZrNwA&list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr#t=2900

http://youtu.be/lO7LhVZrNwA?list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr

https://www.youtube.com/watch?v=lO7LhVZrNwA&list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr#t=2900

Data Scientist

http://goo.gl/q5CAx8 http://research.janelia.org/zebrafish/

http://research.janelia.org/zebrafish/

SQL (Data Analyst)

Cloud Computing

(Data Engineer)

Machine Learning (Data Scientist)

Spark

Databricks Cloud DEMO

BI (SparkSQL)



Spark

Instant BI Reporthttp://youtu.be/dJQ5lV5Tldw?t=30m30s

http://youtu.be/dJQ5lV5Tldw?t=30m30s

BI (SparkSQL)



Spark

Background Knowledge• Tweet real time data store into SQL database

• Spark MLLib use Wikipedia data to train a TF-IDF model

• SparkSQL select tweet and filter by TF-IDF model

• Generate live BI report

Code• val wiki = sql(“select text from wiki”)

• val model = new TFIDF()

• model.train(wiki)

• registerFunction(“similarity” , model.similarity _ )

• select tweet from tweet where similarity(tweet, “$search” > 0.01 )

DEMO

http://youtu.be/dJQ5lV5Tldw?t=39m30s

Q & A

Technology

OCF.tw's talk about "Introduction to spark"