80
Introduction to Spark Wisely Chen (aka thegiive) Sr. Engineer at Yahoo

OCF.tw's talk about "Introduction to spark"

Embed Size (px)

DESCRIPTION

在 OCF and OSSF 的邀請下分享一下 Spark If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF) Please check http://ocf.tw/ or http://www.openfoundry.org/ 另外感謝 CLBC 的場地 如果你想到在一個良好的工作環境下工作 歡迎跟 CLBC 接洽 http://clbc.tw/

Citation preview

Page 1: OCF.tw's talk about "Introduction to spark"

Introduction to SparkWisely Chen (aka thegiive)

Sr. Engineer at Yahoo

Page 2: OCF.tw's talk about "Introduction to spark"

Agenda• What is Spark? ( Easy )

• Spark Concept ( Middle )

• Break : 10min

• Spark EcoSystem ( Easy )

• Spark Future ( Middle )

• Q&A

Page 3: OCF.tw's talk about "Introduction to spark"

Who am I? • Wisely Chen ( [email protected] )

• Sr. Engineer in Yahoo![Taiwan] data team

• Loves to promote open source tech

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

• Spark Summit 2014 San Francisco

• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

Page 4: OCF.tw's talk about "Introduction to spark"

Taiwan Data Team

Data!Highway

BI!Report

Serving!API

Data!Mart

ETL /Forecast

Machine!Learning

Page 5: OCF.tw's talk about "Introduction to spark"
Page 6: OCF.tw's talk about "Introduction to spark"

Recommendation

Forecast

Page 7: OCF.tw's talk about "Introduction to spark"

HADOOP

Page 8: OCF.tw's talk about "Introduction to spark"

Opinion from Cloudera• The leading candidate for “successor to

MapReduce” today is Apache Spark

• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !

• From http://0rz.tw/y3OfM

Page 9: OCF.tw's talk about "Introduction to spark"

What is Spark

• From UC Berkeley AMP Lab

• Most activity Big data open source project since Hadoop

Page 10: OCF.tw's talk about "Introduction to spark"

Community

Page 11: OCF.tw's talk about "Introduction to spark"

Community

Page 12: OCF.tw's talk about "Introduction to spark"

Where is Spark?

Page 13: OCF.tw's talk about "Introduction to spark"

HDFS

YARN

MapReduce

Hadoop 2.0

Storm HBase Others

Page 14: OCF.tw's talk about "Introduction to spark"

HDFS

YARN

MapReduce

Hadoop Architecture

Hive

Storage

Resource Management

Computing Engine

SQL

Page 15: OCF.tw's talk about "Introduction to spark"

HDFS

YARN

MapReduce

Hadoop vs Spark

Spark

Hive Shark/SparkSQL

Page 16: OCF.tw's talk about "Introduction to spark"

Spark vs Hadoop• Spark run on Yarn, Mesos or Standalone mode

• Spark’s main concept is based on MapReduce

• Spark can read from

• HDFS: data locality

• HBase

• Cassandra

Page 17: OCF.tw's talk about "Introduction to spark"

More than MapReduce

HDFS

Spark Core : MapReduce

Shark: Hive GraphX: Pregel MLib: MahoutStreaming:

Storm

Resource Management System(Yarn, Mesos)

Page 18: OCF.tw's talk about "Introduction to spark"

Why Spark?

Page 19: OCF.tw's talk about "Introduction to spark"

天下武功,無堅不破,惟快不破

Page 20: OCF.tw's talk about "Introduction to spark"

3X~25X than MapReduce framework !

From Matei’s paper: http://0rz.tw/VVqgP

Logistic regression

Runn

ing

Tim

e(S)

0

20

40

60

80

MR Spark3

76

KMeans

0

27.5

55

82.5

110

MR Spark

33

106

PageRank

0

45

90

135

180

MR Spark

23

171

Page 21: OCF.tw's talk about "Introduction to spark"

What is Spark

• Apache Spark™ is a very fast and general engine for large-scale data processing

Page 22: OCF.tw's talk about "Introduction to spark"

Language Support

• Python

• Java

• Scala

Page 23: OCF.tw's talk about "Introduction to spark"

Python Word Count• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Access data via Spark API

Process via Python

Page 24: OCF.tw's talk about "Introduction to spark"

What is Spark

• Apache Spark™ is a very fast and general engine for large-scale data processing

Page 25: OCF.tw's talk about "Introduction to spark"

Why is Spark so fast?

Page 26: OCF.tw's talk about "Introduction to spark"

Most machine learning algorithms need iterative computing

Page 27: OCF.tw's talk about "Introduction to spark"

a1.0

1.0

1.0

1.0

PageRank

1st Iter 2nd Iter 3rd Iter

b

d

c

Rank Tmp

Result

Rank Tmp

Result

a1.85

1.00.58

b

d

c

0.58

a1.31

1.720.39

b

d

c

0.58

Page 28: OCF.tw's talk about "Introduction to spark"

HDFS is 100x slower than memory

Input (HDFS) Iter 1 Tmp

(HDFS) Iter 2 Tmp (HDFS) Iter N

Input (HDFS) Iter 1 Tmp

(Mem) Iter 2 Tmp (Mem) Iter N

MapReduce

Spark

Page 29: OCF.tw's talk about "Introduction to spark"

First iteration(HDFS)!take 200 sec

3rd iteration(mem)!take 7.7 sec

Page Rank algorithm in 1 billion record url

2nd iteration(mem)!take 7.4 sec

Page 30: OCF.tw's talk about "Introduction to spark"

Spark Concept

Page 31: OCF.tw's talk about "Introduction to spark"

Shuffle

Map Reduce

Page 32: OCF.tw's talk about "Introduction to spark"

DAG Engine

Page 33: OCF.tw's talk about "Introduction to spark"

DAG Engine

Page 34: OCF.tw's talk about "Introduction to spark"

RDD

• Resilient Distributed Dataset

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

Page 35: OCF.tw's talk about "Introduction to spark"

Fault Tolerance

天下武功,無堅不破,惟快不破

Page 36: OCF.tw's talk about "Introduction to spark"

RDD

RDD a RDD b

val a =sc.textFile(“hdfs://....”)

val b = a.filer( line=>line.contain(“Spark”) )

Value c

val c = b.count()

Transformation Action

Page 37: OCF.tw's talk about "Introduction to spark"

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!

Worker!!!!

Worker!!!!Task

TaskTask

Page 38: OCF.tw's talk about "Introduction to spark"

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!Block1

RDD a

Worker!!!!!Block2

RDD a

Worker!!!!!Block3

RDD a

Page 39: OCF.tw's talk about "Introduction to spark"

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Page 40: OCF.tw's talk about "Introduction to spark"

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Page 41: OCF.tw's talk about "Introduction to spark"

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Cache1 Cache2

Cache3

Page 42: OCF.tw's talk about "Introduction to spark"

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD m

Worker!!!!!

RDD m

Worker!!!!!

RDD m

Cache1 Cache2

Cache3

Page 43: OCF.tw's talk about "Introduction to spark"

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD a

Worker!!!!!

RDD a

Worker!!!!!

RDD a

Cache1 Cache2

Cache3

Page 44: OCF.tw's talk about "Introduction to spark"

1st iteration(no cache)!

take same time

with cache!take 7 sec

RDD Cache

Page 45: OCF.tw's talk about "Introduction to spark"

RDD Cache

• Data locality

• CacheA big shuffle!take 20min

After cache, take only 265ms

self join 5 billion record data

Page 46: OCF.tw's talk about "Introduction to spark"

Scala Word Count• val file = spark.textFile("hdfs://...")

• val counts = file.flatMap(line => line.split(" "))

• .map(word => (word, 1))

• .reduceByKey(_ + _)

• counts.saveAsTextFile("hdfs://...")

Page 47: OCF.tw's talk about "Introduction to spark"

Step by Step

• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)

• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)

• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)

Page 48: OCF.tw's talk about "Introduction to spark"

Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }

• });

• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()

• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }

• });

• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()

• public Integer call(Integer a, Integer b) { return a + b; }

• });

• counts.saveAsTextFile("hdfs://...");

Page 49: OCF.tw's talk about "Introduction to spark"

Java vs Scala• Scala : file.flatMap(line => line.split(" "))

• Java version :

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) {

• return Arrays.asList(s.split(" ")); }

• });

Page 50: OCF.tw's talk about "Introduction to spark"

Python• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Page 51: OCF.tw's talk about "Introduction to spark"

Highly Recommend

• Scala : Latest API feature, Stable

• Python

• very familiar language

• Native Lib: NumPy, SciPy

Page 52: OCF.tw's talk about "Introduction to spark"

How to use it?

• 1. go to https://spark.apache.org/

• 2. Download and unzip it

• 3. ./sbin/start-all.sh or ./bin/spark-shell

Page 53: OCF.tw's talk about "Introduction to spark"

DEMO

Page 54: OCF.tw's talk about "Introduction to spark"

EcoSystem/Future

Page 55: OCF.tw's talk about "Introduction to spark"
Page 56: OCF.tw's talk about "Introduction to spark"

Hadoop EcoSystem

Page 57: OCF.tw's talk about "Introduction to spark"

Hadoop EcoSystem

Page 58: OCF.tw's talk about "Introduction to spark"

Spark ECOSystem

HDFS

Spark Core : MapReduce

SparkSQL: Hive GraphX: Pregel MLib: MahoutStreaming:

Storm

Resource Management System(Yarn, Mesos)

Page 59: OCF.tw's talk about "Introduction to spark"

Unified Platform

Page 60: OCF.tw's talk about "Introduction to spark"

Detail

SparkSQL

Spark

MLlib

Hive HDFS Cassandra RDBMS

Streaming BI ETL

Page 61: OCF.tw's talk about "Introduction to spark"

Complexity

Page 62: OCF.tw's talk about "Introduction to spark"

Performance

Page 63: OCF.tw's talk about "Introduction to spark"

Write once, Run use case

Page 64: OCF.tw's talk about "Introduction to spark"

BI (SparkSQL)

Streaming (SparkStreaming)

Machine Learning (MLlib)

Spark

Page 65: OCF.tw's talk about "Introduction to spark"

Spark bridge people together

Page 66: OCF.tw's talk about "Introduction to spark"

Data Analyst

Data Engineer Data Scientist

Page 67: OCF.tw's talk about "Introduction to spark"

Bridge people together

• Scala : Engineer

• Java : Engineer

• Python : Data Scientist , Engineer

• R : Data Scientist , Data Analyst

• SQL : Data Analyst

Page 68: OCF.tw's talk about "Introduction to spark"

Yahoo EC team

Data Platform!!!!!!!!!!

Filtered Data!

(HDFS)

Data Mart!

(Oracle)

ML Model!(Spark)

BI Report!(MSTR)

Traffic!Data

Transaction!Data

Shark

Page 69: OCF.tw's talk about "Introduction to spark"

Data Analyst

Page 70: OCF.tw's talk about "Introduction to spark"

Data Analyst

• =

• Select tweet from tweets_data where similarity(tweet , “FIFA” ) > 0.01

!

!

• http://youtu.be/lO7LhVZrNwA?list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr

350 TB data

Machine Learning

https://www.youtube.com/watch?v=lO7LhVZrNwA&list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr#t=2900

Page 71: OCF.tw's talk about "Introduction to spark"

Data Scientist

http://goo.gl/q5CAx8 http://research.janelia.org/zebrafish/

Page 72: OCF.tw's talk about "Introduction to spark"

SQL (Data Analyst)

Cloud Computing

(Data Engineer)

Machine Learning (Data Scientist)

Spark

Page 73: OCF.tw's talk about "Introduction to spark"

Databricks Cloud DEMO

Page 74: OCF.tw's talk about "Introduction to spark"

BI (SparkSQL)

Streaming (SparkStreaming)

Machine Learning (MLlib)

Spark

Page 75: OCF.tw's talk about "Introduction to spark"

Instant BI Reporthttp://youtu.be/dJQ5lV5Tldw?t=30m30s

Page 76: OCF.tw's talk about "Introduction to spark"

BI (SparkSQL)

Streaming (SparkStreaming)

Machine Learning (MLlib)

Spark

Page 77: OCF.tw's talk about "Introduction to spark"

Background Knowledge• Tweet real time data store into SQL database

• Spark MLLib use Wikipedia data to train a TF-IDF model

• SparkSQL select tweet and filter by TF-IDF model

• Generate live BI report

Page 78: OCF.tw's talk about "Introduction to spark"

Code• val wiki = sql(“select text from wiki”)

• val model = new TFIDF()

• model.train(wiki)

• registerFunction(“similarity” , model.similarity _ )

• select tweet from tweet where similarity(tweet, “$search” > 0.01 )

Page 79: OCF.tw's talk about "Introduction to spark"

DEMO

http://youtu.be/dJQ5lV5Tldw?t=39m30s

Page 80: OCF.tw's talk about "Introduction to spark"

Q & A