Upload
giivee-the
View
990
Download
0
Embed Size (px)
DESCRIPTION
在 OCF and OSSF 的邀請下分享一下 Spark If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF) Please check http://ocf.tw/ or http://www.openfoundry.org/ 另外感謝 CLBC 的場地 如果你想到在一個良好的工作環境下工作 歡迎跟 CLBC 接洽 http://clbc.tw/
Citation preview
Introduction to SparkWisely Chen (aka thegiive)
Sr. Engineer at Yahoo
Agenda• What is Spark? ( Easy )
• Spark Concept ( Middle )
• Break : 10min
• Spark EcoSystem ( Easy )
• Spark Future ( Middle )
• Q&A
Who am I? • Wisely Chen ( [email protected] )
• Sr. Engineer in Yahoo![Taiwan] data team
• Loves to promote open source tech
• Hadoop Summit 2013 San Jose
• Jenkins Conf 2013 Palo Alto
• Spark Summit 2014 San Francisco
• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
Taiwan Data Team
Data!Highway
BI!Report
Serving!API
Data!Mart
ETL /Forecast
Machine!Learning
Recommendation
Forecast
HADOOP
Opinion from Cloudera• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !
• From http://0rz.tw/y3OfM
What is Spark
• From UC Berkeley AMP Lab
• Most activity Big data open source project since Hadoop
Community
Community
Where is Spark?
HDFS
YARN
MapReduce
Hadoop 2.0
Storm HBase Others
HDFS
YARN
MapReduce
Hadoop Architecture
Hive
Storage
Resource Management
Computing Engine
SQL
HDFS
YARN
MapReduce
Hadoop vs Spark
Spark
Hive Shark/SparkSQL
Spark vs Hadoop• Spark run on Yarn, Mesos or Standalone mode
• Spark’s main concept is based on MapReduce
• Spark can read from
• HDFS: data locality
• HBase
• Cassandra
More than MapReduce
HDFS
Spark Core : MapReduce
Shark: Hive GraphX: Pregel MLib: MahoutStreaming:
Storm
Resource Management System(Yarn, Mesos)
Why Spark?
天下武功,無堅不破,惟快不破
3X~25X than MapReduce framework !
From Matei’s paper: http://0rz.tw/VVqgP
Logistic regression
Runn
ing
Tim
e(S)
0
20
40
60
80
MR Spark3
76
KMeans
0
27.5
55
82.5
110
MR Spark
33
106
PageRank
0
45
90
135
180
MR Spark
23
171
What is Spark
• Apache Spark™ is a very fast and general engine for large-scale data processing
Language Support
• Python
• Java
• Scala
Python Word Count• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) \
• .map(lambda word: (word, 1)) \
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
Access data via Spark API
Process via Python
What is Spark
• Apache Spark™ is a very fast and general engine for large-scale data processing
Why is Spark so fast?
Most machine learning algorithms need iterative computing
a1.0
1.0
1.0
1.0
PageRank
1st Iter 2nd Iter 3rd Iter
b
d
c
Rank Tmp
Result
Rank Tmp
Result
a1.85
1.00.58
b
d
c
0.58
a1.31
1.720.39
b
d
c
0.58
HDFS is 100x slower than memory
Input (HDFS) Iter 1 Tmp
(HDFS) Iter 2 Tmp (HDFS) Iter N
Input (HDFS) Iter 1 Tmp
(Mem) Iter 2 Tmp (Mem) Iter N
MapReduce
Spark
First iteration(HDFS)!take 200 sec
3rd iteration(mem)!take 7.7 sec
Page Rank algorithm in 1 billion record url
2nd iteration(mem)!take 7.4 sec
Spark Concept
Shuffle
Map Reduce
DAG Engine
DAG Engine
RDD
• Resilient Distributed Dataset
• Collections of objects spread across a cluster, stored in RAM or on Disk
• Built through parallel transformations
Fault Tolerance
天下武功,無堅不破,惟快不破
RDD
RDD a RDD b
val a =sc.textFile(“hdfs://....”)
val b = a.filer( line=>line.contain(“Spark”) )
Value c
val c = b.count()
Transformation Action
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!
Worker!!!!
Worker!!!!Task
TaskTask
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!Block1
RDD a
Worker!!!!!Block2
RDD a
Worker!!!!!Block3
RDD a
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Block1 Block2
Block3
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Block1 Block2
Block3
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Cache1 Cache2
Cache3
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD m
Worker!!!!!
RDD m
Worker!!!!!
RDD m
Cache1 Cache2
Cache3
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD a
Worker!!!!!
RDD a
Worker!!!!!
RDD a
Cache1 Cache2
Cache3
1st iteration(no cache)!
take same time
with cache!take 7 sec
RDD Cache
RDD Cache
• Data locality
• CacheA big shuffle!take 20min
After cache, take only 265ms
self join 5 billion record data
Scala Word Count• val file = spark.textFile("hdfs://...")
• val counts = file.flatMap(line => line.split(" "))
• .map(word => (word, 1))
• .reduceByKey(_ + _)
• counts.saveAsTextFile("hdfs://...")
Step by Step
• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)
• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)
• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)
Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
• });
• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()
• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
• });
• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()
• public Integer call(Integer a, Integer b) { return a + b; }
• });
• counts.saveAsTextFile("hdfs://...");
Java vs Scala• Scala : file.flatMap(line => line.split(" "))
• Java version :
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
• public Iterable<String> call(String s) {
• return Arrays.asList(s.split(" ")); }
• });
Python• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) \
• .map(lambda word: (word, 1)) \
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
Highly Recommend
• Scala : Latest API feature, Stable
• Python
• very familiar language
• Native Lib: NumPy, SciPy
How to use it?
• 1. go to https://spark.apache.org/
• 2. Download and unzip it
• 3. ./sbin/start-all.sh or ./bin/spark-shell
DEMO
EcoSystem/Future
Hadoop EcoSystem
Hadoop EcoSystem
Spark ECOSystem
HDFS
Spark Core : MapReduce
SparkSQL: Hive GraphX: Pregel MLib: MahoutStreaming:
Storm
Resource Management System(Yarn, Mesos)
Unified Platform
Detail
SparkSQL
Spark
MLlib
Hive HDFS Cassandra RDBMS
Streaming BI ETL
Complexity
Performance
Write once, Run use case
BI (SparkSQL)
Streaming (SparkStreaming)
Machine Learning (MLlib)
Spark
Spark bridge people together
Data Analyst
Data Engineer Data Scientist
Bridge people together
• Scala : Engineer
• Java : Engineer
• Python : Data Scientist , Engineer
• R : Data Scientist , Data Analyst
• SQL : Data Analyst
Yahoo EC team
Data Platform!!!!!!!!!!
Filtered Data!
(HDFS)
Data Mart!
(Oracle)
ML Model!(Spark)
BI Report!(MSTR)
Traffic!Data
Transaction!Data
Shark
Data Analyst
Data Analyst
• =
• Select tweet from tweets_data where similarity(tweet , “FIFA” ) > 0.01
!
!
• http://youtu.be/lO7LhVZrNwA?list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr
350 TB data
Machine Learning
https://www.youtube.com/watch?v=lO7LhVZrNwA&list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr#t=2900
Data Scientist
http://goo.gl/q5CAx8 http://research.janelia.org/zebrafish/
SQL (Data Analyst)
Cloud Computing
(Data Engineer)
Machine Learning (Data Scientist)
Spark
Databricks Cloud DEMO
BI (SparkSQL)
Streaming (SparkStreaming)
Machine Learning (MLlib)
Spark
Instant BI Reporthttp://youtu.be/dJQ5lV5Tldw?t=30m30s
BI (SparkSQL)
Streaming (SparkStreaming)
Machine Learning (MLlib)
Spark
Background Knowledge• Tweet real time data store into SQL database
• Spark MLLib use Wikipedia data to train a TF-IDF model
• SparkSQL select tweet and filter by TF-IDF model
• Generate live BI report
Code• val wiki = sql(“select text from wiki”)
• val model = new TFIDF()
• model.train(wiki)
• registerFunction(“similarity” , model.similarity _ )
• select tweet from tweet where similarity(tweet, “$search” > 0.01 )
DEMO
http://youtu.be/dJQ5lV5Tldw?t=39m30s
Q & A