Upload
lisa-ryan
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Reynold Xin 辛湜 (shi2)
UC Berkeley
Spark and SharkHigh-Speed In-Memory Analyticsover Hadoop and Hive Data
UC BERKELEY
Prof. Michael Franklin
My BackgroundPhD student in the AMP Lab at UC Berkeley
»50-person lab focusing on big data
Works on the Berkeley Data Analytics System (BDAS)
Work/intern experience at Google Research, IBM DB2, Altera
AMPLab @ BerkeleyNSF & DARPA funded
Sponsors: Amazon Web Services / Google / SAP
Intel, Huawei, Facebook, Hortonworks, Cloudera … (21 companies)
Collaboration databases (Mike Franklin – sitting right here), systems (Ion Stoica), networking (Scott Shenker), architecture (David Patterson), machine learning (Michael Jordan).
What is Spark?Not a modified version of Hadoop
Separate, fast, MapReduce-like engine»In-memory data storage for very fast
iterative queries»General execution graphs and powerful
optimizations»Up to 100x faster than Hadoop MapReduce
Compatible with Hadoop’s storage APIs»Can read/write to any Hadoop-supported
system, including HDFS, HBase, SequenceFiles, etc
What is Shark?A SQL analytics engine built on top of Spark
Compatible with Apache Hive data, metastores, and queries (HiveQL, UDFs, etc)
Similar speedups of up to 100x
Project HistorySpark project started in 2009, open sourced 2010
Shark started summer 2011, open sourced 2012
Spark 0.7 released Feb 2013
Streaming alpha in Spark 0.7
GraphLab support soon
AdoptionIn use at Yahoo!, Foursquare, Berkeley, Princeton & many others (possibly Taobao, Netease)
600+ member meetup, 800+ watchers on GitHub
30+ contributors (including Intel Shanghai)
AMPCamp 150 attendees, 5000 online attendees
AMPCamp on Mar 15, ECNU, Shanghai
This TalkIntroduction
Spark
Shark: SQL on Spark
Why is Hadoop MapReduce slow?
Streaming Spark
Why go Beyond MapReduce?MapReduce simplified big data analysis by giving a reliable programming model for large clusters
But as soon as it got popular, users wanted more:
»More complex, multi-stage applications»More interactive ad-hoc queries
Why go Beyond MapReduce?Complex jobs and interactive queries both need one thing that MapReduce lacks:
Efficient primitives for data sharingS
tag
e 3
Sta
ge 2
Sta
ge 1
Iterative algorithm
Query 1
Query 2
Query 3
Interactive data mining
Sta
ge 3
Sta
ge 2
Why go Beyond MapReduce?Complex jobs and interactive queries both need one thing that MapReduce lacks:
Efficient primitives for data sharing
Sta
ge 1
Iterative algorithm
Query 1
Query 2
Query 3
Interactive data mining
In MapReduce, the only way to share data across jobs is stable storage (e.g.
HDFS) -> slow!
Examples
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
I/O and serialization can take 90% of the time
iter. 1 iter. 2 . . .
Input
Goal: In-Memory Data Sharing
Distributedmemory
Input
query 1
query 2
query 3
. . .
one-timeprocessing
10-100× faster than network and disk
Solution: ResilientDistributed Datasets (RDDs)Distributed collections of objects that can be stored in memory for fast reuse
Automatically recover lost data on failure
Support a wide range of applications
Programming ModelResilient distributed datasets (RDDs)
»Immutable, partitioned collections of objects»Can be cached in memory for efficient reuse
Transformations (e.g. map, filter, groupBy, join)
»Build RDDs from other RDDs
Actions (e.g. count, collect, save)»Return a result or write it to storage
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20
sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }}
public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}
Word Countmyrdd.flatMap { doc => doc.split(“\s”) } .map { word => (word, 1) } .reduceByKey { case(v1, v2) => v1 + v2 }
myrdd.flatMap(_.split(“\s”)) .map((_, 1)) .reduceByKey(_ + _)
Fault ToleranceRDDs track the series of transformations used to build them (their lineage) to recompute lost data
E.g:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))
HadoopRDDpath = hdfs://…
FilteredRDDfunc =
_.contains(...)
MappedRDDfunc = _.split(…)
1 2 3 4 5 6 7 8 9 100
20406080
100120140
119
57 56 58 5881
57 59 57 59
Iteration
Itera
trio
n t
ime (
s) Failure happens
Fault Recovery Results
Memorybandwidth
Networkbandwidth
Tradeoff Space
Granularityof Updates
Write Throughput
Fine
Coarse
Low High
K-V stores,databases,RAMCloud
Batchanalytics
Transactionalworkloads
HDFS RDDs
Behavior with Not Enough RAM
Cache disabled
25% 50% 75% Fully cached
0
20
40
60
80
10068.8
58.1
40.7
29.7
11.5
% of working set in memory
Itera
tion
tim
e (
s)
Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}
println("Final w: " + w)
Initial parameter vector
Repeated MapReduce steps
to do gradient descent
Load data in memory once
Logistic Regression Performance
1 5 10 20 300
500
1000
1500
2000
2500
3000
3500
Hadoop
Number of Iterations
Ru
nn
ing
Tim
e (
s) 110 s / iteration
first iteration 80 s
further iterations 1 s
Supported Operatorsmap
filter
groupBy
sort
join
leftOuterJoin
rightOuterJoin
reduce
count
reduceByKey
groupByKey
first
union
cross
sample
cogroup
take
partitionBy
pipe
save
...
SchedulerDryad-like task DAG
Pipelines functionswithin a stage
Cache-aware datalocality & reuse
Partitioning-awareto avoid shuffles
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
ImplementationUse Mesos / YARN to share resources with Hadoop & other frameworks
Can access any Hadoop input source (HDFS, S3, …)
SparkHadoo
pMPI
Mesos
Node Node Node Node
…
Core engine is only 20k lines of code
YARN
User ApplicationsIn-memory analytics & anomaly detection (Conviva)
Interactive queries on data streams (Quantifind)
Exploratory log analysis (Foursquare)
Traffic estimation w/ GPS data (Mobile Millennium)
Twitter spam classification (Monarch)
. . .
Conviva GeoReport
Group aggregations on many keys w/ same filter
40× gain over Hive from avoiding repeated reading, deserialization and filtering
Spark
Hive
0 2 4 6 8 10 12 14 16 18 20
0.5
20
Time (hours)
Mobile Millennium Project
Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu
Iterative EM algorithm scaling to 160 nodes
Estimate city traffic from crowdsourced GPS data
This TalkIntroduction
Spark
Shark: SQL on Spark
Why is Hadoop MapReduce slow?
Streaming Spark
ChallengesData volumes expanding.
Faults and stragglers complicate parallel database design.
Complexity of analysis: machine learning, graph algorithms, etc.
Low-latency, interactivity.
MPP Databases Vertica, SAP HANA, Teradata, Google Dremel...
Fast!
Generally not fault-tolerant; challenging for long running queries as clusters scale up.
Lack rich analytics such as machine learning and graph algorithms.
MapReduceApache Hive, Google Tenzing, Turn Cheetah …
Deterministic, idempotent tasks: enables fine-grained fault-tolerance and resource sharing.
Expressive Machine Learning algorithms.
High-latency, dismissed for interactive workloads.
SharkA data warehouse that
»builds on Spark,»scales out and is fault-tolerant,»supports low-latency, interactive queries
through in-memory computation,»supports both SQL and complex analytics,»is compatible with Apache Hive (storage,
serdes, UDFs, types, metadata).
Hive Architecture
Meta store
HDFS
Client
Driver
SQL Parse
r
Query Optimize
r
Physical Plan
Execution
CLI JDBC
MapReduce
Shark Architecture
Meta store
HDFS
Client
Driver
SQL Parse
r
Physical Plan
Execution
CLI JDBC
Spark
Cache Mgr.
Query Optimize
r
Engine FeaturesColumnar Memory Store
Machine Learning Integration
Partial DAG Execution
Hash-based Shuffle vs Sort-based Shuffle
Data Co-partitioning
Partition Pruning based on Range Statistics
...
Efficient In-Memory StorageSimply caching Hive records as Java objects is inefficient due to high per-object overhead
Instead, Shark employs column-oriented storage using arrays of primitive types
1
Column Storage
2 3
johnmike
sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2mike
3.5
3sally
6.4
Efficient In-Memory StorageSimply caching Hive records as Java objects is inefficient due to high per-object overhead
Instead, Shark employs column-oriented storage using arrays of primitive types
1
Column Storage
2 3
johnmike
sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2mike
3.5
3sally
6.4
Benefit: similarly compact size to serialized data,
but >5x faster to access
Machine Learning IntegrationUnified system for query processing and machine learning
Query processing and ML share the same set of workers and caches
Performance
1.7 TB Real Warehouse Data on 100 EC2 nodes
This TalkIntroduction
Spark
Shark: SQL on Spark
Why is Hadoop MapReduce slow?
Streaming Spark
Why are previous MR-based systems slow?
Disk-based intermediate outputs.
Inferior data format and layout (no control of data co-partitioning).
Execution strategies (lack of optimization based on data statistics).
Task scheduling and launch overhead!
Task Launch Overhead
Hadoop uses heartbeat to communicate scheduling decisions.
Task launch delay 5 - 10 seconds.
Spark uses an event-driven architecture and can launch tasks in 5ms
»better parallelism»easier straggler mitigation»elasticity»multi-tenancy resource sharing
Task Launch Overhead
This TalkHadoop & MapReduce
Spark
Shark: SQL on Spark
Why is Hadoop MapReduce slow?
Streaming Spark
What’s Next?Recall that Spark’s model was motivated by two emerging uses (interactive and multi-stage apps)
Another emerging use case that needs fast data sharing is stream processing
»Track and update state in memory as events arrive
»Large-scale reporting, click analysis, spam filtering, etc
Streaming SparkExtends Spark to perform streaming computations
Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs
Intermix seamlessly with batch and ad-hoc queries
tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)
T=1
T=2
…
map reduceByWindow
[Zaharia et al, HotCloud 2012]
Streaming SparkExtends Spark to perform streaming computations
Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs
Intermix seamlessly with batch and ad-hoc queries
tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)
T=1
T=2
…
map reduceByWindow
[Zaharia et al, HotCloud 2012]
Result: can process 42 million records/second
(4 GB/s) on 100 nodes at sub-second latency
Streaming SparkExtends Spark to perform streaming computations
Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs
Intermix seamlessly with batch and ad-hoc queries
tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)
T=1
T=2
…
map reduceByWindow
[Zaharia et al, HotCloud 2012]
Alpha released Feb 2013
ConclusionSpark and Shark speed up your interactive and complex analytics on Hadoop data
Download and docs: www.spark-project.org
»Easy to run locally, on EC2, or on Mesos and YARN
Email: [email protected]
Twitter: @rxin
Weibo: @hashjoin