Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Reynold Xin 辛湜 (shi2)

UC Berkeley

Spark and SharkHigh-Speed In-Memory Analyticsover Hadoop and Hive Data

UC BERKELEY

Prof. Michael Franklin

My BackgroundPhD student in the AMP Lab at UC Berkeley

»50-person lab focusing on big data

Works on the Berkeley Data Analytics System (BDAS)

Work/intern experience at Google Research, IBM DB2, Altera

AMPLab @ BerkeleyNSF & DARPA funded

Sponsors: Amazon Web Services / Google / SAP

Intel, Huawei, Facebook, Hortonworks, Cloudera … (21 companies)

Collaboration databases (Mike Franklin – sitting right here), systems (Ion Stoica), networking (Scott Shenker), architecture (David Patterson), machine learning (Michael Jordan).

What is Spark?Not a modified version of Hadoop

Separate, fast, MapReduce-like engine»In-memory data storage for very fast

iterative queries»General execution graphs and powerful

optimizations»Up to 100x faster than Hadoop MapReduce

Compatible with Hadoop’s storage APIs»Can read/write to any Hadoop-supported

system, including HDFS, HBase, SequenceFiles, etc

What is Shark?A SQL analytics engine built on top of Spark

Compatible with Apache Hive data, metastores, and queries (HiveQL, UDFs, etc)

Similar speedups of up to 100x

Project HistorySpark project started in 2009, open sourced 2010

Shark started summer 2011, open sourced 2012

Spark 0.7 released Feb 2013

Streaming alpha in Spark 0.7

GraphLab support soon

AdoptionIn use at Yahoo!, Foursquare, Berkeley, Princeton & many others (possibly Taobao, Netease)

600+ member meetup, 800+ watchers on GitHub

30+ contributors (including Intel Shanghai)

AMPCamp 150 attendees, 5000 online attendees

AMPCamp on Mar 15, ECNU, Shanghai

This TalkIntroduction

Spark

Shark: SQL on Spark

Why is Hadoop MapReduce slow?

Streaming Spark

Why go Beyond MapReduce?MapReduce simplified big data analysis by giving a reliable programming model for large clusters

But as soon as it got popular, users wanted more:

»More complex, multi-stage applications»More interactive ad-hoc queries

Why go Beyond MapReduce?Complex jobs and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharingS

tag

e 3

Sta

ge 2

Sta

ge 1

Iterative algorithm

Query 1

Query 2

Query 3

Interactive data mining

Sta

ge 3

Sta

ge 2

Why go Beyond MapReduce?Complex jobs and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

Sta

ge 1

Iterative algorithm

Query 1

Query 2

Query 3

Interactive data mining

In MapReduce, the only way to share data across jobs is stable storage (e.g.

HDFS) -> slow!

Examples

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

I/O and serialization can take 90% of the time

iter. 1 iter. 2 . . .

Input

Goal: In-Memory Data Sharing

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-timeprocessing

10-100× faster than network and disk

Solution: ResilientDistributed Datasets (RDDs)Distributed collections of objects that can be stored in memory for fast reuse

Automatically recover lost data on failure

Support a wide range of applications

Programming ModelResilient distributed datasets (RDDs)

»Immutable, partitioned collections of objects»Can be cached in memory for efficient reuse

Transformations (e.g. map, filter, groupBy, join)

»Build RDDs from other RDDs

Actions (e.g. count, collect, save)»Return a result or write it to storage

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20

sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }}

public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}

Word Countmyrdd.flatMap { doc => doc.split(“\s”) } .map { word => (word, 1) } .reduceByKey { case(v1, v2) => v1 + v2 }

myrdd.flatMap(_.split(“\s”)) .map((_, 1)) .reduceByKey(_ + _)

Fault ToleranceRDDs track the series of transformations used to build them (their lineage) to recompute lost data

E.g:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

1 2 3 4 5 6 7 8 9 100

20406080

100120140

119

57 56 58 5881

57 59 57 59

Iteration

Itera

trio

n t

ime (

s) Failure happens

Fault Recovery Results

Memorybandwidth

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

Batchanalytics

Transactionalworkloads

HDFS RDDs

Behavior with Not Enough RAM

Cache disabled

25% 50% 75% Fully cached

0

20

40

60

80

10068.8

58.1

40.7

29.7

11.5

% of working set in memory

Itera

tion

tim

e (

s)

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w)

Initial parameter vector

Repeated MapReduce steps

to do gradient descent

Load data in memory once

Logistic Regression Performance

1 5 10 20 300

500

1000

1500

2000

2500

3000

3500

Hadoop

Number of Iterations

Ru

nn

ing

Tim

e (

s) 110 s / iteration

first iteration 80 s

further iterations 1 s

Supported Operatorsmap

filter

groupBy

sort

join

leftOuterJoin

rightOuterJoin

reduce

count

reduceByKey

groupByKey

first

union

cross

sample

cogroup

take

partitionBy

pipe

save

...

SchedulerDryad-like task DAG

Pipelines functionswithin a stage

Cache-aware datalocality & reuse

Partitioning-awareto avoid shuffles

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= previously computed partition

ImplementationUse Mesos / YARN to share resources with Hadoop & other frameworks

Can access any Hadoop input source (HDFS, S3, …)

SparkHadoo

pMPI

Mesos

Node Node Node Node

…

Core engine is only 20k lines of code

YARN

User ApplicationsIn-memory analytics & anomaly detection (Conviva)

Interactive queries on data streams (Quantifind)

Exploratory log analysis (Foursquare)

Traffic estimation w/ GPS data (Mobile Millennium)

Twitter spam classification (Monarch)

. . .

Conviva GeoReport

Group aggregations on many keys w/ same filter

40× gain over Hive from avoiding repeated reading, deserialization and filtering

Spark

Hive

0 2 4 6 8 10 12 14 16 18 20

0.5

20

Time (hours)

Mobile Millennium Project

Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu

Iterative EM algorithm scaling to 160 nodes

Estimate city traffic from crowdsourced GPS data

http://traffic.berkeley.edu/


Spark

Shark: SQL on Spark


Streaming Spark

ChallengesData volumes expanding.

Faults and stragglers complicate parallel database design.

Complexity of analysis: machine learning, graph algorithms, etc.

Low-latency, interactivity.

MPP Databases Vertica, SAP HANA, Teradata, Google Dremel...

Fast!

Generally not fault-tolerant; challenging for long running queries as clusters scale up.

Lack rich analytics such as machine learning and graph algorithms.

MapReduceApache Hive, Google Tenzing, Turn Cheetah …

Deterministic, idempotent tasks: enables fine-grained fault-tolerance and resource sharing.

Expressive Machine Learning algorithms.

High-latency, dismissed for interactive workloads.

SharkA data warehouse that

»builds on Spark,»scales out and is fault-tolerant,»supports low-latency, interactive queries

through in-memory computation,»supports both SQL and complex analytics,»is compatible with Apache Hive (storage,

serdes, UDFs, types, metadata).

Hive Architecture

Meta store

HDFS

Client

Driver

SQL Parse

r

Query Optimize

r

Physical Plan

Execution

CLI JDBC

MapReduce

Shark Architecture

Meta store

HDFS

Client

Driver

SQL Parse

r

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query Optimize

r

Engine FeaturesColumnar Memory Store

Machine Learning Integration

Partial DAG Execution

Hash-based Shuffle vs Sort-based Shuffle

Data Co-partitioning

Partition Pruning based on Range Statistics

...

Efficient In-Memory StorageSimply caching Hive records as Java objects is inefficient due to high per-object overhead

Instead, Shark employs column-oriented storage using arrays of primitive types

1

Column Storage

2 3

johnmike

sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2mike

3.5

3sally

6.4

Efficient In-Memory StorageSimply caching Hive records as Java objects is inefficient due to high per-object overhead

Instead, Shark employs column-oriented storage using arrays of primitive types

1

Column Storage

2 3

johnmike

sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2mike

3.5

3sally

6.4

Benefit: similarly compact size to serialized data,

but >5x faster to access

Machine Learning IntegrationUnified system for query processing and machine learning

Query processing and ML share the same set of workers and caches

Performance

1.7 TB Real Warehouse Data on 100 EC2 nodes


Spark

Shark: SQL on Spark


Streaming Spark

Why are previous MR-based systems slow?

Disk-based intermediate outputs.

Inferior data format and layout (no control of data co-partitioning).

Execution strategies (lack of optimization based on data statistics).

Task scheduling and launch overhead!

Task Launch Overhead

Hadoop uses heartbeat to communicate scheduling decisions.

Task launch delay 5 - 10 seconds.

Spark uses an event-driven architecture and can launch tasks in 5ms

»better parallelism»easier straggler mitigation»elasticity»multi-tenancy resource sharing

Task Launch Overhead

This TalkHadoop & MapReduce

Spark

Shark: SQL on Spark


Streaming Spark

What’s Next?Recall that Spark’s model was motivated by two emerging uses (interactive and multi-stage apps)

Another emerging use case that needs fast data sharing is stream processing

»Track and update state in memory as events arrive

»Large-scale reporting, click analysis, spam filtering, etc

Streaming SparkExtends Spark to perform streaming computations

Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs

Intermix seamlessly with batch and ad-hoc queries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

T=1

T=2

…

map reduceByWindow

[Zaharia et al, HotCloud 2012]




tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)

T=1

T=2

…

map reduceByWindow


Result: can process 42 million records/second

(4 GB/s) on 100 nodes at sub-second latency




tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)

T=1

T=2

…

map reduceByWindow


Alpha released Feb 2013

ConclusionSpark and Shark speed up your interactive and complex analytics on Hadoop data

Download and docs: www.spark-project.org

»Easy to run locally, on EC2, or on Mesos and YARN

Email: [email protected]

Twitter: @rxin

Weibo: @hashjoin

http://www.spark-project.org/

mailto:[email protected]

Documents

Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY