57
Reynold Xin 辛辛 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Embed Size (px)

Citation preview

Page 1: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Reynold Xin 辛湜 (shi2)

UC Berkeley

Spark and SharkHigh-Speed In-Memory Analyticsover Hadoop and Hive Data

UC BERKELEY

Page 2: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Prof. Michael Franklin

Page 3: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

My BackgroundPhD student in the AMP Lab at UC Berkeley

»50-person lab focusing on big data

Works on the Berkeley Data Analytics System (BDAS)

Work/intern experience at Google Research, IBM DB2, Altera

Page 4: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

AMPLab @ BerkeleyNSF & DARPA funded

Sponsors: Amazon Web Services / Google / SAP

Intel, Huawei, Facebook, Hortonworks, Cloudera … (21 companies)

Collaboration databases (Mike Franklin – sitting right here), systems (Ion Stoica), networking (Scott Shenker), architecture (David Patterson), machine learning (Michael Jordan).

Page 5: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

What is Spark?Not a modified version of Hadoop

Separate, fast, MapReduce-like engine»In-memory data storage for very fast

iterative queries»General execution graphs and powerful

optimizations»Up to 100x faster than Hadoop MapReduce

Compatible with Hadoop’s storage APIs»Can read/write to any Hadoop-supported

system, including HDFS, HBase, SequenceFiles, etc

Page 6: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

What is Shark?A SQL analytics engine built on top of Spark

Compatible with Apache Hive data, metastores, and queries (HiveQL, UDFs, etc)

Similar speedups of up to 100x

Page 7: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Project HistorySpark project started in 2009, open sourced 2010

Shark started summer 2011, open sourced 2012

Spark 0.7 released Feb 2013

Streaming alpha in Spark 0.7

GraphLab support soon

Page 8: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

AdoptionIn use at Yahoo!, Foursquare, Berkeley, Princeton & many others (possibly Taobao, Netease)

600+ member meetup, 800+ watchers on GitHub

30+ contributors (including Intel Shanghai)

AMPCamp 150 attendees, 5000 online attendees

AMPCamp on Mar 15, ECNU, Shanghai

Page 9: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY
Page 10: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY
Page 11: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

This TalkIntroduction

Spark

Shark: SQL on Spark

Why is Hadoop MapReduce slow?

Streaming Spark

Page 12: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Why go Beyond MapReduce?MapReduce simplified big data analysis by giving a reliable programming model for large clusters

But as soon as it got popular, users wanted more:

»More complex, multi-stage applications»More interactive ad-hoc queries

Page 13: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Why go Beyond MapReduce?Complex jobs and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharingS

tag

e 3

Sta

ge 2

Sta

ge 1

Iterative algorithm

Query 1

Query 2

Query 3

Interactive data mining

Page 14: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Sta

ge 3

Sta

ge 2

Why go Beyond MapReduce?Complex jobs and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

Sta

ge 1

Iterative algorithm

Query 1

Query 2

Query 3

Interactive data mining

In MapReduce, the only way to share data across jobs is stable storage (e.g.

HDFS) -> slow!

Page 15: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Examples

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

I/O and serialization can take 90% of the time

Page 16: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

iter. 1 iter. 2 . . .

Input

Goal: In-Memory Data Sharing

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-timeprocessing

10-100× faster than network and disk

Page 17: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Solution: ResilientDistributed Datasets (RDDs)Distributed collections of objects that can be stored in memory for fast reuse

Automatically recover lost data on failure

Support a wide range of applications

Page 18: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Programming ModelResilient distributed datasets (RDDs)

»Immutable, partitioned collections of objects»Can be cached in memory for efficient reuse

Transformations (e.g. map, filter, groupBy, join)

»Build RDDs from other RDDs

Actions (e.g. count, collect, save)»Return a result or write it to storage

Page 19: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20

sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Page 20: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }}

public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}

Page 21: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Word Countmyrdd.flatMap { doc => doc.split(“\s”) } .map { word => (word, 1) } .reduceByKey { case(v1, v2) => v1 + v2 }

myrdd.flatMap(_.split(“\s”)) .map((_, 1)) .reduceByKey(_ + _)

Page 22: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Fault ToleranceRDDs track the series of transformations used to build them (their lineage) to recompute lost data

E.g:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Page 23: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

1 2 3 4 5 6 7 8 9 100

20406080

100120140

119

57 56 58 5881

57 59 57 59

Iteration

Itera

trio

n t

ime (

s) Failure happens

Fault Recovery Results

Page 24: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Memorybandwidth

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

Batchanalytics

Transactionalworkloads

HDFS RDDs

Page 25: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY
Page 26: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Behavior with Not Enough RAM

Cache disabled

25% 50% 75% Fully cached

0

20

40

60

80

10068.8

58.1

40.7

29.7

11.5

% of working set in memory

Itera

tion

tim

e (

s)

Page 27: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w)

Initial parameter vector

Repeated MapReduce steps

to do gradient descent

Load data in memory once

Page 28: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Logistic Regression Performance

1 5 10 20 300

500

1000

1500

2000

2500

3000

3500

Hadoop

Number of Iterations

Ru

nn

ing

Tim

e (

s) 110 s / iteration

first iteration 80 s

further iterations 1 s

Page 29: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Supported Operatorsmap

filter

groupBy

sort

join

leftOuterJoin

rightOuterJoin

reduce

count

reduceByKey

groupByKey

first

union

cross

sample

cogroup

take

partitionBy

pipe

save

...

Page 30: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

SchedulerDryad-like task DAG

Pipelines functionswithin a stage

Cache-aware datalocality & reuse

Partitioning-awareto avoid shuffles

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= previously computed partition

Page 31: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

ImplementationUse Mesos / YARN to share resources with Hadoop & other frameworks

Can access any Hadoop input source (HDFS, S3, …)

SparkHadoo

pMPI

Mesos

Node Node Node Node

Core engine is only 20k lines of code

YARN

Page 32: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

User ApplicationsIn-memory analytics & anomaly detection (Conviva)

Interactive queries on data streams (Quantifind)

Exploratory log analysis (Foursquare)

Traffic estimation w/ GPS data (Mobile Millennium)

Twitter spam classification (Monarch)

. . .

Page 33: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Conviva GeoReport

Group aggregations on many keys w/ same filter

40× gain over Hive from avoiding repeated reading, deserialization and filtering

Spark

Hive

0 2 4 6 8 10 12 14 16 18 20

0.5

20

Time (hours)

Page 34: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Mobile Millennium Project

Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu

Iterative EM algorithm scaling to 160 nodes

Estimate city traffic from crowdsourced GPS data

Page 35: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

This TalkIntroduction

Spark

Shark: SQL on Spark

Why is Hadoop MapReduce slow?

Streaming Spark

Page 36: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

ChallengesData volumes expanding.

Faults and stragglers complicate parallel database design.

Complexity of analysis: machine learning, graph algorithms, etc.

Low-latency, interactivity.

Page 37: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

MPP Databases Vertica, SAP HANA, Teradata, Google Dremel...

Fast!

Generally not fault-tolerant; challenging for long running queries as clusters scale up.

Lack rich analytics such as machine learning and graph algorithms.

Page 38: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

MapReduceApache Hive, Google Tenzing, Turn Cheetah …

Deterministic, idempotent tasks: enables fine-grained fault-tolerance and resource sharing.

Expressive Machine Learning algorithms.

High-latency, dismissed for interactive workloads.

Page 39: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY
Page 40: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

SharkA data warehouse that

»builds on Spark,»scales out and is fault-tolerant,»supports low-latency, interactive queries

through in-memory computation,»supports both SQL and complex analytics,»is compatible with Apache Hive (storage,

serdes, UDFs, types, metadata).

Page 41: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Hive Architecture

Meta store

HDFS

Client

Driver

SQL Parse

r

Query Optimize

r

Physical Plan

Execution

CLI JDBC

MapReduce

Page 42: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Shark Architecture

Meta store

HDFS

Client

Driver

SQL Parse

r

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query Optimize

r

Page 43: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Engine FeaturesColumnar Memory Store

Machine Learning Integration

Partial DAG Execution

Hash-based Shuffle vs Sort-based Shuffle

Data Co-partitioning

Partition Pruning based on Range Statistics

...

Page 44: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Efficient In-Memory StorageSimply caching Hive records as Java objects is inefficient due to high per-object overhead

Instead, Shark employs column-oriented storage using arrays of primitive types

1

Column Storage

2 3

johnmike

sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2mike

3.5

3sally

6.4

Page 45: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Efficient In-Memory StorageSimply caching Hive records as Java objects is inefficient due to high per-object overhead

Instead, Shark employs column-oriented storage using arrays of primitive types

1

Column Storage

2 3

johnmike

sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2mike

3.5

3sally

6.4

Benefit: similarly compact size to serialized data,

but >5x faster to access

Page 46: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Machine Learning IntegrationUnified system for query processing and machine learning

Query processing and ML share the same set of workers and caches

Page 47: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Performance

1.7 TB Real Warehouse Data on 100 EC2 nodes

Page 48: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

This TalkIntroduction

Spark

Shark: SQL on Spark

Why is Hadoop MapReduce slow?

Streaming Spark

Page 49: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Why are previous MR-based systems slow?

Disk-based intermediate outputs.

Inferior data format and layout (no control of data co-partitioning).

Execution strategies (lack of optimization based on data statistics).

Task scheduling and launch overhead!

Page 50: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Task Launch Overhead

Hadoop uses heartbeat to communicate scheduling decisions.

Task launch delay 5 - 10 seconds.

Spark uses an event-driven architecture and can launch tasks in 5ms

»better parallelism»easier straggler mitigation»elasticity»multi-tenancy resource sharing

Page 51: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Task Launch Overhead

Page 52: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

This TalkHadoop & MapReduce

Spark

Shark: SQL on Spark

Why is Hadoop MapReduce slow?

Streaming Spark

Page 53: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

What’s Next?Recall that Spark’s model was motivated by two emerging uses (interactive and multi-stage apps)

Another emerging use case that needs fast data sharing is stream processing

»Track and update state in memory as events arrive

»Large-scale reporting, click analysis, spam filtering, etc

Page 54: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Streaming SparkExtends Spark to perform streaming computations

Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs

Intermix seamlessly with batch and ad-hoc queries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

T=1

T=2

map reduceByWindow

[Zaharia et al, HotCloud 2012]

Page 55: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Streaming SparkExtends Spark to perform streaming computations

Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs

Intermix seamlessly with batch and ad-hoc queries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)

T=1

T=2

map reduceByWindow

[Zaharia et al, HotCloud 2012]

Result: can process 42 million records/second

(4 GB/s) on 100 nodes at sub-second latency

Page 56: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

Streaming SparkExtends Spark to perform streaming computations

Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs

Intermix seamlessly with batch and ad-hoc queries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)

T=1

T=2

map reduceByWindow

[Zaharia et al, HotCloud 2012]

Alpha released Feb 2013

Page 57: Reynold Xin 辛湜 (shi2) UC Berkeley Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

ConclusionSpark and Shark speed up your interactive and complex analytics on Hadoop data

Download and docs: www.spark-project.org

»Easy to run locally, on EC2, or on Mesos and YARN

Email: [email protected]

Twitter: @rxin

Weibo: @hashjoin