IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)

UC BERKELEY

IndexedRDD

Ankur Dave��UC Berkeley AMPLab

Efficient Fine-Grained Updates for RDDs

Immutability in Spark

Query: Sequence of data-parallel bulk transformations on RDDs


map task 4

map task 3

map task 2

map task 1

reduce task 4

reduce task 3

reduce task 2

reduce task 1

Wor

ker I

D

Time

Transformations are executed as independent tasks


map task 4

map task 3

map task 2

map task 1

reduce task 4

reduce task 3

reduce task 2

reduce task 1

Wor

ker I

D

Time

map task 2 (retry)

Automatic mid-query fault tolerance


map task 4

map task 3

map task 2 (killed)

map task 1

reduce task 4

reduce task 3

reduce task 2

reduce task 1

Wor

ker I

D

Time

map task 2 (replica)

Automatic straggler mitigation

Task replayability: Repeated task executions must give the same result

Enforced by input immutability and task determinism

The Need for Efficient Updates

1.  Local mutable hash table•  Partial task failures corrupt local state

2.  Local or external database with atomic updates and snapshots (LevelDB, Cassandra)

•  Severely limits program generality3.  Copy entire data structure on update•  Highly wasteful for small updates

Naïve SolutionsMotivation•  Spark’s core abstraction is bulk

transformations of immutable datasets (RDDs)

•  For batch analytics, immutability enables automatic mid-query fault tolerance and straggler mitigation

•  But applications like streaming aggregation depend on efficient fine-grained updates

•  Naive solutions involving mutable state sacrifice the advantages of Spark

•  Instead, we propose to enable efficient updates without sacrificing immutable semantics

Efficient Updates to Immutable DataAnkur Dave, Joseph Gonzalez, Ion Stoica

username # followers@taylorswift13 48,182,689@ankurdave 206@mejoeyg 81

@taylorswift13: +1 follower

@taylorswift13: +1 follower

Purely Functional Data Structures•  Enable efficient updates without

modifying the existing version in any way

•  Immutable semantics: Updates return a new copy of the data that internally shares structure with existing copy (similar to copy-on-write snapshot)

•  Nodes from old versions are GC’d•  Different data structures optimize

different operations (joins, range scans)

Challenges1.  Garbage collection:•  Tunable branching factor,!

reference counting2.  Efficient checkpointing•  Incremental checkpoints using data

structure diff3.  Aging data out to disk

Evaluation

0.01

0.1

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000 1e+06 1e+07

Tota

l tim

e (m

s)

Read/write size (records)

Read-modify-write scaling, 10M existing elements

hash table with cloningCoW B-tree

LevelDB with snapshotspatricia trie

red-black treehash table (mutable baseline)

Related Work1.  Multiversion Concurrency Control2.  Copy-on-write B-trees3.  Stratified Doubling Array

Streaming Aggregation

Incremental AlgorithmsOld Ranks New RanksGraph Update

+ =

Existing Solutions

map task

Direct mutation:

Problem: Task failures corrupt state

map task (retry)

Time

x = 0, y = 0x = 1, y = 0x = 2, y = 1

map task

def f(row: Row) = { row.x++; row.y++; }

Existing SolutionsAtomic batched database updates:

Problem: Long-lived snapshots are uncommon in databases

Time

Pages taskAds task

def f(row: Row) = { BEGIN; row.x++; row.y++; COMMIT; }

Clicks Ad Views Top Ads

Popular Pages

Existing SolutionsFull copy:

��Problem: Very inefficient for small updates

def f(row: Row) = { val newRow = row.clone(); newRow.x++; newRow.y++; return newRow; }

Immutability(for fault tolerance, straggler mitigation, dataset reusability, parallel recovery, …)

Tradeoff

Fine-grained updates(for streaming aggregation, incremental algorithms, …)

vs.

Can we have both?

Support efficient updates without modifying existing version using structural sharing (fine-grained copy-on-write)

Persistent Data Structures

RDD-based distributed key-value store supporting immutable semantics and efficient updates

IndexedRDD

IndexedRDD (800 LOC)

Spark

PART (1100 LOC)

IndexedRDD

RDD[T] (cached)

Array[T]

Array[T]

Array[T]

Array[T]

IndexedRDD[K,V]

PART[K,V]

PART[K,V]

PART[K,V]

PART[K,V]

class IndexedRDD[K: KeySerializer, V] extends RDD[(K, V)] { // Fine-‐Grained Ops -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ def get(k: K): Option[V] def multiget(ks: Array[K]): Map[K, V] def put(k: K, v: V): IndexedRDD[K, V] def multiput(kvs: Map[K, V], merge: (K,V,V) => V): IndexedRDD[K, V] def delete(ks: Array[K]): IndexedRDD[K, V] // Accelerated RDD Ops -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ def filter(pred: (K,V) => Boolean): IndexedRDD[K, V] // Specialized Joins -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ def fullOuterJoin[V2, W](other: RDD[(K, V2)])(f) : IndexedRDD[K, W]

}

IndexedRDD API

Persistent Adaptive Radix Trees (PART)

Recent work in main-memory indexing for databases

256-ary radix tree (trie) with node compression

V. Leis, A. Kemper, and T. Neumann. The adaptive radix tree: ARTful indexing for main-memory databases. In ICDE 2013.

Adaptive Radix Tree

1.  Sorted order traversals (unlike hash tables)

2.  Better asymptotic performance than binary search trees for long keys (O(k) vs O(k log n))

3.  Very efficient union and intersection operations

4.  Predictable performance: no rehashing or rebalancing

Why a Radix Tree?

Adds persistence to the adaptive radix tree using path copying (shadowing) and reference counting

1100 lines of Java

Persistent Adaptive Radix Tree (PART)

old new

…

…

Updates can be performed in place if:1.  Update applies to nodes referenced only by one

version, and

2.  That version will never be referenced in the future

Batched Updates

×

v1 v2

v1 v3 v1 v3

~

t1

t2

v2

Microbenchmarks

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

PART (batch size

1K)

PART (batch size

10M)

hash table red-black tree

B-tree

M in

sert

s/s

Insert throughput (single threaded)

Microbenchmarks

0

0.5

1

1.5

2

2.5

M lo

okup

s/s

Lookup throughput

0

5

10

15

20

25

M e

lem

ents

sca

nned

/s

Scan throughput

0

0.2

0.4

0.6

0.8

1

1.2

Mem

ory

usag

e (G

B)

Memory usage

Tree partitioning: minimize number of changed files by segregating frequently-changed nodes from infrequently-changed nodes.

Incremental Checkpointing

frequentlychanging

infrequentlychanging

Incremental Checkpointing

Counting occurrences of 26-character string ids

Synthetic data generated on the slaves (8 r3.2xlarge)

Load with1 billion keys, stream of 100 million keys

Real app: Streaming aggregation

1.  GC pauses��Future work: Off-heap storage w/ reference counting

2.  Scan performance��Future work: Layout-aware allocator

Limitations

Thanks!

[email protected]

IndexedRDD: https://github.com/amplab/spark-indexedrdd

Use it in your project:

resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-‐packages/maven" libraryDependencies += "amplab" % "spark-‐indexedrdd" % "0.1"

Data & Analytics

IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)