98
CS 347 Lecture 9 1 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Embed Size (px)

Citation preview

Page 1: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

CS 347 Lecture 9 1

CS 347: Parallel and Distributed

Data Management

Notes X:BigTable, HBASE, Cassandra

Page 2: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Sources

• HBASE: The Definitive Guide, Lars George, O ’Reilly Publishers, 2011.

• Cassandra: The Definitive Guide, Eben Hewitt, O’Reilly Publishers, 2011.

• BigTable: A Distributed Storage System for Structured Data, F. Chang et al, ACM Transactions on Computer Systems, Vol. 26, No. 2, June 2008.

CS 347 Lecture 9 2

Page 3: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Lots of Buzz Words!

• “Apache Cassandra is an open-source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunably consistent, column-oriented database that bases its distribution design on Amazon’s dynamo and its data model on Google’s Big Table.”

• Clearly, it is buzz-word compliant!!

CS 347 Lecture 9 3

Page 4: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Basic Idea: Key-Value Store

CS 347 Lecture 9 4

key valuek1 v1k2 v2k3 v3k4 v4

Table T:

Page 5: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Basic Idea: Key-Value Store

CS 347 Lecture 9 5

key valuek1 v1k2 v2k3 v3k4 v4

Table T:

keys are sorted

• API:– lookup(key) value– lookup(key range)

values– getNext value– insert(key, value)– delete(key)

• Each row has timestemp• Single row actions atomic

(but not persistent in some systems?)

• No multi-key transactions• No query language!

Page 6: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Fragmentation (Sharding)

CS 347 Lecture 9 6

key valuek1 v1k2 v2k3 v3k4 v4k5 v5k6 v6k7 v7k8 v8k9 v9k10 v10

key valuek1 v1k2 v2k3 v3k4 v4

key valuek5 v5k6 v6

key valuek7 v7k8 v8k9 v9k10 v10

server 1 server 2 server 3

• use a partition vector• “auto-sharding”: vector selected automatically

tablet

Page 7: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Tablet Replication

CS 347 Lecture 9 7

key valuek7 v7k8 v8k9 v9k10 v10

server 3 server 4 server 5

key valuek7 v7k8 v8k9 v9k10 v10

key valuek7 v7k8 v8k9 v9k10 v10

primary backup backup

• Cassandra:Replication Factor (# copies)R/W Rule: One, Quorum, AllPolicy (e.g., Rack Unaware, Rack Aware, ...)Read all copies (return fastest reply, do repairs if necessary)

• HBase: Does not manage replication, relies on HDFS

Page 8: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Need a “directory”

• Table Name: Key Server that stores key Backup servers

• Can be implemented as a special table.

CS 347 Lecture 9 8

Page 9: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Tablet Internals

CS 347 Lecture 9 9

key valuek3 v3k8 v8k9 deletek15 v15

key valuek2 v2k6 v6k9 v9k12 v12

key valuek4 v4k5 deletek10 v10k20 v20k22 v22

memory

disk

Design Philosophy: Only write to disk using sequential I/O.

Page 10: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Tablet Internals

CS 347 Lecture 9 10

key valuek3 v3k8 v8k9 deletek15 v15

key valuek2 v2k6 v6k9 v9k12 v12

key valuek4 v4k5 deletek10 v10k20 v20k22 v22

memory

disk

flush periodically (minor compaction)

• tablet is merge of all layers (files)• disk segments immutable• writes efficient; reads only efficient when all data in memory

• use bloom filters for reads• periodically reorganize into single layer (merge, major compaction)

tombstone

Page 11: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Column Family

CS 347 Lecture 9 11

K A B C D Ek1 a1 b1 c1 d1 e1k2 a2 null c2 d2 e2k3 null null null d3 e3k4 a4 b4 c4 e4 e4k5 a5 b5 null null null

Page 12: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Column Family

CS 347 Lecture 9 12

K A B C D Ek1 a1 b1 c1 d1 e1k2 a2 null c2 d2 e2k3 null null null d3 e3k4 a4 b4 c4 e4 e4k5 a5 b5 null null null

• for storage, treat each row as a single “super value”• API provides access to sub-values

(use family:qualifier to refer to sub-values e.g., price:euros, price:dollars )

• Cassandra allows “super-column”: two level nesting of columns (e.g., Column A can have sub-columns X & Y )

Page 13: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Vertical Partitions

CS 347 Lecture 9 13

K A B C D Ek1 a1 b1 c1 d1 e1k2 a2 null c2 d2 e2k3 null null null d3 e3k4 a4 b4 c4 e4 e4k5 a5 b5 null null null

K Ak1 a1k2 a2k4 a4k5 a5

K Bk1 b1k4 b4k5 b5

K Ck1 c1k2 c2k4 c4

K D Ek1 d1 e1k2 d2 e2k3 d3 e3k4 e4 e4

can be manually implemented as

Page 14: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Vertical Partitions

CS 347 Lecture 9 14

K A B C D Ek1 a1 b1 c1 d1 e1k2 a2 null c2 d2 e2k3 null null null d3 e3k4 a4 b4 c4 e4 e4k5 a5 b5 null null null

K Ak1 a1k2 a2k4 a4k5 a5

K Bk1 b1k4 b4k5 b5

K Ck1 c1k2 c2k4 c4

K D Ek1 d1 e1k2 d2 e2k3 d3 e3k4 e4 e4

column family

• good for sparse data;• good for column scans• not so good for tuple reads• are atomic updates to row still supported?• API supports actions on full table; mapped to actions on column tables• API supports column “project”• To decide on vertical partition, need to know access patterns

Page 15: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Failure Recovery (BigTable, HBase)

CS 347 Lecture 9 15

tablet servermemory

log

GFS or HDFS

master nodespare

tablet server

write ahead logging

ping

Page 16: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Failure recovery (Cassandra)

• No master node, all nodes in “cluster” equal

CS 347 Lecture 9 16

server 1 server 2 server 3

Page 17: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Failure recovery (Cassandra)

• No master node, all nodes in “cluster” equal

CS 347 Lecture 9 17

server 1 server 2 server 3

access any table in clusterat any server

that server sends requeststo other servers

Page 18: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

CS 347 Lecture 9 18

CS 347: Parallel and Distributed

Data Management

Notes X: MemCacheD

Page 19: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

MemCacheD

• General-purpose distributed memory caching system

• Open source

CS 347 Lecture 9 19

Page 20: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

What MemCacheD Is

CS 347 Lecture 9 20

data source 1

cache 1

data source 2

cache 2

data source 3

cache 3

get_object(cache 1, MyName)

x

put(cache 1, myName, X)

Page 21: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

What MemCacheD Is

CS 347 Lecture 9 21

data source 1

cache 1

data source 2

cache 2

data source 3

cache 3

get_object(cache 1, MyName)

x

put(cache 1, myName, X)

Can purge MyName whenever

each cache ishash table of(name, value) pairs

no connection

Page 22: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

What MemCacheD Could Be (but ain't)

CS 347 Lecture 9 22

data source 1

cache 1

data source 2

cache 2

data source 3

cache 3

distributed cache

get_object(X)

Page 23: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

What MemCacheD Could Be (but ain't)

CS 347 Lecture 9 23

data source 1

cache 1

data source 2

cache 2

data source 3

cache 3

distributed cache

get_object(X)

x

x

Page 24: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Persistence?

CS 347 Lecture 9 24

cache 1 cache 2 cache 3

get_object(cache 1, MyName)

put(cache 1, myName, X)

Can purge MyName whenever

each cache ishash table of(name, value) pairs

Page 25: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Persistence?

CS 347 Lecture 9 25

cache 1 cache 2 cache 3

get_object(cache 1, MyName)

put(cache 1, myName, X)

Can purge MyName whenever

each cache ishash table of(name, value) pairs

DBMS DBMSDBMS

Example: MemcacheDB = memcached + BerkeleyDB

Page 26: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

CS 347 Lecture 9 26

CS 347: Parallel and Distributed

Data Management

Notes X: ZooKeeper

Page 27: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

ZooKeeper

• Coordination service for distributed processes• Provides clients with high throughput, high

availability, memory only file system

CS 347 Lecture 9 27

client

client

client

client

client

ZooKeeper

/

a b

c d

e

znode /a/d/e (has state)

Page 28: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

ZooKeeper Servers

CS 347 Lecture 9 28

client

client

client

client

client

server

server

server

state replica

state replica

state replica

Page 29: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

ZooKeeper Servers

CS 347 Lecture 9 29

client

client

client

client

client

server

server

server

state replica

state replica

state replica

read

Page 30: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

ZooKeeper Servers

CS 347 Lecture 9 30

client

client

client

client

client

server

server

server

state replica

state replica

state replica

write

propagate & sych

writes totally ordered:used Zab algorithm

Page 31: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Failures

CS 347 Lecture 9 31

client

client

client

client

client

server

server

server

state replica

state replica

state replica

if your server dies,just connect to a different one!

Page 32: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

ZooKeeper Notes

• Differences with file system:– all nodes can store data– storage size limited

• API: insert node, read node, read children, delete node, ...

• Can set triggers on nodes• Clients and servers must know all servers• ZooKeeper works as long as a majority of

servers are available• Writes totally ordered; read ordered w.r.t.

writes

CS 347 Lecture 9 32

Page 33: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

ZooKeeper Applications

• Metadata/configuration store• Lock server• Leader election• Group membership• Barrier• Queue

CS 347 Lecture 9 33

Page 34: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

CS 347 Lecture 9 34

CS 347: Parallel and Distributed

Data Management

Notes X: S4

Page 35: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Material based on:

CS 347 Lecture 9 35

Page 36: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

S4

• Platform for processing unbounded data streams– general purpose– distributed– scalable– partially fault tolerant

(whatever this means!)

CS 347 Lecture 9 36

Page 37: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Data Stream

CS 347 Lecture 9 37

A 34B abcdC 15D TRUE

A 3B defC 0D FALSE

A 34B abcdC 78

Terminology: event (data record), key, attribute

Question: Can an event have duplicate attributes for same key?(I think so...)

Stream unbounded, generated by say user queries,purchase transactions, phone calls, sensor readings,...

Page 38: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

S4 Processing Workflow

CS 347 Lecture 9 38

user specified“processing unit”

A 34B abcdC 15D TRUE

Page 39: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Inside a Processing Unit

CS 347 Lecture 9 39

PE1

PE2

PEn

...key=z

key=b

key=a

processing element

Page 40: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Example:

• Stream of English quotes• Produce a sorted list of the top K most

frequent words

CS 347 Lecture 9 40

Page 41: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

CS 347 Lecture 9 41

Page 42: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Processing Nodes

CS 347 Lecture 9 42

...

key=b

key=a

processing node

......

hash(key)=1

hash(key)=m

Page 43: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Dynamic Creation of PEs

• As a processing node sees new key attributes,it dynamically creates new PEs to handle them

• Think of PEs as threads

CS 347 Lecture 9 43

Page 44: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Another View of Processing Node

CS 347 Lecture 9 44

Page 45: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Failures

• Communication layer detects node failures and provides failover to standby nodes

• What happens events in transit during failure?(My guess: events are lost!)

CS 347 Lecture 9 45

Page 46: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

How do we do DB operations on top of S4?• Selects & projects easy!

CS 347 Lecture 9 46

• What about joins?

Page 47: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

What is a Stream Join?

• For true join, would need to storeall inputs forever! Not practical...

• Instead, define window join:– at time t new R tuple arrives– it only joins with previous w S tuples

CS 347 Lecture 9 47

R

S

Page 48: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

One Idea for Window Join

CS 347 Lecture 9 48

R

S

key 2rel RC 15D TRUE

key 2rel SC 0D FALSE

code for PE:for each event e: if e.rel=R [store in Rset(last w) for s in Sset: output join(e,s) ] else ...

“key” isjoin attribute

Page 49: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

One Idea for Window Join

CS 347 Lecture 9 49

R

S

key 2rel RC 15D TRUE

key 2rel SC 0D FALSE

code for PE:for each event e: if e.rel=R [store in Rset(last w) for s in Sset: output join(e,s) ] else ...

“key” isjoin attribute

Is this right???(enforcing window on a per-key value basis)

Maybe add sequence numbersto events to enforce correct window?

Page 50: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Another Idea for Window Join

CS 347 Lecture 9 50

R

S

key fakerel RC 15D TRUE

key fakerel SC 0D FALSE

code for PE:for each event e: if e.rel=R [store in Rset(last w) for s in Sset: if e.C=s.C then output join(e,s) ] else if e.rel=S ...

All R & S eventshave “key=fake”;Say join key is C.

Page 51: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Another Idea for Window Join

CS 347 Lecture 9 51

R

S

key fakerel RC 15D TRUE

key fakerel SC 0D FALSE

code for PE:for each event e: if e.rel=R [store in Rset(last w) for s in Sset: if e.C=s.C then output join(e,s) ] else if e.rel=S ...

All R & S eventshave “key=fake”;Say join key is C. Entire join done in one PE;

no parallelism

Page 52: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Do You Have a Better Idea for Window Join?

CS 347 Lecture 9 52

R

S

key ?rel RC 15D TRUE

key ?rel SC 0D FALSE

Page 53: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Final Comment: Managing state of PE

CS 347 Lecture 9 53

Who manages state?S4: user doesMupet: System doesIs state persistent?

A 34B abcdC 15D TRUE

state

Page 54: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

CS 347 Lecture 9 54

CS 347: Parallel and Distributed

Data Management

Notes X: Hyracks

Page 55: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Hyracks

• Generalization of map-reduce• Infrastructure for “big data” processing• Material here based on:

CS 347 Lecture 9 55

Appeared in ICDE 2011

Page 56: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

A Hyracks data object:

• Records partitioned across N sites

• Simple record schema is available (more than just key-value)

CS 347 Lecture 9 56

site 1

site 2

site 3

Page 57: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Operations

CS 347 Lecture 9 57

operator

distributionrule

Page 58: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Operations (parallel execution)

CS 347 Lecture 9 58

op 1

op 2

op 3

Page 59: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Example: Hyracks Specification

CS 347 Lecture 9 59

Page 60: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Example: Hyracks Specification

CS 347 Lecture 9 60

initial partitionsinput & output isa set of partitions

mapping to new partitions2 replicated partitions

Page 61: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Notes• Job specification can be done manually

or automatically

CS 347 Lecture 9 61

Page 62: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Example: Activity Node Graph

CS 347 Lecture 9 62

Page 63: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Example: Activity Node Graph

CS 347 Lecture 9 63

sequencing constraint

activities

Page 64: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Example: Parallel Instantiation

CS 347 Lecture 9 64

Page 65: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Example: Parallel Instantiation

CS 347 Lecture 9 65

stage(start after input stages finish)

Page 66: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

System Architecture

CS 347 Lecture 9 66

Page 67: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Library of Operators:

• File reader/writers• Mappers• Sorters• Joiners (various types0• Aggregators

• Can add more

CS 347 Lecture 9 67

Page 68: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Library of Connectors:

• N:M hash partitioner• N:M hash-partitioning merger (input

sorted)• N:M range partitioner (with partition

vector)• N:M replicator• 1:1

• Can add more!

CS 347 Lecture 9 68

Page 69: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Hyracks Fault Tolerance: Work in progress?

CS 347 Lecture 9 69

Page 70: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Hyracks Fault Tolerance: Work in progress?

CS 347 Lecture 9 70

save output to files

save output to files

One approach: save partial results to persistent storage;after failure, redo all work to reconstruct missing data

Page 71: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Hyracks Fault Tolerance: Work in progress?

CS 347 Lecture 9 71

Can we do better? Maybe: each process retains previous results until no longer needed?

Pi output: r1, r2, r3, r4, r5, r6, r7, r8

have "made way" to final result current output

Page 72: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

CS 347 Lecture 9 72

CS 347: Parallel and Distributed

Data Management

Notes X: Pregel

Page 73: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Material based on:

• In SIGMOD 2010

• Note there is an open-source version of Pregel called GIRAPH

CS 347 Lecture 9 73

Page 74: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Pregel

• A computational model/infrastructurefor processing large graphs

• Prototypical example: Page Rank

CS 347 Lecture 9 74

x

e

d

b

a

f

PR[i+1,x] =f(PR[i,a]/na, PR[i,b]/nb)

Page 75: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Pregel

• Synchronous computation in iterations• In one iteration, each node:

– gets messages from neighbors– computes– sends data to neighbors

CS 347 Lecture 9 75

x

e

d

b

a

f

PR[i+1,x] =f(PR[i,a]/na, PR[i,b]/nb)

Page 76: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Pregel vs Map-Reduce/S4/Hyracks/...

• In Map-Reduce, S4, Hyracks,...workflow separate from data

CS 347 Lecture 9 76

• In Pregel, data (graph) drives data flow

Page 77: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Pregel Motivation

• Many applications require graph processing

• Map-Reduce and other workflow systemsnot a good fit for graph processing

• Need to run graph algorithms on many processors

CS 347 Lecture 9 77

Page 78: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Example of Graph Computation

CS 347 Lecture 9 78

Page 79: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Termination

• After each iteration, each vertex votes tohalt or not

• If all vertexes vote to halt, computation terminates

CS 347 Lecture 9 79

Page 80: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Vertex Compute (simplified)

• Available data for iteration i:– InputMessages: { [from, value] }– OutputMessages: { [to, value] }– OutEdges: { [to, value] }– MyState: value

• OutEdges and MyState are remembered for next iteration

CS 347 Lecture 9 80

Page 81: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Max Computation

• change := false• for [f,w] in InputMessages do

if w > MyState.value then [ MyState.value := w change := true ]

• if (superstep = 1) OR change then for [t, w] in OutEdges do add [t, MyState.value] to OutputMessages else vote to halt

CS 347 Lecture 9 81

Page 82: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Page Rank Example

CS 347 Lecture 9 82

Page 83: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Page Rank Example

CS 347 Lecture 9 83

iteration count

iterate thru InputMessages

MyState.value

shorthand: send same msg to all

Page 84: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Single-Source Shortest Paths

CS 347 Lecture 9 84

Page 85: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Architecture

CS 347 Lecture 9 85

master

worker A

inputdata 1

inputdata 2

worker B

worker Cgraph has nodes a,b,c,d...

sample record:[a, value]

Page 86: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Architecture

CS 347 Lecture 9 86

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

partition graph andassign to workers

Page 87: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Architecture

CS 347 Lecture 9 87

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

read input data worker Aforwards inputvalues to appropriateworkers

Page 88: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Architecture

CS 347 Lecture 9 88

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

run superstep 1

Page 89: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Architecture

CS 347 Lecture 9 89

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

at end superstep 1,send messages

halt?

Page 90: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Architecture

CS 347 Lecture 9 90

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

run superstep 2

Page 91: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Architecture

CS 347 Lecture 9 91

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

checkpoint

Page 92: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Architecture

CS 347 Lecture 9 92

master

worker A

vertexesa, b, c

worker B

vertexesd, e

worker C

vertexesf, g, h

checkpoint

write to stable store:MyState, OutEdges,InputMessages(or OutputMessages)

Page 93: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Architecture

CS 347 Lecture 9 93

master

worker A

vertexesa, b, c

worker B

vertexesd, e

if worker dies,find replacement &restart fromlatest checkpoint

Page 94: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Architecture

CS 347 Lecture 9 94

master

worker A

vertexesa, b, c

inputdata 1

inputdata 2

worker B

vertexesd, e

worker C

vertexesf, g, h

Page 95: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Interesting Challenge

• How best to partition graph for efficiency?

CS 347 Lecture 9 95

x

e

d

b

a

f

g

Page 96: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

CS 347 Lecture 9 96

CS 347: Parallel and Distributed

Data Management

Notes X: Kestrel

Page 97: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Kestrel

• Kestrel server handles a set of reliable, ordered message queues

• A server does not communicate with other servers (advertised as good for scalability!)

CS 347 Lecture 9 97

client

client

server

server

queues

queues

Page 98: CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Kestrel

• Kestrel server handles a set of reliable, ordered message queues

• A server does not communicate with other servers (advertised as good for scalability!)

CS 347 Lecture 9 98

client

client

server

server

queues

queuesget q

put q

log

log