32
ClusterOn: Building highly configurable and reusable clustered data services using simple data nodes Ali Anwar Yue Cheng Hai Huang Ali R. Butt

ClusterOn: Building highly configurable and … · CHUANG, W.-C., SANG, B., YOO, S., GU, ... KILLIAN, C. Eventwave: Programming model and runtime support for tightly-coupled elastic

Embed Size (px)

Citation preview

ClusterOn: Building highly configurable and reusable clustered

data services using simple data nodes Ali Anwar★ Yue Cheng★ Hai Huang† Ali R. Butt★

★ †

Growing data needs à new storage applications

2

BigTable Redis Memcached

DynamoDB

K-V store

Cassandra MongoDB HyperTable

HyperDex

NoSQL DB

Swift Amazon S3

Object store

HDFS LustreFS

DFS

A new storage system is born daily*

5 6 57

912

10

16 17 16

0

5

10

15

20

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

#storagesystem

s

NumberofstoragesystemspapersinSOSP,OSDI,ATC,andEuroSysconferencesinthelastdecade

3 *Weexaggeratehere,butnewstorageapplicaOonsaredevelopedfairlyregularly!

4

Distributed storage systems are notoriously hard to implement!

5

Fault-tolerant algorithms are notoriously hard to express correctly, even as pseudo-code. This problem is worse when the code for such an algorithm is intermingled with all the other code that goes into building a complete system… Tushar Chandra, Robert Griesemer, and Joshua Redstone, Paxos Made Live, PODC’07

Case study: Redis 3.0.1

replication.c – replicate data from master to slave, or slave to slave sentinel.c – monitoring nodes, handle failover from master to slave cluster.c – support cluster mode with multiple masters/shards These files are 20% of the code base (450K/2100K in size) 6

Quantifying LoC

0%20%40%60%80%100%

Redis HDFS

%LoC

CoreIO•  Core IO

12955 31117Linesofcode:

7

Quantifying LoC

0%20%40%60%80%100%

Redis HDFS

%LoC

CoreIO Management•  Core IO •  Distributed management

8237 56208Linesofcode: 12955 31117

8

Quantifying LoC •  Core IO •  Distributed management •  Etc: Config/auth/stats/

compatibility ...

0%20%40%60%80%100%

Redis HDFS

%LoC

CoreIO Management Etc

8237 5620820202 25185

Linesofcode: 12955 31117

9

Quantifying LoC •  Core IO •  Distributed management •  Etc: Config/auth/stats/

compatibility ...

0%20%40%60%80%100%

Redis HDFS

%LoC

CoreIO Management Etc

Linesofcode:8237 5620820202 25185

12955 31117

10

11

12

Abstracting and generalizing common features/functionality can

simplify development of new distributed storage applications

13

Steps involved in developing a distributed storage application

14

1voidPut(Strkey,Objval){2if(this.master):3Lock(key)4HashTbl.insert(key,val)5Unlock(key)6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidLock(Strkey){17...//Acquirelock18}1920voidUnlock(Strkey){21...//Releaselock22}2324voidSync(Replicaspeers){25...//Updatereplicas26}2728voidQuorum(Strkey){29...//Selectanode30}

(a) Vanilla

15

1voidPut(Strkey,Objval){2if(this.master):3Lock(key)4HashTbl.insert(key,val)5Unlock(key)6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidLock(Strkey){17...//Acquirelock18}1920voidUnlock(Strkey){21...//Releaselock22}2324voidSync(Replicaspeers){25...//Updatereplicas26}2728voidQuorum(Strkey){29...//Selectanode30}

1voidPut(Strkey,Objval){2if(this.master):3zk.Lock(key)//zookeeper4HashTbl.insert(key,val)5zk.Unlock(key)//zookeeper6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidSync(Replicaspeers){17...//Updatereplicas18}1920voidQuorum(Strkey){21...//Selectanode22}

(b) Zookeeper based

16

(a) Vanilla

1voidPut(Strkey,Objval){2if(this.master):3Lock(key)4HashTbl.insert(key,val)5Unlock(key)6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidLock(Strkey){17...//Acquirelock18}1920voidUnlock(Strkey){21...//Releaselock22}2324voidSync(Replicaspeers){25...//Updatereplicas26}2728voidQuorum(Strkey){29...//Selectanode30}

1voidPut(Strkey,Objval){2if(this.master):3zk.Lock(key)//zookeeper4HashTbl.insert(key,val)5zk.Unlock(key)//zookeeper6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidSync(Replicaspeers){17...//Updatereplicas18}1920voidQuorum(Strkey){21...//Selectanode22}

1#include<vsynclib>23voidPut(Strkey,Objval){4if(this.master):5zk.Lock(key)//zookeeper6HashTbl.insert(key,val)7zk.Unlock(key)//zookeeper8Vsync.Sync(master.slaves)9}1011ObjGet(Strkey){12if(this.master):13Objval=Vsync.Quorum(key)14Vsync.Sync(master.slaves)15returnval16}

(c) Vsync

17

(a) Vanilla (b) Zookeeper based

1voidPut(Strkey,Objval){2if(this.master):3Lock(key)4HashTbl.insert(key,val)5Unlock(key)6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidLock(Strkey){17...//Acquirelock18}1920voidUnlock(Strkey){21...//Releaselock22}2324voidSync(Replicaspeers){25...//Updatereplicas26}2728voidQuorum(Strkey){29...//Selectanode30}

1voidPut(Strkey,Objval){2if(this.master):3zk.Lock(key)//zookeeper4HashTbl.insert(key,val)5zk.Unlock(key)//zookeeper6Sync(master.slaves)7}89ObjGet(Strkey){10if(this.master):11Objval=Quorum(key)12Sync(master.slaves)13returnval14}1516voidSync(Replicaspeers){17...//Updatereplicas18}1920voidQuorum(Strkey){21...//Selectanode22}

1#include<vsynclib>23voidPut(Strkey,Objval){4if(this.master):5zk.Lock(key)//zookeeper6HashTbl.insert(key,val)7zk.Unlock(key)//zookeeper8Vsync.Sync(master.slaves)9}1011ObjGet(Strkey){12if(this.master):13Objval=Vsync.Quorum(key)14Vsync.Sync(master.slaves)15returnval16}

1voidPut(Strkey,Objval){2HashTbl.insert(key,val)3}45ObjGet(Strkey){6returnHashTbl(key)7}

(d) ClusterOn

+ClusterOn

=DistributedapplicaOon

18

(a) Vanilla (b) Zookeeper based (c) Vsync

Design goals •  Minimize framework overhead

•  Enable effective service differentiation

•  Realize reusable distributed storage platform components

19

Design challenges Diversity of applications

-  Replication policies -  Sharding policies -  Membership

management -  Failover recovery -  Client connector

CAP tradeoffs

-  Latency -  Throughput -  Availability

Consistency

-  None -  Strong -  Eventual

20

Using ClusterOn

21

1voidPut(Strkey,Objval){2HashTbl.insert(key,val)3}45ObjGet(Strkey){6returnHashTbl(key)7}

Non-dist.kvstore App.developer

{“Topology”:“Master-Slave”,“Consistency”:“Strong”,“Replication”:3,“Sharding”:false,}

Bootstrapinfo

kvs

ClusterOn

ReplicaOon Consistency Topology

Loadbalancing Autoscaling Failover

Metadataserver

Datalet

kvs kvs kvs kvs kvs kvs kvs

ObjectstoreFilesystemDatabaseetc...

ClusterOn architecture

22

{“Topology”:“Master-Slave”,“Consistency”:“Strong”,“Replication”:3,“Sharding”:false,}

Bootstrapinfokvs

ClusterOn

ReplicaOon Consistency Topology

Loadbalancing Autoscaling Failover

Metadataserver

Middlew

are

ApplicaO

on

kvs kvs kvs kvs kvs kvs kvs

M S S

Userrequest Replytouser

Master/slave;Strongconsistency;Writerequest

Master/slave;Strongconsistency;Failover

ClusterOn architecture

23

{“Topology”:“Master-Slave”,“Consistency”:“Strong”,“Replication”:3,“Sharding”:false,}

Bootstrapinfokvs

ClusterOn

ReplicaOon Consistency Topology

Loadbalancing Autoscaling Failover

Metadataserver

Middlew

are

kvs kvs kvs kvs kvs kvs kvs

M S S

kvs

S

ApplicaO

on

Failover

Preliminary evaluation •  Proof-of-concept prototype implementation

–  ClusterOn framework (so far: 5000+ lines of C++; not including header files)

•  Event handling, Protobuf •  Strong consistency with Zookeeper

–  Storage apps •  Redis: in-memory KV cache/store •  LevelDB: persistent KV store

•  Measure overhead & scalability 24

Data forwarding overhead: Redis

0

100

200

300

400

500

0 20 40 60 80 100 120 140Throughp

ut(1

03RPS)

Batchsize

Twemproxy+Redis

1 ClusterOn proxy + 1 Redis backend; Memory cache 16B key 32B value; 10 millions KV requests; YCSB: 100% GET 25

Data forwarding overhead: Redis

0

100

200

300

400

500

0 20 40 60 80 100 120 140Throughp

ut(1

03RPS)

Batchsize

DirectRedis

1 ClusterOn proxy + 1 Redis backend; Memory cache 16B key 32B value; 10 millions KV requests; YCSB: 100% GET 26

Twemproxy+Redis

Data forwarding overhead: Redis

0

100

200

300

400

500

0 20 40 60 80 100 120 140Throughp

ut(1

03RPS)

Batchsize

ClusterOn+Redis

Twemproxy+Redis

DirectRedis

1 ClusterOn proxy + 1 Redis backend; Memory cache 16B key 32B value; 10 millions KV requests; YCSB: 100% GET 27

Scaling up: LevelDB

24.1

3.2

33.2

23.422.6

2.9

29.022.1

0

10

20

30

40

1KB 10KB 1KB 10KB

SET GETThroughp

ut(1

03QPS)

DirectLDB ClusterOn+1LDB

1 ClusterOn proxy + N LevelDB backends; Persistent store SATA SSD; 1 replica; 1 million KV requests 28

Scaling up: LevelDB

24.1

3.2

33.223.422.6

2.9

29.0 22.140.0

5.5

44.1 42.3

72.9

11.0

81.1

57.1

020406080

100

1KB 10KB 1KB 10KB

SET GETThroughp

ut(1

03QPS)

DirectLDB ClusterOn+1LDB ClusterOn+2LDB ClusterOn+4LDB

1 ClusterOn proxy + N LevelDB backends; Persistent store SATA SSD; 1 replica; 1 million KV requests 29

Scaling up: LevelDB

24.1

3.2

33.223.422.6

2.9

29.0 22.140.0

5.5

44.1 42.3

72.9

11.0

81.1

57.1

020406080

100

1KB 10KB 1KB 10KB

SET GETThroughp

ut(1

03QPS)

DirectLDB ClusterOn+1LDB ClusterOn+2LDB ClusterOn+4LDB

1 ClusterOn proxy + N LevelDB backends; Persistent store SATA SSD; 1 replica; 1 million KV requests 30

Related work

CHUANG, W.-C., SANG, B., YOO, S., GU, R., KULKARNI, M., AND KILLIAN, C. Eventwave: Programming model and runtime support for tightly-coupled elastic cloud applications. In ACM SOCC’13.

HUNT, G. C., SCOTT, M. L., ET AL. The coign automatic distributed partitioning system. In USENIX OSDI’99.

Vsync: Consistent Data Replication for Cloud Computing https://vsync.codeplex.com/

31

Summary •  Modern distributed storage applications share a large

portion of common functionalities which can be generalized and abstracted

•  ClusterOn can automatically provide core functionalities, such as service distribution, to reduce development effort –  Faster realization of new storage applications –  Easy code maintenance –  Flexible and extensible service differentiation

•  Questions & contact: Ali Anwar, [email protected] http://research.cs.vt.edu/dssl/

32