Managing Data in the Cloud. App Server Scaling in the Cloud Load Balancer (Proxy) App Server MySQL Master DB MySQL Slave DB Replication Client Site Database

Managing Data in the Cloud

CS271 2

App Server

App Server

App Server

Scaling in the Cloud

Load Balancer (Proxy)

App Server

MySQL Master DB

MySQL Slave DB

Replication

Client Site

Database becomes the Scalability Bottleneck

Cannot leverage elasticity

App Server

Client Site Client Site

CS271 3

App Server

App Server

App Server



App Server

MySQL Master DB

MySQL Slave DB

Replication

Client Site

App Server


CS271 4

Key Value Stores

Apache+ App Server

Apache+ App Server

Apache+ App Server



Apache+ App Server

Client Site

Apache+ App Server


CS271 5

CAP Theorem (Eric Brewer)

• “Towards Robust Distributed Systems” PODC 2000.

• “CAP Twelve Years Later: How the "Rules" Have Changed” IEEE Computer 2012

CS271 6

Key Value Stores

• Key-Valued data model– Key is the unique identifier– Key is the granularity for consistent access– Value can be structured or unstructured

• Gained widespread popularity– In house: Bigtable (Google), PNUTS (Yahoo!), Dynamo

(Amazon)– Open source: HBase, Hypertable, Cassandra, Voldemort

• Popular choice for the modern breed of web-applications

CS271 7

• Data model.– Sparse, persistent, multi-dimensional sorted map.

• Data is partitioned across multiple servers.• The map is indexed by a row key, column key, and

a timestamp.• Output value is un-interpreted array of bytes.

– (row: byte[ ], column: byte[ ], time: int64) byte[ ]

Big Table (Google)

CS271 8

Architecture Overview

• Shared-nothing architecture consisting of thousands of nodes (commodity PC).

Google File System

Google’s Bigtable Data Model

…….

CS271 9

• Every read or write of data under a single row is atomic.

• Objective: make read operations single-sited!

Atomicity Guarantees in Big Table

CS271 10

• Google File System (GFS)– Highly available distributed file system that stores log and data files

• Chubby– Highly available persistent distributed lock manager.

• Tablet servers– Handles read and writes to its tablet and splits tablets.– Each tablet is typically 100-200 MB in size.

• Master Server– Assigns tablets to tablet servers,– Detects the addition and deletion of tablet servers,– Balances tablet-server load,

Big Table’s Building Blocks

CS271 11

Overview of Bigtable Architecture

Tablet Serve

r

Tablet Serve

r

Google File System

Tablet Serve

r

Master Chubby

Control Operations

Lease Manageme

nt

T1 T2 Tn Tablets

Master and Chubby Proxies

Log Manager

Cache Manager

CS271 12

GFS Architectural Design

• A GFS cluster– A single master– Multiple chunkservers per master

• Accessed by multiple clients

– Running on commodity Linux machines• A file

– Represented as fixed-sized chunks• Labeled with 64-bit unique global IDs• Stored at chunkservers• 3-way replication across chunkservers

CS271 13

GFS chunkserver

Linux file system

Architectural Design

GFS Master

GFS chunkserver

Linux file systemGFS chunkserver

Linux file system

Application

GFS client

chunk location?

chunk data?

CS271 14

Single-Master Design

• Simple• Master answers only chunk locations• A client typically asks for multiple chunk

locations in a single request• The master also predicatively provides chunk

locations immediately following those requested

CS271 15

Metadata• Master stores three major types

– File and chunk namespaces, persistent in operation log– File-to-chunk mappings, persistent in operation log– Locations of a chunk’s replicas, not persistent.

• All kept in memory: Fast!– Quick global scans

• For Garbage collections and Reorganizations

– 64 bytes of metadata only per 64 MB of data

CS271 16

Mutation Operation in GFS

• Mutation: any write or append operation

• The data needs to be written to all replicas

• Guarantee of the same order when multi user request the mutation operation.

CS271 17

GFS Revisited• “GFS: Evolution on Fast-Forward” an interview with GFS

designers in CACM 3/11.• Single master was critical for early deployment.• “the choice to establish 64MB …. was much larger than the

typical file-system block size, but only because the files generated by Google's crawling and indexing system were unusually large.”

• As the application mix changed over time, ….deal efficiently with large numbers of files requiring far less than 64MB (think in terms of Gmail, for example). The problem was not so much with the number of files itself, but rather with the memory demands all of those files made on the centralized master, thus exposing one of the bottleneck risks inherent in the original GFS design.

CS271 18

GFS Revisited(Cont’d)

• “the initial emphasis in designing GFS was on batch efficiency as opposed to low latency.”

• “The original single-master design: A single point of failure may not have been a disaster for batch-oriented applications, but it was certainly unacceptable for latency-sensitive applications, such as video serving.”

• Future directions: distributed master, etc.• Interesting and entertaining read.

CS271 19

PNUTS Overview

• Data Model:– Simple relational model—really key-value store.– Single-table scans with predicates

• Fault-tolerance:– Redundancy at multiple levels: data, meta-data etc.– Leverages relaxed consistency for high availability:

reads & writes despite failures• Pub/Sub Message System:

– Yahoo! Message Broker for asynchronous updates

CS271 20

Asynchronous replication

CS271 21

Consistency Model

• Hide the complexity of data replication• Between the two extremes:

– One-copy serializability, and– Eventual consistency

• Key assumption:– Applications manipulate one record at a time

• Per-record time-line consistency:– All replicas of a record preserve the update order

CS271 22

Implementation

• A read returns a consistent version• One replica designated as master (per record)• All updates forwarded to that master• Master designation adaptive, replica with

most of writes becomes master

CS271 23

Consistency model

• Goal: make it easier for applications to reason about updates and cope with asynchrony

• What happens to a record with primary key “Brian”?

Time

Record inserted

Update Update Update UpdateUpdate Delete

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Update Update

CS271 24

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7


Current version

Stale versionStale version

Read

CS271 25

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7


Read up-to-date

Current version


CS271 26

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7


Read ≥ v.6

Current version


CS271 27

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7


Write

Current version


CS271 28

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7


Write if = v.7

ERROR

Current version


CS271 29

Data-path components

Storage units

RoutersTablet

controller

REST API

Clients

MessageBroker

PNUTS Architecture

CS271 30

Storageunits

Routers

Tablet controller

REST API

Clients

Local region Remote regions

YMB

PNUTS architecture

CS271 31

System Architecture: Key Features

• Pub/Sub Mechanism: Yahoo! Message Broker

• Physical Storage: Storage Unit• Mapping of records: Tablet Controller• Record locating: Routers

CS271 32

Highlights of PNUTS Approach

• Shared nothing architecture• Multiple datacenter for geographic distribution• Time-line consistency and access to stale data.• Use a publish-subscribe system for reliable

fault-tolerant communication• Replication with record-based master.

CS271 33

AMAZON’S KEY-VALUE STORE: DYNAMO

Adapted from Amazon’s Dynamo Presentation

CS271 34

Highlights of Dynamo

• High write availability• Optimistic: vector clocks for resolution• Consistent hashing (Chord) in controlled

environment• Quorums for relaxed consistency.

CS271 35

TOO MANY CHOICES – WHICH SYSTEM SHOULD I USE?

Cooper et al., SOCC 2010

CS271 36

Benchmarking Serving Systems

• A standard benchmarking tool for evaluating Key Value stores: Yahoo! Cloud Servicing Benchmark (YCSB)

• Evaluate different systems on common workloads

• Focus on performance and scale out

CS271 37

Benchmark tiers• Tier 1 – Performance

– Latency versus throughput as throughput increases

• Tier 2 – Scalability– Latency as database, system size increases– “Scale-out”

– Latency as we elastically add servers– “Elastic speedup”

Workload A – Update heavy: 50/50 read/update

0 2000 4000 6000 8000 10000 12000 140000

10

20

30

40

50

60

70

Workload A - Read latency

Cassandra Hbase PNUTS MySQL

Throughput (ops/sec)

Ave

rag

e r

ea

d la

ten

cy (

ms)

38

Cassandra (based on Dynamo) is optimized for heavy updatesCassandra uses hash partitioning.

CS271 39

Workload B – Read heavy95/5 read/update

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

2

4

6

8

10

12

14

16

18

20

Workload B - Read latency

Cassandra HBase PNUTS MySQL

Throughput (operations/sec)

Ave

rage

rea

d la

tenc

y (m

s)

PNUTS uses MSQL, and MSQL is optimized for read operations

Workload E – short scansScans of 1-100 records of size 1KB

0 200 400 600 800 1000 1200 1400 16000

20

40

60

80

100

120

Workload E - Scan latency

Hbase PNUTS Cassandra

Throughput (operations/sec)

Av

era

ge

sc

an

late

nc

y (

ms

)

CS271 40

HBASE uses append-only log, so optimized for scans—same for MSQLand PNUTS. Cassandra uses hash partitioning, so poor scan performance.

CS271 41

Summary

• Different databases suitable for different workloads

• Evolving systems – landscape changing dramatically

• Active development community around open source systems

• Scale-up– Classical enterprise setting

(RDBMS)– Flexible ACID transactions– Transactions in a single node

• Scale-out– Cloud friendly (Key value stores)– Execution at a single server

• Limited functionality & guarantees

– No multi-row or multi-step transactions

CS271 42

Two approaches to scalability

Key-Value Store Lessons

What are the design principles learned?

CS271

• Separate System and Application State– System metadata is critical but small– Application data has varying needs– Separation allows use of different class of protocols

44

Design Principles [DNIS 2010]

CS271

• Decouple Ownership from Data Storage– Ownership is exclusive read/write access to data– Decoupling allows lightweight ownership migration

45

Design Principles

Cache Manager

Transaction Manager Recovery

Ownership[Multi-step transactions or

Read/Write Access]

Storage

Classical DBMSs Decoupled ownership and Storage

CS271

• Limit most interactions to a single node– Allows horizontal scaling– Graceful degradation during failures– No distributed synchronization

46

Design Principles

Thanks: Curino et al VLDB 2010

CS271

• Limited distributed synchronization is practical– Maintenance of metadata– Provide strong guarantees only for data that needs it

47

Design Principles

SBBD 2012

• Need to tolerate catastrophic failures– Geographic Replication

• How to support ACID transactions over data replicated at multiple datacenters – One-copy serializablity: Clients can access data in any datacenter,

appears as single copy with atomic access

48

Fault-tolerance in the Cloud

SBBD 2012

• Entity groups are sub-database– Static partitioning– Cheap transactions in Entity groups (common)– Expensive cross-entity group transactions (rare)

49

Megastore: Entity Groups(Google--CIDR 2011)

SBBD 2012

Semantically Predefined• Email

– Each email account forms a natural entity group– Operations within an account are transactional: user’s send message is

guaranteed to observe the change despite of fail-over to another replica• Blogs

– User’s profile is entity group– Operations such as creating a new blog rely on asynchronous messaging

with two-phase commit• Maps

– Dividing the globe into non-overlapping patches– Each patch can be an entity group

50

Megastore Entity Groups

SBBD 2012 51

Megastore

Slides adapted from authors’ presentation

SBBD 2012 52

Google’s Spanner: Database Tech That Can Scan the Planet (OSDI 2012)

The Big Picture (OSDI 2012)

Logs

SSTables

Colossus File System

2PC (atomicity)

Paxos (consistency)

2PL + wound-wait (isolation)

Tablets

Movedir

load balancing

GPS + Atomic Clocks

TrueTime

TrueTime: APIs that provide real time with bounds on error. o Powered by GPS and atomic clocks.

Enforce external consistencyo If the start of T2 occurs after the commit of T1 ,

then the commit timestamp of T2 must be greater than the commit timestamp of T1 .

Concurrency Control:o Update transactions: 2PLo Read-only transactions: Use real time to return a

consistency snapshot.

TrueTime

CS271 55

Primary References• Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes,

Gruber: Bigtable: A Distributed Storage System for Structured Data. OSDI 2006

• The Google File System: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. Symp on Operating Systems Princ 2003.

• GFS: Evolution on Fast-Forward: Kirk McKusick, Sean Quinlan Communications of the ACM 2010.

• Cooper, Ramakrishnan, Srivastava, Silberstein, Bohannon, Jacobsen, Puz, Weaver, Yerneni: PNUTS: Yahoo!'s hosted data serving platform. VLDB 2008.

• DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: amazon's highly available key-value store. SOSP 2007

• Cooper, Silberstein, Tam, Ramakrishnan, Sears: Benchmarking cloud serving systems with YCSB. SoCC 2010

Documents

Managing Data in the Cloud. App Server Scaling in the Cloud Load Balancer (Proxy) App Server MySQL Master DB MySQL Slave DB Replication Client Site Database