Upload
alvin-reeves
View
223
Download
0
Tags:
Embed Size (px)
Citation preview
Managing Data in the Cloud
CS271 2
App Server
App Server
App Server
Scaling in the Cloud
Load Balancer (Proxy)
App Server
MySQL Master DB
MySQL Slave DB
Replication
Client Site
Database becomes the Scalability Bottleneck
Cannot leverage elasticity
App Server
Client Site Client Site
CS271 3
App Server
App Server
App Server
Scaling in the Cloud
Load Balancer (Proxy)
App Server
MySQL Master DB
MySQL Slave DB
Replication
Client Site
App Server
Client Site Client Site
CS271 4
Key Value Stores
Apache+ App Server
Apache+ App Server
Apache+ App Server
Scaling in the Cloud
Load Balancer (Proxy)
Apache+ App Server
Client Site
Apache+ App Server
Client Site Client Site
CS271 5
CAP Theorem (Eric Brewer)
• “Towards Robust Distributed Systems” PODC 2000.
• “CAP Twelve Years Later: How the "Rules" Have Changed” IEEE Computer 2012
CS271 6
Key Value Stores
• Key-Valued data model– Key is the unique identifier– Key is the granularity for consistent access– Value can be structured or unstructured
• Gained widespread popularity– In house: Bigtable (Google), PNUTS (Yahoo!), Dynamo
(Amazon)– Open source: HBase, Hypertable, Cassandra, Voldemort
• Popular choice for the modern breed of web-applications
CS271 7
• Data model.– Sparse, persistent, multi-dimensional sorted map.
• Data is partitioned across multiple servers.• The map is indexed by a row key, column key, and
a timestamp.• Output value is un-interpreted array of bytes.
– (row: byte[ ], column: byte[ ], time: int64) byte[ ]
Big Table (Google)
CS271 8
Architecture Overview
• Shared-nothing architecture consisting of thousands of nodes (commodity PC).
Google File System
Google’s Bigtable Data Model
…….
CS271 9
• Every read or write of data under a single row is atomic.
• Objective: make read operations single-sited!
Atomicity Guarantees in Big Table
CS271 10
• Google File System (GFS)– Highly available distributed file system that stores log and data files
• Chubby– Highly available persistent distributed lock manager.
• Tablet servers– Handles read and writes to its tablet and splits tablets.– Each tablet is typically 100-200 MB in size.
• Master Server– Assigns tablets to tablet servers,– Detects the addition and deletion of tablet servers,– Balances tablet-server load,
Big Table’s Building Blocks
CS271 11
Overview of Bigtable Architecture
Tablet Serve
r
Tablet Serve
r
Google File System
Tablet Serve
r
Master Chubby
Control Operations
Lease Manageme
nt
T1 T2 Tn Tablets
Master and Chubby Proxies
Log Manager
Cache Manager
CS271 12
GFS Architectural Design
• A GFS cluster– A single master– Multiple chunkservers per master
• Accessed by multiple clients
– Running on commodity Linux machines• A file
– Represented as fixed-sized chunks• Labeled with 64-bit unique global IDs• Stored at chunkservers• 3-way replication across chunkservers
CS271 13
GFS chunkserver
Linux file system
Architectural Design
GFS Master
GFS chunkserver
Linux file systemGFS chunkserver
Linux file system
Application
GFS client
chunk location?
chunk data?
CS271 14
Single-Master Design
• Simple• Master answers only chunk locations• A client typically asks for multiple chunk
locations in a single request• The master also predicatively provides chunk
locations immediately following those requested
CS271 15
Metadata• Master stores three major types
– File and chunk namespaces, persistent in operation log– File-to-chunk mappings, persistent in operation log– Locations of a chunk’s replicas, not persistent.
• All kept in memory: Fast!– Quick global scans
• For Garbage collections and Reorganizations
– 64 bytes of metadata only per 64 MB of data
CS271 16
Mutation Operation in GFS
• Mutation: any write or append operation
• The data needs to be written to all replicas
• Guarantee of the same order when multi user request the mutation operation.
CS271 17
GFS Revisited• “GFS: Evolution on Fast-Forward” an interview with GFS
designers in CACM 3/11.• Single master was critical for early deployment.• “the choice to establish 64MB …. was much larger than the
typical file-system block size, but only because the files generated by Google's crawling and indexing system were unusually large.”
• As the application mix changed over time, ….deal efficiently with large numbers of files requiring far less than 64MB (think in terms of Gmail, for example). The problem was not so much with the number of files itself, but rather with the memory demands all of those files made on the centralized master, thus exposing one of the bottleneck risks inherent in the original GFS design.
CS271 18
GFS Revisited(Cont’d)
• “the initial emphasis in designing GFS was on batch efficiency as opposed to low latency.”
• “The original single-master design: A single point of failure may not have been a disaster for batch-oriented applications, but it was certainly unacceptable for latency-sensitive applications, such as video serving.”
• Future directions: distributed master, etc.• Interesting and entertaining read.
CS271 19
PNUTS Overview
• Data Model:– Simple relational model—really key-value store.– Single-table scans with predicates
• Fault-tolerance:– Redundancy at multiple levels: data, meta-data etc.– Leverages relaxed consistency for high availability:
reads & writes despite failures• Pub/Sub Message System:
– Yahoo! Message Broker for asynchronous updates
CS271 20
Asynchronous replication
CS271 21
Consistency Model
• Hide the complexity of data replication• Between the two extremes:
– One-copy serializability, and– Eventual consistency
• Key assumption:– Applications manipulate one record at a time
• Per-record time-line consistency:– All replicas of a record preserve the update order
CS271 22
Implementation
• A read returns a consistent version• One replica designated as master (per record)• All updates forwarded to that master• Master designation adaptive, replica with
most of writes becomes master
CS271 23
Consistency model
• Goal: make it easier for applications to reason about updates and cope with asynchrony
• What happens to a record with primary key “Brian”?
Time
Record inserted
Update Update Update UpdateUpdate Delete
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Update Update
CS271 24
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Current version
Stale versionStale version
Read
CS271 25
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Read up-to-date
Current version
Stale versionStale version
CS271 26
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Read ≥ v.6
Current version
Stale versionStale version
CS271 27
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Write
Current version
Stale versionStale version
CS271 28
Consistency model
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Write if = v.7
ERROR
Current version
Stale versionStale version
CS271 29
Data-path components
Storage units
RoutersTablet
controller
REST API
Clients
MessageBroker
PNUTS Architecture
CS271 30
Storageunits
Routers
Tablet controller
REST API
Clients
Local region Remote regions
YMB
PNUTS architecture
CS271 31
System Architecture: Key Features
• Pub/Sub Mechanism: Yahoo! Message Broker
• Physical Storage: Storage Unit• Mapping of records: Tablet Controller• Record locating: Routers
CS271 32
Highlights of PNUTS Approach
• Shared nothing architecture• Multiple datacenter for geographic distribution• Time-line consistency and access to stale data.• Use a publish-subscribe system for reliable
fault-tolerant communication• Replication with record-based master.
CS271 33
AMAZON’S KEY-VALUE STORE: DYNAMO
Adapted from Amazon’s Dynamo Presentation
CS271 34
Highlights of Dynamo
• High write availability• Optimistic: vector clocks for resolution• Consistent hashing (Chord) in controlled
environment• Quorums for relaxed consistency.
CS271 35
TOO MANY CHOICES – WHICH SYSTEM SHOULD I USE?
Cooper et al., SOCC 2010
CS271 36
Benchmarking Serving Systems
• A standard benchmarking tool for evaluating Key Value stores: Yahoo! Cloud Servicing Benchmark (YCSB)
• Evaluate different systems on common workloads
• Focus on performance and scale out
CS271 37
Benchmark tiers• Tier 1 – Performance
– Latency versus throughput as throughput increases
• Tier 2 – Scalability– Latency as database, system size increases– “Scale-out”
– Latency as we elastically add servers– “Elastic speedup”
Workload A – Update heavy: 50/50 read/update
0 2000 4000 6000 8000 10000 12000 140000
10
20
30
40
50
60
70
Workload A - Read latency
Cassandra Hbase PNUTS MySQL
Throughput (ops/sec)
Ave
rag
e r
ea
d la
ten
cy (
ms)
38
Cassandra (based on Dynamo) is optimized for heavy updatesCassandra uses hash partitioning.
CS271 39
Workload B – Read heavy95/5 read/update
0 1000 2000 3000 4000 5000 6000 7000 8000 90000
2
4
6
8
10
12
14
16
18
20
Workload B - Read latency
Cassandra HBase PNUTS MySQL
Throughput (operations/sec)
Ave
rage
rea
d la
tenc
y (m
s)
PNUTS uses MSQL, and MSQL is optimized for read operations
Workload E – short scansScans of 1-100 records of size 1KB
0 200 400 600 800 1000 1200 1400 16000
20
40
60
80
100
120
Workload E - Scan latency
Hbase PNUTS Cassandra
Throughput (operations/sec)
Av
era
ge
sc
an
late
nc
y (
ms
)
CS271 40
HBASE uses append-only log, so optimized for scans—same for MSQLand PNUTS. Cassandra uses hash partitioning, so poor scan performance.
CS271 41
Summary
• Different databases suitable for different workloads
• Evolving systems – landscape changing dramatically
• Active development community around open source systems
• Scale-up– Classical enterprise setting
(RDBMS)– Flexible ACID transactions– Transactions in a single node
• Scale-out– Cloud friendly (Key value stores)– Execution at a single server
• Limited functionality & guarantees
– No multi-row or multi-step transactions
CS271 42
Two approaches to scalability
Key-Value Store Lessons
What are the design principles learned?
CS271
• Separate System and Application State– System metadata is critical but small– Application data has varying needs– Separation allows use of different class of protocols
44
Design Principles [DNIS 2010]
CS271
• Decouple Ownership from Data Storage– Ownership is exclusive read/write access to data– Decoupling allows lightweight ownership migration
45
Design Principles
Cache Manager
Transaction Manager Recovery
Ownership[Multi-step transactions or
Read/Write Access]
Storage
Classical DBMSs Decoupled ownership and Storage
CS271
• Limit most interactions to a single node– Allows horizontal scaling– Graceful degradation during failures– No distributed synchronization
46
Design Principles
Thanks: Curino et al VLDB 2010
CS271
• Limited distributed synchronization is practical– Maintenance of metadata– Provide strong guarantees only for data that needs it
47
Design Principles
SBBD 2012
• Need to tolerate catastrophic failures– Geographic Replication
• How to support ACID transactions over data replicated at multiple datacenters – One-copy serializablity: Clients can access data in any datacenter,
appears as single copy with atomic access
48
Fault-tolerance in the Cloud
SBBD 2012
• Entity groups are sub-database– Static partitioning– Cheap transactions in Entity groups (common)– Expensive cross-entity group transactions (rare)
49
Megastore: Entity Groups(Google--CIDR 2011)
SBBD 2012
Semantically Predefined• Email
– Each email account forms a natural entity group– Operations within an account are transactional: user’s send message is
guaranteed to observe the change despite of fail-over to another replica• Blogs
– User’s profile is entity group– Operations such as creating a new blog rely on asynchronous messaging
with two-phase commit• Maps
– Dividing the globe into non-overlapping patches– Each patch can be an entity group
50
Megastore Entity Groups
SBBD 2012 51
Megastore
Slides adapted from authors’ presentation
SBBD 2012 52
Google’s Spanner: Database Tech That Can Scan the Planet (OSDI 2012)
The Big Picture (OSDI 2012)
Logs
SSTables
Colossus File System
2PC (atomicity)
Paxos (consistency)
2PL + wound-wait (isolation)
Tablets
Movedir
load balancing
GPS + Atomic Clocks
TrueTime
TrueTime: APIs that provide real time with bounds on error. o Powered by GPS and atomic clocks.
Enforce external consistencyo If the start of T2 occurs after the commit of T1 ,
then the commit timestamp of T2 must be greater than the commit timestamp of T1 .
Concurrency Control:o Update transactions: 2PLo Read-only transactions: Use real time to return a
consistency snapshot.
TrueTime
CS271 55
Primary References• Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes,
Gruber: Bigtable: A Distributed Storage System for Structured Data. OSDI 2006
• The Google File System: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. Symp on Operating Systems Princ 2003.
• GFS: Evolution on Fast-Forward: Kirk McKusick, Sean Quinlan Communications of the ACM 2010.
• Cooper, Ramakrishnan, Srivastava, Silberstein, Bohannon, Jacobsen, Puz, Weaver, Yerneni: PNUTS: Yahoo!'s hosted data serving platform. VLDB 2008.
• DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: amazon's highly available key-value store. SOSP 2007
• Cooper, Silberstein, Tam, Ramakrishnan, Sears: Benchmarking cloud serving systems with YCSB. SoCC 2010