100
1 Data Serving in the Cloud Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Brian Cooper Adam Silberstein Utkarsh Srivastava Yahoo! Research Joint work with the Sherpa team in Cloud Computing

Ramakrishnan Keynote Ladis2009

Embed Size (px)

DESCRIPTION

We are in the midst of a computing revolution. As the cost of provisioning hardware and software stacks grows, and the cost of securing and administering these complex systems grows even faster, we're seeing a shift towards computing clouds. For cloud service providers, there is efficiency from amortizing costs and averaging usage peaks. Internet portals like Yahoo! have long offered application services, such as email for individuals and organizations. Companies are now offering services such as storage and compute cycles, enabling higher-level services to be built on top. In this talk, I will discuss Yahoo!'s vision of cloud computing, and describe some of the key initiatives, highlighting the technical challenges involved in designing hosted, multi-tenanted data management systems.

Citation preview

Page 1: Ramakrishnan Keynote Ladis2009

1

Data Serving in the Cloud

Raghu RamakrishnanChief Scientist, Audience and Cloud Computing

Brian CooperAdam Silberstein

Utkarsh Srivastava

Yahoo! Research

Joint work with the Sherpa team in Cloud Computing

Page 2: Ramakrishnan Keynote Ladis2009

2

Outline

• Clouds• Scalable serving—the new landscape

– Very Large Scale Distributed systems (VLSD)• Yahoo!’s PNUTS/Sherpa• Comparison of several systems

– Preview of upcoming Y! Cloud Serving (YCS) benchmark

Page 3: Ramakrishnan Keynote Ladis2009

3

Types of Cloud Services

• Two kinds of cloud services:– Horizontal (“Platform”) Cloud Services

• Functionality enabling tenants to build applications or new services on top of the cloud

– Functional Cloud Services • Functionality that is useful in and of itself to tenants. E.g.,

various SaaS instances, such as Saleforce.com; Google Analytics and Yahoo!’s IndexTools; Yahoo! properties aimed at end-users and small businesses, e.g., flickr, Groups, Mail, News, Shopping

• Could be built on top of horizontal cloud services or from scratch

• Yahoo! has been offering these for a long while (e.g., Mail for SMB, Groups, Flickr, BOSS, Ad exchanges, YQL)

Page 4: Ramakrishnan Keynote Ladis2009

5

Yahoo! Horizontal Cloud StackPr

ovis

ioni

ng (

Self-

serv

e)

Horizontal Cloud Services …YCS YCPI Brooklyn

EDGEM

onit

orin

g/M

eter

ing/

Secu

rity

Horizontal Cloud Services…Hadoop

BATCH STORAGE

Horizontal Cloud Services…PNUTS/Sherpa MOBStor

OPERATIONAL STORAGE

Horizontal Cloud ServicesVM/OS …

APP

Horizontal Cloud ServicesVM/OS yApache

WEB

Dat

a H

ighw

ay

Serving Grid

PHP App Engine

Page 5: Ramakrishnan Keynote Ladis2009

6

Cloud-Power @ Yahoo!

Ads Optimization

Content Optimization

Search Index

Image/Video Storage &Delivery

Machine Learning (e.g. Spam

filters)

AttachmentStorage

Page 6: Ramakrishnan Keynote Ladis2009

7

Yahoo!’s Cloud: Massive Scale, Geo-Footprint

• Massive user base and engagement– 500M+ unique users per month– Hundreds of petabyte of storage – Hundreds of billions of objects– Hundred of thousands of requests/sec

• Global– Tens of globally distributed data centers– Serving each region at low latencies

• Challenging Users– Downtime is not an option (outages cost $millions)– Very variable usage patterns

Page 7: Ramakrishnan Keynote Ladis2009

8

New in 2010!• SIGMOD and SIGOPS are starting a new annual

conference (co-located with SIGMOD in 2010):

ACM Symposium on Cloud Computing (SoCC) PC Chairs: Surajit Chaudhuri & Mendel RosenblumGC: Joe Hellerstein Treasurer: Brian Cooper

• Steering committee: Phil Bernstein, Ken Birman, Joe Hellerstein, John Ousterhout, Raghu Ramakrishnan, Doug Terry, John Wilkes

Page 8: Ramakrishnan Keynote Ladis2009

9

VERY LARGE SCALE DISTRIBUTED (VLSD) DATA SERVING

ACID or BASE? Litmus tests are colorful, but the picture is cloudy

Page 10: Ramakrishnan Keynote Ladis2009

11

Web Data Management

Large data analysis(Hadoop)

Structured record storage

(PNUTS/Sherpa)

Blob storage(MObStor)

•Warehousing•Scan oriented workloads

•Focus on sequential disk I/O

•$ per cpu cycle

•CRUD •Point lookups and short scans

•Index organized table and random I/Os

•$ per latency

•Object retrieval and streaming

•Scalable file storage

•$ per GB storage & bandwidth

Page 11: Ramakrishnan Keynote Ladis2009

12

The World Has Changed

• Web serving applications need:– Scalability!

• Preferably elastic– Flexible schemas– Geographic distribution– High availability– Reliable storage

• Web serving applications can do without:– Complicated queries– Strong transactions

• But some form of consistency is still desirable

Page 12: Ramakrishnan Keynote Ladis2009

13

Typical Applications

• User logins and profiles– Including changes that must not be lost!

• But single-record “transactions” suffice

• Events– Alerts (e.g., news, price changes)– Social network activity (e.g., user goes offline)– Ad clicks, article clicks

• Application-specific data– Postings in message board– Uploaded photos, tags– Shopping carts

Page 13: Ramakrishnan Keynote Ladis2009

14

Data Serving in the Y! Cloud

Simple Web Service API’s

Database

PNUTS /SHERPA

Search

Vespa

MessagingTribble

Storage

MObStorForeign key

photo → listing

FredsList.com application

ALTER ListingsMAKE CACHEABLE

Compute

Grid

Batch export

Caching

memcached

1234323, transportation, For sale: one bicycle, barely used

5523442, childcare, Nanny available in San Jose

DECLARE DATASET Listings AS( ID String PRIMARY KEY,Category String,Description Text )

32138, camera, Nikon D40,USD 300

Page 14: Ramakrishnan Keynote Ladis2009

15

VLSD Data Serving Stores• Must partition data across machines

– How are partitions determined?– Can partitions be changed easily? (Affects elasticity)– How are read/update requests routed?– Range selections? Can requests span machines?

• Availability: What failures are handled?– With what semantic guarantees on data access?

• (How) Is data replicated?– Sync or async? Consistency model? Local or geo?

• How are updates made durable?• How is data stored on a single machine?

Page 15: Ramakrishnan Keynote Ladis2009

16

The CAP Theorem

• You have to give up one of the following in a distributed system (Brewer, PODC 2000; Gilbert/Lynch, SIGACT News 2002):– Consistency of data

• Think serializability– Availability

• Pinging a live node should produce results– Partition tolerance

• Live nodes should not be blocked by partitions

Page 16: Ramakrishnan Keynote Ladis2009

17

Approaches to CAP• “BASE”

– No ACID, use a single version of DB, reconcile later• Defer transaction commit

– Until partitions fixed and distr xact can run• Eventual consistency (e.g., Amazon Dynamo)

– Eventually, all copies of an object converge• Restrict transactions (e.g., Sharded MySQL)

– 1-M/c Xacts: Objects in xact are on the same machine – 1-Object Xacts: Xact can only read/write 1 object

• Object timelines (PNUTS)

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

Page 17: Ramakrishnan Keynote Ladis2009

1818

“I want a big, virtual database”

“What I want is a robust, high performance virtual relational database that runs transparently over a cluster, nodes dropping in and out of service at will, read-write replication and data migration all done automatically.

I want to be able to install a database on a server cloud and use it like it was all running on one machine.”

-- Greg Linden’s blog

Page 18: Ramakrishnan Keynote Ladis2009

19

PNUTS /

SHERPATo Help You Scale Your Mountains of Data

Y! CCDI

Page 19: Ramakrishnan Keynote Ladis2009

20

Yahoo! Serving Storage Problem

– Small records – 100KB or less– Structured records – Lots of fields, evolving– Extreme data scale - Tens of TB– Extreme request scale - Tens of thousands of requests/sec– Low latency globally - 20+ datacenters worldwide– High Availability - Outages cost $millions– Variable usage patterns - Applications and users change

20

Page 20: Ramakrishnan Keynote Ladis2009

21

E 75656 C

A 42342 EB 42521 WC 66354 WD 12352 E

F 15677 E

What is PNUTS/Sherpa?

E 75656 C

A 42342 EB 42521 WC 66354 WD 12352 E

F 15677 E

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

Parallel database Geographic replication

Structured, flexible schema

Hosted, managed infrastructure

A 42342 EB 42521 WC 66354 WD 12352 EE 75656 CF 15677 E

21

Page 21: Ramakrishnan Keynote Ladis2009

22

What Will It Become?

E 75656 C

A 42342 EB 42521 WC 66354 WD 12352 E

F 15677 E

E 75656 C

A 42342 EB 42521 WC 66354 WD 12352 E

F 15677 E

E 75656 C

A 42342 EB 42521 WC 66354 WD 12352 E

F 15677 E

Indexes and views

Page 22: Ramakrishnan Keynote Ladis2009

23

Scalability• Thousands of machines• Easy to add capacity• Restrict query language to avoid costly queries

Geographic replication• Asynchronous replication around the globe• Low-latency local access

High availability and fault tolerance• Automatically recover from failures• Serve reads and writes despite failures

Design Goals

23

Consistency• Per-record guarantees• Timeline model • Option to relax if needed

Multiple access paths• Hash table, ordered table• Primary, secondary access

Hosted service• Applications plug and play• Share operational cost

Page 23: Ramakrishnan Keynote Ladis2009

24

Technology Elements

PNUTS • Query planning and execution• Index maintenance

Distributed infrastructure for tabular data• Data partitioning • Update consistency• Replication

YDOT FS• Ordered tables

Applications

Tribble• Pub/sub messaging

YDHT FS • Hash tables

Zookeeper• Consistency service

YCA:

Aut

horiz

atio

n

PNUTS API Tabular API

24

Page 24: Ramakrishnan Keynote Ladis2009

25

PNUTS: Key Components

• Maintains map from database.table.key-to-tablet-to-SU

• Provides load balancing

• Caches the maps from the TC• Routes client requests to

correct SU

• Stores records• Services get/set/delete

requests

25

Page 25: Ramakrishnan Keynote Ladis2009

26

Storageunits

Routers

Tablet Controller

REST API

Clients

Local region Remote regions

Tribble

Detailed Architecture

26

Page 26: Ramakrishnan Keynote Ladis2009

27

DATA MODEL

27

Page 27: Ramakrishnan Keynote Ladis2009

28

Data Manipulation

• Per-record operations– Get– Set– Delete

• Multi-record operations– Multiget– Scan– Getrange

• Web service (RESTful) API

28

Page 28: Ramakrishnan Keynote Ladis2009

29

Tablets—Hash Table

Apple

Lemon

Grape

Orange

Lime

Strawberry

Kiwi

Avocado

Tomato

Banana

Grapes are good to eat

Limes are green

Apple is wisdom

Strawberry shortcake

Arrgh! Don’t get scurvy!

But at what price?

How much did you pay for this lemon?

Is this a vegetable?

New Zealand

The perfect fruit

Name Description Price

$12

$9

$1

$900

$2

$3

$1

$14

$2

$8

0x0000

0xFFFF

0x911F

0x2AF3

29

Page 29: Ramakrishnan Keynote Ladis2009

30

Tablets—Ordered Table

30

Apple

Banana

Grape

Orange

Lime

Strawberry

Kiwi

Avocado

Tomato

Lemon

Grapes are good to eat

Limes are green

Apple is wisdom

Strawberry shortcake

Arrgh! Don’t get scurvy!

But at what price?

The perfect fruit

Is this a vegetable?

How much did you pay for this lemon?

New Zealand

$1

$3

$2

$12

$8

$1

$9

$2

$900

$14

Name Description PriceA

Z

Q

H

Page 30: Ramakrishnan Keynote Ladis2009

31

Flexible Schema

Posted date Listing id Item Price

6/1/07 424252 Couch $5706/1/07 763245 Bike $866/3/07 211242 Car $11236/5/07 421133 Lamp $15

Color

Red

Condition

Good

Fair

Page 31: Ramakrishnan Keynote Ladis2009

3232

Primary vs. Secondary Access

Posted date Listing id Item Price6/1/07 424252 Couch $5706/1/07 763245 Bike $866/3/07 211242 Car $11236/5/07 421133 Lamp $15

Price Posted date Listing id15 6/5/07 42113386 6/1/07 763245570 6/1/07 4242521123 6/3/07 211242

Primary table

Secondary index

Planned functionality

Page 32: Ramakrishnan Keynote Ladis2009

33

Index Maintenance

• How to have lots of interesting indexes and views, without killing performance?

• Solution: Asynchrony!– Indexes/views updated asynchronously when

base table updated

Page 33: Ramakrishnan Keynote Ladis2009

34

PROCESSINGREADS & UPDATES

34

Page 34: Ramakrishnan Keynote Ladis2009

35

Updates

1Write key k

2Write key k7

Sequence # for key k

8

Sequence # for key k

SU SU SU

3Write key k

4

5SUCCESS

6Write key k

RoutersMessage brokers

35

Page 35: Ramakrishnan Keynote Ladis2009

36

Accessing Data

36

SUSU SU

1Get key k

2Get key k3 Record for key k

4 Record for key k

Page 36: Ramakrishnan Keynote Ladis2009

37

Bulk Read

37

SUScatter/gather server

SU SU

1{k1, k2, … kn}

2Get k1

Get k2 Get k3

Page 37: Ramakrishnan Keynote Ladis2009

38

Storage unit 1 Storage unit 2 Storage unit 3

Range Queries in YDOT

• Clustered, ordered retrieval of records

Storage unit 1Canteloupe

Storage unit 3Lime

Storage unit 2Strawberry

Storage unit 1

Router

AppleAvocadoBananaBlueberry

CanteloupeGrapeKiwiLemon

LimeMangoOrange

StrawberryTomatoWatermelon

AppleAvocadoBananaBlueberry

CanteloupeGrapeKiwiLemon

LimeMangoOrange

StrawberryTomatoWatermelon

Grapefruit…Pear?Grapefruit…Lime?

Lime…Pear?

Storage unit 1Canteloupe

Storage unit 3Lime

Storage unit 2Strawberry

Storage unit 1

Page 38: Ramakrishnan Keynote Ladis2009

39

Bulk Load in YDOT

• YDOT bulk inserts can cause performance hotspots

• Solution: preallocate tablets

Page 39: Ramakrishnan Keynote Ladis2009

40

ASYNCHRONOUS REPLICATION AND CONSISTENCY

40

Page 40: Ramakrishnan Keynote Ladis2009

41

Asynchronous Replication

41

Page 41: Ramakrishnan Keynote Ladis2009

42

Consistency Model

• If copies are asynchronously updated, what can we say about stale copies?– ACID guarantees require synchronous updts– Eventual consistency: Copies can drift apart,

but will eventually converge if the system is allowed to quiesce

• To what value will copies converge? • Do systems ever “quiesce”?

– Is there any middle ground?

Page 42: Ramakrishnan Keynote Ladis2009

43

Example: Social Alice

User StatusAlice Busy

West East

User StatusAlice Free

User StatusAlice ???

User StatusAlice ???

User StatusAlice Busy

User StatusAlice ___

___

Busy

Free

Free

Record Timeline

(Network fault, updt goes to East)

(Alice logs on)

Page 43: Ramakrishnan Keynote Ladis2009

44

• Goal: Make it easier for applications to reason about updates and cope with asynchrony

• What happens to a record with primary key “Alice”?

PNUTS Consistency Model

44

Time

Record inserted

UpdateUpdate

Update UpdateUpdate Delete

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1

v. 6 v. 8

Update Update

As the record is updated, copies may get out of sync.

Page 44: Ramakrishnan Keynote Ladis2009

45

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Write

Current version

Stale versionStale version

PNUTS Consistency Model

45

Achieved via per-record primary copy protocol(To maximize availability, record masterships automaticlly transferred if site fails)

Can be selectively weakened to eventual consistency (local writes that are reconciled using version vectors)

Page 45: Ramakrishnan Keynote Ladis2009

46

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale versionStale version

PNUTS Consistency Model

46

Test-and-set writes facilitate per-record transactions

Page 46: Ramakrishnan Keynote Ladis2009

47

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Current version

Stale versionStale version

Read

PNUTS Consistency Model

47

In general, reads are served using a local copy

Page 47: Ramakrishnan Keynote Ladis2009

48

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Read up-to-date

Current version

Stale versionStale version

PNUTS Consistency Model

48

But application can request and get current version

Page 48: Ramakrishnan Keynote Ladis2009

49

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Read ≥ v.6

Current version

Stale versionStale version

PNUTS Consistency Model

49

Or variations such as “read forward”—while copies may lag themaster record, every copy goes through the same sequence of changes

Page 49: Ramakrishnan Keynote Ladis2009

50

OPERABILITY

50

Page 50: Ramakrishnan Keynote Ladis2009

5151

Server 1 Server 2 Server 3 Server 4

Bike $866/2/07 636353

Chair $106/5/07 662113

Distribution

Couch $5706/1/07 424252

Car $11236/1/07 256623

Lamp $196/7/07 121113

Bike $566/9/07 887734

Scooter $186/11/07 252111

Hammer $80006/11/07 116458

Distribution for parallelismData shuffling for load balancing

Page 51: Ramakrishnan Keynote Ladis2009

52

Tablet Splitting and Balancing

52

Each storage unit has many tablets (horizontal partitions of the table)

Tablets may grow over timeOverfull tablets split

Storage unit may become a hotspot

Shed load by moving tablets to other servers

Storage unitTablet

Page 52: Ramakrishnan Keynote Ladis2009

53

Consistency Techniques

• Per-record mastering– Each record is assigned a “master region”

• May differ between records– Updates to the record forwarded to the master region– Ensures consistent ordering of updates

• Tablet-level mastering– Each tablet is assigned a “master region”– Inserts and deletes of records forwarded to the master region– Master region decides tablet splits

• These details are hidden from the application– Except for the latency impact!

Page 53: Ramakrishnan Keynote Ladis2009

5454

Mastering

A 42342 EB 42521 WC 66354 WD 12352 EE 75656 CF 15677 E

A 42342 EB 42521 WC 66354 WD 12352 EE 75656 CF 15677 E

A 42342 EB 42521 WC 66354 WD 12352 EE 75656 CF 15677 E

A 42342 EB 42521 EC 66354 WD 12352 EE 75656 CF 15677 E

C 66354 WB 42521 EA 42342 E

D 12352 EE 75656 CF 15677 E

Page 54: Ramakrishnan Keynote Ladis2009

5555

Record vs. Tablet Master

A 42342 EB 42521 WC 66354 WD 12352 EE 75656 CF 15677 E

A 42342 EB 42521 WC 66354 WD 12352 EE 75656 CF 15677 E

A 42342 EB 42521 WC 66354 WD 12352 EE 75656 CF 15677 E

A 42342 EB 42521 EC 66354 WD 12352 EE 75656 CF 15677 E

C 66354 WB 42521 EA 42342 E

D 12352 EE 75656 CF 15677 E

Record master serializes updates

Tablet master serializes inserts

Page 55: Ramakrishnan Keynote Ladis2009

5656

Coping With Failures

A 42342 EB 42521 WC 66354 WD 12352 EE 75656 CF 15677 E A 42342 E

B 42521 WC 66354 WD 12352 EE 75656 CF 15677 E

A 42342 EB 42521 WC 66354 WD 12352 EE 75656 CF 15677 E

X XOVERRIDE W → E

Page 56: Ramakrishnan Keynote Ladis2009

57

Further PNutty Reading

Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008)Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, Raghu Ramakrishnan

PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008)Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni

Asynchronous View Maintenance for VLSD Databases (SIGMOD 2009)Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava and Raghu Ramakrishnan

Cloud Storage Design in a PNUTShellBrian F. Cooper, Raghu Ramakrishnan, and Utkarsh SrivastavaBeautiful Data, O’Reilly Media, 2009

Adaptively Parallelizing Distributed Range Queries (VLDB 2009)Ymir Vigfusson, Adam Silberstein, Brian Cooper, Rodrigo Fonseca

Page 57: Ramakrishnan Keynote Ladis2009

58

COMPARING SOMECLOUD SERVING STORES

Green Apples and Red Apples

Page 58: Ramakrishnan Keynote Ladis2009

59

Motivation

• Many “cloud DB” and “nosql” systems out there– PNUTS– BigTable

• HBase, Hypertable, HTable– Azure– Cassandra– Megastore (behind Google AppEngine)– Amazon Web Services

• S3, SimpleDB, EBS– And more: CouchDB, Voldemort, etc.

• How do they compare?– Feature tradeoffs– Performance tradeoffs– Not clear!

Page 59: Ramakrishnan Keynote Ladis2009

60

The Contestants

• Baseline: Sharded MySQL– Horizontally partition data among MySQL servers

• PNUTS/Sherpa– Yahoo!’s cloud database

• Cassandra– BigTable + Dynamo

• HBase– BigTable + Hadoop

Page 60: Ramakrishnan Keynote Ladis2009

61

SHARDED MYSQL

61

Page 61: Ramakrishnan Keynote Ladis2009

62

Architecture

• Our own implementation of sharding

Shard Server

MySQL

Shard Server

MySQL

Shard Server

MySQL

Shard Server

MySQL

Client Client Client ClientClient

Page 62: Ramakrishnan Keynote Ladis2009

63

Shard Server• Server is Apache + plugin + MySQL

– MySQL schema: key varchar(255), value mediumtext– Flexible schema: value is blob of key/value pairs

• Why not direct to MySQL?– Flexible schema means an update is:

• Read record from MySQL• Apply changes• Write record to MySQL

– Shard server means the read is local• No need to pass whole record over network to change one field

MySQL

Apache Apache Apache Apache … Apache (100 processes)

Page 63: Ramakrishnan Keynote Ladis2009

64

Client

• Application plus shard client• Shard client

– Loads config file of servers– Hashes record key– Chooses server responsible for hash range– Forwards query to server

Client

Application

Shard client

Q?

Hash() Server map CURL

Page 64: Ramakrishnan Keynote Ladis2009

65

Pros and Cons

• Pros– Simple– “Infinitely” scalable– Low latency– Geo-replication

• Cons– Not elastic (Resharding is hard)– Poor support for load balancing– Failover? (Adds complexity)– Replication unreliable (Async log shipping)

Page 65: Ramakrishnan Keynote Ladis2009

66

Azure SDS

• Cloud of SQL Server instances• App partitions data into instance-sized

pieces– Transactions and queries within an instance

SDS InstanceData

Storage Per-field indexing

Page 66: Ramakrishnan Keynote Ladis2009

67

Google MegaStore

• Transactions across entity groups– Entity-group: hierarchically linked records

• Ramakris• Ramakris.preferences• Ramakris.posts• Ramakris.posts.aug-24-09

– Can transactionally update multiple records within an entity group

• Records may be on different servers• Use Paxos to ensure ACID, deal with server failures

– Can join records within an entity group– Reportedly, moving to ordered, async replication w/o ACID

• Other details– Built on top of BigTable– Supports schemas, column partitioning, some indexing

Phil Bernstein, http://perspectives.mvdirona.com/2008/07/10/GoogleMegastore.aspx

Page 67: Ramakrishnan Keynote Ladis2009

68

PNUTS

68

Page 68: Ramakrishnan Keynote Ladis2009

69

Architecture

Storageunits

REST API

Clients

Log servers

Routers

Tablet controller

Page 69: Ramakrishnan Keynote Ladis2009

70

Routers

• Direct requests to storage unit– Decouple client from storage layer

• Easier to move data, add/remove servers, etc.

– Tradeoff: Some latency to get increased flexibility

Router

Y! Traffic Server

PNUTS Router plugin

Page 70: Ramakrishnan Keynote Ladis2009

71

Msg/Log Server

• Topic-based, reliable publish/subscribe– Provides reliable logging– Provides intra- and inter-datacenter replication

Log server

Disk

Pub/sub hub

Disk

Log server

Pub/sub hub

Page 71: Ramakrishnan Keynote Ladis2009

72

Pros and Cons

• Pros– Reliable geo-replication– Scalable consistency model– Elastic scaling– Easy load balancing

• Cons– System complexity relative to sharded MySQL to

support geo-replication, consistency, etc.– Latency added by router

Page 72: Ramakrishnan Keynote Ladis2009

73

HBASE

73

Page 73: Ramakrishnan Keynote Ladis2009

74

Architecture

Disk

HRegionServer

Client Client Client ClientClient

HBaseMaster

REST API

Disk

HRegionServer

Disk

HRegionServer

Disk

HRegionServer

Java Client

Page 74: Ramakrishnan Keynote Ladis2009

75

HRegion Server

• Records partitioned by column family into HStores– Each HStore contains many MapFiles

• All writes to HStore applied to single memcache• Reads consult MapFiles and memcache• Memcaches flushed as MapFiles (HDFS files) when full• Compactions limit number of MapFiles

HRegionServer

HStore

MapFiles

Memcachewrites

Flush to diskreads

Page 75: Ramakrishnan Keynote Ladis2009

76

Pros and Cons

• Pros– Log-based storage for high write throughput– Elastic scaling– Easy load balancing– Column storage for OLAP workloads

• Cons– Writes not immediately persisted to disk– Reads cross multiple disk, memory locations– No geo-replication– Latency/bottleneck of HBaseMaster when using

REST

Page 76: Ramakrishnan Keynote Ladis2009

77

CASSANDRA

77

Page 77: Ramakrishnan Keynote Ladis2009

78

Architecture

• Facebook’s storage system– BigTable data model– Dynamo partitioning and consistency model– Peer-to-peer architecture

Cassandra node

Disk

Cassandra node

Disk

Cassandra node

Disk

Cassandra node

Disk

Client Client Client ClientClient

Page 78: Ramakrishnan Keynote Ladis2009

79

Routing

• Consistent hashing, like Dynamo or Chord– Server position = hash(serverid)– Content position = hash(contentid)– Server responsible for all content in a hash interval

Server

Responsible hash interval

Page 79: Ramakrishnan Keynote Ladis2009

80

Cassandra Server

• Writes go to log and memory table• Periodically memory table merged with disk table

Cassandra node

Disk

RAM

Log SSTable file

Memtable

Update

(later)

Page 80: Ramakrishnan Keynote Ladis2009

81

Pros and Cons

• Pros– Elastic scalability– Easy management

• Peer-to-peer configuration– BigTable model is nice

• Flexible schema, column groups for partitioning, versioning, etc.– Eventual consistency is scalable

• Cons– Eventual consistency is hard to program against– No built-in support for geo-replication

• Gossip can work, but is not really optimized for, cross-datacenter– Load balancing?

• Consistent hashing limits options– System complexity

• P2P systems are complex; have complex corner cases

Page 81: Ramakrishnan Keynote Ladis2009

82

Cassandra Findings

• Tunable memtable size– Can have large memtable flushed less frequently, or

small memtable flushed frequently– Tradeoff is throughput versus recovery time

• Larger memtable will require fewer flushes, but will take a long time to recover after a failure

• With 1GB memtable: 45 mins to 1 hour to restart• Can turn off log flushing

– Risk loss of durability• Replication is still synchronous with the write

– Durable if updates propagated to other servers that don’t fail

Page 82: Ramakrishnan Keynote Ladis2009

83

NUMBERS

Thanks to Ryan Rawson & J.D. Cryans for advice on HBase configuration, and Jonathan Ellis on Cassandra

83

Page 83: Ramakrishnan Keynote Ladis2009

84

Overview• Setup

– Six server-class machines• 8 cores (2 x quadcore) 2.5 GHz CPUs, RHEL 4, Gigabit ethernet• 8 GB RAM• 6 x 146GB 15K RPM SAS drives in RAID 1+0

– Plus extra machines for clients, routers, controllers, etc.• Workloads

– 120 million 1 KB records = 20 GB per server– Write heavy workload: 50/50 read/update – Read heavy workload: 95/5 read/update

• Metrics– Latency versus throughput curves

• Caveats– Write performance would be improved for PNUTS, Sharded ,and

Cassandra with a dedicated log disk– We tuned each system as well as we knew how

Page 84: Ramakrishnan Keynote Ladis2009

85

ResultsRead latency vs. actual throughput, 95/5 read/write

0

20

40

60

80

100

120

140

0 1000 2000 3000 4000 5000 6000 7000

Throughput (ops/sec)

Late

ncy

(ms)

PNUTS Sharded Cassandra Hbase

Page 85: Ramakrishnan Keynote Ladis2009

86

ResultsWrite latency vs. actual throughput, 95/5 read/write

0

20

40

60

80

100

120

140

160

180

0 1000 2000 3000 4000 5000 6000 7000

Throughput (ops/sec)

Late

ncy

(ms)

PNUTS Sharded Cassandra HBase

Page 86: Ramakrishnan Keynote Ladis2009

87

Results

Page 87: Ramakrishnan Keynote Ladis2009

88

Results

Page 88: Ramakrishnan Keynote Ladis2009

89

Qualitative Comparison

• Storage Layer– File Based: HBase, Cassandra– MySQL: PNUTS, Sharded MySQL

• Write Persistence– Writes committed synchronously to disk: PNUTS, Cassandra,

Sharded MySQL– Writes flushed asynchronously to disk: HBase (current version)

• Read Pattern– Find record in MySQL (disk or buffer pool): PNUTS, Sharded

MySQL– Find record and deltas in memory and on disk: HBase,

Cassandra

Page 89: Ramakrishnan Keynote Ladis2009

90

Qualitative Comparison

• Replication (not yet utilized in benchmarks)– Intra-region: HBase, Cassandra– Inter- and intra-region: PNUTS– Inter- and intra-region: MySQL (but not guaranteed)

• Mapping record to server– Router: PNUTS, HBase (with REST API)– Client holds mapping: HBase (java library), Sharded

MySQL– P2P: Cassandra

Page 90: Ramakrishnan Keynote Ladis2009

91

YCS Benchmark• Will distribute Java application, plus extensible benchmark suite

– Many systems have Java APIs– Non-Java systems can support REST

Scenario file• R/W mix• Record size• Data set• …

Command-line parameters• DB to use• Target throughput• Number of threads• …

YCSB client

DB

clie

ntClient

threads

Stats

Scenario executor C

loud

DB

Extensible: plug in new clientsExtensible: define new scenarios

Page 91: Ramakrishnan Keynote Ladis2009

92

Benchmark Tiers

• Tier 1: Cluster performance– Tests fundamental design independent of replication/scale-

out/availability

• Tier 2: Replication– Local replication– Geographic replication

• Tier 3: Scale out– Scale out the number of machines by a factor of 10– Scale out the number of replicas

• Tier 4: Availability– Kill system components

If you’re interested in this, please contact me or [email protected]

Page 92: Ramakrishnan Keynote Ladis2009

93

Shadoop

SherpaHadoop

Adam Silberstein and the Sherpa team in Y! Research and CCDI

Page 93: Ramakrishnan Keynote Ladis2009

94

Sherpa vs. HDFS

Sherpa HDFS

Shadoop Goals1. Provide a batch-processing framework for Sherpa using Hadoop 2. Minimize/characterize penalty for reading/writing Sherpa vs. HDFS

Router

• Sherpa optimized for low-latency record-level access: B-trees• HDFS optimized for batch-oriented access: File system• At 50,000 feet the data stores appear similar

Is Sherpa a reasonable backend for Hadoop for Sherpa users?

Name node

Client Client

Page 94: Ramakrishnan Keynote Ladis2009

95

Building Shadoop

Hadoop Tasks

scan(0x2-0x4)

scan(0xa-0xc)

scan(0x8-0xa)

scan(0x0-0x2)

scan(0xc-0xe)

MapSherpa

1. Split Sherpa table into hash ranges2. Each Hadoop task assigned a range3. Task uses Sherpa scan to retrieve

records in range4. Task feeds scan results and feeds

records to map function

Input Output

Map or ReduceHadoop Tasks Sherpa

routersetsetsetsetsetset

1. Call Sherpa set to write output1. No DOT range-clustering2. Record at a time insert

RecordReader

set

Page 95: Ramakrishnan Keynote Ladis2009

96

Use Cases

HDFS Sherpa

Bulk Load Sherpa TablesSource data in HDFS

Map/Reduce, PigOutput is servable content

Migrate Sherpa TablesBetween farms, code versions

Map/Reduce, PigOutput is servable content

Validator Ops ToolEnsure replicas are consistent

reads

Map/Reduce, PigOutput is input to further analysis

Sherpa

Sherpa Sherpa

Sherpa Sherpa

HDFS

HDFS

Sherpa as Hadoop “cache”Standard Hadoop in HDFS, using Sherpa as a shared cache

HDFS

Sherpa

Page 96: Ramakrishnan Keynote Ladis2009

98

SYSTEMSIN CONTEXT

98

Page 97: Ramakrishnan Keynote Ladis2009

99

Application Design Space

Records Files

Get a few things

Scan everything

Sherpa MObStor

Everest Hadoop

YMDBMySQL

Filer

Oracle

BigTable

99

Page 98: Ramakrishnan Keynote Ladis2009

100

Comparison Matrix

Has

h/so

rt

Dyn

amic

Failu

res

hand

led

Rou

ting

Dur

ing

failu

re

PNUTS

MySQL

HDFS

BigTable

Dynamo

Syn

c/as

ync

Cassandra

Loca

l/geo

Con

sist

ency

100

Partitioning

Megastore

Azure

H+S

H+S

Other

H+S

S

S

S

H

Y

Y

Y

Y

Y

N

N

Y

Cli

Rtr

Rtr

Rtr

P2P

P2P

Rtr

Cli

Availability

Colo+serverColo+server

Colo+server

Colo+server

Colo+serverColo+server

Colo+server

Replication

Async

Sync

Sync

Sync+Async

Sync

Async

Local+geo

Local+nearby

Async

Local+nearby

Local+nearbyLocal+nearby

Local+nearby

Timeline +Eventual

ACID

N/A (noupdates)

Multi-version

Eventual

ACID/other

ACID

Dur

abili

ty

Double WAL

WAL

Triple replication

Triple replication

Triple WAL

Triple replication

WAL

Storage

Rea

ds/

writ

es

Buffer pages

Buffer pages

Files

LSM/SSTable

LSM/SSTable

LSM/SSTable

Buffer pages

Read+write

Read+write

Read+writeRead+write

Read+write

Read+write

Read Local+nearby

Eventual WAL Buffer pages

Sync LocalRead+write

Server

Page 99: Ramakrishnan Keynote Ladis2009

101

Comparison Matrix

Ela

stic

Ope

rabi

lity

Glo

bal l

owla

tenc

y

Avai

labi

lity

Stru

ctur

edac

cess

Sherpa

Y! UDB

MySQL

Oracle

HDFS

BigTable

DynamoU

pdat

esCassandra

Con

sist

ency

m

odel

SQ

L/A

CID

101

Page 100: Ramakrishnan Keynote Ladis2009

102

QUESTIONS?

102