38
Managing Cloud Resources: Distributed Rate Limiting Alex C. Snoeren Kevin Webb, Bhanu Chandra Vattikonda, Barath Raghavan, Kashi Vishwanath, Sriram Ramabhadran, and Kenneth Yocum Building and Programming the Cloud Workshop 13 January 2010

Managing Cloud Resources: Distributed Rate Limiting

  • Upload
    hewitt

  • View
    41

  • Download
    1

Embed Size (px)

DESCRIPTION

Managing Cloud Resources: Distributed Rate Limiting. Alex C. Snoeren Kevin Webb, Bhanu Chandra Vattikonda , Barath Raghavan , Kashi Vishwanath , Sriram Ramabhadran , and Kenneth Yocum Building and Programming the Cloud Workshop 13 January 2010. Hosting with a single physical presence - PowerPoint PPT Presentation

Citation preview

Page 1: Managing Cloud Resources:  Distributed Rate Limiting

Managing Cloud Resources: Distributed Rate Limiting

Alex C. SnoerenKevin Webb, Bhanu Chandra Vattikonda, Barath Raghavan,

Kashi Vishwanath, Sriram Ramabhadran,and Kenneth Yocum

Building and Programming the Cloud Workshop13 January 2010

Page 2: Managing Cloud Resources:  Distributed Rate Limiting

2

Hosting with a single physical presence However, clients are across the Internet

Centralized Internet services

Mysore-Park Cloud Workshop – 13 January 2010

Page 3: Managing Cloud Resources:  Distributed Rate Limiting

3

Cloud-based services Resources and clients distributed across the world

Often incorporates resources from multiple providers

Windows Live

Mysore-Park Cloud Workshop – 13 January 2010

Page 4: Managing Cloud Resources:  Distributed Rate Limiting

4

Resources in the Cloud Distributed resource consumption

Clients consume resources at multiple sites Metered billing is state-of-the-art Service “punished” for popularity

» Those unable to pay are disconnected No control of resources used to serve increased demand

Overprovision and pray Application designers typically cannot describe needs Individual service bottlenecks varied but severe

» IOps, network bandwidth, CPU, RAM, etc.» Need a way to balance resource demand

Mysore-Park Cloud Workshop – 13 January 2010

Page 5: Managing Cloud Resources:  Distributed Rate Limiting

5

Two lynchpins for success Need a way to control and manage distributed

resources as if they were centralized All current models from OS scheduling and provisioning

literature assume full knowledge and absolute control (This talk focuses specifically on network bandwidth)

Must be able to efficiently support rapidly evolving application demand

Balance the resource needs to hardware realization automatically without application designer input

(Another talk if you’re interested)

Mysore-Park Cloud Workshop – 13 January 2010

Page 6: Managing Cloud Resources:  Distributed Rate Limiting

6

S

S

S

D

D

D

0 ms

0 ms

0 ms

Limiters

Ideal: Emulate a single limiter Make distributed feel centralized

Packets should experience the same limiter behavior

Mysore-Park Cloud Workshop – 13 January 2010

Page 7: Managing Cloud Resources:  Distributed Rate Limiting

7

Accuracy(how close to K Mbps is delivered, flow rate fairness)

+Responsiveness

(how quickly demand shifts are accommodated)

Vs.

Communication Efficiency(how much and often rate limiters must communicate)

Engineering tradeoffs

Mysore-Park Cloud Workshop – 13 January 2010

Page 8: Managing Cloud Resources:  Distributed Rate Limiting

8

Limiter 1

Limiter 2

Limiter 3

Limiter 4

Gossip

GossipGossipEstimatelocal demand

Estimateintervaltimer

Set allocationGlobal

demand

Enforce limit

Packetarrival

An initial architecture

Mysore-Park Cloud Workshop – 13 January 2010

Page 9: Managing Cloud Resources:  Distributed Rate Limiting

9

Token bucket, fill rate K Mbps

Packet

Token bucket limiters

Mysore-Park Cloud Workshop – 13 January 2010

Page 10: Managing Cloud Resources:  Distributed Rate Limiting

10

Demand info

(bytes/sec)

Limiter 1 Limiter 2

A global token bucket (GTB)?

Mysore-Park Cloud Workshop – 13 January 2010

Page 11: Managing Cloud Resources:  Distributed Rate Limiting

11

Limiter 1

3 TCP flowsS D

Limiter 27 TCP flowsS D

Single token bucket

10 TCP flowsS D

A baseline experiment

Mysore-Park Cloud Workshop – 13 January 2010

Page 12: Managing Cloud Resources:  Distributed Rate Limiting

12

Single token bucket Global token bucket

7 TCP flows3 TCP flows

10 TCP flows

Problem: GTB requires near-instantaneous arrival info

GTB performance

Mysore-Park Cloud Workshop – 13 January 2010

Page 13: Managing Cloud Resources:  Distributed Rate Limiting

13

5 Mbps (limit)4 Mbps (global arrival rate)

Case 1: Below global limit, forward packet

Limiters send, collect global rate info from othersTake 2: Global Random Drop

Mysore-Park Cloud Workshop – 13 January 2010

Page 14: Managing Cloud Resources:  Distributed Rate Limiting

14

5 Mbps (limit)6 Mbps (global arrival rate)

Case 2: Above global limit, drop with probability:Excess / Global arrival rate = 1/6

Same at all limiters

Global Random Drop (GRD)

Mysore-Park Cloud Workshop – 13 January 2010

Page 15: Managing Cloud Resources:  Distributed Rate Limiting

15

7 TCP flows

3 TCP flows

10 TCP flows

Delivers flow behavior similar to a central limiter

GRD baseline performanceSingle token bucket Global token bucket

Mysore-Park Cloud Workshop – 13 January 2010

Page 16: Managing Cloud Resources:  Distributed Rate Limiting

16

GRD under dynamic arrivals

Mysore-Park Cloud Workshop – 13 January 2010 (50-ms estimate interval)

Page 17: Managing Cloud Resources:  Distributed Rate Limiting

17

Limiter 1

3 TCP flowsS D

Limiter 2

7 TCP flowsS D

Returning to our baseline

Mysore-Park Cloud Workshop – 13 January 2010

Page 18: Managing Cloud Resources:  Distributed Rate Limiting

18

“3 flows”“7 flows”

Goal: Provide inter-flow fairness for TCP flows

Local token-bucketenforcement

Basic idea: flow counting

Limiter 1 Limiter 2

Mysore-Park Cloud Workshop – 13 January 2010

Page 19: Managing Cloud Resources:  Distributed Rate Limiting

19

Local token rate (limit) = 10 Mbps

Flow A = 5 Mbps

Flow B = 5 Mbps

Flow count = 2 flows

Estimating TCP demand

1 TCP flowS

1 TCP flowS

Mysore-Park Cloud Workshop – 13 January 2010

Page 20: Managing Cloud Resources:  Distributed Rate Limiting

20

FPS under dynamic arrivals

Mysore-Park Cloud Workshop – 13 January 2010 (500-ms estimate interval)

Page 21: Managing Cloud Resources:  Distributed Rate Limiting

21

Comparing FPS to GRD

Both are responsive and provide similar utilization GRD requires accurate estimates of the global rate at all limiters.

GRD (50-ms est. int.)

FPS (500-ms est. int.)

Mysore-Park Cloud Workshop – 13 January 2010

Page 22: Managing Cloud Resources:  Distributed Rate Limiting

22

Estimating skewed demandLimiter 1

D

Limiter 2

3 TCP flowsS D

1 TCP flowS

S1 TCP flow

Mysore-Park Cloud Workshop – 13 January 2010

Page 23: Managing Cloud Resources:  Distributed Rate Limiting

23

Key insight: Use a TCP flow’s rate to infer demand

Local token rate (limit) = 10 Mbps

Flow A = 8 Mbps

Flow B = 2 Mbps

Flow count ≠ demand

Bottlenecked elsewhere

Estimating skewed demand

Mysore-Park Cloud Workshop – 13 January 2010

Page 24: Managing Cloud Resources:  Distributed Rate Limiting

24

Local token rate (limit) = 10 Mbps

Flow A = 8 Mbps

Flow B = 2 Mbps

Bottlenecked elsewhere

Estimating skewed demand

108

Local LimitLargest Flow’s Rate = = 1.25 flows

Mysore-Park Cloud Workshop – 13 January 2010

Page 25: Managing Cloud Resources:  Distributed Rate Limiting

25

3 flowsLimiter 2

10 Mbps x 1.251.25 + 3

Global limit = 10 Mbps

1.25 flowsLimiter 1

Set local token rate =

= 2.94 Mbps

Global limit x local flow countTotal flow count

=

FPS example

Mysore-Park Cloud Workshop – 13 January 2010

Page 26: Managing Cloud Resources:  Distributed Rate Limiting

26

FPS bottleneck example

Mysore-Park Cloud Workshop – 13 January 2010

Initially 3:7 split between 10 un-bottlenecked flows At 25s, 7-flow aggregate bottlenecked to 2 Mbps At 45s, un-bottlenecked flow arrives: 3:1 for 8 Mbps

Page 27: Managing Cloud Resources:  Distributed Rate Limiting

27

Real world constraints Resources spent tracking usage is pure overhead

Efficient implementation (<3% CPU, sample & hold) Modest communication budget (<1% bandwidth)

Control channel is slow and lossy Need to extend gossip protocols to tolerate loss An interesting research problem on its own…

The nodes themselves may fail or partition In an asynchronous system, you cannot tell the difference Need to have a mechanism that deals gracefully with both

Mysore-Park Cloud Workshop – 13 January 2010

Page 28: Managing Cloud Resources:  Distributed Rate Limiting

28

Robust control communication

Mysore-Park Cloud Workshop – 13 January 2010

7 Limiters enforcing 10 Mbps limit Demand fluctuates every 5 sec between 1-100 flows Varying loss on the control channel

Page 29: Managing Cloud Resources:  Distributed Rate Limiting

29

Handling partitions

Mysore-Park Cloud Workshop – 13 January 2010

Failsafe operation: each disconnected group k/n Ideally: Bank-o-mat problem (credit/debit scheme) Challege: group membership with asymmetric partitions

Page 30: Managing Cloud Resources:  Distributed Rate Limiting

30

5 Mbps

Following PlanetLab demand Apache Web servers on 10 PlanetLab nodes

5-Mbps aggregate limit Shift load over time from 10 nodes to 4

Mysore-Park Cloud Workshop – 13 January 2010

Page 31: Managing Cloud Resources:  Distributed Rate Limiting

31

Demands at 10 apache servers on Planetlab

Demand shifts to just 4 nodesWasted capacity

31Mysore-Park Cloud Workshop – 13 January 2010

Current limiting options

Page 32: Managing Cloud Resources:  Distributed Rate Limiting

32

Applying FPS on PlanetLab

32Mysore-Park Cloud Workshop – 13 January 2010

Page 33: Managing Cloud Resources:  Distributed Rate Limiting

33

Hierarchical limiting

Mysore-Park Cloud Workshop – 13 January 2010

Page 34: Managing Cloud Resources:  Distributed Rate Limiting

34

A sample use case

Mysore-Park Cloud Workshop – 13 January 2010

T 0:A: 5 flows at L1

T 55:A: 5 flows at L2

T 110:B: 5 flows at L1

T 165:B: 5 flows at L2

Page 35: Managing Cloud Resources:  Distributed Rate Limiting

35

Worldwide flow join

Mysore-Park Cloud Workshop – 13 January 2010

8 nodes split between UCSD and Polish Telecom 5 Mbps aggregate limit A new flow arrives at each limiter every 10 seconds

Page 36: Managing Cloud Resources:  Distributed Rate Limiting

36

Worldwide demand shift

Mysore-Park Cloud Workshop – 13 January 2010

Same demand-shift experiment as before At 50 sec, Polish Telecom demand disappears Reappears at 90 sec.

Page 37: Managing Cloud Resources:  Distributed Rate Limiting

37

Where to go from here Need to “let go” of full control, make decisions with

only a “cloudy” view of actual resource consumption Distinguish between what you know and what you don’t know Operate efficiently when you know you know. Have failsafe options when you know you don’t.

Moreover, we cannot rely upon application/service designers to understand their resource demands

The system needs to dynamically adjust to shifts We’ve started to manage the demand equation We’re now focusing on the supply side: custom-tailored

resource provisioning.

Mysore-Park Cloud Workshop – 13 January 2010

Page 38: Managing Cloud Resources:  Distributed Rate Limiting