64
Maelstrom Ricochet Conclusion Reliable Communication for Datacenters Mahesh Balakrishnan Cornell University, Ithaca, NY Mahesh Balakrishnan Reliable Communication for Datacenters

Reliable Communication for Datacenters

Embed Size (px)

Citation preview

Page 1: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Reliable Communication for Datacenters

Mahesh Balakrishnan

Cornell University, Ithaca, NY

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 2: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Datacenters

I Internet Services (90s) — Websites, Search, Online StoresI Since then:

# of low-end volume servers

0

5

10

15

20

25

30

2000 2001 2002 2003 2004 2005

Mill

ions

Installed Server Base 00-05:I Commodity — up by 100%I High/Mid — down by 40%

2007: ≈ 7 million new units

I Today: Datacenters are ubiquitousI How have they evolved?

Data partially sourced from IDC press releases (www.idc.com)

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 3: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Networks of Datacenters

Why?Business Continuity,Client Locality,Distributed Datasetsor Operations ...Any modernenterprise!

N

SEW

100 ms

RTT: 110 ms

210 ms

220 ms

110 ms

100 ms 200 ms

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 4: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Networks of Real-Time Datacenters

I Finance, Aerospace, Military, Search and Rescue...I ... documents, chat, email, games, videos, photos, blogs,

social networksI The Datacenter is the Computer!I Not hard real-time: real fast, highly responsive, time-critical

Gartner Survey:I Real-Time Infrastructure (RTI): reaction time in secs/minsI 83%: RTI is important or very importantI 84%: Have no RTI capability

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 5: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Networks of Real-Time Datacenters

I Finance, Aerospace, Military, Search and Rescue...I ... documents, chat, email, games, videos, photos, blogs,

social networksI The Datacenter is the Computer!I Not hard real-time: real fast, highly responsive, time-critical

Gartner Survey:I Real-Time Infrastructure (RTI): reaction time in secs/minsI 83%: RTI is important or very importantI 84%: Have no RTI capability

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 6: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

The Real-Time Datacenter — Systems ChallengesHow do we recover from failures within seconds?

Real-World

,

Real-Tim

e

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 7: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

The Real-Time Datacenter — Systems ChallengesHow do we recover from failures within seconds?

Real-World

,

Real-Tim

e

Disk Failure,

Disasters

Crashes Overloads

Bugs Exploits

Sof

twar

e S

tack

Compilers, Databases, Distributed Systems, Machine Learning,

Filesystems, Operating Systems, Networking ...

Packet Loss

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 8: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

The Real-Time Datacenter — Systems ChallengesHow do we recover from failures within seconds?

Real-World

,

Real-Tim

e

Disk Failure,

Disasters

Crashes Overloads

Bugs Exploits

SMFSKyotoFS

Tempest

Maelstrom

Sof

twar

e S

tack

Compilers, Databases, Distributed Systems, Machine Learning,

Filesystems, Operating Systems, Networking ...

Ricochet

Plato

Packet Loss

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 9: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Reliable Communication

Goal: Recover lost packets fast !

I Existing protocols react to loss: too much, too lateI We want proactive recovery: stable overhead, low latencies

I Maelstrom: Reliability between datacentersI Ricochet: Reliability within datacenters

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 10: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Reliable Communication between Datacenters

Step 1: Open a TCP/IP socket from CA to NY. What happens?I Throughput ∝ BufSize

RTTI BufSize = Receiver Buffer Size

I Current Solution: Manually reset buffer sizes

I TCP/IP needs zero loss on high-speed long-distance links:

I Sensitive Congestion ControlI RTT dependence + Sequenced Delivery

Current State-of-the-Art Solution: $$$!

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 11: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Reliable Communication between Datacenters

Step 1: Open a TCP/IP socket from CA to NY. What happens?I Throughput ∝ BufSize

RTTI BufSize = Receiver Buffer Size

I Current Solution: Manually reset buffer sizes

I TCP/IP needs zero loss on high-speed long-distance links:I Sensitive Congestion ControlI RTT dependence + Sequenced Delivery

Current State-of-the-Art Solution: $$$!

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 12: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Reliable Communication between Datacenters

Step 1: Open a TCP/IP socket from CA to NY. What happens?I Throughput ∝ BufSize

RTTI BufSize = Receiver Buffer Size

I Current Solution: Manually reset buffer sizes

I TCP/IP needs zero loss on high-speed long-distance links:I Sensitive Congestion ControlI RTT dependence + Sequenced Delivery

Current State-of-the-Art Solution: $$$!

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 13: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

TeraGrid: Supercomputer NetworkSDSC, PSC, CTC, IU, NCSA ...

I End-to-End UDP Probes: ZeroCongestion, Non-Zero Loss!

I Possible Reasons:I transient congestionI degraded fiberI malfunctioning HWI misconfigured HWI switching contentionI low receiver powerI end-host overflowI ...

24

14

0

5

10

15

20

25

30

0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1

% of Lost Packets

% o

f Mea

sure

men

ts

Problem Statement: Run unmodified TCP/IP over lossyhigh-speed long-distance networks

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 14: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

TeraGrid: Supercomputer NetworkSDSC, PSC, CTC, IU, NCSA ...

I End-to-End UDP Probes: ZeroCongestion, Non-Zero Loss!

I Possible Reasons:I transient congestionI degraded fiberI malfunctioning HWI misconfigured HWI switching contentionI low receiver powerI end-host overflowI ...

14

24

3

14

0

5

10

15

20

25

30

0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1

% of Lost Packets

% o

f Mea

sure

men

ts

All MeasurementsWithout Indiana U. Site

Problem Statement: Run unmodified TCP/IP over lossyhigh-speed long-distance networks

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 15: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

TeraGrid: Supercomputer NetworkSDSC, PSC, CTC, IU, NCSA ...

I End-to-End UDP Probes: ZeroCongestion, Non-Zero Loss!

I Possible Reasons:I transient congestionI degraded fiberI malfunctioning HWI misconfigured HWI switching contentionI low receiver powerI end-host overflowI ...

14

24

3

14

0

5

10

15

20

25

30

0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7 1

% of Lost Packets

% o

f Mea

sure

men

ts

All MeasurementsWithout Indiana U. Site

Problem Statement: Run unmodified TCP/IP over lossyhigh-speed long-distance networks

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 16: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

The Maelstrom Network Appliance

Packet Loss

Sending End-hostsCommodity TCP

MaelstromSend-side Appliance

MaelstromReceive-side

Appliance

Receiving End-hostsCommodity TCP

Inserts FEC

FEC+Data

Recovers via FEC

FEC = Forward Error CorrectionTransparent: No modification to end-host or network

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 17: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

What is FEC?

A B C D E X X X C D E A B

3 repair packets from every 5 data packets

Receiver can recover from any 3 lost packets

Rate1: (r , c) — c repair packets for every r data packets.

I Pro: No Feedback Loop, Constant OverheadI Why not run FEC at end-hosts?I Con: Recovery Latency dependent on channel data rate

I FEC in the Network:

I Where and What: At the appliance, across multiplechannels

1Rateless codes are popular, but inapplicable to real-time streamsMahesh Balakrishnan Reliable Communication for Datacenters

Page 18: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

What is FEC?

A B C D E X X X C D E A B

3 repair packets from every 5 data packets

Receiver can recover from any 3 lost packets

Rate1: (r , c) — c repair packets for every r data packets.

I Pro: No Feedback Loop, Constant OverheadI Why not run FEC at end-hosts?I Con: Recovery Latency dependent on channel data rate

I FEC in the Network:I Where and What: At the appliance, across multiple

channels

1Rateless codes are popular, but inapplicable to real-time streamsMahesh Balakrishnan Reliable Communication for Datacenters

Page 19: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

The Maelstrom Network Appliance

Packet Loss

Sending End-hostsCommodity TCP

MaelstromSend-side Appliance

MaelstromReceive-side

Appliance

Receiving End-hostsCommodity TCP

Inserts FEC

FEC+Data

Recovers via FEC

FEC = Forward Error CorrectionTransparent: No modification to end-host or network

Where: at the appliance, What: aggregated data

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 20: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Maelstrom Mechanism

Send-Side Appliance:I Intercept IP packetsI Create repair packet =

XOR + ‘recipe’ of datapacket IDs

Receive-Side Appliance:I Lost packet recovered

using XOR and otherdata packets

I At receiver end-host: outof order, no loss

29 28 27 26 25

2526

2728

29

X

LOSSXOR

‘Recipe List’: 25,26,27,28,29

25 26 28 29

Lambda Jum

bo MTU

LAN M

TU

Appliance

Appliance

27

Recovered Packet

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 21: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Flow Control

I Two Flow Control Modes for TCP/IP Traffic:

A) End-to-End Flow Control

End-Host End-HostAppliance Appliance

B) Split Flow Control

End-Host End-HostAppliance Appliance

I Recall: Two problems with TCP/IP — loss and bufferingI End-to-end mode hides lossI Split mode buffers data (standard PeP)

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 22: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Layered Interleaving for Bursty LossRecovery Latency ∝ Actual Burst Size, not Max Burst Size

3 2 1

X1

1121

X2

101201

X3

Data Stream

XORs:

I XORs at different interleavesI Recovery latency degrades gracefully

with loss burstiness:X1 catches random singleton lossesX2 catches loss bursts of 10 or lessX3 catches bursts of 100 or less

Comparison of RecoveryProbability: r=7, c=2

2in2inMahesh Balakrishnan Reliable Communication for Datacenters

Page 23: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Layered Interleaving for Bursty LossRecovery Latency ∝ Actual Burst Size, not Max Burst Size

3 2 1

X1

1121

X2

101201

X3

Data Stream

XORs:

I XORs at different interleavesI Recovery latency degrades gracefully

with loss burstiness:X1 catches random singleton lossesX2 catches loss bursts of 10 or lessX3 catches bursts of 100 or less

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Loss probabilityRe

cove

ry p

roba

bility

Reed−SolomonMaelstrom

Comparison of RecoveryProbability: r=7, c=2 2in2in

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 24: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Implementation Details

I In Kernel — Linux 2.6.20 ModuleI Commodity Box: 3 Ghz, 1 Gbps NIC (≈ 800$)I Max speed: 1 Gbps, Memory Footprint: 10 MBI 50-60% CPU→ NIC is the bottleneck (for c = 3)

I How do we efficiently store/access/clean a gigabit of dataevery second?

I Scaling to Multi-Gigabit: Partition IP space across proxies

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 25: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Evaluation: FEC mode and lossClaim: Maelstrom effectively hides loss from TCP/IP

0 100 200 300 400 500 600 700 800 900

1000

0 50 100 150 200 250

Thro

ughp

ut (M

bps)

One Way Link Latency (ms)

TCP No LossMaelstrom No Loss

Maelstrom (0.1%)Maelstrom (1.0%)

TCP/IP (0.1%)TCP (1% loss)

Mahesh Balakrishnan Reliable Communication for Datacenters

← Tput ∝ 1RTT

{→Data+ FEC=̃ 1 Gbps

↑TCP/IP with loss

TCP/IP without loss

Page 26: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Evaluation: FEC mode and lossClaim: Maelstrom effectively hides loss from TCP/IP

0 100 200 300 400 500 600 700 800 900

1000

0 50 100 150 200 250

Thro

ughp

ut (M

bps)

One Way Link Latency (ms)

TCP No LossMaelstrom No Loss

Maelstrom (0.1%)Maelstrom (1.0%)

TCP/IP (0.1%)TCP (1% loss)

Mahesh Balakrishnan Reliable Communication for Datacenters

← Tput ∝ 1RTT

{→Data+ FEC=̃ 1 Gbps

↑TCP/IP with loss

TCP/IP without loss

Page 27: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Evaluation: FEC mode and lossClaim: Maelstrom effectively hides loss from TCP/IP

0 100 200 300 400 500 600 700 800 900

1000

0 50 100 150 200 250

Thro

ughp

ut (M

bps)

One Way Link Latency (ms)

TCP No LossMaelstrom No Loss

Maelstrom (0.1%)Maelstrom (1.0%)

TCP/IP (0.1%)TCP (1% loss)

Mahesh Balakrishnan Reliable Communication for Datacenters

← Tput ∝ 1RTT

{→Data+ FEC=̃ 1 Gbps

↑TCP/IP with loss

TCP/IP without loss

Page 28: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Evaluation: FEC mode and lossClaim: Maelstrom effectively hides loss from TCP/IP

0 100 200 300 400 500 600 700 800 900

1000

0 50 100 150 200 250

Thro

ughp

ut (M

bps)

One Way Link Latency (ms)

TCP No LossMaelstrom No Loss

Maelstrom (0.1%)Maelstrom (1.0%)

TCP/IP (0.1%)TCP (1% loss)

Mahesh Balakrishnan Reliable Communication for Datacenters

← Tput ∝ 1RTT

{→Data+ FEC=̃ 1 Gbps

↑TCP/IP with loss

TCP/IP without loss

Page 29: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Evaluation: Split mode and bufferingClaim: Maelstrom eliminates the need for larger end-host buffers

0

100

200

300

400

500

600

client-buf M-ALL M-FEC M-BUF tcp

Thr

ough

put (

Mbi

ts/s

ec)

RTT 100ms

throughput 0.1% loss

Mahesh Balakrishnan Reliable Communication for Datacenters

← Maelstrom FEC

Maelstrom FEC + Split Mode

↑Maelstrom FEC + Hand-Tuned Buffers

Page 30: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Evaluation: Split mode and bufferingClaim: Maelstrom eliminates the need for larger end-host buffers

0

100

200

300

400

500

600

client-buf M-ALL M-FEC M-BUF tcp

Thr

ough

put (

Mbi

ts/s

ec)

RTT 100ms

throughput 0.1% loss

Mahesh Balakrishnan Reliable Communication for Datacenters

← Maelstrom FEC

Maelstrom FEC + Split Mode

↑Maelstrom FEC + Hand-Tuned Buffers

Page 31: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Evaluation: Split mode and bufferingClaim: Maelstrom eliminates the need for larger end-host buffers

0

100

200

300

400

500

600

client-buf M-ALL M-FEC M-BUF tcp

Thr

ough

put (

Mbi

ts/s

ec)

RTT 100ms

throughput 0.1% loss

Mahesh Balakrishnan Reliable Communication for Datacenters

← Maelstrom FEC

Maelstrom FEC + Split Mode

↑Maelstrom FEC + Hand-Tuned Buffers

Page 32: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Evaluation: Split mode and bufferingClaim: Maelstrom eliminates the need for larger end-host buffers

0

100

200

300

400

500

600

client-buf M-ALL M-FEC M-BUF tcp

Thr

ough

put (

Mbi

ts/s

ec)

RTT 100ms

throughput 0.1% loss

Mahesh Balakrishnan Reliable Communication for Datacenters

← Maelstrom FEC

Maelstrom FEC + Split Mode

↑Maelstrom FEC + Hand-Tuned Buffers

Page 33: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Evaluation: Delivery LatencyClaim: Maelstrom eliminates TCP/IP’s loss-related jitter

0

100

200

300

400

500

600

0 1000 2000 3000 4000 5000 6000

Del

iver

y L

aten

cy (

ms)

Packet #

TCP/IP: 0.1% Loss

0

100

200

300

400

500

600

0 1000 2000 3000 4000 5000 6000

Del

iver

y L

aten

cy (

ms)

Packet #

Maelstrom: 0.1% Loss

I Receive-side buffering due to sequencingI Send-side buffering due to congestion control

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 34: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

SMFS: The Smoke and Mirrors Filesystem

I Classic Mirroring Trade-off:I Fast — return to user after sending to mirrorI Safe — return to user after ACK from mirror

I Maelstrom: Lossy Network→ Lossless Network→ Disk!I SMFS — return to user after sending enough FECI Result: Fast, Safe Mirroring independent of link length!I General Principle: Gray-box Exposure of Protocol State

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 35: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

The Big Picture

Real-World

,

Real-Tim

e

Disk Failure,

Disasters

Crashes Overloads

Bugs Exploits

SMFSKyotoFS

Tempest

Maelstrom

Sof

twar

e S

tack

Compilers, Databases, Distributed Systems, Machine Learning,

Filesystems, Operating Systems, Networking ...

Ricochet

Plato

Packet Loss

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 36: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Ricochet: Multicast in the Datacenter

I Commodity Datacenters:Extreme Scale-Out

I Data Replication:I Fault ToleranceI High AvailabilityI Performance

I Updating data in multiplelocations: Multicast!

Data is Clustered / Replicated / Published / Cached...

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 37: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Ricochet: Multicast in the Datacenter

I Commodity Datacenters:Extreme Scale-Out

I Data Replication:I Fault ToleranceI High AvailabilityI Performance

I Updating data in multiplelocations: Multicast!

Upd

ates

Data is Clustered / Replicated / Published / Cached...

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 38: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

How is Multicast Used?

Financial Pub-Sub Example:I Each equity is mapped to

a multicast groupI Each node is interested

in a different set ofequities ...

Each node in many groups=⇒ Low per-group data rate

High per-node data rate=⇒ Overload

Tracking S&P 500

Tracking Portfolio

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 39: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

How is Multicast Used?

Financial Pub-Sub Example:I Each equity is mapped to

a multicast groupI Each node is interested

in a different set ofequities ...

Each node in many groups=⇒ Low per-group data rate

High per-node data rate=⇒ Overload

Tracking S&P 500

Tracking Portfolio

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 40: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

How is Multicast Used?

Financial Pub-Sub Example:I Each equity is mapped to

a multicast groupI Each node is interested

in a different set ofequities ...

Each node in many groups=⇒ Low per-group data rate

High per-node data rate=⇒ Overload

Tracking S&P 500

Tracking Portfolio

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 41: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Where does loss occur in a Datacenter?

Packet Loss occurs at end-hosts: independent and bursty

0

20

40

60

80

100

1801601401201008060

burs

t len

gth

(pac

kets

) receiver r1: loss bursts in A and Breceiver r1: data bursts in A and B

0

20

40

60

80

100

1801601401201008060

burs

t len

gth

(pac

kets

)

time in seconds

receiver r2: loss bursts in Areceiver r2: data bursts in A

Mahesh Balakrishnan Reliable Communication for Datacenters

LossBursts

←DataRate

LessLoadedNode

OverloadedNode

Page 42: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Problem Statement

I Low-Latency ReliableMulticast

I Scalability:

I Number of ReceiversI Number of SendersI Number of Groups

Multicast Data

Lost Packet

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 43: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Problem Statement

I Low-Latency ReliableMulticast

I Scalability:I Number of Receivers

I Number of SendersI Number of Groups

Multicast Data

Lost Packet

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 44: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Problem Statement

I Low-Latency ReliableMulticast

I Scalability:I Number of ReceiversI Number of Senders

I Number of Groups

Multicast Data

Lost Packet

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 45: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Problem Statement

I Low-Latency ReliableMulticast

I Scalability:I Number of ReceiversI Number of SendersI Number of Groups

Multicast Data

Lost Packet

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 46: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Design Space for Reliable MulticastHow does latency scale?

Existing mechanisms: recovery latency ∝ 1datarate

I Example: SRM1

NAK/SequencingI data rate: at a

single sender, ina single group

What about FEC?

FEC in the Network:Where and What?

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1 2 4 8 16 32 64 128

Mill

isec

onds

Groups per Node (Log Scale)

SRM Average Discovery Delay

1S.Floyd, V.Jacobson, C.-G.Liu, S.McCanne, and L. Zhang. A reliable multicast framework for light-weight

sessions and application level framing. IEEE/ACM ToN, 5(6):784-803, 1997

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 47: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Design Space for Reliable MulticastHow does latency scale?

Existing mechanisms: recovery latency ∝ 1datarate

I Example: SRM1

NAK/SequencingI data rate: at a

single sender, ina single group

What about FEC?FEC in the Network:Where and What? 0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1 2 4 8 16 32 64 128

Mill

isec

onds

Groups per Node (Log Scale)

SRM Average Discovery Delay

1S.Floyd, V.Jacobson, C.-G.Liu, S.McCanne, and L. Zhang. A reliable multicast framework for light-weight

sessions and application level framing. IEEE/ACM ToN, 5(6):784-803, 1997

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 48: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Receiver-Based Forward Error Correction

I Receivers generate XORs ofincoming multicast packets andexchange with other receivers

I Each XOR sent to c otherreceivers

I latency ∝ 1Ps datarate

data rate: across all senders, ina single group

Multicast Data

SENDERS

GROUP

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 49: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Receiver-Based Forward Error Correction

I Receivers generate XORs ofincoming multicast packets andexchange with other receivers

I Each XOR sent to c otherreceivers

I latency ∝ 1Ps datarate

data rate: across all senders, ina single group

Multicast Data

SENDERS

GROUP

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 50: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Lateral Error Correction: Principle

I XORs can combine data from multiple groups

Single-Group RFEC

B2

INC

OM

ING

DA

TA P

ACK

ETS

A1

A2A3

B2

B1

A4

A5

B4

B3

A6

Repair Packet I:(A1, A2, A3, A4, A5)

Repair Packet II:(B1,B2,B3,B4,B5)

B5

Node n1

Node n2

A1

A2

A3

B1

A4

A5B4

B3

A6

B5

INC

OM

ING

DA

TA P

ACK

ETS

Recovery

Loss

Lateral Error Correction

Repair Packet I:(A1, A2, A3, B1, B2)

B2

INC

OM

ING

DA

TA P

ACK

ETS

A1

A2A3

B2

B1

A4

A5

B4

B3

A6

Repair Packet II:(A4, A5, B3, B4, A6)

B5

Node n1

Node n2

A1

A2

A3

B1

A4

A5B4

B3

A6

B5

INC

OM

ING

DA

TA P

ACK

ETS

Recovery

Loss

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 51: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Nodes and Disjoint Regions

2

1

4

3

I Receiver n1 belongs togroups A, B, and C

I Divides groups intodisjoint regions

I Is unaware of groups itdoes not belong to (D)

I Works with any conventional Group Membership ServiceI Can we build a scalable multi-group GMS?

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 52: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Nodes and Disjoint Regions

2

1

4

3

I Receiver n1 belongs togroups A, B, and C

I Divides groups intodisjoint regions

I Is unaware of groups itdoes not belong to (D)

I Works with any conventional Group Membership ServiceI Can we build a scalable multi-group GMS?

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 53: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Regional Selection

1

Aac

A a

A abc

1.0cAab

I Select targets for XORsfrom regions, not groups

I From each region, selectproportional fraction ofcA: cx

A = |x ||A| · cA

latency ∝ 1Ps

Pg datarate

data rate: across all senders, in intersections of groups

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 54: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Regional Selection

1

Aac

A a

A abc

1.0cAab

I Select targets for XORsfrom regions, not groups

I From each region, selectproportional fraction ofcA: cx

A = |x ||A| · cA

latency ∝ 1Ps

Pg datarate

data rate: across all senders, in intersections of groups

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 55: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Scalability in GroupsClaim: Ricochet scales to hundreds of groups.

90

92

94

96

98

100

2 4 8 16 32 64 128 256 512 1024

LE

C R

ecov

ery

Perc

enta

ge

Groups per Node (Log Scale)

Recovery %

0

10000

20000

30000

40000

50000

2 4 8 16 32 64 128 256 512 1024

Mic

rose

cond

s

Groups per Node (Log Scale)

Average Recovery Latency

At 128 groups, SRM latency was 8 seconds. 400 times slower!

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 56: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion Motivation Design and Implementation Evaluation

Distribution of Recovery LatencyClaim: Ricochet is reliable and time-critical

0

10

20

30

40

50

0 50 100 150 200 250

Rec

over

y Pe

rcen

tage

Milliseconds

96.8% LEC + 3.2% NAK

0

10

20

30

40

50

0 50 100 150 200 250

Rec

over

y Pe

rcen

tage

Milliseconds

92% LEC + 8% NAK

0

10

20

30

40

50

0 50 100 150 200 250

Rec

over

y Pe

rcen

tage

Milliseconds

84% LEC + 16% NAK

(a) 10% Loss Rate (b) 15% Loss Rate (c) 20% Loss Rate

Most lost packets recovered < 50ms by LEC.Remainder via reactive NAKs.

Bursty Loss: 100 packet burst→ 90% recovered at 50 ms avg

Mahesh Balakrishnan Reliable Communication for Datacenters

LEC

↓ NAK

Page 57: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

The Big Picture: The Datacenter is the Computer

Real-World

,

Real-Tim

e

Disk Failure,

Disasters

Crashes Overloads

Bugs Exploits

SMFSKyotoFS

Tempest

Maelstrom

Sof

twar

e S

tack

Compilers, Databases, Distributed Systems, Machine Learning,

Filesystems, Operating Systems, Networking ...

Ricochet

Plato

Packet Loss

NSDI 2008NSDI 2007

DSN 2008

SRDS 2006

HotOS XISubmitted

MistralMobiHoc 2006

SequoiaPODC 2007

Mahesh Balakrishnan Reliable Communication for Datacenters

Bottomline: Performance versus Price

We need new abstractions!Power, Parallelism, Privacy...

Page 58: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

The Big Picture: The Datacenter is the Computer

Real-World

,

Real-Tim

e

Disk Failure,

Disasters

Crashes Overloads

Bugs Exploits

SMFSKyotoFS

Tempest

Maelstrom

Sof

twar

e S

tack

Compilers, Databases, Distributed Systems, Machine Learning,

Filesystems, Operating Systems, Networking ...

Ricochet

Plato

Packet Loss

NSDI 2008NSDI 2007

DSN 2008

SRDS 2006

HotOS XISubmitted

MistralMobiHoc 2006

SequoiaPODC 2007

Mahesh Balakrishnan Reliable Communication for Datacenters

Bottomline: Performance versus Price

We need new abstractions!Power, Parallelism, Privacy...

Page 59: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Conclusion

I The Real-Time DatacenterI Recover from failures within seconds

I Reliable Communication: FEC in the NetworkI Recover lost packets in millisecondsI Maelstrom: between DatacentersI Ricochet: within Datacenters

Thank You!

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 60: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Conclusion

I The Real-Time DatacenterI Recover from failures within seconds

I Reliable Communication: FEC in the NetworkI Recover lost packets in millisecondsI Maelstrom: between DatacentersI Ricochet: within Datacenters

Thank You!

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 61: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Extra Slide: FEC and Bursty Loss

I Existing solution:interleaving

I Interleave i and rate(r , c) tolerates (c ∗ i)burst...

I ...with i times the latency

A B C D E

X X X

F G H I J

A C E G I

X X XB D F H J

Figure: Interleave of 2 — Evenand Odd packets encodedseparately

Wanted: Graceful degradation of recovery latency with actualburst size for constant overhead

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 62: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Extra Slide: FEC and Bursty Loss

I Existing solution:interleaving

I Interleave i and rate(r , c) tolerates (c ∗ i)burst...

I ...with i times the latency

A B C D E

X X X

F G H I J

A C E G I

X X XB D F H J

Figure: Interleave of 2 — Evenand Odd packets encodedseparately

Wanted: Graceful degradation of recovery latency with actualburst size for constant overhead

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 63: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Extra Slide: Maelstrom EvaluationMaelstrom goodput is near theoretical maximum

0

100

200

300

400

500

600

700

800

900

1000

0 1 2 3 4 5 6

Thr

ough

put (

Mbi

ts/s

ec)

FEC Layers (c)

theoretical goodputM-FEC goodput

Mahesh Balakrishnan Reliable Communication for Datacenters

Page 64: Reliable Communication for Datacenters

Maelstrom Ricochet Conclusion

Extra Slide: Layered Interleaving

0

20

40

60

80

100

0 200 400 600 800 1000

Reco

very

Lat

ency

(Mill

iseco

nds)

Recovered Packet #

Reed SolomonLayered Interleaving

Mahesh Balakrishnan Reliable Communication for Datacenters