33
On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science The University of Hong Kong Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013

XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Embed Size (px)

DESCRIPTION

While datacenters are increasingly adopting VMs to provide elastic cloud services, they still rely on traditional TCP for congestion control. In this talk, I will first show that VM scheduling delays can heavily contaminate RTTs sensed by VM senders, preventing TCP from correctly learning the physical network condition. Focusing on the incast problem, which is commonly seen in large-scale distributed data processing such as MapReduce and web search, I find that the solutions that have been developed for *physical* clusters fall short in a Xen *virtual* cluster. Second, I will provide a concrete understanding of the problem, and reveal that the situations that when the sending VM is preempted versus when the receiving VM is preempted, are different. Third, I will introduce my recent attempts on paravirtualizing TCP to overcome the negative effect caused by VM scheduling delays.

Citation preview

Page 1: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines

Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science

The University of Hong Kong

Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013

Page 2: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Outline

Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion

Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side

PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation

Questions & Comments

Page 3: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Outline

Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion

Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side

PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation

Questions & Comments

Page 4: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Physical datacenter

A set of physical machines Network delays:

propagation delays of the physical network/switch

ToR switches

Core switch

. . .

Servers in a rack

… … …

ToR switches

Core switch

. . .

Servers in a rack

… …

VM VM VM

VM VM VM

Virtualized datacenter

A set of virtual machines Network delays:

additional delays due to virtualization overhead

Page 5: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Virtualization brings “delays”

1. I/O virtualization overhead (PV or HVM) – Guest VMs are unable to directly access the hardware. – Additional data movement between dom0 and domUs. – HVM: Passthrough I/O can avoid it

2. VM scheduling delays – Multiple VMs share one physical core

delay

VM

VM

VM

pCPU

VM

VM

VM

pCPU

Hypervisor

Page 6: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Virtualization brings “delays”

[1VM 2VMs] [1VM 3VMs]

Peak: 30ms Peak: 60ms

Avg: 0.147ms Avg: 0.374ms

[PM PM] [1VM 1VM]

Delays of I/O virtualization (PV guests): < 1ms

VM scheduling delays: 10× ms – Queuing delays VM scheduling delays

The dominant factor to network RTT

Page 7: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Network delays in public clouds

[HPDC’10]

[INFOCOM’10]

Page 8: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Incast network congestion • A special form of network congestion, typically seen in

distributed processing applications (scatter-gather). – Barrier-synchronized request workloads – The limited buffer space of the switch output port can be easily

overfilled by simultaneous transmissions. • Application-level throughput (goodput) can be orders of

magnitude lower than the link capacity.

[SIGCOMM’09]

Page 9: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Solutions for physical clusters

The dominate factor: once the packet loss happens, whether the sender can know it as soon as possible. – In case of “tail loss”, the sender can only count on the

retransmit timer’s firing.

Two representative papers: Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems [FAST’08]. Understanding TCP Incast Throughput Collapse in Datacenter Networks [WREN’09].

Prior works: none of them can fully eliminate the throughput collapse. – Increase switch buffer size – Limited transmit – Reduce duplicate ACK threshold – Disable slow-start – Randomize timeout value – Reno, NewReno, SACK

Page 10: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Solutions for physical clusters (cont’d)

RTOmin in a virtual cluster? Not well studied.

[SIGCOMM’09] [DCTCP, SIGCOMM’10]

Significantly reducing RTOmin has been shown to be a safe and effective approach. [SIGCOMM’09]

Even with ECN support in hardware switch, a small RTOmin still shows apparent advantages. [DCTCP, SIGCOMM’10]

Page 11: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Outline

Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion

Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side

PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation

Questions & Comments

Page 12: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Pseudo-congestion

A small RTOmin frequent spurious RTOs

RTOmin=200ms RTOmin=100ms

RTOmin=10ms RTOmin=1ms

NO network congestion, still RTT spikes.

VM

pCPU

VM

VM

30ms

30ms

30ms

3VMs per core

Red points: measured RTTs Blue points: calculated RTO values

RTO = SRTT + 4* RTTVAR Lower-bound: RTOmin

TCP’s low-pass filter

Retransmit TimeOut

Page 13: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Pseudo-congestion (cont’d)

A small RTOmin: serious spurious RTOs with largely

varied RTTs.

A big RTOmin: throughput collapse with heavy network

congestion.

“adjusting RTOmin: a tradeoff between timely response with premature timeouts, and there is NO optimal balance between the two.” -- Allman and Paxson [SIGCOMM’99]

Virtualized datacenters A new instantiation

Page 14: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Sender-side vs. Receiver-side

The scheduling delays to the sender VM The scheduling delays to the receiver VM

3VMs1VM Freq. 1VM3VMs 1086 1× RTOs 677

0 2× RTOs 673 0 3× RTOs 196 0 4× RTOs 30

RTO only happens once a time Successive RTOs are normal

To transmit 4000 1MB data blocks

Page 15: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

A micro-view with tcpdump

8.4

8.5

8.6

8.7

8.8

8.9

9

9.1

0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80

x106 snd.una: the first sent but unacknowledged byte. snd.nxt: the next byte that will be sent.

snd.nxt

snd.una

snd.nxt

snd.una

RTO happens twice, before the receiver

VM wakes up.The receiver VM has been

stopped.

The generation and the return of the ACKs will be delayed.

RTOs must happen on the sender’s side.

When the receiver VM is preempted

Time (ms) vs. sequence number (from the sender VM) Time (ms) vs. ACK number (from the receiver VM)

The ACK’s arrival time is not delayed, but the receiving time is too late.

From TCP’s perspective, RTO should not be triggered.

When the sender VM is preempted

RTO happens just afterthe sender VM wakes up.

The sender VM has been

stopped.An ACK arrives

before the sender VM wakes up.

Page 16: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Timer

VM1 is running

Buffer

Timer TCP sender

Driver domain

TCP receiver ACK ACK ACK

data data data

Physical network

Within hypervisor

VM2 is running VM3 is running VM1 is running

clear timer

clear timer

VM scheduling latency

VM2 is waiting

VM3 is waiting VM1 is waiting

VM3 is waiting VM1 is waiting

VM2 is waiting

VM2 is waiting

VM3 is waiting

wait ..

Timer IRQ: RTO happens!

Network IRQ: receive ACK;

Spurious RTO!

deliver ACK

Scheduling queue

Expire time

Timer

The reasons due to common OS design – Timer interrupt is executed before other interrupts – Network processing is a little late (bottom half)

The sender-side problem: OS reasons

1

2

After the VM wakes up, both TIMER and NET are pending. RTO happens just before the ACK enters the VM

Page 17: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

To detect spurious RTOs

Disabling delayed ACK seems to be helpful

c

Two well-known detection algorithms: F-RTO and Eifel – Eifel performs much worse than F-RTO in some situations, e.g.

with bursty packet loss [CCR’03] – F-RTO is implemented in Linux

Low detection

rate

[3VMs1VM] [1VM3VMs]

Low detection

rate

F-RTO interacts badly with delayed ACK (ACK coalescing) – Reducing delayed ACK timeout value does NOT help.

Page 18: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Delayed ACK vs. CPU overhead

Disabling delayed ACK Significant CPU overhead

Sender VM Receiver VM Sender VM Receiver VM

Page 19: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Delayed ACK vs. CPU overhead

Disabling delayed ACK Significant CPU overhead

delack-200ms delack-1ms w/o delack

Total ACKs 229,650 244,757 2,832,260

delack-200ms delack-1ms w/o delack

Total ACKs 252,278 262,274 2,832,179

Sender VM Receiver VM Sender VM Receiver VM Disabling delayed ACK: 11~13× more ACKs are sent

Page 20: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Outline

Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion

Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side

PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation

Questions & Comments

Page 21: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

PVTCP – A ParaVirtualized TCP

Main Idea – If we can detect such moment, and let the guest OS be

aware of this, there is a chance to handle the problem.

Observation – Spurious RTOs only happen when the sender/receiver

VM just experienced a scheduling delays.

“the more information about current network conditions available to a transport protocol, the more efficiently it can use the network to transfer its data.” -- Allman and Paxson [SIGCOMM’99]

Page 22: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Detect the VM’s wakeup moment

VM

pCPU

VM

VM

30ms

30ms

30ms

Virtual timer IRQs (every 1ms)

Time

Guest OS

Hypervisor

VM is NOT running

. . .

VM is running

Virtual timer IRQs (every 1ms)

jiffie

s++

jiffie

s++

jiffie

s++

jiffie

s++

jiffie

s++

jiffie

s++

VM is running

jiffies += 60 (HZ=1000)

jiffie

s++

jiffie

s++

jiffie

s++

jiffie

s++

3VMs per core

Acute increase of the system clock (jiffies) The VM just wakes up

one-shot timer

Page 23: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

PVTCP – the sender VM is preempted Spurious RTOs can be avoided.

No need to detect them at all!

Timer TCP

Timer

VM1 is running

Buffer

Timer TCP sender

Driver domain

TCP receiver ACK ACK ACK

data data data

Physical network

Within hypervisor

VM2 is running VM3 is running VM1 is running

clear timer

clear timer

wait .. deliver

ACK

Expire time

Timer

Start time

Expiry time

Timer IRQ: RTO happens!

Network IRQ: receive ACK;

Spurious RTO! 2

1

VM scheduling latency

Page 24: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

PVTCP – the sender VM is preempted Spurious RTOs can be avoided.

No need to detect them at all!

Timer

Timer

TCP PVTCP

Timer

VM1 is running

Buffer

Timer TCP sender

Driver domain

TCP receiver ACK ACK ACK

data data data

Physical network

Within hypervisor

VM2 is running VM3 is running VM1 is running

clear timer

clear timer

wait .. deliver

ACK

Expire time

Timer

Start time

Expiry time

1ms

Solution: after the VM wakes up, extend the TCP retransmit timer’s expiry time by 1ms.

Net IRQ first: ACK enters.

Reset the timer.

VM scheduling latency

Page 25: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

PVTCP – the sender VM is preempted

Timer PVTCP

Timer

VM1 is running

Buffer

Timer TCP sender

Driver domain

TCP receiver ACK ACK ACK

data data data

Physical network

Within hypervisor

VM2 is running VM3 is running VM1 is running

clear timer

clear timer

wait .. deliver

ACK

Expire time

Timer

1ms

Net IRQ first: ACK enters.

Reset the timer.

VM scheduling latency

StartTime ExpiryTime

Solution: MRTTi SRTTi-1

TCP’s low-pass filter to estimate RTT/RTO Smoothed RTT (SRTTi) 7/8 * SRTTi-1 +1/8 * MRTTi RTT variance (RTTVARi) 3/4 * RTTVARi -1+ 1/4 * |SRTTi - MRTTi| Expected RTO value (RTOi+1) SRTTi + 4 * RTTVARi

Measured RTT (MRTT) = TrueRTT + VMSchedDelay

Page 26: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

PVTCP – the receiver VM is preempted

Spurious RTOs cannot be avoided, so we have to let the sender detect them.

Solution: temporarily disable delayed ACK when the receiver VM just wakes up. – Eifel: check the timestamp of the first one ACK – F-RTO: check the ACK number of the first two ACKs – Just-in-time: do not delay the ACKs for the first three segments

Detection algorithms requires deterministic return of future ACKs from the receiver – Enable delayed ACK retransmission ambiguity – Disable delayed ACK significant CPU overhead

Page 27: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

PVTCP evaluation: throughput

PVTCP avoids throughput collapse in the whole range

TCP’s dilemma: pseudo-congestion & real congestion

RTOmin

Experimental setup: 20 sender VMs 1 receiver VM

PVTCP-1ms

TCP-1ms TCP-200ms

Page 28: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

PVTCP evaluation: CPU overhead

Sender VM Receiver VM Sender VM Receiver VM

Enable delayed ACK: PVTCP (RTOmin=1ms) ≈ TCP (RTOmin=200ms)

Page 29: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

PVTCP evaluation: CPU overhead

RTOmin TCP-200ms TCP-1ms PVTCP-1ms

Total ACKs 192,587 244,757 192,863

RTOmin TCP-200ms TCP-1ms PVTCP-1ms

Total ACKs 194,384 262,274 208,688

Sender VM Receiver VM Sender VM Receiver VM

+0% +7.4%

Spurious RTOs are avoided

Temporarily disable delayed ACK to help the sender detect spurious RTOs

Page 30: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

One concern

Page 31: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

The buffer of the netback

The vif’s buffer: temporarily store incoming packets when the VM has been preempted. – ifconfig vifX.Y txqueuelen [value]

The default value is too small intensive packet loss – #define XENVIF_QUEUE_LENGTH 32

This parameter should be set bigger (> 10,000 perhaps..)

The scheduling delays to the receiver VM

Receiver VM

Driver domain

RTO happens!

Hypervisor scheduling delay

VM 2RUN

VM 3 WAIT

VM 1WAIT

VM 3WAIT

buffer

Data packets, waiting for ACK

ACKing

Within hypervisor

Physical network VM scheduling queue

VM 3WAIT

VM 1 RUN

VM 2WAIT

VM 1RUN

wait

deliver

ACKing

VM 1 WAIT

VM2WAIT

VM 3RUN

VM 2WAIT

Sender VM

Sender VM

ReceiverVM

Driver domain

RTO happens!

Hypervisor scheduling delay

ACKingbuffer

Within hypervisor

Physical network

Data packets, waiting for ACK

VM scheduling queue

VM 1 RUN

VM 3RUN

VM 2WAIT

VM 1WAIT

VM 2RUN

VM 3WAIT

VM 3WAIT

VM 2WAIT

VM 1WAIT

VM 1 RUN

VM 2WAIT

VM 3WAIT

wait

deliver

The buffer size matters!

The scheduling delays to the sender VM

Page 32: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Summary Problem: VM scheduling delays cause spurious RTOs.

Proposed Solution: a ParaVirtualized TCP (PVTCP) – Provide a method to detect a VM’s wakeup moment

Sender-side problem – There are OS reasons

Receiver-side problem – Networking problem

Sender-side problem – Spurious RTOs can be

avoided. – Slightly extends the

retransmit timer’s expiry time after the sender VM wakes up.

Receiver-side problem – Spurious RTOs can be

detected. – Temporarily disable

delayed ACK after the receiver VM wakes up.

– Just-in-time

Future Work: your inputs ..

Page 33: XPDS13: On Paravirualizing TCP - Congestion Control on Xen VMs - Luwei Cheng, Student, University of Hong Kong

Thanks for your listening

Comments & Questions

Email: [email protected] URL: http://www.cs.hku.hk/~lwcheng